Agentic AI Architecture Patterns

For most of the LLM era, AI applications followed a simple pattern: take user input, call a model, return the output. Chatbots, summarizers, classifiers---they all operated in a single turn. The user asked, the model answered, and that was the end of the interaction.

That model is breaking down. The most consequential AI systems being built today don’t just answer questions---they take actions. They browse codebases, query databases, write and execute code, call APIs, and iterate on their own outputs until a task is complete. These are agents: systems where an LLM operates in a loop, reasoning about what to do next, executing actions in the world, and adapting based on what it observes.

The shift from chatbot to agent is not just a product evolution---it is an architectural one. Agents introduce loops where pipelines had straight lines, state where systems were stateless, and autonomy where humans were always in the driver’s seat. This guide covers the architecture patterns that make agents work: ReAct loops, tool-use protocols, planning strategies, memory hierarchies, multi-agent coordination, and the reliability patterns that keep everything from going off the rails.

Why Now?

The 2025—2026 period has been a clear inflection point for agentic AI. Several converging developments made agents practical:

Models got good enough at tool use. GPT-5.3-Codex demonstrated that frontier models could reliably operate in multi-step agentic coding loops, autonomously navigating large codebases for extended periods. Claude Opus 4.6 pushed similar boundaries with adaptive reasoning and computer use capabilities.
Tool-use protocols standardized. The Model Context Protocol (MCP), introduced by Anthropic in late 2024 and adopted by OpenAI, Google, and others throughout 2025, gave agents a universal way to connect to tools and data sources.
Industry bet on agents. The OpenAI Frontier Alliance signaled that enterprise agent deployment was moving from experimentation to strategy. Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025.
Frameworks matured. LangGraph, CrewAI, AutoGen, and others moved from experimental repos to production-grade orchestration layers.

The result is that building an agent is no longer a research exercise. It is an engineering discipline with emerging best practices---and that is what this guide is about.

The Anatomy of an AI Agent

Before diving into specific patterns, it helps to establish what an agent actually is. At the core, every AI agent has four components:

┌─────────────────────────────────────────────────────────────────────┐
│                         THE AGENT LOOP                              │
│                                                                     │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐                  │
│   │ PERCEIVE  │───►│  REASON   │───►│    ACT    │                  │
│   │           │    │           │    │           │                  │
│   │ Parse     │    │ LLM core  │    │ Execute   │                  │
│   │ inputs,   │    │ decides   │    │ tool call │                  │
│   │ read obs. │    │ next step │    │ or output │                  │
│   └───────────┘    └───────────┘    └───────────┘                  │
│        ▲                                  │                         │
│        │          ┌───────────┐           │                         │
│        │          │  MEMORY   │           │                         │
│        │          │           │           │                         │
│        └──────────│ Context,  │◄──────────┘                         │
│                   │ state,    │                                      │
│                   │ history   │                                      │
│                   └───────────┘                                      │
└─────────────────────────────────────────────────────────────────────┘

Perception is how the agent takes in information. This includes the initial user request, but also observations from tool executions, API responses, file contents, error messages---anything the agent reads from its environment.

Reasoning is the LLM core. Given what the agent perceives and remembers, the model decides what to do next. This is where chain-of-thought happens, where plans are formed, and where the agent evaluates whether it has achieved its goal.

Action is how the agent affects the world. This includes calling tools (search, code execution, API calls), writing files, sending messages, or producing a final response to the user.

Memory ties the loop together. It holds the conversation history, tool results, intermediate reasoning, and any persistent state the agent needs across steps. Without memory, each iteration of the loop would start from scratch.

The critical insight is the loop. A traditional LLM pipeline is a straight line: input goes in, output comes out. An agent loops: it reasons, acts, observes the result, and reasons again. This loop continues until the agent determines it has completed the task or hit a stopping condition.

Levels of Autonomy

Not every system that uses an LLM needs full autonomy. It helps to think of a spectrum:

Level	Name	Description	Example
0	Single-turn	One input, one output, no tools	Chatbot, classifier
1	Copilot	Suggests actions, human executes	Code completion, email drafts
2	Tool-assisted	Uses tools in a single step	RAG lookup, function calling
3	Supervised agent	Multi-step loop with human checkpoints	Claude Code with permission prompts
4	Autonomous agent	Runs to completion independently	Background coding agents, CI bots

Most production systems today operate at Level 2—3. Level 4 is where the industry is heading, but it demands significantly more reliability and safety infrastructure---more on that later.

The ReAct Pattern

ReAct (Reasoning + Acting) is the foundational pattern behind most modern agents. Introduced by Yao et al. in their 2022 paper “ReAct: Synergizing Reasoning and Acting in Language Models”, it interleaves explicit reasoning with tool execution in a loop.

How It Works

The ReAct loop has three phases that repeat until the task is complete:

Think: The model reasons about what it has observed and what it should do next
Act: The model selects and executes a tool or action
Observe: The system captures the result of the action and feeds it back to the model

┌──────────┐     ┌──────────┐     ┌──────────┐
│  THINK   │────►│   ACT    │────►│ OBSERVE  │
│          │     │          │     │          │
│ "I need  │     │ Execute: │     │ Result:  │
│ to find  │     │ search(  │     │ Found 3  │
│ the bug  │     │  'error  │     │ files    │
│ in the   │     │   log')  │     │ with     │
│ parser"  │     │          │     │ errors   │
└──────────┘     └──────────┘     └──────────┘
      ▲                                 │
      └─────────────────────────────────┘
              Loop until done

Why ReAct Works

The key insight is that reasoning grounds actions, and observations ground reasoning. Without the think step, the model might take random or repetitive actions. Without observations, the model’s reasoning would drift from reality. The interleaving creates a self-correcting dynamic.

Compare this to chain-of-thought alone (reasoning without acting), which can hallucinate facts it cannot verify. And compare it to acting without reasoning, which produces agents that flail---calling tools randomly without a coherent strategy.

ReAct Trace: A Coding Task

Here is what a ReAct trace looks like for a practical task---finding and fixing a bug reported in a test failure:

User: The test_payment_processing test is failing. Fix it.

Think: The test is failing, so I need to first understand the error. Let me
       run the test to see the failure message.

Act:   execute_command("pytest tests/test_payment.py::test_payment_processing -v")

Observe: FAILED - AssertionError: Expected status 'completed' but got 'pending'.
         The test calls process_payment() with a valid card and expects
         the status to transition to 'completed'.

Think: The payment is staying in 'pending' instead of moving to 'completed'.
       This likely means the status transition logic in process_payment() has
       a bug. Let me look at the implementation.

Act:   read_file("src/payments/processor.py")

Observe: [File contents showing process_payment function. Line 47 has:
         if payment.amount > 0 and payment.card.is_valid:
             payment.status = "processing"
         # Missing transition from "processing" to "completed"]

Think: I see the issue. The function sets status to "processing" but never
       transitions to "completed". There should be a step after the payment
       gateway confirms the charge. Let me check if there's a confirmation
       step that should update the status.

Act:   search_code("payment.status = \"completed\"")

Observe: No results found. The status is never set to "completed" anywhere
         in the codebase.

Think: Confirmed---the completion status is never set. After the gateway
       processes the charge successfully (line 52: gateway_response =
       self.gateway.charge(payment)), the status should be updated. Let me
       add that transition.

Act:   edit_file("src/payments/processor.py", line=53,
         insert='        if gateway_response.success:\n            payment.status = "completed"')

Observe: File updated successfully.

Think: Now let me run the test again to verify the fix works.

Act:   execute_command("pytest tests/test_payment.py::test_payment_processing -v")

Observe: PASSED

Think: The test passes now. The fix was adding the status transition to
       "completed" after a successful gateway charge. The task is done.

Final response: Fixed the failing test. The issue was in src/payments/processor.py
                --- the process_payment function never transitioned the payment
                status from "processing" to "completed" after a successful
                gateway charge. Added the status update on line 53.

ReAct in Pseudocode

def react_agent(task: str, tools: list[Tool], max_steps: int = 20) -> str:
    """Core ReAct loop."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": task}
    ]

    for step in range(max_steps):
        # THINK + ACT: Model reasons and selects an action
        response = llm.generate(messages, tools=tools)

        if response.is_final_answer:
            return response.content

        # Extract the tool call from the response
        tool_call = response.tool_calls[0]

        # ACT: Execute the tool
        try:
            observation = tools.execute(
                tool_call.name,
                tool_call.arguments
            )
        except ToolError as e:
            observation = f"Error: {e}"

        # OBSERVE: Feed result back into context
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "tool", "content": observation})

    return "Max steps reached without completing the task."

The simplicity of this loop is its strength. The LLM handles all the hard parts---deciding what to do, interpreting results, knowing when to stop. The framework just manages the loop, tool execution, and message history.

Tool Use and Function Calling

Tools are what give agents their power. Without tools, an LLM can only generate text. With tools, it can read files, query databases, call APIs, execute code, and interact with arbitrary external systems.

The Function Calling Pattern

Modern LLMs support tool use through function calling: the model outputs a structured tool invocation instead of (or alongside) natural language. The system then executes the tool and returns the result. Here is the typical flow:

┌──────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
│  User    │────►│   LLM     │────►│  System  │────►│   LLM    │
│  query   │     │ generates │     │ executes │     │ uses     │
│          │     │ tool call │     │ the tool │     │ result   │
└──────────┘     └───────────┘     └──────────┘     └──────────┘

Tool Schema Design

Tools are defined with a name, description, and parameter schema. The description is critical---it is how the model decides when and how to use the tool. Here is an example:

{
  "name": "get_weather",
  "description": "Get current weather conditions for a specific location. Use this when the user asks about weather, temperature, or outdoor conditions for a city or address.",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name or address, e.g., 'San Francisco, CA'"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature unit. Default: fahrenheit"
      }
    },
    "required": ["location"]
  }
}

And the invocation and result:

// Model outputs:
{
  "tool_call": {
    "name": "get_weather",
    "arguments": {
      "location": "San Francisco, CA",
      "units": "fahrenheit"
    }
  }
}

// System executes, returns:
{
  "tool_result": {
    "temperature": 62,
    "condition": "partly cloudy",
    "humidity": 73,
    "wind_speed_mph": 12
  }
}

Principles of Good Tool Design

The quality of your tools directly determines agent reliability. Several principles matter:

Make tools atomic. Each tool should do one thing well. A search_and_summarize tool conflates retrieval with generation---split it into search and summarize so the agent can reason between steps.

Write descriptions for the model, not the developer. The model reads the description to decide when to use the tool. “Queries the PostgreSQL database” is less useful than “Search for customer records matching a name, email, or account ID. Returns up to 10 matching records with their order history.”

Return structured, informative results. Return enough information for the model to reason about the result, but not so much that it overwhelms the context window. Include error details when things fail---“Connection timeout after 5s to api.example.com” is more useful than “Error.”

Handle errors gracefully. Agents will encounter tool failures. Return error information as tool results rather than throwing exceptions. This lets the model reason about the failure and try an alternative approach.

Model Context Protocol (MCP)

One of the most significant developments in agent infrastructure is the Model Context Protocol (MCP), an open standard for connecting LLMs to tools and data sources. Introduced by Anthropic in November 2024 and since adopted by OpenAI, Google, and others, MCP provides a universal interface between agents and their capabilities.

Before MCP, every agent framework had its own tool definition format. LangChain tools were incompatible with AutoGen tools, which were incompatible with native function calling APIs. MCP standardizes this with a client-server architecture:

┌─────────────┐                    ┌──────────────────┐
│  AI Agent   │◄──── MCP ────────►│  MCP Server:     │
│  (Client)   │     Protocol       │  GitHub          │
└─────────────┘                    └──────────────────┘
       │
       │                           ┌──────────────────┐
       └──────── MCP ────────────►│  MCP Server:     │
                 Protocol          │  Database        │
                                   └──────────────────┘

An MCP server exposes tools, resources (data the agent can read), and prompts (templated instructions) through a standardized JSON-RPC interface. Any MCP-compatible client can discover and use these capabilities without custom integration code.

The November 2025 specification expanded MCP significantly, adding asynchronous tasks (for long-running operations), OAuth 2.1 authorization, and resource subscriptions. By early 2026, MCP had been donated to the Agentic AI Foundation under the Linux Foundation, cementing its role as an industry standard.

For production agent systems, MCP matters because it decouples tool development from agent development. Your team can build an MCP server for your internal APIs once, and it works with Claude, GPT, Gemini, or any MCP-compatible agent framework.

Error Handling and Retry Patterns

Tools fail. APIs time out, rate limits hit, files don’t exist, permissions get denied. Robust agents need strategies for handling these failures:

def execute_tool_with_retry(
    tool_name: str,
    arguments: dict,
    max_retries: int = 3,
    backoff_base: float = 1.0
) -> ToolResult:
    """Execute a tool with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            result = tools.execute(tool_name, arguments)
            return ToolResult(success=True, data=result)
        except RateLimitError:
            wait_time = backoff_base * (2 ** attempt)
            time.sleep(wait_time)
        except ToolNotFoundError as e:
            # Don't retry—this won't resolve itself
            return ToolResult(
                success=False,
                error=f"Tool '{tool_name}' not found: {e}"
            )
        except ToolExecutionError as e:
            # Return error to the model so it can adapt
            return ToolResult(
                success=False,
                error=f"Execution failed: {e}"
            )

    return ToolResult(
        success=False,
        error=f"Max retries ({max_retries}) exceeded for {tool_name}"
    )

The key principle is: return errors to the model, don’t hide them. An agent that receives “File not found: config.yaml” can reason about alternative file names or locations. An agent that receives a generic “tool failed” message cannot.

Multi-Step Planning

Simple agents react to each observation independently---they decide the very next action based on the current state. This works for straightforward tasks, but complex tasks benefit from planning: decomposing the goal into sub-tasks before executing them.

Plan-Then-Execute

The simplest planning pattern separates the planning phase from the execution phase:

┌──────────────────────┐     ┌──────────────────────┐
│    PLANNING PHASE    │     │   EXECUTION PHASE    │
│                      │     │                      │
│  Input: User task    │     │  For each sub-task:  │
│  Output: Ordered     │────►│    1. Execute step   │
│    list of sub-tasks │     │    2. Check result   │
│                      │     │    3. Proceed or     │
│                      │     │       re-plan        │
└──────────────────────┘     └──────────────────────┘

def plan_and_execute(task: str, tools: list[Tool]) -> str:
    """Plan-then-execute agent pattern."""

    # Phase 1: Generate a plan
    plan_prompt = f"""Break this task into a numbered list of concrete steps.
Each step should be a single action I can take with available tools.

Task: {task}
Available tools: {[t.name for t in tools]}

Output a JSON array of step descriptions."""

    plan = llm.generate(plan_prompt)
    steps = json.loads(plan)

    # Phase 2: Execute each step
    results = []
    for i, step in enumerate(steps):
        result = react_agent(
            task=f"Execute step {i+1}: {step}\n\nPrevious results: {results}",
            tools=tools,
            max_steps=10
        )
        results.append({"step": step, "result": result})

    # Phase 3: Synthesize final answer
    return llm.generate(
        f"Task: {task}\nCompleted steps and results: {results}\n"
        f"Synthesize a final answer."
    )

This pattern works well when the task is decomposable upfront---you can reasonably figure out all the steps before starting. It struggles when early results change what steps are needed.

Interleaved Planning

A more flexible approach interleaves planning with execution. The agent maintains a plan but revises it after each step based on what it learns:

def interleaved_plan_and_execute(task: str, tools: list[Tool]) -> str:
    """Agent that re-plans after each execution step."""
    context = {"task": task, "completed_steps": [], "observations": []}

    while True:
        # Generate or update the plan
        plan = generate_plan(context)

        if plan.is_complete:
            return synthesize_answer(context)

        # Execute only the next step
        next_step = plan.steps[0]
        observation = react_agent(next_step, tools, max_steps=5)

        # Update context with results
        context["completed_steps"].append(next_step)
        context["observations"].append(observation)

This is more robust to unexpected results but uses more tokens (you regenerate the plan each iteration) and can be slower.

ReWOO: Reasoning Without Observation

ReWOO (Reasoning Without Observation) takes a different approach. Instead of interleaving reasoning with observations, ReWOO generates the entire plan upfront, including which tools to call and how to use their results---before any tool is actually executed.

The key innovation is the use of placeholders. The planner writes tool calls that reference the outputs of previous tools using variables:

Plan:
  Step 1: search("Python async best practices") -> #result1
  Step 2: read_url(#result1.top_link) -> #result2
  Step 3: search("Python asyncio common mistakes") -> #result3
  Step 4: synthesize(#result2, #result3) -> final answer

The worker executes all tool calls in order, substituting actual results for placeholders. Finally, a solver LLM synthesizes the answer from all collected evidence.

ReWOO’s advantage is efficiency: it calls the LLM only twice (planning and synthesis) instead of once per step. The original paper showed 5x token savings compared to ReAct on multi-step reasoning benchmarks. The tradeoff is rigidity---if the first search returns nothing useful, the plan cannot adapt.

Tree-of-Thought Search

For problems where the right approach is not obvious, tree-of-thought search explores multiple reasoning paths in parallel:

                    ┌─── Approach A: Use regex ─── [evaluate] ─── Score: 3/10
                    │
Task: Parse dates ──┼─── Approach B: Use dateutil ── [evaluate] ─── Score: 8/10 ✓
                    │
                    └─── Approach C: Manual split ─ [evaluate] ─── Score: 5/10

The agent generates multiple candidate approaches, evaluates each one (possibly by partially executing them), and pursues the most promising path. This is expensive---you multiply token usage by the branching factor---but it is powerful for tasks where the first approach often fails.

When Planning Helps vs. When It Hurts

Planning adds overhead. For every task, you spend tokens generating a plan that might not survive contact with reality. Here are rough guidelines:

Scenario	Planning Value	Recommended Approach
Simple lookup (“What’s the weather?”)	Low	Direct tool call, no planning
Multi-step retrieval (“Summarize these 5 docs”)	Medium	Plan-then-execute
Open-ended exploration (“Find and fix the bug”)	High	Interleaved planning
Batch processing (“Format all files”)	High	Plan-then-execute (parallelizable)
Time-critical (“Respond in <2s”)	Low	Skip planning, use direct ReAct

The general rule: plan when the cost of doing the wrong thing first is high, and skip planning when the task is straightforward or latency matters.

Memory Architectures

An agent without memory is like a developer who forgets everything between keystrokes. Memory is what allows agents to maintain context, learn from past actions, and operate coherently across long tasks.

The Memory Hierarchy

Agent memory naturally organizes into layers, analogous to computer memory hierarchies:

┌─────────────────────────────────────────────┐
│          WORKING MEMORY                     │
│    LLM context window (current turn)        │
│    Capacity: 128K-1M+ tokens                │
│    Speed: Instant (in-context)              │
├─────────────────────────────────────────────┤
│          SHORT-TERM MEMORY                  │
│    Conversation history, scratchpads        │
│    Capacity: Limited by context window      │
│    Speed: Instant (in-context)              │
├─────────────────────────────────────────────┤
│          LONG-TERM MEMORY                   │
│    Vector stores, databases, RAG            │
│    Capacity: Virtually unlimited            │
│    Speed: Retrieval latency (50-500ms)      │
├─────────────────────────────────────────────┤
│          EPISODIC MEMORY                    │
│    Past task traces, outcomes, learnings    │
│    Capacity: Virtually unlimited            │
│    Speed: Retrieval latency (50-500ms)      │
└─────────────────────────────────────────────┘

Working Memory: The Context Window

Working memory is what the LLM can attend to right now---the current contents of its context window. This is the most important memory system because it directly determines what the model knows when making decisions.

The challenge with working memory is that it is finite and expensive. Even with 128K or 1M token windows, a long-running agent can exhaust its context with tool results, code files, and conversation history. Strategies for managing working memory include:

Summarization: Periodically compress earlier conversation turns into summaries
Selective inclusion: Only include the most relevant tool results, not all of them
Sliding window: Drop the oldest messages when approaching the context limit
Hierarchical context: Keep a high-level summary always present, load details on demand

Short-Term Memory: Conversation State

Short-term memory tracks the current task’s state---what has been tried, what worked, what the current plan is. In most implementations, this is simply the conversation history maintained by the agent framework.

More sophisticated agents use explicit scratchpads---a designated section of the prompt where the agent can write and update notes to itself:

# Agent scratchpad pattern
scratchpad = """
## Current Task State
- Goal: Migrate database schema from v2 to v3
- Completed: Backed up existing data, created migration script
- Blocked: Need to verify foreign key constraints before running migration
- Next: Query information_schema for FK references to 'users' table
"""

The scratchpad gives the agent a structured place to maintain state without relying on the conversation history alone, which can be noisy with tool outputs and intermediate reasoning.

Long-Term Memory: RAG and External Storage

When agents need to access information beyond their context window---corporate knowledge bases, documentation, previous conversations---they use long-term memory. This is typically implemented as a RAG pipeline backed by a vector store.

The architecture for agent long-term memory looks like:

def retrieve_relevant_context(
    query: str,
    memory_store: VectorStore,
    top_k: int = 5
) -> list[str]:
    """Retrieve relevant memories for the current query."""
    # Embed the query
    query_embedding = embed(query)

    # Search the vector store
    results = memory_store.similarity_search(
        query_embedding,
        top_k=top_k,
        filter={"type": "knowledge"}  # Filter by memory type
    )

    return [r.content for r in results]

Long-term memory is critical for agents that operate across sessions (remembering user preferences, past decisions, project context) or that need access to large knowledge bases.

Episodic Memory: Learning from Experience

Episodic memory stores records of past task executions---what the agent tried, what worked, and what failed. This allows agents to improve over time without retraining the underlying model.

# After completing a task, store the episode
episode = {
    "task": "Fix authentication bug in login flow",
    "approach": "Traced the error to a missing token refresh call",
    "tools_used": ["search_code", "read_file", "edit_file", "run_tests"],
    "outcome": "success",
    "steps_taken": 7,
    "key_insight": "Token refresh was skipped when session cookie existed",
    "timestamp": "2026-02-21T14:30:00Z"
}
memory_store.add(episode, type="episodic")

# When encountering a similar task later, retrieve relevant episodes
similar_episodes = memory_store.similarity_search(
    embed("authentication is failing for some users"),
    filter={"type": "episodic", "outcome": "success"}
)
# Inject as context: "In a previous similar task, you found that..."

Memory Comparison

Memory Type	Capacity	Latency	Persistence	Best For
Working (context)	128K—1M tokens	Instant	Current turn only	Active reasoning, current step
Short-term (history)	Context-limited	Instant	Current session	Task state, conversation flow
Long-term (RAG)	Unlimited	50—500ms	Permanent	Knowledge bases, documentation
Episodic (traces)	Unlimited	50—500ms	Permanent	Learning from past tasks

The most effective agents combine all four layers. Working memory holds the immediate context. Short-term memory tracks the task state. Long-term memory provides knowledge retrieval. Episodic memory offers relevant past experiences. The agent framework manages what gets loaded into the context window and when, balancing completeness against context window limits.

Multi-Agent Coordination

Single agents hit ceilings. Context windows fill up, tasks require disparate expertise, and error rates compound over long sequences of actions. Multi-agent architectures address these limitations by distributing work across specialized agents that coordinate to solve complex tasks.

Orchestrator-Worker Pattern

The most common multi-agent pattern uses a central orchestrator that delegates tasks to specialized worker agents:

                    ┌──────────────────┐
                    │   ORCHESTRATOR   │
                    │                  │
                    │  Understands     │
                    │  the full task,  │
                    │  delegates       │
                    │  sub-tasks       │
                    └──────┬───────────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
     ┌────────────┐ ┌────────────┐ ┌────────────┐
     │  WORKER:   │ │  WORKER:   │ │  WORKER:   │
     │  Research  │ │  Code      │ │  Review    │
     │            │ │            │ │            │
     │  Search,   │ │  Write,    │ │  Read,     │
     │  read,     │ │  edit,     │ │  analyze,  │
     │  summarize │ │  test      │ │  critique  │
     └────────────┘ └────────────┘ └────────────┘

Each worker has its own context window, system prompt, and tool access scoped to its specialty. The orchestrator maintains the big picture and assembles results.

Claude Code’s subagent architecture follows this pattern. The main agent delegates read-only exploration tasks to an Explore subagent with constrained tool access (read, search, but no write). This keeps the main agent’s context focused on decision-making while the subagent handles information gathering in its own context window.

Advantages of orchestrator-worker:

Specialization: Each worker can have a focused system prompt and tool set
Context isolation: Workers don’t pollute each other’s context windows
Parallelization: Independent sub-tasks can run concurrently
Failure isolation: One worker failing doesn’t crash the entire system

Peer-to-Peer / Swarm Pattern

In a swarm architecture, there is no central controller. Agents communicate as peers, negotiating who handles what and sharing information through a common protocol.

OpenAI’s experimental Swarm framework demonstrated this with two primitives: routines (instructions an agent follows) and handoffs (one agent transferring control to another). Each agent decides when to hand off based on its own assessment of whether the current task falls within its expertise.

# Simplified swarm handoff pattern
triage_agent = Agent(
    name="Triage",
    instructions="Route customer requests to the appropriate specialist.",
    functions=[transfer_to_sales, transfer_to_support, transfer_to_billing]
)

sales_agent = Agent(
    name="Sales",
    instructions="Handle pricing questions and upsell opportunities.",
    functions=[lookup_pricing, create_quote, transfer_to_triage]
)

support_agent = Agent(
    name="Support",
    instructions="Resolve technical issues and bug reports.",
    functions=[search_kb, create_ticket, transfer_to_triage]
)

The swarm pattern excels when the routing logic is complex and dynamic---when you cannot easily pre-determine which agent should handle what. It struggles with tasks requiring tight coordination between agents, since there is no central planner ensuring coherence.

Debate / Adversarial Pattern

Some multi-agent systems use disagreement productively. Two or more agents independently analyze a problem and then challenge each other’s conclusions:

┌──────────────┐     ┌──────────────┐
│   AGENT A    │     │   AGENT B    │
│              │     │              │
│  Generates   │     │  Generates   │
│  answer +    │     │  answer +    │
│  reasoning   │     │  reasoning   │
└──────┬───────┘     └──────┬───────┘
       │                     │
       └──────┐   ┌─────────┘
              ▼   ▼
       ┌──────────────┐
       │   CRITIQUE    │
       │              │
       │  Each agent  │
       │  challenges  │
       │  the other   │
       └──────┬───────┘
              │
              ▼
       ┌──────────────┐
       │    JUDGE     │
       │              │
       │  Selects or  │
       │  synthesizes │
       │  best answer │
       └──────────────┘

This pattern is particularly effective for:

Code review: One agent writes code, another reviews it for bugs and style issues
Fact verification: Multiple agents independently research a claim and cross-check
Risk assessment: Optimistic and pessimistic agents debate a decision

The debate pattern catches errors that a single agent would miss because each agent acts as an adversary for the other’s reasoning blind spots.

Pipeline Pattern

The simplest multi-agent coordination is sequential handoff---one agent’s output becomes the next agent’s input:

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ Research │────►│  Draft   │────►│  Review  │────►│  Polish  │
│ Agent    │     │  Agent   │     │  Agent   │     │  Agent   │
└──────────┘     └──────────┘     └──────────┘     └──────────┘

Each agent in the pipeline has a clear, focused role. The research agent gathers information. The draft agent writes content. The review agent identifies issues. The polish agent refines the final output.

Pipelines are predictable and easy to debug---you can inspect the output at each stage. They are well-suited for tasks with natural sequential phases. The downside is inflexibility: if the review agent finds a fundamental research gap, it cannot easily send the task back to the research agent without additional coordination logic.

Multi-Agent Coordination Challenges

Multi-agent systems introduce coordination overhead that does not exist in single-agent systems:

Shared state: How do agents share information? Through message passing, shared memory, or a central store? Each approach has different consistency and latency characteristics.
Token multiplication: Multi-agent systems consume roughly 5—15x more tokens than equivalent single-agent systems, because each agent needs context about the overall task plus its specific sub-task.
Error propagation: A mistake by one agent can cascade through the system. The orchestrator needs to detect and recover from worker failures.
Debugging difficulty: When the final output is wrong, tracing the error back to the responsible agent and step requires comprehensive logging.

The practical advice: start with a single agent and only move to multi-agent when you hit a concrete limitation---context window exhaustion, need for parallel execution, or tasks requiring genuinely different expertise. Multi-agent architectures solve real problems, but they also multiply complexity.

Reliability and Safety Patterns

Agents can fail in ways that traditional software cannot. A chatbot that gives a wrong answer is annoying. An agent that takes wrong actions---deleting files, sending incorrect API calls, modifying production data---can cause real damage. As agents gain more autonomy, reliability and safety patterns become essential infrastructure, not optional features.

Self-Correction: Verify-Then-Fix Loops

The simplest reliability pattern is having the agent check its own work:

def self_correcting_agent(task: str, tools: list[Tool]) -> str:
    """Agent that verifies its own outputs."""
    # Step 1: Execute the task
    result = react_agent(task, tools)

    # Step 2: Verify the result
    verification_prompt = f"""
    Original task: {task}
    Agent's result: {result}

    Verify this result:
    1. Does it fully address the original task?
    2. Are there any errors or inconsistencies?
    3. Were any steps missed?

    If the result is correct, respond with "VERIFIED".
    If not, explain what needs to be fixed.
    """
    verification = llm.generate(verification_prompt)

    if "VERIFIED" in verification:
        return result

    # Step 3: Fix and re-verify
    fix_prompt = f"""
    Task: {task}
    Previous result: {result}
    Issues found: {verification}

    Fix the issues and produce a corrected result.
    """
    return react_agent(fix_prompt, tools)

For coding agents, self-correction often takes the form of “write code, run tests, fix failures.” This is one of the most effective reliability patterns because it grounds verification in objective signals (test pass/fail) rather than the model’s self-assessment.

Human-in-the-Loop Checkpoints

For high-stakes actions, agents should pause and ask for human approval:

SENSITIVE_ACTIONS = {
    "delete_file", "drop_table", "send_email",
    "deploy", "modify_production", "transfer_funds"
}

def guarded_tool_execution(tool_call: ToolCall) -> ToolResult:
    """Execute tool with human approval for sensitive actions."""
    if tool_call.name in SENSITIVE_ACTIONS:
        # Present the action to the user
        approved = request_human_approval(
            action=tool_call.name,
            arguments=tool_call.arguments,
            reasoning=tool_call.reasoning
        )
        if not approved:
            return ToolResult(
                success=False,
                error="Action rejected by user"
            )

    return tools.execute(tool_call.name, tool_call.arguments)

The best human-in-the-loop systems are selective. They do not ask for approval on every action (which defeats the purpose of automation) but identify the specific actions where human judgment adds the most value. Good prompt engineering can also reduce the frequency of problematic actions by giving the model clearer constraints.

Guardrails: Input and Output Filtering

Guardrails are checks that run before the model sees input and after it produces output:

class AgentGuardrails:
    """Input and output filtering for agent safety."""

    def filter_input(self, user_input: str) -> str:
        """Check user input before it reaches the agent."""
        # Detect prompt injection attempts
        if self.detect_injection(user_input):
            raise GuardrailViolation("Potential prompt injection detected")

        # Check for sensitive data that shouldn't be processed
        if self.contains_pii(user_input):
            user_input = self.redact_pii(user_input)

        return user_input

    def filter_output(self, agent_output: str) -> str:
        """Check agent output before it reaches the user."""
        # Prevent leaking sensitive information
        if self.contains_secrets(agent_output):
            raise GuardrailViolation("Output contains sensitive data")

        # Check for harmful content
        if self.is_harmful(agent_output):
            raise GuardrailViolation("Output flagged as harmful")

        return agent_output

Sandboxing Tool Execution

Agents that execute code or shell commands need sandboxing to limit the blast radius of mistakes:

Filesystem sandboxing: Restrict file operations to a specific directory tree. The agent can read and write within its workspace but cannot access system files or other projects.
Network sandboxing: Limit which hosts and ports the agent can reach. Prevent access to internal services that the agent does not need.
Resource limits: Cap CPU time, memory usage, and disk writes. An agent stuck in a loop should not be able to consume unbounded resources.
Execution isolation: Run tool code in containers or sandboxed processes. If the agent generates and executes malicious code, the damage is contained.

# Example: sandboxed code execution
def execute_code_sandboxed(code: str, timeout: int = 30) -> str:
    """Run agent-generated code in an isolated environment."""
    result = subprocess.run(
        ["docker", "run", "--rm",
         "--network=none",           # No network access
         "--memory=512m",            # Memory limit
         f"--timeout={timeout}",     # Time limit
         "-v", f"{workspace}:/work", # Only mount workspace
         "python-sandbox",
         "python", "-c", code],
        capture_output=True,
        timeout=timeout + 5
    )
    return result.stdout.decode() if result.returncode == 0 else result.stderr.decode()

Audit Trails and Observability

Every agent action should be logged in a structured format that supports debugging and compliance:

@dataclass
class AgentEvent:
    timestamp: datetime
    session_id: str
    step_number: int
    event_type: str          # "think", "tool_call", "tool_result", "error", "final"
    content: str             # The reasoning, tool arguments, or result
    token_usage: int
    latency_ms: float
    model: str
    tool_name: Optional[str]
    tool_arguments: Optional[dict]

Good observability lets you answer questions like: “Why did the agent delete that file?” by tracing back through its reasoning chain. This is essential for debugging production incidents and for building trust with stakeholders who are cautious about autonomous systems.

Why Autonomous Agents Need More Safety

There is a direct relationship between autonomy and required safety infrastructure:

Autonomy Level	Safety Requirements
Copilot (suggestions only)	Basic output filtering
Tool-assisted (single step)	Input validation, output filtering
Supervised agent (multi-step)	All above + human checkpoints, audit trail
Autonomous agent (unsupervised)	All above + sandboxing, self-correction, comprehensive monitoring, rollback capability

The principle: every increase in autonomy should be matched by a corresponding increase in safety infrastructure. An agent that runs unsupervised for hours needs dramatically more guardrails than one that asks for permission at each step.

Production Considerations

Building an agent that works in a demo is straightforward. Building one that works reliably in production---at scale, within a budget, with acceptable latency---requires careful engineering.

Cost Management

Agent loops multiply API costs. A single-turn interaction might use 1,000—2,000 tokens. An agent loop solving a complex task might execute 20+ steps, each consuming 5,000—50,000 tokens as the context window grows. Multi-agent systems multiply this further.

Strategies for managing cost:

Set token budgets. Define a maximum token spend per task and terminate gracefully when the budget is exhausted:

class BudgetedAgent:
    def __init__(self, max_tokens: int = 500_000):
        self.max_tokens = max_tokens
        self.tokens_used = 0

    def run(self, task: str) -> str:
        while self.tokens_used < self.max_tokens:
            response = self.step()
            self.tokens_used += response.usage.total_tokens

            if response.is_complete:
                return response.content

        return self.graceful_termination()

Use smaller models for simple steps. Not every step in an agent loop needs a frontier model. Use a fast, cheap model for tool selection and routing, and reserve the expensive model for complex reasoning.

Cache tool results. If the agent reads the same file twice, serve the cached version instead of executing the tool again.

Minimize context growth. Summarize older conversation turns, drop redundant tool results, and use selective retrieval instead of stuffing everything into the context window.

Latency Budgets

Users have different tolerance for latency depending on the context:

Use Case	Acceptable Latency	Implication
Interactive chat	1—5 seconds	Limit to 1—2 agent steps
Background task	Minutes	Full agent loop acceptable
Batch processing	Hours	Multi-agent pipeline acceptable
CI/CD integration	Minutes	Bounded agent with timeout

For interactive use cases, streaming the agent’s reasoning (showing “Searching codebase…” or “Reading file…”) helps manage perceived latency even when actual latency is high.

Evaluating Agents

Traditional software testing---unit tests, integration tests---does not straightforwardly apply to agents. Agent behavior is non-deterministic, multi-step, and influenced by tool outputs that may change over time. Here are evaluation approaches that work:

Task completion benchmarks. Define a set of tasks with known solutions and measure how often the agent completes them correctly:

benchmarks = [
    {
        "task": "Find all TODO comments in the codebase",
        "expected": ["src/auth.py:42", "src/api.py:18", "tests/conftest.py:7"],
        "eval": lambda result: all(e in result for e in expected)
    },
    {
        "task": "Add input validation to the signup endpoint",
        "eval": lambda result: (
            run_tests("tests/test_signup.py") == "PASS"
            and "validate" in read_file("src/routes/signup.py")
        )
    }
]

Trajectory analysis. Even when the final answer is correct, the agent might have taken a wildly inefficient path. Log and analyze the number of steps, tools used, tokens consumed, and errors encountered.

A/B testing. For user-facing agents, compare different configurations (model, system prompt, tools, planning strategy) by routing traffic and measuring user satisfaction and task completion.

Adversarial testing. Deliberately try to break the agent with edge cases: ambiguous instructions, conflicting information, tools that return errors, and tasks that are impossible to complete. How the agent handles failure is as important as how it handles success.

When to Use Agents vs. Simple Prompts

Not every task needs an agent. Agents add complexity, cost, and latency. Use them when you need the loop---when the task requires multiple steps, tool use, or adaptation based on intermediate results.

Use an Agent When…	Use a Simple Prompt When…
Task requires multiple tool calls	Task is single-turn Q&A
Output depends on external data	All information is in the prompt
Task requires adaptation (if X then Y)	Processing is deterministic
User needs interactive problem-solving	User needs a one-shot answer
Quality requires self-verification	Output is easy to validate externally

The simplest system that solves the problem is usually the best one. Start with a prompt. Add tools if the prompt needs external data. Add a loop if a single tool call is not enough. Add planning if the loop needs structure. Add multiple agents if one agent cannot handle the scope. Each layer of complexity should be justified by a concrete limitation of the simpler approach.

Conclusion

Agentic AI is not a single technology---it is a set of architecture patterns that give LLMs the ability to operate in loops, use tools, maintain state, and coordinate with other agents. The patterns we covered---ReAct, function calling, MCP, multi-step planning, memory hierarchies, multi-agent coordination, and reliability safeguards---form the toolkit for building systems that go beyond question-answering into genuine task completion.

The field is moving fast. MCP has become an industry standard in barely a year. Multi-agent frameworks are evolving from experimental repos to production infrastructure. Models are getting better at long-horizon reasoning and tool use with every generation. But the fundamental architecture patterns are stabilizing. The ReAct loop, the orchestrator-worker pattern, the memory hierarchy---these are becoming the standard vocabulary of agent engineering, much as MVC and microservices became standard vocabulary for web development.

The practical advice is to start simple and add complexity only when you need it. A well-crafted ReAct agent with good tools and clear guardrails can solve most tasks that people reach for multi-agent systems to handle. When you do need multi-agent coordination, start with the orchestrator-worker pattern before exploring swarms or debate architectures. And regardless of the architecture, invest in reliability infrastructure---self-correction, human checkpoints, sandboxing, and observability---proportional to the autonomy you grant your agents.

The systems we are building today are laying the foundation for how humans and AI will collaborate for years to come. Getting the architecture right matters.