Orchestrating LLM Agents in Production

Running a single LLM call is straightforward. You send a prompt, get a response, handle the output. But real production systems rarely stay that simple: tasks grow complex enough that no single context window can hold them, latency requirements push toward parallel execution, and different subtasks benefit from different model capabilities. Multi-agent systems solve these problems — and introduce an entirely new class of failure modes that most tutorials ignore. Four patterns appear in every serious production deployment: routing, task decomposition, state management, and failure recovery.

Routing: Dispatching Tasks to the Right Agent

The first problem in any multi-agent system is deciding which agent handles a given input. Get this wrong and you waste tokens sending a simple factual lookup to a costly reasoning model, or you send a nuanced analysis to a model that cannot handle it.

Three approaches dominate production systems. Rule-based routing is the most predictable: a set of explicit conditions maps input properties to agent identifiers. It is fast, debuggable, and has no LLM latency — but falls apart when input categories are fuzzy or overlap. Embedding-similarity routing computes a vector embedding of the input and finds the nearest labeled example via cosine similarity; it generalizes well but requires a maintained labeled set and adds one embedding-model call per request. LLM-based routing uses a small, fast model to classify the intent before forwarding to the specialist — the most flexible approach, but it adds a round-trip.

from __future__ import annotations
from dataclasses import dataclass
from typing import Callable
from openai import OpenAI
 
client = OpenAI()
 
@dataclass
class Route:
    name: str
    description: str
    handler: Callable[[str], str]
 
def llm_router(query: str, routes: list[Route]) -> str:
    """Use a cheap model to classify the query and dispatch to the correct handler."""
    route_descriptions = "\n".join(
        f"- {r.name}: {r.description}" for r in routes
    )
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a routing classifier. Given a user query, respond with "
                    "exactly one route name from the list below. Output only the name.\n\n"
                    f"{route_descriptions}"
                ),
            },
            {"role": "user", "content": query},
        ],
        max_tokens=20,
        temperature=0,
    )
    chosen = response.choices[0].message.content.strip()
    for route in routes:
        if route.name == chosen:
            return route.handler(query)
    # Fallback: first route
    return routes[0].handler(query)

LLM routers can hallucinate route names. Always validate the returned route name against the registered list before dispatching. A silent fallback to the first route can mask classification failures for hours before anyone notices.

The key production trade-off: routing accuracy versus routing cost. For high-volume systems, a 99% accurate rule-based router that handles 80% of queries, combined with an LLM router for the remaining 20%, wins on both cost and reliability.

Task Decomposition: Breaking Complex Jobs Into Agent-Sized Steps

Once you have routing, the next challenge is tasks that are too large for a single agent to complete in one shot. A request to “audit this codebase for security vulnerabilities, summarize the findings, and draft remediation tickets” spans multiple domains and easily exceeds any context window.

The naive approach — passing the entire task to one agent — fails in two predictable ways: the model runs out of context mid-execution, or it produces a shallow response that covers all three subtasks poorly. The production approach is to decompose the job before execution.

Two strategies exist. A static plan hardcodes the decomposition into the orchestrator: you know ahead of time that a codebase audit always decomposes into file scanning, pattern matching, and report generation. A planner agent is more flexible: you send the top-level goal to a capable model and ask it to produce a structured execution plan, which the orchestrator then runs. Static plans are faster and cheaper; planner agents handle tasks whose structure cannot be anticipated.

import json
from anthropic import Anthropic
from pydantic import BaseModel
 
anthropic = Anthropic()
 
class Step(BaseModel):
    id: str
    description: str
    depends_on: list[str]
    agent: str  # e.g. "code-analyst", "report-writer"
 
class ExecutionPlan(BaseModel):
    goal: str
    steps: list[Step]
 
def plan_task(goal: str) -> ExecutionPlan:
    """Ask a capable model to produce a structured execution plan."""
    response = anthropic.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": (
                    f"Break this goal into a structured execution plan as JSON.\n"
                    f"Goal: {goal}\n\n"
                    "Output a JSON object matching this schema:\n"
                    '{"goal": "...", "steps": [{"id": "step-1", '
                    '"description": "...", "depends_on": [], "agent": "..."}]}'
                ),
            }
        ],
    )
    raw = response.content[0].text
    # Strip markdown code fences if present
    if raw.startswith("```"):
        raw = raw.split("```")[1].lstrip("json").strip()
    return ExecutionPlan(**json.loads(raw))
 
def execute_plan(plan: ExecutionPlan) -> dict[str, str]:
    """Execute steps in dependency order, collecting results."""
    completed: dict[str, str] = {}
    pending = list(plan.steps)
 
    while pending:
        ready = [s for s in pending if all(d in completed for d in s.depends_on)]
        if not ready:
            raise RuntimeError("Circular dependency detected in plan")
        for step in ready:
            completed[step.id] = run_agent(step.agent, step.description, completed)
            pending.remove(step)
 
    return completed
 
def run_agent(agent: str, task: str, context: dict[str, str]) -> str:
    """Placeholder — dispatch to the appropriate agent implementation."""
    context_str = "\n".join(f"{k}: {v[:200]}" for k, v in context.items())
    response = anthropic.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": f"Agent: {agent}\nTask: {task}\nContext:\n{context_str}",
            }
        ],
    )
    return response.content[0].text

Pydantic validation on the planner’s output is not optional. LLMs produce structurally valid JSON most of the time — but “most of the time” is not production. Wrap every JSON parse in a try/except and have a fallback (retry with a stricter prompt, or fall back to a static plan).

The expected cost of a planner-based decomposition scales with the number of steps and the size of each step’s context. For a plan with $n$ steps, average token count per step $t_i$ , and cost per token $c_i$ , the expected total cost is:

$\mathbb{E}[\text{cost}] = \sum_{i=1}^{n} p_i \cdot t_i \cdot c_i$

where $p_i$ is the probability that step $i$ actually executes (steps with unsatisfied dependencies may be skipped on partial success). In practice, estimate this before choosing between static and dynamic decomposition — planner agents add one high-token planning call at the front.

State Management: Shared Memory at Scale

With multiple agents executing in parallel or sequence, state becomes the hardest problem. Each agent needs access to results from upstream steps, but you cannot pass the entire accumulated context to every downstream call — context windows blow up, latency compounds, and you hit the “lost in the middle” effect where models ignore information buried in long prompts.

Three patterns dominate production state management. A scratchpad pattern stores intermediate results in a structured object and passes only the relevant subset to each agent. A shared message bus (often Redis or a message queue) lets agents publish and subscribe to results without tight coupling. A summarization layer periodically condenses prior steps into a compact summary that downstream agents use instead of the full history.

import redis
import json
from dataclasses import dataclass, field, asdict
 
@dataclass
class AgentState:
    task_id: str
    goal: str
    completed_steps: dict[str, str] = field(default_factory=dict)
    summary: str = ""
    token_budget_remaining: int = 100_000
 
class RedisStateStore:
    def __init__(self, host: str = "localhost", port: int = 6379):
        self._client = redis.Redis(host=host, port=port, decode_responses=True)
        self._ttl = 3600  # 1 hour — tasks should not run forever
 
    def save(self, state: AgentState) -> None:
        key = f"agent_state:{state.task_id}"
        self._client.setex(key, self._ttl, json.dumps(asdict(state)))
 
    def load(self, task_id: str) -> AgentState | None:
        key = f"agent_state:{task_id}"
        raw = self._client.get(key)
        if raw is None:
            return None
        return AgentState(**json.loads(raw))
 
    def update_step(self, task_id: str, step_id: str, result: str) -> None:
        state = self.load(task_id)
        if state is None:
            raise KeyError(f"No state found for task {task_id}")
        # Trim result to avoid unbounded state growth
        state.completed_steps[step_id] = result[:2000]
        # Reduce token budget estimate
        state.token_budget_remaining -= len(result.split())
        self.save(state)

Stale state is the silent killer in long-running agent workflows. If a step fails and is retried, the previous (partial) result may still be in the store. Always write state atomically: update the step result and its status flag (pending/complete/failed) in a single operation. Reads that see “complete” without a result, or a result without “complete”, indicate a partial write — treat them as failures.

At scale, context window blow-up becomes the dominant cost driver. Rather than accumulating raw step outputs, a summarization agent runs after every $k$ completed steps and compresses the history. Downstream agents receive the summary plus only the immediately preceding step — typically 80% fewer tokens than passing full history.

Failure Recovery: When Agents and Networks Break

Every agent call can fail: the model API returns a 500, the response exceeds the timeout, the output fails validation, a downstream dependency is unavailable. Production systems need explicit failure handling at every layer.

The simplest pattern is exponential backoff with jitter on transient errors. tenacity wraps this cleanly:

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
    before_sleep_log,
)
import logging
from openai import APIError, APITimeoutError, RateLimitError
 
logger = logging.getLogger(__name__)
 
@retry(
    retry=retry_if_exception_type((APIError, APITimeoutError, RateLimitError)),
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def call_with_retry(model: str, messages: list[dict]) -> str:
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        timeout=30,
    )
    return response.choices[0].message.content

Retries handle transient failures, but they do not handle semantic failures — when the model returns a response that is syntactically valid but does not satisfy the acceptance criteria. For these cases you need a fallback agent: a simpler, more constrained prompt that is less likely to produce garbage but also less capable.

from pydantic import BaseModel, ValidationError
 
class StructuredOutput(BaseModel):
    summary: str
    action_items: list[str]
    confidence: float
 
def call_with_fallback(primary_prompt: str, fallback_prompt: str) -> StructuredOutput:
    """Try the primary agent; fall back to a simpler prompt on validation failure."""
    for prompt in [primary_prompt, fallback_prompt]:
        try:
            raw = call_with_retry("gpt-4o", [{"role": "user", "content": prompt}])
            # Strip markdown fences
            if raw.startswith("```"):
                raw = raw.split("```")[1].lstrip("json").strip()
            return StructuredOutput.model_validate_json(raw)
        except (ValidationError, Exception) as exc:
            logger.warning("Agent call failed: %s — trying fallback", exc)
    # Both failed — return a safe empty result rather than raising
    return StructuredOutput(summary="", action_items=[], confidence=0.0)

Idempotency is non-negotiable for agents that write to external systems. Assign a stable task_id before any agent call begins. Before executing a step, check whether a successful result already exists in the state store. This makes your orchestrator safe to retry at the task level without duplicating side effects — essential when a partial workflow crashes and must be resumed.

For tasks that may partially succeed — five of eight subtasks complete before a failure — partial-result reconciliation is better than full retry. Store each step’s result atomically as it completes. On re-execution, skip any step with a successful result in the state store and only re-run the failed or missing steps. This pattern converts a full retry (cost: $n \cdot \bar{t} \cdot \bar{c}$ ) into an incremental retry (cost: proportional to the number of failed steps).

Putting It Together

These four patterns compose. In a production workflow: the router dispatches the input to the appropriate decomposer; the decomposer produces a plan (static or dynamic); the orchestrator executes the plan, writing results to a state store and summarizing periodically; and every agent call is wrapped with retry logic and fallback agents. The state store is the backbone — it makes the entire system resumable, observable, and idempotent.

Order matters: routing and decomposition first, state management and failure recovery second. Implementing all four in parallel guarantees none work well. The hybrid router — rules for 80% of volume, LLM for the remaining 20% — costs less and fails more predictably than a pure LLM router. And the state store is not an infrastructure detail: it is what turns a prototype that crashes into a system that can resume, be observed, and be audited.