15 Patterns That Keep Production AI Agents From Burning Down Prod
Your agent demo works perfectly. It calls tools, reasons through problems, and delivers impressive results. Then you deploy it.
Suddenly it’s retrying a dead API 4,000 times. It’s spending $200 in tokens on a single user request. It’s mutating production data without guardrails. Three agents are fighting over the same ticket, each convinced the other two are wrong.
The gap between “agent that works” and “agent you can trust in production” isn’t about smarter models. It’s about the same boring infrastructure patterns that have kept distributed systems alive for decades, adapted for a world where your “microservice” can hallucinate.
Here are 15 patterns that close that gap. I’ve grouped them into four categories: resilience, containment, architecture, and operations.
The Resilience Stack
These four patterns work together to handle failures gracefully instead of cascading them.
1. Agent Circuit Breaker
A circuit breaker stops an agent from hammering a dependency that is clearly failing. Provider outage, persistent 5xx, tool bug. Without one, your agent will retry into oblivion, burning tokens and cascading the failure downstream.
The pattern has three states:
%%{init: {"layout": "dagre"}}%%
stateDiagram-v2
[*] --> Closed
Closed --> Open : failure threshold crossed
Open --> HalfOpen : cooldown expires
HalfOpen --> Closed : probe succeeds
HalfOpen --> Open : probe fails
In the closed state, requests flow normally. When error rates cross a threshold (say, 5 failures in 60 seconds), the circuit opens and all new requests fail fast with a controlled error. After a cooldown, the circuit moves to half-open and probes with limited traffic to test recovery.
You can also apply this concept to safety. Research on representation-level circuit breaking shows you can detect harmful internal activations and short-circuit a model before it emits unsafe content. For production agents, that means guardrail layers that terminate runs when certain patterns appear (policy violations, repeated jailbreak attempts, calls to dangerous tools) instead of relying purely on output filters.
Design checklist:
- Track rolling failure metrics (error rate, timeouts, provider-specific 4xx/5xx) per backend or tool
- Model the three states explicitly. Don’t just do “retry 3 times then give up”
- Combine with retries and fallbacks: transient issues get retried, sustained issues trip the breaker
2. Tool Invocation Timeout
Without timeouts, a slow or hung dependency stalls your entire agent run. The user sees a spinner. Your token meter keeps ticking. Nothing happens.
Set per-tool timeout budgets based on realistic latency data:
Tool Type Typical p95 Timeout Budget
──────────────────────────────────────────────────────
Database query 50ms 150ms
Web search 800ms 2,000ms
Code execution 2,000ms 5,000ms
LLM sub-call 3,000ms 8,000ms
External API 500ms 1,500ms
──────────────────────────────────────────────────────
Overall wall clock — 30,000ms
Set each tool’s timeout at 1.5-2x its typical p95. Add an overall “wall clock” limit per agent run so no single request can block indefinitely regardless of how many tools it chains.
The key insight: classify timeouts separately from logical errors. A timeout means “we don’t know what happened.” A 400 error means “the request was bad.” Different root causes, different remediation paths. When a tool crosses its timeout threshold repeatedly, couple it with the circuit breaker to stop trying altogether.
3. Idempotent Tool Calls
Retries are only safe if your tools can handle being called twice with the same inputs. Without idempotency, a retry after a timeout might double-charge a credit card, send a duplicate email, or create two Jira tickets.
%%{init: {"layout": "dagre"}}%%
flowchart LR
Agent -->|"op_id: abc-123"| Tool[Tool: Create Order]
Tool --> Log["Operation Log\n(abc-123: completed)"]
Agent -->|"retry op_id: abc-123"| Tool
Tool --> Log
Log -->|"already exists"| NoOp[Return cached result]
For read-only tools: they’re already idempotent. No changes needed.
For write tools: require a caller-supplied idempotency key. Store operation logs keyed by that ID so retries return the cached result instead of re-executing side effects.
For non-idempotent operations you can’t redesign (legacy APIs, third-party services): simulate idempotency with deduplication keys or “upsert” semantics at the integration layer.
4. Dead Letter Queue for Failed Runs
A dead letter queue (DLQ) holds agent runs that couldn’t complete after configured retry attempts. Instead of losing them or retrying forever, you park them for human triage.
%%{init: {"layout": "dagre"}}%%
flowchart LR
Queue[Task Queue] --> Agent
Agent -->|success| Done[Completed]
Agent -->|retry 1| Agent
Agent -->|retry 2| Agent
Agent -->|retry 3 fails| DLQ[Dead Letter Queue]
DLQ --> Human[Human Review]
Human -->|fixed| Queue
Why this matters for agents specifically: agent failures are messier than typical service failures. A failed run might have partial state, tool outputs from earlier steps, and a decision history that matters for debugging. Attach all of it as metadata.
- Define per-task max attempts before DLQ. Three is a good default. Don’t allow unbounded retries
- Build dashboards and alerts on DLQ volume. Spikes are early canaries for regressions
- Once the underlying bug is fixed, forward repaired messages back to the main queue for reprocessing
The Containment Layer
These patterns limit the blast radius when an agent misbehaves.
5. Blast Radius Limiter
Even with circuit breakers and DLQs, you need hard caps on what an agent can do per request. Think of it as the agentic equivalent of IAM policies, rate limits, and spending quotas.
| Resource | Per-Request Limit | Per-Session Limit | Per-Day Limit |
|---|---|---|---|
| LLM tokens | 8,000 | 50,000 | 500,000 |
| Tool calls | 10 | 50 | 500 |
| DB mutations | 5 | 20 | 100 |
| Emails sent | 1 | 5 | 20 |
| Estimated cost | $0.50 | $5.00 | $50.00 |
Separate “read-only” and “write” environments. Reads get generous limits. Writes get strict limits and approval gates. When a limit is hit, alert and escalate to human review instead of silently dropping work.
Why this works: gateway-level observability makes this enforceable by tracking latency, token usage, and costs per route, user, or workflow. Limit breaches trigger automated shutdowns before they become incidents.
6. Confidence Threshold Gate
A confidence gate blocks risky actions when the model is uncertain and routes them to safer alternatives: ask a clarifying question, use a simpler flow, or escalate to a human.
%%{init: {"layout": "dagre"}}%%
flowchart TB
Agent[Agent Decision] --> Score{Confidence\nScore}
Score -->|"> 0.85"| Execute[Auto-execute]
Score -->|"0.60 - 0.85"| Confirm[Ask User\nto Confirm]
Score -->|"< 0.60"| Escalate[Route to\nHuman]
You can estimate confidence through model self-critique, external verifier models, or classifier heads. Anything below threshold gets treated as “not safe to automate.”
Design checklist:
- Define per-route confidence thresholds and escalation policies before launch, not after an incident
- Add secondary triggers beyond confidence scores: negative sentiment spikes, repeated user rephrasing, explicit request for a human
- Log confidence scores alongside outcomes to tune thresholds over time. Start conservative, loosen as you gather data
7. Human Escalation Protocol
Human-in-the-loop (HITL) isn’t just “sometimes ask a human.” It’s a designed protocol for when, how, and where humans intervene.
Naive approaches escalate entire conversations when the AI gets stuck. Better approaches let humans answer targeted questions while the AI stays primary. One expert supporting many concurrent sessions, responding to precise prompts instead of reading raw logs.
| Trigger | Escalation Type | Format |
|---|---|---|
| Confidence below threshold | Targeted question | ”Is this the correct account? Options: A, B, C” |
| High-risk topic (legal, medical) | Full handoff | Complete context summary + conversation history |
| Negative sentiment spike | Collaborative | Agent stays primary, human reviews in real-time |
| User requests human | Immediate transfer | Warm handoff with context |
The critical feedback loop: capture human responses as labeled data so the agent improves at those exact edge cases over time. Every escalation should make the next one less likely.
Architecture Decisions
These patterns shape how you structure the overall system.
8. Orchestration vs Choreography
In orchestration, a central controller drives the whole workflow. It calls tools and agents in order, handles branching, retries, and compensation. The control flow lives in one place.
In choreography, agents react to events and each other without a central brain. Behavior emerges from event subscriptions and message flows.
| Dimension | Orchestration | Choreography |
|---|---|---|
| Control flow | Explicit, centralized | Emergent, distributed |
| Debugging | Follow the conductor | Trace across N services |
| Failure handling | One place to catch errors | Each agent handles its own |
| Coupling | Conductor knows all agents | Agents only know events |
| Best for | Regulated, business-critical flows | Loosely coupled, event-driven tasks |
For most production agent systems, start with orchestration. You need explicit governance and observability over every step, especially in financial operations, customer-facing flows, or anything with compliance requirements.
Reserve choreography for loosely coupled agents that discover and react to events (notifications, enrichment, personalization). But back it with strong distributed tracing and DLQs, because debugging emergent behavior across independent agents is genuinely hard.
9. LLM Gateway
An LLM gateway is a layer that all model traffic passes through. It abstracts over multiple providers while centralizing routing, auth, quotas, and observability.
%%{init: {"layout": "dagre"}}%%
flowchart TB
A1[Agent 1] --> GW[LLM Gateway]
A2[Agent 2] --> GW
A3[Agent 3] --> GW
GW --> |"routing\npolicy"| P1[OpenAI]
GW --> P2[Anthropic]
GW --> P3[Google]
GW --> Trace[Trace Store]
GW --> Budget[Cost Budget]
GW --> Limits[Rate Limits]
Because the gateway sees every LLM call, it’s the natural place to enforce rate limits, cost budgets, circuit breakers, and audit policies without touching application code at each call site.
- Use the gateway as the only way agents talk to models. No side-door credentials
- Implement routing policies (cheaper models for simple tasks, low-latency models for real-time, fallbacks for outages) and log them as spans
- Export traces via OpenTelemetry so LLM telemetry lines up with your existing infrastructure monitoring
10. Semantic Caching
In production, a large fraction of queries are paraphrases of previous ones. Without caching, you recompute full retrieval and generation pipelines every time.
A semantic cache stores results keyed by meaning, not exact string matches. It sits at the front of the agent loop: if the user’s intent matches a previous one closely enough, return the cached answer instead of re-executing the whole chain.
Query: "What's the refund policy?"
Cache: HIT (similarity: 0.94, threshold: 0.90)
↳ matched: "How do I get a refund?"
↳ saved: ~2,100 tokens, ~1.8s latency
Case studies report up to 80% cost and latency reductions for high-frequency query patterns. But the threshold tuning matters. Too aggressive and you serve stale or wrong answers. Too conservative and you barely save anything.
Cache in layers with different TTLs:
| Layer | TTL | Example |
|---|---|---|
| LLM responses | 1-4 hours | Factual Q&A |
| Deterministic tool results | 5-15 minutes | API lookups |
| Embeddings | 24-72 hours | Document vectors |
| Session state | Session duration | Conversation context |
State Management
These patterns keep long-running and multi-agent systems coherent.
11. Context Window Checkpointing
Long-running agents eventually hit context limits. Without strategy, they either truncate important history or blow up with oversized prompts.
Context checkpointing periodically distills the current agent state into a compact summary, then continues with a fresh context window seeded from that checkpoint.
%%{init: {"layout": "dagre"}}%%
flowchart LR
T1["Turns 1-20\n(raw history)"] --> CP1["Checkpoint 1\n(summary + facts)"]
CP1 --> T2["Turns 21-40\n(raw history)"]
T2 --> CP2["Checkpoint 2\n(summary + facts)"]
CP2 --> T3["Turns 41+\n(current context)"]
Implementation approach:
- Summarize every N turns or when token count nears 70-80% of the limit
- Replace raw history with a model-generated summary plus key extracted facts (entity IDs, decisions made, constraints discovered)
- Store durable state (task graph, external IDs, approval records) in a database, not prompt text. Prompts are volatile. Databases are not
- Make checkpointing itself idempotent and observable so you can debug “lost context” bugs
12. Multi-Agent State Sync
As soon as you have multiple agents working on the same user, ticket, or dataset, you need a shared state model. Without one, you get conflicting actions and weird loops.
The fix: put the “world state” in a database or event log that agents read and write explicitly, rather than burying it inside each agent’s prompt history.
%%{init: {"layout": "dagre"}}%%
flowchart TB
Agent1[Research Agent] --> |"read/write"| State[(Shared State\nDB + Event Log)]
Agent2[Writing Agent] --> |"read/write"| State
Agent3[Review Agent] --> |"read/write"| State
State --> |"events"| Agent1
State --> |"events"| Agent2
State --> |"events"| Agent3
- Use a single source of truth for shared entities (ticket, order, document) with optimistic locking or versioning
- Have agents publish state changes as events that others consume, with idempotent handlers and DLQs for failed processing
- Include state snapshots in traces so you can replay and debug multi-agent interactions after the fact
13. Replanning Loop
Most agent frameworks model a perception-decide-act cycle. But production designs need to make the “decide” step explicit: reconsider objectives, check constraints, and potentially regenerate the plan.
This matters when tools fail, quotas get exhausted, or the user updates their request mid-flow.
Step 1: Search API ✓
Step 2: Analyze results ✓
Step 3: Call pricing API ✗ (timeout)
→ REPLAN triggered
- Drop step 3 (pricing API down)
- Substitute step 3b: Use cached pricing data
- Continue from step 4
Step 3b: Load cached prices ✓
Step 4: Generate report ✓
Design checklist:
- Represent plans as explicit structures (lists of steps, DAGs) that can be edited programmatically, not buried in natural language
- Define triggers for replanning: repeated tool failures, significant context changes, hitting blast-radius limits
- Gate replanning by severity. Minor failures get a local patch. Critical assumption breaks trigger full replanning or escalation
- Log plan versions in traces so you can compare “planned vs executed” when debugging behavior
Operational Safety
These patterns handle deployment, monitoring, and ongoing reliability.
14. Canary Agent Deployment
Canary deployment rolls out a new agent or prompt to a small subset of traffic first, then gradually increases exposure while monitoring metrics.
Day 1: 1% traffic → new agent v2.1
↳ monitoring: error rate, hallucination rate, cost, latency
Day 2: 5% traffic (metrics look good)
Day 3: 25% traffic
Day 5: 100% traffic
The same pattern from microservice deployments works for agents and prompts: assign a fraction of users to the new version, compare quality metrics against the control group, and roll back instantly if hallucinations, safety incidents, or latency spike.
- Route internal users or low-risk segments to new agents first
- Define success/failure thresholds on quality, safety, and cost. Automate rollback when thresholds are violated
- Use your gateway and observability traces to compare canary vs baseline behavior in detail
15. Agentic Observability Tracing
Traditional logging captures final responses. Agentic observability captures every step: prompts, tool calls, retrievals, decisions, and outputs as linked spans in a trace.
Trace: user-request-abc123
├─ Span: llm-call (model: claude-4, tokens: 1,847, latency: 1,200ms)
├─ Span: tool-call:search (query: "refund policy", latency: 340ms)
├─ Span: tool-call:db-lookup (user_id: 42, latency: 12ms)
├─ Span: guardrail-check (policy: pii-filter, result: pass)
├─ Span: llm-call (model: claude-4, tokens: 892, latency: 680ms)
└─ Span: response (tokens: 234, total-latency: 2,450ms, cost: $0.03)
This level of tracing answers the questions that matter in production: which tool slowed this request, why a guardrail fired, how much a particular workflow costs, and where multi-agent interactions went off the rails.
- Instrument agents so each LLM call, tool invocation, retrieval, and guardrail check is a span with correlation IDs
- Export traces via OpenTelemetry to unify agent monitoring with your existing infrastructure stack
- Build dashboards for latency, cost, error rate, and quality per route and agent version. Wire alerts to DLQs and circuit breakers
The Patterns Interlock
These 15 patterns aren’t isolated techniques. They form a layered defense system:
┌──────────────────────────────────────────────────────────┐
│ Operations: Canary Deployment + Observability │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Architecture: Gateway + Orchestration + Cache │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ State: Checkpointing + Sync + Replanning │ │ │
│ │ │ ┌───────────────────────────────────────────┐ │ │ │
│ │ │ │ Containment: Limits + Gates + HITL │ │ │ │
│ │ │ │ ┌──────────────────────────────────────┐ │ │ │ │
│ │ │ │ │ Resilience: CB + Timeouts + DLQ │ │ │ │ │
│ │ │ │ └──────────────────────────────────────┘ │ │ │ │
│ │ │ └───────────────────────────────────────────┘ │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
The resilience stack (circuit breakers + timeouts + idempotency + DLQs) forms the core. Containment wraps around it. Architecture provides the structural framework. Operations gives you the visibility and deployment safety to evolve everything else.
You don’t need all 15 on day one. But you need to know where each one goes, because the gap between “agent that works” and “agent that survives production” is exactly these patterns.
Start with three: observability tracing (you can’t fix what you can’t see), blast radius limiters (cap the damage), and circuit breakers (stop cascading failures). Build outward from there as your agent system grows in complexity and traffic.
Building production agent systems? I’d love to hear which patterns saved you (or which ones you learned about the hard way). Reach out on LinkedIn.