Debugging AI Agent Failures in Production

After you read this, you should be able to take a vague “agent failed” report and turn it into: a classified failure mode, a run timeline with evidence, a replay you can execute locally, and a fix that doesn’t regress next week.

The thing that surprised me building production agents is how rarely the model is the root cause. Most incidents were boundary failures: a tool returned partial JSON, retrieval pulled the wrong chunk, or the planner got stuck looping. If you treat everything as “prompt engineering,” you’ll ship random changes and never get stable.

What counts as an agent failure (and why it matters)

In production, “failure” isn’t “the model said something weird.” It’s “the user didn’t get the outcome they came for” or “we violated an operational constraint.” You need a definition that your on-call brain can apply at 2am.

I like to define failures as run outcomes with a small set of buckets. Not 40. Not free-form strings that turn into taxonomy hell. A handful you can graph over time.

Here’s a set that held up for me:

Tool Error: network/auth/schema/rate limit. Anything where the agent couldn’t execute an external action.
Bad Decision: the agent chose the wrong tool, wrong order, or wrong branch (even if tools worked).
Bad Output: final output violates format/constraints, missing required fields, hallucinated facts, wrong tone.
Timeout: exceeded wall-clock budget (often from loops or slow tools).
Cost Blowup: exceeded token/cost budget (often from over-retrieval or repeated self-critiques).

The point isn’t the names. The point is that every production run ends in exactly one of these (or “Completed”).

A concrete way to make this real is to define an outcome object you can attach to runs. Even if you’re not using a tracing product, you can log this to your DB.

{
  "runId": "run_01JMYZ3G4H4Y7J2R9N4F4M1WQK",
  "label": "SupportAgent",
  "status": "Failed",
  "failureMode": "Tool Error",
  "reason": "CRM 401: token expired",
  "retryable": true,
  "userImpact": "No reply sent",
  "createdAt": "2026-02-12T09:41:22.114Z"
}

Two things that didn’t work for me:

“Success = no exception thrown.” Agents can return garbage while happily exiting 0.
“Failure = user complained.” User reports are delayed, incomplete, and biased toward visible failures. You need internal detection.

If you can’t name and classify outcomes, you can’t debug. You’ll also never know if you improved anything—because you can’t measure regressions.

Instrument the run so you can answer "what happened?"

If you want to debug AI agent failures in production, you need to answer one question fast: what happened in this specific run? Not “what does the agent usually do,” but the exact timeline: inputs → calls → tool invocations → retries → output → outcome.

The minimum instrumentation that paid for itself:

Run-scoped correlation ID that you propagate everywhere (logs, tool calls, DB writes).
Model settings (model name, temperature, max tokens, top_p, etc).
Every tool call: name, request payload, response payload (or a redacted version), latency, status.
Every LLM call: messages (or hashed/redacted), tool definitions, response, tokens, latency, status.
Retries: count, reason, backoff.
Final outcome: one of your buckets, plus structured metadata.

Hierarchical tracing matters. Flat “here are 12 LLM calls” logs are hard to reason about. I want:

Run: one user request / one agent attempt
Groups/phases: plan, retrieve, execute, validate
Calls: the individual LLM/tool invocations

If you’re using OpenTelemetry, model it like spans. If you’re not, you can still encode the hierarchy in your own tables.

Here’s a schema I’ve used (Postgres) that makes debugging practical without over-modeling:

create table agent_runs (
  id text primary key,
  label text not null,
  correlation_id text not null,
  started_at timestamptz not null,
  ended_at timestamptz,
  outcome text,
  failure_mode text,
  metadata jsonb not null default '{}'
);

create table agent_events (
  id bigserial primary key,
  run_id text not null references agent_runs(id),
  parent_event_id bigint references agent_events(id),
  kind text not null, -- 'group' | 'llm_call' | 'tool_call' | 'validation'
  name text not null,
  started_at timestamptz not null,
  ended_at timestamptz,
  status text not null, -- 'ok' | 'error'
  data jsonb not null
);

And here’s what a single tool call event looks like when it’s actually useful (note: real numbers, not toy values):

{
  "kind": "tool_call",
  "name": "crm.searchCustomer",
  "status": "error",
  "data": {
    "request": { "email": "jane@acme.com" },
    "response": { "error": "Unauthorized", "code": 401 },
    "latencyMs": 312,
    "attempt": 1
  }
}

Tools like Langfuse/Helicone can capture parts of this. The hard part is consistency: every run gets an ID, every call is linked, and every run ends with an outcome. Without that discipline, you’ll have “observability” and still be guessing.

Reproduce production failures without guessing

Most agent incidents are unfixable until you can replay them. The failure is usually a specific combination of: messages, retrieved docs, tool responses, and model settings. If you only log the final prompt, you’re missing half the state.

What you need to snapshot for replay:

Full message history (or a reversible redaction scheme)
Model + all sampling settings
Tool definitions (schemas) that were provided to the model
Tool inputs + tool outputs (including errors)
Retrieval results (doc IDs + exact chunks returned)
Any external state you read (feature flags, user plan, permissions)

Then you implement a “replay mode” that swaps real tools for recorded outputs. This is the only way I’ve found to avoid phantom bugs where you “fix” something that never reproduces.

A small pattern that works: record tool interactions as fixtures keyed by (runId, stepName, callIndex).

export function createReplayTools(fixtures) {
  return {
    async crm_searchCustomer(args) {
      const fx = fixtures.next('crm.searchCustomer', args);
      return fx.output; // can be success or an error-shaped object
    },
    async kb_search(args) {
      const fx = fixtures.next('kb.search', args);
      return fx.output;
    },
  };
}

Now your agent code can run unchanged, except the tool implementations are swapped.

For retrieval, “replay” means freezing the retrieval result set, not just the query. If you re-run against a live index, you’re not reproducing the incident—you’re testing today’s index.

This is the smallest useful retrieval snapshot I’ve used:

{
  "query": "How do I reset my API token?",
  "topK": 6,
  "results": [
    { "docId": "docs_1281", "chunkId": "c07", "score": 0.82, "text": "To rotate an API token, go to Settings..." },
    { "docId": "docs_992", "chunkId": "c03", "score": 0.79, "text": "Tokens expire after 90 days..." }
  ]
}

Controlling nondeterminism:

Set temperature: 0 for replay when possible. It won’t perfectly match production, but it removes a big source of noise.
Use seed if your provider supports it (some do, some don’t, and it’s not universal across endpoints).
Stub tools and freeze retrieval.
If your agent uses “current time” or random IDs in prompts, inject a deterministic clock/id generator in replay.

What didn’t work for me:

Relying on user screenshots. You’ll miss tool errors, retries, and the hidden chain-of-thought you don’t even log.
Logging only the final assembled prompt. Multi-step agents fail in the middle. You need intermediate state and tool I/O.

Once you can replay, debugging becomes normal software engineering again: bisect changes, add assertions, write regression tests.

Debug the three hard classes: tools, retrieval, planning

When a run fails, I start with a decision tree. Not because it’s fancy, but because it prevents the default failure mode: tweaking prompts until the incident “goes away.”

1) Tool failures

Symptoms:

sudden spike in “Tool Error”
LLM output looks fine, but the action didn’t happen
repeated retries with identical args
partial JSON responses that parse but are missing fields

Checks that actually catch things:

Schema drift: tool returned {customer_id: ...} but your model expects customerId.
Auth: expired tokens, wrong scopes, multi-tenant mixups.
Rate limits: 429s that your retry policy amplifies into timeouts.
Partial responses: upstream returns 200 with an embedded error.

I’m opinionated here: enforce strict schemas at the tool boundary. Let the model be flexible; don’t let your system be.

If you’re using JSON Schema for tool args, validate on both sides: before sending (model args) and after receiving (tool output).

import Ajv from 'ajv';

const ajv = new Ajv({ allErrors: true });

export function validateOrThrow(schema, value, label) {
  const validate = ajv.compile(schema);
  if (!validate(value)) {
    const err = new Error(`Invalid ${label}: ${ajv.errorsText(validate.errors)}`);
    err.details = validate.errors;
    throw err;
  }
}

Tradeoff: strict validation increases “hard failures” early. That’s good. Silent corruption is worse.

2) Retrieval failures (RAG)

Symptoms:

confident, wrong answers with plausible citations
missing citations entirely
answers that ignore recent docs
“it worked yesterday” after an index rebuild

My checklist:

Chunking: if chunks are too big, recall drops; too small, context is fragmented.
Staleness: index lag vs source of truth. Know your freshness SLO.
Recall vs precision: topK too low causes missing facts; topK too high bloats context and cost.
Citation plumbing: the model can’t cite what you don’t pass through.

A practical trick: log retrieval coverage metrics per run. Not “vector score,” but “did we retrieve anything from the doc family we expected?” For internal KBs, doc IDs are often stable.

export function retrievalCoverage(results) {
  const docIds = new Set(results.map(r => r.docId));
  return {
    uniqueDocs: docIds.size,
    topScore: results[0]?.score ?? null,
    hasPolicyDoc: [...docIds].some(id => id.startsWith('policy_')),
  };
}

Tradeoff: caching retrieval results improves latency and cost, but it can make correctness worse if your docs change frequently. If you cache, cache with a TTL and include a “source version” in the cache key.

3) Planning / loop failures

These are the most “agent-y” bugs. The model isn’t wrong in one step; the control flow is wrong across steps.

Symptoms:

infinite tool loops (“search → read → search → read”)
premature termination (“I’m done” without doing the action)
bad branching (chooses billing flow for a technical ticket)
oscillation between two plans

The fix is rarely “better prompt.” It’s usually guardrails in the planner.

I use three controls:

Step budget (max tool calls / max LLM calls)
State machine (explicit phases; planner can’t jump arbitrarily)
Loop detection (same tool + same args repeated)

Loop detection can be dumb and still effective:

export function detectRepeatToolCall(history, toolName, args) {
  const key = `${toolName}:${JSON.stringify(args)}`;
  const seen = history.toolKeys || new Set();
  const repeated = seen.has(key);
  seen.add(key);
  history.toolKeys = seen;
  return repeated;
}

If it repeats, don’t “try harder.” Record an outcome like “Planner Loop” and stop. Autonomy without budgets is how you get 3am cost spikes.

Tradeoff: guardrails reduce autonomy. I’m fine with that in production. I’d rather fail fast with a clear outcome than spend $4 generating nonsense.

Ship fixes that stick: tests, canaries, and budgets

A fix isn’t real until it’s protected by automation and monitored in prod.

Turn incidents into regression tests

For structured outputs, deterministic tests are non-negotiable. If your agent outputs JSON, validate it. If it outputs a specific format, parse it.

import Ajv from 'ajv';
const ajv = new Ajv();

const OutputSchema = {
  type: 'object',
  required: ['action', 'customerId'],
  properties: {
    action: { enum: ['refund', 'replace_card', 'escalate'] },
    customerId: { type: 'string', minLength: 6 }
  },
  additionalProperties: false
};

export function assertValidOutput(obj) {
  const ok = ajv.validate(OutputSchema, obj);
  if (!ok) throw new Error(ajv.errorsText(ajv.errors));
}

For “bad output” that’s subjective (tone, completeness), I avoid LLM-as-judge unless I have to. Heuristics catch a lot:

max length
required sections present
no banned phrases
includes citations when required

export function validateSupportReply(text) {
  const errors = [];
  if (text.length < 120) errors.push('Too short');
  if (!/https?:\/\//.test(text)) errors.push('Missing link');
  if (/as an ai language model/i.test(text)) errors.push('Refusal boilerplate');
  return { ok: errors.length === 0, errors };
}

LLM-as-judge is fine for “is this helpful?” but it’s expensive and can drift. If you use it, pin the judge model and treat it as a dependency.

Canary and budgets

Budgets are the easiest way to prevent incidents from turning into outages.

I enforce three per-run budgets:

timeMs: hard timeout for the run
maxCalls: max LLM calls + tool calls
maxCostUsd (or maxTokens): stop before you burn money

A simple budget guard around your agent loop is enough:

export function createBudget({ deadlineMs, maxSteps }) {
  const started = Date.now();
  let steps = 0;

  return {
    step() {
      steps += 1;
      if (steps > maxSteps) throw new Error('Budget: max steps exceeded');
      if (Date.now() - started > deadlineMs) throw new Error('Budget: deadline exceeded');
    }
  };
}

For canaries: route 1–5% of traffic to the new prompt/tooling and compare outcome rates and cost/latency deltas. If you can’t do traffic splitting, run shadow executions (don’t return result to user) and score outcomes offline.

The key is closing the loop: after deploy, track whether “Tool Error” went down, whether “Cost Blowup” went up, and whether success rate improved. If you can’t answer that, you didn’t ship a fix—you shipped a change.

Doing this with WarpMetrics (runs, groups, outcomes)

If you want the “one run, clear timeline, explicit outcome” workflow without building your own tables, WarpMetrics maps cleanly to the mental model above: Run → Group → Call → Outcome.

Here’s the smallest snippet that creates a run, groups phases, links calls, and records a failure mode you can aggregate later:

import { warp, run, group, call, outcome, flush } from '@warpmetrics/warp';
import OpenAI from 'openai';

const openai = warp(new OpenAI());

const r = run('SupportAgent', { ticketId: 18422 });
const planning = group(r, 'Planning');

const planRes = await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Plan the next action.' }] });
call(planning, planRes);

outcome(r, 'Tool Error', { tool: 'crm.searchCustomer', code: 401, retryable: true });
await flush();

That’s enough to correlate an incident to a single run, see the timeline, and track outcome rates over time. The important part isn’t the SDK—it’s the discipline: every run gets a label, every phase is grouped, every run ends with an outcome you can measure.

What counts as an agent failure (and why it matters)

Instrument the run so you can answer "what happened?"

Reproduce production failures without guessing

Debug the three hard classes: tools, retrieval, planning

1) Tool failures

2) Retrieval failures (RAG)

3) Planning / loop failures

Ship fixes that stick: tests, canaries, and budgets

Turn incidents into regression tests

Canary and budgets

Doing this with WarpMetrics (runs, groups, outcomes)

Start building with WarpMetrics