Retry Patterns for AI Agents That Actually Work
You should be able to ship an agent that retries the right things (429s, timeouts, flaky tools), stops retrying the wrong things (bad plan, missing context), and proves with metrics that retries improved outcomes instead of just burning tokens.
The main thing I learned: most “ai agent retry” logic fails because it treats all failures the same, and because it retries without changing anything. A retry that doesn’t change the conditions is usually just a more expensive way to fail.
What should an agent retry vs fail fast?
Start with a failure taxonomy. Not a giant one. Just enough to route failures into different policies.
I use four buckets:
- Transient infra failures: network timeouts, DNS, provider 5xx, rate limits (429). Usually retryable.
- Model/tool flakiness: tool returns malformed JSON sometimes, model violates schema, streaming cuts out mid-flight. Sometimes retryable, but often needs a transform.
- Task failures: the agent made a bad plan, lacks context, or the requested action is impossible. Retrying is usually wasted.
- Safety/permission failures: auth errors, forbidden operations, policy refusals. Fail fast and escalate.
The routing is the whole game. Here’s a compact classifier that’s “good enough” in Node without dragging in a framework:
function classifyFailure(err) {
const status = err?.status ?? err?.response?.status;
const code = err?.code;
const msg = String(err?.message || '');
if (status === 429) return { kind: 'RATE_LIMIT', retryable: true };
if (status >= 500) return { kind: 'PROVIDER_5XX', retryable: true };
if (code === 'ETIMEDOUT' || code === 'ECONNRESET') {
return { kind: 'NETWORK', retryable: true };
}
if (status === 401 || status === 403) return { kind: 'AUTH', retryable: false };
if (msg.includes('JSON') && msg.includes('parse')) {
return { kind: 'BAD_TOOL_OUTPUT', retryable: true };
}
return { kind: 'UNKNOWN', retryable: false };
}This isn’t perfect. It doesn’t need to be. The goal is to stop doing blind retries for non-retryable work.
Once you have a kind, attach a policy. The key is that different kinds need different actions:
- Immediate retry for one-off network blips.
- Delayed retry with jitter for rate limits and provider overload.
- Alternate model/tool for schema violations or tool flakiness.
- Stop and escalate for auth/permission and true task failures.
I like to encode that mapping explicitly so it’s hard to “accidentally retry everything” during a refactor:
const RETRY_POLICY = {
RATE_LIMIT: { mode: 'backoff', maxAttempts: 5 },
PROVIDER_5XX: { mode: 'backoff', maxAttempts: 4 },
NETWORK: { mode: 'immediate', maxAttempts: 3 },
BAD_TOOL_OUTPUT: { mode: 'transform', maxAttempts: 2 },
AUTH: { mode: 'failfast', maxAttempts: 1 },
UNKNOWN: { mode: 'failfast', maxAttempts: 1 },
};What didn’t work for me: “retry N times for any error”. It artificially boosts your success rate while hiding the real failure mode (missing context, wrong tool, wrong plan). It’s also how you get agents that spin for 90 seconds and then still fail, except now users also hate the latency.
How to do backoff, jitter, and budgets sanely
A good llm retry pattern is:
- exponential backoff
- full jitter (randomized delay)
- per-call max attempts
- run-level budgets (time, cost, tokens)
Backoff is easy. Jitter is the part people skip, and it matters. Without jitter, a fleet of agents gets rate limited, then retries at the same intervals, and you create a thundering herd that keeps you rate limited.
This is the backoff function I actually use (full jitter from AWS’ guidance):
function backoffMs(attempt, {
baseMs = 250,
capMs = 8000,
} = {}) {
const exp = Math.min(capMs, baseMs * (2 ** (attempt - 1)));
return Math.floor(Math.random() * exp); // full jitter: [0, exp)
}Now the budgets. If you don’t enforce budgets, your agent will be “infinitely helpful” right up until it runs out of money. I track three ceilings:
- deadlineMs: hard wall-clock limit
- maxCostUsd: cap spend per run
- maxTokens: cap total tokens per run (rough proxy when cost isn’t known until after the call)
A small budget guard makes your retry loop predictable:
function assertBudget(b, now = Date.now()) {
if (now > b.deadlineAt) throw new Error('BUDGET_DEADLINE_EXCEEDED');
if (b.costUsd >= b.maxCostUsd) throw new Error('BUDGET_COST_EXCEEDED');
if (b.tokens >= b.maxTokens) throw new Error('BUDGET_TOKENS_EXCEEDED');
}Then in your retry loop, you check budget before sleeping and before attempting. Otherwise you’ll happily wait 8 seconds only to immediately fail.
One more gotcha: streaming. If you stream tokens and the stream dies halfway through, you can’t always assume “no output happened”. The user may have already seen partial text. For streaming UI, I treat stream interruption as non-retryable unless I can restart cleanly (same prompt, same idempotency key, and the UI can reconcile).
For tool calls, I’m stricter: if the tool might have performed a side effect, I don’t auto-retry the tool call unless it’s idempotent (more on that later). That’s how you avoid double-charging and duplicate emails.
Here’s a compact retry runner that respects policy + jitter + budget. The interesting part is that it makes the sleep conditional on remaining time:
async function runWithRetry(fn, budget, policyForKind) {
let attempt = 0;
while (true) {
attempt++;
assertBudget(budget);
try {
return await fn({ attempt });
} catch (err) {
const { kind, retryable } = classifyFailure(err);
const policy = policyForKind(kind);
if (!retryable || attempt >= policy.maxAttempts) throw err;
const delay = policy.mode === 'immediate' ? 0 : backoffMs(attempt);
const latestWake = Date.now() + delay;
if (latestWake > budget.deadlineAt) throw err;
if (delay) await new Promise(r => setTimeout(r, delay));
}
}
}This is intentionally not “generic middleware”. It’s a small, explicit loop you can reason about. When retries go wrong, debugging “retry frameworks” is miserable.
How to retry with intent (not repetition)
Blind retries are repetition. Intentful retries are transforms.
If you retry and keep the same prompt, same tool args, same model, same context… you’re mostly paying for variance. Sometimes variance saves you. Usually it doesn’t.
These are the transforms that actually moved reliability for me:
1) Tighten the output contract (schema + stop conditions)
If the model keeps giving you “almost JSON”, don’t just retry. Add a stricter contract and a repair step.
A minimal pattern is: attempt → parse → if parse fails, retry with a “repair” instruction that includes the raw output.
function buildJsonRepairPrompt(raw) {
return [
{ role: 'system', content: 'Return ONLY valid JSON. No prose.' },
{ role: 'user', content: `Fix this into valid JSON:\n\n${raw}` },
];
}Tradeoff: repair prompts can hide upstream prompt issues. I still do it, but I log how often repair happened. If it’s frequent, I fix the original prompt.
2) Add missing context (but only the missing part)
A common agent failure is “it didn’t know X”, and the retry repeats the same context-free prompt. Instead, detect what’s missing and patch that.
I like a targeted self-check that outputs a small JSON blob: which fields are missing, which tool results are required, etc.
const selfCheck = [
{ role: 'system', content: 'Answer in JSON with keys: missing, nextTool.' },
{ role: 'user', content: 'Do you have enough info to complete the task?' },
];If missing includes “repo name” or “customer id”, you don’t retry the main generation. You go fetch it (or ask the user). That’s a “fail fast into clarification”, not a retry.
3) Decompose the task on retry
When the failure is reasoning-related (bad plan, too many constraints), decomposition helps. It costs more and adds latency. But it’s predictable.
I’ll often switch from “do everything” to “plan → execute → verify” only after the first attempt fails:
async function decompose(task) {
return [
{ role: 'system', content: 'Break into 3-5 executable steps.' },
{ role: 'user', content: task },
];
}Tradeoff: decomposition increases the number of calls, which increases surface area for 429s and tool failures. That’s where budgets matter.
4) Switch models (sparingly)
Switching models can fix some reasoning failures and some formatting failures. It also makes behavior less predictable across runs, which can be a problem if you’re debugging.
I only switch models for specific kinds:
- repeated schema violations
- repeated “refusal” patterns that are model-specific
- repeated tool-call argument mistakes
I don’t switch models for timeouts/429s. That’s an infra problem; switching models just moves load.
The practical rule: if the retry transform doesn’t change inputs, constraints, or approach, it’s usually wasted tokens.
How to make retries idempotent and safe
This is the production pain nobody mentions in prompt engineering threads: retries duplicate side effects.
If your agent can send emails, charge cards, create tickets, or write DB rows, then retries turn into distributed systems bugs. Assume partial success:
- tool succeeded, but the agent crashed before recording it
- tool timed out, but actually succeeded server-side
- agent got a 500, but the provider processed the request
You need idempotency keys per side-effecting action. Not per run. Per action.
Pattern 1: idempotency keys on tool calls
If your tool is an HTTP API you own, accept an idempotency key header. Store the response keyed by that value.
function actionKey({ runId, action, targetId }) {
return `${runId}:${action}:${targetId}`;
}
async function sendEmail(api, { runId, to, templateId }) {
const key = actionKey({ runId, action: 'sendEmail', targetId: to });
return api.post('/emails/send', {
to, templateId,
}, {
headers: { 'Idempotency-Key': key },
});
}This makes retries safe even when you don’t know if the previous attempt succeeded.
Pattern 2: “plan then commit”
For multi-step agents, I separate “decide what to do” from “do it”. The plan can be retried freely. The commit phase is where idempotency matters.
The simplest version is to have the model output a structured plan that includes stable action IDs, then execute those IDs exactly once.
const plan = {
actions: [
{ id: 'a1', type: 'create_ticket', payload: { subject: 'Refund' } },
{ id: 'a2', type: 'send_email', payload: { to: 'sam@acme.io' } }
]
};If you retry planning, you must either (a) reuse the original plan if it’s valid, or (b) generate a new plan but dedupe by action type + target + payload hash. Otherwise you’ll create different IDs and double-execute.
Pattern 3: outbox / transactional messaging
If your agent writes to a DB and triggers side effects, the outbox pattern is still the least bad option. You commit “intent” and “state” in one DB transaction, then a worker delivers side effects with idempotency keys.
Here’s a concrete schema I’ve used:
create table agent_outbox (
id bigserial primary key,
run_id text not null,
action_key text not null unique,
kind text not null,
payload jsonb not null,
status text not null default 'pending',
created_at timestamptz not null default now()
);And a real example row:
{
"run_id": "wm_run_01JMY2K9V7Q8W3K1D3A1QZ9H2N",
"action_key": "wm_run_01JMY2K9V7Q8W3K1D3A1QZ9H2N:sendEmail:sam@acme.io",
"kind": "sendEmail",
"payload": { "to": "sam@acme.io", "templateId": "refund-approved-v2" },
"status": "pending"
}Where to store dedupe state?
- DB if you need correctness and auditability.
- Redis if you only need short-lived dedupe and can tolerate rare duplicates.
- In-memory is basically useless once you have >1 process.
If you do nothing else from this post: treat agent retries like distributed systems retries. Assume partial success and build for it.
How to measure whether retries help or hide bugs
Retries can make your graphs look better while your product gets worse.
The trap: you measure “run success rate” and see it go up. But you don’t notice that:
- first-attempt quality dropped
- p95 latency doubled
- cost per successful run spiked
You need metrics at two levels: attempt and run.
What I log (minimum viable):
- run_id, attempt number
- failure kind (from taxonomy)
- retry transform applied (if any)
- per-attempt latency, tokens, cost
- final run outcome (success/failure)
- retries-per-success (how many attempts to get a success)
If you have SQL access to your telemetry, you can compute the big ones. Here’s a query shape that catches “retries hiding regressions”: compare first-attempt success vs eventual success.
select
date_trunc('day', started_at) as day,
count(*) as runs,
avg(case when first_attempt_outcome = 'Completed' then 1 else 0 end) as first_attempt_success_rate,
avg(case when final_outcome = 'Completed' then 1 else 0 end) as eventual_success_rate,
avg(total_cost_usd) as avg_cost_usd,
percentile_cont(0.95) within group (order by total_latency_ms) as p95_latency_ms
from agent_runs
where label = 'Support agent'
group by 1
order by 1 desc;If eventual success is flat but first-attempt success drops, you didn’t improve reliability. You just added retries.
Observability tools help correlate attempts. OpenTelemetry traces can do it. Langfuse/Helicone-style logging can do it. WarpMetrics can do it with explicit Run/Outcome/Act chains. But the tooling isn’t the important part. The classification + per-attempt/per-run metrics are.
One more thing: don’t lump all failures together. A spike in RATE_LIMIT failures means “buy capacity or slow down”. A spike in BAD_TOOL_OUTPUT means “fix schema/tooling”. If you don’t separate them, you’ll “fix” the wrong thing.
Wiring this up with WarpMetrics (practical, not fancy)
If you’re already using WarpMetrics, the clean way to make retries observable is: record an outcome for the failed attempt, record an act that describes the retry transform, then start a follow-up run linked to that act. That gives you a chain you can query later.
Here’s the smallest snippet that shows the pattern:
import { run, outcome, act } from '@warpmetrics/warp';
const r1 = run('Support agent', { attempt: 1 });
const oc1 = outcome(r1, 'Rate Limited', { code: 429, kind: 'RATE_LIMIT' });
const a1 = act(oc1, 'Retry', { strategy: 'backoff+jitter', delayMs: 1200 });
const r2 = run(a1, 'Support agent', { attempt: 2 });
// ... do the second attempt ...
outcome(r2, 'Completed');That’s it. You can now measure retries-per-success, see which failure kinds trigger retries, and check whether your “retry with intent” transforms actually improve outcomes instead of just spending more.
Start building with WarpMetrics
Track every LLM call, measure outcomes, and let your agents query their own performance data.