Building a Self-Improving AI Agent with Feedback Loops
You’ll be able to take an agent that “sometimes works” and turn it into one that gets measurably better over time: same task, repeated runs, explicit pass/fail gates, and retries that change one thing on purpose.
The trick is to stop calling random retries “learning”. Most agent loops are just a slot machine: run again, hope the model feels different. Real self-improvement is boring: metrics, controlled interventions, and logging enough evidence that you can explain why attempt 3 succeeded.
What makes an agent “self-improving” in practice?
A “self-improving AI agent” doesn’t need to train a model. In practice, you’re usually doing one (or more) of these:
- Updating prompts (system prompt, few-shots, rubrics)
- Updating state (memory, retrieved context, tool outputs)
- Updating policy/config (which model, temperature, decomposition strategy)
- Adding verification steps (validators, judges, diff checks)
That’s not gradient descent. It’s closer to automated debugging: run → observe → adjust → rerun.
The minimum architecture I’ve found that actually works:
- Repeatable task loop: a task you can run multiple times with the same input.
- Measurable success criteria: a gate that says pass/fail (or score).
- Controlled intervention: a specific change between attempts.
- Stop rule: max attempts, budget/time caps, or “good enough” threshold.
If any of those are missing, you get the usual failure modes:
- No measurable target → you can’t tell if it improved, only that it produced different text.
- No controlled changes → retries become a random walk (sometimes worse, sometimes better).
- No stop rule → infinite loops that burn money.
A simple way to encode this is to treat each attempt as a pure function of (input, policy) and make policy explicit. Don’t bury it across ten files.
const basePolicy = {
model: 'gpt-4o-mini',
temperature: 0.2,
rubricVersion: 'v3',
maxOutputTokens: 800,
};
function nextPolicy(policy, intervention) {
return { ...policy, ...intervention, revision: (policy.revision || 0) + 1 };
}The “learning” is just choosing intervention based on the last outcome. If you can’t name the intervention in one sentence (“added missing schema context”, “switched judge model”, “split into two subtasks”), you’re not doing controlled improvement.
One more opinion: don’t start with multi-agent orchestration. Start with one agent that can pass a deterministic check. Then add subjective quality gates. Most teams do the reverse and end up with a complex system that can’t explain failures.
How do you design an AI agent feedback loop?
The core AI agent feedback loop pattern I use looks like:
Run → Generate → Evaluate → Record outcome (+ evidence) → Act → Follow-up run
The part people skip is “record evidence”. If you only log a score, you can’t debug judge drift, rubric bugs, or prompt regressions. You want the inputs that produced the verdict: the candidate output, the rubric, and any extracted facts used in the decision.
Evaluation strategies (and when they don’t work)
1) Deterministic checks
Use these whenever you can. They’re cheap, stable, and hard to game.
Example: the agent must output JSON matching a schema and include required fields.
import Ajv from 'ajv';
const ajv = new Ajv({ allErrors: true });
const schema = {
type: 'object',
required: ['title', 'steps', 'riskLevel'],
properties: {
title: { type: 'string', minLength: 8 },
steps: { type: 'array', minItems: 3, items: { type: 'string' } },
riskLevel: { enum: ['low', 'medium', 'high'] },
},
};
const validate = ajv.compile(schema);
function evalDeterministic(json) {
const ok = validate(json);
return { ok, errors: ok ? [] : validate.errors.map(e => `${e.instancePath} ${e.message}`) };
}2) Heuristics
Useful when correctness isn’t binary but you can still check properties: length bounds, presence of citations, “must mention X”, no forbidden phrases.
Heuristics are fragile, but they’re predictable. I’ll take predictable over “LLM judge vibes” for anything high volume.
function evalHeuristic(text) {
const tooLong = text.length > 6000;
const hasChecklist = /- \[[ x]\]/.test(text);
const mentionsRollback = /rollback/i.test(text);
const ok = !tooLong && hasChecklist && mentionsRollback;
return { ok, signals: { tooLong, hasChecklist, mentionsRollback } };
}3) LLM-as-judge
Use this for subjective qualities: helpfulness, tone, completeness, reasoning quality. It’s also the easiest to screw up.
Common judge failures I’ve hit:
- Verbosity reward: judge equates “long” with “good”.
- Rubric leakage: the candidate sees the rubric and optimizes for it.
- Inconsistent scoring: same output gets different scores across runs.
- Overfitting: you optimize to the judge, not to user value.
You mitigate these by:
- Keeping the judge prompt short and strict.
- Forcing structured output (JSON) so you can parse it.
- Logging the judge prompt + candidate + score, so you can audit.
Here’s a judge prompt pattern that’s boring but stable: single score, short reasons, explicit “must cite evidence from the text”.
function judgePrompt({ task, candidate }) {
return [
{ role: 'system', content: 'You are a strict evaluator. Output JSON only.' },
{
role: 'user',
content:
`Task: ${task}
Candidate:
${candidate}
Return JSON:
{"score":1-10,"verdict":"pass|fail","reasons":[...],"evidence":[...]}`
},
];
}Then parse and treat it as a gate. If parsing fails, that’s a failure outcome too (don’t silently accept it).
function parseJudge(jsonText) {
const j = JSON.parse(jsonText);
const score = Number(j.score);
const verdict = j.verdict;
if (!Number.isFinite(score) || score < 1 || score > 10) throw new Error('bad score');
if (!['pass', 'fail'].includes(verdict)) throw new Error('bad verdict');
return { score, verdict, reasons: j.reasons || [], evidence: j.evidence || [] };
}The key takeaway: treat evaluation as a gate with evidence. The evidence is what makes the loop debuggable. Without it, you can’t tell whether your agent improved or your judge got easier.
What should the agent change between attempts?
Retries only help if you change something specific. Otherwise you’re just sampling.
Interventions that consistently move metrics for me:
- Refine the prompt (add constraints, add examples, clarify output format)
- Add missing context (RAG, tool results, user constraints)
- Decompose the task (plan → execute → verify)
- Switch model (cheap draft model → expensive fixer)
- Add a verification step (schema validation, link checker, unit test)
- Escalate (human review, or a different subsystem)
Tradeoffs are real:
- Prompt refinement is cheap but can regress other cases (prompt drift).
- Adding context increases tokens and latency; can also distract the model.
- Decomposition adds calls (cost/latency) but improves reliability.
- Switching models is often the fastest win, but it’s the easiest way to blow budget.
Don’t do “random walk” retries
If attempt 1 fails, and attempt 2 changes prompt + model + temperature + added context, you learned nothing. You can’t attribute the improvement.
Tag each attempt with what changed. Keep it one variable when possible.
const interventions = {
refinePrompt: { temperature: 0.1, rubricVersion: 'v4' },
switchModel: { model: 'gpt-4o' },
addVerifier: { enableJsonSchema: true },
};
function chooseIntervention({ failureMode }) {
if (failureMode === 'invalid_json') return { name: 'Add Verifier', patch: interventions.addVerifier };
if (failureMode === 'missing_steps') return { name: 'Refine Prompt', patch: interventions.refinePrompt };
return { name: 'Switch Model', patch: interventions.switchModel };
}Add a verification step before you spend on a better model
This surprised me: a lot of “quality” failures are actually formatting failures. Fixing those with a verifier is cheaper than jumping to a bigger model.
Pattern: generate → validate → if invalid, run a “repair” step that only fixes structure.
async function repairToJson(llm, text) {
const res = await llm.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0,
messages: [
{ role: 'system', content: 'Convert to valid JSON. Output JSON only.' },
{ role: 'user', content: text },
],
});
return res.choices[0].message.content;
}This isn’t glamorous, but it reduces failure rate without increasing the “main” model cost much.
The key takeaway: retries only help if you change something specific—and you can attribute the change to a measurable outcome delta.
How do you measure improvement without fooling yourself?
If you only measure “did the final attempt succeed”, you’ll convince yourself the agent is improving when it’s just retrying more.
I track improvement as a time series with these metrics:
- Success rate per run (not per attempt)
- Cost per successful run
- Latency per successful run
- Attempts per successful run (should go down)
- Failure-mode distribution (what kind of failures are happening)
Avoid survivorship bias
The common mistake: only logging the final successful attempt. That hides the cost of getting there.
You want to attribute all attempts to the run chain, then compute cost/latency across the chain. Even if you don’t have fancy tooling, you can do this with a simple table.
Here’s a minimal schema that’s enough to compute the metrics above:
create table agent_attempts (
chain_id text not null,
attempt int not null,
outcome text not null,
failure_mode text,
cost_usd numeric(10,4) not null,
latency_ms int not null,
intervention text,
created_at timestamptz not null default now(),
primary key (chain_id, attempt)
);And an example row that’s realistic (not “0.00” everywhere):
insert into agent_attempts
(chain_id, attempt, outcome, failure_mode, cost_usd, latency_ms, intervention)
values
('req_2026_02_12_8f3c', 2, 'success', null, 0.0437, 2810, 'Switch Model');Now you can compute cost per successful chain:
with chains as (
select chain_id,
bool_or(outcome = 'success') as succeeded,
sum(cost_usd) as total_cost,
sum(latency_ms) as total_latency,
max(attempt) as attempts
from agent_attempts
group by chain_id
)
select
count(*) filter (where succeeded) * 1.0 / count(*) as success_rate,
avg(total_cost) filter (where succeeded) as cost_per_success,
avg(total_latency) filter (where succeeded) as latency_per_success,
avg(attempts) filter (where succeeded) as attempts_per_success
from chains;Watch for judge overfitting
If you use an LLM judge, you can “improve” by learning the judge’s quirks. Symptoms:
- Judge score goes up, but user complaints don’t go down.
- Outputs become rubric-shaped and unnatural.
- Small prompt tweaks cause huge score shifts.
Mitigations:
- Keep a small set of golden tasks with human-labeled outcomes.
- Periodically re-run those tasks and compare deltas (regression testing).
- Rotate judges or use two judges and require agreement for promotion.
Prompt drift is real
If your agent updates prompts automatically, you need a way to roll back. Otherwise you’ll get a slow degradation: it improves one failure mode while breaking two others.
At minimum, version your prompts and record the version on each attempt. If you can’t answer “when did rubricVersion v4 ship and what did it break?”, you’re flying blind.
Observability tools help here (Langfuse, Helicone, etc.) because they centralize runs/calls/outcomes. The important part isn’t the dashboard. It’s that you can query history and compute the metrics above without stitching logs from five places.
Key takeaway: measure improvement per successful outcome (not per attempt), and track failure modes so you know what actually got better.
How do you close the loop safely in production?
Self-improvement loops are production footguns. The failure mode isn’t “it returns the wrong answer”. It’s “it burns $400 overnight retrying the same thing”.
Hard limits are not optional:
- Max attempts per chain (I default to 3)
- Budget caps per chain (and per day)
- Time caps (wall clock)
- Rollback for prompt/config updates
- Escalation path (human or a safe fallback)
Here’s a control loop skeleton that enforces the boring stuff. Notice it makes the stop rule explicit and returns a structured result even on failure.
async function runWithLimits({ runAttempt, evalAttempt, maxAttempts, maxCostUsd }) {
let totalCost = 0;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
const { output, costUsd } = await runAttempt({ attempt });
totalCost += costUsd;
const evaluation = await evalAttempt({ output });
if (evaluation.ok) return { ok: true, attempt, totalCost, output };
if (totalCost >= maxCostUsd) {
return { ok: false, attempt, totalCost, failureMode: 'budget_cap' };
}
}
return { ok: false, attempt: maxAttempts, totalCost, failureMode: 'max_attempts' };
}Privacy and retention
Logging prompts/responses is useful and also a liability.
Practical approach:
- Redact known secrets (API keys, tokens) before logging.
- Hash or omit user PII fields.
- Store only what you need for evaluation evidence (often the candidate output + judge verdict + minimal context).
- Set retention windows. Keep full payloads short, keep aggregates longer.
Rollout plan that doesn’t ruin your week
I’ve had “self-improving” agents get worse before they get better. That’s normal: you’re exploring changes.
Roll out in stages:
- Shadow mode: run the loop, record outcomes, but don’t affect users.
- Limited traffic: a small percentage of requests, with strict budgets.
- Promotion rules: only adopt a new prompt/config if it beats the baseline on golden tasks and doesn’t regress key metrics.
- Expand: increase traffic gradually.
Key takeaway: production feedback loops need hard limits (attempts, spend, time) and a rollback/escalation plan, or they’ll turn into expensive infinite loops.
Practical: wiring the loop with WarpMetrics
If you already have your own logging, keep it. The useful bit in WarpMetrics is that it models the loop explicitly as Run → Outcome → Act → follow-up Run, and it captures cost/latency/tokens for each LLM call without you writing custom middleware.
This is what it looks like when you tag a failed attempt, declare the intervention, and link the retry as a follow-up run:
import { warp, run, call, outcome, act, flush } from '@warpmetrics/warp';
import OpenAI from 'openai';
const openai = warp(new OpenAI());
const r1 = run('Support Draft', { attempt: 1 });
const res1 = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: 'Draft a reply...' }] });
call(r1, res1);
const oc1 = outcome(r1, 'Below Threshold', { score: 6, evidence: 'Missed refund policy' });
const a1 = act(oc1, 'Add Context', { change: 'Included refund policy excerpt' });
const r2 = run(a1, 'Support Draft', { attempt: 2 });
const res2 = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [{ role: 'user', content: 'Draft a reply with policy...' }] });
call(r2, res2);
outcome(r2, 'Completed', { score: 9 });
await flush();That’s the whole point: you can later query chains of attempts, see which interventions correlate with success, and compute cost/latency per successful outcome instead of pretending retries are free.
Start building with WarpMetrics
Track every LLM call, measure outcomes, and let your agents query their own performance data.