Agents that debug
themselves.
WarpMetrics exposes performance data back to agents via MCP tools. They query their own success rates, find failures, and self-correct.
Other tools show dashboards to humans.
Your agents can't read dashboards.
The debugging loop is manual: you read logs, find the failure, fix the prompt, redeploy. Your agents run blind between fixes.
WarpMetrics
- ✓17 MCP tools for agent self-querying
- ✓Runs with structured outcomes and success rates
- ✓Agents query → adjust → verify in a loop
- ✓Automatic tracking of streaming, tool calls, and errors
- ✓Async SDK — no proxy, no added latency
Other tools
Langfuse, Helicone, etc.
- ✗Human-only dashboards, no agent access
- ✗Flat traces without run-level outcomes
- ✗No programmatic query interface for agents
- ✗No MCP integration
- ✗No outcome classification or success rate tracking
How it works
Agents that see their own performance and self-correct
Your code review agent runs. Success rate drops from 94% to 67%. The agent calls get_outcome_stats and discovers failures correlate with long inputs. It switches to a model with larger context. Success rate recovers to 91%.
Instrument
Wrap your client with warp(). All LLM calls are captured.
Run
Each execution is a run with cost, latency, tokens, and outcomes.
Query
Agents call MCP tools to query success rates, find failures, and compare costs.
Improve
The agent adjusts prompts, swaps models, or changes logic — then verifies the improvement.
MCP integration
17 tools agents call to query their own data
Agents call these during execution to check success rates, find failures, compare costs, and decide what to change.
list_runs
Filter runs by label, date range, and outcome. Returns success rates, costs, and durations.
get_outcome_stats
Returns success/failure counts and rates per outcome name, with trend data over time.
get_call
Returns the full call record: model, provider, tokens, cost, latency, tool calls, and status.
get_run_timeline
Returns the execution sequence of a run: groups, calls, and outcomes in order.
get_timeseries
Returns metrics over time — success rates, costs, and latency bucketed by hour or day.
get_stats
Summary statistics across all runs: total cost, average latency, call counts, and success rates.
Data agents consume
What your agents see via MCP
Every LLM call, grouped into runs, with structured outcomes and full cost accounting. Your team sees the dashboard. Your agents query the same data programmatically.
Every agent execution is a run
→ list_runsRuns have labels, success rates, total cost, and duration. Filter by label to compare different agent tasks or versions.
Runs contain groups, calls, and outcomes
→ get_run_timelineGroups organize logical steps. Calls are individual LLM requests with cost, latency, and token counts. Outcomes classify the result.
Distribution
Outcomes classify run results
→ get_outcome_statsRecord outcomes as success, failure, or neutral. WarpMetrics computes success rates per label, per time period, and across your entire project.
Full detail on every LLM call
→ get_callProvider, model, cost, latency, prompt/completion tokens, tool calls, and streaming — captured automatically by the SDK wrapper.
Tool Calls
{
"query": "wireless headphones",
"filters": {
"price_max": 200,
"rating_min": 4.0
}
}Messages
Three SDK calls to start tracking
warp() wraps your client. run() starts a tracked execution. outcome() records the result.
import { warp, run, outcome } from '@warpmetrics/warp';
import OpenAI from 'openai';
const openai = warp(new OpenAI());
const r = run('Code review');
const res = await openai.responses.create({...});
outcome(r, 'Completed');Once tracked, this data is queryable by agents via MCP tools.
Start tracking in under a minute
npm install, wrap your client, deploy. Your agents can query their own performance data immediately.
Free tier includes 7 days of data retention. No credit card required.