Agents that debug
themselves.

WarpMetrics exposes performance data back to agents via MCP tools. They query their own success rates, find failures, and self-correct.

Other tools show dashboards to humans.
Your agents can't read dashboards.

The debugging loop is manual: you read logs, find the failure, fix the prompt, redeploy. Your agents run blind between fixes.

WarpMetrics

 

  • 17 MCP tools for agent self-querying
  • Runs with structured outcomes and success rates
  • Agents query → adjust → verify in a loop
  • Automatic tracking of streaming, tool calls, and errors
  • Async SDK — no proxy, no added latency

Other tools

Langfuse, Helicone, etc.

  • Human-only dashboards, no agent access
  • Flat traces without run-level outcomes
  • No programmatic query interface for agents
  • No MCP integration
  • No outcome classification or success rate tracking

How it works

Agents that see their own performance and self-correct

Your code review agent runs. Success rate drops from 94% to 67%. The agent calls get_outcome_stats and discovers failures correlate with long inputs. It switches to a model with larger context. Success rate recovers to 91%.

Setup
The loop
1
warp()

Instrument

Wrap your client with warp(). All LLM calls are captured.

2

Run

Each execution is a run with cost, latency, tokens, and outcomes.

3
runscostfails

Query

Agents call MCP tools to query success rates, find failures, and compare costs.

4

Improve

The agent adjusts prompts, swaps models, or changes logic — then verifies the improvement.

Each cycle feeds data back into the next run

MCP integration

17 tools agents call to query their own data

Agents call these during execution to check success rates, find failures, compare costs, and decide what to change.

list_runs

Filter runs by label, date range, and outcome. Returns success rates, costs, and durations.

label:reviewstatus:fail

get_outcome_stats

Returns success/failure counts and rates per outcome name, with trend data over time.

84%92%

get_call

Returns the full call record: model, provider, tokens, cost, latency, tool calls, and status.

model: gpt-4ocost: $0.03

get_run_timeline

Returns the execution sequence of a run: groups, calls, and outcomes in order.

Research
Write
Review

get_timeseries

Returns metrics over time — success rates, costs, and latency bucketed by hour or day.

get_stats

Summary statistics across all runs: total cost, average latency, call counts, and success rates.

2.1k
runs
$14
cost

Data agents consume

What your agents see via MCP

Every LLM call, grouped into runs, with structured outcomes and full cost accounting. Your team sees the dashboard. Your agents query the same data programmatically.

Product research
847 runs$0.12 avg3.2s
94%
success
Last run 2m ago
Code review
1,203 runs$0.08 avg2.1s
89%
success
Last run 5m ago
Content generation
634 runs$0.15 avg4.5s
97%
success
Last run 12m ago
Data extraction
2,156 runs$0.05 avg1.8s
82%
success
Last run 1m ago
Report generation
423 runs$0.22 avg6.7s
67%
success
Last run 8m ago
Customer support
3,421 runs$0.06 avg1.4s
91%
success
Last run 30s ago

Every agent execution is a run

→ list_runs

Runs have labels, success rates, total cost, and duration. Filter by label to compare different agent tasks or versions.

Runs contain groups, calls, and outcomes

→ get_run_timeline

Groups organize logical steps. Calls are individual LLM requests with cost, latency, and token counts. Outcomes classify the result.

Entity
Outcomes
Timeline
Generate product description
Research phase
research_complete
2.4s
gpt-4o
1.1s
gpt-4o
1.0s
Writing phase
draft_created
3.3s
gpt-4o
1.9s
gpt-4o-mini
1.1s
Review phase
approved
1.3s
gpt-4o-mini
1.1s
Success Rate
94%
Success
2002
Failure
132

Distribution

task_completed847
data_extracted634
validation_passed521
retry_needed156
invalid_input89
timeout43

Outcomes classify run results

→ get_outcome_stats

Record outcomes as success, failure, or neutral. WarpMetrics computes success rates per label, per time period, and across your entire project.

Full detail on every LLM call

→ get_call

Provider, model, cost, latency, prompt/completion tokens, tool calls, and streaming — captured automatically by the SDK wrapper.

gpt-4o
OpenAI
success
Duration
2.3s
Cost
$0.08
Tokens
3,247
In / Out
2.1k / 1.1k

Tool Calls

search_products
{
  "query": "wireless headphones",
  "filters": {
    "price_max": 200,
    "rating_min": 4.0
  }
}

Messages

user
Find me the best wireless headphones under $200 with good reviews
assistant
I found 3 excellent options for you. The Sony WH-1000XM4 is currently the top pick at $198...

Three SDK calls to start tracking

warp() wraps your client. run() starts a tracked execution. outcome() records the result.

import { warp, run, outcome } from '@warpmetrics/warp';
import OpenAI from 'openai';

const openai = warp(new OpenAI());

const r = run('Code review');
const res = await openai.responses.create({...});
outcome(r, 'Completed');

Once tracked, this data is queryable by agents via MCP tools.

Start tracking in under a minute

npm install, wrap your client, deploy. Your agents can query their own performance data immediately.

Free tier includes 7 days of data retention. No credit card required.