agent session

Agents that debug
themselves.

WarpMetrics exposes performance data back to agents via MCP tools. They query their own success rates, find failures, and self-correct.

Start Building

Other tools show dashboards to humans.
Your agents can't read dashboards.

The debugging loop is manual: you read logs, find the failure, fix the prompt, redeploy. Your agents run blind between fixes.

WarpMetrics

✓17 MCP tools for agent self-querying
✓Runs with structured outcomes and success rates
✓Agents query → adjust → verify in a loop
✓Automatic tracking of streaming, tool calls, and errors
✓Async SDK — no proxy, no added latency

Other tools

Langfuse, Helicone, etc.

✗Human-only dashboards, no agent access
✗Flat traces without run-level outcomes
✗No programmatic query interface for agents
✗No MCP integration
✗No outcome classification or success rate tracking

How it works

Agents that see their own performance and self-correct

Your code review agent runs. Success rate drops from 94% to 67%. The agent calls get_outcome_stats and discovers failures correlate with long inputs. It switches to a model with larger context. Success rate recovers to 91%.

Setup

The loop

warp()

Instrument

Wrap your client with warp(). All LLM calls are captured.

Run

Each execution is a run with cost, latency, tokens, and outcomes.

runscostfails

Query

Agents call MCP tools to query success rates, find failures, and compare costs.

Improve

The agent adjusts prompts, swaps models, or changes logic — then verifies the improvement.

Each cycle feeds data back into the next run

MCP integration

17 tools agents call to query their own data

Agents call these during execution to check success rates, find failures, compare costs, and decide what to change.

list_runs

Filter runs by label, date range, and outcome. Returns success rates, costs, and durations.

label:reviewstatus:fail

get_outcome_stats

Returns success/failure counts and rates per outcome name, with trend data over time.

84%92%

get_call

Returns the full call record: model, provider, tokens, cost, latency, tool calls, and status.

model: gpt-4ocost: $0.03

get_run_timeline

Returns the execution sequence of a run: groups, calls, and outcomes in order.

Research

Write

Review

get_timeseries

Returns metrics over time — success rates, costs, and latency bucketed by hour or day.

get_stats

Summary statistics across all runs: total cost, average latency, call counts, and success rates.

2.1k

runs

$14

cost

See all 17 MCP tools

Data agents consume

What your agents see via MCP

Every LLM call, grouped into runs, with structured outcomes and full cost accounting. Your team sees the dashboard. Your agents query the same data programmatically.

Product research

847 runs$0.12 avg3.2s

94%

success

Last run 2m ago

Code review

1,203 runs$0.08 avg2.1s

89%

success

Last run 5m ago

Content generation

634 runs$0.15 avg4.5s

97%

success

Last run 12m ago

Data extraction

2,156 runs$0.05 avg1.8s

82%

success

Last run 1m ago

Report generation

423 runs$0.22 avg6.7s

67%

success

Last run 8m ago

Customer support

3,421 runs$0.06 avg1.4s

91%

success

Last run 30s ago

Every agent execution is a run

→ list_runs

Runs have labels, success rates, total cost, and duration. Filter by label to compare different agent tasks or versions.

Runs contain groups, calls, and outcomes

→ get_run_timeline

Groups organize logical steps. Calls are individual LLM requests with cost, latency, and token counts. Outcomes classify the result.

Entity

Outcomes

Timeline

Generate product description

Research phase

research_complete

2.4s

gpt-4o

1.1s

gpt-4o

1.0s

Writing phase

draft_created

3.3s

gpt-4o

1.9s

gpt-4o-mini

1.1s

Review phase

approved

1.3s

gpt-4o-mini

1.1s

Success Rate

94%

Success

2002

Failure

132

Distribution

task_completed847

data_extracted634

validation_passed521

retry_needed156

invalid_input89

timeout43

Outcomes classify run results

→ get_outcome_stats

Record outcomes as success, failure, or neutral. WarpMetrics computes success rates per label, per time period, and across your entire project.

Full detail on every LLM call

→ get_call

Provider, model, cost, latency, prompt/completion tokens, tool calls, and streaming — captured automatically by the SDK wrapper.

gpt-4o

OpenAI

success

Duration

2.3s

Cost

$0.08

Tokens

3,247

In / Out

2.1k / 1.1k

Tool Calls

search_products

{
  "query": "wireless headphones",
  "filters": {
    "price_max": 200,
    "rating_min": 4.0
  }
}

Messages

user

Find me the best wireless headphones under $200 with good reviews

assistant

I found 3 excellent options for you. The Sony WH-1000XM4 is currently the top pick at $198...

Three SDK calls to start tracking

warp() wraps your client. run() starts a tracked execution. outcome() records the result.

import { warp, run, outcome } from '@warpmetrics/warp';
import OpenAI from 'openai';

const openai = warp(new OpenAI());

const r = run('Code review');
const res = await openai.responses.create({...});
outcome(r, 'Completed');

Once tracked, this data is queryable by agents via MCP tools.

Start Building See How It Works

Start tracking in under a minute

npm install, wrap your client, deploy. Your agents can query their own performance data immediately.

Get Started Free Star on GitHub

Free tier includes 7 days of data retention. No credit card required.

Agents that debugthemselves.

Other tools show dashboards to humans.Your agents can't read dashboards.

WarpMetrics

Other tools

Agents that see their own performance and self-correct

Instrument

Run

Query

Improve

17 tools agents call to query their own data

list_runs

get_outcome_stats

get_call

get_run_timeline

get_timeseries

get_stats

What your agents see via MCP

Every agent execution is a run

Runs contain groups, calls, and outcomes

Distribution

Outcomes classify run results

Full detail on every LLM call

Tool Calls

Messages

Three SDK calls to start tracking

Start tracking in under a minute

Agents that debug
themselves.

Other tools show dashboards to humans.
Your agents can't read dashboards.