How it works

Five primitives that structure every agent execution

Every agent run flows through a hierarchy: Run → Group → Call → Outcome → Act. Agents query this structure via MCP to find failures and adjust.

The yellow path shows the self-improvement loop: fail → act → retry → pass

The data model

Five concepts that map every agent execution

Each primitive is recorded, stored, and queryable via both the dashboard and MCP tools.

Run

One complete agent execution. Label it, link it to a PR or ticket, record whether it worked.

const r = run('Code review');

Group

Organize steps within a run. Planning, execution, validation — groups nest infinitely.

const g = group(r, 'Planning');

Call

Every LLM API call, tracked automatically by the SDK wrapper. Model, tokens, cost, latency, tool calls, and streaming — all captured.

const res = await openai.responses.create({...});

Outcome

84%

Classify the result of any run, group, or call as success, failure, or neutral. Three strategies: unit test, heuristic, LLM-as-judge.

outcome(r, 'Passed');

Act

retry

When an outcome fails, an act records the corrective action — retry, change prompt, switch model — and links to a follow-up run.

const a = act(o, 'Retry'); run(a, 'Code review');

The self-improvement loop

The instrument → run → query → improve cycle

Each run is recorded with structured outcomes. Agents query their own history via MCP tools, identify failing steps, and adjust prompts or models before the next run.

Evaluation strategies

Three ways to evaluate

Not everything needs a test suite. Pick the strategy that fits — or combine them.

Deterministic

Unit tests, exact match. When tests pass, the outcome passes.

Heuristic

Programmatic checks. Length, format, keywords. No "right answer" needed.

00.75

LLM-as-Judge

Second LLM scores output. For subjective quality: tone, clarity, completeness.

agent output
verdict: clear9/10

MCP integration

Agents query their own run history via MCP

17 MCP tools — including list_runs, get_outcome_stats, get_run_timeline, and get_timeseries — let your agent see its own performance history and use it to improve.

  • Query success rates by outcome name or time range
  • Drill into run timelines to find which step failed
  • Inspect individual calls — inputs, outputs, cost, latency

The platform

Dashboard for your team, API for your agents

Agents query the API. Your team gets a dashboard. Both see the same data — cost, performance, outcomes, and trends.

Cost & token tracking

Spend per run, per model, per group. Prompt, completion, and cached tokens broken down on every call.

Flow visualization

Waterfall timeline of every execution step. Compare runs side-by-side to identify where execution paths diverged.

84%
passfail

Outcomes & errors

Classify outcomes as success, failure, or neutral. Errors captured automatically with full details.

The answer is 42.

Streaming support

OpenAI and Anthropic streaming handled transparently. Full content, tokens, and timing captured.

label:deploymodel:gpt-4ostatus:pass

Search & filter everything

Filter runs by label, calls by model, outcomes by classification. Sort by cost, latency, or time.

N
J
A

Teams & projects

Organize data across projects with separate API keys. Invite your team with role-based access.

7d
90d

Data retention

7 days on Free, 90 days on Pro, unlimited on Enterprise. Data remains queryable by agents and the dashboard for the full retention window.

Start tracking agent executions

Install the SDK, wrap your client, record outcomes. Agents can query their own data immediately.