How it works
Five primitives that structure every agent execution
Every agent run flows through a hierarchy: Run → Group → Call → Outcome → Act. Agents query this structure via MCP to find failures and adjust.
The data model
Five concepts that map every agent execution
Each primitive is recorded, stored, and queryable via both the dashboard and MCP tools.
Run
One complete agent execution. Label it, link it to a PR or ticket, record whether it worked.
const r = run('Code review');Group
Organize steps within a run. Planning, execution, validation — groups nest infinitely.
const g = group(r, 'Planning');Call
Every LLM API call, tracked automatically by the SDK wrapper. Model, tokens, cost, latency, tool calls, and streaming — all captured.
const res = await openai.responses.create({...});Outcome
Classify the result of any run, group, or call as success, failure, or neutral. Three strategies: unit test, heuristic, LLM-as-judge.
outcome(r, 'Passed');Act
When an outcome fails, an act records the corrective action — retry, change prompt, switch model — and links to a follow-up run.
const a = act(o, 'Retry'); run(a, 'Code review');The self-improvement loop
The instrument → run → query → improve cycle
Each run is recorded with structured outcomes. Agents query their own history via MCP tools, identify failing steps, and adjust prompts or models before the next run.
Evaluation strategies
Three ways to evaluate
Not everything needs a test suite. Pick the strategy that fits — or combine them.
Deterministic
Unit tests, exact match. When tests pass, the outcome passes.
Heuristic
Programmatic checks. Length, format, keywords. No "right answer" needed.
LLM-as-Judge
Second LLM scores output. For subjective quality: tone, clarity, completeness.
MCP integration
Agents query their own run history via MCP
17 MCP tools — including list_runs, get_outcome_stats, get_run_timeline, and get_timeseries — let your agent see its own performance history and use it to improve.
- ✓Query success rates by outcome name or time range
- ✓Drill into run timelines to find which step failed
- ✓Inspect individual calls — inputs, outputs, cost, latency
The platform
Dashboard for your team, API for your agents
Agents query the API. Your team gets a dashboard. Both see the same data — cost, performance, outcomes, and trends.
Cost & token tracking
Spend per run, per model, per group. Prompt, completion, and cached tokens broken down on every call.
Flow visualization
Waterfall timeline of every execution step. Compare runs side-by-side to identify where execution paths diverged.
Outcomes & errors
Classify outcomes as success, failure, or neutral. Errors captured automatically with full details.
Streaming support
OpenAI and Anthropic streaming handled transparently. Full content, tokens, and timing captured.
Search & filter everything
Filter runs by label, calls by model, outcomes by classification. Sort by cost, latency, or time.
Teams & projects
Organize data across projects with separate API keys. Invite your team with role-based access.
Data retention
7 days on Free, 90 days on Pro, unlimited on Enterprise. Data remains queryable by agents and the dashboard for the full retention window.
Start tracking agent executions
Install the SDK, wrap your client, record outcomes. Agents can query their own data immediately.