LLM Observability: What to Track and Why
After you ship, you’ll get the same three pages over and over: cost spikes, latency spikes, and “the agent is doing weird stuff.” This post is the set of metrics/events that let you answer those pages with evidence, not vibes. The trick is to start from the questions you must answer, then design a stable schema that makes those answers cheap to compute.
What questions should LLM monitoring answer?
If you don’t write these down, you’ll log everything (prompts, tool outputs, embeddings, traces) and still not be able to answer basic operational questions. For LLM observability and LLM monitoring, I’ve found 8 questions cover 90% of production pain.
1) “Why did spend jump?”
You need cost per call and cost per successful task. Cost per call tells you which model got expensive. Cost per successful run tells you when retries/repair loops are silently eating your budget.
- Metric:
cost_usdper call, aggregated bymodel,provider - Metric:
cost_usdper run, andcost_usd / successful_run
2) “Where did latency regress?”
Average latency lies. Tail latency is what users feel, and multi-step agents amplify it.
- Metric:
latency_msper call +p95/p99by model/route - Metric: run duration (wall clock), plus phase durations (planning/retrieval/etc.)
3) “What’s failing, and is it retryable?”
You need to distinguish model errors (429/500), your own timeouts, tool failures, and validation failures. Otherwise you’ll “fix” the wrong thing.
- Event:
status=errorwith normalizederror_type+ provider error code if available - Metric: retry count, and success-after-retry rate
4) “Did the model output change (quality drift)?”
Token/latency changes are often the first signal of prompt/model drift. But drift only matters if outcomes drift.
- Metric: output length distribution (completion tokens, chars)
- Metric: outcome rate over time (success/failure/neutral), segmented by prompt version/model
5) “Which step in the agent is the money pit?”
Per-call dashboards break down when an agent makes 10–200 calls. You need aggregation by phase.
- Event: hierarchical traces (run → group/phase → call)
- Metric: cost/latency per phase label
6) “What are users actually experiencing?”
You need correlation IDs that bridge product events (user request) to agent runs and LLM calls.
- Field:
trace_id/request_idon everything - Field:
user_id/org_id(or hashed equivalents), androute/feature
7) “Which prompts/tools correlate with failures?”
You don’t need to log every byte. You do need enough context to reproduce.
- Field: prompt/template version, tool names invoked, and key config (temperature, max tokens)
- Event: tool call records (name, arguments shape, result status)
8) “Are we paying for tokens we didn’t need?”
Caching and prompt bloat are the easiest wins, but provider-specific.
- Metric: prompt tokens vs cached tokens (if available)
- Metric: prompt size per route/prompt version
A lightweight way to force this discipline is to keep a “questions → signals” table in your repo. Example:
# observability/questions.yml
- question: "Why did spend jump?"
signals:
- call.cost_usd
- call.model
- run.cost_usd
- run.outcome
- question: "Where did latency regress?"
signals:
- call.latency_ms_p95
- run.duration_ms_p95
- group.duration_ms_p95
- question: "Which step fails?"
signals:
- group.outcome
- call.status
- tool.statusThe point isn’t the file format. The point is: if a signal doesn’t answer a question, it’s probably expensive noise.
What to track on every LLM call
Per-call telemetry is the unit of truth. If you don’t have consistent tokens/cost/latency/status, you can’t explain spend or failures, and you can’t aggregate correctly at the run level.
Here’s the baseline schema I aim for on every LLM API invocation, regardless of provider:
- Identity:
provider,model - Usage:
prompt_tokens,completion_tokens,total_tokens(when available) - Money:
estimated_cost_usd(or computed server-side) - Time:
latency_ms(wall clock) - Reliability:
status(success|error),error_message,error_type - Control flow:
retry_count,idempotency_key(if you use one) - Tools:
tools_provided[],tool_calls[](name + arguments) - Correlation:
trace_id,run_id(or equivalent),user_id/org_id - Debug context: prompt/template version, and redacted request/response snapshots
A concrete NDJSON record looks like this (real numbers, real fields):
{
"ts": "2026-02-12T09:41:22.184Z",
"trace_id": "req_7d1f2c9a",
"provider": "openai",
"model": "gpt-4o",
"prompt_tokens": 1842,
"completion_tokens": 312,
"total_tokens": 2154,
"estimated_cost_usd": 0.0286,
"latency_ms": 1468,
"status": "success",
"retry_count": 0,
"tools_provided": ["searchDocs", "getInvoice"],
"tool_calls": [
{ "name": "searchDocs", "arguments_bytes": 438 }
],
"prompt_version": "support-triage@2026-02-01"
}Gotcha: streaming usage arrives late
With streaming, you often don’t know token usage until the stream completes. If you record the call at request start, you’ll either miss usage or have to patch it later.
What worked for me: record a “call started” timestamp locally, but emit the final call event only after the stream ends (or errors). Minimal example:
async function consumeStream(stream) {
const chunks = [];
for await (const part of stream) {
const text = part.choices?.[0]?.delta?.content;
if (text) chunks.push(text);
}
return chunks.join('');
}The observability implication: don’t build dashboards that assume “call event exists instantly.” For streaming endpoints, your call record is naturally delayed.
Gotcha: cached tokens are provider-specific
Some providers report cached input tokens or cache read/write tokens. Treat these as optional fields, not required schema, or you’ll end up with half your calls failing validation.
In SQL, that means NULL-able columns and careful rollups. Example rollup that won’t explode when cached fields are missing:
SELECT
model,
COUNT(*) AS calls,
SUM(prompt_tokens) AS prompt_tokens,
SUM(COALESCE(cached_input_tokens, 0)) AS cached_input_tokens,
SUM(estimated_cost_usd) AS cost_usd
FROM llm_calls
WHERE ts >= NOW() - INTERVAL '7 days'
GROUP BY model
ORDER BY cost_usd DESC;Capturing request/response context (without leaking secrets)
You want request/response snapshots because debugging without them is miserable. But raw prompts often contain PII, API keys, or customer data. Two rules that saved me:
- Store redacted content by default.
- Store references to raw content behind stricter access controls (or don’t store it at all).
A simple redaction pass (not perfect, but better than nothing):
function redact(text) {
return text
.replace(/sk-[A-Za-z0-9]{20,}/g, 'sk-REDACTED')
.replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, 'email-REDACTED');
}Don’t pretend regex solves compliance. It doesn’t. But it catches the common foot-guns while you build real policy.
How to observe multi-step agents (runs and phases)
Single-call metrics work for “LLM as a function.” Agents aren’t that. Agents are workflows: plan, retrieve, call tools, revise, validate, retry. If you only track calls, your dashboards turn into a pile of unrelated rows.
You need three layers:
- Run: one user task / one agent attempt (top-level unit)
- Group/Phase: a labeled step inside the run (planning, retrieval, execution, validation)
- Call: the individual LLM API invocation
This hierarchy answers the questions you actually get paged for:
- “Where did time go?” → run timeline + phase durations + call latencies
- “Where did cost go?” → cost by phase label, not just by model
- “Which step fails?” → outcomes/errors attached at group level
A minimal data model (vendor-agnostic) looks like this:
-- runs: one row per user task attempt
-- groups: phases within a run (can be nested)
-- calls: LLM calls linked to either a run or a group
SELECT
r.run_id,
g.label AS phase,
COUNT(c.call_id) AS calls,
SUM(c.estimated_cost_usd) AS cost_usd,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY c.latency_ms) AS p95_call_latency_ms
FROM runs r
JOIN groups g ON g.run_id = r.run_id
JOIN calls c ON c.group_id = g.group_id
WHERE r.started_at >= NOW() - INTERVAL '24 hours'
GROUP BY r.run_id, g.label
ORDER BY cost_usd DESC;Tracing vs “agent structure”
OpenTelemetry-style tracing is fine for stitching spans together, but vanilla tracing doesn’t give you semantics like “this was the validation phase” or “this run succeeded.” You can layer that on with span attributes, but you have to be disciplined or it devolves into inconsistent tags.
Tools like Langfuse/Helicone can capture traces/calls. Agent-first systems model runs/phases explicitly. I don’t care which you pick. I care that you can reliably answer:
- cost per successful run
- phase-level cost/latency
- failure rate by phase label
- retry loops and repair attempts
Where time is spent: wall clock vs sum of calls
Agents do parallelism (tool calls, concurrent retrieval) and idle time (waiting on rate limits). If you only sum call latencies, you’ll miss “dead time.”
So track both:
run.duration_ms(wall clock)sum(call.latency_ms)(work time)- optionally:
queue_wait_msif you have internal queues
A quick sanity check query I use: runs where wall clock is much larger than sum of call latency. Those are usually rate limiting or tool slowness.
SELECT
run_id,
duration_ms,
(SELECT SUM(latency_ms) FROM calls WHERE calls.run_id = runs.run_id) AS sum_call_latency_ms
FROM runs
WHERE started_at >= NOW() - INTERVAL '7 days'
AND duration_ms > 2 * (SELECT COALESCE(SUM(latency_ms), 0) FROM calls WHERE calls.run_id = runs.run_id)
ORDER BY duration_ms DESC
LIMIT 50;How to measure quality with outcomes (not vibes)
You can monitor cost and latency forever and still ship a system that doesn’t work. Quality needs a first-class signal, and in production that means outcomes: explicit records of whether the task succeeded, and why.
I like to standardize a small set of outcome names and keep details in metadata. Example taxonomy that stayed sane:
CompletedFailedRate LimitedValidation ErrorTool ErrorEscalatedBelow Threshold
If you let every engineer invent outcome names, you get 200 nearly-duplicates and no usable success rate.
Strategy 1: deterministic checks (best ROI)
If your agent outputs structured data, validate it. Deterministic checks are cheap, stable, and don’t drift.
import { z } from 'zod';
const Answer = z.object({
status: z.enum(['ok', 'needs_human']),
message: z.string().min(1),
citations: z.array(z.string().url()).max(5)
});
export function validateAnswer(jsonText) {
const parsed = JSON.parse(jsonText);
return Answer.safeParse(parsed);
}When this fails, record Validation Error with the zod issues. That’s actionable.
Strategy 2: heuristics (good enough, fast)
Heuristics catch obvious failures: empty answers, missing sections, forbidden content, or “agent ignored instructions.”
export function heuristicScore(text) {
const tooShort = text.trim().length < 200;
const hasSorry = /\bI (can't|cannot|won't)\b/i.test(text);
const hasCodeFence = /```/.test(text);
if (tooShort) return { ok: false, reason: 'too_short' };
if (hasSorry) return { ok: false, reason: 'refusal_language' };
if (!hasCodeFence) return { ok: false, reason: 'missing_code_fence' };
return { ok: true };
}Heuristics are blunt. That’s fine. They’re great for alerting and triage, not for nuanced evaluation.
Strategy 3: LLM-as-judge (use carefully)
Judges are useful for subjective quality (tone, completeness), but they drift, and they can be gamed. Two guardrails that helped:
- Fix the judge prompt and version it (
judge@2026-02-10) - Store the judge’s score and the rubric text you used
Example judge response schema:
{
"verdict": "pass",
"score": 8,
"reasons": ["Answered the question", "Included concrete steps", "No policy violations"]
}Pitfalls I’ve hit:
- Judge model changes silently → score distribution shifts → false alarms.
- Too many labels → nobody classifies them → success rate becomes meaningless.
- “One score to rule them all” → hides why it failed. Always keep a short reason code.
Outcomes are also how you connect observability back to product metrics. “p95 latency went up” matters, but “cost per completed run doubled” is what gets attention.
Common observability traps and how to avoid them
Most LLM observability failures aren’t technical. They’re discipline failures.
Trap: logging prompts without correlation IDs
If you can’t go from a user complaint to a run/call in one hop, you’ll waste hours.
Guardrail: generate a trace_id at request entry and propagate it everywhere. If you use HTTP, put it in response headers too.
export function getTraceId(req) {
return req.headers['x-request-id'] ?? `req_${crypto.randomUUID()}`;
}Trap: sampling that hides rare failures
If you sample “normal traffic,” you’ll miss the weird edge cases that break agents: long contexts, tool timeouts, multilingual inputs.
Guardrail: sample by outcome and error, not uniformly. Keep 100% of failures, and a small % of successes.
Trap: dashboards without alert thresholds
A dashboard is not monitoring. If you don’t have thresholds, you’re just hoping someone looks.
Guardrail: alert on SLO-style signals:
- success rate per run label
- p95 run duration
- p95 call latency by model
- cost per successful run
Example of an alert query for “cost per successful run”:
SELECT
label,
SUM(cost_usd) / NULLIF(SUM(CASE WHEN outcome = 'Completed' THEN 1 ELSE 0 END), 0) AS cost_per_success
FROM run_daily_rollup
WHERE day = CURRENT_DATE
GROUP BY label
HAVING SUM(cost_usd) / NULLIF(SUM(CASE WHEN outcome = 'Completed' THEN 1 ELSE 0 END), 0) > 0.35;Trap: ignoring tail latency
Agents amplify tail latency because they do multiple calls. A small p95 regression per call becomes a huge p95 regression per run.
Guardrail: track p95/p99 at the run level, not just call level.
Trap: not tracking retries and repair loops
Retries are where budgets go to die. If you don’t track retry count and “success after retry,” you’ll think the system is fine because it eventually succeeds.
Guardrail: record retry attempts as explicit events/links, and compute:
- retries per run
- cost per retry
- success rate with/without retries
Trap: schema churn
If you change field names every week, you’ll never get stable trend lines.
Guardrail: keep a minimal stable schema for calls/runs/outcomes. Add fields carefully. Deprecate slowly.
Doing this with WarpMetrics (runs, groups, outcomes)
If you want the run/group/call/outcome structure without building it yourself, WarpMetrics maps cleanly onto the hierarchy above. The SDK wraps your LLM client, then you explicitly link calls into runs/groups and record outcomes.
Here’s the smallest useful pattern: one run, two phases, one outcome.
import OpenAI from 'openai';
import { warp, run, group, call, outcome, flush } from '@warpmetrics/warp';
const openai = warp(new OpenAI());
const r = run('Support agent', { traceId: 'req_7d1f2c9a', userId: 'u_1842' });
const planning = group(r, 'Planning');
call(planning, await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Plan the next action.' }] }));
const execution = group(r, 'Execution');
call(execution, await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Write the customer reply.' }] }));
outcome(r, 'Completed', { channel: 'email' });
await flush();That snippet is boring on purpose. Boring instrumentation is what stays running in prod. Once you have runs + phases + outcomes, you can answer the operational questions: where time/cost went, which phase failed, and what “success” looks like in numbers.
Start building with WarpMetrics
Track every LLM call, measure outcomes, and let your agents query their own performance data.