LLM Observability: What to Track and Why

After you ship, you’ll get the same three pages over and over: cost spikes, latency spikes, and “the agent is doing weird stuff.” This post is the set of metrics/events that let you answer those pages with evidence, not vibes. The trick is to start from the questions you must answer, then design a stable schema that makes those answers cheap to compute.

What questions should LLM monitoring answer?

If you don’t write these down, you’ll log everything (prompts, tool outputs, embeddings, traces) and still not be able to answer basic operational questions. For LLM observability and LLM monitoring, I’ve found 8 questions cover 90% of production pain.

1) “Why did spend jump?”

You need cost per call and cost per successful task. Cost per call tells you which model got expensive. Cost per successful run tells you when retries/repair loops are silently eating your budget.

Metric: cost_usd per call, aggregated by model, provider
Metric: cost_usd per run, and cost_usd / successful_run

2) “Where did latency regress?”

Average latency lies. Tail latency is what users feel, and multi-step agents amplify it.

Metric: latency_ms per call + p95/p99 by model/route
Metric: run duration (wall clock), plus phase durations (planning/retrieval/etc.)

3) “What’s failing, and is it retryable?”

You need to distinguish model errors (429/500), your own timeouts, tool failures, and validation failures. Otherwise you’ll “fix” the wrong thing.

Event: status=error with normalized error_type + provider error code if available
Metric: retry count, and success-after-retry rate

4) “Did the model output change (quality drift)?”

Token/latency changes are often the first signal of prompt/model drift. But drift only matters if outcomes drift.

Metric: output length distribution (completion tokens, chars)
Metric: outcome rate over time (success/failure/neutral), segmented by prompt version/model

5) “Which step in the agent is the money pit?”

Per-call dashboards break down when an agent makes 10–200 calls. You need aggregation by phase.

Event: hierarchical traces (run → group/phase → call)
Metric: cost/latency per phase label

6) “What are users actually experiencing?”

You need correlation IDs that bridge product events (user request) to agent runs and LLM calls.

Field: trace_id / request_id on everything
Field: user_id / org_id (or hashed equivalents), and route / feature

7) “Which prompts/tools correlate with failures?”

You don’t need to log every byte. You do need enough context to reproduce.

Field: prompt/template version, tool names invoked, and key config (temperature, max tokens)
Event: tool call records (name, arguments shape, result status)

8) “Are we paying for tokens we didn’t need?”

Caching and prompt bloat are the easiest wins, but provider-specific.

Metric: prompt tokens vs cached tokens (if available)
Metric: prompt size per route/prompt version

A lightweight way to force this discipline is to keep a “questions → signals” table in your repo. Example:

# observability/questions.yml
- question: "Why did spend jump?"
  signals:
    - call.cost_usd
    - call.model
    - run.cost_usd
    - run.outcome
- question: "Where did latency regress?"
  signals:
    - call.latency_ms_p95
    - run.duration_ms_p95
    - group.duration_ms_p95
- question: "Which step fails?"
  signals:
    - group.outcome
    - call.status
    - tool.status

The point isn’t the file format. The point is: if a signal doesn’t answer a question, it’s probably expensive noise.

What to track on every LLM call

Per-call telemetry is the unit of truth. If you don’t have consistent tokens/cost/latency/status, you can’t explain spend or failures, and you can’t aggregate correctly at the run level.

Here’s the baseline schema I aim for on every LLM API invocation, regardless of provider:

Identity: provider, model
Usage: prompt_tokens, completion_tokens, total_tokens (when available)
Money: estimated_cost_usd (or computed server-side)
Time: latency_ms (wall clock)
Reliability: status (success|error), error_message, error_type
Control flow: retry_count, idempotency_key (if you use one)
Tools: tools_provided[], tool_calls[] (name + arguments)
Correlation: trace_id, run_id (or equivalent), user_id/org_id
Debug context: prompt/template version, and redacted request/response snapshots

A concrete NDJSON record looks like this (real numbers, real fields):

{
  "ts": "2026-02-12T09:41:22.184Z",
  "trace_id": "req_7d1f2c9a",
  "provider": "openai",
  "model": "gpt-4o",
  "prompt_tokens": 1842,
  "completion_tokens": 312,
  "total_tokens": 2154,
  "estimated_cost_usd": 0.0286,
  "latency_ms": 1468,
  "status": "success",
  "retry_count": 0,
  "tools_provided": ["searchDocs", "getInvoice"],
  "tool_calls": [
    { "name": "searchDocs", "arguments_bytes": 438 }
  ],
  "prompt_version": "support-triage@2026-02-01"
}

Gotcha: streaming usage arrives late

With streaming, you often don’t know token usage until the stream completes. If you record the call at request start, you’ll either miss usage or have to patch it later.

What worked for me: record a “call started” timestamp locally, but emit the final call event only after the stream ends (or errors). Minimal example:

async function consumeStream(stream) {
  const chunks = [];
  for await (const part of stream) {
    const text = part.choices?.[0]?.delta?.content;
    if (text) chunks.push(text);
  }
  return chunks.join('');
}

The observability implication: don’t build dashboards that assume “call event exists instantly.” For streaming endpoints, your call record is naturally delayed.

Gotcha: cached tokens are provider-specific

Some providers report cached input tokens or cache read/write tokens. Treat these as optional fields, not required schema, or you’ll end up with half your calls failing validation.

In SQL, that means NULL-able columns and careful rollups. Example rollup that won’t explode when cached fields are missing:

SELECT
  model,
  COUNT(*) AS calls,
  SUM(prompt_tokens) AS prompt_tokens,
  SUM(COALESCE(cached_input_tokens, 0)) AS cached_input_tokens,
  SUM(estimated_cost_usd) AS cost_usd
FROM llm_calls
WHERE ts >= NOW() - INTERVAL '7 days'
GROUP BY model
ORDER BY cost_usd DESC;

Capturing request/response context (without leaking secrets)

You want request/response snapshots because debugging without them is miserable. But raw prompts often contain PII, API keys, or customer data. Two rules that saved me:

Store redacted content by default.
Store references to raw content behind stricter access controls (or don’t store it at all).

A simple redaction pass (not perfect, but better than nothing):

function redact(text) {
  return text
    .replace(/sk-[A-Za-z0-9]{20,}/g, 'sk-REDACTED')
    .replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, 'email-REDACTED');
}

Don’t pretend regex solves compliance. It doesn’t. But it catches the common foot-guns while you build real policy.

How to observe multi-step agents (runs and phases)

Single-call metrics work for “LLM as a function.” Agents aren’t that. Agents are workflows: plan, retrieve, call tools, revise, validate, retry. If you only track calls, your dashboards turn into a pile of unrelated rows.

You need three layers:

Run: one user task / one agent attempt (top-level unit)
Group/Phase: a labeled step inside the run (planning, retrieval, execution, validation)
Call: the individual LLM API invocation

This hierarchy answers the questions you actually get paged for:

“Where did time go?” → run timeline + phase durations + call latencies
“Where did cost go?” → cost by phase label, not just by model
“Which step fails?” → outcomes/errors attached at group level

A minimal data model (vendor-agnostic) looks like this:

-- runs: one row per user task attempt
-- groups: phases within a run (can be nested)
-- calls: LLM calls linked to either a run or a group

SELECT
  r.run_id,
  g.label AS phase,
  COUNT(c.call_id) AS calls,
  SUM(c.estimated_cost_usd) AS cost_usd,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY c.latency_ms) AS p95_call_latency_ms
FROM runs r
JOIN groups g ON g.run_id = r.run_id
JOIN calls c ON c.group_id = g.group_id
WHERE r.started_at >= NOW() - INTERVAL '24 hours'
GROUP BY r.run_id, g.label
ORDER BY cost_usd DESC;

Tracing vs “agent structure”

OpenTelemetry-style tracing is fine for stitching spans together, but vanilla tracing doesn’t give you semantics like “this was the validation phase” or “this run succeeded.” You can layer that on with span attributes, but you have to be disciplined or it devolves into inconsistent tags.

Tools like Langfuse/Helicone can capture traces/calls. Agent-first systems model runs/phases explicitly. I don’t care which you pick. I care that you can reliably answer:

cost per successful run
phase-level cost/latency
failure rate by phase label
retry loops and repair attempts

Where time is spent: wall clock vs sum of calls

Agents do parallelism (tool calls, concurrent retrieval) and idle time (waiting on rate limits). If you only sum call latencies, you’ll miss “dead time.”

So track both:

run.duration_ms (wall clock)
sum(call.latency_ms) (work time)
optionally: queue_wait_ms if you have internal queues

A quick sanity check query I use: runs where wall clock is much larger than sum of call latency. Those are usually rate limiting or tool slowness.

SELECT
  run_id,
  duration_ms,
  (SELECT SUM(latency_ms) FROM calls WHERE calls.run_id = runs.run_id) AS sum_call_latency_ms
FROM runs
WHERE started_at >= NOW() - INTERVAL '7 days'
  AND duration_ms > 2 * (SELECT COALESCE(SUM(latency_ms), 0) FROM calls WHERE calls.run_id = runs.run_id)
ORDER BY duration_ms DESC
LIMIT 50;

How to measure quality with outcomes (not vibes)

You can monitor cost and latency forever and still ship a system that doesn’t work. Quality needs a first-class signal, and in production that means outcomes: explicit records of whether the task succeeded, and why.

I like to standardize a small set of outcome names and keep details in metadata. Example taxonomy that stayed sane:

Completed
Failed
Rate Limited
Validation Error
Tool Error
Escalated
Below Threshold

If you let every engineer invent outcome names, you get 200 nearly-duplicates and no usable success rate.

Strategy 1: deterministic checks (best ROI)

If your agent outputs structured data, validate it. Deterministic checks are cheap, stable, and don’t drift.

import { z } from 'zod';

const Answer = z.object({
  status: z.enum(['ok', 'needs_human']),
  message: z.string().min(1),
  citations: z.array(z.string().url()).max(5)
});

export function validateAnswer(jsonText) {
  const parsed = JSON.parse(jsonText);
  return Answer.safeParse(parsed);
}

When this fails, record Validation Error with the zod issues. That’s actionable.

Strategy 2: heuristics (good enough, fast)

Heuristics catch obvious failures: empty answers, missing sections, forbidden content, or “agent ignored instructions.”

export function heuristicScore(text) {
  const tooShort = text.trim().length < 200;
  const hasSorry = /\bI (can't|cannot|won't)\b/i.test(text);
  const hasCodeFence = /```/.test(text);

  if (tooShort) return { ok: false, reason: 'too_short' };
  if (hasSorry) return { ok: false, reason: 'refusal_language' };
  if (!hasCodeFence) return { ok: false, reason: 'missing_code_fence' };

  return { ok: true };
}

Heuristics are blunt. That’s fine. They’re great for alerting and triage, not for nuanced evaluation.

Strategy 3: LLM-as-judge (use carefully)

Judges are useful for subjective quality (tone, completeness), but they drift, and they can be gamed. Two guardrails that helped:

Fix the judge prompt and version it (judge@2026-02-10)
Store the judge’s score and the rubric text you used

Example judge response schema:

{
  "verdict": "pass",
  "score": 8,
  "reasons": ["Answered the question", "Included concrete steps", "No policy violations"]
}

Pitfalls I’ve hit:

Judge model changes silently → score distribution shifts → false alarms.
Too many labels → nobody classifies them → success rate becomes meaningless.
“One score to rule them all” → hides why it failed. Always keep a short reason code.

Outcomes are also how you connect observability back to product metrics. “p95 latency went up” matters, but “cost per completed run doubled” is what gets attention.

Common observability traps and how to avoid them

Most LLM observability failures aren’t technical. They’re discipline failures.

Trap: logging prompts without correlation IDs

If you can’t go from a user complaint to a run/call in one hop, you’ll waste hours.

Guardrail: generate a trace_id at request entry and propagate it everywhere. If you use HTTP, put it in response headers too.

export function getTraceId(req) {
  return req.headers['x-request-id'] ?? `req_${crypto.randomUUID()}`;
}

Trap: sampling that hides rare failures

If you sample “normal traffic,” you’ll miss the weird edge cases that break agents: long contexts, tool timeouts, multilingual inputs.

Guardrail: sample by outcome and error, not uniformly. Keep 100% of failures, and a small % of successes.

Trap: dashboards without alert thresholds

A dashboard is not monitoring. If you don’t have thresholds, you’re just hoping someone looks.

Guardrail: alert on SLO-style signals:

success rate per run label
p95 run duration
p95 call latency by model
cost per successful run

Example of an alert query for “cost per successful run”:

SELECT
  label,
  SUM(cost_usd) / NULLIF(SUM(CASE WHEN outcome = 'Completed' THEN 1 ELSE 0 END), 0) AS cost_per_success
FROM run_daily_rollup
WHERE day = CURRENT_DATE
GROUP BY label
HAVING SUM(cost_usd) / NULLIF(SUM(CASE WHEN outcome = 'Completed' THEN 1 ELSE 0 END), 0) > 0.35;

Trap: ignoring tail latency

Agents amplify tail latency because they do multiple calls. A small p95 regression per call becomes a huge p95 regression per run.

Guardrail: track p95/p99 at the run level, not just call level.

Trap: not tracking retries and repair loops

Retries are where budgets go to die. If you don’t track retry count and “success after retry,” you’ll think the system is fine because it eventually succeeds.

Guardrail: record retry attempts as explicit events/links, and compute:

retries per run
cost per retry
success rate with/without retries

Trap: schema churn

If you change field names every week, you’ll never get stable trend lines.

Guardrail: keep a minimal stable schema for calls/runs/outcomes. Add fields carefully. Deprecate slowly.

Doing this with WarpMetrics (runs, groups, outcomes)

If you want the run/group/call/outcome structure without building it yourself, WarpMetrics maps cleanly onto the hierarchy above. The SDK wraps your LLM client, then you explicitly link calls into runs/groups and record outcomes.

Here’s the smallest useful pattern: one run, two phases, one outcome.

import OpenAI from 'openai';
import { warp, run, group, call, outcome, flush } from '@warpmetrics/warp';

const openai = warp(new OpenAI());

const r = run('Support agent', { traceId: 'req_7d1f2c9a', userId: 'u_1842' });

const planning = group(r, 'Planning');
call(planning, await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Plan the next action.' }] }));

const execution = group(r, 'Execution');
call(execution, await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: 'Write the customer reply.' }] }));

outcome(r, 'Completed', { channel: 'email' });
await flush();

That snippet is boring on purpose. Boring instrumentation is what stays running in prod. Once you have runs + phases + outcomes, you can answer the operational questions: where time/cost went, which phase failed, and what “success” looks like in numbers.