Learn

How to Track LLM API Costs Across Providers

·cosmo
track llm costsllm api costs

You want to answer questions like: “Why did spend spike yesterday?”, “Which feature is burning tokens?”, and “Are retries doubling our bill?” across OpenAI, Anthropic, and whatever you add next quarter. You can’t do that from provider invoices. You need per-call logs with stable attribution, a normalized usage schema, and cost computation you can recompute when pricing changes.

I’ve built this a few times now. The hard part isn’t writing a logger. It’s getting the data model right so you’re not stuck later with “total_tokens: 1234” and no clue what happened.

What data you must log for cost attribution

Minimum viable cost attribution is one row per LLM call with:

  • Who/what to blame: userId, orgId, feature, and a stable runId (or request/job ID).
  • What you called: provider, model, and (if you have it) endpoint/operation (responses.create, messages.create, etc).
  • What it cost in usage terms: input/output tokens, plus cache/read/write where available.
  • What happened: status (success/error), errorCode/errorType (normalized), and latencyMs.
  • When: timestamp (and ideally region).
  • Raw payloads: the provider response (and sometimes the request) so you can remap later.

The common mistake: logging only totals, or only prompt/completion tokens. That’s fine until you need to compare providers (different semantics) or explain a spike (retries, cache misses, streaming loops).

Here’s the shape I aim for in my own logs (NDJSON, DB row, whatever). Keep it boring and explicit:

{
  "ts": "2026-02-12T18:42:11.392Z",
  "runId": "run_01JMC6Q2V1K1Q8H3G4Y7G2W6QK",
  "requestId": "req_9f3b2c0a",
  "userId": "usr_123",
  "orgId": "org_acme",
  "feature": "support.reply",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "status": "success",
  "latencyMs": 842,
  "usage": {
    "inputTokens": 812,
    "outputTokens": 146,
    "cachedInputTokens": 640,
    "cacheReadTokens": 0,
    "cacheWriteTokens": 0
  },
  "rawResponse": { "id": "chatcmpl-abc...", "usage": { "prompt_tokens": 812, "completion_tokens": 146, "total_tokens": 958 } }
}

A few opinions from getting burned:

  • Always log a stable correlation ID that survives retries. If every retry gets a new request ID, you’ll never notice a retry storm until the bill arrives.
  • Log latency even for errors. Rate limits and timeouts are “cost events” because they trigger retries.
  • Don’t log full prompts by default unless you have a real need and a privacy story. For cost attribution, you usually only need token counts + metadata. If you do log prompts, store them separately with retention controls.

If you’re doing agentic workflows, “requestId” isn’t enough. You want a hierarchy (run → phase → call). More on that later.

How to normalize usage across providers

Providers disagree on naming and sometimes on meaning. If you want to track LLM costs across providers, normalize into one internal schema and keep the raw response next to it.

My internal schema is basically:

  • inputTokens
  • outputTokens
  • cachedInputTokens (OpenAI-style)
  • cacheReadTokens / cacheWriteTokens (Anthropic-style)
  • totalTokens (derived)

Make it explicit that some fields are optional. Don’t pretend they exist when they don’t.

export function normalizeUsage(u) {
  const usage = {
    inputTokens: u?.inputTokens ?? 0,
    outputTokens: u?.outputTokens ?? 0,
    cachedInputTokens: u?.cachedInputTokens ?? 0,
    cacheReadTokens: u?.cacheReadTokens ?? 0,
    cacheWriteTokens: u?.cacheWriteTokens ?? 0,
  };

  return {
    ...usage,
    totalTokens:
      usage.inputTokens +
      usage.outputTokens +
      usage.cacheReadTokens +
      usage.cacheWriteTokens,
  };
}

Then write small mappers per provider/endpoint. The trick is: map what you know, don’t guess.

OpenAI responses often look like “prompt_tokens / completion_tokens”, and cached tokens may show up depending on API/feature. Map conservatively:

export function mapOpenAIUsage(raw) {
  const u = raw?.usage;
  return normalizeUsage({
    inputTokens: u?.prompt_tokens ?? 0,
    outputTokens: u?.completion_tokens ?? 0,
    cachedInputTokens: u?.prompt_tokens_details?.cached_tokens ?? 0,
  });
}

Anthropic’s Messages API uses different fields and has cache read/write semantics. If you don’t use prompt caching, those are zero. If you do, they matter a lot:

export function mapAnthropicUsage(raw) {
  const u = raw?.usage;
  return normalizeUsage({
    inputTokens: u?.input_tokens ?? 0,
    outputTokens: u?.output_tokens ?? 0,
    cacheReadTokens: u?.cache_read_input_tokens ?? 0,
    cacheWriteTokens: u?.cache_creation_input_tokens ?? 0,
  });
}

Annoying differences you’ll hit:

  • Streaming usage arrives late. Some SDKs only give you usage after the stream completes. If your logger writes the row at “first token”, you’ll store zeros. Fix: write a “call started” event and a “call finished” event, or buffer until the end for streams.
  • Missing fields. Some responses don’t include usage if the call errors early. Store what you have (status/latency/error), and leave usage null/zero.
  • Cache semantics aren’t portable. OpenAI’s “cached input tokens” and Anthropic’s “cache read/write” aren’t the same thing. Normalize them into separate fields and don’t combine them unless you’re very sure.

One more thing: store the raw provider payload. Pricing changes. Models get renamed. You will need to reprocess.

I usually store rawResponse (and sometimes a redacted rawRequest) as JSONB in Postgres. Example table:

create table llm_calls (
  id bigserial primary key,
  ts timestamptz not null,
  run_id text not null,
  feature text not null,
  provider text not null,
  model text not null,
  status text not null,
  latency_ms int not null,
  input_tokens int not null,
  output_tokens int not null,
  cached_input_tokens int not null,
  cache_read_tokens int not null,
  cache_write_tokens int not null,
  raw_response jsonb not null
);

create index on llm_calls (ts);
create index on llm_calls (run_id);
create index on llm_calls (feature, ts);

Yes, JSONB costs storage. It’s still cheaper than being unable to explain spend.

How to compute LLM API costs correctly

Cost computation is a pure function:

cost = f(provider, model, timestamp, usage breakdown, pricing table version)

The “timestamp” part matters because pricing changes. Also, providers rename models and add aliases. If you don’t version pricing, you’ll end up recomputing historical spend with today’s rates, which is wrong and confusing.

I treat pricing as time-versioned config. A row has: provider, model (or model family), effective start, and per-1M token rates for each bucket you care about.

Example pricing config (simplified, but real structure):

{
  "version": "2026-01-15",
  "prices": [
    {
      "provider": "openai",
      "model": "gpt-4o-mini",
      "effectiveFrom": "2026-01-15T00:00:00Z",
      "usdPer1MInput": 0.15,
      "usdPer1MOutput": 0.60,
      "usdPer1MCachedInput": 0.075
    },
    {
      "provider": "anthropic",
      "model": "claude-sonnet-4-5-20250514",
      "effectiveFrom": "2026-01-15T00:00:00Z",
      "usdPer1MInput": 3.0,
      "usdPer1MOutput": 15.0,
      "usdPer1MCacheRead": 0.3,
      "usdPer1MCacheWrite": 3.0
    }
  ]
}

Then cost is just math. Keep it boring. Keep it testable.

export function computeCostUsd({ provider, model, ts, usage, price }) {
  const u = normalizeUsage(usage);

  if (!price) throw new Error(`Missing price for ${provider}:${model} @ ${ts}`);

  const perToken = (usdPer1M) => usdPer1M / 1_000_000;

  let cost =
    u.inputTokens * perToken(price.usdPer1MInput) +
    u.outputTokens * perToken(price.usdPer1MOutput);

  if (price.usdPer1MCachedInput != null) {
    cost += u.cachedInputTokens * perToken(price.usdPer1MCachedInput);
  }
  if (price.usdPer1MCacheRead != null) {
    cost += u.cacheReadTokens * perToken(price.usdPer1MCacheRead);
  }
  if (price.usdPer1MCacheWrite != null) {
    cost += u.cacheWriteTokens * perToken(price.usdPer1MCacheWrite);
  }

  return Math.round(cost * 1e6) / 1e6; // keep 6 decimals, avoid float noise
}

Things that surprised me the first time:

  • Model aliases/renames will break joins. Store both model (as returned) and a modelKey you control (your canonical mapping). When a provider changes a name, you update mapping, not your whole history.
  • Partial failures still cost money. A streaming call that errors after emitting tokens should be counted if you have usage. If you don’t have usage, at least count the event and flag it for later.
  • Retries multiply cost invisibly. If your retry logic sits in a middleware, and your logging sits above it, you’ll undercount. Logging needs to happen at the actual call boundary.

Local estimate vs authoritative server-side:

  • Local estimate is great for “budget checks before you run the expensive step” and for unit tests. But it will drift unless you keep pricing perfectly synced.
  • Server-side authoritative is better for reporting and billing because you can update pricing centrally and recompute. The tradeoff is you can’t always enforce budgets in real time unless you also do a local estimate.

My compromise: compute estimatedCostUsd locally (optional) and store it, but treat it as a hint. For reporting, recompute from stored usage + time-versioned pricing.

How to roll up spend by feature, user, and run

Per-call logs are necessary but not sufficient. The moment you ship an agent that does 8 calls per request, “flat logs” stop being explainable.

You want rollups that match how your product behaves:

  • Per request/run: “This user action cost $0.04.”
  • Per feature: “Autofix PR comments is 60% of spend.”
  • Per agent run hierarchy: “Planning step is cheap; tool loop is exploding.”

If you already log runId, rollups are straightforward SQL.

Spend per run:

select
  run_id,
  count(*) as calls,
  sum((input_tokens + cache_read_tokens + cache_write_tokens) ) as in_like_tokens,
  sum(output_tokens) as out_tokens,
  sum(cost_usd) as cost_usd
from llm_calls
where ts >= now() - interval '1 day'
group by run_id
order by cost_usd desc
limit 20;

Spend per feature (this is where budgets usually live):

select
  feature,
  sum(cost_usd) as cost_usd,
  count(*) as calls,
  percentile_cont(0.95) within group (order by latency_ms) as p95_latency_ms
from llm_calls
where ts >= date_trunc('day', now())
group by feature
order by cost_usd desc;

Cardinality pitfalls are real:

  • Don’t put raw URLs, prompt hashes, or free-form labels into feature. You’ll create a million groups and your rollups become useless.
  • Prefer small controlled vocabularies: feature = "support.reply", feature = "code.review", etc.
  • Put high-cardinality stuff in metadata (JSON) if you must, and only index what you actually query.

Hierarchical grouping beats flat logs. In agent systems, a run often has phases (“plan”, “retrieve”, “draft”, “self-check”) and loops (“tool calls until done”). If you model that hierarchy, you can answer: “Which phase is burning money?”

If you don’t want to build your own UI for this, observability tools can help (Langfuse, Helicone, etc.). The tool choice matters less than the data model: run → group → call.

Guardrails: budgets, alerts, and anomaly detection

Once you can attribute spend, you can control it. The guardrails that actually work in production are boring and ruthless:

  1. Hard caps per request/run (fail fast).
  2. Soft budgets per user/org (degrade gracefully).
  3. Model fallback rules (switch to cheaper model or smaller context).
  4. Alerts on cost per successful outcome, not just raw spend.

Hard caps are easiest if you estimate cost before each call and stop when you’re about to exceed a limit. You don’t need perfect estimates; you need “good enough to prevent disasters.”

Here’s a pattern I use: keep a running tally per run, and refuse the next call if it would exceed the budget.

export function createRunBudget({ maxCostUsd }) {
  let spent = 0;

  return {
    get spent() {
      return spent;
    },
    charge(estimatedCostUsd) {
      const next = spent + estimatedCostUsd;
      if (next > maxCostUsd) {
        const err = new Error(`Run budget exceeded: ${next.toFixed(4)} > ${maxCostUsd}`);
        err.code = 'BUDGET_EXCEEDED';
        throw err;
      }
      spent = next;
    },
  };
}

Then, right before an expensive call, estimate + charge:

const budget = createRunBudget({ maxCostUsd: 0.25 });

const est = computeCostUsd({
  provider: 'openai',
  model: 'gpt-4o',
  ts: new Date().toISOString(),
  usage: { inputTokens: 5000, outputTokens: 800 },
  price: priceFor('openai', 'gpt-4o', new Date()),
});

budget.charge(est);
// proceed with the call

Alerting: don’t just alert on “daily spend > X”. That catches growth and misses regressions. Alert on ratios:

  • cost per successful run
  • retries per run
  • tool calls per run
  • tokens per outcome class (success vs failure)

A simple anomaly query that catches retry storms is “cost per run p95 doubled”:

with per_run as (
  select
    date_trunc('hour', ts) as bucket,
    run_id,
    sum(cost_usd) as run_cost
  from llm_calls
  where ts >= now() - interval '48 hours'
  group by 1, 2
)
select
  bucket,
  percentile_cont(0.95) within group (order by run_cost) as p95_run_cost
from per_run
group by bucket
order by bucket;

Common failure modes that blow budgets:

  • Retries multiplying cost. Especially with exponential backoff + “retry on any error”.
  • Streaming loops. A bug that replays the same stream consumption can double-count output tokens and also double your actual spend.
  • Tool-call explosions. Agents that decide to call tools 30 times because you didn’t cap iterations.
  • Context growth. Each step appends messages; token usage grows superlinearly across a run.

The important mindset shift: optimize cost vs outcome. If you only optimize for lower spend, you’ll “save money” by failing faster. Track success, and look at cost per success.

Doing this with WarpMetrics (practical, not magical)

If you don’t want to hand-roll the run/group/call hierarchy and per-call capture, WarpMetrics already models the structure (Run → Group → Call → Outcome/Act) and captures tokens/cost/latency for OpenAI and Anthropic calls. The useful part for cost tracking is that you get stable run IDs and phase grouping without inventing your own schema.

Here’s the minimal pattern I’d use to attribute spend to a run and feature-like label:

import OpenAI from 'openai';
import { warp, run, group, call, outcome, flush } from '@warpmetrics/warp';

const openai = warp(new OpenAI());

const r = run('support.reply', { userId: 'usr_123', orgId: 'org_acme' });
const drafting = group(r, 'Draft');

const res = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: 'Reply to this ticket: ...' }],
});

call(drafting, res);
outcome(r, 'Completed');

await flush();

Even if you use something else for dashboards, the core lessons still apply: log per-call usage + stable attribution, normalize usage into one schema, version pricing so you can recompute, and roll up spend along the same hierarchy your agent actually runs. That’s how you track LLM API costs without lying to yourself.

Start building with WarpMetrics

Track every LLM call, measure outcomes, and let your agents query their own performance data.