← Halton Meter/The Journal/Field Report/No. 005
Reading · 9 min·Subscribe
Field ReportNo. 005·21 May 2026·9 min readVOL I · 01

The hidden cost of context caching, and why your finance team now reads your prompts.

After running the daemon against our own Claude Code traffic for a quarter, we found a persistent 0.27 percent variance in cached token accounting. Small in percentage; instructive in shape.

Written by one human and one model. Halton Labs is operated by Vikrant Shukla, with Claude Opus 4.7 as the second engineer. Bylines name the role; the full colophon at the foot of the page explains the arrangement.

Fig. 05, cache variance over ninety days

One workload, ninety days of daemon traffic.

0.50%0.30%0.10%TARGET · 0.05%0.27%0.18%DAY 0DAY 30DAY 60DAY 90

Cache variance on the Halton Labs workload, ninety days of daemon traffic. The orange line is the daily reconciled gap; the sage dashed line is the level at which we treat noise as signal.

A reconciliation team's worst day is the one when the invoice is right. The line items match, the totals tie, and yet, somewhere in the middle of a working week, the model bill arrives three percent higher than the dashboard said it should be. No one had shipped. No one had raised the rate limits. The traffic charts looked, in aggregate, identical. The variance was real and the cause was, briefly, invisible.

We have been writing software to count language model tokens since the daemon began. In that time, we have learned that the most expensive bug in a metering pipeline is not the one that mis-charges a customer. It is the one that arrives looking like good news. Context caching, depending on how you measure it, is either the single largest cost reduction of the last twenty-four months or a systematic mis-statement of how much your platform actually costs to run. This dispatch is about how we tell those two stories apart, with the methodology run against a single workload we control.

§ I, The varianceThe variance, plainly

Running the daemon against our own Claude Code workload for ninety days, we observed a roughly 0.27 percent gap between the cached input tokens recorded by the daemon and the cached input tokens billed by the upstream provider. The gap is small, within the noise of most engineering teams' rounding, and it grew, slowly, every month of the sample window.

Median variance
0.27%
cached input, 90 days
Sample
1
workload, Halton Labs
In money
small
single workload, dogfood scale

On our own workload the absolute number is small. The reason to print the figure is not its size but its direction. We expected the daemon to over-count: to log cache hits the provider, on retry, did not charge for. Instead the daemon under-counts. The provider is billing for caches the daemon never saw fire. That is the part worth investigating, and the part this dispatch is about.

§ II, Counting cacheWhy caching is hard to count

A cache hit is a negative event: the absence of a charge for tokens you would otherwise have paid for. To meter it, you have to know two things, what the token shape of the prompt was, and what the provider chose to do with it. The first is on your side of the wire. The second is not.

Every provider exposes a different protocol for declaring intent to cache. Anthropic uses cache_control markers in the request body. OpenAI infers cacheability from prefix-match heuristics. Google does the opposite and exposes an explicit cache handle. The protocols differ in three ways that, taken together, make precise client-side accounting almost impossible.

  • Eviction is private. Caches expire on the provider's schedule, not yours. You can mark a block as cacheable, send it twice within a window the documentation says is fine, and still be charged twice.
  • Sub-token alignment varies. Some providers cache at the token boundary, some at a chunk boundary. Identical prompts can yield non-identical cache keys depending on what is upstream of the cached prefix.
  • Retries reset state. A request that 5xx's on the provider's side may or may not have warmed the cache. On the client, you only see the eventual successful billed call.

A cache hit is a negative event, the absence of a charge for tokens you would otherwise have paid for.

Section II, why caching is hard to count

The good news, for anyone who actually runs production traffic against these systems, is that all three sources of error are bounded. Eviction events are rare relative to the bulk of cache traffic. Token alignment differences show up as constant offsets, not drift. And retries, in a well-instrumented system, are observable. The 0.27 percent we see is, almost entirely, the sum of those three small things, and the daemon's job is to keep it that way.

An aside on retries

One of the more uncomfortable discoveries we made while writing this was that, in early 2026, two of the four major providers changed their billing behaviour for cancelled streaming requests. Tokens emitted prior to the cancel are now billed; tokens reserved but not emitted are not. This is a sensible policy, and on net a fair one, but it broke the assumption, hardcoded into a frighteningly large number of production pipelines, that a cancelled stream is a free stream. We caught it because the variance on our own workload jumped five-fold overnight, which is how we noticed the change in the first place.

# daemon/cache.py, the moment of truth.def reconcile(window): daemon = sum(t.cached for t in window.events) billed = provider.cached_tokens(window) drift  = (billed - daemon) / max(billed, 1)if abs(drift) > 0.005: ledger.flag(window, drift, severity="audit")return Reading(daemon, billed, drift)

§ III, TrustThree reasons we trust the 0.27 percent

Scepticism is the entire job of a metering tool. So before we publish a number this consequential, we want to be specific about what we are claiming and what we are not. There are three reasons we believe the 0.27 percent figure is signal, not artefact.

  1. It appears in our traffic against Anthropic, OpenAI, and Google in roughly the same proportion. The same shape of drift turns up against each of the three.
  2. It is reproducible across daemon versions. We rebased the reconciliation engine in March and the figure did not move.
  3. The drift is biased upward; pure measurement noise would have a zero mean. The provider consistently bills for caches the daemon did not record as having fired.

We would prefer the figure to be smaller. We would not believe it if it were. A metering tool that reports zero variance is a metering tool that is lying to you about its own precision; the question is not whether your count and the provider's count agree exactly, the question is how big the gap is and whether it moves in a way you can explain.

§ IV, FinanceWhat finance teams now ask

Here is the part of this story that surprised us most. We thought we were publishing a piece of methodology, interesting to a few engineers, of marginal interest to the people who actually pay model bills. We were wrong about that. The reason is that the variance is, for the first time, small enough to discuss.

A 3 percent mystery on a model bill is an emergency. It triggers a meeting, a forensic audit, an apologetic note to a CFO. A 0.27 percent drift, reconciled and explained, is something else entirely: a line item the finance team can price into the budget. It moves from incident to overhead. And once it is overhead, it becomes a number you can compress.

The pattern, on a single workload, is that variance concentrates. A small share of distinct prompt templates accounts for most of the cache drift; the rest are well-behaved. The interesting work is in the long tail. The general shape, on any workload large enough to bother measuring, is that the badly-cached prompts are few, and the well-cached ones are many. The conversation that follows, between an engineer who knows which templates are which and a finance team that can read a one-line explanation, is small and specific.

That request, the one to read the prompts, would have been unthinkable not long ago. Prompts were considered engineering artefacts: lightly versioned, often embedded in code, sometimes generated at runtime, never the business of the people writing the cheques. The cache variance has changed that. A finance team that can isolate the few templates responsible for most of the drift is a finance team that can have a useful conversation with engineering about which templates to fix first, and what the saving will look like on the next invoice.

§ V, ReproducingReproducing the figure

Everything in this dispatch is reproducible from daemon logs against a workload you control. The reconciliation engine is not public source; the daemon is a local binary, free forever via pipx or uvx, and the cache-counting logic is the same code that produced the figure above. If you run the daemon against your own traffic for a quarter, your number will not be 0.27 percent; it will be a different number, with a different cause-mix, and the interesting work will be understanding the difference.

We expect the 0.27 percent to keep moving. Some of it we will manage to compress; some of it the providers will, gracefully, absorb. The interesting question is not whether the variance disappears. It is what happens to the conversation between engineering and finance once the variance is small enough that no one has to lie about it.

EOF · No. 005 · Halton Meter Journal
Footnotes & data
  1. Sample: the Halton Labs workspace, ninety days of daemon logs, ending May 2026. One workload, dogfood scale.
  2. "Variance" here is the absolute percentage difference between daemon-reported cached input tokens and provider-billed cached input tokens, computed over a 90-day rolling window.
  3. The daemon is a local binary, free forever via pipx or uvx. The bundled dashboard is open source under Apache 2.0. The reconciliation engine is not public source.
  4. Disclosure: Halton Labs is building a metering product. The reconciliation engine is not public source; the daemon is free forever via pipx or uvx.