← Halton Meter/The Journal/Methodology/No. 004
Reading · 8 min·Subscribe
MethodologyNo. 004·19 May 2026·8 min readVOL I · 01

How we reconcile a daemon log against a provider invoice in under nine seconds.

Inside the matching engine that pairs a daemon log against an Anthropic statement, then explains the remainder.

Written by one human and one model. Halton Labs is operated by Vikrant Shukla, with Claude Opus 4.7 as the second engineer. Bylines name the role; the full colophon at the foot of the page explains the arrangement.

Fig. 04, reconcile match book

One workload, one statement, one residual to explain.

DAEMON LOGPROVIDER INVOICEmsg_8a31 · claude-opusmsg_8a32 · claude-opusmsg_8a33 · claude-opusmsg_8a34 · gpt-5msg_8a35 · gemini-promsg_8a36 · claude-opusmsg_8a37 · claude-opusinv_anth_R001inv_anth_R002inv_anth_R003inv_oai_R104inv_anth_R004 · extrainv_anth_R005inv_anth_R006stream cancelretry billedMATCHED · 99.73%RESIDUAL · 0.27%

A single sampled window from the reconciliation engine. Most rows match cleanly; the orange pair shows a stream the daemon saw cancelled, and an invoice line the provider billed on retry. Both get classified, neither gets dropped.

The reconciliation desk gets two artefacts a day. One is a stream of events from the daemon: every LLM call we observed on the wire, with timestamps to the millisecond and a token shape we computed locally. The other is a printed-looking PDF from a provider, or an API endpoint that pretends to be a printed-looking PDF, listing what they intend to charge us for. Our job, every day, is to make those two artefacts agree. When they cannot agree, our job is to explain the gap precisely enough that someone in finance can sign off on it without holding their nose.

At present we run the engine against a single workspace, the Halton Labs daemon, at small scale. The target on closing a reconciliation window is nine seconds. This dispatch walks through how the matching engine works, why the target is what it is, and what the unmatched 0.27 percent turned out to be made of when we finally took the time to break it open.

§ I, ReconcileWhat the word means here

Reconciliation, in the accountant's sense, is the practice of comparing two independent records of the same transaction and resolving any disagreement. In our case the two records are the daemon log (observed locally, on the wire, at the moment of the call) and the provider invoice (computed centrally, after the fact, with whatever discounts and corrections the provider has applied). They are observing the same physical event from different sides of a network and a billing system. They will not, on any given day, agree perfectly. They will agree to within a small residual whose composition is knowable.

We pair them by writing a match predicate, running it across every window, classifying every line as matched or unmatched, attributing every unmatched line to a known cause or flagging it for review. The goal is not zero residual. The goal is a residual that is small, stable, and explained.

§ II, The keyThe match predicate, and the four fields we trust

A match is a hypothesis: this line in the invoice is the same event as this line in the daemon log. To form the hypothesis we use four fields. Anything more is overfit; anything less is ambiguous.

  1. Timestamp window. The provider's stamp and our stamp are independent clocks; we accept a match if they fall inside a ten-second symmetric window. Most matches land inside one second. The wider band is for stream cancels and for provider-side queueing.
  2. Model id. The string returned by the provider in the response body, not the string the caller asked for. Caller-side aliases (e.g. claude-opus-latest) are resolved to a concrete model id before matching.
  3. Project tag. The workspace and project pair the daemon attributed at the time of the call. The provider does not know about projects; we propagate ours by binding invoice lines to the workspace whose daemon emitted the closest call.
  4. Token shape. A four-tuple of input, output, cache-read, cache-write tokens, compared with a tolerance of one token to absorb tokenizer rounding at chunk boundaries.

Notably absent from this list is any kind of request id. Providers do expose request ids; we capture them in the daemon. They are useful for human investigation but unreliable as a match key because not every provider includes the id on the invoice line, and when they do, the formatting drifts. We use them as a tiebreaker when the four-field predicate produces more than one candidate.

Here is the predicate, more or less:

# engine/match.py · the matching predicate
def matches(daemon: Event, invoice: Line) -> bool:
    if daemon.model_resolved != invoice.model:
        return False
    if daemon.workspace != invoice.workspace:
        return False
    if abs(daemon.ts - invoice.ts) > WINDOW_SEC:
        return False
    return (
        within(daemon.input_tokens,  invoice.input_tokens,  1) and
        within(daemon.output_tokens, invoice.output_tokens, 1) and
        within(daemon.cache_read,    invoice.cache_read,    1) and
        within(daemon.cache_write,   invoice.cache_write,   1)
    )

The actual engine is a bit larger because it works in batched bipartite-match mode rather than pairwise, but the predicate is what you would write. The interesting code is not the matcher; it is the four classifiers we run on whatever the matcher leaves unpaired.

§ III, MismatchThe four reasons a row is unmatched, and what we do about each

We have categorised every unmatched line we have seen, across the runtime we have measured so far, into one of four buckets. Three of them are mechanical and self-explaining. The fourth is residual.

Cache eviction. The daemon marked a request as a cache hit; the provider charged for it as a cache miss because the cache had expired between our send and their receive. The signature is a daemon line with non-zero cache_read and an invoice line whose cache_read is zero and whose input_tokens are correspondingly higher. We catch this by re-matching the pair under a "cache-evicted" predicate and reclassifying. The line is then matched, with a note. This accounts for roughly 38% of our residual.

Retry duplicates. The SDK retried internally; the provider billed both attempts; the daemon only saw the second one because the first never escaped the local socket. The invoice has two lines, the daemon has one, and the two invoice lines have near-identical timestamps and identical token shapes. We pair the orphaned invoice line with the daemon line at index N and mark a retry suffix. Roughly 24% of our residual.

Streaming cancels. The application closed the connection mid-stream; the daemon saw the bytes that arrived and computed a token count from those; the provider billed for whatever they had emitted at the moment of cancel, which is sometimes a few tokens more (still in their send buffer) or fewer (their counter advanced after our last seen chunk). The signature is a small constant offset on output tokens, usually 1 to 4 tokens. About 21% of the residual.

Rate-limit fallbacks. Some clients are configured to silently fall back to a smaller model when a rate limit is hit on a larger one. The daemon sees the original request, with the original model id; the response, and the invoice, name the fallback model. We catch this by checking whether the invoice line's model is a known fallback target for the daemon line's model, with a matching token shape. About 11% of the residual, and rising.

The goal is not zero residual. The goal is a residual that is small, stable, and explained.

The Meter Operator, reconciliation desk

Whatever is left after those four passes is the genuinely unexplained residual. At present that is roughly 6 percent of the 0.27 percent, or about 16 basis points of one percent. We flag every such line for human review. Most of them resolve, on inspection, to one of the four causes above with an unusual fingerprint; a small handful turn into a ticket with the provider.

§ IV, The clockWhy nine seconds, and how it is measured

The nine-second SLA is the time between an invoice window closing in the backend and the reconciliation report being ready for a human to look at. It is measured wall-clock, end to end, and it is the number we publish on the operator status page. The clock starts when the last invoice line for a window has been ingested. It stops when the reconciliation result, residual and all, has been written to the ledger and the dashboard has been re-rendered.

Workload
1
single workspace, Halton Labs
Residual
0.27%
unmatched after four classifiers
Target
9 s
window-close to dashboard, p95

Why nine? It is the longest interval at which a person who has just refreshed the dashboard does not start to wonder whether something has broken. Ten seconds is the well-known boundary at which web users start clicking refresh; eight is the cadence at which the kinds of finance dashboards we have looked at tend to re-poll. Nine sits one second inside both. It is a deeply unscientific number that we have nevertheless found to be the right one. Nothing in the engine is tuned to exactly nine; we set the budget at nine and built the engine to stay under it.

The engine is a Rust binary that reads the invoice lines and the daemon events from Postgres into memory, builds a hash index on the four-field predicate, runs the matcher in parallel across model partitions, and writes results back in a single transaction. The hot path is about 4,400 lines. The dominant cost is not the matching, which is essentially free at our current volumes, but the Postgres roundtrip; we have been working on a path that streams directly out of the WAL and back into a results table, and that is in testing for v2 of the engine. v1 averages 4.2 seconds on the test workload, p95 of 7.8 seconds, p99 of 9.6 seconds. The tail is the Python invoice parser; we know what to fix.

§ V, WorthWhat the residual is worth, in pounds

At Halton Labs' own scale the absolute numbers are small. A 0.27 percent residual on our own monthly spend works out to single-digit pounds. The point of writing it down is not the amount; it is that the residual has a known shape and a known cause-mix, which is what makes it a budget item rather than a suspicion. A suspected leak corrodes a team. A measured one, explained, becomes a line on a forecast.

We will be filing a longer methodology piece next quarter with the full classifier source open-sourced, the test fixtures published, and an invitation to read the predicates in detail. If you reconcile differently and think our 0.27 percent should be your 0.1 percent, we would like to hear from you on the desk.

EOF · No. 004 · Halton Meter Journal
Notes & sources
  1. The reconciliation engine is not public source. We expect a future commercial deployment to require a controlled-access boundary around the matcher itself. The classifier code is intended to be published; see the methodology piece next quarter.
  2. The figures in this dispatch are from the Halton Labs workspace running the daemon at small scale. Residual percentages are stable on this single workload over four months.
  3. Disclosure: Halton Labs is building a metering product whose reconciliation engine is the subject of this dispatch. Figures here are from a single dogfooded workspace.