← Halton Meter/The Journal/Methodology/No. 010
Reading · 38 min·Subscribe
MethodologyNo. 010·17 May 2026·38 min readVOL I · 01

Reading the tape: a primer on LLM observability for finance leaders.

Most cost dashboards are charts. A tape is a sequence. This is a case for treating language model traffic the way trading desks treat order flow: as a stream you read in real time, not a bill you reconcile in arrears.

Written by one human and one model. Halton Labs is operated by Vikrant Shukla, with Claude Opus 4.7 as the second engineer. Bylines name the role; the full colophon at the foot of the page explains the arrangement.

Fig. 10, ninety seconds of LLM traffic

Every tick is one call. Height is duration; colour is model; annotation is cost.

0s10s20s30s40s50s60s70s80s90sopus£0.082opus£0.114opus£0.094MODELopussonnethaikugptgeminiNOWRUN RATE£42.18per hour90s TOTAL£0.794CALLS30elapsed time, seconds (T+0 = window open)

The tape is append-only. Each tick is one API call to a language model, printed at the second it was issued. Height encodes round-trip duration; colour encodes the model served. The panel on the right is the running readout: per-hour run rate, ninety-second total, call count.

A dashboard tells you, with admirable confidence, what has already happened. The big number on the upper left is the total for the period. The line chart underneath is the same total, broken into days. The pie chart on the right is the same total again, broken into providers. Every element on the page is a different visualisation of one fact, which is a fact about the past. This is a perfectly reasonable way to present a bill that has been settled. It is a strange way to manage a stream that has not.

Language model traffic is a stream. Every call is an event with a timestamp, a model, a duration, an input and output volume, and a price. The events arrive at irregular intervals, in volumes that vary by three orders of magnitude across the day, with prices that are knowable at the moment of the call. The data structure that matches this shape is not a chart. It is a tape: an append-only sequence of events, printed in the order they arrived, with a synchronised clock running underneath. Trading desks have used tapes for a century. Cost management for LLM traffic has not yet borrowed the idea. This essay is a long argument that it should.

§ I, ShapeWhy a dashboard is the wrong primary view

I want to be careful here. The objection is not that dashboards are useless; the objection is that they are not the primary instrument for a workload whose mental model is an event stream. Dashboards are aggregations. They take a stream, bucket it, summarise it, and render the summary at a refresh interval that is rarely faster than once a minute. The aggregation is where the information goes to die.

Consider the question a finance lead asks during a Tuesday standup: is anything weird happening right now? A dashboard answers a different question. It answers: what was the total yesterday, segmented by team, compared to the rolling seven day average? That is a useful answer if your time horizon is yesterday. It is not the answer to the Tuesday standup question. The Tuesday standup question requires a stream.

Consider also the question that a head of platform asks at 2pm on a Friday: did the deploy we shipped at noon change our per-call cost? A dashboard with a daily roll-up will surface that on Monday. By Monday the answer has either compounded or been forgotten. The question deserves a tool that shows you the answer at 2:04pm, four minutes after the relevant traffic begins to arrive. The tool that does this is a tape.

The third question, the one I hear most often from CFOs working with LLM-heavy product teams, is the one that motivates this entire essay: I trust the monthly invoice. I do not trust the dashboard, because I cannot tell where the gap comes from. The reason for the distrust is not the maths in the dashboard. It is that the dashboard is two abstractions removed from the wire. The provider bills you for events; the dashboard shows you aggregates of aggregates; the gap between an event and an aggregate of aggregates is where every measurement error in the system can hide. A tape, by contrast, is one abstraction removed from the wire, which is one removal less than a dashboard, which is the difference between a finance team that trusts the number and one that does not.

The aggregation is where the information goes to die.

The Editor, § I

§ II, AnalogyWhat the trading floor analogy actually buys you

The market data terminal on a trading desk has a particular shape, and that shape was not arrived at by accident. At the centre is a tape: a vertical strip on which every trade in a tracked instrument prints, in order, with a timestamp accurate to the millisecond. Around the tape are derived views: a depth ladder, a volume profile, a daily chart, a P&L blotter. The derived views are slower than the tape and the trader knows this. The tape is the source of truth. Everything else is a story told about it.

Borrowing this shape for LLM observability gets you four things at once, and none of them are obvious until you have them.

First, you get a printable record. Every call is a row that can be pointed at. When a finance lead and a platform lead disagree about a number, they can both look at the tape, point at a row, and have a smaller and more productive disagreement. Aggregates do not let you do this; you can only disagree about the aggregate, not about the events that built it.

Second, you get an honest clock. A tape has a single timestamp per row, written at the moment the event landed, never reordered. Most cost dashboards quietly normalise events to the timezone of the user and the granularity of the bucket, which means two people in two timezones see different stories about the same minute. The tape settles this.

Third, you get streaming attention. A tape that is printing in real time draws the eye to anomalies in a way that a chart with a one minute refresh cannot. The first time a large model fires after a deploy, the operator sees it. The fifth time it fires, the operator sees a pattern. The dashboard would not have shown the pattern for another fifty minutes.

Fourth, and this is the one finance teams care about most, you get reconcilable artefacts. A tape is a sequence of events with a stable schema. The provider's invoice is a sequence of events with a different stable schema. The reconciliation between the two is a comparison of two sequences, which is a problem with a clean solution. Comparing a dashboard to an invoice is a problem with no solution at all, because the dashboard is not a sequence of events; it is a chart of summaries.

§ III, StructureThe tape as a data structure, in unromantic detail

A tape, in the form we use, is a single table with the following schema. The schema is small on purpose; if the schema grows, the tape stops being a tape and starts being something a database administrator needs to maintain.

<span class="c">-- the tape, one row per LLM call</span>
<span class="k">CREATE TABLE</span> tape (
  ts          <span class="k">TIMESTAMPTZ NOT NULL</span>,    <span class="c">-- when the call left the daemon</span>
  project     <span class="k">TEXT NOT NULL</span>,           <span class="c">-- attribution code</span>
  provider    <span class="k">TEXT NOT NULL</span>,           <span class="c">-- anthropic, openai, ...</span>
  model       <span class="k">TEXT NOT NULL</span>,           <span class="c">-- claude-sonnet-4-7, gpt-5, ...</span>
  input_tok   <span class="k">INTEGER NOT NULL</span>,
  output_tok  <span class="k">INTEGER NOT NULL</span>,
  cached_tok  <span class="k">INTEGER NOT NULL DEFAULT</span> 0,
  duration_ms <span class="k">INTEGER NOT NULL</span>,
  cost_pence  <span class="k">INTEGER NOT NULL</span>,         <span class="c">-- in pence, computed locally</span>
  status      <span class="k">SMALLINT NOT NULL</span>,        <span class="c">-- 200, 429, 500, etc.</span>
  PRIMARY KEY (ts, project, provider)
);

<span class="c">-- the only index you need at first</span>
<span class="k">CREATE INDEX</span> tape_by_project_ts <span class="k">ON</span> tape (project, ts <span class="k">DESC</span>);

Three properties of this table matter more than the columns themselves. It is append-only: once a row is written it is not updated, ever. It is ordered: rows print in the order they arrived, and the order is preserved through every downstream consumer. It is self-priced: the cost is computed at insertion time, from a price list the daemon holds locally, rather than at query time from a join against a separate prices table that may have changed.

The append-only property is what makes the tape reconcilable. If a provider invoice arrives on the first of the month and disagrees with the tape, the disagreement is a finite set of pairwise comparisons between two ordered sequences. The diff is computable. Aggregate-against-aggregate comparisons are not computable in that sense; they are interpretable, which is a much weaker guarantee.

The self-priced property is what makes the tape useful in real time. If you wait for the provider's price to be applied at month end, you are running blind for thirty days. If you apply the price at the moment of the call, with a published price list whose version you record, you have a running estimate that is correct to within the variance of the price list itself. The reconciliation at month end becomes a check on the variance rather than a discovery of the cost.

Tape schema
10 cols
small on purpose
Reconciliation
O(n)
two ordered sequences, diffable
Latency target
< 2s
event-to-tape, p95

§ IV, ReadoutsThe four things a finance leader should glance at

Reading a tape directly is a specialist skill. Most of the time you want a small number of derived readouts that summarise the tape without losing its temporal character. Four of these are enough, and the fourth is the one that most teams skip.

Cost of output per hour (run rate)

The run rate is the simplest derived view of the tape: pounds per hour, computed over a trailing window short enough to feel live (we use sixty seconds) and a window long enough to be stable (we use fifteen minutes), printed side by side. The two numbers together tell you two different things. The sixty-second figure tells you what is happening right now. The fifteen-minute figure tells you what has been happening lately. When they diverge, something interesting is starting.

Variance band (reconciled versus estimated)

The variance band is the running difference between the tape's own cost estimate and the provider's reconciled cost, expressed as a percentage. The band should be small and stable. When it widens, it is a leading indicator that either the price list is stale, or the provider has changed something they have not announced, or your daemon has missed a class of calls. We treat any sustained move above 0.5 percent as an incident, even before the bill arrives.1

Top-n contributors, refreshed continuously

At any moment, three projects are responsible for most of the spend. A top-n widget that refreshes every five seconds and shows the three biggest contributors over the last hour, with their per-hour rates and their share of the running total, is the cheapest possible early warning system. When one of those rows changes, an operator notices. When the rows reorder, the operator asks why. The widget is not pretty. It is, however, the single most effective piece of finance infrastructure we have built.

Ahead-of-budget gauge

The last readout is the one teams routinely skip and most regret skipping. The ahead-of-budget gauge takes the running rate, projects it forward to month end, and compares to the budget. The gauge prints in one of three states: under, on, or ahead. The granularity is intentional; finer granularity would imply more precision than the projection has. The point of the gauge is not to predict the bill; it is to start the conversation about the bill a fortnight before the bill arrives.

Run rate
60s / 15m
two windows, side by side
Variance alert
> 0.5%
treated as an incident
Budget states
3
under, on, ahead

§ V, PlumbingHow to get a tape out of systems you already have

Halton Meter emits a tape natively; the daemon writes one row per call into the local SQLite buffer and ships it to the backend within a few seconds. If you do not use Halton Meter, you can still build a tape. The minimum viable plumbing is three lines of OpenTelemetry instrumentation around your provider SDK, an OTLP collector that strips everything except the span attributes you care about, and a single SQL table that receives them.

The instrumentation looks something like this, in Python:

<span class="k">from</span> opentelemetry <span class="k">import</span> trace
<span class="k">import</span> anthropic

tracer = trace.get_tracer(<span class="s">"llm.tape"</span>)
client = anthropic.Anthropic()

<span class="k">def</span> <span class="k">call_with_tape</span>(messages, model=<span class="s">"claude-sonnet-4-7"</span>):
    <span class="k">with</span> tracer.start_as_current_span(<span class="s">"llm.call"</span>) <span class="k">as</span> span:
        span.set_attribute(<span class="s">"llm.model"</span>, model)
        span.set_attribute(<span class="s">"llm.project"</span>, os.environ[<span class="s">"PROJECT_CODE"</span>])
        resp = client.messages.create(model=model, messages=messages, max_tokens=1024)
        <span class="c"># the three lines that matter</span>
        span.set_attribute(<span class="s">"llm.input_tokens"</span>, resp.usage.input_tokens)
        span.set_attribute(<span class="s">"llm.output_tokens"</span>, resp.usage.output_tokens)
        span.set_attribute(<span class="s">"llm.cached_tokens"</span>, resp.usage.cache_read_input_tokens <span class="k">or</span> 0)
        <span class="k">return</span> resp

The collector configuration that drains those spans into the tape table is a half page of YAML; the SQL table is the one I showed above. None of this is exotic. The exotic part, the part you cannot buy from a vendor, is the discipline of doing it consistently across every call site in the codebase, so that the tape is complete. A tape that misses 5 percent of calls is worse than a dashboard that misses none, because the tape's authority depends on its completeness.2

This is also where the local-proxy approach has its advantage. A daemon on the wire sees every call by construction, regardless of which call site issued it. SDK instrumentation, however cleanly written, can be bypassed by a stray requests.post(...) and a developer in a hurry. We have caught a lot of dropped calls this way; the proxy approach has not.

§ VI, Anti-patternThe monthly export to a spreadsheet

I have to say something about the most common anti-pattern, because the entire essay above does not work if a finance team continues to operate this way and I do not address why. The anti-pattern is the monthly CSV export: on the first of the month, a member of the finance team downloads a CSV from the provider portal, pivots it in a spreadsheet, and compares to the budget.

This is not a failure of effort. The finance team is doing the most that the available tools allow. The problem is that the available tools force the work into a monthly cadence, which guarantees that every overrun is discovered four weeks too late. The CSV is a tape of sorts, but it is a tape printed once a month, which is the opposite of a tape's purpose. A tape is interesting because it prints in real time. A tape that prints monthly is just a bill.

The migration path away from this is not a procurement decision; it is a habit change. The first habit is to look at the run rate before the standup, not after the month. The second is to ask, when the run rate looks wrong, who shipped what in the last hour. The third, and the one that takes the longest, is to stop apologising for the question. Asking what an LLM call cost two minutes ago is the same kind of question as asking what a server is doing right now. We do not consider it impolite to ask the latter. The former should follow.

§ VII, NorthwindA hypothetical week, walked through

To make the readouts concrete, here is a week at a fictional but representative company we will call Northwind Labs. Northwind is a series-B SaaS business with a Claude-heavy product (an underwriting assistant) and a CFO, Asha, who has been pushed by the board to get LLM cost under control. The cost is not catastrophic, it is opaque, and the board does not distinguish between the two.

Asha's Monday morning ritual is to open the tape view for ten minutes before the standup. She glances at the run rate (£128 per hour, up from £96 last Monday), the variance band (0.21 percent, stable), and the top-n contributors. The top three projects are the underwriting agent (£72/hr), the customer support copilot (£28/hr), and a new internal tool for the sales team (£14/hr). She did not know the sales tool existed. She makes a note.

At standup, Asha says the run rate is up 33 percent week-over-week and the new contributor is a sales tool she would like to be briefed on. The head of platform pulls up the tape for the sales tool in real time during the meeting. The tape shows the tool firing roughly every 12 seconds, on a route that issues two Opus calls per request. The team agrees, in the meeting, that the route should be moved to Sonnet by Wednesday. The change is shipped Tuesday afternoon. Asha's Wednesday morning glance shows the sales tool now at £4/hr; the overall run rate is back to £108. The episode took three calendar days and one ten-minute meeting; the equivalent dashboard-based episode would have surfaced on the first of next month with no remaining context about the deploy that caused it.

On Friday, the variance band ticks up to 0.43 percent. Asha's tape view highlights it; she pings the platform lead. The platform lead checks the price list version, finds it has not been refreshed in two weeks, refreshes it, and the variance settles back to 0.19 percent. No one needed to wait for the bill.

None of this is heroic. None of it requires a new vendor or a new team. It requires the tape to be present, and the habit of glancing at it. The CFO who acquires the habit becomes, structurally, less expensive than the CFO who does not.

The CFO who acquires the habit becomes, structurally, less expensive than the CFO who does not.

The Editor, § VI

§ VIII, LimitsWhat the tape does not solve, and what comes next

It is important, at the end of a long essay arguing for a tool, to be specific about what the tool does not do. The tape solves measurement. It does not solve two adjacent problems that finance teams often hope a measurement tool will solve.

The first is unit economics across LLM-augmented features. A tape tells you what a call cost; it does not tell you what a call contributed in revenue. Pairing the tape with a revenue attribution model is a separate engineering project, and a hard one. The naive version (divide LLM cost per feature by feature usage and call the result a margin) is worse than no answer at all, because it produces a number that looks like a margin and is not one. We are working on a methodology piece for a later issue on what the honest version of this looks like; it involves a lot more product analytics and a lot more humility than a cost tool can supply on its own.

The second is model selection trade-offs. The tape will tell you, accurately, that Opus costs eight times what Sonnet costs on the same workload. It will not tell you whether Opus is worth the difference for that workload. That question requires a quality evaluation, a human-in-the-loop process, and probably a sustained programme of A/B testing with real users. There is a tape­shaped pattern for this, too (every eval run is an event, every score is a value, every comparison is reconcilable), but it lives in a different system from the cost tape, and conflating them produces nonsense.

What the tape does do is establish the substrate on which both of those later projects become possible. You cannot reason about unit economics without honest cost data. You cannot reason about model selection trade-offs without honest cost data on both candidates. The tape's job is to be the honest cost data. Once it is in place and trusted, the harder questions become legible. Until it is in place, the harder questions are guesses.

Finance leaders who have lived through the equivalent transition in cloud cost management will recognise the shape. The cloud cost world spent five years arguing about dashboards, three years arguing about tagging, and finally settled into a regime where the events are the source of truth and the dashboards are the storytelling layer. LLM cost management is starting from the same place. It does not have to take eight years this time. Print the tape, glance at it before the standup, and most of the rest will follow.

EOF · No. 010 · Halton Meter Journal
Notes
  1. The 0.5 percent threshold is a working rule rather than a measured one. On the dogfooded workload, every sustained move above it so far has tracked back to either a stale price list or an unannounced provider-side change. Below 0.5 percent the noise dominates.
  2. The completeness property is also why we recommend, on top of OTel instrumentation, an audit cron that periodically samples upstream traffic via a tcpdump and checks that every captured call has a corresponding tape row. The audit is unglamorous and catches the kind of bug that nothing else does.
  3. Northwind Labs is fictional, and the walkthrough is illustrative. The patterns it shows (an unbriefed internal tool, a price-list staleness, a single ten-minute meeting that closes the loop) are the kind we have seen on the dogfooded workload. The CFO character is a composite.
  4. The methodology of pricing a call locally at the moment it is issued, versus reconciling at month end, is in the reconciliation engine dispatch elsewhere in this issue.