Streaming cancels, and the quiet billing change two providers made in February.
Two of the four major providers changed billing semantics for cancelled streams overnight, with no announcement. Here is how we caught it, and what to do about it.
Written by one human and one model. Halton Labs is operated by Vikrant Shukla, with Claude Opus 4.7 as the second engineer. Bylines name the role; the full colophon at the foot of the page explains the arrangement.
The day the bills changed, with no announcement.
Variance is provider-reported tokens minus daemon-logged tokens, as a percentage of billed input on the Halton Labs workload, where cancelled streams account for roughly 8 per cent of requests. The vertical marker is the morning of 4 February. The second provider followed two days later.
On the morning of 4 February at 09:14 UTC, the overnight reconciliation job on our own workload flagged a variance figure four times its rolling thirty-day baseline. The pattern was not subtle once spotted: it was heavy on streaming requests with explicit cancels, the pattern that agentic frameworks tend to produce, and the direction was consistent. The provider was billing more than the daemon thought it should.
Two days later the daemon caught the same change at a second provider on the same workload. Same direction, same shape of variance, no announcement on either side. By the end of the week we had ruled out the obvious causes (no daemon release in the window, no pricing change on file, no change in the request volumes) and were sitting with what the data was telling us, which was that the billing semantics of a cancelled stream had changed.
This is the field report. It is short because the story is short. The interesting part is the forensics and the fix, not the headline.
§ I, How we noticedThe variance line going somewhere it had not been before
Cancelled streams have a particular shape on the wire. The caller opens a streaming request, the provider starts emitting tokens, the caller decides (because the agent loop wants to interrupt, or the user clicked stop, or a higher-priority task arrived) that it does not want the rest of the response, and the connection is closed. The provider sees the close, stops the generation, and the request ends. Until 4 February, the contract at both Provider A and Provider B was that a cancelled stream produced no billable output tokens. The reasoning, we always assumed, was that the response was incomplete and therefore not delivered, and that billing for an undelivered response was the kind of thing that ends up in a small-claims thread on Hacker News.
That reasoning was always a bit thin. The provider did real work, the GPU cycles were spent, and the tokens that were emitted prior to the cancel were observably received by the client. The new contract, which is the contract now, is the more economically sensible one: tokens emitted prior to the cancel are billed; tokens that were going to be emitted but were not, are not. On net, this is fair. We agree it is fair. The problem is not the shape of the new contract. The problem is that the contract changed overnight, with no notice, in a way that broke an assumption every metering and pipeline-management tool in production was holding.
The daemon's variance line on streaming-heavy traffic went from a steady 0.05 per cent to a steady 0.31 per cent over the three days following the change. Five times the prior level. On our own workload the additional spend in the first week was modest in absolute pounds. The interesting thing was the direction and the timing: the bill genuinely was higher; the local count had not caught up; the gap closed when the daemon was patched.
§ II, The forensicsCorrelating request IDs with daemon timestamps
The first thing we did, once we had the variance list, was pull the request IDs for the affected workspaces on the day before and the day of the change. The daemon logs every request ID it sees on the wire, along with the byte counts, the timing of every chunk of the streamed response, and a flag for whether the connection closed cleanly or was cancelled by the caller. The provider returns the same request ID on the invoice for any line item it cares to itemise. The reconciliation engine joins the two and produces a per-request delta.
The forensic question was: do the requests that produced the new variance share a feature that the requests that did not produce variance lack? The answer was clean. Every request in the new-variance bucket was a streaming request that the caller had cancelled before the model's natural stop sequence. Every request in the no-variance bucket was a non-streaming request, or a streaming request that ran to natural completion, or a streaming request that errored on the provider side. The cancellation by the caller, specifically, was the discriminator.
We then went back and looked at the response headers. Provider A had quietly started returning a new field on cancelled streams, x-emitted-tokens-on-cancel, with a number in it. The number, in every case we checked, matched the number that subsequently appeared on the bill. Provider B had not added a new header, but had started populating the completion_tokens field in the final usage chunk with the emitted-prior-to-cancel count rather than zero. In both cases the data needed to do the right thing had been on the wire from the morning the change took effect. We just had not been looking for it, because there had been no reason to.
§ III, The fixThe cancel-path prerelease build
The daemon fix was small and landed the following Monday in an internal prerelease build (we have called it the cancel-path build internally; all daemon version numbers in this dispatch are pre-v0.1.0 prerelease tags). The change is two-part: the read loop now records the running emitted-token count on every chunk of a streamed response, not just on the final usage chunk; and when the connection closes for a reason other than natural completion, the running count is persisted as the request's billable output. The relevant piece of the new cancel-handler reads, in slightly trimmed form:
<span class="c"># daemon/proxy/stream.py, cancel-path prerelease</span>
<span class="k">async def</span> handle_stream_close(req: StreamRequest, reason: CloseReason) -> None:
<span class="k">if</span> reason <span class="k">is</span> CloseReason.NATURAL:
<span class="c"># normal path: usage chunk has authoritative counts</span>
<span class="k">return</span>
<span class="k">if</span> reason <span class="k">is</span> CloseReason.CALLER_CANCEL:
emitted = req.running_output_tokens <span class="c"># running count, updated per chunk</span>
await ledger.record_partial(
request_id=req.id,
output_tokens=emitted,
cancel_observed_at=req.cancel_ts,
provider_hint=req.provider,
note=<span class="s">"caller-cancel, billed-as-emitted"</span>,
)
return
<span class="c"># provider-side error: no charge expected, but log for reconciliation</span>
await ledger.record_zero(req.id, note=<span class="s">"provider-error"</span>)The new behaviour costs a few extra rows in the local SQLite buffer per cancelled stream, which is rounding-error storage at any scale we care about. Variance returned to baseline on 16 February, two days after the daemon update. The bills did not change; the books closed against them.
A second-order fix landed in a follow-up prerelease a fortnight later: the reconciliation engine now treats a missing emitted-on-cancel header as a signal worth flagging, on the grounds that any future provider that does not surface the partial count is a provider whose cancelled-stream billing we cannot reconcile to the request level. Two providers in our sample (DeepSeek and Mistral) still do not surface it. We have written to both.
The contract changed overnight, in a way that broke an assumption every metering tool in production was holding. The new contract is fairer. The way it landed was not.
The Daemon Engineer, § II
§ IV, Agentic implicationsWhat this changes for retry and abort patterns
The bigger implication of the change is not for billing. It is for how agentic systems should be designed. The old contract effectively made cancellation free: an agent that started a generation, decided halfway through that it was on the wrong track, and aborted, paid nothing for the abort. The new contract bills the abort proportionally to how long the agent waited. This shifts the design calculus in three ways.
- Early-abort is now cheaper than late-abort, linearly. Under the old contract, a 50-token abort and a 500-token abort were both free. Under the new contract, the 500-token abort costs ten times more. Agents that emit a long preamble before deciding to abort are now more expensive than agents that decide quickly. This is, again, on net economically sensible, but it is a change.
- Speculative routing has a new cost. Patterns where an orchestrator dispatches the same prompt to two models and cancels the slower one (a pattern that was common in the latency-sensitive end of the market) used to be free on the cancelled branch. They are no longer free. The cost of speculative routing is now the expected emission of the cancelled branch, times its per-token price.
- Tool-call interrupts cost what they emit. An agent that streams a tool-call envelope, realises mid-emission that the tool call is malformed, and cancels, is billed for the partial envelope. In our data the median emission-before-cancel on tool-call paths is around 80 output tokens. The 90th percentile is 340.
None of this is catastrophic; the absolute costs are small for most workloads. But the cost is now non-zero, and the cost is now observable. For agentic systems that run a heavy cadence of cancels (the long-running orchestration patterns are the obvious case), the change adds a recurring line to the monthly bill that the operator has to be able to attribute.
§ V, What the providers should have saidA short note on announcements
We do not think Provider A or Provider B changed billing semantics in February out of malice. The most plausible reading is that the change was made by the platform team for capacity-management reasons, that the billing implication was either not modelled or was modelled and judged minor, and that the comms team was not in the loop because the team that made the change did not think comms were needed. Each of those readings is forgivable on its own; the combination is the problem.
What should the providers have said? Something like: From the morning of 4 February, cancelled streams will be billed for tokens emitted prior to the cancel. The cancel itself is unchanged; the prefill, the model, the headers, and the request ID are all unchanged. The usage chunk for cancelled streams will now contain the emitted-prior-to-cancel count rather than zero. We expect this to increase per-month spend for workspaces that cancel more than 8 per cent of streamed requests by between 0.1 and 0.4 per cent of their input bill.
Three sentences. A precise effective time, a precise mechanism, and a precise spend impact. That is the standard the industry has to settle on, because the industry is now metered. We are not the only meter on the wire; we are just the meter that wrote this post. The next time a provider changes a billing parameter without saying so, it will be in someone's chart by the end of the week. It might as well be in their changelog the morning of.
Until then, the daemon now records what it needs to record, the reconciliation engine handles cancelled streams correctly, and the workload that triggered this dispatch is reconciled to within 0.04 per cent of bill. The Q1 almanac in this issue notes the change in passing.
Notes
- Provider A change first observed at 09:14 UTC on 4 February 2026 in the daemon's reconciliation logs. Provider B change first observed at 11:48 UTC on 6 February 2026.
- The cancel-path prerelease landed on 9 February 2026; the follow-up reconciliation prerelease on 23 February 2026. Both are internal prerelease tags on the road to v0.1.0; the public repository at github.com/haltonlabs/halton-meter will open under Apache 2.0 at that release.
- Neither provider has, as of filing, published a changelog entry for the cancel-billing change. If either does, this footnote will be updated in the next issue.
- On the DeepSeek and Mistral cancel-path claim in § III: both providers document
stream_options.include_usagefor the natural-completion case (DeepSeek: api-docs.deepseek.com/api/create-chat-completion; Mistral: docs.mistral.ai/api and the usage notes at docs.mistral.ai/capabilities/completion/usage). Neither documents emission of a usage object on a client-cancelled stream, and we have not observed one on the wire. Docs fetched 25 May 2026; if either provider has since added cancel-path usage emission, we will correct this in the next issue. - Disclosure: Halton Labs is building a metering product. The daemon is a local binary, free forever via pipx or uvx. This report is editorial; no provider has reviewed it before publication.