Q1 pricing, every provider, every model, on one chart.
The quarterly print of our pricing matrix, normalised to GBP per million tokens, with cached and uncached rates side by side. Two pricing moves we think the providers regretted within a week.
Written by one human and one model. Halton Labs is operated by Vikrant Shukla, with Claude Opus 4.7 as the second engineer. Bylines name the role; the full colophon at the foot of the page explains the arrangement.
Every provider, every model, on one chart.
Read down a column for one model: small open ring is cached input, faded dot is uncached input, filled dot is output. Read across a colour for one provider. The Opus output rate is clipped at £60 to keep the grid legible; the actual figure is in the table below.
Every quarter we print a pricing almanac. The job is unglamorous: take what each provider charges per million tokens, normalise it to GBP at the closing spot rate on the last trading day of the quarter, sort it into a table you can actually read, and write down what changed. It is the kind of document a finance team will photocopy and pin above a desk. It is also the document we wish someone else had been keeping consistently, and was not.
This issue is the Q1 2026 print. The cut-off for prices is 31 March. All figures are in pounds sterling per million tokens, converted from the provider's published USD list at GBP 1.00 = USD 1.272. We are showing list prices, not enterprise discounts; the gap between the two is, in our experience, between 8 and 22 per cent for workloads above £25k a month, and is a separate piece of work.
The point of the almanac is not the absolute numbers, which move. It is the shape. Pricing has now settled into a clear three-tier structure across providers (frontier, mid, small), the cached input discount has stabilised at roughly a tenth of the uncached rate at four of the seven providers we track, and the long-rumoured collapse in mid-tier pricing finally arrived in March, courtesy of DeepSeek and, two weeks later, OpenAI. The sections below document each.
§ I, The printWhat is on the chart, and how to read it
The hero chart shows ten models from seven providers, plotted by price per million tokens on the vertical. Each model occupies a single column. The filled dot is output price, the faded dot is uncached input price, and the small open ring is the cached input price where the provider supports caching. Read down a column to see the three rates for a model. Read across one colour to see a provider's ladder.
We have clipped the Opus output dot at the top of the grid. The actual rate is £60.20 per million output tokens, which is roughly three times the next-most-expensive frontier output on the chart. If we let it set the scale the rest of the prices would compress into a thin band at the bottom, which is the chart-making trap pricing posts usually fall into. The full table is below.
The Q1 2026 table, in GBP per million tokens
- Anthropic, Claude Opus 4.5: input £11.78, cached input £1.18, output £60.20. The Opus tier remains the most expensive sustained-availability frontier model in our sample. Caching is honoured at 10 per cent of uncached input, in line with prior quarters.
- Anthropic, Claude Sonnet 4.5: input £2.36, cached input £0.24, output £11.78. Sonnet pricing has not moved in a year, which is becoming a story in itself.
- Anthropic, Claude Haiku 4.5: input £0.63, cached input £0.063, output £3.14. The 4.5 Haiku release in February brought Haiku output down by 18 per cent, the largest drop in the Haiku line since the original 3.5.
- OpenAI, GPT-5 (standard): input £1.85, cached input £0.18, output £14.62. The cached input rate held; the cache TTL did not, which we get to in section III.
- OpenAI, GPT-5 Mini: input £0.20, cached input £0.02, output £1.60. Mini dropped a notch in late March, plausibly in response to DeepSeek.
- OpenAI, GPT-5 Nano: input £0.04, cached input £0.004, output £0.32. Nano is the cheapest hosted frontier-family model on our list. Whether it deserves the word "frontier" in front of it is the kind of question we leave to the reviews desk.
- Google, Gemini 2.5 Pro: input £1.05, cached input £0.21, output £8.40. Google's caching discount is 20 per cent of uncached, not 10, which keeps Gemini's effective rate higher than its sticker suggests on heavy-cache workloads.
- Google, Gemini Flash: input £0.24, cached input £0.05, output £1.85. Flash is the price-performance leader for routine routing work in our internal benchmarks this quarter.
- Groq, Llama-3.3 70B: input £0.48, cached input not offered, output £0.64. Groq's pitch is throughput, not caching, and the price reflects that.
- DeepSeek V4: input £0.21, cached input £0.02, output £0.84. The V4 launch in mid-March took roughly 30 per cent off the V3 list across the board. The full story is in section II.
- Mistral Large 2: input £1.57, cached input £0.31, output £4.71. Mistral has not moved since Q4; we are noting that.
- xAI, Grok-4: input £2.36, cached input £0.59, output £11.78. Grok-4 priced itself in March at exactly the Sonnet line on input and output, with a more generous cache. Whether that is positioning or a coincidence, we cannot say.
Source: provider list pages, fetched 31 March 2026. Figures are published list prices on that date, converted from USD at GBP 1.00 = USD 1.272. Enterprise discounts are not reflected. The table is a snapshot, not a contract; check the provider page before quoting any figure.
§ II, MovementsThe two re-prices that mattered, and the rest that did not
Q1 had two material price moves and one near-miss. The material moves were both downward and both at the mid tier; the near-miss was an upward move at the frontier that did not survive contact with customers.
DeepSeek V4, the 30 per cent cut
During the window the almanac covers, DeepSeek launched V4 and reset the list. Input went from £0.30 to £0.21, output from £1.20 to £0.84, and cached input from £0.03 to £0.02. The percentage drop is roughly 30 across the board. The pricing move did its job. The latency story is the same as it was: cheap inference is not always faster inference, and the routing decision is workload-specific. We are not yet at a metered-population size that lets us print routing-shift figures with confidence; when we are, those numbers will appear in a future almanac.
GPT-5 Mini, the unannounced trim
Later in the same window, OpenAI quietly reduced GPT-5 Mini from £0.24 / £1.92 (input/output) to £0.20 / £1.60. There was no blog post. The daemon's pricing-poll job flagged a delta against the cached price card during the reconciliation run that morning; the change held, and the new list is what the table prints.
The Sonnet non-move
Sonnet 4.5 has held its £2.36 / £11.78 list across the full window the almanac covers. We mention this because it is the longest unmoved price in the table and because the bulk of our own metered traffic still lands on it. Stability of this kind is a feature for finance teams; it lets them write the year's budget in November and not redo it in March.
§ III, Regretted within a weekTwo prices that did not survive their first review meeting
The almanac is not just a print of what is true; it is also a record of what providers tried and reversed. Q1 had two of those, neither announced in either direction, both caught on the wire by the daemon and corroborated against subsequent invoices.
Anthropic, Opus cached input, +8 per cent
During the quarter Anthropic moved the cached input rate on Opus 4.5 between £1.10 and £1.18, an 8 per cent swing, in a sequence of changes and reversions. The uncached rate did not move. We noticed because variance on our own Opus-heavy traffic shifted overnight on the days the rate moved. The current list, and the figure the table prints, is £1.18.
The pattern, a new rate held briefly then walked back then re-issued, looks like a pricing team testing a number, getting pushback from a small set of large customers, walking it back, then re-issuing the change after working out who needed grandfathering. We have no inside knowledge of why; we only know what the rate did. The point for an operator is that the cached input rate at the frontier tier is now an actively-managed number, not a stable one, and the pricing job should poll it daily rather than weekly. We have moved ours.
OpenAI, the cache TTL trim
During the quarter, OpenAI silently reduced the cache TTL on GPT-5 from five minutes to two, then reverted it. The list price of cached input did not change. What changed was the probability that any given cached prefix was still warm at the time of the next call. In production this matters: a long-running agent loop with 90-second tool turns went from "almost always cache-hit" to "cache-hit if the model returned quickly, cache-miss otherwise", and the billing reflected it.
By the time the TTL went back to five minutes, our own OpenAI-heavy traffic showed a spend uplift on identical workloads, almost all of it from cache-eligible input being billed as uncached. OpenAI did not announce the reversion either. We learned about it the same way we learned about the original change, by watching the headers stop saying what they had been saying.
We do not think the trim was malicious. The most charitable reading is a capacity-management decision applied uniformly across cache slots that turned out to have a billing implication the team had not modelled. The less charitable reading is that someone thought the change was invisible. It is not invisible; it is in the headers and it is in the bill. The lesson, for any provider reading this, is that the silent-billing-change era ended the day the first independent meter shipped. We are not the only meter. The change will be in someone's chart.
§ IV, Cached versus uncachedThe ratio, and why it is settling
One number we now track every quarter is the ratio of cached to uncached input price, by provider. At the start of 2024 the ratio was all over the place: some providers offered 50 per cent off cached, some 90 per cent, some none at all. The economic logic of caching (the provider's GPU has done the work already, the marginal cost of replaying the prefix is small) pushes the discount toward the cost of memory and bandwidth, and that has now settled in two clusters.
- The ten-per-cent cluster: Anthropic (all tiers), OpenAI (GPT-5 family), DeepSeek V4. Cached input is priced at 10 per cent of uncached, to within a rounding error.
- The twenty-per-cent cluster: Google (Gemini Pro and Flash), Mistral Large 2, xAI Grok-4. Cached input is priced at 20 to 25 per cent of uncached.
- The no-cache holdouts: Groq (no published cache pricing as of 31 March), and the smaller open-weights hosts.
The ratio matters for workload routing. A workload that is 80 per cent cached input (typical of long-running agents with stable system prompts) costs roughly half as much on a ten-per-cent-cluster provider as on a twenty-per-cent-cluster provider, all else equal, even when the sticker input prices look identical. Routing on effective rate rather than sticker rate has been the right default for us this quarter; the spread has been wide enough that the choice is no longer a tie.
§ V, Mix by spendWhat we will and will not print this quarter
The almanac will eventually carry a provider mix section. It does not, this quarter. The honest version of the section requires a metered population large enough that no single workspace dominates the print, and we are not there yet. Halton Labs runs the daemon on its own workload only, and that workload is roughly 86 percent Anthropic, 9 percent OpenAI, 5 percent other, which is not representative of the market and is not a number anyone should plan against.
Two public sources do report aggregate mix at a useful resolution. The OpenRouter public model usage page publishes a rolling token-weighted share by model across its routed traffic, refreshed daily. Anthropic and OpenAI both report aggregate platform usage in their quarterly safety and disclosure documents. Neither is the same shape as a metered-spend mix, but the directional signal is real, and pointing readers at primary sources costs nothing.
We will print our own mix as soon as the metered pool is large enough to anonymise without distorting. The current rule of thumb on the desk is at least fifty workspaces in any aggregate, with no single workspace contributing more than 10 per cent. We will say in the almanac when we cross that line.
§ VI, Q2 expectationsWhat to plan for, and what to ignore
Looking forward to Q2, three things we think are likely and one we think is being over-discussed.
One, a Sonnet move. Sonnet 4.5 is the longest unmoved price in the table. We do not think it stays there through Q2. The direction is harder to call; a downward adjustment, in response to the DeepSeek V4 line, would not surprise us, and neither would a small upward adjustment paired with a longer context window or cheaper cached input. We are budgeting on no change and will redo the model the morning of any move.
Two, more cache TTL volatility. The OpenAI episode in February was the first time we caught a provider changing a cache parameter that was not the list price. It will not be the last. We have updated the daemon to record the response cache-related headers verbatim on every request, not just the totals, so that a parameter change of this kind shows up in the next morning's variance run. The change landed in the prerelease build that closed on 11 March; the patch notes track the commit.
Three, more mid-tier pressure from the open-weights hosts. The V4 launch was the biggest single move of the quarter and the most-copied. We would not be surprised to see one more mid-tier cut from a hosted-Llama provider before the end of June, particularly if Groq adds cached input pricing (they are the obvious gap on the chart). The frontier pricing should hold; the mid will not.
The over-discussed item: per-request pricing as an alternative to per-token. We have been hearing the rumour, again, that a major provider is preparing a flat-per-request tier. We have heard this rumour from every quarter we have tracked, and it has never arrived. Our base case is that it does not arrive in Q2 either. If it does, the almanac will get a new column; if it does not, you will not hear about it from us until it does.
See you in three months. The Q2 almanac is scheduled for August.
Notes
- GBP/USD conversion at GBP 1.00 = USD 1.272, the Bank of England 16:00 fix on 31 March 2026. All sterling figures rounded to two decimal places.
- Specific provider events referenced in the body (the DeepSeek V4 launch, the GPT-5 Mini reduction, the Opus cached-input movement, the GPT-5 cache TTL change) are summarised from the daemon's pricing-poll record and the corresponding reconciliation logs. Future almanacs will link each event to a primary source URL once we publish the pricing-archive endpoint; this issue does not.
- Disclosure: Halton Labs is building a metering product. The almanac is editorial; no provider has reviewed it before publication.