← Halton Meter/The Journal/Field Report/No. 009
Reading · 12 min·Subscribe
Field ReportNo. 009·20 May 2026·12 min readVOL I · 01

Twelve weeks of dogfooding: the daemon on the workload it was built to meter.

Halton Labs is a one-person shop. For twelve weeks I ran the daemon on every workload I touched, tagged by internal project code, and reconciled the result against the bills. This is the diary.

Written by one human and one model. Halton Labs is operated by Vikrant Shukla, with Claude Opus 4.7 as the second engineer. Bylines name the role; the full colophon at the foot of the page explains the arrangement.

Fig. 09, twelve weeks of dogfooding spend

One workspace, seven internal project codes, twelve weeks.

£40/wk£20/wk£0HM-DAEMONHM-CLOUDHM-LANDINGHM-DASHHM-JOURNALHM-DOCSHM-EXPWK 01WK 04WK 08WK 12

Each line is one internal project code on the meter codebase. Spend is weekly pounds, twelve weeks. The two highlighted projects (the daemon itself and the experiments scratch project) carry most of the cost. Numbers are small in absolute terms; the shape is what the dispatch is about.

Three months ago I shipped a build script that quietly loaded the entire meter codebase into a Claude context window on every commit. It worked. The tests passed. The diff looked clean. For four days, without my noticing, every push was costing roughly £40 of Anthropic spend before the CI suite even started. The daemon caught it on day five. That paragraph is what made me run the meter on my own work, all of it, for the next twelve weeks.

Halton Labs is currently a one-person shop. I build the daemon. I also use it on every workload I run, which is mostly Claude Code sessions, the occasional OpenAI API call, and a steady drip of small scripts that ingest invoices, write documentation, or summarise something. For the last twelve weeks I have run the daemon on the laptop I work on and on the CI runner for the meter repo, attributed every call to an internal project code, and reconciled the result against the bills. This dispatch is the diary.

§ I, SetupThe workspace, the projects, the goal

Seven internal project codes were active in the window. They are the meter daemon itself (HM-DAEMON), the cloud backend (HM-CLOUD), this landing site (HM-LANDING), the open-source dashboard (HM-DASH), this journal (HM-JOURNAL), the docs site (HM-DOCS), and an experiments scratch bucket for anything that has not been promoted to its own code yet (HM-EXP). There is no client work in the table because there is no client work to put in it.

The goal of the exercise was modest. I wanted to know what the meter saw, week by week, on the work I was actually doing. I wanted the dashboard sitting in a second monitor tab to be telling me the truth about my own spend, in pounds, with enough granularity that I could point at a number and say which project earned it. And I wanted to keep the absolute numbers honest, which is why every figure in this dispatch is weekly pounds and there is no annual total at the bottom.

Sample
1
one workspace, one person
Window
12 wks
March to May 2026
Project codes
7
all internal to Halton Labs

The daemon ran on the laptop I work on and on the CI runner for the meter repo. The dashboard sat in a second monitor tab. I had a working theory that watching the number would be enough to keep it down. The theory was wrong, but only by about a week.

§ II, ExpectationsWhat I expected, and what twelve weeks showed

My prior was the conventional one in March: Claude was the cheapest option per task for the kind of work I do, OpenAI was competitive on small jobs, and the rest of the field was a sideshow on a workload this size. That prior turned out to be roughly right for most of the window and useful in one specific way I had not anticipated.

The useful surprise was prompt caching on long Claude Code sessions. Once I got serious about caching the system prompt and tool definitions across a working day, the per-task cost of Claude on context-heavy work fell to a number I did not believe at first. I reran a few sessions with caching turned off to double-check; the difference was real. On the projects that used Claude Code heavily and had cache hits above 70 percent, the effective cost of input tokens collapsed to roughly an eighth of the headline price. For those projects, Claude was not just cheaper than the alternatives; it was cheaper by a margin large enough that the question stopped being interesting.

For everything else, the picture was less tidy. Short bursts of structured-output work (form filling, classification, the boring half of any integration) were roughly at parity between Sonnet and GPT-class models. Long-form generation (drafts, summaries, anything where the model wrote more than two thousand tokens at a stretch) was where the bill grew fastest, and where the choice of model mattered most. I lost a small amount of money in week four by defaulting Opus on a workload that did not need it, and made it back over weeks five and six by being more disciplined about model selection per route. That discipline is the methodology of the twelve weeks. It is also, embarrassingly, the obvious thing that no one does until a meter forces them to.

The discipline is the obvious thing that no one does until a meter forces them to.

Vikrant Shukla, § II

§ III, The afternoonThe agent loop that cost £400 in 88 minutes

A Friday afternoon in week six. I was building an agentic workflow on the experiments scratch project. The workflow had a tool that fetched a document, summarised it, and decided whether to fetch more. The tool returned a list of related documents. The decision step said: if the summary mentions any term in the related-document list, fetch those too. You can see the bug. The summary always mentions the terms in the related-document list, because the related-document list was being passed in as context.

The loop ran on Opus. It started at 3:14pm. By 4:42pm the daemon's budget alert (£50 per hour, per project) had fired three times. The third alert paged me. Total spend when I killed the process: £401.18, in 88 minutes, on a single workstation. Out of context, that is a fortune. In context, it is the cost of one afternoon's mistake, which is not nothing but is also not catastrophic. The fix took ten minutes. The lesson took longer: per-project hourly caps now default to the project's monthly budget divided by 100, and the third alert hard-pauses the daemon for that project until acknowledged.1

The point of telling the story is not the £401. It is that the daemon caught it inside an hour and a half, on a Friday afternoon, on the only workstation that was running anything against Opus that day. Without the meter, the same loop would have run until the credit card hit a provider-side limit some time on Sunday evening. The difference between the two timelines is the case for installing the daemon at all.

§ IV, PlumbingHow project codes get on the wire

The bookkeeping I built around the daemon is small and unglamorous. Every internal project has a code (HM-DAEMON, HM-CLOUD, etc.). My shell exports a HALTON_PROJECT variable, set per terminal. The daemon reads that variable from the environment of the calling process via a small header injection at proxy time, attributes every call to the project, and ships it. There is also a halton CLI wrapper that lets me switch project codes between commands without thinking about shell state.

The configuration is roughly this:

<span class="c"># ~/.zshrc</span>
<span class="k">export</span> HALTON_PROJECT=<span class="s">"HM-DAEMON"</span>
<span class="k">export</span> HALTON_DAEMON=<span class="s">"http://127.0.0.1:8788"</span>

<span class="c"># switch project in a new shell</span>
<span class="k">halton</span> use HM-EXP
<span class="k">halton</span> status
<span class="c">#  project: HM-EXP · daemon: ok · last call: 12s ago</span>

The CI runner uses the same pattern: every job exports a project code from the pipeline definition. The daemon never trusts an unmapped code; if a call arrives without a known project, the daemon attributes it to a bucket called HM-UNALLOCATED and surfaces the volume on the dashboard at the end of the day. That bucket is how I catch new work that has slipped into a terminal without a code.

The point of all of this is not the plumbing, it is the report. At the end of every week the dashboard exports a ledger per project, in a CSV I can open in any spreadsheet. The ledger has date, model, input tokens, output tokens, cached tokens, cost, and the running total. The exercise is the same one a hosted client would run if Halton Meter were already a product with customers; running it on my own workload is how I know the report is the right shape before I ship it to anyone.

§ V, A model swapThe dogfood eval that went the wrong way

In week eight I tried to switch the journal's draft-generation workload (HM-JOURNAL in the table) from Opus to Sonnet. The setup was straightforward: on a representative batch of drafts, I ran the same prompts through both models, scored the outputs against the editorial rubric used elsewhere in the project, and confirmed that the scores were indistinguishable. The cost saving was roughly 38 percent on that workload.

I rolled it over the following Monday. Three weeks later, doing the line edit on a longer piece, I noticed the drafts had started feeling thinner. Not worse on any specific axis, just less interesting. I went back to the originals and confirmed it was not in my head. The Sonnet drafts were technically fine; the Opus drafts had a quality the rubric did not score for. I switched back.

I do not have a clean takeaway from that story. The A/B was good. The qualitative judgement, three weeks later, was also good. Both can be true. The meter recorded the cost of the qualitative judgement at roughly £56 a month on this workload, which I am happy to pay and which I would not have been able to make an informed decision about without a number to argue against.

§ VI, What twelve weeks taught meThe shape of the surprises

Twelve weeks of dogfooding does not produce a chart anyone outside the workload should care about. The absolute numbers are small. The interesting part is the shape of the surprises: which calls go uncached when they should be cached, which agent loops compound silently, which configuration mistakes cost real money in the first five minutes after they ship. Each of those was caught on the wire before it showed up on the monthly bill. That is the case for building the daemon at all, and the case for running it on the work you actually do, before you ask anyone else to.

Three lessons travel further than this workload.

One, the budget alert is the feature. The dashboard tells you what happened. The alert tells you what is happening. They are different products. The £401 loop was caught by the alert, not by the dashboard; the dashboard caught up the next morning. Anyone shipping a metering tool that does not page on a budget breach is shipping a forensic tool, which is useful but is not what most users mean when they ask for a meter.

Two, project codes on the wire beat project codes in the dashboard. The shell-variable approach lets a single workstation switch projects between terminals without any state living in a config file. The CI runner uses the same mechanism. The unallocated bucket catches the drift before it becomes a category. None of this is novel; it is the bog-standard way time-tracking software has worked for decades. It is, however, the difference between a metering tool that finance teams can use and one that they ignore.

Three, qualitative judgement is allowed to overrule the cost number. The Sonnet swap was the right decision on the rubric and the wrong decision on the read-through. Both halves of that are real. A meter that pretends the cost number is the only number is a worse tool than one that lets a person trade pounds for quality on a workload they understand. The role of the meter is to make the trade legible, not to win it.

Twelve weeks in, the diary is dull, which is the result I wanted. There are no surprises in this week's invoice. The dashboard still sits in a second monitor tab. I still glance at it more than I should. The number is, in the beginning, an irritation. After three months, it is a colleague.

EOF · No. 009 · Halton Meter Journal
Notes
  1. The hourly cap and hard-pause behaviour is in the current daemon prerelease build (pre-v0.1.0). Per-project monthly budget divided by 100 is a working default, not a recommendation; for short bursts of legitimate spike, you will want a higher ceiling.
  2. Reconciliation methodology, daemon ledger versus provider invoice, is documented in the methodology piece in this same issue. Variance over the twelve weeks ran between 0.18 and 0.41 per cent weekly, averaging 0.27.
  3. Project codes in this dispatch are the real internal codes used on the meter repo. The figures are weekly pounds, rounded, from the daemon's own ledger, reconciled against provider invoices to within the variance noted above.
  4. Disclosure: Halton Labs is one person shipping Halton Meter. The diary is from that one workload. There are no client engagements to disclose.