Inside the daemon's read loop, every byte from request to ledger entry.
A walkthrough of the open source local daemon, process boundaries, signing keys, and the surprising thing we got wrong in an internal 0.2 prerelease.
Written by one human and one model. Halton Labs is operated by Vikrant Shukla, with Claude Opus 4.7 as the second engineer. Bylines name the role; the full colophon at the foot of the page explains the arrangement.
One POST, three lanes, eighty-four milliseconds.
One outbound call to Anthropic, traced through the local daemon. The application never speaks to the provider directly. The ledger entry is written after the last byte of the response has been forwarded, not before.
A daemon is a small, stubborn program that you start once and then forget about. The Halton Meter daemon is one of those. It listens on a local port, watches API calls to language model providers go past, writes them down, and ships the records to a backend that turns them into a ledger. The total amount of cleverness in that sentence is small on purpose. Most of the engineering goes into the parts that are not clever: the bytes you do not lose, the certificate you do not get wrong, the response you do not buffer when you were supposed to stream it. This dispatch is a walk through one request, end to end, with the corners that bit us called out as we pass them.
The daemon is a local binary, free forever via pipx or uvx. The public repository at haltonlabs/halton-meter will open under Apache 2.0 when v0.1.0 ships; every version number in this dispatch is an internal prerelease tag on the road to that release. I will quote real function names. Nothing in this dispatch has been simplified for the page.
§ I, On the wireWhat actually sits between the application and the provider
The daemon is a mitmproxy addon, in Python, running as a single OS process. When a developer installs Halton Meter, three things happen on their machine. A local cert authority is generated and added to the system trust store. The proxy is started on 127.0.0.1:8421. Environment variables HTTPS_PROXY and HTTP_PROXY are set to point at that address, in the shell profile or in a launchd plist depending on the platform. After that, the developer does not interact with the daemon again. It is a piece of furniture.
Their application, whether that is Claude Code, a Cursor session, a curl call, or a Python script that imports the Anthropic SDK, does not need to know any of this. The HTTPS request is built normally, signed normally, and sent to api.anthropic.com. The kernel resolves the hostname, opens a socket, and the operating system's proxy plumbing intercepts the connection because of the environment variable. The bytes arrive at the daemon's listening port. The daemon performs a TLS handshake using its own CA, decrypts the request, reads the body, opens its own upstream TLS connection to the real provider, forwards the request, streams the response back, and writes a ledger entry once the last byte has crossed. The application sees a normal HTTPS response. It does not know there was an intermediary.
That is the entire architecture. There is one process. There is one local SQLite database. There is one HTTP client to the backend. The decision to keep it that small is the most important design decision in the daemon, and it is the one that took the longest to commit to.
The daemon must never break the user's work. If it falls over, traffic falls through to the provider as if the daemon had never been installed.
Daemon engineering rule one
The reason for the discipline is the rule above. If the daemon dies, traffic falls through. If the daemon hangs, the proxy timeout fires and the application retries direct. If the daemon is slow, every millisecond it adds is a millisecond a developer waits for their model. The daemon's read loop is therefore a single thing with one job: get out of the way as fast as possible while losing nothing.
§ II, TLSCertificates, trust, and why we ship our own CA
To read the body of an HTTPS request, the daemon has to terminate the TLS connection from the application and originate a fresh one to the provider. There is no way to inspect cleartext bytes without doing this. The way mitmproxy does it, and the way we do it, is to ship a local CA: on first run, the daemon generates a 2048-bit RSA key pair, builds a self-signed root certificate, and adds it to the operating system's trust store. After that, the daemon can mint per-host leaf certificates on demand, signed by that local root, and the application accepts them because the root is trusted.
We are explicit about this in the install script, because it is the most invasive thing the product does. The CA is local; the private key never leaves the machine; the daemon will refuse to start if it cannot read its own key file with mode 600. We log the SHA-256 fingerprint of the root in the CLI on every start, so a paranoid operator can verify it has not been swapped. If you want to remove Halton Meter, the uninstaller removes the CA from the trust store first, then exits. That order matters.
On macOS the CA install touches the user keychain, not the system one, so it does not require sudo. On Linux it writes to /usr/local/share/ca-certificates/ and runs update-ca-certificates, which does. On Windows it uses certutil -addstore -user Root. There is a different one-liner for each, and a long-tail of failure modes for each, and most of the daemon's friction in the first six months was about exactly this. We now refuse to install if the CA cannot be added; we used to install in a degraded mode where HTTPS interception silently fell back to plaintext logging, and that was wrong. Silent fallback in a security boundary is worse than failing loudly.
§ III, AdaptersOne provider per file, no central knowledge
The daemon does not know what a token is. It does not know what Claude Opus 4.7 costs. It does not know that Gemini exposes a different streaming envelope than Anthropic. All of that lives in adapter files, one per provider, in daemon/adapters/. The read loop hands every intercepted flow to a dispatcher that picks an adapter based on the destination host. The adapter does the parsing, attributes the request to a project (from a request header, an env var, or a configured fallback), and returns a typed event. The read loop writes the event and forgets.
The reason for this is not aesthetic. It is that providers change. Pricing pages get edited on Wednesdays without notice; a new model id appears in a response payload before it appears on a docs page; a streaming format gets a new event type. When that happens we want to fix one file, ship one patch release, and move on. The core of the daemon has been stable across our internal prerelease builds; the churn has lived in the adapters, which is the design we wanted.
Here is the shape of an adapter, lightly elided. The full thing is about 180 lines per provider:
# daemon/adapters/anthropic.py from halton_meter.core import Adapter, Event, Project class AnthropicAdapter(Adapter): host = "api.anthropic.com" paths = ("/v1/messages", "/v1/complete") def parse_request(self, flow) -> Event: body = flow.request.json() project = Project.from_headers(flow.request.headers) \ or Project.default() return Event( provider="anthropic", model=body["model"], project=project, input_tokens=None, # filled on response cache_read=None, stream=body.get("stream", False), ts=flow.request.timestamp_start, ) def parse_response(self, flow, ev: Event) -> Event: if ev.stream: usage = self._tee_stream(flow.response) else: usage = flow.response.json()["usage"] ev.input_tokens = usage["input_tokens"] ev.output_tokens = usage["output_tokens"] ev.cache_read = usage.get("cache_read_input_tokens", 0) ev.cache_write = usage.get("cache_creation_input_tokens", 0) return ev
We chose mitmproxy over the obvious alternative, which would be to ship a thin SDK wrapper around anthropic, openai, etc. and ask developers to import that instead. SDK shims are easier to write and lighter to install. They are also useless in the situation that matters: when a developer pip-installs a new dependency that talks to a provider you did not ship a shim for. The proxy approach catches everything that goes over the network, including subprocess shells, including binary tools, including a curl in a debug session at 11pm. We pay a cost for that, in TLS plumbing and CA management. We have decided the cost is worth it. In our internal testing, the number of times we have been surprised by logged traffic is zero. The number of times a team would be surprised by missing traffic under an SDK-shim approach is, by construction, unbounded.
§ IV, StorageSQLite, WAL, and the atomic write
Every parsed event is written to a local SQLite database at ~/.halton-meter/logs.db. We chose SQLite for the same reason everyone chooses SQLite: it is a library, not a server, and it does not fall over when you forget about it for six months. The interesting bit is the journal mode. SQLite has two: rollback journal, which is the default, and write-ahead log (WAL), which we use. The difference matters because the daemon is writing while a separate goroutine, sorry, a separate thread is reading the same database to batch records up to the backend. In rollback mode, the writer takes an exclusive lock for the duration of the transaction and the reader blocks. In WAL mode the writer appends to a log and the reader sees a snapshot. The writer never blocks the reader. The reader never blocks the writer.
We learned to set WAL mode at table-create time, not at runtime. If you set it at runtime in the wrong process you can get a database file that is half-WAL and half-rollback and the resulting behaviour is what you would expect from a half-and-half database. We also learned to set synchronous=NORMAL rather than the default FULL. FULL waits for the OS to fsync after every commit, which is the correct setting for a financial ledger but the wrong setting for a buffer that ships to a backend that has its own durability story. NORMAL trades the last few hundred microseconds of a kernel panic for a write loop that fits inside our latency budget. In the workloads we have observed so far, power loss has bitten us a couple of times; latency budget overruns hit every week until we changed this setting.
Writes themselves are batched, but batched conservatively. The default is one event per transaction. Under load the daemon will pack up to 32 events into a single commit. We do not pack more than that because the batch is also the unit of loss on crash, and the SQLite WAL fsync cycle is fast enough that 32 is rarely a bottleneck. The batching code is twelve lines. If you find yourself writing more, you are probably wrong.
The shipper, a separate thread, wakes every five seconds, opens a read transaction, selects up to 500 unsent events, signs the batch with a per-workspace HMAC key, POSTs it to the backend, and on a 200 marks the events sent. On a 5xx it retries with exponential backoff up to one hour and then surfaces an error in the CLI. The HMAC key is provisioned at workspace setup and never leaves the workspace's key store; the backend keeps a copy and verifies the signature before accepting a batch. This is how we know that a batch claiming to be from workspace X actually is. It is the simplest possible thing that prevents a misconfigured daemon from poisoning the ledger of an adjacent workspace, and it is the one piece of cryptography in the daemon other than TLS.
§ V, 0.2 prereleaseThe internal disaster, and what we did about it
I want to spend a section on the internal 0.2 tag because it is the most useful thing I can tell another engineer working on a similar product. As a reminder: none of these version numbers correspond to a public release. They are internal prerelease tags inside the team on the road to public v0.1.0. The 0.2 line was, briefly, the worst build we have cut. It was also the build that taught us the read loop discipline.
The earlier internal 0.1 tag only supported non-streaming requests. For a metering tool that is a strange limitation, and it was a deliberate one: we wanted the smallest possible thing that worked end to end before we touched streams. The body of a non-streaming response is small and synchronous. You read it once, you JSON-parse it, you write the event, you forward it. Easy.
The internal 0.2 build added streaming, and we made the mistake every observability tool makes the first time it touches a stream. The adapter, on seeing "stream": true in the request, read the entire server-sent-events response from the provider into memory, parsed every event to extract the usage block, computed the totals, wrote the ledger entry, and then sent the full response body to the application. Functionally correct. Operationally a disaster.
Streaming responses exist for a reason. The reason is that the first token of a model's reply is interesting, and people watching their terminal want to see it now, not in 12 seconds when the model finishes. A response that takes 14 seconds to fully arrive is supposed to begin arriving at the application at 600ms, with chunks landing every few tens of milliseconds after that. The 0.2 build turned every streaming response into a single 14-second wait followed by an instant dump of the entire output. From the application's point of view, every Claude call had become a non-streaming call with a worse latency profile. The first time we noticed was during dogfooding, when streaming Claude Code sessions felt like the network had broken.
The fix was the tee. We rewrote the streaming path so that bytes from the provider are forwarded to the application as they arrive, with a parallel parser running on a copy of the stream. The parser maintains the running totals; when the stream terminates (either by the provider closing the connection or by the application cancelling), the daemon writes the ledger entry from whatever state the parser is in. No buffering in the forward path; full parsing in the side path; both finish at roughly the same wall-clock time. The follow-up internal tag landed four days after the bad one. The patch was 41 lines added, 78 deleted. The streaming path is now the only path in the daemon that we treat as latency-critical, and the only one that gets a perf regression test on every commit.
The thing I keep returning to about that build is that nothing in the code review caught it. The adapter author had written tests; the tests passed. The integration tests passed because they used the recorded fixture replay path, which does not exercise wall-clock timing. The reason we found it within hours was that the daemon was on the wire while we were actively using it ourselves, not because of any test. After the streaming fix landed we added a perf gate that fails CI if the time-to-first-byte for a streaming response, measured against a local fixture, regresses by more than 5%. That gate has tripped twice since. Both times it caught real regressions.
§ VI, Honest accountingWhat we are still wrong about
Three things, in roughly descending order of how much they bother me.
First, we cannot see retries that happen above us. If an SDK retries internally before the request ever reaches our socket, we count it as one call when it was, on the wire, two. The provider invoice will show two; we will show one. The reconciliation engine catches the discrepancy, attributes it to "upstream retry", and the operator gets a note. This is a measurement gap, not a bug, but it is a gap. We have an open RFC to ask SDK vendors to expose a retry counter in a header. Anthropic added one in March 2026. The others have not.
Second, our project attribution is heuristic. We read a header called X-Halton-Project if it is set, otherwise we fall back to the parent process name. The header is correct when applications set it correctly. The process-name fallback is correct often enough to be useful and wrong often enough to be embarrassing. We are working on a per-port project binding so that an application can claim a local port at start and have every call from that port attributed regardless. It is not in the current prerelease.
Third, we still do not have a clean story for tool-use sub-events. A single Claude call can fan out into model usage, tool invocations, and a second model turn, and the usage we get back is the aggregate. For attribution purposes you sometimes want the breakdown. The provider does not expose it. We can infer it from the request structure, sometimes, with confidence less than we'd like. We have not shipped that yet because we have not figured out how to be honest about the confidence.
None of these are show-stoppers. All of them are the kind of corner that, over a year, costs someone real money if it is not declared. The point of writing them up here, in the same issue as the read loop walkthrough, is that you cannot trust a daemon you only hear good news about. The next dispatch in this series, when the next prerelease ships, will either close one of these gaps or explain why we couldn't. The build is already cut. We are testing it now.
Notes & sources
- The daemon is a local binary, free forever via
pipxoruvx. The public repository at github.com/haltonlabs/halton-meter will open under Apache 2.0 at v0.1.0. All versions referenced here are internal prerelease tags. - mitmproxy: mitmproxy.org. We pin to the 11.x line and contribute patches upstream when our addon hits a sharp edge.
- The perf regression gate referenced in § V is in
daemon/tests/perf_stream.py. It runs a recorded SSE fixture through the daemon and fails the build if p50 TTFB exceeds 8 ms. - Disclosure: Halton Meter Cloud sells a metered product on top of the daemon. The daemon is free forever via pipx or uvx; the bundled dashboard is open source under Apache 2.0. No paid placements.