Skip to content

Headroom: The Missing Layer Between Your Agent and Your Wallet

8 min read

I’ve been saying context is the hardest part of agentic systems for months. You can have the best model in the world, but if you shove 200k tokens of build logs, JSON arrays, and stale conversation history into every call, you get slow, expensive, and distracted outputs.

The typical fix? Manual curation. Write better CLAUDE.md files. Prune your context. Be disciplined.

That works until your agent makes 47 tool calls in a session and each one returns 3,000 tokens of JSON you can’t control.

Today I found Headroom, and it solves this problem at the infrastructure layer rather than the discipline layer. Same thesis as harness engineering: stop blaming the model, fix the environment. Headroom fixes the most expensive part of the environment: what actually gets sent as tokens.

What It Actually Does

Headroom is an open-source, local-first proxy that intercepts LLM API calls and rewrites the messages array before it reaches the provider. Your application code doesn’t change. The model doesn’t change. Headroom sits between them and makes the context smaller, smarter, and cheaper.

You can drop it in as:

  • A Python/TS library: compress(messages)
  • A network proxy: headroom proxy
  • Wrappers for Claude Code, Cursor, Aider, Copilot CLI

Everything that isn’t a system prompt or user message is fair game for compression: tool outputs, logs, RAG chunks, file contents, conversation history. System prompts get a different treatment (more on that below).

Drop-In Integration

The simplest path is two lines. Wrap your messages before sending them:

from headroom import compress

# Your existing messages array (system, user, tool outputs, etc.)
result = compress(messages, model="claude-sonnet-4-5-20250929")

# Send the optimized messages to the LLM instead
response = client.messages.create(messages=result.messages, ...)

That’s it. compress() returns optimized messages plus metrics:

print(f"Tokens: {result.tokens_before}{result.tokens_after}")
print(f"Saved: {result.tokens_saved} ({result.tokens_saved / result.tokens_before:.0%})")
print(f"Transforms applied: {result.transforms_applied}")

For LangGraph agents, you replace your tool node with one that compresses results before they re-enter the conversation:

from headroom import compress

def tool_node_with_compression(state: State):
    result = []
    for tool_call in state["messages"][-1].tool_calls:
        tool = tools_by_name[tool_call["name"]]
        observation = tool.invoke(tool_call["args"])

        temp_messages = [
            {"role": "user", "content": state["messages"][0].content},
            {"role": "tool", "content": observation, "tool_call_id": tool_call["id"]},
        ]

        compressed = compress(temp_messages, model="claude-sonnet-4-5-20250929")
        compressed_content = compressed.messages[-1]["content"]

        result.append(ToolMessage(content=compressed_content, tool_call_id=tool_call["id"]))
    return {"messages": result}

No LLM call needed for the compression itself. It runs locally, so the cost is zero and the latency is milliseconds.

The Compression Pipeline

Every LLM call flows through three stages before hitting the provider:

%%{init: {"layout": "dagre"}}%%
flowchart LR
    A[Messages] --> B[CacheAligner]
    B --> C[ContentRouter + Compressors]
    C --> D[Context Manager]
    D --> E[Optimized Request]
    E --> F[LLM Provider]

CacheAligner pulls dynamic bits (timestamps, UUIDs, per-request tokens) out of your system prompt and pushes them to the tail. The prefix becomes byte-identical across calls, so Anthropic/OpenAI KV caches can reuse those tokens instead of billing you repeatedly. This is invisible savings: you don’t see fewer tokens in the payload, but your effective bill drops.

ContentRouter + Specialized Compressors detect the structure of each context segment and route it to the right compressor. This is where the bulk of token reduction happens.

Context Manager enforces your context window budget. Messages that get dropped aren’t thrown away. They go into the CCR cache (more on this next), so the model can retrieve them later if needed.

Every transform is fail-open. If a compressor can’t understand the content, it passes it through unchanged. Worst case: no savings. Never breakage.

Five Compressors, Not One

Headroom doesn’t use a single generic compression strategy. It routes content to specialized compressors by type:

CompressorTarget ContentMechanismTypical Savings
SmartCrusherJSON arrays, tool outputsField-level statistics, keeps outliers and boundaries70-90%
LogCompressorBuild/test logsClusters patterns, keeps failures and stack traces85-95%
SearchCompressorSearch resultsPrunes to top matches + group representatives60-80%
CodeCompressorSource codeAST-aware (tree-sitter), keeps signatures and structure40-70%
Kompress-baseFree-form textModernBERT-family model trained on agent traces30-50%

SmartCrusher is the most interesting one for agent workloads. When it sees a 1,000-row JSON array from a tool response, it doesn’t just truncate. It runs field-level statistics: variance, uniqueness, change-points per column. Then it selects a subset that preserves schema, distribution boundaries, outliers, and error cases. The “boring” rows (all passing tests, all identical status codes) get collapsed.

Why this matters for accuracy: The compressors retain anomalies first. Failing tests, error responses, boundary conditions. These are exactly the things the model needs to reason about. The bulk “everything is fine” rows that would dilute attention get removed.

CCR: The Trick That Makes Aggressive Compression Safe

Here’s the insight that separates Headroom from naive summarization: Compress-Cache-Retrieve (CCR).

When SmartCrusher compresses a 1,000-row tool response down to 20 representative rows, the original 1,000 rows aren’t deleted. They’re written to a local LRU cache with a hash key like abc123. The compressed payload sent to the model includes both the 20 rows and a marker:

[1000 items → 20 representative, hash=abc123]

Headroom also injects a headroom_retrieve tool into the LLM’s tool schema:

{
  "name": "headroom_retrieve",
  "parameters": {
    "hash": "string",
    "query": "string (optional)"
  }
}

The model now has a choice. If 20 representative rows are enough to answer the question (and for most tasks, they are), it proceeds normally. No retrieval needed. Zero extra tokens.

But if the model decides it needs more detail (“show me the exact stack trace for error X”), it calls headroom_retrieve with the hash and an optional query. Headroom intercepts that tool call, looks up the original in the local cache, and either returns the full payload or runs a BM25 search over the stored items when a query is present.

The economics work because retrieval is rare. Benchmarks show the model almost never needs to retrieve. The compression strategies keep enough signal (anomalies, boundaries, representative examples) that the compressed view is sufficient for reasoning. CCR gives you the economics of lossy compression with the safety net of lossless retrieval.

Where the Real Savings Appear

The numbers from benchmarks and independent testing:

Workload TypeToken SavingsWhy
JSON-heavy agents (DB queries, API responses)70-90%Model rarely needs every row
Build/test logs, SRE incidents85-95%Most log lines are noise
Multi-tool, long-running agent sessions73-92%Accumulated structure and duplication
Short chats (<200 tokens)MinimalNothing to compress
Code-only sessionsLow-moderateConservative by default for correctness

The pattern is clear: the longer the session and the more tool calls, the bigger the savings. This maps directly to the production agent patterns I wrote about. Complex multi-step workflows with dozens of tool invocations are exactly where costs explode and where Headroom pays for itself.

How This Connects to Context Engineering

I’ve been writing about context as infrastructure and harness engineering as separate concerns. Headroom collapses them into one layer.

In the harness engineering model, context provision is the second layer where agents fail. You either give the model too little context (it hallucinates) or too much (it gets distracted and costs explode). The manual solution is careful CLAUDE.md files, structured prompts, and disciplined tool output formatting.

Headroom automates the “too much context” problem. It’s a context harness component that sits at the infrastructure level:

┌──────────────────────────────────────────────────┐
│  Your Agent Code                                 │
├──────────────────────────────────────────────────┤
│  Headroom (context optimization layer)           │
│  ┌──────────┐ ┌──────────┐ ┌────────────────────┐│
│  │CacheAlign│ │Compressor│ │CCR Cache + Retrieve││
│  └──────────┘ └──────────┘ └────────────────────┘│
├──────────────────────────────────────────────────┤
│  LLM Provider (Anthropic, OpenAI, Bedrock)       │
└──────────────────────────────────────────────────┘

For large codebases where context windows are already the bottleneck, this kind of automatic compression could mean the difference between “agent loses track after 15 tool calls” and “agent maintains coherence through the entire session.”

The BM25 Connection

If you read my piece on why grep is not BM25, the CCR retrieval mechanism should feel familiar. When the model calls headroom_retrieve with a query, it’s doing sparse retrieval over the cached original content. BM25 ranking over stored items, returning the top matches instead of dumping everything back into context.

This is retrieval-augmented generation at the infrastructure level. Not over an external knowledge base. Over your own agent’s prior tool outputs. The model effectively gets a side-channel database of everything that was compressed away, searchable on demand.

What Could Go Wrong

The failure mode isn’t data loss. It’s tool-calling behavior.

If the model fails to call headroom_retrieve when it should, you get incomplete answers. The compressed view looked sufficient, but the model missed something in the compressed-away portion. Headroom mitigates this by:

  1. Including explicit omission metadata: “500 items compressed to 20 (480 passed, 15 failed, 5 errored)”
  2. Making CCR markers visible so the model knows retrieval is available
  3. Running a headroom learn command that analyzes failed sessions and writes guidance to improve compression strategies

The headroom learn feedback loop is particularly smart. It observes which compressions lead to retrieval (meaning the initial compression was too aggressive) and adjusts policies per tool and content type over time.

Running as a Proxy (Zero Code Changes)

If you don’t want to touch application code at all, Headroom runs as a local proxy that intercepts API calls transparently:

# Start the proxy (Anthropic backend)
headroom proxy --port 8787

# Point your tools at it
export ANTHROPIC_BASE_URL=http://localhost:8787

For production on macOS, you can run it as a LaunchAgent that starts on login, restarts on crash, and auto-configures your shell:

# Install as persistent service
./install.sh --port 8787

# Add to ~/.zshrc
export HEADROOM_PROXY_PORT=8787
source /path/to/shell-integration.sh

Every LLM call from Claude Code, Cursor, Aider, or your custom agent now flows through Headroom automatically. The proxy applies the full compression pipeline, manages the CCR cache, and forwards optimized requests to the provider.

For AWS Bedrock users, the proxy supports Bedrock as a backend and handles the cachePoint translation for prefix caching:

headroom proxy --backend bedrock --region us-west-2 --port 8787

When It Doesn’t Help

Headroom is explicit about low-benefit zones:

  • Short chats under 200 tokens. Nothing to compress.
  • Code-only sessions with AST compression disabled. Conservative defaults for correctness.
  • Already-compact outputs. If your tools return terse, grep-like results, there’s no fat to trim.
  • System prompts and user messages. Preserved for correctness and cacheability.

The sweet spot is long-running agents with verbose tool outputs: incident debugging, code search, test result analysis, database query results. Exactly the production workloads that cost the most.

The Bottom Line

The industry conversation around agent costs has been focused on model pricing and context window limits. Headroom shifts the question from “how do I fit everything into the context window?” to “why am I sending all of this in the first place?”

The architectural insight is simple: most context an agent accumulates is noise. Build logs where 95% of lines say “OK.” JSON arrays where 900 of 1,000 rows are identical. Search results below the relevance threshold. You’re paying per token for all of it.

Headroom treats this as an infrastructure problem, not a prompting problem. Compress automatically, cache originals locally, let the model pull back what it actually needs. 60-95% savings on heavy workloads without changing your code or your model.

If you’re running production agents and your token bill is climbing, this is worth a look: github.com/chopratejas/headroom.


Running AI agents at scale and fighting context bloat? I’d love to hear what strategies you’re using. Reach out on LinkedIn.