Claude Code in Large Codebases: What Anthropic Got Right and What They Left Out
Anthropic just published “How Claude Code works in large codebases”, their enterprise playbook for deploying Claude Code at scale. Multi-million-line monorepos. Decades-old legacy systems. Distributed architectures spanning dozens of repositories.
I read it and had the same reaction I get when a framework’s creator finally documents the patterns early adopters figured out through pain: “Yes. All of that. And also some things you didn’t mention.”
I’ve been using Claude Code daily for over a year. Not on demo projects. On real production systems. A greenfield voice AI platform with real-time latency constraints and multi-stage state machines. Legacy data services processing thousands of events per second with zero documentation. Client projects where the previous team left and took all tribal knowledge with them. I’ve written about how Claude Code works under the hood, about harness engineering, about context as infrastructure.
The Anthropic article validates a lot of what I’ve learned. But it’s written for enterprise teams with dedicated developer experience organizations. Here’s the practitioner version. What actually works when you’re one engineer deploying this across codebases of wildly different shapes and ages.
The Harness Hierarchy They Got Exactly Right
Anthropic’s article describes five extension points: CLAUDE.md files, hooks, skills, plugins, and MCP servers. They say the order you build them matters.
They’re right. And the order is more important than people realize.
I made the mistake of jumping to MCP servers early because they felt powerful. Connected Claude to a documentation system, a project tracker, a database inspector. Impressive capabilities. But Claude kept making basic mistakes because it didn’t understand the codebase conventions.
The hierarchy should be brutal in its prioritization:
| Priority | Layer | ROI When Added First | ROI When Added Last |
|---|---|---|---|
| 1 | CLAUDE.md files | Immediate, 2-3x quality | Already captured by other layers |
| 2 | Hooks (lint, format, verify) | Eliminates entire error classes | Marginal after good skills |
| 3 | Skills (task-specific) | Progressive context loading | Redundant if CLAUDE.md is bloated |
| 4 | MCP servers | Extends reach to external tools | High value regardless of order |
| 5 | Plugins (distribution) | Team-wide consistency | Only matters at scale |
The key insight: Layers 1-3 are about making the model smarter for your specific codebase. Layers 4-5 are about extending what the model can reach. If you extend reach before establishing intelligence, you get a confused agent with access to everything and understanding of nothing.
Greenfield vs. Brownfield: Two Completely Different Games
The Anthropic article treats large codebases as one category. In practice, greenfield and brownfield deployments need fundamentally different strategies.
Greenfield: Build the Harness as You Build the Code
Starting a new project with Claude Code is the easy case. You have no technical debt, no undocumented conventions, no “everyone knows but nobody wrote down” constraints.
The trap is thinking you don’t need a harness yet because the codebase is small.
I started a voice AI platform from scratch last year. Real-time conversation handling, a multi-stage finite state machine, integrations with speech-to-text and TTS providers. For the first week, Claude worked fine without any configuration. The codebase was tiny, the patterns were obvious.
By week three, it was making dangerous mistakes. Choosing synchronous patterns in code paths that needed streaming. Using raw WebSocket connections instead of the LiveKit SDK abstractions we’d established. Writing state transitions that violated the FSM invariants. Hallucinating API signatures for voice providers it hadn’t seen the docs for.
The codebase was only 4,000 lines. But the domain constraints had already outgrown what Claude could infer from context alone.
The greenfield pattern that works:
## Session 1-2: Scaffold + CLAUDE.md seed
- Set up project structure
- Write initial CLAUDE.md with tech stack, commands, and structure
- Keep it under 50 lines
## Session 3-10: Grow the harness with the code
- Every time Claude makes a wrong assumption, add the constraint
- Move domain-specific rules to subdirectory CLAUDE.md files
- Add verification commands as you add test infrastructure
## Session 10+: Stabilize and prune
- Review CLAUDE.md for outdated rules
- Consolidate patterns into skills for repeatable workflows
- Remove constraints the model no longer needs
The key difference from brownfield: you’re writing the rules as you discover them. There’s no archaeology required.
Brownfield: The Cold-Start Problem Is Real
Legacy codebases are where Claude Code either transforms your productivity or wastes your time. There’s no middle ground.
The Anthropic article mentions “making the codebase navigable at scale.” For brownfield projects, this understates the challenge. The problem isn’t navigation. It’s that nobody alive remembers why half the decisions were made.
I took over a data pipeline service where the original team had moved on. No architecture docs. No ADRs. Git history going back four years with commit messages like “fix” and “wip.” Tests that hadn’t been run in months. Build scripts that referenced deprecated internal services. Thirty-plus microservices with undocumented cross-dependencies.
Claude Code’s first session on that codebase was essentially useless. It would read files, follow references, and produce suggestions that violated invisible constraints baked into the code through years of institutional knowledge.
The brownfield cold-start protocol I developed:
Session 1: Archaeological survey (no code changes)
├── Map the directory structure into CLAUDE.md
├── Identify build commands and document them
├── Run existing tests, document what passes/fails
└── Note patterns: naming conventions, error handling, imports
Session 2: Constraint extraction
├── Read through the most-modified files (git log --stat)
├── Extract implicit rules into explicit CLAUDE.md entries
├── Document "don't touch" areas (legacy code that works but is fragile)
└── Identify the verification boundary (what can be safely tested)
Session 3+: Incremental work with guardrails
├── Never modify more than one module per session
├── Always run verification before declaring done
└── Update CLAUDE.md with every new constraint discovered
Measured difference from my experience:
| Metric | Without cold-start protocol | With cold-start protocol |
|---|---|---|
| First useful PR | Session 5-6 | Session 3 |
| ”Broke something invisible” incidents | 4 per week | 0-1 per week |
| Time spent re-explaining context | 40% of sessions | 10% of sessions |
The two hours invested in sessions 1-2 saved roughly eight hours over the following two weeks.
The “Lost in the Middle” Problem Is Worse Than They Say
Anthropic’s article recommends “keeping CLAUDE.md files lean and layered.” I’ve written about this before. But the problem goes deeper than file length.
Claude Code injects a system reminder alongside your CLAUDE.md contents telling the model that these instructions “may or may not be relevant.” The more irrelevant content you have, the more likely Claude ignores everything, including your critical rules.
The research backs this up. Liu et al.’s “Lost in the Middle” paper proved LLMs utilize information in the middle of long contexts significantly less than information at the beginning or end.
What this means practically:
Your first 5 lines and last 5 lines of CLAUDE.md are the most likely to be followed. Everything in the middle is probabilistic. A 600-line CLAUDE.md isn’t 6x more informative than a 100-line one. It’s probably less effective.
I keep my root CLAUDE.md under 80 lines. Here’s the structure that survives the “relevance filter”:
# CLAUDE.md
## Commands (always relevant)
pnpm dev / pnpm build / pnpm test / pnpm lint
## Architecture (always relevant)
[5-7 lines: tech stack, file structure, key patterns]
## Critical Constraints (5-10 hard rules)
[Only rules that apply to >80% of tasks]
## Navigation (pointers, not content)
[Links to subdirectory CLAUDE.md files]
Everything else lives in subdirectory files or skills that load on demand. This isn’t optional. It’s the difference between Claude following your rules and Claude ignoring them.
Agentic Search vs. RAG: The Tradeoff Nobody Talks About
The Anthropic article correctly explains why Claude Code uses grep-based agentic search instead of RAG. No stale indexes. Always-fresh results. Zero infrastructure.
What they don’t say: the token cost on large codebases is brutal.
On a large monorepo with tens of thousands of files, I’ve watched Claude burn 15,000+ tokens just finding the right entry point. Three minutes of grep loops before it even starts working on the actual task. That’s $0.30-0.50 of API cost and three minutes of wall-clock time on search alone.
The article mentions LSP integrations as the fix. That’s correct but incomplete. Here’s the full hierarchy of search optimization:
Level 1: Good CLAUDE.md navigation (free, immediate)
├── "API routes are in src/api/routes/*.ts"
├── "Database models are in src/models/"
└── Result: Claude goes directly to the right directory
Level 2: .claudeignore / permissions.deny (free, immediate)
├── Exclude node_modules, dist, generated files
├── Exclude vendored third-party code
└── Result: 60-80% reduction in search space
Level 3: LSP integration (setup cost, high value)
├── Symbol-level navigation instead of string matching
├── "Go to definition" instead of grep
└── Result: Eliminates false-positive grep matches
Level 4: MCP-based search (infra cost, highest value)
├── Semantic search for conceptual queries
├── Structured codebase index with metadata
└── Result: "Find all error handling patterns" actually works
Most developers jump to Level 4. But Levels 1-2 are free and eliminate 80% of the wasted search tokens. I’ve seen my average search cost drop from ~12,000 tokens to ~3,000 tokens just by writing better navigation in CLAUDE.md and excluding irrelevant directories.
Subagents Changed How I Work
The Anthropic article mentions subagents briefly: “split exploration from editing.” That undersells it.
Subagents are the single biggest workflow improvement I’ve made with Claude Code in 2026. Not because they’re faster. Because they protect the main context window.
The problem with large codebase work: by the time Claude has explored enough to understand the problem, it’s burned 40-60% of its context window on exploration output. Then it has less room to reason about the solution.
The pattern I use daily:
Main agent (orchestrator):
├── Spawn subagent 1: "Map the authentication flow. Write findings to /tmp/auth-map.md"
├── Spawn subagent 2: "Find all callers of processPayment. List them in /tmp/payment-callers.md"
├── Wait for results
├── Read the summary files (not the raw exploration)
└── Make edits with full context window available for reasoning
Each subagent gets its own context window. It can burn 50,000 tokens exploring without polluting the main agent’s working memory. The main agent only ingests the distilled findings.
Measured impact on a real refactoring task:
| Approach | Context used for exploration | Context remaining for edits | Quality of solution |
|---|---|---|---|
| Single agent | 62% | 38% | Missed 2 edge cases |
| Orchestrator + 2 subagents | 8% (reading summaries) | 92% | Caught all edge cases |
The orchestrator pattern isn’t just for huge codebases. I use it on mid-sized projects too, any time a task requires understanding multiple systems before making changes.
Hooks for Self-Improvement, Not Just Guardrails
Anthropic calls this out explicitly: “hooks make the setup self-improving.” Most teams still use hooks as guardrails only. Don’t commit without tests passing. Don’t push to main directly.
The self-improving use case is more powerful. Here’s a stop hook pattern I use:
#!/bin/bash
# .claude/hooks/post-session-reflect.sh
# Runs after each Claude Code session
# Check if CLAUDE.md was violated during the session
if grep -q "FIXME: violated convention" /tmp/claude-session-log 2>/dev/null; then
echo "Consider updating CLAUDE.md with the convention that was violated"
fi
But the real game-changer is using Claude itself as the hook. After completing a task, ask it: “Did you encounter any conventions in this codebase that aren’t documented in CLAUDE.md? If so, what would you add?”
I do this manually every 5-10 sessions. The prompts that come back are surprisingly precise:
- “The
src/utils/directory uses a barrel export pattern, but this isn’t documented anywhere. I had to discover it by reading existing files.” - “Error responses in the API routes always use the
AppErrorclass, but I almost used a raw throw because nothing in CLAUDE.md mentioned it.” - “Tests in this project use factories in
test/factories/rather than inline fixtures. I learned this by reading other tests, but it should be explicit.”
Each of these becomes a one-line addition to the relevant CLAUDE.md. The harness literally improves itself through use.
Let Deterministic Tools Handle the Housekeeping
Here’s something the Anthropic article doesn’t address: how much of your CLAUDE.md should actually be enforced by the model at all.
The answer is less than you think.
Every rule that can be expressed as a deterministic check should be a tool, not an instruction. “Use double quotes” is a linter rule. “Import order should be stdlib, third-party, local” is a formatter rule. “No unused variables” is a static analysis rule. Writing these in CLAUDE.md is asking a probabilistic system to enforce deterministic constraints. It will forget. Guaranteed.
I’ve written about the Rust rewrite revolution in modern tooling. The tools that matter for Claude Code harnesses:
Python: Ruff handles everything. Linting, formatting, import sorting. 130x faster than Black. One config in pyproject.toml. When Claude generates Python code with wrong import order or inconsistent quotes, Ruff fixes it in 0.3 seconds without burning a single token on “please remember to sort imports.”
TypeScript/JavaScript: Ultracite (Biome-based) does the same. One binary. Lint + format. No ESLint/Prettier config conflicts. No plugin dependencies. I use it on every JS/TS project now. The pre-configured rules are tuned for AI-generated code specifically, catching patterns that Claude tends to produce (unused imports from component frontmatter, over-eager type assertions).
The harness pattern:
# Hook: runs after every Claude edit
ruff check --fix . # Python: 0.3s
ruff format . # Python: 0.1s
pnpm lint:fix # TypeScript (Ultracite/Biome): 0.4s
This means my CLAUDE.md never contains formatting or style rules. Zero lines wasted on things tools handle deterministically. Every line in CLAUDE.md is reserved for things only the model can judge: architectural decisions, domain constraints, business logic patterns.
| Category | Enforced By | Example |
|---|---|---|
| Formatting | Ruff / Ultracite | Quote style, indentation, line length |
| Import order | Ruff / Biome | Stdlib first, third-party, local |
| Unused code | Ruff / Biome | Dead imports, unreachable branches |
| Naming conventions | Ruff rules | snake_case for Python, camelCase for TS |
| Architecture decisions | CLAUDE.md | ”Use the repository pattern for data access” |
| Domain constraints | CLAUDE.md | ”FSM transitions must be validated against the state graph” |
| Integration patterns | CLAUDE.md | ”All voice provider calls go through the adapter layer” |
The split is clean: tools handle syntax, the model handles semantics. This frees up CLAUDE.md for the rules that actually need model intelligence. It also means Claude’s output is automatically cleaned on every edit, so you never waste a follow-up prompt on “fix the formatting.”
The compound effect: With deterministic tools as hooks, Claude’s code is always clean when you review it. You never waste cognitive cycles on style issues. Every review focuses on logic, architecture, and correctness. This adds up fast across hundreds of sessions.
Compound Engineering: Making the Harness Grow Itself
Anthropic talks about plugins as a way to “distribute what works.” That’s the organizational answer. But what about the individual developer question: how do you make your harness get better automatically, not just through manual CLAUDE.md updates every few sessions?
This is where the Compound Engineering plugin changed my workflow. I’ve written a full comparison of it against other frameworks, but in the context of large codebases, it solves a specific problem the Anthropic article doesn’t address: harness decay.
The problem: you set up a great CLAUDE.md, build hooks, create skills. Then you ship features for three weeks. The codebase evolves. New patterns emerge that aren’t documented. Old patterns get deprecated but the docs still reference them. Your harness slowly drifts out of sync with reality.
Compound Engineering adds a fourth step to the standard plan/work/review cycle:
/ce:plan → Structure the implementation
/ce:work → Agents execute with parallel task management
/ce:review → Multi-persona code review (5 specialized reviewers)
/ce:compound → "What did we learn? What should change?"
That last step is what makes this relevant to large codebases. After every meaningful piece of work, /ce:compound asks: what patterns emerged? What should agents know next time? What documentation is now stale?
The answers get codified. Not as ephemeral chat history. As actual updates to your project’s knowledge base that agents consult in future sessions.
Why this matters at scale: In a large codebase, the rate of pattern discovery is high. You’re constantly learning “oh, this module uses a different error handling approach” or “this service has an undocumented rate limit.” Without a compounding mechanism, those learnings exist only in your head until you remember to update CLAUDE.md manually.
The multi-persona review is the other differentiator for large codebases. When /ce:review spawns five specialized reviewers (correctness, security, performance, testing, maintainability), each reviewing your diff independently, it catches cross-cutting concerns that a single-pass review misses. On a voice AI system with real-time constraints, the performance reviewer caught a blocking call in an async pipeline that would have added 200ms of latency. The security reviewer flagged an unvalidated webhook signature I’d missed.
Measured impact over one month:
| Metric | Without compounding | With /ce:compound |
|---|---|---|
| Times Claude repeated a known mistake | 8-10 per week | 1-2 per week |
| CLAUDE.md staleness (days since last meaningful update) | 12-15 days | 2-3 days |
| First-draft quality (PRs needing zero revision) | ~30% | ~55% |
The compounding isn’t magic. It’s discipline. But having it built into the workflow loop means it actually happens instead of being a thing you intend to do “when you have time.”
The Model Upgrade Trap
The Anthropic article warns about this: “instructions written for your current model can work against a future one.” I’ve experienced this firsthand twice now.
When Opus 4.5 released, I had 15 rules in my CLAUDE.md that were workarounds for Sonnet’s limitations. Rules like “always break refactors into single-file changes” and “never modify more than 3 files without verification.” These made sense for Sonnet. They actively handicapped Opus, which handles coordinated multi-file edits well.
My review cadence:
After every major model release:
1. Disable all "behavioral guardrail" rules (keep structural ones)
2. Run 5 representative tasks with no constraints
3. Note which tasks still need guardrails vs. which the model handles natively
4. Re-enable only the rules that still provide value
5. Archive removed rules with a note: "Removed for Opus 4.5+, model handles natively"
The archive step is important. If you downgrade models for cost reasons or switch between models per task, you may need those rules back. Don’t delete. Annotate and shelve.
What Anthropic Missed: The Solo Developer Story
The Anthropic article is written for enterprise teams. Dedicated infrastructure teams. Managed plugin marketplaces. Cross-functional working groups. Agent managers.
Most developers using Claude Code don’t have any of that. They’re solo developers or small teams. No one is going to “assign ownership for Claude Code management and adoption” when the team is three people.
For solo developers and small teams, the priority order shifts:
- CLAUDE.md files. Same as enterprise. This is universal.
- Verification hooks. Automate what you’d forget to check manually. Lint, typecheck, test.
- Skills for your top 3 workflows. Not 30 skills. Three. The ones you do daily.
- MCP for tools you actually use. Your project tracker. Your documentation. That’s it.
- Skip plugins entirely unless you’re distributing to others.
The ROI curve for solo developers is steeper on the basics and flatter on the organizational tooling. Invest 80% of your configuration time in CLAUDE.md quality and hook-based verification. The rest is diminishing returns until your team grows.
The Bottom Line
Anthropic’s article confirms what practitioners have known: the harness matters more than the model. The patterns they describe, layered CLAUDE.md files, progressive context loading, scoped verification, are exactly right.
What I’d add from a year of daily use: the greenfield/brownfield distinction matters enormously, the “Lost in the Middle” problem makes CLAUDE.md length a critical design decision, subagents change the economics of exploration, deterministic tools like Ruff and Ultracite should handle everything they can so CLAUDE.md stays focused on what only the model can judge, and compounding mechanisms like /ce:compound prevent harness decay across weeks of active development.
The strongest Claude Code setup isn’t the one with the most features configured. It’s the one where every configuration element earns its keep by making the next session measurably better than the last.
Start with 50 lines of CLAUDE.md. Add one thing at a time. Measure whether it helps. Remove what doesn’t. That’s the whole playbook.
Deploying Claude Code on a large codebase, legacy system, or greenfield project? I’d love to hear what patterns are working and where you’re still hitting walls. Reach out on LinkedIn.