grep Is Not BM25: The Difference Between Pattern Matching and Retrieval

I caught myself explaining BM25 as “fancy grep” to a colleague last week. Mid-sentence, I realized I was saying something fundamentally wrong.

grep and sparse retrieval (BM25, SPLADE, TF-IDF) look similar on the surface. Both find text. Both match terms. But they solve different problems at entirely different points on the “from raw bytes to relevance” spectrum. Conflating them leads to bad architecture decisions, especially when building RAG systems.

In my Dense vs Sparse Retrieval post, I covered how sparse retrieval compares to dense (embedding-based) search. This post zooms in on the other end: what separates sparse retrieval from plain text matching in the first place.

What grep Actually Does

grep scans bytes. That’s it. Given a pattern (literal string or regex), it reads files line by line and prints every line that matches. No indexing. No scoring. No concept of “documents” or “relevance.”

# Find all lines containing "connection_pool" in Python files
grep -rn "connection_pool" --include="*.py" .

# Regex: find function definitions with "cache" in the name
grep -rn "def .*cache.*(" --include="*.py" .

The output is binary: a line matches or it doesn’t. Results come back in file order, not relevance order. grep doesn’t know which match is more important.

The complexity model is simple: search time scales linearly with input size. Scan 1GB of logs, it reads 1GB. Scan 10GB, it reads 10GB. Tools like ripgrep optimize this with parallelism and memory mapping, but the fundamental model is the same: no index, brute-force scan.

This makes grep perfect for:

Hunting through logs for a specific error code
Finding where a function is defined in a codebase
One-off pattern checks where you know exactly what you’re looking for

What BM25 Actually Does

BM25 is a completely different beast. It operates on three concepts that grep doesn’t have:

1. An inverted index. Before any search happens, BM25 builds a data structure that maps every term in the vocabulary to a list of documents where it appears, along with positions and counts. This is the “pre-built library catalog” that makes retrieval fast.

2. Term statistics. BM25 computes two signals per term: how often it appears in the current document (term frequency) and how rare it is across the entire corpus (inverse document frequency). A term that appears in 90% of documents (“the”, “is”) gets low weight. A term that appears in 0.1% of documents (“SPLADE”, “pgvector”) gets high weight.

3. Relevance scoring. Each document gets a numerical score based on how well it matches the query, weighted by those statistics. The output is a ranked list: document #47 scores 8.3, document #12 scores 6.1, document #203 scores 4.7.

grep output:
  file1.py:23: connection_pool = create_pool(max_size=10)
  file1.py:45: connection_pool.close()
  file2.py:8:  from db import connection_pool

BM25 output:
  Doc #47  →  score: 8.3  (3 mentions, rare term, short doc)
  Doc #12  →  score: 6.1  (2 mentions, common co-terms)
  Doc #203 →  score: 4.7  (1 mention, long doc dilutes score)

The complexity model is fundamentally different: retrieval is sublinear in corpus size. BM25 only touches the postings lists for terms in your query, not every byte of every document. Search 10 million documents for “connection pooling” and it reads two postings lists, not 10 million files.

The Comparison Table

Aspect	grep	BM25 / Sparse Retrieval
Input model	Raw bytes/lines in files	Tokenized documents in an inverted index
Query type	Literal string or regex	Bag of terms (keywords)
Output	All matching lines, unranked	Top-k documents, ranked by relevance
Scoring	None (match or no match)	TF-IDF weighted relevance score
Indexing	None, scans files directly	Pre-built inverted index
Search complexity	O(n) in corpus size	Sublinear, touches only matching postings
Scale sweet spot	Local files, GBs	Large collections, millions of docs

Where the Confusion Comes From

The confusion is understandable. Both grep and BM25 work with exact terms. Neither understands meaning or synonyms (that’s what dense retrieval adds). And for small corpora, the practical difference is invisible.

Search 500 markdown files with grep? You get results in milliseconds. Search the same 500 files with BM25? Also milliseconds. At that scale, the index overhead of BM25 feels like unnecessary complexity.

The gap becomes obvious at scale:

Corpus: 500 files (2MB)
  grep:  ~5ms
  BM25:  ~3ms (after indexing)

Corpus: 100K documents (500MB)
  grep:  ~800ms
  BM25:  ~5ms

Corpus: 10M documents (50GB)
  grep:  not practical
  BM25:  ~10ms

The key insight: grep is O(n) in corpus size. BM25 is roughly O(k) where k is the number of query terms. As your corpus grows, grep slows down proportionally. BM25 barely notices.

When to Use Which

Use grep when:

You’re searching local files you can see (source code, configs, logs)
You know the exact string or pattern you want
The corpus is small enough that linear scan is fast
You need regex power (lookaheads, capture groups, complex patterns)

Use BM25/sparse retrieval when:

You’re building search for users who don’t know exact terms
You need ranked results, not all matches
Your corpus is large (thousands to millions of documents)
You want to combine with dense retrieval for hybrid search

Use both when:

Agentic coding tools like Claude Code use grep-style search for precise code lookups and retrieval-based search for broader context gathering. They’re complementary, not competing.

The Agentic Search Angle

Here’s where it gets interesting. Modern AI coding agents blur the line. Tools like Claude Code, Cursor, and Windsurf use grep (or ripgrep) as a first-pass tool for precise pattern matching: “find all files importing this module.” But they increasingly layer retrieval on top for broader questions: “find code related to authentication.”

Some recent approaches even use LLMs to generate better grep patterns. But the fundamental limitation remains: grep scales linearly and has no notion of relevance. As codebases grow past a certain size, agents that rely only on grep-style search start missing context that retrieval-based approaches would surface.

The evolution looks like this:

%%{init: {"layout": "dagre"}}%%
flowchart LR
    G["grep\n(exact match)"] --> S["BM25\n(ranked terms)"]
    S --> D["Dense\n(semantic)"]
    D --> H["Hybrid\n(all three)"]

Each step adds a capability the previous one lacks. grep adds pattern power over raw text. BM25 adds ranking and corpus statistics. Dense adds semantic understanding. Hybrid combines all three. The best retrieval systems in production use the full spectrum.

The Bottom Line

grep is a microscope. It shows you exactly where a pattern appears in raw text, with precision and speed. BM25 is a library catalog. It tells you which documents are most relevant to a query, ranked by statistical importance.

Calling BM25 “fancy grep” is like calling a search engine “fancy Ctrl+F.” They share the surface behavior of finding text, but the underlying mechanics, scaling properties, and use cases are fundamentally different. Understanding where each one sits on the search spectrum is how you make the right architecture call for your retrieval system.

Building search systems and debating grep vs proper retrieval? I’d love to hear what you’re using. Reach out on LinkedIn.