RAG for Dummies: What Nobody Tells You About Building Search That Works

Everyone talks about RAG like it’s magic. Throw some documents into a vector database, add an LLM, and suddenly you have a chatbot that knows everything about your company.

Here’s the truth I learned after building multiple RAG systems: it’s not magic. It’s plumbing. And like all plumbing, the quality of your results depends entirely on how well you understand the pipes.

Most RAG tutorials skip the fundamentals. They show you the happy path, the demo that works on 10 documents, then leave you wondering why your production system returns garbage when you scale to 10,000. This guide fixes that.

What RAG Actually Is

RAG stands for Retrieval-Augmented Generation. Strip away the jargon, and it’s a two-step process:

Find relevant information from your documents
Generate an answer using that information

The “retrieval” part searches your knowledge base. The “generation” part is the LLM synthesizing an answer. Simple concept, complex execution.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    Q[User Question] --> E1[Embed Query]
    E1 --> VS[Vector Search]
    VS --> R[Retrieved Chunks]
    R --> LLM[LLM Generation]
    LLM --> A[Answer]

    subgraph Knowledge Base
        D[Documents] --> C[Chunking]
        C --> E2[Embed Chunks]
        E2 --> VDB[(Vector DB)]
    end

    VDB --> VS

Why this matters: The LLM only knows what you retrieve. Bad retrieval means bad answers. No amount of prompt engineering fixes garbage context.

Embeddings: The Math Behind Meaning

Embeddings are where the magic actually happens. They convert text into numbers, specifically high-dimensional vectors, where similar meanings cluster together.

Think of it like GPS coordinates for concepts. “Car” and “automobile” end up near each other in vector space. “Car” and “banana” are far apart.

How embeddings learn this:

The model trains on billions of text examples, learning patterns like “these two sentences mean the same thing” and “these two are different.” Over time, it learns to place similar concepts close together in vector space.

Why this matters: Better training means the model captures subtle differences. A well-trained model distinguishes between “company vacation policy” and “company sick leave policy.” A poorly-trained one might confuse them.

The dimension trade-off:

Dimensions	Memory	Accuracy	Use Case
384	Low	Good	Standard semantic search
768	Medium	Better	Complex documents
1536	High	Best	Multi-modal, specialized domains

Start at 384-768 dimensions. Higher dimensions don’t automatically mean better results. A well-trained 384-dimensional model beats a poorly-trained 4096-dimensional one every time.

Chunking: The Most Underrated Decision

Here’s what trips up every beginner: you can’t embed entire documents. A 50-page PDF doesn’t fit in one vector. You need to break it into chunks.

Chunk size matters more than you think:

Too small (100 tokens):
"The company policy states that" → Missing context
"employees should" → Incomplete thought

Too large (2000 tokens):
"[entire section about HR policies]" → Diluted meaning, matches everything poorly

Just right (300-500 tokens):
"Employees are entitled to 15 days of annual leave per year.
Leave requests must be submitted at least two weeks in advance
through the HR portal." → Complete, specific, searchable

The overlap trick:

Chunks with 20-30% overlap prevent losing context at boundaries. If your chunk size is 512 tokens, use a 128-token overlap.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    subgraph Document
        A[Chunk 1: tokens 0-512]
        B[Chunk 2: tokens 384-896]
        C[Chunk 3: tokens 768-1280]
    end

    A -.overlap.-> B
    B -.overlap.-> C

Semantic chunking: The advanced approach uses AI to detect topic boundaries rather than arbitrary token limits. More expensive during preprocessing, but finds 15-25% more relevant results. Worth it for documents with distinct sections.

Vector Databases: Choose Your Weapon

Once you have embeddings, you need somewhere to store and search them. The options:

Database	Best For	Latency	Operational Overhead
PostgreSQL + pgvector	< 10M vectors, existing Postgres stack	~4ms	Lowest
Pinecone	Managed, auto-scaling	~10ms	Low
Weaviate	Hybrid search built-in	~15ms	Medium
Qdrant	High performance, self-hosted	~5ms	Medium
Milvus	Massive scale (billions of vectors)	~10ms	Higher

My recommendation: Start with pgvector if you already have PostgreSQL. The operational simplicity beats the marginal performance gains of specialized databases until you hit real scale problems.

There are also managed solutions of these solutions which reduces operational overhead.

The search itself:

When you search, the database compares your query’s vector to every stored vector and returns the closest matches. “Closest” means the vectors point in similar directions. A score of 1.0 means identical, 0 means unrelated, -1 means opposite.

# Pseudocode for vector search
def search(query_embedding, k=5):
    similarities = []
    for doc_id, doc_embedding in vector_db:
        score = cosine_similarity(query_embedding, doc_embedding)
        similarities.append((doc_id, score))
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:k]

Hybrid Search: Why Vector-Only Fails

Here’s a dirty secret: pure vector search only finds about 75% of the relevant documents in your database. Sounds decent until your user searches for “ERR_CONNECTION_REFUSED” and gets results about network connectivity philosophy instead of the actual error code.

The problem: Embeddings understand meaning, not exact text. Error codes, entity names, and technical acronyms get semantically smoothed into oblivion.

The solution: Hybrid search combines two approaches: vector search (finds similar meaning) and keyword search (finds exact text matches, like traditional search engines).

%%{init: {"layout": "dagre"}}%%
flowchart TB
    Q[Query] --> VS[Vector Search]
    Q --> KW[Keyword Search]

    VS --> |semantic matches| Merge[Merge & Rank]
    KW --> |exact matches| Merge

    Merge --> Results[Combined Results]

How they combine: The system merges both result lists. Documents that rank high in both searches (meaning match AND keyword match) bubble to the top. Documents that only appear in one list rank lower.

The numbers: Hybrid search finds 87% of relevant documents versus 75% for vector-only. That 12% difference often contains the exact answer your user needed.

Reranking: The Expensive Upgrade

Initial retrieval casts a wide net, grabbing many candidates that might be relevant. Reranking sorts those candidates by actual relevance, pushing the best matches to the top.

Rerankers read your query and each document together, understanding the relationship between them. This is more accurate than comparing vectors, but 10-50x slower because it processes each document individually.

The trade-off:

Approach	Accuracy Boost	Added Latency
No reranking	Baseline	0ms
Reranking (50 candidates → 5 best)	+20-35%	200-500ms

When to use it: If you can afford 500ms+ delay and getting the best results matters (legal, medical, compliance), always rerank. For sub-100ms requirements, skip it.

The pattern:

# Step 1: Get 50 candidates quickly
candidates = hybrid_search(query, k=50)

# Step 2: Rerank them carefully (slower but more accurate)
reranked = reranker.sort_by_relevance(query, candidates)

# Step 3: Take the top 5 for the LLM
context = reranked[:5]

The Production Optimization Playbook

Everything above gets you a working RAG system. These optimizations get you a fast one:

1. Metadata filtering

Pre-filter documents by metadata (date, category, source) before vector search. Reduces scan cost by 50-70%.

# Instead of searching all documents
results = vector_db.search(query_embedding, k=10)

# Filter first, then search
results = vector_db.search(
    query_embedding,
    k=10,
    filter={"category": "policy", "year": 2025}
)

2. Embedding caching

Cache frequently-accessed embeddings in Redis or local memory. Cuts recomputation by 80%.

3. Batch processing

Buffer upserts into batches rather than single-record writes. Improves throughput 3-5x.

4. Response caching

Cache LLM outputs for repeated queries. Reduces API costs by 10%+.

5. Index pruning

Periodically remove obsolete vectors. Storage and I/O overhead compound over time.

Choosing Your Embedding Model

The model matters more than any other decision. Here’s the current landscape:

Model	Dimensions	Strength	Open Source
BGE-M3	1024	Multilingual, fine-tuning friendly	Yes
E5-Large	1024	General-purpose excellence	Yes
OpenAI text-embedding-3-large	3072	Best absolute metrics	No
Cohere Embed v3	1024	Variable context windows	No
Mistral Embed	1024	Speed/accuracy trade-off	No

The benchmark: MTEB (Massive Text Embedding Benchmark) evaluates models across 8 tasks spanning 50+ datasets. Check the retrieval scores specifically, not overall rankings.

The catch: No universal best model exists. A model excelling at semantic similarity may underperform on clustering. Domain-specific models (medical, legal) outperform general-purpose ones by 15%+ in their domains.

Fine-Tuning: When and How

Off-the-shelf embeddings handle 80% of use cases. Fine-tuning should be the last resort and makes sense when:

Your domain has specialized vocabulary (medical, legal, internal jargon)
You have 500+ labeled examples of query-document relevance
You’ve already optimized chunking, hybrid search, and reranking

LoRA (Low-Rank Adaptation): A technique that fine-tunes only a small fraction of the model instead of the whole thing. This means you can customize embedding models on a regular laptop instead of needing expensive GPU clusters.

The trade-off is worth it: fine-tuning on your domain data can improve search accuracy by 15-20% for specialized vocabularies.

Query Expansion for Complex Questions

Simple queries work fine with basic retrieval. Complex queries spanning multiple concepts need decomposition.

Before:

"What regulations govern AI in healthcare, and how do they differ from financial services?"

After decomposition:

1. "Regulations governing AI in healthcare"
2. "Regulations governing AI in financial services"
3. "Differences between healthcare and financial AI regulations"

Retrieving for each subquery independently, then merging results, finds significantly more relevant documents than asking the original complex question directly.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    Q[Complex Query] --> LLM1[LLM: Decompose]
    LLM1 --> Q1[Subquery 1]
    LLM1 --> Q2[Subquery 2]
    LLM1 --> Q3[Subquery 3]

    Q1 --> R1[Retrieve]
    Q2 --> R2[Retrieve]
    Q3 --> R3[Retrieve]

    R1 --> M[Merge Results]
    R2 --> M
    R3 --> M

    M --> LLM2[LLM: Synthesize]
    LLM2 --> A[Final Answer]

The Latency Budget

Production RAG must hit < 100ms latency for acceptable UX. Here’s where the time goes:

Embedding generation:     5-20ms
Vector search:           10-30ms
Reranking (optional):   200-500ms
LLM generation:        500-2000ms
─────────────────────────────────
Total without rerank:  515-2050ms
Total with rerank:     715-2550ms

Optimization priorities:

Skip reranking if latency-constrained
Use smaller embedding models (384 vs 1536 dimensions)
Pre-filter with metadata to reduce search space
Cache aggressively

The Bottom Line

RAG isn’t complicated. It’s just precise. Every decision, from chunk size to embedding model to hybrid search weights, compounds into your final result quality.

Start simple:

384-768 dimension embeddings
400-token chunks with 20% overlap
pgvector for storage
Hybrid search (vector + keyword combined)
No reranking until you need better accuracy

Add complexity only when you’ve measured the baseline:

Reranking when getting the exact right answer matters more than speed
Query decomposition for complex multi-part questions
Fine-tuning when you have domain-specific vocabulary and labeled data

The difference between a RAG demo and a RAG product is understanding these trade-offs. Now you do.

Building RAG systems and hitting walls? I’d love to hear what challenges you’re facing. Reach out on LinkedIn.