RAG for Dummies: What Nobody Tells You About Building Search That Works
Everyone talks about RAG like it’s magic. Throw some documents into a vector database, add an LLM, and suddenly you have a chatbot that knows everything about your company.
Here’s the truth I learned after building multiple RAG systems: it’s not magic. It’s plumbing. And like all plumbing, the quality of your results depends entirely on how well you understand the pipes.
Most RAG tutorials skip the fundamentals. They show you the happy path, the demo that works on 10 documents, then leave you wondering why your production system returns garbage when you scale to 10,000. This guide fixes that.
What RAG Actually Is
RAG stands for Retrieval-Augmented Generation. Strip away the jargon, and it’s a two-step process:
- Find relevant information from your documents
- Generate an answer using that information
The “retrieval” part searches your knowledge base. The “generation” part is the LLM synthesizing an answer. Simple concept, complex execution.
%%{init: {"layout": "dagre"}}%%
flowchart LR
Q[User Question] --> E1[Embed Query]
E1 --> VS[Vector Search]
VS --> R[Retrieved Chunks]
R --> LLM[LLM Generation]
LLM --> A[Answer]
subgraph Knowledge Base
D[Documents] --> C[Chunking]
C --> E2[Embed Chunks]
E2 --> VDB[(Vector DB)]
end
VDB --> VS
Why this matters: The LLM only knows what you retrieve. Bad retrieval means bad answers. No amount of prompt engineering fixes garbage context.
Embeddings: The Math Behind Meaning
Embeddings are where the magic actually happens. They convert text into numbers, specifically high-dimensional vectors, where similar meanings cluster together.
Think of it like GPS coordinates for concepts. “Car” and “automobile” end up near each other in vector space. “Car” and “banana” are far apart.
How embeddings learn this:
The model trains on billions of text examples, learning patterns like “these two sentences mean the same thing” and “these two are different.” Over time, it learns to place similar concepts close together in vector space.
Why this matters: Better training means the model captures subtle differences. A well-trained model distinguishes between “company vacation policy” and “company sick leave policy.” A poorly-trained one might confuse them.
The dimension trade-off:
| Dimensions | Memory | Accuracy | Use Case |
|---|---|---|---|
| 384 | Low | Good | Standard semantic search |
| 768 | Medium | Better | Complex documents |
| 1536 | High | Best | Multi-modal, specialized domains |
Start at 384-768 dimensions. Higher dimensions don’t automatically mean better results. A well-trained 384-dimensional model beats a poorly-trained 4096-dimensional one every time.
Chunking: The Most Underrated Decision
Here’s what trips up every beginner: you can’t embed entire documents. A 50-page PDF doesn’t fit in one vector. You need to break it into chunks.
Chunk size matters more than you think:
Too small (100 tokens):
"The company policy states that" → Missing context
"employees should" → Incomplete thought
Too large (2000 tokens):
"[entire section about HR policies]" → Diluted meaning, matches everything poorly
Just right (300-500 tokens):
"Employees are entitled to 15 days of annual leave per year.
Leave requests must be submitted at least two weeks in advance
through the HR portal." → Complete, specific, searchable
The overlap trick:
Chunks with 20-30% overlap prevent losing context at boundaries. If your chunk size is 512 tokens, use a 128-token overlap.
%%{init: {"layout": "dagre"}}%%
flowchart LR
subgraph Document
A[Chunk 1: tokens 0-512]
B[Chunk 2: tokens 384-896]
C[Chunk 3: tokens 768-1280]
end
A -.overlap.-> B
B -.overlap.-> C
Semantic chunking: The advanced approach uses AI to detect topic boundaries rather than arbitrary token limits. More expensive during preprocessing, but finds 15-25% more relevant results. Worth it for documents with distinct sections.
Vector Databases: Choose Your Weapon
Once you have embeddings, you need somewhere to store and search them. The options:
| Database | Best For | Latency | Operational Overhead |
|---|---|---|---|
| PostgreSQL + pgvector | < 10M vectors, existing Postgres stack | ~4ms | Lowest |
| Pinecone | Managed, auto-scaling | ~10ms | Low |
| Weaviate | Hybrid search built-in | ~15ms | Medium |
| Qdrant | High performance, self-hosted | ~5ms | Medium |
| Milvus | Massive scale (billions of vectors) | ~10ms | Higher |
My recommendation: Start with pgvector if you already have PostgreSQL. The operational simplicity beats the marginal performance gains of specialized databases until you hit real scale problems.
There are also managed solutions of these solutions which reduces operational overhead.
The search itself:
When you search, the database compares your query’s vector to every stored vector and returns the closest matches. “Closest” means the vectors point in similar directions. A score of 1.0 means identical, 0 means unrelated, -1 means opposite.
# Pseudocode for vector search
def search(query_embedding, k=5):
similarities = []
for doc_id, doc_embedding in vector_db:
score = cosine_similarity(query_embedding, doc_embedding)
similarities.append((doc_id, score))
return sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
Hybrid Search: Why Vector-Only Fails
Here’s a dirty secret: pure vector search only finds about 75% of the relevant documents in your database. Sounds decent until your user searches for “ERR_CONNECTION_REFUSED” and gets results about network connectivity philosophy instead of the actual error code.
The problem: Embeddings understand meaning, not exact text. Error codes, entity names, and technical acronyms get semantically smoothed into oblivion.
The solution: Hybrid search combines two approaches: vector search (finds similar meaning) and keyword search (finds exact text matches, like traditional search engines).
%%{init: {"layout": "dagre"}}%%
flowchart TB
Q[Query] --> VS[Vector Search]
Q --> KW[Keyword Search]
VS --> |semantic matches| Merge[Merge & Rank]
KW --> |exact matches| Merge
Merge --> Results[Combined Results]
How they combine: The system merges both result lists. Documents that rank high in both searches (meaning match AND keyword match) bubble to the top. Documents that only appear in one list rank lower.
The numbers: Hybrid search finds 87% of relevant documents versus 75% for vector-only. That 12% difference often contains the exact answer your user needed.
Reranking: The Expensive Upgrade
Initial retrieval casts a wide net, grabbing many candidates that might be relevant. Reranking sorts those candidates by actual relevance, pushing the best matches to the top.
Rerankers read your query and each document together, understanding the relationship between them. This is more accurate than comparing vectors, but 10-50x slower because it processes each document individually.
The trade-off:
| Approach | Accuracy Boost | Added Latency |
|---|---|---|
| No reranking | Baseline | 0ms |
| Reranking (50 candidates → 5 best) | +20-35% | 200-500ms |
When to use it: If you can afford 500ms+ delay and getting the best results matters (legal, medical, compliance), always rerank. For sub-100ms requirements, skip it.
The pattern:
# Step 1: Get 50 candidates quickly
candidates = hybrid_search(query, k=50)
# Step 2: Rerank them carefully (slower but more accurate)
reranked = reranker.sort_by_relevance(query, candidates)
# Step 3: Take the top 5 for the LLM
context = reranked[:5]
The Production Optimization Playbook
Everything above gets you a working RAG system. These optimizations get you a fast one:
1. Metadata filtering
Pre-filter documents by metadata (date, category, source) before vector search. Reduces scan cost by 50-70%.
# Instead of searching all documents
results = vector_db.search(query_embedding, k=10)
# Filter first, then search
results = vector_db.search(
query_embedding,
k=10,
filter={"category": "policy", "year": 2025}
)
2. Embedding caching
Cache frequently-accessed embeddings in Redis or local memory. Cuts recomputation by 80%.
3. Batch processing
Buffer upserts into batches rather than single-record writes. Improves throughput 3-5x.
4. Response caching
Cache LLM outputs for repeated queries. Reduces API costs by 10%+.
5. Index pruning
Periodically remove obsolete vectors. Storage and I/O overhead compound over time.
Choosing Your Embedding Model
The model matters more than any other decision. Here’s the current landscape:
| Model | Dimensions | Strength | Open Source |
|---|---|---|---|
| BGE-M3 | 1024 | Multilingual, fine-tuning friendly | Yes |
| E5-Large | 1024 | General-purpose excellence | Yes |
| OpenAI text-embedding-3-large | 3072 | Best absolute metrics | No |
| Cohere Embed v3 | 1024 | Variable context windows | No |
| Mistral Embed | 1024 | Speed/accuracy trade-off | No |
The benchmark: MTEB (Massive Text Embedding Benchmark) evaluates models across 8 tasks spanning 50+ datasets. Check the retrieval scores specifically, not overall rankings.
The catch: No universal best model exists. A model excelling at semantic similarity may underperform on clustering. Domain-specific models (medical, legal) outperform general-purpose ones by 15%+ in their domains.
Fine-Tuning: When and How
Off-the-shelf embeddings handle 80% of use cases. Fine-tuning should be the last resort and makes sense when:
- Your domain has specialized vocabulary (medical, legal, internal jargon)
- You have 500+ labeled examples of query-document relevance
- You’ve already optimized chunking, hybrid search, and reranking
LoRA (Low-Rank Adaptation): A technique that fine-tunes only a small fraction of the model instead of the whole thing. This means you can customize embedding models on a regular laptop instead of needing expensive GPU clusters.
The trade-off is worth it: fine-tuning on your domain data can improve search accuracy by 15-20% for specialized vocabularies.
Query Expansion for Complex Questions
Simple queries work fine with basic retrieval. Complex queries spanning multiple concepts need decomposition.
Before:
"What regulations govern AI in healthcare, and how do they differ from financial services?"
After decomposition:
1. "Regulations governing AI in healthcare"
2. "Regulations governing AI in financial services"
3. "Differences between healthcare and financial AI regulations"
Retrieving for each subquery independently, then merging results, finds significantly more relevant documents than asking the original complex question directly.
%%{init: {"layout": "dagre"}}%%
flowchart TB
Q[Complex Query] --> LLM1[LLM: Decompose]
LLM1 --> Q1[Subquery 1]
LLM1 --> Q2[Subquery 2]
LLM1 --> Q3[Subquery 3]
Q1 --> R1[Retrieve]
Q2 --> R2[Retrieve]
Q3 --> R3[Retrieve]
R1 --> M[Merge Results]
R2 --> M
R3 --> M
M --> LLM2[LLM: Synthesize]
LLM2 --> A[Final Answer]
The Latency Budget
Production RAG must hit < 100ms latency for acceptable UX. Here’s where the time goes:
Embedding generation: 5-20ms
Vector search: 10-30ms
Reranking (optional): 200-500ms
LLM generation: 500-2000ms
─────────────────────────────────
Total without rerank: 515-2050ms
Total with rerank: 715-2550ms
Optimization priorities:
- Skip reranking if latency-constrained
- Use smaller embedding models (384 vs 1536 dimensions)
- Pre-filter with metadata to reduce search space
- Cache aggressively
The Bottom Line
RAG isn’t complicated. It’s just precise. Every decision, from chunk size to embedding model to hybrid search weights, compounds into your final result quality.
Start simple:
- 384-768 dimension embeddings
- 400-token chunks with 20% overlap
- pgvector for storage
- Hybrid search (vector + keyword combined)
- No reranking until you need better accuracy
Add complexity only when you’ve measured the baseline:
- Reranking when getting the exact right answer matters more than speed
- Query decomposition for complex multi-part questions
- Fine-tuning when you have domain-specific vocabulary and labeled data
The difference between a RAG demo and a RAG product is understanding these trade-offs. Now you do.
Building RAG systems and hitting walls? I’d love to hear what challenges you’re facing. Reach out on LinkedIn.