:::tip 🎮 Interactive Playground Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required. :::
RAG System Design
The Knowledge Cutoff Problem That Cost a Fortune
In 2023, a major financial services firm deployed a customer support chatbot built on fine-tuned GPT-3.5. The fine-tuning cost 40,000 retraining cycle. Their legal team started flagging the chatbot's responses as regulatory liability.
Their AI engineering lead ran an experiment on a Friday afternoon. She took the existing GPT-3.5 model - no fine-tuning, no special training - and connected it to the company's internal knowledge base. She built a simple retrieval pipeline: embed the user's question, find the three most relevant document chunks, prepend them to the prompt, and let the base model answer from that context. Monday morning she showed the results to her team. The accuracy on time-sensitive questions went from 61% to 87%. The product catalog questions were now correct because the model was reading the current catalog, not remembering a stale version of it from training.
This is Retrieval Augmented Generation. The insight is almost embarrassingly simple: LLMs are excellent at reading comprehension. Give them a relevant document and a question, and they will extract the answer accurately. The hard part is not the generation - it is the retrieval. Getting the right chunks into the context window, in the right order, with enough surrounding context to make sense, is an engineering problem that takes months to solve well.
The naive RAG implementation the lead built that Friday was good enough to prove the concept. But it broke on questions that spanned multiple documents, it failed on questions where the user's phrasing didn't match the document vocabulary, it returned irrelevant chunks when the question was ambiguous, and it hallucinated confidently when nothing in the retrieval matched the query. Building a production RAG system - one that serves tens of thousands of queries per day with consistent accuracy - requires addressing each of these failure modes systematically.
This lesson is about building that production system. We will go from the naive three-chunk RAG pipeline to the advanced architecture that handles multi-hop reasoning, vocabulary mismatch, ambiguous queries, and evaluation at scale.
Why This Exists
The Fine-Tuning Trap
Fine-tuning bakes knowledge into model weights. This creates three problems in production:
Stale knowledge. Weights are fixed after training. Any information created or updated after the training cutoff is invisible to the model. For most enterprise use cases - legal documents, product catalogs, policy manuals, technical documentation - information changes continuously.
Catastrophic forgetting. When you fine-tune on your domain data, the model loses some of its general reasoning capability. Fine-tuning for customer support on financial products produces a model that is better at answering questions about your products but worse at everything else. For multi-purpose deployments, this is a regression.
No provenance. A fine-tuned model cannot tell you which document its answer came from. This is a compliance problem in any regulated industry. "The model said so" is not a defensible answer when a customer asks for evidence.
RAG solves all three. Knowledge lives in the retrieval store, not in weights. Update the store and the model immediately has the new information. The model's general capabilities are untouched. Every answer is grounded in specific retrieved documents that can be cited.
The Evolution: Naive RAG to Advanced RAG
Historical Context
RAG was formalized by Lewis et al. at Facebook AI Research in the 2020 paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." The paper showed that combining a dense retrieval model (DPR) with a sequence-to-sequence generator (BART) outperformed both pure retrieval and pure parametric models on open-domain QA benchmarks.
The underlying idea - retrieve relevant documents and use them to condition text generation - appeared much earlier. IBM's Watson used a similar retrieve-then-read pipeline in the 2011 Jeopardy! challenge. The 2017 DrQA system from Facebook used TF-IDF retrieval with a neural reader. What changed in 2020 was the combination of dense neural retrieval (much better than TF-IDF for semantic matching) with powerful auto-regressive language models (GPT-3 era), making the quality high enough for production deployment.
The explosion of production RAG systems began in 2023, driven by the GPT-4 API and the release of LangChain and LlamaIndex as RAG orchestration frameworks. By 2024, RAG had become the dominant pattern for enterprise AI deployments.
Core Concepts
The RAG Pipeline Architecture
Chunking Strategies
Chunking is the most underestimated decision in RAG design. The chunk is the unit of retrieval: too small, and retrieved chunks lack context; too large, and they introduce irrelevant content that confuses the model and wastes context window budget.
Fixed-size chunking: Split documents into chunks of tokens with tokens of overlap. Simple, predictable. Works well for homogeneous documents (all similar structure). Breaks semantic units - a chunk may start mid-sentence or cut off a critical piece of reasoning.
Recommended defaults: tokens, tokens overlap.
Semantic chunking: Split at natural boundaries detected by the model. Use a sentence encoder to compute the embedding of each sentence. When the cosine similarity between consecutive sentence embeddings drops below a threshold, start a new chunk. This preserves semantic coherence at the cost of variable chunk sizes.
Hierarchical chunking (parent-child): Index at two levels. Create small chunks (~128 tokens) for precise retrieval. Store parent chunks (~512 tokens) that contain the small chunks. When a small chunk is retrieved, replace it with its parent for the generation context. This gives you the precision of small-chunk retrieval with the coherence of large-chunk context.
Document-level chunking: For short documents (under 512 tokens), keep the entire document as a single chunk. Never split a document if it fits in the context.
from typing import List, Optional
import tiktoken
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticChunker:
"""
Splits documents at semantic boundaries detected by embedding similarity drops.
"""
def __init__(
self,
model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
similarity_threshold: float = 0.4,
min_chunk_size: int = 100, # tokens
max_chunk_size: int = 600, # tokens
):
self.encoder = SentenceTransformer(model_name)
self.threshold = similarity_threshold
self.min_size = min_chunk_size
self.max_size = max_chunk_size
self.tokenizer = tiktoken.get_encoding("cl100k_base")
def split(self, text: str) -> List[str]:
"""Split text into semantically coherent chunks."""
# Split into sentences (rough approximation)
sentences = self._split_sentences(text)
if len(sentences) <= 1:
return sentences
# Embed all sentences
embeddings = self.encoder.encode(sentences, batch_size=32)
# Compute consecutive similarity
boundaries = [0] # Always start a chunk at position 0
for i in range(1, len(sentences)):
sim = np.dot(embeddings[i - 1], embeddings[i])
if sim < self.threshold:
boundaries.append(i)
boundaries.append(len(sentences))
# Merge into chunks, respecting size constraints
chunks = []
for start, end in zip(boundaries[:-1], boundaries[1:]):
chunk_text = " ".join(sentences[start:end])
tokens = len(self.tokenizer.encode(chunk_text))
if tokens > self.max_size:
# Fall back to fixed-size splitting for oversized chunks
sub_chunks = self._fixed_split(chunk_text)
chunks.extend(sub_chunks)
elif tokens >= self.min_size:
chunks.append(chunk_text)
elif chunks:
# Merge undersized chunk with previous
chunks[-1] += " " + chunk_text
else:
chunks.append(chunk_text)
return chunks
def _split_sentences(self, text: str) -> List[str]:
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if s.strip()]
def _fixed_split(self, text: str, chunk_size: int = 512, overlap: int = 64) -> List[str]:
tokens = self.tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i : i + chunk_size]
chunks.append(self.tokenizer.decode(chunk_tokens))
return chunks
Embedding Model Selection
The embedding model determines how well semantic similarity maps to retrieval relevance. Key dimensions to evaluate:
| Model | Dimensions | Context Window | Retrieval Quality | Cost |
|---|---|---|---|---|
text-embedding-3-small | 1536 | 8191 tokens | Good | Low |
text-embedding-3-large | 3072 | 8191 tokens | Excellent | Medium |
voyage-2 | 1024 | 4000 tokens | Excellent | Medium |
bge-large-en-v1.5 | 1024 | 512 tokens | Very Good | Free (self-hosted) |
e5-mistral-7b-instruct | 4096 | 32768 tokens | Best | High (self-hosted) |
MTEB benchmark (Massive Text Embedding Benchmark) is the standard evaluation. Look at the retrieval-specific BEIR subset, not just the overall MTEB score.
Critical decision: Use the same embedding model for indexing and querying. If you change embedding models, you must reindex the entire document store - there is no compatibility between embedding spaces from different models.
Vector Store Design
pgvector: PostgreSQL extension. Best choice if you already run PostgreSQL. ACID compliance, standard SQL queries, easy joins with metadata. Supports HNSW and IVF-Flat indexes. Scales to ~10M vectors comfortably.
Pinecone: Managed service. Horizontal scaling to billions of vectors. Handles index management, replication, and serving. No infrastructure management. Higher cost at scale.
Weaviate: Open-source, self-hosted. Supports vector search + keyword search in one system. Hybrid search native. Good for teams that want control over infrastructure.
Qdrant: Open-source, self-hosted. Strong performance on HNSW. Payload filtering before vector search (efficient metadata filtering). Good Rust-based implementation.
Chroma: Development and small-scale production. Simple API. Not recommended for billion-scale.
Hybrid Search: BM25 + Dense Retrieval
Hybrid search combines sparse lexical matching (BM25) with dense semantic search. This combination outperforms either approach alone on most retrieval tasks because:
- BM25 is excellent at exact keyword matching - it finds documents that contain the exact terms in the query
- Dense retrieval is excellent at semantic similarity - it finds documents that are conceptually related even when vocabulary differs
The combination is called Reciprocal Rank Fusion (RRF):
where is the set of ranked lists (BM25 ranks, dense ranks), is document 's rank in list , and is a smoothing constant.
from rank_bm25 import BM25Okapi
from typing import List, Tuple, Dict
import numpy as np
def reciprocal_rank_fusion(
ranked_lists: List[List[Tuple[str, float]]],
k: int = 60,
weights: Optional[List[float]] = None,
) -> List[Tuple[str, float]]:
"""
Combine multiple ranked lists using Reciprocal Rank Fusion.
Args:
ranked_lists: each element is a list of (doc_id, score) tuples sorted by score desc
k: smoothing constant (60 is standard)
weights: optional per-list weights (default equal weighting)
Returns:
Fused ranked list of (doc_id, rrf_score) tuples
"""
if weights is None:
weights = [1.0] * len(ranked_lists)
rrf_scores: Dict[str, float] = {}
for weight, ranked_list in zip(weights, ranked_lists):
for rank, (doc_id, _) in enumerate(ranked_list, start=1):
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + weight * (1.0 / (k + rank))
return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
class HybridRetriever:
"""Combines BM25 sparse and dense vector retrieval."""
def __init__(
self,
vector_store, # your vector store client
corpus: List[Dict], # {"id": str, "text": str, "metadata": dict}
embedding_model,
bm25_weight: float = 0.3,
dense_weight: float = 0.7,
):
self.vector_store = vector_store
self.embedding_model = embedding_model
self.bm25_weight = bm25_weight
self.dense_weight = dense_weight
# Build BM25 index
tokenized_corpus = [doc["text"].lower().split() for doc in corpus]
self.bm25 = BM25Okapi(tokenized_corpus)
self.doc_ids = [doc["id"] for doc in corpus]
def retrieve(self, query: str, top_k: int = 20) -> List[Dict]:
"""Retrieve top-k documents using hybrid search."""
# BM25 retrieval
bm25_scores = self.bm25.get_scores(query.lower().split())
bm25_top_k = np.argsort(bm25_scores)[::-1][:top_k * 2]
bm25_results = [(self.doc_ids[i], bm25_scores[i]) for i in bm25_top_k]
# Dense retrieval
query_embedding = self.embedding_model.encode(query)
dense_results = self.vector_store.search(query_embedding, top_k=top_k * 2)
# RRF fusion
fused = reciprocal_rank_fusion(
[bm25_results, dense_results],
weights=[self.bm25_weight, self.dense_weight],
)
return [{"id": doc_id, "score": score} for doc_id, score in fused[:top_k]]
Reranking with Cross-Encoders
The retrieval stage retrieves the top 20-100 candidates quickly. The reranking stage applies a more expensive but more accurate model to reorder them. Cross-encoders process the query and candidate together (no precomputation), allowing deep interaction modeling.
from sentence_transformers import CrossEncoder
class Reranker:
"""
Cross-encoder reranker. Processes query + document jointly for accurate scoring.
Much more accurate than bi-encoder retrieval, but 10-50x slower - use on small candidate sets.
"""
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.model = CrossEncoder(model_name)
def rerank(
self,
query: str,
candidates: List[Dict], # [{"id": str, "text": str, ...}]
top_k: int = 5,
) -> List[Dict]:
"""
Rerank candidates using cross-encoder.
Returns top_k candidates sorted by relevance score.
"""
pairs = [(query, doc["text"]) for doc in candidates]
scores = self.model.predict(pairs)
for doc, score in zip(candidates, scores):
doc["rerank_score"] = float(score)
reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
return reranked[:top_k]
Query Rewriting
User queries are often ambiguous, incomplete, or phrased differently from the documents. Query rewriting expands or reformulates the query before retrieval.
HyDE (Hypothetical Document Embedding): Ask the LLM to generate a hypothetical document that would answer the query. Embed the hypothetical document instead of the raw query. Since the hypothetical document is phrased like an actual document, it retrieves real documents more accurately.
Multi-query expansion: Generate 3-5 reformulations of the query. Run retrieval for each, then deduplicate and fuse results.
Step-back prompting: For complex questions, ask the LLM to identify the underlying concept and retrieve on that concept first.
async def hyde_query_expansion(query: str, llm_client) -> str:
"""
Hypothetical Document Embedding: generate a hypothetical answer document
and use its embedding for retrieval instead of the query embedding.
"""
prompt = f"""Generate a short document (2-3 sentences) that would be the perfect
answer to this question. Write it as if you are an expert writing documentation,
not as if you are answering a question.
Question: {query}
Document:"""
response = await llm_client.complete(prompt, max_tokens=150)
return response.text
RAG Evaluation with RAGAS
RAGAS (Retrieval Augmented Generation Assessment) provides four metrics for evaluating RAG quality without labeled ground truth:
Faithfulness: Does the answer contain only information supported by the retrieved context? Measures hallucination.
Answer Relevance: Does the answer address the question asked? Measures verbosity and topic drift.
Context Precision: Are the retrieved chunks relevant to the question? Measures retrieval precision.
Context Recall: Does the retrieved context contain all the information needed to answer the question? Measures retrieval recall.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
def evaluate_rag_pipeline(
questions: List[str],
answers: List[str],
contexts: List[List[str]],
ground_truths: Optional[List[str]] = None,
) -> Dict:
"""
Evaluate RAG pipeline quality using RAGAS metrics.
Note: ground_truths are required for context_recall.
If not available, omit context_recall from metrics.
"""
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
}
if ground_truths:
data["ground_truth"] = ground_truths
dataset = Dataset.from_dict(data)
metrics = [faithfulness, answer_relevancy, context_precision]
if ground_truths:
metrics.append(context_recall)
result = evaluate(dataset, metrics=metrics)
return result
Production Engineering Notes
Chunking in Practice
The right chunk size depends on your document type:
- Technical documentation: 512 tokens with 64-token overlap. Technical docs are dense with information; large chunks introduce unrelated content.
- Legal contracts: 1024 tokens. Legal clauses are long and must be read in full context.
- FAQ documents: Keep each Q&A as a single chunk. Do not split across Q&A pairs.
- Tabular data: Convert tables to natural language before chunking. Never split a table across chunks.
Test your chunking visually: take 10 representative queries, retrieve the top-5 chunks, and manually inspect whether the chunks make sense. The best parameter tuning tool is human inspection.
Latency Decomposition
A production RAG pipeline has several latency components:
| Stage | Typical Latency | Notes |
|---|---|---|
| Query rewriting | 300-800ms | LLM call; can be parallelized |
| Dense embedding | 10-30ms | Batch GPU inference |
| BM25 search | 1-5ms | In-memory, very fast |
| Vector search | 5-20ms | Depends on index size |
| Reranking | 50-200ms | Cross-encoder on GPU |
| LLM generation | 500-3000ms | Dominates total latency |
The LLM generation dominates. Optimize retrieval first (it's cheaper), but measure end-to-end to make sure retrieval quality improvements translate to quality improvements in the final answer.
Metadata Filtering
Always store metadata alongside vectors: document source, creation date, document type, access permissions. Use metadata pre-filtering to reduce the search space before vector search. This improves both latency and relevance.
# Example: filter to only recent documents before vector search
results = vector_store.search(
query_embedding=query_embedding,
top_k=20,
filter={
"created_at": {"$gte": "2024-01-01"},
"document_type": {"$in": ["policy", "manual"]},
"access_level": {"$lte": user.access_level},
}
)
Common Mistakes
Mistake: Using fixed-size chunking for structured documents.
PDFs with tables, code with functions, and legal contracts with numbered clauses all have natural structural boundaries. Splitting these by token count destroys semantic coherence. A chunk that ends mid-table or mid-code-function is nearly useless for retrieval. Use structure-aware chunking: parse the document structure (headings, sections, table boundaries) and chunk at those boundaries.
Mistake: Not evaluating retrieval independently from generation.
It is tempting to evaluate RAG end-to-end by asking humans to rate answers. But if answers are poor, you do not know whether retrieval failed (wrong chunks retrieved) or generation failed (right chunks, wrong answer). Always evaluate retrieval separately: for a set of queries, manually identify which chunks should be retrieved, and compute recall@K. This tells you where to focus improvement effort.
Mistake: Reranking with the same model used for retrieval.
Reranking should use a more powerful model than retrieval. If you use the same bi-encoder for both, reranking just reorders based on the same signal the retrieval already maximized - it adds latency with no quality gain. Use a cross-encoder for reranking (it processes query + document jointly, unlike the bi-encoder which processes them separately).
Mistake: Including the entire retrieved context in the prompt without truncation.
LLMs experience "lost in the middle" - they pay much more attention to the beginning and end of their context window than the middle. If you dump 10 retrieved chunks into the middle of a 16K context window, the model may ignore the most relevant ones. Limit retrieved context to 3-5 highly relevant chunks, and order them so the most relevant chunk is either first or last.
Tip: Cache embeddings aggressively.
User queries often repeat. "What is the refund policy?" appears hundreds of times per day. Cache query embeddings with a TTL of several hours. Cache retrieved chunks for popular queries. This reduces latency and embedding API costs significantly. Use Redis with embedding as key (MD5 hash of the query string).
Interview Q&A
Q: What is the difference between naive RAG and advanced RAG, and when does each approach fail?
A: Naive RAG takes the user query, embeds it, retrieves the top-K chunks by cosine similarity, and passes them to the LLM. It fails in four scenarios: vocabulary mismatch (user asks about "heart attack," documents use "myocardial infarction"), multi-hop queries (the answer requires combining information from two separate documents), ambiguous queries (a short query matches many irrelevant chunks), and poor chunking (the relevant information spans a chunk boundary). Advanced RAG addresses these with query rewriting (fixes vocabulary mismatch), hierarchical retrieval (fixes multi-hop), query expansion (fixes ambiguity), and semantic chunking (fixes boundary issues). Advanced RAG also adds reranking - a cross-encoder that reorders the retrieved candidates more accurately than bi-encoder similarity.
Q: How do you choose between pgvector, Pinecone, and Weaviate for a production RAG deployment?
A: The decision comes down to scale, operational complexity, and existing infrastructure. If you already run PostgreSQL and your corpus is under 10 million vectors, pgvector is the right default - it is one less system to manage, it supports ACID transactions, and you can filter with standard SQL WHERE clauses. If you expect to scale beyond 100M vectors or need multi-region deployment, a managed service like Pinecone removes operational burden at a higher cost. Weaviate is the right choice when you need hybrid search (BM25 + dense) as a first-class feature built into the vector store, rather than implementing it at the application layer. For most enterprise RAG systems starting out, pgvector until you hit its limits, then Pinecone.
Q: Explain Reciprocal Rank Fusion and why it is preferred over score normalization for hybrid search.
A: Score normalization attempts to put BM25 scores and cosine similarity scores on the same scale (e.g., min-max normalize both to [0,1]) and then take a weighted average. The problem is that BM25 scores and cosine similarity scores have different distributional shapes - BM25 scores are skewed right and depend on document length, while cosine similarities are bounded and concentrated near 0 for unrelated documents. Normalizing them makes assumptions about their distributions that are often violated. RRF uses only the rank position, not the raw scores, which makes it distribution-free. The formula gives most weight to top-ranked items and is robust to the long tail of low-ranked items. RRF consistently outperforms score normalization in practice because it is insensitive to scale differences between ranking systems.
Q: How would you debug a RAG system that gives confident but wrong answers?
A: Structured debugging. First, check faithfulness - is the wrong answer in the retrieved chunks? If yes, the LLM is hallucinating or misinterpreting correct context (generation failure). If no, the wrong information was retrieved (retrieval failure). For generation failures: check if the LLM is ignoring context in favor of parametric knowledge (add "answer only from the provided context, do not use prior knowledge" to the system prompt), check if the prompt format is confusing the model. For retrieval failures: check if the correct chunks exist in the corpus at all, check if they are in the wrong chunks due to poor chunking, check recall@5 for the query. The most common cause of confident wrong answers is retrieval failure combined with a question that could plausibly be answered from general knowledge - the LLM fills the gap with parametric memory.
Q: What is HyDE and when does it help?
A: Hypothetical Document Embedding asks the LLM to generate a hypothetical answer to the query before retrieval. This hypothetical answer is then embedded and used as the retrieval query instead of the original question. It helps primarily in asymmetric retrieval tasks where the query and the relevant documents have different linguistic styles - for example, a question ("What causes bubble formation in fermentation?") is phrased differently from the document that contains the answer ("Bubble formation during fermentation results from carbon dioxide production by yeast..."). The hypothetical answer is phrased more like a document, so its embedding is closer to the relevant document's embedding in the vector space. HyDE works well for open-domain factual questions but adds 300-800ms latency for the extra LLM call. Do not use it for queries where the vocabulary already matches the corpus well.
