:::tip 🎮 Interactive Playground Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required. :::
Why RAG Exists
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Reading time: 45 minutes | Interview relevance: Very High | Target roles: AI Engineer, ML Engineer, Backend Engineer building AI products
The Contract Review Catastrophe
The system worked beautifully in the demo. Meridian Legal AI had built a contract review assistant that could analyze NDAs, employment agreements, and software licensing contracts in seconds - a task that previously took a junior associate two hours. The founders showed it to twenty law firms and got twelve signed letters of intent. The investors were excited. The engineering team was proud.
Then they shipped it to their first real client, a mid-sized pharmaceutical company with a 340-page master services agreement that had been negotiated over four years and contained seventeen custom annexes. Clause 14.3(b) of Annex G defined the specific conditions under which the licensor could trigger a penalty clause - a bespoke provision that appeared in no standard contract template, no legal textbook, and no publicly available case law. The AI assistant was asked: "What are the conditions that trigger the penalty under clause 14.3(b)?" It answered confidently, citing conditions that were plausible but completely fabricated. The real clause had three specific carve-outs. The AI's answer had none of them.
The legal team caught it before it reached the client. But it shook everyone's confidence. The engineering team's first instinct was fine-tuning. They collected all the client's contracts, formatted them as instruction-response pairs, and spent three weeks running supervised fine-tuning jobs. The resulting model was better on generic contract language but still hallucinated on the specific provisions of this specific client's specific annex. How could it not? The model had never seen those exact documents during training, and fine-tuning on a few hundred examples wasn't going to change that.
The root cause was simple, and it was architectural: the model had no access to the actual contract. It was generating answers from parametric memory - knowledge baked into its weights during training - and applying that general knowledge to highly specific, proprietary questions. The solution wasn't a better model. It wasn't more fine-tuning. It was giving the model the document and asking it to answer from the text in front of it. That is Retrieval-Augmented Generation.
They rebuilt the pipeline in four days. The new system retrieved the relevant clauses from the contract, passed them to the model as context, and instructed it to answer only from what was provided and to cite the clause number. Accuracy on specific contractual questions went from roughly 40% to over 95%. The AI assistant shipped to the pharmaceutical client the following week and has been in production ever since. The three weeks spent on fine-tuning taught them something more valuable than any accuracy metric: they learned that the problem was information access, not model capability - and those are solved by entirely different tools.
Why This Exists
Language models store knowledge in their weights. During training, billions of parameters adjust to encode statistical relationships across the training corpus - facts, grammar, reasoning patterns, world knowledge. When you ask a model a question, it retrieves this knowledge from its own parameters to generate a response. This is called parametric memory: knowledge stored in the parameters of the model itself.
Parametric memory has three deep structural problems for production AI systems:
1. Knowledge cutoff. The model only knows what was in its training data. If your organization's policy changed last quarter, if a regulation was updated six months ago, if a product specification was written last week - the model has no knowledge of these things. No amount of prompting will fix this, because the information is literally not in the model's weights.
2. Hallucination under uncertainty. When a model is asked about something it has weak or partial knowledge of, it doesn't say "I don't know." It generates a plausible-sounding answer by pattern-matching to similar contexts it has seen. This is not a bug in the model - it is a fundamental property of how autoregressive text generation works. The model generates the most probable next token at each step. If the most probable answer to "What does clause 14.3(b) say?" sounds like other contract clauses the model has seen, it will produce that answer, even if it's wrong.
3. No source attribution. Even when a model produces a correct answer, it cannot tell you where that answer came from. For legal, medical, financial, and compliance use cases, this is often a hard requirement. "The model said so" is not an acceptable citation in a court filing.
RAG exists to solve all three problems at once. Instead of asking the model to answer from its parametric memory, RAG retrieves relevant documents and passes them as context. The model then answers from the retrieved text - which is current, specific, and attributable.
Historical Context
The term "Retrieval-Augmented Generation" was coined in a 2020 paper from Facebook AI Research: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis, Ethan Perez, Aleksandra Piktus, and colleagues. Published at NeurIPS 2020, the paper introduced a model architecture that combined a dense retrieval system with a sequence-to-sequence generator.
The key insight in the Lewis et al. paper was the distinction between parametric and non-parametric memory in language models:
- Parametric memory: knowledge stored in the model's weights, fixed after training
- Non-parametric memory: an external datastore that can be updated without retraining
The original RAG paper used a dense retrieval system called DPR (Dense Passage Retrieval, also from Facebook AI Research) to retrieve relevant Wikipedia passages, then fed them to a BART sequence-to-sequence model as context. On knowledge-intensive benchmarks like Natural Questions, TriviaQA, and WebQuestions, RAG significantly outperformed models that relied on parametric memory alone.
Before the Lewis paper, the dominant approach to knowledge-intensive NLP was to pack more knowledge into larger models - essentially, if the model doesn't know something, train a bigger model on more data. This approach had diminishing returns and was fundamentally brittle for anything requiring specific, current, or private knowledge. RAG proposed a different architecture: keep the model relatively small and general, and pair it with a dynamically updatable retrieval system.
The field has evolved rapidly since 2020. The original Lewis et al. architecture is now called Naive RAG - a simple retrieve-and-read pipeline. The subsequent evolution:
- Naive RAG (2020-2022): basic dense retrieval, fixed chunking, single retrieval step
- Advanced RAG (2022-2023): query rewriting, reranking, hybrid search, better chunking
- Modular RAG (2023-2024): routing, multi-step retrieval, iterative refinement, CRAG
- Agentic RAG (2024-present): self-correcting retrieval loops, tool-augmented retrieval, multi-hop reasoning over heterogeneous sources
Three Alternatives - When Each Is Right
Before building a RAG system, you should seriously consider three alternatives. Each is the right answer for a different kind of problem.
Option 1: Fine-Tuning
Fine-tuning updates the model's weights on domain-specific data. It is the right tool when:
- You need the model to adopt a specific style or format (e.g., always respond in JSON, always write in a specific legal voice)
- You need to internalize domain vocabulary (medical terminology, legal Latin, internal jargon) so the model understands questions without needing those terms in context
- You need to improve reasoning patterns for a specific task type (e.g., always break contract review into five specific categories)
- Your knowledge is stable and doesn't change frequently - you're not trying to serve current information
Fine-tuning is the wrong tool when:
- You need to teach the model specific facts from specific documents
- Your knowledge base changes weekly or monthly
- You need the model to cite sources
- You have less than ~10,000 high-quality training examples
The Meridian Legal team's mistake was classic: they tried to fine-tune domain knowledge into the model when what they needed was document access. Fine-tuning cannot memorize every clause of every contract. Even if it could, new contracts arrive every week.
Option 2: Prompt Stuffing
If your context window is large enough, you can simply put all relevant documents directly into the prompt. With 200K token context windows now available, this sounds appealing.
Prompt stuffing works well when:
- Your knowledge fits in the context window
- You always know which documents are relevant in advance
- The query is broad enough that you need the entire knowledge base
It breaks down when:
- Your knowledge base is larger than the context window (a 50,000-document corpus at 1,000 tokens each is 50 million tokens - not feasible)
- Costs are prohibitive (1M tokens per query at Claude's pricing can be expensive at scale)
- Relevant documents must be selected dynamically based on the query
- Long-context retrieval has diminishing performance ("lost in the middle" phenomenon - models pay less attention to content in the middle of very long contexts)
- Latency matters - processing 200K tokens takes several seconds
The "lost in the middle" effect, documented by Liu et al. (2023), showed that LLM performance drops significantly when relevant information appears in the middle of a very long context. RAG solves this by retrieving only the most relevant chunks, keeping the context short and high-signal.
Option 3: RAG - The Right Tool for Factual, Dynamic, Citable Knowledge
RAG is the right architecture when:
- Your knowledge base is too large to fit in a context window
- Your knowledge changes frequently (product docs, regulations, research papers)
- Users need citations - they need to know where the answer came from
- You need to control hallucination by grounding answers in specific documents
- Different users need access to different subsets of the knowledge base (tenant isolation)
RAG's core trade-off: it introduces retrieval latency and retrieval failure modes (if you retrieve the wrong chunks, the answer will be wrong even if the model is perfect). Good RAG engineering is largely about minimizing retrieval failure.
Core Concepts: Parametric vs. Non-Parametric Knowledge
Understanding the parametric/non-parametric distinction is foundational to reasoning about RAG architectures.
Parametric knowledge lives in the model's weights. It was baked in during the pre-training and fine-tuning process. It's fast to query (no retrieval step), but it's fixed, opaque, and impossible to update without retraining. When a model produces a hallucination, it is often because parametric knowledge was insufficient for a specific query and the model "completed" the answer with plausible-but-wrong content.
Non-parametric knowledge lives in an external store - a database, a document collection, a vector index. It can be updated at any time without touching the model. It's slower to query (requires a retrieval step), but it's transparent (you can inspect exactly what was retrieved), attributable (you can cite the source), and current.
RAG combines both: the model's parametric knowledge provides general reasoning capability, language understanding, and synthesis ability, while the non-parametric store provides factual grounding. The model says "based on these documents, the answer is X" rather than "based on what I learned during training, I think the answer is X."
Dense vs. Sparse Retrieval
Before writing code, you need to understand the two families of retrieval methods.
Sparse retrieval (BM25, TF-IDF) represents documents and queries as high-dimensional sparse vectors where each dimension corresponds to a vocabulary term. BM25 is the gold standard - it measures relevance based on term frequency, inverse document frequency, and document length normalization. Sparse retrieval is excellent at lexical matching: if the query contains "clause 14.3(b)", BM25 will find documents containing exactly that string with high precision. It fails on semantic matching: "When can the penalty be triggered?" won't match "clause 14.3(b)" unless the document also uses those words.
Dense retrieval uses neural embedding models to map documents and queries into a continuous vector space. Similar meanings end up close together in this space, regardless of whether they share vocabulary. A query about "penalty trigger conditions" will retrieve text about "conditions that activate the fee clause" because they're semantically similar. Dense retrieval excels at semantic matching but can miss exact terms (especially proper nouns, IDs, codes) that don't have meaningful embeddings.
Production RAG systems use hybrid retrieval: run both BM25 and dense retrieval, merge the results (typically using Reciprocal Rank Fusion), and rerank. This is covered in detail in Lesson 05.
The RAG Pipeline: End to End
A complete RAG system has two separate pipelines that operate at different times:
Offline (indexing) pipeline - runs once when documents are added:
- Parse raw documents (PDF, DOCX, HTML, code)
- Clean and normalize text
- Chunk into retrievable segments with overlap
- Enrich with metadata (source, page, section, date)
- Embed each chunk using an embedding model
- Store chunk text and embedding in a vector store
Online (query) pipeline - runs for every user query:
- Receive user query
- Optionally transform the query (HyDE, multi-query expansion)
- Embed the query using the same embedding model
- Retrieve top-k similar chunks from the vector store
- Optionally rerank using a cross-encoder
- Assemble an augmented prompt with retrieved context
- Generate an answer with citations using the LLM
LLM Failure Modes Without RAG
To fully appreciate what RAG solves, it helps to enumerate exactly how a model fails on knowledge-intensive queries without retrieval:
Production Code: Minimal RAG Pipeline from Scratch
The following implements a complete, working RAG pipeline without relying on LangChain, LlamaIndex, or any high-level RAG framework. Building it from scratch teaches you exactly what those frameworks are doing under the hood - and gives you full control over every decision.
This implementation uses:
sentence-transformersfor local embeddings (no API cost for indexing)faiss-cpufor vector searchanthropicSDK for generation
"""
Minimal RAG pipeline from scratch.
No LangChain, no LlamaIndex - every component explicit.
Install: pip install anthropic sentence-transformers faiss-cpu numpy
"""
import os
import json
import time
import hashlib
import numpy as np
from dataclasses import dataclass, field, asdict
from typing import Optional
import anthropic
# Try to import optional dependencies with clear error messages
try:
from sentence_transformers import SentenceTransformer
HAS_ST = True
except ImportError:
HAS_ST = False
print("sentence-transformers not installed. Run: pip install sentence-transformers")
try:
import faiss
HAS_FAISS = True
except ImportError:
HAS_FAISS = False
print("faiss-cpu not installed. Run: pip install faiss-cpu")
# ---------------------------------------------------------------------------
# Data structures
# ---------------------------------------------------------------------------
@dataclass
class Document:
"""A raw document before chunking."""
content: str
source: str # file path, URL, or identifier
title: Optional[str] = None
created_at: Optional[str] = None
doc_type: Optional[str] = None # "contract", "policy", "manual", etc.
def doc_id(self) -> str:
"""Stable ID based on source + content hash."""
payload = f"{self.source}:{self.content[:200]}"
return hashlib.md5(payload.encode()).hexdigest()[:12]
@dataclass
class Chunk:
"""A retrievable chunk of text with metadata."""
chunk_id: str
text: str
source: str
doc_id: str
chunk_index: int # position within the document
start_char: int # character offset in original doc
end_char: int
metadata: dict = field(default_factory=dict)
embedding: Optional[np.ndarray] = field(default=None, repr=False)
def to_context_string(self) -> str:
"""Format this chunk for inclusion in a prompt."""
source_label = self.metadata.get("title", self.source)
return f"[Source: {source_label}, chunk {self.chunk_index}]\n{self.text}"
@dataclass
class RetrievedContext:
"""The result of a retrieval query: ranked chunks with scores."""
query: str
chunks: list[Chunk]
scores: list[float]
retrieval_time_ms: float
def to_prompt_context(self, max_chunks: int = 5) -> str:
"""Build the context block to inject into the LLM prompt."""
selected = list(zip(self.chunks, self.scores))[:max_chunks]
parts = []
for i, (chunk, score) in enumerate(selected, 1):
parts.append(
f"--- Document {i} (relevance: {score:.3f}) ---\n"
f"{chunk.to_context_string()}"
)
return "\n\n".join(parts)
def source_list(self) -> list[str]:
"""Return unique sources for citation."""
seen = set()
sources = []
for chunk in self.chunks:
src = chunk.metadata.get("title", chunk.source)
if src not in seen:
seen.add(src)
sources.append(src)
return sources
# ---------------------------------------------------------------------------
# Step 1: Chunking
# ---------------------------------------------------------------------------
class FixedSizeChunker:
"""
Splits documents into fixed-size chunks with overlap.
Overlap ensures that content near chunk boundaries is captured
by at least one chunk.
"""
def __init__(self, chunk_size: int = 512, overlap: int = 64):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, doc: Document) -> list[Chunk]:
text = doc.content
chunks: list[Chunk] = []
start = 0
chunk_index = 0
while start < len(text):
end = start + self.chunk_size
chunk_text = text[start:end]
# Don't create a chunk that is pure whitespace
if not chunk_text.strip():
start = end - self.overlap
continue
chunk_id = f"{doc.doc_id()}-{chunk_index}"
chunk = Chunk(
chunk_id=chunk_id,
text=chunk_text.strip(),
source=doc.source,
doc_id=doc.doc_id(),
chunk_index=chunk_index,
start_char=start,
end_char=min(end, len(text)),
metadata={
"title": doc.title or doc.source,
"doc_type": doc.doc_type,
"created_at": doc.created_at,
},
)
chunks.append(chunk)
chunk_index += 1
start = end - self.overlap # step forward with overlap
return chunks
class ParagraphChunker:
"""
Splits on paragraph boundaries (double newlines).
Better preserves semantic units than fixed-size splitting.
"""
def __init__(self, max_chunk_size: int = 800, min_chunk_size: int = 50):
self.max_chunk_size = max_chunk_size
self.min_chunk_size = min_chunk_size
def chunk(self, doc: Document) -> list[Chunk]:
paragraphs = [p.strip() for p in doc.content.split("\n\n") if p.strip()]
chunks: list[Chunk] = []
buffer = ""
buffer_start = 0
chunk_index = 0
char_offset = 0
for para in paragraphs:
if len(buffer) + len(para) > self.max_chunk_size and len(buffer) >= self.min_chunk_size:
# Flush buffer as a chunk
chunk = Chunk(
chunk_id=f"{doc.doc_id()}-{chunk_index}",
text=buffer,
source=doc.source,
doc_id=doc.doc_id(),
chunk_index=chunk_index,
start_char=buffer_start,
end_char=char_offset,
metadata={"title": doc.title or doc.source},
)
chunks.append(chunk)
chunk_index += 1
buffer = para
buffer_start = char_offset
else:
buffer = f"{buffer}\n\n{para}".strip() if buffer else para
char_offset += len(para) + 2 # +2 for "\n\n"
# Flush remaining buffer
if buffer.strip():
chunks.append(Chunk(
chunk_id=f"{doc.doc_id()}-{chunk_index}",
text=buffer,
source=doc.source,
doc_id=doc.doc_id(),
chunk_index=chunk_index,
start_char=buffer_start,
end_char=len(doc.content),
metadata={"title": doc.title or doc.source},
))
return chunks
# ---------------------------------------------------------------------------
# Step 2: Embedding
# ---------------------------------------------------------------------------
class EmbeddingModel:
"""
Wraps a sentence-transformers model for local embedding.
For production, consider: text-embedding-3-small (OpenAI),
voyage-3 (Voyage AI), or cohere-embed-v3.
"""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
if not HAS_ST:
raise RuntimeError("sentence-transformers required. pip install sentence-transformers")
print(f"Loading embedding model: {model_name}")
self.model = SentenceTransformer(model_name)
self.model_name = model_name
self.dim = self.model.get_sentence_embedding_dimension()
print(f"Embedding dimension: {self.dim}")
def embed(self, texts: list[str], batch_size: int = 64) -> np.ndarray:
"""Embed a list of texts, returning float32 numpy array of shape (N, dim)."""
embeddings = self.model.encode(
texts,
batch_size=batch_size,
show_progress_bar=len(texts) > 100,
normalize_embeddings=True, # normalize for cosine similarity via dot product
convert_to_numpy=True,
)
return embeddings.astype(np.float32)
def embed_query(self, query: str) -> np.ndarray:
"""Embed a single query string."""
embedding = self.model.encode([query], normalize_embeddings=True, convert_to_numpy=True)
return embedding.astype(np.float32)
# ---------------------------------------------------------------------------
# Step 3: Vector Store
# ---------------------------------------------------------------------------
class FAISSVectorStore:
"""
In-memory FAISS vector store with chunk metadata stored in a parallel list.
For production: use pgvector (Postgres), Pinecone, or Weaviate.
"""
def __init__(self, dim: int):
if not HAS_FAISS:
raise RuntimeError("faiss-cpu required. pip install faiss-cpu")
# Inner Product index - works as cosine similarity when vectors are L2-normalized
self.index = faiss.IndexFlatIP(dim)
self.chunks: list[Chunk] = [] # parallel list: index i → chunk
self.dim = dim
def add_chunks(self, chunks: list[Chunk], embeddings: np.ndarray) -> None:
"""
Add chunks and their embeddings to the index.
embeddings: shape (N, dim), float32, already L2-normalized
"""
assert len(chunks) == len(embeddings), "chunks and embeddings must have same length"
assert embeddings.shape[1] == self.dim, f"Expected dim {self.dim}, got {embeddings.shape[1]}"
self.index.add(embeddings)
self.chunks.extend(chunks)
print(f"Indexed {len(chunks)} chunks. Total: {len(self.chunks)}")
def search(self, query_embedding: np.ndarray, top_k: int = 10) -> tuple[list[Chunk], list[float]]:
"""
Return the top-k most similar chunks with their similarity scores.
query_embedding: shape (1, dim), float32, L2-normalized
"""
if self.index.ntotal == 0:
return [], []
top_k = min(top_k, self.index.ntotal)
scores, indices = self.index.search(query_embedding, top_k)
result_chunks = []
result_scores = []
for score, idx in zip(scores[0], indices[0]):
if idx >= 0: # -1 indicates FAISS found no result
result_chunks.append(self.chunks[idx])
result_scores.append(float(score))
return result_chunks, result_scores
def total_chunks(self) -> int:
return len(self.chunks)
def save(self, path: str) -> None:
"""Persist index and metadata to disk."""
faiss.write_index(self.index, f"{path}.faiss")
# Serialize chunk metadata without embeddings (which are in FAISS)
metadata = []
for chunk in self.chunks:
d = asdict(chunk)
d.pop("embedding", None) # don't serialize numpy array as JSON
metadata.append(d)
with open(f"{path}.meta.json", "w") as f:
json.dump(metadata, f, indent=2)
print(f"Saved index ({self.index.ntotal} vectors) to {path}.faiss")
@classmethod
def load(cls, path: str, dim: int) -> "FAISSVectorStore":
"""Load index and metadata from disk."""
store = cls(dim=dim)
store.index = faiss.read_index(f"{path}.faiss")
with open(f"{path}.meta.json") as f:
metadata = json.load(f)
store.chunks = [Chunk(**m) for m in metadata]
print(f"Loaded index ({store.index.ntotal} vectors) from {path}.faiss")
return store
# ---------------------------------------------------------------------------
# Step 4: Retriever
# ---------------------------------------------------------------------------
class Retriever:
"""
Combines the embedding model and vector store to answer retrieval queries.
"""
def __init__(self, embedding_model: EmbeddingModel, vector_store: FAISSVectorStore):
self.embedding_model = embedding_model
self.vector_store = vector_store
def retrieve(self, query: str, top_k: int = 5) -> RetrievedContext:
"""
Retrieve the most relevant chunks for a query.
Returns a RetrievedContext with ranked chunks and scores.
"""
t0 = time.perf_counter()
query_embedding = self.embedding_model.embed_query(query)
chunks, scores = self.vector_store.search(query_embedding, top_k=top_k)
elapsed_ms = (time.perf_counter() - t0) * 1000
return RetrievedContext(
query=query,
chunks=chunks,
scores=scores,
retrieval_time_ms=elapsed_ms,
)
# ---------------------------------------------------------------------------
# Step 5: Generator with Citations
# ---------------------------------------------------------------------------
SYSTEM_PROMPT = """You are a precise document assistant. Your job is to answer questions based strictly on the provided context documents.
Rules:
1. Answer ONLY from the provided context. Do not use outside knowledge.
2. If the context does not contain enough information to answer, say so clearly.
3. Always cite your sources using the document labels provided (e.g., "According to [Source: Annual Report, chunk 3]...").
4. Keep answers concise and accurate. Do not pad with unnecessary information.
5. If multiple documents contain relevant information, synthesize them and cite each."""
USER_PROMPT_TEMPLATE = """Context Documents:
{context}
---
Question: {question}
Answer based on the context above. Cite your sources."""
class RAGGenerator:
"""
Generates answers from retrieved context using Claude.
Tracks citation sources for attribution.
"""
def __init__(self, model: str = "claude-haiku-4-5"):
self.client = anthropic.Anthropic()
self.model = model
def generate(
self,
question: str,
context: RetrievedContext,
max_tokens: int = 1024,
max_context_chunks: int = 5,
) -> dict:
"""
Generate an answer from the retrieved context.
Returns a dict with:
- answer: the generated text
- sources: list of source identifiers
- model: the model used
- usage: token usage statistics
- retrieval_time_ms: retrieval latency
- generation_time_ms: generation latency
"""
context_str = context.to_prompt_context(max_chunks=max_context_chunks)
sources = context.source_list()
prompt = USER_PROMPT_TEMPLATE.format(
context=context_str,
question=question,
)
t0 = time.perf_counter()
response = self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": prompt}],
)
generation_time_ms = (time.perf_counter() - t0) * 1000
return {
"question": question,
"answer": response.content[0].text,
"sources": sources,
"model": self.model,
"usage": {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
},
"retrieval_time_ms": context.retrieval_time_ms,
"generation_time_ms": generation_time_ms,
"total_time_ms": context.retrieval_time_ms + generation_time_ms,
"top_chunk_scores": context.scores[:5],
}
# ---------------------------------------------------------------------------
# Full Pipeline: Ingestion + Query
# ---------------------------------------------------------------------------
class RAGPipeline:
"""
End-to-end RAG pipeline: ingest documents, retrieve, generate.
"""
def __init__(
self,
embedding_model_name: str = "all-MiniLM-L6-v2",
chunker: str = "paragraph", # "fixed" or "paragraph"
chunk_size: int = 512,
chunk_overlap: int = 64,
llm_model: str = "claude-haiku-4-5",
):
self.embedding_model = EmbeddingModel(embedding_model_name)
if chunker == "fixed":
self.chunker = FixedSizeChunker(chunk_size=chunk_size, overlap=chunk_overlap)
else:
self.chunker = ParagraphChunker()
self.vector_store = FAISSVectorStore(dim=self.embedding_model.dim)
self.retriever = Retriever(self.embedding_model, self.vector_store)
self.generator = RAGGenerator(model=llm_model)
def ingest(self, documents: list[Document], batch_size: int = 64) -> dict:
"""
Ingest a list of documents: chunk → embed → store.
Returns ingestion statistics.
"""
t0 = time.perf_counter()
all_chunks: list[Chunk] = []
for doc in documents:
chunks = self.chunker.chunk(doc)
all_chunks.extend(chunks)
print(f" {doc.source}: {len(chunks)} chunks")
# Embed in batches
texts = [chunk.text for chunk in all_chunks]
embeddings = self.embedding_model.embed(texts, batch_size=batch_size)
# Attach embeddings to chunks and store
for chunk, emb in zip(all_chunks, embeddings):
chunk.embedding = emb
self.vector_store.add_chunks(all_chunks, embeddings)
elapsed = time.perf_counter() - t0
return {
"documents_ingested": len(documents),
"chunks_created": len(all_chunks),
"ingestion_time_s": elapsed,
"chunks_per_second": len(all_chunks) / elapsed,
}
def query(
self,
question: str,
top_k: int = 5,
max_tokens: int = 1024,
) -> dict:
"""
Answer a question using the ingested knowledge base.
"""
context = self.retriever.retrieve(question, top_k=top_k)
result = self.generator.generate(question, context, max_tokens=max_tokens)
return result
# ---------------------------------------------------------------------------
# Example Usage
# ---------------------------------------------------------------------------
def main():
# Create sample documents (in production: load from PDFs, DOCX, etc.)
documents = [
Document(
content="""MASTER SERVICES AGREEMENT - ANNEX G
Annex G: Penalty and Fee Structures
Section 14: Penalty Clauses
14.1 Standard Penalties
The standard penalty for late delivery is 0.5% of the project value per day, capped at 10% of the total contract value.
14.2 Escalated Penalties
In cases where delivery delay exceeds 30 days, the penalty escalates to 1.0% per day.
14.3 Penalty Trigger Conditions
Penalties under this Annex are triggered when:
(a) The licensor fails to deliver milestones within the agreed schedule as defined in Schedule B;
(b) The licensed software fails to pass User Acceptance Testing (UAT) within three consecutive test cycles, provided that:
(i) The licensee has provided complete and accurate test data as specified in Annex H;
(ii) The failure is not attributable to infrastructure issues on the licensee's side;
(iii) Written notice of UAT failure has been provided within 48 hours of each failed cycle;
(c) The licensor fails to maintain the minimum service levels defined in the SLA (Exhibit C) for two consecutive calendar months.
14.4 Carve-outs
The penalty provisions in 14.3(b) shall not apply in the following circumstances:
- Force majeure events as defined in Section 22
- Failures attributable to third-party integrations listed in Appendix D
- Failures during the first 90 days after initial deployment (grace period)""",
source="annex-g-penalties.txt",
title="MSA Annex G: Penalty Clauses",
doc_type="contract",
created_at="2024-01-15",
),
Document(
content="""MASTER SERVICES AGREEMENT - EXHIBIT C
Service Level Agreement (SLA)
1. Availability
The licensed software must maintain 99.5% uptime measured monthly, excluding scheduled maintenance windows.
2. Response Times
- Critical issues (P1): initial response within 1 hour, resolution within 4 hours
- High issues (P2): initial response within 4 hours, resolution within 24 hours
- Medium issues (P3): initial response within 1 business day, resolution within 5 business days
3. SLA Measurement
Uptime is measured using the monitoring infrastructure specified in Appendix E.
Monthly SLA reports must be provided by the 5th business day of the following month.
4. SLA Breach Consequences
SLA breaches trigger the penalty provisions in Annex G, Section 14.3(c).""",
source="exhibit-c-sla.txt",
title="MSA Exhibit C: SLA",
doc_type="contract",
created_at="2024-01-15",
),
]
# Initialize pipeline
print("\n=== Initializing RAG Pipeline ===")
pipeline = RAGPipeline(
embedding_model_name="all-MiniLM-L6-v2",
chunker="paragraph",
llm_model="claude-haiku-4-5",
)
# Ingest
print("\n=== Ingesting Documents ===")
stats = pipeline.ingest(documents)
print(f"Ingestion complete: {stats}")
# Query 1: Specific clause question (would fail without RAG)
print("\n=== Query 1: Specific clause ===")
result = pipeline.query(
"What are the specific conditions that trigger the penalty under clause 14.3(b)?",
top_k=5,
)
print(f"Answer:\n{result['answer']}")
print(f"\nSources: {result['sources']}")
print(f"Retrieval: {result['retrieval_time_ms']:.1f}ms | Generation: {result['generation_time_ms']:.1f}ms")
print(f"Tokens: {result['usage']}")
# Query 2: Cross-document question
print("\n=== Query 2: Cross-document synthesis ===")
result2 = pipeline.query(
"If the software fails UAT and also misses SLA targets, what penalties apply?",
top_k=5,
)
print(f"Answer:\n{result2['answer']}")
print(f"\nSources: {result2['sources']}")
# Query 3: Out-of-scope question (should be handled gracefully)
print("\n=== Query 3: Out of scope ===")
result3 = pipeline.query(
"What is the total contract value?",
top_k=3,
)
print(f"Answer:\n{result3['answer']}")
if __name__ == "__main__":
main()
Production Engineering Notes
Latency Budget
A production RAG query has a multi-step latency budget. Each step adds time:
| Step | Typical Latency | Notes |
|---|---|---|
| Query embedding | 5-15ms | Local model; use GPU for less than 5ms |
| ANN vector search | 1-10ms | FAISS flat: exact but slow; IVF: approximate but fast |
| Network to vector DB | 5-50ms | pgvector (local): less than 5ms; Pinecone: 20-80ms |
| Reranking (cross-encoder) | 50-200ms | Often skipped for latency; batch limit 20 chunks |
| LLM generation (TTFT) | 300-800ms | Time to first token; streaming hides this |
| LLM generation (total) | 1-5s | Depends on output length |
Total P50: 500-800ms for a lean pipeline without reranking. Total P50: 1-2s with reranking.
Budget your latency from the user experience backward: if you need a 2-second response, you have roughly 500ms for retrieval and 1.5 seconds for generation. This directly constrains how much reranking you can do.
When Retrieval Fails
Retrieval can fail in two ways:
Low recall: the correct chunk is not in the top-k results. Causes: poor embedding quality for this domain, query vocabulary mismatch, chunk too small (semantic information diluted). Fix: increase k, add BM25 hybrid retrieval, or fine-tune embeddings.
Low precision: the top-k results contain many irrelevant chunks. Causes: query too broad, embedding space not discriminative enough, too many similar documents. Fix: add reranking, improve chunking (smaller chunks are more specific), use metadata filters.
The critical diagnostic: track retrieval accuracy separately from answer accuracy. If retrieval accuracy is high but answer accuracy is low, the problem is in generation (the model is ignoring context or misinterpreting it). If retrieval accuracy is low, fix the retrieval layer first - no amount of prompt engineering will help if the wrong chunks are retrieved.
Detecting Retrieval vs. Generation Errors
def diagnose_failure(question: str, answer: str, context: RetrievedContext) -> dict:
"""
Use Claude to determine whether a failure is a retrieval error or generation error.
Pass this the question, the answer, and the retrieved chunks.
"""
client = anthropic.Anthropic()
context_str = context.to_prompt_context(max_chunks=5)
diagnosis_prompt = f"""You are evaluating a RAG system failure.
Question: {question}
Retrieved context:
{context_str}
System answer: {answer}
Tasks:
1. Is the correct answer present in the retrieved context? (yes/no)
2. If yes, did the system's answer correctly reflect the context? (yes/no)
3. Classify: "retrieval_failure" (correct info not retrieved) or "generation_failure" (correct info retrieved but wrong answer) or "correct" (answer is right)
4. Brief explanation (1 sentence).
Respond in JSON: {{"answer_in_context": bool, "answer_correct": bool, "failure_type": str, "explanation": str}}"""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=256,
messages=[{"role": "user", "content": diagnosis_prompt}],
)
import json
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return {"raw": response.content[0].text}
:::tip Production Monitoring Track retrieval accuracy and generation accuracy as separate metrics in your observability stack. When accuracy drops, you need to know which layer failed. A retrieval failure requires different fixes than a generation failure - and mixing up the diagnosis wastes engineering time. :::
:::warning Context Window Overflow
If you retrieve too many chunks, you can exceed the model's context window or trigger the "lost in the middle" degradation. Always calculate your token budget: prompt_tokens + sum(chunk_tokens) + max_output_tokens < context_limit. For Claude, the limit is 200K tokens - but optimal performance is usually with much shorter contexts (under 20K tokens of retrieved content).
:::
:::danger Never Prompt-Stuffing in Place of Retrieval Putting your entire knowledge base in every prompt is tempting with large context windows, but it has three serious production problems: (1) it's extremely expensive at scale - 100K tokens per query × 10,000 queries/day = 1 billion tokens/day; (2) "lost in the middle" means the model pays less attention to content buried in the middle of very long contexts; (3) it prevents tenant isolation - you can't selectively show different users different subsets of your knowledge base. Use retrieval. :::
Interview Questions and Answers
Q1: What is RAG and why was it invented?
RAG (Retrieval-Augmented Generation) is an architecture that retrieves relevant documents from an external store and passes them as context to a language model before generation. It was invented because language models have two fundamental limitations: their knowledge is fixed at training time (knowledge cutoff), and they hallucinate when asked about facts they have weak knowledge of. The founding paper (Lewis et al., NeurIPS 2020) introduced the parametric/non-parametric memory distinction: the model's weights hold general reasoning capability, while an external document store holds specific, current, attributable facts. RAG combines both.
Q2: When would you choose fine-tuning over RAG?
Fine-tuning is the right choice when you need to change the model's behavior or style, not its factual knowledge. Concrete examples: training the model to always respond in JSON, adopting domain-specific vocabulary so it understands jargon in questions, improving reasoning patterns for a specific task type, or teaching the model to follow a specific format (e.g., always structure contract analysis into five sections). RAG is the right choice when you need the model to know specific facts from specific documents, especially when those documents are proprietary, change frequently, or need to be cited. The key diagnostic: if you're trying to teach facts, use RAG. If you're trying to change behavior, use fine-tuning. Many production systems need both.
Q3: What is the "lost in the middle" problem and how does RAG address it?
The "lost in the middle" phenomenon (Liu et al., 2023) shows that LLMs perform worse when relevant information appears in the middle of a long context, compared to when it appears at the beginning or end. Performance on knowledge-intensive tasks drops by 20-30% when the relevant passage is in the middle of a 20-document context versus at the beginning. RAG addresses this by retrieving only the most relevant chunks (typically 3-10) and placing them at the beginning of the context. This keeps the context short and positions the most relevant information where the model pays most attention. Reranking the retrieved chunks and putting the highest-scored chunk first further helps.
Q4: How do you detect whether a RAG failure is a retrieval error or a generation error?
First, log every retrieval: store the query, the top-k chunks returned, and the final answer. To diagnose, you need to check whether the correct answer was present in the retrieved chunks. If the correct answer was in the chunks but the model gave a wrong answer, that's a generation error - fix it by improving the prompt, adding more explicit instructions to "answer only from context," or reducing context length. If the correct answer was not in the chunks, that's a retrieval error - fix it by increasing k, improving chunking, adding hybrid BM25 retrieval, or fine-tuning the embedding model. You can automate this diagnosis using an LLM-as-judge: pass the question, retrieved context, and answer to Claude and ask it to classify the failure type. This is the foundation of RAG evaluation in production.
Q5: What are the main components of a RAG pipeline and what does each do?
A RAG system has two pipelines: offline (indexing) and online (query). Offline: (1) Parser - converts raw files (PDF, DOCX, HTML) into plain text; (2) Chunker - splits the text into retrievable segments with appropriate size and overlap; (3) Enricher - adds metadata (source, page, section, date); (4) Embedder - converts each chunk into a dense vector using an embedding model; (5) Vector store - persists the vectors with their chunk text for fast similarity search. Online: (1) Query transformer - optionally rewrites or expands the query; (2) Retriever - embeds the query and finds the most similar chunks using ANN search; (3) Reranker - optionally re-scores the top-k chunks using a more accurate cross-encoder; (4) Context assembler - formats the retrieved chunks into a prompt with source labels; (5) Generator - passes the augmented prompt to the LLM and produces an answer with citations.
Q6: What is the difference between dense and sparse retrieval, and when would you use each?
Sparse retrieval (BM25, TF-IDF) represents documents as term-frequency vectors. It excels at lexical matching - finding documents that contain exact query terms. If the user queries for "ISO 27001 clause 6.2", BM25 will find documents containing that exact string with high precision. Dense retrieval uses neural embeddings to capture semantic similarity - "conditions that activate the fee clause" will retrieve text about "penalty trigger conditions" because they're semantically close in the embedding space. Dense retrieval fails on lexical precision; sparse retrieval fails on semantic recall. Production systems use hybrid retrieval: run both, merge results using Reciprocal Rank Fusion (RRF), then rerank. The typical recipe: BM25 for lexical precision on IDs, codes, names, and exact terms, plus dense retrieval for semantic understanding of concepts and paraphrases.
Q7: What are the main failure modes of RAG in production?
The RAG failure stack, from bottom to top: (1) Parsing failures - scanned PDFs produce garbled OCR text, tables are linearized incorrectly, code loses indentation; (2) Chunking failures - semantic units split across chunk boundaries, context lost at edges; (3) Embedding failures - out-of-domain embedding model fails to capture domain similarity; (4) Retrieval failures - wrong chunks returned because the query vocabulary doesn't match the document vocabulary; (5) Reranking failures - cross-encoder is skipped to save latency, reducing precision; (6) Generation failures - model ignores retrieved context and uses parametric memory; (7) Evaluation failures - offline benchmarks don't reflect production distribution, so failures are invisible until users complain. Good RAG engineering instruments every layer with separate metrics so you can identify which layer is failing.
