What is retrieval-augmented generation?

Understand why Retrieval-Augmented Generation was invented, what problems it solves that fine-tuning and prompt stuffing cannot, and how to architect a minimal RAG pipeline from scratch.

How does RAG fundamentals work in practice?

Why RAG Exists covers retrieval-augmented generation, RAG fundamentals, LLM hallucination from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/rag-engineering/why-rag-exists

What is the difference between retrieval-augmented generation and LLM hallucination?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/rag-engineering/why-rag-exists

:::tip 🎮 Interactive Playground Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required. :::

Why RAG Exists

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Reading time: 45 minutes | Interview relevance: Very High | Target roles: AI Engineer, ML Engineer, Backend Engineer building AI products

The Contract Review Catastrophe

The system worked beautifully in the demo. Meridian Legal AI had built a contract review assistant that could analyze NDAs, employment agreements, and software licensing contracts in seconds - a task that previously took a junior associate two hours. The founders showed it to twenty law firms and got twelve signed letters of intent. The investors were excited. The engineering team was proud.

Then they shipped it to their first real client, a mid-sized pharmaceutical company with a 340-page master services agreement that had been negotiated over four years and contained seventeen custom annexes. Clause 14.3(b) of Annex G defined the specific conditions under which the licensor could trigger a penalty clause - a bespoke provision that appeared in no standard contract template, no legal textbook, and no publicly available case law. The AI assistant was asked: "What are the conditions that trigger the penalty under clause 14.3(b)?" It answered confidently, citing conditions that were plausible but completely fabricated. The real clause had three specific carve-outs. The AI's answer had none of them.

The legal team caught it before it reached the client. But it shook everyone's confidence. The engineering team's first instinct was fine-tuning. They collected all the client's contracts, formatted them as instruction-response pairs, and spent three weeks running supervised fine-tuning jobs. The resulting model was better on generic contract language but still hallucinated on the specific provisions of this specific client's specific annex. How could it not? The model had never seen those exact documents during training, and fine-tuning on a few hundred examples wasn't going to change that.

The root cause was simple, and it was architectural: the model had no access to the actual contract. It was generating answers from parametric memory - knowledge baked into its weights during training - and applying that general knowledge to highly specific, proprietary questions. The solution wasn't a better model. It wasn't more fine-tuning. It was giving the model the document and asking it to answer from the text in front of it. That is Retrieval-Augmented Generation.

They rebuilt the pipeline in four days. The new system retrieved the relevant clauses from the contract, passed them to the model as context, and instructed it to answer only from what was provided and to cite the clause number. Accuracy on specific contractual questions went from roughly 40% to over 95%. The AI assistant shipped to the pharmaceutical client the following week and has been in production ever since. The three weeks spent on fine-tuning taught them something more valuable than any accuracy metric: they learned that the problem was information access, not model capability - and those are solved by entirely different tools.

Why This Exists

Language models store knowledge in their weights. During training, billions of parameters adjust to encode statistical relationships across the training corpus - facts, grammar, reasoning patterns, world knowledge. When you ask a model a question, it retrieves this knowledge from its own parameters to generate a response. This is called parametric memory: knowledge stored in the parameters of the model itself.

Parametric memory has three deep structural problems for production AI systems:

1. Knowledge cutoff. The model only knows what was in its training data. If your organization's policy changed last quarter, if a regulation was updated six months ago, if a product specification was written last week - the model has no knowledge of these things. No amount of prompting will fix this, because the information is literally not in the model's weights.

2. Hallucination under uncertainty. When a model is asked about something it has weak or partial knowledge of, it doesn't say "I don't know." It generates a plausible-sounding answer by pattern-matching to similar contexts it has seen. This is not a bug in the model - it is a fundamental property of how autoregressive text generation works. The model generates the most probable next token at each step. If the most probable answer to "What does clause 14.3(b) say?" sounds like other contract clauses the model has seen, it will produce that answer, even if it's wrong.

3. No source attribution. Even when a model produces a correct answer, it cannot tell you where that answer came from. For legal, medical, financial, and compliance use cases, this is often a hard requirement. "The model said so" is not an acceptable citation in a court filing.

RAG exists to solve all three problems at once. Instead of asking the model to answer from its parametric memory, RAG retrieves relevant documents and passes them as context. The model then answers from the retrieved text - which is current, specific, and attributable.

Historical Context

The term "Retrieval-Augmented Generation" was coined in a 2020 paper from Facebook AI Research: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis, Ethan Perez, Aleksandra Piktus, and colleagues. Published at NeurIPS 2020, the paper introduced a model architecture that combined a dense retrieval system with a sequence-to-sequence generator.

The key insight in the Lewis et al. paper was the distinction between parametric and non-parametric memory in language models:

Parametric memory: knowledge stored in the model's weights, fixed after training
Non-parametric memory: an external datastore that can be updated without retraining

The original RAG paper used a dense retrieval system called DPR (Dense Passage Retrieval, also from Facebook AI Research) to retrieve relevant Wikipedia passages, then fed them to a BART sequence-to-sequence model as context. On knowledge-intensive benchmarks like Natural Questions, TriviaQA, and WebQuestions, RAG significantly outperformed models that relied on parametric memory alone.

Before the Lewis paper, the dominant approach to knowledge-intensive NLP was to pack more knowledge into larger models - essentially, if the model doesn't know something, train a bigger model on more data. This approach had diminishing returns and was fundamentally brittle for anything requiring specific, current, or private knowledge. RAG proposed a different architecture: keep the model relatively small and general, and pair it with a dynamically updatable retrieval system.

The field has evolved rapidly since 2020. The original Lewis et al. architecture is now called Naive RAG - a simple retrieve-and-read pipeline. The subsequent evolution:

Naive RAG (2020-2022): basic dense retrieval, fixed chunking, single retrieval step
Advanced RAG (2022-2023): query rewriting, reranking, hybrid search, better chunking
Modular RAG (2023-2024): routing, multi-step retrieval, iterative refinement, CRAG
Agentic RAG (2024-present): self-correcting retrieval loops, tool-augmented retrieval, multi-hop reasoning over heterogeneous sources

Three Alternatives - When Each Is Right

Before building a RAG system, you should seriously consider three alternatives. Each is the right answer for a different kind of problem.

Option 1: Fine-Tuning

Fine-tuning updates the model's weights on domain-specific data. It is the right tool when:

You need the model to adopt a specific style or format (e.g., always respond in JSON, always write in a specific legal voice)
You need to internalize domain vocabulary (medical terminology, legal Latin, internal jargon) so the model understands questions without needing those terms in context
You need to improve reasoning patterns for a specific task type (e.g., always break contract review into five specific categories)
Your knowledge is stable and doesn't change frequently - you're not trying to serve current information

Fine-tuning is the wrong tool when:

You need to teach the model specific facts from specific documents
Your knowledge base changes weekly or monthly
You need the model to cite sources
You have less than ~10,000 high-quality training examples

The Meridian Legal team's mistake was classic: they tried to fine-tune domain knowledge into the model when what they needed was document access. Fine-tuning cannot memorize every clause of every contract. Even if it could, new contracts arrive every week.

Option 2: Prompt Stuffing

If your context window is large enough, you can simply put all relevant documents directly into the prompt. With 200K token context windows now available, this sounds appealing.

Prompt stuffing works well when:

Your knowledge fits in the context window
You always know which documents are relevant in advance
The query is broad enough that you need the entire knowledge base

It breaks down when:

Your knowledge base is larger than the context window (a 50,000-document corpus at 1,000 tokens each is 50 million tokens - not feasible)
Costs are prohibitive (1M tokens per query at Claude's pricing can be expensive at scale)
Relevant documents must be selected dynamically based on the query
Long-context retrieval has diminishing performance ("lost in the middle" phenomenon - models pay less attention to content in the middle of very long contexts)
Latency matters - processing 200K tokens takes several seconds

The "lost in the middle" effect, documented by Liu et al. (2023), showed that LLM performance drops significantly when relevant information appears in the middle of a very long context. RAG solves this by retrieving only the most relevant chunks, keeping the context short and high-signal.

Option 3: RAG - The Right Tool for Factual, Dynamic, Citable Knowledge

RAG is the right architecture when:

Your knowledge base is too large to fit in a context window
Your knowledge changes frequently (product docs, regulations, research papers)
Users need citations - they need to know where the answer came from
You need to control hallucination by grounding answers in specific documents
Different users need access to different subsets of the knowledge base (tenant isolation)

RAG's core trade-off: it introduces retrieval latency and retrieval failure modes (if you retrieve the wrong chunks, the answer will be wrong even if the model is perfect). Good RAG engineering is largely about minimizing retrieval failure.

Core Concepts: Parametric vs. Non-Parametric Knowledge

Understanding the parametric/non-parametric distinction is foundational to reasoning about RAG architectures.

Parametric knowledge lives in the model's weights. It was baked in during the pre-training and fine-tuning process. It's fast to query (no retrieval step), but it's fixed, opaque, and impossible to update without retraining. When a model produces a hallucination, it is often because parametric knowledge was insufficient for a specific query and the model "completed" the answer with plausible-but-wrong content.

Non-parametric knowledge lives in an external store - a database, a document collection, a vector index. It can be updated at any time without touching the model. It's slower to query (requires a retrieval step), but it's transparent (you can inspect exactly what was retrieved), attributable (you can cite the source), and current.

RAG combines both: the model's parametric knowledge provides general reasoning capability, language understanding, and synthesis ability, while the non-parametric store provides factual grounding. The model says "based on these documents, the answer is X" rather than "based on what I learned during training, I think the answer is X."

Dense vs. Sparse Retrieval

Before writing code, you need to understand the two families of retrieval methods.

Sparse retrieval (BM25, TF-IDF) represents documents and queries as high-dimensional sparse vectors where each dimension corresponds to a vocabulary term. BM25 is the gold standard - it measures relevance based on term frequency, inverse document frequency, and document length normalization. Sparse retrieval is excellent at lexical matching: if the query contains "clause 14.3(b)", BM25 will find documents containing exactly that string with high precision. It fails on semantic matching: "When can the penalty be triggered?" won't match "clause 14.3(b)" unless the document also uses those words.

Dense retrieval uses neural embedding models to map documents and queries into a continuous vector space. Similar meanings end up close together in this space, regardless of whether they share vocabulary. A query about "penalty trigger conditions" will retrieve text about "conditions that activate the fee clause" because they're semantically similar. Dense retrieval excels at semantic matching but can miss exact terms (especially proper nouns, IDs, codes) that don't have meaningful embeddings.

Production RAG systems use hybrid retrieval: run both BM25 and dense retrieval, merge the results (typically using Reciprocal Rank Fusion), and rerank. This is covered in detail in Lesson 05.

The RAG Pipeline: End to End

A complete RAG system has two separate pipelines that operate at different times:

Offline (indexing) pipeline - runs once when documents are added:

Parse raw documents (PDF, DOCX, HTML, code)
Clean and normalize text
Chunk into retrievable segments with overlap
Enrich with metadata (source, page, section, date)
Embed each chunk using an embedding model
Store chunk text and embedding in a vector store

Online (query) pipeline - runs for every user query:

Receive user query
Optionally transform the query (HyDE, multi-query expansion)
Embed the query using the same embedding model
Retrieve top-k similar chunks from the vector store
Optionally rerank using a cross-encoder
Assemble an augmented prompt with retrieved context
Generate an answer with citations using the LLM

LLM Failure Modes Without RAG

To fully appreciate what RAG solves, it helps to enumerate exactly how a model fails on knowledge-intensive queries without retrieval:

Production Code: Minimal RAG Pipeline from Scratch

The following implements a complete, working RAG pipeline without relying on LangChain, LlamaIndex, or any high-level RAG framework. Building it from scratch teaches you exactly what those frameworks are doing under the hood - and gives you full control over every decision.

This implementation uses:

sentence-transformers for local embeddings (no API cost for indexing)
faiss-cpu for vector search
anthropic SDK for generation

"""
Minimal RAG pipeline from scratch.
No LangChain, no LlamaIndex - every component explicit.

Install: pip install anthropic sentence-transformers faiss-cpu numpy
"""

import os
import json
import time
import hashlib
import numpy as np
from dataclasses import dataclass, field, asdict
from typing import Optional
import anthropic

# Try to import optional dependencies with clear error messages
try:
    from sentence_transformers import SentenceTransformer
    HAS_ST = True
except ImportError:
    HAS_ST = False
    print("sentence-transformers not installed. Run: pip install sentence-transformers")

try:
    import faiss
    HAS_FAISS = True
except ImportError:
    HAS_FAISS = False
    print("faiss-cpu not installed. Run: pip install faiss-cpu")


# ---------------------------------------------------------------------------
# Data structures
# ---------------------------------------------------------------------------

@dataclass
class Document:
    """A raw document before chunking."""
    content: str
    source: str                          # file path, URL, or identifier
    title: Optional[str] = None
    created_at: Optional[str] = None
    doc_type: Optional[str] = None       # "contract", "policy", "manual", etc.

    def doc_id(self) -> str:
        """Stable ID based on source + content hash."""
        payload = f"{self.source}:{self.content[:200]}"
        return hashlib.md5(payload.encode()).hexdigest()[:12]


@dataclass
class Chunk:
    """A retrievable chunk of text with metadata."""
    chunk_id: str
    text: str
    source: str
    doc_id: str
    chunk_index: int                     # position within the document
    start_char: int                      # character offset in original doc
    end_char: int
    metadata: dict = field(default_factory=dict)
    embedding: Optional[np.ndarray] = field(default=None, repr=False)

    def to_context_string(self) -> str:
        """Format this chunk for inclusion in a prompt."""
        source_label = self.metadata.get("title", self.source)
        return f"[Source: {source_label}, chunk {self.chunk_index}]\n{self.text}"


@dataclass
class RetrievedContext:
    """The result of a retrieval query: ranked chunks with scores."""
    query: str
    chunks: list[Chunk]
    scores: list[float]
    retrieval_time_ms: float

    def to_prompt_context(self, max_chunks: int = 5) -> str:
        """Build the context block to inject into the LLM prompt."""
        selected = list(zip(self.chunks, self.scores))[:max_chunks]
        parts = []
        for i, (chunk, score) in enumerate(selected, 1):
            parts.append(
                f"--- Document {i} (relevance: {score:.3f}) ---\n"
                f"{chunk.to_context_string()}"
            )
        return "\n\n".join(parts)

    def source_list(self) -> list[str]:
        """Return unique sources for citation."""
        seen = set()
        sources = []
        for chunk in self.chunks:
            src = chunk.metadata.get("title", chunk.source)
            if src not in seen:
                seen.add(src)
                sources.append(src)
        return sources


# ---------------------------------------------------------------------------
# Step 1: Chunking
# ---------------------------------------------------------------------------

class FixedSizeChunker:
    """
    Splits documents into fixed-size chunks with overlap.
    Overlap ensures that content near chunk boundaries is captured
    by at least one chunk.
    """

    def __init__(self, chunk_size: int = 512, overlap: int = 64):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, doc: Document) -> list[Chunk]:
        text = doc.content
        chunks: list[Chunk] = []
        start = 0
        chunk_index = 0

        while start < len(text):
            end = start + self.chunk_size
            chunk_text = text[start:end]

            # Don't create a chunk that is pure whitespace
            if not chunk_text.strip():
                start = end - self.overlap
                continue

            chunk_id = f"{doc.doc_id()}-{chunk_index}"
            chunk = Chunk(
                chunk_id=chunk_id,
                text=chunk_text.strip(),
                source=doc.source,
                doc_id=doc.doc_id(),
                chunk_index=chunk_index,
                start_char=start,
                end_char=min(end, len(text)),
                metadata={
                    "title": doc.title or doc.source,
                    "doc_type": doc.doc_type,
                    "created_at": doc.created_at,
                },
            )
            chunks.append(chunk)
            chunk_index += 1
            start = end - self.overlap  # step forward with overlap

        return chunks


class ParagraphChunker:
    """
    Splits on paragraph boundaries (double newlines).
    Better preserves semantic units than fixed-size splitting.
    """

    def __init__(self, max_chunk_size: int = 800, min_chunk_size: int = 50):
        self.max_chunk_size = max_chunk_size
        self.min_chunk_size = min_chunk_size

    def chunk(self, doc: Document) -> list[Chunk]:
        paragraphs = [p.strip() for p in doc.content.split("\n\n") if p.strip()]
        chunks: list[Chunk] = []
        buffer = ""
        buffer_start = 0
        chunk_index = 0
        char_offset = 0

        for para in paragraphs:
            if len(buffer) + len(para) > self.max_chunk_size and len(buffer) >= self.min_chunk_size:
                # Flush buffer as a chunk
                chunk = Chunk(
                    chunk_id=f"{doc.doc_id()}-{chunk_index}",
                    text=buffer,
                    source=doc.source,
                    doc_id=doc.doc_id(),
                    chunk_index=chunk_index,
                    start_char=buffer_start,
                    end_char=char_offset,
                    metadata={"title": doc.title or doc.source},
                )
                chunks.append(chunk)
                chunk_index += 1
                buffer = para
                buffer_start = char_offset
            else:
                buffer = f"{buffer}\n\n{para}".strip() if buffer else para

            char_offset += len(para) + 2  # +2 for "\n\n"

        # Flush remaining buffer
        if buffer.strip():
            chunks.append(Chunk(
                chunk_id=f"{doc.doc_id()}-{chunk_index}",
                text=buffer,
                source=doc.source,
                doc_id=doc.doc_id(),
                chunk_index=chunk_index,
                start_char=buffer_start,
                end_char=len(doc.content),
                metadata={"title": doc.title or doc.source},
            ))

        return chunks


# ---------------------------------------------------------------------------
# Step 2: Embedding
# ---------------------------------------------------------------------------

class EmbeddingModel:
    """
    Wraps a sentence-transformers model for local embedding.
    For production, consider: text-embedding-3-small (OpenAI),
    voyage-3 (Voyage AI), or cohere-embed-v3.
    """

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        if not HAS_ST:
            raise RuntimeError("sentence-transformers required. pip install sentence-transformers")
        print(f"Loading embedding model: {model_name}")
        self.model = SentenceTransformer(model_name)
        self.model_name = model_name
        self.dim = self.model.get_sentence_embedding_dimension()
        print(f"Embedding dimension: {self.dim}")

    def embed(self, texts: list[str], batch_size: int = 64) -> np.ndarray:
        """Embed a list of texts, returning float32 numpy array of shape (N, dim)."""
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=len(texts) > 100,
            normalize_embeddings=True,   # normalize for cosine similarity via dot product
            convert_to_numpy=True,
        )
        return embeddings.astype(np.float32)

    def embed_query(self, query: str) -> np.ndarray:
        """Embed a single query string."""
        embedding = self.model.encode([query], normalize_embeddings=True, convert_to_numpy=True)
        return embedding.astype(np.float32)


# ---------------------------------------------------------------------------
# Step 3: Vector Store
# ---------------------------------------------------------------------------

class FAISSVectorStore:
    """
    In-memory FAISS vector store with chunk metadata stored in a parallel list.
    For production: use pgvector (Postgres), Pinecone, or Weaviate.
    """

    def __init__(self, dim: int):
        if not HAS_FAISS:
            raise RuntimeError("faiss-cpu required. pip install faiss-cpu")
        # Inner Product index - works as cosine similarity when vectors are L2-normalized
        self.index = faiss.IndexFlatIP(dim)
        self.chunks: list[Chunk] = []   # parallel list: index i → chunk
        self.dim = dim

    def add_chunks(self, chunks: list[Chunk], embeddings: np.ndarray) -> None:
        """
        Add chunks and their embeddings to the index.
        embeddings: shape (N, dim), float32, already L2-normalized
        """
        assert len(chunks) == len(embeddings), "chunks and embeddings must have same length"
        assert embeddings.shape[1] == self.dim, f"Expected dim {self.dim}, got {embeddings.shape[1]}"

        self.index.add(embeddings)
        self.chunks.extend(chunks)
        print(f"Indexed {len(chunks)} chunks. Total: {len(self.chunks)}")

    def search(self, query_embedding: np.ndarray, top_k: int = 10) -> tuple[list[Chunk], list[float]]:
        """
        Return the top-k most similar chunks with their similarity scores.
        query_embedding: shape (1, dim), float32, L2-normalized
        """
        if self.index.ntotal == 0:
            return [], []

        top_k = min(top_k, self.index.ntotal)
        scores, indices = self.index.search(query_embedding, top_k)

        result_chunks = []
        result_scores = []
        for score, idx in zip(scores[0], indices[0]):
            if idx >= 0:  # -1 indicates FAISS found no result
                result_chunks.append(self.chunks[idx])
                result_scores.append(float(score))

        return result_chunks, result_scores

    def total_chunks(self) -> int:
        return len(self.chunks)

    def save(self, path: str) -> None:
        """Persist index and metadata to disk."""
        faiss.write_index(self.index, f"{path}.faiss")
        # Serialize chunk metadata without embeddings (which are in FAISS)
        metadata = []
        for chunk in self.chunks:
            d = asdict(chunk)
            d.pop("embedding", None)  # don't serialize numpy array as JSON
            metadata.append(d)
        with open(f"{path}.meta.json", "w") as f:
            json.dump(metadata, f, indent=2)
        print(f"Saved index ({self.index.ntotal} vectors) to {path}.faiss")

    @classmethod
    def load(cls, path: str, dim: int) -> "FAISSVectorStore":
        """Load index and metadata from disk."""
        store = cls(dim=dim)
        store.index = faiss.read_index(f"{path}.faiss")
        with open(f"{path}.meta.json") as f:
            metadata = json.load(f)
        store.chunks = [Chunk(**m) for m in metadata]
        print(f"Loaded index ({store.index.ntotal} vectors) from {path}.faiss")
        return store


# ---------------------------------------------------------------------------
# Step 4: Retriever
# ---------------------------------------------------------------------------

class Retriever:
    """
    Combines the embedding model and vector store to answer retrieval queries.
    """

    def __init__(self, embedding_model: EmbeddingModel, vector_store: FAISSVectorStore):
        self.embedding_model = embedding_model
        self.vector_store = vector_store

    def retrieve(self, query: str, top_k: int = 5) -> RetrievedContext:
        """
        Retrieve the most relevant chunks for a query.
        Returns a RetrievedContext with ranked chunks and scores.
        """
        t0 = time.perf_counter()

        query_embedding = self.embedding_model.embed_query(query)
        chunks, scores = self.vector_store.search(query_embedding, top_k=top_k)

        elapsed_ms = (time.perf_counter() - t0) * 1000

        return RetrievedContext(
            query=query,
            chunks=chunks,
            scores=scores,
            retrieval_time_ms=elapsed_ms,
        )


# ---------------------------------------------------------------------------
# Step 5: Generator with Citations
# ---------------------------------------------------------------------------

SYSTEM_PROMPT = """You are a precise document assistant. Your job is to answer questions based strictly on the provided context documents.

Rules:
1. Answer ONLY from the provided context. Do not use outside knowledge.
2. If the context does not contain enough information to answer, say so clearly.
3. Always cite your sources using the document labels provided (e.g., "According to [Source: Annual Report, chunk 3]...").
4. Keep answers concise and accurate. Do not pad with unnecessary information.
5. If multiple documents contain relevant information, synthesize them and cite each."""

USER_PROMPT_TEMPLATE = """Context Documents:
{context}

---

Question: {question}

Answer based on the context above. Cite your sources."""


class RAGGenerator:
    """
    Generates answers from retrieved context using Claude.
    Tracks citation sources for attribution.
    """

    def __init__(self, model: str = "claude-haiku-4-5"):
        self.client = anthropic.Anthropic()
        self.model = model

    def generate(
        self,
        question: str,
        context: RetrievedContext,
        max_tokens: int = 1024,
        max_context_chunks: int = 5,
    ) -> dict:
        """
        Generate an answer from the retrieved context.

        Returns a dict with:
        - answer: the generated text
        - sources: list of source identifiers
        - model: the model used
        - usage: token usage statistics
        - retrieval_time_ms: retrieval latency
        - generation_time_ms: generation latency
        """
        context_str = context.to_prompt_context(max_chunks=max_context_chunks)
        sources = context.source_list()

        prompt = USER_PROMPT_TEMPLATE.format(
            context=context_str,
            question=question,
        )

        t0 = time.perf_counter()

        response = self.client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            system=SYSTEM_PROMPT,
            messages=[{"role": "user", "content": prompt}],
        )

        generation_time_ms = (time.perf_counter() - t0) * 1000

        return {
            "question": question,
            "answer": response.content[0].text,
            "sources": sources,
            "model": self.model,
            "usage": {
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
            },
            "retrieval_time_ms": context.retrieval_time_ms,
            "generation_time_ms": generation_time_ms,
            "total_time_ms": context.retrieval_time_ms + generation_time_ms,
            "top_chunk_scores": context.scores[:5],
        }


# ---------------------------------------------------------------------------
# Full Pipeline: Ingestion + Query
# ---------------------------------------------------------------------------

class RAGPipeline:
    """
    End-to-end RAG pipeline: ingest documents, retrieve, generate.
    """

    def __init__(
        self,
        embedding_model_name: str = "all-MiniLM-L6-v2",
        chunker: str = "paragraph",   # "fixed" or "paragraph"
        chunk_size: int = 512,
        chunk_overlap: int = 64,
        llm_model: str = "claude-haiku-4-5",
    ):
        self.embedding_model = EmbeddingModel(embedding_model_name)

        if chunker == "fixed":
            self.chunker = FixedSizeChunker(chunk_size=chunk_size, overlap=chunk_overlap)
        else:
            self.chunker = ParagraphChunker()

        self.vector_store = FAISSVectorStore(dim=self.embedding_model.dim)
        self.retriever = Retriever(self.embedding_model, self.vector_store)
        self.generator = RAGGenerator(model=llm_model)

    def ingest(self, documents: list[Document], batch_size: int = 64) -> dict:
        """
        Ingest a list of documents: chunk → embed → store.
        Returns ingestion statistics.
        """
        t0 = time.perf_counter()

        all_chunks: list[Chunk] = []
        for doc in documents:
            chunks = self.chunker.chunk(doc)
            all_chunks.extend(chunks)
            print(f"  {doc.source}: {len(chunks)} chunks")

        # Embed in batches
        texts = [chunk.text for chunk in all_chunks]
        embeddings = self.embedding_model.embed(texts, batch_size=batch_size)

        # Attach embeddings to chunks and store
        for chunk, emb in zip(all_chunks, embeddings):
            chunk.embedding = emb

        self.vector_store.add_chunks(all_chunks, embeddings)

        elapsed = time.perf_counter() - t0
        return {
            "documents_ingested": len(documents),
            "chunks_created": len(all_chunks),
            "ingestion_time_s": elapsed,
            "chunks_per_second": len(all_chunks) / elapsed,
        }

    def query(
        self,
        question: str,
        top_k: int = 5,
        max_tokens: int = 1024,
    ) -> dict:
        """
        Answer a question using the ingested knowledge base.
        """
        context = self.retriever.retrieve(question, top_k=top_k)
        result = self.generator.generate(question, context, max_tokens=max_tokens)
        return result


# ---------------------------------------------------------------------------
# Example Usage
# ---------------------------------------------------------------------------

def main():
    # Create sample documents (in production: load from PDFs, DOCX, etc.)
    documents = [
        Document(
            content="""MASTER SERVICES AGREEMENT - ANNEX G

Annex G: Penalty and Fee Structures

Section 14: Penalty Clauses

14.1 Standard Penalties
The standard penalty for late delivery is 0.5% of the project value per day, capped at 10% of the total contract value.

14.2 Escalated Penalties
In cases where delivery delay exceeds 30 days, the penalty escalates to 1.0% per day.

14.3 Penalty Trigger Conditions
Penalties under this Annex are triggered when:
(a) The licensor fails to deliver milestones within the agreed schedule as defined in Schedule B;
(b) The licensed software fails to pass User Acceptance Testing (UAT) within three consecutive test cycles, provided that:
    (i) The licensee has provided complete and accurate test data as specified in Annex H;
    (ii) The failure is not attributable to infrastructure issues on the licensee's side;
    (iii) Written notice of UAT failure has been provided within 48 hours of each failed cycle;
(c) The licensor fails to maintain the minimum service levels defined in the SLA (Exhibit C) for two consecutive calendar months.

14.4 Carve-outs
The penalty provisions in 14.3(b) shall not apply in the following circumstances:
- Force majeure events as defined in Section 22
- Failures attributable to third-party integrations listed in Appendix D
- Failures during the first 90 days after initial deployment (grace period)""",
            source="annex-g-penalties.txt",
            title="MSA Annex G: Penalty Clauses",
            doc_type="contract",
            created_at="2024-01-15",
        ),
        Document(
            content="""MASTER SERVICES AGREEMENT - EXHIBIT C

Service Level Agreement (SLA)

1. Availability
The licensed software must maintain 99.5% uptime measured monthly, excluding scheduled maintenance windows.

2. Response Times
- Critical issues (P1): initial response within 1 hour, resolution within 4 hours
- High issues (P2): initial response within 4 hours, resolution within 24 hours
- Medium issues (P3): initial response within 1 business day, resolution within 5 business days

3. SLA Measurement
Uptime is measured using the monitoring infrastructure specified in Appendix E.
Monthly SLA reports must be provided by the 5th business day of the following month.

4. SLA Breach Consequences
SLA breaches trigger the penalty provisions in Annex G, Section 14.3(c).""",
            source="exhibit-c-sla.txt",
            title="MSA Exhibit C: SLA",
            doc_type="contract",
            created_at="2024-01-15",
        ),
    ]

    # Initialize pipeline
    print("\n=== Initializing RAG Pipeline ===")
    pipeline = RAGPipeline(
        embedding_model_name="all-MiniLM-L6-v2",
        chunker="paragraph",
        llm_model="claude-haiku-4-5",
    )

    # Ingest
    print("\n=== Ingesting Documents ===")
    stats = pipeline.ingest(documents)
    print(f"Ingestion complete: {stats}")

    # Query 1: Specific clause question (would fail without RAG)
    print("\n=== Query 1: Specific clause ===")
    result = pipeline.query(
        "What are the specific conditions that trigger the penalty under clause 14.3(b)?",
        top_k=5,
    )
    print(f"Answer:\n{result['answer']}")
    print(f"\nSources: {result['sources']}")
    print(f"Retrieval: {result['retrieval_time_ms']:.1f}ms | Generation: {result['generation_time_ms']:.1f}ms")
    print(f"Tokens: {result['usage']}")

    # Query 2: Cross-document question
    print("\n=== Query 2: Cross-document synthesis ===")
    result2 = pipeline.query(
        "If the software fails UAT and also misses SLA targets, what penalties apply?",
        top_k=5,
    )
    print(f"Answer:\n{result2['answer']}")
    print(f"\nSources: {result2['sources']}")

    # Query 3: Out-of-scope question (should be handled gracefully)
    print("\n=== Query 3: Out of scope ===")
    result3 = pipeline.query(
        "What is the total contract value?",
        top_k=3,
    )
    print(f"Answer:\n{result3['answer']}")


if __name__ == "__main__":
    main()

Production Engineering Notes

Latency Budget

A production RAG query has a multi-step latency budget. Each step adds time:

Step	Typical Latency	Notes
Query embedding	5-15ms	Local model; use GPU for less than 5ms
ANN vector search	1-10ms	FAISS flat: exact but slow; IVF: approximate but fast
Network to vector DB	5-50ms	pgvector (local): less than 5ms; Pinecone: 20-80ms
Reranking (cross-encoder)	50-200ms	Often skipped for latency; batch limit 20 chunks
LLM generation (TTFT)	300-800ms	Time to first token; streaming hides this
LLM generation (total)	1-5s	Depends on output length

Total P50: 500-800ms for a lean pipeline without reranking. Total P50: 1-2s with reranking.

Budget your latency from the user experience backward: if you need a 2-second response, you have roughly 500ms for retrieval and 1.5 seconds for generation. This directly constrains how much reranking you can do.

When Retrieval Fails

Retrieval can fail in two ways:

Low recall: the correct chunk is not in the top-k results. Causes: poor embedding quality for this domain, query vocabulary mismatch, chunk too small (semantic information diluted). Fix: increase k, add BM25 hybrid retrieval, or fine-tune embeddings.

Low precision: the top-k results contain many irrelevant chunks. Causes: query too broad, embedding space not discriminative enough, too many similar documents. Fix: add reranking, improve chunking (smaller chunks are more specific), use metadata filters.

The critical diagnostic: track retrieval accuracy separately from answer accuracy. If retrieval accuracy is high but answer accuracy is low, the problem is in generation (the model is ignoring context or misinterpreting it). If retrieval accuracy is low, fix the retrieval layer first - no amount of prompt engineering will help if the wrong chunks are retrieved.

Detecting Retrieval vs. Generation Errors

def diagnose_failure(question: str, answer: str, context: RetrievedContext) -> dict:
    """
    Use Claude to determine whether a failure is a retrieval error or generation error.
    Pass this the question, the answer, and the retrieved chunks.
    """
    client = anthropic.Anthropic()

    context_str = context.to_prompt_context(max_chunks=5)

    diagnosis_prompt = f"""You are evaluating a RAG system failure.

Question: {question}

Retrieved context:
{context_str}

System answer: {answer}

Tasks:
1. Is the correct answer present in the retrieved context? (yes/no)
2. If yes, did the system's answer correctly reflect the context? (yes/no)
3. Classify: "retrieval_failure" (correct info not retrieved) or "generation_failure" (correct info retrieved but wrong answer) or "correct" (answer is right)
4. Brief explanation (1 sentence).

Respond in JSON: {{"answer_in_context": bool, "answer_correct": bool, "failure_type": str, "explanation": str}}"""

    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": diagnosis_prompt}],
    )

    import json
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return {"raw": response.content[0].text}

:::tip Production Monitoring Track retrieval accuracy and generation accuracy as separate metrics in your observability stack. When accuracy drops, you need to know which layer failed. A retrieval failure requires different fixes than a generation failure - and mixing up the diagnosis wastes engineering time. :::

:::warning Context Window Overflow If you retrieve too many chunks, you can exceed the model's context window or trigger the "lost in the middle" degradation. Always calculate your token budget: prompt_tokens + sum(chunk_tokens) + max_output_tokens < context_limit. For Claude, the limit is 200K tokens - but optimal performance is usually with much shorter contexts (under 20K tokens of retrieved content). :::

:::danger Never Prompt-Stuffing in Place of Retrieval Putting your entire knowledge base in every prompt is tempting with large context windows, but it has three serious production problems: (1) it's extremely expensive at scale - 100K tokens per query × 10,000 queries/day = 1 billion tokens/day; (2) "lost in the middle" means the model pays less attention to content buried in the middle of very long contexts; (3) it prevents tenant isolation - you can't selectively show different users different subsets of your knowledge base. Use retrieval. :::

Interview Questions and Answers

Q1: What is RAG and why was it invented?

RAG (Retrieval-Augmented Generation) is an architecture that retrieves relevant documents from an external store and passes them as context to a language model before generation. It was invented because language models have two fundamental limitations: their knowledge is fixed at training time (knowledge cutoff), and they hallucinate when asked about facts they have weak knowledge of. The founding paper (Lewis et al., NeurIPS 2020) introduced the parametric/non-parametric memory distinction: the model's weights hold general reasoning capability, while an external document store holds specific, current, attributable facts. RAG combines both.

Q2: When would you choose fine-tuning over RAG?

Fine-tuning is the right choice when you need to change the model's behavior or style, not its factual knowledge. Concrete examples: training the model to always respond in JSON, adopting domain-specific vocabulary so it understands jargon in questions, improving reasoning patterns for a specific task type, or teaching the model to follow a specific format (e.g., always structure contract analysis into five sections). RAG is the right choice when you need the model to know specific facts from specific documents, especially when those documents are proprietary, change frequently, or need to be cited. The key diagnostic: if you're trying to teach facts, use RAG. If you're trying to change behavior, use fine-tuning. Many production systems need both.

Q3: What is the "lost in the middle" problem and how does RAG address it?

The "lost in the middle" phenomenon (Liu et al., 2023) shows that LLMs perform worse when relevant information appears in the middle of a long context, compared to when it appears at the beginning or end. Performance on knowledge-intensive tasks drops by 20-30% when the relevant passage is in the middle of a 20-document context versus at the beginning. RAG addresses this by retrieving only the most relevant chunks (typically 3-10) and placing them at the beginning of the context. This keeps the context short and positions the most relevant information where the model pays most attention. Reranking the retrieved chunks and putting the highest-scored chunk first further helps.

Q4: How do you detect whether a RAG failure is a retrieval error or a generation error?

First, log every retrieval: store the query, the top-k chunks returned, and the final answer. To diagnose, you need to check whether the correct answer was present in the retrieved chunks. If the correct answer was in the chunks but the model gave a wrong answer, that's a generation error - fix it by improving the prompt, adding more explicit instructions to "answer only from context," or reducing context length. If the correct answer was not in the chunks, that's a retrieval error - fix it by increasing k, improving chunking, adding hybrid BM25 retrieval, or fine-tuning the embedding model. You can automate this diagnosis using an LLM-as-judge: pass the question, retrieved context, and answer to Claude and ask it to classify the failure type. This is the foundation of RAG evaluation in production.

Q5: What are the main components of a RAG pipeline and what does each do?

A RAG system has two pipelines: offline (indexing) and online (query). Offline: (1) Parser - converts raw files (PDF, DOCX, HTML) into plain text; (2) Chunker - splits the text into retrievable segments with appropriate size and overlap; (3) Enricher - adds metadata (source, page, section, date); (4) Embedder - converts each chunk into a dense vector using an embedding model; (5) Vector store - persists the vectors with their chunk text for fast similarity search. Online: (1) Query transformer - optionally rewrites or expands the query; (2) Retriever - embeds the query and finds the most similar chunks using ANN search; (3) Reranker - optionally re-scores the top-k chunks using a more accurate cross-encoder; (4) Context assembler - formats the retrieved chunks into a prompt with source labels; (5) Generator - passes the augmented prompt to the LLM and produces an answer with citations.

Q6: What is the difference between dense and sparse retrieval, and when would you use each?

Sparse retrieval (BM25, TF-IDF) represents documents as term-frequency vectors. It excels at lexical matching - finding documents that contain exact query terms. If the user queries for "ISO 27001 clause 6.2", BM25 will find documents containing that exact string with high precision. Dense retrieval uses neural embeddings to capture semantic similarity - "conditions that activate the fee clause" will retrieve text about "penalty trigger conditions" because they're semantically close in the embedding space. Dense retrieval fails on lexical precision; sparse retrieval fails on semantic recall. Production systems use hybrid retrieval: run both, merge results using Reciprocal Rank Fusion (RRF), then rerank. The typical recipe: BM25 for lexical precision on IDs, codes, names, and exact terms, plus dense retrieval for semantic understanding of concepts and paraphrases.

Q7: What are the main failure modes of RAG in production?

The RAG failure stack, from bottom to top: (1) Parsing failures - scanned PDFs produce garbled OCR text, tables are linearized incorrectly, code loses indentation; (2) Chunking failures - semantic units split across chunk boundaries, context lost at edges; (3) Embedding failures - out-of-domain embedding model fails to capture domain similarity; (4) Retrieval failures - wrong chunks returned because the query vocabulary doesn't match the document vocabulary; (5) Reranking failures - cross-encoder is skipped to save latency, reducing precision; (6) Generation failures - model ignores retrieved context and uses parametric memory; (7) Evaluation failures - offline benchmarks don't reflect production distribution, so failures are invisible until users complain. Good RAG engineering instruments every layer with separate metrics so you can identify which layer is failing.

The Contract Review Catastrophe​

Why This Exists​

Historical Context​

Three Alternatives - When Each Is Right​

Option 1: Fine-Tuning​

Option 2: Prompt Stuffing​

Option 3: RAG - The Right Tool for Factual, Dynamic, Citable Knowledge​

Core Concepts: Parametric vs. Non-Parametric Knowledge​

Dense vs. Sparse Retrieval​

The RAG Pipeline: End to End​

LLM Failure Modes Without RAG​

Production Code: Minimal RAG Pipeline from Scratch​

Production Engineering Notes​

Latency Budget​

When Retrieval Fails​

Detecting Retrieval vs. Generation Errors​

Interview Questions and Answers​