Skip to main content

RAG Papers - Grounding Language Models in External Knowledge

Reading time: ~35 min | Interview relevance: Critical | Roles: AI Engineer, MLE, Data Scientist

The Real Interview Moment

You're in a system design interview at a company building enterprise AI products. The interviewer describes their challenge: "Our customers have millions of internal documents - policies, technical manuals, contracts. They want an AI assistant that can answer questions using their specific documents, not just general knowledge. Walk me through the RAG architecture, starting from the original paper. Then tell me what goes wrong with naive RAG and how recent papers like Self-RAG and CRAG address those problems. Finally, when would you recommend RAG versus fine-tuning?"

This is possibly the most practically important question in AI engineering interviews today. RAG is the backbone of enterprise AI products - every chatbot, search assistant, and knowledge management system uses some form of it. The interviewer wants to know if you understand the academic foundations and the production engineering challenges.

If you can only cite one paper and describe one retrieval architecture, you're not ready for this interview. They want to see depth: the evolution from Dense Passage Retrieval to Self-RAG, the failure modes, and the engineering trade-offs.

What You Will Master

After reading this page, you will be able to:

  • Explain the original RAG architecture and its training objective
  • Describe Dense Passage Retrieval (DPR) and dual-encoder retrieval
  • Explain REALM and how retrieval can be trained end-to-end
  • Describe Self-RAG and its self-reflection mechanism
  • Explain CRAG and adaptive retrieval
  • Compare all RAG variants on architecture, training, and quality
  • Analyze when RAG is superior to fine-tuning and vice versa
  • Design production RAG systems with awareness of failure modes

Part 1 - The Problem RAG Solves

Why Language Models Need External Knowledge

Language models have fundamental limitations that retrieval addresses:

LimitationDescriptionRAG Solution
Knowledge cutoffTraining data has a fixed dateRetrieve current information
HallucinationModels confidently generate false informationGround responses in retrieved evidence
Domain specificityModels lack specialized knowledgeRetrieve domain-specific documents
AttributionModels can't cite sourcesProvide source documents with answers
Update costRetraining is expensiveUpdate the knowledge base instead
60-Second Answer

"RAG combines a retriever with a generator. Instead of relying solely on the model's parametric knowledge, RAG retrieves relevant documents from an external knowledge base and conditions the generation on those documents. This grounds the model's responses in actual evidence, reduces hallucination, enables real-time knowledge updates without retraining, and provides citation capability. The original RAG paper showed this outperforms both pure retrieval and pure generation on knowledge-intensive tasks."

RAG Overview: Retrieve then Generate

Part 2 - Dense Passage Retrieval (DPR)

From Sparse to Dense Retrieval

Before DPR: Retrieval used sparse methods like BM25 (TF-IDF variant) - matching exact keywords.

DPR insight (Karpukhin et al., 2020): Use two BERT encoders to embed queries and passages into the same dense vector space, then retrieve by maximum inner product search (MIPS).

Architecture:

sim(q,p)=EQ(q)TEP(p)\text{sim}(q, p) = E_Q(q)^T E_P(p)

where:

  • EQE_Q is the query encoder (BERT)
  • EPE_P is the passage encoder (BERT)
  • Both produce 768-dimensional vectors

Training objective - contrastive learning with in-batch negatives:

L=logesim(qi,pi+)esim(qi,pi+)+j=1nesim(qi,pj)\mathcal{L} = -\log \frac{e^{\text{sim}(q_i, p_i^+)}}{e^{\text{sim}(q_i, p_i^+)} + \sum_{j=1}^{n} e^{\text{sim}(q_i, p_j^-)}}

where pi+p_i^+ is the positive passage and pjp_j^- are negative passages (other passages in the batch + hard negatives from BM25).

import torch
import torch.nn as nn
from transformers import BertModel

class DPREncoder(nn.Module):
"""Dual encoder for Dense Passage Retrieval."""

def __init__(self, model_name="bert-base-uncased"):
super().__init__()
self.query_encoder = BertModel.from_pretrained(model_name)
self.passage_encoder = BertModel.from_pretrained(model_name)

def encode_query(self, input_ids, attention_mask):
outputs = self.query_encoder(input_ids, attention_mask=attention_mask)
return outputs.last_hidden_state[:, 0, :] # [CLS] token

def encode_passage(self, input_ids, attention_mask):
outputs = self.passage_encoder(input_ids, attention_mask=attention_mask)
return outputs.last_hidden_state[:, 0, :] # [CLS] token

def forward(self, q_ids, q_mask, p_ids, p_mask):
q_emb = self.encode_query(q_ids, q_mask) # (B, D)
p_emb = self.encode_passage(p_ids, p_mask) # (B, D)
scores = torch.matmul(q_emb, p_emb.T) # (B, B)
return scores

def dpr_loss(scores, temperature=1.0):
"""In-batch negative contrastive loss."""
labels = torch.arange(scores.shape[0], device=scores.device)
return nn.functional.cross_entropy(scores / temperature, labels)

DPR vs BM25

AspectBM25DPRHybrid
MatchingExact keywordSemantic similarityBoth
SpeedVery fast (inverted index)Fast (ANN search)Moderate
Out-of-vocabularyFails on paraphrasesHandles semantic similarityBest
Domain transferWorks immediatelyRequires domain-specific trainingModerate
Rare termsExcellent (exact match)May miss rare termsBest
Best forKeyword-heavy queriesNatural language questionsProduction
Common Trap

Many candidates dismiss BM25 as "old technology." In production, hybrid retrieval (BM25 + dense) consistently outperforms pure dense retrieval. BM25 excels at exact term matching (product IDs, error codes, names) where dense models struggle. Always mention hybrid approaches in system design interviews.

Part 3 - The Original RAG Paper

RAG Architecture (Lewis et al., 2020)

The original RAG paper from Meta AI proposed two variants:

RAG-Sequence: Retrieve once, use the same documents for the entire generation.

PRAG-Seq(yx)=ztop-kPη(zx)Pθ(yx,z)P_\text{RAG-Seq}(y|x) = \sum_{z \in \text{top-k}} P_\eta(z|x) \cdot P_\theta(y|x, z)

RAG-Token: Retrieve different documents for each generated token (more flexible but more expensive).

PRAG-Token(yx)=iztop-kPη(zx)Pθ(yix,z,y<i)P_\text{RAG-Token}(y|x) = \prod_i \sum_{z \in \text{top-k}} P_\eta(z|x) \cdot P_\theta(y_i|x, z, y_{<i})

where:

  • Pη(zx)P_\eta(z|x) is the retrieval probability (DPR)
  • Pθ(yx,z)P_\theta(y|x, z) is the generation probability (BART)
  • zz is a retrieved document
  • top-k typically = 5

RAG-Sequence Marginalization Over Documents

Key Innovations

  1. End-to-end training: The retriever and generator are trained jointly. The retriever's parameters are updated through the marginalization over documents.
  2. Non-parametric memory: The knowledge base can be updated without retraining - just swap the document index.
  3. Interpretability: You can inspect which documents were retrieved, providing a form of attribution.

Training Details

  • Retriever: DPR (initialized from pre-trained, passage encoder frozen during RAG training)
  • Generator: BART-large (400M parameters)
  • Knowledge base: Wikipedia (21M passages, 100-word chunks)
  • Index: FAISS for approximate nearest neighbor search

Part 4 - REALM: Pre-Training with Retrieval

REALM (Guu et al., 2020)

REALM (Retrieval-Augmented Language Model pre-training) integrates retrieval into the pre-training phase, not just fine-tuning.

Key idea: During masked language modeling, retrieve relevant documents to help predict masked tokens:

P(yx)=ztop-kP(zx)P(yx,z)P(y|x) = \sum_{z \in \text{top-k}} P(z|x) \cdot P(y|x, z)

Critical difference from RAG: REALM pre-trains the retriever and language model jointly from scratch. The retriever learns what information is useful for language modeling in general, not just for a specific downstream task.

The asynchronous index update problem: As the passage encoder is updated during training, the FAISS index becomes stale. REALM periodically re-encodes all passages and rebuilds the index (every few hundred steps).

REALM vs RAG

AspectREALMRAG
When retrieval is addedPre-trainingFine-tuning
Retriever trainingJoint from scratchInitialize from DPR, partially frozen
Compute costVery high (re-index during pre-training)Moderate
GeneralityBetter general knowledge retrievalBetter task-specific retrieval
Practical adoptionLow (expensive to pre-train)High (easy to apply to any LLM)

Part 5 - Self-RAG: Teaching Models When to Retrieve

The Problem with Naive RAG

Standard RAG always retrieves, even when retrieval is unnecessary or harmful:

ScenarioProblem with Always-Retrieve
"What is 2+2?"Retrieves irrelevant documents, confuses the model
"Write a poem about love"Retrieved documents add noise to creative tasks
"Summarize this passage"The passage IS the context; retrieval adds nothing
Contradictory retrievalRetrieved documents conflict with each other

Self-RAG (Asai et al., 2023)

Self-RAG trains the language model to generate special reflection tokens that control retrieval and evaluate its own outputs:

Reflection tokens:

  • [Retrieve] - Should I retrieve for this query? (Yes/No)
  • [IsRel] - Is the retrieved document relevant? (Relevant/Irrelevant)
  • [IsSup] - Does the retrieved document support my response? (Fully/Partially/Not supported)
  • [IsUse] - Is my overall response useful? (5/4/3/2/1)

Self-RAG Reflection Tokens

Self-RAG Training

  1. Critic model: Train a separate model (GPT-4) to label training data with reflection tokens
  2. Distillation: Use these labels to train the target model to both generate text AND output reflection tokens
  3. Inference-time control: Use the reflection tokens to decide whether to retrieve, filter irrelevant documents, and select the best response

The result: Self-RAG outperforms both vanilla RAG and instruction-tuned LLMs without retrieval, because it retrieves only when helpful and critically evaluates retrieved content.

Part 6 - CRAG: Corrective Retrieval-Augmented Generation

The Problem CRAG Solves

Even when retrieval finds relevant documents, the model might:

  • Rely on a document that's partially correct
  • Miss a crucial piece of information across multiple documents
  • Generate content that contradicts the retrieved evidence

CRAG (Yan et al., 2024)

CRAG introduces a lightweight retrieval evaluator and corrective actions:

Step 1: Evaluate retrieval quality

A trained evaluator scores each retrieved document as: Correct, Ambiguous, or Incorrect.

Step 2: Take corrective action based on evaluation

EvaluationAction
Correct (confidence > threshold)Use retrieved documents directly
Ambiguous (mixed signals)Refine the query, retrieve again, combine with web search
Incorrect (low confidence)Discard retrieval, fall back to web search or parametric knowledge

Step 3: Knowledge refinement

For retrieved documents judged as correct, apply knowledge decomposition: extract only the relevant sentences/facts rather than feeding the entire document to the generator.

CRAG Corrective Retrieval Flow

Company Variation
  • Google: Uses retrieval-augmented approaches in Gemini. Details are proprietary but include multi-hop retrieval.
  • OpenAI: ChatGPT with browsing uses a form of RAG. Their retrieval approach is undisclosed.
  • Anthropic: Claude uses retrieval capabilities with a focus on long-context processing over traditional RAG.
  • Perplexity: Their entire product is essentially production-grade RAG with web search.
  • Enterprise (LangChain, LlamaIndex): Most enterprise RAG uses LangChain or LlamaIndex with chunking, embedding, and vector stores.

Part 7 - RAG Evolution Summary

Comparison of All RAG Variants

MethodYearRetrieval TimingRetrieval TrainingKey Innovation
DPR2020N/A (retriever only)Contrastive learningDense retrieval with dual encoders
REALM2020During pre-trainingEnd-to-end with LMJoint retrieval + LM pre-training
RAG2020During fine-tuning/inferenceDPR (partially frozen)Marginalize over retrieved docs
FiD2021During inferenceFrozenProcess each doc independently, fuse in decoder
RETRO2022During pre-trainingFrozenChunked cross-attention with retrieval
Self-RAG2023Adaptive (model decides)Distilled from GPT-4Reflection tokens for self-evaluation
CRAG2024Adaptive (evaluator decides)Separate evaluatorCorrective actions based on retrieval quality

RAG Evolution Timeline: DPR to CRAG

Part 8 - RAG vs Fine-Tuning

When to Use Each

This is one of the most commonly asked questions in AI engineering interviews.

RAG vs Fine-Tuning Decision Tree

Detailed Comparison

FactorRAGFine-TuningRAG + Fine-Tuning
New knowledgeExcellent (update index)Limited (needs retraining)Excellent
Knowledge freshnessReal-time updatesStale after trainingReal-time
Hallucination reductionStrong (grounded in docs)ModerateStrongest
Citation/attributionNatural (source docs)Not possibleNatural
Behavioral changesLimitedExcellentExcellent
LatencyHigher (retrieval step)Lower (no retrieval)Highest
Cost to updateLow (re-index docs)High (retrain model)Medium
Domain adaptationGood (domain docs)Better (learns domain patterns)Best
Compute (inference)Retrieval + generationGeneration onlyRetrieval + generation
Data privacyDocuments stay in your infraTraining data exposure riskMixed
Instant Rejection

Never say "RAG is always better than fine-tuning" or vice versa. The correct answer is always "it depends on the use case." Specifically: RAG excels at injecting factual knowledge and providing attribution. Fine-tuning excels at changing behavior, style, and output format. The best production systems often combine both: fine-tune for behavior + RAG for knowledge.

Part 9 - Production RAG: Failure Modes and Solutions

Common RAG Failures

Failure ModeDescriptionSolution
Poor chunkingDocuments split mid-sentence or mid-paragraphSemantic chunking, overlap, respect document structure
Irrelevant retrievalTop-k documents don't answer the questionBetter embeddings, re-ranking, query expansion
Missing contextAnswer requires information across multiple chunksMulti-hop retrieval, larger chunks, parent-child chunking
Hallucination despite retrievalModel ignores retrieved docs and generates from memoryStronger prompting, constrained decoding, Self-RAG
Outdated embeddingsNew documents not yet embeddedStreaming embedding pipeline, incremental indexing
Conflicting documentsRetrieved docs disagree with each otherTimestamp-based ranking, authority scoring
Query-document mismatchUser asks in different terms than the documents useQuery expansion, hypothetical document embedding (HyDE)

Production Architecture

Production RAG Pipeline Architecture

Chunking Strategies

StrategyDescriptionBest For
Fixed-sizeSplit every N tokens with overlapSimple baseline, unstructured text
Sentence-levelSplit on sentence boundariesQ&A over short factual content
Paragraph-levelSplit on paragraph boundariesLonger-form reasoning
SemanticSplit where embedding similarity dropsHeterogeneous documents
Parent-childSmall chunks for retrieval, return parent chunk for contextBest of both worlds
Document-awareRespect headers, sections, tablesStructured documents (manuals, reports)

Re-Ranking: The Secret Weapon

Two-stage retrieval with re-ranking dramatically improves quality:

  1. Stage 1 (Retriever): Fast, approximate - retrieve top 100 candidates with bi-encoder
  2. Stage 2 (Re-ranker): Slow, precise - re-rank top 100 with cross-encoder to get top 5
from sentence_transformers import CrossEncoder

# Stage 1: Bi-encoder retrieval (fast)
query_embedding = bi_encoder.encode(query)
candidates = vector_store.search(query_embedding, top_k=100)

# Stage 2: Cross-encoder re-ranking (precise)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)

# Sort by re-ranker score, take top 5
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
60-Second Answer

"Production RAG has three critical components beyond the basic retrieve-and-generate pattern: First, hybrid retrieval combining dense and sparse (BM25) search with reciprocal rank fusion. Second, cross-encoder re-ranking to filter the top 100 candidates down to the top 5 with much higher precision. Third, intelligent chunking that respects document structure and uses parent-child relationships. These three improvements take RAG from 'demo quality' to 'production quality'."

Part 10 - Practice Problems

Problem 1: Chunking Strategy Design

You're building a RAG system for a legal firm. Documents include contracts (structured, with sections and clauses), case law (long-form narrative), and email correspondence (short, informal). Design a chunking strategy for each document type.

Hint 1 - Direction

Each document type has different structure. Think about what constitutes a "complete thought" in each type and what context is needed to understand a chunk.

Full Answer + Rubric

Contracts:

  • Strategy: Document-aware chunking by clause/section. Each clause is a natural unit.
  • Implementation: Parse section numbers and headers. Keep each clause as one chunk. Add the contract title and section path as metadata (e.g., "Section 3.2: Termination Conditions").
  • Overlap: Include the section header in every sub-chunk for context.
  • Parent-child: Use the full section as parent, individual clauses as children.

Case Law:

  • Strategy: Paragraph-level chunking with semantic boundaries.
  • Implementation: Split on paragraph boundaries. Use semantic similarity to merge short paragraphs and split long ones. Include case citation (party names, court, year) as metadata in every chunk.
  • Special handling: The "holding" paragraph (the court's decision) should be tagged specially, as it's the most important part.

Email Correspondence:

  • Strategy: Full-email as a single chunk (emails are typically short).
  • Implementation: Keep entire email as one chunk. Extract metadata: sender, recipients, date, subject line, thread ID. For long email threads, split by individual messages.
  • Threading: Link emails in the same thread for multi-turn context retrieval.

Cross-cutting concerns:

  • Maintain a metadata layer: document type, date, parties involved, confidentiality level
  • Use different embedding models or fine-tune for legal domain (e.g., legal-BERT)
  • Access control: filter retrieval by user permissions

Scoring:

  • Strong Hire: Different strategies per document type, considers metadata, mentions parent-child chunking, addresses access control
  • Lean Hire: Reasonable chunking but same strategy for all document types
  • No Hire: Uses fixed 512-token chunks for everything

Problem 2: RAG vs Fine-Tuning Decision

Your company has a medical knowledge base of 50,000 clinical guidelines (updated quarterly) and wants to build an AI assistant for doctors. Compare RAG, fine-tuning, and RAG + fine-tuning approaches for this use case.

Hint 1 - Direction

Medical applications have unique requirements: accuracy is critical (patient safety), guidelines change, regulatory compliance requires attribution, and the model needs to handle medical terminology.

Full Answer + Rubric

RAG only:

  • Pros: Citations (critical for medical liability), quarterly updates are easy (re-index), no training compute.
  • Cons: May not understand medical terminology well. Chunking medical documents is hard (tables, dosage charts, cross-references).
  • Risk: Retrieval might miss relevant guidelines if query terms don't match medical terminology.

Fine-tuning only:

  • Pros: Better medical language understanding. Can learn clinical reasoning patterns.
  • Cons: No citations. Can't easily update when guidelines change. Risk of hallucinating medical information (catastrophic in healthcare).
  • Risk: A fine-tuned model that confidently hallucinates a drug dosage is dangerous.

RAG + Fine-tuning (recommended):

  • Fine-tune on medical QA datasets to improve medical language understanding and response format (e.g., always structure as: diagnosis → evidence → recommendation → sources).
  • RAG for real-time knowledge retrieval with citations.
  • Use Self-RAG-style reflection to verify that responses are supported by retrieved guidelines.
  • Re-ranking with a medical domain-specific cross-encoder.

Additional considerations:

  • Regulatory: FDA may require citation trails. RAG provides this naturally.
  • Liability: If the model gives wrong advice, citations help determine if the error was in retrieval or generation.
  • Evaluation: Use medical domain experts for eval, not just automated metrics.

Scoring:

  • Strong Hire: Recommends RAG + FT with specific medical considerations (safety, citations, regulatory), addresses hallucination risk
  • Lean Hire: Correctly identifies RAG as better for this case but doesn't address medical-specific concerns
  • No Hire: Recommends fine-tuning only without addressing hallucination risk in medical contexts

Problem 3: Multi-Hop RAG

A user asks: "How does the company's parental leave policy compare to industry standards?" This requires: (1) retrieving the company's parental leave policy, (2) retrieving industry benchmarks, (3) synthesizing a comparison. Design a multi-hop retrieval system for this.

Hint 1 - Direction

Single-query retrieval won't work here because "parental leave policy" and "industry standards" are different information needs. Think about query decomposition.

Full Answer + Rubric

Multi-hop RAG Architecture:

  1. Query decomposition: Use an LLM to break the query into sub-queries:

    • Sub-query 1: "Company parental leave policy details"
    • Sub-query 2: "Industry standard parental leave benefits 2024"
  2. Parallel retrieval: Execute both sub-queries against appropriate knowledge bases:

    • Sub-query 1 → internal document store (HR policies)
    • Sub-query 2 → external knowledge base or web search
  3. Relevance filtering: Re-rank and filter results for each sub-query independently.

  4. Context assembly: Combine the most relevant chunks from both retrievals into a structured context:

    [COMPANY POLICY]:
    {retrieved company policy chunks}

    [INDUSTRY BENCHMARKS]:
    {retrieved industry data chunks}
  5. Synthesis prompt: Ask the LLM to generate a comparison table, citing specific sections from each source.

  6. Verification: Optionally, use a Self-RAG-style check to verify that the comparison accurately reflects both sources.

Scoring:

  • Strong Hire: Describes query decomposition, parallel retrieval from different sources, structured context assembly, and verification
  • Lean Hire: Recognizes the need for multiple retrievals but doesn't describe how to decompose and synthesize
  • No Hire: Tries to answer with a single retrieval query

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Explain RAG"Retriever + Generator → retrieve relevant docs → condition generation → ground in evidence"RAG retrieves relevant documents from an external knowledge base and conditions the language model's generation on those documents, grounding responses in actual evidence"
"DPR vs BM25?"Sparse (keyword) vs dense (semantic) → DPR for paraphrases, BM25 for exact terms → hybrid is best"DPR handles semantic similarity but misses exact terms. BM25 handles exact matches but misses paraphrases. Hybrid with reciprocal rank fusion is production standard."
"RAG vs fine-tuning?"RAG for knowledge, FT for behavior → combine for best results → consider update frequency"RAG for dynamic knowledge with citations. Fine-tuning for behavioral changes. Most production systems combine both."
"What is Self-RAG?"Reflection tokens → model decides when to retrieve → evaluates retrieval quality → self-corrects"Self-RAG trains the model to output reflection tokens that decide whether to retrieve, evaluate document relevance, and verify response support."
"RAG failure modes?"Poor chunking → irrelevant retrieval → hallucination despite retrieval → query mismatch"The three biggest failure modes are: bad chunking losing context, irrelevant top-k results, and the model ignoring retrieved docs."
"Production RAG?"Hybrid retrieval → re-ranking → smart chunking → parent-child → metadata filtering"Production RAG needs hybrid retrieval, cross-encoder re-ranking, document-aware chunking, and metadata filtering. This takes it from demo to production quality."

Spaced Repetition Checkpoints

  • Day 0: Read this page. Draw the RAG pipeline. Explain DPR's training objective.
  • Day 3: Compare RAG-Sequence and RAG-Token. Explain Self-RAG's reflection tokens without looking.
  • Day 7: Design a production RAG system with hybrid retrieval and re-ranking. Include a chunking strategy.
  • Day 14: Argue RAG vs fine-tuning for three different use cases (medical, legal, customer support).
  • Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.

Next Steps

© 2026 EngineersOfAI. All rights reserved.