RAG Papers - Grounding Language Models in External Knowledge

Reading time: ~35 min | Interview relevance: Critical | Roles: AI Engineer, MLE, Data Scientist

The Real Interview Moment

You're in a system design interview at a company building enterprise AI products. The interviewer describes their challenge: "Our customers have millions of internal documents - policies, technical manuals, contracts. They want an AI assistant that can answer questions using their specific documents, not just general knowledge. Walk me through the RAG architecture, starting from the original paper. Then tell me what goes wrong with naive RAG and how recent papers like Self-RAG and CRAG address those problems. Finally, when would you recommend RAG versus fine-tuning?"

This is possibly the most practically important question in AI engineering interviews today. RAG is the backbone of enterprise AI products - every chatbot, search assistant, and knowledge management system uses some form of it. The interviewer wants to know if you understand the academic foundations and the production engineering challenges.

If you can only cite one paper and describe one retrieval architecture, you're not ready for this interview. They want to see depth: the evolution from Dense Passage Retrieval to Self-RAG, the failure modes, and the engineering trade-offs.

What You Will Master

After reading this page, you will be able to:

Explain the original RAG architecture and its training objective
Describe Dense Passage Retrieval (DPR) and dual-encoder retrieval
Explain REALM and how retrieval can be trained end-to-end
Describe Self-RAG and its self-reflection mechanism
Explain CRAG and adaptive retrieval
Compare all RAG variants on architecture, training, and quality
Analyze when RAG is superior to fine-tuning and vice versa
Design production RAG systems with awareness of failure modes

Part 1 - The Problem RAG Solves

Why Language Models Need External Knowledge

Language models have fundamental limitations that retrieval addresses:

Limitation	Description	RAG Solution
Knowledge cutoff	Training data has a fixed date	Retrieve current information
Hallucination	Models confidently generate false information	Ground responses in retrieved evidence
Domain specificity	Models lack specialized knowledge	Retrieve domain-specific documents
Attribution	Models can't cite sources	Provide source documents with answers
Update cost	Retraining is expensive	Update the knowledge base instead

60-Second Answer

"RAG combines a retriever with a generator. Instead of relying solely on the model's parametric knowledge, RAG retrieves relevant documents from an external knowledge base and conditions the generation on those documents. This grounds the model's responses in actual evidence, reduces hallucination, enables real-time knowledge updates without retraining, and provides citation capability. The original RAG paper showed this outperforms both pure retrieval and pure generation on knowledge-intensive tasks."

RAG Overview: Retrieve then Generate

Part 2 - Dense Passage Retrieval (DPR)

From Sparse to Dense Retrieval

Before DPR: Retrieval used sparse methods like BM25 (TF-IDF variant) - matching exact keywords.

DPR insight (Karpukhin et al., 2020): Use two BERT encoders to embed queries and passages into the same dense vector space, then retrieve by maximum inner product search (MIPS).

Architecture:

$\text{sim}(q, p) = E_Q(q)^T E_P(p)$

where:

$E_Q$ is the query encoder (BERT)
$E_P$ is the passage encoder (BERT)
Both produce 768-dimensional vectors

Training objective - contrastive learning with in-batch negatives:

$\mathcal{L} = -\log \frac{e^{\text{sim}(q_i, p_i^+)}}{e^{\text{sim}(q_i, p_i^+)} + \sum_{j=1}^{n} e^{\text{sim}(q_i, p_j^-)}}$

where $p_i^+$ is the positive passage and $p_j^-$ are negative passages (other passages in the batch + hard negatives from BM25).

import torch
import torch.nn as nn
from transformers import BertModel

class DPREncoder(nn.Module):
    """Dual encoder for Dense Passage Retrieval."""

    def __init__(self, model_name="bert-base-uncased"):
        super().__init__()
        self.query_encoder = BertModel.from_pretrained(model_name)
        self.passage_encoder = BertModel.from_pretrained(model_name)

    def encode_query(self, input_ids, attention_mask):
        outputs = self.query_encoder(input_ids, attention_mask=attention_mask)
        return outputs.last_hidden_state[:, 0, :]  # [CLS] token

    def encode_passage(self, input_ids, attention_mask):
        outputs = self.passage_encoder(input_ids, attention_mask=attention_mask)
        return outputs.last_hidden_state[:, 0, :]  # [CLS] token

    def forward(self, q_ids, q_mask, p_ids, p_mask):
        q_emb = self.encode_query(q_ids, q_mask)       # (B, D)
        p_emb = self.encode_passage(p_ids, p_mask)     # (B, D)
        scores = torch.matmul(q_emb, p_emb.T)          # (B, B)
        return scores

def dpr_loss(scores, temperature=1.0):
    """In-batch negative contrastive loss."""
    labels = torch.arange(scores.shape[0], device=scores.device)
    return nn.functional.cross_entropy(scores / temperature, labels)

DPR vs BM25

Aspect	BM25	DPR	Hybrid
Matching	Exact keyword	Semantic similarity	Both
Speed	Very fast (inverted index)	Fast (ANN search)	Moderate
Out-of-vocabulary	Fails on paraphrases	Handles semantic similarity	Best
Domain transfer	Works immediately	Requires domain-specific training	Moderate
Rare terms	Excellent (exact match)	May miss rare terms	Best
Best for	Keyword-heavy queries	Natural language questions	Production

Common Trap

Many candidates dismiss BM25 as "old technology." In production, hybrid retrieval (BM25 + dense) consistently outperforms pure dense retrieval. BM25 excels at exact term matching (product IDs, error codes, names) where dense models struggle. Always mention hybrid approaches in system design interviews.

Part 3 - The Original RAG Paper

RAG Architecture (Lewis et al., 2020)

The original RAG paper from Meta AI proposed two variants:

RAG-Sequence: Retrieve once, use the same documents for the entire generation.

$P_\text{RAG-Seq}(y|x) = \sum_{z \in \text{top-k}} P_\eta(z|x) \cdot P_\theta(y|x, z)$

RAG-Token: Retrieve different documents for each generated token (more flexible but more expensive).

$P_\text{RAG-Token}(y|x) = \prod_i \sum_{z \in \text{top-k}} P_\eta(z|x) \cdot P_\theta(y_i|x, z, y_{<i})$

where:

$P_\eta(z|x)$ is the retrieval probability (DPR)
$P_\theta(y|x, z)$ is the generation probability (BART)
$z$ is a retrieved document
top-k typically = 5

RAG-Sequence Marginalization Over Documents

Key Innovations

End-to-end training: The retriever and generator are trained jointly. The retriever's parameters are updated through the marginalization over documents.
Non-parametric memory: The knowledge base can be updated without retraining - just swap the document index.
Interpretability: You can inspect which documents were retrieved, providing a form of attribution.

Training Details

Retriever: DPR (initialized from pre-trained, passage encoder frozen during RAG training)
Generator: BART-large (400M parameters)
Knowledge base: Wikipedia (21M passages, 100-word chunks)
Index: FAISS for approximate nearest neighbor search

Part 4 - REALM: Pre-Training with Retrieval

REALM (Guu et al., 2020)

REALM (Retrieval-Augmented Language Model pre-training) integrates retrieval into the pre-training phase, not just fine-tuning.

Key idea: During masked language modeling, retrieve relevant documents to help predict masked tokens:

$P(y|x) = \sum_{z \in \text{top-k}} P(z|x) \cdot P(y|x, z)$

Critical difference from RAG: REALM pre-trains the retriever and language model jointly from scratch. The retriever learns what information is useful for language modeling in general, not just for a specific downstream task.

The asynchronous index update problem: As the passage encoder is updated during training, the FAISS index becomes stale. REALM periodically re-encodes all passages and rebuilds the index (every few hundred steps).

REALM vs RAG

Aspect	REALM	RAG
When retrieval is added	Pre-training	Fine-tuning
Retriever training	Joint from scratch	Initialize from DPR, partially frozen
Compute cost	Very high (re-index during pre-training)	Moderate
Generality	Better general knowledge retrieval	Better task-specific retrieval
Practical adoption	Low (expensive to pre-train)	High (easy to apply to any LLM)

Part 5 - Self-RAG: Teaching Models When to Retrieve

The Problem with Naive RAG

Standard RAG always retrieves, even when retrieval is unnecessary or harmful:

Scenario	Problem with Always-Retrieve
"What is 2+2?"	Retrieves irrelevant documents, confuses the model
"Write a poem about love"	Retrieved documents add noise to creative tasks
"Summarize this passage"	The passage IS the context; retrieval adds nothing
Contradictory retrieval	Retrieved documents conflict with each other

Self-RAG (Asai et al., 2023)

Self-RAG trains the language model to generate special reflection tokens that control retrieval and evaluate its own outputs:

Reflection tokens:

[Retrieve] - Should I retrieve for this query? (Yes/No)
[IsRel] - Is the retrieved document relevant? (Relevant/Irrelevant)
[IsSup] - Does the retrieved document support my response? (Fully/Partially/Not supported)
[IsUse] - Is my overall response useful? (5/4/3/2/1)

Self-RAG Reflection Tokens

Self-RAG Training

Critic model: Train a separate model (GPT-4) to label training data with reflection tokens
Distillation: Use these labels to train the target model to both generate text AND output reflection tokens
Inference-time control: Use the reflection tokens to decide whether to retrieve, filter irrelevant documents, and select the best response

The result: Self-RAG outperforms both vanilla RAG and instruction-tuned LLMs without retrieval, because it retrieves only when helpful and critically evaluates retrieved content.

Part 6 - CRAG: Corrective Retrieval-Augmented Generation

The Problem CRAG Solves

Even when retrieval finds relevant documents, the model might:

Rely on a document that's partially correct
Miss a crucial piece of information across multiple documents
Generate content that contradicts the retrieved evidence

CRAG (Yan et al., 2024)

CRAG introduces a lightweight retrieval evaluator and corrective actions:

Step 1: Evaluate retrieval quality

A trained evaluator scores each retrieved document as: Correct, Ambiguous, or Incorrect.

Step 2: Take corrective action based on evaluation

Evaluation	Action
Correct (confidence > threshold)	Use retrieved documents directly
Ambiguous (mixed signals)	Refine the query, retrieve again, combine with web search
Incorrect (low confidence)	Discard retrieval, fall back to web search or parametric knowledge

Step 3: Knowledge refinement

For retrieved documents judged as correct, apply knowledge decomposition: extract only the relevant sentences/facts rather than feeding the entire document to the generator.

CRAG Corrective Retrieval Flow

Company Variation

Google: Uses retrieval-augmented approaches in Gemini. Details are proprietary but include multi-hop retrieval.
OpenAI: ChatGPT with browsing uses a form of RAG. Their retrieval approach is undisclosed.
Anthropic: Claude uses retrieval capabilities with a focus on long-context processing over traditional RAG.
Perplexity: Their entire product is essentially production-grade RAG with web search.
Enterprise (LangChain, LlamaIndex): Most enterprise RAG uses LangChain or LlamaIndex with chunking, embedding, and vector stores.

Part 7 - RAG Evolution Summary

Comparison of All RAG Variants

Method	Year	Retrieval Timing	Retrieval Training	Key Innovation
DPR	2020	N/A (retriever only)	Contrastive learning	Dense retrieval with dual encoders
REALM	2020	During pre-training	End-to-end with LM	Joint retrieval + LM pre-training
RAG	2020	During fine-tuning/inference	DPR (partially frozen)	Marginalize over retrieved docs
FiD	2021	During inference	Frozen	Process each doc independently, fuse in decoder
RETRO	2022	During pre-training	Frozen	Chunked cross-attention with retrieval
Self-RAG	2023	Adaptive (model decides)	Distilled from GPT-4	Reflection tokens for self-evaluation
CRAG	2024	Adaptive (evaluator decides)	Separate evaluator	Corrective actions based on retrieval quality

RAG Evolution Timeline: DPR to CRAG

Part 8 - RAG vs Fine-Tuning

When to Use Each

This is one of the most commonly asked questions in AI engineering interviews.

RAG vs Fine-Tuning Decision Tree

Detailed Comparison

Factor	RAG	Fine-Tuning	RAG + Fine-Tuning
New knowledge	Excellent (update index)	Limited (needs retraining)	Excellent
Knowledge freshness	Real-time updates	Stale after training	Real-time
Hallucination reduction	Strong (grounded in docs)	Moderate	Strongest
Citation/attribution	Natural (source docs)	Not possible	Natural
Behavioral changes	Limited	Excellent	Excellent
Latency	Higher (retrieval step)	Lower (no retrieval)	Highest
Cost to update	Low (re-index docs)	High (retrain model)	Medium
Domain adaptation	Good (domain docs)	Better (learns domain patterns)	Best
Compute (inference)	Retrieval + generation	Generation only	Retrieval + generation
Data privacy	Documents stay in your infra	Training data exposure risk	Mixed

Instant Rejection

Never say "RAG is always better than fine-tuning" or vice versa. The correct answer is always "it depends on the use case." Specifically: RAG excels at injecting factual knowledge and providing attribution. Fine-tuning excels at changing behavior, style, and output format. The best production systems often combine both: fine-tune for behavior + RAG for knowledge.

Part 9 - Production RAG: Failure Modes and Solutions

Common RAG Failures

Failure Mode	Description	Solution
Poor chunking	Documents split mid-sentence or mid-paragraph	Semantic chunking, overlap, respect document structure
Irrelevant retrieval	Top-k documents don't answer the question	Better embeddings, re-ranking, query expansion
Missing context	Answer requires information across multiple chunks	Multi-hop retrieval, larger chunks, parent-child chunking
Hallucination despite retrieval	Model ignores retrieved docs and generates from memory	Stronger prompting, constrained decoding, Self-RAG
Outdated embeddings	New documents not yet embedded	Streaming embedding pipeline, incremental indexing
Conflicting documents	Retrieved docs disagree with each other	Timestamp-based ranking, authority scoring
Query-document mismatch	User asks in different terms than the documents use	Query expansion, hypothetical document embedding (HyDE)

Production Architecture

Production RAG Pipeline Architecture

Chunking Strategies

Strategy	Description	Best For
Fixed-size	Split every N tokens with overlap	Simple baseline, unstructured text
Sentence-level	Split on sentence boundaries	Q&A over short factual content
Paragraph-level	Split on paragraph boundaries	Longer-form reasoning
Semantic	Split where embedding similarity drops	Heterogeneous documents
Parent-child	Small chunks for retrieval, return parent chunk for context	Best of both worlds
Document-aware	Respect headers, sections, tables	Structured documents (manuals, reports)

Re-Ranking: The Secret Weapon

Two-stage retrieval with re-ranking dramatically improves quality:

Stage 1 (Retriever): Fast, approximate - retrieve top 100 candidates with bi-encoder
Stage 2 (Re-ranker): Slow, precise - re-rank top 100 with cross-encoder to get top 5

from sentence_transformers import CrossEncoder

# Stage 1: Bi-encoder retrieval (fast)
query_embedding = bi_encoder.encode(query)
candidates = vector_store.search(query_embedding, top_k=100)

# Stage 2: Cross-encoder re-ranking (precise)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.predict(pairs)

# Sort by re-ranker score, take top 5
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]

60-Second Answer

"Production RAG has three critical components beyond the basic retrieve-and-generate pattern: First, hybrid retrieval combining dense and sparse (BM25) search with reciprocal rank fusion. Second, cross-encoder re-ranking to filter the top 100 candidates down to the top 5 with much higher precision. Third, intelligent chunking that respects document structure and uses parent-child relationships. These three improvements take RAG from 'demo quality' to 'production quality'."

Part 10 - Practice Problems

Problem 1: Chunking Strategy Design

You're building a RAG system for a legal firm. Documents include contracts (structured, with sections and clauses), case law (long-form narrative), and email correspondence (short, informal). Design a chunking strategy for each document type.

Hint 1 - Direction

Each document type has different structure. Think about what constitutes a "complete thought" in each type and what context is needed to understand a chunk.

Full Answer + Rubric

Contracts:

Strategy: Document-aware chunking by clause/section. Each clause is a natural unit.
Implementation: Parse section numbers and headers. Keep each clause as one chunk. Add the contract title and section path as metadata (e.g., "Section 3.2: Termination Conditions").
Overlap: Include the section header in every sub-chunk for context.
Parent-child: Use the full section as parent, individual clauses as children.

Case Law:

Strategy: Paragraph-level chunking with semantic boundaries.
Implementation: Split on paragraph boundaries. Use semantic similarity to merge short paragraphs and split long ones. Include case citation (party names, court, year) as metadata in every chunk.
Special handling: The "holding" paragraph (the court's decision) should be tagged specially, as it's the most important part.

Email Correspondence:

Strategy: Full-email as a single chunk (emails are typically short).
Implementation: Keep entire email as one chunk. Extract metadata: sender, recipients, date, subject line, thread ID. For long email threads, split by individual messages.
Threading: Link emails in the same thread for multi-turn context retrieval.

Cross-cutting concerns:

Maintain a metadata layer: document type, date, parties involved, confidentiality level
Use different embedding models or fine-tune for legal domain (e.g., legal-BERT)
Access control: filter retrieval by user permissions

Scoring:

Strong Hire: Different strategies per document type, considers metadata, mentions parent-child chunking, addresses access control
Lean Hire: Reasonable chunking but same strategy for all document types
No Hire: Uses fixed 512-token chunks for everything

Problem 2: RAG vs Fine-Tuning Decision

Your company has a medical knowledge base of 50,000 clinical guidelines (updated quarterly) and wants to build an AI assistant for doctors. Compare RAG, fine-tuning, and RAG + fine-tuning approaches for this use case.

Hint 1 - Direction

Medical applications have unique requirements: accuracy is critical (patient safety), guidelines change, regulatory compliance requires attribution, and the model needs to handle medical terminology.

Full Answer + Rubric

RAG only:

Pros: Citations (critical for medical liability), quarterly updates are easy (re-index), no training compute.
Cons: May not understand medical terminology well. Chunking medical documents is hard (tables, dosage charts, cross-references).
Risk: Retrieval might miss relevant guidelines if query terms don't match medical terminology.

Fine-tuning only:

Pros: Better medical language understanding. Can learn clinical reasoning patterns.
Cons: No citations. Can't easily update when guidelines change. Risk of hallucinating medical information (catastrophic in healthcare).
Risk: A fine-tuned model that confidently hallucinates a drug dosage is dangerous.

RAG + Fine-tuning (recommended):

Fine-tune on medical QA datasets to improve medical language understanding and response format (e.g., always structure as: diagnosis → evidence → recommendation → sources).
RAG for real-time knowledge retrieval with citations.
Use Self-RAG-style reflection to verify that responses are supported by retrieved guidelines.
Re-ranking with a medical domain-specific cross-encoder.

Additional considerations:

Regulatory: FDA may require citation trails. RAG provides this naturally.
Liability: If the model gives wrong advice, citations help determine if the error was in retrieval or generation.
Evaluation: Use medical domain experts for eval, not just automated metrics.

Scoring:

Strong Hire: Recommends RAG + FT with specific medical considerations (safety, citations, regulatory), addresses hallucination risk
Lean Hire: Correctly identifies RAG as better for this case but doesn't address medical-specific concerns
No Hire: Recommends fine-tuning only without addressing hallucination risk in medical contexts

Problem 3: Multi-Hop RAG

A user asks: "How does the company's parental leave policy compare to industry standards?" This requires: (1) retrieving the company's parental leave policy, (2) retrieving industry benchmarks, (3) synthesizing a comparison. Design a multi-hop retrieval system for this.

Hint 1 - Direction

Single-query retrieval won't work here because "parental leave policy" and "industry standards" are different information needs. Think about query decomposition.

Full Answer + Rubric

Multi-hop RAG Architecture:

Query decomposition: Use an LLM to break the query into sub-queries:
- Sub-query 1: "Company parental leave policy details"
- Sub-query 2: "Industry standard parental leave benefits 2024"
Parallel retrieval: Execute both sub-queries against appropriate knowledge bases:
- Sub-query 1 → internal document store (HR policies)
- Sub-query 2 → external knowledge base or web search
Relevance filtering: Re-rank and filter results for each sub-query independently.

Context assembly: Combine the most relevant chunks from both retrievals into a structured context:

[COMPANY POLICY]:
{retrieved company policy chunks}

[INDUSTRY BENCHMARKS]:
{retrieved industry data chunks}

Synthesis prompt: Ask the LLM to generate a comparison table, citing specific sections from each source.
Verification: Optionally, use a Self-RAG-style check to verify that the comparison accurately reflects both sources.

Scoring:

Strong Hire: Describes query decomposition, parallel retrieval from different sources, structured context assembly, and verification
Lean Hire: Recognizes the need for multiple retrievals but doesn't describe how to decompose and synthesize
No Hire: Tries to answer with a single retrieval query

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Explain RAG"	Retriever + Generator → retrieve relevant docs → condition generation → ground in evidence	"RAG retrieves relevant documents from an external knowledge base and conditions the language model's generation on those documents, grounding responses in actual evidence"
"DPR vs BM25?"	Sparse (keyword) vs dense (semantic) → DPR for paraphrases, BM25 for exact terms → hybrid is best	"DPR handles semantic similarity but misses exact terms. BM25 handles exact matches but misses paraphrases. Hybrid with reciprocal rank fusion is production standard."
"RAG vs fine-tuning?"	RAG for knowledge, FT for behavior → combine for best results → consider update frequency	"RAG for dynamic knowledge with citations. Fine-tuning for behavioral changes. Most production systems combine both."
"What is Self-RAG?"	Reflection tokens → model decides when to retrieve → evaluates retrieval quality → self-corrects	"Self-RAG trains the model to output reflection tokens that decide whether to retrieve, evaluate document relevance, and verify response support."
"RAG failure modes?"	Poor chunking → irrelevant retrieval → hallucination despite retrieval → query mismatch	"The three biggest failure modes are: bad chunking losing context, irrelevant top-k results, and the model ignoring retrieved docs."
"Production RAG?"	Hybrid retrieval → re-ranking → smart chunking → parent-child → metadata filtering	"Production RAG needs hybrid retrieval, cross-encoder re-ranking, document-aware chunking, and metadata filtering. This takes it from demo to production quality."

Spaced Repetition Checkpoints

Day 0: Read this page. Draw the RAG pipeline. Explain DPR's training objective.
Day 3: Compare RAG-Sequence and RAG-Token. Explain Self-RAG's reflection tokens without looking.
Day 7: Design a production RAG system with hybrid retrieval and re-ranking. Include a chunking strategy.
Day 14: Argue RAG vs fine-tuning for three different use cases (medical, legal, customer support).
Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.

Next Steps

Continue to Scaling Laws for understanding compute-optimal training
Review LoRA and PEFT for when you choose fine-tuning over RAG
For system design with RAG, see ML System Design

The Real Interview Moment​

What You Will Master​

Part 1 - The Problem RAG Solves​

Why Language Models Need External Knowledge​

Part 2 - Dense Passage Retrieval (DPR)​

From Sparse to Dense Retrieval​

DPR vs BM25​

Part 3 - The Original RAG Paper​

RAG Architecture (Lewis et al., 2020)​

Key Innovations​

Training Details​

Part 4 - REALM: Pre-Training with Retrieval​

REALM (Guu et al., 2020)​

REALM vs RAG​

Part 5 - Self-RAG: Teaching Models When to Retrieve​

The Problem with Naive RAG​

Self-RAG (Asai et al., 2023)​

Self-RAG Training​

Part 6 - CRAG: Corrective Retrieval-Augmented Generation​

The Problem CRAG Solves​

CRAG (Yan et al., 2024)​

Part 7 - RAG Evolution Summary​

Comparison of All RAG Variants​

Part 8 - RAG vs Fine-Tuning​

When to Use Each​

Detailed Comparison​

Part 9 - Production RAG: Failure Modes and Solutions​

Common RAG Failures​

Production Architecture​

Chunking Strategies​

Re-Ranking: The Secret Weapon​

Part 10 - Practice Problems​

Problem 1: Chunking Strategy Design​

Problem 2: RAG vs Fine-Tuning Decision​

Problem 3: Multi-Hop RAG​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Next Steps​

The Real Interview Moment

What You Will Master

Part 1 - The Problem RAG Solves

Why Language Models Need External Knowledge

Part 2 - Dense Passage Retrieval (DPR)

From Sparse to Dense Retrieval

DPR vs BM25

Part 3 - The Original RAG Paper

RAG Architecture (Lewis et al., 2020)

Key Innovations

Training Details

Part 4 - REALM: Pre-Training with Retrieval

REALM (Guu et al., 2020)

REALM vs RAG

Part 5 - Self-RAG: Teaching Models When to Retrieve

The Problem with Naive RAG

Self-RAG (Asai et al., 2023)

Self-RAG Training

Part 6 - CRAG: Corrective Retrieval-Augmented Generation

The Problem CRAG Solves

CRAG (Yan et al., 2024)

Part 7 - RAG Evolution Summary

Comparison of All RAG Variants

Part 8 - RAG vs Fine-Tuning

When to Use Each

Detailed Comparison

Part 9 - Production RAG: Failure Modes and Solutions

Common RAG Failures

Production Architecture

Chunking Strategies

Re-Ranking: The Secret Weapon

Part 10 - Practice Problems

Problem 1: Chunking Strategy Design

Problem 2: RAG vs Fine-Tuning Decision

Problem 3: Multi-Hop RAG

Interview Cheat Sheet

Spaced Repetition Checkpoints

Next Steps