What is advanced RAG?

Go beyond naive RAG - master query transformation, HyDE, multi-query retrieval, Self-RAG, Corrective RAG, and iterative retrieval patterns for complex questions.

How does HyDE work in practice?

Advanced RAG Patterns covers advanced RAG, HyDE, query expansion from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/rag-systems/advanced-rag-patterns

What is the difference between advanced RAG and query expansion?

See the full breakdown at https://engineersofai.com/docs/llms/rag-systems/advanced-rag-patterns

Advanced RAG Patterns

When Naive RAG Reaches Its Ceiling

The baseline RAG system was working - 73% of queries got correct answers based on human evaluation. The team was proud of it. Then their product manager walked over with a list of queries that users had actually submitted that the system was failing on.

"Compare the termination provisions in our MSA with the SLA agreement."

"What changed in our refund policy between version 2.1 and version 3.0?"

"If a customer qualifies for both the premium discount and the enterprise tier, what's the total reduction?"

Single-pass retrieval, which had worked fine for direct factual lookups, completely failed on these queries. They required comparing two documents, reasoning about versions, and combining information from multiple sources. The naive RAG pattern - embed query, retrieve chunks, generate answer - was designed for a fundamentally simpler problem.

Advanced RAG patterns exist because queries in the real world are not simple lookups. They require multi-step reasoning, comparisons, temporal awareness, and sometimes the ability to recognize when current retrieval isn't sufficient and to try again. This lesson covers the patterns that transform RAG from a Q&A system into a genuine knowledge reasoning system.

Why Naive RAG Fails at Complex Queries

Naive RAG has a single retrieval step with the raw user query. This fails when:

The query is ambiguous or poorly phrased. Users often ask questions differently than documents are written. "What's the deal with their pricing?" doesn't retrieve well against structured pricing documentation.
The answer requires multiple documents. "Compare X and Y" requires retrieving about both X and Y, but a single query embedding points toward one centroid in embedding space, not two.
The query is too abstract. "What's our risk exposure?" is too high-level to retrieve specific relevant clauses. The embedding is diffuse.
The relevant passage uses different vocabulary. "Termination for convenience" in a contract vs a user asking "can we cancel the contract early?"
The answer requires synthesis across many documents. "What are the common themes in our customer complaints?" can't be answered from 5 retrieved chunks.

Advanced patterns address each of these failure modes.

Pattern 1: HyDE (Hypothetical Document Embeddings)

Invented by: Gao et al. (2022), "Precise Zero-Shot Dense Retrieval without Relevance Labels"

The insight: Instead of embedding the user's question (which may be brief and poorly phrased for retrieval), ask an LLM to write a hypothetical document that would answer the question. Embed that hypothetical document. Use it for retrieval.

Why it works: The hypothetical document uses the vocabulary, structure, and phrasing style of the actual documents in your corpus. Its embedding is closer to the target documents in embedding space than the raw question's embedding.

from openai import OpenAI
import numpy as np

client = OpenAI()

def hyde_retrieve(
    query: str,
    vector_store,
    top_k: int = 5,
    hypothetical_model: str = "gpt-4o-mini",
) -> list:
    """
    HyDE: Generate a hypothetical document, embed it, use for retrieval.
    """
    # Step 1: Generate a hypothetical document
    response = client.chat.completions.create(
        model=hypothetical_model,
        messages=[
            {
                "role": "system",
                "content": (
                    "Write a concise passage that would directly answer the following question. "
                    "Write it as if it were an excerpt from a professional document. "
                    "Do not reference that it is hypothetical."
                )
            },
            {
                "role": "user",
                "content": f"Question: {query}"
            }
        ],
        temperature=0,
        max_tokens=300,
    )
    hypothetical_doc = response.choices[0].message.content

    # Step 2: Embed the hypothetical document (not the original query)
    emb_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=hypothetical_doc,
    )
    hyde_vector = emb_response.data[0].embedding

    # Step 3: Retrieve using the hypothetical document's vector
    results = vector_store.search_by_vector(hyde_vector, top_k=top_k)

    return results


# Compare HyDE vs standard retrieval
query = "can we cancel the contract early"

# Standard retrieval
standard_results = vector_store.search(query, top_k=5)

# HyDE retrieval
hyde_results = hyde_retrieve(query, vector_store, top_k=5)

print("Standard top result:", standard_results[0]["text"][:100])
print("HyDE top result:", hyde_results[0]["text"][:100])

When HyDE works well: Queries using informal language against formal documents (legal, technical, academic). Queries where the user doesn't know the correct terminology.

When HyDE underperforms: When the LLM's hypothetical document is confidently wrong about domain specifics - it embeds toward the wrong part of the corpus. For highly specialized domains where the LLM lacks sufficient domain knowledge.

Pattern 2: Query Expansion and Multi-Query Retrieval

The problem: A single query embedding represents one point in embedding space. Complex queries may have multiple facets, each pulling toward different parts of the corpus.

The solution: Generate multiple paraphrases or related queries, retrieve for each, combine results with RRF.

from typing import List
import json

def generate_query_variants(query: str, n_variants: int = 4) -> List[str]:
    """Generate alternative phrasings of the same query for multi-query retrieval."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    f"Generate {n_variants} different phrasings of the user's question. "
                    "Each phrasing should express the same information need but use different vocabulary. "
                    "Include the original query. "
                    "Respond with JSON: {\"queries\": [\"query1\", \"query2\", ...]}"
                )
            },
            {
                "role": "user",
                "content": query,
            }
        ],
        temperature=0.4,
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    queries = result.get("queries", [query])
    # Always include original query
    if query not in queries:
        queries.insert(0, query)
    return queries[:n_variants + 1]


def multi_query_retrieve(
    query: str,
    vector_store,
    top_k_per_query: int = 10,
    final_k: int = 5,
) -> List[dict]:
    """Retrieve with multiple query variants and fuse with RRF."""
    from collections import defaultdict

    # Generate query variants
    variants = generate_query_variants(query, n_variants=3)
    print(f"Query variants: {variants}")

    # Retrieve for each variant
    all_ranked_lists = []
    doc_registry = {}  # id -> document for deduplication

    for variant in variants:
        results = vector_store.search(variant, top_k=top_k_per_query)
        ranked_ids = []
        for doc in results:
            doc_id = doc["id"]
            doc_registry[doc_id] = doc
            ranked_ids.append(doc_id)
        all_ranked_lists.append(ranked_ids)

    # RRF fusion
    rrf_scores = defaultdict(float)
    k_rrf = 60
    for ranked_list in all_ranked_lists:
        for rank, doc_id in enumerate(ranked_list):
            rrf_scores[doc_id] += 1.0 / (k_rrf + rank + 1)

    # Sort by RRF score
    sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_registry[doc_id] for doc_id, _ in sorted_docs[:final_k]]

Pattern 3: Step-Back Prompting

Invented by: Zheng et al. (2023), "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models"

The insight: Before retrieving for a specific question, first retrieve for a more abstract, general version of the question. Then use both to answer the original.

Why it works: Specific questions sometimes can't be answered from specific context without understanding the broader principle. Stepping back retrieves foundational information that enables better reasoning.

def step_back_retrieve(
    query: str,
    vector_store,
    top_k: int = 5,
) -> dict:
    """
    Step-back prompting: retrieve at both specific and abstract level.
    """
    # Generate abstract version of the query
    abstract_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Rephrase the question into a more general, abstract version "
                    "that would help retrieve background principles or concepts. "
                    "Respond with only the abstract question, no explanation."
                )
            },
            {"role": "user", "content": query}
        ],
        temperature=0,
    )
    abstract_query = abstract_response.choices[0].message.content

    # Retrieve for both specific and abstract queries
    specific_results = vector_store.search(query, top_k=top_k)
    abstract_results = vector_store.search(abstract_query, top_k=top_k)

    return {
        "specific_context": specific_results,
        "abstract_context": abstract_results,
        "abstract_query": abstract_query,
    }


def generate_with_step_back(query: str, step_back_results: dict) -> str:
    """Generate answer using both specific and abstract context."""
    specific_ctx = "\n\n".join([r["text"] for r in step_back_results["specific_context"][:3]])
    abstract_ctx = "\n\n".join([r["text"] for r in step_back_results["abstract_context"][:2]])

    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question using the provided context. "
                    "Background principles are provided to help with reasoning. "
                    "Cite your sources."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Background Principles:\n{abstract_ctx}\n\n"
                    f"Specific Information:\n{specific_ctx}\n\n"
                    f"Question: {query}"
                )
            }
        ],
    ).choices[0].message.content

Pattern 4: Sub-Question Decomposition

For complex multi-part questions, decompose into atomic sub-questions, answer each independently, then synthesize.

def decompose_and_answer(
    complex_query: str,
    vector_store,
    model: str = "gpt-4o",
) -> str:
    """
    Break complex queries into sub-questions, answer each, synthesize.
    """
    # Step 1: Decompose into sub-questions
    decomp_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Break the user's complex question into 2-4 simpler sub-questions. "
                    "Each sub-question should be independently answerable. "
                    "Respond with JSON: {\"sub_questions\": [\"q1\", \"q2\", ...]}"
                )
            },
            {"role": "user", "content": complex_query}
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    sub_questions = json.loads(decomp_response.choices[0].message.content).get("sub_questions", [complex_query])

    # Step 2: Answer each sub-question independently via RAG
    sub_answers = []
    for sq in sub_questions:
        results = vector_store.search(sq, top_k=3)
        context = "\n\n".join([r["text"] for r in results])

        answer_resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Answer briefly based only on the provided context. Say 'No information found' if context is irrelevant."
                },
                {
                    "role": "user",
                    "content": f"Context:\n{context}\n\nQuestion: {sq}"
                }
            ],
            temperature=0,
        )
        sub_answers.append({
            "question": sq,
            "answer": answer_resp.choices[0].message.content,
        })

    # Step 3: Synthesize sub-answers into final answer
    sub_qa_text = "\n".join([f"Q: {sa['question']}\nA: {sa['answer']}" for sa in sub_answers])

    synthesis = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "Synthesize the following Q&A pairs into a comprehensive answer to the original question."
            },
            {
                "role": "user",
                "content": f"Original question: {complex_query}\n\nSub-questions and answers:\n{sub_qa_text}"
            }
        ],
    )
    return synthesis.choices[0].message.content


# Example
complex_q = "Compare the refund policy and shipping terms between the standard and premium tiers."
answer = decompose_and_answer(complex_q, vector_store)

Pattern 5: Contextual Compression

Sometimes you retrieve large chunks where only a small portion is relevant to the query. Contextual compression extracts just the relevant part, reducing context noise.

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Compressor: extracts only the relevant portion of each retrieved chunk
compressor = LLMChainExtractor.from_llm(llm)

# Base retriever
base_retriever = Chroma(
    collection_name="documents",
    embedding_function=OpenAIEmbeddings()
).as_retriever(search_kwargs={"k": 5})

# Compression retriever: retrieves then compresses each chunk
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

# Query: retrieve and compress
compressed_docs = compression_retriever.get_relevant_documents(
    "What is the notice period for contract termination?"
)
for doc in compressed_docs:
    print(f"Compressed chunk ({len(doc.page_content)} chars): {doc.page_content[:200]}")

Pattern 6: Self-RAG

Published by: Asai et al. (2023), "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection"

Self-RAG trains the model with special reflection tokens that allow it to:

Decide whether retrieval is needed at all
Evaluate the relevance of retrieved passages
Assess whether its own generation is supported by the retrieved text
Critique and revise its own output

The key self-reflection tokens:

[Retrieve] / [No Retrieve] - model decides whether to retrieve
[Relevant] / [Irrelevant] - model judges each retrieved passage
[Fully supported] / [Partially supported] / [No support] - model assesses its generation
[Utility] scores 1-5 - model rates usefulness of its response

While training Self-RAG from scratch requires significant compute, you can approximate its behavior in prompting:

def self_rag_pipeline(
    query: str,
    vector_store,
    model: str = "gpt-4o",
) -> dict:
    """
    Approximate Self-RAG behavior through prompted reflection.
    """
    # Step 1: Does this query need retrieval?
    need_retrieval = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Does answering this question require looking up specific factual information "
                    "that might not be in general knowledge? Answer only 'YES' or 'NO'."
                )
            },
            {"role": "user", "content": query}
        ],
        temperature=0,
    ).choices[0].message.content.strip().upper()

    if need_retrieval != "YES":
        # Answer directly without retrieval
        answer = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}],
        ).choices[0].message.content
        return {"answer": answer, "retrieved": False, "reflection": "No retrieval needed"}

    # Step 2: Retrieve
    results = vector_store.search(query, top_k=5)

    # Step 3: Evaluate relevance of each retrieved chunk
    relevant_chunks = []
    for chunk in results:
        relevance = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Is this passage relevant to answering the question? Answer only 'RELEVANT' or 'IRRELEVANT'."
                },
                {
                    "role": "user",
                    "content": f"Question: {query}\n\nPassage: {chunk['text']}"
                }
            ],
            temperature=0,
        ).choices[0].message.content.strip().upper()

        if relevance == "RELEVANT":
            relevant_chunks.append(chunk)

    if not relevant_chunks:
        return {"answer": "I don't have relevant information to answer this question.", "retrieved": True}

    # Step 4: Generate answer
    context = "\n\n".join([c["text"] for c in relevant_chunks])
    answer = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "Answer based on the provided context. Cite what you use."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}"
            }
        ],
    ).choices[0].message.content

    # Step 5: Assess whether answer is supported
    support_check = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Is the answer fully supported by the provided context? Answer 'FULLY', 'PARTIALLY', or 'NOT SUPPORTED'."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nAnswer: {answer}"
            }
        ],
        temperature=0,
    ).choices[0].message.content.strip()

    return {
        "answer": answer,
        "retrieved": True,
        "relevant_chunks": len(relevant_chunks),
        "support": support_check,
    }

Pattern 7: Corrective RAG (CRAG)

Published by: Yan et al. (2024), "Corrective Retrieval Augmented Generation"

CRAG adds a retrieval evaluator that grades the retrieved documents. If documents are ambiguous or incorrect relative to the query, it triggers a web search fallback.

import requests
from typing import Literal

def evaluate_retrieval_quality(
    query: str,
    retrieved_docs: List[dict],
) -> Literal["correct", "ambiguous", "incorrect"]:
    """Evaluate whether retrieved docs are sufficient to answer query."""
    docs_text = "\n\n".join([f"[{i+1}] {d['text']}" for i, d in enumerate(retrieved_docs[:3])])

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Evaluate whether the retrieved documents are sufficient to answer the query. "
                    "Respond with JSON: {\"quality\": \"correct|ambiguous|incorrect\", \"reason\": \"...\"}"
                )
            },
            {
                "role": "user",
                "content": f"Query: {query}\n\nRetrieved documents:\n{docs_text}"
            }
        ],
        temperature=0,
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    return result.get("quality", "ambiguous")


def web_search(query: str) -> List[dict]:
    """Search the web using Tavily or similar API."""
    # Using Tavily API (sign up at tavily.com)
    # In practice: use requests to call the API
    # This is a placeholder for the actual web search call
    import os
    api_key = os.getenv("TAVILY_API_KEY")
    if not api_key:
        return []  # Graceful fallback

    response = requests.post(
        "https://api.tavily.com/search",
        json={"query": query, "max_results": 5, "search_depth": "basic"},
        headers={"Content-Type": "application/json", "X-API-Key": api_key},
    )
    results = response.json().get("results", [])
    return [{"text": r.get("content", ""), "source": r.get("url", "")} for r in results]


def crag_pipeline(query: str, vector_store) -> str:
    """Corrective RAG: evaluate retrieval and fall back to web search if needed."""
    # Retrieve from vector store
    local_docs = vector_store.search(query, top_k=5)

    # Evaluate retrieval quality
    quality = evaluate_retrieval_quality(query, local_docs)
    print(f"Retrieval quality: {quality}")

    if quality == "correct":
        context_docs = local_docs
    elif quality == "ambiguous":
        # Use both local and web
        web_docs = web_search(query)
        context_docs = local_docs + web_docs
    else:  # incorrect
        # Rely on web search
        context_docs = web_search(query) or local_docs  # fallback to local if web fails

    context = "\n\n".join([d["text"] for d in context_docs[:5]])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Answer the question based on the provided context. Cite your sources."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}"
            }
        ],
    )
    return response.choices[0].message.content

Pattern 8: Iterative Retrieval

For queries that require building up context over multiple steps - where knowing one thing reveals what else you need to know - iterative retrieval loops between retrieval and generation.

def iterative_retrieval(
    query: str,
    vector_store,
    max_iterations: int = 3,
    model: str = "gpt-4o",
) -> str:
    """
    Iteratively retrieve, assess completeness, and retrieve more if needed.
    """
    accumulated_context = []
    current_query = query

    for iteration in range(max_iterations):
        # Retrieve for current query
        new_docs = vector_store.search(current_query, top_k=3)

        # Add new unique docs to accumulated context
        existing_texts = {d["text"] for d in accumulated_context}
        for doc in new_docs:
            if doc["text"] not in existing_texts:
                accumulated_context.append(doc)

        context = "\n\n".join([d["text"] for d in accumulated_context])

        # Ask LLM: can you answer the question now? If not, what else do you need?
        reflection = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Given the accumulated context and question, determine: "
                        "1. Can you answer the question fully? "
                        "2. If not, what specific information is still missing? "
                        "Respond with JSON: {\"can_answer\": true/false, \"missing_info\": \"what do you still need?\"}"
                    )
                },
                {
                    "role": "user",
                    "content": f"Question: {query}\n\nAccumulated context:\n{context}"
                }
            ],
            temperature=0,
            response_format={"type": "json_object"},
        )
        reflection_result = json.loads(reflection.choices[0].message.content)

        if reflection_result.get("can_answer", False):
            break  # We have enough context

        # Generate a follow-up query for the missing information
        missing = reflection_result.get("missing_info", "")
        if missing:
            current_query = missing  # Next iteration retrieves missing info
            print(f"Iteration {iteration + 1}: Retrieving for '{missing[:80]}'")
        else:
            break

    # Final answer generation
    final_context = "\n\n".join([d["text"] for d in accumulated_context])
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "Answer the question comprehensively using all provided context."
            },
            {
                "role": "user",
                "content": f"Context:\n{final_context}\n\nQuestion: {query}"
            }
        ],
    )
    return response.choices[0].message.content

Pattern Comparison: Which to Use When

Production Trade-offs

Pattern	Latency Added	LLM Calls Added	When Worth It
HyDE	+200-500ms	+1	Always worth testing; often 5-15% recall improvement
Multi-Query	+500ms-1s	+1 (gen) + 3-4 extra retrievals	Complex domains, diverse query types
Step-Back	+500ms	+1	Abstract/conceptual queries
Sub-Question	+1-3s	+3-6	Complex multi-part questions
Contextual Compression	+200-500ms	+N (per chunk)	Large chunks with low density relevant content
Self-RAG (prompted)	+1-2s	+3-5	High-stakes answers needing verification
CRAG	+1-3s	+2-3 (+ web latency)	Knowledge gaps are common
Iterative RAG	+2-10s	+N × iterations	Multi-hop questions, deep research

The most important production advice: Measure first. Add advanced patterns only when you have eval evidence they improve your specific failure cases. Teams routinely add HyDE + multi-query + step-back + iterative all at once, creating a pipeline that takes 8 seconds per query and is impossible to debug. Start with naive RAG. Add one pattern. Measure. Repeat.

Common Mistakes

:::danger Adding Complexity Without Evaluation Advanced patterns add latency, cost, and failure modes. HyDE can embed a confidently-wrong hypothetical document. Multi-query can retrieve redundant content. Sub-question decomposition can fragment questions incorrectly. Before adding any advanced pattern, establish a baseline eval score. Add the pattern. If it improves your target metric by more than 3 percentage points, keep it. Otherwise, revert. Complexity without measurement is waste. :::

:::warning Combining Too Many Patterns A pipeline that chains HyDE → multi-query → step-back → reranking → contextual compression → self-reflection is extremely difficult to debug. When it fails (and it will), you won't know which component caused the failure. Build incrementally. Add one pattern at a time. Maintain the ability to ablate each component. :::

:::tip Start with Multi-Query + RRF If you're going to add just one advanced pattern to a naive RAG system, multi-query with RRF fusion is consistently the best first choice. It's simple to implement, doesn't add many LLM calls, addresses the most common failure mode (single-query retrieval missing facets), and almost always improves recall by 5-15%. Add it first. :::

Interview Questions and Answers

Q: Explain HyDE. Why does embedding a hypothetical document improve retrieval?

A: HyDE (Hypothetical Document Embeddings) addresses the vocabulary mismatch problem between user queries and corpus documents. User queries are often short, informal, and use different vocabulary than the formal documents in the corpus. A user asking "can we cancel early?" has an embedding that may not be close to a contract clause saying "either party may terminate with 30 days written notice." HyDE generates a hypothetical document that would answer the query, using document-like language: "Either party may terminate this agreement by providing written notice 30 days in advance..." This hypothetical document uses the vocabulary and structure of real corpus documents, so its embedding is much closer to the actual relevant document in embedding space. The hypothetical document is never shown to the user - it's only used for retrieval.

Q: When would you use sub-question decomposition in a RAG system?

A: Sub-question decomposition is valuable for queries that require information from multiple distinct sources or multiple facts combined. "Compare our enterprise pricing to the standard tier" requires retrieving enterprise pricing AND standard pricing - two separate facts from potentially two different documents. A single query embedding points toward one region of embedding space, not two. Decomposing into "what is enterprise pricing?" and "what is standard pricing?" allows each to retrieve precisely. Use decomposition when: (1) the question contains conjunctions ("and", "compare", "between"); (2) the question asks for a comparison between two entities; (3) the question asks about changes over time (requiring multiple temporal versions); (4) the answer requires combining facts from different document sections. The cost: 2-3x more LLM calls and retrieval operations, plus the synthesis step.

Q: What is Corrective RAG and what problem does it solve that standard RAG doesn't?

A: Corrective RAG adds a retrieval quality evaluation step between retrieval and generation. Standard RAG always generates an answer from retrieved documents regardless of quality - if retrieval fails (wrong documents, insufficient information), the LLM either hallucinates or correctly says "I don't know." CRAG addresses this by evaluating whether retrieved documents are actually sufficient to answer the query. If retrieval quality is "correct," it proceeds normally. If "ambiguous," it supplements with web search. If "incorrect," it falls back entirely to web search. The key benefit: CRAG can handle queries about information not in the knowledge base (current events, very recent information) by gracefully falling back to web search rather than returning a degraded answer. The cost: an extra LLM call for evaluation and potential web search latency.

Q: How would you decide which advanced RAG pattern to add first to a system with naive RAG?

A: Measure the failure modes of your current system first. Run your evaluation set and categorize failures: (1) vocabulary mismatch failures (query says "cancel," document says "terminate") → start with HyDE; (2) complex multi-facet query failures → start with multi-query + RRF; (3) abstract query failures (the question is too high-level) → try step-back; (4) multi-hop failures (answer requires two facts) → try sub-question decomposition; (5) out-of-scope failures (knowledge base doesn't have the answer) → try CRAG with web search. In practice, multi-query + RRF is the best first addition for most systems because retrieval vocabulary mismatch is nearly universal and multi-query specifically addresses it with minimal added complexity. After adding the first pattern, re-evaluate. Add a second pattern only if you can identify a remaining specific failure mode.

Q: What is iterative retrieval and what class of queries does it excel at?

A: Iterative retrieval loops: retrieve, assess whether you have enough information to answer, and if not, formulate a follow-up retrieval query based on what's still missing. This loop continues until either the LLM determines it has sufficient context or a maximum iteration count is reached. It excels at multi-hop queries: questions where the answer to one sub-question determines what to look up next. Example: "Does our contract with Acme allow them to sublicense our software to their subsidiary in Singapore?" Step 1: retrieve the contract's sublicensing clause. Step 2: if the clause references "approved territories," retrieve what constitutes an approved territory. Step 3: determine if Singapore is covered. Each retrieval step depends on what was learned in the previous step. Standard single-pass retrieval can't do this because it doesn't know what follow-up facts are needed until after reading the initial results.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Advanced RAG Patterns demo on the EngineersOfAI Playground - no code required.

:::

When Naive RAG Reaches Its Ceiling​

Why Naive RAG Fails at Complex Queries​

Pattern 1: HyDE (Hypothetical Document Embeddings)​

Pattern 2: Query Expansion and Multi-Query Retrieval​

Pattern 3: Step-Back Prompting​

Pattern 4: Sub-Question Decomposition​

Pattern 5: Contextual Compression​

Pattern 6: Self-RAG​

Pattern 7: Corrective RAG (CRAG)​

Pattern 8: Iterative Retrieval​

Pattern Comparison: Which to Use When​

Production Trade-offs​

Common Mistakes​

Interview Questions and Answers​