:::tip 🎮 Interactive Playground Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required. :::
Agentic RAG
The Question That Broke the System
The research assistant RAG system had a strong track record on single-domain questions. It handled regulatory lookups, earnings analysis, and market commentary with consistent quality. The team had good reason to trust it.
Then a compliance analyst submitted this question: "What are the implications of the 2023 SEC climate disclosure rules for companies that also have European CSRD obligations?" The system responded with a three-paragraph answer about the SEC's climate rules. It was accurate. It was well-sourced. And it answered approximately half the question. The CSRD - the European Corporate Sustainability Reporting Directive - was mentioned once, briefly, almost as an afterthought. The actual question - how do these two regulatory frameworks interact for dual-obligated companies? - was never addressed.
The analyst submitted a follow-up: "What about CSRD specifically?" The system responded with a solid explanation of the CSRD framework. Still missing: the intersection analysis. How do the reporting timelines differ? Where do the material topics overlap? Where do they conflict? What must a company do that the SEC rules require but CSRD does not, and vice versa? Answering these questions requires: (1) retrieve on SEC rules, (2) retrieve on CSRD, (3) identify where both regulations address the same topics, (4) compare the requirements in those overlapping areas, (5) reason about the implications for companies subject to both. A single retrieval pass cannot do this. Neither can two retrieval passes followed by separate answers. The system needed to think.
This is the fundamental limitation of standard RAG: it has no concept of "I retrieved X, but now I need Y before I can answer." Standard RAG retrieves once, generates once, and stops. The retrieval is not informed by what the generation discovers is missing. The generation cannot trigger additional retrieval when it encounters a gap. The system is a pipeline, not a loop.
Agentic RAG replaces the pipeline with a loop. The agent retrieves, generates, checks whether the generated content is well-grounded, identifies what is still missing, retrieves again for the missing pieces, and continues until the answer is complete and supported. On complex multi-hop questions, this iterative process is not an optimization - it is the only correct architecture. Standard RAG on the SEC+CSRD question is not "slightly worse" than agentic RAG. It is architecturally incapable of answering it.
This lesson covers five agentic RAG patterns in production depth: Self-RAG, FLARE, ReAct with RAG as a tool, RAPTOR, and Corrective RAG. For each pattern you will understand the mechanism, the implementation using the Anthropic SDK, the failure modes, and how to build a production system that handles complex multi-hop questions reliably.
Why Agentic RAG Exists
The Single-Shot Limitation
Standard RAG has a fixed, non-adaptive structure:
query → embed query → retrieve K chunks → concatenate → generate answer → done
This structure has three critical limitations for complex questions:
No iterative refinement: The retrieval step runs once, before the LLM has generated anything. The LLM cannot discover, mid-generation, that it needs additional context and trigger a second retrieval. Once the generation starts, the context is fixed.
No self-correction: If the retrieved chunks are poor - irrelevant, outdated, insufficient - the LLM generates an answer based on poor context. There is no step where the system checks: "Is this answer actually grounded in what I retrieved? Is any claim unsupported?"
No sequential information building: Some questions require building knowledge step by step. Answer A must be known before question B can even be formulated. Standard RAG retrieves all context at once, which means question B cannot be asked in terms of answer A.
The Multi-Hop Structure of Complex Questions
Complex questions have a directed acyclic graph structure. To fully answer the SEC+CSRD question:
- Node 1: What does the SEC 2023 climate rule require? (retrieve independently)
- Node 2: What does CSRD require? (retrieve independently)
- Node 3: What topics appear in both? (depends on Nodes 1 and 2)
- Node 4: Where do the requirements differ on overlapping topics? (depends on Node 3)
- Node 5: What are the practical implications for dual-obligated companies? (depends on Node 4)
Standard RAG collapses this graph into a single retrieval step. Nodes 3, 4, and 5 cannot be answered without the outputs of earlier nodes. The graph must be traversed in dependency order - which requires an agent that can plan, retrieve, reason, and retrieve again.
The Self-Correction Necessity
Retrieval quality is not uniform. Some queries retrieve excellent context. Others - because of vocabulary mismatch, sparse corpus coverage, or query ambiguity - retrieve mediocre or irrelevant content. Standard RAG passes all retrieved context to the LLM regardless of quality. Agentic RAG can evaluate retrieved context and take corrective action: reformulate the query, try a different retrieval strategy, or fall back to web search.
This is not an edge case. In production RAG systems, 10-20% of queries retrieve substantially poor context. Without correction, these queries produce confident-sounding wrong answers - the worst failure mode in any knowledge system.
Historical Context
Agentic RAG sits at the intersection of two research threads: retrieval-augmented generation and LLM agents.
RAG was introduced by Lewis et al. (Facebook Research, 2020) as a way to ground generation in non-parametric memory. The original RAG paper used a single-shot retrieve-then-generate pattern. It worked well for factoid questions and knowledge-intensive NLP tasks, but its limitations on complex questions were immediately apparent to practitioners.
ReAct (Yao et al., Princeton and Google, 2022) showed that interleaving reasoning traces and action execution significantly improves LLM performance on multi-step tasks. The key contribution was the synergy between thinking and acting - each reasoning step informs the next action, and each action result informs the next reasoning step.
Self-RAG (Asai et al., University of Washington, 2023) was a landmark paper that formalized iterative retrieval with self-reflection. Rather than treating retrieval as a fixed input step, Self-RAG trains the LLM to decide when to retrieve, whether the retrieved content is relevant, whether the generated content is supported, and whether the answer is useful. These decisions are encoded as special reflection tokens.
FLARE (Jiang et al., CMU, 2023) took a different approach: rather than periodically checking whether to retrieve, FLARE monitors the LLM's token-level confidence during generation and triggers retrieval when confidence drops below a threshold. This enables a natural integration of retrieval into the generation process.
RAPTOR (Sarthi et al., Stanford, 2024) addressed a different problem: how to retrieve effectively across different levels of specificity. RAPTOR builds a tree of progressively abstracted document summaries, enabling retrieval at the right level of granularity for each question.
Corrective RAG (CRAG) (Yan et al., 2024) formalized the corrective action pattern: evaluate retrieval quality post-retrieval, and if it is poor, take corrective action (query reformulation, web search fallback, or explicit disclaimer).
Pattern 1: Self-RAG - Iterative Retrieval with Reflection
The Core Mechanism
Self-RAG (Asai et al., 2023) adds four reflection decisions to the generate-retrieve loop:
- Retrieve?: Should I retrieve before generating the next passage? (RETRIEVE / NO_RETRIEVE)
- ISREL: Is the retrieved passage relevant to the query? (Relevant / Irrelevant)
- ISSUP: Does my generated response support its claims with the retrieved passage? (Fully Supported / Partially Supported / No Support)
- ISUSE: Is my response useful overall? (1-5 scale)
In the original paper, these decisions are trained as special tokens. For production systems without custom training, you implement them as LLM classification calls - asking Claude to make these judgments explicitly.
The Self-RAG Loop
1. Generate initial passage
2. IF low confidence → retrieve
3. Score retrieved chunks: ISREL for each
4. For each relevant chunk: generate response segment
5. Score: ISSUP (is this segment supported by the chunk?)
6. IF ISSUP = "No Support" → retrieve again with refined query
7. Continue until response is fully grounded
8. Score: ISUSE
9. Return highest-ISUSE, fully-supported response
The key property: retrieval is triggered by generation quality, not by a fixed schedule. The agent retrieves more when it detects low confidence or poor grounding, not on every turn.
Pattern 2: FLARE - Forward-Looking Active Retrieval
The Core Idea
FLARE (Forward-Looking Active REtrieval, Jiang et al., 2023) monitors generation confidence at the token level. When the model generates a low-probability token - indicating uncertainty about a fact - it pauses, formulates a retrieval query based on the current context, retrieves, and continues generation with the new context.
The mechanism:
- Generate tokens one by one, tracking log probabilities
- If any token's probability falls below threshold → stop
- Formulate a retrieval query from the incomplete generated text
- Retrieve relevant documents
- Regenerate the uncertain segment with retrieved context
- Continue
FLARE is particularly effective for long-form generation where different parts of the answer require different retrieved knowledge. It naturally allocates retrieval effort to the parts of the generation that need it.
In practice without direct token probability access (which most API providers do not expose): approximate FLARE by generating candidate answers, having the LLM explicitly flag uncertain statements, then retrieving for those statements.
Pattern 3: ReAct + RAG - Tool-Augmented Agentic Retrieval
The Core Idea
ReAct (Yao et al., 2022) interleaves reasoning traces and action execution. Applied to RAG, retrieval becomes a tool that the agent can call when it decides retrieval is needed. The agent:
- Reasons about what information it needs
- Calls the retrieval tool with a specific query
- Processes the retrieved content
- Reasons about whether the content is sufficient
- Either retrieves again (with a refined or different query) or synthesizes the answer
This is the most flexible agentic RAG pattern. The agent can:
- Call retrieval multiple times with different queries
- Use different knowledge sources (internal KB, web, database)
- Retrieve sequentially where each retrieval informs the next query
- Stop retrieval when it determines it has enough information
The Anthropic SDK's tool use API makes this pattern straightforward to implement. You define retrieval as a tool, and the agent decides when and how to call it.
Pattern 4: RAPTOR - Multi-Level Retrieval
The Core Idea
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval, Sarthi et al., Stanford, 2024) builds a hierarchical tree of document summaries. Leaf nodes are raw document chunks. Parent nodes are summaries of their children. The root node is a high-level summary of the entire corpus.
At retrieval time, RAPTOR queries at the appropriate level:
- Abstract, broad questions → retrieve from high-level summary nodes
- Specific, detailed questions → retrieve from leaf-level chunks
- Mixed questions → retrieve from both levels
This addresses a fundamental limitation of flat chunk retrieval: detailed chunks may be too granular to answer abstract questions, and high-level documents may lack the specifics needed for detailed questions. RAPTOR serves both by building the hierarchy.
RAPTOR Tree Construction
Raw documents → chunk → leaf nodes (level 0)
Leaf nodes → cluster + summarize → level 1 summaries
Level 1 summaries → cluster + summarize → level 2 summaries
Level 2 → level 3 → ... → root summary
Clustering uses UMAP dimensionality reduction followed by Gaussian Mixture Models. Summarization uses an LLM. The tree construction is an offline process run during corpus ingestion.
Pattern 5: Corrective RAG (CRAG)
The Core Idea
CRAG (Yan et al., 2024) adds a retrieval quality evaluation step between retrieval and generation. After retrieving, it scores each chunk for relevance to the query. Based on aggregate relevance:
- High relevance: Proceed with retrieved context (standard RAG)
- Low relevance: Reformulate query + search web as fallback
- Mixed relevance: Keep high-relevance chunks, discard low-relevance, supplement with reformulated search
CRAG acknowledges that retrieval quality is not uniform and builds explicit correction logic. For RAG systems serving questions that may fall outside the corpus (current events, recent updates), the web search fallback is essential for graceful degradation.
Production Code
import anthropic
import asyncio
import json
import re
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum
client = anthropic.Anthropic()
async_client = anthropic.AsyncAnthropic()
# ─────────────────────────────────────────────
# Data Structures
# ─────────────────────────────────────────────
class RetrievalDecision(Enum):
RETRIEVE = "RETRIEVE"
NO_RETRIEVE = "NO_RETRIEVE"
class RelevanceScore(Enum):
RELEVANT = "RELEVANT"
IRRELEVANT = "IRRELEVANT"
class SupportScore(Enum):
FULLY_SUPPORTED = "FULLY_SUPPORTED"
PARTIALLY_SUPPORTED = "PARTIALLY_SUPPORTED"
NO_SUPPORT = "NO_SUPPORT"
@dataclass
class KnowledgeChunk:
"""A retrieved document chunk."""
chunk_id: str
content: str
source: str
score: float = 0.0
rank: int = 0
relevance: Optional[RelevanceScore] = None
@dataclass
class AgentStep:
"""A single step in an agentic RAG execution trace."""
step_number: int
step_type: str # THINK, RETRIEVE, EVALUATE, GENERATE, SYNTHESIZE
input_query: Optional[str] = None
output: Optional[str] = None
chunks: list[KnowledgeChunk] = field(default_factory=list)
metadata: dict = field(default_factory=dict)
@dataclass
class AgenticRAGResult:
"""Complete result from an agentic RAG pipeline."""
original_query: str
final_answer: str
steps: list[AgentStep]
total_llm_calls: int
total_retrievals: int
strategy_used: str
# ─────────────────────────────────────────────
# Mock Knowledge Base
# ─────────────────────────────────────────────
class KnowledgeBase:
"""
Mock knowledge base for demonstration.
In production: Pinecone, Weaviate, Qdrant, pgvector, etc.
"""
_CORPUS = {
"sec-climate-001": KnowledgeChunk(
chunk_id="sec-climate-001",
content=(
"The SEC's 2023 climate disclosure rule (Release Nos. 33-11275) requires large "
"accelerated filers to disclose material climate-related risks in their annual "
"reports starting FY2025. Required disclosures include: Scope 1 and Scope 2 GHG "
"emissions (with limited assurance for large accelerated filers), material climate "
"risks and their financial impacts, governance over climate risk, and progress on "
"climate-related targets. Scope 3 emissions disclosure was removed from the final rule."
),
source="sec-rules/climate-disclosure-2023.md",
),
"sec-climate-002": KnowledgeChunk(
chunk_id="sec-climate-002",
content=(
"SEC climate rule enforcement timeline: Large accelerated filers - FY2025 for risk "
"disclosures, FY2026 for Scope 1/2 emissions. Accelerated filers - FY2026 for risks, "
"FY2028 for emissions. Non-accelerated filers and smaller reporting companies - "
"FY2027 for risk disclosures only (emissions not required). Financial statement "
"disclosures for severe weather events required from FY2025 for all affected filers."
),
source="sec-rules/climate-timelines.md",
),
"csrd-001": KnowledgeChunk(
chunk_id="csrd-001",
content=(
"The EU Corporate Sustainability Reporting Directive (CSRD) requires large EU companies "
"and EU-listed companies to report on sustainability matters under European Sustainability "
"Reporting Standards (ESRS). Mandatory topics under ESRS include: climate change "
"(E1), pollution (E2), water (E3), biodiversity (E4), resource use (E5), own workforce "
"(S1), workers in value chain (S2), affected communities (S3), consumers (S4), and "
"business conduct (G1). CSRD scope is substantially broader than SEC climate rules."
),
source="eu-regulations/csrd-overview.md",
),
"csrd-002": KnowledgeChunk(
chunk_id="csrd-002",
content=(
"CSRD reporting timeline: Large EU companies already subject to NFRD - FY2024 reports "
"(published 2025). Other large EU companies (meet 2 of 3: >250 employees, >€40M revenue, "
">€20M total assets) - FY2025 reports. Listed SMEs (optional until 2028, then mandatory). "
"Non-EU companies with >€150M EU turnover and at least one EU subsidiary - FY2028 reports. "
"CSRD requires independent third-party assurance from the first reporting year."
),
source="eu-regulations/csrd-timelines.md",
),
"csrd-sec-overlap": KnowledgeChunk(
chunk_id="csrd-sec-overlap",
content=(
"Comparison of CSRD and SEC climate rules for dual-obligated companies: Both require "
"disclosure of climate-related risks and governance. Key differences: (1) Scope - CSRD "
"covers all sustainability topics; SEC rules cover climate only. (2) GHG reporting - "
"CSRD requires Scope 1, 2, and 3 under ESRS E1; SEC requires only Scope 1 and 2 (if "
"material). (3) Assurance - CSRD requires third-party assurance from year 1; SEC requires "
"limited assurance only for large accelerated filers starting FY2028. (4) Double "
"materiality - CSRD uses both financial and impact materiality; SEC uses financial "
"materiality only."
),
source="compliance/csrd-sec-comparison.md",
),
"api-timeout-001": KnowledgeChunk(
chunk_id="api-timeout-001",
content=(
"API timeout errors typically manifest as HTTP 504 Gateway Timeout responses. Root "
"causes include: server-side processing time exceeding configured limits, slow "
"database queries blocking request threads, connection pool exhaustion causing "
"queue buildup, and upstream service dependencies timing out. Default timeout is "
"30 seconds for synchronous requests."
),
source="api-docs/timeouts.md",
),
"connection-pool-001": KnowledgeChunk(
chunk_id="connection-pool-001",
content=(
"Connection pool configuration: max_pool_size defaults to 10 connections. Under "
"high load, when all 10 connections are active, new requests queue. If the queue "
"fills before a connection is released, requests fail with timeout. Increase "
"max_pool_size proportional to concurrent request rate. Monitor pool_active and "
"pool_idle metrics to detect saturation."
),
source="api-docs/connection-pool.md",
),
}
def search(self, query: str, k: int = 5) -> list[KnowledgeChunk]:
"""Mock semantic search. In production: vector similarity search."""
query_lower = query.lower()
scored = []
for chunk in self._CORPUS.values():
doc_words = chunk.content.lower()
query_words = query_lower.split()
score = sum(1 for w in query_words if w in doc_words) / max(len(query_words), 1)
chunk_copy = KnowledgeChunk(
chunk_id=chunk.chunk_id,
content=chunk.content,
source=chunk.source,
score=round(score, 4),
)
scored.append(chunk_copy)
scored.sort(key=lambda x: x.score, reverse=True)
for i, c in enumerate(scored[:k]):
c.rank = i + 1
return scored[:k]
def search_web(self, query: str) -> list[KnowledgeChunk]:
"""Mock web search fallback for out-of-corpus queries."""
return [
KnowledgeChunk(
chunk_id="web-001",
content=f"[Web search result for '{query}'] This information was retrieved from public web sources as a fallback because the internal knowledge base did not contain sufficient relevant content.",
source="web-search",
score=0.5,
rank=1,
)
]
_kb = KnowledgeBase()
# ─────────────────────────────────────────────
# Pattern 1: Self-RAG Agent
# ─────────────────────────────────────────────
class SelfRAGAgent:
"""
Iterative RAG with self-reflection (Asai et al., 2023).
At each iteration:
1. Generate a response draft
2. Retrieve if confidence is low or grounding is poor
3. Evaluate relevance of retrieved chunks (ISREL)
4. Evaluate support of draft against chunks (ISSUP)
5. Iterate until well-grounded or max_iterations reached
"""
_RETRIEVE_DECISION_PROMPT = """Given this question and current response draft, decide whether
additional retrieval is needed to improve grounding.
Question: {question}
Current response draft: {draft}
Respond with RETRIEVE if:
- The draft contains claims not supported by specific evidence
- Important aspects of the question are not addressed
- The draft contains uncertainty markers like "might", "could", "I think"
Respond with NO_RETRIEVE if:
- The draft is well-grounded with specific facts
- All main aspects of the question are addressed
Respond with exactly one word: RETRIEVE or NO_RETRIEVE"""
_ISREL_PROMPT = """Is this retrieved chunk relevant to answering the question?
Question: {question}
Retrieved chunk: {chunk}
Respond with exactly one word: RELEVANT or IRRELEVANT"""
_ISSUP_PROMPT = """Does the retrieved chunk support the claims in the response?
Response: {response}
Retrieved chunk: {chunk}
Classify support level:
- FULLY_SUPPORTED: All claims in the response are supported by the chunk
- PARTIALLY_SUPPORTED: Some claims are supported, some are not
- NO_SUPPORT: The chunk does not support the response claims
Respond with exactly one of: FULLY_SUPPORTED, PARTIALLY_SUPPORTED, NO_SUPPORT"""
_GENERATE_PROMPT = """Answer the following question using the provided context.
Be specific and cite evidence from the context. If the context is insufficient, say so explicitly.
Question: {question}
Context:
{context}
Answer:"""
def generate_draft(self, question: str, context: str) -> str:
"""Generate a response draft with available context."""
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=700,
messages=[{
"role": "user",
"content": self._GENERATE_PROMPT.format(question=question, context=context)
}]
)
return message.content[0].text.strip()
def should_retrieve(self, question: str, draft: str) -> RetrievalDecision:
"""Decide whether additional retrieval is needed."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._RETRIEVE_DECISION_PROMPT.format(
question=question, draft=draft
)
}]
)
raw = message.content[0].text.strip().upper()
try:
return RetrievalDecision(raw)
except ValueError:
return RetrievalDecision.RETRIEVE # Default: retrieve when uncertain
def evaluate_relevance(self, question: str, chunk: KnowledgeChunk) -> RelevanceScore:
"""Evaluate whether a retrieved chunk is relevant to the question."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._ISREL_PROMPT.format(
question=question,
chunk=chunk.content[:500],
)
}]
)
raw = message.content[0].text.strip().upper()
try:
return RelevanceScore(raw)
except ValueError:
return RelevanceScore.RELEVANT # Default to relevant when uncertain
def evaluate_support(self, response: str, chunk: KnowledgeChunk) -> SupportScore:
"""Evaluate how well a chunk supports the generated response."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=20,
messages=[{
"role": "user",
"content": self._ISSUP_PROMPT.format(
response=response[:500],
chunk=chunk.content[:500],
)
}]
)
raw = message.content[0].text.strip().upper()
try:
return SupportScore(raw)
except ValueError:
return SupportScore.PARTIALLY_SUPPORTED
def iterative_retrieve(
self,
query: str,
max_iterations: int = 3,
) -> AgenticRAGResult:
"""
Main Self-RAG loop with iterative retrieval and self-reflection.
"""
steps: list[AgentStep] = []
accumulated_context: list[KnowledgeChunk] = []
llm_calls = 0
retrievals = 0
# Step 1: Initial retrieval
initial_chunks = _kb.search(query, k=5)
retrievals += 1
accumulated_context.extend(initial_chunks)
steps.append(AgentStep(
step_number=1,
step_type="RETRIEVE",
input_query=query,
chunks=initial_chunks,
output=f"Retrieved {len(initial_chunks)} chunks",
))
# Filter to relevant chunks
relevant_chunks = []
for chunk in initial_chunks:
relevance = self.evaluate_relevance(query, chunk)
llm_calls += 1
chunk.relevance = relevance
if relevance == RelevanceScore.RELEVANT:
relevant_chunks.append(chunk)
steps.append(AgentStep(
step_number=2,
step_type="EVALUATE",
output=f"ISREL: {len(relevant_chunks)}/{len(initial_chunks)} chunks relevant",
chunks=relevant_chunks,
))
# Step 2: Generate initial draft
context_text = "\n\n".join([
f"[{c.source}] {c.content}"
for c in (relevant_chunks if relevant_chunks else initial_chunks)
])
draft = self.generate_draft(query, context_text)
llm_calls += 1
steps.append(AgentStep(
step_number=3,
step_type="GENERATE",
output=draft,
))
# Step 3: Iterative refinement
for iteration in range(max_iterations - 1):
decision = self.should_retrieve(query, draft)
llm_calls += 1
if decision == RetrievalDecision.NO_RETRIEVE:
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="EVALUATE",
output="RETRIEVE decision: NO_RETRIEVE - response is well grounded",
))
break
# Generate a refined retrieval query based on the draft's gaps
refinement_query = self._generate_refinement_query(query, draft)
llm_calls += 1
additional_chunks = _kb.search(refinement_query, k=4)
retrievals += 1
# Filter for relevance and uniqueness
existing_ids = {c.chunk_id for c in accumulated_context}
new_chunks = [c for c in additional_chunks if c.chunk_id not in existing_ids]
relevant_new = [
c for c in new_chunks
if self.evaluate_relevance(query, c) == RelevanceScore.RELEVANT
]
llm_calls += len(new_chunks)
accumulated_context.extend(relevant_new)
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="RETRIEVE",
input_query=refinement_query,
chunks=relevant_new,
output=f"Iteration {iteration + 1}: retrieved {len(relevant_new)} new relevant chunks",
))
if relevant_new:
# Regenerate with expanded context
full_context = "\n\n".join([f"[{c.source}] {c.content}" for c in accumulated_context])
draft = self.generate_draft(query, full_context)
llm_calls += 1
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="GENERATE",
output=draft,
metadata={"iteration": iteration + 1},
))
return AgenticRAGResult(
original_query=query,
final_answer=draft,
steps=steps,
total_llm_calls=llm_calls,
total_retrievals=retrievals,
strategy_used="SELF_RAG",
)
def _generate_refinement_query(self, original_query: str, current_draft: str) -> str:
"""Generate a query targeting the gaps in the current draft."""
prompt = f"""Given this question and partial answer, generate a search query to find
information that would improve or complete the answer.
Original question: {original_query}
Current draft (may be incomplete): {current_draft[:500]}
Generate a specific search query for the missing information.
Return ONLY the search query, nothing else."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text.strip()
# ─────────────────────────────────────────────
# Pattern 2: ReAct RAG Agent with Tool Use
# ─────────────────────────────────────────────
class ReActRAGAgent:
"""
ReAct-style RAG agent using Anthropic's tool use API.
The agent reasons about what information it needs, calls retrieval
tools, processes results, and decides whether to retrieve more or
synthesize the answer. Tool use API handles the ReAct loop naturally.
Tools available to the agent:
- retrieve_from_knowledge_base: semantic search over internal corpus
- search_web: web search fallback for out-of-corpus information
- synthesize_answer: explicitly signal completion
"""
_TOOLS = [
{
"name": "retrieve_from_knowledge_base",
"description": (
"Search the internal knowledge base for relevant information. "
"Use specific, targeted queries. Can be called multiple times "
"with different queries to gather information from different angles."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "A specific search query to find relevant documents",
},
"k": {
"type": "integer",
"description": "Number of documents to retrieve (default 5, max 10)",
"default": 5,
},
},
"required": ["query"],
},
},
{
"name": "search_web",
"description": (
"Search the public web for information not in the internal knowledge base. "
"Use when the knowledge base returns insufficient results for a query."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Web search query",
},
},
"required": ["query"],
},
},
]
_SYSTEM_PROMPT = """You are a precise research assistant with access to a knowledge base and web search.
When answering a question:
1. Think carefully about what information you need
2. Use retrieve_from_knowledge_base for internal knowledge
3. Use search_web only if the knowledge base is insufficient
4. Make multiple retrieval calls if needed to gather all necessary information
5. Build your answer from retrieved evidence - do not rely on prior knowledge
6. Synthesize a comprehensive, well-grounded answer from all retrieved information
For complex multi-part questions, retrieve information for each part separately."""
def execute_tool(self, tool_name: str, tool_input: dict) -> str:
"""Execute a tool call and return the result as a string."""
if tool_name == "retrieve_from_knowledge_base":
query = tool_input["query"]
k = tool_input.get("k", 5)
chunks = _kb.search(query, k=k)
if not chunks:
return "No relevant documents found in the knowledge base for this query."
result_parts = []
for i, chunk in enumerate(chunks):
result_parts.append(
f"Result {i + 1} [Source: {chunk.source}, Score: {chunk.score:.3f}]:\n{chunk.content}"
)
return "\n\n---\n\n".join(result_parts)
elif tool_name == "search_web":
query = tool_input["query"]
chunks = _kb.search_web(query)
return f"Web search results for '{query}':\n\n" + chunks[0].content
return f"Unknown tool: {tool_name}"
def agentic_loop(self, query: str, max_steps: int = 8) -> AgenticRAGResult:
"""
Run the ReAct agent loop using Anthropic's tool use API.
The agent iterates: generate → tool call → process result → generate
until it produces a final answer or hits max_steps.
"""
steps: list[AgentStep] = []
llm_calls = 0
retrievals = 0
messages = [{"role": "user", "content": query}]
for step_num in range(max_steps):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1500,
system=self._SYSTEM_PROMPT,
tools=self._TOOLS,
messages=messages,
)
llm_calls += 1
# Check stop condition
if response.stop_reason == "end_turn":
# No tool call - agent has synthesized the final answer
final_answer = ""
for block in response.content:
if hasattr(block, "text"):
final_answer += block.text
steps.append(AgentStep(
step_number=step_num + 1,
step_type="SYNTHESIZE",
output=final_answer,
))
break
if response.stop_reason != "tool_use":
# Unexpected stop - extract any text as answer
final_answer = "".join(
block.text for block in response.content if hasattr(block, "text")
)
break
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
tool_name = block.name
tool_input = block.input
tool_output = self.execute_tool(tool_name, tool_input)
if tool_name in ("retrieve_from_knowledge_base", "search_web"):
retrievals += 1
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": tool_output,
})
step_type = "RETRIEVE" if "retrieve" in tool_name else "WEB_SEARCH"
steps.append(AgentStep(
step_number=step_num + 1,
step_type=step_type,
input_query=tool_input.get("query", ""),
output=tool_output[:300] + "..." if len(tool_output) > 300 else tool_output,
))
# Feed tool results back to the conversation
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
else:
# max_steps reached without completion
final_answer = "Maximum steps reached. Partial answer based on retrieved information."
return AgenticRAGResult(
original_query=query,
final_answer=final_answer,
steps=steps,
total_llm_calls=llm_calls,
total_retrievals=retrievals,
strategy_used="REACT_RAG",
)
# ─────────────────────────────────────────────
# Pattern 3: Multi-Hop RAG Pipeline
# ─────────────────────────────────────────────
@dataclass
class HopResult:
"""Result from a single hop in multi-hop retrieval."""
hop_number: int
query: str
chunks: list[KnowledgeChunk]
intermediate_answer: str
class MultiHopRAGPipeline:
"""
Sequential multi-hop RAG for complex questions.
Explicitly plans the retrieval hops needed to answer the question,
executes each hop using the answers from previous hops to inform
subsequent queries, and synthesizes a final answer.
Best for: questions with clear logical dependency structure where
each piece of information builds on the previous.
"""
_PLAN_PROMPT = """Break down this complex question into a sequence of retrieval hops.
Each hop is a specific search query. Later hops can reference what earlier hops will find.
Maximum 4 hops. Keep queries specific and targeted.
Return a JSON array of strings - each string is one search query, in order.
Return ONLY the JSON array.
Complex question: {question}
JSON array of hop queries:"""
_HOP_ANSWER_PROMPT = """Answer this specific sub-question using only the provided context.
Be concise and precise. Extract the key fact(s) needed.
Previous hop answers (use as additional context):
{previous_answers}
Retrieved context:
{context}
Sub-question: {question}
Concise answer:"""
_SYNTHESIS_PROMPT = """Synthesize a comprehensive final answer using the research gathered.
Original question: {original_question}
Research gathered through multiple retrieval hops:
{hop_summaries}
Write a complete, precise final answer that addresses the original question.
Cite specific evidence from the research. Be comprehensive but not redundant."""
def plan_hops(self, question: str) -> list[str]:
"""Plan the sequence of retrieval hops needed."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{
"role": "user",
"content": self._PLAN_PROMPT.format(question=question)
}]
)
raw = message.content[0].text.strip()
try:
match = re.search(r'\[.*?\]', raw, re.DOTALL)
return json.loads(match.group() if match else raw)
except (json.JSONDecodeError, AttributeError):
return [question] # Fallback: single hop
def execute_hop(
self,
hop_query: str,
hop_number: int,
previous_hops: list[HopResult],
k: int = 5,
) -> HopResult:
"""Execute a single retrieval hop."""
chunks = _kb.search(hop_query, k=k)
# Format previous answers for context
prev_answers = ""
for prev in previous_hops:
prev_answers += f"Hop {prev.hop_number}: {prev.query}\nAnswer: {prev.intermediate_answer}\n\n"
context = "\n\n".join([f"[{c.source}] {c.content}" for c in chunks])
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=350,
messages=[{
"role": "user",
"content": self._HOP_ANSWER_PROMPT.format(
previous_answers=prev_answers or "None",
context=context,
question=hop_query,
)
}]
)
return HopResult(
hop_number=hop_number,
query=hop_query,
chunks=chunks,
intermediate_answer=message.content[0].text.strip(),
)
def run(self, question: str) -> AgenticRAGResult:
"""Execute the full multi-hop RAG pipeline."""
steps: list[AgentStep] = []
# Step 1: Plan hops
hop_queries = self.plan_hops(question)
llm_calls = 1
steps.append(AgentStep(
step_number=1,
step_type="THINK",
output=f"Planned {len(hop_queries)} retrieval hops: {hop_queries}",
))
# Step 2: Execute hops sequentially
completed_hops: list[HopResult] = []
for i, hop_query in enumerate(hop_queries):
hop_result = self.execute_hop(hop_query, i + 1, completed_hops)
llm_calls += 1
completed_hops.append(hop_result)
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="RETRIEVE",
input_query=hop_query,
chunks=hop_result.chunks,
output=hop_result.intermediate_answer,
metadata={"hop": i + 1},
))
# Step 3: Synthesize
hop_summaries = "\n\n".join([
f"Hop {h.hop_number} - Query: {h.query}\nFindings: {h.intermediate_answer}"
for h in completed_hops
])
synthesis_message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1000,
messages=[{
"role": "user",
"content": self._SYNTHESIS_PROMPT.format(
original_question=question,
hop_summaries=hop_summaries,
)
}]
)
llm_calls += 1
final_answer = synthesis_message.content[0].text.strip()
all_chunks = [c for h in completed_hops for c in h.chunks]
return AgenticRAGResult(
original_query=question,
final_answer=final_answer,
steps=steps,
total_llm_calls=llm_calls,
total_retrievals=len(hop_queries),
strategy_used="MULTI_HOP",
)
# ─────────────────────────────────────────────
# Pattern 4: Corrective RAG Pipeline
# ─────────────────────────────────────────────
@dataclass
class RelevanceEvaluation:
"""Evaluation result for a set of retrieved chunks."""
chunks: list[KnowledgeChunk]
individual_scores: list[float] # 0.0 - 1.0 per chunk
aggregate_score: float
action: str # PROCEED, REFORMULATE, WEB_SEARCH
class CorrectiveRAGPipeline:
"""
Corrective RAG (CRAG, Yan et al., 2024).
Evaluates retrieval quality after each retrieval step.
Takes corrective action if quality is poor:
- High quality: proceed with generation
- Low quality: reformulate query and retry, or fall back to web search
- Mixed quality: keep good chunks, discard bad, supplement with more retrieval
"""
RELEVANCE_THRESHOLDS = {
"high": 0.7, # Proceed with generation
"low": 0.3, # Web search fallback
# Between 0.3 and 0.7: reformulate and retry
}
_RELEVANCE_SCORE_PROMPT = """Score how relevant this retrieved passage is to answering the question.
Question: {question}
Passage: {passage}
Score from 0.0 to 1.0 where:
1.0 = directly answers the question with specific relevant information
0.5 = partially relevant, contains some useful information
0.0 = irrelevant, does not help answer the question
Return ONLY a decimal number between 0.0 and 1.0. No explanation."""
_REFORMULATE_PROMPT = """The initial search query returned poor results. Reformulate it to better
target the information needed.
Original question: {question}
Original query: {original_query}
Retrieved content (insufficient): {retrieved_preview}
Generate an improved search query that would find more relevant information.
Return ONLY the query, nothing else."""
_GENERATE_PROMPT = """Answer this question using the provided context.
Be specific. Cite evidence. If context is insufficient for any part, say so explicitly.
Note the source of any web-retrieved information.
Question: {question}
Context (sources noted):
{context}
Answer:"""
def score_chunk_relevance(self, question: str, chunk: KnowledgeChunk) -> float:
"""Score the relevance of a single chunk to the question (0-1)."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": self._RELEVANCE_SCORE_PROMPT.format(
question=question,
passage=chunk.content[:600],
)
}]
)
try:
score = float(re.search(r'\d+\.?\d*', message.content[0].text).group())
return min(max(score, 0.0), 1.0)
except (AttributeError, ValueError):
return 0.5
def evaluate_retrieval(self, question: str, chunks: list[KnowledgeChunk]) -> RelevanceEvaluation:
"""Evaluate the overall relevance of a retrieved chunk set."""
if not chunks:
return RelevanceEvaluation(
chunks=[], individual_scores=[], aggregate_score=0.0, action="WEB_SEARCH"
)
scores = [self.score_chunk_relevance(question, c) for c in chunks]
aggregate = sum(scores) / len(scores)
if aggregate >= self.RELEVANCE_THRESHOLDS["high"]:
action = "PROCEED"
elif aggregate <= self.RELEVANCE_THRESHOLDS["low"]:
action = "WEB_SEARCH"
else:
action = "REFORMULATE"
return RelevanceEvaluation(
chunks=chunks,
individual_scores=scores,
aggregate_score=aggregate,
action=action,
)
def filter_good_chunks(
self,
chunks: list[KnowledgeChunk],
scores: list[float],
threshold: float = 0.4,
) -> list[KnowledgeChunk]:
"""Keep only chunks with relevance score above threshold."""
return [c for c, s in zip(chunks, scores) if s >= threshold]
def reformulate_query(
self,
question: str,
original_query: str,
poor_chunks: list[KnowledgeChunk],
) -> str:
"""Generate a better search query based on poor retrieval results."""
retrieved_preview = "\n".join([c.content[:200] for c in poor_chunks[:2]])
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=100,
messages=[{
"role": "user",
"content": self._REFORMULATE_PROMPT.format(
question=question,
original_query=original_query,
retrieved_preview=retrieved_preview,
)
}]
)
return message.content[0].text.strip()
def run(self, question: str, max_attempts: int = 3) -> AgenticRAGResult:
"""Execute the CRAG pipeline with corrective retrieval."""
steps: list[AgentStep] = []
llm_calls = 0
retrievals = 0
final_context_chunks: list[KnowledgeChunk] = []
context_source_notes: list[str] = []
current_query = question
for attempt in range(max_attempts):
# Retrieve
chunks = _kb.search(current_query, k=6)
retrievals += 1
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="RETRIEVE",
input_query=current_query,
chunks=chunks,
output=f"Retrieved {len(chunks)} chunks",
))
# Evaluate relevance
evaluation = self.evaluate_retrieval(question, chunks)
llm_calls += len(chunks) # One haiku call per chunk for scoring
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="EVALUATE",
output=(
f"Aggregate relevance: {evaluation.aggregate_score:.2f} | "
f"Action: {evaluation.action}"
),
metadata={"scores": evaluation.individual_scores},
))
if evaluation.action == "PROCEED":
final_context_chunks.extend(chunks)
context_source_notes.append("Internal knowledge base")
break
elif evaluation.action == "WEB_SEARCH":
# Fall back to web search
web_chunks = _kb.search_web(question)
retrievals += 1
final_context_chunks.extend(web_chunks)
context_source_notes.append("Web search (knowledge base insufficient)")
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="WEB_SEARCH",
input_query=question,
chunks=web_chunks,
output="Web search fallback executed",
))
break
else: # REFORMULATE
# Keep good chunks, reformulate for the rest
good_chunks = self.filter_good_chunks(chunks, evaluation.individual_scores)
final_context_chunks.extend(good_chunks)
if attempt < max_attempts - 1:
new_query = self.reformulate_query(question, current_query, chunks)
llm_calls += 1
current_query = new_query
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="THINK",
output=f"Reformulated query: {new_query}",
))
# Generate final answer
source_label = " + ".join(context_source_notes) or "Mixed sources"
context_text = "\n\n".join([
f"[Source: {c.source}]\n{c.content}"
for c in final_context_chunks[:6]
])
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=900,
messages=[{
"role": "user",
"content": self._GENERATE_PROMPT.format(
question=question,
context=f"[Sources: {source_label}]\n\n{context_text}",
)
}]
)
llm_calls += 1
final_answer = message.content[0].text.strip()
steps.append(AgentStep(
step_number=len(steps) + 1,
step_type="GENERATE",
output=final_answer,
))
return AgenticRAGResult(
original_query=question,
final_answer=final_answer,
steps=steps,
total_llm_calls=llm_calls,
total_retrievals=retrievals,
strategy_used="CORRECTIVE_RAG",
)
# ─────────────────────────────────────────────
# Orchestrator: Auto-Select Strategy
# ─────────────────────────────────────────────
class AgenticRAGOrchestrator:
"""
Routes complex questions to the appropriate agentic RAG pattern.
Selection heuristics:
- Simple, single-facet questions → Self-RAG (efficient, iterative)
- Multi-facet questions requiring sequential knowledge → Multi-Hop
- Open-ended research with unknown information needs → ReAct (most flexible)
- Questions that may be out-of-corpus → Corrective RAG
"""
_COMPLEXITY_PROMPT = """Classify this question's complexity and retrieval pattern:
SIMPLE - Single information need, one retrieval hop likely sufficient
ITERATIVE - May need 2-3 refinement hops but mostly sequential
MULTI_HOP - Clearly requires multiple sequential information lookups
OPEN_ENDED - Complex research question, unknown information needs
OUT_OF_CORPUS - May require information beyond the internal knowledge base
Return ONLY one of: SIMPLE, ITERATIVE, MULTI_HOP, OPEN_ENDED, OUT_OF_CORPUS
Question: {question}
Classification:"""
_STRATEGY_MAP = {
"SIMPLE": "SELF_RAG",
"ITERATIVE": "SELF_RAG",
"MULTI_HOP": "MULTI_HOP",
"OPEN_ENDED": "REACT",
"OUT_OF_CORPUS": "CORRECTIVE",
}
def __init__(self):
self.self_rag = SelfRAGAgent()
self.react = ReActRAGAgent()
self.multi_hop = MultiHopRAGPipeline()
self.corrective = CorrectiveRAGPipeline()
def auto_select_strategy(self, question: str) -> str:
"""Classify question complexity and select the appropriate strategy."""
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=20,
messages=[{
"role": "user",
"content": self._COMPLEXITY_PROMPT.format(question=question)
}]
)
raw = message.content[0].text.strip().upper()
return self._STRATEGY_MAP.get(raw, "SELF_RAG")
def execute(self, question: str, force_strategy: Optional[str] = None) -> AgenticRAGResult:
"""Route question to the appropriate agentic RAG strategy."""
strategy = force_strategy or self.auto_select_strategy(question)
print(f"\nStrategy selected: {strategy}")
print(f"Question: {question[:80]}...")
if strategy == "SELF_RAG":
return self.self_rag.iterative_retrieve(question, max_iterations=3)
elif strategy == "REACT":
return self.react.agentic_loop(question, max_steps=8)
elif strategy == "MULTI_HOP":
return self.multi_hop.run(question)
elif strategy == "CORRECTIVE":
return self.corrective.run(question)
else:
return self.self_rag.iterative_retrieve(question)
# ─────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────
def run_demo():
orchestrator = AgenticRAGOrchestrator()
test_questions = [
# Multi-hop: requires SEC + CSRD + intersection
"What are the implications of the 2023 SEC climate disclosure rules for companies that also have European CSRD obligations?",
# Simple iterative: single domain
"Why do I get 504 errors when my API server is under high load?",
]
for question in test_questions:
print(f"\n{'=' * 70}")
result = orchestrator.execute(question)
print(f"Strategy: {result.strategy_used}")
print(f"LLM calls: {result.total_llm_calls}")
print(f"Retrievals: {result.total_retrievals}")
print(f"Steps: {len(result.steps)}")
print(f"Answer:\n{result.final_answer[:400]}...")
if __name__ == "__main__":
run_demo()
Self-RAG Iterative Loop
ReAct + RAG Tool Loop
Corrective RAG with Web Search Fallback
Multi-Hop Sequential Retrieval
Production Engineering Notes
Cost Control: Agentic Loops Multiply API Calls
The primary production risk with agentic RAG is unbounded cost. A ReAct loop that makes 8 retrieval calls, each triggering claude-opus-4-6 to reason about results, can cost 10-20x more than standard single-shot RAG.
Cost control strategies:
-
Budget enforcement: Track token usage per query. If accumulated cost exceeds a per-query budget (e.g., $0.10), force the agent to synthesize with what it has.
-
Step limits: Hard cap on the number of agent iterations (e.g., max_steps=8). Log queries that hit the cap - they are either very complex or the agent is looping.
-
Strategy tiering: Route simple queries to single-shot RAG. Use agentic patterns only for complex queries that meet a complexity threshold (detected by the router at haiku cost).
-
Model tiering: Use claude-haiku-4-5-20251001 for retrieval decisions, relevance scoring, and sub-question answering. Reserve claude-opus-4-6 for final synthesis only.
Latency: Sequential Retrieval Adds Real Latency
Agentic RAG is fundamentally sequential - each hop's output informs the next hop's query. You cannot parallelize sequential dependencies.
Parallelism opportunities:
- Independent sub-questions in multi-hop: If sub-questions A and B have no dependency, retrieve them in parallel using
asyncio.gather. - Relevance scoring: Score all retrieved chunks in parallel - each scoring call is independent.
- Initial retrieval breadth: Retrieve more chunks initially (k=10) to reduce the chance of needing a second retrieval pass.
Target latencies for production:
- Self-RAG (1-2 iterations): 2-4 seconds
- Multi-hop (3 hops): 4-8 seconds
- ReAct (5 steps): 5-10 seconds
For interactive applications, show incremental results - display retrieved evidence as it arrives rather than waiting for synthesis.
Loop Detection and Termination
Agentic systems can loop. Common patterns:
- The agent repeatedly retrieves the same documents with slight query variations
- Relevance scoring oscillates - good then bad then good - never terminating
- Sub-question decomposition produces questions that depend on each other
Guards:
- Deduplication: Track chunk IDs across all retrievals. Stop retrieving when new retrievals produce no new unique chunks.
- Query deduplication: Track query strings (or their hashes). If the same query has been issued twice, do not issue it again.
- Divergence detection: If relevance scores are not improving after 2 iterations of REFORMULATE, escalate to WEB_SEARCH or return a partial answer with an explicit acknowledgment of limitations.
Observability: Trace Every Step
Agentic RAG without observability is unmaintainable. When an answer is wrong, you need to know whether it was wrong because of poor retrieval (fix: improve retrieval strategy), poor chunk quality (fix: re-ingest documents), poor reasoning (fix: update prompts), or poor synthesis (fix: update synthesis prompt).
Instrument at minimum:
- Each retrieval call: query, timestamp, chunk IDs returned, scores
- Each LLM call: model, token counts, latency, input/output (truncated)
- Each evaluation decision: ISREL scores, RETRIEVE/NO_RETRIEVE decisions
- Final answer: query, answer, total cost, total latency, step count
Store traces in a structured log or observability platform (Langfuse, Weave, Arize). Sample 5-10% of production traces for human evaluation weekly.
:::tip When to Use Agentic vs. Standard RAG
Use standard single-shot RAG when:
- The question has a single clear information need
- The corpus is high-quality and well-indexed
- Latency under 1 second is required
- Cost per query must be minimized
Use agentic RAG when:
- The question requires information from multiple non-adjacent parts of the corpus
- Single-shot retrieval quality is demonstrably poor (measure recall@K offline)
- The question type is inherently multi-hop
- Answer quality justifies 3-5x higher cost and latency
Start with standard RAG. Move to agentic when you have evidence - from evaluation metrics or user feedback - that single-shot retrieval is insufficient.
:::
:::warning Self-RAG Reflection Quality Depends on Prompt Quality
The ISREL, ISSUP, and RETRIEVE/NO_RETRIEVE decisions in Self-RAG are made by the LLM evaluating its own output. This has a well-known failure mode: the LLM is biased toward reporting its own generations as "well-grounded" even when they contain hallucinated claims.
Mitigation:
- Use claim-level verification, not response-level: extract each factual claim separately and verify it against specific retrieved text, not the full context
- Use a different model for verification than for generation - cross-model evaluation is less biased than self-evaluation
- Run ISSUP verification on a random sample of production outputs as a quality audit
:::
:::danger Infinite Retrieval Loops
Agentic RAG loops can spiral into infinite retrieval without hard termination guards. A common failure pattern: the agent retrieves, finds insufficient context, reformulates, retrieves again, finds the same insufficient context, reformulates again with a slightly different phrasing, and continues indefinitely.
Always implement:
- Hard iteration cap (max_iterations, max_steps)
- Chunk ID deduplication - stop if a new retrieval produces no new unique chunks
- Query hash tracking - never issue the same query twice
- Budget cap - stop if accumulated token cost exceeds the per-query limit
Without all four guards, a single complex query can result in dozens of API calls and significant unexpected cost.
:::
Interview Q&A
Q1: What is the difference between standard RAG and agentic RAG? When does each break down?
Standard RAG is a fixed pipeline: retrieve once → generate once → done. It breaks down for three classes of questions: (1) multi-hop questions that require building knowledge step by step, where the information needed for step 2 cannot be known until step 1 is answered; (2) ambiguous questions where the right retrieval query cannot be known from the raw user query alone; (3) questions that fall partially or fully outside the corpus, where the single retrieval pass returns poor context and there is no recovery mechanism.
Agentic RAG replaces the pipeline with a loop: retrieve → evaluate → generate → check grounding → retrieve again if needed → synthesize. It handles multi-hop questions through sequential hopping, handles ambiguous questions by refining queries based on retrieval results, and handles poor retrieval through corrective mechanisms (reformulation, web search fallback). The tradeoff is 3-10x higher latency and cost. Agentic RAG is not universally better - it is better for complex questions where single-shot retrieval demonstrably fails.
Q2: Explain Self-RAG. What are the four reflection tokens and what failure mode does each detect?
Self-RAG (Asai et al., 2023) trains the LLM to interleave generation with self-reflection using four special decision tokens:
Retrieve: Should additional retrieval happen before generating the next segment? Detects when the model is about to generate content that requires external knowledge it may not have. Without this token, the model generates regardless of whether it has adequate knowledge.
ISREL: Is the retrieved passage relevant to the query? Detects when the retrieval system returns plausible-but-wrong documents - documents that match the query surface-form but do not actually answer it. Filters noise from the context window before generation.
ISSUP: Does the generated segment support its claims with the retrieved passage? Detects hallucination - when the model generates claims that go beyond what the retrieved context actually says. This is the primary hallucination detection mechanism in Self-RAG.
ISUSE: Is the overall response useful? A final quality gate. A response can be fully grounded but still not useful - it might answer a subtly different question than the one asked.
In production without custom fine-tuning, you implement these as explicit LLM classification calls using a cheap model like claude-haiku-4-5-20251001.
Q3: How does RAPTOR differ from flat chunk retrieval, and when is the overhead justified?
RAPTOR (Sarthi et al., Stanford, 2024) builds a hierarchical tree of document summaries: leaf nodes are raw chunks, higher nodes are progressively abstracted summaries of their children. At retrieval time, questions are answered from the appropriate tree level: abstract questions from high-level summary nodes, specific questions from leaf nodes.
Flat chunk retrieval fails at the extremes. For a high-level question like "What is this company's overall approach to climate risk?", no individual leaf chunk contains the answer - it emerges from the aggregate of many chunks. A leaf-level search returns specific numeric details that don't constitute an answer. For this question, a root-level summary retrieval is appropriate.
For a specific question like "What is the exact Scope 2 reporting deadline for accelerated filers?", a root-level summary cannot give the precise answer - only a leaf chunk containing the specific regulatory text can. Leaf-level retrieval is appropriate.
RAPTOR overhead: tree construction adds O(N × depth) LLM calls during ingestion, where N is the number of chunks and depth is the tree height (typically 3-4 levels). This is a one-time offline cost. Retrieval overhead at inference time is minimal - you query the appropriate level based on question classification. The overhead is justified when your corpus is large (100K+ chunks) and your queries span multiple abstraction levels.
Q4: CRAG adds a relevance evaluation step. What evidence would you look for in production to know whether CRAG is worth deploying?
Deploy CRAG if you observe any of these signals in production:
(1) High "I don't know" rate from the generator: When the LLM frequently says "I don't have sufficient information," the retrieval is failing. CRAG's reformulation and web fallback addresses this directly.
(2) High hallucination rate on out-of-scope queries: Track queries where users mark answers as incorrect. If incorrect answers cluster around questions that are near but not quite covered by your corpus, CRAG's relevance evaluation and fallback would catch these.
(3) Corpus coverage measurement: Sample 100 random user queries. For each, manually score whether the top-5 retrieved chunks are actually relevant. If aggregate relevance score is below 0.5 for more than 20% of queries, CRAG's corrective mechanisms have clear targets to improve.
(4) Query distribution shift: If your user queries are drifting toward topics not well-covered in your corpus - due to new products, recent events, or expanding user base - CRAG's web fallback provides graceful degradation rather than confident wrong answers.
Measure CRAG impact by A/B testing: faithfulness score (are claims grounded in retrieved content?) and answer correctness (sampled human evaluation) before and after deployment.
Q5: How do you prevent an agentic RAG loop from running indefinitely and running up API costs?
Four complementary guards, all required:
Hard step cap: Never allow more than N LLM calls per query. N should be based on your latency and cost budgets. At N=8 with claude-opus-4-6, you are looking at roughly $0.06-0.10 per query. Track queries that hit the cap - they indicate either genuinely complex questions (expected) or prompt/loop logic issues (bugs).
Chunk deduplication: After each retrieval, compare the returned chunk IDs against all previously retrieved chunk IDs. If the new retrieval returns no chunks not already in context, there is no information gain from further retrieval. Terminate and synthesize with available context.
Query fingerprinting: Hash each query string before issuing it. If the same hash has been issued before in this session, do not issue it again. This catches reformulation loops where the model generates semantically similar queries that produce the same results.
Budget enforcement: Track accumulated token usage in real time. If usage exceeds a per-query budget (e.g., 10,000 tokens), terminate the loop immediately and generate a partial answer with a note: "Answer based on partial research due to query budget limit."
Log all termination reasons separately. "Hit max steps" should be rare (under 5% of agentic queries). If it is common, your routing is sending too-simple queries to agentic pipelines, or your agent is not converging efficiently.
Summary
Agentic RAG exists because single-shot retrieval is architecturally incapable of answering multi-hop questions correctly. The five patterns covered in this lesson address different dimensions of the problem:
| Pattern | Key Mechanism | Best For | Latency | Cost |
|---|---|---|---|---|
| Self-RAG | ISREL/ISSUP reflection, conditional retrieval | Iterative refinement, grounding verification | Medium | Medium |
| FLARE | Token-level confidence monitoring | Long-form generation with uncertain facts | Medium | Medium |
| ReAct + RAG | Tool use API, agent loop | Open-ended research, unknown information needs | High | High |
| RAPTOR | Hierarchical summary tree retrieval | Mixed-specificity questions over large corpora | Low (offline build) | Low (at query time) |
| CRAG | Post-retrieval relevance evaluation + correction | Systems where corpus coverage is uncertain | Medium | Medium |
Do not default to agentic RAG. Start with standard RAG, measure where it fails, and deploy agentic patterns specifically where measurement shows they help. Agentic RAG is a powerful tool for genuinely complex questions - but it is expensive, slow, and harder to debug. Use it where it is truly needed.
