Skip to main content

Agentic RAG

The Research Task That Broke Every Static Pipeline

A hedge fund's ML team had built a solid RAG system for their investment research. Analysts could ask natural language questions and get answers grounded in their document corpus - earnings transcripts, SEC filings, internal models, market data.

Then a senior analyst submitted a query: "Given current interest rate dynamics, compare how our top 5 portfolio companies discussed margin compression in their last two earnings calls, and flag any that seem to be taking on additional leverage to compensate."

This query required: (1) identifying the top 5 portfolio companies from an internal database, (2) retrieving the last two earnings transcripts for each (10 documents), (3) searching for margin compression commentary in each, (4) searching for leverage-related disclosures, (5) comparing findings across companies, (6) applying domain judgment to "flag" concerning patterns.

Six sequential retrieval steps, where each step's query depended on the results of the previous. No static RAG pipeline could do this. Even iterative RAG struggled - it had no way to call an internal database, cross-reference findings across 10 documents, or make domain-specific judgments about what constitutes a concerning leverage pattern.

What they needed wasn't a better retrieval algorithm. They needed an agent that could reason about what information it needed, decide which tools to call, interpret the results, and decide what to do next - in a loop.

This is agentic RAG: retrieval as a tool in a reasoning loop, not as a fixed pipeline step.

Why This Exists: The Static Pipeline Limitation

Static RAG has one retrieval step: embed query, retrieve chunks, generate answer. This works when:

  • The query maps directly to a retrievable fact
  • One retrieval is sufficient to gather all needed context
  • The query doesn't require understanding the results before knowing what to retrieve next

Static RAG fails when:

  • Retrieval is multi-step: what to retrieve next depends on what you retrieved first
  • Multiple data sources: some information is in a vector DB, some in a SQL database, some requires a web search
  • Query requires interpretation: the user's question implies retrieval strategies that aren't obvious from the literal words
  • Verification is needed: after generating an answer, check whether it's actually grounded and correct
  • The task is open-ended: "research X and tell me what I need to know" rather than "answer question Y"

Agents solve these problems by making retrieval one of many tools available to a reasoning loop, rather than a fixed step in a predetermined pipeline.

The Core Architecture: ReAct + RAG

ReAct (Reasoning + Acting, Yao et al. 2022) is the foundational pattern for agentic systems. The agent alternates between reasoning ("what do I need to do?") and acting (calling a tool). Each action's result informs the next reasoning step.

In agentic RAG, "act" includes retrieval tools:

Building a Minimal Agentic RAG with OpenAI Function Calling

from openai import OpenAI
import json
from typing import List, Dict, Any, Optional

client = OpenAI()

# Define retrieval tools for the agent
TOOLS = [
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": (
"Search the internal knowledge base for relevant documents. "
"Use for company policies, product documentation, internal procedures."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The semantic search query"
},
"top_k": {
"type": "integer",
"description": "Number of results to return (1-10)",
"default": 5,
},
"filter_doc_type": {
"type": "string",
"description": "Optional: filter by document type (faq, policy, manual, etc.)"
},
},
"required": ["query"],
},
}
},
{
"type": "function",
"function": {
"name": "search_web",
"description": (
"Search the web for current information. "
"Use for recent news, current events, information that might be outdated in the knowledge base."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The web search query"
},
},
"required": ["query"],
},
}
},
{
"type": "function",
"function": {
"name": "query_database",
"description": (
"Query structured data from the internal database. "
"Use for customer records, order history, numerical data, statistics."
),
"parameters": {
"type": "object",
"properties": {
"sql_intent": {
"type": "string",
"description": "Natural language description of what data you need"
},
},
"required": ["sql_intent"],
},
}
},
]


class ToolExecutor:
"""Handles tool execution for the agent."""

def __init__(self, vector_store):
self.vector_store = vector_store

def execute(self, tool_name: str, tool_args: Dict[str, Any]) -> str:
if tool_name == "search_knowledge_base":
return self._search_kb(**tool_args)
elif tool_name == "search_web":
return self._search_web(**tool_args)
elif tool_name == "query_database":
return self._query_db(**tool_args)
else:
return f"Unknown tool: {tool_name}"

def _search_kb(self, query: str, top_k: int = 5, filter_doc_type: str = None) -> str:
filters = {}
if filter_doc_type:
filters["doc_type"] = filter_doc_type
results = self.vector_store.search(query, top_k=top_k, filters=filters)
if not results:
return "No relevant documents found in the knowledge base."
formatted = "\n\n".join([
f"[Source: {r.get('metadata', {}).get('source', 'unknown')}]\n{r['text']}"
for r in results
])
return f"Found {len(results)} relevant documents:\n\n{formatted}"

def _search_web(self, query: str) -> str:
# In production: use Tavily, SerpAPI, or Bing Search API
return f"[Web search for '{query}' - integrate with Tavily/SerpAPI in production]"

def _query_db(self, sql_intent: str) -> str:
# In production: use text-to-SQL model or LLM to generate and execute SQL
return f"[Database query for '{sql_intent}' - integrate with your database in production]"


def run_agentic_rag(
user_query: str,
tool_executor: ToolExecutor,
model: str = "gpt-4o",
max_iterations: int = 10,
) -> str:
"""
Run a ReAct-style agentic RAG loop.
The agent decides when to retrieve and which tools to use.
"""
messages = [
{
"role": "system",
"content": (
"You are a helpful research assistant with access to several retrieval tools. "
"For complex questions, use multiple tool calls to gather comprehensive information. "
"Think carefully about what information you need before calling each tool. "
"After gathering sufficient information, provide a comprehensive, well-cited answer. "
"Always base your final answer on retrieved information, not prior knowledge."
)
},
{
"role": "user",
"content": user_query,
}
]

iteration = 0
while iteration < max_iterations:
iteration += 1

# Call the model
response = client.chat.completions.create(
model=model,
messages=messages,
tools=TOOLS,
tool_choice="auto", # model decides whether to use tools
)

message = response.choices[0].message
messages.append(message)

# Check if the model wants to use tools
if message.tool_calls:
# Execute each tool call
for tool_call in message.tool_calls:
tool_name = tool_call.function.name
tool_args = json.loads(tool_call.function.arguments)

print(f"Agent calling: {tool_name}({tool_args})")
tool_result = tool_executor.execute(tool_name, tool_args)

# Add tool result to message history
messages.append({
"role": "tool",
"content": tool_result,
"tool_call_id": tool_call.id,
})
else:
# Model has finished - return the final answer
return message.content

return "Maximum iterations reached. Partial results may be incomplete."


# Usage
executor = ToolExecutor(vector_store=your_vector_store)
answer = run_agentic_rag(
"What is our current refund policy for premium customers?",
tool_executor=executor,
)
print(answer)

LangGraph: Stateful Agentic RAG

For production systems, LangGraph provides graph-based state management - each node in the graph is a function, and edges represent transitions between states. This makes complex agentic workflows explicit, debuggable, and resumable.

from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from typing import TypedDict, Annotated, List
import operator

# Define the agent's state
class AgentState(TypedDict):
messages: Annotated[List, operator.add] # Messages accumulate
query: str
retrieved_context: List[str]
iterations: int
final_answer: str

# Define the nodes
def should_retrieve(state: AgentState) -> str:
"""Decide whether to retrieve more or answer."""
messages = state["messages"]
last_message = messages[-1]

# If the last message has tool calls, execute them
if hasattr(last_message, "tool_calls") and last_message.tool_calls:
return "execute_tools"

# If we've iterated too many times, force an answer
if state.get("iterations", 0) >= 5:
return "generate_answer"

# If the model generated a final response, we're done
return "end"


def agent_node(state: AgentState, llm) -> AgentState:
"""The reasoning node: decides what tools to call."""
from langchain_core.tools import tool

@tool
def search_documents(query: str) -> str:
"""Search the document knowledge base for relevant information."""
results = your_vector_store.search(query, top_k=5)
return "\n\n".join([r["text"] for r in results])

@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
return "[Web search results would appear here]"

tools = [search_documents, search_web]
llm_with_tools = llm.bind_tools(tools)

response = llm_with_tools.invoke(state["messages"])

return {
"messages": [response],
"iterations": state.get("iterations", 0) + 1,
}


def build_agentic_rag_graph():
"""Build the LangGraph agentic RAG workflow."""
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Define tools as LangChain tools
@tool
def search_documents(query: str) -> str:
"""Search the document knowledge base."""
results = your_vector_store.search(query, top_k=5)
return "\n\n".join([r["text"] for r in results])

@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
import requests
# Integrate with Tavily/SerpAPI here
return f"[Web results for: {query}]"

tools = [search_documents, search_web]
llm_with_tools = llm.bind_tools(tools)

# Build the graph
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("agent", lambda state: {"messages": [llm_with_tools.invoke(state["messages"])]})
workflow.add_node("tools", ToolNode(tools))

# Add edges
workflow.set_entry_point("agent")
workflow.add_conditional_edges(
"agent",
# If last message has tool calls -> execute tools; else -> end
lambda state: "tools" if state["messages"][-1].tool_calls else END,
)
workflow.add_edge("tools", "agent") # After tools, go back to agent

return workflow.compile()


# Run the graph
graph = build_agentic_rag_graph()

result = graph.invoke({
"messages": [
SystemMessage(content="You are a research assistant. Use tools to find accurate information."),
HumanMessage(content="What is our refund policy for enterprise customers, and has it changed recently?"),
],
"query": "enterprise customer refund policy",
"retrieved_context": [],
"iterations": 0,
"final_answer": "",
})

final_message = result["messages"][-1]
print(final_message.content)

Router Agent: Multiple Knowledge Sources

A router agent classifies incoming queries and routes them to the most appropriate retrieval source.

from typing import Literal
import json

class RouterAgent:
"""Routes queries to appropriate retrieval sources."""

ROUTES = {
"vector_search": "Questions about company policies, product documentation, procedures",
"sql_database": "Questions about customer records, orders, numerical data, statistics",
"web_search": "Questions about current events, recent news, external information",
"code_search": "Questions about code implementations, APIs, technical specifications",
"direct_answer": "Simple factual questions the model can answer from training data",
}

def classify_query(self, query: str) -> Dict[str, Any]:
"""Classify query and determine routing."""
routes_desc = "\n".join([f"- {k}: {v}" for k, v in self.ROUTES.items()])

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
f"Classify the following query to the best retrieval source.\n"
f"Sources:\n{routes_desc}\n\n"
"Respond with JSON: {\"route\": \"source_name\", \"confidence\": 0-1, "
"\"reasoning\": \"why\", \"rewritten_query\": \"optimized query for this source\"}"
)
},
{"role": "user", "content": query}
],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)

def route_and_retrieve(self, query: str) -> Dict[str, Any]:
classification = self.classify_query(query)
route = classification["route"]
rewritten = classification.get("rewritten_query", query)

print(f"Routing '{query[:50]}...' → {route} (confidence: {classification['confidence']:.2f})")

if route == "vector_search":
results = your_vector_store.search(rewritten, top_k=5)
context = "\n\n".join([r["text"] for r in results])
elif route == "sql_database":
# Use text-to-SQL pipeline
context = execute_sql_query(rewritten)
elif route == "web_search":
context = web_search(rewritten)
elif route == "code_search":
context = search_codebase(rewritten)
else: # direct_answer
context = None

return {
"route": route,
"context": context,
"rewritten_query": rewritten,
}


router = RouterAgent()

# Route different query types
queries = [
"What is our refund policy?", # → vector_search
"How many orders did we process last month?", # → sql_database
"What is the current Fed funds rate?", # → web_search
"What is Python?", # → direct_answer
]

for q in queries:
result = router.route_and_retrieve(q)
print(f"Route: {result['route']}")

Reflection: Self-Verifying Agentic RAG

Add a reflection step where the agent evaluates its own answer before returning it to the user.

def agentic_rag_with_reflection(
query: str,
tool_executor: ToolExecutor,
model: str = "gpt-4o",
) -> Dict[str, Any]:
"""
Agentic RAG with self-reflection and verification.
The agent can revise its answer if reflection reveals issues.
"""
# Phase 1: Information gathering (same as before)
initial_answer = run_agentic_rag(query, tool_executor, model)

# Phase 2: Self-reflection
reflection_response = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": (
"You are a quality control reviewer. Evaluate the answer for: "
"1. Factual accuracy relative to the question "
"2. Completeness - does it fully address the question? "
"3. Any unsupported claims that could be hallucinations "
"Respond with JSON: {\"quality\": \"high|medium|low\", "
"\"issues\": [...], \"needs_revision\": true/false, "
"\"missing_information\": \"what else is needed\"}"
)
},
{
"role": "user",
"content": f"Original question: {query}\n\nDraft answer: {initial_answer}"
}
],
temperature=0,
response_format={"type": "json_object"},
)

reflection = json.loads(reflection_response.choices[0].message.content)

if reflection.get("needs_revision") and reflection.get("missing_information"):
# Phase 3: Targeted follow-up retrieval
follow_up = reflection["missing_information"]
additional_context = tool_executor.execute(
"search_knowledge_base",
{"query": follow_up}
)

# Phase 4: Revised answer
revision_response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": (
f"Revise and improve the following answer based on additional context.\n\n"
f"Original question: {query}\n"
f"Draft answer: {initial_answer}\n\n"
f"Issues identified: {reflection['issues']}\n\n"
f"Additional context: {additional_context}"
)
}
],
)
final_answer = revision_response.choices[0].message.content
else:
final_answer = initial_answer

return {
"answer": final_answer,
"quality": reflection.get("quality"),
"was_revised": reflection.get("needs_revision", False),
}

Production Patterns: Caching, Circuit Breakers, Fallbacks

Agentic RAG systems have non-deterministic behavior and variable latency. Production deployments need guardrails:

import hashlib
import time
from functools import lru_cache

class ProductionAgenticRAG:
"""Production-hardened agentic RAG with caching and fallbacks."""

def __init__(self, tool_executor: ToolExecutor, cache_ttl: int = 3600):
self.executor = tool_executor
self.cache: Dict[str, Dict] = {}
self.cache_ttl = cache_ttl
self.max_retries = 3
self.timeout_seconds = 30

def _cache_key(self, query: str) -> str:
return hashlib.md5(query.lower().strip().encode()).hexdigest()

def _is_cached(self, key: str) -> bool:
if key not in self.cache:
return False
entry = self.cache[key]
return (time.time() - entry["timestamp"]) < self.cache_ttl

def query(self, user_query: str) -> Dict[str, Any]:
cache_key = self._cache_key(user_query)

# Check cache
if self._is_cached(cache_key):
cached = self.cache[cache_key]
print(f"Cache hit for query")
return {**cached["result"], "cached": True}

# Try agentic with retry
for attempt in range(self.max_retries):
try:
result = self._run_with_timeout(user_query)
# Cache successful result
self.cache[cache_key] = {
"result": result,
"timestamp": time.time()
}
return result
except TimeoutError:
print(f"Timeout on attempt {attempt + 1}")
if attempt == self.max_retries - 1:
return self._fallback_rag(user_query)
except Exception as e:
print(f"Error on attempt {attempt + 1}: {e}")
if attempt == self.max_retries - 1:
return self._fallback_rag(user_query)

def _run_with_timeout(self, query: str, timeout: int = 30) -> Dict:
"""Run agentic RAG with a time limit."""
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(run_agentic_rag, query, self.executor)
try:
answer = future.result(timeout=timeout)
return {"answer": answer, "method": "agentic", "cached": False}
except concurrent.futures.TimeoutError:
future.cancel()
raise TimeoutError(f"Query exceeded {timeout}s timeout")

def _fallback_rag(self, query: str) -> Dict:
"""Simple RAG fallback when agentic pipeline fails/times out."""
print("Falling back to simple RAG")
results = self.executor.vector_store.search(query, top_k=3)
context = "\n\n".join([r["text"] for r in results])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer based only on the provided context. If insufficient, say so."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}
],
)
return {
"answer": response.choices[0].message.content,
"method": "simple_rag_fallback",
"cached": False,
}

When Agentic RAG Is Worth the Cost

Agentic RAG adds significant complexity: non-determinism, variable latency (1-30 seconds per query), debugging difficulty, and LLM API cost proportional to the number of tool calls.

Worth it when:

  • Queries are complex, multi-step, and open-ended
  • Multiple data sources must be combined
  • The user's need can't be captured in a single retrieval query
  • High-stakes answers where verification is critical
  • The business value of a correct comprehensive answer greatly exceeds the cost of extra LLM calls

Not worth it when:

  • Most queries are simple factual lookups
  • Latency SLA is under 2 seconds (agentic RAG often takes 5-30s)
  • Query volume is high (cost scales with number of tool calls)
  • The task is well-defined and bounded (deterministic pipeline is better)

The practical hybrid: Classify incoming queries. Route simple factual queries to standard RAG (fast, cheap). Route complex research queries to agentic RAG (slow, expensive). This handles 80% of volume cheaply while serving the 20% complex queries that need agent capabilities.

Production Engineering Notes

Streaming responses: Agentic RAG can take 10-30 seconds. Always stream the final generation to the user so they see tokens appearing, even while tool calls happen in the background. Use SSE (Server-Sent Events) or WebSocket for streaming.

Token budget management: An agentic loop that calls tools 10 times and accumulates all context can easily hit 100K tokens in the message history. Implement context compression: summarize older tool results when the message history grows too long.

Tool call logging: Every tool call should be logged with: timestamp, tool name, arguments, result preview, and latency. This is essential for debugging non-deterministic failures and for cost analysis.

Parallelizing tool calls: When the agent needs results from multiple independent tools (e.g., search knowledge base AND search web), call both in parallel. OpenAI's API returns multiple tool_calls in one response - execute them concurrently.

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI()

async def execute_tools_parallel(tool_calls: list, executor: ToolExecutor) -> list:
"""Execute multiple tool calls in parallel."""
async def execute_one(tool_call) -> dict:
tool_name = tool_call.function.name
tool_args = json.loads(tool_call.function.arguments)

# Run in thread pool to avoid blocking the event loop
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(
None,
executor.execute,
tool_name,
tool_args,
)
return {"tool_call_id": tool_call.id, "result": result}

return await asyncio.gather(*[execute_one(tc) for tc in tool_calls])

Determinism for debugging: Pure agentic systems are non-deterministic (same query may trigger different tool sequences). For debugging, log the full tool call sequence. For reproducibility in testing, set temperature=0 and seed the model when the API supports it.

Common Mistakes

:::danger No Timeout on Agentic Loops Without a timeout, an agentic loop can run indefinitely - making tool calls, getting confused, making more tool calls. Set a hard iteration limit (max_iterations=10) and a wall-clock timeout (30 seconds). Fall back to simple RAG when the limit is hit. Users should never wait more than 30 seconds for any single response. :::

:::danger Letting the Agent Call Tools Unnecessarily Agents will often call retrieval tools for simple questions that the model already knows the answer to ("What is Python?"). This wastes API calls and adds latency. Either: (1) use a router that bypasses the agent for simple queries, or (2) include a direct_answer tool that the agent can call when no retrieval is needed, and train the system prompt to prefer this for obvious cases. :::

:::warning Accumulating Full Tool Outputs in Context If each tool call returns 2000 tokens and the agent makes 10 tool calls, you're at 20K tokens of tool outputs before generation. At GPT-4o pricing, that's $0.10 just in tool results for one query. Implement context management: summarize or truncate older tool results when total context grows beyond a threshold. Keep only the most recently retrieved and most relevant content. :::

:::warning Not Evaluating Agentic Systems Differently Standard RAG eval (RAGAS faithfulness, context precision) doesn't capture agentic behavior. For agentic RAG, also evaluate: (1) task completion rate (did the agent answer the actual question?), (2) tool efficiency (how many tool calls did it take?), (3) answer completeness (did it use all relevant retrieved information?). Build evals specifically for your complex query types - multi-hop reasoning, cross-source synthesis, temporal queries. :::

Interview Questions and Answers

Q: What is agentic RAG and how does it differ from standard RAG?

A: Standard RAG has a fixed pipeline: embed query → retrieve chunks → generate answer. It's deterministic and fast but handles only queries that can be answered with a single retrieval step. Agentic RAG makes retrieval a tool in a reasoning loop - the agent (LLM) decides when to retrieve, what to retrieve, which retrieval source to use, and whether the retrieved information is sufficient or if another retrieval step is needed. This enables multi-step retrieval, multi-source synthesis (combining vector search, SQL, web search), and self-verification. The cost: non-determinism, higher latency (5-30 seconds vs under 2 seconds), and higher API cost (multiple LLM calls per query). Agentic RAG is appropriate for complex open-ended research queries; standard RAG is appropriate for simple factual lookups. Production systems often route between both based on query complexity.

Q: Describe the ReAct pattern and how it applies to RAG systems.

A: ReAct (Reasoning and Acting, Yao et al. 2022) structures agent behavior as an alternating sequence of thought and action steps. In a ReAct RAG agent: (1) The model receives a query and generates a "thought" - reasoning about what information it needs; (2) Based on the thought, it calls a retrieval "action" (vector search, web search, SQL query); (3) The tool returns an "observation" (retrieved results); (4) The model generates another thought, deciding whether it has enough information or needs another retrieval step; (5) This continues until the model determines it can answer, at which point it generates a final response. The key advantage over standard RAG: each retrieval query is informed by the results of previous retrievals. The model can decide "I found that the refund policy mentions 'tier-specific exceptions' - I need to retrieve more information about tier-specific policies" - a chain of reasoning that flat RAG cannot express.

Q: How would you handle a 30-second timeout SLA when agentic RAG sometimes takes 60 seconds?

A: Three strategies in combination. First, query routing: classify incoming queries and route simple factual queries to standard RAG (under 2 seconds). Only complex queries need agentic RAG, and these users typically tolerate more latency. Second, hard timeouts: set a wall-clock limit of 25 seconds on the agentic pipeline (leaving 5 seconds buffer). When the timeout hits, immediately fall back to simple RAG using whatever context has been accumulated so far, and generate a partial answer with a disclaimer that it may be incomplete. Third, streaming: even if the agentic loop takes 20 seconds, stream the final answer generation token-by-token so the user sees progress within 2-3 seconds. This dramatically improves perceived latency. Users tolerate waiting longer when they can see something is happening. Log queries that hit timeouts; these are candidates for query routing improvement - if a pattern of queries consistently times out, build a specialized fast-path handler for that query type.

Q: How do you prevent an agentic RAG system from running up API costs with unnecessary tool calls?

A: Several mechanisms. At the system prompt level: explicitly instruct the agent to call the minimum number of tools necessary, prefer direct_answer when the model already knows something, and avoid calling retrieval for questions about general knowledge. At the architectural level: add a query classifier that routes simple questions (where the answer is well-known) to direct LLM response with no tools. At the loop level: set a maximum tool call budget per query (e.g., max 5 tool calls) and include the budget in the system prompt ("You have a budget of 5 tool calls for this query; use them efficiently"). Monitor cost per query type in production - if a particular query pattern consistently uses 8+ tool calls, build a specialized sub-agent or hardcoded pipeline for that pattern rather than relying on the general agent.

Q: When would you recommend building a custom agentic RAG pipeline vs using a framework like LangGraph or LlamaIndex?

A: Use a framework (LangGraph, LlamaIndex QueryPipeline, CrewAI) when: you're prototyping or building a standard agentic pattern (ReAct, multi-agent), the framework's abstractions match your architecture well, the team has limited ML infrastructure experience, and you don't need deep customization of the orchestration logic. LangGraph is specifically strong for stateful multi-step agents where you need explicit state transitions and the ability to checkpoint/resume. Build custom when: you have specific performance requirements the framework can't meet (frameworks add overhead), your orchestration logic is fundamentally different from what frameworks support, you need fine-grained control over parallelism and batching, or you're integrating with systems that don't have framework support. A pragmatic approach: start with a framework for the first version to move fast, then rewrite the performance-critical components custom if profiling reveals framework overhead is significant. Most agentic RAG systems don't need to go custom - the bottleneck is LLM inference latency, not framework overhead.

Evaluating Agentic RAG Systems

Standard RAG evaluation (RAGAS faithfulness, context precision) captures only part of agentic system quality. Agentic systems need additional metrics:

from dataclasses import dataclass
from typing import Callable
import time


@dataclass
class AgenticRAGEvalResult:
query: str
final_answer: str
tool_calls: list[dict]
total_latency_ms: float
task_completed: bool
answer_completeness: float # 0-1 from LLM judge
tool_efficiency: float # 1 / num_tool_calls (normalized)


def evaluate_agentic_rag(
agentic_pipeline: Callable,
test_cases: list[dict],
judge_client,
) -> list[AgenticRAGEvalResult]:
"""
Evaluate an agentic RAG system on test cases with ground truth.

Each test_case: {
"query": str,
"expected_topics": list[str], # topics the answer should cover
"max_acceptable_tool_calls": int,
}
"""
results = []

for case in test_cases:
start = time.perf_counter()
response = agentic_pipeline(case["query"])
latency_ms = (time.perf_counter() - start) * 1000

# Task completion: did the agent produce a substantive answer?
task_completed = (
len(response.get("answer", "")) > 50
and "I don't know" not in response.get("answer", "")
and "cannot find" not in response.get("answer", "").lower()
)

# Answer completeness: LLM judges whether all expected topics are covered
completeness_prompt = f"""
Query: {case['query']}
Answer: {response.get('answer', '')}
Expected topics to cover: {', '.join(case['expected_topics'])}

Score the answer completeness from 0.0 to 1.0:
- 1.0: all expected topics are addressed with substance
- 0.7: most topics addressed, minor gaps
- 0.4: some topics addressed, significant gaps
- 0.0: answer does not address the expected topics

Return only a float between 0.0 and 1.0.
"""
completion = judge_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": completeness_prompt}],
temperature=0,
)
try:
completeness = float(completion.choices[0].message.content.strip())
except ValueError:
completeness = 0.0

# Tool efficiency: fewer tool calls = higher score (normalized to [0,1])
num_tool_calls = len(response.get("tool_calls", []))
max_calls = case.get("max_acceptable_tool_calls", 5)
tool_efficiency = max(0.0, 1.0 - (num_tool_calls - 1) / max_calls)

results.append(AgenticRAGEvalResult(
query=case["query"],
final_answer=response.get("answer", ""),
tool_calls=response.get("tool_calls", []),
total_latency_ms=latency_ms,
task_completed=task_completed,
answer_completeness=completeness,
tool_efficiency=tool_efficiency,
))

return results


def summarize_eval(results: list[AgenticRAGEvalResult]) -> dict:
n = len(results)
return {
"task_completion_rate": sum(r.task_completed for r in results) / n,
"mean_answer_completeness": sum(r.answer_completeness for r in results) / n,
"mean_tool_efficiency": sum(r.tool_efficiency for r in results) / n,
"mean_latency_ms": sum(r.total_latency_ms for r in results) / n,
"p95_latency_ms": sorted(r.total_latency_ms for r in results)[int(0.95 * n)],
}

The key metrics for agentic evaluation:

  • Task completion rate: Did the agent actually answer the question vs. saying it couldn't find information?
  • Answer completeness: Does the final answer cover all aspects of a complex multi-part question?
  • Tool efficiency: How many retrieval steps did it take? An agent that answers correctly in 2 calls is better than one that answers correctly in 8 calls.
  • P95 latency: At the 95th percentile, is the system within acceptable bounds? Mean latency hides the long tail that frustrates users.

Production Deployment Checklist

Before deploying an agentic RAG system to production, verify:

:::tip Deployment Checklist

  • Query router implemented - simple queries go to fast-path standard RAG
  • Hard timeout with graceful fallback (standard RAG on timeout, not error)
  • Maximum tool call budget enforced in system prompt and agent loop
  • All tool calls logged with latency and result size (for cost/performance monitoring)
  • Response streaming enabled - users see token generation within 2-3 seconds
  • Circuit breaker on each tool - if vector DB is slow, agent degrades gracefully
  • Eval set covers complex multi-step queries (not just simple factual lookups)
  • Task completion rate measured in production (1% sampling with LLM judge)
  • Alert on tool call budget exhaustion rate (if over 10% of queries hit the budget, investigate)
  • Rollback plan: feature flag to route all traffic back to standard RAG :::

Summary

Agentic RAG is the right architecture for queries requiring multi-step reasoning, multi-source synthesis, or dynamic retrieval strategy selection. It is not the right architecture for simple factual lookups - the latency and cost overhead of an agentic loop is unjustifiable when a single retrieval step is sufficient. The key engineering pattern: always implement a query router that sends simple queries to fast-path standard RAG and only routes complex queries to the agentic pipeline. Within the agentic pipeline, enforce hard constraints - timeout with graceful fallback, maximum tool call budget, streaming output - that protect against the failure modes unique to agentic systems. Evaluate with metrics designed for agentic behavior: task completion rate, answer completeness across multi-part questions, tool call efficiency, and P95 latency - not just faithfulness and context recall.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.