Skip to main content

LlamaIndex Architecture

The Query That Should Have Been Simple

The product manager sends the request on a Monday morning: "We have 10,000 customer support tickets from the last year. Can we build an agent that answers questions about our product issues - specific ones, like 'how many customers complained about login timeouts in Q3' and 'what's the most common error when exporting to PDF'?" She adds a deadline: end of the week.

Your team's first instinct is to build the agent with raw Anthropic API calls - a simple RAG pipeline, a vector store, a search tool. Three hours later, the spike reveals the shape of the problem. The tickets are in different formats: some are JSON exports, some are CSV, some are email threads parsed into markdown. Retrieving the right tickets requires hybrid search - keyword for ticket IDs and error codes, semantic for conceptual questions about user problems. Some questions need aggregation over many tickets ("how many in Q3"), which requires structured querying, not just retrieval. Some questions need the agent to read specific tickets deeply, not just retrieve snippets.

You are not building a simple RAG pipeline. You are building a knowledge system with multiple retrieval strategies, document understanding, and agentic reasoning over the results. The raw API approach will work, but you will spend three days building infrastructure that LlamaIndex already has: document loaders for every format, a query engine that picks the right retrieval strategy per query, structured data extraction, and agentic tools built on top of all of it.

By Wednesday, your LlamaIndex-based system is handling queries the team had not even thought of. The agent routes simple lookup questions to a vector search tool, complex aggregation questions to a SQL-style structured tool, and comparative questions to a multi-document synthesis tool. The product manager is satisfied. The ticket is closed.

LlamaIndex is the right tool for knowledge-intensive agents - systems where the primary challenge is accessing and reasoning over large document collections, not orchestrating complex state transitions. This lesson explains its architecture and when to use it.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Frameworks demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

The Gap Between LangChain and Real RAG

When LangChain launched in 2022, its retrieval capabilities were basic: load documents, chunk them, embed them, search. This covered simple retrieval-augmented generation but not the complex document understanding that enterprise applications needed.

Jerry Liu and Simon Mo saw the gap. They were building an LLM-powered knowledge graph system at a company called Neum, and the existing tools were insufficient. They needed: sophisticated document parsing (not just text extraction), multiple index types (not just vector search), query routing (different query strategies for different question types), and composable query pipelines. They built LlamaIndex (originally GPT Index) to solve these problems and open-sourced it in November 2022.

Where LangChain positioned itself as the general-purpose LLM framework, LlamaIndex positioned itself as the data framework for LLMs: a system optimized for connecting LLMs to complex, heterogeneous data collections. The distinction held through 2023 and 2024 as LlamaIndex added agentic capabilities on top of its data foundation.

The Document-Centric Philosophy

LlamaIndex's architecture reflects a philosophy: the hard part of most enterprise LLM applications is not the LLM - it is getting the right data to the LLM at the right time in the right format.

Every LlamaIndex abstraction flows from this philosophy. Document is the first-class citizen. Node is a chunk of a document with metadata that preserves the document context. Index is a queryable structure over nodes. QueryEngine is the retrieval + synthesis pipeline. Agent is built on top of QueryEngine, not the other way around.


Historical Context

DateEvent
November 2022GPT Index (later LlamaIndex) open-sourced by Jerry Liu and Simon Mo
March 2023LlamaIndex rebranded from GPT Index; $8.5M seed funding
November 2023LlamaIndex v0.9 - full rewrite with better composability
February 2024LlamaIndex v0.10 - split into llama-index-core + provider packages
June 2024LlamaIndex Workflows announced - event-driven orchestration
October 2024LlamaParse (document parsing API) goes GA
December 2024LlamaIndex Cloud and managed RAG pipelines

The v0.10 split mirrors LangChain's package restructuring: a stable core with optional integration packages. pip install llama-index-core gives you the framework without 100+ dependencies. Individual integrations like llama-index-llms-anthropic and llama-index-vector-stores-chroma are installed separately.


Core Architecture

Layer 1: Documents and Nodes

Everything starts with Document objects:

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter

# Create documents manually
doc = Document(
text="Your document text here...",
metadata={
"source": "support_ticket_4721",
"date": "2024-09-15",
"category": "login",
"priority": "high"
},
doc_id="ticket_4721"
)

# Load from files
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(
input_dir="./tickets/",
required_exts=[".txt", ".md", ".pdf"],
recursive=True
)
documents = reader.load_data()
print(f"Loaded {len(documents)} documents")

# Split into nodes (chunks)
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} nodes")

# Nodes preserve document relationship
for node in nodes[:3]:
print(f"Node: {node.node_id[:8]}... from doc: {node.ref_doc_id[:8]}...")
print(f" Content: {node.text[:100]}...")
print(f" Metadata: {node.metadata}")

The distinction between Document and Node matters for metadata filtering. Documents have document-level metadata (the ticket's creation date, priority, category). Nodes inherit document metadata, allowing retrieval to filter by document properties even when returning chunk-level results.

Layer 2: Indexes

LlamaIndex supports multiple index types, each suited to different retrieval patterns:

VectorStoreIndex - semantic similarity search:

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Setup
llm = Anthropic(model="claude-opus-4-6", max_tokens=4096)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Persistent vector store
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("tickets")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build index
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
embed_model=embed_model,
show_progress=True
)

# Or load existing index
# index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)

SummaryIndex - summarizes across all documents, useful for "what are the main themes" queries:

from llama_index.core import SummaryIndex

# SummaryIndex passes all nodes to the LLM for synthesis
# Best for: "Summarize all Q3 issues", "What are the recurring themes?"
summary_index = SummaryIndex(nodes)

KeywordTableIndex - sparse keyword matching, fast for exact lookups:

from llama_index.core import KeywordTableIndex

# Good for: "Find tickets mentioning 'PDF export error'", "Tickets from user_id 4721"
keyword_index = KeywordTableIndex(nodes)

Layer 3: Query Engines

A QueryEngine combines a retriever (find relevant nodes) with a response synthesizer (generate an answer from retrieved nodes):

# Basic vector query engine
vector_engine = index.as_query_engine(
llm=llm,
similarity_top_k=5, # Retrieve top 5 nodes
response_mode="compact", # Compact multiple nodes into one context
streaming=True # Stream the response
)

# Query it
response = vector_engine.query(
"What are the most common complaints about the login system?"
)
print(response.response)

# Inspect the source nodes used
for node_with_score in response.source_nodes:
print(f"Score: {node_with_score.score:.3f}")
print(f"Text: {node_with_score.node.text[:200]}")
print(f"Metadata: {node_with_score.node.metadata}")

Response modes control how retrieved nodes are synthesized:

  • compact - fit as many nodes as possible into the context, synthesize once
  • refine - start with first node's answer, iteratively refine with each subsequent node
  • tree_summarize - build a tree of summaries over retrieved nodes (best for many long nodes)
  • no_text - return only source nodes without synthesis (useful for debugging retrieval)

Metadata Filtering

Filter retrieval by document metadata before semantic search:

from llama_index.core.vector_stores import (
MetadataFilter,
MetadataFilters,
FilterOperator
)

# Find only high-priority login issues from Q3
filters = MetadataFilters(filters=[
MetadataFilter(key="category", value="login"),
MetadataFilter(key="priority", value="high"),
MetadataFilter(
key="date",
value="2024-07-01",
operator=FilterOperator.GTE
),
MetadataFilter(
key="date",
value="2024-09-30",
operator=FilterOperator.LTE
)
])

filtered_engine = index.as_query_engine(
llm=llm,
similarity_top_k=10,
filters=filters
)

response = filtered_engine.query("What login issues occurred in Q3?")

RouterQueryEngine: Automatic Strategy Selection

The RouterQueryEngine automatically selects the appropriate query engine based on the query content - one of LlamaIndex's most powerful features:

from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool

# Define each specialized engine as a tool with a description
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_engine,
name="semantic_search",
description="Best for questions about specific issues, symptoms, or user experiences. "
"Use when the question is about 'what', 'how', or 'why' something happened."
)

summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_index.as_query_engine(llm=llm),
name="summary_analysis",
description="Best for questions requiring synthesis across many tickets. "
"Use for 'what are the main themes', 'summarize all issues', or trend analysis."
)

keyword_tool = QueryEngineTool.from_defaults(
query_engine=keyword_index.as_query_engine(llm=llm),
name="keyword_lookup",
description="Best for finding tickets with specific error codes, ticket IDs, or exact phrases. "
"Use for 'find tickets containing X' or 'tickets from user Y'."
)

# Router selects the best engine for each query
router_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(llm=llm),
query_engine_tools=[vector_tool, summary_tool, keyword_tool],
verbose=True # Logs which engine was selected and why
)

# The router will select the appropriate engine
queries = [
"What is the most common error when exporting to PDF?", # → semantic search
"Summarize all high-priority issues from the last month", # → summary analysis
"Find all tickets mentioning error code E4021", # → keyword lookup
]

for query in queries:
print(f"\nQuery: {query}")
response = router_engine.query(query)
print(f"Answer: {response.response[:300]}")

LlamaIndex Agents

LlamaIndex provides two agent types, both built on top of query engines as tools.

FunctionCallingAgent

The recommended agent for models with native function calling (Claude, GPT-4):

from llama_index.core.agent import FunctionCallingAgent
from llama_index.core.tools import FunctionTool, QueryEngineTool
import anthropic

# Custom function tool
def count_tickets_by_category(category: str, start_date: str, end_date: str) -> str:
"""
Count the number of support tickets in a specific category within a date range.

Args:
category: The ticket category (login, export, billing, performance)
start_date: Start date in YYYY-MM-DD format
end_date: End date in YYYY-MM-DD format

Returns:
Count and summary of tickets matching the criteria
"""
# Real implementation would query your database or filtered index
return f"Found 47 tickets in '{category}' between {start_date} and {end_date}"

def get_top_user_complaints(n: int = 5) -> str:
"""
Return the top N most frequently reported user complaints.

Args:
n: Number of top complaints to return (default 5)

Returns:
Ranked list of complaints with counts
"""
# Real implementation queries your aggregated data
return f"Top {n} complaints: 1. Login timeouts (234), 2. PDF export (178), ..."

# Build tools
function_tools = [
FunctionTool.from_defaults(fn=count_tickets_by_category),
FunctionTool.from_defaults(fn=get_top_user_complaints),
QueryEngineTool.from_defaults(
query_engine=vector_engine,
name="search_tickets",
description="Search through customer support tickets semantically"
),
QueryEngineTool.from_defaults(
query_engine=router_engine,
name="analyze_tickets",
description="Comprehensive ticket analysis - routes to best search strategy automatically"
),
]

# Create the agent
agent = FunctionCallingAgent.from_tools(
tools=function_tools,
llm=llm,
verbose=True,
max_function_calls=10,
system_prompt="""You are a customer support analytics agent.
You help teams understand patterns in support tickets using various analytical tools.
Always provide specific numbers when available and cite the relevant ticket information."""
)

# Multi-step agentic query
response = agent.chat(
"Compare the volume and severity of login issues vs PDF export issues in Q3 2024. "
"Which one should we prioritize fixing first?"
)
print(response.response)

ReActAgent

For models without native function calling or when you want ReAct-style reasoning traces:

from llama_index.core.agent import ReActAgent

react_agent = ReActAgent.from_tools(
tools=function_tools,
llm=llm,
verbose=True,
max_iterations=10,
react_chat_formatter=None # Use default ReAct format
)

# The agent will explicitly show Thought/Action/Observation steps
response = react_agent.chat(
"What percentage of all tickets in the last 6 months were about login issues?"
)

LlamaIndex Workflows: Event-Driven Orchestration

Workflows are LlamaIndex's answer to LangGraph - an event-driven orchestration system for complex multi-step agent pipelines. Introduced in mid-2024, Workflows are cleaner than LangGraph for knowledge-intensive pipelines that naturally fit the event-driven model.

from llama_index.core.workflow import (
Workflow,
StartEvent,
StopEvent,
Event,
step,
Context
)
from typing import Optional

# Define custom events that flow between steps
class QueryReceivedEvent(Event):
query: str
filters: Optional[dict] = None

class RetrievalCompleteEvent(Event):
query: str
nodes: list
strategy_used: str

class AnalysisCompleteEvent(Event):
query: str
analysis: str
source_count: int

class RAGWorkflow(Workflow):
"""
Event-driven RAG pipeline:
Start → route_query → [semantic_search | summary_search] → synthesize → Stop
"""

@step
async def route_query(self, ctx: Context, ev: StartEvent) -> QueryReceivedEvent:
"""Determine the best retrieval strategy for this query."""
query = ev.get("query", "")
await ctx.set("original_query", query)

# Simple routing heuristic - in production, use the router engine
is_aggregation = any(w in query.lower() for w in ["how many", "count", "total", "percentage"])
strategy = "summary" if is_aggregation else "semantic"

return QueryReceivedEvent(query=query, strategy=strategy)

@step
async def semantic_search(self, ctx: Context, ev: QueryReceivedEvent) -> RetrievalCompleteEvent:
"""Execute semantic vector search."""
if ev.filters and ev.filters.get("strategy") == "summary":
return None # Skip this step for summary queries

retriever = vector_engine._retriever
nodes = await retriever.aretrieve(ev.query)
return RetrievalCompleteEvent(
query=ev.query,
nodes=nodes,
strategy_used="semantic"
)

@step
async def synthesize_answer(self, ctx: Context, ev: RetrievalCompleteEvent) -> StopEvent:
"""Synthesize the final answer from retrieved nodes."""
query = ev.query
nodes = ev.nodes
strategy = ev.strategy_used

# Build context from retrieved nodes
context_parts = []
for node in nodes[:5]:
score = getattr(node, 'score', 'N/A')
text = node.node.text if hasattr(node, 'node') else str(node)
context_parts.append(f"[Relevance: {score}]\n{text[:500]}")

context = "\n\n---\n\n".join(context_parts)

# Call Claude directly for synthesis
from llama_index.llms.anthropic import Anthropic
llm_instance = Anthropic(model="claude-opus-4-6")

response = await llm_instance.acomplete(
prompt=f"""Answer the following question based on these support tickets:

Question: {query}

Relevant tickets:
{context}

Provide a specific, data-driven answer. Cite ticket examples where relevant."""
)

return StopEvent(result={
"answer": str(response),
"sources_used": len(nodes),
"strategy": strategy
})

# Run the workflow
async def run_workflow():
workflow = RAGWorkflow(timeout=120, verbose=True)
result = await workflow.run(query="What are the top login issues this quarter?")
print(result["answer"])
print(f"\nSources: {result['sources_used']} nodes, Strategy: {result['strategy']}")

import asyncio
asyncio.run(run_workflow())

Full Production Example: Agentic RAG System

A complete production system combining LlamaIndex indexing, routing, and agents with Claude via the Anthropic SDK:

import anthropic
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.agent import FunctionCallingAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool
from llama_index.llms.anthropic import Anthropic as LlamaAnthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
import json

# ─── Setup ────────────────────────────────────────────────────────────────────

llm = LlamaAnthropic(model="claude-opus-4-6", max_tokens=4096)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

def build_ticket_knowledge_base(tickets_dir: str) -> VectorStoreIndex:
"""Load and index support tickets."""
# Load documents
reader = SimpleDirectoryReader(tickets_dir, recursive=True)
documents = reader.load_data()
print(f"Loaded {len(documents)} ticket documents")

# Parse into nodes with metadata-aware splitting
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} nodes")

# Persistent vector store
chroma_client = chromadb.PersistentClient(path="./ticket_db")
collection = chroma_client.get_or_create_collection("support_tickets")
vector_store = ChromaVectorStore(chroma_collection=collection)

from llama_index.core import StorageContext
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(
nodes,
storage_context=storage_context,
embed_model=embed_model
)
return index

# ─── Tools ────────────────────────────────────────────────────────────────────

def build_agent_tools(index: VectorStoreIndex) -> list:
"""Build the complete tool set for the support analytics agent."""

# Semantic search tool
semantic_engine = index.as_query_engine(
llm=llm,
similarity_top_k=8,
response_mode="tree_summarize"
)

search_tool = QueryEngineTool.from_defaults(
query_engine=semantic_engine,
name="search_support_tickets",
description="""Search through customer support tickets using semantic similarity.
Use this to find tickets about specific symptoms, error messages, or user experiences.
Best for: 'What errors do users see when...', 'How do users describe the problem with...'"""
)

# Custom analytical tools
def count_tickets(category: str = None, date_range: str = None) -> str:
"""Count tickets with optional category and date filters."""
# Real implementation: query your database
filters = []
if category:
filters.append(f"category='{category}'")
if date_range:
filters.append(f"date IN {date_range}")
filter_str = " AND ".join(filters) if filters else "no filters"
return json.dumps({"count": 42, "filters_applied": filter_str})

def get_resolution_rate(issue_type: str) -> str:
"""Get the resolution rate for a specific issue type."""
return json.dumps({
"issue_type": issue_type,
"total_tickets": 120,
"resolved": 89,
"resolution_rate": "74.2%",
"avg_resolution_hours": 18.3
})

def get_trending_issues(days: int = 7) -> str:
"""Get issues that have increased in frequency over the past N days."""
return json.dumps({
"period_days": days,
"trending_up": ["login_timeout", "pdf_export_failure"],
"trending_down": ["slow_load_time"],
"new_issues": ["mobile_app_crash_ios17"]
})

return [
search_tool,
FunctionTool.from_defaults(fn=count_tickets),
FunctionTool.from_defaults(fn=get_resolution_rate),
FunctionTool.from_defaults(fn=get_trending_issues),
]

# ─── Agent ────────────────────────────────────────────────────────────────────

def create_analytics_agent(index: VectorStoreIndex) -> FunctionCallingAgent:
tools = build_agent_tools(index)

agent = FunctionCallingAgent.from_tools(
tools=tools,
llm=llm,
verbose=True,
max_function_calls=15,
system_prompt="""You are a senior customer support analyst.
You have access to all customer support tickets and can search them semantically
or analyze them statistically. When answering:
1. Use multiple tools to gather comprehensive data
2. Provide specific numbers, percentages, and dates
3. Identify actionable patterns, not just descriptions
4. Flag any data quality issues you notice
5. Recommend next steps based on the analysis"""
)
return agent

# ─── Main ─────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
# Build the knowledge base (one-time)
index = build_ticket_knowledge_base("./support_tickets/")

# Create the agent
agent = create_analytics_agent(index)

# Run queries
questions = [
"What are the top 3 most common issues in the last 7 days and their resolution rates?",
"Compare login issues vs PDF export issues - which should we prioritize?",
"Are there any new issues trending upward that we should investigate?",
]

for question in questions:
print(f"\n{'='*60}")
print(f"Q: {question}")
print('='*60)
response = agent.chat(question)
print(f"A: {response.response}")

LlamaIndex vs LangChain vs LangGraph

DimensionLlamaIndexLangChainLangGraph
Primary focusData + RAGChains + integrationsStateful graphs
Document loading160+ connectors50+ connectorsVia LangChain
Index typesVectorStore, Summary, KG, SQLVectorStore onlyVia LangChain
Query routingRouterQueryEngine built-inManualVia conditional edges
Agent typeFunctionCalling, ReActTool-calling, ReActAny (explicit graph)
State managementBasic (workflow)ImplicitExplicit TypedDict
CheckpointingLimitedNot built-inFirst-class
Human-in-loopVia workflowVia callbacksinterrupt() built-in
Best forKnowledge-intensive agentsIntegration-heavy chainsComplex state workflows

Use LlamaIndex when:

  • Your agent primarily queries and reasons over documents
  • You need multiple retrieval strategies (semantic, keyword, structured)
  • You need metadata filtering on large document collections
  • RAG quality (chunking, retrieval, synthesis) is the core engineering challenge

Use LangGraph when:

  • Complex state routing is the primary challenge (not retrieval)
  • You need production-grade checkpointing and resumption
  • You need explicit human-in-the-loop workflows
  • Multi-agent coordination with explicit state passing

Production Engineering Notes

Index Versioning

When your document collection changes, you need to update the index without rebuilding it from scratch:

from llama_index.core import VectorStoreIndex
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter

pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=50),
],
vector_store=vector_store,
)

# Incrementally add new documents (deduplication by doc_id)
new_documents = load_new_tickets_since_last_sync()
pipeline.run(documents=new_documents, show_progress=True)

Retrieval Quality Evaluation

Never deploy a RAG system without evaluating retrieval quality:

from llama_index.core.evaluation import (
RetrieverEvaluator,
RelevancyEvaluator,
FaithfulnessEvaluator
)

# Evaluate retrieval hit rate and MRR
retriever_evaluator = RetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"],
retriever=vector_engine._retriever
)

# Test dataset: (query, expected_node_ids)
test_cases = [
("login timeout error", ["ticket_123", "ticket_456"]),
("PDF export fails", ["ticket_789", "ticket_012"]),
]

for query, expected_ids in test_cases:
result = await retriever_evaluator.aevaluate(query, expected_ids=expected_ids)
print(f"Query: {query}")
print(f" Hit rate: {result.metric_vals_dict['hit_rate']:.2f}")
print(f" MRR: {result.metric_vals_dict['mrr']:.2f}")

Caching for Production

LlamaIndex supports multiple caching layers:

from llama_index.core import Settings
from llama_index.core.storage.kvstore import RedisKVStore
from llama_index.core.storage.index_store import KeyvalIndexStore

# Cache embeddings to avoid re-embedding the same text
from llama_index.core.ingestion import IngestionCache
from llama_index.core.storage.kvstore import RedisKVStore as RedisCache

cache = IngestionCache(
cache=RedisCache.from_host_and_port("localhost", 6379),
collection="embedding_cache"
)
pipeline = IngestionPipeline(
transformations=[SentenceSplitter(), embed_model],
cache=cache
)

:::danger Never Use LlamaIndex's ReActAgent with Complex State

LlamaIndex's ReActAgent is designed for simple tool use with a ReAct reasoning trace. It does not support complex state management, checkpointing, or human-in-the-loop. If your agent needs to pause for human review, recover from partial completion, or maintain complex non-message state, use LlamaIndex Workflows or switch to LangGraph for the orchestration layer while keeping LlamaIndex for the retrieval layer.

These two frameworks compose well: LangGraph nodes that call LlamaIndex query engines give you the best of both - LangGraph's stateful orchestration with LlamaIndex's retrieval quality.

:::

:::warning Retrieval Quality vs. Generation Quality

Engineers often optimize LLM generation quality (better prompts, larger models) when the actual problem is retrieval quality (retrieving the wrong nodes). Before changing your model or prompt, instrument your retrieval: log which nodes were retrieved for each query, log their relevance scores, and manually inspect whether the retrieved nodes contain the information needed to answer the question.

A vector store returning the wrong nodes cannot be fixed by a better prompt. Fix the retrieval first: adjust similarity_top_k, add metadata filters, try hybrid search (vector + keyword), or improve your chunking strategy.

:::


Interview Questions and Answers

Q1: How does LlamaIndex differ from LangChain architecturally, and when would you choose it?

LlamaIndex is document-centric. Its primary abstraction is the document pipeline: load documents, parse into nodes, index nodes, query the index. Everything else - agents, workflows, routing - is built on top of this retrieval foundation. LangChain is chain-centric: the primary abstraction is composing components into chains, with retrieval as one of many integration types.

Choose LlamaIndex when the core engineering challenge is retrieval quality: how to get the right information out of a large, heterogeneous document collection for LLM consumption. Its multiple index types (VectorStore, Summary, KnowledgeGraph), metadata filtering, hybrid retrieval, and RouterQueryEngine are significantly more sophisticated than LangChain's retrieval capabilities.

Choose LangChain when you need many external service integrations (Slack, Notion, databases) or when the chain composition model matches your problem. Choose LangGraph when the challenge is stateful orchestration. These frameworks are not mutually exclusive - LlamaIndex retrieval as a LangGraph node is a common production pattern.

Q2: What is the RouterQueryEngine in LlamaIndex and why is it useful?

RouterQueryEngine uses an LLM (or a lightweight classifier) to select the most appropriate query engine for each incoming query. You define multiple engines - each specialized for a different query type - with human-readable descriptions. The router reads the query, reads the descriptions, and routes to the best-fit engine.

This is useful because different question types require different retrieval strategies. "What is the error message users see when PDF export fails?" needs semantic search - it is looking for similar content. "Summarize all issues from Q3" needs a summary index that reads many documents. "Find ticket #4721" needs keyword lookup. A single vector search engine handles none of these optimally. RouterQueryEngine picks the right tool for each query automatically, without requiring you to classify queries in your application code.

Q3: Explain the difference between LlamaIndex's VectorStoreIndex, SummaryIndex, and KeywordTableIndex. When would each be appropriate?

VectorStoreIndex converts all nodes to embeddings and stores them in a vector database. Retrieval is by semantic similarity: the query is embedded, and the nearest neighbors are returned. Best for: questions where the answer is semantically related to the query - user experience questions, symptom matching, thematic search.

SummaryIndex keeps all nodes in a list without embedding them. At query time, it passes all (or a selected subset of) nodes to the LLM for synthesis. Best for: summarization questions that require reading across many documents - "What are the main themes in this month's tickets?", "Summarize all issues the high-priority customer raised."

KeywordTableIndex extracts keywords from each node and builds a keyword-to-node mapping. Retrieval is by keyword overlap. Best for: exact lookup queries - "Find tickets containing error code E4021", "Show me all tickets from user ID 7823." Keyword retrieval is fast and precise for known terms but fails for conceptual queries.

In production, combine all three via RouterQueryEngine so each query uses the most appropriate strategy.

Q4: What are LlamaIndex Workflows and how do they compare to LangGraph?

LlamaIndex Workflows (2024) are an event-driven orchestration system. You define steps decorated with @step, each consuming specific event types and emitting event types. The workflow engine routes events to the correct steps automatically based on type matching. Complex orchestration emerges from event flows rather than explicit graph edges.

Compared to LangGraph: Workflows are more natural for data pipeline patterns where data transforms through clear stages. LangGraph is more natural for control flow patterns where routing decisions depend on accumulated state. Workflows have less boilerplate for linear pipelines. LangGraph has better support for cycles, explicit state inspection, and production persistence.

The practical choice: use LlamaIndex Workflows when your pipeline is primarily a RAG pipeline with some orchestration on top. Use LangGraph when the orchestration complexity is the primary engineering challenge, and use LlamaIndex query engines as tools within LangGraph nodes.

Q5: How do you evaluate whether a LlamaIndex RAG system is working well in production?

Evaluate at three layers:

Retrieval evaluation: does the system retrieve the right nodes? Measure hit rate (is the expected node in the top-K results?) and MRR (mean reciprocal rank - how high is the expected node ranked?). Use RetrieverEvaluator with a golden dataset of (query, expected_node_ids) pairs. Anything below 70% hit rate at K=5 means your chunking, embedding model, or metadata filtering needs improvement.

Faithfulness evaluation: does the generated answer contain only information present in the retrieved nodes? Use FaithfulnessEvaluator with an LLM judge. Low faithfulness means hallucination - the model is generating beyond what it retrieved. Fix: improve retrieval quality so the answer is always grounded, or reduce response_mode complexity.

Relevance evaluation: is the answer relevant to the question? Use RelevancyEvaluator. High faithfulness but low relevance means you are retrieving faithful but off-topic information - adjust your retrieval strategy, chunk sizes, or similarity threshold.

In production: log every query, log retrieved node IDs and scores, log the final answer. Sample 5% of queries for human evaluation weekly. Alert when faithfulness or relevance scores drop below your threshold.

© 2026 EngineersOfAI. All rights reserved.