LlamaIndex Deep Dive
A Production Scenario
A law firm wants an AI system that can answer questions across their entire document library - 20 years of contracts, case notes, legal opinions, precedents, and regulatory filings. The collection is 500,000 documents. Each question might require synthesizing information across 5-10 documents. Some questions require comparing a current contract against 50 similar historical contracts. Some require finding the one document in 500,000 that contains a specific clause.
You build the first version with LangChain's RetrievalQA. It works for simple factual questions. But when an attorney asks "How have our contract termination clauses evolved over the past 10 years?", the retriever returns 5 documents and the answer is superficial - it cannot synthesize across enough sources to see the evolution. When they ask "Compare the indemnification clause in Contract-2024-001 with our standard template and flag any deviations", the system retrieves generically related documents instead of the specific ones needed.
The problem is that LangChain treats retrieval as a generic step. It retrieves a fixed number of chunks and pipes them into a prompt. It does not know that your question requires comparing two specific documents, or that your question about "evolution over 10 years" requires time-aware retrieval, or that your query about indemnification is better answered by searching a specialized index of clause types rather than the full document corpus.
LlamaIndex was built precisely for this problem. It is not a generic LLM application framework - it is a data framework for LLM applications. It models the complexity of real document collections: different document types need different indices, different query types need different retrieval strategies, and complex questions need to be decomposed into sub-questions and answered separately before synthesis.
Why This Exists
LangChain Is Great at Agents; LlamaIndex Is Great at Data
Both LangChain and LlamaIndex build LLM applications. But they start from different assumptions.
LangChain starts from the model: how do I chain LLM calls, use tools, and build agents? Data retrieval is one component among many.
LlamaIndex starts from the data: how do I represent complex document collections so that LLMs can query them intelligently? The LLM is the consumer of a sophisticated data infrastructure.
For most RAG applications, LangChain's retrieval is good enough. For applications where the data complexity is the core challenge - large collections, heterogeneous document types, complex query patterns - LlamaIndex's abstractions are significantly better.
The Key Abstraction Gap
In LangChain, a retriever returns List[Document]. That is the full API. You control chunking and embedding, but the retrieval step is opaque and uniform.
In LlamaIndex, you have multiple index types, each optimized for different query patterns. You have query engines that can decompose questions. You have response synthesizers with different strategies. You have post-processors that filter, rerank, or expand results. The retrieval pipeline is composable at a finer level of granularity.
Historical Context
LlamaIndex was created by Jerry Liu and Simon Suo in late 2022, originally called "GPT Index." The initial focus was on making it easy to query documents with GPT-3. It quickly evolved into a full framework.
Key releases:
- v0.8 (2023): Added the query pipeline concept, making components composable
- v0.10 (early 2024): Major refactor to
llama-index-corewith modular integrations - Workflows (2024): Event-driven, async workflow system - LlamaIndex's answer to LangGraph
The framework has consistently focused on the data layer: better chunking, smarter retrieval, and more sophisticated query understanding.
Core Abstractions
Documents and Nodes
A Document is the raw input (a PDF, a web page, a database row). LlamaIndex processes Documents into Node objects - chunks with metadata, embeddings, and relationships to other nodes.
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
HierarchicalNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
# Load documents
documents = SimpleDirectoryReader("./documents/").load_data()
print(f"Loaded {len(documents)} documents")
# Basic chunking: split into sentences
splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=128,
)
nodes = splitter.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} nodes")
# Semantic chunking: split where meaning changes
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
semantic_splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model
)
semantic_nodes = semantic_splitter.get_nodes_from_documents(documents)
# Hierarchical chunking: multiple granularities
hier_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)
hier_nodes = hier_parser.get_nodes_from_documents(documents)
# Creates parent nodes (2048 tokens) with child nodes (512, 128 tokens)
# Retrieval can use small chunks for precision, large for context
Index Types
from llama_index.core import (
VectorStoreIndex,
SummaryIndex,
TreeIndex,
KnowledgeGraphIndex,
Settings,
)
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure global settings
Settings.llm = Anthropic(model="claude-opus-4-6", max_tokens=2048)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.chunk_size = 1024
# ── VectorStoreIndex: semantic similarity search ───────────────────────────────
# Best for: fact retrieval, document QA, general semantic search
vector_index = VectorStoreIndex.from_documents(documents)
# Persist to disk
vector_index.storage_context.persist(persist_dir="./vector_index")
# Load from disk
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./vector_index")
vector_index = load_index_from_storage(storage_context)
# ── SummaryIndex: sequential synthesis over all documents ─────────────────────
# Best for: "summarize all documents", "what are the key themes across documents"
# NOT for: targeted fact retrieval (retrieves ALL nodes)
summary_index = SummaryIndex.from_documents(documents)
# ── KnowledgeGraphIndex: entity-relationship graph ───────────────────────────
# Best for: "how is X related to Y?", graph traversal queries
kg_index = KnowledgeGraphIndex.from_documents(
documents,
max_triplets_per_chunk=5,
include_embeddings=True,
)
Query Engines
A QueryEngine wraps an index with retrieval, postprocessing, and response synthesis into a single callable interface.
from llama_index.core.query_engine import (
RetrieverQueryEngine,
SubQuestionQueryEngine,
)
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import (
SimilarityPostprocessor,
KeywordNodePostprocessor,
CohereRerank,
LongContextReorder,
)
from llama_index.core.response_synthesizers import (
ResponseMode,
get_response_synthesizer,
)
# ── Basic Query Engine ─────────────────────────────────────────────────────────
basic_engine = vector_index.as_query_engine(
similarity_top_k=5,
response_mode="compact", # Compact multiple chunks into one prompt
)
response = basic_engine.query("What are the key contract termination clauses?")
print(response.response)
# Access source nodes (which chunks were used)
for node in response.source_nodes:
print(f" Source: {node.metadata.get('file_name', 'unknown')}, "
f"Score: {node.score:.3f}")
# ── Query Engine with Reranking ────────────────────────────────────────────────
retriever = VectorIndexRetriever(
index=vector_index,
similarity_top_k=20, # Retrieve many, then rerank to top 5
)
# Cohere reranking (requires API key)
# reranker = CohereRerank(api_key="...", top_n=5)
# Simple similarity threshold filter
similarity_filter = SimilarityPostprocessor(similarity_cutoff=0.75)
# Reorder for long context (put best at beginning/end)
long_context_reorder = LongContextReorder()
synthesizer = get_response_synthesizer(
response_mode=ResponseMode.REFINE, # Iteratively refine answer with each node
)
engine_with_reranking = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[
similarity_filter,
long_context_reorder,
],
response_synthesizer=synthesizer,
)
# ── Response Modes ─────────────────────────────────────────────────────────────
# COMPACT: concatenate chunks, one LLM call (fast, cheap)
compact_engine = vector_index.as_query_engine(response_mode="compact")
# REFINE: iterate through chunks, refining answer at each step (slow, thorough)
refine_engine = vector_index.as_query_engine(response_mode="refine")
# TREE_SUMMARIZE: hierarchical summarization (good for many chunks)
tree_engine = vector_index.as_query_engine(response_mode="tree_summarize")
# NO_TEXT: just retrieve nodes, no synthesis (for custom synthesis)
no_text_engine = vector_index.as_query_engine(response_mode="no_text")
Sub-Question Query Engine: Decompose Complex Queries
The sub-question query engine is one of LlamaIndex's most powerful features. It automatically breaks a complex question into sub-questions, routes each to the appropriate index, then synthesizes the combined results.
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool
# Create separate indices for different document collections
contracts_index = VectorStoreIndex.from_documents(contract_documents)
regulations_index = VectorStoreIndex.from_documents(regulatory_documents)
case_law_index = VectorStoreIndex.from_documents(case_law_documents)
# Wrap each index as a tool with a description
tools = [
QueryEngineTool.from_defaults(
query_engine=contracts_index.as_query_engine(),
name="contracts",
description=(
"Contains all client contracts and agreements from 2015-2024. "
"Use for questions about contract terms, clauses, parties, and dates."
)
),
QueryEngineTool.from_defaults(
query_engine=regulations_index.as_query_engine(),
name="regulations",
description=(
"Contains regulatory filings, compliance documents, and legal standards. "
"Use for questions about compliance requirements and regulatory obligations."
)
),
QueryEngineTool.from_defaults(
query_engine=case_law_index.as_query_engine(),
name="case_law",
description=(
"Contains relevant legal precedents and case summaries. "
"Use for questions about how courts have interpreted specific clauses."
)
),
]
# Sub-question engine automatically decomposes queries
sub_question_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=tools,
verbose=True,
)
# For a complex question, it might generate:
# Sub-Q1 → contracts: "What does Contract-2024-001 say about indemnification?"
# Sub-Q2 → regulations: "What are the regulatory requirements for indemnification?"
# Sub-Q3 → case_law: "How have courts interpreted standard indemnification clauses?"
# Then synthesize all three answers
response = sub_question_engine.query(
"Does Contract-2024-001's indemnification clause comply with current regulations "
"and how does it compare to recent court interpretations?"
)
print(response)
LlamaIndex Workflows: Event-Driven Async Agents
LlamaIndex Workflows is the framework's stateful, async, event-driven answer to LangGraph. Instead of a graph with typed edges, you define event types and handlers that react to events.
from llama_index.core.workflow import (
Workflow,
StartEvent,
StopEvent,
Event,
step,
Context,
)
from llama_index.core import VectorStoreIndex
from llama_index.llms.anthropic import Anthropic
# ── Define Events ─────────────────────────────────────────────────────────────
class QueryReceivedEvent(Event):
query: str
class RetrievalDoneEvent(Event):
query: str
retrieved_context: str
needs_clarification: bool
class ClarificationNeededEvent(Event):
query: str
clarification_question: str
class ResponseReadyEvent(Event):
response: str
# ── Define Workflow ────────────────────────────────────────────────────────────
class IntelligentRAGWorkflow(Workflow):
"""
A RAG workflow with intent classification and optional clarification.
"""
def __init__(self, index: VectorStoreIndex, **kwargs):
super().__init__(**kwargs)
self.index = index
self.llm = Anthropic(model="claude-opus-4-6")
@step
async def classify_and_retrieve(
self,
ctx: Context,
ev: StartEvent
) -> RetrievalDoneEvent:
"""Classify the query and retrieve relevant context."""
query = ev.query
# Check if query is clear enough
clarity_check = await self.llm.acomplete(
f"Is this question clear and answerable? "
f"Reply YES or NO and briefly why.\n\nQuestion: {query}"
)
is_clear = "YES" in clarity_check.text.upper()
if not is_clear:
return RetrievalDoneEvent(
query=query,
retrieved_context="",
needs_clarification=True
)
# Retrieve context
retriever = self.index.as_retriever(similarity_top_k=5)
nodes = await retriever.aretrieve(query)
context = "\n\n".join(
f"[Source: {n.metadata.get('file_name', 'unknown')}]\n{n.text}"
for n in nodes
)
return RetrievalDoneEvent(
query=query,
retrieved_context=context,
needs_clarification=False
)
@step
async def handle_clarification(
self,
ctx: Context,
ev: RetrievalDoneEvent
) -> ClarificationNeededEvent | ResponseReadyEvent:
"""Branch: ask for clarification or generate answer."""
if ev.needs_clarification:
# Generate a clarifying question
clarification = await self.llm.acomplete(
f"Generate a specific clarifying question for: {ev.query}"
)
return ClarificationNeededEvent(
query=ev.query,
clarification_question=clarification.text
)
# Generate the answer
response = await self.llm.acomplete(
f"Context:\n{ev.retrieved_context}\n\nQuestion: {ev.query}\n\nAnswer:"
)
return ResponseReadyEvent(response=response.text)
@step
async def handle_clarification_needed(
self,
ctx: Context,
ev: ClarificationNeededEvent
) -> StopEvent:
"""Return clarification question to user."""
return StopEvent(result={
"type": "clarification_needed",
"question": ev.clarification_question
})
@step
async def finalize_response(
self,
ctx: Context,
ev: ResponseReadyEvent
) -> StopEvent:
"""Return final answer."""
return StopEvent(result={
"type": "answer",
"response": ev.response
})
# ── Run the Workflow ──────────────────────────────────────────────────────────
import asyncio
async def run_workflow():
# In real code, create an actual index
# index = VectorStoreIndex.from_documents(documents)
# workflow = IntelligentRAGWorkflow(index=index, timeout=60, verbose=True)
# result = await workflow.run(query="What are the termination clauses?")
# print(result)
pass
Multi-Document Agents
LlamaIndex's agent system is optimized for routing queries across multiple document collections using tool-based selection.
from llama_index.core.agent import ReActAgent, FunctionCallingAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool
# Build specialized indices
def create_document_agent(documents, index_name: str) -> QueryEngineTool:
"""Create a query engine tool for a document collection."""
index = VectorStoreIndex.from_documents(documents)
engine = index.as_query_engine(similarity_top_k=3, response_mode="compact")
return QueryEngineTool.from_defaults(
query_engine=engine,
name=index_name,
description=f"Query the {index_name} document collection"
)
# In production, you would load real documents
# contract_tool = create_document_agent(contract_docs, "contracts")
# policy_tool = create_document_agent(policy_docs, "company_policies")
# tech_tool = create_document_agent(tech_docs, "technical_documentation")
# Add custom function tools alongside query tools
def check_compliance(clause_text: str) -> str:
"""Check if a contract clause meets compliance requirements."""
return f"Compliance check result for clause: {clause_text[:100]}..."
compliance_tool = FunctionTool.from_defaults(
fn=check_compliance,
name="compliance_checker",
description="Check if a specific contract clause meets current compliance requirements"
)
# Build the agent
tools = [compliance_tool] # Add contract_tool, policy_tool, etc.
agent = ReActAgent.from_tools(
tools=tools,
llm=Anthropic(model="claude-opus-4-6"),
verbose=True,
max_iterations=10,
)
# The agent routes across document collections automatically
response = agent.chat(
"Check the indemnification clause in our standard contract template "
"for compliance with current regulations."
)
print(response.response)
Index Persistence and Incremental Updates
from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
def create_persistent_index(
documents,
persist_dir: str = "./index_storage"
) -> VectorStoreIndex:
"""Create a vector index with persistent storage."""
chroma_client = chromadb.PersistentClient(path=persist_dir)
chroma_collection = chroma_client.get_or_create_collection("documents")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True
)
return index
def add_documents_incrementally(
index: VectorStoreIndex,
new_documents: list
) -> None:
"""Add new documents without rebuilding the entire index."""
for doc in new_documents:
# LlamaIndex handles chunking, embedding, and insertion
index.insert(doc)
print(f"Added {len(new_documents)} new documents to index")
def delete_document(index: VectorStoreIndex, doc_id: str) -> None:
"""Remove a document from the index by ID."""
index.delete_ref_doc(doc_id, delete_from_docstore=True)
print(f"Deleted document: {doc_id}")
LlamaIndex vs LangChain for RAG
| Dimension | LangChain | LlamaIndex |
|---|---|---|
| Focus | Agent orchestration, model chaining | Data indexing, retrieval, query processing |
| Index types | Primarily vector store | Vector, Summary, Tree, Knowledge Graph |
| Retrieval | Retriever (returns chunks) | Multiple retrievers with postprocessors |
| Query complexity | Single retrieval step | Sub-question decomposition, query fusion |
| Document loaders | Good selection | Excellent selection (85+ connectors) |
| Agents | AgentExecutor, LangGraph | ReActAgent, FunctionCallingAgent, Workflows |
| Observability | LangSmith (excellent) | Arize Phoenix, LlamaTrace |
| Learning curve | Moderate | Moderate to steep |
| Best for | General agents, LLM pipelines | Complex RAG, multi-document QA |
Rule of Thumb
Use LlamaIndex when the core challenge is your data: you have complex document collections, need multiple index types, need sophisticated retrieval beyond basic similarity search, or need sub-question decomposition.
Use LangChain when the core challenge is your workflow: you need complex agent orchestration, conditional routing, multi-agent coordination, or integration with many different APIs.
They are not mutually exclusive. Many production systems use LlamaIndex for the data layer and LangChain or direct API calls for the agent layer.
Tracing with Arize Phoenix
import phoenix as px
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
# Start Phoenix (local tracing UI at localhost:6006)
px.launch_app()
# Set up OpenTelemetry instrumentation
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(
SimpleSpanProcessor(px.OpenInferenceSpanExporter())
)
trace_api.set_tracer_provider(tracer_provider)
# Instrument LlamaIndex
LlamaIndexInstrumentor().instrument()
# Now all LlamaIndex operations are automatically traced
# index = VectorStoreIndex.from_documents(documents)
# engine = index.as_query_engine()
# response = engine.query("...") # This call is traced in Phoenix UI
Common Mistakes
:::danger Using VectorStoreIndex for Summarization Tasks
If the user asks "summarize all documents" and you use a VectorStoreIndex, you will only summarize the top-5 retrieved chunks - not all documents. For comprehensive tasks that need to process all content, use SummaryIndex which iterates through all nodes. Use VectorStoreIndex only for targeted retrieval.
:::
:::danger Not Configuring chunk_size and chunk_overlap for Your Domain The default chunk sizes (1024 tokens, 200 overlap) work for general text. Legal documents, code, and scientific papers have very different optimal chunk sizes. Legal documents often need larger chunks (2000+ tokens) to preserve clause context. Code needs smaller chunks aligned to function boundaries. Always profile your retrieval quality with different chunk settings. :::
:::warning Ignoring the response_mode Parameter
The default response mode is compact (concatenate all chunks into one prompt). For complex questions across many documents, refine (iterate through chunks, refining the answer) or tree_summarize (hierarchical synthesis) often produces significantly better answers at the cost of more LLM calls.
:::
:::warning Not Using Node Postprocessors
Raw retrieved nodes often include low-relevance chunks that dilute the answer. Always add at least a similarity threshold filter (SimilarityPostprocessor(similarity_cutoff=0.7)) to remove low-quality retrievals. For production, add a reranker (Cohere, BGE) to reorder results by actual relevance.
:::
Interview Q&A
Q: What is the main conceptual difference between LlamaIndex and LangChain?
LangChain is primarily an agent orchestration framework - it focuses on chaining LLM calls, tool use, and agent loops. LlamaIndex is primarily a data framework - it focuses on representing complex document collections in ways that LLMs can query intelligently. LangChain's retrieval is a component in a larger agent pipeline. In LlamaIndex, retrieval is the primary concern, with multiple index types, query engines, and postprocessors designed specifically for complex retrieval patterns.
Q: When would you use SummaryIndex instead of VectorStoreIndex?
SummaryIndex processes ALL nodes (no retrieval step) and is designed for questions that require synthesizing across an entire document collection: "What are the main themes?", "Summarize everything." VectorStoreIndex retrieves the top-k most similar chunks - it is targeted and works for factual Q&A but will miss information not captured in the top-k. Use VectorStoreIndex for most Q&A tasks; use SummaryIndex when you need comprehensive coverage of the full document set.
Q: How does the Sub-Question Query Engine work?
The Sub-Question Query Engine has an LLM decompose a complex question into multiple simpler sub-questions. Each sub-question is routed to the most appropriate tool (index/query engine) based on the tool descriptions. Each sub-question is answered independently, then all answers are synthesized into a final response. This is particularly powerful for questions that span multiple data sources or require comparing information across different document collections.
Q: What are LlamaIndex Workflows and how do they compare to LangGraph?
Both are stateful, event-driven workflow systems for building complex agent behaviors. LlamaIndex Workflows is event-driven - you define event types and handlers, and the workflow executes reactively. LangGraph is graph-based - you define nodes and edges explicitly. Both support async execution, persistence, and complex routing. LlamaIndex Workflows integrates more naturally with LlamaIndex's indexing and retrieval components; LangGraph integrates more naturally with LangChain's chains and tools. For projects already using one ecosystem, use that ecosystem's workflow system.
Q: How do you handle incremental index updates as new documents arrive?
Use index.insert(document) to add individual documents without rebuilding. For bulk updates, use index.refresh_ref_docs(new_documents) which adds new documents and updates changed ones. To delete, use index.delete_ref_doc(doc_id). For high-throughput updates, batch the inserts and use an async indexing pipeline. Always use a persistent vector store (Chroma, Pinecone, Weaviate) rather than the in-memory store so updates survive process restarts.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the RAG Pipeline demo on the EngineersOfAI Playground - no code required.
:::
