Framework Comparison
The $200K Decision
A startup building an enterprise AI research platform spends six months and $200,000 in engineering time discovering that their chosen framework - LangChain - does not support the stateful, human-in-the-loop workflow their enterprise customers require. The agent they built works for demos. It does not work for the scenario where a research report takes three days to complete, requires three different human approvals, and must be resumable if a cloud instance crashes.
The rebuild costs another three months and takes them to LangGraph. Had they made the right framework choice in month one, they would have saved five months and approximately $180,000 in salary costs.
The right framework choice is not about which framework is "best." It is about which framework's abstraction model matches the specific shape of your problem. This lesson gives you the tools to make that match correctly.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Frameworks demo on the EngineersOfAI Playground - no code required. :::
Why This Matters
Framework selection is an architectural decision that compounds over time. The framework you choose shapes:
- What is easy to build - the framework's "happy path" is frictionless
- What is hard to build - anything outside the happy path requires working against the framework
- What is impossible without major changes - some requirements are structurally incompatible with some frameworks
- Your debugging experience - different frameworks have vastly different debuggability at 3 AM
Choose well and you move fast for two years. Choose poorly and you spend six months fighting your framework, then spend another three months migrating.
The 12-Dimension Comparison
The comparison covers six options - Raw API, LangGraph, LangChain, CrewAI, AutoGen v0.4, and LlamaIndex - across twelve production dimensions rated on a 1-5 scale.
| Dimension | Raw API | LangGraph | LangChain | CrewAI | AutoGen v0.4 | LlamaIndex |
|---|---|---|---|---|---|---|
| Learning curve | 1 (easy) | 3 | 4 | 2 | 3 | 3 |
| Debugging quality | 5 | 4 | 2 | 3 | 3 | 3 |
| Flexibility | 5 | 4 | 3 | 2 | 3 | 3 |
| State management | 1 (manual) | 5 | 2 | 2 | 3 | 3 |
| Multi-agent support | 1 (manual) | 4 | 2 | 5 | 5 | 3 |
| Human-in-the-loop | 2 (manual) | 5 | 2 | 3 | 4 | 2 |
| Checkpointing | 1 (none) | 5 | 1 | 1 | 2 | 2 |
| Streaming | 5 | 4 | 3 | 2 | 3 | 3 |
| RAG/Retrieval quality | 1 (manual) | 2 | 3 | 2 | 2 | 5 |
| Integration ecosystem | 1 (write your own) | 3 | 5 | 3 | 3 | 4 |
| Production stability | 5 | 4 | 3 | 3 | 3 | 4 |
| Community/Support | 2 | 4 | 5 | 4 | 4 | 4 |
Scale: 1 = weakest, 5 = strongest for that dimension
Reading the Matrix
Raw API scores highest on debugging (5), flexibility (5), streaming (5), and production stability (5). It scores lowest on state management, multi-agent support, checkpointing, integration ecosystem, and community. Use it when you need maximum control and your problem is simple enough to not need framework features.
LangGraph scores highest on state management (5), human-in-the-loop (5), and checkpointing (5). Scores well on debugging (4) and multi-agent (4). Lowest on RAG/retrieval. Use it when complex stateful workflows are the core engineering challenge.
LangChain scores highest on integration ecosystem (5) and community (5). Scores poorly on debugging (2), state management (2), human-in-the-loop (2), and checkpointing (1). Use it when integration breadth is the primary requirement, not when agent complexity is high.
CrewAI scores highest on multi-agent (5). Scores well on learning curve (2 - easier). Scores poorly on flexibility (2) and checkpointing (1). Use it when role-based team coordination is the primary model.
AutoGen v0.4 scores highest on multi-agent (5) and conversational coordination. Reasonable across most dimensions. Use it when conversation and debate between agents is the primary interaction model.
LlamaIndex scores highest on RAG/retrieval quality (5). Strong on integration ecosystem (4). Use it when document retrieval and knowledge management is the core challenge.
Decision Flowchart
Case Studies: "We Chose X Because..."
Case Study 1: Enterprise Research Platform (LangGraph)
Company: B2B analytics startup, 40 engineers Use case: An agent that researches companies for sales teams - gathering financials, news, executive changes, and competitive intelligence over 2-4 hours with multiple human approval gates Framework chosen: LangGraph with PostgreSQL checkpointer Why LangGraph:
The task runs for hours, not minutes. It needs to pause three times for human approval (research plan approval, preliminary findings approval, final report approval). If the cloud instance crashes, the task must resume from the last approved checkpoint, not restart.
These requirements - multi-hour runtime, human-in-the-loop gates, crash resumption - map directly to LangGraph's interrupt_before, PostgreSQL checkpointing, and explicit state. No other framework handles all three in production.
What they gave up: LangGraph's learning curve is higher than CrewAI. The initial scaffolding took two weeks to understand and configure correctly. But in production, the explicit state model made debugging fast - every failure could be diagnosed by inspecting the checkpoint.
# The key LangGraph config for this system
from langgraph.checkpoint.postgres import PostgresSaver
graph = research_graph.compile(
checkpointer=PostgresSaver.from_conn_string(DATABASE_URL),
interrupt_before=["plan_approval", "findings_approval", "final_approval"]
)
config = {"configurable": {"thread_id": f"research-{company_id}-{request_id}"}}
Case Study 2: Legal Document Analysis (LlamaIndex)
Company: Legal tech startup, 15 engineers Use case: An agent that answers questions about a firm's contract history - "What are our standard indemnification terms?", "Which contracts have uncapped liability clauses?", "How many contracts expire in Q3?" Framework chosen: LlamaIndex with VectorStoreIndex + KeywordTableIndex + RouterQueryEngine Why LlamaIndex:
The problem is fundamentally about retrieval quality over 50,000 legal documents. A semantic question ("standard indemnification terms") needs vector search. A structural question ("which contracts have X clause") needs exact extraction. A counting question ("how many in Q3") needs structured querying over metadata.
LlamaIndex's RouterQueryEngine routes each query to the appropriate retrieval strategy automatically. Its metadata filtering enables structured queries by contract date, counterparty, and jurisdiction without building a separate database query layer.
What they gave up: LlamaIndex's agent orchestration is weaker than LangGraph. For simple queries, the FunctionCallingAgent is fine. For complex multi-step reasoning over many documents, they had to add LangGraph as the orchestration layer, using LlamaIndex query engines as LangGraph node tools.
# LlamaIndex routing + LangGraph orchestration
def build_legal_tools(index: VectorStoreIndex) -> list:
return [
QueryEngineTool.from_defaults(
query_engine=index.as_query_engine(response_mode="compact"),
name="semantic_search",
description="Find contracts based on meaning and concepts"
),
QueryEngineTool.from_defaults(
query_engine=metadata_filtered_engine,
name="structural_search",
description="Find contracts with specific clauses or structural elements"
)
]
Case Study 3: Content Production Pipeline (CrewAI)
Company: B2B SaaS company, marketing team of 5 Use case: Automated content pipeline - from topic brief to published blog post, including research, writing, fact-checking, SEO optimization Framework chosen: CrewAI with sequential process Why CrewAI:
The content pipeline has clear role boundaries: researcher, writer, editor, SEO specialist. Each role has a distinct goal and distinct tools. Tasks flow linearly. The expected output of each task is well-defined. This maps perfectly to CrewAI's Agent/Task/Crew model - no complex state routing, no human-in-the-loop (the human reviews the final output, not intermediate steps), no sophisticated retrieval.
CrewAI's learning curve was lowest for the team (two days to first working crew). The role/goal/backstory abstraction improved prompt quality because thinking in role terms led to better system messages than abstract LLM prompting.
What they gave up: limited flexibility when requirements change. Adding a conditional step (only run SEO optimization if the post is longer than 1000 words) required CrewAI Flows, which added complexity they had not anticipated.
Case Study 4: Internal Coding Assistant (AutoGen v0.4)
Company: Enterprise software company, internal tooling team Use case: A Slack bot that engineers can ask to write, test, and explain code - with the bot able to actually run the code it writes and iterate based on test output Framework chosen: AutoGen v0.4 with Docker code execution Why AutoGen:
The interaction model is naturally conversational: engineer asks a question, assistant responds with code, engineer says "that fails with X error," assistant revises. The Docker code executor runs every code block automatically, returning stdout/stderr to the assistant for self-correction.
AutoGen's RoundRobinGroupChat with an AssistantAgent and CodeExecutorAgent implements this pattern with minimal code. The conversation history is the full context - the engineer can refer to "the function from three messages ago" and the assistant has the full context.
What they gave up: AutoGen v0.4's observability is weaker than LangSmith. They had to implement custom logging to track conversation quality over time. Migration from v0.2 (which some team members had used previously) was non-trivial - the APIs are completely different.
Case Study 5: Customer Support Routing (Raw API + thin wrapper)
Company: Consumer fintech, 200K+ users Use case: First-response agent that triages support tickets, answers routine questions (account status, fee explanations, simple troubleshooting), and escalates complex cases Framework chosen: Raw Anthropic API with a 150-line custom wrapper Why raw API:
The agent has five tools: look up account, check transaction history, search FAQ, create escalation ticket, send notification. The logic is almost entirely linear. The p99 latency requirement (under 3 seconds for first response) ruled out frameworks with significant abstraction overhead. The compliance requirements mandated that every LLM call be logged in a specific format to a specific SIEM - something no framework natively supports.
The 150-line wrapper handles the agentic loop, the specific logging format, cost tracking per support ticket, and error handling with the company's specific retry and fallback policies. Everything is explicit, testable, and auditable.
What they gave up: when they later wanted to add multi-agent routing (for certain ticket types, route to a specialist agent rather than direct tool use), they had to build it themselves. This took two weeks. With LangGraph or AutoGen Swarm, it might have taken two days. The trade-off was accepted because the existing system was reliable, debuggable, and meeting SLAs.
Combining Frameworks
Production systems often use multiple frameworks for different concerns:
A real production system might use:
- LangGraph as the master orchestrator (state management, checkpointing, human-in-the-loop)
- LlamaIndex for document retrieval nodes within the graph
- CrewAI for content generation sub-workflows run as LangGraph nodes
- Raw API for specific nodes where latency or cost control is critical
# LangGraph node that runs a CrewAI crew
from crewai import Crew, Process
def content_generation_node(state: ResearchState) -> dict:
"""Run a CrewAI content generation crew as a LangGraph node."""
crew = Crew(
agents=[writer_agent, editor_agent],
tasks=[writing_task, editing_task],
process=Process.sequential
)
result = crew.kickoff(inputs={
"research": state["research_results"][0] if state["research_results"] else "",
"topic": state["topic"]
})
return {"draft": result.raw}
# LangGraph node that uses LlamaIndex retrieval
def retrieval_node(state: ResearchState) -> dict:
"""Run LlamaIndex retrieval as a LangGraph node."""
query_engine = vector_index.as_query_engine(similarity_top_k=8)
response = query_engine.query(state["topic"])
return {
"research_results": [response.response],
"sources": [n.node.metadata for n in response.source_nodes]
}
The 2025 Landscape: What Is Winning and Why
LangGraph Is Winning for Production Complex Agents
By the end of 2024, LangGraph had established itself as the production standard for complex stateful agents. The explicit state model, PostgreSQL checkpointing, and interrupt() human-in-the-loop support address the real problems that production agents face. Teams that started with LangChain's AgentExecutor have largely migrated to LangGraph.
What drove adoption: enterprises building agents that run for hours with multiple approval gates found that only LangGraph provided the reliability they needed. The learning curve was justified by the production gains.
CrewAI Is Winning for Content and Research Pipelines
For content generation, research pipelines, and marketing automation, CrewAI's role-based model has found a large audience. The learning curve is low enough that non-engineering teams can configure crews, and the quality improvements from role-based prompting are real.
What drove adoption: marketing and content teams who needed to automate pipelines without deep engineering investment. The role/goal/backstory model translates naturally to how these teams think about their workflows.
LlamaIndex Is Winning for Knowledge-Intensive Systems
Legal tech, financial services, healthcare, and enterprise document search - domains with large document collections and complex retrieval requirements - have converged on LlamaIndex. Its retrieval quality is genuinely superior to LangChain's for complex use cases.
What drove adoption: teams that tried LangChain for RAG and hit retrieval quality limits. LlamaIndex's multiple index types and RouterQueryEngine solve problems that LangChain's single vector search approach cannot.
Raw API Is Winning for Latency and Cost-Critical Systems
High-volume, low-latency agent systems - customer support, real-time analysis, consumer applications - are increasingly built on raw API to meet performance requirements. The simplicity also appeals to teams with strong engineering culture who prefer owning every line of behavior.
Framework Consolidation Predictions
The framework ecosystem is likely to consolidate around 2-3 winners:
-
LangGraph (complex orchestration) - the LangChain team's investment here is strong and the framework is clearly more mature than competitors for production stateful agents
-
LlamaIndex (knowledge access) - strong position in the enterprise document space that is hard for orchestration-focused frameworks to displace
-
A new entrant - the current frameworks were designed for the 2023-2024 era of agents. As models become more capable and agent task complexity increases, a framework designed for the 2026+ era may emerge
CrewAI and AutoGen are likely to remain relevant but will face pressure from LangGraph, which is increasingly adding multi-agent support that competes with their core value.
Quick Reference: The Right Framework in One Line
| Situation | Use This |
|---|---|
| Less than 5 tools, simple linear loop | Raw API |
| Complex routing, checkpointing, human-in-loop | LangGraph |
| Multi-step content or research pipeline | CrewAI |
| Agent debate, critique, collaborative code review | AutoGen v0.4 |
| Large document collection, sophisticated RAG | LlamaIndex |
| Maximum integration with Anthropic ecosystem | Raw API + Anthropic SDK |
| Need to move fast for a prototype | LangChain or CrewAI |
| Enterprise with strict control requirements | Raw API |
| Academic research on agent behavior | AutoGen or LangGraph |
Production Engineering Notes
The Migration Cost Is Real
Switching frameworks in production is not a 1:1 translation. LangChain's AgentExecutor is not the same as LangGraph's StateGraph. CrewAI's Crew is not the same as AutoGen's Team. Each framework embeds architectural assumptions that shape how you structure your application.
Budget for migration realistically: a well-built LangChain agent migrated to LangGraph typically takes 2-4 weeks for an experienced engineer. A raw API agent migrated to LangGraph takes 1-2 weeks. Plan the migration explicitly rather than treating it as incidental.
The Thin Adapter Pattern
Protect your application logic from framework churn with a thin adapter layer:
from abc import ABC, abstractmethod
class AgentBackend(ABC):
@abstractmethod
def run(self, task: str, **kwargs) -> str:
pass
class LangGraphBackend(AgentBackend):
def __init__(self, graph, config):
self.graph = graph
self.config = config
def run(self, task: str, **kwargs) -> str:
state = {"topic": task, **kwargs}
result = self.graph.invoke(state, self.config)
return result.get("draft", "")
class CrewAIBackend(AgentBackend):
def __init__(self, crew):
self.crew = crew
def run(self, task: str, **kwargs) -> str:
result = self.crew.kickoff(inputs={"topic": task, **kwargs})
return result.raw
class RawAPIBackend(AgentBackend):
def __init__(self, agent):
self.agent = agent
def run(self, task: str, **kwargs) -> str:
run = self.agent.run(task)
return run.final_answer if run.success else run.error
# Your application never imports from the framework directly
class ResearchApplication:
def __init__(self, backend: AgentBackend):
self.backend = backend
def research(self, topic: str) -> str:
return self.backend.run(topic)
:::danger Do Not Choose a Framework Because It Is Trending
Framework adoption follows hype cycles. LangChain was "the only real choice" in early 2023. By late 2023, it was "too complex and broken." By mid-2024, LangGraph was "the new LangChain killer." By late 2024, it was "the production standard."
Make framework choices based on your specific requirements, not on what is trending on X or HackerNews. The right framework for your problem might be the one that got less press this month. Evaluate frameworks against your use case, run a spike, measure the developer experience on your actual problem, then decide.
:::
:::warning Framework Version Pinning Is a Production Requirement
Every framework mentioned in this lesson has had breaking API changes between minor versions. This is not a criticism - it reflects the rapid pace of development. But it means that pip install crewai (without a pinned version) in a production Dockerfile is a reliability risk.
Pin all framework versions in your production dependencies. Test upgrades in staging before deploying. Read the changelog for every minor version upgrade. Make framework upgrades explicit engineering work, not incidental side effects of dependency updates.
:::
Interview Questions and Answers
Q1: If you had to recommend a single framework for a team just starting to build production agents, what would it be and why?
My recommendation depends on the use case, but if forced to give a single starting point for a team with no existing production agents: LangGraph.
LangGraph's learning curve is higher than CrewAI or LangChain, but it is the right curve to climb. The explicit state model, once understood, prevents the class of hard-to-debug bugs that implicit state causes. The checkpointing and human-in-the-loop support are production requirements that most agents eventually need - better to have them from the start than to retrofit them later.
The alternative recommendation for teams that need to move fast first and refactor later: raw API for the initial prototype. Raw API has the fastest time to a working agent, the easiest debugging, and the cleanest migration path to LangGraph when state complexity grows. The "prototype fast, migrate when pain is real" pattern works well for teams with the discipline to actually migrate when the pain signals appear.
Q2: How would you compare LangGraph and CrewAI for a multi-agent system that needs five specialized agents coordinating on a complex task?
The key question is whether the coordination is state-driven or role-driven.
State-driven coordination: the next agent to act depends on what the previous agent produced. A routing decision requires reading intermediate results. The agents share a complex state with many fields beyond message history. If this describes your system, LangGraph is clearly better - its conditional edges and typed state handle dynamic routing naturally.
Role-driven coordination: the agents have distinct expertise and responsibilities. The pipeline is relatively fixed (researcher → writer → editor, or analyst → fact-checker → publisher). The coordination is about handoffs between well-defined roles, not complex routing. If this describes your system, CrewAI is a natural fit and will get you to production faster.
In practice, complex production systems often need both. The right answer for many five-agent systems is: LangGraph for orchestration (which agent is active and when, with what state), with each agent potentially being a CrewAI crew or an AutoGen team for its internal coordination.
Q3: A team is choosing between LlamaIndex and LangChain for a RAG-powered agent. They have 50,000 documents in mixed formats (PDF, Word, email, CSV). Walk through how you would evaluate the two frameworks.
Build a retrieval quality benchmark first. Take 50 representative queries with known correct answers. Build the simplest possible RAG pipeline with each framework. Measure hit rate (is the answer in the retrieved chunks?) and end-to-end answer quality.
For 50,000 documents in mixed formats: LlamaIndex's document loading ecosystem includes production-grade parsers for PDF, Word, and email. LangChain's loaders work but are less mature for complex document types. LlamaIndex's metadata handling and filtering are more flexible - critical when you need to filter by document date, type, or source.
For RAG quality specifically: LlamaIndex's multiple index types (VectorStore for semantic, Summary for synthesis, Keyword for exact) and RouterQueryEngine are built for exactly this use case. A question like "summarize all emails from Q3" maps to SummaryIndex, which LangChain cannot easily replicate without significant custom code.
My prediction: LlamaIndex would show meaningfully better retrieval quality in the benchmark, especially for questions requiring synthesis across many documents or exact matching on known terms. If that benchmark confirms the hypothesis, LlamaIndex is the right choice.
Q4: What is the most significant architectural difference between AutoGen v0.2 and v0.4, and why does it matter?
The most significant difference is the execution model. AutoGen v0.2 used synchronous message passing: user_proxy.initiate_chat(assistant, message="task") blocked until the conversation completed. This made concurrent agent execution awkward and limited scalability.
AutoGen v0.4 uses an asynchronous event-driven runtime: agents communicate through an async message bus, each agent is an async coroutine, and team coordination is built on top of async primitives. This makes it possible to run many agent conversations concurrently without blocking, stream partial results, and integrate cleanly with async web frameworks.
Why it matters in production: a synchronous customer support agent can only handle one ticket at a time per process. An async agent can handle many tickets concurrently using the same event loop. For high-volume applications, this difference in architecture translates to 10-100x difference in throughput per process.
The cost: v0.2 code is completely incompatible with v0.4. Migration requires rewriting your agent definitions, your conversation initiation logic, and your termination conditions. Teams with stable v0.2 production deployments should evaluate migration cost carefully before committing.
Q5: In 2025, what determines whether a team should be building agents with a framework or building them on the raw API?
Three factors determine the decision, weighted in this order:
First: operational complexity requirements. Does the agent need checkpointing and crash recovery? Human-in-the-loop gates? Complex stateful routing? If yes, a framework - specifically LangGraph - provides these features in a production-tested form. Building them from raw API is possible but expensive. If no, raw API is often sufficient and preferable.
Second: team cognitive capacity. A five-person startup where everyone is working on multiple things has less capacity to learn and maintain a framework than a twenty-person company with dedicated agent infrastructure engineers. Frameworks have a learning curve and a maintenance burden. Smaller teams with simpler problems should bias toward raw API.
Third: retrieval and integration requirements. Does the agent need sophisticated RAG over large document collections? Use LlamaIndex. Does it need integration with many external services? LangChain's integration ecosystem is genuinely valuable. Raw API means writing every integration yourself.
The meta-principle: the right framework is the one that solves a specific problem you have, not a generic "best practice." Build with raw API until a specific problem appears - then adopt the framework that solves that specific problem.
