AI Letters #29 - Week 3 Scorecard: Six Agent Benchmarks, Three Frameworks, One Uncomfortable Truth

April 21, 2026 · 10 min read

AI Engineering Education

Six benchmarks. SynapseKit wins 4 on ergonomics. LangChain wins the one you'll hit in production: per-tool error recovery. LlamaIndex scores 7/18 - not a maturity gap, an architectural one. It's a retrieval framework that added agents.

Interactive Chart

Agent Framework History Timeline →

From the original ReAct paper (2022) through LangChain's agent executor, LlamaIndex's agent bolts, and SynapseKit's Crew API. Click each milestone to understand why agent frameworks diverged so dramatically in design philosophy.

Interactive Explorer

6-Dimension Agent Scorecard Explorer →

Click each of the 6 benchmarks - ReAct, Function Calling, Built-in Tools, Multi-Agent, Observability, Error Handling - to see exact scores, code comparisons, and what the winner got right.

Evidence Dashboard

Full 3-Week Cumulative Rankings →

Week 3 bar chart, radar across all 6 dimensions, and cumulative 3-week stacked standings - all benchmark data from notebooks #15–#21 in one view.

The Six Benchmarks

#	Notebook	Dimension	Winner
15	ReAct Agents	LoC + built-in tools + loop control	SynapseKit
16	Function Calling	Schema LoC + multi-format export	SynapseKit
17	Built-in Tools	Tool count + zero-config coverage	SynapseKit
18	Multi-Agent	LoC + orchestration patterns supported	SynapseKit
19	Observability	LoC to enable + local feature depth	3-way tie
20	Error Handling	LoC + built-in error primitives	LangChain

Week 3 Points (max 18):

Framework       #15  #16  #17  #18  #19  #20  Total
----------------------------------------------------
SynapseKit        3    3    3    3    2    2     16
LangChain         2    2    2    2    2    3     13
LlamaIndex        1    1    1    1    2    1      7

SynapseKit: 16. LangChain: 13. LlamaIndex: 7.

What SynapseKit Actually Wins On

The four wins are not flukes. There is a coherent pattern.

ReAct Agents (#15): CalculatorTool and DateTimeTool are built in. You construct an agent with a list of tools and a model - that's the entire setup. LangChain's create_react_agent is clean but requires you to wire the tool list separately from the agent executor. LlamaIndex's ReActAgent matches SynapseKit on line count but ships no built-in calculation or datetime tooling.

Function Calling (#16): Define a function schema once. Call .schema() for OpenAI format. Call .anthropic_schema() for Anthropic format. Same source of truth, zero duplication. LangChain requires StructuredTool plus convert_to_openai_function - two different objects. LlamaIndex requires FunctionTool plus a separate get_parameters_dict() call. Neither provides a single definition that exports to both provider formats.

Built-in Tools (#17): 30 tools. 12 that work with zero configuration - no pip install, no API key, no setup. 9 categories. LangChain ships 17 core tools, most requiring a per-tool pip install and an API key before they'll run. LlamaIndex ships 3 core tool wrappers. This is the widest margin in the entire week: 30 vs 17 vs 3.

Multi-Agent (#18): SynapseKit supports 6 of 6 orchestration patterns - sequential, parallel, supervisor, hierarchical, pipeline, and feedback loop. LangChain supports 5 (LangGraph handles the complex DAG cases well). LlamaIndex supports 3. The Crew + Task(context_from=[...]) pattern in SynapseKit is the most concise way to express inter-agent dependencies across all three frameworks.

The One LangChain Win That Matters

Error handling. LangChain scores 3/3.

ToolException raised inside tool
        ↓
AgentExecutor catches (handle_tool_error=True)
        ↓
Error message becomes LLM Observation
        ↓
LLM reasons: retry / use different tool / report to user

vs.

SynapseKit / LlamaIndex
        ↓
try/except in tool function (manual, every tool)
        ↓
return error string (if you remembered to)
        ↓
no structured recovery loop

ToolException is not just a named exception type. It is a design decision: tool failures are information for the reasoning loop, not crashes to be caught. Raise ToolException("The search API timed out") and the LLM's next observation is that string. It can reason: try a different query, use a fallback tool, tell the user. Five lines including imports. No boilerplate per tool.

LangChain also ships handle_parsing_errors=True - which catches malformed LLM outputs before they crash the agent. This is the failure mode no one talks about until it happens in production: the model returns something that doesn't match the expected ReAct format, the parser throws, the agent is gone. One kwarg prevents it. SynapseKit and LlamaIndex both crash on malformed output without custom handling.

SynapseKit's CircuitState is the stronger primitive for a different failure class - repeated failures at the LLM or network level. But per-tool error handling is where engineers spend most of their production debugging time. LangChain wins that battle.

The Uncomfortable Truth About LlamaIndex

LlamaIndex scored 7 out of 18 possible points in the Agents & Tools week. Third place in 5 of 6 benchmarks. Third in ReAct ergonomics. Third in function calling. Third in multi-agent patterns. Third in error handling. Tied for second in observability only because all three frameworks cover the basics.

This is not a performance gap or a maturity gap. It is an architectural conclusion: LlamaIndex is a retrieval and indexing framework. It added agents. It is not an agent framework that also handles retrieval.

In Week 2 (RAG Pipelines), LlamaIndex came second overall. Its chunking benchmark (#9) was the most detailed of any framework. Its document loading and indexing abstractions are the most mature. VectorStoreIndex, SummaryIndex, KnowledgeGraphIndex - these are not bolt-ons. They are the product.

When your application is 80% retrieval and 20% agent orchestration, LlamaIndex is the correct choice. When the ratio flips, you are fighting the framework's grain.

The 3-Week Cumulative Picture

Framework     Week 1  Week 2  Week 3   Total (21 benchmarks)
------------------------------------------------------------
SynapseKit       15      14      16       45
LangChain         8      10      13       31
LlamaIndex        7      12       7       26

The trend line for LangChain is important. Week 1: 8 points. Week 2: 10. Week 3: 13. The delta between first and second place has shrunk from 7 points to 3 points over three weeks. Week 4 tests production concerns - async throughput, graph workflows, cost tracking, guardrails, MCP support. LangChain's ecosystem depth tends to surface there. The gap may close further.

LlamaIndex's pattern is the mirror image: strong in Week 2 (12 points, retrieval week), weak in Weeks 1 and 3 (7 points each, everything else). A specialist framework trading against generalists.

What This Means for Engineers

If you're building an agent-first application, SynapseKit's batteries-included approach saves real time. 30 built-in tools, concise multi-agent patterns, single function schema definition. The upfront ergonomics advantage compounds over the first month of development.
Add handle_tool_error=True and handle_parsing_errors=True to every LangChain AgentExecutor immediately. These two kwargs are free insurance. Without them, tool exceptions crash the agent and malformed LLM outputs crash the agent. With them, both become recoverable observations. No code changes required.
LangChain's per-tool error recovery is better than writing your own. If you are currently wrapping every tool function in a try/except and returning error strings manually - in any framework - you are doing more work than LangChain's ToolException pattern requires.
Use LlamaIndex specifically when your application is knowledge-graph-heavy or your chunking requirements are sophisticated. SemanticSplitterNodeParser, recursive splitting with boundary detection, KnowledgeGraphIndex - these have no equivalent in SynapseKit or LangChain.
The framework choice is not permanent, but the migration cost is real. Switching from LangChain's AgentExecutor to SynapseKit's Crew mid-project is not a find-and-replace operation. Pick based on what your application's core pattern is.

The Thing Most People Miss

The benchmarks measure ergonomics. Ergonomics predicts developer velocity in the first 90 days. It does not predict the failure modes you encounter in production at month six.

The most common production failure in LLM agents is not a missing built-in tool or a verbose schema definition. It is uncontrolled loops - agents that retry a failing operation until they exhaust either the max_iterations cap or the API rate limit. SynapseKit's CircuitState and LangChain's ToolException both address this, from opposite directions. SynapseKit short-circuits before the LLM sees the failure. LangChain routes the failure through the LLM and hopes it reasons its way out.

Both work for different failure classes. Neither is universal.

The engineer who builds the most reliable production agent will be the one who understands which failures should be invisible to the LLM (circuit-break them) and which failures the LLM should reason about (ToolException them). That judgment call is not in any benchmark. It comes from shipping something, watching it break, and learning the shape of the break.

Week 4 shifts to production: async throughput, graph-based workflows, built-in evaluation, cost tracking, guardrails, MCP support. That is where the ergonomics winner and the production winner may diverge for the first time.

Three Things Worth Doing This Week

Map your application's agent-to-retrieval ratio. Write it down as a fraction. If it's above 60% agents, audit whether your current framework has built-in error primitives. If it's below 40% agents, audit whether your retrieval path uses framework-native indexing or custom code.
Count your framework's built-in tools and test three of them. The tools you're pip-installing and wrapping manually might already be built in. SynapseKit's 12 zero-config tools cover most of what agents need without any setup.
Write a deliberate failure test for your agent. Pick the tool your agent calls most frequently, make it throw an exception, and watch what happens. Does the agent recover? Does it loop? Does it crash? That diagnosis time is the measurement that matters most for production reliability.

Three weeks of benchmarks point to a framework with strong agent ergonomics. Six months of production data will point to something more nuanced. The race is not over.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

The Six Benchmarks​

What SynapseKit Actually Wins On​

The One LangChain Win That Matters​

The Uncomfortable Truth About LlamaIndex​

The 3-Week Cumulative Picture​

What This Means for Engineers​

The Thing Most People Miss​

Three Things Worth Doing This Week​

Want to Think Like an AI Architect?