6-Dimension Agent Scorecard Explorer
Click any benchmark to see exact scores, code comparisons, and what the winner got right
NB #15
ReAct Agents
Winner: SynapseKit
NB #16
Function Calling
Winner: SynapseKit
NB #17
Built-in Tools
Winner: SynapseKit
NB #18
Multi-Agent
Winner: SynapseKit
NB #19
Observability
3-Way Tie
NB #20
Error Handling
Winner: LangChain
SK wins: built-in CalculatorTool + DateTimeTool; most concise agent setup. LC's create_react_agent is clean but requires more wiring. LI has no built-in calc or datetime tooling.
The core test: implement an identical ReAct agent that uses a calculator and datetime tool, with a max_iterations guard. SynapseKit's advantage is that CalculatorTool and DateTimeTool are imports — no custom code required. LangChain's create_react_agent is genuinely clean but you wire the tool list separately from the AgentExecutor. LlamaIndex's ReActAgent matches SynapseKit on syntax length but you're writing the tool functions yourself.
from synapsekit import Agent
from synapsekit.tools import (
CalculatorTool,
DateTimeTool)
agent = Agent(
model="gpt-4o-mini",
tools=[CalculatorTool(),
DateTimeTool()],
max_iterations=5)
result = await agent.run(
"What is 847 * 23? "
"What day is today?")
from langchain.agents import (
create_react_agent,
AgentExecutor)
from langchain.tools import Tool
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
tools = [calc_tool, datetime_tool]
agent = create_react_agent(
llm, tools, prompt)
executor = AgentExecutor(
agent=agent, tools=tools,
max_iterations=5)
result = executor.invoke(
{"input": "What is 847 * 23?"})
from llama_index.agent.openai import (
OpenAIAgent)
from llama_index.core.tools import (
FunctionTool)
# Must write calc + datetime fns
calc = FunctionTool.from_defaults(
fn=calculate)
dt = FunctionTool.from_defaults(
fn=get_datetime)
agent = OpenAIAgent.from_tools(
[calc, dt],
max_function_calls=5)
response = agent.chat(query)
SK wins: .schema() + .anthropic_schema() from one definition. LC: StructuredTool + convert_to_openai_function (two objects). LI: FunctionTool + get_parameters_dict() (no Anthropic export).
The multi-provider reality: OpenAI's tool schema format and Anthropic's tool schema format differ in structure and field naming. A team using both Claude and GPT needs two synchronized schema definitions — or a framework that generates both from one source. SynapseKit's @tool decorator makes the function definition the source of truth. .schema() generates OpenAI format; .anthropic_schema() generates Anthropic format. One change propagates to both.
from synapsekit import tool
@tool
def search_web(
query: str,
max_results: int = 5
) -> str:
"""Search the web."""
return do_search(query)
# One definition, two formats:
openai_fmt = search_web.schema()
anthropic_fmt = (
search_web.anthropic_schema())
from langchain.tools import (
StructuredTool)
from langchain.utils.function_calling import (
convert_to_openai_function)
tool = StructuredTool.from_function(
func=search_web,
name="search_web",
description="Search the web")
# Separate conversion step:
openai_fmt = (
convert_to_openai_function(tool))
# No built-in Anthropic export
from llama_index.core.tools import (
FunctionTool)
tool = FunctionTool.from_defaults(
fn=search_web,
name="search_web",
description="Search the web")
# Manual parameter extraction:
params = tool.metadata\
.get_parameters_dict()
# No Anthropic schema method
# No unified export API
Widest margin in Week 3. SK: 30 tools, 12 zero-config, 9 categories. LC: 17 core tools (most need per-tool pip install). LI: 3 core wrappers.
Zero-config means the tool works the moment you import it — no pip install, no API key, no environment variable. Calculation, datetime, text processing, JSON parsing, regex, UUID generation, hashing — these are the tools that come up constantly in agent applications. SynapseKit ships 12 that meet this standard. LangChain ships a handful (mostly wrappers that need API keys). LlamaIndex ships 3 core FunctionTool types, leaving everything else to the user.
# 12 zero-config tools:
CalculatorTool() # math
DateTimeTool() # time/date
TextTool() # regex, split
JSONTool() # parse, format
HashTool() # md5, sha256
UUIDTool() # generation
FileReadTool() # local files
CounterTool() # tallying
SortTool() # sorting
FilterTool() # list ops
StringFormatTool() # templates
ValidateTool() # schema check
# 9 categories total
# Most need pip install:
# pip install wikipedia
WikipediaTool()
# pip install duckduckgo-search
DuckDuckGoSearch()
# Needs API key:
TavilySearch()
# Zero-config subset (~4):
# - BaseTool (abstract)
# - StructuredTool
# - tool decorator
# No built-in calculator
# No built-in datetime
# Only 3 core wrappers:
FunctionTool # any function
QueryEngineTool # index query
ToolMetadata # schema only
# Everything else:
# write it yourself
# Community tools exist
# but not in llama-index-core
# require separate pip installs
SK: 6/6 patterns, most concise Crew+Task API. LC: 5/6, LangGraph wins on complex DAG flexibility. LI: 3/6 patterns — handoff only, no parallel or supervisor.
Six orchestration patterns were tested: sequential, parallel, supervisor, hierarchical, pipeline, and feedback loop. SynapseKit's Crew + Task(context_from=[...]) is the most concise way to express inter-agent dependencies. LangChain's LangGraph is the most flexible for complex conditional workflows but costs more lines. LlamaIndex supports handoff-based patterns only — no parallel execution, no supervisor pattern, no feedback loops.
from synapsekit import Agent, Crew, Task
researcher = Agent(role="Researcher",
tools=[WebSearchTool()])
writer = Agent(role="Writer")
research_task = Task(
description="Find key facts about {topic}",
agent=researcher)
write_task = Task(
description="Write a summary",
agent=writer,
context_from=[research_task])
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task])
result = await crew.run(
topic="LLM frameworks")
from langgraph.graph import StateGraph
from langgraph.graph.message import (
add_messages)
class State(TypedDict):
messages: Annotated[list, add_messages]
next: str
def supervisor(state):
# Route to researcher or writer
...
graph = StateGraph(State)
graph.add_node("supervisor", supervisor)
graph.add_node("researcher", research)
graph.add_node("writer", write)
graph.add_conditional_edges(
"supervisor",
lambda s: s["next"],
{"researcher": "researcher", ...})
app = graph.compile()
from llama_index.core.agent import (
AgentRunner, FunctionCallingAgent)
# Handoff-only pattern:
primary = FunctionCallingAgent(
tools=[handoff_tool],
llm=llm)
# No parallel support
# No supervisor pattern
# No feedback loops
# Must implement manually
# using external Python code
# (not framework primitives)
3-way tie: LC wins on LoC (1 line via set_verbose). SK+LI tie on local feature depth (7/8 features). LI has best post-run query API (CBEventType). LC missing step latency locally — needs LangSmith.
LangChain enables tracing in 1 line: set_verbose(True). SynapseKit requires 4-5 lines for the Tracer middleware pattern. LlamaIndex requires 4 lines for LlamaDebugHandler + CallbackManager. But LangChain's 1-line setup doesn't expose step latency locally — timing data requires LangSmith. SynapseKit's TraceSpan.duration_ms and LlamaIndex's CBEventType timestamps both work without an external service. Score: all tied at 2 points because the local depth difference partially offsets the LoC advantage.
from synapsekit.middleware import Tracer
tracer = Tracer()
agent = Agent(
model="gpt-4o-mini",
middleware=[tracer])
result = await agent.run(query)
# Query structured spans:
for span in tracer.spans:
print(span.name,
span.duration_ms,
span.token_usage)
from langchain.globals import (
set_verbose, set_debug)
# 1 line enables tracing:
set_verbose(True)
# Optional: full prompt logging
set_debug(True)
# No structured object to query
# No step latency locally
# No programmatic access
# (redirect stderr to capture)
# LangSmith needed for timing
from llama_index.core.callbacks import (
LlamaDebugHandler,
CallbackManager,
CBEventType)
from llama_index.core import Settings
debug = LlamaDebugHandler()
Settings.callback_manager = (
CallbackManager([debug]))
# Best post-run query API:
llm_events = debug.get_event_pairs(
CBEventType.LLM)
tool_events = debug.get_event_pairs(
CBEventType.FUNCTION_CALL)
LC wins: ToolException + handle_tool_error + handle_parsing_errors in 5 lines. SK wins on LLM-level resilience (FallbackChain + CircuitState). LI: fully manual, no built-in error primitives.
LangChain wins the benchmark that matters most in production. ToolException turns tool failures into LLM observations — the error becomes the next reasoning step. handle_tool_error=True catches it in AgentExecutor. handle_parsing_errors=True catches malformed LLM outputs before they crash the agent. Two kwargs, zero custom code. SynapseKit's FallbackChain and CircuitState are stronger for LLM-level failures (model unavailable, repeated timeouts) but weaker for per-tool error handling. LlamaIndex has max_iterations as its only error primitive — everything else is a try/except you write yourself.
from synapsekit import Agent
from synapsekit.resilience import (
FallbackChain, CircuitState)
# LLM-level resilience:
agent = Agent(
model=FallbackChain([
"gpt-4o-mini",
"gpt-3.5-turbo"]),
circuit_state=CircuitState(
max_failures=3))
# Per-tool: manual try/except
# in each tool.run() method
from langchain.tools import tool
from langchain.schema import ToolException
@tool
def search(query: str) -> str:
"""Search the web."""
if api_down:
raise ToolException(
"Search unavailable. "
"Answer from training data.")
return do_search(query)
executor = AgentExecutor(
agent=agent, tools=[search],
handle_tool_error=True,
handle_parsing_errors=True)
from llama_index.core.agent import (
ReActAgent)
# Only built-in primitive:
agent = ReActAgent.from_tools(
tools,
max_function_calls=5)
# Everything else is manual:
def safe_search(query):
try:
return do_search(query)
except Exception as e:
return f"Error: {e}"
# No ToolException
# No handle_tool_error kwarg
# No parse error handling