AI Letters #24 - ReAct Agents: Six Lines vs Nineteen (And What You Lose in Between)
"Six lines to build a working ReAct agent sounds like a win. It is - until your agent starts looping and you have no idea why."
The ReAct loop is the first pattern every engineer reaches for when they need an agent. Thought, Action, Observation. Repeat until done. It's elegant on paper. In production it breaks in exactly the ways you'd expect: infinite loops, wrong tool selection, hallucinated tool calls that return nothing useful.
The question isn't whether ReAct agents work. It's whether your framework lets you see inside the loop when things go wrong.
Notebook #15 of the LLM Showdown measured three things: lines of code to build a working ReAct agent with two tools, the built-in tool inventory available without writing any tool code, and loop control parameters exposed to the caller. SynapseKit wins on LoC. LangChain wins on observability. LlamaIndex sits in the middle on both. The numbers are not the story. The tradeoff they reveal is.
What ReAct Actually Requires
A minimal working ReAct agent needs four things: an LLM, at least one tool with a schema, a prompt that formats Thought/Action/Observation, and a loop that parses the model's output and dispatches tool calls. Getting all four wired together is where the frameworks diverge.
The benchmark task was identical across all three: define a calculator tool and a datetime tool, build a ReAct agent, run one query that requires at least one tool call.
The Evidence
Lines of code - imports + setup to a working agent:
Framework Imports Functional Total
--------------------------------------------
SynapseKit 3 3 6
LlamaIndex 3 10 13
LangChain 5 14 19
SynapseKit gets to 6 lines because CalculatorTool and DateTimeTool are shipped in the library. You import them like any other class. There is no tool-definition code because there is nothing to define.
LangChain's 19 lines include two @tool-decorated functions - that's 10 lines of the gap right there. Strip those and LangChain's agent setup is 9 lines. The decorator approach is not verbose; it's complete. The tool code is what you'd write in any framework.
LlamaIndex at 13 lines uses FunctionTool.from_defaults() - plain Python functions wrapped into tool objects. Slightly more explicit than LangChain's decorator, slightly less so than SynapseKit's class hierarchy.
Custom tool definition - what it costs when built-ins don't cover your use case:
SynapseKit 6 lines (subclass BaseTool, implement async run())
LangChain 5 lines (@tool decorator on any annotated function)
LlamaIndex 5 lines (plain function + FunctionTool.from_defaults())
SynapseKit's advantage evaporates here. The moment you need a tool that isn't in their library, you're writing more code than the alternatives, not less. The subclass pattern is also more rigid - you're tied to their async interface, their error handling convention, their schema format.
Built-in tool inventory (no tool code required):
Framework Built-in tools
--------------------------------
SynapseKit 18
LangChain 15
LlamaIndex 9
SynapseKit leads: web scraping, arxiv, PubMed, SQL, shell, Python REPL, translation, sentiment - all importable. LangChain has 15 but many require third-party API keys (Tavily, Brave, Google). LlamaIndex's 9 are mostly retrieval-oriented, which makes sense given its RAG-first heritage.
Loop control parameters exposed to the caller:
Parameter SynapseKit LangChain LlamaIndex
-----------------------------------------------------------
max_iterations Yes Yes Yes
early stop Yes Yes Yes
handle_parsing_error Yes Yes Yes
verbose No Yes Yes
return_intermediate_steps No Yes Yes
async support Yes Yes Yes
Score (out of 6): 4 6 6
This is the number that matters in production.
The Contrast
ReAct Loop - What You Can Observe
SynapseKit LangChain / LlamaIndex
────────────────────── ──────────────────────────────
[Thought] [Thought] <- verbose logs
| |
[Action] [Action] <- intermediate steps
| |
[Observation] [Observation] <- response.sources
| |
[Answer] [Answer]
^ opaque ^ full trace available
SynapseKit's loop runs. You get the final answer. What happened in between - which tools were called, in what order, with what arguments, what they returned - is not surfaced by default. There is no verbose=True. There is no return_intermediate_steps. If the agent gives you a wrong answer, your debugging path is: re-run with print statements you've injected manually, or read source code.
LangChain gives you return_intermediate_steps=True on AgentExecutor. Every thought, every tool call, every observation is accessible in the response object. LlamaIndex surfaces the same through response.sources. This is not a nice-to-have. It is the difference between an agent you can ship and an agent you can't explain.
What This Means for Engineers
-
The 6-line number is real but context-dependent. If your use case fits SynapseKit's 18 built-in tools, you genuinely write less code. If it doesn't, you write more.
-
Observability is not optional in production. The first time a ReAct agent gives a customer a wrong answer, you will need to reconstruct exactly what it thought and did. SynapseKit makes that hard by default.
-
LangChain's verbosity is load-bearing.
return_intermediate_steps,verbose,handle_parsing_errors- these aren't academic features. They are the handles you grab during an incident. -
LlamaIndex at 13 lines is the quiet winner. FunctionTool is clean.
response.sourcesgives you the trace. The tool count (9 built-in) is lower, but the RAG-tool integration is first-class. If you're already using LlamaIndex for retrieval, adding agents costs almost nothing structurally. -
The custom tool cost comparison exposes the real architecture. SynapseKit's BaseTool subclass is not burdensome at 6 lines - but it is a commitment. LangChain's
@tooldecorator composes with any Python function you already wrote. The closer your existing codebase is to plain Python, the more that matters.
The Thing Most People Miss
The benchmark measured the cost to build a ReAct agent. It didn't measure the cost to debug one. Debugging cost scales with agent complexity, agent usage, and how long the loop runs. A 6-line setup that produces an opaque loop will cost you more time over a quarter than a 19-line setup with full observability - assuming the agent actually runs in production. Most of them do, eventually.
The frameworks that win on setup lines tend to lose on debuggability. This is not a coincidence. It is the fundamental tradeoff in API design: the more you hide, the less you write. The more you expose, the more you can see.
Three Things Worth Doing This Week
-
Check your current agent setup for
return_intermediate_stepsor equivalent. If you can't reconstruct the last 10 agent traces from your logs, you don't have production observability yet. -
Audit your tool definitions. If they are tightly coupled to a framework's base class, write one clean Python function that does the same thing. Keep framework-agnostic logic separate from framework integration.
-
Run notebook #15 yourself against your own framework of choice: github.com/engineersofai/llm-showdown. The task is simple enough to replicate in 20 minutes. The loop control gaps show up immediately.
The conciseness race is worth running. Just know what you're trading away when you win it.
Engineers of AI
Read more: www.engineersofai.com
If this was useful, forward it to one engineer who should be reading it.
