AI Letters #27 - Agent Observability: 3 Lines Gets You In, But What Can You Actually See?
"Three lines to enable tracing in LangChain. Zero lines of latency data when you're done."
Every agent fails eventually. A tool returns nothing. The LLM loops on the same thought. The retrieved documents are all wrong. What separates a two-minute debug from a two-hour one is not how the agent was built - it's how much you can see when it breaks.
Notebook #19 of the LLM Showdown measured one thing: how much can you observe about a running agent without leaving your local environment? No external service. No API key for a tracing platform. No paid tier. Just framework-native observability on the same machine where your code runs.
LangChain enables tracing in the fewest lines. What those lines actually surface is a different question.
The Numbers
Lines of code to enable useful local tracing (no external service, no API key):
Framework Imports Enable Total
-----------------------------------------
LangChain 1 2 3
LlamaIndex 2 2 4
SynapseKit 2 5 7
LangChain wins by a wide margin. set_verbose(True) is one line. Add set_debug(True) for full raw prompt logging. That's it.
What those lines actually surface locally (feature depth score):
Feature SynapseKit LangChain LlamaIndex
------------------------------------------------------------------
Token usage Yes Partial Yes
Step latency Yes No Yes
Intermediate agent steps Yes Yes Yes
Tool call args + returns Yes Yes Yes
Full raw LLM prompt Yes Yes Yes
Retrieved documents Yes Yes Yes
Zero-config enable (1-2 lines) Yes Yes No
Score (out of 7): 7 5 6
The latency row is where LangChain's 3-line win costs you the most. set_verbose(True) and set_debug(True) print chain I/O, tool calls, and agent reasoning to stdout. They do not record how long any step took. For timing data - how long did the LLM call take, how long did the tool execution take, which step is the bottleneck - LangChain requires LangSmith, which is an external service.
Token usage is similarly partial: verbose mode shows counts in the output, but not in a structured object you can query. For cost tracking per run, again: LangSmith.
The Three Design Philosophies
How does tracing work?
SynapseKit LangChain LlamaIndex
────────────────── ────────────────── ──────────────────
Explicit object Global side effect Injected callback
tracer = Tracer() set_verbose(True) handler = LlamaDebug
agent = Agent( # all agents now Settings.callback_manager
middleware=[tracer]) emit to stdout = CallbackManager(
result = await automatically [handler])
agent.run(query)
tracer.spans No object to query handler.get_event_pairs
→ structured list → redirect stderr (CBEventType.LLM)
to capture → typed event list
SynapseKit uses an explicit Tracer object. You pass it into the agent at construction time. After the run, you query tracer.spans to get a structured list of TraceSpan objects - one per event, with duration_ms, metadata, and full payload. This is testable: you can assert on specific spans in a unit test. It's composable: you can pass different tracers to different agents in the same application.
LangChain uses global flags. set_verbose(True) is a global side effect that makes all subsequent LangChain objects emit structured logs to stderr. No object to query. No programmatic access to events after the run. To capture the output you redirect stderr - which is exactly the kind of code you don't want in production. The upside: one line, zero configuration, works immediately on any existing agent.
LlamaIndex uses a callback manager injected via Settings. LlamaDebugHandler is the most sophisticated of the three locally. After a run, you call debug_handler.get_event_pairs(CBEventType.LLM) to get typed event pairs (start + end) for every LLM call. CBEventType.FUNCTION_CALL for tool events. CBEventType.RETRIEVE for retrieval events. The event type enum covers the full taxonomy of what an LLM pipeline does. The downside: 4 lines to set up, and the Settings injection pattern means it affects all agents globally - same problem as LangChain's flags, just more structured.
What LangSmith Actually Solves
LangChain's local observability gap is not an accident. The missing features - step latency, structured cost tracking, run replay - are exactly what LangSmith provides. This is an intentional split: local verbose mode for development debugging, LangSmith for production observability.
LangSmith is free to start (up to 5,000 traces/month). For production systems it becomes a meaningful cost. More importantly, it's an external dependency: your observability now requires internet access, an API key in your environment, and a third-party service to be running. For air-gapped deployments, containerised CI environments, or applications where you can't send LLM prompts to a third party, this is a hard constraint.
SynapseKit and LlamaIndex both give you timing and structured event access locally. That's not because LangChain missed these features - it's because they made a different product decision about where the boundary between framework and platform should be.
What This Means for Engineers
-
For development debugging, LangChain's
set_verbose(True)is genuinely the fastest path. One line, immediate output, zero configuration. If all you need is "show me what the agent is doing", this works. -
If you need timing data locally, LangChain is the wrong tool. No step latency without LangSmith. If you're profiling which part of your agent pipeline is slow - LLM call, tool execution, retrieval - you need SynapseKit's
TraceSpan.duration_msor LlamaIndex's event timestamps. -
LlamaIndex's
CBEventTypequery API is the most powerful post-run interface. After a run you can ask: how many LLM calls happened? What were the inputs and outputs? Which retrieval queries ran? All typed, all queryable. It's verbose to set up but the richest local interface of the three. -
SynapseKit's Tracer is the only one designed for testing. Because it returns a structured object, you can write assertions:
assert tracer.spans[2].name == "TOOL_CALL". You can verify that a tool was called with the right arguments. You can check that the token count stayed under a budget. None of this is possible with global flags or Settings injection. -
Global state is a production smell. Both
set_verbose(True)andSettings.callback_managerare global mutations. In a multi-tenant system, a test suite, or any application where you want different tracing behaviour for different agents, global state is a problem. SynapseKit's explicit middleware pattern is the only one that avoids this.
The Thing Most People Miss
Observability during development and observability in production are different problems.
During development, you want maximum visibility with minimum setup. LangChain's set_verbose(True) wins here. You run the agent, watch the terminal, understand what happened.
In production, you need structured, queryable, per-run data without global side effects. You need latency. You need the ability to replay a specific failing run. You need to assert "this run used fewer than 2,000 tokens" in a regression test. LangChain's local tooling doesn't give you this - LangSmith does, but at the cost of an external dependency.
The frameworks that win on development convenience (global flags, one-line setup) tend to create friction in production (no structured objects, no local timing). The frameworks that win on production correctness (explicit Tracer, typed callbacks) require more setup. This is not a bug in either design. It's the same tradeoff that appears in every layer of software engineering: explicitness versus convenience, always at the cost of the other.
Three Things Worth Doing This Week
-
Add one timing assertion to your agent test suite. Pick the most critical tool call in your pipeline and assert that it completes under a threshold. If your framework doesn't expose duration, that's the data point you need.
-
Check whether your tracing uses global state. If you're using
set_verbose(True)orSettings.callback_managerin a production environment, document exactly what gets emitted and where. Uncontrolled log output to stderr in a containerised environment is a reliability hazard. -
Run an agent that fails intentionally and time how long it takes to diagnose. Inject a tool that throws an exception mid-run. Measure how long it takes to identify: which step failed, what arguments it was called with, and what the LLM thought immediately before the call. That time is your observability gap.
The 3-line setup is the beginning of observability, not the end of it.
Engineers of AI
Read more: www.engineersofai.com
If this was useful, forward it to one engineer who should be reading it.
