AI Letters #27 - Agent Observability: 3 Lines Gets You In, But What Can You Actually See?

April 16, 2026 · 9 min read

AI Engineering Education

"Three lines to enable tracing in LangChain. Zero lines of latency data when you're done."

Every agent fails eventually. A tool returns nothing. The LLM loops on the same thought. The retrieved documents are all wrong. What separates a two-minute debug from a two-hour one is not how the agent was built - it's how much you can see when it breaks.

Notebook #19 of the LLM Showdown measured one thing: how much can you observe about a running agent without leaving your local environment? No external service. No API key for a tracing platform. No paid tier. Just framework-native observability on the same machine where your code runs.

LangChain enables tracing in the fewest lines. What those lines actually surface is a different question.

Interactive Chart

Observability History Timeline →

From print debugging to distributed tracing to LLM-specific observability. Click each milestone to see how visibility into running systems evolved and what each generation got wrong.

Interactive Explorer

Tracing Design Explorer →

Click each of the 3 tracing approaches - Tracer object, global flags, callback manager - to see exactly what you can observe, how to query it, and what you're missing locally.

Evidence Dashboard

Full Benchmark Results →

LoC stacked chart, feature depth heatmap, and design philosophy comparison - all data from notebook #19 in one view.

The Numbers

Lines of code to enable useful local tracing (no external service, no API key):

Framework       Imports  Enable   Total
-----------------------------------------
LangChain             1       2       3
LlamaIndex            2       2       4
SynapseKit            2       5       7

LangChain wins by a wide margin. set_verbose(True) is one line. Add set_debug(True) for full raw prompt logging. That's it.

What those lines actually surface locally (feature depth score):

Feature                         SynapseKit  LangChain  LlamaIndex
------------------------------------------------------------------
Token usage                     Yes         Partial    Yes
Step latency                    Yes         No         Yes
Intermediate agent steps        Yes         Yes        Yes
Tool call args + returns        Yes         Yes        Yes
Full raw LLM prompt             Yes         Yes        Yes
Retrieved documents             Yes         Yes        Yes
Zero-config enable (1-2 lines)  Yes         Yes        No

Score (out of 7):                 7           5          6

The latency row is where LangChain's 3-line win costs you the most. set_verbose(True) and set_debug(True) print chain I/O, tool calls, and agent reasoning to stdout. They do not record how long any step took. For timing data - how long did the LLM call take, how long did the tool execution take, which step is the bottleneck - LangChain requires LangSmith, which is an external service.

Token usage is similarly partial: verbose mode shows counts in the output, but not in a structured object you can query. For cost tracking per run, again: LangSmith.

The Three Design Philosophies

How does tracing work?

SynapseKit            LangChain             LlamaIndex
──────────────────    ──────────────────    ──────────────────
Explicit object       Global side effect    Injected callback

tracer = Tracer()     set_verbose(True)     handler = LlamaDebug
agent  = Agent(       # all agents now      Settings.callback_manager
  middleware=[tracer])  emit to stdout        = CallbackManager(
result = await          automatically         [handler])
  agent.run(query)
tracer.spans          No object to query    handler.get_event_pairs
  → structured list   → redirect stderr       (CBEventType.LLM)
                        to capture            → typed event list

SynapseKit uses an explicit Tracer object. You pass it into the agent at construction time. After the run, you query tracer.spans to get a structured list of TraceSpan objects - one per event, with duration_ms, metadata, and full payload. This is testable: you can assert on specific spans in a unit test. It's composable: you can pass different tracers to different agents in the same application.

LangChain uses global flags. set_verbose(True) is a global side effect that makes all subsequent LangChain objects emit structured logs to stderr. No object to query. No programmatic access to events after the run. To capture the output you redirect stderr - which is exactly the kind of code you don't want in production. The upside: one line, zero configuration, works immediately on any existing agent.

LlamaIndex uses a callback manager injected via Settings. LlamaDebugHandler is the most sophisticated of the three locally. After a run, you call debug_handler.get_event_pairs(CBEventType.LLM) to get typed event pairs (start + end) for every LLM call. CBEventType.FUNCTION_CALL for tool events. CBEventType.RETRIEVE for retrieval events. The event type enum covers the full taxonomy of what an LLM pipeline does. The downside: 4 lines to set up, and the Settings injection pattern means it affects all agents globally - same problem as LangChain's flags, just more structured.

What LangSmith Actually Solves

LangChain's local observability gap is not an accident. The missing features - step latency, structured cost tracking, run replay - are exactly what LangSmith provides. This is an intentional split: local verbose mode for development debugging, LangSmith for production observability.

LangSmith is free to start (up to 5,000 traces/month). For production systems it becomes a meaningful cost. More importantly, it's an external dependency: your observability now requires internet access, an API key in your environment, and a third-party service to be running. For air-gapped deployments, containerised CI environments, or applications where you can't send LLM prompts to a third party, this is a hard constraint.

SynapseKit and LlamaIndex both give you timing and structured event access locally. That's not because LangChain missed these features - it's because they made a different product decision about where the boundary between framework and platform should be.

What This Means for Engineers

For development debugging, LangChain's set_verbose(True) is genuinely the fastest path. One line, immediate output, zero configuration. If all you need is "show me what the agent is doing", this works.
If you need timing data locally, LangChain is the wrong tool. No step latency without LangSmith. If you're profiling which part of your agent pipeline is slow - LLM call, tool execution, retrieval - you need SynapseKit's TraceSpan.duration_ms or LlamaIndex's event timestamps.
LlamaIndex's CBEventType query API is the most powerful post-run interface. After a run you can ask: how many LLM calls happened? What were the inputs and outputs? Which retrieval queries ran? All typed, all queryable. It's verbose to set up but the richest local interface of the three.
SynapseKit's Tracer is the only one designed for testing. Because it returns a structured object, you can write assertions: assert tracer.spans[2].name == "TOOL_CALL". You can verify that a tool was called with the right arguments. You can check that the token count stayed under a budget. None of this is possible with global flags or Settings injection.
Global state is a production smell. Both set_verbose(True) and Settings.callback_manager are global mutations. In a multi-tenant system, a test suite, or any application where you want different tracing behaviour for different agents, global state is a problem. SynapseKit's explicit middleware pattern is the only one that avoids this.

The Thing Most People Miss

Observability during development and observability in production are different problems.

During development, you want maximum visibility with minimum setup. LangChain's set_verbose(True) wins here. You run the agent, watch the terminal, understand what happened.

In production, you need structured, queryable, per-run data without global side effects. You need latency. You need the ability to replay a specific failing run. You need to assert "this run used fewer than 2,000 tokens" in a regression test. LangChain's local tooling doesn't give you this - LangSmith does, but at the cost of an external dependency.

The frameworks that win on development convenience (global flags, one-line setup) tend to create friction in production (no structured objects, no local timing). The frameworks that win on production correctness (explicit Tracer, typed callbacks) require more setup. This is not a bug in either design. It's the same tradeoff that appears in every layer of software engineering: explicitness versus convenience, always at the cost of the other.

Three Things Worth Doing This Week

Add one timing assertion to your agent test suite. Pick the most critical tool call in your pipeline and assert that it completes under a threshold. If your framework doesn't expose duration, that's the data point you need.
Check whether your tracing uses global state. If you're using set_verbose(True) or Settings.callback_manager in a production environment, document exactly what gets emitted and where. Uncontrolled log output to stderr in a containerised environment is a reliability hazard.
Run an agent that fails intentionally and time how long it takes to diagnose. Inject a tool that throws an exception mid-run. Measure how long it takes to identify: which step failed, what arguments it was called with, and what the LLM thought immediately before the call. That time is your observability gap.

The 3-line setup is the beginning of observability, not the end of it.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

The Numbers​

The Three Design Philosophies​

What LangSmith Actually Solves​

What This Means for Engineers​

The Thing Most People Miss​

Three Things Worth Doing This Week​

Want to Think Like an AI Architect?