AI Letters #30 - Async Throughput: The Framework Tax on Every Concurrent Request

April 22, 2026 · 10 min read

AI Engineering Education

Every framework says await. Every framework says "production-ready". At one concurrent request, the difference is invisible. At 50 concurrent requests, LangChain's LCEL middleware costs 19.2% of theoretical throughput while SynapseKit loses only 3.2%. Notebook #22 of the LLM Showdown isolates the framework tax on async IO - and the gap is 7x in overhead milliseconds.

Interactive Chart

Async in Python: From Callbacks to Native Coroutines →

The history of async IO in Python - from Twisted's reactor pattern through asyncio, uvloop, and into LLM framework async primitives. Click each milestone to see how async patterns evolved.

Interactive Explorer

Throughput Scaling Explorer →

Drag the concurrency slider from 1 to 50 and watch how each framework's throughput scales. See where LangChain's curve diverges from the theoretical maximum.

Evidence Dashboard

Full Throughput Evidence Dashboard →

Efficiency bars, overhead breakdown, scaling factors, and per-call latency - all benchmark data from notebook #22 in one interactive view.

"The difference between wrapping a sync call in a thread and genuinely non-blocking async IO only shows up under real concurrency. At 50 simultaneous requests, that difference is 19%."

Every LLM framework claims async support. The documentation says await. The examples show ainvoke. The marketing page says "production-ready". And when you run a single request, every framework delivers the same result in approximately the same time. The overhead per call is sub-millisecond. Nobody notices.

Then you deploy to a FastAPI endpoint handling 20 simultaneous users. Or you fire off 50 tool calls in an asyncio.gather batch. And one framework quietly adds 12 milliseconds of overhead per batch while the others add less than 2. At scale, those milliseconds compound into throughput ceilings that are invisible in development and painful in production.

Notebook #22 of the LLM Showdown isolates exactly this. A mock async function with a fixed 50ms sleep - simulating an LLM API call - wrapped in each framework's async primitive. Fire N concurrent requests. Measure total time. A perfect async implementation processes 50 requests in ~50ms. Any extra time is pure framework tax.

The results are not close.

What We Measured

Each framework wraps a mock async function - asyncio.sleep(0.05) - simulating a 50ms LLM API call. We fire N concurrent requests using asyncio.gather and measure total wall-clock time. A perfect async implementation processes N requests in ~50ms regardless of N, because all sleeps run concurrently in the event loop.

Metric	What it captures
Requests/sec	Throughput at 1, 5, 10, 20, 50 concurrent requests
Async efficiency	Actual rps vs theoretical max (% of ideal)
Scaling factor	rps at n=50 / rps at n=1 - perfect async gives 50x
Framework overhead	Milliseconds added per batch beyond raw asyncio

Frameworks: SynapseKit 1.4 (BaseTool.run()), LangChain 1.2 (RunnableLambda.ainvoke()), LlamaIndex Core 0.14 (FunctionTool.acall())

The Numbers

Throughput (requests/sec):

Concurrency   Baseline  SynapseKit  LangChain  LlamaIndex
----------------------------------------------------------
n=1             19.6       19.8        19.4       19.7
n=5             97.8       98.8        96.1       97.3
n=10           194.9      195.7       184.2      193.3
n=20           391.3      388.9       360.5      381.9
n=50           986.6      967.5       808.3      927.2

At n=1, everyone looks the same. The mock call takes ~50ms. Each framework adds sub-millisecond overhead. If this were the only data point, you would conclude that async performance is irrelevant to framework choice.

At n=50, the picture changes. The baseline (raw asyncio.sleep) achieves 986.6 rps - nearly the theoretical maximum of 1000 rps (50 requests / 0.05s). SynapseKit tracks close at 967.5. LlamaIndex at 927.2. LangChain drops to 808.3.

Async efficiency at n=50 concurrent calls:

Framework      rps    overhead   efficiency
--------------------------------------------
Baseline      986.6     0.7ms      98.7%
SynapseKit    967.5     1.7ms      96.8%
LlamaIndex    927.2     3.9ms      92.7%
LangChain     808.3    11.9ms      80.8%

LangChain adds 11.9ms of overhead per batch at 50 concurrent requests. SynapseKit adds 1.7ms. That is a 7x difference in framework-introduced latency.

The Scaling Factor

The cleanest way to read this: how close does each framework get to 50x throughput when you send 50x more concurrent requests?

Scaling factor: rps(n=50) / rps(n=1)
Perfect async = 50x

Framework      rps n=1  rps n=50  scaling  vs perfect
------------------------------------------------------
Baseline         19.6     986.6    50.4x     100.9%
SynapseKit       19.8     967.5    48.9x      97.7%
LlamaIndex       19.7     927.2    47.1x      94.2%
LangChain        19.4     808.3    41.7x      83.5%

SynapseKit: 97.7% of perfect scaling. LlamaIndex: 94.2%. LangChain: 83.5%.

The 16.5% gap between SynapseKit and LangChain at 50 concurrent requests is not a rounding error. It is a consistent pattern across multiple runs (median of 3 repeats, after warmup). Something in LangChain's LCEL ainvoke path does more work per invocation than the other frameworks' async primitives.

Where the Overhead Comes From

This benchmark isolates the framework call path. The mock function is identical - asyncio.sleep(0.05) - so the overhead is entirely in:

Object construction - creating/validating the invocation context
Callback routing - LCEL's pipe chain, middleware, callbacks
Serialization/validation - input/output schema checks

LangChain's LCEL is a composable chain architecture. Every ainvoke passes through the Runnable protocol - input validation, callbacks, tracing hooks, output parsing. This is powerful for composition (chain1 | chain2 | chain3) but adds overhead per invocation. At n=1, the overhead is 0.51ms - invisible. At n=50, the total accumulated overhead is 11.9ms per batch.

SynapseKit's BaseTool.run() is a thin wrapper. Validate the input against the JSON schema, call the function, return the result. No middleware chain, no callback infrastructure. The tradeoff: less composability, less overhead.

LlamaIndex's FunctionTool.acall() falls in between - some validation overhead but no LCEL-style chain traversal.

The Real-World Caveat

This benchmark tests the framework call path under synthetic concurrency. In a production RAG pipeline, the bottleneck is rarely the framework wrapper. It is the retrieval step, the LLM API itself, or the embedding computation.

Production async bottleneck stack:

LLM API call         200-2000ms   <-- actual bottleneck
Embedding call        10-100ms    <-- second bottleneck
Vector DB query        5-50ms     <-- third bottleneck
Framework overhead     1-12ms     <-- what we measured
Python event loop     <0.1ms     <-- irrelevant

The framework overhead matters when:

Batch processing with asyncio.gather: If you fire 100+ concurrent tool calls in a batch, the per-batch overhead compounds. LangChain's 11.9ms at n=50 extrapolates to ~25ms at n=100. SynapseKit's 1.7ms extrapolates to ~3.5ms. Still small in absolute terms - but the ratio stays 7x.
FastAPI endpoints at high QPS: When your server handles 50-100 simultaneous requests, framework overhead becomes a contributor to p99 latency. Not the primary contributor, but a non-trivial one.
Streaming with concurrent tool calls: Agents that call multiple tools in parallel between reasoning steps accumulate framework overhead on every tool invocation cycle.

The framework overhead does NOT matter when:

Your bottleneck is the LLM API (it almost always is)
You're running 1-5 concurrent requests (all frameworks are equivalent)
Your tools are CPU/GPU bound (use asyncio.to_thread, not await)

What This Means for Engineers

At low concurrency, framework async performance is irrelevant. All three frameworks add sub-millisecond overhead at n=1 through n=5. If your application handles fewer than 10 simultaneous requests, async efficiency should not factor into your framework choice.
At high concurrency, LangChain's LCEL overhead becomes measurable. The 11.9ms per-batch overhead at n=50 is not a dealbreaker, but it is a consistent tax. If you are building a high-throughput batch processing pipeline with asyncio.gather, this matters.
SynapseKit's thin async wrapper pays off at scale. 96.8% async efficiency at n=50 - nearly indistinguishable from raw asyncio. The tradeoff is less middleware infrastructure. If you need LCEL-style composability, you pay for it.
LlamaIndex's async path is cleaner than expected. 92.7% efficiency at n=50 is solid. After weeks of ranking third, this is a genuine strength - LlamaIndex's FunctionTool.acall() adds minimal overhead.
Profile your actual bottleneck before optimizing framework overhead. If your LLM API calls take 500ms and your framework adds 2ms, the framework overhead is 0.4% of total latency. Optimize the API call first.

The Thing Most People Miss

Async efficiency is not the same as async correctness.

A framework can achieve 99% async efficiency on a synthetic benchmark and still serialize your real workload if any component in the chain is synchronous. One sync database call in a retriever. One blocking file read in a document loader. One sync HTTP request wrapped in asyncio.to_thread that exhausts the thread pool.

The benchmark above proves that the framework call paths themselves are non-blocking. That is necessary but not sufficient. The production question is whether every component you plug into the framework - retrievers, embedders, tool functions, document loaders - is also genuinely async.

SynapseKit's retriever and tool base classes are async-native. LlamaIndex's retriever base classes are async-native. LangChain's retrievers are inconsistent - some have native _aget_relevant_documents, some fall back to run_in_executor.

The 19.2% throughput loss LangChain shows in this benchmark is the framework's own overhead. In production, if your retriever falls back to run_in_executor, the loss compounds further. The framework tax and the component tax stack.

The engineer who builds the highest-throughput async pipeline will not be the one who picks the framework with the best synthetic benchmark. They will be the one who audits every component in their chain for sync fallbacks and eliminates them. The framework choice sets the floor. The component audit determines the ceiling.

Week 4 continues: graph workflows, cost tracking, guardrails, MCP support. The async result gives SynapseKit another point. The cumulative race tightens.

Three Things Worth Doing This Week

Audit your async chain for sync fallbacks. Open every retriever, tool, and loader in your pipeline. Search for run_in_executor or asyncio.to_thread. Each one is a thread-pool bottleneck masquerading as async code. Replace with native async implementations where they exist.
Run a throughput test on your actual pipeline. Fire 20 concurrent requests at your full pipeline (not just the LLM call). Measure wall-clock time. Compare against 20 sequential requests. If the ratio is less than 15x, something in your chain is serializing. Find it.
Set a p99 latency budget for framework overhead. If your LLM call takes 500ms, your framework overhead budget should be less than 5ms (1%). Measure it with the same technique as notebook #22: wrap a known-latency mock function and compare. If you exceed the budget, simplify the call chain.

The fastest async code is the code that does nothing between your function call and the event loop. Every layer of abstraction between await and the actual IO operation is overhead. Sometimes that abstraction is worth the cost. Sometimes it is not. Measure before you assume.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

What We Measured​

The Numbers​

The Scaling Factor​

Where the Overhead Comes From​

The Real-World Caveat​

What This Means for Engineers​

The Thing Most People Miss​

Three Things Worth Doing This Week​

Want to Think Like an AI Architect?