Skip to main content

AI Letters #30 - Async Throughput: The Framework Tax on Every Concurrent Request

· 10 min read
EngineersOfAI
AI Engineering Education

Every framework says await. Every framework says "production-ready". At one concurrent request, the difference is invisible. At 50 concurrent requests, LangChain's LCEL middleware costs 19.2% of theoretical throughput while SynapseKit loses only 3.2%. Notebook #22 of the LLM Showdown isolates the framework tax on async IO - and the gap is 7x in overhead milliseconds.

"The difference between wrapping a sync call in a thread and genuinely non-blocking async IO only shows up under real concurrency. At 50 simultaneous requests, that difference is 19%."

Every LLM framework claims async support. The documentation says await. The examples show ainvoke. The marketing page says "production-ready". And when you run a single request, every framework delivers the same result in approximately the same time. The overhead per call is sub-millisecond. Nobody notices.

Then you deploy to a FastAPI endpoint handling 20 simultaneous users. Or you fire off 50 tool calls in an asyncio.gather batch. And one framework quietly adds 12 milliseconds of overhead per batch while the others add less than 2. At scale, those milliseconds compound into throughput ceilings that are invisible in development and painful in production.

Notebook #22 of the LLM Showdown isolates exactly this. A mock async function with a fixed 50ms sleep - simulating an LLM API call - wrapped in each framework's async primitive. Fire N concurrent requests. Measure total time. A perfect async implementation processes 50 requests in ~50ms. Any extra time is pure framework tax.

The results are not close.

What We Measured

Each framework wraps a mock async function - asyncio.sleep(0.05) - simulating a 50ms LLM API call. We fire N concurrent requests using asyncio.gather and measure total wall-clock time. A perfect async implementation processes N requests in ~50ms regardless of N, because all sleeps run concurrently in the event loop.

MetricWhat it captures
Requests/secThroughput at 1, 5, 10, 20, 50 concurrent requests
Async efficiencyActual rps vs theoretical max (% of ideal)
Scaling factorrps at n=50 / rps at n=1 - perfect async gives 50x
Framework overheadMilliseconds added per batch beyond raw asyncio

Frameworks: SynapseKit 1.4 (BaseTool.run()), LangChain 1.2 (RunnableLambda.ainvoke()), LlamaIndex Core 0.14 (FunctionTool.acall())


The Numbers

Throughput (requests/sec):

Concurrency Baseline SynapseKit LangChain LlamaIndex
----------------------------------------------------------
n=1 19.6 19.8 19.4 19.7
n=5 97.8 98.8 96.1 97.3
n=10 194.9 195.7 184.2 193.3
n=20 391.3 388.9 360.5 381.9
n=50 986.6 967.5 808.3 927.2

At n=1, everyone looks the same. The mock call takes ~50ms. Each framework adds sub-millisecond overhead. If this were the only data point, you would conclude that async performance is irrelevant to framework choice.

At n=50, the picture changes. The baseline (raw asyncio.sleep) achieves 986.6 rps - nearly the theoretical maximum of 1000 rps (50 requests / 0.05s). SynapseKit tracks close at 967.5. LlamaIndex at 927.2. LangChain drops to 808.3.

Async efficiency at n=50 concurrent calls:

Framework rps overhead efficiency
--------------------------------------------
Baseline 986.6 0.7ms 98.7%
SynapseKit 967.5 1.7ms 96.8%
LlamaIndex 927.2 3.9ms 92.7%
LangChain 808.3 11.9ms 80.8%

LangChain adds 11.9ms of overhead per batch at 50 concurrent requests. SynapseKit adds 1.7ms. That is a 7x difference in framework-introduced latency.


The Scaling Factor

The cleanest way to read this: how close does each framework get to 50x throughput when you send 50x more concurrent requests?

Scaling factor: rps(n=50) / rps(n=1)
Perfect async = 50x

Framework rps n=1 rps n=50 scaling vs perfect
------------------------------------------------------
Baseline 19.6 986.6 50.4x 100.9%
SynapseKit 19.8 967.5 48.9x 97.7%
LlamaIndex 19.7 927.2 47.1x 94.2%
LangChain 19.4 808.3 41.7x 83.5%

SynapseKit: 97.7% of perfect scaling. LlamaIndex: 94.2%. LangChain: 83.5%.

The 16.5% gap between SynapseKit and LangChain at 50 concurrent requests is not a rounding error. It is a consistent pattern across multiple runs (median of 3 repeats, after warmup). Something in LangChain's LCEL ainvoke path does more work per invocation than the other frameworks' async primitives.


Where the Overhead Comes From

This benchmark isolates the framework call path. The mock function is identical - asyncio.sleep(0.05) - so the overhead is entirely in:

  1. Object construction - creating/validating the invocation context
  2. Callback routing - LCEL's pipe chain, middleware, callbacks
  3. Serialization/validation - input/output schema checks

LangChain's LCEL is a composable chain architecture. Every ainvoke passes through the Runnable protocol - input validation, callbacks, tracing hooks, output parsing. This is powerful for composition (chain1 | chain2 | chain3) but adds overhead per invocation. At n=1, the overhead is 0.51ms - invisible. At n=50, the total accumulated overhead is 11.9ms per batch.

SynapseKit's BaseTool.run() is a thin wrapper. Validate the input against the JSON schema, call the function, return the result. No middleware chain, no callback infrastructure. The tradeoff: less composability, less overhead.

LlamaIndex's FunctionTool.acall() falls in between - some validation overhead but no LCEL-style chain traversal.


The Real-World Caveat

This benchmark tests the framework call path under synthetic concurrency. In a production RAG pipeline, the bottleneck is rarely the framework wrapper. It is the retrieval step, the LLM API itself, or the embedding computation.

Production async bottleneck stack:

LLM API call 200-2000ms <-- actual bottleneck
Embedding call 10-100ms <-- second bottleneck
Vector DB query 5-50ms <-- third bottleneck
Framework overhead 1-12ms <-- what we measured
Python event loop <0.1ms <-- irrelevant

The framework overhead matters when:

  • Batch processing with asyncio.gather: If you fire 100+ concurrent tool calls in a batch, the per-batch overhead compounds. LangChain's 11.9ms at n=50 extrapolates to ~25ms at n=100. SynapseKit's 1.7ms extrapolates to ~3.5ms. Still small in absolute terms - but the ratio stays 7x.

  • FastAPI endpoints at high QPS: When your server handles 50-100 simultaneous requests, framework overhead becomes a contributor to p99 latency. Not the primary contributor, but a non-trivial one.

  • Streaming with concurrent tool calls: Agents that call multiple tools in parallel between reasoning steps accumulate framework overhead on every tool invocation cycle.

The framework overhead does NOT matter when:

  • Your bottleneck is the LLM API (it almost always is)
  • You're running 1-5 concurrent requests (all frameworks are equivalent)
  • Your tools are CPU/GPU bound (use asyncio.to_thread, not await)

What This Means for Engineers

  1. At low concurrency, framework async performance is irrelevant. All three frameworks add sub-millisecond overhead at n=1 through n=5. If your application handles fewer than 10 simultaneous requests, async efficiency should not factor into your framework choice.

  2. At high concurrency, LangChain's LCEL overhead becomes measurable. The 11.9ms per-batch overhead at n=50 is not a dealbreaker, but it is a consistent tax. If you are building a high-throughput batch processing pipeline with asyncio.gather, this matters.

  3. SynapseKit's thin async wrapper pays off at scale. 96.8% async efficiency at n=50 - nearly indistinguishable from raw asyncio. The tradeoff is less middleware infrastructure. If you need LCEL-style composability, you pay for it.

  4. LlamaIndex's async path is cleaner than expected. 92.7% efficiency at n=50 is solid. After weeks of ranking third, this is a genuine strength - LlamaIndex's FunctionTool.acall() adds minimal overhead.

  5. Profile your actual bottleneck before optimizing framework overhead. If your LLM API calls take 500ms and your framework adds 2ms, the framework overhead is 0.4% of total latency. Optimize the API call first.


The Thing Most People Miss

Async efficiency is not the same as async correctness.

A framework can achieve 99% async efficiency on a synthetic benchmark and still serialize your real workload if any component in the chain is synchronous. One sync database call in a retriever. One blocking file read in a document loader. One sync HTTP request wrapped in asyncio.to_thread that exhausts the thread pool.

The benchmark above proves that the framework call paths themselves are non-blocking. That is necessary but not sufficient. The production question is whether every component you plug into the framework - retrievers, embedders, tool functions, document loaders - is also genuinely async.

SynapseKit's retriever and tool base classes are async-native. LlamaIndex's retriever base classes are async-native. LangChain's retrievers are inconsistent - some have native _aget_relevant_documents, some fall back to run_in_executor.

The 19.2% throughput loss LangChain shows in this benchmark is the framework's own overhead. In production, if your retriever falls back to run_in_executor, the loss compounds further. The framework tax and the component tax stack.

The engineer who builds the highest-throughput async pipeline will not be the one who picks the framework with the best synthetic benchmark. They will be the one who audits every component in their chain for sync fallbacks and eliminates them. The framework choice sets the floor. The component audit determines the ceiling.

Week 4 continues: graph workflows, cost tracking, guardrails, MCP support. The async result gives SynapseKit another point. The cumulative race tightens.


Three Things Worth Doing This Week

  1. Audit your async chain for sync fallbacks. Open every retriever, tool, and loader in your pipeline. Search for run_in_executor or asyncio.to_thread. Each one is a thread-pool bottleneck masquerading as async code. Replace with native async implementations where they exist.

  2. Run a throughput test on your actual pipeline. Fire 20 concurrent requests at your full pipeline (not just the LLM call). Measure wall-clock time. Compare against 20 sequential requests. If the ratio is less than 15x, something in your chain is serializing. Find it.

  3. Set a p99 latency budget for framework overhead. If your LLM call takes 500ms, your framework overhead budget should be less than 5ms (1%). Measure it with the same technique as notebook #22: wrap a known-latency mock function and compare. If you exceed the budget, simplify the call chain.

The fastest async code is the code that does nothing between your function call and the event loop. Every layer of abstraction between await and the actual IO operation is overhead. Sometimes that abstraction is worth the cost. Sometimes it is not. Measure before you assume.


Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.