AI Letters #21 - Streaming RAG: Time to First Token Across Three Frameworks
When users wait for an LLM, the number that matters is time-to-first-token, not total time. 200ms to first token feels instant. 2 seconds to first token feels broken - even if the full answer arrives faster.
Every LLM UI eventually learns the same lesson. Users don't measure latency the way your dashboard does. They don't care about tokens-per-second, p99 tail latency, or median completion time. They care about one thing: how long until something appears on screen. That number is TTFT - time to first token - and it dominates perceived performance more than any other metric in LLM serving.
The catch is that when you're building a streaming RAG pipeline, the framework itself sits between your .stream() call and the first token your user sees. Every async for, every LCEL graph traversal, every callback dispatch adds latency before a single character leaves the server. In production that overhead is invisible because network latency to OpenAI or Anthropic is 100–1000x larger. But strip out the network with a mock LLM and you can finally see what the framework itself costs you.
We built identical streaming RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same documents, same query, same mock LLM that yields the exact same token list with zero network latency. The result: all three clear the sub-millisecond bar comfortably. Nobody loses on the number. The interesting split is elsewhere - in the shape of the streaming API itself.
What We Measured
Task: Build a streaming RAG pipeline (BM25 retrieval + LLM stream). Feed the retrieved context into an LLM that streams tokens. Measure the latency from calling .stream() to receiving the first token.
| Metric | What it captures |
|---|---|
| Lines of code | Code to wire up a streaming RAG pipeline |
| TTFT (median) | Pure framework overhead with a zero-latency mock LLM |
| Streaming API surface | Sync vs async, generator vs callback, on-RAG vs on-LLM |
Why a mock LLM: real LLM APIs add 100–2000ms of network and provider latency. That swamps any framework difference. Strip it out and the framework overhead finally becomes visible - the part you can actually optimise.
Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU. 50 reps per framework. Disclosure: I'm the author of SynapseKit. All code is on Kaggle - fork and run yourself.
The Code
SynapseKit - async generator on the RAG object itself:
from synapsekit import RAG
rag = RAG(model="gpt-4o-mini", api_key=KEY, provider="openai")
await rag.add_documents(DOCS)
async for token in rag.stream(QUERY):
print(token, end="", flush=True)
rag.stream(query) is a single method call that streams the full RAG pipeline - retrieve, construct prompt, call LLM, yield tokens. No chain composition, no graph construction. Async-only.
LangChain - LCEL chain with .stream():
from langchain_community.retrievers import BM25Retriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
retriever = BM25Retriever.from_texts(DOCS, k=3)
prompt = ChatPromptTemplate.from_template("Context: {ctx}\n\nQ: {q}")
llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)
chain = {"ctx": retriever, "q": RunnablePassthrough()} | prompt | llm
for chunk in chain.stream(QUERY):
print(chunk.content, end="", flush=True)
LCEL composition makes every step explicit and swappable. More imports, more ceremony, but you can yank out the retriever or add a reranker without touching the stream call. Both sync (.stream) and async (.astream) are native.
LlamaIndex - query_engine(streaming=True):
from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI(model="gpt-4o-mini")
index = VectorStoreIndex.from_documents([Document(text=d) for d in DOCS])
engine = index.as_query_engine(streaming=True)
response = engine.query(QUERY)
for chunk in response.response_gen:
print(chunk, end="", flush=True)
One flag flip (streaming=True) turns the query engine into a streaming generator. Clean surface. No native async stream on the query engine - you'd wrap it yourself or reach for the lower-level async APIs.
The Numbers
With a mock LLM that yields the same token list at zero network latency, we ran 50 TTFT measurements per framework:
Framework Median TTFT p99 TTFT API shape
────────────────────────────────────────────────────────────
SynapseKit 0.08 ms 0.15 ms async generator
LangChain 0.12 ms 0.21 ms sync generator
LlamaIndex 0.14 ms 0.26 ms sync generator
All three land in the sub-millisecond zone. The framework overhead itself is effectively free. At this resolution the numbers are noise. If you're choosing a framework to optimise TTFT, you're optimising the wrong thing - put your effort into prompt caching, smaller context windows, provider selection, and serving infrastructure. That's where the real milliseconds live.
For reference, here's what actually dominates TTFT in production:
Component Typical latency
────────────────────────────────────────────
Framework overhead < 1 ms
Embedding lookup 5–20 ms
BM25 retrieval 10–50 ms
Network to LLM provider 80–200 ms
LLM first token 150–600 ms
────────────────────────────────────────────
Total TTFT 250 ms – 1 s
The framework is a rounding error. A 0.08ms vs 0.14ms difference cannot be measured in production - it vanishes into jitter.
The API Surface Split
This is where the frameworks actually diverge. When you're writing real code, the shape of the streaming API matters more than its latency.
Feature SynapseKit LangChain LlamaIndex
──────────────────────────────────────────────────────────────
Primary API async gen sync + async sync gen
Sync support No Yes Yes
Native async on RAG Yes Yes No
Callback handlers No Yes Yes (mgr)
Stream on RAG object Yes Yes (LCEL) Yes (flag)
SynapseKit is async-only. There is no .stream() on a sync path. If your codebase runs in Flask, Django sync views, or a Jupyter notebook without an event loop, every call site needs asyncio.run() or you need to restructure around async. That's a migration, not a drop-in.
LangChain is the most flexible. chain.stream() for sync, chain.astream() for async, plus a callback handler ecosystem (StreamingStdOutCallbackHandler, AsyncIteratorCallbackHandler) for every framework integration you might need. If you're building a Streamlit app, a CLI tool, and an async FastAPI endpoint from the same chain, this is the path.
LlamaIndex sits in the middle. Native sync generators (response.response_gen) are easy to consume. The async story is weaker - the query engine doesn't expose a clean async stream by default. You reach for lower-level LLM APIs or wrap the sync generator in a thread.
What This Means for Engineers
-
Stop optimising framework TTFT overhead. At sub-millisecond, it's below the noise floor of every real LLM deployment. The TTFT you see in your dashboard is 99%+ network and provider latency. Focus there.
-
Match the streaming API to your runtime. If your app is async (FastAPI, async workers, LangGraph): SynapseKit and LangChain
.astream()are both clean. If your app is sync (Flask, Django sync views, Jupyter, a CLI): LangChain.stream()or LlamaIndex'sresponse_genlet you avoid restructuring. -
Use callbacks for UI binding, generators for pipelines. LangChain's callback handler pattern is the cleanest path for tying stream output into progress bars, partial rendering, and multi-consumer fan-out. For a one-consumer pipeline, a generator is simpler.
-
Stream from the RAG object, not the LLM. All three frameworks can stream from the top-level RAG call (SynapseKit
rag.stream, LangChain LCEL chain, LlamaIndexquery_engine(streaming=True)). Don't roll your own retrieve + LLM stream loop - you'll reimplement the prompt construction wrong. -
Measure TTFT end-to-end, not in isolation. The real number includes retrieval time, prompt build, network round-trip, and the provider's own time-to-first-token. That's the number your users experience. Framework overhead disappears into it.
The Corollary Most People Miss
TTFT is not the only perception metric. Inter-token latency - the jitter between the 2nd, 10th, and 100th tokens - matters almost as much. A stream that arrives in steady 15ms bursts feels smooth. A stream that arrives in a burst, stalls for 200ms, then bursts again feels broken. And inter-token latency is where framework buffering, callback dispatch, and LCEL graph traversal actually can start to matter at production volumes.
None of these frameworks add visible buffering on a mock LLM. But layer in a callback chain, a streaming response wrapper, and a server-sent-events encoder on top, and you can build a pipeline that adds 10–20ms of buffering per token. That's the part you have to profile yourself - and the part no benchmark in this series will catch for you.
Perception metric What the user feels
──────────────────────────────────────────────────
TTFT Did anything happen?
Inter-token latency Is it flowing or stalling?
Total time Was it fast enough to use?
You optimise all three in different ways. Framework choice affects the first two slightly and the third not at all.
Three Things Worth Doing This Week
-
Instrument TTFT in your production RAG. Log the three numbers that matter: retrieval latency, prompt-build latency, and time-to-first-token from the LLM. If any one is above 300ms, that's where the work is - not in the framework.
-
Switch from
.stream()to.astream()if you're on an async stack. Sync.stream()inside an async handler blocks the event loop. Most teams accidentally run sync streams in async contexts because it was easier to paste the tutorial code. -
Read the Kaggle notebook. Full reproducible code, mock LLM implementations for each framework, 50-run TTFT distributions: LLM Showdown #12 - Streaming RAG TTFT
Streaming is the default UX for every modern LLM product. The frameworks all do it. None of them are meaningfully slower than the others. The real question is whether your stream fits your runtime - async or sync, generator or callback, on the RAG or on the LLM. Pick the shape that matches your code, not the one with the lowest microsecond count on a benchmark that doesn't include the network.
Engineers of AI
Read more: www.engineersofai.com
If this was useful, forward it to one engineer who should be reading it.
