AI Letters #21 - Streaming RAG: Time to First Token Across Three Frameworks

April 7, 2026 · 10 min read

AI Engineering Education

When users wait for an LLM, the number that matters is time-to-first-token, not total time. 200ms to first token feels instant. 2 seconds to first token feels broken - even if the full answer arrives faster.

Every LLM UI eventually learns the same lesson. Users don't measure latency the way your dashboard does. They don't care about tokens-per-second, p99 tail latency, or median completion time. They care about one thing: how long until something appears on screen. That number is TTFT - time to first token - and it dominates perceived performance more than any other metric in LLM serving.

The catch is that when you're building a streaming RAG pipeline, the framework itself sits between your .stream() call and the first token your user sees. Every async for, every LCEL graph traversal, every callback dispatch adds latency before a single character leaves the server. In production that overhead is invisible because network latency to OpenAI or Anthropic is 100–1000x larger. But strip out the network with a mock LLM and you can finally see what the framework itself costs you.

We built identical streaming RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same documents, same query, same mock LLM that yields the exact same token list with zero network latency. The result: all three clear the sub-millisecond bar comfortably. Nobody loses on the number. The interesting split is elsewhere - in the shape of the streaming API itself.

Interactive Chart

TTFT vs Network Latency →

How framework overhead compares to real network latency from OpenAI, Anthropic, and a local model. The framework is a rounding error - until it isn't.

Code Explorer

Streaming RAG Code by Framework →

Full streaming pipeline code side by side - imports, setup, and the `.stream()` consumption pattern annotated for each framework.

Data

TTFT Distribution + API Surface Matrix →

Median TTFT, p99 tail, sync vs async support, and callback availability - the full scorecard across all three frameworks.

What We Measured

Task: Build a streaming RAG pipeline (BM25 retrieval + LLM stream). Feed the retrieved context into an LLM that streams tokens. Measure the latency from calling .stream() to receiving the first token.

Metric	What it captures
Lines of code	Code to wire up a streaming RAG pipeline
TTFT (median)	Pure framework overhead with a zero-latency mock LLM
Streaming API surface	Sync vs async, generator vs callback, on-RAG vs on-LLM

Why a mock LLM: real LLM APIs add 100–2000ms of network and provider latency. That swamps any framework difference. Strip it out and the framework overhead finally becomes visible - the part you can actually optimise.

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU. 50 reps per framework. Disclosure: I'm the author of SynapseKit. All code is on Kaggle - fork and run yourself.

The Code

SynapseKit - async generator on the RAG object itself:

from synapsekit import RAG

rag = RAG(model="gpt-4o-mini", api_key=KEY, provider="openai")
await rag.add_documents(DOCS)
async for token in rag.stream(QUERY):
    print(token, end="", flush=True)

rag.stream(query) is a single method call that streams the full RAG pipeline - retrieve, construct prompt, call LLM, yield tokens. No chain composition, no graph construction. Async-only.

LangChain - LCEL chain with .stream():

from langchain_community.retrievers import BM25Retriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

retriever = BM25Retriever.from_texts(DOCS, k=3)
prompt    = ChatPromptTemplate.from_template("Context: {ctx}\n\nQ: {q}")
llm       = ChatOpenAI(model="gpt-4o-mini", streaming=True)
chain     = {"ctx": retriever, "q": RunnablePassthrough()} | prompt | llm
for chunk in chain.stream(QUERY):
    print(chunk.content, end="", flush=True)

LCEL composition makes every step explicit and swappable. More imports, more ceremony, but you can yank out the retriever or add a reranker without touching the stream call. Both sync (.stream) and async (.astream) are native.

LlamaIndex - query_engine(streaming=True):

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini")
index  = VectorStoreIndex.from_documents([Document(text=d) for d in DOCS])
engine = index.as_query_engine(streaming=True)
response = engine.query(QUERY)
for chunk in response.response_gen:
    print(chunk, end="", flush=True)

One flag flip (streaming=True) turns the query engine into a streaming generator. Clean surface. No native async stream on the query engine - you'd wrap it yourself or reach for the lower-level async APIs.

The Numbers

With a mock LLM that yields the same token list at zero network latency, we ran 50 TTFT measurements per framework:

Framework    Median TTFT   p99 TTFT   API shape
────────────────────────────────────────────────────────────
SynapseKit      0.08 ms      0.15 ms   async generator
LangChain       0.12 ms      0.21 ms   sync generator
LlamaIndex      0.14 ms      0.26 ms   sync generator

All three land in the sub-millisecond zone. The framework overhead itself is effectively free. At this resolution the numbers are noise. If you're choosing a framework to optimise TTFT, you're optimising the wrong thing - put your effort into prompt caching, smaller context windows, provider selection, and serving infrastructure. That's where the real milliseconds live.

For reference, here's what actually dominates TTFT in production:

Component                 Typical latency
────────────────────────────────────────────
Framework overhead        < 1 ms
Embedding lookup          5–20 ms
BM25 retrieval            10–50 ms
Network to LLM provider   80–200 ms
LLM first token           150–600 ms
────────────────────────────────────────────
Total TTFT                250 ms – 1 s

The framework is a rounding error. A 0.08ms vs 0.14ms difference cannot be measured in production - it vanishes into jitter.

The API Surface Split

This is where the frameworks actually diverge. When you're writing real code, the shape of the streaming API matters more than its latency.

Feature                 SynapseKit    LangChain    LlamaIndex
──────────────────────────────────────────────────────────────
Primary API             async gen     sync + async  sync gen
Sync support            No            Yes           Yes
Native async on RAG     Yes           Yes           No
Callback handlers       No            Yes           Yes (mgr)
Stream on RAG object    Yes           Yes (LCEL)    Yes (flag)

SynapseKit is async-only. There is no .stream() on a sync path. If your codebase runs in Flask, Django sync views, or a Jupyter notebook without an event loop, every call site needs asyncio.run() or you need to restructure around async. That's a migration, not a drop-in.

LangChain is the most flexible. chain.stream() for sync, chain.astream() for async, plus a callback handler ecosystem (StreamingStdOutCallbackHandler, AsyncIteratorCallbackHandler) for every framework integration you might need. If you're building a Streamlit app, a CLI tool, and an async FastAPI endpoint from the same chain, this is the path.

LlamaIndex sits in the middle. Native sync generators (response.response_gen) are easy to consume. The async story is weaker - the query engine doesn't expose a clean async stream by default. You reach for lower-level LLM APIs or wrap the sync generator in a thread.

What This Means for Engineers

Stop optimising framework TTFT overhead. At sub-millisecond, it's below the noise floor of every real LLM deployment. The TTFT you see in your dashboard is 99%+ network and provider latency. Focus there.
Match the streaming API to your runtime. If your app is async (FastAPI, async workers, LangGraph): SynapseKit and LangChain .astream() are both clean. If your app is sync (Flask, Django sync views, Jupyter, a CLI): LangChain .stream() or LlamaIndex's response_gen let you avoid restructuring.
Use callbacks for UI binding, generators for pipelines. LangChain's callback handler pattern is the cleanest path for tying stream output into progress bars, partial rendering, and multi-consumer fan-out. For a one-consumer pipeline, a generator is simpler.
Stream from the RAG object, not the LLM. All three frameworks can stream from the top-level RAG call (SynapseKit rag.stream, LangChain LCEL chain, LlamaIndex query_engine(streaming=True)). Don't roll your own retrieve + LLM stream loop - you'll reimplement the prompt construction wrong.
Measure TTFT end-to-end, not in isolation. The real number includes retrieval time, prompt build, network round-trip, and the provider's own time-to-first-token. That's the number your users experience. Framework overhead disappears into it.

The Corollary Most People Miss

TTFT is not the only perception metric. Inter-token latency - the jitter between the 2nd, 10th, and 100th tokens - matters almost as much. A stream that arrives in steady 15ms bursts feels smooth. A stream that arrives in a burst, stalls for 200ms, then bursts again feels broken. And inter-token latency is where framework buffering, callback dispatch, and LCEL graph traversal actually can start to matter at production volumes.

None of these frameworks add visible buffering on a mock LLM. But layer in a callback chain, a streaming response wrapper, and a server-sent-events encoder on top, and you can build a pipeline that adds 10–20ms of buffering per token. That's the part you have to profile yourself - and the part no benchmark in this series will catch for you.

Perception metric           What the user feels
──────────────────────────────────────────────────
TTFT                        Did anything happen?
Inter-token latency         Is it flowing or stalling?
Total time                  Was it fast enough to use?

You optimise all three in different ways. Framework choice affects the first two slightly and the third not at all.

Three Things Worth Doing This Week

Instrument TTFT in your production RAG. Log the three numbers that matter: retrieval latency, prompt-build latency, and time-to-first-token from the LLM. If any one is above 300ms, that's where the work is - not in the framework.
Switch from .stream() to .astream() if you're on an async stack. Sync .stream() inside an async handler blocks the event loop. Most teams accidentally run sync streams in async contexts because it was easier to paste the tutorial code.
Read the Kaggle notebook. Full reproducible code, mock LLM implementations for each framework, 50-run TTFT distributions: LLM Showdown #12 - Streaming RAG TTFT

Streaming is the default UX for every modern LLM product. The frameworks all do it. None of them are meaningfully slower than the others. The real question is whether your stream fits your runtime - async or sync, generator or callback, on the RAG or on the LLM. Pick the shape that matches your code, not the one with the lowest microsecond count on a benchmark that doesn't include the network.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

What We Measured​

The Code​

The Numbers​

The API Surface Split​

What This Means for Engineers​

The Corollary Most People Miss​

Three Things Worth Doing This Week​

Want to Think Like an AI Architect?