Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Streaming LLM Responses demo on the EngineersOfAI Playground - no code required. :::

Streaming Responses

The Blinking Cursor Problem

When Preethi's team first deployed their LLM-powered writing assistant, they made a design decision that seemed obvious at the time: wait for the complete response before showing it to the user. The logic was clean - no partial states to manage, no complex frontend code, no risk of showing an incomplete thought.

The user feedback came back brutal. "It feels like the app is frozen." "I thought it crashed." "The three-second wait is unbearable." The problem was not the quality of the responses - users actually loved the output. The problem was the silence. Three seconds of nothing, then suddenly a 400-word response appeared. The experience felt wrong because it violated a fundamental human expectation: when you send something, something should happen.

They switched to streaming. Within two weeks, user session length increased 34%. Not because responses were faster - they were not. The time to complete response was identical. But time-to-first-token dropped from 3 seconds to 0.3 seconds, and users could read the beginning of the response while the model was still generating the end. Perceived performance is often more important than actual performance. Users judge "fast" by when something appears, not by when it finishes.

Streaming is not just a nice-to-have for LLM applications. It is the foundational UX pattern that makes LLM-powered products feel responsive and alive. Any production LLM application that blocks on the full response before showing anything is leaving significant user experience on the table.

How LLM Streaming Works

Language models generate tokens one at a time through a forward pass and sampling step. Without streaming, the server accumulates all tokens into a complete response before sending anything to the client. With streaming, each token (or small group of tokens) is sent to the client as soon as it is generated.

The transport mechanism is typically Server-Sent Events (SSE) - an HTTP/1.1 protocol where the server pushes data to the client over a persistent connection. Each event is a line of text prefixed with data: . The format is deliberately simple:

data: Hello

data: , world

data: !

data: [DONE]

SSE is preferred over WebSockets for LLM streaming because it is unidirectional (server to client), works over standard HTTP, reconnects automatically on drop, and does not require protocol upgrade. WebSockets add bidirectional complexity that is unnecessary when you only need to stream one direction.

Implementation: Basic Streaming with the Anthropic SDK

The Anthropic SDK provides a stream context manager that handles all SSE parsing internally. You get a clean async iterator over text chunks:

import anthropic
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from typing import AsyncGenerator

# Sync client for simple use cases
sync_client = anthropic.Anthropic()

# Async client for FastAPI / high-concurrency production use
async_client = anthropic.AsyncAnthropic()

app = FastAPI()


def stream_claude_sync(
messages: list[dict],
system_prompt: str = "",
model: str = "claude-opus-4-6",
max_tokens: int = 1024,
) -> AsyncGenerator[str, None]:
"""
Synchronous streaming generator for simple use cases.
Use in scripts or synchronous frameworks.
Yields SSE-formatted strings.
"""
with sync_client.messages.stream(
model=model,
max_tokens=max_tokens,
system=system_prompt,
messages=messages,
) as stream:
for text_chunk in stream.text_stream:
# SSE format: data: <content>\n\n
# Double newline signals end of this event
yield f"data: {text_chunk}\n\n"

# Signal completion
yield "data: [DONE]\n\n"


async def stream_claude_async(
messages: list[dict],
system_prompt: str = "",
model: str = "claude-opus-4-6",
max_tokens: int = 1024,
include_usage: bool = True,
) -> AsyncGenerator[str, None]:
"""
Async streaming generator - use this in FastAPI for high concurrency.
Supports thousands of concurrent streaming connections.

Args:
messages: Conversation messages
system_prompt: System prompt
model: Claude model to use
max_tokens: Maximum tokens to generate
include_usage: Whether to include usage stats in the final event

Yields:
SSE-formatted strings
"""
async with async_client.messages.stream(
model=model,
max_tokens=max_tokens,
system=system_prompt,
messages=messages,
) as stream:
async for text in stream.text_stream:
yield f"data: {text}\n\n"
# Yield control to event loop between tokens
# Prevents a single stream from starving other coroutines
await asyncio.sleep(0)

if include_usage:
# Include usage stats in the final event for cost tracking
final_message = await stream.get_final_message()
usage = final_message.usage
import json
stats = json.dumps({
"done": True,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"stop_reason": final_message.stop_reason,
})
yield f"data: {stats}\n\n"

yield "data: [DONE]\n\n"


@app.post("/chat/stream")
async def chat_stream(request: dict):
"""
FastAPI endpoint that streams LLM responses via SSE.

Critical headers:
- Cache-Control: no-cache - prevents proxy caching of SSE events
- X-Accel-Buffering: no - disables nginx proxy buffering
- Connection: keep-alive - keeps the connection open for SSE
"""
messages = request.get("messages", [])
system = request.get("system", "You are a helpful assistant.")
model = request.get("model", "claude-opus-4-6")

return StreamingResponse(
stream_claude_async(messages, system, model),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no", # Critical: disable nginx buffering
"Access-Control-Allow-Origin": "*",
},
)

Implementation: Client-Side Streaming

JavaScript Fetch + ReadableStream

// Production-grade streaming client in TypeScript
interface StreamingCallbacks {
onChunk: (text: string) => void;
onComplete: (fullText: string, usage?: UsageStats) => void;
onError: (error: Error) => void;
}

interface UsageStats {
input_tokens: number;
output_tokens: number;
stop_reason: string;
}

async function streamChatResponse(
messages: Array<{ role: string; content: string }>,
systemPrompt: string,
callbacks: StreamingCallbacks,
signal?: AbortSignal
): Promise<void> {
const response = await fetch("/api/chat/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages, system: systemPrompt }),
signal, // AbortController signal for cancellation
});

if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}

const reader = response.body?.getReader();
const decoder = new TextDecoder();
let fullText = "";
let buffer = "";

if (!reader) throw new Error("Response body is null");

try {
while (true) {
const { done, value } = await reader.read();
if (done) break;

// Decode chunk and add to buffer
// stream: true maintains state across chunks for multi-byte characters
buffer += decoder.decode(value, { stream: true });

// Process complete SSE lines
// Lines end with \n\n - split on that
const lines = buffer.split("\n");
buffer = lines.pop() || ""; // Incomplete last line stays in buffer

for (const line of lines) {
if (!line.startsWith("data: ")) continue;

const data = line.slice(6).trim();
if (!data) continue;

// [DONE] sentinel - stream complete
if (data === "[DONE]") {
callbacks.onComplete(fullText);
return;
}

// JSON events (usage stats, done signal)
if (data.startsWith("{")) {
try {
const parsed = JSON.parse(data);
if (parsed.done) {
callbacks.onComplete(fullText, parsed as UsageStats);
return;
}
} catch {
// Not valid JSON - treat as text
}
continue;
}

// Plain text token
fullText += data;
callbacks.onChunk(data);
}
}
} catch (error) {
if (error instanceof Error && error.name === "AbortError") {
return; // Normal cancellation
}
callbacks.onError(error as Error);
} finally {
reader.releaseLock();
}
}


// React hook for production streaming
import { useState, useCallback, useRef } from "react";

interface StreamingState {
response: string;
isStreaming: boolean;
error: Error | null;
usage: UsageStats | null;
}

function useStreamingChat() {
const [state, setState] = useState<StreamingState>({
response: "",
isStreaming: false,
error: null,
usage: null,
});

const abortControllerRef = useRef<AbortController | null>(null);
const startTimeRef = useRef<number>(0);

const sendMessage = useCallback(
async (
messages: Array<{ role: string; content: string }>,
systemPrompt: string = ""
) => {
// Cancel any existing stream
abortControllerRef.current?.abort();
abortControllerRef.current = new AbortController();

startTimeRef.current = performance.now();

setState({ response: "", isStreaming: true, error: null, usage: null });

try {
await streamChatResponse(
messages,
systemPrompt,
{
onChunk: (text) => {
setState((prev) => ({
...prev,
response: prev.response + text,
}));
},
onComplete: (fullText, usage) => {
const totalTime = performance.now() - startTimeRef.current;
console.log(
`Stream complete: ${usage?.output_tokens} tokens in ${totalTime.toFixed(0)}ms`
);
setState({ response: fullText, isStreaming: false, error: null, usage: usage || null });
},
onError: (err) => {
setState((prev) => ({ ...prev, isStreaming: false, error: err }));
},
},
abortControllerRef.current.signal
);
} catch (err) {
if (err instanceof Error && err.name !== "AbortError") {
setState((prev) => ({ ...prev, isStreaming: false, error: err as Error }));
} else {
setState((prev) => ({ ...prev, isStreaming: false }));
}
}
},
[]
);

const cancel = useCallback(() => {
abortControllerRef.current?.abort();
setState((prev) => ({ ...prev, isStreaming: false }));
}, []);

return { ...state, sendMessage, cancel };
}

Production Streaming Patterns

Pattern 1: Buffered Chunking for Smooth Rendering

Raw token-by-token streaming can create a choppy "letter-by-letter" effect in some UIs. Buffer tokens and flush at word or sentence boundaries for a more natural reading experience:

import asyncio
import anthropic
import json
from typing import AsyncGenerator

async_client = anthropic.AsyncAnthropic()


async def buffered_word_stream(
messages: list[dict],
system: str = "",
buffer_mode: str = "word", # "token", "word", "sentence"
model: str = "claude-opus-4-6",
) -> AsyncGenerator[str, None]:
"""
Buffer tokens and yield at natural boundaries.

Modes:
- "token": yield immediately (fastest first token, most granular)
- "word": buffer until space or punctuation (smooth word-by-word)
- "sentence": buffer until sentence end (chunkier but natural prose flow)

Tradeoff: more buffering = smoother rendering but slightly higher perceived latency.
For most chat UIs, "word" mode is the right balance.
"""
buffer = ""

async with async_client.messages.stream(
model=model,
max_tokens=1024,
system=system,
messages=messages,
) as stream:
async for text in stream.text_stream:
if buffer_mode == "token":
yield f"data: {text}\n\n"
continue

buffer += text

if buffer_mode == "word":
# Yield complete words - flush on space
while " " in buffer or any(p in buffer for p in ".,!?;:\n"):
# Find the next natural break
break_pos = -1
for i, ch in enumerate(buffer):
if ch in " .,!?;:\n":
break_pos = i + 1
break

if break_pos == -1:
break # No break found - keep buffering

chunk = buffer[:break_pos]
buffer = buffer[break_pos:]
if chunk.strip() or chunk in " \n":
yield f"data: {chunk}\n\n"

elif buffer_mode == "sentence":
# Yield complete sentences
for punct in [". ", "! ", "? ", ".\n", "!\n", "?\n"]:
if punct in buffer:
idx = buffer.index(punct) + len(punct)
chunk = buffer[:idx]
buffer = buffer[idx:]
yield f"data: {chunk}\n\n"
break

# Flush any remaining buffer
if buffer.strip():
yield f"data: {buffer}\n\n"

yield "data: [DONE]\n\n"

Pattern 2: Streaming with Tool Use

When the model uses tools, the stream pauses while tools execute. Handle this gracefully - signal the tool call to the client so users see progress indicators instead of a frozen cursor:

import anthropic
import json
from typing import Callable, Any

client = anthropic.Anthropic()


def stream_with_tool_use(
messages: list[dict],
tools: list[dict],
tool_executor: Callable[[str, dict], Any],
system: str = "",
model: str = "claude-opus-4-6",
):
"""
Stream responses that include tool use.

Handles the pause-execute-resume cycle:
1. Stream text tokens normally
2. When tool_use block starts: notify client ("searching...")
3. Execute tool synchronously
4. Resume stream with tool results incorporated

Yields SSE events including:
- Text chunks: data: <text>
- Tool call start: data: {"event": "tool_start", "name": "..."}
- Tool call end: data: {"event": "tool_end", "name": "..."}
- Done: data: [DONE]
"""
conversation = messages.copy()
max_tool_iterations = 10 # Prevent infinite tool loops

for iteration in range(max_tool_iterations):
with client.messages.stream(
model=model,
max_tokens=2048,
system=system,
tools=tools,
messages=conversation,
) as stream:
# Track accumulated tool inputs for multi-delta inputs
tool_input_accumulator: dict[str, str] = {}

for event in stream:
event_type = getattr(event, "type", "")

if event_type == "content_block_start":
block = event.content_block
if getattr(block, "type", "") == "tool_use":
# Notify client that a tool is being called
tool_event = json.dumps({
"event": "tool_start",
"name": block.name,
"id": block.id,
})
yield f"data: {tool_event}\n\n"
tool_input_accumulator[block.id] = ""

elif event_type == "content_block_delta":
delta = event.delta
delta_type = getattr(delta, "type", "")

if delta_type == "text_delta":
yield f"data: {delta.text}\n\n"

elif delta_type == "input_json_delta":
# Accumulate tool input JSON (don't stream to user)
block_id = getattr(event, "index", "")
# Note: in practice, use content_block_start's id

# Get final message to check stop reason
final_message = stream.get_final_message()

if final_message.stop_reason == "end_turn":
yield "data: [DONE]\n\n"
return

elif final_message.stop_reason == "tool_use":
# Execute all requested tools
tool_uses = [
block for block in final_message.content
if block.type == "tool_use"
]

tool_results = []
for tool_use in tool_uses:
# Execute the tool (synchronous - add async handling for real tools)
try:
result = tool_executor(tool_use.name, tool_use.input)
result_content = str(result)
is_error = False
except Exception as e:
result_content = f"Tool execution error: {str(e)}"
is_error = True

# Notify client tool finished
tool_done_event = json.dumps({
"event": "tool_end",
"name": tool_use.name,
"success": not is_error,
})
yield f"data: {tool_done_event}\n\n"

tool_results.append({
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": result_content,
"is_error": is_error,
})

# Add assistant message and tool results to conversation
conversation.append({
"role": "assistant",
"content": final_message.content,
})
conversation.append({
"role": "user",
"content": tool_results,
})
# Loop continues - model will respond to tool results

else:
# Unexpected stop reason
yield "data: [DONE]\n\n"
return

# Safety: if we exhaust max iterations
yield f"data: {{\"error\": \"max_tool_iterations_exceeded\"}}\n\n"
yield "data: [DONE]\n\n"

Pattern 3: Streaming with Backpressure

When the client is slow (mobile, high latency), naive streaming buffers too many tokens in memory on the server. Use a bounded queue to apply backpressure - pause generation when the client cannot keep up:

import asyncio
import anthropic
from asyncio import Queue
from typing import AsyncGenerator

async_client = anthropic.AsyncAnthropic()


async def streaming_with_backpressure(
messages: list[dict],
system: str = "",
model: str = "claude-opus-4-6",
queue_size: int = 100, # Max tokens to buffer
max_tokens: int = 1024,
) -> AsyncGenerator[str, None]:
"""
Producer-consumer streaming with backpressure control.

When the client reads slowly, the queue fills up and the producer
(LLM generation) pauses - preventing memory exhaustion on the server.

queue_size: smaller = less memory, more backpressure
larger = more buffering, smoother delivery

For most use cases, 50-100 tokens is a good queue size.
"""
token_queue: Queue[str | None] = Queue(maxsize=queue_size)

async def producer() -> None:
"""Generate tokens from LLM and push to queue."""
try:
async with async_client.messages.stream(
model=model,
max_tokens=max_tokens,
system=system,
messages=messages,
) as stream:
async for text in stream.text_stream:
# This blocks when queue is full - natural backpressure
await token_queue.put(text)
except Exception as e:
# Put error sentinel
await token_queue.put(f"__ERROR__:{str(e)}")
finally:
# Always put sentinel to signal completion
await token_queue.put(None)

# Start producer in background
producer_task = asyncio.create_task(producer())

try:
while True:
token = await token_queue.get()

if token is None:
# Sentinel: stream complete
yield "data: [DONE]\n\n"
token_queue.task_done()
break

if isinstance(token, str) and token.startswith("__ERROR__:"):
error_msg = token[10:]
yield f"data: {{\"error\": \"{error_msg}\"}}\n\n"
yield "data: [DONE]\n\n"
token_queue.task_done()
break

yield f"data: {token}\n\n"
token_queue.task_done()
finally:
# Ensure producer is cleaned up even if client disconnects
producer_task.cancel()
try:
await producer_task
except asyncio.CancelledError:
pass

Pattern 4: Streaming with Latency Metrics

Measure and monitor streaming performance in production. TTFT (Time to First Token) is the metric that most directly correlates with user perception of responsiveness:

import anthropic
import time
import json
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()


@dataclass
class StreamingMetrics:
"""Metrics captured from a single streaming response."""
model: str
ttft_ms: Optional[float] # Time to first token
total_time_ms: float # Total generation time
input_tokens: int
output_tokens: int
tokens_per_second: float
inter_token_latency_ms: float # Average time between tokens


def measure_streaming_performance(
messages: list[dict],
system: str = "",
model: str = "claude-opus-4-6",
max_tokens: int = 500,
) -> StreamingMetrics:
"""
Measure streaming performance metrics.
Use this for:
- Model comparison (Haiku vs Sonnet vs Opus TTFT)
- Regional routing optimization
- SLA monitoring
- Cache effectiveness measurement
"""
start_time = time.perf_counter()
first_token_time: Optional[float] = None
token_count = 0
full_text = ""

with client.messages.stream(
model=model,
max_tokens=max_tokens,
system=system,
messages=messages,
) as stream:
for text in stream.text_stream:
if first_token_time is None:
first_token_time = time.perf_counter()

token_count += 1
full_text += text

final_message = stream.get_final_message()
end_time = time.perf_counter()

total_time = (end_time - start_time) * 1000
ttft = (first_token_time - start_time) * 1000 if first_token_time else None
output_tokens = final_message.usage.output_tokens
generation_time_s = (end_time - (first_token_time or start_time))
tps = output_tokens / max(generation_time_s, 0.001)

return StreamingMetrics(
model=model,
ttft_ms=ttft,
total_time_ms=total_time,
input_tokens=final_message.usage.input_tokens,
output_tokens=output_tokens,
tokens_per_second=tps,
inter_token_latency_ms=(total_time - (ttft or 0)) / max(output_tokens - 1, 1),
)


# TTFT Optimization Lookup Table
TTFT_OPTIMIZATION_GUIDE = {
"prompt_caching": {
"impact": "60-80% TTFT reduction",
"how": "Cache system prompt and static documents with cache_control",
"when": "System prompt > 1024 tokens and same across requests",
},
"model_selection": {
"impact": "2-4x TTFT variation",
"how": "claude-haiku: ~300ms | claude-sonnet: ~500ms | claude-opus: ~1000ms",
"when": "Quality requirements allow using a smaller model",
},
"input_minimization": {
"impact": "10-30% TTFT reduction",
"how": "Shorter context = faster first token processing",
"when": "Conversation history is growing large",
},
"regional_routing": {
"impact": "50-200ms based on geography",
"how": "Route to nearest API endpoint",
"when": "Global user base with significant network latency",
},
"connection_warmup": {
"impact": "100-300ms first-call overhead",
"how": "Keep HTTP connections alive (connection pooling)",
"when": "High-frequency low-latency applications",
},
}

Streaming with Prompt Caching

Prompt caching dramatically reduces TTFT for requests that share a static prefix - the most impactful optimization for most streaming applications:

import anthropic
import time
from typing import AsyncGenerator

async_client = anthropic.AsyncAnthropic()


async def stream_with_prompt_cache(
static_context: str, # Long static document, product catalog, etc.
dynamic_messages: list[dict], # User-specific, changes per request
system_instruction: str = "You are a helpful assistant.",
model: str = "claude-opus-4-6",
max_tokens: int = 1000,
) -> AsyncGenerator[str, None]:
"""
Stream with Anthropic prompt caching enabled.

First call: Full input processing cost, content cached for 5 minutes.
Subsequent calls within 5min: 90% cheaper for cached portion, faster TTFT.

Cost breakdown (claude-opus-4-6):
- Standard input: $15/M tokens
- Cache creation: $18.75/M tokens (25% surcharge on first write)
- Cache hit reads: $1.50/M tokens (90% savings)

TTFT improvement: 40-60% for cache hits due to skipped attention computation.
"""
start_time = time.perf_counter()

async with async_client.messages.stream(
model=model,
max_tokens=max_tokens,
system=[
{
"type": "text",
"text": system_instruction,
},
{
"type": "text",
"text": static_context,
"cache_control": {"type": "ephemeral"}, # Cache this block
},
],
messages=dynamic_messages,
) as stream:
first_token = True
async for text in stream.text_stream:
if first_token:
ttft = (time.perf_counter() - start_time) * 1000
print(f"TTFT: {ttft:.0f}ms")
first_token = False

yield f"data: {text}\n\n"

# Get final message for cache metrics
final = await stream.get_final_message()
usage = final.usage
cache_read = getattr(usage, "cache_read_input_tokens", 0)
cache_creation = getattr(usage, "cache_creation_input_tokens", 0)
cache_hit = cache_read > 0

import json
stats = json.dumps({
"done": True,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"cache_hit": cache_hit,
"cache_read_tokens": cache_read,
"cache_creation_tokens": cache_creation,
})
yield f"data: {stats}\n\n"

yield "data: [DONE]\n\n"

Architecture: Production Streaming System

Handling Client Disconnects

One of the most common streaming bugs: the server continues generating tokens after the client disconnects, wasting compute and burning tokens:

import anthropic
import asyncio
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse

async_client = anthropic.AsyncAnthropic()
app = FastAPI()


@app.post("/chat/stream")
async def chat_stream(request: Request, body: dict):
"""
Streaming endpoint with proper disconnect handling.
Stops generation when the client disconnects.
"""
messages = body.get("messages", [])
system = body.get("system", "")

async def generate_with_disconnect_detection():
try:
async with async_client.messages.stream(
model="claude-opus-4-6",
max_tokens=1024,
system=system,
messages=messages,
) as stream:
async for text in stream.text_stream:
# Check if client disconnected before sending each chunk
if await request.is_disconnected():
print("Client disconnected - stopping generation")
return

yield f"data: {text}\n\n"
# Yield control to allow disconnect detection
await asyncio.sleep(0)

yield "data: [DONE]\n\n"

except asyncio.CancelledError:
# Normal: happens when client disconnects and FastAPI cancels the task
print("Stream task cancelled (client disconnected)")
return
except Exception as e:
import json
error_event = json.dumps({"error": str(e), "type": type(e).__name__})
yield f"data: {error_event}\n\n"
yield "data: [DONE]\n\n"

return StreamingResponse(
generate_with_disconnect_detection(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no",
},
)

Common Pitfalls

:::danger Nginx/Proxy Buffering Destroys Streaming By default, nginx and most reverse proxies buffer the entire response before forwarding it to the client. This completely destroys the streaming effect - the user sees the same "wait then show" behavior as non-streaming. Fix: add X-Accel-Buffering: no to your response headers. In nginx config: proxy_buffering off;. This is the most common cause of "why isn't my streaming working?" reports. :::

:::danger Not Handling Client Disconnects If a user closes their browser tab mid-stream, the HTTP connection closes. Without proper disconnect handling, your server continues generating tokens for nobody - burning API quota and compute. Always check request.is_disconnected() or wrap in try/except asyncio.CancelledError. :::

:::warning Blocking the Event Loop in Async Streaming Using synchronous operations inside an async streaming generator blocks all other concurrent streams. time.sleep() in an async function is the most common offender - it blocks the entire event loop. Use await asyncio.sleep(0) to yield control between tokens. Never make synchronous DB calls or file I/O inside async generators. :::

:::warning Missing Streaming Timeout LLM generation for long responses can take minutes. Without a timeout, a stuck connection holds a server slot forever. Set timeout=httpx.Timeout(30.0, connect=5.0) on your async client, or wrap the stream in asyncio.wait_for(stream_gen, timeout=300). :::

:::tip Progressive Enhancement Design your UI to work without streaming (show a loading indicator, then display the complete response) and enhance with streaming when available. This ensures graceful degradation if the streaming connection fails or the client environment doesn't support SSE. Progressive enhancement means your feature works for everyone and feels magical for users with good connections. :::

Streaming vs Non-Streaming: When to Use Each

ScenarioRecommendationReason
User-facing chat UIAlways streamUX: perceived performance, user reads while generating
Background summarizationNon-streamingNo user waiting; simpler code
Structured JSON extractionNon-streamingJSON must be complete to parse
Agent tool use loopsStream text, buffer tool callsUser sees progress; tool calls need full input
Batch processingNon-streamingThroughput > latency
API-to-API calls (no user)Non-streamingSimplicity; no UX benefit
Mobile with poor connectionStream + backpressureReduces buffer memory requirements
Markdown renderingStream with word bufferingPrevents partial markdown render artifacts

Interview Q&A

Q1: Why does streaming improve perceived performance even when total response time is identical?

Streaming improves perceived performance through two distinct mechanisms. First, TTFT (Time to First Token) drops dramatically - the user sees the first tokens appear almost immediately instead of waiting for the full response. This provides immediate confirmation the system is working. Research on perceived waiting time shows that "something happening" feels faster than equivalent "nothing happening" even with identical total duration. Second, progressive comprehension - users begin reading and understanding the response while it is still generating, effectively overlapping reading with generation. For a 5-second response, streaming turns 5 seconds of silence into 0.3 seconds of silence followed by 4.7 seconds of reading - a fundamentally different experience. The 34% session length increase Preethi's team observed reflects users being more willing to send follow-up messages when they do not feel they are "waiting."

Q2: What is Server-Sent Events and why is it preferred over WebSockets for LLM streaming?

SSE is an HTTP/1.1 protocol for pushing events from server to client over a persistent connection. Each event is data: <content>\n\n. SSE is preferred for LLM streaming for five reasons: (1) it is unidirectional (server to client), which matches exactly what LLM streaming needs - one request, many response chunks; (2) it works over standard HTTP/HTTPS without protocol upgrade, meaning it works with all existing load balancers, CDNs, and proxies; (3) browsers reconnect automatically via EventSource API on connection drop; (4) it has much lower implementation complexity on both client and server; (5) it supports event IDs and retry logic natively. WebSockets add bidirectional complexity that is unnecessary for LLM streaming. Only switch to WebSockets if you need simultaneous server-to-client AND client-to-server real-time communication, like live collaborative editing.

Q3: How do you handle streaming in a microservices architecture where the LLM call happens in a backend service?

Two approaches. Pass-through streaming: the backend service opens a streaming connection to the LLM API and immediately forwards each token chunk to the calling client. The client connects to the backend, which acts as a streaming proxy. Latency overhead is minimal - just network hops. This requires all intermediaries (API gateways, nginx, load balancers) to be configured for streaming (disable buffering). Queue-based streaming: the backend writes chunks to Redis Streams or a similar pub/sub system; the frontend client subscribes and receives chunks. This adds latency but enables features like stream resumption (client reconnects and gets missed chunks), audit logging of every token, stream fanout (multiple clients watching same generation), and cross-service stream handoff. Use pass-through for user-facing chat; use queue-based when you need advanced streaming features or the generation is triggered asynchronously.

Q4: What are the key production metrics to monitor for a streaming API?

Four critical metrics: (1) TTFT (Time to First Token) - the most impactful UX metric. Measure P50, P95, P99. Alert if P95 TTFT exceeds your threshold (typically 1-2 seconds for user-facing applications). (2) ITL (Inter-Token Latency) - time between consecutive tokens. High ITL causes choppy rendering and usually indicates LLM provider degradation. (3) Stream error rate - connections dropped mid-stream, protocol errors, timeout errors. Alert if above 1% of initiated streams. (4) Stream completion rate - what fraction of initiated streams complete successfully vs abort mid-way. Also track tokens per second (TPS) for capacity planning and cache hit rate for cost optimization. Build a daily TTFT histogram - improvements from caching and model changes show up immediately.

Q5: How do you implement token-level stream cancellation when a user clicks "Stop generating"?

Client sends an abort signal (AbortController in browser, connection close in mobile). On the server: (1) In FastAPI, check await request.is_disconnected() inside the generator loop - return immediately when true; (2) Wrap the streaming client in asyncio.CancelledError handling - FastAPI will cancel the task when the client disconnects; (3) Use the Anthropic async client's streaming context manager (async with client.messages.stream()) - when the context exits (due to cancellation), the underlying HTTP connection closes and generation stops at the provider; (4) Store the abort status in Redis if you need cross-service stream cancellation. The key requirement is that stopping the generator must also stop the LLM API call - otherwise you burn tokens for nobody. The context manager pattern ensures this automatically.

© 2026 EngineersOfAI. All rights reserved.