What is streaming LLM?

Implementing and optimizing streaming for real-time LLM response delivery - SSE, chunking strategies, backpressure, tool use streaming, and production patterns for perceived performance.

How does server-sent events work in practice?

Streaming Responses covers streaming LLM, server-sent events, SSE streaming from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/production-ai-patterns/streaming-responses

What is the difference between streaming LLM and SSE streaming?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/production-ai-patterns/streaming-responses

:::tip 🎮 Interactive Playground Visualize this concept: Try the Streaming LLM Responses demo on the EngineersOfAI Playground - no code required. :::

Streaming Responses

The Blinking Cursor Problem

When Preethi's team first deployed their LLM-powered writing assistant, they made a design decision that seemed obvious at the time: wait for the complete response before showing it to the user. The logic was clean - no partial states to manage, no complex frontend code, no risk of showing an incomplete thought.

The user feedback came back brutal. "It feels like the app is frozen." "I thought it crashed." "The three-second wait is unbearable." The problem was not the quality of the responses - users actually loved the output. The problem was the silence. Three seconds of nothing, then suddenly a 400-word response appeared. The experience felt wrong because it violated a fundamental human expectation: when you send something, something should happen.

They switched to streaming. Within two weeks, user session length increased 34%. Not because responses were faster - they were not. The time to complete response was identical. But time-to-first-token dropped from 3 seconds to 0.3 seconds, and users could read the beginning of the response while the model was still generating the end. Perceived performance is often more important than actual performance. Users judge "fast" by when something appears, not by when it finishes.

Streaming is not just a nice-to-have for LLM applications. It is the foundational UX pattern that makes LLM-powered products feel responsive and alive. Any production LLM application that blocks on the full response before showing anything is leaving significant user experience on the table.

How LLM Streaming Works

Language models generate tokens one at a time through a forward pass and sampling step. Without streaming, the server accumulates all tokens into a complete response before sending anything to the client. With streaming, each token (or small group of tokens) is sent to the client as soon as it is generated.

The transport mechanism is typically Server-Sent Events (SSE) - an HTTP/1.1 protocol where the server pushes data to the client over a persistent connection. Each event is a line of text prefixed with data: . The format is deliberately simple:

data: Hello

data: , world

data: !

data: [DONE]

SSE is preferred over WebSockets for LLM streaming because it is unidirectional (server to client), works over standard HTTP, reconnects automatically on drop, and does not require protocol upgrade. WebSockets add bidirectional complexity that is unnecessary when you only need to stream one direction.

Implementation: Basic Streaming with the Anthropic SDK

The Anthropic SDK provides a stream context manager that handles all SSE parsing internally. You get a clean async iterator over text chunks:

import anthropic
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from typing import AsyncGenerator

# Sync client for simple use cases
sync_client = anthropic.Anthropic()

# Async client for FastAPI / high-concurrency production use
async_client = anthropic.AsyncAnthropic()

app = FastAPI()


def stream_claude_sync(
    messages: list[dict],
    system_prompt: str = "",
    model: str = "claude-opus-4-6",
    max_tokens: int = 1024,
) -> AsyncGenerator[str, None]:
    """
    Synchronous streaming generator for simple use cases.
    Use in scripts or synchronous frameworks.
    Yields SSE-formatted strings.
    """
    with sync_client.messages.stream(
        model=model,
        max_tokens=max_tokens,
        system=system_prompt,
        messages=messages,
    ) as stream:
        for text_chunk in stream.text_stream:
            # SSE format: data: <content>\n\n
            # Double newline signals end of this event
            yield f"data: {text_chunk}\n\n"

        # Signal completion
        yield "data: [DONE]\n\n"


async def stream_claude_async(
    messages: list[dict],
    system_prompt: str = "",
    model: str = "claude-opus-4-6",
    max_tokens: int = 1024,
    include_usage: bool = True,
) -> AsyncGenerator[str, None]:
    """
    Async streaming generator - use this in FastAPI for high concurrency.
    Supports thousands of concurrent streaming connections.

    Args:
        messages: Conversation messages
        system_prompt: System prompt
        model: Claude model to use
        max_tokens: Maximum tokens to generate
        include_usage: Whether to include usage stats in the final event

    Yields:
        SSE-formatted strings
    """
    async with async_client.messages.stream(
        model=model,
        max_tokens=max_tokens,
        system=system_prompt,
        messages=messages,
    ) as stream:
        async for text in stream.text_stream:
            yield f"data: {text}\n\n"
            # Yield control to event loop between tokens
            # Prevents a single stream from starving other coroutines
            await asyncio.sleep(0)

        if include_usage:
            # Include usage stats in the final event for cost tracking
            final_message = await stream.get_final_message()
            usage = final_message.usage
            import json
            stats = json.dumps({
                "done": True,
                "input_tokens": usage.input_tokens,
                "output_tokens": usage.output_tokens,
                "stop_reason": final_message.stop_reason,
            })
            yield f"data: {stats}\n\n"

    yield "data: [DONE]\n\n"


@app.post("/chat/stream")
async def chat_stream(request: dict):
    """
    FastAPI endpoint that streams LLM responses via SSE.

    Critical headers:
    - Cache-Control: no-cache - prevents proxy caching of SSE events
    - X-Accel-Buffering: no - disables nginx proxy buffering
    - Connection: keep-alive - keeps the connection open for SSE
    """
    messages = request.get("messages", [])
    system = request.get("system", "You are a helpful assistant.")
    model = request.get("model", "claude-opus-4-6")

    return StreamingResponse(
        stream_claude_async(messages, system, model),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",  # Critical: disable nginx buffering
            "Access-Control-Allow-Origin": "*",
        },
    )

Implementation: Client-Side Streaming

JavaScript Fetch + ReadableStream

// Production-grade streaming client in TypeScript
interface StreamingCallbacks {
  onChunk: (text: string) => void;
  onComplete: (fullText: string, usage?: UsageStats) => void;
  onError: (error: Error) => void;
}

interface UsageStats {
  input_tokens: number;
  output_tokens: number;
  stop_reason: string;
}

async function streamChatResponse(
  messages: Array<{ role: string; content: string }>,
  systemPrompt: string,
  callbacks: StreamingCallbacks,
  signal?: AbortSignal
): Promise<void> {
  const response = await fetch("/api/chat/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ messages, system: systemPrompt }),
    signal, // AbortController signal for cancellation
  });

  if (!response.ok) {
    throw new Error(`HTTP ${response.status}: ${response.statusText}`);
  }

  const reader = response.body?.getReader();
  const decoder = new TextDecoder();
  let fullText = "";
  let buffer = "";

  if (!reader) throw new Error("Response body is null");

  try {
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      // Decode chunk and add to buffer
      // stream: true maintains state across chunks for multi-byte characters
      buffer += decoder.decode(value, { stream: true });

      // Process complete SSE lines
      // Lines end with \n\n - split on that
      const lines = buffer.split("\n");
      buffer = lines.pop() || ""; // Incomplete last line stays in buffer

      for (const line of lines) {
        if (!line.startsWith("data: ")) continue;

        const data = line.slice(6).trim();
        if (!data) continue;

        // [DONE] sentinel - stream complete
        if (data === "[DONE]") {
          callbacks.onComplete(fullText);
          return;
        }

        // JSON events (usage stats, done signal)
        if (data.startsWith("{")) {
          try {
            const parsed = JSON.parse(data);
            if (parsed.done) {
              callbacks.onComplete(fullText, parsed as UsageStats);
              return;
            }
          } catch {
            // Not valid JSON - treat as text
          }
          continue;
        }

        // Plain text token
        fullText += data;
        callbacks.onChunk(data);
      }
    }
  } catch (error) {
    if (error instanceof Error && error.name === "AbortError") {
      return; // Normal cancellation
    }
    callbacks.onError(error as Error);
  } finally {
    reader.releaseLock();
  }
}


// React hook for production streaming
import { useState, useCallback, useRef } from "react";

interface StreamingState {
  response: string;
  isStreaming: boolean;
  error: Error | null;
  usage: UsageStats | null;
}

function useStreamingChat() {
  const [state, setState] = useState<StreamingState>({
    response: "",
    isStreaming: false,
    error: null,
    usage: null,
  });

  const abortControllerRef = useRef<AbortController | null>(null);
  const startTimeRef = useRef<number>(0);

  const sendMessage = useCallback(
    async (
      messages: Array<{ role: string; content: string }>,
      systemPrompt: string = ""
    ) => {
      // Cancel any existing stream
      abortControllerRef.current?.abort();
      abortControllerRef.current = new AbortController();

      startTimeRef.current = performance.now();

      setState({ response: "", isStreaming: true, error: null, usage: null });

      try {
        await streamChatResponse(
          messages,
          systemPrompt,
          {
            onChunk: (text) => {
              setState((prev) => ({
                ...prev,
                response: prev.response + text,
              }));
            },
            onComplete: (fullText, usage) => {
              const totalTime = performance.now() - startTimeRef.current;
              console.log(
                `Stream complete: ${usage?.output_tokens} tokens in ${totalTime.toFixed(0)}ms`
              );
              setState({ response: fullText, isStreaming: false, error: null, usage: usage || null });
            },
            onError: (err) => {
              setState((prev) => ({ ...prev, isStreaming: false, error: err }));
            },
          },
          abortControllerRef.current.signal
        );
      } catch (err) {
        if (err instanceof Error && err.name !== "AbortError") {
          setState((prev) => ({ ...prev, isStreaming: false, error: err as Error }));
        } else {
          setState((prev) => ({ ...prev, isStreaming: false }));
        }
      }
    },
    []
  );

  const cancel = useCallback(() => {
    abortControllerRef.current?.abort();
    setState((prev) => ({ ...prev, isStreaming: false }));
  }, []);

  return { ...state, sendMessage, cancel };
}

Production Streaming Patterns

Pattern 1: Buffered Chunking for Smooth Rendering

Raw token-by-token streaming can create a choppy "letter-by-letter" effect in some UIs. Buffer tokens and flush at word or sentence boundaries for a more natural reading experience:

import asyncio
import anthropic
import json
from typing import AsyncGenerator

async_client = anthropic.AsyncAnthropic()


async def buffered_word_stream(
    messages: list[dict],
    system: str = "",
    buffer_mode: str = "word",  # "token", "word", "sentence"
    model: str = "claude-opus-4-6",
) -> AsyncGenerator[str, None]:
    """
    Buffer tokens and yield at natural boundaries.

    Modes:
    - "token": yield immediately (fastest first token, most granular)
    - "word": buffer until space or punctuation (smooth word-by-word)
    - "sentence": buffer until sentence end (chunkier but natural prose flow)

    Tradeoff: more buffering = smoother rendering but slightly higher perceived latency.
    For most chat UIs, "word" mode is the right balance.
    """
    buffer = ""

    async with async_client.messages.stream(
        model=model,
        max_tokens=1024,
        system=system,
        messages=messages,
    ) as stream:
        async for text in stream.text_stream:
            if buffer_mode == "token":
                yield f"data: {text}\n\n"
                continue

            buffer += text

            if buffer_mode == "word":
                # Yield complete words - flush on space
                while " " in buffer or any(p in buffer for p in ".,!?;:\n"):
                    # Find the next natural break
                    break_pos = -1
                    for i, ch in enumerate(buffer):
                        if ch in " .,!?;:\n":
                            break_pos = i + 1
                            break

                    if break_pos == -1:
                        break  # No break found - keep buffering

                    chunk = buffer[:break_pos]
                    buffer = buffer[break_pos:]
                    if chunk.strip() or chunk in " \n":
                        yield f"data: {chunk}\n\n"

            elif buffer_mode == "sentence":
                # Yield complete sentences
                for punct in [". ", "! ", "? ", ".\n", "!\n", "?\n"]:
                    if punct in buffer:
                        idx = buffer.index(punct) + len(punct)
                        chunk = buffer[:idx]
                        buffer = buffer[idx:]
                        yield f"data: {chunk}\n\n"
                        break

        # Flush any remaining buffer
        if buffer.strip():
            yield f"data: {buffer}\n\n"

        yield "data: [DONE]\n\n"

Pattern 2: Streaming with Tool Use

When the model uses tools, the stream pauses while tools execute. Handle this gracefully - signal the tool call to the client so users see progress indicators instead of a frozen cursor:

import anthropic
import json
from typing import Callable, Any

client = anthropic.Anthropic()


def stream_with_tool_use(
    messages: list[dict],
    tools: list[dict],
    tool_executor: Callable[[str, dict], Any],
    system: str = "",
    model: str = "claude-opus-4-6",
):
    """
    Stream responses that include tool use.

    Handles the pause-execute-resume cycle:
    1. Stream text tokens normally
    2. When tool_use block starts: notify client ("searching...")
    3. Execute tool synchronously
    4. Resume stream with tool results incorporated

    Yields SSE events including:
    - Text chunks: data: <text>
    - Tool call start: data: {"event": "tool_start", "name": "..."}
    - Tool call end: data: {"event": "tool_end", "name": "..."}
    - Done: data: [DONE]
    """
    conversation = messages.copy()
    max_tool_iterations = 10  # Prevent infinite tool loops

    for iteration in range(max_tool_iterations):
        with client.messages.stream(
            model=model,
            max_tokens=2048,
            system=system,
            tools=tools,
            messages=conversation,
        ) as stream:
            # Track accumulated tool inputs for multi-delta inputs
            tool_input_accumulator: dict[str, str] = {}

            for event in stream:
                event_type = getattr(event, "type", "")

                if event_type == "content_block_start":
                    block = event.content_block
                    if getattr(block, "type", "") == "tool_use":
                        # Notify client that a tool is being called
                        tool_event = json.dumps({
                            "event": "tool_start",
                            "name": block.name,
                            "id": block.id,
                        })
                        yield f"data: {tool_event}\n\n"
                        tool_input_accumulator[block.id] = ""

                elif event_type == "content_block_delta":
                    delta = event.delta
                    delta_type = getattr(delta, "type", "")

                    if delta_type == "text_delta":
                        yield f"data: {delta.text}\n\n"

                    elif delta_type == "input_json_delta":
                        # Accumulate tool input JSON (don't stream to user)
                        block_id = getattr(event, "index", "")
                        # Note: in practice, use content_block_start's id

            # Get final message to check stop reason
            final_message = stream.get_final_message()

            if final_message.stop_reason == "end_turn":
                yield "data: [DONE]\n\n"
                return

            elif final_message.stop_reason == "tool_use":
                # Execute all requested tools
                tool_uses = [
                    block for block in final_message.content
                    if block.type == "tool_use"
                ]

                tool_results = []
                for tool_use in tool_uses:
                    # Execute the tool (synchronous - add async handling for real tools)
                    try:
                        result = tool_executor(tool_use.name, tool_use.input)
                        result_content = str(result)
                        is_error = False
                    except Exception as e:
                        result_content = f"Tool execution error: {str(e)}"
                        is_error = True

                    # Notify client tool finished
                    tool_done_event = json.dumps({
                        "event": "tool_end",
                        "name": tool_use.name,
                        "success": not is_error,
                    })
                    yield f"data: {tool_done_event}\n\n"

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": tool_use.id,
                        "content": result_content,
                        "is_error": is_error,
                    })

                # Add assistant message and tool results to conversation
                conversation.append({
                    "role": "assistant",
                    "content": final_message.content,
                })
                conversation.append({
                    "role": "user",
                    "content": tool_results,
                })
                # Loop continues - model will respond to tool results

            else:
                # Unexpected stop reason
                yield "data: [DONE]\n\n"
                return

    # Safety: if we exhaust max iterations
    yield f"data: {{\"error\": \"max_tool_iterations_exceeded\"}}\n\n"
    yield "data: [DONE]\n\n"

Pattern 3: Streaming with Backpressure

When the client is slow (mobile, high latency), naive streaming buffers too many tokens in memory on the server. Use a bounded queue to apply backpressure - pause generation when the client cannot keep up:

import asyncio
import anthropic
from asyncio import Queue
from typing import AsyncGenerator

async_client = anthropic.AsyncAnthropic()


async def streaming_with_backpressure(
    messages: list[dict],
    system: str = "",
    model: str = "claude-opus-4-6",
    queue_size: int = 100,  # Max tokens to buffer
    max_tokens: int = 1024,
) -> AsyncGenerator[str, None]:
    """
    Producer-consumer streaming with backpressure control.

    When the client reads slowly, the queue fills up and the producer
    (LLM generation) pauses - preventing memory exhaustion on the server.

    queue_size: smaller = less memory, more backpressure
                larger = more buffering, smoother delivery

    For most use cases, 50-100 tokens is a good queue size.
    """
    token_queue: Queue[str | None] = Queue(maxsize=queue_size)

    async def producer() -> None:
        """Generate tokens from LLM and push to queue."""
        try:
            async with async_client.messages.stream(
                model=model,
                max_tokens=max_tokens,
                system=system,
                messages=messages,
            ) as stream:
                async for text in stream.text_stream:
                    # This blocks when queue is full - natural backpressure
                    await token_queue.put(text)
        except Exception as e:
            # Put error sentinel
            await token_queue.put(f"__ERROR__:{str(e)}")
        finally:
            # Always put sentinel to signal completion
            await token_queue.put(None)

    # Start producer in background
    producer_task = asyncio.create_task(producer())

    try:
        while True:
            token = await token_queue.get()

            if token is None:
                # Sentinel: stream complete
                yield "data: [DONE]\n\n"
                token_queue.task_done()
                break

            if isinstance(token, str) and token.startswith("__ERROR__:"):
                error_msg = token[10:]
                yield f"data: {{\"error\": \"{error_msg}\"}}\n\n"
                yield "data: [DONE]\n\n"
                token_queue.task_done()
                break

            yield f"data: {token}\n\n"
            token_queue.task_done()
    finally:
        # Ensure producer is cleaned up even if client disconnects
        producer_task.cancel()
        try:
            await producer_task
        except asyncio.CancelledError:
            pass

Pattern 4: Streaming with Latency Metrics

Measure and monitor streaming performance in production. TTFT (Time to First Token) is the metric that most directly correlates with user perception of responsiveness:

import anthropic
import time
import json
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()


@dataclass
class StreamingMetrics:
    """Metrics captured from a single streaming response."""
    model: str
    ttft_ms: Optional[float]          # Time to first token
    total_time_ms: float              # Total generation time
    input_tokens: int
    output_tokens: int
    tokens_per_second: float
    inter_token_latency_ms: float     # Average time between tokens


def measure_streaming_performance(
    messages: list[dict],
    system: str = "",
    model: str = "claude-opus-4-6",
    max_tokens: int = 500,
) -> StreamingMetrics:
    """
    Measure streaming performance metrics.
    Use this for:
    - Model comparison (Haiku vs Sonnet vs Opus TTFT)
    - Regional routing optimization
    - SLA monitoring
    - Cache effectiveness measurement
    """
    start_time = time.perf_counter()
    first_token_time: Optional[float] = None
    token_count = 0
    full_text = ""

    with client.messages.stream(
        model=model,
        max_tokens=max_tokens,
        system=system,
        messages=messages,
    ) as stream:
        for text in stream.text_stream:
            if first_token_time is None:
                first_token_time = time.perf_counter()

            token_count += 1
            full_text += text

        final_message = stream.get_final_message()
        end_time = time.perf_counter()

    total_time = (end_time - start_time) * 1000
    ttft = (first_token_time - start_time) * 1000 if first_token_time else None
    output_tokens = final_message.usage.output_tokens
    generation_time_s = (end_time - (first_token_time or start_time))
    tps = output_tokens / max(generation_time_s, 0.001)

    return StreamingMetrics(
        model=model,
        ttft_ms=ttft,
        total_time_ms=total_time,
        input_tokens=final_message.usage.input_tokens,
        output_tokens=output_tokens,
        tokens_per_second=tps,
        inter_token_latency_ms=(total_time - (ttft or 0)) / max(output_tokens - 1, 1),
    )


# TTFT Optimization Lookup Table
TTFT_OPTIMIZATION_GUIDE = {
    "prompt_caching": {
        "impact": "60-80% TTFT reduction",
        "how": "Cache system prompt and static documents with cache_control",
        "when": "System prompt > 1024 tokens and same across requests",
    },
    "model_selection": {
        "impact": "2-4x TTFT variation",
        "how": "claude-haiku: ~300ms | claude-sonnet: ~500ms | claude-opus: ~1000ms",
        "when": "Quality requirements allow using a smaller model",
    },
    "input_minimization": {
        "impact": "10-30% TTFT reduction",
        "how": "Shorter context = faster first token processing",
        "when": "Conversation history is growing large",
    },
    "regional_routing": {
        "impact": "50-200ms based on geography",
        "how": "Route to nearest API endpoint",
        "when": "Global user base with significant network latency",
    },
    "connection_warmup": {
        "impact": "100-300ms first-call overhead",
        "how": "Keep HTTP connections alive (connection pooling)",
        "when": "High-frequency low-latency applications",
    },
}

Streaming with Prompt Caching

Prompt caching dramatically reduces TTFT for requests that share a static prefix - the most impactful optimization for most streaming applications:

import anthropic
import time
from typing import AsyncGenerator

async_client = anthropic.AsyncAnthropic()


async def stream_with_prompt_cache(
    static_context: str,           # Long static document, product catalog, etc.
    dynamic_messages: list[dict],  # User-specific, changes per request
    system_instruction: str = "You are a helpful assistant.",
    model: str = "claude-opus-4-6",
    max_tokens: int = 1000,
) -> AsyncGenerator[str, None]:
    """
    Stream with Anthropic prompt caching enabled.

    First call: Full input processing cost, content cached for 5 minutes.
    Subsequent calls within 5min: 90% cheaper for cached portion, faster TTFT.

    Cost breakdown (claude-opus-4-6):
    - Standard input: $15/M tokens
    - Cache creation: $18.75/M tokens (25% surcharge on first write)
    - Cache hit reads: $1.50/M tokens (90% savings)

    TTFT improvement: 40-60% for cache hits due to skipped attention computation.
    """
    start_time = time.perf_counter()

    async with async_client.messages.stream(
        model=model,
        max_tokens=max_tokens,
        system=[
            {
                "type": "text",
                "text": system_instruction,
            },
            {
                "type": "text",
                "text": static_context,
                "cache_control": {"type": "ephemeral"},  # Cache this block
            },
        ],
        messages=dynamic_messages,
    ) as stream:
        first_token = True
        async for text in stream.text_stream:
            if first_token:
                ttft = (time.perf_counter() - start_time) * 1000
                print(f"TTFT: {ttft:.0f}ms")
                first_token = False

            yield f"data: {text}\n\n"

        # Get final message for cache metrics
        final = await stream.get_final_message()
        usage = final.usage
        cache_read = getattr(usage, "cache_read_input_tokens", 0)
        cache_creation = getattr(usage, "cache_creation_input_tokens", 0)
        cache_hit = cache_read > 0

        import json
        stats = json.dumps({
            "done": True,
            "input_tokens": usage.input_tokens,
            "output_tokens": usage.output_tokens,
            "cache_hit": cache_hit,
            "cache_read_tokens": cache_read,
            "cache_creation_tokens": cache_creation,
        })
        yield f"data: {stats}\n\n"

    yield "data: [DONE]\n\n"

Architecture: Production Streaming System

Handling Client Disconnects

One of the most common streaming bugs: the server continues generating tokens after the client disconnects, wasting compute and burning tokens:

import anthropic
import asyncio
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse

async_client = anthropic.AsyncAnthropic()
app = FastAPI()


@app.post("/chat/stream")
async def chat_stream(request: Request, body: dict):
    """
    Streaming endpoint with proper disconnect handling.
    Stops generation when the client disconnects.
    """
    messages = body.get("messages", [])
    system = body.get("system", "")

    async def generate_with_disconnect_detection():
        try:
            async with async_client.messages.stream(
                model="claude-opus-4-6",
                max_tokens=1024,
                system=system,
                messages=messages,
            ) as stream:
                async for text in stream.text_stream:
                    # Check if client disconnected before sending each chunk
                    if await request.is_disconnected():
                        print("Client disconnected - stopping generation")
                        return

                    yield f"data: {text}\n\n"
                    # Yield control to allow disconnect detection
                    await asyncio.sleep(0)

                yield "data: [DONE]\n\n"

        except asyncio.CancelledError:
            # Normal: happens when client disconnects and FastAPI cancels the task
            print("Stream task cancelled (client disconnected)")
            return
        except Exception as e:
            import json
            error_event = json.dumps({"error": str(e), "type": type(e).__name__})
            yield f"data: {error_event}\n\n"
            yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate_with_disconnect_detection(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )

Common Pitfalls

:::danger Nginx/Proxy Buffering Destroys Streaming By default, nginx and most reverse proxies buffer the entire response before forwarding it to the client. This completely destroys the streaming effect - the user sees the same "wait then show" behavior as non-streaming. Fix: add X-Accel-Buffering: no to your response headers. In nginx config: proxy_buffering off;. This is the most common cause of "why isn't my streaming working?" reports. :::

:::danger Not Handling Client Disconnects If a user closes their browser tab mid-stream, the HTTP connection closes. Without proper disconnect handling, your server continues generating tokens for nobody - burning API quota and compute. Always check request.is_disconnected() or wrap in try/except asyncio.CancelledError. :::

:::warning Blocking the Event Loop in Async Streaming Using synchronous operations inside an async streaming generator blocks all other concurrent streams. time.sleep() in an async function is the most common offender - it blocks the entire event loop. Use await asyncio.sleep(0) to yield control between tokens. Never make synchronous DB calls or file I/O inside async generators. :::

:::warning Missing Streaming Timeout LLM generation for long responses can take minutes. Without a timeout, a stuck connection holds a server slot forever. Set timeout=httpx.Timeout(30.0, connect=5.0) on your async client, or wrap the stream in asyncio.wait_for(stream_gen, timeout=300). :::

:::tip Progressive Enhancement Design your UI to work without streaming (show a loading indicator, then display the complete response) and enhance with streaming when available. This ensures graceful degradation if the streaming connection fails or the client environment doesn't support SSE. Progressive enhancement means your feature works for everyone and feels magical for users with good connections. :::

Streaming vs Non-Streaming: When to Use Each

Scenario	Recommendation	Reason
User-facing chat UI	Always stream	UX: perceived performance, user reads while generating
Background summarization	Non-streaming	No user waiting; simpler code
Structured JSON extraction	Non-streaming	JSON must be complete to parse
Agent tool use loops	Stream text, buffer tool calls	User sees progress; tool calls need full input
Batch processing	Non-streaming	Throughput > latency
API-to-API calls (no user)	Non-streaming	Simplicity; no UX benefit
Mobile with poor connection	Stream + backpressure	Reduces buffer memory requirements
Markdown rendering	Stream with word buffering	Prevents partial markdown render artifacts

Interview Q&A

Q1: Why does streaming improve perceived performance even when total response time is identical?

Streaming improves perceived performance through two distinct mechanisms. First, TTFT (Time to First Token) drops dramatically - the user sees the first tokens appear almost immediately instead of waiting for the full response. This provides immediate confirmation the system is working. Research on perceived waiting time shows that "something happening" feels faster than equivalent "nothing happening" even with identical total duration. Second, progressive comprehension - users begin reading and understanding the response while it is still generating, effectively overlapping reading with generation. For a 5-second response, streaming turns 5 seconds of silence into 0.3 seconds of silence followed by 4.7 seconds of reading - a fundamentally different experience. The 34% session length increase Preethi's team observed reflects users being more willing to send follow-up messages when they do not feel they are "waiting."

Q2: What is Server-Sent Events and why is it preferred over WebSockets for LLM streaming?

SSE is an HTTP/1.1 protocol for pushing events from server to client over a persistent connection. Each event is data: <content>\n\n. SSE is preferred for LLM streaming for five reasons: (1) it is unidirectional (server to client), which matches exactly what LLM streaming needs - one request, many response chunks; (2) it works over standard HTTP/HTTPS without protocol upgrade, meaning it works with all existing load balancers, CDNs, and proxies; (3) browsers reconnect automatically via EventSource API on connection drop; (4) it has much lower implementation complexity on both client and server; (5) it supports event IDs and retry logic natively. WebSockets add bidirectional complexity that is unnecessary for LLM streaming. Only switch to WebSockets if you need simultaneous server-to-client AND client-to-server real-time communication, like live collaborative editing.

Q3: How do you handle streaming in a microservices architecture where the LLM call happens in a backend service?

Two approaches. Pass-through streaming: the backend service opens a streaming connection to the LLM API and immediately forwards each token chunk to the calling client. The client connects to the backend, which acts as a streaming proxy. Latency overhead is minimal - just network hops. This requires all intermediaries (API gateways, nginx, load balancers) to be configured for streaming (disable buffering). Queue-based streaming: the backend writes chunks to Redis Streams or a similar pub/sub system; the frontend client subscribes and receives chunks. This adds latency but enables features like stream resumption (client reconnects and gets missed chunks), audit logging of every token, stream fanout (multiple clients watching same generation), and cross-service stream handoff. Use pass-through for user-facing chat; use queue-based when you need advanced streaming features or the generation is triggered asynchronously.

Q4: What are the key production metrics to monitor for a streaming API?

Four critical metrics: (1) TTFT (Time to First Token) - the most impactful UX metric. Measure P50, P95, P99. Alert if P95 TTFT exceeds your threshold (typically 1-2 seconds for user-facing applications). (2) ITL (Inter-Token Latency) - time between consecutive tokens. High ITL causes choppy rendering and usually indicates LLM provider degradation. (3) Stream error rate - connections dropped mid-stream, protocol errors, timeout errors. Alert if above 1% of initiated streams. (4) Stream completion rate - what fraction of initiated streams complete successfully vs abort mid-way. Also track tokens per second (TPS) for capacity planning and cache hit rate for cost optimization. Build a daily TTFT histogram - improvements from caching and model changes show up immediately.

Q5: How do you implement token-level stream cancellation when a user clicks "Stop generating"?

Client sends an abort signal (AbortController in browser, connection close in mobile). On the server: (1) In FastAPI, check await request.is_disconnected() inside the generator loop - return immediately when true; (2) Wrap the streaming client in asyncio.CancelledError handling - FastAPI will cancel the task when the client disconnects; (3) Use the Anthropic async client's streaming context manager (async with client.messages.stream()) - when the context exits (due to cancellation), the underlying HTTP connection closes and generation stops at the provider; (4) Store the abort status in Redis if you need cross-service stream cancellation. The key requirement is that stopping the generator must also stop the LLM API call - otherwise you burn tokens for nobody. The context manager pattern ensures this automatically.

The Blinking Cursor Problem​

How LLM Streaming Works​

Implementation: Basic Streaming with the Anthropic SDK​

Implementation: Client-Side Streaming​

JavaScript Fetch + ReadableStream​

Production Streaming Patterns​

Pattern 1: Buffered Chunking for Smooth Rendering​

Pattern 2: Streaming with Tool Use​

Pattern 3: Streaming with Backpressure​

Pattern 4: Streaming with Latency Metrics​

Streaming with Prompt Caching​

Architecture: Production Streaming System​

Handling Client Disconnects​

Common Pitfalls​

Streaming vs Non-Streaming: When to Use Each​

Interview Q&A​