Skip to main content

Batching Strategies for LLM Serving

The Economics of a Single GPU Decode Step

You have a freshly deployed LLaMA-3 70B cluster. Eight A100 80GB GPUs, NVLink, the works. Your serving team has configured TensorRT-LLM and the system is live. The first week of production goes smoothly. Then your VP of Product sends a message: "Can we get the cost per 1000 tokens down? We are losing money on every API call."

You look at the GPU utilization graphs. During decode - the token-by-token generation phase - utilization is hovering around 8%. Eight percent. On hardware that costs 3.50perhourperGPU.Youhave3.50 per hour per GPU. You have 28 per hour of compute producing work at the rate of $2.24 per hour. The rest is waste.

The root cause is not software bugs or misconfigurations. It is a fundamental property of autoregressive generation. Every single decode step reads the entire 70B parameter model from HBM memory - all 140 GB of weights - regardless of whether you are generating one token for one user or one token for 64 users simultaneously. The matrix multiplications during decode are memory-bandwidth-bound, not compute-bound. You are paying for memory bandwidth that can serve many users in parallel but is being used to serve one.

This is the batching problem for LLM serving. It is not the same batching problem as in traditional deep learning inference. Traditional models like ResNet or BERT have fixed-length inputs and outputs - you can simply stack 32 requests into a batch and run one forward pass. Language generation does not work this way. Users send prompts of wildly different lengths. They request different numbers of output tokens. A request for "summarize this 5000-word document" and a request for "what is 2+2" arrive at the same second. Naively treating them as a batch means padding the short request to match the long one, which wastes compute, or waiting until you have enough similar requests to batch efficiently, which adds latency.

Every batching strategy in LLM serving is an attempt to solve this fundamental mismatch: the hardware wants to process many requests simultaneously, but the autoregressive nature of generation creates variable-length sequences that are hard to batch. The evolution from static batching to continuous batching to chunked prefill is the story of incrementally solving this mismatch. Getting it right is the difference between a system that is 8% efficient and one that is 70%+ efficient - the difference between losing money and operating profitably.

The three papers that define modern LLM serving batching are: Orca (2022) which introduced continuous batching, vLLM (2023) which demonstrated it in practice with paged KV caches, and Sarathi-Serve (2023) which introduced chunked prefill. Together they account for most of the 10-50x throughput improvement that the best production LLM systems have over naive implementations.

Why This Exists - The Weight Read Problem

To understand why batching matters so much for LLMs specifically, you need to understand one key fact: during the decode phase, a single step generates exactly one token. To generate that token, the model must read every weight in every transformer layer.

For a 70B parameter model in FP16: Weight size=70×109×2 bytes=140 GB\text{Weight size} = 70 \times 10^9 \times 2 \text{ bytes} = 140 \text{ GB}

An A100 80GB has HBM bandwidth of 2 TB/s. Reading 140 GB takes: Read time=140 GB2000 GB/s=0.07 seconds=70 ms per decode step\text{Read time} = \frac{140 \text{ GB}}{2000 \text{ GB/s}} = 0.07 \text{ seconds} = 70 \text{ ms per decode step}

Meanwhile, the actual arithmetic for a single-token decode step is trivially small. You are computing matrix-vector products (matrix times a single vector), not matrix-matrix products. Matrix-vector products are memory-bandwidth-bound by definition - you spend far more time loading weights than computing with them. The A100 has 312 TFLOPS of FP16 compute but only 2 TB/s memory bandwidth. The ratio is 312 / 2 = 156 FLOPs per byte. A matrix-vector product for a single token needs roughly 1 FLOP per byte. You are running at less than 1% arithmetic efficiency.

Now consider what happens when you batch 64 requests together. You still read the same 140 GB of weights - once. But now you are generating 64 tokens simultaneously. The weight read cost is amortized across 64 outputs. Your arithmetic efficiency climbs toward the roofline because you are now computing a matrix-matrix product (matrix times 64 vectors) instead of a matrix-vector product. Utilization goes from 1% toward 64%.

This is why batching is the primary lever for LLM serving efficiency. Unlike image classification where batching improves GPU utilization by reducing kernel launch overhead, for LLM decode, batching is about amortizing weight reads across multiple outputs. Every decode step reads the full model whether you batch or not. The question is how many tokens you produce per read.

The challenge is that previous systems (before 2022) could not easily batch decode steps together because requests were at different positions in generation, had different KV cache sizes, and required different amounts of computation. The solution was continuous batching - a batching strategy designed specifically for the sequential nature of autoregressive generation.

Historical Context - From Triton Inference Server to Orca

Before 2022, the dominant approach to LLM serving in production was what we now call static batching. Systems like NVIDIA's Triton Inference Server and FasterTransformer would accept a batch of requests, process them together for the full generation (from first token to last), and release the GPU resources only when every request in the batch had finished. This worked adequately for small models and homogeneous workloads. For production LLM serving at scale, it was a disaster.

The problem with static batching becomes obvious when you think about request completion times. In a batch of 64 requests, some users ask for 10-token responses and some ask for 500-token responses. The 10-token requests finish quickly, but the entire batch is held in the GPU until the 500-token request completes. The 54 requests that finished early have their memory tied up in KV cache storage. The GPU is running at batch size 64 until step 10, then batch size 10 (just the remaining requests), then batch size 1, even though there are new requests waiting in the queue that could be using that capacity.

The fundamental insight behind continuous batching came from the Orca paper ("ORCA: A Distributed Serving System for Transformer-Based Generative Models", Yu et al., 2022). The key observation: each decode step is independent. Unlike attention in BERT (where you need the full sequence to compute attention), autoregressive decode generates exactly one token per step using the KV cache as memory. You can add a new request to the "batch" at the beginning of any decode step, and remove a completed request from the batch at the end of any decode step. There is no reason to wait for slow requests to finish before starting new ones.

The vLLM paper (Kwon et al., 2023) demonstrated continuous batching in practice at scale and solved the KV cache memory fragmentation problem that had made continuous batching operationally difficult. By treating KV cache memory like OS virtual memory - using a page table to map logical blocks to physical blocks - vLLM achieved both continuous batching and efficient memory use, resulting in 24x higher throughput than FasterTransformer in their benchmarks.

The "aha moment" in the Orca paper can be summarized as: LLM generation is iteration-level scheduling, not request-level scheduling. Traditional request batching waits for the entire request to complete. Iteration-level scheduling makes scheduling decisions at each decode step. This small conceptual shift unlocked an order-of-magnitude throughput improvement.

Core Concepts

Static Batching - The Baseline

Static batching is the simplest approach: collect N requests, pad them all to the same length, run one forward pass per decode step, wait until all N requests complete, then start the next batch.

The padding waste is severe. If your batch has one request that needs 500 tokens and 63 requests that need 20 tokens, every decode step from token 20 to token 500 is computing on a batch where 63 out of 64 "requests" are actually padding tokens. You are doing real compute on fake data.

Padding waste=i=1N(max_tokenstokensi)i=1Nmax_tokens\text{Padding waste} = \frac{\sum_{i=1}^{N}(\text{max\_tokens} - \text{tokens}_i)}{\sum_{i=1}^{N} \text{max\_tokens}}

In practice, if request lengths follow a power-law distribution (which they do in most production LLM workloads), the padding waste in static batching exceeds 60-70% for typical batch sizes.

The secondary problem is GPU idle time. When the fast requests in a static batch finish, the remaining GPU capacity cannot be used for new requests until the entire batch completes. This creates a staircase utilization pattern - high at the start of a batch, declining as requests finish, followed by a reset when the new batch starts.

Dynamic Batching - Better but Not Enough

Dynamic batching (used in systems like NVIDIA Triton with its dynamic batching plugin) groups requests that arrive within a time window and sends them together as a batch. Requests arriving within 5ms of each other go into the same batch. This improves GPU utilization compared to processing each request individually, but still has the static batching problem: once a batch starts, it runs to completion.

Dynamic batching also introduces a latency-throughput tradeoff via the batching timeout parameter. Setting a 5ms wait window reduces throughput for isolated requests (adds 5ms to every request's latency) while improving throughput for bursty traffic (better batching when requests cluster together). Tuning this parameter is painful: the right value depends on your traffic patterns, which change throughout the day.

The more fundamental problem: dynamic batching does not help with the decode-phase inefficiency. It helps with the prefill phase (processing the initial prompt) by batching multiple prefills together, but once you are in the decode phase generating tokens, you still have the variable-length completion problem.

Continuous Batching - The Modern Standard

Continuous batching (also called in-flight batching) makes scheduling decisions at the granularity of a single decode step rather than at the granularity of a full request. At the beginning of each decode step, the scheduler can:

  1. Check if any running requests have just completed (generated their last token or hit the max-length limit)
  2. Add new requests from the queue to fill the freed slots
  3. Run the decode step with the new combined batch

The result is that the GPU is always processing a batch of size B, where B is constrained by KV cache memory rather than by artificial batching windows. Short requests complete and are immediately replaced by new requests. The batch is continuously changing composition, but always at full capacity.

Time -->

Static batching:
[Req A (500 tok)][Req B (500 tok)][Req C (500 tok)][Req D (500 tok)]
████████████████████████████████████████████████████████ <- full batch
[Req E][Req F][Req G][Req H] <- waiting in queue

Continuous batching:
Step 1: [A][B][C][D]
Step 2: [A][B][C][D]
Step 10: [A][B][C][E] <- D finished at step 9, E entered immediately
Step 11: [A][B][F][E] <- C finished at step 10, F entered immediately
Step 25: [A][G][H][I] <- B, E, F finished, three new requests entered

The key implementation requirement for continuous batching is that the KV cache must be managed dynamically. In static batching, you pre-allocate a contiguous KV cache block for each sequence's maximum length. In continuous batching, you do not know the final length of each sequence at the start, and you need to add and remove sequences from the batch mid-run. This is where paged KV cache (as implemented in vLLM) becomes essential.

Paged KV cache allocates KV cache in fixed-size blocks (e.g., 16 tokens per block). Each sequence gets as many blocks as it currently needs, and blocks are allocated/freed dynamically. A page table maps each sequence's logical token positions to physical blocks in GPU memory. This allows the KV cache to be non-contiguous in physical memory while appearing contiguous to the computation.

The throughput improvement from continuous batching vs static batching is typically 4-20x depending on the distribution of request lengths. The widest improvements come when request lengths vary a lot (long-tailed distributions) because that is when static batching wastes the most GPU time on padding and idle periods.

Chunked Prefill - Solving the Prefill-Decode Interference Problem

Continuous batching solves the padding and idle GPU problems, but introduces a new problem: prefill-decode interference.

When a new request enters the continuously-batched system, it needs to process its full prompt (the prefill phase). Prefill is compute-intensive - you are running attention over the full sequence length in parallel. A 4096-token prompt prefill takes roughly 4096x longer than a single decode step. During that prefill, the decode steps for all other requests in the batch are blocked.

The problem: you have 50 users in the middle of long responses (in decode phase), and a new user arrives with a 4096-token prompt. You insert the new request into the batch. For the next several hundred milliseconds, your 50 in-flight requests are frozen while the prompt is processed. These 50 users experience a sudden "pause" in their token stream - perceptible to humans as latency spikes.

This is called Time to First Token (TTFT) vs Inter-Token Latency (ITL) interference. Prefill increases TTFT for new requests (good) but worsens ITL for existing requests (bad).

Chunked prefill, introduced in Sarathi-Serve (Agrawal et al., 2023), solves this by breaking long prompts into chunks and processing one chunk per decode step:

  • Instead of processing a 4096-token prompt in one step, process 256 tokens per step
  • Interleave prompt chunks with decode steps for existing requests
  • New request's prompt is processed over 16 steps instead of 1, spreading the compute impact

The decode latency impact of chunked prefill is minimal: each step now has slightly more compute (the chunk), but the GPU is already memory-bandwidth-bound during decode, so the extra compute is "free" in terms of wall-clock time. The ITL improvement for existing requests is substantial - no more multi-hundred-millisecond pauses when new long prompts arrive.

Tuning the chunk size involves a tradeoff: smaller chunks are more fair to existing decode requests but add more overhead per token of prefill (more kernel launches, more scheduling overhead). Typical production values are 256-1024 tokens per chunk.

Prefill-Decode Disaggregation - Architectural Separation

The most aggressive solution to the prefill-decode interference problem is to separate them entirely: run prefill on dedicated "prefill machines" and decode on dedicated "decode machines." This is called prefill-decode disaggregation (also called P-D disaggregation or "disaggregated serving").

The insight: prefill and decode have completely different hardware requirements.

Prefill is compute-bound. Processing a 4096-token prompt runs matrix-matrix multiplications with batch size proportional to sequence length. High FLOPS machines (A100, H100) maximize prefill throughput. The optimal hardware for prefill is the same as for training.

Decode is memory-bandwidth-bound. Each decode step is a matrix-vector multiplication (one token). You want maximum HBM bandwidth, not maximum FLOPS. Lower-cost GPUs with good bandwidth-to-cost ratios (L4, A10G) can be efficient for decode if their bandwidth is sufficient to load model weights fast enough.

In a disaggregated system:

  1. Request arrives at a load balancer
  2. Request is routed to a prefill machine, which processes the prompt and generates the KV cache
  3. The KV cache is transferred over the network to a decode machine
  4. Decode machine generates tokens and streams them back to the user

The KV cache transfer overhead (step 3) is the cost of disaggregation. For a LLaMA-3 70B model with 80 transformer layers, the KV cache for a 1024-token prompt is approximately 80×2×2×1024×8192×2 bytes4 GB80 \times 2 \times 2 \times 1024 \times 8192 \times 2 \text{ bytes} \approx 4 \text{ GB}. Over a 100 Gbps InfiniBand link, this transfer takes about 0.3 seconds, which adds meaningfully to TTFT.

The disaggregated approach is in active use at large-scale deployments (some have reported it as part of their production serving stack). The optimal deployment is hardware-heterogeneous: a few high-FLOPS A100/H100 prefill nodes serving many A10G/L4 decode nodes. The ratio of prefill to decode nodes depends on the request mix (prompt length distribution, output length distribution).

Architecture Diagrams

Batching Strategy Comparison

Prefill-Decode Disaggregated Architecture

Throughput vs Latency Tradeoff Space

Code Examples

vLLM Continuous Batching Configuration

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
import asyncio

# Basic synchronous vLLM with continuous batching
llm = LLM(
model="meta-llama/Llama-3-70b-instruct",
tensor_parallel_size=4, # 4 GPUs for 70B model
max_model_len=8192, # max total sequence length
gpu_memory_utilization=0.90, # leave 10% for overhead
# Continuous batching parameters:
max_num_seqs=256, # max concurrent sequences
max_num_batched_tokens=8192, # max tokens per decode step (across all seqs)
enable_chunked_prefill=True, # enable chunked prefill
max_num_chunked_prefill_tokens=1024, # chunk size
)

sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256,
)

# Batch of requests - vLLM handles batching internally
prompts = [
"Explain quantum entanglement in simple terms.", # ~100 token response expected
"Write a haiku about distributed systems.", # ~20 token response expected
"What is the capital of France?", # ~5 token response expected
"Summarize the history of machine learning in 3 paragraphs.", # ~300 token response
]

# vLLM processes these with continuous batching:
# - short requests complete first and free slots
# - new requests enter without waiting for the long summary to finish
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Generated ({len(output.outputs[0].token_ids)} tokens): {output.outputs[0].text[:100]}")
print()

Async LLM Engine for Production Serving

from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
import asyncio
import uuid

# Async engine for production use - handles continuous batching automatically
engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3-70b-instruct",
tensor_parallel_size=4,
max_model_len=8192,
max_num_seqs=256,
max_num_batched_tokens=8192,
enable_chunked_prefill=True,
max_num_chunked_prefill_tokens=512, # smaller chunks = better ITL fairness
gpu_memory_utilization=0.90,
dtype="float16",
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

async def generate_stream(prompt: str, max_tokens: int = 200):
"""Stream tokens as they are generated."""
request_id = str(uuid.uuid4())
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=max_tokens,
)

# Add to in-flight batch - continuous batching happens here
async for output in engine.generate(prompt, sampling_params, request_id):
if output.outputs:
# Yield each new token as it is generated
yield output.outputs[0].text

async def serve_concurrent_requests():
"""Demonstrate concurrent serving with continuous batching."""
# Create 20 concurrent requests of varying lengths
requests = [
("Short question: what is 2+2?", 20),
("Write a detailed explanation of attention mechanisms.", 500),
("Hello!", 10),
("Explain the entire history of computing in detail.", 1000),
] * 5 # 20 total requests

async def handle_request(prompt, max_tokens, req_num):
tokens = []
async for token in generate_stream(prompt, max_tokens):
tokens.append(token)
print(f"Request {req_num}: {len(tokens)} tokens generated")

# Fire all requests concurrently - the engine batches them
tasks = [
handle_request(prompt, max_tokens, i)
for i, (prompt, max_tokens) in enumerate(requests)
]
await asyncio.gather(*tasks)

asyncio.run(serve_concurrent_requests())

Measuring Throughput vs Latency

import time
import asyncio
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass, field

@dataclass
class RequestMetrics:
request_id: str
prompt_tokens: int
output_tokens: int
ttft_ms: float # Time to first token
e2e_latency_ms: float # Total end-to-end latency
itl_ms: List[float] = field(default_factory=list) # Inter-token latencies

async def benchmark_serving(
engine,
prompts: List[str],
max_tokens_list: List[int],
concurrency: int = 32,
) -> List[RequestMetrics]:
"""
Benchmark LLM serving with realistic concurrent load.

concurrency: number of simultaneous in-flight requests.
"""
semaphore = asyncio.Semaphore(concurrency)
all_metrics = []

async def timed_request(prompt: str, max_tokens: int, req_id: int) -> RequestMetrics:
async with semaphore:
request_uuid = str(req_id)
sampling_params = SamplingParams(max_tokens=max_tokens, temperature=0.0)

start_time = time.perf_counter()
first_token_time = None
last_token_time = start_time
token_times = []
total_output_tokens = 0

async for output in engine.generate(prompt, sampling_params, request_uuid):
now = time.perf_counter()
if first_token_time is None and output.outputs:
first_token_time = now

if output.outputs:
new_tokens = len(output.outputs[0].token_ids)
if new_tokens > total_output_tokens:
itl = (now - last_token_time) * 1000
token_times.append(itl)
total_output_tokens = new_tokens
last_token_time = now

end_time = time.perf_counter()

return RequestMetrics(
request_id=request_uuid,
prompt_tokens=len(prompt.split()), # approximate
output_tokens=total_output_tokens,
ttft_ms=(first_token_time - start_time) * 1000 if first_token_time else 0,
e2e_latency_ms=(end_time - start_time) * 1000,
itl_ms=token_times[1:], # skip first (same as TTFT)
)

tasks = [
timed_request(prompt, max_tokens, i)
for i, (prompt, max_tokens) in enumerate(zip(prompts, max_tokens_list))
]
metrics = await asyncio.gather(*tasks)
return list(metrics)


def print_benchmark_summary(metrics: List[RequestMetrics]):
ttfts = [m.ttft_ms for m in metrics]
e2e = [m.e2e_latency_ms for m in metrics]
all_itl = [itl for m in metrics for itl in m.itl_ms]
total_output = sum(m.output_tokens for m in metrics)
total_time_s = max(m.e2e_latency_ms for m in metrics) / 1000

print(f"=== Benchmark Summary ({len(metrics)} requests) ===")
print(f"Throughput: {total_output / total_time_s:.1f} tokens/sec")
print(f"TTFT - Mean: {np.mean(ttfts):.1f}ms P50: {np.percentile(ttfts, 50):.1f}ms P99: {np.percentile(ttfts, 99):.1f}ms")
print(f"E2E - Mean: {np.mean(e2e):.1f}ms P50: {np.percentile(e2e, 50):.1f}ms P99: {np.percentile(e2e, 99):.1f}ms")
print(f"ITL - Mean: {np.mean(all_itl):.1f}ms P50: {np.percentile(all_itl, 50):.1f}ms P99: {np.percentile(all_itl, 99):.1f}ms")

Tuning max_num_seqs and max_num_batched_tokens

"""
Key vLLM parameters for continuous batching performance:

max_num_seqs:
- Maximum number of sequences processed simultaneously
- Limited by KV cache memory: each sequence needs (num_layers * 2 * head_dim * num_heads * seq_len * 2 bytes)
- For LLaMA-3 70B at max_model_len=4096: ~4 GB per sequence
- With 80 GB GPU x4 (320 GB), budget ~50 GB for KV cache -> ~12 max sequences per server
- Set this based on KV cache memory budget, not arbitrarily

max_num_batched_tokens:
- Maximum total tokens processed per forward pass (across all sequences)
- Affects both prefill throughput and decode batch size
- Setting too low: decode batches are small, low GPU utilization
- Setting too high: large prefills block decode for too long
- Good starting point: max_model_len (e.g., 4096 for a 4K context model)
- With chunked prefill: set to chunk_size * num_concurrent_prefills + decode_batch_tokens

enable_chunked_prefill + max_num_chunked_prefill_tokens:
- Enables Sarathi-style chunked prefill
- chunk_size = max_num_chunked_prefill_tokens
- Smaller = more fairness for decode, more overhead per prompt token
- Larger = faster TTFT, more ITL spikes
- Start at 512 and tune based on P99 ITL

swap_space:
- CPU memory to use for KV cache offloading when GPU is full
- Allows serving more concurrent sequences at cost of swapping latency
- Only useful for highly variable request lengths
- Set to 4-16 GB for production deployments with long-tail request sizes
"""

# Example: profile-guided configuration
# Step 1: measure your production traffic
def analyze_traffic_distribution(access_logs):
prompt_lengths = []
output_lengths = []
for request in access_logs:
prompt_lengths.append(request["prompt_token_count"])
output_lengths.append(request["output_token_count"])

print(f"Prompt length P50: {np.percentile(prompt_lengths, 50):.0f}")
print(f"Prompt length P95: {np.percentile(prompt_lengths, 95):.0f}")
print(f"Prompt length P99: {np.percentile(prompt_lengths, 99):.0f}")
print(f"Output length P50: {np.percentile(output_lengths, 50):.0f}")
print(f"Output length P95: {np.percentile(output_lengths, 95):.0f}")
print(f"Output length P99: {np.percentile(output_lengths, 99):.0f}")

# Rough max_model_len recommendation
max_needed = np.percentile(
[p + o for p, o in zip(prompt_lengths, output_lengths)], 99
)
print(f"\nRecommended max_model_len: {int(max_needed * 1.1)}") # 10% headroom

FastAPI + vLLM Production Endpoint

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.sampling_params import SamplingParams
import asyncio
import uuid
import json

app = FastAPI()

# Engine initialization (done once at startup)
engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3-8b-instruct",
tensor_parallel_size=1,
max_num_seqs=128,
max_num_batched_tokens=4096,
enable_chunked_prefill=True,
gpu_memory_utilization=0.90,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

class CompletionRequest(BaseModel):
prompt: str
max_tokens: int = 200
temperature: float = 0.7
stream: bool = False

@app.post("/v1/completions")
async def create_completion(request: CompletionRequest):
request_id = str(uuid.uuid4())
sampling_params = SamplingParams(
temperature=request.temperature,
max_tokens=request.max_tokens,
)

if request.stream:
async def token_stream():
async for output in engine.generate(
request.prompt, sampling_params, request_id
):
if output.outputs:
chunk = {
"id": request_id,
"object": "text_completion",
"choices": [{
"text": output.outputs[0].text,
"finish_reason": output.outputs[0].finish_reason,
}]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"

return StreamingResponse(token_stream(), media_type="text/event-stream")

# Non-streaming: wait for full completion
final_output = None
async for output in engine.generate(request.prompt, sampling_params, request_id):
final_output = output

if final_output is None:
raise HTTPException(status_code=500, detail="Generation failed")

return {
"id": request_id,
"object": "text_completion",
"choices": [{
"text": final_output.outputs[0].text,
"finish_reason": final_output.outputs[0].finish_reason,
"token_count": len(final_output.outputs[0].token_ids),
}]
}

Production Engineering Notes

SLA Design: TTFT vs ITL vs E2E Latency

Production LLM deployments typically have three separate SLA targets, and it is important to understand what each measures:

TTFT (Time to First Token): How long from request submission to the first token appearing. This is dominated by (a) queue wait time and (b) prefill time. Users perceive this as the system "thinking." SLA targets are typically 200-2000ms depending on use case.

ITL (Inter-Token Latency): Time between successive tokens in the generated response. This is dominated by decode batch size and weight memory bandwidth. Users perceive this as the "typing speed." Typical target is 20-80ms per token for real-time applications.

E2E (End-to-End Latency): Total time to receive the complete response. This matters for batch or async use cases. E2E = TTFT + (output_tokens - 1) x ITL.

Optimizing for all three simultaneously requires understanding the tradeoffs. A larger batch size improves throughput and reduces per-token cost, but increases ITL. Chunked prefill reduces ITL spikes but adds overhead to TTFT. The right operating point depends on your application: chatbots are ITL-sensitive, code generation tools are E2E-sensitive, and document processing pipelines care most about throughput.

KV Cache Memory Planning

def estimate_kv_cache_memory_gb(
model_layers: int,
num_heads: int,
head_dim: int,
max_seq_len: int,
dtype_bytes: int = 2, # FP16
) -> float:
"""
Estimate KV cache memory per sequence.

For LLaMA-3 70B:
layers=80, num_heads=64, head_dim=128, seq_len=8192
-> 80 * 2 * 64 * 128 * 8192 * 2 bytes = 17 GB per sequence
"""
# K and V are separate, hence the factor of 2
bytes_per_seq = model_layers * 2 * num_heads * head_dim * max_seq_len * dtype_bytes
return bytes_per_seq / (1024 ** 3)


def max_concurrent_sequences(
total_gpu_memory_gb: float,
model_weights_gb: float,
kv_cache_per_seq_gb: float,
overhead_fraction: float = 0.10,
) -> int:
"""
Calculate maximum concurrent sequences given GPU memory.
"""
available_for_kv = total_gpu_memory_gb * (1 - overhead_fraction) - model_weights_gb
if available_for_kv <= 0:
return 0
return int(available_for_kv / kv_cache_per_seq_gb)


# LLaMA-3 70B on 4x A100 80GB
kv_per_seq = estimate_kv_cache_memory_gb(
model_layers=80,
num_heads=64,
head_dim=128,
max_seq_len=4096,
)
print(f"KV cache per sequence (4K context): {kv_per_seq:.1f} GB")

max_seqs = max_concurrent_sequences(
total_gpu_memory_gb=4 * 80, # 320 GB total
model_weights_gb=140, # 70B x 2 bytes
kv_cache_per_seq_gb=kv_per_seq,
)
print(f"Max concurrent sequences: {max_seqs}")
# Typically 30-60 for this configuration at 4K context

Monitoring in Production

Key metrics to track in production:

  • Batch size over time: Should be near max_num_seqs when load is high. If consistently below 50%, either traffic is low or requests are too short to benefit from batching
  • Queue depth (waiting requests): If this grows unboundedly, you need more capacity or lower SLAs
  • GPU memory utilization: Should be 85-95% with continuous batching and paged KV cache. Lower means you are under-utilizing capacity
  • Prefill throughput (tokens/sec): Limited by compute. If this is a bottleneck, consider chunked prefill or prefill disaggregation
  • Decode throughput (tokens/sec): Limited by memory bandwidth. Improve by increasing batch size
  • P99 TTFT: Spikes indicate prefill blocking. Fix with chunked prefill or smaller max_num_chunked_prefill_tokens
  • P99 ITL: Spikes indicate large prefills entering the batch. Fix with smaller chunk size
# vLLM exposes metrics via OpenAI-compatible /metrics endpoint
# Key Prometheus metrics:
# vllm:num_requests_running - current batch size
# vllm:num_requests_waiting - queue depth
# vllm:gpu_cache_usage_perc - KV cache utilization
# vllm:time_to_first_token_seconds - TTFT histogram
# vllm:time_per_output_token_seconds - ITL histogram
# vllm:generation_tokens_total - total tokens generated

# Example Prometheus query for P99 TTFT:
# histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))

# Example Grafana dashboard alert:
# Alert if P99 TTFT > 2s for more than 1 minute

Common Mistakes

:::danger Static Batch Size Deployment for LLMs

The single most common mistake in initial LLM deployments is treating generation as static batching - accepting a batch of requests, running them to completion together, and then starting the next batch. This pattern comes naturally from experience with classification or embedding models.

For LLMs, this wastes 60-80% of GPU capacity due to the variable-length problem. Never deploy an LLM without continuous batching. vLLM, TensorRT-LLM, and Hugging Face TGI all provide continuous batching by default. If your serving stack does not use it, fix this before anything else. :::

:::danger Setting max_num_seqs Too High Without KV Cache Analysis

Setting max_num_seqs=512 sounds like it should maximize throughput. In practice, if your KV cache memory cannot support 512 concurrent sequences at your target context length, vLLM will hit memory limits and crash - or spend most of its time swapping KV cache blocks to CPU, which is slower than having fewer concurrent sequences.

Always calculate your KV cache memory budget before setting max_num_seqs. The formula is above. A rule of thumb: leave at least 15% of total GPU memory for KV cache beyond what the model weights need. :::

:::warning Ignoring the Prefill-Decode Interference Problem

Enabling continuous batching without enabling chunked prefill will improve average throughput but can worsen P99 ITL significantly. A single 8192-token prefill can freeze all decode operations for 500ms+, causing a visible pause in token streams for every other user.

Enable chunked prefill in vLLM (enable_chunked_prefill=True, max_num_chunked_prefill_tokens=512) for any production deployment that serves interactive users. The throughput cost is minimal (< 5%) and the ITL improvement is dramatic. :::

:::warning Benchmarking with Synthetic Uniform Requests

Benchmarking your LLM server with 1000 requests of identical length will give misleading throughput numbers. Continuous batching and chunked prefill show their biggest advantages when request lengths are heterogeneous. A synthetic benchmark with all 512-token prompts and all 200-token outputs will not reveal the padding waste and head-of-line blocking that appear in production with real traffic distributions.

Always benchmark with realistic traffic shapes sampled from production logs. For pre-production estimates, use distributions that match your use case: long document processing has very different optimal configurations than short chat completions. :::

Interview Questions and Answers

Q1: Explain why batching is so much more important for LLM decode than for image classification inference.

Image classification is compute-bound at batch sizes above ~8. Increasing batch from 1 to 32 on a ResNet gives moderate improvements because the convolution kernels are already efficient at batch 1 - you are limited by compute, not memory bandwidth. The improvement from batching is roughly proportional to how much parallelism you add over kernel launch overhead.

LLM decode is fundamentally memory-bandwidth-bound. Each decode step generates one token, which requires reading all model weights from HBM (2 TB/s on A100) to perform matrix-vector multiplications. Whether you batch 1 or 64 requests, you read the same 140 GB of weights. The difference is that at batch 64, you generate 64 tokens from one weight read instead of 1 token. Throughput scales nearly linearly with batch size until you become compute-bound (which for most models happens above batch 64-256). At batch 1, a 70B model runs at roughly 1-2% of theoretical arithmetic efficiency. At batch 64, this climbs to 60-80%. Batching is not an optimization for LLM decode, it is a prerequisite for efficient operation.

Q2: What is continuous batching and how does it differ from dynamic batching?

Dynamic batching groups requests that arrive within a time window (e.g., 5ms) and processes them together as a batch from start to finish. It still uses static batch processing - once a batch starts, it runs to completion. New requests wait for the entire current batch to finish.

Continuous batching (Orca 2022) makes scheduling decisions at the granularity of a single decode step. At the start of each decode iteration, the scheduler checks whether any sequences have completed and removes them. If the batch has available capacity, it adds new requests from the queue. The batch composition changes every decode step. This eliminates the "slow request holds up new requests" problem of static batching. A 5-token request completes and frees its slot in 5 steps. A 500-token request runs in the background without blocking new arrivals.

The implementation requires paged KV cache (vLLM's contribution) to manage the dynamic memory allocation efficiently - you cannot pre-allocate contiguous KV cache blocks when sequence lengths are unknown.

Q3: Walk through how paged KV cache enables continuous batching. What was the problem it solved?

Before paged KV cache, LLM serving frameworks pre-allocated a contiguous memory block for each sequence's KV cache equal to max_sequence_length. This wastes memory because most sequences are shorter than max length. More critically, it fragments GPU memory: you have many large pre-allocated blocks, most partially empty. You cannot fit as many concurrent sequences as the total memory would theoretically allow.

Paged KV cache treats KV cache memory like OS virtual memory. Physical memory is divided into fixed-size blocks (e.g., 16 tokens each). Each sequence has a logical view of its KV cache (token positions 0, 1, 2, ..., N), but the physical blocks that store this data are scattered in memory. A page table maps each logical block to a physical block. Physical blocks are allocated on demand as the sequence generates tokens, and freed immediately when the sequence completes.

This allows continuous batching to work efficiently because: (1) you never waste memory on pre-allocated but unused sequence slots, (2) sequences can grow to any length without fragmentation issues, (3) blocks freed by completed sequences are immediately available for new sequences. In practice, paged KV cache reduces KV cache memory waste from 60-80% (in systems with static allocation) to under 5%.

Q4: Explain the prefill-decode interference problem and describe two approaches to solving it.

During prefill (processing the user's prompt), the model runs attention over the full sequence length in parallel. This is compute-intensive: a 4096-token prefill requires O(n^2) attention computation. During this time, every other sequence that is in the middle of generating tokens (decode phase) must wait. Users receiving streamed token responses experience a sudden pause - their token stream freezes for potentially hundreds of milliseconds.

Two solutions: First, chunked prefill (Sarathi-Serve 2023) breaks the prefill into small chunks (e.g., 256 tokens) and processes one chunk per decode step. The decode steps for other sequences continue between prefill chunks. The prefill takes more steps to complete, but ITL for in-flight sequences is nearly unchanged. The compute cost per chunk is small since each chunk is a tiny fraction of the full attention computation.

Second, prefill-decode disaggregation runs prefill and decode on separate hardware. Prefill machines are compute-heavy (A100/H100) and process prompts without impacting any decode operations. The resulting KV cache is transferred to decode machines that handle token generation. This fully eliminates the interference at the cost of network transfer overhead for KV cache data and the operational complexity of managing two separate hardware pools.

Q5: Your LLM serving cluster has P99 TTFT of 8 seconds and P50 TTFT of 200ms. The P99 ITL is 150ms and P50 ITL is 25ms. What is likely happening, and what would you tune?

The large gap between P50 and P99 TTFT (200ms vs 8 seconds) indicates head-of-line blocking by long prompts. Most requests get processed quickly, but some requests with very long prompts are sitting in the queue behind other long prompts being processed. The prefill for a single 8192-token prompt could take 6-8 seconds, during which all queued requests wait.

The high P99 ITL (150ms vs 25ms P50) confirms this interpretation: when long prefills enter the batch mid-generation, they stall the decode steps for existing requests. The existing requests experience multi-hundred-millisecond ITL spikes.

Fixes in order: First, enable chunked prefill if not already enabled. Set max_num_chunked_prefill_tokens=256 to aggressively interleave prefill with decode. This should collapse the ITL P99. Second, check the queue scheduling policy - are very long prompts being processed first? Implement preemption-based scheduling where very long prompts can be partially preempted in favor of shorter ones. Third, if the cluster serves a mix of very short and very long prompts, consider prefill-decode disaggregation: route long prompts to dedicated prefill nodes so they do not interfere with the decode pool at all.

Monitor the fix by checking that P99 TTFT drops to within 3-5x of P50 TTFT, and that P99 ITL drops to within 5x of P50 ITL.

Q6: How would you size a vLLM deployment to handle 500 requests per second with a P99 TTFT SLA of 1 second for a 70B model? What information do you need to answer this?

This is a capacity planning problem. You need: (1) the distribution of prompt lengths - P50, P95, P99, (2) the distribution of output lengths - same percentiles, (3) whether you can use FP16 or need FP8/INT8, (4) your hardware options and their cost.

Starting assumptions: prompt P50 = 512 tokens, output P50 = 200 tokens, 4x A100 80GB per serving node.

Throughput per node: A 70B FP16 model with continuous batching at batch ~32 does roughly 1500-2000 tokens/second decode throughput on 4x A100. With 200 output tokens per request, that is 7-10 requests/second per node. For 500 RPS, you need roughly 50-70 nodes.

For the P99 TTFT SLA, the bottleneck is prefill throughput (compute-bound). At 512 tokens average prompt, a 4x A100 node can prefill roughly 5000-8000 tokens/second. With 500 RPS arriving, you have 250,000 tokens/second of prefill load. With each node doing 6000 tokens/second of prefill, you again need ~40 nodes for prefill capacity.

The P99 constraint changes things: P99 prompt may be 4096 tokens (long document use cases). A single 4096-token prefill at 6000 tokens/second takes 0.68 seconds. With chunked prefill and a queue of requests, P99 TTFT could easily exceed 1 second if prefill capacity is tight. You either over-provision (add 20% more nodes) or implement prefill disaggregation so prefill nodes can run at full compute without sharing resources with decode.

The honest answer: prototype with vLLM benchmark tools (python -m vllm.benchmarks.benchmark_serving) against a sample of real traffic before committing to a production size. Benchmarks on synthetic data are unreliable for SLA planning.

Q7: What is the roofline model for LLM decode, and how does batch size determine whether you are compute-bound or memory-bandwidth-bound?

The roofline model defines the maximum achievable throughput as the minimum of two limits: the compute roof (peak FLOPS divided by arithmetic intensity) and the memory bandwidth roof (peak bandwidth).

For LLM decode, arithmetic intensity is defined as FLOPs per byte of memory read. For a weight matrix of shape [dmodel,dff][d_{model}, d_{ff}] processed against a batch of BB tokens:

  • FLOPs: 2×B×dmodel×dff2 \times B \times d_{model} \times d_{ff} (matrix-matrix multiply)
  • Bytes read: 2×dmodel×dff2 \times d_{model} \times d_{ff} (weights in FP16)
  • Arithmetic intensity: BB FLOPs/byte

The ridge point of the roofline is where arithmetic intensity equals hardware roofline: Bridge=FLOPs/s/BandwidthB_{ridge} = \text{FLOPs/s} / \text{Bandwidth}. For an A100: 312×1012/2×1012=156312 \times 10^{12} / 2 \times 10^{12} = 156 FLOPs/byte, so Bridge156B_{ridge} \approx 156.

This means: below batch size ~156, LLM decode is memory-bandwidth-bound. You are not using the A100's compute efficiently. Throughput scales linearly with batch size. Above batch size ~156, decode becomes compute-bound and throughput plateaus.

In practice, most production LLM serving runs at batch sizes well below this ridge point (batch 32-64 is common), meaning the primary optimization target is increasing batch size, not arithmetic efficiency. This is the quantitative argument for why continuous batching and high max_num_seqs configurations improve throughput.

Q8: Describe the tradeoff between speculative decoding and continuous batching. Do they work well together?

Speculative decoding uses a small "draft" model to propose multiple tokens ahead, then verifies them in parallel with the large "target" model. When speculation succeeds, you generate multiple tokens in the time of a single target model forward pass. This reduces latency for individual requests.

Continuous batching and speculative decoding have a subtle tension. Continuous batching benefits from large batch sizes to amortize weight reads. Speculative decoding works best for individual requests with high acceptance rates (which depends on the draft model quality and the input distribution). When you combine them, the batch contains some requests on draft step N and others on different draft steps - the variable-length speculation makes batch homogeneity harder to achieve.

The practical impact: speculative decoding reduces latency by 2-3x for a single request in isolation. In a heavily batched continuous batching system at high utilization, the latency benefit shrinks because the bottleneck shifts from individual request decode speed to batch scheduling overhead. Speculative decoding is most effective in low-to-medium load scenarios where batch sizes are naturally small (under 16), or in disaggregated setups where single-user decode is common.

TensorRT-LLM and vLLM both support speculative decoding with continuous batching, handling the variable acceptance lengths through careful synchronization. The engineering complexity is substantial: rejected tokens must be cleaned out of the KV cache, and the batch scheduler must handle variable step counts per sequence.

Summary - Choosing Your Batching Strategy

Different workloads call for different configurations. The decision tree below captures the most common production choices:

  • Small model (7B-13B), interactive chat: Enable continuous batching with moderate max_num_seqs (64-128). Enable chunked prefill with chunk size 512. Single-pool deployment is fine.

  • Large model (70B+), mixed workload: Enable continuous batching with max_num_seqs sized to your KV cache budget. Use chunked prefill with chunk size 256 for better ITL. Monitor P99 ITL - if it exceeds 5x P50, reduce chunk size further.

  • Very long context (32K+ tokens), document processing: Prefill-decode disaggregation becomes attractive. Prefill of 32K tokens takes too long to interleave gracefully with decode even with chunked prefill. Separate prefill nodes handle the heavy compute; decode nodes handle the streaming.

  • Batch/offline workloads (no latency SLA): Maximize max_num_batched_tokens and max_num_seqs. Disable chunked prefill (or use large chunks). Throughput is the only metric.

  • Embedding models (no generation): Static batching with dynamic batching windows is fine. There is no decode phase and no variable-length completion problem. Use ONNX + TensorRT with Triton dynamic batching.

© 2026 EngineersOfAI. All rights reserved.