Autoregressive Decoding
The Production Scenario
It is 2:47 AM on a Tuesday. Your on-call phone buzzes. The dashboard shows P95 latency for the coding assistant has spiked from 800 ms to 6.2 seconds. Users are abandoning requests. Revenue is dropping. You pull up the GPU metrics: utilization is 45%, nowhere near saturated. Memory bandwidth utilization is 98%. You stare at the screen trying to understand how the GPU can be nearly idle on compute but completely bottlenecked on memory.
Three weeks later, after reading every inference optimization paper you can find, the answer becomes obvious: LLM decoding is not a compute problem. It was never a compute problem. You were solving the wrong equation entirely.
This lesson explains the fundamental reason inference is hard to optimize, why it defies your intuition about GPU utilization, and what the actual bottleneck is. Everything else in this module - KV caching, quantization, continuous batching, speculative decoding - only makes sense once you understand autoregressive decoding at the hardware level.
The insight is deceptively simple. During each decode step, you are loading the entire model from GPU memory to compute just one token. One token. For LLaMA-2 70B, that is 140 GB of weights loaded from HBM2e memory running at 2 TB/s - taking about 70 milliseconds of memory transfer time - to perform a trivial amount of arithmetic. The compute units sit idle, waiting for data. You are bottlenecked by memory bandwidth, not FLOP count.
Until you internalize this fact, you will keep trying to optimize the wrong thing.
Why This Exists: The Fundamental Problem
Language models are probabilistic. They do not "know" the next word - they compute a probability distribution over the entire vocabulary and sample from it. To get the second word, they need the first word as context. To get the third, they need the first two. This creates an unavoidable sequential dependency: you cannot compute token until you have token .
This is what "autoregressive" means. The model feeds its own output back as input. Every modern LLM - GPT-4, LLaMA, Claude, Gemini - works this way at inference time.
The sequential nature of decoding is the root cause of every performance challenge in LLM serving. You cannot simply throw more parallel compute at it. Each decode step is a tiny computation with enormous memory requirements, executed one step at a time.
Historical Context
The transformer architecture (Vaswani et al., 2017) was designed primarily for training efficiency. The attention mechanism - allowing every token to attend to every other token - was revolutionary for training, where you can process the entire sequence in parallel. The authors did not anticipate that serving these models would become a multi-billion-dollar engineering problem.
Early GPT models (2018–2020) were small enough that inference latency was not the dominant concern. When GPT-3 (Brown et al., 2020) arrived at 175B parameters, the scale changed everything. The inference community began seriously studying the memory-bandwidth bottleneck. The 2022–2023 period saw an explosion of inference optimization work: Flash Attention (Dao et al., 2022), PagedAttention (Kwon et al., 2023), LLM.int8() (Dettmers et al., 2022), and speculative decoding (Leviathan et al., 2022; Chen et al., 2022) all emerged in roughly 18 months.
The Two Phases of LLM Inference
LLM inference has two fundamentally different phases with different performance characteristics.
Phase 1: Prefill
During prefill, you process the entire input prompt in a single forward pass. If your prompt is 512 tokens, the transformer processes all 512 tokens simultaneously using the parallel attention mechanism.
Characteristics:
- All input tokens are known upfront - fully parallel
- Compute-intensive: many tokens, many attention operations
- Produces the KV cache for all input tokens
- Outputs the probability distribution for the first generated token
- Duration scales roughly linearly with prompt length
For a 512-token prompt on an A100 80GB, prefill typically completes in 20–100 ms depending on model size.
Phase 2: Decode
After prefill, you enter the decode loop. Each iteration:
- Takes the last generated token as input (or the last token from prefill on the first step)
- Runs a full forward pass through all model layers
- Computes attention over all previous tokens using the KV cache
- Samples the next token from the output distribution
- Appends the new token and repeats
Characteristics:
- Each step generates exactly one token
- Memory-bandwidth-bound (see roofline model below)
- Duration per step is roughly constant (independent of how many tokens you've already generated, given a fixed batch size)
- Total decode time scales linearly with output length
Why Decoding Is Memory-Bandwidth Bound
The Roofline Model
The roofline model is a visual performance model that tells you whether a computation is limited by compute (FLOPs) or memory bandwidth. Every operation falls somewhere on this curve.
Arithmetic intensity is the key metric:
High arithmetic intensity = compute-bound. Low arithmetic intensity = memory-bound.
For an A100 SXM GPU:
- Peak compute: 312 TFLOPS (FP16 tensor cores)
- Peak memory bandwidth: 2,000 GB/s
- Roofline "ridge point": FLOPs/byte
Any operation with arithmetic intensity below 156 FLOPs/byte on an A100 is memory-bandwidth-bound.
Arithmetic Intensity of a Decode Step
During a single decode step (batch size = 1), for a linear layer with weight matrix :
- FLOPs: (matrix-vector multiply)
- Memory bytes read: (reading the weight matrix in FP16)
- Arithmetic intensity: FLOP/byte
Arithmetic intensity of 1 FLOP/byte. The A100 ridge point is 156. We are 156× below the threshold for compute-bound operation.
This is the core insight: at batch size 1, you spend 99.4% of your time waiting for weight data to arrive from HBM memory. The tensor cores do trivial work. The bottleneck is pure memory bandwidth.
Increasing Batch Size Helps - To A Point
With batch size :
- FLOPs:
- Memory bytes: still (weights loaded once, reused for all batch items)
- Arithmetic intensity: FLOPs/byte
At batch size 156, you finally reach the A100 ridge point and become compute-bound. This is why batching helps so much - it amortizes the weight loading cost across multiple requests.
But there is a catch: at batch size 156 with a 70B model, your KV cache alone could require hundreds of GB of memory. You quickly run out of GPU RAM before reaching the optimal batch size.
GPU Memory Breakdown
Understanding what lives in GPU memory is essential for capacity planning.
For LLaMA-2 70B in FP16:
| Component | Size | Notes |
|---|---|---|
| Model weights | 140 GB | 70B params × 2 bytes (FP16) |
| KV cache (one sequence, 4096 ctx) | ~16 GB | See Module 02 for formula |
| Activations | ~1–2 GB | Temporary per forward pass |
| CUDA context + framework overhead | ~2–3 GB | Fixed cost |
An A100 80GB can barely fit the model weights. You need at least 2× A100s just to load the model, leaving no headroom for the KV cache. This is why large models require multi-GPU serving and why quantization is so important.
The Memory Capacity Formula
Where the factor of 2 is for K and V tensors. For LLaMA-2 70B (80 layers, 64 heads, 128 head dim, FP16):
This is why KV cache management is the central problem in LLM serving.
Latency Metrics
Production SLO definitions for LLM services:
Time to First Token (TTFT)
The time from request submission to receiving the first generated token. Dominated by the prefill phase.
A good TTFT for interactive applications: under 200 ms at P50, under 500 ms at P95.
Time Per Output Token (TPOT)
The average time between consecutive generated tokens during the decode phase. This determines how fast text appears on screen for streaming applications.
For a 70B model on 2× A100s with batch size 1: roughly 25–40 ms per token (25–40 tokens/sec).
End-to-End Latency
For a response of 200 tokens: - 7.5 seconds.
Throughput
Throughput and per-request latency trade off against each other. Larger batches increase throughput but increase TTFT (because new requests wait for the current batch to process).
Code: Profiling Prefill vs Decode Latency
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
def profile_inference(model_name: str = "meta-llama/Llama-2-7b-hf"):
"""
Profile prefill and decode phases separately.
Requires ~14GB GPU memory for 7B FP16.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="cuda"
)
model.eval()
prompt = "Explain the transformer architecture in detail, covering " * 20
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
input_ids = inputs["input_ids"]
print(f"Prompt length: {input_ids.shape[1]} tokens")
# --- Benchmark prefill ---
torch.cuda.synchronize()
NUM_RUNS = 5
prefill_times = []
for _ in range(NUM_RUNS):
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.no_grad():
# Single forward pass - this IS the prefill
outputs = model(input_ids, use_cache=True)
past_key_values = outputs.past_key_values
torch.cuda.synchronize()
t1 = time.perf_counter()
prefill_times.append((t1 - t0) * 1000)
prefill_ms = sum(prefill_times) / len(prefill_times)
print(f"Prefill latency: {prefill_ms:.1f} ms avg over {NUM_RUNS} runs")
# --- Benchmark single decode step ---
# Start from the prefill state, generate one more token
decode_times = []
last_token = outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)
for _ in range(20): # 20 decode steps
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.no_grad():
step_outputs = model(
input_ids=last_token,
past_key_values=past_key_values,
use_cache=True
)
past_key_values = step_outputs.past_key_values
last_token = step_outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)
torch.cuda.synchronize()
t1 = time.perf_counter()
decode_times.append((t1 - t0) * 1000)
decode_ms = sum(decode_times[2:]) / len(decode_times[2:]) # Skip warmup
print(f"Decode step latency: {decode_ms:.1f} ms avg (= {1000/decode_ms:.0f} tokens/sec)")
# Compute theoretical minimum based on memory bandwidth
# Rough model: 7B params, FP16 = 14GB, must read entire model each step
model_size_gb = sum(p.numel() * 2 for p in model.parameters()) / 1e9
bw_gb_per_sec = 2000 # A100 peak bandwidth in GB/s
theoretical_min_ms = (model_size_gb / bw_gb_per_sec) * 1000
print(f"\nModel size: {model_size_gb:.1f} GB")
print(f"Theoretical min decode step (BW-bound): {theoretical_min_ms:.1f} ms")
print(f"Hardware efficiency: {theoretical_min_ms / decode_ms * 100:.0f}%")
if __name__ == "__main__":
profile_inference()
Expected output for LLaMA-2 7B on an A100 80GB:
Prompt length: 256 tokens
Prefill latency: 18.3 ms avg over 5 runs
Decode step latency: 8.2 ms avg (= 122 tokens/sec)
Model size: 13.5 GB
Theoretical min decode step (BW-bound): 6.75 ms
Hardware efficiency: 82%
The hardware efficiency of ~82% means we are achieving 82% of the theoretical maximum given memory bandwidth. The remaining 18% is overhead from kernel launches, attention computation, and other operations.
Measuring Arithmetic Intensity with Torch Profiler
import torch
from torch.profiler import profile, ProfilerActivity
def measure_arithmetic_intensity(model, input_ids, past_key_values=None):
"""
Use torch profiler to estimate FLOPs and memory operations.
"""
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
with_flops=True,
with_modules=True
) as prof:
with torch.no_grad():
if past_key_values is not None:
# Decode step
out = model(
input_ids=input_ids[:, -1:],
past_key_values=past_key_values,
use_cache=True
)
else:
# Prefill
out = model(input_ids, use_cache=True)
# Summarize
events = prof.key_averages()
total_flops = sum(e.flops for e in events if e.flops > 0)
print(f"Estimated FLOPs: {total_flops / 1e9:.1f} GFLOPs")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
return total_flops
Understanding the Roofline Visually
This diagram captures the key insight: prefill is compute-bound (good GPU utilization), decode at small batch sizes is deep in memory-bandwidth-bound territory.
Production Engineering Notes
SLO Definition
Define separate SLOs for TTFT and TPOT:
# Example SLO configuration for a production LLM API
SLO_CONFIG = {
"chat": {
"ttft_p50_ms": 150,
"ttft_p95_ms": 400,
"ttft_p99_ms": 800,
"tpot_p50_ms": 35,
"tpot_p95_ms": 70,
"tpot_p99_ms": 120,
},
"batch_generation": {
"ttft_p50_ms": 500,
"ttft_p95_ms": 2000,
"tpot_p50_ms": 20,
"tpot_p95_ms": 50,
}
}
Measuring TTFT and TPOT in Production
import time
from dataclasses import dataclass, field
from typing import List
import statistics
@dataclass
class InferenceMetrics:
request_id: str
prompt_tokens: int
start_time: float = field(default_factory=time.time)
first_token_time: float = None
token_timestamps: List[float] = field(default_factory=list)
@property
def ttft_ms(self) -> float:
if self.first_token_time is None:
return None
return (self.first_token_time - self.start_time) * 1000
@property
def tpot_ms(self) -> float:
if len(self.token_timestamps) < 2:
return None
gaps = [
(self.token_timestamps[i+1] - self.token_timestamps[i]) * 1000
for i in range(len(self.token_timestamps) - 1)
]
return statistics.mean(gaps)
@property
def total_output_tokens(self) -> int:
return len(self.token_timestamps)
@property
def e2e_latency_ms(self) -> float:
if not self.token_timestamps:
return None
return (self.token_timestamps[-1] - self.start_time) * 1000
def streaming_generate_with_metrics(model, tokenizer, prompt: str) -> InferenceMetrics:
"""
Generate tokens one at a time, recording per-token timestamps.
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
metrics = InferenceMetrics(
request_id="req_001",
prompt_tokens=inputs["input_ids"].shape[1]
)
past_key_values = None
input_ids = inputs["input_ids"]
with torch.no_grad():
for step in range(256): # max output tokens
t = time.perf_counter()
outputs = model(
input_ids=input_ids if step == 0 else input_ids[:, -1:],
past_key_values=past_key_values,
use_cache=True
)
past_key_values = outputs.past_key_values
next_token = outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)
input_ids = torch.cat([input_ids, next_token], dim=1)
now = time.perf_counter()
if step == 0:
metrics.first_token_time = now
metrics.token_timestamps.append(now)
# Check for EOS
if next_token.item() == tokenizer.eos_token_id:
break
return metrics
GPU Memory Estimation Script
def estimate_gpu_memory_requirements(
num_params_billions: float,
dtype_bytes: int,
num_layers: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
batch_size: int,
kv_dtype_bytes: int = 2
) -> dict:
"""
Estimate GPU memory requirements for LLM serving.
Args:
num_params_billions: Model size (e.g., 70 for LLaMA-2 70B)
dtype_bytes: 2 for FP16/BF16, 4 for FP32, 1 for INT8
num_layers: Number of transformer layers
num_kv_heads: Number of KV heads (use GQA value if applicable)
head_dim: Dimension per head
max_seq_len: Maximum sequence length
batch_size: Number of concurrent sequences
kv_dtype_bytes: Bytes per element in KV cache
"""
weights_gb = (num_params_billions * 1e9 * dtype_bytes) / 1e9
# KV cache: 2 (K+V) × layers × kv_heads × head_dim × seq_len × bytes
kv_per_seq_gb = (
2 * num_layers * num_kv_heads * head_dim * max_seq_len * kv_dtype_bytes
) / 1e9
total_kv_gb = kv_per_seq_gb * batch_size
overhead_gb = 3.0 # CUDA context, framework, activations
total_gb = weights_gb + total_kv_gb + overhead_gb
return {
"weights_gb": round(weights_gb, 1),
"kv_per_sequence_gb": round(kv_per_seq_gb, 2),
"kv_total_gb": round(total_kv_gb, 1),
"overhead_gb": overhead_gb,
"total_gb": round(total_gb, 1),
"notes": f"Max batch {batch_size} with {max_seq_len} context tokens"
}
# Example: LLaMA-2 70B on 2× A100 80GB (160GB total)
llama70b = estimate_gpu_memory_requirements(
num_params_billions=70,
dtype_bytes=2, # FP16
num_layers=80,
num_kv_heads=8, # GQA: only 8 KV heads despite 64 query heads
head_dim=128,
max_seq_len=4096,
batch_size=8
)
print("LLaMA-2 70B Memory Estimate:")
for k, v in llama70b.items():
print(f" {k}: {v}")
# Output:
# weights_gb: 140.0
# kv_per_sequence_gb: 0.67
# kv_total_gb: 5.4
# overhead_gb: 3.0
# total_gb: 148.4
Common Mistakes
:::danger Optimizing for GPU compute utilization instead of throughput GPU utilization percentage is a misleading metric for LLM inference. During memory-bandwidth-bound decode, the GPU compute units are mostly idle by design. "60% GPU utilization" during decode does NOT mean you have 40% headroom - it means you are already hitting the memory bandwidth ceiling. Monitor memory bandwidth utilization and tokens/second, not GPU utilization percentage. :::
:::danger Conflating prefill and decode performance Prefill is compute-bound. Decode is memory-bandwidth-bound. They respond differently to optimizations. INT4 quantization helps decode (reads 4× less data from memory) but barely helps prefill (already compute-bound). Batching helps decode dramatically (amortizes weight loading). Prefill can be sped up by increasing compute resources. Decode cannot. Never apply a prefill optimization to a decode bottleneck. :::
:::warning Ignoring the KV cache in memory calculations It is easy to calculate model weights and call it done. For a 7B model at FP16, weights are 14 GB - no problem on a single A100. But add 32 sequences at 4096 tokens each: the KV cache adds another 8–20 GB depending on architecture. With batch size 64, you can easily exceed GPU memory even for "small" models. Always calculate KV cache requirements alongside weights. :::
:::warning Not warming up before benchmarking The first 2–3 inference calls are slower due to CUDA kernel compilation, memory allocation, and cache warming. Always discard warmup runs when measuring latency. A minimum of 10 timed runs with 3 warmup runs is a reasonable benchmark protocol. :::
Interview Questions
Q1: Why is LLM decoding memory-bandwidth-bound and not compute-bound?
During each decode step with batch size 1, the model performs a matrix-vector multiplication: one token (a vector) against each weight matrix. The number of FLOPs is proportional to the number of parameters ( for a matmul), but so is the number of bytes that must be read from memory ( bytes for FP16). The arithmetic intensity is therefore 1 FLOP/byte. The A100 has a roofline ridge point at 156 FLOPs/byte, meaning any operation below that threshold is memory-bound. Decode at small batch sizes is deep in memory-bound territory - the compute units are largely idle, waiting for data from HBM memory.
Q2: What is the difference between TTFT and TPOT, and when does each matter more?
TTFT (Time To First Token) measures latency from request submission to the first output token. It is dominated by the prefill phase. TPOT (Time Per Output Token) measures the average inter-token delay during decode. TTFT matters for interactive applications where users notice the initial pause - a code assistant that takes 2 seconds before typing anything feels broken. TPOT matters for reading experience - if tokens arrive faster than ~30 ms each (33 tokens/sec), reading keeps up with generation. For batch processing (summarization pipelines, document processing), overall throughput matters more than either.
Q3: How does batch size affect the arithmetic intensity of decode?
With batch size , the model performs a matrix-batch-of-vectors multiply. The FLOPs scale as (proportional to batch size), but the bytes of weight data loaded remain (weights are loaded once and reused across all batch items). Therefore arithmetic intensity scales linearly with batch size: FLOPs/byte. At batch size 156 on an A100 (ridge point = 156), decoding transitions from memory-bound to compute-bound. This is why batching is the single most impactful optimization for decode throughput - but it requires enough GPU memory to hold all the concurrent KV caches.
Q4: What are the two phases of inference and what are their compute characteristics?
Prefill processes all input tokens simultaneously using the full parallel attention mechanism. It is compute-intensive because many tokens contribute to many attention operations simultaneously. Arithmetic intensity during prefill is high (300+ FLOPs/byte for typical prompt lengths), making it compute-bound. Decode generates one token per forward pass, resulting in a matrix-vector multiply with arithmetic intensity near 1 FLOPs/byte at batch size 1 - deep in memory-bandwidth-bound territory. These different characteristics mean different optimization strategies apply: compute scaling (more FLOP capacity) helps prefill; memory capacity and bandwidth optimizations help decode.
Q5: Given a 70B parameter model in FP16 on 2× A100 80GB, what limits the maximum batch size you can serve?
Available GPU memory for KV cache = total GPU memory - model weights - overhead = 160 GB - 140 GB - 3 GB = 17 GB. With LLaMA-2 70B using GQA (8 KV heads, 128 head dim, 80 layers), KV cache per sequence at 4096 context = 2 × 80 × 8 × 128 × 4096 × 2 bytes ≈ 1.34 GB. Maximum batch size = floor(17 / 1.34) ≈ 12 sequences. In practice, memory fragmentation and variable sequence lengths reduce this further - PagedAttention (covered in Module 02) addresses this fragmentation problem.
Q6: Why can't you simply parallelize decode steps across multiple GPU cores to speed things up?
Each decode step has a hard sequential dependency on the previous step's output. Step requires the token generated at step as input. You cannot start step until step completes because you do not know the input. This is fundamentally different from training, where you have the full target sequence and can process everything in parallel. The only way to break this sequential bottleneck is speculative decoding (where a draft model speculatively guesses multiple future tokens, and the target model verifies them in parallel) - covered in Module 05.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Sampling Strategies: Temperature, Top-K, Top-P demo on the EngineersOfAI Playground - no code required.
:::
