KV Cache Management and PagedAttention

The Night the Inference Cluster Ran Out of Memory

It's 2:47 AM. You're on-call for a production LLM serving cluster handling customer support at scale. The dashboards start lighting up red. GPU memory utilization is spiking to 100% across 16 A100s. Requests are getting dropped. Users are seeing errors.

You pull up the metrics. The model is a 70B parameter LLaMA-2 serving at batch size 32. But the requests tonight are different - customers are submitting long context queries, pasting entire documents into the system. Sequences that normally average 256 tokens are now averaging 1800 tokens. Nothing in the model weights changed. No code was deployed. Yet the system is collapsing.

You dig deeper and find the root cause: the KV cache. Every transformer layer, for every request in the batch, is storing key and value tensors for every token in the sequence. With 80 layers, 64 attention heads, 128-dimensional head vectors, and sequences suddenly 7x longer, the memory consumption didn't grow linearly - it exploded. And because your serving system pre-allocated contiguous memory blocks per request assuming a worst-case sequence length, 65% of that allocated memory was sitting empty, doing nothing, while real requests were being rejected.

You implement an emergency workaround: reduce the maximum sequence length cutoff. Customers get errors for long documents. The business loses revenue. You file an incident report and start researching. Three weeks later, you've migrated the cluster to vLLM with PagedAttention. Memory utilization drops from 35% efficiency to 91%. The problem never recurs.

This is not a hypothetical scenario. It happened at dozens of companies before Woosuk Kwon and the Berkeley Sky Computing Lab published the PagedAttention paper in 2023. Understanding why it happened and how PagedAttention fixed it is foundational knowledge for anyone running LLMs in production.

Why This Exists

The Problem Before PagedAttention

Before 2023, LLM serving frameworks treated GPU memory like a fixed-size array. When a request arrived, the system would allocate a contiguous memory block large enough to hold the KV cache for the maximum possible sequence length - say, 4096 tokens. If the actual request used only 300 tokens, the remaining 3796 token slots sat idle but reserved.

This waste came from a reasonable engineering instinct: dynamic memory allocation on GPUs is slow. If you're reserving and freeing memory mid-inference, you introduce latency. So systems pre-allocated worst-case. The result was catastrophic memory efficiency.

Three specific failure modes plagued naive KV cache management:

Fragmentation - When requests of varying lengths complete at different times, the freed memory blocks are scattered across the GPU in non-contiguous chunks. A new long request might need 2GB of contiguous memory, but while 4GB is technically free, no single contiguous 2GB region exists. The request gets rejected even though GPU memory looks "mostly available."

Over-reservation - Systems didn't know how long a generation would be when it started. A request asking for "a brief summary" and a request asking for "a comprehensive analysis" both start with unknown output lengths. Conservative systems reserved maximum context window memory for both. Most of it went unused.

Beam search explosion - Beam search generates multiple candidate continuations simultaneously. With a beam width of 4, a naive system needs 4x the KV cache memory, each beam holding its own independent copy of all previous tokens' keys and values. For long sequences, this makes beam search economically unviable on anything but very short contexts.

What These Problems Actually Cost

The math is unforgiving. Take a practical production setup: LLaMA-2 70B serving at FP16. Each token's KV cache requires:

$\text{KV memory per token} = 2 \times n_\text{layers} \times n_\text{heads} \times d_\text{head} \times 2 \text{ bytes}$

With 80 layers, 64 heads, and 128-dimensional head vectors:

$= 2 \times 80 \times 64 \times 128 \times 2 = 2{,}621{,}440 \text{ bytes} \approx 2.5 \text{ MB per token}$

For a batch of 32 requests at 4096 token maximum context:

$= 32 \times 4096 \times 2.5 \text{ MB} = 327{,}680 \text{ MB} \approx 320 \text{ GB}$

Four A100-80GB GPUs just to hold the KV cache, even if average actual sequence length is 512 tokens (12.5% utilization). You're paying for 320GB of cache and using 40GB of it. The memory cost of running the actual model weights (LLaMA-2 70B at FP16 = 140GB) is almost secondary at this point.

This is the problem PagedAttention was designed to solve.

Historical Context

Operating Systems Already Solved This

In 1962, the MIT Compatible Time-Sharing System introduced virtual memory. The insight was profound: programs don't need all their data to be physically contiguous in RAM. The operating system can maintain a mapping from virtual addresses (what the program sees) to physical addresses (where data actually lives in RAM). Memory can be allocated in fixed-size pages, scattered physically but appearing contiguous to the program.

This solved exactly the fragmentation problem that naive LLM serving was rediscovering sixty years later.

Woosuk Kwon, a PhD student at UC Berkeley, was working on LLM serving optimization in 2022. He was frustrated by the memory waste in existing systems like FasterTransformer and Orca. While reading operating systems literature, he had the "aha moment": the KV cache problem is structurally identical to virtual memory management. The transformer needs to read previous tokens' keys and values, but it doesn't actually care whether they're physically contiguous - it just needs to access them in the right order.

The PagedAttention paper was published at SOSP 2023 (Symposium on Operating Systems Principles - deliberately chosen to signal the connection). The authors demonstrated 2-4x throughput improvements over the state of the art at the time. vLLM, the serving framework built around PagedAttention, became the dominant open-source LLM serving solution within months of release.

The SOSP Connection

Publishing at SOSP was a strategic choice. The systems community had solved memory management problems for decades. By framing LLM memory management as an OS problem, Kwon et al. could import decades of proven solutions: virtual memory, demand paging, copy-on-write, memory-mapped files. The transformer attention mechanism just needed its access pattern to be expressed in page-friendly terms.

Core Concepts

How the KV Cache Works

To understand PagedAttention, you must first understand why the KV cache exists at all.

In a transformer decoder generating text, each new token attends to all previous tokens. The attention mechanism computes:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Where $Q$ is the query from the current token, and $K$ , $V$ are keys and values from all previous tokens. Without caching, generating token $t$ would require recomputing keys and values for tokens $1$ through $t-1$ - every forward pass reprocesses the entire prefix. That's $O(t^2)$ compute for a sequence of length $t$ .

The KV cache stores the key and value tensors for every previously processed token. When generating token $t$ , we only compute $K$ and $V$ for the new token $t$ , then append it to the cache. The attention mechanism reads the entire cached history. This reduces compute from $O(t^2)$ to $O(t)$ over the full generation - a massive win.

The cost: memory. The KV cache grows linearly with sequence length, and must persist for the entire duration of a request.

PagedAttention: Virtual Memory for Attention

PagedAttention introduces three key concepts borrowed directly from OS virtual memory:

Physical blocks - The GPU's KV cache memory is divided into fixed-size chunks called physical blocks. Each block holds KV tensors for a fixed number of tokens (the block size, typically 16 tokens). These blocks are scattered across GPU memory.

Virtual blocks - Each request has its own virtual address space. A sequence of 500 tokens maps to ceil(500/16) = 32 virtual blocks. The sequence's virtual blocks are numbered 0, 1, 2, ...31.

Block table - A mapping from each request's virtual block numbers to the physical block addresses in GPU memory. When the attention kernel needs token $t$ 's KV data, it looks up which virtual block contains token $t$ (block number = $t$ // block_size), finds the corresponding physical block in the block table, and reads from that physical address plus an offset.

The attention kernel has to be modified to understand this indirection. Instead of reading from a contiguous memory region, it follows the block table for each request. This is the core engineering change that makes everything else possible.

Virtual Sequence (Request A):
Block 0 | Block 1 | Block 2 | Block 3
Tokens  | Tokens  | Tokens  | Tokens
 0-15   |  16-31  |  32-47  |  48-63

Block Table (Request A):
Virtual 0 -> Physical 47
Virtual 1 -> Physical 3
Virtual 2 -> Physical 91
Virtual 3 -> Physical 12

Physical GPU Memory:
[block 3: A's tokens 16-31][...free...][block 12: A's tokens 48-63]
[...other requests...][block 47: A's tokens 0-15][...]
[block 91: A's tokens 32-47][...]

The sequence appears contiguous to the attention logic, but its physical storage is scattered wherever free blocks happen to exist.

Memory Utilization: The Numbers

The improvement in memory utilization is dramatic. With naive contiguous allocation, utilization follows a bimodal pattern: requests that nearly fill their allocation are efficient; requests that end early or use short contexts waste massively. Empirically, production systems saw 30-40% actual utilization.

With PagedAttention, waste is limited to two sources:

Internal fragmentation within the last block - A sequence of 500 tokens fills 31 full blocks (496 tokens) plus 4 tokens in a 16-token block. The last block is 25% utilized. On average, half a block is wasted per sequence.
Reserved-but-unused blocks for in-progress generation - You may pre-allocate 1-2 blocks ahead for ongoing generation to avoid allocation stalls.

Empirically, PagedAttention achieves 90-98% memory utilization, compared to 30-40% for naive systems. This means you can run 2-3x more concurrent requests on the same hardware, directly translating to 2-3x throughput improvement.

Copy-on-Write for Beam Search and Parallel Sampling

Beam search maintains $B$ candidate continuations simultaneously. Without copy-on-write, each beam needs its own copy of the entire KV cache - $B$ x memory overhead for beam width $B$ .

PagedAttention's copy-on-write optimization works as follows:

When multiple beams share a common prefix (which they all do - they diverged from the same prompt), they can share physical blocks for that prefix. The block table for beam 1 and beam 2 both point to the same physical block for tokens 0-15. Reference counting tracks how many beams share each block.

When beam 1 needs to write a new token that diverges from beam 2's path - specifically when a shared block needs to be modified - a new physical block is allocated, the shared block's content is copied in, and beam 1's block table is updated to point to the new private block. Beam 2 continues pointing to the original. This is exactly copy-on-write semantics.

For beam search with width 4, the memory overhead for the shared prompt portion is 1x instead of 4x. Only the diverged portions need separate storage. For typical use cases where prompts are long and generated outputs are relatively short, this saves 60-80% of beam search memory compared to naive implementations.

The same mechanism extends to parallel sampling: generating N independent completions from the same prompt shares the prompt's KV cache across all N sampled sequences with zero memory overhead.

Radix Attention and Prefix Caching

A natural extension of block-based KV caching is prefix caching: if two different requests share the same prefix (same system prompt, same few-shot examples), their KV blocks for that prefix can be shared.

SGLang's RadixAttention (2024) extends this to a radix tree data structure. The tree's nodes are KV cache blocks. The path from root to a leaf represents a sequence of blocks. If two requests share a common prefix of any length, they share the corresponding path in the radix tree.

This is transformative for use cases with shared system prompts - which is nearly every production deployment. A 2000-token system prompt is computed once and cached. Every subsequent request reuses those KV blocks without recomputation. The effective cache hit rate for common system prompts approaches 100% in steady state.

The memory management becomes more complex: you need an eviction policy (LRU is common), reference counting for shared blocks, and atomic operations to safely modify the tree under concurrent access. But the throughput gains are substantial - prompt processing is often the latency bottleneck for long-context requests.

Architecture Diagrams

KV Cache Memory Layout

PagedAttention vs Naive Allocation

Prefix Caching with RadixAttention

Code Examples

Understanding KV Cache Memory Requirements

Before tuning PagedAttention, you need to understand your memory budget. This script computes exact KV cache requirements for any model configuration.

from dataclasses import dataclass
from typing import Optional
import math


@dataclass
class ModelConfig:
    """LLM architecture parameters."""
    num_layers: int
    num_kv_heads: int  # May differ from num_attention_heads (GQA)
    head_dim: int
    dtype_bytes: int = 2  # FP16 = 2 bytes, FP8 = 1 byte


@dataclass
class ServingConfig:
    """Serving infrastructure parameters."""
    max_sequence_length: int
    max_batch_size: int
    gpu_memory_gb: float
    model_weight_gb: float
    block_size: int = 16  # tokens per KV block


def compute_kv_cache_per_token(model: ModelConfig) -> int:
    """
    Compute bytes of KV cache required per token per layer.

    Formula: 2 (K and V) * num_kv_heads * head_dim * dtype_bytes
    """
    return 2 * model.num_kv_heads * model.head_dim * model.dtype_bytes


def compute_kv_cache_total_naive(
    model: ModelConfig,
    serving: ServingConfig
) -> dict:
    """
    Compute KV cache memory for naive contiguous allocation.
    Pre-allocates max_sequence_length per request.
    """
    bytes_per_token = compute_kv_cache_per_token(model)
    bytes_per_token_all_layers = bytes_per_token * model.num_layers

    # Naive: allocate max_seq_len for each request in batch
    total_kv_bytes = (
        bytes_per_token_all_layers
        * serving.max_sequence_length
        * serving.max_batch_size
    )

    available_for_kv = (
        serving.gpu_memory_gb - serving.model_weight_gb
    ) * 1024**3

    return {
        "bytes_per_token": bytes_per_token_all_layers,
        "total_kv_required_gb": total_kv_bytes / 1024**3,
        "available_gb": available_for_kv / 1024**3,
        "fits": total_kv_bytes <= available_for_kv,
        "utilization_if_avg_512_tokens": (
            512 / serving.max_sequence_length * 100
        ),
    }


def compute_paged_attention_blocks(
    model: ModelConfig,
    serving: ServingConfig,
    overhead_fraction: float = 0.05
) -> dict:
    """
    Compute PagedAttention block pool configuration.
    overhead_fraction: fraction reserved for system overhead.
    """
    bytes_per_token = compute_kv_cache_per_token(model)
    bytes_per_token_all_layers = bytes_per_token * model.num_layers

    # Bytes per physical block
    bytes_per_block = (
        bytes_per_token_all_layers * serving.block_size
    )

    # Available memory for KV pool
    available_bytes = (
        serving.gpu_memory_gb - serving.model_weight_gb
    ) * 1024**3 * (1 - overhead_fraction)

    # Number of physical blocks we can hold
    num_physical_blocks = int(available_bytes / bytes_per_block)

    # Maximum tokens we can cache simultaneously
    max_cached_tokens = num_physical_blocks * serving.block_size

    return {
        "bytes_per_block": bytes_per_block / 1024,  # KB
        "num_physical_blocks": num_physical_blocks,
        "max_cached_tokens": max_cached_tokens,
        "effective_max_batch_at_1k_tokens": (
            max_cached_tokens // 1024
        ),
    }


# Example: LLaMA-2 70B with GQA
# 80 layers, 8 KV heads (GQA), 128 head dim, FP16
llama70b = ModelConfig(
    num_layers=80,
    num_kv_heads=8,  # GQA: 8 KV heads for 64 query heads
    head_dim=128,
    dtype_bytes=2
)

serving = ServingConfig(
    max_sequence_length=4096,
    max_batch_size=32,
    gpu_memory_gb=80.0,
    model_weight_gb=35.0,  # ~70B params at FP16 / 2 (per GPU with TP=2)
    block_size=16
)

naive = compute_kv_cache_total_naive(llama70b, serving)
paged = compute_paged_attention_blocks(llama70b, serving)

print("=== Naive Allocation ===")
print(f"KV bytes per token (all layers): {naive['bytes_per_token'] / 1024:.1f} KB")
print(f"Total KV cache required: {naive['total_kv_required_gb']:.1f} GB")
print(f"Available for KV: {naive['available_gb']:.1f} GB")
print(f"Fits: {naive['fits']}")
print(f"Utilization at avg 512 tokens: {naive['utilization_if_avg_512_tokens']:.1f}%")

print("\n=== PagedAttention ===")
print(f"Bytes per block (block_size=16): {paged['bytes_per_block']:.1f} KB")
print(f"Total physical blocks: {paged['num_physical_blocks']}")
print(f"Max simultaneously cached tokens: {paged['max_cached_tokens']:,}")
print(f"Effective max batch at 1K avg tokens: {paged['effective_max_batch_at_1k_tokens']}")

Output for the example above:

=== Naive Allocation ===
KV bytes per token (all layers): 320.0 KB
Total KV cache required: 40.0 GB
Available for KV: 45.0 GB
Fits: True (barely - only 5GB headroom)
Utilization at avg 512 tokens: 12.5%

=== PagedAttention ===
Bytes per block (block_size=16): 5120.0 KB (5 MB per block)
Total physical blocks: 8533
Max simultaneously cached tokens: 136,528
Effective max batch at 1K avg tokens: 133

Running vLLM with PagedAttention

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
import time

# Basic vLLM setup - PagedAttention is used by default
llm = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",
    tensor_parallel_size=1,           # GPUs for this model
    gpu_memory_utilization=0.90,       # fraction of GPU memory for KV cache pool
    max_num_batched_tokens=8192,       # max total tokens across batch per step
    max_num_seqs=256,                  # max concurrent sequences
    block_size=16,                     # tokens per KV block (default: 16)
    swap_space=4,                      # GB of CPU RAM for block swapping
    enforce_eager=False,               # use CUDA graphs for speed
)

prompts = [
    "Explain transformer attention in simple terms.",
    "What is the difference between CUDA cores and Tensor Cores?",
    "Write a Python function to compute cosine similarity.",
]

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=512,
)

start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start

total_output_tokens = sum(
    len(o.outputs[0].token_ids) for o in outputs
)
print(f"Generated {total_output_tokens} tokens in {elapsed:.2f}s")
print(f"Throughput: {total_output_tokens / elapsed:.1f} tokens/sec")

for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    print(f"\nPrompt: {prompt[:50]}...")
    print(f"Output: {generated[:100]}...")

Measuring Cache Hit Rates with Prefix Caching

from vllm import LLM, SamplingParams
from vllm.core.block_manager import BlockSpaceManager

# Enable prefix caching (available in vLLM >= 0.3.0)
llm = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",
    enable_prefix_caching=True,         # radix attention for prefix reuse
    gpu_memory_utilization=0.90,
    max_num_batched_tokens=8192,
)

SYSTEM_PROMPT = """You are a helpful assistant specializing in machine learning
and hardware optimization. You provide technically accurate, detailed responses
with code examples when relevant. Always consider performance implications."""

# Simulate production traffic: same system prompt, varying queries
queries = [
    "How does quantization affect inference latency?",
    "What is the optimal batch size for A100 GPU inference?",
    "Explain the difference between TP and PP parallelism.",
    "How do I profile GPU memory usage in PyTorch?",
    "What causes attention to be memory-bandwidth bound?",
]

# First pass: cold cache - all tokens computed from scratch
print("=== First Pass (Cold Cache) ===")
start = time.time()
for query in queries:
    prompt = f"<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{query} [/INST]"
    outputs = llm.generate([prompt], SamplingParams(max_tokens=200))
cold_time = time.time() - start
print(f"Cold cache time: {cold_time:.2f}s")

# Second pass: warm cache - system prompt tokens already cached
print("\n=== Second Pass (Warm Cache) ===")
start = time.time()
for query in queries:
    prompt = f"<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{query} [/INST]"
    outputs = llm.generate([prompt], SamplingParams(max_tokens=200))
warm_time = time.time() - start
print(f"Warm cache time: {warm_time:.2f}s")
print(f"Speedup from prefix caching: {cold_time / warm_time:.2f}x")

# Access internal cache statistics (vLLM 0.4+)
# stats = llm.llm_engine.scheduler.block_manager.get_prefix_cache_stats()
# print(f"Cache hit rate: {stats.hit_rate:.1%}")
# print(f"Cached blocks: {stats.num_cached_blocks}")

Monitoring Block Utilization in Production

import asyncio
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from prometheus_client import Gauge, start_http_server

# Metrics for Prometheus/Grafana
kv_cache_usage = Gauge(
    'vllm_kv_cache_usage_ratio',
    'Fraction of KV cache blocks currently in use'
)
prefix_cache_hit_rate = Gauge(
    'vllm_prefix_cache_hit_rate',
    'Rolling cache hit rate for prefix caching'
)
num_running_seqs = Gauge(
    'vllm_num_running_sequences',
    'Number of sequences currently being processed'
)
num_waiting_seqs = Gauge(
    'vllm_num_waiting_sequences',
    'Number of sequences waiting for scheduling'
)

async def monitor_engine(engine: AsyncLLMEngine):
    """Periodically scrape engine stats and update Prometheus metrics."""
    while True:
        stats = await engine.get_model_config()
        # In a real deployment, use engine.engine.scheduler.block_manager
        # to access block usage stats directly
        await asyncio.sleep(1.0)

# Production vLLM server configuration
engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-2-70b-chat-hf",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.92,        # leave 8% for activations
    max_num_batched_tokens=32768,        # allows long context + large batches
    max_num_seqs=512,
    block_size=16,
    enable_prefix_caching=True,
    swap_space=16,                       # 16GB CPU swap for overflow
    disable_log_stats=False,             # emit stats every 10s
)

# Key production tuning parameters:
# gpu_memory_utilization: 0.85-0.95
#   - Higher = more blocks, more concurrency, risk of OOM on spikes
#   - Lower = more headroom, fewer blocks, less throughput
#
# block_size: 8, 16, or 32
#   - Smaller = less internal fragmentation, more block table overhead
#   - Larger = less overhead, more waste on short sequences
#   - 16 is usually optimal
#
# max_num_batched_tokens: controls maximum compute per scheduler step
#   - Increase for higher throughput at cost of latency variability
#   - Decrease for more consistent latency (TTFT and TPOT)
#
# swap_space: enables CPU offloading when GPU blocks exhaust
#   - Introduces latency spikes (PCIe bandwidth ~32GB/s vs HBM ~2TB/s)
#   - Better to tune max_num_seqs to avoid needing this

Benchmarking Memory Utilization

# Install vLLM benchmarking tools
pip install vllm

# Run the built-in benchmark with ShareGPT dataset (realistic sequence lengths)
python -m vllm.entrypoints.benchmark_throughput \
    --model meta-llama/Llama-2-13b-chat-hf \
    --backend vllm \
    --dataset sharegpt \
    --num-prompts 1000 \
    --gpu-memory-utilization 0.90 \
    --block-size 16 \
    --tensor-parallel-size 1

# Compare with different block sizes
for BLOCK_SIZE in 8 16 32; do
    echo "=== Block size: $BLOCK_SIZE ==="
    python -m vllm.entrypoints.benchmark_throughput \
        --model meta-llama/Llama-2-13b-chat-hf \
        --backend vllm \
        --dataset sharegpt \
        --num-prompts 500 \
        --block-size $BLOCK_SIZE \
        --output-len 256
done

# Profile with NVIDIA NSight to see actual memory access patterns
nsys profile -o kvcache_profile \
    --trace=cuda,cudnn,nvtx \
    python -m vllm.entrypoints.benchmark_throughput \
        --model meta-llama/Llama-2-7b-chat-hf \
        --num-prompts 100

Production Engineering Notes

Tuning `gpu_memory_utilization`

This is the most important knob. Setting it to 0.90 means vLLM allocates 90% of available GPU memory (after model weights) to the KV block pool. The remaining 10% handles:

CUDA kernel workspace memory
Activation tensors during forward passes
Temporary buffers for sampling operations
CUDA context overhead

In practice, 0.90 is safe for most workloads. Under extremely high concurrency with long sequences, push it to 0.92-0.95. Below 0.85 you're leaving significant throughput on the table.

If you're running multiple models on one GPU (not recommended for production), factor shared overhead carefully.

Block Size Selection

The default block_size=16 is well-validated across many workloads. The tradeoff:

block_size=8: Lower internal fragmentation (average 4 tokens wasted per sequence), higher block table overhead, more entries in the block manager data structures. Good for very short sequences (under 100 tokens).
block_size=16: Balanced. Works well for most production workloads with average sequences of 200-2000 tokens.
block_size=32: Higher throughput due to coarser memory accesses, but wastes up to 31 tokens per sequence. Only beneficial for very long sequences (2000+).

Handling Sequence Length Spikes

Production traffic follows heavy-tailed distributions. Your p50 request might be 300 tokens, but your p99 is 3000 and your max is 32000. Configure for the distribution, not the median:

# vLLM configuration for heavy-tailed workloads
llm = LLM(
    model="...",
    max_model_len=8192,                # max sequence length supported
    gpu_memory_utilization=0.88,       # slightly conservative for spikes
    max_num_seqs=256,                  # don't let too many long seqs co-exist
    swap_space=8,                      # CPU fallback for extreme spikes
)

Monitor vllm_num_waiting_sequences - if this rises consistently, you're hitting memory limits. Either reduce max_num_seqs, reduce max_model_len, or add more GPUs.

Multi-GPU KV Cache

With tensor parallelism, the KV cache is split across GPUs automatically. Each GPU holds $1/\text{TP}$ of each layer's KV data. The block manager coordinates across devices. For a 70B model on 4xA100:

Model weights: ~35GB per GPU (total 140GB for FP16)
KV cache per GPU: ~45GB (after weights)
Effective total KV pool: 180GB across 4 GPUs

This is additive - TP helps both for model weight memory AND KV cache capacity.

Common Mistakes

:::danger Block Size Too Large for Short-Context Workloads

Setting block_size=32 or block_size=64 when your average sequence is 200 tokens causes significant waste. A 200-token sequence uses 7 blocks: 6 full blocks (192 tokens) plus one block with only 8 tokens. That last block is 75% empty.

At scale with 1000 concurrent short requests, you're wasting 750 blocks (23,040 tokens of capacity). This is exactly the fragmentation problem PagedAttention was designed to solve - don't reintroduce it by choosing a large block size.

Fix: Use block_size=16 as the default. Only increase if profiling shows measurable throughput gains for your specific workload, and only when average sequences are consistently above 1000 tokens. :::

:::danger Ignoring Swap Space Latency

Setting swap_space=32 doesn't give you free capacity. CPU-GPU block swapping happens over PCIe at 32-64 GB/s, compared to HBM bandwidth of 2 TB/s. When a request's blocks get swapped to CPU RAM, resuming that request requires swapping them back - you'll see latency spikes of 100-500ms for affected requests.

In production, swap should be a circuit breaker of last resort, not a routine operation. If you're hitting swap regularly:

Reduce max_num_seqs to limit concurrency
Reduce max_model_len to cap per-request memory
Add GPU capacity

Monitor vllm_num_swapped_sequences - if it's consistently non-zero, you have a memory capacity problem that swapping is merely hiding. :::

:::warning Prefix Caching Requires Consistent Prompt Formatting

Radix attention matches prefixes at the token level. Any variation in system prompt formatting - trailing spaces, different newline styles, slight wording changes - creates cache misses. A 95% system prompt match still causes 100% miss.

Standardize prompt templates in your application layer. Run your prompts through the tokenizer and verify the token IDs are identical for requests that should share a prefix. Log cache hit rates per endpoint; if hits drop, something changed in your prompt construction code.

Also note: prefix caching is incompatible with certain sampling techniques that require independent random states per token. Greedy decoding and temperature sampling work fine; some beam search configurations require careful handling. :::

:::warning GPU Memory Utilization at 0.95+ Causes OOM on Spikes

gpu_memory_utilization=0.95 allocates 95% of available memory to the block pool, leaving only 5% for activation memory and other overhead. During high-throughput bursts, temporary buffers during the forward pass can exhaust this 5% headroom, causing CUDA OOM errors that crash the entire serving process.

This is particularly dangerous because the error doesn't manifest at steady state - it appears under peak load, exactly when you least want instability.

Keep gpu_memory_utilization at 0.90-0.92 in production. Profile your specific model's activation memory footprint under peak batch sizes before pushing higher. :::

Interview Q&A

Q1: Explain the KV cache memory formula and where the numbers come from.

Answer: The KV cache stores key and value tensors for every previously generated token, in every layer, for the current batch. The formula is:

$\text{KV bytes per token} = 2 \times n_\text{layers} \times n_\text{kv\_heads} \times d_\text{head} \times \text{dtype\_bytes}$

The "2" is for keys AND values - each attention head produces both. $n_\text{kv\_heads}$ accounts for Grouped Query Attention (GQA) where models like LLaMA-2 70B use only 8 KV heads even though they have 64 query heads - a significant memory saving. $d_\text{head}$ is typically $d_\text{model} / n_\text{heads}$ .

For LLaMA-2 70B at FP16: $2 \times 80 \times 8 \times 128 \times 2 = 327{,}680$ bytes $\approx 320$ KB per token. At 4096 token sequences with batch size 32: $320\text{ KB} \times 4096 \times 32 \approx 40\text{ GB}$ .

Without GQA (using full 64 KV heads), this would be 8x larger: 320 GB just for KV cache. GQA is architecturally essential for serving large models at scale.

Q2: How does PagedAttention differ from the OS virtual memory it's inspired by?

Answer: OS virtual memory manages pages of any data. PagedAttention is specialized for attention access patterns in two important ways:

First, access is append-only. New tokens are always appended to the end of a sequence's KV cache. There are no random writes into the middle of a cached sequence (unlike general OS memory). This means the physical block layout only needs to support sequential access with occasional appends.

Second, block tables are per-request and must be communicated to GPU kernels. In OS virtual memory, the CPU's MMU handles address translation in hardware. In PagedAttention, the block table is passed as a tensor to the CUDA kernel, which performs software address translation. This adds per-block overhead but enables the GPU kernel to batch multiple requests with different physical layouts simultaneously.

Third, PagedAttention doesn't need demand paging in the OS sense - the entire pool is already on the GPU. The "paging" is only about allocation, not about loading from slower storage (except in the swap case with CPU offloading).

Q3: What is copy-on-write in the context of beam search, and why does it matter?

Answer: Beam search maintains $B$ candidate sequences (beams) simultaneously. All beams start from the same prompt and diverge as generation proceeds. Without optimization, each beam needs its own copy of the full KV cache - $B$ x memory overhead.

Copy-on-write uses reference counting to share physical blocks between beams that haven't yet diverged. Initially, all beams point to the same physical blocks (via their individual block tables). When beam $i$ needs to write a new token that diverges from beam $j$ 's path, only then is the shared block copied and a new private block created for beam $i$ .

In practice, for a prompt of 1000 tokens and 4 beams generating 200 tokens, the shared portion (1000 tokens) uses 1x memory instead of 4x. Only the generated portion (200 tokens) requires separate blocks. Memory savings: $(1000 \times 4 - 1000) / (1000 \times 4) = 75\%$ reduction in the shared portion's memory.

This makes beam search economically viable for longer contexts. Without it, beam width 4 at 4096-token context is memory-prohibitive on most serving hardware.

Q4: What is the difference between internal and external fragmentation in KV cache management?

Answer: External fragmentation happens when total free memory is sufficient but no contiguous region large enough for a new allocation exists. This is the classic problem with naive contiguous KV cache allocation - free blocks are scattered, and a new long request can't fit even though sufficient total memory is free.

PagedAttention eliminates external fragmentation entirely. Because requests don't need contiguous blocks, any combination of scattered free blocks can serve any request. A 1000-token request needs 63 blocks of size 16; these can be any 63 free blocks in the pool, physically scattered anywhere.

Internal fragmentation is waste within allocated blocks. A sequence of 100 tokens uses 7 blocks (96 tokens in 6 full blocks + 4 tokens in a 7th block). The 7th block is 75% empty. This waste is unavoidable with fixed block sizes and is bounded by at most one block per sequence. With block_size=16, the worst case waste is 15 tokens per sequence - at 320 KB per block, that's 320 KB of waste per sequence, versus 40-100 MB of waste per sequence with naive contiguous allocation.

The tradeoff: block size controls internal fragmentation (smaller blocks = less waste) vs. block table overhead (smaller blocks = more entries in the block manager = more CPU/memory overhead for metadata).

Q5: How would you explain prefix caching's throughput impact to a product manager deciding whether to implement it?

Answer: Every LLM request has two phases: prompt processing (the "prefill" phase, processing the input tokens) and token generation (producing output tokens). Prefix caching only speeds up prefill.

For a typical production deployment with a 1000-token system prompt and 100-token user queries, the first request processes 1100 tokens in prefill. The second request, with a different query but the same system prompt, would normally process 1100 tokens again. With prefix caching, it processes only 100 tokens (the new query), because the 1000 system prompt tokens are already cached.

The compute saving is dramatic: 91% reduction in prefill compute for every request after the first. This translates directly to:

Lower time-to-first-token (TTFT) - users see the response start faster
Higher throughput - more requests handled per second since prefill compute is reduced
Lower cost - less GPU time consumed per request

Real-world numbers: in a system where prefill takes 200ms and generation takes 1000ms, prefix caching brings prefill to 18ms (for a 91% cache hit). Total latency drops from 1200ms to 1018ms - a 15% improvement in perceived response time. Throughput for the same hardware increases roughly proportionally.

The catch: the benefit is proportional to system prompt length and consistency. If every request has a unique 10-token prompt, prefix caching saves nothing. If every request shares a 5000-token prompt (RAG context, long instructions), prefix caching saves 98% of prefill compute.

Q6: A vLLM deployment is showing high memory pressure and requests are being dropped. Walk through your diagnostic process.

Answer: Start with the metrics, not the config.

Step 1: Identify the failure mode. Check vllm_num_waiting_sequences and vllm_kv_cache_usage_ratio. If cache usage is consistently near 100% and waiting queue is growing, you're hitting memory limits. If cache usage is fine but waiting queue grows during bursts, you have a compute bottleneck (different problem).

Step 2: Characterize the traffic. What are actual sequence lengths? Log the distribution of prompt lengths and output lengths. If your p99 sequence length is 5x your p50, that's the source of pressure - a few long requests are consuming disproportionate blocks.

Step 3: Check for prefix caching effectiveness. If enabled, what's the cache hit rate? If it's low despite shared system prompts, check for inconsistent prompt formatting.

Step 4: Calculate whether memory is correctly configured. Run the KV memory calculation for your model and compare to your allocated block pool size. If you're allocating significantly less than expected, check gpu_memory_utilization setting.

Step 5: Remediation options in order of preference:

Reduce max_num_seqs to limit concurrency (reduces peak memory pressure, may increase wait time)
Reduce max_model_len to cap per-request memory consumption (breaks very long requests)
Enable or tune prefix caching (if system prompts are long and consistent)
Add tensor parallelism across more GPUs (most expensive but most effective long-term)

Avoid raising swap_space as a first response - it trades OOM for latency spikes, and latency spikes at scale cause cascade failures in dependent services.

The Night the Inference Cluster Ran Out of Memory​

Why This Exists​

The Problem Before PagedAttention​

What These Problems Actually Cost​

Historical Context​

Operating Systems Already Solved This​

The SOSP Connection​

Core Concepts​

How the KV Cache Works​

PagedAttention: Virtual Memory for Attention​

Memory Utilization: The Numbers​

Copy-on-Write for Beam Search and Parallel Sampling​

Radix Attention and Prefix Caching​

Architecture Diagrams​

KV Cache Memory Layout​

PagedAttention vs Naive Allocation​

Prefix Caching with RadixAttention​

Code Examples​

Understanding KV Cache Memory Requirements​

Running vLLM with PagedAttention​

Measuring Cache Hit Rates with Prefix Caching​

Monitoring Block Utilization in Production​

Benchmarking Memory Utilization​

Production Engineering Notes​

Tuning gpu_memory_utilization​

Block Size Selection​

Handling Sequence Length Spikes​

Multi-GPU KV Cache​

Common Mistakes​

Interview Q&A​

Q1: Explain the KV cache memory formula and where the numbers come from.​

Q2: How does PagedAttention differ from the OS virtual memory it's inspired by?​

Q3: What is copy-on-write in the context of beam search, and why does it matter?​

Q4: What is the difference between internal and external fragmentation in KV cache management?​

Q5: How would you explain prefix caching's throughput impact to a product manager deciding whether to implement it?​

Q6: A vLLM deployment is showing high memory pressure and requests are being dropped. Walk through your diagnostic process.​