KV Cache
The Production Scenario
Your LLM serving cluster is running at 30% throughput compared to what the hardware should theoretically deliver. You have 8× A100s, LLaMA-2 70B, and a continuous batching setup. The GPU utilization numbers look fine. Token generation speed looks fine for individual requests. But when you try to increase concurrent users, memory allocation errors start appearing. Requests get queued. Latency spikes.
You dig into the memory profiler and find something shocking: at any given moment, 65% of your allocated KV cache memory is wasted - empty, reserved but unused. You have carefully fragmented your GPU memory into thousands of unusable holes, just like a filesystem that has been writing and deleting files for years without defragmentation.
This is the memory fragmentation problem that PagedAttention was designed to solve. Before we get there, we need to understand what the KV cache is, why it exists, and exactly how much memory it consumes. The numbers are larger than most engineers expect.
Without a KV cache, a 70B model generating a 200-token response would require reading model weights approximately 200 times - once per decode step. With a KV cache, you still do 200 decode steps, but you avoid recomputing the attention keys and values for every previous token at every step. This sounds like a minor optimization. When you actually measure it, eliminating the KV cache causes inference to slow by 10-100× depending on sequence length.
Why This Exists: The Recomputation Problem
Recall the attention mechanism. For each token at position , the attention operation computes:
During decode, the query comes from the current token (position ). But and come from ALL previous tokens (positions through ).
Without any caching, generating token requires:
- Running the full forward pass for token
- Recomputing and for ALL previous tokens from scratch
Generating a 200-token response requires computing KV tensors for position 0 at every one of those 200 steps. Position 0 is recomputed 200 times. Position 1 is recomputed 199 times. Total recomputation grows quadratically with sequence length: where is the output length.
The KV cache stores the and tensors for each previous token. At step , you only compute and for the new token, then concatenate them to the cache. Total computation becomes - linear in sequence length.
Historical Context
The attention mechanism itself (Vaswani et al., 2017) implicitly assumed caching would be used during inference. The original "Attention is All You Need" paper described the autoregressive generation process and noted that key-value pairs from previous positions could be cached.
For years, the KV cache was simply "how inference worked" - an implementation detail. The real challenge, identified around 2021–2022 as model sizes exploded, was that KV cache memory requirements were growing to rival or exceed model weight memory requirements. For GPT-3 scale models (175B) with long contexts, the KV cache for a single sequence could require 50+ GB.
The pivotal work was PagedAttention (Kwon et al., 2023), the system paper behind vLLM. The authors measured that existing serving systems (HuggingFace transformers, FasterTransformer) wasted 60–80% of GPU memory due to KV cache fragmentation. PagedAttention borrowed virtual memory and paging concepts from operating systems to solve this problem, achieving 2–24× higher throughput in benchmarks.
Concurrently, the ML research community addressed KV cache memory at the architecture level: Multi-Query Attention (Shazeer, 2019) and Grouped Query Attention (Ainslie et al., 2023) reduce the KV cache size by factors of 8–64× with minimal quality impact.
What the KV Cache Stores
For each transformer layer and each token position, the attention sublayer computes three projections:
- - query (only needed at current token)
- - key (needed for all future tokens to attend back to this position)
- - value (needed for all future tokens to attend back to this position)
The KV cache stores and tensors. is never cached because each token's query is only used in that token's own decode step.
For a model with:
- layers
- attention heads (or KV heads for GQA)
- head dimension
- tokens in the sequence
- bytes per element (2 for FP16/BF16, 1 for INT8)
The factor of 2 accounts for both K and V tensors.
Memory Calculations for Popular Models
| Model | Layers | KV Heads | Head Dim | KV per token (FP16) |
|---|---|---|---|---|
| LLaMA-2 7B | 32 | 32 (MHA) | 128 | 524 KB |
| LLaMA-2 13B | 40 | 40 (MHA) | 128 | 819 KB |
| LLaMA-2 70B | 80 | 8 (GQA) | 128 | 328 KB |
| LLaMA-3 8B | 32 | 8 (GQA) | 128 | 131 KB |
| LLaMA-3 70B | 80 | 8 (GQA) | 128 | 328 KB |
| Mistral 7B | 32 | 8 (GQA) | 128 | 131 KB |
LLaMA-2 70B at 4096 context, FP16:
With 80 GB per A100 and 140 GB for model weights across 2× A100s, you have roughly 20 GB for KV cache - enough for about 15 concurrent sequences at full 4096-token context.
The Memory Fragmentation Problem
Here is the problem that plagued all KV cache implementations before PagedAttention:
Contiguous Allocation Wastes Memory
Traditional implementations pre-allocate a contiguous block of memory for each sequence's KV cache, sized for the maximum possible sequence length. This is necessary because you cannot move a KV cache once allocated - the attention computation references it by absolute memory address.
Consider a server handling concurrent requests:
- Request A: 2048 tokens max (using 100 tokens currently)
- Request B: 4096 tokens max (using 4096 tokens - done)
- Request C: 1024 tokens max (using 1024 tokens - done)
- Request D: 3000 tokens max (using 300 tokens currently)
After requests B and C complete, their allocated memory is freed. But the freed blocks are 4096 and 1024 token-sized holes. If a new request E needs 5000 tokens, it cannot fit in either hole - even though the total free space is 5120 tokens. This is external fragmentation.
Internal fragmentation is worse: a request allocated for 4096 max tokens that only uses 100 tokens currently wastes 97.5% of its allocation while running.
The PagedAttention paper measured this fragmentation at 60–80% of reserved KV cache memory across typical workloads.
PagedAttention: OS Paging for GPU Memory
PagedAttention (Kwon et al., 2023) solves fragmentation by applying virtual memory concepts from operating systems to KV cache management.
The Core Idea
Instead of one contiguous block per sequence, the KV cache is divided into fixed-size pages (also called blocks). Each page holds the KV tensors for a fixed number of tokens (typically 16 tokens). Pages for a single sequence need not be contiguous in physical GPU memory - a page table maps logical page numbers to physical memory locations.
Benefits of PagedAttention
- No internal fragmentation: Memory is allocated in small blocks (16 tokens). Only the last block of each sequence is partially used.
- No external fragmentation: Any free block can be assigned to any sequence, regardless of prior allocation patterns.
- Dynamic allocation: Blocks are allocated only as needed - a sequence that generates 50 tokens uses only 4 blocks, not a 4096-token pre-allocation.
- Copy-on-write sharing: For parallel sampling (beam search, multiple candidate sequences from one prompt), the prompt's KV cache blocks can be shared until sequences diverge, then copied on the first write.
Memory Utilization Improvement
In the original PagedAttention paper:
- HuggingFace transformers: 20.4% memory utilization
- vLLM with PagedAttention: 96.3% memory utilization
This near-perfect utilization translates directly to serving 4–5× more concurrent sequences on the same hardware.
Multi-Head vs Multi-Query vs Grouped-Query Attention
The architecture of the attention mechanism itself determines KV cache size. Three variants exist:
Multi-Head Attention (MHA)
The original transformer design. Each attention head has its own independent and projection matrices. With heads:
KV cache stores K tensors and V tensors per layer. Most expressive but largest KV cache.
Multi-Query Attention (MQA)
Shazeer (2019) proposed sharing K and V across all query heads. All query heads share a single K and single V projection.
KV cache size is reduced by a factor of compared to MHA. For a model with 32 heads, MQA reduces KV cache by 32×. Quality degradation is modest for most tasks. PaLM and Falcon use MQA.
Grouped Query Attention (GQA)
Ainslie et al. (2023) proposed a middle ground. query heads are divided into groups, each sharing a K and V projection. is typically 4–8.
With query heads and KV heads, KV cache is 4× smaller than MHA while being much larger than MQA. Quality is better than MQA and close to MHA.
LLaMA-2 34B and 70B, LLaMA-3, Mistral, and most modern efficient models use GQA.
Comparison Table
| Method | KV Heads | KV Cache vs MHA | Quality | Used In |
|---|---|---|---|---|
| MHA | H | 1× | Best | GPT-2, early transformers |
| GQA | H/4 to H/8 | 4–8× smaller | Near-MHA | LLaMA-2 70B, LLaMA-3, Mistral |
| MQA | 1 | H× smaller | Slight loss | PaLM, Falcon 40B |
KV Cache Quantization
One additional lever: quantize the KV cache itself.
KV cache tensors are activations, not weights. They are computed on the fly and written to memory. Quantizing them to INT8 reduces KV cache memory by 2× with typically small quality degradation.
Why quantizing KV cache is harder than weight quantization:
- Activation distributions change with each input (dynamic range varies)
- You need per-token or per-channel scale factors
- Quantization error accumulates across sequence length
Despite these challenges, INT8 KV cache is deployed in production by several major providers. At very long contexts (32K+ tokens), quality impact requires careful evaluation task-by-task.
import torch
def quantize_kv_cache_int8(kv_tensor: torch.Tensor):
"""
Quantize KV cache tensor to INT8 with per-token scales.
kv_tensor: shape [batch, heads, seq_len, head_dim]
"""
# Compute per-token max absolute value for scale
# Shape: [batch, heads, seq_len, 1]
abs_max = kv_tensor.abs().max(dim=-1, keepdim=True).values
scale = abs_max / 127.0
scale = scale.clamp(min=1e-8)
# Quantize to INT8
quantized = (kv_tensor / scale).round().clamp(-128, 127).to(torch.int8)
return quantized, scale
def dequantize_kv_cache(quantized: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
"""Dequantize INT8 KV cache back to float for attention computation."""
return quantized.float() * scale
def measure_kv_quantization_error():
"""Measure quantization error for INT8 KV cache."""
torch.manual_seed(42)
# Simulate KV cache tensor [1, 8, 512, 128]
kv_tensor = torch.randn(1, 8, 512, 128, dtype=torch.float32)
# Quantize and dequantize
quantized, scale = quantize_kv_cache_int8(kv_tensor)
reconstructed = dequantize_kv_cache(quantized, scale)
# Measure error
mse = ((kv_tensor - reconstructed) ** 2).mean().item()
max_error = (kv_tensor - reconstructed).abs().max().item()
relative_error = (
(kv_tensor - reconstructed).abs().mean() / kv_tensor.abs().mean()
).item()
print(f"KV Cache INT8 Quantization Error:")
print(f" MSE: {mse:.6f}")
print(f" Max absolute error: {max_error:.4f}")
print(f" Mean relative error: {relative_error:.4f} ({relative_error*100:.2f}%)")
print(f" Memory reduction: 2x (float32 -> INT8)")
KV Cache Offloading to CPU RAM
For very long contexts or large batch sizes, KV cache can overflow GPU HBM and be offloaded to CPU RAM. CPU RAM is cheaper and more plentiful (typically 500 GB–1 TB per server vs 80 GB per GPU).
The trade-off: PCIe bandwidth (typically 32–64 GB/s for PCIe 5.0) is far slower than GPU HBM bandwidth (2 TB/s for A100). Each decode step that accesses offloaded KV cache pays the PCIe latency penalty.
Strategies:
- Swap entire sequences: When a sequence is not being decoded (waiting in queue), move its KV cache to CPU RAM. Reload before resuming.
- Layer-wise offloading: Keep recent layer KVs on GPU, older layers on CPU.
- FlexGen (Sheng et al., 2023): Systematic offloading to CPU RAM and even NVMe SSD for maximum batch size at the cost of throughput.
Implementing a Simple KV Cache
import torch
from dataclasses import dataclass, field
from typing import List, Optional, Tuple
@dataclass
class KVCache:
"""
Simple KV cache implementation.
Production systems use PagedAttention (vLLM) for efficiency.
"""
num_layers: int
num_heads: int
head_dim: int
dtype: torch.dtype = torch.float16
device: str = "cuda"
cache: List = field(default_factory=list)
def __post_init__(self):
self.cache = [None] * self.num_layers
def update(
self,
layer_idx: int,
key: torch.Tensor,
value: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Update cache for a layer and return full K, V tensors.
Args:
layer_idx: Which transformer layer
key: New key tensor [batch, heads, 1, head_dim]
value: New value tensor [batch, heads, 1, head_dim]
Returns:
Full key and value tensors including all previous tokens
"""
if self.cache[layer_idx] is None:
self.cache[layer_idx] = (key, value)
else:
prev_k, prev_v = self.cache[layer_idx]
full_k = torch.cat([prev_k, key], dim=2)
full_v = torch.cat([prev_v, value], dim=2)
self.cache[layer_idx] = (full_k, full_v)
return self.cache[layer_idx]
def get_sequence_length(self) -> int:
if self.cache[0] is None:
return 0
return self.cache[0][0].shape[2]
def memory_bytes(self) -> int:
total = 0
for layer_kv in self.cache:
if layer_kv is not None:
k, v = layer_kv
total += k.numel() * k.element_size()
total += v.numel() * v.element_size()
return total
def memory_gb(self) -> float:
return self.memory_bytes() / 1e9
def measure_kv_cache_growth():
"""
Demonstrate KV cache memory growth with sequence length.
Simulates LLaMA-3 8B: 32 layers, 8 KV heads, 128 head dim.
"""
seq_lengths = [128, 256, 512, 1024, 2048, 4096, 8192]
print("KV Cache Memory Growth - LLaMA-3 8B")
print(f"{'Seq Len':>10} {'KV Cache (MB)':>15} {'Per Token (KB)':>15}")
print("-" * 45)
for target_len in seq_lengths:
# Formula: 2 * layers * kv_heads * head_dim * seq_len * bytes
# 2 * 32 * 8 * 128 * target_len * 2 (FP16)
mem_bytes = 2 * 32 * 8 * 128 * target_len * 2
mem_mb = mem_bytes / 1e6
per_token_kb = (mem_bytes / target_len) / 1e3
print(f"{target_len:>10} {mem_mb:>15.1f} {per_token_kb:>15.1f}")
measure_kv_cache_growth()
Expected output:
KV Cache Memory Growth - LLaMA-3 8B
Seq Len KV Cache (MB) Per Token (KB)
---------------------------------------------
128 67.1 524.3
256 134.2 524.3
512 268.4 524.3
1024 536.9 524.3
2048 1073.7 524.3
4096 2147.5 524.3
8192 4294.9 524.3
Per-token KV cache is constant at 524 KB for this configuration - the total KV cache grows linearly with sequence length.
Sliding Window KV Cache
For very long contexts, you can limit memory by only caching the most recent tokens (sliding window attention). Mistral 7B uses this approach with window size 4096.
class SlidingWindowKVCache:
"""
KV cache that retains only the last `window_size` tokens.
Used by Mistral and similar models for long-context efficiency.
"""
def __init__(self, num_layers: int, window_size: int = 4096):
self.num_layers = num_layers
self.window_size = window_size
self.cache = [None] * num_layers
def update(self, layer_idx: int, key: torch.Tensor, value: torch.Tensor):
if self.cache[layer_idx] is None:
self.cache[layer_idx] = (key, value)
else:
prev_k, prev_v = self.cache[layer_idx]
full_k = torch.cat([prev_k, key], dim=2)
full_v = torch.cat([prev_v, value], dim=2)
# Trim to window size - evict oldest tokens
if full_k.shape[2] > self.window_size:
full_k = full_k[:, :, -self.window_size:, :]
full_v = full_v[:, :, -self.window_size:, :]
self.cache[layer_idx] = (full_k, full_v)
return self.cache[layer_idx]
def max_memory_bytes(self, batch_size: int, num_heads: int, head_dim: int) -> int:
"""Maximum memory usage is bounded regardless of sequence length."""
per_layer = 2 * batch_size * num_heads * self.window_size * head_dim * 2
return self.num_layers * per_layer
Production KV Cache Sizing Guide
def kv_cache_capacity_analysis(
gpu_memory_gb: float,
num_gpus: int,
model_weights_gb: float,
num_layers: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
kv_dtype_bytes: int = 2,
overhead_gb: float = 3.0
) -> dict:
"""
Analyze KV cache capacity for a given hardware configuration.
"""
total_gpu_gb = gpu_memory_gb * num_gpus
available_for_kv = total_gpu_gb - model_weights_gb - overhead_gb
kv_per_sequence_bytes = (
2 * num_layers * num_kv_heads * head_dim * max_seq_len * kv_dtype_bytes
)
kv_per_sequence_gb = kv_per_sequence_bytes / 1e9
max_concurrent_sequences = int(available_for_kv / kv_per_sequence_gb)
return {
"total_gpu_memory_gb": total_gpu_gb,
"model_weights_gb": model_weights_gb,
"available_for_kv_gb": round(available_for_kv, 1),
"kv_per_sequence_gb": round(kv_per_sequence_gb, 3),
"max_concurrent_sequences": max_concurrent_sequences,
}
# LLaMA-2 70B on 2x A100 80GB with GQA
result = kv_cache_capacity_analysis(
gpu_memory_gb=80, num_gpus=2, model_weights_gb=140,
num_layers=80, num_kv_heads=8, head_dim=128,
max_seq_len=4096, kv_dtype_bytes=2
)
print("LLaMA-2 70B on 2x A100 80GB:")
for k, v in result.items():
print(f" {k}: {v}")
# available_for_kv_gb: 17.0
# kv_per_sequence_gb: 0.268 (GQA saves significantly vs MHA!)
# max_concurrent_sequences: 63
# Compare with MHA (if the same model used full 64 heads for KV):
result_mha = kv_cache_capacity_analysis(
gpu_memory_gb=80, num_gpus=2, model_weights_gb=140,
num_layers=80, num_kv_heads=64, head_dim=128,
max_seq_len=4096, kv_dtype_bytes=2
)
print("\nSame model with MHA instead of GQA:")
for k, v in result_mha.items():
print(f" {k}: {v}")
# kv_per_sequence_gb: 2.147
# max_concurrent_sequences: 7
# GQA gives ~9x more concurrent sequences!
Common Mistakes
:::danger Pre-allocating maximum sequence length for all requests The naive approach allocates KV cache memory for the maximum possible sequence length when a request arrives. A request allocated for 4096 tokens that only generates 50 tokens wastes 98.8% of its KV cache memory while running. This internal fragmentation is the primary cause of poor GPU memory utilization in naive serving systems. Use PagedAttention or dynamic allocation. Never pre-allocate maximum sequence length upfront. :::
:::danger Forgetting KV cache in multi-GPU memory math When planning a multi-GPU deployment, engineers correctly calculate: "140 GB model weights / 80 GB per A100 = 2 GPUs minimum." They then forget that each concurrent request adds KV cache. At batch size 16 with 4096-token context for LLaMA-2 70B, the KV cache adds 4.3 GB (with GQA). The correct formula is: required_gpus = ceil((weights + kv_cache_total + overhead) / gpu_memory). :::
:::warning Using query head count instead of KV head count in GQA models
LLaMA-2 70B has 64 query heads but only 8 KV heads (GQA). When calculating KV cache size, always use the KV head count. Using query head count overestimates KV cache size by 8× for this model. Check the HuggingFace config for num_key_value_heads (not num_attention_heads).
:::
:::warning Not accounting for KV cache across all layers simultaneously
Some engineers calculate attention memory as if only one layer's KV cache matters. The full KV cache for all layers must be resident simultaneously during a decode step - you run all 80 layers in sequence, and each layer needs its own cached K, V tensors. The formula includes the sum across all layers: 2 × layers × kv_heads × head_dim × seq_len × bytes.
:::
Interview Questions
Q1: What does the KV cache store and why does it exist?
The KV cache stores the Key and Value tensors computed by the attention mechanism for all previously generated (and prompt) tokens. Without caching, generating token requires recomputing K and V for all previous tokens to - cost grows as for a sequence of length . With KV cache, K and V are computed once and stored, only computing K, V for the new token each step - cost becomes . Query tensors are not cached because each token's query is only used in that token's own decode step.
Q2: What is PagedAttention and what problem does it solve?
PagedAttention (Kwon et al., 2023) solves KV cache memory fragmentation in LLM serving. Traditional implementations allocate a contiguous memory block for each sequence's KV cache, sized for the maximum sequence length. This wastes memory through internal fragmentation (sequences using less than their maximum allocation) and external fragmentation (free blocks too small for new allocations). The paper measured 60–80% memory waste. PagedAttention divides GPU memory into fixed-size blocks (pages), each holding KV tensors for a fixed number of tokens (typically 16). A page table maps logical sequence positions to physical memory blocks. Pages from different sequences interleave freely, achieving 96%+ memory utilization vs 20% before.
Q3: What is the difference between MHA, MQA, and GQA, and how do they affect KV cache size?
Multi-Head Attention (MHA) gives each attention head its own K and V projections - largest KV cache. Multi-Query Attention (MQA, Shazeer 2019) uses a single shared K and V projection across all query heads - KV cache reduced by a factor of (number of query heads), e.g., 32× for a model with 32 heads. Grouped Query Attention (GQA, Ainslie et al. 2023) uses KV head groups for query heads - KV cache reduced by (typically 4–8×) with quality close to MHA. Most modern models (LLaMA-2 34B+, LLaMA-3, Mistral) use GQA as it provides most of MQA's memory savings with much less quality degradation.
Q4: For LLaMA-2 70B on 2× A100 80GB, how many concurrent requests can you serve at 4096-token context in FP16?
Total GPU memory: 160 GB. Model weights (FP16): 140 GB. Overhead: ~3 GB. Available for KV cache: ~17 GB. LLaMA-2 70B uses GQA with 8 KV heads, 128 head_dim, 80 layers. KV per sequence: 2 × 80 × 8 × 128 × 4096 × 2 bytes = 268 MB. Max concurrent sequences: floor(17,000 / 268) ≈ 63 with PagedAttention. With naive contiguous allocation and typical fragmentation (60–80% waste), this drops to roughly 12–25 concurrent sequences on the same hardware.
Q5: How does copy-on-write work in PagedAttention for beam search?
When generating multiple candidate sequences from the same prompt (beam search with beam width ), all candidates share the same prompt KV cache initially. With PagedAttention, prompt pages are reference-counted like OS copy-on-write memory. When a candidate sequence would write to a shared page (appending to the last partial block at the divergence point), PagedAttention allocates a new physical block, copies the shared content, decrements the original block's reference count, and the sequence writes to the new block. This means beam search with width 4 uses nearly the same prompt memory as a single sequence until beams diverge, rather than 4× the memory from the start.
Q6: What are the trade-offs of quantizing the KV cache to INT8?
Benefits: 2× memory reduction, directly enabling 2× more concurrent sequences on the same hardware. Risks: KV cache tensors are activations with dynamic ranges that vary per input, making quantization harder than for static weights. Per-token scale factors are required. Quantization error can accumulate across long sequences. For most chat, coding, and summarization tasks at typical context lengths (under 8K tokens), INT8 KV cache quality loss is negligible. For tasks sensitive to precision (complex multi-step reasoning, very long context) or extreme context lengths (32K+), evaluate carefully on representative benchmarks before deploying.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the KV Cache Explained demo on the EngineersOfAI Playground - no code required.
:::
