Module 5: Memory Systems for AI
LLM inference is a memory problem. A 70B parameter model in BF16 requires 140GB of memory just to store the weights. The KV cache for a single user conversation at 4K context in FP16 requires another 2-8GB depending on the architecture. When you are serving 100 concurrent users, the KV cache alone is hundreds of gigabytes. This is why vLLM, PagedAttention, and quantization exist: they are all solutions to the same fundamental constraint.
Understanding memory systems turns these techniques from black-box optimizations into logical consequences. Once you understand what PagedAttention is solving, why it works becomes obvious.
The Memory Problem in LLM Inference
There are three distinct memory costs in LLM inference:
Model weights. Static, shared across all requests. Llama 3 70B in BF16 = 140GB. Fixed cost.
KV cache. Per-request, grows with context length. Each token in the context needs to store keys and values for every layer. For Llama 3 70B: 2 (k,v) × 80 layers × 8192 (dim/n_heads × n_heads) × 2 bytes × context_length. At 4K tokens, this is ~10GB per request.
Activations. Per-batch, temporary. Usually smaller than KV cache but still significant for large batches.
The classical problem: if you pre-allocate KV cache as a contiguous block for the maximum possible context length, you waste memory for short conversations. If you allocate dynamically, you fragment memory and hurt throughput. PagedAttention solves this with virtual memory for KV cache.
Memory Hierarchy for LLM Serving
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | DRAM Architecture and Latency | Rows, columns, refresh cycles, latency numbers |
| 2 | HBM - High Bandwidth Memory | How HBM achieves 3+ TB/s, stacked die architecture |
| 3 | Cache Hierarchies and Locality | L1/L2/L3, spatial and temporal locality in ML |
| 4 | Memory-Bound vs Compute-Bound Workloads | Measuring arithmetic intensity in practice |
| 5 | KV Cache Memory Management | KV cache sizing, context length vs memory tradeoffs |
| 6 | PagedAttention and vLLM | Virtual memory for KV cache, block manager |
| 7 | Quantization and Memory Savings | INT8, NF4, FP8 - memory vs quality tradeoffs |
| 8 | Memory Profiling for ML Engineers | torch.cuda.memory_summary, memory_profiler, OOM debugging |
Key Concepts You Will Master
- KV cache sizing - calculating exact memory requirements for any model at any context length
- PagedAttention - the virtual memory abstraction that enables high-throughput LLM serving
- Memory-bound operation identification - measuring where your model spends memory bandwidth
- Quantization as memory compression - INT8/NF4 from a memory perspective, not just a compute one
- OOM debugging - systematic approach to identifying and fixing GPU out-of-memory errors
Prerequisites
- GPU Architecture
- Basic LLM understanding
