Module 5: Memory Systems for AI

LLM inference is a memory problem. A 70B parameter model in BF16 requires 140GB of memory just to store the weights. The KV cache for a single user conversation at 4K context in FP16 requires another 2-8GB depending on the architecture. When you are serving 100 concurrent users, the KV cache alone is hundreds of gigabytes. This is why vLLM, PagedAttention, and quantization exist: they are all solutions to the same fundamental constraint.

Understanding memory systems turns these techniques from black-box optimizations into logical consequences. Once you understand what PagedAttention is solving, why it works becomes obvious.

The Memory Problem in LLM Inference

There are three distinct memory costs in LLM inference:

Model weights. Static, shared across all requests. Llama 3 70B in BF16 = 140GB. Fixed cost.

KV cache. Per-request, grows with context length. Each token in the context needs to store keys and values for every layer. For Llama 3 70B: 2 (k,v) × 80 layers × 8192 (dim/n_heads × n_heads) × 2 bytes × context_length. At 4K tokens, this is ~10GB per request.

Activations. Per-batch, temporary. Usually smaller than KV cache but still significant for large batches.

The classical problem: if you pre-allocate KV cache as a contiguous block for the maximum possible context length, you waste memory for short conversations. If you allocate dynamically, you fragment memory and hurt throughput. PagedAttention solves this with virtual memory for KV cache.

Memory Hierarchy for LLM Serving

Lessons in This Module

#	Lesson	Key Concept
1	DRAM Architecture and Latency	Rows, columns, refresh cycles, latency numbers
2	HBM - High Bandwidth Memory	How HBM achieves 3+ TB/s, stacked die architecture
3	Cache Hierarchies and Locality	L1/L2/L3, spatial and temporal locality in ML
4	Memory-Bound vs Compute-Bound Workloads	Measuring arithmetic intensity in practice
5	KV Cache Memory Management	KV cache sizing, context length vs memory tradeoffs
6	PagedAttention and vLLM	Virtual memory for KV cache, block manager
7	Quantization and Memory Savings	INT8, NF4, FP8 - memory vs quality tradeoffs
8	Memory Profiling for ML Engineers	torch.cuda.memory_summary, memory_profiler, OOM debugging

Key Concepts You Will Master

KV cache sizing - calculating exact memory requirements for any model at any context length
PagedAttention - the virtual memory abstraction that enables high-throughput LLM serving
Memory-bound operation identification - measuring where your model spends memory bandwidth
Quantization as memory compression - INT8/NF4 from a memory perspective, not just a compute one
OOM debugging - systematic approach to identifying and fixing GPU out-of-memory errors

Prerequisites

GPU Architecture
Basic LLM understanding

The Memory Problem in LLM Inference​

Memory Hierarchy for LLM Serving​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Memory Problem in LLM Inference

Memory Hierarchy for LLM Serving

Lessons in This Module

Key Concepts You Will Master

Prerequisites