Skip to main content

Module 5: Memory Systems for AI

LLM inference is a memory problem. A 70B parameter model in BF16 requires 140GB of memory just to store the weights. The KV cache for a single user conversation at 4K context in FP16 requires another 2-8GB depending on the architecture. When you are serving 100 concurrent users, the KV cache alone is hundreds of gigabytes. This is why vLLM, PagedAttention, and quantization exist: they are all solutions to the same fundamental constraint.

Understanding memory systems turns these techniques from black-box optimizations into logical consequences. Once you understand what PagedAttention is solving, why it works becomes obvious.

The Memory Problem in LLM Inference

There are three distinct memory costs in LLM inference:

Model weights. Static, shared across all requests. Llama 3 70B in BF16 = 140GB. Fixed cost.

KV cache. Per-request, grows with context length. Each token in the context needs to store keys and values for every layer. For Llama 3 70B: 2 (k,v) × 80 layers × 8192 (dim/n_heads × n_heads) × 2 bytes × context_length. At 4K tokens, this is ~10GB per request.

Activations. Per-batch, temporary. Usually smaller than KV cache but still significant for large batches.

The classical problem: if you pre-allocate KV cache as a contiguous block for the maximum possible context length, you waste memory for short conversations. If you allocate dynamically, you fragment memory and hurt throughput. PagedAttention solves this with virtual memory for KV cache.

Memory Hierarchy for LLM Serving

Lessons in This Module

#LessonKey Concept
1DRAM Architecture and LatencyRows, columns, refresh cycles, latency numbers
2HBM - High Bandwidth MemoryHow HBM achieves 3+ TB/s, stacked die architecture
3Cache Hierarchies and LocalityL1/L2/L3, spatial and temporal locality in ML
4Memory-Bound vs Compute-Bound WorkloadsMeasuring arithmetic intensity in practice
5KV Cache Memory ManagementKV cache sizing, context length vs memory tradeoffs
6PagedAttention and vLLMVirtual memory for KV cache, block manager
7Quantization and Memory SavingsINT8, NF4, FP8 - memory vs quality tradeoffs
8Memory Profiling for ML Engineerstorch.cuda.memory_summary, memory_profiler, OOM debugging

Key Concepts You Will Master

  • KV cache sizing - calculating exact memory requirements for any model at any context length
  • PagedAttention - the virtual memory abstraction that enables high-throughput LLM serving
  • Memory-bound operation identification - measuring where your model spends memory bandwidth
  • Quantization as memory compression - INT8/NF4 from a memory perspective, not just a compute one
  • OOM debugging - systematic approach to identifying and fixing GPU out-of-memory errors

Prerequisites

© 2026 EngineersOfAI. All rights reserved.