Module 4: Memory Management for ML

Out-of-memory errors are the most common infrastructure problem in ML engineering. A training job that ran fine at batch size 32 OOMs at batch size 64. An inference server that handles 10 concurrent requests fine crashes at 15. A data pipeline that processes 100k samples fine hangs on 1M. Every one of these problems traces to memory management - how memory is allocated, when it is freed, and whether the allocation patterns match the hardware's capacity.

This module gives you the tools to understand memory at every layer: the OS virtual memory system, Python's reference counting garbage collector, PyTorch's CUDA memory caching allocator, and GPU HBM constraints.

Memory Layers in an ML System

Lessons in This Module

#	Lesson	Key Concept
1	Stack vs Heap Allocation	Memory layout, allocation cost, stack frames
2	Python Memory Model	Reference counting, cyclic GC, memory views
3	Reference Counting and GC	Python GC internals, weakref, memory leaks
4	Memory Leaks in ML Training	Accumulating tensors, detach, graph retention
5	GPU Memory Allocation Patterns	CUDA allocator, caching, fragmentation
6	Memory Profiling Tools	torch.cuda.memory_summary, memory_profiler, valgrind
7	Zero-Copy Data Transfer	Pinned memory, DMA transfers, avoiding copies
8	Memory-Efficient Training Strategies	Gradient checkpointing, activation offloading, mixed precision

Key Concepts You Will Master

CUDA memory caching allocator - why torch.cuda.empty_cache() does not always fix OOM errors
Python reference cycle detection - finding and fixing memory leaks in training loops
Gradient checkpointing - the memory-compute tradeoff that enables training larger models
Pinned memory - using page-locked host memory to accelerate GPU data transfer
Fragmentation - why GPU memory fragmentation causes OOM at 60% utilization and how to avoid it

Prerequisites

Computer Architecture
Operating Systems for ML
Python proficiency, basic PyTorch

Memory Layers in an ML System​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

Memory Layers in an ML System

Lessons in This Module

Key Concepts You Will Master

Prerequisites