Skip to main content

Module 4: Memory Management for ML

Out-of-memory errors are the most common infrastructure problem in ML engineering. A training job that ran fine at batch size 32 OOMs at batch size 64. An inference server that handles 10 concurrent requests fine crashes at 15. A data pipeline that processes 100k samples fine hangs on 1M. Every one of these problems traces to memory management - how memory is allocated, when it is freed, and whether the allocation patterns match the hardware's capacity.

This module gives you the tools to understand memory at every layer: the OS virtual memory system, Python's reference counting garbage collector, PyTorch's CUDA memory caching allocator, and GPU HBM constraints.

Memory Layers in an ML System

Lessons in This Module

#LessonKey Concept
1Stack vs Heap AllocationMemory layout, allocation cost, stack frames
2Python Memory ModelReference counting, cyclic GC, memory views
3Reference Counting and GCPython GC internals, weakref, memory leaks
4Memory Leaks in ML TrainingAccumulating tensors, detach, graph retention
5GPU Memory Allocation PatternsCUDA allocator, caching, fragmentation
6Memory Profiling Toolstorch.cuda.memory_summary, memory_profiler, valgrind
7Zero-Copy Data TransferPinned memory, DMA transfers, avoiding copies
8Memory-Efficient Training StrategiesGradient checkpointing, activation offloading, mixed precision

Key Concepts You Will Master

  • CUDA memory caching allocator - why torch.cuda.empty_cache() does not always fix OOM errors
  • Python reference cycle detection - finding and fixing memory leaks in training loops
  • Gradient checkpointing - the memory-compute tradeoff that enables training larger models
  • Pinned memory - using page-locked host memory to accelerate GPU data transfer
  • Fragmentation - why GPU memory fragmentation causes OOM at 60% utilization and how to avoid it

Prerequisites

© 2026 EngineersOfAI. All rights reserved.