Memory Allocators for ML

The Production Scenario

A distributed training job runs fine for 40 minutes, then dies with CUDA out-of-memory on step 8,847. Every time. Not step 8,846. Not step 8,848. Exactly step 8,847. The GPU has 80 GB of HBM. The model needs about 60 GB. There should be 20 GB to spare. But the CUDA allocator reports 79.8 GB reserved and only 62.1 GB allocated. The gap - 17.7 GB - is reserved but not used. It is not available to satisfy new allocations.

This is CUDA memory fragmentation. PyTorch's caching allocator reserves memory from CUDA in large chunks and manages sub-allocation internally. After thousands of allocations and frees of variable-sized tensors, the reserved memory becomes fragmented: it contains many free blocks, but none large enough to satisfy the next allocation request. The allocator cannot compact these blocks (CUDA does not support moving allocations), and it cannot release them to the OS without risking another expensive cudaMalloc call later. So the memory sits there, reserved but unusable, until the next allocation causes an OOM.

The fix required understanding exactly how PyTorch's caching allocator works: its block size classes, its split/merge logic, and its reserved-vs-allocated distinction. With that understanding, you can call torch.cuda.empty_cache() at the right moment (not randomly, which is what most engineers try first), tune the PYTORCH_CUDA_ALLOC_CONF environment variable to change block size thresholds, and restructure the training loop to avoid the specific allocation pattern that causes fragmentation.

This lesson explains how allocators work from the CPU side (glibc, jemalloc, tcmalloc) through to the GPU side (PyTorch's CUDA caching allocator). By the end, you will be able to look at a torch.cuda.memory_stats() output and immediately understand what it means.

Why This Exists

The naive approach to memory allocation is to call the OS directly for every allocation. On Linux, this means calling mmap() or brk() for each allocation and returning memory to the OS on every free. This is catastrophically slow: system calls take ~1 microsecond each, and a Python program making millions of allocations per second would spend most of its time in system call overhead. Worse, the OS allocator has no knowledge of the program's allocation patterns, so it cannot optimize for them.

User-space allocators (glibc malloc, jemalloc, tcmalloc) sit between the program and the OS. They request large chunks of memory from the OS (arenas, regions, extents) and satisfy individual allocations from these chunks without going to the OS. This reduces system call frequency from millions-per-second to occasionally-per-second.

For ML workloads specifically, the challenge is that allocation patterns are very different from general-purpose programs. Training workloads allocate large (multi-GB) tensors with deterministic sizes that repeat every step. Serving workloads allocate moderate-sized buffers at high frequency. Both patterns stress aspects of general-purpose allocators that were not designed with them in mind, motivating both custom CPU allocators (jemalloc for Python serving) and entirely custom GPU allocators (PyTorch's caching allocator for CUDA).

Historical Context

glibc malloc is derived from Doug Lea's dlmalloc (1987). It uses a boundary-tag system: every allocated block has a header and footer containing the block size, enabling neighboring free blocks to be merged (coalesced) into larger free blocks. glibc's ptmalloc2 (2000) adds per-thread arenas to reduce lock contention. ptmalloc2 is still the default allocator on most Linux systems.

jemalloc was developed by Jason Evans at FreeBSD in 2005, later adopted by Mozilla Firefox and Meta's production servers. Its key insight is that the traditional allocator design (a few large arenas with complex coalescing) performs poorly on modern multi-core systems due to false sharing and lock contention. jemalloc uses per-CPU arenas, size-class bins with minimal coalescing overhead, and a design that produces lower fragmentation on long-running server workloads.

tcmalloc (thread-caching malloc) was developed at Google (Sanjay Ghemawat and Paul Menage, around 2005). Its key contribution is aggressive thread-local caching: each thread has its own free list for small allocations, so the common case (alloc and free in the same thread) requires no synchronization at all. tcmalloc is used in Google's production systems and powers Abseil's memory layer.

PyTorch's CUDA caching allocator was developed at Facebook AI Research (Adam Paszke and others, 2016-2018). The fundamental constraint that shaped its design is that cudaMalloc is extremely slow (~100 ms for a large allocation) compared to malloc (~100 ns). The caching allocator's entire design is built around never calling cudaMalloc unnecessarily.

Core Concepts

General Allocator Design: The Common Structure

All general-purpose allocators share a common structure, even though their implementations differ:

Fragmentation: Internal vs External

Internal fragmentation occurs when an allocated block is larger than the requested size. The wasted space is inside the allocation. This is unavoidable in size-class systems: a 513-byte request goes into a 576-byte size class, wasting 63 bytes.

External fragmentation occurs when free memory exists but is not in contiguous blocks large enough to satisfy new requests. This is the dangerous form: the process has enough total free memory but cannot use it.

$\text{Fragmentation ratio} = 1 - \frac{\text{usable allocated memory}}{\text{total memory reserved from OS}}$

A fragmentation ratio of 0 means no waste. A ratio of 0.5 means half the reserved memory is wasted due to fragmentation - a sign of pathological allocation patterns.

import ctypes
import ctypes.util
import os
import sys

def get_malloc_stats():
    """
    Get memory allocation statistics using malloc_info or /proc/self/status.
    Works on Linux with glibc.
    """
    if sys.platform != 'linux':
        print("malloc_info only available on Linux with glibc")
        return {}

    try:
        libc = ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True)
        # malloc_stats prints to stderr - useful for debugging
        libc.malloc_stats()
    except Exception as e:
        print(f"malloc_stats not available: {e}")

    # Read from /proc/self/status for RSS/VMS
    stats = {}
    try:
        with open('/proc/self/status', 'r') as f:
            for line in f:
                if line.startswith(('VmRSS:', 'VmSize:', 'VmPeak:', 'VmData:')):
                    parts = line.split()
                    stats[parts[0].rstrip(':')] = int(parts[1])
    except FileNotFoundError:
        pass

    return stats

# Check allocator in use
def detect_allocator():
    """Detect which malloc implementation is in use."""
    try:
        import subprocess
        result = subprocess.run(
            ['python3', '-c',
             'import ctypes; '
             'libc = ctypes.CDLL(None); '
             'print(libc.malloc_usable_size(ctypes.c_void_p(0)))'],
            capture_output=True, text=True
        )
        # If this works, we have glibc (provides malloc_usable_size)
        print("Allocator appears to be glibc malloc")
    except Exception:
        pass

    # Check LD_PRELOAD for jemalloc or tcmalloc
    ld_preload = os.environ.get('LD_PRELOAD', '')
    if 'jemalloc' in ld_preload:
        print("jemalloc detected via LD_PRELOAD")
    elif 'tcmalloc' in ld_preload:
        print("tcmalloc detected via LD_PRELOAD")
    else:
        print(f"LD_PRELOAD: '{ld_preload}' (no custom allocator)")

detect_allocator()
stats = get_malloc_stats()
if stats:
    print(f"RSS: {stats.get('VmRSS', 'N/A')} kB")
    print(f"VSZ: {stats.get('VmSize', 'N/A')} kB")

jemalloc: The ML Python Allocator

jemalloc's design advantages for long-running Python ML workloads:

Per-CPU arenas reduce lock contention in multi-threaded data loading
Decay-based purging returns unused memory to the OS more aggressively than glibc
Lower fragmentation on server workloads - important for multi-day training jobs

The key metric jemalloc improves is RSS (Resident Set Size) stability. With glibc, a Python ML serving process often shows RSS growing indefinitely over days - not because of Python-level memory leaks, but because glibc malloc retains freed memory in its internal heap rather than returning it to the OS. jemalloc's decay-based purging addresses this.

# Using jemalloc via LD_PRELOAD (no code changes required)
# Install: sudo apt-get install libjemalloc-dev

# Check if jemalloc is available
ls /usr/lib/x86_64-linux-gnu/libjemalloc.so* 2>/dev/null || \
    echo "jemalloc not found - install with: sudo apt install libjemalloc-dev"

# Use jemalloc for a Python process
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 python3 your_script.py

# Or set in your service startup script
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

# For Docker, add to Dockerfile:
# ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

# Benchmarking allocator fragmentation
# Run this script twice: once normally, once with jemalloc via LD_PRELOAD

import os
import time
import random
import psutil

def measure_rss():
    """Get current RSS in MB."""
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

def fragmentation_workload():
    """
    Simulates a serving workload that causes allocator fragmentation:
    - Many allocations of varying sizes
    - Random lifetimes (some freed immediately, some held)
    """
    held_objects = []

    for iteration in range(200):
        # Allocate objects of varying sizes
        new_objects = []
        for _ in range(500):
            size = random.choice([100, 1000, 10000, 100000])
            obj = bytearray(size)
            new_objects.append(obj)

        # Hold some objects for "a while" (simulate request state)
        held_objects.extend(new_objects[:100])
        if len(held_objects) > 1000:
            held_objects = held_objects[-500:]   # keep recent ones

        # Free the rest
        del new_objects

    return held_objects

rss_before = measure_rss()
print(f"RSS before workload: {rss_before:.1f} MB")

start = time.perf_counter()
result = fragmentation_workload()
elapsed = time.perf_counter() - start

rss_peak = measure_rss()
print(f"RSS after workload:  {rss_peak:.1f} MB")
print(f"Elapsed:             {elapsed*1000:.0f} ms")

del result
import gc
gc.collect()
time.sleep(1)   # allow allocator to return memory to OS

rss_final = measure_rss()
print(f"RSS after cleanup:   {rss_final:.1f} MB")
print(f"Retained by allocator: {rss_final - rss_before:.1f} MB")
print()
print("Run with LD_PRELOAD=libjemalloc.so to compare retention")

tcmalloc: Thread-Caching Design

tcmalloc's key feature is aggressive thread-local caching. Each thread has its own per-thread cache (a free list for each size class). Allocations and frees that stay within the same thread never touch the global heap - no locks, no atomics, near-zero overhead.

The trade-off: tcmalloc uses more memory (each thread's cache holds pre-fetched blocks even if they are not immediately needed). For ML workloads with many threads (data loading workers), this can mean 10-50 MB of "overhead" per thread in free block caches.

tcmalloc is generally preferred for throughput-critical serving systems (Google uses it production-wide). jemalloc is generally preferred for long-running processes where RSS stability matters more than peak throughput.

# Measuring allocator throughput
import time
import sys

def allocation_benchmark(n_allocs=1_000_000, size=64):
    """
    Benchmark raw allocation throughput.
    Run with different allocators via LD_PRELOAD to compare.
    """
    start = time.perf_counter()

    # Python list allocation (goes through pymalloc for small sizes)
    objects = []
    for _ in range(n_allocs):
        obj = bytearray(size)
        objects.append(obj)

    alloc_time = time.perf_counter() - start

    start = time.perf_counter()
    del objects
    free_time = time.perf_counter() - start

    alloc_ns = alloc_time / n_allocs * 1e9
    free_ns = free_time / n_allocs * 1e9

    print(f"Allocation benchmark (n={n_allocs:,}, size={size}B):")
    print(f"  Alloc: {alloc_ns:.1f} ns/op ({alloc_time*1000:.0f} ms total)")
    print(f"  Free:  {free_ns:.1f} ns/op ({free_time*1000:.0f} ms total)")
    print(f"  LD_PRELOAD: {os.environ.get('LD_PRELOAD', 'none (glibc default)')}")
    return alloc_ns

import os
allocation_benchmark(n_allocs=100_000, size=256)
allocation_benchmark(n_allocs=100_000, size=4096)
allocation_benchmark(n_allocs=10_000, size=65536)

PyTorch's CUDA Caching Allocator

PyTorch's GPU memory allocator is the most important allocator for ML engineers to understand. Its design is driven by a fundamental constraint: CUDA memory allocation (cudaMalloc) is about 1000x slower than CPU malloc.

CPU malloc: ~100 nanoseconds
CUDA cudaMalloc: ~50-200 milliseconds (requires synchronization with the GPU driver)

This means you cannot call cudaMalloc for every tensor allocation. PyTorch's caching allocator solves this by maintaining a pool of previously allocated CUDA memory blocks, re-using them for new allocations without going back to CUDA.

Reserved vs Allocated Memory

The critical distinction in PyTorch's GPU memory model:

Allocated memory: memory currently held by live tensors. torch.cuda.memory_allocated()
Reserved memory: memory held by the caching allocator (allocated from CUDA but not necessarily in use by tensors). torch.cuda.memory_reserved()
Free blocks: reserved - allocated. Blocks the allocator holds for reuse.

The allocator will never call cudaFree unless you explicitly call torch.cuda.empty_cache(). This means reserved memory only grows during a training run unless you clear the cache.

import torch
import json

def print_cuda_memory_summary(device=0):
    """Print a human-readable CUDA memory summary."""
    if not torch.cuda.is_available():
        print("CUDA not available")
        return

    stats = torch.cuda.memory_stats(device)
    allocated = torch.cuda.memory_allocated(device)
    reserved = torch.cuda.memory_reserved(device)

    print(f"CUDA Memory Summary (device {device}):")
    print(f"  Allocated:   {allocated / 1024**3:.3f} GB  (live tensors)")
    print(f"  Reserved:    {reserved / 1024**3:.3f} GB  (held by allocator)")
    print(f"  Free blocks: {(reserved - allocated) / 1024**3:.3f} GB  (in cache)")

    if reserved > 0:
        frag_ratio = (reserved - allocated) / reserved
        print(f"  Fragmentation ratio: {frag_ratio:.1%}")

    # Key statistics from memory_stats
    interesting_keys = [
        'active_bytes.all.current',
        'reserved_bytes.all.current',
        'allocated_bytes.all.current',
        'active_bytes.all.peak',
        'num_alloc_retries',           # times allocator had to retry after empty_cache
        'num_ooms',                     # number of OOM events
        'num_cudaMalloc_retries',
    ]

    print("\n  Detailed stats:")
    for key in interesting_keys:
        if key in stats:
            value = stats[key]
            if 'bytes' in key:
                print(f"    {key}: {value / 1024**2:.1f} MB")
            else:
                print(f"    {key}: {value}")

def demonstrate_cuda_allocator():
    if not torch.cuda.is_available():
        print("CUDA not available - showing CPU simulation")
        # Demonstrate the concept with CPU tensors
        t1 = torch.zeros(1024, 1024)
        t2 = torch.zeros(1024, 1024)
        print(f"Two 4MB tensors allocated")
        del t1
        print(f"After del t1: memory may or may not be returned to OS")
        del t2
        return

    device = torch.device('cuda:0')
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    print("=== Initial state ===")
    print_cuda_memory_summary()

    # Allocate a large tensor
    print("\n=== After allocating 1GB tensor ===")
    t1 = torch.zeros(256 * 1024 * 1024, dtype=torch.float32, device=device)  # 1 GB
    print_cuda_memory_summary()

    # Free it
    print("\n=== After del t1 (memory stays reserved) ===")
    del t1
    torch.cuda.synchronize()
    print_cuda_memory_summary()

    # empty_cache: returns free blocks to CUDA
    print("\n=== After empty_cache ===")
    torch.cuda.empty_cache()
    print_cuda_memory_summary()

demonstrate_cuda_allocator()

CUDA Memory Fragmentation

Fragmentation in the CUDA allocator happens when many different-sized tensors are allocated and freed in an irregular pattern. After fragmentation, the allocator has many small free blocks but cannot satisfy a large allocation:

import torch
import gc

def demonstrate_fragmentation():
    if not torch.cuda.is_available():
        print("CUDA not available")
        return

    device = torch.device('cuda:0')
    torch.cuda.empty_cache()

    print("Demonstrating CUDA memory fragmentation")
    print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

    # Create fragmentation: allocate many tensors of varying sizes, then free them
    tensors = []
    sizes_mb = [10, 50, 10, 100, 10, 200, 10, 50, 10, 100] * 3   # varying sizes

    print(f"\nAllocating {sum(sizes_mb)} MB in {len(sizes_mb)} tensors...")
    for size_mb in sizes_mb:
        n_floats = size_mb * 1024 * 1024 // 4   # float32 = 4 bytes
        t = torch.zeros(n_floats, dtype=torch.float32, device=device)
        tensors.append(t)

    allocated = torch.cuda.memory_allocated(0) / 1024**2
    reserved = torch.cuda.memory_reserved(0) / 1024**2
    print(f"After allocation: {allocated:.0f} MB allocated, {reserved:.0f} MB reserved")

    # Free every other tensor (create gaps = fragmentation)
    print("\nFreeing every other tensor to create fragmentation...")
    for i in range(0, len(tensors), 2):
        del tensors[i]
        tensors[i] = None

    gc.collect()
    torch.cuda.synchronize()

    allocated = torch.cuda.memory_allocated(0) / 1024**2
    reserved = torch.cuda.memory_reserved(0) / 1024**2
    print(f"After partial free: {allocated:.0f} MB allocated, {reserved:.0f} MB reserved")
    print(f"Fragmentation: {reserved - allocated:.0f} MB reserved but not allocated")

    # The fragmented free blocks may not satisfy a large contiguous allocation
    # This is where OOM can occur even with "enough" free memory

    # empty_cache releases all free blocks back to CUDA
    torch.cuda.empty_cache()
    allocated_after = torch.cuda.memory_allocated(0) / 1024**2
    reserved_after = torch.cuda.memory_reserved(0) / 1024**2
    print(f"\nAfter empty_cache: {allocated_after:.0f} MB allocated, {reserved_after:.0f} MB reserved")

    # Clean up
    for t in tensors:
        if t is not None:
            del t
    torch.cuda.empty_cache()

demonstrate_fragmentation()

CUDA OOM Debugging Procedure

When you get torch.cuda.OutOfMemoryError, follow this systematic procedure:

import torch
import gc
from typing import Optional

def cuda_oom_diagnostic(device: int = 0) -> dict:
    """
    Run a diagnostic when CUDA OOM occurs.
    Call this in your except clause to understand why OOM happened.
    """
    if not torch.cuda.is_available():
        return {}

    report = {}

    # 1. Basic memory summary
    report['allocated_gb'] = torch.cuda.memory_allocated(device) / 1024**3
    report['reserved_gb'] = torch.cuda.memory_reserved(device) / 1024**3
    report['total_gb'] = torch.cuda.get_device_properties(device).total_memory / 1024**3
    report['free_in_allocator_gb'] = (
        report['reserved_gb'] - report['allocated_gb']
    )

    # 2. Memory stats for deeper analysis
    stats = torch.cuda.memory_stats(device)
    report['num_alloc_retries'] = stats.get('num_alloc_retries', 0)
    report['peak_allocated_gb'] = stats.get(
        'active_bytes.all.peak', 0
    ) / 1024**3

    # 3. Check if fragmentation is the issue
    # If free_in_allocator > 1 GB, fragmentation is likely the cause
    if report['free_in_allocator_gb'] > 1.0:
        report['likely_cause'] = 'fragmentation'
        report['recommendation'] = (
            'Call torch.cuda.empty_cache() and retry, or restructure '
            'allocation patterns to avoid variable-size tensors'
        )
    else:
        report['likely_cause'] = 'genuine OOM - need more GPU memory or reduce batch size'
        report['recommendation'] = (
            'Reduce batch size, use gradient checkpointing, '
            'or switch to a model with lower memory requirements'
        )

    return report

def safe_allocate(size_gb: float, device: int = 0) -> Optional[torch.Tensor]:
    """Allocate a tensor, handling OOM gracefully."""
    if not torch.cuda.is_available():
        return None

    n_floats = int(size_gb * 1024**3 / 4)
    try:
        t = torch.zeros(n_floats, dtype=torch.float32,
                        device=f'cuda:{device}')
        return t
    except torch.cuda.OutOfMemoryError:
        # Step 1: free Python objects and GC
        gc.collect()
        torch.cuda.empty_cache()

        # Step 2: run diagnostic
        diag = cuda_oom_diagnostic(device)
        print(f"OOM Diagnostic:")
        for k, v in diag.items():
            if isinstance(v, float):
                print(f"  {k}: {v:.3f}")
            else:
                print(f"  {k}: {v}")

        # Step 3: retry after cache clear
        try:
            t = torch.zeros(n_floats, dtype=torch.float32,
                            device=f'cuda:{device}')
            print("Retry succeeded after empty_cache()")
            return t
        except torch.cuda.OutOfMemoryError:
            print("Retry failed - genuine OOM")
            return None

# Usage
if torch.cuda.is_available():
    t = safe_allocate(2.0)   # try to allocate 2 GB
    if t is not None:
        del t
        torch.cuda.empty_cache()

Memory Budget Planning for Transformer Training

For transformer training, the memory consumed by a single training step is:

$\text{Total} = M_\theta + M_g + M_o + M_a + M_{\text{batch}}$

Where:

$M_\theta$ = model parameters (weights)
$M_g$ = gradients (same size as weights)
$M_o$ = optimizer state (Adam: 2x weights for m and v)
$M_a$ = activations (depends on sequence length and batch size)
$M_\text{batch}$ = input/output batch tensors

For mixed precision (fp16/bf16 parameters, fp32 master weights):

$M_\theta(\text{mixed}) = 2 \text{ bytes/param (fp16)} + 4 \text{ bytes/param (fp32 master)}$ $M_g(\text{mixed}) = 2 \text{ bytes/param}$ $M_o(\text{Adam, mixed}) = 8 \text{ bytes/param (two fp32 states)}$

Total optimizer + weights in mixed precision: approximately $16$ bytes per parameter.

def training_memory_budget(
    n_params: int,
    seq_len: int,
    batch_size: int,
    n_layers: int,
    d_model: int,
    n_heads: int,
    mixed_precision: bool = True,
    gradient_checkpointing: bool = False,
) -> dict:
    """
    Estimate GPU memory requirements for transformer training.

    Parameters reflect a decoder-only transformer (GPT-style).
    All values in GB.
    """
    bytes_per_gb = 1024**3

    # Weights
    if mixed_precision:
        # fp16 params + fp32 master weights
        param_bytes = n_params * (2 + 4)
    else:
        param_bytes = n_params * 4   # fp32

    weight_gb = param_bytes / bytes_per_gb

    # Gradients (same precision as params for mixed, fp32 otherwise)
    if mixed_precision:
        grad_gb = n_params * 2 / bytes_per_gb   # fp16 grads
    else:
        grad_gb = n_params * 4 / bytes_per_gb

    # Optimizer state: Adam maintains m (1st moment) and v (2nd moment)
    # Both in fp32 regardless of training precision
    optimizer_gb = n_params * 8 / bytes_per_gb

    # Activations: the expensive variable
    # For a transformer layer: approximately 12 * seq_len * d_model bytes per layer
    # (this is a rough approximation; exact value depends on attention type)
    if gradient_checkpointing:
        # Checkpointing recomputes activations during backward
        # Only store activations at checkpoint boundaries (one per layer)
        activation_bytes = n_layers * seq_len * d_model * 2   # fp16
    else:
        # Store all intermediate activations for backward pass
        # Rough formula: ~34 * seq_len * d_model * n_layers bytes
        activation_bytes = n_layers * 34 * seq_len * d_model * 2   # fp16
    activation_gb = activation_bytes * batch_size / bytes_per_gb

    # KV cache per layer (during attention): 4 * seq_len * d_model bytes
    kv_cache_gb = n_layers * 4 * seq_len * d_model * batch_size * 2 / bytes_per_gb

    total_gb = weight_gb + grad_gb + optimizer_gb + activation_gb

    return {
        'weights_gb':       round(weight_gb, 2),
        'gradients_gb':     round(grad_gb, 2),
        'optimizer_gb':     round(optimizer_gb, 2),
        'activations_gb':   round(activation_gb, 2),
        'kv_cache_gb':      round(kv_cache_gb, 2),
        'total_gb':         round(total_gb, 2),
        'overhead_gb':      round(total_gb * 0.15, 2),   # 15% allocator overhead estimate
        'total_with_overhead_gb': round(total_gb * 1.15, 2),
    }

# GPT-3 175B on A100-80GB with tensor parallelism (8 GPUs)
gpt3_per_gpu = training_memory_budget(
    n_params=175_000_000_000 // 8,   # divided across 8 GPUs
    seq_len=2048,
    batch_size=1,
    n_layers=96 // 8,                 # layer parallelism
    d_model=12288,
    n_heads=96,
    mixed_precision=True,
    gradient_checkpointing=True,
)
print("GPT-3 175B per GPU (8-GPU tensor parallel):")
for k, v in gpt3_per_gpu.items():
    print(f"  {k:35s}: {v:.2f} GB")

# LLaMA-7B single GPU training
llama7b = training_memory_budget(
    n_params=7_000_000_000,
    seq_len=2048,
    batch_size=4,
    n_layers=32,
    d_model=4096,
    n_heads=32,
    mixed_precision=True,
    gradient_checkpointing=True,
)
print("\nLLaMA-7B single GPU (with gradient checkpointing):")
for k, v in llama7b.items():
    print(f"  {k:35s}: {v:.2f} GB")

Parsing torch.cuda.memory_stats()

torch.cuda.memory_stats() returns a large dictionary. Here is how to parse the important fields:

import torch
import json

def parse_cuda_memory_stats(device: int = 0) -> None:
    """
    Parse and explain torch.cuda.memory_stats() output.
    """
    if not torch.cuda.is_available():
        print("CUDA not available")
        return

    stats = torch.cuda.memory_stats(device)

    # Group stats by category
    categories = {
        'Allocation activity': [
            'allocation.all.current',
            'allocation.all.peak',
            'allocation.all.allocated',
            'allocation.all.freed',
        ],
        'Bytes in use': [
            'allocated_bytes.all.current',
            'allocated_bytes.all.peak',
            'active_bytes.all.current',
            'active_bytes.all.peak',
        ],
        'Reserved (caching allocator)': [
            'reserved_bytes.all.current',
            'reserved_bytes.all.peak',
        ],
        'Errors and retries': [
            'num_alloc_retries',
            'num_ooms',
            'num_cudaMalloc_retries',
        ],
        'Fragmentation indicators': [
            'inactive_split_bytes.all.current',
            'inactive_split_bytes.all.peak',
            'inactive_split.all.current',
        ],
    }

    print(f"CUDA Memory Stats (device {device}):")
    for category, keys in categories.items():
        print(f"\n  {category}:")
        for key in keys:
            if key in stats:
                value = stats[key]
                if 'bytes' in key.lower():
                    print(f"    {key}: {value / 1024**2:.1f} MB")
                else:
                    print(f"    {key}: {value}")

    # Compute derived metrics
    allocated = stats.get('allocated_bytes.all.current', 0)
    reserved = stats.get('reserved_bytes.all.current', 0)
    inactive_split = stats.get('inactive_split_bytes.all.current', 0)

    if reserved > 0:
        frag_ratio = (reserved - allocated) / reserved
        print(f"\n  Derived metrics:")
        print(f"    Utilization: {allocated/reserved:.1%} of reserved")
        print(f"    Fragmentation ratio: {frag_ratio:.1%}")
        print(f"    Inactive split bytes: {inactive_split/1024**2:.1f} MB")
        print(f"    (inactive splits = free blocks from splitting larger blocks)")

# After training a batch
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()

    # Simulate some allocations
    tensors = [
        torch.zeros(1024, 1024, device='cuda'),
        torch.zeros(512, 512, device='cuda'),
        torch.zeros(2048, 256, device='cuda'),
    ]

    parse_cuda_memory_stats()

    del tensors
    torch.cuda.empty_cache()

warning

torch.cuda.empty_cache() does NOT reduce allocated memory. It only releases free blocks from the caching allocator back to CUDA. If your process is OOM because live tensors are consuming all memory, empty_cache() will not help. Check torch.cuda.memory_allocated() vs torch.cuda.memory_reserved() first. If allocated is close to reserved, the issue is live tensors. If reserved is much larger than allocated, try empty_cache().

Production Engineering Notes

Allocator Selection Decision Framework

CUDA Allocator Tuning

PyTorch exposes allocator tuning through environment variables:

# Fragment-reducing: round small allocations up to larger size classes
# Default roundup_power2_divisions is 1 (no rounding up)
PYTORCH_CUDA_ALLOC_CONF="roundup_power2_divisions:4"

# Garbage collection threshold: fraction of reserved memory that can be
# in the allocator cache before triggering an automatic empty_cache()
PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.8"

# Max split size: blocks larger than this are never split
# Default is unlimited; setting a value reduces fragmentation from splits
PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"

# Combine multiple settings
PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512,roundup_power2_divisions:4,garbage_collection_threshold:0.8"

import os
import torch

# Check current CUDA allocator config
cuda_alloc_conf = os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')
print(f"PYTORCH_CUDA_ALLOC_CONF: {cuda_alloc_conf}")

# Memory snapshot: detailed per-block view (PyTorch 2.0+)
if torch.cuda.is_available():
    try:
        torch.cuda.memory._record_memory_history(max_entries=10000)
        # ... run your workload ...
        # Then dump snapshot for analysis
        # snapshot = torch.cuda.memory._snapshot()
        # with open('memory_snapshot.pickle', 'wb') as f:
        #     import pickle
        #     pickle.dump(snapshot, f)
        # Analyze with: python -m torch.cuda._memory_viz trace_plot memory_snapshot.pickle
        torch.cuda.memory._record_memory_history(enabled=None)   # stop recording
        print("Memory snapshot recording available (PyTorch >= 2.0)")
    except AttributeError:
        print("Memory snapshot requires PyTorch >= 2.0")

danger

Never call torch.cuda.empty_cache() inside a training step's inner loop. The purpose of the caching allocator is to reuse memory blocks between steps without going back to CUDA. Calling empty_cache() every step forces a new cudaMalloc on the next step, turning a fast (nanosecond) allocation into a slow (millisecond) one. Call empty_cache() only at epoch boundaries, during model reload, or when you have detected genuine fragmentation.

Interview Q&A

Q1: What is the difference between glibc malloc, jemalloc, and tcmalloc? When would you choose each for a Python ML workload?

All three are user-space allocators that manage a pool of OS-acquired memory and sub-allocate from it. The differences are in design priorities. glibc malloc (ptmalloc2) is the default on Linux and is adequate for most workloads. It uses per-thread arenas to reduce lock contention but has a tendency to retain freed memory in its internal heap (high "fragmentation" from the OS's perspective). jemalloc uses per-CPU arenas and decay-based purging, which aggressively returns freed memory to the OS. For long-running ML serving processes that run for days, jemalloc prevents the gradual RSS growth that glibc shows. tcmalloc uses aggressive thread-local caching: alloc/free pairs in the same thread require no synchronization. This gives extremely high throughput for allocation-heavy workloads. For a high-throughput serving system with 1000+ requests/second, tcmalloc wins. For a training job running for days where RSS stability matters, jemalloc wins. Neither requires code changes - you swap them via LD_PRELOAD.

Q2: What does torch.cuda.memory_reserved() return, and why is it different from torch.cuda.memory_allocated()?

memory_allocated() returns the bytes currently held by live PyTorch tensors - memory that is actively being used. memory_reserved() returns the total bytes that PyTorch's caching allocator has acquired from CUDA via cudaMalloc - including free blocks that the allocator is holding for future reuse. The gap between them (reserved - allocated) is free memory in the caching allocator. It is not available to other processes and does not show up as available in nvidia-smi. This gap exists because cudaFree is expensive (~100ms), so the allocator holds blocks rather than returning them to CUDA. torch.cuda.empty_cache() releases these free blocks back to CUDA, reducing reserved to approximately equal allocated.

Q3: What causes CUDA OOM even when nvidia-smi shows "available" memory?

Two scenarios. First, fragmentation: the caching allocator has enough total free memory but the free blocks are too small to satisfy the new allocation. After many variable-size alloc/free cycles, the free block pool is fragmented. Solution: call torch.cuda.empty_cache() to release free blocks, which allows the next allocation to get a fresh large block from CUDA. Second, memory held by processes not reflected in your process stats: other processes on the same GPU (other training jobs, monitoring daemons, CUDA context overhead). Solution: use nvidia-smi to check all processes, ensure exclusive GPU access for training.

Q4: How would you estimate the GPU memory budget for training a transformer model?

The four main consumers are: (1) model weights - for mixed precision (fp16 params + fp32 master), this is approximately 6 bytes per parameter. (2) Gradients - 2 bytes per parameter in fp16. (3) Optimizer state - Adam requires two fp32 momentum vectors, 8 bytes per parameter. (4) Activations - the variable part, proportional to batch size, sequence length, and number of layers. Without gradient checkpointing, activations can dominate at about 34 bytes per (position, layer) in fp16. With gradient checkpointing, activations shrink to about 2 bytes per (position, layer) at the cost of recomputing them during backward. Total for a 7B parameter model in mixed precision with gradient checkpointing and batch size 4, sequence length 2048: approximately 14 GB (weights) + 14 GB (gradients + optimizer) + ~8 GB (activations) = ~36 GB, fitting on a 40 GB A100.

Q5: Why is cudaMalloc so much slower than malloc, and how does PyTorch's caching allocator address this?

cudaMalloc requires synchronization with the GPU driver and the CUDA runtime: the driver must find a contiguous region of GPU virtual memory, set up page tables, notify the GPU, and confirm the allocation. This round-trip through driver and hardware takes 50-200 milliseconds. In comparison, malloc operates entirely in user space from a pre-acquired heap, taking about 100 nanoseconds. PyTorch's caching allocator addresses this by calling cudaMalloc only when there is no suitable free block in its cache. For training loops that allocate the same tensor shapes every step (which is typical), after the first step all allocations are satisfied from the cache without touching CUDA, reducing effective allocation cost to near zero. The trade-off is that the caching allocator holds GPU memory for the process even when tensors are freed - which is why reserved > allocated during training.

Q6: What is max_split_size_mb in PYTORCH_CUDA_ALLOC_CONF, and when should you set it?

When the caching allocator needs to satisfy an allocation and finds a free block that is larger than requested, it splits the block: the requested size is returned, and the remainder stays in the free list. Over time, repeated splitting produces many small "inactive split" blocks that cannot satisfy larger allocations - fragmentation. Setting max_split_size_mb prevents splits of blocks larger than the threshold, reducing fragmentation at the cost of internal fragmentation (each allocation wastes more space). Set this when torch.cuda.memory_stats() shows a high inactive_split_bytes.all.current value. A good starting value is the size of your largest regularly-allocated tensor (often the activation or gradient size for a single layer). Setting it too small forces the allocator to use many small blocks; too large wastes memory on internal fragmentation.

Q7: How does jemalloc reduce RSS fragmentation in Python serving workloads?

glibc malloc uses a brk()-based heap that expands upward. When large objects are freed, glibc cannot return the memory to the OS unless the freed region is at the top of the heap (due to the brk() model). In practice, the heap contains a mix of live and freed objects, and freed memory interior to the heap is retained. Over days, this causes RSS to grow continuously. jemalloc uses mmap() for all allocations (not brk()), which allows individual chunks to be returned to the OS independently of their position in address space. Additionally, jemalloc has a "decay-based purging" mechanism: free blocks are tracked with timestamps, and after a configurable decay period (default: 10 seconds), unused pages within those blocks are returned to the OS via madvise(MADV_FREE). This keeps RSS stable over long running periods. For ML serving processes that run for days and handle millions of requests, jemalloc typically produces 20-40% lower steady-state RSS than glibc.

The Production Scenario​

Why This Exists​

Historical Context​

Core Concepts​

General Allocator Design: The Common Structure​

Fragmentation: Internal vs External​

jemalloc: The ML Python Allocator​

tcmalloc: Thread-Caching Design​

PyTorch's CUDA Caching Allocator​

Reserved vs Allocated Memory​

CUDA Memory Fragmentation​

CUDA OOM Debugging Procedure​

Memory Budget Planning for Transformer Training​

Parsing torch.cuda.memory_stats()​

Production Engineering Notes​

Allocator Selection Decision Framework​

CUDA Allocator Tuning​

Interview Q&A​