Unified Memory and Memory Pooling
Reading time: ~35 min - Interview relevance: High - Target roles: ML Infrastructure, CUDA Developer, MLOps Engineer
The Production Scenario: Three AM and Your Inference Server Is Leaking Memory
It is 3:14 AM. Your on-call phone buzzes. The LLM inference server that powers your company's chatbot product has started throwing CUDA out of memory errors. GPU utilization is dropping off a cliff. You SSH in and run nvidia-smi. The numbers look wrong: the model itself uses 38 GB of your 80 GB A100, but nvidia-smi is showing 76 GB allocated. Something is sitting on 38 GB of GPU memory that nobody asked for.
You open a Python shell and call torch.cuda.memory_stats(). The output is bewildering at first. reserved_bytes.all.current is 41 GB. allocated_bytes.all.current is only 23 GB. There is an 18 GB gap between what PyTorch has reserved from CUDA and what it is actually using for tensors. That gap is fragmented cache. PyTorch's caching allocator grabbed memory aggressively during the batch processing spikes earlier that night and never gave it back to CUDA. Now new allocations are failing even though, technically, enough free physical VRAM exists.
You call torch.cuda.empty_cache(). Memory reserved drops to 24 GB. The server recovers. You go back to sleep. But the next morning you sit down to actually understand what just happened - because patching symptoms at 3 AM is not the same as understanding the system.
This scenario plays out constantly in production ML infrastructure. Memory management on GPUs is not automatic. The CUDA runtime, the driver, PyTorch's allocator, and the operating system's virtual memory subsystem all have separate views of memory. They do not always agree. Understanding how these layers interact is the difference between running a stable serving cluster and firefighting fragmentation leaks every few days.
This lesson covers the full memory management stack - from CUDA Unified Memory (what it is and when to avoid it) through stream-ordered memory pools (the right way to allocate GPU memory at high frequency) to PyTorch's caching allocator (why that 18 GB gap appears and how to reason about it). By the end you will be able to profile, diagnose, and tune GPU memory allocation for both training and inference workloads.
Why This Exists - The Problem Before Memory Pools
To understand why memory pooling was invented, you need to feel the pain of what came before it.
The Original CUDA Memory Model
When CUDA launched in 2007, memory management was explicit and simple. CPU memory and GPU memory were completely separate. You called malloc() on the CPU side and cudaMalloc() on the GPU side. You explicitly copied data between them with cudaMemcpy(). The programmer was responsible for knowing where every byte lived.
This model was fast because it was simple. The GPU driver knew exactly what memory belonged to the GPU. There was no translation layer, no migration logic, no unified address space. A GPU memory address and a CPU memory address were completely different things.
The problem is that explicit memory management is painful and error-prone. Consider a deep learning framework in 2013. Every layer's forward pass allocates output tensors. Every backward pass allocates gradient tensors. Between layers you have intermediate buffers for normalization, attention, residual connections. On a single training step, a large model might perform hundreds of separate cudaMalloc and cudaFree calls. Each of those calls is synchronous - it stalls the GPU pipeline. Benchmarks from the CUDA 8 era showed that a naive training loop spending 15-20% of its wall time just waiting for memory allocation and deallocation to complete.
The other problem was data transfer complexity. Preprocessing happens on the CPU. Labels, augmentation, tokenization - all of it runs on CPU threads. To get that data onto the GPU, you needed explicit cudaMemcpy calls with correctly calculated sizes, correctly typed pointers, and careful attention to synchronization. One wrong assumption about when the copy was complete and you had a race condition that produced corrupted gradients. Not a crash - corrupted gradients. The worst kind of bug.
The Promise of Unified Memory
CUDA 6.0, released in 2014, introduced Unified Memory via cudaMallocManaged. The promise was seductive: allocate memory once, use it from both CPU and GPU, let the CUDA runtime figure out where the data needs to be. Write float* data; cudaMallocManaged(&data, size) and then use data on both CPU threads and GPU kernels without any explicit copies.
This worked. But it introduced a new failure mode that took years for the community to fully understand: page migration overhead.
Historical Context - CUDA Unified Memory Origins
Pascal and the Hardware Foundation
cudaMallocManaged existed before Pascal (the 2016 GPU architecture), but it was essentially fake on older hardware. On Kepler and Maxwell, Unified Memory was implemented with page faults that required stopping all GPU execution. When a GPU thread accessed a page not resident on the GPU, the entire device stalled while the CPU migrated that page over PCIe. This was predictably catastrophic for performance.
Pascal changed everything. The Pascal architecture (P100, 2016) introduced hardware page fault support with concurrent page migration. The GPU could now handle page faults without stalling all execution. A warp that triggered a page fault would simply stall, other warps would continue executing, and the migration would complete asynchronously. This was the first time Unified Memory became genuinely usable for performance-sensitive code.
The key insight from the Pascal team at NVIDIA was that the access pattern for most ML workloads is predictable. You stream through a dataset. You access each activation tensor twice - once in the forward pass, once in the backward pass. If the runtime could prefetch pages before they were needed, the performance penalty of Unified Memory could be nearly eliminated.
This led to CUDA's cudaMemPrefetchAsync - an explicit hint to the runtime to start migrating pages before they are accessed. With prefetching, Unified Memory performance on Pascal and later architectures can approach the performance of explicit cudaMemcpy.
The Memory Pool Problem - CUDA 11.2 and Stream-Ordered Allocators
Even with Unified Memory sorted, the high-frequency allocation problem remained. Every call to cudaMalloc or cudaFree goes through the CUDA driver, which holds a global lock. In a multi-threaded serving system with many concurrent requests, that global lock becomes a severe bottleneck.
CUDA 11.2 (2020) introduced the stream-ordered memory allocator: cudaMallocAsync and cudaFreeAsync. Instead of returning memory to the global CUDA heap on cudaFreeAsync, the runtime keeps it in a per-stream pool and recycles it for the next cudaMallocAsync call on the same stream. Allocation becomes essentially free for repeated same-size requests - you are just pulling from a local cache rather than hitting the driver.
PyTorch adopted this approach early. Its caching allocator, which predates the official CUDA stream-ordered API, implements the same philosophy: never return memory to CUDA if you might need it again soon.
Core Concepts: Understanding the Memory Stack
Layer 1 - Physical GPU Memory
At the bottom of the stack is physical VRAM - HBM2e on an A100 (80 GB), GDDR6X on an RTX 4090 (24 GB). This is hardware. It has fixed capacity. When this runs out, nothing works.
Layer 2 - The CUDA Driver Memory Manager
The CUDA driver manages the physical memory. When you call cudaMalloc, you are asking the driver to give you a contiguous virtual address range backed by physical memory. The driver tracks which ranges are allocated. cudaFree returns a range to the driver's free list.
The driver's allocator is not fast. It uses a global lock. Every allocation is a potential serialization point. This is why PyTorch never calls cudaMalloc directly for normal tensor operations.
Layer 3 - PyTorch's Caching Allocator
PyTorch has its own memory allocator that sits between tensor operations and the CUDA driver. The caching allocator's job is to make cudaMalloc and cudaFree calls infrequent.
The logic is:
- When a tensor is freed, do NOT call
cudaFree. Instead, put the memory block in a "cached blocks" pool. - When a new tensor is allocated, search the cached blocks pool for a block that is large enough. If found, return it immediately without touching the CUDA driver.
- Only call
cudaMallocif the cache has no suitable block. - Only call
cudaFree(viatorch.cuda.empty_cache()) when the cache is taking up space that could be used for new allocations.
This is why torch.cuda.memory_stats() shows two distinct numbers: reserved_bytes (everything the caching allocator has taken from CUDA, including the free cache) and allocated_bytes (what is actually in use by live tensors).
The gap between them is the cache. It is intentional. Without it, every tensor allocation would hit the CUDA driver. With it, most allocations are just pointer manipulations within the already-reserved memory.
Layer 4 - Unified Memory (cudaMallocManaged)
Unified Memory lives alongside the caching allocator, not inside it. When you call cudaMallocManaged, you are asking the CUDA runtime to manage migration between CPU and GPU automatically. This memory does NOT go through PyTorch's caching allocator by default.
The page migration mechanism works like this:
- Physical pages of unified memory start on whichever device (CPU or GPU) first accesses them, or can be explicitly prefetched.
- When a GPU kernel accesses a page not currently in GPU memory, the GPU issues a page fault.
- The CUDA runtime intercepts the fault, migrates the page over PCIe (or NVLink), and retries the memory access.
- Subsequent accesses to that page from the GPU are fast (in VRAM) until the page is migrated back to CPU.
Page migration has latency. On PCIe 4.0, migrating a 4 KB page takes on the order of microseconds due to overhead. For bulk transfers this amortizes well. For fine-grained random access across a large unified memory buffer, it can be devastating.
Math: What Fragmentation Actually Costs
Let's think about fragmentation precisely. The caching allocator uses a bin-based approach - allocations are rounded up to the nearest power of two (roughly) to reduce the number of distinct block sizes and make reuse more likely.
Suppose you have a sequence of allocations with sizes in bytes:
The allocator rounds each up to the next bin boundary . Internal fragmentation per allocation is:
Over a workload with allocations, total internal fragmentation is:
External fragmentation is harder to quantify but conceptually simpler: you have free blocks scattered through memory such that no single contiguous free region is large enough for the next allocation, even though the total free bytes would be sufficient. If your free cache has ten 100 MB blocks but you need one 500 MB tensor, you have 1 GB reserved but cannot satisfy a 500 MB request. The allocator must call cudaMalloc for a fresh 500 MB block.
This is the situation that leads to OOM errors even when nvidia-smi shows substantial free memory. The fragmentation tax can easily consume 10-30% of GPU memory in long-running training or serving workloads.
Mermaid Diagram: The GPU Memory Allocation Stack
Code: Understanding torch.cuda.memory_stats()
The first step to managing GPU memory is understanding what PyTorch is actually doing with it.
import torch
import json
def print_memory_report(device: int = 0) -> dict:
"""
Print a human-readable memory report for a CUDA device.
Returns the raw stats dict for programmatic use.
"""
if not torch.cuda.is_available():
print("No CUDA device available")
return {}
stats = torch.cuda.memory_stats(device)
# Key metrics to understand
reserved_bytes = stats.get("reserved_bytes.all.current", 0)
allocated_bytes = stats.get("allocated_bytes.all.current", 0)
inactive_split = stats.get("inactive_split_bytes.all.current", 0)
num_alloc_retries = stats.get("num_alloc_retries", 0)
num_ooms = stats.get("num_ooms", 0)
reserved_gb = reserved_bytes / 1e9
allocated_gb = allocated_bytes / 1e9
cached_free_gb = (reserved_bytes - allocated_bytes) / 1e9
inactive_gb = inactive_split / 1e9
print(f"=== GPU {device} Memory Report ===")
print(f"Reserved (from CUDA driver): {reserved_gb:.2f} GB")
print(f"Allocated (live tensors): {allocated_gb:.2f} GB")
print(f"Cached free (in pool): {cached_free_gb:.2f} GB")
print(f"Inactive splits: {inactive_gb:.2f} GB")
print(f"Allocation retries: {num_alloc_retries}")
print(f"OOM events: {num_ooms}")
# Fragmentation ratio: how much of reserved is unusable due to splits
if reserved_bytes > 0:
frag_ratio = inactive_split / reserved_bytes
print(f"Fragmentation ratio: {frag_ratio:.1%}")
# PyTorch also provides driver-level view
driver_mem = torch.cuda.memory_reserved(device) / 1e9
print(f"\nnvidia-smi equivalent: {driver_mem:.2f} GB")
return stats
# Use during training to monitor memory health
stats = print_memory_report(device=0)
What Each Number Means
reserved_bytes.all.current - total bytes the caching allocator has obtained from the CUDA driver. This is what nvidia-smi shows as used by your process.
allocated_bytes.all.current - bytes actually in live tensors. This is the "real" usage.
inactive_split_bytes.all.current - this is the most important diagnostic number for fragmentation. When the allocator splits a large cached block to satisfy a smaller request, the leftover portion is an "inactive split." A high inactive split count means your allocations have irregular sizes that prevent clean recycling.
num_alloc_retries - the number of times an allocation failed and had to trigger garbage collection. If this is non-zero in production, you have memory pressure.
Code: Detecting and Handling Fragmentation
import torch
import contextlib
class MemoryFragmentationMonitor:
"""
Context manager that detects problematic fragmentation
and optionally triggers cache clearing.
"""
def __init__(
self,
device: int = 0,
frag_threshold: float = 0.3,
auto_clear: bool = False
):
self.device = device
self.frag_threshold = frag_threshold
self.auto_clear = auto_clear
self.initial_stats = None
def __enter__(self):
self.initial_stats = torch.cuda.memory_stats(self.device)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
stats = torch.cuda.memory_stats(self.device)
reserved = stats.get("reserved_bytes.all.current", 0)
inactive = stats.get("inactive_split_bytes.all.current", 0)
if reserved > 0:
frag_ratio = inactive / reserved
if frag_ratio > self.frag_threshold:
print(
f"WARNING: Memory fragmentation ratio {frag_ratio:.1%} "
f"exceeds threshold {self.frag_threshold:.1%}"
)
if self.auto_clear:
print("Clearing CUDA cache...")
torch.cuda.empty_cache()
new_stats = torch.cuda.memory_stats(self.device)
new_reserved = new_stats.get("reserved_bytes.all.current", 0)
freed = (reserved - new_reserved) / 1e9
print(f"Freed {freed:.2f} GB from cache")
def demonstrate_fragmentation():
"""
Artificially create fragmentation to show how it develops.
"""
device = torch.device("cuda:0")
# Allocate a bunch of different-sized tensors
tensors = []
sizes = [100, 200, 50, 300, 75, 150, 400, 25]
print("Creating varied-size allocations...")
for size_mb in sizes:
# Each tensor is a different size to defeat bin-based reuse
n_elements = (size_mb * 1024 * 1024) // 4 # float32
t = torch.empty(n_elements, device=device)
tensors.append(t)
print_memory_report()
# Free every other tensor - creates holes
print("\nFreeing alternating tensors (creating holes)...")
for i in range(0, len(tensors), 2):
del tensors[i]
tensors = [t for t in tensors if t.is_cuda] # keep alive ones
print_memory_report()
# Now try to allocate one large contiguous block
# This may fail even though total free bytes are sufficient
print("\nAttempting large contiguous allocation...")
try:
large_tensor = torch.empty(
(600 * 1024 * 1024) // 4,
device=device
)
print("Large allocation succeeded")
except torch.cuda.OutOfMemoryError:
print("OOM: fragmentation prevented contiguous allocation")
print("Clearing cache and retrying...")
torch.cuda.empty_cache()
large_tensor = torch.empty(
(600 * 1024 * 1024) // 4,
device=device
)
print("Retry succeeded after cache clear")
Code: CUDA Unified Memory in Practice
Understanding when cudaMallocManaged helps and when it hurts requires testing with actual access patterns.
import torch
import time
import ctypes
# Note: Direct cudaMallocManaged requires ctypes or pycuda
# PyTorch does not expose it directly - this shows the concept via pycuda
try:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
def benchmark_unified_vs_explicit(size_mb: int = 512):
"""
Compare Unified Memory vs explicit copies for a simple
GPU kernel workload.
"""
size_bytes = size_mb * 1024 * 1024
n_floats = size_bytes // 4
print(f"Benchmarking with {size_mb} MB buffer...")
# === Approach 1: Explicit memory management ===
# Allocate separately on CPU and GPU, copy explicitly
cpu_data = np.random.randn(n_floats).astype(np.float32)
t0 = time.perf_counter()
gpu_data = cuda.mem_alloc(size_bytes)
cuda.memcpy_htod(gpu_data, cpu_data)
# ... kernel would run here ...
result = np.empty(n_floats, dtype=np.float32)
cuda.memcpy_dtoh(result, gpu_data)
t1 = time.perf_counter()
explicit_time = (t1 - t0) * 1000
gpu_data.free()
print(f"Explicit memcpy approach: {explicit_time:.1f} ms")
# === Approach 2: Unified memory (page migration) ===
# Fill on CPU, access on GPU (triggers page migration)
managed_data = cuda.managed_zeros(n_floats, dtype=np.float32)
managed_data[:] = cpu_data # CPU write
# First GPU access will trigger page migration
t0 = time.perf_counter()
# Prefetch hint to avoid demand-paging during kernel
cuda.mem_advise(
managed_data,
cuda.mem_advise.set_preferred_location,
cuda.Device(0)
)
# ... kernel would run here using managed_data pointer ...
t1 = time.perf_counter()
unified_time = (t1 - t0) * 1000
print(f"Unified memory + prefetch: {unified_time:.1f} ms")
print()
print("Key insight: without prefetch, first access triggers")
print("on-demand page migration which can be 10-100x slower.")
except ImportError:
print("pycuda not available - showing PyTorch Unified Memory concepts")
print()
print("PyTorch does not use cudaMallocManaged for standard tensors.")
print("It uses its own caching allocator with explicit CUDA memory.")
print()
print("cudaMallocManaged is most useful for:")
print(" 1. CPU+GPU collaborative algorithms (graph traversal, tree search)")
print(" 2. Memory-oversubscription for arrays too large for GPU VRAM")
print(" 3. Rapid prototyping without explicit memcpy management")
print()
print("It is NOT the right choice for:")
print(" 1. Standard deep learning training/inference")
print(" 2. High-throughput data pipelines")
print(" 3. Any workload where you control access patterns precisely")
Code: Stream-Ordered Memory Pools with cudaMallocAsync
CUDA 11.2 introduced the proper solution to high-frequency allocation overhead. PyTorch uses this internally, but you can also use it directly for custom CUDA work.
# This demonstrates the concept using PyTorch's memory pool API
# which wraps the CUDA 11.2 stream-ordered allocator
import torch
def demonstrate_memory_pools():
"""
Show how to use PyTorch's memory pool API for fine-grained control.
Useful for multi-tenant serving with strict memory isolation.
"""
if not torch.cuda.is_available():
print("No CUDA device")
return
device = torch.device("cuda:0")
# === Default behavior: single global caching allocator ===
print("=== Default Allocator Behavior ===")
# Create tensors - all share the same global cache
a = torch.randn(1000, 1000, device=device)
b = torch.randn(1000, 1000, device=device)
print(f"After allocating 2 tensors:")
print(f" Reserved: {torch.cuda.memory_reserved() / 1e6:.1f} MB")
print(f" Allocated: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
del a # Goes to cache, not returned to CUDA driver
print(f"\nAfter deleting tensor a:")
print(f" Reserved: {torch.cuda.memory_reserved() / 1e6:.1f} MB")
print(f" Allocated: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
print(f" (Reserved stays high - b's memory is still in cache)")
torch.cuda.empty_cache()
print(f"\nAfter empty_cache():")
print(f" Reserved: {torch.cuda.memory_reserved() / 1e6:.1f} MB")
del b
torch.cuda.empty_cache()
# === Memory Snapshot for debugging ===
print("\n=== Memory Snapshot Tool ===")
# Record allocations for post-mortem analysis
torch.cuda.memory._record_memory_history(max_entries=100000)
tensors = []
for i in range(5):
t = torch.randn(500, 500, device=device)
tensors.append(t)
# Take snapshot - shows allocation stack traces
snapshot = torch.cuda.memory._snapshot()
print(f"Snapshot captured {len(snapshot['segments'])} memory segments")
# In production, save this for analysis:
# torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
# Then analyze with: https://pytorch.org/memory_viz
torch.cuda.memory._record_memory_history(enabled=None)
del tensors
torch.cuda.empty_cache()
def memory_pool_for_serving():
"""
Pattern for multi-tenant inference: use separate memory contexts
to prevent one request from starving another.
This is conceptual - actual implementation uses PyTorch's
MemPool context manager (available since PyTorch 2.1)
"""
print("=== Memory Pool Pattern for Serving ===")
print()
print("Problem: In a serving system with concurrent requests,")
print("the global caching allocator can cause priority inversion.")
print()
print("Request A allocates 20 GB for batch inference.")
print("Request A finishes - memory goes to global cache.")
print("Request B needs 5 GB - gets it from cache (fast).")
print("Request C needs 25 GB - OOM, even though A's memory is 'free'.")
print()
print("Solution: Isolate high-priority requests in their own pool.")
print()
print("PyTorch 2.1+ pattern:")
print()
# Conceptual code showing the isolation pattern
code = '''
# Using torch.cuda.MemPool (PyTorch 2.1+)
from torch.cuda.memory import MemPool
# Create isolated pool for high-priority inference
priority_pool = MemPool()
with torch.cuda.use_mem_pool(priority_pool):
# All allocations here go into the isolated pool
model_output = model(high_priority_input)
# When context exits, pool memory can be reused within the pool
# but does not mix with global cache
# Global allocations are unaffected
regular_tensor = torch.randn(100, 100, device="cuda")
'''
print(code)
Mermaid Diagram: Unified Memory Page Migration
Code: Tuning the PyTorch Caching Allocator
PyTorch exposes environment variables to tune the caching allocator's behavior. These are not well documented but matter significantly in production.
import os
import torch
# === Environment Variables for Caching Allocator Tuning ===
# PYTORCH_CUDA_ALLOC_CONF is a comma-separated list of settings
# Set BEFORE importing torch, or use the programmatic API
# Option 1: Environment variable (set before process start)
# export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,garbage_collection_threshold:0.8
# Option 2: Programmatic API (PyTorch 1.10+)
def configure_allocator_for_inference():
"""
Optimal allocator settings for a long-running inference server.
"""
# max_split_size_mb: blocks larger than this are never split
# Setting this prevents large blocks from being fragmented
# Default is unlimited. 512 MB is a good starting point.
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = (
"max_split_size_mb:512,"
"garbage_collection_threshold:0.8,"
"expandable_segments:True"
)
# Note: must be set before torch initializes CUDA
print("Allocator configured for inference:")
print(" max_split_size_mb=512: prevents splitting large blocks")
print(" garbage_collection_threshold=0.8: GC when 80% of reserved is in use")
print(" expandable_segments=True: virtual address space pre-reserved")
def configure_allocator_for_training():
"""
Optimal allocator settings for a training job with variable batch sizes.
"""
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = (
"max_split_size_mb:128,"
"garbage_collection_threshold:0.7,"
"expandable_segments:True"
)
print("Allocator configured for training:")
print(" max_split_size_mb=128: allows finer-grained reuse")
print(" garbage_collection_threshold=0.7: more aggressive GC")
# === The expandable_segments setting - crucial for large models ===
def explain_expandable_segments():
"""
expandable_segments=True changes how the allocator requests memory
from CUDA. Instead of requesting fixed-size chunks, it pre-reserves
the entire GPU virtual address space upfront, then maps physical
pages lazily.
Benefits:
- Eliminates the "address space fragmentation" problem
- Large contiguous virtual address ranges available even when
physical memory is fragmented
- Particularly important for models that grow dynamically
(speculative decoding with variable sequence lengths)
Drawback:
- Pre-reserves virtual address space, which can confuse tools
like nvidia-smi into reporting higher usage than actual
"""
print("expandable_segments: pre-reserves virtual address space")
print("This is the most impactful allocator setting for large models.")
print()
print("Enable with:")
print(' PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True')
print()
print("Or in Python before initializing CUDA:")
print(' torch.cuda.memory.set_per_process_memory_fraction(1.0)')
# === Memory snapshots for production debugging ===
def production_oom_handler():
"""
Pattern for graceful OOM handling with diagnostic information.
Use this as a wrapper around inference calls in production.
"""
def safe_inference(model, inputs, device="cuda:0"):
try:
with torch.no_grad():
return model(inputs)
except torch.cuda.OutOfMemoryError as e:
# Capture diagnostic information before anything is freed
stats = torch.cuda.memory_stats(device)
reserved_gb = stats.get("reserved_bytes.all.current", 0) / 1e9
allocated_gb = stats.get("allocated_bytes.all.current", 0) / 1e9
retries = stats.get("num_alloc_retries", 0)
ooms = stats.get("num_ooms", 0)
print(f"OOM Details:")
print(f" Reserved: {reserved_gb:.2f} GB")
print(f" Allocated: {allocated_gb:.2f} GB")
print(f" Fragmentation: {(reserved_gb - allocated_gb):.2f} GB gap")
print(f" Prior retries: {retries}, prior OOMs: {ooms}")
print(f" Error: {e}")
# Try to recover by clearing cache
print("Attempting cache clear recovery...")
torch.cuda.empty_cache()
try:
with torch.no_grad():
return model(inputs)
except torch.cuda.OutOfMemoryError:
print("Recovery failed - request too large for available memory")
raise
return safe_inference
Production Engineering Notes
When Unified Memory Makes Sense
Unified Memory via cudaMallocManaged is the right choice in a narrow set of scenarios:
GPU memory oversubscription - if your data structure is genuinely too large for GPU VRAM but computation only touches a small fraction at any time (think large graph neural networks with sparse access patterns), Unified Memory lets the GPU work on a working set while the rest stays in CPU RAM. The page migration penalty is acceptable because you would otherwise not be able to run at all.
CPU-GPU collaborative algorithms - tree search algorithms (Monte Carlo tree search in game AI), certain graph algorithms, and hash tables with irregular access patterns are naturally expressed as shared CPU-GPU data structures. Unified Memory simplifies the programming model without necessarily losing performance if access patterns are managed.
Prototyping - when correctness matters more than performance and you want to eliminate explicit copy management during development.
Unified Memory is NOT appropriate for standard DNN training or inference. Every tensor has a predictable lifetime and access pattern. The caching allocator handles this far more efficiently than page migration ever could.
The multi-tenant serving problem
Running multiple models on one GPU requires careful memory partitioning. The default global caching allocator has no concept of tenancy isolation. Model A's idle cached memory can prevent Model B from allocating even though the combined usage would fit.
There are three approaches in production:
Approach 1: MPS (Multi-Process Service) - NVIDIA's MPS daemon allows multiple processes to share a single GPU with space partitioning. Each process has its own allocator and cannot affect others. The downside is that MPS has overhead for context switching and does not provide strict memory isolation (one process crashing can affect others).
Approach 2: CUDA MIG (Multi-Instance GPU) - A100 and H100 support hardware-level partitioning. A 40 GB A100 can be split into up to 7 isolated GPU instances, each with dedicated memory bandwidth and VRAM. This is the cleanest isolation but reduces available memory per tenant.
Approach 3: Application-level pools - PyTorch 2.1's MemPool API allows explicit pool management within a single process. High-priority requests get their own pool; background work shares the default pool.
Checkpoint memory spikes
During training, saving a checkpoint with torch.save(model.state_dict(), path) creates a temporary copy of all model weights. For a 70B parameter model in bfloat16, that is 140 GB in GPU memory for the live weights, plus another 140 GB momentarily during the save serialization. This is frequently the cause of OOM events during training runs that were otherwise stable.
The solution is asynchronous checkpointing: copy weights to CPU memory first (which is cheap over NVLink or PCIe), then write to disk from CPU threads while the GPU continues training.
import torch
import threading
import copy
class AsyncCheckpointer:
"""
Save checkpoints without blocking GPU training.
Copies state dict to CPU, then saves in background thread.
"""
def __init__(self):
self._save_thread = None
def save_async(self, model: torch.nn.Module, path: str):
# Wait for any previous save to complete
if self._save_thread is not None:
self._save_thread.join()
# Copy state dict to CPU - fast, non-blocking for GPU
# pin_memory=True improves PCIe transfer speed
cpu_state = {
k: v.cpu()
for k, v in model.state_dict().items()
}
def _save():
torch.save(cpu_state, path)
print(f"Checkpoint saved: {path}")
self._save_thread = threading.Thread(target=_save, daemon=True)
self._save_thread.start()
def wait(self):
if self._save_thread is not None:
self._save_thread.join()
Common Mistakes
:::danger Calling empty_cache() in a hot loop
torch.cuda.empty_cache() is not free. It acquires locks, iterates the entire cached blocks list, and calls cudaFree for each block. In a tight inference loop, calling it every request can add 5-20ms of latency per call and actually slow down your server by eliminating the cache that was preventing cudaMalloc overhead.
Call empty_cache() only when you have diagnosed actual fragmentation or when switching between workloads with very different memory size profiles.
# WRONG: calling every batch
for batch in dataloader:
output = model(batch)
torch.cuda.empty_cache() # Don't do this
# RIGHT: call only on detected pressure
for batch in dataloader:
output = model(batch)
if torch.cuda.memory_stats()["num_alloc_retries"] > 0:
torch.cuda.empty_cache()
:::
:::danger Assuming nvidia-smi shows true memory usage
nvidia-smi shows memory reserved by your process, which includes PyTorch's cached free blocks. A process showing 75 GB in nvidia-smi might only have 40 GB in live tensors. Do not use nvidia-smi alone to diagnose OOM - use torch.cuda.memory_stats() to understand the actual allocation picture.
:::
:::warning Using cudaMallocManaged for training tensors
Some tutorials show cudaMallocManaged as a way to handle large models. For standard training workloads, this is slower than the caching allocator approach. Page migration over PCIe adds 10-100x latency compared to already-resident VRAM accesses. Use it only for genuine oversubscription scenarios where you cannot fit the working set in VRAM.
:::
:::warning Ignoring inactive_split_bytes
High inactive_split_bytes is the leading indicator of fragmentation that will eventually cause OOM. Monitor it. If it grows steadily over a long training run, your allocation pattern is causing the caching allocator to split large blocks into fragments that cannot be recombined.
The fix is expandable_segments:True plus reviewing your tensor size patterns to reduce the variance of allocation sizes.
:::
Interview Questions and Answers
Q1: What is the difference between torch.cuda.memory_reserved() and torch.cuda.memory_allocated()? When would each be larger than the other?
memory_allocated() returns the bytes currently occupied by live tensors - actual data. memory_reserved() returns the bytes the PyTorch caching allocator has taken from the CUDA driver, including cached free blocks. memory_reserved() is always greater than or equal to memory_allocated(). The gap is the free cache pool.
The gap grows when you delete tensors (they go to cache instead of being returned to CUDA) and shrinks when you call empty_cache() or when the allocator recycles cached blocks for new allocations. In a long-running server, the gap tends to stabilize at the "high water mark" of the workload's peak fragmentation level.
Q2: A colleague says to use cudaMallocManaged to avoid running out of GPU memory when training a large model. Explain why this is usually wrong and what the right approach is.
cudaMallocManaged enables oversubscription by migrating pages from CPU RAM to GPU VRAM on demand. The problem is access latency. A GPU accessing a page that requires migration over PCIe waits potentially thousands of GPU cycles for that migration. GPU memory bandwidth is 3.35 TB/s on an H100. PCIe 5.0 peak is 128 GB/s - about 26x slower. For compute-heavy workloads like transformer training where every tensor is accessed multiple times per step, this migration overhead compounds catastrophically.
The right approach for large models depends on what is causing the memory pressure. For model weights too large for one GPU, use tensor parallelism or pipeline parallelism across multiple GPUs. For activation memory during training, use gradient checkpointing to trade recomputation for memory. For peak allocation spikes from attention, use FlashAttention which is designed to be memory-efficient. Unified Memory should be a last resort for irregular access patterns, not a general solution.
Q3: Describe what happens inside PyTorch's caching allocator when you call torch.randn(1000, 1000, device='cuda') followed by del tensor.
When torch.randn(1000, 1000, device='cuda') is called, the caching allocator computes the required bytes (4 million floats x 4 bytes = 16 MB), rounds up to the next allocation bin, searches its pool of cached free blocks for a block of at least that size, and either returns a cached block directly or calls cudaMalloc if no suitable cached block exists. The CUDA kernel for random number generation is then enqueued on the current stream.
When del tensor is called, Python's reference counting decrements the tensor's refcount to zero. The tensor's destructor notifies the caching allocator. The allocator marks the underlying memory block as free but does NOT call cudaFree. Instead, it adds the block to its free list. The next torch.randn call of similar size will find this block in the free list and reuse it without any driver interaction.
Q4: What is max_split_size_mb in PYTORCH_CUDA_ALLOC_CONF and how does setting it affect fragmentation?
max_split_size_mb controls the maximum size of a cached block that the allocator is allowed to split to satisfy a smaller request. Without this setting, the allocator will split a 1 GB cached block to satisfy a 100 MB request, leaving a 900 MB fragment. That 900 MB fragment may sit unused if no subsequent request is exactly the right size.
Setting max_split_size_mb:512 means blocks larger than 512 MB are never split. A 1 GB request will only be satisfied by a block of at least 1 GB. This reduces fragmentation for workloads with a mix of large and small tensors, because the large blocks remain whole and available for large allocations. The tradeoff is that small requests may need to allocate fresh memory via cudaMalloc if no small blocks are available, potentially increasing driver calls.
Q5: In a multi-tenant inference server running two different models on the same GPU, how does memory pooling help and what are the tradeoffs?
Without pooling, both models share the global caching allocator. Model A finishes a batch inference and releases 20 GB back to the cache. Model B then needs to allocate 5 GB - it gets it from Model A's cached blocks (fast). But then Model C needs 25 GB. Even though 20 GB sits in cache from Model A, the allocator cannot satisfy 25 GB from fragmented 20 GB blocks plus 5 GB currently used by Model B.
With per-model memory pools (using CUDA MIG or PyTorch's MemPool API), each model gets a fixed memory budget. Model A's cache cannot interfere with Model C's allocation headroom. The tradeoff is reduced flexibility - if Model A is idle, its 20 GB sits reserved and unused even when Model C is desperate for memory. This is the classic isolation vs. efficiency tradeoff. Production systems often use a hybrid: a small reserved pool per model for predictable latency plus a shared overflow pool for burst capacity.
Q6: Explain how stream-ordered allocation (cudaMallocAsync) differs from traditional cudaMalloc and why this matters for throughput.
Traditional cudaMalloc acquires a global driver lock, searches the global memory pool, maps physical pages, and returns. This is safe because all operations on the returned memory must be explicitly synchronized by the programmer. But the global lock means concurrent callers from different CPU threads serialize.
cudaMallocAsync associates the allocation with a CUDA stream. The allocator maintains per-stream free lists. When you call cudaFreeAsync(ptr, stream), the memory is not immediately available - it goes into the stream's pending-free list and is only actually recycled after all operations on that stream before the free have completed (ensured by stream ordering). When you call cudaMallocAsync(size, stream) next, it checks the stream's free list first, then the global pool.
For a serving system where each request runs on its own CUDA stream, this means each stream has its own allocation cache. The common case - same-size requests recycling from the same stream's cache - has no driver interaction at all. Peak throughput for small allocation-heavy workloads (like short sequence inference with many attention head buffers) can improve by 2-5x.
Summary
GPU memory management is a multi-layer system. Physical VRAM sits at the bottom. The CUDA driver manages raw allocation with slow, locked operations. PyTorch's caching allocator sits above the driver and recycles memory aggressively to avoid driver calls. Unified Memory adds a migration layer between CPU and GPU but is rarely the right choice for ML workloads because the page fault latency exceeds what production workloads can absorb.
The key mental model: reserved_bytes - allocated_bytes = fragmented cache. This gap is intentional and healthy at moderate levels. It becomes a problem when it grows to consume memory that could be used for new allocations, causing OOM errors despite "free" memory.
For production systems, the most impactful settings are expandable_segments:True (eliminates virtual address fragmentation for large models), max_split_size_mb tuned to your allocation size distribution, and asynchronous checkpointing to prevent save spikes from causing OOM during training.
