Skip to main content

CPU Memory Architecture for ML

A Bad Morning in a Production Training Cluster

It is 3:14 AM. Your A100 cluster just started a new training run for a 13B parameter recommendation model. GPU utilization is showing 34%. Not a typo - thirty-four percent. The job is technically running. The loss is decreasing. But the GPUs are spending two-thirds of their time waiting.

Your on-call engineer pages you. The cluster is expensive. Every idle GPU-hour costs real money. You SSH in and start digging. GPU memory is fine - 72 GB of HBM2e loaded correctly. NVLink shows healthy bandwidth. The CUDA kernels look normal. Then you run perf stat on the DataLoader processes and the answer hits you like a freight train: memory bandwidth saturation on the host CPU. The 56-core Xeon is trying to decompress JPEG images, shuffle tensors, and feed eight GPUs simultaneously - and it is drowning.

The team had done everything right on the GPU side. They had profiled the forward pass, tuned the attention kernels, enabled torch.compile. But they had never profiled the host side of the pipeline. The CPUs were running four DataLoader workers each, all fighting over the same NUMA node's memory controller. The L3 cache was cold on every batch because workers were accessing discontiguous memory regions in random order. The DRAM bandwidth was maxed. The GPUs were starving.

This scenario plays out constantly in production ML. Engineers obsess over GPU memory bandwidth and flop counts - both valid concerns - but the CPU memory subsystem is often the silent bottleneck. It sits between your storage, your preprocessing logic, and your GPUs. When it breaks down, your expensive accelerators idle and you get paged at 3 AM.

This lesson explains exactly how CPU memory works at the hardware level, why it fails in ML workloads in predictable ways, and what you can do about it with concrete tools and techniques. By the end, you will be able to diagnose a CPU memory bottleneck in under five minutes and fix it in under an hour.


Why This Exists - The Problem CPU Memory Architecture Solves

The Speed Gap That Changed Everything

In the 1970s and 1980s, CPU clock speeds and DRAM speeds grew together. You could design a system where the processor and memory ran at roughly comparable rates. A CPU instruction that needed to fetch from memory waited a handful of cycles - annoying, but manageable.

Then something broke. CPU speeds started doubling every 18 months (Moore's Law in action), but DRAM speeds improved much more slowly - constrained by physics, specifically the RC delay of long metal wires connecting capacitors and transistors on the memory chip. By the mid-1990s, a CPU could execute hundreds of instructions in the time it took to fetch a single value from DRAM. The processor sat idle, spinning its wheels.

The industry's answer was a hierarchy of caches - small, fast, expensive SRAM sitting between the CPU and main memory. L1 cache could respond in 4-5 cycles. L2 in 12-15 cycles. L3 in 30-50 cycles. DRAM in 200-300 cycles. By keeping recently-used data in caches, programs could avoid the 200-cycle DRAM penalty most of the time.

This worked brilliantly for programs with temporal locality - programs that reuse the same data repeatedly. Scientific computing, database queries, rendering - all of these access the same memory regions in tight loops. Caches love them.

ML data pipelines often have the opposite property. A DataLoader shuffles the dataset randomly, accessing different files or array indices every batch. An image preprocessing pipeline reads each sample once and discards it. A sparse feature lookup table in a recommendation system jumps around a 100 GB embedding table unpredictably. These workloads are cache-hostile by design, and the CPU memory hierarchy punishes them severely.


Historical Context - How We Got Here

The Birth of Cache Memory

The concept of a memory cache predates modern CPUs. The IBM System/360 Model 85 (1968) was the first commercial computer with a hardware cache, invented by engineers Lyle Johnson and Jan Vandierendonck. The insight was simple: a small fast memory that automatically keeps copies of recently-accessed main memory contents. If the CPU needs something already in the cache (a cache hit), it gets it instantly. If not (a cache miss), it fetches from main memory and stores a copy in cache.

The "aha moment" came when researchers measured real programs and found that about 90% of memory accesses hit in cache - despite cache being only 0.1% of the total memory size. This remarkable property, known as the principle of locality, made caches practical: programs tend to use the same memory regions repeatedly (temporal locality) and access memory in sequential patterns (spatial locality).

The NUMA Transition

Single-socket servers with one memory controller worked fine through the 1990s and early 2000s. But as core counts grew, a single memory controller became a bottleneck. Intel and AMD both moved to multi-socket designs where each socket has its own local DRAM - the NUMA (Non-Uniform Memory Access) architecture.

On a dual-socket server, a process running on socket 0 can access socket 0's local DRAM in roughly 80 ns, but accessing socket 1's remote DRAM takes around 140 ns - 75% slower. This matters enormously when you are running DataLoader processes across both sockets without pinning them.

AMD's EPYC processors (Rome, Milan, Genoa) took this further with the "chiplet" design - multiple dies on one package, each with its own local memory controllers connected by Infinity Fabric. An EPYC Genoa (9654) has 12 chiplets, and memory latency varies depending on which chiplet a thread is running on and which chiplet owns the memory being accessed.


Core Concepts - The CPU Memory Hierarchy

The Cache Hierarchy in Detail

A modern server CPU has three levels of cache between the execution units and main DRAM.

L1 Cache - Closest to the CPU core, split into instruction cache (L1i) and data cache (L1d). Typically 32-64 KB per core. Latency: 4-5 cycles (roughly 1-2 ns). Each physical core has its own private L1.

L2 Cache - Larger unified cache, typically 256 KB to 1 MB per core. Latency: 12-15 cycles (roughly 4-5 ns). Also private per core on most modern designs.

L3 Cache - Large shared cache across all cores on a die. Modern server CPUs have 32-256 MB of L3 (Intel Xeon Sapphire Rapids has 60 MB, AMD EPYC Genoa has 384 MB with 3D V-Cache variants). Latency: 30-50 cycles (10-20 ns). Shared means multiple cores compete for it.

DRAM - Main memory, accessed when data is not in any cache level. Modern DDR5 and LPDDR5 latency is 80-100 ns (200-300 cycles). Orders of magnitude slower than L1.

The math that matters: if your ML pipeline causes frequent L3 misses and falls through to DRAM, you are paying a 100x latency penalty versus hitting L1. At scale, this becomes the dominant cost.

Cache Hit Rates and Effective Latency:

Effective_Latency = Hit_Rate_L1 * Latency_L1
+ (1 - Hit_Rate_L1) * Hit_Rate_L2 * Latency_L2
+ (1 - Hit_Rate_L1) * (1 - Hit_Rate_L2) * Hit_Rate_L3 * Latency_L3
+ (1 - Hit_Rate_L1) * (1 - Hit_Rate_L2) * (1 - Hit_Rate_L3) * Latency_DRAM

For a typical ML DataLoader with random-access patterns, L1 hit rate might be 60%, L2 hit rate 70%, L3 hit rate 40%. Plugging in approximate cycle counts for an Intel Xeon:

Effective Latency=0.60×4+0.40×0.70×12+0.40×0.30×0.40×40+0.40×0.30×0.60×250\text{Effective Latency} = 0.60 \times 4 + 0.40 \times 0.70 \times 12 + 0.40 \times 0.30 \times 0.40 \times 40 + 0.40 \times 0.30 \times 0.60 \times 250

=2.4+3.36+1.92+18.0=25.68 cycles= 2.4 + 3.36 + 1.92 + 18.0 = 25.68 \text{ cycles}

Compare that to a cache-friendly sequential access pattern with 95% L1 hit rate - the effective latency drops to about 5 cycles. The random-access penalty is over 5x in cycles, and at memory bandwidth saturation it gets much worse.

Cache Lines - The Atom of Memory Transfer

CPUs do not move individual bytes between DRAM and cache. They move cache lines - fixed-size blocks, always 64 bytes on x86. When you access a single byte in DRAM, the hardware fetches all 64 bytes containing it and stores the whole line in cache.

This has a critical implication for ML code: if your data structures are padded or have non-contiguous fields, you waste cache line space. Worse, if multiple threads write to different variables that happen to share a cache line, you get false sharing - a cache coherence nightmare where cores keep invalidating each other's copies of the same line even though they are not accessing the same data.

# False sharing example - what NOT to do in parallel DataLoader workers
import numpy as np
import multiprocessing as mp

# BAD: counter array where adjacent elements share cache lines
# Each int32 is 4 bytes, 16 per cache line
# Workers updating counters[0] and counters[1] thrash each other
counters = mp.Array('i', [0] * 16)

# GOOD: pad each counter to its own cache line (64 bytes)
# Use separate mp.Value per worker
counters = [mp.Value('i', 0) for _ in range(16)]

DRAM Technology - DDR5 and LPDDR5

DDR5 is the current standard for server and desktop systems. Key characteristics:

  • Bandwidth: 51.2 GB/s per channel (DDR5-6400), servers typically run 4-8 channels
  • A dual-socket Intel Xeon Sapphire Rapids with 8 channels per socket: peak 409.6 GB/s total
  • Latency: roughly 14-16 ns CAS latency at DDR5-4800
  • Power: improved on-die ECC, lower voltage (1.1V vs 1.2V for DDR4)

LPDDR5 is the mobile/embedded variant used in edge inference hardware, Apple M-series chips, and Jetson boards:

  • Higher bandwidth density (bandwidth per mm2) than DDR5
  • Lower absolute bandwidth ceiling than server DDR5
  • The Apple M3 Ultra uses LPDDR5X at 800 GB/s - faster than H100's HBM3 (3.35 TB/s) for CPU-attached memory, but still much slower than HBM
  • Key for ML at the edge: LPDDR5 is unified with GPU memory on Apple Silicon, eliminating PCIe transfer overhead entirely

HBM vs DRAM for CPU - Some AMD EPYC 9654 variants pair traditional DDR5 DIMM slots with HBM2e stacked on the package. The HBM provides 900 GB/s bandwidth versus DDR5's 460 GB/s. For bandwidth-bound preprocessing workloads (image resizing, tokenization at scale), this makes a measurable difference.

NUMA Topology - The Hidden Tax

NUMA exists because running many memory channels through a single controller creates a bottleneck. Each socket (or chiplet on EPYC) gets its own local DRAM and memory controller, connected to other sockets via a high-speed interconnect.

Dual-Socket Intel Xeon NUMA Topology:

Socket 0 (NUMA node 0) Socket 1 (NUMA node 1)
+----------------------+ +----------------------+
| Cores 0-27 | | Cores 28-55 |
| L3 Cache: 60 MB |<------->| L3 Cache: 60 MB |
| Memory Controller | UPI | Memory Controller |
| Channels 0-3 | 3x UPI | Channels 4-7 |
| DRAM: 256 GB | | DRAM: 256 GB |
+----------------------+ +----------------------+

Local DRAM access: ~80 ns
Remote DRAM access (cross-socket UPI): ~140 ns
NUMA penalty: ~75% latency increase

For ML workloads, NUMA mismatches happen silently. The OS allocates memory on a first-touch basis by default - whichever NUMA node first touches (writes to) a page becomes its owner. If your DataLoader processes spawn on socket 0 but the model weights were first touched on socket 1, every parameter access pays the remote DRAM penalty.


How CPU Memory Affects ML Workloads

DataLoader Workers - The Critical Path

PyTorch's DataLoader uses multiple worker processes to parallelize data loading and preprocessing. Each worker independently reads samples, applies transforms, and places them into a queue that the main training process reads from.

The CPU memory implications are significant:

  1. Forked process memory - Workers are forked from the main process. On Linux with copy-on-write semantics, as long as workers only read shared data (like numpy arrays already loaded), they share physical pages. The moment a worker writes anything (even to a local variable holding a transformed image), the page is copied. With 16 workers and a 10 GB dataset in memory, you can balloon to 170 GB RSS.

  2. Cache thrashing - Multiple workers access different, random dataset indices simultaneously. The L3 cache gets constantly evicted as each worker brings in new cache lines that others do not need.

  3. NUMA interference - Workers scheduled across both sockets access memory allocated on one socket, creating remote DRAM access patterns.

import torch
from torch.utils.data import DataLoader, Dataset
import numpy as np
import os

class ImageDataset(Dataset):
def __init__(self, data_dir, transform=None):
self.files = sorted(os.listdir(data_dir))
self.data_dir = data_dir
self.transform = transform

def __len__(self):
return len(self.files)

def __getitem__(self, idx):
# This path: disk I/O + decompression + transform all happen in worker
img_path = os.path.join(self.data_dir, self.files[idx])
img = np.load(img_path) # or PIL.Image.open for JPEGs
if self.transform:
img = self.transform(img)
return torch.from_numpy(img)

# Baseline - likely suboptimal on multi-socket servers
loader = DataLoader(
dataset,
batch_size=64,
num_workers=8,
pin_memory=True, # Allocates pinned (page-locked) host memory
prefetch_factor=2, # Number of batches prefetched per worker
persistent_workers=True # Keep worker processes alive between epochs
)

Pinned Memory - Why It Matters for GPU Transfer

Pinned memory (also called page-locked memory) is host DRAM that the OS has been told will never be swapped out. CUDA requires pinned memory for asynchronous (DMA) transfers to the GPU.

Without pinned memory, the copy path is:

  1. CPU copies data from pageable memory to a pinned staging buffer (CPU-to-CPU copy)
  2. DMA engine transfers from pinned buffer to GPU HBM (CPU-to-GPU copy)

With pinned memory (pin_memory=True in DataLoader), step 1 is eliminated. The DMA engine reads directly from the DataLoader's output buffer.

The cost: pinned memory cannot be swapped. On a system with 512 GB RAM running many jobs, pinning too much memory stresses the OS memory manager. A good rule of thumb is to pin no more than 10-15% of total host RAM.

# Manual pinned memory allocation for custom pipelines
import torch

# Allocate pinned memory explicitly
pinned_buffer = torch.zeros(batch_size, 3, 224, 224, pin_memory=True)

# Non-blocking transfer to GPU - returns immediately, transfer happens async
gpu_tensor = pinned_buffer.to('cuda:0', non_blocking=True)

# Do CPU work here while transfer completes in background
# ...

# Synchronize before using gpu_tensor
torch.cuda.synchronize()

The DataLoader Worker Count Question

A common mistake: setting num_workers to the total CPU core count. This seems logical but often hurts.

Why more workers is not always better:

  • Each worker takes CPU cycles for preprocessing. If workers saturate the CPUs, the main training process (which also needs CPU time for graph execution, gradient aggregation) gets starved.
  • More workers means more simultaneous DRAM accesses, increasing memory controller contention.
  • On NUMA systems, workers spread across both sockets create remote memory access patterns.
  • Context switching overhead increases with more processes.

Empirical rule for starting point:

import psutil
import os

# Starting heuristic - tune from here
num_cpus = psutil.cpu_count(logical=False) # Physical cores only
num_gpus = torch.cuda.device_count()

# Rule of thumb: 2-4 workers per GPU, not per CPU
suggested_workers = min(num_cpus // 2, num_gpus * 4)
print(f"Suggested num_workers: {suggested_workers}")

# Profile to find the sweet spot
for nw in [2, 4, 8, 16]:
loader = DataLoader(dataset, num_workers=nw, ...)
# Measure actual GPU utilization over 100 batches
# Pick the nw that maximizes GPU utilization without saturating CPU

Diagnosing CPU Memory Bottlenecks

Tool 1 - perf stat

Linux perf gives hardware performance counters - the closest you can get to reading the CPU's mind.

# Run perf on your DataLoader process (replace PID with actual process ID)
# -e flags: cache misses at each level, DRAM accesses, memory bandwidth
sudo perf stat -e \
cache-references,cache-misses,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses,\
mem-loads,mem-stores \
-p $(pgrep -f "python train.py") \
sleep 30

# Example output interpretation:
# LLC-load-misses / LLC-loads = L3 miss rate
# If L3 miss rate > 20%, you have a cache thrashing problem
# mem-loads rate * 64 bytes = approximate DRAM bandwidth consumed
# Memory bandwidth measurement with perf
sudo perf stat -e \
uncore_imc_0/cas_count_read/,\
uncore_imc_0/cas_count_write/ \
-a sleep 10

# CAS (Column Address Strobe) counts map directly to DRAM transactions
# Each CAS count = 64 bytes = one cache line
# DRAM bandwidth (GB/s) = (read_cas + write_cas) * 64 / measurement_time / 1e9

Tool 2 - numastat and numactl

# Check NUMA memory allocation - are processes using local memory?
numastat -p $(pgrep -f "python train.py")

# Output shows: local vs remote memory hits
# If remote_hit / total_hit > 15%, you have a NUMA problem

# Run training with NUMA binding to socket 0
numactl --cpunodebind=0 --membind=0 python train.py

# For dual-GPU setups where GPU 0 is on socket 0, GPU 1 on socket 1:
# Launch two processes, each bound to their respective socket
numactl --cpunodebind=0 --membind=0 python train.py --gpu=0 &
numactl --cpunodebind=1 --membind=1 python train.py --gpu=1 &

Tool 3 - Intel VTune / AMD uProf

For deeper cache analysis, vendor tools expose additional counters:

# Intel VTune memory access analysis
vtune -collect memory-access \
-knob analyze-mem-objects=true \
-- python train.py

# This generates a full report showing:
# - Hot memory access regions by source line
# - NUMA access patterns (local vs remote ratio)
# - Memory bandwidth over time
# - LLC miss hotspots

# AMD uProf equivalent
AMDuProfCLI collect --config tbp -o ./profile -- python train.py
AMDuProfCLI report -i ./profile

Tool 4 - Python-level Profiling with py-spy

# py-spy gives CPU flame graphs without code modification
pip install py-spy

# Record a flame graph of the training loop
py-spy record -o profile.svg --pid $(pgrep -f "train.py") --duration 60

# The flame graph shows time in DataLoader.__iter__ vs model.forward()
# If DataLoader time > 20% of total, the data pipeline is the bottleneck

Prefetching Strategies

Hardware Prefetching

Modern CPUs have hardware prefetch engines that detect sequential access patterns and fetch cache lines before the CPU requests them. For sequential array iteration, hardware prefetching is highly effective.

For random access (shuffled dataset), hardware prefetchers fail because they cannot predict the next address. This is why shuffled DataLoader workloads are expensive - hardware prefetch provides no benefit.

Software Prefetching in Python

import torch
from torch.utils.data import DataLoader
import threading
from queue import Queue

class PrefetchDataLoader:
"""
Explicit prefetch wrapper that loads the next batch
while the current batch is being processed on GPU.
"""
def __init__(self, dataloader, device='cuda'):
self.dataloader = dataloader
self.device = device
self.stream = torch.cuda.Stream()

def __iter__(self):
loader_iter = iter(self.dataloader)

# Fetch first batch
try:
batch = next(loader_iter)
except StopIteration:
return

# Pin and start async transfer
with torch.cuda.stream(self.stream):
if isinstance(batch, (list, tuple)):
batch_gpu = [t.to(self.device, non_blocking=True) for t in batch]
else:
batch_gpu = batch.to(self.device, non_blocking=True)

while True:
# Synchronize - wait for previous transfer to complete
torch.cuda.current_stream().wait_stream(self.stream)
current_batch = batch_gpu

# Start prefetching next batch
try:
batch = next(loader_iter)
with torch.cuda.stream(self.stream):
if isinstance(batch, (list, tuple)):
batch_gpu = [t.to(self.device, non_blocking=True) for t in batch]
else:
batch_gpu = batch.to(self.device, non_blocking=True)
except StopIteration:
yield current_batch
break

yield current_batch

# Usage
base_loader = DataLoader(dataset, batch_size=64, num_workers=8, pin_memory=True)
loader = PrefetchDataLoader(base_loader, device='cuda:0')

for batch in loader:
# batch is already on GPU when you receive it
output = model(batch)

NUMA-Aware DataLoader Initialization

import os
import ctypes
import subprocess

def get_numa_node_for_gpu(gpu_id):
"""Find which NUMA node a GPU is closest to."""
try:
result = subprocess.run(
['nvidia-smi', '--query-gpu=pci.bus_id', '--format=csv,noheader',
'--id', str(gpu_id)],
capture_output=True, text=True
)
bus_id = result.stdout.strip().lower().replace('0000:', '')
numa_path = f'/sys/bus/pci/devices/{bus_id}/numa_node'
with open(numa_path) as f:
return int(f.read().strip())
except Exception:
return 0 # Default to node 0

def bind_process_to_numa_node(numa_node):
"""Bind current process to a specific NUMA node."""
# This requires libnuma
try:
libnuma = ctypes.CDLL('libnuma.so.1')
# numa_run_on_node(node)
libnuma.numa_run_on_node(numa_node)
# numa_set_preferred(node)
libnuma.numa_set_preferred(numa_node)
print(f"Bound to NUMA node {numa_node}")
except OSError:
print("libnuma not available, skipping NUMA binding")

# In your training script
gpu_id = int(os.environ.get('LOCAL_RANK', 0))
numa_node = get_numa_node_for_gpu(gpu_id)
bind_process_to_numa_node(numa_node)

# Now create DataLoader - workers will inherit NUMA binding from parent
loader = DataLoader(dataset, num_workers=8, pin_memory=True)

CPU as Memory Controller for GPU Workloads

The CPU does not just run DataLoader workers - it also manages the PCIe and NVLink fabric that connects GPUs. On most systems, the CPU's memory controller sits in the critical path for GPU-to-GPU transfers (even with NVLink, the CPU coordinates the DMA operations).

Recommendation System Embedding Lookups

This is where CPU memory architecture matters most in production ML. A typical recommendation system (like those at Meta or Google) has:

  • Embedding tables: 10-500 GB, stored in CPU DRAM (too large for GPU)
  • At each forward pass: look up 100-1000 IDs per sample, 1024 samples per batch
  • Access pattern: completely random into a 500 GB table
import torch
import torch.nn as nn

class SparseEmbeddingLayer(nn.Module):
"""
Large sparse embedding table that lives in CPU memory.
Demonstrates the DRAM bandwidth challenge in recommendation systems.
"""
def __init__(self, num_embeddings, embedding_dim):
super().__init__()
# This table stays on CPU - too large for GPU
# 100M embeddings * 128 dim * 4 bytes = 51.2 GB
self.embedding = nn.EmbeddingBag(
num_embeddings=num_embeddings,
embedding_dim=embedding_dim,
mode='mean',
sparse=True # Sparse gradients - only update accessed rows
)

def forward(self, indices, offsets):
# Each call: random access into 51.2 GB DRAM table
# At 100 GB/s DRAM bandwidth: 51 GB * random_access_factor
# Real bandwidth achieved: often < 5 GB/s due to cache misses
return self.embedding(indices, offsets)

# The fix: TorchRec's ShardedEmbeddingBag
# Distributes embeddings across GPU HBM, CPU DRAM, and SSD
# with intelligent caching based on access frequency
from torchrec.distributed import DistributedModelParallel

Memory Bandwidth - The Real Bottleneck Number

import time
import numpy as np

def measure_dram_bandwidth():
"""
Measure achievable DRAM bandwidth with sequential vs random access.
Run this to understand your system's memory characteristics.
"""
size_gb = 4
n_elements = size_gb * 1024**3 // 8 # float64 elements

data = np.random.randn(n_elements).astype(np.float64)

# Sequential read bandwidth
start = time.perf_counter()
result = np.sum(data) # Forces sequential read of all elements
elapsed = time.perf_counter() - start
seq_bw = size_gb / elapsed
print(f"Sequential read bandwidth: {seq_bw:.1f} GB/s")

# Random access bandwidth (simulating embedding lookup)
indices = np.random.randint(0, n_elements, size=100_000)
start = time.perf_counter()
result = np.sum(data[indices])
elapsed = time.perf_counter() - start
# Random accesses are 8 bytes each, but each fetches 64 bytes (cache line)
random_bw = (len(indices) * 64) / elapsed / 1e9
print(f"Random access effective bandwidth: {random_bw:.2f} GB/s")
print(f"Bandwidth efficiency: {random_bw/seq_bw*100:.1f}%")

# Typical results on a modern server:
# Sequential read bandwidth: 85.3 GB/s
# Random access effective bandwidth: 1.2 GB/s
# Bandwidth efficiency: 1.4%
# This is why sparse embedding lookups are so expensive!
measure_dram_bandwidth()

Mermaid Architecture Diagrams

CPU Memory Hierarchy

DataLoader to GPU Data Flow

Intel vs AMD NUMA Topology


Production Engineering Notes

Sizing DataLoader Workers for Your Hardware

The correct number of workers depends on:

  1. Preprocessing cost per sample - If transforms are cheap (just normalization), fewer workers needed. If you are doing JPEG decompression, random crops, and augmentation, you need more.
  2. NUMA topology - On a 2-socket server, confine workers to one socket to avoid remote DRAM. This limits you to the core count of one socket.
  3. Memory bandwidth budget - Each worker adds to DRAM bandwidth consumption. Measure with perf stat and ensure you are not saturating memory controllers.
  4. GPU count - The pipeline must be fast enough to feed all GPUs simultaneously.
# Production DataLoader configuration template
# Tune each parameter based on profiling results

dataloader_config = {
'batch_size': 64,
'num_workers': 8, # Start at num_gpus * 2, profile and increase
'pin_memory': True, # Always True when training on GPU
'prefetch_factor': 2, # Batches prefetched per worker (default: 2)
'persistent_workers': True, # Keep workers alive between epochs
'drop_last': True, # Ensures uniform batch sizes for efficiency
}

# If using DDP, divide total workers by number of GPUs per node
# Each process should have its own set of workers
if dist.is_initialized():
dataloader_config['num_workers'] = 8 // dist.get_world_size()

Memory Bandwidth Monitoring in Production

import subprocess
import time
import threading

class MemoryBandwidthMonitor:
"""
Continuous DRAM bandwidth monitoring using perf.
Runs as a background thread alongside training.
"""
def __init__(self, interval_sec=5):
self.interval = interval_sec
self.running = False
self.readings = []

def _read_bandwidth(self):
"""Read DRAM bandwidth using /proc/meminfo delta."""
with open('/proc/meminfo') as f:
info = {line.split(':')[0]: int(line.split()[1])
for line in f.readlines() if ':' in line}
return info

def start(self):
self.running = True
self.thread = threading.Thread(target=self._monitor_loop, daemon=True)
self.thread.start()

def stop(self):
self.running = False

def _monitor_loop(self):
while self.running:
# Use ipmitool or DCMI for server-grade bandwidth monitoring
result = subprocess.run(
['cat', '/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq'],
capture_output=True, text=True
)
time.sleep(self.interval)

# For production, use Prometheus node_exporter with custom metrics
# or integrate with your existing telemetry pipeline (Grafana, DataDog)

Reducing Memory Pressure During Dataset Loading

import mmap
import numpy as np

class MemoryMappedDataset:
"""
Use memory mapping for large numpy array datasets.
OS handles paging - only loaded pages occupy physical RAM.
Much better than loading entire dataset into memory.
"""
def __init__(self, data_path, shape, dtype=np.float32):
self.data = np.memmap(data_path, dtype=dtype, mode='r', shape=shape)
self.length = shape[0]

def __len__(self):
return self.length

def __getitem__(self, idx):
# OS fetches only the page containing this sample
# 4 KB page = roughly 1 KB of float32 data
# Much less memory pressure than loading all samples
sample = self.data[idx].copy() # .copy() to get writable array
return torch.from_numpy(sample)

# For HDF5 datasets
import h5py

class HDF5Dataset:
def __init__(self, h5_path, dataset_key='data'):
self.h5_path = h5_path
self.dataset_key = dataset_key
# Open file handle per worker process to avoid file handle sharing
self._file = None
self._dataset = None

def _get_dataset(self):
if self._file is None:
self._file = h5py.File(self.h5_path, 'r', swmr=True)
self._dataset = self._file[self.dataset_key]
return self._dataset

def __getitem__(self, idx):
ds = self._get_dataset()
return torch.from_numpy(ds[idx])

def __del__(self):
if self._file is not None:
self._file.close()

Common Mistakes

:::danger NUMA-blind DataLoader Configuration

Setting num_workers=32 on a 2-socket 28-core-per-socket server causes half the workers to run on socket 0 and half on socket 1. When DRAM is allocated on socket 0 (where your training process first ran), socket 1 workers pay a 75% latency penalty on every memory access.

Always bind your DataLoader workers to the same NUMA node as your GPU:

# Check GPU NUMA affinity
cat /sys/bus/pci/devices/<gpu_pci_addr>/numa_node

# Bind training process to that NUMA node
numactl --cpunodebind=0 --membind=0 python train.py

Never skip NUMA binding on multi-socket servers. The performance difference can be 30-40% on data-heavy workloads. :::

:::danger Unpinned Memory with Non-Blocking Transfers

If you call tensor.to('cuda', non_blocking=True) on a tensor in regular (pageable) memory, the non-blocking call is silently downgraded to a blocking synchronous copy. You get none of the overlap benefit and your code looks correct in profiling because the synchronization point is hidden.

Always verify your tensors are pinned before using non_blocking:

assert tensor.is_pinned(), "non_blocking=True requires pinned memory"
gpu_tensor = tensor.to('cuda', non_blocking=True)

And ensure DataLoader has pin_memory=True so output tensors are automatically pinned. :::

:::warning Too Many Workers Starving the Training Process

With num_workers=32 on a 32-core machine, all CPU capacity goes to DataLoader workers. The main training process (which does graph execution, loss computation, gradient aggregation in DDP) has no cores left. You will see GPU utilization high but training throughput lower than with num_workers=8.

Profile before tuning: py-spy record -o flame.svg -p $(pgrep train.py). If the main process is spending time in CPU-bound ops (backward pass, optimizer step), reduce workers. :::

:::warning False Sharing in Custom Collate Functions

Writing a custom collate_fn that updates shared arrays from multiple workers creates false sharing at cache line boundaries. Workers running on different cores spend cycles invalidating each other's cache entries.

Keep workers stateless - each worker produces an independent output tensor, and the collate function combines them. Never use shared mutable state in collate functions. :::

:::warning Memory Copy in __getitem__ When Using memmap

When using np.memmap, the array is memory-mapped but still copy-on-write. If you transform the array in-place (e.g., data[idx] /= 255.0), you trigger a page copy and allocate new physical memory. For a 1 TB memory-mapped dataset with 16 workers, this can exhaust RAM.

Always do sample = data[idx].copy() first to explicitly copy to a small buffer, then apply transforms to the copy. The .copy() call is intentional and cheap - it copies one sample, not the whole dataset. :::


Interview Q&A

Q1: You have a training job where GPU utilization is 40% despite the model fitting in GPU memory. Your initial instinct is to check the DataLoader. Walk me through your diagnostic process.

The first thing I check is whether the problem is actually in the data pipeline versus something else (CPU-side backward pass, optimizer step, or actual model compute being underestimated). I run py-spy record for 60 seconds to get a CPU flame graph. If DataLoader.__iter__ or worker subprocesses dominate the flame, it is a data pipeline issue.

Next I check memory bandwidth: perf stat -e LLC-load-misses,mem-loads -a sleep 10 gives L3 miss rate and approximate DRAM bandwidth. If L3 miss rate is above 20% and DRAM bandwidth is near saturation (check the theoretical max with dmidecode -t memory to find DDR type and channel count), I have a cache-thrashing problem.

Then I check NUMA: numastat -p <pid> shows local vs remote memory accesses. If remote accesses are more than 15% of total, I bind the process to the correct NUMA node.

Finally I profile the transforms themselves - sometimes a slow augmentation library (e.g., slow JPEG decoder) is the real culprit, in which case I switch to a faster decoder (libjpeg-turbo, nvjpeg for GPU decoding) rather than tuning the memory subsystem.


Q2: Explain why pinned memory is required for asynchronous GPU transfers and what the risks are of pinning too much memory.

CUDA's DMA engine (the hardware that moves data between CPU and GPU over PCIe) can only read from physical memory addresses that are guaranteed not to change. Regular (pageable) OS memory can be swapped out by the OS memory manager at any time - its physical address can change. The DMA engine cannot handle this safely.

Pinned memory is marked as "do not page out" by the OS. The physical address is stable, so the DMA engine can read it directly without CPU involvement. This enables truly asynchronous transfers: the CPU launches the DMA operation and continues executing, while the DMA hardware handles the transfer in parallel.

The risk of pinning too much: the OS cannot use pinned pages for anything else. If you pin 200 GB of a 256 GB server, the OS has only 56 GB of pageable memory for everything else. Under memory pressure, the OOM killer may terminate processes, and the OS's ability to cache filesystem data (page cache) is severely reduced, slowing disk I/O. The practical limit is 10-15% of total RAM for pinned DataLoader buffers.


Q3: A recommendation model uses a 400 GB embedding table stored in CPU DRAM. Inference latency is 80 ms but the target is 20 ms. The embedding lookup alone takes 60 ms. What approaches would you take to fix this?

The core problem is that random access into a 400 GB table hits DRAM constantly - no cache can hold 400 GB. Effective DRAM bandwidth for random access is typically 1-3 GB/s despite peak bandwidth of 100+ GB/s.

Approach 1 - Tiered caching with TorchRec: Implement a CUDA unified memory pool that keeps the hot embeddings (top 10% most accessed IDs, which typically account for 90% of traffic by power law) in GPU HBM. Cold embeddings stay in DRAM. This converts most lookups from 200 ns DRAM accesses to 50 ns HBM accesses.

Approach 2 - Quantization: Reduce embedding from fp32 (4 bytes) to int8 (1 byte) with per-row scale factors. The table shrinks to 100 GB, fitting more rows in L3 cache. Cache hit rate increases proportionally.

Approach 3 - Batching and reordering: Sort lookup indices within each batch by row ID before accessing the embedding table. This converts random access to semi-sequential access, improving hardware prefetcher effectiveness. On a batch of 1024 samples each looking up 100 IDs, sorting reduces cache misses significantly.

Approach 4 - DRAM hardware upgrade: HBM-augmented CPUs (AMD EPYC with HBM) offer 800+ GB/s bandwidth vs 100 GB/s for DDR5. For random access workloads like this, the effective bandwidth improvement is nearly proportional.


Q4: What is NUMA and why does it matter for ML training on multi-socket servers?

NUMA (Non-Uniform Memory Access) is a memory architecture where a multi-socket system has separate DRAM connected to each CPU socket (or chiplet). Each socket can access its own local DRAM quickly (local access, ~80 ns) but must traverse an inter-socket interconnect (Intel UPI, AMD Infinity Fabric) to access the other socket's DRAM (remote access, ~140 ns, a ~75% penalty).

For ML training, NUMA causes problems when:

  1. The OS schedules DataLoader workers on socket 1 while the training process first touched (allocated) the model weights and dataset buffers on socket 0. Every worker access to those buffers is a remote NUMA access.

  2. In DDP (Distributed Data Parallel) with multiple GPUs per node, if GPU 0 is physically connected to socket 0 and GPU 1 to socket 1, PCIe transfers from a socket-0-allocated buffer to GPU 1 must cross the UPI link, adding latency and consuming inter-socket bandwidth.

The fix is NUMA binding: use numactl to ensure the training process and its workers all run on the NUMA node local to their GPU. Check GPU NUMA affinity with cat /sys/bus/pci/devices/<pci_addr>/numa_node. In SLURM environments, request --tasks-per-node=1 --cpus-per-task=<cores_per_socket> and use numactl in the launch script.


Q5: Compare DDR5 and LPDDR5 from an ML workload perspective. When does the difference matter?

DDR5 and LPDDR5 serve different segments but both use the same JEDEC DDR5 signaling standard. Key differences:

Bandwidth: DDR5 scales with channel count - a server with 8 DDR5 channels at 6400 MT/s achieves 409 GB/s peak. LPDDR5X (as in Apple M3 Ultra) runs at lower per-channel bandwidth but achieves 800 GB/s by using a very wide interface (256-bit wide vs 64-bit per DDR5 DIMM). Apple's approach trades channel flexibility for raw bandwidth density.

Latency: LPDDR5 typically has slightly higher latency than DDR5 at the same data rate. LPDDR5 is optimized for power efficiency, not minimum latency.

Capacity: DDR5 DIMMs scale to 256 GB per DIMM, enabling servers with 6+ TB of DRAM. LPDDR5 is soldered and fixed - Apple M3 Ultra maxes at 192 GB. For large model loading (loading a 350 GB model into host memory), only DDR5 server configurations can handle this.

When it matters: For edge inference on Apple Silicon, LPDDR5X's unified memory architecture means the CPU and GPU share the same physical memory with no PCIe copy overhead - effectively infinite PCIe bandwidth to the GPU. For server training with large sparse embedding tables (400 GB+), only DDR5 servers can hold the entire table.


Q6: Your team is migrating from a single-socket server to a dual-socket server for a large language model preprocessing pipeline. What changes do you need to make to avoid NUMA performance regressions?

The migration from single-socket to dual-socket introduces NUMA complexity that does not exist on a single-socket system. Here are the required changes:

Step 1 - Profile the existing workload: Before migration, establish baseline metrics - preprocessing throughput, DRAM bandwidth utilization, and worker CPU utilization. These are your comparison baselines.

Step 2 - Update DataLoader configuration: On a dual-socket system, limit num_workers to the core count of one socket rather than total cores. Bind the training process and workers to a single NUMA node using numactl.

Step 3 - Check GPU placement: Run nvidia-smi topo -m to determine GPU-to-CPU affinity. Each GPU is physically connected to one socket via PCIe. Workers that feed a given GPU should run on the same socket as that GPU.

Step 4 - Memory allocation policy: Consider setting NUMA_BALANCING=0 in the kernel for training workloads. NUMA automatic balancing migrates pages between nodes to improve locality, but the migration process itself consumes bandwidth and can cause latency spikes during training.

Step 5 - Monitor after migration: Use numastat -c during a training run to verify local vs remote NUMA access ratios are below 5% for remote. If higher, check which processes are not following the NUMA binding.

The key insight: what was a single-socket "all memory is local" workload becomes a workload where process placement relative to memory matters enormously.

© 2026 EngineersOfAI. All rights reserved.