Multicore and NUMA Architecture
The Training Job That Used 96 Cores and Got Slower
An ML platform team at a major tech company was migrating a data preprocessing pipeline from a 16-core workstation to a production server with two 48-core AMD EPYC sockets (96 cores total). The expectation was obvious: 6x more cores, roughly 6x more throughput. They set the PyTorch DataLoader worker count to 90 (leaving 6 cores for the main process and OS), deployed to the server, and started timing.
Throughput was 2.3x better than the 16-core machine. Not 6x. They tried reducing workers to 48. Still 2.3x. They tried 24. Still roughly the same. Pinning workers to specific CPUs with taskset. No change. The preprocessing was CPU-bound, not I/O-bound. The code was embarrassingly parallel - each worker processed an independent batch with no synchronization.
After two days of investigation, the root cause emerged: all 90 DataLoader workers were allocated memory on whichever NUMA node had free memory, with no regard for which CPU socket was running the worker. Workers on socket 0 were frequently reading data from memory physically attached to socket 1, crossing the QPI interconnect on every memory access. That interconnect has roughly 2x higher latency and half the bandwidth of local memory access.
The fix: numactl --interleave=all python train.py for workloads where workers have no locality, or os.sched_setaffinity() combined with NUMA-aware memory allocation to bind each worker to a single socket. With proper NUMA-aware placement, throughput reached 5.1x improvement on 96 cores - within 15% of the theoretical maximum given memory bandwidth constraints.
This story captures the central lesson of multicore architecture for ML engineers. Raw core count is not the whole story. How those cores are connected, where their memory is, and how your OS scheduler and memory allocator interact with the hardware topology determines whether you get 1x or 5x scaling. Understanding NUMA is not a senior engineer nicety - it is the difference between wasting a $50,000 server and actually using it.
Why This Exists - The Single-Core Wall
By 2004, CPU clock speeds hit a wall. Intel's Tejas processor (planned for 2004) was cancelled because it would have required 150 watts and still would not have been significantly faster than its predecessor. The physics of transistor switching speed, heat dissipation, and wire propagation delay conspired to make single-core performance scaling uneconomical.
The solution the industry converged on was multicore: put multiple CPU cores on the same die, each with its own L1 and L2 caches, sharing an L3 cache and a memory controller. Two cores running at 3 GHz can theoretically do twice the work of one core at 3 GHz for parallel workloads, while consuming far less power than one hypothetical core at 6 GHz.
This transition changed everything about how software must be written to extract maximum performance. A 2005 application running on a single-core CPU automatically ran faster every two years as new hardware shipped. A 2024 application running on a 64-core server runs at exactly the same speed as on a 4-core laptop unless it is explicitly written to use multiple cores. The hardware shifted the burden of parallelism onto software.
Historical Context - From Dual-Core to AMD's Chiplet Revolution
Intel shipped the first mainstream dual-core desktop processor (Pentium D) in 2005 by bonding two Prescott dies side by side on one package. This was inelegant (the two cores communicated through the slow front-side bus, not an on-die interconnect) but proved the commercial value.
AMD's Zen architecture (2017) introduced the CCX (Core Complex) design: a modular unit of 4 CPU cores with shared L3 cache. Multiple CCX units are combined into a single chip. Zen 2 (2019) pushed this further with the "chiplet" model: compute dies (CCD) made on 7nm TSMC silicon, connected to an I/O die (on larger 14nm) via AMD's Infinity Fabric interconnect. An EPYC Rome processor is 8 compute chiplets + 1 I/O chiplet. This made 64-core, 128-thread processors economically feasible and gave AMD a massive yield advantage over Intel's monolithic die approach.
Intel responded with Foveros and EMIB (Embedded Multi-die Interconnect Bridge) packaging technologies. Apple's M-series chips took the chiplet concept further: a unified memory architecture where CPU, GPU, Neural Engine, and I/O blocks share the same memory package, eliminating traditional GPU memory bandwidth bottlenecks.
Shared Memory Multicore - The Hardware Model
In a modern multicore CPU, all cores share the same physical DRAM. The memory address space is flat and shared - any core can read or write any address. This is called shared memory multiprocessing (SMP when all cores have equal access latency).
Each core has its own:
- L1 instruction cache (32-64 KB, private)
- L1 data cache (32-64 KB, private)
- L2 cache (256 KB - 1 MB, private)
Cores share:
- L3 cache (8-64+ MB, shared across all cores on the same die or chiplet)
- Memory controller
- PCIe lanes and I/O
Cache Coherence Overhead in Multicore
When multiple cores share memory and each has its own L1/L2 cache, the hardware must maintain cache coherence - ensuring that all cores see a consistent view of memory. We covered the MESI protocol in the cache design lesson. At scale, this coherence mechanism becomes a significant performance bottleneck.
The cost of coherence manifests in several ways:
Invalidation storms: If one core writes to a memory location cached by all other cores, all other caches must invalidate their copy. With 96 cores, a write to a shared variable sends 95 invalidation messages on the interconnect and stalls 95 cores waiting for their coherence miss to be resolved.
Cache ping-pong: Two threads on different cores alternately writing to the same variable cause the cache line to bounce back and forth between the two cores' L1 caches. Each transition costs a round-trip on the interconnect - typically 20-50 cycles per bounce.
Shared read bandwidth: The L3 cache is shared. If 96 cores are all reading different data, they all compete for L3 bandwidth. Modern L3 caches have high aggregate bandwidth (Intel Sapphire Rapids L3 can sustain ~6 TB/s aggregate), but each individual core's share diminishes as more cores compete.
import multiprocessing
import time
import ctypes
def demonstrate_cache_ping_pong():
"""
Two processes alternately incrementing a shared counter.
Shows the overhead of cache-line bouncing between cores.
"""
iterations = 1_000_000
# Shared counter - both processes will fight over this cache line
shared_counter = multiprocessing.Value(ctypes.c_int64, 0)
def increment_worker(counter, n):
for _ in range(n):
with counter.get_lock():
counter.value += 1
# Single-process baseline
start = time.perf_counter()
for _ in range(iterations):
shared_counter.value += 1
single_time = time.perf_counter() - start
# Two processes sharing the counter
shared_counter.value = 0
start = time.perf_counter()
p1 = multiprocessing.Process(
target=increment_worker, args=(shared_counter, iterations // 2)
)
p2 = multiprocessing.Process(
target=increment_worker, args=(shared_counter, iterations // 2)
)
p1.start()
p2.start()
p1.join()
p2.join()
multi_time = time.perf_counter() - start
print(f"Single process: {single_time:.3f}s")
print(f"Two processes (shared counter): {multi_time:.3f}s")
print(f"Overhead from coherence + lock: {multi_time / single_time:.1f}x slower")
if __name__ == "__main__":
demonstrate_cache_ping_pong()
NUMA Topology - Non-Uniform Memory Access
On single-socket servers, all cores have equal access latency to all memory. This is UMA (Uniform Memory Access). Multi-socket servers break this assumption: each socket has its own memory controller and DRAM physically attached to it. To access memory on a different socket, a request must traverse a high-speed interconnect (Intel UltraPath Interconnect / QPI, or AMD Infinity Fabric). This remote access is slower - typically 1.5x to 2.5x higher latency and lower bandwidth than local access.
This topology is called NUMA (Non-Uniform Memory Access). Each socket (and its directly attached DRAM) is a NUMA node.
AMD EPYC CCX Architecture - NUMA Within a Single Socket
AMD's EPYC processors have an additional NUMA complication: even within a single socket, the CCX (Core Complex) design creates multiple NUMA domains. Each CCX is a group of 4 or 8 cores with their own L3 cache slice. Within a CCX, memory access is fastest. Between CCXs on the same socket, requests traverse the Infinity Fabric - still faster than crossing sockets, but not as fast as within-CCX access.
An EPYC Genoa (Zen 4) processor has 12 CCDs (Core Chiplet Dies) with 8 cores each = 96 cores. The NUMA topology shows up to 4 NUMA nodes within a single socket. Misunderstanding this topology is a common cause of "it should scale but doesn't" problems.
# View the full NUMA topology
numactl --hardware
# View NUMA topology with CPU placement
lstopo --output-format ascii # requires hwloc package
# Detailed NUMA distance matrix
numactl --hardware
# Output:
# node distances:
# node 0 1 2 3
# 0: 10 12 32 32
# 1: 12 10 32 32
# 2: 32 32 10 12
# 3: 32 32 12 10
# Nodes 0-1 are CCXs on same physical socket (close)
# Nodes 0 to 2/3 are cross-socket (far)
Detecting NUMA Topology with Python
import os
import subprocess
import json
def detect_numa_topology():
"""
Detect NUMA topology using hwloc/numactl or /sys filesystem.
Works on Linux systems (most cloud VMs).
"""
print("=== NUMA Topology Detection ===\n")
# Method 1: /sys filesystem (always available on Linux)
numa_path = "/sys/devices/system/node"
if os.path.exists(numa_path):
numa_nodes = sorted([
d for d in os.listdir(numa_path)
if d.startswith("node") and d[4:].isdigit()
])
print(f"NUMA nodes found: {len(numa_nodes)}")
for node_dir in numa_nodes:
node_id = int(node_dir[4:])
cpulist_path = f"{numa_path}/{node_dir}/cpulist"
meminfo_path = f"{numa_path}/{node_dir}/meminfo"
try:
with open(cpulist_path) as f:
cpulist = f.read().strip()
print(f" Node {node_id} CPUs: {cpulist}")
except FileNotFoundError:
pass
try:
with open(meminfo_path) as f:
for line in f:
if "MemTotal" in line:
mem_kb = int(line.split()[3])
print(f" Node {node_id} Memory: {mem_kb // 1024 // 1024} GB")
break
except FileNotFoundError:
pass
# Method 2: numactl command
try:
result = subprocess.run(
["numactl", "--hardware"],
capture_output=True, text=True, timeout=5
)
if result.returncode == 0:
print("\n=== numactl --hardware ===")
print(result.stdout)
except (FileNotFoundError, subprocess.TimeoutExpired):
print("numactl not available (install with: apt-get install numastat)")
# Method 3: psutil (cross-platform, less detail)
try:
import psutil
print(f"\n=== psutil ===")
print(f"Physical CPUs (sockets): {psutil.cpu_count(logical=False)}")
print(f"Logical CPUs (with HT): {psutil.cpu_count(logical=True)}")
print(f"Hyperthreading ratio: {psutil.cpu_count(logical=True) / psutil.cpu_count(logical=False):.1f}x")
except ImportError:
print("psutil not installed: pip install psutil")
detect_numa_topology()
CPU Affinity - Binding Processes to CPU Cores
CPU affinity tells the OS scheduler which CPU cores a process or thread is allowed to run on. By default, the scheduler can migrate processes between cores freely, which can cause:
- Cache warmup loss (process moves to a different core with cold cache)
- NUMA locality loss (process moves from socket 0 to socket 1, now its memory is remote)
- Cache pollution (process migrated to a core that was busy with something else)
For ML workloads, pinning workers to specific CPU ranges per NUMA node is often the single highest-impact optimization:
import os
import multiprocessing
import time
import numpy as np
def set_cpu_affinity(cpu_list: list):
"""
Pin current process to the specified CPU cores.
Requires Linux and sufficient permissions.
Args:
cpu_list: List of CPU core IDs to allow (e.g., [0, 1, 2, 3])
"""
try:
os.sched_setaffinity(0, cpu_list) # 0 = current process
print(f"PID {os.getpid()}: pinned to CPUs {sorted(cpu_list)}")
return True
except (AttributeError, OSError) as e:
print(f"Could not set affinity: {e}")
print("(os.sched_setaffinity requires Linux and may need elevated permissions)")
return False
def get_cpu_affinity() -> set:
"""Get the current CPU affinity set for this process."""
try:
return os.sched_getaffinity(0)
except AttributeError:
return set()
def numa_aware_worker(worker_id: int, numa_node_cpus: list, data_size: int):
"""
Worker that pins itself to a specific NUMA node's CPUs.
This ensures memory allocation happens on the local NUMA node.
"""
# Pin to the CPUs assigned to this worker's NUMA node
set_cpu_affinity(numa_node_cpus)
# Allocate memory after setting affinity
# On NUMA systems, memory allocation defaults to the current NUMA node
# Setting affinity first ensures we get local memory
data = np.random.rand(data_size).astype(np.float32)
# Process the data
result = np.sum(data ** 2)
return result
def demonstrate_affinity():
"""
Show CPU affinity setting and its effect on process scheduling.
"""
print("Default affinity (all CPUs allowed):")
default_affinity = get_cpu_affinity()
print(f" CPUs: {sorted(default_affinity)}")
# Pin to first 4 cores only
set_cpu_affinity([0, 1, 2, 3])
new_affinity = get_cpu_affinity()
print(f"\nAfter setting affinity to [0,1,2,3]:")
print(f" CPUs: {sorted(new_affinity)}")
# Restore full affinity
cpu_count = os.cpu_count() or 4
set_cpu_affinity(list(range(cpu_count)))
print(f"\nRestored to all {cpu_count} CPUs")
demonstrate_affinity()
Using taskset for Process-Level Affinity
# Pin a process to CPU cores 0-11 (first NUMA node on typical dual-socket server)
taskset -c 0-11 python train.py
# Pin to CPU cores 24-47 (second NUMA node)
taskset -c 24-47 python train.py
# Pin with numactl (also controls memory allocation)
numactl --cpunodebind=0 --membind=0 python train.py
# Run on all cores of NUMA node 1 with local memory
numactl --cpunodebind=1 --membind=1 python train.py
# Interleave memory across all NUMA nodes (good for multi-threaded workloads)
numactl --interleave=all python train.py
# Check what NUMA node a process is running on
cat /proc/PID/status | grep Cpu
cat /proc/PID/numa_maps | head -20
Hyperthreading (SMT) - Two Logical Cores Per Physical Core
Hyperthreading (Intel's term) or SMT (Simultaneous Multi-Threading) presents two logical CPU cores per physical core to the operating system. The two logical cores share the physical core's execution units, caches, and TLBs, but each has its own register set, PC (program counter), and retirement state.
The rationale: a single thread running on a core often stalls waiting for memory (cache misses, TLB misses). During these stalls, the execution units are idle. A second thread running on the same core can fill these stalls with its own work, effectively hiding memory latency.
For ML workloads, hyperthreading has mixed results:
When hyperthreading helps:
- I/O-bound workloads (DataLoader preprocessing with disk reads)
- Workloads with many cache misses (random embedding lookups)
- Workloads with high branch misprediction rates (varied input types)
When hyperthreading hurts:
- Compute-bound, well-vectorized workloads (all execution units already saturated by 1 thread)
- Cache-sensitive workloads (second thread pollutes L1/L2 cache, causing misses for the first thread)
- Memory-bandwidth-limited workloads (both threads competing for the same bandwidth)
import psutil
import os
def analyze_hyperthreading():
"""
Detect hyperthreading configuration and assess its impact.
"""
logical_cpus = psutil.cpu_count(logical=True)
physical_cpus = psutil.cpu_count(logical=False)
if logical_cpus is None or physical_cpus is None:
print("Could not determine CPU counts")
return
ht_ratio = logical_cpus / physical_cpus
print(f"Physical cores: {physical_cpus}")
print(f"Logical CPUs (with HT): {logical_cpus}")
print(f"Hyperthreading multiplier: {ht_ratio:.0f}x")
if ht_ratio == 2:
print("\nHyperthreading is ENABLED (2 logical per physical)")
print("Recommendation for ML training:")
print(" - Use physical core count as worker count, not logical")
print(f" - DataLoader workers: {physical_cpus} (not {logical_cpus})")
print(" - For compute-bound work, HT siblings may hurt (cache sharing)")
print(" - Pin workers to physical cores, not HT siblings of same core")
elif ht_ratio == 1:
print("\nHyperthreading is DISABLED or not supported")
print(f" - DataLoader workers: {physical_cpus}")
def get_physical_to_logical_mapping():
"""
Map physical cores to their logical CPU pairs (HT siblings).
Useful for NUMA-aware worker placement.
"""
mapping = {}
try:
for cpu_id in range(os.cpu_count() or 0):
path = f"/sys/devices/system/cpu/cpu{cpu_id}/topology/core_id"
with open(path) as f:
core_id = int(f.read().strip())
path_pkg = f"/sys/devices/system/cpu/cpu{cpu_id}/topology/physical_package_id"
with open(path_pkg) as f:
package_id = int(f.read().strip())
key = (package_id, core_id)
if key not in mapping:
mapping[key] = []
mapping[key].append(cpu_id)
print("\nPhysical core to logical CPU mapping:")
print("(Package, Core) -> [logical CPUs]")
for (pkg, core), cpus in sorted(mapping.items())[:8]: # show first 8
siblings = "HT siblings" if len(cpus) > 1 else "no HT"
print(f" Package {pkg}, Core {core:3d}: {cpus} ({siblings})")
return mapping
except FileNotFoundError:
print("Topology info not available (not Linux or missing sysfs)")
return {}
analyze_hyperthreading()
get_physical_to_logical_mapping()
Memory Bandwidth vs Compute Balance
For ML workloads, the critical question is whether the bottleneck is compute (CPU arithmetic throughput) or memory bandwidth (how fast data can be moved from DRAM to CPU).
The arithmetic intensity of a computation is:
The roofline model says:
- If (the "ridge point"), the workload is compute-bound
- If the ridge point, the workload is memory-bandwidth-bound
For a typical dual-socket Xeon server:
- Peak compute: 2 sockets * 48 cores * 16 GFLOPS/core = 1,536 GFLOPS
- Peak bandwidth: ~250 GB/s (8-channel DDR5)
- Ridge point: 1,536 / 250 = 6.1 FLOPS/byte
A dense matrix multiply of 2048x2048 matrices has arithmetic intensity: FLOPS/byte. Well above the ridge point - it is compute-bound. Adding more memory bandwidth does not help.
A simple element-wise add C = A + B has arithmetic intensity: 1 FLOP / 24 bytes = 0.04 FLOPS/byte. Far below the ridge point - it is memory-bandwidth-bound. Adding more cores does not help; they all share the same memory bandwidth.
import numpy as np
import time
import psutil
def measure_memory_bandwidth():
"""
Measure actual memory bandwidth using a simple streaming benchmark.
Compare to theoretical peak to assess bandwidth utilization.
"""
# Use an array large enough to not fit in L3 cache
# Typical L3 is 32-256 MB; use 2GB to ensure DRAM access
size_gb = 0.5 # 0.5 GB - adjust based on available RAM
n_elements = int(size_gb * 1024**3 / 8) # float64 elements
print(f"Array size: {size_gb} GB")
print(f"Elements: {n_elements:,}")
a = np.ones(n_elements, dtype=np.float64)
b = np.ones(n_elements, dtype=np.float64)
# DAXPY: y = alpha * x + y
# This is a standard memory bandwidth benchmark
# Reads 2 arrays, writes 1 array = 3 * array_size bytes transferred
alpha = 2.0
# Warmup
c = alpha * a + b
# Measure
runs = 5
start = time.perf_counter()
for _ in range(runs):
c = alpha * a + b
elapsed = (time.perf_counter() - start) / runs
bytes_transferred = 3 * n_elements * 8 # read a, read b, write c
bandwidth_gbs = bytes_transferred / elapsed / 1e9
print(f"\nMemory bandwidth achieved: {bandwidth_gbs:.1f} GB/s")
print("Theoretical peak (DDR4-3200 quad-channel): ~100 GB/s")
print("Theoretical peak (DDR5-4800 eight-channel): ~300 GB/s")
print(f"Utilization: ~{bandwidth_gbs / 100 * 100:.0f}% of DDR4 peak")
measure_memory_bandwidth()
NUMA-Aware PyTorch DataLoader
PyTorch's DataLoader creates worker processes to preprocess data in parallel. For optimal NUMA performance:
- Workers should be pinned to CPUs on the NUMA node closest to the GPU they are feeding
- Memory (
pin_memory=True) should be allocated on the NUMA node where the GPU's PCIe slot is connected - Worker count should match physical core count, not logical (hyperthreaded) count
import torch
from torch.utils.data import DataLoader, Dataset
import os
import multiprocessing
class NumaAwareDataset(Dataset):
"""
A dataset wrapper that demonstrates NUMA-aware loading.
"""
def __init__(self, data_size: int, feature_dim: int):
self.data_size = data_size
self.feature_dim = feature_dim
# Data is stored as numpy array (CPU memory)
import numpy as np
self.data = np.random.rand(data_size, feature_dim).astype(np.float32)
self.labels = np.random.randint(0, 10, size=data_size)
def __len__(self):
return self.data_size
def __getitem__(self, idx):
return (
torch.from_numpy(self.data[idx].copy()),
torch.tensor(self.labels[idx], dtype=torch.long)
)
def worker_init_fn(worker_id: int):
"""
Called once per DataLoader worker at startup.
Use this to set CPU affinity for NUMA-aware loading.
For a dual-socket server:
- Workers 0-N/2 get pinned to socket 0 CPUs
- Workers N/2-N get pinned to socket 1 CPUs
"""
num_workers = torch.utils.data.get_worker_info().num_workers
# Example: split workers evenly across NUMA nodes
# Adjust cpu_ranges for your actual server topology
# Use `numactl --hardware` to get the correct CPU ranges
total_cpus = os.cpu_count() or 1
cpus_per_worker = max(1, total_cpus // num_workers)
start_cpu = worker_id * cpus_per_worker
end_cpu = min(start_cpu + cpus_per_worker, total_cpus)
cpu_list = list(range(start_cpu, end_cpu))
try:
os.sched_setaffinity(0, cpu_list)
except (AttributeError, OSError):
pass # Not on Linux or insufficient permissions
# Set NumPy random seed per worker (important for reproducibility)
import numpy as np
worker_info = torch.utils.data.get_worker_info()
np.random.seed(worker_info.seed % (2**32))
def create_numa_aware_dataloader(
dataset: Dataset,
batch_size: int = 256,
num_workers: int = None,
pin_memory: bool = True,
) -> DataLoader:
"""
Create a DataLoader configured for NUMA-aware performance.
Args:
dataset: The dataset to load from
batch_size: Batch size
num_workers: Number of worker processes. Defaults to physical core count.
pin_memory: Pin tensors to page-locked memory for faster GPU transfer.
Set True when using a GPU, False for CPU-only.
"""
import psutil
if num_workers is None:
# Use physical core count, not logical (hyperthreaded) count
# Hyperthreaded cores share execution resources - adding HT workers
# for compute-bound preprocessing hurts rather than helps
physical_cores = psutil.cpu_count(logical=False)
num_workers = min(physical_cores or 4, 16) # cap at 16 for diminishing returns
print(f"DataLoader config:")
print(f" num_workers: {num_workers}")
print(f" pin_memory: {pin_memory}")
print(f" (pin_memory=True allocates page-locked memory on the GPU's NUMA node)")
return DataLoader(
dataset,
batch_size=batch_size,
num_workers=num_workers,
pin_memory=pin_memory,
worker_init_fn=worker_init_fn,
persistent_workers=True, # keep workers alive between epochs
prefetch_factor=2, # prefetch 2 batches per worker
shuffle=True,
)
# Usage example
if __name__ == "__main__":
dataset = NumaAwareDataset(data_size=100_000, feature_dim=512)
loader = create_numa_aware_dataloader(dataset, batch_size=256)
import time
start = time.perf_counter()
batch_count = 0
for batch_x, batch_y in loader:
batch_count += 1
if batch_count >= 100:
break
elapsed = time.perf_counter() - start
print(f"\n100 batches loaded in {elapsed:.2f}s")
print(f"Throughput: {100 * 256 / elapsed:.0f} samples/sec")
pin_memory - What It Actually Does
pin_memory=True in DataLoader allocates tensors in page-locked (pinned) host memory instead of regular pageable memory. This has two effects:
- The OS is prevented from swapping out this memory to disk, guaranteeing it stays in DRAM
- GPU DMA (Direct Memory Access) transfers from page-locked memory are faster because the GPU's DMA engine can transfer directly without a CPU-coordinated bounce-copy through a temporary pinned buffer
The performance difference for pin_memory=True vs False:
- Small tensors (< 1 MB): negligible
- Large tensors (> 100 MB): 20-50% faster GPU transfer
- With many workers: important because each worker's output must be combined in the main process
pin_memory=True consumes page-locked memory, which is a limited resource. Over-allocating can cause the system to run out of page-locked memory, causing DataLoader workers to crash or hang. A safe heuristic: total pinned memory should be less than 5-10% of total DRAM. With 256 GB RAM, stay under 25 GB of pinned allocations.
NUMA Memory Allocation in Python
When you import numpy as np and call np.zeros(N), the memory is allocated lazily (pages are not actually backed by physical RAM until first access). On a NUMA system, the first touch of each page determines which NUMA node the physical backing is on. This is the Linux "first-touch" NUMA policy.
This means: if a main process allocates memory and then worker processes (on different NUMA nodes) access it, the memory stays on the main process's NUMA node - forcing remote access for all workers.
import ctypes
import os
def numa_alloc_demo():
"""
Demonstrate NUMA-aware memory allocation using ctypes.
On production systems, use libnuma for proper NUMA allocation.
"""
# Check if libnuma is available
try:
libnuma = ctypes.CDLL("libnuma.so.1")
print("libnuma available")
# Get current NUMA node
# numa_node_of_cpu(cpu_id) returns the NUMA node for a given CPU
libnuma.numa_node_of_cpu.restype = ctypes.c_int
current_cpu = os.sched_getcpu() if hasattr(os, 'sched_getcpu') else 0
numa_node = libnuma.numa_node_of_cpu(current_cpu)
print(f"Current CPU {current_cpu} is on NUMA node {numa_node}")
# numa_alloc_local: allocate on the NUMA node of the calling thread
# This is the preferred allocation strategy when workers are NUMA-pinned
size = 1024 * 1024 * 100 # 100 MB
libnuma.numa_alloc_local.restype = ctypes.c_void_p
ptr = libnuma.numa_alloc_local(ctypes.c_size_t(size))
if ptr:
print(f"Allocated 100 MB on local NUMA node {numa_node}")
# Free the memory
libnuma.numa_free(ctypes.c_void_p(ptr), ctypes.c_size_t(size))
else:
print("Allocation failed")
except OSError:
print("libnuma not available (install with: apt-get install libnuma-dev)")
print("Alternative: use numactl at the process level instead")
print(" numactl --cpunodebind=0 --membind=0 python my_script.py")
numa_alloc_demo()
Production numactl and taskset Recipes
# === SCENARIO 1: Training on a dual-socket server ===
# Run training process on socket 0, allocate memory on socket 0
numactl --cpunodebind=0 --membind=0 \
python train.py --workers 24
# === SCENARIO 2: Multiple GPU training, each on its own NUMA node ===
# GPU 0 is on PCIe attached to socket 0; GPU 1 is on socket 1
# Launch each training process on the correct NUMA node
numactl --cpunodebind=0 --membind=0 \
python train.py --gpu 0 --rank 0 &
numactl --cpunodebind=1 --membind=1 \
python train.py --gpu 1 --rank 1 &
wait
# === SCENARIO 3: Inference server with multiple replicas ===
# Spread replicas across NUMA nodes to share L3 bandwidth
for replica in 0 1 2 3; do
numa_node=$((replica % 2)) # alternate between nodes 0 and 1
numactl --cpunodebind=$numa_node --membind=$numa_node \
gunicorn --workers 8 --bind 0.0.0.0:$((8080 + replica)) app:app &
done
# === SCENARIO 4: DataLoader tuning for NUMA ===
# Check the NUMA node of each CPU
for cpu in $(seq 0 95); do
node=$(cat /sys/devices/system/cpu/cpu${cpu}/node_id 2>/dev/null || echo "?")
echo "CPU $cpu -> NUMA node $node"
done | head -20
# === Check memory NUMA locality of a running process ===
# Shows how much of process memory is local vs remote
numastat -p PID
# === Performance counter for NUMA activity ===
# Count remote memory accesses (NUMA events)
perf stat -e \
node-loads,node-load-misses,\
node-stores,node-store-misses \
-p PID
Production Engineering Notes
Optimal Worker Count Formula
For CPU-bound DataLoader preprocessing (image transforms, tokenization):
import psutil
import os
def optimal_worker_count(
compute_bound: bool = True,
has_gpu: bool = True,
reserved_for_training: int = 2,
) -> int:
"""
Calculate optimal DataLoader worker count.
Args:
compute_bound: True if preprocessing is CPU-heavy (image decode, augmentation)
has_gpu: True if main process is doing GPU training
reserved_for_training: CPUs to reserve for training process and OS
"""
physical_cores = psutil.cpu_count(logical=False) or 4
logical_cores = psutil.cpu_count(logical=True) or 8
if compute_bound:
# Use physical cores only - HT doesn't help for compute-bound work
available = physical_cores - reserved_for_training
else:
# I/O-bound: HT helps, use logical cores
available = logical_cores - reserved_for_training
# Empirical cap: beyond 16-20 workers, gains diminish due to
# Python multiprocessing overhead and IPC costs
optimal = min(available, 16)
optimal = max(optimal, 1)
print(f"Physical cores: {physical_cores}")
print(f"Logical CPUs: {logical_cores}")
print(f"Recommended workers: {optimal}")
return optimal
optimal_worker_count(compute_bound=True, has_gpu=True)
Monitoring NUMA Performance
# Install numastat for per-NUMA-node memory statistics
apt-get install numastat
# Show per-NUMA-node allocation statistics
numastat
# Show per-process NUMA memory statistics
numastat -p $(pgrep -f train.py)
# Output interpretation:
# numa_hit: allocations that succeeded on the intended node (want HIGH)
# numa_miss: allocations that had to go to a different node (want LOW)
# numa_foreign: pages allocated on this node for a process on another node (want LOW)
# interleave_hit: interleaved allocations (expected if using --interleave)
Never set num_workers in PyTorch DataLoader to the logical CPU count on a hyperthreaded system for compute-bound preprocessing. With 96 logical CPUs (48 physical), setting num_workers=90 means you are scheduling 90 processes on 48 physical cores, each pair of co-located workers competing for the same L1/L2 cache and execution units. This causes severe cache thrashing and can make preprocessing slower than using 24-30 workers pinned to physical cores.
persistent_workers=True in PyTorch DataLoader keeps worker processes alive between epochs. This avoids the overhead of respawning workers (which can take 1-5 seconds on complex pipelines). However, persistent workers hold their memory allocations between epochs. On a NUMA system, if you change the worker count between runs without restarting, workers from a previous NUMA placement may remain active with stale affinity settings.
On AMD EPYC multi-socket servers with numactl, misidentifying the NUMA topology is extremely common. An EPYC 7763 in a dual-socket configuration has up to 8 NUMA nodes (4 per socket from the CCX structure). Running numactl --cpunodebind=0 --membind=0 only uses a fraction of socket 0's cores. Always run numactl --hardware first and map out the full topology before writing affinity scripts.
Common Mistakes
Using logical CPU count instead of physical for worker count
os.cpu_count() returns logical CPU count, which includes hyperthreading. On a 48-core, 96-thread server, os.cpu_count() returns 96. Setting 96 DataLoader workers creates 96 competing processes on 48 physical cores. For compute-bound preprocessing (image decoding, tokenization), use psutil.cpu_count(logical=False) to get physical core count. Hyperthreaded co-located workers share the same L1/L2 cache and execution units, causing cache thrashing and delivering less throughput than half the workers.
Allocating shared data before forking worker processes
In PyTorch DataLoader (and Python's multiprocessing), workers are created by forking the main process. If you allocate a large dataset or model in the main process before creating workers (the common pattern), all workers inherit this memory via copy-on-write, but on a NUMA system, that memory was allocated on the main process's NUMA node. All workers on remote NUMA nodes will see 2x higher latency on every memory access. Solution: use spawn start method instead of fork so workers initialize fresh (and perform their own NUMA-local allocation), or use numactl --interleave=all to spread allocation across nodes.
Ignoring AMD EPYC's intra-socket NUMA
Many engineers check NUMA topology on a dual-socket Intel server, see 2 NUMA nodes, and assume that's all they need to handle. On AMD EPYC (Zen 2, 3, 4) chips, even a single-socket server can have 4 or more NUMA nodes corresponding to the CCX/CCD structure. A process bound to "NUMA node 0" on a single-socket EPYC 7763 is bound to only 16 of 64 available cores. Always run lstopo (hwloc) to visualize the complete topology before making any affinity decisions.
Forgetting that fork() inherits file descriptors and locks
Python's multiprocessing defaults to fork() on Linux (inheriting all open file handles, locks, and sockets from the parent). This is fast but dangerous: if the parent has an open database connection, file lock, or GPU context when fork() happens, the child inherits it in an undefined state. PyTorch DataLoader with the fork start method can deadlock if the parent has CUDA context initialized. Always initialize CUDA after the DataLoader is created, not before.
Interview Questions and Answers
Q1: What is NUMA and why does it matter for multi-socket ML training servers?
NUMA (Non-Uniform Memory Access) is a memory architecture where memory access latency depends on the physical location of the memory relative to the requesting CPU. In a dual-socket server, each socket has its own DRAM directly connected to its memory controller. Accessing that local DRAM costs roughly 10 ns. Accessing DRAM on the other socket requires traversing an interconnect (QPI or Infinity Fabric) and costs 20-25 ns - roughly 2x slower. For ML training, NUMA matters because: (1) DataLoader workers on one socket reading data allocated on the other socket see 2x memory latency, (2) model parameter allocation that straddles NUMA nodes causes inefficient access patterns, (3) bandwidth to each NUMA node is limited, so cross-socket traffic can saturate the interconnect. The fix is NUMA-aware placement: bind workers and their memory allocations to the same NUMA node.
Q2: Explain the difference between physical cores and logical CPUs. How should this affect DataLoader worker count?
Physical cores are independent processing units with their own L1 and L2 caches, execution units, and register files. Logical CPUs (the number reported by os.cpu_count()) includes both physical cores and hyperthreaded siblings - where each physical core presents as two logical CPUs. Hyperthreaded siblings share the physical core's execution units, caches, and most hardware resources. For compute-bound preprocessing (CPU-intensive tasks like image decode, tokenization, feature engineering), hyperthreading provides little benefit because the second logical thread competes with the first for the same execution resources and cache. For such workloads, use physical core count as the worker count. For I/O-bound preprocessing (waiting on disk), hyperthreading can help because one thread's wait can be filled by the other. Use psutil.cpu_count(logical=False) to get the physical core count.
Q3: What is CPU affinity and when should you use it for ML workloads?
CPU affinity is a constraint on the OS scheduler specifying which CPU cores a process or thread is allowed to run on. Without affinity constraints, the scheduler can migrate processes between cores freely. For ML workloads, explicit affinity is beneficial when: (1) running on NUMA systems - binding workers to cores on the same NUMA node as their data prevents cross-socket memory access, (2) avoiding cache pollution from migrating workers - a worker that moves to a new core loses its L1/L2 cache warmup, (3) isolating training from other system processes - binding training to specific cores prevents OS background tasks from preempting the training process. Set affinity with os.sched_setaffinity() in Python, taskset at the command line, or numactl --cpunodebind for NUMA-aware binding.
Q4: Why can false sharing cause a parallel program to run slower than a serial one?
False sharing occurs when multiple threads modify different variables that occupy the same 64-byte cache line. Although the threads have no logical data dependency, the hardware cache coherence protocol (MESI) treats the cache line as shared and generates an invalidation every time any thread writes to it. Each invalidation forces all other threads to re-fetch the cache line on their next access. With N threads all writing to the same cache line at a high rate, the coherence traffic can overwhelm the interconnect and cause each thread to stall for 20-100 cycles per write operation - far worse than serial execution where there is no coherence overhead at all. The fix is padding: separate independently-modified variables to different cache lines by aligning them to 64-byte boundaries.
Q5: What does pin_memory=True in PyTorch DataLoader actually do and when should you use it?
pin_memory=True instructs DataLoader workers to allocate output tensors in page-locked (pinned) host memory rather than regular pageable memory. Page-locked memory has two properties that matter for GPU training: (1) the OS guarantees the physical memory backing will not be swapped out or relocated, and (2) GPU DMA engines can transfer directly from page-locked memory without an intermediate copy through a pinned bounce buffer. The result is faster tensor.to(device) transfers - typically 20-50% faster for large tensors. Use pin_memory=True whenever you are training on a GPU. Do not use it for CPU-only workloads (it wastes page-locked memory with no benefit). Be aware that pin_memory=True increases host memory pressure - over-allocating pinned memory can exhaust the pinned memory pool and cause OOM errors or slow degradation.
Q6: AMD EPYC processors have multiple NUMA nodes within a single socket. How does this affect workload placement?
AMD's EPYC architecture uses a chiplet design where each socket contains multiple CCDs (Core Chiplet Dies), each containing 8 cores with a private L3 slice. These chiplets communicate through the Infinity Fabric, which is fast but not as fast as on-die access. This creates "sub-NUMA" domains within a single socket: accessing memory attached to a CCD on the same socket but served by a different memory channel has higher latency than memory on the local channel. An EPYC Genoa processor can expose 4 NUMA nodes per socket. For a single-socket EPYC 7763 (64 cores, 4 NUMA nodes internally), binding a workload to "NUMA node 0" with numactl --cpunodebind=0 constrains it to 16 cores and 64 GB of the total 256 GB. The correct approach is to run lstopo --output-format ascii to understand the topology, then make explicit decisions about whether to span NUMA nodes (interleaved mode) or stay within one (local mode) based on workload memory requirements.
