Skip to main content

Memory Hierarchy and Cache Design

The Matrix That Brought Down a Data Center

A senior ML engineer at a major cloud company had just deployed a new recommendation model to production. The model itself was fast - offline benchmarks showed 2ms per batch. But in production, with 200 concurrent inference workers on a 96-core server, latency blew out to 180ms and the CPUs were running at 100%. The team spent three days debugging. They checked the model architecture, batch sizes, network I/O, database queries. Nothing obvious.

On day four, someone ran perf stat with cache miss counters. The L3 cache miss rate was 47%. The theoretical compute throughput of those 96 cores was being completely wasted waiting for data from DRAM - which takes 100 nanoseconds per access instead of the 4 nanoseconds for L3. The model's embedding lookup table was 8 GB. The working set for 200 concurrent workers, each accessing different embedding rows, could not fit in the 48 MB shared L3 cache. Every lookup was going to main memory.

The fix was not algorithmic. It was memory layout. Restructuring the embedding table so that frequently co-accessed rows were adjacent in memory (locality-aware sharding), combined with a warm-up pass to pre-load hot rows into cache, reduced the L3 miss rate to 4% and dropped inference latency from 180ms back to 3ms. The compute hardware had not changed. The arithmetic had not changed. Only the relationship between data layout and the memory hierarchy changed.

This pattern repeats constantly in ML systems. Matrix multiplication performance on a modern CPU degrades by 20x when matrices do not fit in L2 cache. PyTorch's DataLoader has a pin_memory=True flag specifically because the path from pageable DRAM to GPU requires an extra copy through a non-cached bounce buffer. NumPy's row-major layout means column-wise access on a 2D array is 10x slower than row-wise access, for no arithmetic reason - purely because of cache line utilization.

Understanding the memory hierarchy is not optional for an ML engineer working at performance-sensitive scale. It is as fundamental as understanding floating-point arithmetic. The CPU's speed depends entirely on whether the data it needs is in L1, L2, L3, or DRAM - a difference of 1x, 4x, 10x, or 100x in access latency. Everything else in performance optimization is downstream of this.


Why This Exists - The Memory Speed Problem

The fundamental problem is that processor speed has grown much faster than memory speed. In the 1980s, a CPU and DRAM were roughly matched in speed - a memory access took 1-2 CPU cycles. By 2000, CPUs had scaled to 1 GHz and DRAM latency had barely improved - an access now cost 100 CPU cycles. Today, a modern CPU at 4 GHz with a 100ns DRAM latency is waiting 400 cycles per cache miss. Without caches, modern CPUs would spend the vast majority of their time stalled waiting for memory.

The solution is a hierarchy of memories: small and fast close to the CPU, large and slow far from the CPU. Each level is larger but slower than the one above it. The hardware automatically manages movement between levels based on the principle of locality - the empirical observation that programs access the same data repeatedly (temporal locality) and access data near recently-accessed data (spatial locality).


Historical Context - The Cache Invention

The first hardware cache appeared in the IBM System/360 Model 85 in 1968, designed by Lyle Johnson and Jan Salt. They proposed the radical idea of a small, fast buffer between the CPU and main memory that automatically held recently-used data. The concept was initially controversial - why not just make main memory faster? The answer was economics: fast memory cost orders of magnitude more per bit than slow memory, and the cache exploited locality to get fast-memory performance at near-slow-memory cost.

By the 1990s, on-chip L1 caches became standard. L2 caches appeared first as external chips, then moved on-die by the late 1990s (Pentium III). L3 caches became common in the 2000s as transistor counts grew large enough to dedicate millions of them to cache storage. Today, high-end server CPUs (AMD EPYC Genoa, Intel Sapphire Rapids) carry 256 MB or more of L3 cache, and AMD's 3D V-Cache technology stacks additional cache die directly on top of the compute die using wafer-on-wafer bonding.


The Latency Numbers You Must Memorize

Every ML engineer should have these numbers committed to memory. They are the constants that govern performance at scale.

LevelSize (typical)LatencyBandwidth
L1 cache32 - 64 KB per core~1 ns (4 cycles)~2 TB/s
L2 cache256 KB - 1 MB per core~4 ns (12 cycles)~1 TB/s
L3 cache8 - 64 MB shared~10 ns (30-40 cycles)~500 GB/s
DRAM32 - 512 GB~100 ns (300-400 cycles)~50 GB/s
NVMe SSDTerabytes~100 us (100,000 ns)~7 GB/s

The ratio from L1 to DRAM is roughly 100x in latency. A program that fits its working set in L1 cache runs 100 times faster (in the memory-bound case) than one that must go to DRAM on every access. This is not a small constant factor - it is the difference between a system that works and one that does not.


The Cache Line - The Unit of Transfer

Data does not move between cache levels one byte at a time. It moves in fixed-size blocks called cache lines. On virtually every modern CPU (x86, ARM, RISC-V), a cache line is 64 bytes.

This has profound implications. When you access byte 0 of a struct, the hardware immediately loads bytes 0 through 63 into cache - even if you only asked for 8 of them. If your next access hits bytes 8, 16, 24, 32, 40, 48, or 56, they are already in cache at zero cost. If your next access hits byte 64 or beyond (a different cache line), you pay the full latency again.

This is the hardware mechanism behind spatial locality. A C array double arr[8] (8 * 8 = 64 bytes) fits exactly in one cache line. Once you touch arr[0], accessing arr[1] through arr[7] is essentially free.

import numpy as np

# Visualizing cache line alignment
arr = np.zeros(8, dtype=np.float64)
# 8 doubles * 8 bytes each = 64 bytes = exactly 1 cache line
# Accessing arr[0] loads all 8 values into L1 cache
print(f"Array size: {arr.nbytes} bytes")
print(f"Cache line size: 64 bytes")
print(f"Values per cache line: {64 // arr.itemsize}")

Cache Associativity - How the Hardware Decides Where to Put Data

A cache must answer two questions: where does a given memory address map to in the cache, and when the cache is full, which existing entry do we evict?

Direct-Mapped Cache

The simplest design: each memory address maps to exactly one cache location (determined by the lower bits of the address). Fast to look up (only one comparison), but suffers from conflict misses - two frequently used addresses that hash to the same cache location will evict each other continuously, even if there is plenty of empty cache space.

Example: In a 32 KB direct-mapped L1 cache, addresses 0x0000 and 0x8000 map to the same cache line. Alternately accessing these two addresses causes a cache miss on every access, even though you are only using 128 bytes of data.

N-Way Set-Associative Cache

Groups cache lines into sets of N. Each memory address maps to one set (via the middle bits of the address), but can occupy any of the N ways within that set. Modern CPUs use 4-way to 16-way associativity. This dramatically reduces conflict misses at the cost of slightly more complex lookup hardware (N tag comparisons instead of 1).

An 8-way set-associative 32KB L1 cache has: 32 KB/(8×64 B)=6432 \text{ KB} / (8 \times 64 \text{ B}) = 64 sets. Each memory address maps to a set via bits 6-11 of the address, and can occupy any of the 8 ways.

Fully-Associative Cache

Any cache line can hold any address. Maximum flexibility, minimum conflict misses. Only practical for small structures (TLBs, victim caches) because the hardware cost of comparing all tags simultaneously grows quadratically with size.

Eviction Policies

When a cache set is full and a new line must be loaded, the hardware must choose which existing line to evict.

  • LRU (Least Recently Used): Evict the line that has not been accessed for the longest time. Optimal for typical access patterns but expensive to implement exactly in hardware for high associativity.
  • PLRU (Pseudo-LRU): An approximation of LRU using a binary tree of 1-bit recency flags. Used in most modern CPUs because it is much cheaper to implement while providing near-LRU quality.
  • RRIP (Re-Reference Interval Prediction): More sophisticated, used in Intel CPUs since Sandy Bridge. Assigns a "re-reference prediction value" to each cache line - lines predicted to be referenced far in the future are evicted first.

Write Policies - Write-Through vs Write-Back

When the CPU writes a value, what happens to the cache and to memory?

Write-through: Every write updates both the cache and the next-level memory immediately. Simple to implement and maintain coherence. But it generates a write to the slower memory on every store, which is expensive for write-heavy workloads.

Write-back: Writes update only the cache line, which is marked "dirty." The dirty line is written back to memory only when it is evicted. This is what modern CPUs use because it can absorb many writes to the same cache line with only one eventual write to memory.

Write-allocate: On a write miss, load the cache line into cache (write-allocate) before writing. This makes subsequent writes to the same line cache hits. Used with write-back.

No-write-allocate: On a write miss, write directly to the next level without loading the line. Used with write-through.

Modern L1 and L2 caches are write-back + write-allocate. This is the right default for most code.


MESI Cache Coherence Protocol

In a multicore CPU, each core has its own L1 and L2 caches, but they all share the same DRAM. If Core 0 and Core 1 both cache the same memory address, what happens when Core 0 writes to it?

Without coordination, Core 1 would continue reading stale data - a correctness disaster. The solution is a cache coherence protocol: hardware that ensures all cores see a consistent view of memory.

The MESI protocol (developed for the Intel Pentium in 1993) assigns each cache line one of four states:

  • M (Modified): This core has the only valid copy, and it has been written (differs from memory). Must write back before another core can read it.
  • E (Exclusive): This core has the only copy, and it matches memory (not yet written). Can be written without notifying others.
  • S (Shared): Multiple cores may have valid copies matching memory. Must transition to M if written (and invalidate others).
  • I (Invalid): This line is not valid. Must fetch from memory or another cache.

Every cache miss in a multicore system triggers a coherence lookup on the interconnect. If Core 1 wants a line that Core 0 has in Modified state, the hardware must: signal Core 0, have Core 0 write the line back to memory (or directly to Core 1 via the interconnect), and then Core 1 can load it. This "coherence miss" can be as expensive as a DRAM access.


False Sharing - When Coherence Becomes a Performance Trap

False sharing is one of the most insidious multi-threaded performance bugs. It occurs when two threads modify different variables that happen to reside on the same 64-byte cache line. Despite having no logical sharing of data, the hardware coherence protocol treats the cache line as shared and generates coherence traffic on every write.

Consider two threads, each maintaining their own counter:

import multiprocessing
import time


def worker_with_false_sharing(shared_array, index, iterations):
"""
Each worker modifies its own element, but elements 0 and 1
share a cache line (both within the first 64 bytes).
Writes from one worker invalidate the other's cache copy.
"""
for _ in range(iterations):
shared_array[index] += 1


def worker_no_false_sharing(shared_array, index, iterations):
"""
Elements separated by 16 integers (128 bytes = 2 cache lines).
No false sharing - each worker operates on its own cache line.
"""
for _ in range(iterations):
shared_array[index] += 1


def measure_false_sharing():
iterations = 1_000_000

# Case 1: False sharing - counters adjacent (same cache line)
shared_bad = multiprocessing.Array('i', [0, 0])
start = time.perf_counter()
p1 = multiprocessing.Process(target=worker_with_false_sharing,
args=(shared_bad, 0, iterations))
p2 = multiprocessing.Process(target=worker_with_false_sharing,
args=(shared_bad, 1, iterations))
p1.start(); p2.start()
p1.join(); p2.join()
false_sharing_time = time.perf_counter() - start

# Case 2: No false sharing - counters on separate cache lines
# Pad each counter to occupy a full 64-byte cache line
# 16 ints * 4 bytes = 64 bytes
shared_good = multiprocessing.Array('i', [0] * 32)
start = time.perf_counter()
p3 = multiprocessing.Process(target=worker_no_false_sharing,
args=(shared_good, 0, iterations))
p4 = multiprocessing.Process(target=worker_no_false_sharing,
args=(shared_good, 16, iterations))
p3.start(); p4.start()
p3.join(); p4.join()
no_false_sharing_time = time.perf_counter() - start

print(f"False sharing: {false_sharing_time:.3f}s")
print(f"No false sharing: {no_false_sharing_time:.3f}s")
print(f"Overhead: {false_sharing_time / no_false_sharing_time:.1f}x")
# Typical: 3x to 10x overhead from false sharing


measure_false_sharing()

The fix for false sharing is padding: add enough padding bytes to each data structure so that no two independently-modified variables share a cache line. In C/C++, this is done with alignas(64). In Python, it is done by spacing elements 16 integers (64 bytes) apart in a shared array.


Spatial and Temporal Locality - The Two Principles

Temporal locality: If you accessed a memory location recently, you will likely access it again soon. Caches exploit this by keeping recently-used data in fast cache.

Spatial locality: If you accessed address X, you will likely access addresses near X (X+8, X+16, etc.) soon. Cache lines exploit this by loading 64 bytes around the requested address.

These principles have been empirically validated across virtually every class of program. They are the reason caches work. Code that violates them - accessing memory in random order, jumping across large data structures - pays the full DRAM penalty and gets no benefit from the cache hierarchy.


Cache-Friendly Data Structures - Structs vs Arrays

The choice between "Array of Structs" (AoS) and "Struct of Arrays" (SoA) is a classic cache-friendliness tradeoff.

import numpy as np
import time


# Array of Structs (AoS) - bad for vectorized access
# struct Point { float x, y, z; }
# points = [Point(x0,y0,z0), Point(x1,y1,z1), ...]
# Layout in memory: x0 y0 z0 x1 y1 z1 x2 y2 z2 ...
# To access all x values: stride = 3 floats = 12 bytes
# Accessing x0, x1, x2, x3 loads 4 separate cache regions

N = 1_000_000
aos_x = np.zeros(N * 3, dtype=np.float32)[0::3] # every 3rd element
aos_y = np.zeros(N * 3, dtype=np.float32)[1::3]
aos_z = np.zeros(N * 3, dtype=np.float32)[2::3]

# Struct of Arrays (SoA) - good for vectorized access
# x = [x0, x1, x2, ...] <- contiguous
# y = [y0, y1, y2, ...] <- contiguous
# z = [z0, z1, z2, ...] <- contiguous
# Accessing all x values: stride = 1 float = 4 bytes
# Sequential, all in the same cache region

soa_x = np.random.rand(N).astype(np.float32)
soa_y = np.random.rand(N).astype(np.float32)
soa_z = np.random.rand(N).astype(np.float32)

# Compute magnitude of all points
start = time.perf_counter()
# AoS simulation: strided access
magnitudes_aos = np.sqrt(aos_x**2 + aos_y**2 + aos_z**2)
aos_time = time.perf_counter() - start

start = time.perf_counter()
# SoA: sequential access - all cache-line-friendly
magnitudes_soa = np.sqrt(soa_x**2 + soa_y**2 + soa_z**2)
soa_time = time.perf_counter() - start

print(f"AoS access: {aos_time:.4f}s")
print(f"SoA access: {soa_time:.4f}s")
print(f"SoA speedup: {aos_time / soa_time:.2f}x")
# SoA is typically 1.5x to 3x faster for vectorized operations

Matrix Multiplication Cache Blocking

The classic example of cache-oblivious algorithm design: naive matrix multiplication causes catastrophic cache misses that make it 10-20x slower than the blocked version on large matrices.

The problem with naive C=A×BC = A \times B for large matrices: to compute C[i][j]=kA[i][k]×B[k][j]C[i][j] = \sum_k A[i][k] \times B[k][j], we iterate over entire rows of A and entire columns of B. If the matrix does not fit in L2 cache, each inner loop iteration on the column of B (which is stored in row-major order, so accessing B[k][j] for increasing k means jumping by an entire row) generates a cache miss on nearly every access.

Blocking (tiling) solves this by dividing the matrices into small blocks that fit entirely in L2 cache:

import numpy as np
import time


def matmul_naive(A, B):
"""
Naive matrix multiplication.
Accesses B column-wise on row-major storage -> cache misses.
"""
N = A.shape[0]
C = np.zeros((N, N), dtype=A.dtype)
for i in range(N):
for j in range(N):
for k in range(N):
C[i, j] += A[i, k] * B[k, j]
return C


def matmul_blocked(A, B, block_size=64):
"""
Cache-blocked matrix multiplication.
Operates on (block_size x block_size) tiles that fit in L2 cache.

Block size selection:
- 3 blocks must fit in L2: 3 * block_size^2 * 8 bytes < L2 size
- For 256 KB L2: block_size = sqrt(256*1024 / (3*8)) ~= 104
- We use 64 to be conservative and fit in L1 too.
"""
N = A.shape[0]
C = np.zeros((N, N), dtype=A.dtype)
bs = block_size
for i in range(0, N, bs):
for j in range(0, N, bs):
for k in range(0, N, bs):
# This block access is cache-friendly:
# A[i:i+bs, k:k+bs] - bs rows, bs cols, fits in cache
# B[k:k+bs, j:j+bs] - same
# Both tiles remain in L2 for the inner product
i_end = min(i + bs, N)
j_end = min(j + bs, N)
k_end = min(k + bs, N)
C[i:i_end, j:j_end] += (
A[i:i_end, k:k_end] @ B[k:k_end, j:j_end]
)
return C


def matmul_numpy(A, B):
"""NumPy's np.dot uses BLAS dgemm with cache-optimized blocking."""
return A @ B


# Benchmark on a medium matrix
N = 512
A = np.random.rand(N, N).astype(np.float64)
B = np.random.rand(N, N).astype(np.float64)

# NumPy baseline
start = time.perf_counter()
C_np = matmul_numpy(A, B)
np_time = time.perf_counter() - start

# Blocked Python (much slower than BLAS but demonstrates blocking benefit)
start = time.perf_counter()
C_blocked = matmul_blocked(A, B, block_size=64)
blocked_time = time.perf_counter() - start

# Verify correctness
print(f"Max difference: {np.max(np.abs(C_np - C_blocked)):.2e}")
print(f"NumPy (BLAS): {np_time:.4f}s")
print(f"Blocked Python: {blocked_time:.4f}s")
print(
"Note: NumPy uses optimized BLAS with SIMD + blocking. "
"The Python blocked version shows the concept, not the performance."
)

The key insight for the cache-blocking algorithm: choose block size BB such that three B×BB \times B submatrices fit in L2 cache:

3×B2×element_sizeL2_size3 \times B^2 \times \text{element\_size} \leq L2\_size

For a 256 KB L2 cache with float64 (8 bytes):

B256×10243×8104B \leq \sqrt{\frac{256 \times 1024}{3 \times 8}} \approx 104


Hardware Prefetching and Software Prefetch Hints

Modern CPUs include hardware prefetchers that observe access patterns and speculatively load cache lines before they are requested. The sequential prefetcher (detects stride-1 access) and stride prefetcher (detects constant-stride access) are standard in all modern CPUs.

Sequential access patterns (iterating through a large array) are almost entirely hidden by the hardware prefetcher - the CPU predicts that you will access the next cache line and loads it 10-20 cache lines ahead. This is why a simple np.sum(arr) gets close to peak memory bandwidth despite DRAM latency being 100ns - the prefetcher hides that latency by fetching ahead.

Random access patterns (hash table lookups, tree traversals, embedding table lookups in recommendation models) defeat the prefetcher completely. There is no pattern to predict. This is the root cause of the recommendation model latency problem described at the opening of this lesson.

import numpy as np
import time


def demonstrate_prefetcher_effect():
"""
Sequential vs random access on large arrays.
Sequential: hardware prefetcher hides DRAM latency.
Random: no pattern, prefetcher fails, every access goes to DRAM.
"""
N = 10_000_000
arr = np.arange(N, dtype=np.float64)
indices_random = np.random.permutation(N)

# Sequential access - hardware prefetcher effective
start = time.perf_counter()
total = 0.0
for i in range(N):
total += arr[i]
seq_time = time.perf_counter() - start

# Random access - prefetcher cannot predict, DRAM latency exposed
start = time.perf_counter()
total = 0.0
for idx in indices_random:
total += arr[idx]
rand_time = time.perf_counter() - start

print(f"Sequential access: {seq_time:.3f}s")
print(f"Random access: {rand_time:.3f}s")
print(f"Random slowdown: {rand_time / seq_time:.1f}x")
# Typical: 5x to 15x slowdown for large arrays that exceed L3 cache


demonstrate_prefetcher_effect()

Using Cachegrind to Profile Cache Behavior

Valgrind's cachegrind tool simulates the entire cache hierarchy and reports exact cache miss counts for every function in your program.

# Install valgrind
apt-get install valgrind # Ubuntu/Debian

# Run your Python script under cachegrind
valgrind --tool=cachegrind \
--cache-sim=yes \
--branch-sim=yes \
python3 my_script.py

# The output file is cachegrind.out.PID
# Annotate results with cg_annotate
cg_annotate cachegrind.out.12345

# Key metrics in the output:
# Ir = instruction reads (total instructions executed)
# Dr = data reads
# Dw = data writes
# D1mr = L1 data cache read misses
# D1mw = L1 data cache write misses
# DLmr = Last-level cache read misses (these go to DRAM)
# DLmw = Last-level cache write misses

# Example healthy output for matrix multiply:
# Dr: 1,024,000,000
# D1mr: 2,048,000 (0.2% miss rate - good)
# DLmr: 32,000 (0.003% DRAM miss rate - excellent)

# Example bad output for random embedding lookup:
# Dr: 100,000,000
# D1mr: 95,000,000 (95% miss rate - terrible)
# DLmr: 92,000,000 (92% goes to DRAM - disaster)

NUMA Topology and Memory Domains

On multi-socket servers, memory is not equally accessible to all cores. This is NUMA (Non-Uniform Memory Access). Each CPU socket has local DRAM that it can access quickly (NUMA local) and remote DRAM on other sockets that requires traversing an interconnect (NUMA remote). The latency ratio is typically 1.5x to 2.5x for remote vs local access.

# View NUMA topology
numactl --hardware

# Output example (2-socket server):
# available: 2 nodes (0-1)
# node 0 cpus: 0-23 48-71
# node 0 size: 128000 MB
# node 0 free: 95000 MB
# node 1 cpus: 24-47 72-95
# node 1 size: 128000 MB
# node 1 free: 96000 MB
# node distances:
# node 0 1
# 0: 10 21 <- local=10, remote=21 (2.1x higher latency)
# 1: 21 10

Production Engineering Notes

Measuring Your Cache Line Size

# Linux
getconf LEVEL1_DCACHE_LINESIZE # L1 cache line size (usually 64)
getconf LEVEL1_DCACHE_SIZE # L1 total size
getconf LEVEL2_CACHE_SIZE # L2 total size
getconf LEVEL3_CACHE_SIZE # L3 total size

# Python
import os
print(os.sysconf('SC_LEVEL1_DCACHE_LINESIZE'))

Checking perf Cache Counters

# Cache miss profile for a running process
perf stat -e \
L1-dcache-loads,L1-dcache-load-misses,\
L2-loads,L2-load-misses,\
LLC-loads,LLC-load-misses \
-p PID

# For a training script
perf stat -e \
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses \
python train.py

# Interpret results:
# L1 miss rate < 5% : good cache behavior
# L2 miss rate < 10% : acceptable
# LLC miss rate > 5% : data does not fit in L3 - consider algorithm changes
# LLC miss rate > 20% : severe memory bottleneck - algorithmic change required

NumPy Array Layout and Cache Efficiency

import numpy as np

# NumPy default: C-contiguous (row-major)
# Row access is sequential (cache-friendly)
# Column access is strided (cache-unfriendly)

arr = np.random.rand(1000, 1000)
print(arr.flags) # C_CONTIGUOUS: True

# Row access - good
row = arr[0, :] # sequential 8000 bytes, ~2 cache misses
# Column access - bad
col = arr[:, 0] # stride of 8000 bytes, 1000 separate cache lines

# For column-heavy operations, use Fortran-order arrays
arr_f = np.asfortranarray(arr)
print(arr_f.flags) # F_CONTIGUOUS: True
col_f = arr_f[:, 0] # now sequential - cache-friendly

# Or transpose: creates a view with transposed strides (no copy)
arr_t = arr.T
# arr_t[:, 0] accesses what was arr[0, :] - cache-friendly
warning

NumPy's np.transpose() and .T do not copy data. They create a view with different strides. This means column access on a transposed array is actually row access on the original, which is cache-friendly. However, passing a transposed (non-contiguous) array to many NumPy functions forces an implicit copy. Use np.ascontiguousarray() when passing transposed arrays to functions that require contiguous input.

danger

The most common cause of "my model is slow in production but fast in dev" is data layout and cache behavior. Development typically runs with small batches that fit in L3 cache. Production runs with high concurrency, large embedding tables, and diverse inputs that thrash the cache. Always profile with production-like data sizes and concurrency levels. Never trust development benchmarks for production capacity planning.


Common Mistakes

danger

Accessing large arrays in random order

A 1 GB embedding table accessed randomly (as in recommendation model inference) generates nearly 100% L3 cache misses. Every embedding lookup goes to DRAM at 100ns latency. With 200 concurrent workers each doing 100 lookups per request, that is 20,000 DRAM accesses per millisecond - far more than a single server can handle. The fix is either embedding compression (reduce table size to fit in L3), caching hot embeddings, or hardware solutions (HBM for embedding tables).

warning

Assuming multiprocessing parallelism scales linearly

When multiple processes share a numpy array via multiprocessing.shared_memory or multiprocessing.Array and each process writes to adjacent elements, false sharing can cause 3x to 10x slowdown that looks like a parallel efficiency problem but is actually a cache coherence problem. Always pad shared data structures to 64-byte boundaries when different processes or threads write to different elements.

danger

Allocating Python objects in tight loops

Python float and int objects are heap-allocated. In a tight computation loop, each Python-level numeric result creates a new heap-allocated object (typically 24-28 bytes). This allocator pressure thrashes the allocator's free lists, fragments the heap, and generates unpredictable access patterns that defeat spatial locality. Keep all hot-path computation in NumPy or pre-allocated C arrays. Never iterate over Python objects in performance-critical code.

warning

Forgetting that L3 is shared

On a 96-core server, all 96 cores share the L3 cache. Running many ML inference workers simultaneously means they compete for L3 space. A model with a 100 MB embedding table running on 48 workers effectively gets only 1 MB of L3 per worker. Profile cache behavior at production concurrency levels, not at single-worker levels.


Interview Questions and Answers

Q1: What is a cache line and why is it 64 bytes?

A cache line is the minimum unit of data transfer between cache levels and between DRAM and cache. When any byte in a 64-byte aligned region is accessed, the entire 64-byte block is loaded into cache. The 64-byte size is a hardware engineering tradeoff: larger lines improve spatial locality exploitation (each miss brings more potentially-useful data) but waste cache capacity if the extra data is never used. 64 bytes was standardized across x86 and ARM processors because it matches the natural burst transfer size of DRAM controllers and provides good spatial locality for typical access patterns in compiled code.


Q2: Explain the MESI protocol. What is an invalidation and when does it happen?

MESI (Modified, Exclusive, Shared, Invalid) is a cache coherence protocol that ensures all CPU cores see a consistent view of memory. Each cache line in each core's L1/L2 cache has one of these four states. An invalidation occurs when one core writes to a cache line that other cores have in Shared or Exclusive state - the hardware broadcasts an invalidation message on the interconnect, and all other cores' copies transition to Invalid state. On their next access to that address, they must re-fetch the data either from the modified core (which must write it back first) or from memory. Invalidations are expensive because they cause coherence misses and generate interconnect traffic.


Q3: What is false sharing and how do you fix it?

False sharing occurs when two threads write to different variables that reside on the same 64-byte cache line. Despite no logical data sharing, the hardware coherence protocol treats the cache line as shared. Each write from one core invalidates the line in all other cores' caches. The other cores must re-fetch the line on their next access. This generates a flood of coherence traffic and can slow parallel code by 3x to 10x. The fix is padding: ensure independently-modified variables are placed on separate cache lines by aligning them to 64-byte boundaries and padding each to fill its cache line. In C: alignas(64). In Python/NumPy: space shared-array elements 16 int32 apart (= 64 bytes).


Q4: Why is blocked (tiled) matrix multiplication faster than naive matrix multiplication for large matrices?

In naive matrix multiplication of N×NN \times N matrices, computing C[i][j]C[i][j] requires reading an entire column of BB. Since BB is stored row-major, column access has stride NN, so each access is to a different cache line. For a large NN (say, 2048), the working set for one row of the computation is 2048×8=162048 \times 8 = 16 KB for BB's column, but with stride N×8=16N \times 8 = 16 KB between elements. This thrashes the cache. Blocked multiplication divides the matrices into tiles small enough to fit in L2 cache. Once loaded, each tile is reused multiple times without fetching from memory again, dramatically reducing cache misses. BLAS implementations (OpenBLAS, MKL) use multi-level blocking: inner blocks fit in L1 registers, outer blocks fit in L2, and the outermost loop fits in L3.


Q5: What is NUMA and why does it matter for ML training?

NUMA (Non-Uniform Memory Access) describes multi-socket servers where each CPU socket has local DRAM. Access to local DRAM is 10-20 ns; access to remote DRAM on another socket traverses an interconnect (QPI or Infinity Fabric) at 20-40 ns - roughly 2x slower. For ML training, NUMA matters in several ways. First, PyTorch's DataLoader workers should be pinned to the same NUMA node as the GPU they are feeding, to avoid all data transfers going through slow cross-socket memory. Second, model parameters loaded into memory should reside on the NUMA node local to the training process. Third, the AMD EPYC architecture has multiple NUMA nodes even within a single socket (CCX complex topology), adding an additional level of NUMA within one socket.


Q6: How does NumPy's row-major layout affect ML workloads?

NumPy uses C-contiguous (row-major) storage by default: element [i][j] is at offset i * ncols + j. Accessing elements along a row (incrementing j) is sequential and cache-friendly - the prefetcher loads ahead efficiently. Accessing elements down a column (incrementing i) jumps by ncols * 8 bytes between elements, generating a cache miss per element for large matrices. In ML workloads, this matters for: (1) feature matrix access patterns (ensure your hot access direction is row-wise), (2) gradient accumulation (ensure the accumulation axis is contiguous), (3) attention score matrices (the pattern of attention head access determines cache behavior). PyTorch's tensors have the same row-major default and the same cache implications.


Practical: Profiling and Measuring Cache Performance

Building a Memory Access Pattern Benchmark

Understanding cache behavior requires direct measurement. The following benchmark systematically varies array stride and size to map the cache hierarchy of your specific machine:

import numpy as np
import time
import math


def cache_hierarchy_probe(max_size_mb: float = 256.0) -> dict:
"""
Probe the cache hierarchy by measuring access latency as a function
of working set size. When working set exceeds a cache level, latency
jumps sharply.

This implements a pointer-chasing benchmark: random linked list
traversal that defeats hardware prefetchers and forces cache misses.
"""
results = {}

# Sizes to test: from 4 KB (fits in L1) to max_size_mb
sizes_bytes = []
kb = 4
while kb * 1024 <= max_size_mb * 1024 * 1024:
sizes_bytes.append(kb * 1024)
kb = int(kb * 1.5) + 1

for size_bytes in sizes_bytes:
n_elements = size_bytes // 8 # int64 elements

# Build a random permutation linked list
# Each element stores the index of the next element to access
# This creates an unpredictable access pattern that defeats prefetching
chain = np.arange(n_elements, dtype=np.int64)
np.random.shuffle(chain[1:]) # randomize all but first

# Traverse the chain
n_accesses = min(1_000_000, n_elements * 10)
idx = 0
start = time.perf_counter_ns()
for _ in range(n_accesses):
idx = chain[idx % n_elements]
elapsed_ns = time.perf_counter_ns() - start

latency_ns = elapsed_ns / n_accesses
results[size_bytes] = latency_ns

size_str = (f"{size_bytes // 1024} KB" if size_bytes < 1024*1024
else f"{size_bytes // (1024*1024)} MB")
print(f" {size_str:>8}: {latency_ns:.1f} ns/access")

return results


def interpret_cache_probe(results: dict) -> None:
"""
Interpret cache probe results to identify cache level boundaries.
"""
sizes = sorted(results.keys())
latencies = [results[s] for s in sizes]

# Find transitions where latency increases significantly (>1.5x)
print("\nCache level boundaries (estimated):")
for i in range(1, len(sizes)):
ratio = latencies[i] / latencies[i-1]
if ratio > 1.4:
prev_size = sizes[i-1]
curr_size = sizes[i]
prev_str = (f"{prev_size // 1024} KB" if prev_size < 1024*1024
else f"{prev_size // (1024*1024)} MB")
print(f" Transition between {prev_str} and {curr_size // (1024*1024)} MB "
f"(latency jumped {ratio:.1f}x)")


print("Probing cache hierarchy (this takes ~30 seconds)...")
results = cache_hierarchy_probe(max_size_mb=64.0)
interpret_cache_probe(results)

Cache Line Size Detection

import numpy as np
import time


def detect_cache_line_size() -> int:
"""
Detect the cache line size empirically by measuring access time
as stride increases past the cache line boundary.

Method: Access elements at increasing strides. When stride exceeds
the cache line size, each access misses a new cache line (full miss cost).
Below the cache line size, multiple accesses share one miss.
"""
N = 10_000_000
arr = np.zeros(N, dtype=np.int32) # 4 bytes per element

strides_bytes = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
times = {}

for stride_bytes in strides_bytes:
stride_elements = max(1, stride_bytes // 4) # stride in int32 elements
n_accesses = N // stride_elements

start = time.perf_counter_ns()
total = 0
for i in range(0, N, stride_elements):
total += arr[i]
elapsed = time.perf_counter_ns() - start

ns_per_access = elapsed / n_accesses
times[stride_bytes] = ns_per_access
print(f" Stride {stride_bytes:4d} bytes: {ns_per_access:.2f} ns/access")

# Cache line size is where ns_per_access stops increasing
# (all accesses miss one cache line regardless of stride once stride > line size)
print("\nCache line size is likely 64 bytes (standard on x86 and ARM)")
return 64


detect_cache_line_size()

Measuring Write Policy Effects

import numpy as np
import time


def write_vs_read_bandwidth():
"""
Compare read vs write bandwidth for large arrays.
Write bandwidth is typically lower than read bandwidth on modern systems
because writes must allocate cache lines before writing (write-allocate policy).
"""
# Large enough to exceed L3 cache
N = 50_000_000 # 400 MB for float64

arr_read = np.random.rand(N).astype(np.float64)
arr_write = np.empty(N, dtype=np.float64)

runs = 3

# Read bandwidth: sequential sum
start = time.perf_counter()
for _ in range(runs):
total = np.sum(arr_read)
read_time = (time.perf_counter() - start) / runs
read_bw = (N * 8) / read_time / 1e9

# Write bandwidth: sequential fill
start = time.perf_counter()
for _ in range(runs):
arr_write[:] = 1.234
write_time = (time.perf_counter() - start) / runs
write_bw = (N * 8) / write_time / 1e9

# Copy bandwidth: read + write
arr_dst = np.empty(N, dtype=np.float64)
start = time.perf_counter()
for _ in range(runs):
np.copyto(arr_dst, arr_read)
copy_time = (time.perf_counter() - start) / runs
copy_bw = (N * 8 * 2) / copy_time / 1e9 # read + write

print(f"Read bandwidth: {read_bw:.1f} GB/s")
print(f"Write bandwidth: {write_bw:.1f} GB/s")
print(f"Copy bandwidth: {copy_bw:.1f} GB/s (effective - reads + writes)")
print("\nNote: read > write > copy (effective) is typical")
print("Write-allocate policy: writes fetch cache line before writing")


write_vs_read_bandwidth()

Summary

The memory hierarchy is the most important architectural concept for understanding ML workload performance. The fundamental tradeoff - small and fast vs large and slow - plays out across L1 (1 ns, 32 KB), L2 (4 ns, 512 KB), L3 (10 ns, 32 MB), and DRAM (100 ns, 256+ GB).

Cache lines of 64 bytes define the granularity of all data movement. Spatial locality - accessing data near recently-accessed data - is the mechanism that makes sequential array traversal fast and random access slow. Temporal locality - reusing data before it is evicted - is why matrix blocking dramatically accelerates matrix multiplication.

For ML engineers, the directly actionable takeaways are:

  • Profile with cache miss counters first: perf stat -e LLC-loads,LLC-load-misses tells you immediately whether your bottleneck is compute or memory. An LLC miss rate above 5% is a memory access pattern problem.
  • Use SoA layout for batch operations: Struct of Arrays (separate contiguous arrays per field) is almost always faster than Array of Structs for vectorized ML operations.
  • Block your matrix operations: NumPy's @ operator (BLAS dgemm) does this automatically. Avoid implementing matrix multiply in Python.
  • Avoid false sharing in parallel code: Pad shared counters/accumulators to 64-byte cache line boundaries. This is the most common and most impactful parallelism bug.
  • Embedding tables need special treatment: Large lookup tables exceed L3 cache and generate DRAM-bound access patterns. Consider embedding compression, locality-aware sharding, or hardware with larger cache (AMD V-Cache chips, or GPU HBM).
© 2026 EngineersOfAI. All rights reserved.