GPU Memory Hierarchy Deep Dive

Reading time: ~45 min · Interview relevance: Very High · Target roles: CUDA Developer, Performance Engineer, ML Systems Engineer

Registers: ~1 cycle. L1/shared memory: ~30 cycles. L2 cache: ~200 cycles. HBM: ~400-600 cycles. Getting data into registers and keeping it there is 400-600x faster than fetching from HBM. Every single optimization technique in GPU programming reduces to one thing: managing this hierarchy.

The Transformer That Should Have Been Fast

The model was GPT-scale - 70 billion parameters, 80 layers, attention heads that would fit comfortably on four A100s in tensor-parallel mode. The team had done everything right. Mixed precision. Pipelined data loading. Custom attention kernel using Flash Attention. Batch size tuned to maximize SM occupancy. On paper, they were running at 312 TFLOPS peak compute on each GPU.

Achieved TFLOPS in Nsight: 47. Fifteen percent of theoretical peak.

The senior ML infrastructure engineer spent the first two hours suspecting the attention kernel. Custom kernels always contain bugs when performance is unexpectedly low. She profiled, checked register usage, checked shared memory - everything looked plausible. Then she switched Nsight to memory view, and the number appeared immediately: memory bandwidth utilization was at 94%.

The GPU was not compute-bound. It was memory-bound, and nearly saturated. Every clock cycle, the HBM was delivering data as fast as it physically could, and the SMs were still starving. The problem was not the attention kernel. The problem was the feed-forward layers running immediately after attention in each transformer block. Each feed-forward layer was reading the full weight matrix from HBM on every forward pass with no reuse, because the sequence-length-to-model-dimension ratio meant there was no opportunity to tile and keep weights in L2.

The fix required rethinking the entire execution order. They fused the feed-forward layers across adjacent transformer blocks, a technique that allowed weight tiles to stay in L2 cache across the two consecutive matrix multiplications. The change was forty lines of code and two days of validation. Achieved TFLOPS jumped to 189 - four times better, without touching a single weight or changing a single hyperparameter.

This is what memory hierarchy knowledge buys you. Not marginal gains. Not 5% here and there. The difference between a model that runs at 15% efficiency and one that runs at 60% is almost entirely how well the kernel author understands where data lives and how to keep it close to compute.

This lesson teaches that hierarchy from the hardware up. By the end you will know the latency number at every level, the capacity at every level, the bandwidth at every level, and most importantly - the specific programming patterns that exploit each level.

Why This Exists

Early graphics chips had no memory hierarchy at all. Texture units fetched directly from DRAM on every access. This was acceptable when the texture operations were simple and the DRAM latency was low relative to compute time. As shaders became more complex and arithmetic throughput grew faster than DRAM bandwidth, raw DRAM access became the bottleneck.

CPUs had already solved this problem with cache hierarchies - small, fast SRAM close to the processor that held recently used data. But GPU memory hierarchies cannot simply copy the CPU approach, because the memory access patterns are fundamentally different.

A CPU cache is designed around the principle of temporal locality: if you access address X now, you will likely access X again soon. Caches exploit this by keeping recent accesses in fast storage. GPU workloads often violate this assumption. A matrix multiplication kernel accesses each element of the input matrices exactly once and never touches them again. Standard LRU caches provide near-zero benefit for this access pattern.

GPU memory systems are designed around spatial locality and explicit programmer control instead. The L1/shared memory split is the key innovation: half of the fast on-chip memory is a hardware-managed cache (L1), and the other half is explicitly programmer-controlled scratchpad memory (shared memory). The programmer can load exactly the data that will be reused, keep it in shared memory for the duration of the computation, and avoid all cache eviction overhead.

This design choice - explicit scratchpad rather than pure hardware cache - is what makes GPU memory systems fundamentally different from CPU memory systems, and it is why understanding the hierarchy requires understanding not just the latency numbers but the programming model that maps to each level.

Historical Context

The modern GPU memory hierarchy was established with NVIDIA's Fermi architecture in 2010. Before Fermi, GPUs had texture caches and limited constant caches, but no general-purpose L1/L2 hierarchy accessible from compute kernels. The G80 (2006) and GT200 (2008) architectures had shared memory but no L1 cache for global memory accesses - every access to global memory went straight to GDDR DRAM.

Fermi introduced a configurable L1/shared memory split: 16KB L1 + 48KB shared, or 48KB L1 + 16KB shared. This was a significant change because it meant accesses to global memory could now be cached, making some previously inefficient memory access patterns suddenly viable.

The Kepler architecture (2012) pulled back on L1 - reducing it to read-only path only for most global accesses. This was a deliberate choice to route more area toward register files and shared memory, which NVIDIA's workload analysis showed were more valuable for the GEMM-dominated workloads HPC users cared about.

Maxwell (2014) and Pascal (2016) continued refining the balance. Volta (2017) introduced the modern layout that persists through Ampere and Hopper: independent L1 per SM with configurable shared memory allocation, large unified L2, and HBM2 off-chip memory connected via a wide memory subsystem.

The Hopper H100 (2022) represents the current peak: 256KB L1/shared per SM (configurable), 50MB L2, 3.35 TB/s HBM3 bandwidth. These numbers will appear throughout this lesson as the reference hardware.

The Memory Hierarchy - Level by Level

Level 0: Registers

Registers are the fastest memory in the system. When a CUDA thread performs an arithmetic operation, its operands must be in registers. There is no other path to the ALU.

Capacity: Each SM on the H100 has 65,536 32-bit (4-byte) registers. With 2,048 threads maximum per SM, this averages to 32 registers per thread - but the actual allocation depends on your kernel's compiled register usage. If your kernel uses 64 registers per thread, you can only fit 1,024 threads per SM (occupancy limited by registers).

Latency: Effectively zero. Accessing a register takes 1 clock cycle - it is simply a read from a flip-flop inside the arithmetic unit. There is no pipeline stage between the register file and the ALU.

Bandwidth: Essentially unconstrained within a thread. Each thread can read and write multiple registers per clock cycle as needed by the instruction pipeline.

The key constraint: Registers are private to each thread. Thread 0 cannot read thread 1's registers. This matters because data that needs to be shared between threads must pass through shared memory, not registers. Reduction operations, prefix scans, and transpose operations all require going through shared memory as the communication mechanism.

Register spilling: When a kernel uses more registers than are available per thread at the desired occupancy, the compiler spills registers to local memory - a region in L2/HBM that is logically local to the thread but physically off-chip. Register spills destroy performance because what was a 1-cycle register access becomes a 200-600 cycle memory access. The CUDA compiler flag --maxrregcount=N limits register usage per thread, forcing spilling if necessary - sometimes the occupancy gain from fitting more warps outweighs the spill cost, but this must be measured.

// Check register usage of a compiled kernel
// nvcc --ptxas-options=-v mykernel.cu
// Output: ptxas info: Used N registers, X bytes smem, Y bytes cmem[0]

// Force register limit (may cause spilling)
__global__ __launch_bounds__(256, 2) void myKernel(...) {
    // __launch_bounds__(maxThreadsPerBlock, minBlocksPerSM)
    // This tells compiler: 256 threads/block, want at least 2 blocks per SM
    // Compiler constrains register usage to satisfy this
}

Level 1: L1 Cache and Shared Memory

L1 and shared memory occupy the same physical SRAM on-chip. The partition between them is configurable per kernel.

Capacity on H100: 256KB total per SM, configurable as:

0KB shared + 256KB L1 (cache-heavy workloads)
100KB shared + 156KB L1 (balanced)
164KB shared + 92KB L1 (compute-heavy, lots of explicit tiling)
228KB shared + 28KB L1 (maximum shared memory for tightly tiled kernels)

In practice most production kernels use between 48KB and 128KB of shared memory per SM, leaving the remainder as L1.

Latency: ~20-32 cycles for shared memory. ~28-32 cycles for L1 cache hits. These numbers are nearly identical because they share the same physical SRAM - the difference is only in address translation overhead.

Bandwidth: ~19 TB/s aggregate across all SMs on H100 (128 SMs x ~150 GB/s per SM). This is roughly 5x the off-chip HBM bandwidth. Keeping computation on-chip leverages this.

L1 cache behavior: L1 caches GPU global memory accesses. Cache lines are 128 bytes. When a warp (32 threads) accesses 32 consecutive floats (128 bytes), it generates one 128-byte cache line load - perfectly coalesced. When a warp accesses 32 scattered addresses, it generates up to 32 separate cache line loads - worst-case uncoalesced access.

Shared memory banks: Shared memory is organized into 32 banks, each 4 bytes wide. Threads in a warp can access different banks simultaneously with no conflict. If two threads access the same bank (but different addresses), a bank conflict occurs and the accesses serialize. Understanding bank layout is critical for matrix transpose and other patterns.

// Shared memory declaration
__global__ void tiledMatMul(const float* A, const float* B, float* C,
                             int N, int TILE_SIZE) {
    // Declare shared memory tile
    extern __shared__ float tile[];
    float* tileA = tile;
    float* tileB = tile + TILE_SIZE * TILE_SIZE;

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Load tile into shared memory (coalesced access pattern)
        tileA[threadIdx.y * TILE_SIZE + threadIdx.x] =
            A[row * N + (t * TILE_SIZE + threadIdx.x)];
        tileB[threadIdx.y * TILE_SIZE + threadIdx.x] =
            B[(t * TILE_SIZE + threadIdx.y) * N + col];

        // Synchronize before using shared memory
        __syncthreads();

        // Compute using shared memory (fast, ~30 cycles vs 400-600 cycles)
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y * TILE_SIZE + k] *
                   tileB[k * TILE_SIZE + threadIdx.x];
        }

        // Synchronize before next tile load
        __syncthreads();
    }

    if (row < N && col < N) C[row * N + col] = sum;
}

# Set shared memory configuration from Python (PyCUDA/CuPy)
import pycuda.driver as cuda

# Configure 164KB shared memory for a kernel
func = module.get_function("tiledMatMul")
func.set_attribute(cuda.function_attribute.MAX_DYNAMIC_SHARED_SIZE_BYTES, 164 * 1024)

Level 2: L2 Cache

L2 is a unified cache shared across all SMs on the chip. Unlike L1 which is per-SM, L2 is a central resource. On H100, L2 is 50MB - large enough to hold many working sets.

Capacity: 50MB on H100. 40MB on A100. 6MB on V100 (major increase between V100 and A100 was one of the key improvements for transformer workloads).

Latency: ~150-250 cycles. Approximately 5-8x slower than L1, but still 2-3x faster than HBM.

Bandwidth: ~12 TB/s on H100. Lower than L1 aggregate bandwidth but still substantial.

What fits in L2: A 50MB L2 can hold:

A 3,500 x 3,500 matrix of float32 values (49MB)
Weight matrices for small to medium transformer layers
KV-cache for short sequences in LLM inference

L2 residency hint: CUDA provides a mechanism to pin data in L2 for repeated access patterns - useful for weight matrices during inference when the same weights are used across many token generations.

// Pin data in L2 cache (Ampere/Hopper - sm_80+)
// Useful for LLM inference where weight matrices are reused across steps
cudaStreamAttrValue stream_attribute;
stream_attribute.accessPolicyWindow.base_ptr = weight_matrix_ptr;
stream_attribute.accessPolicyWindow.num_bytes = weight_matrix_bytes;
stream_attribute.accessPolicyWindow.hitRatio = 1.0;  // Try to keep 100% in L2
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;

cudaStreamSetAttribute(stream,
                       cudaStreamAttributeAccessPolicyWindow,
                       &stream_attribute);

# Python equivalent using PyTorch CUDA streams
import torch
import ctypes

# Get raw CUDA stream handle
stream = torch.cuda.current_stream()
stream_handle = stream.cuda_stream

# Access policy window - keep weight matrix in L2 during inference loop
# This is done via ctypes calling into libcuda directly, or via pycuda
# In practice, use torch.cuda.set_stream() + manual management

Level 3: HBM (High Bandwidth Memory)

HBM is the main GPU memory - the "GPU RAM" that shows up as "80GB" or "40GB" in hardware specs. It is physically separate from the GPU die, connected via a very wide (1024-bit) bus through a silicon interposer (more detail on the physical construction in the next lesson).

Capacity: 80GB on H100 SXM5. 40GB on A100 40GB variant. 24GB on RTX 4090 (GDDR6X, not HBM).

Latency: ~400-600 cycles, depending on memory controller queue depth and access pattern.

Bandwidth: 3.35 TB/s on H100 SXM5. 2.0 TB/s on A100. 1.0 TB/s on RTX 4090 (GDDR6X).

What lives in HBM:

All tensors allocated with torch.zeros(...).cuda() or cudaMalloc()
Model weights during inference and training
Activations for the full forward pass (unless using activation checkpointing)
Optimizer states (Adam: 2x the model size in fp32 states)
Gradient buffers during backward pass

Coalesced vs uncoalesced access: When 32 threads in a warp access consecutive addresses, the memory controller services the request in one 128-byte transaction. When they access scattered addresses, up to 32 separate transactions are needed - 32x the bandwidth consumption for the same amount of useful data. Coalesced access is the single most important memory optimization for HBM.

// GOOD: Coalesced access - threads access consecutive addresses
// Thread 0 accesses A[0], Thread 1 accesses A[1], ..., Thread 31 accesses A[31]
__global__ void coalescedKernel(float* A, float* B, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) B[idx] = A[idx] * 2.0f;
}

// BAD: Strided access - threads access every 32nd element
// Thread 0 accesses A[0], Thread 1 accesses A[32], ...
// Each access is a separate cache line - 32x bandwidth overhead
__global__ void stridedKernel(float* A, float* B, int N) {
    int idx = threadIdx.x + blockIdx.x;  // BUG: stride = gridDim
    if (idx * 32 < N) B[idx * 32] = A[idx * 32] * 2.0f;
}

Level 4: PCIe / Host DRAM

Technically not part of the GPU memory hierarchy, but the PCIe bus to host DRAM is the final level engineers must reason about. Model loading, data pipeline feeding, and gradient communication all touch this level.

PCIe 4.0 x16: ~32 GB/s bidirectional. This is 100x slower than HBM bandwidth. PCIe 5.0 x16: ~64 GB/s bidirectional. NVLink (GPU-to-GPU): 900 GB/s on H100 - 28x PCIe 4.0 bandwidth.

The PCIe bottleneck is why data pipeline optimization matters enormously. A model that takes 50ms to run but requires 500ms to load inputs is IO-bound regardless of how fast the GPU is. Overlapping data transfer with compute via CUDA streams is the standard solution.

The Full Hierarchy in Numbers

Level	Capacity (H100)	Latency	Bandwidth	Scope
Registers	256KB/SM (65536 regs)	~1 cycle	Unlimited (per-thread)	Per-thread
L1/Shared Memory	256KB/SM	~20-32 cycles	~19 TB/s aggregate	Per-SM
L2 Cache	50MB	~150-250 cycles	~12 TB/s	Whole GPU
HBM3	80GB	~400-600 cycles	3.35 TB/s	Whole GPU
PCIe 4.0	Host DRAM (TB)	~10,000+ cycles	32 GB/s	Host-GPU

How the Hierarchy Shapes Kernel Design

The Arithmetic Intensity Lens

Every kernel can be characterized by its arithmetic intensity: the ratio of floating-point operations to bytes transferred from HBM.

$\text{arithmetic intensity} = \frac{\text{FLOP}}{\text{bytes from HBM}}$

A kernel with arithmetic intensity above the roofline threshold (discussed in the next module) is compute-bound - adding more compute does not help, but reducing memory access might. A kernel below the threshold is memory-bound - adding more compute does not help, but adding bandwidth does.

Matrix multiplication (GEMM): For an M x N x K multiplication, arithmetic is $2MNK$ FLOP and memory transfer is $(MK + KN + MN) \times 4$ bytes. For large square matrices with M=N=K=1024:

FLOP: $2 \times 1024^3 \approx 2.1 \times 10^9$
Bytes: $(3 \times 1024^2) \times 4 \approx 12.6 \times 10^6$
Arithmetic intensity: $\approx 167$ FLOP/byte

This is why GEMM is compute-bound on modern GPUs - you do many operations per byte loaded, so compute is the bottleneck, not bandwidth.

Element-wise operations: For a ReLU applied to a 1GB tensor:

FLOP: ~1 billion (one max operation per float)
Bytes: 8GB (4GB read + 4GB write)
Arithmetic intensity: ~0.125 FLOP/byte

This is heavily memory-bound. The GPU ALUs are mostly idle, waiting for data to arrive from HBM. This is why standalone ReLU and normalization kernels are poor uses of GPU time, and why kernel fusion (combining multiple elementwise ops into one kernel) is so valuable.

Memory Access Pattern Archetypes

Understanding how each layer type in a neural network maps to HBM access patterns lets you predict before profiling whether you have a memory problem.

Convolution: Exhibit high reuse of filters. Filter weights are small and hot in L2. Activations stream through. Compute-bound for large feature maps.

Attention (naive): Quadratic memory access with sequence length. For sequence length S, the attention matrix is $S \times S$ and must be read/written from HBM - $O(S^2)$ memory access. Flash Attention eliminates this by tiling and keeping attention computation in shared memory.

Feed-forward (FFN) layers: Linear projections (GEMM) followed by activation. Compute-bound for large batch sizes. Memory-bound for small batch inference (batch=1 or 2) because weight matrices must be loaded from HBM but are used for only a few multiply-accumulate operations per weight element.

Layer normalization: Requires computing mean and variance across the hidden dimension - a reduction. Heavily memory-bound. Fusing with the preceding or following linear layer significantly improves efficiency.

Memory Hierarchy Visualization

Practical: Profiling the Memory Hierarchy

Using cuda.memGetInfo()

The simplest memory query - how much HBM is free right now.

import torch
import pycuda.driver as cuda

cuda.init()
device = cuda.Device(0)
ctx = device.make_context()

# Query free and total HBM memory
free_mem, total_mem = cuda.mem_get_info()
print(f"Free HBM: {free_mem / 1e9:.2f} GB")
print(f"Total HBM: {total_mem / 1e9:.2f} GB")
print(f"Used HBM: {(total_mem - free_mem) / 1e9:.2f} GB")

ctx.pop()

# PyTorch equivalent
free, total = torch.cuda.mem_get_info()
print(f"Free: {free / 1e9:.2f} GB, Total: {total / 1e9:.2f} GB")

# Detailed PyTorch memory stats
stats = torch.cuda.memory_stats()
print(f"Active memory: {stats['active_bytes.all.current'] / 1e9:.2f} GB")
print(f"Reserved memory: {stats['reserved_bytes.all.current'] / 1e9:.2f} GB")
print(f"Peak active: {stats['active_bytes.all.peak'] / 1e9:.2f} GB")

Writing a Simple Bandwidth Benchmark

Before profiling your model, establish what peak bandwidth your GPU actually delivers. Theoretical specs are rarely achieved.

import torch
import time

def measure_memory_bandwidth(size_gb=1.0, n_trials=20):
    """
    Measure achieved HBM bandwidth using a simple copy kernel.
    Peak theoretical H100: 3.35 TB/s
    Typically achieve 80-90% = 2.68-3.0 TB/s
    """
    n_floats = int(size_gb * 1e9 / 4)  # float32 = 4 bytes
    src = torch.ones(n_floats, dtype=torch.float32, device='cuda')
    dst = torch.empty_like(src)

    # Warmup
    for _ in range(5):
        dst.copy_(src)
    torch.cuda.synchronize()

    # Measure
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(n_trials):
        dst.copy_(src)
    end.record()
    torch.cuda.synchronize()

    elapsed_ms = start.elapsed_time(end)
    elapsed_s = elapsed_ms / 1000.0

    # Each copy reads src (size_gb GB) and writes dst (size_gb GB)
    bytes_transferred = 2 * size_gb * 1e9 * n_trials
    bandwidth_tbs = bytes_transferred / elapsed_s / 1e12

    print(f"Buffer size: {size_gb} GB")
    print(f"Trials: {n_trials}")
    print(f"Elapsed: {elapsed_ms:.1f} ms total")
    print(f"Achieved bandwidth: {bandwidth_tbs:.3f} TB/s")
    return bandwidth_tbs

bw = measure_memory_bandwidth(size_gb=2.0)

Nsight Compute - Key Metrics to Watch

When profiling with Nsight Compute (ncu --metrics ...), these are the memory hierarchy metrics that matter most:

# Profile a kernel with full memory metrics
ncu --metrics \
  l1tex__t_bytes.sum.per_second,\
  lts__t_bytes.sum.per_second,\
  dram__bytes.sum.per_second,\
  sm__warps_active.avg.pct_of_peak_sustained_active,\
  l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum \
  python train.py

# Simpler: use the "memory workload analysis" preset
ncu --set memory python train.py

Key metrics explained:

l1tex__t_bytes.sum.per_second: L1 texture/cache throughput. Compare to 19 TB/s max.
lts__t_bytes.sum.per_second: L2 throughput. Compare to 12 TB/s max.
dram__bytes.sum.per_second: HBM throughput. Compare to 3.35 TB/s max.
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum: Shared memory bank conflicts. Should be 0 for well-written kernels.

# Detect memory bottlenecks programmatically using PyTorch profiler
import torch
from torch.profiler import profile, record_function, ProfilerActivity

model = MyTransformerModel().cuda()
inputs = torch.randn(8, 512, 1024, device='cuda')

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True,
    profile_memory=True
) as prof:
    with record_function("forward"):
        out = model(inputs)

# Print memory summary
print(prof.key_averages().table(
    sort_by="self_cuda_memory_usage",
    row_limit=20
))

# Export to Chrome trace
prof.export_chrome_trace("trace.json")

Detecting Shared Memory Bank Conflicts

Bank conflicts serialize memory accesses and can cut shared memory throughput by 2x-32x. Here is the access pattern that causes them and how to fix it.

// BAD: Column-major access causes bank conflicts in row-major tile
// When computing C = A * B with a transposed tile of B:
__global__ void badTransposedAccess(float* tile) {
    int tx = threadIdx.x;  // [0, 31]
    int ty = threadIdx.y;  // [0, 31]
    // All threads in the same column (same tx, different ty)
    // access tile[0][tx], tile[1][tx], tile[2][tx]...
    // These map to banks: tx%32, tx%32, tx%32 - SAME BANK = conflict
    float val = tile[ty * 32 + tx];  // OK for row access
    float val2 = tile[tx * 32 + ty]; // BAD: column access = bank conflict
}

// FIX: Pad the tile by 1 to shift bank alignment
__shared__ float tile[32][33];  // 33 instead of 32 - adds 1 padding column
// Now column access hits different banks:
// tile[0][ty] is in bank ty%32
// tile[1][ty] is in bank (33 + ty)%32 = (1 + ty)%32 - DIFFERENT

Data Flow Through the Hierarchy - A Worked Example

Let's trace exactly what happens to a float32 value as it participates in a tiled matrix multiplication on an H100.

The operation: C[row][col] += A[row][k] * B[k][col]

Step 1: DMA engine loads tile of A from HBM into L2
        - 128x128 floats = 64KB
        - Latency: ~400 cycles for first cache line, pipelined after
        - HBM bandwidth consumed: 64KB per tile per block

Step 2: L2 to L1 transfer on SM cache miss
        - 64KB tile fills L1 of executing SM
        - Latency: ~200 cycles (L2 hit from step 1 onward)
        - L2 bandwidth consumed: 64KB

Step 3: Kernel loads A tile from L1/global into shared memory
        __syncthreads() ensures all threads have loaded before compute
        - Shared memory now holds the 64KB tile
        - Latency: ~30 cycles per access from shared mem

Step 4: Inner loop reads A values from shared memory into registers
        sum += tileA[ty][k] * tileB[k][tx]
        - Register access: 1 cycle
        - tileA[ty][k] accessed 128 times (once per k iteration)
        - Total register ops: 2 per inner loop iteration (2 loads + 1 FMA)

Step 5: Final sum written back to C in HBM
        - 1 float per thread = 4 bytes
        - Coalesced write: 32 threads x 4 bytes = 128-byte cache line
        - Hits L2 write buffer, later flushed to HBM

This trace shows the key insight: each value in A is loaded from HBM once (step 1-2) but used 128 times in step 4. The tile size determines how much reuse we get. Larger tiles mean fewer HBM reads and more compute per byte - higher arithmetic intensity.

Memory Hierarchy in the Context of LLM Inference

LLM inference has become the defining workload for GPU memory systems. Understanding memory hierarchy behavior during inference explains why model serving is often harder than training.

The decode phase memory problem: During token generation (autoregressive decode), each step generates exactly one token. The batch dimension is effectively 1 (or small). For a single-token forward pass through a layer with hidden dimension H=4096 and FFN expansion 4x:

Weight matrix size: 4096 x 16384 x 4 bytes = 256MB
Useful compute: 2 x 4096 x 16384 = 134M FLOP
Arithmetic intensity: 134M / 256M = 0.52 FLOP/byte

This is extremely memory-bound. At H100's 3.35 TB/s, the theoretical time to load these weights is 256MB / 3.35TB/s = 76 microseconds. At 312 TFLOP/s, the compute would take 134M / 312T = 0.4 microseconds. We are 190x more bandwidth-limited than compute-limited.

This means for LLM decode: every optimization to reduce memory bandwidth directly improves throughput. This is why quantization (W4A16: reducing weights to 4-bit) is so impactful for inference - it cuts the weight loading bandwidth by 8x.

# Demonstrate the decode vs prefill performance cliff
import torch
import time

# Simulate a linear layer (like FFN in a transformer)
weight = torch.randn(4096, 16384, device='cuda', dtype=torch.float16)

def benchmark_gemm(batch_size, n_trials=100):
    x = torch.randn(batch_size, 4096, device='cuda', dtype=torch.float16)

    # Warmup
    for _ in range(10):
        _ = x @ weight.T
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    for _ in range(n_trials):
        _ = x @ weight.T
    end.record()
    torch.cuda.synchronize()

    elapsed_ms = start.elapsed_time(end) / n_trials
    flop = 2 * batch_size * 4096 * 16384
    tflops = flop / (elapsed_ms / 1000) / 1e12
    bytes_read = 4096 * 16384 * 2  # weight in fp16
    bandwidth = bytes_read / (elapsed_ms / 1000) / 1e12

    print(f"Batch={batch_size:4d}: {elapsed_ms:.3f}ms, "
          f"{tflops:.2f} TFLOP/s, "
          f"{bandwidth:.2f} TB/s effective bandwidth")

print("Batch size vs GEMM efficiency (weight: 4096x16384 fp16):")
for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
    benchmark_gemm(bs)

Data Flow Diagram for LLM Inference

Warp-Level Memory Patterns

Understanding how memory access patterns at the warp level interact with the cache hierarchy is the bridge between the hardware descriptions above and practical kernel design.

Coalescing Across the Warp

A warp is 32 threads that execute in lockstep. When all 32 threads issue a load instruction simultaneously, the memory subsystem combines them into the fewest possible cache line transactions. The rule: threads must access addresses within the same 128-byte aligned chunk for the loads to be coalesced into a single transaction.

// Pattern 1: Perfect coalescing
// Thread i accesses address base + i * 4 (float stride 1)
// All 32 threads hit the same 128-byte cache line: 1 transaction
float val = data[threadIdx.x + blockIdx.x * 32];

// Pattern 2: Stride 2 access
// Thread 0: base+0, Thread 1: base+8, Thread 2: base+16...
// Threads 0-15 hit one 128-byte cache line, threads 16-31 hit another
// 2 transactions for useful data scattered in 256 bytes: 50% efficiency
float val = data[(threadIdx.x + blockIdx.x * 32) * 2];

// Pattern 3: Struct-of-arrays vs Array-of-structs
// AOS (cache unfriendly for partial struct access):
struct Particle { float x, y, z, w; };
Particle* particles = ...;
// To read only x values: stride = 4 floats = 16 bytes
// Thread 0 reads byte 0-3, Thread 1 reads byte 16-19... 32 separate cache lines
float x = particles[threadIdx.x].x;  // BAD for streaming x values

// SOA (cache friendly):
float* xs = ...; float* ys = ...; float* zs = ...;
// Thread 0 reads xs[0], Thread 1 reads xs[1]... all in 1 cache line
float x = xs[threadIdx.x];  // GOOD

The Warp Stall Problem

When a warp issues a load from HBM, it cannot proceed until the data arrives (400-600 cycles later). During this stall, the SM switches to another ready warp. If there are no ready warps (all are stalled on memory), the SM goes idle - this shows up in Nsight as "Warp State: Stall - Long Scoreboard."

The solution is having enough warps in flight so that some are always ready while others are stalled. This is the fundamental role of occupancy in latency hiding.

A rough calculation: HBM latency is ~500 cycles. If a kernel issues one memory access every 4 cycles (memory-bound), you need 500/4 = 125 warps in flight to hide all latency. An H100 SM holds at most 64 warps. So a single SM cannot fully hide HBM latency for a simple streaming kernel - but multiple outstanding requests and the prefetch queue in the L2/HBM controller help in practice.

# Detecting warp stalls with PyTorch profiler
import torch
from torch.profiler import profile, ProfilerActivity

# Run with CUDA activity profiling
with profile(activities=[ProfilerActivity.CUDA]) as prof:
    # Your kernel here
    result = torch.matmul(a, b)

# Look for long kernel durations relative to FLOP count
print(prof.key_averages().table(sort_by="cuda_time_total"))

Memory Access Patterns for Common Neural Network Operations

The table below summarizes how different layer types access memory, their typical arithmetic intensity, and the dominant bottleneck.

Operation	Access Pattern	Arithmetic Intensity	Bottleneck
GEMM (large batch)	Coalesced reads, blocked tiles	50-200 FLOP/byte	Compute
GEMM (decode, batch=1)	Streaming weight reads	0.5-2 FLOP/byte	Bandwidth
Element-wise (ReLU, etc.)	Streaming reads + writes	0.25 FLOP/byte	Bandwidth
Layer Norm	Two-pass reduction	1-2 FLOP/byte	Bandwidth
Softmax	Row-wise max then exp	2-4 FLOP/byte	Bandwidth
Attention (naive)	Quadratic random access	1-3 FLOP/byte	Bandwidth
Flash Attention	Blocked, SRAM-resident	10-30 FLOP/byte (effective)	Compute
Convolution (large FM)	Sliding window with high reuse	20-100 FLOP/byte	Compute

This table explains why fusing element-wise operations (ReLU, layer norm, residual add) into the preceding GEMM is so valuable. Standalone, each elementwise op loads the activation tensor from HBM, applies a trivial computation, and writes it back. Fused, the activations flow from register file to the elementwise op with no HBM round-trip at all.

Kernel Fusion and Its Effect on the Memory Hierarchy

Kernel fusion is the technique of combining multiple sequential operations into a single CUDA kernel, eliminating intermediate HBM round-trips.

Example: unfused vs fused normalization

Unfused path:
  Kernel 1: GEMM - reads weights (HBM), writes activations (HBM)
  Kernel 2: Bias add - reads activations (HBM), writes activations (HBM)
  Kernel 3: Layer norm (pass 1: mean) - reads activations (HBM)
  Kernel 4: Layer norm (pass 2: normalize) - reads activations (HBM), writes (HBM)
  Kernel 5: GeLU - reads activations (HBM), writes activations (HBM)

  Total HBM reads of the activation tensor: 4x
  Total HBM writes of the activation tensor: 3x

Fused path (e.g., in FlashAttention-style kernel):
  Single kernel: GEMM + bias + layer norm + GeLU
  Activation tensor stays in registers/shared memory throughout

  Total HBM reads of the activation tensor: 0 (intermediate)
  Total HBM writes of the activation tensor: 1 (final output only)

  Bandwidth saved: 6x reads/writes of the activation tensor

For a 1B activation tensor (hidden dim 4096, sequence 512, batch 8, fp16): fusing saves 6 x 4GB = 24GB of HBM bandwidth per transformer layer. At 3.35 TB/s, that is 7.2ms saved per layer. For an 80-layer model, fusion can recover 576ms per forward pass - a substantial improvement for any latency-sensitive use case.

Production Engineering Notes

Memory Fragmentation

The CUDA memory allocator (cudaMalloc) suffers from fragmentation over time in long-running processes. A 80GB GPU might report 20GB free but be unable to allocate a 10GB contiguous block because free memory is fragmented into many small pieces.

PyTorch's caching allocator mitigates this by pooling allocations. torch.cuda.empty_cache() returns cached (but released) memory back to CUDA, allowing other processes to use it - but does not reduce fragmentation within the same process.

# Memory management for long-running inference servers
import torch
import gc

def run_inference_with_memory_management(model, batch):
    try:
        output = model(batch)
        return output
    except torch.cuda.OutOfMemoryError:
        # Clear cache and retry
        torch.cuda.empty_cache()
        gc.collect()
        try:
            output = model(batch)
            return output
        except torch.cuda.OutOfMemoryError:
            # Log and fail gracefully
            raise RuntimeError("OOM even after cache clear - need smaller batch")
    finally:
        # Aggressive cleanup between batches in low-memory situations
        torch.cuda.empty_cache()

# Monitor fragmentation
def log_memory_fragmentation():
    stats = torch.cuda.memory_stats()
    active = stats['active_bytes.all.current']
    reserved = stats['reserved_bytes.all.current']
    fragmentation = 1.0 - (active / reserved) if reserved > 0 else 0
    print(f"Active: {active/1e9:.2f}GB, Reserved: {reserved/1e9:.2f}GB, "
          f"Fragmentation: {fragmentation:.1%}")

NUMA Effects on Multi-GPU Systems

Modern servers have multiple NUMA (Non-Uniform Memory Access) nodes. PCIe switches connect GPUs to specific CPU sockets. A GPU on NUMA node 0 has faster PCIe access to NUMA node 0's DRAM. Cross-NUMA PCIe transfers can be 2x slower than same-NUMA transfers.

# Check NUMA topology
import subprocess
result = subprocess.run(['nvidia-smi', 'topo', '-m'], capture_output=True, text=True)
print(result.stdout)

# PyTorch: pin memory to specific NUMA node for faster H2D transfers
# Use numactl at the process level:
# numactl --cpunodebind=0 --membind=0 python train.py

Memory Bandwidth Saturation Detection

When your GPU is memory-bandwidth-saturated, you will see:

dram__bytes.sum.per_second near 3.35 TB/s in Nsight
SM active time high, but compute (FLOP) low
Scaling batch size does not improve throughput proportionally

The fix is usually one of:

Kernel fusion (reduce the number of HBM round-trips)
Quantization (reduce bytes per parameter)
Operator reordering (increase data reuse before eviction)
Flash Attention style algorithms (avoid materializing large intermediate tensors)

Common Mistakes

:::danger Register Spilling Ruins Performance Silently Adding more complex logic to a kernel often crosses a register threshold that forces the compiler to spill registers to local memory. You see no compile error. The kernel still produces correct results. But what was a 1-cycle register access is now a 400-cycle HBM round-trip and performance collapses.

Always check register usage with --ptxas-options=-v during compilation. If registers exceed 32-64 per thread and you need high occupancy, consider splitting the kernel or using __launch_bounds__ to constrain register usage explicitly. :::

:::danger Uncoalesced HBM Access Destroys Bandwidth A warp of 32 threads accessing 32 scattered addresses consumes up to 128x the bandwidth of coalesced access (32 separate 128-byte cache lines vs 1 cache line for the same data). This single mistake can reduce memory bandwidth utilization from 90% to under 5%.

Always visualize your access pattern: draw which address each thread accesses. If consecutive threads access consecutive addresses, you are coalesced. If threads access addresses with stride > 1 float, investigate and fix. :::

:::warning Shared Memory Bank Conflicts Are Invisible to Bandwidth Metrics Shared memory bank conflicts do not show up in HBM bandwidth metrics because they occur on-chip. A kernel with severe bank conflicts can look fine in all memory counters while running at half speed. The only way to detect them is the specific Nsight counter: l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum. Any nonzero value in a tightly optimized kernel deserves investigation. Padding tiles by 1 element (e.g., __shared__ float tile[32][33] instead of [32][32]) is the standard fix for transpose-style access patterns. :::

:::warning L2 Cache Eviction Can Surprise You During Multi-Kernel Launches When multiple kernels run concurrently on the same GPU (CUDA streams), they share the L2 cache. A kernel you expected to be L2-resident may get evicted by another concurrent kernel. If you are using L2 residency hints (cudaAccessPropertyPersisting), verify with Nsight that the persisting data is actually staying in L2 and not being evicted by competing streams. :::

:::warning torch.cuda.empty_cache() Does Not Free Memory A common misconception: torch.cuda.empty_cache() releases the PyTorch caching allocator's pooled memory back to CUDA, but does not free it to other processes unless nothing else in the Python process holds a reference. If you have any live tensor referencing a CUDA allocation, that memory is still allocated. Use del tensor followed by gc.collect() followed by torch.cuda.empty_cache() to actually free memory. :::

Interview Q&A

Q1: Walk me through the GPU memory hierarchy levels and give me a latency number for each.

The GPU memory hierarchy has four main levels. Registers are at the top - they are the flip-flop storage inside each SM, holding the working values for each thread. Access is 1 clock cycle, and each thread on H100 has access to up to 255 registers. Below that is L1 cache and shared memory, which share the same physical 256KB per SM. Access latency is 20-32 cycles. L1 is a hardware-managed cache; shared memory is explicitly controlled by the programmer. Below that is the L2 cache - 50MB unified across all 128 SMs on H100, with 150-250 cycle latency. Finally, HBM (High Bandwidth Memory) is the main GPU DRAM - 80GB on H100, 400-600 cycle latency, 3.35 TB/s bandwidth. The key design principle is that each level is roughly 5-8x slower but exponentially larger than the level above.

Q2: What is the difference between L1 cache and shared memory on a GPU? They are both on-chip SRAM - why are there two?

They occupy the same physical SRAM - on H100, it is 256KB per SM and you configure how much goes to L1 vs shared memory (anywhere from 0/256 to 228/28). The difference is control. L1 is a hardware-managed cache: the hardware decides what goes in and when it is evicted based on recent access patterns. Shared memory is a programmer-managed scratchpad: you explicitly load data into it with load instructions, use it, and it stays there until the block ends or you stop using it.

The reason both exist is that they solve different problems. L1 helps with irregular access patterns that happen to have locality - the hardware captures reuse automatically. Shared memory is better when you know exactly what will be reused and want to guarantee it stays fast - like the tile in a matrix multiplication where you load a tile and use every element K times. With L1, there is a risk the hardware evicts your tile before you finish using it. With shared memory, that cannot happen.

Q3: Explain why LLM inference (decode phase) is memory-bandwidth-bound and not compute-bound.

During decode, you generate one token at a time. The forward pass is a batch-size-1 (or small batch) matrix-vector multiplication through every layer. For a weight matrix of shape [4096, 16384], you perform 2 x 4096 x 16384 = 134 million FLOP but must load the entire matrix (4096 x 16384 x 2 bytes = 128MB in fp16) from HBM.

Arithmetic intensity = 134M FLOP / 128MB = ~1 FLOP/byte. H100's peak compute is 312 TFLOP/s and peak bandwidth is 3.35 TB/s. The compute/bandwidth ratio (the roofline balance point) is 312 / 3.35 = 93 FLOP/byte. Your workload at 1 FLOP/byte is 93x below the balance point - deeply memory-bandwidth-limited.

This is why quantization (especially weight-only quantization like W4A16) is so impactful for inference - cutting weight size from 2 bytes to 0.5 bytes (4-bit) directly reduces HBM bandwidth consumption by 4x and proportionally improves throughput.

Q4: What are shared memory bank conflicts and how do you detect and fix them?

Shared memory is physically organized into 32 banks, each 4 bytes wide. If the bank index is address / 4 % 32, then addresses that map to the same bank and are accessed by different threads in the same warp serialize - the accesses happen one after another instead of simultaneously.

The canonical example is matrix transpose: you load an N x N tile row-by-row (coalesced from HBM) into __shared__ float tile[N][N], then write it out column-by-column. The column-access pattern means all threads access different rows but the same column - same bank on every access - maximum conflict.

The fix is padding: declare __shared__ float tile[N][N+1]. Adding 1 column shifts every row's bank alignment by 1, so column accesses now hit different banks. The cost is 1 float per row of wasted shared memory - usually negligible.

Detection: ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum. Any number above zero in a tuned kernel warrants investigation.

Q5: If I have a kernel that uses 64 registers per thread and my SM has 65,536 registers, how many threads can I fit on one SM, and what is my occupancy?

Registers per thread = 64. SM has 65,536 registers. Max threads from register perspective = 65,536 / 64 = 1,024 threads per SM.

H100 maximum threads per SM = 2,048. So register usage limits us to 1,024 / 2,048 = 50% occupancy from the register perspective. But we also need to check block constraints - if we are using 256-thread blocks, we can fit 4 blocks per SM (1,024 / 256 = 4), and the SM supports up to 32 blocks, so block count is not the limiter. Finally we check shared memory - if each block uses 20KB of shared memory and we have 4 blocks, that is 80KB total, which fits in 256KB. So register file is the binding constraint and we achieve 50% theoretical occupancy.

Whether 50% is good enough depends on the kernel. For a compute-heavy kernel (matrix multiplication), 50% occupancy with enough warps in flight to hide arithmetic latency is usually fine. For a memory-heavy kernel where each warp stalls often waiting for HBM, you might want higher occupancy. Reducing the algorithm's register pressure (or accepting some spilling to buy back occupancy) is the tradeoff to evaluate.

Q6: A teammate says "just use cudaMemPrefetchAsync to get data to the GPU before you need it and latency hiding takes care of the rest." What is right and wrong about this statement?

The teammate is correct that cudaMemPrefetchAsync is a useful tool - it initiates a DMA transfer on a CUDA stream without blocking the CPU, allowing CPU work and GPU data transfer to overlap. This is the right approach for pipelining data loading with compute.

What is incomplete about the statement is the assumption that latency hiding "takes care of the rest" once data is in HBM. Getting data from host DRAM to HBM is only the first bottleneck. Once in HBM, the data still has to traverse the memory hierarchy - HBM to L2 (200-250 cycles), L2 to L1/shared (30 cycles), shared to registers (1 cycle). If the access pattern is uncoalesced, each HBM access consumes 32x the bandwidth. If shared memory bank conflicts are present, on-chip bandwidth is cut in half. Prefetching into HBM does nothing for these issues.

The complete picture: prefetching handles the CPU-GPU transfer bottleneck. Coalesced access handles HBM-to-L2 efficiency. Tiling and shared memory handle L2-to-register efficiency. All three matter and they are independent optimizations.

Summary and Next Steps

The GPU memory hierarchy is a 5-level system with latency ranging from 1 cycle (registers) to 10,000+ cycles (host DRAM via PCIe). Each level is exponentially slower but exponentially larger than the level above. Every GPU optimization technique - tiling, shared memory usage, prefetching, kernel fusion, quantization - is fundamentally about keeping data at a higher level in this hierarchy for longer.

The practical framework:

Profile first with Nsight to determine if you are compute-bound or memory-bound
Check HBM bandwidth utilization - if near 100%, you are bandwidth-saturated
Check coalescing - uncoalesced access is the most common culprit
Check shared memory usage - explicit tiling is the primary tool for improving L1/L2 reuse
Check register usage - avoid spilling, but accept 50% occupancy if the alternative is register spills

The next lesson covers the physical construction of HBM and GDDR memory technologies - understanding why HBM achieves 3.35 TB/s while GDDR6X achieves "only" 1 TB/s, and what the physical constraints are that limit both.

The Transformer That Should Have Been Fast​

Why This Exists​

Historical Context​

The Memory Hierarchy - Level by Level​

Level 0: Registers​

Level 1: L1 Cache and Shared Memory​

Level 2: L2 Cache​

Level 3: HBM (High Bandwidth Memory)​

Level 4: PCIe / Host DRAM​

The Full Hierarchy in Numbers​

How the Hierarchy Shapes Kernel Design​

The Arithmetic Intensity Lens​

Memory Access Pattern Archetypes​

Memory Hierarchy Visualization​

Practical: Profiling the Memory Hierarchy​

Using cuda.memGetInfo()​

Writing a Simple Bandwidth Benchmark​

Nsight Compute - Key Metrics to Watch​

Detecting Shared Memory Bank Conflicts​

Data Flow Through the Hierarchy - A Worked Example​

Memory Hierarchy in the Context of LLM Inference​

Data Flow Diagram for LLM Inference​

Warp-Level Memory Patterns​

Coalescing Across the Warp​

The Warp Stall Problem​

Memory Access Patterns for Common Neural Network Operations​

Kernel Fusion and Its Effect on the Memory Hierarchy​

Production Engineering Notes​

Memory Fragmentation​

NUMA Effects on Multi-GPU Systems​

Memory Bandwidth Saturation Detection​

Common Mistakes​

Interview Q&A​

Summary and Next Steps​