Memory Hierarchy in GPUs
Reading time: ~40 min · Interview relevance: Very High · Target roles: ML Engineer, CUDA Developer, Systems Engineer
GPU performance is not about compute - it is about memory. The H100 can do 3958 TFLOPS of FP16 compute. It has 3.35 TB/s of HBM bandwidth. At 2 bytes per BF16 value, you can load 1.675 trillion values per second. That sounds like a lot, until you realize one transformer attention head needs to touch billions of values per forward pass - and it needs to do it over and over again.
The Night the Cluster Caught Fire
It is 2:47 AM. You are on call. Your company's production inference cluster - 64 H100s serving a large language model to paying customers - has just fallen to 12% of its normal throughput. Response latencies are spiking from 180ms to over 4 seconds. The pager woke you up. Users are angry. Your manager is online asking what is wrong.
You SSH into the cluster and pull up nvitop. All 64 GPUs are at 100% utilization. No errors. No OOM. The GPUs are running. They are just... slow. Compute utilization shows 98%. This should be fast. Why is it broken?
You pull the profiler traces from earlier in the evening. The culprit becomes obvious: someone pushed a new attention implementation yesterday. The old one used flash attention with careful tiling. The new one - written by an engineer who was confident they had a "cleaner" implementation - accesses the key and value matrices in a strided pattern. Instead of reading memory sequentially, it reads every 128th byte. The GPU's memory controller is issuing 128-byte cache line fetches and using only 2 bytes from each one. Memory utilization: 1.6%. The GPU is doing 99 times more memory work than necessary.
The fix is a one-line change - transpose the KV cache layout before the attention kernel. Throughput goes back to normal in 12 minutes. But those 3 hours of degraded service cost the company real money. And the root cause was not a bug in the math, not a wrong hyperparameter, not a network issue. It was a misunderstanding of how GPU memory hierarchy works.
This lesson is about preventing that 2:47 AM call. It is about understanding, at a hardware level, how memory moves through a GPU - from the tiny, blazing-fast registers an individual thread owns, through shared memory that a thread block shares, through the L2 cache that the entire chip shares, all the way down to HBM (High Bandwidth Memory) sitting on the chip package. Each layer has different capacity, different latency, and different bandwidth. Writing fast GPU code means knowing which layer your data lives in, how to keep it in the fast layers, and how to access memory in patterns the hardware was designed to handle.
This is not academic. Every serious ML performance optimization - FlashAttention, PagedAttention, quantization kernels, custom fused operators - is fundamentally an exercise in GPU memory management. The engineers who built these systems understood the memory hierarchy deeply. After this lesson, so will you.
Why This Exists
Before GPUs had a sophisticated memory hierarchy, every memory access went directly to DRAM. DRAM (Dynamic Random Access Memory) is cheap, dense, and has massive capacity. It is also slow - 100 to 400 clock cycles of latency, and even the best DRAM can only deliver a few hundred GB/s of bandwidth.
Early GPU workloads were graphics: rasterizing triangles, texturing surfaces, blending pixels. These workloads have a natural property: they are embarrassingly parallel and each pixel is roughly independent. You can saturate DRAM bandwidth by having enough threads in flight that while one thread waits for memory, another thread is computing. This technique - hiding memory latency with compute - is called latency hiding through thread-level parallelism.
For a while, this was enough. But then two things happened.
First, GPUs got dramatically more parallel. The number of CUDA cores per chip went from hundreds to thousands to tens of thousands. All those cores can do arithmetic fast. But arithmetic needs data, and data comes from memory. The compute-to-memory ratio - how many floating point operations you can do per byte loaded - kept climbing. By the time we had thousands of cores, memory had become the bottleneck for nearly every workload.
Second, machine learning changed the access patterns. Transformer attention is not like rasterizing triangles. The same weight matrices get used over and over. The query, key, and value projections for a single layer get multiplied by huge matrices millions of times per training run. The data is not random - it is highly structured and highly reused. A hardware system that caches recently-used data and spatial neighbors would dramatically reduce the actual load on DRAM.
The solution was to build a hierarchy. Instead of one flat DRAM, modern GPUs have five distinct layers of memory, each faster and smaller than the one below it. Data lives as close to the computation as physics allows. When a thread needs a value, the hardware checks the fast nearby storage first. Only if the value is not there does it go to the slower, farther storage. This is the same design principle as CPU caches - it just took GPUs longer to adopt it because their original workloads did not benefit from it as much.
The twist for GPUs is that part of the hierarchy is manually managed. On a CPU, you never explicitly put data into the L1 cache. The hardware handles it transparently. On a GPU, shared memory - the fastest per-SM storage below registers - is a scratchpad that your CUDA kernel explicitly controls. You decide what goes in, when it goes in, and when it comes out. This gives you fine-grained control that CPU caches do not offer, but it requires you to understand the hierarchy deeply enough to use it well.
Historical Context
The story of GPU memory hierarchy parallels the broader story of GPU architecture maturation.
2006 - NVIDIA Tesla (G80): The first CUDA GPU introduced shared memory explicitly. Each Streaming Multiprocessor had 16KB of on-chip shared memory. This was the first GPU where programmers could explicitly stage data through a fast scratchpad. The big insight from NVIDIA's architects was that many GPU algorithms - matrix multiply, FFT, image convolution - have regular, predictable data reuse patterns. If the programmer declares the reuse explicitly, the hardware can exploit it. Shared memory was that declaration mechanism.
2010 - Fermi: Fermi was the first GPU architecture with a proper L1/L2 cache hierarchy alongside shared memory. Each SM got 64KB of configurable on-chip memory that could be split between L1 cache (hardware-managed) and shared memory (software-managed) in ratios like 48KB/16KB or 16KB/48KB. A unified L2 cache of 768KB served the entire chip. This was NVIDIA acknowledging that not every programmer wants to manage caches manually - some workloads benefit from hardware-managed caching even on a GPU.
2012 - Kepler: Kepler added the register file boost that still defines the register-heavy ML kernels we run today. Each SM got 65,536 32-bit registers - one of the largest register files ever put on a microprocessor. This matters enormously for register-tiling strategies where you keep an entire output tile in registers across multiple loop iterations.
2017 - Volta: Volta (the V100) was the first GPU built explicitly around deep learning. It introduced Tensor Cores - specialized matrix multiply units that operate on 4x4 matrix tiles. Critically, Volta also introduced independent thread scheduling, which changed how warp divergence and synchronization work at the instruction level. The memory hierarchy got HBM2 with 900 GB/s bandwidth, up from 480 GB/s on Pascal.
2022 - Hopper (H100): The H100 is currently the dominant training and inference chip for large language models. It introduced HBM3 with 3.35 TB/s of bandwidth, 50MB of L2 cache (up from 40MB on A100), and the Transformer Engine that dynamically selects FP8/FP16/BF16 precision per layer. The memory hierarchy decisions made on H100 directly affect how FlashAttention-3 and other LLM kernels are written.
The key historical insight: every major GPU architecture generation has improved the memory hierarchy because compute improvements have consistently outpaced memory improvements. The memory wall - the growing gap between compute speed and memory speed - has been the central challenge of GPU architecture for 15 years. Every new cache level, every bandwidth bump, every new memory technology is an attempt to keep feeding the ever-faster compute units with data fast enough.
The Five Levels of GPU Memory
Before going deep on each level, here is the complete hierarchy in one view. Understand this diagram and you understand 80% of GPU performance optimization.
Each level is physically closer to the compute units than the one below it. "Closer" means both spatially on the chip (shorter wires = faster signal propagation) and logically (fewer steps in the memory access protocol).
Now let us go through each level with the precision an ML performance engineer needs.
Level 1 - Registers
What They Are
Registers are the fastest storage on the GPU. When a thread executes an instruction like c = a + b, the values a, b, and c live in registers. There is no memory access at all - the arithmetic unit reads directly from the register file and writes directly back. This happens in 1 clock cycle.
Each thread has its own private set of registers. No other thread can read or write your registers. This private ownership is what enables the massively parallel execution model: thousands of threads can all be doing arithmetic simultaneously because each thread's working data is physically separate.
Capacity: On H100, each SM has 65,536 32-bit (4-byte) registers. That is 256KB per SM. With 132 SMs on an H100, that is about 33MB of total register file capacity across the chip.
Per-thread allocation: When you launch a CUDA kernel, the compiler determines how many registers each thread needs based on the kernel's logic. If a thread uses 32 registers and you launch blocks of 1024 threads, the block needs 32,768 registers. If your SM has 65,536 registers, it can hold 2 blocks simultaneously. If your kernel uses 64 registers per thread, only 1 block fits. This affects occupancy - how many warps are active simultaneously on an SM, which is critical for latency hiding.
Register spilling: If a thread tries to use more registers than the hardware provides (or more than the occupancy-limited allocation allows), the compiler spills the excess to local memory. Local memory is not a separate hardware structure - it is a region of global memory (HBM) that acts as a register overflow area. Accessing spilled registers takes 500+ cycles instead of 1 cycle. This is one of the most common causes of unexpected slowdowns in complex CUDA kernels.
Register Tiling - The Key Optimization Pattern
The most powerful use of registers in ML kernels is register tiling: keeping an output accumulation tile in registers across the full inner loop of a matrix multiply. Here is the pattern:
# Conceptual illustration of register tiling in PyTorch CUDA extension
# The actual compute happens in the .cu file but PyTorch exposes the pattern
import torch
# Naive: every output element computed separately, no register reuse
def naive_matmul(A, B):
return A @ B # Let PyTorch/cuBLAS decide the tiling
# What cuBLAS does internally (pseudocode for a 16x16 output tile):
# Each thread holds a 2x2 or 4x4 output accumulator IN REGISTERS
# Thread computes partial sums across K dimension without writing to memory
# Final output written to shared memory or global memory once
#
# for k in range(0, K, TILE_K):
# load A_tile[m_range, k:k+TILE_K] into shared memory
# load B_tile[k:k+TILE_K, n_range] into shared memory
# __syncthreads()
# for kk in range(TILE_K):
# for m in range(TILE_M_PER_THREAD):
# for n in range(TILE_N_PER_THREAD):
# C_reg[m][n] += A_reg[m] * B_reg[n] # pure register ops
# __syncthreads()
# write C_reg back to global memory once
The key insight: by keeping the accumulator C_reg in registers across the entire K loop, you avoid K memory writes and reads. For a 4096x4096 matrix multiply with K=4096, this eliminates 4096 write-read cycles per output element. The register file is fast enough that this fits naturally.
Level 2 - Shared Memory and L1 Cache
The Software-Managed Scratchpad
Shared memory is the most powerful optimization tool in GPU programming. It is on-chip memory - physically located on the SM die, not on HBM - with latency around 32 cycles compared to 500+ cycles for HBM. An SM's shared memory is shared by all threads in a block running on that SM. (No sharing happens between blocks, and no sharing between SMs - hence the name is slightly misleading. It is shared within a block, not globally.)
On H100, each SM has up to 228KB of combined L1/shared memory, configurable up to 228KB for shared memory. The default split varies by CUDA version and kernel configuration.
The key word is "software-managed." On a CPU, when you access a variable, the hardware automatically checks L1, then L2, then L3, then DRAM, transparently. You never write code to stage data into CPU cache. On a GPU, shared memory is a scratchpad: you explicitly copy data in, compute with it, and copy results out. If you do not explicitly use __shared__ in your CUDA kernel, shared memory stays empty and every memory access goes to L2/HBM.
Why Manual Management Is a Feature, Not a Bug
At first glance, manual cache management sounds like extra work. Why not just let the hardware handle it like a CPU cache? The answer is that GPU workloads - especially matrix operations - have predictable, structured access patterns that the programmer knows about but the hardware cannot easily infer.
Consider matrix multiplication: where is and is .
Each output element requires the entire i-th row of A and the entire j-th column of B:
If you compute naively (one thread per output element, reading directly from HBM), the value row of A is loaded once per output element in that row. An entire row is loaded times total. For a 4096x4096 matrix, row i of A is loaded 4096 times from HBM when it only needs to be loaded once.
With shared memory tiling: load a tile of A (say 32x32) into shared memory once. All threads in the block that need values from that tile read from shared memory (~32 cycles) instead of HBM (~500 cycles). The row is loaded from HBM once and reused 32 times from shared memory. This is a 15x effective bandwidth improvement for this access pattern.
A hardware-managed cache might achieve the same in theory, but it would need to correctly infer that the entire 32x32 tile will be reused before evicting it. With explicit shared memory, you guarantee it.
Bank Conflicts - The Hidden Performance Killer
Shared memory is organized into banks - 32 banks on current NVIDIA GPUs, each 4 bytes wide. A 32-thread warp can access shared memory in 1 cycle if each thread accesses a different bank. But if multiple threads in a warp access the same bank (a bank conflict), those accesses are serialized.
The classic bank conflict scenario: a 32x32 matrix stored in row-major layout where each thread reads the same column (column-major access). Thread 0 reads element [0][0] (bank 0), thread 1 reads element [1][0] (also bank 0 if stride equals 32 * 4 = 128 bytes, which wraps around to bank 0). This creates 32-way bank conflicts - 32x serialization - and completely destroys the latency benefit of shared memory.
The fix is padding: add one extra element to each row of the shared memory array. This shifts each row's starting bank by 1, eliminating the column-access conflict.
# Demonstrating bank conflicts using PyTorch CUDA timing
import torch
import time
def time_transpose(matrix_size=4096, use_padding=False):
"""
Benchmark matrix transpose with and without shared memory padding.
Real implementations use CUDA kernels; this demonstrates the concept
via measurable performance differences.
"""
device = 'cuda'
# Create a large matrix that exceeds L2 cache (50MB on H100)
# 4096x4096 float32 = 64MB - forces HBM access
A = torch.randn(matrix_size, matrix_size, device=device, dtype=torch.float32)
# Warm up
for _ in range(10):
B = A.t().contiguous()
torch.cuda.synchronize()
# Time the operation
start = time.perf_counter()
for _ in range(100):
B = A.t().contiguous()
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
# Calculate bandwidth
bytes_transferred = 2 * matrix_size * matrix_size * 4 # read + write
bandwidth_gb_s = (100 * bytes_transferred) / (elapsed * 1e9)
return elapsed / 100 * 1000, bandwidth_gb_s
t_ms, bw = time_transpose(4096)
print(f"Transpose 4096x4096: {t_ms:.3f}ms, effective bandwidth: {bw:.1f} GB/s")
# Good implementation should achieve 60-80% of peak HBM bandwidth (~2.0-2.7 TB/s on H100)
L1 Cache - The Hardware-Managed Complement
Alongside shared memory, each SM has an L1 data cache. In modern NVIDIA GPUs (Volta onward), L1 and shared memory share the same physical SRAM. What is not allocated to shared memory is used as L1 cache.
L1 cache is hardware-managed: you do not explicitly control what goes in it. It operates on cache lines of 128 bytes. When you access any byte in global memory, the entire 128-byte cache line containing that byte is fetched into L1. If you then access the adjacent byte (in the same cache line), it is an L1 hit at ~32 cycles instead of an HBM miss at 500+ cycles.
This 128-byte cache line is the fundamental unit you must align your thinking around for any memory access optimization.
Level 3 - L2 Cache
The L2 cache is a chip-wide resource - all SMs share it. On H100, the L2 cache is 50MB. On A100, it was 40MB. On V100, it was 6MB. This rapid growth reflects how critical L2 has become for transformer workloads where the same weight matrices are reused across multiple sequence positions.
Latency: approximately 200 clock cycles at H100's 1,980 MHz base clock. That is about 100 nanoseconds.
Bandwidth: The L2 bandwidth on H100 is approximately 12 TB/s - roughly 3.5x the HBM bandwidth. This enormous bandwidth between L2 and the SMs is what makes L2 residency so valuable. If your working set fits in 50MB, you effectively have 12 TB/s of accessible bandwidth instead of 3.35 TB/s.
What fits in L2 on H100 (50MB):
- A 4096x4096 BF16 matrix = 32MB. Fits.
- The KV cache for a 1024-token sequence at 32 heads, 128 head-dim, BF16: . Fits.
- The KV cache for an 8192-token sequence: 128MB. Does not fit. Goes to HBM.
This is why FlashAttention's context-length scaling and the memory pressure of long-context inference are so directly tied to L2 cache capacity. The transition from 6MB (V100) to 50MB (H100) was not an accident - NVIDIA was explicitly targeting attention head computations.
Level 4 - HBM (High Bandwidth Memory)
HBM is the main memory of the GPU. On H100 SXM, there are 80GB of HBM3. This is where all your model weights, activations, optimizer states, and KV caches live when they are not actively being computed on.
HBM Architecture
HBM is not located on the GPU die itself. It is a separate chip (or stack of chips) placed adjacent to the GPU die on the same package substrate. They are connected by extremely wide buses - thousands of data lines running in parallel across a tiny distance (a few millimeters).
The bandwidth comes from this width: H100 HBM3 uses a 5120-bit memory bus (compared to 256-bit for a typical consumer GPU's GDDR6). At 3.35 TB/s, this bus can transfer approximately:
That is 1.65KB of memory moving across the HBM bus every single clock cycle. Impressive. But the compute units can do far more arithmetic than that in one cycle. The memory wall is real.
HBM vs GDDR6 vs LPDDR5
| Memory Type | Bandwidth | Capacity | Who Uses It |
|---|---|---|---|
| HBM3 (H100) | 3.35 TB/s | 80GB | Data center GPUs |
| HBM2e (A100) | 2.0 TB/s | 80GB | Previous gen data center |
| GDDR6X (RTX 4090) | 1.0 TB/s | 24GB | Consumer GPUs |
| GDDR6 (RTX 3090) | 0.94 TB/s | 24GB | Consumer GPUs |
| LPDDR5X (Apple M3 Ultra) | 0.8 TB/s | 192GB | Apple Silicon |
The bandwidth difference between HBM3 and GDDR6X (3.35x) explains why an H100 can run LLMs at 3-4x the throughput of an RTX 4090 even when the compute difference is smaller.
Measuring Your Actual HBM Bandwidth
import torch
import time
def measure_memory_bandwidth(size_gb=1.0, dtype=torch.float16, num_trials=20):
"""
Measure effective HBM read bandwidth using a simple copy kernel.
A copy reads src and writes to dst - 2x the data size in transfers.
"""
device = torch.device('cuda')
# Calculate number of elements
bytes_per_element = torch.finfo(dtype).bits // 8 if dtype.is_floating_point else 1
n_elements = int(size_gb * 1024**3 / bytes_per_element)
# Allocate source and destination
src = torch.randn(n_elements, dtype=dtype, device=device)
dst = torch.empty_like(src)
# Warm up - ensure memory is properly allocated and initialized on device
for _ in range(5):
dst.copy_(src)
torch.cuda.synchronize()
# Timed trials
times = []
for _ in range(num_trials):
torch.cuda.synchronize()
t0 = time.perf_counter()
dst.copy_(src)
torch.cuda.synchronize()
t1 = time.perf_counter()
times.append(t1 - t0)
# Remove outliers (top and bottom 10%)
times.sort()
trimmed = times[num_trials//10 : -num_trials//10]
avg_time = sum(trimmed) / len(trimmed)
# Bandwidth = total bytes transferred / time
# copy transfers size_gb GB read + size_gb GB write = 2 * size_gb GB
total_bytes = 2 * size_gb * 1024**3
bandwidth_tb_s = total_bytes / avg_time / 1e12
theoretical_h100 = 3.35 # TB/s for HBM3 on H100 SXM
efficiency = bandwidth_tb_s / theoretical_h100 * 100
print(f"Array size: {size_gb:.1f} GB ({dtype})")
print(f"Average time: {avg_time*1000:.2f} ms")
print(f"Measured bandwidth: {bandwidth_tb_s:.2f} TB/s")
print(f"Theoretical H100 HBM3: {theoretical_h100} TB/s")
print(f"Efficiency: {efficiency:.1f}%")
print()
return bandwidth_tb_s
# Run at different sizes to see cache effects
print("=== Memory Bandwidth at Different Working Set Sizes ===\n")
# Small: fits in L2 cache (50MB on H100)
measure_memory_bandwidth(size_gb=0.01) # 10MB - L2 resident
# Medium: spills into HBM
measure_memory_bandwidth(size_gb=0.1) # 100MB - HBM resident
# Large: definitely HBM, steady state
measure_memory_bandwidth(size_gb=1.0) # 1GB - pure HBM bandwidth
# Very large: same as above, confirms no strange effects
measure_memory_bandwidth(size_gb=4.0) # 4GB - pure HBM bandwidth
# Expected output on H100:
# 10MB: ~8-12 TB/s (L2 bandwidth dominates)
# 100MB: ~3.0-3.2 TB/s (HBM, initial)
# 1GB: ~3.1-3.3 TB/s (HBM, steady state)
# 4GB: ~3.1-3.3 TB/s (HBM, same)
The Roofline Model - Connecting Compute and Memory
The roofline model is a simple visual and mathematical tool that tells you immediately whether a kernel is compute-bound or memory-bound - and what the theoretical maximum performance is in either case.
Arithmetic Intensity
Arithmetic intensity is the number of floating point operations (FLOPs) per byte of memory traffic:
Units: FLOPs/byte, sometimes written as FLOP/B.
Examples:
- Vector addition (, N elements): 1 add per 3 element loads. If BF16: 1 FLOP per 6 bytes. FLOP/byte.
- Matrix-vector multiply (, matrix): FLOPs, bytes. For large M, N: FLOP/byte.
- Matrix-matrix multiply (, all ): FLOPs, bytes. For large N: FLOP/byte. At N=4096: FLOP/byte.
The dramatic difference in arithmetic intensity explains why matrix-matrix multiply is compute-bound (and fast, relatively speaking) while matrix-vector multiply and element-wise operations are memory-bound (and slow, relative to the hardware's potential).
The Roofline Equations
Given:
- Peak compute: in TFLOPS (e.g., 989 TFLOPS BF16 on H100 non-Tensor-Core)
- Peak memory bandwidth: in TB/s (e.g., 3.35 TB/s on H100 HBM3)
The ridge point is the arithmetic intensity where compute and memory are equally limiting:
For a kernel with arithmetic intensity :
- If : memory-bound. Maximum achievable performance =
- If : compute-bound. Maximum achievable performance =
Using Tensor Cores ( TFLOPS BF16 on H100):
This means even matrix multiply with FLOP/byte at N=4096 is barely compute-bound with Tensor Cores. At smaller batch sizes (N=256), FLOP/byte - clearly memory-bound, Tensor Cores are underutilized.
import torch
import time
import math
def roofline_analysis(A: torch.Tensor, B: torch.Tensor, flops_per_element: float,
bytes_per_element: float, operation_name: str):
"""
Measure actual kernel throughput and compare to roofline bounds.
"""
# Peak specs for H100 SXM (adjust for your hardware)
peak_compute_tflops = 3958.0 # BF16 Tensor Core
peak_bandwidth_tb_s = 3.35 # HBM3
n_elements = A.numel()
total_flops = n_elements * flops_per_element
total_bytes = n_elements * bytes_per_element
arithmetic_intensity = total_flops / total_bytes
# Warm up
for _ in range(10):
C = torch.matmul(A, B) if B is not None else A * 2.0
torch.cuda.synchronize()
# Time
n_trials = 50
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n_trials):
C = torch.matmul(A, B) if B is not None else A * 2.0
torch.cuda.synchronize()
elapsed = (time.perf_counter() - t0) / n_trials
actual_tflops = total_flops / elapsed / 1e12
memory_bound_limit = arithmetic_intensity * peak_bandwidth_tb_s # TFLOPS
compute_bound_limit = peak_compute_tflops
roofline_limit = min(memory_bound_limit, compute_bound_limit)
efficiency = actual_tflops / roofline_limit * 100
bound_type = "MEMORY-BOUND" if arithmetic_intensity < (peak_compute_tflops / peak_bandwidth_tb_s) else "COMPUTE-BOUND"
print(f"\n{'='*50}")
print(f"Operation: {operation_name}")
print(f"Arithmetic intensity: {arithmetic_intensity:.1f} FLOP/byte")
print(f"Kernel type: {bound_type}")
print(f"Actual throughput: {actual_tflops:.1f} TFLOPS")
print(f"Roofline limit: {roofline_limit:.1f} TFLOPS")
print(f"Roofline efficiency: {efficiency:.1f}%")
device = 'cuda'
# Memory-bound: element-wise multiply (very low arithmetic intensity)
N = 32 * 1024 * 1024 # 32M elements
A_vec = torch.randn(N, dtype=torch.bfloat16, device=device)
B_vec = torch.randn(N, dtype=torch.bfloat16, device=device)
# 1 FLOP (multiply), 3 * 2 bytes read (A, B) + 2 bytes write (C) = 8 bytes
roofline_analysis(A_vec, None, flops_per_element=1.0,
bytes_per_element=8.0, operation_name="Element-wise multiply (32M BF16)")
# Compute-bound: large matrix multiply (high arithmetic intensity)
N_mat = 4096
A_mat = torch.randn(N_mat, N_mat, dtype=torch.bfloat16, device=device)
B_mat = torch.randn(N_mat, N_mat, dtype=torch.bfloat16, device=device)
# 2*N^3 FLOPs, 3*N^2*2 bytes -> intensity = N/3
intensity = N_mat / 3.0
roofline_analysis(A_mat, B_mat, flops_per_element=2*N_mat,
bytes_per_element=6*2, # per output element approximate
operation_name=f"MatMul {N_mat}x{N_mat} BF16")
Memory Access Patterns - Coalesced vs Strided
This section explains the 2:47 AM incident from the opening. It is arguably the single most important practical concept in GPU memory programming.
Cache Lines and Coalescing
When the GPU memory controller fetches data from HBM, it always fetches an entire 128-byte cache line, regardless of how many bytes your code actually requested. If you request 1 byte, you get 128 bytes of memory traffic. The 127 bytes you did not request are either used (good) or wasted (bad).
Coalesced access: when 32 threads in a warp each access 4-byte values that are sequential in memory (e.g., thread 0 reads address 0, thread 1 reads address 4, thread 2 reads address 8, ... thread 31 reads address 124), all 32 requests fit in a single 128-byte cache line. One memory transaction serves the entire warp.
Strided access: when threads access non-sequential memory (e.g., thread 0 reads address 0, thread 1 reads address 128, thread 2 reads address 256...), each thread's request falls in a different cache line. The memory controller must issue 32 separate cache line fetches to serve the warp. All 32 fetch 128 bytes, use 4 bytes, and discard 124. Memory utilization: 4/128 = 3.1%.
Benchmarking Coalesced vs Strided Access
import torch
import time
def benchmark_access_pattern(n_elements=64*1024*1024, stride=1, dtype=torch.float32):
"""
Compare sequential (coalesced) vs strided memory access patterns.
stride=1: sequential, coalesced access - best case
stride=N: access every Nth element - progressively worse
"""
device = 'cuda'
# Source array in HBM
src = torch.randn(n_elements * stride, dtype=dtype, device=device)
# Create indices for strided access
indices = torch.arange(0, n_elements * stride, stride, device=device, dtype=torch.long)
# Warm up
for _ in range(10):
if stride == 1:
dst = src[:n_elements].clone()
else:
dst = src[indices].clone()
torch.cuda.synchronize()
# Timed run
n_trials = 30
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n_trials):
if stride == 1:
dst = src[:n_elements].clone()
else:
dst = src[indices].clone()
torch.cuda.synchronize()
elapsed = (time.perf_counter() - t0) / n_trials
useful_bytes = n_elements * dtype.itemsize
# Effective bandwidth (based on useful data only)
effective_bw_gb_s = useful_bytes / elapsed / 1e9
# Theoretical memory traffic (accounting for wasted cache line fetches)
# Each access fetches 128-byte cache line, uses stride*4 bytes if stride<32
# For stride >= 32: each access is a separate cache line
cache_line_bytes = 128
bytes_per_access = dtype.itemsize
if stride == 1:
waste_factor = 1.0
else:
# Each element fetches a full cache line, uses only bytes_per_access
waste_factor = min(stride, cache_line_bytes // bytes_per_access)
actual_bw_gb_s = effective_bw_gb_s * waste_factor
print(f"Stride={stride:4d}: {elapsed*1000:.2f}ms, "
f"effective BW={effective_bw_gb_s:.1f} GB/s, "
f"actual BW consumed={actual_bw_gb_s:.1f} GB/s, "
f"utilization={100/max(waste_factor,1):.0f}%")
print("Memory access pattern benchmark (float32, 256MB array)")
print("="*65)
for stride in [1, 2, 4, 8, 16, 32, 64, 128]:
benchmark_access_pattern(stride=stride)
# Expected output on H100:
# Stride= 1: fast, ~3TB/s, 100% utilization
# Stride= 2: still ok (fits in cache line)
# Stride= 4: degrading
# Stride= 32: ~32x slower effective throughput
# Stride= 128: ~32x slower, each access fully wastes cache line
Practical Fix - Transpose Before Computation
The real-world fix for strided access patterns is usually to transpose the data layout before the memory-intensive operation.
import torch
import time
def attention_kv_access_benchmark(seq_len=2048, n_heads=32, head_dim=128,
batch_size=4, dtype=torch.bfloat16):
"""
Benchmark two KV cache layouts for attention computation.
Layout A (bad for sequential key access):
K shape: [batch, seq, heads, head_dim]
Accessing all heads for position t means strided access over seq dimension
Layout B (good, as used in FlashAttention):
K shape: [batch, heads, seq, head_dim]
Accessing head h's key sequence is sequential in memory
"""
device = 'cuda'
# Create KV cache in layout A: [batch, seq, heads, head_dim]
K_layout_A = torch.randn(batch_size, seq_len, n_heads, head_dim,
dtype=dtype, device=device)
Q = torch.randn(batch_size, n_heads, 1, head_dim,
dtype=dtype, device=device)
# Layout B: [batch, heads, seq, head_dim]
K_layout_B = K_layout_A.permute(0, 2, 1, 3).contiguous()
def compute_attention_scores_A(Q, K):
# Need to permute K to match Q's head dimension - causes strided access
K_perm = K.permute(0, 2, 1, 3) # not contiguous - strided reads
# [batch, heads, 1, head_dim] x [batch, heads, head_dim, seq]
return torch.matmul(Q, K_perm.transpose(-2, -1))
def compute_attention_scores_B(Q, K):
# K is already [batch, heads, seq, head_dim] - sequential reads
return torch.matmul(Q, K.transpose(-2, -1))
# Warm up
for _ in range(10):
s = compute_attention_scores_A(Q, K_layout_A)
s = compute_attention_scores_B(Q, K_layout_B)
torch.cuda.synchronize()
n_trials = 100
# Benchmark layout A
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n_trials):
scores_A = compute_attention_scores_A(Q, K_layout_A)
torch.cuda.synchronize()
time_A = (time.perf_counter() - t0) / n_trials * 1000
# Benchmark layout B
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n_trials):
scores_B = compute_attention_scores_B(Q, K_layout_B)
torch.cuda.synchronize()
time_B = (time.perf_counter() - t0) / n_trials * 1000
print(f"seq_len={seq_len}, heads={n_heads}, head_dim={head_dim}, batch={batch_size}")
print(f"Layout A [batch,seq,heads,head_dim]: {time_A:.3f}ms (strided access)")
print(f"Layout B [batch,heads,seq,head_dim]: {time_B:.3f}ms (sequential access)")
print(f"Speedup from correct layout: {time_A/time_B:.2f}x")
attention_kv_access_benchmark(seq_len=2048)
attention_kv_access_benchmark(seq_len=8192)
Shared Memory Tiling - The Complete Pattern
Here is the canonical shared memory tiling implementation in PyTorch. This does not replace cuBLAS for production, but understanding this pattern is essential for writing custom fused kernels.
import torch
def tiled_matmul_demo(M=1024, K=512, N=1024, TILE=32, dtype=torch.float32):
"""
Demonstrates the memory access pattern of tiled matrix multiply.
PyTorch's @ operator calls cuBLAS which does this internally.
This version makes the tiling explicit to illustrate the shared memory pattern.
Each tile of C is computed by:
1. Loading TILE x TILE of A into shared memory
2. Loading TILE x TILE of B into shared memory
3. Computing the partial dot product
4. Repeating across K dimension
5. Writing result once to global memory
Memory traffic analysis (vs naive):
- Naive: M*N*K reads of A + M*N*K reads of B = 2*M*N*K total reads
- Tiled: M*K + K*N reads total (each tile loaded once)
- Reduction factor: M*N / TILE (for A) + M*N / TILE (for B)
- For TILE=32: 32x reduction in HBM reads
"""
device = 'cuda'
A = torch.randn(M, K, dtype=dtype, device=device)
B = torch.randn(K, N, dtype=dtype, device=device)
# Curent PyTorch mm uses tiled cuBLAS internally
# We can verify correctness and measure timing
C_ref = torch.mm(A, B)
# Manual loop version (for illustration of tiling pattern, not performance)
C_manual = torch.zeros(M, N, dtype=dtype, device=device)
# This loop represents what each thread block does in a CUDA kernel
# Each iteration processes one tile of the K dimension
n_k_tiles = (K + TILE - 1) // TILE
memory_accesses_naive = 2 * M * N * K # bytes: * dtype_size
memory_accesses_tiled = (M * K + K * N) # each element loaded once
reduction_factor = memory_accesses_naive / memory_accesses_tiled
print(f"Matrix dims: ({M}x{K}) @ ({K}x{N})")
print(f"Tile size: {TILE}x{TILE}")
print(f"Number of K-tiles: {n_k_tiles}")
print(f"Naive HBM reads: {memory_accesses_naive:,} elements")
print(f"Tiled HBM reads: {memory_accesses_tiled:,} elements")
print(f"Memory traffic reduction: {reduction_factor:.1f}x")
print()
# Verify PyTorch's result matches
assert torch.allclose(C_ref, C_ref, atol=1e-4), "Mismatch!"
print("Verification passed (cuBLAS result is consistent)")
return C_ref
tiled_matmul_demo(M=2048, K=1024, N=2048, TILE=32)
Visualizing the Tiling Strategy
Production Engineering Notes
1. Profile Before Optimizing
Never guess whether a kernel is memory-bound or compute-bound. Use NVIDIA Nsight Systems or torch.profiler to measure:
import torch
from torch.profiler import profile, record_function, ProfilerActivity
def profile_memory_intensive_op():
device = 'cuda'
# Simulate a memory-bound operation: batch normalization
batch_size, channels, height, width = 32, 256, 64, 64
x = torch.randn(batch_size, channels, height, width, device=device, dtype=torch.float16)
bn = torch.nn.BatchNorm2d(channels).to(device).half()
# Warm up
for _ in range(10):
_ = bn(x)
torch.cuda.synchronize()
with profile(
activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU],
record_shapes=True,
with_flops=True,
profile_memory=True
) as prof:
with record_function("batch_norm_forward"):
for _ in range(50):
y = bn(x)
torch.cuda.synchronize()
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=10
))
# Export for Tensorboard
prof.export_chrome_trace("trace.json")
print("\nTrace saved to trace.json - open in chrome://tracing or Perfetto UI")
profile_memory_intensive_op()
2. Check Memory Bandwidth Utilization in Your Training Loop
import torch
import time
from dataclasses import dataclass
from typing import Optional
@dataclass
class BandwidthStats:
operation: str
elapsed_ms: float
bytes_accessed: int
bandwidth_tb_s: float
theoretical_peak_tb_s: float
utilization_pct: float
def estimate_bytes_for_module(module: torch.nn.Module,
input_tensor: torch.Tensor) -> int:
"""
Rough estimate of HBM bytes touched during a forward pass.
Counts: parameters read once + input read + output written.
Ignores intermediate activations (conservative estimate).
"""
param_bytes = sum(p.numel() * p.element_size() for p in module.parameters())
input_bytes = input_tensor.numel() * input_tensor.element_size()
# Output size (approximate: assume same size as input for most layers)
with torch.no_grad():
try:
out = module(input_tensor)
output_bytes = out.numel() * out.element_size()
except Exception:
output_bytes = input_bytes # fallback
# Total: weights + input + output (all must be read/written from/to HBM)
return param_bytes + input_bytes + output_bytes
def benchmark_layer_bandwidth(module: torch.nn.Module,
input_tensor: torch.Tensor,
n_trials: int = 50,
theoretical_peak_tb_s: float = 3.35) -> BandwidthStats:
device = next(module.parameters()).device
# Warm up
with torch.no_grad():
for _ in range(10):
_ = module(input_tensor)
torch.cuda.synchronize()
bytes_accessed = estimate_bytes_for_module(module, input_tensor)
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.no_grad():
for _ in range(n_trials):
out = module(input_tensor)
torch.cuda.synchronize()
elapsed = (time.perf_counter() - t0) / n_trials
bandwidth = bytes_accessed / elapsed / 1e12
return BandwidthStats(
operation=module.__class__.__name__,
elapsed_ms=elapsed * 1000,
bytes_accessed=bytes_accessed,
bandwidth_tb_s=bandwidth,
theoretical_peak_tb_s=theoretical_peak_tb_s,
utilization_pct=bandwidth / theoretical_peak_tb_s * 100
)
# Example: benchmark common layers
device = 'cuda'
# Linear layer: classic memory-bound at small batch sizes
linear = torch.nn.Linear(4096, 4096, bias=False).to(device).to(torch.bfloat16)
x_small = torch.randn(1, 4096, dtype=torch.bfloat16, device=device) # batch=1, memory-bound
x_large = torch.randn(512, 4096, dtype=torch.bfloat16, device=device) # batch=512, more compute-bound
stats_small = benchmark_layer_bandwidth(linear, x_small)
stats_large = benchmark_layer_bandwidth(linear, x_large)
print(f"Linear(4096->4096), batch=1: {stats_small.bandwidth_tb_s:.2f} TB/s ({stats_small.utilization_pct:.0f}% utilization)")
print(f"Linear(4096->4096), batch=512: {stats_large.bandwidth_tb_s:.2f} TB/s ({stats_large.utilization_pct:.0f}% utilization)")
3. Use torch.compile for Kernel Fusion
Many memory-bound operations can be fused by torch.compile, reducing the number of HBM round-trips:
import torch
def unfused_ops(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor) -> torch.Tensor:
"""
Unfused: each op is a separate kernel = separate HBM read+write.
Total HBM trips: 4 reads + 4 writes = 8 HBM round trips for activation pipeline.
"""
x = torch.nn.functional.linear(x, weight, bias) # HBM read weight+x, write x
x = x / (1 + torch.exp(-x)) # sigmoid - HBM read x, write x
x = x - x.mean(dim=-1, keepdim=True) # mean - HBM read x, write x
x = x / (x.std(dim=-1, keepdim=True) + 1e-5) # std - HBM read x, write x
return x
@torch.compile
def fused_ops(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor) -> torch.Tensor:
"""
torch.compile fuses these ops into a single kernel where possible.
The normalized + activated values flow through registers and shared memory
without intermediate HBM writes. Reduces HBM trips from 8 to ~2.
"""
x = torch.nn.functional.linear(x, weight, bias)
x = x / (1 + torch.exp(-x))
x = x - x.mean(dim=-1, keepdim=True)
x = x / (x.std(dim=-1, keepdim=True) + 1e-5)
return x
# Benchmark fusion benefit
device = 'cuda'
batch, seq, dim = 8, 2048, 4096
x = torch.randn(batch, seq, dim, dtype=torch.bfloat16, device=device)
W = torch.randn(dim, dim, dtype=torch.bfloat16, device=device)
b = torch.zeros(dim, dtype=torch.bfloat16, device=device)
import time
# Warm up compiled version (first call triggers compilation)
_ = fused_ops(x, W, b)
torch.cuda.synchronize()
n = 50
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
y = unfused_ops(x, W, b)
torch.cuda.synchronize()
t_unfused = (time.perf_counter() - t0) / n * 1000
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
y = fused_ops(x, W, b)
torch.cuda.synchronize()
t_fused = (time.perf_counter() - t0) / n * 1000
print(f"Unfused: {t_unfused:.2f}ms")
print(f"Fused (compiled): {t_fused:.2f}ms")
print(f"Speedup: {t_unfused/t_fused:.2f}x")
4. Monitor and Prevent Memory Fragmentation
import torch
import gc
def print_memory_stats(label=""):
"""Utility: print current GPU memory allocation and fragmentation stats."""
if not torch.cuda.is_available():
return
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
max_allocated = torch.cuda.max_memory_allocated() / 1e9
# Fragmentation = reserved but not allocated
fragmentation = reserved - allocated
frag_pct = fragmentation / reserved * 100 if reserved > 0 else 0
print(f"[{label}] Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB, "
f"Fragmentation: {fragmentation:.2f}GB ({frag_pct:.0f}%)")
def defrag_if_needed(threshold_gb=2.0):
"""
If fragmentation exceeds threshold, force a garbage collection.
This is a last resort - better to avoid fragmentation through
consistent allocation sizes.
"""
reserved = torch.cuda.memory_reserved() / 1e9
allocated = torch.cuda.memory_allocated() / 1e9
fragmentation = reserved - allocated
if fragmentation > threshold_gb:
gc.collect()
torch.cuda.empty_cache()
print(f"Defragmented: freed {fragmentation:.1f}GB of fragmented memory")
return fragmentation
print_memory_stats("initial")
# Simulate allocation pattern that causes fragmentation
tensors = [torch.randn(1024, 1024, device='cuda') for _ in range(10)]
print_memory_stats("after allocations")
del tensors[::2] # delete every other tensor - creates fragmentation
print_memory_stats("after partial delete")
defrag_if_needed()
print_memory_stats("after defrag")
Common Mistakes
:::danger Assuming L2 Cache Is Sufficient
L2 cache on an H100 is 50MB. This sounds large but it gets exhausted quickly. A GPT-3 attention layer's KV cache for a single 2048-token sequence is approximately . It does not fit. Do not assume your working set fits in L2 - always calculate: seq_len * n_heads * head_dim * 2 (K+V) * bytes_per_element. If it exceeds 50MB, your attention kernel will hit HBM on every access. This is why FlashAttention's tiling strategy exists.
:::
:::danger Strided Memory Access Kills Bandwidth
Transposing a matrix or permuting tensor dimensions without .contiguous() creates a non-contiguous tensor. Subsequent operations on non-contiguous tensors access memory with strides, not sequentially. The GPU fetches full 128-byte cache lines and uses only a few bytes from each one. A single .permute() followed by heavy computation can reduce effective memory bandwidth by 10-30x. Always call .contiguous() after permutations that precede memory-intensive ops. Measure with tensor.is_contiguous() and tensor.stride().
:::
:::warning Register Spilling Ruins Kernel Performance
If your CUDA kernel (including Triton kernels) uses too many variables, the compiler spills registers to global memory. Each spill access costs 500+ cycles instead of 1 cycle. In Triton kernels, watch for this when using very large tile sizes (BLOCK_M, BLOCK_N > 128). Profile with ncu --metrics l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum - any nonzero value means register spills. The fix is usually to reduce tile size or split the kernel into stages.
:::
:::warning Small Batch Sizes Are Always Memory-Bound
During inference with batch_size=1 (single user request), every operation is memory-bound. A linear layer y = Wx with W shape [4096, 4096] in BF16 requires reading 32MB of weights plus 8KB of input. Arithmetic intensity: . At 3.35 TB/s HBM bandwidth, maximum throughput is 3.35 TFLOPS - less than 0.1% of the H100's 3958 TFLOPS Tensor Core throughput. This is the batch=1 inference "memory wall" and the reason batching requests is so important for GPU utilization.
:::
:::warning torch.cuda.synchronize() Is Required for Accurate Timing
Never time GPU operations without torch.cuda.synchronize() before and after the timed region. GPU operations are asynchronous: the Python call returns immediately while the GPU continues working. Without synchronization, you are measuring CPU dispatch latency (microseconds) not GPU execution time (milliseconds). This makes every operation look 100x faster than it is. The pattern is always: sync() -> t0 = time.perf_counter() -> op() -> sync() -> t1 = time.perf_counter().
:::
:::warning Contiguous vs Non-Contiguous Tensors Have Different Memory Layouts
tensor.view() requires contiguous memory and raises an error if the tensor is non-contiguous. tensor.reshape() silently returns a non-contiguous tensor if the view is not possible, which can cause unexpected performance cliffs in downstream operations. Best practice: after any operation that might create non-contiguous tensors (permute, transpose, narrow, expand), explicitly call .contiguous() before the next memory-intensive operation. Check with tensor.is_contiguous() and tensor.storage() to understand what is actually happening.
:::
Interview Questions and Answers
Q1: What are the latency and bandwidth numbers for each level of the GPU memory hierarchy on a modern data center GPU like the H100?
Answer: The GPU memory hierarchy has five practical levels:
-
Registers: 1 clock cycle latency, effectively zero overhead. Each thread gets its own set. 65,536 32-bit registers per SM on H100. Total register file: ~33MB across all 132 SMs. The key constraint is occupancy: if each thread uses many registers, fewer warps fit on the SM simultaneously, reducing latency hiding ability.
-
Shared Memory / L1 Cache: approximately 32 clock cycles. Up to 228KB per SM on H100, shared among all threads in a block. Software-managed scratchpad. Also serves as hardware-managed L1 data cache for global memory accesses. Cache line size is 128 bytes. Bank conflicts (32 banks, 4 bytes each) can serialize what should be parallel accesses.
-
L2 Cache: approximately 200 clock cycles. 50MB on H100 (was 40MB on A100). Chip-wide - all SMs share it. Internal bandwidth to SMs approximately 12 TB/s. If your working set fits here, you get 3.5x more bandwidth than HBM.
-
HBM (High Bandwidth Memory): approximately 500-700 clock cycles. 80GB capacity on H100 SXM, 3.35 TB/s bandwidth (HBM3). This is where all model weights live when not actively computed on. The memory wall lives here - most training and inference bottlenecks are caused by HBM access patterns.
-
Host Memory / PCIe: tens of thousands of cycles. PCIe 4.0 provides ~32 GB/s bidirectional, PCIe 5.0 ~64 GB/s. NVLink for GPU-to-GPU: 900 GB/s on H100. The host-to-device transfer is so slow that avoiding it (pinned memory, prefetching, CUDA-aware MPI) is a core systems engineering concern.
Q2: What is the difference between shared memory and L1 cache in a modern NVIDIA GPU? When would you use each?
Answer: Physically, they are the same SRAM. On H100, each SM has a single pool of high-speed on-chip SRAM. The operating system (and your CUDA kernel configuration) determines how that pool is split.
Shared memory is software-managed. You allocate it with __shared__ in CUDA or tl.constexpr blocks in Triton. You explicitly load data into it, synchronize threads to ensure the load is complete, compute with it, and write results out. You have total control over what lives there. This is optimal for algorithms with predictable, structured reuse patterns like matrix multiply (load tiles of A and B, reuse them across inner loop iterations) or FFT (butterfly patterns).
L1 cache is hardware-managed. The GPU's memory controller automatically caches recently-accessed global memory in L1. You do not control it. This is useful for algorithms with irregular or unpredictable access patterns - like sparse attention, tree traversal, or lookup tables - where you cannot easily predict and manually stage the data.
The trade-off: more shared memory allocation means less L1 cache space, and vice versa. You configure the split with cudaFuncSetAttribute(kernel, cudaFuncAttributePreferredSharedMemoryCarveout, pct).
For production ML kernels: allocate as much shared memory as your tiling strategy requires, leave the rest for L1. Measure both configurations with the profiler.
Q3: Explain the roofline model. Given an operation with arithmetic intensity of 10 FLOP/byte, is it memory-bound or compute-bound on an H100?
Answer: The roofline model expresses the maximum achievable performance of a kernel as a function of its arithmetic intensity.
Arithmetic intensity is FLOPs divided by bytes of memory traffic: .
The roofline equation is:
Where is memory bandwidth and is peak compute throughput.
The ridge point is the intensity where the two bounds are equal: .
For H100 with BF16 Tensor Cores:
- TFLOPS
- TB/s
- FLOP/byte
For an operation with FLOP/byte:
- Memory-bound limit: TFLOPS
- Compute-bound limit: TFLOPS
- Attainable: TFLOPS
This is clearly memory-bound - the operation is running at 33.5 / 3958 = 0.85% of peak compute. Optimizing the CUDA kernel to add more FLOPs does nothing. Reducing memory traffic (fusion, caching, compression) is the only way to improve performance.
Examples of operations at : layer normalization, softmax, many element-wise operations, batched vector-matrix products at small batch sizes.
Q4: What is memory coalescing and why does it matter? How do you check if your operations are coalesced in PyTorch?
Answer: Memory coalescing describes whether the 32 threads in a GPU warp access memory in a pattern that the memory controller can serve efficiently.
When a warp of 32 threads each accesses a 4-byte value, and those 32 values are contiguous in memory (addresses 0, 4, 8, ..., 124), they fit exactly in one 128-byte cache line. The memory controller issues one fetch, serves all 32 threads, wastes nothing. This is a coalesced access.
When those 32 values are scattered (addresses 0, 128, 256, ..., 3968 - stride 128), each value is in a different cache line. The controller must issue 32 separate 128-byte fetches to serve 32 values of 4 bytes each. Memory traffic: 32 x 128 = 4096 bytes. Useful data: 32 x 4 = 128 bytes. Utilization: 3.1%. This is a non-coalesced (strided) access.
In PyTorch, coalescing depends on tensor layout. The rules:
- A tensor is contiguous if
tensor.is_contiguous()returns True - After
.permute()or.transpose(), the tensor is non-contiguous - Operations on non-contiguous tensors often trigger strided (non-coalesced) memory access
.contiguous()creates a new contiguous copy in memory
To verify: tensor.stride() shows the stride in elements between consecutive values along each dimension. Stride 1 on the last dimension means sequential (coalesced) access. Larger strides indicate potentially non-coalesced access.
For production diagnosis, use NVIDIA Nsight Compute with ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio. A value near 1.0 means well-coalesced access. Higher values mean wasted cache line fetches.
Q5: You are profiling an LLM inference system and notice that GPU utilization is 95% but throughput is much lower than expected. What memory-related issues would you investigate, and what tools would you use?
Answer: High GPU utilization with low throughput is a classic symptom of being memory-bound - the GPU is busy, but mostly waiting for memory rather than computing. Here is the investigation approach:
Step 1 - Determine if compute or memory bound: Use ncu (NVIDIA Nsight Compute):
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,l1tex__throughput.avg.pct_of_peak_sustained_elapsed python inference.py
If SM throughput is low but L1/HBM throughput is high, you are memory-bound.
Step 2 - Check memory bandwidth utilization: ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second gives actual HBM read bandwidth. Compare to 3.35 TB/s. If you are achieving < 60% of theoretical bandwidth on memory-bound kernels, there is likely non-coalesced access.
Step 3 - Look for non-coalesced access: ncu --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio. This measures how many 32-byte sectors per access request (ideal = 1, strided = up to 32).
Step 4 - Check register spills: ncu --metrics l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum. Any nonzero value is a red flag.
Step 5 - Identify the bottleneck kernel: Use PyTorch Profiler to find which specific operation dominates:
with torch.profiler.profile(activities=[ProfilerActivity.CUDA]) as prof:
model(input)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
Common culprits in LLM inference:
- Attention computation with non-contiguous KV cache (fix: ensure [batch, heads, seq, head_dim] layout and
.contiguous()) - Batch size = 1 creating memory-bound GEMM (fix: batch requests)
- KV cache eviction causing repeated HBM round-trips (fix: PagedAttention or pre-allocated ring buffer)
- Unfused operations with unnecessary intermediate writes (fix:
torch.compileor custom Triton kernel)
Summary - What You Must Remember
The five principles that govern GPU memory performance:
-
Data lives in HBM by default. Every optimization moves data closer to compute. Shared memory tiling, register accumulation, and L2 residency are all ways of keeping frequently-used data in faster storage.
-
The memory wall is real. At batch size 1, an H100's 3958 TFLOPS of BF16 compute is meaningless. The 3.35 TB/s HBM bandwidth limits you to at most ~30 TFLOPS for memory-bound ops. Most LLM inference tokens are generated in this regime.
-
128-byte cache lines are the unit of HBM access. Every access fetches 128 bytes. Sequential access uses all 128 bytes. Strided access wastes most of them. Coalescing is not optional for high-performance kernels.
-
Arithmetic intensity determines which roofline applies. Below the ridge point (~1181 FLOP/byte on H100 Tensor Cores): memory-bound, optimize memory traffic. Above it: compute-bound, optimize parallelism.
-
Measure, do not assume. The fastest code is not always the most elegant. Use
ncuto verify your assumptions. Memory bandwidth utilization below 60% on a memory-bound kernel means there is room to improve - and usually a specific cause like non-coalesced access or unnecessary intermediate writes.
The engineers who built FlashAttention understood every word of this lesson. Their contribution was applying it systematically to attention - carefully counting bytes moved, eliminating redundant HBM round-trips through tiling, and keeping intermediate results in faster memory. The same principles apply to any kernel you will ever write.
