Global, Shared, and Register Memory
Reading time: ~40 min - Interview relevance: Very High - Target roles: CUDA Developer, Kernel Engineer, ML Infrastructure
The classic optimization pattern in CUDA: load from slow global memory into fast shared memory, compute in registers, write back to global memory. Every efficient CUDA kernel is a variation on this pattern.
The 8% Problem
You launch a matrix multiply kernel. The math looks right - outputs check out against NumPy. You profile it with Nsight Compute and the first number you see is 8.3% memory bandwidth utilization.
Your H100 can push 3.35 TB/s to and from HBM. You are using 278 GB/s. The GPU is idle 91.7% of the time, waiting for data.
The kernel code itself is fine. The problem is where the data lives when the GPU is computing. Every multiply-add instruction is reading directly from global memory (HBM) - the slowest memory tier in the entire GPU. The latency for each load is roughly 400 cycles. Your threads issue a load, wait 400 cycles, do one FMA, issue another load, wait 400 cycles.
The fix is not algorithmic. The math stays the same. The fix is moving data up the memory hierarchy before you start computing on it. Load a tile of A and B into shared memory - 32 cycles latency instead of 400. Compute all the dot products for that tile entirely from fast on-chip memory. Write one result back to global memory.
That single change - from global memory reads to shared memory reads - typically improves memory bandwidth utilization from 8% to 70-85% on matrix multiply workloads. This lesson is about how and why that works, and how to apply the same pattern to any compute kernel.
Why This Exists: The Memory Wall
In 1994, William Wulf and Sally McKee published a paper describing a fundamental problem in computer architecture: processor speed was growing at roughly 60% per year while memory bandwidth was growing at only 7% per year. They called it "the memory wall." Processors were becoming so fast that they were constantly idle, waiting for data to arrive from DRAM.
GPU architects hit this wall hard. A modern H100 GPU has 16,896 CUDA cores, each capable of executing a multiply-add per clock cycle at 1.98 GHz. That is 16,896 x 1.98 x 10^9 x 2 = 66.9 TFLOPS of compute. But HBM3 bandwidth is 3.35 TB/s. Each FP32 operation needs 8 bytes of input (two float32s). Peak compute-limited throughput implies a memory requirement of 66.9 x 10^12 x 8 = 535 TB/s.
The memory bandwidth is 160x smaller than what peak compute requires. This ratio is called the arithmetic intensity requirement, and it is why memory optimization is the first thing you address in CUDA performance engineering.
The solution the hardware provides is a tiered memory hierarchy: fast but small memory close to the cores, slow but large memory far away. Your job as a CUDA programmer is to explicitly move data between these tiers in a way that keeps the arithmetic units fed.
The Five Memory Spaces: Real Numbers
Understanding CUDA memory optimization requires understanding exactly what each tier costs. These numbers are for an H100 SXM5.
Memory Tier Latency Bandwidth Size Scope
----------- ------- --------- ---- -----
Registers ~1 cycle ~20 TB/s 256 KB / SM Per-thread
Shared Memory ~32 cycles ~19 TB/s 228 KB / SM Per-block
L1 + Texture ~40 cycles ~19 TB/s 256 KB / SM Per-SM
L2 Cache ~200 cycles ~12 TB/s 50 MB total All SMs
Global (HBM) ~400 cycles 3.35 TB/s 80 GB All SMs
Constant Memory ~40 cycles (broadcast) 64 KB Read-only
Local Memory ~400 cycles 3.35 TB/s per-thread Per-thread (spill)
The important thing to notice: registers and shared memory have 100x lower latency than global memory. Registers are effectively free - the compiler allocates them and the hardware reads them in the same cycle as the instruction that needs them. Shared memory at 32 cycles is fast enough that you can do a round-trip load-compute-store pattern efficiently.
Global memory at 400 cycles is the bottleneck. If your threads are issuing individual loads from global memory and waiting for them before computing, you will use a fraction of the available compute.
Registers: The Fastest Tier
Registers are the most important memory tier that most tutorials barely mention. They are where your computation actually happens.
Every thread gets its own private set of registers. When you write:
float a = matrix_a[row * K + k];
float b = matrix_b[k * N + col];
float c += a * b;
The variables a, b, and c live in registers. The FMA instruction (c += a * b) reads both inputs and writes the result entirely from/to registers - zero extra latency beyond the instruction itself.
Register File Size and Occupancy
Each SM on an H100 has 65,536 32-bit registers (256 KB). These are shared among all threads currently resident on that SM. The number of threads per SM is bounded by how many registers your kernel uses.
If your kernel uses 32 registers per thread, you can fit 65536 / 32 = 2048 threads per SM. This is typical - the hardware target for full occupancy. If your kernel uses 64 registers per thread (a complex kernel with many live variables), you can only fit 1024 threads per SM - 50% occupancy.
The compiler decides register allocation automatically. You can see it with:
nvcc -Xptxas -v my_kernel.cu
This prints something like:
ptxas info: Used 32 registers, 4096 bytes smem, 400 bytes cmem[0]
You can also cap registers with __launch_bounds__ to control occupancy explicitly:
__global__ __launch_bounds__(256, 4) void my_kernel(...) {
// max 256 threads per block, minimum 4 blocks per SM
// compiler will try to keep registers <= 65536/(256*4) = 64 per thread
}
Register Spilling: The Silent Killer
What happens when your kernel needs more registers than are available? The compiler spills register contents to local memory - which despite the name, is physically located in global HBM. 400 cycle latency.
Register spilling is the silent killer of kernel performance. The kernel gives correct results, profiling shows poor performance, and the symptom is invisible unless you check the ptxas output for "spilled" registers.
# See register spilling
nvcc -Xptxas -v,-warn-spills my_kernel.cu
The output will show:
ptxas info: Used 96 registers, 4 bytes lmem, ...
lmem is local memory - register spill. Even a few bytes of spill can halve kernel performance because those accesses serialize through the L2 cache and HBM.
To fix spilling: simplify your kernel (break into multiple kernel launches), reduce loop unroll depth, or restructure the computation to use fewer live variables at any given point.
Shared Memory: Software-Managed Cache
Shared memory is the most important tool for CUDA performance optimization. It is a small, fast, on-chip memory that all threads in a block can read and write. Unlike L1 cache (which is hardware-managed and opaque to the programmer), shared memory is under your complete control.
Declaration
Static shared memory - size known at compile time:
__global__ void kernel_static_smem() {
__shared__ float tile[32][32]; // 32*32*4 = 4096 bytes of smem
__shared__ int count; // 4 bytes of smem
// All threads in the block share this tile
int tid = threadIdx.x + threadIdx.y * blockDim.x;
tile[threadIdx.y][threadIdx.x] = some_value();
__syncthreads(); // MUST sync before reading from smem
float val = tile[threadIdx.x][threadIdx.y]; // transposed read
}
Dynamic shared memory - size specified at kernel launch:
__global__ void kernel_dynamic_smem(int n) {
extern __shared__ float shared_data[]; // size determined at launch
shared_data[threadIdx.x] = global_input[blockIdx.x * blockDim.x + threadIdx.x];
__syncthreads();
// ... use shared_data
}
// Launch with 3rd argument = bytes of dynamic shared memory
kernel_dynamic_smem<<<grid, block, n * sizeof(float)>>>(n);
Shared Memory per SM
The H100 has up to 228 KB of L1/shared memory per SM. This is configurable - you can split it between L1 and shared memory using cudaFuncSetAttribute:
cudaFuncSetAttribute(my_kernel,
cudaFuncAttributePreferredSharedMemoryCarveout,
cudaSharedmemCarveoutMaxShared); // maximize shared memory
With large shared memory allocations, be aware that more smem per block means fewer blocks per SM, which reduces occupancy. There is always a tradeoff.
__syncthreads(): Why It Is Non-Negotiable
Shared memory is shared by all threads in a block. Threads execute independently and do not stay in lockstep. Without synchronization, one thread might try to read a value from shared memory before another thread has written it.
__syncthreads() is a barrier: all threads in the block must reach this point before any thread can continue past it. This guarantees that all writes before the barrier are visible to all reads after it.
The canonical usage pattern:
// Step 1: cooperatively load data into shared memory
shared_tile[threadIdx.y][threadIdx.x] = global_A[...];
shared_tile_b[threadIdx.y][threadIdx.x] = global_B[...];
// Step 2: synchronize - ensure all threads finished writing
__syncthreads();
// Step 3: compute using shared memory (safe to read now)
float result = 0.0f;
for (int k = 0; k < TILE_SIZE; k++) {
result += shared_tile[threadIdx.y][k] * shared_tile_b[k][threadIdx.x];
}
// Step 4: synchronize before reusing shared memory for next tile
__syncthreads();
:::danger __syncthreads() Inside Conditionals
Never put __syncthreads() inside a conditional branch that not all threads will execute. If thread A reaches the barrier and thread B takes the other branch and never reaches it, thread A waits forever. The GPU deadlocks.
// WRONG - deadlock if some threads take the else branch
if (threadIdx.x < 16) {
shared[threadIdx.x] = val;
__syncthreads(); // Only half the threads reach this!
}
// CORRECT - sync is outside the conditional
shared[threadIdx.x] = (threadIdx.x < 16) ? val : 0.0f;
__syncthreads(); // All threads reach this
:::
The Canonical Example: Tiled Matrix Multiply
Matrix multiply is the canonical CUDA optimization example, not because it is the most complex problem, but because it is the clearest illustration of why memory hierarchy matters.
Why the Naive Kernel Is Slow
Consider multiplying two N x N matrices A and B to produce C where .
The naive kernel has each thread compute one element of C:
Each thread reads N elements from row i of A and N elements from column j of B - all from global memory. For N=1024, each thread issues 2048 global memory reads. Multiply by the 1024*1024 threads, and the naive kernel issues over 2 billion global memory accesses - each costing ~400 cycles.
Worse, each element of A and B is read by multiple threads. Element A[i][k] is needed by all N threads computing row i of C. Element B[k][j] is needed by all N threads computing column j of C. With N=1024, each element of A and B is loaded from global memory 1024 times.
This is the problem tiling solves.
The Tiling Insight
Instead of each thread independently loading all of A and B from global memory, we divide the work into tiles. A block of threads cooperatively loads a tile of A and a tile of B into shared memory, computes partial results, then moves to the next tile.
Within a tile, each element of A and B is loaded once from global memory (by one thread) and then read T times from shared memory (by T threads, where T is the tile size). The ratio of arithmetic to global memory loads improves by factor T.
For TILE_SIZE=32:
- Naive: 2N loads per thread, NN threads = 2*N^3 global loads total
- Tiled: 2*(N/TILE_SIZE)TILE_SIZE loads per thread = 2N loads per thread... wait, same count per thread
- But now each global load serves TILE_SIZE threads instead of 1
- Effective global memory load reduction: TILE_SIZE x fewer transactions
Complete Tiled Matrix Multiply Implementation
#include <cuda_runtime.h>
#include <stdio.h>
#define TILE_SIZE 32
// Naive matrix multiply - for comparison
__global__ void matmul_naive(
const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K
) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
float sum = 0.0f;
for (int k = 0; k < K; k++) {
sum += A[row * K + k] * B[k * N + col];
// Every access hits global memory (400 cycles)
}
C[row * N + col] = sum;
}
}
// Tiled matrix multiply using shared memory
__global__ void matmul_tiled(
const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K
) {
// Shared memory tiles - each block allocates two tiles
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f; // accumulator lives in a register
// Iterate over tiles along the K dimension
int num_tiles = (K + TILE_SIZE - 1) / TILE_SIZE;
for (int t = 0; t < num_tiles; t++) {
// Step 1: Cooperatively load tile of A into shared memory
int a_col = t * TILE_SIZE + threadIdx.x;
if (row < M && a_col < K) {
As[threadIdx.y][threadIdx.x] = A[row * K + a_col];
} else {
As[threadIdx.y][threadIdx.x] = 0.0f; // boundary padding
}
// Step 2: Cooperatively load tile of B into shared memory
int b_row = t * TILE_SIZE + threadIdx.y;
if (b_row < K && col < N) {
Bs[threadIdx.y][threadIdx.x] = B[b_row * N + col];
} else {
Bs[threadIdx.y][threadIdx.x] = 0.0f; // boundary padding
}
// Step 3: Synchronize - all threads must finish loading before computing
__syncthreads();
// Step 4: Compute partial dot product from shared memory tiles
// All reads here hit shared memory (~32 cycles) not global (~400 cycles)
for (int k = 0; k < TILE_SIZE; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
// Step 5: Synchronize before loading next tile
// Prevents faster threads from overwriting smem before slower ones finish reading
__syncthreads();
}
// Write final result to global memory
if (row < M && col < N) {
C[row * N + col] = sum;
}
}
// Host function to run both kernels and compare timing
void run_matmul_benchmark(int M, int N, int K) {
size_t bytes_A = M * K * sizeof(float);
size_t bytes_B = K * N * sizeof(float);
size_t bytes_C = M * N * sizeof(float);
float *h_A, *h_B, *h_C_naive, *h_C_tiled;
float *d_A, *d_B, *d_C;
// Allocate host memory
h_A = (float*)malloc(bytes_A);
h_B = (float*)malloc(bytes_B);
h_C_naive = (float*)malloc(bytes_C);
h_C_tiled = (float*)malloc(bytes_C);
// Initialize with random values
for (int i = 0; i < M * K; i++) h_A[i] = (float)rand() / RAND_MAX;
for (int i = 0; i < K * N; i++) h_B[i] = (float)rand() / RAND_MAX;
// Allocate device memory
cudaMalloc(&d_A, bytes_A);
cudaMalloc(&d_B, bytes_B);
cudaMalloc(&d_C, bytes_C);
cudaMemcpy(d_A, h_A, bytes_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, bytes_B, cudaMemcpyHostToDevice);
dim3 block(TILE_SIZE, TILE_SIZE);
dim3 grid((N + TILE_SIZE - 1) / TILE_SIZE, (M + TILE_SIZE - 1) / TILE_SIZE);
// Time naive kernel
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
matmul_naive<<<grid, block>>>(d_A, d_B, d_C, M, N, K);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float naive_ms;
cudaEventElapsedTime(&naive_ms, start, stop);
cudaMemcpy(h_C_naive, d_C, bytes_C, cudaMemcpyDeviceToHost);
// Time tiled kernel
cudaEventRecord(start);
matmul_tiled<<<grid, block>>>(d_A, d_B, d_C, M, N, K);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float tiled_ms;
cudaEventElapsedTime(&tiled_ms, start, stop);
printf("Matrix size: %dx%dx%d\n", M, N, K);
printf("Naive kernel: %.3f ms\n", naive_ms);
printf("Tiled kernel: %.3f ms (%.1fx speedup)\n",
tiled_ms, naive_ms / tiled_ms);
// Verify correctness
float max_err = 0.0f;
cudaMemcpy(h_C_tiled, d_C, bytes_C, cudaMemcpyDeviceToHost);
for (int i = 0; i < M * N; i++) {
max_err = fmaxf(max_err, fabsf(h_C_naive[i] - h_C_tiled[i]));
}
printf("Max error: %e\n", max_err);
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C_naive); free(h_C_tiled);
cudaEventDestroy(start); cudaEventDestroy(stop);
}
int main() {
run_matmul_benchmark(1024, 1024, 1024);
run_matmul_benchmark(2048, 2048, 2048);
return 0;
}
Tile Size Choices
The tile size determines the tradeoff between shared memory usage and bandwidth savings.
TILE_SIZE = 16:
- Shared memory: 2 x 16 x 16 x 4 bytes = 2 KB per block
- Bandwidth reduction: 16x (each global load serves 16 threads)
- More blocks fit per SM (higher occupancy)
TILE_SIZE = 32:
- Shared memory: 2 x 32 x 32 x 4 bytes = 8 KB per block
- Bandwidth reduction: 32x (each global load serves 32 threads)
- Matches warp size exactly - best coalescing alignment
TILE_SIZE = 64:
- Shared memory: 2 x 64 x 64 x 4 bytes = 32 KB per block
- Bandwidth reduction: 64x
- May leave insufficient smem for multiple blocks per SM
For A100/H100, TILE_SIZE=32 is nearly always optimal for FP32 matmul. For FP16 with tensor cores, the tile sizes match the wmma instruction dimensions (16x16x16 or 16x8x16).
Python Wrapper: Measuring the Speedup
import ctypes
import numpy as np
import time
# Compile the CUDA kernels to a shared library
# nvcc -O3 --shared -Xcompiler -fPIC -o libmatmul.so matmul.cu
# For pure Python measurement, use PyTorch + custom CUDA
import torch
def benchmark_matmul(M: int, N: int, K: int, device: str = "cuda") -> None:
"""
Compare naive global-memory matmul vs tiled shared-memory matmul.
Uses PyTorch as the harness with custom CUDA kernels loaded via torch.ops.
"""
A = torch.randn(M, K, device=device, dtype=torch.float32)
B = torch.randn(K, N, device=device, dtype=torch.float32)
# Warm up
torch.mm(A, B)
torch.cuda.synchronize()
# Measure PyTorch's optimized matmul (uses cuBLAS, which uses shared memory internally)
n_iters = 100
start = time.perf_counter()
for _ in range(n_iters):
C = torch.mm(A, B)
torch.cuda.synchronize()
elapsed_cublas = (time.perf_counter() - start) / n_iters * 1000
# Compute theoretical peak
flops = 2 * M * N * K # each output element: K multiplies + K adds
gflops_cublas = (flops / elapsed_cublas / 1e6)
print(f"Matrix: {M}x{K} @ {K}x{N}")
print(f"cuBLAS (shared memory): {elapsed_cublas:.3f} ms ({gflops_cublas:.1f} GFLOPS)")
print(f"H100 peak FP32: 66,900 GFLOPS")
print(f"Utilization: {gflops_cublas / 669:.1%}")
def demonstrate_register_spill_cost() -> None:
"""
Demonstrates that register spilling (local memory usage) hurts performance.
Compare a kernel with a large loop body (many live variables) vs a clean one.
"""
N = 4096
A = torch.randn(N, N, device="cuda", dtype=torch.float32)
# Technique 1: single large matmul (clean register pressure)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(10):
C = torch.mm(A, A)
torch.cuda.synchronize()
t1 = time.perf_counter()
print(f"Single large matmul: {(t1-t0)/10*1000:.2f} ms")
# Technique 2: chained matmuls (same total work, different memory pressure)
B = torch.randn(N, N//4, device="cuda", dtype=torch.float32)
D = torch.randn(N//4, N, device="cuda", dtype=torch.float32)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(10):
tmp = torch.mm(A, B) # N x N/4
C = torch.mm(tmp, D) # N x N
torch.cuda.synchronize()
t1 = time.perf_counter()
print(f"Chained matmuls: {(t1-t0)/10*1000:.2f} ms")
if __name__ == "__main__":
benchmark_matmul(1024, 1024, 1024)
benchmark_matmul(4096, 4096, 4096)
demonstrate_register_spill_cost()
Constant Memory: Read-Only Broadcast
Constant memory is a 64 KB read-only cache with a special property: when all threads in a warp read the same address, the hardware broadcasts it in a single cycle. This is called a broadcast read.
It is ideal for kernel parameters that all threads need to read but none write:
// Declare in global scope
__constant__ float convolution_kernel[256];
// Copy to constant memory before kernel launch
cudaMemcpyToSymbol(convolution_kernel, h_kernel, 256 * sizeof(float));
__global__ void convolve(float* input, float* output, int width) {
// All threads reading convolution_kernel[k] in the same iteration
// hits the constant cache broadcast - essentially free
float result = 0.0f;
for (int k = 0; k < 256; k++) {
result += input[idx + k] * convolution_kernel[k];
// ^^^ broadcast read, all threads read same k
}
output[idx] = result;
}
Use constant memory for: model weights in inference (small models), lookup tables, filter kernels, any array that all threads read identically indexed.
Shared Memory Bank Conflicts
Shared memory is organized into 32 banks, each 4 bytes wide. Bank index for an address is (address / 4) % 32. Two threads accessing addresses in the same bank in the same instruction causes a bank conflict: the accesses are serialized.
A bank conflict in a TILE_SIZE=32 matrix happens naturally when you read along the "wrong" dimension:
// Loaded As as As[row][col] - row-major
// Reading As[threadIdx.y][k] - all threads read different columns of same row
// k=0: thread 0 reads bank 0, thread 1 reads bank 1, ... no conflict
// But reading As[k][threadIdx.x] with k varying - all threads read same column
// When k=0: thread 0 reads As[0][0] = bank 0, thread 1 reads As[0][1] = bank 1 - fine
// When stride is 32: thread 0 reads As[0][0]=bank 0, thread 1 reads As[0][32]=bank 0 - conflict!
The standard fix is padding: add one extra column to the shared memory allocation.
// Without padding - potential conflict on column reads
__shared__ float As[TILE_SIZE][TILE_SIZE]; // 32x32 = conflict on row stride = 32
// With padding - no conflicts
__shared__ float As[TILE_SIZE][TILE_SIZE + 1]; // 32x33, each row starts at different bank
Why does padding work? A 32x32 array has rows that are 324 = 128 bytes = 32 banks wide. Row N+1 starts at bank 0 again. When you do a column read, all threads end up hitting the same bank. Adding one element (4 bytes) to each row makes each row 334 = 132 bytes, so row N+1 starts at bank 1. Now column reads hit different banks - no conflict.
// Full tiled matmul with bank-conflict-free padding
#define TILE_SIZE 32
__global__ void matmul_tiled_padded(
const float* __restrict__ A,
const float* __restrict__ B,
float* __restrict__ C,
int M, int N, int K
) {
// +1 padding eliminates bank conflicts on column accesses
__shared__ float As[TILE_SIZE][TILE_SIZE + 1];
__shared__ float Bs[TILE_SIZE][TILE_SIZE + 1];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
int a_col = t * TILE_SIZE + threadIdx.x;
int b_row = t * TILE_SIZE + threadIdx.y;
As[threadIdx.y][threadIdx.x] = (row < M && a_col < K) ?
A[row * K + a_col] : 0.0f;
Bs[threadIdx.y][threadIdx.x] = (b_row < K && col < N) ?
B[b_row * N + col] : 0.0f;
__syncthreads();
for (int k = 0; k < TILE_SIZE; k++) {
// As[threadIdx.y][k] - all threads read different k, same row - no conflict
// Bs[k][threadIdx.x] - all threads read same k, different col - but padded!
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
if (row < M && col < N) {
C[row * N + col] = sum;
}
}
Memory Hierarchy Access Flow
The full picture of how a tiled kernel moves data:
Production Engineering Notes
Profiling shared memory usage:
ncu --metrics l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld,\
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st \
./my_kernel
Check for bank conflicts:
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld \
./my_kernel
A non-zero bank conflict count means your shared memory reads have conflicts. Add padding.
Occupancy analysis: Low occupancy is not always bad. A kernel using 64KB of shared memory per block will have low occupancy (1-2 blocks per SM), but if each block has enough arithmetic intensity to hide the reduced parallelism, it can still achieve good throughput.
Rule of thumb: if your kernel's compute intensity (FLOPs per byte) is high, low occupancy is fine. If it is memory-bound, maximize occupancy to hide latency.
Double buffering: Advanced technique - use two sets of shared memory tiles, load tile T+1 while computing tile T. Eliminates the stall between __syncthreads() and the next load.
// Double buffer: two sets of shared memory tiles
__shared__ float As[2][TILE_SIZE][TILE_SIZE + 1];
__shared__ float Bs[2][TILE_SIZE][TILE_SIZE + 1];
int curr = 0, next = 1;
// Pre-load first tile
load_tile(As[curr], Bs[curr], ...);
__syncthreads();
for (int t = 1; t < num_tiles; t++) {
// Load next tile while computing current
async_load_tile(As[next], Bs[next], ...); // uses cp.async
compute_from_tiles(As[curr], Bs[curr], ...);
__syncthreads();
swap(curr, next);
}
compute_from_tiles(As[curr], Bs[curr], ...);
Common Mistakes
:::danger Using Global Memory Directly in Inner Loops The most common CUDA performance bug: reading from global memory in a tight inner loop.
// WRONG - 400 cycle load on every iteration
for (int k = 0; k < K; k++) {
sum += A[row * K + k] * B[k * N + col]; // global load
}
// RIGHT - load to shared memory once, then compute
// ... (use tiled pattern shown above)
:::
:::warning Forgetting the Second __syncthreads() Many tutorials show the sync after loading but miss the sync after computing. Without the second sync, fast threads in the warp will start loading the next tile into shared memory while slow threads are still reading the current tile.
__syncthreads(); // after loading - always present in tutorials
for (int k = 0; k < TILE_SIZE; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads(); // BEFORE loading next tile - often forgotten!
// Without this, the next iteration's load will corrupt As/Bs
// while some threads are still reading from them
:::
:::warning Over-Allocating Shared Memory Allocating more shared memory than needed limits the number of blocks that can run concurrently per SM, reducing occupancy.
// If you need 8KB of smem per block but allocate 32KB,
// only 7 blocks can run on an SM (228KB / 32KB = 7)
// vs 28 blocks if you had allocated correctly (228KB / 8KB = 28)
// Check actual usage with nvcc -Xptxas -v
:::
:::danger Register Spilling to Local Memory Kernels with deep loop unrolling or many live variables will spill registers to local memory (HBM). The symptom is correct output but poor performance.
# Check for spilling during compilation
nvcc -Xptxas -v,-warn-spills my_kernel.cu
# If you see "lmem" in the output, there is spilling
# ptxas info: Used 96 registers, 128 bytes lmem, ...
# ^^^^^^^^^^^^^ BAD
:::
Interview Q&A
Q1: What is the difference between shared memory and L1 cache in CUDA?
They occupy the same physical SRAM on modern GPUs (H100: 256KB unified L1/smem per SM). The difference is control. L1 cache is hardware-managed: the hardware decides what to cache and when to evict it. You cannot explicitly load data into L1 or guarantee it stays there. Shared memory is programmer-managed: you explicitly write __shared__ to declare allocations, you explicitly load data into it with assignment statements, and it stays there for the lifetime of the block. This explicit control is critical for algorithms like matrix multiply where you know exactly which data to reuse - the hardware cache would not necessarily make the same choices you would.
Q2: Why must __syncthreads() be called after filling shared memory?
Threads in a block execute independently and do not stay in lockstep within a block (only within a warp). Thread 0 might finish writing its element to shared memory before thread 31 even starts. Without __syncthreads(), a thread that finishes its write early and immediately starts reading from shared memory might read stale data written by a previous iteration - or worse, read locations that other threads have not written yet. __syncthreads() is a barrier that all threads must reach before any can continue, guaranteeing that all writes before the barrier are visible to all reads after it. It is a release-acquire fence over shared memory.
Q3: Walk through the tiled matrix multiply and explain why it is faster.
In the naive kernel, computing C[i][j] requires K reads from row i of A and K reads from column j of B - all from global memory at 400 cycle latency each. The total global memory traffic is O(N^3) for an N x N matrix. In the tiled kernel, we partition the K-dimension into tiles of size T. For each tile, a block of T x T threads cooperatively loads a T x T tile of A and T x T tile of B into shared memory - 2T^2 global memory reads total. Then all T^2 threads compute T multiply-adds each from shared memory (32 cycle latency). Each global load serves T threads instead of just 1, so global memory traffic is reduced by factor T. For T=32, that is a 32x reduction in global memory transactions, which is why you see roughly 10-30x speedups depending on the matrix size and GPU.
Q4: What is register spilling and how does it affect performance?
When a kernel uses more registers per thread than the SM can provide (65,536 total, distributed across resident threads), the compiler spills register contents to "local memory" - which is physically located in global HBM at 400 cycle latency. The kernel still produces correct results (the compiler handles the spills automatically), but each spilled-register access adds the full global memory latency. A kernel that was compute-bound suddenly becomes memory-bound. Common causes: deep loop unrolling with many live accumulator variables, large local arrays, complex control flow that keeps many values live simultaneously. Fix by reducing loop unroll depth, splitting the kernel into multiple launches, or using __launch_bounds__ to tell the compiler to target lower register usage.
Q5: When would you use constant memory over global memory?
Constant memory is optimal when all threads in a warp read the same address simultaneously - the hardware detects this and broadcasts the value in a single operation. This is a 32x speedup over global memory for this access pattern. Ideal use cases: convolution filter kernels (all threads read filter[k] with the same k), model weight lookup tables in small inference kernels, kernel launch parameters that all threads need. The constraint: only 64 KB total, read-only during kernel execution. If threads read different addresses from constant memory, you lose the broadcast benefit and it is no faster than cached global memory.
Q6: How do you choose tile size for tiled matrix multiply?
Three constraints: (1) Tile must fit in shared memory - TILE_SIZE^2 * 2 * 4 bytes < available smem per block. (2) TILE_SIZE should equal or be a multiple of the warp size (32) to ensure coalesced global memory access when loading tiles. (3) Larger tiles reduce the ratio of global memory loads to arithmetic operations (better arithmetic intensity), but also reduce occupancy (fewer blocks per SM). In practice, TILE_SIZE=32 is optimal for FP32 on A100/H100. For FP16 with tensor cores, you use the hardware-prescribed warp tile shapes (16x8x16 for WMMA), not multiples of 32.
Q7: What happens if you forget __syncthreads() between tiles?
Without the second __syncthreads() (after computing, before loading the next tile), faster threads will begin writing the next tile into As and Bs while slower threads are still reading the current tile. This is a race condition on shared memory. The results will be silently incorrect - no error, no warning, just wrong numbers. This is one of the most insidious CUDA bugs: the kernel compiles, launches, and produces output that looks plausible but differs from the correct answer by small amounts. The fix is always the same: __syncthreads() between the compute phase and the next load phase.
Summary
The CUDA memory hierarchy is a performance tool, not just an architectural detail. Registers are free. Shared memory at 32 cycles is 12x faster than global memory. The gap between a kernel that reads from global memory in its inner loop versus one that pre-loads into shared memory is typically 5-30x in practice.
The tiled matrix multiply is the canonical example of the pattern you will use in almost every performance-critical kernel: identify the data that gets reused, load it cooperatively into shared memory, sync, compute, sync again, move to the next tile. Add the +1 padding to avoid bank conflicts. Watch register usage to avoid spilling to local memory.
These are not micro-optimizations. They are the difference between 8% and 85% hardware utilization.
