Streaming Multiprocessors
Reading time: ~35 min · Interview relevance: High · Target roles: ML Engineer, GPU Engineer, AI Infrastructure
The SM is to a GPU what a CPU core is to a CPU - except instead of hiding latency with cache, it hides latency by running 64 warps simultaneously and switching between them in one clock cycle.
Your transformer attention kernel is running at 45% GPU utilization. Nsight Compute shows "Warp Cycles Per Issue: 18" - meaning on average, 18 clock cycles pass between consecutive instructions issued to a warp. An ideal kernel would show 4 or fewer. Something is stalling your warps.
The culprit turns out to be a shared memory bank conflict introduced when you transposed a matrix incorrectly. By adding a single padding column - one int per row, 128 bytes total - you eliminate the conflict. Warp cycles per issue drops to 4.2. Utilization jumps to 81%. Training throughput increases by 22%.
This kind of fix is only possible if you understand what is happening inside a Streaming Multiprocessor: how warps are scheduled, what causes stalls, and how shared memory banks work. That is what this lesson is about.
Why This Exists
The previous lesson explained that GPUs run thousands of threads simultaneously, grouped into warps of 32. But "threads run on the GPU" is too vague to reason about performance. You need a precise model of where the computation actually happens: the Streaming Multiprocessor.
Every CUDA kernel executes on SMs. An SM is a self-contained execution engine with its own scheduler, register file, execution units, and fast on-chip memory. Understanding its structure is the foundation for every optimization technique in GPU programming: occupancy tuning, shared memory usage, warp divergence avoidance, and register pressure management.
Historical Context
NVIDIA introduced the Streaming Multiprocessor in the Tesla architecture (G80, 2006) - the same chip that launched CUDA. The name "streaming" referred to the streaming data model from shader programming: data streams through the execution units in a continuous flow.
Early SMs (G80) had 8 scalar processors and ran one warp every 4 clocks. Fermi (2010) introduced the modern SM structure: two warp schedulers, instruction dispatch units, and the unified register file. Kepler (2012) went to 192 CUDA cores per SM. Maxwell (2014) reduced that to 128 but dramatically improved energy efficiency. Volta (2017) added Tensor Cores - hardware matrix multiply units for deep learning. Hopper (H100, 2022) has 128 CUDA cores, 4 Tensor Core groups, 4 warp schedulers, and 256KB of configurable L1/shared memory per SM.
The number and organization of SMs varies by chip:
- GTX 1080 Ti: 28 SMs (Pascal)
- V100: 80 SMs (Volta)
- A100: 108 SMs (Ampere)
- H100 SXM5: 132 SMs (Hopper)
More SMs means more parallelism, but only if you have enough work to fill them all.
Core Concepts
SM Internal Structure
An SM (using H100 as the reference) contains:
Warp Schedulers (4 per SM). Each scheduler manages a pool of resident warps and selects one per clock cycle to issue an instruction. This is how the GPU hides memory latency: while warp A is waiting for a memory load (100+ cycles), warp B, C, or D is executing. The scheduler switches between warps in a single clock cycle at zero cost - unlike a CPU context switch which takes hundreds of cycles.
CUDA Cores (128 per SM, FP32/INT32). Each CUDA core is a simple ALU: it can do one floating-point or integer multiply-add per clock. With 4 schedulers issuing to 32 CUDA cores each, the SM can start 4 x 32 = 128 operations per cycle.
Tensor Cores (4 groups per SM, H100). Each group handles a 4th-generation Tensor Core operation: a 16x8x16 matrix multiply-accumulate in fp16, bf16, fp8, or int8. One Tensor Core operation does 4096 multiply-adds in a single instruction. Tensor Cores are how modern GPUs achieve their headline TFLOPS numbers - they are roughly 10-16x faster per SM than standard CUDA cores for matrix math.
Register File (65,536 x 32-bit registers per SM). Every thread has its own registers allocated from this pool. With 65,536 registers and a maximum of 2048 threads per SM, each thread gets at most 32 registers if all 2048 threads are resident. Heavy use of local variables increases register count per thread, reducing occupancy.
Shared Memory / L1 Cache (228KB per SM, H100). This is configurable fast on-chip SRAM, accessible by all threads in a thread block with roughly 32-cycle latency (vs ~400 cycles for HBM). It is the GPU equivalent of a software-managed cache - the programmer controls what goes in it. Common use: load a tile of matrix data into shared memory, then have all threads in the block read from shared memory instead of global memory. This is the foundation of tiled matrix multiply.
L2 Cache (50MB total, shared across all SMs). Larger than shared memory, slower, but still much faster than HBM. The L2 cache is not programmable - it caches global memory accesses automatically.
Thread Blocks and the SM
When you launch a CUDA kernel, you specify a grid of thread blocks. Each thread block is assigned to exactly one SM. Multiple thread blocks can run on the same SM simultaneously - up to the SM's resource limits.
Within an SM, threads are organized as warps. A thread block of 256 threads becomes 256/32 = 8 warps. If you run 4 thread blocks simultaneously on an SM, that is 32 warps the SM must manage.
The maximum resident warps per SM (H100): 64 warps. The maximum resident threads per SM (H100): 2048 threads. The maximum thread blocks per SM (H100): 32 blocks.
These limits define the ceiling on parallelism per SM. Reaching the ceiling requires careful configuration.
Occupancy: The Key Metric
Occupancy is the ratio of active warps to the maximum possible warps on an SM:
For H100: max warps = 64. If your kernel has 32 active warps per SM, occupancy is 50%.
Higher occupancy generally means better latency hiding. When a warp stalls on a memory load, the scheduler needs other warps to switch to. If occupancy is 25% (16 warps), there are fewer alternatives; the scheduler may find all warps stalled, leaving the SM idle.
But occupancy is not everything. A kernel with 50% occupancy and perfect memory coalescing can outperform a 100% occupancy kernel with poor access patterns. Occupancy is one tool, not the goal.
Three resources limit occupancy:
1. Registers. Each thread's registers come from the SM's 65,536-register pool. If your kernel uses 64 registers per thread and your block size is 256 threads, each block uses 256 x 64 = 16,384 registers. The SM can fit 65,536 / 16,384 = 4 blocks, or 4 x 256 / 32 = 32 warps per SM. Occupancy = 32/64 = 50%.
If you reduce register count to 32 per thread, 8 blocks fit, 64 warps, 100% occupancy.
2. Shared memory. If each block uses 32KB of shared memory and the SM has 96KB available, only 3 blocks fit. If 3 blocks x 256 threads / 32 = 24 warps, occupancy = 24/64 = 37.5%.
3. Block size. If your block has 128 threads (4 warps) and the SM supports 32 blocks, maximum warps = 32 x 4 = 128, but capped at 64. So blocks of 128 cap you at 64/4 = 16 blocks. Block sizes that are multiples of 32 and at least 128-256 threads are generally recommended.
# Checking occupancy empirically with PyTorch
import torch
import time
def time_kernel(fn, *args, n_warmup=5, n_iter=50):
"""Profile a CUDA kernel with proper warmup and synchronization."""
for _ in range(n_warmup):
fn(*args)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(n_iter):
fn(*args)
end.record()
torch.cuda.synchronize()
return start.elapsed_time(end) / n_iter # ms
# Benchmark: does batch size (i.e., number of thread blocks) affect throughput?
device = 'cuda'
for batch in [32, 64, 128, 256, 512, 1024, 4096]:
A = torch.randn(batch, 512, device=device, dtype=torch.float16)
B = torch.randn(512, 512, device=device, dtype=torch.float16)
t = time_kernel(torch.mm, A, B)
flops = 2 * batch * 512 * 512
tflops = flops / (t * 1e-3) / 1e12
print(f"batch={batch:4d}: {t:.3f}ms {tflops:.3f} TFLOPS")
# Small batches -> few thread blocks -> low SM utilization -> low TFLOPS
# Large batches -> many blocks -> high SM utilization -> peak TFLOPS
The Warp Scheduler in Detail
The warp scheduler is the heart of an SM. Here is exactly how it works:
Each clock cycle, each of the 4 warp schedulers selects one warp that is eligible to execute. A warp is eligible if:
- It has an instruction ready to issue
- The required execution units are available
- Its dependencies are resolved
If a warp has a pending memory load, it is not eligible until the load returns. If a warp has a data dependency (instruction i+1 needs the result of instruction i), it is not eligible for N cycles (the instruction latency, typically 4-8 cycles for arithmetic, 100+ for global memory).
The scheduler selects among eligible warps using a round-robin or greedy strategy. This selection is completely free - zero overhead. This is the fundamental mechanism of latency hiding: while 48 warps wait for memory, 16 warps keep executing.
The key insight: you need enough warps to keep the scheduler busy. If a memory operation stalls a warp for 400 cycles and you only have 2 warps total, the SM sits idle for much of those 400 cycles. If you have 32 warps, the other 31 fill the gap.
Warp Scheduler Timeline (simplified, 4 warps, memory latency = 8 cycles):
Cycle Warp0 Warp1 Warp2 Warp3
1 EXEC EXEC EXEC EXEC
2 MEM EXEC EXEC EXEC <- Warp0 issues load, stalls
3 stall MEM EXEC EXEC <- Warp1 issues load, stalls
4 stall stall MEM EXEC
5 stall stall stall MEM
6 stall stall stall stall <- All 4 warps stalled! SM is idle.
7 stall stall stall stall
8 stall stall stall stall
9 stall stall stall stall
10 EXEC stall stall stall <- Warp0 load returned
11 EXEC EXEC stall stall
With 4 warps, 6 out of 11 cycles are wasted (55% idle).
With 32 warps, most stalls are hidden by other warps executing.
Shared Memory Banks and Conflicts
Shared memory is organized into 32 banks (matching warp width). Bank holds addresses at positions , and so on. All 32 threads in a warp can access shared memory simultaneously - but only if each thread accesses a different bank.
A bank conflict occurs when multiple threads in a warp access different addresses in the same bank. The hardware serializes these accesses. A 2-way conflict halves bandwidth; a 32-way conflict reduces bandwidth to 1/32.
The classic example: a 32x32 matrix stored row-major in shared memory. Thread needs to access element at column of row . When all 32 threads access column simultaneously, they access addresses at rows 0, 1, 2, ..., 31 of column . The stride between these is 32 elements = 32 words. Every element lands in the same bank. Catastrophic 32-way conflict.
// WRONG: 32-way bank conflict when accessing columns
__shared__ float matrix[32][32];
// Thread i accesses matrix[i][j]:
// Address = base + i * 32 + j
// All threads have stride 32, so all land in bank ((i*32 + j) % 32) = bank (j)
// All 32 threads access bank j - 32-way conflict
// CORRECT: Add one padding column to break the bank alignment
__shared__ float matrix[32][33]; // 33 instead of 32
// Thread i now accesses: base + i * 33 + j
// Bank = (i * 33 + j) % 32
// Stride 33 is not a multiple of 32, so consecutive rows land in different banks
// Zero bank conflicts
In PyTorch and Triton you control padding in tiled algorithms:
# In Triton, use tl.dot for matmul - it handles padding internally
# But for custom reductions or transposes, pad manually:
@triton.jit
def transpose_kernel(input_ptr, output_ptr, M, N,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
pid_m = tl.program_id(0)
pid_n = tl.program_id(1)
# Load tile into shared memory (BLOCK_M x BLOCK_N)
row_offs = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
col_offs = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
# Triton handles bank conflict avoidance internally for tl.dot
# For manual shared memory: use (BLOCK_N + 1) as stride
tile = tl.load(input_ptr + row_offs[:, None] * N + col_offs[None, :],
mask=(row_offs[:, None] < M) & (col_offs[None, :] < N))
# Store transposed
tl.store(output_ptr + col_offs[:, None] * M + row_offs[None, :],
tl.trans(tile),
mask=(col_offs[:, None] < N) & (row_offs[None, :] < M))
Special Function Units (SFUs)
Each SM has 16 SFUs (Special Function Units) that compute transcendental functions: sin, cos, exp, log, sqrt, reciprocal. These are hardware-implemented polynomial approximations - fast (4-cycle throughput) but reduced precision (23-bit mantissa).
For ML applications, SFU operations appear in:
- Softmax (exp, sum, division)
- GELU activation (erf approximation)
- Layer norm (sqrt for RMS normalization)
- Positional encodings (sin, cos)
SFU throughput is 16 operations per SM per cycle vs 128 CUDA cores for regular math. If your kernel is SFU-heavy, you may be SFU-bound. Use faster approximations (__expf, __logf) or fuse operations to reduce the relative cost.
Analyzing SM Utilization with Nsight Compute
# Profile a PyTorch training step
ncu --set full -o profile_output python -c "
import torch
A = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
B = torch.randn(4096, 4096, device='cuda', dtype=torch.float16)
for _ in range(10):
C = torch.mm(A, B)
torch.cuda.synchronize()
"
# Key metrics to interpret:
# sm__throughput.avg.pct_of_peak_sustained_elapsed
# -> Overall SM utilization. Below 60% = underutilization issue.
#
# sm__warps_active.avg.pct_of_peak_sustained_active
# -> Achieved occupancy as % of theoretical max. Below 50% = occupancy issue.
#
# smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct
# -> % of cycles stalled waiting for global memory. >20% = memory bottleneck.
#
# smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct
# -> % of cycles stalled on memory instruction throttle. Reduce memory ops.
#
# l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sum
# -> L1 cache hit count for loads. Low = poor spatial/temporal locality.
Practical Block Configuration Guidelines
import torch
def find_optimal_block_size(kernel_fn, n_elements, candidate_sizes=[128, 256, 512, 1024]):
"""
Empirically find the best block size for a kernel.
In production, use Triton's @autotune or CUDA's occupancy API.
"""
best_time = float('inf')
best_size = candidate_sizes[0]
for block_size in candidate_sizes:
times = []
for _ in range(5): # warmup + measure
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
kernel_fn(n_elements, block_size)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
avg = sum(times[1:]) / len(times[1:]) # drop first warmup
print(f"block_size={block_size}: {avg:.3f}ms")
if avg < best_time:
best_time = avg
best_size = block_size
return best_size
# General rules of thumb:
# - Elementwise kernels: 1024 threads/block (high occupancy, simple)
# - Reduction kernels: 256-512 threads/block (shared memory usage)
# - Matmul / attention: tile-based, typically 128x128 tiles with 128-256 threads
# - Memory-bound kernels: maximize occupancy (use 256-1024)
# - Compute-bound kernels: balance register pressure vs occupancy (128-256)
Common Mistakes
:::danger Using Thread Block Sizes That Are Not Multiples of 32 A thread block of 100 threads becomes 3 full warps (96 threads) plus 1 partial warp (4 threads active, 28 inactive). Those 28 register slots are permanently wasted. The SM still counts the partial warp as active. Always use multiples of 32. Prefer 128, 256, or 512 for most kernels. :::
:::warning Ignoring Register Pressure
Each local variable in a kernel uses a register. If per-thread register count crosses a threshold, occupancy drops. The effect is discrete and surprising: going from 32 to 33 registers per thread can cut occupancy in half. Use nvcc --ptxas-options=-v to see register usage. Aim for 32 or fewer registers per thread for maximum occupancy.
:::
:::warning Allocating Shared Memory Larger Than Needed If you allocate 96KB of shared memory per block on an SM with 96KB total, only one block can co-reside. One block of 256 threads = 8 warps = 12.5% occupancy. Reduce tile sizes, compress data types, or split into multiple kernels with smaller shared memory footprints. :::
:::danger Accessing Shared Memory with Column-Stride Patterns Without Padding
Storing an M x N matrix row-major in shared memory and reading column-by-column creates an N-way bank conflict (if N = 32). Throughput drops by 32x. Add padding: declare the array as [M][N+1] instead of [M][N]. One extra element per row eliminates all bank conflicts. The memory waste is negligible (4 bytes per row).
:::
Interview Questions
Q1: What is GPU occupancy and why does it matter for performance?
Occupancy is the ratio of active warps per SM to the maximum supported warps. H100 supports 64 warps per SM; if your kernel runs 32, occupancy is 50%.
Occupancy matters because the warp scheduler hides memory latency by switching to another warp when one stalls. If occupancy is low, the scheduler has fewer options and the SM may sit idle waiting for memory. Higher occupancy provides more warps to interleave, keeping execution units busy.
However, occupancy is not the goal - throughput is. A kernel with 50% occupancy that does perfectly coalesced memory access can outperform a 100% occupancy kernel with poor access patterns. Profile before assuming occupancy is your bottleneck.
Q2: What limits occupancy, and how would you fix each bottleneck?
Three resources limit occupancy: registers per thread (fix: reduce local variables, use __launch_bounds__), shared memory per block (fix: reduce tile sizes or use L1 instead), and block size (fix: use multiples of 32, prefer 128-256 threads/block). Use Nsight Compute "Theoretical Occupancy" vs "Achieved Occupancy" to identify which constraint is binding.
Q3: Explain shared memory bank conflicts and how to fix them.
Shared memory is divided into 32 banks. Multiple threads in a warp accessing different addresses in the same bank causes a bank conflict - the hardware serializes the accesses, reducing bandwidth. The classic case: reading a column from a 32x32 matrix stored row-major gives every thread the same bank. Fix: pad the array width by 1 (declare [32][33] instead of [32][32]). The stride becomes 33, staggering elements across all 32 banks with zero conflicts.
Q4: How does a warp scheduler hide memory latency without a cache?
When a warp issues a global memory load, it becomes ineligible for 100-400 cycles while the data travels from HBM. The warp scheduler immediately selects another eligible warp and issues its instruction. This context switch is zero-cost because each warp has its own dedicated registers - nothing needs to be saved or restored. As long as enough warps are resident and some are always eligible, the SM execution units stay fully busy throughout the memory latency period. This is fundamentally different from a CPU, which hides latency with a cache hierarchy instead.
Q5: What is the difference between shared memory and L1 cache on an SM, and when would you use each?
Both are fast on-chip SRAM with similar latency (~32 cycles vs ~400 cycles for HBM). Shared memory is explicitly programmer-managed: you load data into it with explicit instructions, and it stays there until the block releases it. All threads in the block share it and can access any location. L1 cache is hardware-managed: the GPU automatically caches global memory accesses based on access patterns, with no programmer control over what is cached or evicted.
Use shared memory when multiple threads need the same data (tile reuse in matmul), when you need guaranteed persistence across multiple accesses, or when implementing algorithms that require cross-thread data sharing (reduction, scan). Use L1 for streaming access patterns where temporal reuse is low and you just want to reduce HBM traffic for nearby sequential accesses.
Summary
The Streaming Multiprocessor is the GPU's fundamental execution unit. An H100 SM contains 4 warp schedulers, 128 CUDA cores, 4 Tensor Core groups, a 65,536-register register file, and up to 228KB of shared memory.
The SM hides memory latency by keeping many warps resident and switching between them at zero cost each clock cycle. This requires sufficient occupancy - active warps per SM. Occupancy is limited by registers per thread, shared memory per block, and block size.
Shared memory is the most impactful per-kernel optimization tool: programmer-controlled SRAM that eliminates redundant global memory reads. Bank conflicts - caused by multiple threads accessing the same bank - can destroy its benefit. Padding the array by one element per row is the standard fix.
The next lesson covers the full CUDA execution model: how kernels are launched, how the GPU schedules thread blocks across SMs, and how CUDA streams enable overlapping compute with data transfer.
