:::tip ๐ฎ Interactive Playground Visualize this concept: Try the CUDA Programming Model demo on the EngineersOfAI Playground - no code required. :::
GPU Architecture for ML Engineers
The 40% Utilization Mysteryโ
The ML engineer had spent two weeks optimizing a transformer training loop. PyTorch profiler showed GPU utilization at 40%. Every blog post said "target 80%+." The code looked clean - batch sizes were large, data loading was fast, and the model fit in VRAM with room to spare.
Running nvidia-smi, the engineer confirmed 40% GPU utilization. But which 40%? Was the GPU waiting on memory transfers? Was compute underutilized due to poor warp occupancy? Was there a kernel launch overhead problem? Without understanding what the GPU was doing during the other 60%, profiling was just confirming a symptom, not finding a cause.
The investigation revealed the real culprit: the model had a sequence of operations that were individually memory-bandwidth bound, not compute-bound. Each operation would copy its input tensor from HBM (High Bandwidth Memory) to the L1 cache, perform a few operations, then write results back. The GPU compute units sat idle during memory transfers. No amount of "making the code more efficient" would help - the ceiling was the HBM bandwidth.
The solution was fusing operations into custom CUDA kernels (or using FlashAttention which does this automatically) so that data stayed in the L1/L2 cache across multiple operations, dramatically reducing the number of HBM round-trips. After this change, GPU utilization jumped to 72% and training throughput increased 1.8ร.
This is the roofline model in practice. To understand why and how to fix GPU underutilization, you need to understand what is actually happening inside the GPU.
Why ML Engineers Need to Understand GPU Architectureโ
You can write PyTorch for years without knowing what a warp is or what HBM stands for. The code works. But when performance matters - and it always matters at scale - you hit questions that PyTorch's abstraction cannot answer:
- Why does adding one more transformer layer cause a 3ร slowdown?
- Why is attention slower on my A6000 than on an A100, even though both have plenty of FLOPS?
- Why does moving from float32 to float16 cause a 5ร speedup, not 2ร?
- Why does my model train faster with batch size 512 than batch size 1024?
These questions have precise answers grounded in GPU hardware. This lesson gives you the mental model to reason about them.
The GPU vs CPU Design Philosophyโ
A CPU is designed for latency: execute a single instruction stream as fast as possible, with deep caches, branch prediction, and out-of-order execution to eliminate pipeline stalls. A modern CPU core is a marvel of latency-reduction engineering that runs a single thread extremely fast.
A GPU is designed for throughput: execute thousands of instruction streams simultaneously, accept high latency on any individual thread in exchange for massive parallelism across threads. A GPU has thousands of cores, but each core is simple and slow compared to a CPU core.
This design philosophy is a perfect match for neural network training, where the dominant operation - matrix multiplication - is perfectly parallelizable: each output element can be computed independently.
CPU vs GPU Design Contrast:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CPU (e.g., AMD Threadripper Pro) โ
โ 64 cores ร 1 instruction/cycle โ
โ ~64 operations in parallel โ
โ Single-core perf: 5 GHz + cache โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GPU (e.g., NVIDIA H100 SXM) โ
โ 16,896 CUDA cores โ
โ 132 SMs ร 128 CUDA cores each โ
โ ~16,896 ops in parallel โ
โ Single-core perf: ~1.7 GHz โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CUDA Cores vs Tensor Coresโ
An A100 GPU has two types of execution units:
CUDA cores (FP32 units): Each performs one floating-point multiply-accumulate (MAC) per clock cycle on scalar values. An A100 has 6,912 CUDA cores providing 19.5 TFLOPS of FP32 throughput.
Tensor Cores: Each performs a 4ร4 matrix multiply-accumulate per clock cycle - 64 FP16 MACs per Tensor Core per clock. An A100 has 432 Tensor Cores providing 312 TFLOPS of FP16 throughput (with sparsity) or 77.6 TFLOPS without sparsity.
The ratio: Tensor Cores deliver 16ร more throughput for matrix operations than CUDA cores. Matrix multiplication is the core operation in every transformer layer. This is why "use mixed precision" is the single highest-leverage optimization for deep learning training.
| GPU Generation | FP32 TFLOPS | FP16 TFLOPS | TF32 TFLOPS | BF16 TFLOPS |
|---|---|---|---|---|
| V100 (2017) | 14.0 | 112 | - | - |
| A100 SXM (2020) | 19.5 | 77.6 | 156 | 77.6 |
| H100 SXM (2022) | 67.0 | 989 | 494 | 989 |
| H200 SXM (2024) | 67.0 | 989 | 494 | 989 |
The H100's massive FP16/BF16 TFLOPS advantage over A100 comes from new Tensor Core architecture (4th gen), not more cores.
Memory Hierarchyโ
Understanding GPU memory is more important for ML performance than understanding FLOPS. Most ML operations are memory-bandwidth bound, not compute-bound.
HBM (High Bandwidth Memory): What ML engineers call "GPU RAM" or "VRAM." The A100 SXM has 80 GB of HBM2e with 2 TB/s bandwidth. The H100 SXM has 80 GB of HBM3 with 3.35 TB/s bandwidth. Every tensor that does not fit in L1 or L2 cache must be fetched from HBM.
The critical insight: HBM bandwidth (2 TB/s) limits how much data can move to the compute cores per second. If an operation requires reading 2 TB of data and your GPU has 2 TB/s bandwidth, that operation takes at least 1 second - regardless of how many TFLOPS your GPU has.
The Roofline Modelโ
The roofline model gives you a way to predict the maximum achievable performance of any kernel given its arithmetic intensity.
Arithmetic intensity (): the ratio of floating-point operations to bytes of memory traffic:
The roofline sets two limits:
- Compute ceiling: maximum performance = GPU TFLOPS (e.g., 77.6 TFLOPS for A100 FP16)
- Memory bandwidth ceiling: maximum performance = bandwidth ร arithmetic intensity (e.g., 2 TB/s ร I)
Your kernel's actual performance is bounded by whichever ceiling it hits first:
The ridge point is the arithmetic intensity at which both ceilings are equal:
Operations with FLOP/byte on an A100 are memory-bandwidth bound. Operations with are compute-bound.
def roofline_analysis(
flops: float, # total floating-point operations
memory_bytes: float, # total memory traffic (read + write)
peak_tflops: float, # GPU peak performance (e.g., 77.6e12 for A100 FP16)
bandwidth_tbs: float, # memory bandwidth in TB/s (e.g., 2.0 for A100)
) -> dict:
"""
Roofline model: diagnose whether an operation is compute or bandwidth bound.
"""
bandwidth_bytes_per_sec = bandwidth_tbs * 1e12
peak_flops_per_sec = peak_tflops * 1e12
arithmetic_intensity = flops / memory_bytes # FLOP/byte
ridge_point = peak_flops_per_sec / bandwidth_bytes_per_sec
# Maximum achievable performance
memory_bound_peak = bandwidth_bytes_per_sec * arithmetic_intensity
actual_ceiling = min(peak_flops_per_sec, memory_bound_peak)
return {
"arithmetic_intensity_flop_per_byte": round(arithmetic_intensity, 2),
"ridge_point_flop_per_byte": round(ridge_point, 2),
"bottleneck": "COMPUTE-BOUND" if arithmetic_intensity > ridge_point else "MEMORY-BANDWIDTH BOUND",
"peak_achievable_tflops": round(actual_ceiling / 1e12, 2),
"max_utilization_pct": round(actual_ceiling / peak_flops_per_sec * 100, 1),
"optimization_direction": (
"Use Tensor Cores, increase batch size, kernel fusion"
if arithmetic_intensity < ridge_point
else "Reduce precision, use FlashAttention, reduce model size"
),
}
# Example: Attention computation
# Q, K, V are (batch=32, heads=16, seq=1024, head_dim=64)
# QK^T matmul: 2 * seq * seq * head_dim * batch * heads FLOPs
seq_len = 1024
head_dim = 64
n_heads = 16
batch_size = 32
flops_qkt = 2 * seq_len * seq_len * head_dim * batch_size * n_heads
# Memory: load Q (batch*heads*seq*head_dim*2bytes) + K + write output
q_bytes = batch_size * n_heads * seq_len * head_dim * 2 # float16
k_bytes = q_bytes
out_bytes = batch_size * n_heads * seq_len * seq_len * 2
result = roofline_analysis(
flops=flops_qkt,
memory_bytes=q_bytes + k_bytes + out_bytes,
peak_tflops=77.6, # A100 FP16
bandwidth_tbs=2.0,
)
print("Attention QK^T Roofline Analysis:")
for key, value in result.items():
print(f" {key}: {value}")
Common ML Operations and Their Arithmetic Intensityโ
| Operation | Arithmetic Intensity | Bottleneck |
|---|---|---|
| Element-wise activation (ReLU, GELU) | 0.5โ2 FLOP/byte | Memory BW |
| Layer normalization | 4โ8 FLOP/byte | Memory BW |
| Attention (naive, long sequence) | 3โ10 FLOP/byte | Memory BW |
| Linear layer (large batch) | 50โ200 FLOP/byte | Compute |
| Attention (FlashAttention) | 20โ40 FLOP/byte | Borderline |
This explains why FlashAttention achieves 2โ4ร speedup over naive attention: it fuses operations to keep data in L1/L2 cache, dramatically increasing the effective arithmetic intensity.
Warp Execution and Occupancyโ
A GPU does not execute individual threads - it executes warps of 32 threads simultaneously. Every instruction is executed on all 32 threads of a warp in lockstep (SIMT - Single Instruction, Multiple Threads).
Divergence: If threads within a warp take different code paths (e.g., if statements where different threads evaluate differently), the GPU must execute both branches serially with inactive threads masked off. This halves throughput for that code region. For ML workloads, divergence is rare in inner loop compute kernels but can appear in custom preprocessing or sparse operations.
Occupancy: The fraction of the maximum number of warps that can be active simultaneously on a Streaming Multiprocessor (SM). Low occupancy means the SM has too few active warps to hide memory latency.
import torch
def check_gpu_memory_stats():
"""Inspect GPU memory allocation and fragmentation."""
if not torch.cuda.is_available():
print("CUDA not available")
return
device = torch.cuda.current_device()
stats = {
"device_name": torch.cuda.get_device_name(device),
"total_vram_gb": torch.cuda.get_device_properties(device).total_memory / 1e9,
"allocated_gb": torch.cuda.memory_allocated(device) / 1e9,
"reserved_gb": torch.cuda.memory_reserved(device) / 1e9,
"free_gb": (
torch.cuda.get_device_properties(device).total_memory -
torch.cuda.memory_allocated(device)
) / 1e9,
}
for key, value in stats.items():
if isinstance(value, float):
print(f" {key}: {value:.2f}")
else:
print(f" {key}: {value}")
# Detailed memory summary
print("\n" + torch.cuda.memory_summary(device, abbreviated=True))
NVLink vs PCIeโ
When training on multiple GPUs, the GPUs must communicate gradients. The bandwidth of this communication channel determines distributed training efficiency.
PCIe 4.0: Up to 64 GB/s bidirectional per slot. If 8 GPUs are connected via PCIe to the CPU, all-reduce across 8 GPUs is limited by PCIe bandwidth.
NVLink (H100): 900 GB/s total bandwidth per GPU (NVLink 4.0). An NVLink interconnect between A100s runs at 600 GB/s. This is 9โ14ร faster than PCIe.
The practical impact: for data parallel training, communication is an all-reduce of all gradient tensors. A 7B parameter model has 7 ร 10^9 ร 4 bytes = 28 GB of gradients. Over PCIe at 32 GB/s (unidirectional), that all-reduce takes 875ms - longer than most forward+backward passes. Over NVLink at 300 GB/s per direction, it takes 93ms. NVLink makes data parallelism practical; PCIe-only systems require careful gradient compression or pipeline parallelism to compensate.
Practical Profiling with PyTorchโ
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
def profile_model_forward(model: nn.Module, batch_size: int = 32):
"""
Profile a model's forward pass to identify GPU bottlenecks.
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
dummy_input = torch.randn(batch_size, 512).to(device)
# Warm up
with torch.no_grad():
for _ in range(5):
_ = model(dummy_input)
torch.cuda.synchronize()
# Profile
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_flops=True,
) as prof:
with record_function("model_inference"):
with torch.no_grad():
for _ in range(20):
_ = model(dummy_input)
torch.cuda.synchronize()
# Print top operations by CUDA time
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=15
))
# Export Chrome trace for visualization
prof.export_chrome_trace("trace.json")
print("Trace saved to trace.json - open in chrome://tracing")
Production Engineering Notesโ
Monitor GPU utilization with context. nvidia-smi reports "GPU utilization" as the fraction of time the GPU is running at least one kernel. A utilization of 95% can mean the GPU is perfectly busy OR it means a slow kernel is running for 95% of the time. Use nsight compute or PyTorch profiler to see which kernels are running and whether they are compute-bound or memory-bound.
Prefer bf16 over fp16 for training stability. BF16 (Brain Float 16) has the same exponent range as FP32 (8 exponent bits) but only 7 mantissa bits. This means bf16 rarely overflows/underflows (the main cause of fp16 training instability) while still using Tensor Cores. All modern GPUs (A100, H100) and TPUs support bf16 natively.
Use torch.compile() for automatic kernel fusion. In PyTorch 2.0+, model = torch.compile(model) applies TorchInductor to fuse operations and generate optimized CUDA kernels automatically. For transformer models, compile() typically improves throughput 10โ30% with no code changes.
Common Mistakesโ
:::danger Assuming higher VRAM means faster training VRAM capacity determines what model sizes you can fit. VRAM bandwidth determines how fast training runs. An H100 with 80 GB HBM3 (3.35 TB/s) is faster at training than an A100 with 80 GB HBM2e (2.0 TB/s) - not because of extra VRAM, but because of 67% higher bandwidth. Check bandwidth specs, not just VRAM capacity. :::
:::danger Running FP32 training when FP16/BF16 would work Standard FP32 does not use Tensor Cores. FP16/BF16 training uses Tensor Cores and achieves 4โ16ร higher throughput on the same GPU. Most modern neural networks train stably in BF16. The correct default is mixed precision training (BF16 forward/backward pass, FP32 master weights). Always benchmark FP32 vs BF16 training loss curves before assuming FP32 is necessary. :::
:::warning Ignoring PCIe bandwidth for multi-GPU setups Consumer RTX GPUs connect via PCIe x16 to the CPU. For 8-GPU all-reduce, PCIe bandwidth becomes a severe bottleneck. If you are buying hardware for multi-GPU training, verify you need NVLink (A100/H100 SXM) or NVSwitch (DGX systems). PCIe-only setups (even RTX 4090s) are significantly slower for gradient communication in data parallel training. :::
:::tip Use NVTX annotations to correlate PyTorch operations with CUDA kernels
When profiling with Nsight Systems, NVTX range markers show which PyTorch operation generated which CUDA kernels. Add torch.cuda.nvtx.range_push("attention") before and range_pop() after attention computation to see exactly how long each model component takes at the CUDA kernel level.
:::
Interview Questionsโ
Q1: What is the roofline model and how do you use it to diagnose GPU performance issues?
The roofline model plots achievable performance as a function of arithmetic intensity (FLOP/byte). Two limits constrain performance: peak compute (TFLOPS) and memory bandwidth (TB/s). An operation with arithmetic intensity below the ridge point is memory-bandwidth bound - adding more compute units will not help. To use it: measure your kernel's FLOP count and memory traffic (using nsight compute), compute arithmetic intensity, compare to your GPU's ridge point (TFLOPS / bandwidth โ 39 FLOP/byte for A100 FP16). If below the ridge point, optimize for memory access patterns - fuse kernels to reuse cached data, reduce precision, or use algorithms with higher arithmetic intensity (FlashAttention instead of naive attention).
Q2: Why does FlashAttention achieve 2โ4ร speedup over standard PyTorch attention without changing the math?
Standard attention splits into three separate kernels: QรK^T matmul, softmax, and resultรV matmul. Each kernel reads its input from HBM and writes output to HBM. The attention matrix for sequence length 1024 is 1024ร1024ร2 bytes = 2 MB. For batch 32 and 16 heads, that is 1 GB that must be written and re-read between kernels. FlashAttention fuses all three into one kernel with a tiled algorithm that keeps the attention matrix in L1/L2 SRAM cache across all three operations. It never materializes the full attention matrix in HBM. This eliminates most of the HBM traffic, dramatically increasing effective arithmetic intensity from ~5 FLOP/byte to ~35 FLOP/byte - near the A100 ridge point.
Q3: Explain warp divergence. When does it occur in ML workloads?
A warp is 32 CUDA threads executing in lockstep. If threads within a warp take different code paths (if/else branching), the GPU serializes both branches with inactive threads masked - cutting throughput in half for that divergent region. In standard ML workloads (dense matrix multiplication, attention, activation functions), divergence is rare because all threads execute the same operations on different data. Divergence appears in: sparse attention masks (some threads process valid positions, others skip), custom CUDA kernels with conditional logic, mixture-of-experts routing (different tokens routed to different experts), and preprocessing kernels with variable-length inputs. Minimize divergence by restructuring conditionals to be uniform across warp boundaries.
Q4: What is the difference between CUDA cores and Tensor Cores, and why does it matter for training?
CUDA cores perform one scalar float multiply-accumulate per clock. Tensor Cores perform a 4ร4 matrix multiply-accumulate per clock - 64 FP16 MACs versus 1 FP32 MAC. An A100 has 432 Tensor Cores delivering 77.6 TFLOPS FP16 versus 6,912 CUDA cores delivering 19.5 TFLOPS FP32. Training with FP16 and Tensor Cores is 4ร faster on the same hardware. This matters because: (1) always use mixed precision (BF16 or FP16) to engage Tensor Cores; (2) matrix dimensions should be multiples of 8 (for Volta) or 16 (for Ampere) to ensure the compiler maps operations to Tensor Cores; (3) FP32 training is not just "safer" - it is 4ร slower and rarely needed for modern architectures with BF16 master weights.
Q5: When would you choose NVLink over PCIe connectivity for a multi-GPU training cluster?
Always prefer NVLink when training models where gradient synchronization is a significant fraction of iteration time. The threshold: when the gradient tensor size (parameters ร 4 bytes for FP32 master weights) divided by your desired all-reduce time exceeds PCIe bandwidth. For a 7B parameter model (28 GB gradients) at 32 GB/s PCIe bandwidth, all-reduce takes 875ms. If your forward+backward pass is under 2 seconds, that is 44% overhead from communication. With NVLink at 300 GB/s, the same all-reduce takes 93ms - 9ร faster. In practice: if you are training models above ~1B parameters on multi-GPU machines you own, specify A100 SXM or H100 SXM (NVLink interconnect) rather than PCIe variants. The cost premium is significant but the training time savings usually justify it.
