Module 1: GPU Architecture
Every performance optimization in deep learning - from FlashAttention to mixed precision to fused kernels - makes sense only if you understand the hardware it is optimizing for. This module builds the mental model of how a GPU actually works so that every optimization you encounter from this point on has an obvious mechanical explanation.
The goal is not to make you a GPU engineer. It is to give you enough architectural understanding that you can reason about why your code runs fast or slow, make informed choices between hardware options, and read optimization papers with comprehension rather than just applying the results.
The Core Mental Model
A modern GPU like the H100 contains 132 Streaming Multiprocessors (SMs). Each SM can run thousands of threads simultaneously. But this parallelism is only useful if you give each thread meaningful work to do - and if the data those threads need is available on time.
The fundamental tension in GPU programming is between compute and memory bandwidth. You have an enormous amount of compute (989 TFLOPS on H100 BF16). You have high-bandwidth memory (3.35 TB/s HBM3 on H100 SXM). But memory bandwidth is still frequently the bottleneck, because attention operations, embedding lookups, and activation functions are all memory-bound - they spend more time reading and writing data than doing arithmetic.
The roofline model makes this concrete. Every operation has a defined arithmetic intensity (FLOPs per byte of memory access). If that intensity is below the hardware's compute/bandwidth ratio, you are memory-bound. Above it, you are compute-bound. This single insight explains most of what FlashAttention, operator fusion, and quantization are doing.
GPU Architecture Overview
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | GPU vs CPU Architecture | Why GPUs win for matrix ops; SIMT execution model |
| 2 | Streaming Multiprocessors | SM internals, warp scheduling, occupancy |
| 3 | Memory Hierarchy in GPUs | Registers, L1, L2, HBM - latency and bandwidth at each level |
| 4 | Tensor Cores and Mixed Precision | How tensor cores work, BF16/FP16, TF32 |
| 5 | Ampere, Hopper, Ada Architectures | What changed across GPU generations for AI |
| 6 | Roofline Model and Bottleneck Analysis | Arithmetic intensity, identifying compute vs memory bottlenecks |
| 7 | PCIe and NVLink Interconnects | Host-device bandwidth, GPU-GPU communication |
| 8 | Selecting GPUs for Training vs Inference | H100, A100, L40S, RTX 4090 - when to use which |
Key Concepts You Will Master
- SIMT execution model - how GPUs execute thousands of threads in lockstep and what happens when they diverge
- Warp occupancy - why filling the GPU with active warps matters for hiding memory latency
- Memory bandwidth vs compute - the roofline model and how to apply it to your workloads
- Tensor core operations - the matrix multiply-accumulate operations that make modern GPU training possible
- Architecture generations - what Ampere (A100), Hopper (H100), and Ada (RTX 40xx) each added for AI workloads
Prerequisites
- Basic Python and PyTorch
- Familiarity with neural network training
