Module 1: GPU Architecture

Every performance optimization in deep learning - from FlashAttention to mixed precision to fused kernels - makes sense only if you understand the hardware it is optimizing for. This module builds the mental model of how a GPU actually works so that every optimization you encounter from this point on has an obvious mechanical explanation.

The goal is not to make you a GPU engineer. It is to give you enough architectural understanding that you can reason about why your code runs fast or slow, make informed choices between hardware options, and read optimization papers with comprehension rather than just applying the results.

The Core Mental Model

A modern GPU like the H100 contains 132 Streaming Multiprocessors (SMs). Each SM can run thousands of threads simultaneously. But this parallelism is only useful if you give each thread meaningful work to do - and if the data those threads need is available on time.

The fundamental tension in GPU programming is between compute and memory bandwidth. You have an enormous amount of compute (989 TFLOPS on H100 BF16). You have high-bandwidth memory (3.35 TB/s HBM3 on H100 SXM). But memory bandwidth is still frequently the bottleneck, because attention operations, embedding lookups, and activation functions are all memory-bound - they spend more time reading and writing data than doing arithmetic.

The roofline model makes this concrete. Every operation has a defined arithmetic intensity (FLOPs per byte of memory access). If that intensity is below the hardware's compute/bandwidth ratio, you are memory-bound. Above it, you are compute-bound. This single insight explains most of what FlashAttention, operator fusion, and quantization are doing.

GPU Architecture Overview

Lessons in This Module

#	Lesson	Key Concept
1	GPU vs CPU Architecture	Why GPUs win for matrix ops; SIMT execution model
2	Streaming Multiprocessors	SM internals, warp scheduling, occupancy
3	Memory Hierarchy in GPUs	Registers, L1, L2, HBM - latency and bandwidth at each level
4	Tensor Cores and Mixed Precision	How tensor cores work, BF16/FP16, TF32
5	Ampere, Hopper, Ada Architectures	What changed across GPU generations for AI
6	Roofline Model and Bottleneck Analysis	Arithmetic intensity, identifying compute vs memory bottlenecks
7	PCIe and NVLink Interconnects	Host-device bandwidth, GPU-GPU communication
8	Selecting GPUs for Training vs Inference	H100, A100, L40S, RTX 4090 - when to use which

Key Concepts You Will Master

SIMT execution model - how GPUs execute thousands of threads in lockstep and what happens when they diverge
Warp occupancy - why filling the GPU with active warps matters for hiding memory latency
Memory bandwidth vs compute - the roofline model and how to apply it to your workloads
Tensor core operations - the matrix multiply-accumulate operations that make modern GPU training possible
Architecture generations - what Ampere (A100), Hopper (H100), and Ada (RTX 40xx) each added for AI workloads

Prerequisites

Basic Python and PyTorch
Familiarity with neural network training

The Core Mental Model​

GPU Architecture Overview​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Core Mental Model

GPU Architecture Overview

Lessons in This Module

Key Concepts You Will Master

Prerequisites