Skip to main content

Module 1: GPU Architecture

Every performance optimization in deep learning - from FlashAttention to mixed precision to fused kernels - makes sense only if you understand the hardware it is optimizing for. This module builds the mental model of how a GPU actually works so that every optimization you encounter from this point on has an obvious mechanical explanation.

The goal is not to make you a GPU engineer. It is to give you enough architectural understanding that you can reason about why your code runs fast or slow, make informed choices between hardware options, and read optimization papers with comprehension rather than just applying the results.

The Core Mental Model

A modern GPU like the H100 contains 132 Streaming Multiprocessors (SMs). Each SM can run thousands of threads simultaneously. But this parallelism is only useful if you give each thread meaningful work to do - and if the data those threads need is available on time.

The fundamental tension in GPU programming is between compute and memory bandwidth. You have an enormous amount of compute (989 TFLOPS on H100 BF16). You have high-bandwidth memory (3.35 TB/s HBM3 on H100 SXM). But memory bandwidth is still frequently the bottleneck, because attention operations, embedding lookups, and activation functions are all memory-bound - they spend more time reading and writing data than doing arithmetic.

The roofline model makes this concrete. Every operation has a defined arithmetic intensity (FLOPs per byte of memory access). If that intensity is below the hardware's compute/bandwidth ratio, you are memory-bound. Above it, you are compute-bound. This single insight explains most of what FlashAttention, operator fusion, and quantization are doing.

GPU Architecture Overview

Lessons in This Module

#LessonKey Concept
1GPU vs CPU ArchitectureWhy GPUs win for matrix ops; SIMT execution model
2Streaming MultiprocessorsSM internals, warp scheduling, occupancy
3Memory Hierarchy in GPUsRegisters, L1, L2, HBM - latency and bandwidth at each level
4Tensor Cores and Mixed PrecisionHow tensor cores work, BF16/FP16, TF32
5Ampere, Hopper, Ada ArchitecturesWhat changed across GPU generations for AI
6Roofline Model and Bottleneck AnalysisArithmetic intensity, identifying compute vs memory bottlenecks
7PCIe and NVLink InterconnectsHost-device bandwidth, GPU-GPU communication
8Selecting GPUs for Training vs InferenceH100, A100, L40S, RTX 4090 - when to use which

Key Concepts You Will Master

  • SIMT execution model - how GPUs execute thousands of threads in lockstep and what happens when they diverge
  • Warp occupancy - why filling the GPU with active warps matters for hiding memory latency
  • Memory bandwidth vs compute - the roofline model and how to apply it to your workloads
  • Tensor core operations - the matrix multiply-accumulate operations that make modern GPU training possible
  • Architecture generations - what Ampere (A100), Hopper (H100), and Ada (RTX 40xx) each added for AI workloads

Prerequisites

  • Basic Python and PyTorch
  • Familiarity with neural network training
© 2026 EngineersOfAI. All rights reserved.