GPU architecture, CUDA programming, custom accelerators, kernel optimization, memory systems, and distributed training infrastructure - from transistors to clusters.
The hardware layer that determines whether your model trains fast or sits idle.
SM architecture, warp execution, CUDA execution model, and GPU memory hierarchy from first principles.
What you'll master
8 lessons
Thread hierarchies, memory management, kernel optimization, and writing high-performance GPU kernels.
What you'll master
8 lessons
TPUs, Trainium, Groq LPU, Cerebras, Gaudi, Apple Silicon - architecture and when to choose custom accelerators.
What you'll master
8 lessons
Occupancy, tiling, tensor cores, Flash Attention kernel, kernel fusion, and Triton for custom kernels.
What you'll master
8 lessons
HBM, roofline analysis, NVLink bandwidth, KV cache capacity planning, and storage I/O bottlenecks.
What you'll master
8 lessons
Multi-GPU architectures, NCCL, DGX systems, ZeRO, fault tolerance, and cloud vs on-prem.
What you'll master
8 lessons
Inference-optimized hardware, KV cache management, speculative decoding, TensorRT, and edge inference.
What you'll master
8 lessons
Every millisecond of latency and every dollar of compute traces back to hardware decisions.
Start Learning Free →