Module 6: Distributed Training Hardware

Training GPT-4 required approximately 25,000 A100 GPUs running for months. Training Llama 3 required 24,576 H100 GPUs. These are not just big computers - they are purpose-built infrastructures where the networking between GPUs is as important as the GPUs themselves.

When you scale from 1 GPU to 8, you hit NVLink bandwidth limits. When you scale from 8 to 64, you hit PCIe switch bandwidth limits. When you scale beyond a single node, InfiniBand becomes your bottleneck. Every scaling step has a different constraint, and understanding those constraints determines whether your training job scales efficiently or grinds to a halt waiting for gradient synchronization.

The Bandwidth Hierarchy

Lessons in This Module

#	Lesson	Key Concept
1	Multi-GPU Training Overview	DDP, FSDP, tensor parallelism - which requires what
2	NVLink and NVSwitch	Bandwidth, topology, why NVSwitch changed multi-GPU
3	InfiniBand for GPU Clusters	HDR/NDR IB, RDMA, switch topologies
4	RDMA and Collective Operations	Remote DMA, why NCCL uses RDMA
5	Network Topology for AI Clusters	Fat-tree, Dragonfly, rail-optimized, what Google/Meta use
6	AllReduce Algorithms	Ring-AllReduce, tree, recursive halving-doubling
7	Training at 10,000+ GPU Scale	Checkpoint frequency, straggler mitigation, failure rates
8	Fault Tolerance in Training Clusters	Failure probability at scale, elastic training, checkpointing strategy

Key Concepts You Will Master

AllReduce topology - why Ring-AllReduce scales better than naive parameter server
NVLink vs PCIe - the 10-30x bandwidth difference and what it means for model parallelism
InfiniBand RDMA - how bypassing the OS kernel reduces communication latency in clusters
NCCL - NVIDIA's collective communication library and how it chooses communication algorithms
Failure probability at scale - why a 1% daily node failure rate means your 10k-GPU job will fail multiple times
Elastic training - approaches to resuming training after node failures without losing all progress

Prerequisites

GPU Architecture
Understanding of distributed training concepts (DDP, FSDP)

The Bandwidth Hierarchy​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Bandwidth Hierarchy

Lessons in This Module

Key Concepts You Will Master

Prerequisites