Module 6: Distributed Training Hardware
Training GPT-4 required approximately 25,000 A100 GPUs running for months. Training Llama 3 required 24,576 H100 GPUs. These are not just big computers - they are purpose-built infrastructures where the networking between GPUs is as important as the GPUs themselves.
When you scale from 1 GPU to 8, you hit NVLink bandwidth limits. When you scale from 8 to 64, you hit PCIe switch bandwidth limits. When you scale beyond a single node, InfiniBand becomes your bottleneck. Every scaling step has a different constraint, and understanding those constraints determines whether your training job scales efficiently or grinds to a halt waiting for gradient synchronization.
The Bandwidth Hierarchy
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | Multi-GPU Training Overview | DDP, FSDP, tensor parallelism - which requires what |
| 2 | NVLink and NVSwitch | Bandwidth, topology, why NVSwitch changed multi-GPU |
| 3 | InfiniBand for GPU Clusters | HDR/NDR IB, RDMA, switch topologies |
| 4 | RDMA and Collective Operations | Remote DMA, why NCCL uses RDMA |
| 5 | Network Topology for AI Clusters | Fat-tree, Dragonfly, rail-optimized, what Google/Meta use |
| 6 | AllReduce Algorithms | Ring-AllReduce, tree, recursive halving-doubling |
| 7 | Training at 10,000+ GPU Scale | Checkpoint frequency, straggler mitigation, failure rates |
| 8 | Fault Tolerance in Training Clusters | Failure probability at scale, elastic training, checkpointing strategy |
Key Concepts You Will Master
- AllReduce topology - why Ring-AllReduce scales better than naive parameter server
- NVLink vs PCIe - the 10-30x bandwidth difference and what it means for model parallelism
- InfiniBand RDMA - how bypassing the OS kernel reduces communication latency in clusters
- NCCL - NVIDIA's collective communication library and how it chooses communication algorithms
- Failure probability at scale - why a 1% daily node failure rate means your 10k-GPU job will fail multiple times
- Elastic training - approaches to resuming training after node failures without losing all progress
Prerequisites
- GPU Architecture
- Understanding of distributed training concepts (DDP, FSDP)
© 2026 EngineersOfAI. All rights reserved.
