Skip to main content

Module 6: Distributed Training Hardware

Training GPT-4 required approximately 25,000 A100 GPUs running for months. Training Llama 3 required 24,576 H100 GPUs. These are not just big computers - they are purpose-built infrastructures where the networking between GPUs is as important as the GPUs themselves.

When you scale from 1 GPU to 8, you hit NVLink bandwidth limits. When you scale from 8 to 64, you hit PCIe switch bandwidth limits. When you scale beyond a single node, InfiniBand becomes your bottleneck. Every scaling step has a different constraint, and understanding those constraints determines whether your training job scales efficiently or grinds to a halt waiting for gradient synchronization.

The Bandwidth Hierarchy

Lessons in This Module

#LessonKey Concept
1Multi-GPU Training OverviewDDP, FSDP, tensor parallelism - which requires what
2NVLink and NVSwitchBandwidth, topology, why NVSwitch changed multi-GPU
3InfiniBand for GPU ClustersHDR/NDR IB, RDMA, switch topologies
4RDMA and Collective OperationsRemote DMA, why NCCL uses RDMA
5Network Topology for AI ClustersFat-tree, Dragonfly, rail-optimized, what Google/Meta use
6AllReduce AlgorithmsRing-AllReduce, tree, recursive halving-doubling
7Training at 10,000+ GPU ScaleCheckpoint frequency, straggler mitigation, failure rates
8Fault Tolerance in Training ClustersFailure probability at scale, elastic training, checkpointing strategy

Key Concepts You Will Master

  • AllReduce topology - why Ring-AllReduce scales better than naive parameter server
  • NVLink vs PCIe - the 10-30x bandwidth difference and what it means for model parallelism
  • InfiniBand RDMA - how bypassing the OS kernel reduces communication latency in clusters
  • NCCL - NVIDIA's collective communication library and how it chooses communication algorithms
  • Failure probability at scale - why a 1% daily node failure rate means your 10k-GPU job will fail multiple times
  • Elastic training - approaches to resuming training after node failures without losing all progress

Prerequisites

© 2026 EngineersOfAI. All rights reserved.