9 docs tagged with "distributed-training"

Cloud vs On-Prem GPU Infrastructure

Total cost of ownership analysis for cloud GPU instances vs on-premises clusters, break-even analysis, spot instance economics, Kubernetes GPU scheduling, and FinOps strategies for GPU compute at scale.

DGX and HGX System Design

NVIDIA DGX H100 and HGX reference designs - 8-GPU NVLink mesh, NVSwitch fabric, PCIe host bridge, ConnectX InfiniBand, power and cooling requirements, DGX SuperPOD scale-out, and topology-aware NCCL configuration for maximum distributed training throughput.

Fault Tolerance in Large Cluster Training

Why fault tolerance is critical at scale, how to design checkpointing strategies, detect stragglers, handle spot preemptions, and recover from failures without restarting multi-week training runs.

GPU Cluster Networking

InfiniBand vs RoCE vs Ethernet for GPU cluster communication, fat-tree and rail-optimized topologies, GPUDirect RDMA, SHARP in-network aggregation, and diagnosing collective communication bottlenecks in production ML clusters.

Gradient Checkpointing and Rematerialization

Activation checkpointing to reduce training memory usage, sublinear memory algorithm, selective checkpointing strategies, and implementation in PyTorch and JAX.

Module 6: Distributed Training Hardware

NVLink, InfiniBand, AllReduce algorithms, network topology, fault tolerance, and the hardware that makes training at thousands of GPUs possible.

Multi-GPU Training Architectures

Master data parallelism, tensor parallelism, pipeline parallelism, and 3D parallelism for large-scale model training - with communication volume math, PyTorch DDP vs FSDP, and Megatron-LM weight splitting strategies.

NCCL and Collective Communication

Deep dive into NCCL internals - the five collective operations, ring-allreduce algorithm, tree-reduce for small tensors, algorithm selection heuristics, tuning environment variables, and diagnosing collective hangs in production GPU clusters.

ZeRO and Memory Efficiency

DeepSpeed ZeRO stages 1/2/3 - sharding optimizer states, gradients, and parameters across data parallel workers to enable training models too large for single-GPU memory.