Cloud vs On-Prem GPU Infrastructure
Total cost of ownership analysis for cloud GPU instances vs on-premises clusters, break-even analysis, spot instance economics, Kubernetes GPU scheduling, and FinOps strategies for GPU compute at scale.
Total cost of ownership analysis for cloud GPU instances vs on-premises clusters, break-even analysis, spot instance economics, Kubernetes GPU scheduling, and FinOps strategies for GPU compute at scale.
NVIDIA DGX H100 and HGX reference designs - 8-GPU NVLink mesh, NVSwitch fabric, PCIe host bridge, ConnectX InfiniBand, power and cooling requirements, DGX SuperPOD scale-out, and topology-aware NCCL configuration for maximum distributed training throughput.
Why fault tolerance is critical at scale, how to design checkpointing strategies, detect stragglers, handle spot preemptions, and recover from failures without restarting multi-week training runs.
InfiniBand vs RoCE vs Ethernet for GPU cluster communication, fat-tree and rail-optimized topologies, GPUDirect RDMA, SHARP in-network aggregation, and diagnosing collective communication bottlenecks in production ML clusters.
Activation checkpointing to reduce training memory usage, sublinear memory algorithm, selective checkpointing strategies, and implementation in PyTorch and JAX.
NVLink, InfiniBand, AllReduce algorithms, network topology, fault tolerance, and the hardware that makes training at thousands of GPUs possible.
Master data parallelism, tensor parallelism, pipeline parallelism, and 3D parallelism for large-scale model training - with communication volume math, PyTorch DDP vs FSDP, and Megatron-LM weight splitting strategies.
Deep dive into NCCL internals - the five collective operations, ring-allreduce algorithm, tree-reduce for small tensors, algorithm selection heuristics, tuning environment variables, and diagnosing collective hangs in production GPU clusters.
DeepSpeed ZeRO stages 1/2/3 - sharding optimizer states, gradients, and parameters across data parallel workers to enable training models too large for single-GPU memory.