Module 6 - Distributed Training Hardware
Multi-GPU architectures, GPU cluster networking, NCCL, DGX systems, ZeRO, fault tolerance, and cloud vs on-prem infrastructure.
Multi-GPU architectures, GPU cluster networking, NCCL, DGX systems, ZeRO, fault tolerance, and cloud vs on-prem infrastructure.