Module 5: Networking for Distributed AI

Distributed training means moving gradients between GPUs. Distributed inference means moving requests between services. At the scale modern AI systems operate, the network is often the bottleneck - not the compute.

When you scale from 8 GPUs to 64, the AllReduce communication cost can dominate total training time if your network is not correctly configured. When you serve a 70B model across 4 H100s with tensor parallelism, every token generation requires 3 inter-GPU communication rounds. Understanding networking lets you predict these costs, diagnose bottlenecks, and make architecture decisions that respect physical constraints.

Networking in the ML System

Lessons in This Module

#	Lesson	Key Concept
1	TCP/IP Fundamentals for Engineers	IP routing, TCP flow control, latency vs bandwidth
2	RDMA - Remote Direct Memory Access	Bypassing the OS kernel, zero-copy GPU-to-GPU
3	Collective Operations and AllReduce	Ring-AllReduce, tree reduce, NCCL algorithms
4	Bandwidth, Latency, and Throughput	Little's Law, queuing theory basics for serving
5	gRPC for Model Serving	Protocol Buffers, streaming, bidirectional gRPC
6	Distributed File Systems for Training	NFS, Lustre, S3 - throughput requirements and limits
7	Network Bottlenecks in Distributed Training	Diagnosing communication overhead, profiling NCCL
8	Service Mesh for AI Microservices	Istio/Envoy for ML serving, observability, circuit breaking

Key Concepts You Will Master

AllReduce communication volume - calculating gradient communication overhead for any model size
RDMA - how remote direct memory access eliminates CPU involvement in inter-GPU communication
NCCL topology detection - how NCCL automatically discovers the best communication algorithm
gRPC streaming - using server-side streaming for token-by-token LLM responses
Network bottleneck identification - distinguishing compute-bound from communication-bound training

Prerequisites

Computer Architecture
Basic networking knowledge (IP addresses, ports, HTTP)

Networking in the ML System​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

Networking in the ML System

Lessons in This Module

Key Concepts You Will Master

Prerequisites