GPU Cluster Networking
The Training Run That Cost $800,000 and Finished Late
In early 2023, a major AI company was training a 30B parameter model on 256 A100s. The compute budget was allocated, the team was ready, the checkpointing strategy was solid. On day 3, someone noticed that training throughput had dropped from the expected 2,100 tokens per second per GPU to 1,650 tokens per second. An 21% degradation.
The model was training. Losses were decreasing. Nothing was crashing. The team ran nvidia-smi - GPUs looked healthy. The training script showed normal step times. But the throughput numbers did not lie. Somewhere, 21% of the cluster's capacity was being wasted every second.
Four days later (and roughly $200,000 in wasted compute later), an infrastructure engineer finally ran ibnetdiscover on the InfiniBand fabric and spotted it: two core spine switches in the fat-tree topology were incorrectly configured, causing asymmetric routing. Half the all-reduce traffic was being funneled through one overloaded path while the other path was nearly empty. ECMP (Equal-Cost Multi-Path) load balancing was not working correctly because the switch firmware had a subtle hash collision bug that sent the same NCCL communicator flows to the same path.
The fix took 30 minutes. A firmware update and a routing table reset. The cluster jumped back to 2,090 tokens per second per GPU. But those four days were gone.
This is the reality of GPU cluster networking. The network is not a background concern - it is a first-class citizen in training performance. A 400 Gb/s InfiniBand link costs $15,000 per port. A 256-GPU cluster has hundreds of such ports. The network can be the most expensive component of the infrastructure, and the first place training performance dies.
This lesson covers everything you need to understand, deploy, and debug GPU cluster networking: InfiniBand generations, RoCE as an alternative, fat-tree and rail-optimized topologies, GPUDirect RDMA for zero-copy communication, SHARP in-network all-reduce, and the practical tools for diagnosing when your network is hurting your training runs.
Why This Exists - The Interconnect Bottleneck
Early multi-GPU training in 2015-2016 used standard Ethernet. 10 GbE, maybe 25 GbE if the team had budget. This worked fine when models were small - AlexNet's gradient communication could complete in milliseconds over 10 GbE. But as models grew, the problem became structural.
The fundamental issue: GPU compute throughput scales roughly with GPU count (near-linearly with good data parallelism), but Ethernet bandwidth does not scale the same way. Ethernet is designed for general-purpose networking with variable latency and packet loss tolerance. It uses TCP/IP, which has protocol overhead, retransmission timers, and CPU involvement for every packet. When you try to do an all-reduce of 100 GB of gradients, TCP/IP's overhead and latency add up to seconds of wasted time.
More precisely: 10 GbE has a latency of ~5-10 microseconds per hop at best. InfiniBand HDR has a latency of ~0.6 microseconds. For a ring all-reduce with collective steps across nodes, even small per-hop latency differences compound dramatically. With 256 nodes, that is 510 collective steps. At 10 us latency difference per step: 5.1 ms of pure latency overhead, before any transmission time. At 100B gradient bytes, even with RDMA, latency matters for the synchronization points.
The second failure of Ethernet was CPU overhead. Standard Ethernet requires the CPU to process every packet - interrupts, memory copies, TCP state machine management. On GPU training nodes, CPUs are already busy with data loading, preprocessing, and orchestration. Pulling the CPU into the hot path of gradient communication created CPU bottlenecks that throttled network throughput well below the theoretical link bandwidth.
InfiniBand and RDMA (Remote Direct Memory Access) solved both problems: sub-microsecond latency and zero CPU involvement in the data path.
Historical Context - From Myrinet to NDR InfiniBand
The quest for high-speed interconnects for parallel computing predates GPUs by decades. In the 1990s, Myrinet (from Myricom) offered 1 Gb/s interconnects for supercomputing clusters when Ethernet was at 100 Mb/s. But Myrinet was proprietary and expensive.
InfiniBand was standardized in 2000 by the InfiniBand Trade Association, a consortium including IBM, Intel, and HP. The initial specification defined 2.5 Gb/s per lane, with scalable multi-lane configurations. The RDMA capability was the key innovation: InfiniBand's architecture allowed the network interface card (HCA - Host Channel Adapter) to directly read and write application memory without CPU involvement.
The bandwidth roadmap has been aggressive:
- SDR (Single Data Rate, 2001): 10 Gb/s
- DDR (Double Data Rate, 2005): 20 Gb/s
- QDR (Quad Data Rate, 2008): 40 Gb/s
- FDR (Fourteen Data Rate, 2011): 56 Gb/s
- EDR (Enhanced Data Rate, 2015): 100 Gb/s
- HDR (High Data Rate, 2018): 200 Gb/s
- NDR (Next Data Rate, 2022): 400 Gb/s
- XDR (eXtended Data Rate, 2025+): 800 Gb/s
NVIDIA acquired Mellanox (the dominant InfiniBand vendor) in 2020 for $6.9 billion - a signal of how critical network infrastructure is to the GPU computing business. Today, virtually every serious AI training cluster uses NVIDIA InfiniBand (formerly Mellanox).
The key architectural turning point for ML workloads came with the Hopper (H100) generation in 2022, when NVIDIA made GPUDirect RDMA a first-class feature and integrated it deeply with NCCL. Combined with NDR InfiniBand at 400 Gb/s, the network stopped being the obvious bottleneck for clusters up to a few hundred GPUs.
Core Concepts - Network Technologies
InfiniBand Architecture
InfiniBand uses a switched fabric topology where every server connects to an InfiniBand HCA (Host Channel Adapter). The HCA is the intelligence at the edge - it handles RDMA operations, manages queue pairs (the fundamental abstraction for InfiniBand communication), and moves data between GPU memory and the network without CPU involvement.
Key InfiniBand primitives relevant to ML:
RDMA Write: directly write data into a remote host's registered memory. No CPU involvement at the destination. The NVLink equivalent for inter-node communication.
RDMA Send/Receive: two-sided operation where both sides must post work requests. Used for control messages and synchronization.
Queue Pairs (QPs): the fundamental communication abstraction. Each QP has a Send Queue (SQ) and Receive Queue (RQ). Applications post work requests to QPs; the HCA processes them asynchronously.
Completion Queues (CQs): used to poll for completed operations. NCCL polls CQs to determine when sends and receives have completed.
The bandwidth numbers that matter for ML:
| Generation | Per-port bandwidth | Latency | Year |
|---|---|---|---|
| HDR | 200 Gb/s (25 GB/s) | 0.6 us | 2018 |
| HDR100 | 100 Gb/s (12.5 GB/s) | 0.6 us | 2018 |
| NDR | 400 Gb/s (50 GB/s) | 0.5 us | 2022 |
| NDR200 | 200 Gb/s (25 GB/s) | 0.5 us | 2022 |
| XDR | 800 Gb/s (100 GB/s) | 0.4 us | 2025 |
HDR100 is commonly used for leaf switch uplinks to provide oversubscription (HDR100 downlinks to servers, HDR uplinks to spine switches). NDR is the current standard for new H100 cluster deployments.
RoCE - RDMA over Converged Ethernet
RoCE (pronounced "rocky") is a protocol that runs the InfiniBand RDMA transport layer over Ethernet. The key insight: what makes InfiniBand fast is not the physical layer - it is the RDMA semantics. RoCE takes those semantics and runs them over Ethernet switches.
RoCE v2 (standardized in 2016) runs RDMA over UDP/IP, which means it works across L3 boundaries (routable). This is significant: InfiniBand requires its own fabric (you cannot mix InfiniBand with Ethernet switches), while RoCE can use standard Ethernet switches like Arista or Cisco Nexus.
The catch: Ethernet was designed for lossy networks. InfiniBand was designed as a lossless fabric. RDMA over a lossy fabric leads to retransmissions and terrible performance (every retransmission requires restarting the entire operation, not just the lost packet). RoCE requires a lossless Ethernet fabric, achieved via:
PFC (Priority Flow Control): IEEE 802.1Qbb standard. When a switch buffer fills up, PFC sends a pause frame to the upstream device, temporarily halting transmission on a specific traffic class. This prevents drops.
ECN (Explicit Congestion Notification): marks packets as congested (setting ECN bits in IP header) before dropping them. Endpoints respond by reducing their sending rate. This is the preferred congestion signal as it avoids head-of-line blocking.
DCQCN (Data Center Quantized Congestion Notification): the algorithm used by most RoCE implementations to respond to ECN signals. Combines rate reduction and rate recovery in a way that prevents congestion collapse.
When to choose RoCE over InfiniBand:
- You have Ethernet expertise in-house but not InfiniBand
- You need L3 routing between pods (InfiniBand is typically L2)
- You want to use commodity Ethernet switches (can be cheaper at 100 GbE)
- You are building a cloud or multi-tenant environment where InfiniBand fabric isolation is complicated
When to choose InfiniBand:
- Maximum performance is the primary goal
- You are building a dedicated ML training cluster
- You want SHARP (in-network reduction - IB only)
- The team is comfortable with InfiniBand management tools
In practice, hyperscalers (Microsoft Azure, Google Cloud, Amazon AWS) use a mix: AWS EFA (Elastic Fabric Adapter) is essentially RoCE with proprietary extensions. Azure's NDv4 series uses NDR InfiniBand. Google's TPU pods use a custom interconnect. On-premises clusters for serious training workloads almost universally use InfiniBand.
Fat-Tree Topology
The fat-tree is the dominant topology for GPU training clusters. Understanding it is non-negotiable.
A k-ary fat-tree has three levels of switches: edge (or leaf), aggregation, and core (spine). Each edge switch connects to servers and aggregation switches. Each aggregation switch connects to edge switches and core switches. There are core switches.
Total servers in a k-ary fat-tree: .
For a 2-level fat-tree (spine-leaf) with k=48:
- 24 server ports + 24 uplinks per leaf switch
- 24 leaf switches per pod
- core switches
- Total servers:
The defining property of the fat-tree is full bisection bandwidth: the bandwidth across any cut that divides the network into two equal halves equals the sum of all server edge bandwidths. This means any server can communicate with any other server at full link rate, simultaneously, without congestion.
Why this matters for all-reduce: in a ring all-reduce across 64 GPUs, traffic flows between non-adjacent GPUs in the ring. If those GPUs happen to be on different racks, the traffic must traverse spine switches. A fat-tree ensures this traffic is not throttled - every server gets its full bandwidth for collective communication.
The cost of full bisection bandwidth: you need core switches, where each core switch has ports. For a cluster of 1024 servers with k=32: 256 core switches, each with 32 ports. That is 8,192 core switch ports just for the spine layer. InfiniBand NDR switches cost 50,000 each. The switch infrastructure for a large cluster is a multi-million-dollar investment.
Oversubscription
Not all clusters use non-blocking fat-trees. Oversubscription ratios of 2:1 or 4:1 are common in cost-constrained deployments. A 2:1 oversubscribed fat-tree has half the number of uplinks from leaf to spine as downlinks from leaf to servers. This means in the worst case, half the servers trying to communicate across racks will be throttled.
For ML training, oversubscription is dangerous. An all-reduce operation communicates between all nodes simultaneously. If the inter-rack bandwidth is oversubscribed, the all-reduce takes proportionally longer. For a 4:1 oversubscription: expected all-reduce bandwidth = 1/4 of non-oversubscription bandwidth.
Rule: ML training clusters should be non-blocking (1:1 subscription) or at most 2:1 oversubscribed. Never 4:1 or higher for the communication paths used by NCCL collectives.
Rail-Optimized Topology
The fat-tree topology optimizes for any-to-any communication. But all-reduce has a very specific pattern: every node communicates with every other node equally. A more specialized topology - the rail-optimized (or HPN, High Performance Network) topology - exploits this structure.
In a rail-optimized topology, each server has multiple NICs (network interface cards), and each NIC connects to a different "rail" - an independent network plane. An NVIDIA DGX H100 has 8 H100 GPUs and 8 InfiniBand ConnectX-7 NICs. In a rail-optimized topology:
- Rail 0: GPU 0's NIC on every server connects to switches on rail 0
- Rail 1: GPU 1's NIC on every server connects to switches on rail 1
- ...
- Rail 7: GPU 7's NIC on every server connects to switches on rail 7
Each rail is an independent fat-tree. When NCCL runs a ring all-reduce, it can spread the traffic across all 8 rails simultaneously, multiplying effective bandwidth by 8.
The advantage over a single fat-tree: a single fat-tree with 8x the bandwidth would require 8x the ports and switches, at 8x the cost. A rail-optimized topology achieves similar aggregate bandwidth with simpler, shallower per-rail switching fabric.
The disadvantage: each rail is an independent L2 domain. Routing between GPUs on different rails requires going through the server's CPU, which adds overhead. NCCL must be configured to use each NIC for the appropriate collective step.
This topology is used by NVIDIA's Quantum-2 InfiniBand switching fabric and is the recommended design for DGX SuperPOD deployments. For a SuperPOD with 32 DGX H100 nodes (256 GPUs total), the rail-optimized topology uses 8 independent rails, each with 32 switch ports in a single-level switch, achieving 400 Gb/s x 8 = 3.2 Tb/s of aggregate bisection bandwidth per server.
GPUDirect RDMA - Bypassing the CPU
GPUDirect RDMA is the technology that allows InfiniBand or RoCE NICs to directly access GPU memory without any CPU involvement. The data path without GPUDirect:
GPU memory -> PCIe bus -> CPU memory -> PCIe bus -> NIC -> network
The data path with GPUDirect RDMA:
GPU memory -> PCIe bus -> NIC -> network
GPUDirect RDMA eliminates the bounce buffer in CPU memory. This removes:
- The CPU copy time (significant for large tensors)
- CPU memory bandwidth consumption
- CPU interrupt overhead
- PCIe traversal in both directions (GPU->CPU memory and CPU memory->NIC)
For a 100 GB gradient tensor, the CPU copy at typical CPU memory bandwidth of 100 GB/s takes 1 second. GPUDirect RDMA eliminates this entirely. Even for smaller tensors, removing the CPU from the hot path reduces jitter and improves consistency.
Requirements for GPUDirect RDMA:
- NVIDIA GPU (Volta or newer for best support)
- NVIDIA ConnectX-5 or newer HCA (or ConnectX-7 for NDR)
- Both GPU and NIC connected to the same PCIe complex (ideally same NUMA node)
nvidia-peermemkernel module loaded- GPU and NIC on PCIe Gen4 or Gen5 for adequate bandwidth
Validating GPUDirect RDMA:
# Check that nvidia-peermem is loaded
lsmod | grep nvidia_peermem
# Run the perftest benchmark with GPUDirect
ib_write_bw --use_cuda=0 & # server GPU 0
ib_write_bw --use_cuda=0 <server> # client GPU 0
# Compare with CPU-based test
ib_write_bw &
ib_write_bw <server>
# GPUDirect should show bandwidth within 5-10% of CPU-based test
# If GPUDirect is significantly slower, check:
# 1. NUMA affinity (is the NIC on the same NUMA node as the GPU?)
# 2. PCIe bandwidth (is the GPU bottlenecked on PCIe?)
# 3. nvidia-peermem module loaded correctly?
PCIe topology matters: on a DGX H100 node, GPUs are connected in two groups of 4 via NVSwitch. Each group also has InfiniBand NICs on the same PCIe complex. GPU 0-3 share PCIe Complex A with 4 of the 8 InfiniBand NICs. GPU 4-7 share PCIe Complex B with the other 4 NICs. For optimal GPUDirect performance, NCCL assigns NICs to GPUs based on NUMA locality - GPU 0's traffic prefers NIC 0 on Complex A, not NIC 4 on Complex B.
When this affinity is wrong (common misconfiguration: NIC-to-GPU NUMA mapping not set), GPUDirect traffic crosses the PCIe complex boundary via the CPU interconnect (UPI on Intel, Infinity Fabric on AMD), adding 50-100 ns of latency and significantly reducing bandwidth.
SHARP - In-Network All-Reduce
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is an NVIDIA Quantum InfiniBand feature that performs the arithmetic of all-reduce operations inside the network switches themselves, rather than at the endpoints.
In a standard ring all-reduce, data flows: GPU -> NIC -> switch -> NIC -> GPU -> NIC -> switch... The switches are just forwarding data. With SHARP, the switches participate in the computation: when data from multiple streams arrives at a SHARP-capable switch, the switch performs the reduction (sum, min, max) and only forwards the result. This cuts the communication volume roughly in half for an all-reduce.
The math: in a ring all-reduce across nodes, each node sends bytes total. With SHARP tree reduction, each node sends messages of total size bytes. For N=64 GPUs and 1 GB gradient tensor:
- Ring all-reduce: GB per GPU
- SHARP tree: GB per GPU
This is a 2x reduction in data moved, which translates roughly to 2x improvement in all-reduce throughput - but only if the computation bottleneck is link bandwidth, not switch processing.
SHARP requirements and limitations:
- Only available on NVIDIA Quantum InfiniBand switches (not RoCE)
- Requires SHARP-capable HCA firmware (ConnectX-6 or newer)
- Only works for flat data parallel all-reduce, not tensor-parallel (which uses activations of different shapes each step)
- Number of concurrent SHARP trees is limited by switch resources
- Requires NCCL with SHARP plugin enabled
Practical SHARP speedups: benchmarks from NVIDIA show 2-4x all-reduce speedup with SHARP for large messages on clusters of 64+ nodes. For gradient synchronization in data parallel training of large models, this translates to 20-30% end-to-end training speedup.
# Enable SHARP in NCCL
export NCCL_SHARP_ENABLE=1
export SHARP_COLL_LOG_LEVEL=WARN
# Verify SHARP is actually being used
export NCCL_DEBUG=INFO
# Look for "SHARP" in the NCCL output during the first all-reduce
Code Examples
Network Topology Discovery
#!/bin/bash
# Comprehensive InfiniBand fabric diagnostic script
# Run as root or with appropriate IB management privileges
echo "=== InfiniBand Fabric Discovery ==="
# Discover the full fabric topology
# ibnetdiscover outputs a graph of all switches and nodes
ibnetdiscover --show > /tmp/fabric_topology.txt
echo "Topology saved to /tmp/fabric_topology.txt"
echo "Switches found: $(grep -c '^Switch' /tmp/fabric_topology.txt)"
echo "Hosts found: $(grep -c '^Ca' /tmp/fabric_topology.txt)"
echo ""
echo "=== Fabric Status ==="
# Check for degraded links (should be 0 for healthy fabric)
ibcheckfabric 2>&1 | tail -20
echo ""
echo "=== Switch Port Counters ==="
# Check error counters on all switch ports
# High values indicate degraded links or cables
perfquery -x 0 2>/dev/null | head -40
echo ""
echo "=== Host Channel Adapters ==="
# Show all HCAs in the fabric
ibstat | grep -E "(CA|Port|State|Rate|Link)"
echo ""
echo "=== Link Speed Check ==="
# Show actual vs expected link speeds
# Look for anything below expected (e.g., 100Gb/s when 200Gb/s expected)
ibdatacounts -v 2>/dev/null | grep -E "(Port|Speed)"
# Check for any ports operating at degraded speed
# NDR should show 4x NDR (400 Gb/s), HDR should show 4x HDR (200 Gb/s)
iblinkinfo | grep -v "Active" | grep -v "^#"
NCCL Diagnostic and Benchmarking
#!/bin/bash
# NCCL bandwidth and latency benchmarks
# Run on all nodes: mpirun -hostfile hostfile -np $NUM_GPUS ./nccl_bench.sh
# Clone and build nccl-tests if not already present
if [ ! -d "/opt/nccl-tests" ]; then
git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests
cd /opt/nccl-tests && make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda
fi
export NCCL_DEBUG=WARN
export NCCL_IB_DISABLE=0 # ensure InfiniBand is enabled
export NCCL_IB_HCA=mlx5_0 # specify which HCA to use (check with ibstat)
export NCCL_SOCKET_IFNAME=^lo,docker # exclude loopback and docker
echo "=== All-Reduce Bandwidth Test ==="
# Test a range of message sizes
# Critical sizes: gradient tensors typically 1MB - 100GB
mpirun -np 8 /opt/nccl-tests/build/all_reduce_perf \
-b 8 \ # minimum size: 8 bytes
-e 8G \ # maximum size: 8 GB
-f 2 \ # factor between sizes (powers of 2)
-g 1 \ # GPUs per process
--op sum \
2>&1 | tee /tmp/allreduce_bench.txt
echo ""
echo "=== All-Reduce Latency (small messages) ==="
mpirun -np 8 /opt/nccl-tests/build/all_reduce_perf \
-b 8 \
-e 4096 \
-f 2 \
-g 1 \
--op sum
echo ""
echo "=== All-Reduce at Typical Gradient Sizes ==="
# 7B model gradients in BF16: ~14 GB
mpirun -np 8 /opt/nccl-tests/build/all_reduce_perf \
-b 1G \
-e 16G \
-f 2 \
-g 1 \
--op sum
# Expected output column meanings:
# size: message size in bytes
# count: number of elements
# type: data type
# redop: reduction operation
# root: root rank (for broadcast)
# time: median time in microseconds
# algbw: algorithmic bandwidth = size/time
# busbw: bus bandwidth = algbw * (2*(N-1)/N for all-reduce)
# error: max relative error
#
# Target bus bandwidth: >= 80% of theoretical link bandwidth
# e.g., NDR at 400 Gb/s = 50 GB/s -> target busbw >= 40 GB/s
Python: Diagnosing Slow Collectives
import torch
import torch.distributed as dist
import time
import os
import statistics
def diagnose_collective_performance(num_warmup=5, num_trials=20):
"""
Comprehensive diagnostic of collective communication performance.
Run on all ranks simultaneously.
"""
rank = dist.get_rank()
world_size = dist.get_world_size()
device = torch.device(f"cuda:{rank % torch.cuda.device_count()}")
results = {}
# Test 1: Point-to-point latency (baseline)
if rank == 0 or rank == 1:
tensor = torch.zeros(1, device=device)
latencies = []
for i in range(num_warmup + num_trials):
torch.cuda.synchronize()
start = time.perf_counter()
if rank == 0:
dist.send(tensor, dst=1)
dist.recv(tensor, src=1)
else:
dist.recv(tensor, src=0)
dist.send(tensor, dst=0)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
if i >= num_warmup:
latencies.append(elapsed * 1e6) # microseconds
if rank == 0:
results['p2p_latency_us'] = {
'mean': statistics.mean(latencies),
'median': statistics.median(latencies),
'stdev': statistics.stdev(latencies),
'min': min(latencies),
'max': max(latencies)
}
dist.barrier()
# Test 2: All-reduce bandwidth at multiple sizes
sizes_gb = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 4.0] # GB
allreduce_results = {}
for size_gb in sizes_gb:
num_elements = int(size_gb * 1e9 / 2) # BF16 = 2 bytes
tensor = torch.randn(num_elements, dtype=torch.bfloat16, device=device)
# Warmup
for _ in range(num_warmup):
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
dist.barrier()
# Benchmark
times = []
for _ in range(num_trials):
torch.cuda.synchronize()
start = time.perf_counter()
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
times.append(elapsed)
dist.barrier()
if rank == 0:
median_time = statistics.median(times)
# Bus bandwidth: accounts for ring all-reduce traffic pattern
busbw_gb_s = (
2 * (world_size - 1) / world_size * size_gb / median_time
)
allreduce_results[f"{size_gb:.3f}GB"] = {
'median_ms': median_time * 1000,
'busbw_GB_s': busbw_gb_s,
'stdev_ms': statistics.stdev(times) * 1000
}
if rank == 0:
results['allreduce'] = allreduce_results
# Test 3: Detect stragglers - check for rank-to-rank variance
# Each rank times a barrier + small all-reduce
rank_times = []
for trial in range(10):
local_tensor = torch.tensor([float(rank)], device=device)
torch.cuda.synchronize()
start = time.perf_counter()
dist.all_reduce(local_tensor)
torch.cuda.synchronize()
rank_times.append(time.perf_counter() - start)
# Gather all rank times to rank 0 for analysis
time_tensor = torch.tensor(
[statistics.median(rank_times)], device=device
)
all_times = [torch.zeros(1, device=device) for _ in range(world_size)]
dist.all_gather(all_times, time_tensor)
if rank == 0:
times_list = [t.item() * 1000 for t in all_times]
results['straggler_analysis'] = {
'rank_times_ms': times_list,
'max_rank': times_list.index(max(times_list)),
'variance_ms': max(times_list) - min(times_list),
'warning': max(times_list) > 2 * min(times_list)
}
print("\n=== Collective Performance Diagnostic ===")
print(f"World size: {world_size} GPUs")
if 'p2p_latency_us' in results:
p2p = results['p2p_latency_us']
print(f"\nP2P Latency (rank 0 <-> rank 1):")
print(f" Median: {p2p['median']:.1f} us")
print(f" StdDev: {p2p['stdev']:.1f} us")
# InfiniBand: expect < 2 us. Ethernet: expect > 10 us.
if p2p['median'] < 5:
print(" [OK] Consistent with InfiniBand/NVLink")
elif p2p['median'] < 20:
print(" [WARN] Higher than expected - check RoCE or cable quality")
else:
print(" [CRITICAL] Very high latency - check network path")
print(f"\nAll-Reduce Bandwidth:")
print(f"{'Size':>10} {'Median (ms)':>12} {'BusBW (GB/s)':>14} {'StdDev (ms)':>12}")
for size, stats in results['allreduce'].items():
flag = ""
# NDR target: 40+ GB/s busbw. HDR target: 20+ GB/s.
if stats['busbw_GB_s'] < 15:
flag = " [CRITICAL]"
elif stats['busbw_GB_s'] < 25:
flag = " [WARN]"
print(f"{size:>10} {stats['median_ms']:>12.1f} "
f"{stats['busbw_GB_s']:>14.1f} "
f"{stats['stdev_ms']:>12.2f}{flag}")
straggler = results['straggler_analysis']
print(f"\nStraggler Analysis:")
print(f" Max-Min variance: {straggler['variance_ms']:.2f} ms")
if straggler['warning']:
print(f" [WARN] Rank {straggler['max_rank']} is significantly "
f"slower than others - potential straggler")
else:
print(f" [OK] No significant stragglers detected")
return results
def monitor_nccl_environment():
"""Print all NCCL-relevant environment variables and their effects."""
nccl_vars = {
'NCCL_DEBUG': 'Verbosity: WARN, INFO, TRACE. Use INFO to see ring/tree selection.',
'NCCL_DEBUG_SUBSYS': 'Specific subsystem: INIT, COLL, P2P, NET, GRAPH, ALL',
'NCCL_IB_DISABLE': '1 to disable InfiniBand (forces Ethernet)',
'NCCL_IB_HCA': 'Which HCA to use. Set to specific device if NCCL picks wrong NIC',
'NCCL_NET_GDR_LEVEL': 'GPUDirect RDMA level. 0=disabled, 5=always (default: auto)',
'NCCL_SOCKET_IFNAME': 'Network interface for fallback. Use ^lo to exclude loopback',
'NCCL_ALGO': 'Override algorithm: Ring, Tree, CollnetDirect, CollnetChain, NVLS',
'NCCL_PROTO': 'Override protocol: Simple, LL, LL128',
'NCCL_NTHREADS': 'Number of threads per block for NCCL kernel',
'NCCL_NCHUNKS_COARSE': 'Number of chunks for coarse-grained pipelining',
'NCCL_P2P_LEVEL': 'P2P access level: 0=disabled, NVL=NVLink, PIX=PCIe, PXB=PCIe bridge, PH=NUMA',
'NCCL_SHM_DISABLE': '1 to disable shared memory transport (useful for debugging)',
'NCCL_SHARP_ENABLE': '1 to enable SHARP in-network reduction (InfiniBand only)',
'NCCL_BUFFSIZE': 'Buffer size per peer channel. Default 4MB. Increase for bandwidth.',
'NCCL_GRAPH_DUMP_FILE': 'Dump NCCL communication graph to file for inspection',
}
print("=== NCCL Environment ===")
for var, description in nccl_vars.items():
value = os.environ.get(var, '<not set>')
print(f"{var:30} = {value:20} # {description}")
Detecting Network Configuration Issues
#!/bin/bash
# Common network misconfiguration checks for ML clusters
echo "=== NUMA Topology - NIC to GPU Affinity ==="
# Check which NUMA node each GPU and NIC is on
# They should be co-located for best GPUDirect RDMA performance
for gpu_id in 0 1 2 3 4 5 6 7; do
numa_node=$(cat /sys/bus/pci/devices/$(nvidia-smi -q -i $gpu_id | grep "Bus Id" | awk '{print $NF}' | sed 's/GPU-//' | tr '[:upper:]' '[:lower:]')/numa_node 2>/dev/null)
echo "GPU $gpu_id: NUMA node $numa_node"
done
echo ""
for nic in $(ls /sys/class/infiniband/); do
pci_addr=$(readlink /sys/class/infiniband/$nic/device | sed 's/.*\///')
numa_node=$(cat /sys/bus/pci/devices/$pci_addr/numa_node 2>/dev/null)
echo "IB NIC $nic: NUMA node $numa_node"
done
echo ""
echo "=== MTU Check ==="
# InfiniBand should run at 4096 bytes MTU for best throughput
# Ethernet should be at 9000 (jumbo frames) for RoCE
for ib_dev in $(ls /sys/class/infiniband/); do
echo "$ib_dev MTU:"
for port in 1 2; do
state=$(cat /sys/class/infiniband/$ib_dev/ports/$port/state 2>/dev/null)
mtu=$(cat /sys/class/infiniband/$ib_dev/ports/$port/lid 2>/dev/null)
echo " Port $port: state=$state"
done
done
# For Ethernet/RoCE interfaces:
for intf in $(ibdev2netdev | awk '{print $5}'); do
mtu=$(ip link show $intf | grep mtu | awk '{print $5}')
echo "Ethernet $intf MTU: $mtu (should be 9000 for RoCE)"
done
echo ""
echo "=== Flow Control Check (PFC for RoCE) ==="
# PFC must be enabled for RoCE stability
# Check priority flow control settings
for intf in $(ibdev2netdev | awk '{print $5}'); do
echo "PFC on $intf:"
mlnx_qos -i $intf 2>/dev/null | grep -A2 "PFC"
done
echo ""
echo "=== ECMP Routing Check ==="
# Check that ECMP load balancing is enabled and configured correctly
# InfiniBand: check subnet manager routing
sminfo 2>/dev/null
echo ""
echo "=== Link Error Counters ==="
# Persistent errors indicate bad cables or SFPs
# SymbolErrors, LinkRecovers should be 0 on a healthy link
perfquery 2>/dev/null | grep -E "(SymbolErrors|LinkRecovers|LinkDowned|RcvErrors|XmtDiscards)"
echo ""
echo "=== GPUDirect RDMA Status ==="
# Verify nvidia-peermem is loaded
lsmod | grep -E "(nvidia_peermem|ib_core|rdma_core)"
# Check GPUDirect support in NCCL
python3 -c "
import torch
import subprocess
result = subprocess.run(['nvidia-smi', '-q'], capture_output=True, text=True)
lines = [l for l in result.stdout.split('\n') if 'P2P' in l or 'GPUDirect' in l]
for l in lines[:10]:
print(l)
"
Architecture Diagrams
Production Engineering Notes
Sizing the Network for Your Workload
The key formula for network sizing: what is the all-reduce bandwidth requirement per GPU?
For data parallel training of a 7B model on GPUs with step time :
- Gradient tensor size: bytes = 14 GB (BF16)
- Ring all-reduce traffic per GPU: GB GB
- Required bandwidth:
If second (roughly 7B model, batch 32, seq 2048 on H100): need 28 GB/s network bandwidth per GPU just for gradient sync. NDR InfiniBand at 50 GB/s per port handles this with headroom. HDR at 25 GB/s is borderline - computation and communication overlap becomes critical.
For tensor-parallel training with TP=8 within a DGX node: NVLink 4.0 at 900 GB/s bidirectional is never the bottleneck for tensor-parallel all-reduces. The inter-node network (InfiniBand) handles only pipeline activation transfers and data-parallel gradient sync.
Switch Fabric Sizing
A non-blocking fat-tree for N servers with each server having B GB/s of NIC bandwidth requires:
Each port at B GB/s. For N=512 DGX H100 nodes (4096 GPUs), each with 8x NDR IB NICs (8 x 50 GB/s = 400 GB/s per node):
At 400 Gb/s (50 GB/s) per port. A 64-port NDR switch handles 64 ports. You need switches minimum. In a 2-level spine-leaf design: 64 leaf switches (each with 64 ports: 32 down to servers, 32 up to spine) and 64 spine switches. This is a common configuration for A100/H100 clusters.
The cost: an NVIDIA Quantum-2 NDR switch with 64 ports lists at approximately 80,000 at the time of writing. A 128-switch fabric costs 10.2M in switches alone, before cables, transceivers, and labor.
Congestion Detection and Response
NCCL has a built-in timeout mechanism. If a collective takes longer than expected, NCCL will log a warning. However, distinguishing network congestion from compute straggler requires instrumentation.
Key metrics to monitor in production:
- NCCL collective time per step: track with a custom NCCL profiler hook or PyTorch's
torch.profiler - IB port utilization: SNMP polling of
ifHCInOctets/ifHCOutOctetson switch ports - PFC pause frames: high PFC pause rates indicate congestion (for RoCE) - monitor
sysfsor SNMP - IB error counters:
SymbolErrorsandPortRcvErrorsindicate physical layer issues
# Add to your training loop to track collective times
import torch.distributed as dist
from contextlib import contextmanager
import time
collective_times = []
@contextmanager
def timed_collective():
torch.cuda.synchronize()
start = time.perf_counter()
yield
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
collective_times.append(elapsed * 1000) # ms
# In training loop:
with timed_collective():
dist.all_reduce(gradients, op=dist.ReduceOp.SUM)
# Periodically report:
if step % 100 == 0 and dist.get_rank() == 0:
recent = collective_times[-100:]
print(f"All-reduce: mean={sum(recent)/len(recent):.1f}ms, "
f"max={max(recent):.1f}ms, "
f"p99={sorted(recent)[99]:.1f}ms")
if max(recent) > 5 * (sum(recent)/len(recent)):
print("WARNING: high variance in collective times - check for stragglers")
Effective Bandwidth vs Theoretical
A common mistake: assuming NCCL achieves the theoretical link bandwidth. In practice:
- Protocol overhead (headers, ACKs): 3-5% overhead
- RDMA queue pair setup and teardown: matters at small message sizes
- PCIe bandwidth limitations: 64 GB/s for PCIe 4.0 x16, 128 GB/s for PCIe 5.0 x16
- NVSwitch to NVLink latency for intra-node: adds 1-2 us per hop
- NCCL algorithm selection: Ring vs Tree vs NVLS affects at what message size peak bandwidth is reached
Rule of thumb for achievable NCCL all-reduce bandwidth:
- Intra-node NVLink (8 GPUs): 85-90% of NVLink bandwidth for messages > 1 MB
- Inter-node InfiniBand HDR: 70-80% of link bandwidth for messages > 10 MB
- Inter-node InfiniBand NDR: 75-85% of link bandwidth for messages > 10 MB
- Small messages (< 1 MB): latency-bound, may see only 10-30% of peak bandwidth
Common Mistakes
:::danger Disabling GPUDirect RDMA accidentally
Setting NCCL_NET_GDR_LEVEL=0 or failing to load nvidia-peermem disables GPUDirect RDMA and causes all collective communication to bounce through CPU memory. This can reduce all-reduce throughput by 40-60% for large tensors and add significant CPU pressure. Check that lsmod | grep nvidia_peermem returns output before launching training. If nvidia-peermem is not loading automatically, add it to /etc/modules-load.d/. The symptom: NCCL INFO log will show "NET/IB : No GPUDirect RDMA support" - always check NCCL INFO output on the first run.
:::
:::danger Running RoCE without PFC/ECN configured
RoCE over a lossy Ethernet fabric is catastrophic for collective communication performance. A single dropped packet in an RDMA operation requires restarting the entire transfer from the beginning (not just retransmitting the packet, because RDMA operations are not individually ACKed). With a 100 GB gradient all-reduce, a single packet drop adds seconds of retransmission overhead. Always configure PFC (Priority Flow Control) on the Ethernet fabric and verify it is working with mlnx_qos. The symptom: very high variance in all-reduce times, occasional step times 10-100x the median.
:::
:::warning Wrong NIC-to-GPU NUMA binding
On a DGX H100 node, there are 8 GPUs and 8 InfiniBand NICs. GPUs 0-3 are on NUMA node 0 and GPUs 4-7 are on NUMA node 1. InfiniBand NICs are similarly split. If NCCL maps GPU 0 to a NIC on NUMA node 1, every collective communication crosses the NVSwitch-to-CPU-interconnect boundary, adding 50-100 ns of latency per transfer and reducing bandwidth by 20-40%. Set NCCL_IB_HCA to explicitly specify which HCA each rank should use, matching GPU NUMA affinity. Verify the mapping with nvidia-smi topo -m which shows the NIC-to-GPU topology.
:::
:::warning Oversubscribed fabric for all-reduce workloads A 2:1 or 4:1 oversubscribed fat-tree reduces the cost of the switch fabric significantly. For web serving or batch inference, this is usually acceptable - not all servers communicate simultaneously. For distributed training all-reduce, all nodes communicate simultaneously and continuously. A 2:1 oversubscription means the effective all-reduce bandwidth is halved when traffic saturates the uplinks. This 2x slowdown in gradient sync translates directly to 2x longer training runs. Never use more than 2:1 oversubscription for paths carrying NCCL collective traffic. If you must use an oversubscribed fabric, configure NCCL to use algorithms that are more tolerant of bandwidth variance (Tree instead of Ring). :::
:::warning Using iWARP instead of RoCE for RDMA iWARP (Internet Wide Area RDMA Protocol) runs RDMA over TCP/IP rather than UDP/IP like RoCE v2. iWARP works over lossy networks without special switch configuration. The catch: the TCP overhead (per-packet ACKs, slow start, connection state) adds significant latency and CPU overhead compared to RoCE. For ML collectives with large messages, iWARP typically delivers 50-70% of the throughput of RoCE v2 on the same hardware. Always prefer RoCE v2 with a properly configured lossless Ethernet fabric over iWARP. If you cannot configure PFC/ECN on your switches, that is a switch configuration problem to fix - not a reason to use iWARP. :::
Interview Q&A
Q1: Explain the difference between InfiniBand and RoCE. When would you choose one over the other for a GPU training cluster?
InfiniBand is a complete interconnect architecture: physical layer, link layer, network layer, transport layer, and programming model are all defined in the InfiniBand specification. InfiniBand switches form their own fabric with their own subnet manager, routing protocol (typically OpenSM for subnet management), and addressing (LIDs - Local Identifiers). It is lossless by design, using credit-based flow control at the link layer. Latency is sub-microsecond and bandwidth scales from 100 Gb/s (EDR) to 400 Gb/s (NDR) to 800 Gb/s (XDR) per port.
RoCE (RDMA over Converged Ethernet) takes the InfiniBand RDMA transport semantics and runs them over Ethernet at the physical and link layers. RoCE v2 uses UDP/IP as the network layer, making it routable across L3 boundaries. The performance is close to InfiniBand on well-configured lossless Ethernet fabric, but RoCE requires explicit configuration of Priority Flow Control and Explicit Congestion Notification on all switches in the fabric. Without these, packet drops cause RDMA retransmissions that destroy throughput.
Choose InfiniBand when: (1) you are building a dedicated training cluster and want maximum performance with minimal configuration risk, (2) you want SHARP in-network reduction (only available on Quantum InfiniBand switches), (3) your team has InfiniBand expertise. Choose RoCE when: (1) you want to use commodity Ethernet switches and integrate with existing Ethernet infrastructure, (2) you need L3 routing between cluster segments, (3) you are building cloud infrastructure where InfiniBand fabric isolation is complex.
Q2: What is SHARP and how does it reduce all-reduce communication volume? What are its limitations?
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) performs reduction operations inside NVIDIA Quantum InfiniBand switches. Instead of flowing data to endpoints for reduction, SHARP-capable switches aggregate incoming data streams and forward only the reduced result.
In a standard ring all-reduce across nodes, each node transmits approximately bytes where is the parameter count in bytes. With SHARP, each node sends bytes to its nearest SHARP tree root, the tree aggregates at each level, and the final result is broadcast back. Total transmission is approximately bytes per node - roughly a 2x reduction in data moved.
Limitations: SHARP is only available on NVIDIA Quantum InfiniBand switches, not RoCE or Ethernet switches. The number of concurrent SHARP trees is limited by switch buffer resources (typically 8-16 concurrent SHARP groups per switch). SHARP works well for flat data-parallel all-reduce where all nodes reduce the same tensor of fixed size; it does not adapt well to tensor-parallel all-reduces where activation shapes change per layer. Also, SHARP provides the most benefit when the link bandwidth (not switch processing) is the bottleneck - for very large clusters where switches themselves are compute-limited, SHARP benefits may be less than theoretical.
Q3: Walk through the GPUDirect RDMA data path. What kernel modules and hardware requirements enable it? What fails when it is not working?
The standard GPU-to-network data path: GPU computes into VRAM, CUDA triggers a copy from VRAM to CPU-pinned memory (CUDA "bounce buffer"), the network driver DMAs from pinned CPU memory into the NIC send buffer, and the NIC transmits. On the receive side: NIC DMA to CPU pinned memory, then CPU copy to GPU VRAM.
GPUDirect RDMA eliminates both CPU bounce buffers. The NVIDIA kernel driver (nvidia.ko) exposes a peer-to-peer memory interface. The InfiniBand HCA driver uses this to map GPU physical memory pages directly into the HCA's DMA space. The NIC can then directly DMA from GPU VRAM to the network (send path) or from the network to GPU VRAM (receive path) with no CPU involvement.
Hardware requirements: NVIDIA GPU (Volta or newer), NVIDIA ConnectX-5 or newer HCA, both on the same PCIe complex (or connected via PCIe 4.0/5.0 with sufficient bandwidth), the nvidia-peermem kernel module loaded (this provides the memory registration interface). The GPU and NIC must be on PCIe - Thunderbolt NICs do not support GPUDirect.
When not working: NCCL logs "NET/IB : No GPUDirect RDMA support" and falls back to the CPU bounce buffer path. Symptoms include: higher CPU utilization during collectives (from memory copy work), 30-50% lower collective throughput, increased NUMA bandwidth consumption. Diagnosis: lsmod | grep nvidia_peermem, NCCL_DEBUG=INFO output, compare ib_write_bw --use_cuda vs plain ib_write_bw benchmark throughput.
Q4: A 512-GPU training cluster is running at 60% of expected all-reduce throughput. Walk through your debugging process.
Step 1: isolate the scope. Is the problem on all nodes or specific nodes? Run nccl-tests all_reduce_perf with progressively smaller node groups (all 512, then 256, then 128, etc.) to narrow whether it is a global fabric issue or a specific set of nodes.
Step 2: check for stragglers. One slow GPU/NIC can bottleneck a ring all-reduce. Run the diagnostic script that times each rank independently and use dist.all_gather to collect rank-level timing. If one rank is consistently 2-3x slower, isolate that node.
Step 3: check the IB fabric. Run ibnetdiscover and iblinkinfo | grep -v Active to find any links running at degraded speed. Check perfquery for error counters on switch ports - SymbolErrors or PortRcvErrors on any port indicate a bad cable or transceiver. Run ibcheckfabric for a quick health summary.
Step 4: check for routing asymmetry (the scenario from the opening). In a fat-tree, ECMP should distribute flows evenly across spine switches. If all traffic is going through one path (firmware bug, misconfigured routing), you will see high utilization on some switch ports and near-zero on others. Check switch port counters via SNMP or the switch management interface.
Step 5: check NCCL algorithm selection. Use NCCL_DEBUG=INFO to see which algorithm (Ring vs Tree) NCCL selected. For inter-node collectives, Tree may outperform Ring for large clusters with high-latency paths. Try NCCL_ALGO=Tree and benchmark.
Step 6: check GPUDirect RDMA is active on all nodes. On even one node without GPUDirect, the all-reduce throughput for the entire ring is bottlenecked at that node's CPU-copy-limited speed.
Q5: What is the bisection bandwidth of a fat-tree, and why does it matter for collective communication? How does oversubscription affect training performance?
Bisection bandwidth is the total bandwidth available across the minimum cut that divides the network into two equal halves. For a non-blocking (1:1) fat-tree with N servers each having B GB/s of NIC bandwidth, the bisection bandwidth is . This is the "full" bisection - any half of the servers can communicate with the other half at full link rate simultaneously.
For collective communication this is critical because all-reduce is inherently an all-to-all communication pattern. In a ring all-reduce, each node sends to and receives from its ring neighbors. If those neighbors happen to be on the other side of the bisection (different racks or pods), every step of the ring traverses the spine layer. With full bisection bandwidth, this is not a bottleneck. With oversubscription, the spine bandwidth is less than the leaf bandwidth.
For a 2:1 oversubscribed fat-tree: each pod has leaf-to-spine uplinks at half the leaf-to-server bandwidth. Worst case: all N/2 servers on one side are communicating with all N/2 servers on the other side. The effective inter-pod bandwidth is halved. In ring all-reduce, this bottleneck affects (N-1)/2 of the N-1 ring steps.
Practical impact: for a 256 GPU data-parallel training job with 2:1 oversubscription and 14 GB gradient tensor, the all-reduce saturates the spine. Expected all-reduce time with 1:1 fabric: seconds. With 2:1 oversubscription: seconds. If compute per step is 2 seconds, the ratio drops from 78% compute utilization to 64% - a 14 percentage point degradation in efficiency, which directly multiplies the training cost.
Q6: Explain how ECMP load balancing works for NCCL traffic and what can go wrong with it.
ECMP (Equal-Cost Multi-Path) is the mechanism fat-tree switches use to distribute traffic across multiple equal-cost paths to a destination. When a packet arrives at a spine switch with multiple equal-cost paths to the destination leaf, ECMP selects among them based on a hash of the packet's 5-tuple: (source IP, destination IP, source port, destination port, protocol).
For NCCL traffic over InfiniBand, the "5-tuple equivalent" is the source LID, destination LID, and other flow identifiers. Each NCCL communicator creates a set of Queue Pairs (QPs), and each QP has a unique identifier. ECMP hashes QP identifiers to spread different QPs (and their traffic) across different paths.
What can go wrong:
-
Hash collisions: if two high-bandwidth NCCL flows hash to the same ECMP path, that path is overloaded while other paths are underutilized. This is the bug from the opening scenario. Firmware bugs or poorly-chosen hash seeds can cause pathological collision patterns where all flows from a specific communicator always land on the same spine switch.
-
Elephant flows: a single large all-reduce creates one dominant flow that overwhelms one path. ECMP hashes per-flow, not per-packet, so a single large flow is never split across paths. Technologies like packet-spray (used by custom interconnects like Google's Jupiter) can split individual flows, but standard ECMP cannot.
-
Path asymmetry: if spine switches do not have equal numbers of paths to all destination pods (misconfiguration, switch failure), some destinations are reachable via fewer paths than others, creating unequal ECMP weights.
Mitigation: use NCCL's ability to create multiple QPs per communicator (NCCL_MIN_NCHANNELS and NCCL_MAX_NCHANNELS) to create more distinct flows, improving ECMP distribution. Run ibnetdiscover regularly to detect topology changes. Monitor per-port utilization on spine switches and alert on imbalance above 20%.
