Network Debugging for Distributed Training

It's 2:47 AM and your 64-GPU training job - the one that has been running for 18 hours with 40% of the weekly GPU budget already consumed - has stopped making progress. The loss curve flatlined. All 64 processes are alive. No Python exceptions. No OOM errors. The cluster monitoring shows GPU utilization has dropped from 94% to 0%, and it has been that way for eleven minutes. Your on-call phone buzzes a second time. Somewhere in the distributed communication fabric connecting those 64 GPUs, something has gone wrong, and you have no idea where to start.

This is the moment every distributed systems engineer dreads: a silent hang in a multi-node training job. The failure is not loud. There is no traceback. The processes are not dead - they are waiting. They could be waiting for a gradient synchronization that will never complete, a collective operation where one rank sent the wrong tensor shape, a network switch that quietly dropped a packet during a critical AllReduce, or a single slow node that fell 200 milliseconds behind and triggered a timeout cascade. You cannot fix what you cannot see, and the tools to make distributed training visible are scattered across five different debugging systems that most ML engineers have never been taught to use together.

The problem is compounded by the scale of the infrastructure. A 64-GPU job might span 8 nodes, each with 8 A100s connected by NVLink internally and InfiniBand externally. The communication topology is a ring, a tree, or a recursive halving algorithm depending on the collective operation. NCCL - the NVIDIA Collective Communications Library that orchestrates all of this - runs entirely in background CUDA threads, invisible to PyTorch's normal error reporting mechanisms. When NCCL hangs, it does so silently, and the only evidence is the absence of forward progress.

The good news is that the debugging methodology for distributed training network failures is systematic and learnable. NCCL provides debug logging that, when properly configured, exposes exactly which operation is blocking and on which rank. iperf3 and nccl-tests can measure raw network bandwidth and identify underperforming nodes in minutes. torch.profiler generates communication timelines that show compute-communication overlap at microsecond granularity. RDMA diagnostics expose InfiniBand fabric issues that look like software bugs but are actually hardware. This lesson teaches you to use all of these tools together - not in isolation, but as a coherent debugging workflow that takes you from "training is hung" to "here is the specific root cause" in under 30 minutes.

Over the next several thousand words, you will learn the communication patterns that distributed training relies on, why each one fails in characteristic ways, and how to diagnose every category of failure. By the end, you will be the person on your team who does not panic at 2:47 AM - because you will have the tools and the mental model to fix it.

Why This Exists

Collective communication in distributed training was not designed for debugging. NCCL was designed for performance - specifically, for squeezing every last bit of bandwidth out of InfiniBand and NVLink fabrics in HPC environments where every microsecond of communication overhead reduces training throughput. Debuggability was not a primary design goal.

The result is a library that performs magnificently when everything is working and fails silently when anything goes wrong. The operators of large GPU clusters discovered this the hard way starting around 2017-2018, when GPT-1-scale training jobs began running for weeks at a time. A job that runs for 30 days and silently hangs on day 28 is a catastrophic infrastructure failure. The debugging tooling that exists today - NCCL_DEBUG, TORCH_DISTRIBUTED_DEBUG, nccl-tests, the profiler timeline - was largely built in response to painful production incidents at OpenAI, Google Brain, and Meta FAIR between 2018 and 2022.

Understanding why distributed training communication is hard to debug requires understanding the communication patterns themselves.

Historical Context

NCCL (pronounced "nickel") was first released by NVIDIA in 2016 as an open-source library providing optimized implementations of collective operations for multi-GPU systems. The 1.x series focused on single-node multi-GPU communication using NVLink. NCCL 2.0, released in 2017, added multi-node support over InfiniBand and Ethernet, making it the foundation for all large-scale distributed deep learning.

Before NCCL, researchers used MPI (Message Passing Interface) - the standard from high-performance computing. MPI had decades of debugging tooling, but it was not optimized for GPU memory and required copying tensors from GPU memory to CPU before communication. NCCL bypassed this entirely by operating directly on GPU memory using RDMA (Remote Direct Memory Access), achieving near-theoretical bandwidth at the cost of less mature debugging infrastructure.

PyTorch adopted NCCL as its default collective communication backend for DDP (Distributed Data Parallel) in 2018-2019. TensorFlow uses NCCL for its distribution strategies. The JAX/XLA ecosystem uses NCCL on GPU. Essentially every framework that runs distributed training on NVIDIA hardware relies on NCCL, which means NCCL debugging skills are universally applicable.

Core Concepts: Communication Patterns

Before debugging communication failures, you need a precise mental model of what distributed training is actually communicating. There are five fundamental collective operations, each with distinct failure modes.

AllReduce

AllReduce is the workhorse of data-parallel training. Every worker computes gradients on its local batch, then AllReduce sums (or averages) those gradients across all workers so every worker ends up with identical, globally-averaged gradients before the optimizer step.

$\text{AllReduce}: x_i^{\text{out}} = \sum_{j=0}^{N-1} x_j^{\text{in}} \quad \forall i \in \{0, \ldots, N-1\}$

The naive implementation - rank 0 collects everything, sums, broadcasts back - has $O(N)$ communication. NCCL uses a ring-based AllReduce algorithm where communication cost is $O(2(N-1)/N \cdot B)$ where $B$ is tensor size in bytes, nearly optimal for large tensors. For small tensors, NCCL switches to a recursive halving-doubling algorithm.

Failure modes: Ring initialization failures (when the ring topology cannot be established), tensor shape mismatches between ranks (hangs indefinitely), and timeout when one rank is slow to reach the operation.

AllGather

AllGather collects the full tensor from every rank onto every rank. If each rank has a tensor of size $M$ , the output on every rank has size $N \times M$ .

Used in: ZeRO optimizer stage 3 (collecting sharded parameters before forward pass), tensor parallelism (collecting activation shards).

Failure modes: Memory pressure (output tensor is $N \times$ input size - on 128 GPUs this can OOM), timeout on slow ranks.

ReduceScatter

ReduceScatter is the inverse of AllGather: reduce across all ranks, then scatter different portions of the result to different ranks. Combined with AllGather, it implements AllReduce with lower peak memory. This is the foundation of ZeRO optimizer.

Used in: ZeRO stages 2 and 3, Megatron-LM tensor parallelism.

Failure modes: Mismatched tensor sizes (each rank's scatter chunk must be equal), synchronization with AllGather in ZeRO causing double-hang.

Broadcast

One rank sends a tensor to all others. Used for: initial parameter synchronization at training start, broadcasting batch normalization statistics in some architectures.

Failure modes: Root rank (rank 0) slow to compute or load the tensor - all other ranks wait indefinitely.

Point-to-Point (P2P)

Direct send/receive between specific rank pairs. Used in pipeline parallelism (stage N sends activations to stage N+1). Unlike collectives, P2P operations involve only two ranks, making them easier to debug but harder to profile (each pair is a separate operation).

Failure modes: Deadlock when both ranks do isend before irecv, buffer size mismatch, rank pair not in same NCCL communicator.

NCCL Debugging Workflow

Step 1: Enable NCCL Debug Logging

The first thing you do when a distributed job hangs is re-run with verbose NCCL logging enabled. NCCL provides several environment variables:

# Basic: show NCCL warnings and errors
export NCCL_DEBUG=WARN

# Info: show initialization and operation start/end
export NCCL_DEBUG=INFO

# Verbose: show every chunk transfer (very noisy - only use with 2 GPUs)
export NCCL_DEBUG=TRACE

# Filter to specific subsystem
# Options: INIT, COLL, P2P, SHM, NET, GRAPH, TUNING, ENV, ALL
export NCCL_DEBUG_SUBSYS=INIT,NET

# Write logs to file instead of stderr (essential for multi-node)
export NCCL_DEBUG_FILE=/tmp/nccl-debug-rank%d.log

With NCCL_DEBUG=INFO, a successful AllReduce looks like this in the log:

NCCL INFO AllReduce: opCount 1e size 4194304 datatype 7 op 0 root 0 comm 0x7f8a4c000b80 [nranks=8] stream 0x7f8a4c001000
NCCL INFO AllReduce: opCount 1e DONE

A hanging AllReduce produces the first line and then silence. The critical field is comm - the communicator address - which tells you which communicator group is stuck. The opCount field increments with each collective operation, so if you see opCount 1e repeatedly across ranks with different values, you have a rank-ordering problem - ranks are entering different operations at different points.

Step 2: Identify the Stuck Operation

When a job hangs, the NCCL log will show the last operation that started but never completed. Look for asymmetry:

# Extract last NCCL operation from each rank's log
for i in $(seq 0 7); do
  echo "=== Rank $i ==="
  tail -20 /tmp/nccl-debug-rank${i}.log | grep "opCount"
done

If rank 0 shows opCount 1e and rank 7 shows opCount 1d, rank 7 is one operation behind - it never entered the collective that ranks 0-6 are waiting in. This is the most common cause of NCCL hangs: a conditional collective where one rank took a different code path.

Step 3: The NCCL Timeout

By default, PyTorch DDP sets a 30-minute timeout on collective operations. When exceeded, all ranks throw:

RuntimeError: [rank0]: Watchdog caught collective operation timeout
ncclRemoteError: Error while communicating with remote rank 5

The timeout is configurable:

import torch.distributed as dist
import datetime

dist.init_process_group(
    backend="nccl",
    timeout=datetime.timedelta(minutes=60)  # Increase for long operations
)

But increasing the timeout just delays the inevitable. The real fix requires identifying why rank 5 is not participating.

Step 4: TORCH_DISTRIBUTED_DEBUG

PyTorch adds its own layer of distributed debugging on top of NCCL:

# Show warnings about unused parameters, etc.
export TORCH_DISTRIBUTED_DEBUG=DETAIL

# C++ log level for torch.distributed internals
export TORCH_CPP_LOG_LEVEL=INFO

With TORCH_DISTRIBUTED_DEBUG=DETAIL, PyTorch will print a report when a timeout occurs showing exactly which parameters were involved in the stuck AllReduce. This is invaluable for identifying unused-parameter bugs in DDP:

DDP Logging Data:
  Rank: 0
  World Size: 8
  ...
  Param names (with gradients): ["layer1.weight", "layer1.bias", ...]
  Param names (without gradients / unused): ["aux_head.weight"]  # <-- the bug

If a parameter is in the model but never receives a gradient (because it was never used in the forward pass on some ranks), DDP will hang waiting for a gradient that will never arrive.

Code: Distributed Training Debug Wrapper

"""
debug_distributed.py - Wrapper for distributed training jobs that adds
systematic NCCL and network debugging. Run instead of your normal launcher
when investigating hangs or performance issues.

Usage:
  torchrun --nproc_per_node=8 debug_distributed.py --script your_train.py
"""

import os
import sys
import time
import socket
import argparse
import datetime
import logging
import threading
import subprocess
from pathlib import Path

import torch
import torch.distributed as dist


def configure_nccl_debug(level: str = "INFO", log_dir: str = "/tmp/nccl_logs"):
    """Configure NCCL debug environment variables before dist.init_process_group."""
    Path(log_dir).mkdir(parents=True, exist_ok=True)

    os.environ["NCCL_DEBUG"] = level
    os.environ["NCCL_DEBUG_SUBSYS"] = "INIT,NET,GRAPH"
    os.environ["NCCL_DEBUG_FILE"] = str(Path(log_dir) / "nccl-rank%d.log")

    # Force NCCL to use a specific network interface (useful when multi-homed)
    # os.environ["NCCL_SOCKET_IFNAME"] = "ib0"  # InfiniBand interface

    # Disable NCCL shared memory transport (useful when /dev/shm is small)
    # os.environ["NCCL_SHM_DISABLE"] = "1"

    # Set P2P level - controls NVLink vs PCIe vs network preference
    # os.environ["NCCL_P2P_LEVEL"] = "NVL"  # NVLink only

    print(f"[debug] NCCL logs will be written to {log_dir}/nccl-rank*.log")


def configure_torch_debug():
    """Configure PyTorch distributed debug settings."""
    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    os.environ["TORCH_CPP_LOG_LEVEL"] = "WARNING"

    # Enable anomaly detection for NaN/inf (expensive but catches gradient issues)
    torch.autograd.set_detect_anomaly(True)


def get_rank_info() -> dict:
    """Collect node and rank information for debugging."""
    rank = int(os.environ.get("RANK", 0))
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    world_size = int(os.environ.get("WORLD_SIZE", 1))
    hostname = socket.gethostname()

    # Get GPU info
    if torch.cuda.is_available():
        gpu_name = torch.cuda.get_device_name(local_rank)
        gpu_mem_total = torch.cuda.get_device_properties(local_rank).total_memory
        gpu_mem_gb = gpu_mem_total / (1024 ** 3)
    else:
        gpu_name = "CPU"
        gpu_mem_gb = 0

    return {
        "rank": rank,
        "local_rank": local_rank,
        "world_size": world_size,
        "hostname": hostname,
        "gpu_name": gpu_name,
        "gpu_mem_gb": gpu_mem_gb,
    }


def barrier_with_timeout(timeout_seconds: int = 300) -> bool:
    """
    Attempt a barrier with a shorter timeout to detect straggler nodes.
    Returns True if barrier succeeded, False if timed out.
    """
    try:
        dist.barrier(timeout=datetime.timedelta(seconds=timeout_seconds))
        return True
    except RuntimeError as e:
        rank = dist.get_rank()
        logging.error(f"[rank{rank}] Barrier timed out after {timeout_seconds}s: {e}")
        return False


def test_allreduce_bandwidth(tensor_sizes_mb: list = None) -> dict:
    """
    Benchmark AllReduce bandwidth at multiple tensor sizes.
    Run this after dist.init_process_group to measure actual training communication.
    """
    if tensor_sizes_mb is None:
        tensor_sizes_mb = [1, 10, 100, 500]

    rank = dist.get_rank()
    results = {}

    for size_mb in tensor_sizes_mb:
        n_elements = (size_mb * 1024 * 1024) // 4  # float32 = 4 bytes
        tensor = torch.randn(n_elements, device=f"cuda:{dist.get_rank() % torch.cuda.device_count()}")

        # Warmup
        for _ in range(3):
            dist.all_reduce(tensor)
        torch.cuda.synchronize()

        # Benchmark
        n_iters = 10
        start = time.perf_counter()
        for _ in range(n_iters):
            dist.all_reduce(tensor)
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start

        # AllReduce bandwidth formula: 2*(N-1)/N * size / time
        # (ring algorithm sends each byte twice)
        world_size = dist.get_world_size()
        algo_bw_gbps = (
            2 * (world_size - 1) / world_size * size_mb / 1024
        ) / (elapsed / n_iters)

        results[f"{size_mb}MB"] = {
            "algo_bandwidth_gbps": round(algo_bw_gbps, 2),
            "avg_time_ms": round(elapsed / n_iters * 1000, 2),
        }

        if rank == 0:
            print(f"  AllReduce {size_mb}MB: {algo_bw_gbps:.2f} GB/s, {elapsed/n_iters*1000:.2f}ms avg")

    return results


def detect_straggler_ranks(n_rounds: int = 5) -> list:
    """
    Measure per-rank timing variance to identify slow/straggler ranks.
    Returns list of ranks that are consistently slower than the median.
    """
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    device = f"cuda:{rank % torch.cuda.device_count()}"

    timings = []
    for _ in range(n_rounds):
        start = time.perf_counter()
        dist.barrier()
        elapsed = time.perf_counter() - start
        timings.append(elapsed)

    # Gather all timings to rank 0
    timing_tensor = torch.tensor(timings, device=device)
    all_timings = [torch.zeros(n_rounds, device=device) for _ in range(world_size)]
    dist.all_gather(all_timings, timing_tensor)

    stragglers = []
    if rank == 0:
        all_timings_np = torch.stack(all_timings).cpu().numpy()
        mean_times = all_timings_np.mean(axis=1)
        overall_median = float(torch.tensor(mean_times).median())

        print("\n[straggler detection]")
        for r, t in enumerate(mean_times):
            ratio = t / overall_median
            status = "SLOW" if ratio > 1.5 else "ok"
            print(f"  rank {r}: avg barrier time {t*1000:.2f}ms ({ratio:.2f}x median) [{status}]")
            if ratio > 1.5:
                stragglers.append(r)

        if stragglers:
            print(f"  WARNING: Straggler ranks detected: {stragglers}")
        else:
            print("  No stragglers detected")

    return stragglers


def log_nccl_timeout_detection():
    """
    Install a watchdog thread that logs rank state every 60 seconds.
    Useful for diagnosing which operation caused a hang.
    """
    rank = int(os.environ.get("RANK", 0))
    log_path = f"/tmp/nccl_watchdog_rank{rank}.log"

    def watchdog():
        with open(log_path, "w") as f:
            while True:
                timestamp = datetime.datetime.now().isoformat()
                gpu_util = "N/A"
                if torch.cuda.is_available():
                    # This will be stuck if CUDA is in a NCCL wait
                    # Use nvidia-smi for non-blocking GPU utilization
                    try:
                        result = subprocess.run(
                            ["nvidia-smi", "--query-gpu=utilization.gpu,memory.used",
                             "--format=csv,noheader,nounits",
                             f"--id={rank % torch.cuda.device_count()}"],
                            capture_output=True, text=True, timeout=5
                        )
                        gpu_util = result.stdout.strip()
                    except Exception:
                        pass

                f.write(f"{timestamp} rank={rank} gpu_util={gpu_util}\n")
                f.flush()
                time.sleep(60)

    t = threading.Thread(target=watchdog, daemon=True)
    t.start()
    return log_path


def main():
    parser = argparse.ArgumentParser(description="Debug wrapper for distributed training")
    parser.add_argument("--nccl-debug-level", default="INFO",
                        choices=["WARN", "INFO", "TRACE"])
    parser.add_argument("--test-bandwidth", action="store_true",
                        help="Run AllReduce bandwidth test before training")
    parser.add_argument("--detect-stragglers", action="store_true",
                        help="Run straggler detection before training")
    args, remaining = parser.parse_known_args()

    # Configure debug environments BEFORE init_process_group
    configure_nccl_debug(level=args.nccl_debug_level)
    configure_torch_debug()

    # Initialize distributed
    dist.init_process_group(
        backend="nccl",
        timeout=datetime.timedelta(minutes=30)
    )

    rank = dist.get_rank()
    info = get_rank_info()

    # All ranks log their identity
    print(f"[rank{rank}] hostname={info['hostname']} "
          f"gpu={info['gpu_name']} mem={info['gpu_mem_gb']:.1f}GB")

    # Ensure all ranks initialized before proceeding
    if not barrier_with_timeout(timeout_seconds=120):
        print(f"[rank{rank}] FATAL: Initial barrier failed - not all ranks initialized")
        dist.destroy_process_group()
        sys.exit(1)

    if rank == 0:
        print(f"\n[debug] All {info['world_size']} ranks initialized successfully")

    # Optional pre-flight checks
    if args.test_bandwidth:
        if rank == 0:
            print("\n[bandwidth test]")
        test_allreduce_bandwidth()

    if args.detect_stragglers:
        detect_straggler_ranks()

    # Start watchdog
    watchdog_log = log_nccl_timeout_detection()
    if rank == 0:
        print(f"[debug] Watchdog logs: /tmp/nccl_watchdog_rank*.log")

    if rank == 0:
        print("\n[debug] Pre-flight checks passed. Starting training...\n")


if __name__ == "__main__":
    main()

Network Bandwidth Testing

Before debugging NCCL, verify the underlying network is healthy. A misconfigured MTU or a flapping InfiniBand port will cause NCCL failures that look like software bugs.

iperf3 for TCP/Ethernet Bandwidth

# On node 1 (server)
iperf3 -s -p 5201

# On node 2 (client) - test TCP bandwidth
iperf3 -c node1 -p 5201 -t 30 -P 4  # 4 parallel streams, 30 seconds

# Expected output on 100Gbps network:
# [SUM] 0.00-30.00 sec  330 GBytes  94.3 Gbits/sec  # ~94% of line rate

# Test with jumbo frames (MTU 9000) - must be enabled on both ends
iperf3 -c node1 -p 5201 -t 30 -P 4 --set-mss 8960

# UDP test for packet loss measurement
iperf3 -c node1 -p 5201 -u -b 10G -t 30

nccl-tests for GPU-to-GPU Bandwidth

nccl-tests is NVIDIA's official tool for benchmarking NCCL collective performance:

# Build nccl-tests (usually pre-installed in ML Docker images)
git clone https://github.com/NCCL/nccl-tests.git
cd nccl-tests && make MPI=1 MPI_HOME=/usr/local/mpi NCCL_HOME=/usr/local/cuda

# AllReduce bandwidth test across all GPUs on 2 nodes
mpirun --hostfile hostfile -n 16 \
  ./build/all_reduce_perf \
  -b 1M -e 1G -f 2 \  # test sizes from 1MB to 1GB, doubling each step
  -g 1                  # 1 GPU per MPI rank

# Expected output (A100s on InfiniBand HDR 200Gbps):
# Size(B)   Algo BW(GB/s)   Bus BW(GB/s)
# 1048576   12.5             22.0
# 134217728 85.3             149.3
# 1073741824 88.1            154.2

# AllGather test
mpirun -n 16 ./build/all_gather_perf -b 1M -e 1G -f 2 -g 1

# ReduceScatter test
mpirun -n 16 ./build/reduce_scatter_perf -b 1M -e 1G -f 2 -g 1

# nccl_bandwidth_automation.py
# Automates nccl-tests across different GPU counts and reports results

import subprocess
import re
import json
from pathlib import Path


def run_nccl_test(
    test_binary: str,
    hostfile: str,
    n_ranks: int,
    min_size: str = "1M",
    max_size: str = "1G",
) -> dict:
    """Run an NCCL test and parse the output into structured results."""

    cmd = [
        "mpirun",
        "--hostfile", hostfile,
        "-n", str(n_ranks),
        test_binary,
        "-b", min_size,
        "-e", max_size,
        "-f", "2",
        "-g", "1",
        "-c", "1",  # check correctness
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)

    if result.returncode != 0:
        return {"error": result.stderr, "command": " ".join(cmd)}

    # Parse NCCL test output
    # Format: size count type redop root time algbw busbw #wrong latency
    lines = result.stdout.split("\n")
    data_lines = [l for l in lines if re.match(r"^\s+\d+", l)]

    results = []
    for line in data_lines:
        parts = line.split()
        if len(parts) >= 8:
            results.append({
                "size_bytes": int(parts[0]),
                "size_human": _bytes_to_human(int(parts[0])),
                "avg_time_us": float(parts[5]),
                "algo_bw_gbps": float(parts[6]),
                "bus_bw_gbps": float(parts[7]),
                "errors": int(parts[8]) if len(parts) > 8 else 0,
            })

    return {
        "n_ranks": n_ranks,
        "test": Path(test_binary).stem,
        "results": results,
        "peak_algo_bw_gbps": max(r["algo_bw_gbps"] for r in results) if results else 0,
        "peak_bus_bw_gbps": max(r["bus_bw_gbps"] for r in results) if results else 0,
    }


def _bytes_to_human(n: int) -> str:
    for unit in ["B", "KB", "MB", "GB"]:
        if n < 1024:
            return f"{n:.1f}{unit}"
        n //= 1024
    return f"{n:.1f}TB"


def run_full_bandwidth_suite(hostfile: str, nccl_tests_dir: str) -> dict:
    """Run the complete NCCL bandwidth test suite and report results."""
    tests = {
        "allreduce": f"{nccl_tests_dir}/all_reduce_perf",
        "allgather": f"{nccl_tests_dir}/all_gather_perf",
        "reducescatter": f"{nccl_tests_dir}/reduce_scatter_perf",
        "broadcast": f"{nccl_tests_dir}/broadcast_perf",
    }

    # Count GPUs from hostfile
    with open(hostfile) as f:
        n_nodes = sum(1 for line in f if line.strip())

    # Assuming 8 GPUs per node (standard for A100/H100 nodes)
    n_ranks = n_nodes * 8

    all_results = {}
    for test_name, binary in tests.items():
        print(f"Running {test_name} with {n_ranks} ranks...")
        result = run_nccl_test(binary, hostfile, n_ranks)
        all_results[test_name] = result
        if "error" not in result:
            print(f"  Peak: {result['peak_algo_bw_gbps']:.1f} GB/s algo, "
                  f"{result['peak_bus_bw_gbps']:.1f} GB/s bus")
        else:
            print(f"  FAILED: {result['error'][:100]}")

    return all_results


if __name__ == "__main__":
    results = run_full_bandwidth_suite(
        hostfile="/etc/training/hostfile",
        nccl_tests_dir="/opt/nccl-tests/build"
    )
    print(json.dumps(results, indent=2))

RDMA and InfiniBand Diagnostics

When training runs over InfiniBand, a set of specialized tools can diagnose fabric-level problems that are invisible to iperf3.

# Show InfiniBand port status for all HCAs (Host Channel Adapters)
ibstat
# Look for: State: Active, Physical state: Polling means link is down

# Show detailed port info including speed and error counters
ibstatus

# Test InfiniBand connectivity between two nodes (like ping over IB)
ibping -G 0x0002c903000000a0  # Use GUID from ibstat output
# Expected: min/max/avg in microseconds

# Measure raw RDMA bandwidth
# On receiver:
ib_write_bw -d mlx5_0 -i 1

# On sender:
ib_write_bw -d mlx5_0 -i 1 <receiver_hostname>
# Expected: ~200 Gb/s on HDR InfiniBand

# Check for InfiniBand errors (flapping, CRC errors, etc.)
perfquery  # or: perfquery -x (extended counters)

# Monitor error counters in real time
watch -n 2 'perfquery | grep -E "Errors|Discards"'

# Check fabric topology
ibtracert lid1 lid2  # trace path between two LIDs

# Verify MTU settings on IB ports (should be 4096 for best performance)
ibv_devinfo | grep -A5 "mlx5_0"

Common InfiniBand Error Signatures

Error	Meaning	Action
`SymbolErrors` increasing	Physical layer signal issues	Check cables, replace SFP
`PortRcvErrors` increasing	Received packets with errors	Check far-end port
`PortXmitDiscards` increasing	Transmit queue overflow	Network congestion, check switch
`LinkDowned` > 0	Link went down and came back	Check cable integrity
`ExcessiveBufferOverrunErrors`	Credit-based flow control failure	MTU mismatch or firmware issue

torch.profiler for Communication Timeline Analysis

Identifying whether your training job is compute-bound or communication-bound requires a profiling timeline that shows both simultaneously.

"""
profile_distributed.py - Profile distributed training with torch.profiler
to visualize compute-communication overlap and identify bottlenecks.
"""

import torch
import torch.distributed as dist
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch import profiler


class SimpleMLP(nn.Module):
    def __init__(self, hidden_size: int = 4096):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(hidden_size, hidden_size * 4),
            nn.ReLU(),
            nn.Linear(hidden_size * 4, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size // 4),
        )

    def forward(self, x):
        return self.layers(x)


def profile_training_step(rank: int, world_size: int, output_dir: str = "/tmp/traces"):
    """
    Run a few training steps under torch.profiler and save Chrome trace.
    Open in Chrome at chrome://tracing or in Perfetto UI.
    """
    device = torch.device(f"cuda:{rank}")
    torch.cuda.set_device(device)

    model = SimpleMLP(hidden_size=4096).to(device)
    model = DDP(model, device_ids=[rank])
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    # Profile with CUDA activity + communication ops
    with profiler.profile(
        activities=[
            profiler.ProfilerActivity.CPU,
            profiler.ProfilerActivity.CUDA,
        ],
        schedule=profiler.schedule(
            wait=2,      # Skip first 2 steps (warmup)
            warmup=1,    # Profile during warmup
            active=5,    # Capture 5 steps
            repeat=1,
        ),
        on_trace_ready=profiler.tensorboard_trace_handler(output_dir),
        record_shapes=True,
        with_stack=True,
        profile_memory=True,
    ) as prof:
        for step in range(10):
            batch = torch.randn(64, 4096, device=device)
            labels = torch.randint(0, 1024, (64,), device=device)

            optimizer.zero_grad()
            with profiler.record_function("forward"):
                out = model(batch)
                loss = nn.functional.cross_entropy(out, labels)

            with profiler.record_function("backward"):
                loss.backward()

            with profiler.record_function("optimizer_step"):
                optimizer.step()

            prof.step()

    # Print key stats
    if rank == 0:
        print(prof.key_averages().table(
            sort_by="cuda_time_total",
            row_limit=20
        ))

        # Look for NCCL operations in the trace
        nccl_events = [
            e for e in prof.key_averages()
            if "nccl" in e.key.lower() or "allreduce" in e.key.lower()
        ]
        print("\n=== NCCL Communication Events ===")
        for event in nccl_events:
            print(f"  {event.key}: {event.cuda_time_total / 1000:.2f}ms total, "
                  f"{event.count} calls, "
                  f"{event.cuda_time_total / event.count / 1000:.2f}ms avg")


def analyze_compute_communication_ratio(trace_dir: str) -> dict:
    """
    Post-process a saved profiler trace to compute the compute:communication ratio.
    A healthy training job should be >80% compute.
    """
    import json
    import glob

    trace_files = glob.glob(f"{trace_dir}/**/*.json", recursive=True)
    if not trace_files:
        return {}

    with open(trace_files[0]) as f:
        trace = json.load(f)

    events = trace.get("traceEvents", [])
    compute_us = 0
    comm_us = 0

    for event in events:
        if event.get("ph") != "X":  # Duration event
            continue
        name = event.get("name", "").lower()
        dur = event.get("dur", 0)

        if any(kw in name for kw in ["nccl", "allreduce", "allgather", "broadcast"]):
            comm_us += dur
        elif any(kw in name for kw in ["volta_gemm", "ampere_gemm", "cudnn", "cutlass"]):
            compute_us += dur

    total = compute_us + comm_us
    if total == 0:
        return {}

    return {
        "compute_ms": round(compute_us / 1000, 2),
        "communication_ms": round(comm_us / 1000, 2),
        "compute_fraction": round(compute_us / total, 3),
        "communication_fraction": round(comm_us / total, 3),
        "is_compute_bound": compute_us > comm_us,
    }

Communication vs Compute Overlap

The most important performance characteristic of a distributed training system is how well it overlaps communication with computation. When a GPU is sending gradient chunks via AllReduce while simultaneously computing the next layer's gradients, it is doing useful work during what would otherwise be idle time.

PyTorch DDP enables this automatically via "bucketing" - it groups small gradients into buckets and starts the AllReduce for a bucket as soon as all parameters in that bucket have gradients, before the full backward pass completes.

# Configure DDP bucketing for better overlap
model = DDP(
    model,
    device_ids=[local_rank],
    bucket_cap_mb=25,        # Default 25MB - tune for your model/network
    find_unused_parameters=False,  # Disable if all params used (faster)
    gradient_as_bucket_view=True,  # Reduce memory by 1 gradient copy
)

# Check actual bucket assignments
if rank == 0:
    for i, bucket in enumerate(model.parameters_to_ignore_for_sync):
        print(f"Bucket {i}: {bucket}")

The dashed "overlap" arrows indicate that communication starts while computation is still in progress. On a well-tuned system, the AllReduce for bucket 1 completes by the time the backward pass finishes bucket 3, making communication "free" from the perspective of end-to-end step time.

tcpdump and Wireshark for Training Traffic

When NCCL fails with a network error (as opposed to a software bug), capturing raw packets can identify the failure mode:

# Capture NCCL traffic on the InfiniBand interface
# Note: NCCL over RDMA bypasses the kernel network stack,
# so you need to capture on the Ethernet/RoCE interface instead
sudo tcpdump -i eth0 -w /tmp/nccl_capture.pcap \
  "port 50000 or port 50001"  # NCCL default port range

# Limit capture to 100MB
sudo tcpdump -i eth0 -w /tmp/nccl_capture.pcap \
  -C 100 -W 5 \  # 5 files of 100MB each
  "host 10.0.0.2"  # Filter to specific remote node

# Analyze in Wireshark (look for):
# - Retransmissions (red rows in Wireshark IO graph)
# - Zero window size (receiver buffer full - congestion)
# - RST packets (abrupt connection close)
# - Duplicate ACKs (packet loss indicator)

# Automated retransmission rate check
tcpdump -r /tmp/nccl_capture.pcap -q 2>/dev/null | \
  awk '/TCP/{total++} /retransmit/{retrans++} END {
    print "Total:", total, "Retransmits:", retrans,
    "Rate:", (retrans/total*100) "%"
  }'

Production Engineering Notes

MTU Mismatch is the Silent Killer

The single most common cause of "NCCL works on 2 nodes but fails on 8 nodes" is MTU mismatch. Every hop between your training nodes - NICs, switches, routers - must have the same MTU. If one switch is set to 1500 and your NICs are set to 9000 (jumbo frames), packets get fragmented or silently dropped.

# Check MTU on all interfaces
ip link show | grep mtu

# Set MTU on all training interfaces (must be done on all nodes)
sudo ip link set eth0 mtu 9000

# Verify MTU end-to-end with ping (DF bit = don't fragment)
ping -M do -s 8972 node2  # 8972 + 28 header = 9000
# If this fails but -s 1472 succeeds, you have an MTU mismatch somewhere

NCCL Environment Variable Quick Reference

Variable	Values	Purpose
`NCCL_DEBUG`	WARN/INFO/TRACE	Log verbosity
`NCCL_DEBUG_SUBSYS`	INIT,NET,COLL,P2P,ALL	Subsystem filter
`NCCL_SOCKET_IFNAME`	eth0,ib0,^lo	Force network interface
`NCCL_IB_DISABLE`	0/1	Disable InfiniBand
`NCCL_P2P_LEVEL`	LOC,PIX,PXB,PHB,SYS,NVL	P2P transport level
`NCCL_ALGO`	Ring,Tree,CollNet	Force collective algorithm
`NCCL_PROTO`	Simple,LL,LL128	Protocol (LL=low latency)
`NCCL_TIMEOUT`	seconds	Operation timeout
`NCCL_BUFFSIZE`	bytes	Communication buffer
`NCCL_TOPO_DUMP_FILE`	path	Dump detected topology

Systematic Debugging Flowchart

:::danger NCCL Ring Init Failure A NCCL error: Unhandled system error during ring initialization usually means NCCL cannot establish connections between ranks. Check:

Firewall rules between nodes (NCCL uses TCP/IP for initial handshake, then switches to RDMA)
NCCL_SOCKET_IFNAME - NCCL may be trying to use the wrong interface
/etc/hosts or DNS resolution for hostnames used in the process group

This is distinct from a timeout - ring init failure happens before any collective operation starts. :::

:::warning Conditional Collectives Cause Silent Hangs If any collective operation (AllReduce, AllGather, Broadcast) is inside an if rank == 0: block or inside a conditional that evaluates differently on different ranks, you will get a silent hang - not an error. PyTorch DDP cannot detect this condition. Always ensure every collective is called unconditionally on every rank.

# WRONG - rank 0 does AllReduce, others skip it
if rank == 0:
    dist.all_reduce(tensor)

# CORRECT - all ranks participate
dist.all_reduce(tensor)  # result only used on rank 0

:::

:::warning find_unused_parameters=True is Expensive DDP(model, find_unused_parameters=True) enables a traversal of the computation graph after each backward pass to identify parameters that did not receive gradients. This adds ~5-10% overhead per step. Only use it if your model actually has conditionally-used parameters. For static architectures, always set it to False. :::

Interview Q&A

Q1: A distributed training job with 32 GPUs hangs after step 1000 with no Python exception. Walk me through how you would diagnose this.

Start by checking if all 32 processes are still alive with ps aux | grep python. If alive, they are waiting - not crashed. Enable NCCL_DEBUG=INFO and NCCL_DEBUG_FILE=/tmp/nccl-rank%d.log, then reproduce the hang. Grep the log files for the last opCount value on each rank - if they differ, the ranks are out of sync (different collectives entered). Check for conditional operations in the code. If all ranks show the same opCount but no "DONE", the hang is in the collective itself - run nccl-tests to verify the network is healthy. Check ibstat for InfiniBand port errors. Finally, check TORCH_DISTRIBUTED_DEBUG=DETAIL output for unused parameters.

Q2: What is the difference between algorithmic bandwidth and bus bandwidth in nccl-tests output, and which one should you compare against your network spec?

Algorithmic bandwidth (algbw) measures the effective data rate from the application's perspective - how fast the collective delivers the result. For AllReduce, this is 2 * size / latency (factor 2 because data is both reduced and broadcast). Bus bandwidth (busbw) accounts for the actual bytes on the wire, including the multiple passes the ring algorithm requires. For AllReduce with $N$ ranks, busbw = algbw * 2(N-1)/N. Compare busbw against your network spec (e.g., 200Gbps InfiniBand HDR) because it represents actual network utilization.

Q3: Explain why ZeRO optimizer uses ReduceScatter instead of AllReduce, and what communication patterns it requires.

Standard AllReduce produces identical full gradients on all ranks - then all $N$ optimizer copies make identical updates. ZeRO stage 2 recognizes this is wasteful: if optimizer states are sharded, each rank only needs its gradient shard. ReduceScatter reduces gradients across all ranks but distributes the result - rank $i$ gets shard $i$ . Each rank then performs an optimizer step on only its shard. ZeRO stage 3 extends this to parameters: before the forward pass, an AllGather collects the full parameter; after the backward pass, a ReduceScatter distributes the gradient shard; after the optimizer step, the updated shard is available for the next AllGather. The communication volume is the same as AllReduce, but peak memory is $1/N$ of full optimizer state.

Q4: What is communication-computation overlap in DDP, and how does bucket_cap_mb affect it?

DDP overlaps gradient synchronization with the backward pass by starting AllReduce on completed gradient buckets before the full backward pass finishes. As each layer computes its gradients, if enough layers have completed to fill a bucket (default 25MB), that bucket's AllReduce starts immediately. The backward pass continues computing later layers' gradients while the AllReduce runs on the network. A larger bucket_cap_mb means fewer AllReduce operations (lower latency overhead) but longer delay before the first bucket starts transmitting (less overlap). The optimal value depends on the backward pass time vs. network bandwidth ratio. For large models on fast InfiniBand, larger buckets (100MB+) are often better.

Q5: When would you use NCCL P2P operations instead of collectives, and what are the debugging challenges specific to P2P?

P2P send/receive is used in pipeline parallelism - stage N's forward activation must be sent to stage N+1 before stage N+1 can compute. Unlike collectives, P2P operations involve exactly two ranks, so they cannot benefit from ring or tree algorithms. The primary debugging challenge is deadlock: if stage N does isend and stage N+1 also does isend before either does irecv, both block waiting for the other to receive. The pattern that prevents this is alternating: even-numbered stages send first, odd-numbered stages receive first. Debugging P2P hangs requires checking whether the send and receive are correctly paired using NCCL_DEBUG=INFO with NCCL_DEBUG_SUBSYS=P2P.

Q6: What does TORCH_DISTRIBUTED_DEBUG=DETAIL output when a DDP job times out, and how do you use it?

When a DDP collective times out, PyTorch prints a detailed report per rank showing: the world size, local rank, the names of all parameters that have gradients, and critically, the names of parameters that do NOT have gradients (unused parameters). If you see a parameter in the "unused" list, it means that parameter was created in __init__ but never used in forward() on this rank - or it was conditionally used and this rank took the skip path. The fix is either to remove the parameter if it is truly unused, set find_unused_parameters=True to handle it (with performance cost), or restructure the code so the collective is always called on all ranks. This output is only generated after the timeout expires, so you must wait for the full timeout duration to see it.

Why This Exists​

Historical Context​

Core Concepts: Communication Patterns​

AllReduce​

AllGather​

ReduceScatter​

Broadcast​

Point-to-Point (P2P)​

NCCL Debugging Workflow​

Step 1: Enable NCCL Debug Logging​

Step 2: Identify the Stuck Operation​

Step 3: The NCCL Timeout​

Step 4: TORCH_DISTRIBUTED_DEBUG​

Code: Distributed Training Debug Wrapper​

Network Bandwidth Testing​

iperf3 for TCP/Ethernet Bandwidth​

nccl-tests for GPU-to-GPU Bandwidth​

RDMA and InfiniBand Diagnostics​

Common InfiniBand Error Signatures​

torch.profiler for Communication Timeline Analysis​

Communication vs Compute Overlap​

tcpdump and Wireshark for Training Traffic​

Production Engineering Notes​

MTU Mismatch is the Silent Killer​

NCCL Environment Variable Quick Reference​

Systematic Debugging Flowchart​

Interview Q&A​