Skip to main content

PCIe and NVLink Interconnects

Reading time: ~40 min - Interview relevance: High - Target roles: ML Infrastructure Engineer, Systems Engineer, Distributed Training

PCIe 5.0 gives you 64 GB/s between CPU and GPU. NVLink 4.0 gives you 900 GB/s between GPUs. For any all-reduce operation in multi-GPU training, you are using one of these two paths - and choosing which one determines whether your gradient synchronization takes 5 milliseconds or 50.


The Day 8 GPUs Were Slower Than 2

The distributed training team at a research institute had just expanded from a 2-GPU training rig to an 8-GPU server. The expectation: roughly 4x throughput. The reality: 1.6x throughput, and the utilization charts showed each GPU sitting at 23% during what should have been pure forward/backward compute time.

The bottleneck was the all-reduce. During gradient synchronization between the 8 GPUs, every gradient tensor had to travel over the system interconnect. The team had provisioned a server where not all 8 GPUs were directly connected to each other via NVLink. Four GPUs were on one PCIe switch, four on another. The two PCIe switches connected through the CPU's PCIe root complex. Every cross-group gradient transfer was routed through CPU memory - 32 GB/s of PCIe 3.0 x16 bandwidth, shared among all 8 GPUs competing to synchronize.

The fix was simple once diagnosed: move the training workload to a DGX server with NVSwitch, which provides full all-to-all NVLink connectivity between all 8 GPUs. All-reduce bandwidth went from ~30 GB/s effective to ~600 GB/s. Training throughput went from 1.6x to 6.8x vs single GPU.

The root cause was a misunderstanding of interconnect topology. The specs on the GPU card looked fine - NVIDIA A100, 2 TB/s HBM bandwidth. But the path between GPUs was not HBM; it was PCIe. And PCIe bandwidth is two orders of magnitude lower than HBM bandwidth. Once you move data off the GPU chip, you are at the mercy of whatever interconnect exists between nodes. Understanding that interconnect - its bandwidth, its latency, its topology - is as important as understanding the GPU itself.


Why This Exists - The Fundamental Problem of Moving Data Between Chips

Modern neural network training requires coordination between multiple processors. At minimum, parameters must flow from CPU host memory to GPU device memory at the start of training. During multi-GPU training, gradients must be synchronized across GPUs after every backward pass. In large distributed training runs, gradients and activations must cross network links between servers.

Every one of these transfers is bounded by the bandwidth of the interconnect it travels over. And in every case, that interconnect bandwidth is dramatically lower than the on-chip bandwidth (HBM, SRAM) of the GPU itself.

The problem has existed as long as multi-chip computing has existed. Early multi-CPU servers in the 1990s used the front-side bus (FSB) as their interconnect - a shared bus that all CPUs competed for. The FSB became a famous bottleneck as CPU count increased. The industry response was point-to-point links: Intel QPI, AMD HyperTransport, and eventually the PCIe ecosystem.

For GPUs, the problem arrived later but hit harder. A single A100 GPU has 2 TB/s of internal HBM bandwidth. PCIe 4.0 x16 provides 32 GB/s - 62x lower. Even NVLink 3.0 at 600 GB/s is 3.3x lower than HBM. This means that any operation requiring data to cross chip boundaries is entering a bandwidth desert relative to what the GPU can do internally.

The entire history of GPU interconnect development - from PCIe 1.0's 4 GB/s through NVLink 4.0's 900 GB/s - is the story of the industry chasing the bandwidth demands of increasingly powerful processors.


Historical Context - From Shared Bus to NVSwitch Mesh

PCIe: The Industry Standard (2003 - Present)

PCI Express was introduced in 2003 as a replacement for the shared PCI bus. Instead of a shared medium where all devices competed for the same bandwidth, PCIe uses point-to-point serial links where each device has its own dedicated connection to the root complex (typically the CPU).

Each PCIe link consists of one or more "lanes." A lane is a differential serial pair - one pair for transmit, one for receive. A full x16 slot (16 lanes) is the standard for graphics cards and high-performance compute cards.

PCIe bandwidth has roughly doubled with each generation:

GenerationReleaseBandwidth (x16 slot, each direction)Total (bidirectional)
PCIe 1.020034 GB/s8 GB/s
PCIe 2.020078 GB/s16 GB/s
PCIe 3.0201016 GB/s32 GB/s
PCIe 4.0201732 GB/s64 GB/s
PCIe 5.0201964 GB/s128 GB/s
PCIe 6.02022128 GB/s256 GB/s

Most production ML servers today use PCIe 4.0 or 5.0. The H100 NVL board connects to the CPU via PCIe 5.0 x16.

PCIe was designed as a general-purpose interconnect for attaching devices - GPUs, SSDs, network cards, FPGAs - to a CPU. It was never designed for GPU-to-GPU communication. But for many years, it was the only option.

In 2016, NVIDIA introduced NVLink with the Pascal P100 GPU. NVLink is a proprietary, high-bandwidth, point-to-point interconnect designed specifically for GPU-to-GPU and GPU-to-CPU communication. Its bandwidth dwarfs PCIe:

NVLink GenerationGPUBandwidth per LinkTotal (16 links, H100)
NVLink 1.0P10020 GB/s bidirectional160 GB/s
NVLink 2.0V10025 GB/s bidirectional300 GB/s
NVLink 3.0A10025 GB/s per direction600 GB/s
NVLink 4.0H10025 GB/s per direction900 GB/s

NVLink 4.0 on H100 provides 900 GB/s of total bidirectional bandwidth between directly connected GPUs - 14x higher than PCIe 5.0 x16.

NVSwitch: From Pairs to Full Mesh (2018 - Present)

NVLink 1.0 and 2.0 allowed direct GPU-to-GPU connections in pairs. The DGX-1 used NVLink to create a "hybrid cube mesh" topology among its 8 GPUs - some GPU pairs were directly connected, others required hops. This limited all-reduce performance because collective operations had to route around the partial connectivity.

In 2018, NVIDIA introduced NVSwitch with the DGX-2. NVSwitch is a dedicated switching chip that allows all-to-all NVLink connectivity - every GPU can communicate with every other GPU at full NVLink bandwidth simultaneously. In DGX H100, six third-generation NVSwitch chips provide 900 GB/s all-to-all bandwidth between all 8 H100 GPUs.

The impact on distributed training was enormous. Ring all-reduce - the standard algorithm - requires each GPU to communicate with two neighbors. With NVSwitch, all communication happens at full NVLink speed. Without it, some GPU pairs must route through intermediate GPUs or through PCIe, creating bottlenecks proportional to the worst link in the ring.


Core Concepts - Intuition First

PCIe Topology in a Server

In a typical server, PCIe is organized as a tree. At the root is the CPU (or Intel's Root Complex). The CPU's PCIe controller provides a certain number of "lanes" - physical PCIe signal pairs. These lanes are divided among PCIe slots.

A dual-socket server has two CPUs. Each CPU has its own PCIe lanes. GPUs connected to CPU 0 must communicate through the CPU 0 - CPU 1 interconnect (Intel UPI or AMD Infinity Fabric) to reach GPUs connected to CPU 1. This adds latency and reduces effective bandwidth.

The PCIe tree structure matters enormously for GPU communication:

CPU 0
/ \
PCIe SW 0 PCIe SW 1
/ \ / \
GPU0 GPU1 GPU2 GPU3

GPU0 to GPU1: one PCIe switch hop - full PCIe bandwidth GPU0 to GPU2: two switch hops + CPU - 50% of PCIe bandwidth (shared) GPU0 to GPU3: same as GPU0 to GPU2

In practice, the effective bandwidth for cross-switch GPU communication is often 8-12 GB/s rather than the 32 GB/s theoretical for PCIe 4.0.

NVLink enables different connectivity patterns depending on how the cables are routed or how the switch is configured:

NVLink Ring: Each GPU is connected to two neighbors, forming a ring. All-reduce works naturally on a ring: data flows in one direction, with each GPU accumulating partial sums. All-reduce bandwidth = NVLink bandwidth / 2 (bidirectional ring).

NVLink Mesh (NVSwitch): Every GPU is connected to every other GPU via NVSwitch. There is no "ring" - the all-reduce can be done in two phases (reduce-scatter + all-gather) with full-bandwidth parallel communication. All-reduce bandwidth = NVLink bandwidth * (n-1)/n (approaches NVLink bandwidth for large n).

For 8 GPUs, the NVSwitch mesh provides roughly 3.5x higher effective all-reduce bandwidth compared to a ring at the same per-link bandwidth.

What Limits PCIe Bandwidth in Practice

Theoretical PCIe bandwidth (64 GB/s for Gen 4 x16) is rarely achieved in practice for GPU transfers:

  1. Protocol overhead: PCIe frames include headers and sequence numbers. Effective payload bandwidth is ~90-95% of raw link rate.

  2. DMA alignment and size: Small transfers (under ~1 MB) have high overhead relative to payload. The per-transfer setup cost is fixed regardless of size.

  3. Pinned vs pageable memory: CPU memory is normally pageable - it can be swapped to disk by the OS. CUDA's DMA engine cannot directly access pageable memory. For transfers involving pageable memory, CUDA silently performs a two-stage copy: first it copies to a pinned staging buffer (on CPU), then DMA-transfers from there to GPU. This halves effective bandwidth.

  4. Shared bandwidth: Multiple devices sharing a PCIe root complex compete for the same bandwidth pool. Eight GPUs on a 64 GB/s PCIe bus have only 8 GB/s average bandwidth each.

Pinned Memory - Why It Matters

Pinned memory (also called page-locked memory) is CPU memory that has been locked in physical RAM - it cannot be paged out or moved. When memory is pinned:

  1. Its physical address is guaranteed to remain stable - the DMA engine can use it directly
  2. CUDA's DMA engine can transfer directly between GPU HBM and pinned CPU memory without a staging copy
  3. Transfers can be overlapped with GPU computation (async transfers)

The bandwidth difference is significant:

  • Pageable memory to GPU: ~8-12 GB/s (effective, due to staging copy)
  • Pinned memory to GPU: ~25-30 GB/s (near theoretical PCIe 4.0 limit)

The cost: pinned memory cannot be swapped. Pinning too much memory reduces the available RAM for the OS and other processes. The general recommendation is to pin only the memory you are actively transferring - dataloaders, for example, should pin the batch tensor just before the transfer, not entire datasets.

Peer-to-Peer (P2P) Transfers

P2P transfers allow one GPU to read from or write to another GPU's memory directly, without staging through CPU memory. For GPUs connected via NVLink, P2P uses the NVLink path. For GPUs connected only via PCIe, P2P uses the PCIe path and goes through the CPU root complex.

P2P must be explicitly enabled:

import torch

# Check if P2P is available between two GPUs
can_access = torch.cuda.can_device_access_peer(0, 1) # GPU 0 accessing GPU 1
print(f"GPU 0 can P2P access GPU 1: {can_access}")

# Enable P2P
torch.cuda.set_device(0)
# Access is enabled automatically when using distributed PyTorch
# or can be enabled manually in CUDA:
# cudaDeviceEnablePeerAccess(1, 0) # GPU 0 enables access to GPU 1

Without P2P, a GPU-to-GPU transfer stages through CPU memory: GPU0 -> CPU -> GPU1, using PCIe bandwidth twice. With P2P, it goes directly: GPU0 -> GPU1, using NVLink or PCIe bandwidth once.


The Math - Bandwidth, Latency, and All-Reduce Time

Modeling GPU-to-GPU Transfer Time

For a tensor of BB bytes, the transfer time over a link with bandwidth WW (bytes/second) and latency α\alpha (seconds) is:

t=α+BWt = \alpha + \frac{B}{W}

For large tensors (where B/WαB/W \gg \alpha), latency is negligible and transfer time is dominated by bandwidth. For small tensors (where B/WαB/W \ll \alpha), latency dominates.

For an NVLink 4.0 transfer:

  • WW = 900 GB/s = 9×10119 \times 10^{11} bytes/s
  • α\alpha \approx 1-2 microseconds (measured)

For a 1 GB gradient tensor:

  • Transfer time =2μs+1×109/9×10111.1= 2\mu s + 1 \times 10^9 / 9 \times 10^{11} \approx 1.1 ms (bandwidth-dominated)

For a 1 KB gradient tensor (very small all-reduce):

  • Transfer time =2μs+103/9×10112μs= 2\mu s + 10^3 / 9 \times 10^{11} \approx 2\mu s (latency-dominated)

Ring All-Reduce on N GPUs

The ring all-reduce algorithm is the standard for synchronizing gradients:

  1. Reduce-scatter phase: Each GPU sends a chunk to the next GPU and receives a chunk from the previous. After N1N-1 steps, each GPU holds one chunk that is the sum across all GPUs.

  2. All-gather phase: Each GPU broadcasts its summed chunk around the ring. After N1N-1 steps, every GPU has the full reduced tensor.

Total data transferred per GPU: 2×N1N×B2 \times \frac{N-1}{N} \times B bytes, where BB is the full gradient tensor size.

For large N, this approaches 2B2B bytes per GPU regardless of N - ring all-reduce scales well because each GPU always sends/receives roughly its share of the total data.

Time for ring all-reduce on N GPUs with link bandwidth WW:

tall-reduce=2×N1N×BWt_\text{all-reduce} = 2 \times \frac{N-1}{N} \times \frac{B}{W}

For 8 GPUs, 1 GB gradients, NVLink 4.0 (900 GB/s per link):

t=2×78×1099×10111.94 mst = 2 \times \frac{7}{8} \times \frac{10^9}{9 \times 10^{11}} \approx 1.94 \text{ ms}

For 8 GPUs, 1 GB gradients, PCIe 4.0 (32 GB/s):

t=2×78×1093.2×101054.7 mst = 2 \times \frac{7}{8} \times \frac{10^9}{3.2 \times 10^{10}} \approx 54.7 \text{ ms}

NVLink is 28x faster for this all-reduce. A typical transformer has 70B parameters, so 280 GB of gradients. At NVLink speeds: 546 ms. At PCIe speeds: 15.3 seconds. For a training step that takes 1-2 seconds, PCIe all-reduce dominates; NVLink all-reduce is a rounding error.


Code - Measuring Interconnect Bandwidth

Benchmarking PCIe Bandwidth

import torch
import time
import numpy as np

def benchmark_pcie_bandwidth(
size_gb=1.0,
direction='h2d', # 'h2d' = host to device, 'd2h' = device to host
use_pinned=True,
n_trials=20,
device_id=0
):
"""
Benchmark PCIe bandwidth between CPU memory and GPU memory.

Args:
size_gb: transfer size in GB
direction: 'h2d' (host to device) or 'd2h' (device to host)
use_pinned: whether to use pinned (page-locked) CPU memory
n_trials: number of timed trials
device_id: CUDA device to test

Returns:
peak bandwidth in GB/s
"""
torch.cuda.set_device(device_id)
n_elements = int(size_gb * 1e9 / 4) # FP32 elements

if use_pinned:
# Allocate pinned (page-locked) CPU memory
cpu_tensor = torch.empty(n_elements, dtype=torch.float32, pin_memory=True)
else:
# Regular pageable CPU memory
cpu_tensor = torch.empty(n_elements, dtype=torch.float32)

gpu_tensor = torch.empty(n_elements, dtype=torch.float32, device=f'cuda:{device_id}')

# Fill with some data
cpu_tensor.fill_(1.0)
gpu_tensor.fill_(2.0)

# Warm up
for _ in range(3):
if direction == 'h2d':
gpu_tensor.copy_(cpu_tensor)
else:
cpu_tensor.copy_(gpu_tensor)
torch.cuda.synchronize()

bandwidths = []
for _ in range(n_trials):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
if direction == 'h2d':
gpu_tensor.copy_(cpu_tensor)
else:
cpu_tensor.copy_(gpu_tensor)
end.record()

torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
bw = size_gb / (elapsed_ms / 1000.0)
bandwidths.append(bw)

peak_bw = max(bandwidths)
median_bw = float(np.median(bandwidths))

memory_type = "Pinned" if use_pinned else "Pageable"
print(f"{direction.upper()} {memory_type} ({size_gb:.1f} GB):")
print(f" Peak: {peak_bw:.1f} GB/s")
print(f" Median: {median_bw:.1f} GB/s")
return peak_bw


# Run the benchmark
print("=== PCIe Bandwidth Benchmark ===")
benchmark_pcie_bandwidth(size_gb=1.0, direction='h2d', use_pinned=True)
benchmark_pcie_bandwidth(size_gb=1.0, direction='h2d', use_pinned=False)
benchmark_pcie_bandwidth(size_gb=1.0, direction='d2h', use_pinned=True)
benchmark_pcie_bandwidth(size_gb=1.0, direction='d2h', use_pinned=False)

Expected output on PCIe 4.0 system:

=== PCIe Bandwidth Benchmark ===
H2D Pinned (1.0 GB):
Peak: 28.4 GB/s
Median: 27.8 GB/s
H2D Pageable (1.0 GB):
Peak: 12.1 GB/s
Median: 11.6 GB/s
D2H Pinned (1.0 GB):
Peak: 26.9 GB/s
Median: 26.1 GB/s
D2H Pageable (1.0 GB):
Peak: 11.4 GB/s
Median: 10.9 GB/s
import torch
import numpy as np

def benchmark_gpu_to_gpu_bandwidth(
src_device=0,
dst_device=1,
size_gb=1.0,
n_trials=20
):
"""
Benchmark GPU-to-GPU transfer bandwidth.
On NVLink systems, this uses NVLink.
On PCIe-only systems, this routes through CPU memory.
"""
n_elements = int(size_gb * 1e9 / 4) # FP32

# Check P2P availability
can_p2p = torch.cuda.can_device_access_peer(src_device, dst_device)
print(f"GPU {src_device} -> GPU {dst_device}")
print(f" P2P access available: {can_p2p}")

src_tensor = torch.ones(n_elements, dtype=torch.float32, device=f'cuda:{src_device}')
dst_tensor = torch.empty(n_elements, dtype=torch.float32, device=f'cuda:{dst_device}')

# Warm up
for _ in range(3):
dst_tensor.copy_(src_tensor)
torch.cuda.synchronize()

bandwidths = []
for _ in range(n_trials):
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

# Record on the destination device's stream
torch.cuda.set_device(dst_device)
start_event.record()
dst_tensor.copy_(src_tensor)
end_event.record()

torch.cuda.synchronize()
elapsed_ms = start_event.elapsed_time(end_event)
bw = size_gb / (elapsed_ms / 1000.0)
bandwidths.append(bw)

peak_bw = max(bandwidths)
print(f" Peak bandwidth: {peak_bw:.1f} GB/s")
print(f" (NVLink 4.0 max: 900 GB/s)")
print(f" (PCIe 4.0 max: ~28 GB/s via CPU)")
return peak_bw

# Run across all GPU pairs
n_gpus = torch.cuda.device_count()
print(f"Testing {n_gpus} GPUs\n")

for src in range(n_gpus):
for dst in range(n_gpus):
if src != dst:
benchmark_gpu_to_gpu_bandwidth(src, dst, size_gb=0.5)
print()

Overlapping Compute and Transfers with CUDA Streams

import torch
from torch.utils.data import DataLoader, TensorDataset

def overlapped_data_pipeline(model, dataset, batch_size=64, n_batches=20):
"""
Overlap CPU-to-GPU data transfer with GPU computation using CUDA streams.

Without overlap:
[Transfer batch 1] [Compute batch 1] [Transfer batch 2] [Compute batch 2] ...

With overlap:
[Transfer batch 1] [Compute batch 1 + Transfer batch 2] [Compute batch 2 + Transfer batch 3] ...
"""
device = torch.device('cuda:0')

# Stream for compute
compute_stream = torch.cuda.Stream()
# Stream for data transfer
transfer_stream = torch.cuda.Stream()

dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=4, pin_memory=True)
iterator = iter(dataloader)

# Pre-load first batch
batch_cpu = next(iterator)
batch_gpu = batch_cpu.to(device, non_blocking=True)

outputs = []
total_compute_ms = 0.0

for i in range(n_batches - 1):
current_batch_gpu = batch_gpu

# Start loading the next batch asynchronously (on transfer stream)
with torch.cuda.stream(transfer_stream):
try:
next_batch_cpu = next(iterator)
# non_blocking=True means the transfer happens asynchronously
# pin_memory=True (set in DataLoader) is required for non_blocking to work
next_batch_gpu = next_batch_cpu.to(device, non_blocking=True)
except StopIteration:
next_batch_gpu = None

# Run compute on current batch (on compute stream)
with torch.cuda.stream(compute_stream):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()

output = model(current_batch_gpu)
outputs.append(output)

end.record()

torch.cuda.synchronize()

# Transfer stream must be done before we use next_batch_gpu in compute stream
# The synchronize above ensures both streams are done
batch_gpu = next_batch_gpu

return outputs


# Example: DataLoader with pinned memory (required for async transfers)
def create_pinned_dataloader(data_tensor, batch_size=128):
"""
Create a DataLoader that uses pinned memory for fast CPU-to-GPU transfers.

Key settings:
- pin_memory=True: allocates batches in pinned CPU memory
- num_workers > 0: load data in background CPU threads
- persistent_workers=True: keep worker processes alive between epochs
"""
dataset = TensorDataset(data_tensor)
loader = DataLoader(
dataset,
batch_size=batch_size,
pin_memory=True, # Required for non_blocking GPU transfers
num_workers=4, # Parallel CPU preprocessing
persistent_workers=True, # Avoid respawning workers each epoch
prefetch_factor=2, # Pre-fetch 2 batches ahead
)
return loader
import torch
import torch.distributed as dist
import os

def benchmark_nccl_allreduce(
world_size=8,
tensor_size_gb=0.5,
n_trials=20
):
"""
Benchmark NCCL all-reduce bandwidth on available GPUs.
Run this with: torchrun --nproc_per_node=8 this_script.py
"""
rank = int(os.environ.get('RANK', 0))
local_rank = int(os.environ.get('LOCAL_RANK', 0))

torch.cuda.set_device(local_rank)

dist.init_process_group(
backend='nccl', # Use NCCL for GPU collectives
init_method='env://'
)

n_elements = int(tensor_size_gb * 1e9 / 4) # FP32
tensor = torch.ones(n_elements, dtype=torch.float32, device=f'cuda:{local_rank}')

# Warm up
for _ in range(5):
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()

times = []
for _ in range(n_trials):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
end.record()

torch.cuda.synchronize()
times.append(start.elapsed_time(end))

if rank == 0:
best_ms = min(times)
# Ring all-reduce transfers 2*(N-1)/N * B bytes per GPU
bytes_per_gpu = 2 * (world_size - 1) / world_size * tensor_size_gb * 1e9
achieved_bw = bytes_per_gpu / (best_ms / 1000.0) / 1e9

print(f"NCCL All-reduce ({world_size} GPUs, {tensor_size_gb:.1f} GB):")
print(f" Best time: {best_ms:.2f} ms")
print(f" Achieved bandwidth: {achieved_bw:.1f} GB/s per GPU")
print(f" (NVLink 4.0 max: 900 GB/s bidirectional)")

dist.destroy_process_group()

Checking GPU Topology

import subprocess

def print_gpu_topology():
"""
Print the NVLink/PCIe topology between all GPUs.
Equivalent to running: nvidia-smi topo -m
"""
result = subprocess.run(
['nvidia-smi', 'topo', '-m'],
capture_output=True, text=True
)
print(result.stdout)

# Also check CUDA device properties for NVLink
import torch
n_gpus = torch.cuda.device_count()
for i in range(n_gpus):
props = torch.cuda.get_device_properties(i)
print(f"GPU {i}: {props.name}")
print(f" NVLink bandwidth: Check nvidia-smi nvlink -s")
print(f" PCIe gen/width: Check nvidia-smi -q -d PCIE")

print_gpu_topology()

Architecture Diagrams

PCIe Topology in a Dual-Socket Server

DGX H100 NVSwitch Mesh

PCIe Transfer Path: Pinned vs Pageable Memory

Bandwidth Hierarchy


Production Engineering Notes

RDMA and GPUDirect for Multi-Node Training

When training spans multiple servers (nodes), gradient synchronization must cross the network. Standard network transfers go through CPU memory: GPU memory -> PCIe -> CPU memory -> NIC -> wire -> NIC -> CPU memory -> PCIe -> GPU memory. This path crosses PCIe twice and involves CPU memory copies.

RDMA (Remote Direct Memory Access) eliminates the CPU copies: data moves directly between NIC DMA engine and CPU memory, bypassing the CPU. GPUDirect RDMA goes further: data moves directly between GPU memory and the NIC, bypassing CPU memory entirely:

GPU memory --PCIe--> NIC -> wire -> NIC --PCIe--> GPU memory

GPUDirect RDMA requires:

  1. A compatible NIC (Mellanox/NVIDIA ConnectX-6 or newer)
  2. A compatible GPU (A100, H100 support it natively)
  3. The GPU and NIC must be on the same PCIe root complex (same CPU socket, ideally the same PCIe switch) for full-speed operation
  4. The nv_peer_mem or nvidia-peermem kernel module loaded

In practice, GPUDirect RDMA can double the effective inter-node bandwidth by eliminating the two PCIe crossings.

Optimizing Multi-GPU Placement

When provisioning multi-GPU servers for training, the PCIe topology matters as much as GPU count:

Ideal topology for 8 GPUs (DGX-style):

  • All 8 GPUs connected to NVSwitch fabric
  • NVSwitch provides all-to-all NVLink connectivity
  • PCIe is only used for CPU-GPU control path, not for gradient synchronization
  • Result: gradient all-reduce at 600-900 GB/s effective

Poor topology for 8 GPUs (bare-metal server with no NVSwitch):

  • 4 GPUs per CPU socket, connected via PCIe switches
  • Cross-socket GPU communication routes through CPU interconnect
  • Effective bandwidth for cross-socket transfers: 8-15 GB/s
  • All-reduce with this topology: 10-15x slower than NVSwitch

If you cannot use NVLink/NVSwitch hardware, use the NCCL_P2P_DISABLE=1 environment variable to force NCCL to use socket transport (through CPU memory) explicitly, which is sometimes more stable on PCIe-only systems than partially-enabled P2P.

NCCL Environment Variables for Interconnect Control

NCCL (NVIDIA Collective Communications Library) respects several environment variables that control how it uses the available interconnects:

# Force NCCL to use NVLink (disable PCIe P2P)
export NCCL_P2P_LEVEL=NVL

# Force all communication through host (CPU) memory - useful for debugging
export NCCL_P2P_DISABLE=1

# Enable NCCL debug output to understand which paths are being used
export NCCL_DEBUG=INFO

# Set the network interface NCCL uses for inter-node communication
export NCCL_SOCKET_IFNAME=ens5 # or eth0, ib0 for InfiniBand

# Enable GPUDirect RDMA (requires compatible NIC + nv_peer_mem module)
export NCCL_NET_GDR_LEVEL=SYS

# Set NCCL ring/tree algorithm (default is auto)
# RING is better for large tensors; TREE for small tensors
export NCCL_ALGO=Ring

# Control number of NCCL communication channels
# More channels = higher bandwidth for large messages
export NCCL_NSOCKS_PERTHREAD=4

For a DGX H100 with all-NVLink topology:

export NCCL_P2P_LEVEL=NVL
export NCCL_ALGO=Ring

For a multi-node InfiniBand cluster:

export NCCL_SOCKET_IFNAME=ib0
export NCCL_NET_GDR_LEVEL=SYS # Enable GPUDirect RDMA
export NCCL_IB_DISABLE=0

Diagnosing Interconnect Bottlenecks in Training

Use the following profiling approach to determine whether your training is interconnect-bound:

  1. Profile with PyTorch Profiler:
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
record_shapes=True
) as prof:
for i, (inputs, labels) in enumerate(dataloader):
if i >= 5:
break
with record_function("data_transfer"):
inputs = inputs.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
with record_function("forward"):
outputs = model(inputs)
with record_function("backward"):
loss = criterion(outputs, labels)
loss.backward()
with record_function("optimizer"):
optimizer.step()
optimizer.zero_grad()

# Print timeline sorted by total CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))

# Export Chrome trace for visual inspection
prof.export_chrome_trace("training_trace.json")

Look for: nccl:all_reduce, cudaMemcpy, ncclAllReduce in the timeline. If these account for more than 10-15% of step time on large models, interconnect is a bottleneck.

  1. Quick bandwidth test with nvidia-smi:
# Show NVLink status and bandwidth counters
nvidia-smi nvlink -s
nvidia-smi nvlink -c # clear counters

# Wait 10 seconds during training, then read
nvidia-smi nvlink -s # shows TX/RX in MB/s

# Show PCIe bandwidth
nvidia-smi dmon -s t -d 1 -c 20 # Show PCIE TX/RX for all GPUs, 20 samples

Gradient Compression to Reduce Interconnect Traffic

When interconnect bandwidth is genuinely the bottleneck, gradient compression reduces the amount of data that must traverse the slow link:

PowerSGD: Low-rank approximation of gradient matrices. Instead of communicating the full gradient matrix GRm×nG \in \mathbb{R}^{m \times n}, communicate two low-rank factors PRm×rP \in \mathbb{R}^{m \times r} and QRn×rQ \in \mathbb{R}^{n \times r} where rmin(m,n)r \ll \min(m, n).

Communication reduction: from mnmn to r(m+n)r(m + n) values, a factor of mn/r(m+n)mn / r(m+n) savings.

For a typical weight matrix with m=n=4096m = n = 4096 and r=4r = 4:

  • Original: 4096×4096=16.8M4096 \times 4096 = 16.8M values
  • Compressed: 4×(4096+4096)=32,7684 \times (4096 + 4096) = 32,768 values
  • Compression ratio: 512x

Top-K sparsification: Send only the K largest gradient values (by absolute value), along with their indices. 99% sparsification sends 1% of gradients.

Trade-off: gradient compression introduces approximation error and can slow convergence. It is a last resort when interconnect genuinely limits scaling, not a default optimization.


Common Mistakes

:::danger Running All-Reduce Over PCIe in a Multi-GPU Setup If your training server has multiple GPUs but is not NVLink-connected (either no NVLink hardware, or P2P is disabled, or GPUs are on different PCIe switches without a common NVSwitch), every all-reduce goes through CPU memory over PCIe. For large models, this can make gradient synchronization take 10-50x longer than necessary and completely dominate training step time. Always verify your GPU topology with nvidia-smi topo -m before designing a multi-GPU training setup. :::

:::danger Using Pageable Memory for Large Data Transfers If you load training batches into regular CPU RAM and then call .cuda(), CUDA performs a two-stage copy: pageable -> pinned staging buffer -> GPU. This halves effective transfer bandwidth (from ~28 GB/s to ~12 GB/s on PCIe 4.0). Always use pin_memory=True in DataLoader and non_blocking=True in the .to(device) call. The cost is that pinned memory cannot be swapped, so do not pin enormous datasets - only the active batch pipeline. :::

:::warning Assuming All 8 GPUs Are Equally Connected On a consumer or workstation server with 8 GPUs, not all GPU pairs are created equal. nvidia-smi topo -m shows the topology matrix: "NV1", "NV2" means 1 or 2 NVLink paths; "PIX" means connected via PCIe switch; "PXB" means connected via PCIe but through a bridge; "SYS" means connected via CPU QPI/UPI. A training run that places data-parallel replicas on GPU 0, 1, 2, 3 (on CPU 0's PCIe tree) and GPU 4, 5, 6, 7 (on CPU 1's PCIe tree) will have severely imbalanced communication costs between groups. :::

:::warning Enabling P2P When GPUs Are on Different IOMMU Domains On some systems, enabling peer-to-peer CUDA memory access requires disabling IOMMU (Input-Output Memory Management Unit) in BIOS, or configuring it to allow P2P. Without proper IOMMU configuration, cudaDeviceEnablePeerAccess may return cudaSuccess but actually fall back to a staging copy through CPU memory. Verify that P2P is actually working by running a bandwidth benchmark and comparing to expected NVLink speeds. :::

:::tip Overlapping Data Loading with Compute One of the highest-impact low-effort optimizations: use multiple DataLoader workers (num_workers=4 or more), pin_memory=True, and non_blocking=True in .to(device). This allows the CPU to prepare the next batch and transfer it to GPU while the current batch is being computed. The effective PCIe bandwidth is "free" when it is completely overlapped with GPU compute. Many teams run with num_workers=0 and pin_memory=False by default and leave significant PCIe overlap opportunities on the table. :::


Interview Q&A

Q1: What is the bandwidth difference between PCIe 4.0 and NVLink 4.0, and when does this difference actually matter for ML training?

PCIe 4.0 x16 provides 32 GB/s in each direction (64 GB/s bidirectional). NVLink 4.0 (H100) provides 900 GB/s bidirectional per GPU pair.

The difference - roughly 14x - matters specifically during gradient synchronization in data-parallel training. After each backward pass, all GPUs must exchange gradients (all-reduce). If this happens over PCIe (because GPUs are only PCIe-connected, or because NVLink is disabled), the time is:

tPCIe=2×(N1)/N×B/32GB/st_\text{PCIe} = 2 \times (N-1)/N \times B / 32 \text{GB/s}

For 8 GPUs with 1 GB of gradients: about 54 ms.

Over NVLink 4.0 with NVSwitch: about 2 ms.

For a training step that takes 100-200 ms of compute, PCIe all-reduce is a 25-35% overhead that is completely invisible on NVSwitch hardware. This is why DGX pricing is justified for serious training workloads.

The difference does NOT matter for:

  • Inference with batch size 1 (no gradient sync needed)
  • Training with very small models where gradients are small
  • Pipeline parallelism where gradient sync happens across pipeline stages (different communication pattern)

Q2: Explain how pinned memory works and why it doubles transfer bandwidth compared to regular CPU memory.

CPU memory is normally "pageable" - the OS can move physical pages around or swap them to disk. CUDA's DMA engine needs to know the physical address of memory to perform a direct transfer, and it cannot do that if the OS might move the pages while the transfer is in progress.

For pageable memory, CUDA performs a two-stage transfer:

  1. CPU memcpy from pageable memory to a temporary pinned staging buffer (in physical RAM, locked by CUDA): ~10-15 GB/s (limited by CPU memory bandwidth)
  2. DMA from staging buffer to GPU over PCIe: ~28-30 GB/s

The effective bandwidth is limited by the slower of these two stages - the CPU memcpy to staging, at ~10-15 GB/s.

For pinned (page-locked) memory:

  • cudaMallocHost() or torch.empty(..., pin_memory=True) allocates memory that is locked in physical RAM
  • Physical address is stable; the DMA engine can use it directly
  • Only the PCIe DMA transfer happens: ~28-30 GB/s

Pinned memory also enables asynchronous transfers - the GPU can DMA data while running kernels concurrently (on different CUDA streams). Pageable memory transfers require synchronization.

Cost: pinned memory consumes physical RAM permanently (cannot be swapped). Over-pinning reduces OS memory pressure tolerance.

Q3: What is NVSwitch and why is it necessary for all-to-all GPU communication?

NVLink provides point-to-point GPU-to-GPU connections. A single GPU has a fixed number of NVLink ports (18 on H100). Without a switch, each port connects to exactly one other GPU. With 8 GPUs and 18 ports each, you could connect each GPU to at most 7 others (one per other GPU) - forming a complete graph. This works for 8 GPUs.

The problem arises with higher GPU counts: for 16 or more GPUs, no single GPU has enough NVLink ports to connect to all others. More importantly, even with 8 GPUs you want multiple links between GPU pairs for higher bandwidth - not just one link.

NVSwitch is a dedicated switch chip that allows many NVLink connections to be multiplexed. In DGX H100:

  • 6 NVSwitch chips on the baseboard
  • Each H100 connects to each NVSwitch chip via multiple NVLink ports
  • Any GPU can communicate with any other GPU at full bandwidth simultaneously through the NVSwitch fabric

Without NVSwitch, an 8-GPU ring all-reduce requires data to traverse intermediate GPUs. GPU 0 communicates with GPU 1 and GPU 7 (its ring neighbors). GPU 0's gradients must hop through GPU 1, GPU 2, GPU 3 ... to reach GPU 4 (opposite side of ring). NVSwitch eliminates hops: GPU 0 can send directly to GPU 4 at full NVLink bandwidth.

Q4: How do you measure whether a training run is interconnect-bound vs compute-bound?

Three complementary approaches:

  1. Theoretical analysis: Calculate the expected all-reduce time: t=2×(N1)/N×Bgrad/Winterconnectt = 2 \times (N-1)/N \times B_\text{grad} / W_\text{interconnect}. For A100 with NVLink (600 GB/s), 1 GB gradients, 8 GPUs: ~2 ms. For PCIe (32 GB/s): ~55 ms. Compare to step time. If all-reduce is less than 5% of step time, you are compute-bound.

  2. Profiler measurement: Use torch.profiler and look for nccl:all_reduce or ncclAllReduceRing in the trace. Sum up these events as a fraction of total step time. Above 15-20% indicates interconnect is a significant bottleneck.

  3. Scaling efficiency experiment: Run training on 1 GPU, 2 GPUs, 4 GPUs, 8 GPUs. Plot throughput (samples/second) vs GPU count. If throughput scales linearly, you are compute-bound. If it sub-scales (e.g., 2 GPUs gives less than 2x, and 8 GPUs gives less than 6x), the gap is communication overhead - likely interconnect.

A quick empirical test: set NCCL_P2P_DISABLE=1 (forces all communication through CPU memory) and run one step. Then set it back and run one step. If the step time difference is small (say, under 5%), interconnect is not your bottleneck. If the difference is large, interconnect topology matters.

Q5: What is GPUDirect RDMA and when should you use it?

Normally, data moves from GPU to the network over two separate buses:

  1. GPU memory -> PCIe -> CPU memory (a cudaMemcpy)
  2. CPU memory -> PCIe -> NIC (a DMA by the NIC)

Standard RDMA eliminates the second PCIe crossing for CPU-to-CPU communication, but data still stages through CPU memory for GPU transfers.

GPUDirect RDMA (GDR) eliminates both CPU memory stages. The NIC DMA engine maps GPU memory directly and transfers it over the network without ever touching CPU memory:

GPU memory -> NIC -> wire -> NIC -> GPU memory

When to use it:

  • Multi-node training with InfiniBand or high-speed Ethernet
  • When inter-node bandwidth is a bottleneck (gradient all-reduce across nodes)
  • When GPU and NIC are on the same PCIe root complex (same CPU socket, ideally same PCIe switch)

Requirements: NVIDIA GPU (A100/H100), Mellanox/NVIDIA NIC (ConnectX-5 or newer), nv_peer_mem or nvidia-peermem kernel module, NCCL_NET_GDR_LEVEL=SYS environment variable.

Expected benefit: 2x improvement in inter-node bandwidth (eliminate two PCIe hops). For a 400 Gbps InfiniBand link (50 GB/s), this is the difference between 25 GB/s (staging through CPU) and 50 GB/s (GDR direct). For large model training where inter-node all-reduce is a bottleneck, this can improve multi-node scaling efficiency by 20-40%.

Q6: A team runs 8-GPU training on a bare-metal server (no DGX). After profiling, they find all-reduce takes 40% of step time. What are their options?

Diagnose first: run nvidia-smi topo -m to see the actual interconnect topology. If any GPU pairs show SYS (routes through CPU interconnect), you have suboptimal placement.

Options in order of impact:

  1. Verify NVLink is enabled and in use. Some servers have NVLink hardware but it is disabled or not configured. Check: nvidia-smi nvlink -s. If links show "Inactive," investigate NVLink configuration (may require specific BIOS settings or nvidia-persistenced).

  2. Enable peer-to-peer access explicitly. Ensure CUDA P2P is working: check that cudaDeviceCanAccessPeer returns true for the appropriate GPU pairs. If not, investigate IOMMU settings.

  3. Use gradient accumulation. Accumulate gradients over multiple micro-batches before doing the all-reduce. This reduces communication frequency by a factor of the accumulation steps. If each all-reduce takes 40% of step time but you accumulate 4 micro-batches, communication is now 10% of total time (4 compute steps + 1 communication).

  4. Use gradient compression. PowerSGD or Top-K sparsification reduces bytes transferred. Trade-off: approximation error may require slightly more steps to converge.

  5. Switch to hardware. If training is a sustained workload, a DGX-style NVSwitch server dramatically changes the economics. The rental cost of a DGX H100 for a training run may be less than the wall-clock time wasted on PCIe all-reduce at scale.

  6. Switch to pipeline or tensor parallelism. Data parallelism + all-reduce is not the only option. Pipeline parallelism sends activations (not all gradients) between GPU stages. Tensor parallelism splits individual operations. Both have different communication patterns that may be less sensitive to PCIe bandwidth.


Engineers of AI

Read more: www.engineersofai.com

© 2026 EngineersOfAI. All rights reserved.