Skip to main content

DGX and HGX System Design

The Day We Realized the Server Was the Bottleneck

The model was GPT-4 scale. The team had spent eight months on architecture decisions, tokenization, data pipelines, and loss curves. They had 128 A100-80GB GPUs on paper - enough to train the model in theory. The cluster was assembled from OEM servers, each with 8 GPUs connected via PCIe. NVLink between GPUs was not available; the budget hadn't covered DGX nodes.

The first training run came in 23% slower than projected. The utilization graphs were the tell: GPU compute was sitting at 58% average even during the forward pass, not the backward pass where communication overhead is expected. Something was stealing compute cycles from inside the node, not between nodes. The culprit was PCIe contention. With 8 A100s connected via PCIe Gen4 x16 through a single CPU complex, the aggregate GPU-to-GPU bandwidth across all 8 GPUs was around 128 GB/s. A DGX A100 with NVLink achieves 600 GB/s all-to-all between the same 8 GPUs. They had built a cluster with 4.7x less intra-node bandwidth than the reference design, and were paying for it in training throughput.

The fix was expensive: replace 16 servers with 16 DGX A100 nodes. The DGX nodes cost 3.4x more per box. But the training throughput improvement paid back the cost difference in 4.2 months of compute savings at the scale they were running.

This is the economic argument for DGX: it is not just a faster server, it is a server designed specifically around the communication requirements of large model training. Every architectural decision - the NVSwitch topology, the HGX board, the dual CPUs, the 8-card InfiniBand - exists to eliminate the bottlenecks that appear at scale. Understanding why each component exists, and what happens when it isn't there, is what turns a competent ML engineer into someone who can make real infrastructure decisions.

The DGX is NVIDIA's reference implementation of what a training server should look like. The HGX is the OEM building block that lets other vendors replicate that reference. Together they form the de facto standard architecture for serious large model training. If you are going to build, buy, or evaluate GPU infrastructure at any scale, you need to understand this design from the inside out.


Why This Exists

The PCIe Architecture Problem

Before NVLink and NVSwitch, multi-GPU servers used PCIe as the interconnect. PCIe is a general-purpose bus designed for CPU-to-peripheral communication. It was never designed for GPU-to-GPU high-bandwidth transfers, and it shows at scale.

PCIe Gen4 x16 provides 32 GB/s per direction (full duplex), but that bandwidth is shared. In a server with 8 GPUs connected through two PCIe switches (4 GPUs per switch) under one or two CPUs, the actual GPU-to-GPU bandwidth depends heavily on which two GPUs are communicating:

  • Two GPUs on the same PCIe switch: up to 32 GB/s (one PCIe link)
  • Two GPUs on different PCIe switches: must traverse the CPU memory controller, drops to 10-16 GB/s depending on the NUMA configuration
  • Any GPU pair under sustained all-reduce: each link carries bidirectional traffic, effective throughput 4-8 GB/s per pair

For tensor-parallel attention layers where every GPU needs data from every other GPU multiple times per forward pass, PCIe architecture is fundamentally too slow. The bandwidth deficit directly translates to GPU idle time.

NVLink is NVIDIA's proprietary high-speed interconnect designed specifically for GPU-to-GPU communication. Where PCIe is a hierarchical tree (everything routes through the CPU), NVLink is a point-to-point mesh. Each NVLink connection is a dedicated bidirectional path between exactly two GPUs with no shared switching fabric between them.

NVLink 4.0 (H100) operates at 900 GB/s aggregate bidirectional bandwidth for a single GPU connected to the NVSwitch fabric. The individual NVLink connection speed is 25 GB/s per direction (50 GB/s bidirectional per link), and each H100 SXM5 GPU has 18 NVLink connections. Those 18 connections connect to NVSwitch chips rather than directly to other GPUs - this is what enables the all-to-all full bisection bandwidth topology.

NVSwitch: The All-to-All Enabler

Direct NVLink GPU-to-GPU pairing works fine for 2-4 GPUs but becomes physically impossible at 8 GPUs: with 8 GPUs, full connectivity would require (82)=28\binom{8}{2} = 28 unique links and every GPU would need 7 connections. The H100 SXM5 has 18 NVLink connections - enough for full connectivity only if routed through switch chips.

NVSwitch is NVIDIA's switching chip for NVLink. It behaves like a network switch but at GPU interconnect speeds. In the DGX H100, four NVSwitch 3.0 chips provide the switching fabric. Each GPU connects to all four NVSwitch chips, and the NVSwitch chips collectively provide full all-to-all connectivity between all 8 GPUs with full bisection bandwidth.

The practical result: any GPU can send data to any other GPU at full NVLink speed simultaneously. All 8 GPUs can be running all-to-all at the same time with no contention. This is qualitatively different from PCIe, where simultaneous all-to-all collapses bandwidth by NN.


Historical Context

DGX-1: The First Reference Design (2016)

NVIDIA launched the DGX-1 in April 2016, delivering it to OpenAI as their first customer. It contained 8 Pascal P100 GPUs with NVLink 1.0, a dual Intel Xeon CPU, and 512 GB of RAM. The headline number was 170 TFLOPS of half-precision compute. More important than the compute was the 160 GB/s NVLink bandwidth (bidirectional, between paired GPUs) - roughly 4x faster than PCIe.

The "aha moment" for the DGX team was recognizing that the fastest path to making deep learning faster was not making individual GPUs faster - it was making the communication between GPUs fast enough that the GPUs could be kept busy. The first DGX sold for $129,000. Within a year, every serious AI lab was running them.

DGX A100 and the NVSwitch Revolution (2020)

The DGX A100 (2020) introduced NVSwitch 2.0 and changed the topology from pairwise NVLink to switched all-to-all. Where the DGX-1 had NVLink connections between specific GPU pairs, the DGX A100 connected every GPU to shared NVSwitch chips, providing full bisection bandwidth at 600 GB/s per GPU. This made tensor parallelism practical within a single DGX node for the first time.

DGX H100 and the 900 GB/s Wall (2022-2023)

The DGX H100 pushed intra-node bandwidth to 900 GB/s per GPU using NVLink 4.0 and NVSwitch 3.0. The design also added the Coherence Interconnect (NVLink-C2C) for the upcoming Grace-Hopper unified memory architecture. The practical achievement: an 8-GPU DGX H100 has 3.6 TB/s of total NVSwitch fabric bandwidth. This number is larger than the memory bandwidth of most single GPUs from five years prior.


Core Concepts

DGX H100 Full Architecture

The DGX H100 is not a single product but an integrated system where every component is spec'd around the GPU communication requirements. Understanding the full BOM (bill of materials) and why each component exists:

GPU Complex: 8x H100 SXM5 GPUs Each H100 SXM5 has 80 GB HBM3 memory (4 HBM3 stacks at 3.35 TB/s each), 3.35 TB/s total HBM3 bandwidth per GPU, and 18 NVLink 4.0 connections. The SXM5 form factor (as opposed to PCIe) is required for NVLink - the additional physical connections cannot fit in a PCIe card slot. Total GPU memory: 640 GB. Total compute: 32,768 FP8 TFLOPS, 3,958 FP16 TFLOPS.

NVSwitch 3.0 Fabric: 4 NVSwitch chips Each NVSwitch 3.0 chip provides 7.2 TB/s of all-to-all switching bandwidth. Four chips provide total fabric bandwidth of 28.8 TB/s. Each GPU connects to all four NVSwitch chips via its 18 NVLink connections (4-5 connections per NVSwitch chip). The result: any-to-any GPU bandwidth at full 900 GB/s bidirectional - all 8 GPU pairs simultaneously without contention.

CPU Complex: 2x Intel Xeon Platinum 8480C Two CPUs at 56 cores each = 112 physical cores total. The dual-CPU configuration is required not for compute but for PCIe bandwidth. Each CPU has its own PCIe Gen5 root complex providing 512 GB/s of PCIe bandwidth. Four GPUs connect to each CPU's PCIe fabric, giving 128 GB/s of host-to-GPU bandwidth per GPU. This matters for data loading and checkpoint saving, which still use PCIe.

The CPU selection is aggressive: Sapphire Rapids with AMX (Advanced Matrix Extensions) instructions, which accelerates CPU-side preprocessing and tokenization. The 8480C also supports DDR5-4800, necessary for the large memory bandwidth needed to feed data to 8 GPUs.

System Memory: 2 TB DDR5 2 TB (16x 128 GB DIMMs) running at DDR5-4800. This number might seem excessive for a system where all the real work happens in GPU memory. The 2 TB serves three purposes: (1) data staging - a batch of training data can be loaded into system RAM before being paged to GPU, (2) activation checkpointing - with gradient checkpointing, activations can spill to CPU RAM, and (3) model sharding for CPU offload (ZeRO-Infinity can use NVMe and CPU RAM for parameter storage).

Network: 8x 400 GbE ConnectX-7 + 1x Mgmt ConnectX-7 Each GPU has its own dedicated ConnectX-7 NIC providing 400 Gbps of InfiniBand NDR bandwidth. The 1:1 GPU-to-NIC ratio is critical: it means inter-node NCCL traffic from each GPU has a dedicated 400 Gbps path to the network, with no NIC sharing. The ConnectX-7 supports both InfiniBand and RoCEv2 (RDMA over Converged Ethernet), giving the DGX flexibility to connect to either IB or Ethernet fabrics.

Eight NICs at 400 Gbps = 3.2 Tbps total inter-node bandwidth per DGX node. This is still 3.5x less than the intra-node NVLink fabric (900 GB/s per GPU x 8 = 7.2 Tbps), which reflects the fundamental design principle: intra-node communication should always be faster than inter-node.

Storage: 8x 3.84 TB NVMe M.2 SSDs 30.7 TB total NVMe storage for local checkpoint caching and training data staging. The 8-drive configuration provides roughly 50 GB/s of sequential read bandwidth - enough to keep 8 GPUs fed even for image or video datasets where data loading is the bottleneck.

Power and Cooling The DGX H100 consumes up to 10.2 kW under full load. For comparison, a typical office building circuit is 2-4 kW, and a residential home's entire electrical load is 5-10 kW. This is not a server you plug into standard datacenter racks without planning. It requires 240V, 3-phase power delivery, and the cooling must handle 10.2 kW of heat - roughly equivalent to 10 electric space heaters running simultaneously. NVIDIA recommends rear-door heat exchangers or direct liquid cooling for dense DGX deployments.

HGX H100: The OEM Building Block

The HGX H100 is the GPU board assembly that contains the 8 H100 SXM5 GPUs, 4 NVSwitch chips, and all interconnects - everything except the CPUs, system RAM, NICs, and chassis. NVIDIA sells this board to server OEMs (Dell, HPE, Lenovo, Supermicro) who build their own server chassis around it.

The key advantage of HGX over DGX for large-scale deployment: OEMs can integrate the HGX board into their existing chassis designs, datacenter management software, and support contracts. A hyperscaler like Meta or Google doesn't want to be locked into NVIDIA's specific CPU, NIC, and storage choices. HGX gives them the GPU fabric (the part they can't build themselves) while allowing customization of everything else.

The HGX H100 is electrically and physically identical to what's inside the DGX H100 from the GPU interconnect perspective. Same NVLink bandwidth, same NVSwitch topology, same 900 GB/s all-to-all. The difference is integration: DGX is a validated, tested, optimized complete system. HGX is a component that requires OEM integration work to achieve DGX-equivalent performance.

The HGX H100 80G 4-GPU variant also exists for budget-constrained deployments. It uses 4 H100 SXM5 GPUs connected via NVSwitch (4-GPU topology with 2 NVSwitch chips), providing 450 GB/s all-to-all bandwidth. Half the capacity at roughly 55% of the full 8-GPU HGX price.

DGX SuperPOD: Rack-Level Architecture

A single DGX node is a training server. A DGX SuperPOD is a complete training cluster - the reference architecture for 100-billion to 1-trillion parameter model training. The current (2024) DGX H100 SuperPOD design contains 32 DGX H100 nodes in a single rack-dense configuration.

Compute: 32 DGX H100 nodes x 8 H100 GPUs = 256 H100 GPUs total

Network Fabric: NVIDIA Quantum-2 InfiniBand HDR400 Every DGX node's 8 ConnectX-7 NICs connect to a two-level fat-tree InfiniBand fabric. The leaf switches are 40-port InfiniBand HDR400 switches. The spine switches are also 40-port. The full two-level fat-tree provides full bisection bandwidth - any two GPUs in the SuperPOD can communicate at full 400 Gbps simultaneously, with no blocking. This is the "non-blocking" fabric that distinguishes a DGX SuperPOD from a commodity cluster where inter-rack bandwidth is oversubscribed.

SHARP Aggregation: in-network compute NVIDIA's Quantum-2 InfiniBand switches include a feature called SHARP (Scalable Hierarchical Aggregation and Reduction Protocol). SHARP offloads the all-reduce reduction computation into the switch fabric itself. Instead of routing all gradient data through GPU memory for reduction, the switch performs the reduction as data flows through it. For all-reduce over small-to-medium tensors, SHARP reduces latency by 2-4x because it eliminates the round-trip to GPU memory.

SHARP is active in DGX SuperPOD deployments by default. To use it with NCCL:

export NCCL_ALGO=CollNet # SHARP-enabled collective algorithm
export SHARP_COLL_ENABLE_SAT=1

Storage: NVIDIA Base Command / BeeGFS The storage layer is typically a BeeGFS parallel filesystem distributed across storage nodes in the SuperPOD rack. BeeGFS provides 12+ TB/s of aggregate read bandwidth across 32 nodes - enough to keep all 256 GPUs fed for most workloads. The storage nodes use NVMe SSDs and connect to the same InfiniBand fabric as the compute nodes.

Management: NVIDIA Base Command Manager BCM handles cluster provisioning, job scheduling, health monitoring, and telemetry. Every GPU, NIC, NVSwitch, and storage device is monitored. The telemetry feeds into dashboards that show per-GPU utilization, NVLink health, IB link state, power consumption, and thermal data in real time.

Comparing DGX A100 vs H100 vs GB200 NVL72

Understanding how the architecture has evolved helps predict where it's going and which generation is right for your workload:

SpecDGX A100DGX H100GB200 NVL72
GPUs8x A100 80GB8x H100 SXM5 80GB36x H100 + 72x B200 (Grace-Blackwell)
Intra-node bandwidth600 GB/s per GPU900 GB/s per GPU1.8 TB/s per GPU
GPU memory640 GB HBM2e640 GB HBM372x 192 GB HBM3e = 13.8 TB
Compute (FP8)16K TFLOPS32K TFLOPS1.44 EFLOPS (entire NVL72)
Inter-node NIC8x 200 Gbps IB HDR8x 400 Gbps IB NDR18x 400 Gbps NVLink-network
Power per node6.5 kW10.2 kW120 kW (entire NVL72 rack)
NVSwitch versionNVSwitch 2.0NVSwitch 3.0NVSwitch 4.0

The GB200 NVL72 represents a fundamental shift: rather than 8 GPUs in a node, it is 72 Blackwell GPUs in a rack, all connected via NVLink 5.0 and NVSwitch 4.0, effectively making the entire rack appear as a single logical compute unit. The NVL72 has 13.8 TB of unified GPU memory addressable by any GPU in the rack. This changes the architecture of tensor parallelism - you no longer need to think about node boundaries for intra-rack parallelism.


Code Examples

Checking Your System Topology

# Verify NVLink connectivity and bandwidth
nvidia-smi nvlink --status -i 0 # show all NVLink connections for GPU 0
nvidia-smi nvlink --capabilities -i 0 # show NVLink capabilities

# Check NVSwitch presence (DGX nodes will show NVSwitch as a device)
nvidia-smi topo --matrix # full topology matrix
# Example output for DGX H100:
# GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0...
# GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS ...
# GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS ...
# NV18 = 18 NVLink connections via NVSwitch - this is the DGX topology

# Compare with a PCIe-only server (bad topology for training):
# GPU0 X PIX PHB PHB SYS SYS SYS SYS
# PIX = same PCIe switch, PHB = through host bridge, SYS = through QPI/UPI

# Measure actual NVLink bandwidth
# p2pBandwidthLatencyTest is part of CUDA samples
./p2pBandwidthLatencyTest
# Look for: "Bidirectional P2P=Enabled Bandwidth"
# DGX H100 target: 400-450 GB/s per GPU pair (NVLink 4.0)
# PCIe Gen4 x16: 32 GB/s per pair

# Test all-to-all bandwidth
./simpleP2P # basic P2P test

Topology-Aware NCCL Configuration for DGX H100

# Optimal NCCL environment for DGX H100 single-node
# These settings match the physical topology of DGX H100

# Tell NCCL to use all NVLink channels
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=32

# LL128 protocol is best for NVLink (high bandwidth, low latency)
export NCCL_PROTO=LL128

# Larger buffer improves pipeline utilization for large gradients
export NCCL_BUFFSIZE=33554432 # 32 MB

# Enable NVLink topology detection (should be auto-detected, but be explicit)
export NCCL_P2P_LEVEL=NVL # use NVLink for intra-node P2P

# Use all 8 InfiniBand NICs for inter-node traffic
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7

# Enable GPUDirect RDMA
export NCCL_NET_GDR_LEVEL=2

# Disable socket fallback (fail fast if IB is misconfigured)
export NCCL_IB_DISABLE=0

# For DGX SuperPOD with SHARP:
# export NCCL_ALGO=CollNet # enable SHARP offload for small collectives
# export NCCL_COLLNET_ENABLE=1
import torch
import time
from typing import Optional

def nvlink_bandwidth_test(
src_gpu: int = 0,
dst_gpu: int = 1,
tensor_size_gb: float = 4.0,
n_iters: int = 20,
) -> dict[str, float]:
"""
Measures GPU-to-GPU P2P bandwidth.
On DGX H100, expect 400-450 GB/s per GPU pair.
On PCIe-only server, expect 25-32 GB/s per pair.
"""
n_elements = int(tensor_size_gb * 1e9 / 4) # float32

# Create source tensor on src_gpu
with torch.cuda.device(src_gpu):
src = torch.randn(n_elements, dtype=torch.float32, device=f"cuda:{src_gpu}")

# Create destination tensor on dst_gpu
with torch.cuda.device(dst_gpu):
dst = torch.empty(n_elements, dtype=torch.float32, device=f"cuda:{dst_gpu}")

# Check if P2P access is enabled between these GPUs
can_p2p = torch.cuda.can_device_access_peer(src_gpu, dst_gpu)
if not can_p2p:
print(f"WARNING: P2P access not available between GPU {src_gpu} and GPU {dst_gpu}")
print("This is a PCIe-only path or NVLink is disabled")

# Warm up
for _ in range(5):
dst.copy_(src, non_blocking=True)
torch.cuda.synchronize(src_gpu)
torch.cuda.synchronize(dst_gpu)

# Benchmark
times = []
for _ in range(n_iters):
torch.cuda.synchronize(src_gpu)
t0 = time.perf_counter()
dst.copy_(src, non_blocking=True)
torch.cuda.synchronize(dst_gpu)
times.append(time.perf_counter() - t0)

tensor_bytes = n_elements * 4
times_sorted = sorted(times)
median_time = times_sorted[n_iters // 2]
min_time = times_sorted[0]

median_bw = tensor_bytes / median_time / 1e9
peak_bw = tensor_bytes / min_time / 1e9

result = {
"src_gpu": src_gpu,
"dst_gpu": dst_gpu,
"tensor_gb": tensor_size_gb,
"median_bw_gbps": median_bw,
"peak_bw_gbps": peak_bw,
"p2p_enabled": can_p2p,
}

print(f"GPU {src_gpu} -> GPU {dst_gpu}: "
f"median={median_bw:.1f} GB/s, peak={peak_bw:.1f} GB/s "
f"({'NVLink' if can_p2p else 'PCIe'})")

return result


def run_all_pairs_bandwidth_test() -> None:
"""Run P2P bandwidth test for all GPU pairs."""
n_gpus = torch.cuda.device_count()
print(f"Testing all {n_gpus*(n_gpus-1)//2} GPU pairs on {n_gpus} GPUs")
print("-" * 60)

results = {}
for i in range(n_gpus):
for j in range(i + 1, n_gpus):
results[(i, j)] = nvlink_bandwidth_test(i, j)

# Summary
bandwidths = [r["median_bw_gbps"] for r in results.values()]
print(f"\nSummary:")
print(f" Min pair bandwidth: {min(bandwidths):.1f} GB/s")
print(f" Max pair bandwidth: {max(bandwidths):.1f} GB/s")
print(f" Mean pair bandwidth: {sum(bandwidths)/len(bandwidths):.1f} GB/s")

if min(bandwidths) < 50.0:
print("\nWARNING: Some GPU pairs have very low bandwidth (< 50 GB/s)")
print("These are likely PCIe paths. Check topology with: nvidia-smi topo --matrix")

if min(bandwidths) > 300.0:
print("\nAll pairs have NVLink bandwidth. DGX/HGX topology confirmed.")


if __name__ == "__main__":
run_all_pairs_bandwidth_test()

Setting Up Distributed Training Optimally on DGX H100

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def configure_for_dgx_h100() -> None:
"""
Set optimal environment variables for DGX H100 before initializing NCCL.
Call this before dist.init_process_group().
"""
# NVLink configuration
os.environ.setdefault("NCCL_MIN_NCHANNELS", "16")
os.environ.setdefault("NCCL_BUFFSIZE", "33554432") # 32 MB

# All 8 InfiniBand NICs (DGX H100 has mlx5_0 through mlx5_7)
os.environ.setdefault(
"NCCL_IB_HCA",
"mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7"
)

# GPUDirect RDMA for inter-node
os.environ.setdefault("NCCL_NET_GDR_LEVEL", "2")

# Disable timeout (use PyTorch's process group timeout instead)
os.environ.setdefault("NCCL_TIMEOUT", "3600")

# For SuperPOD with SHARP switches:
# os.environ.setdefault("NCCL_ALGO", "CollNet")
# os.environ.setdefault("NCCL_COLLNET_ENABLE", "1")


def setup_dgx_training(rank: int, world_size: int, local_rank: int) -> torch.device:
"""
Initialize distributed training on DGX H100.

Args:
rank: global rank (0 to world_size-1)
world_size: total number of GPUs across all nodes
local_rank: GPU index on this node (0-7)
"""
configure_for_dgx_h100()

# Set GPU before init_process_group to ensure NCCL uses correct device
torch.cuda.set_device(local_rank)
device = torch.device(f"cuda:{local_rank}")

dist.init_process_group(
backend="nccl",
init_method="env://",
rank=rank,
world_size=world_size,
)

# Verify topology is as expected
if rank == 0:
n_local = torch.cuda.device_count()
print(f"Initialized {world_size} ranks, {n_local} GPUs per node")

# Check that NCCL sees NVLink topology
# This will be visible in NCCL_DEBUG=INFO output
# Look for: "Channel 00 : 0[...] -> 1[...] via NVL/IPC"

return device


def build_model_for_dgx(model: torch.nn.Module, rank: int, local_rank: int) -> DDP:
"""
Wrap model in DDP with DGX H100 optimal settings.
"""
model = model.cuda(local_rank)

# gradient_as_bucket_view: saves memory copy by using gradient tensor directly
# find_unused_parameters: MUST be False for performance (verify no unused params first)
# bucket_cap_mb: larger buckets for large models, smaller for small models
model = DDP(
model,
device_ids=[local_rank],
output_device=local_rank,
gradient_as_bucket_view=True,
find_unused_parameters=False,
bucket_cap_mb=256, # 256 MB buckets for large model training
)

return model
# NVLink health monitoring
# Check for NVLink errors (should all be 0 in healthy operation)
nvidia-smi nvlink --errorcounters -i 0

# Example healthy output:
# GPU 00000000:03:00.0 NVLink Error Counters:
# Link 0: Replay Errors: 0
# Link 0: Recovery Errors: 0
# Link 0: CRC FLIT Errors: 0
# Link 0: CRC Data Errors: 0
# Non-zero CRC errors indicate cable/connector issues - replace before training

# Monitor NVLink utilization during training
watch -n1 'nvidia-smi nvlink --utilization -i 0 | head -20'

# InfiniBand health check
ibstat # show all IB ports and link state (should be "Active")
ibstatus # more detailed status
perfquery # show performance counters

# Check for IB errors (should be 0)
perfquery -x # extended performance counters including error counts
# Non-zero "PortXmitDiscards" or "PortRcvErrors" indicate network issues

# Monitor IB bandwidth during training
ibdump -d mlx5_0 -i 1 # capture traffic on port 1 of mlx5_0

# Full cluster InfiniBand health check (run from management node)
ibnetdiscover | grep -c "^Ca" # should equal num_nodes * NICs_per_node

Architecture Diagrams

DGX H100 Internal Topology

DGX SuperPOD Fabric Architecture

DGX vs HGX vs OEM PCIe Bandwidth Comparison


Production Engineering Notes

Rack Layout and Power Phasing

A DGX H100 draws 10.2 kW at peak. A standard datacenter rack (42U) can fit 6 DGX H100 nodes vertically but the power density is 61.2 kW - well above the 10-15 kW that most standard datacenter floor tiles are provisioned for. When deploying DGX at scale, work with your datacenter provider on power phasing before any hardware is racked. A DGX SuperPOD (32 nodes) requires 326 kW of power - roughly the power consumption of 300 homes.

Three-phase power is required for DGX. Each DGX H100 has dual redundant power supplies that accept 200-240V three-phase. Never connect a DGX to single-phase power - it will either not power on or trigger the overcurrent protection.

For cooling, NVIDIA recommends 27 degrees Celsius or lower inlet air temperature at the front of the rack. At 10.2 kW per 1U node, standard front-to-back air cooling struggles at density above 4-5 DGX nodes per rack. Consider rear-door heat exchangers (passive liquid cooling attached to the back of each rack) or direct liquid cooling (DLC) kits available from NVIDIA for the DGX H100.

Firmware and Driver Alignment

DGX performance is sensitive to the exact combination of GPU firmware, NVSwitch firmware, NIC firmware, and CUDA driver version. NVIDIA releases DGX Software Stack (DSS) bundles that specify the validated combination. Running a mismatched stack - for example, a new CUDA driver with old NVSwitch firmware - can cause mysterious 10-20% performance regressions that are nearly impossible to root-cause without knowing the version dependency.

# Check current DGX component versions
nvidia-smi -q | grep -E "CUDA Version|Driver Version"
cat /proc/driver/nvidia/version

# Check NVSwitch firmware
nvidia-smi -q | grep -i "nvswitch" | head -20

# For DGX OS (NVIDIA's validated Ubuntu stack), update via:
sudo apt-get update && sudo apt-get install -y dgx-software-stack

# Verify NVSwitch fabric manager is running
systemctl status nvidia-fabricmanager
# If stopped, NVLink between GPUs won't work:
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager

The NVIDIA Fabric Manager is a daemon that manages NVSwitch routing tables. Without it running, NVLink is not initialized and GPUs cannot communicate via NVSwitch. This is a common "gotcha" when setting up a fresh DGX system - everything else looks correct but GPU-to-GPU P2P fails because the Fabric Manager was not started or enabled.

Topology-Aware Process Launch

When launching multi-node training on DGX SuperPOD, the process-to-GPU mapping matters for NCCL topology optimization. NCCL performs better when local ranks 0-7 correspond to GPUs 0-7 on the same physical node. Incorrect process placement - for example, two processes from different nodes sharing local rank 0 - causes NCCL to misidentify the topology and use slower inter-node paths for what should be intra-node communication.

# Correct: use --map-by ppr:8:node to ensure 8 processes per node
mpirun \
-np 256 \
--hostfile /etc/dgx_hostfile \
--map-by ppr:8:node \
--bind-to numa \
-x LOCAL_RANK='$(ompi_comm_world_local_rank)' \
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python train.py

# Or with torchrun (preferred for PyTorch):
torchrun \
--nnodes=32 \
--nproc_per_node=8 \
--rdzv_id=job123 \
--rdzv_backend=c10d \
--rdzv_endpoint=dgx-node-00:29500 \
train.py

Cost Analysis: DGX vs Cloud vs PCIe OEM

At scale, the right infrastructure choice depends on workload duration and bandwidth requirements. A useful mental model:

For a 70B parameter model training run with data parallelism only (no tensor parallelism), PCIe OEM servers may be acceptable - the all-reduce pattern doesn't stress intra-node bandwidth nearly as much as tensor parallel. The training will be slower, but not catastrophically so.

For a 175B+ model with tensor parallelism degree 8 (required to fit the model in memory even with activation checkpointing), intra-node bandwidth is on the critical path of every forward pass. A PCIe OEM server at 128 GB/s aggregate vs DGX at 7.2 TB/s means tensor parallel matmuls take 56x longer to communicate. This makes tensor parallelism on PCIe servers impractical - you'd need to use pipeline parallelism instead, which introduces pipeline bubbles and requires larger batch sizes that hurt convergence.

The DGX total cost of ownership (TCO) over a 3-year depreciation horizon is roughly 2-2.5x higher than PCIe OEM servers at the same GPU count. But if your workload requires tensor parallelism and you're using PCIe servers, you're effectively running at 30-40% MFU (model FLOP utilization) compared to 55-65% on DGX. The TCO per effective FLOP is actually lower on DGX for these workloads.


Common Mistakes

:::danger Running Training Without nvidia-fabricmanager

On DGX and HGX systems, nvidia-fabricmanager must be running for NVSwitch to initialize the NVLink routing fabric. If the Fabric Manager is not running, GPUs will fall back to PCIe for all communication, and you will see NVLink bandwidth of 0 in nccl-tests even though the hardware is present. This is silent - NCCL will not error, it will just use PCIe paths and run 20-30x slower than expected for intra-node all-reduce. Always verify Fabric Manager status before launching training jobs on DGX systems.

# Check - must show "active (running)"
systemctl status nvidia-fabricmanager

# If not running:
sudo systemctl start nvidia-fabricmanager
# Wait 30 seconds for fabric initialization, then retry NCCL

:::

:::danger Mismatched World Size and GPU Count Leading to NVSwitch Deadlock

On DGX systems, all 8 GPUs must be allocated together as a unit for NVSwitch to function correctly. If you attempt to run a training job using only 4 of the 8 GPUs on a DGX node (e.g., CUDA_VISIBLE_DEVICES=0,1,2,3), the NVSwitch routing tables are configured for 8 GPUs but only 4 are active. Depending on the NCCL version and NVSwitch firmware, this can cause hangs or reduced bandwidth. When using a DGX node, always allocate all 8 GPUs or use a virtualized subset (MIG mode) properly configured for partial GPU usage.

:::

:::warning InfiniBand NIC Assignment Must Match GPU NUMA Affinity

On DGX H100, GPUs 0-3 connect to CPU 0's PCIe complex, and GPUs 4-7 connect to CPU 1's PCIe complex. The 8 ConnectX-7 NICs are similarly split: NICs 0-3 attach to CPU 0, NICs 4-7 to CPU 1. For optimal performance, NCCL should use NICs 0-3 for GPUs 0-3 and NICs 4-7 for GPUs 4-7 to avoid NUMA crossings. If you set NCCL_IB_HCA without respecting this affinity, inter-node traffic from GPUs 0-3 may route through CPU 1's memory bus (a NUMA crossing), adding 20-30% latency to every inter-node collective.

The correct way to handle this is to let NCCL's auto-detection handle NIC-to-GPU affinity (the default in NCCL 2.16+). Only override NCCL_IB_HCA if you have a specific reason, and if you do, list all 8 NICs so NCCL can choose the appropriate one per GPU. :::

:::warning PCIe OEM Servers Cannot Effectively Run Tensor Parallelism

Engineers frequently attempt to configure tensor parallel training (using Megatron-LM or DeepSpeed tensor parallelism) on PCIe-based servers to fit large models in memory. The communication pattern for tensor parallelism requires all-reduce inside each transformer layer - typically 4-8 times per layer. On a 128-layer model with tensor parallel degree 8, this is 512-1024 intra-node all-reduces per training step. At PCIe speeds (5-10 GB/s effective per pair), these all-reduces dominate training time and GPU utilization collapses to under 30%. Use pipeline parallelism on PCIe servers instead, which has far less intra-node communication. Save tensor parallelism for NVLink hardware. :::


Interview Q&A

Q1: What is the difference between a DGX H100 and an HGX H100, and when would you choose each?

The DGX H100 is a complete, validated server system: chassis, power supplies, CPUs, RAM, NICs, storage, GPU board, and NVSwitch fabric all integrated and tested by NVIDIA. The HGX H100 is only the GPU board assembly - the 8 H100 SXM5 GPUs, 4 NVSwitch chips, and interconnect substrate, sold to OEM manufacturers who integrate it into their own server chassis.

Choose DGX when: you want a single vendor to be responsible for the entire system, you need NVIDIA's validated DGX Software Stack with guaranteed compatibility, or your procurement process is simpler with a complete system. DGX is also the better choice if your team lacks the expertise to validate and optimize an OEM integration.

Choose HGX when: you are a hyperscaler or large OEM who wants control over the CPU choice, NIC selection, chassis design, cooling approach, and management software. HGX lets you build a custom server that matches your datacenter's specific infrastructure (e.g., liquid cooling systems, custom rack dimensions, specific power delivery). HGX also allows using ARM-based CPUs (like NVIDIA's own Grace CPU) as the host processor, which can reduce cost and improve PCIe bandwidth in some configurations.

From a GPU compute and NVLink fabric perspective, DGX and HGX are identical. Any performance difference comes from the OEM integration quality, not the GPU hardware itself.

Q2: Why does NVSwitch enable full all-to-all bandwidth at 900 GB/s per GPU, but direct NVLink connections could not achieve the same thing at 8 GPUs?

With direct NVLink connections (no switch), achieving full all-to-all between 8 GPUs requires every pair to have a dedicated connection. That's (82)=28\binom{8}{2} = 28 unique links. Each H100 has 18 NVLink connections - not enough for 7 dedicated connections to every other GPU plus some remaining for switch connections.

NVSwitch solves this with shared switching fabric. Each GPU connects its 18 NVLink connections to 4 NVSwitch chips (roughly 4-5 connections per switch chip). The NVSwitch chips internally route traffic between any two GPU connections, providing full connectivity. The key is that NVSwitch operates like a non-blocking crossbar - it can route all 8 GPU-to-switch paths simultaneously without any contention, as long as the total bandwidth through the switch doesn't exceed the switch's capacity.

The 4 NVSwitch 3.0 chips in DGX H100 each have 7.2 TB/s of internal bandwidth, totaling 28.8 TB/s. Since there are 8 GPUs each at 900 GB/s bidirectional, the maximum all-to-all demand is 8×900=7.28 \times 900 = 7.2 TB/s in each direction. The 4-switch topology is sized to handle this with capacity to spare - it's a 4x overprovisioned fabric from a worst-case all-to-all perspective.

Q3: Describe the DGX SuperPOD fat-tree topology and explain why full bisection bandwidth matters for large model training.

The DGX SuperPOD uses a two-level fat-tree InfiniBand topology. At the bottom (leaf) level, individual DGX nodes connect to leaf switches. At the top (spine) level, leaf switches connect to spine switches. Every leaf switch connects to every spine switch, providing redundant paths between any two nodes.

Full bisection bandwidth means that if you divide the cluster in half arbitrarily, the total bandwidth crossing that cut equals the total bandwidth available to one half of the cluster. Formally, for NN nodes each with BB bandwidth, a full-bisection-bandwidth fabric provides N2×B\frac{N}{2} \times B cross-section bandwidth. In practice, this means that a training job distributed across all 32 DGX nodes in a SuperPOD can be running simultaneous all-to-all communication between all 256 GPUs, and no GPU pair is bandwidth-limited by network oversubscription.

Why it matters for training: data-parallel all-reduce during gradient synchronization sends data between all GPU pairs simultaneously. If the fabric is oversubscribed by 4:1 (common in cost-optimized enterprise clusters), then during a 256-GPU all-reduce, effectively only 64 GPUs worth of bandwidth is available for the inter-node communication - the other 192 GPUs are waiting for network access. This slows gradient synchronization proportionally to the oversubscription ratio. For a training job where communication is already 20-30% of step time on full-bisection hardware, 4:1 oversubscription pushes communication to 60-80% of step time, collapsing MFU.

Q4: What is NVIDIA SHARP, how does it work, and what types of workloads benefit most from it?

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) offloads collective compute operations into the InfiniBand switch fabric itself. In a conventional all-reduce, data flows from GPU memory through the NIC, through the switch, back through NICs, into GPU memory on each participating node, where the reduction (summing) happens, and then the result travels back through the network to all nodes. SHARP inserts the reduction computation directly into the switch data path - data flows in, gets summed as it passes through, and the result flows out, without any round-trip to GPU memory.

The performance benefit is significant for small-to-medium tensors: SHARP reduces all-reduce latency by eliminating the round-trip to GPU memory (saves 2 network traversals per all-reduce). For a 32-node cluster with 10 microsecond InfiniBand RTT, eliminating the return trip saves 10 microseconds per all-reduce. In a transformer model with 128 layers and 4 all-reduces per layer, that's 51 milliseconds of saved communication overhead per forward-backward pass.

SHARP is most beneficial for: pipeline-parallel training where many small gradient tensors are synchronized across pipeline stages, optimizer state synchronization in ZeRO-1 where optimizer statistics are small tensors, and any workload with high collective frequency (many small all-reduces) rather than few large ones. SHARP is less beneficial for large tensor all-reduces (where bandwidth dominates, not latency) or for intra-node NVLink communication (SHARP only applies to InfiniBand).

Q5: A training job on a DGX H100 cluster is achieving 45% MFU when theoretical analysis suggests 65% should be achievable. What is your systematic debugging approach?

This is a 20-point MFU gap, which is significant. Systematic approach:

Step 1: Profile compute vs communication breakdown. Use PyTorch Profiler with record_shapes=True and with_stack=True for 5 training steps. Look at the Trace View in TensorBoard - are the GPUs idle during backward pass (communication stall) or during forward pass (compute issue)?

Step 2: If communication is the bottleneck (GPU idle during backward), run nccl-tests all_reduce_perf and compare measured bus bandwidth to theoretical peak. If bandwidth is 60-70% of expected, check: (a) NCCL_IB_HCA is listing all 8 NICs, (b) GPUDirect RDMA is active (NCCL_NET_GDR_LEVEL=2), (c) NVLink channels are maximized (NCCL_MIN_NCHANNELS=16).

Step 3: If compute is the bottleneck (GPU busy but efficiency is low), check memory bandwidth utilization with ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second. Low memory bandwidth relative to HBM3 peak (3.35 TB/s) suggests arithmetic intensity is too low - possibly from fp32 operations that should be fp16/bf16, or small batch sizes causing underutilization of tensor cores.

Step 4: Check for thermal throttling. nvidia-smi -q -d PERFORMANCE | grep -E "Perf|Power|Throttle". A DGX H100 at full thermal load with inadequate cooling will drop GPU clocks by 10-20%, directly impacting MFU.

Step 5: Verify bucket size alignment in DDP. If bucket_cap_mb is too small (default 25 MB), you may be firing hundreds of all-reduces per backward pass instead of a few large ones. The per-collective overhead adds up. Increase to 256 MB and re-measure.

Step 6: Check that find_unused_parameters=False is set. Even if all parameters are used, setting this to True forces DDP to do an extra traversal of the computation graph to find unused parameters, which adds overhead proportional to model size.

Q6: How does the GB200 NVL72 change the parallelism strategy for 1-trillion parameter models compared to a DGX H100 SuperPOD?

On a DGX H100 SuperPOD, a 1-trillion parameter model in bf16 requires 2 TB of GPU memory. A single DGX H100 node has 640 GB. You need at least 4 nodes (3,200 GB) to hold the model weights alone (before activations and optimizer states). This forces 4-way pipeline parallelism at minimum, plus 8-way tensor parallelism within each node for operator-level parallelism. The pipeline parallel bubble overhead (proportional to number of pipeline stages) and the tensor parallel communication overhead (all-reduces across 8 GPUs) both eat into MFU.

The GB200 NVL72 has 72 B200 GPUs in one rack, each with 192 GB HBM3e = 13.8 TB total. A 1-trillion parameter bf16 model fits entirely within a single NVL72 rack. All 72 GPUs can participate in tensor parallelism without any pipeline parallelism - eliminating pipeline bubbles entirely. The NVLink 5.0 fabric provides full all-to-all at 1.8 TB/s per GPU, making 72-way tensor parallelism feasible (NVLink bandwidth is sufficient to run tensor-parallel all-reduces across 72 GPUs without dominating step time).

The practical implication for parallelism strategy: on DGX H100 SuperPOD, the optimal config for a 1T model is roughly (data parallel = 32, pipeline parallel = 4, tensor parallel = 8). On GB200 NVL72, the same model can run (data parallel = N_racks, pipeline parallel = 1, tensor parallel = 72) - no pipeline parallelism, maximum tensor parallelism, no pipeline bubbles. Research from NVIDIA suggests this can improve MFU from roughly 45% (DGX H100, pipeline+tensor) to 60-70% (NVL72, tensor-only) for 1T parameter models.


Appendix: DGX Configuration Reference

DGX GenerationNVLink VersionPer-GPU BandwidthTotal Intra-Node
DGX-1 (P100)NVLink 1.0160 GB/s (4 links)640 GB/s
DGX-1 (V100)NVLink 2.0300 GB/s (6 links)1.2 TB/s
DGX A100NVLink 3.0600 GB/s (12 links)4.8 TB/s
DGX H100NVLink 4.0900 GB/s (18 links)7.2 TB/s
GB200 NVL72NVLink 5.01.8 TB/s (18 links)130 TB/s (rack)

NVSwitch Chip Generations

NVSwitch VersionDGX GenerationBandwidth per ChipChips per Node
NVSwitch 1.0DGX-1 V100900 GB/s6
NVSwitch 2.0DGX A1003.6 TB/s6
NVSwitch 3.0DGX H1007.2 TB/s4
NVSwitch 4.0GB200 NVL7257.6 TB/s9 (per rack)

Quick Topology Verification Checklist

Run these checks before every large training job on DGX systems:

#!/bin/bash
# dgx_preflight.sh - Run before training jobs

echo "=== DGX Pre-flight Check ==="

# 1. Verify Fabric Manager
echo -n "Fabric Manager status: "
systemctl is-active nvidia-fabricmanager || { echo "FAIL - start fabric manager"; exit 1; }

# 2. Verify all 8 GPUs are visible
GPU_COUNT=$(nvidia-smi -L | wc -l)
echo "GPU count: $GPU_COUNT (expected 8)"
[ "$GPU_COUNT" -eq 8 ] || echo "WARNING: expected 8 GPUs"

# 3. Check for NVLink errors
echo "NVLink error check:"
for i in $(seq 0 7); do
ERRORS=$(nvidia-smi nvlink --errorcounters -i $i 2>/dev/null | grep -v "^$" | grep -v "GPU" | awk '{sum+=$NF} END {print sum}')
echo " GPU $i: total NVLink errors = ${ERRORS:-0}"
done

# 4. Check NVLink utilization baseline (should be 0 before job starts)
echo "NVLink utilization (should be ~0 at idle):"
nvidia-smi nvlink --utilization -i 0 2>/dev/null | grep "Rx\|Tx" | head -4

# 5. Verify InfiniBand links are active
echo "InfiniBand link states:"
for port in $(ibstat 2>/dev/null | grep "Port " | awk '{print $2}' | tr -d ':'); do
STATE=$(ibstat 2>/dev/null | grep -A5 "Port $port" | grep "State:" | awk '{print $2}')
echo " Port $port: $STATE"
done

# 6. Check GPU temperatures (thermal throttling risk above 80 C)
echo "GPU temperatures:"
nvidia-smi --query-gpu=index,temperature.gpu --format=csv,noheader | \
awk -F',' '{printf " GPU %s: %s C%s\n", $1, $2, ($2+0 > 80) ? " WARNING: HIGH" : ""}'

# 7. Verify NCCL can initialize (run 1-step nccl-test)
echo "NCCL initialization test:"
if [ -f "./nccl-tests/build/all_reduce_perf" ]; then
./nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 8 -c 0 2>&1 | \
grep -E "busbw|FAILED|WARNING" | tail -3
else
echo " nccl-tests not found, skipping"
fi

echo "=== Pre-flight complete ==="

GPU Memory Capacity Planning

For distributed training, GPU memory usage breaks down into four components. Understanding each helps you set the right parallelism configuration before you start:

Model weights: params×bytes_per_param\text{params} \times \text{bytes\_per\_param}. A 70B parameter model in bf16 = 70×109×2=14070 \times 10^9 \times 2 = 140 GB. In fp32 = 280 GB.

Optimizer states: With AdamW, each parameter has two optimizer states (first moment, second moment), both in fp32 regardless of model precision. For mixed-precision training: params×(4+4+4)=params×12\text{params} \times (4 + 4 + 4) = \text{params} \times 12 bytes. For a 70B model: 70×109×12=84070 \times 10^9 \times 12 = 840 GB - this dominates and is why ZeRO optimizer state sharding is required at scale.

Gradients: Same dtype as model weights during backward, so params×2\text{params} \times 2 bytes for bf16. For 70B: 140 GB.

Activations: Depend on batch size and sequence length. For a transformer with hidden size HH, sequence length SS, batch size BB, and LL layers: roughly 12×B×S×H×L12 \times B \times S \times H \times L bytes. For a 70B model (H=8192H=8192, L=80L=80) with B=1B=1, S=4096S=4096: about 32 GB. With activation checkpointing this drops to roughly L×\sqrt{L} \times per-layer activation memory.

The implication for DGX sizing: a single DGX H100 node with 640 GB GPU memory can hold model weights for a 70B model (140 GB) but cannot hold the optimizer states (840 GB) without ZeRO sharding. With ZeRO-1 across 8 GPUs, optimizer states shard to 105 GB per GPU - now it fits. With ZeRO-3 across 32 nodes (256 GPUs), every component shards, and each GPU holds only 5.5 GB of the 70B model's data.

© 2026 EngineersOfAI. All rights reserved.