Kernel Bypass and DPDK

Reading time: ~40 min · Interview relevance: Medium-High · Target roles: AI Infrastructure Engineer, ML Systems Engineer, Distributed Training Specialist

Your 256-GPU training job is doing AllReduce every step. With a 7B parameter model at float32, each AllReduce moves approximately 28 GB of gradient data across the network - 28 GB that every GPU must send and receive every single step. At batch_size=2048 and step time of 800 milliseconds, the AllReduce budget is roughly 120 milliseconds. If it takes longer than that, communication is your bottleneck and GPUs sit idle.

The naive path for GPU-to-GPU communication goes: GPU VRAM - PCIe bus - CPU RAM - kernel TCP stack - NIC - wire - NIC - kernel TCP stack - CPU RAM - PCIe bus - GPU VRAM. Every arrow is a copy or a bus transfer. The kernel TCP stack alone adds 50-200 microseconds of latency per message and requires the CPU to process every packet. For AllReduce over 256 GPUs, you cannot afford this path.

The production path at any serious ML training cluster uses RDMA (Remote Direct Memory Access) over InfiniBand or RoCE (RDMA over Converged Ethernet). With RDMA, the GPU gradient buffer is transferred directly from one machine's GPU VRAM to another machine's GPU VRAM via the NIC, without any CPU involvement, without any kernel involvement, and without any copies through CPU RAM. The latency drops from microseconds to sub-microsecond. The throughput reaches the full wire speed of the interconnect (200 Gbps on modern InfiniBand).

NCCL (NVIDIA Collective Communications Library) implements this path. When NCCL detects InfiniBand with GPUDirect RDMA support, it bypasses the entire kernel networking stack and the CPU memory path. The communication is literally: GPU A writes gradient tensor address to InfiniBand verb queue - NIC DMA engine transfers bytes from GPU VRAM to peer's GPU VRAM - peer's NIC signals completion. The CPU does nothing except initiate the operation.

This lesson explains the kernel networking stack overhead, why bypassing it matters for ML, the mechanisms used (DPDK, RDMA, io_uring, eBPF), and how NCCL uses these mechanisms in practice. You do not need to implement DPDK from scratch. You do need to understand why NCCL requires specific environment variable configuration, what InfiniBand and RoCE are, and when these mechanisms actually matter for your workloads.

Why This Exists

The Linux kernel's TCP/IP stack was designed for correctness and flexibility, not for minimum latency. Every packet that enters the kernel's networking stack goes through: interrupt handling (CPU switches context to handle the NIC interrupt), soft IRQ processing (the packet is moved from the NIC's DMA buffer to a socket buffer), protocol processing (TCP/IP header parsing, checksum verification, congestion control state machine), socket layer (packet is placed in the socket's receive queue), and finally syscall (application calls recv() to copy data from the socket buffer to user space).

Each of these steps takes CPU time. Interrupt handling causes cache misses. Context switching adds latency. On a modern Intel Xeon at 3 GHz, the kernel TCP stack adds approximately 5-15 microseconds per packet for small messages. For large transfers with interrupt coalescing, latency can reach 50-200 microseconds. For a training job doing AllReduce every step, these microseconds per packet multiply by thousands of packets per AllReduce multiplied by thousands of steps per training run.

The mathematical reality: at 100 Gbps NIC speed, you can send 12.5 GB/s. A gradient tensor of 28 GB should transfer in 2.24 seconds. But with kernel TCP overhead on 256-node AllReduce, you might observe 8-12 seconds - 4-5x slower than wire speed, purely due to kernel overhead. RDMA on InfiniBand achieves 95%+ of wire speed at sub-microsecond latency.

Historical Context

DPDK (Data Plane Development Kit) was originally created by Intel in 2010 for telecom and high-frequency trading use cases. It was open-sourced in 2013 under a BSD license and contributed to the Linux Foundation. DPDK enables userspace programs to access NIC hardware directly, completely bypassing the kernel networking stack.

RDMA dates to the InfiniBand architecture specification in 2000. InfiniBand emerged from the HIPPI (High Performance Parallel Interface) and Fibre Channel communities as a high-bandwidth low-latency interconnect for HPC clusters. Mellanox (now NVIDIA after a 2020 acquisition) became the dominant InfiniBand vendor. RDMA over Converged Ethernet (RoCE) was standardized in 2010 as a way to get RDMA semantics over standard Ethernet infrastructure.

GPUDirect RDMA was introduced by NVIDIA in 2012. It allows the NIC to DMA directly to and from GPU VRAM, eliminating the GPU-to-CPU memory copy that was previously required for any network transfer of GPU data. NCCL was first released in 2016 and incorporated GPUDirect RDMA from early versions. The NCCL 2.0 release in 2017 added multi-node support with InfiniBand backends.

io_uring was introduced in Linux 5.1 (2019) by Jens Axboe at Facebook. It provides an interface for asynchronous I/O that avoids the syscall overhead of the older aio_* family, using shared memory ring buffers between the kernel and userspace to submit and complete I/O operations with zero syscalls in the fast path.

eBPF (extended Berkeley Packet Filter) became programmable from userspace in Linux 3.18 (2014) and gained XDP (eXpress Data Path) support in Linux 4.8 (2016). XDP allows eBPF programs to intercept packets at the NIC driver level, before they enter the kernel's networking stack - enabling packet filtering and forwarding at near-wire speed.

Core Concepts

The Linux Kernel Networking Overhead - Where Time Goes

Understanding what DPDK and RDMA bypass requires understanding what they are bypassing. The Linux kernel packet receive path for a standard TCP socket:

NIC receives packet
  -> NIC DMA to kernel ring buffer (in CPU-accessible RAM)
  -> NIC raises hardware interrupt
  -> CPU interrupts current task (context switch, cache invalidation)
  -> interrupt handler: minimal processing, schedules softirq
  -> softirq (ksoftirqd): TCP/IP stack processing
     - Ethernet header parsing
     - IP routing table lookup
     - TCP sequence number verification
     - TCP congestion control (cubic, BBR): update state machine
     - Place packet in socket's sk_buff receive queue
  -> Application calls recv() syscall
     - User-to-kernel context switch
     - Copy data from sk_buff to user buffer
     - Kernel-to-user context switch
  -> Data available in user buffer

Each step has a cost:

Hardware interrupt: 1-2 microseconds (cache miss, CPU pipeline flush)
Softirq processing: 2-5 microseconds per packet for TCP
TCP state machine: 1-3 microseconds per packet (congestion control math)
syscall boundary crossing: 0.1-0.3 microseconds (kernel/user mode switch)
Memory copy (sk_buff to user buffer): 0.5-2 microseconds per 1 KB

For a 4 KB packet, total overhead: 5-12 microseconds. For 1 MB of data in 250 packets: 1.25-3 milliseconds of pure overhead, independent of network speed.

With interrupt coalescing (the NIC waits to fire one interrupt for N packets), per-packet latency drops but the waiting period adds 50-200 microseconds to the first packet's latency. This is the trade-off that makes kernel networking bad for both high-throughput and low-latency simultaneously.

DPDK Architecture

DPDK eliminates kernel involvement by running the NIC driver in userspace. The application polls the NIC's hardware receive ring directly (Polling Mode Driver, PMD), without waiting for interrupts. Packets go directly from NIC DMA buffer to DPDK's userspace memory pool, bypassing the kernel completely.

Core DPDK components:

Poll Mode Drivers (PMDs): Userspace implementations of NIC drivers for specific hardware (Intel ixgbe, Mellanox mlx5, etc.). The PMD owns the NIC's DMA rings and reads from them in a tight polling loop.

Huge Pages: DPDK requires memory allocated as 2 MB or 1 GB huge pages (vs the standard 4 KB page size). This reduces TLB pressure for the NIC's DMA operations. With 4 KB pages, a 2 MB buffer requires 512 TLB entries. With 2 MB huge pages, it requires 1. TLB misses at line rate (millions of packets per second) add up quickly.

CPU Pinning: DPDK dedicates entire CPU cores to packet processing. These cores run in a tight polling loop - they never sleep, never yield, never handle interrupts. This burns a full CPU core but delivers consistent sub-microsecond latency. For ML clusters, you typically dedicate 2-4 CPU cores for DPDK and leave the rest for training.

Memory Pools (rte_mempool): DPDK uses fixed-size memory pools for packet buffers. No dynamic allocation during packet processing - every buffer is pre-allocated. This eliminates malloc latency and heap fragmentation under load.

# Python bindings for DPDK are available via dpdk-bindings (ctypes wrapper)
# This is a conceptual example - production DPDK code is typically in C
# with Python used only for configuration and management.

import subprocess
import json


def configure_huge_pages(
    num_2mb_pages: int = 4096,
    num_1gb_pages: int = 0,
) -> bool:
    """Configure huge pages for DPDK use.

    DPDK requires huge pages to minimize TLB pressure during high-speed
    packet processing. Must be called before DPDK initialization.
    2 MB pages: good for most use cases (8 GB total with 4096 pages)
    1 GB pages: for DPDK applications processing >40 Gbps sustained
    """
    # Set 2MB hugepages in NUMA-local memory
    result_2mb = subprocess.run(
        ["sh", "-c", f"echo {num_2mb_pages} > /proc/sys/vm/nr_hugepages"],
        capture_output=True
    )

    # For 1 GB pages: must be set at boot via grub (hugepagesz=1G hugepages=N)
    # Cannot be allocated dynamically after boot

    # Verify allocation
    with open("/proc/meminfo") as f:
        meminfo = f.read()

    lines = {k.strip(): v.strip()
             for line in meminfo.splitlines()
             if ":" in line
             for k, v in [line.split(":", 1)]}

    huge_total = int(lines.get("HugePages_Total", "0"))
    huge_free = int(lines.get("HugeFree", lines.get("HugePages_Free", "0")))
    huge_size_kb = int(lines.get("Hugepagesize", "2048 kB").split()[0])

    print(f"Huge page configuration:")
    print(f"  Page size:   {huge_size_kb} KB")
    print(f"  Total pages: {huge_total} ({huge_total * huge_size_kb // 1024} MB)")
    print(f"  Free pages:  {huge_free} ({huge_free * huge_size_kb // 1024} MB)")

    return result_2mb.returncode == 0


def check_dpdk_capable_nics() -> list[dict]:
    """Find NICs that support DPDK by checking for DPDK-compatible PCI devices."""
    result = subprocess.run(
        ["dpdk-devbind.py", "--status"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        return []

    nics = []
    current_section = None
    for line in result.stdout.splitlines():
        if "Network devices using DPDK-compatible driver" in line:
            current_section = "dpdk"
        elif "Network devices using kernel driver" in line:
            current_section = "kernel"
        elif current_section and line.strip().startswith("0000:"):
            parts = line.split()
            nics.append({
                "pci_addr": parts[0],
                "driver": current_section,
                "description": " ".join(parts[1:]),
            })
    return nics


def bind_nic_to_dpdk(pci_address: str) -> bool:
    """Bind a NIC to DPDK's vfio-pci or uio_pci_generic driver.

    WARNING: This unbinds the NIC from the kernel driver.
    Any IP configuration on that interface is lost.
    For ML clusters: only bind the RDMA/InfiniBand NIC to DPDK,
    keep the management NIC on the kernel driver.
    """
    # Load vfio-pci module if not loaded
    subprocess.run(["modprobe", "vfio-pci"], check=True)

    result = subprocess.run(
        ["dpdk-devbind.py", "--bind=vfio-pci", pci_address],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        print(f"NIC {pci_address} bound to vfio-pci for DPDK use")
    else:
        print(f"Failed to bind NIC: {result.stderr}")
    return result.returncode == 0


def benchmark_kernel_vs_dpdk_latency() -> dict:
    """Benchmark comparison between kernel TCP and DPDK networking.

    Returns estimated latency values based on documented measurements.
    Actual values depend on hardware, CPU frequency, and load.
    """
    return {
        "kernel_tcp_latency_us": {
            "small_msg_64b": 15,
            "medium_msg_1kb": 20,
            "large_msg_64kb": 50,
            "description": "Typical kernel TCP socket latency on modern server"
        },
        "dpdk_udp_latency_us": {
            "small_msg_64b": 1.5,
            "medium_msg_1kb": 2.0,
            "large_msg_64kb": 8,
            "description": "DPDK poll-mode driver, no kernel involvement"
        },
        "rdma_rc_latency_us": {
            "small_msg_64b": 0.8,
            "medium_msg_1kb": 1.0,
            "large_msg_64kb": 3.5,
            "description": "RDMA Reliable Connection, InfiniBand HDR"
        },
        "rdma_gpudirect_latency_us": {
            "gpu_to_gpu_1mb": 12,
            "gpu_to_gpu_28gb": 2240,
            "description": "GPUDirect RDMA, GPU VRAM to GPU VRAM, direct NIC DMA"
        }
    }

RDMA and InfiniBand for ML Clusters

RDMA (Remote Direct Memory Access) is a technology that allows one machine to directly read from or write to the memory of another machine via the NIC's DMA engine, without involving either machine's CPU. The CPU only initiates the operation (posts a work request to the NIC's queue pair); the NIC hardware executes the transfer autonomously.

The programming model uses "verbs" - operations like ibv_post_send() (post a send work request), ibv_poll_cq() (poll the completion queue for finished operations). When a send completes, the NIC places a completion entry in the completion queue, and the application polls for it. There are no syscalls in the data path after the initial queue setup.

Queue Pairs (QPs): RDMA communication happens through queue pairs. Each QP has a Send Queue and a Receive Queue. To send data, you post a work request (WR) to the Send Queue specifying the local memory buffer address, its RDMA key (a handle that permits the NIC to DMA from this buffer), the remote address and key, and the operation type (RDMA Write, RDMA Read, Send, etc.). The NIC executes the transfer and posts a completion entry when done.

Memory Registration: Before the NIC can DMA to/from a memory buffer, the buffer must be registered with the RDMA device. Registration pins the pages in physical memory (prevents OS from swapping or remapping them) and creates an RDMA memory key that the remote side can use to address the buffer. For GPU buffers with GPUDirect RDMA, the GPU VRAM pages are registered with the RDMA device instead.

Transport Types:

RC (Reliable Connected): end-to-end reliability, in-order delivery, per-connection state. Used by NCCL for AllReduce.
UD (Unreliable Datagram): connectionless, no reliability guarantees. Used for control messages.
UC (Unreliable Connected): connected but no reliability. Rarely used.

# pyverbs: Python bindings for the ibverbs RDMA API
# Install: pip install pyverbs (requires libibverbs-dev)

def setup_rdma_connection(
    local_ib_device: str = "mlx5_0",
    local_ib_port: int = 1,
) -> dict:
    """Set up an RDMA connection context.

    This is the setup phase - actual data transfer uses ibv_post_send/recv.
    For NCCL, this is handled internally. For custom RDMA code,
    use pyverbs directly.
    """
    try:
        import pyverbs.device as d
        import pyverbs.pd as pd
        import pyverbs.cq as cq
        import pyverbs.qp as qp
        import pyverbs.mr as mr
        import numpy as np

        # Open RDMA device
        ctx = d.Context(name=local_ib_device)
        device_attrs = ctx.query_device()

        print(f"RDMA device: {local_ib_device}")
        print(f"  Max QPs:              {device_attrs.max_qp}")
        print(f"  Max CQ entries:       {device_attrs.max_cqe}")
        print(f"  Max MR size:          {device_attrs.max_mr_size} bytes")
        print(f"  Phys port count:      {device_attrs.phys_port_cnt}")

        # Create Protection Domain
        protection_domain = pd.PD(ctx)

        # Create Completion Queue (for send + receive completions)
        comp_queue = cq.CQ(ctx, 100)  # 100 completion entries

        # Create Queue Pair (RC = Reliable Connected)
        qp_init_attr = qp.QPInitAttr(
            qp_type=qp.QPType.RC,
            scq=comp_queue,  # Send completion queue
            rcq=comp_queue,  # Receive completion queue
            cap=qp.QPCap(
                max_send_wr=64,    # Max outstanding send work requests
                max_recv_wr=64,    # Max outstanding receive work requests
                max_send_sge=1,    # Scatter-gather elements per send WR
                max_recv_sge=1,
                max_inline_data=64  # Inline small messages in WR descriptor
            )
        )
        queue_pair = qp.QP(protection_domain, qp_init_attr)

        # Allocate and register a memory buffer for RDMA operations
        buf = np.zeros(1024 * 1024, dtype=np.float32)  # 4 MB buffer
        memory_region = mr.MR(
            protection_domain,
            buf.nbytes,
            mr.AccessFlags.LOCAL_WRITE | mr.AccessFlags.REMOTE_WRITE | mr.AccessFlags.REMOTE_READ
        )

        return {
            "context": ctx,
            "pd": protection_domain,
            "cq": comp_queue,
            "qp": queue_pair,
            "mr": memory_region,
            "qp_num": queue_pair.qp_num,
            "lid": ctx.query_port(local_ib_port).lid,
        }

    except ImportError:
        print("pyverbs not available. Install with: pip install pyverbs")
        return {}
    except Exception as e:
        print(f"RDMA setup failed: {e}")
        print("Ensure InfiniBand/RoCE hardware is present and libibverbs is installed.")
        return {}


def check_rdma_devices() -> list[dict]:
    """List available RDMA devices and their capabilities."""
    result = subprocess.run(
        ["ibv_devinfo"],
        capture_output=True, text=True
    )
    devices = []
    if result.returncode == 0:
        current_device = {}
        for line in result.stdout.splitlines():
            line = line.strip()
            if line.startswith("hca_id:"):
                if current_device:
                    devices.append(current_device)
                current_device = {"name": line.split(":")[1].strip()}
            elif "fw_ver:" in line:
                current_device["fw_version"] = line.split(":")[1].strip()
            elif "node_guid:" in line:
                current_device["guid"] = line.split(":")[1].strip()
            elif "active_speed:" in line:
                current_device["speed"] = line.split(":")[1].strip()
            elif "active_width:" in line:
                current_device["width"] = line.split(":")[1].strip()
        if current_device:
            devices.append(current_device)
    return devices

GPUDirect RDMA and NCCL

GPUDirect RDMA is the technology that makes GPU-to-GPU communication at scale practical. Without it, the GPU AllReduce path would be:

Without GPUDirect RDMA:
  GPU A VRAM -> PCIe bus -> CPU RAM (copy 1)
  CPU RAM -> NIC DMA -> wire -> remote NIC
  remote NIC -> CPU RAM (copy 2)
  CPU RAM -> PCIe bus -> GPU B VRAM (copy 3)
  CPU is involved at every step. 3 copies. 2 PCIe bus traversals.

With GPUDirect RDMA:
  GPU A VRAM -> PCIe bus -> InfiniBand NIC
  InfiniBand NIC -> wire -> remote InfiniBand NIC
  remote InfiniBand NIC -> PCIe bus -> GPU B VRAM
  CPU posts the initial RDMA verb and is done. 0 copies. 0 CPU involvement.

NCCL automatically uses GPUDirect RDMA when:

NVIDIA GPUs support it (all Volta and later: V100, A100, H100)
InfiniBand NICs support peer-to-peer DMA (Mellanox ConnectX-5 or later)
The RDMA peer memory kernel module is loaded (nvidia-peermem or nv_peer_mem)
Topology allows direct PCIe peer access (GPU and NIC on same PCIe root complex or connected NVSwitch)

import subprocess
import os


def configure_nccl_environment(
    use_infiniband: bool = True,
    ib_interface: str = "mlx5_0:1",
    debug_level: str = "WARN",
) -> dict:
    """Configure NCCL environment variables for optimal InfiniBand performance.

    These environment variables control NCCL's transport selection,
    GPU-to-NIC topology detection, and debugging output.

    Returns the environment dict to be merged into process environment.
    """
    env = {}

    if use_infiniband:
        # Tell NCCL which InfiniBand device and port to use
        env["NCCL_IB_HCA"] = ib_interface  # e.g., "mlx5_0:1" or "mlx5_0,mlx5_1"
        env["NCCL_IB_DISABLE"] = "0"
        env["NCCL_IB_GID_INDEX"] = "3"   # GID index for RoCE v2 (0 for InfiniBand)
        env["NCCL_IB_TIMEOUT"] = "22"    # RDMA completion timeout (microseconds, power of 2)
        env["NCCL_IB_RETRY_CNT"] = "7"  # RDMA retry count for failed operations

        # Enable GPUDirect RDMA if GPU and NIC are on the same PCIe switch
        env["NCCL_IB_CUDA_SUPPORT"] = "1"  # Enable GPUDirect RDMA

        # Socket fallback interface (for control traffic, not data)
        env["NCCL_SOCKET_IFNAME"] = "eth0"
    else:
        # Fallback: disable InfiniBand, use TCP sockets
        env["NCCL_IB_DISABLE"] = "1"
        env["NCCL_SOCKET_IFNAME"] = "eth0"

    # P2P transport: direct GPU-to-GPU copy within a node
    # Disable if GPUs are not NVLink-connected (avoids slow PCIe P2P path)
    env["NCCL_P2P_LEVEL"] = "NVL"  # NVL=NVLink only, PIX=PCIe, PHB=PCIe Hub

    # Tree vs ring algorithm selection for AllReduce
    # TREE is better for large tensors across many nodes
    # RING is better for small tensors with few nodes
    env["NCCL_ALGO"] = "AUTO"  # Let NCCL decide based on message size

    # Protocol: Simple (fewer messages), LL (lower latency), LL128 (128-byte chunks)
    env["NCCL_PROTO"] = "Simple"  # Best for large gradients

    # Debug output level
    env["NCCL_DEBUG"] = debug_level  # WARN, INFO, TRACE (TRACE is very verbose)

    return env


def detect_nccl_topology() -> dict:
    """Detect GPUDirect RDMA support and NCCL topology."""
    info = {}

    # Check if nvidia-peermem module is loaded (required for GPUDirect RDMA)
    with open("/proc/modules") as f:
        modules = f.read()
    info["nvidia_peermem_loaded"] = "nvidia_peermem" in modules
    info["nv_peer_mem_loaded"] = "nv_peer_mem" in modules
    info["gpudirect_rdma_available"] = (
        info["nvidia_peermem_loaded"] or info["nv_peer_mem_loaded"]
    )

    # Check for InfiniBand devices
    ib_devices_path = "/sys/class/infiniband"
    if os.path.exists(ib_devices_path):
        ib_devices = os.listdir(ib_devices_path)
        info["ib_devices"] = ib_devices
        info["infiniband_present"] = len(ib_devices) > 0
    else:
        info["ib_devices"] = []
        info["infiniband_present"] = False

    # Check NUMA topology: GPU and NIC should be on the same NUMA node
    # for optimal GPUDirect RDMA performance
    if info["infiniband_present"] and info["ib_devices"]:
        ib_device = info["ib_devices"][0]
        numa_path = f"/sys/class/infiniband/{ib_device}/device/numa_node"
        if os.path.exists(numa_path):
            with open(numa_path) as f:
                info["ib_numa_node"] = int(f.read().strip())

    # Check GPU NUMA nodes via nvidia-smi
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=index,numa_affinity", "--format=csv,noheader"],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        gpu_numa = {}
        for line in result.stdout.strip().splitlines():
            parts = [p.strip() for p in line.split(",")]
            if len(parts) == 2 and parts[0].isdigit():
                gpu_numa[int(parts[0])] = parts[1]
        info["gpu_numa_nodes"] = gpu_numa

    return info


def benchmark_nccl_allreduce(
    tensor_size_mb: int = 100,
    num_iterations: int = 50,
) -> dict:
    """Benchmark NCCL AllReduce performance.
    Requires at least 2 GPUs and PyTorch with NCCL support.
    """
    try:
        import torch
        import torch.distributed as dist
        import time

        if not torch.cuda.is_available():
            return {"error": "CUDA not available"}
        if torch.cuda.device_count() < 2:
            return {"error": "Need at least 2 GPUs for AllReduce benchmark"}

        # Initialize process group on GPU 0 and GPU 1
        # In practice this is initialized across machines via NCCL_INIT_METHOD
        # For local testing, use gloo backend or shared file system
        dist.init_process_group(
            backend="nccl",
            init_method="env://",
            world_size=torch.cuda.device_count(),
            rank=0,  # This would vary per process in real distributed training
        )

        device = torch.device("cuda:0")
        # Create tensor of specified size
        num_floats = (tensor_size_mb * 1024 * 1024) // 4
        tensor = torch.randn(num_floats, device=device)

        # Warmup
        for _ in range(5):
            dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
        torch.cuda.synchronize()

        # Benchmark
        times = []
        for _ in range(num_iterations):
            start = time.perf_counter()
            dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
            torch.cuda.synchronize()  # Wait for GPU to finish
            end = time.perf_counter()
            times.append((end - start) * 1000)  # ms

        import statistics
        return {
            "tensor_size_mb": tensor_size_mb,
            "num_gpus": torch.cuda.device_count(),
            "mean_latency_ms": round(statistics.mean(times), 2),
            "p50_latency_ms": round(sorted(times)[len(times)//2], 2),
            "p99_latency_ms": round(sorted(times)[int(len(times)*0.99)], 2),
            "effective_bandwidth_gbps": round(
                (2 * tensor_size_mb / 1024) / (statistics.mean(times) / 1000),
                1
            ),
        }
    except Exception as e:
        return {"error": str(e)}

io_uring for Storage Bypass

io_uring (introduced Linux 5.1, 2019) addresses the same fundamental problem as DPDK but for storage I/O instead of network I/O. Traditional read() and write() syscalls require two context switches per operation (user-to-kernel and back). For NVMe SSDs that can handle 1 million IOPS, the syscall overhead alone becomes a bottleneck.

io_uring uses two ring buffers shared between the kernel and userspace:

Submission Queue (SQ): userspace writes I/O requests here
Completion Queue (CQ): kernel writes completion events here

In the fast path, submitting 100 I/O requests requires only 1 syscall (io_uring_enter()) instead of 100 syscalls. In the "sqpoll" mode, a kernel thread polls the SQ continuously - zero syscalls in the critical path for both submission and completion.

For ML data loading, io_uring is relevant when:

Loading training data from NVMe at rates above 1M IOPS
Streaming large checkpoint files (multiple concurrent reads for parallel loading)
Preprocessing pipeline that does high-frequency small file reads

import ctypes
import os
import struct
import mmap


def demonstrate_io_uring_concept() -> None:
    """Demonstrate io_uring ring buffer setup conceptually.

    In practice, use the liburing Python bindings or asyncio's
    io_uring backend (Python 3.12+ on Linux 5.19+ uses io_uring by default
    for asyncio file operations).
    """
    print("io_uring fundamentals:")
    print("1. io_uring_setup(N, params) -> fd")
    print("   Creates submission ring (SQ) and completion ring (CQ)")
    print("   Both rings are mapped into userspace via mmap")
    print("")
    print("2. Submit I/O: write SQE to SQ ring, call io_uring_enter(1)")
    print("   In sqpoll mode: kernel thread polls SQ, zero syscall needed")
    print("")
    print("3. Check completions: poll CQ ring directly in userspace")
    print("   No syscall needed to read completions")
    print("")
    print("Net effect: 100 reads = 1 syscall (batch submit)")
    print("vs pread64: 100 reads = 100 syscalls")


def benchmark_read_methods(
    file_path: str,
    chunk_size: int = 1024 * 1024,  # 1 MB chunks
    num_reads: int = 100,
) -> dict:
    """Compare read performance with different I/O methods.

    Methods compared:
    - read(): standard blocking read
    - pread64(): positioned read (thread-safe)
    - mmap: memory-mapped read
    - asyncio: async I/O (uses io_uring on Linux 5.19+)
    """
    import time
    import asyncio
    import aiofiles  # pip install aiofiles

    results = {}
    file_size = os.path.getsize(file_path)
    offsets = [
        (i * chunk_size) % (file_size - chunk_size)
        for i in range(num_reads)
    ]

    # Method 1: standard read
    start = time.perf_counter()
    with open(file_path, "rb") as f:
        for offset in offsets:
            f.seek(offset)
            data = f.read(chunk_size)
    elapsed = time.perf_counter() - start
    results["standard_read_ms"] = round(elapsed * 1000, 1)
    results["standard_read_iops"] = round(num_reads / elapsed)

    # Method 2: mmap (OS page cache, no syscall per read)
    start = time.perf_counter()
    with open(file_path, "rb") as f:
        mapped = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        for offset in offsets:
            data = mapped[offset:offset + chunk_size]
        mapped.close()
    elapsed = time.perf_counter() - start
    results["mmap_read_ms"] = round(elapsed * 1000, 1)
    results["mmap_read_iops"] = round(num_reads / elapsed)

    # Note: asyncio with io_uring would show benefit at very high IOPS
    # where syscall overhead dominates (>100K IOPS per process)

    return results

eBPF and XDP for ML Traffic Prioritization

eBPF (extended Berkeley Packet Filter) allows user-written programs to run in the Linux kernel without loading a kernel module. eBPF programs are verified by the kernel's verifier (ensuring they terminate and access only allowed memory), then JIT-compiled to native machine code and attached to various kernel hooks.

XDP (eXpress Data Path) is an eBPF hook at the earliest point in the NIC driver where the kernel first sees a received packet - before sk_buff allocation, before protocol processing. At this point the program can:

XDP_PASS: pass the packet to the normal kernel networking stack
XDP_DROP: discard the packet (line-rate filtering, ~100 ns per packet)
XDP_TX: retransmit the packet (useful for a fast L2 forwarder)
XDP_REDIRECT: redirect to another interface or to a userspace AF_XDP socket

For ML clusters, XDP and eBPF are useful for:

Prioritizing NCCL AllReduce traffic over background management traffic at the NIC level
Rate-limiting gossip protocol traffic from distributed training coordinators
Implementing fast path routing for inference requests that bypasses conntrack
Collecting per-flow latency metrics without copying packets to userspace

from bcc import BPF  # pip install bcc


XDP_PROGRAM = """
#include <uapi/linux/if_ether.h>
#include <uapi/linux/ip.h>
#include <uapi/linux/tcp.h>
#include <uapi/linux/udp.h>

// Map to count packets per destination port
BPF_HASH(port_packet_count, u16, u64);
BPF_HASH(port_byte_count, u16, u64);

// High-priority port for NCCL AllReduce traffic
// NCCL uses ports in range 42000-43000 by default
#define NCCL_PORT_LOW  42000
#define NCCL_PORT_HIGH 43000

int xdp_nccl_monitor(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data     = (void *)(long)ctx->data;

    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
    if (eth->h_proto != __constant_htons(ETH_P_IP))
        return XDP_PASS;

    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
    if (ip->protocol != IPPROTO_TCP)
        return XDP_PASS;

    // Parse TCP header
    int ip_header_len = ip->ihl * 4;
    struct tcphdr *tcp = (void *)ip + ip_header_len;
    if ((void *)(tcp + 1) > data_end)
        return XDP_PASS;

    u16 dport = __constant_ntohs(tcp->dest);

    // Track packet counts per destination port
    u64 *count = port_packet_count.lookup(&dport);
    if (count) {
        (*count)++;
    } else {
        u64 one = 1;
        port_packet_count.update(&dport, &one);
    }

    // All packets pass - this is monitoring only, not filtering
    return XDP_PASS;
}
"""


def attach_xdp_monitor(interface: str = "eth0") -> None:
    """Attach XDP program to monitor NCCL traffic on a network interface.

    The XDP program runs at line rate in the NIC driver, adding
    essentially zero overhead to packet processing while collecting
    per-port statistics useful for NCCL debugging.
    """
    try:
        b = BPF(text=XDP_PROGRAM)
        fn = b.load_func("xdp_nccl_monitor", BPF.XDP)
        b.attach_xdp(interface, fn, 0)

        print(f"XDP monitor attached to {interface}")
        print("Monitoring port traffic (press Ctrl+C to stop)...")

        import time
        try:
            while True:
                time.sleep(2)
                port_counts = b.get_table("port_packet_count")
                if port_counts:
                    print("\nTop ports by packet count:")
                    sorted_ports = sorted(
                        [(k.value, v.value) for k, v in port_counts.items()],
                        key=lambda x: x[1], reverse=True
                    )[:10]
                    for port, count in sorted_ports:
                        tag = " <- NCCL" if 42000 <= port <= 43000 else ""
                        print(f"  Port {port:5d}: {count:10,} packets{tag}")

        except KeyboardInterrupt:
            pass
        finally:
            b.remove_xdp(interface, 0)
            print(f"\nXDP monitor removed from {interface}")
    except Exception as e:
        print(f"eBPF/XDP requires root privileges and Linux 5.6+: {e}")

SR-IOV for Dedicated NIC Access per Container

SR-IOV (Single Root I/O Virtualization) is a PCIe hardware feature that allows a single physical NIC to present itself as multiple virtual NICs (called Virtual Functions, VFs) to the OS. Each VF can be assigned to a different VM or container, giving it dedicated NIC hardware access without going through a software bridge.

For ML containers: instead of a container using a virtual ethernet (veth) device that routes through a software bridge and the host's kernel networking stack, the container receives a dedicated VF. Traffic to and from that container goes directly to the physical NIC hardware without kernel software routing overhead.

def configure_sriov_for_ml_containers(
    pf_interface: str,
    num_vfs: int,
) -> bool:
    """Configure SR-IOV Virtual Functions on a physical NIC.

    Creates num_vfs VFs that can be assigned to containers via
    Kubernetes device plugin or docker --device flags.

    Requires SR-IOV capable NIC (Mellanox ConnectX-5+, Intel X710+).
    Must be run as root on the host.
    """
    sriov_path = f"/sys/class/net/{pf_interface}/device/sriov_numvfs"

    if not os.path.exists(sriov_path):
        print(f"SR-IOV not supported on {pf_interface}")
        print("Requires SR-IOV capable NIC and BIOS with SR-IOV enabled")
        return False

    # Create VFs
    with open(sriov_path, "w") as f:
        f.write(str(num_vfs))

    print(f"Created {num_vfs} VFs on {pf_interface}")

    # List created VFs
    result = subprocess.run(
        ["ip", "link", "show", pf_interface],
        capture_output=True, text=True
    )
    print(result.stdout)
    return True

Architecture Diagram

Production Engineering Notes

When kernel bypass actually matters for ML: Kernel bypass (DPDK, RDMA) matters when: (1) you are doing AllReduce across more than 8 nodes and gradient tensors exceed 1 GB per step, (2) your inference SLA requires sub-5ms latency at the 99th percentile, or (3) you are training at scale where 10-20% of wall-clock time is network-bound. For single-node multi-GPU training (NVLink within a node), network bypass is irrelevant - NVLink already bypasses the kernel at the hardware level. For batch inference at moderate scale (less than 100 req/s), kernel TCP is fine.

NCCL environment variables that actually matter: NCCL_IB_HCA specifies which InfiniBand HCA to use - if your machine has multiple IB ports, this determines which one NCCL uses. NCCL_DEBUG=INFO or NCCL_DEBUG=TRACE logs which transport NCCL selects (IB RDMA vs TCP fallback). NCCL_IB_DISABLE=0 ensures RDMA is active. NCCL_P2P_LEVEL=NVL tells NCCL to only use P2P for NVLink-connected GPUs (not the slower PCIe P2P path). Always run NCCL_DEBUG=INFO when debugging slow AllReduce to confirm RDMA is actually being used.

RoCE vs InfiniBand trade-offs: RoCE (RDMA over Converged Ethernet) uses standard Ethernet switches, which are cheaper and more familiar than InfiniBand switches. RoCEv2 (the modern version) requires Priority Flow Control (PFC) on switches to prevent packet drops that break RDMA semantics. InfiniBand has built-in flow control and credit-based flow. For new ML cluster buildouts with tight latency requirements, InfiniBand is still preferred. For clusters where Ethernet infrastructure already exists and you want RDMA at lower cost, RoCEv2 with proper switch configuration works.

GPUDirect RDMA debugging: The most common failure mode is missing the nvidia-peermem kernel module. Verify: lsmod | grep nvidia_peermem. If not loaded: modprobe nvidia-peermem. If the module fails to load (common after kernel updates), you need to recompile it: apt-get install -y linux-headers-$(uname -r) then dkms install nvidia-peermem/<version>. Without this module, NCCL falls back to copying gradients through CPU RAM - your AllReduce might still work but at 3-5x higher latency.

Common Mistakes

:::danger Using TCP Fallback in Production NCCL Without Knowing It NCCL silently falls back to TCP sockets if InfiniBand is unavailable or misconfigured. With TCP, AllReduce latency can be 5-10x higher than with RDMA. The failure is silent - training runs, just slowly. Always run NCCL_DEBUG=INFO on a new cluster and verify the log shows NCCL INFO Transport : IB rather than NCCL INFO Transport : NET/Socket. A training job that runs at 40% GPU utilization instead of 80% is often suffering from silent NCCL TCP fallback. :::

:::danger Deploying DPDK Without Dedicated CPU Cores DPDK's polling mode driver runs in a tight while(1) loop that 100% utilizes a CPU core. If you run DPDK on a shared CPU core, it will preempt your ML training compute and degrade throughput unpredictably. DPDK must be configured with rte_eal_init() specifying isolated CPU cores that are reserved exclusively for packet processing. These cores should be isolated from the kernel scheduler via isolcpus= and nohz_full= kernel boot parameters. Running DPDK on non-isolated cores causes both degraded packet processing performance and degraded training performance simultaneously. :::

:::warning io_uring Does Not Improve Latency for Large Sequential Reads io_uring's primary benefit is batching many small I/Os to reduce syscall overhead. For training data loading that does large sequential reads (reading 256 MB files sequentially), io_uring provides no significant benefit over a single read() call - the I/O time dominates, not the syscall overhead. io_uring matters when your data loading pattern involves thousands of small random reads per second (e.g., loading individual 10 KB image files), where each file would otherwise require its own open() + read() + close() = 3 syscalls. With io_uring, 1000 such operations can be batched into a single io_uring_enter() call. :::

:::warning RoCE Requires PFC on Switches RoCEv2 (RDMA over Converged Ethernet v2) requires lossless Ethernet behavior. Standard Ethernet drops packets under congestion. When an RDMA packet is dropped, the RDMA transport layer retransmits after a timeout - this adds 100-1000 microseconds of latency for every dropped packet and can cause cascade failures in AllReduce. Priority Flow Control (PFC) is an Ethernet pause mechanism that prevents packet drops by throttling the sender when buffers fill. Without PFC configured correctly on every switch in the path, RoCEv2 will perform much worse than advertised in ML cluster benchmarks. :::

Interview Questions

Q1: Why does NCCL AllReduce perform better on InfiniBand than on Ethernet TCP, even at the same physical bandwidth?

Three compounding factors. First, latency: InfiniBand RDMA has 0.5-1 microsecond latency per message vs 15-50 microseconds for TCP on Ethernet. For AllReduce, the ring algorithm sends O(num_nodes) sequential messages. With 256 nodes, that is 255 sequential round trips. At 1 us per hop: 255 us total latency overhead. At 30 us per hop: 7.65 ms total latency overhead. Second, zero CPU involvement: GPUDirect RDMA means the CPU does not participate in data transfer. With TCP, the CPU must run the TCP state machine for every packet, consuming cores that could otherwise run the training computation. Third, no copies: with GPUDirect RDMA, gradient data moves directly from GPU VRAM to peer GPU VRAM via NIC DMA. With TCP, it goes GPU VRAM -> CPU RAM -> TCP socket buffer -> wire -> TCP socket buffer -> CPU RAM -> GPU VRAM (3 copies + 2 PCIe bus traversals).

Q2: What is GPUDirect RDMA and what kernel module is required for it to work?

GPUDirect RDMA is an NVIDIA technology that allows InfiniBand or RoCE NICs to DMA directly to and from GPU VRAM, bypassing CPU RAM entirely. Normally, PCIe peer-to-peer DMA is restricted to devices on the same PCIe root complex or connected via NVSwitch. GPUDirect RDMA extends this to allow a NIC (which may be on a different PCIe root) to DMA to a GPU's VRAM using the NVIDIA Peer Memory Access mechanism. The required kernel module is nvidia-peermem (previously called nv_peer_mem). This module registers the GPU's physical memory regions with the RDMA subsystem's peer memory framework, allowing the InfiniBand driver to create DMA mappings that point directly to GPU VRAM. Without this module, NCCL falls back to copying gradient data through CPU RAM, which adds two PCIe bus traversals and two memory copies per AllReduce message.

Q3: When does io_uring provide meaningful benefit for ML workloads?

io_uring benefits ML workloads in two scenarios. First, high-IOPS random data loading: when training on a dataset of millions of small files (ImageNet with 1.28M JPEG images, average 100 KB each), loading a batch of 256 images requires 256 separate file opens and reads. With pread64, that is 256 syscalls in sequence. With io_uring, those 256 reads can be submitted in a single io_uring_enter() call and completions polled without any additional syscalls. On NVMe SSDs capable of 1M+ IOPS, syscall overhead can be 20-30% of total I/O time for this pattern. Second, checkpoint loading at startup: reading a 7B model checkpoint involves streaming gigabytes from NVMe. io_uring's sqpoll mode (kernel thread polls for submissions, zero syscall latency in critical path) can reduce checkpoint load time by 10-20% on very fast NVMe arrays. io_uring does NOT help for large sequential reads (reading a 1 GB file sequentially) where the I/O bandwidth dominates.

Q4: What is the difference between DPDK and kernel bypass via RDMA verbs? When would you use each?

DPDK completely removes the kernel from the network I/O path for all protocols. Your application writes the protocol implementation (TCP/UDP/custom) in userspace and drives the NIC directly via the Poll Mode Driver. DPDK is appropriate when you are implementing custom network protocols (not TCP/UDP), need to process millions of small packets per second, or are building a software router/switch. It is NOT the right tool for ML gradient communication.

RDMA verbs are a specific protocol over InfiniBand or RoCE that provides zero-copy, CPU-bypass transfers for registered memory regions. RDMA does not bypass the kernel for control operations (QP setup, memory registration) but completely bypasses it for data transfers. The RDMA driver still runs in the kernel; the data transfer is offloaded to the NIC hardware. RDMA is the right tool for ML AllReduce communication because it works with the existing NCCL library and requires no custom protocol implementation.

For ML: use RDMA (via NCCL) for gradient communication. Use DPDK only if building custom ML communication middleware or a specialized inference routing layer.

Q5: Your 64-GPU training job's AllReduce takes 8 seconds per step for 14 GB of gradients. Your InfiniBand links are 200 Gbps. What is the theoretical minimum and what could be causing the gap?

Theoretical minimum: AllReduce over 64 GPUs transfers $2 \times (N-1)/N \times D$ data where $N=64$ and $D=14$ GB. That is $2 \times 63/64 \times 14 = 27.6$ GB. At 200 Gbps (25 GB/s), minimum transfer time = 27.6 / 25 = 1.1 seconds. At 8 seconds, you are at 14% of wire speed. Likely causes: (1) NCCL fell back to TCP - verify with NCCL_DEBUG=INFO. (2) nvidia-peermem module not loaded - gradients are going through CPU RAM, adding 2 PCIe traversals per node. (3) Ring algorithm with 64 nodes requires 63 sequential steps; if individual step latency is high (RoCE without PFC causing retransmits), the latency multiplies. (4) Incorrect NCCL_IB_HCA pointing to the wrong NIC port. (5) NUMA mismatch - GPU and NIC on different NUMA nodes, causing PCIe traffic to route through the inter-socket QPI link. Diagnostic sequence: NCCL debug -> check nvidia-peermem -> run perftest (ib_send_bw, ib_read_bw) to verify InfiniBand bandwidth independently -> check NUMA topology with nvidia-smi topo -m.

Why This Exists​

Historical Context​

Core Concepts​

The Linux Kernel Networking Overhead - Where Time Goes​

DPDK Architecture​

RDMA and InfiniBand for ML Clusters​

GPUDirect RDMA and NCCL​

io_uring for Storage Bypass​

eBPF and XDP for ML Traffic Prioritization​

SR-IOV for Dedicated NIC Access per Container​

Architecture Diagram​

Production Engineering Notes​

Common Mistakes​

Interview Questions​