Linux Performance Tuning
Reading time: ~40 min · Interview relevance: High · Target roles: MLOps Engineer, AI Infrastructure Engineer, Systems Engineer
A freshly provisioned Ubuntu 22.04 server is optimized for general-purpose workloads. General-purpose means: conservative memory management (swap aggressively to avoid OOM kills), balanced CPU scheduling (no process gets to monopolize a core), conservative NIC interrupt rates (coalesce interrupts to reduce CPU load). These defaults make sense for a web server running hundreds of small processes. They are wrong for an ML training server running two processes that need exclusive access to every CPU core, terabytes of RAM, and hundreds of gigabits per second of network throughput.
The performance gap between a default Ubuntu install and a tuned ML server is not marginal. On a Google internal study of TPU v4 pods, OS-level tuning was responsible for 12-18% throughput improvement over baseline Linux defaults. On NVIDIA A100 clusters, enabling performance CPU governors and disabling transparent huge page compaction reduced training step time variance by 25-30%, which matters enormously for distributed training where the slowest node determines cluster throughput.
The changes are not complex. They are specific, measurable, and in most cases reversible. The challenge is knowing which knobs exist, what they control, and what the correct value is for your specific workload. This lesson is a systematic treatment of every major Linux tuning area that matters for ML: memory management, CPU scheduling, NUMA topology, network stack, I/O scheduling, and kernel boot parameters. For each area, you will understand the mechanism, the problem it causes when misconfigured, the correct value for ML workloads, and how to verify the change is in effect.
The goal at the end is a reproducible tuning script you can apply to every new ML server, with before-and-after benchmarks to validate the improvement.
Why This Exists
Linux kernel defaults optimize for safety and breadth of use cases. When vm.swappiness=60, the kernel is willing to swap out pages that haven't been accessed recently, even if there is plenty of free RAM. This prevents OOM kills on a general-purpose server where 50 processes share memory. On a training server where one process is deliberately using 95% of RAM for model weights and activations, that swapping causes 100x slowdowns when the kernel moves model pages to swap and then faults them back in during the forward pass.
When transparent huge pages are set to always, the khugepaged kernel thread tries to compact memory into 2 MB pages continuously. This is beneficial for memory-intensive applications in steady state. But during PyTorch model initialization or checkpoint loading - when large amounts of memory are being allocated and initialized - khugepaged compaction causes multi-second pauses that appear as unexplained hangs in training logs.
These are not edge cases. They are the default behavior of Linux on untuned hardware. Understanding why the defaults exist and what the correct alternative is separates engineers who say "this cluster is slow sometimes" from engineers who find and fix the root cause.
Historical Context
Linux's virtual memory tuning parameters accumulated over decades. vm.swappiness was introduced in Linux 2.6 and has had the same default (60) since. It was designed when RAM was expensive (256 MB was generous) and the workload was a mixed-use server. vm.dirty_ratio and vm.dirty_background_ratio were added for systems with large write workloads (database servers). The defaults have not changed significantly in 20 years.
Transparent Huge Pages (THP) was introduced in Linux 2.6.38 (2011). The default always mode was appropriate for 2011 HPC workloads. The madvise mode (which lets applications opt in) was added later but never became the default. Almost every database (PostgreSQL, MySQL, MongoDB), JVM, and ML framework recommends madvise or never mode for THP.
CPU frequency governors have existed since Linux 2.6 (ondemand governor, 2006). The performance governor pins the CPU at maximum frequency. The powersave governor pins it at minimum. The ondemand and schedutil governors adjust dynamically. Cloud instances almost universally ship with ondemand or schedutil. Training on a powersave governor because the provisioning script forgot to set it is a common source of unexplained 30-40% throughput degradation.
NUMA awareness in the Linux scheduler was added incrementally from Linux 2.6 through 3.x. NUMA auto-balancing (kernel 3.8, 2013) attempts to migrate process pages to the NUMA node where the process is running. For many workloads this is beneficial. For distributed training where PyTorch intentionally allocates tensors on specific NUMA nodes for RDMA performance, auto-balancing interferes.
Core Concepts
Memory Management Tuning
The most impactful memory-related parameters for ML workloads:
vm.swappiness: Controls the kernel's tendency to swap pages to disk. Range 0-200 (Linux 5.8+) or 0-100 (older). Default: 60. For ML training servers: set to 0 or 1.
At swappiness=60, the kernel will begin swapping pages to disk when free memory drops below a threshold, even if RAM is available. Model weights that are not accessed during a particular micro-batch may be swapped out. When the next micro-batch accesses them, a page fault brings them back from disk - adding 1-10 ms per page fault, multiplied by millions of pages, multiplied by thousands of steps. Set to 1 on training servers (some swap only in emergency), 0 on inference servers (no swap ever - latency predictability is critical).
vm.overcommit_memory: Controls whether the kernel allows processes to allocate more virtual memory than physically available. Default: 0 (heuristic). For ML: set to 1 (always allow overcommit).
PyTorch's CUDA memory allocator, Python's memory allocator, and the OS must all participate in memory management. With overcommit_memory=0, large mmap() calls can fail even when there is plenty of physical RAM, because the kernel's heuristic calculation says "you might not have enough memory to back this mapping." Training jobs that fork workers (DataLoader) can fail at startup with confusing "Cannot allocate memory" errors even on a machine with 512 GB RAM and 400 GB free. Set to 1 so virtual memory allocation always succeeds; the OOM killer handles the rare case where physical memory is actually exhausted.
vm.dirty_ratio and vm.dirty_background_ratio: Control when the kernel writes dirty (modified) pages to disk. dirty_background_ratio (default 10%) is the threshold at which background writeback starts. dirty_ratio (default 20%) is the threshold at which application writes stall waiting for writeback. For checkpoint-heavy ML training: increase dirty_ratio to 40-50% to allow large checkpoint files to buffer entirely in RAM before flushing, preventing training loop stalls during checkpoint writes.
Transparent Huge Pages (THP): The most misunderstood tuning parameter for ML.
THP allows the kernel to use 2 MB pages instead of 4 KB pages for anonymous memory. Large pages reduce TLB pressure (one TLB entry covers 2 MB instead of 4 KB, so a 4 GB model needs 2048 TLB entries instead of 1 million). Benefits for ML: reduced TLB miss rate during matrix multiplication, faster memory allocation.
The problem: the always mode runs khugepaged which continuously scans memory looking for 4 KB pages to consolidate into 2 MB pages. This compaction process pauses the process that owns the memory while pages are being moved. For a training job with 300 GB of tensors, khugepaged can cause pauses of 2-10 seconds at unpredictable intervals. The madvise mode lets the application control which regions use huge pages using madvise(addr, len, MADV_HUGEPAGE). PyTorch can then opt specific allocator regions into THP without suffering system-wide compaction pressure.
import subprocess
import pathlib
import os
import json
from typing import Any
def read_sysctl(param: str) -> str:
"""Read a sysctl parameter value."""
path = pathlib.Path("/proc/sys") / param.replace(".", "/")
try:
return path.read_text().strip()
except FileNotFoundError:
return f"NOT FOUND: {param}"
except PermissionError:
return f"PERMISSION DENIED: {param}"
def write_sysctl(param: str, value: str, persist: bool = False) -> bool:
"""Write a sysctl parameter value.
persist=True: add to /etc/sysctl.d/99-ml-tuning.conf
This survives reboots. Without persist, changes are lost on reboot.
"""
path = pathlib.Path("/proc/sys") / param.replace(".", "/")
try:
path.write_text(value + "\n")
if persist:
conf_path = pathlib.Path("/etc/sysctl.d/99-ml-tuning.conf")
# Read existing content to avoid duplicates
existing = conf_path.read_text() if conf_path.exists() else ""
param_line = f"{param} = {value}\n"
# Remove old setting for this param if present
lines = [l for l in existing.splitlines()
if not l.startswith(f"{param} =") and l.strip()]
lines.append(param_line.strip())
conf_path.write_text("\n".join(lines) + "\n")
return True
except PermissionError:
print(f"Permission denied writing {param}. Run as root.")
return False
class MLServerTuner:
"""Systematic Linux performance tuner for ML workloads.
Checks current values, reports deviations from ML-optimal settings,
and optionally applies corrections.
WARNING: Apply on dedicated ML servers only.
These settings are NOT appropriate for general-purpose servers.
"""
# ML-optimal sysctl settings with explanations
OPTIMAL_SETTINGS = {
# Memory management
"vm.swappiness": ("1", "Minimize swap usage; model weights must stay in RAM"),
"vm.overcommit_memory": ("1", "Allow virtual overcommit; prevents false OOM on fork"),
"vm.overcommit_ratio": ("50", "Overcommit ratio when overcommit_memory=2"),
"vm.dirty_ratio": ("40", "Allow 40% dirty pages; reduces checkpoint write stalls"),
"vm.dirty_background_ratio": ("10", "Start background writeback at 10%"),
"vm.min_free_kbytes": ("1048576", "Keep 1 GB free; prevents allocation failures"),
"vm.zone_reclaim_mode": ("0", "Disable NUMA zone reclaim; causes latency spikes"),
# Network stack - critical for NCCL and distributed training
"net.core.rmem_max": ("134217728", "128 MB socket receive buffer max"),
"net.core.wmem_max": ("134217728", "128 MB socket send buffer max"),
"net.core.rmem_default": ("67108864", "64 MB default receive buffer"),
"net.core.wmem_default": ("67108864", "64 MB default send buffer"),
"net.ipv4.tcp_rmem": ("4096 87380 134217728", "TCP receive buffer: min/default/max"),
"net.ipv4.tcp_wmem": ("4096 65536 134217728", "TCP send buffer: min/default/max"),
"net.ipv4.tcp_congestion_control": ("bbr", "BBR congestion control for distributed training"),
"net.core.netdev_max_backlog": ("300000", "NIC receive queue depth"),
"net.core.somaxconn": ("65535", "Max TCP listen backlog for model server"),
"net.ipv4.tcp_max_syn_backlog": ("65535", "Max SYN queue for incoming connections"),
# Kernel scheduler
"kernel.sched_migration_cost_ns": ("5000000", "Reduce task migration; keep ML threads on NUMA node"),
"kernel.sched_autogroup_enabled": ("0", "Disable autogroup; can degrade ML job scheduling"),
"kernel.numa_balancing": ("0", "Disable NUMA auto-balancing for training workloads"),
# File system
"fs.file-max": ("2097152", "Max open file descriptors system-wide"),
"fs.inotify.max_user_watches": ("1048576", "For file-watching frameworks in training pipelines"),
}
def audit(self) -> list[dict]:
"""Check all settings and return list of deviations."""
issues = []
for param, (optimal_value, description) in self.OPTIMAL_SETTINGS.items():
current = read_sysctl(param)
if "NOT FOUND" in current or "PERMISSION" in current:
issues.append({
"param": param,
"status": "unavailable",
"current": current,
"optimal": optimal_value,
"description": description,
})
elif current != optimal_value:
issues.append({
"param": param,
"status": "suboptimal",
"current": current,
"optimal": optimal_value,
"description": description,
})
return issues
def print_audit_report(self) -> None:
"""Print a human-readable audit report."""
issues = self.audit()
optimal_count = len(self.OPTIMAL_SETTINGS) - len(issues)
print(f"\n=== ML Server Tuning Audit ===")
print(f"Settings checked: {len(self.OPTIMAL_SETTINGS)}")
print(f"Optimal: {optimal_count}")
print(f"Needs attention: {len(issues)}\n")
if issues:
print("Parameters needing adjustment:")
print("-" * 80)
for issue in issues:
if issue["status"] == "suboptimal":
print(f" {issue['param']}")
print(f" Current: {issue['current']}")
print(f" Optimal: {issue['optimal']}")
print(f" Reason: {issue['description']}")
print()
def apply_all(self, persist: bool = True) -> dict:
"""Apply all ML-optimal settings.
persist=True writes to /etc/sysctl.d/99-ml-tuning.conf
so settings survive reboots.
"""
results = {"applied": [], "failed": []}
for param, (value, description) in self.OPTIMAL_SETTINGS.items():
success = write_sysctl(param, value, persist=persist)
if success:
results["applied"].append(param)
else:
results["failed"].append(param)
print(f"Applied: {len(results['applied'])} settings")
if results["failed"]:
print(f"Failed (run as root): {results['failed']}")
return results
def configure_thp_policy(mode: str = "madvise") -> bool:
"""Configure Transparent Huge Pages policy.
Modes:
- 'always': Always use huge pages for anonymous memory.
khugepaged runs constantly. Causes compaction pauses.
NOT recommended for ML.
- 'madvise': Only use huge pages when madvise(MADV_HUGEPAGE) is called.
Applications opt in. No system-wide compaction pressure.
RECOMMENDED for ML.
- 'never': Disable THP entirely. Slightly higher TLB pressure.
Use only if madvise causes issues with specific libraries.
"""
valid_modes = ("always", "madvise", "never")
if mode not in valid_modes:
raise ValueError(f"mode must be one of {valid_modes}")
thp_path = pathlib.Path("/sys/kernel/mm/transparent_hugepage/enabled")
try:
thp_path.write_text(mode)
current = thp_path.read_text().strip()
print(f"THP policy set to: {mode}")
print(f"Kernel confirms: {current}")
# Also configure khugepaged scan rate (slower scan = less compaction pressure)
khuge_path = pathlib.Path("/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs")
if khuge_path.exists() and mode == "madvise":
khuge_path.write_text("10000") # Scan every 10 seconds instead of 10ms
print("khugepaged scan rate: 10000ms (reduced compaction overhead)")
return True
except PermissionError:
print("Permission denied. Run as root.")
return False
CPU Governor and Frequency Scaling
CPU frequency governors control how the kernel scales CPU frequency in response to load. This has a direct impact on training throughput because matrix multiplication performance is directly proportional to CPU frequency (for CPU-side operations like DataLoader preprocessing) and because PCIe transactions involving the CPU for GPU-to-host copies scale with CPU frequency.
Governor | Behavior | ML Impact
------------|------------------------|------------------------------------------
performance | Always max frequency | RECOMMENDED for training + inference
powersave | Always min frequency | 30-50% throughput loss vs performance
ondemand | Reactive scaling | Frequency ramp-up latency causes step jitter
schedutil | CFS-guided scaling | Better than ondemand, worse than performance
conservative| Very slow ramp-up | Never use for ML
import glob
def get_cpu_governor_info() -> dict:
"""Read CPU frequency governor status for all CPUs."""
info = {}
governor_paths = sorted(glob.glob(
"/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor"
))
if not governor_paths:
info["error"] = "cpufreq not available (may be inside a VM or container)"
return info
governors = {}
for path in governor_paths:
cpu_id = path.split("/")[5] # "cpu0", "cpu1", etc.
with open(path) as f:
governor = f.read().strip()
governors[cpu_id] = governor
# Summary
unique_governors = set(governors.values())
info["governors"] = governors
info["unique_governors"] = list(unique_governors)
info["all_same"] = len(unique_governors) == 1
# Read current + max frequency for CPU 0
cpu0_dir = "/sys/devices/system/cpu/cpu0/cpufreq"
for freq_file in ["scaling_cur_freq", "scaling_max_freq", "scaling_min_freq",
"cpuinfo_max_freq", "cpuinfo_min_freq"]:
freq_path = f"{cpu0_dir}/{freq_file}"
if os.path.exists(freq_path):
with open(freq_path) as f:
freq_khz = int(f.read().strip())
info[freq_file + "_ghz"] = round(freq_khz / 1_000_000, 2)
return info
def set_cpu_governor(governor: str = "performance") -> dict:
"""Set CPU frequency governor for all CPUs.
Requires root. On cloud instances, this may require disabling
the cloud provider's CPU frequency management service first:
- AWS: disable cpupower-mgr.service
- GCP: disable the default performance optimization (already 'performance' on some instances)
- Azure: may require BIOS-level P-state control
For ML training: always use 'performance'.
For ML inference: 'performance' for consistent latency.
"""
valid_governors = ["performance", "powersave", "ondemand", "schedutil", "conservative"]
if governor not in valid_governors:
raise ValueError(f"Governor must be one of {valid_governors}")
results = {"set": [], "failed": []}
governor_paths = sorted(glob.glob(
"/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor"
))
for path in governor_paths:
try:
with open(path, "w") as f:
f.write(governor)
cpu_id = path.split("/")[5]
results["set"].append(cpu_id)
except PermissionError:
results["failed"].append(path)
except OSError as e:
results["failed"].append(f"{path}: {e}")
if results["set"]:
print(f"Governor '{governor}' set on {len(results['set'])} CPUs")
if results["failed"]:
print(f"Failed on {len(results['failed'])} CPUs. Run as root.")
# Also disable Intel P-state driver's energy efficiency bias
epb_path = "/sys/devices/system/cpu/cpu0/power/energy_perf_bias"
if os.path.exists(epb_path):
# 0 = maximum performance, 15 = maximum power saving
try:
for epb_path in glob.glob("/sys/devices/system/cpu/cpu*/power/energy_perf_bias"):
with open(epb_path, "w") as f:
f.write("0")
print("Energy performance bias set to maximum performance on all CPUs")
except PermissionError:
print("Could not set energy_perf_bias (run as root)")
return results
IRQ Affinity and Interrupt Balancing
Network Interface Cards raise hardware interrupts when packets arrive. The Linux kernel's irqbalance daemon distributes these interrupts across CPUs to prevent any single CPU from being saturated by interrupt processing. For general-purpose servers, this is beneficial. For ML training servers, irqbalance can interfere with CPU pinning for DPDK or NCCL threads by moving IRQs to CPUs that are dedicated to GPU compute.
For production ML inference clusters where latency consistency matters, you want to pin NIC interrupts to specific CPUs that are not used for model inference. This ensures the interrupt processing CPU never preempts the inference thread.
def get_nic_irq_mapping(interface: str) -> list[dict]:
"""Get IRQ-to-CPU affinity mapping for a network interface.
Returns list of IRQs associated with the interface and their
current CPU affinities (as CPU masks).
"""
irq_mappings = []
# Find IRQs for this interface
try:
with open("/proc/interrupts") as f:
lines = f.readlines()
except PermissionError:
print("Cannot read /proc/interrupts")
return []
# Parse header to get CPU count
cpu_count = len(lines[0].split()) - 1 # minus "IRQ" column
for line in lines[1:]:
parts = line.split()
if not parts:
continue
irq_str = parts[0].rstrip(":")
if not irq_str.isdigit():
continue
irq_num = int(irq_str)
# Check if this IRQ belongs to our interface
irq_name = " ".join(parts[cpu_count + 2:]) if len(parts) > cpu_count + 2 else ""
if interface in irq_name:
# Read current CPU affinity
affinity_path = f"/proc/irq/{irq_num}/smp_affinity"
affinity_list_path = f"/proc/irq/{irq_num}/smp_affinity_list"
affinity = "unknown"
if os.path.exists(affinity_list_path):
with open(affinity_list_path) as f:
affinity = f.read().strip()
irq_mappings.append({
"irq": irq_num,
"name": irq_name.strip(),
"cpu_affinity": affinity,
})
return irq_mappings
def pin_nic_irqs_to_cpus(
interface: str,
cpu_list: list[int],
) -> dict:
"""Pin NIC interrupts to specific CPUs.
For ML inference servers:
- Pin NIC IRQs to CPUs 0-3 (management CPUs)
- Reserve CPUs 4-N for model inference (no interrupt preemption)
Requires root. Also disable irqbalance or it will override these settings.
"""
# First, stop irqbalance from overriding our settings
subprocess.run(["systemctl", "stop", "irqbalance"], capture_output=True)
irq_mappings = get_nic_irq_mapping(interface)
results = {"pinned": [], "failed": []}
# Create CPU mask from cpu_list
# For cpu_list=[0, 1]: mask is 0b11 = 3 = "3" in hex = "00000003"
cpu_mask = 0
for cpu in cpu_list:
cpu_mask |= (1 << cpu)
hex_mask = format(cpu_mask, "08x")
for irq_info in irq_mappings:
irq_num = irq_info["irq"]
affinity_path = f"/proc/irq/{irq_num}/smp_affinity"
try:
with open(affinity_path, "w") as f:
f.write(hex_mask)
results["pinned"].append({
"irq": irq_num,
"cpus": cpu_list,
"mask": hex_mask,
})
except (PermissionError, FileNotFoundError) as e:
results["failed"].append({"irq": irq_num, "error": str(e)})
print(f"NIC IRQ pinning for {interface}:")
print(f" Pinned {len(results['pinned'])} IRQs to CPUs {cpu_list}")
if results["failed"]:
print(f" Failed: {results['failed']}")
return results
def configure_rps_rfs(
interface: str,
cpu_mask_hex: str = "ff", # Default: spread across CPUs 0-7
rfs_flow_entries: int = 65536,
) -> None:
"""Configure Receive Packet Steering (RPS) and Receive Flow Steering (RFS).
RPS: software analog of RSS. Distributes packet processing across
CPUs in software (for NICs without hardware RSS support).
RFS: extends RPS by steering packets to the CPU where the application
thread is running, improving cache locality.
For ML inference: set RFS to steer packets to the inference worker CPUs.
"""
# RPS: configure which CPUs handle rx queue processing
rps_paths = sorted(glob.glob(
f"/sys/class/net/{interface}/queues/rx-*/rps_cpus"
))
for rps_path in rps_paths:
try:
with open(rps_path, "w") as f:
f.write(cpu_mask_hex)
except (PermissionError, FileNotFoundError):
pass
# RFS: global flow count
rfs_path = pathlib.Path("/proc/sys/net/core/rps_sock_flow_entries")
if rfs_path.exists():
rfs_path.write_text(str(rfs_flow_entries))
# Per-queue RFS flow entries
for rfs_path in glob.glob(f"/sys/class/net/{interface}/queues/rx-*/rps_flow_cnt"):
per_queue_entries = rfs_flow_entries // max(1, len(rps_paths))
with open(rfs_path, "w") as f:
f.write(str(per_queue_entries))
print(f"RPS/RFS configured on {interface}:")
print(f" RPS CPU mask: {cpu_mask_hex}")
print(f" RFS flow entries: {rfs_flow_entries}")
NUMA Tuning for Multi-Socket Servers
NUMA (Non-Uniform Memory Access) is the architecture of modern multi-socket servers. A 2-socket AMD EPYC or Intel Xeon server has two NUMA nodes. CPUs on socket 0 can access their local RAM at full speed, but accessing RAM on socket 1 goes through the inter-socket QPI/xGMI link, adding 30-80 ns latency and halving the effective bandwidth.
For ML training, NUMA matters because:
- GPU-to-CPU DMA transfers for tensor operations should go through the GPU's local NUMA node
- DataLoader workers should run on the same NUMA node as the GPU they feed
- Model weights allocated on the wrong NUMA node suffer up to 50% bandwidth degradation
def get_numa_topology() -> dict:
"""Get NUMA topology: nodes, CPUs per node, GPU NUMA affinity."""
topology = {}
# Read NUMA nodes from sysfs
numa_nodes = sorted(glob.glob("/sys/devices/system/node/node*/"))
topology["num_numa_nodes"] = len(numa_nodes)
topology["nodes"] = {}
for node_path in numa_nodes:
node_id = int(node_path.rstrip("/").split("node")[-1])
node_info = {}
# CPU list for this node
cpu_list_path = f"{node_path}/cpulist"
if os.path.exists(cpu_list_path):
with open(cpu_list_path) as f:
node_info["cpu_list"] = f.read().strip()
# Memory info
meminfo_path = f"{node_path}/meminfo"
if os.path.exists(meminfo_path):
with open(meminfo_path) as f:
meminfo = f.read()
for line in meminfo.splitlines():
if "MemTotal" in line:
node_info["mem_total_mb"] = int(line.split(":")[1].strip().split()[0]) // 1024
elif "MemFree" in line:
node_info["mem_free_mb"] = int(line.split(":")[1].strip().split()[0]) // 1024
topology["nodes"][node_id] = node_info
# Get GPU NUMA affinity via nvidia-smi
result = subprocess.run(
["nvidia-smi", "--query-gpu=index,name,numa_affinity", "--format=csv,noheader"],
capture_output=True, text=True
)
if result.returncode == 0:
topology["gpus"] = {}
for line in result.stdout.strip().splitlines():
parts = [p.strip() for p in line.split(",")]
if len(parts) >= 3:
gpu_id = int(parts[0])
topology["gpus"][gpu_id] = {
"name": parts[1],
"numa_node": parts[2],
}
return topology
def check_numa_misalignment() -> list[str]:
"""Check for NUMA misalignment that could hurt ML performance."""
warnings = []
topology = get_numa_topology()
if topology.get("num_numa_nodes", 1) <= 1:
return [] # Single NUMA node, no issue
gpus = topology.get("gpus", {})
nodes = topology.get("nodes", {})
# Check: are GPUs distributed across NUMA nodes?
gpu_numa_nodes = set(v.get("numa_node") for v in gpus.values())
if len(gpu_numa_nodes) > 1:
warnings.append(
"GPUs are spread across multiple NUMA nodes. "
"Pin each training process to the NUMA node local to its GPUs."
)
return warnings
def run_numactl_for_training(
training_script: str,
gpu_id: int,
numa_node: int | None = None,
) -> list[str]:
"""Generate numactl command to pin training to the correct NUMA node.
numactl --cpunodebind=N --membind=N ensures that all CPU threads
and memory allocations for this process come from NUMA node N,
which should match the NUMA node of the GPU being used.
Returns the full command list for subprocess.run().
"""
if numa_node is None:
# Detect GPU's NUMA node
result = subprocess.run(
["nvidia-smi", "--query-gpu=numa_affinity",
f"--id={gpu_id}", "--format=csv,noheader"],
capture_output=True, text=True
)
if result.returncode == 0:
numa_str = result.stdout.strip()
numa_node = int(numa_str) if numa_str.isdigit() else 0
cmd = [
"numactl",
f"--cpunodebind={numa_node}",
f"--membind={numa_node}",
"python", training_script,
]
print(f"NUMA-pinned training command:")
print(f" GPU {gpu_id} on NUMA node {numa_node}")
print(f" Command: {' '.join(cmd)}")
return cmd
NIC Tuning for Distributed Training
High-performance ML training clusters require specific NIC tuning to achieve consistent throughput for NCCL communications.
def tune_nic_for_ml(interface: str) -> dict:
"""Apply NIC tuning for ML distributed training workloads.
Key settings:
- Increase ring buffer sizes (more packets buffered before drop)
- Disable interrupt coalescing for latency-sensitive inference
- Enable interrupt coalescing for throughput-sensitive training
- Configure RSS (Receive Side Scaling) to distribute load across CPUs
"""
results = {}
# Check current ring buffer sizes
result = subprocess.run(
["ethtool", "-g", interface],
capture_output=True, text=True
)
results["ring_buffer_current"] = result.stdout
# Maximize ring buffers (prevents packet drops under NCCL burst traffic)
result = subprocess.run(
["ethtool", "-G", interface, "rx", "4096", "tx", "4096"],
capture_output=True, text=True
)
results["ring_buffer_set"] = result.returncode == 0
# For training: enable interrupt coalescing (higher throughput, more latency)
# For inference: disable interrupt coalescing (lower latency, more CPU)
result = subprocess.run(
["ethtool", "-C", interface,
"rx-usecs", "50", # 50us coalescing for training
"tx-usecs", "50",
"rx-frames", "64", # Coalesce up to 64 frames
],
capture_output=True, text=True
)
results["coalescing_set"] = result.returncode == 0
# Check and set number of hardware RSS queues
result = subprocess.run(
["ethtool", "-l", interface],
capture_output=True, text=True
)
results["queue_info"] = result.stdout
# Set number of combined queues to match CPU count (up to hardware max)
cpu_count = os.cpu_count() or 16
result = subprocess.run(
["ethtool", "-L", interface, "combined", str(min(cpu_count, 16))],
capture_output=True, text=True
)
results["queues_set"] = result.returncode == 0
# Enable TCP segmentation offload and checksum offload
# These offload CPU work to the NIC hardware
for feature in ["tso", "gso", "gro", "tx-checksumming", "rx-checksumming"]:
subprocess.run(
["ethtool", "-K", interface, feature, "on"],
capture_output=True
)
print(f"NIC tuning applied to {interface}")
return results
Kernel Boot Parameters
Some tuning parameters can only be set at boot time via grub. For dedicated ML servers, these are worth configuring in /etc/default/grub.
GRUB_ML_PARAMS = {
"isolcpus": {
"example": "isolcpus=2-7",
"effect": "Removes CPUs 2-7 from general kernel scheduler. "
"Only explicitly pinned processes (via taskset/numactl) run there. "
"Use for DPDK polling cores or high-priority inference threads.",
"ml_use_case": "DPDK packet processing, latency-critical inference",
},
"nohz_full": {
"example": "nohz_full=2-7",
"effect": "Disables scheduler tick on listed CPUs when only one task is runnable. "
"Eliminates 1ms timer interrupts that can disrupt latency-sensitive inference. "
"Must be used with isolcpus for effectiveness.",
"ml_use_case": "Sub-millisecond inference latency, timing-sensitive training benchmarks",
},
"rcu_nocbs": {
"example": "rcu_nocbs=2-7",
"effect": "Moves RCU callbacks off the listed CPUs to dedicated kthreads. "
"Prevents RCU callbacks from preempting ML compute threads. "
"Usually paired with nohz_full.",
"ml_use_case": "Same as nohz_full - maximum compute isolation",
},
"transparent_hugepage": {
"example": "transparent_hugepage=madvise",
"effect": "Sets THP policy at boot. Same as writing to sysfs but permanent.",
"ml_use_case": "All ML workloads",
},
"numa_balancing": {
"example": "numa_balancing=disable",
"effect": "Disables NUMA automatic balancing. Prevents kernel from migrating "
"pages between NUMA nodes mid-training. Same as setting kernel.numa_balancing=0 "
"via sysctl but takes effect before userspace starts.",
"ml_use_case": "Multi-socket servers with large training jobs",
},
"intel_idle.max_cstate": {
"example": "intel_idle.max_cstate=1",
"effect": "Limits Intel CPU to C1 sleep state (shallowest). "
"Deeper C-states (C6, C10) save power but have 100-300us wake latency. "
"Prevents inference latency spikes when CPU wakes from deep sleep.",
"ml_use_case": "Low-latency inference servers",
},
"processor.max_cstate": {
"example": "processor.max_cstate=1",
"effect": "AMD equivalent of intel_idle.max_cstate. "
"Same effect: limit sleep depth to minimize wake latency.",
"ml_use_case": "Low-latency inference on AMD servers",
},
}
def generate_grub_cmdline(
isolate_cpus: list[int] | None = None,
workload: str = "training", # "training" or "inference"
) -> str:
"""Generate recommended GRUB_CMDLINE_LINUX_DEFAULT for ML servers.
Returns the command line string to add to /etc/default/grub.
After editing grub, run: update-grub && reboot
"""
params = []
# Always: disable THP compaction at boot level
params.append("transparent_hugepage=madvise")
# Always: disable NUMA auto-balancing
params.append("numa_balancing=disable")
if isolate_cpus:
cpu_range = f"{min(isolate_cpus)}-{max(isolate_cpus)}"
params.append(f"isolcpus={cpu_range}")
params.append(f"nohz_full={cpu_range}")
params.append(f"rcu_nocbs={cpu_range}")
if workload == "inference":
# For inference: minimize C-state wake latency
params.append("intel_idle.max_cstate=1") # Intel
params.append("processor.max_cstate=1") # AMD
return " ".join(params)
Comprehensive Performance Audit Script
import time
import statistics
class MLPerformanceAudit:
"""Complete ML server performance audit.
Checks all major tuning areas and produces a prioritized
list of improvements with estimated impact.
"""
def run_full_audit(self) -> dict:
"""Run comprehensive performance audit."""
report = {
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
"sections": {},
}
# 1. Memory management
report["sections"]["memory"] = self._audit_memory()
# 2. CPU governor
report["sections"]["cpu"] = self._audit_cpu()
# 3. NUMA topology
report["sections"]["numa"] = self._audit_numa()
# 4. THP policy
report["sections"]["thp"] = self._audit_thp()
# 5. Network stack
report["sections"]["network"] = self._audit_network()
# 6. File limits
report["sections"]["limits"] = self._audit_limits()
# 7. Kernel version
report["sections"]["kernel"] = self._audit_kernel()
return report
def _audit_memory(self) -> dict:
issues = []
swappiness = int(read_sysctl("vm.swappiness"))
if swappiness > 10:
issues.append({
"severity": "HIGH",
"param": "vm.swappiness",
"current": str(swappiness),
"recommended": "1",
"impact": "Swappiness >10 may cause model weights to be paged out, "
"adding 10-1000ms latency per page fault during training.",
})
overcommit = int(read_sysctl("vm.overcommit_memory"))
if overcommit == 0:
issues.append({
"severity": "MEDIUM",
"param": "vm.overcommit_memory",
"current": "0",
"recommended": "1",
"impact": "Heuristic overcommit can cause fork() to fail even with "
"abundant free RAM. DataLoader workers may fail to start.",
})
zone_reclaim = int(read_sysctl("vm.zone_reclaim_mode"))
if zone_reclaim != 0:
issues.append({
"severity": "MEDIUM",
"param": "vm.zone_reclaim_mode",
"current": str(zone_reclaim),
"recommended": "0",
"impact": "Zone reclaim causes latency spikes when NUMA zones are reclaimed "
"under memory pressure. Can cause 100-500ms training step pauses.",
})
return {"issues": issues, "total_issues": len(issues)}
def _audit_cpu(self) -> dict:
issues = []
governor_info = get_cpu_governor_info()
if "error" not in governor_info:
governors = governor_info.get("unique_governors", [])
if any(g != "performance" for g in governors):
issues.append({
"severity": "HIGH",
"param": "cpu_governor",
"current": str(governors),
"recommended": "performance",
"impact": "Non-performance governor reduces CPU frequency during "
"DataLoader preprocessing and PCIe operations. "
"Typical impact: 15-40% throughput reduction.",
})
return {"issues": issues, "governor_info": governor_info}
def _audit_thp(self) -> dict:
issues = []
thp_path = pathlib.Path("/sys/kernel/mm/transparent_hugepage/enabled")
if thp_path.exists():
content = thp_path.read_text().strip()
# Active mode is enclosed in [brackets]
active_mode = "unknown"
for part in content.split():
if part.startswith("[") and part.endswith("]"):
active_mode = part[1:-1]
if active_mode == "always":
issues.append({
"severity": "HIGH",
"param": "transparent_hugepage",
"current": "always",
"recommended": "madvise",
"impact": "THP=always causes khugepaged compaction during model "
"initialization and checkpoint loading. Typical impact: "
"2-10 second unexplained pauses during training startup.",
})
return {"issues": issues}
def _audit_network(self) -> dict:
issues = []
rmem_max = int(read_sysctl("net.core.rmem_max"))
if rmem_max < 67_108_864: # 64 MB
issues.append({
"severity": "HIGH",
"param": "net.core.rmem_max",
"current": str(rmem_max),
"recommended": "134217728",
"impact": "Small socket buffers cause TCP window scaling to limit "
"bandwidth on high-latency paths. NCCL fallback to TCP "
"will be significantly underperforming its potential.",
})
numa_balancing = int(read_sysctl("kernel.numa_balancing"))
if numa_balancing != 0:
issues.append({
"severity": "MEDIUM",
"param": "kernel.numa_balancing",
"current": "1",
"recommended": "0",
"impact": "NUMA auto-balancing moves pages between NUMA nodes during "
"training. Can cause 1-5% throughput variance per step, "
"which matters for step time consistency in distributed training.",
})
return {"issues": issues}
def _audit_limits(self) -> dict:
issues = []
# Check open file limit
result = subprocess.run(
["sh", "-c", "ulimit -n"],
capture_output=True, text=True
)
if result.returncode == 0:
nofile = int(result.stdout.strip())
if nofile < 65536:
issues.append({
"severity": "MEDIUM",
"param": "ulimit -n (nofile)",
"current": str(nofile),
"recommended": "1048576",
"impact": "Low open file limit causes 'Too many open files' errors "
"in distributed training with many TCP connections.",
})
return {"issues": issues}
def _audit_kernel(self) -> dict:
result = subprocess.run(["uname", "-r"], capture_output=True, text=True)
kernel_version = result.stdout.strip() if result.returncode == 0 else "unknown"
# Parse major.minor version
try:
major, minor = [int(x) for x in kernel_version.split(".")[:2]]
kernel_ok = major > 5 or (major == 5 and minor >= 15)
except ValueError:
kernel_ok = None
return {
"kernel_version": kernel_version,
"meets_min_version": kernel_ok,
"min_recommended": "5.15 LTS",
"notes": [
"5.15 LTS: io_uring stable, cgroup v2 default",
"5.19+: improved io_uring for asyncio (Python asyncio file I/O)",
"6.1 LTS: improved NUMA scheduler, better huge page handling",
]
}
def print_summary(self) -> None:
"""Run audit and print prioritized summary."""
report = self.run_full_audit()
all_issues = []
for section_name, section in report["sections"].items():
for issue in section.get("issues", []):
issue["section"] = section_name
all_issues.append(issue)
high = [i for i in all_issues if i["severity"] == "HIGH"]
medium = [i for i in all_issues if i["severity"] == "MEDIUM"]
print(f"\n{'='*60}")
print(f"ML Server Performance Audit - {report['timestamp']}")
print(f"{'='*60}")
print(f"HIGH severity issues: {len(high)}")
print(f"MEDIUM severity issues: {len(medium)}")
print()
if high:
print("HIGH PRIORITY (fix these first):")
for issue in high:
print(f"\n [{issue['section'].upper()}] {issue['param']}")
print(f" Current: {issue['current']}")
print(f" Optimal: {issue['recommended']}")
print(f" Impact: {issue['impact']}")
if medium:
print("\nMEDIUM PRIORITY:")
for issue in medium:
print(f"\n [{issue['section'].upper()}] {issue['param']}")
print(f" Current: {issue['current']}")
print(f" Optimal: {issue['recommended']}")
# Usage
if __name__ == "__main__":
audit = MLPerformanceAudit()
audit.print_summary()
print("\n\nTo apply all recommended settings (requires root):")
print(" tuner = MLServerTuner()")
print(" tuner.apply_all(persist=True)")
print(" set_cpu_governor('performance')")
print(" configure_thp_policy('madvise')")
Architecture Diagram
Production Engineering Notes
Apply tuning at instance launch, not post-hoc: All tuning changes should be applied by your instance initialization script (cloud-init, Ansible, Packer AMI) so every machine in the cluster is identically tuned. Ad-hoc tuning applied to some machines but not others makes distributed training debugging a nightmare - a slow node might simply have the wrong governor rather than a hardware fault.
Verify with benchmarks, not just sysctl reads: After applying tuning, run a standardized benchmark: iperf3 -c <peer> for network throughput, mlperf for training throughput, lat_mem_rd from lmbench for memory latency. Document baseline and post-tuning numbers. This creates a corpus of expected performance that makes future regressions detectable.
The socket buffer size and NCCL interaction: NCCL's TCP transport uses large buffers for gradient transfers. If net.core.rmem_max is 212,992 bytes (the Ubuntu default), the kernel limits every TCP socket's receive buffer to 208 KB. At 200 Gbps InfiniBand bandwidth, a 208 KB buffer is filled in 8 microseconds, causing TCP to advertise a zero window and stall the sender while the receiver drains the buffer. With 128 MB buffers, the window never closes during normal NCCL operation, and TCP can sustain wire-speed transfers.
Hugepage reservation for inference services: For low-latency inference servers, consider pre-reserving 2 MB huge pages for the model server process. Add vm.nr_hugepages = 16384 (32 GB) to sysctl and use madvise(addr, size, MADV_HUGEPAGE) in the model loading code. This eliminates THP allocation-time latency and ensures huge pages are available even after the system has been running for hours with memory fragmentation.
The vm.zone_reclaim_mode footgun: When vm.zone_reclaim_mode=1, the kernel reclaims memory from a NUMA node's local zone before allocating from a remote NUMA zone. This sounds reasonable but causes severe latency spikes when a training process fills a NUMA node's RAM and the kernel starts reclaiming its own pages instead of allocating from the adjacent NUMA node. For ML workloads where the entire dataset fits in NUMA-local RAM, this is catastrophic. Always set to 0.
Common Mistakes
:::danger THP=always Causing Training Stalls
If your training logs show unexplained 3-30 second pauses early in a run (during model initialization or the first few batches), check: cat /sys/kernel/mm/transparent_hugepage/enabled. If it shows [always], this is almost certainly khugepaged compacting 300 GB of model tensors into huge pages. Fix: echo madvise > /sys/kernel/mm/transparent_hugepage/enabled. The pauses will disappear immediately. This is the single most common cause of mysterious training slowdowns on new cloud instances.
:::
:::danger vm.swappiness=60 on a Training Server
On a server with 1 TB RAM and a 100 GB model in memory, swappiness=60 causes the kernel to swap model weight pages to disk whenever they haven't been accessed in the last few minutes. When the training loop accesses those pages, each page fault reads from an NVMe SSD (100 microseconds) or network storage (10 milliseconds). For a model with 10 billion 4-byte parameters, paging out even 1% of weights causes 400 million bytes of page faults - that is seconds of added latency per step. Set vm.swappiness=1 on dedicated training servers.
:::
:::warning CPU Governor Not Set After Reboot
Sysctl changes survive reboots if placed in /etc/sysctl.d/. But cpufreq governor changes made via echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do NOT survive reboots. Use cpupower frequency-set -g performance and enable cpupower.service to make this persistent, or add the echo commands to /etc/rc.local. Cloud providers sometimes reset the governor on instance resize or maintenance events - add monitoring that alerts if the governor switches away from performance during active training runs.
:::
:::warning NUMA Balancing Causing Step Time Variance
If your distributed training shows consistent variance in per-step time (some steps consistently 20-30% slower than others in a predictable cycle), check if NUMA auto-balancing is enabled: cat /proc/sys/kernel/numa_balancing. A value of 1 means the kernel is running NUMA scanning threads that periodically migrate pages. During migration, the process that owns the pages experiences additional page faults. Fix: sysctl -w kernel.numa_balancing=0. Expect step time variance to drop by 5-20% and worst-case step times to improve significantly. Add to /etc/sysctl.d/99-ml-tuning.conf for persistence.
:::
Interview Questions
Q1: A new ML training server is running at 60% of expected GPU utilization. The GPUs are not compute-bound. What Linux-level issues would you investigate first?
Systematic investigation order: (1) CPU governor - run cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor and verify all CPUs show performance. A powersave or ondemand governor reduces DataLoader preprocessing speed and PCIe transfer efficiency by 30-50%. (2) THP policy - check cat /sys/kernel/mm/transparent_hugepage/enabled. If [always], khugepaged may be causing intermittent pauses. (3) NUMA alignment - run nvidia-smi topo -m and verify GPUs and the NIC are on the same NUMA node. If training data is being allocated on NUMA node 1 while GPU is on NUMA node 0, you get 50% effective memory bandwidth. (4) Network buffer sizes - check sysctl net.core.rmem_max. If 212,992 (default Ubuntu), NCCL TCP fallback will stall due to small buffers. (5) IRQ distribution - cat /proc/interrupts | grep eth and verify NIC interrupts are not all landing on the same CPU that the training process uses.
Q2: Explain the difference between vm.swappiness=0 and vm.swappiness=1 on modern Linux.
On Linux kernel 3.5+, vm.swappiness=0 changed semantics: it disables swapping entirely for anonymous memory as long as any free pages exist. The OOM killer fires before any swapping occurs. This is aggressive: on a server where a runaway process suddenly allocates 90% of RAM, the kernel OOM-kills processes instead of swapping. For dedicated ML servers where you want zero swap, this is correct. vm.swappiness=1 allows swapping but strongly prefers reclaiming file-backed pages (page cache) over swapping anonymous memory (model weights). This is safer for systems where OOM kills would be catastrophic (inference servers that must not die). The practical difference: with swappiness=0, a memory-leaking process causes OOM kill immediately. With swappiness=1, the kernel first evicts page cache (training data prefetch buffers), then reclaims slab cache, and only swaps anonymous memory as a last resort before OOM killing.
Q3: What does nohz_full do and when would you use it for ML inference?
nohz_full=<cpu_list> disables the scheduler's 1 kHz periodic tick on the listed CPUs when exactly one process is running on that CPU. Normally, the kernel fires an interrupt every 1 ms on every CPU to check if the running process should be preempted, update usage statistics, and run callbacks. This 1 ms interrupt (the jiffy tick) adds up to 1000 interruptions per second to ML inference computations. Each interruption adds 1-5 microseconds of latency (cache invalidation, context switch overhead). For a model server targeting 500 microsecond P99 latency, 1-5 microsecond jitter from jiffy ticks is a significant percentage of the budget. With nohz_full, the listed CPUs run with no periodic interrupts, and the inference thread runs completely uninterrupted until it voluntarily yields. Must be combined with isolcpus (to prevent the scheduler from adding other tasks to those CPUs) and rcu_nocbs (to prevent RCU callbacks from interrupting those CPUs). This combination can reduce P99 inference latency by 10-20% and P99.9 by 30-40% on latency-sensitive serving infrastructure.
Q4: Your distributed training job on a 2-socket machine runs 25% slower than on a single-socket machine with half the cores. What is likely happening and how do you fix it?
Classic NUMA-related issue: PyTorch DataLoader workers and CUDA operations are running on CPUs from both NUMA sockets, causing cross-socket memory accesses. On a 2-socket system, cross-NUMA memory bandwidth is typically 40-60% of local bandwidth and adds 50-100 ns additional latency per memory access. If the GPU is on NUMA node 0 but DataLoader workers run on CPUs from NUMA node 1, all tensor data must cross the inter-socket QPI/xGMI link before reaching the GPU. This effectively halves the memory bandwidth available to the training job. Diagnosis: numastat -c -p <training_pid> shows cross-NUMA memory accesses. lstopo shows the hardware topology. Fix: numactl --cpunodebind=0 --membind=0 python train.py pins the training process to NUMA node 0 (local to the GPU). Set export OMP_NUM_THREADS=<cpus_on_node_0>. Set kernel.numa_balancing=0 to prevent the kernel from undoing your manual binding. Expected result: training throughput matches or exceeds the single-socket baseline because all memory accesses are NUMA-local.
Q5: What is the vm.zone_reclaim_mode=0 setting and why is it specifically important for ML training servers?
vm.zone_reclaim_mode controls how the kernel handles memory pressure within a NUMA zone. When set to 1 (enabled), if a process on NUMA node 0 exhausts node 0's local free memory, the kernel tries to reclaim pages from node 0's own cache before allocating from node 1's free memory. The reclaim process involves page writeback, LRU scanning, and potentially page migrations - all of which cause the allocating process to stall. When set to 0 (disabled), the kernel simply allocates from the nearest available free memory (node 1) without any reclaim on node 0. For ML training servers where the model weights and activations fill RAM deliberately (not because of leaks), zone_reclaim mode triggers constantly as allocation patterns cycle through training batches. With zone_reclaim_mode=1 enabled, training step times can vary by 5-15% with periodic spikes of 30-100% due to reclaim stalls. With zone_reclaim_mode=0, allocation always succeeds immediately from remote node memory, trading slightly higher average memory latency for completely predictable allocation time. The slightly higher average latency (remote NUMA vs local NUMA) is a much smaller penalty than the periodic multi-millisecond reclaim stalls.
