Skip to main content

Linux Process Scheduling

The Kubernetes Cluster That Throttled Itself to Death

A team running distributed PyTorch training on Kubernetes noticed something troubling. Their training throughput was 40% lower than the same job run on bare metal. The GPU utilization was fine - 95%+. The network bandwidth was fine. The CPU utilization showed 70% across the cluster. By every naive metric, the cluster had headroom. So why was training 40% slower?

A senior engineer ran kubectl describe pod on a few training pods and spotted the configuration: resources.limits.cpu: "4". Four CPU cores per pod. The pods were requesting 4 CPUs and being granted 4 CPUs. But the actual sustained CPU usage was hitting 3.8 cores, very close to the limit. The cluster CPU utilization number of 70% was a lie - it was averaged across all pods including idle ones.

The real problem: Kubernetes CPU limits are implemented via CFS (Completely Fair Scheduler) bandwidth control. A pod with limits.cpu: "4" gets a quota of 400ms of CPU time every 100ms scheduling period. If the pod uses 400ms before the period ends, it is throttled - blocked from running until the next 100ms window. When a GPU kernel finishes and needs to launch the next operation immediately, the CPU thread is throttled and sits blocked for up to 100ms. 100ms is an eternity when you are trying to chain thousands of small GPU kernel launches.

The fix was counterintuitive: remove the CPU limit entirely (keep the CPU request, remove the limit). Without a hard limit, Kubernetes uses the request for scheduling decisions but does not throttle burst usage. Training throughput went from 60% of bare metal to 92% of bare metal immediately.

This is not a Kubernetes quirk. It is how Linux CFS bandwidth control works. Understanding it requires understanding the CFS scheduler from first principles: what it optimizes for, how it implements fairness, and exactly where it introduces the latency that destroys GPU kernel launch throughput in distributed training.


Why This Exists - The Scheduling Problem

A computer with 8 CPU cores and 200 running processes must make a decision thousands of times per second: which process gets to run next, on which core, for how long? The scheduler is the kernel subsystem that answers this question.

The challenge is that different workloads have radically different requirements. A video game needs smooth, regular CPU time to maintain 60 FPS - any 16ms hiccup causes a dropped frame. A file compression job wants maximum CPU throughput but does not care about latency. A real-time audio DSP needs a hard guarantee of CPU access within microseconds, or the audio buffer underruns and you hear crackling. A training job needs to keep 8 GPUs busy continuously - which means the CPU overhead threads must respond quickly enough that no GPU ever sits idle.

No single scheduling strategy satisfies all of these at once. Linux handles this through multiple scheduling classes: the CFS (Completely Fair Scheduler) for normal processes, real-time classes (SCHED_FIFO, SCHED_RR) for latency-critical work, and deadline scheduling (SCHED_DEADLINE) for periodic real-time tasks.


Historical Context

1993 - The O(N) scheduler. Early Linux schedulers iterated through all runnable tasks to find the next one to run - O(N) complexity. On a machine with 1000 runnable processes, every scheduling decision was slow.

2001 - The O(1) scheduler. Ingo Molnar's O(1) scheduler used bitmask priority arrays to select the next task in constant time. But it used heuristics to distinguish interactive (I/O-bound) from batch (CPU-bound) processes that were often wrong.

2007 - CFS. Ingo Molnar introduced the Completely Fair Scheduler in Linux 2.6.23. CFS abandoned the concept of fixed time slices. Instead, it tracks accumulated CPU time per process (called vruntime) and always runs the process with the smallest vruntime - the one that has received the least CPU time relative to what it deserves. This provides fairness without heuristics.

2008 - cgroups. The control groups subsystem allowed grouping processes and limiting their resource consumption collectively. Combined with CFS bandwidth control (added in 2012), this became the foundation for Kubernetes CPU resource management.

2019 - EEVDF scheduler. Work began on the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, merged into Linux 6.6 (2023), which improves latency and fairness for mixed interactive/batch workloads. PyTorch training on modern kernels benefits from EEVDF's improved handling of wake-up latency.


Core Concepts

CFS: The Completely Fair Scheduler

CFS maintains a red-black tree (self-balancing binary search tree) of all runnable tasks, keyed by vruntime. The task with the smallest vruntime is always at the leftmost node and runs next.

vruntime is the amount of CPU time a task has received, weighted by its priority:

vruntime=physical CPU time×NICE_0_WEIGHTtask weight\text{vruntime} = \text{physical CPU time} \times \frac{\text{NICE\_0\_WEIGHT}}{\text{task weight}}

Where NICE_0_WEIGHT is the weight for a niceness of 0 (default). Higher-priority tasks (lower nice value, higher weight) accumulate vruntime more slowly - they can run longer before being preempted. Lower-priority tasks accumulate vruntime faster and are preempted sooner.

CFS Run Queue (Red-Black Tree)
Ordered by vruntime (smallest = leftmost = runs next)

[proc_B vruntime=100]
/ \
[proc_A vruntime=50] [proc_C vruntime=150]
\
[proc_D vruntime=200]

Next to run: proc_A (smallest vruntime = 50)
After A runs for ~10ms, its vruntime increases to ~110, tree rebalances

The scheduling latency (target time to give every process at least one slice) is configurable:

cat /proc/sys/kernel/sched_latency_ns # default: 24000000 (24ms)
cat /proc/sys/kernel/sched_min_granularity_ns # default: 3000000 (3ms)

With 8 runnable processes and 24ms target latency, each process gets one 3ms slice per 24ms cycle. A GPU kernel launch handler that is sleeping wakes up, gets into the run queue, and waits up to 21ms before running. This is scheduling jitter, and it kills GPU utilization.

Nice Values and Priority

Every Linux process has a nice value from -20 (highest priority) to +19 (lowest priority). Nice 0 is the default. The name comes from "being nice to other processes" - a higher nice value means you are more willing to yield.

The translation from nice value to CFS weight follows an exponential scale where each level is roughly 1.25x:

Nice valueWeightRelative CPU share
-2088761~3.5x more than nice 0
-109548~1.2x more than nice 0
01024baseline
+10110~0.1x of nice 0
+1915~0.015x of nice 0

Two processes with nice 0 and nice 5 compete: the nice 0 process gets 10241024+33575%\frac{1024}{1024+335} \approx 75\% of CPU time, the nice 5 process gets 3351024+33525%\frac{335}{1024+335} \approx 25\%.

import os
import subprocess
from typing import Optional

def set_process_priority(pid: Optional[int] = None, nice_value: int = 0) -> None:
"""
Set the nice value for a process.
Negative values require root (CAP_SYS_NICE).
Range: -20 (highest) to +19 (lowest).
"""
if pid is None:
pid = os.getpid()

try:
os.setpriority(os.PRIO_PROCESS, pid, nice_value)
print(f"Set PID {pid} nice value to {nice_value}")
except PermissionError:
print(f"Need root to set nice value below current: {os.getpriority(os.PRIO_PROCESS, pid)}")

def get_process_nice(pid: Optional[int] = None) -> int:
"""Get the nice value for a process."""
if pid is None:
pid = os.getpid()
return os.getpriority(os.PRIO_PROCESS, pid)

# Pattern: lower priority for background preprocessing
def run_preprocessing_with_nice(nice_value: int = 10):
"""
Run CPU preprocessing at lower priority so it does not compete
with the main training loop's CPU threads.
"""
os.nice(nice_value) # os.nice() adjusts relative to current nice value
# Now do the preprocessing work...
import numpy as np
data = np.random.randn(1000, 1000)
result = data @ data.T
return result

# Pattern: raise priority for latency-sensitive inference
def set_high_priority_for_inference():
"""
Increase priority for inference serving process.
Requires running as root or having CAP_SYS_NICE.
"""
current_pid = os.getpid()
try:
os.setpriority(os.PRIO_PROCESS, current_pid, -10)
print("Set inference process to nice -10")
except PermissionError:
print("Cannot set negative nice without root. Using nice 0.")

# Using subprocess renice for already-running processes
def renice_process(pid: int, new_nice: int) -> bool:
result = subprocess.run(
["renice", "-n", str(new_nice), "-p", str(pid)],
capture_output=True, text=True
)
if result.returncode == 0:
print(f"Renice'd PID {pid} to {new_nice}: {result.stdout.strip()}")
return True
print(f"Renice failed: {result.stderr.strip()}")
return False

CFS Scheduling Flow


Real-Time Schedulers: SCHED_FIFO and SCHED_RR

CFS is designed for fairness. Real-time tasks need guarantees. Linux provides two RT scheduling policies:

SCHED_FIFO (First In, First Out): A FIFO task runs until it voluntarily yields, blocks on I/O, or is preempted by a higher-priority RT task. It is never preempted by CFS tasks regardless of their nice value. If a SCHED_FIFO task has a bug and enters an infinite loop, it will lock up the CPU core completely (no other process can run on that core).

SCHED_RR (Round Robin): Like SCHED_FIFO but with a time quantum. After exhausting the quantum, the task goes to the back of its priority level's queue. Multiple SCHED_RR tasks at the same priority share the CPU fairly.

RT priorities range from 1 (lowest RT) to 99 (highest RT). Any RT task preempts any CFS task, regardless of nice value.

import os
import ctypes
import ctypes.util
import struct

SCHED_NORMAL = 0 # CFS (default)
SCHED_FIFO = 1 # Real-time FIFO
SCHED_RR = 2 # Real-time Round Robin
SCHED_BATCH = 3 # CFS but never preempt for interactivity
SCHED_IDLE = 5 # Lower priority than nice +19

def set_realtime_priority(priority: int = 50, policy: int = SCHED_FIFO) -> bool:
"""
Set a real-time scheduling policy for the current process.
Requires root or CAP_SYS_NICE.
priority: 1-99 for RT, 0 for CFS policies.

WARNING: A SCHED_FIFO process at priority 99 can lock up the machine.
Always keep a watchdog or set a low priority (10-30) for safety.
"""
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)

# struct sched_param { int sched_priority; }
sched_param = ctypes.c_int(priority)

ret = libc.sched_setscheduler(
ctypes.c_int(0), # 0 = current process
ctypes.c_int(policy),
ctypes.byref(sched_param)
)

if ret != 0:
import errno as errno_module
err = ctypes.get_errno()
print(f"sched_setscheduler failed: {errno_module.errorcode.get(err, err)}")
return False

policy_names = {SCHED_NORMAL: "SCHED_NORMAL", SCHED_FIFO: "SCHED_FIFO",
SCHED_RR: "SCHED_RR", SCHED_BATCH: "SCHED_BATCH"}
print(f"Set scheduling policy to {policy_names.get(policy)} priority={priority}")
return True

# Using chrt command (simpler for subprocess scenarios)
def set_rt_priority_via_chrt(pid: int, priority: int = 50) -> bool:
"""Set RT priority using chrt command."""
result = subprocess.run(
["chrt", "-f", "-p", str(priority), str(pid)],
capture_output=True, text=True
)
return result.returncode == 0

# When to use RT scheduling for ML:
# 1. GPU completion interrupt handler thread (poll mode in CUDA driver)
# 2. Audio ML inference (real-time DSP pipeline)
# 3. Robot control inference (must respond within N microseconds)
# 4. Financial trading ML (latency measured in microseconds)
#
# NOT recommended for:
# 1. Regular training loops - RT priority with a bug = locked machine
# 2. DataLoader workers - they do disk I/O which blocks anyway
# 3. Any code that could spin in an unbounded loop

chrt Usage from Command Line

# Check current scheduling policy of a process
chrt -p $(pgrep python)

# Set SCHED_FIFO priority 50 for an existing process
sudo chrt -f -p 50 $(pgrep inference_server)

# Launch a new process with SCHED_RR priority 30
sudo chrt -r 30 python inference_server.py

# Check priority limits
cat /proc/sys/kernel/sched_rt_runtime_us # default: 950000 (95% of each second)
cat /proc/sys/kernel/sched_rt_period_us # default: 1000000 (1 second)
# RT processes are limited to 95% of CPU time to prevent starvation of CFS tasks

CPU Affinity: taskset and os.sched_setaffinity

CPU affinity binds a process or thread to specific CPU cores. This prevents the scheduler from migrating the task between cores, which improves cache locality and reduces NUMA cross-socket overhead.

import os
import subprocess
from typing import Set, List

def get_cpu_affinity(pid: int = None) -> Set[int]:
"""Get the set of CPUs a process is allowed to run on."""
if pid is None:
pid = os.getpid()
return os.sched_getaffinity(pid)

def set_cpu_affinity(cpus: Set[int], pid: int = None) -> None:
"""
Restrict a process to specific CPU cores.
After this call, the scheduler will only place this process
on the specified cores.
"""
if pid is None:
pid = os.getpid()
os.sched_setaffinity(pid, cpus)
print(f"PID {pid} restricted to CPUs: {sorted(cpus)}")

# Pattern: Pin inference server to one NUMA node
def pin_to_numa_node(node: int = 0) -> None:
"""
Pin the current process to all CPUs on a specific NUMA node.
Reduces cross-socket memory access latency.
"""
result = subprocess.run(
["numactl", "--hardware"],
capture_output=True, text=True
)

# Parse NUMA topology
cpus_on_node = get_cpus_for_numa_node(node)
set_cpu_affinity(cpus_on_node)
print(f"Pinned to NUMA node {node}, CPUs: {sorted(cpus_on_node)}")

def get_cpus_for_numa_node(node: int) -> Set[int]:
"""Read CPU list for a NUMA node from sysfs."""
path = f"/sys/devices/system/node/node{node}/cpulist"
with open(path) as f:
cpu_list_str = f.read().strip()

cpus = set()
for part in cpu_list_str.split(","):
if "-" in part:
start, end = part.split("-")
cpus.update(range(int(start), int(end) + 1))
else:
cpus.add(int(part))
return cpus

# Pattern: Distribute training workers across cores
def distribute_workers_to_cores(n_workers: int) -> List[Set[int]]:
"""
Assign each worker to a disjoint set of CPU cores.
Avoids contention between workers on the same physical core.
"""
import multiprocessing
total_cpus = multiprocessing.cpu_count()
cpus_per_worker = max(1, total_cpus // n_workers)

assignments = []
for i in range(n_workers):
start = i * cpus_per_worker
end = min(start + cpus_per_worker, total_cpus)
assignments.append(set(range(start, end)))
return assignments

# taskset equivalent from command line
# taskset -c 0-7 python train.py # restrict to CPUs 0-7
# taskset -c 8-15 python eval.py # restrict to CPUs 8-15
# taskset -p 0-7 $(pgrep -f train.py) # apply to running process

cgroups for CPU Resource Control

cgroups (control groups) allow you to group processes and set resource limits on the group. Kubernetes uses cgroups to implement CPU requests and limits.

CFS Bandwidth Control - The Kubernetes CPU Throttling Problem

Kubernetes CPU limits are implemented via CFS bandwidth control parameters:

# Kubernetes CPU limit of "4" translates to:
cat /sys/fs/cgroup/kubepods/pod<id>/cpu.cfs_quota_us # 400000 (400ms)
cat /sys/fs/cgroup/kubepods/pod<id>/cpu.cfs_period_us # 100000 (100ms)

# The pod can use 400ms of CPU time every 100ms period
# If it uses 400ms before the period ends, it is throttled until next period

# Check if a container is being throttled
cat /sys/fs/cgroup/kubepods/pod<id>/cpu.stat
# nr_periods 1000 <- number of scheduling periods elapsed
# nr_throttled 247 <- periods where pod was throttled (24.7%!)
# throttled_time 24700000 <- nanoseconds spent throttled

The throttling problem for training: PyTorch uses CPU threads for CUDA stream management, gradient communication, and kernel launch queuing. When a GPU kernel completes, a CPU thread processes the completion and launches the next kernel. If that thread is throttled, the GPU sits idle waiting for the CPU to wake up. On a fast GPU, 100ms of CPU throttling can cause significant idle time.

import subprocess
import os

def check_cgroup_throttling(cgroup_path: str = None) -> dict:
"""
Check if the current process's cgroup is being throttled.
Returns throttling statistics.
"""
if cgroup_path is None:
# Find our cgroup from /proc/self/cgroup
with open("/proc/self/cgroup") as f:
for line in f:
if "cpu," in line or "cpuacct" in line:
parts = line.strip().split(":")
cgroup_rel = parts[2]
cgroup_path = f"/sys/fs/cgroup/cpu{cgroup_rel}"
break

if cgroup_path is None:
return {"error": "Could not determine cgroup"}

stats_path = os.path.join(cgroup_path, "cpu.stat")
if not os.path.exists(stats_path):
return {"error": f"No cpu.stat at {stats_path}"}

stats = {}
with open(stats_path) as f:
for line in f:
key, _, value = line.strip().partition(" ")
stats[key] = int(value)

if "nr_periods" in stats and stats["nr_periods"] > 0:
throttle_pct = 100 * stats.get("nr_throttled", 0) / stats["nr_periods"]
stats["throttle_percent"] = throttle_pct
if throttle_pct > 5:
print(f"WARNING: {throttle_pct:.1f}% of scheduling periods throttled!")
print("Consider removing cpu limit or increasing it")

return stats

def get_cfs_quota() -> dict:
"""Check CFS quota settings for current cgroup."""
cgroup_cpu = "/sys/fs/cgroup/cpu"

# Try cgroup v2
for path in ["/sys/fs/cgroup/cpu.max",
f"/proc/self/cgroup"]:
if os.path.exists("/sys/fs/cgroup/cpu.max"):
with open("/sys/fs/cgroup/cpu.max") as f:
content = f.read().strip()
if content == "max 100000":
return {"quota": "unlimited", "period_us": 100000}
quota, period = content.split()
return {"quota_us": int(quota), "period_us": int(period),
"cpu_limit": float(quota) / float(period)}

# cgroup v1
quota_path = "/sys/fs/cgroup/cpu/cpu.cfs_quota_us"
period_path = "/sys/fs/cgroup/cpu/cpu.cfs_period_us"
if os.path.exists(quota_path):
with open(quota_path) as f:
quota = int(f.read().strip())
with open(period_path) as f:
period = int(f.read().strip())
if quota < 0:
return {"quota": "unlimited"}
return {"quota_us": quota, "period_us": period,
"cpu_limit": quota / period}

return {"quota": "unknown"}

Kubernetes Resource Spec for ML Training

# BAD: CPU limit causes throttling during GPU kernel launches
apiVersion: v1
kind: Pod
spec:
containers:
- name: training
resources:
requests:
cpu: "4"
memory: "32Gi"
limits:
cpu: "4" # <- causes CFS throttling!
memory: "32Gi"

---
# BETTER: Set request (for scheduling) but no CPU limit
apiVersion: v1
kind: Pod
spec:
containers:
- name: training
resources:
requests:
cpu: "4" # scheduler uses this to place the pod
memory: "32Gi"
limits:
# No cpu limit! Pod can burst above 4 CPUs when available
memory: "32Gi" # Keep memory limit to prevent OOM

---
# BEST for dedicated training nodes: Guaranteed QoS + cpuset
# Guaranteed QoS = requests == limits
# With kubelet static CPU policy, pods get exclusive cpuset binding
apiVersion: v1
kind: Pod
spec:
containers:
- name: training
resources:
requests:
cpu: "16" # integer CPU count for exclusive cpuset
memory: "128Gi"
limits:
cpu: "16" # requests == limits -> Guaranteed QoS
memory: "128Gi"

CPU Isolation with isolcpus and cpuset

For the most demanding latency-sensitive ML inference, you can isolate entire CPU cores from the OS scheduler. The kernel will never schedule any system tasks (kernel threads, interrupts) on those cores.

# Kernel boot parameter to isolate CPUs 4-7 from general scheduling
# Add to /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=4-7 nohz_full=4-7 rcu_nocbs=4-7"

# After reboot:
# CPUs 4-7 will only run tasks explicitly assigned to them
# No kernel housekeeping tasks run on these cores
# Dramatically reduces scheduling jitter

# Assign your inference server to isolated cores
taskset -c 4-7 python inference_server.py

# Check that CPUs are indeed isolated
cat /sys/devices/system/cpu/isolated # should show "4-7"

# For containerized workloads, use cpuset cgroup directly
mkdir /sys/fs/cgroup/cpuset/ml_inference
echo "4-7" > /sys/fs/cgroup/cpuset/ml_inference/cpuset.cpus
echo "0" > /sys/fs/cgroup/cpuset/ml_inference/cpuset.mems # NUMA node 0
echo $(pgrep inference_server) > /sys/fs/cgroup/cpuset/ml_inference/tasks

NUMA-Aware Scheduling

On multi-socket machines (common for training servers), memory is Non-Uniform Memory Access (NUMA). Each socket has local RAM that is fast to access, and remote RAM (the other socket's memory) that is slower.

NUMA Topology Example (2-socket, 96-core machine):
+------------------+ +------------------+
| Socket 0 | | Socket 1 |
| CPUs 0-47 | | CPUs 48-95 |
| RAM: 384 GB | | RAM: 384 GB |
| Local latency: | | Local latency: |
| ~80 ns | | ~80 ns |
| Remote latency: | | Remote latency: |
| ~150 ns | | ~150 ns |
+------------------+ +------------------+
\ /
\ Interconnect /
\ (UPI / QPI) /
----------------

Cross-NUMA access is roughly 1.5-2x slower than local access. For a training job that allocates tensors on socket 0's memory but then runs compute on socket 1's CPUs, every tensor access pays the remote penalty.

import subprocess
import os
from typing import List

def get_numa_topology() -> dict:
"""Parse NUMA topology from /sys."""
topology = {}
node_path = "/sys/devices/system/node"

if not os.path.exists(node_path):
return {"error": "NUMA not available (single-socket or not Linux)"}

for node_dir in sorted(os.listdir(node_path)):
if not node_dir.startswith("node"):
continue
node_id = int(node_dir[4:])
cpu_list_path = f"{node_path}/{node_dir}/cpulist"
mem_info_path = f"{node_path}/{node_dir}/meminfo"

cpus = set()
if os.path.exists(cpu_list_path):
with open(cpu_list_path) as f:
cpus = parse_cpu_list(f.read().strip())

mem_gb = 0
if os.path.exists(mem_info_path):
with open(mem_info_path) as f:
for line in f:
if "MemTotal" in line:
mem_kb = int(line.split()[3])
mem_gb = mem_kb / 1024 / 1024

topology[node_id] = {"cpus": cpus, "mem_gb": mem_gb}

return topology

def parse_cpu_list(cpu_list_str: str) -> set:
"""Parse CPU list string like '0-23,48-71' into a set of CPU ids."""
cpus = set()
for part in cpu_list_str.split(","):
part = part.strip()
if "-" in part:
start, end = part.split("-")
cpus.update(range(int(start), int(end) + 1))
elif part:
cpus.add(int(part))
return cpus

def run_with_numa_binding(
script: str,
numa_node: int = 0,
bind_memory: bool = True
) -> subprocess.Popen:
"""
Launch a Python script with NUMA binding.
--cpunodebind: run threads on specified NUMA node's CPUs
--membind: allocate memory from specified NUMA node
"""
cmd = ["numactl"]
if bind_memory:
cmd.extend(["--membind", str(numa_node)])
cmd.extend(["--cpunodebind", str(numa_node), "python", script])

print(f"Launching: {' '.join(cmd)}")
return subprocess.Popen(cmd)

# Example: For multi-GPU training on 2-socket machine
# GPU 0 is typically on NUMA node 0
# GPU 1 is typically on NUMA node 1
# Bind each training process to the NUMA node closest to its GPU
def launch_numa_aware_training(n_gpus: int = 2):
processes = []
topology = get_numa_topology()
n_nodes = len([k for k in topology if isinstance(k, int)])

for gpu_id in range(n_gpus):
# On most 2-socket systems, GPU i corresponds to NUMA node i
numa_node = gpu_id % n_nodes

cmd = [
"numactl",
"--cpunodebind", str(numa_node),
"--membind", str(numa_node),
"python", "train_worker.py",
"--gpu-id", str(gpu_id),
]
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
proc = subprocess.Popen(cmd, env=env)
processes.append(proc)
print(f"GPU {gpu_id} bound to NUMA node {numa_node}")

return processes

Measuring Scheduling Latency and Jitter

Scheduling jitter - the variability in when a sleeping task actually wakes up after its timer fires - directly impacts inference latency. A task that expects to wake up after 1ms might actually wake up after 3ms if the scheduler was busy with other work.

import time
import statistics
import os
from typing import List

def measure_scheduling_jitter(
sleep_duration_ms: float = 1.0,
iterations: int = 1000
) -> dict:
"""
Measure how accurately the scheduler wakes this process.
Low jitter = scheduler is giving us CPU on time.
High jitter = other processes or throttling are delaying us.
"""
target_sleep = sleep_duration_ms / 1000.0
overrun_ms_list: List[float] = []

for _ in range(iterations):
start = time.perf_counter()
time.sleep(target_sleep)
actual = time.perf_counter() - start
overrun_ms = (actual - target_sleep) * 1000
overrun_ms_list.append(overrun_ms)

return {
"target_ms": sleep_duration_ms,
"mean_overrun_ms": statistics.mean(overrun_ms_list),
"p50_overrun_ms": statistics.median(overrun_ms_list),
"p95_overrun_ms": sorted(overrun_ms_list)[int(0.95 * len(overrun_ms_list))],
"p99_overrun_ms": sorted(overrun_ms_list)[int(0.99 * len(overrun_ms_list))],
"max_overrun_ms": max(overrun_ms_list),
"std_overrun_ms": statistics.stdev(overrun_ms_list),
}

def detect_cpu_throttling() -> bool:
"""
Quick check: if scheduling jitter is consistently > 5ms,
the process is likely being throttled by cgroup quotas.
"""
jitter = measure_scheduling_jitter(sleep_duration_ms=0.5, iterations=100)
p99 = jitter["p99_overrun_ms"]

if p99 > 5.0:
print(f"THROTTLING DETECTED: p99 wakeup jitter = {p99:.1f}ms")
print("Check: cat /sys/fs/cgroup/cpu.stat (or cpu/cpu.stat for v1)")
return True
else:
print(f"Scheduling looks OK: p99 wakeup jitter = {p99:.2f}ms")
return False

# Using perf for system-level jitter measurement
# sudo perf stat -e context-switches,cpu-migrations python benchmark.py
# cpu-migrations: how many times tasks migrated between CPUs
# high cpu-migrations = poor affinity, wasted cache warming

def measure_with_perf(script_path: str) -> str:
"""Run a script under perf stat to measure scheduling events."""
result = subprocess.run(
[
"perf", "stat",
"-e", "context-switches,cpu-migrations,page-faults,cycles,instructions",
"python", script_path,
],
capture_output=True, text=True
)
return result.stderr # perf stat writes to stderr

SCHED_BATCH for Background ML Workloads

For offline training jobs that should not interfere with production serving on the same machine:

import ctypes
import ctypes.util

SCHED_BATCH = 3 # Like SCHED_NORMAL but never preempts interactive tasks

def set_batch_scheduling() -> bool:
"""
SCHED_BATCH: designed for batch workloads.
- Never preempts interactive/latency-sensitive tasks
- Gets full CPU when idle, yields immediately when latency-sensitive tasks wake
- Better than just using nice +19 because it also affects wakeup behavior
"""
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
sched_param = ctypes.c_int(0) # priority must be 0 for non-RT policies

ret = libc.sched_setscheduler(
ctypes.c_int(0),
ctypes.c_int(SCHED_BATCH),
ctypes.byref(sched_param)
)

if ret == 0:
print("Set scheduling policy to SCHED_BATCH")
return True
print("Failed to set SCHED_BATCH")
return False

# Full recommended setup for a background training job on a serving machine:
def configure_background_training_job():
"""
Configure a training job to minimize interference with serving.
"""
# 1. Lower CPU priority
os.nice(10)

# 2. Use SCHED_BATCH policy
set_batch_scheduling()

# 3. Set IO priority to idle (discussed in File Systems lesson)
# ionice -c 3 -p $(os.getpid())

# 4. Use memory cgroup to limit RSS if needed
# echo "$$ " >> /sys/fs/cgroup/memory/training/tasks

# 5. Set CPU affinity to non-critical cores
n_cpus = os.cpu_count()
# Reserve first 8 CPUs for serving, use the rest for training
training_cpus = set(range(8, n_cpus))
if training_cpus:
os.sched_setaffinity(0, training_cpus)
print(f"Training restricted to CPUs: {sorted(training_cpus)}")

Production Engineering Notes

Diagnosing Scheduling Problems

# Check if any processes are running real-time and might starve others
chrt -p $(ps -e -o pid= | head -50 | tr '\n' ' ') 2>/dev/null | grep -v "SCHED_OTHER"

# Identify processes with high involuntary context switches (preempted frequently)
# High voluntary_ctxt_switches = lots of blocking I/O (normal)
# High nonvoluntary_ctxt_switches = getting preempted (scheduling pressure)
for pid in $(ps -e -o pid= | head -20); do
if [ -f "/proc/$pid/status" ]; then
name=$(awk '/Name:/{print $2}' /proc/$pid/status)
vol=$(awk '/voluntary_ctxt_switches:/{print $2}' /proc/$pid/status)
nonvol=$(awk '/nonvoluntary_ctxt_switches:/{print $2}' /proc/$pid/status)
echo "$name (pid=$pid): voluntary=$vol nonvoluntary=$nonvol"
fi
done

# Watch for CPU steal time (running in a VM/container getting CPU stolen)
# st% in top/htop = CPU time stolen by hypervisor
# High steal% = your VM is oversubscribed on the host

# Check which CPUs are most loaded
mpstat -P ALL 1 5 # per-CPU utilization, 5 samples, 1 second each

# Find runqueue depth (how many tasks waiting to run per CPU)
cat /proc/schedstat # or use 'sar -q'

Scheduler Tuning for ML Training

# Reduce CFS target latency for more responsive scheduling
# Default: 24ms target latency (each task gets one slice per 24ms period)
# For training loops: reduce to 6ms to reduce GPU idle time
sudo sysctl -w kernel.sched_latency_ns=6000000
sudo sysctl -w kernel.sched_min_granularity_ns=750000

# For containers: check if cgroup v1 or v2
ls /sys/fs/cgroup/unified # if exists, v2 is active
ls /sys/fs/cgroup/cpu # if exists, v1 is active (or hybrid)

# Enable cgroup v2 (better for modern Kubernetes)
# Add to kernel boot: systemd.unified_cgroup_hierarchy=1

:::danger Scheduling Mistakes That Kill Training Throughput

Setting a CPU limit in Kubernetes without understanding CFS throttling. If your training pod has limits.cpu: "4" and your GPU kernel launches are dense, the CPU overhead threads will be throttled. The GPU sits idle waiting for the CPU to wake up. The only correct fix is to either remove the CPU limit entirely, increase it to give headroom (e.g., 2x the request), or use Guaranteed QoS (requests == limits as integers) with kubelet static CPU policy for exclusive cpuset binding.

Running training on hyperthreaded cores expecting linear scaling. Two hyper-threads on the same physical core share execution units (ALUs, FPUs). A training worker on HT core 1 and a DataLoader worker on HT core 2 (same physical core) will compete for the same ALUs. For CPU-bound workloads, disable HT or use taskset to pin workers to physical cores only (even-numbered cores on most systems). Check: lscpu | grep "Thread(s) per core".

Using SCHED_FIFO at high priority without a watchdog. A SCHED_FIFO task at priority 90 that enters an infinite loop (common during debugging) will lock up the entire machine. Every other task on that CPU core is frozen. Recovery requires either a higher-priority SCHED_FIFO watchdog or SSH from another machine to kill the process. Always use SCHED_RR instead of SCHED_FIFO for ML workloads - the time quantum provides a safety valve. :::

:::warning NUMA Blindness in Multi-GPU Training

Running torchrun or torch.distributed.launch without NUMA awareness on a 2-socket machine will allocate memory from whichever socket is first available, then run compute threads on whichever CPUs the OS picks. For a model that does not fit in one socket's memory bandwidth, 50% of memory accesses will cross the NUMA interconnect, adding 70ns latency per access. For attention mechanisms that do many small, scattered memory reads, this compounds significantly.

Before launching multi-GPU training on a multi-socket machine:

# Check your GPU-to-NUMA mapping
nvidia-smi topo -m

# Check which NUMA node each GPU is closest to
for i in $(seq 0 3); do
echo -n "GPU $i -> NUMA node: "
cat /sys/class/drm/card$i/device/numa_node 2>/dev/null || echo "unknown"
done

# Launch with NUMA binding
CUDA_VISIBLE_DEVICES=0,1 numactl --cpunodebind=0 --membind=0 torchrun --nproc_per_node=2 train.py
CUDA_VISIBLE_DEVICES=2,3 numactl --cpunodebind=1 --membind=1 torchrun --nproc_per_node=2 train.py

:::


Interview Questions and Answers

Q1: Explain CFS. How does it implement fairness? What data structure does it use and why?

CFS (Completely Fair Scheduler) implements fairness by tracking a virtual runtime (vruntime) for each runnable task - the total CPU time the task has received, weighted by its priority. CFS always runs the task with the smallest vruntime (the one that has received the least CPU relative to its entitlement). After each clock tick, the running task's vruntime increases. When its vruntime surpasses another task's, it is preempted and the other task runs.

CFS uses a red-black tree (a self-balancing binary search tree) keyed by vruntime. This gives O(log N) insertion/removal and O(1) "pick next task" (always the leftmost node). The alternatives - an array or a list - would give O(N) pick-next or O(N) insert. With thousands of runnable tasks on a large server, O(N) would make the scheduler itself a significant CPU consumer.

The key insight over the old O(1) scheduler: CFS does not need heuristics to classify tasks as interactive or batch. The vruntime naturally captures the task's needs: a task that sleeps often (interactive) accumulates less vruntime and wakes up to find it has a small vruntime relative to CPU-bound tasks, so it gets priority automatically.

Q2: A Kubernetes training job shows 40% lower throughput than bare metal with the same hardware. CPU utilization is 70% and GPU utilization is 95%. What are the likely causes and how do you diagnose them?

The most common cause is CFS bandwidth throttling from CPU limits. With limits.cpu: "4", the pod gets 400ms of CPU time per 100ms window. If GPU kernel completion events and the next launch request come in a burst (common in training), the CPU threads handling them can exhaust the quota and be throttled for the remainder of the period.

Diagnosis steps:

# Check throttling statistics
cat /sys/fs/cgroup/cpu.stat
# Look for nr_throttled and throttled_time

# Check GPU idle time
nvidia-smi dmon -s u -d 1 # utilization per second
# If GPU utilization fluctuates rapidly (95%, 10%, 95%, 10%), CPU is the bottleneck

# Check scheduling jitter
chrt -p $(pgrep -f train.py) # verify scheduling policy

Fix: remove the CPU limit entirely (keep CPU request for scheduling). Alternatively, set an integer request with limits == request to enable Guaranteed QoS and kubelet static CPU policy for exclusive cpuset binding.

Secondary causes to check: NUMA imbalance (training threads on socket 0, GPU on socket 1), hyperthreading contention (training and monitoring sharing physical cores), container overhead from veth/iptables in the network path.

Q3: What is the difference between taskset and numactl? When would you use each for ML workloads?

taskset sets CPU affinity - it controls which CPU cores a process can run on. It does not control memory allocation. Use taskset when you want to prevent the OS from migrating a process between cores (improves cache locality) or when you need to pin workers to specific physical cores to avoid hyperthreading contention.

numactl controls both CPU affinity (which NUMA node's CPUs to use) and memory policy (which NUMA node to allocate memory from). numactl --cpunodebind=0 --membind=0 pins the process to socket 0's CPUs AND ensures all memory allocations come from socket 0's RAM. This is the correct tool for multi-socket machines where cache-local memory access is important.

For ML workloads: use numactl when launching training workers on a multi-socket machine, especially when each worker corresponds to a GPU that has a known NUMA affinity (check nvidia-smi topo -m). Use taskset when working on a single-socket machine or when you just need core pinning without NUMA memory control.

Q4: Explain the trade-off between isolcpus and cpuset cgroups for ML inference isolation. When would you use each?

isolcpus is a kernel boot parameter that removes specified CPUs from the general scheduler. No kernel threads, no interrupt handlers, no system tasks run on those CPUs. A process must be explicitly assigned to them via taskset or sched_setaffinity. This gives maximum isolation - zero scheduling noise - but requires a reboot and permanently reserves those CPUs. It is the right choice for dedicated inference hardware with strict latency SLAs (trading systems, real-time robotics).

cpuset cgroups (the cpuset controller in cgroups) allow restricting a group of processes to a set of CPUs at runtime, without a reboot. Kubernetes uses cpuset via kubelet's static CPU policy (set cpuManagerPolicy: static in kubelet config). This is less complete isolation than isolcpus (kernel threads can still run on those CPUs), but it is practical for containerized workloads and provides significant latency improvement over unbound scheduling.

For production ML inference: use cpuset cgroups (via Kubernetes Guaranteed QoS) for most workloads. Reserve isolcpus for the final 20% of latency improvement when you have dedicated hardware and sub-millisecond requirements.

Q5: A training job on a multi-GPU machine is underperforming. nvidia-smi shows all 8 GPUs at 95%+ utilization, but the per-step time is 30% slower than the theoretical optimum based on GPU FLOPS. What scheduler-related causes would you investigate?

Several scheduler-related causes can add overhead between GPU kernels without affecting GPU utilization metrics:

First, CPU-GPU synchronization gaps. Between backward pass and optimizer step, NCCL AllReduce for gradients requires CPU threads to post ring-all-reduce operations. If those threads are throttled (cgroup quota) or delayed (high runqueue depth), the GPU idles between kernels even though it was "active" during the previous kernel. Check nvprof or nsight for gaps between CUDA kernels.

Second, NUMA cross-socket memory access for gradient buffers. If gradient tensors are allocated in socket 0's memory but NCCL communication threads run on socket 1's CPUs (or vice versa), every gradient copy pays the 70ns NUMA penalty. With thousands of parameters, this accumulates. Check numastat for numa_miss counts.

Third, NCCL threads competing with DataLoader workers on the same physical cores. NCCL uses OS threads internally for allreduce. If DataLoader workers are unaffected by taskset or cpuset, they may share physical cores with NCCL threads, causing the communication to stall. Pin DataLoader workers to one cpuset and NCCL/training threads to another.

Fourth, sched_latency_ns too large. With the default 24ms target latency and many training-related threads, some threads wait up to 24ms for their next scheduling slice. For dense GPU kernel chaining, even 1ms of scheduler delay causes measurable throughput loss. Tune sched_latency_ns down to 4-6ms.

Q6: What is scheduling jitter and how does it affect inference latency SLAs?

Scheduling jitter is the variance in how long a task actually sleeps between when it requests to wake up and when it actually receives CPU time. A task that calls time.sleep(0.001) (1ms) might not actually run for 1.5ms, 2ms, or even 5ms if the CPU is busy with other tasks when the timer fires.

For inference serving with a p99 latency SLA of 10ms: if scheduling jitter alone contributes 5ms at p99, you have only 5ms for actual ML computation. Jitter accumulates across every network receive, queue dequeue, GPU launch, GPU completion, and response send - each of which involves a wakeup.

Causes of high jitter: large CFS latency target (sched_latency_ns), CPU throttling via cgroup quota, competing RT tasks, interrupt affinity (all network IRQs going to one core), and missing the CPU's runqueue in a multi-core race.

Measurement and fixing:

# Measure jitter
jitter = measure_scheduling_jitter(sleep_duration_ms=0.5, iterations=1000)
print(f"p99 wakeup jitter: {jitter['p99_overrun_ms']:.2f}ms")

# Fix options in order of impact:
# 1. Remove CPU limit in Kubernetes (biggest win, usually)
# 2. Set cpuset to isolate inference workers
# 3. Reduce sched_latency_ns to 4ms
# 4. Set SCHED_RR priority 50 for inference server threads
# 5. Move network IRQ affinity to non-inference CPUs (irqbalance or manual)
© 2026 EngineersOfAI. All rights reserved.