System Calls and Linux API

The Checkpoint That Took Eight Minutes

A production training job was checkpointing a 7B parameter model to NFS every 30 minutes. The checkpoint itself - serializing weights and optimizer state - took about 90 seconds in Python. But the total training pause was 8 minutes. The missing 6.5 minutes were invisible from Python's perspective: they were inside the kernel, doing page-cache writeback, NFS RPC calls, and fsync.

The engineers were calling torch.save(state, path) followed by os.sync() to ensure durability. What they did not realize was that os.sync() issues a single sync(2) system call that flushes all dirty pages on the entire system - including unrelated processes - to disk. On a machine with 500GB of RAM and 300GB of dirty page cache, that takes a long time.

The fix was os.fsync(fd) on the specific checkpoint file descriptor, followed by a directory fsync to ensure the directory entry was also durable. Total pause dropped from 8 minutes to 110 seconds. Knowing one system call's semantics saved 6+ minutes per checkpoint.

This is the payoff for understanding system calls. Not to write kernel modules or replace the OS, but to know what your Python code is actually doing when it crosses the user-kernel boundary, understand what the kernel is doing in response, and make informed decisions about which syscall achieves the outcome you want - and which one silently does something much more expensive.

This lesson covers the Linux system calls and kernel interfaces that matter most for ML engineers: file I/O (open, read, write, mmap), process management (fork, clone, execve), synchronization (futex), high-performance I/O (epoll, io_uring), introspection (/proc, /sys), and security (seccomp, capabilities).

Why This Exists

Every program running on Linux is running in user space. The CPU enforces a privilege boundary: user-space code cannot directly access hardware (disks, network cards, GPUs), manipulate process memory outside its own address space, or change kernel data structures. To do any of these things, a program makes a system call - a controlled transition into kernel space.

The kernel executes the system call on behalf of the process and returns the result. The transition has a cost: saving CPU registers, switching to kernel stack, executing the syscall handler, switching back. On a modern x86_64 machine with Spectre mitigations, a simple syscall round-trip costs roughly 100-300 nanoseconds. For a program that makes millions of syscalls per second (a high-throughput I/O server), this matters. For a training job where each step takes hundreds of milliseconds, individual syscall overhead is irrelevant - but the semantics of which syscall you use still matter enormously.

Python abstracts almost all direct syscall interaction. open() calls open(2). file.read() calls read(2). os.mmap() calls mmap(2). The abstraction is clean but lossy: you lose visibility into what the kernel is doing, how much time it is spending, and which code path you triggered. strace gives that visibility back.

The specific syscalls that matter most for ML are in three categories. Data I/O: open, read, write, mmap, fsync for dataset loading and checkpoint saving. Concurrency: clone (creates threads), futex (the kernel primitive under every mutex and condition variable). High-performance I/O: epoll for event-driven inference servers, io_uring for the next generation of async I/O. Each of these has design decisions and tradeoffs that are invisible from Python but critical for systems work.

Historical Context

The Unix system call interface was designed at Bell Labs in the early 1970s with a philosophy of simplicity: a small number of orthogonal primitives that combine to build complex behavior. The original Unix had about 30 system calls. Modern Linux has around 350.

The key design decision that shaped everything was "everything is a file." Files, directories, devices, pipes, sockets, and process-related objects all use the same open/read/write/close interface. This uniformity is why /proc (a virtual filesystem exposing kernel data structures) and /sys (exposing device and driver state) work: they are filesystems, so existing tools like cat, ls, and Python's open() work on them without modification.

mmap (memory-mapped files, introduced in BSD Unix circa 1981) was a landmark addition: it lets a process map a file into its address space so that reads and writes to memory are automatically backed by the file. The kernel manages the page cache, and multiple processes mapping the same file share pages. This is how PyTorch's safetensors format loads weights without reading them sequentially into Python objects.

epoll was added to Linux 2.5.44 (2002) specifically to address the select/poll O(N) scaling problem: epoll is O(1) per event regardless of how many file descriptors are watched. It is the foundation of every production Python web server (Gunicorn, uvicorn), every ML inference framework's network layer, and every high-performance message broker.

io_uring (Linux 5.1, 2019) is the most significant I/O interface addition since epoll. It uses shared memory ring buffers to submit and complete I/O requests without any syscalls in the common case - the amortized syscall cost approaches zero. For ML workloads reading large numbers of small samples from NFS or object storage, io_uring can double I/O throughput versus traditional async I/O.

Core Concepts

The User-Kernel Boundary

Every time your Python code does something that touches hardware or the kernel, it crosses the user-kernel boundary via a system call:

User Space                     |  Kernel Space
-------------------------------|-------------------------------
Python: open("data.bin")       |
  -> os.open()                 |
    -> libc fopen()            |
      -> syscall open(2)  ---->|  VFS layer
                               |    -> ext4/NFS driver
                               |      -> block device I/O
                          <----|  returns file descriptor (int)
Python: file descriptor = 5    |

The syscall instruction switches the CPU to ring 0 (kernel mode). The kernel looks up the syscall number in a table, calls the handler, and then sysret switches back to ring 3 (user mode). On x86_64 with Spectre mitigations, this costs roughly 200-500 ns. Syscalls that do only memory operations (like futex when the lock is uncontended) may be faster; syscalls that trigger I/O wait as long as the I/O takes.

# You can see exactly which syscalls Python makes
# Run: strace -c python3 -c "open('/dev/null').read()"
# Output shows:
#   read:      1  (reads from the file)
#   open/openat: 2  (opens /dev/null and maybe the Python runtime)
#   close:     2
#   ... etc

The Core ML Syscalls: File I/O

import os

# open(2): returns a file descriptor (small integer)
fd = os.open("dataset.bin", os.O_RDONLY | os.O_CLOEXEC)
# O_CLOEXEC: automatically close fd in child processes (important for multiprocessing)

# read(2): reads up to n bytes - may return fewer (partial read)
buf = os.read(fd, 4096)  # up to 4096 bytes

# write(2): writes data - may write fewer bytes than requested
n_written = os.write(fd_out, b"checkpoint data")

# pread(2) / pwrite(2): positional read/write without seeking
# Thread-safe: multiple threads can pread the same file at different offsets
buf = os.pread(fd, 4096, offset=1024*1024)  # read 4KB at offset 1MB

# fsync(2): flush this file's data AND metadata to storage
os.fsync(fd)

# fdatasync(2): flush data only (faster - skip metadata like atime)
os.fdatasync(fd)

# close(2): release the file descriptor
os.close(fd)

The difference between fsync and fdatasync matters for checkpointing: fdatasync is faster because it does not flush file metadata (timestamps, etc.). For a checkpoint, only the data matters. On NFS with sync mount option, either call may take seconds. On local NVMe with a battery-backed write cache, both complete in under a millisecond.

mmap: Memory-Mapped Files

mmap maps a file (or anonymous memory) into the process's virtual address space. Reads and writes to the mapped region are handled by the page cache - no explicit read/write calls needed. Multiple processes mapping the same file share the same physical pages.

import mmap
import struct
import numpy as np

# Memory-map a tensor file for zero-copy access
def mmap_tensor(path: str, dtype=np.float32) -> np.ndarray:
    """
    Load a tensor from a binary file using mmap.
    The tensor data is not copied into Python - the OS pages it in on demand.
    Excellent for large embedding tables that are partially accessed per request.
    """
    with open(path, "rb") as f:
        # Read header: 8 bytes = ndim (int32) + dtype code (int32)
        header = f.read(8)
        ndim, dtype_code = struct.unpack("ii", header)
        shape = struct.unpack("i" * ndim, f.read(4 * ndim))

        # Memory-map the rest of the file
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

    # Create a NumPy array backed by the mmap - no copy
    offset = 8 + 4 * ndim  # past the header
    arr = np.frombuffer(mm, dtype=dtype, offset=offset)
    arr = arr.reshape(shape)
    return arr  # arr holds a reference to mm, keeping it alive


# mmap for shared memory between processes
def create_shared_buffer(size_bytes: int) -> mmap.mmap:
    """
    Create an anonymous mmap shared between parent and child processes.
    Used for passing large numpy arrays to DataLoader workers without pickle.
    """
    return mmap.mmap(-1, size_bytes)  # -1 = anonymous (not backed by file)


# Practical: zero-copy embedding lookup
def load_embedding_table(path: str) -> np.ndarray:
    """
    A 100M x 128 float32 embedding table is 51GB.
    mmap lets you access any row without loading all 51GB.
    The OS pages in only the rows you actually access.
    """
    with open(path, "r+b") as f:
        mm = mmap.mmap(f.fileno(), 0)
        arr = np.frombuffer(mm, dtype=np.float32)
        # arr.shape would be (100_000_000 * 128,) - reshape as needed
        return arr  # OS pages in rows on access, evicts cold rows

strace: Watching Syscalls in Action

strace intercepts system calls in real time and prints them. It is the most useful debugging tool for "what is my program actually doing":

# Trace all syscalls made by a Python training script
strace -tt python3 train.py 2>&1 | head -100

# Count syscall frequency (find bottlenecks)
strace -c python3 train.py
# Example output for data loading:
#   % time     seconds  usecs/call     calls    syscall
#   ------ ----------- ----------- --------- --------
#    45.21    0.452123         112      4032   pread64   <- reading dataset
#    28.33    0.283301          88      3217   futex     <- lock waits
#    14.12    0.141200          44      3209   mmap      <- memory allocation
#     8.01    0.080100         200       400   openat    <- file opens
#     4.33    0.043300         108       401   close     <- file closes

# Trace only file I/O syscalls
strace -e trace=read,write,open,openat,close,pread64 python3 train.py

# Trace a specific process ID (attach to running training job)
strace -p $(pgrep -f train.py) -e trace=futex 2>&1 | head -50

# Full trace with timing: find slow syscalls
strace -T python3 checkpoint.py 2>&1 | grep "fsync\|write"
# Output: fsync(5)  = 0 <8.412341>  <- 8.4 seconds in fsync!

/proc: The Process Information Filesystem

/proc is a virtual filesystem that exposes kernel data structures as files. Every running process has a directory /proc/PID/. The ML-relevant entries:

import os
import re
from pathlib import Path

def get_process_memory_mb(pid: int = None) -> dict:
    """Parse /proc/PID/status for memory usage."""
    if pid is None:
        pid = os.getpid()

    status_path = f"/proc/{pid}/status"
    metrics = {}

    with open(status_path) as f:
        for line in f:
            if line.startswith("VmRSS:"):
                # VmRSS: Resident Set Size - physical memory in use
                metrics["rss_mb"] = int(line.split()[1]) / 1024
            elif line.startswith("VmPeak:"):
                # Peak virtual memory
                metrics["vm_peak_mb"] = int(line.split()[1]) / 1024
            elif line.startswith("VmSwap:"):
                # How much of this process is swapped to disk
                metrics["swap_mb"] = int(line.split()[1]) / 1024
            elif line.startswith("Threads:"):
                metrics["threads"] = int(line.split()[1])
            elif line.startswith("voluntary_ctxt_switches:"):
                metrics["voluntary_ctx_switches"] = int(line.split()[1])
            elif line.startswith("nonvoluntary_ctxt_switches:"):
                metrics["nonvoluntary_ctx_switches"] = int(line.split()[1])

    return metrics


def get_system_memory() -> dict:
    """Parse /proc/meminfo for system-wide memory statistics."""
    meminfo = {}
    with open("/proc/meminfo") as f:
        for line in f:
            key, val_unit = line.split(":", 1)
            val_str = val_unit.strip().split()[0]
            meminfo[key.strip()] = int(val_str)

    return {
        "total_gb":     meminfo["MemTotal"]  / (1024 * 1024),
        "available_gb": meminfo["MemAvailable"] / (1024 * 1024),
        "free_gb":      meminfo["MemFree"]   / (1024 * 1024),
        "cached_gb":    meminfo["Cached"]    / (1024 * 1024),
        "swap_used_gb": (meminfo["SwapTotal"] - meminfo["SwapFree"]) / (1024 * 1024),
    }


def get_open_file_descriptors(pid: int = None) -> list:
    """List open file descriptors for a process via /proc/PID/fd/."""
    if pid is None:
        pid = os.getpid()

    fd_dir = f"/proc/{pid}/fd"
    fds = []
    for entry in os.scandir(fd_dir):
        try:
            target = os.readlink(entry.path)
            fds.append({"fd": int(entry.name), "target": target})
        except (PermissionError, OSError):
            pass
    return fds


def get_cpu_affinity_info() -> dict:
    """Read CPU set and scheduling info from /proc/self/status."""
    with open("/proc/self/status") as f:
        content = f.read()

    cpus_allowed = re.search(r"Cpus_allowed:\s+(\S+)", content)
    return {
        "cpus_allowed_mask": cpus_allowed.group(1) if cpus_allowed else "unknown",
        "pid": os.getpid(),
    }


# Print a training job's memory profile
info = get_process_memory_mb()
print(f"RSS: {info['rss_mb']:.0f} MB")
print(f"Swap: {info['swap_mb']:.0f} MB")
print(f"Threads: {info['threads']}")
print(f"Context switches (voluntary): {info['voluntary_ctx_switches']}")

sys_mem = get_system_memory()
print(f"System memory available: {sys_mem['available_gb']:.1f} GB")

/proc/self/maps: Memory Mapping Inspection

def get_memory_maps(pid: int = None) -> list:
    """
    Parse /proc/PID/maps to see all virtual memory regions.
    Useful for understanding what is mapped into a PyTorch process:
    - Tensor storage
    - mmap'd files (datasets, model weights)
    - Shared libraries (torch, cudart, etc.)
    - Stack and heap regions
    """
    if pid is None:
        pid = os.getpid()

    maps = []
    with open(f"/proc/{pid}/maps") as f:
        for line in f:
            parts = line.strip().split(None, 5)
            if len(parts) < 5:
                continue

            addr_range = parts[0]
            perms      = parts[1]
            offset     = parts[2]
            pathname   = parts[5] if len(parts) > 5 else ""

            start, end = addr_range.split("-")
            size_kb = (int(end, 16) - int(start, 16)) // 1024

            maps.append({
                "start": start,
                "end": end,
                "size_kb": size_kb,
                "perms": perms,
                "path": pathname.strip(),
            })

    return maps


def summarize_memory_maps():
    maps = get_memory_maps()

    heap_kb = sum(m["size_kb"] for m in maps if "[heap]" in m["path"])
    stack_kb = sum(m["size_kb"] for m in maps if "[stack]" in m["path"])
    torch_kb = sum(m["size_kb"] for m in maps if "torch" in m["path"])
    anon_kb  = sum(m["size_kb"] for m in maps if not m["path"])

    print(f"Heap:         {heap_kb / 1024:.1f} MB")
    print(f"Stack:        {stack_kb / 1024:.1f} MB")
    print(f"PyTorch libs: {torch_kb / 1024:.1f} MB")
    print(f"Anonymous:    {anon_kb / 1024:.1f} MB  (likely tensor storage)")

summarize_memory_maps()

epoll: High-Performance Event-Driven I/O

epoll monitors many file descriptors simultaneously and returns only those that are ready, in O(1) time. It is the foundation of asyncio and every Python web server:

import select
import socket
import errno
import os

def create_epoll_inference_server(host: str, port: int, n_requests: int):
    """
    Edge-triggered epoll server for ML inference requests.
    Handles hundreds of concurrent connections without a thread per connection.
    """
    # Create listening socket
    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server.bind((host, port))
    server.listen(1024)
    server.setblocking(False)

    # Create epoll instance
    ep = select.epoll()

    # Register server socket for read events (new connections)
    ep.register(server.fileno(), select.EPOLLIN)

    connections = {}   # fd -> socket
    requests    = {}   # fd -> accumulated request bytes
    completed   = 0

    while completed < n_requests:
        # Wait for events - blocks until at least one fd is ready
        events = ep.poll(timeout=1.0)  # 1 second timeout

        for fd, event in events:
            if fd == server.fileno():
                # New connection
                conn, addr = server.accept()
                conn.setblocking(False)
                ep.register(conn.fileno(),
                            select.EPOLLIN | select.EPOLLET)  # edge-triggered
                connections[conn.fileno()] = conn
                requests[conn.fileno()]    = b""

            elif event & select.EPOLLIN:
                # Existing connection has data
                conn = connections[fd]
                try:
                    while True:
                        data = conn.recv(4096)
                        if not data:
                            break
                        requests[fd] += data
                except BlockingIOError:
                    pass  # EAGAIN: no more data right now (edge-triggered)

                # Parse and respond (simplified)
                request_data = requests[fd]
                if b"\r\n\r\n" in request_data:
                    # Run inference (would call PyTorch here)
                    response = b"HTTP/1.1 200 OK\r\nContent-Length: 4\r\n\r\nOK\r\n"
                    conn.sendall(response)
                    ep.unregister(fd)
                    conn.close()
                    del connections[fd]
                    del requests[fd]
                    completed += 1

    ep.close()
    server.close()
    print(f"Served {completed} inference requests")

eventfd and timerfd: Kernel-to-User Signaling

eventfd creates a file descriptor that can be used for inter-thread/inter-process event notification. It is lighter than a pipe and works with epoll:

import ctypes
import os
import struct
import select

# eventfd via ctypes (no Python stdlib wrapper)
libc = ctypes.CDLL("libc.so.6", use_errno=True)
EFD_NONBLOCK = 0x800
EFD_SEMAPHORE = 0x1

def eventfd(initval: int = 0, flags: int = 0) -> int:
    """Create an eventfd - returns a file descriptor."""
    fd = libc.eventfd(ctypes.c_uint(initval), ctypes.c_int(flags))
    if fd < 0:
        errno = ctypes.get_errno()
        raise OSError(errno, os.strerror(errno))
    return fd

def eventfd_write(fd: int, value: int = 1):
    """Signal an event by writing a uint64 to the fd."""
    data = struct.pack("Q", value)  # uint64
    os.write(fd, data)

def eventfd_read(fd: int) -> int:
    """Read (and reset) the event counter."""
    data = os.read(fd, 8)
    return struct.unpack("Q", data)[0]


# Use eventfd to signal training completion across processes
import multiprocessing

def training_worker(done_efd: int, n_steps: int):
    """Worker process: run training, signal completion via eventfd."""
    import time
    for step in range(n_steps):
        time.sleep(0.01)  # simulate training step
    print(f"Worker: {n_steps} steps done")
    eventfd_write(done_efd, 1)  # signal the controller

def controller_with_eventfd():
    """Main process: watch for worker completion using epoll + eventfd."""
    efd = eventfd()

    worker = multiprocessing.Process(
        target=training_worker, args=(efd, 50))
    worker.start()

    # Use epoll to wait for the eventfd
    ep = select.epoll()
    ep.register(efd, select.EPOLLIN)

    print("Controller: waiting for training to complete...")
    events = ep.poll(timeout=30.0)  # wait up to 30s

    if events:
        count = eventfd_read(efd)
        print(f"Controller: training complete (count={count})")
    else:
        print("Controller: timeout!")

    worker.join()
    ep.close()
    os.close(efd)

controller_with_eventfd()

io_uring: The Modern Async I/O Interface

io_uring (Linux 5.1+) uses shared memory ring buffers between user space and kernel to submit and complete I/O without syscalls in the fast path. Python bindings are available via liburing:

# io_uring via liburing Python bindings (pip install liburing)
# This demonstrates the concept - install liburing for production use

import os
import ctypes
import struct
from typing import List

def demonstrate_io_uring_concept():
    """
    io_uring uses two ring buffers in shared memory:
    - Submission Queue (SQ): user pushes I/O requests
    - Completion Queue (CQ): kernel pushes results

    No syscall needed to push to SQ or read from CQ in the fast path.
    One io_uring_enter() syscall can submit N requests in one shot.

    Compared to read(2): one syscall per read operation
    io_uring: one syscall per BATCH of operations
    """

    # Pseudocode (actual io_uring requires liburing or cffi):
    #
    # ring = io_uring_setup(queue_depth=64)
    #
    # # Submit 64 reads in one syscall
    # for i, (fd, offset, buf) in enumerate(pending_reads):
    #     sqe = ring.get_sqe()
    #     sqe.opcode = IORING_OP_READ
    #     sqe.fd = fd
    #     sqe.off = offset
    #     sqe.addr = buf
    #     sqe.len = BLOCK_SIZE
    #     sqe.user_data = i  # tag for completion matching
    #
    # io_uring_submit(ring)  # ONE syscall for all 64 reads
    #
    # # Wait for completions (also one syscall per batch)
    # for cqe in ring.wait_cqes(n=64):
    #     idx = cqe.user_data
    #     handle_completion(idx, cqe.res)
    #
    # Performance vs traditional async:
    # Traditional: 64 read() syscalls = 64 * 300ns = ~19,200 ns overhead
    # io_uring:    2 syscalls (submit + wait) = 2 * 300ns = ~600 ns overhead
    # Speedup: 32x in syscall overhead alone
    print("io_uring: batch I/O with minimal syscall overhead")
    print("Key advantage: amortize syscall cost across many I/O operations")
    print("Use case: reading many small files for ML dataset loading")


# Practical: benchmark read() vs mmap for dataset loading
import time
import tempfile
import numpy as np

def benchmark_read_strategies(n_samples: int = 10000,
                               sample_size: int = 4096):
    """Compare traditional read() vs mmap for random access."""

    # Create test file
    with tempfile.NamedTemporaryFile(delete=False) as f:
        path = f.name
        data = np.random.rand(n_samples * sample_size // 4).astype(np.float32)
        f.write(data.tobytes())

    file_size = os.path.getsize(path)
    print(f"Test file: {file_size / 1024 / 1024:.1f} MB")

    # Strategy 1: sequential read with read()
    t0 = time.perf_counter()
    with open(path, "rb") as f:
        for _ in range(n_samples):
            chunk = f.read(sample_size)
    seq_read_time = time.perf_counter() - t0

    # Strategy 2: random access with pread()
    indices = np.random.randint(0, n_samples, n_samples)
    fd = os.open(path, os.O_RDONLY)
    t0 = time.perf_counter()
    for idx in indices:
        os.pread(fd, sample_size, idx * sample_size)
    os.close(fd)
    pread_time = time.perf_counter() - t0

    # Strategy 3: mmap for random access
    import mmap
    t0 = time.perf_counter()
    with open(path, "rb") as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        for idx in indices:
            offset = idx * sample_size
            mm[offset:offset + sample_size]  # zero-copy page access
        mm.close()
    mmap_time = time.perf_counter() - t0

    print(f"Sequential read():  {seq_read_time:.3f}s")
    print(f"Random pread():     {pread_time:.3f}s")
    print(f"Random mmap:        {mmap_time:.3f}s")

    os.unlink(path)

benchmark_read_strategies()
demonstrate_io_uring_concept()

futex: The Kernel Primitive Behind Every Lock

futex (Fast User-space muTEX) is the syscall that implements every mutex and condition variable on Linux. In the uncontended case, it is entirely in user space (just an atomic compare-and-swap). Only when there is contention does it enter the kernel:

import ctypes
import ctypes.util
import os

# futex is rarely called directly in Python - this demonstrates what
# Python's threading.Lock() does under the hood

libc_name = ctypes.util.find_library("c")
libc = ctypes.CDLL(libc_name)

FUTEX_WAIT = 0
FUTEX_WAKE = 1

def futex_wait(addr: ctypes.c_int, expected_val: int, timeout=None):
    """
    Sleep until *addr != expected_val (or timeout).
    Atomically: check if *addr == expected_val, if so sleep.
    This prevents the "lost wakeup" problem.
    """
    syscall_nr = 202  # __NR_futex on x86_64
    return libc.syscall(syscall_nr,
                        ctypes.byref(addr),  # address of futex word
                        FUTEX_WAIT,           # operation
                        expected_val,         # expected value
                        None,                 # timeout (None = infinite)
                        None,                 # addr2 (unused)
                        0)                    # val3 (unused)

def futex_wake(addr: ctypes.c_int, n_threads: int = 1):
    """Wake up to n_threads threads waiting on this futex."""
    syscall_nr = 202  # __NR_futex
    return libc.syscall(syscall_nr,
                        ctypes.byref(addr),
                        FUTEX_WAKE,
                        n_threads,
                        None, None, 0)

# The actual mutex algorithm (what pthread_mutex_lock does internally):
#
# UNCONTENDED (fast path, no syscall):
#   CAS(futex_word, 0 -> 1)  // if lock is free, grab it atomically
#   -> success: lock acquired, no syscall made
#
# CONTENDED (slow path, syscall needed):
#   CAS(futex_word, 0 -> 1)  // fails - lock is held
#   futex(FUTEX_WAIT, 1)     // sleep in kernel until word changes
#   -> woken by futex(FUTEX_WAKE)

Understanding futex explains why lock contention shows up in strace as futex calls with FUTEX_WAIT: the thread has already failed to acquire the lock in user space and is now sleeping in the kernel. High futex wait counts = high lock contention.

seccomp: Syscall Filtering for ML Inference Sandboxes

seccomp (secure computing mode) allows a process to restrict which system calls it can make. In ML inference, this is used to sandbox a model server so that even if it is compromised by a malicious model, it cannot exfiltrate data or spawn processes:

import ctypes
import struct
import os

# seccomp is typically used via libseccomp Python bindings
# pip install seccomp  (wraps libseccomp)

def demonstrate_seccomp_sandbox():
    """
    A minimal ML inference server needs only a small set of syscalls:
    read, write, recv, send (for I/O)
    mmap, munmap, mprotect (for memory)
    futex, nanosleep (for threading)
    exit, exit_group (for shutdown)
    brk (for heap)
    close, fstat (for file operations)

    Block everything else:
    - execve: cannot execute new programs
    - fork/clone: cannot create processes or threads
    - ptrace: cannot debug/trace other processes
    - socket with AF_INET (optionally): cannot open new network connections
    """

    try:
        import seccomp

        # Create a filter that kills the process on unexpected syscall
        f = seccomp.SyscallFilter(defaction=seccomp.KILL_PROCESS)

        # Allow the minimum needed for inference
        allowed_syscalls = [
            "read", "write", "pread64", "pwrite64",
            "mmap", "munmap", "mprotect", "brk",
            "futex", "nanosleep", "clock_nanosleep",
            "close", "fstat", "lseek",
            "recv", "send", "recvfrom", "sendto",
            "exit", "exit_group",
            "rt_sigaction", "rt_sigreturn",
            "getpid", "gettid",
        ]

        for syscall in allowed_syscalls:
            try:
                f.add_rule(seccomp.ALLOW, syscall)
            except seccomp.SyscallError:
                pass  # syscall not available on this architecture

        f.load()  # apply the filter to this process
        print("seccomp filter loaded - inference sandbox active")

        # From here: any syscall not in allowed_syscalls kills the process
        # os.system("ls")  # would be killed - execve is not allowed

    except ImportError:
        print("seccomp not installed (pip install seccomp)")
        print("Demonstrating concept - install libseccomp for production use")


demonstrate_seccomp_sandbox()

Production Engineering Notes

Context Switch Cost Is Not Uniform

A "system call" is cheap (200-500 ns) when it returns quickly (e.g., getpid). It is expensive when it blocks. read on an NFS file can block for 10ms+ if the page is not cached. futex(FUTEX_WAIT) blocks until another thread releases a lock. The number of system calls is less important than the time spent inside them. Use strace -T (timing) to find slow syscalls.

/proc/PID/syscall: What Is a Hung Process Doing

When a process is stuck, cat /proc/PID/syscall shows the exact system call it is currently executing (if in a syscall) along with the arguments. This is often faster than attaching GDB for initial diagnosis:

# Check what a hung training process is blocked on
cat /proc/$(pgrep -f train.py)/syscall
# Example output: 7 0x3 0x7f... 0x1000 0x0 0x0 0x7f... 0x... 0x...
# syscall 7 = poll - the process is waiting for a file descriptor
# This tells you it is in a select/poll/epoll wait

File Descriptor Leaks

Every ML framework opens file descriptors (dataset files, checkpoint files, network sockets, GPU devices). Not closing them causes leaks that exhaust the per-process limit (default 1024 on many systems, configurable via ulimit -n). Monitor with:

# Check current FD count for training process
ls /proc/$(pgrep -f train.py)/fd | wc -l

# List what is open
ls -la /proc/$(pgrep -f train.py)/fd/ | head -30

In Python, always use context managers (with open(...)) for file handles. For os.open(), use try/finally or a RAII wrapper.

mmap vs read for Large Dataset Files

For sequential access (reading a dataset file front-to-back), read() with a large buffer (128KB+) matches or beats mmap because the OS prefetches sequentially. For random access (e.g., accessing random samples by index in an HDF5 or binary dataset), mmap wins: the OS only pages in the pages that are actually accessed, and repeated accesses hit the page cache at memory speed.

For embedding tables in recommendation systems - large, sparse, random access - mmap is the right choice. The OS naturally implements LRU eviction of cold rows.

Common Mistakes

:::danger Using os.sync() Instead of os.fsync()

os.sync() flushes dirty pages for the entire system - all processes, all files. On a machine with heavy write traffic (checkpointing, log rotation), this can take tens of seconds. os.fsync(fd) flushes only the specific file. After checkpointing:

# WRONG: affects entire system, can take minutes
os.sync()

# RIGHT: flush only your checkpoint file
with open(checkpoint_path, "wb") as f:
    torch.save(state_dict, f)
    f.flush()        # flush Python's userspace buffer
    os.fsync(f.fileno())  # flush kernel page cache to storage

# ALSO: fsync the directory to make the file visible after crash
dir_fd = os.open(os.path.dirname(checkpoint_path), os.O_RDONLY)
os.fsync(dir_fd)
os.close(dir_fd)

:::

:::danger File Descriptor Exhaustion in DataLoader Workers

PyTorch's DataLoader with num_workers > 0 forks worker processes. Each worker inherits the parent's file descriptors. If the parent has a large dataset opened via mmap (e.g., LMDB, HDF5), each worker gets a copy of that file descriptor. With 8 workers and 1000 open fds each, you hit the system limit quickly.

# WRONG: open dataset in parent, fork workers
dataset = LMDBDataset("train.lmdb")  # opens fds in parent
loader = DataLoader(dataset, num_workers=8)  # workers inherit fds

# RIGHT: open dataset lazily in worker_init_fn
def worker_init_fn(worker_id):
    # Each worker opens its own connection
    dataset = torch.utils.data.get_worker_info().dataset
    dataset.open()  # open LMDB handle here, not in parent

loader = DataLoader(dataset_stub, num_workers=8,
                    worker_init_fn=worker_init_fn)

:::

:::warning strace Has Significant Overhead

strace intercepts every syscall via ptrace, which adds a full context switch per syscall. A program that normally makes 100k syscalls/sec may run 10-20x slower under strace. Use it for debugging and profiling, not for production monitoring. For production, use eBPF/BPF tools (bpftrace, opensnoop) which have near-zero overhead.

# Production-safe syscall monitoring with bpftrace
bpftrace -e 'tracepoint:syscalls:sys_enter_read
    /pid == $1 / { @[comm] = count(); }' \
    $(pgrep -f train.py)

:::

:::warning epoll Edge-Triggered Mode Requires Non-Blocking Sockets

Edge-triggered epoll (EPOLLET) fires only when a file descriptor transitions from not-ready to ready. If you do not drain all available data in one go, the fd will not fire again even though data remains. This is a common source of "silent hangs" in inference servers. Either drain in a loop until EAGAIN, or use level-triggered mode (EPOLLIN without EPOLLET) which fires as long as data is available.

# WRONG: edge-triggered, partial read
conn.recv(1024)  # reads 1024 bytes, but 5000 bytes are available
# epoll won't fire again until more data arrives - request hangs!

# RIGHT: drain until EAGAIN
try:
    while True:
        data = conn.recv(65536)  # large buffer
        if not data:
            break
        buffer += data
except BlockingIOError:
    pass  # EAGAIN - all data consumed

:::

Interview Questions and Answers

Q1: What is a system call and what does it cost?

A system call is a controlled transition from user space to kernel space. The CPU switches from ring 3 (user mode) to ring 0 (kernel mode), saves user registers, executes the kernel handler, then returns. On x86_64 with Spectre mitigations, a simple syscall round-trip costs 100-500 ns for the mode switch alone. Syscalls that block (read on a cache-cold file, futex wait on a contended lock) cost as long as the blocking operation takes. The key insight is that the number of syscalls matters less than the time spent in them - one blocked read on NFS costs more than a million fast getpid calls.

Q2: What is mmap and why is it useful for ML dataset loading?

mmap(2) maps a file or anonymous memory into the process's virtual address space. Reading from the mapped region triggers a page fault on the first access; the kernel pages in the relevant 4KB block from the file and caches it. Subsequent accesses hit the page cache at memory bandwidth speeds. For ML dataset loading, mmap is useful when you access data randomly (e.g., random sampling from a binary dataset): only the pages actually accessed are loaded, and they are cached in the kernel's page cache so multiple processes (multiple DataLoader workers) share the same physical pages without redundant copies. For sequential access, buffered read() with large buffers is equally fast or faster.

Q3: What does strace show and how would you use it to diagnose a slow training job?

strace intercepts and logs every system call a process makes, including arguments and return values. strace -c gives a summary of call counts and total time per syscall - this is the first tool to run. If it shows excessive time in pread64 calls, the bottleneck is data loading. If it shows time in futex with FUTEX_WAIT, there is high lock contention. If it shows many mmap calls, there may be excessive memory allocation. strace -T logs the duration of each individual syscall, so you can find specific slow calls like a single fsync that takes 8 seconds. The limitation is that strace has 10-20x overhead, so only run it in diagnosis, not production.

Q4: What is epoll and why is it better than select/poll for an ML inference server?

select and poll require passing all watched file descriptors to the kernel on every call. They scan all FDs to find ready ones. This is O(N) per call: with 10,000 open connections, each poll call copies and scans 10,000 FDs. epoll maintains a kernel-side data structure. epoll_ctl registers FDs once. epoll_wait returns only the FDs that are actually ready, in O(1) time. For an ML inference server with hundreds of concurrent clients, epoll scales to 100,000+ connections where select/poll would bottleneck on FD scanning. Python's asyncio uses epoll on Linux as its event loop backend.

Q5: What is seccomp and how would you use it to sandbox an ML model server?

seccomp (secure computing mode) filters system calls at the kernel level. A process can install a BPF filter that allows, denies, or kills on specific syscalls. For an ML model server, you would identify the minimal syscall set needed (read, write, mmap, futex, socket operations for existing connections) and block everything else, especially execve (cannot run new programs), fork/clone (cannot spawn processes), and new socket calls with AF_UNIX (cannot exfiltrate data). If a malicious model triggers a code path that tries to make a blocked syscall, the kernel kills the process immediately. This limits blast radius from model supply chain attacks.

Q6: What is futex and how does it relate to Python's threading.Lock?

A futex (Fast User-space muTEX) is a kernel primitive that supports the fast path of mutex implementation entirely in user space. The futex word is a shared integer. To lock: atomically compare-and-swap it from 0 to 1. If the CAS succeeds (lock was free), no syscall is needed - the lock is held. If the CAS fails (lock is held), call futex(FUTEX_WAIT) to sleep in the kernel until woken. To unlock: atomically write 0, then if any threads were sleeping call futex(FUTEX_WAKE). Python's threading.Lock is implemented using pthreads which use futex internally. High futex counts in strace mean high lock contention: the fast path is failing and threads are going into the kernel to sleep.

Q7: What is the /proc filesystem and name three files that are useful for ML system debugging?

/proc is a virtual filesystem that exposes kernel data structures as files. It is read-only (mostly) and does not correspond to real disk storage - the kernel generates content on the fly when you read it. Three files useful for ML: (1) /proc/PID/status - shows the process's RSS (physical memory), virtual memory, swap usage, thread count, and voluntary/involuntary context switch count. High involuntary context switches indicate preemption by the scheduler. (2) /proc/PID/syscall - shows which system call the process is currently executing, useful for diagnosing hung processes without GDB. (3) /proc/meminfo - shows system-wide memory statistics including MemAvailable (how much memory is free without swapping), useful for detecting when a training job is about to OOM before it actually does.

The Checkpoint That Took Eight Minutes​

Why This Exists​

Historical Context​

Core Concepts​

The User-Kernel Boundary​

The Core ML Syscalls: File I/O​

mmap: Memory-Mapped Files​

strace: Watching Syscalls in Action​

/proc: The Process Information Filesystem​

/proc/self/maps: Memory Mapping Inspection​

epoll: High-Performance Event-Driven I/O​

eventfd and timerfd: Kernel-to-User Signaling​

io_uring: The Modern Async I/O Interface​

futex: The Kernel Primitive Behind Every Lock​

seccomp: Syscall Filtering for ML Inference Sandboxes​

Production Engineering Notes​

Context Switch Cost Is Not Uniform​

/proc/PID/syscall: What Is a Hung Process Doing​

File Descriptor Leaks​

mmap vs read for Large Dataset Files​

Common Mistakes​

Interview Questions and Answers​