Skip to main content

Virtual Memory and Page Faults

The Model That Would Not Load

A production ML team was deploying a 70-billion-parameter language model for real-time inference. The model checkpoint was 140 GB. The inference server had 256 GB of RAM. By every account, this should have worked. Instead, the first inference request after deployment took 47 seconds - 47 seconds before a single token was generated. Subsequent requests were fast. But that first request was catastrophically slow, and their SLA required a response in under 3 seconds.

The team initially blamed slow disk I/O. They switched from HDD to NVMe SSD. 47 seconds became 31 seconds. Still nowhere near acceptable. They tried loading the model with torch.load() wrapped in mmap=True. The first request dropped to 4.1 seconds. Interesting, but still too slow.

A deeper investigation revealed what was actually happening. When PyTorch loaded the model weights from disk, it allocated a large contiguous block of virtual memory - but Linux uses demand paging. The physical memory was not actually populated until each page was accessed. That first inference request triggered 34 million page faults as the kernel scrambled to load 140 GB of model weights from disk, one 4 KB page at a time. The kernel was doing this efficiently, but 34 million system calls still take time.

The final fix used three techniques together: mmap for lazy loading, hugepages (2 MB pages instead of 4 KB pages), and mlock to pin the model weights in RAM after the first warmup request. The combination reduced the first-request latency from 47 seconds to 1.2 seconds, within their SLA. The team had shipped to production without understanding what virtual memory actually does to large model loading - and paid for it with 6 weeks of debugging.

Understanding virtual memory is not academic knowledge for ML engineers. It is the difference between shipping a working inference service and spending weeks debugging latency spikes that your monitoring tools will not explain. Memory-mapped datasets, CUDA pinned memory, huge page allocations, OOM killer tuning - all of these are virtual memory concepts in disguise.

This lesson explains virtual memory from first principles, walks through every mechanism that matters for ML workloads, and gives you the tools to diagnose and fix memory-related performance problems in production systems.


Why This Exists - The Problem Virtual Memory Solves

Before virtual memory, programs used physical addresses directly. This created three catastrophic problems. First, you could not safely run two programs at once - a bug in one would corrupt the other's memory. Second, programs were limited to the amount of physical RAM installed. Third, loading a program required it to fit entirely in RAM before any of it could execute.

Virtual memory solves all three problems at once. The OS gives each process the illusion of owning a large, contiguous address space. The kernel transparently maps virtual addresses to physical addresses, keeping pages on disk when RAM is full, and enforcing boundaries between processes. Programs can be larger than physical RAM. Bugs cannot corrupt other processes. And the OS can start executing a program before it is fully loaded from disk.

For ML workloads, this means you can memory-map a 500 GB dataset and access it as if it were all in RAM - even on a machine with only 64 GB of physical memory. The OS handles the movement of data between disk and RAM automatically, in 4 KB chunks called pages.


Historical Context

1961 - Atlas Supervisor. The first virtual memory system was implemented by Tom Kilburn and colleagues on the Atlas computer at the University of Manchester. The system used a "one-level store" that presented programmers with a 1-million-word address space, far larger than the physical memory available. Pages not in core memory (drum storage) were automatically fetched on demand.

1969 - Unix paged memory. The PDP-11 Unix implementation used a hardware page table walk to translate virtual to physical addresses. This hardware support made virtual memory practical for general-purpose computing.

1985 - 386 and the 32-bit era. The Intel 386 introduced protected mode with hardware-enforced page tables, making virtual memory ubiquitous in personal computers. Each process got a 4 GB address space.

2003 - x86-64 and 64-bit addressing. AMD's x86-64 architecture extended virtual addresses to 64 bits (48 bits usable, 57 bits in 5-level paging mode), giving each process a virtual address space of 128 TB to 128 PB. This made it practical to memory-map terabyte-scale datasets.

2011 - Transparent Huge Pages (THP). Linux 2.6.38 introduced THP, which automatically promotes standard 4 KB page mappings to 2 MB huge pages when possible. This reduces TLB pressure for large allocations - critical for ML workloads that access large, contiguous memory regions.


Core Concepts

Virtual Address Space Layout

Every 64-bit Linux process has a virtual address space of 128 TB (with 4-level paging). This space is divided into regions:

Virtual Address Space (64-bit Linux process)
0xFFFFFFFFFFFF +---------------------------+
| Kernel space (128 TB) | <- not accessible to userspace
0x0000800000000000
+---------------------------+
| Stack (grows down) | <- default 8 MB soft limit
| ... |
| mmap region | <- shared libs, mmap() calls
| ... |
| Heap (grows up) | <- malloc(), torch.Tensor alloc
| BSS (zero-init globals) |
| Data (init globals) |
0x0000000400000 | Text (code) | <- read-only executable code
+---------------------------+

The key regions for ML workloads:

  • Text segment: your Python interpreter and C extensions (PyTorch, NumPy). Read-only. Shared between all Python processes via copy-on-write.
  • Heap: where malloc() and Python's memory allocator live. PyTorch tensors on CPU are allocated here via c10::Allocator.
  • mmap region: memory-mapped files (datasets, model weights), anonymous mmap (large allocations above MMAP_THRESHOLD, default 128 KB), shared memory segments.
  • Stack: each thread has its own stack. The main thread gets 8 MB by default; worker threads in threading.Thread also get 8 MB each.

Page Tables - The Translation Machinery

Every virtual address must be translated to a physical address. Modern x86-64 CPUs use a 4-level page table hierarchy:

Virtual AddressPGDPUDPMDPTEPhysical Frame\text{Virtual Address} \rightarrow \text{PGD} \rightarrow \text{PUD} \rightarrow \text{PMD} \rightarrow \text{PTE} \rightarrow \text{Physical Frame}

Where:

  • PGD (Page Global Directory): bits [47:39] of the virtual address (9 bits, 512 entries)
  • PUD (Page Upper Directory): bits [38:30] (9 bits)
  • PMD (Page Middle Directory): bits [29:21] (9 bits)
  • PTE (Page Table Entry): bits [20:12] (9 bits, points to 4 KB physical page)
  • Page offset: bits [11:0] (12 bits, offset within the 4 KB page)

Each page table walk requires 4 memory accesses. For a tight loop accessing random memory locations, this overhead dominates. The TLB exists to avoid this cost.

The TLB and TLB Shootdown

The Translation Lookaside Buffer (TLB) is a small, fast cache inside the CPU that stores recent virtual-to-physical address translations. On a modern CPU, the L1 TLB holds 64-128 entries for 4 KB pages. A TLB hit costs 1-2 cycles. A TLB miss requires the 4-level page table walk - 4 memory accesses, potentially 200-400 cycles if the page table entries are not in L1 cache.

TLB coverage with 4 KB pages: 64 entries×4 KB=256 KB64 \text{ entries} \times 4 \text{ KB} = 256 \text{ KB} of address space. A PyTorch model with 1 billion parameters uses 4 GB4 \text{ GB} of RAM - covering this with 4 KB pages requires 4 GB4 KB=1,048,576\frac{4 \text{ GB}}{4 \text{ KB}} = 1{,}048{,}576 TLB entries. With only 64 in the TLB, every other access misses.

With 2 MB huge pages: 64 entries×2 MB=128 MB64 \text{ entries} \times 2 \text{ MB} = 128 \text{ MB} of address space covered. A 4 GB model needs only 2,048 huge page TLB entries - still more than 64, but the miss rate drops dramatically.

TLB Shootdown occurs in multi-core systems when one CPU modifies a page table entry. All other CPUs that might have cached the old translation must invalidate their TLBs. This is done via an inter-processor interrupt (IPI) - every core stops what it is doing to flush its TLB. On a 96-core machine, a single TLB shootdown interrupts 95 other cores. Heavy mmap/munmap usage (like PyTorch's allocator resizing tensors) generates TLB shootdowns that degrade throughput across all training workers.

Minor vs Major Page Faults

When a program accesses a virtual address and the page is not present in physical memory, the CPU triggers a page fault - a hardware exception that transfers control to the kernel's page fault handler.

Minor page fault: The page exists in memory but is not mapped in this process's page table. Causes: accessing a newly mmap'd region for the first time (demand paging), copy-on-write after fork, stack growth. Resolution: the kernel updates the page table entry to point to the existing physical page. No disk I/O. Cost: ~1-5 microseconds.

Major page fault: The page is not in physical memory at all - it must be loaded from disk. Causes: first access to a memory-mapped file page, accessing swapped-out pages. Resolution: the kernel issues a disk read, waits for the I/O to complete, then updates the page table. Cost: 1-10 milliseconds (SSD) or 5-20 milliseconds (HDD). This is the fault that caused the 47-second first inference request.

Major fault costmodel sizepage size×disk latency=140 GB4 KB×100μs3,500 seconds (sequential)\text{Major fault cost} \approx \frac{\text{model size}}{\text{page size}} \times \text{disk latency} = \frac{140 \text{ GB}}{4 \text{ KB}} \times 100 \mu s \approx 3{,}500 \text{ seconds (sequential)}

Linux parallelizes page fault handling, but there is still significant overhead from 34 million kernel entry/exits.

import subprocess
import os

def get_page_fault_stats(pid: int = None) -> dict:
"""
Read page fault counts from /proc/PID/stat.
minflt = minor page faults since process start
majflt = major page faults since process start
"""
if pid is None:
pid = os.getpid()

with open(f"/proc/{pid}/stat", "r") as f:
fields = f.read().split()

# Field indices from proc(5) man page (0-indexed)
minflt = int(fields[9]) # minor faults
majflt = int(fields[11]) # major faults

return {
"pid": pid,
"minor_faults": minflt,
"major_faults": majflt,
}

def measure_page_faults_during(func, *args, **kwargs):
"""Measure page faults caused by a function call."""
before = get_page_fault_stats()
result = func(*args, **kwargs)
after = get_page_fault_stats()
return result, {
"minor_faults": after["minor_faults"] - before["minor_faults"],
"major_faults": after["major_faults"] - before["major_faults"],
}

# Example: measuring page faults when loading a large tensor
import torch
import numpy as np

def load_large_tensor(path: str) -> torch.Tensor:
return torch.load(path)

# tensor, faults = measure_page_faults_during(load_large_tensor, "model.pt")
# print(f"Minor: {faults['minor_faults']}, Major: {faults['major_faults']}")

Demand Paging

Linux does not load a program's pages into RAM until they are actually accessed. When torch.load() allocates memory for model weights, the kernel creates virtual address mappings but does not assign physical pages. Only when code reads or writes to those addresses does the CPU trigger a page fault and the kernel assign physical memory.

This is why a process can allocate 100 GB of virtual memory on a machine with 16 GB of RAM without immediately crashing - as long as it never actually touches more than 16 GB worth of pages. This is also why /proc/PID/status shows VmRSS (resident set size - physical pages actually in RAM) much smaller than VmVirt (virtual address space size).

import os
import resource

def memory_stats(pid: int = None) -> dict:
"""Parse /proc/PID/status for memory information."""
if pid is None:
pid = os.getpid()

stats = {}
with open(f"/proc/{pid}/status", "r") as f:
for line in f:
if line.startswith(("VmVirt", "VmRSS", "VmPeak",
"VmSize", "VmSwap", "RssAnon",
"RssFile", "RssShmem")):
key, value = line.strip().split(":", 1)
stats[key.strip()] = value.strip()
return stats

# Typical output for a PyTorch inference process:
# VmPeak: 85000000 kB <- peak virtual memory ever used
# VmSize: 72000000 kB <- current virtual address space
# VmRSS: 12000000 kB <- actual physical pages in RAM
# VmSwap: 200000 kB <- pages currently swapped to disk
# RssAnon: 10000000 kB <- anonymous pages (heap, stack)
# RssFile: 2000000 kB <- file-backed pages (mmap'd model)

Memory-Mapped Files for ML Datasets

mmap maps a file directly into the process's virtual address space. Reads are demand-paged from disk (or the page cache). This is fundamentally different from read(): with read(), the kernel copies data from page cache into your buffer (one extra copy). With mmap, you access the page cache directly - zero copy.

import mmap
import os
import struct
import numpy as np
from pathlib import Path

def create_mmapped_dataset(data: np.ndarray, path: str) -> None:
"""Write a binary dataset file that can be efficiently mmap'd."""
with open(path, "wb") as f:
# Write header: dtype, shape, then raw bytes
header = struct.pack(
"!4sII",
data.dtype.str.encode(), # dtype string e.g. '<f4'
*data.shape # height, width (for 2D)
)
f.write(header)
f.write(data.tobytes())
print(f"Wrote {data.nbytes / 1e9:.2f} GB dataset to {path}")

class MmappedDataset:
"""
Dataset backed by a memory-mapped file.
The OS page cache handles which pages are in RAM at any time.
Perfect for datasets larger than RAM.
"""

def __init__(self, path: str):
self.path = path
self.file = open(path, "rb")

# Read header to get dtype and shape
header_size = 12 # 4 bytes dtype + 4 bytes height + 4 bytes width
raw_header = self.file.read(header_size)
dtype_str, h, w = struct.unpack("!4sII", raw_header)
self.dtype = np.dtype(dtype_str.decode().strip('\x00'))
self.shape = (h, w)
self.item_size = self.dtype.itemsize * w
self.n_items = h

# Memory-map the data region (everything after header)
self.mm = mmap.mmap(
self.file.fileno(),
length=0, # 0 = map the whole file
access=mmap.ACCESS_READ, # read-only
offset=0
)
self.data_offset = header_size

def __len__(self) -> int:
return self.n_items

def __getitem__(self, idx: int) -> np.ndarray:
offset = self.data_offset + idx * self.item_size
# This may trigger a minor or major page fault on first access
raw = self.mm[offset:offset + self.item_size]
return np.frombuffer(raw, dtype=self.dtype).copy()

def __del__(self):
self.mm.close()
self.file.close()

# HuggingFace datasets uses mmap internally via Apache Arrow
# The same principle: map the dataset file into virtual memory,
# let the OS page cache manage what's in RAM
from datasets import load_dataset

def load_with_mmap(dataset_name: str):
"""
HuggingFace datasets automatically uses mmap for large datasets.
Access patterns determine which pages stay in RAM (LRU eviction).
"""
dataset = load_dataset(
dataset_name,
keep_in_memory=False, # use mmap, not RAM copy
streaming=False
)
return dataset

Analyzing Virtual Memory with /proc/PID/maps

import os
import re
from dataclasses import dataclass
from typing import List

@dataclass
class VMARegion:
"""Represents one row from /proc/PID/maps"""
start: int
end: int
perms: str # e.g. 'rwxp' or 'r--s'
offset: int
dev: str
inode: int
pathname: str
size_kb: int # computed

def __repr__(self):
size_mb = self.size_kb / 1024
return (f"[{self.start:#018x}-{self.end:#018x}] "
f"{self.perms} {size_mb:8.1f} MB {self.pathname}")

def parse_proc_maps(pid: int = None) -> List[VMARegion]:
"""Parse /proc/PID/maps to understand memory layout."""
if pid is None:
pid = os.getpid()

regions = []
pattern = re.compile(
r"([0-9a-f]+)-([0-9a-f]+)\s+" # address range
r"([rwxps-]{4})\s+" # permissions
r"([0-9a-f]+)\s+" # offset
r"([0-9a-f:]+)\s+" # device
r"(\d+)\s*" # inode
r"(.*)" # pathname (optional)
)

with open(f"/proc/{pid}/maps", "r") as f:
for line in f:
m = pattern.match(line.strip())
if m:
start = int(m.group(1), 16)
end = int(m.group(2), 16)
regions.append(VMARegion(
start=start,
end=end,
perms=m.group(3),
offset=int(m.group(4), 16),
dev=m.group(5),
inode=int(m.group(6)),
pathname=m.group(7).strip(),
size_kb=(end - start) // 1024
))
return regions

def summarize_memory_by_type(pid: int = None) -> dict:
"""Summarize memory usage by region type."""
regions = parse_proc_maps(pid)
summary = {
"heap": 0, "stack": 0, "text": 0, "mmap_file": 0,
"mmap_anon": 0, "shared_lib": 0, "other": 0
}
for r in regions:
kb = r.size_kb
if r.pathname == "[heap]":
summary["heap"] += kb
elif r.pathname == "[stack]":
summary["stack"] += kb
elif r.pathname.endswith(".so") or ".so." in r.pathname:
summary["shared_lib"] += kb
elif r.pathname and not r.pathname.startswith("["):
if "r-x" in r.perms:
summary["text"] += kb
else:
summary["mmap_file"] += kb
elif not r.pathname:
summary["mmap_anon"] += kb
else:
summary["other"] += kb
return {k: f"{v/1024:.1f} MB" for k, v in summary.items()}

Huge Pages for ML Workloads

Standard 4 KB pages cause severe TLB pressure for large ML allocations. Huge pages (2 MB) reduce the number of TLB entries needed by a factor of 512.

import ctypes
import ctypes.util
import mmap
import os
import numpy as np

# Huge page allocation via mmap with MAP_HUGETLB
MAP_HUGETLB = 0x40000 # Linux-specific flag
HUGEPAGE_SIZE = 2 * 1024 * 1024 # 2 MB

def allocate_with_hugepages(size_bytes: int) -> memoryview:
"""
Allocate memory backed by huge pages.
Requires /proc/sys/vm/nr_hugepages > 0 or transparent huge pages enabled.

sudo sysctl -w vm.nr_hugepages=512 # pre-allocate 512 x 2MB = 1 GB
"""
# Round up to huge page boundary
size_aligned = ((size_bytes + HUGEPAGE_SIZE - 1) // HUGEPAGE_SIZE) * HUGEPAGE_SIZE

try:
buf = mmap.mmap(
-1, # anonymous (no file backing)
size_aligned,
mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS | MAP_HUGETLB,
mmap.PROT_READ | mmap.PROT_WRITE,
)
return memoryview(buf)
except OSError as e:
print(f"Huge page allocation failed: {e}")
print("Falling back to standard pages")
return memoryview(bytearray(size_bytes))

def check_thp_status() -> str:
"""Check if Transparent Huge Pages are enabled."""
try:
with open("/sys/kernel/mm/transparent_hugepage/enabled") as f:
return f.read().strip()
# Output: "always [madvise] never"
# [madvise] = only for regions that explicitly request THP via madvise()
# [always] = automatically promote eligible regions
except FileNotFoundError:
return "THP not available (not Linux)"

def advise_hugepages_for_tensor(data: np.ndarray) -> None:
"""
Tell the kernel to use huge pages for this array (requires MADV_HUGEPAGE).
Only works when THP is in 'madvise' mode.
"""
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
MADV_HUGEPAGE = 14 # Linux constant

addr = data.ctypes.data
length = data.nbytes
# Align to huge page boundary
aligned_addr = addr & ~(HUGEPAGE_SIZE - 1)
aligned_length = (length + (addr - aligned_addr) + HUGEPAGE_SIZE - 1) & ~(HUGEPAGE_SIZE - 1)

ret = libc.madvise(ctypes.c_void_p(aligned_addr), ctypes.c_size_t(aligned_length), MADV_HUGEPAGE)
if ret != 0:
errno = ctypes.get_errno()
print(f"madvise(MADV_HUGEPAGE) failed: errno={errno}")
else:
print(f"Requested huge pages for {length / 1e6:.1f} MB tensor")

Transparent Huge Pages (THP) and ML Training

THP automatically promotes 4 KB page ranges to 2 MB huge pages. For training workloads, the interaction with THP is complex:

# Check current THP settings
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# For ML training on dedicated hardware, 'always' is usually best
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# For latency-sensitive inference, 'madvise' gives more control
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# Disable THP compaction which can cause latency spikes
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag

THP can cause latency spikes during defragmentation (when the kernel tries to consolidate 512 contiguous 4 KB pages into one 2 MB page). For inference servers with strict latency SLAs, disable THP defragmentation or use explicit huge page pre-allocation via nr_hugepages.


mlock and Page Pinning

mlock() prevents the kernel from swapping pages to disk. Critical for latency-sensitive inference: without mlock, the OS can swap model weights to disk under memory pressure, causing 100ms+ latency spikes on the next request.

import ctypes
import ctypes.util
import numpy as np
import os

def mlock_array(arr: np.ndarray) -> bool:
"""
Lock a NumPy array in physical memory - prevent swap.
Requires either CAP_IPC_LOCK capability or ulimit -l unlimited.

sudo sysctl -w vm.max_map_count=262144
ulimit -l unlimited # in shell, or set via /etc/security/limits.conf
"""
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)

addr = arr.ctypes.data
length = arr.nbytes

ret = libc.mlock(ctypes.c_void_p(addr), ctypes.c_size_t(length))
if ret != 0:
errno = ctypes.get_errno()
import errno as errno_module
print(f"mlock failed: {errno_module.errorcode.get(errno, errno)}")
print("Try: ulimit -l unlimited")
return False

print(f"Locked {length / 1e6:.1f} MB in physical memory")
return True

def munlock_array(arr: np.ndarray) -> None:
"""Unlock previously locked memory."""
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
libc.munlock(ctypes.c_void_p(arr.ctypes.data), ctypes.c_size_t(arr.nbytes))

# Pattern for latency-sensitive inference: load and lock model weights
def load_and_pin_model_weights(model_path: str) -> np.ndarray:
"""
Load model weights into RAM and lock them.
Subsequent inference requests will never trigger page faults
on the model weights.
"""
import torch

model = torch.load(model_path, map_location="cpu")
weights = {}
for name, param in model.named_parameters():
arr = param.detach().numpy()
if mlock_array(arr):
weights[name] = arr
else:
weights[name] = arr
print(f"Warning: {name} is not locked, may be swapped")

return weights

Copy-on-Write After Fork

When fork() creates a child process, Linux does not copy the parent's physical pages immediately. Instead, both parent and child share the same physical pages, marked read-only. When either process writes to a page, the kernel detects the write (via a page fault), copies the physical page, and remaps one process to the new copy. This is copy-on-write (CoW).

For Python multiprocessing: when you fork a Python process that has loaded a large dataset or model into memory, the child inherits all those pages via CoW. If the child only reads the data (common for workers that process batches without modifying the model), no physical memory is copied. The child uses the same physical pages as the parent - significant memory savings.

import multiprocessing as mp
import numpy as np
import os

def child_worker_read_only(shared_array_info: tuple) -> float:
"""
Worker that only reads from shared data (no CoW copy triggered).
Memory usage of child = ~0 additional RAM for the shared data.
"""
shm_name, shape, dtype = shared_array_info
from multiprocessing import shared_memory
shm = shared_memory.SharedMemory(name=shm_name)
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)

# Read-only access - no CoW copy
result = float(arr.mean())
shm.close()
return result

def child_worker_write(shared_array_info: tuple, worker_id: int) -> None:
"""
Worker that modifies data.
With fork: triggers CoW copy for modified pages (memory cost)
With spawn: no inherited data to begin with
"""
shm_name, shape, dtype = shared_array_info
from multiprocessing import shared_memory
shm = shared_memory.SharedMemory(name=shm_name)
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)

# Writes via shared_memory do NOT trigger CoW because shared_memory
# uses MAP_SHARED. Both processes see each other's writes.
arr[worker_id::4] *= 2.0 # modify every 4th row
shm.close()

# Demo: fork-based CoW with large dataset
def cow_demo():
# Create large array in parent
data = np.random.randn(10_000_000).astype(np.float32)
# 40 MB - both parent and child share these pages initially

print(f"Parent PID: {os.getpid()}")
print(f"Parent memory before fork: {get_rss_mb():.0f} MB")

pid = os.fork()
if pid == 0:
# Child: read-only access triggers no CoW copies
_ = data.sum() # read - no copy
print(f"Child memory (read-only): {get_rss_mb():.0f} MB")
os._exit(0)
else:
os.waitpid(pid, 0)

def get_rss_mb() -> float:
"""Get resident set size in MB for current process."""
with open(f"/proc/{os.getpid()}/status") as f:
for line in f:
if line.startswith("VmRSS:"):
return int(line.split()[1]) / 1024
return 0.0

OOM Killer and Memory Overcommit

Linux's default memory model allows overcommit: processes can allocate more virtual memory than physical RAM + swap combined. The kernel optimistically grants allocations, expecting that not all memory will be used simultaneously (due to demand paging and CoW).

When actual memory usage exceeds available RAM + swap, the OOM (Out Of Memory) killer activates. It selects a process to kill based on a score computed from memory usage, priority, and other factors.

import os

def get_oom_score(pid: int = None) -> int:
"""
Read the OOM killer score for a process.
Higher score = more likely to be killed.
Score is based on memory usage, adjusted by oom_score_adj.
"""
if pid is None:
pid = os.getpid()
with open(f"/proc/{pid}/oom_score") as f:
return int(f.read().strip())

def get_oom_adj(pid: int = None) -> int:
"""
Read the OOM score adjustment for a process.
Range: -1000 (never kill) to +1000 (kill first).
"""
if pid is None:
pid = os.getpid()
with open(f"/proc/{pid}/oom_score_adj") as f:
return int(f.read().strip())

def protect_from_oom_killer(pid: int = None) -> None:
"""
Adjust OOM score to protect a critical process (e.g., training job).
-1000 = completely immune from OOM killer.
Use carefully - if this process leaks memory, the system will hang.
"""
if pid is None:
pid = os.getpid()
# Requires root or CAP_SYS_RESOURCE
try:
with open(f"/proc/{pid}/oom_score_adj", "w") as f:
f.write("-500\n") # significantly reduce kill probability
print(f"Set oom_score_adj to -500 for PID {pid}")
except PermissionError:
print("Need root to set oom_score_adj")

def check_overcommit_policy() -> dict:
"""Check the system's memory overcommit policy."""
with open("/proc/sys/vm/overcommit_memory") as f:
policy = int(f.read().strip())

with open("/proc/sys/vm/overcommit_ratio") as f:
ratio = int(f.read().strip())

policies = {
0: "heuristic (default) - allows most overcommit",
1: "always overcommit - never fail allocations",
2: "never overcommit - strict limit at overcommit_ratio% of RAM"
}

return {
"policy": policy,
"policy_name": policies.get(policy, "unknown"),
"overcommit_ratio": ratio,
}

Overcommit Policy Recommendations for ML

# Check current settings
cat /proc/sys/vm/overcommit_memory # 0 = heuristic, 1 = always, 2 = strict
cat /proc/sys/vm/overcommit_ratio # percentage of RAM to allow

# For training on dedicated GPU machines:
# Use policy 1 (always) to allow PyTorch to make large reservations
# without allocation failures from aggressive heuristics
sudo sysctl -w vm.overcommit_memory=1

# For production inference serving where OOM = service outage:
# Use policy 2 with a ratio that leaves headroom for OS
sudo sysctl -w vm.overcommit_memory=2
sudo sysctl -w vm.overcommit_ratio=80 # allow up to 80% of RAM

# Monitor if OOM killer has triggered
dmesg | grep -i "oom\|killed process" | tail -20

PyTorch Memory Allocation and CUDA Pinned Memory

PyTorch uses a custom memory allocator called c10::CachingAllocator (CPU) and THCCachingAllocator (CUDA). These maintain free lists of previously allocated blocks rather than returning memory to the OS on every free(), which avoids repeated mmap/munmap system calls and TLB shootdowns.

import torch
import numpy as np

def demo_cuda_pinned_memory():
"""
Pinned (page-locked) memory on CPU enables faster H2D/D2H transfers
because the CUDA DMA engine can transfer directly without copying
through a staging buffer.
"""
batch_size = 256
feature_dim = 1024

# Regular CPU memory (pageable) - slower GPU transfer
regular_tensor = torch.randn(batch_size, feature_dim)

# Pinned (page-locked) CPU memory - faster GPU transfer
# torch.pin_memory() calls mlock() internally
pinned_tensor = torch.randn(batch_size, feature_dim).pin_memory()

if torch.cuda.is_available():
# non_blocking=True: initiates DMA transfer without waiting
# Only works with pinned memory - with regular memory it must block
gpu_tensor = pinned_tensor.to("cuda", non_blocking=True)

# Do CPU work while transfer happens asynchronously
cpu_result = regular_tensor.mean() # runs concurrently with H2D

torch.cuda.synchronize() # wait for the transfer to complete
print(f"GPU tensor device: {gpu_tensor.device}")

def pytorch_memory_stats():
"""Monitor PyTorch's CUDA memory allocator."""
if not torch.cuda.is_available():
return

# Allocate some tensors
a = torch.randn(1000, 1000, device="cuda")
b = torch.randn(1000, 1000, device="cuda")

stats = torch.cuda.memory_stats()
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9 # includes allocator cache
max_allocated = torch.cuda.max_memory_allocated() / 1e9

print(f"Allocated: {allocated:.2f} GB")
print(f"Reserved (allocator cache): {reserved:.2f} GB")
print(f"Max allocated: {max_allocated:.2f} GB")

# Clear the allocator cache (returns memory to CUDA runtime)
del a, b
torch.cuda.empty_cache()

print(f"After empty_cache - Reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")
# Note: empty_cache() does NOT reduce memory_allocated()
# It only reduces memory_reserved() by returning cached free blocks

Monitoring Page Faults with perf

# Count page faults for a Python training job
perf stat -e page-faults,minor-faults,major-faults python train.py

# Sample page faults to find which code path causes them
perf record -e page-faults:u -g python train.py
perf report --stdio

# Watch page fault rate in real time
watch -n 1 "cat /proc/$(pgrep -f train.py)/stat | awk '{print \"minor:\", \$10, \"major:\", \$12}'"

# Find processes with highest page fault rates
for pid in $(ls /proc | grep '^[0-9]'); do
if [ -f "/proc/$pid/stat" ]; then
awk -v pid=$pid '{print pid, "minor:", $10, "major:", $12}' /proc/$pid/stat
fi
done | sort -k4 -rn | head -10

Virtual Memory Architecture for ML


Production Engineering Notes

Memory Tuning for Training Jobs

# Disable swap for training on dedicated machines
# (swap during training = 10-100x slowdown)
sudo swapoff -a

# Tune page cache writeback (dirty pages)
# dirty_ratio: start writing when dirty pages exceed X% of RAM
# dirty_background_ratio: background writes start at X% (gentler)
sudo sysctl -w vm.dirty_ratio=5
sudo sysctl -w vm.dirty_background_ratio=2

# Tune vm.swappiness (0 = avoid swap, 100 = swap aggressively)
# For training jobs: keep at 0 or 1
sudo sysctl -w vm.swappiness=0

# Pre-allocate huge pages for stable performance
# Each huge page = 2 MB. For 16 GB of huge pages:
sudo sysctl -w vm.nr_hugepages=8192

# Check huge page usage
grep -i hugepages /proc/meminfo

/proc/PID/smaps for Detailed Memory Analysis

import os
import re
from typing import Dict

def parse_smaps(pid: int = None) -> Dict[str, dict]:
"""
Parse /proc/PID/smaps for detailed per-mapping memory info.
More detailed than /proc/PID/maps - includes RSS, PSS, swap per region.
"""
if pid is None:
pid = os.getpid()

regions = {}
current_region = None
current_data = {}

with open(f"/proc/{pid}/smaps", "r") as f:
for line in f:
# New region header
if re.match(r'^[0-9a-f]+-[0-9a-f]+', line):
if current_region:
regions[current_region] = current_data
current_region = line.strip()
current_data = {}
elif ':' in line:
key, _, value = line.partition(':')
value = value.strip()
if value.endswith(' kB'):
try:
current_data[key.strip()] = int(value[:-3])
except ValueError:
current_data[key.strip()] = value
else:
current_data[key.strip()] = value

if current_region:
regions[current_region] = current_data

return regions

def total_pss_mb(pid: int = None) -> float:
"""
PSS (Proportional Set Size) is the fairest measure of a process's
actual memory cost: shared pages are divided among all processes
that share them.

RSS counts shared pages fully for each process (double-counts).
PSS counts shared pages / number_of_sharers.
"""
smaps = parse_smaps(pid)
total_pss = sum(
region.get("Pss", 0)
for region in smaps.values()
)
return total_pss / 1024 # convert KB to MB

:::danger Fatal Memory Mistakes in ML Systems

Loading a 70B model without warming up page faults before serving. When you deploy an inference server, the first request must traverse every page of the model weights (major page faults from disk). This can cause 10-60 second latency spikes. Always run a warmup request after loading the model and before marking the server ready in your load balancer. For extra protection, call mlock() on the model weights after the warmup to prevent future swap.

Using fork() start method with PyTorch and expecting memory efficiency. After a fork(), the Python process's entire heap - including all loaded tensor data - is theoretically shared via CoW. But Python's reference counting means the garbage collector constantly increments and decrements ob_refcnt on shared objects. Each reference count change triggers a CoW copy of that page. A 32 GB model shared between 8 fork'd workers ends up using 32 GB × 8 = 256 GB of RAM, not the 32 GB you expected. Use spawn or shared_memory for true sharing.

Setting vm.overcommit_memory=1 on shared inference serving infrastructure. This setting allows processes to allocate unlimited virtual memory. A single runaway process can allocate all virtual memory and trigger OOM for every other service on the machine. On training-only dedicated machines, policy 1 is acceptable. On shared serving infrastructure, use policy 2 with a conservative overcommit ratio. :::

:::warning Transparent Huge Pages and Latency Spikes

THP defragmentation can cause multi-millisecond latency spikes in inference servers. The kernel pauses the process while it compacts memory to create 2 MB contiguous regions. For latency-sensitive serving:

# Defer THP defragmentation to background (reduces latency spikes)
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag

# Or disable THP entirely and use explicit huge pages
echo never > /sys/kernel/mm/transparent_hugepage/enabled

# Then allocate huge pages explicitly for the model weights
sudo sysctl -w vm.nr_hugepages=8192 # 8192 * 2MB = 16 GB

Pre-allocated huge pages (nr_hugepages) are never defragmented - they are reserved at boot time. These are the safest option for latency-sensitive ML serving. :::


Interview Questions and Answers

Q1: Explain the difference between minor and major page faults. When does each occur in a PyTorch training job?

A minor page fault occurs when the page is already in physical memory but not yet mapped in the current process's page table. No disk I/O is required. Examples: first access to a newly allocated tensor (demand paging), a fork'd process accessing a CoW page for the first time (kernel maps the existing page), or stack growth. Cost: 1-5 microseconds.

A major page fault occurs when the page is not in physical memory at all and must be fetched from disk. Examples: first access to a memory-mapped dataset file, accessing swapped-out pages, or loading a model checkpoint via mmap. Cost: 0.1-10 milliseconds depending on storage.

In a PyTorch training job: model loading (first pass) generates major faults if reading from disk via mmap. Dataset prefetching generates major faults if the dataset exceeds page cache size. Subsequent epochs generate minor or no faults if the data fits in page cache. This is why epoch 1 is always slower - it pays the major fault tax. Monitoring with perf stat -e major-faults quantifies this cost.

Q2: What is the TLB, and why do huge pages matter for large model inference?

The TLB (Translation Lookaside Buffer) is a CPU-internal cache for virtual-to-physical address translations. It holds 64-128 entries for 4 KB pages. A TLB miss requires a full 4-level page table walk (4 memory accesses, 200-400 cycles). A TLB hit costs 1-2 cycles.

For large model inference: a 70B parameter model uses 70×109×2 bytes (fp16)=140 GB70 \times 10^9 \times 2 \text{ bytes (fp16)} = 140 \text{ GB} of RAM. With 4 KB pages, covering the full model requires 140 GB4 KB=35 million\frac{140 \text{ GB}}{4 \text{ KB}} = 35 \text{ million} page table entries. The TLB can only hold 64-128 of these at once. During a forward pass that accesses weights non-sequentially, TLB misses dominate memory access latency.

With 2 MB huge pages, the same 140 GB model needs only 140 GB2 MB=68,000\frac{140 \text{ GB}}{2 \text{ MB}} = 68{,}000 TLB entries - still more than the TLB holds, but the TLB miss rate drops proportionally to the page size ratio (512x fewer misses per unit of memory accessed in a sequential scan). For transformer attention patterns that access different parts of the weight matrix, the effective TLB miss reduction is 5-50x depending on the access pattern.

Q3: A training job is being killed by the OOM killer but your GPU memory monitor shows only 20 GB used on a 40 GB GPU. What is probably happening and how do you investigate?

The OOM killer operates on CPU RAM, not GPU VRAM. The nvidia-smi showing 20/40 GB is irrelevant. The issue is that CPU RAM is exhausted.

Common causes: (1) DataLoader workers with num_workers=8 each loaded a copy of the dataset into CPU memory (8x RAM multiplication). (2) PyTorch's CUDA caching allocator reserved CPU pinned memory buffers. (3) Python's garbage collector has not freed tensors that were moved to GPU (the Python object wrapper stays in CPU RAM even after .cuda()). (4) Gradient accumulation created large intermediate tensors.

Investigation steps:

# Watch RSS over time
watch -n 1 "ps aux | grep python | awk '{sum += \$6} END {print sum/1024 \" MB\"}'"

# Check DataLoader worker memory
ls /proc | xargs -I{} sh -c "[ -f /proc/{}/cmdline ] && grep -l DataLoader /proc/{}/cmdline" 2>/dev/null | while read f; do pid=$(echo $f | cut -d/ -f3); awk '/VmRSS/{print pid, $2}' pid=$pid /proc/$pid/status; done

# Python memory profiling
# pip install memory_profiler
# @profile decorator on training loop

Fix: reduce num_workers, use pin_memory=False to avoid pinned memory buffers, explicitly delete intermediate tensors and call torch.cuda.empty_cache().

Q4: You have a Python multiprocessing DataLoader with 16 workers using fork start method. Memory usage is 16x higher than expected. Explain why and how to fix it.

With fork, each worker process is a copy of the parent's address space, sharing pages via copy-on-write. In theory, read-only data (like the dataset) should share physical pages with zero duplication. In practice, Python's reference counting destroys this.

When Python accesses any object, it increments that object's ob_refcnt field. This is a write to the page containing the Python object header. This write triggers a CoW copy of that 4 KB page - the physical page is no longer shared between parent and child. For a dataset stored as a list of Python strings or dicts, every element access writes to a page and triggers a copy. The entire dataset ends up duplicated 16 times.

Fix options:

  1. Use spawn start method - workers start fresh, only import what they need, no CoW footprint.
  2. Store the dataset in numpy arrays and use multiprocessing.shared_memory for true zero-copy sharing (numpy does not use Python reference counting for array data itself).
  3. Use memory-mapped files via mmap or HuggingFace datasets (backed by Apache Arrow, which uses flat memory layouts without Python object overhead).
  4. For the dataset index/metadata, store in a multiprocessing.Array (backed by shared memory, not CoW).

Q5: What does mlock() do and when should you use it for ML inference?

mlock() is a syscall that marks virtual memory pages as un-swappable. The kernel guarantees those pages will stay in physical RAM regardless of memory pressure. Once a page is mlock'd, it will never generate a major page fault.

Use mlock() for ML inference when:

  1. The model weights are too large to fit in L3 cache but small enough to fit in RAM, and your SLA requires consistent sub-10ms latency. Without mlock, a spike in system memory usage (another process, log rotation, etc.) can cause the OS to swap model pages to disk, leading to 100ms+ latency spikes on the next request.

  2. You have a real-time inference component (e.g., fraud detection that must respond in under 5ms). The first inference after any memory pressure event would fail the SLA without mlock.

The trade-off: mlock'd pages cannot be reclaimed. Over-locking memory on a shared server starves other processes. Requires the CAP_IPC_LOCK capability or RLIMIT_MEMLOCK set high enough. Always mlock only what you need: the model weights, not the entire process address space.

Q6: Explain TLB shootdown. When does it occur in a PyTorch training job, and how can it be minimized?

TLB shootdown occurs when one CPU core modifies a page table entry (e.g., via munmap or permission change) and must invalidate the cached translations on all other CPU cores. The OS sends an inter-processor interrupt (IPI) to every other core, which must stop executing, flush its TLB for the affected range, and signal back. This is a synchronized pause across all cores.

In PyTorch training, TLB shootdowns occur during: (1) tensor allocation and deallocation, because PyTorch's CPU allocator uses mmap(MAP_ANONYMOUS) for large allocations and munmap on free; (2) model gradient accumulation loops that allocate intermediate tensors each step; (3) DataLoader worker spawning and teardown which creates and destroys large process address spaces.

Minimization strategies: (1) Use huge pages - each TLB entry covers 512x more memory, so far fewer munmap calls are needed to free a given amount of memory. (2) PyTorch's caching allocator already reduces allocator churn by reusing freed blocks; avoid calling torch.cuda.empty_cache() in the training loop. (3) Use persistent_workers=True in DataLoader to avoid repeatedly creating and destroying worker processes. (4) Profile with perf stat -e dTLB-load-misses,dTLB-store-misses to quantify the actual cost.

© 2026 EngineersOfAI. All rights reserved.