Module 2: Operating Systems for ML

Every Python process you start, every file you read, every network call you make goes through the operating system kernel. The kernel manages memory, schedules CPU time, handles I/O, and controls process isolation. Most ML engineers ignore this layer until a production server starts swapping memory to disk, a data loader becomes a bottleneck, or a multi-process training job has mysterious performance degradation due to NUMA topology.

Understanding the OS layer turns these from mysterious problems into solvable ones.

Where the OS Shows Up in ML

Data loading bottlenecks. PyTorch's DataLoader uses multiple worker processes to prefetch data. These workers are OS processes, scheduled by the Linux CFS scheduler. When you have 8 GPU workers competing for CPU time with 16 data loader workers, understanding process scheduling helps you configure worker counts correctly.

Memory-mapped datasets. Training on a 500GB text dataset does not require loading it all into RAM. Memory-mapped files (mmap) let the OS page data in as needed. HuggingFace datasets use this by default. Understanding how virtual memory and page faults work explains why dataset loading speed depends on working set size, not dataset size.

Huge pages. By default, Linux uses 4KB memory pages. For ML workloads that allocate large tensors, this means thousands of page table entries per allocation. Huge pages (2MB or 1GB) reduce page table overhead and TLB pressure. Enabling transparent huge pages can improve training throughput by 5-15% on CPU-heavy workloads.

OS Abstraction Layers

Lessons in This Module

#	Lesson	Key Concept
1	Process vs Thread vs Coroutine	OS process model, GIL implications, when to use each
2	Virtual Memory and Page Tables	Address translation, page faults, virtual address space
3	Memory-Mapped Files for Large Datasets	mmap system call, page faults as lazy loading
4	Linux Scheduler and CPU Affinity	CFS scheduler, taskset, NUMA-aware process placement
5	Huge Pages and Transparent Huge Pages	2MB vs 4KB pages, THP, enabling for training workloads
6	Kernel Bypass and DPDK	RDMA, kernel bypass networking, high-throughput data ingestion
7	System Calls and Python Overhead	strace, syscall cost, minimizing kernel transitions
8	OS Tuning for Training Servers	/proc/sys tuning, vm.swappiness, io scheduler settings

Key Concepts You Will Master

Virtual memory layout - understanding what the kernel does when your process OOMs
mmap for datasets - how to load arbitrarily large datasets without running out of RAM
Process vs thread tradeoffs - why PyTorch DataLoader uses processes (GIL), not threads
Huge page configuration - enabling transparent huge pages to reduce TLB pressure
OS tuning parameters - the specific kernel settings that matter for ML training server configuration

Prerequisites

Computer Architecture
Basic Linux command line

Where the OS Shows Up in ML​

OS Abstraction Layers​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

Where the OS Shows Up in ML

OS Abstraction Layers

Lessons in This Module

Key Concepts You Will Master

Prerequisites