Module 2: Operating Systems for ML
Every Python process you start, every file you read, every network call you make goes through the operating system kernel. The kernel manages memory, schedules CPU time, handles I/O, and controls process isolation. Most ML engineers ignore this layer until a production server starts swapping memory to disk, a data loader becomes a bottleneck, or a multi-process training job has mysterious performance degradation due to NUMA topology.
Understanding the OS layer turns these from mysterious problems into solvable ones.
Where the OS Shows Up in ML
Data loading bottlenecks. PyTorch's DataLoader uses multiple worker processes to prefetch data. These workers are OS processes, scheduled by the Linux CFS scheduler. When you have 8 GPU workers competing for CPU time with 16 data loader workers, understanding process scheduling helps you configure worker counts correctly.
Memory-mapped datasets. Training on a 500GB text dataset does not require loading it all into RAM. Memory-mapped files (mmap) let the OS page data in as needed. HuggingFace datasets use this by default. Understanding how virtual memory and page faults work explains why dataset loading speed depends on working set size, not dataset size.
Huge pages. By default, Linux uses 4KB memory pages. For ML workloads that allocate large tensors, this means thousands of page table entries per allocation. Huge pages (2MB or 1GB) reduce page table overhead and TLB pressure. Enabling transparent huge pages can improve training throughput by 5-15% on CPU-heavy workloads.
OS Abstraction Layers
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | Process vs Thread vs Coroutine | OS process model, GIL implications, when to use each |
| 2 | Virtual Memory and Page Tables | Address translation, page faults, virtual address space |
| 3 | Memory-Mapped Files for Large Datasets | mmap system call, page faults as lazy loading |
| 4 | Linux Scheduler and CPU Affinity | CFS scheduler, taskset, NUMA-aware process placement |
| 5 | Huge Pages and Transparent Huge Pages | 2MB vs 4KB pages, THP, enabling for training workloads |
| 6 | Kernel Bypass and DPDK | RDMA, kernel bypass networking, high-throughput data ingestion |
| 7 | System Calls and Python Overhead | strace, syscall cost, minimizing kernel transitions |
| 8 | OS Tuning for Training Servers | /proc/sys tuning, vm.swappiness, io scheduler settings |
Key Concepts You Will Master
- Virtual memory layout - understanding what the kernel does when your process OOMs
- mmap for datasets - how to load arbitrarily large datasets without running out of RAM
- Process vs thread tradeoffs - why PyTorch DataLoader uses processes (GIL), not threads
- Huge page configuration - enabling transparent huge pages to reduce TLB pressure
- OS tuning parameters - the specific kernel settings that matter for ML training server configuration
Prerequisites
- Computer Architecture
- Basic Linux command line
