Skip to main content

Module 2: Operating Systems for ML

Every Python process you start, every file you read, every network call you make goes through the operating system kernel. The kernel manages memory, schedules CPU time, handles I/O, and controls process isolation. Most ML engineers ignore this layer until a production server starts swapping memory to disk, a data loader becomes a bottleneck, or a multi-process training job has mysterious performance degradation due to NUMA topology.

Understanding the OS layer turns these from mysterious problems into solvable ones.

Where the OS Shows Up in ML

Data loading bottlenecks. PyTorch's DataLoader uses multiple worker processes to prefetch data. These workers are OS processes, scheduled by the Linux CFS scheduler. When you have 8 GPU workers competing for CPU time with 16 data loader workers, understanding process scheduling helps you configure worker counts correctly.

Memory-mapped datasets. Training on a 500GB text dataset does not require loading it all into RAM. Memory-mapped files (mmap) let the OS page data in as needed. HuggingFace datasets use this by default. Understanding how virtual memory and page faults work explains why dataset loading speed depends on working set size, not dataset size.

Huge pages. By default, Linux uses 4KB memory pages. For ML workloads that allocate large tensors, this means thousands of page table entries per allocation. Huge pages (2MB or 1GB) reduce page table overhead and TLB pressure. Enabling transparent huge pages can improve training throughput by 5-15% on CPU-heavy workloads.

OS Abstraction Layers

Lessons in This Module

#LessonKey Concept
1Process vs Thread vs CoroutineOS process model, GIL implications, when to use each
2Virtual Memory and Page TablesAddress translation, page faults, virtual address space
3Memory-Mapped Files for Large Datasetsmmap system call, page faults as lazy loading
4Linux Scheduler and CPU AffinityCFS scheduler, taskset, NUMA-aware process placement
5Huge Pages and Transparent Huge Pages2MB vs 4KB pages, THP, enabling for training workloads
6Kernel Bypass and DPDKRDMA, kernel bypass networking, high-throughput data ingestion
7System Calls and Python Overheadstrace, syscall cost, minimizing kernel transitions
8OS Tuning for Training Servers/proc/sys tuning, vm.swappiness, io scheduler settings

Key Concepts You Will Master

  • Virtual memory layout - understanding what the kernel does when your process OOMs
  • mmap for datasets - how to load arbitrarily large datasets without running out of RAM
  • Process vs thread tradeoffs - why PyTorch DataLoader uses processes (GIL), not threads
  • Huge page configuration - enabling transparent huge pages to reduce TLB pressure
  • OS tuning parameters - the specific kernel settings that matter for ML training server configuration

Prerequisites

© 2026 EngineersOfAI. All rights reserved.