Module 1: Computer Architecture for ML Engineers
You write Python. Python compiles to bytecode. The bytecode interpreter calls into C libraries. The C libraries issue SIMD instructions. The instructions hit the cache hierarchy. Cache misses go to DRAM. Every one of these layers affects your model's training and inference performance, and most ML engineers have never thought about any of them below the PyTorch level.
This module closes that gap. Not to make you a hardware engineer, but to give you a mental model that makes performance behavior make sense. When you understand why a strided memory access is 10x slower than sequential access, you make better decisions about data layout in your training pipelines.
The Memory Hierarchy
The most important concept in computer architecture for ML engineers is the memory hierarchy. Accessing data from CPU registers is ~1 nanosecond. From L1 cache: ~5ns. From L2: ~12ns. From L3: ~40ns. From RAM: ~100ns. From SSD: ~100 microseconds.
A cache miss from L3 to RAM costs 100x more than an L1 hit. A training loop that fits its working set in L3 runs 10-100x faster than one that does not. This explains why batch size matters, why data layout matters, and why seemingly minor changes to a training loop can cause dramatic performance differences.
CPU Architecture Overview
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | CPU Architecture for ML Engineers | Pipeline stages, out-of-order execution, branch prediction |
| 2 | Memory Hierarchy, Caches, and Locality | L1/L2/L3, spatial and temporal locality, prefetching |
| 3 | Instruction Level Parallelism | Superscalar execution, pipelining, dependency stalls |
| 4 | SIMD and Vectorization | AVX/AVX-512, auto-vectorization, NumPy's secret |
| 5 | Branch Prediction and Pipeline Stalls | Misprediction cost, branch-free code patterns |
| 6 | NUMA Architecture | Non-uniform memory access, NUMA-aware allocation |
| 7 | CPU vs GPU Design Philosophy | Latency vs throughput optimization, when CPU wins |
| 8 | Hardware Counters and Performance Analysis | perf, VTune, cache miss rates, IPC measurement |
Key Concepts You Will Master
- Cache locality - writing memory access patterns that stay in L1/L2 and avoid DRAM
- SIMD vectorization - how NumPy and PyTorch achieve 4-16x speedups on CPU with vector instructions
- NUMA topology - why memory allocation on multi-socket servers affects ML training throughput
- Hardware performance counters - measuring what your code is actually doing at the hardware level
- Branch-free algorithms - writing prediction-friendly code for data preprocessing loops
Prerequisites
- Basic C/Python programming
- No prior architecture knowledge required
