Module 1: Computer Architecture for ML Engineers

You write Python. Python compiles to bytecode. The bytecode interpreter calls into C libraries. The C libraries issue SIMD instructions. The instructions hit the cache hierarchy. Cache misses go to DRAM. Every one of these layers affects your model's training and inference performance, and most ML engineers have never thought about any of them below the PyTorch level.

This module closes that gap. Not to make you a hardware engineer, but to give you a mental model that makes performance behavior make sense. When you understand why a strided memory access is 10x slower than sequential access, you make better decisions about data layout in your training pipelines.

The Memory Hierarchy

The most important concept in computer architecture for ML engineers is the memory hierarchy. Accessing data from CPU registers is ~1 nanosecond. From L1 cache: ~5ns. From L2: ~12ns. From L3: ~40ns. From RAM: ~100ns. From SSD: ~100 microseconds.

A cache miss from L3 to RAM costs 100x more than an L1 hit. A training loop that fits its working set in L3 runs 10-100x faster than one that does not. This explains why batch size matters, why data layout matters, and why seemingly minor changes to a training loop can cause dramatic performance differences.

CPU Architecture Overview

Lessons in This Module

#	Lesson	Key Concept
1	CPU Architecture for ML Engineers	Pipeline stages, out-of-order execution, branch prediction
2	Memory Hierarchy, Caches, and Locality	L1/L2/L3, spatial and temporal locality, prefetching
3	Instruction Level Parallelism	Superscalar execution, pipelining, dependency stalls
4	SIMD and Vectorization	AVX/AVX-512, auto-vectorization, NumPy's secret
5	Branch Prediction and Pipeline Stalls	Misprediction cost, branch-free code patterns
6	NUMA Architecture	Non-uniform memory access, NUMA-aware allocation
7	CPU vs GPU Design Philosophy	Latency vs throughput optimization, when CPU wins
8	Hardware Counters and Performance Analysis	perf, VTune, cache miss rates, IPC measurement

Key Concepts You Will Master

Cache locality - writing memory access patterns that stay in L1/L2 and avoid DRAM
SIMD vectorization - how NumPy and PyTorch achieve 4-16x speedups on CPU with vector instructions
NUMA topology - why memory allocation on multi-socket servers affects ML training throughput
Hardware performance counters - measuring what your code is actually doing at the hardware level
Branch-free algorithms - writing prediction-friendly code for data preprocessing loops

Prerequisites

Basic C/Python programming
No prior architecture knowledge required

The Memory Hierarchy​

CPU Architecture Overview​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Memory Hierarchy

CPU Architecture Overview

Lessons in This Module

Key Concepts You Will Master

Prerequisites