Skip to main content

Module 1: Computer Architecture for ML Engineers

You write Python. Python compiles to bytecode. The bytecode interpreter calls into C libraries. The C libraries issue SIMD instructions. The instructions hit the cache hierarchy. Cache misses go to DRAM. Every one of these layers affects your model's training and inference performance, and most ML engineers have never thought about any of them below the PyTorch level.

This module closes that gap. Not to make you a hardware engineer, but to give you a mental model that makes performance behavior make sense. When you understand why a strided memory access is 10x slower than sequential access, you make better decisions about data layout in your training pipelines.

The Memory Hierarchy

The most important concept in computer architecture for ML engineers is the memory hierarchy. Accessing data from CPU registers is ~1 nanosecond. From L1 cache: ~5ns. From L2: ~12ns. From L3: ~40ns. From RAM: ~100ns. From SSD: ~100 microseconds.

A cache miss from L3 to RAM costs 100x more than an L1 hit. A training loop that fits its working set in L3 runs 10-100x faster than one that does not. This explains why batch size matters, why data layout matters, and why seemingly minor changes to a training loop can cause dramatic performance differences.

CPU Architecture Overview

Lessons in This Module

#LessonKey Concept
1CPU Architecture for ML EngineersPipeline stages, out-of-order execution, branch prediction
2Memory Hierarchy, Caches, and LocalityL1/L2/L3, spatial and temporal locality, prefetching
3Instruction Level ParallelismSuperscalar execution, pipelining, dependency stalls
4SIMD and VectorizationAVX/AVX-512, auto-vectorization, NumPy's secret
5Branch Prediction and Pipeline StallsMisprediction cost, branch-free code patterns
6NUMA ArchitectureNon-uniform memory access, NUMA-aware allocation
7CPU vs GPU Design PhilosophyLatency vs throughput optimization, when CPU wins
8Hardware Counters and Performance Analysisperf, VTune, cache miss rates, IPC measurement

Key Concepts You Will Master

  • Cache locality - writing memory access patterns that stay in L1/L2 and avoid DRAM
  • SIMD vectorization - how NumPy and PyTorch achieve 4-16x speedups on CPU with vector instructions
  • NUMA topology - why memory allocation on multi-socket servers affects ML training throughput
  • Hardware performance counters - measuring what your code is actually doing at the hardware level
  • Branch-free algorithms - writing prediction-friendly code for data preprocessing loops

Prerequisites

  • Basic C/Python programming
  • No prior architecture knowledge required
© 2026 EngineersOfAI. All rights reserved.