9 docs tagged with "computer-architecture"

ARM vs x86 for AI Workloads

Comprehensive comparison of ARM and x86 architectures for ML workloads - ISA design, power efficiency, Apple Silicon unified memory, AWS Graviton3 inference, and performance-per-watt analysis for production AI systems.

CPU Pipeline and Instruction Execution

Learn how modern CPUs execute billions of instructions per second through pipelining, out-of-order execution, branch prediction, and superscalar design - and why these details matter for every ML engineer.

Hardware Acceleration Beyond GPU

FPGA, ASIC, TPU systolic arrays, neuromorphic chips, photonic computing, and processing-in-memory for ML - when to use each, economic analysis, and the emerging hardware landscape beyond NVIDIA GPUs.

Hardware Performance Counters

Master hardware performance counters, the PMU, and Linux perf to diagnose CPU bottlenecks, optimize cache behavior, and profile ML workloads with surgical precision.

Memory Hierarchy and Cache Design

Learn how CPU cache hierarchy works - L1/L2/L3 structure, associativity, eviction policies, MESI coherence, NUMA topology, and how to write cache-friendly code that runs 10x to 100x faster for ML workloads.

Module 1: Computer Architecture for ML Engineers

CPU architecture, memory hierarchy, SIMD vectorization, NUMA, and hardware performance analysis - understanding the machine your ML code runs on.

Multicore and NUMA Architecture

Learn how multicore CPUs and NUMA topology affect ML workload performance - cache coherence overhead, CPU affinity, NUMA-aware memory allocation, hyperthreading, and configuring PyTorch DataLoader for optimal hardware utilization.

SIMD and Vectorization

Learn how SIMD instruction sets (SSE, AVX2, AVX-512) enable CPUs to process 8 to 16 floating-point operations per cycle, why NumPy and PyTorch use them by default, and how to write code that compilers can auto-vectorize.

Storage Hierarchy: SSD and NVMe

Deep dive into SSD and NVMe storage architecture for ML workloads - NAND flash physics, NVMe protocol, io_uring async I/O, memory-mapped datasets, and designing storage systems for large-scale training.