Module 3: Custom Silicon for AI

GPUs were not designed for deep learning. They were designed for graphics. The fact that they turned out to be excellent for matrix multiplication - the core operation of neural networks - was fortunate, but it means GPUs carry a lot of design decisions that make sense for graphics and are merely acceptable for ML.

Custom AI silicon makes different tradeoffs. A TPU is designed from the ground up for matrix multiply with fixed dataflow. A Groq LPU eliminates the cache hierarchy entirely in favor of deterministic memory scheduling. A Cerebras WSE puts an entire model on a single chip to eliminate inter-chip communication. Each of these represents a genuinely different answer to the question: what should AI hardware optimize for?

The Tradeoff Space

Understanding custom silicon requires understanding the axes of the tradeoff:

Flexibility vs efficiency. A GPU is flexible - you can run any kernel on it. A TPU is less flexible but more efficient per watt for the specific operations it supports. FPGAs are maximally flexible (you define the hardware) but require significant engineering.

Training vs inference. The optimal hardware for training (large batches, backward pass, gradient accumulation) is different from the optimal hardware for inference (small batches, low latency, high throughput). Most custom silicon picks a point on this spectrum.

Memory bandwidth vs compute. The fundamental constraint for LLM inference is memory bandwidth - you are bandwidth-bound, not compute-bound. Groq's LPU and Cerebras address this differently, but both are trying to solve the same problem.

Custom Silicon Landscape

Lessons in This Module

#	Lesson	Key Concept
1	Google TPU Architecture	Systolic arrays, XLA compilation, TPU pods
2	AWS Trainium and Inferentia	NeuronSDK, compilation workflow, EC2 Trn/Inf instances
3	Groq LPU Architecture	Deterministic execution, SRAM-only design, inference throughput
4	Cerebras Wafer Scale Engine	On-chip memory, weight streaming, distributed compute on one chip
5	Intel Gaudi and Habana	Habana SynapseAI, training and inference use cases
6	Apple Silicon for AI	Neural Engine, unified memory architecture, Core ML
7	FPGAs for AI Inference	Xilinx/AMD, fixed-function inference, ultra-low latency
8	Choosing Custom Silicon vs GPUs	Decision framework: workload type, scale, team capability

Key Concepts You Will Master

Systolic array architecture - how TPUs execute matrix multiply in a fundamentally different way than CUDA cores
XLA compilation - how Google's compiler maps JAX/TF operations to TPU hardware
Deterministic execution - why eliminating caches enables predictable latency at high throughput (Groq)
Weight streaming - the Cerebras approach to handling models larger than on-chip memory
Unified memory - why Apple Silicon's shared CPU/GPU memory changes the economics of inference on consumer hardware

Prerequisites

GPU Architecture
Basic neural network training knowledge

The Tradeoff Space​

Custom Silicon Landscape​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The Tradeoff Space

Custom Silicon Landscape

Lessons in This Module

Key Concepts You Will Master

Prerequisites