Module 3: Custom Silicon for AI
GPUs were not designed for deep learning. They were designed for graphics. The fact that they turned out to be excellent for matrix multiplication - the core operation of neural networks - was fortunate, but it means GPUs carry a lot of design decisions that make sense for graphics and are merely acceptable for ML.
Custom AI silicon makes different tradeoffs. A TPU is designed from the ground up for matrix multiply with fixed dataflow. A Groq LPU eliminates the cache hierarchy entirely in favor of deterministic memory scheduling. A Cerebras WSE puts an entire model on a single chip to eliminate inter-chip communication. Each of these represents a genuinely different answer to the question: what should AI hardware optimize for?
The Tradeoff Space
Understanding custom silicon requires understanding the axes of the tradeoff:
Flexibility vs efficiency. A GPU is flexible - you can run any kernel on it. A TPU is less flexible but more efficient per watt for the specific operations it supports. FPGAs are maximally flexible (you define the hardware) but require significant engineering.
Training vs inference. The optimal hardware for training (large batches, backward pass, gradient accumulation) is different from the optimal hardware for inference (small batches, low latency, high throughput). Most custom silicon picks a point on this spectrum.
Memory bandwidth vs compute. The fundamental constraint for LLM inference is memory bandwidth - you are bandwidth-bound, not compute-bound. Groq's LPU and Cerebras address this differently, but both are trying to solve the same problem.
Custom Silicon Landscape
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | Google TPU Architecture | Systolic arrays, XLA compilation, TPU pods |
| 2 | AWS Trainium and Inferentia | NeuronSDK, compilation workflow, EC2 Trn/Inf instances |
| 3 | Groq LPU Architecture | Deterministic execution, SRAM-only design, inference throughput |
| 4 | Cerebras Wafer Scale Engine | On-chip memory, weight streaming, distributed compute on one chip |
| 5 | Intel Gaudi and Habana | Habana SynapseAI, training and inference use cases |
| 6 | Apple Silicon for AI | Neural Engine, unified memory architecture, Core ML |
| 7 | FPGAs for AI Inference | Xilinx/AMD, fixed-function inference, ultra-low latency |
| 8 | Choosing Custom Silicon vs GPUs | Decision framework: workload type, scale, team capability |
Key Concepts You Will Master
- Systolic array architecture - how TPUs execute matrix multiply in a fundamentally different way than CUDA cores
- XLA compilation - how Google's compiler maps JAX/TF operations to TPU hardware
- Deterministic execution - why eliminating caches enables predictable latency at high throughput (Groq)
- Weight streaming - the Cerebras approach to handling models larger than on-chip memory
- Unified memory - why Apple Silicon's shared CPU/GPU memory changes the economics of inference on consumer hardware
Prerequisites
- GPU Architecture
- Basic neural network training knowledge
