Hardware and Silicon for AI
Most ML engineers treat hardware as a black box. They pick a GPU instance type, watch their training script run, and accept whatever throughput they get. The engineers who treat hardware as a first-class concern achieve 2-10x better utilization, catch bottlenecks before they become blockers, and make better architectural decisions because they understand the physical constraints their code runs under.
This track closes that gap.
Why Hardware Knowledge Matters
When your training run is slow, the answer is somewhere in the hardware stack. When your inference costs are too high, the answer is almost always in memory - how you're using it, how much you need, how efficiently you're moving it. When you're choosing between H100s and TPUs for a workload, the right answer depends on understanding the memory bandwidth, interconnect topology, and compute characteristics of both.
The engineers who understand hardware make better decisions at every level:
- They write model architectures that fit the hardware (not the other way around)
- They choose the right quantization strategy because they understand what memory bandwidth actually costs
- They know why FlashAttention is fast (it's not the math - it's the memory access pattern)
- They debug OOM errors without guessing
- They design distributed training topologies that do not bottleneck on the network
What This Track Covers
Seven modules covering the full hardware stack for AI:
| Module | Topic | Key Skills |
|---|---|---|
| 1 | GPU Architecture | SMs, tensor cores, memory hierarchy, roofline model |
| 2 | CUDA Programming | Kernels, thread blocks, memory coalescing, profiling |
| 3 | Custom Silicon | TPUs, Trainium, Groq LPU, Cerebras, Gaudi |
| 4 | Kernel Optimization | Triton, FlashAttention, operator fusion, torch.compile |
| 5 | Memory Systems | HBM, KV cache, quantization, memory profiling |
| 6 | Distributed Training Hardware | NVLink, InfiniBand, AllReduce, fault tolerance |
| 7 | Inference Hardware | Cost-per-token, batching, edge hardware, serving stack |
Who This Track Is For
ML Engineers who want to stop treating hardware as a black box and start using it as a lever.
AI Infrastructure Engineers building training clusters or inference serving systems.
Research Engineers implementing custom kernels and optimizing model architectures for specific hardware.
Senior Engineers making hardware procurement and architecture decisions.
Prerequisites
- Comfortable with Python and PyTorch
- Basic understanding of neural network training
- Math for AI Track helpful but not required
The Payoff
Understanding hardware does not make you a hardware engineer. It makes you a better ML engineer - one who writes code that respects the machine it runs on, and understands why certain optimizations work when others do not.
Start with GPU Architecture if you are new to the hardware stack.
Start with CUDA Programming if you want to write your own kernels.
Start with Kernel Optimization if you are optimizing an existing model.
