Module 7: Inference Hardware
Training hardware and inference hardware optimize for completely different things. Training needs sustained throughput over hours or days. Inference needs low latency for individual requests, high throughput across many concurrent requests, and a cost per token that makes the business model work.
An H100 is excellent for training. For inference, depending on your latency requirements and batch characteristics, an L40S, A10G, or even a consumer RTX 4090 might give you better cost efficiency. This module teaches you how to make that decision quantitatively.
Training vs Inference Requirements
| Requirement | Training | Inference |
|---|---|---|
| Latency | Hours per run - batch latency acceptable | Sub-second TTFT for user-facing |
| Memory | Must fit model + optimizer + activations | Must fit model + KV cache |
| Utilization | Want sustained 80%+ GPU utilization | Want low latency even at low utilization |
| Primary bottleneck | Compute (matrix multiply) | Memory bandwidth (loading weights) |
Inference Hardware Decision Tree
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | Inference vs Training Hardware Requirements | Memory, latency, cost - the different optimization targets |
| 2 | Batching Strategies and Throughput | Static vs dynamic batching, continuous batching in vLLM |
| 3 | Edge AI Hardware | NVIDIA Jetson, Apple M-series, Snapdragon NPU |
| 4 | Cost-Per-Token Analysis | GPU rental cost, tokens/sec, calculating break-even |
| 5 | Hardware for Long Context Inference | KV cache memory scaling with context length |
| 6 | Speculative Decoding Hardware Implications | Draft model requirements, memory for two models |
| 7 | On-Premise vs Cloud for Inference | TCO analysis, break-even calculation, when each wins |
| 8 | Building an Inference Stack | GPU selection, serving framework, autoscaling architecture |
Key Concepts You Will Master
- Cost-per-token calculation - calculating real inference cost for any model on any hardware
- Continuous batching - how vLLM serves variable-length requests with high utilization
- TTFT vs TPOT - time-to-first-token vs time-per-output-token and which matters when
- Speculative decoding - draft model requirements and when the speedup is worth the memory cost
- Cloud vs on-premise TCO - the break-even analysis that determines whether to own hardware
Prerequisites
- GPU Architecture
- Memory Systems
- Basic LLM inference understanding
© 2026 EngineersOfAI. All rights reserved.
