Module 7: Inference Hardware

Training hardware and inference hardware optimize for completely different things. Training needs sustained throughput over hours or days. Inference needs low latency for individual requests, high throughput across many concurrent requests, and a cost per token that makes the business model work.

An H100 is excellent for training. For inference, depending on your latency requirements and batch characteristics, an L40S, A10G, or even a consumer RTX 4090 might give you better cost efficiency. This module teaches you how to make that decision quantitatively.

Training vs Inference Requirements

Requirement	Training	Inference
Latency	Hours per run - batch latency acceptable	Sub-second TTFT for user-facing
Memory	Must fit model + optimizer + activations	Must fit model + KV cache
Utilization	Want sustained 80%+ GPU utilization	Want low latency even at low utilization
Primary bottleneck	Compute (matrix multiply)	Memory bandwidth (loading weights)

Inference Hardware Decision Tree

Lessons in This Module

#	Lesson	Key Concept
1	Inference vs Training Hardware Requirements	Memory, latency, cost - the different optimization targets
2	Batching Strategies and Throughput	Static vs dynamic batching, continuous batching in vLLM
3	Edge AI Hardware	NVIDIA Jetson, Apple M-series, Snapdragon NPU
4	Cost-Per-Token Analysis	GPU rental cost, tokens/sec, calculating break-even
5	Hardware for Long Context Inference	KV cache memory scaling with context length
6	Speculative Decoding Hardware Implications	Draft model requirements, memory for two models
7	On-Premise vs Cloud for Inference	TCO analysis, break-even calculation, when each wins
8	Building an Inference Stack	GPU selection, serving framework, autoscaling architecture

Key Concepts You Will Master

Cost-per-token calculation - calculating real inference cost for any model on any hardware
Continuous batching - how vLLM serves variable-length requests with high utilization
TTFT vs TPOT - time-to-first-token vs time-per-output-token and which matters when
Speculative decoding - draft model requirements and when the speedup is worth the memory cost
Cloud vs on-premise TCO - the break-even analysis that determines whether to own hardware

Prerequisites

GPU Architecture
Memory Systems
Basic LLM inference understanding

Training vs Inference Requirements​

Inference Hardware Decision Tree​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

Training vs Inference Requirements

Inference Hardware Decision Tree

Lessons in This Module

Key Concepts You Will Master

Prerequisites