Skip to main content

Module 7: Inference Hardware

Training hardware and inference hardware optimize for completely different things. Training needs sustained throughput over hours or days. Inference needs low latency for individual requests, high throughput across many concurrent requests, and a cost per token that makes the business model work.

An H100 is excellent for training. For inference, depending on your latency requirements and batch characteristics, an L40S, A10G, or even a consumer RTX 4090 might give you better cost efficiency. This module teaches you how to make that decision quantitatively.

Training vs Inference Requirements

RequirementTrainingInference
LatencyHours per run - batch latency acceptableSub-second TTFT for user-facing
MemoryMust fit model + optimizer + activationsMust fit model + KV cache
UtilizationWant sustained 80%+ GPU utilizationWant low latency even at low utilization
Primary bottleneckCompute (matrix multiply)Memory bandwidth (loading weights)

Inference Hardware Decision Tree

Lessons in This Module

#LessonKey Concept
1Inference vs Training Hardware RequirementsMemory, latency, cost - the different optimization targets
2Batching Strategies and ThroughputStatic vs dynamic batching, continuous batching in vLLM
3Edge AI HardwareNVIDIA Jetson, Apple M-series, Snapdragon NPU
4Cost-Per-Token AnalysisGPU rental cost, tokens/sec, calculating break-even
5Hardware for Long Context InferenceKV cache memory scaling with context length
6Speculative Decoding Hardware ImplicationsDraft model requirements, memory for two models
7On-Premise vs Cloud for InferenceTCO analysis, break-even calculation, when each wins
8Building an Inference StackGPU selection, serving framework, autoscaling architecture

Key Concepts You Will Master

  • Cost-per-token calculation - calculating real inference cost for any model on any hardware
  • Continuous batching - how vLLM serves variable-length requests with high utilization
  • TTFT vs TPOT - time-to-first-token vs time-per-output-token and which matters when
  • Speculative decoding - draft model requirements and when the speedup is worth the memory cost
  • Cloud vs on-premise TCO - the break-even analysis that determines whether to own hardware

Prerequisites

© 2026 EngineersOfAI. All rights reserved.