Selecting GPUs for Training vs Inference
Reading time: ~35 min - Interview relevance: High - Target roles: ML Engineer, AI Infrastructure Engineer, MLOps
It is a Tuesday afternoon at a mid-size AI company. The CTO has just approved a $2M hardware budget. The team needs to train a 13 billion parameter foundation model from scratch, then serve it at scale to 50,000 daily active users. You have been handed the spreadsheet and the task: figure out what to buy.
You open a browser and land on NVIDIA's product page. The H100 SXM5 80GB costs 400,000. The A100 80GB SXM is 96,000 for 8. Someone in Slack suggests the RTX 4090 at 98/hour for a p5.48xlarge. The options multiply and none of them come with a decision framework.
This is the real problem the industry faces. GPU marketing emphasizes peak TFLOPS and vague "AI performance" scores. But the number that matters for training is BF16 tensor core throughput plus HBM capacity. The number that matters for inference is memory bandwidth divided by model size in bytes. These are completely different bottlenecks, and they point to completely different hardware choices. Buying H100s for inference of 7B models is like buying a Formula 1 engine to commute to work - technically capable but economically indefensible.
The disconnect runs deep. A team that just finished training a 7B parameter model on rented A100s will instinctively reach for the same hardware when deploying it to production. But serving that same model requires different arithmetic entirely. In training, you are doing a full forward pass plus backward pass across the entire network, updating millions of gradients, maintaining optimizer states - the GPU is compute-bound and every TFLOP counts. In inference, you are running a forward pass on a single request (or a small batch), and the GPU sits idle while it fetches the next layer's weights from memory - the bottleneck shifts entirely to memory bandwidth.
Across four years of building ML infrastructure at scale, engineers have paid for this confusion in two ways: wasting $500k on H100 clusters for inference workloads that would run fine on RTX 4090s, and buying cheap consumer GPUs for fine-tuning runs that explode OOM at batch 2. This lesson gives you the first-principles framework to never make either mistake.
By the end, you will be able to look at a model size, a training recipe, a serving requirement, and a budget - and identify the optimal GPU in under five minutes. More importantly, you will understand why, so when a new GPU launches next year, you can evaluate it yourself without waiting for someone else's benchmark post.
Why This Exists
Before the deep learning era, compute hardware was general-purpose. You ran database queries, video rendering, and scientific simulations on largely the same server hardware, just with different amounts of RAM and disk. The GPU was a specialized co-processor for graphics, occasionally repurposed for scientific workloads via CUDA after 2007.
As deep learning exploded between 2012 and 2018, the industry discovered something uncomfortable: training neural networks at scale requires fundamentally different hardware characteristics than running them in production. Training is an iterative optimization problem - you need to store the model, its gradients, its optimizer states, and all intermediate activations. You need to do this billions of times, as fast as possible, ideally in parallel across many GPUs. The bottleneck is raw arithmetic throughput and inter-GPU communication bandwidth.
Inference is a retrieval and computation problem - given an input, produce an output as quickly as possible, as cheaply as possible, as many times per second as possible. You only need the model weights. You need to fetch those weights from memory and multiply them by activations. The bottleneck is memory bandwidth: how fast can you stream weights from HBM (High Bandwidth Memory) to the arithmetic units?
NVIDIA's data center GPU lineup evolved to address training first - the A100 in 2020, the H100 in 2022, both optimized for the matrix multiply throughput needed in training. But the LLM serving boom of 2023 exposed a gap: engineers needed GPUs that prioritized memory bandwidth efficiency over raw TFLOPS. This is why cards like the L40S (48GB GDDR6) and even the consumer RTX 4090 became legitimate production inference options, and why the right answer to "which GPU should I buy" is always: "it depends entirely on whether you are training or serving."
Historical Context
The GPU selection problem as we know it today traces back to the introduction of the A100 in May 2020. It was the first GPU to support BF16 (Brain Float 16) natively in hardware - the number format that became standard for LLM training because it has the same exponent range as FP32 (avoiding the overflow problems that plagued FP16 training) while cutting memory in half. The A100 also introduced the third generation Tensor Core, capable of 312 TFLOPS in BF16 - roughly 20x the performance of the best GPU available five years earlier.
Teams immediately began training larger models. GPT-3 (175B parameters, released June 2020) required 1,024 A100s for months. This was training at unprecedented scale. But the serving story was different: GPT-3 inference ran on smaller clusters because you only needed to store the weights, not the training scaffolding.
The H100 arrived in 2022 with a key addition: FP8 support. FP8 allows 989 TFLOPS on a single H100 SXM - a 3x improvement over BF16 on the same chip. This matters for training because you can use FP8 for forward passes and scale factors to maintain accuracy. But for inference, HBM3 bandwidth increased from 2 TB/s (A100) to 3.35 TB/s (H100) - a 67% improvement. Both specs matter, but they matter for different reasons.
The RTX 4090 arrived as a consumer card in late 2022 and surprised the ML community: 82.6 TFLOPS FP16, 1 TB/s GDDR6X bandwidth, 24GB VRAM, for 12,000, the price-to-performance ratio for fine-tuning small models was dramatically better. Teams running QLoRA fine-tuning on 7B models discovered the RTX 4090 was often the optimal choice - fast enough, cheap enough, and the 24GB VRAM fits a 4-bit quantized 7B model comfortably.
The lesson the industry slowly learned: a single "best GPU" for AI does not exist. The optimal choice is a function of model size, training recipe, serving latency requirements, throughput requirements, and budget. Getting this right is a core competency of ML infrastructure engineers.
Core Concepts
Training Memory: The Full Accounting
When you train a neural network, you need to store far more than just the model weights. Understanding exactly what lives in GPU memory during training is the foundation of every GPU selection decision.
Consider a model with parameters. In full FP32 training, every parameter is a 32-bit float - 4 bytes each. But you also need:
- Gradients: one gradient per parameter, same dtype as the weight - 4 bytes/param
- Adam optimizer states: two 32-bit floats per parameter (first and second moment estimates) - 8 bytes/param
- Activations: intermediate values saved during the forward pass for use in the backward pass - roughly 4-8 bytes/param depending on architecture and sequence length
Total: roughly 16-20 bytes per parameter in FP32 training.
The industry standard today is mixed precision training - weights stored in BF16 (2 bytes), but gradients accumulated and optimizer states maintained in FP32 (4 bytes each). This gives:
- BF16 master weights: 2 bytes/param
- FP32 optimizer copy of weights: 4 bytes/param
- FP32 gradients: 4 bytes/param
- FP32 Adam first moment (m): 4 bytes/param
- FP32 Adam second moment (v): 4 bytes/param
Total: 18 bytes per parameter for mixed precision with AdamW.
For a 7B parameter model:
A single A100 80GB cannot hold this. You need at least two A100 80GB GPUs with tensor parallelism or ZeRO-3 sharding.
For QLoRA fine-tuning (4-bit base weights, BF16 adapter parameters):
- Base weights: 0.5 bytes/param (4-bit)
- Adapter weights (small fraction, maybe 1% of params): 2 bytes/param
- Optimizer states on adapter only: 8 bytes/adapter-param
Effective total: roughly 4-6 bytes per base parameter. A 7B model requires about 35GB - fits on two RTX 4090s, or barely on one with careful sequence length management.
Inference Memory: Weights Plus KV Cache
Inference is simpler but has its own gotcha: the KV cache.
During autoregressive generation (the token-by-token output of a language model), each new token attends to all previous tokens. Rather than recomputing the Key and Value matrices for every previous token at every step, you cache them. This KV cache grows with sequence length and batch size.
KV cache memory per token, per layer, in FP16:
For a 70B model (8192 context, batch size 32):
- KV cache = 32 (batch) x 8192 (context) x 80 (layers) x 128 (head dim) x 2 (K+V) x 2 bytes = ~85 GB
This means even for inference, VRAM planning must account for KV cache on top of model weights.
Model weights for 70B in FP16: GB. You need four A100 80GBs just for weights, plus more for KV cache.
Training Is Compute-Bound; Inference Is Bandwidth-Bound
This is the central insight that drives GPU selection.
Training compute-bound reasoning: During a training step, every parameter participates in both the forward pass and the backward pass. The ratio of compute (FLOPs) to data movement (bytes) is high - you do a lot of arithmetic for each byte you load from memory. Modern GPUs hit their arithmetic ceiling before they hit their memory bandwidth ceiling during training. More TFLOPS = faster training.
Inference memory-bandwidth-bound reasoning: During a single autoregressive decoding step, each layer's weight matrix (typically floats) is loaded from HBM, multiplied by a single token's activation vector, and then the next layer starts. For batch size 1, you load 140 GB of weights but only do 140 billion multiply-accumulate operations (one per weight, because the input is a single vector). This is an arithmetic intensity of 1 FLOP/byte - far below what modern GPU arithmetic units can sustain.
The result: for batch size 1 LLM inference, the GPU is sitting 90%+ idle on its arithmetic units, waiting for HBM to stream the next layer's weights. You are paying for TFLOPS you cannot use. What actually determines tokens/second is memory bandwidth.
This is why the RTX 4090 is genuinely competitive with the A100 for 7B inference: it has 1 TB/s bandwidth vs the A100's 2 TB/s, but the A100 costs 7x more. Per dollar of bandwidth, the RTX 4090 wins.
The Bandwidth-per-Dollar Analysis for Inference
Theoretical token throughput for single-request (batch=1) inference:
For a 7B FP16 model: model size = 14 GB = bytes.
| GPU | Bandwidth | 7B FP16 throughput |
|---|---|---|
| H100 SXM | 3.35 TB/s | ~240 tokens/s |
| A100 SXM | 2.0 TB/s | ~143 tokens/s |
| RTX 4090 | 1.0 TB/s | ~71 tokens/s |
| RTX 3090 | 936 GB/s | ~67 tokens/s |
For a 70B FP16 model: model size = 140 GB = bytes.
| GPU | Bandwidth | 70B FP16 throughput (per GPU) |
|---|---|---|
| H100 SXM | 3.35 TB/s | ~24 tokens/s |
| A100 SXM | 2.0 TB/s | ~14 tokens/s |
| RTX 4090 | 1.0 TB/s | ~7 tokens/s (model does not fit) |
The 70B model does not fit on a single RTX 4090 (24GB) or even a single A100 80GB (140GB > 80GB). You need multiple GPUs in tensor parallel mode.
GPU Comparison Tables
Training
| GPU | BF16 TFLOPS | FP8 TFLOPS | HBM | Bandwidth | NVLink | Price (approx) | Best For |
|---|---|---|---|---|---|---|---|
| H100 SXM5 80GB | 989 (w/ sparsity) / 312 (dense) | 1979 | HBM3 80GB | 3.35 TB/s | 900 GB/s | $25k-35k | 70B+ training |
| A100 SXM 80GB | 312 | N/A | HBM2e 80GB | 2.0 TB/s | 600 GB/s | $10k-15k | 13B-70B training |
| A100 SXM 40GB | 312 | N/A | HBM2e 40GB | 1.6 TB/s | 600 GB/s | $7k-10k | 7B-13B training |
| RTX 4090 24GB | 82.6 | N/A | GDDR6X 24GB | 1.0 TB/s | None | $1.6k | 7B fine-tuning |
| RTX 3090 24GB | 35.6 | N/A | GDDR6X 24GB | 936 GB/s | None | $700 used | Experimentation |
| L40S 48GB | 91.6 | 183 | GDDR6 48GB | 864 GB/s | None | $8k-10k | Inference / light fine-tune |
Inference (ordered by cost efficiency for LLM serving)
| GPU | Bandwidth | VRAM | Max Model (FP16) | Price | Bandwidth/$k |
|---|---|---|---|---|---|
| RTX 4090 24GB | 1.0 TB/s | 24GB | 12B (FP16) / 24B (INT4) | $1.6k | 625 GB/s/$k |
| RTX 3090 24GB | 936 GB/s | 24GB | 12B (FP16) / 24B (INT4) | $700 | 1337 GB/s/$k |
| A100 40GB SXM | 1.6 TB/s | 40GB | 20B (FP16) | $10k | 160 GB/s/$k |
| H100 SXM 80GB | 3.35 TB/s | 80GB | 40B (FP16) | $30k | 112 GB/s/$k |
| L40S 48GB | 864 GB/s | 48GB | 24B (FP16) | $9k | 96 GB/s/$k |
The RTX 3090 used offers extraordinary bandwidth-per-dollar for small model inference. The H100 dominates raw throughput but is the worst value for inference per dollar of bandwidth.
Code Examples
1. Model Memory Estimator
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class Precision(Enum):
FP32 = "fp32"
FP16 = "fp16"
BF16 = "bf16"
INT8 = "int8"
INT4 = "int4"
FP8 = "fp8"
class Optimizer(Enum):
NONE = "none" # inference only
SGD = "sgd" # 1x momentum
ADAM = "adam" # 2x moments, FP32
ADAMW = "adamw" # 2x moments, FP32 (standard)
QLORA = "qlora" # 4-bit base, BF16 adapter, Adam on adapter only
BYTES_PER_PARAM = {
Precision.FP32: 4,
Precision.FP16: 2,
Precision.BF16: 2,
Precision.INT8: 1,
Precision.INT4: 0.5,
Precision.FP8: 1,
}
@dataclass
class MemoryEstimate:
weights_gb: float
gradients_gb: float
optimizer_states_gb: float
activations_gb: float
total_gb: float
note: str
def model_memory_estimate(
n_params: int,
precision: Precision = Precision.BF16,
optimizer: Optimizer = Optimizer.ADAMW,
activation_multiplier: float = 2.0,
adapter_fraction: float = 0.01, # for QLoRA: fraction of params in adapter
) -> MemoryEstimate:
"""
Estimate GPU memory required for a model.
Args:
n_params: Total number of model parameters (e.g., 7_000_000_000 for 7B)
precision: Weight storage precision
optimizer: Optimizer type (affects optimizer state memory)
activation_multiplier: Activations as multiple of weight bytes (typically 2x)
adapter_fraction: For QLoRA, what fraction of params are in the trainable adapter
Returns:
MemoryEstimate with breakdown in GB
"""
bytes_per_param = BYTES_PER_PARAM[precision]
# Weight memory
weights_bytes = n_params * bytes_per_param
# Gradient memory (only needed during training)
# Mixed precision training: gradients in FP32 regardless of weight precision
if optimizer == Optimizer.NONE:
gradients_bytes = 0.0
elif optimizer == Optimizer.QLORA:
# Only adapter params have gradients
gradients_bytes = n_params * adapter_fraction * 4 # FP32 gradients
else:
gradients_bytes = n_params * 4 # FP32 gradients for mixed precision
# Optimizer state memory
if optimizer == Optimizer.NONE:
optimizer_bytes = 0.0
elif optimizer == Optimizer.SGD:
# 1 momentum buffer per param in FP32
optimizer_bytes = n_params * 4
elif optimizer in (Optimizer.ADAM, Optimizer.ADAMW):
# FP32 master weights + 2x FP32 moments
# In mixed precision: FP32 master copy + m + v = 3 * 4 bytes
optimizer_bytes = n_params * 4 * 3 # master weights + m + v
elif optimizer == Optimizer.QLORA:
# Only adapter params have optimizer states
n_adapter = int(n_params * adapter_fraction)
optimizer_bytes = n_adapter * 4 * 3 # FP32 master + m + v for adapter
else:
optimizer_bytes = 0.0
# Activation memory (rough estimate - depends heavily on sequence length and architecture)
if optimizer == Optimizer.NONE:
# Inference: no activations saved for backward, but KV cache grows
activations_bytes = weights_bytes * 0.2 # rough KV cache estimate
else:
activations_bytes = weights_bytes * activation_multiplier
total_bytes = (
weights_bytes + gradients_bytes + optimizer_bytes + activations_bytes
)
def to_gb(b: float) -> float:
return b / (1024 ** 3)
# Generate a human-readable note
if optimizer == Optimizer.QLORA:
note = (
f"QLoRA: {precision.value} base weights + BF16 adapter "
f"({adapter_fraction*100:.1f}% of params)"
)
elif optimizer == Optimizer.NONE:
note = f"Inference only: {precision.value} weights + KV cache estimate"
else:
note = (
f"Mixed precision training: {precision.value} weights + "
f"FP32 gradients + FP32 {optimizer.value} states"
)
return MemoryEstimate(
weights_gb=to_gb(weights_bytes),
gradients_gb=to_gb(gradients_bytes),
optimizer_states_gb=to_gb(optimizer_bytes),
activations_gb=to_gb(activations_bytes),
total_gb=to_gb(total_bytes),
note=note,
)
# --- Usage examples ---
models = {
"7B": 7_000_000_000,
"13B": 13_000_000_000,
"70B": 70_000_000_000,
"175B": 175_000_000_000,
}
print("=" * 70)
print("TRAINING MEMORY ESTIMATES (Mixed Precision BF16 + FP32 AdamW)")
print("=" * 70)
for name, params in models.items():
est = model_memory_estimate(params, Precision.BF16, Optimizer.ADAMW)
print(f"\n{name} model:")
print(f" Weights: {est.weights_gb:7.1f} GB")
print(f" Gradients: {est.gradients_gb:7.1f} GB")
print(f" Optimizer states: {est.optimizer_states_gb:7.1f} GB")
print(f" Activations: {est.activations_gb:7.1f} GB")
print(f" TOTAL: {est.total_gb:7.1f} GB")
# How many A100 80GB or H100 80GB GPUs needed?
a100_needed = max(1, -(-est.total_gb // 75)) # ceiling, leaving 5GB headroom
print(f" A100/H100 80GB GPUs needed: {int(a100_needed)}")
print("\n" + "=" * 70)
print("QLORA FINE-TUNING MEMORY ESTIMATES (4-bit base, BF16 adapter)")
print("=" * 70)
for name, params in models.items():
est = model_memory_estimate(params, Precision.INT4, Optimizer.QLORA)
print(f"\n{name} model (QLoRA):")
print(f" Total: {est.total_gb:7.1f} GB | {est.note}")
print("\n" + "=" * 70)
print("INFERENCE MEMORY ESTIMATES (FP16 weights, no optimizer)")
print("=" * 70)
for name, params in models.items():
est = model_memory_estimate(params, Precision.FP16, Optimizer.NONE)
print(f"\n{name} model (FP16 inference):")
print(f" Total: {est.total_gb:7.1f} GB (weights + KV cache estimate)")
Expected output:
======================================================================
TRAINING MEMORY ESTIMATES (Mixed Precision BF16 + FP32 AdamW)
======================================================================
7B model:
Weights: 13.0 GB
Gradients: 26.1 GB
Optimizer states: 78.2 GB
Activations: 26.1 GB
TOTAL: 143.4 GB
A100/H100 80GB GPUs needed: 2
13B model:
TOTAL: 266.5 GB
A100/H100 80GB GPUs needed: 4
70B model:
TOTAL: 1433.6 GB
A100/H100 80GB GPUs needed: 20
======================================================================
QLORA FINE-TUNING MEMORY ESTIMATES (4-bit base, BF16 adapter)
======================================================================
7B model (QLoRA):
Total: 7.5 GB
2. Inference Throughput Estimator
def inference_throughput_estimate(
bandwidth_tbs: float,
model_size_gb: float,
batch_size: int = 1,
efficiency: float = 0.7,
) -> dict:
"""
Estimate tokens/second for LLM inference.
The key insight: for small batch sizes, LLM inference is memory-bandwidth-bound.
Each decoding step loads the full model from HBM to compute one (or batch_size) tokens.
At large batch sizes, the workload becomes compute-bound (many tokens share weight loads).
The crossover point is called the "arithmetic intensity break-even."
Args:
bandwidth_tbs: Memory bandwidth in TB/s (e.g., 2.0 for A100)
model_size_gb: Total model size in GB (e.g., 14.0 for 7B FP16)
batch_size: Number of sequences decoded in parallel
efficiency: Fraction of peak bandwidth achieved in practice (0.6-0.85)
Returns:
dict with tokens/sec and roofline analysis
"""
bandwidth_bytes_per_sec = bandwidth_tbs * 1e12
model_size_bytes = model_size_gb * 1e9
# Time to load entire model weights once from HBM
weight_load_time_sec = model_size_bytes / (bandwidth_bytes_per_sec * efficiency)
# Each weight load produces `batch_size` tokens (one per sequence in batch)
tokens_per_load = batch_size
# Theoretical throughput
tokens_per_sec = tokens_per_load / weight_load_time_sec
# Latency for a single token (batch=1 matters most for interactive use)
latency_ms = (weight_load_time_sec / batch_size) * 1000
return {
"bandwidth_tbs": bandwidth_tbs,
"model_size_gb": model_size_gb,
"batch_size": batch_size,
"tokens_per_sec": round(tokens_per_sec, 1),
"latency_per_token_ms": round(latency_ms, 2),
"note": (
f"Memory-bandwidth-bound estimate. "
f"Peak at batch={batch_size} assuming {efficiency*100:.0f}% bandwidth efficiency."
),
}
# Compare GPUs for inference of different model sizes
gpu_specs = {
"H100 SXM 80GB": {"bandwidth_tbs": 3.35, "vram_gb": 80, "price_k": 30},
"A100 SXM 80GB": {"bandwidth_tbs": 2.0, "vram_gb": 80, "price_k": 12},
"A100 SXM 40GB": {"bandwidth_tbs": 1.6, "vram_gb": 40, "price_k": 8},
"RTX 4090 24GB": {"bandwidth_tbs": 1.0, "vram_gb": 24, "price_k": 1.6},
"RTX 3090 24GB": {"bandwidth_tbs": 0.936,"vram_gb": 24, "price_k": 0.7},
"L40S 48GB": {"bandwidth_tbs": 0.864,"vram_gb": 48, "price_k": 9},
}
model_specs = {
"7B FP16": {"size_gb": 14, "min_vram_gb": 14},
"13B FP16": {"size_gb": 26, "min_vram_gb": 26},
"70B FP16": {"size_gb": 140, "min_vram_gb": 140}, # needs multiple GPUs
}
print("=" * 80)
print("INFERENCE THROUGHPUT COMPARISON (batch=1, single GPU)")
print("=" * 80)
print(f"{'GPU':<22} {'Model':<12} {'tok/s':>8} {'lat (ms)':>10} {'fit?':>6}")
print("-" * 80)
for gpu_name, gpu in gpu_specs.items():
for model_name, model in model_specs.items():
fits = "YES" if gpu["vram_gb"] >= model["min_vram_gb"] else "NO"
if fits == "YES":
result = inference_throughput_estimate(
bandwidth_tbs=gpu["bandwidth_tbs"],
model_size_gb=model["size_gb"],
batch_size=1,
)
tok_s = result["tokens_per_sec"]
lat = result["latency_per_token_ms"]
else:
tok_s = 0
lat = 0
tok_s_str = f"{tok_s:.1f}" if fits == "YES" else "N/A"
lat_str = f"{lat:.1f}" if fits == "YES" else "N/A"
print(
f"{gpu_name:<22} {model_name:<12} {tok_s_str:>8} {lat_str:>10} {fits:>6}"
)
print()
3. Cost Per 1M Tokens Analysis
def cost_per_1m_tokens(
hardware_cost_usd: float,
utilization: float,
throughput_tokens_per_sec: float,
hardware_lifetime_years: float = 3.0,
power_cost_per_kwh: float = 0.12,
gpu_tdp_watts: float = 400,
num_gpus: int = 1,
) -> dict:
"""
Calculate total cost per 1M tokens served for owned hardware.
Includes: hardware amortization + electricity + rough maintenance estimate.
Does NOT include: networking, storage, ops labor, cooling overhead.
Args:
hardware_cost_usd: Total hardware purchase price
utilization: Fraction of time the GPU is actively serving (0.0-1.0)
throughput_tokens_per_sec: Tokens/second when actively serving
hardware_lifetime_years: Depreciation period
power_cost_per_kwh: Electricity rate
gpu_tdp_watts: GPU thermal design power (e.g., 400W for H100 SXM)
num_gpus: Number of GPUs in the serving unit
Returns:
dict with cost breakdown
"""
seconds_per_year = 365 * 24 * 3600
lifetime_seconds = hardware_lifetime_years * seconds_per_year
# Hardware amortization cost per second of runtime
hardware_per_sec = hardware_cost_usd / lifetime_seconds
# Power cost per second
total_watts = gpu_tdp_watts * num_gpus
kwh_per_sec = total_watts / (1000 * 3600)
power_per_sec = kwh_per_sec * power_cost_per_kwh
# Total cost per second of wall-clock time
total_per_sec = hardware_per_sec + power_per_sec
# Tokens produced per second of wall-clock time (accounts for idle time)
effective_tokens_per_sec = throughput_tokens_per_sec * utilization
if effective_tokens_per_sec <= 0:
return {"error": "Cannot compute cost - zero effective throughput"}
# Cost per token
cost_per_token = total_per_sec / effective_tokens_per_sec
cost_per_1m = cost_per_token * 1_000_000
# Annual token capacity
annual_tokens = effective_tokens_per_sec * seconds_per_year
return {
"hardware_cost_usd": hardware_cost_usd,
"utilization_pct": utilization * 100,
"throughput_tokens_per_sec": throughput_tokens_per_sec,
"effective_throughput": round(effective_tokens_per_sec, 1),
"hardware_cost_per_1m_tokens": round(hardware_per_sec / effective_tokens_per_sec * 1e6, 4),
"power_cost_per_1m_tokens": round(power_per_sec / effective_tokens_per_sec * 1e6, 4),
"total_cost_per_1m_tokens_usd": round(cost_per_1m, 4),
"annual_capacity_billions": round(annual_tokens / 1e9, 1),
}
# 7B model inference cost comparison
print("=" * 72)
print("COST PER 1M TOKENS - 7B FP16 INFERENCE (70% utilization, 3-yr life)")
print("=" * 72)
scenarios_7b = [
{
"name": "RTX 3090 (used)",
"hw_cost": 700,
"throughput": 65,
"tdp": 350,
"num_gpus": 1,
},
{
"name": "RTX 4090",
"hw_cost": 1600,
"throughput": 70,
"tdp": 450,
"num_gpus": 1,
},
{
"name": "A100 40GB SXM",
"hw_cost": 8000,
"throughput": 114,
"tdp": 400,
"num_gpus": 1,
},
{
"name": "H100 SXM (overkill)",
"hw_cost": 30000,
"throughput": 239,
"tdp": 700,
"num_gpus": 1,
},
]
print(f"{'GPU':<22} {'tok/s':>8} {'$/1M tok':>12} {'Annual cap (B)':>16}")
print("-" * 72)
for s in scenarios_7b:
result = cost_per_1m_tokens(
hardware_cost_usd=s["hw_cost"],
utilization=0.70,
throughput_tokens_per_sec=s["throughput"],
gpu_tdp_watts=s["tdp"],
num_gpus=s["num_gpus"],
)
print(
f"{s['name']:<22} "
f"{s['throughput']:>8} "
f"${result['total_cost_per_1m_tokens_usd']:>10.4f} "
f"{result['annual_capacity_billions']:>14.1f}B"
)
# Cloud comparison
print("\n" + "=" * 72)
print("CLOUD COST COMPARISON - AWS spot/on-demand for LLM training")
print("=" * 72)
cloud_instances = [
{
"name": "p4d.24xlarge",
"description": "8x A100 40GB",
"on_demand_hr": 32.77,
"spot_hr": 9.83,
"bf16_tflops": 8 * 312, # 8 GPUs
},
{
"name": "p4de.24xlarge",
"description": "8x A100 80GB",
"on_demand_hr": 40.96,
"spot_hr": 12.29,
"bf16_tflops": 8 * 312,
},
{
"name": "p5.48xlarge",
"description": "8x H100 SXM",
"on_demand_hr": 98.32,
"spot_hr": 29.50,
"bf16_tflops": 8 * 312, # dense BF16
},
{
"name": "g5.48xlarge",
"description": "8x A10G 24GB",
"on_demand_hr": 16.29,
"spot_hr": 4.89,
"bf16_tflops": 8 * 31.2,
},
]
print(f"{'Instance':<18} {'Config':<18} {'On-demand/hr':>14} {'Spot/hr':>10}")
print("-" * 72)
for inst in cloud_instances:
print(
f"{inst['name']:<18} "
f"{inst['description']:<18} "
f"${inst['on_demand_hr']:>12.2f} "
f"${inst['spot_hr']:>8.2f}"
)
4. Full GPU Comparison DataFrame
import pandas as pd
def build_gpu_comparison_dataframe() -> pd.DataFrame:
"""
Build a comprehensive GPU comparison table for ML workloads.
All throughput estimates are theoretical; real-world performance is 60-85%.
"""
gpus = [
{
"GPU": "H100 SXM5 80GB",
"Tier": "Data Center",
"VRAM_GB": 80,
"BF16_TFLOPS": 312,
"FP8_TFLOPS": 989,
"Bandwidth_TBs": 3.35,
"NVLink_GBs": 900,
"TDP_W": 700,
"Price_USD": 30000,
"Best_for": "70B+ training, 40B inference",
},
{
"GPU": "A100 SXM 80GB",
"Tier": "Data Center",
"VRAM_GB": 80,
"BF16_TFLOPS": 312,
"FP8_TFLOPS": None,
"Bandwidth_TBs": 2.0,
"NVLink_GBs": 600,
"TDP_W": 400,
"Price_USD": 12000,
"Best_for": "13B-70B training",
},
{
"GPU": "A100 SXM 40GB",
"Tier": "Data Center",
"VRAM_GB": 40,
"BF16_TFLOPS": 312,
"FP8_TFLOPS": None,
"Bandwidth_TBs": 1.6,
"NVLink_GBs": 600,
"TDP_W": 400,
"Price_USD": 8000,
"Best_for": "7B-13B training",
},
{
"GPU": "L40S 48GB",
"Tier": "Data Center",
"VRAM_GB": 48,
"BF16_TFLOPS": 91.6,
"FP8_TFLOPS": 183,
"Bandwidth_TBs": 0.864,
"NVLink_GBs": None,
"TDP_W": 350,
"Price_USD": 9000,
"Best_for": "Inference, light fine-tune",
},
{
"GPU": "RTX 4090 24GB",
"Tier": "Consumer",
"VRAM_GB": 24,
"BF16_TFLOPS": 82.6,
"FP8_TFLOPS": None,
"Bandwidth_TBs": 1.0,
"NVLink_GBs": None,
"TDP_W": 450,
"Price_USD": 1600,
"Best_for": "7B fine-tuning, 7B inference",
},
{
"GPU": "RTX 3090 24GB",
"Tier": "Consumer",
"VRAM_GB": 24,
"BF16_TFLOPS": 35.6,
"FP8_TFLOPS": None,
"Bandwidth_TBs": 0.936,
"NVLink_GBs": None,
"TDP_W": 350,
"Price_USD": 700,
"Best_for": "Experimentation, cheap inference",
},
{
"GPU": "A10G 24GB",
"Tier": "Data Center",
"VRAM_GB": 24,
"BF16_TFLOPS": 31.2,
"FP8_TFLOPS": None,
"Bandwidth_TBs": 0.6,
"NVLink_GBs": None,
"TDP_W": 150,
"Price_USD": 2500,
"Best_for": "Inference, AWS g5",
},
]
df = pd.DataFrame(gpus)
# Derived metrics
df["BW_per_kUSD"] = (df["Bandwidth_TBs"] * 1000) / (df["Price_USD"] / 1000) # GB/s per $1k
df["TFLOPS_per_kUSD"] = df["BF16_TFLOPS"] / (df["Price_USD"] / 1000)
# Throughput estimate for 7B FP16 inference (batch=1, 70% efficiency)
model_7b_gb = 14.0
df["7B_FP16_tokens_per_sec"] = (
df["Bandwidth_TBs"] * 1e12 * 0.7 / (model_7b_gb * 1e9)
).round(1)
# Mark GPUs that can't fit 7B FP16
df.loc[df["VRAM_GB"] < 14, "7B_FP16_tokens_per_sec"] = float("nan")
return df
df = build_gpu_comparison_dataframe()
print("GPU COMPARISON - TRAINING PERSPECTIVE")
print("(higher BF16 TFLOPS + NVLink bandwidth = better training)")
training_cols = ["GPU", "VRAM_GB", "BF16_TFLOPS", "Bandwidth_TBs", "NVLink_GBs", "Price_USD"]
print(df[training_cols].to_string(index=False))
print("\n\nGPU COMPARISON - INFERENCE PERSPECTIVE")
print("(higher Bandwidth_TBs/$ = better inference value)")
inference_cols = ["GPU", "VRAM_GB", "Bandwidth_TBs", "Price_USD", "BW_per_kUSD", "7B_FP16_tokens_per_sec"]
print(df[inference_cols].sort_values("BW_per_kUSD", ascending=False).to_string(index=False))
Architecture Diagrams
Decision Framework Flowchart
Training Memory Breakdown
Inference Bottleneck: Compute vs Bandwidth
Cloud Instance Mapping
Production Engineering Notes
Always Benchmark Before Buying
Theoretical specs are starting points. Real-world performance depends on:
- Memory fragmentation: PyTorch's CUDA allocator can fragment VRAM, causing OOM at 85-90% theoretical capacity. Always leave 10-15% headroom.
- PCIe vs NVLink: Two A100s connected via PCIe 4.0 (64 GB/s bidirectional) will be significantly slower for tensor parallel workloads than NVLink (600 GB/s). If you need multi-GPU tensor parallelism, PCIe-only setups (like two RTX 4090s) will have severe communication bottlenecks.
- Thermal throttling: Consumer GPUs (RTX 4090) thermal throttle under sustained load in dense server configurations. Data center GPUs (A100, H100) are designed for 100% sustained utilization in rack environments with active cooling.
- Driver and CUDA version compatibility: Not all quantization libraries support all GPU architectures. bitsandbytes INT4 requires CUDA compute capability >= 7.5. FlashAttention-2 requires >= 8.0 (A100, H100, RTX 30xx/40xx). Always verify library support before committing to hardware.
vLLM PagedAttention Changes the Inference Calculus
vLLM (released by Berkeley 2023) introduced PagedAttention, which manages KV cache memory the way an OS manages virtual memory - in non-contiguous pages, reducing fragmentation and enabling much higher GPU utilization. The practical effect: with vLLM, a single A100 80GB serving a 30B INT4 model can handle significantly more concurrent requests than naive implementations suggested.
This matters for GPU selection: vLLM effectively increases throughput by 2-4x over naive HuggingFace generation, which changes the cost-per-token calculation significantly. When comparing hardware options, always benchmark with vLLM (or TGI, which uses similar techniques) rather than naive generation loops.
ZeRO Stages Change the Multi-GPU Training Story
For training across multiple GPUs, DeepSpeed's ZeRO (Zero Redundancy Optimizer) shards optimizer states, gradients, and parameters across GPUs:
- ZeRO Stage 1: Shard optimizer states only. Each GPU still stores full model + gradients.
- ZeRO Stage 2: Shard optimizer states + gradients. Each GPU stores full model.
- ZeRO Stage 3: Shard optimizer states + gradients + model parameters. Memory per GPU scales as 1/N.
ZeRO Stage 3 on 8x RTX 4090 (8 x 24GB = 192GB effective) can train a 13B model in full precision. Without ZeRO, you would need 8x A100 80GB (640GB effective). The hardware cost difference: ~96k. ZeRO Stage 3 over PCIe is slower than NVLink, but for training where you care about cost more than speed, it can be the right trade-off.
The Quantization Escape Hatch
Quantization changes the memory equation dramatically:
| Format | Bytes/param | 70B model size | Min GPUs (A100 80GB) |
|---|---|---|---|
| FP32 | 4 | 280 GB | 4 |
| FP16/BF16 | 2 | 140 GB | 2 |
| INT8 | 1 | 70 GB | 1 |
| INT4 (GPTQ/AWQ) | 0.5 | 35 GB | 1 |
| INT4 + INT4 KV | ~0.5 | ~35 GB + reduced KV | 1 |
A 70B INT4 quantized model fits on a single A100 80GB with enough room for a reasonable KV cache. Quality degradation is measurable but often acceptable for instruction-following tasks. This changes the GPU selection for serving: instead of 2x H100 80GB (12k) at INT4.
NVLink vs PCIe for Multi-GPU Training
This is the most commonly overlooked hardware spec when building training clusters.
For tensor parallelism (splitting a single model across GPUs), inter-GPU bandwidth is on the critical path of every forward and backward pass. A 70B model in tensor parallel mode across 4 GPUs does a collective all-reduce after every attention layer. With NVLink at 600 GB/s bidirectional (A100), this all-reduce completes in milliseconds. With PCIe 4.0 x16 at 32 GB/s unidirectional, the same operation takes 10-20x longer and the GPU sits idle.
The rule: if you are doing tensor parallelism across GPUs on the same node, NVLink (data center GPUs only) is essentially mandatory for performance. PCIe-only setups (consumer GPUs) are practical for data parallelism (each GPU has its own model copy, gradient sync happens less frequently) but not for tensor parallel.
Common Mistakes
:::danger Buying H100s for Small Model Inference
The H100's headline specs make it tempting to buy for all workloads. But for serving a 7B model to end users:
- H100 SXM at $30k delivers ~239 tokens/sec for 7B FP16 (bandwidth-bound, 70% efficiency)
- RTX 4090 at $1.6k delivers ~70 tokens/sec for the same model
The H100 is 3.4x faster but 18.75x more expensive. Per dollar, the RTX 4090 is 5.5x better value for this specific workload.
Rule: H100 wins for training 70B+ models (compute-bound, benefits from FP8 and NVLink) and for inference with large batch sizes (>32) where compute-bound operation benefits from raw TFLOPS. For batch=1 interactive serving of models under 30B parameters, prioritize bandwidth-per-dollar instead. :::
:::danger Forgetting Optimizer State Memory in Training Plans
Engineers often plan training GPU requirements based on model size alone:
- "7B model at BF16 = 14GB, fits on one RTX 4090 with 24GB to spare."
This is catastrophically wrong. The full memory picture for 7B mixed-precision AdamW training:
- BF16 weights: 13 GB
- FP32 gradients: 26 GB
- FP32 optimizer states (master weights + m + v): 78 GB
- Activations: 26 GB
- Total: ~143 GB
You need at minimum 2x A100 80GB with ZeRO Stage 2/3, or 8x RTX 4090 with ZeRO Stage 3 over PCIe (which will be slow).
Always run the memory calculation before provisioning hardware. Use the model_memory_estimate() function above or Hugging Face's model memory calculator.
:::
:::warning Using Consumer GPUs for NVLink-Dependent Workloads
RTX 4090 and RTX 3090 do not have NVLink. They connect via PCIe 4.0 (peak ~64 GB/s bidirectional for x16 slots). Data center A100 and H100 have NVLink at 600-900 GB/s.
If your training job uses tensor parallelism (splitting a single model layer across GPUs), consumer GPUs will serialize at the PCIe bus and you will see 10-20x slower inter-GPU communication vs NVLink. A 4x A100 NVLink setup will outperform 8x RTX 4090 PCIe for tensor-parallel 30B training even though the RTX 4090 array has more total VRAM.
Use consumer GPUs for: ZeRO data parallelism, QLoRA fine-tuning, single-GPU inference. Use data center GPUs for: tensor parallelism, pipeline parallelism, models too large for data-parallel training. :::
:::warning Ignoring Thermal Constraints in Dense Deployments
Consumer GPUs (RTX 4090, RTX 3090) are designed for desktop tower cases with open airflow. In 1U/2U server chassis or dense rack deployments, sustained full-load operation causes throttling because the GPU cannot dissipate heat fast enough.
A server rack with 8x RTX 4090 requires careful airflow engineering, often specialized chassis (Lambda Scalar, Puget Systems, or custom builds), and even then may throttle. The effective throughput in a dense deployment may be 15-25% lower than benchmarks run in open-air workstation cases.
Data center GPUs (A100, H100) are specified for 24/7 rack operation at full TDP. Their cooling solution and power delivery are designed for server environments. When building production inference infrastructure that must sustain 100% utilization around the clock, factor in the real sustained throughput, not peak benchmark numbers. :::
:::warning Underestimating KV Cache Growth in Production Serving
In production LLM serving, the KV cache can exceed model weight memory at large batch sizes and long contexts.
For a 70B model at FP16, serving 32 concurrent requests with 4096-token context:
- Model weights: 140 GB
- KV cache: 32 (batch) x 4096 (context) x 80 (layers) x 128 (head_dim) x 2 (K+V) x 2 bytes = ~42 GB
This is manageable on 2x H100 80GB. But at 8192-token context with the same batch:
- KV cache: 84 GB
Total memory requirement: 224 GB, requiring 3x H100 80GB or special KV cache compression (GPTQ quantized KV cache, or techniques like StreamingLLM).
Plan for peak concurrent load and max context length, not average. A serving cluster that fits the model comfortably but OOMs under peak KV cache load will degrade exactly when you need it most. :::
Interview Q&A
Q1: A colleague says "just buy H100s for everything." How would you respond?
The H100 is the optimal choice for training large models (70B+) where FP8 arithmetic throughput and NVLink bandwidth are on the critical path. But for most other workloads, it is not the optimal choice per dollar.
For inference of models up to 30B parameters at batch sizes of 1-8 (interactive serving), the bottleneck is memory bandwidth per dollar, not TFLOPS. The H100 delivers 3.35 TB/s bandwidth at 1k. The RTX 3090 used delivers 936 GB/s at 1k, a 12x better ratio.
For QLoRA fine-tuning of 7B models, an RTX 4090 at 30k for an H100 for the same job is wasteful.
The right answer is always: characterize the workload (model size, training vs inference, batch size, latency requirement, throughput requirement), run the memory and throughput arithmetic, then select the hardware that meets requirements at minimum cost. H100 wins on raw performance; it rarely wins on performance per dollar except at the largest scale.
Q2: What memory is required to train a 13B parameter model with AdamW and mixed precision? Which GPU setup would you use?
Mixed precision AdamW training requires:
- BF16 weights: 2 bytes x 13B = 26 GB
- FP32 gradients: 4 bytes x 13B = 52 GB
- FP32 optimizer states (master weights + m + v): 12 bytes x 13B = 156 GB
- Activations (roughly 2x weights): 52 GB
- Total: ~286 GB
Options:
- 2x A100 80GB SXM with ZeRO Stage 2: Total capacity 160 GB. With ZeRO Stage 2 (sharding optimizer states and gradients), per-GPU memory drops to roughly: 26 GB weights + 26 GB gradients (half each) + 78 GB optimizer (half each) + 52 GB activations (not sharded) = ~182 GB effective before sharding. This is tight - ZeRO Stage 3 is safer.
- 4x A100 80GB SXM with ZeRO Stage 3: Total capacity 320 GB. Optimizer states + gradients + weights all sharded. Per-GPU: ~72 GB, comfortable. NVLink ensures fast gradient sync.
- 8x RTX 4090 with ZeRO Stage 3 over PCIe: Total capacity 192 GB. Memory fits, but PCIe communication will be slow for the all-reduce. Acceptable if training time is not critical and cost matters.
Recommended for production training: 4x A100 80GB SXM in a single DGX-compatible node with NVLink for fast gradient sync and ZeRO Stage 2 or 3.
Q3: Explain why a 7B model on a single RTX 4090 is faster for inference than a 7B model on 2x RTX 4090.
For autoregressive LLM inference at batch size 1, the bottleneck is memory bandwidth - how fast you can stream model weights from HBM to the arithmetic units.
On a single RTX 4090 (24GB, 1 TB/s bandwidth): all weights are on one GPU. Each decoding step loads all 14GB of FP16 weights once. Time per step: ~14 GB / (1 TB/s x 0.7) = 20 ms. Throughput: ~50 tokens/sec.
On 2x RTX 4090 in tensor parallel: each GPU holds half the model (7GB). Each decoding step: both GPUs load their 7 GB in parallel (~10 ms), but must synchronize after each attention layer via PCIe (64 GB/s). A 70B model has 32 attention layers. Each PCIe all-reduce adds latency. In practice, PCIe communication overhead can exceed the computation time for small models, making 2x RTX 4090 tensor parallel slower than 1x for 7B inference.
The lesson: tensor parallelism requires high inter-GPU bandwidth (NVLink) to be beneficial. For models that fit on a single GPU, a single GPU is faster for inference due to zero communication overhead.
Q4: A startup wants to serve a 70B model to 1000 requests per minute with average 200-token responses and p95 latency under 5 seconds. Design their inference infrastructure.
First, calculate the throughput requirement:
- 1000 requests/min = 16.7 req/sec
- 200 tokens per request = 3,333 tokens/sec total throughput needed
- With 20% burst headroom: plan for 4,000 tokens/sec
Model memory for 70B:
- FP16: 140 GB - requires 2x H100 80GB or 2x A100 80GB per serving instance
- INT4 (GPTQ): 35 GB - fits on 1x A100 80GB
Per-GPU throughput for 70B FP16 (A100 80GB, bandwidth-bound, batch=32):
- At batch=32, compute-bound crossover may be reached; realistically ~300-500 tokens/sec per GPU pair
- With 2x A100 80GB (140 GB combined, 4 TB/s combined) and vLLM: ~400 tokens/sec per serving instance
Number of 2x A100 80GB instances needed: 4000 / 400 = 10 instances = 20 A100 80GB GPUs
Alternative with INT4:
- 70B INT4 on 1x A100 80GB: ~200 tokens/sec per GPU (smaller model, lower bandwidth per step)
- Need 20 single-GPU A100 instances = 20 A100 40GB GPUs at 160k
For p95 latency under 5 seconds at 200 tokens:
- Required: 200 tokens / 5 sec = 40 tokens/sec per request
- With vLLM batching, this is achievable at moderate load
Recommended architecture: 8x A100 80GB (4x paired instances) for FP16 70B serving with vLLM + load balancer. Scale horizontally as traffic grows. Total hardware: ~41/hr - for 1000 req/min sustained this is roughly $30k/month, making owned hardware break-even in under 4 months.
Q5: What is arithmetic intensity and why does it determine whether training or inference is the bottleneck on your GPU?
Arithmetic intensity is the ratio of floating point operations to memory bytes accessed, measured in FLOPs/byte. It determines whether a workload is compute-bound (limited by peak TFLOPS) or memory-bandwidth-bound (limited by peak HBM bandwidth).
A GPU has a "hardware ridge point" - the arithmetic intensity at which it transitions from memory-bound to compute-bound. For the A100 SXM:
- Peak compute: 312 TFLOPS (BF16) = 312e12 FLOP/sec
- Peak bandwidth: 2e12 bytes/sec
- Ridge point: 312e12 / 2e12 = 156 FLOPs/byte
Any workload with arithmetic intensity above 156 FLOPs/byte is compute-bound on an A100. Below 156, it is bandwidth-bound.
Training: A matrix multiply in a 7B model involves weight matrices of size [4096, 16384] and an activation batch of shape [batch=32, seq=2048, 4096]. The compute is FLOPs. The memory accessed (loading weights once) is roughly bytes. Arithmetic intensity: ~69,000 FLOPs/byte - far above the ridge point. Compute-bound. More TFLOPS = faster training.
Inference at batch=1: Same weight matrix, but activation batch is [1, 1, 4096] (single token). Compute: FLOPs. Memory accessed: same bytes. Arithmetic intensity: ~1 FLOPs/byte - far below the ridge point. Bandwidth-bound. More bandwidth = faster inference.
This is the fundamental physics that drives every GPU selection decision in production AI.
Q6: When does renting cloud GPUs make more financial sense than owning hardware?
Buying hardware wins when:
- Utilization is high (>60%) and sustained over 2+ years
- You have predictable, steady-state workloads (fine-tune runs on schedule, serving traffic is consistent)
- Your team can manage hardware operations (cooling, maintenance, networking)
- The amortized hardware cost per GPU-hour is lower than cloud spot prices
Renting wins when:
- Workload is bursty or unpredictable (research teams with irregular GPU needs)
- You need to experiment with different GPU types before committing to architecture
- You need H100 clusters for one-time large model training runs (buying H100 hardware for a 2-week run is wasteful)
- Cloud spot pricing is available and your workload tolerates interruption (spot can be 60-70% cheaper than on-demand)
- You want to avoid capital expenditure and prefer OpEx
Break-even analysis for A100 80GB SXM:
- Hardware cost: $12,000 per GPU
- AWS on-demand (p4de.24xlarge): 5.12/GPU-hr
- AWS spot: ~$1.53/GPU-hr
- If running 24/7 at 70% utilization for 3 years: 3 x 365 x 24 x 0.7 = 18,396 GPU-hours
- On-demand cost: 18,396 x 94,187 vs buying for $12,000
At 70% sustained utilization over 3 years, buying an A100 is 7x cheaper than on-demand cloud. But if you run at 20% utilization (common in research), buying costs the same but cloud is more flexible.
The rule: sustained utilization above 30-40% typically justifies hardware purchase for multi-year deployments. Below that, cloud is usually cheaper when factoring in operational overhead.
Summary
GPU selection for ML workloads is not about buying the latest flagship. It is about matching hardware characteristics to workload bottlenecks:
-
Training is compute-bound: BF16/FP8 TFLOPS and NVLink bandwidth are the primary specs. H100 SXM or A100 SXM with NVLink is the right choice for serious training workloads. RTX 4090 clusters work for QLoRA and small model training where NVLink is not needed.
-
Inference is memory-bandwidth-bound (at small batch sizes): HBM bandwidth per dollar is the primary metric. RTX 3090 (used) and RTX 4090 offer excellent bandwidth-per-dollar for serving models up to 12-13B parameters. A100 and H100 become necessary for larger models that do not fit on consumer VRAM.
-
Memory math before hardware decisions: Run
model_memory_estimate()before any hardware commitment. The most common expensive mistake in ML infrastructure is under-provisioning because engineers forget optimizer states and activations. -
Quantization changes the game: INT4 GPTQ/AWQ cuts model memory by 4x vs FP16. A 70B model that requires 2x H100 80GB at FP16 fits comfortably on 1x A100 80GB at INT4, with acceptable quality loss for most tasks.
-
Cloud vs owned: Break-even is typically around 30-40% sustained utilization over a 3-year horizon. Below that, cloud spot pricing is usually more economical and more flexible.
The engineers who make the best hardware decisions are the ones who understand the first-principles physics: arithmetic intensity, memory hierarchy, and the roofline model. Everything else - benchmark numbers, marketing specs, and peer recommendations - flows from that foundation.
