GPU Inference vs Training Requirements

The Production Alert at 3 AM

It is 3:07 AM on a Tuesday. Your company just launched a ChatGPT-style product. Fifty thousand users hit it in the first hour. The Slack alerts are firing. The GPU cluster you provisioned - eight H100 SXM5 nodes, the most expensive hardware available - is at 94% GPU utilization, but your median response latency is 4.2 seconds per token. Users are abandoning sessions. Your infrastructure cost is $180,000 per month, and the product still feels slow.

Your CTO pulls up the GPU utilization dashboard. Everything looks fine on paper: compute utilization is high, GPUs are busy. But when your ML infrastructure lead finally digs into the profiling data at 4 AM, she finds something uncomfortable. The GPUs are not bottlenecked on compute. They are waiting for data. Specifically, they are spending most of their time loading the model's 175 billion parameters from high-bandwidth memory every single time they generate one token. The HBM bandwidth - not the Tensor Cores - is the constraint.

You bought a Ferrari for a road trip where the speed limit is 30 mph.

This scenario plays out repeatedly at companies that make the same conceptual error: they treat inference hardware selection as a minor variation of training hardware selection. They are not the same problem. They are not even close to the same problem. Training and inference have fundamentally different computational profiles, different memory access patterns, different batch size requirements, and different cost structures. Understanding this distinction at a deep level is one of the highest-leverage skills an ML infrastructure engineer can develop.

The engineers who truly understand why inference is memory-bandwidth-bound and training is compute-bound will make GPU procurement decisions that save their companies millions of dollars per year. They will also build serving systems that feel fast to users instead of systems that merely look impressive on a hardware spec sheet.

This lesson will give you the complete mental model. By the end, you will be able to look at any LLM serving requirement and make a defensible GPU selection decision - not by following someone else's benchmark, but by reasoning from first principles about where your bottleneck actually lives.

Why This Exists - The Problem with Treating Inference Like Training

Before the modern LLM serving era (pre-2022), most ML inference happened in relatively simple settings. You had a small model, fixed-size batches, and you wanted predictions as fast as possible. The GPU market was driven almost entirely by training workloads. Hardware companies optimized for FLOP/s.

The problem emerged when large language models arrived. LLMs have a fundamentally different computational structure from convolutional networks or small transformer encoders. The autoregressive decode loop - generating one token at a time, each token requiring a full forward pass through the entire model - creates a workload that looks almost nothing like training.

In training, you process large batches. You load each weight once and use it to compute gradients for hundreds or thousands of examples. The ratio of compute operations to memory reads is high. GPUs love this. Their thousands of cores can all be busy simultaneously on different elements of the batch.

In inference decode, you often process one or a few requests at a time (each at a different position in its generation). You load every single weight in the model for each forward pass. But you only do one or a few matrix multiplications with those weights. The ratio of compute to memory reads collapses. Now the GPU is waiting for its high-bandwidth memory (HBM) to deliver bytes faster than the Tensor Cores can process them.

Cloud providers discovered this painfully. A100 clusters purchased for training were pressed into inference service. They worked - technically - but the cost-per-token was far higher than it needed to be. The H100's improvements over the A100 (more FLOP/s, better sparsity, faster NVLink) were largely irrelevant for single-user decode workloads. What mattered was HBM bandwidth, and the A100 80GB SXM had 2 TB/s of that. The H100 SXM5 has 3.35 TB/s. The improvement is meaningful but not proportional to the 3x price difference.

Meanwhile, the L40S - a card designed for graphics and inference, with GDDR6 memory that gives only 864 GB/s bandwidth - turned out to beat the H100 on cost-per-token in many real serving deployments, simply because it is much cheaper and the bandwidth gap becomes less relevant when you batch requests intelligently.

This disconnect between "best training GPU" and "best inference GPU" was not obvious until operators started measuring carefully. This lesson explains the underlying physics so you can reason through it yourself.

Historical Context - How We Learned the Hard Way

The distinction between compute-bound and memory-bandwidth-bound workloads is not new. It was formalized in the Roofline model by Williams, Waterman, and Patterson in their 2009 paper "Roofline: An Insightful Visual Performance Model for Multicore Architectures." The model predicts achievable performance based on two hardware ceilings: peak compute (FLOP/s) and peak memory bandwidth (bytes/second).

The insight: every algorithm has an arithmetic intensity measured in FLOP per byte. If your algorithm's arithmetic intensity is below the machine's "ridge point" (where compute ceiling and bandwidth ceiling cross), you are memory-bandwidth-bound. Above the ridge point, you are compute-bound.

For a decade this was mainly relevant to high-performance computing researchers. GPU vendors designed for compute-bound workloads because that is what deep learning training looked like.

The LLM era changed the calculus. Noam Shazeer's 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" and the subsequent scaling of dense transformers created models where serving a single user required loading billions of parameters from memory per generated token. The autoregressive generation pattern meant batch sizes were often 1 during decode.

By 2022, engineers at Google, Meta, and the major cloud providers began publishing detailed analyses of LLM inference costs. Phil Tillet at OpenAI developed Triton. The vLLM team at UC Berkeley (Kwon et al., 2023) published the PagedAttention paper which explicitly characterized the memory-bandwidth constraints of KV cache management. Aakanksha Chowdhery et al.'s PaLM paper (2022) included careful analysis of TPU inference efficiency. By 2023, the field had converged on a clear understanding: LLM decode is memory-bandwidth-bound, prefill is compute-bound, and hardware selection must account for both.

The "aha moment" for most practitioners was calculating the arithmetic intensity of a single decode step with batch size 1 and seeing that it sits far below the ridge point of any modern GPU. The math is simple and illuminating - we will work through it fully below.

Core Concepts - Arithmetic Intensity and the Roofline Model

What the Roofline Model Says

Every GPU has two fundamental limits:

Peak compute throughput - measured in FLOP/s (floating-point operations per second)
Peak memory bandwidth - measured in bytes/second

The roofline model predicts achievable performance $P$ as:

$P = \min\left(\text{Peak FLOP/s},\ \text{Bandwidth} \times I\right)$

where $I$ is the arithmetic intensity of the workload in FLOP/byte.

The ridge point is the intensity at which the two ceilings are equal:

$I_{\text{ridge}} = \frac{\text{Peak FLOP/s}}{\text{Bandwidth (bytes/s)}}$

For an H100 SXM5:

Peak FP16 Tensor Core FLOP/s: $\approx 989 \times 10^{12}$ FLOP/s
HBM3 bandwidth: $\approx 3.35 \times 10^{12}$ bytes/s
Ridge point: $I_{\text{ridge}} \approx 989 / 3.35 \approx 295$ FLOP/byte

If your workload has arithmetic intensity below 295 FLOP/byte on an H100, you are memory-bandwidth-bound. The Tensor Cores are waiting for data.

Arithmetic Intensity of a Matrix Multiply

A matrix multiply $C = A \times B$ where $A \in \mathbb{R}^{M \times K}$ and $B \in \mathbb{R}^{K \times N}$ :

Operations: $2MKN$ FLOP (multiply-add)
Memory reads: $(MK + KN + MN)$ elements $\times$ bytes per element

For FP16 (2 bytes per element):

$I = \frac{2MKN}{2(MK + KN + MN)}$

When $M$ is large (big batch), the $MN$ output term dominates the denominator and the intensity approaches $K/2$ - large. Compute-bound.

When $M = 1$ (batch size 1, single decode step):

$I \approx \frac{2KN}{2(K + KN + N)} \approx \frac{2KN}{2KN} = 1 \text{ FLOP/byte}$

One FLOP per byte. The H100's ridge point is 295 FLOP/byte. You are getting less than 1% of available compute throughput. The other 99% of your expensive GPU is idle, waiting for HBM to deliver the weight matrix.

This single calculation explains everything about why LLM decode is slow at small batch sizes - and why throwing more FLOP/s at it does not help.

The Two Phases of LLM Inference

LLM inference is not one workload. It is two distinct phases with completely different computational profiles:

Prefill Phase:

All input tokens processed simultaneously
Matrix multiplies have shape [batch_size, seq_len, d_model] x [d_model, d_ff]
Sequence length acts like batch size - can be hundreds or thousands
Arithmetic intensity is high: compute-bound
Looks like training (forward pass only, no gradient, but same compute profile)

Decode Phase:

One token generated per step
Matrix multiplies have shape [batch_size, 1, d_model] x [d_model, d_ff]
The "1" collapses the intensity
Memory-bandwidth-bound unless batch sizes are very large (typically > 100-200 for H100)
This is where latency is spent for long outputs

For a typical user request with 200 input tokens and 500 output tokens, the prefill takes maybe 50 ms (fast) and the decode takes 5 seconds (slow). The bottleneck is entirely in decode.

Prefill:  [############################] 50ms   - compute bound
Decode:   [############################][############################][...x500] - each step is 10ms
           Total decode: 5,000ms

Batch Size as the Critical Variable

Batch size is the lever that converts a memory-bandwidth-bound workload into a compute-bound workload.

At batch size 1: $I \approx 1$ FLOP/byte - deep in memory-bound territory.

At batch size $B$ , for a weight matrix of shape $[K, N]$ :

$I = \frac{2BKN}{2(BK + KN + BN)} \approx \frac{2BKN}{2KN} = B \text{ FLOP/byte}$

(when $BK \ll KN$ and $BN \ll KN$ , which holds when $B \ll K$ )

So arithmetic intensity scales approximately linearly with batch size. For the H100 with ridge point at 295 FLOP/byte, you need roughly $B \approx 295$ to saturate the Tensor Cores. In practice, memory for KV caches limits you before you reach this - but the direction is clear.

This is why continuous batching is such a powerful technique for LLM serving: it aggregates many users' decode steps into a single GPU kernel call, driving up the effective batch size and utilization.

GPU Memory Capacity vs Bandwidth

For inference serving, you need to fit the model in GPU memory. For a 70B parameter model in FP16:

$\text{Model size} = 70 \times 10^9 \times 2 \text{ bytes} = 140 \text{ GB}$

An A100 80GB cannot fit this. A single H100 80GB cannot fit this. You need multi-GPU serving. But KV cache on top of this can easily add another 20-40 GB per batch at sequence lengths of 4096+. Memory capacity constrains your maximum batch size, which constrains your utilization.

This is where the tradeoff between different GPU SKUs becomes concrete.

GPU Selection Framework - Comparing Real Hardware

Let us build a concrete comparison using real hardware specs.

GPU	HBM Type	Bandwidth	VRAM	FP16 FLOP/s	TDP	List Price (est.)
H100 SXM5	HBM3	3.35 TB/s	80 GB	989 TFLOP/s	700W	$30,000+
H100 PCIe	HBM2e	2.0 TB/s	80 GB	756 TFLOP/s	350W	$25,000+
A100 SXM4	HBM2e	2.0 TB/s	80 GB	312 TFLOP/s	400W	$10,000+
L40S	GDDR6	864 GB/s	48 GB	362 TFLOP/s	350W	$8,000+
A10G	GDDR6	600 GB/s	24 GB	125 TFLOP/s	150W	$3,500+
RTX 4090	GDDR6X	1.0 TB/s	24 GB	165 TFLOP/s	450W	$1,600

For memory-bandwidth-bound decode, the relevant column is bandwidth per dollar:

H100 SXM5: 3.35 TB/s / $30k = 0.112 GB/s per dollar
A100 SXM4: 2.0 TB/s / $10k = 0.200 GB/s per dollar
L40S: 0.864 TB/s / $8k = 0.108 GB/s per dollar
RTX 4090: 1.0 TB/s / $1.6k = 0.625 GB/s per dollar

The RTX 4090 has the best bandwidth per dollar by a wide margin. But it lacks HBM, has only 24 GB VRAM, and cannot fit large models without aggressive quantization. For small models (7B-13B), it is genuinely competitive for throughput-optimized serving.

The A100 80GB SXM4 offers the best bandwidth-per-dollar among data center cards for decode workloads. The H100 SXM5's premium is largely justified by prefill throughput (where its higher FLOP/s matters) and NVLink bandwidth (for multi-GPU workloads).

Code Examples - Measuring Arithmetic Intensity and Bottlenecks

Calculating Arithmetic Intensity for Your Model

"""
arithmetic_intensity.py - Calculate roofline position for your LLM.
"""
from dataclasses import dataclass
from typing import Optional

@dataclass
class GPUSpec:
    name: str
    peak_flops_fp16: float      # TFLOP/s
    memory_bandwidth: float     # TB/s
    vram_gb: float

    @property
    def ridge_point(self) -> float:
        """
        Ridge point in FLOP/byte. Above this = compute bound.
        Below this = memory bandwidth bound.
        """
        return (self.peak_flops_fp16 * 1e12) / (self.memory_bandwidth * 1e12)


@dataclass
class TransformerConfig:
    name: str
    num_layers: int
    hidden_dim: int             # d_model
    ffn_dim: int                # d_ff (usually 4 * hidden_dim)
    num_heads: int
    head_dim: int               # usually hidden_dim / num_heads
    vocab_size: int
    precision_bytes: int = 2    # 2 = FP16, 1 = INT8, 4 = FP32

    @property
    def total_params_billions(self) -> float:
        # Approximate: 2 * L * d^2 * (1 + 4 + 2/d) ≈ 12 * L * d^2 / 1e9
        attn_params = self.num_layers * 4 * self.hidden_dim * self.hidden_dim
        ffn_params = self.num_layers * 2 * self.hidden_dim * self.ffn_dim
        embed_params = self.vocab_size * self.hidden_dim
        return (attn_params + ffn_params + embed_params) / 1e9

    @property
    def model_size_gb(self) -> float:
        return self.total_params_billions * 1e9 * self.precision_bytes / 1e9


def compute_decode_intensity(model: TransformerConfig, batch_size: int) -> float:
    """
    Arithmetic intensity for one decode step (single token generation).
    Returns FLOP per byte.
    """
    B = batch_size
    d = model.hidden_dim
    L = model.num_layers
    d_ff = model.ffn_dim
    bytes_per_elem = model.precision_bytes

    # Per layer: attention projections (4 matmuls) + FFN (2 matmuls)
    # Each is [B, d] x [d, d] or [B, d] x [d, d_ff]
    # FLOP per layer:
    flops_attn = 4 * 2 * B * d * d        # Q,K,V,O projections
    flops_ffn  = 2 * 2 * B * d * d_ff     # up, down projections

    # Memory reads per layer (weights only, dominant term):
    bytes_attn = 4 * d * d * bytes_per_elem
    bytes_ffn  = 2 * d * d_ff * bytes_per_elem

    # Total across all layers:
    total_flops = L * (flops_attn + flops_ffn)
    total_bytes = L * (bytes_attn + bytes_ffn)

    return total_flops / total_bytes


def estimate_decode_throughput(
    model: TransformerConfig,
    gpu: GPUSpec,
    batch_size: int
) -> dict:
    """
    Estimate decode throughput given hardware and workload.
    """
    intensity = compute_decode_intensity(model, batch_size)
    ridge = gpu.ridge_point

    # Achievable FLOP/s (roofline model)
    if intensity >= ridge:
        # Compute bound
        achievable_flops = gpu.peak_flops_fp16 * 1e12
        bottleneck = "compute"
    else:
        # Memory bandwidth bound
        achievable_flops = gpu.memory_bandwidth * 1e12 * intensity
        bottleneck = "memory_bandwidth"

    # Total FLOP for one decode step across all layers:
    B = batch_size
    d = model.hidden_dim
    L = model.num_layers
    d_ff = model.ffn_dim
    total_flops_per_step = L * (
        4 * 2 * B * d * d +
        2 * 2 * B * d * d_ff
    )

    # Time per step in seconds:
    time_per_step_s = total_flops_per_step / achievable_flops

    # Tokens per second:
    tokens_per_second = batch_size / time_per_step_s

    return {
        "arithmetic_intensity": round(intensity, 2),
        "ridge_point": round(ridge, 2),
        "bottleneck": bottleneck,
        "utilization_pct": round(min(intensity / ridge, 1.0) * 100, 1),
        "time_per_step_ms": round(time_per_step_s * 1000, 2),
        "tokens_per_second": round(tokens_per_second, 1),
    }


# --- Example Usage ---

llama3_70b = TransformerConfig(
    name="LLaMA-3 70B",
    num_layers=80,
    hidden_dim=8192,
    ffn_dim=28672,
    num_heads=64,
    head_dim=128,
    vocab_size=128256,
    precision_bytes=2,     # FP16
)

h100_sxm = GPUSpec(
    name="H100 SXM5",
    peak_flops_fp16=989.4,  # TFLOP/s
    memory_bandwidth=3.35,  # TB/s
    vram_gb=80,
)

a100_sxm = GPUSpec(
    name="A100 SXM4",
    peak_flops_fp16=312.0,
    memory_bandwidth=2.0,
    vram_gb=80,
)

l40s = GPUSpec(
    name="L40S",
    peak_flops_fp16=362.0,
    memory_bandwidth=0.864,
    vram_gb=48,
)

print(f"Model: {llama3_70b.name}")
print(f"  Params: {llama3_70b.total_params_billions:.1f}B")
print(f"  Model size (FP16): {llama3_70b.model_size_gb:.1f} GB")
print()

for gpu in [h100_sxm, a100_sxm, l40s]:
    print(f"GPU: {gpu.name} (ridge point: {gpu.ridge_point:.0f} FLOP/byte)")
    for bs in [1, 8, 32, 128]:
        result = estimate_decode_throughput(llama3_70b, gpu, bs)
        print(
            f"  batch={bs:3d}: "
            f"{result['arithmetic_intensity']:5.1f} FLOP/byte | "
            f"{result['bottleneck']:20s} | "
            f"{result['utilization_pct']:5.1f}% util | "
            f"{result['tokens_per_second']:7.1f} tok/s"
        )
    print()

Profiling Actual GPU Bottlenecks with PyTorch

"""
profile_bottleneck.py - Use PyTorch profiler to identify decode bottleneck.
"""
import torch
import time
from contextlib import contextmanager

def matmul_benchmark(M: int, K: int, N: int, dtype=torch.float16, num_warmup=10, num_iters=100):
    """
    Benchmark a single matrix multiply and measure achieved throughput.
    """
    device = "cuda"
    A = torch.randn(M, K, dtype=dtype, device=device)
    B = torch.randn(K, N, dtype=dtype, device=device)

    # Warmup
    for _ in range(num_warmup):
        _ = torch.mm(A, B)
    torch.cuda.synchronize()

    # Timed run
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(num_iters):
        _ = torch.mm(A, B)
    end.record()
    torch.cuda.synchronize()

    elapsed_ms = start.elapsed_time(end) / num_iters
    elapsed_s = elapsed_ms / 1000.0

    # FLOP and bytes
    flops = 2 * M * K * N
    bytes_accessed = (M * K + K * N + M * N) * 2   # FP16 = 2 bytes

    achieved_tflops = flops / elapsed_s / 1e12
    achieved_bandwidth_tbs = bytes_accessed / elapsed_s / 1e12
    arithmetic_intensity = flops / bytes_accessed

    return {
        "M": M, "K": K, "N": N,
        "elapsed_ms": round(elapsed_ms, 3),
        "arithmetic_intensity": round(arithmetic_intensity, 2),
        "achieved_tflops": round(achieved_tflops, 2),
        "achieved_bw_tbs": round(achieved_bandwidth_tbs, 3),
    }


# Simulate different batch sizes for a 7B model
# d_model=4096, d_ff=11008 (LLaMA-2 7B)
K, N = 4096, 11008

print("Batch | AI (FLOP/byte) | Achieved TF/s | Achieved BW (TB/s) | Time (ms)")
print("-" * 80)
for bs in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
    result = matmul_benchmark(bs, K, N)
    print(
        f"  {bs:4d} | {result['arithmetic_intensity']:14.2f} | "
        f"{result['achieved_tflops']:13.2f} | "
        f"{result['achieved_bw_tbs']:18.3f} | "
        f"{result['elapsed_ms']:9.3f}"
    )

Estimating Serving Cost Per Token

"""
cost_per_token.py - Compare serving costs across GPU options.
"""

def tokens_per_second_estimate(
    model_params_b: float,       # billions
    precision_bytes: int,        # 2=FP16, 1=INT8
    gpu_bandwidth_tbs: float,    # TB/s
    batch_size: int,
    efficiency_factor: float = 0.7,   # real-world vs theoretical
) -> float:
    """
    Estimate tokens per second using bandwidth-bound model.
    Assumes memory-bound regime (valid for typical serving batch sizes).
    """
    model_bytes = model_params_b * 1e9 * precision_bytes
    # At batch_size B, we load the model once per step but process B tokens
    # Time per step = model_bytes / effective_bandwidth
    effective_bw = gpu_bandwidth_tbs * 1e12 * efficiency_factor
    time_per_step_s = model_bytes / effective_bw
    return batch_size / time_per_step_s


def cost_per_million_tokens(
    gpu_name: str,
    model_params_b: float,
    precision_bytes: int,
    gpu_bandwidth_tbs: float,
    gpu_hourly_cost_usd: float,
    batch_size: int,
    num_gpus_required: int,
) -> dict:
    tps = tokens_per_second_estimate(
        model_params_b / num_gpus_required,  # per GPU in tensor parallel
        precision_bytes,
        gpu_bandwidth_tbs,
        batch_size,
    )
    total_tps = tps  # throughput is per-request, not per-GPU, in TP
    # Actually for tensor parallel: each GPU holds 1/N of the model
    # and sees 1/N of the bandwidth requirement per layer
    # The result: roughly linear scaling with num GPUs for BW-bound work

    # Total cluster tps:
    cluster_tps = tps   # simplified: TP doesn't help throughput much, just fits model

    hourly_tokens = cluster_tps * 3600
    million_tokens_per_hour = hourly_tokens / 1e6
    cluster_hourly_cost = gpu_hourly_cost_usd * num_gpus_required

    cost_per_million = cluster_hourly_cost / million_tokens_per_hour

    return {
        "gpu": gpu_name,
        "num_gpus": num_gpus_required,
        "tokens_per_second": round(cluster_tps, 1),
        "cost_per_million_tokens_usd": round(cost_per_million, 2),
        "cluster_hourly_cost_usd": round(cluster_hourly_cost, 2),
    }


configs = [
    # (name, BW TB/s, hourly $, num_gpus for 70B FP16)
    ("H100 SXM5",  3.35,  8.00,  2),
    ("A100 SXM4",  2.00,  3.50,  2),
    ("L40S",       0.864, 2.20,  4),   # 48GB x4 to fit 70B FP16
    ("A10G",       0.600, 1.50,  8),   # 24GB x8 to fit 70B FP16
]

model_params_b = 70.0
precision_bytes = 2   # FP16
batch_size = 32

print(f"Model: LLaMA-3 70B FP16, batch size {batch_size}")
print(f"{'GPU':<15} {'# GPUs':<8} {'Tok/s':<12} {'$/M tokens':<15} {'$/hr cluster'}")
print("-" * 65)

for name, bw, hourly, num_gpus in configs:
    result = cost_per_million_tokens(
        name, model_params_b, precision_bytes,
        bw, hourly, batch_size, num_gpus
    )
    print(
        f"{result['gpu']:<15} {result['num_gpus']:<8} "
        f"{result['tokens_per_second']:<12.1f} "
        f"${result['cost_per_million_tokens_usd']:<14.2f} "
        f"${result['cluster_hourly_cost_usd']:.2f}"
    )

Architecture Diagrams

Prefill vs Decode Computational Profile

GPU Hardware Selection Decision Tree

Production Engineering Notes

Measuring Real Bottlenecks Before Buying Hardware

Never make GPU procurement decisions based on spec sheets alone. Always run your actual workload profile through nsys or ncu before committing.

# Profile a vLLM serving request with Nsight Systems
nsys profile \
  --trace=cuda,nvtx,cudnn \
  --output=inference_profile \
  python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b-instruct \
    --tensor-parallel-size 2 \
    --max-model-len 4096

# Profile memory bandwidth utilization with ncu
ncu --metrics \
  sm__throughput.avg.pct_of_peak_sustained_elapsed,\
  dram__throughput.avg.pct_of_peak_sustained_elapsed,\
  gpu__time_duration.sum \
  --target-processes all \
  python your_inference_script.py

Look for dram__throughput near 100% and sm__throughput well below 100%. That confirms memory-bandwidth-bound operation.

KV Cache Memory Math

KV cache memory per token, per layer, for a model with hidden dimension $d$ and $H$ attention heads:

$\text{KV bytes per token per layer} = 2 \times H \times d_{head} \times 2 \text{ bytes} = 4 H d_{head}$

For LLaMA-3 70B: $H = 64$ heads, $d_{head} = 128$ , $L = 80$ layers:

$\text{KV bytes per token} = 80 \times 4 \times 64 \times 128 = 2{,}621{,}440 \text{ bytes} \approx 2.5 \text{ MB per token}$

For a batch of 32 requests each generating 2048 tokens:

$32 \times 2048 \times 2.5 \text{ MB} = 163{,}840 \text{ MB} \approx 160 \text{ GB}$

This is why 70B inference typically needs 4x 80GB GPUs in production: 140 GB for weights + 160 GB for KV cache at moderate batch sizes. Memory capacity constrains your batch size ceiling, which constrains your GPU utilization ceiling.

Continuous Batching vs Static Batching

Static batching (naive approach): wait until you have a full batch, run it, wait for all requests to finish. Fast requests are blocked waiting for slow ones. GPU is underutilized between batches.

Continuous batching (vLLM, TGI): new requests join the batch at any step. Finished requests are immediately replaced. GPU is always running. Effective batch size stays high.

The throughput difference can be 3-5x for real traffic distributions where request lengths vary widely. This is the primary reason to use vLLM or TGI rather than rolling your own serving stack.

Multi-GPU Tensor Parallelism vs Pipeline Parallelism

Tensor Parallelism (TP): Split weight matrices across GPUs. Each GPU holds a column shard of the weight matrix. For a $[d, 4d]$ FFN weight split across 4 GPUs, each GPU holds $[d, d]$ . Every GPU participates in every layer. Reduces per-step latency but requires high-bandwidth inter-GPU communication (NVLink ideal).

Pipeline Parallelism (PP): Split layers across GPUs. GPU 0 holds layers 1-20, GPU 1 holds layers 21-40, etc. Reduces communication bandwidth needed but introduces pipeline bubbles. Better for throughput than latency.

For single-server multi-GPU with NVLink, TP is generally preferred for latency-sensitive serving. For multi-node, PP is used to avoid network bottlenecks.

NVLink bandwidth matters significantly for TP: the H100 SXM5's 900 GB/s NVLink bandwidth vs PCIe 4.0's 64 GB/s bandwidth means TP communication overhead is much lower in SXM configurations.

Common Mistakes

:::danger Confusing GPU Utilization with Efficiency High GPU utilization reported by nvidia-smi does not mean your GPU is efficiently used. It means the GPU is not idle. A GPU spending 90% of its time waiting for HBM to deliver data will show high utilization. Use ncu to measure arithmetic intensity and memory bandwidth utilization - these are the real efficiency metrics. :::

:::danger Buying H100s for Single-User Interactive Serving For a serving system where requests arrive one at a time with small batches (B < 10), an H100 SXM5 will perform nearly identically to an A100 80GB at 3x the price. The H100's 3x higher FLOP/s is irrelevant when the workload is memory-bandwidth-bound. Measure before you commit to a procurement decision. :::

:::warning Ignoring KV Cache Memory in Capacity Planning Engineers commonly plan GPU capacity by fitting the model weights, then discovering in production that KV cache for real traffic patterns needs 2-3x the weight memory. Always calculate maximum KV cache size at your target batch size and sequence length before finalizing hardware selection. :::

:::warning Treating Prefill and Decode as One Workload Prefill is compute-bound and fast. Decode is memory-bandwidth-bound and slow per step. A serving system that optimizes only for one phase will leave performance on the table. Systems like disaggregated prefill/decode (Distserve, Splitwise) physically separate the two phases onto different hardware to maximize utilization of both. :::

:::warning Underestimating the Impact of Batch Size The difference between batch size 1 and batch size 64 is the difference between 0.3% GPU utilization and 20% GPU utilization for a 70B model. Continuous batching, request queuing, and even artificial batching delays are all worthwhile when they increase effective batch size. Design your serving stack to maximize batch size before optimizing anything else. :::

Interview Q&A

Q1: Why is LLM inference during the decode phase memory-bandwidth-bound rather than compute-bound?

Answer: During decode, the model generates one token at a time. Each step requires a full forward pass through all model layers. For each layer, you load the full weight matrix (e.g., $[d_{model}, d_{ff}]$ which might be $[8192, 28672]$ for a 70B model) from GPU HBM and multiply it with the current token's hidden state vector $[1, 8192]$ . The FLOP count for this multiply is $2 \times 1 \times 8192 \times 28672 \approx 470M$ FLOP. The bytes loaded are $8192 \times 28672 \times 2 = 470M$ bytes. Arithmetic intensity: $470M / 470M = 1$ FLOP/byte.

The H100's ridge point is approximately 295 FLOP/byte. At 1 FLOP/byte, you are using less than 0.4% of available compute throughput. The Tensor Cores sit idle while HBM delivers data. The only way to move out of memory-bound territory is to increase batch size, which amortizes the weight loading cost across more output tokens per step.

Q2: How does batch size affect the compute vs memory-bandwidth bound transition for LLM inference?

Answer: Batch size $B$ approximately scales arithmetic intensity linearly. A weight matrix $W \in \mathbb{R}^{K \times N}$ multiplied by activations $X \in \mathbb{R}^{B \times K}$ has:

FLOP: $2BKN$
Bytes: $2(BK + KN + BN)$
Intensity $\approx B$ FLOP/byte when $B \ll K$

To reach the H100's ridge point of ~295 FLOP/byte, you need batch size ~295. In practice, KV cache memory limits batch size well before this for large models. For a 70B model with 4k sequence length, you might be KV-cache-limited at batch size 32-64, giving intensity of 32-64 FLOP/byte - still solidly memory-bandwidth-bound. This is why continuous batching, multi-query attention (reducing KV cache size), and quantization (reducing weight size) are all important: they let you reach higher effective batch sizes before hitting memory limits.

Q3: A company is choosing between H100 SXM5 and A100 SXM4 for LLM inference serving of a 13B parameter model. What is your recommendation and how do you arrive at it?

Answer: For a 13B model, the analysis starts with memory footprint: 13B parameters x 2 bytes (FP16) = 26 GB, which fits comfortably on a single 80GB card with plenty of room for KV cache.

Next, analyze the workload regime. At typical serving batch sizes (8-32), arithmetic intensity is approximately 8-32 FLOP/byte, well below the A100's ridge point of ~156 FLOP/byte. Both GPUs are memory-bandwidth-bound for this workload.

Bandwidth comparison: H100 SXM5 has 3.35 TB/s, A100 SXM4 has 2.0 TB/s. The H100 provides about 1.68x more tokens per second for the same batch size in the memory-bound regime.

Cost comparison: H100 SXM5 typically costs 2.5-3x more than A100 SXM4. If the bandwidth improvement is 1.68x but the cost is 2.7x, the A100 delivers about 1.6x better tokens-per-dollar.

Recommendation: A100 80GB SXM4 unless (a) request volumes are high enough that prefill latency (compute-bound, where H100 wins more decisively) is a user-visible bottleneck, or (b) you can fill batches large enough to approach compute-bound territory consistently.

Q4: What is the difference between NVLink bandwidth and PCIe bandwidth, and when does each matter for inference?

Answer: NVLink (H100 SXM5: 900 GB/s bidirectional) is direct GPU-to-GPU interconnect, used for tensor parallel communication - the all-reduce operations that synchronize partial results across GPUs after each layer. PCIe 4.0 (64 GB/s) connects GPUs to the CPU and to each other in PCIe-connected configurations.

For single-GPU inference: neither matters.

For tensor-parallel multi-GPU inference on one server: NVLink bandwidth is critical. In tensor parallelism, after every attention and FFN layer you need an all-reduce across all TP GPUs. For a batch of B tokens with hidden dim D, this all-reduce transfers $2BD$ bytes. At high batch sizes, this communication can become a significant fraction of total step time. NVLink's 900 GB/s vs PCIe's 64 GB/s is a 14x bandwidth difference - the difference between TP being nearly free vs being a serious bottleneck.

H100 SXM (NVLink) vs H100 PCIe is therefore a meaningful distinction for multi-GPU inference. For 2-GPU tensor parallel serving, PCIe H100 cards are often adequate. For 4-8 GPU TP with large batch sizes, SXM NVLink cards can be justified by their lower communication overhead.

Q5: Explain disaggregated prefill/decode serving and when it is beneficial.

Answer: Disaggregated prefill/decode separates the two LLM inference phases onto physically different hardware:

Prefill nodes: Optimized for compute throughput. H100 SXM cards justified here. Handle the initial prompt processing (compute-bound, benefits from high FLOP/s). Transfer KV cache to decode nodes when done.
Decode nodes: Optimized for memory bandwidth per dollar. A100 or even L40S clusters. Handle the autoregressive generation loop (memory-bandwidth-bound).

This is beneficial when:

Your traffic has high prefill/decode asymmetry (long prompts, short outputs, or vice versa).
You want to avoid prefill spikes from long inputs disrupting ongoing decode (head-of-line blocking).
Your cost model shows different hardware being optimal for each phase.

Papers: Splitwise (2024, Microsoft), Distserve (2024, UC Berkeley) both demonstrate 2-4x better cost efficiency for specific workload types.

The main cost is engineering complexity: KV cache transfer between nodes adds latency and requires high-bandwidth interconnects between prefill and decode server pools.

Q6: How would you design a GPU selection process for a new LLM serving deployment from scratch?

Answer: A rigorous GPU selection process has five stages:

Stage 1 - Workload characterization. Measure your actual traffic: distribution of input token lengths, output token lengths, requests per second, p50/p99 latency SLAs, daily token volume target.

Stage 2 - Memory capacity sizing. Calculate model weight footprint (params x precision_bytes). Add KV cache at your target batch size and max sequence length. This determines minimum VRAM per server and whether you need multi-GPU.

Stage 3 - Throughput estimation. Use the roofline model to estimate tokens/second as a function of batch size for candidate GPUs. Determine which GPUs are bandwidth-bound vs compute-bound at your target batch sizes.

Stage 4 - Cost modeling. Calculate cost per million tokens for each candidate: (cluster_hourly_cost / million_tokens_per_hour). Include all costs: GPU hardware amortized, memory, networking, power.

Stage 5 - Empirical validation. Run actual benchmarks with your exact model, quantization level, and request distribution using vLLM or TGI. Measure p50/p99 TTFT (time-to-first-token, prefill-dominated) and TPOT (time-per-output-token, decode-dominated) separately.

Never skip Stage 5. Theory predicts the direction but not the exact magnitude. Real systems have overheads that only benchmarking reveals.

The Production Alert at 3 AM​

Why This Exists - The Problem with Treating Inference Like Training​

Historical Context - How We Learned the Hard Way​

Core Concepts - Arithmetic Intensity and the Roofline Model​

What the Roofline Model Says​

Arithmetic Intensity of a Matrix Multiply​

The Two Phases of LLM Inference​

Batch Size as the Critical Variable​

GPU Memory Capacity vs Bandwidth​

GPU Selection Framework - Comparing Real Hardware​

Code Examples - Measuring Arithmetic Intensity and Bottlenecks​

Calculating Arithmetic Intensity for Your Model​

Profiling Actual GPU Bottlenecks with PyTorch​

Estimating Serving Cost Per Token​

Architecture Diagrams​

Prefill vs Decode Computational Profile​

GPU Hardware Selection Decision Tree​

Production Engineering Notes​

Measuring Real Bottlenecks Before Buying Hardware​

KV Cache Memory Math​

Continuous Batching vs Static Batching​

Multi-GPU Tensor Parallelism vs Pipeline Parallelism​

Common Mistakes​

Interview Q&A​

Q1: Why is LLM inference during the decode phase memory-bandwidth-bound rather than compute-bound?​

Q2: How does batch size affect the compute vs memory-bandwidth bound transition for LLM inference?​

Q3: A company is choosing between H100 SXM5 and A100 SXM4 for LLM inference serving of a 13B parameter model. What is your recommendation and how do you arrive at it?​

Q4: What is the difference between NVLink bandwidth and PCIe bandwidth, and when does each matter for inference?​

Q5: Explain disaggregated prefill/decode serving and when it is beneficial.​

Q6: How would you design a GPU selection process for a new LLM serving deployment from scratch?​