Skip to main content

HBM and GDDR Memory Technologies

Reading time: ~40 min · Interview relevance: High · Target roles: ML Infrastructure Engineer, Hardware Engineer, Performance Engineer

HBM3 delivers 3.35 TB/s to the H100. GDDR6X delivers 1.0 TB/s to the RTX 4090. Both are state-of-the-art DRAM. The difference is entirely physical: one stacks dies vertically and uses a 1024-bit bus through a silicon interposer, the other sits on a PCB with a 384-bit bus. This physical difference is why one GPU costs 35,000andtheothercosts35,000 and the other costs 1,600.

The Bandwidth Crisis That Didn't Have to Happen

In 2013, NVIDIA's top GPU was the GK110 (GeForce GTX 780 Ti). It had 384 GFLOP/s (FP64) and 336 GB/s memory bandwidth. The bandwidth-to-compute ratio was approximately 0.88 bytes per FLOP. Performance was balanced - for most workloads, you could get close to theoretical peak utilization.

By 2016, compute had grown much faster than bandwidth. The P100 delivered 5,300 GFLOP/s (FP64) with 732 GB/s bandwidth - a 0.14 bytes-per-FLOP ratio. Compute had grown 14x since GK110 but bandwidth had grown only 2.2x. The GPU was increasingly likely to sit idle waiting for data. For the GEMM-dominated workloads HPC cared about, this was survivable because GEMM has high arithmetic intensity. But as neural networks grew, as batch sizes stayed small for inference, as attention mechanisms created O(S^2) memory-access patterns - the memory bandwidth wall started killing efficiency.

AMD Radeon Instinct MI250X shipped with 3.2 TB/s HBM2e bandwidth. NVIDIA H100 shipped with 3.35 TB/s HBM3 bandwidth. At the same time, consumer RTX 4090 shipped with 1.0 TB/s GDDR6X. The gap between datacenter and consumer GPU memory bandwidth is now 3.35x - but the price gap is 20x. Understanding why requires understanding the physics of how DRAM is manufactured, packaged, and connected.

This lesson explains the physics. Not because it is interesting (though it is), but because it directly determines what optimizations are available to you as a systems engineer. When you understand that HBM's bandwidth comes from its 1024-bit bus width, you immediately understand why quantization works so well for LLM inference (fewer bits means more useful data per bus transfer). When you understand that GDDR6X uses PAM4 signaling to pack two bits per signal transition, you understand why it runs hotter than HBM despite lower bandwidth. The physics predicts the constraints.

Why This Exists

By 2010, the bottleneck in memory systems was no longer the DRAM cell itself. DRAM cells could refresh and deliver data fast enough. The bottleneck was the number of signal wires running from the DRAM chip to the processor.

Every signal wire is a physical trace on a PCB. Each trace takes space. Each trace adds capacitance that limits switching frequency. PCB routing between a memory chip and a GPU must pass through board layers, vias, connectors, and often across centimeters of board length. By the time you reach 5 GHz switching speeds (GDDR5 era), every centimeter of trace is a liability. Signal integrity degrades. Cross-talk between adjacent traces becomes severe. Power consumption from driving these high-speed traces dominates the memory system's power budget.

The naive solution is more traces. If GDDR5 with a 256-bit bus gives you 256 GB/s, use a 1024-bit bus and get 1 TB/s. But 1024 PCB traces require a much larger package, longer average trace length, and the signal integrity problems scale super-linearly. By the time you have a 1024-bit bus running at GDDR5 speeds on a PCB, the signal integrity is so poor that you cannot actually achieve rated bandwidth without heroic board design measures.

The insight behind HBM is to eliminate the PCB entirely. What if the DRAM chip was not on a separate board but was sitting directly adjacent to the GPU, with connections measured in micrometers rather than centimeters? The switching frequency needed to achieve 1 TB/s drops dramatically when trace lengths go from 5 cm to 5 micrometers. You can now have thousands of signal lines, each running at relatively low frequency, collectively delivering bandwidth that PCB-connected DRAM cannot approach.

This is HBM in one sentence: move the DRAM from the PCB onto an interposer sitting directly next to the GPU, use thousands of short connections instead of hundreds of long ones, and bank bandwidth on wire count rather than switching speed.

Historical Context

The GDDR Lineage

GDDR (Graphics Double Data Rate) memory evolved directly from consumer SDRAM. GDDR1 in 2001 was essentially DDR1 adapted for graphics use - higher bandwidth, more forgiving latency requirements, specialized for sequential access patterns.

The progression was steady: GDDR3 (2004) at 900 MT/s, GDDR5 (2008) at 4-5 GT/s per pin, GDDR6 (2018) at 12-16 GT/s per pin. Each generation pushed signaling frequency higher, used better equalization and error correction on the signal path, and improved power efficiency per transferred bit.

GDDR6X (2020) introduced PAM4 (Pulse Amplitude Modulation 4-level) signaling, effectively doubling the data per clock transition. Standard NRZ (Non-Return-to-Zero) encoding uses two signal levels: high = 1, low = 0. PAM4 uses four signal levels: 00, 01, 10, 11. Two bits per transition instead of one. This lets GDDR6X at 21 GT/s per pin deliver equivalent bandwidth to hypothetical 42 GT/s NRZ - but at the cost of tighter voltage margins (harder to distinguish four levels reliably) and higher power consumption from the more complex transmitter/receiver circuits.

The HBM Revolution

HBM was co-developed by AMD and SK Hynix, first appearing in AMD's Fiji GPU (Radeon R9 Fury X) in June 2015. The specification was finalized by JEDEC as JESD235 in 2013.

The key innovation was not new DRAM technology but new packaging technology: Through-Silicon Vias (TSVs). A TSV is a vertical copper connection etched through a silicon wafer. Instead of connecting chips side-by-side on a PCB, you stack them vertically and route signals straight through. The via diameter is typically 5-10 micrometers and the pitch (center-to-center distance) is around 40 micrometers - far denser than any PCB routing.

HBM1 (2015) stacked 4 DRAM dies vertically on a base logic die, with the stack mounted on a silicon interposer alongside the GPU. Bandwidth: 128 GB/s per stack via a 1024-bit interface. By comparison, GDDR5 at the time was delivering about 256 GB/s total with a 512-bit bus running at 4 GT/s per pin.

HBM2 (2016) increased the stack to 8 dies, doubled capacity, and raised per-stack bandwidth to 256 GB/s. The Tesla P100 (2016) used four HBM2 stacks for 720 GB/s total - the first time a product used HBM2.

HBM2e (2019-2020) pushed per-stack bandwidth to 460 GB/s and capacity to 16GB per stack. The A100 (2020) uses five HBM2e stacks for 2.0 TB/s total.

HBM3 (2022) used in H100 raises per-stack bandwidth to 819 GB/s via a 1024-bit interface running at 6.4 Gbps per pin, with capacities up to 24GB per stack. H100 SXM5 uses six HBM3 stacks for 3.35 TB/s total, with 80GB total capacity.

The Physical Architecture

GDDR6X Physical Layout

A GPU with GDDR6X (RTX 4090) looks like this:

  • GPU die sits in the center of the PCB
  • GDDR6X chips are arrayed around the GPU, mounted directly on the PCB
  • Signal traces route on the PCB from each GDDR6X chip to the GPU's memory controller pads
  • Typical trace lengths: 10-50mm
  • Bus width: 384 bits total (split across 12 GDDR6X devices x 32 bits each)
  • GDDR6X speed: 21 GT/s per pin (PAM4)
  • RTX 4090 bandwidth: 384 bits x 21 GT/s / 8 = 1,008 GB/s = approximately 1.0 TB/s

The thermal picture: GDDR6X chips run hot (up to 95-100 degrees C under load) because PAM4 signaling requires high-power drivers to maintain signal integrity at 21 GT/s over PCB traces. The RTX 4090's memory subsystem alone consumes approximately 50-60W of the GPU's 450W TDP.

HBM Physical Layout

An H100 SXM5 GPU packaged with HBM3 uses an entirely different physical structure:

Silicon interposer: A large piece of silicon (not PCB fiberglass) serves as the substrate. Silicon has much better electrical properties than PCB material for high-density, high-frequency routing. The interposer is typically 30-60mm on a side and contains thousands of fine copper traces connecting the GPU die to the HBM stacks.

HBM stack construction: Each HBM3 stack consists of:

  • Base logic die (bottom): memory controller, PHY, command/address logic
  • 8-12 DRAM dies stacked on top using TSVs
  • Micro-bumps connecting each die to the next (pitch: ~40-55 micrometers)
  • The stack total height: approximately 720 micrometers (less than 1mm)

Interposer mounting: GPU die and HBM stacks are placed side-by-side on the silicon interposer. The connections between GPU and HBM are through the interposer's fine metal traces - trace length of 5-20mm on silicon vs 10-50mm on PCB, with much finer trace pitch possible on silicon.

HBM3 bandwidth math:

  • 1024-bit bus (8 channels x 128 bits each)
  • 6.4 Gbps per pin
  • Per-stack bandwidth: 1024 bits x 6.4 Gbps / 8 = 819.2 GB/s
  • H100 SXM5 has 6 stacks: 6 x 819.2 = 4,915 GB/s... but H100 spec says 3.35 TB/s

The discrepancy is real: not all 6 stacks run at full 819 GB/s simultaneously due to memory controller queuing effects and the actual access patterns. The 3.35 TB/s is the tested, achievable sustained bandwidth, not the raw pin bandwidth math. Peak pin bandwidth would be ~4.9 TB/s.

Technology Deep Dive: Through-Silicon Vias

TSVs are the enabling technology for HBM. Understanding them explains both why HBM works and why it is expensive.

A TSV is created by drilling a hole through a silicon wafer, depositing an insulating liner (silicon oxide), and filling with copper. The process:

  1. Via etching: Deep Reactive Ion Etching (DRIE) creates vertical holes through 50-100 micrometers of silicon with aspect ratios of 10:1 or higher
  2. Oxide liner: Chemical Vapor Deposition deposits SiO2 to insulate the via from the silicon
  3. Barrier and seed layers: Tantalum + TaN barrier layer prevents copper diffusion into silicon; thin copper seed layer enables electroplating
  4. Copper fill: Electrochemical deposition fills the via
  5. Planarization: Chemical-Mechanical Polishing (CMP) flattens the surface

The result is a conductive via, typically 5-10 micrometers in diameter, passing straight through the die. A single HBM DRAM die contains approximately 1,000 TSVs (for the 1024-bit data bus plus address, command, and power pins). A stack of 8 dies contains 8,000 TSVs in the lower dies.

Why TSVs enable high bandwidth: The TSV allows the HBM stack to have a 1024-bit wide interface without requiring 1024 long PCB signal traces. The entire 1024-bit bus is implemented as TSVs and interposer traces - all sub-20mm paths. The shorter path enables lower power per bit transferred and higher reliability at moderate speeds (6.4 Gbps/pin vs GDDR6X's 21 Gbps/pin).

TSV cost impact: TSV processing adds approximately 3-5 extra lithography and deposition steps to the wafer manufacturing process. Combined with the need for a silicon interposer (also manufactured like a silicon wafer), the package cost for HBM-equipped GPUs is significantly higher than PCB-mounted GDDR. The H100 package (GPU die + interposer + HBM stacks) is estimated to contribute $8,000-12,000 to the GPU's manufacturing cost.

ECC Memory in Datacenter GPUs

Enterprise/datacenter GPUs include ECC (Error-Correcting Code) memory. Consumer GPUs typically do not (or have reduced ECC capability).

DRAM bits can flip spontaneously due to cosmic rays, alpha particles from packaging material, and electrical noise. In a 80GB memory space at typical DRAM error rates, you expect approximately one soft error per day of continuous operation - not a problem for a gaming GPU where you restart frequently, but unacceptable for a training job that runs for weeks.

SECDED ECC: Single Error Correction, Double Error Detection. For every 64 bits of data, 8 additional parity bits are stored. The ECC logic can correct any single-bit error and detect (but not correct) any double-bit error. This overhead is approximately 12.5% of capacity - an 80GB HBM GPU with full ECC stores effectively ~71GB usable.

HBM3 ECC implementation: Each HBM3 die has on-die ECC correction. The base logic die includes ECC engines that correct errors before data reaches the GPU. This means ECC adds no measurable bandwidth overhead (unlike older implementations where ECC checking was done by the GPU's memory controller, consuming read bandwidth).

Comparing ECC across generations:

GPUMemoryECC TypeBandwidth Penalty
RTX 4090GDDR6XOptional (soft ECC)~5% when enabled
A100HBM2eAlways on (on-die)Negligible
H100HBM3Always on (on-die)Negligible
RTX 3090GDDR6XNone nativeN/A
# Check ECC status on NVIDIA GPUs
import subprocess

def check_ecc_status():
result = subprocess.run(
['nvidia-smi', '--query-gpu=gpu_name,ecc.mode.current,ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total',
'--format=csv,noheader'],
capture_output=True, text=True
)
lines = result.stdout.strip().split('\n')
for line in lines:
parts = [p.strip() for p in line.split(',')]
print(f"GPU: {parts[0]}")
print(f" ECC mode: {parts[1]}")
print(f" Corrected errors (volatile): {parts[2]}")
print(f" Uncorrected errors (volatile): {parts[3]}")

check_ecc_status()

Memory Evolution Timeline

Bandwidth Numbers That Matter for AI

The relationship between memory bandwidth and AI workload performance is direct and calculable. Here are the numbers that appear repeatedly in production ML engineering:

LLM Inference Bandwidth Requirements

For a transformer model with hidden dimension HH and FFN expansion ratio rr (typically 4), the memory bandwidth required per decode step per token is:

BW per step=model parameters×bytes per parameter\text{BW per step} = \text{model parameters} \times \text{bytes per parameter}

For a 70B parameter model in fp16 (2 bytes/parameter): 70×109×2=14070 \times 10^9 \times 2 = 140 GB per forward pass. At H100's 3.35 TB/s, the minimum time for one decode step (ignoring compute) is:

tmin=140 GB3.35 TB/s=41.8 mst_{min} = \frac{140 \text{ GB}}{3.35 \text{ TB/s}} = 41.8 \text{ ms}

This means even a theoretically perfect H100 cannot decode faster than ~24 tokens/second for a 70B model at batch size 1. This is the hard floor set by physics.

With RTX 4090 (1.0 TB/s): minimum time = 140ms = ~7 tokens/second at batch 1.

This analysis explains the market: for LLM inference at small batch sizes, you are paying for bandwidth, not compute. The H100 at 3.35x GDDR6X bandwidth can theoretically serve 3.35x more tokens per second at the same price per token.

The KV Cache Bandwidth Problem

The KV (Key-Value) cache stores past attention states for each token in context. For a model with LL layers, HkvH_{kv} heads, head dimension dd, and sequence length SS:

KV cache size=2×L×Hkv×d×S×bytes per element\text{KV cache size} = 2 \times L \times H_{kv} \times d \times S \times \text{bytes per element}

For Llama 3 70B: L=80, H_kv=8 (GQA), d=128, fp16 (2 bytes): 2×80×8×128×S×2=327,680×S bytes2 \times 80 \times 8 \times 128 \times S \times 2 = 327,680 \times S \text{ bytes}

At context length 4096: ~1.34 GB. At 32K context: ~10.5 GB.

During decode, the full KV cache must be read from HBM at each step. For 10.5 GB at 3.35 TB/s, that is 3.1ms just for KV cache reads per step - comparable to the cost of reading the weights in a small model.

import torch

def estimate_kv_cache_bandwidth_cost(
n_layers: int,
n_kv_heads: int,
head_dim: int,
seq_len: int,
hbm_bandwidth_tbs: float = 3.35,
dtype=torch.float16
):
"""
Estimate time to load KV cache from HBM per decode step.
This is the irreducible bandwidth cost for autoregressive generation.
"""
bytes_per_element = 2 if dtype == torch.float16 else 4
# 2x for K and V
kv_cache_bytes = 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes_per_element
kv_cache_gb = kv_cache_bytes / 1e9

# Time to read from HBM (minimum theoretical)
time_ms = (kv_cache_gb / 1e3) / hbm_bandwidth_tbs * 1000

print(f"KV cache size: {kv_cache_gb:.2f} GB")
print(f"Minimum KV load time per step: {time_ms:.2f} ms")
print(f"Theoretical max decode speed (KV only): {1000/time_ms:.1f} tokens/sec")
return kv_cache_bytes

# Llama 3 70B with GQA
estimate_kv_cache_bandwidth_cost(
n_layers=80,
n_kv_heads=8,
head_dim=128,
seq_len=4096
)

# Same model at 32K context
print("\nAt 32K context:")
estimate_kv_cache_bandwidth_cost(
n_layers=80,
n_kv_heads=8,
head_dim=128,
seq_len=32768
)

Measuring Actual Bandwidth

Theoretical bandwidth numbers are peaks under ideal conditions. Actual achieved bandwidth depends on access patterns, row buffer efficiency, bank conflicts, and refresh overhead. Here is how to measure what you actually have.

import torch
import numpy as np

def comprehensive_bandwidth_benchmark(device='cuda'):
"""
Measure HBM bandwidth at multiple transfer sizes.
Small transfers: see latency effects (bandwidth appears lower)
Large transfers: approach peak bandwidth
"""
results = []

for size_mb in [1, 4, 16, 64, 256, 1024, 4096, 8192]:
if size_mb > 40000:
continue # Skip if too large for GPU

n_bytes = size_mb * 1024 * 1024
n_floats = n_bytes // 4

try:
src = torch.ones(n_floats, dtype=torch.float32, device=device)
dst = torch.empty_like(src)

# Warmup
for _ in range(3):
dst.copy_(src)
torch.cuda.synchronize()

# Time the copy (read + write = 2x size_mb transferred)
n_trials = max(3, min(50, int(10000 / size_mb)))
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
for _ in range(n_trials):
dst.copy_(src)
end.record()
torch.cuda.synchronize()

elapsed_ms = start.elapsed_time(end) / n_trials
# dst.copy_(src): reads src, writes dst = 2 * size_mb transferred
bandwidth_gbs = (2 * size_mb) / elapsed_ms # GB/s
bandwidth_tbs = bandwidth_gbs / 1000

results.append({
'size_mb': size_mb,
'time_ms': elapsed_ms,
'bandwidth_gbs': bandwidth_gbs
})
print(f"Size: {size_mb:6d} MB | Time: {elapsed_ms:8.3f} ms | "
f"BW: {bandwidth_gbs:8.1f} GB/s ({bandwidth_tbs:.3f} TB/s)")

del src, dst
torch.cuda.empty_cache()

except torch.cuda.OutOfMemoryError:
print(f"Size: {size_mb:6d} MB | OOM")
break

if results:
peak = max(r['bandwidth_gbs'] for r in results)
print(f"\nPeak achieved bandwidth: {peak:.1f} GB/s ({peak/1000:.3f} TB/s)")

return results

print("HBM Bandwidth Benchmark:")
print("-" * 70)
results = comprehensive_bandwidth_benchmark()

Benchmarking Read vs Write vs Copy Bandwidth

Read and write bandwidth are often asymmetric on HBM. Understanding which direction is your bottleneck matters for kernel design.

def directional_bandwidth_test(size_gb=2.0, n_trials=20):
"""
Test read-only, write-only, and copy (read+write) bandwidth separately.
HBM systems often have higher read bandwidth than write bandwidth.
"""
n_floats = int(size_gb * 1e9 / 4)

src = torch.ones(n_floats, dtype=torch.float32, device='cuda')
dst = torch.empty_like(src)

def time_kernel(kernel_fn, label):
# Warmup
for _ in range(5):
kernel_fn()
torch.cuda.synchronize()

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(n_trials):
kernel_fn()
end.record()
torch.cuda.synchronize()

elapsed_ms = start.elapsed_time(end) / n_trials
bytes_moved = size_gb * 1e9
bw = bytes_moved / (elapsed_ms / 1000) / 1e9
print(f"{label:20s}: {elapsed_ms:.3f}ms, {bw:.1f} GB/s")
return bw

# Read: sum reduction forces all reads but writes only one float
time_kernel(lambda: src.sum(), "Read (reduction)")

# Write: fill forces writes to all elements
time_kernel(lambda: dst.fill_(1.0), "Write (fill)")

# Copy: read all of src, write all of dst
time_kernel(lambda: dst.copy_(src), "Copy (read+write)")

# In-place scale: reads and writes entire tensor
time_kernel(lambda: dst.mul_(2.0), "In-place mul (rw)")

directional_bandwidth_test()

HBM vs GDDR - Head to Head Comparison

PropertyHBM3 (H100)GDDR6X (RTX 4090)
Bus width1024 bits per stack384 bits total
Signaling speed6.4 Gbps/pin (NRZ)21 Gbps/pin (PAM4)
Total bandwidth3.35 TB/s~1.0 TB/s
Capacity80 GB24 GB
Power (memory subsystem)~100W~50-60W
Bandwidth/Watt33.5 GB/s/W~16-20 GB/s/W
Latency~100 ns~90-100 ns
Package typeInterposer + TSV stackStandard PCB BGA
Manufacturing costVery highModerate
GPU price (2024)$25,000-35,000$1,600-2,000

Note: HBM has higher bandwidth per watt despite higher total power, because it achieves bandwidth through wider buses at lower signaling frequencies rather than through extremely high-frequency narrow buses.

Thermal Characteristics

HBM Thermal Profile

HBM stacks are thermally challenging because the heat-generating DRAM dies are buried under each other. The top die in an 8-die stack has heat transfer blocked by 7 dies below it. The base logic die at the bottom of the stack (the hottest component from a switching perspective) must transfer heat upward through the entire stack.

On H100, the HBM stacks are thermally coupled to the GPU's heat spreader. The stacks themselves are rated to approximately 95 degrees C junction temperature. In a well-cooled server, HBM junction temperature typically stays under 80 degrees C. In poorly ventilated scenarios, thermal throttling occurs - HBM bandwidth reduces and refresh rates increase to reduce power, which further reduces available bandwidth.

# Monitor HBM temperature via nvidia-smi
import subprocess
import time

def monitor_hbm_temperature(duration_s=30, interval_s=1):
"""
Monitor GPU and HBM temperature during workload.
HBM temperature is reported separately from GPU die temperature.
"""
print(f"{'Time':>5s} | {'GPU Temp':>10s} | {'HBM Temp':>10s} | {'Power':>8s}")
print("-" * 45)

start = time.time()
while time.time() - start < duration_s:
result = subprocess.run(
['nvidia-smi',
'--query-gpu=temperature.gpu,temperature.memory,power.draw',
'--format=csv,noheader,nounits'],
capture_output=True, text=True
)
parts = result.stdout.strip().split(',')
elapsed = time.time() - start
gpu_temp = parts[0].strip() if len(parts) > 0 else 'N/A'
mem_temp = parts[1].strip() if len(parts) > 1 else 'N/A'
power = parts[2].strip() if len(parts) > 2 else 'N/A'
print(f"{elapsed:5.1f}s | {gpu_temp:>9s}C | {mem_temp:>9s}C | {power:>7s}W")
time.sleep(interval_s)

# Run during a bandwidth-intensive workload to see thermal behavior
# monitor_hbm_temperature(duration_s=60)

GDDR6X Thermal Profile

GDDR6X chips run significantly hotter than HBM under load. PAM4 signaling at 21 Gbps requires high-drive-strength output stages that consume substantial power even when not transferring data (biasing the multi-level signal requires constant current). RTX 4090 GDDR6X temperatures routinely reach 95-104 degrees C under sustained workloads.

This thermal limitation is one reason GDDR6X speeds have not continued to scale - you cannot push much higher than 21 Gbps/pin on PCB traces without unmanageable power dissipation and signal integrity problems at high temperatures.

Bandwidth as the Binding Constraint for LLM Inference

The industry has recognized HBM bandwidth as the key bottleneck for LLM serving. This recognition has driven several important optimizations:

Quantization as Bandwidth Optimization

Weight-only quantization (W4A16, W8A16) directly reduces HBM bandwidth consumption by reducing bytes per weight element.

import torch
import time

def demonstrate_quantization_bandwidth_impact():
"""
Show how quantization reduces effective bandwidth consumption
by reducing bytes transferred per useful weight element.
"""
H, W = 4096, 4096 # Weight matrix dimensions
batch_size = 1 # Decode mode - single token

x = torch.randn(batch_size, H, device='cuda', dtype=torch.float16)

# FP16 weights (2 bytes/element)
w_fp16 = torch.randn(W, H, device='cuda', dtype=torch.float16)

# INT8 weights (1 byte/element) - simulated
w_int8 = torch.randint(-128, 127, (W, H), device='cuda', dtype=torch.int8)
scale_fp16 = torch.ones(W, device='cuda', dtype=torch.float16)

def time_gemm(fn, n_trials=200):
for _ in range(20):
fn()
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(n_trials):
fn()
end.record()
torch.cuda.synchronize()
return start.elapsed_time(end) / n_trials

# FP16 GEMM
t_fp16 = time_gemm(lambda: x @ w_fp16.T)
bytes_fp16 = W * H * 2 # 2 bytes per fp16 element
bw_fp16 = bytes_fp16 / (t_fp16 / 1000) / 1e9

print(f"FP16 GEMM (batch=1): {t_fp16:.3f}ms, "
f"effective BW: {bw_fp16:.1f} GB/s")

# INT8 dequant + GEMM (simplified - real production uses fused kernels)
w_dequant = w_int8.to(torch.float16) * scale_fp16.unsqueeze(1)
t_int8 = time_gemm(lambda: x @ w_dequant.T)
bytes_int8 = W * H * 1 # 1 byte per int8 element
bw_int8 = bytes_int8 / (t_int8 / 1000) / 1e9

print(f"INT8 dequant (batch=1): {t_int8:.3f}ms, "
f"effective BW (weight only): {bw_int8:.1f} GB/s")

print(f"\nINT8 vs FP16 speedup: {t_fp16/t_int8:.2f}x")
print(f"Note: Real W8A16 kernels use fused dequant+matmul for better perf")

demonstrate_quantization_bandwidth_impact()

Continuous Batching and Bandwidth Amortization

When batch size increases, the arithmetic intensity of each forward pass increases but the weight loading bandwidth stays the same (weights are shared across all batch elements). This is why serving systems try to maximize batch size.

time per token=model size in bytesHBM bandwidth×batch size+FLOP per tokencompute throughput\text{time per token} = \frac{\text{model size in bytes}}{\text{HBM bandwidth} \times \text{batch size}} + \frac{\text{FLOP per token}}{\text{compute throughput}}

For batch size 1: bandwidth term dominates. For large batch sizes: compute term grows proportionally, eventually becoming the limit.

The crossover batch size (where compute equals bandwidth constraint) is:

crossover batch=FLOP per forward passmodel bytes×computebandwidth\text{crossover batch} = \frac{\text{FLOP per forward pass}}{\text{model bytes} \times \frac{\text{compute}}{\text{bandwidth}}}

For H100 (312 TFLOP/s FP16, 3.35 TB/s): ratio = 93 FLOP/byte. For a 70B parameter model at fp16:

  • FLOP per token per layer: approximately 2 x H x FFN_dim x 2 (for forward + backward projection) = 2 x 4096 x 16384 x 2 = 268M FLOP/layer
  • Bytes per layer: approximately (4096 x 16384 + 16384 x 4096) x 2 = 268M bytes/layer
  • Arithmetic intensity per token = 268M / 268M = 1 FLOP/byte at batch 1
  • Crossover: batch = 93 x (1/1) = ~93

At batch size 93 or above, a 70B model inference on H100 becomes compute-bound rather than bandwidth-bound.

def find_crossover_batch_size(
model_params: float, # total parameters
bytes_per_param: float, # 2 for fp16, 4 for fp32
flop_per_param: float, # typically ~2 for forward pass
hbm_bandwidth_tbs: float = 3.35,
compute_tflops: float = 312.0
):
"""
Compute the batch size at which inference transitions from
memory-bandwidth-bound to compute-bound.
"""
model_bytes = model_params * bytes_per_param
flop_per_token = model_params * flop_per_param

# Roofline balance point
ridge_flop_per_byte = compute_tflops * 1e12 / (hbm_bandwidth_tbs * 1e12)

# At batch B: arithmetic intensity = B * flop_per_token / model_bytes
# Crossover when: B * flop_per_token / model_bytes = ridge
crossover_batch = ridge_flop_per_byte * model_bytes / flop_per_token

print(f"Model: {model_params/1e9:.0f}B params, {bytes_per_param} bytes/param")
print(f"Model size: {model_bytes/1e9:.1f} GB")
print(f"Hardware ridge point: {ridge_flop_per_byte:.1f} FLOP/byte")
print(f"Crossover batch size: {crossover_batch:.0f}")
print(f"At batch < {crossover_batch:.0f}: memory-bandwidth limited")
print(f"At batch >= {crossover_batch:.0f}: compute limited")
return crossover_batch

print("H100 crossover analysis:")
find_crossover_batch_size(70e9, 2.0, 2.0) # 70B fp16

print("\nA100 crossover analysis:")
find_crossover_batch_size(70e9, 2.0, 2.0,
hbm_bandwidth_tbs=2.0,
compute_tflops=312.0)

print("\nRTX 4090 crossover analysis:")
find_crossover_batch_size(7e9, 2.0, 2.0, # 7B model, more reasonable for 24GB
hbm_bandwidth_tbs=1.008,
compute_tflops=82.6)

HBM Generations - Spec Comparison

GenerationYearBandwidth/StackCapacity/StackProcess NodeProducts
HBM12015128 GB/s1-4 GB20nmAMD Fury X
HBM22016256 GB/s4-8 GB20nmP100, Vega
HBM2e2019307-460 GB/s8-16 GB10nmA100, MI250X
HBM32022665-819 GB/s16-24 GB10nmH100, MI300X
HBM3e2024819-1170 GB/s24-36 GB5nmH200, MI300X

HBM3e (as used in NVIDIA H200) reaches 4.8 TB/s total at the same 80GB capacity as H100, or up to 141GB capacity at slightly lower bandwidth. The H200's bandwidth advantage over H100 directly translates to higher throughput for LLM inference at small batch sizes.

Production Engineering Notes

Memory Capacity Planning for LLM Serving

Planning how many GPUs you need to serve a model involves both bandwidth and capacity constraints:

def gpu_capacity_planning(
model_params: float, # parameter count
bytes_per_param: float, # 2=fp16, 4=fp32, 0.5=int4
max_seq_len: int, # maximum context length
max_batch_size: int, # concurrent requests
n_layers: int,
n_kv_heads: int,
head_dim: int,
gpu_memory_gb: float = 80.0,
memory_overhead_fraction: float = 0.15 # OS + framework overhead
):
"""
Estimate minimum GPU memory for LLM serving.
"""
# Model weights
weights_gb = model_params * bytes_per_param / 1e9
print(f"Model weights: {weights_gb:.1f} GB")

# KV cache (fp16 always, even for quantized weights)
kv_bytes_per_token = 2 * n_layers * n_kv_heads * head_dim * 2 # 2 for K+V, 2 for fp16
kv_cache_gb = (kv_bytes_per_token * max_seq_len * max_batch_size) / 1e9
print(f"KV cache ({max_seq_len} tokens x batch {max_batch_size}): {kv_cache_gb:.1f} GB")

# Framework overhead
overhead_gb = (weights_gb + kv_cache_gb) * memory_overhead_fraction
print(f"Framework overhead (~{memory_overhead_fraction:.0%}): {overhead_gb:.1f} GB")

total_gb = weights_gb + kv_cache_gb + overhead_gb
print(f"Total required: {total_gb:.1f} GB")

n_gpus = int(torch.ceil(torch.tensor(total_gb / gpu_memory_gb)).item())
print(f"GPUs required (80GB each): {n_gpus}")
print(f"Utilization if {n_gpus} GPUs used: {total_gb / (n_gpus * gpu_memory_gb):.1%}")
return n_gpus

import torch

print("=== Llama 3 70B fp16 serving ===")
gpu_capacity_planning(
model_params=70e9,
bytes_per_param=2.0,
max_seq_len=4096,
max_batch_size=8,
n_layers=80,
n_kv_heads=8,
head_dim=128
)

print("\n=== Llama 3 70B int4 (quantized) serving ===")
gpu_capacity_planning(
model_params=70e9,
bytes_per_param=0.5, # 4-bit quantization
max_seq_len=4096,
max_batch_size=8,
n_layers=80,
n_kv_heads=8,
head_dim=128
)

When you cannot fit a model in single-GPU HBM, you need multiple GPUs. The interconnect bandwidth between GPUs then becomes critical. On H100 SXM5:

  • NVLink 4.0: 900 GB/s bidirectional between adjacent GPUs (20 GB/s per lane x 18 lanes x 2.5x encoding = 900 GB/s)
  • PCIe 5.0 x16: 64 GB/s bidirectional

For tensor parallelism (where activations must be all-reduced at each layer), NVLink is 14x faster than PCIe. The all-reduce bandwidth requirement for a model with hidden dimension H across N GPUs is proportional to H x bytes_per_element / N x latency. With PCIe, the all-reduce at 4096 hidden dimension for 8 GPUs would take approximately 4096 x 2 bytes x 8 / 8 / 64 GB/s = 1.0ms per layer. With NVLink it is 64 microseconds. For a 80-layer model, this difference compounds to 80ms vs 5ms per forward pass - a 16x inference latency difference purely from interconnect.

Common Mistakes

:::danger Confusing Peak Bandwidth with Achieved Bandwidth Spec sheets say H100 delivers 3.35 TB/s. Real workloads achieve 70-90% of that for optimal access patterns, and as low as 5-10% for pathological ones. Always benchmark your actual workload. The spec is an upper bound, not a guarantee. Small tensor operations in particular have terrible achieved bandwidth due to kernel launch overhead and the time to ramp up memory controller utilization. :::

:::danger Assuming GDDR6X is "Good Enough" for Production LLM Serving RTX 4090 with GDDR6X is excellent for experimentation, fine-tuning, and small model inference. For production serving of large models (>13B parameters) at meaningful throughput, the 1 TB/s bandwidth and 24GB capacity create hard floors that no software optimization can overcome. A single A100 (80GB, 2 TB/s) will outperform a single RTX 4090 for LLM serving by 2-3x in both throughput and maximum model size, despite costing 10x more. At scale, the cost-per-token math favors datacenter GPUs. :::

:::warning HBM Temperature Throttling Is Silent and Hard to Detect HBM stacks throttle at 95+ degrees C. When throttled, HBM bandwidth drops by 10-30% and the GPU stays within power limits - but performance degrades silently. There is no "OOM error" or obvious crash. You just see inexplicably lower throughput on warm hardware compared to cold hardware, or mysterious performance variance across servers in the same cluster. Always monitor temperature.memory via nvidia-smi in production serving. If any instance consistently exceeds 85 degrees C, investigate cooling before assuming it is a software issue. :::

:::warning Quantization Reduces Bandwidth Cost But Not KV Cache Size A common misconception: "I quantized my model to int4, so I need 8x less memory and 8x less bandwidth." This is true for weights, but the KV cache is almost always kept in fp16 regardless of weight quantization (to preserve attention score precision). For long-context applications where KV cache dominates memory usage, quantizing weights does not help with the memory or bandwidth bottleneck. Separate techniques (KV cache quantization, GQA/MQA, sliding window attention) are needed for the KV cache. :::

:::warning ECC Has a Real Cost on Older GPU Generations On V100 and earlier, enabling ECC reduced effective HBM bandwidth by approximately 5-6% because ECC checking was done by the memory controller, consuming read bandwidth. On A100 and H100, ECC is implemented on-die within each HBM stack and adds negligible bandwidth overhead. If you are benchmarking on older hardware and comparing results with and without ECC, factor in this bandwidth difference. Disabled ECC on V100 is not a fair comparison point for A100 with ECC always enabled. :::

Interview Q&A

Q1: Why does HBM have so much more bandwidth than GDDR6X even though GDDR6X runs at higher signaling frequency?

HBM uses a 1024-bit bus per stack (8 channels x 128 bits each), while GDDR6X uses a 384-bit total bus across all 12 chips on an RTX 4090. Even though GDDR6X runs at 21 Gbps/pin vs HBM3's 6.4 Gbps/pin, the bandwidth math is:

HBM3 per stack: 1024 bits x 6.4 Gbps / 8 = 819 GB/s. Six stacks on H100: ~4.9 TB/s peak pin bandwidth, ~3.35 TB/s sustained.

GDDR6X RTX 4090: 384 bits x 21 Gbps / 8 = ~1.0 TB/s.

HBM achieves bandwidth through extreme bus width enabled by TSVs and silicon interposer routing. GDDR6X achieves bandwidth through high signaling frequency on a narrow bus. The HBM approach is 3.35x better in bandwidth but 20x more expensive because TSV fabrication and silicon interposer packaging add cost that PCB-mounted GDDR does not have.

Q2: Explain Through-Silicon Vias and why they are the key enabling technology for HBM.

TSVs are vertical copper conductors etched through silicon wafers. They replace the traditional bond wire connections used in 2D chip packaging - where signals had to leave a chip's top surface, loop through a bond wire to a PCB, and then travel to the next chip. Bond wires and PCB traces can be millimeters to centimeters long, which limits signaling frequency and increases power consumption per bit.

TSVs allow stacking DRAM dies vertically with connections measured in micrometers. An HBM stack connects dies at ~40 micrometer pitch. With thousands of short TSV connections, you can implement the 1024-bit HBM bus entirely within the stack height of under 1 millimeter. No PCB traces needed for the data path.

The physics benefit: a 10 micrometer TSV has 10x-100x lower capacitance than a PCB trace of comparable function. Lower capacitance means less energy to switch the signal and ability to use lower drive strength (lower power) while still achieving correct signal timing. This is why HBM achieves 2x better bandwidth per watt compared to GDDR6X.

Q3: How does memory bandwidth limit LLM inference throughput, and what techniques address this?

For autoregressive LLM decode (token generation), each step requires loading the entire model weight matrix from HBM to perform one matrix-vector multiplication per token. For a 70B parameter model in fp16, that is 140GB of data loaded per decode step. At H100's 3.35 TB/s, the minimum possible decode time is 140GB / 3.35 TB/s = 42ms per token, regardless of compute speed. This is the memory bandwidth wall.

Techniques that address this:

  1. Weight quantization (W4A16, W8A16): reduces bytes per weight by 2x-8x, proportionally reducing load time
  2. Larger batch sizes: amortize the weight loading cost across multiple tokens generated simultaneously
  3. Continuous batching: fill batch slots dynamically to maximize GPU utilization without waiting for fixed batch sizes
  4. Speculative decoding: use a small draft model to predict multiple tokens, verify with the large model in one forward pass - effectively increases tokens generated per weight load
  5. Model parallelism with NVLink: spread weights across GPUs with fast interconnect, each GPU loads a fraction of weights

Q4: What is the crossover batch size for memory-bound vs compute-bound operation, and how do you calculate it?

The crossover batch size is where the work done per second equals the memory bandwidth times the roofline ratio. The roofline ridge point (balance between compute and bandwidth) is compute_peak / bandwidth_peak in FLOP/byte. For H100: 312 TFLOPS / 3.35 TB/s = 93 FLOP/byte.

For a model with arithmetic intensity A(B) = FLOP(B) / bytes_loaded, where FLOP(B) scales linearly with batch size but bytes_loaded is constant (weights are loaded once regardless of batch size), the crossover is:

batch_crossover = ridge_point x (bytes per forward pass / FLOP per token per batch element)

For a 70B model: bytes per pass = 140GB. FLOP per token = 140 GFLOP (2 x params as a rough estimate). So batch_crossover = 93 x (140GB / 140GFLOP) = 93.

Below batch 93, adding more compute capacity does not help. Above batch 93, adding more HBM bandwidth does not help. This is the fundamental tradeoff that explains why H100 (3.35 TB/s, 312 TFLOPS) is better for small-batch LLM inference than V100 (900 GB/s, 112 TFLOPS) even though V100's compute/bandwidth ratio is similar - H100's absolute bandwidth is 3.7x higher, which directly raises the throughput floor.

Q5: What is PAM4 signaling and why does GDDR6X use it? What are the tradeoffs?

PAM4 (Pulse Amplitude Modulation 4-level) encodes 2 bits per signal transition instead of the 1 bit per transition used by standard NRZ (Non-Return-to-Zero) signaling. Instead of two voltage levels (0V for bit 0, V_dd for bit 1), PAM4 uses four voltage levels: 0V (00), 1/3 V_dd (01), 2/3 V_dd (10), V_dd (11).

GDDR6X uses PAM4 because the physical interconnect speed had hit limits with NRZ. PCB traces at 21 Gbps NRZ would require extremely tight impedance control and would have unacceptable bit error rates due to signal integrity problems. PAM4 at 21 Gbps effective data rate can be implemented as 10.5 Gbps NRZ-equivalent signaling (since each transition carries 2 bits), which has much better signal integrity on PCB traces.

The tradeoffs: PAM4 has smaller noise margins because you are distinguishing four voltage levels instead of two. The difference between level 0 and level 1 is only 1/3 of the full signal swing, versus 1/2 for NRZ. This requires higher-quality transmitter/receiver circuits that consume more power. GDDR6X memory chips draw significantly more power per transferred bit than GDDR6 NRZ - which is why RTX 4090 memory temperatures reach 100 degrees C under load. HBM avoids PAM4 entirely because TSV connections are so short that even 6.4 Gbps NRZ achieves the required bandwidth with excellent signal integrity.

Q6: How does ECC work in HBM3, and what is the impact on usable capacity and bandwidth?

HBM3 implements SECDED (Single Error Correction, Double Error Detection) ECC on-die within each HBM stack. The base logic die includes dedicated ECC engines that process every memory access. For every 64-bit data word, 8 bits of ECC parity are stored alongside. This provides the ability to correct any single-bit flip and detect any two-bit error.

The capacity overhead is 8/64 = 12.5%. An "80GB" H100 actually has 80GB of usable data space, with HBM internally storing approximately 90GB when you include ECC bits - but the ECC overhead is invisible to the programmer. You see 80GB in nvidia-smi.

The bandwidth impact on H100 is negligible. Because ECC is implemented on-die within each HBM stack (ECC engines are part of the HBM3 die design), the correction happens locally without consuming any of the HBM3-to-GPU interconnect bandwidth. The ECC bits never traverse the interposer. Contrast this with V100's off-die ECC, where ECC bits were transferred over the HBM2 bus and the GPU's memory controller performed correction - consuming approximately 5-6% of HBM2 bandwidth for ECC operations.

Summary and What Comes Next

HBM and GDDR6X represent two fundamentally different approaches to the same problem: how do you get data from DRAM to a processor fast enough to keep compute units busy?

GDDR6X's approach - very high signaling frequencies on a moderate-width bus mounted on PCB - is cost-effective, thermally accessible, and delivers 1 TB/s. It is right for consumer GPUs where cost and form factor matter more than peak bandwidth.

HBM's approach - moderate signaling frequency on an extremely wide bus connected via silicon interposer and TSVs - is expensive, requires specialized packaging, and delivers 3.35 TB/s. It is right for datacenter workloads where throughput per dollar at scale justifies the hardware premium.

For AI systems engineers, the practical takeaway is: memory bandwidth is not a hardware property you accept passively. It is a constraint that determines your optimization strategy. When you know your bandwidth budget, you can calculate your arithmetic intensity requirement, choose appropriate quantization strategies, size your batches to cross the roofline boundary, and design serving systems that amortize weight loading across as many useful operations as possible.

The next lesson covers the Roofline Model in detail - the quantitative framework that ties bandwidth, compute, and arithmetic intensity together into a single analysis tool for characterizing and improving kernel performance.

© 2026 EngineersOfAI. All rights reserved.