Roofline Model and Bottleneck Analysis
Reading time: ~40 min · Interview relevance: Very High · Target roles: ML Engineer, CUDA Developer, Performance Engineer
The roofline model answers the most important question in GPU optimization: is this operation limited by compute or by memory bandwidth? The answer determines which optimizations are worth pursuing - and which ones are a complete waste of time.
Opening: A Performance Debug at 2 AM
It is 2:47 AM and your training job is running 40% slower than it should be. You just upgraded from an A100 to an H100 - a GPU that costs roughly three times as much per hour and promises nearly 3x the FP16 tensor core throughput. On paper, your transformer training run should have screamed. Instead, it is crawling. GPU utilization shows 98% across all eight cards. MFU (model FLOP utilization) is stuck at 28%. Your manager will ask for an explanation in six hours.
You stare at the Nsight Systems timeline. The kernels are launching. Memory is moving. Everything looks busy. But something is wrong at the level that profiling tools show you only if you know exactly what to look for. You start dumping Nsight Compute reports for individual kernels. The data is overwhelming - hundreds of metrics per kernel. You have no framework to decide which number matters.
This is the experience that breaks most ML engineers who try to go deep on performance. They know the GPU is slow. They do not know why it is slow. They try random optimizations - switch to BF16, add torch.compile, increase batch size - and see marginal gains that do not add up to the speedup they need. Each change takes hours to validate. They are guessing, not engineering.
The root cause in this particular incident: the model was bottlenecked on memory bandwidth, not compute. Adding tensor cores did nothing because tensor cores were not the bottleneck to begin with. The H100 has 3.35 TB/s of HBM3 bandwidth - meaningfully higher than the A100's 2 TB/s - but the team's custom attention kernels were reading and writing the full K and V matrices to HBM on every layer. They were paying memory bandwidth tax on every forward pass. Switching to FlashAttention 2 cut training time by 38% on the same hardware within 48 hours. No hardware upgrade required.
The tool that would have pointed directly to that root cause in ten minutes is the roofline model. It is a simple performance model - a two-line graph with mathematics that fits on a napkin. But it encodes a complete theory of GPU bottleneck analysis. Once you understand how to construct one, read one, and reason from one, debugging GPU performance stops being guesswork and becomes engineering.
This lesson teaches you to build the roofline model from first principles, understand what it says about every common ML operation, measure real arithmetic intensity in PyTorch code, and use the roofline to make concrete optimization decisions. By the end, you will know exactly which operations in a transformer are memory-bound versus compute-bound - and what to do about each one.
Why This Exists - The Problem of Not Knowing What to Optimize
Before the roofline model existed, performance engineers had two numbers to chase: peak FLOPS and peak memory bandwidth. But these two numbers are independent hardware limits. A GPU can be hitting peak bandwidth while barely touching its compute units. It can be maxing out tensor cores while the memory bus sits idle. What was missing was a unified framework that placed any given operation in context relative to both limits simultaneously.
The naive approach was trial and error. Profile a kernel, try different optimizations, measure again. This works eventually but it wastes enormous effort optimizing the wrong bottleneck. If your kernel is memory-bandwidth bound, no amount of algorithmic improvement to the compute path helps. You can implement the most elegant Winograd convolution algorithm in the world and gain nothing, because the GPU was never waiting on the multiply-accumulate units to begin with.
The deeper problem is that compute and memory bandwidth scale independently across hardware generations. The A100 had 312 TFLOPS of BF16 tensor core throughput and 2 TB/s of HBM bandwidth - a ratio of 156 FLOPs/byte. The H100 SXM5 has 989 TFLOPS of BF16 tensor core throughput and 3.35 TB/s of HBM bandwidth - a ratio of 295 FLOPs/byte. Compute improved by 3.2x while bandwidth improved by 1.7x. An operation that was borderline on the A100 (just barely compute-bound) might be strongly memory-bound on the H100, because the "ridge point" shifted dramatically. Without a model that accounts for both limits simultaneously and tracks how they change across hardware, you cannot reason about portability of optimization work.
The roofline model solves this. It gives every kernel a single number - arithmetic intensity - that tells you where it sits relative to the hardware limits. Operations with low arithmetic intensity are memory-bound regardless of GPU. Operations with high arithmetic intensity are compute-bound and will benefit from better compute hardware. The ridge point divides the space cleanly. Once you place an operation on the roofline, the optimization path is obvious.
Historical Context - Williams, Waterman, and Patterson (2009)
The roofline model was introduced in a 2009 paper titled "Roofline: An Insightful Visual Performance Model for Multicore Architectures" by Samuel Williams, Andrew Waterman, and David Patterson at UC Berkeley. Patterson is the same David Patterson who co-invented RISC architecture and later joined Google to advocate for domain-specific accelerators - he has spent much of his career building conceptual tools for reasoning about computer architecture.
The original paper targeted multicore CPUs and the memory wall problem: as CPUs got faster, DRAM bandwidth did not keep pace. Optimizing for computation speed became less useful when data could not be fed fast enough to keep cores busy. Williams, Waterman, and Patterson formalized this with the roofline model, giving engineers a principled way to determine whether a given workload would benefit more from better compute or better memory bandwidth.
The insight was deceptively simple. Every computation has two costs: arithmetic work (measured in FLOPs) and data movement (measured in bytes). The ratio - FLOPs per byte - is the arithmetic intensity. Hardware provides two limits: peak compute (FLOPS) and peak memory bandwidth (bytes per second). The performance of any computation is bounded by both limits simultaneously. The tighter bound determines the actual bottleneck.
When the ML community moved to GPUs and the GPU performance gap with theoretical peak became a serious engineering problem, the roofline model migrated naturally from CPU performance analysis to GPU performance analysis. NVIDIA incorporated roofline analysis directly into Nsight Compute in 2019, making it a first-class tool for GPU kernel optimization. Today, the roofline model is the standard entry point for any serious GPU performance investigation.
Core Concepts
Arithmetic Intensity - The Single Most Important Number
Arithmetic intensity is defined as the ratio of floating-point operations performed to bytes of memory traffic generated:
The units are FLOPs per byte. The "bytes accessed" in the denominator refers specifically to bytes moved between DRAM (HBM on modern GPUs) and the compute units - not cache-to-register traffic. This distinction matters enormously. If you reuse data from L2 cache or shared memory, you pay for the DRAM transfer only once, raising the effective arithmetic intensity.
A pure elementwise operation like ReLU - read a float, apply max(0, x), write the result - performs 1 FLOP on 8 bytes (4 bytes read + 4 bytes written in FP32). Its arithmetic intensity is FLOPs/byte. It is aggressively memory-bound on every GPU ever built.
A large square matrix multiplication of two N x N matrices performs FLOPs and reads bytes of data (two input matrices plus one output in FP32). Its arithmetic intensity is:
For , that is FLOPs/byte. This is why large matrix multiplications are the workhorse of deep learning - they have high arithmetic intensity and are compute-bound on virtually every modern GPU, meaning the tensor cores are the limiting factor, not memory bandwidth.
The Two Performance Ceilings
The roofline model defines two hardware ceilings:
Memory bandwidth ceiling: Performance is bounded by how fast data can be moved from DRAM. If a kernel has arithmetic intensity and the hardware provides bandwidth (bytes/second), the maximum achievable performance is:
This is a sloped line on a log-log plot. For every FLOPs/byte increase in arithmetic intensity, you get a proportional performance increase - as long as you are on the memory-bound side.
Peak compute ceiling: Performance is bounded by how fast the compute units can execute floating-point operations. This ceiling is a horizontal line at (the peak FLOPS rating of the hardware).
The roofline model says that actual achievable performance for a kernel with arithmetic intensity is:
On a log-log plot, the memory bandwidth ceiling is a diagonal line with slope 1 (since both axes are log scale). The peak compute ceiling is a horizontal line. The two lines intersect at the ridge point.
The Ridge Point - Where Memory-Bound Becomes Compute-Bound
The ridge point is the arithmetic intensity at which the memory bandwidth ceiling equals the peak compute ceiling:
Operations with are memory-bound: adding more compute units does not help. To improve performance, you must reduce bytes accessed (kernel fusion, reuse data in shared memory) or increase bandwidth (use faster memory tiers like L2 or shared memory).
Operations with are compute-bound: reducing DRAM traffic does not help. To improve performance, you must reduce FLOPs (more efficient algorithms) or increase peak compute (use Tensor Cores, FP8 instead of FP32).
H100 SXM5 Numbers - Real Hardware Ceilings
Let's put real numbers on this for the most common training GPU as of 2025:
| Metric | Value |
|---|---|
| Peak FP32 compute | 67 TFLOPS |
| Peak BF16 Tensor Core | 989 TFLOPS |
| Peak FP8 Tensor Core | 1,979 TFLOPS |
| HBM3 bandwidth | 3.35 TB/s |
| L2 cache bandwidth | ~12 TB/s |
| L2 cache size | 50 MB |
Now compute the ridge points:
FP32 ridge point:
BF16 Tensor Core ridge point:
FP8 Tensor Core ridge point:
These numbers are striking. For BF16 tensor cores, you need 295 FLOPs of compute for every byte you access from HBM before you saturate the compute units. Most naive implementations of even moderately complex operations never reach this. The H100's extraordinary compute throughput means an enormous fraction of operations that look like they should be compute-heavy are actually memory-bound when implemented naively.
For comparison, the A100 SXM4 BF16 Tensor Core ridge point was approximately FLOPs/byte. The H100's ridge point is nearly double. Operations that were just barely compute-bound on the A100 are memory-bound on the H100.
Arithmetic Intensity of Common ML Operations
Understanding where common operations fall is essential knowledge:
Elementwise operations (ReLU, GELU, sigmoid, add, multiply):
- FLOPs: 1 per element (roughly)
- Bytes: 4 bytes read + 4 bytes written (FP32) = 8 bytes
- Arithmetic intensity: FLOPs/byte
- Classification: Strongly memory-bound on every GPU
Layer normalization (over hidden dim ):
- FLOPs: per token (mean, variance, normalize, scale, shift)
- Bytes: bytes read + bytes written = bytes (plus small parameter reads)
- Arithmetic intensity: FLOPs/byte
- Classification: Memory-bound
Vector-vector dot product (length ):
- FLOPs: (N multiplies + N additions)
- Bytes: bytes
- Arithmetic intensity: FLOPs/byte
- Classification: Strongly memory-bound
Matrix-vector multiply (GEMV) - matrix is , vector is :
- FLOPs:
- Bytes: (matrix) + (vector) + (output) bytes for large
- Arithmetic intensity: FLOPs/byte
- Classification: Strongly memory-bound (this is why inference at batch size 1 is bandwidth-limited)
Matrix-matrix multiply (GEMM) - square matrices of size :
- FLOPs:
- Bytes: bytes
- Arithmetic intensity: FLOPs/byte
- For : 21 FLOPs/byte (barely above FP32 ridge, below BF16 Tensor Core ridge)
- For : 171 FLOPs/byte (compute-bound in FP32, memory-bound for BF16 on H100)
- For : 683 FLOPs/byte (compute-bound for both FP32 and BF16 on H100)
Softmax (over sequence length ):
- FLOPs: (exp + sum + divide per element, plus max subtraction)
- Bytes: read + written = bytes (minimum 2 passes)
- Arithmetic intensity: FLOPs/byte
- Classification: Memory-bound
Naive attention (sequence length , head dim ):
- FLOPs: (two GEMMs for QK and OV, plus softmax)
- Bytes: Reads Q, K, V plus writes the attention matrix to HBM: bytes
- For large : arithmetic intensity FLOPs/byte (e.g., gives 64 FLOPs/byte)
- Classification: Memory-bound on H100 (below 295 FLOPs/byte ridge)
- The attention matrix write/read is the killer - it costs memory traffic
FlashAttention (same operation, tiled implementation):
- FLOPs: Same (same math, different execution)
- Bytes: No matrix written to HBM. Only Q, K, V, and O: bytes
- Arithmetic intensity: FLOPs/byte (same as GEMM!)
- For : FLOPs/byte - crosses into compute-bound territory for BF16
- Classification: Compute-bound - the key algorithmic win of FlashAttention
This is the concrete payoff of the roofline model. Naive attention and FlashAttention compute exactly the same mathematical result. But their roofline positions are completely different - and that difference is the 3-5x speedup FlashAttention delivers. The roofline model explains it precisely.
Roofline Diagram for the H100
Optimization Decision Tree
Code Examples
1. Computing Arithmetic Intensity Analytically
"""
Analytical arithmetic intensity calculator for common ML operations.
These are theoretical minimums - actual memory traffic may be higher.
"""
def arithmetic_intensity_elementwise(dtype_bytes: int = 4) -> float:
"""
Elementwise ops: ReLU, sigmoid, exp, add, multiply.
1 FLOP per element, 1 read + 1 write per element.
"""
flops_per_element = 1.0
bytes_per_element = 2 * dtype_bytes # read + write
return flops_per_element / bytes_per_element
def arithmetic_intensity_gemm(M: int, N: int, K: int, dtype_bytes: int = 2) -> float:
"""
General matrix multiply: C[M,N] = A[M,K] @ B[K,N]
FLOPs: 2*M*N*K (K multiplies + K-1 adds, simplified to 2K per output element)
Bytes: read A (M*K) + read B (K*N) + write C (M*N), all in dtype_bytes
"""
flops = 2 * M * N * K
bytes_accessed = dtype_bytes * (M * K + K * N + M * N)
return flops / bytes_accessed
def arithmetic_intensity_layernorm(hidden_dim: int, dtype_bytes: int = 2) -> float:
"""
LayerNorm over hidden_dim.
Approximate FLOPs: 7 * hidden_dim per token
- mean: hidden_dim adds + 1 divide
- variance: hidden_dim subtracts + squares + adds + 1 divide
- normalize: hidden_dim subtracts + divides
- scale + shift: 2 * hidden_dim multiply-adds
Bytes: read input + read gamma + read beta + write output
"""
flops = 7 * hidden_dim
bytes_accessed = dtype_bytes * (hidden_dim + hidden_dim + hidden_dim + hidden_dim)
return flops / bytes_accessed
def arithmetic_intensity_attention_naive(
seq_len: int, head_dim: int, dtype_bytes: int = 2
) -> float:
"""
Naive attention: Q @ K^T -> softmax -> @ V
The bottleneck: must write the [seq_len, seq_len] attention matrix to HBM.
FLOPs: 2 * seq_len^2 * head_dim (QK matmul) + 2 * seq_len^2 * head_dim (AV matmul)
+ ~5 * seq_len^2 (softmax)
Bytes: Read Q, K, V (3 * seq_len * head_dim) + Write/Read attention matrix (seq_len^2)
+ Write output (seq_len * head_dim)
"""
flops = (
2 * seq_len ** 2 * head_dim # QK^T
+ 5 * seq_len ** 2 # softmax approx
+ 2 * seq_len ** 2 * head_dim # AV
)
bytes_accessed = dtype_bytes * (
3 * seq_len * head_dim # Q, K, V reads
+ seq_len ** 2 # attention matrix write (float32 for stability)
+ seq_len ** 2 # attention matrix read after softmax
+ seq_len * head_dim # output write
)
return flops / bytes_accessed
def arithmetic_intensity_flash_attention(
seq_len: int, head_dim: int, dtype_bytes: int = 2
) -> float:
"""
FlashAttention: same FLOPs, but no L x L matrix written to HBM.
Tiles K and V in SRAM/shared memory and accumulates directly.
FLOPs: same as naive attention
Bytes: only Q, K, V reads + output write (no attention matrix in HBM)
"""
flops = (
2 * seq_len ** 2 * head_dim
+ 5 * seq_len ** 2
+ 2 * seq_len ** 2 * head_dim
)
bytes_accessed = dtype_bytes * (
3 * seq_len * head_dim # Q, K, V reads
+ seq_len * head_dim # output write
)
return flops / bytes_accessed
# H100 SXM5 hardware specs
H100_SPECS = {
"peak_fp32_tflops": 67e12,
"peak_bf16_tensor_tflops": 989e12,
"peak_fp8_tensor_tflops": 1979e12,
"hbm_bandwidth_bytes_per_sec": 3.35e12,
"l2_bandwidth_bytes_per_sec": 12e12,
}
def ridge_point(peak_compute: float, bandwidth: float) -> float:
"""Arithmetic intensity at which memory-bound becomes compute-bound."""
return peak_compute / bandwidth
def roofline_performance(
arithmetic_intensity: float,
peak_compute: float,
bandwidth: float,
) -> tuple[float, str]:
"""
Returns (achievable_performance_flops_per_sec, bottleneck)
"""
memory_limit = arithmetic_intensity * bandwidth
compute_limit = peak_compute
if memory_limit < compute_limit:
return memory_limit, "memory-bound"
else:
return compute_limit, "compute-bound"
if __name__ == "__main__":
B = H100_SPECS["hbm_bandwidth_bytes_per_sec"]
P_bf16 = H100_SPECS["peak_bf16_tensor_tflops"]
P_fp32 = H100_SPECS["peak_fp32_tflops"]
print("=== H100 Ridge Points ===")
print(f"FP32 ridge: {ridge_point(P_fp32, B):.1f} FLOPs/byte")
print(f"BF16 Tensor: {ridge_point(P_bf16, B):.1f} FLOPs/byte")
print()
print("=== Arithmetic Intensities (BF16, dtype_bytes=2) ===")
ops = [
("Elementwise (ReLU)", arithmetic_intensity_elementwise(dtype_bytes=2)),
("LayerNorm (H=4096)", arithmetic_intensity_layernorm(4096, dtype_bytes=2)),
("GEMM 128x128x128", arithmetic_intensity_gemm(128, 128, 128, dtype_bytes=2)),
("GEMM 1024x1024x1024", arithmetic_intensity_gemm(1024, 1024, 1024, dtype_bytes=2)),
("GEMM 4096x4096x4096", arithmetic_intensity_gemm(4096, 4096, 4096, dtype_bytes=2)),
("Naive Attention L=512 d=64",
arithmetic_intensity_attention_naive(512, 64, dtype_bytes=2)),
("Naive Attention L=2048 d=64",
arithmetic_intensity_attention_naive(2048, 64, dtype_bytes=2)),
("FlashAttention L=2048 d=64",
arithmetic_intensity_flash_attention(2048, 64, dtype_bytes=2)),
]
ridge = ridge_point(P_bf16, B)
for name, intensity in ops:
perf, bound = roofline_performance(intensity, P_bf16, B)
pct_of_peak = (perf / P_bf16) * 100
marker = "COMPUTE" if bound == "compute-bound" else "MEMORY "
print(
f" [{marker}] {name:<35} "
f"I={intensity:7.1f} FLOPs/byte "
f"Peak={pct_of_peak:5.1f}% of BF16 ceiling"
)
Expected output:
=== H100 Ridge Points ===
FP32 ridge: 20.0 FLOPs/byte
BF16 Tensor: 295.2 FLOPs/byte
=== Arithmetic Intensities (BF16, dtype_bytes=2) ===
[MEMORY ] Elementwise (ReLU) I= 0.1 FLOPs/byte Peak= 0.0% of BF16 ceiling
[MEMORY ] LayerNorm (H=4096) I= 0.9 FLOPs/byte Peak= 0.0% of BF16 ceiling
[MEMORY ] GEMM 128x128x128 I= 21.3 FLOPs/byte Peak= 0.6% of BF16 ceiling
[MEMORY ] GEMM 1024x1024x1024 I= 170.7 FLOPs/byte Peak= 17.3% of BF16 ceiling
[COMPUTE] GEMM 4096x4096x4096 I= 682.7 FLOPs/byte Peak=100.0% of BF16 ceiling
[MEMORY ] Naive Attention L=512 d=64 I= 52.1 FLOPs/byte Peak= 5.3% of BF16 ceiling
[MEMORY ] Naive Attention L=2048 d=64 I= 61.1 FLOPs/byte Peak= 6.2% of BF16 ceiling
[COMPUTE] FlashAttention L=2048 d=64 I= 340.7 FLOPs/byte Peak=100.0% of BF16 ceiling
2. Measuring Arithmetic Intensity Empirically with PyTorch Profiler
"""
Empirically measure arithmetic intensity using PyTorch's CUDA profiler.
This measures actual memory traffic, not just the theoretical minimum.
Requires: torch >= 2.0, CUDA GPU
"""
import torch
import torch.profiler
from typing import Callable
def measure_kernel_stats(
fn: Callable,
*args,
warmup_iters: int = 5,
measure_iters: int = 10,
label: str = "operation",
) -> dict:
"""
Run a function and measure:
- Wall time (ms)
- DRAM reads (bytes)
- DRAM writes (bytes)
- FLOPs (if available from profiler)
- Derived arithmetic intensity
NOTE: PyTorch profiler gives DRAM traffic via kineto_results.
For accurate byte counts, use Nsight Compute. This is an approximation.
"""
device = "cuda"
# Warmup
for _ in range(warmup_iters):
result = fn(*args)
torch.cuda.synchronize()
activities = [
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
]
with torch.profiler.profile(
activities=activities,
record_shapes=True,
profile_memory=True,
with_flops=True,
) as prof:
for _ in range(measure_iters):
result = fn(*args)
torch.cuda.synchronize()
# Aggregate across iterations
total_flops = 0
total_self_cuda_time_us = 0
key_avgs = prof.key_averages()
for event in key_avgs:
if event.device_type == torch.autograd.DeviceType.CUDA:
total_flops += event.flops
total_self_cuda_time_us += event.self_cuda_time_total
avg_time_ms = total_self_cuda_time_us / measure_iters / 1000.0
avg_flops = total_flops / measure_iters
# Achieved TFLOPS
achieved_tflops = avg_flops / (avg_time_ms * 1e-3) / 1e12 if avg_time_ms > 0 else 0
return {
"label": label,
"avg_time_ms": avg_time_ms,
"flops": avg_flops,
"achieved_tflops": achieved_tflops,
}
def classify_operation(achieved_tflops: float, theoretical_intensity: float) -> dict:
"""
Given measured throughput and theoretical arithmetic intensity,
classify whether the operation is memory-bound or compute-bound
on the current GPU.
"""
# Query current GPU specs (approximate for common GPUs)
gpu_name = torch.cuda.get_device_name(0)
# Conservative estimates - you should set these for your exact GPU
gpu_specs = {
"H100": {"peak_bf16_tflops": 989.0, "hbm_bandwidth_tbs": 3.35},
"A100": {"peak_bf16_tflops": 312.0, "hbm_bandwidth_tbs": 2.0},
"A10G": {"peak_bf16_tflops": 125.0, "hbm_bandwidth_tbs": 0.6},
"RTX 4090": {"peak_bf16_tflops": 330.0, "hbm_bandwidth_tbs": 1.008},
}
# Default to a mid-range GPU if not recognized
specs = {"peak_bf16_tflops": 312.0, "hbm_bandwidth_tbs": 2.0}
for key, val in gpu_specs.items():
if key.lower() in gpu_name.lower():
specs = val
break
peak_compute = specs["peak_bf16_tflops"]
bandwidth = specs["hbm_bandwidth_tbs"]
ridge = peak_compute / bandwidth
efficiency = achieved_tflops / peak_compute * 100
bound = "compute-bound" if theoretical_intensity > ridge else "memory-bound"
if bound == "memory-bound":
# What fraction of bandwidth ceiling are we hitting?
memory_ceiling = theoretical_intensity * bandwidth # TFLOPS equivalent
memory_efficiency = achieved_tflops / memory_ceiling * 100 if memory_ceiling > 0 else 0
return {
"bound": bound,
"ridge_point": ridge,
"compute_efficiency_pct": efficiency,
"memory_ceiling_efficiency_pct": memory_efficiency,
"recommendation": "Fuse kernels to reduce HBM round-trips. "
"Consider shared memory tiling.",
}
else:
return {
"bound": bound,
"ridge_point": ridge,
"compute_efficiency_pct": efficiency,
"recommendation": "Enable Tensor Cores (BF16/FP8). "
"Check for warp divergence or uncoalesced access.",
}
def benchmark_matmul_sizes():
"""
Benchmark matrix multiplications of different sizes to see
how arithmetic intensity changes with problem size.
"""
device = "cuda"
dtype = torch.bfloat16
sizes = [128, 256, 512, 1024, 2048, 4096]
print(f"\n{'Size':>6} | {'Theory I':>10} | {'Time (ms)':>10} | {'TFLOPS':>8} | {'Bound'}")
print("-" * 65)
for N in sizes:
A = torch.randn(N, N, dtype=dtype, device=device)
B = torch.randn(N, N, dtype=dtype, device=device)
def fn():
return torch.matmul(A, B)
# Warmup
for _ in range(3):
_ = fn()
torch.cuda.synchronize()
# Time using CUDA events (more accurate than profiler for single ops)
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(20):
out = fn()
end.record()
torch.cuda.synchronize()
avg_time_ms = start.elapsed_time(end) / 20.0
flops = 2 * N ** 3
achieved_tflops = flops / (avg_time_ms * 1e-3) / 1e12
theoretical_intensity = N / 6.0 # BF16: dtype_bytes=2, formula = 2N^3 / (12N^2 * 2/4)
# Corrected: BF16 is 2 bytes, bytes = 2*(3*N^2) = 6*N^2, so I = 2N^3/(6N^2) = N/3
theoretical_intensity = N / 3.0 # BF16 (2 bytes)
# H100 ridge = 295 FLOPs/byte
bound = "COMPUTE" if theoretical_intensity > 295 else "MEMORY "
print(
f"{N:>6} | {theoretical_intensity:>9.1f} | {avg_time_ms:>9.3f} | "
f"{achieved_tflops:>7.1f} | {bound}"
)
if __name__ == "__main__":
if not torch.cuda.is_available():
print("No CUDA GPU available. This script requires a GPU.")
else:
benchmark_matmul_sizes()
3. Measuring the Performance Gap - Roofline Distance
"""
Roofline distance: how far below the roofline ceiling is the actual performance?
A large gap means the kernel is poorly optimized relative to its theoretical limit.
A small gap means the kernel is near-optimal.
"""
import torch
import time
class RooflineAnalyzer:
"""
Analyze an operation's position on the roofline and compute the efficiency gap.
"""
# H100 SXM5 specs (update for your GPU)
GPU_SPECS = {
"peak_bf16_tflops": 989.0,
"hbm_bandwidth_tbs": 3.35,
}
def __init__(self, gpu_specs: dict = None):
self.specs = gpu_specs or self.GPU_SPECS
self.peak = self.specs["peak_bf16_tflops"]
self.bandwidth = self.specs["hbm_bandwidth_tbs"]
self.ridge = self.peak / self.bandwidth
def roofline_limit(self, arithmetic_intensity: float) -> float:
"""The theoretical max performance (TFLOPS) at this arithmetic intensity."""
memory_ceiling = arithmetic_intensity * self.bandwidth
return min(memory_ceiling, self.peak)
def efficiency(self, achieved_tflops: float, arithmetic_intensity: float) -> dict:
"""
Returns a dict with:
- roofline_limit: theoretical max TFLOPS
- achieved_tflops: actual measured TFLOPS
- efficiency_pct: achieved / limit * 100
- bottleneck: 'memory' or 'compute'
- gap_factor: how many times below the ceiling we are
"""
limit = self.roofline_limit(arithmetic_intensity)
eff_pct = achieved_tflops / limit * 100 if limit > 0 else 0
gap = limit / achieved_tflops if achieved_tflops > 0 else float("inf")
bottleneck = "memory" if arithmetic_intensity < self.ridge else "compute"
return {
"roofline_limit_tflops": limit,
"achieved_tflops": achieved_tflops,
"efficiency_pct": eff_pct,
"gap_factor": gap,
"bottleneck": bottleneck,
"ridge_point": self.ridge,
}
def print_report(self, label: str, achieved_tflops: float, arithmetic_intensity: float):
r = self.efficiency(achieved_tflops, arithmetic_intensity)
print(f"\n=== Roofline Report: {label} ===")
print(f" Arithmetic intensity : {arithmetic_intensity:.1f} FLOPs/byte")
print(f" Ridge point (BF16) : {r['ridge_point']:.1f} FLOPs/byte")
print(f" Bottleneck : {r['bottleneck'].upper()}-BOUND")
print(f" Roofline ceiling : {r['roofline_limit_tflops']:.1f} TFLOPS")
print(f" Achieved : {r['achieved_tflops']:.1f} TFLOPS")
print(f" Efficiency : {r['efficiency_pct']:.1f}% of ceiling")
print(f" Gap factor : {r['gap_factor']:.1f}x below ceiling")
if r["efficiency_pct"] < 50:
print(f" WARNING: {r['gap_factor']:.1f}x below ceiling - significant optimization opportunity")
elif r["efficiency_pct"] < 80:
print(f" NOTE: Some optimization headroom remains")
else:
print(f" GOOD: Near-optimal for this kernel type")
def demo_attention_comparison():
"""
Compare naive attention vs FlashAttention on the roofline.
Uses analytical intensity values since flash_attn may not be installed.
"""
analyzer = RooflineAnalyzer()
seq_len = 2048
head_dim = 64
dtype = torch.bfloat16
device = "cuda"
# Analytical intensities (from our earlier functions)
naive_intensity = 61.1 # L=2048, d=64, BF16
flash_intensity = 340.7 # L=2048, d=64, BF16
# Simulate measured throughput (replace with actual profiler numbers)
# Naive attention on H100: typically ~50-80 TFLOPS (memory-bound)
# FlashAttention on H100: typically ~400-600 TFLOPS (compute-bound)
naive_achieved_tflops = 65.0 # representative measured value
flash_achieved_tflops = 450.0 # representative measured value
analyzer.print_report(
"Naive Attention (L=2048, d=64)",
naive_achieved_tflops,
naive_intensity,
)
analyzer.print_report(
"FlashAttention (L=2048, d=64)",
flash_achieved_tflops,
flash_intensity,
)
print("\n=== Summary ===")
print(f"FlashAttention speedup: {flash_achieved_tflops / naive_achieved_tflops:.1f}x")
print(
"Root cause: FlashAttention eliminates O(L^2) HBM writes, "
"raising arithmetic intensity from 61 to 341 FLOPs/byte."
)
print(
"On H100 with BF16 ridge at 295 FLOPs/byte: "
"naive is memory-bound, FlashAttention is compute-bound."
)
if __name__ == "__main__":
demo_attention_comparison()
Reading NVIDIA Nsight Compute's Roofline Chart
Nsight Compute (NCU) has a built-in roofline chart under the "Roofline" section of any kernel analysis report. Here is how to read it:
Running Nsight Compute from the command line:
# Profile a single kernel with roofline metrics
ncu --metrics gpu__time_duration,\
dram__bytes_read,dram__bytes_write,\
sm__sass_thread_inst_executed_op_fadd_pred_on,\
sm__sass_thread_inst_executed_op_fmul_pred_on,\
sm__sass_thread_inst_executed_op_ffma_pred_on \
--target-processes all \
python your_script.py
# Or use the full roofline preset
ncu --set roofline python your_script.py
What the chart shows:
- Each kernel is plotted as a dot at position (arithmetic intensity, achieved GFLOPS/s)
- The roofline is the two-line ceiling: sloped line (memory bandwidth) meets horizontal line (peak compute)
- Dots below the roofline by a large margin indicate poorly optimized kernels
- Dots near the roofline (within 80%) indicate near-optimal kernels
- The horizontal distance from a dot to the ridge point shows whether the bottleneck is compute or memory
Interpreting the gap:
- A kernel plotted well below the memory bandwidth line: the kernel is not even achieving bandwidth utilization. Check for uncoalesced memory access, cache thrashing, or excessive synchronization.
- A kernel plotted at the bandwidth line but far from the compute ceiling: it is correctly bandwidth-bound. Optimizations should focus on reducing data movement.
- A kernel near the compute ceiling: it is well-optimized. Gains require algorithmic changes or better hardware.
Common Nsight Compute metrics to track:
# Memory throughput metrics
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum # L1 load bytes
dram__bytes_read.sum # HBM read bytes
dram__bytes_write.sum # HBM write bytes
# Compute metrics
sm__inst_executed_pipe_tensor_op_hmma.sum # Tensor core instructions
sm__sass_thread_inst_executed_op_ffma.sum # FFMA instructions
# Derived
achieved_occupancy # Active warps / max warps
Production Engineering Notes
Practical Rules for Transformer Models
Most transformer components have well-known roofline positions. Knowing these lets you prioritize optimization work without profiling every run:
Always memory-bound (regardless of model size or GPU):
- Embedding table lookups (tiny arithmetic intensity, massive random access)
- LayerNorm and RMSNorm
- Dropout
- All elementwise activations (GELU, SiLU, ReLU)
- Bias additions
Memory-bound at typical inference batch sizes, compute-bound at large training batches:
- Attention projections (Q, K, V, O projections) at batch size 1 are GEMV (0.5 FLOPs/byte), at large batch are GEMM (compute-bound)
- FFN layers follow the same pattern
Compute-bound when properly implemented:
- Large GEMM with adequate batch/sequence dimensions
- FlashAttention at sequence lengths above ~512
- Convolutions with large spatial dimensions and channel counts
Kernel Fusion and the Roofline
Kernel fusion is the primary tool for improving memory-bound operations. The idea: instead of running five separate elementwise kernels (each reading and writing to HBM), fuse them into one kernel that reads once, applies all operations in registers or shared memory, and writes once.
Consider the FFN sublayer in a transformer: output = dropout(gelu(linear2(gelu(linear1(x)))))
Naively, this is: matmul + gelu + matmul + gelu + dropout = 5 separate CUDA kernel launches, with 4 intermediate tensors written to and read from HBM. After fusion:
- The two GEMMs are still separate (they are compute-bound and benefit from cuBLAS optimization)
- But gelu + gelu + dropout fuse into a single elementwise kernel
- 2 HBM round-trips eliminated
This does not change the roofline position of the GEMMs (they are already compute-bound). But it dramatically improves the effective arithmetic intensity of the activation functions from 0.125 to something much higher, because now one pass over the data computes multiple operations.
torch.compile performs this fusion automatically using Triton-generated fused kernels. The roofline model tells you exactly how much benefit to expect before you even run the optimization.
The L2 Cache Roofline
For operations whose working set fits in L2 cache (50 MB on H100), the effective bandwidth is closer to 12 TB/s than 3.35 TB/s. This shifts the ridge point:
Naive attention at small sequence lengths (L=256, working set = 256^2 * 2 bytes = 128 KB, well within L2) effectively runs against the L2 roofline, not the HBM roofline. Its effective arithmetic intensity of ~50 FLOPs/byte is then actually close to the L2 ridge point of 82 FLOPs/byte. This is why small-batch attention is not as catastrophically slow as the HBM roofline analysis would suggest.
FlashAttention's design specifically targets keeping K and V tiles in L2/shared memory so that repeated access hits the fast cache path rather than HBM.
Arithmetic Intensity Depends on Precision
Switching from FP32 to BF16 affects arithmetic intensity in two ways:
- Compute ceiling doubles (or triples with tensor cores): the ridge point shifts right dramatically
- Bytes per element halves: arithmetic intensity doubles for the same algorithm
For a GEMM of size N with BF16 (2 bytes/element) vs FP32 (4 bytes/element):
- BF16: FLOPs/byte
- FP32: FLOPs/byte
BF16 has twice the arithmetic intensity, and roughly 3-15x higher compute ceiling (depending on whether tensor cores engage). Both effects push the operation deeper into the compute-bound regime. For memory-bound operations, only the bytes-per-element reduction matters - you get at most a 2x speedup from switching precision for an elementwise operation.
Common Mistakes
:::danger Optimizing the Wrong Bottleneck Adding tensor cores, enabling BF16, and tuning CUDA launch parameters on a memory-bound kernel gives near-zero improvement. The kernel is waiting for data from HBM. Better compute units sit idle. Always measure arithmetic intensity and compare to the ridge point before choosing an optimization strategy. Ten minutes with Nsight Compute saves weeks of wasted engineering. :::
:::danger Assuming Large = Compute-Bound A matrix multiply is not automatically compute-bound just because the matrices are large. For batch size 1 inference with a weight matrix , the operation is a GEMV (single vector times matrix). Arithmetic intensity is approximately 0.5 FLOPs/byte - strongly memory-bound. Switching to FP8 tensor cores saves nothing. The fix is batching tokens or speculative decoding to increase effective batch size, converting the GEMV back into a GEMM. :::
:::warning Confusing Bandwidth-Bound and Latency-Bound Memory-bound operations can fail to reach even the bandwidth ceiling for different reasons. A kernel that performs scattered random reads (e.g., gather operations for embedding lookups) may not saturate bandwidth because memory latency - not throughput - is the bottleneck. Thousands of independent small reads that cannot be pipelined together leave bandwidth headroom unused. Nsight Compute's "Memory Workload Analysis" section distinguishes these cases. The roofline model assumes you can stream data; latency-bound kernels live below the roofline for different reasons. :::
:::warning Using Theoretical FLOPs vs Actual FLOPs Softmax looks like it has roughly 5 FLOPs per element analytically. But its actual FLOP count depends on the implementation: is the max subtraction done in a separate pass? Is the division implemented with a reciprocal multiply? Different implementations have different counts. When comparing analytical and measured arithmetic intensity, use Nsight Compute's hardware counters for FLOPs rather than hand-calculated estimates for any operation involving transcendentals or reductions. :::
:::warning The Roofline Assumes Perfect Memory Access Patterns The roofline model gives an optimistic upper bound. Real kernels are further below the ceiling due to: bank conflicts in shared memory, uncoalesced global memory access, warp divergence, pipeline stalls, launch overhead. The roofline tells you the direction and rough magnitude of the gap to close. Nsight Compute's individual performance limiters (like the "Compute Workload Analysis" and "Memory Workload Analysis" sections) explain the remaining gap between the roofline and your actual performance. :::
Interview Q&A
Q1: Calculate the arithmetic intensity of a BF16 matrix multiply for M=N=K=4096. Is it compute-bound or memory-bound on an H100?
Answer:
FLOPs:
Bytes accessed (BF16 = 2 bytes/element):
- Read A:
- Read B:
- Write C:
- Total: bytes
Arithmetic intensity:
Wait - let us recalculate more carefully. The general formula for square GEMM in BF16 (dtype_bytes = 2) is:
For : FLOPs/byte.
H100 BF16 Tensor Core ridge point: FLOPs/byte.
Since , this operation is strongly compute-bound on the H100. The compute units are the limiting factor. To improve performance, you would need higher compute throughput (e.g., FP8 quantization, which doubles tensor core throughput) or a reduced-FLOP algorithm. Reducing memory bandwidth use does not help.
Q2: Why does FlashAttention move the attention operation to the compute-bound region of the roofline?
Answer:
Naive attention for sequence length and head dimension must write the attention matrix (after the matmul and before the softmax matmul) to HBM. This intermediate write is bytes.
Bytes accessed in naive attention: approximately bytes (the term dominates for long sequences).
FLOPs are (two GEMMs plus softmax overhead).
So arithmetic intensity is approximately . For , that is 32 FLOPs/byte - well below the H100 BF16 ridge of 295 FLOPs/byte. Memory-bound.
FlashAttention tiles the computation so that K and V blocks fit in shared memory. The attention matrix is never materialized in HBM. Bytes accessed drop to - only Q, K, V, and output.
New arithmetic intensity: . For : FLOPs/byte. The algorithm crossed the ridge point - it is now compute-bound. The 3-5x speedup of FlashAttention comes entirely from this roofline shift, not from reducing FLOPs (which are identical).
Q3: What is the difference between being memory-bandwidth bound and memory-latency bound?
Answer:
Bandwidth-bound means the GPU can pipeline memory requests efficiently but is limited by the total bytes per second the HBM bus can transfer. The memory controller is fully busy, all row buffers are open, and DRAM transfers are happening back-to-back. Adding more DRAM bandwidth would help. Adding cache hits or reusing data (kernel fusion) would help.
Latency-bound means the GPU is issuing many small, independent, non-sequential memory requests, and the execution units are stalling waiting for each request to complete before the next can proceed. The memory bus is not saturated - far fewer bytes per second are actually moving than the bandwidth ceiling would allow. The bottleneck is the round-trip time (HBM latency is ~200-600ns) multiplied by the number of sequential requests.
Embedding table lookups are a classic example: each lookup reads 4-64 bytes from an essentially random address. These cannot be pipelined because the next lookup address depends on the input token. Even with a large embedding table, the GPU hits maybe 10-15% of peak HBM bandwidth because requests queue up behind latency rather than flooding the bandwidth.
The fix for bandwidth-bound: reduce bytes accessed (fusion, tiling, quantization). The fix for latency-bound: group lookups, prefetch, pad to aligned access, or restructure the computation to enable coalescing.
Nsight Compute's "Memory Workload Analysis" page distinguishes the two cases via the "Memory Throughput" vs "L1/L2 Hit Rate" metrics. If bandwidth utilization is low but there are many outstanding requests, you are latency-bound.
Q4: You profile a custom attention kernel on an H100 and find it achieves 120 TFLOPS in BF16. The theoretical arithmetic intensity is 64 FLOPs/byte. Is this kernel well-optimized?
Answer:
First, determine the roofline ceiling at FLOPs/byte:
The ceiling is 214.4 TFLOPS (the kernel is memory-bound - 64 < 295 ridge point).
The kernel achieves 120 TFLOPS against a ceiling of 214.4 TFLOPS: efficiency = of the roofline limit.
This is mediocre. A well-optimized memory-bound kernel typically achieves 70-90% of the bandwidth ceiling. The gap indicates secondary bottlenecks: likely uncoalesced memory accesses (the attention matrix is accessed in a pattern that may not be coalesced), shared memory bank conflicts in the softmax reduction, or warp synchronization overhead in the multi-pass algorithm.
The right action is not to switch to FlashAttention immediately (though that would help). First, understand why there is a 44% gap from the bandwidth ceiling. Nsight Compute's "Memory Access Pattern" analysis will show whether the accesses are coalesced. If you close this gap first (getting to ~190 TFLOPS), then deciding whether to implement FlashAttention becomes an informed tradeoff between engineering cost and remaining gain.
Q5: A team switches their training cluster from A100 to H100. Their large GEMM operations get roughly 3x faster as expected. But their overall training throughput only improves by 1.4x. What is the likely explanation, and how would you diagnose it with the roofline model?
Answer:
The A100 BF16 Tensor Core ridge point is FLOPs/byte. The H100 BF16 Tensor Core ridge point is FLOPs/byte. The H100 has nearly double the ridge point.
Operations that were just above the A100 ridge (e.g., small-to-medium GEMMs with FLOPs/byte) are now in the memory-bound region on the H100. They do not benefit from the 3x compute improvement. They only benefit from the 1.7x bandwidth improvement (3.35 vs 2.0 TB/s).
If the model has a significant fraction of time spent in medium-sized GEMMs (typical for short sequences, small batch sizes, or models with many small projections), those GEMMs went from compute-bound on A100 to memory-bound on H100. The expected 3x speedup collapsed to 1.7x for those layers.
Additionally, all the non-GEMM operations (LayerNorm, activations, embedding lookups) only see a 1.7x improvement from the bandwidth upgrade - these were memory-bound on A100 and remain memory-bound on H100.
Diagnosis using the roofline model:
- Run Nsight Compute on a representative training step on both GPUs
- Export the arithmetic intensity for each kernel
- Plot both roofline charts (A100 and H100) and overlay the kernel positions
- Kernels that shifted from above the ridge to below the ridge are the culprits
Fix: For medium-sized GEMMs that are now memory-bound on H100, increase effective batch size (more tokens per gradient step), use gradient accumulation with larger micro-batches, or fuse operations around the GEMMs to amortize the memory traffic. FlashAttention also helps here by converting the attention bottleneck from memory-bound to compute-bound.
Q6: Explain how increasing batch size affects the roofline position of a linear layer during inference.
Answer:
A linear layer computes where , , .
At batch size B=1 (single-token inference): the operation is where is a vector. This is a GEMV.
- FLOPs:
- Bytes: Read ( bytes in BF16) + Read ( bytes) + Write ( bytes)
- For large weight matrix (): bytes
- Arithmetic intensity: FLOPs/byte
1 FLOPs/byte is catastrophically memory-bound on every GPU. Single-token autoregressive generation is fundamentally memory-bandwidth limited, not compute limited.
At batch size B=32: now processing 32 tokens simultaneously. Operation is a true GEMM.
- FLOPs:
- Bytes: Same read, plus 32x larger and :
- For large weight matrices the read still dominates: bytes
- Arithmetic intensity: FLOPs/byte (simplifying)
Still memory-bound for the H100 ridge at 295 FLOPs/byte.
At batch size B=512: FLOPs/byte. Now compute-bound on H100.
This is why LLM serving is obsessed with batching. Every token added to a batch amortizes the cost of reading the weight matrix. The "continuous batching" technique in systems like vLLM increases effective batch size by packing tokens from multiple users into a single GPU forward pass, moving the linear layers up the roofline from memory-bound toward the compute ceiling.
Summary
The roofline model is a two-line performance framework that divides GPU optimization space into exactly two regimes:
- Memory-bound (): performance scales with arithmetic intensity times bandwidth. Fix by fusing kernels, tiling into fast memory, or reducing bytes accessed.
- Compute-bound (): performance is capped at peak FLOPS. Fix by enabling Tensor Cores, using lower precision (FP8), or reducing algorithmic FLOP count.
The ridge point is the dividing line. For the H100 in BF16 Tensor Core mode, that ridge is at 295 FLOPs/byte. Most transformer operations - elementwise activations, LayerNorm, naive attention, small GEMMs - fall well below this ridge. Only large matrix multiplications and FlashAttention-style tiled algorithms reliably cross it.
Key numbers to know cold:
- H100 HBM bandwidth: 3.35 TB/s
- H100 BF16 Tensor Core peak: 989 TFLOPS
- H100 BF16 ridge: 295 FLOPs/byte
- Elementwise op intensity: ~0.125 FLOPs/byte
- Large GEMM intensity: FLOPs/byte (BF16)
- Naive attention intensity: ~ FLOPs/byte
- FlashAttention intensity: ~ FLOPs/byte
The roofline model is the first tool to reach for when a kernel is slow. Measure arithmetic intensity. Compare to the ridge. The optimization path follows directly.
