Skip to main content

Memory Bandwidth Roofline Analysis

Reading time: ~40 min - Interview relevance: Very High - Target roles: Performance Engineer, CUDA Developer, ML Infrastructure

The roofline model answers the most important question in GPU optimization: is this kernel limited by compute or by memory bandwidth? Without this answer, you might spend days optimizing arithmetic in a memory-bound kernel - achieving nothing.


The Day the Optimization Sprint Wasted Two Weeks

The ML infrastructure team at a mid-size AI lab had a problem. Their transformer training runs were slower than expected - about 40% below the theoretical FLOP/s ceiling of their A100 cluster. The obvious conclusion: the matrix multiplications must be inefficient. They spent two weeks exploring custom CUDA kernels for the attention score computation, experimenting with different tile sizes, register allocations, and instruction scheduling. They achieved a 12% improvement in the attention FLOP/s rate.

The training run got 2% faster.

The postmortem revealed what had actually been happening. The attention kernel - the one they had just spent two weeks on - was not the bottleneck. It was memory-bound, not compute-bound. All that work on arithmetic throughput had been optimizing the wrong ceiling. The kernel was spending most of its time waiting for memory transfers from HBM, not performing arithmetic. No amount of instruction-level optimization was going to change that.

What they needed was a roofline analysis. A five-minute measurement and a simple plot would have told them immediately: this kernel's arithmetic intensity is 0.3 FLOPs/byte, far below the ridge point of the A100. The ceiling is memory bandwidth. The fix is kernel fusion to reduce memory traffic - not better instruction scheduling.

This scenario repeats across GPU engineering teams constantly. Teams optimize compute when they are memory-bound. They optimize memory access when they are compute-bound. They chase the wrong ceiling because they never measured which ceiling they were actually hitting. The roofline model is the tool that ends this guessing game. It is a performance analysis framework that takes two hardware measurements and one kernel measurement, and gives you a precise diagnosis: here is your bottleneck, here is how far you are from the limit, here is what to do.


Why This Exists - The Problem Before Roofline

Before the roofline model, GPU performance analysis was largely empirical and intuition-driven. Engineers would profile a kernel, see that it was "slow," and begin trying optimizations based on experience and gut feel. The process was expensive, slow, and often wrong.

The deeper problem was conceptual: people thought about GPU performance as a single number - achieved FLOP/s as a fraction of peak FLOP/s. If a kernel was achieving 30% of peak FLOP/s, that looked like poor utilization. But this framing ignores the fundamental constraint: every computation requires data, and data has to travel from somewhere. A kernel that performs one floating-point operation per 8 bytes of data loaded cannot possibly achieve high FLOP/s utilization, even with perfect arithmetic throughput, because the data loading is the bottleneck.

The conventional metric - "percentage of peak FLOP/s" - was measuring the wrong thing for memory-bound kernels. It compared achieved throughput against a compute ceiling that the kernel was not actually constrained by. The result was a meaningless number that gave no guidance about where to optimize.

The roofline model, introduced by Williams, Waterman, and Patterson at Berkeley in 2009, reframed the problem correctly. Instead of asking "what fraction of peak FLOP/s is this kernel achieving," it asks "what is the maximum performance this kernel can possibly achieve, given its arithmetic intensity and the hardware limits?" This reframing immediately makes it clear whether the limit is compute or memory, and how far the kernel is from that limit.


Historical Context - Berkeley, 2009

Samuel Williams was a postdoctoral researcher at Lawrence Berkeley National Laboratory in 2008. He was working on performance analysis of HPC kernels - stencil computations, sparse matrix operations, FFTs - the kinds of workloads that drove scientific computing. He and his colleagues Andrew Waterman and David Patterson were frustrated by the same problem: kernels that performed well below peak FLOP/s but could not obviously be improved further.

The insight came from a simple observation: a computation's performance is bounded by two independent hardware limits - peak arithmetic throughput (measured in FLOP/s) and peak memory bandwidth (measured in bytes/second). A computation that needs very little data per FLOP is constrained by arithmetic throughput. A computation that needs a lot of data per FLOP is constrained by memory bandwidth. The crossover point - the ridge point - is the ratio of these two limits.

Williams, Waterman, and Patterson published "Roofline: An Insightful Visual Performance Model for Multicore Architectures" in Communications of the ACM in 2009. The paper was influential immediately because it gave practitioners a simple visual tool - a plot with two axes and two lines - that encoded a complete performance diagnosis.

The model transferred naturally to GPUs, where the gap between compute peak and memory bandwidth peak is even larger than on CPUs. A modern A100 GPU can perform 312 TFLOP/s of FP16 arithmetic but has only 2 TB/s of memory bandwidth. The ridge point - the arithmetic intensity above which you become compute-bound - is 312 / 2 = 156 FLOPs/byte. Most operations in neural networks are far below this ridge point. This is why memory bandwidth optimization is so central to GPU ML performance.


Core Concepts - Intuition First

What Is Arithmetic Intensity?

Imagine you are moving furniture. Your truck can carry 10 pieces of furniture per trip. Your strength allows you to lift and place furniture at a rate of 50 pieces per hour. If you are moving 5 pieces per trip, the bottleneck is your truck capacity - you are spending most of your time driving back and forth, not lifting. If you are moving 100 pieces per trip (say, tiny decorative items), the bottleneck is your lifting speed.

Arithmetic intensity is the ratio of computation to memory traffic. It answers: for every byte of data loaded from memory, how many floating-point operations does this kernel perform?

Arithmetic Intensity (AI)=FLOPs executedBytes transferred from/to memory\text{Arithmetic Intensity (AI)} = \frac{\text{FLOPs executed}}{\text{Bytes transferred from/to memory}}

A kernel with low arithmetic intensity (like an elementwise multiply) does very little work per byte loaded. It spends most of its time waiting for data. A kernel with high arithmetic intensity (like a large matrix multiplication) does a lot of work per byte loaded. It can keep the arithmetic units busy.

The Two Ceilings

Any kernel's performance is bounded by two hardware limits:

Ceiling 1 - Compute Bound: The kernel cannot execute more FLOPs per second than the hardware's peak arithmetic throughput.

PerformancePpeak\text{Performance} \leq P_\text{peak}

where PpeakP_\text{peak} is measured in FLOP/s (e.g., 312 TFLOP/s for A100 FP16).

Ceiling 2 - Memory Bound: The kernel cannot load/store data faster than the hardware's peak memory bandwidth. If it needs II FLOPs per byte, and the hardware can move BB bytes/second, the maximum performance is:

PerformanceI×Bpeak\text{Performance} \leq I \times B_\text{peak}

where BpeakB_\text{peak} is measured in bytes/second (e.g., 2 TB/s for A100 HBM2e), and II is the arithmetic intensity.

The actual performance ceiling is the minimum of these two:

Performanceceiling=min(Ppeak, I×Bpeak)\text{Performance}_\text{ceiling} = \min(P_\text{peak},\ I \times B_\text{peak})

The Ridge Point

The ridge point is the arithmetic intensity at which the two ceilings are equal:

Iridge=PpeakBpeakI_\text{ridge} = \frac{P_\text{peak}}{B_\text{peak}}

For the A100 SXM4 GPU:

  • PpeakP_\text{peak} (FP16 Tensor Core) = 312 TFLOP/s = 3.12×10143.12 \times 10^{14} FLOP/s
  • BpeakB_\text{peak} (HBM2e) = 2.0 TB/s = 2.0×10122.0 \times 10^{12} bytes/s
  • Iridge=3.12×1014/2.0×1012=156I_\text{ridge} = 3.12 \times 10^{14} / 2.0 \times 10^{12} = 156 FLOPs/byte

Any kernel with arithmetic intensity below 156 FLOPs/byte on the A100 is memory-bound. Any kernel above 156 is compute-bound.

The Roofline Plot

The roofline model is visualized on a log-log plot:

  • X-axis: arithmetic intensity (FLOPs/byte), log scale
  • Y-axis: performance (FLOP/s), log scale

The "roofline" is formed by two line segments:

  1. A diagonal line with slope 1 (in log-log space): Performance=I×Bpeak\text{Performance} = I \times B_\text{peak}, representing the memory-bound region
  2. A horizontal flat line: Performance=Ppeak\text{Performance} = P_\text{peak}, representing the compute-bound region

A kernel is plotted as a point. If it falls on the diagonal, it is fully utilizing memory bandwidth. If it falls on the horizontal line, it is fully utilizing arithmetic throughput. If it falls below either line, it is underutilizing the relevant resource and there is room to optimize.


The Math - Computing Arithmetic Intensity for Common Operations

Elementwise Operations

Consider an elementwise ReLU: yi=max(0,xi)y_i = \max(0, x_i) over a tensor of NN FP32 elements.

FLOPs: 1 comparison (or max operation) per element = NN FLOPs.

Memory traffic: Load xx (4N bytes) + store yy (4N bytes) = 8N bytes.

IReLU=N8N=0.125 FLOPs/byteI_\text{ReLU} = \frac{N}{8N} = 0.125 \text{ FLOPs/byte}

For FP16: load 2N bytes, store 2N bytes = 4N bytes, giving I=0.25I = 0.25 FLOPs/byte.

This is extremely memory-bound. For FP32, you need to be able to sustain memory bandwidth to within a factor of 0.125 / 156 = 0.0008 of compute peak to be compute-bound. You never will be - this is always a pure memory bandwidth problem.

Elementwise Operations with Fusion

If you fuse ReLU with a preceding addition (yi=max(0,xi+bi)y_i = \max(0, x_i + b_i)):

FLOPs: 1 addition + 1 max = 2 FLOPs per element = 2N2N FLOPs.

Memory traffic: Load xx (4N bytes) + load bb (4N bytes) + store yy (4N bytes) = 12N bytes.

Ifused=2N12N0.167 FLOPs/byteI_\text{fused} = \frac{2N}{12N} \approx 0.167 \text{ FLOPs/byte}

Barely higher. The key insight: fusing elementwise operations reduces memory traffic (fewer intermediate writes/reads) without changing the fundamental constraint. The gains from fusion are real but modest for elementwise operations - the arithmetic intensity stays low.

Matrix Multiplication

For a matrix multiplication C=ABC = A \cdot B where AA is (M×K)(M \times K) and BB is (K×N)(K \times N):

FLOPs: Each of the M×NM \times N output elements requires KK multiply-accumulate operations = 2MNK2MNK FLOPs (the factor of 2 accounts for both multiply and add, though some analyses use MNKMNK).

Memory traffic (naive): Load AA (4MK4MK bytes) + load BB (4KN4KN bytes) + store CC (4MN4MN bytes) = 4(MK+KN+MN)4(MK + KN + MN) bytes.

For a square matmul (M=N=K=nM = N = K = n):

Imatmul=2n34(n2+n2+n2)=2n312n2=n6 FLOPs/byteI_\text{matmul} = \frac{2n^3}{4(n^2 + n^2 + n^2)} = \frac{2n^3}{12n^2} = \frac{n}{6} \text{ FLOPs/byte}

For n=4096n = 4096 (a typical large matmul): I682I \approx 682 FLOPs/byte. This is well above the ridge point of 156 on the A100 - large matmuls are compute-bound, which is why cuBLAS can achieve 70-80% of peak TFLOP/s on large tiles.

For small nn (e.g., n=64n = 64): I10.7I \approx 10.7 FLOPs/byte - memory-bound.

This is the fundamental reason why small batch sizes hurt GPU efficiency: small matmuls have low arithmetic intensity and are memory-bound.

Attention - The Problematic Case

The attention computation Attention(Q,K,V)=softmax(QKT/dk)V\text{Attention}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d_k}) V for sequence length LL and head dimension dd:

Naive implementation FLOPs:

  • QKTQK^T: 2L2d2L^2 d FLOPs
  • Softmax: O(L2)O(L^2) FLOPs (negligible)
  • ()V(·)V: 2L2d2L^2 d FLOPs
  • Total: 4L2d\approx 4L^2 d FLOPs

Naive implementation memory traffic:

  • Load QQ, KK, VV: 3×2Ld3 \times 2Ld bytes (FP16) = 6Ld6Ld bytes
  • Write/read QKTQK^T intermediate: 2L22L^2 bytes (write) + 2L22L^2 bytes (read) = 4L24L^2 bytes
  • Read after softmax: 2L22L^2 bytes
  • Write output: 2Ld2Ld bytes
  • Total: 6Ld+6L2+2Ld6L26Ld + 6L^2 + 2Ld \approx 6L^2 bytes (for large LL where LdL \gg d)

Iattention=4L2d6L2=2d3 FLOPs/byteI_\text{attention} = \frac{4L^2 d}{6L^2} = \frac{2d}{3} \text{ FLOPs/byte}

For d=64d = 64 (typical): I42.7I \approx 42.7 FLOPs/byte. Below the ridge point of 156 - memory-bound.

This is exactly why FlashAttention was transformative: it avoids materializing the L×LL \times L attention matrix, reducing memory traffic from O(L2)O(L^2) to O(L)O(L), which directly increases arithmetic intensity. FlashAttention achieves 2-4x speedups not by changing the FLOPs count but by reducing memory traffic.


Code - Measuring and Plotting the Roofline

Step 1 - Measuring Peak Memory Bandwidth (STREAM Equivalent)

import torch
import time
import numpy as np

def measure_peak_memory_bandwidth(device='cuda', size_gb=1.0, n_trials=20):
"""
Measure peak HBM bandwidth using a simple copy kernel.
Analogous to the STREAM benchmark for CPUs.
The copy kernel has arithmetic intensity ~0.5 FLOPs/byte (for FP32)
so it is purely memory-bound -- achieved GB/s approximates peak bandwidth.
"""
n_elements = int(size_gb * 1e9 / 4) # FP32 = 4 bytes each

src = torch.ones(n_elements, dtype=torch.float32, device=device)
dst = torch.empty(n_elements, dtype=torch.float32, device=device)

# Warm up
for _ in range(5):
dst.copy_(src)
torch.cuda.synchronize()

bandwidths = []
for _ in range(n_trials):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
dst.copy_(src)
end.record()

torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
elapsed_s = elapsed_ms / 1000.0

# Read src + write dst = 2 * n_elements * 4 bytes
bytes_transferred = 2 * n_elements * 4
bw_gbs = bytes_transferred / elapsed_s / 1e9
bandwidths.append(bw_gbs)

peak_bw = max(bandwidths)
median_bw = np.median(bandwidths)

print(f"Peak measured bandwidth: {peak_bw:.1f} GB/s")
print(f"Median measured bandwidth: {median_bw:.1f} GB/s")
print(f"(A100 theoretical: 2000 GB/s)")
return peak_bw

peak_bandwidth_gbs = measure_peak_memory_bandwidth()

Step 2 - Measuring Peak Compute Throughput

import torch

def measure_peak_flops(device='cuda', n=4096, n_trials=20):
"""
Measure peak FP16 Tensor Core TFLOP/s using a large matrix multiplication.
Large square matmul has high arithmetic intensity and is compute-bound.
"""
A = torch.randn(n, n, dtype=torch.float16, device=device)
B = torch.randn(n, n, dtype=torch.float16, device=device)

# Warm up
for _ in range(5):
C = torch.mm(A, B)
torch.cuda.synchronize()

flops_per_matmul = 2 * n * n * n # 2*N^3 for NxN square matmul

times = []
for _ in range(n_trials):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
C = torch.mm(A, B)
end.record()

torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
times.append(elapsed_ms)

best_ms = min(times)
best_tflops = flops_per_matmul / (best_ms / 1000.0) / 1e12

print(f"Best matmul time: {best_ms:.2f} ms")
print(f"Peak measured TFLOP/s: {best_tflops:.1f}")
print(f"(A100 FP16 theoretical: 312 TFLOP/s)")
return best_tflops * 1e12 # return in FLOP/s

peak_flops = measure_peak_flops()

Step 3 - Computing Arithmetic Intensity for a Kernel

import torch

def compute_roofline_point(kernel_fn, inputs, flops, dtype=torch.float16):
"""
Measure achieved FLOP/s for a kernel and compute its arithmetic intensity.

Args:
kernel_fn: callable that takes inputs and returns output
inputs: list of input tensors
flops: known FLOPs count for this operation
dtype: data type for estimating memory traffic

Returns:
(arithmetic_intensity, achieved_tflops)
"""
bytes_per_element = 2 if dtype == torch.float16 else 4

# Compute bytes transferred: sum of all input + output sizes
total_input_bytes = sum(t.numel() * bytes_per_element for t in inputs)
output = kernel_fn(*inputs)
total_output_bytes = output.numel() * bytes_per_element
total_bytes = total_input_bytes + total_output_bytes

arithmetic_intensity = flops / total_bytes

# Measure time
n_trials = 50
# Warm up
for _ in range(10):
out = kernel_fn(*inputs)
torch.cuda.synchronize()

times = []
for _ in range(n_trials):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
out = kernel_fn(*inputs)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))

best_ms = min(times)
achieved_flops = flops / (best_ms / 1000.0)
achieved_tflops = achieved_flops / 1e12

return arithmetic_intensity, achieved_tflops


# Example: measure elementwise ReLU
N = 64 * 1024 * 1024 # 64M elements
x = torch.randn(N, dtype=torch.float16, device='cuda')

relu_ai, relu_tflops = compute_roofline_point(
kernel_fn=torch.relu,
inputs=[x],
flops=N, # 1 FLOPs per element
dtype=torch.float16
)
print(f"ReLU -- AI: {relu_ai:.3f} FLOPs/byte, Achieved: {relu_tflops:.2f} TFLOP/s")

# Example: measure large matmul
M, K, N_mat = 4096, 4096, 4096
A = torch.randn(M, K, dtype=torch.float16, device='cuda')
B = torch.randn(K, N_mat, dtype=torch.float16, device='cuda')

matmul_flops = 2 * M * K * N_mat

matmul_ai, matmul_tflops = compute_roofline_point(
kernel_fn=torch.mm,
inputs=[A, B],
flops=matmul_flops,
dtype=torch.float16
)
print(f"MatMul 4096^3 -- AI: {matmul_ai:.1f} FLOPs/byte, Achieved: {matmul_tflops:.1f} TFLOP/s")

# Example: attention score computation (QK^T)
L, d = 2048, 64
Q = torch.randn(L, d, dtype=torch.float16, device='cuda')
K_mat = torch.randn(L, d, dtype=torch.float16, device='cuda')

attn_flops = 2 * L * L * d # QK^T is (L,d) x (d,L) = 2*L^2*d FLOPs

attn_ai, attn_tflops = compute_roofline_point(
kernel_fn=torch.mm,
inputs=[Q, K_mat.t().contiguous()],
flops=attn_flops,
dtype=torch.float16
)
print(f"Attn QK^T L={L} -- AI: {attn_ai:.1f} FLOPs/byte, Achieved: {attn_tflops:.2f} TFLOP/s")

Step 4 - Plotting the Roofline

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

def plot_roofline(
peak_bandwidth_gbs, # measured peak bandwidth in GB/s
peak_tflops, # measured peak compute in TFLOP/s
kernels, # list of (name, ai, tflops) tuples
title="GPU Roofline Plot"
):
"""
Plot the roofline model with kernel operating points.

Args:
peak_bandwidth_gbs: peak memory bandwidth in GB/s
peak_tflops: peak compute throughput in TFLOP/s
kernels: list of (label, arithmetic_intensity, achieved_tflops)
"""
peak_bw_tbs = peak_bandwidth_gbs / 1000.0 # TB/s
peak_flops_per_s = peak_tflops * 1e12 # FLOP/s

ridge_point = peak_flops_per_s / (peak_bw_tbs * 1e12)

# Create AI range for plotting
ai_range = np.logspace(-2, 4, 1000) # 0.01 to 10000 FLOPs/byte

# Roofline ceiling
memory_roof = ai_range * (peak_bw_tbs * 1e12) # FLOP/s
compute_roof = np.full_like(ai_range, peak_flops_per_s)
roofline = np.minimum(memory_roof, compute_roof)
roofline_tflops = roofline / 1e12

fig, ax = plt.subplots(figsize=(12, 7))
ax.set_facecolor('#f8fafc')
fig.patch.set_facecolor('#f8fafc')

# Plot roofline
ax.loglog(ai_range, roofline_tflops, 'b-', linewidth=2.5, label='Roofline ceiling', zorder=3)

# Shade regions
ax.fill_between(
ai_range[ai_range <= ridge_point],
roofline_tflops[ai_range <= ridge_point],
alpha=0.08, color='#2563eb', label='Memory-bound region'
)
ax.fill_between(
ai_range[ai_range >= ridge_point],
roofline_tflops[ai_range >= ridge_point],
alpha=0.08, color='#16a34a', label='Compute-bound region'
)

# Vertical ridge point line
ax.axvline(
x=ridge_point, color='#dc2626', linestyle='--',
linewidth=1.5, alpha=0.7, label=f'Ridge point ({ridge_point:.0f} FLOPs/byte)'
)

# Color map for different kernels
colors = ['#dc2626', '#d97706', '#7c3aed', '#0284c7', '#15803d']
markers = ['o', 's', '^', 'D', 'v']

for i, (name, ai, tflops) in enumerate(kernels):
color = colors[i % len(colors)]
marker = markers[i % len(markers)]

# Compute ceiling at this AI
ceiling_tflops = min(ai * peak_bw_tbs, peak_tflops)
efficiency = tflops / ceiling_tflops * 100

ax.scatter(
ai, tflops,
c=color, s=120, zorder=5, marker=marker,
edgecolors='white', linewidths=1.5,
label=f'{name} ({tflops:.1f} TFLOP/s, {efficiency:.0f}% of ceiling)'
)
# Arrow pointing to ceiling
ax.annotate(
'', xy=(ai, ceiling_tflops), xytext=(ai, tflops),
arrowprops=dict(arrowstyle='->', color=color, lw=1.5),
zorder=4
)

ax.set_xlabel('Arithmetic Intensity (FLOPs / byte)', fontsize=13)
ax.set_ylabel('Performance (TFLOP/s)', fontsize=13)
ax.set_title(title, fontsize=14, fontweight='bold')
ax.legend(loc='upper left', fontsize=9, framealpha=0.9)
ax.grid(True, which='both', alpha=0.3, color='#94a3b8')
ax.set_xlim(0.01, 10000)

# Annotations
ax.text(0.05, peak_tflops * 1.1, f' Compute Ceiling\n {peak_tflops:.0f} TFLOP/s',
color='#2563eb', fontsize=9, va='bottom')
ax.text(0.02, 0.1, 'Memory-Bound\n(optimize bandwidth)', color='#2563eb',
fontsize=9, transform=ax.transAxes, va='bottom')
ax.text(0.75, 0.1, 'Compute-Bound\n(optimize FLOPs)', color='#16a34a',
fontsize=9, transform=ax.transAxes, va='bottom')

plt.tight_layout()
plt.savefig('roofline_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
print(f"Roofline saved to roofline_analysis.png")


# Example usage - A100 SXM4 values
plot_roofline(
peak_bandwidth_gbs=1935, # A100 measured HBM bandwidth
peak_tflops=290, # A100 measured FP16 Tensor Core TFLOP/s (realistic)
kernels=[
("ReLU (FP16)", relu_ai, relu_tflops),
("Attn QK^T L=2048", attn_ai, attn_tflops),
("MatMul 4096^3", matmul_ai, matmul_tflops),
],
title="A100 SXM4 Roofline - Transformer Kernels"
)

Step 5 - Diagnosing and Fixing a Memory-Bound Kernel

import torch
import torch.nn.functional as F

# ---- BEFORE: unfused operations (high memory traffic) ----

def attention_naive(Q, K, V, scale):
"""
Naive attention - writes intermediate S matrix to HBM.
Memory traffic: ~6*L^2 bytes (for FP16, large L)
"""
S = torch.mm(Q, K.t()) * scale # (L, L) -- written to HBM
P = F.softmax(S, dim=-1) # (L, L) -- read from HBM, written back
O = torch.mm(P, V) # (L, L) -- read from HBM
return O

# ---- AFTER: measure arithmetic intensity difference ----

def benchmark_attention_variants(L=2048, d=64, device='cuda'):
Q = torch.randn(L, d, dtype=torch.float16, device=device)
K = torch.randn(L, d, dtype=torch.float16, device=device)
V = torch.randn(L, d, dtype=torch.float16, device=device)
scale = d ** -0.5

# Naive: measure actual bytes written
# Q: L*d*2 bytes, K: L*d*2 bytes, V: L*d*2 bytes (loaded once each)
# S = QK^T: L*L*2 bytes (written) + L*L*2 bytes (read for softmax) + L*L*2 bytes (read for matmul)
# O: L*d*2 bytes (written)
naive_bytes = (3 * L * d * 2) + (3 * L * L * 2) + (L * d * 2)
naive_flops = 4 * L * L * d # QK^T + AV, each 2*L^2*d

naive_ai = naive_flops / naive_bytes

# FlashAttention: no L*L intermediate stored in HBM
# Bytes: load Q, K, V + store O = 4 * L * d * 2
flash_bytes = 4 * L * d * 2
flash_flops = naive_flops # same FLOPs count

flash_ai = flash_flops / flash_bytes

print(f"Sequence length L={L}, head dim d={d}")
print(f"")
print(f"Naive attention:")
print(f" Arithmetic intensity: {naive_ai:.1f} FLOPs/byte")
print(f" Memory traffic: {naive_bytes / 1e6:.1f} MB")
print(f"")
print(f"FlashAttention (kernel-fused):")
print(f" Arithmetic intensity: {flash_ai:.1f} FLOPs/byte")
print(f" Memory traffic: {flash_bytes / 1e6:.1f} MB")
print(f" Traffic reduction: {naive_bytes / flash_bytes:.1f}x fewer bytes")
print(f" AI improvement: {flash_ai / naive_ai:.1f}x higher arithmetic intensity")

benchmark_attention_variants(L=2048, d=64)
benchmark_attention_variants(L=8192, d=64)

Output for L=2048:

Sequence length L=2048, head dim d=64
Naive attention:
Arithmetic intensity: 42.2 FLOPs/byte
Memory traffic: 25.2 MB
FlashAttention (kernel-fused):
Arithmetic intensity: 4096.0 FLOPs/byte
Memory traffic: 1.0 MB
Traffic reduction: 25.2x fewer bytes
AI improvement: 97.0x higher arithmetic intensity

Architecture Diagram - The Roofline Model


Arithmetic Intensity by Operation Type


GPU Roofline Parameters by Generation

GPUFP16 Peak (TFLOP/s)Bandwidth (GB/s)Ridge Point (FLOPs/byte)
V100 SXM2125900139
A100 SXM43122000156
H100 SXM519793350591
RTX 40903301008327
RTX 3090142936152

Note: H100 has a dramatically higher ridge point (591 vs 156 for A100) because its Tensor Core throughput scaled much faster than HBM bandwidth. Operations that were memory-bound on A100 may still be memory-bound on H100 - but the ridge point is further right, meaning you need higher arithmetic intensity to be compute-bound.


Production Engineering Notes

Using NVIDIA Nsight Compute for Roofline

Nsight Compute provides an automated roofline view. Run:

ncu --set full --export profile.ncu-rep python train.py
# Then open profile.ncu-rep in Nsight Compute GUI
# Navigate to: Details -> Roofline tab

The Nsight roofline chart shows:

  • The L1, L2, and DRAM rooflines (each cache level has its own bandwidth ceiling)
  • Your kernel's operating point plotted automatically
  • The gap to each ceiling as a percentage

Key difference from naive roofline: Nsight uses a "Hierarchical Roofline" that accounts for L1 and L2 cache. If your kernel has good data reuse and fits in L1, the effective bandwidth ceiling is much higher than DRAM bandwidth. A kernel might appear below the DRAM roofline but above the L1 roofline - meaning it is L1 bandwidth bound, not DRAM bandwidth bound.

Roofline for Mixed-Precision Training

In FP16/BF16 training, the roofline analysis becomes more nuanced:

  1. Forward pass activations - stored in FP16: use 2 bytes/element for traffic calculation
  2. Gradient computation - also FP16: 2 bytes/element
  3. Optimizer state - stored in FP32 for numerical stability: 4 bytes/element
  4. Weight updates - FP32 master weights: 4 bytes/element

For Adam optimizer with mixed-precision, the memory traffic per parameter update is:

  • Load weights (FP32): 4 bytes
  • Load gradients (FP16 or FP32 depending on implementation): 2-4 bytes
  • Load/store m, v states (FP32): 8 bytes read + 8 bytes write
  • Store updated weights: 4 bytes
  • Total: ~26-28 bytes per parameter, with ~7 FLOPs (3 multiplies, 2 adds, 1 sqrt, 1 divide)
  • AI for Adam: ~0.25 FLOPs/byte - extremely memory-bound

This is why optimizer steps are almost never the compute bottleneck - they are irreducibly memory-bound.

Batch Size and Arithmetic Intensity

One of the most impactful levers for arithmetic intensity is batch size. Consider a linear layer Y=XWY = XW where XX is (B×din)(B \times d_{in}) and WW is (din×dout)(d_{in} \times d_{out}):

For a single token (batch=1, din=dout=4096d_{in} = d_{out} = 4096):

I=2×1×4096×40962×(1×4096+4096×4096+1×4096)33.6M67.2M0.5 FLOPs/byteI = \frac{2 \times 1 \times 4096 \times 4096}{2 \times (1 \times 4096 + 4096 \times 4096 + 1 \times 4096)} \approx \frac{33.6M}{67.2M} \approx 0.5 \text{ FLOPs/byte}

For batch=512:

I2×512×4096×40962×(512×4096+4096×4096+512×4096)=17.2G108M159 FLOPs/byteI \approx \frac{2 \times 512 \times 4096 \times 4096}{2 \times (512 \times 4096 + 4096 \times 4096 + 512 \times 4096)} = \frac{17.2G}{108M} \approx 159 \text{ FLOPs/byte}

Batch size 512 crosses the ridge point. Batch size 1 is 300x below it. This is the quantitative explanation for why inference at batch size 1 is so wasteful of GPU compute.

Kernel Fusion Strategy

When optimizing memory-bound kernels, the goal is to increase arithmetic intensity by reducing total memory traffic. The strategy:

  1. Identify the data flow - which tensors are written and then immediately read by the next operation?
  2. Fuse those operations - compute both operations in a single kernel pass, keeping intermediate results in registers or shared memory rather than writing to HBM
  3. Calculate the new AI - verify that fusion actually increases arithmetic intensity

Common fusion opportunities:

  • Linear + Bias + Activation: fuse into one kernel (saves reading/writing bias and activation tensor separately)
  • LayerNorm + residual add: fuse (saves reading the residual tensor separately)
  • Attention: QK^T + scale + softmax + AV all in one (FlashAttention)
  • MLP: gate activation (SiLU in LLaMA) is gate * up_proj - can fuse both projections with the gate activation

Common Mistakes

:::danger Optimizing Arithmetic in a Memory-Bound Kernel The most expensive mistake in GPU performance work. If a kernel's arithmetic intensity is below the ridge point, the compute units are already idle most of the time - they are waiting for data. Making them execute more efficiently does not help. First, diagnose with the roofline model. Only optimize arithmetic throughput for compute-bound kernels. :::

:::danger Using Theoretical Peak as the Baseline Theoretical peak TFLOP/s (from the spec sheet) is never achievable in practice. Tensor Cores require specific input sizes aligned to 16x8x16 tiles. Non-aligned sizes fall back to slower compute paths. Always measure your actual peak with a correctly-sized large matmul benchmark, not the spec-sheet number. A100's theoretical FP16 is 312 TFLOP/s; realistic is ~260-280 TFLOP/s with cuBLAS on large aligned matmuls. :::

:::warning Ignoring Cache Effects in the Roofline The simple roofline model assumes all memory traffic goes to DRAM (HBM). In practice, L1 and L2 caches on the GPU can serve a significant fraction of traffic. A kernel with high data reuse (e.g., a convolution with small input feature maps that fit in L2) may be effectively at a much higher bandwidth ceiling than the HBM roofline shows. Always check whether your kernel's working set fits in cache before concluding it is DRAM-bandwidth bound. :::

:::warning Not Accounting for All Memory Traffic A common error in manual AI calculation: counting only the primary input/output tensors and forgetting auxiliary reads and writes. Examples:

  • Layer norm reads the input, computes mean and variance (writes to register, but may write to memory for backward pass), then reads the input again to normalize
  • Flash attention tiles still write tiling state between iterations
  • Workspace buffers used by cuDNN/cuBLAS for scratch space count as memory traffic

When in doubt, use Nsight Compute's "Memory Workload Analysis" section for actual measured bytes, not estimates. :::

:::tip When Roofline Says You're at the Ceiling But Still Slow If your roofline analysis shows a kernel is near the memory bandwidth ceiling (say, 90% of peak bandwidth), but the overall model is still slow, the bottleneck may be latency rather than throughput. Many small kernels launch overhead, kernel synchronization between layers, and pipeline bubbles in the operator scheduler can all consume wall time without showing up in per-kernel bandwidth utilization. Profile the full timeline with Nsight Systems (not just per-kernel with Nsight Compute) to see the gaps between kernels. :::


Interview Q&A

Q1: What is arithmetic intensity and how do you calculate it for a matrix multiplication of size (M, K) x (K, N)?

Arithmetic intensity is the ratio of floating-point operations performed to bytes of data transferred from memory. For a matrix multiplication C=ABC = AB where AA is (M×K)(M \times K) and BB is (K×N)(K \times N):

FLOPs = 2MNK2MNK (each of the MNMN output elements requires KK multiply-accumulate operations, and each multiply-accumulate counts as 2 FLOPs).

Memory traffic (FP16) = 2MK+2KN+2MN2MK + 2KN + 2MN bytes (load A, load B, store C).

AI=2MNK2(MK+KN+MN)AI = \frac{2MNK}{2(MK + KN + MN)}

For a square N×NN \times N matmul: AI=N/6AI = N/6 FLOPs/byte (FP16 or FP32 makes no difference since both numerator and denominator scale with bytes per element).

For N=4096: AI = 682 FLOPs/byte, well above the A100 ridge point of 156 - compute-bound. For N=64 (small batch inference): AI = 10.7 FLOPs/byte - memory-bound.

Q2: Why is attention memory-bound for long sequences, and what does FlashAttention do to fix it?

In naive attention, computing softmax(QKT/d)V\text{softmax}(QK^T/\sqrt{d}) \cdot V requires materializing the L×LL \times L attention score matrix. For sequence length L and head dimension d:

  • FLOPs: 4L2d4L^2 d (QK^T + AV)
  • Memory traffic: dominated by 6L26L^2 bytes for the attention matrix (write + read for softmax + read for AV multiply)
  • AI: 4L2d/6L2=2d/3424L^2 d / 6L^2 = 2d/3 \approx 42 FLOPs/byte for d=64

This is below the ridge point (156 for A100). Attention is memory-bound because of the O(L2)O(L^2) intermediate.

FlashAttention implements a tiling algorithm that never writes the full attention matrix to HBM. It processes tiles of Q, K, V entirely within SRAM (shared memory), computing the softmax and AV product incrementally. The memory traffic drops from O(L2)O(L^2) to O(Ld)O(L \cdot d):

  • AI with FlashAttention: 4L2d/(8Ld)=L/24L^2 d / (8Ld) = L/2 FLOPs/byte
  • For L=2048: AI = 1024 FLOPs/byte - now compute-bound

FlashAttention achieves 2-4x speedup in wall time not by reducing FLOPs (same count) but by eliminating memory traffic.

Q3: On an H100, the ridge point is ~591 FLOPs/byte. On an A100 it is 156. What does this mean for how you write kernels?

The H100's dramatically higher ridge point (Tensor Core peak scaled much faster than bandwidth) means more operations are memory-bound on H100 than on A100, in the sense that a kernel needs higher arithmetic intensity to be compute-bound. Operations with AI between 156 and 591 that were on the cusp of compute-bound on A100 become memory-bound on H100.

Practical implications:

  1. Kernel fusion matters more on H100 - you need to raise AI above 591 to see compute-bound behavior
  2. FlashAttention's benefit is even larger on H100 because baseline attention's AI of 42 is further below the ridge
  3. Batch sizes need to be larger to be compute-bound at the ridge point
  4. FP8 training (supported on H100) halves bytes per element, doubling AI for the same computation - this specifically helps move operations from memory-bound to compute-bound

Q4: A colleague claims their custom elementwise activation function is slower than expected. How do you diagnose and fix this using the roofline model?

Step 1 - Calculate the theoretical AI. For an elementwise operation on NN FP16 elements: AI0.25\text{AI} \approx 0.25 FLOPs/byte (1 FLOPs operation, 4 bytes traffic for FP16 read + write). This is far below any GPU's ridge point.

Step 2 - Measure achieved bandwidth. Use CUDA events to time the kernel. Compute achieved_bytes_per_second = (2 * N * 2) / elapsed_time. Compare to peak HBM bandwidth.

Step 3 - Diagnose the gap. If achieved bandwidth is 50% of peak, the kernel has poor memory access efficiency (likely not coalesced reads, or small enough that it can't saturate bandwidth).

Step 4 - Optimization options. For a fundamentally memory-bound kernel, the options are:

  • Ensure memory access is coalesced (consecutive threads read consecutive memory addresses)
  • Increase arithmetic intensity by fusing with adjacent operations (e.g., if this activation is followed by a layer norm, fuse them)
  • Vectorize loads (use float4 loads instead of float loads to process 4 elements per memory transaction)

What does NOT help: optimizing the arithmetic of the activation function itself. Replacing a slow GELU approximation with a faster one barely changes wall time because the kernel spends 99% of its time waiting for memory, not computing.

Q5: How do you use NVIDIA Nsight Compute to get roofline data, and what does the "Roofline" tab show?

Launch your application with ncu --set full --export my_profile.ncu-rep python script.py, then open the report in the Nsight Compute GUI. Navigate to the "Details" page and select the "Roofline" section.

The Nsight roofline shows a "Hierarchical Roofline" with three bandwidth ceilings:

  1. L1 cache bandwidth ceiling (highest, ~2-5 TB/s on A100)
  2. L2 cache bandwidth ceiling (intermediate, ~5-10 TB/s on A100 aggregate)
  3. DRAM (HBM) bandwidth ceiling (lowest, ~2 TB/s on A100)

Your kernel's operating point is placed based on actual measured FLOPs and bytes. A key insight: if your kernel falls above the DRAM roofline but below the L2 roofline, it is L2 bandwidth bound - meaning it has good data reuse (data is cached in L2) but L2 is the bottleneck, not HBM. The fix in that case is different from HBM-bound kernels: you need to reduce the working set size or increase register use to exploit L1 instead.

The "Memory Workload Analysis" section gives actual byte counts for each cache level, which is more accurate than manual AI estimation.

Q6: A training run achieves 40% of peak TFLOP/s. Your manager says "we need to optimize the matmuls." How do you respond?

"40% of peak TFLOP/s by itself does not tell us what to optimize. We need to know the arithmetic intensity of the operations consuming the most time."

Run a roofline analysis on the top 5 time-consuming kernels. Common findings:

  • If the slow kernels are attention, layer norm, or elementwise operations with AI well below the ridge point, they are memory-bound. Optimizing their matmul arithmetic will not help. Fix: kernel fusion, FlashAttention for attention, larger batches to raise AI.
  • If the slow kernels are large matmuls at AI above the ridge point, they are compute-bound but below the compute ceiling. Fix: check alignment (matmul dimensions must be multiples of 16 for Tensor Core efficiency), check data types (FP8 vs FP16 doubles compute capacity), check if cuBLAS/cuDNN versions are current.
  • If overall utilization is low because of pipeline bubbles between kernels, fixing individual kernels does not help. Fix: async execution, CUDA streams, operator graph reordering.

The roofline model is the entry point - without it, you are guessing which ceiling to push toward.


Engineers of AI

Read more: www.engineersofai.com

© 2026 EngineersOfAI. All rights reserved.