Roofline Model and Bottleneck Analysis

Reading time: ~40 min · Interview relevance: Very High · Target roles: ML Engineer, CUDA Developer, Performance Engineer

The roofline model answers the most important question in GPU optimization: is this operation limited by compute or by memory bandwidth? The answer determines which optimizations are worth pursuing - and which ones are a complete waste of time.

Opening: A Performance Debug at 2 AM

It is 2:47 AM and your training job is running 40% slower than it should be. You just upgraded from an A100 to an H100 - a GPU that costs roughly three times as much per hour and promises nearly 3x the FP16 tensor core throughput. On paper, your transformer training run should have screamed. Instead, it is crawling. GPU utilization shows 98% across all eight cards. MFU (model FLOP utilization) is stuck at 28%. Your manager will ask for an explanation in six hours.

You stare at the Nsight Systems timeline. The kernels are launching. Memory is moving. Everything looks busy. But something is wrong at the level that profiling tools show you only if you know exactly what to look for. You start dumping Nsight Compute reports for individual kernels. The data is overwhelming - hundreds of metrics per kernel. You have no framework to decide which number matters.

This is the experience that breaks most ML engineers who try to go deep on performance. They know the GPU is slow. They do not know why it is slow. They try random optimizations - switch to BF16, add torch.compile, increase batch size - and see marginal gains that do not add up to the speedup they need. Each change takes hours to validate. They are guessing, not engineering.

The root cause in this particular incident: the model was bottlenecked on memory bandwidth, not compute. Adding tensor cores did nothing because tensor cores were not the bottleneck to begin with. The H100 has 3.35 TB/s of HBM3 bandwidth - meaningfully higher than the A100's 2 TB/s - but the team's custom attention kernels were reading and writing the full K and V matrices to HBM on every layer. They were paying memory bandwidth tax on every forward pass. Switching to FlashAttention 2 cut training time by 38% on the same hardware within 48 hours. No hardware upgrade required.

The tool that would have pointed directly to that root cause in ten minutes is the roofline model. It is a simple performance model - a two-line graph with mathematics that fits on a napkin. But it encodes a complete theory of GPU bottleneck analysis. Once you understand how to construct one, read one, and reason from one, debugging GPU performance stops being guesswork and becomes engineering.

This lesson teaches you to build the roofline model from first principles, understand what it says about every common ML operation, measure real arithmetic intensity in PyTorch code, and use the roofline to make concrete optimization decisions. By the end, you will know exactly which operations in a transformer are memory-bound versus compute-bound - and what to do about each one.

Why This Exists - The Problem of Not Knowing What to Optimize

Before the roofline model existed, performance engineers had two numbers to chase: peak FLOPS and peak memory bandwidth. But these two numbers are independent hardware limits. A GPU can be hitting peak bandwidth while barely touching its compute units. It can be maxing out tensor cores while the memory bus sits idle. What was missing was a unified framework that placed any given operation in context relative to both limits simultaneously.

The naive approach was trial and error. Profile a kernel, try different optimizations, measure again. This works eventually but it wastes enormous effort optimizing the wrong bottleneck. If your kernel is memory-bandwidth bound, no amount of algorithmic improvement to the compute path helps. You can implement the most elegant Winograd convolution algorithm in the world and gain nothing, because the GPU was never waiting on the multiply-accumulate units to begin with.

The deeper problem is that compute and memory bandwidth scale independently across hardware generations. The A100 had 312 TFLOPS of BF16 tensor core throughput and 2 TB/s of HBM bandwidth - a ratio of 156 FLOPs/byte. The H100 SXM5 has 989 TFLOPS of BF16 tensor core throughput and 3.35 TB/s of HBM bandwidth - a ratio of 295 FLOPs/byte. Compute improved by 3.2x while bandwidth improved by 1.7x. An operation that was borderline on the A100 (just barely compute-bound) might be strongly memory-bound on the H100, because the "ridge point" shifted dramatically. Without a model that accounts for both limits simultaneously and tracks how they change across hardware, you cannot reason about portability of optimization work.

The roofline model solves this. It gives every kernel a single number - arithmetic intensity - that tells you where it sits relative to the hardware limits. Operations with low arithmetic intensity are memory-bound regardless of GPU. Operations with high arithmetic intensity are compute-bound and will benefit from better compute hardware. The ridge point divides the space cleanly. Once you place an operation on the roofline, the optimization path is obvious.

Historical Context - Williams, Waterman, and Patterson (2009)

The roofline model was introduced in a 2009 paper titled "Roofline: An Insightful Visual Performance Model for Multicore Architectures" by Samuel Williams, Andrew Waterman, and David Patterson at UC Berkeley. Patterson is the same David Patterson who co-invented RISC architecture and later joined Google to advocate for domain-specific accelerators - he has spent much of his career building conceptual tools for reasoning about computer architecture.

The original paper targeted multicore CPUs and the memory wall problem: as CPUs got faster, DRAM bandwidth did not keep pace. Optimizing for computation speed became less useful when data could not be fed fast enough to keep cores busy. Williams, Waterman, and Patterson formalized this with the roofline model, giving engineers a principled way to determine whether a given workload would benefit more from better compute or better memory bandwidth.

The insight was deceptively simple. Every computation has two costs: arithmetic work (measured in FLOPs) and data movement (measured in bytes). The ratio - FLOPs per byte - is the arithmetic intensity. Hardware provides two limits: peak compute (FLOPS) and peak memory bandwidth (bytes per second). The performance of any computation is bounded by both limits simultaneously. The tighter bound determines the actual bottleneck.

When the ML community moved to GPUs and the GPU performance gap with theoretical peak became a serious engineering problem, the roofline model migrated naturally from CPU performance analysis to GPU performance analysis. NVIDIA incorporated roofline analysis directly into Nsight Compute in 2019, making it a first-class tool for GPU kernel optimization. Today, the roofline model is the standard entry point for any serious GPU performance investigation.

Core Concepts

Arithmetic Intensity - The Single Most Important Number

Arithmetic intensity $I$ is defined as the ratio of floating-point operations performed to bytes of memory traffic generated:

$I = \frac{\text{FLOPs}}{\text{Bytes accessed (DRAM)}}$

The units are FLOPs per byte. The "bytes accessed" in the denominator refers specifically to bytes moved between DRAM (HBM on modern GPUs) and the compute units - not cache-to-register traffic. This distinction matters enormously. If you reuse data from L2 cache or shared memory, you pay for the DRAM transfer only once, raising the effective arithmetic intensity.

A pure elementwise operation like ReLU - read a float, apply max(0, x), write the result - performs 1 FLOP on 8 bytes (4 bytes read + 4 bytes written in FP32). Its arithmetic intensity is $I = 1/8 = 0.125$ FLOPs/byte. It is aggressively memory-bound on every GPU ever built.

A large square matrix multiplication of two N x N matrices performs $2N^3$ FLOPs and reads $3N^2 \times 4$ bytes of data (two input matrices plus one output in FP32). Its arithmetic intensity is:

$I_{matmul} = \frac{2N^3}{12N^2} = \frac{N}{6}$

For $N = 4096$ , that is $4096/6 \approx 683$ FLOPs/byte. This is why large matrix multiplications are the workhorse of deep learning - they have high arithmetic intensity and are compute-bound on virtually every modern GPU, meaning the tensor cores are the limiting factor, not memory bandwidth.

The Two Performance Ceilings

The roofline model defines two hardware ceilings:

Memory bandwidth ceiling: Performance is bounded by how fast data can be moved from DRAM. If a kernel has arithmetic intensity $I$ and the hardware provides bandwidth $B$ (bytes/second), the maximum achievable performance is:

$P_{memory} = I \times B$

This is a sloped line on a log-log plot. For every FLOPs/byte increase in arithmetic intensity, you get a proportional performance increase - as long as you are on the memory-bound side.

Peak compute ceiling: Performance is bounded by how fast the compute units can execute floating-point operations. This ceiling is a horizontal line at $P_{compute}$ (the peak FLOPS rating of the hardware).

The roofline model says that actual achievable performance for a kernel with arithmetic intensity $I$ is:

$P_{achievable} = \min(P_{compute}, \; I \times B)$

On a log-log plot, the memory bandwidth ceiling is a diagonal line with slope 1 (since both axes are log scale). The peak compute ceiling is a horizontal line. The two lines intersect at the ridge point.

The Ridge Point - Where Memory-Bound Becomes Compute-Bound

The ridge point $I_{ridge}$ is the arithmetic intensity at which the memory bandwidth ceiling equals the peak compute ceiling:

$I_{ridge} = \frac{P_{compute}}{B}$

Operations with $I < I_{ridge}$ are memory-bound: adding more compute units does not help. To improve performance, you must reduce bytes accessed (kernel fusion, reuse data in shared memory) or increase bandwidth (use faster memory tiers like L2 or shared memory).

Operations with $I > I_{ridge}$ are compute-bound: reducing DRAM traffic does not help. To improve performance, you must reduce FLOPs (more efficient algorithms) or increase peak compute (use Tensor Cores, FP8 instead of FP32).

H100 SXM5 Numbers - Real Hardware Ceilings

Let's put real numbers on this for the most common training GPU as of 2025:

Metric	Value
Peak FP32 compute	67 TFLOPS
Peak BF16 Tensor Core	989 TFLOPS
Peak FP8 Tensor Core	1,979 TFLOPS
HBM3 bandwidth	3.35 TB/s
L2 cache bandwidth	~12 TB/s
L2 cache size	50 MB

Now compute the ridge points:

FP32 ridge point: $I_{ridge,fp32} = \frac{67 \times 10^{12}}{3.35 \times 10^{12}} \approx 20 \text{ FLOPs/byte}$

BF16 Tensor Core ridge point: $I_{ridge,bf16} = \frac{989 \times 10^{12}}{3.35 \times 10^{12}} \approx 295 \text{ FLOPs/byte}$

FP8 Tensor Core ridge point: $I_{ridge,fp8} = \frac{1979 \times 10^{12}}{3.35 \times 10^{12}} \approx 591 \text{ FLOPs/byte}$

These numbers are striking. For BF16 tensor cores, you need 295 FLOPs of compute for every byte you access from HBM before you saturate the compute units. Most naive implementations of even moderately complex operations never reach this. The H100's extraordinary compute throughput means an enormous fraction of operations that look like they should be compute-heavy are actually memory-bound when implemented naively.

For comparison, the A100 SXM4 BF16 Tensor Core ridge point was approximately $312 \times 10^{12} / 2 \times 10^{12} = 156$ FLOPs/byte. The H100's ridge point is nearly double. Operations that were just barely compute-bound on the A100 are memory-bound on the H100.

Arithmetic Intensity of Common ML Operations

Understanding where common operations fall is essential knowledge:

Elementwise operations (ReLU, GELU, sigmoid, add, multiply):

FLOPs: 1 per element (roughly)
Bytes: 4 bytes read + 4 bytes written (FP32) = 8 bytes
Arithmetic intensity: $\approx 0.125$ FLOPs/byte
Classification: Strongly memory-bound on every GPU

Layer normalization (over hidden dim $H$ ):

FLOPs: $\approx 7H$ per token (mean, variance, normalize, scale, shift)
Bytes: $H \times 4$ bytes read + $H \times 4$ bytes written = $8H$ bytes (plus small parameter reads)
Arithmetic intensity: $\approx 0.875$ FLOPs/byte
Classification: Memory-bound

Vector-vector dot product (length $N$ ):

FLOPs: $2N$ (N multiplies + N additions)
Bytes: $2N \times 4 = 8N$ bytes
Arithmetic intensity: $\approx 0.25$ FLOPs/byte
Classification: Strongly memory-bound

Matrix-vector multiply (GEMV) - matrix is $M \times N$ , vector is $N$ :

FLOPs: $2MN$
Bytes: $MN \times 4$ (matrix) + $N \times 4$ (vector) + $M \times 4$ (output) $\approx 4MN$ bytes for large $M,N$
Arithmetic intensity: $\approx 0.5$ FLOPs/byte
Classification: Strongly memory-bound (this is why inference at batch size 1 is bandwidth-limited)

Matrix-matrix multiply (GEMM) - square matrices of size $N$ :

FLOPs: $2N^3$
Bytes: $3N^2 \times 4 = 12N^2$ bytes
Arithmetic intensity: $N/6$ FLOPs/byte
For $N=128$ : 21 FLOPs/byte (barely above FP32 ridge, below BF16 Tensor Core ridge)
For $N=1024$ : 171 FLOPs/byte (compute-bound in FP32, memory-bound for BF16 on H100)
For $N=4096$ : 683 FLOPs/byte (compute-bound for both FP32 and BF16 on H100)

Softmax (over sequence length $L$ ):

FLOPs: $\approx 5L$ (exp + sum + divide per element, plus max subtraction)
Bytes: $L \times 4$ read + $L \times 4$ written = $8L$ bytes (minimum 2 passes)
Arithmetic intensity: $\approx 0.6$ FLOPs/byte
Classification: Memory-bound

Naive attention (sequence length $L$ , head dim $d$ ):

FLOPs: $\approx 4L^2 d$ (two GEMMs for QK and OV, plus softmax)
Bytes: Reads Q, K, V plus writes the $L \times L$ attention matrix to HBM: $\approx 4L^2 \times 4 + 4Ld \times 4$ bytes
For large $L$ : arithmetic intensity $\approx d$ FLOPs/byte (e.g., $d=64$ gives 64 FLOPs/byte)
Classification: Memory-bound on H100 (below 295 FLOPs/byte ridge)
The $L \times L$ attention matrix write/read is the killer - it costs $O(L^2)$ memory traffic

FlashAttention (same operation, tiled implementation):

FLOPs: Same $\approx 4L^2 d$ (same math, different execution)
Bytes: No $L \times L$ matrix written to HBM. Only Q, K, V, and O: $4Ld \times 4$ bytes
Arithmetic intensity: $\approx L/6$ FLOPs/byte (same as GEMM!)
For $L=2048, d=64$ : $\approx 341$ FLOPs/byte - crosses into compute-bound territory for BF16
Classification: Compute-bound - the key algorithmic win of FlashAttention

This is the concrete payoff of the roofline model. Naive attention and FlashAttention compute exactly the same mathematical result. But their roofline positions are completely different - and that difference is the 3-5x speedup FlashAttention delivers. The roofline model explains it precisely.

Roofline Diagram for the H100

Optimization Decision Tree

Code Examples

1. Computing Arithmetic Intensity Analytically

"""
Analytical arithmetic intensity calculator for common ML operations.
These are theoretical minimums - actual memory traffic may be higher.
"""

def arithmetic_intensity_elementwise(dtype_bytes: int = 4) -> float:
    """
    Elementwise ops: ReLU, sigmoid, exp, add, multiply.
    1 FLOP per element, 1 read + 1 write per element.
    """
    flops_per_element = 1.0
    bytes_per_element = 2 * dtype_bytes  # read + write
    return flops_per_element / bytes_per_element


def arithmetic_intensity_gemm(M: int, N: int, K: int, dtype_bytes: int = 2) -> float:
    """
    General matrix multiply: C[M,N] = A[M,K] @ B[K,N]
    FLOPs: 2*M*N*K (K multiplies + K-1 adds, simplified to 2K per output element)
    Bytes: read A (M*K) + read B (K*N) + write C (M*N), all in dtype_bytes
    """
    flops = 2 * M * N * K
    bytes_accessed = dtype_bytes * (M * K + K * N + M * N)
    return flops / bytes_accessed


def arithmetic_intensity_layernorm(hidden_dim: int, dtype_bytes: int = 2) -> float:
    """
    LayerNorm over hidden_dim.
    Approximate FLOPs: 7 * hidden_dim per token
      - mean: hidden_dim adds + 1 divide
      - variance: hidden_dim subtracts + squares + adds + 1 divide
      - normalize: hidden_dim subtracts + divides
      - scale + shift: 2 * hidden_dim multiply-adds
    Bytes: read input + read gamma + read beta + write output
    """
    flops = 7 * hidden_dim
    bytes_accessed = dtype_bytes * (hidden_dim + hidden_dim + hidden_dim + hidden_dim)
    return flops / bytes_accessed


def arithmetic_intensity_attention_naive(
    seq_len: int, head_dim: int, dtype_bytes: int = 2
) -> float:
    """
    Naive attention: Q @ K^T -> softmax -> @ V
    The bottleneck: must write the [seq_len, seq_len] attention matrix to HBM.

    FLOPs: 2 * seq_len^2 * head_dim (QK matmul) + 2 * seq_len^2 * head_dim (AV matmul)
           + ~5 * seq_len^2 (softmax)
    Bytes: Read Q, K, V (3 * seq_len * head_dim) + Write/Read attention matrix (seq_len^2)
           + Write output (seq_len * head_dim)
    """
    flops = (
        2 * seq_len ** 2 * head_dim  # QK^T
        + 5 * seq_len ** 2           # softmax approx
        + 2 * seq_len ** 2 * head_dim  # AV
    )
    bytes_accessed = dtype_bytes * (
        3 * seq_len * head_dim   # Q, K, V reads
        + seq_len ** 2           # attention matrix write (float32 for stability)
        + seq_len ** 2           # attention matrix read after softmax
        + seq_len * head_dim     # output write
    )
    return flops / bytes_accessed


def arithmetic_intensity_flash_attention(
    seq_len: int, head_dim: int, dtype_bytes: int = 2
) -> float:
    """
    FlashAttention: same FLOPs, but no L x L matrix written to HBM.
    Tiles K and V in SRAM/shared memory and accumulates directly.

    FLOPs: same as naive attention
    Bytes: only Q, K, V reads + output write (no attention matrix in HBM)
    """
    flops = (
        2 * seq_len ** 2 * head_dim
        + 5 * seq_len ** 2
        + 2 * seq_len ** 2 * head_dim
    )
    bytes_accessed = dtype_bytes * (
        3 * seq_len * head_dim   # Q, K, V reads
        + seq_len * head_dim     # output write
    )
    return flops / bytes_accessed


# H100 SXM5 hardware specs
H100_SPECS = {
    "peak_fp32_tflops": 67e12,
    "peak_bf16_tensor_tflops": 989e12,
    "peak_fp8_tensor_tflops": 1979e12,
    "hbm_bandwidth_bytes_per_sec": 3.35e12,
    "l2_bandwidth_bytes_per_sec": 12e12,
}


def ridge_point(peak_compute: float, bandwidth: float) -> float:
    """Arithmetic intensity at which memory-bound becomes compute-bound."""
    return peak_compute / bandwidth


def roofline_performance(
    arithmetic_intensity: float,
    peak_compute: float,
    bandwidth: float,
) -> tuple[float, str]:
    """
    Returns (achievable_performance_flops_per_sec, bottleneck)
    """
    memory_limit = arithmetic_intensity * bandwidth
    compute_limit = peak_compute
    if memory_limit < compute_limit:
        return memory_limit, "memory-bound"
    else:
        return compute_limit, "compute-bound"


if __name__ == "__main__":
    B = H100_SPECS["hbm_bandwidth_bytes_per_sec"]
    P_bf16 = H100_SPECS["peak_bf16_tensor_tflops"]
    P_fp32 = H100_SPECS["peak_fp32_tflops"]

    print("=== H100 Ridge Points ===")
    print(f"FP32 ridge:  {ridge_point(P_fp32, B):.1f} FLOPs/byte")
    print(f"BF16 Tensor: {ridge_point(P_bf16, B):.1f} FLOPs/byte")
    print()

    print("=== Arithmetic Intensities (BF16, dtype_bytes=2) ===")
    ops = [
        ("Elementwise (ReLU)", arithmetic_intensity_elementwise(dtype_bytes=2)),
        ("LayerNorm (H=4096)",  arithmetic_intensity_layernorm(4096, dtype_bytes=2)),
        ("GEMM 128x128x128",    arithmetic_intensity_gemm(128, 128, 128, dtype_bytes=2)),
        ("GEMM 1024x1024x1024", arithmetic_intensity_gemm(1024, 1024, 1024, dtype_bytes=2)),
        ("GEMM 4096x4096x4096", arithmetic_intensity_gemm(4096, 4096, 4096, dtype_bytes=2)),
        ("Naive Attention L=512 d=64",
         arithmetic_intensity_attention_naive(512, 64, dtype_bytes=2)),
        ("Naive Attention L=2048 d=64",
         arithmetic_intensity_attention_naive(2048, 64, dtype_bytes=2)),
        ("FlashAttention L=2048 d=64",
         arithmetic_intensity_flash_attention(2048, 64, dtype_bytes=2)),
    ]

    ridge = ridge_point(P_bf16, B)
    for name, intensity in ops:
        perf, bound = roofline_performance(intensity, P_bf16, B)
        pct_of_peak = (perf / P_bf16) * 100
        marker = "COMPUTE" if bound == "compute-bound" else "MEMORY "
        print(
            f"  [{marker}] {name:<35} "
            f"I={intensity:7.1f} FLOPs/byte  "
            f"Peak={pct_of_peak:5.1f}% of BF16 ceiling"
        )

Expected output:

=== H100 Ridge Points ===
FP32 ridge:  20.0 FLOPs/byte
BF16 Tensor: 295.2 FLOPs/byte

=== Arithmetic Intensities (BF16, dtype_bytes=2) ===
  [MEMORY ] Elementwise (ReLU)                  I=    0.1 FLOPs/byte  Peak=  0.0% of BF16 ceiling
  [MEMORY ] LayerNorm (H=4096)                  I=    0.9 FLOPs/byte  Peak=  0.0% of BF16 ceiling
  [MEMORY ] GEMM 128x128x128                    I=   21.3 FLOPs/byte  Peak=  0.6% of BF16 ceiling
  [MEMORY ] GEMM 1024x1024x1024                 I=  170.7 FLOPs/byte  Peak= 17.3% of BF16 ceiling
  [COMPUTE] GEMM 4096x4096x4096                 I=  682.7 FLOPs/byte  Peak=100.0% of BF16 ceiling
  [MEMORY ] Naive Attention L=512 d=64          I=   52.1 FLOPs/byte  Peak=  5.3% of BF16 ceiling
  [MEMORY ] Naive Attention L=2048 d=64         I=   61.1 FLOPs/byte  Peak=  6.2% of BF16 ceiling
  [COMPUTE] FlashAttention L=2048 d=64          I=  340.7 FLOPs/byte  Peak=100.0% of BF16 ceiling

2. Measuring Arithmetic Intensity Empirically with PyTorch Profiler

"""
Empirically measure arithmetic intensity using PyTorch's CUDA profiler.
This measures actual memory traffic, not just the theoretical minimum.

Requires: torch >= 2.0, CUDA GPU
"""

import torch
import torch.profiler
from typing import Callable


def measure_kernel_stats(
    fn: Callable,
    *args,
    warmup_iters: int = 5,
    measure_iters: int = 10,
    label: str = "operation",
) -> dict:
    """
    Run a function and measure:
    - Wall time (ms)
    - DRAM reads (bytes)
    - DRAM writes (bytes)
    - FLOPs (if available from profiler)
    - Derived arithmetic intensity

    NOTE: PyTorch profiler gives DRAM traffic via kineto_results.
    For accurate byte counts, use Nsight Compute. This is an approximation.
    """
    device = "cuda"

    # Warmup
    for _ in range(warmup_iters):
        result = fn(*args)
    torch.cuda.synchronize()

    activities = [
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]

    with torch.profiler.profile(
        activities=activities,
        record_shapes=True,
        profile_memory=True,
        with_flops=True,
    ) as prof:
        for _ in range(measure_iters):
            result = fn(*args)
        torch.cuda.synchronize()

    # Aggregate across iterations
    total_flops = 0
    total_self_cuda_time_us = 0

    key_avgs = prof.key_averages()
    for event in key_avgs:
        if event.device_type == torch.autograd.DeviceType.CUDA:
            total_flops += event.flops
            total_self_cuda_time_us += event.self_cuda_time_total

    avg_time_ms = total_self_cuda_time_us / measure_iters / 1000.0
    avg_flops = total_flops / measure_iters

    # Achieved TFLOPS
    achieved_tflops = avg_flops / (avg_time_ms * 1e-3) / 1e12 if avg_time_ms > 0 else 0

    return {
        "label": label,
        "avg_time_ms": avg_time_ms,
        "flops": avg_flops,
        "achieved_tflops": achieved_tflops,
    }


def classify_operation(achieved_tflops: float, theoretical_intensity: float) -> dict:
    """
    Given measured throughput and theoretical arithmetic intensity,
    classify whether the operation is memory-bound or compute-bound
    on the current GPU.
    """
    # Query current GPU specs (approximate for common GPUs)
    gpu_name = torch.cuda.get_device_name(0)

    # Conservative estimates - you should set these for your exact GPU
    gpu_specs = {
        "H100": {"peak_bf16_tflops": 989.0, "hbm_bandwidth_tbs": 3.35},
        "A100": {"peak_bf16_tflops": 312.0, "hbm_bandwidth_tbs": 2.0},
        "A10G": {"peak_bf16_tflops": 125.0, "hbm_bandwidth_tbs": 0.6},
        "RTX 4090": {"peak_bf16_tflops": 330.0, "hbm_bandwidth_tbs": 1.008},
    }

    # Default to a mid-range GPU if not recognized
    specs = {"peak_bf16_tflops": 312.0, "hbm_bandwidth_tbs": 2.0}
    for key, val in gpu_specs.items():
        if key.lower() in gpu_name.lower():
            specs = val
            break

    peak_compute = specs["peak_bf16_tflops"]
    bandwidth = specs["hbm_bandwidth_tbs"]
    ridge = peak_compute / bandwidth

    efficiency = achieved_tflops / peak_compute * 100
    bound = "compute-bound" if theoretical_intensity > ridge else "memory-bound"

    if bound == "memory-bound":
        # What fraction of bandwidth ceiling are we hitting?
        memory_ceiling = theoretical_intensity * bandwidth  # TFLOPS equivalent
        memory_efficiency = achieved_tflops / memory_ceiling * 100 if memory_ceiling > 0 else 0
        return {
            "bound": bound,
            "ridge_point": ridge,
            "compute_efficiency_pct": efficiency,
            "memory_ceiling_efficiency_pct": memory_efficiency,
            "recommendation": "Fuse kernels to reduce HBM round-trips. "
                              "Consider shared memory tiling.",
        }
    else:
        return {
            "bound": bound,
            "ridge_point": ridge,
            "compute_efficiency_pct": efficiency,
            "recommendation": "Enable Tensor Cores (BF16/FP8). "
                              "Check for warp divergence or uncoalesced access.",
        }


def benchmark_matmul_sizes():
    """
    Benchmark matrix multiplications of different sizes to see
    how arithmetic intensity changes with problem size.
    """
    device = "cuda"
    dtype = torch.bfloat16

    sizes = [128, 256, 512, 1024, 2048, 4096]
    print(f"\n{'Size':>6} | {'Theory I':>10} | {'Time (ms)':>10} | {'TFLOPS':>8} | {'Bound'}")
    print("-" * 65)

    for N in sizes:
        A = torch.randn(N, N, dtype=dtype, device=device)
        B = torch.randn(N, N, dtype=dtype, device=device)

        def fn():
            return torch.matmul(A, B)

        # Warmup
        for _ in range(3):
            _ = fn()
        torch.cuda.synchronize()

        # Time using CUDA events (more accurate than profiler for single ops)
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)

        start.record()
        for _ in range(20):
            out = fn()
        end.record()
        torch.cuda.synchronize()

        avg_time_ms = start.elapsed_time(end) / 20.0
        flops = 2 * N ** 3
        achieved_tflops = flops / (avg_time_ms * 1e-3) / 1e12
        theoretical_intensity = N / 6.0  # BF16: dtype_bytes=2, formula = 2N^3 / (12N^2 * 2/4)
        # Corrected: BF16 is 2 bytes, bytes = 2*(3*N^2) = 6*N^2, so I = 2N^3/(6N^2) = N/3
        theoretical_intensity = N / 3.0  # BF16 (2 bytes)

        # H100 ridge = 295 FLOPs/byte
        bound = "COMPUTE" if theoretical_intensity > 295 else "MEMORY  "
        print(
            f"{N:>6} | {theoretical_intensity:>9.1f}  | {avg_time_ms:>9.3f}  | "
            f"{achieved_tflops:>7.1f}  | {bound}"
        )


if __name__ == "__main__":
    if not torch.cuda.is_available():
        print("No CUDA GPU available. This script requires a GPU.")
    else:
        benchmark_matmul_sizes()

3. Measuring the Performance Gap - Roofline Distance

"""
Roofline distance: how far below the roofline ceiling is the actual performance?
A large gap means the kernel is poorly optimized relative to its theoretical limit.
A small gap means the kernel is near-optimal.
"""

import torch
import time


class RooflineAnalyzer:
    """
    Analyze an operation's position on the roofline and compute the efficiency gap.
    """

    # H100 SXM5 specs (update for your GPU)
    GPU_SPECS = {
        "peak_bf16_tflops": 989.0,
        "hbm_bandwidth_tbs": 3.35,
    }

    def __init__(self, gpu_specs: dict = None):
        self.specs = gpu_specs or self.GPU_SPECS
        self.peak = self.specs["peak_bf16_tflops"]
        self.bandwidth = self.specs["hbm_bandwidth_tbs"]
        self.ridge = self.peak / self.bandwidth

    def roofline_limit(self, arithmetic_intensity: float) -> float:
        """The theoretical max performance (TFLOPS) at this arithmetic intensity."""
        memory_ceiling = arithmetic_intensity * self.bandwidth
        return min(memory_ceiling, self.peak)

    def efficiency(self, achieved_tflops: float, arithmetic_intensity: float) -> dict:
        """
        Returns a dict with:
        - roofline_limit: theoretical max TFLOPS
        - achieved_tflops: actual measured TFLOPS
        - efficiency_pct: achieved / limit * 100
        - bottleneck: 'memory' or 'compute'
        - gap_factor: how many times below the ceiling we are
        """
        limit = self.roofline_limit(arithmetic_intensity)
        eff_pct = achieved_tflops / limit * 100 if limit > 0 else 0
        gap = limit / achieved_tflops if achieved_tflops > 0 else float("inf")
        bottleneck = "memory" if arithmetic_intensity < self.ridge else "compute"

        return {
            "roofline_limit_tflops": limit,
            "achieved_tflops": achieved_tflops,
            "efficiency_pct": eff_pct,
            "gap_factor": gap,
            "bottleneck": bottleneck,
            "ridge_point": self.ridge,
        }

    def print_report(self, label: str, achieved_tflops: float, arithmetic_intensity: float):
        r = self.efficiency(achieved_tflops, arithmetic_intensity)
        print(f"\n=== Roofline Report: {label} ===")
        print(f"  Arithmetic intensity : {arithmetic_intensity:.1f} FLOPs/byte")
        print(f"  Ridge point (BF16)   : {r['ridge_point']:.1f} FLOPs/byte")
        print(f"  Bottleneck           : {r['bottleneck'].upper()}-BOUND")
        print(f"  Roofline ceiling     : {r['roofline_limit_tflops']:.1f} TFLOPS")
        print(f"  Achieved             : {r['achieved_tflops']:.1f} TFLOPS")
        print(f"  Efficiency           : {r['efficiency_pct']:.1f}% of ceiling")
        print(f"  Gap factor           : {r['gap_factor']:.1f}x below ceiling")

        if r["efficiency_pct"] < 50:
            print(f"  WARNING: {r['gap_factor']:.1f}x below ceiling - significant optimization opportunity")
        elif r["efficiency_pct"] < 80:
            print(f"  NOTE: Some optimization headroom remains")
        else:
            print(f"  GOOD: Near-optimal for this kernel type")


def demo_attention_comparison():
    """
    Compare naive attention vs FlashAttention on the roofline.
    Uses analytical intensity values since flash_attn may not be installed.
    """
    analyzer = RooflineAnalyzer()

    seq_len = 2048
    head_dim = 64
    dtype = torch.bfloat16
    device = "cuda"

    # Analytical intensities (from our earlier functions)
    naive_intensity = 61.1    # L=2048, d=64, BF16
    flash_intensity = 340.7   # L=2048, d=64, BF16

    # Simulate measured throughput (replace with actual profiler numbers)
    # Naive attention on H100: typically ~50-80 TFLOPS (memory-bound)
    # FlashAttention on H100: typically ~400-600 TFLOPS (compute-bound)
    naive_achieved_tflops = 65.0    # representative measured value
    flash_achieved_tflops = 450.0   # representative measured value

    analyzer.print_report(
        "Naive Attention (L=2048, d=64)",
        naive_achieved_tflops,
        naive_intensity,
    )
    analyzer.print_report(
        "FlashAttention (L=2048, d=64)",
        flash_achieved_tflops,
        flash_intensity,
    )

    print("\n=== Summary ===")
    print(f"FlashAttention speedup: {flash_achieved_tflops / naive_achieved_tflops:.1f}x")
    print(
        "Root cause: FlashAttention eliminates O(L^2) HBM writes, "
        "raising arithmetic intensity from 61 to 341 FLOPs/byte."
    )
    print(
        "On H100 with BF16 ridge at 295 FLOPs/byte: "
        "naive is memory-bound, FlashAttention is compute-bound."
    )


if __name__ == "__main__":
    demo_attention_comparison()

Reading NVIDIA Nsight Compute's Roofline Chart

Nsight Compute (NCU) has a built-in roofline chart under the "Roofline" section of any kernel analysis report. Here is how to read it:

Running Nsight Compute from the command line:

# Profile a single kernel with roofline metrics
ncu --metrics gpu__time_duration,\
dram__bytes_read,dram__bytes_write,\
sm__sass_thread_inst_executed_op_fadd_pred_on,\
sm__sass_thread_inst_executed_op_fmul_pred_on,\
sm__sass_thread_inst_executed_op_ffma_pred_on \
--target-processes all \
python your_script.py

# Or use the full roofline preset
ncu --set roofline python your_script.py

What the chart shows:

Each kernel is plotted as a dot at position (arithmetic intensity, achieved GFLOPS/s)
The roofline is the two-line ceiling: sloped line (memory bandwidth) meets horizontal line (peak compute)
Dots below the roofline by a large margin indicate poorly optimized kernels
Dots near the roofline (within 80%) indicate near-optimal kernels
The horizontal distance from a dot to the ridge point shows whether the bottleneck is compute or memory

Interpreting the gap:

A kernel plotted well below the memory bandwidth line: the kernel is not even achieving bandwidth utilization. Check for uncoalesced memory access, cache thrashing, or excessive synchronization.
A kernel plotted at the bandwidth line but far from the compute ceiling: it is correctly bandwidth-bound. Optimizations should focus on reducing data movement.
A kernel near the compute ceiling: it is well-optimized. Gains require algorithmic changes or better hardware.

Common Nsight Compute metrics to track:

# Memory throughput metrics
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum  # L1 load bytes
dram__bytes_read.sum                           # HBM read bytes
dram__bytes_write.sum                          # HBM write bytes

# Compute metrics
sm__inst_executed_pipe_tensor_op_hmma.sum      # Tensor core instructions
sm__sass_thread_inst_executed_op_ffma.sum      # FFMA instructions

# Derived
achieved_occupancy                             # Active warps / max warps

Production Engineering Notes

Practical Rules for Transformer Models

Most transformer components have well-known roofline positions. Knowing these lets you prioritize optimization work without profiling every run:

Always memory-bound (regardless of model size or GPU):

Embedding table lookups (tiny arithmetic intensity, massive random access)
LayerNorm and RMSNorm
Dropout
All elementwise activations (GELU, SiLU, ReLU)
Bias additions

Memory-bound at typical inference batch sizes, compute-bound at large training batches:

Attention projections (Q, K, V, O projections) at batch size 1 are GEMV (0.5 FLOPs/byte), at large batch are GEMM (compute-bound)
FFN layers follow the same pattern

Compute-bound when properly implemented:

Large GEMM with adequate batch/sequence dimensions
FlashAttention at sequence lengths above ~512
Convolutions with large spatial dimensions and channel counts

Kernel Fusion and the Roofline

Kernel fusion is the primary tool for improving memory-bound operations. The idea: instead of running five separate elementwise kernels (each reading and writing to HBM), fuse them into one kernel that reads once, applies all operations in registers or shared memory, and writes once.

Consider the FFN sublayer in a transformer: output = dropout(gelu(linear2(gelu(linear1(x)))))

Naively, this is: matmul + gelu + matmul + gelu + dropout = 5 separate CUDA kernel launches, with 4 intermediate tensors written to and read from HBM. After fusion:

The two GEMMs are still separate (they are compute-bound and benefit from cuBLAS optimization)
But gelu + gelu + dropout fuse into a single elementwise kernel
2 HBM round-trips eliminated

This does not change the roofline position of the GEMMs (they are already compute-bound). But it dramatically improves the effective arithmetic intensity of the activation functions from 0.125 to something much higher, because now one pass over the data computes multiple operations.

torch.compile performs this fusion automatically using Triton-generated fused kernels. The roofline model tells you exactly how much benefit to expect before you even run the optimization.

The L2 Cache Roofline

For operations whose working set fits in L2 cache (50 MB on H100), the effective bandwidth is closer to 12 TB/s than 3.35 TB/s. This shifts the ridge point:

$I_{ridge,L2,bf16} = \frac{989}{12} \approx 82 \text{ FLOPs/byte}$

Naive attention at small sequence lengths (L=256, working set = 256^2 * 2 bytes = 128 KB, well within L2) effectively runs against the L2 roofline, not the HBM roofline. Its effective arithmetic intensity of ~50 FLOPs/byte is then actually close to the L2 ridge point of 82 FLOPs/byte. This is why small-batch attention is not as catastrophically slow as the HBM roofline analysis would suggest.

FlashAttention's design specifically targets keeping K and V tiles in L2/shared memory so that repeated access hits the fast cache path rather than HBM.

Arithmetic Intensity Depends on Precision

Switching from FP32 to BF16 affects arithmetic intensity in two ways:

Compute ceiling doubles (or triples with tensor cores): the ridge point shifts right dramatically
Bytes per element halves: arithmetic intensity doubles for the same algorithm

For a GEMM of size N with BF16 (2 bytes/element) vs FP32 (4 bytes/element):

BF16: $I = 2N^3 / (2 \times 3N^2) = N/3$ FLOPs/byte
FP32: $I = 2N^3 / (4 \times 3N^2) = N/6$ FLOPs/byte

BF16 has twice the arithmetic intensity, and roughly 3-15x higher compute ceiling (depending on whether tensor cores engage). Both effects push the operation deeper into the compute-bound regime. For memory-bound operations, only the bytes-per-element reduction matters - you get at most a 2x speedup from switching precision for an elementwise operation.

Common Mistakes

:::danger Optimizing the Wrong Bottleneck Adding tensor cores, enabling BF16, and tuning CUDA launch parameters on a memory-bound kernel gives near-zero improvement. The kernel is waiting for data from HBM. Better compute units sit idle. Always measure arithmetic intensity and compare to the ridge point before choosing an optimization strategy. Ten minutes with Nsight Compute saves weeks of wasted engineering. :::

:::danger Assuming Large = Compute-Bound A matrix multiply is not automatically compute-bound just because the matrices are large. For batch size 1 inference with a weight matrix $W \in \mathbb{R}^{4096 \times 16384}$ , the operation is a GEMV (single vector times matrix). Arithmetic intensity is approximately 0.5 FLOPs/byte - strongly memory-bound. Switching to FP8 tensor cores saves nothing. The fix is batching tokens or speculative decoding to increase effective batch size, converting the GEMV back into a GEMM. :::

:::warning Confusing Bandwidth-Bound and Latency-Bound Memory-bound operations can fail to reach even the bandwidth ceiling for different reasons. A kernel that performs scattered random reads (e.g., gather operations for embedding lookups) may not saturate bandwidth because memory latency - not throughput - is the bottleneck. Thousands of independent small reads that cannot be pipelined together leave bandwidth headroom unused. Nsight Compute's "Memory Workload Analysis" section distinguishes these cases. The roofline model assumes you can stream data; latency-bound kernels live below the roofline for different reasons. :::

:::warning Using Theoretical FLOPs vs Actual FLOPs Softmax looks like it has roughly 5 FLOPs per element analytically. But its actual FLOP count depends on the implementation: is the max subtraction done in a separate pass? Is the division implemented with a reciprocal multiply? Different implementations have different counts. When comparing analytical and measured arithmetic intensity, use Nsight Compute's hardware counters for FLOPs rather than hand-calculated estimates for any operation involving transcendentals or reductions. :::

:::warning The Roofline Assumes Perfect Memory Access Patterns The roofline model gives an optimistic upper bound. Real kernels are further below the ceiling due to: bank conflicts in shared memory, uncoalesced global memory access, warp divergence, pipeline stalls, launch overhead. The roofline tells you the direction and rough magnitude of the gap to close. Nsight Compute's individual performance limiters (like the "Compute Workload Analysis" and "Memory Workload Analysis" sections) explain the remaining gap between the roofline and your actual performance. :::

Interview Q&A

Q1: Calculate the arithmetic intensity of a BF16 matrix multiply for M=N=K=4096. Is it compute-bound or memory-bound on an H100?

Answer:

FLOPs: $2 \times 4096^3 = 2 \times 68,719,476,736 \approx 137.4 \text{ GFLOPs}$

Bytes accessed (BF16 = 2 bytes/element):

Read A: $4096 \times 4096 \times 2 = 33.5 \text{ MB}$
Read B: $4096 \times 4096 \times 2 = 33.5 \text{ MB}$
Write C: $4096 \times 4096 \times 2 = 33.5 \text{ MB}$
Total: $100.5 \text{ MB} \approx 100.5 \times 10^6$ bytes

Arithmetic intensity: $I = 137.4 \times 10^9 / 100.5 \times 10^6 \approx 1367 \text{ FLOPs/byte}$

Wait - let us recalculate more carefully. The general formula for square GEMM in BF16 (dtype_bytes = 2) is:

$I = \frac{2N^3}{2 \times 3N^2} = \frac{N}{3}$

For $N = 4096$ : $I = 4096 / 3 \approx 1365$ FLOPs/byte.

H100 BF16 Tensor Core ridge point: $989 \text{ TFLOPS} / 3.35 \text{ TB/s} \approx 295$ FLOPs/byte.

Since $1365 \gg 295$ , this operation is strongly compute-bound on the H100. The compute units are the limiting factor. To improve performance, you would need higher compute throughput (e.g., FP8 quantization, which doubles tensor core throughput) or a reduced-FLOP algorithm. Reducing memory bandwidth use does not help.

Q2: Why does FlashAttention move the attention operation to the compute-bound region of the roofline?

Answer:

Naive attention for sequence length $L$ and head dimension $d$ must write the $L \times L$ attention matrix (after the $QK^T$ matmul and before the softmax $\times V$ matmul) to HBM. This intermediate write is $O(L^2)$ bytes.

Bytes accessed in naive attention: approximately $4L^2 \times 2 + 4Ld \times 2$ bytes (the $L^2$ term dominates for long sequences).

FLOPs are $4L^2 d$ (two GEMMs plus softmax overhead).

So arithmetic intensity is approximately $4L^2 d / (8L^2) = d/2$ . For $d = 64$ , that is 32 FLOPs/byte - well below the H100 BF16 ridge of 295 FLOPs/byte. Memory-bound.

FlashAttention tiles the computation so that K and V blocks fit in shared memory. The $L \times L$ attention matrix is never materialized in HBM. Bytes accessed drop to $O(Ld)$ - only Q, K, V, and output.

New arithmetic intensity: $4L^2 d / (4Ld \times 2) = L/2$ . For $L = 2048, d = 64$ : $I = 1024$ FLOPs/byte. The algorithm crossed the ridge point - it is now compute-bound. The 3-5x speedup of FlashAttention comes entirely from this roofline shift, not from reducing FLOPs (which are identical).

Q3: What is the difference between being memory-bandwidth bound and memory-latency bound?

Answer:

Bandwidth-bound means the GPU can pipeline memory requests efficiently but is limited by the total bytes per second the HBM bus can transfer. The memory controller is fully busy, all row buffers are open, and DRAM transfers are happening back-to-back. Adding more DRAM bandwidth would help. Adding cache hits or reusing data (kernel fusion) would help.

Latency-bound means the GPU is issuing many small, independent, non-sequential memory requests, and the execution units are stalling waiting for each request to complete before the next can proceed. The memory bus is not saturated - far fewer bytes per second are actually moving than the bandwidth ceiling would allow. The bottleneck is the round-trip time (HBM latency is ~200-600ns) multiplied by the number of sequential requests.

Embedding table lookups are a classic example: each lookup reads 4-64 bytes from an essentially random address. These cannot be pipelined because the next lookup address depends on the input token. Even with a large embedding table, the GPU hits maybe 10-15% of peak HBM bandwidth because requests queue up behind latency rather than flooding the bandwidth.

The fix for bandwidth-bound: reduce bytes accessed (fusion, tiling, quantization). The fix for latency-bound: group lookups, prefetch, pad to aligned access, or restructure the computation to enable coalescing.

Nsight Compute's "Memory Workload Analysis" page distinguishes the two cases via the "Memory Throughput" vs "L1/L2 Hit Rate" metrics. If bandwidth utilization is low but there are many outstanding requests, you are latency-bound.

Q4: You profile a custom attention kernel on an H100 and find it achieves 120 TFLOPS in BF16. The theoretical arithmetic intensity is 64 FLOPs/byte. Is this kernel well-optimized?

Answer:

First, determine the roofline ceiling at $I = 64$ FLOPs/byte:

$P_{achievable} = \min(989 \text{ TFLOPS}, \; 64 \times 3.35 \text{ TB/s}) = \min(989, 214.4) = 214.4 \text{ TFLOPS}$

The ceiling is 214.4 TFLOPS (the kernel is memory-bound - 64 < 295 ridge point).

The kernel achieves 120 TFLOPS against a ceiling of 214.4 TFLOPS: efficiency = $120/214.4 = 56\%$ of the roofline limit.

This is mediocre. A well-optimized memory-bound kernel typically achieves 70-90% of the bandwidth ceiling. The gap indicates secondary bottlenecks: likely uncoalesced memory accesses (the attention matrix is accessed in a pattern that may not be coalesced), shared memory bank conflicts in the softmax reduction, or warp synchronization overhead in the multi-pass algorithm.

The right action is not to switch to FlashAttention immediately (though that would help). First, understand why there is a 44% gap from the bandwidth ceiling. Nsight Compute's "Memory Access Pattern" analysis will show whether the accesses are coalesced. If you close this gap first (getting to ~190 TFLOPS), then deciding whether to implement FlashAttention becomes an informed tradeoff between engineering cost and remaining gain.

Q5: A team switches their training cluster from A100 to H100. Their large GEMM operations get roughly 3x faster as expected. But their overall training throughput only improves by 1.4x. What is the likely explanation, and how would you diagnose it with the roofline model?

Answer:

The A100 BF16 Tensor Core ridge point is $312/2.0 = 156$ FLOPs/byte. The H100 BF16 Tensor Core ridge point is $989/3.35 = 295$ FLOPs/byte. The H100 has nearly double the ridge point.

Operations that were just above the A100 ridge (e.g., small-to-medium GEMMs with $I \approx 170-280$ FLOPs/byte) are now in the memory-bound region on the H100. They do not benefit from the 3x compute improvement. They only benefit from the 1.7x bandwidth improvement (3.35 vs 2.0 TB/s).

If the model has a significant fraction of time spent in medium-sized GEMMs (typical for short sequences, small batch sizes, or models with many small projections), those GEMMs went from compute-bound on A100 to memory-bound on H100. The expected 3x speedup collapsed to 1.7x for those layers.

Additionally, all the non-GEMM operations (LayerNorm, activations, embedding lookups) only see a 1.7x improvement from the bandwidth upgrade - these were memory-bound on A100 and remain memory-bound on H100.

Diagnosis using the roofline model:

Run Nsight Compute on a representative training step on both GPUs
Export the arithmetic intensity for each kernel
Plot both roofline charts (A100 and H100) and overlay the kernel positions
Kernels that shifted from above the ridge to below the ridge are the culprits

Fix: For medium-sized GEMMs that are now memory-bound on H100, increase effective batch size (more tokens per gradient step), use gradient accumulation with larger micro-batches, or fuse operations around the GEMMs to amortize the memory traffic. FlashAttention also helps here by converting the attention bottleneck from memory-bound to compute-bound.

Q6: Explain how increasing batch size affects the roofline position of a linear layer during inference.

Answer:

A linear layer computes $Y = XW^T$ where $X \in \mathbb{R}^{B \times d_{in}}$ , $W \in \mathbb{R}^{d_{out} \times d_{in}}$ , $Y \in \mathbb{R}^{B \times d_{out}}$ .

At batch size B=1 (single-token inference): the operation is $y = Wx$ where $x$ is a vector. This is a GEMV.

FLOPs: $2 \times d_{out} \times d_{in}$
Bytes: Read $W$ ( $d_{out} \times d_{in} \times 2$ bytes in BF16) + Read $x$ ( $d_{in} \times 2$ bytes) + Write $y$ ( $d_{out} \times 2$ bytes)
For large weight matrix ( $d_{out} \times d_{in} \gg d_{in} + d_{out}$ ): bytes $\approx 2 \times d_{out} \times d_{in}$
Arithmetic intensity: $I = 2 d_{out} d_{in} / (2 d_{out} d_{in}) = 1$ FLOPs/byte

1 FLOPs/byte is catastrophically memory-bound on every GPU. Single-token autoregressive generation is fundamentally memory-bandwidth limited, not compute limited.

At batch size B=32: now processing 32 tokens simultaneously. Operation is a true GEMM.

FLOPs: $2 \times 32 \times d_{out} \times d_{in}$
Bytes: Same $W$ read, plus 32x larger $X$ and $Y$ : $\approx 2 \times d_{out} \times d_{in} + 2 \times 32 \times (d_{in} + d_{out})$
For large weight matrices the $W$ read still dominates: bytes $\approx 2 d_{out} d_{in}$
Arithmetic intensity: $I \approx B = 32$ FLOPs/byte (simplifying)

Still memory-bound for the H100 ridge at 295 FLOPs/byte.

At batch size B=512: $I \approx 512$ FLOPs/byte. Now compute-bound on H100.

This is why LLM serving is obsessed with batching. Every token added to a batch amortizes the cost of reading the weight matrix. The "continuous batching" technique in systems like vLLM increases effective batch size by packing tokens from multiple users into a single GPU forward pass, moving the linear layers up the roofline from memory-bound toward the compute ceiling.

Summary

The roofline model is a two-line performance framework that divides GPU optimization space into exactly two regimes:

Memory-bound ( $I < I_{ridge}$ ): performance scales with arithmetic intensity times bandwidth. Fix by fusing kernels, tiling into fast memory, or reducing bytes accessed.
Compute-bound ( $I > I_{ridge}$ ): performance is capped at peak FLOPS. Fix by enabling Tensor Cores, using lower precision (FP8), or reducing algorithmic FLOP count.

The ridge point $I_{ridge} = P_{compute} / B$ is the dividing line. For the H100 in BF16 Tensor Core mode, that ridge is at 295 FLOPs/byte. Most transformer operations - elementwise activations, LayerNorm, naive attention, small GEMMs - fall well below this ridge. Only large matrix multiplications and FlashAttention-style tiled algorithms reliably cross it.

Key numbers to know cold:

H100 HBM bandwidth: 3.35 TB/s
H100 BF16 Tensor Core peak: 989 TFLOPS
H100 BF16 ridge: 295 FLOPs/byte
Elementwise op intensity: ~0.125 FLOPs/byte
Large GEMM intensity: $N/3$ FLOPs/byte (BF16)
Naive attention intensity: ~ $d/2$ FLOPs/byte
FlashAttention intensity: ~ $L/2$ FLOPs/byte

The roofline model is the first tool to reach for when a kernel is slow. Measure arithmetic intensity. Compare to the ridge. The optimization path follows directly.

Opening: A Performance Debug at 2 AM​

Why This Exists - The Problem of Not Knowing What to Optimize​

Historical Context - Williams, Waterman, and Patterson (2009)​

Core Concepts​

Arithmetic Intensity - The Single Most Important Number​

The Two Performance Ceilings​

The Ridge Point - Where Memory-Bound Becomes Compute-Bound​

H100 SXM5 Numbers - Real Hardware Ceilings​

Arithmetic Intensity of Common ML Operations​

Roofline Diagram for the H100​

Optimization Decision Tree​

Code Examples​

1. Computing Arithmetic Intensity Analytically​

2. Measuring Arithmetic Intensity Empirically with PyTorch Profiler​

3. Measuring the Performance Gap - Roofline Distance​

Reading NVIDIA Nsight Compute's Roofline Chart​

Production Engineering Notes​

Practical Rules for Transformer Models​

Kernel Fusion and the Roofline​

The L2 Cache Roofline​

Arithmetic Intensity Depends on Precision​

Common Mistakes​

Interview Q&A​

Q1: Calculate the arithmetic intensity of a BF16 matrix multiply for M=N=K=4096. Is it compute-bound or memory-bound on an H100?​

Q2: Why does FlashAttention move the attention operation to the compute-bound region of the roofline?​

Q3: What is the difference between being memory-bandwidth bound and memory-latency bound?​

Q4: You profile a custom attention kernel on an H100 and find it achieves 120 TFLOPS in BF16. The theoretical arithmetic intensity is 64 FLOPs/byte. Is this kernel well-optimized?​

Q5: A team switches their training cluster from A100 to H100. Their large GEMM operations get roughly 3x faster as expected. But their overall training throughput only improves by 1.4x. What is the likely explanation, and how would you diagnose it with the roofline model?​

Q6: Explain how increasing batch size affects the roofline position of a linear layer during inference.​

Summary​

Opening: A Performance Debug at 2 AM

Why This Exists - The Problem of Not Knowing What to Optimize

Historical Context - Williams, Waterman, and Patterson (2009)

Core Concepts

Arithmetic Intensity - The Single Most Important Number

The Two Performance Ceilings

The Ridge Point - Where Memory-Bound Becomes Compute-Bound

H100 SXM5 Numbers - Real Hardware Ceilings

Arithmetic Intensity of Common ML Operations

Roofline Diagram for the H100

Optimization Decision Tree

Code Examples

1. Computing Arithmetic Intensity Analytically

2. Measuring Arithmetic Intensity Empirically with PyTorch Profiler

3. Measuring the Performance Gap - Roofline Distance

Reading NVIDIA Nsight Compute's Roofline Chart

Production Engineering Notes

Practical Rules for Transformer Models

Kernel Fusion and the Roofline

The L2 Cache Roofline

Arithmetic Intensity Depends on Precision

Common Mistakes

Interview Q&A

Q1: Calculate the arithmetic intensity of a BF16 matrix multiply for M=N=K=4096. Is it compute-bound or memory-bound on an H100?

Q2: Why does FlashAttention move the attention operation to the compute-bound region of the roofline?

Q3: What is the difference between being memory-bandwidth bound and memory-latency bound?

Q4: You profile a custom attention kernel on an H100 and find it achieves 120 TFLOPS in BF16. The theoretical arithmetic intensity is 64 FLOPs/byte. Is this kernel well-optimized?

Q5: A team switches their training cluster from A100 to H100. Their large GEMM operations get roughly 3x faster as expected. But their overall training throughput only improves by 1.4x. What is the likely explanation, and how would you diagnose it with the roofline model?

Q6: Explain how increasing batch size affects the roofline position of a linear layer during inference.

Summary