Occupancy and Thread Block Tuning

Reading time: ~40 min · Interview relevance: Very High · Target roles: CUDA Developer, Performance Engineer, ML Systems Engineer

Occupancy is not a dial you turn to maximum. It is a constraint surface defined by register usage, shared memory, and block size simultaneously. The job of a performance engineer is to find the point on that surface where latency hiding is satisfied without starving compute.

The 3 AM Attention Kernel

The model had been in production for six weeks. Inference was fast enough - marginally. Then the platform team asked for a 40% throughput improvement before the next billing cycle. No new hardware. Just the existing A100 cluster.

The attention kernel was the obvious target. Multi-head attention at sequence length 2048 was burning 60% of total inference time. A senior engineer pulled up Nsight Compute and saw the number immediately: 31% occupancy. Thirty-one percent. The SM was sitting idle more than it was working.

The first instinct was correct but incomplete. Change the thread block size from 128 to 256 and see what happens. The recompile took 90 seconds. The benchmark ran. Throughput jumped 38%. The occupancy reading in Nsight climbed from 31% to 62%. The team celebrated for approximately four minutes, at which point someone suggested pushing it further - maybe 512 threads per block would hit 90% or higher.

It did not. At 512 threads per block, throughput dropped back to nearly where it had started. Occupancy read 41%. Worse than 256, and not much better than the original 128. Three configuration choices, three completely different outcomes, and no obvious intuition to explain why.

This is the occupancy problem. It is not a single number you optimize. It is a system with three independent resource limiters - registers, shared memory, and block size - that interact in non-linear ways. Increasing one configuration parameter often hits a different limiter and produces an outcome opposite to what you expected. The 512-thread experiment failed because at that block size, register usage per thread forced the compiler to allocate more registers, and the SM simply ran out of register file space to hold two such blocks simultaneously.

The engineers who consistently write fast kernels are not guessing at these configurations. They are calculating. They know the exact register count per thread, the exact shared memory per block, and they run the arithmetic to find which limiter is binding before they touch a single line of code. This lesson teaches that calculation, from the hardware limits down to the formula, with a worked example you can apply to any kernel on any NVIDIA architecture.

Why This Exists

Before GPUs were programmable, they were fixed-function pipelines. Vertex shaders, then pixel shaders, then unified shaders - each generation added more generality. But the fundamental tension was always the same: how do you keep hundreds or thousands of execution units busy when memory latency is 300-600 clock cycles and arithmetic takes only 4?

CPUs answer this question with deep out-of-order execution, large branch predictors, and multi-level caches. A CPU core can have 200+ instructions in flight, reordered to hide latency from any individual instruction. This works, but it costs enormous die area and power budget. A 16-core CPU spends maybe 30% of its transistors on computation and 70% on the machinery of latency hiding.

GPUs took a radically different approach. Instead of making one thread go fast, make thousands of threads go simultaneously. When thread group A stalls waiting for memory, instantly switch to thread group B. Zero switching overhead because all groups' state is resident in the register file simultaneously. This is barrel processing - the GPU is a barrel processor at scale.

The critical insight is that this only works if there are enough thread groups resident. If you have only two groups and both stall on the same memory bank at the same time, the execution units go idle regardless. The minimum number of resident groups to statistically guarantee that some group is always ready is the fundamental question occupancy tries to answer.

NVIDIA formalized this as occupancy: the ratio of active warps (groups of 32 threads) to the maximum possible warps on a streaming multiprocessor. A 100% occupancy means every warp slot is filled. 50% means half are empty. The hypothesis is that higher occupancy equals better latency hiding equals better throughput. This hypothesis is mostly correct, with important exceptions.

Historical Context

The occupancy concept emerged with the first CUDA GPUs in 2006-2007. The G80 architecture, which shipped with the GeForce 8800 GTX, was NVIDIA's first fully programmable GPU. It introduced streaming multiprocessors (SMs) with a fixed number of warp slots, a shared register file, and configurable shared memory.

The CUDA Occupancy Calculator was initially a spreadsheet published by NVIDIA developer relations. Engineers would manually enter their kernel's register count, shared memory usage, and block size, and the spreadsheet returned theoretical occupancy. This spreadsheet became essential - developers passed it around, kept it open in a separate tab, and referenced it every time they were tuning a kernel.

Mark Harris at NVIDIA wrote the foundational blog posts on occupancy optimization in 2010-2012 that most CUDA performance guides still reference. His key observation was that occupancy is necessary but not sufficient for performance - a point that confused engineers who expected a simple linear relationship between occupancy and throughput.

The hardware has evolved substantially since then. H100 SMs have 65,536 registers per SM (versus 8,192 on G80), 228KB shared memory per SM (versus 16KB), and support up to 2048 threads per SM. But the occupancy calculation structure - three independent limiters, take the minimum - has remained unchanged for nearly 20 years.

Core Concepts

The Three Resource Limiters

An SM on the H100 has finite resources shared among all resident thread blocks:

Register file: 65,536 32-bit registers per SM
Shared memory: 228KB total, configurable per block (up to 227KB with cudaFuncSetAttribute)
Thread slots: 2,048 threads maximum per SM
Block slots: 32 blocks maximum per SM
Warp slots: 64 warps maximum per SM (since 2048 / 32 = 64)

Occupancy is defined as:

$\text{occupancy} = \frac{\text{active warps per SM}}{\text{max warps per SM}} = \frac{\text{active warps}}{64}$

Every kernel launch has three independent constraints, each of which limits how many blocks can be simultaneously resident on an SM. Theoretical occupancy is determined by whichever constraint is tightest.

Limiter 1 - The Register Constraint

The register file has 65,536 registers per SM. Each thread claims some number of registers; registers are allocated in chunks of 256 per warp (NVIDIA aligns register allocation to warp granularity to simplify hardware).

$\text{warps\_from\_registers} = \left\lfloor \frac{65536}{\text{regs\_per\_thread} \times 32} \right\rfloor$

Wait - why multiply by 32? Because a warp is 32 threads, and the register file allocates per-warp, not per-thread. If each thread needs 32 registers, each warp needs $32 \times 32 = 1024$ registers. The SM can hold $65536 / 1024 = 64$ warps - no limit from registers.

But if each thread needs 64 registers (common in complex kernels), each warp needs $64 \times 32 = 2048$ registers. The SM can hold $65536 / 2048 = 32$ warps - exactly half occupancy.

There is a critical non-linearity here: register allocation happens in chunks. On Ampere and Hopper, the allocation granularity is 256 registers per warp. So if a thread uses 33 registers, the compiler rounds up to the next allocation unit, which may be 40 or 48 depending on the architecture. This creates occupancy cliffs.

Limiter 2 - The Shared Memory Constraint

Each block uses some amount of shared memory. The SM has a fixed pool. The number of blocks that can fit determines the shared memory warp limit.

$\text{blocks\_from\_smem} = \left\lfloor \frac{\text{smem\_per\_SM}}{\text{smem\_per\_block}} \right\rfloor$

Convert blocks to warps:

$\text{warps\_from\_smem} = \text{blocks\_from\_smem} \times \left\lfloor \frac{\text{threads\_per\_block}}{32} \right\rfloor$

On H100, smem_per_SM defaults to 100KB but can be set up to 228KB per SM if you configure the L1/shared split. If a kernel uses 16KB of shared memory per block and the SM has 100KB available, you can fit 6 blocks. With 256 threads per block (8 warps), that gives $6 \times 8 = 48$ warps, or $48/64 = 75\%$ occupancy from shared memory alone.

Limiter 3 - The Block Size Constraint

This one is simpler. There are two sub-constraints: maximum blocks per SM (32) and maximum threads per SM (2048).

$\text{blocks\_from\_warp\_slots} = \left\lfloor \frac{64}{\text{warps\_per\_block}} \right\rfloor$

$\text{blocks\_from\_thread\_limit} = \left\lfloor \frac{2048}{\text{threads\_per\_block}} \right\rfloor$

$\text{blocks\_from\_block\_limit} = \min(\text{blocks from warp slots}, \text{blocks from thread limit}, 32)$

$\text{warps\_from\_block\_limit} = \text{blocks\_from\_block\_limit} \times \text{warps\_per\_block}$

The Occupancy Formula

The active warps per SM is the minimum of all three limiters:

$\text{active\_warps} = \min(\text{warps\_from\_registers},\ \text{warps\_from\_smem},\ \text{warps\_from\_block\_limit})$

$\text{occupancy} = \frac{\text{active\_warps}}{64}$

Worked Example

Let's calculate occupancy for a real attention kernel configuration:

48 registers per thread (typical for a fused softmax attention kernel)
16KB shared memory per block (for KV tile storage)
256 threads per block

Register limiter: $\text{warps\_from\_registers} = \left\lfloor \frac{65536}{48 \times 32} \right\rfloor = \left\lfloor \frac{65536}{1536} \right\rfloor = \left\lfloor 42.67 \right\rfloor = 42 \text{ warps}$

Shared memory limiter (assuming 100KB SM budget): $\text{blocks\_from\_smem} = \left\lfloor \frac{102400}{16384} \right\rfloor = 6 \text{ blocks}$ $\text{warps\_from\_smem} = 6 \times 8 = 48 \text{ warps}$

Block size limiter: $\text{warps\_per\_block} = 256 / 32 = 8$ $\text{blocks\_from\_warp\_slots} = \lfloor 64 / 8 \rfloor = 8$ $\text{blocks\_from\_thread\_limit} = \lfloor 2048 / 256 \rfloor = 8$ $\text{blocks\_limit} = \min(8, 8, 32) = 8$ $\text{warps\_from\_block\_limit} = 8 \times 8 = 64 \text{ warps}$

Final occupancy: $\text{active\_warps} = \min(42, 48, 64) = 42$ $\text{occupancy} = 42 / 64 = 65.6\%$

The register limiter is binding. To improve occupancy, we need to either reduce register usage or change block size.

If we change block size to 128 threads (4 warps per block):

Blocks from warp slots: $\lfloor 64/4 \rfloor = 16$ blocks
Blocks from thread limit: $\lfloor 2048/128 \rfloor = 16$ blocks
Blocks from smem: $\lfloor 102400/16384 \rfloor = 6$ blocks (smem becomes binding)
Warps from smem: $6 \times 4 = 24$ warps
Occupancy: $24/64 = 37.5\%$

Going down to 128 threads moved the binding limiter from registers to shared memory and cut occupancy nearly in half. This is what happened in the opening scenario when the team went from 256 to 128 by accident during an earlier experiment.

The Occupancy Cliff

The register allocation granularity creates a cliff effect. On Hopper, registers are allocated in units of 256 per warp. This means:

Regs/thread	Regs/warp	Rounded to	Warps on SM
32	1024	1024	64
33	1056	1280	51
40	1280	1280	51
41	1312	1536	42
48	1536	1536	42
49	1568	1792	36
64	2048	2048	32

Notice: going from 32 to 33 registers per thread drops warp count from 64 to 51 - a 20% occupancy loss for one extra register. This is the occupancy cliff. A kernel that uses 33 registers when the compiler could have used 32 is operating at 80% of achievable occupancy due to a single-register alignment issue.

`__launch_bounds__` - Telling the Compiler Your Intent

The CUDA compiler decides how many registers to allocate per thread based on its register pressure analysis. It does not know your occupancy target. You can hint it with __launch_bounds__:

__global__
__launch_bounds__(256, 4)  // max 256 threads per block, min 4 blocks per SM
void attention_kernel(float* Q, float* K, float* V, float* out, int seq_len) {
    // kernel body
}

The second argument (minimum blocks per SM) is the key occupancy lever. By specifying MIN_BLOCKS = 4, you are telling the compiler: "I need at least 4 blocks resident on each SM simultaneously. Please allocate registers such that this is possible."

The compiler then calculates: 4 blocks minimum, H100 has 65536 registers, 256 threads per block. Maximum registers per thread = $65536 / (4 \times 256) = 64$ . If the natural register pressure is 70, the compiler will spill some registers to local memory (which maps to L2 cache and HBM). If the natural pressure is 55, no spilling occurs and the constraint is not binding.

The tradeoff: register spilling adds memory traffic. If the compiler has to spill 10 registers per thread, every time those values are needed, the kernel reads from L2 instead of the register file. Whether this trade is worthwhile depends entirely on whether occupancy was the bottleneck.

`cudaOccupancyMaxPotentialBlockSize` - Automated Tuning

CUDA provides a runtime function that runs the occupancy calculation for you:

int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize,
    &blockSize,
    attention_kernel,
    0,        // dynamic shared memory per block
    0         // block size limit (0 = no limit)
);

This returns the block size that maximizes occupancy. It does not account for algorithm-specific constraints (e.g., your tile size must be a perfect square), so treat it as a starting point, not a final answer.

For kernels with dynamic shared memory, pass the expected shared memory per block as the third argument. The function factors this into the shared memory limiter calculation.

Why 100% Occupancy Is NOT Always Optimal

This is the most commonly misunderstood point about occupancy. The goal is not maximum occupancy. The goal is sufficient occupancy to hide latency, plus maximum compute throughput.

Consider two kernels:

Kernel A: 100% occupancy, poor memory coalescing. Every global memory access touches non-contiguous addresses. Each warp issues 32 separate cache line requests instead of 1. Memory bandwidth utilization: 15%.

Kernel B: 50% occupancy, perfect memory coalescing. Every global memory access is fully coalesced. Memory bandwidth utilization: 75%.

Kernel B is dramatically faster despite half the occupancy. The SM in Kernel B is doing meaningful work with the threads it has. The SM in Kernel A has lots of threads, but they are all stalled waiting for scattered memory requests.

The latency hiding requirement is what matters. You need enough warps that when some stall on memory, others can execute. A rough rule of thumb: 4-8 warps per SM in flight at any time is sufficient to hide typical HBM latency of 400-600 cycles at GPU frequencies. On H100, 8 warps = 12.5% occupancy. Most real kernels with reasonable memory access patterns need 25-50% occupancy to fully saturate the execution units.

The cases where high occupancy matters most are arithmetic-bound kernels with light memory traffic. For memory-bound kernels (most attention variants, embedding lookups, reduction operations), occupancy past 50% rarely produces measurable improvement.

Python Occupancy Calculator

import math
from dataclasses import dataclass

@dataclass
class SMConfig:
    """Hardware limits for a specific GPU SM."""
    max_registers_per_sm: int
    max_shared_memory_per_sm: int  # bytes
    max_threads_per_sm: int
    max_warps_per_sm: int
    max_blocks_per_sm: int
    warp_size: int = 32
    register_alloc_unit: int = 256  # registers per warp, allocation granularity

# Hardware profiles
H100_SM = SMConfig(
    max_registers_per_sm=65536,
    max_shared_memory_per_sm=100 * 1024,  # default config; up to 228KB
    max_threads_per_sm=2048,
    max_warps_per_sm=64,
    max_blocks_per_sm=32,
    register_alloc_unit=256,
)

A100_SM = SMConfig(
    max_registers_per_sm=65536,
    max_shared_memory_per_sm=96 * 1024,
    max_threads_per_sm=2048,
    max_warps_per_sm=64,
    max_blocks_per_sm=32,
    register_alloc_unit=256,
)

def round_up_to_granularity(value: int, granularity: int) -> int:
    return math.ceil(value / granularity) * granularity

def calculate_occupancy(
    sm: SMConfig,
    regs_per_thread: int,
    smem_per_block_bytes: int,
    threads_per_block: int,
    verbose: bool = True,
) -> float:
    """
    Calculate theoretical occupancy given kernel resource usage.

    Returns occupancy as a fraction in [0, 1].
    """
    warps_per_block = math.ceil(threads_per_block / sm.warp_size)

    # --- Register limiter ---
    # Registers are allocated per warp, rounded to allocation granularity
    regs_per_warp_raw = regs_per_thread * sm.warp_size
    regs_per_warp = round_up_to_granularity(regs_per_warp_raw, sm.register_alloc_unit)
    if regs_per_warp == 0:
        warps_from_regs = sm.max_warps_per_sm
    else:
        warps_from_regs = sm.max_registers_per_sm // regs_per_warp

    # --- Shared memory limiter ---
    if smem_per_block_bytes == 0:
        blocks_from_smem = sm.max_blocks_per_sm
    else:
        blocks_from_smem = sm.max_shared_memory_per_sm // smem_per_block_bytes
    warps_from_smem = blocks_from_smem * warps_per_block

    # --- Block size limiter ---
    blocks_from_warp_slots = sm.max_warps_per_sm // warps_per_block
    blocks_from_thread_limit = sm.max_threads_per_sm // threads_per_block
    blocks_from_block_limit = min(
        blocks_from_warp_slots,
        blocks_from_thread_limit,
        sm.max_blocks_per_sm,
    )
    warps_from_block_limit = blocks_from_block_limit * warps_per_block

    # --- Final occupancy ---
    active_warps = min(warps_from_regs, warps_from_smem, warps_from_block_limit)
    occupancy = active_warps / sm.max_warps_per_sm

    if verbose:
        print(f"=== Occupancy Analysis ===")
        print(f"Threads per block    : {threads_per_block} ({warps_per_block} warps)")
        print(f"Registers per thread : {regs_per_thread} (rounded warp alloc: {regs_per_warp})")
        print(f"Shared memory/block  : {smem_per_block_bytes / 1024:.1f} KB")
        print()
        print(f"Register limiter     : {warps_from_regs} warps")
        print(f"Shared mem limiter   : {warps_from_smem} warps ({blocks_from_smem} blocks)")
        print(f"Block size limiter   : {warps_from_block_limit} warps ({blocks_from_block_limit} blocks)")
        print()
        binding = min(
            ("registers", warps_from_regs),
            ("shared_memory", warps_from_smem),
            ("block_limit", warps_from_block_limit),
            key=lambda x: x[1],
        )
        print(f"Binding limiter      : {binding[0]} ({binding[1]} warps)")
        print(f"Active warps         : {active_warps} / {sm.max_warps_per_sm}")
        print(f"Theoretical occupancy: {occupancy:.1%}")

    return occupancy


def sweep_block_sizes(
    sm: SMConfig,
    regs_per_thread: int,
    smem_per_block_bytes: int,
    block_sizes: list[int] | None = None,
) -> None:
    """Print occupancy for a range of block sizes to find the optimal configuration."""
    if block_sizes is None:
        block_sizes = [32, 64, 96, 128, 192, 256, 320, 384, 512, 1024]

    print(f"Block Size | Warps/Block | Active Warps | Occupancy")
    print("-" * 55)
    for bs in block_sizes:
        if bs > sm.max_threads_per_sm:
            continue
        occ = calculate_occupancy(sm, regs_per_thread, smem_per_block_bytes, bs, verbose=False)
        warps = math.ceil(bs / sm.warp_size)
        active = int(occ * sm.max_warps_per_sm)
        print(f"  {bs:>6}   |     {warps:>3}      |      {active:>3}      | {occ:.1%}")


# --- Example usage ---
if __name__ == "__main__":
    print("=== Worked example: attention kernel (48 regs, 16KB smem, 256 threads) ===\n")
    calculate_occupancy(H100_SM, regs_per_thread=48, smem_per_block_bytes=16384, threads_per_block=256)

    print("\n=== Block size sweep: 48 regs, 16KB smem ===\n")
    sweep_block_sizes(H100_SM, regs_per_thread=48, smem_per_block_bytes=16384)

    print("\n=== Occupancy cliff demo: register count sensitivity ===\n")
    print(f"{'Regs/thread':>12} | {'Warps/SM':>9} | {'Occupancy':>10}")
    print("-" * 38)
    for r in [32, 33, 40, 41, 48, 49, 56, 57, 64, 65, 80, 128]:
        occ = calculate_occupancy(H100_SM, r, 0, 256, verbose=False)
        warps = int(occ * H100_SM.max_warps_per_sm)
        print(f"  {r:>10} | {warps:>9} | {occ:>9.1%}")

Running this produces:

=== Block size sweep: 48 regs, 16KB smem ===

Block Size | Warps/Block | Active Warps | Occupancy
-------------------------------------------------------
      128   |       4     |       24      | 37.5%
      256   |       8     |       42      | 65.6%
      512   |      16     |       32      | 50.0%
     1024   |      32     |       32      | 50.0%

This matches the opening scenario exactly. 128 is limited by shared memory (6 blocks x 4 warps). 256 is limited by registers (42 warps). 512 is limited by both registers and block size at 32 warps each.

CUDA Kernel with `__launch_bounds__`

#include <cuda_runtime.h>
#include <stdio.h>

// Kernel WITHOUT launch_bounds - compiler picks register count freely
__global__ void attention_no_bounds(
    const float* __restrict__ Q,
    const float* __restrict__ K,
    const float* __restrict__ V,
    float* __restrict__ out,
    int seq_len,
    int head_dim,
    float scale
) {
    int tid = threadIdx.x;
    int bid = blockIdx.x;
    int row = bid;

    extern __shared__ float smem[];
    float* s_scores = smem;                      // seq_len floats
    float* s_V_row  = smem + seq_len;            // head_dim floats

    // Compute attention scores for this row
    float row_max = -1e9f;
    for (int col = tid; col < seq_len; col += blockDim.x) {
        float score = 0.0f;
        for (int d = 0; d < head_dim; d++) {
            score += Q[row * head_dim + d] * K[col * head_dim + d];
        }
        score *= scale;
        s_scores[col] = score;
        row_max = fmaxf(row_max, score);
    }
    __syncthreads();

    // Softmax: exp and sum
    float row_sum = 0.0f;
    for (int col = tid; col < seq_len; col += blockDim.x) {
        s_scores[col] = expf(s_scores[col] - row_max);
        row_sum += s_scores[col];
    }
    __syncthreads();

    // Weighted sum of V
    for (int d = tid; d < head_dim; d += blockDim.x) {
        float val = 0.0f;
        for (int col = 0; col < seq_len; col++) {
            val += s_scores[col] * V[col * head_dim + d];
        }
        out[row * head_dim + d] = val / row_sum;
    }
}

// Same kernel WITH launch_bounds - compiler limits registers to hit occupancy target
// Telling the compiler: max 256 threads per block, target at least 4 blocks per SM
// At 4 blocks * 256 threads/block = 1024 threads, register budget = 65536/1024 = 64 regs/thread
__global__
__launch_bounds__(256, 4)
void attention_with_bounds(
    const float* __restrict__ Q,
    const float* __restrict__ K,
    const float* __restrict__ V,
    float* __restrict__ out,
    int seq_len,
    int head_dim,
    float scale
) {
    // Identical body - compiler handles register allocation differently
    int tid = threadIdx.x;
    int bid = blockIdx.x;
    int row = bid;

    extern __shared__ float smem[];
    float* s_scores = smem;
    float* s_V_row  = smem + seq_len;

    float row_max = -1e9f;
    for (int col = tid; col < seq_len; col += blockDim.x) {
        float score = 0.0f;
        for (int d = 0; d < head_dim; d++) {
            score += Q[row * head_dim + d] * K[col * head_dim + d];
        }
        score *= scale;
        s_scores[col] = score;
        row_max = fmaxf(row_max, score);
    }
    __syncthreads();

    float row_sum = 0.0f;
    for (int col = tid; col < seq_len; col += blockDim.x) {
        s_scores[col] = expf(s_scores[col] - row_max);
        row_sum += s_scores[col];
    }
    __syncthreads();

    for (int d = tid; d < head_dim; d += blockDim.x) {
        float val = 0.0f;
        for (int col = 0; col < seq_len; col++) {
            val += s_scores[col] * V[col * head_dim + d];
        }
        out[row * head_dim + d] = val / row_sum;
    }
}

Programmatic Block Size Optimization

#include <cuda_runtime.h>
#include <stdio.h>

// Wrapper that finds optimal block size at runtime
void launch_with_optimal_blocksize(
    const float* Q, const float* K, const float* V,
    float* out, int seq_len, int head_dim, float scale,
    int num_rows
) {
    int optimalBlockSize;
    int minGridSize;

    // Compute shared memory needed as a function of block size
    // For this kernel: (seq_len + head_dim) floats per block
    size_t dynamicSmemBytes = (seq_len + head_dim) * sizeof(float);

    cudaError_t err = cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,
        &optimalBlockSize,
        attention_with_bounds,  // kernel function pointer
        dynamicSmemBytes,       // dynamic shared memory per block
        0                       // max block size (0 = use hardware max)
    );

    if (err != cudaSuccess) {
        fprintf(stderr, "cudaOccupancyMaxPotentialBlockSize failed: %s\n",
                cudaGetErrorString(err));
        return;
    }

    printf("Optimal block size: %d threads\n", optimalBlockSize);
    printf("Min grid size for full occupancy: %d blocks\n", minGridSize);

    // Query actual achieved occupancy
    int maxActiveBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &maxActiveBlocks,
        attention_with_bounds,
        optimalBlockSize,
        dynamicSmemBytes
    );
    printf("Max active blocks per SM: %d\n", maxActiveBlocks);

    // Compute number of SMs on this device
    int deviceId;
    cudaGetDevice(&deviceId);
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, deviceId);

    float occupancy = (float)(maxActiveBlocks * optimalBlockSize)
                    / (float)prop.maxThreadsPerMultiProcessor;
    printf("Theoretical occupancy: %.1f%%\n", occupancy * 100.0f);

    // Launch with optimal configuration
    int gridSize = (num_rows + optimalBlockSize - 1) / optimalBlockSize;
    attention_with_bounds<<<gridSize, optimalBlockSize, dynamicSmemBytes>>>(
        Q, K, V, out, seq_len, head_dim, scale
    );
}

Checking Register Usage

To see how many registers your kernel actually uses, compile with verbose output:

nvcc -arch=sm_90 -lineinfo --ptxas-options=-v kernel.cu -o kernel 2>&1 | grep "registers"
# Output example:
# ptxas info: 0 bytes gmem, 16384 bytes smem
# ptxas info: Compiling entry function 'attention_with_bounds' for 'sm_90'
# ptxas info: Function properties for attention_with_bounds
#   Used 48 registers, 16384 bytes smem, 336 bytes cmem[0]

With --maxrregcount to force a register ceiling:

nvcc -arch=sm_90 --maxrregcount=32 kernel.cu -o kernel

This is a global override versus __launch_bounds__ which is per-kernel. For kernels in libraries where you do not control the compilation flags, __launch_bounds__ is the correct approach.

Mermaid - Occupancy Tuning Decision Flowchart

Production Engineering Notes

Profile First, Calculate Second

The occupancy calculator tells you theoretical occupancy. Nsight Compute tells you achieved occupancy - the fraction of cycles where the SM had active warps. These can differ significantly:

Theoretical: what the hardware can hold
Achieved: what was actually resident during profiling

If achieved occupancy is significantly below theoretical, the issue is usually kernel launch overhead (grid too small) or serialization within the kernel itself.

The Grid Size Trap

Occupancy analysis only matters when the grid is large enough to fill the GPU. If you launch 10 blocks on a 132-SM H100, most SMs are idle regardless of occupancy within those 10 blocks. The minimum grid size for full SM utilization is:

$\text{min\_blocks} = \text{SM\_count} \times \text{blocks\_per\_SM\_at\_target\_occupancy}$

For H100 (132 SMs) targeting 4 blocks per SM (50% occupancy with 8 warps each): 528 blocks minimum.

Register Spilling vs Lower Occupancy

When __launch_bounds__ forces the compiler to reduce registers, it spills to local memory. Local memory accesses hit L1 then L2 then HBM - the same latency hierarchy as global memory. The question becomes:

Cost of lower occupancy: more cycles stalled on memory per warp
Cost of register spilling: extra L2/HBM accesses per warp when spilled values are accessed

If the kernel is already memory-bound, register spilling makes things significantly worse. If the kernel is arithmetic-bound, the extra memory traffic from spilling may be negligible. This is why occupancy tuning requires profiling alongside calculation.

Nsight Compute Key Metrics

When profiling occupancy in Nsight Compute, look at:

sm__warps_active.avg.pct_of_peak_sustained_active - achieved occupancy
launch__registers_per_thread - actual registers allocated
launch__shared_mem_per_block_static - static shared memory per block
sm__warps_eligible.avg.per_cycle_active - warps that could be scheduled each cycle
smsp__issue_active.avg.pct_of_peak_sustained_active - warp issue occupancy

The difference between warps_active and warps_eligible tells you whether warps are stalling or just not resident.

Multi-Kernel Considerations

On multi-tenant deployments where multiple kernels run on the same GPU (common in inference servers), occupancy analysis becomes cooperative. A small high-priority kernel launched with a small grid may not fill the GPU. In MIG (Multi-Instance GPU) configurations on A100/H100, each instance has its own SM allocation and the occupancy calculation applies per instance.

Common Mistakes

:::danger Assuming Occupancy Scales Linearly with Performance The most common mistake is writing a benchmark that shows occupancy improvement and concluding the kernel is faster. Always measure actual FLOP/s or bandwidth, not occupancy. A 20% occupancy increase with no throughput change means a different bottleneck is limiting performance - and you have made your kernel harder to maintain for no gain. :::

:::danger Using --maxrregcount Globally Applying --maxrregcount=32 to an entire compilation unit affects every kernel in that file, including kernels that might benefit from high register counts. Use __launch_bounds__ per-kernel so each kernel only pays the cost it needs to. :::

:::warning Ignoring the Block Size Limiter Engineers often focus on the register and shared memory limiters and forget the block size constraint. Choosing a block size of 1024 threads (32 warps) means only 2 blocks can fit per SM at the thread limit. If registers and shared memory allow 8 blocks, you are at 25% occupancy from block size alone. Always include all three limiters in your analysis. :::

:::warning Optimizing Occupancy When Compute Bound If your kernel is bottlenecked by arithmetic throughput (FMA throughput, tensor core throughput) and the SM execution units are already saturated, increasing occupancy adds no value. Use Nsight Compute's roofline analysis to determine whether you are memory-bound or compute-bound before spending time on occupancy optimization. :::

:::warning Not Accounting for Static vs Dynamic Shared Memory The CUDA runtime charges both static shared memory (declared with __shared__ in the kernel) and dynamic shared memory (passed as the third launch parameter) against the SM shared memory pool. Engineers often measure one and forget the other. Check launch__shared_mem_per_block_static and launch__shared_mem_per_block_dynamic separately in Nsight Compute. :::

Interview Questions and Answers

Q1: Walk me through the calculation of theoretical occupancy for a kernel with 56 registers per thread, 24KB shared memory per block, and 128 threads per block on an H100.

The H100 SM has 65,536 registers, 100KB shared memory (default), 2048 max threads, 64 max warps, 32 max blocks.

Warps per block: $128 / 32 = 4$ warps.

Register limiter: each warp needs $56 \times 32 = 1792$ registers. Round up to allocation granularity of 256: $\lceil 1792/256 \rceil \times 256 = 1792$ (already aligned). Warps from registers: $65536 / 1792 = 36$ warps.

Shared memory limiter: blocks per SM = $\lfloor 102400 / 24576 \rfloor = 4$ blocks. Warps: $4 \times 4 = 16$ warps.

Block size limiter: blocks from warp slots = $\lfloor 64 / 4 \rfloor = 16$ . Blocks from thread limit = $\lfloor 2048 / 128 \rfloor = 16$ . Both give 16 blocks, warps = $16 \times 4 = 64$ .

Active warps = $\min(36, 16, 64) = 16$ . Occupancy = $16/64 = 25\%$ . The binding limiter is shared memory.

Q2: What does __launch_bounds__(128, 6) tell the CUDA compiler and what tradeoff does it impose?

It tells the compiler two things: (1) this kernel will never be launched with more than 128 threads per block, so the compiler can assume warp count is at most 4 per block, and (2) the application requires at least 6 blocks resident per SM simultaneously.

The register budget the compiler derives from the second argument: $65536 / (6 \times 128) = 85$ registers per thread maximum. If the kernel's natural register pressure is 90, the compiler will spill 5 registers to local memory. The tradeoff is increased memory traffic (spill/fill operations) in exchange for higher occupancy. Whether this is worthwhile depends entirely on whether occupancy was limiting performance.

Q3: Your kernel runs at 50% occupancy. How do you determine if this is causing a performance problem?

First, check the roofline. In Nsight Compute, look at the memory throughput percentage of peak. If you are at 80% of HBM bandwidth, the kernel is memory-bound and occupancy is almost certainly not the limiter - you are already extracting most of what the memory system can provide.

If memory throughput is low (say 20%), check if the issue is coalescing (see l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum vs l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum - the ratio reveals coalescing efficiency). Poor coalescing at high occupancy is worse than good coalescing at low occupancy.

If coalescing is good and memory throughput is still low, then occupancy may genuinely be the limiter. The smoking gun is smsp__issue_active.avg.pct_of_peak_sustained_active below 50% - meaning the warp schedulers have no eligible warps to issue to the execution units over half the cycles.

Q4: Explain the occupancy cliff and give a concrete example.

Register allocation on NVIDIA GPUs happens in chunks per warp. On Hopper, the allocation granularity is 256 registers per warp. This means that going from 32 to 33 registers per thread bumps the per-warp allocation from $32 \times 32 = 1024$ to $33 \times 32 = 1056$ , which rounds up to 1280. The SM can fit $65536 / 1024 = 64$ warps with 32 registers, but only $65536 / 1280 = 51$ warps with 33 registers. One extra register costs 13 warp slots - a 20% occupancy drop.

This is why compiler version changes can silently regress kernel performance: a compiler optimization that saves one arithmetic instruction might use one extra register, crossing a cliff and tanking occupancy. Always check register counts when changing CUDA toolkit versions.

Q5: When should you NOT optimize for higher occupancy?

When the kernel is compute-bound. If tensor core throughput is saturated at 100% of peak FLOP/s, adding more warps just means more warps waiting to access the same execution units. The queuing latency increases with no throughput benefit.

Also when the kernel has a small grid. If you launch fewer blocks than SMs on the device, the SMs running no blocks are idle regardless of occupancy within the active blocks. Fix the grid size problem first.

And when the cost of achieving higher occupancy is register spilling. If hitting 75% occupancy requires spilling 15 registers per thread, and the memory-bound kernel already has HBM traffic as its bottleneck, you are adding more memory traffic to a memory-bottlenecked kernel. The break-even analysis almost always favors lower occupancy with no spilling in this case.

Q6: How does occupancy interact with the warp scheduler? Why do we need multiple active warps even if one warp has no data dependencies?

Each SM has 4 warp schedulers (on Ampere/Hopper), each capable of issuing one instruction per clock to one warp. A warp with no data dependencies and no stalls can issue one instruction per clock. If you have only 4 active warps and each is compute-bound, you theoretically achieve peak instruction throughput.

The problem is memory. Any global memory access stalls the warp for 200-600 cycles (L2 miss to HBM). During those cycles, the warp scheduler needs other warps to issue to. With only 4 warps and any memory traffic, most cycles have no issuable warp. The warp scheduler sits idle.

The latency hiding rule of thumb: number of warps needed to hide N cycles of latency at M cycles per instruction throughput is $N / M$ . For HBM at 400 cycles latency and 4-cycle instruction throughput: 100 warps needed per scheduler, or 400 warps total per SM - more than the maximum of 64. This is why HBM latency cannot be fully hidden by thread-level parallelism alone; it is one reason memory bandwidth (not latency) dominates performance modeling.

Q7: A kernel is register-limited at 42% occupancy. You add __launch_bounds__(256, 6) and occupancy rises to 62%, but throughput drops 8%. What happened?

The __launch_bounds__ forced the compiler to reduce registers per thread to meet the 6-block minimum. The register budget constraint: $65536 / (6 \times 256) = 42$ registers per thread. If the natural pressure was 55, the compiler spilled 13 registers to local memory.

Every spilled register becomes a local memory read or write. Local memory accesses map to L1 cache first, then L2, then HBM - the same latency hierarchy as global memory, just with thread-local addressing. If the kernel accesses spilled registers in a hot loop, the extra memory traffic outweighs the occupancy gain.

The fix: profile with nvprof --metrics local_load_transactions,local_store_transactions to quantify spill traffic. If spill traffic is significant, try a compromise - __launch_bounds__(256, 5) gives a looser register budget and may achieve 50% occupancy without triggering spilling. Or restructure the kernel's hot loop to reduce live register count by breaking it into smaller scopes.

Q8: How do you empirically find the optimal block size when the occupancy calculation is inconclusive?

The occupancy calculation gives you a ranked list of block sizes by theoretical occupancy. When multiple sizes have identical or similar theoretical occupancy, or when you suspect occupancy is not the binding constraint, empirical search is the answer.

Write a sweep benchmark:

int block_sizes[] = {32, 64, 96, 128, 192, 256, 320, 384, 512};
int n_sizes = sizeof(block_sizes) / sizeof(block_sizes[0]);

for (int i = 0; i < n_sizes; i++) {
    int bs = block_sizes[i];
    int grid = (N + bs - 1) / bs;

    // Warmup
    for (int w = 0; w < 3; w++)
        my_kernel<<<grid, bs>>>(args...);
    cudaDeviceSynchronize();

    // Timed run
    cudaEvent_t t0, t1;
    cudaEventCreate(&t0); cudaEventCreate(&t1);
    cudaEventRecord(t0);
    for (int r = 0; r < 50; r++)
        my_kernel<<<grid, bs>>>(args...);
    cudaEventRecord(t1);
    cudaEventSynchronize(t1);

    float ms;
    cudaEventElapsedTime(&ms, t0, t1);
    printf("block_size=%d: %.3f ms/iter\n", bs, ms / 50.0f);
}

Run this sweep on the target hardware and pick the best empirical result. Cache effects, instruction pipeline alignment, and warp scheduler behavior can all produce non-monotonic throughput curves that purely theoretical analysis misses.

Architecture Comparison - H100 vs A100 vs V100

Understanding how the resource limits changed across generations helps calibrate your intuition:

Resource	V100 (Volta)	A100 (Ampere)	H100 (Hopper)
Registers/SM	65,536	65,536	65,536
Max threads/SM	2,048	2,048	2,048
Max warps/SM	64	64	64
Max blocks/SM	32	32	32
Shared mem/SM (default)	96KB	96KB	100KB
Shared mem/SM (max)	96KB	164KB	228KB
SMs total	80	108	132
Register alloc granularity	256	256	256

The occupancy calculation formula is identical across all three generations. The hardware limits are nearly the same for registers, warps, threads, and blocks. The main difference is the expanded shared memory on Hopper (228KB configurable vs 96KB on Volta) - this changes the shared memory limiter calculation for large-tile kernels.

One practical consequence: a kernel tuned for V100 occupancy will have nearly identical theoretical occupancy on A100 and H100, because the SM resource limits are the same. The throughput difference across generations comes from SM count, clock speed, tensor core capabilities, and memory bandwidth - not from occupancy calculation changes.

Summary

Occupancy is the ratio of active warps to maximum warps on an SM. It is determined by three independent limiters - registers, shared memory, and block size - and the binding limiter sets the ceiling. The calculation is deterministic: run the formulas, find the minimum, and you know your theoretical occupancy before running a single profiling session.

The three key actions for occupancy optimization are: (1) use __launch_bounds__ to guide register allocation toward your target occupancy, (2) balance tile size against shared memory occupancy cost, and (3) choose block sizes that align well with the warp and block slot counts. The occupancy calculator code in this lesson runs the full analysis in under 1ms for any configuration.

Most importantly: 100% occupancy is not the goal. The goal is enough occupancy to hide latency while maximizing compute throughput. For memory-bound kernels, 25-50% occupancy is typically sufficient. Always verify with Nsight Compute that occupancy improvement translates to throughput improvement before treating it as a solved problem.

The 3 AM Attention Kernel​

Why This Exists​

Historical Context​

Core Concepts​

The Three Resource Limiters​

Limiter 1 - The Register Constraint​

Limiter 2 - The Shared Memory Constraint​

Limiter 3 - The Block Size Constraint​

The Occupancy Formula​

Worked Example​

The Occupancy Cliff​

__launch_bounds__ - Telling the Compiler Your Intent​

cudaOccupancyMaxPotentialBlockSize - Automated Tuning​

Why 100% Occupancy Is NOT Always Optimal​

Python Occupancy Calculator​

CUDA Kernel with __launch_bounds__​

Programmatic Block Size Optimization​

Checking Register Usage​

Mermaid - Occupancy Tuning Decision Flowchart​

Production Engineering Notes​

Profile First, Calculate Second​

The Grid Size Trap​

Register Spilling vs Lower Occupancy​

Nsight Compute Key Metrics​

Multi-Kernel Considerations​

Common Mistakes​

Interview Questions and Answers​

Architecture Comparison - H100 vs A100 vs V100​

Summary​