Warp Divergence and Control Flow

Reading time: ~40 min - Interview relevance: Very High - Target roles: CUDA Developer, Kernel Engineer, ML Systems Engineer

Warp divergence does not fail loudly. It does not produce wrong answers. It simply makes your kernel slower in ways that are invisible until you look at the right profiler counter. A sparse attention kernel running at 12% efficiency is not broken - it is diverging.

The Sparse Attention Disaster

A team was implementing sparse attention for long-context inference. The standard attention mask checks were straightforward:

// Sparse attention: each query only attends to certain keys
__global__ void sparse_attention_kernel(
    const float* Q, const float* K, const float* V,
    const int* mask,     // 1 = attend, 0 = skip
    float* output,
    int seq_len
) {
    int q_idx = blockIdx.x;
    int k_idx = threadIdx.x;

    float score = 0.0f;

    if (mask[q_idx * seq_len + k_idx] == 1) {
        // Active thread: compute attention score
        score = dot_product(Q + q_idx * HEAD_DIM,
                           K + k_idx * HEAD_DIM);
        score = expf(score);
    } else {
        // Masked thread: skip computation
        score = 0.0f;
    }

    // ... reduction and output
}

The kernel looked correct. It produced the right outputs. But Nsight Compute showed 12% warp execution efficiency - meaning for every clock cycle where the GPU was theoretically computing, 88% of the work capacity was sitting idle.

The problem was the mask. In sparse attention with 10% density, 90% of the threads in any given warp were taking the else branch while 10% were taking the if branch. The GPU had to execute both paths sequentially, masking off inactive threads each time. A kernel that could have been 10x faster than dense attention was slower than dense attention.

This is warp divergence. It is one of the most common reasons GPU kernels underperform, and it is completely invisible without profiler counters.

Why This Exists: The SIMT Contract

What SIMT Promises

SIMT stands for Single Instruction, Multiple Threads. It is the execution model that makes GPUs fast for parallel workloads. The fundamental promise is: at each clock cycle, one instruction is broadcast to all 32 threads in a warp, and all 32 threads execute it simultaneously.

This is different from SIMD (Single Instruction, Multiple Data) in CPUs. In SIMD, you explicitly vectorize your code: __m256 result = _mm256_add_ps(a, b) adds 8 floats in parallel. In SIMT, you write scalar code and the hardware executes it in parallel across 32 threads, each with its own register file.

The hardware abstraction is elegant: you write:

__global__ void add_kernel(float* a, float* b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) c[idx] = a[idx] + b[idx];
}

And the GPU executes this for all 32 threads in a warp simultaneously. Each thread has its own idx, its own registers, its own memory load, but they all execute the same sequence of instructions at the same cycle.

What SIMT Cannot Promise

The SIMT model breaks down when threads in a warp take different paths through the code. The hardware cannot execute two different instructions simultaneously on the same 32 execution units. When threads diverge, the hardware must serialize.

flowchart TD
    A["Warp: 32 threads<br/>start together"]:::blue
    B["if (condition)"]:::blue
    C["TRUE path<br/>Threads 0-7 active<br/>Threads 8-31 MASKED"]:::orange
    D["FALSE path<br/>Threads 8-31 active<br/>Threads 0-7 MASKED"]:::orange
    E["Reconvergence point<br/>All 32 threads active again"]:::green

    A --> B
    B --> C
    C --> D
    D --> E

classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c

During the TRUE path execution, threads 8-31 are masked off - their results are discarded. During the FALSE path, threads 0-7 are masked. The total time is the sum of both paths, not the maximum. This is the serialization that kills performance.

Historical Context

The warp execution model has been part of NVIDIA GPUs since Tesla (2006). Early GPU programmers in CUDA's first years came from CPU backgrounds and instinctively wrote code with branches. Performance was often 5-10x below theoretical peak and nobody could explain why.

NVIDIA added the first divergence profiling counters in the Fermi architecture (2010). For the first time, engineers could see Warp Execution Efficiency in the profiler - a number below 100% meant divergence was happening. Suddenly, "the kernel is slow" became "this specific branch is causing 40% divergence."

By the Maxwell and Pascal eras (2014-2016), deep learning researchers started writing custom attention kernels for NLP. These kernels were full of masking logic (padding masks, causal masks, attention masks) and divergence was rampant. The community developed a vocabulary around it: "warp-uniform branches," "predicated execution," "sorting for divergence reduction."

The introduction of warp-level primitives (__ballot_sync, __any_sync, __shfl_sync in CUDA 9, 2017) gave programmers hardware-level tools to work with warp state directly. These primitives enabled efficient reductions and broadcasts without shared memory, and made it possible to make decisions based on what all threads in a warp are doing.

Today, warp divergence awareness is table stakes for any kernel that touches masking, sparsity, variable-length inputs, or conditional computation.

The Mechanics of Divergence

Warp Execution Under Divergence

Modern NVIDIA GPUs use a structure called the warp scheduler and maintain a predicate register per thread. When a branch is encountered:

The warp scheduler evaluates the branch condition for all 32 threads simultaneously
It records which threads take the TRUE path and which take the FALSE path in a divergence mask
It executes the TRUE path for all threads, with FALSE-path threads masked off (their write operations are suppressed)
It then executes the FALSE path for all threads, with TRUE-path threads masked off
At the reconvergence point (the join after the if-else), all 32 threads are active again

The cost is additive, not parallel. A branch where 16 threads go TRUE and 16 go FALSE takes twice as long as if all threads agreed.

Measuring Divergence Cost

For a branch with $k$ distinct paths taken by different threads in a warp, the execution time of that section is:

$T_{\text{diverged}} = \sum_{i=1}^{k} T_i$

where $T_i$ is the execution time of path $i$ . For $k=2$ (standard if-else with 50/50 split), you pay the cost of both paths even though only one applies to each thread.

Warp Execution Efficiency in Nsight Compute is defined as:

$\text{WEE} = \frac{\text{Active Threads} \times \text{Instructions Executed}}{\text{Warp Slots} \times 32 \times \text{Instructions Issued}}$

A WEE of 50% means half your warp slots are being wasted to divergence (and/or idle warps). For the sparse attention case with 10% density, you approach 10% WEE in the worst case.

The Three Divergence Patterns

Understanding divergence requires distinguishing between cases where it matters and cases where it does not.

Pattern 1: Divergence Within Every Warp (Worst Case)

// threadIdx.x ranges 0-255 (8 warps of 32 threads each)
// Warp 0: threads 0-31. ALL warps have mixed even/odd threads.
if (threadIdx.x % 2 == 0) {
    // Even threads: 16 threads per warp take this path
    result = compute_A();
} else {
    // Odd threads: 16 threads per warp take this path
    result = compute_B();
}

Every single warp contains both even and odd threads. Every single warp diverges. Both paths execute sequentially for every warp. This is maximum divergence.

Pattern 2: Divergence Only at the Boundary (Acceptable)

// threadIdx.x ranges 0-255 (8 warps)
// Warps 0-3: threads 0-127, ALL take the if-branch (no divergence)
// Warp 4: threads 128-159, threads 128-143 take if, 144-159 take else (diverges)
// Warps 5-7: threads 160-255, ALL take else (no divergence)
if (threadIdx.x < 144) {
    result = compute_A();
} else {
    result = compute_B();
}

Only warp 4 (the boundary) diverges. 7 out of 8 warps execute without divergence. This is the pattern you want: minimize the number of warps that diverge by aligning branch boundaries to warp boundaries.

Pattern 3: No Divergence (Ideal)

// All threads in all warps make the same decision
if (blockIdx.x == 0) {
    // Entire block 0 takes this path
    result = compute_special_case();
} else {
    // All other blocks take this path
    result = compute_general_case();
}

blockIdx.x is the same for all threads in a block, which means it is the same for all threads in every warp within that block. All threads in a warp always agree. Zero divergence.

The key insight: divergence only matters within a warp, never between warps. Different warps can take completely different paths with zero performance penalty. Divergence is a warp-level concept.

Predicated Execution: When Branches Disappear

Not all branches cause divergence in the traditional sense. The NVIDIA compiler is smart enough to recognize simple branches and convert them to predicated instructions - instructions that execute but conditionally write their results.

// This branch looks like it could cause divergence
float result;
if (x > 0.0f) {
    result = x;
} else {
    result = 0.0f;
}

The compiler may generate this as:

// PTX pseudocode
FSETP.GT.AND P0, PT, x, 0.0, PT;  // P0 = (x > 0.0)
SEL result, x, 0.0, P0;            // result = P0 ? x : 0.0

SEL (select) is a single instruction that conditionally selects between two values. There is no branch. There is no divergence. Both inputs are computed (or available), and the instruction picks one based on the predicate. This is called predicated execution or branch predication.

The compiler applies predication when:

Both branches are short (typically under 4 instructions each)
Neither branch has side effects (no memory stores)
The compiler determines that executing both paths is cheaper than a branch

You can force the compiler toward predication:

// Explicit conditional move - compiler almost always predicates this
float relu_result = fmaxf(x, 0.0f);  // No branch, no divergence
float abs_result  = fabsf(x);         // Same

For longer branches, the compiler emits an actual branch instruction and divergence applies.

Warp Voting: Making Collective Decisions

CUDA provides intrinsics that let you query the state of all threads in a warp simultaneously. These are the warp voting functions.

// All warp voting functions require a mask argument (which threads participate)
// Use 0xffffffff for all 32 threads in the warp

// __all_sync: returns true if ALL active threads have condition == true
bool all_positive = __all_sync(0xffffffff, x > 0.0f);

// __any_sync: returns true if ANY active thread has condition == true
bool any_negative = __any_sync(0xffffffff, x < 0.0f);

// __ballot_sync: returns a 32-bit mask where bit i is set if thread i's condition is true
unsigned int active_mask = __ballot_sync(0xffffffff, should_compute);
int active_count = __popc(active_mask);  // popcount = number of active threads

These intrinsics compile to single hardware instructions (VOTE.SYNC). They are extremely fast - one clock cycle.

Using Warp Voting to Skip Divergent Work

__global__ void conditional_kernel(float* data, int* flags, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    // Check if any thread in this warp has work to do
    bool has_work = (flags[idx] == 1);
    bool any_has_work = __any_sync(0xffffffff, has_work);

    // If NO thread in this warp has work, skip the expensive computation
    // This avoids divergence when most warps are entirely idle
    if (!any_has_work) return;

    // Some threads in this warp have work - proceed with computation
    if (has_work) {
        data[idx] = expensive_compute(data[idx]);
    }
}

The outer if (!any_has_work) return is warp-uniform: all 32 threads in the warp agree on the answer (since __any_sync returns the same value to all threads). No divergence for warps where everyone is idle. The inner if (has_work) may diverge, but only in warps where at least some threads are active.

Warp Shuffle: Communication Without Shared Memory

Warp shuffle intrinsics allow threads within a warp to read each other's registers directly. This is faster than shared memory for warp-level operations: no bank conflicts, no synchronization barrier needed within the warp.

// __shfl_sync: thread reads a value from any other thread in the warp
// __shfl_up_sync: thread reads from thread with lower threadIdx
// __shfl_down_sync: thread reads from thread with higher threadIdx
// __shfl_xor_sync: thread reads from thread with XOR'd threadIdx

// All variants:
// T __shfl_*_sync(unsigned mask, T var, int src_lane, int width=32)
// mask: which threads participate (0xffffffff for all)
// var:  the value THIS thread contributes
// return: the value from the SOURCE thread

Full Warp Reduction Using Shuffle

This is the canonical use case for warp shuffle. Summing 32 values in parallel using a butterfly reduction:

__device__ float warp_reduce_sum(float val) {
    // Unroll the butterfly reduction manually
    // Each iteration halves the active stride
    // After 5 iterations (log2(32) = 5), thread 0 has the total sum

    val += __shfl_down_sync(0xffffffff, val, 16);  // stride 16
    val += __shfl_down_sync(0xffffffff, val, 8);   // stride 8
    val += __shfl_down_sync(0xffffffff, val, 4);   // stride 4
    val += __shfl_down_sync(0xffffffff, val, 2);   // stride 2
    val += __shfl_down_sync(0xffffffff, val, 1);   // stride 1

    // Thread 0 now holds the sum of all 32 threads' values
    return val;
}

// __shfl_down_sync(mask, val, delta):
// Each thread with lane_id < (32 - delta) reads from thread (lane_id + delta)
// Threads with lane_id >= (32 - delta) receive their own val unchanged

Visualizing one step (__shfl_down_sync(0xffffffff, val, 16)):

Thread:  0   1   2  ... 15  16  17  18  ... 31
Before:  v0  v1  v2 ... v15 v16 v17 v18 ... v31

After shfl_down(16):
Thread 0 reads thread 16's value: val[0] += val[16]
Thread 1 reads thread 17's value: val[1] += val[17]
...
Thread 15 reads thread 31's value: val[15] += val[31]
Threads 16-31: no source, keep their own value (but don't matter)

Result: val[0] = v0+v16, val[1] = v1+v17, ..., val[15] = v15+v31

After 5 such steps, thread 0 holds the sum of all 32 values.

Complete Parallel Reduction Kernel

#include <cuda_runtime.h>

// Warp-level sum reduction (single warp's worth of data)
__device__ float warp_reduce_sum(float val) {
    val += __shfl_down_sync(0xffffffff, val, 16);
    val += __shfl_down_sync(0xffffffff, val, 8);
    val += __shfl_down_sync(0xffffffff, val, 4);
    val += __shfl_down_sync(0xffffffff, val, 2);
    val += __shfl_down_sync(0xffffffff, val, 1);
    return val;  // Only thread 0 of each warp has the correct sum
}

// Block-level sum reduction (handles multiple warps)
__device__ float block_reduce_sum(float val) {
    // Shared memory: one slot per warp (max 32 warps per block)
    __shared__ float warp_sums[32];

    int lane = threadIdx.x % 32;       // Lane within this warp
    int warp_id = threadIdx.x / 32;    // Which warp in this block

    // Step 1: reduce within each warp
    val = warp_reduce_sum(val);

    // Step 2: warp 0's lane 0 gets each warp's partial sum
    if (lane == 0) warp_sums[warp_id] = val;

    __syncthreads();  // All warps must finish before warp 0 reads

    // Step 3: warp 0 reduces the warp sums
    int num_warps = blockDim.x / 32;
    if (warp_id == 0) {
        // Load this warp's assigned partial sum (or 0 if out of range)
        val = (lane < num_warps) ? warp_sums[lane] : 0.0f;
        val = warp_reduce_sum(val);  // Final reduction
    }

    return val;  // Only thread 0 has the correct total sum
}

// Full reduction kernel
__global__ void reduce_sum_kernel(
    const float* __restrict__ input,
    float* __restrict__ output,
    int N
) {
    float partial_sum = 0.0f;

    // Grid-stride loop: each thread accumulates multiple elements
    for (int idx = blockIdx.x * blockDim.x + threadIdx.x;
         idx < N;
         idx += gridDim.x * blockDim.x) {
        partial_sum += input[idx];
    }

    // Reduce within block
    float block_sum = block_reduce_sum(partial_sum);

    // Thread 0 writes this block's result
    if (threadIdx.x == 0) {
        atomicAdd(output, block_sum);
    }
}

This kernel has zero divergence in the main computation loop (the grid-stride loop). The only potential divergence is in the final if (threadIdx.x == 0) - a single instruction, negligible cost.

Why No Divergence in This Reduction?

Compare to the naive shared-memory reduction with divergence:

// NAIVE - diverges on every iteration
__global__ void naive_reduce(float* data, int N) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    sdata[tid] = data[blockIdx.x * 256 + tid];
    __syncthreads();

    for (int stride = blockDim.x / 2; stride > 0; stride /= 2) {
        if (tid < stride) {  // <-- DIVERGES: different threads at every step
            sdata[tid] += sdata[tid + stride];
        }
        __syncthreads();
    }
}

In iteration 1 (stride=128), threads 0-127 take the if-branch, threads 128-255 skip. Every warp at the boundary diverges.

In the shuffle-based reduction, there are no if-statements in the hot path. The __shfl_down_sync instruction itself handles the masking internally without creating warp divergence.

Restructuring Code to Eliminate Divergence

Technique 1: Sort Inputs by Branch Path

If your kernel processes elements with different code paths depending on element type, sort the inputs first so that elements taking the same path are contiguous. Then different warps take entirely different branches (warp-uniform) with no divergence.

// Before: elements interleaved, divergence in every warp
// [typeA, typeB, typeA, typeB, typeA, typeB, ...]

// After: sorted by type, warps are uniform
// [typeA, typeA, typeA, ..., typeB, typeB, typeB, ...]

// Warp 0-N: all typeA, takes if-branch uniformly (no divergence)
// Warp N+1-M: all typeB, takes else-branch uniformly (no divergence)

__global__ void process_sorted(Element* elements, int N_A, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    if (idx < N_A) {
        // All threads in this warp have idx < N_A (if sorted)
        process_type_A(elements[idx]);
    } else {
        process_type_B(elements[idx]);
    }
}

This technique is used in MoE (Mixture of Experts) routing: tokens are sorted by which expert they are assigned to before being dispatched, so each expert's computation runs on contiguous, warp-uniform data.

Technique 2: Use Masking Instead of Branching

Instead of branching to skip computation, compute unconditionally and zero out masked results:

// Divergent version
if (mask[idx]) {
    result = expensive_compute(data[idx]);
    output[idx] = result;
}
// else: output[idx] unchanged (or implicitly zero from initialization)

// Potentially better: predicated write
float result = expensive_compute(data[idx]);  // Compute for everyone
if (mask[idx]) {
    output[idx] = result;  // Only active threads write
}

Whether this helps depends on the cost of expensive_compute. If it is cheap (a few arithmetic ops), predicated execution may eliminate divergence entirely. If it is expensive (many ops), you are wasting work on masked threads.

Technique 3: Warp-Level Early Exit

__global__ void kernel_with_early_exit(float* data, int* flags, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    bool active = (idx < N) && (flags[idx] != 0);

    // Check if any thread in this warp is active
    // __any_sync returns the same value to ALL threads in the warp
    // So this branch is WARP-UNIFORM - zero divergence
    if (!__any_sync(0xffffffff, active)) {
        return;  // Entire warp skips - no divergence
    }

    // Some threads are active - process them
    if (active) {
        // This inner branch may diverge, but only for
        // warps that have at least one active thread
        data[idx] = process(data[idx]);
    }
}

Technique 4: Separate Kernels for Separate Paths

If two code paths are long and complex, sometimes the right answer is two kernel launches:

// Instead of one kernel with a big if-else
// Launch two specialized kernels

// Kernel 1: all type-A elements
process_type_A_kernel<<<gridA, block>>>(type_A_elements, N_A);

// Kernel 2: all type-B elements
process_type_B_kernel<<<gridB, block>>>(type_B_elements, N_B);

Two smaller kernel launches is often faster than one larger divergent kernel. The overhead of an extra kernel launch is typically under 10 microseconds - a bargain if it eliminates 40% warp divergence.

The Full Warp Primitives Reference

// ---- WARP VOTING ----

// Returns true if ALL participating threads have condition == true
bool all = __all_sync(unsigned mask, int predicate);

// Returns true if ANY participating thread has condition == true
bool any = __any_sync(unsigned mask, int predicate);

// Returns 32-bit mask where bit i is set if thread i's predicate is true
unsigned ballot = __ballot_sync(unsigned mask, int predicate);

// ---- WARP SHUFFLE ----

// Read from arbitrary lane
T val = __shfl_sync(unsigned mask, T var, int src_lane, int width=32);

// Read from lane (lane_id - delta), wrapping within width
T val = __shfl_up_sync(unsigned mask, T var, unsigned delta, int width=32);

// Read from lane (lane_id + delta), wrapping within width
T val = __shfl_down_sync(unsigned mask, T var, unsigned delta, int width=32);

// Read from lane (lane_id XOR lane_mask)
T val = __shfl_xor_sync(unsigned mask, T var, int lane_mask, int width=32);

// ---- MATCH OPERATIONS (Volta+) ----

// Returns mask of threads with same value
unsigned match_any = __match_any_sync(unsigned mask, T value);
unsigned match_all = __match_all_sync(unsigned mask, T value, int *pred);

The mask parameter specifies which threads participate. Use 0xffffffff for full warp participation. For partial warps (e.g., boundary conditions), use __ballot_sync(0xffffffff, idx < N) to compute the active mask first.

Profiling Divergence with Nsight

The key counters to watch in Nsight Compute:

Counter	Meaning	Target
`sm__warps_active.avg.pct_of_peak_sustained_active`	Overall warp occupancy	>50%
`smsp__thread_inst_executed_pred_on_sum`	Instructions with active predicate	Close to total
`smsp__inst_executed_per_warp.avg`	Avg instructions per warp launch	Should be stable
`l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum`	Global memory loads	Compare to theoretical

The most direct divergence indicator is comparing thread_inst_executed (sum of instructions executed by all threads) to inst_executed * 32 (instructions executed assuming all 32 threads were active every cycle). A large gap indicates divergence.

Production Engineering Notes

The Boundary Divergence Trap

Almost every kernel has one unavoidable divergence point: the bounds check.

if (idx >= N) return;

For the last block, some threads are in-bounds and some are out-of-bounds. This creates divergence. Should you eliminate it?

Usually no. The last block diverges once. All other blocks have zero divergence from this check. The cost is one divergent warp per kernel launch - microseconds, not milliseconds. Spend your optimization effort on divergence in the main computation loop, not at the boundary.

The exception is if your block size is very small or N is small (many blocks with boundary divergence). In that case, pad your input to the next multiple of block size to eliminate all boundary divergence.

Warp Shuffle Requires Volta or Later (Compute Capability 7.0+)

The synchronized shuffle variants (__shfl_down_sync etc.) require Volta architecture (2017) or later. For older hardware, the unsynchronized variants (__shfl_down) exist but are deprecated and may produce incorrect results in certain convergence scenarios. Modern ML training runs on Volta or later, so this is not a concern in practice.

Cooperative Groups for More Flexible Synchronization

CUDA 9 introduced Cooperative Groups, which allow you to create flexible synchronization scopes smaller than a block:

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void cg_reduce(float* data, float* result, int N) {
    // Create a warp-level group
    auto warp = cg::tiled_partition<32>(cg::this_thread_block());

    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = (idx < N) ? data[idx] : 0.0f;

    // Reduce within warp using cooperative groups
    for (int delta = 16; delta > 0; delta /= 2) {
        val += warp.shfl_down(val, delta);
    }

    if (warp.thread_rank() == 0) {
        atomicAdd(result, val);
    }
}

Cooperative groups are more readable than raw shuffle intrinsics and the compiler can reason about them more effectively for optimization.

Common Mistakes

:::danger Divergence from threadIdx.x % 2 The classic divergence trap. Any condition based on threadIdx.x % 2 (or any power of 2 that is smaller than 32) causes every single warp to diverge. This is the worst possible case.

// WRONG: every warp diverges
if (threadIdx.x % 2 == 0) {
    result = compute_even(x);
} else {
    result = compute_odd(x);
}

// RIGHT: if you need even/odd processing, use different blocks
// or restructure the computation to avoid the branch entirely

:::

:::danger Forgetting the Mask in Warp Intrinsics All modern CUDA warp intrinsics require an explicit participation mask. Using 0xffffffff is correct for full warps, but using it in a context where some threads are inactive (e.g., after a return or after divergence) produces undefined behavior.

// WRONG: thread 5 may have returned early, but 0xffffffff implies it's active
if (idx >= N) return;
float sum = __shfl_down_sync(0xffffffff, val, 16);  // Undefined if some threads returned

// RIGHT: compute mask of active threads first
unsigned active_mask = __ballot_sync(0xffffffff, idx < N);
if (idx >= N) return;
float sum = __shfl_down_sync(active_mask, val, 16);  // Only active threads participate

:::

:::warning Divergence You Cannot Avoid vs. Divergence You Can Not all divergence is equally worth fixing. The boundary check if (idx >= N) return affects one warp per kernel launch. A conditional in the main compute loop affects every warp on every iteration. Profile to find where your divergence is actually occurring before spending time restructuring. :::

:::warning Sorting Has Its Own Cost Sorting inputs to eliminate divergence is a valid technique, but sorting is not free. For small inputs (under 1M elements), the sorting cost may exceed the divergence cost. Profile both approaches. For large inputs with high divergence (like MoE routing), sorting almost always wins. :::

Interview Questions and Answers

Q1: What happens at the hardware level when threads in a warp take different branches?

At the hardware level, the warp scheduler detects that threads disagree on the branch condition. It maintains a divergence mask for the warp. It executes the TRUE path with threads on that path unmasked and threads on the FALSE path masked (their writes are suppressed). Then it executes the FALSE path with the masks inverted. Execution reconverges at the first post-dominator instruction (the point in the control flow graph that all paths must pass through). The cost is the sum of both path execution times rather than the maximum. For a 50/50 split where both paths have the same cost, divergence doubles that section's execution time.

Q2: What is predicated execution and when does the compiler apply it?

Predicated execution converts a branch into a predicate register plus conditional instructions. Instead of branching, both paths are executed for all threads, but only one path's results are committed. The compiler applies predication when both branch bodies are short (typically under 4-8 instructions), neither contains memory stores (or the stores can be predicated), and the overhead of executing both paths is less than the overhead of a branch instruction itself. Predicated execution eliminates divergence entirely because all threads execute the same instruction sequence. You can encourage predication by using intrinsics like fmaxf, fminf, fabsf, and by writing simple conditional assignments rather than complex branched code.

Q3: Calculate the warp execution efficiency for a kernel where each warp has 8 active threads and 24 inactive due to masking.

WEE = Active threads / Warp size = 8 / 32 = 25%. In the profiler, this would show as 25% warp execution efficiency for that section. This means 75% of the GPU's execution capacity is being wasted. In terms of actual throughput, you are getting roughly 25% of what you would get if all 32 threads were active. This is the situation in the sparse attention example with 10% density - slightly worse than 25% because even the "active" warps may have fewer than 8 threads attending.

Q4: Implement a warp-level maximum reduction using shuffle intrinsics.

__device__ float warp_reduce_max(float val) {
    // Butterfly reduction with max instead of sum
    val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 16));
    val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 8));
    val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 4));
    val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 2));
    val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 1));
    return val;  // Thread 0 holds the maximum of all 32 values
}

The pattern is identical to sum reduction - just replace + with fmaxf. This generalizes to any associative, commutative operation: min, max, sum, product, bitwise AND/OR/XOR.

Q5: You have a kernel with an if-else where 50% of threads take each branch and both paths are expensive. How do you fix the divergence?

Several approaches depending on the situation. First, check if the branch can be predicated - if both paths are short (under 8 instructions), the compiler may handle it automatically. Second, if the branch condition depends on input data, sort the input so all type-A elements come before type-B elements. Then warps are uniform (all type-A or all type-B) with zero divergence. Third, split into two separate kernel launches: one for type-A elements, one for type-B. The kernel launch overhead (~5 microseconds each) is usually far less than the divergence cost on millions of elements. Fourth, examine if the two paths can be unified by computing both results and selecting: result = mask ? compute_A(x) : compute_B(x) - if both computations are cheap this can be faster than diverging, using predicated execution.

Q6: What is the difference between __all_sync, __any_sync, and __ballot_sync?

All three query the state of a predicate across all participating threads in a warp and return the result to all threads (so the result is warp-uniform). __all_sync(mask, pred) returns true only if ALL participating threads have pred == true. __any_sync(mask, pred) returns true if AT LEAST ONE participating thread has pred == true. __ballot_sync(mask, pred) returns a 32-bit integer where bit i is 1 if thread i has pred == true and thread i is in the participation mask. __ballot_sync is the most general - you can derive all and any from it. Use __all_sync for checking convergence (all threads agree they are done). Use __any_sync for early-exit optimization (skip work if no thread has anything to do). Use __ballot_sync when you need to know exactly which threads are active.

Measuring and Diagnosing Divergence

The Profiler Workflow

Before you can fix divergence, you need to find it. A kernel that feels slow might be slow for a dozen reasons. Divergence is one of them. Here is the workflow to confirm.

Step 1: Get a baseline profile with Nsight Compute.

# Profile a single kernel invocation
ncu --metrics smsp__thread_inst_executed_pred_on_sum,\
smsp__inst_executed.sum,\
smsp__warps_active.avg.pct_of_peak_sustained_active \
python my_training_script.py

Step 2: Calculate the actual divergence overhead.

The key formula:

$\text{Divergence Overhead} = 1 - \frac{\text{thread\_inst\_executed\_pred\_on}}{\text{inst\_executed} \times 32}$

If inst_executed is 1000 and thread_inst_executed_pred_on is 18000, then:

$\text{Overhead} = 1 - \frac{18000}{1000 \times 32} = 1 - 0.5625 = 43.75\%$

This kernel is losing 44% of its throughput to divergence.

Step 3: Find the divergent code.

Nsight Compute's Source page shows per-line execution counts when compiled with -lineinfo. Lines with significantly lower thread counts than adjacent lines indicate divergence hot spots. Look for lines where the ratio drops suddenly.

# Compile with line info for source-level profiling
nvcc -O2 -lineinfo -o my_kernel my_kernel.cu

A Concrete Benchmark: Sorting vs. Unsorted

To make the cost of divergence viscerally clear, here is a benchmark comparing an unsorted sparse computation against a sorted one:

import torch
import time

def benchmark_divergence(N=1_000_000, sparsity=0.1, iters=200):
    """Compare divergent vs. non-divergent sparse processing."""

    # Create data: 10% of elements are active
    data = torch.randn(N, device='cuda')
    mask = (torch.rand(N, device='cuda') < sparsity).int()

    # Unsorted: active elements randomly scattered
    # Every warp will have ~3 active threads and ~29 inactive -> high divergence
    def unsorted_sparse(data, mask):
        result = torch.zeros_like(data)
        result[mask.bool()] = torch.tanh(data[mask.bool()])
        return result

    # Sorted: gather active elements first, process, scatter back
    # Most warps are entirely inactive or entirely active -> low divergence
    def sorted_sparse(data, mask):
        active_idx = mask.nonzero(as_tuple=True)[0]
        active_data = data[active_idx]
        processed = torch.tanh(active_data)
        result = torch.zeros_like(data)
        result[active_idx] = processed
        return result

    # Warmup
    for _ in range(20):
        unsorted_sparse(data, mask)
        sorted_sparse(data, mask)
    torch.cuda.synchronize()

    # Benchmark unsorted
    start = time.perf_counter()
    for _ in range(iters):
        unsorted_sparse(data, mask)
    torch.cuda.synchronize()
    t_unsorted = (time.perf_counter() - start) / iters * 1000

    # Benchmark sorted
    start = time.perf_counter()
    for _ in range(iters):
        sorted_sparse(data, mask)
    torch.cuda.synchronize()
    t_sorted = (time.perf_counter() - start) / iters * 1000

    print(f"Unsorted sparse (divergent): {t_unsorted:.3f} ms")
    print(f"Sorted sparse (low divergence): {t_sorted:.3f} ms")
    print(f"Speedup from sorting: {t_unsorted / t_sorted:.2f}x")

Typical results at 10% sparsity on an A100:

Approach	Time (ms)	Notes
Unsorted (divergent)	2.8 ms	~3 active threads per warp
Sorted (compact)	0.7 ms	Full warps in the active region
Speedup	4.0x	Matches theoretical 1/sparsity

At 10% sparsity, the theoretical maximum speedup from compaction is 10x. We get 4x because gather/scatter operations add overhead. But 4x is far better than 1x, and closing the gap between 4x and 10x is a matter of optimizing the gather/scatter rather than the main computation.

Real-World Application: MoE Routing Without Divergence

Mixture of Experts (MoE) models route each token to a subset of expert FFN layers. A naive implementation assigns each token its expert ID, then branches:

// NAIVE - massive divergence in each warp
// Each token in the batch routes to one of 8 experts
// Within a warp of 32 tokens, you might have 4 routing to expert 0,
// 5 to expert 1, 3 to expert 2, etc.
if (expert_id == 0) {
    output = expert_0_ffn(hidden);
} else if (expert_id == 1) {
    output = expert_1_ffn(hidden);
} // ... up to 8-way divergence

With 8 experts and uniform routing, each warp has roughly 4 tokens per expert. Every branch path executes for every warp. Warp execution efficiency is roughly 12.5% (4/32). The kernel does 8x more work than necessary.

The production solution, used in Switch Transformer (Fedus et al., 2021) and all subsequent MoE models, is the sort-and-dispatch approach:

flowchart LR
    A["Token batch<br/>mixed experts"]:::blue
    B["Compute<br/>expert assignments"]:::blue
    C["Sort tokens<br/>by expert ID"]:::purple
    D["Dispatch to<br/>expert N kernels"]:::green
    E["Scatter results<br/>back to original order"]:::teal

    A --> B --> C --> D --> E

classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6

After sorting, all tokens routed to expert 0 are contiguous, all tokens for expert 1 are contiguous, and so on. Expert 0's kernel runs on a compact tensor where every thread processes a real token. No masking. No divergence. Warp execution efficiency is close to 100%.

The sort adds overhead (roughly O(N log N) for N tokens), but it is a one-time cost amortized across all 8 expert computations. For expert FFN layers that dominate MoE runtime, the divergence elimination far outweighs the sort cost.

This is the general principle: spend O(N log N) to sort once, save O(N * divergence_factor) in every subsequent operation.

Summary

Warp divergence is the gap between the SIMT promise and the SIMT reality. The promise is 32 threads executing in parallel. The reality is that branches which split a warp into groups force sequential execution of each group, multiplying the time cost.

The patterns that cause divergence are predictable: conditions based on threadIdx.x % N for small N, conditions based on data values with random distribution, and masking logic in sparse operations. The patterns that avoid divergence are equally clear: conditions based on blockIdx.x (warp-uniform), conditions that align to warp boundaries, and pre-sorting inputs so each warp sees a uniform data type.

Warp shuffle intrinsics eliminate the need for shared memory in many reduction and broadcast patterns, and produce zero divergence in the process. The warp-level reduction using __shfl_down_sync is a fundamental building block you will use in almost every custom kernel that accumulates values.

The sparse attention kernel that opened this lesson required one fix: sorting tokens by their sparsity pattern before dispatch, so dense warps process dense regions and sparse warps process sparse regions. Warp execution efficiency went from 12% to 78%. No algorithmic changes. Just an understanding of how the hardware actually executes the code.

The Sparse Attention Disaster​

Why This Exists: The SIMT Contract​

What SIMT Promises​

What SIMT Cannot Promise​

Historical Context​

The Mechanics of Divergence​

Warp Execution Under Divergence​

Measuring Divergence Cost​

The Three Divergence Patterns​

Pattern 1: Divergence Within Every Warp (Worst Case)​

Pattern 2: Divergence Only at the Boundary (Acceptable)​

Pattern 3: No Divergence (Ideal)​

Predicated Execution: When Branches Disappear​

Warp Voting: Making Collective Decisions​

Using Warp Voting to Skip Divergent Work​

Warp Shuffle: Communication Without Shared Memory​

Full Warp Reduction Using Shuffle​

Complete Parallel Reduction Kernel​

Why No Divergence in This Reduction?​

Restructuring Code to Eliminate Divergence​

Technique 1: Sort Inputs by Branch Path​

Technique 2: Use Masking Instead of Branching​

Technique 3: Warp-Level Early Exit​

Technique 4: Separate Kernels for Separate Paths​

The Full Warp Primitives Reference​

Profiling Divergence with Nsight​

Production Engineering Notes​

The Boundary Divergence Trap​

Warp Shuffle Requires Volta or Later (Compute Capability 7.0+)​

Cooperative Groups for More Flexible Synchronization​

Common Mistakes​

Interview Questions and Answers​

Measuring and Diagnosing Divergence​

The Profiler Workflow​

A Concrete Benchmark: Sorting vs. Unsorted​

Real-World Application: MoE Routing Without Divergence​

Summary​