Warp Divergence and Control Flow
Reading time: ~40 min - Interview relevance: Very High - Target roles: CUDA Developer, Kernel Engineer, ML Systems Engineer
Warp divergence does not fail loudly. It does not produce wrong answers. It simply makes your kernel slower in ways that are invisible until you look at the right profiler counter. A sparse attention kernel running at 12% efficiency is not broken - it is diverging.
The Sparse Attention Disaster
A team was implementing sparse attention for long-context inference. The standard attention mask checks were straightforward:
// Sparse attention: each query only attends to certain keys
__global__ void sparse_attention_kernel(
const float* Q, const float* K, const float* V,
const int* mask, // 1 = attend, 0 = skip
float* output,
int seq_len
) {
int q_idx = blockIdx.x;
int k_idx = threadIdx.x;
float score = 0.0f;
if (mask[q_idx * seq_len + k_idx] == 1) {
// Active thread: compute attention score
score = dot_product(Q + q_idx * HEAD_DIM,
K + k_idx * HEAD_DIM);
score = expf(score);
} else {
// Masked thread: skip computation
score = 0.0f;
}
// ... reduction and output
}
The kernel looked correct. It produced the right outputs. But Nsight Compute showed 12% warp execution efficiency - meaning for every clock cycle where the GPU was theoretically computing, 88% of the work capacity was sitting idle.
The problem was the mask. In sparse attention with 10% density, 90% of the threads in any given warp were taking the else branch while 10% were taking the if branch. The GPU had to execute both paths sequentially, masking off inactive threads each time. A kernel that could have been 10x faster than dense attention was slower than dense attention.
This is warp divergence. It is one of the most common reasons GPU kernels underperform, and it is completely invisible without profiler counters.
Why This Exists: The SIMT Contract
What SIMT Promises
SIMT stands for Single Instruction, Multiple Threads. It is the execution model that makes GPUs fast for parallel workloads. The fundamental promise is: at each clock cycle, one instruction is broadcast to all 32 threads in a warp, and all 32 threads execute it simultaneously.
This is different from SIMD (Single Instruction, Multiple Data) in CPUs. In SIMD, you explicitly vectorize your code: __m256 result = _mm256_add_ps(a, b) adds 8 floats in parallel. In SIMT, you write scalar code and the hardware executes it in parallel across 32 threads, each with its own register file.
The hardware abstraction is elegant: you write:
__global__ void add_kernel(float* a, float* b, float* c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) c[idx] = a[idx] + b[idx];
}
And the GPU executes this for all 32 threads in a warp simultaneously. Each thread has its own idx, its own registers, its own memory load, but they all execute the same sequence of instructions at the same cycle.
What SIMT Cannot Promise
The SIMT model breaks down when threads in a warp take different paths through the code. The hardware cannot execute two different instructions simultaneously on the same 32 execution units. When threads diverge, the hardware must serialize.
flowchart TD
A["Warp: 32 threads<br/>start together"]:::blue
B["if (condition)"]:::blue
C["TRUE path<br/>Threads 0-7 active<br/>Threads 8-31 MASKED"]:::orange
D["FALSE path<br/>Threads 8-31 active<br/>Threads 0-7 MASKED"]:::orange
E["Reconvergence point<br/>All 32 threads active again"]:::green
A --> B
B --> C
C --> D
D --> E
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
During the TRUE path execution, threads 8-31 are masked off - their results are discarded. During the FALSE path, threads 0-7 are masked. The total time is the sum of both paths, not the maximum. This is the serialization that kills performance.
Historical Context
The warp execution model has been part of NVIDIA GPUs since Tesla (2006). Early GPU programmers in CUDA's first years came from CPU backgrounds and instinctively wrote code with branches. Performance was often 5-10x below theoretical peak and nobody could explain why.
NVIDIA added the first divergence profiling counters in the Fermi architecture (2010). For the first time, engineers could see Warp Execution Efficiency in the profiler - a number below 100% meant divergence was happening. Suddenly, "the kernel is slow" became "this specific branch is causing 40% divergence."
By the Maxwell and Pascal eras (2014-2016), deep learning researchers started writing custom attention kernels for NLP. These kernels were full of masking logic (padding masks, causal masks, attention masks) and divergence was rampant. The community developed a vocabulary around it: "warp-uniform branches," "predicated execution," "sorting for divergence reduction."
The introduction of warp-level primitives (__ballot_sync, __any_sync, __shfl_sync in CUDA 9, 2017) gave programmers hardware-level tools to work with warp state directly. These primitives enabled efficient reductions and broadcasts without shared memory, and made it possible to make decisions based on what all threads in a warp are doing.
Today, warp divergence awareness is table stakes for any kernel that touches masking, sparsity, variable-length inputs, or conditional computation.
The Mechanics of Divergence
Warp Execution Under Divergence
Modern NVIDIA GPUs use a structure called the warp scheduler and maintain a predicate register per thread. When a branch is encountered:
- The warp scheduler evaluates the branch condition for all 32 threads simultaneously
- It records which threads take the TRUE path and which take the FALSE path in a divergence mask
- It executes the TRUE path for all threads, with FALSE-path threads masked off (their write operations are suppressed)
- It then executes the FALSE path for all threads, with TRUE-path threads masked off
- At the reconvergence point (the join after the if-else), all 32 threads are active again
The cost is additive, not parallel. A branch where 16 threads go TRUE and 16 go FALSE takes twice as long as if all threads agreed.
Measuring Divergence Cost
For a branch with distinct paths taken by different threads in a warp, the execution time of that section is:
where is the execution time of path . For (standard if-else with 50/50 split), you pay the cost of both paths even though only one applies to each thread.
Warp Execution Efficiency in Nsight Compute is defined as:
A WEE of 50% means half your warp slots are being wasted to divergence (and/or idle warps). For the sparse attention case with 10% density, you approach 10% WEE in the worst case.
The Three Divergence Patterns
Understanding divergence requires distinguishing between cases where it matters and cases where it does not.
Pattern 1: Divergence Within Every Warp (Worst Case)
// threadIdx.x ranges 0-255 (8 warps of 32 threads each)
// Warp 0: threads 0-31. ALL warps have mixed even/odd threads.
if (threadIdx.x % 2 == 0) {
// Even threads: 16 threads per warp take this path
result = compute_A();
} else {
// Odd threads: 16 threads per warp take this path
result = compute_B();
}
Every single warp contains both even and odd threads. Every single warp diverges. Both paths execute sequentially for every warp. This is maximum divergence.
Pattern 2: Divergence Only at the Boundary (Acceptable)
// threadIdx.x ranges 0-255 (8 warps)
// Warps 0-3: threads 0-127, ALL take the if-branch (no divergence)
// Warp 4: threads 128-159, threads 128-143 take if, 144-159 take else (diverges)
// Warps 5-7: threads 160-255, ALL take else (no divergence)
if (threadIdx.x < 144) {
result = compute_A();
} else {
result = compute_B();
}
Only warp 4 (the boundary) diverges. 7 out of 8 warps execute without divergence. This is the pattern you want: minimize the number of warps that diverge by aligning branch boundaries to warp boundaries.
Pattern 3: No Divergence (Ideal)
// All threads in all warps make the same decision
if (blockIdx.x == 0) {
// Entire block 0 takes this path
result = compute_special_case();
} else {
// All other blocks take this path
result = compute_general_case();
}
blockIdx.x is the same for all threads in a block, which means it is the same for all threads in every warp within that block. All threads in a warp always agree. Zero divergence.
The key insight: divergence only matters within a warp, never between warps. Different warps can take completely different paths with zero performance penalty. Divergence is a warp-level concept.
Predicated Execution: When Branches Disappear
Not all branches cause divergence in the traditional sense. The NVIDIA compiler is smart enough to recognize simple branches and convert them to predicated instructions - instructions that execute but conditionally write their results.
// This branch looks like it could cause divergence
float result;
if (x > 0.0f) {
result = x;
} else {
result = 0.0f;
}
The compiler may generate this as:
// PTX pseudocode
FSETP.GT.AND P0, PT, x, 0.0, PT; // P0 = (x > 0.0)
SEL result, x, 0.0, P0; // result = P0 ? x : 0.0
SEL (select) is a single instruction that conditionally selects between two values. There is no branch. There is no divergence. Both inputs are computed (or available), and the instruction picks one based on the predicate. This is called predicated execution or branch predication.
The compiler applies predication when:
- Both branches are short (typically under 4 instructions each)
- Neither branch has side effects (no memory stores)
- The compiler determines that executing both paths is cheaper than a branch
You can force the compiler toward predication:
// Explicit conditional move - compiler almost always predicates this
float relu_result = fmaxf(x, 0.0f); // No branch, no divergence
float abs_result = fabsf(x); // Same
For longer branches, the compiler emits an actual branch instruction and divergence applies.
Warp Voting: Making Collective Decisions
CUDA provides intrinsics that let you query the state of all threads in a warp simultaneously. These are the warp voting functions.
// All warp voting functions require a mask argument (which threads participate)
// Use 0xffffffff for all 32 threads in the warp
// __all_sync: returns true if ALL active threads have condition == true
bool all_positive = __all_sync(0xffffffff, x > 0.0f);
// __any_sync: returns true if ANY active thread has condition == true
bool any_negative = __any_sync(0xffffffff, x < 0.0f);
// __ballot_sync: returns a 32-bit mask where bit i is set if thread i's condition is true
unsigned int active_mask = __ballot_sync(0xffffffff, should_compute);
int active_count = __popc(active_mask); // popcount = number of active threads
These intrinsics compile to single hardware instructions (VOTE.SYNC). They are extremely fast - one clock cycle.
Using Warp Voting to Skip Divergent Work
__global__ void conditional_kernel(float* data, int* flags, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
// Check if any thread in this warp has work to do
bool has_work = (flags[idx] == 1);
bool any_has_work = __any_sync(0xffffffff, has_work);
// If NO thread in this warp has work, skip the expensive computation
// This avoids divergence when most warps are entirely idle
if (!any_has_work) return;
// Some threads in this warp have work - proceed with computation
if (has_work) {
data[idx] = expensive_compute(data[idx]);
}
}
The outer if (!any_has_work) return is warp-uniform: all 32 threads in the warp agree on the answer (since __any_sync returns the same value to all threads). No divergence for warps where everyone is idle. The inner if (has_work) may diverge, but only in warps where at least some threads are active.
Warp Shuffle: Communication Without Shared Memory
Warp shuffle intrinsics allow threads within a warp to read each other's registers directly. This is faster than shared memory for warp-level operations: no bank conflicts, no synchronization barrier needed within the warp.
// __shfl_sync: thread reads a value from any other thread in the warp
// __shfl_up_sync: thread reads from thread with lower threadIdx
// __shfl_down_sync: thread reads from thread with higher threadIdx
// __shfl_xor_sync: thread reads from thread with XOR'd threadIdx
// All variants:
// T __shfl_*_sync(unsigned mask, T var, int src_lane, int width=32)
// mask: which threads participate (0xffffffff for all)
// var: the value THIS thread contributes
// return: the value from the SOURCE thread
Full Warp Reduction Using Shuffle
This is the canonical use case for warp shuffle. Summing 32 values in parallel using a butterfly reduction:
__device__ float warp_reduce_sum(float val) {
// Unroll the butterfly reduction manually
// Each iteration halves the active stride
// After 5 iterations (log2(32) = 5), thread 0 has the total sum
val += __shfl_down_sync(0xffffffff, val, 16); // stride 16
val += __shfl_down_sync(0xffffffff, val, 8); // stride 8
val += __shfl_down_sync(0xffffffff, val, 4); // stride 4
val += __shfl_down_sync(0xffffffff, val, 2); // stride 2
val += __shfl_down_sync(0xffffffff, val, 1); // stride 1
// Thread 0 now holds the sum of all 32 threads' values
return val;
}
// __shfl_down_sync(mask, val, delta):
// Each thread with lane_id < (32 - delta) reads from thread (lane_id + delta)
// Threads with lane_id >= (32 - delta) receive their own val unchanged
Visualizing one step (__shfl_down_sync(0xffffffff, val, 16)):
Thread: 0 1 2 ... 15 16 17 18 ... 31
Before: v0 v1 v2 ... v15 v16 v17 v18 ... v31
After shfl_down(16):
Thread 0 reads thread 16's value: val[0] += val[16]
Thread 1 reads thread 17's value: val[1] += val[17]
...
Thread 15 reads thread 31's value: val[15] += val[31]
Threads 16-31: no source, keep their own value (but don't matter)
Result: val[0] = v0+v16, val[1] = v1+v17, ..., val[15] = v15+v31
After 5 such steps, thread 0 holds the sum of all 32 values.
Complete Parallel Reduction Kernel
#include <cuda_runtime.h>
// Warp-level sum reduction (single warp's worth of data)
__device__ float warp_reduce_sum(float val) {
val += __shfl_down_sync(0xffffffff, val, 16);
val += __shfl_down_sync(0xffffffff, val, 8);
val += __shfl_down_sync(0xffffffff, val, 4);
val += __shfl_down_sync(0xffffffff, val, 2);
val += __shfl_down_sync(0xffffffff, val, 1);
return val; // Only thread 0 of each warp has the correct sum
}
// Block-level sum reduction (handles multiple warps)
__device__ float block_reduce_sum(float val) {
// Shared memory: one slot per warp (max 32 warps per block)
__shared__ float warp_sums[32];
int lane = threadIdx.x % 32; // Lane within this warp
int warp_id = threadIdx.x / 32; // Which warp in this block
// Step 1: reduce within each warp
val = warp_reduce_sum(val);
// Step 2: warp 0's lane 0 gets each warp's partial sum
if (lane == 0) warp_sums[warp_id] = val;
__syncthreads(); // All warps must finish before warp 0 reads
// Step 3: warp 0 reduces the warp sums
int num_warps = blockDim.x / 32;
if (warp_id == 0) {
// Load this warp's assigned partial sum (or 0 if out of range)
val = (lane < num_warps) ? warp_sums[lane] : 0.0f;
val = warp_reduce_sum(val); // Final reduction
}
return val; // Only thread 0 has the correct total sum
}
// Full reduction kernel
__global__ void reduce_sum_kernel(
const float* __restrict__ input,
float* __restrict__ output,
int N
) {
float partial_sum = 0.0f;
// Grid-stride loop: each thread accumulates multiple elements
for (int idx = blockIdx.x * blockDim.x + threadIdx.x;
idx < N;
idx += gridDim.x * blockDim.x) {
partial_sum += input[idx];
}
// Reduce within block
float block_sum = block_reduce_sum(partial_sum);
// Thread 0 writes this block's result
if (threadIdx.x == 0) {
atomicAdd(output, block_sum);
}
}
This kernel has zero divergence in the main computation loop (the grid-stride loop). The only potential divergence is in the final if (threadIdx.x == 0) - a single instruction, negligible cost.
Why No Divergence in This Reduction?
Compare to the naive shared-memory reduction with divergence:
// NAIVE - diverges on every iteration
__global__ void naive_reduce(float* data, int N) {
__shared__ float sdata[256];
int tid = threadIdx.x;
sdata[tid] = data[blockIdx.x * 256 + tid];
__syncthreads();
for (int stride = blockDim.x / 2; stride > 0; stride /= 2) {
if (tid < stride) { // <-- DIVERGES: different threads at every step
sdata[tid] += sdata[tid + stride];
}
__syncthreads();
}
}
In iteration 1 (stride=128), threads 0-127 take the if-branch, threads 128-255 skip. Every warp at the boundary diverges.
In the shuffle-based reduction, there are no if-statements in the hot path. The __shfl_down_sync instruction itself handles the masking internally without creating warp divergence.
Restructuring Code to Eliminate Divergence
Technique 1: Sort Inputs by Branch Path
If your kernel processes elements with different code paths depending on element type, sort the inputs first so that elements taking the same path are contiguous. Then different warps take entirely different branches (warp-uniform) with no divergence.
// Before: elements interleaved, divergence in every warp
// [typeA, typeB, typeA, typeB, typeA, typeB, ...]
// After: sorted by type, warps are uniform
// [typeA, typeA, typeA, ..., typeB, typeB, typeB, ...]
// Warp 0-N: all typeA, takes if-branch uniformly (no divergence)
// Warp N+1-M: all typeB, takes else-branch uniformly (no divergence)
__global__ void process_sorted(Element* elements, int N_A, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
if (idx < N_A) {
// All threads in this warp have idx < N_A (if sorted)
process_type_A(elements[idx]);
} else {
process_type_B(elements[idx]);
}
}
This technique is used in MoE (Mixture of Experts) routing: tokens are sorted by which expert they are assigned to before being dispatched, so each expert's computation runs on contiguous, warp-uniform data.
Technique 2: Use Masking Instead of Branching
Instead of branching to skip computation, compute unconditionally and zero out masked results:
// Divergent version
if (mask[idx]) {
result = expensive_compute(data[idx]);
output[idx] = result;
}
// else: output[idx] unchanged (or implicitly zero from initialization)
// Potentially better: predicated write
float result = expensive_compute(data[idx]); // Compute for everyone
if (mask[idx]) {
output[idx] = result; // Only active threads write
}
Whether this helps depends on the cost of expensive_compute. If it is cheap (a few arithmetic ops), predicated execution may eliminate divergence entirely. If it is expensive (many ops), you are wasting work on masked threads.
Technique 3: Warp-Level Early Exit
__global__ void kernel_with_early_exit(float* data, int* flags, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
bool active = (idx < N) && (flags[idx] != 0);
// Check if any thread in this warp is active
// __any_sync returns the same value to ALL threads in the warp
// So this branch is WARP-UNIFORM - zero divergence
if (!__any_sync(0xffffffff, active)) {
return; // Entire warp skips - no divergence
}
// Some threads are active - process them
if (active) {
// This inner branch may diverge, but only for
// warps that have at least one active thread
data[idx] = process(data[idx]);
}
}
Technique 4: Separate Kernels for Separate Paths
If two code paths are long and complex, sometimes the right answer is two kernel launches:
// Instead of one kernel with a big if-else
// Launch two specialized kernels
// Kernel 1: all type-A elements
process_type_A_kernel<<<gridA, block>>>(type_A_elements, N_A);
// Kernel 2: all type-B elements
process_type_B_kernel<<<gridB, block>>>(type_B_elements, N_B);
Two smaller kernel launches is often faster than one larger divergent kernel. The overhead of an extra kernel launch is typically under 10 microseconds - a bargain if it eliminates 40% warp divergence.
The Full Warp Primitives Reference
// ---- WARP VOTING ----
// Returns true if ALL participating threads have condition == true
bool all = __all_sync(unsigned mask, int predicate);
// Returns true if ANY participating thread has condition == true
bool any = __any_sync(unsigned mask, int predicate);
// Returns 32-bit mask where bit i is set if thread i's predicate is true
unsigned ballot = __ballot_sync(unsigned mask, int predicate);
// ---- WARP SHUFFLE ----
// Read from arbitrary lane
T val = __shfl_sync(unsigned mask, T var, int src_lane, int width=32);
// Read from lane (lane_id - delta), wrapping within width
T val = __shfl_up_sync(unsigned mask, T var, unsigned delta, int width=32);
// Read from lane (lane_id + delta), wrapping within width
T val = __shfl_down_sync(unsigned mask, T var, unsigned delta, int width=32);
// Read from lane (lane_id XOR lane_mask)
T val = __shfl_xor_sync(unsigned mask, T var, int lane_mask, int width=32);
// ---- MATCH OPERATIONS (Volta+) ----
// Returns mask of threads with same value
unsigned match_any = __match_any_sync(unsigned mask, T value);
unsigned match_all = __match_all_sync(unsigned mask, T value, int *pred);
The mask parameter specifies which threads participate. Use 0xffffffff for full warp participation. For partial warps (e.g., boundary conditions), use __ballot_sync(0xffffffff, idx < N) to compute the active mask first.
Profiling Divergence with Nsight
The key counters to watch in Nsight Compute:
| Counter | Meaning | Target |
|---|---|---|
sm__warps_active.avg.pct_of_peak_sustained_active | Overall warp occupancy | >50% |
smsp__thread_inst_executed_pred_on_sum | Instructions with active predicate | Close to total |
smsp__inst_executed_per_warp.avg | Avg instructions per warp launch | Should be stable |
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum | Global memory loads | Compare to theoretical |
The most direct divergence indicator is comparing thread_inst_executed (sum of instructions executed by all threads) to inst_executed * 32 (instructions executed assuming all 32 threads were active every cycle). A large gap indicates divergence.
Production Engineering Notes
The Boundary Divergence Trap
Almost every kernel has one unavoidable divergence point: the bounds check.
if (idx >= N) return;
For the last block, some threads are in-bounds and some are out-of-bounds. This creates divergence. Should you eliminate it?
Usually no. The last block diverges once. All other blocks have zero divergence from this check. The cost is one divergent warp per kernel launch - microseconds, not milliseconds. Spend your optimization effort on divergence in the main computation loop, not at the boundary.
The exception is if your block size is very small or N is small (many blocks with boundary divergence). In that case, pad your input to the next multiple of block size to eliminate all boundary divergence.
Warp Shuffle Requires Volta or Later (Compute Capability 7.0+)
The synchronized shuffle variants (__shfl_down_sync etc.) require Volta architecture (2017) or later. For older hardware, the unsynchronized variants (__shfl_down) exist but are deprecated and may produce incorrect results in certain convergence scenarios. Modern ML training runs on Volta or later, so this is not a concern in practice.
Cooperative Groups for More Flexible Synchronization
CUDA 9 introduced Cooperative Groups, which allow you to create flexible synchronization scopes smaller than a block:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void cg_reduce(float* data, float* result, int N) {
// Create a warp-level group
auto warp = cg::tiled_partition<32>(cg::this_thread_block());
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = (idx < N) ? data[idx] : 0.0f;
// Reduce within warp using cooperative groups
for (int delta = 16; delta > 0; delta /= 2) {
val += warp.shfl_down(val, delta);
}
if (warp.thread_rank() == 0) {
atomicAdd(result, val);
}
}
Cooperative groups are more readable than raw shuffle intrinsics and the compiler can reason about them more effectively for optimization.
Common Mistakes
:::danger Divergence from threadIdx.x % 2
The classic divergence trap. Any condition based on threadIdx.x % 2 (or any power of 2 that is smaller than 32) causes every single warp to diverge. This is the worst possible case.
// WRONG: every warp diverges
if (threadIdx.x % 2 == 0) {
result = compute_even(x);
} else {
result = compute_odd(x);
}
// RIGHT: if you need even/odd processing, use different blocks
// or restructure the computation to avoid the branch entirely
:::
:::danger Forgetting the Mask in Warp Intrinsics
All modern CUDA warp intrinsics require an explicit participation mask. Using 0xffffffff is correct for full warps, but using it in a context where some threads are inactive (e.g., after a return or after divergence) produces undefined behavior.
// WRONG: thread 5 may have returned early, but 0xffffffff implies it's active
if (idx >= N) return;
float sum = __shfl_down_sync(0xffffffff, val, 16); // Undefined if some threads returned
// RIGHT: compute mask of active threads first
unsigned active_mask = __ballot_sync(0xffffffff, idx < N);
if (idx >= N) return;
float sum = __shfl_down_sync(active_mask, val, 16); // Only active threads participate
:::
:::warning Divergence You Cannot Avoid vs. Divergence You Can
Not all divergence is equally worth fixing. The boundary check if (idx >= N) return affects one warp per kernel launch. A conditional in the main compute loop affects every warp on every iteration. Profile to find where your divergence is actually occurring before spending time restructuring.
:::
:::warning Sorting Has Its Own Cost Sorting inputs to eliminate divergence is a valid technique, but sorting is not free. For small inputs (under 1M elements), the sorting cost may exceed the divergence cost. Profile both approaches. For large inputs with high divergence (like MoE routing), sorting almost always wins. :::
Interview Questions and Answers
Q1: What happens at the hardware level when threads in a warp take different branches?
At the hardware level, the warp scheduler detects that threads disagree on the branch condition. It maintains a divergence mask for the warp. It executes the TRUE path with threads on that path unmasked and threads on the FALSE path masked (their writes are suppressed). Then it executes the FALSE path with the masks inverted. Execution reconverges at the first post-dominator instruction (the point in the control flow graph that all paths must pass through). The cost is the sum of both path execution times rather than the maximum. For a 50/50 split where both paths have the same cost, divergence doubles that section's execution time.
Q2: What is predicated execution and when does the compiler apply it?
Predicated execution converts a branch into a predicate register plus conditional instructions. Instead of branching, both paths are executed for all threads, but only one path's results are committed. The compiler applies predication when both branch bodies are short (typically under 4-8 instructions), neither contains memory stores (or the stores can be predicated), and the overhead of executing both paths is less than the overhead of a branch instruction itself. Predicated execution eliminates divergence entirely because all threads execute the same instruction sequence. You can encourage predication by using intrinsics like fmaxf, fminf, fabsf, and by writing simple conditional assignments rather than complex branched code.
Q3: Calculate the warp execution efficiency for a kernel where each warp has 8 active threads and 24 inactive due to masking.
WEE = Active threads / Warp size = 8 / 32 = 25%. In the profiler, this would show as 25% warp execution efficiency for that section. This means 75% of the GPU's execution capacity is being wasted. In terms of actual throughput, you are getting roughly 25% of what you would get if all 32 threads were active. This is the situation in the sparse attention example with 10% density - slightly worse than 25% because even the "active" warps may have fewer than 8 threads attending.
Q4: Implement a warp-level maximum reduction using shuffle intrinsics.
__device__ float warp_reduce_max(float val) {
// Butterfly reduction with max instead of sum
val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 16));
val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 8));
val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 4));
val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 2));
val = fmaxf(val, __shfl_down_sync(0xffffffff, val, 1));
return val; // Thread 0 holds the maximum of all 32 values
}
The pattern is identical to sum reduction - just replace + with fmaxf. This generalizes to any associative, commutative operation: min, max, sum, product, bitwise AND/OR/XOR.
Q5: You have a kernel with an if-else where 50% of threads take each branch and both paths are expensive. How do you fix the divergence?
Several approaches depending on the situation. First, check if the branch can be predicated - if both paths are short (under 8 instructions), the compiler may handle it automatically. Second, if the branch condition depends on input data, sort the input so all type-A elements come before type-B elements. Then warps are uniform (all type-A or all type-B) with zero divergence. Third, split into two separate kernel launches: one for type-A elements, one for type-B. The kernel launch overhead (~5 microseconds each) is usually far less than the divergence cost on millions of elements. Fourth, examine if the two paths can be unified by computing both results and selecting: result = mask ? compute_A(x) : compute_B(x) - if both computations are cheap this can be faster than diverging, using predicated execution.
Q6: What is the difference between __all_sync, __any_sync, and __ballot_sync?
All three query the state of a predicate across all participating threads in a warp and return the result to all threads (so the result is warp-uniform). __all_sync(mask, pred) returns true only if ALL participating threads have pred == true. __any_sync(mask, pred) returns true if AT LEAST ONE participating thread has pred == true. __ballot_sync(mask, pred) returns a 32-bit integer where bit i is 1 if thread i has pred == true and thread i is in the participation mask. __ballot_sync is the most general - you can derive all and any from it. Use __all_sync for checking convergence (all threads agree they are done). Use __any_sync for early-exit optimization (skip work if no thread has anything to do). Use __ballot_sync when you need to know exactly which threads are active.
Measuring and Diagnosing Divergence
The Profiler Workflow
Before you can fix divergence, you need to find it. A kernel that feels slow might be slow for a dozen reasons. Divergence is one of them. Here is the workflow to confirm.
Step 1: Get a baseline profile with Nsight Compute.
# Profile a single kernel invocation
ncu --metrics smsp__thread_inst_executed_pred_on_sum,\
smsp__inst_executed.sum,\
smsp__warps_active.avg.pct_of_peak_sustained_active \
python my_training_script.py
Step 2: Calculate the actual divergence overhead.
The key formula:
If inst_executed is 1000 and thread_inst_executed_pred_on is 18000, then:
This kernel is losing 44% of its throughput to divergence.
Step 3: Find the divergent code.
Nsight Compute's Source page shows per-line execution counts when compiled with -lineinfo. Lines with significantly lower thread counts than adjacent lines indicate divergence hot spots. Look for lines where the ratio drops suddenly.
# Compile with line info for source-level profiling
nvcc -O2 -lineinfo -o my_kernel my_kernel.cu
A Concrete Benchmark: Sorting vs. Unsorted
To make the cost of divergence viscerally clear, here is a benchmark comparing an unsorted sparse computation against a sorted one:
import torch
import time
def benchmark_divergence(N=1_000_000, sparsity=0.1, iters=200):
"""Compare divergent vs. non-divergent sparse processing."""
# Create data: 10% of elements are active
data = torch.randn(N, device='cuda')
mask = (torch.rand(N, device='cuda') < sparsity).int()
# Unsorted: active elements randomly scattered
# Every warp will have ~3 active threads and ~29 inactive -> high divergence
def unsorted_sparse(data, mask):
result = torch.zeros_like(data)
result[mask.bool()] = torch.tanh(data[mask.bool()])
return result
# Sorted: gather active elements first, process, scatter back
# Most warps are entirely inactive or entirely active -> low divergence
def sorted_sparse(data, mask):
active_idx = mask.nonzero(as_tuple=True)[0]
active_data = data[active_idx]
processed = torch.tanh(active_data)
result = torch.zeros_like(data)
result[active_idx] = processed
return result
# Warmup
for _ in range(20):
unsorted_sparse(data, mask)
sorted_sparse(data, mask)
torch.cuda.synchronize()
# Benchmark unsorted
start = time.perf_counter()
for _ in range(iters):
unsorted_sparse(data, mask)
torch.cuda.synchronize()
t_unsorted = (time.perf_counter() - start) / iters * 1000
# Benchmark sorted
start = time.perf_counter()
for _ in range(iters):
sorted_sparse(data, mask)
torch.cuda.synchronize()
t_sorted = (time.perf_counter() - start) / iters * 1000
print(f"Unsorted sparse (divergent): {t_unsorted:.3f} ms")
print(f"Sorted sparse (low divergence): {t_sorted:.3f} ms")
print(f"Speedup from sorting: {t_unsorted / t_sorted:.2f}x")
Typical results at 10% sparsity on an A100:
| Approach | Time (ms) | Notes |
|---|---|---|
| Unsorted (divergent) | 2.8 ms | ~3 active threads per warp |
| Sorted (compact) | 0.7 ms | Full warps in the active region |
| Speedup | 4.0x | Matches theoretical 1/sparsity |
At 10% sparsity, the theoretical maximum speedup from compaction is 10x. We get 4x because gather/scatter operations add overhead. But 4x is far better than 1x, and closing the gap between 4x and 10x is a matter of optimizing the gather/scatter rather than the main computation.
Real-World Application: MoE Routing Without Divergence
Mixture of Experts (MoE) models route each token to a subset of expert FFN layers. A naive implementation assigns each token its expert ID, then branches:
// NAIVE - massive divergence in each warp
// Each token in the batch routes to one of 8 experts
// Within a warp of 32 tokens, you might have 4 routing to expert 0,
// 5 to expert 1, 3 to expert 2, etc.
if (expert_id == 0) {
output = expert_0_ffn(hidden);
} else if (expert_id == 1) {
output = expert_1_ffn(hidden);
} // ... up to 8-way divergence
With 8 experts and uniform routing, each warp has roughly 4 tokens per expert. Every branch path executes for every warp. Warp execution efficiency is roughly 12.5% (4/32). The kernel does 8x more work than necessary.
The production solution, used in Switch Transformer (Fedus et al., 2021) and all subsequent MoE models, is the sort-and-dispatch approach:
flowchart LR
A["Token batch<br/>mixed experts"]:::blue
B["Compute<br/>expert assignments"]:::blue
C["Sort tokens<br/>by expert ID"]:::purple
D["Dispatch to<br/>expert N kernels"]:::green
E["Scatter results<br/>back to original order"]:::teal
A --> B --> C --> D --> E
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
After sorting, all tokens routed to expert 0 are contiguous, all tokens for expert 1 are contiguous, and so on. Expert 0's kernel runs on a compact tensor where every thread processes a real token. No masking. No divergence. Warp execution efficiency is close to 100%.
The sort adds overhead (roughly O(N log N) for N tokens), but it is a one-time cost amortized across all 8 expert computations. For expert FFN layers that dominate MoE runtime, the divergence elimination far outweighs the sort cost.
This is the general principle: spend O(N log N) to sort once, save O(N * divergence_factor) in every subsequent operation.
Summary
Warp divergence is the gap between the SIMT promise and the SIMT reality. The promise is 32 threads executing in parallel. The reality is that branches which split a warp into groups force sequential execution of each group, multiplying the time cost.
The patterns that cause divergence are predictable: conditions based on threadIdx.x % N for small N, conditions based on data values with random distribution, and masking logic in sparse operations. The patterns that avoid divergence are equally clear: conditions based on blockIdx.x (warp-uniform), conditions that align to warp boundaries, and pre-sorting inputs so each warp sees a uniform data type.
Warp shuffle intrinsics eliminate the need for shared memory in many reduction and broadcast patterns, and produce zero divergence in the process. The warp-level reduction using __shfl_down_sync is a fundamental building block you will use in almost every custom kernel that accumulates values.
The sparse attention kernel that opened this lesson required one fix: sorting tokens by their sparsity pattern before dispatch, so dense warps process dense regions and sparse warps process sparse regions. Warp execution efficiency went from 12% to 78%. No algorithmic changes. Just an understanding of how the hardware actually executes the code.
