Instruction-Level Optimization

Reading time: ~38 min · Interview relevance: High · Target roles: CUDA Developer, Kernel Engineer, ML Systems Engineer

The GPU hides inter-warp latency by switching warps. It hides intra-warp latency by issuing multiple independent instructions from the same warp in a single cycle. If your warp only has one ready instruction, you are leaving three instruction slots empty every cycle.

The Benchmark That Changed How I Think About GPUs

Two kernels. Same input size. Same memory access pattern - perfectly coalesced, 100% L2 hit rate. Same arithmetic intensity. Same number of floating-point operations.

Kernel A runs at 78% of peak FP32 throughput on an H100.

Kernel B runs at 31%.

The memory profiler shows nothing wrong. Occupancy is identical - 64 active warps per SM on both. L1 cache hit rate is the same. You stare at the code for ten minutes and cannot see the difference.

Then you open Nsight Compute and look at the warp stall reasons. Kernel B has a massive bar in "Stall Wait" - not memory stalls, not sync stalls, but wait stalls. The instruction scheduler has warps ready to execute but their next instruction is waiting on the result of the previous one. A register dependency chain, three instructions deep, serializing every computation.

Kernel A has that same chain - but it processes four independent elements simultaneously. Four accumulator registers. Four parallel dependency chains. The scheduler fills its four issue slots every cycle with one instruction from each chain, and the result is 2.5x higher throughput from identical hardware.

This is Instruction-Level Parallelism (ILP). It is not about how many warps you have. It is about how many independent instructions a single warp can present to the scheduler in a single cycle. Getting this right is the difference between a kernel that looks optimized and one that actually is.

By the end of this lesson you will understand what the SM instruction scheduler actually does each cycle, why four independent operations in one warp beats four separate warps running one operation each in certain scenarios, how vectorized float4 loads halve your instruction count while doubling your bandwidth, how loop unrolling exposes the independent instruction streams the compiler cannot find automatically, and how to read PTX to verify that the instructions you wrote are the instructions the hardware will execute.

Why This Exists - The Instruction Latency Problem

The Root Issue: Instructions Take Time

Every GPU instruction has a latency - the number of cycles between when an instruction is issued and when its result is available for the next instruction. On modern NVIDIA GPUs, a typical FP32 FMA instruction has a latency of around 4 cycles. An FP64 FMA is around 8 cycles. A global memory load is 400-800 cycles.

The classic GPU answer to memory latency is warp switching. The SM has 64+ active warps. When warp 0 issues a memory load and stalls waiting for the result, the scheduler switches to warp 1. By the time the scheduler has cycled through enough other warps, the memory result is ready and warp 0 can resume. This is latency hiding through warp-level parallelism.

But this only solves inter-warp latency. What about the arithmetic dependency chain inside a single warp?

// This code has a 4-cycle dependency chain on every iteration
float acc = 0.0f;
for (int i = 0; i < N; i++) {
    acc = acc * alpha + data[i];   // acc depends on previous acc
}

Every iteration, acc depends on the result of the previous iteration. The scheduler cannot start iteration i+1 until iteration i completes. Even with 64 warps available, this warp cannot progress faster than one FMA every 4 cycles. The latency is intra-warp and warp switching does not help.

The H100 SM has four warp schedulers, each capable of issuing one instruction per cycle. That is four independent instruction issue slots per cycle. If your warp only feeds one ready instruction per cycle, you are using 25% of the available issue bandwidth.

ILP is the technique of restructuring computations so that multiple independent instructions from the same warp are simultaneously ready, filling all four scheduler slots simultaneously.

Historical Context - How NVIDIA Exposed ILP to Programmers

The Fermi architecture (2010) was the first NVIDIA GPU with dual instruction dispatch - two schedulers per SM, two instructions per cycle. The intent was to double throughput on independent workloads without increasing clock speed.

But the early CUDA programming model made ILP invisible. You wrote one operation per line. The compiler could sometimes find independent operations and reorder them, but loop-carried dependencies (the most common case in reduction and matrix multiply inner loops) were invisible to the auto-vectorizer.

The #pragma unroll directive, carried over from CPU compiler tradition, became the primary tool. Unrolling an inner loop four times creates four independent accumulation variables, four parallel dependency chains, four instructions simultaneously ready for the scheduler. NVIDIA documented this pattern in their Kepler optimization guide (2012) and it has remained critical through every architecture since.

The float4 vector load came from a different direction. CPU SIMD (SSE/AVX) vectorizes across multiple data lanes. GPU vectorization is different - it is about issuing a single 128-bit load instruction instead of four 32-bit loads. The benefit is instruction count reduction, not parallelism per se. Fewer instructions mean fewer issue slots consumed, more issue slots available for computation, and less pressure on the instruction decoder.

PTX (Parallel Thread Execution) - NVIDIA's virtual ISA - was introduced with CUDA 1.0 but became a serious optimization tool around CUDA 4.0 when NVIDIA published the PTX ISA reference and began exposing it through inline PTX in CUDA kernels. Understanding PTX is now table stakes for kernel engineers who need to verify that compiler output matches intent.

Core Concept 1 - Instruction-Level Parallelism

What the Scheduler Actually Does

The SM warp scheduler works like this each cycle:

Look at all active warps
For each warp, look at its next instruction
An instruction is "ready" if all its source operands are available (no pending register dependency)
Issue up to 4 ready instructions from eligible warps (one per scheduler)

The key insight: a single warp can have multiple ready instructions if those instructions operate on independent registers.

// Low ILP - sequential dependency chain
// Scheduler sees: inst0 ready, inst1 NOT ready (depends on inst0 result)
float a = x[i];           // inst0: load
float b = a * 2.0f;       // inst1: wait for inst0 (4 cycle dependency)
float c = b + 1.0f;       // inst2: wait for inst1

// High ILP - four independent chains
// Scheduler sees: inst0, inst1, inst2, inst3 all ready simultaneously
float a0 = x[i+0] * 2.0f + 1.0f;   // chain 0: independent
float a1 = x[i+1] * 2.0f + 1.0f;   // chain 1: independent
float a2 = x[i+2] * 2.0f + 1.0f;   // chain 2: independent
float a3 = x[i+3] * 2.0f + 1.0f;   // chain 3: independent

In the second version, the four FMA instructions operate on completely different registers. All four are simultaneously ready on the first cycle. The scheduler issues all four in parallel. Throughput is 4x the single-chain version.

Measuring ILP - Issued IPC vs Peak IPC

The metric to watch in Nsight Compute is sm__inst_executed_pipe_fma.avg.pct_of_peak_sustained_active. This tells you what fraction of peak FMA throughput your kernel achieves. On an H100, peak FP32 FMA throughput is 128 operations per SM per cycle. If you are getting 40%, your instruction scheduling is leaving 60% of compute on the table.

The warp stall reason smsp__warp_issue_stalled_wait_lns_per_warp measures cycles where a warp had a pending instruction but the instruction was waiting on a register dependency. High values here - above 10-15% of total cycles - indicate an ILP problem.

flowchart TD

Core Concept 2 - Vectorized Loads

The float4 Load

A regular float load from global memory generates a 32-bit load instruction - LDG.32 in PTX. Loading four floats sequentially generates four LDG.32 instructions, consuming four instruction issue slots.

A float4 load generates a single 128-bit load instruction - LDG.128. One instruction. One issue slot. Same data transferred.

The immediate benefits:

4x reduction in load instruction count - the instruction decoder is a bottleneck on instruction-dense kernels
128-bit memory transaction - a single 128-bit aligned transaction instead of four 32-bit transactions, which is more efficient for memory system coalescing
Freed issue slots - the three saved instruction slots can be used for computation

// Scalar loads - 4 instruction issue slots for loads
__global__ void elementwise_scalar(float* __restrict__ out,
                                   const float* __restrict__ in,
                                   float scale, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = in[idx] * scale;   // LDG.32 + FMUL + STG.32
    }
}

// Vectorized loads - 1 instruction issue slot for loads
__global__ void elementwise_float4(float4* __restrict__ out,
                                   const float4* __restrict__ in,
                                   float scale, int n4) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n4) {
        float4 val = in[idx];         // LDG.128 - ONE instruction
        val.x *= scale;
        val.y *= scale;
        val.z *= scale;
        val.w *= scale;
        out[idx] = val;               // STG.128 - ONE instruction
    }
}

The vectorized version has 8 total instructions (1 load + 4 multiply + 1 store + 2 address compute) vs 12 for the scalar version across 4 elements. The load instruction count drops from 4 to 1.

Alignment Requirements for Vector Loads

float4 loads require 16-byte alignment. The address must be divisible by 16 (i.e., the float4 pointer must be 16-byte aligned).

// WRONG - this can generate misaligned accesses
float* data = ...; // not guaranteed to be 16-byte aligned for float4
float4* data4 = (float4*)data;   // cast without alignment check

// CORRECT - use cudaMalloc which always returns 256-byte aligned memory
float4* data4;
cudaMalloc(&data4, n * sizeof(float4));  // 256-byte aligned, always valid for float4

When working with subarrays, pad your data to multiples of 4 elements:

// Pad n to multiple of 4 before allocation
int n_padded = (n + 3) & ~3;  // round up to next multiple of 4
float4* data4;
cudaMalloc(&data4, n_padded * sizeof(float));

`__ldg()` - The Read-Only Cache Load

The __ldg() intrinsic loads data through the read-only texture cache (L1 texture path), bypassing the regular L1 data cache. This is beneficial when:

Data is read-only within the kernel (no writes to same addresses)
Access pattern is somewhat irregular or irregular enough to pollute L1 data cache
Multiple thread blocks need the same data - texture cache is shared across thread blocks on the same SM

// Regular load - goes through L1 data cache
float val = data[idx];

// __ldg load - goes through read-only texture cache
float val = __ldg(&data[idx]);

// Combined with float4 vectorization
float4 val = __ldg((const float4*)&data[idx*4]);

The __restrict__ keyword on pointer parameters tells the compiler that the pointer is not aliased with any other pointer in the function. This allows the compiler to automatically generate __ldg loads (read-only cache path) because it can prove the data will not be modified through any other pointer. This is the preferred approach over manually inserting __ldg everywhere.

// __restrict__ enables automatic __ldg generation by compiler
__global__ void scale_kernel(float* __restrict__ out,
                             const float* __restrict__ in,
                             float alpha, int n) {
    // compiler knows in[] is not aliased with out[]
    // so it generates LDG.CONSTANT or texture cache loads for in[]
}

Core Concept 3 - Loop Unrolling

What Unrolling Does to the Instruction Stream

A tight loop has overhead: increment counter, compare to bound, branch. More importantly, a loop with a single accumulation variable creates a sequential dependency chain that the scheduler cannot parallelize.

#pragma unroll N tells the compiler to replicate the loop body N times, creating N independent copies of the computation. Each copy uses different registers and there are zero dependencies between copies.

// BEFORE unrolling - one accumulator, sequential dependency chain
float sum = 0.0f;
for (int k = 0; k < K; k++) {
    sum += A[row * K + k] * B[k * N + col];
    // every iteration: sum depends on previous sum result
    // scheduler cannot pipeline across iterations
}

// AFTER manual 4x unrolling - four independent accumulators
float sum0 = 0.0f, sum1 = 0.0f, sum2 = 0.0f, sum3 = 0.0f;
for (int k = 0; k < K; k += 4) {
    sum0 += A[row * K + k+0] * B[(k+0) * N + col];
    sum1 += A[row * K + k+1] * B[(k+1) * N + col];
    sum2 += A[row * K + k+2] * B[(k+2) * N + col];
    sum3 += A[row * K + k+3] * B[(k+3) * N + col];
    // all four FMAs are independent of each other
    // scheduler issues all four simultaneously
}
float sum = sum0 + sum1 + sum2 + sum3;

The unrolled version has four independent FMA operations per cycle. On H100 with 128 FP32 FMAs per cycle per SM, and 4 schedulers, a single warp presenting 4 independent FMAs uses 4 out of 4 issue slots. This is near-optimal ILP.

#pragma unroll

The compiler directive #pragma unroll N automates this. Without specifying N, the compiler unrolls as much as it deems beneficial (which is often conservative):

// Unroll 4x - compiler generates 4 copies of the loop body
#pragma unroll 4
for (int k = 0; k < K; k++) {
    acc += A[k] * B[k];
}

// Unroll completely (dangerous for large trip counts - see warning below)
#pragma unroll
for (int k = 0; k < 16; k++) {  // compile-time constant trip count
    acc += A[k] * B[k];
}

// Disable unrolling (useful for debugging register pressure)
#pragma unroll 1
for (int k = 0; k < K; k++) {
    acc += A[k] * B[k];
}

The Register Pressure Tradeoff

Every unrolled copy creates new live variables. Four accumulators instead of one. Four sets of loaded values instead of one. Registers are consumed.

On an H100 SM, each thread block can use up to 65536 registers. Each thread has between 32 and 255 registers available, depending on occupancy. More registers per thread means fewer concurrent threads (lower occupancy), which means fewer warps available for latency hiding.

The optimal unroll factor is not always 4. Profile with --metrics sm__warps_active.avg.pct_of_peak_sustained_active to see occupancy. If unrolling drops occupancy from 75% to 37%, the ILP gain is likely not worth the occupancy loss.

Core Concept 4 - FMA and Instruction Throughput

Fused Multiply-Add

The FMA (Fused Multiply-Add) instruction computes a * b + c in a single instruction with a single rounding operation. This is architecturally significant:

Throughput: FMA costs the same as a plain FMUL or FADD - one instruction issue slot
Accuracy: one rounding instead of two, 0.5 ULP error bound instead of 1.0
Usage: replaces two instructions (FMUL + FADD) with one, freeing issue slots

On H100, FP32 FMA throughput is 128 operations per SM per cycle. FP32 FMUL alone is also 128 per cycle. FP32 FADD alone is also 128 per cycle. But FMUL + FADD = 2 instructions = 2 issue slots = 128 * 0.5 = 64 pairs per cycle. FMA = 1 instruction = 128 fused ops per cycle.

The compiler will automatically generate FMA when it can prove the intermediate result is not used. But sometimes it cannot prove this, especially when the intermediate value is stored to a variable that is later read:

// Compiler generates FMA automatically
float result = a * b + c;   // -> FFMA result, a, b, c

// Compiler may NOT generate FMA - intermediate stored
float tmp = a * b;          // -> FMUL tmp, a, b
float result = tmp + c;     // -> FADD result, tmp, c  (no FMA!)

// Force FMA with intrinsic
float result = __fmaf_rn(a, b, c);  // -> FFMA always, round to nearest

The __fmaf_rn intrinsic guarantees FMA generation. Use it in inner loops where you want to eliminate any ambiguity about compiler output.

Instruction Throughput Table (H100)

Instruction	FP32 throughput/SM/cycle	FP64 throughput	INT32 throughput
FMA	128	64	-
FMUL/FADD	128	64	-
INT MUL 32-bit	-	-	32
INT ADD	-	-	128
SFU (sin, cos, exp, rcp)	16	-	-
IMAD	-	-	64

The SFU (Special Function Unit) throughput - 16 operations per SM per cycle - is 8x lower than FMA throughput. Any kernel with __sinf, __cosf, expf, or rcpf in its hot path will be SFU-bound, not FMA-bound. Use __expf (fast approximation) instead of expf (double precision via libm) for neural network activation functions.

Integer Arithmetic in Address Calculation

Address calculations inside loops often involve integer arithmetic. On GPU, integer division and modulo are expensive - 32 cycles each. Array indexing that requires division or modulo will serialize the pipeline.

// SLOW - integer division in hot path
int row = idx / N;    // 32-cycle integer division
int col = idx % N;    // 32-cycle integer modulo

// FAST - precomputed strides, no division
int row = blockIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Access: A[row * stride_A + col]  -- stride_A precomputed, stored in register

For 2D array indexing, pass the stride as a kernel parameter and compute addresses using multiplication and addition only.

Core Concept 5 - Reading PTX and SASS

Why Look at PTX?

PTX (Parallel Thread Execution) is NVIDIA's virtual ISA. It is the intermediate representation between CUDA C++ and the hardware SASS (Shader Assembly) instructions. Looking at PTX tells you:

Whether the compiler generated FMA or separate FMUL+FADD
Whether vector loads are being generated as LDG.128 or LDG.32
Whether loop unrolling produced the expected instruction replication
Whether __ldg or regular cache path is being used

To generate PTX from a CUDA file:

# Compile to PTX
nvcc -ptx -arch=sm_90 kernel.cu -o kernel.ptx

# Or inspect PTX in a compiled binary
cuobjdump --dump-ptx myprogram > kernel.ptx

What to Look For in PTX

// PTX for scalar float load
ld.global.f32 %f1, [%rd1];       // LDG.32 - 32-bit load

// PTX for float4 load (what you want)
ld.global.v4.f32 {%f1,%f2,%f3,%f4}, [%rd1];  // LDG.128 - 128-bit load

// PTX for FMA (what you want)
fma.rn.f32 %f5, %f1, %f2, %f3;   // FFMA - fused

// PTX for separate multiply+add (bad in hot loops)
mul.f32 %f5, %f1, %f2;            // FMUL
add.f32 %f6, %f5, %f3;            // FADD

// PTX for __ldg load (read-only cache)
ld.global.nc.f32 %f1, [%rd1];    // nc = "non-coherent" = texture cache path

The key patterns to verify:

ld.global.v4.f32 confirms float4 loads are generated
fma.rn.f32 confirms FMA (not separate mul+add)
ld.global.nc confirms __ldg / read-only cache path
Loop body should appear N times after #pragma unroll N

SASS vs PTX

PTX is virtual and architecture-independent. SASS (Shader Assembly) is the actual hardware instructions that run. Use cuobjdump --dump-sass to see SASS:

cuobjdump --dump-sass myprogram | grep -A 5 "Function : myKernel"

Key SASS instructions to recognize:

// 128-bit global load (what float4 produces)
LDG.E.128 R4, [R2] ;

// FP32 FMA
FFMA R8, R4, R5, R6 ;

// Read-only cache load (what __ldg produces)
LDG.E.CONSTANT R4, [R2] ;

// Warp-level store
STG.E.128 [R10], R4 ;

Full Code Example - Scalar vs Vectorized Kernel

#include <cuda_runtime.h>
#include <stdio.h>

// Kernel 1: Scalar elementwise scale - baseline
__global__ void scale_scalar(float* __restrict__ out,
                              const float* __restrict__ in,
                              float alpha, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = in[idx] * alpha;
    }
}

// Kernel 2: float4 vectorized - 4x fewer load/store instructions
__global__ void scale_float4(float4* __restrict__ out,
                              const float4* __restrict__ in,
                              float alpha, int n4) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n4) {
        float4 val = in[idx];        // LDG.128 - one 128-bit load
        val.x *= alpha;
        val.y *= alpha;
        val.z *= alpha;
        val.w *= alpha;
        out[idx] = val;              // STG.128 - one 128-bit store
    }
}

// Kernel 3: float4 with FMA intrinsics - maximum ILP
__global__ void scale_float4_fma(float4* __restrict__ out,
                                  const float4* __restrict__ in,
                                  float alpha, float bias, int n4) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n4) {
        float4 val = __ldg(&in[idx]);    // LDG.128 through read-only cache
        val.x = __fmaf_rn(val.x, alpha, bias);  // FFMA - guaranteed FMA
        val.y = __fmaf_rn(val.y, alpha, bias);
        val.z = __fmaf_rn(val.z, alpha, bias);
        val.w = __fmaf_rn(val.w, alpha, bias);
        out[idx] = val;
    }
}

int main() {
    const int N = 1 << 24;  // 16M elements
    const float alpha = 2.5f;
    const float bias = 1.0f;

    float *d_in, *d_out;
    cudaMalloc(&d_in,  N * sizeof(float));
    cudaMalloc(&d_out, N * sizeof(float));

    // Initialize input
    // ... (cudaMemset or kernel init)

    dim3 block(256);

    // Kernel 1: scalar
    dim3 grid_scalar((N + block.x - 1) / block.x);
    scale_scalar<<<grid_scalar, block>>>(d_out, d_in, alpha, N);

    // Kernel 2: float4 (N must be divisible by 4)
    int N4 = N / 4;
    dim3 grid_vec((N4 + block.x - 1) / block.x);
    scale_float4<<<grid_vec, block>>>((float4*)d_out, (float4*)d_in, alpha, N4);

    // Kernel 3: float4 with FMA and __ldg
    scale_float4_fma<<<grid_vec, block>>>((float4*)d_out, (float4*)d_in,
                                           alpha, bias, N4);

    cudaFree(d_in);
    cudaFree(d_out);
    return 0;
}

Full Code Example - ILP-Optimized Reduction

This kernel shows the full ILP pattern for a parallel sum reduction. The key is processing 4 elements per thread with independent accumulators:

// High-ILP parallel sum reduction
// Each thread accumulates 4 independent sums, then reduces them
__global__ void sum_reduction_ilp4(const float* __restrict__ data,
                                    float* __restrict__ partial_sums,
                                    int n) {
    // 4 independent accumulators - 4 parallel dependency chains
    float acc0 = 0.0f;
    float acc1 = 0.0f;
    float acc2 = 0.0f;
    float acc3 = 0.0f;

    // Each thread steps by blockDim.x * 4 to handle 4 elements per step
    int stride = blockDim.x * gridDim.x;
    int base = blockIdx.x * blockDim.x + threadIdx.x;

    // Process 4 elements per iteration with 4 independent FMAs
    // Compiler will see acc0,acc1,acc2,acc3 are independent
    // and schedule all 4 additions simultaneously
    #pragma unroll 4
    for (int i = base; i < n - stride * 3; i += stride * 4) {
        acc0 += data[i + stride * 0];   // independent
        acc1 += data[i + stride * 1];   // independent
        acc2 += data[i + stride * 2];   // independent
        acc3 += data[i + stride * 3];   // independent
    }

    // Combine accumulators
    float acc = acc0 + acc1 + acc2 + acc3;

    // Handle remainder elements
    for (int i = base + (n / (stride*4)) * stride*4; i < n; i += stride) {
        acc += data[i];
    }

    // Warp-level reduction using shuffle intrinsics
    #pragma unroll
    for (int offset = 16; offset > 0; offset >>= 1) {
        acc += __shfl_down_sync(0xffffffff, acc, offset);
    }

    // First thread in each warp writes to shared memory
    __shared__ float warp_sums[32];  // max 32 warps per block
    int lane = threadIdx.x & 31;
    int warp_id = threadIdx.x >> 5;
    if (lane == 0) warp_sums[warp_id] = acc;
    __syncthreads();

    // Final reduction in first warp
    if (warp_id == 0) {
        acc = (lane < (blockDim.x >> 5)) ? warp_sums[lane] : 0.0f;
        #pragma unroll
        for (int offset = 16; offset > 0; offset >>= 1) {
            acc += __shfl_down_sync(0xffffffff, acc, offset);
        }
        if (lane == 0) partial_sums[blockIdx.x] = acc;
    }
}

Full Code Example - Loop Unrolling in Matrix Multiply Inner Loop

// Naive matmul inner loop - sequential dependency on acc
__device__ float matmul_naive_inner(const float* __restrict__ A_tile,
                                     const float* __restrict__ B_tile,
                                     int row, int col, int K_tile) {
    float acc = 0.0f;
    for (int k = 0; k < K_tile; k++) {
        acc += A_tile[row * K_tile + k] * B_tile[k * 16 + col];
        // acc has 4-cycle dependency on previous iteration
    }
    return acc;
}

// Unrolled matmul inner loop - 4 independent accumulators
__device__ float matmul_unrolled_inner(const float* __restrict__ A_tile,
                                        const float* __restrict__ B_tile,
                                        int row, int col, int K_tile) {
    float acc0 = 0.0f, acc1 = 0.0f, acc2 = 0.0f, acc3 = 0.0f;

    // K_tile should be multiple of 4 for clean unrolling
    #pragma unroll 4
    for (int k = 0; k < K_tile; k += 4) {
        // All four FMAs are independent - scheduler issues all 4 simultaneously
        acc0 = __fmaf_rn(A_tile[row * K_tile + k+0], B_tile[(k+0) * 16 + col], acc0);
        acc1 = __fmaf_rn(A_tile[row * K_tile + k+1], B_tile[(k+1) * 16 + col], acc1);
        acc2 = __fmaf_rn(A_tile[row * K_tile + k+2], B_tile[(k+2) * 16 + col], acc2);
        acc3 = __fmaf_rn(A_tile[row * K_tile + k+3], B_tile[(k+3) * 16 + col], acc3);
    }

    return acc0 + acc1 + acc2 + acc3;
}

Core Concept 6 - Instruction Scheduling and the Dependency Graph

Visualizing the Dependency Graph

Every kernel has an instruction dependency graph. An edge from instruction A to instruction B means B reads a register written by A - B cannot start until A completes. ILP optimization is the art of widening this graph: replacing a single long chain with multiple parallel chains.

Consider a dot product computation. The naive version has a single chain:

load a[0]  -->  multiply a[0]*b[0]  -->  add to acc  -->  load a[1]  -->  ...
                                                 ^
                                     acc depends on previous add
                                     4-cycle stall every iteration

The unrolled version creates four independent chains:

chain 0: load a[0] --> mul a[0]*b[0] --> add to acc0
chain 1: load a[1] --> mul a[1]*b[1] --> add to acc1   (independent of chain 0)
chain 2: load a[2] --> mul a[2]*b[2] --> add to acc2   (independent of chain 0,1)
chain 3: load a[3] --> mul a[3]*b[3] --> add to acc3   (independent of chain 0,1,2)

final: acc0 + acc1 + acc2 + acc3   (4 adds, but only happens once at the end)

The scheduler now has 4 independent instructions available every cycle. It issues all 4 simultaneously, hiding the 4-cycle FMA latency completely.

The Math: Optimal Unroll Factor

For an instruction with latency L cycles and a loop body with a single accumulator, the scheduler stalls for L-1 cycles between iterations. The effective throughput is 1 iteration per L cycles instead of 1 per cycle.

With N independent accumulators (unroll factor N), the scheduler needs only N >= L to eliminate stalls. For FP32 FMA on H100 with L=4, unroll factor 4 eliminates all stalls. Beyond N=4, there are no additional stall cycles to eliminate - the benefit plateaus while register usage keeps rising.

$\text{Required unroll factor} = \lceil L / \text{IPC}_\text{peak} \rceil$

Where L is instruction latency in cycles and IPC_peak is the peak issue rate. For H100 with 4 schedulers and 4-cycle FMA latency: required unroll = 4/1 = 4.

This formula explains why:

FP64 FMA (8-cycle latency) benefits from unroll factor 8
SFU ops (16-cycle effective latency) benefit from unroll factor 16
Simple integer adds (1-2 cycle latency) barely benefit from unrolling

Identifying ILP Opportunities in Real Code

The patterns that signal ILP potential:

// Pattern 1: Reduction with single accumulator - classic ILP opportunity
float sum = 0.0f;
for (int i = 0; i < N; i++) {
    sum += data[i];   // sum depends on previous sum - serialize
}
// Fix: 4 accumulators + 4x unroll

// Pattern 2: Transformation loop - may already have ILP if data is independent
for (int i = 0; i < N; i++) {
    out[i] = in[i] * alpha + beta;  // each iteration independent
}
// Good ILP already - but float4 loads can improve instruction efficiency

// Pattern 3: Stencil with sequential updates - harder to parallelize
for (int i = 1; i < N-1; i++) {
    out[i] = in[i-1] + in[i] + in[i+1];  // read, no accumulation dep
}
// Independent iterations - good ILP, use float4 loads

// Pattern 4: Scan/prefix sum - hard sequential dependency
float prefix = 0.0f;
for (int i = 0; i < N; i++) {
    prefix += data[i];
    out[i] = prefix;   // each output depends on all previous
}
// Cannot naively unroll - use warp-level shuffle primitives instead

Production Engineering Notes

Profiling ILP with Nsight Compute

The metrics to collect for ILP analysis:

# Collect the core ILP metrics
ncu --metrics \
  sm__inst_executed_pipe_fma.avg.pct_of_peak_sustained_active,\
  smsp__warp_issue_stalled_wait_lns_per_warp,\
  sm__warps_active.avg.pct_of_peak_sustained_active,\
  l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.pct_of_peak_sustained_elapsed \
  --target-processes all \
  ./myprogram

Interpretation guide:

FMA utilization < 50%: likely ILP problem or memory bound
Stall wait > 15%: instruction dependency chain problem - add unrolling
Warp occupancy > 80% but FMA < 50%: definitely ILP problem, not occupancy
Warp occupancy < 30%: occupancy problem, reduce register usage before adding unrolling

Benchmarking ILP Impact: A Minimal Reproducer

Use this template to measure ILP improvement on your hardware:

#include <cuda_runtime.h>
#include <stdio.h>

// ILP-1: single accumulator (low ILP)
__global__ void dot_ilp1(const float* __restrict__ a,
                          const float* __restrict__ b,
                          float* __restrict__ out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    float acc = 0.0f;
    for (int i = tid; i < n; i += stride) {
        acc = __fmaf_rn(a[i], b[i], acc);  // sequential dependency on acc
    }
    // warp reduction
    for (int offset = 16; offset > 0; offset >>= 1)
        acc += __shfl_down_sync(0xffffffff, acc, offset);
    if ((threadIdx.x & 31) == 0) atomicAdd(out, acc);
}

// ILP-4: four independent accumulators (high ILP)
__global__ void dot_ilp4(const float* __restrict__ a,
                          const float* __restrict__ b,
                          float* __restrict__ out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    // Four independent accumulators = four parallel dependency chains
    float acc0 = 0.0f, acc1 = 0.0f, acc2 = 0.0f, acc3 = 0.0f;
    #pragma unroll 4
    for (int i = tid; i < n - stride*3; i += stride*4) {
        acc0 = __fmaf_rn(a[i],           b[i],           acc0);
        acc1 = __fmaf_rn(a[i+stride],   b[i+stride],   acc1);
        acc2 = __fmaf_rn(a[i+stride*2], b[i+stride*2], acc2);
        acc3 = __fmaf_rn(a[i+stride*3], b[i+stride*3], acc3);
    }
    float acc = acc0 + acc1 + acc2 + acc3;
    // handle remainder
    for (int i = tid + (n/(stride*4))*(stride*4); i < n; i += stride)
        acc = __fmaf_rn(a[i], b[i], acc);
    for (int offset = 16; offset > 0; offset >>= 1)
        acc += __shfl_down_sync(0xffffffff, acc, offset);
    if ((threadIdx.x & 31) == 0) atomicAdd(out, acc);
}

// Expected result on H100:
// dot_ilp1: ~25-35% of peak FMA throughput (limited by 4-cycle dependency)
// dot_ilp4: ~85-95% of peak FMA throughput (4 independent chains fill scheduler)

When NOT to Optimize for ILP

ILP optimization is not always the answer. It is the wrong tool when:

Memory bound kernels: if the kernel is waiting on memory 80% of the time, fixing ILP only improves the remaining 20%. Fix memory access patterns first.
Very low register budget: if you are already at 255 registers per thread (maximum), any unrolling that adds registers is impossible.
Divergent warps: if warp divergence means 50% of threads are inactive, ILP in the remaining threads cannot recover the lost throughput.
Short kernels: for kernels under 100 microseconds on modern hardware, launch overhead and L2 warm-up dominate. ILP optimization yields single-digit percentage improvements that are not worth the code complexity.

Precomputing Strides to Eliminate Hot-Path Division

// SLOW - division inside hot loop
__global__ void slow_kernel(float* data, int rows, int cols) {
    for (int i = 0; i < rows * cols; i++) {
        int row = i / cols;   // 32-cycle division every iteration
        int col = i % cols;   // 32-cycle modulo every iteration
        data[i] = compute(row, col);
    }
}

// FAST - no division in hot loop
__global__ void fast_kernel(float* data, int rows, int cols, int stride) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y;     // from grid dimension - no arithmetic
    int col = tid;            // from thread index - no arithmetic
    // access: data[row * stride + col]  -- only add+mul, no div
}

Common Mistakes

:::danger Over-unrolling kills occupancy #pragma unroll 16 on a loop that loads 3 floats per iteration creates 48 live register variables. Combined with the accumulator registers, you can easily hit 128+ registers per thread. At 128 registers per thread on an SM with 65536 total registers, only 512 threads can run simultaneously - that is one thread block of 512, or two blocks of 256. Occupancy drops from 75% to 25%. Profile register usage with nvcc --ptxas-options=-v and check occupancy with Nsight before committing to high unroll factors. :::

:::danger float4 aliasing causes undefined behavior Casting a float* to float4* and then accessing the same memory through both pointers is undefined behavior in C++. The compiler assumes aliased pointers may refer to the same memory and may generate incorrect code. Always use __restrict__ on all pointer parameters, and never access the same buffer through both a float* and float4* pointer in the same kernel. :::

:::warning FMA contraction is not guaranteed without intrinsics The compiler can combine a * b + c into FMA only if the intermediate result a * b is not observable. If you write float tmp = a * b; tmp += c;, the assignment to tmp makes the intermediate result observable and the compiler may not fuse it. Use __fmaf_rn(a, b, c) in inner loops where FMA generation is critical for performance. :::

:::warning Loop trip count must be compile-time constant for full unrolling #pragma unroll with no argument only fully unrolls loops with compile-time constant trip counts. If K is a runtime variable, #pragma unroll generates a partially unrolled loop with a residual scalar loop for the remainder. Use #pragma unroll 4 with a specific factor, or template the kernel on the trip count to guarantee complete unrolling. :::

:::warning __ldg is only beneficial for truly read-only data The __ldg intrinsic and the __restrict__ keyword route loads through the read-only (texture) cache. This is beneficial when the data is read once and never written. If the data is modified by another kernel stage and you need to see those modifications, __ldg may return stale cached values. Only use __ldg for truly read-only kernel parameters. :::

Interview Questions and Answers

Q1: What is Instruction-Level Parallelism (ILP) on a GPU and why does it matter?

ILP on a GPU refers to the ability of a single warp to have multiple independent instructions simultaneously ready for execution. An SM has four warp schedulers, each capable of issuing one instruction per cycle. If a warp has four independent operations in its instruction queue - meaning none of them depend on the result of another - all four can be issued in a single cycle. Without ILP, a warp with sequential dependency chains can only feed one scheduler slot per 4+ cycles (the FMA latency), leaving 75% of compute throughput unused. ILP is distinct from occupancy: even with maximum warp occupancy, if every warp has a 1-instruction-deep dependency chain, throughput is limited to 25% of peak. The fix is restructuring computations to use multiple independent accumulators, typically by manually unrolling loops 4x and using separate register variables.

Q2: What is the advantage of a float4 load over four separate float loads?

A float4 load generates a single LDG.128 (128-bit load) instruction, consuming one instruction issue slot and one memory transaction. Four separate float loads generate four LDG.32 instructions, consuming four instruction issue slots and four memory transactions. The instruction count reduction frees three issue slots per load site for computation instructions, improving the compute-to-memory instruction ratio. The single 128-bit memory transaction is also more efficient than four 32-bit transactions because it requires only one address calculation and one memory system arbitration. The primary requirement is 16-byte alignment of the base address, which cudaMalloc guarantees (256-byte aligned). For input arrays not allocated by cudaMalloc, verify alignment with assert(((uintptr_t)ptr % 16) == 0).

Q3: How does #pragma unroll N improve performance and what are its risks?

#pragma unroll N replicates the loop body N times in the compiled code, creating N independent copies of the loop body computation. When those N copies use different register variables (different accumulators, different loaded values), they create N parallel dependency chains that the scheduler can interleave. On a 4-cycle FMA latency, 4 independent accumulators means the scheduler can issue one instruction per cycle (filling the pipeline) instead of one per 4 cycles. The risks are: (1) register pressure - N copies means N times the register usage, potentially reducing occupancy; (2) code bloat - very large unroll factors increase instruction cache pressure; (3) diminishing returns - beyond 4-8x unrolling, the marginal ILP gain falls while register pressure keeps rising. Always verify with nvcc --ptxas-options=-v that register count is within acceptable bounds and check occupancy with Nsight Compute after applying unrolling.

Q4: How do you diagnose an ILP bottleneck using Nsight Compute?

The primary diagnostic metric is smsp__warp_issue_stalled_wait_lns_per_warp - this measures cycles per warp spent stalled because the next instruction is waiting on a register dependency. If this metric is above 10-15% of total warp cycles, ILP is the bottleneck. Cross-reference with sm__inst_executed_pipe_fma.avg.pct_of_peak_sustained_active - if FMA utilization is low but warp occupancy is normal, the warps are present but not issuing FMAs every cycle. Also check smsp__issue_active.avg.pct_of_peak_sustained_active - this measures what fraction of cycles the schedulers are actively issuing instructions. Low issue activity with normal occupancy confirms that warps have instructions pending but they are not ready due to dependencies.

Q5: What is the difference between ILP and occupancy, and when does each matter more?

Occupancy is the ratio of active warps to the maximum possible warps on an SM, determined by register count, shared memory usage, and thread block size. High occupancy provides more warps for the scheduler to switch between, which is critical for hiding memory latency (400-800 cycles). ILP is about how many independent instructions a single warp can issue simultaneously, which hides arithmetic instruction latency (4-8 cycles) and improves utilization of the SM's multiple issue slots. For memory-bound kernels (arithmetic intensity < 10 FLOP/byte), maximizing occupancy matters most because memory latency dominates and you need warps to switch to. For compute-bound kernels with short dependency chains, ILP matters more - even with maximum occupancy, if each warp only feeds one instruction per 4 cycles, you are at 25% of peak throughput. The Roofline model tells you which regime you are in: plot your kernel's arithmetic intensity against the hardware's compute/memory bandwidth ratio.

Q6: Explain the FMA instruction and when the compiler will and will not generate it automatically.

FMA (Fused Multiply-Add) computes a * b + c in a single hardware instruction with a single rounding operation, versus the two-instruction sequence FMUL followed by FADD which rounds twice and uses two issue slots. The compiler generates FMA automatically when it can prove the intermediate product a * b is not separately observable - meaning the product is not stored to a named variable that is read elsewhere. It will generate FMA for float r = a * b + c; but typically will not for float tmp = a * b; float r = tmp + c; because tmp is observable. To guarantee FMA in performance-critical inner loops, use the intrinsic __fmaf_rn(a, b, c) which maps directly to a single FFMA SASS instruction. For double precision, use __fma_rn. The performance advantage is most pronounced in tight inner loops where the difference between 64 FLOP/s and 128 FLOP/s is meaningful - for example, the K-dimension accumulation loop in a matrix multiply kernel.

Q7: How would you optimize an elementwise kernel that currently runs at 40% of memory bandwidth?

First, verify the bottleneck is instruction overhead, not memory access pattern issues. If coalescing is perfect (sequential thread access to sequential addresses), the culprit is likely instruction throughput. Step 1: vectorize loads to float4 - this cuts the load instruction count by 4x, freeing issue slots for computation. Step 2: check if the computation per loaded element is sufficient to overlap with the memory latency - for a pure scale operation (1 FLOP per element), even perfect vectorization may leave us memory-bound; consider fusing with another operation (scale + bias + relu). Step 3: if the kernel is compute-side bound, apply 4x loop unrolling to create 4 independent computation chains per warp. Step 4: add __restrict__ to all pointer parameters to enable compiler-generated read-only cache loads. Step 5: profile with ncu --set full and check l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.pct_of_peak_sustained_elapsed - if memory bandwidth is at 95%+ of peak after vectorization, you have reached the memory bandwidth limit and further instruction optimization cannot help.

Summary

Instruction-level optimization extracts throughput from work that memory bandwidth allows but instruction scheduling limits. The four core techniques are:

ILP via unrolling: 4 independent accumulators, 4 parallel dependency chains, all four SM scheduler slots active every cycle
float4 loads: one 128-bit instruction instead of four 32-bit instructions, 4x reduction in load instruction count
FMA: guaranteed single-instruction multiply-add via __fmaf_rn, no rounding overhead from separate FMUL+FADD
PTX verification: read the PTX to confirm the compiler generated the instructions you intended

The diagnostic workflow: Nsight Compute, stall-wait metric, FMA utilization, then occupancy. Fix memory first (coalescing, cache hit rate), then ILP, then occupancy. Never apply ILP blindly - profile before and after to confirm the improvement is real.

The gap between 31% and 78% peak throughput is real and reproducible. The fix is not algorithmic - it is understanding what the instruction scheduler needs and giving it four independent instructions per cycle instead of one.

The Benchmark That Changed How I Think About GPUs​

Why This Exists - The Instruction Latency Problem​

The Root Issue: Instructions Take Time​

Historical Context - How NVIDIA Exposed ILP to Programmers​

Core Concept 1 - Instruction-Level Parallelism​

What the Scheduler Actually Does​

Measuring ILP - Issued IPC vs Peak IPC​

Core Concept 2 - Vectorized Loads​

The float4 Load​

Alignment Requirements for Vector Loads​

__ldg() - The Read-Only Cache Load​

Core Concept 3 - Loop Unrolling​

What Unrolling Does to the Instruction Stream​

#pragma unroll​

The Register Pressure Tradeoff​

Core Concept 4 - FMA and Instruction Throughput​

Fused Multiply-Add​

Instruction Throughput Table (H100)​

Integer Arithmetic in Address Calculation​

Core Concept 5 - Reading PTX and SASS​

Why Look at PTX?​

What to Look For in PTX​

SASS vs PTX​

Full Code Example - Scalar vs Vectorized Kernel​

Full Code Example - ILP-Optimized Reduction​

Full Code Example - Loop Unrolling in Matrix Multiply Inner Loop​

Core Concept 6 - Instruction Scheduling and the Dependency Graph​

Visualizing the Dependency Graph​

The Math: Optimal Unroll Factor​

Identifying ILP Opportunities in Real Code​

Production Engineering Notes​

Profiling ILP with Nsight Compute​

Benchmarking ILP Impact: A Minimal Reproducer​

When NOT to Optimize for ILP​

Precomputing Strides to Eliminate Hot-Path Division​

Common Mistakes​

Interview Questions and Answers​

Summary​