Thread Blocks, Warps, and Grids

Reading time: ~40 min - Interview relevance: High - Target roles: CUDA Developer, ML Researcher, Systems Engineer

Thread indexing is not a detail you can skim. It is the single most common source of CUDA bugs: silent wrong answers, out-of-bounds writes, and correctness that disappears at scale. Get this exactly right and everything else is straightforward.

The Crash at a Million Elements

You wrote a CUDA kernel. You tested it on a vector of 1,000 elements and it worked perfectly. You ran it on 10,000 elements - still fine. You submit your benchmark job with 1,000,000 elements and it crashes with a memory access violation, or worse, it silently returns wrong answers for the last chunk of data.

This scenario plays out constantly for engineers learning CUDA. The kernel logic is correct. The indexing formula was correct - for the small test cases. The bug is in how you calculated your grid dimensions, and it only manifests when the problem size exceeds a certain threshold.

Here is what went wrong. Suppose your kernel has int blockSize = 256 and you computed the grid as:

int gridSize = N / blockSize;  // integer division, truncates

For $N = 1000$ : 1000 / 256 = 3 blocks. You launch 768 threads. Elements 768 through 999 are never processed. Silent data corruption.

For $N = 1,000,000$ : 1000000 / 256 = 3906 blocks. You launch 999,936 threads. The last 64 elements are silently skipped, and if your output buffer was initialized to zero you might not even notice.

The fix is one line:

int gridSize = (N + blockSize - 1) / blockSize;  // ceiling division

But to understand why this fix works, and to understand all the other indexing bugs that are waiting for you, you need to build a complete mental model of the CUDA thread hierarchy from the ground up.

Why This Abstraction Exists

Before CUDA, parallel programming on GPUs required mapping your computation onto graphics primitives. If you wanted to process 1,000,000 data points you had to figure out what texture size to use, how to lay out your data as pixels, and how to write a shader that would be invoked once per pixel.

CUDA introduced a clean abstraction: instead of pixels and shaders, you have threads and kernels. A thread is the basic unit of execution - it runs your kernel function once. Multiple threads are grouped into a block. Multiple blocks form a grid. You specify the shape of this hierarchy when you launch a kernel.

The reason for this three-level structure (thread, block, grid) is hardware reality:

Threads within a block share fast on-chip shared memory and can synchronize with barriers (__syncthreads()).
Blocks are independently scheduled to Streaming Multiprocessors (SMs). Blocks do not share memory and cannot communicate with each other (without global memory writes).
Grids cover your entire problem - as many blocks as you need to process all your data.

This design means that the hardware can run any number of blocks in any order, mapping them to available SMs as they become free. Your kernel just needs to compute the correct global index from the thread/block coordinates.

The Three-Level Hierarchy

Each level has its own built-in variables that are available inside any __global__ or __device__ function:

Variable	Type	Meaning
`threadIdx.x/y/z`	`uint3`	Thread's index within its block
`blockIdx.x/y/z`	`uint3`	Block's index within the grid
`blockDim.x/y/z`	`dim3`	Number of threads per block in each dimension
`gridDim.x/y/z`	`dim3`	Number of blocks in the grid in each dimension

These are not global variables you set. They are hardware-injected constants. The GPU instantiates each thread with its own unique copy of threadIdx and blockIdx, which is how thousands of thread instances of the same function all compute different results.

The Hardware Limit Table

Before writing a single kernel, memorize these limits:

Parameter	Limit	Notes
Threads per block (total)	1024	`blockDim.x * blockDim.y * blockDim.z <= 1024`
blockDim.x	1024	Maximum in x dimension
blockDim.y	1024	Maximum in y dimension
blockDim.z	64	Maximum in z dimension
gridDim.x	$2^{31} - 1$	~2.1 billion
gridDim.y	65535
gridDim.z	65535

These limits come from the hardware: the SM can track at most 1024 live threads per block due to register file constraints, and the grid dimension registers are fixed width. Exceeding these limits causes cudaErrorInvalidConfiguration at launch time - or silent launch failure if you do not check errors.

Warps: The Real Unit of Execution

The thread abstraction says: each thread runs independently. The hardware reality is different.

The GPU does not actually execute individual threads. It executes warps - groups of 32 consecutive threads that execute in lockstep on a single SM's execution units. When an SM is assigned a block, it divides the block's threads into warps:

Threads 0-31 form warp 0
Threads 32-63 form warp 1
Threads 64-95 form warp 2
And so on

All 32 threads in a warp execute the same instruction at the same time. This is SIMT (Single Instruction Multiple Thread) - the GPU equivalent of CPU SIMD, but at a higher abstraction level.

Why Block Size Must Be a Multiple of 32

If your block size is not a multiple of 32, the last warp is partially filled. Suppose your block has 100 threads:

Warp 0: threads 0-31 (full)
Warp 1: threads 32-63 (full)
Warp 2: threads 64-95 (full)
Warp 3: threads 96-99 (4 active threads, 28 idle)

The hardware still issues instructions to all 32 slots in warp 3. The 28 idle threads execute but produce no useful work and no side effects (they are masked out). You pay the register file cost and scheduling overhead for 32 threads while getting work from only 4. This is warp divergence from block boundary mismatch - subtle and costly.

Always use block sizes that are multiples of 32: 32, 64, 128, 256, 512, 1024.

Warp ID Within a Block

You can compute the warp ID of any thread within its block:

// For a 1D block
int warpId = threadIdx.x / 32;
int laneId = threadIdx.x % 32;  // thread's position within its warp (0-31)

laneId is important for warp-level primitives like __shfl_sync() (warp shuffle) and __ballot_sync() (warp voting), which we cover in later lessons.

1D Thread Indexing

The most common case: your data is a flat array of $N$ elements. You want each thread to process one element.

The Global Index Formula

int idx = blockIdx.x * blockDim.x + threadIdx.x;

Let us trace through this for a concrete example: $N = 1000$ , blockSize = 256, gridSize = 4 (ceiling of 1000/256).

Thread	`blockIdx.x`	`blockDim.x`	`threadIdx.x`	`idx`
First thread of block 0	0	256	0	0
Last thread of block 0	0	256	255	255
First thread of block 1	1	256	0	256
Last thread of block 1	1	256	255	511
First thread of block 3	3	256	0	768
Thread at element 999	3	256	231	999
Thread at element 1000	3	256	232	1000 - OUT OF BOUNDS
Last thread of block 3	3	256	255	1023 - OUT OF BOUNDS

Block 3 has threads with idx values 768 through 1023. But we only have elements 0 through 999. Threads 1000-1023 in block 3 would access memory past the end of our array. This is why the boundary guard is mandatory.

Complete 1D Kernel Template

__global__ void kernel_1d(const float* input, float* output, int n) {
    // Step 1: compute global index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Step 2: boundary guard - ALWAYS REQUIRED
    if (idx >= n) return;

    // Step 3: actual computation using idx to index into data
    output[idx] = input[idx] * 2.0f;  // example: double each element
}

// Launch from host:
void launch_1d(const float* d_input, float* d_output, int n) {
    int block_size = 256;
    int grid_size = (n + block_size - 1) / block_size;  // ceiling division
    kernel_1d<<<grid_size, block_size>>>(d_input, d_output, n);
}

Why Ceiling Division Works

The formula (n + block_size - 1) / block_size is integer ceiling division. It ensures you always have at least enough blocks to cover all $N$ elements, even when $N$ is not a multiple of block_size.

For $N = 1000$ , block_size = 256: $\text{gridSize} = \left\lceil \frac{1000}{256} \right\rceil = \left\lfloor \frac{1000 + 255}{256} \right\rfloor = \left\lfloor \frac{1255}{256} \right\rfloor = 4$

4 blocks × 256 threads = 1024 threads launched, but only 1000 are valid. The boundary guard if (idx >= n) return filters out threads 1000-1023 harmlessly.

2D Thread Indexing

Many ML operations are naturally 2D: matrix multiplication, convolution, attention score computation, image processing. For these, you want each thread to own one element of a 2D output.

The 2D Index Formulas

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

The convention in CUDA (matching linear algebra) is:

x dimension handles columns
y dimension handles rows

Using dim3 for 2D Blocks and Grids

dim3 is CUDA's three-component integer type for specifying block and grid dimensions. For 2D:

// Block of 16 columns x 16 rows = 256 threads
dim3 blockDim(16, 16);  // blockDim.x=16, blockDim.y=16, blockDim.z=1

// Grid covers an M x N matrix
// Ceiling division in each dimension
dim3 gridDim(
    (N + blockDim.x - 1) / blockDim.x,  // blocks needed in x (columns)
    (M + blockDim.y - 1) / blockDim.y   // blocks needed in y (rows)
);

// Launch
matrix_kernel<<<gridDim, blockDim>>>(args);

Complete 2D Kernel: Matrix Scale

// Scale every element of a matrix by a scalar
__global__ void matrix_scale_2d(
    const float* __restrict__ input,
    float* __restrict__ output,
    float scale,
    int rows,
    int cols
) {
    // Compute 2D position
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    // Boundary guards for both dimensions
    if (row >= rows || col >= cols) return;

    // Row-major indexing: element [row][col] is at offset row*cols + col
    int idx = row * cols + col;
    output[idx] = input[idx] * scale;
}

// Launch for a 768 x 1024 matrix
void launch_matrix_scale(
    const float* d_input,
    float* d_output,
    float scale,
    int rows,   // 768
    int cols    // 1024
) {
    dim3 blockDim(16, 16);  // 256 threads per block, 16x16 tile
    dim3 gridDim(
        (cols + blockDim.x - 1) / blockDim.x,  // (1024+15)/16 = 64
        (rows + blockDim.y - 1) / blockDim.y   // (768+15)/16  = 48
    );
    // Total: 64*48 = 3072 blocks, each 16x16 = 256 threads
    // Total threads: 786432, matrix elements: 786432 (exact fit in this case)

    matrix_scale_2d<<<gridDim, blockDim>>>(d_input, d_output, scale, rows, cols);
}

Why 16x16 and Not 32x1 for 2D?

A block of 32x1 threads has 32 threads in a single warp, which is the minimum. A block of 16x16 threads has 256 threads, forming 8 warps. More warps per block gives the SM scheduler more options to hide latency by switching between warps while one is waiting on memory. The 16x16 tile also maps naturally to shared memory tiling patterns used in matrix multiplication (covered in a later lesson).

For pure elementwise 2D operations (no shared memory), 32x4 or 16x8 or 16x16 are all reasonable. The key constraint is that blockDim.x * blockDim.y <= 1024.

3D Thread Indexing

3D indexing is used for volumetric operations: 3D convolutions, volumetric medical imaging, physics simulations on voxel grids, and batched 2D operations where the batch dimension becomes the z-axis.

The 3D Index Formulas

int x = blockIdx.x * blockDim.x + threadIdx.x;  // column
int y = blockIdx.y * blockDim.y + threadIdx.y;  // row
int z = blockIdx.z * blockDim.z + threadIdx.z;  // depth / batch

Complete 3D Kernel: Batched Matrix Scale

// Apply a different scale to each matrix in a batch
// batch_scales[b] is the scale for the b-th matrix
__global__ void batched_matrix_scale_3d(
    const float* __restrict__ input,  // [batch, rows, cols]
    float* __restrict__ output,
    const float* __restrict__ batch_scales,
    int batch,
    int rows,
    int cols
) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int b   = blockIdx.z * blockDim.z + threadIdx.z;

    if (col >= cols || row >= rows || b >= batch) return;

    // 3D row-major indexing
    int idx = b * (rows * cols) + row * cols + col;
    output[idx] = input[idx] * batch_scales[b];
}

void launch_batched_scale(
    const float* d_input,
    float* d_output,
    const float* d_scales,
    int batch,    // e.g. 32
    int rows,     // e.g. 64
    int cols      // e.g. 64
) {
    // z dimension handles batch; limit is 64 for blockDim.z
    dim3 blockDim(16, 16, 1);  // 256 threads per block
    dim3 gridDim(
        (cols  + blockDim.x - 1) / blockDim.x,
        (rows  + blockDim.y - 1) / blockDim.y,
        (batch + blockDim.z - 1) / blockDim.z
    );
    batched_matrix_scale_3d<<<gridDim, blockDim>>>(
        d_input, d_output, d_scales, batch, rows, cols
    );
}

For 3D problems, blockDim.z = 1 is common when the z dimension is a batch index and you want the x/y tiles to maximize occupancy. Using blockDim.z > 1 makes sense when the z dimension has many elements and you want locality within the block in all three dimensions (e.g., a convolution over a 3D volume).

Stride Loops: Handling Very Large Inputs

There is an upper bound on how many threads you can launch in a single grid. gridDim.x maxes out at $2^{31} - 1 \approx 2.1$ billion for x, which at 256 threads per block means you can address about $500$ billion elements - larger than any realistic problem. But gridDim.y and gridDim.z max out at 65535. For very large 2D or 3D problems, you may hit this limit.

The solution is the grid-stride loop pattern:

__global__ void grid_stride_1d(
    const float* __restrict__ input,
    float* __restrict__ output,
    int n
) {
    // Each thread processes multiple elements, striding by the total grid width
    int stride = blockDim.x * gridDim.x;  // total threads in the grid
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    for (; idx < n; idx += stride) {
        output[idx] = input[idx] * 2.0f;
    }
}

// Launch with a fixed number of blocks (not proportional to N)
void launch_grid_stride(const float* d_in, float* d_out, int n) {
    int block_size = 256;
    // Use a fixed number of blocks - e.g. 4 blocks per SM, 108 SMs on H100
    // This avoids the overhead of launching millions of blocks for huge N
    int grid_size = 1024;  // or query with cudaGetDeviceProperties
    grid_stride_1d<<<grid_size, block_size>>>(d_in, d_out, n);
}

Grid-stride loops have an additional advantage: the kernel can be tested with a single block and single thread (<<<1, 1>>>) for debugging, and the loop logic is identical. Correctness does not depend on a specific launch configuration.

Choosing Block Size: A Practical Guide

The question "what block size should I use?" has a real answer based on SM occupancy theory.

What Is Occupancy?

Occupancy is the ratio of active warps on an SM to the maximum warps the SM can support. A100 SMs support 64 active warps. If your kernel only runs 16 warps per block, and each SM fits 4 blocks, you get 64 warps - 100% occupancy. But if register usage is high, fewer blocks fit, and occupancy drops.

Higher occupancy generally helps hide memory latency (the SM can switch to a ready warp while another is waiting for DRAM). But occupancy is not a monotone function of block size - there are trade-offs.

The Practical Decision Tree

The canonical rule of thumb: start with 256 threads per block. Then measure.

CUDA Occupancy API

CUDA provides a runtime API to compute optimal block size automatically:

#include <cuda_runtime.h>

void query_optimal_block_size() {
    int blockSize;        // optimal block size chosen
    int minGridSize;      // minimum grid size for full occupancy

    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,
        &blockSize,
        my_kernel,        // the kernel function pointer
        0,                // dynamic shared memory per block
        0                 // block size limit (0 = use hardware max)
    );

    printf("Optimal block size: %d\n", blockSize);
    printf("Min grid size for full occupancy: %d\n", minGridSize);
}

This does not profile your kernel. It uses register count (extracted from the compiled binary) to compute the block size that maximizes theoretical occupancy. It is a good starting point but not a substitute for actual profiling.

Full Python Example via torch.utils.cpp_extension

Here is a complete working example with all three indexing patterns - 1D, 2D, and batched 2D (3D) - wrapped for use from Python.

The CUDA C++ Kernels

// indexing_demo.cu
#include <cuda_runtime.h>
#include <stdint.h>

// ============================================================
// 1D: elementwise ReLU
// ============================================================
__global__ void relu_1d_kernel(
    const float* __restrict__ input,
    float* __restrict__ output,
    int n
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;
    output[idx] = fmaxf(input[idx], 0.0f);
}

extern "C" void launch_relu_1d(
    const float* input, float* output, int n, cudaStream_t stream
) {
    int block_size = 256;
    int grid_size = (n + block_size - 1) / block_size;
    relu_1d_kernel<<<grid_size, block_size, 0, stream>>>(input, output, n);
}

// ============================================================
// 2D: add bias to a [rows x cols] matrix
// bias has shape [cols] - broadcast across rows
// ============================================================
__global__ void add_bias_2d_kernel(
    const float* __restrict__ input,
    const float* __restrict__ bias,
    float* __restrict__ output,
    int rows,
    int cols
) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    if (row >= rows || col >= cols) return;

    int idx = row * cols + col;
    output[idx] = input[idx] + bias[col];
}

extern "C" void launch_add_bias_2d(
    const float* input,
    const float* bias,
    float* output,
    int rows,
    int cols,
    cudaStream_t stream
) {
    dim3 block_dim(16, 16);
    dim3 grid_dim(
        (cols + block_dim.x - 1) / block_dim.x,
        (rows + block_dim.y - 1) / block_dim.y
    );
    add_bias_2d_kernel<<<grid_dim, block_dim, 0, stream>>>(
        input, bias, output, rows, cols
    );
}

// ============================================================
// 3D: batched matrix ReLU with per-batch scaling
// input: [batch, rows, cols]
// scales: [batch]
// ============================================================
__global__ void batched_scale_relu_kernel(
    const float* __restrict__ input,
    const float* __restrict__ scales,
    float* __restrict__ output,
    int batch,
    int rows,
    int cols
) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int b   = blockIdx.z;  // one block per batch item in z

    if (col >= cols || row >= rows || b >= batch) return;

    int idx = b * (rows * cols) + row * cols + col;
    float val = input[idx] * scales[b];
    output[idx] = fmaxf(val, 0.0f);  // scale then ReLU
}

extern "C" void launch_batched_scale_relu(
    const float* input,
    const float* scales,
    float* output,
    int batch,
    int rows,
    int cols,
    cudaStream_t stream
) {
    dim3 block_dim(16, 16, 1);
    dim3 grid_dim(
        (cols + block_dim.x - 1) / block_dim.x,
        (rows + block_dim.y - 1) / block_dim.y,
        batch  // one block-z per batch item (assumes batch <= 65535)
    );
    batched_scale_relu_kernel<<<grid_dim, block_dim, 0, stream>>>(
        input, scales, output, batch, rows, cols
    );
}

The C++ PyTorch Binding

// indexing_ext.cpp
#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>

// Declare launchers
void launch_relu_1d(const float* in, float* out, int n, cudaStream_t s);
void launch_add_bias_2d(const float* in, const float* bias, float* out,
                        int rows, int cols, cudaStream_t s);
void launch_batched_scale_relu(const float* in, const float* scales,
                               float* out, int batch, int rows, int cols,
                               cudaStream_t s);

torch::Tensor relu_1d(torch::Tensor input) {
    TORCH_CHECK(input.is_cuda() && input.is_contiguous());
    TORCH_CHECK(input.dtype() == torch::kFloat32);
    auto output = torch::empty_like(input);
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    launch_relu_1d(
        input.data_ptr<float>(),
        output.data_ptr<float>(),
        input.numel(),
        stream
    );
    return output;
}

torch::Tensor add_bias_2d(torch::Tensor input, torch::Tensor bias) {
    TORCH_CHECK(input.is_cuda() && bias.is_cuda());
    TORCH_CHECK(input.is_contiguous() && bias.is_contiguous());
    TORCH_CHECK(input.dim() == 2 && bias.dim() == 1);
    TORCH_CHECK(bias.size(0) == input.size(1), "bias size must match cols");

    auto output = torch::empty_like(input);
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    launch_add_bias_2d(
        input.data_ptr<float>(),
        bias.data_ptr<float>(),
        output.data_ptr<float>(),
        input.size(0),  // rows
        input.size(1),  // cols
        stream
    );
    return output;
}

torch::Tensor batched_scale_relu(
    torch::Tensor input,
    torch::Tensor scales
) {
    TORCH_CHECK(input.is_cuda() && scales.is_cuda());
    TORCH_CHECK(input.is_contiguous() && scales.is_contiguous());
    TORCH_CHECK(input.dim() == 3);  // [batch, rows, cols]

    auto output = torch::empty_like(input);
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    launch_batched_scale_relu(
        input.data_ptr<float>(),
        scales.data_ptr<float>(),
        output.data_ptr<float>(),
        input.size(0),  // batch
        input.size(1),  // rows
        input.size(2),  // cols
        stream
    );
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("relu_1d",           &relu_1d,           "1D ReLU (CUDA)");
    m.def("add_bias_2d",       &add_bias_2d,       "2D bias add (CUDA)");
    m.def("batched_scale_relu", &batched_scale_relu, "3D batched scale+ReLU (CUDA)");
}

Build and Test Script

# build_and_test.py
import torch
import torch.nn.functional as F
from torch.utils.cpp_extension import load

# JIT compile on first run, uses cache on subsequent runs
ext = load(
    name="indexing_ext",
    sources=["indexing_ext.cpp", "indexing_demo.cu"],
    extra_cuda_cflags=["-O3", "--use_fast_math", "-arch=sm_80"],
    verbose=False
)

device = torch.device("cuda")

# ---- Test 1D ReLU ----
x1d = torch.randn(1_000_003, device=device)  # odd size to test boundary guard
ref1d = F.relu(x1d)
out1d = ext.relu_1d(x1d)
print(f"1D ReLU max diff: {(out1d - ref1d).abs().max().item():.2e}")

# ---- Test 2D add_bias ----
rows, cols = 769, 1025  # non-multiples of 16 to stress boundary guards
x2d = torch.randn(rows, cols, device=device)
bias = torch.randn(cols, device=device)
ref2d = x2d + bias.unsqueeze(0)
out2d = ext.add_bias_2d(x2d, bias)
print(f"2D bias add max diff: {(out2d - ref2d).abs().max().item():.2e}")

# ---- Test 3D batched scale + ReLU ----
batch, h, w = 31, 65, 65  # non-multiples to stress boundary guards
x3d = torch.randn(batch, h, w, device=device)
scales = torch.rand(batch, device=device)  # positive scales
ref3d = F.relu(x3d * scales.view(batch, 1, 1))
out3d = ext.batched_scale_relu(x3d, scales)
print(f"3D batched max diff: {(out3d - ref3d).abs().max().item():.2e}")

print("All tests passed.")

Expected output:

1D ReLU max diff: 0.00e+00
2D bias add max diff: 0.00e+00
3D batched max diff: 0.00e+00
All tests passed.

Warp Divergence: The Hidden Performance Killer

When threads in the same warp take different execution paths through an if/else, the warp must execute both branches sequentially with some threads masked. This is warp divergence.

// Divergent code: half the warp goes one way, half the other
__global__ void divergent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    // BUG: threads in the same warp diverge based on idx value
    // If a warp spans indices 128-159, threads 128-143 go one branch
    // and threads 144-159 go the other - full serialization
    if (data[idx] > 0.0f) {
        data[idx] = sqrtf(data[idx]);         // expensive
    } else {
        data[idx] = data[idx] * data[idx];    // cheap
    }
}

// Better: branchless equivalent using max() and conditional assignment
__global__ void non_divergent_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    float x = data[idx];
    float pos = (x > 0.0f) ? sqrtf(x) : 0.0f;
    float neg = (x <= 0.0f) ? x * x : 0.0f;
    data[idx] = pos + neg;
    // Both branches execute but neither is masked - no warp divergence
}

The boundary guard if (idx >= n) return is technically divergent - the last warp in the grid will have some threads that return early. But this is unavoidable and only affects the last partial warp, so the cost is minimal.

Divergence you must avoid: conditional branches based on input data values within the kernel body, especially for complex operations. The SASS compiler will sometimes eliminate divergence automatically for simple cases, but it cannot do so in general.

Common Mistakes

:::danger Forgetting the Boundary Guard

__global__ void broken_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // BUG: no boundary check
    data[idx] = data[idx] * 2.0f;  // writes past array end for last block
}

This writes to memory past the end of your allocation. If the memory happens to be allocated (e.g., you over-allocated), you get silently corrupted data. If not, you get a segfault or CUDA_ERROR_ILLEGAL_ADDRESS at the next synchronize. Always add if (idx >= n) return;. :::

:::danger Integer Division Truncation in Grid Size

// BUG: truncates - misses the last partial block
int gridSize = N / blockSize;

// CORRECT: ceiling division
int gridSize = (N + blockSize - 1) / blockSize;

// Also correct (slightly different style):
int gridSize = (N - 1) / blockSize + 1;  // assumes N > 0

// Or using standard library:
#include <cmath>
int gridSize = (int)std::ceil((double)N / blockSize);  // slower, fp rounding risk

The (N + blockSize - 1) / blockSize pattern is idiomatic and correct. Use it. :::

:::warning Block Size Not a Multiple of 32

// BAD: 100 threads per block = 4 warps, last warp only 4/32 active
dim3 blockDim(100);

// BAD: 5 x 5 = 25 threads, less than one full warp
dim3 blockDim(5, 5);

// GOOD: multiples of 32
dim3 blockDim(128);   // 4 warps
dim3 blockDim(256);   // 8 warps
dim3 blockDim(16, 16); // 256 threads = 8 warps
dim3 blockDim(32, 8);  // 256 threads = 8 warps

Non-multiples-of-32 waste warp slots and reduce SM utilization. :::

:::danger Exceeding 1024 Threads Per Block

// BUG: 1025 threads per block exceeds hardware limit
dim3 blockDim(1025);              // fails: cudaErrorInvalidConfiguration
dim3 blockDim(32, 32);            // 1024: OK (exactly at limit)
dim3 blockDim(33, 32);            // 1056: FAILS silently or errors
dim3 blockDim(16, 16, 5);        // 1280: FAILS

// CORRECT:
dim3 blockDim(16, 16, 4);        // 1024: OK but check blockDim.z <= 64
dim3 blockDim(32, 32, 1);        // 1024: OK

Always verify blockDim.x * blockDim.y * blockDim.z <= 1024. If you forget to check CUDA errors, an oversized block silently does nothing. :::

:::warning Wrong Index Formula for 2D Kernels

// BUG: x and y swapped - off-by-one in non-square matrices
__global__ void wrong_2d(float* data, int rows, int cols) {
    // Swapped: row uses blockDim.x, col uses blockDim.y
    int row = blockIdx.x * blockDim.x + threadIdx.x;  // wrong
    int col = blockIdx.y * blockDim.y + threadIdx.y;  // wrong

    // This works for square matrices but produces wrong results
    // for non-square matrices when gridDim != (same in x and y)
}

// CORRECT: x -> columns, y -> rows (matches linear algebra convention)
__global__ void correct_2d(float* data, int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;  // x = column
    int row = blockIdx.y * blockDim.y + threadIdx.y;  // y = row
    if (row >= rows || col >= cols) return;
    int idx = row * cols + col;  // row-major
    data[idx] *= 2.0f;
}

This is a consistency rule: CUDA x maps to columns, y maps to rows. Violating it causes silent wrong answers on non-square matrices. :::

Interview Q&A

Q1: Why must thread block sizes be multiples of 32?

The GPU's execution hardware does not execute individual threads - it executes warps of 32 threads in lockstep. When a block's thread count is not a multiple of 32, the last warp is partially filled. The hardware still allocates a full warp slot for it (including register file space and scheduling overhead) but executes useful work on only the active lanes. For example, a block of 100 threads wastes 28 thread slots in the last warp. In production, this means lower SM occupancy and wasted compute units. The practical impact is usually small for the last block's last warp (since there is only one such warp), but setting block sizes to 128, 256, or 512 is a zero-cost best practice that avoids the issue entirely.

Q2: How do you calculate the global thread index for a 1D problem?

The global index is idx = blockIdx.x * blockDim.x + threadIdx.x. The intuition: blockIdx.x * blockDim.x gives the starting index of this block's thread range (e.g., block 3 with 256 threads/block starts at index 768). Adding threadIdx.x (0-255 within the block) gives each thread its unique offset within that range. For the 2D case: row = blockIdx.y * blockDim.y + threadIdx.y and col = blockIdx.x * blockDim.x + threadIdx.x, then the flat index is row * cols + col for row-major storage.

Q3: What happens to threads in the last block when the problem size is not a multiple of block size?

The last block contains threads whose computed idx exceeds N - 1. Without a boundary guard, those threads execute the kernel body with out-of-bounds indices, causing undefined behavior (usually an illegal memory access at the next cudaDeviceSynchronize). With a boundary guard if (idx >= n) return;, the out-of-bounds threads exit immediately after the check. The warp they belong to still executes (the warp was launched), but the out-of-bounds lanes are masked and do no work. This is correct behavior and is the mandatory pattern for any kernel where problem size may not be a multiple of block size.

Q4: What is warp divergence and when is it a performance problem?

Warp divergence occurs when threads in the same warp take different paths through a conditional branch (if/else, switch, loop with data-dependent iterations). Since a warp executes in lockstep, the hardware must execute both branches sequentially - first with threads that took the true path active and false threads masked, then vice versa. Execution time becomes the sum of both branches instead of the maximum. Divergence is a performance problem when: (1) the branches contain expensive code (large inner loops, transcendentals), (2) many warps diverge (the branch condition is not aligned to warp boundaries), and (3) the branches have significantly different costs. Divergence on the boundary guard if (idx >= n) return is acceptable because it only affects the last warp in the grid. Divergence on the actual computation logic (e.g., different processing paths based on input values) can halve throughput.

Q5: What is the maximum number of thread blocks per dimension, and what limits it?

gridDim.x can go up to $2^{31} - 1 \approx 2.1$ billion. gridDim.y and gridDim.z are limited to 65535. These limits come from the width of the hardware registers that store the block coordinate. gridDim.x was extended to 32 bits in Fermi (2010) to support large 1D problem sizes, while y and z remained 16-bit. For problems that would require more than 65535 blocks in y or z, the solution is to use the grid-stride loop pattern (have each thread process multiple elements) or to restructure the problem to use x for the large dimension.

Q6: What is the __launch_bounds__ qualifier and when should you use it?

__launch_bounds__(maxThreadsPerBlock, minBlocksPerSM) is a hint to the NVCC compiler that your kernel will be launched with at most maxThreadsPerBlock threads per block, and that you want at least minBlocksPerSM blocks to fit simultaneously on one SM. NVCC uses this to cap register allocation: with fewer possible live threads per SM, each thread can use more registers, which can improve performance by reducing register spilling to local memory. Use it when you always launch with a specific block size (e.g., always 256) and you know your kernel can benefit from more registers. Example:

__global__ __launch_bounds__(256, 4)
void my_kernel(float* data, int n) {
    // NVCC guarantees this kernel will use <= (register_file_size / (256*4))
    // registers per thread - enables more aggressive optimization
}

Q7: How does the SM decide which warp to execute next, and what does this mean for kernel design?

Each SM maintains a pool of resident warps (up to 64 on A100). On every cycle, the warp scheduler selects a warp that is "ready" - not stalled waiting for memory or other operations. When a warp issues a global memory load (which has ~400-cycle latency on A100 HBM), it becomes stalled and the scheduler immediately picks another ready warp to run. This latency hiding is called "zero-overhead context switching" and is the fundamental mechanism that makes GPUs fast despite high memory latency. The implication for kernel design: you want enough resident warps that the SM always has a ready warp to run even when many warps are stalled on memory. This is why occupancy matters - higher occupancy means more warps to hide latency behind. A kernel with 100% theoretical occupancy and high memory pressure will typically outperform the same kernel at 25% occupancy.

Summary

The CUDA thread hierarchy has three levels: threads (finest), blocks (intermediate), and grids (coarsest). Below the thread abstraction, the hardware executes in warps of 32 threads. Getting indexing right requires applying idx = blockIdx.x * blockDim.x + threadIdx.x correctly for each dimension, always using ceiling division for grid size computation, and always protecting array accesses with a boundary guard.

For 1D problems use a 1D grid of 1D blocks with blockDim.x = 256 as a default. For matrix operations use a 2D grid with dim3 blockDim(16, 16). For batched problems or volumetric data use 3D grids with blockDim.z = 1 when the third dimension is a batch index.

Block size must be a multiple of 32. Total threads per block must not exceed 1024. When problem sizes are very large or non-standard, use grid-stride loops. Profile with Nsight to confirm occupancy matches expectations.

The next lesson covers shared memory - the on-chip memory that threads within a block share, and how it enables fast tile-based matrix multiplication.

The Crash at a Million Elements​

Why This Abstraction Exists​

The Three-Level Hierarchy​

The Hardware Limit Table​

Warps: The Real Unit of Execution​

Why Block Size Must Be a Multiple of 32​

Warp ID Within a Block​

1D Thread Indexing​

The Global Index Formula​

Complete 1D Kernel Template​

Why Ceiling Division Works​

2D Thread Indexing​

The 2D Index Formulas​

Using dim3 for 2D Blocks and Grids​

Complete 2D Kernel: Matrix Scale​

Why 16x16 and Not 32x1 for 2D?​

3D Thread Indexing​

The 3D Index Formulas​

Complete 3D Kernel: Batched Matrix Scale​

Stride Loops: Handling Very Large Inputs​

Choosing Block Size: A Practical Guide​

What Is Occupancy?​

The Practical Decision Tree​

CUDA Occupancy API​

Full Python Example via torch.utils.cpp_extension​

The CUDA C++ Kernels​

The C++ PyTorch Binding​

Build and Test Script​

Warp Divergence: The Hidden Performance Killer​

Common Mistakes​

Interview Q&A​

Summary​