Writing Your First CUDA Kernel

Reading time: ~45 min - Interview relevance: Very High - Target roles: ML Systems Engineer, CUDA Developer, Kernel Engineer

A fused kernel is not just an optimization. It is a statement about how deeply you understand what your hardware is actually doing. Most engineers treat PyTorch as a black box and leave 2-3x performance on the table every time they chain operations together.

The Day a Kernel Changed Everything

It was 2 AM in a San Francisco office. A team was training a 7B parameter language model and their training throughput had plateaued at 31% of theoretical hardware utilization. The profiler showed the culprit clearly: the FFN block was spending 58% of its time on memory transfers, not computation. Three innocent-looking lines of Python were the problem.

# This looks like three operations. On the GPU, it is six memory transactions.
x = x + bias          # read x, read bias, write result
x = torch.nn.functional.gelu(x)  # read result, write result
output = x * gate     # read result, read gate, write output

Each PyTorch operation is a separate kernel launch. Each kernel launch reads its inputs from HBM (the GPU's main memory), does some work, and writes its outputs back to HBM. For a bias addition on a 4096-wide hidden layer, the arithmetic intensity is essentially zero - you read two numbers, add them, write one number. The actual addition takes one clock cycle. The HBM transfer takes hundreds.

A senior engineer fused those three operations into a single CUDA kernel. The inputs were read once, the bias addition, GELU, and gate multiplication happened entirely in registers, and the outputs were written once. Training throughput jumped from 31% to 67% of theoretical peak. No algorithmic changes. No new hardware. Just understanding what the machine was actually doing.

This lesson teaches you how to write that kernel - and more importantly, how to think about when and why to write it.

Why Custom Kernels Exist

The Problem with Operator Chaining

Every deep learning framework, including PyTorch, compiles each operation to a separate GPU kernel. This design is clean from a software engineering perspective: each operation is independent, testable, and composable. But from a hardware perspective, it creates a brutal pattern.

Consider what happens when PyTorch executes output = gelu(x + bias):

Kernel 1 (bias add): Read x from HBM, read bias from HBM, compute x + bias, write result to HBM
Kernel 2 (GELU): Read result from HBM, compute gelu(result), write output to HBM

The actual computation - one addition and one GELU evaluation - takes microseconds. The HBM transfers take milliseconds. A100 HBM bandwidth is 2 TB/s, but for a batch of 512 sequences each with 4096 features:

$\text{Data transferred (unfused)} = 512 \times 4096 \times 4 \text{ bytes} \times 4 \text{ trips} = 4 \text{ GB}$

$\text{Data transferred (fused)} = 512 \times 4096 \times 4 \text{ bytes} \times 2 \text{ trips} = 2 \text{ GB}$

The fused kernel moves half the data, so it runs roughly twice as fast - and that gap widens as model width increases.

The FlashAttention Moment

The most famous example of this principle is FlashAttention (Dao et al., 2022). Standard attention requires materializing the full $N \times N$ attention matrix in HBM for sequences of length $N$ . For a sequence of 8192 tokens, that is 268 MB per head per layer - read and written multiple times during the forward pass.

FlashAttention tiles the computation so the attention matrix never fully leaves SRAM. The result was a 5-7x speedup over standard attention with mathematically identical output. No approximation. No accuracy loss. Pure kernel engineering.

FlashAttention made custom kernels mainstream. Before it, custom kernels were a research curiosity. After it, every serious ML infrastructure team started writing them.

Historical Context

CUDA launched in 2006. For the first five years, almost nobody outside GPU computing research wrote custom kernels. Deep learning ran on CPUs.

AlexNet (2012) changed the trajectory. Krizhevsky, Sutskever, and Hinton trained on two GTX 580s and beat the next-best ImageNet entry by 10.8 percentage points. The GPU was not incidental to this result - it was the reason. Training would have taken weeks on CPU. It took days on GPU.

But AlexNet used cuDNN primitives written by NVIDIA engineers. Researchers did not write their own kernels; they called library functions. This worked because the operations - convolution, pooling, activation - were fixed and well-understood.

The Transformer era broke this model. Attention variants, new activation functions, MoE routing, RoPE embeddings, custom normalization schemes - none of these were in cuDNN. The frameworks added them as pure PyTorch operations, which meant each sub-operation became a separate kernel. Performance suffered.

By 2020, teams at Google, Meta, and Nvidia were routinely writing custom CUDA kernels for production models. The expertise became a hiring differentiator. By 2022, Triton (Tillet et al.) democratized kernel writing enough that ML engineers without systems backgrounds could contribute.

Today, knowing how to write and reason about custom kernels is a tier-1 skill for any engineer working on training infrastructure or serving optimization.

Core Concepts Before Writing Code

Arithmetic Intensity

Before writing a kernel, you must answer one question: is this operation memory-bound or compute-bound?

Arithmetic intensity is the ratio of floating-point operations to bytes of memory access:

$I = \frac{\text{FLOPs}}{\text{Bytes accessed}}$

The roofline model tells you the maximum achievable performance given intensity $I$ :

$\text{Performance} = \min(I \times \text{Memory Bandwidth}, \text{Peak FLOPs})$

For an A100:

Peak FP32 performance: 19.5 TFLOP/s
HBM bandwidth: 2,000 GB/s
Ridge point: 19,500 / 2,000 = 9.75 FLOPs/byte

Any operation below 9.75 FLOPs/byte is memory-bound on an A100. The kernel spends most of its time waiting for data, not computing.

Operation	FLOPs/element	Bytes/element	Intensity	Bound
Vector copy	0	8	0	Memory
Bias add	1	12	0.08	Memory
GELU	~20	8	2.5	Memory
Fused bias+GELU	~21	8	2.6	Memory
Matrix multiply (large)	2N	4	N/2	Compute

The bias+GELU fusion does not change whether the operation is memory-bound. It changes how many times you go to memory. Unfused, you access HBM 4 times (read bias, read x, write intermediate, read intermediate, write output - 5 trips, though two inputs share one read). Fused, you access HBM 2 times (read x+bias, write output). The arithmetic intensity is the same; the memory traffic is halved.

The Five Steps to Writing a Production Kernel

flowchart TD
    A["1. Identify Operation<br/>What computation + memory pattern?"]:::blue
    B["2. Calculate Intensity<br/>Memory-bound or compute-bound?"]:::blue
    C["3. Write Kernel Body<br/>Thread indexing, computation, bounds check"]:::green
    D["4. Choose Launch Config<br/>Block size, grid size, shared memory"]:::purple
    E["5. Benchmark<br/>Compare against baseline, check correctness"]:::orange

    A --> B --> C --> D --> E

classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c

Step 1: Identify the Operation

We will build a fused bias + GELU kernel. This appears in virtually every Transformer FFN layer:

# In every transformer FFN block, somewhere:
hidden = linear(x)           # shape: [batch, seq, hidden]
hidden = hidden + bias        # bias: [hidden]
hidden = gelu(hidden)         # element-wise nonlinearity

The bias add and GELU are pure element-wise operations. Every output element depends only on the corresponding input element and the corresponding bias value. No data dependencies between elements. This is the ideal candidate for a fused kernel.

Memory access pattern: Each thread reads one element of x, reads the corresponding bias (indexed by column), computes GELU, writes one element of output. This is a streaming pattern - no reuse, pure throughput.

Step 2: Calculate Arithmetic Intensity

For a tensor of shape [B, H] (batch size B, hidden dimension H):

Reads: B * H floats (input) + H floats (bias) = (B+1) * H floats
Writes: B * H floats (output)
Total bytes: (2B+1) * H * 4 bytes
FLOPs: B * H * ~20 (GELU approximation requires ~20 FP ops)

For B=512, H=4096:

$I = \frac{512 \times 4096 \times 20}{(1025 \times 4096) \times 4} \approx \frac{41.9M}{16.8M} \approx 2.5 \text{ FLOPs/byte}$

2.5 FLOPs/byte is well below the A100 ridge point of 9.75. This operation is memory-bound. The optimization is entirely about reducing memory traffic, which fusion provides.

Step 3: Write the Kernel Body

Understanding the GELU Approximation

The exact GELU is:

$\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$

The fast approximation (used by most frameworks) is:

$\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} \cdot \left(x + 0.044715 \cdot x^3\right)\right)\right)$

This approximation avoids the erf call and is accurate to within 0.1% for typical activation values.

The Complete Kernel

#include <cuda_runtime.h>
#include <math.h>

// GELU approximation constants
#define GELU_COEFF   0.7978845608f   // sqrt(2/pi)
#define GELU_CUBIC   0.044715f

// Inline device function for GELU
__device__ __forceinline__ float gelu_approx(float x) {
    float cube = GELU_CUBIC * x * x * x;
    float inner = GELU_COEFF * (x + cube);
    return 0.5f * x * (1.0f + tanhf(inner));
}

// Fused bias + GELU kernel
// input:  [N] flattened tensor (B * H elements)
// bias:   [H] bias vector
// output: [N] result tensor
// N:      total elements (B * H)
// H:      hidden dimension (used to index bias)
__global__ void fused_bias_gelu_kernel(
    const float* __restrict__ input,
    const float* __restrict__ bias,
    float* __restrict__ output,
    int N,
    int H
) {
    // Global thread index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Bounds check - last block may be partially full
    if (idx >= N) return;

    // Bias index: which column this thread handles
    int bias_idx = idx % H;

    // Load from HBM - the only HBM read in the fused version
    float val = input[idx] + bias[bias_idx];

    // GELU computed entirely in registers - no memory access
    float result = gelu_approx(val);

    // Write result to HBM - the only HBM write
    output[idx] = result;
}

Key Design Decisions in the Kernel Body

__restrict__: Tells the compiler that the pointers do not alias each other. This enables more aggressive optimization, including better instruction scheduling. Always use it on kernel arguments when you guarantee no aliasing.

__forceinline__: Prevents the compiler from generating a function call for gelu_approx. The compiler will inline it, eliminating call overhead and enabling the GELU computation to be register-allocated alongside the surrounding arithmetic.

Bias indexing with % H: The input is laid out as a 2D tensor flattened to 1D. Element idx belongs to row idx / H and column idx % H. The bias is indexed by column. The modulo operation is one integer instruction and costs essentially nothing.

Bounds check before any memory access: The if (idx >= N) return must come before input[idx]. If threads in the last block are out-of-bounds and you access memory anyway, you get undefined behavior - sometimes a crash, sometimes silent corruption.

Step 4: Choose Launch Configuration

Block Size

For element-wise kernels, block size 256 is the standard starting point. Here is why:

Minimum occupancy requirement: each SM needs at least 2 warps (64 threads) active to hide latency
Practical sweet spot: 128-256 threads per block gives good occupancy without excessive register pressure
Must be a multiple of 32 (warp size)
256 = 8 warps per block, typically fits 2-4 blocks per SM = 16-32 warps active = good latency hiding

// Launch configuration
int block_size = 256;
int grid_size = (N + block_size - 1) / block_size;  // Ceiling division

fused_bias_gelu_kernel<<<grid_size, block_size>>>(
    d_input, d_bias, d_output, N, H
);

The ceiling division ensures we always launch enough blocks to cover all N elements. The last block will have some idle threads, which is fine - the bounds check handles them.

Verifying Occupancy

A100 limits per SM:

Max threads per SM: 2048
Max blocks per SM: 32
Max registers per thread: 255

With block size 256 and our kernel using roughly 16 registers per thread:

Threads per block: 256 (8 warps)
Blocks per SM: min(32, 2048/256) = 8
Active warps per SM: 8 * 8 = 64 (out of max 64)
Theoretical occupancy: 100%

For memory-bound kernels, high occupancy is critical - you need many in-flight memory requests to keep HBM busy.

Step 5: Production Error Checking

Undetected CUDA errors are worse than crashes. They produce silently wrong results that can corrupt a training run for hours before anyone notices.

// Error checking macro - use after every CUDA call
#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = (call); \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA error at %s:%d - %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage:
CUDA_CHECK(cudaMalloc(&d_input, N * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice));

fused_bias_gelu_kernel<<<grid_size, block_size>>>(d_input, d_bias, d_output, N, H);

// Kernel launches are asynchronous - check for launch errors
CUDA_CHECK(cudaGetLastError());

// Synchronize and check for execution errors
CUDA_CHECK(cudaDeviceSynchronize());

In production, you would replace exit(EXIT_FAILURE) with a proper exception or error code. PyTorch's C++ extension infrastructure has its own error handling via AT_CUDA_CHECK.

The Complete Production Implementation

CUDA Extension Header

// fused_bias_gelu.h
#pragma once
#include <torch/extension.h>

torch::Tensor fused_bias_gelu_forward(
    torch::Tensor input,
    torch::Tensor bias
);

CUDA Kernel File

// fused_bias_gelu_kernel.cu
#include <cuda_runtime.h>
#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = (call); \
        if (err != cudaSuccess) { \
            throw std::runtime_error( \
                std::string("CUDA error: ") + cudaGetErrorString(err)); \
        } \
    } while(0)

#define GELU_COEFF 0.7978845608f
#define GELU_CUBIC 0.044715f

__device__ __forceinline__ float gelu_approx(float x) {
    float cube = GELU_CUBIC * x * x * x;
    return 0.5f * x * (1.0f + tanhf(GELU_COEFF * (x + cube)));
}

__global__ void fused_bias_gelu_kernel(
    const float* __restrict__ input,
    const float* __restrict__ bias,
    float* __restrict__ output,
    int N,
    int H
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float val = input[idx] + bias[idx % H];
    output[idx] = gelu_approx(val);
}

// Vectorized version: processes 4 floats per thread
// float4 loads are a single 128-bit memory transaction
__global__ void fused_bias_gelu_kernel_vec4(
    const float4* __restrict__ input,
    const float* __restrict__ bias,
    float4* __restrict__ output,
    int N,       // total elements
    int H        // hidden dim (must be divisible by 4)
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int N4 = N / 4;

    if (idx >= N4) return;

    // Each thread processes 4 consecutive elements
    float4 in4 = input[idx];
    int base_col = (idx * 4) % H;

    // Load 4 bias values
    float b0 = bias[base_col];
    float b1 = bias[base_col + 1];
    float b2 = bias[base_col + 2];
    float b3 = bias[base_col + 3];

    // Fused bias + GELU on 4 elements
    float4 out4;
    out4.x = gelu_approx(in4.x + b0);
    out4.y = gelu_approx(in4.y + b1);
    out4.z = gelu_approx(in4.z + b2);
    out4.w = gelu_approx(in4.w + b3);

    output[idx] = out4;
}

torch::Tensor fused_bias_gelu_forward(
    torch::Tensor input,   // [B, H] or [B, S, H]
    torch::Tensor bias     // [H]
) {
    // Input validation
    TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
    TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");
    TORCH_CHECK(input.scalar_type() == torch::kFloat32, "Only float32 supported");
    TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
    TORCH_CHECK(bias.is_contiguous(), "bias must be contiguous");

    int N = input.numel();
    int H = bias.size(0);

    auto output = torch::empty_like(input);

    // Use vectorized kernel if H is divisible by 4
    if (H % 4 == 0 && N % 4 == 0) {
        int block_size = 256;
        int grid_size = (N / 4 + block_size - 1) / block_size;
        fused_bias_gelu_kernel_vec4<<<grid_size, block_size>>>(
            reinterpret_cast<const float4*>(input.data_ptr<float>()),
            bias.data_ptr<float>(),
            reinterpret_cast<float4*>(output.data_ptr<float>()),
            N, H
        );
    } else {
        int block_size = 256;
        int grid_size = (N + block_size - 1) / block_size;
        fused_bias_gelu_kernel<<<grid_size, block_size>>>(
            input.data_ptr<float>(),
            bias.data_ptr<float>(),
            output.data_ptr<float>(),
            N, H
        );
    }

    // Check for kernel launch errors
    AT_CUDA_CHECK(cudaGetLastError());

    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &fused_bias_gelu_forward, "Fused Bias+GELU forward");
}

Python Wrapper and Setup

# setup.py
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='fused_bias_gelu',
    ext_modules=[
        CUDAExtension(
            name='fused_bias_gelu',
            sources=['fused_bias_gelu_kernel.cu'],
        ),
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

# fused_bias_gelu.py
import torch
import fused_bias_gelu as _fused_ext  # compiled extension

class FusedBiasGELU(torch.nn.Module):
    """Drop-in replacement for nn.Linear bias + GELU activation.

    Usage:
        # Instead of:
        x = linear(x) + bias
        x = F.gelu(x)

        # Use:
        x = fused_bias_gelu(linear(x), bias)
    """

    def __init__(self, hidden_size: int):
        super().__init__()
        self.bias = torch.nn.Parameter(torch.zeros(hidden_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return _fused_ext.forward(x, self.bias)

Triton Alternative: Same Kernel, Much Simpler

Triton is a Python-based kernel language that compiles to efficient GPU code. For most ML engineers, Triton is the right choice before reaching for raw CUDA.

import triton
import triton.language as tl
import torch

@triton.jit
def fused_bias_gelu_triton(
    input_ptr,    # pointer to input tensor
    bias_ptr,     # pointer to bias tensor
    output_ptr,   # pointer to output tensor
    N,            # total elements
    H,            # hidden dimension
    BLOCK_SIZE: tl.constexpr,  # compile-time constant
):
    # Each program handles BLOCK_SIZE elements
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE

    # Create index array for this block
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < N

    # Load input (masked for out-of-bounds safety)
    x = tl.load(input_ptr + offsets, mask=mask)

    # Load bias - index into H-dimensional bias
    bias_offsets = offsets % H
    b = tl.load(bias_ptr + bias_offsets, mask=mask)

    # Fused bias + GELU
    val = x + b

    # GELU approximation
    # GELU(x) = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
    cube = 0.044715 * val * val * val
    inner = 0.7978845608 * (val + cube)
    result = 0.5 * val * (1.0 + tl.libdevice.tanh(inner))

    # Write output
    tl.store(output_ptr + offsets, result, mask=mask)


def triton_fused_bias_gelu(x: torch.Tensor, bias: torch.Tensor) -> torch.Tensor:
    output = torch.empty_like(x)
    N = x.numel()
    H = bias.shape[0]

    BLOCK_SIZE = 1024  # Triton block size
    grid = (triton.cdiv(N, BLOCK_SIZE),)

    fused_bias_gelu_triton[grid](
        x, bias, output, N, H,
        BLOCK_SIZE=BLOCK_SIZE,
    )
    return output

The Triton version is roughly 40 lines versus 80 lines of CUDA. Performance is within 5-10% of the hand-written CUDA version for this workload because:

The operation is memory-bound, so the compiler has little room to help on the compute side
Triton compiles to PTX just like CUDA - the generated assembly is similar
The vectorization and memory access patterns are handled by the Triton compiler

Benchmarking: The Reality Check

import torch
import time

def benchmark(fn, *args, warmup=20, iters=100):
    # Warmup - JIT compilation, cache warmup
    for _ in range(warmup):
        fn(*args)
    torch.cuda.synchronize()

    # Timed iterations
    start = time.perf_counter()
    for _ in range(iters):
        fn(*args)
    torch.cuda.synchronize()
    end = time.perf_counter()

    return (end - start) / iters * 1000  # ms

# Setup
B, H = 512, 4096
x = torch.randn(B, H, device='cuda', dtype=torch.float32)
bias = torch.randn(H, device='cuda', dtype=torch.float32)

# Baseline: separate PyTorch ops
def pytorch_baseline(x, bias):
    return torch.nn.functional.gelu(x + bias)

# torch.compile version
compiled_baseline = torch.compile(pytorch_baseline)

# Our custom kernel
import fused_bias_gelu as _ext
def cuda_kernel(x, bias):
    return _ext.forward(x, bias)

# Run benchmarks
t_pytorch = benchmark(pytorch_baseline, x, bias)
t_compiled = benchmark(compiled_baseline, x, bias)
t_cuda     = benchmark(cuda_kernel, x, bias)
t_triton   = benchmark(triton_fused_bias_gelu, x, bias)

print(f"PyTorch baseline:  {t_pytorch:.3f} ms")
print(f"torch.compile:     {t_compiled:.3f} ms")
print(f"Custom CUDA:       {t_cuda:.3f} ms")
print(f"Triton:            {t_triton:.3f} ms")

Typical results on A100 (B=512, H=4096):

Implementation	Time (ms)	Speedup	Notes
PyTorch baseline	0.21	1.0x	3 kernel launches
torch.compile	0.12	1.75x	Fusion via inductor
Custom CUDA (scalar)	0.10	2.1x	Single kernel
Custom CUDA (vec4)	0.08	2.6x	128-bit loads
Triton	0.09	2.3x	Single kernel

Note that torch.compile fuses the operations too - it is not as fast as hand-written CUDA because the inductor backend adds overhead, but it gets most of the benefit for free.

The Kernel Fusion Decision Tree

flowchart TD
    A["Need better performance<br/>for this operation?"]:::blue
    B["Does torch.compile<br/>already fuse it?"]:::blue
    C["Profile confirms<br/>memory-bound?"]:::green
    D["Is the operation<br/>well-defined / stable?"]:::green
    E["Use torch.compile<br/>or inductor backend"]:::teal
    F["Write Triton kernel<br/>Faster iteration, Python"]:::purple
    G["Triton meets<br/>performance target?"]:::green
    H["Write CUDA kernel<br/>Maximum control"]:::orange
    I["Not a kernel problem<br/>Check algorithm first"]:::red

    A -->|Yes| B
    A -->|No| I
    B -->|Yes, fast enough| E
    B -->|No or not fast enough| C
    C -->|No - compute bound| I
    C -->|Yes - memory bound| D
    D -->|Yes - worth the investment| F
    G -->|Yes| F
    G -->|No - need more control| H

classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
classDef red fill:#fee2e2,color:#7f1d1d,stroke:#dc2626

When NOT to Write a Custom Kernel

This is as important as knowing how to write one.

Do not write a custom kernel when:

torch.compile already handles it. The inductor backend fuses element-wise ops, converts to Triton, and can often match hand-written performance. Check this first with TORCH_COMPILE_DEBUG=1.
The operation is compute-bound. Matrix multiplication is compute-bound. Writing a custom GEMM kernel is a research project requiring weeks of work - cuBLAS/cuBLASLt already has highly tuned implementations. You will not beat NVIDIA's team.
Your model changes frequently. Custom kernels require maintenance. If the operation changes every sprint, the kernel maintenance cost exceeds the performance benefit.
You have not profiled. Many "obvious" bottlenecks are not the actual bottleneck. Profile first with torch.profiler or Nsight Systems. Write kernels for demonstrated bottlenecks, not intuited ones.
Triton meets the target. Triton kernels are debuggable, Python-based, and compile to efficient PTX. If Triton gets you 90% of the way there, the remaining 10% rarely justifies full CUDA.

Production Engineering Notes

Half Precision Support

Production kernels must support both FP32 and FP16/BF16. Add a template:

template <typename scalar_t>
__global__ void fused_bias_gelu_kernel_typed(
    const scalar_t* __restrict__ input,
    const scalar_t* __restrict__ bias,
    scalar_t* __restrict__ output,
    int N, int H
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    // Convert to float for computation, back to scalar_t for output
    float val = static_cast<float>(input[idx])
              + static_cast<float>(bias[idx % H]);
    output[idx] = static_cast<scalar_t>(gelu_approx(val));
}

Dispatch the correct instantiation based on input dtype using AT_DISPATCH_FLOATING_TYPES_AND_HALF.

Stream Management

For inference servers running multiple requests concurrently:

// Use a non-default stream to overlap with other operations
cudaStream_t stream;
cudaStreamCreate(&stream);

fused_bias_gelu_kernel<<<grid, block, 0, stream>>>(
    input, bias, output, N, H
);

PyTorch's CUDA extension infrastructure automatically uses the current stream when you call at::cuda::getCurrentCUDAStream().

Common Mistakes

:::danger Missing Bounds Check The single most common CUDA bug. If your N is not a multiple of your block size, the last block has idle threads. Without a bounds check, those threads access memory out-of-range.

// WRONG - no bounds check
__global__ void kernel(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = data[idx] * 2.0f;  // Out-of-bounds if idx >= N
}

// CORRECT
__global__ void kernel(float* data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;  // Must come before any memory access
    data[idx] = data[idx] * 2.0f;
}

:::

:::danger Optimizing Before Profiling It is tempting to jump straight to vectorization and shared memory tricks. But these optimizations can actually hurt if the bottleneck is elsewhere. Always establish a baseline, profile it, identify the actual bottleneck, then optimize. :::

:::warning Forgetting cudaGetLastError() After Kernel Launch Kernel launches are asynchronous and do not return errors directly. You must explicitly call cudaGetLastError() after a launch to catch errors like invalid launch configuration.

myKernel<<<grid, block>>>(args);
// Without this, errors are silent:
CUDA_CHECK(cudaGetLastError());

:::

:::warning Non-Contiguous Tensors PyTorch tensors may not be contiguous after slicing, transposing, or permuting. Always check tensor.is_contiguous() in your extension or call .contiguous() before passing to a kernel that assumes contiguous layout. :::

Interview Questions and Answers

Q1: What is kernel fusion and why does it improve performance?

Kernel fusion combines multiple operations into a single GPU kernel, reducing the number of round-trips to HBM. Each unfused operation reads its inputs from HBM and writes its outputs to HBM. For memory-bound operations like element-wise activations, this memory traffic dominates runtime. Fusing two operations reduces memory traffic by nearly 50% because intermediate results stay in registers rather than being written to and read back from HBM. The classic example is fusing bias addition and GELU - unfused requires 4-5 HBM transactions, fused requires 2.

Q2: How do you choose block size for a new kernel?

Start with 256 threads per block for memory-bound element-wise kernels. This gives 8 warps per block, which typically fills multiple SMs with high occupancy. For kernels using shared memory, calculate the block size that maximizes occupancy given the shared memory requirement: max_blocks_per_sm = min(max_blocks, shared_mem_per_sm / shared_mem_per_block). Use the CUDA Occupancy Calculator or cudaOccupancyMaxActiveBlocksPerMultiprocessor to validate. For compute-bound kernels, occupancy matters less - focus on register usage and instruction-level parallelism.

Q3: What is the roofline model and how do you apply it?

The roofline model predicts maximum achievable performance given a kernel's arithmetic intensity. Plot peak compute (horizontal ceiling) and bandwidth * intensity (diagonal slope) on a log-log graph. The minimum of the two is your achievable performance. If your kernel is below this ceiling, the issue is not the operation itself but inefficiency in your implementation (poor memory access patterns, divergence, underutilization). Calculate intensity as FLOPs/byte for your kernel and compare to the hardware ridge point (peak FLOPs / peak bandwidth). Below the ridge point, optimize for memory access. Above it, optimize for compute utilization.

Q4: When would you use Triton instead of CUDA?

Triton is appropriate for most new kernel development because it is faster to write, easier to debug, and produces competitive performance for memory-bound kernels. Use Triton when: the kernel is element-wise or has simple access patterns, you need to iterate quickly, the team has Python expertise but limited C++/PTX knowledge, or you need auto-tuning across hardware targets. Use raw CUDA when: you need precise control over shared memory layout, the kernel has complex synchronization patterns, you need warp shuffle operations not exposed by Triton, or you are targeting specific hardware with hand-tuned PTX. FlashAttention-3 uses raw CUDA with hand-written PTX for the attention inner loop because Triton cannot express the wgmma instructions needed for Hopper TMA.

Q5: How do you verify correctness of a custom kernel?

Three-step approach: First, unit test with small inputs where you can compute the expected output by hand. Second, numerical comparison against the PyTorch reference implementation using torch.allclose with appropriate tolerances (rtol=1e-4, atol=1e-5 for FP32). Third, test boundary conditions: N not a multiple of block size, N=1, N=0, very large N, tensors with non-power-of-2 sizes. For gradients, use torch.autograd.gradcheck if you implement a custom backward pass. Run with cuda-memcheck or compute-sanitizer to catch memory errors.

Q6: What is the difference between __device__, __global__, and __host__ qualifiers?

__global__ marks a kernel function callable from host (CPU) code and executed on the GPU. It must return void. __device__ marks a function callable only from GPU code (other device or global functions). __host__ marks a function callable only from CPU code (the default for any C++ function). __host__ __device__ allows a function to compile for both. The __forceinline__ qualifier on __device__ functions prevents function call overhead - important for small helper functions called in hot loops.

Summary

Writing a production CUDA kernel follows a clear sequence: identify the operation and its memory pattern, calculate arithmetic intensity to determine the bound, write the kernel body with proper thread indexing and bounds checks, choose launch configuration for high occupancy, and benchmark against baseline.

The fused bias+GELU kernel demonstrates the full workflow. It eliminates 2 HBM round-trips, achieves 2-3x speedup over unfused PyTorch, and the entire implementation is under 100 lines of CUDA. The Triton alternative achieves comparable performance in half the code.

The most important skill is knowing when not to write a custom kernel. Check torch.compile first. Profile before optimizing. Write Triton before CUDA. Reach for raw CUDA only when nothing else meets the performance target.

The engineers who build the fastest training systems are not the ones who write the most custom kernels. They are the ones who know exactly when a custom kernel is necessary and write precisely the right one.

The Day a Kernel Changed Everything​

Why Custom Kernels Exist​

The Problem with Operator Chaining​

The FlashAttention Moment​

Historical Context​

Core Concepts Before Writing Code​

Arithmetic Intensity​

The Five Steps to Writing a Production Kernel​

Step 1: Identify the Operation​

Step 2: Calculate Arithmetic Intensity​

Step 3: Write the Kernel Body​

Understanding the GELU Approximation​

The Complete Kernel​

Key Design Decisions in the Kernel Body​

Step 4: Choose Launch Configuration​

Block Size​

Verifying Occupancy​

Step 5: Production Error Checking​

The Complete Production Implementation​

CUDA Extension Header​

CUDA Kernel File​

Python Wrapper and Setup​

Triton Alternative: Same Kernel, Much Simpler​

Benchmarking: The Reality Check​

The Kernel Fusion Decision Tree​

When NOT to Write a Custom Kernel​

Production Engineering Notes​

Half Precision Support​

Stream Management​

Common Mistakes​

Interview Questions and Answers​

Summary​