Writing Your First CUDA Kernel
Reading time: ~45 min - Interview relevance: Very High - Target roles: ML Systems Engineer, CUDA Developer, Kernel Engineer
A fused kernel is not just an optimization. It is a statement about how deeply you understand what your hardware is actually doing. Most engineers treat PyTorch as a black box and leave 2-3x performance on the table every time they chain operations together.
The Day a Kernel Changed Everything
It was 2 AM in a San Francisco office. A team was training a 7B parameter language model and their training throughput had plateaued at 31% of theoretical hardware utilization. The profiler showed the culprit clearly: the FFN block was spending 58% of its time on memory transfers, not computation. Three innocent-looking lines of Python were the problem.
# This looks like three operations. On the GPU, it is six memory transactions.
x = x + bias # read x, read bias, write result
x = torch.nn.functional.gelu(x) # read result, write result
output = x * gate # read result, read gate, write output
Each PyTorch operation is a separate kernel launch. Each kernel launch reads its inputs from HBM (the GPU's main memory), does some work, and writes its outputs back to HBM. For a bias addition on a 4096-wide hidden layer, the arithmetic intensity is essentially zero - you read two numbers, add them, write one number. The actual addition takes one clock cycle. The HBM transfer takes hundreds.
A senior engineer fused those three operations into a single CUDA kernel. The inputs were read once, the bias addition, GELU, and gate multiplication happened entirely in registers, and the outputs were written once. Training throughput jumped from 31% to 67% of theoretical peak. No algorithmic changes. No new hardware. Just understanding what the machine was actually doing.
This lesson teaches you how to write that kernel - and more importantly, how to think about when and why to write it.
Why Custom Kernels Exist
The Problem with Operator Chaining
Every deep learning framework, including PyTorch, compiles each operation to a separate GPU kernel. This design is clean from a software engineering perspective: each operation is independent, testable, and composable. But from a hardware perspective, it creates a brutal pattern.
Consider what happens when PyTorch executes output = gelu(x + bias):
- Kernel 1 (bias add): Read
xfrom HBM, readbiasfrom HBM, computex + bias, write result to HBM - Kernel 2 (GELU): Read result from HBM, compute
gelu(result), write output to HBM
The actual computation - one addition and one GELU evaluation - takes microseconds. The HBM transfers take milliseconds. A100 HBM bandwidth is 2 TB/s, but for a batch of 512 sequences each with 4096 features:
The fused kernel moves half the data, so it runs roughly twice as fast - and that gap widens as model width increases.
The FlashAttention Moment
The most famous example of this principle is FlashAttention (Dao et al., 2022). Standard attention requires materializing the full attention matrix in HBM for sequences of length . For a sequence of 8192 tokens, that is 268 MB per head per layer - read and written multiple times during the forward pass.
FlashAttention tiles the computation so the attention matrix never fully leaves SRAM. The result was a 5-7x speedup over standard attention with mathematically identical output. No approximation. No accuracy loss. Pure kernel engineering.
FlashAttention made custom kernels mainstream. Before it, custom kernels were a research curiosity. After it, every serious ML infrastructure team started writing them.
Historical Context
CUDA launched in 2006. For the first five years, almost nobody outside GPU computing research wrote custom kernels. Deep learning ran on CPUs.
AlexNet (2012) changed the trajectory. Krizhevsky, Sutskever, and Hinton trained on two GTX 580s and beat the next-best ImageNet entry by 10.8 percentage points. The GPU was not incidental to this result - it was the reason. Training would have taken weeks on CPU. It took days on GPU.
But AlexNet used cuDNN primitives written by NVIDIA engineers. Researchers did not write their own kernels; they called library functions. This worked because the operations - convolution, pooling, activation - were fixed and well-understood.
The Transformer era broke this model. Attention variants, new activation functions, MoE routing, RoPE embeddings, custom normalization schemes - none of these were in cuDNN. The frameworks added them as pure PyTorch operations, which meant each sub-operation became a separate kernel. Performance suffered.
By 2020, teams at Google, Meta, and Nvidia were routinely writing custom CUDA kernels for production models. The expertise became a hiring differentiator. By 2022, Triton (Tillet et al.) democratized kernel writing enough that ML engineers without systems backgrounds could contribute.
Today, knowing how to write and reason about custom kernels is a tier-1 skill for any engineer working on training infrastructure or serving optimization.
Core Concepts Before Writing Code
Arithmetic Intensity
Before writing a kernel, you must answer one question: is this operation memory-bound or compute-bound?
Arithmetic intensity is the ratio of floating-point operations to bytes of memory access:
The roofline model tells you the maximum achievable performance given intensity :
For an A100:
- Peak FP32 performance: 19.5 TFLOP/s
- HBM bandwidth: 2,000 GB/s
- Ridge point: 19,500 / 2,000 = 9.75 FLOPs/byte
Any operation below 9.75 FLOPs/byte is memory-bound on an A100. The kernel spends most of its time waiting for data, not computing.
| Operation | FLOPs/element | Bytes/element | Intensity | Bound |
|---|---|---|---|---|
| Vector copy | 0 | 8 | 0 | Memory |
| Bias add | 1 | 12 | 0.08 | Memory |
| GELU | ~20 | 8 | 2.5 | Memory |
| Fused bias+GELU | ~21 | 8 | 2.6 | Memory |
| Matrix multiply (large) | 2N | 4 | N/2 | Compute |
The bias+GELU fusion does not change whether the operation is memory-bound. It changes how many times you go to memory. Unfused, you access HBM 4 times (read bias, read x, write intermediate, read intermediate, write output - 5 trips, though two inputs share one read). Fused, you access HBM 2 times (read x+bias, write output). The arithmetic intensity is the same; the memory traffic is halved.
The Five Steps to Writing a Production Kernel
flowchart TD
A["1. Identify Operation<br/>What computation + memory pattern?"]:::blue
B["2. Calculate Intensity<br/>Memory-bound or compute-bound?"]:::blue
C["3. Write Kernel Body<br/>Thread indexing, computation, bounds check"]:::green
D["4. Choose Launch Config<br/>Block size, grid size, shared memory"]:::purple
E["5. Benchmark<br/>Compare against baseline, check correctness"]:::orange
A --> B --> C --> D --> E
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
Step 1: Identify the Operation
We will build a fused bias + GELU kernel. This appears in virtually every Transformer FFN layer:
# In every transformer FFN block, somewhere:
hidden = linear(x) # shape: [batch, seq, hidden]
hidden = hidden + bias # bias: [hidden]
hidden = gelu(hidden) # element-wise nonlinearity
The bias add and GELU are pure element-wise operations. Every output element depends only on the corresponding input element and the corresponding bias value. No data dependencies between elements. This is the ideal candidate for a fused kernel.
Memory access pattern: Each thread reads one element of x, reads the corresponding bias (indexed by column), computes GELU, writes one element of output. This is a streaming pattern - no reuse, pure throughput.
Step 2: Calculate Arithmetic Intensity
For a tensor of shape [B, H] (batch size B, hidden dimension H):
- Reads: B * H floats (input) + H floats (bias) = (B+1) * H floats
- Writes: B * H floats (output)
- Total bytes: (2B+1) * H * 4 bytes
- FLOPs: B * H * ~20 (GELU approximation requires ~20 FP ops)
For B=512, H=4096:
2.5 FLOPs/byte is well below the A100 ridge point of 9.75. This operation is memory-bound. The optimization is entirely about reducing memory traffic, which fusion provides.
Step 3: Write the Kernel Body
Understanding the GELU Approximation
The exact GELU is:
The fast approximation (used by most frameworks) is:
This approximation avoids the erf call and is accurate to within 0.1% for typical activation values.
The Complete Kernel
#include <cuda_runtime.h>
#include <math.h>
// GELU approximation constants
#define GELU_COEFF 0.7978845608f // sqrt(2/pi)
#define GELU_CUBIC 0.044715f
// Inline device function for GELU
__device__ __forceinline__ float gelu_approx(float x) {
float cube = GELU_CUBIC * x * x * x;
float inner = GELU_COEFF * (x + cube);
return 0.5f * x * (1.0f + tanhf(inner));
}
// Fused bias + GELU kernel
// input: [N] flattened tensor (B * H elements)
// bias: [H] bias vector
// output: [N] result tensor
// N: total elements (B * H)
// H: hidden dimension (used to index bias)
__global__ void fused_bias_gelu_kernel(
const float* __restrict__ input,
const float* __restrict__ bias,
float* __restrict__ output,
int N,
int H
) {
// Global thread index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Bounds check - last block may be partially full
if (idx >= N) return;
// Bias index: which column this thread handles
int bias_idx = idx % H;
// Load from HBM - the only HBM read in the fused version
float val = input[idx] + bias[bias_idx];
// GELU computed entirely in registers - no memory access
float result = gelu_approx(val);
// Write result to HBM - the only HBM write
output[idx] = result;
}
Key Design Decisions in the Kernel Body
__restrict__: Tells the compiler that the pointers do not alias each other. This enables more aggressive optimization, including better instruction scheduling. Always use it on kernel arguments when you guarantee no aliasing.
__forceinline__: Prevents the compiler from generating a function call for gelu_approx. The compiler will inline it, eliminating call overhead and enabling the GELU computation to be register-allocated alongside the surrounding arithmetic.
Bias indexing with % H: The input is laid out as a 2D tensor flattened to 1D. Element idx belongs to row idx / H and column idx % H. The bias is indexed by column. The modulo operation is one integer instruction and costs essentially nothing.
Bounds check before any memory access: The if (idx >= N) return must come before input[idx]. If threads in the last block are out-of-bounds and you access memory anyway, you get undefined behavior - sometimes a crash, sometimes silent corruption.
Step 4: Choose Launch Configuration
Block Size
For element-wise kernels, block size 256 is the standard starting point. Here is why:
- Minimum occupancy requirement: each SM needs at least 2 warps (64 threads) active to hide latency
- Practical sweet spot: 128-256 threads per block gives good occupancy without excessive register pressure
- Must be a multiple of 32 (warp size)
- 256 = 8 warps per block, typically fits 2-4 blocks per SM = 16-32 warps active = good latency hiding
// Launch configuration
int block_size = 256;
int grid_size = (N + block_size - 1) / block_size; // Ceiling division
fused_bias_gelu_kernel<<<grid_size, block_size>>>(
d_input, d_bias, d_output, N, H
);
The ceiling division ensures we always launch enough blocks to cover all N elements. The last block will have some idle threads, which is fine - the bounds check handles them.
Verifying Occupancy
A100 limits per SM:
- Max threads per SM: 2048
- Max blocks per SM: 32
- Max registers per thread: 255
With block size 256 and our kernel using roughly 16 registers per thread:
- Threads per block: 256 (8 warps)
- Blocks per SM: min(32, 2048/256) = 8
- Active warps per SM: 8 * 8 = 64 (out of max 64)
- Theoretical occupancy: 100%
For memory-bound kernels, high occupancy is critical - you need many in-flight memory requests to keep HBM busy.
Step 5: Production Error Checking
Undetected CUDA errors are worse than crashes. They produce silently wrong results that can corrupt a training run for hours before anyone notices.
// Error checking macro - use after every CUDA call
#define CUDA_CHECK(call) \
do { \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d - %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Usage:
CUDA_CHECK(cudaMalloc(&d_input, N * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_input, h_input, N * sizeof(float), cudaMemcpyHostToDevice));
fused_bias_gelu_kernel<<<grid_size, block_size>>>(d_input, d_bias, d_output, N, H);
// Kernel launches are asynchronous - check for launch errors
CUDA_CHECK(cudaGetLastError());
// Synchronize and check for execution errors
CUDA_CHECK(cudaDeviceSynchronize());
In production, you would replace exit(EXIT_FAILURE) with a proper exception or error code. PyTorch's C++ extension infrastructure has its own error handling via AT_CUDA_CHECK.
The Complete Production Implementation
CUDA Extension Header
// fused_bias_gelu.h
#pragma once
#include <torch/extension.h>
torch::Tensor fused_bias_gelu_forward(
torch::Tensor input,
torch::Tensor bias
);
CUDA Kernel File
// fused_bias_gelu_kernel.cu
#include <cuda_runtime.h>
#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>
#define CUDA_CHECK(call) \
do { \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
throw std::runtime_error( \
std::string("CUDA error: ") + cudaGetErrorString(err)); \
} \
} while(0)
#define GELU_COEFF 0.7978845608f
#define GELU_CUBIC 0.044715f
__device__ __forceinline__ float gelu_approx(float x) {
float cube = GELU_CUBIC * x * x * x;
return 0.5f * x * (1.0f + tanhf(GELU_COEFF * (x + cube)));
}
__global__ void fused_bias_gelu_kernel(
const float* __restrict__ input,
const float* __restrict__ bias,
float* __restrict__ output,
int N,
int H
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float val = input[idx] + bias[idx % H];
output[idx] = gelu_approx(val);
}
// Vectorized version: processes 4 floats per thread
// float4 loads are a single 128-bit memory transaction
__global__ void fused_bias_gelu_kernel_vec4(
const float4* __restrict__ input,
const float* __restrict__ bias,
float4* __restrict__ output,
int N, // total elements
int H // hidden dim (must be divisible by 4)
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int N4 = N / 4;
if (idx >= N4) return;
// Each thread processes 4 consecutive elements
float4 in4 = input[idx];
int base_col = (idx * 4) % H;
// Load 4 bias values
float b0 = bias[base_col];
float b1 = bias[base_col + 1];
float b2 = bias[base_col + 2];
float b3 = bias[base_col + 3];
// Fused bias + GELU on 4 elements
float4 out4;
out4.x = gelu_approx(in4.x + b0);
out4.y = gelu_approx(in4.y + b1);
out4.z = gelu_approx(in4.z + b2);
out4.w = gelu_approx(in4.w + b3);
output[idx] = out4;
}
torch::Tensor fused_bias_gelu_forward(
torch::Tensor input, // [B, H] or [B, S, H]
torch::Tensor bias // [H]
) {
// Input validation
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(bias.is_cuda(), "bias must be a CUDA tensor");
TORCH_CHECK(input.scalar_type() == torch::kFloat32, "Only float32 supported");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(bias.is_contiguous(), "bias must be contiguous");
int N = input.numel();
int H = bias.size(0);
auto output = torch::empty_like(input);
// Use vectorized kernel if H is divisible by 4
if (H % 4 == 0 && N % 4 == 0) {
int block_size = 256;
int grid_size = (N / 4 + block_size - 1) / block_size;
fused_bias_gelu_kernel_vec4<<<grid_size, block_size>>>(
reinterpret_cast<const float4*>(input.data_ptr<float>()),
bias.data_ptr<float>(),
reinterpret_cast<float4*>(output.data_ptr<float>()),
N, H
);
} else {
int block_size = 256;
int grid_size = (N + block_size - 1) / block_size;
fused_bias_gelu_kernel<<<grid_size, block_size>>>(
input.data_ptr<float>(),
bias.data_ptr<float>(),
output.data_ptr<float>(),
N, H
);
}
// Check for kernel launch errors
AT_CUDA_CHECK(cudaGetLastError());
return output;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("forward", &fused_bias_gelu_forward, "Fused Bias+GELU forward");
}
Python Wrapper and Setup
# setup.py
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
setup(
name='fused_bias_gelu',
ext_modules=[
CUDAExtension(
name='fused_bias_gelu',
sources=['fused_bias_gelu_kernel.cu'],
),
],
cmdclass={
'build_ext': BuildExtension
}
)
# fused_bias_gelu.py
import torch
import fused_bias_gelu as _fused_ext # compiled extension
class FusedBiasGELU(torch.nn.Module):
"""Drop-in replacement for nn.Linear bias + GELU activation.
Usage:
# Instead of:
x = linear(x) + bias
x = F.gelu(x)
# Use:
x = fused_bias_gelu(linear(x), bias)
"""
def __init__(self, hidden_size: int):
super().__init__()
self.bias = torch.nn.Parameter(torch.zeros(hidden_size))
def forward(self, x: torch.Tensor) -> torch.Tensor:
return _fused_ext.forward(x, self.bias)
Triton Alternative: Same Kernel, Much Simpler
Triton is a Python-based kernel language that compiles to efficient GPU code. For most ML engineers, Triton is the right choice before reaching for raw CUDA.
import triton
import triton.language as tl
import torch
@triton.jit
def fused_bias_gelu_triton(
input_ptr, # pointer to input tensor
bias_ptr, # pointer to bias tensor
output_ptr, # pointer to output tensor
N, # total elements
H, # hidden dimension
BLOCK_SIZE: tl.constexpr, # compile-time constant
):
# Each program handles BLOCK_SIZE elements
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
# Create index array for this block
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < N
# Load input (masked for out-of-bounds safety)
x = tl.load(input_ptr + offsets, mask=mask)
# Load bias - index into H-dimensional bias
bias_offsets = offsets % H
b = tl.load(bias_ptr + bias_offsets, mask=mask)
# Fused bias + GELU
val = x + b
# GELU approximation
# GELU(x) = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
cube = 0.044715 * val * val * val
inner = 0.7978845608 * (val + cube)
result = 0.5 * val * (1.0 + tl.libdevice.tanh(inner))
# Write output
tl.store(output_ptr + offsets, result, mask=mask)
def triton_fused_bias_gelu(x: torch.Tensor, bias: torch.Tensor) -> torch.Tensor:
output = torch.empty_like(x)
N = x.numel()
H = bias.shape[0]
BLOCK_SIZE = 1024 # Triton block size
grid = (triton.cdiv(N, BLOCK_SIZE),)
fused_bias_gelu_triton[grid](
x, bias, output, N, H,
BLOCK_SIZE=BLOCK_SIZE,
)
return output
The Triton version is roughly 40 lines versus 80 lines of CUDA. Performance is within 5-10% of the hand-written CUDA version for this workload because:
- The operation is memory-bound, so the compiler has little room to help on the compute side
- Triton compiles to PTX just like CUDA - the generated assembly is similar
- The vectorization and memory access patterns are handled by the Triton compiler
Benchmarking: The Reality Check
import torch
import time
def benchmark(fn, *args, warmup=20, iters=100):
# Warmup - JIT compilation, cache warmup
for _ in range(warmup):
fn(*args)
torch.cuda.synchronize()
# Timed iterations
start = time.perf_counter()
for _ in range(iters):
fn(*args)
torch.cuda.synchronize()
end = time.perf_counter()
return (end - start) / iters * 1000 # ms
# Setup
B, H = 512, 4096
x = torch.randn(B, H, device='cuda', dtype=torch.float32)
bias = torch.randn(H, device='cuda', dtype=torch.float32)
# Baseline: separate PyTorch ops
def pytorch_baseline(x, bias):
return torch.nn.functional.gelu(x + bias)
# torch.compile version
compiled_baseline = torch.compile(pytorch_baseline)
# Our custom kernel
import fused_bias_gelu as _ext
def cuda_kernel(x, bias):
return _ext.forward(x, bias)
# Run benchmarks
t_pytorch = benchmark(pytorch_baseline, x, bias)
t_compiled = benchmark(compiled_baseline, x, bias)
t_cuda = benchmark(cuda_kernel, x, bias)
t_triton = benchmark(triton_fused_bias_gelu, x, bias)
print(f"PyTorch baseline: {t_pytorch:.3f} ms")
print(f"torch.compile: {t_compiled:.3f} ms")
print(f"Custom CUDA: {t_cuda:.3f} ms")
print(f"Triton: {t_triton:.3f} ms")
Typical results on A100 (B=512, H=4096):
| Implementation | Time (ms) | Speedup | Notes |
|---|---|---|---|
| PyTorch baseline | 0.21 | 1.0x | 3 kernel launches |
| torch.compile | 0.12 | 1.75x | Fusion via inductor |
| Custom CUDA (scalar) | 0.10 | 2.1x | Single kernel |
| Custom CUDA (vec4) | 0.08 | 2.6x | 128-bit loads |
| Triton | 0.09 | 2.3x | Single kernel |
Note that torch.compile fuses the operations too - it is not as fast as hand-written CUDA because the inductor backend adds overhead, but it gets most of the benefit for free.
The Kernel Fusion Decision Tree
flowchart TD
A["Need better performance<br/>for this operation?"]:::blue
B["Does torch.compile<br/>already fuse it?"]:::blue
C["Profile confirms<br/>memory-bound?"]:::green
D["Is the operation<br/>well-defined / stable?"]:::green
E["Use torch.compile<br/>or inductor backend"]:::teal
F["Write Triton kernel<br/>Faster iteration, Python"]:::purple
G["Triton meets<br/>performance target?"]:::green
H["Write CUDA kernel<br/>Maximum control"]:::orange
I["Not a kernel problem<br/>Check algorithm first"]:::red
A -->|Yes| B
A -->|No| I
B -->|Yes, fast enough| E
B -->|No or not fast enough| C
C -->|No - compute bound| I
C -->|Yes - memory bound| D
D -->|Yes - worth the investment| F
G -->|Yes| F
G -->|No - need more control| H
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
classDef red fill:#fee2e2,color:#7f1d1d,stroke:#dc2626
When NOT to Write a Custom Kernel
This is as important as knowing how to write one.
Do not write a custom kernel when:
-
torch.compilealready handles it. The inductor backend fuses element-wise ops, converts to Triton, and can often match hand-written performance. Check this first withTORCH_COMPILE_DEBUG=1. -
The operation is compute-bound. Matrix multiplication is compute-bound. Writing a custom GEMM kernel is a research project requiring weeks of work - cuBLAS/cuBLASLt already has highly tuned implementations. You will not beat NVIDIA's team.
-
Your model changes frequently. Custom kernels require maintenance. If the operation changes every sprint, the kernel maintenance cost exceeds the performance benefit.
-
You have not profiled. Many "obvious" bottlenecks are not the actual bottleneck. Profile first with
torch.profileror Nsight Systems. Write kernels for demonstrated bottlenecks, not intuited ones. -
Triton meets the target. Triton kernels are debuggable, Python-based, and compile to efficient PTX. If Triton gets you 90% of the way there, the remaining 10% rarely justifies full CUDA.
Production Engineering Notes
Half Precision Support
Production kernels must support both FP32 and FP16/BF16. Add a template:
template <typename scalar_t>
__global__ void fused_bias_gelu_kernel_typed(
const scalar_t* __restrict__ input,
const scalar_t* __restrict__ bias,
scalar_t* __restrict__ output,
int N, int H
) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
// Convert to float for computation, back to scalar_t for output
float val = static_cast<float>(input[idx])
+ static_cast<float>(bias[idx % H]);
output[idx] = static_cast<scalar_t>(gelu_approx(val));
}
Dispatch the correct instantiation based on input dtype using AT_DISPATCH_FLOATING_TYPES_AND_HALF.
Stream Management
For inference servers running multiple requests concurrently:
// Use a non-default stream to overlap with other operations
cudaStream_t stream;
cudaStreamCreate(&stream);
fused_bias_gelu_kernel<<<grid, block, 0, stream>>>(
input, bias, output, N, H
);
PyTorch's CUDA extension infrastructure automatically uses the current stream when you call at::cuda::getCurrentCUDAStream().
Common Mistakes
:::danger Missing Bounds Check The single most common CUDA bug. If your N is not a multiple of your block size, the last block has idle threads. Without a bounds check, those threads access memory out-of-range.
// WRONG - no bounds check
__global__ void kernel(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = data[idx] * 2.0f; // Out-of-bounds if idx >= N
}
// CORRECT
__global__ void kernel(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return; // Must come before any memory access
data[idx] = data[idx] * 2.0f;
}
:::
:::danger Optimizing Before Profiling It is tempting to jump straight to vectorization and shared memory tricks. But these optimizations can actually hurt if the bottleneck is elsewhere. Always establish a baseline, profile it, identify the actual bottleneck, then optimize. :::
:::warning Forgetting cudaGetLastError() After Kernel Launch
Kernel launches are asynchronous and do not return errors directly. You must explicitly call cudaGetLastError() after a launch to catch errors like invalid launch configuration.
myKernel<<<grid, block>>>(args);
// Without this, errors are silent:
CUDA_CHECK(cudaGetLastError());
:::
:::warning Non-Contiguous Tensors
PyTorch tensors may not be contiguous after slicing, transposing, or permuting. Always check tensor.is_contiguous() in your extension or call .contiguous() before passing to a kernel that assumes contiguous layout.
:::
Interview Questions and Answers
Q1: What is kernel fusion and why does it improve performance?
Kernel fusion combines multiple operations into a single GPU kernel, reducing the number of round-trips to HBM. Each unfused operation reads its inputs from HBM and writes its outputs to HBM. For memory-bound operations like element-wise activations, this memory traffic dominates runtime. Fusing two operations reduces memory traffic by nearly 50% because intermediate results stay in registers rather than being written to and read back from HBM. The classic example is fusing bias addition and GELU - unfused requires 4-5 HBM transactions, fused requires 2.
Q2: How do you choose block size for a new kernel?
Start with 256 threads per block for memory-bound element-wise kernels. This gives 8 warps per block, which typically fills multiple SMs with high occupancy. For kernels using shared memory, calculate the block size that maximizes occupancy given the shared memory requirement: max_blocks_per_sm = min(max_blocks, shared_mem_per_sm / shared_mem_per_block). Use the CUDA Occupancy Calculator or cudaOccupancyMaxActiveBlocksPerMultiprocessor to validate. For compute-bound kernels, occupancy matters less - focus on register usage and instruction-level parallelism.
Q3: What is the roofline model and how do you apply it?
The roofline model predicts maximum achievable performance given a kernel's arithmetic intensity. Plot peak compute (horizontal ceiling) and bandwidth * intensity (diagonal slope) on a log-log graph. The minimum of the two is your achievable performance. If your kernel is below this ceiling, the issue is not the operation itself but inefficiency in your implementation (poor memory access patterns, divergence, underutilization). Calculate intensity as FLOPs/byte for your kernel and compare to the hardware ridge point (peak FLOPs / peak bandwidth). Below the ridge point, optimize for memory access. Above it, optimize for compute utilization.
Q4: When would you use Triton instead of CUDA?
Triton is appropriate for most new kernel development because it is faster to write, easier to debug, and produces competitive performance for memory-bound kernels. Use Triton when: the kernel is element-wise or has simple access patterns, you need to iterate quickly, the team has Python expertise but limited C++/PTX knowledge, or you need auto-tuning across hardware targets. Use raw CUDA when: you need precise control over shared memory layout, the kernel has complex synchronization patterns, you need warp shuffle operations not exposed by Triton, or you are targeting specific hardware with hand-tuned PTX. FlashAttention-3 uses raw CUDA with hand-written PTX for the attention inner loop because Triton cannot express the wgmma instructions needed for Hopper TMA.
Q5: How do you verify correctness of a custom kernel?
Three-step approach: First, unit test with small inputs where you can compute the expected output by hand. Second, numerical comparison against the PyTorch reference implementation using torch.allclose with appropriate tolerances (rtol=1e-4, atol=1e-5 for FP32). Third, test boundary conditions: N not a multiple of block size, N=1, N=0, very large N, tensors with non-power-of-2 sizes. For gradients, use torch.autograd.gradcheck if you implement a custom backward pass. Run with cuda-memcheck or compute-sanitizer to catch memory errors.
Q6: What is the difference between __device__, __global__, and __host__ qualifiers?
__global__ marks a kernel function callable from host (CPU) code and executed on the GPU. It must return void. __device__ marks a function callable only from GPU code (other device or global functions). __host__ marks a function callable only from CPU code (the default for any C++ function). __host__ __device__ allows a function to compile for both. The __forceinline__ qualifier on __device__ functions prevents function call overhead - important for small helper functions called in hot loops.
Summary
Writing a production CUDA kernel follows a clear sequence: identify the operation and its memory pattern, calculate arithmetic intensity to determine the bound, write the kernel body with proper thread indexing and bounds checks, choose launch configuration for high occupancy, and benchmark against baseline.
The fused bias+GELU kernel demonstrates the full workflow. It eliminates 2 HBM round-trips, achieves 2-3x speedup over unfused PyTorch, and the entire implementation is under 100 lines of CUDA. The Triton alternative achieves comparable performance in half the code.
The most important skill is knowing when not to write a custom kernel. Check torch.compile first. Profile before optimizing. Write Triton before CUDA. Reach for raw CUDA only when nothing else meets the performance target.
The engineers who build the fastest training systems are not the ones who write the most custom kernels. They are the ones who know exactly when a custom kernel is necessary and write precisely the right one.
