CUDA Programming Model

Reading time: ~40 min - Interview relevance: High - Target roles: CUDA Developer, ML Researcher, Systems Engineer

CUDA is not just a library. It is a programming model that exposes GPU parallelism through a single abstraction: you write a function that runs once, and the hardware executes it millions of times in parallel across thousands of threads.

The Moment You Hit the Wall

You have been writing PyTorch code for two years. You know how to build models, write training loops, use torch.compile, even profile with torch.profiler. But today your research lead drops a paper on your desk about a novel activation function - a smooth, piecewise approximation with a learnable sharpness parameter - and says: "implement this and benchmark it."

You open PyTorch. You look for F.my_new_activation. It does not exist. You try to approximate it with existing ops. You can - it takes six lines of PyTorch and a lot of intermediate tensors. But when you profile it you see three kernel launches, two memory round-trips through global DRAM, and terrible throughput for a function that conceptually requires one pass through the data.

Your research lead has a CUDA background. She opens a .cu file, writes forty lines, compiles it with torch.utils.cpp_extension, and drops it in as a torch.autograd.Function. One kernel. One pass. Four times faster.

That gap - between knowing PyTorch and knowing how to write the underlying kernels - is what this module closes.

This lesson is not about theory. It is about understanding exactly how CUDA works so that you can write, compile, and invoke a real GPU kernel from Python. We start with why CUDA exists, build up the full programming model, and end with a working custom kernel that you can run on your machine.

By the end you will understand what happens - at the instruction level - when you call model(x).

Why CUDA Exists: The Parallelism Gap

CPUs Were Not Built for This

Through the 1990s and early 2000s, CPUs got faster every 18 months because transistors got smaller and clock speeds went up. Dennard scaling held: smaller transistors ran cooler, so you could push them faster. By 2004 this broke. Intel's Tejas processor was cancelled because it would have drawn 150W and needed exotic cooling. Clock speeds hit a wall around 3-4 GHz where they still sit today.

CPU architects responded by adding cores: 2, then 4, then 8, then 16. But CPU cores are designed for latency - they have deep pipelines, large out-of-order execution windows, branch predictors, and multiple levels of cache - all to make one thread run as fast as possible. Even a 16-core CPU is designed to run 16 independent tasks fast, not to run a single computation across thousands of workers simultaneously.

Graphics cards faced a different problem. Rendering a 1080p frame means computing color for 2 million pixels. Each pixel is mostly independent. A GPU does not need to run 2 million sequential instructions - it needs to run the same instruction (shade this pixel) across 2 million independent data points at once. This is data parallelism, and it maps naturally to wide SIMD architectures with thousands of simpler cores.

By 2005, researchers noticed something: the same pixel-shading parallelism maps directly to numerical computing. A vector addition of 1 million elements is exactly the same shape as shading 1 million pixels. If you could get general-purpose code to run on a GPU, you would have a scientific supercomputer on a PCI card.

The problem: GPUs in 2005 were programmed in graphics APIs. To run a numerical computation you had to frame it as a shader, load your data as a texture, and pretend you were doing graphics. It worked, but it was absurd. Every GPGPU (General-Purpose GPU) computation involved fake triangles and rasterization just to get your floating-point data onto the chip.

CUDA 2007: The Language That Changed Everything

In 2006, NVIDIA released the G80 GPU - the first designed from the ground up for general-purpose computation alongside graphics. The G80 introduced a unified shader architecture (no distinction between vertex and pixel shaders) and, critically, hardware that could execute general-purpose C code directly.

In February 2007, NVIDIA released CUDA (Compute Unified Device Architecture) 1.0. For the first time, you could write a .cu file - ordinary C with a few extensions - compile it, and run it on GPU hardware without pretending you were doing graphics. The abstraction was revolutionary: write a function, tell the GPU to run it across $N$ threads, and the hardware figures out how to schedule those threads across its thousands of execution units.

The "aha moment" of CUDA was not a new algorithm. It was a model: treat the GPU as a massively parallel co-processor with its own memory, expose the parallelism through a simple thread hierarchy, and let the programmer express computation without knowing anything about graphics APIs.

This democratized GPU computing. By 2012, Alex Krizhevsky trained AlexNet on two GTX 580 cards using CUDA. Every major deep learning framework today - PyTorch, TensorFlow, JAX - compiles to CUDA kernels under the hood. Every AI accelerator (Google TPU, AWS Trainium, NVIDIA H100) is benchmarked against what CUDA can do. Understanding CUDA is not optional for serious ML engineers.

How CUDA Programs Are Structured

A CUDA program consists of two distinct parts running on two distinct processors with two distinct memory spaces. This is the first and most important mental model to build.

┌─────────────────────────────────────────────────────────────┐
│                    HOST (CPU)                               │
│   - Runs sequentially (or with CPU threads)                 │
│   - Manages program logic, I/O, data transfer               │
│   - Allocates/frees both CPU and GPU memory                 │
│   - Launches kernels and checks errors                      │
│                                                             │
│   Memory: System DRAM (e.g. 64 GB DDR5)                    │
└──────────────────────────┬──────────────────────────────────┘
                           │  PCI-Express Bus
                           │  (H2D / D2H memory copies)
┌──────────────────────────▼──────────────────────────────────┐
│                   DEVICE (GPU)                              │
│   - Runs massively parallel kernel code                     │
│   - Thousands of threads executing the same function        │
│   - Each thread has its own registers + stack               │
│   - Shared memory within a block                            │
│                                                             │
│   Memory: HBM / GDDR (e.g. 80 GB HBM3 on H100)            │
└─────────────────────────────────────────────────────────────┘

The CPU and GPU do not share memory. Data that lives on the CPU must be explicitly copied to the GPU before a kernel can use it. Results computed on the GPU must be explicitly copied back to the CPU before the CPU can read them. This is not a convenience - it is a fundamental constraint of the hardware.

The CUDA Compilation Pipeline

Before we write code, we need to understand where .cu files go.

NVCC (NVIDIA CUDA Compiler) is a wrapper compiler that does two things: it separates host code from device code, then compiles each with the appropriate tool. Host code goes to your system C++ compiler (g++ on Linux, clang on macOS). Device code goes through NVIDIA's own compiler.

PTX (Parallel Thread eXecution) is the key intermediate step. PTX is a virtual instruction set architecture - think of it like JVM bytecode, but for GPUs. PTX is not machine code for any specific GPU. It describes computation at a high level. When you deploy a CUDA binary, the PTX is included alongside (or instead of) actual machine code. At runtime, the CUDA driver performs JIT (just-in-time) compilation from PTX to SASS for the exact GPU in the machine.

SASS (Shader ASSembly) is the actual binary machine code that executes on a specific GPU microarchitecture. SASS is specific to a GPU generation: SASS for an A100 will not run on a V100. This is why PTX is valuable - you ship PTX, and the driver compiles to SASS for whatever GPU the customer has.

The practical implication: if you compile targeting a newer compute capability than the GPU you have, the JIT step at runtime will fail or fall back to an older version. If you compile for an older capability than your GPU, PTX JIT will produce working code but may not use newer hardware features.

The Three Function Qualifiers

CUDA C++ adds three function qualifiers to standard C++ that determine where a function lives and who can call it.

// Runs on GPU, callable only from CPU (or another __global__ kernel in CUDA 5+)
// This is what you "launch" with <<<>>>
__global__ void my_kernel(float* data, int n) {
    // This body runs in parallel across N threads
}

// Runs on GPU, callable only from GPU (__global__ or other __device__ functions)
// Cannot be called from host code
__device__ float my_helper(float x) {
    return x * x + 1.0f;
}

// Runs on CPU, callable only from CPU
// This is just normal C++ - the qualifier is optional but makes intent clear
__host__ void prepare_data(float* h_data, int n) {
    for (int i = 0; i < n; i++) h_data[i] = (float)i;
}

// Can run on BOTH CPU and GPU - useful for math utility functions
// NVCC compiles two versions
__host__ __device__ float sigmoid(float x) {
    return 1.0f / (1.0f + expf(-x));
}

The most important rule: __global__ functions are the entry points. They are called from CPU code and execute on the GPU. They must return void. They cannot be virtual, cannot have default arguments (in older CUDA), and cannot be called recursively (in most cases).

__device__ functions are like inline helper functions that only exist on the GPU side. They are inlined by default and have no additional overhead compared to just duplicating the code.

__host__ __device__ is the most useful combination for utility math - the compiler generates two versions and each is used in its respective context.

Kernel Launch Syntax

The triple-angle-bracket syntax is CUDA's extension to C++ for launching kernels:

kernel_name<<<gridDim, blockDim, sharedMem, stream>>>(arg1, arg2, ...);

Let's break down each parameter:

Parameter	Type	Required	Meaning
`gridDim`	`dim3` or `int`	Yes	How many blocks in the grid
`blockDim`	`dim3` or `int`	Yes	How many threads in each block
`sharedMem`	`size_t`	No (default 0)	Dynamic shared memory in bytes per block
`stream`	`cudaStream_t`	No (default 0)	Which stream to use for async execution

The total number of threads launched is gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z. If you launch <<<4, 256>>> you get 4 blocks of 256 threads each = 1024 threads total.

A Simple First Kernel

Here is the simplest possible CUDA program: vector addition.

#include <stdio.h>
#include <cuda_runtime.h>

// Kernel: runs on GPU, called from CPU
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    // Each thread computes one element
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Boundary guard - MANDATORY when n is not a multiple of blockDim.x
    if (idx >= n) return;

    c[idx] = a[idx] + b[idx];
}

int main() {
    const int N = 1024;
    const int bytes = N * sizeof(float);

    // Allocate host memory
    float* h_a = (float*)malloc(bytes);
    float* h_b = (float*)malloc(bytes);
    float* h_c = (float*)malloc(bytes);

    // Fill with data
    for (int i = 0; i < N; i++) {
        h_a[i] = (float)i;
        h_b[i] = (float)(i * 2);
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy from host to device (H2D)
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 4 blocks, 256 threads per block = 1024 threads
    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;  // ceiling division
    vector_add<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    // Copy result from device to host (D2H)
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    printf("c[0] = %f (expected %f)\n", h_c[0], h_a[0] + h_b[0]);
    printf("c[511] = %f (expected %f)\n", h_c[511], h_a[511] + h_b[511]);

    // Free device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Free host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

This twenty-line main function contains the complete CUDA workflow that every GPU program follows:

Allocate host memory (malloc)
Allocate device memory (cudaMalloc)
Copy data to device (cudaMemcpy with cudaMemcpyHostToDevice)
Launch kernel (<<<>>>)
Synchronize (cudaDeviceSynchronize)
Copy results back (cudaMemcpy with cudaMemcpyDeviceToHost)
Free device memory (cudaFree)
Free host memory (free)

You will follow this pattern in every CUDA program you ever write.

The Memory API in Detail

cudaMalloc

cudaError_t cudaMalloc(void** devPtr, size_t size);

cudaMalloc allocates size bytes of linear memory on the device. The pointer stored in devPtr points into GPU memory - you cannot dereference it from CPU code. Attempting to do so causes a segfault or undefined behavior.

The common pattern is:

float* d_data;
cudaError_t err = cudaMalloc(&d_data, N * sizeof(float));
if (err != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
    exit(1);
}

cudaMemcpy

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind);

The kind parameter tells CUDA which direction the copy goes:

Kind	Meaning
`cudaMemcpyHostToDevice`	CPU RAM -> GPU VRAM
`cudaMemcpyDeviceToHost`	GPU VRAM -> CPU RAM
`cudaMemcpyDeviceToDevice`	GPU VRAM -> GPU VRAM (same GPU)
`cudaMemcpyHostToHost`	CPU RAM -> CPU RAM (rarely used)
`cudaMemcpyDefault`	Inferred from pointer attributes (requires unified memory)

cudaMemcpy is synchronous by default - it blocks the CPU thread until the copy completes. For asynchronous copies (which overlap with kernel execution), use cudaMemcpyAsync with a stream.

cudaFree

cudaError_t cudaFree(void* devPtr);

Frees memory previously allocated with cudaMalloc. Calling free() on a device pointer (instead of cudaFree) is a bug and will corrupt memory. Calling cudaFree on a host pointer is also a bug. The two memory spaces are completely separate.

cudaDeviceSynchronize

cudaError_t cudaDeviceSynchronize();

This call blocks the CPU thread until all previously issued CUDA operations - kernels, memory copies, etc. - on the device have completed. This is mandatory before you read results back from device memory, because kernel launches are asynchronous. Without it, your cudaMemcpy D2H might execute before the kernel finishes.

Error Checking: The CUDA_CHECK Macro

Every CUDA API function returns a cudaError_t. In production code, you must check every single one. The standard way is a macro:

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>

// Production error-check macro
#define CUDA_CHECK(call)                                                    \
    do {                                                                    \
        cudaError_t err = (call);                                           \
        if (err != cudaSuccess) {                                           \
            fprintf(stderr, "CUDA error at %s:%d - %s\n",                  \
                    __FILE__, __LINE__, cudaGetErrorString(err));           \
            exit(EXIT_FAILURE);                                             \
        }                                                                   \
    } while (0)

// Usage
float* d_data;
CUDA_CHECK(cudaMalloc(&d_data, N * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));

For kernels, the launch itself does not return an error code - errors are deferred. You check them immediately after the launch:

my_kernel<<<gridSize, blockSize>>>(args...);

// Check for launch errors (wrong arguments, invalid grid size, etc.)
cudaError_t launch_err = cudaGetLastError();
if (launch_err != cudaSuccess) {
    fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(launch_err));
    exit(EXIT_FAILURE);
}

// Check for execution errors (requires sync)
CUDA_CHECK(cudaDeviceSynchronize());

cudaGetLastError() clears the error state and returns the last error. cudaPeekAtLastError() reads it without clearing.

:::danger Always Check CUDA Errors in Production The single most common cause of mysterious GPU bugs is silent error propagation. A cudaMalloc fails because the device is out of memory. The pointer is null. The kernel silently writes nothing. The D2H copy silently copies garbage. The loss is NaN. You spend three hours debugging your model when the real issue was a one-line error check. Use CUDA_CHECK everywhere. :::

A Complete Custom Kernel: The SiLU Activation

SiLU (Sigmoid Linear Unit) is defined as $f(x) = x \cdot \sigma(x)$ where $\sigma(x) = \frac{1}{1 + e^{-x}}$ .

PyTorch has F.silu() built in, but writing it as a custom kernel illustrates the full workflow.

The CUDA C++ Kernel

// silu_kernel.cu
#include <cuda_runtime.h>
#include <math.h>

// __device__ helper - only callable from GPU
__device__ inline float sigmoid_device(float x) {
    return 1.0f / (1.0f + expf(-x));
}

// Forward kernel: computes SiLU elementwise
__global__ void silu_forward_kernel(
    const float* __restrict__ input,
    float* __restrict__ output,
    int n
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    float x = input[idx];
    output[idx] = x * sigmoid_device(x);
}

// Backward kernel: computes d(SiLU)/dx for backprop
// d/dx [x * sigmoid(x)] = sigmoid(x) + x * sigmoid(x) * (1 - sigmoid(x))
//                       = sigmoid(x) * (1 + x * (1 - sigmoid(x)))
__global__ void silu_backward_kernel(
    const float* __restrict__ input,
    const float* __restrict__ grad_output,
    float* __restrict__ grad_input,
    int n
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    float x = input[idx];
    float sig = sigmoid_device(x);
    float dsilu_dx = sig * (1.0f + x * (1.0f - sig));
    grad_input[idx] = grad_output[idx] * dsilu_dx;
}

// C-linkage launcher functions callable from Python via ctypes or torch extension
extern "C" {

void launch_silu_forward(
    const float* input,
    float* output,
    int n,
    cudaStream_t stream
) {
    int block_size = 256;
    int grid_size = (n + block_size - 1) / block_size;
    silu_forward_kernel<<<grid_size, block_size, 0, stream>>>(input, output, n);
}

void launch_silu_backward(
    const float* input,
    const float* grad_output,
    float* grad_input,
    int n,
    cudaStream_t stream
) {
    int block_size = 256;
    int grid_size = (n + block_size - 1) / block_size;
    silu_backward_kernel<<<grid_size, block_size, 0, stream>>>(
        input, grad_output, grad_input, n
    );
}

} // extern "C"

PyTorch Extension Wrapper

The production way to use custom CUDA kernels from Python is torch.utils.cpp_extension. It handles compilation, linking, and Python binding automatically.

Create the following file structure:

my_silu/
  silu_kernel.cu      (the CUDA kernel above)
  silu_ext.cpp        (PyTorch C++ binding)
  setup.py            (build script)

The C++ binding file:

// silu_ext.cpp
#include <torch/extension.h>

// Declare the launcher functions defined in silu_kernel.cu
void launch_silu_forward(
    const float* input, float* output, int n, cudaStream_t stream);
void launch_silu_backward(
    const float* input, const float* grad_output, float* grad_input,
    int n, cudaStream_t stream);

// PyTorch-facing functions that unwrap tensors and call our launchers
torch::Tensor silu_forward(torch::Tensor input) {
    TORCH_CHECK(input.is_cuda(), "Input must be a CUDA tensor");
    TORCH_CHECK(input.is_contiguous(), "Input must be contiguous");
    TORCH_CHECK(input.dtype() == torch::kFloat32, "Input must be float32");

    auto output = torch::empty_like(input);
    int n = input.numel();
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();

    launch_silu_forward(
        input.data_ptr<float>(),
        output.data_ptr<float>(),
        n,
        stream
    );
    return output;
}

torch::Tensor silu_backward(
    torch::Tensor input,
    torch::Tensor grad_output
) {
    TORCH_CHECK(input.is_cuda() && grad_output.is_cuda());
    TORCH_CHECK(input.is_contiguous() && grad_output.is_contiguous());

    auto grad_input = torch::empty_like(input);
    int n = input.numel();
    cudaStream_t stream = at::cuda::getCurrentCUDAStream();

    launch_silu_backward(
        input.data_ptr<float>(),
        grad_output.data_ptr<float>(),
        grad_input.data_ptr<float>(),
        n,
        stream
    );
    return grad_input;
}

// Bind to Python module named "silu_ext"
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &silu_forward, "SiLU forward (CUDA)");
    m.def("backward", &silu_backward, "SiLU backward (CUDA)");
}

The setup.py build script:

# setup.py
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name="silu_ext",
    ext_modules=[
        CUDAExtension(
            name="silu_ext",
            sources=["silu_ext.cpp", "silu_kernel.cu"],
            extra_compile_args={
                "cxx": ["-O3"],
                "nvcc": [
                    "-O3",
                    "--use_fast_math",          # fused multiply-add, fast transcendentals
                    "-arch=sm_80",              # target Ampere (A100); change for your GPU
                    "--expt-relaxed-constexpr", # needed for some PyTorch headers
                ],
            },
        )
    ],
    cmdclass={"build_ext": BuildExtension},
)

Build it:

cd my_silu
python setup.py build_ext --inplace
# or faster, for development:
pip install -e . --no-build-isolation

Using the Extension from Python

# test_silu.py
import torch
import torch.autograd as autograd
import silu_ext  # the compiled extension

class SiLUFunction(autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return silu_ext.forward(x)

    @staticmethod
    def backward(ctx, grad_output):
        (x,) = ctx.saved_tensors
        return silu_ext.backward(x, grad_output.contiguous())

class SiLU(torch.nn.Module):
    def forward(self, x):
        return SiLUFunction.apply(x)

# Test it
device = torch.device("cuda")
x = torch.randn(1_000_000, device=device, requires_grad=True)

# Our kernel
model = SiLU()
y = model(x)
loss = y.sum()
loss.backward()
print("Custom SiLU output mean:", y.mean().item())

# Reference: PyTorch's built-in SiLU
import torch.nn.functional as F
x2 = x.detach().clone().requires_grad_(True)
y2 = F.silu(x2)
loss2 = y2.sum()
loss2.backward()

# Compare
print("Max forward diff:", (y - y2).abs().max().item())
print("Max grad diff:", (x.grad - x2.grad).abs().max().item())

Expected output:

Custom SiLU output mean: ~0.58
Max forward diff: 0.0  (or < 1e-6 due to float precision)
Max grad diff: 0.0

JIT Compilation with torch.utils.cpp_extension.load

For quick iteration during development, you do not need a setup.py. You can JIT-compile directly from Python:

import torch
from torch.utils.cpp_extension import load

silu_ext = load(
    name="silu_ext",
    sources=["silu_ext.cpp", "silu_kernel.cu"],
    extra_cuda_cflags=["-O3", "--use_fast_math", "-arch=sm_80"],
    verbose=True  # shows compilation output
)

# Now use silu_ext.forward() and silu_ext.backward() as before

The first call compiles and caches in ~/.cache/torch_extensions/. Subsequent calls load the cached .so directly.

The NVCC Compilation Pipeline: Under the Hood

When you run nvcc silu_kernel.cu -o silu, here is exactly what happens:

PTX is key. You can inspect it by adding -ptx to your NVCC command:

nvcc -ptx -arch=sm_80 silu_kernel.cu -o silu_kernel.ptx

A fragment of the output looks like:

.visible .entry silu_forward_kernel(
    .param .u64 input,
    .param .u64 output,
    .param .u32 n
)
{
    .reg .f32   %f<5>;
    .reg .b32   %r<6>;
    .reg .b64   %rd<10>;

    ld.param.u64    %rd1, [input];     // load input pointer
    mov.u32         %r1, %tid.x;      // threadIdx.x
    mov.u32         %r2, %ctaid.x;    // blockIdx.x
    mov.u32         %r3, %ntid.x;     // blockDim.x
    mad.lo.s32      %r4, %r2, %r3, %r1;  // idx = blockIdx.x * blockDim.x + threadIdx.x
    // ... boundary check and actual computation
}

This is valuable for understanding exactly what instruction sequence your kernel compiles to, and for finding missed optimizations.

Compute Capability Targets

The -arch=sm_80 flag specifies the target compute capability. Common values:

GPU	Compute Capability	`-arch` flag
V100	7.0	`sm_70`
T4	7.5	`sm_75`
A100	8.0	`sm_80`
A10/A30	8.6	`sm_86`
H100	9.0	`sm_90`
RTX 4090	8.9	`sm_89`

To compile for multiple targets at once (common in libraries that ship to multiple GPU types):

nvcc -gencode arch=compute_70,code=sm_70 \
     -gencode arch=compute_80,code=sm_80 \
     -gencode arch=compute_90,code=sm_90 \
     -o my_kernel my_kernel.cu

This embeds SASS for each target architecture plus PTX for forward-compat JIT.

Production Engineering Notes

Pinned (Page-Locked) Memory for Faster Transfers

By default, malloc allocates pageable memory. The CUDA driver cannot DMA directly from pageable memory to the GPU - it first copies to a staging buffer. Pinned memory avoids this:

// Allocate pinned memory on host
float* h_data;
cudaMallocHost(&h_data, N * sizeof(float));  // pinned, not cacheable by OS

// Use it for H2D/D2H copies - significantly faster
cudaMemcpyAsync(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice, stream);

// Always free with cudaFreeHost
cudaFreeHost(h_data);

Pinned memory transfers run 2-4x faster than pageable transfers, but pinned memory cannot be swapped by the OS. Use it only for buffers you know you will be transferring frequently.

Streams for Overlapping Computation and Data Transfer

CUDA streams allow you to pipeline computation with data transfer:

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Chunk your work and overlap
// While GPU processes chunk 0, CPU can be preparing chunk 1
cudaMemcpyAsync(d_chunk0, h_chunk0, chunk_bytes, cudaMemcpyHostToDevice, stream1);
my_kernel<<<grid, block, 0, stream1>>>(d_chunk0, d_out0, chunk_n);
cudaMemcpyAsync(h_out0, d_out0, chunk_bytes, cudaMemcpyDeviceToHost, stream1);

cudaMemcpyAsync(d_chunk1, h_chunk1, chunk_bytes, cudaMemcpyHostToDevice, stream2);
my_kernel<<<grid, block, 0, stream2>>>(d_chunk1, d_out1, chunk_n);
cudaMemcpyAsync(h_out1, d_out1, chunk_bytes, cudaMemcpyDeviceToHost, stream2);

cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

The `restrict` Keyword

In the SiLU kernel above, the pointers have __restrict__. This tells the compiler that no two pointers alias each other - they do not point to overlapping memory regions. Without this hint, the compiler must assume input and output might overlap and generates more conservative (slower) code.

// Without __restrict__: compiler generates loads before every store
__global__ void slow_kernel(float* input, float* output, int n) {
    // ...
}

// With __restrict__: compiler can cache loads in registers and reorder freely
__global__ void fast_kernel(
    const float* __restrict__ input,
    float* __restrict__ output,
    int n
) {
    // ...
}

This is one of those zero-cost optimizations that is worth making a habit.

Common Mistakes

:::danger Forgetting cudaDeviceSynchronize Before Reading Results

my_kernel<<<grid, block>>>(d_input, d_output, n);
// BUG: kernel may not have finished yet
cudaMemcpy(h_output, d_output, bytes, cudaMemcpyDeviceToHost);
// You will get garbage or zeros

Always synchronize first, or use a synchronous cudaMemcpy which implies a sync. :::

:::danger Dereferencing Device Pointers on the Host

float* d_data;
cudaMalloc(&d_data, 1024 * sizeof(float));
// BUG: d_data points into GPU memory
float val = d_data[0];  // segfault or undefined behavior
printf("%f\n", *d_data); // same - crash

Device pointers are just addresses. On the CPU side they are opaque handles. You cannot read through them. :::

:::warning Not Checking cudaMalloc Return Values

float* d_large;
cudaMalloc(&d_large, 200ULL * 1024 * 1024 * 1024); // request 200 GB
// If this fails, d_large is NULL
// Kernel launches with NULL pointer silently write to address 0
// Results are garbage, no error message

Use CUDA_CHECK on every allocation. :::

:::warning Wrong Grid Size Calculation (Off by One)

int gridSize = N / blockSize;  // BUG: if N=1000, blockSize=256, gridSize=3
// Only 768 elements processed, last 232 missed
int gridSize = (N + blockSize - 1) / blockSize;  // CORRECT: ceiling division

Always use ceiling division for grid size calculation. :::

:::danger Kernel Launch with Invalid Configuration

// BUG: requesting more than 1024 threads per block
my_kernel<<<1, 2048>>>(args);  // silent failure or cudaErrorInvalidConfiguration
// Always check gridDim/blockDim limits before launch
// blockDim.x * blockDim.y * blockDim.z <= 1024

:::

Interview Q&A

Q1: What is the difference between host code and device code in CUDA, and how does NVCC handle each?

Host code is standard C++ that runs on the CPU. Device code is annotated with __global__, __device__, or __host__ __device__ qualifiers and runs on the GPU. NVCC is a driver compiler that splits the source at these annotations. Host code is compiled by the system C++ compiler (g++/clang). Device code goes through NVIDIA's own compiler frontend, is lowered to PTX (virtual ISA), then assembled by PTXAS to SASS (actual machine code). The two halves are linked into a single binary. The .cu extension tells NVCC to treat the file as potentially containing both host and device code.

Q2: What does the <<<gridDim, blockDim>>> launch syntax specify, and what are the hardware limits?

gridDim specifies how many thread blocks are in the grid (can be up to 65535 in each of x, y, z dimensions; x dimension supports up to $2^{31}-1$ ). blockDim specifies how many threads are in each block. The constraint is that blockDim.x * blockDim.y * blockDim.z <= 1024 for current hardware. The total thread count is gridDim * blockDim. The launch also takes optional parameters for dynamic shared memory size per block and a CUDA stream for asynchronous execution.

Q3: Why is cudaDeviceSynchronize() necessary, and when is it safe to skip?

Kernel launches are asynchronous - the CPU continues executing immediately after <<<>>> returns. cudaDeviceSynchronize() blocks the CPU until all pending GPU work completes. It is necessary before any cudaMemcpy D2H or before reading results from mapped pinned memory. It is safe to skip if you use a subsequent synchronous operation that implies it (like a synchronous cudaMemcpy), or if you are building a pipeline where you only care about results at the very end. In benchmarking, skipping sync gives artificially fast times because the CPU timer ends before the GPU work finishes.

Q4: What is PTX and why does NVIDIA use it as an intermediate representation?

PTX (Parallel Thread eXecution) is a virtual instruction set that sits between CUDA C++ and actual GPU machine code (SASS). It serves two purposes. First, it decouples the compiler from specific GPU microarchitectures - the same PTX works across GPU generations. Second, it enables JIT compilation at runtime: when a CUDA binary is deployed on a newer GPU, the driver JIT-compiles the embedded PTX to the new GPU's SASS, providing forward compatibility without recompilation. This is how code compiled for sm_80 (A100) can still run on sm_90 (H100) via JIT, though without sm_90-specific optimizations.

Q5: What does __restrict__ do in a CUDA kernel parameter, and when should you use it?

__restrict__ is a hint to the compiler that no two pointers in the function signature alias each other - they point to non-overlapping memory regions. Without this hint, the compiler must assume any write through one pointer could change the value read through another, which prevents certain optimizations like reordering loads and stores, keeping values in registers, or vectorizing memory operations. You should use it whenever your kernel takes multiple pointer parameters that you know are non-overlapping. The cost is zero (it is a compiler hint, not a runtime check), but incorrect use (when pointers do alias) produces wrong results. A common pattern is marking input pointers const __restrict__ and output pointers __restrict__.

Q6: What happens at the GPU instruction level between a kernel launch and the first thread executing your kernel body?

After <<<>>>, the CUDA runtime submits a kernel descriptor (function pointer, grid/block dimensions, arguments) to a work queue in the GPU driver. The GPU's hardware scheduler (GigaThread engine) receives the work and begins assigning blocks to available SMs. For each assigned block, the SM allocates a set of 32-thread warps, assigns register file space, and allocates shared memory. Each warp is initialized with the appropriate threadIdx and blockIdx values hard-coded into the warp context. The SM then begins issuing instructions from the first warp into its execution pipelines. All of this setup - called the kernel launch overhead - takes roughly 5-20 microseconds, which is why small, fast kernels should be batched rather than launched individually.

Summary

The CUDA programming model is built on three ideas that work together. First, host and device are separate processors with separate memory - you own all data movement. Second, a kernel is a single function that the GPU executes across thousands of threads simultaneously - you express parallelism by writing what one thread does, and the hardware multiplies it. Third, NVCC compiles both halves of your program, lowering device code through PTX to SASS while the host side becomes ordinary binary.

The practical workflow is: allocate device memory with cudaMalloc, copy data with cudaMemcpy H2D, launch your kernel, synchronize with cudaDeviceSynchronize, copy results back with cudaMemcpy D2H, and free with cudaFree. Check every call with CUDA_CHECK. Use torch.utils.cpp_extension to integrate custom CUDA kernels into PyTorch with minimal boilerplate.

In the next lesson we go deeper into the thread hierarchy - warps, blocks, and grids - which is where most performance bugs live.

The Moment You Hit the Wall​

Why CUDA Exists: The Parallelism Gap​

CPUs Were Not Built for This​

CUDA 2007: The Language That Changed Everything​

How CUDA Programs Are Structured​

The CUDA Compilation Pipeline​

The Three Function Qualifiers​

Kernel Launch Syntax​

A Simple First Kernel​

The Memory API in Detail​

cudaMalloc​

cudaMemcpy​

cudaFree​

cudaDeviceSynchronize​

Error Checking: The CUDA_CHECK Macro​

A Complete Custom Kernel: The SiLU Activation​

The CUDA C++ Kernel​

PyTorch Extension Wrapper​

Using the Extension from Python​

JIT Compilation with torch.utils.cpp_extension.load​

The NVCC Compilation Pipeline: Under the Hood​

Compute Capability Targets​

Production Engineering Notes​

Pinned (Page-Locked) Memory for Faster Transfers​

Streams for Overlapping Computation and Data Transfer​

The __restrict__ Keyword​

Common Mistakes​

Interview Q&A​

Summary​