Module 2: CUDA Programming

PyTorch and JAX hide CUDA from you. That is intentional and usually correct - you should not be writing raw CUDA for things frameworks already do well. But there are two situations where knowing CUDA is essential: debugging performance issues that trace to kernel-level behavior, and implementing operations that frameworks do not provide or do not implement efficiently for your specific use case.

This module teaches you enough CUDA to do both. You will write kernels, understand the execution model, debug memory access patterns, and use Nsight to understand what your code is actually doing at the hardware level.

The CUDA Execution Model

CUDA expresses GPU parallelism through a hierarchy: threads group into warps (32 threads), warps group into thread blocks, thread blocks form a grid. You specify this hierarchy when you launch a kernel, and the GPU scheduler assigns thread blocks to available SMs.

The key insight: all threads in a warp execute the same instruction simultaneously (SIMT - Single Instruction Multiple Thread). If threads in the same warp take different branches, the GPU serializes those branches, executing each path with the other path's threads masked off. This is warp divergence, and it is one of the main ways to accidentally halve your GPU throughput.

Memory is equally critical. A thread has access to: its own registers (fastest, ~1 cycle), shared memory within its thread block (~5-10 cycles), and global device memory (HBM, ~200-600 cycles). The difference between a fast kernel and a slow one often comes down to whether you are hitting registers and shared memory or thrashing global memory with non-coalesced accesses.

CUDA Memory Hierarchy

Lessons in This Module

#	Lesson	Key Concept
1	CUDA Programming Model	Host vs device, kernel launch syntax, compilation
2	Thread Blocks, Warps, and Grids	Thread hierarchy, choosing block dimensions
3	Global, Shared, and Register Memory	When to use each memory space
4	Memory Coalescing and Bank Conflicts	Access patterns that maximize memory bandwidth
5	Writing Your First CUDA Kernel	Vector addition, matrix multiply from scratch
6	Warp Divergence and Control Flow	Avoiding branch divergence, predication
7	Profiling with Nsight	Nsight Compute, Nsight Systems, reading the reports
8	CUDA Streams and Async Execution	Overlapping compute and data transfer

Key Concepts You Will Master

Kernel launch configuration - choosing grid and block dimensions that maximize SM occupancy
Shared memory tiling - the classic technique for reducing global memory accesses in matrix multiply
Coalesced memory access - writing access patterns that let adjacent threads read adjacent memory addresses
Nsight Compute - reading the profiler output to identify the actual bottleneck in a kernel
Asynchronous execution - using CUDA streams to overlap data transfer with kernel execution

Prerequisites

GPU Architecture
C or C++ basics (pointers, arrays, functions)
Optional: Python/CUDA interop via torch.utils.cpp_extension

The CUDA Execution Model​

CUDA Memory Hierarchy​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

The CUDA Execution Model

CUDA Memory Hierarchy

Lessons in This Module

Key Concepts You Will Master

Prerequisites