Module 2: CUDA Programming
PyTorch and JAX hide CUDA from you. That is intentional and usually correct - you should not be writing raw CUDA for things frameworks already do well. But there are two situations where knowing CUDA is essential: debugging performance issues that trace to kernel-level behavior, and implementing operations that frameworks do not provide or do not implement efficiently for your specific use case.
This module teaches you enough CUDA to do both. You will write kernels, understand the execution model, debug memory access patterns, and use Nsight to understand what your code is actually doing at the hardware level.
The CUDA Execution Model
CUDA expresses GPU parallelism through a hierarchy: threads group into warps (32 threads), warps group into thread blocks, thread blocks form a grid. You specify this hierarchy when you launch a kernel, and the GPU scheduler assigns thread blocks to available SMs.
The key insight: all threads in a warp execute the same instruction simultaneously (SIMT - Single Instruction Multiple Thread). If threads in the same warp take different branches, the GPU serializes those branches, executing each path with the other path's threads masked off. This is warp divergence, and it is one of the main ways to accidentally halve your GPU throughput.
Memory is equally critical. A thread has access to: its own registers (fastest, ~1 cycle), shared memory within its thread block (~5-10 cycles), and global device memory (HBM, ~200-600 cycles). The difference between a fast kernel and a slow one often comes down to whether you are hitting registers and shared memory or thrashing global memory with non-coalesced accesses.
CUDA Memory Hierarchy
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | CUDA Programming Model | Host vs device, kernel launch syntax, compilation |
| 2 | Thread Blocks, Warps, and Grids | Thread hierarchy, choosing block dimensions |
| 3 | Global, Shared, and Register Memory | When to use each memory space |
| 4 | Memory Coalescing and Bank Conflicts | Access patterns that maximize memory bandwidth |
| 5 | Writing Your First CUDA Kernel | Vector addition, matrix multiply from scratch |
| 6 | Warp Divergence and Control Flow | Avoiding branch divergence, predication |
| 7 | Profiling with Nsight | Nsight Compute, Nsight Systems, reading the reports |
| 8 | CUDA Streams and Async Execution | Overlapping compute and data transfer |
Key Concepts You Will Master
- Kernel launch configuration - choosing grid and block dimensions that maximize SM occupancy
- Shared memory tiling - the classic technique for reducing global memory accesses in matrix multiply
- Coalesced memory access - writing access patterns that let adjacent threads read adjacent memory addresses
- Nsight Compute - reading the profiler output to identify the actual bottleneck in a kernel
- Asynchronous execution - using CUDA streams to overlap data transfer with kernel execution
Prerequisites
- GPU Architecture
- C or C++ basics (pointers, arrays, functions)
- Optional: Python/CUDA interop via torch.utils.cpp_extension
