Skip to main content

Module 2: CUDA Programming

PyTorch and JAX hide CUDA from you. That is intentional and usually correct - you should not be writing raw CUDA for things frameworks already do well. But there are two situations where knowing CUDA is essential: debugging performance issues that trace to kernel-level behavior, and implementing operations that frameworks do not provide or do not implement efficiently for your specific use case.

This module teaches you enough CUDA to do both. You will write kernels, understand the execution model, debug memory access patterns, and use Nsight to understand what your code is actually doing at the hardware level.

The CUDA Execution Model

CUDA expresses GPU parallelism through a hierarchy: threads group into warps (32 threads), warps group into thread blocks, thread blocks form a grid. You specify this hierarchy when you launch a kernel, and the GPU scheduler assigns thread blocks to available SMs.

The key insight: all threads in a warp execute the same instruction simultaneously (SIMT - Single Instruction Multiple Thread). If threads in the same warp take different branches, the GPU serializes those branches, executing each path with the other path's threads masked off. This is warp divergence, and it is one of the main ways to accidentally halve your GPU throughput.

Memory is equally critical. A thread has access to: its own registers (fastest, ~1 cycle), shared memory within its thread block (~5-10 cycles), and global device memory (HBM, ~200-600 cycles). The difference between a fast kernel and a slow one often comes down to whether you are hitting registers and shared memory or thrashing global memory with non-coalesced accesses.

CUDA Memory Hierarchy

Lessons in This Module

#LessonKey Concept
1CUDA Programming ModelHost vs device, kernel launch syntax, compilation
2Thread Blocks, Warps, and GridsThread hierarchy, choosing block dimensions
3Global, Shared, and Register MemoryWhen to use each memory space
4Memory Coalescing and Bank ConflictsAccess patterns that maximize memory bandwidth
5Writing Your First CUDA KernelVector addition, matrix multiply from scratch
6Warp Divergence and Control FlowAvoiding branch divergence, predication
7Profiling with NsightNsight Compute, Nsight Systems, reading the reports
8CUDA Streams and Async ExecutionOverlapping compute and data transfer

Key Concepts You Will Master

  • Kernel launch configuration - choosing grid and block dimensions that maximize SM occupancy
  • Shared memory tiling - the classic technique for reducing global memory accesses in matrix multiply
  • Coalesced memory access - writing access patterns that let adjacent threads read adjacent memory addresses
  • Nsight Compute - reading the profiler output to identify the actual bottleneck in a kernel
  • Asynchronous execution - using CUDA streams to overlap data transfer with kernel execution

Prerequisites

  • GPU Architecture
  • C or C++ basics (pointers, arrays, functions)
  • Optional: Python/CUDA interop via torch.utils.cpp_extension
© 2026 EngineersOfAI. All rights reserved.