CUDA Programming Model
Learn the CUDA programming model from first principles - host vs device execution, kernel launch syntax, the NVCC compilation pipeline, and how to write and compile your first GPU kernel from Python using torch.utils.cpp_extension.
CUDA Streams and Async Execution
Learn how CUDA streams enable concurrent GPU execution, how to overlap data transfers with computation using double buffering, how CUDA events work for synchronization and timing, and how PyTorch streams integrate with training pipelines for maximum throughput.
Global, Shared, and Register Memory
Master the five CUDA memory spaces - registers, shared memory, L1/L2 cache, and global memory - with real latency numbers, tiled matrix multiply, and the patterns that separate 8% bandwidth utilization from 85%.
Memory Coalescing and Bank Conflicts
Master the two most impactful memory access patterns in CUDA - global memory coalescing and shared memory bank conflicts. Understand why identical computation with transposed access can be 8x slower, and how to fix both problems with layout changes and padding.
Module 2: CUDA Programming
Write GPU kernels from scratch - thread hierarchy, memory spaces, coalescing, warp divergence, and profiling with Nsight - the foundation for understanding every ML framework under the hood.
Profiling with Nsight
Learn how to use Nsight Systems and Nsight Compute to find GPU performance bottlenecks, read roofline charts, interpret warp stall reasons, and use the PyTorch profiler to guide real optimization decisions.
Thread Blocks, Warps, and Grids
Master the CUDA thread hierarchy - threads, warps, blocks, and grids - how they map to physical hardware, how to calculate global thread indices for 1D, 2D, and 3D problems, and how to choose block dimensions for maximum SM occupancy.
Warp Divergence and Control Flow
How branch divergence serializes GPU warp execution, the cost of divergence, warp shuffle intrinsics, and concrete techniques for restructuring kernels to minimize divergence.
Writing Your First CUDA Kernel
End-to-end walkthrough of writing a production-grade fused bias+GELU CUDA kernel, including kernel fusion principles, launch configuration, error checking, Triton alternative, and full benchmarks.