9 docs tagged with "cuda-programming"

CUDA Programming Model

Learn the CUDA programming model from first principles - host vs device execution, kernel launch syntax, the NVCC compilation pipeline, and how to write and compile your first GPU kernel from Python using torch.utils.cpp_extension.

CUDA Streams and Async Execution

Learn how CUDA streams enable concurrent GPU execution, how to overlap data transfers with computation using double buffering, how CUDA events work for synchronization and timing, and how PyTorch streams integrate with training pipelines for maximum throughput.

Global, Shared, and Register Memory

Master the five CUDA memory spaces - registers, shared memory, L1/L2 cache, and global memory - with real latency numbers, tiled matrix multiply, and the patterns that separate 8% bandwidth utilization from 85%.

Memory Coalescing and Bank Conflicts

Master the two most impactful memory access patterns in CUDA - global memory coalescing and shared memory bank conflicts. Understand why identical computation with transposed access can be 8x slower, and how to fix both problems with layout changes and padding.

9 docs tagged with "cuda-programming"

CUDA Programming Model

CUDA Streams and Async Execution

Global, Shared, and Register Memory

Memory Coalescing and Bank Conflicts

Module 2: CUDA Programming

Profiling with Nsight

Thread Blocks, Warps, and Grids

Warp Divergence and Control Flow

Writing Your First CUDA Kernel