Flash Attention Kernel Deep Dive
How FlashAttention rewrites the attention mechanism to never materialize the N x N matrix in HBM, the online softmax tiling algorithm, IO complexity analysis, and FlashAttention 2 and 3 improvements.
How FlashAttention rewrites the attention mechanism to never materialize the N x N matrix in HBM, the online softmax tiling algorithm, IO complexity analysis, and FlashAttention 2 and 3 improvements.
Master ILP, vectorized loads, loop unrolling, and instruction scheduling to extract maximum throughput from CUDA kernels - the techniques separating 31% from 78% peak utilization.
How kernel fusion eliminates HBM round-trips between chained GPU operations, how torch.compile and TorchInductor identify fusible patterns, and how to write manual fused kernels with Triton for maximum throughput.
Learn how to write correct and fast kernels for FP16, BF16, FP8, INT8, and INT4 quantized models - including the pipeline mistakes that make INT8 slower than FP16.
FlashAttention, Triton, operator fusion, torch.compile, and XLA - making neural network operations faster by understanding what the hardware actually does with your compute.
How GPU occupancy works, what limits it, and how to tune thread block size and register usage to maximize SM utilization without falling into the 100% occupancy trap.
Program NVIDIA Tensor Cores directly using the WMMA API, MMA PTX instructions, Triton tl.dot(), and CUTLASS - understand activation requirements, shape constraints, and how to diagnose zero Tensor Core utilization.
How tiled matrix multiply reduces HBM traffic by reusing data in shared memory, optimal tile size selection, double buffering with cp.async, and applying the tiling pattern to attention and convolution.
Write production GPU kernels in Python with OpenAI Triton - learn the tile-based programming model, core primitives, and how to implement softmax, layer norm, GEMM, and custom attention kernels that match CUDA performance.