9 docs tagged with "kernel-optimization"

Flash Attention Kernel Deep Dive

How FlashAttention rewrites the attention mechanism to never materialize the N x N matrix in HBM, the online softmax tiling algorithm, IO complexity analysis, and FlashAttention 2 and 3 improvements.

Instruction-Level Optimization

Master ILP, vectorized loads, loop unrolling, and instruction scheduling to extract maximum throughput from CUDA kernels - the techniques separating 31% from 78% peak utilization.

Kernel Fusion Strategies

How kernel fusion eliminates HBM round-trips between chained GPU operations, how torch.compile and TorchInductor identify fusible patterns, and how to write manual fused kernels with Triton for maximum throughput.

Mixed Precision and Quantization Kernels

Learn how to write correct and fast kernels for FP16, BF16, FP8, INT8, and INT4 quantized models - including the pipeline mistakes that make INT8 slower than FP16.

Module 4: Kernel Optimization

FlashAttention, Triton, operator fusion, torch.compile, and XLA - making neural network operations faster by understanding what the hardware actually does with your compute.

Occupancy and Thread Block Tuning

How GPU occupancy works, what limits it, and how to tune thread block size and register usage to maximize SM utilization without falling into the 100% occupancy trap.

Tensor Core Programming

Program NVIDIA Tensor Cores directly using the WMMA API, MMA PTX instructions, Triton tl.dot(), and CUTLASS - understand activation requirements, shape constraints, and how to diagnose zero Tensor Core utilization.

Tiling and Shared Memory Optimization

How tiled matrix multiply reduces HBM traffic by reusing data in shared memory, optimal tile size selection, double buffering with cp.async, and applying the tiling pattern to attention and convolution.

Triton for Custom Kernels

Write production GPU kernels in Python with OpenAI Triton - learn the tile-based programming model, core primitives, and how to implement softmax, layer norm, GEMM, and custom attention kernels that match CUDA performance.