Ampere, Hopper, and Ada Architectures
What changed across GPU generations for AI - A100 vs H100 vs H200 vs RTX 4090, NVLink bandwidth, transformer engine, FP8 support, and architecture selection for training and inference.
What changed across GPU generations for AI - A100 vs H100 vs H200 vs RTX 4090, NVLink bandwidth, transformer engine, FP8 support, and architecture selection for training and inference.
Apple M-series unified memory architecture for ML inference - how the ANE, GPU, and CPU share one memory pool, why this matters for local LLMs, and how to run models with MLX and llama.cpp on Apple Silicon.
Deep dive into AWS custom AI chips - Trainium for training and Inferentia for inference, NeuronCore-v2 architecture, the Neuron SDK compilation pipeline, and real-world cost-performance tradeoffs versus GPU instances.
How Cerebras builds the world's largest chip by using the entire silicon wafer as one device, eliminating inter-chip communication overhead for large model training and delivering linear scaling without distributed training frameworks.
A complete decision framework for AI accelerator selection - how to evaluate NVIDIA GPUs, TPUs, Trainium, Gaudi, Groq, and custom ASICs across workload fit, TCO, ecosystem maturity, and team capability.
Learn the CUDA programming model from first principles - host vs device execution, kernel launch syntax, the NVCC compilation pipeline, and how to write and compile your first GPU kernel from Python using torch.utils.cpp_extension.
Learn how CUDA streams enable concurrent GPU execution, how to overlap data transfers with computation using double buffering, how CUDA events work for synchronization and timing, and how PyTorch streams integrate with training pipelines for maximum throughput.
How FlashAttention rewrites the attention mechanism to never materialize the N x N matrix in HBM, the online softmax tiling algorithm, IO complexity analysis, and FlashAttention 2 and 3 improvements.
How FPGAs enable sub-microsecond AI inference - reconfigurable logic, HLS programming, Xilinx Vitis AI, quantization strategies, and when FPGAs beat GPUs for latency-critical deployments.
Master the five CUDA memory spaces - registers, shared memory, L1/L2 cache, and global memory - with real latency numbers, tiled matrix multiply, and the patterns that separate 8% bandwidth utilization from 85%.
Deep dive into Google's Tensor Processing Units - systolic array design, XLA compilation, TPU pod topology, and how to write high-performance JAX programs that avoid recompilation traps.
Why GPUs dominate deep learning - SIMT execution model, throughput vs latency optimization, the fundamental design tradeoffs between CPU and GPU silicon.
How Groq's Language Processing Unit eliminates the memory bottleneck for LLM inference by keeping model weights in on-chip SRAM and using deterministic compiler-scheduled execution.
GPU architecture, CUDA programming, custom silicon, kernel optimization, memory systems, and distributed training hardware - the layer below the framework that determines what is actually possible.
Master ILP, vectorized loads, loop unrolling, and instruction scheduling to extract maximum throughput from CUDA kernels - the techniques separating 31% from 78% peak utilization.
Intel Gaudi AI accelerator architecture - Tensor Processor Cores, built-in RoCE scale-out networking, SynapseAI SDK, and price-performance positioning against NVIDIA H100 for LLM training.
How kernel fusion eliminates HBM round-trips between chained GPU operations, how torch.compile and TorchInductor identify fusible patterns, and how to write manual fused kernels with Triton for maximum throughput.
Master the two most impactful memory access patterns in CUDA - global memory coalescing and shared memory bank conflicts. Understand why identical computation with transposed access can be 8x slower, and how to fix both problems with layout changes and padding.
Registers, L1/L2 cache, shared memory, and HBM - GPU memory hierarchy latency numbers, bandwidth characteristics, and how to write code that uses each level effectively.
Learn how to write correct and fast kernels for FP16, BF16, FP8, INT8, and INT4 quantized models - including the pipeline mistakes that make INT8 slower than FP16.
How GPUs work at the silicon level - streaming multiprocessors, tensor cores, memory hierarchy, and the roofline model that explains every ML performance optimization.
Write GPU kernels from scratch - thread hierarchy, memory spaces, coalescing, warp divergence, and profiling with Nsight - the foundation for understanding every ML framework under the hood.
TPUs, Trainium, Groq LPU, Cerebras WSE, Intel Gaudi, and Apple Silicon - how each architecture differs from GPUs and what workloads each wins on.
FlashAttention, Triton, operator fusion, torch.compile, and XLA - making neural network operations faster by understanding what the hardware actually does with your compute.
HBM, DRAM, cache hierarchies, KV cache management, PagedAttention, and quantization as memory compression - understanding memory is understanding why LLM inference costs what it costs.
NVLink, InfiniBand, AllReduce algorithms, network topology, fault tolerance, and the hardware that makes training at thousands of GPUs possible.
Hardware selection for inference workloads - cost-per-token analysis, batching tradeoffs, edge hardware, speculative decoding implications, and building a complete inference stack.
How GPU occupancy works, what limits it, and how to tune thread block size and register usage to maximize SM utilization without falling into the 100% occupancy trap.
Host-to-device PCIe bandwidth, GPU-to-GPU NVLink and NVSwitch, the interconnect hierarchy in multi-GPU systems, and how interconnect bandwidth shapes model parallelism strategies.
Learn how to use Nsight Systems and Nsight Compute to find GPU performance bottlenecks, read roofline charts, interpret warp stall reasons, and use the PyTorch profiler to guide real optimization decisions.
Arithmetic intensity, roofline model construction, identifying compute vs memory-bound operations, and using the roofline to guide optimization decisions.
H100 vs A100 vs L40S vs RTX 4090 vs A10G - a practical decision framework for matching GPU specifications to training and inference workload requirements.
The SM is the fundamental execution unit of every NVIDIA GPU - warp schedulers, register files, shared memory, occupancy, and how thread block configuration determines performance.
Program NVIDIA Tensor Cores directly using the WMMA API, MMA PTX instructions, Triton tl.dot(), and CUTLASS - understand activation requirements, shape constraints, and how to diagnose zero Tensor Core utilization.
How tensor cores accelerate matrix multiply, BF16 vs FP16 vs FP8 vs TF32, mixed precision training implementation, and the performance impact of precision choices.
Master the CUDA thread hierarchy - threads, warps, blocks, and grids - how they map to physical hardware, how to calculate global thread indices for 1D, 2D, and 3D problems, and how to choose block dimensions for maximum SM occupancy.
How tiled matrix multiply reduces HBM traffic by reusing data in shared memory, optimal tile size selection, double buffering with cp.async, and applying the tiling pattern to attention and convolution.
Write production GPU kernels in Python with OpenAI Triton - learn the tile-based programming model, core primitives, and how to implement softmax, layer norm, GEMM, and custom attention kernels that match CUDA performance.
How branch divergence serializes GPU warp execution, the cost of divergence, warp shuffle intrinsics, and concrete techniques for restructuring kernels to minimize divergence.
End-to-end walkthrough of writing a production-grade fused bias+GELU CUDA kernel, including kernel fusion principles, launch configuration, error checking, Triton alternative, and full benchmarks.