40 docs tagged with "hardware-ai"

Ampere, Hopper, and Ada Architectures

What changed across GPU generations for AI - A100 vs H100 vs H200 vs RTX 4090, NVLink bandwidth, transformer engine, FP8 support, and architecture selection for training and inference.

Apple Silicon for AI

Apple M-series unified memory architecture for ML inference - how the ANE, GPU, and CPU share one memory pool, why this matters for local LLMs, and how to run models with MLX and llama.cpp on Apple Silicon.

AWS Trainium and Inferentia

Deep dive into AWS custom AI chips - Trainium for training and Inferentia for inference, NeuronCore-v2 architecture, the Neuron SDK compilation pipeline, and real-world cost-performance tradeoffs versus GPU instances.

Cerebras Wafer Scale Engine

How Cerebras builds the world's largest chip by using the entire silicon wafer as one device, eliminating inter-chip communication overhead for large model training and delivering linear scaling without distributed training frameworks.

Choosing Custom Silicon vs GPUs

A complete decision framework for AI accelerator selection - how to evaluate NVIDIA GPUs, TPUs, Trainium, Gaudi, Groq, and custom ASICs across workload fit, TCO, ecosystem maturity, and team capability.

CUDA Programming Model

Learn the CUDA programming model from first principles - host vs device execution, kernel launch syntax, the NVCC compilation pipeline, and how to write and compile your first GPU kernel from Python using torch.utils.cpp_extension.

CUDA Streams and Async Execution

Learn how CUDA streams enable concurrent GPU execution, how to overlap data transfers with computation using double buffering, how CUDA events work for synchronization and timing, and how PyTorch streams integrate with training pipelines for maximum throughput.

Flash Attention Kernel Deep Dive

How FlashAttention rewrites the attention mechanism to never materialize the N x N matrix in HBM, the online softmax tiling algorithm, IO complexity analysis, and FlashAttention 2 and 3 improvements.

FPGAs for AI Inference

How FPGAs enable sub-microsecond AI inference - reconfigurable logic, HLS programming, Xilinx Vitis AI, quantization strategies, and when FPGAs beat GPUs for latency-critical deployments.

Global, Shared, and Register Memory

Master the five CUDA memory spaces - registers, shared memory, L1/L2 cache, and global memory - with real latency numbers, tiled matrix multiply, and the patterns that separate 8% bandwidth utilization from 85%.

Google TPU Architecture

Deep dive into Google's Tensor Processing Units - systolic array design, XLA compilation, TPU pod topology, and how to write high-performance JAX programs that avoid recompilation traps.

GPU vs CPU Architecture

Why GPUs dominate deep learning - SIMT execution model, throughput vs latency optimization, the fundamental design tradeoffs between CPU and GPU silicon.

Groq LPU Architecture

How Groq's Language Processing Unit eliminates the memory bottleneck for LLM inference by keeping model weights in on-chip SRAM and using deterministic compiler-scheduled execution.

Hardware and Silicon for AI

GPU architecture, CUDA programming, custom silicon, kernel optimization, memory systems, and distributed training hardware - the layer below the framework that determines what is actually possible.

Instruction-Level Optimization

Master ILP, vectorized loads, loop unrolling, and instruction scheduling to extract maximum throughput from CUDA kernels - the techniques separating 31% from 78% peak utilization.

Intel Gaudi and Habana Labs

Intel Gaudi AI accelerator architecture - Tensor Processor Cores, built-in RoCE scale-out networking, SynapseAI SDK, and price-performance positioning against NVIDIA H100 for LLM training.

Kernel Fusion Strategies

How kernel fusion eliminates HBM round-trips between chained GPU operations, how torch.compile and TorchInductor identify fusible patterns, and how to write manual fused kernels with Triton for maximum throughput.

Memory Coalescing and Bank Conflicts

Master the two most impactful memory access patterns in CUDA - global memory coalescing and shared memory bank conflicts. Understand why identical computation with transposed access can be 8x slower, and how to fix both problems with layout changes and padding.

Memory Hierarchy in GPUs

Registers, L1/L2 cache, shared memory, and HBM - GPU memory hierarchy latency numbers, bandwidth characteristics, and how to write code that uses each level effectively.

Mixed Precision and Quantization Kernels

Learn how to write correct and fast kernels for FP16, BF16, FP8, INT8, and INT4 quantized models - including the pipeline mistakes that make INT8 slower than FP16.

Module 1: GPU Architecture

How GPUs work at the silicon level - streaming multiprocessors, tensor cores, memory hierarchy, and the roofline model that explains every ML performance optimization.

Module 2: CUDA Programming

Write GPU kernels from scratch - thread hierarchy, memory spaces, coalescing, warp divergence, and profiling with Nsight - the foundation for understanding every ML framework under the hood.

Module 3: Custom Silicon for AI

TPUs, Trainium, Groq LPU, Cerebras WSE, Intel Gaudi, and Apple Silicon - how each architecture differs from GPUs and what workloads each wins on.

Module 4: Kernel Optimization

FlashAttention, Triton, operator fusion, torch.compile, and XLA - making neural network operations faster by understanding what the hardware actually does with your compute.

Module 5: Memory Systems for AI

HBM, DRAM, cache hierarchies, KV cache management, PagedAttention, and quantization as memory compression - understanding memory is understanding why LLM inference costs what it costs.

Module 6: Distributed Training Hardware

NVLink, InfiniBand, AllReduce algorithms, network topology, fault tolerance, and the hardware that makes training at thousands of GPUs possible.

Module 7: Inference Hardware

Hardware selection for inference workloads - cost-per-token analysis, batching tradeoffs, edge hardware, speculative decoding implications, and building a complete inference stack.

Occupancy and Thread Block Tuning

How GPU occupancy works, what limits it, and how to tune thread block size and register usage to maximize SM utilization without falling into the 100% occupancy trap.

PCIe and NVLink Interconnects

Host-to-device PCIe bandwidth, GPU-to-GPU NVLink and NVSwitch, the interconnect hierarchy in multi-GPU systems, and how interconnect bandwidth shapes model parallelism strategies.

Profiling with Nsight

Learn how to use Nsight Systems and Nsight Compute to find GPU performance bottlenecks, read roofline charts, interpret warp stall reasons, and use the PyTorch profiler to guide real optimization decisions.

Roofline Model and Bottleneck Analysis

Arithmetic intensity, roofline model construction, identifying compute vs memory-bound operations, and using the roofline to guide optimization decisions.

Selecting GPUs for Training vs Inference

H100 vs A100 vs L40S vs RTX 4090 vs A10G - a practical decision framework for matching GPU specifications to training and inference workload requirements.

Streaming Multiprocessors

The SM is the fundamental execution unit of every NVIDIA GPU - warp schedulers, register files, shared memory, occupancy, and how thread block configuration determines performance.

Tensor Core Programming

Program NVIDIA Tensor Cores directly using the WMMA API, MMA PTX instructions, Triton tl.dot(), and CUTLASS - understand activation requirements, shape constraints, and how to diagnose zero Tensor Core utilization.

Tensor Cores and Mixed Precision

How tensor cores accelerate matrix multiply, BF16 vs FP16 vs FP8 vs TF32, mixed precision training implementation, and the performance impact of precision choices.

Thread Blocks, Warps, and Grids

Master the CUDA thread hierarchy - threads, warps, blocks, and grids - how they map to physical hardware, how to calculate global thread indices for 1D, 2D, and 3D problems, and how to choose block dimensions for maximum SM occupancy.

Tiling and Shared Memory Optimization

How tiled matrix multiply reduces HBM traffic by reusing data in shared memory, optimal tile size selection, double buffering with cp.async, and applying the tiling pattern to attention and convolution.

Triton for Custom Kernels

Write production GPU kernels in Python with OpenAI Triton - learn the tile-based programming model, core primitives, and how to implement softmax, layer norm, GEMM, and custom attention kernels that match CUDA performance.

Warp Divergence and Control Flow

How branch divergence serializes GPU warp execution, the cost of divergence, warp shuffle intrinsics, and concrete techniques for restructuring kernels to minimize divergence.

Writing Your First CUDA Kernel

End-to-end walkthrough of writing a production-grade fused bias+GELU CUDA kernel, including kernel fusion principles, launch configuration, error checking, Triton alternative, and full benchmarks.