24 docs tagged with "hardware"

Batching Strategies for LLM Serving

Static batching, dynamic batching, continuous batching, chunked prefill, and prefill-decode disaggregation for LLM inference throughput and latency optimization.

Cloud vs On-Prem GPU Infrastructure

Total cost of ownership analysis for cloud GPU instances vs on-premises clusters, break-even analysis, spot instance economics, Kubernetes GPU scheduling, and FinOps strategies for GPU compute at scale.

CPU Memory Architecture for ML

How CPU memory hierarchy - L1/L2/L3 caches, DRAM, and NUMA topology - shapes ML data pipelines, DataLoader performance, and large model loading strategies on multi-socket servers.

DGX and HGX System Design

NVIDIA DGX H100 and HGX reference designs - 8-GPU NVLink mesh, NVSwitch fabric, PCIe host bridge, ConnectX InfiniBand, power and cooling requirements, DGX SuperPOD scale-out, and topology-aware NCCL configuration for maximum distributed training throughput.

Edge and Mobile Inference

Running neural networks on devices with 5-15W power budgets - mobile NPUs, Apple Neural Engine, Qualcomm Hexagon, deployment frameworks, and LLMs on-device with llama.cpp and MLX.

Fault Tolerance in Large Cluster Training

Why fault tolerance is critical at scale, how to design checkpointing strategies, detect stragglers, handle spot preemptions, and recover from failures without restarting multi-week training runs.

GPU Cluster Networking

InfiniBand vs RoCE vs Ethernet for GPU cluster communication, fat-tree and rail-optimized topologies, GPUDirect RDMA, SHARP in-network aggregation, and diagnosing collective communication bottlenecks in production ML clusters.

GPU Inference vs Training Requirements

Why inference and training have fundamentally different GPU hardware requirements, covering compute vs memory-bandwidth bottlenecks, the prefill/decode split, and how to select the right GPU for serving.

GPU Memory Hierarchy Deep Dive

Complete GPU memory hierarchy - registers, L1/shared memory, L2 cache, and HBM - capacity, bandwidth, latency at each level, and how data flows through the hierarchy during kernel execution.

Gradient Checkpointing and Rematerialization

Activation checkpointing to reduce training memory usage, sublinear memory algorithm, selective checkpointing strategies, and implementation in PyTorch and JAX.

HBM and GDDR Memory Technologies

High Bandwidth Memory vs GDDR6X - how 3D stacking with Through-Silicon Vias enables HBM3 to deliver 3.35 TB/s on H100, why GDDR6X tops at 1 TB/s, the economics of each, and how memory bandwidth constrains LLM inference throughput.

Inference Cost Optimization

The economics of LLM inference serving - cost per million tokens, GPU utilization, continuous batching, speculative decoding, KV cache management, and building production systems under $1 per million tokens.

KV Cache Management and PagedAttention

How the KV cache works in transformer inference, why naive memory allocation wastes 60-70% of GPU memory, and how PagedAttention from vLLM solved fragmentation using virtual memory techniques from operating systems.

Memory Bandwidth Roofline Analysis

Learn to apply the Roofline model to diagnose whether GPU kernels are memory-bound or compute-bound, calculate arithmetic intensity, and use roofline plots to guide real optimization decisions.

Memory Capacity Planning for LLMs

How to compute exact GPU memory requirements for LLM training and inference - model weights, optimizer states, activations, KV cache - and how to plan GPU cluster configurations for target models.

Multi-GPU Training Architectures

Master data parallelism, tensor parallelism, pipeline parallelism, and 3D parallelism for large-scale model training - with communication volume math, PyTorch DDP vs FSDP, and Megatron-LM weight splitting strategies.

NCCL and Collective Communication

Deep dive into NCCL internals - the five collective operations, ring-allreduce algorithm, tree-reduce for small tensors, algorithm selection heuristics, tuning environment variables, and diagnosing collective hangs in production GPU clusters.

PCIe and NVLink Interconnects

Understand PCIe bandwidth limitations for CPU-GPU data transfer, NVLink for high-speed GPU-to-GPU communication, NVSwitch topology in DGX systems, and how to design systems that avoid interconnect bottlenecks in multi-GPU AI training.

Quantization Hardware Tradeoffs

How INT8, INT4, FP8, and NF4 quantization change memory bandwidth utilization, Tensor Core throughput, and inference latency on real GPUs, including hardware support matrices and production calibration strategies.

Speculative Decoding

How speculative decoding uses a small draft model to generate candidate tokens verified by the large target model in a single forward pass, achieving 2-3x inference speedups without changing output distribution.

Storage IO for Training Pipelines

How storage IO bottlenecks GPU utilization in ML training, NVMe and distributed filesystem characteristics, data loading patterns with WebDataset and DALI, prefetching strategies, and designing checkpointing that does not stall your cluster.

TensorRT and Inference Optimization

NVIDIA TensorRT compilation pipeline, layer fusion, precision calibration, kernel auto-tuning, and deploying optimized inference engines for production LLM and computer vision workloads.

Unified Memory and Memory Pooling

How CUDA Unified Memory works under the hood, when it helps versus hurts performance, and how PyTorch's caching allocator and memory pools eliminate allocation overhead in production ML systems.

ZeRO and Memory Efficiency

DeepSpeed ZeRO stages 1/2/3 - sharding optimizer states, gradients, and parameters across data parallel workers to enable training models too large for single-GPU memory.