Autoregressive Decoding
Understand how LLMs generate tokens one at a time, why decoding is memory-bandwidth bound, and how to reason about inference latency with the roofline model.
Understand how LLMs generate tokens one at a time, why decoding is memory-bandwidth bound, and how to reason about inference latency with the roofline model.
How Activation-aware Weight Quantization protects salient weights to achieve near-lossless INT4 compression, and how to deploy AWQ models with AutoAWQ and vLLM.
Static batching, dynamic batching, continuous batching, chunked prefill, and prefill-decode disaggregation for LLM inference throughput and latency optimization.
Measuring local LLM inference speed - tokens per second, time to first token, memory usage, and systematic comparison across quantization levels, models, and hardware configurations.
Learn how continuous batching eliminates GPU idle time by replacing finished sequences immediately rather than waiting for the longest request in a batch to complete.
End-to-end guide for production deployment of quantized LLMs - format selection, serving stack configuration, latency SLAs, A/B testing, quality monitoring, and rollback strategy.
Running LLMs in Docker containers for reproducibility and deployment portability. NVIDIA Container Toolkit, Ollama and vLLM Docker images, multi-stage builds, and Docker Compose for a full local AI stack.
Running neural networks on devices with 5-15W power budgets - mobile NPUs, Apple Neural Engine, Qualcomm Hexagon, deployment frameworks, and LLMs on-device with llama.cpp and MLX.
A deep technical walkthrough of the GPTQ algorithm - Optimal Brain Surgeon derivation, layer-by-layer quantization, group quantization, actorder, and practical deployment with AutoGPTQ and vLLM.
Why inference and training have fundamentally different GPU hardware requirements, covering compute vs memory-bandwidth bottlenecks, the prefill/decode split, and how to select the right GPU for serving.
How to select hardware for running LLMs locally - VRAM and RAM requirements by model size, GPU tier comparison, Apple Silicon analysis, CPU-only inference feasibility, and a practical hardware selection matrix.
The economics of LLM inference serving - cost per million tokens, GPU utilization, continuous batching, speculative decoding, KV cache management, and building production systems under $1 per million tokens.
Learn how to systematically reduce LLM inference costs using model selection, quantization, caching, request routing, prompt compression, and infrastructure strategies.
Learn how the key-value cache eliminates redundant attention computation in LLM inference, and how PagedAttention solves the memory fragmentation problem.
How the KV cache works in transformer inference, why naive memory allocation wastes 60-70% of GPU memory, and how PagedAttention from vLLM solved fragmentation using virtual memory techniques from operating systems.
llama.cpp - Georgi Gerganov's C++ inference engine that runs quantized LLMs on CPUs and consumer GPUs. GGUF binary format, quantization types, performance tuning, and practical local inference.
LM Studio, Jan.ai, GPT4All, and Open WebUI for running LLMs locally - model discovery, hardware acceleration, local server mode, OpenAI-compatible APIs, and building a complete local AI development workspace.
Apple's MLX framework for running and fine-tuning LLMs on M-series chips - unified memory architecture, lazy evaluation, mlx-lm for inference, LoRA fine-tuning, and benchmarking against llama.cpp.
Master the systems and techniques that make large language model inference fast, efficient, and cost-effective at production scale.
Ollama - Docker-like CLI for running and managing local LLMs. Modelfile format, REST API, OpenAI-compatible endpoints, Python integration, and building a complete local AI stack.
A practical guide to PTQ methods for LLMs - GPTQ, AWQ, SmoothQuant, bitsandbytes, GGUF, and HQQ compared by accuracy, speed, memory, and production use case.
Deploying LLMs in air-gapped environments without internet access - pre-downloading models, offline HuggingFace usage, regulatory compliance, and architecture for privacy-critical AI.
How to rigorously evaluate quantization quality using perplexity, downstream task accuracy, latency, and memory metrics - and build a complete benchmarking pipeline comparing FP16 vs GPTQ vs AWQ vs NF4.
How to diagnose and fix quantization quality degradation - symptoms, root causes, diagnostic tools, and systematic fixes for INT4/INT8 quantized LLMs.
How to quantize CNN and ViT vision models and vision-language models - handling batch norm sensitivity, attention outliers, and the strategy of quantizing the LLM backbone while keeping the vision encoder in FP16.
How INT8, INT4, FP8, and NF4 quantization change memory bandwidth utilization, Tensor Core throughput, and inference latency on real GPUs, including hardware support matrices and production calibration strategies.
When post-training quantization is not enough - how QAT simulates quantization noise during training so models learn to be robust to it, covering the straight-through estimator, QLoRA, and BitNet.
Master LLM quantization techniques - from LLM.int8() to GPTQ and AWQ - to run large models on commodity hardware without unacceptable quality loss.
Master the sampling algorithms that control LLM output diversity - from greedy decoding to nucleus sampling - and learn when to use each in production.
How speculative decoding uses a small draft model to generate candidate tokens verified by the large target model in a single forward pass, achieving 2-3x inference speedups without changing output distribution.
Learn how speculative decoding uses a small draft model to generate tokens that a large target model verifies in parallel, achieving 2-3x speedup with no quality loss.
Learn how tensor parallelism splits weight matrices across GPUs and pipeline parallelism splits model layers, enabling inference and training of models too large for a single GPU.
NVIDIA TensorRT compilation pipeline, layer fusion, precision calibration, kernel auto-tuning, and deploying optimized inference engines for production LLM and computer vision workloads.
Learn how production inference servers like vLLM, TGI, TensorRT-LLM, and Ollama combine PagedAttention, continuous batching, and optimized kernels to serve LLMs at scale.