34 docs tagged with "inference"

Autoregressive Decoding

Understand how LLMs generate tokens one at a time, why decoding is memory-bandwidth bound, and how to reason about inference latency with the roofline model.

AWQ In-Depth

How Activation-aware Weight Quantization protects salient weights to achieve near-lossless INT4 compression, and how to deploy AWQ models with AutoAWQ and vLLM.

Batching Strategies for LLM Serving

Static batching, dynamic batching, continuous batching, chunked prefill, and prefill-decode disaggregation for LLM inference throughput and latency optimization.

Benchmarking Local Model Performance

Measuring local LLM inference speed - tokens per second, time to first token, memory usage, and systematic comparison across quantization levels, models, and hardware configurations.

Continuous Batching

Learn how continuous batching eliminates GPU idle time by replacing finished sequences immediately rather than waiting for the longest request in a batch to complete.

Deploying Quantized Models in Production

End-to-end guide for production deployment of quantized LLMs - format selection, serving stack configuration, latency SLAs, A/B testing, quality monitoring, and rollback strategy.

Docker and Containerized Local Inference

Running LLMs in Docker containers for reproducibility and deployment portability. NVIDIA Container Toolkit, Ollama and vLLM Docker images, multi-stage builds, and Docker Compose for a full local AI stack.

Edge and Mobile Inference

Running neural networks on devices with 5-15W power budgets - mobile NPUs, Apple Neural Engine, Qualcomm Hexagon, deployment frameworks, and LLMs on-device with llama.cpp and MLX.

GPTQ In Depth

A deep technical walkthrough of the GPTQ algorithm - Optimal Brain Surgeon derivation, layer-by-layer quantization, group quantization, actorder, and practical deployment with AutoGPTQ and vLLM.

GPU Inference vs Training Requirements

Why inference and training have fundamentally different GPU hardware requirements, covering compute vs memory-bandwidth bottlenecks, the prefill/decode split, and how to select the right GPU for serving.

Hardware Requirements and Selection

How to select hardware for running LLMs locally - VRAM and RAM requirements by model size, GPU tier comparison, Apple Silicon analysis, CPU-only inference feasibility, and a practical hardware selection matrix.

Inference Cost Optimization

The economics of LLM inference serving - cost per million tokens, GPU utilization, continuous batching, speculative decoding, KV cache management, and building production systems under $1 per million tokens.

Inference Cost Optimization

Learn how to systematically reduce LLM inference costs using model selection, quantization, caching, request routing, prompt compression, and infrastructure strategies.

KV Cache

Learn how the key-value cache eliminates redundant attention computation in LLM inference, and how PagedAttention solves the memory fragmentation problem.

KV Cache Management and PagedAttention

How the KV cache works in transformer inference, why naive memory allocation wastes 60-70% of GPU memory, and how PagedAttention from vLLM solved fragmentation using virtual memory techniques from operating systems.

llama.cpp and GGUF Format

llama.cpp - Georgi Gerganov's C++ inference engine that runs quantized LLMs on CPUs and consumer GPUs. GGUF binary format, quantization types, performance tuning, and practical local inference.

LM Studio and GUI Tools

LM Studio, Jan.ai, GPT4All, and Open WebUI for running LLMs locally - model discovery, hardware acceleration, local server mode, OpenAI-compatible APIs, and building a complete local AI development workspace.

MLX for Apple Silicon

Apple's MLX framework for running and fine-tuning LLMs on M-series chips - unified memory architecture, lazy evaluation, mlx-lm for inference, LoRA fine-tuning, and benchmarking against llama.cpp.

Module 07: LLM Inference & Optimization

Master the systems and techniques that make large language model inference fast, efficient, and cost-effective at production scale.

Ollama and Local Model Management

Ollama - Docker-like CLI for running and managing local LLMs. Modelfile format, REST API, OpenAI-compatible endpoints, Python integration, and building a complete local AI stack.

Post-Training Quantization Methods

A practical guide to PTQ methods for LLMs - GPTQ, AWQ, SmoothQuant, bitsandbytes, GGUF, and HQQ compared by accuracy, speed, memory, and production use case.

Privacy and Air-Gapped Deployment

Deploying LLMs in air-gapped environments without internet access - pre-downloading models, offline HuggingFace usage, regulatory compliance, and architecture for privacy-critical AI.

Quantization Benchmarking

How to rigorously evaluate quantization quality using perplexity, downstream task accuracy, latency, and memory metrics - and build a complete benchmarking pipeline comparing FP16 vs GPTQ vs AWQ vs NF4.

Quantization Error Debugging

How to diagnose and fix quantization quality degradation - symptoms, root causes, diagnostic tools, and systematic fixes for INT4/INT8 quantized LLMs.

Quantization for Vision Models

How to quantize CNN and ViT vision models and vision-language models - handling batch norm sensitivity, attention outliers, and the strategy of quantizing the LLM backbone while keeping the vision encoder in FP16.

Quantization Hardware Tradeoffs

How INT8, INT4, FP8, and NF4 quantization change memory bandwidth utilization, Tensor Core throughput, and inference latency on real GPUs, including hardware support matrices and production calibration strategies.

Quantization-Aware Training

When post-training quantization is not enough - how QAT simulates quantization noise during training so models learn to be robust to it, covering the straight-through estimator, QLoRA, and BitNet.

Quantization: INT8 and INT4

Master LLM quantization techniques - from LLM.int8() to GPTQ and AWQ - to run large models on commodity hardware without unacceptable quality loss.

Sampling Strategies: Temperature, Top-K, Top-P

Master the sampling algorithms that control LLM output diversity - from greedy decoding to nucleus sampling - and learn when to use each in production.

Speculative Decoding

How speculative decoding uses a small draft model to generate candidate tokens verified by the large target model in a single forward pass, achieving 2-3x inference speedups without changing output distribution.

Speculative Decoding

Learn how speculative decoding uses a small draft model to generate tokens that a large target model verifies in parallel, achieving 2-3x speedup with no quality loss.

Tensor and Pipeline Parallelism

Learn how tensor parallelism splits weight matrices across GPUs and pipeline parallelism splits model layers, enabling inference and training of models too large for a single GPU.

TensorRT and Inference Optimization

NVIDIA TensorRT compilation pipeline, layer fusion, precision calibration, kernel auto-tuning, and deploying optimized inference engines for production LLM and computer vision workloads.

vLLM and Inference Servers

Learn how production inference servers like vLLM, TGI, TensorRT-LLM, and Ollama combine PagedAttention, continuous batching, and optimized kernels to serve LLMs at scale.