10 docs tagged with "model-compression"

AWQ: Activation-Aware Weight Quantization

AWQ protects the 1% of weights that matter most - how activation statistics reveal salient weights, how scaling preserves them without extra memory, why AWQ outperforms GPTQ at INT4 for production inference, and how to configure Marlin kernels for maximum throughput.

Benchmarking Compressed Models

How to systematically evaluate accuracy-efficiency tradeoffs in quantized, pruned, and distilled models - perplexity, task-specific capabilities, latency, throughput, and automated regression detection.

GPTQ: Post-Training Quantization

GPTQ explained from first principles - how Hessian-based error compensation quantizes 175B models to 4-bit in hours, the role of calibration data, group size, activation reordering, and how to deploy GPTQ models in production with vLLM and autoGPTQ.

Knowledge Distillation for LLMs

Training smaller student models to match larger teacher models - soft labels, temperature scaling, intermediate representation matching, API-based distillation, and a complete production pipeline for task-specific deployment.

LoRA for Efficient Fine-Tuning

LoRA and QLoRA: fine-tune 70B models on a single GPU by freezing the base model and training only small low-rank adapter matrices - the technique that democratized LLM customization.

Module 05: Model Compression

Master the full spectrum of model compression techniques - quantization, pruning, distillation, and LoRA - to deploy large language models efficiently in production.

Quantization Deep Dive

INT8, INT4, NF4, FP8, and block-wise quantization explained from first principles - how floating point becomes integer, what accuracy you lose, and how to tune quantization for production LLM inference.

Structured Pruning

Remove entire attention heads, MLP neurons, and transformer layers to achieve real hardware latency improvements - with production-grade code for Taylor importance, angular distance layer scoring, iterative recovery, and combined compression pipelines.

Unstructured Pruning

Weight-level sparsity, the Lottery Ticket Hypothesis, SparseGPT, Wanda, and 2:4 structured sparsity - why unstructured pruning is theoretically elegant but practically limited for LLMs.

Why Model Compression Matters

The memory wall, inference costs, edge deployment, and latency requirements that make model compression essential for production AI systems - with real cost math, a full compression taxonomy, and decision frameworks for choosing the right technique.