9 docs tagged with "quantization"

AWQ In-Depth

How Activation-aware Weight Quantization protects salient weights to achieve near-lossless INT4 compression, and how to deploy AWQ models with AutoAWQ and vLLM.

Deploying Quantized Models in Production

End-to-end guide for production deployment of quantized LLMs - format selection, serving stack configuration, latency SLAs, A/B testing, quality monitoring, and rollback strategy.

GPTQ In Depth

A deep technical walkthrough of the GPTQ algorithm - Optimal Brain Surgeon derivation, layer-by-layer quantization, group quantization, actorder, and practical deployment with AutoGPTQ and vLLM.

Module 4: Quantization in Practice

GGUF, GPTQ, AWQ, and bitsandbytes - compress models to fit your hardware budget while understanding exactly what quality you are trading away and why.

Post-Training Quantization Methods

A practical guide to PTQ methods for LLMs - GPTQ, AWQ, SmoothQuant, bitsandbytes, GGUF, and HQQ compared by accuracy, speed, memory, and production use case.

How to rigorously evaluate quantization quality using perplexity, downstream task accuracy, latency, and memory metrics - and build a complete benchmarking pipeline comparing FP16 vs GPTQ vs AWQ vs NF4.

Quantization Error Debugging

How to diagnose and fix quantization quality degradation - symptoms, root causes, diagnostic tools, and systematic fixes for INT4/INT8 quantized LLMs.

Quantization for Vision Models

How to quantize CNN and ViT vision models and vision-language models - handling batch norm sensitivity, attention outliers, and the strategy of quantizing the LLM backbone while keeping the vision encoder in FP16.

Quantization-Aware Training

When post-training quantization is not enough - how QAT simulates quantization noise during training so models learn to be robust to it, covering the straight-through estimator, QLoRA, and BitNet.