AWQ In-Depth
How Activation-aware Weight Quantization protects salient weights to achieve near-lossless INT4 compression, and how to deploy AWQ models with AutoAWQ and vLLM.
How Activation-aware Weight Quantization protects salient weights to achieve near-lossless INT4 compression, and how to deploy AWQ models with AutoAWQ and vLLM.
End-to-end guide for production deployment of quantized LLMs - format selection, serving stack configuration, latency SLAs, A/B testing, quality monitoring, and rollback strategy.
A deep technical walkthrough of the GPTQ algorithm - Optimal Brain Surgeon derivation, layer-by-layer quantization, group quantization, actorder, and practical deployment with AutoGPTQ and vLLM.
GGUF, GPTQ, AWQ, and bitsandbytes - compress models to fit your hardware budget while understanding exactly what quality you are trading away and why.
A practical guide to PTQ methods for LLMs - GPTQ, AWQ, SmoothQuant, bitsandbytes, GGUF, and HQQ compared by accuracy, speed, memory, and production use case.
How to rigorously evaluate quantization quality using perplexity, downstream task accuracy, latency, and memory metrics - and build a complete benchmarking pipeline comparing FP16 vs GPTQ vs AWQ vs NF4.
How to diagnose and fix quantization quality degradation - symptoms, root causes, diagnostic tools, and systematic fixes for INT4/INT8 quantized LLMs.
How to quantize CNN and ViT vision models and vision-language models - handling batch norm sensitivity, attention outliers, and the strategy of quantizing the LLM backbone while keeping the vision encoder in FP16.
When post-training quantization is not enough - how QAT simulates quantization noise during training so models learn to be robust to it, covering the straight-through estimator, QLoRA, and BitNet.