Module 4: Quantization in Practice
A Llama 3 70B model in BF16 is 140GB. In Q4_K_M GGUF quantization, it is 40GB. On a single A100 80GB, that is the difference between not fitting and having 40GB to spare for KV cache. Quantization is not an optimization - for most practical deployments, it is a prerequisite.
The confusion is that "quantization" covers at least four distinct techniques: post-training quantization (PTQ), quantization-aware training (QAT), weight-only quantization (GPTQ, AWQ, GGUF), and activation quantization (INT8, FP8). Each makes different quality-speed-memory tradeoffs, and choosing the wrong one for your use case is a common mistake.
What Quantization Actually Does
Quantization reduces the number of bits used to represent each weight. BF16 uses 16 bits per weight. INT8 uses 8 bits. INT4 uses 4 bits. The memory reduction is proportional: INT4 is 4x smaller than BF16.
The quality loss depends on how the quantization is done. Naive rounding to INT4 loses significant quality. GPTQ uses second-order information from a calibration dataset to minimize quantization error. AWQ identifies the 1% of weights that matter most (activation-aware) and preserves them at higher precision. The result is INT4 quality that approaches INT8 on most benchmarks.
Quantization Format Landscape
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | Why Quantization Matters | Memory math, the economics of fitting models |
| 2 | GGUF and llama.cpp Quantization | Format internals, K-quant types, CPU inference |
| 3 | GPTQ Post-Training Quantization | Second-order weight optimization, calibration data |
| 4 | AWQ Activation-Aware Quantization | Salient weight identification, per-channel scaling |
| 5 | bitsandbytes and NF4 | 4-bit NF4, double quantization, QLoRA training |
| 6 | Quality Tradeoffs by Quantization Level | Benchmark comparison across quant levels |
| 7 | Quantizing Your Own Model | AutoGPTQ, llama.cpp quantize, autoawq workflows |
| 8 | Quantization for Inference Speed | INT8/FP8 for serving, hardware requirements |
Key Concepts You Will Master
- Quantization error analysis - understanding why some layers tolerate quantization and others do not
- GGUF K-quants - what Q4_K_M, Q5_K_S mean and which to choose for your VRAM budget
- AWQ vs GPTQ - the practical quality difference and when to use each
- FP8 for H100 - the near-lossless 2x memory savings available on Hopper architecture
- Quantization-accuracy benchmarks - how to measure quality degradation on your specific task
Prerequisites
- Model Ecosystem
- Running Locally
- Basic GPU memory understanding
