Module 4: Quantization in Practice

A Llama 3 70B model in BF16 is 140GB. In Q4_K_M GGUF quantization, it is 40GB. On a single A100 80GB, that is the difference between not fitting and having 40GB to spare for KV cache. Quantization is not an optimization - for most practical deployments, it is a prerequisite.

The confusion is that "quantization" covers at least four distinct techniques: post-training quantization (PTQ), quantization-aware training (QAT), weight-only quantization (GPTQ, AWQ, GGUF), and activation quantization (INT8, FP8). Each makes different quality-speed-memory tradeoffs, and choosing the wrong one for your use case is a common mistake.

What Quantization Actually Does

Quantization reduces the number of bits used to represent each weight. BF16 uses 16 bits per weight. INT8 uses 8 bits. INT4 uses 4 bits. The memory reduction is proportional: INT4 is 4x smaller than BF16.

The quality loss depends on how the quantization is done. Naive rounding to INT4 loses significant quality. GPTQ uses second-order information from a calibration dataset to minimize quantization error. AWQ identifies the 1% of weights that matter most (activation-aware) and preserves them at higher precision. The result is INT4 quality that approaches INT8 on most benchmarks.

Quantization Format Landscape

Lessons in This Module

#	Lesson	Key Concept
1	Why Quantization Matters	Memory math, the economics of fitting models
2	GGUF and llama.cpp Quantization	Format internals, K-quant types, CPU inference
3	GPTQ Post-Training Quantization	Second-order weight optimization, calibration data
4	AWQ Activation-Aware Quantization	Salient weight identification, per-channel scaling
5	bitsandbytes and NF4	4-bit NF4, double quantization, QLoRA training
6	Quality Tradeoffs by Quantization Level	Benchmark comparison across quant levels
7	Quantizing Your Own Model	AutoGPTQ, llama.cpp quantize, autoawq workflows
8	Quantization for Inference Speed	INT8/FP8 for serving, hardware requirements

Key Concepts You Will Master

Quantization error analysis - understanding why some layers tolerate quantization and others do not
GGUF K-quants - what Q4_K_M, Q5_K_S mean and which to choose for your VRAM budget
AWQ vs GPTQ - the practical quality difference and when to use each
FP8 for H100 - the near-lossless 2x memory savings available on Hopper architecture
Quantization-accuracy benchmarks - how to measure quality degradation on your specific task

Prerequisites

Model Ecosystem
Running Locally
Basic GPU memory understanding

What Quantization Actually Does​

Quantization Format Landscape​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

What Quantization Actually Does

Quantization Format Landscape

Lessons in This Module

Key Concepts You Will Master

Prerequisites