Skip to main content

Module 4: Quantization in Practice

A Llama 3 70B model in BF16 is 140GB. In Q4_K_M GGUF quantization, it is 40GB. On a single A100 80GB, that is the difference between not fitting and having 40GB to spare for KV cache. Quantization is not an optimization - for most practical deployments, it is a prerequisite.

The confusion is that "quantization" covers at least four distinct techniques: post-training quantization (PTQ), quantization-aware training (QAT), weight-only quantization (GPTQ, AWQ, GGUF), and activation quantization (INT8, FP8). Each makes different quality-speed-memory tradeoffs, and choosing the wrong one for your use case is a common mistake.

What Quantization Actually Does

Quantization reduces the number of bits used to represent each weight. BF16 uses 16 bits per weight. INT8 uses 8 bits. INT4 uses 4 bits. The memory reduction is proportional: INT4 is 4x smaller than BF16.

The quality loss depends on how the quantization is done. Naive rounding to INT4 loses significant quality. GPTQ uses second-order information from a calibration dataset to minimize quantization error. AWQ identifies the 1% of weights that matter most (activation-aware) and preserves them at higher precision. The result is INT4 quality that approaches INT8 on most benchmarks.

Quantization Format Landscape

Lessons in This Module

#LessonKey Concept
1Why Quantization MattersMemory math, the economics of fitting models
2GGUF and llama.cpp QuantizationFormat internals, K-quant types, CPU inference
3GPTQ Post-Training QuantizationSecond-order weight optimization, calibration data
4AWQ Activation-Aware QuantizationSalient weight identification, per-channel scaling
5bitsandbytes and NF44-bit NF4, double quantization, QLoRA training
6Quality Tradeoffs by Quantization LevelBenchmark comparison across quant levels
7Quantizing Your Own ModelAutoGPTQ, llama.cpp quantize, autoawq workflows
8Quantization for Inference SpeedINT8/FP8 for serving, hardware requirements

Key Concepts You Will Master

  • Quantization error analysis - understanding why some layers tolerate quantization and others do not
  • GGUF K-quants - what Q4_K_M, Q5_K_S mean and which to choose for your VRAM budget
  • AWQ vs GPTQ - the practical quality difference and when to use each
  • FP8 for H100 - the near-lossless 2x memory savings available on Hopper architecture
  • Quantization-accuracy benchmarks - how to measure quality degradation on your specific task

Prerequisites

© 2026 EngineersOfAI. All rights reserved.