Post-Training Quantization Methods

It is 3:07 AM. Your phone buzzes. The alert reads: "GPU cluster cost exceeded $48,000 this month. Finance wants a 60% reduction by end of quarter or the entire inference budget gets cut." You are the ML infrastructure lead at a mid-size fintech, running three LLaMA 3 70B models across eight A100 GPUs, serving real-time fraud detection and customer support. The models are trained. They are good. Retraining is not an option - it would take six weeks and cost another$ 200,000 in compute. You need to cut memory and cost without touching model quality.

You open your laptop at 3:12 AM and start reading about quantization.

By morning you have a decision to make. There are at least six distinct quantization methods available for this model, each with different tradeoffs, different tooling, different hardware requirements, and different quality impacts. GPTQ needs calibration data and takes two hours to run. AWQ is faster and reportedly better quality. bitsandbytes can be applied with a single argument in your existing code. GGUF lets you run on CPU but your serving stack is GPU-native. SmoothQuant changes how activations are handled. HQQ needs no calibration at all.

By 9 AM, you have deployed an AWQ INT4 model on four A100 GPUs instead of eight, at 98.7% of original model quality measured by your internal evaluation suite. The alert clears. Finance is satisfied.

This lesson is the knowledge that got you through that morning. Not just what each method is, but how they differ mechanically, where each breaks down, and exactly which one to reach for in which situation. We will cover the full quantization zoo - GPTQ, AWQ, SmoothQuant, bitsandbytes (NF4 and INT8), GGUF, HQQ, and QuIP# - and give you a decision matrix you can use in production.

The deeper skill is not memorizing which method is "best." It is understanding the design space well enough to reason about tradeoffs you have never seen before - new models, new hardware, new constraints. That reasoning starts with understanding what every PTQ method is actually doing at its core.

Why Post-Training Quantization Exists

The Problem It Replaced

Before PTQ methods existed for LLMs, you had two choices if you needed a smaller model: train a smaller model from scratch, or use knowledge distillation to compress a large model into a smaller one. Both required significant compute, both required access to the original training pipeline, and both produced a fundamentally different model that needed re-evaluation and re-alignment work.

The alternative that engineers actually used was simple weight rounding - take float32 weights and round them to INT8 by computing a scale factor per tensor. This works well for CNNs used in computer vision. It fails catastrophically for LLMs.

Why does naive quantization fail for LLMs specifically? The answer is activations. Transformer models, especially the attention mechanism and feed-forward layers, produce activations with extreme outliers. A single activation value might be 100x larger than the median activation in its tensor. When you pick a quantization scale based on the max value to cover that outlier, every other value gets crammed into a tiny fraction of the available integer range. Information is destroyed. Perplexity spikes. The model stops making sense.

This was the core discovery that motivated modern PTQ methods: the challenge is not the weights, it is the interaction between weights and activations during inference. Weights are relatively well-behaved. Activations are not. Every major PTQ method published between 2022 and 2024 is essentially a different answer to the question "how do we handle activation outliers without retraining?"

The Research Inflection Point

In 2022, two papers changed the trajectory of LLM deployment. ZeroQuant (Yao et al., 2022) showed that with careful layer-by-layer distillation, INT8 quantization of transformer models was achievable without catastrophic quality loss. More importantly, SmoothQuant (Xiao et al., 2022) showed that activation outliers could be mathematically migrated into the weights, making both activations and weights quantization-friendly simultaneously.

Then GPTQ (Frantar et al., 2022) showed that INT4 weight-only quantization of a 175B parameter GPT model was possible with negligible perplexity increase, using a second-order optimization approach derived from the Optimal Brain Surgeon framework. This was the paper that convinced the field that PTQ was not a research curiosity - it was a production tool.

From 2022 to 2024, the field moved fast. AWQ, HQQ, QuIP#, and AQLM each introduced new ideas. But GPTQ and bitsandbytes remain the most widely deployed methods in practice, largely due to tooling maturity and ecosystem support.

The Quantization Design Space

Before comparing methods, it helps to understand the axes along which they differ. There are four fundamental design decisions in any PTQ method.

Weight-only vs. weight-and-activation quantization. Weight-only quantization (W4A16, W8A16) keeps activations in float16 during inference. Only the stored weights are quantized. The weights are dequantized back to float16 before the matrix multiplication. This does not give you the full compute benefit of INT8 matrix multiply instructions on hardware, but it dramatically reduces memory bandwidth and GPU memory footprint. Weight-and-activation quantization (W8A8) quantizes both, enabling use of INT8 Tensor Core operations and delivering both memory and compute speedups.

Calibration-based vs. calibration-free. Calibration-based methods (GPTQ, AWQ, SmoothQuant) run a small dataset - typically 512 to 1024 samples - through the model to gather statistics about activation distributions before applying quantization. This adds a one-time preprocessing cost (minutes to hours) but allows the quantization to be adapted to the actual data distribution. Calibration-free methods (bitsandbytes NF4, HQQ) apply quantization without any data, making them faster to apply and applicable without access to representative data.

Granularity of quantization. Per-tensor quantization uses one scale factor for an entire weight matrix. Per-channel (per-row or per-column) quantization uses one scale per row or column. Group quantization (groupsize=128 is standard) uses one scale per group of 128 consecutive values. Finer granularity means more scale factors to store and load, but much better accuracy because the scale can adapt to local weight distributions.

Online vs. offline quantization. Offline quantization produces a quantized model checkpoint that can be loaded directly. The quantization math is done once and saved. Online quantization (bitsandbytes) applies quantization dynamically during the forward pass. Online methods are more flexible and composable with training frameworks, but add overhead on every forward pass.

The Methods: What Each One Actually Does

GPTQ - Reconstruction-Based Weight Quantization

GPTQ (Frantar, Ashkboos, Hoefler, Alistarh - ETH Zurich, October 2022) takes weight-only quantization and treats it as a second-order optimization problem. The core idea: when you quantize a weight, you introduce a quantization error. GPTQ compensates for that error by making small updates to all remaining unquantized weights in the same layer, using the Hessian of the layer's output loss to figure out how to spread the correction optimally.

The Hessian for a linear layer is $H = 2XX^T$ where $X$ is the matrix of calibration inputs to that layer. This gives GPTQ information about which weights matter most (high curvature in the loss) and which can absorb correction without hurting the output. The method quantizes weights column by column, and after each column is quantized, it updates the remaining columns to absorb the error.

The result is INT4 quantization with perplexity degradation typically under 0.3 on standard benchmarks (WikiText-2) for models larger than 7B parameters. For 70B+ models, INT4 GPTQ is often indistinguishable from FP16 on most tasks.

GPTQ runs offline, produces a quantized checkpoint, and requires a GPU to quantize (the Hessian computation needs GPU memory). Quantizing a 70B model takes roughly 1-4 hours on a single A100 80GB, depending on group size.

The actorder variant reorders weight columns by decreasing activation magnitude before quantizing. This ensures the most important weights (those multiplied by large activations) are quantized last, when the error compensation mechanism is most precise. actorder=True consistently improves quality but slightly slows inference due to the permutation overhead.

AWQ - Activation-Aware Weight Quantization

AWQ (Lin, Tang, Tang, Yang, Dang, Han - MIT HAN Lab, June 2023) starts from a critical observation about GPTQ: the quality of quantization depends heavily on which weights are most important, and importance is determined by activation magnitude, not weight magnitude.

The insight is that 1% of weights are salient - they correspond to large activation channels - and protecting just those 1% from quantization error accounts for most of the quality recovery. Instead of protecting them by keeping them in FP16 (which breaks the uniform quantization format), AWQ scales those channels up before quantization so their quantization grid is finer, then scales them back down during inference. The scaling is absorbed into the adjacent layer's weights, keeping the model architecture unchanged.

Formally, for a weight matrix $W$ and input activation $x$ , AWQ finds a per-channel scale $s$ and applies it as:

$y = (W \cdot \text{diag}(s)^{-1}) \cdot (\text{diag}(s) \cdot x)$

The scaled-down weight $W \cdot \text{diag}(s)^{-1}$ is what gets quantized. The scaling factor is chosen to minimize quantization error for the salient channels. The $\text{diag}(s) \cdot x$ term can be fused into the preceding layer's output scaling.

AWQ typically beats GPTQ on quality at INT4, especially for smaller models (7B-13B) where GPTQ's error compensation is less effective. AWQ is also faster to apply - it completes in minutes rather than hours because it does not need to compute and invert large Hessian matrices. The tradeoff is that AWQ's quality ceiling is lower for very large models where GPTQ's second-order corrections provide more value.

SmoothQuant - Migration-Based W8A8 Quantization

SmoothQuant (Xiao, Lin, Seznec, Wu, Demouth, Han - MIT and NVIDIA, November 2022) targets a different part of the design space: W8A8 quantization, where both weights and activations are INT8 during the matrix multiply. This is the only way to use hardware INT8 GEMM instructions (like NVIDIA's imma instruction on A100) and get both memory bandwidth and compute throughput benefits.

The problem with W8A8 is those activation outliers. SmoothQuant's solution is mathematically elegant: migrate the quantization difficulty from activations to weights by applying a per-channel smoothing factor.

For a linear layer computing $y = Wx$ , SmoothQuant rewrites this as:

$y = (W \cdot \text{diag}(s)) \cdot (\text{diag}(s)^{-1} \cdot x)$

where $s_j = \max(|x_j|)^\alpha / \max(|W_j|)^{1-\alpha}$ and $\alpha$ is a migration strength hyperparameter (typically 0.5). The factor $s$ makes activations smoother (easier to quantize) while making weights slightly less smooth (still easy to quantize because weights are inherently well-behaved). The net result is that both $W \cdot \text{diag}(s)$ and $\text{diag}(s)^{-1} \cdot x$ are quantization-friendly.

SmoothQuant enables 2x memory reduction and up to 1.56x throughput improvement over FP16 on A100 hardware, because it unlocks the full INT8 compute pipeline. GPTQ and AWQ (W4A16) can match or beat SmoothQuant on memory reduction, but they cannot access INT8 GEMM hardware because activations remain in FP16.

bitsandbytes - Online Quantization for Training and Inference

bitsandbytes (Tim Dettmers, 2022-2023) takes a fundamentally different approach. Instead of a preprocessing step that produces a quantized checkpoint, bitsandbytes applies quantization dynamically during the forward pass. This makes it uniquely suited for fine-tuning workflows (QLoRA uses bitsandbytes NF4) and for situations where you need quantization without a calibration step.

INT8 (LLM.int8()): Uses mixed-precision decomposition. During inference, it detects activation outliers and keeps those specific rows/columns in FP16 while quantizing the rest to INT8. This is done per-batch, per-forward-pass. The overhead is about 15-20% in latency compared to FP16, but memory footprint is cut roughly in half.

NF4 (4-bit Normal Float): Introduced for QLoRA. NF4 is an information-theoretically optimal 4-bit quantization format for normally distributed weights - which transformer weights approximately are. Instead of using evenly-spaced quantization bins, NF4 places quantization levels at the quantiles of a standard normal distribution. This means more precision where weight values are dense (near zero) and coarser precision at the tails. NF4 with double quantization (the quantization scale factors are themselves quantized) achieves near-GPTQ quality without any calibration data.

The limitation of bitsandbytes for inference: it is Python-only, has no CUDA kernel optimization for batch inference, and is primarily designed for single-GPU usage. For production serving at scale, GPTQ or AWQ with vLLM gives significantly better throughput.

GGUF - CPU-First Quantization Format

GGUF (Georgi Gerganov, successor to GGML, 2023) is not primarily a quantization algorithm - it is a file format designed for the llama.cpp ecosystem. GGUF models can run on CPU, Apple Silicon (via Metal), consumer GPUs, and mixed CPU+GPU configurations. The format supports many quantization levels from Q2_K (2-bit, extreme compression) to Q8_0 (8-bit, near-lossless), each using different quantization strategies.

The "K" variants (Q4_K_M, Q5_K_M, Q6_K) use k-quants - a method that applies mixed precision within the model, quantizing attention layers more conservatively (higher bit depth) than feed-forward layers. This reflects the empirical finding that attention weights are more sensitive to quantization than FFN weights.

GGUF is the right choice when the deployment target is CPU-only or mixed CPU/GPU (consumer hardware, edge deployment, local inference with llama.cpp or Ollama). For pure GPU serving at scale, GGUF's throughput is significantly lower than GPTQ or AWQ served via vLLM or TensorRT-LLM.

HQQ - Half-Quadratic Quantization

HQQ (Badri and Shaji, 2023) approaches weight quantization as a pure optimization problem without any calibration data. It minimizes a half-quadratic objective that is robust to outliers in the weight distribution:

$\min_{W_q} \| W - \text{dequant}(W_q) \|_2^2 + \lambda \cdot \phi(W_q)$

where $\phi$ is a regularization term. The half-quadratic solver alternates between two closed-form updates and converges in 10-20 iterations. HQQ runs faster than GPTQ (no calibration pass needed) and achieves comparable quality at 4-bit.

HQQ's main advantage is speed of quantization: a 70B model quantizes in minutes rather than hours. It is useful when you need to rapidly quantize many model variants (hyperparameter sweeps, fine-tuned checkpoints) or when calibration data is unavailable.

QuIP# - Incoherence Processing for 2-Bit Quantization

QuIP# (Tseng et al., 2023) targets the extreme compression regime: 2-bit quantization where the model is 8x smaller than FP16. The core idea is incoherence processing - multiplying the weight matrix by a random orthogonal matrix before quantization, then the inverse after. This spreads any outlier weight values across many dimensions, making the distribution more uniform and quantization-friendly.

At 2 bits, most methods produce models that are noticeably degraded. QuIP# achieves surprisingly competitive results at 2-bit by combining incoherence processing with a lattice codebook (E8 lattice) that optimally covers the quantized weight space. It is primarily a research-grade method - the tooling is less mature than GPTQ or AWQ - but it represents the frontier of ultra-low-bit quantization.

Comparative Analysis

The Decision Matrix

Use Case	Recommended Method	Why
Production GPU serving (scale)	AWQ or GPTQ + vLLM	Best throughput, mature serving stack
Highest possible INT4 quality	AWQ (7B-13B) or GPTQ actorder (70B+)	Quality advantage at respective scales
W8A8 for max throughput	SmoothQuant + TensorRT-LLM	Only method enabling INT8 GEMM
Fine-tuning (QLoRA)	bitsandbytes NF4	Built into HuggingFace PEFT, composable
CPU-only deployment	GGUF Q4_K_M or Q5_K_M	llama.cpp ecosystem, cross-platform
Local inference (consumer GPU)	GGUF or AWQ	Easy setup, good quality
Many model variants, fast iteration	HQQ	No calibration, minutes not hours
Ultra-low memory (2-bit)	QuIP#	Best 2-bit quality
Apple Silicon (M1/M2/M3)	GGUF (Metal backend)	Only mature option with Metal support

Code Examples

Comparing Methods: Loading and Benchmarking

"""
Compare PTQ methods for LLaMA 3 8B on quality and speed.
Requires: transformers, auto-gptq, autoawq, bitsandbytes
"""

import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from auto_gptq import AutoGPTQForCausalLM
from awq import AutoAWQForCausalLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B"

def load_fp16_baseline(model_id: str):
    """Load the FP16 baseline for comparison."""
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    return model, tokenizer

def load_bitsandbytes_int8(model_id: str):
    """Load with bitsandbytes INT8 - no calibration needed."""
    config = BitsAndBytesConfig(load_in_8bit=True)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=config,
        device_map="auto",
    )
    return model, tokenizer

def load_bitsandbytes_nf4(model_id: str):
    """Load with bitsandbytes NF4 (QLoRA-style) - no calibration needed."""
    config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,  # quantize the scale factors too
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=config,
        device_map="auto",
    )
    return model, tokenizer

def load_gptq_model(gptq_model_id: str):
    """
    Load a pre-quantized GPTQ model.
    Many GPTQ models are available on HuggingFace Hub:
    e.g. 'TheBloke/Meta-Llama-3-8B-GPTQ'
    """
    tokenizer = AutoTokenizer.from_pretrained(gptq_model_id)
    model = AutoGPTQForCausalLM.from_quantized(
        gptq_model_id,
        device="cuda:0",
        use_triton=False,       # Triton kernels are faster but need extra install
        inject_fused_attention=True,
        inject_fused_mlp=True,
    )
    return model, tokenizer

def load_awq_model(awq_model_id: str):
    """
    Load a pre-quantized AWQ model.
    e.g. 'hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4'
    """
    tokenizer = AutoTokenizer.from_pretrained(awq_model_id)
    model = AutoAWQForCausalLM.from_quantized(
        awq_model_id,
        fuse_layers=True,       # Fuse attention + MLP for speed
    )
    return model, tokenizer

def measure_memory_gb() -> float:
    """Return current GPU memory usage in GB."""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / 1e9
    return 0.0

def benchmark_throughput(model, tokenizer, prompt: str, n_tokens: int = 100) -> float:
    """Measure tokens per second for generation."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warmup
    with torch.no_grad():
        _ = model.generate(**inputs, max_new_tokens=10, do_sample=False)

    torch.cuda.synchronize()
    start = time.perf_counter()

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=n_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )

    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    generated = outputs.shape[1] - inputs.input_ids.shape[1]
    return generated / elapsed  # tokens per second

def compute_perplexity(model, tokenizer, text: str) -> float:
    """
    Compute perplexity on a text sample.
    Lower is better. FP16 baseline is the reference.
    """
    encodings = tokenizer(text, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model(**encodings, labels=encodings.input_ids)

    return torch.exp(outputs.loss).item()

# Evaluation text (use WikiText-2 test set in practice)
EVAL_TEXT = """
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.
It is named after the engineer Gustave Eiffel, whose company designed and built the tower
from 1887 to 1889 as the centerpiece of the 1889 World's Fair.
"""

PROMPT = "Explain the key differences between supervised and unsupervised learning:"

# Run benchmarks
results = {}

# FP16 baseline
model, tokenizer = load_fp16_baseline(MODEL_ID)
mem_before = measure_memory_gb()
results["fp16"] = {
    "memory_gb": measure_memory_gb(),
    "perplexity": compute_perplexity(model, tokenizer, EVAL_TEXT),
    "tokens_per_sec": benchmark_throughput(model, tokenizer, PROMPT),
}
del model
torch.cuda.empty_cache()

# bitsandbytes NF4
model, tokenizer = load_bitsandbytes_nf4(MODEL_ID)
results["bnb_nf4"] = {
    "memory_gb": measure_memory_gb(),
    "perplexity": compute_perplexity(model, tokenizer, EVAL_TEXT),
    "tokens_per_sec": benchmark_throughput(model, tokenizer, PROMPT),
}
del model
torch.cuda.empty_cache()

for method, stats in results.items():
    print(f"\n{method.upper()}")
    print(f"  Memory:      {stats['memory_gb']:.2f} GB")
    print(f"  Perplexity:  {stats['perplexity']:.3f}")
    print(f"  Throughput:  {stats['tokens_per_sec']:.1f} tok/s")

Applying GPTQ Quantization from Scratch

"""
Quantize a model to GPTQ INT4 from scratch.
Requires: auto-gptq >= 0.6.0, GPU with 24GB+ VRAM for 7B models
"""

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset
import random

MODEL_ID = "meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR = "./llama3-8b-gptq-int4"

def get_calibration_data(tokenizer, n_samples: int = 512, seq_len: int = 2048):
    """
    Load calibration data. The choice of calibration dataset affects quality.
    WikiText-2 is the standard, but domain-specific data gives better results
    for domain-specific deployments.
    """
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

    # Concatenate all text and chunk into seq_len blocks
    text = "\n\n".join(dataset["text"])
    tokens = tokenizer.encode(text, add_special_tokens=False)

    samples = []
    for i in range(n_samples):
        start = random.randint(0, len(tokens) - seq_len - 1)
        chunk = tokens[start : start + seq_len]
        samples.append({"input_ids": chunk})

    return samples

# Configure quantization
quantize_config = BaseQuantizeConfig(
    bits=4,               # INT4 quantization
    group_size=128,       # Group size 128 is the sweet spot for quality vs overhead
    desc_act=True,        # actorder - reorder by activation magnitude (better quality)
    damp_percent=0.01,    # Hessian damping for numerical stability
    sym=False,            # Asymmetric quantization (better range coverage)
)

# Load base model
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoGPTQForCausalLM.from_pretrained(
    MODEL_ID,
    quantize_config=quantize_config,
)

# Prepare calibration data
print("Preparing calibration data...")
calibration_data = get_calibration_data(tokenizer)

# Run quantization
# This takes 1-4 hours for a 70B model on a single A100
# For 8B models, expect 20-40 minutes
print("Running GPTQ quantization...")
model.quantize(calibration_data)

# Save the quantized model
model.save_quantized(OUTPUT_DIR, use_safetensors=True)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Quantized model saved to {OUTPUT_DIR}")

# Load and verify
model_loaded = AutoGPTQForCausalLM.from_quantized(
    OUTPUT_DIR,
    device="cuda:0",
    inject_fused_attention=True,
)

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda:0")
outputs = model_loaded.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Applying AWQ Quantization

"""
Quantize a model to AWQ INT4 from scratch.
AWQ is faster than GPTQ and often better quality for 7B-13B models.
Requires: autoawq >= 0.1.8
"""

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

MODEL_ID = "meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR = "./llama3-8b-awq-int4"

quant_config = {
    "zero_point": True,      # Asymmetric quantization
    "q_group_size": 128,     # Group size - same as GPTQ standard
    "w_bit": 4,              # INT4
    "version": "GEMM",       # GEMM kernel (vs GEMV for single-token inference)
}

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoAWQForCausalLM.from_pretrained(MODEL_ID, safetensors=True)

# AWQ quantization - much faster than GPTQ
# Uses WikiText-2 by default for calibration (512 samples)
# You can pass custom calibration data via calib_data parameter
model.quantize(tokenizer, quant_config=quant_config)

model.save_quantized(OUTPUT_DIR, safetensors=True)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"AWQ quantized model saved to {OUTPUT_DIR}")

Loading GGUF Models with llama-cpp-python

"""
Load and run GGUF quantized models via llama-cpp-python.
Runs on CPU, Apple Silicon, or mixed CPU+GPU.
"""

from llama_cpp import Llama

# Q4_K_M is the recommended default - good quality/speed tradeoff
# Q5_K_M for higher quality at cost of more memory
# Q8_0 for near-lossless at 8-bit
# Q2_K for extreme compression (noticeable quality loss)

model = Llama(
    model_path="./Meta-Llama-3-8B-Q4_K_M.gguf",
    n_ctx=4096,         # Context length
    n_threads=8,        # CPU threads
    n_gpu_layers=35,    # Number of layers to offload to GPU (0 = CPU only)
                        # For 8B models, 35 layers puts most of the model on GPU
    verbose=False,
)

output = model(
    "Explain quantization in one paragraph:",
    max_tokens=200,
    temperature=0.7,
    echo=False,
)

print(output["choices"][0]["text"])

# For OpenAI-compatible API usage:
# llama-cpp-python includes a server:
# python -m llama_cpp.server --model ./model.gguf --n_gpu_layers 35

Accuracy vs. Compression Tradeoffs

A critical point about perplexity degradation: the numbers are model-size-dependent. For a 7B model, INT4 GPTQ typically adds 0.4-0.8 perplexity on WikiText-2. For a 70B model, the same method adds 0.1-0.2. Larger models are more robust to quantization because the redundancy in their weights gives the error compensation more room to work. This means INT4 for 70B is production-safe in ways that INT4 for 7B may not be, depending on the task.

Production Engineering Notes

Serving Quantized Models at Scale

vLLM is the standard serving framework for GPTQ and AWQ models in production. As of vLLM 0.4+, both formats are natively supported with optimized kernels. Key configuration:

from vllm import LLM, SamplingParams

# Serve an AWQ model
llm = LLM(
    model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
    quantization="awq",
    max_model_len=8192,
    gpu_memory_utilization=0.90,  # Leave 10% headroom for KV cache
    tensor_parallel_size=2,        # Split across 2 GPUs
)

Memory budgeting: A 70B parameter FP16 model requires 140GB. INT4 GPTQ/AWQ brings this to approximately 35-40GB (accounting for scale factors, KV cache in FP16, and activation memory). Two A100 80GB cards handle this comfortably with room for batch processing.

KV cache is not quantized by default: Weight quantization reduces model weight memory, but the KV cache grows with batch size and sequence length and remains in FP16. For long-context workloads, KV cache can dominate memory usage. vLLM supports INT8 KV cache quantization separately.

Calibration data selection matters more than you think: For domain-specific deployments (medical, legal, code), using domain-relevant calibration data for GPTQ or AWQ reduces task-specific perplexity degradation. A model quantized with WikiText-2 calibration data and deployed for code generation will show higher degradation than one calibrated on code.

Choosing Batch Size

Quantized models have different optimal batch sizes than FP16. Because the memory footprint is smaller, you can fit more sequences in memory, which improves GPU utilization. However, INT4 weight dequantization adds a small overhead per matrix multiplication that does not scale with batch size as well as INT8 does. In practice:

For latency-sensitive serving (batch size 1-4): GGUF with llama.cpp is competitive
For throughput-optimized serving (batch size 16+): AWQ/GPTQ with vLLM wins
For mixed workloads: AWQ with vLLM's continuous batching

Common Mistakes

:::danger Calibration Data Distribution Mismatch

Using WikiText-2 as calibration data for a model you will deploy on Python code is one of the most common quantization mistakes. Calibration data tells GPTQ and AWQ which activations are large and should be protected. If your calibration data has different activation patterns than your inference data, the quantization will be suboptimal for your actual use case.

Always use calibration data that matches your inference distribution. For code models, use a sample from your code dataset. For multilingual models, include samples from all relevant languages. A few hundred representative samples is sufficient. :::

:::danger Applying bitsandbytes for Production Throughput

bitsandbytes is optimized for convenience and composability with training workflows, not for production inference throughput. If you are serving to real users, bitsandbytes INT8 or NF4 will deliver 30-50% lower throughput than GPTQ or AWQ with vLLM, due to Python-level overhead and lack of batched inference kernels.

Use bitsandbytes for QLoRA fine-tuning and rapid prototyping. Switch to GPTQ or AWQ for production serving. :::

:::warning Group Size 128 is Not Always Optimal

Group size 128 is the community default and a reasonable starting point, but it is not universally optimal. Smaller group sizes (64, 32) give better quality at the cost of more scale factor overhead (larger model size, more memory bandwidth). Larger group sizes (256, 512) save memory but increase quantization error.

For embedding-heavy workloads or models with high sensitivity to quantization (small models, multilingual models), test group sizes 32 and 64 before committing to 128. :::

:::warning Perplexity is Not the Whole Story

WikiText-2 perplexity is the standard benchmark for quantization quality, but it does not capture task-specific degradation. A model with 0.2 perplexity increase might show 2-3% accuracy drop on reasoning tasks and 8-10% on math tasks, because those tasks are harder and more sensitive to the precision of specific weights.

Always evaluate your quantized model on the specific tasks you care about, not just perplexity. Use your actual production evaluation suite if possible. :::

:::warning INT4 for Small Models Requires More Care

The rule of thumb "INT4 is production-safe" applies primarily to models with 30B+ parameters. For 7B models, INT4 quantization introduces meaningful quality degradation on complex reasoning tasks. Test thoroughly at 7B. Consider INT8 (SmoothQuant or bitsandbytes) as a more conservative option that still achieves significant memory savings. :::

Interview Q&A

Q1: What is the core challenge that makes LLM quantization harder than CNN quantization?

The core challenge is activation outliers. LLMs, specifically transformer models, produce activation distributions with extreme outlier values - individual neurons that fire with magnitudes 10-100x larger than the median activation. CNNs trained with BatchNorm have well-normalized, relatively uniform activation distributions that quantize cleanly with simple per-tensor or per-channel scaling.

For LLMs, if you pick a quantization scale that covers the outlier values, all normal values get compressed into a tiny fraction of the integer range, losing precision. If you pick a scale that fits the normal values, the outliers overflow and produce completely wrong results. This problem does not exist for CNNs in the same way because BatchNorm suppresses outlier formation during training.

The three main approaches to this problem are: (1) mixed-precision decomposition - keep outlier channels in FP16 and quantize the rest, which is what bitsandbytes LLM.int8() does; (2) migrate the difficulty from activations to weights - which is SmoothQuant's approach; (3) use weight-only quantization - where activations stay in FP16 and only weights are quantized, avoiding the outlier problem entirely for the matrix multiply, which is what GPTQ and AWQ do.

Q2: Why does GPTQ use the Hessian matrix, and what does it actually tell you?

The Hessian of the layer's output loss with respect to the weights tells you the curvature - how much the output changes for a given change in a weight. A high second derivative for weight $w_{ij}$ means that small errors in $w_{ij}$ cause large changes in the layer output. A low second derivative means the layer output is relatively insensitive to that weight.

For a linear layer computing $y = Wx$ with calibration inputs $X$ , the Hessian is $H = 2XX^T$ . The $XX^T$ term is the autocorrelation matrix of the input activations. Weights corresponding to frequently-activated input dimensions have high Hessian entries, meaning they matter more and should be quantized more carefully.

GPTQ uses this to decide how to redistribute quantization error: when weight $w_{ij}$ is quantized and introduces error $\delta_{ij}$ , the error is absorbed by updating remaining weights $w_{ik}$ (for $k > j$ ) in proportion to $H^{-1}$ . Weights with high curvature absorb less of the error correction because changing them is "expensive" in terms of output quality. The Cholesky decomposition of $H^{-1}$ makes this computation numerically stable.

Q3: In what scenario would you choose SmoothQuant over GPTQ or AWQ?

SmoothQuant is the right choice when you need W8A8 quantization - that is, when you need both the weights AND the activations to be INT8 during the matrix multiply, not just the weights. This matters for compute throughput, not just memory.

GPTQ and AWQ are W4A16 or W8A16 methods: weights are quantized, but activations stay in FP16. The matrix multiply therefore runs in FP16 (or BF16). You get memory reduction but not a compute speedup in the matrix multiply itself.

SmoothQuant enables INT8 GEMM by making activations quantization-friendly. On NVIDIA A100 and H100 hardware, INT8 Tensor Core operations are 2x faster than FP16 Tensor Core operations. For throughput-constrained deployments where you are compute-bound rather than memory-bandwidth-bound, this is significant.

The tradeoff: SmoothQuant at W8A8 achieves less memory compression than INT4 methods (2x vs. 4x for weights). If you are memory-bound and need to fit more model on fewer GPUs, INT4 methods win. If you are compute-bound and already have enough GPU memory, SmoothQuant's INT8 throughput advantage may be larger than INT4's memory advantage.

Q4: What is the NF4 data type and why is it better than INT4 for normally distributed weights?

NF4 (Normal Float 4-bit) places the 16 quantization levels at the quantiles of a standard normal distribution rather than at evenly-spaced intervals. This is information-theoretically optimal for data drawn from a normal distribution, because it minimizes the expected quantization error when the data matches that distribution.

Transformer weight distributions are approximately normally distributed with zero mean - a consequence of weight initialization and the dynamics of Adam optimization. Standard INT4 places levels evenly across the value range, wasting precision in the low-density tails. NF4 places more levels near zero (where weight values are dense) and fewer levels in the tails (where they are sparse).

The practical impact: for weight-only quantization of fine-tuned transformer models, NF4 achieves slightly lower average quantization error than symmetric INT4, especially for the very common small-weight values near zero. Combined with double quantization (quantizing the FP32 scale factors themselves to 8-bit), NF4 achieves significant memory savings with minimal quality loss.

The limitation: NF4 is only optimal for weights that actually follow a normal distribution. If a layer's weights have a different distribution (bimodal, heavy-tailed), NF4's advantage disappears. In practice, this is uncommon for well-trained transformer models.

Q5: How would you measure the quality impact of quantization beyond perplexity, and why does this matter in production?

Perplexity measures how well the model predicts held-out text, but it is a coarse aggregate metric. A 0.2 increase in WikiText-2 perplexity might mask very different patterns: some tasks may be completely unaffected while others degrade substantially.

In production, the right approach is task-specific evaluation on your actual workload:

For reasoning and question answering: run MMLU, HellaSwag, ARC-Challenge, and TruthfulQA on both FP16 and quantized models. Compare accuracy scores. These benchmarks stress-test the model's precision on multi-step reasoning where accumulated quantization error is most damaging.

For code generation: run HumanEval or MBPP and compare pass@k scores. Code generation is particularly sensitive to quantization because small probability differences between tokens (semicolon vs. newline, indented vs. not) determine whether the generated code is syntactically valid.

For your production tasks specifically: build an evaluation set from your actual production queries (with human-labeled or gold-standard outputs) and compare the quantized model's outputs. This is the only evaluation that truly tells you whether quantization is acceptable for your use case.

The reason this matters: there have been documented cases where INT4 quantization of 7B models reduces pass@1 on HumanEval by 8-12% even when WikiText-2 perplexity increases by less than 0.5. Those are two qualitatively different production outcomes. Perplexity would not have predicted the problem.

Q6: What is the "actorder" trick in GPTQ and when should you use it?

The actorder (activation order) trick, also called desc_act=True in AutoGPTQ, reorders the weight columns of each layer before applying GPTQ quantization. The columns are sorted in decreasing order of their associated activation magnitudes (the diagonal of the Hessian $H_{ii}$ , which represents how much that input dimension "fires").

The intuition: GPTQ's error compensation is most effective early in the quantization process, when there are many unquantized weights that can absorb the correction. By quantizing low-importance columns first (when error compensation is cheapest) and high-importance columns last (when error compensation is most precise), actorder improves overall quality.

The tradeoff: reordering columns changes the memory access pattern during inference. The weight matrix is no longer laid out in the natural order, requiring a permutation step during the dequantization pass. This adds a small latency overhead (typically 5-10% compared to non-actorder GPTQ).

When to use it: always use actorder=True when quantizing models where quality matters and you can tolerate the latency overhead. The quality improvement is consistent - typically 0.1-0.3 lower perplexity than non-actorder - with minimal downside. The exception is latency-critical serving where every millisecond counts; in that case, benchmark both variants on your hardware before committing.

Full Comparison Benchmark Reference

The numbers below are representative of what you will see on standard models (LLaMA-3 8B and 70B). Treat them as ballpark figures - exact numbers vary by model family, calibration data, and hardware.

LLaMA-3 8B - WikiText-2 Perplexity

Method	Format	PPL (WikiText-2)	VRAM (inference)	Quantization time	Serving throughput
FP16 baseline	-	7.1	16 GB	N/A	100% (reference)
bitsandbytes INT8	W8A16	7.2	9 GB	None (online)	70-75%
bitsandbytes NF4	W4A16	7.4	5.5 GB	None (online)	55-65%
GPTQ INT4 gs128	W4A16	7.5	5.0 GB	25 min	90-95%
GPTQ INT4 gs128 actorder	W4A16	7.3	5.0 GB	30 min	85-90%
AWQ INT4 gs128	W4A16	7.3	5.0 GB	8 min	92-97%
GGUF Q4_K_M	W4A16	7.4	5.2 GB	Prebuilt	CPU-dependent
HQQ INT4 gs128	W4A16	7.4	5.0 GB	5 min	88-93%

Notes: throughput measured on A100 80GB with vLLM, batch=16, 512 output tokens. GGUF throughput measured on CPU (Intel Xeon) is roughly 15-30 tokens/sec for Q4_K_M.

LLaMA-3 70B - WikiText-2 Perplexity

Method	Format	PPL (WikiText-2)	VRAM (inference)	GPU requirement
FP16 baseline	-	5.7	140 GB	2x A100 80GB
GPTQ INT4 gs128	W4A16	5.9	38 GB	1x A100 80GB
GPTQ INT4 gs128 actorder	W4A16	5.85	38 GB	1x A100 80GB
AWQ INT4 gs128	W4A16	5.85	37 GB	1x A100 80GB
GGUF Q4_K_M	W4A16	5.9	40 GB	1x A100 40GB
GGUF Q5_K_M	W5A16	5.75	48 GB	1x A100 80GB

The 70B numbers show why INT4 quantization was transformative: the model that required two A100 80GB GPUs in FP16 now runs comfortably on a single one with minimal quality loss.

Checklist for Choosing a PTQ Method

Before selecting a quantization method, answer these questions in order:

1. What is the target hardware?

CPU-only or Apple Silicon: GGUF is your only mature option
NVIDIA GPU (data center): GPTQ or AWQ with vLLM
Consumer NVIDIA GPU (RTX 4090 etc): AWQ or GGUF depending on whether you need llama.cpp vs Python stack

2. Do you need to fine-tune after quantization?

Yes: bitsandbytes NF4 (QLoRA) - the only method composable with gradient computation
No: proceed to step 3

3. How fast does quantization need to happen?

Immediate (no calibration step): bitsandbytes, HQQ
Can wait minutes: AWQ
Can wait hours: GPTQ

4. Is this W8A8 (INT8 GEMM) required for compute throughput?

Yes: SmoothQuant + TensorRT-LLM
No: proceed to step 5

5. Which bit width?

8-bit (conservative): GPTQ INT8 or bitsandbytes INT8
4-bit (standard production): AWQ or GPTQ
3-bit or 2-bit (extreme compression): QuIP# or GPTQ with 3-bit (lower quality, use carefully)

6. Does calibration data match your inference domain?

Yes: standard WikiText-2 calibration is fine for general models
No: prepare domain-specific calibration data for GPTQ or AWQ

This decision tree covers 95% of real production scenarios. The remaining 5% - novel hardware, multi-modal models, extremely long contexts - requires deeper investigation case by case.

What Comes Next

The methods described in this lesson are the stable, production-proven PTQ tools as of 2024-2025. The field continues to evolve rapidly:

Speculative decoding with quantized models: Running a small draft model (1B-3B, INT4) to propose tokens and a larger verifier (70B, INT4) to accept or reject them. The quantized sizes make this economically viable in ways it was not with FP16 models.

KV cache quantization: INT8 and INT4 KV cache reduces memory pressure for long-context workloads. vLLM added KV8 cache support in 0.4.0. This is orthogonal to weight quantization and can be combined with any of the methods above.

FP8 quantization: NVIDIA H100 and H200 hardware natively support FP8 computation. FP8 (8-bit floating point) gives better quality than INT8 at the same bit width because the floating-point representation handles dynamic range better. vLLM and TensorRT-LLM both support FP8 serving on H100+. FP8 is increasingly the preferred W8A8 method over SmoothQuant INT8 on H100 hardware because it requires no migration math.

Quantization at 1-bit: BitNet (Wang et al., 2023) trains models from scratch with ternary weights (-1, 0, +1). This is not PTQ - it requires training - but if 1-bit trained models achieve quality parity with FP16 at comparable scale, it would make all the PTQ methods in this lesson obsolete for future models. As of 2025, 1-bit models have not yet matched the quality of FP16 models at the same parameter count, but the research direction is active.

The core skills from this lesson - understanding the weight/activation tradeoff, calibration data selection, granularity decisions, and hardware matching - will remain relevant regardless of which new methods emerge. The design space does not change; only the specific points within it that are worth visiting.

The 3 AM Pager Alert​

Why Post-Training Quantization Exists​

The Problem It Replaced​

The Research Inflection Point​

The Quantization Design Space​

The Methods: What Each One Actually Does​

GPTQ - Reconstruction-Based Weight Quantization​

AWQ - Activation-Aware Weight Quantization​

SmoothQuant - Migration-Based W8A8 Quantization​

bitsandbytes - Online Quantization for Training and Inference​

GGUF - CPU-First Quantization Format​

HQQ - Half-Quadratic Quantization​

QuIP# - Incoherence Processing for 2-Bit Quantization​

Comparative Analysis​

The Decision Matrix​

Code Examples​

Comparing Methods: Loading and Benchmarking​

Applying GPTQ Quantization from Scratch​

Applying AWQ Quantization​

Loading GGUF Models with llama-cpp-python​

Accuracy vs. Compression Tradeoffs​

Production Engineering Notes​

Serving Quantized Models at Scale​

Choosing Batch Size​

Common Mistakes​

Interview Q&A​

Full Comparison Benchmark Reference​

LLaMA-3 8B - WikiText-2 Perplexity​

LLaMA-3 70B - WikiText-2 Perplexity​

Checklist for Choosing a PTQ Method​

What Comes Next​

The 3 AM Pager Alert