Quantization: INT8 and INT4
The Production Scenario
It is early 2023. Your team wants to run LLaMA-2 70B for internal tooling. You do the math: 70B parameters × 2 bytes per parameter (FP16) = 140 GB. Your company has one A100 80GB. You would need two of them, connected with NVLink, and they cost 10/hour for two A100s. For a tool used by 20 engineers a few times per day, this is absurd.
Then you discover INT4 quantization. You quantize the model to 4 bits per parameter. The model size drops from 140 GB to 35 GB. It fits on a single A100 80GB with room to spare, and performance is nearly identical for the tasks your team uses it for. Your infrastructure cost drops by 4×.
This is what quantization unlocks at the model serving level. It is not about theoretical elegance - it is about the practical difference between a model that requires $60K of hardware and one that runs on what you already have. Understanding the techniques, their quality-performance trade-offs, and their failure modes is one of the most practically valuable skills in LLM deployment.
The central tension in quantization: floating point numbers encode a real number as a sign bit, exponent, and mantissa. Integers just encode a fixed-range integer value. You are replacing rich, expressive encodings with coarser ones. The trick is doing this in a way that preserves model behavior. Some methods fail badly. Others achieve remarkable fidelity. The difference is in the details.
Why This Exists: The Memory Problem
Model Size Math
| Model | FP32 | FP16/BF16 | INT8 | INT4 |
|---|---|---|---|---|
| 7B | 28 GB | 14 GB | 7 GB | 3.5 GB |
| 13B | 52 GB | 26 GB | 13 GB | 6.5 GB |
| 34B | 136 GB | 68 GB | 34 GB | 17 GB |
| 70B | 280 GB | 140 GB | 70 GB | 35 GB |
| 180B | 720 GB | 360 GB | 180 GB | 90 GB |
The transition from FP16 to INT4 is a 4× reduction. The difference between needing 8× A100s and needing 1× A100 for a 70B model - that is the practical impact.
But quantization is not free. You are trading precision for efficiency. The question is: how much precision can you lose before model quality degrades unacceptably? The answer depends critically on which weights you quantize, how you compute the quantization mapping, and whether you calibrate on representative data.
Historical Context
Quantization has been used in neural networks since the late 1990s for edge deployment (mobile, embedded systems). For LLMs specifically, the story begins around 2022 when models became too large for single-GPU inference:
- LLM.int8() (Dettmers et al., 2022): First practical INT8 quantization for LLMs at scale. Discovered the "outlier" problem - a small fraction of extreme-magnitude activations that break naive quantization - and solved it with mixed-precision decomposition.
- GPTQ (Frantar et al., 2022): Weight-only INT4 quantization using second-order information. Achieved state-of-the-art INT4 quality by minimizing layer-wise quantization error.
- AWQ (Lin et al., 2023): Activation-aware Weight Quantization. Identified that only ~1% of weights are "salient" (determined by activation magnitude), and protecting these weights with higher precision preserves quality better than GPTQ at INT4.
- GGUF (Georgi Gerganov, 2023): Quantization format in llama.cpp enabling CPU inference with mix of precision levels (Q4_K_M, Q5_K_M, Q8_0).
Quantization Fundamentals
The Quantization Mapping
Quantization maps a range of floating-point values to a set of integer values:
Where:
- is the scale factor (float) - size of each quantization step
- is the zero point (integer) - the integer value representing 0.0 in the float range
Dequantization reconstructs the float:
The reconstruction error is called quantization error: , bounded by per element.
Symmetric vs Asymmetric Quantization
Symmetric: Zero point . The quantized range is centered at zero.
For INT8: range is [-127, 127] (excluding -128 to avoid overflow issues). Simpler hardware implementation but wastes range for asymmetric distributions.
Asymmetric: Zero point is non-zero. Can cover any range .
For INT8: range is [0, 255] or [-128, 127]. Better for ReLU outputs (non-negative) or weights with asymmetric distributions.
Granularity
The granularity of quantization parameters (scale, zero point) determines quality:
| Granularity | Description | Quality | Overhead |
|---|---|---|---|
| Per-tensor | One scale per weight matrix | Lowest | Negligible |
| Per-channel | One scale per output channel | Medium | Small |
| Per-token | One scale per token (for activations) | High | Moderate |
| Per-group | One scale per group of N weights | Highest | Largest |
Per-group quantization (group size 64 or 128) is used in GPTQ and AWQ - it balances quality and memory overhead.
import torch
import numpy as np
from typing import Tuple
def quantize_symmetric(
x: torch.Tensor,
bits: int = 8,
per_channel: bool = False
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Symmetric quantization.
Returns:
quantized: INT tensor
scale: Float scale factors
"""
n_levels = 2 ** (bits - 1) - 1 # 127 for INT8, 7 for INT4
if per_channel:
# Per-channel: one scale per output channel (dim 0)
abs_max = x.abs().max(dim=1, keepdim=True).values
else:
abs_max = x.abs().max()
scale = abs_max / n_levels
scale = scale.clamp(min=1e-8)
quantized = (x / scale).round().clamp(-n_levels, n_levels)
if bits == 8:
quantized = quantized.to(torch.int8)
else:
# INT4 must be stored in wider dtype (no native int4 in PyTorch)
quantized = quantized.to(torch.int8)
return quantized, scale
def dequantize_symmetric(
quantized: torch.Tensor,
scale: torch.Tensor
) -> torch.Tensor:
return quantized.float() * scale
def measure_quantization_error(
weights: torch.Tensor,
bits: int = 8
) -> dict:
"""Measure quantization error for a weight tensor."""
quant, scale = quantize_symmetric(weights, bits=bits)
reconstructed = dequantize_symmetric(quant, scale)
error = weights - reconstructed
relative_error = error.abs() / (weights.abs() + 1e-8)
return {
"mse": error.pow(2).mean().item(),
"max_error": error.abs().max().item(),
"mean_relative_error": relative_error.mean().item(),
"original_size_bytes": weights.numel() * weights.element_size(),
"quantized_size_bytes": quant.numel() * 1, # 1 byte for int8
"compression_ratio": weights.numel() * weights.element_size() / (quant.numel() * 1)
}
The Outlier Problem in LLMs
Why Naive Quantization Fails for LLMs
Dettmers et al. (2022) made a crucial empirical discovery: in transformer models with more than 6.7B parameters, systematic outlier features emerge in the hidden states (activations).
These outliers are:
- Extreme in magnitude: up to 1000× larger than the average activation value
- Systematic: they always appear in the same feature dimensions (columns of the activation matrix)
- Sparse: only about 0.1% of all activation values are outliers
The problem: if you compute a per-tensor scale based on max(|activations|), the scale is dominated by the outliers. All normal-magnitude values get quantized to the same few integer values, destroying almost all information in the non-outlier features.
def demonstrate_outlier_problem():
"""
Show how outliers break naive INT8 quantization of activations.
"""
torch.manual_seed(42)
# Simulate activation tensor with systematic outliers
# Shape: [batch, seq_len, hidden_dim]
batch, seq, hidden = 1, 32, 4096
activations = torch.randn(batch, seq, hidden) * 0.5 # Normal range
# Insert outliers in specific feature dimensions (mimicking real LLMs)
outlier_dims = [42, 314, 1024, 2048, 3200]
for dim in outlier_dims:
activations[:, :, dim] *= 200.0 # 200x larger
# Compute per-tensor scale
abs_max = activations.abs().max()
scale_pertensor = abs_max / 127.0
print(f"Max activation: {abs_max:.2f}")
print(f"Per-tensor scale: {scale_pertensor:.4f}")
# Quantize with per-tensor scale
quant_coarse = (activations / scale_pertensor).round().clamp(-127, 127).to(torch.int8)
recon_coarse = quant_coarse.float() * scale_pertensor
# Only care about non-outlier values
non_outlier_mask = torch.ones(hidden, dtype=torch.bool)
for dim in outlier_dims:
non_outlier_mask[dim] = False
normal_error = (activations[:, :, non_outlier_mask] -
recon_coarse[:, :, non_outlier_mask]).abs().mean()
print(f"\nPer-tensor INT8 error on non-outlier dims: {normal_error:.4f}")
# Compare: per-token scale (one scale per token)
abs_max_per_token = activations.abs().max(dim=-1, keepdim=True).values
scale_per_token = abs_max_per_token / 127.0
quant_fine = (activations / scale_per_token).round().clamp(-127, 127).to(torch.int8)
recon_fine = quant_fine.float() * scale_per_token
normal_error_fine = (activations[:, :, non_outlier_mask] -
recon_fine[:, :, non_outlier_mask]).abs().mean()
print(f"Per-token INT8 error on non-outlier dims: {normal_error_fine:.4f}")
print(f"\nOutliers exist in {len(outlier_dims)} of {hidden} dimensions ({len(outlier_dims)/hidden:.2%})")
print("Yet they dominate the per-tensor scale and destroy quantization of all other dims.")
LLM.int8() - Mixed Precision Decomposition
Dettmers et al. (2022) solved the outlier problem with a decomposition approach:
- Identify outlier dimensions: feature dimensions where any activation exceeds a threshold (default: 6.0)
- Process outliers in FP16: multiply the outlier columns of the weight matrix with the outlier rows of the activation in full FP16
- Process non-outliers in INT8: quantize remaining weights and activations to INT8, multiply in INT8 (faster, lower memory)
- Combine results: add the FP16 and INT8 partial products
The key insight: only ~0.1% of values are outliers, so ~99.9% of the computation uses INT8. The FP16 path handles the few critical outliers without degrading quality.
# Using bitsandbytes for LLM.int8()
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def load_model_int8(model_name: str):
"""
Load model in INT8 with LLM.int8() mixed precision decomposition.
Requires: pip install bitsandbytes transformers accelerate
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True, # LLM.int8() - requires bitsandbytes
device_map="auto"
)
# Model is now in mixed precision:
# - Linear layers: mostly INT8 weights, mixed FP16/INT8 compute
# - LayerNorm, embeddings: remain in FP16
print("Model loaded in INT8")
print_model_memory(model)
return model, tokenizer
def print_model_memory(model):
"""Print memory usage broken down by dtype."""
from collections import defaultdict
dtype_bytes = defaultdict(int)
for name, param in model.named_parameters():
dtype_bytes[str(param.dtype)] += param.numel() * param.element_size()
total = sum(dtype_bytes.values())
print(f"\nMemory by dtype:")
for dtype, bytes_count in sorted(dtype_bytes.items()):
print(f" {dtype}: {bytes_count/1e9:.2f} GB ({bytes_count/total:.1%})")
print(f" Total: {total/1e9:.2f} GB")
LLM.int8() characteristics:
- Memory reduction: ~50% vs FP16 (not quite 4× because only weights are INT8, not all operations)
- Throughput: ~20% slower than FP16 on A100 (mixed precision has overhead)
- Quality: typically within 1–3 perplexity points of FP16
- Best use case: fitting large models on fewer GPUs when speed is secondary to correctness
GPTQ - Weight-Only INT4 Quantization
GPTQ (Frantar et al., 2022) achieves 4× memory reduction by quantizing weights to INT4 using second-order optimization.
The Core Idea
Instead of simply rounding weights to the nearest INT4 value, GPTQ compensates for the quantization error of each weight by adjusting the remaining unquantized weights in the same layer:
- Process weights in columns (one column at a time)
- For each weight : quantize to INT4, compute the error
- Compensate: adjust remaining weights in row to counteract the error using the inverse Hessian
Where is the Hessian of the layer's output error with respect to weights, computed on a calibration dataset.
# Using AutoGPTQ for GPTQ quantization
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
def quantize_with_gptq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
desc_act: bool = False
):
"""
Quantize a model using GPTQ.
Requires: pip install auto-gptq optimum
Args:
model_name: HuggingFace model ID or local path
output_dir: Where to save the quantized model
bits: Quantization bits (2, 3, 4, or 8)
group_size: Group size for per-group quantization
desc_act: Use activation order (slower but better quality)
"""
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# Calibration data - the model is quantized to minimize error on these examples
calibration_data = [
tokenizer(
"Auto-regressive language models learn to predict the next token.",
return_tensors="pt"
),
# Add 128-512 representative examples for best quality
]
quantize_config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
desc_act=desc_act,
)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
# This step does the actual quantization - takes 30-120 minutes for 70B
model.quantize(calibration_data)
# Save quantized model
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"Quantized model saved to {output_dir}")
def load_gptq_model(model_dir: str):
"""Load a previously quantized GPTQ model."""
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoGPTQForCausalLM.from_quantized(
model_dir,
device_map="auto",
use_triton=True # Triton kernels for faster INT4 matmul
)
return model, tokenizer
GPTQ characteristics:
- Memory: 4× reduction from FP16 at INT4 (with group_size=128, overhead ~0.3 bits/weight)
- Quality: perplexity within 5–15% of FP16 depending on task and model size
- Quantization time: 30–120 minutes for 70B model (one-time cost)
- Inference speed: similar to FP16 (weight-only quantization - compute in FP16, weights stored in INT4)
- Best for: offline model serving where you do the quantization once and serve indefinitely
AWQ - Activation-Aware Weight Quantization
AWQ (Lin et al., 2023) improves on GPTQ with a key insight: not all weights are equally important. Weights connected to large-magnitude activations have outsized impact on model output.
The Salient Weight Observation
By analyzing activation statistics over a calibration dataset, AWQ identifies "salient" weights - those connected to feature dimensions with large activation magnitudes. These salient weights represent about 1% of all weights but disproportionately affect model output.
AWQ's approach:
- Identify salient weight channels using activation statistics
- Scale up these weights before quantization (making them larger, reducing relative error)
- Scale down the corresponding input activations to keep the product unchanged
This is mathematically equivalent to per-group quantization with optimized scale factors, but the scales are chosen based on activation statistics rather than weight statistics alone.
# Using AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
def quantize_with_awq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
zero_point: bool = True
):
"""
Quantize a model with AWQ (Activation-aware Weight Quantization).
Requires: pip install autoawq
AWQ is faster to quantize than GPTQ (~10-15 min for 7B vs 30+ min)
and generally achieves better quality at INT4.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoAWQForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
use_cache=False
)
quant_config = {
"zero_point": zero_point,
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM" # GEMM kernel (faster) vs GEMV (lower memory)
}
# AWQ searches for optimal scales using ~128 calibration samples
# Takes ~10-15 minutes for 7B model, ~45 min for 70B
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ quantized model saved to {output_dir}")
def benchmark_quantization_methods(model_name: str = "meta-llama/Llama-3-8B"):
"""
Compare quality and speed across quantization methods.
Uses perplexity on WikiText-2 as quality proxy.
"""
import time
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
test_texts = [t for t in dataset["text"] if len(t.strip()) > 100][:50]
results = {}
# Configuration to test
configs = [
("FP16", {"torch_dtype": torch.float16}),
("INT8 (LLM.int8())", {"load_in_8bit": True}),
("INT4 NF4 (QLoRA)", {"load_in_4bit": True, "bnb_4bit_quant_type": "nf4"}),
]
for name, kwargs in configs:
print(f"\nLoading {name}...")
t0 = time.time()
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
**kwargs
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
load_time = time.time() - t0
# Estimate memory
total_params = sum(p.numel() for p in model.parameters())
# Rough estimate: count actual allocated memory
import gc
gc.collect()
if torch.cuda.is_available():
mem_gb = torch.cuda.max_memory_allocated() / 1e9
else:
mem_gb = -1
results[name] = {
"load_time_s": round(load_time, 1),
"memory_gb": round(mem_gb, 1),
}
print(f" Load time: {load_time:.1f}s, Memory: {mem_gb:.1f} GB")
return results
AWQ vs GPTQ comparison:
| Aspect | GPTQ | AWQ |
|---|---|---|
| Quantization time (7B) | 30–60 min | 10–15 min |
| Quality at INT4 | Good | Better on most benchmarks |
| Kernel support | Triton/CUDA | Optimized GEMM/GEMV |
| vLLM support | Yes | Yes |
| CPU/llama.cpp | Limited | Limited |
NF4 - NormalFloat4 (QLoRA)
Dettmers et al. (2023) introduced NF4 as part of QLoRA. NF4 is an information-theoretically optimal quantization format for normally distributed weights:
Instead of linear spacing between quantization levels, NF4 uses quantile-based spacing. If weights are normally distributed, quantiles of the normal distribution space the levels optimally - more levels near zero where density is highest.
def create_nf4_levels() -> torch.Tensor:
"""
Compute the 16 NF4 quantization levels.
These are quantiles of the standard normal distribution,
normalized to [-1, 1].
"""
from scipy.stats import norm
# 16 levels for 4-bit (2^4 = 16 distinct values)
# Spaced at quantiles of standard normal distribution
quantiles = [(i + 0.5) / 16 for i in range(16)]
levels = [norm.ppf(q) for q in quantiles]
# Normalize to [-1, 1]
abs_max = max(abs(l) for l in levels)
levels = [l / abs_max for l in levels]
return torch.tensor(levels, dtype=torch.float32)
def quantize_nf4(weight: torch.Tensor, block_size: int = 64) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Quantize weight tensor to NF4 format.
Uses double quantization: scale factors are themselves quantized.
"""
nf4_levels = create_nf4_levels()
orig_shape = weight.shape
# Reshape to blocks
weight_flat = weight.flatten()
n_blocks = (weight_flat.numel() + block_size - 1) // block_size
scales = []
quantized_blocks = []
for i in range(n_blocks):
block = weight_flat[i * block_size : (i + 1) * block_size]
# Normalize block to [-1, 1]
abs_max = block.abs().max().item()
if abs_max == 0:
abs_max = 1.0
scales.append(abs_max)
normalized = block / abs_max
# Find nearest NF4 level for each value
distances = (normalized.unsqueeze(-1) - nf4_levels.unsqueeze(0)).abs()
quant_indices = distances.argmin(dim=-1)
quantized_blocks.append(quant_indices.to(torch.uint8))
scales_tensor = torch.tensor(scales, dtype=torch.float32)
quantized_tensor = torch.cat(quantized_blocks)
return quantized_tensor, scales_tensor
GGUF - Quantization for CPU Inference
GGUF (llama.cpp format, successor to GGML) provides a range of quantization formats designed for CPU inference on consumer hardware:
| Format | Bits/weight | Quality | Notes |
|---|---|---|---|
| Q8_0 | 8 | Best | Near-lossless, still 2× smaller than FP16 |
| Q6_K | 6.14 | Excellent | Recommended if memory allows |
| Q5_K_M | 5.34 | Very good | Good balance quality/size |
| Q4_K_M | 4.58 | Good | Most popular for 7B models on consumer hardware |
| Q3_K_M | 3.35 | Acceptable | Noticeable quality loss |
| Q2_K | 2.63 | Poor | Emergency use only |
The "_K" suffix means "k-quant" - uses per-block quantization with mixed precision (some blocks quantized more aggressively, some less, based on importance).
# Using llama.cpp Python bindings (llama-cpp-python)
from llama_cpp import Llama
def load_gguf_model(model_path: str, n_gpu_layers: int = -1):
"""
Load a GGUF quantized model for CPU/GPU hybrid inference.
Args:
model_path: Path to .gguf file
n_gpu_layers: Number of layers to offload to GPU
(-1 = all layers on GPU if available)
"""
model = Llama(
model_path=model_path,
n_gpu_layers=n_gpu_layers, # Offload layers to GPU
n_ctx=4096, # Context window
n_batch=512, # Batch size for prompt processing
verbose=False
)
return model
def compare_gguf_formats():
"""
Benchmark different GGUF quantization formats.
Measures: perplexity approximation, file size, tokens/sec.
"""
# Example: LLaMA-3 8B in different GGUF formats
# File sizes for 8B model:
sizes = {
"F16 (baseline)": 15.0,
"Q8_0": 7.7,
"Q6_K": 6.1,
"Q5_K_M": 5.3,
"Q4_K_M": 4.4,
"Q3_K_M": 3.3,
"Q2_K": 2.7,
}
# Approximate perplexity on WikiText-2 (lower = better)
# These are representative numbers from llama.cpp benchmarks
perplexity = {
"F16 (baseline)": 6.45,
"Q8_0": 6.46,
"Q6_K": 6.47,
"Q5_K_M": 6.50,
"Q4_K_M": 6.56,
"Q3_K_M": 6.79,
"Q2_K": 7.84,
}
print(f"{'Format':<20} {'Size (GB)':>10} {'PPL':>8} {'PPL vs F16':>12}")
print("-" * 55)
baseline_ppl = perplexity["F16 (baseline)"]
for fmt in sizes:
ppl = perplexity[fmt]
ppl_increase = (ppl - baseline_ppl) / baseline_ppl * 100
print(f"{fmt:<20} {sizes[fmt]:>10.1f} {ppl:>8.2f} {ppl_increase:>11.1f}%")
Quality vs Size Trade-offs
Practical Quantization Guide
def choose_quantization_method(
model_size_b: float,
use_case: str,
available_gpu_gb: float,
latency_sensitive: bool,
quality_critical: bool
) -> dict:
"""
Recommend quantization strategy based on requirements.
"""
fp16_gb = model_size_b * 2
recommendations = []
if fp16_gb <= available_gpu_gb * 0.7:
recommendations.append({
"method": "FP16/BF16",
"memory_gb": fp16_gb,
"quality": "Best",
"setup": "No quantization needed",
"command": "torch_dtype=torch.bfloat16"
})
if fp16_gb / 2 <= available_gpu_gb * 0.7:
recommendations.append({
"method": "INT8 (bitsandbytes)",
"memory_gb": fp16_gb / 2,
"quality": "Near-lossless",
"setup": "load_in_8bit=True",
"command": "load_in_8bit=True"
})
if fp16_gb / 4 <= available_gpu_gb * 0.7:
if quality_critical:
method = "AWQ INT4"
cmd = "AutoAWQForCausalLM.from_quantized()"
else:
method = "GPTQ INT4"
cmd = "AutoGPTQForCausalLM.from_quantized()"
recommendations.append({
"method": method,
"memory_gb": fp16_gb / 4,
"quality": "Good (1-5% degradation)",
"setup": "Requires pre-quantization step",
"command": cmd
})
if not recommendations or available_gpu_gb < fp16_gb / 4:
recommendations.append({
"method": "GGUF Q4_K_M (llama.cpp)",
"memory_gb": model_size_b * 0.55,
"quality": "Acceptable",
"setup": "CPU inference or CPU+GPU hybrid",
"command": "Llama(model_path='model.Q4_K_M.gguf')"
})
return recommendations
# Example decisions
print("LLaMA-3 70B (140 GB FP16) on 80 GB GPU:")
recs = choose_quantization_method(70, "chat", 80, True, False)
for r in recs:
print(f" → {r['method']}: {r['memory_gb']:.1f} GB - {r['quality']}")
print("\nLLaMA-3 8B (16 GB FP16) on 24 GB GPU:")
recs = choose_quantization_method(8, "chat", 24, True, True)
for r in recs:
print(f" → {r['method']}: {r['memory_gb']:.1f} GB - {r['quality']}")
Common Mistakes
:::danger Quantizing embedding and output layers Embedding tables and the language modeling head (logit projection) are particularly sensitive to quantization. They map discrete token indices to continuous vectors - small errors here propagate to every generated token. LLM.int8(), GPTQ, and AWQ all leave embeddings and the LM head in FP16/FP32 by default. Never quantize these layers to INT4 without extensive evaluation. :::
:::danger Using INT4 for medical, legal, or financial applications without evaluation INT4 quantization reduces model parameters to 16 discrete levels per weight group. On structured knowledge tasks (medical diagnosis, contract analysis, financial calculations), this precision loss can cause the model to confuse numbers, miss critical distinctions, or generate plausible but incorrect statements. Run comprehensive domain-specific benchmarks before deploying INT4 in high-stakes applications. Use INT8 as the minimum for these use cases. :::
:::warning Calibration dataset matters for GPTQ and AWQ GPTQ and AWQ both use a calibration dataset to compute quantization parameters. The calibration data should match your deployment distribution. Quantizing a coding model on news articles produces worse results than calibrating on code. Use 128–512 representative examples from your target domain. The default calibration data (often WikiText or C4) is fine for general-purpose models but suboptimal for domain-specific deployments. :::
:::warning Throughput vs latency trade-off of weight-only quantization GPTQ and AWQ are weight-only quantization: weights are stored in INT4 but computation happens in FP16 (weights are dequantized before the matmul). This reduces memory - so more requests fit in GPU memory simultaneously, increasing throughput. But it does not directly speed up single-request latency. LLM.int8() does compute in INT8, which can be faster. The confusion: "INT4 must be 4× faster" is wrong. Weight-only INT4 improves throughput through better batching, not per-token speed. :::
Interview Questions
Q1: What is the "outlier problem" in LLM quantization and how does LLM.int8() solve it?
In transformer models larger than ~6.7B parameters, a small fraction (~0.1%) of activation values are orders of magnitude larger than the rest. These "outlier" values dominate the per-tensor scale used in naive INT8 quantization - the scale is set to accommodate the outlier range, causing all normal-magnitude values to collapse to just a few discrete levels, destroying information. LLM.int8() (Dettmers et al., 2022) identifies these outlier feature dimensions using a threshold (default 6.0) and processes them separately in FP16, while quantizing all non-outlier values to INT8. This "mixed-precision decomposition" gives near-lossless quality with ~50% memory reduction.
Q2: What is the difference between post-training quantization (PTQ) and quantization-aware training (QAT)?
PTQ quantizes an already-trained model without further training - fast but potentially lossy. GPTQ and AWQ are PTQ methods. QAT trains (or fine-tunes) the model with quantization simulated in the forward pass, so gradients flow through the quantization step and the model learns to be robust to it. QAT produces better quality than PTQ at the same bit width but requires compute proportional to training. For most LLM use cases, PTQ with calibration data (GPTQ, AWQ) achieves acceptable quality much faster than QAT.
Q3: What is per-group quantization and why does it improve quality over per-tensor?
Per-group quantization computes a separate scale factor for every group of consecutive weights (typically or ). This allows the quantization to adapt to local weight distributions - a group of weights centered around 0.01 gets a different scale than a group centered around 5.0. Per-tensor uses one scale for the entire weight matrix, which is dominated by the maximum value and leaves small-valued weights with poor precision. Per-group adds memory overhead for the scale factors (16-bit floats, one per group) but significantly improves quality, especially at INT4.
Q4: When would you choose AWQ over GPTQ, and when would you choose the reverse?
AWQ generally achieves better quality at INT4 than GPTQ on most benchmarks (coding, reasoning, chat). AWQ also quantizes 3–6× faster. Choose AWQ as the default for INT4 deployment. Choose GPTQ when: (1) you need very specific bit-group combinations that AWQ does not support, (2) compatibility with specific inference frameworks (some still prefer GPTQ kernels), or (3) your task-specific benchmarks show GPTQ performing better (rare but happens for some math-heavy tasks). For CPU inference via llama.cpp, use GGUF formats (Q4_K_M) directly rather than GPTQ or AWQ.
Q5: A 70B model in FP16 is 140 GB. Why does INT4 quantization give you 35 GB, not 17.5 GB?
INT4 stores weights in 4 bits - half a byte - so 70B × 0.5 bytes = 35 GB for weights. But "INT4" is actually a bit of a misnomer for methods like GPTQ and AWQ. These use per-group quantization with group size 128, meaning every 128 weights share one 16-bit scale factor. This adds approximately 0.25 extra bits per weight (1 scale per 128 weights = 16 bits / 128 = 0.125 bytes = 1 bit per weight on average, but the scale is shared, so per-weight overhead is 16/128 = 0.125 bits). Effective bits per weight is ~4.13–4.5 bits, not exactly 4. The headline "35 GB" refers to the approximation; actual files are ~37–42 GB for 70B models with typical group sizes.
Q6: How does NF4 differ from standard INT4, and why is it better for normally distributed weights?
Standard INT4 uses 16 linearly spaced quantization levels in a range . NF4 (NormalFloat4, Dettmers et al. 2023) uses quantile-based spacing: the 16 levels are placed at the quantiles of the standard normal distribution. For normally distributed weights (which transformer weights typically are), this means more quantization levels near zero (where density is highest) and fewer near the extremes. This minimizes expected quantization error for normally distributed data compared to linear spacing. In practice, NF4 achieves lower perplexity than INT4 at the same compression ratio, especially when combined with double quantization (quantizing the scale factors themselves).
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Quantisation Effects demo on the EngineersOfAI Playground - no code required.
:::
