What is llm quantization?

Master LLM quantization techniques - from LLM.int8() to GPTQ and AWQ - to run large models on commodity hardware without unacceptable quality loss.

How does int8 quantization work in practice?

Quantization: INT8 and INT4 covers llm quantization, int8 quantization, int4 quantization from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-inference/quantization-int8-int4

What is the difference between llm quantization and int4 quantization?

See the full breakdown at https://engineersofai.com/docs/llms/llm-inference/quantization-int8-int4

Quantization: INT8 and INT4

The Production Scenario

It is early 2023. Your team wants to run LLaMA-2 70B for internal tooling. You do the math: 70B parameters × 2 bytes per parameter (FP16) = 140 GB. Your company has one A100 80GB. You would need two of them, connected with NVLink, and they cost $30,000 each. The cloud alternative is$ 10/hour for two A100s. For a tool used by 20 engineers a few times per day, this is absurd.

Then you discover INT4 quantization. You quantize the model to 4 bits per parameter. The model size drops from 140 GB to 35 GB. It fits on a single A100 80GB with room to spare, and performance is nearly identical for the tasks your team uses it for. Your infrastructure cost drops by 4×.

This is what quantization unlocks at the model serving level. It is not about theoretical elegance - it is about the practical difference between a model that requires $60K of hardware and one that runs on what you already have. Understanding the techniques, their quality-performance trade-offs, and their failure modes is one of the most practically valuable skills in LLM deployment.

The central tension in quantization: floating point numbers encode a real number as a sign bit, exponent, and mantissa. Integers just encode a fixed-range integer value. You are replacing rich, expressive encodings with coarser ones. The trick is doing this in a way that preserves model behavior. Some methods fail badly. Others achieve remarkable fidelity. The difference is in the details.

Why This Exists: The Memory Problem

Model Size Math

Model	FP32	FP16/BF16	INT8	INT4
7B	28 GB	14 GB	7 GB	3.5 GB
13B	52 GB	26 GB	13 GB	6.5 GB
34B	136 GB	68 GB	34 GB	17 GB
70B	280 GB	140 GB	70 GB	35 GB
180B	720 GB	360 GB	180 GB	90 GB

The transition from FP16 to INT4 is a 4× reduction. The difference between needing 8× A100s and needing 1× A100 for a 70B model - that is the practical impact.

But quantization is not free. You are trading precision for efficiency. The question is: how much precision can you lose before model quality degrades unacceptably? The answer depends critically on which weights you quantize, how you compute the quantization mapping, and whether you calibrate on representative data.

Historical Context

Quantization has been used in neural networks since the late 1990s for edge deployment (mobile, embedded systems). For LLMs specifically, the story begins around 2022 when models became too large for single-GPU inference:

LLM.int8() (Dettmers et al., 2022): First practical INT8 quantization for LLMs at scale. Discovered the "outlier" problem - a small fraction of extreme-magnitude activations that break naive quantization - and solved it with mixed-precision decomposition.
GPTQ (Frantar et al., 2022): Weight-only INT4 quantization using second-order information. Achieved state-of-the-art INT4 quality by minimizing layer-wise quantization error.
AWQ (Lin et al., 2023): Activation-aware Weight Quantization. Identified that only ~1% of weights are "salient" (determined by activation magnitude), and protecting these weights with higher precision preserves quality better than GPTQ at INT4.
GGUF (Georgi Gerganov, 2023): Quantization format in llama.cpp enabling CPU inference with mix of precision levels (Q4_K_M, Q5_K_M, Q8_0).

Quantization Fundamentals

The Quantization Mapping

Quantization maps a range of floating-point values to a set of integer values:

$Q(x) = \text{round}\left(\frac{x}{s}\right) + z$

Where:

$s$ is the scale factor (float) - size of each quantization step
$z$ is the zero point (integer) - the integer value representing 0.0 in the float range

Dequantization reconstructs the float:

$\hat{x} = s \times (Q(x) - z)$

The reconstruction error is called quantization error: $\epsilon = x - \hat{x}$ , bounded by $|s|/2$ per element.

Symmetric vs Asymmetric Quantization

Symmetric: Zero point $z = 0$ . The quantized range is centered at zero.

$s = \frac{\max(|x|)}{2^{b-1} - 1}$

For INT8: range is [-127, 127] (excluding -128 to avoid overflow issues). Simpler hardware implementation but wastes range for asymmetric distributions.

Asymmetric: Zero point is non-zero. Can cover any range $[x_{\min}, x_{\max}]$ .

$s = \frac{x_{\max} - x_{\min}}{2^b - 1}, \quad z = -\text{round}(x_{\min} / s)$

For INT8: range is [0, 255] or [-128, 127]. Better for ReLU outputs (non-negative) or weights with asymmetric distributions.

Granularity

The granularity of quantization parameters (scale, zero point) determines quality:

Granularity	Description	Quality	Overhead
Per-tensor	One scale per weight matrix	Lowest	Negligible
Per-channel	One scale per output channel	Medium	Small
Per-token	One scale per token (for activations)	High	Moderate
Per-group	One scale per group of N weights	Highest	Largest

Per-group quantization (group size 64 or 128) is used in GPTQ and AWQ - it balances quality and memory overhead.

import torch
import numpy as np
from typing import Tuple


def quantize_symmetric(
    x: torch.Tensor,
    bits: int = 8,
    per_channel: bool = False
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Symmetric quantization.

    Returns:
        quantized: INT tensor
        scale: Float scale factors
    """
    n_levels = 2 ** (bits - 1) - 1  # 127 for INT8, 7 for INT4

    if per_channel:
        # Per-channel: one scale per output channel (dim 0)
        abs_max = x.abs().max(dim=1, keepdim=True).values
    else:
        abs_max = x.abs().max()

    scale = abs_max / n_levels
    scale = scale.clamp(min=1e-8)

    quantized = (x / scale).round().clamp(-n_levels, n_levels)

    if bits == 8:
        quantized = quantized.to(torch.int8)
    else:
        # INT4 must be stored in wider dtype (no native int4 in PyTorch)
        quantized = quantized.to(torch.int8)

    return quantized, scale


def dequantize_symmetric(
    quantized: torch.Tensor,
    scale: torch.Tensor
) -> torch.Tensor:
    return quantized.float() * scale


def measure_quantization_error(
    weights: torch.Tensor,
    bits: int = 8
) -> dict:
    """Measure quantization error for a weight tensor."""
    quant, scale = quantize_symmetric(weights, bits=bits)
    reconstructed = dequantize_symmetric(quant, scale)

    error = weights - reconstructed
    relative_error = error.abs() / (weights.abs() + 1e-8)

    return {
        "mse": error.pow(2).mean().item(),
        "max_error": error.abs().max().item(),
        "mean_relative_error": relative_error.mean().item(),
        "original_size_bytes": weights.numel() * weights.element_size(),
        "quantized_size_bytes": quant.numel() * 1,  # 1 byte for int8
        "compression_ratio": weights.numel() * weights.element_size() / (quant.numel() * 1)
    }

The Outlier Problem in LLMs

Why Naive Quantization Fails for LLMs

Dettmers et al. (2022) made a crucial empirical discovery: in transformer models with more than 6.7B parameters, systematic outlier features emerge in the hidden states (activations).

These outliers are:

Extreme in magnitude: up to 1000× larger than the average activation value
Systematic: they always appear in the same feature dimensions (columns of the activation matrix)
Sparse: only about 0.1% of all activation values are outliers

The problem: if you compute a per-tensor scale based on max(|activations|), the scale is dominated by the outliers. All normal-magnitude values get quantized to the same few integer values, destroying almost all information in the non-outlier features.

def demonstrate_outlier_problem():
    """
    Show how outliers break naive INT8 quantization of activations.
    """
    torch.manual_seed(42)

    # Simulate activation tensor with systematic outliers
    # Shape: [batch, seq_len, hidden_dim]
    batch, seq, hidden = 1, 32, 4096
    activations = torch.randn(batch, seq, hidden) * 0.5  # Normal range

    # Insert outliers in specific feature dimensions (mimicking real LLMs)
    outlier_dims = [42, 314, 1024, 2048, 3200]
    for dim in outlier_dims:
        activations[:, :, dim] *= 200.0  # 200x larger

    # Compute per-tensor scale
    abs_max = activations.abs().max()
    scale_pertensor = abs_max / 127.0
    print(f"Max activation: {abs_max:.2f}")
    print(f"Per-tensor scale: {scale_pertensor:.4f}")

    # Quantize with per-tensor scale
    quant_coarse = (activations / scale_pertensor).round().clamp(-127, 127).to(torch.int8)
    recon_coarse = quant_coarse.float() * scale_pertensor

    # Only care about non-outlier values
    non_outlier_mask = torch.ones(hidden, dtype=torch.bool)
    for dim in outlier_dims:
        non_outlier_mask[dim] = False

    normal_error = (activations[:, :, non_outlier_mask] -
                    recon_coarse[:, :, non_outlier_mask]).abs().mean()
    print(f"\nPer-tensor INT8 error on non-outlier dims: {normal_error:.4f}")

    # Compare: per-token scale (one scale per token)
    abs_max_per_token = activations.abs().max(dim=-1, keepdim=True).values
    scale_per_token = abs_max_per_token / 127.0
    quant_fine = (activations / scale_per_token).round().clamp(-127, 127).to(torch.int8)
    recon_fine = quant_fine.float() * scale_per_token

    normal_error_fine = (activations[:, :, non_outlier_mask] -
                         recon_fine[:, :, non_outlier_mask]).abs().mean()
    print(f"Per-token INT8 error on non-outlier dims: {normal_error_fine:.4f}")
    print(f"\nOutliers exist in {len(outlier_dims)} of {hidden} dimensions ({len(outlier_dims)/hidden:.2%})")
    print("Yet they dominate the per-tensor scale and destroy quantization of all other dims.")

LLM.int8() - Mixed Precision Decomposition

Dettmers et al. (2022) solved the outlier problem with a decomposition approach:

Identify outlier dimensions: feature dimensions where any activation exceeds a threshold (default: 6.0)
Process outliers in FP16: multiply the outlier columns of the weight matrix with the outlier rows of the activation in full FP16
Process non-outliers in INT8: quantize remaining weights and activations to INT8, multiply in INT8 (faster, lower memory)
Combine results: add the FP16 and INT8 partial products

$Y = X_{\text{outlier}} W_{\text{outlier}}^T + X_{\text{non-outlier}} W_{\text{non-outlier}}^T$

The key insight: only ~0.1% of values are outliers, so ~99.9% of the computation uses INT8. The FP16 path handles the few critical outliers without degrading quality.

# Using bitsandbytes for LLM.int8()
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def load_model_int8(model_name: str):
    """
    Load model in INT8 with LLM.int8() mixed precision decomposition.
    Requires: pip install bitsandbytes transformers accelerate
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=True,          # LLM.int8() - requires bitsandbytes
        device_map="auto"
    )

    # Model is now in mixed precision:
    # - Linear layers: mostly INT8 weights, mixed FP16/INT8 compute
    # - LayerNorm, embeddings: remain in FP16

    print("Model loaded in INT8")
    print_model_memory(model)
    return model, tokenizer


def print_model_memory(model):
    """Print memory usage broken down by dtype."""
    from collections import defaultdict
    dtype_bytes = defaultdict(int)

    for name, param in model.named_parameters():
        dtype_bytes[str(param.dtype)] += param.numel() * param.element_size()

    total = sum(dtype_bytes.values())
    print(f"\nMemory by dtype:")
    for dtype, bytes_count in sorted(dtype_bytes.items()):
        print(f"  {dtype}: {bytes_count/1e9:.2f} GB ({bytes_count/total:.1%})")
    print(f"  Total: {total/1e9:.2f} GB")

LLM.int8() characteristics:

Memory reduction: ~50% vs FP16 (not quite 4× because only weights are INT8, not all operations)
Throughput: ~20% slower than FP16 on A100 (mixed precision has overhead)
Quality: typically within 1–3 perplexity points of FP16
Best use case: fitting large models on fewer GPUs when speed is secondary to correctness

GPTQ - Weight-Only INT4 Quantization

GPTQ (Frantar et al., 2022) achieves 4× memory reduction by quantizing weights to INT4 using second-order optimization.

The Core Idea

Instead of simply rounding weights to the nearest INT4 value, GPTQ compensates for the quantization error of each weight by adjusting the remaining unquantized weights in the same layer:

Process weights in columns (one column at a time)
For each weight $w_{ij}$ : quantize to INT4, compute the error $\delta = w_{ij} - Q(w_{ij})$
Compensate: adjust remaining weights in row $i$ to counteract the error using the inverse Hessian

$w_{j:}^{\text{new}} = w_{j:}^{\text{old}} - \frac{q_{jF} \cdot \delta_F}{H_{FF}^{-1}[F,F]}$

Where $H$ is the Hessian of the layer's output error with respect to weights, computed on a calibration dataset.

# Using AutoGPTQ for GPTQ quantization
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

def quantize_with_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128,
    desc_act: bool = False
):
    """
    Quantize a model using GPTQ.
    Requires: pip install auto-gptq optimum

    Args:
        model_name: HuggingFace model ID or local path
        output_dir: Where to save the quantized model
        bits: Quantization bits (2, 3, 4, or 8)
        group_size: Group size for per-group quantization
        desc_act: Use activation order (slower but better quality)
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    # Calibration data - the model is quantized to minimize error on these examples
    calibration_data = [
        tokenizer(
            "Auto-regressive language models learn to predict the next token.",
            return_tensors="pt"
        ),
        # Add 128-512 representative examples for best quality
    ]

    quantize_config = BaseQuantizeConfig(
        bits=bits,
        group_size=group_size,
        desc_act=desc_act,
    )

    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config
    )

    # This step does the actual quantization - takes 30-120 minutes for 70B
    model.quantize(calibration_data)

    # Save quantized model
    model.save_quantized(output_dir, use_safetensors=True)
    tokenizer.save_pretrained(output_dir)
    print(f"Quantized model saved to {output_dir}")


def load_gptq_model(model_dir: str):
    """Load a previously quantized GPTQ model."""
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoGPTQForCausalLM.from_quantized(
        model_dir,
        device_map="auto",
        use_triton=True  # Triton kernels for faster INT4 matmul
    )
    return model, tokenizer

GPTQ characteristics:

Memory: 4× reduction from FP16 at INT4 (with group_size=128, overhead ~0.3 bits/weight)
Quality: perplexity within 5–15% of FP16 depending on task and model size
Quantization time: 30–120 minutes for 70B model (one-time cost)
Inference speed: similar to FP16 (weight-only quantization - compute in FP16, weights stored in INT4)
Best for: offline model serving where you do the quantization once and serve indefinitely

AWQ - Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) improves on GPTQ with a key insight: not all weights are equally important. Weights connected to large-magnitude activations have outsized impact on model output.

The Salient Weight Observation

By analyzing activation statistics over a calibration dataset, AWQ identifies "salient" weights - those connected to feature dimensions with large activation magnitudes. These salient weights represent about 1% of all weights but disproportionately affect model output.

AWQ's approach:

Identify salient weight channels using activation statistics
Scale up these weights before quantization (making them larger, reducing relative error)
Scale down the corresponding input activations to keep the product unchanged

This is mathematically equivalent to per-group quantization with optimized scale factors, but the scales are chosen based on activation statistics rather than weight statistics alone.

# Using AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

def quantize_with_awq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128,
    zero_point: bool = True
):
    """
    Quantize a model with AWQ (Activation-aware Weight Quantization).
    Requires: pip install autoawq

    AWQ is faster to quantize than GPTQ (~10-15 min for 7B vs 30+ min)
    and generally achieves better quality at INT4.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoAWQForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        use_cache=False
    )

    quant_config = {
        "zero_point": zero_point,
        "q_group_size": group_size,
        "w_bit": bits,
        "version": "GEMM"  # GEMM kernel (faster) vs GEMV (lower memory)
    }

    # AWQ searches for optimal scales using ~128 calibration samples
    # Takes ~10-15 minutes for 7B model, ~45 min for 70B
    model.quantize(tokenizer, quant_config=quant_config)

    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"AWQ quantized model saved to {output_dir}")


def benchmark_quantization_methods(model_name: str = "meta-llama/Llama-3-8B"):
    """
    Compare quality and speed across quantization methods.
    Uses perplexity on WikiText-2 as quality proxy.
    """
    import time
    from datasets import load_dataset

    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    test_texts = [t for t in dataset["text"] if len(t.strip()) > 100][:50]

    results = {}

    # Configuration to test
    configs = [
        ("FP16", {"torch_dtype": torch.float16}),
        ("INT8 (LLM.int8())", {"load_in_8bit": True}),
        ("INT4 NF4 (QLoRA)", {"load_in_4bit": True, "bnb_4bit_quant_type": "nf4"}),
    ]

    for name, kwargs in configs:
        print(f"\nLoading {name}...")
        t0 = time.time()

        from transformers import AutoModelForCausalLM
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            **kwargs
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        load_time = time.time() - t0

        # Estimate memory
        total_params = sum(p.numel() for p in model.parameters())
        # Rough estimate: count actual allocated memory
        import gc
        gc.collect()
        if torch.cuda.is_available():
            mem_gb = torch.cuda.max_memory_allocated() / 1e9
        else:
            mem_gb = -1

        results[name] = {
            "load_time_s": round(load_time, 1),
            "memory_gb": round(mem_gb, 1),
        }

        print(f"  Load time: {load_time:.1f}s, Memory: {mem_gb:.1f} GB")

    return results

AWQ vs GPTQ comparison:

Aspect	GPTQ	AWQ
Quantization time (7B)	30–60 min	10–15 min
Quality at INT4	Good	Better on most benchmarks
Kernel support	Triton/CUDA	Optimized GEMM/GEMV
vLLM support	Yes	Yes
CPU/llama.cpp	Limited	Limited

NF4 - NormalFloat4 (QLoRA)

Dettmers et al. (2023) introduced NF4 as part of QLoRA. NF4 is an information-theoretically optimal quantization format for normally distributed weights:

Instead of linear spacing between quantization levels, NF4 uses quantile-based spacing. If weights are normally distributed, quantiles of the normal distribution space the levels optimally - more levels near zero where density is highest.

def create_nf4_levels() -> torch.Tensor:
    """
    Compute the 16 NF4 quantization levels.
    These are quantiles of the standard normal distribution,
    normalized to [-1, 1].
    """
    from scipy.stats import norm

    # 16 levels for 4-bit (2^4 = 16 distinct values)
    # Spaced at quantiles of standard normal distribution
    quantiles = [(i + 0.5) / 16 for i in range(16)]
    levels = [norm.ppf(q) for q in quantiles]

    # Normalize to [-1, 1]
    abs_max = max(abs(l) for l in levels)
    levels = [l / abs_max for l in levels]

    return torch.tensor(levels, dtype=torch.float32)


def quantize_nf4(weight: torch.Tensor, block_size: int = 64) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Quantize weight tensor to NF4 format.
    Uses double quantization: scale factors are themselves quantized.
    """
    nf4_levels = create_nf4_levels()
    orig_shape = weight.shape

    # Reshape to blocks
    weight_flat = weight.flatten()
    n_blocks = (weight_flat.numel() + block_size - 1) // block_size

    scales = []
    quantized_blocks = []

    for i in range(n_blocks):
        block = weight_flat[i * block_size : (i + 1) * block_size]

        # Normalize block to [-1, 1]
        abs_max = block.abs().max().item()
        if abs_max == 0:
            abs_max = 1.0
        scales.append(abs_max)
        normalized = block / abs_max

        # Find nearest NF4 level for each value
        distances = (normalized.unsqueeze(-1) - nf4_levels.unsqueeze(0)).abs()
        quant_indices = distances.argmin(dim=-1)
        quantized_blocks.append(quant_indices.to(torch.uint8))

    scales_tensor = torch.tensor(scales, dtype=torch.float32)
    quantized_tensor = torch.cat(quantized_blocks)

    return quantized_tensor, scales_tensor

GGUF - Quantization for CPU Inference

GGUF (llama.cpp format, successor to GGML) provides a range of quantization formats designed for CPU inference on consumer hardware:

Format	Bits/weight	Quality	Notes
Q8_0	8	Best	Near-lossless, still 2× smaller than FP16
Q6_K	6.14	Excellent	Recommended if memory allows
Q5_K_M	5.34	Very good	Good balance quality/size
Q4_K_M	4.58	Good	Most popular for 7B models on consumer hardware
Q3_K_M	3.35	Acceptable	Noticeable quality loss
Q2_K	2.63	Poor	Emergency use only

The "_K" suffix means "k-quant" - uses per-block quantization with mixed precision (some blocks quantized more aggressively, some less, based on importance).

# Using llama.cpp Python bindings (llama-cpp-python)
from llama_cpp import Llama

def load_gguf_model(model_path: str, n_gpu_layers: int = -1):
    """
    Load a GGUF quantized model for CPU/GPU hybrid inference.

    Args:
        model_path: Path to .gguf file
        n_gpu_layers: Number of layers to offload to GPU
                     (-1 = all layers on GPU if available)
    """
    model = Llama(
        model_path=model_path,
        n_gpu_layers=n_gpu_layers,  # Offload layers to GPU
        n_ctx=4096,                 # Context window
        n_batch=512,                # Batch size for prompt processing
        verbose=False
    )
    return model


def compare_gguf_formats():
    """
    Benchmark different GGUF quantization formats.
    Measures: perplexity approximation, file size, tokens/sec.
    """
    # Example: LLaMA-3 8B in different GGUF formats
    # File sizes for 8B model:
    sizes = {
        "F16 (baseline)": 15.0,
        "Q8_0": 7.7,
        "Q6_K": 6.1,
        "Q5_K_M": 5.3,
        "Q4_K_M": 4.4,
        "Q3_K_M": 3.3,
        "Q2_K": 2.7,
    }

    # Approximate perplexity on WikiText-2 (lower = better)
    # These are representative numbers from llama.cpp benchmarks
    perplexity = {
        "F16 (baseline)": 6.45,
        "Q8_0": 6.46,
        "Q6_K": 6.47,
        "Q5_K_M": 6.50,
        "Q4_K_M": 6.56,
        "Q3_K_M": 6.79,
        "Q2_K": 7.84,
    }

    print(f"{'Format':<20} {'Size (GB)':>10} {'PPL':>8} {'PPL vs F16':>12}")
    print("-" * 55)
    baseline_ppl = perplexity["F16 (baseline)"]
    for fmt in sizes:
        ppl = perplexity[fmt]
        ppl_increase = (ppl - baseline_ppl) / baseline_ppl * 100
        print(f"{fmt:<20} {sizes[fmt]:>10.1f} {ppl:>8.2f} {ppl_increase:>11.1f}%")

Quality vs Size Trade-offs

Practical Quantization Guide

def choose_quantization_method(
    model_size_b: float,
    use_case: str,
    available_gpu_gb: float,
    latency_sensitive: bool,
    quality_critical: bool
) -> dict:
    """
    Recommend quantization strategy based on requirements.
    """
    fp16_gb = model_size_b * 2

    recommendations = []

    if fp16_gb <= available_gpu_gb * 0.7:
        recommendations.append({
            "method": "FP16/BF16",
            "memory_gb": fp16_gb,
            "quality": "Best",
            "setup": "No quantization needed",
            "command": "torch_dtype=torch.bfloat16"
        })

    if fp16_gb / 2 <= available_gpu_gb * 0.7:
        recommendations.append({
            "method": "INT8 (bitsandbytes)",
            "memory_gb": fp16_gb / 2,
            "quality": "Near-lossless",
            "setup": "load_in_8bit=True",
            "command": "load_in_8bit=True"
        })

    if fp16_gb / 4 <= available_gpu_gb * 0.7:
        if quality_critical:
            method = "AWQ INT4"
            cmd = "AutoAWQForCausalLM.from_quantized()"
        else:
            method = "GPTQ INT4"
            cmd = "AutoGPTQForCausalLM.from_quantized()"

        recommendations.append({
            "method": method,
            "memory_gb": fp16_gb / 4,
            "quality": "Good (1-5% degradation)",
            "setup": "Requires pre-quantization step",
            "command": cmd
        })

    if not recommendations or available_gpu_gb < fp16_gb / 4:
        recommendations.append({
            "method": "GGUF Q4_K_M (llama.cpp)",
            "memory_gb": model_size_b * 0.55,
            "quality": "Acceptable",
            "setup": "CPU inference or CPU+GPU hybrid",
            "command": "Llama(model_path='model.Q4_K_M.gguf')"
        })

    return recommendations


# Example decisions
print("LLaMA-3 70B (140 GB FP16) on 80 GB GPU:")
recs = choose_quantization_method(70, "chat", 80, True, False)
for r in recs:
    print(f"  → {r['method']}: {r['memory_gb']:.1f} GB - {r['quality']}")

print("\nLLaMA-3 8B (16 GB FP16) on 24 GB GPU:")
recs = choose_quantization_method(8, "chat", 24, True, True)
for r in recs:
    print(f"  → {r['method']}: {r['memory_gb']:.1f} GB - {r['quality']}")

Common Mistakes

:::danger Quantizing embedding and output layers Embedding tables and the language modeling head (logit projection) are particularly sensitive to quantization. They map discrete token indices to continuous vectors - small errors here propagate to every generated token. LLM.int8(), GPTQ, and AWQ all leave embeddings and the LM head in FP16/FP32 by default. Never quantize these layers to INT4 without extensive evaluation. :::

:::danger Using INT4 for medical, legal, or financial applications without evaluation INT4 quantization reduces model parameters to 16 discrete levels per weight group. On structured knowledge tasks (medical diagnosis, contract analysis, financial calculations), this precision loss can cause the model to confuse numbers, miss critical distinctions, or generate plausible but incorrect statements. Run comprehensive domain-specific benchmarks before deploying INT4 in high-stakes applications. Use INT8 as the minimum for these use cases. :::

:::warning Calibration dataset matters for GPTQ and AWQ GPTQ and AWQ both use a calibration dataset to compute quantization parameters. The calibration data should match your deployment distribution. Quantizing a coding model on news articles produces worse results than calibrating on code. Use 128–512 representative examples from your target domain. The default calibration data (often WikiText or C4) is fine for general-purpose models but suboptimal for domain-specific deployments. :::

:::warning Throughput vs latency trade-off of weight-only quantization GPTQ and AWQ are weight-only quantization: weights are stored in INT4 but computation happens in FP16 (weights are dequantized before the matmul). This reduces memory - so more requests fit in GPU memory simultaneously, increasing throughput. But it does not directly speed up single-request latency. LLM.int8() does compute in INT8, which can be faster. The confusion: "INT4 must be 4× faster" is wrong. Weight-only INT4 improves throughput through better batching, not per-token speed. :::

Interview Questions

Q1: What is the "outlier problem" in LLM quantization and how does LLM.int8() solve it?

In transformer models larger than ~6.7B parameters, a small fraction (~0.1%) of activation values are orders of magnitude larger than the rest. These "outlier" values dominate the per-tensor scale used in naive INT8 quantization - the scale is set to accommodate the outlier range, causing all normal-magnitude values to collapse to just a few discrete levels, destroying information. LLM.int8() (Dettmers et al., 2022) identifies these outlier feature dimensions using a threshold (default 6.0) and processes them separately in FP16, while quantizing all non-outlier values to INT8. This "mixed-precision decomposition" gives near-lossless quality with ~50% memory reduction.

Q2: What is the difference between post-training quantization (PTQ) and quantization-aware training (QAT)?

PTQ quantizes an already-trained model without further training - fast but potentially lossy. GPTQ and AWQ are PTQ methods. QAT trains (or fine-tunes) the model with quantization simulated in the forward pass, so gradients flow through the quantization step and the model learns to be robust to it. QAT produces better quality than PTQ at the same bit width but requires compute proportional to training. For most LLM use cases, PTQ with calibration data (GPTQ, AWQ) achieves acceptable quality much faster than QAT.

Q3: What is per-group quantization and why does it improve quality over per-tensor?

Per-group quantization computes a separate scale factor for every group of $G$ consecutive weights (typically $G = 64$ or $G = 128$ ). This allows the quantization to adapt to local weight distributions - a group of weights centered around 0.01 gets a different scale than a group centered around 5.0. Per-tensor uses one scale for the entire weight matrix, which is dominated by the maximum value and leaves small-valued weights with poor precision. Per-group adds memory overhead for the scale factors (16-bit floats, one per group) but significantly improves quality, especially at INT4.

Q4: When would you choose AWQ over GPTQ, and when would you choose the reverse?

AWQ generally achieves better quality at INT4 than GPTQ on most benchmarks (coding, reasoning, chat). AWQ also quantizes 3–6× faster. Choose AWQ as the default for INT4 deployment. Choose GPTQ when: (1) you need very specific bit-group combinations that AWQ does not support, (2) compatibility with specific inference frameworks (some still prefer GPTQ kernels), or (3) your task-specific benchmarks show GPTQ performing better (rare but happens for some math-heavy tasks). For CPU inference via llama.cpp, use GGUF formats (Q4_K_M) directly rather than GPTQ or AWQ.

Q5: A 70B model in FP16 is 140 GB. Why does INT4 quantization give you 35 GB, not 17.5 GB?

INT4 stores weights in 4 bits - half a byte - so 70B × 0.5 bytes = 35 GB for weights. But "INT4" is actually a bit of a misnomer for methods like GPTQ and AWQ. These use per-group quantization with group size 128, meaning every 128 weights share one 16-bit scale factor. This adds approximately 0.25 extra bits per weight (1 scale per 128 weights = 16 bits / 128 = 0.125 bytes = 1 bit per weight on average, but the scale is shared, so per-weight overhead is 16/128 = 0.125 bits). Effective bits per weight is ~4.13–4.5 bits, not exactly 4. The headline "35 GB" refers to the approximation; actual files are ~37–42 GB for 70B models with typical group sizes.

Q6: How does NF4 differ from standard INT4, and why is it better for normally distributed weights?

Standard INT4 uses 16 linearly spaced quantization levels in a range $[-s, s]$ . NF4 (NormalFloat4, Dettmers et al. 2023) uses quantile-based spacing: the 16 levels are placed at the quantiles of the standard normal distribution. For normally distributed weights (which transformer weights typically are), this means more quantization levels near zero (where density is highest) and fewer near the extremes. This minimizes expected quantization error for normally distributed data compared to linear spacing. In practice, NF4 achieves lower perplexity than INT4 at the same compression ratio, especially when combined with double quantization (quantizing the scale factors themselves).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Quantisation Effects demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Why This Exists: The Memory Problem​

Model Size Math​

Historical Context​

Quantization Fundamentals​

The Quantization Mapping​

Symmetric vs Asymmetric Quantization​

Granularity​

The Outlier Problem in LLMs​

Why Naive Quantization Fails for LLMs​

LLM.int8() - Mixed Precision Decomposition​

GPTQ - Weight-Only INT4 Quantization​

The Core Idea​

AWQ - Activation-Aware Weight Quantization​

The Salient Weight Observation​

NF4 - NormalFloat4 (QLoRA)​

GGUF - Quantization for CPU Inference​

Quality vs Size Trade-offs​

Practical Quantization Guide​

Common Mistakes​

Interview Questions​