GPTQ explained from first principles - how Hessian-based error compensation quantizes 175B models to 4-bit in hours, the role of calibration data, group size, activation reordering, and how to deploy GPTQ models in production with vLLM and autoGPTQ.

How does post-training quantization work in practice?

GPTQ: Post-Training Quantization covers gptq, post-training quantization, weight quantization from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/model-compression/gptq

What is the difference between gptq and weight quantization?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/model-compression/gptq

:::tip 🎮 Interactive Playground Visualize this concept: Try the Quantisation Explorer demo on the EngineersOfAI Playground - no code required. :::

GPTQ: Post-Training Quantization That Actually Works

The Night That Changed LLM Deployment

It is late 2022. A researcher downloads LLaMA-65B - Meta's newly released 65 billion parameter language model, arguably the most capable open-weights model in existence at the time. The plan: quantize it to 4-bit to fit on a two-GPU consumer setup. Simple round-to-nearest INT4 quantization cuts the model from 130 GB to about 33 GB in memory. But the model is unrecognizable. On reasoning tasks, accuracy drops 25-40%. On code generation, it produces gibberish. The quantized model is worthless.

The problem is fundamental. Round-to-nearest INT4 treats every weight independently: "What is the closest INT4 value to this FP16 number?" It ignores that weights are interconnected - quantization error in weight $w_i$ changes the layer's output, which propagates errors through all subsequent computations. By the time you reach the final layer, small individual errors have compounded into catastrophic collective failure.

Three weeks later, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh at ETH Zurich publish a paper called "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." It quantizes OPT-175B to 4-bit in under 4 hours on a single GPU with less than 1% accuracy degradation on most benchmarks. The trick: instead of treating each weight independently, GPTQ uses a 30-year-old mathematical result about second-order sensitivity to compensate for quantization errors as they accumulate. When you quantize weight $w_i$ , you adjust all remaining weights in the row to compensate for the error you just introduced. The errors do not compound - they are corrected continuously.

This lesson explains how that mechanism works from mathematical foundations through production implementation, including all the configuration choices that determine whether your GPTQ model performs well or poorly in practice.

What Round-to-Nearest Gets Wrong

Before understanding GPTQ, you must understand exactly why naive quantization fails at INT4. Consider a single linear layer computing $Y = XW$ where:

$X$ is the input matrix (batch × sequence × d_in)
$W$ is the weight matrix (d_in × d_out)
$Y$ is the output (batch × sequence × d_out)

Naive INT4 quantization replaces $W$ with $\hat{W}$ where each element is rounded to the nearest representable INT4 value:

$\hat{w}_{ij} = \text{round}\left(\frac{w_{ij}}{s}\right) \cdot s$

where $s$ is a scale factor (typically $\max(|W|) / 7$ for symmetric 4-bit). The quantization error is:

$\Delta W = W - \hat{W}$

The resulting output error is:

$\Delta Y = X \cdot \Delta W$

Now consider what happens in a 70B transformer. The model has roughly 80 layers. Each layer has multiple linear projections (Q, K, V, O, gate, up, down). Each weight matrix has quantization error $\Delta W_l$ for layer $l$ . The errors compound through the network: the output error from layer 1 becomes additional input noise to layer 2, which amplifies its own quantization error, which compounds into layer 3, and so on.

Error accumulation in a 70B model with naive INT4:

Layer 1:  Input error: 0         Output error: δ₁ = X·ΔW₁
Layer 2:  Input error: δ₁        Output error: δ₂ = (X+δ₁)·ΔW₂ + X·ΔW₁
...
Layer 80: Input error: Σ δᵢ      Output error: Σ (accumulated errors)

By layer 80: Error is not 80× layer error - it compounds multiplicatively
in the worst case. Even 0.1% error per layer → 1-(0.999)^80 ≈ 8% final error
if errors happen to compound coherently (which they often do for common patterns)

The insight that makes GPTQ possible: you do not need to prevent all quantization error. You need to prevent quantization errors from accumulating uncompensated. If you can adjust the un-quantized weights to absorb the error from each freshly quantized weight, you can maintain near-perfect layer outputs.

The Mathematical Foundation: Optimal Brain Surgeon

GPTQ is a direct application of the Optimal Brain Surgeon (OBS) framework from Hassibi and Stork (1993), adapted from weight pruning to weight quantization. The OBS framework addresses: "After we change one weight (quantize it), how should we update the remaining weights to minimize the change in the loss?"

The second-order Taylor expansion of the loss $\mathcal{L}$ around the current weights gives:

$\delta \mathcal{L} = \left(\frac{\partial \mathcal{L}}{\partial \mathbf{w}}\right)^T \delta\mathbf{w} + \frac{1}{2} \delta\mathbf{w}^T \mathbf{H} \, \delta\mathbf{w} + O(|\delta\mathbf{w}|^3)$

where $\mathbf{H}$ is the Hessian: $H_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}$ .

At a local minimum of the loss (which trained weights approximately are), the gradient term vanishes: $\frac{\partial \mathcal{L}}{\partial \mathbf{w}} \approx 0$ . So:

$\delta \mathcal{L} \approx \frac{1}{2} \delta\mathbf{w}^T \mathbf{H} \, \delta\mathbf{w}$

When we quantize weight $w_q$ , we introduce a forced change $\delta w_q = \hat{w}_q - w_q$ (the quantization error). We want to find the update to the remaining weights $\delta \mathbf{w}_{-q}$ that minimizes $\delta \mathcal{L}$ subject to the constraint that $w_q$ takes on the quantized value.

The OBS solution is:

$\delta \mathbf{w}^* = -\frac{\delta w_q}{[\mathbf{H}^{-1}]_{qq}} \cdot [\mathbf{H}^{-1}]_{:,q}$

In words: the optimal correction to remaining weights is proportional to the column of the inverse Hessian corresponding to the quantized weight, scaled by the quantization error divided by the diagonal inverse Hessian entry.

The resulting increase in loss from quantizing weight $q$ (even after optimal correction) is:

$\delta \mathcal{L}_q = \frac{1}{2} \frac{(\delta w_q)^2}{[\mathbf{H}^{-1}]_{qq}}$

This gives you a way to measure how "costly" quantizing each weight is - weights where $[\mathbf{H}^{-1}]_{qq}$ is small are expensive (quantization loss is amplified), while weights where $[\mathbf{H}^{-1}]_{qq}$ is large are cheap (quantization loss is absorbed well).

From OBS to GPTQ: The Key Simplifications

Applying OBS directly to a 70B model is computationally intractable - the full Hessian for a weight matrix with $d^2$ parameters has $d^4$ entries. Frantar et al. made three key observations that make GPTQ practical:

Observation 1: Layer-wise quantization. The Hessian of the total loss over all weights is huge and complex. But GPTQ applies the OBS framework independently to each weight matrix (each linear layer). This is justified because the layer-wise output error (from quantizing that layer's weights) is what matters for downstream layers - not the global loss. The Hessian for a single weight matrix is tractable.

Observation 2: The layer Hessian from activations. For a linear layer $Y = XW$ , the Hessian of the squared output error $\|XW - X\hat{W}\|^2$ with respect to $W$ is:

$\mathbf{H}_W = 2 X^T X$

This is the outer product of the input activations - computable from calibration data without ever computing gradients or doing backpropagation through the full model. Run the model forward on 128 sample inputs, collect the activations at each layer, and you have the Hessian.

Observation 3: Row-wise independence. The rows of $W$ are independent in the sense that quantizing row $i$ does not interact with row $j$ through the Hessian. This means we can process each row independently, reducing the problem from a $d_{in} \cdot d_{out} \times d_{in} \cdot d_{out}$ Hessian to a $d_{in} \times d_{in}$ Hessian (one per row, but all rows share the same Hessian since $H = 2X^TX$ is identical for all rows of the same weight matrix).

The GPTQ algorithm for one weight matrix then becomes:

GPTQ Algorithm for one weight matrix W (shape: d_out × d_in):

1. Collect calibration activations X (shape: n_cal × d_in)
2. Compute H = 2 * X.T @ X  (shape: d_in × d_in)
3. Add damping: H += λ * I   (prevents singular matrix issues)
4. Compute H⁻¹ using Cholesky decomposition for numerical stability
5. For each column index q = 0, 1, ..., d_in-1 (process left to right):
   For each row r = 0, 1, ..., d_out-1:
     a. Quantize W[r, q]: w_q = round(W[r, q] / scale) * scale
     b. Compute quantization error: δw_q = W[r, q] - w_q
     c. Update remaining weights:
        W[r, q+1:] -= (δw_q / H⁻¹[q,q]) * H⁻¹[q, q+1:]
        (adjust future columns to compensate for this column's error)
     d. Store quantized value: W_quant[r, q] = w_q

The critical insight: step (c) propagates the quantization error to future columns in the calibration-statistic-weighted direction that minimizes the output error. Earlier columns affect later ones through the Cholesky factor - the mathematical structure that makes the sequential updates efficient.

GPTQ Implementation From First Principles

Here is a clean NumPy implementation of the core GPTQ algorithm for a single layer. This is pedagogical - the production version uses CUDA and handles the full model - but demonstrates the exact mathematical operations:

import numpy as np
import torch
from typing import Tuple, Optional


def quantize_symmetric(
    w: np.ndarray,
    n_bits: int = 4,
    group_size: int = 128,
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Symmetric per-group quantization of a weight matrix row.

    Args:
        w: 1D weight vector, shape (d_in,)
        n_bits: Quantization bits (4 or 8)
        group_size: Number of weights per quantization group
                    Smaller → more scale parameters, higher accuracy
                    Larger → fewer scale parameters, lower overhead

    Returns:
        w_quantized: Dequantized weights (float, same shape as w)
        scales: Scale factor per group
    """
    d_in = w.shape[0]
    max_int = 2 ** (n_bits - 1) - 1   # 7 for INT4, 127 for INT8
    n_groups = (d_in + group_size - 1) // group_size

    scales = np.zeros(n_groups)
    w_quantized = np.zeros_like(w)

    for g in range(n_groups):
        start = g * group_size
        end = min(start + group_size, d_in)
        group = w[start:end]

        # Scale: map [-max_val, max_val] → [-max_int, max_int]
        max_val = np.max(np.abs(group))
        scale = max_val / max_int if max_val > 0 else 1.0
        scales[g] = scale

        # Quantize and dequantize
        q = np.round(group / scale).clip(-max_int, max_int)
        w_quantized[start:end] = q * scale  # Store dequantized for GPTQ updates

    return w_quantized, scales


def gptq_quantize_layer(
    W: np.ndarray,
    X_calibration: np.ndarray,
    n_bits: int = 4,
    group_size: int = 128,
    damping_factor: float = 0.01,
    block_size: int = 128,
) -> Tuple[np.ndarray, np.ndarray]:
    """
    GPTQ quantization for a single weight matrix.

    This implements the core GPTQ algorithm from Frantar et al. (2022):
    - Compute Hessian from calibration activations
    - Process columns sequentially with error compensation

    Args:
        W: Weight matrix, shape (d_out, d_in)
        X_calibration: Calibration activations, shape (n_tokens, d_in)
        n_bits: Target quantization bits (4 is standard)
        group_size: Weights per quantization group (128 standard)
        damping_factor: Hessian regularization - prevents singularity.
                        Larger → more stable but less accurate compensation.
                        Typical range: 0.001 to 0.1
        block_size: Process this many columns at a time (memory efficiency)

    Returns:
        W_quantized: Dequantized quantized weight matrix
        all_scales: Scale factors, shape (d_out, n_groups)
    """
    d_out, d_in = W.shape
    n_tokens, d_in_cal = X_calibration.shape
    assert d_in == d_in_cal, f"Weight d_in={d_in} != calibration d_in={d_in_cal}"

    # Step 1: Compute Hessian H = 2 * X^T * X
    # This is the second derivative of squared output error w.r.t. weight elements
    # Shape: (d_in, d_in)
    H = 2.0 * (X_calibration.T @ X_calibration) / n_tokens

    # Step 2: Add diagonal damping for numerical stability
    # Without this, H may be singular if some input dimensions are never activated
    avg_diag = np.mean(np.diag(H))
    H += damping_factor * avg_diag * np.eye(d_in)

    # Step 3: Compute H⁻¹ via Cholesky for numerical stability
    # Cholesky is more stable than direct inversion and exploits H's positive definiteness
    try:
        L = np.linalg.cholesky(H)
        L_inv = np.linalg.solve(L, np.eye(d_in))
        H_inv = L_inv.T @ L_inv  # H⁻¹ = (L^{-T}) * L^{-1}
    except np.linalg.LinAlgError:
        # H is not positive definite (e.g., severely undersampled)
        # Fall back to pseudo-inverse - less accurate but doesn't crash
        print("Warning: H is not positive definite, using pseudo-inverse")
        H_inv = np.linalg.pinv(H)

    # Step 4: Column-by-column quantization with error compensation
    W_quantized = W.copy().astype(np.float64)
    n_groups = (d_in + group_size - 1) // group_size
    all_scales = np.zeros((d_out, n_groups))

    for col_start in range(0, d_in, block_size):
        col_end = min(col_start + block_size, d_in)

        for q in range(col_start, col_end):
            # Quantize column q (all rows simultaneously for efficiency)
            group_idx = q // group_size
            col_weights = W_quantized[:, q]  # shape: (d_out,)

            # Compute per-group scale based on current (possibly adjusted) weights
            max_val = np.max(np.abs(col_weights))
            max_int = 2 ** (n_bits - 1) - 1
            scale = max_val / max_int if max_val > 0 else 1.0
            all_scales[:, group_idx] = scale  # Will be overwritten within group, that's OK

            # Quantize and compute error
            w_q = np.round(col_weights / scale).clip(-max_int, max_int) * scale
            delta_w = col_weights - w_q  # Quantization error for this column

            # Store quantized values
            W_quantized[:, q] = w_q

            # Compensate remaining columns using the inverse Hessian
            # This is the core GPTQ update:
            # W[:, q+1:] -= (delta_w / H_inv[q,q]) * H_inv[q, q+1:]
            if q + 1 < d_in and H_inv[q, q] > 1e-12:
                compensation = np.outer(delta_w, H_inv[q, q+1:]) / H_inv[q, q]
                W_quantized[:, q+1:] -= compensation

    # Recompute final scales from quantized values
    # (The iterative updates change the effective scale during quantization)
    for g in range(n_groups):
        start = g * group_size
        end = min(start + group_size, d_in)
        for r in range(d_out):
            group = W_quantized[r, start:end]
            max_val = np.max(np.abs(group))
            max_int = 2 ** (n_bits - 1) - 1
            all_scales[r, g] = max_val / max_int if max_val > 0 else 1.0

    return W_quantized, all_scales


def compare_gptq_vs_naive(
    W: np.ndarray,
    X: np.ndarray,
    n_bits: int = 4,
    group_size: int = 128,
) -> None:
    """Compare GPTQ vs naive round-to-nearest quantization."""

    # Naive quantization
    W_naive = np.zeros_like(W)
    d_out, d_in = W.shape
    n_groups = (d_in + group_size - 1) // group_size
    max_int = 2 ** (n_bits - 1) - 1

    for g in range(n_groups):
        start = g * group_size
        end = min(start + group_size, d_in)
        group = W[:, start:end]
        scale = np.max(np.abs(group)) / max_int
        W_naive[:, start:end] = np.round(group / scale).clip(-max_int, max_int) * scale

    # GPTQ quantization
    W_gptq, _ = gptq_quantize_layer(W, X, n_bits=n_bits, group_size=group_size)

    # Compute output errors on test inputs (same calibration for simplicity)
    Y_original = X @ W.T
    Y_naive = X @ W_naive.T
    Y_gptq = X @ W_gptq.T

    naive_error = np.mean((Y_original - Y_naive) ** 2)
    gptq_error = np.mean((Y_original - Y_gptq) ** 2)

    print(f"Quantization comparison ({n_bits}-bit, group_size={group_size}):")
    print(f"  Naive round-to-nearest MSE: {naive_error:.6f}")
    print(f"  GPTQ MSE:                   {gptq_error:.6f}")
    print(f"  GPTQ improvement:           {naive_error / gptq_error:.2f}x lower error")


# Demo with a random layer
np.random.seed(42)
W_demo = np.random.randn(256, 512).astype(np.float64)  # (d_out=256, d_in=512)
X_demo = np.random.randn(128, 512).astype(np.float64)  # (n_tokens=128, d_in=512)

compare_gptq_vs_naive(W_demo, X_demo, n_bits=4, group_size=128)
# Typical output:
# Naive round-to-nearest MSE: 0.021847
# GPTQ MSE:                   0.001234
# GPTQ improvement: 17.7x lower error

Using AutoGPTQ: The Production Library

The auto-gptq library implements the full GPTQ pipeline with CUDA-accelerated Hessian computation and quantization. This is what you use in production:

# pip install auto-gptq transformers accelerate optimum
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset


def build_calibration_dataset(
    tokenizer,
    dataset_name: str = "allenai/c4",
    n_samples: int = 128,
    seq_length: int = 2048,
    text_column: str = "text",
    seed: int = 42,
) -> list:
    """
    Build a calibration dataset for GPTQ quantization.

    The calibration data determines which weight patterns GPTQ
    optimizes for. Mismatch with deployment domain causes accuracy drops.

    Args:
        n_samples: 128 is standard; 256 provides marginally better Hessians
        seq_length: Should match your typical inference sequence length.
                    Longer calibration → better long-context accuracy.
                    2048 is a good default for most models.

    Returns:
        List of tokenized input_ids tensors, each shape (1, seq_length)
    """
    print(f"Building calibration dataset: {n_samples} samples of length {seq_length}")

    # Load a streaming dataset to avoid downloading everything
    dataset = load_dataset(
        dataset_name,
        "en",
        split="train",
        streaming=True,
        trust_remote_code=True,
    )

    tokenizer.pad_token = tokenizer.eos_token
    calibration_data = []
    collected = 0

    for item in dataset:
        if collected >= n_samples:
            break

        text = item.get(text_column, "")
        if len(text) < 200:   # Skip very short documents
            continue

        encoded = tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=seq_length,
        )
        if encoded["input_ids"].shape[1] < 64:  # Skip extremely short after tokenization
            continue

        # Pad to seq_length for consistent calibration
        pad_length = seq_length - encoded["input_ids"].shape[1]
        if pad_length > 0:
            pad_tensor = torch.full((1, pad_length), tokenizer.pad_token_id)
            encoded["input_ids"] = torch.cat([encoded["input_ids"], pad_tensor], dim=1)

        calibration_data.append(encoded["input_ids"])
        collected += 1

    print(f"  Collected {len(calibration_data)} calibration samples")
    return calibration_data


def quantize_with_gptq(
    model_name: str,
    output_path: str,
    n_bits: int = 4,
    group_size: int = 128,
    desc_act: bool = False,
    n_calibration_samples: int = 128,
    seq_length: int = 2048,
) -> None:
    """
    Full GPTQ quantization pipeline.

    Key configuration parameters and their tradeoffs:

    n_bits:
        4 - standard, best memory/accuracy tradeoff for LLMs
        3 - more aggressive, ~5% accuracy drop, useful when memory is very tight
        8 - minimal accuracy drop, useful when 4-bit is too lossy

    group_size:
        128 - standard default, balanced overhead and accuracy
        64 - better accuracy (+0.3-0.5 perplexity), 2x scale overhead
        32 - best accuracy, 4x scale overhead (the scale parameters start to matter)
        -1 - single scale per row (per-column), worst accuracy but minimal overhead

    desc_act (activation reordering):
        False - faster quantization (3-4x), slightly lower accuracy
        True - quantize in order of activation importance (largest activations first),
               better accuracy but non-sequential access hurts inference speed.
               Generally NOT recommended for inference - use AWQ instead if you
               want activation-aware accuracy improvement.
    """
    print(f"GPTQ quantization: {model_name}")
    print(f"Config: {n_bits}-bit, group_size={group_size}, desc_act={desc_act}")

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Build calibration data
    calibration_data = build_calibration_dataset(
        tokenizer,
        n_samples=n_calibration_samples,
        seq_length=seq_length,
    )

    # Configure GPTQ
    quantize_config = BaseQuantizeConfig(
        bits=n_bits,
        group_size=group_size,
        desc_act=desc_act,
        # damp_percent: Hessian damping, controls numerical stability
        # Lower = more accurate compensation, higher = more stable
        # 0.01 (1%) is the standard default
        damp_percent=0.01,
    )

    # Load model for quantization (FP16 to fit in GPU memory)
    print("Loading model for quantization...")
    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config,
        torch_dtype=torch.float16,
        device_map="auto",
        low_cpu_mem_usage=True,
    )

    # Run GPTQ quantization
    print("Running GPTQ quantization (this may take 30min–4hrs depending on model size)...")
    model.quantize(
        calibration_data,
        cache_examples_on_gpu=True,  # Cache activations on GPU for speed
        batch_size=1,                # Process one sample at a time to manage memory
    )

    # Save quantized model
    model.save_quantized(output_path, use_safetensors=True)
    tokenizer.save_pretrained(output_path)

    # Print memory statistics
    weight_size_gb = sum(
        p.numel() * p.element_size()
        for p in model.parameters()
    ) / 1e9
    print(f"\nQuantization complete!")
    print(f"  Quantized model saved to: {output_path}")
    print(f"  Approximate model size: {weight_size_gb:.2f} GB")
    print(f"  Load with: AutoGPTQForCausalLM.from_quantized('{output_path}')")

Group Size: The Most Important Configuration Choice

Group size controls how many weights share a single quantization scale factor. It is the most impactful tuning parameter in GPTQ configuration.

Why group size matters mathematically: Within a group, all weights share the same quantization scale $s$ . The scale is set to accommodate the largest weight in the group: $s = \max(|w_{group}|) / (2^{b-1} - 1)$ . If one weight is 10x larger than others in the group, the scale is set for that outlier, and all smaller weights are quantized at 1/10th the effective precision they could have had if groups were smaller.

Smaller groups = more scales = more precision per weight. Larger groups = fewer scales = less memory overhead but more quantization error from outlier weights.

The scale overhead for a 7B model:

Weight matrix typically: d_out × d_in, stored as INT4 (0.5 bytes/weight)

Scale overhead per linear layer:
  group_size=32:  n_params / 32 scales × 2 bytes/scale = n_params × 0.0625 bytes
  group_size=128: n_params / 128 × 2 = n_params × 0.015625 bytes
  group_size=1024: n_params / 1024 × 2 = n_params × 0.00195 bytes

For 7B model (7×10⁹ parameters):
  group_size=32:  7B × (0.5 + 0.0625) = ~3.94 GB total
  group_size=128: 7B × (0.5 + 0.015) = ~3.61 GB total
  group_size=1024: 7B × (0.5 + 0.002) = ~3.51 GB total

The difference is 0.43 GB - small for a 7B model.
For a 70B model, this scales to 4.3 GB - meaningful.

:::tip The Default group_size=128 Is Usually Right For 4-bit INT4 production deployment, group_size=128 is the community standard and the right choice for most models. Only consider smaller groups if: (a) you are doing 3-bit quantization where accuracy is more fragile, (b) your model has extreme activation variance suggesting many outlier weights, or (c) benchmarking shows meaningful perplexity improvement that justifies the overhead. :::

Loading and Running GPTQ Models

Once a model is quantized, loading and running it is straightforward:

import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, TextStreamer


def load_gptq_model(
    model_path: str,
    use_triton: bool = False,
    inject_fused_attention: bool = True,
    inject_fused_mlp: bool = True,
) -> tuple:
    """
    Load a GPTQ-quantized model for inference.

    Performance options:
        use_triton: Uses Triton INT4 kernels instead of ExLlama kernels.
                    ExLlama (default) is generally faster for batch_size=1.
                    Triton can be faster at higher batch sizes.
        inject_fused_attention: Fuse attention computations - 10-15% speedup.
        inject_fused_mlp: Fuse MLP gate+up+down projections - 15-20% speedup.

    Both fused options are safe and recommended for inference.
    They do not affect output quality, only execution speed.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token

    model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        use_safetensors=True,
        device_map="auto",
        use_triton=use_triton,
        inject_fused_attention=inject_fused_attention,
        inject_fused_mlp=inject_fused_mlp,
        trust_remote_code=False,
    )
    model.eval()  # Ensure inference mode (disables dropout etc.)

    return model, tokenizer


def generate_with_gptq(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 512,
    temperature: float = 0.0,
    top_p: float = 0.9,
    stream: bool = False,
) -> str:
    """
    Generate text with a GPTQ model.

    Args:
        temperature: 0.0 = greedy (deterministic, best for evaluation)
                     0.1-0.7 = creative but coherent
                     >1.0 = very creative / chaotic
        stream: Print tokens as they are generated
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_length = inputs["input_ids"].shape[1]

    gen_kwargs = {
        "max_new_tokens": max_new_tokens,
        "do_sample": temperature > 0,
        "pad_token_id": tokenizer.eos_token_id,
    }
    if temperature > 0:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"] = top_p

    if stream:
        streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
        gen_kwargs["streamer"] = streamer

    with torch.no_grad():
        output_ids = model.generate(**inputs, **gen_kwargs)

    generated_ids = output_ids[0][input_length:]
    return tokenizer.decode(generated_ids, skip_special_tokens=True)


def benchmark_gptq_throughput(
    model_path: str,
    prompt: str = "Write a detailed technical explanation of how transformers work:",
    batch_sizes: list = [1, 4, 8],
    n_new_tokens: int = 200,
    n_warmup: int = 3,
    n_runs: int = 10,
) -> None:
    """
    Benchmark GPTQ model throughput across batch sizes.
    Prints a summary table of latency and tokens/second.
    """
    import time

    model, tokenizer = load_gptq_model(model_path)

    print(f"\n{'Batch':>6} {'Latency(ms)':>12} {'P90(ms)':>10} {'Tok/s':>8}")
    print("-" * 42)

    for bs in batch_sizes:
        prompts = [prompt] * bs
        inputs = tokenizer(
            prompts, return_tensors="pt", padding=True
        )
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

        # Warmup
        for _ in range(n_warmup):
            with torch.no_grad():
                model.generate(**inputs, max_new_tokens=20, do_sample=False)

        # Timed runs
        latencies = []
        for _ in range(n_runs):
            torch.cuda.synchronize()
            t0 = time.perf_counter()
            with torch.no_grad():
                model.generate(**inputs, max_new_tokens=n_new_tokens, do_sample=False)
            torch.cuda.synchronize()
            latencies.append((time.perf_counter() - t0) * 1000)

        latencies.sort()
        mean_ms = sum(latencies) / len(latencies)
        p90_ms = latencies[int(len(latencies) * 0.9)]
        tps = bs * n_new_tokens / (mean_ms / 1000)

        print(f"{bs:>6} {mean_ms:>12.1f} {p90_ms:>10.1f} {tps:>8.1f}")

Deploying GPTQ with vLLM

For production serving with high throughput requirements, vLLM provides the best GPTQ integration - continuous batching, paged attention, and native GPTQ kernel support:

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
import asyncio
import uuid


def serve_gptq_with_vllm_batch(
    model_path: str,
    prompts: list[str],
    max_tokens: int = 512,
    temperature: float = 0.0,
    tensor_parallel_size: int = 1,
) -> list[str]:
    """
    Batch inference with GPTQ model through vLLM.

    vLLM provides three key advantages over direct AutoGPTQ inference:
    1. Continuous batching: new requests start as soon as GPU capacity frees,
       rather than waiting for a fixed batch to complete
    2. Paged attention: KV cache stored in non-contiguous pages, enabling
       more concurrent sequences without memory fragmentation
    3. Optimized GPTQ kernels: integrated ExLlama v2 and Marlin kernels

    Args:
        tensor_parallel_size: Number of GPUs for tensor parallelism.
                               Use when single GPU can't fit the model.
    """
    llm = LLM(
        model=model_path,
        quantization="gptq",           # Tell vLLM this is a GPTQ model
        dtype="float16",
        tensor_parallel_size=tensor_parallel_size,
        gpu_memory_utilization=0.90,   # Reserve 10% for overhead
        max_model_len=4096,
    )

    sampling_params = SamplingParams(
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=0.9 if temperature > 0 else 1.0,
    )

    outputs = llm.generate(prompts, sampling_params)
    return [output.outputs[0].text for output in outputs]


async def serve_gptq_async_vllm(
    model_path: str,
    max_model_len: int = 4096,
) -> AsyncLLMEngine:
    """
    Initialize vLLM async engine for streaming production serving.

    The async engine handles request queuing, continuous batching,
    and streaming responses - the production pattern for serving APIs.

    Usage:
        engine = await serve_gptq_async_vllm("path/to/model")
        # In your API handler:
        async for output in engine.generate(prompt, params, request_id):
            token = output.outputs[0].text
            yield token  # Stream to client
    """
    engine_args = AsyncEngineArgs(
        model=model_path,
        quantization="gptq",
        dtype="float16",
        max_model_len=max_model_len,
        gpu_memory_utilization=0.90,
        enable_prefix_caching=True,   # Cache KV for shared prefixes (system prompts)
        max_num_seqs=256,             # Max concurrent sequences
    )
    return AsyncLLMEngine.from_engine_args(engine_args)


async def stream_gptq_response(
    engine: AsyncLLMEngine,
    prompt: str,
    max_tokens: int = 512,
) -> str:
    """Stream tokens from the async vLLM engine."""
    from vllm import SamplingParams

    request_id = str(uuid.uuid4())
    sampling_params = SamplingParams(max_tokens=max_tokens, temperature=0.0)

    full_text = ""
    async for output in engine.generate(prompt, sampling_params, request_id):
        if output.outputs:
            full_text = output.outputs[0].text

    return full_text

GPTQ Configuration Reference

GPTQ vs. AWQ vs. bitsandbytes: When to Choose Each

Criterion	GPTQ	AWQ	bitsandbytes NF4
Mechanism	Hessian error compensation	Activation-aware scaling	NormalFloat4 quantization
Accuracy at INT4	Good	Slightly better	Similar to GPTQ
Quantization speed (7B)	30-60 min	30-60 min	Minutes (no calibration step)
Inference speed	Good	Better (Marlin kernel)	Slowest (runtime dequant)
3-bit support	Yes (good)	Marginal	No
LoRA fine-tuning support	No	No	Yes (QLoRA)
CPU inference	Via GGUF	Limited	Limited
vLLM integration	Full	Full	Partial
Calibration data needed	Yes	Yes	No
Best use case	General INT4 deployment, 3-bit, GGUF export	NVIDIA GPU inference with Marlin	Training with QLoRA

Domain-Specific Calibration: A Complete Example

def build_domain_calibration_data(
    tokenizer,
    domain: str = "code",
    n_samples: int = 128,
    seq_length: int = 2048,
) -> list:
    """
    Build domain-specific calibration data for better GPTQ accuracy.

    Why this matters: GPTQ's Hessian reflects which weight dimensions
    are activated by calibration inputs. Mismatch between calibration
    and deployment distributions means GPTQ optimizes the wrong weights.

    domain options:
        "code"    - Python, SQL, shell code from The Stack
        "math"    - LaTeX math problems, proofs
        "medical" - Clinical notes, medical literature
        "legal"   - Legal documents, contracts
        "general" - Pile / C4 (fine for general-purpose models)
    """
    from datasets import load_dataset

    domain_dataset_map = {
        "code": ("codeparrot/github-code", "Python", "code"),
        "math": ("hendrycks/competition_math", None, "problem"),
        "medical": ("medalpaca/medical_meadow_medqa", None, "input"),
        "legal": ("nguyen-brat/legal-dataset", None, "text"),
        "general": ("allenai/c4", "en", "text"),
    }

    if domain not in domain_dataset_map:
        raise ValueError(f"Unknown domain: {domain}. Choose from: {list(domain_dataset_map.keys())}")

    dataset_name, subset, text_col = domain_dataset_map[domain]

    try:
        if subset:
            dataset = load_dataset(dataset_name, subset, split="train", streaming=True)
        else:
            dataset = load_dataset(dataset_name, split="train", streaming=True)
    except Exception as e:
        print(f"Warning: Could not load {dataset_name}: {e}")
        print("Falling back to C4 general data")
        dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
        text_col = "text"

    tokenizer.pad_token = tokenizer.eos_token
    calibration_data = []

    for item in dataset:
        if len(calibration_data) >= n_samples:
            break

        text = item.get(text_col, "")
        if len(text) < 100:
            continue

        encoded = tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=seq_length,
            padding="max_length",
        )

        if encoded["input_ids"].shape[1] >= 64:
            calibration_data.append(encoded["input_ids"])

    print(f"Built {len(calibration_data)} {domain}-domain calibration samples")
    return calibration_data

Evaluating GPTQ Accuracy

After quantizing, always evaluate on your target task before deploying:

import torch
import math
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
from datasets import load_dataset


def compute_perplexity(
    model_path: str,
    dataset_name: str = "wikitext",
    split: str = "test",
    stride: int = 512,
    max_length: int = 2048,
    n_samples: int = 50,
) -> float:
    """
    Compute perplexity of a GPTQ model on a text dataset.

    Perplexity is the standard measure of language model quality.
    Lower = better. Typical values:
      - FP16 Llama-3.1-8B on WikiText-2: ~6.2
      - GPTQ INT4 group128: ~6.5-6.7 (~5% higher perplexity)
      - Naive INT4: ~9-15 (catastrophic degradation visible here)

    stride: Controls context overlap. stride=512 with max_length=2048
            means 75% overlap between windows - expensive but accurate.
            stride=max_length means no overlap - faster but less accurate PPL.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        device_map="auto",
        use_safetensors=True,
    )
    model.eval()

    dataset = load_dataset(dataset_name, "wikitext-2-raw-v1", split=split)
    text = "\n\n".join(dataset["text"])
    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(model.device)

    total_len = input_ids.shape[1]
    nlls = []

    for begin_loc in range(0, min(total_len - max_length, n_samples * stride), stride):
        end_loc = begin_loc + max_length
        target_len = max_length - stride

        input_chunk = input_ids[:, begin_loc:end_loc]
        target_ids = input_chunk.clone()
        target_ids[:, :-target_len] = -100  # Only compute loss on the stride portion

        with torch.no_grad():
            outputs = model(input_chunk, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood.float())

    ppl = math.exp(torch.stack(nlls).mean().item())
    return ppl

GPTQ for GGUF and CPU Inference

While AWQ excels on NVIDIA GPUs, GPTQ has a critical advantage: its quantized format is the basis for GGUF, the format used by llama.cpp for CPU inference. This enables running large models on consumer hardware without any GPU at all.

import subprocess
import os
from pathlib import Path


def convert_gptq_to_gguf(
    gptq_model_path: str,
    output_dir: str,
    quantization_type: str = "Q4_K_M",
    llama_cpp_dir: str = "./llama.cpp",
) -> str:
    """
    Convert a GPTQ-quantized HuggingFace model to GGUF format for llama.cpp.

    GGUF format enables:
    - CPU inference with SIMD-accelerated kernels
    - Apple Silicon GPU acceleration via Metal
    - Quantization types optimized for CPU memory access patterns
    - Streaming from disk for models that don't fit in RAM

    Quantization types (GGUF uses different format than AWQ/GPTQ):
        Q4_K_M: 4-bit with K-quants (mixed 4+6 bit) - best accuracy/speed balance
        Q4_K_S: 4-bit K-quants, smaller than Q4_K_M - save ~0.5GB
        Q5_K_M: 5-bit K-quants - better accuracy, larger file
        Q8_0: 8-bit - closest to FP16 quality, largest file

    Speed on Apple M3 Pro (unified memory, 150 GB/s):
        Llama-3.1-8B Q4_K_M: ~55 tok/s
        Llama-3.1-70B Q4_K_M: ~8 tok/s

    Args:
        gptq_model_path: Path to HuggingFace GPTQ model directory
        output_dir: Where to write the .gguf file
        quantization_type: GGUF quantization type (Q4_K_M recommended)
        llama_cpp_dir: Path to cloned llama.cpp repository

    Returns:
        Path to the created .gguf file
    """
    os.makedirs(output_dir, exist_ok=True)
    model_name = Path(gptq_model_path).name
    gguf_path = os.path.join(output_dir, f"{model_name}-{quantization_type}.gguf")

    convert_script = os.path.join(llama_cpp_dir, "convert-hf-to-gguf.py")
    if not os.path.exists(convert_script):
        raise FileNotFoundError(
            f"llama.cpp convert script not found at {convert_script}. "
            f"Clone llama.cpp: git clone https://github.com/ggerganov/llama.cpp"
        )

    # Step 1: Convert HuggingFace model to FP16 GGUF
    fp16_gguf = os.path.join(output_dir, f"{model_name}-fp16.gguf")
    convert_cmd = [
        "python3", convert_script,
        gptq_model_path,
        "--outfile", fp16_gguf,
        "--outtype", "f16",
    ]

    print(f"Converting to GGUF format...")
    result = subprocess.run(convert_cmd, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"Conversion failed:\n{result.stderr}")
    print(f"  FP16 GGUF created: {fp16_gguf}")

    # Step 2: Quantize to target format using llama-quantize
    quantize_binary = os.path.join(llama_cpp_dir, "llama-quantize")
    if not os.path.exists(quantize_binary):
        # Try build directory
        quantize_binary = os.path.join(llama_cpp_dir, "build", "bin", "llama-quantize")

    quantize_cmd = [quantize_binary, fp16_gguf, gguf_path, quantization_type]

    print(f"Quantizing to {quantization_type}...")
    result = subprocess.run(quantize_cmd, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"Quantization failed:\n{result.stderr}")

    file_size_gb = os.path.getsize(gguf_path) / 1e9
    print(f"  GGUF model created: {gguf_path} ({file_size_gb:.2f} GB)")

    # Clean up FP16 intermediate
    if os.path.exists(fp16_gguf):
        os.remove(fp16_gguf)

    return gguf_path


def run_gguf_inference_python(
    gguf_path: str,
    prompt: str,
    max_tokens: int = 512,
    n_gpu_layers: int = 0,  # 0 = CPU only, -1 = all layers on GPU
    context_length: int = 4096,
) -> str:
    """
    Run inference on a GGUF model using llama-cpp-python.

    Install: pip install llama-cpp-python
    For Metal (Apple Silicon): CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
    For CUDA: CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

    Args:
        n_gpu_layers: Number of model layers to offload to GPU.
            0 = full CPU inference (works on any hardware)
            -1 = all layers on GPU (fastest, requires enough VRAM)
            N = offload N layers (useful for hybrid CPU/GPU when model is too large)
    """
    from llama_cpp import Llama

    llm = Llama(
        model_path=gguf_path,
        n_ctx=context_length,
        n_gpu_layers=n_gpu_layers,
        verbose=False,
    )

    output = llm(
        prompt,
        max_tokens=max_tokens,
        stop=["</s>", "<|eot_id|>"],  # Common stop tokens for Llama-family models
        echo=False,
    )

    return output["choices"][0]["text"]

Mixed-Precision GPTQ: Protecting Sensitive Layers

Not all layers in a transformer tolerate quantization equally well. Research has identified systematic patterns:

First and last transformer layers: The first few and last few layers often have different weight distributions and higher sensitivity to quantization. Quantizing them to 8-bit while using 4-bit for middle layers can improve accuracy with minimal memory overhead.
Attention vs. MLP layers: In some architectures, attention projection layers are more sensitive than MLP layers - the attention mechanism depends on precise relative magnitudes between Q, K, V projections.
Norm-adjacent layers: Layers immediately preceding or following layer normalization often have outlier activations that make 4-bit quantization more lossy.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from typing import Set


def quantize_mixed_precision_gptq(
    model_name: str,
    output_path: str,
    sensitive_layer_names: Set[str] = None,
    default_bits: int = 4,
    sensitive_bits: int = 8,
    group_size: int = 128,
    n_calibration_samples: int = 128,
) -> None:
    """
    Mixed-precision GPTQ: use 8-bit for sensitive layers, 4-bit for the rest.

    Typical accuracy improvement: 0.3-0.8 percentage points on reasoning tasks
    Memory overhead: ~5-15% more than uniform 4-bit (depends on how many sensitive layers)

    Common sensitive layer patterns (auto-detected if not provided):
    - First 2 and last 2 transformer blocks
    - MLP down-projection layers (often larger variance than gate/up)
    - Attention output projection layers

    Args:
        sensitive_layer_names: Set of layer name substrings to use 8-bit.
                               Example: {"model.layers.0", "model.layers.31", "lm_head"}
    """
    from transformers import AutoTokenizer
    from datasets import load_dataset

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Build calibration data
    dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
    calibration_data = []
    for item in dataset:
        if len(calibration_data) >= n_calibration_samples:
            break
        encoded = tokenizer(item["text"], return_tensors="pt",
                           max_length=2048, truncation=True)
        if encoded["input_ids"].shape[1] >= 64:
            calibration_data.append(encoded["input_ids"])

    # Default sensitive layers: first and last 2 blocks
    if sensitive_layer_names is None:
        sensitive_layer_names = {
            "model.layers.0.",
            "model.layers.1.",
            "model.layers.30.",  # Adjust index to n_layers-2
            "model.layers.31.",  # Adjust index to n_layers-1
        }

    # Configure mixed-precision: different quantize_config per layer
    # auto-gptq supports this via the quantize_config dict with per-layer overrides
    quantize_config = BaseQuantizeConfig(
        bits=default_bits,
        group_size=group_size,
        desc_act=False,
        damp_percent=0.01,
    )

    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config,
        torch_dtype="auto",
        device_map="auto",
    )

    # Override bits for sensitive layers
    for name, module in model.named_modules():
        if hasattr(module, "bits"):
            for sensitive_pattern in sensitive_layer_names:
                if sensitive_pattern in name:
                    module.bits = sensitive_bits
                    print(f"  8-bit: {name}")
                    break

    model.quantize(calibration_data)
    model.save_quantized(output_path, use_safetensors=True)
    tokenizer.save_pretrained(output_path)
    print(f"\nMixed-precision GPTQ model saved to {output_path}")

Common Mistakes and Production Pitfalls

:::danger Do Not Skip Calibration Data Validation The most common GPTQ failure in production is calibration data mismatch. A model fine-tuned on medical question-answering, calibrated with general web text (C4), will lose 5-10% accuracy on medical tasks versus the same model calibrated on medical text - even though both produce similar WikiText-2 perplexity. Always: (1) build calibration data that matches your deployment domain, (2) evaluate on your deployment task, not just standard benchmarks, (3) compare perplexity AND task accuracy, not perplexity alone. :::

:::danger Never Quantize the LM Head or Embedding Layers GPTQ by default skips the input embedding table and the language model head (output projection). This is correct behavior - do not override it. The embedding table maps discrete token IDs to continuous vectors; quantizing it introduces token confusion (similar tokens map to similar-but-wrong vectors). The LM head must produce precise logits to correctly rank the next token; quantization noise here directly degrades the output quality for every single generated token. If implementing GPTQ from scratch, explicitly exclude embed_tokens, lm_head, all LayerNorm layers, and all RMSNorm layers from quantization. :::

:::warning desc_act=True Hurts Inference Speed - Use It Carefully The desc_act (descending activation) option reorders columns by activation magnitude before quantization, so the highest-impact columns are quantized first. This can improve accuracy slightly by ensuring the most critical weights are compensated first rather than last. However, it requires storing a column permutation table and performing non-sequential memory access during inference - which destroys cache locality and reduces throughput by 10-20% on most hardware. Use desc_act=False (the default) unless you specifically need the accuracy improvement and have benchmarked the throughput impact. :::

:::tip Use group_size=64 for 3-Bit Quantization At 3-bit, the quantization grid is extremely coarse (only 8 distinct values). The default group_size=128 is often insufficient - quantization error per group is high enough that accuracy drops significantly. With group_size=64, you get twice the scale resolution and typically recover 0.5-1.0 perplexity points at the cost of ~6% more scale storage overhead. For 3-bit deployment where accuracy is paramount, group_size=64 or even group_size=32 is worth the overhead. :::

Interview Questions

Q1: Explain the GPTQ algorithm from mathematical foundations. What problem does it solve and how?

Naive INT4 quantization fails because it quantizes each weight independently, ignoring that weights are interconnected. Quantization error in one weight causes output errors that propagate and compound through subsequent layers. GPTQ solves this using the Optimal Brain Surgeon framework from Hassibi and Stork (1993). The key insight is that for a linear layer $Y = XW$ , the sensitivity of the output to weight perturbations is captured by the Hessian $H = 2X^TX$ (the outer product of input activations). After quantizing weight $w_q$ , introducing error $\delta w_q$ , the optimal update to remaining weights that minimizes the resulting output error is $\delta w^* = -(\delta w_q / H^{-1}_{qq}) \cdot H^{-1}_{:,q}$ . GPTQ applies this update sequentially, column by column, ensuring each weight's quantization error is compensated before the next weight is quantized. The result: errors do not accumulate, and the layer output remains close to the FP16 output even after full INT4 quantization. This requires only a forward pass on calibration data to compute H (no backpropagation, no labels), and Cholesky decomposition for efficient H inversion.

Q2: What is the role of calibration data in GPTQ, and what happens if it is mismatched?

Calibration data serves two functions in GPTQ. First, it provides the input activations needed to compute the Hessian $H = 2X^TX$ for each layer. The Hessian tells GPTQ which input dimensions are consistently activated and at what magnitudes - this determines how much each weight's quantization error is amplified into output error. Second, the Hessian drives the error compensation updates: after quantizing each weight, remaining weights are adjusted in the direction that minimizes the output error on the calibration distribution. If calibration data does not match the deployment distribution, both functions fail: the Hessian reflects the wrong activation patterns, and the compensation updates optimize for the wrong inputs. In practice, a code model calibrated on Wikipedia text will have worse INT4 accuracy on code tasks than the same model calibrated on code - by 5-10% on code-specific benchmarks, even if perplexity on WikiText-2 looks similar. Always match calibration data to deployment domain.

Q3: What is group size in GPTQ, and how does it affect the accuracy-memory tradeoff?

Group size controls the granularity of quantization scales. Within each group, all weights share one scale factor and one zero-point (for asymmetric quantization). With group_size=128, a weight matrix of size 4096×4096 has 4096×(4096/128) = 131,072 scale parameters. With group_size=32, it has 4096×128 = 524,288 scale parameters - 4x more. More scales = finer quantization resolution = less error from outlier weights skewing the scale for an entire group. The tradeoff: smaller groups require storing more scale parameters in FP16, adding memory overhead. For a 70B model at group_size=32 versus group_size=128, the additional scale overhead is roughly 4 GB. In practice: group_size=128 is the correct default for INT4, providing good accuracy with manageable overhead. group_size=64 is worth considering for 3-bit quantization where accuracy is more fragile. group_size=-1 (per-row scaling) is too coarse for 4-bit and should be avoided.

Q4: Why is GPTQ typically applied layer by layer rather than globally across the whole model?

Global GPTQ would require inverting the Hessian of the full loss with respect to all weights simultaneously - a matrix of size $(N_{params})^2$ where $N_{params}$ is in the billions for modern LLMs. For a 7B model, this would require storing and inverting a $7B \times 7B$ matrix - approximately $10^{17}$ bytes, which is physically impossible. Layer-wise quantization makes the problem tractable: for each linear layer with $d_{in}$ input dimensions, the Hessian is $d_{in} \times d_{in}$ - at most 4096×4096 for typical transformer layers. This is a few hundred megabytes at FP32, easily invertible. The approximation is justified empirically: the layer-wise output error (not the global loss) is what matters for downstream layers, and minimizing it layer by layer in the forward direction produces near-optimal results in practice.

Q5: How does activation reordering (desc_act) affect GPTQ, and when should you use it?

Activation reordering sorts the weight columns by decreasing activation magnitude before quantization. Columns corresponding to frequently large activations are quantized first. The benefit: in the sequential GPTQ algorithm, earlier columns receive compensation from later ones' updates. By quantizing the most critical columns first, you ensure the highest-impact weights are quantized when the full remaining weight budget is still available for compensation. The cost: the permutation must be stored with the model, and inference must perform non-sequential weight access (following the permutation) rather than contiguous reads. This destroys memory access locality - caches work on contiguous memory regions - and typically reduces throughput by 10-20%. The practical conclusion: desc_act provides marginal accuracy improvement (0.1-0.3 perplexity points) at significant throughput cost. AWQ achieves similar activation-importance benefits without the inference overhead, by embedding the importance information into the weight scales at quantization time rather than reordering at runtime. Prefer AWQ over GPTQ with desc_act=True for deployment that cares about both accuracy and throughput.

Q6: How would you debug a GPTQ model that shows good perplexity but poor task performance?

This is a calibration-distribution mismatch or task-sensitivity issue. The debugging process: First, compute perplexity on domain-matched text (not WikiText-2 if the model is domain-specific). If domain PPL is significantly worse than general PPL, calibration data was likely mismatched - rebuild calibration set from deployment domain. Second, benchmark task accuracy directly: run the quantized and FP16 models on 200-500 examples from your task. If quantized accuracy is more than 2-3% below FP16, you have a quantization issue. Third, identify which capabilities are most degraded: arithmetic reasoning, multi-step inference, and long-context tasks degrade more from quantization than factual recall or classification. If arithmetic is heavily degraded, try group_size=64 to improve quantization precision for the dense weight clusters that arithmetic relies on. Fourth, check if particular layers are problematic - some models have layers with unusual activation distributions (large outliers) that benefit from mixed-precision: keep those layers at INT8 while quantizing the rest at INT4.

The Night That Changed LLM Deployment​

What Round-to-Nearest Gets Wrong​

The Mathematical Foundation: Optimal Brain Surgeon​

From OBS to GPTQ: The Key Simplifications​

GPTQ Implementation From First Principles​

Using AutoGPTQ: The Production Library​

Group Size: The Most Important Configuration Choice​

Loading and Running GPTQ Models​

Deploying GPTQ with vLLM​

GPTQ Configuration Reference​

GPTQ vs. AWQ vs. bitsandbytes: When to Choose Each​

Domain-Specific Calibration: A Complete Example​

Evaluating GPTQ Accuracy​

GPTQ for GGUF and CPU Inference​

Mixed-Precision GPTQ: Protecting Sensitive Layers​

Common Mistakes and Production Pitfalls​

Interview Questions​