Skip to main content

GPTQ In Depth

A Model Too Big to Serve

It is September 2022. Meta has just released OPT-175B as an open model. Your team has been waiting for it - a publicly available 175 billion parameter language model that you can fine-tune and deploy without paying OpenAI per token. You download the checkpoint. It is 350 gigabytes in float16. You check your server capacity: four A100 80GB GPUs. You do the math: 350GB of weights alone, plus KV cache, plus activation memory. You need at least five A100s just to load the weights, and realistically six or seven to run inference with reasonable batch sizes.

Your budget covers four GPUs. The model will not fit. And the OPT-175B weights are not going anywhere - retraining is out of the question. You need a way to make the model smaller without making it worse.

This was not a unique problem in autumn 2022. It was the defining infrastructure challenge of the LLM era. Training costs had come down enough that 10B+ parameter models were within reach of well-funded teams. Serving them was a different story. The memory footprint of large transformers was a hard wall that blocked deployment for everyone who was not Google or Microsoft. Naive quantization to INT8 helped somewhat, but the perplexity hit was unacceptable for anything beyond 13B parameters - the outlier activation problem that made LLM quantization uniquely hard.

Then, on October 22nd, 2022, Frantar, Ashkboos, Hoefler, and Alistarh at ETH Zurich posted a preprint titled "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." The abstract was startling: they had quantized OPT-175B and BLOOM-176B to INT4 with average WikiText-2 perplexity increases of 0.25 and 0.73 respectively. At INT4, OPT-175B would require roughly 87 gigabytes - four A100 GPUs, with room to spare. The wall had a door.

Within three months, the LLM community had adopted GPTQ as the default quantization method. AutoGPTQ became the standard implementation. Repositories of pre-quantized models flooded HuggingFace Hub. The GPTQ paper is one of the most consequential systems papers in recent ML history, not because of theoretical novelty but because it solved a real deployment problem at exactly the right moment.

Understanding GPTQ is not optional for ML engineers working with LLMs. It is the foundation that most production deployments are built on. This lesson goes from the mathematical core - the Optimal Brain Surgeon framework it is derived from - all the way to the practical details of calibration dataset selection, group size tuning, and serving quantized models at scale with vLLM. No hand-waving. No skipping the math.


Why GPTQ Exists: The Failure of Naive Quantization

Rounding Does Not Work for LLMs

The simplest possible approach to quantization is: take each float16 weight, find the nearest INT4 value, store that instead. This is called Round-to-Nearest (RTN) quantization. For image classification CNNs it works acceptably. For LLMs it fails.

The failure is not primarily about individual weight errors being too large - it is about the cumulative effect of weight rounding errors on the model's output. In a 70-layer transformer, each layer's output is the input to the next. Small errors compound. By the time you reach layer 70, the accumulated effect of rounding errors in all 70 layers produces outputs that diverge significantly from the float16 model.

RTN also ignores which weights matter. In any weight matrix, some weights have much larger impact on the layer output than others, depending on which input features activate frequently. RTN treats all weights equally, applying the same rounding grid regardless of importance. Important weights get the same crude treatment as unimportant ones.

The question GPTQ's authors asked was: can we do better than rounding, using only information available after training (no access to gradients, no retraining), and can we do it fast enough to run on a 175B model in hours?

The Lineage: OBS and OBQ

GPTQ is not invented from scratch. It is built on two earlier frameworks from the 1990s: Optimal Brain Damage (LeCun et al., 1990) and Optimal Brain Surgeon (Hassibi and Stork, 1993). These frameworks were designed for network pruning - removing weights to make models smaller - but the mathematical machinery they developed is directly applicable to quantization.

The core idea of Optimal Brain Surgeon (OBS): given a trained network, what is the least damaging way to set some weight to zero? The answer requires computing the change in the loss function LL when weight wqw_q is removed (set to zero). Using a second-order Taylor expansion:

δL=12δwTHδw\delta L = \frac{1}{2} \delta w^T H \delta w

where H=2Lw2H = \frac{\partial^2 L}{\partial w^2} is the Hessian of the loss with respect to weights and δw\delta w is the weight change. When we set wq=0w_q = 0, we want to minimize the resulting loss increase by adjusting all other weights. Solving this constrained optimization (set wqw_q to exactly zero, minimize total loss increase over all weights) gives the Optimal Brain Surgeon update:

The optimal error for setting wq=0w_q = 0 is:

Lq=wq22[H1]qqL_q = \frac{w_q^2}{2[H^{-1}]_{qq}}

and the optimal compensating update to remaining weights is:

δw=wq[H1]qqH:,q1\delta w = -\frac{w_q}{[H^{-1}]_{qq}} H^{-1}_{:,q}

where H:,q1H^{-1}_{:,q} is the qq-th column of H1H^{-1}.

This says: when you remove weight wqw_q, you should update all other weights proportionally to the qq-th column of the inverse Hessian. The inverse Hessian captures the curvature of the loss - how sensitive other weights are to changes, and therefore how they should compensate for the removed weight.

Optimal Brain Quantization (OBQ), introduced by Frantar and Alistarh (2022) just before GPTQ, applied this same framework to quantization instead of pruning. Instead of setting wq=0w_q = 0, OBQ rounds wqw_q to its nearest quantized value w^q\hat{w}_q. The quantization error is δwq=wqw^q\delta w_q = w_q - \hat{w}_q. The compensating update to remaining weights uses exactly the same formula as OBS:

δw=δwq[H1]qqH:,q1\delta w = -\frac{\delta w_q}{[H^{-1}]_{qq}} H^{-1}_{:,q}

OBQ works but scales quadratically with the number of weights - it was too slow for production models with billions of parameters.


The GPTQ Algorithm

The Key Insight: Layer Independence

GPTQ's first crucial observation is that we do not need to minimize the global loss LL over the entire model. We can minimize the layer-wise reconstruction error independently for each linear layer:

minW^WXW^XF2\min_{\hat{W}} \|WX - \hat{W}X\|_F^2

where WW is the original float16 weight matrix, W^\hat{W} is the quantized weight matrix, XX is the matrix of input activations to this layer (from the calibration data), and F\|\cdot\|_F is the Frobenius norm.

This decomposition is possible because the quantization error in one layer affects subsequent layers only through the change in that layer's output. If we can minimize the change in each layer's output independently, the overall model output will be well-preserved. This layer-wise approach makes the problem tractable.

For a single linear layer with weight matrix WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}} and input XRdin×nX \in \mathbb{R}^{d_{in} \times n} (where nn is the number of calibration tokens), the objective becomes:

minW^WXW^XF2=i(WW^)xi22\min_{\hat{W}} \|WX - \hat{W}X\|_F^2 = \sum_i \|(W - \hat{W}) x_i\|_2^2

The Hessian for a Linear Layer

For this layer-wise objective, the Hessian with respect to the weights WW has a remarkably clean form. Since the objective is quadratic in WW:

H=2XXTH = 2XX^T

This is just twice the autocorrelation matrix of the input activations. Computing it requires one forward pass of the calibration data through the model (recording the inputs to each linear layer). The matrix HRdin×dinH \in \mathbb{R}^{d_{in} \times d_{in}} tells you the curvature of the layer reconstruction loss with respect to each weight.

The diagonal element HiiH_{ii} represents how frequently the ii-th input dimension is activated across the calibration data. A large HiiH_{ii} means the ii-th weight column has high importance - its quantization error will be amplified by frequent activations.

In practice, HH is computed as:

H=2ni=1nxixiTH = \frac{2}{n} \sum_{i=1}^n x_i x_i^T

where nn is the number of calibration tokens (typically 128,000 tokens for 512 samples of length 2048, which is the standard GPTQ configuration).

Column-by-Column Quantization with Error Compensation

GPTQ quantizes weight matrix WW column by column, left to right. For each column jj:

  1. Quantize column jj: compute W^:,j=quantize(W:,j)\hat{W}_{:,j} = \text{quantize}(W_{:,j}) using the chosen quantization grid.
  2. Compute quantization error: ϵj=W:,jW^:,j\epsilon_j = W_{:,j} - \hat{W}_{:,j}
  3. Compensate remaining columns: for all remaining columns k>jk > j, update:

W:,kW:,kϵj[H1]jk[H1]jjW_{:,k} \leftarrow W_{:,k} - \epsilon_j \cdot \frac{[H^{-1}]_{jk}}{[H^{-1}]_{jj}}

The factor [H1]jk[H1]jj\frac{[H^{-1}]_{jk}}{[H^{-1}]_{jj}} is the jj-th row of H1H^{-1}, normalized by the diagonal. This tells each remaining column how much it should absorb of column jj's quantization error.

The update has a clear intuition: if quantizing column jj would increase the layer output error, compensate by making small adjustments to similar columns (those with high [H1]jk[H^{-1}]_{jk}) that can absorb the error without hurting the output.

Cholesky Decomposition for Numerical Stability

A critical engineering detail: directly computing H1H^{-1} and reading off columns is numerically unstable. The matrix H=2XXTH = 2XX^T can be rank-deficient (if n<dinn < d_{in}, which is common), has condition numbers that make direct inversion unreliable, and accumulating floating-point errors column by column leads to drift.

GPTQ solves this by computing the Cholesky decomposition of H1H^{-1} at the start of the quantization process and reading off the needed values during the column-by-column loop:

H1=LLT(Cholesky of H1)H^{-1} = L L^T \quad (\text{Cholesky of } H^{-1})

The jj-th step of the quantization loop reads from the jj-th row of LTL^T, which is numerically stable because Cholesky decomposition is well-conditioned. A small damping term λI\lambda I (typically λ=0.01mean(diag(H))\lambda = 0.01 \cdot \text{mean}(\text{diag}(H))) is added to HH before inversion to handle the rank-deficiency issue:

Hdamped=H+λIH_{\text{damped}} = H + \lambda I

This is the damp_percent parameter in AutoGPTQ (default 0.01 = 1%).

Group Quantization: Local Scale Factors

Standard INT4 quantization uses one scale factor per row (per-channel quantization). GPTQ with group size 128 uses one scale factor per group of 128 consecutive weights in a row. For a weight matrix with din=4096d_{in} = 4096, this means 32 groups per row instead of 1, giving 32x more granular quantization.

The benefit is significant: within a group of 128 weights, all weights share a common scale factor. If those 128 weights have a consistent magnitude range, the quantization grid fits them well. If they do not (some large, some small), a single group-level scale absorbs the heterogeneity. Group quantization typically reduces perplexity degradation by 0.1-0.3 points compared to per-channel quantization.

The cost: scale factors add overhead. For each group, you store one FP16 scale and optionally one FP16 zero-point. For a 4096x4096 weight matrix with group size 128, the scale factors require (4096/128)×4096×2(4096 / 128) \times 4096 \times 2 bytes =4096×32×2= 4096 \times 32 \times 2 bytes 256\approx 256 KB, compared to the quantized weights at 4096×4096/24096 \times 4096 / 2 bytes =8= 8 MB. Scale overhead is about 3% of weight storage at group size 128.

The relationship between group size and quality is approximately:

Group sizeRelative qualityScale overhead
Per-tensor (whole matrix)WorstMinimal
Per-channel (per-row)Baseline~0.1%
256Better~1.5%
128 (standard)Standard~3%
64Better still~6%
32Best~12%

The actorder Trick

The default GPTQ algorithm quantizes columns in natural order: column 0, column 1, ... column din1d_{in}-1. But columns are not equally important. The diagonal of the Hessian, HiiH_{ii}, tells you how frequently the ii-th input dimension activates and therefore how much quantization error in column ii affects the layer output.

The actorder trick (also called desc_act - descending activation order) reorders columns in decreasing order of HiiH_{ii} before applying GPTQ. The most important columns (largest HiiH_{ii}) are quantized last, when the error compensation mechanism is most accurate (all other columns are still at full precision to absorb the correction). The least important columns are quantized first when the compensation does not need to be precise.

Mathematically: let π\pi be the permutation that sorts columns by decreasing HiiH_{ii}. Apply GPTQ to columns in order π\pi. After quantization, apply the inverse permutation to restore the original column order.

The quality improvement from actorder is consistent: 0.1-0.3 perplexity points at 4-bit. For 7B models, this difference is practically significant. For 70B models, it is smaller but still measurable.

The latency cost: inference requires an extra permutation step during weight dequantization. The permutation index must be stored with the model. In practice this adds 5-10% to inference latency. Whether that tradeoff is worth it depends on your quality requirements.


The Math of Quantization Grids

Symmetric vs. Asymmetric Quantization

For INT4 quantization, you are mapping float16 values to 16 discrete levels (24=162^4 = 16). There are two conventions:

Symmetric: levels are {7,6,...,0,...,6,7}\{-7, -6, ..., 0, ..., 6, 7\} (or {8,...,7}\{-8, ..., 7\} for INT4 with sign bit). The scale is s=max(W)/7s = \max(|W|) / 7. Quantization: Wq=round(W/s)W_q = \text{round}(W / s). Dequantization: W^=sWq\hat{W} = s \cdot W_q.

Asymmetric: levels are {0,1,...,15}\{0, 1, ..., 15\} with a separate zero-point. The formula: Wq=round(W/s)+zW_q = \text{round}(W / s) + z where zz is the zero-point. This allows the quantization range to be asymmetric, covering the actual range of WW more efficiently when weights are skewed (not centered at zero).

AutoGPTQ uses sym=False (asymmetric) by default because transformer weight distributions, while approximately normal, are rarely perfectly centered. Asymmetric quantization recovers a few extra bits of effective precision by fitting the range exactly.

The full per-group quantization with asymmetry for a group gg of weights WgW_g:

sg=max(Wg)min(Wg)2bits1s_g = \frac{\max(W_g) - \min(W_g)}{2^{bits} - 1} zg=round(min(Wg)sg)z_g = -\text{round}\left(\frac{\min(W_g)}{s_g}\right) Wq,g=clamp(round(Wgsg)+zg,0,2bits1)W_{q,g} = \text{clamp}\left(\text{round}\left(\frac{W_g}{s_g}\right) + z_g, 0, 2^{bits}-1\right)

Dequantization: W^g=sg(Wq,gzg)\hat{W}_g = s_g \cdot (W_{q,g} - z_g)

Expected Quantization Error

For a group of weights with range R=max(Wg)min(Wg)R = \max(W_g) - \min(W_g) and kk quantization bits, the expected round-trip quantization error (mean squared error) is approximately:

E[(WW^)2]R212(2k1)2\mathbb{E}[(W - \hat{W})^2] \approx \frac{R^2}{12 \cdot (2^k - 1)^2}

This is the quantization noise formula for uniform quantization. At 4-bit (k=4k=4), the denominator is 12×225=270012 \times 225 = 2700. At 8-bit (k=8k=8), it is 12×65025780,00012 \times 65025 \approx 780{,}000. Moving from INT8 to INT4 increases expected quantization error by roughly 289x per group, which is why group size matters so much: smaller groups have smaller RR, directly reducing the numerator.


Practical: Quantizing LLaMA 3 70B with AutoGPTQ

Environment Setup

# Install required packages
pip install auto-gptq>=0.7.1 transformers>=4.40.0 accelerate
pip install optimum # For some GPTQ loading utilities

# For Triton-accelerated inference (faster on A100/H100)
pip install triton>=2.1.0

# Verify GPU memory before starting
python -c "import torch; print(f'GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')"

Calibration Data Preparation

"""
Prepare calibration data for GPTQ quantization.
The choice of calibration data has measurable impact on quantization quality.
"""

import random
import torch
from transformers import AutoTokenizer
from datasets import load_dataset

def prepare_wikitext_calibration(
model_id: str,
n_samples: int = 512,
seq_len: int = 2048,
seed: int = 42,
) -> list[dict]:
"""
Standard WikiText-2 calibration data.
This is the academic standard - use it for benchmark comparisons.
"""
random.seed(seed)
tokenizer = AutoTokenizer.from_pretrained(model_id)

dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

# Remove empty lines and join
text = "\n\n".join(
[line for line in dataset["text"] if len(line.strip()) > 0]
)

tokens = tokenizer.encode(text, add_special_tokens=False)
print(f"Total calibration tokens available: {len(tokens):,}")

samples = []
for _ in range(n_samples):
start = random.randint(0, len(tokens) - seq_len - 1)
chunk = tokens[start : start + seq_len]
samples.append({"input_ids": chunk, "attention_mask": [1] * seq_len})

return samples

def prepare_domain_calibration(
model_id: str,
domain: str = "code", # "code", "medical", "legal", "finance"
n_samples: int = 512,
seq_len: int = 2048,
) -> list[dict]:
"""
Domain-specific calibration data for better in-domain quantization quality.

For code models: use code datasets (The Stack, CodeParrot)
For medical: use PubMed abstracts
For general chat: use ShareGPT or OpenAssistant conversations
"""
tokenizer = AutoTokenizer.from_pretrained(model_id)

dataset_map = {
"code": ("codeparrot/github-code", "train", "code"),
"medical": ("pubmed_qa", "pqa_labeled", "long_answer"),
"finance": ("financial_phrasebank", "train", "sentence"),
}

dataset_name, split, text_column = dataset_map.get(
domain, ("wikitext", "wikitext-2-raw-v1", "text")
)

dataset = load_dataset(dataset_name, split=split)
text = "\n\n".join(
[str(item[text_column]) for item in dataset.select(range(min(10000, len(dataset))))]
)

tokens = tokenizer.encode(text, add_special_tokens=False)

samples = []
for _ in range(n_samples):
start = random.randint(0, max(0, len(tokens) - seq_len - 1))
chunk = tokens[start : start + seq_len]
samples.append({"input_ids": chunk, "attention_mask": [1] * len(chunk)})

return samples

Full GPTQ Quantization Script

"""
Complete GPTQ quantization of LLaMA 3 70B.
Hardware requirements: 80GB+ VRAM for 70B (one A100 80GB or two 40GB)
Expected time: 1-4 hours for 70B, 20-40 minutes for 8B
"""

import torch
import logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Configuration
MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
OUTPUT_DIR = "./llama3-70b-gptq-int4-gs128"

def quantize_model(
model_id: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
desc_act: bool = True, # actorder - better quality, slight latency cost
sym: bool = False, # asymmetric quantization (better range coverage)
damp_percent: float = 0.01, # Hessian damping for numerical stability
n_calibration_samples: int = 512,
calibration_seq_len: int = 2048,
):
"""
Quantize a model to GPTQ INT4 with best-practice settings.

Parameter notes:
- bits=4: INT4, sweet spot for memory vs quality
- group_size=128: standard, good quality/overhead tradeoff
- desc_act=True: actorder, recommended for best quality
- sym=False: asymmetric, handles non-symmetric weight distributions
- damp_percent=0.01: 1% damping, prevents numerical instability
"""

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

# Configure quantization
quantize_config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
desc_act=desc_act,
sym=sym,
damp_percent=damp_percent,
model_seqlen=calibration_seq_len,
)

logger.info(f"Loading model {model_id} for quantization...")
logger.info(f"Config: {bits}-bit, group_size={group_size}, actorder={desc_act}")

# Load with CPU offloading for models that don't fit in a single GPU
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
quantize_config=quantize_config,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
)

# Prepare calibration data
logger.info("Preparing calibration data...")
calibration_data = prepare_wikitext_calibration(
model_id,
n_samples=n_calibration_samples,
seq_len=calibration_seq_len,
)
logger.info(f"Calibration set: {len(calibration_data)} samples x {calibration_seq_len} tokens")

# Run quantization
logger.info("Starting GPTQ quantization...")
logger.info("This will take 1-4 hours for 70B models. Progress logged per layer.")

model.quantize(
calibration_data,
cache_examples_on_gpu=True, # Cache calibration activations on GPU
# Set to False if GPU OOM during quantization
)

logger.info(f"Quantization complete. Saving to {output_dir}")

# Save in safetensors format (more secure than pickle)
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)

# Save a config note about the quantization
with open(f"{output_dir}/quantization_info.txt", "w") as f:
f.write(f"Source model: {model_id}\n")
f.write(f"Bits: {bits}\n")
f.write(f"Group size: {group_size}\n")
f.write(f"Actorder: {desc_act}\n")
f.write(f"Symmetric: {sym}\n")
f.write(f"Calibration samples: {n_calibration_samples}\n")
f.write(f"Calibration seq len: {calibration_seq_len}\n")

logger.info("Done.")
return output_dir

if __name__ == "__main__":
output = quantize_model(
MODEL_ID,
OUTPUT_DIR,
bits=4,
group_size=128,
desc_act=True,
)
print(f"Model saved to: {output}")

Loading and Serving GPTQ Models

"""
Load a GPTQ model for inference and measure quality metrics.
"""

import torch
import time
import math
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
from datasets import load_dataset

GPTQ_MODEL_DIR = "./llama3-70b-gptq-int4-gs128"
# Or use a pre-quantized HuggingFace model:
# GPTQ_MODEL_DIR = "TheBloke/LLaMA-3-70B-Instruct-GPTQ"

def load_gptq_for_inference(
model_dir: str,
device: str = "cuda:0",
use_triton: bool = False,
):
"""
Load a GPTQ model optimized for inference.

use_triton=True: Faster on A100/H100 but requires triton installation.
Recommended for production, requires testing.
use_triton=False: Uses CUDA kernels, more compatible, slightly slower.
"""
tokenizer = AutoTokenizer.from_pretrained(model_dir)

model = AutoGPTQForCausalLM.from_quantized(
model_dir,
device=device,
use_triton=use_triton,
inject_fused_attention=True, # Fused attention kernel (speedup)
inject_fused_mlp=True, # Fused MLP kernel (speedup)
use_safetensors=True,
trust_remote_code=False,
)

model.eval()
return model, tokenizer

def measure_perplexity_wikitext2(model, tokenizer, stride: int = 512) -> float:
"""
Compute WikiText-2 test perplexity using sliding window evaluation.
This is the standard benchmark for quantization quality.

stride: How much to advance the window each step.
Smaller = more accurate but slower.
"""
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
text = "\n\n".join(dataset["text"])

encodings = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
seq_len = 2048 # Standard evaluation context length
n_tokens = encodings.size(1)

nlls = []
prev_end = 0

for begin_loc in range(0, n_tokens, stride):
end_loc = min(begin_loc + seq_len, n_tokens)
trg_len = end_loc - prev_end # Tokens to actually evaluate in this window

input_ids = encodings[:, begin_loc:end_loc]
target_ids = input_ids.clone()
target_ids[:, :-trg_len] = -100 # Mask prefix tokens

with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
neg_log_likelihood = outputs.loss * trg_len

nlls.append(neg_log_likelihood)
prev_end = end_loc

if end_loc >= n_tokens:
break

perplexity = torch.exp(torch.stack(nlls).sum() / n_tokens)
return perplexity.item()

def benchmark_generation_speed(
model,
tokenizer,
prompt: str,
n_new_tokens: int = 100,
n_warmup: int = 3,
) -> dict:
"""Measure generation throughput in tokens per second."""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Warmup runs
for _ in range(n_warmup):
with torch.no_grad():
_ = model.generate(**inputs, max_new_tokens=20, do_sample=False)

torch.cuda.synchronize()
start = time.perf_counter()

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=n_new_tokens,
do_sample=False,
temperature=1.0,
pad_token_id=tokenizer.eos_token_id,
)

torch.cuda.synchronize()
elapsed = time.perf_counter() - start

n_generated = outputs.shape[1] - inputs.input_ids.shape[1]
memory_gb = torch.cuda.memory_allocated() / 1e9

return {
"tokens_per_sec": n_generated / elapsed,
"latency_ms": elapsed * 1000,
"n_tokens": n_generated,
"memory_gb": memory_gb,
}

# Load model and run benchmarks
model, tokenizer = load_gptq_for_inference(GPTQ_MODEL_DIR)

print("Computing WikiText-2 perplexity...")
ppl = measure_perplexity_wikitext2(model, tokenizer)
print(f"Perplexity: {ppl:.3f}")

prompt = "Explain the transformer attention mechanism step by step:"
stats = benchmark_generation_speed(model, tokenizer, prompt)
print(f"\nGeneration speed: {stats['tokens_per_sec']:.1f} tokens/sec")
print(f"Memory footprint: {stats['memory_gb']:.2f} GB")

Serving GPTQ with vLLM

"""
Production serving of GPTQ models with vLLM.
vLLM provides continuous batching, PagedAttention, and CUDA graph optimization.
"""

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
import asyncio

# Single-process serving (testing and small scale)
def serve_gptq_simple(model_path: str, tensor_parallel_size: int = 1):
llm = LLM(
model=model_path,
quantization="gptq",
dtype="float16",
max_model_len=8192,
gpu_memory_utilization=0.90,
tensor_parallel_size=tensor_parallel_size,
# For actorder models, vLLM handles the permutation automatically
)
return llm

# Usage
llm = serve_gptq_simple(
"TheBloke/LLaMA-3-70B-Instruct-GPTQ",
tensor_parallel_size=2, # Split across 2 GPUs for 70B INT4
)

sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
)

prompts = [
"Explain gradient descent in plain English:",
"What is the difference between L1 and L2 regularization?",
"Write a Python function to compute cosine similarity:",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
print(f"\nPrompt: {prompt[:60]}...")
print(f"Response: {generated[:200]}...")

# Async serving for production (handles concurrent requests)
async def serve_gptq_async(model_path: str):
engine_args = AsyncEngineArgs(
model=model_path,
quantization="gptq",
dtype="float16",
max_model_len=8192,
gpu_memory_utilization=0.88,
tensor_parallel_size=2,
max_num_seqs=256, # Max concurrent sequences
max_num_batched_tokens=8192,
)

engine = AsyncLLMEngine.from_engine_args(engine_args)
return engine

Understanding Quality Degradation

Where GPTQ Loses Quality

Not all layers lose quality equally. GPTQ's error compensation works best when the Hessian has high variance - some weights clearly matter more than others. When weights are roughly equally important (uniform Hessian diagonal), there is less to gain from selective compensation.

Empirically, GPTQ shows higher perplexity degradation on:

  • Early transformer layers (layers 0-4 in large models): These have high sensitivity because their outputs propagate through all subsequent layers. Errors here amplify.
  • Attention query and key projections: These are sensitive because small errors in Q and K matrices change attention patterns nonlinearly.
  • Models below 7B parameters: Smaller models have less redundancy. Each weight is more essential, leaving less room for error compensation.
  • Embedding layers: Often excluded from quantization entirely because vocabulary embeddings are used across all positions and errors accumulate over the sequence.

Measuring Degradation Beyond Perplexity

"""
Task-specific evaluation to measure GPTQ quality degradation beyond perplexity.
Uses lm-evaluation-harness for standardized benchmarks.
"""

# Install: pip install lm-eval
# Run from command line:

# Evaluate FP16 baseline on MMLU (academic knowledge test)
# python -m lm_eval --model hf \
# --model_args pretrained=meta-llama/Meta-Llama-3-8B \
# --tasks mmlu \
# --num_fewshot 5 \
# --output_path ./results/fp16_mmlu.json

# Evaluate GPTQ INT4 on the same benchmark
# python -m lm_eval --model hf \
# --model_args pretrained=./llama3-8b-gptq-int4,load_in_4bit=False \
# --tasks mmlu \
# --num_fewshot 5 \
# --output_path ./results/gptq_mmlu.json

# Python comparison script
import json

def compare_benchmark_results(fp16_path: str, quantized_path: str):
with open(fp16_path) as f:
fp16 = json.load(f)
with open(quantized_path) as f:
quant = json.load(f)

tasks = list(fp16["results"].keys())

print(f"{'Task':<40} {'FP16':>8} {'GPTQ INT4':>10} {'Delta':>8}")
print("-" * 70)

total_delta = 0
n_tasks = 0

for task in sorted(tasks):
# Extract primary metric (acc_norm or acc)
fp16_metric = fp16["results"][task].get("acc_norm", fp16["results"][task].get("acc", 0))
quant_metric = quant["results"][task].get("acc_norm", quant["results"][task].get("acc", 0))

delta = quant_metric - fp16_metric
total_delta += delta
n_tasks += 1

flag = " [!]" if abs(delta) > 0.02 else "" # Flag >2% degradation
print(f"{task:<40} {fp16_metric:>8.3f} {quant_metric:>10.3f} {delta:>+8.3f}{flag}")

avg_delta = total_delta / n_tasks
print("-" * 70)
print(f"{'Average':<40} {'':>8} {'':>10} {avg_delta:>+8.3f}")
print(f"\nOverall degradation: {avg_delta:+.3f} accuracy points")
if abs(avg_delta) < 0.01:
print("Assessment: Production-safe (< 1% average degradation)")
elif abs(avg_delta) < 0.02:
print("Assessment: Acceptable (1-2% average degradation)")
else:
print("Assessment: Investigate before deploying (> 2% average degradation)")

GPTQ Architecture and Layer Coverage


Production Engineering Notes

Memory Requirements for Quantization

Quantizing a model requires more GPU memory than loading the quantized result. During the GPTQ process, each layer's original float16 weights must be loaded, the Hessian computed, and the quantized weights written. For a 70B model:

  • Loading the FP16 model: 140GB
  • Hessian computation buffer (largest layer, HRdin×dinH \in \mathbb{R}^{d_{in} \times d_{in}}): For din=8192d_{in} = 8192, HH is 8192×8192×48192 \times 8192 \times 4 bytes 256\approx 256 MB
  • Calibration activations cache: 512×2048×8192×2512 \times 2048 \times 8192 \times 2 bytes 16\approx 16 GB peak

In practice, quantizing a 70B model with the standard configuration requires approximately 160-180GB of GPU memory. This means a single A100 80GB is insufficient - you need either two A100 80GB cards or access to an H100 80GB with additional CPU offloading.

For 8B models: approximately 30-40GB peak, fits on a single A100 40GB.

The cache_examples_on_gpu Tradeoff

AutoGPTQ's cache_examples_on_gpu=True (default) keeps the calibration activations on GPU throughout quantization. This is faster but requires more VRAM. Set cache_examples_on_gpu=False if you hit OOM during quantization - it will move activations to CPU between layers at a significant speed cost.

Pre-quantized Models on HuggingFace Hub

For common base models, pre-quantized GPTQ checkpoints are available on HuggingFace Hub. Using these saves significant time and compute. As of 2024-2025, the primary sources are:

  • TheBloke repository: Extensive collection of GPTQ models, typically quantized at 4-bit with both actorder and non-actorder variants
  • hugging-quants: Official quantizations for major open models
  • Model-specific organization pages: LLaMA 3, Mistral, Qwen, and others often publish official INT4 checkpoints

Always verify the quantization parameters match your requirements (group size, actorder, bits) before deploying a pre-quantized model.

GPTQ vs. AWQ: When to Choose Which

GPTQ and AWQ are both production-grade methods for GPU serving. The practical guidance:

For 70B+ models: GPTQ with actorder typically shows slightly better perplexity than AWQ at INT4 because the second-order error compensation is more valuable when there are more weights to compensate across.

For 7B-13B models: AWQ typically equals or beats GPTQ quality at INT4, with much faster quantization time (minutes vs. hours).

If you have a pre-quantized checkpoint available for your model, use it rather than quantizing from scratch - the result is the same and you save hours of compute.


Common Mistakes

:::danger Using Too Few Calibration Samples

The minimum effective calibration set for GPTQ is 128 samples of 2048 tokens (262,144 total calibration tokens). Using fewer samples makes the Hessian estimate noisy and the error compensation less effective. The standard recommendation is 512 samples.

More is not always better: beyond 1024 samples, returns diminish and quantization time increases linearly. The sweet spot is 256-512 samples for most models.

Never use a calibration set smaller than 128 samples for models larger than 7B - the Hessian estimate will not be stable enough for reliable error compensation. :::

:::danger Forgetting to Evaluate After Quantization

It is easy to quantize a model, see "quantization complete" in the logs, load it, and assume it is fine. GPTQ can fail silently in several ways: numerical instability in the Hessian computation (especially with damp_percent too low), calibration data that is too short or too repetitive leading to a degenerate Hessian, and precision issues with certain model architectures.

Always run a perplexity check on a held-out set after quantization. A jump of more than 1.5-2.0 perplexity points (compared to expected 0.1-0.5 for 70B models) indicates something went wrong in the quantization process. :::

:::warning actorder=True Requires vLLM Support

If you quantize with desc_act=True (actorder) but serve with a framework that does not properly handle the weight permutation, the model will produce garbage outputs - all outputs will be wrong, not just degraded. Before deploying an actorder-quantized model, verify that your serving framework version explicitly supports actorder GPTQ.

As of vLLM 0.4.0+, actorder is supported. Older versions of AutoGPTQ's built-in inference also support it. Some older deployment stacks (llama.cpp GPTQ support pre-2024) did not handle the permutation correctly. :::

:::warning Calibration Data Should Match Inference Distribution

The academic standard is to calibrate with WikiText-2 and evaluate on WikiText-2 - which gives the clean benchmark numbers you see in papers. In production, this is the wrong approach if your inference data is different.

If you are deploying a code assistant, calibrate with code. If you are deploying a multilingual model, calibrate with text in your target languages. The Hessian estimate is a proxy for "which weights matter for your use case" - and a mismatch between calibration and inference distribution wastes that signal.

The improvement from domain-matched calibration is typically 0.1-0.3 perplexity points on domain-specific tasks, but can be larger (0.5-1.0 PPL) for highly specialized domains. :::

:::warning Group Size 128 is Not Free

GPTQ with group_size=128 stores one FP16 scale and one FP16 zero-point per 128 weights. This adds approximately 3% to the model's storage size and, more importantly, requires loading these scale factors during inference - adding memory bandwidth pressure.

For models where you are tightly memory-bandwidth-limited (serving on older GPUs like A10, T4), the bandwidth overhead from small group sizes can partially offset the memory savings from INT4. Benchmark group_size=128 vs group_size=256 on your target hardware before committing. :::


Interview Q&A

Q1: Explain the GPTQ algorithm step by step from the Optimal Brain Surgeon framework.

The Optimal Brain Surgeon (OBS) framework asks: given a trained model, what is the minimum-damage way to set a weight to zero? It answers this by computing the Hessian HH of the loss with respect to weights, and showing that the optimal compensation for removing weight wqw_q is to update all remaining weights by δw=(wq/[H1]qq)H:,q1\delta w = -(w_q / [H^{-1}]_{qq}) \cdot H^{-1}_{:,q}.

Optimal Brain Quantization (OBQ) extends this to quantization: instead of setting wqw_q to zero, you round it to its nearest quantized value, producing error δwq=wqquantize(wq)\delta w_q = w_q - \text{quantize}(w_q). The same compensation formula applies. OBQ works but is slow for large models.

GPTQ makes two key modifications that make OBQ scalable. First, it decomposes the problem layer by layer using the fact that minimizing each layer's reconstruction error independently is a good proxy for minimizing the global loss. For layer ll, the Hessian simplifies to H=2XXTH = 2XX^T where XX is the layer's input activations on calibration data. Second, it quantizes columns in a fixed left-to-right order (not individually optimal like OBQ) with the same compensation applied to remaining columns after each quantization step. This changes the algorithm from O(d3)O(d^3) per weight to O(d3)O(d^3) per layer, making it tractable for 175B models.

The Cholesky decomposition of H1H^{-1} provides numerical stability for the column-by-column updates, and a small damping term λI\lambda I handles rank deficiency in HH.

Q2: Why does GPTQ work better for larger models, and what is the scaling relationship?

Larger models have more redundancy. When GPTQ quantizes column jj and compensates by updating columns j+1,...,dinj+1, ..., d_{in}, the quality of that compensation depends on how many columns are available to absorb the error and how well the Hessian captures the substitutability between columns.

In a 70B model, each linear layer has dind_{in} typically between 8192 and 16384. The compensation has thousands of columns to work with. In a 7B model, dind_{in} is typically 4096 - about half. More importantly, larger models have more overparameterization: multiple weight directions can express similar linear transformations, meaning the compensation can redistribute error into directions that do not hurt the output.

The empirical scaling is roughly: INT4 perplexity degradation decreases as you increase model size. For LLaMA-family models: 7B shows 0.4-0.8 PPL increase, 13B shows 0.3-0.5, 30B shows 0.2-0.3, 70B shows 0.1-0.2. This is roughly consistent with a logarithmic relationship with model size, though the exact scaling depends on architecture.

This has a practical implication: INT4 quantization of a 70B model is production-safe for almost all tasks. INT4 quantization of a 7B model requires task-specific validation and may not be acceptable for precision-sensitive applications.

Q3: What happens when the Hessian matrix is rank-deficient, and how does GPTQ handle it?

The Hessian H=2XXTH = 2XX^T where XRdin×nX \in \mathbb{R}^{d_{in} \times n}. When n<dinn < d_{in} (number of calibration tokens less than input dimension), HH is rank-deficient - it has zero eigenvalues. This creates two problems: the matrix is not invertible, so H1H^{-1} does not exist; and even if you approximate the inverse, zero eigenvalues mean infinite compensation factors for those directions.

GPTQ handles rank deficiency by adding a damping term: Hdamped=H+λIH_{\text{damped}} = H + \lambda I where λ=damp_percent×mean(diag(H))\lambda = \text{damp\_percent} \times \text{mean}(\text{diag}(H)). This ensures all eigenvalues are at least λ\lambda, making the matrix invertible. The default damp_percent=0.01 (1%) means the damping term is 1% of the average diagonal value.

The tradeoff: too much damping makes H1H^{-1} closer to a scaled identity matrix, losing the weight importance information that makes GPTQ better than RTN. Too little damping causes numerical instability. For well-conditioned models with sufficient calibration data (512+ samples of 2048 tokens), 1% damping is sufficient. For poorly conditioned layers or insufficient calibration data, you may need to increase damp_percent to 0.05 or 0.1.

In practice, with standard settings (512 samples, seq_len 2048, giving 1M+ calibration tokens, much larger than typical dind_{in} values of 4096-16384), rank deficiency is not a significant problem. The damping is a safeguard, not a crutch.

Q4: How should you select calibration data, and what is the impact of getting it wrong?

Calibration data selection affects GPTQ quality through the Hessian estimate: H=2XXTH = 2XX^T depends on which activations fire, how strongly, and which input patterns are common. Calibration data that is unrepresentative of your inference distribution produces a Hessian that does not reflect which weights actually matter during deployment.

For general-purpose models deployed on general text: WikiText-2 is fine. It is diverse, high-quality English text that covers a broad range of vocabulary and sentence structures. The Hessian estimated on WikiText-2 will protect weights that are important for general language modeling.

For specialized deployments: use domain-matched data. A code model calibrated on WikiText-2 will have a Hessian that emphasizes weights important for English prose. The weights most critical for Python syntax (specific attention patterns for indentation, specific feed-forward activations for common code patterns) may not be protected. In practice this typically costs 0.1-0.5 perplexity points on coding benchmarks and can cost more on specific tasks.

The calibration data should be: diverse (avoid too much repetition, which leads to degenerate Hessians), representative of your inference domain, and of sufficient length (2048 tokens per sample is standard). Short samples under-estimate the importance of weights that activate primarily in the middle of long contexts.

Q5: How does GPTQ integrate with vLLM and what should you verify before production deployment?

vLLM supports GPTQ models natively via the quantization="gptq" parameter. Under the hood, vLLM uses exllama or exllamav2 kernels for GPTQ inference - these are custom CUDA implementations that handle the INT4 weight packing, scale factor loading, and dequantization on-GPU. They are significantly faster than the naive PyTorch dequantization path.

Before deploying a GPTQ model with vLLM in production, verify:

  1. Kernel compatibility: confirm whether your model uses actorder=True and that your vLLM version supports it. Check vLLM's release notes - actorder support was added in v0.3.0 with exllamav2 and improved in v0.4.0.

  2. Group size compatibility: vLLM supports group sizes 32, 64, 128, 256. Verify your model's group size is in this list.

  3. Perplexity regression test: run your perplexity measurement after loading via vLLM, not just after loading via AutoGPTQ. The two codepaths use different inference kernels and very occasionally show differences.

  4. Batch throughput benchmark: measure tokens/sec at your expected batch sizes (latency optimization requires small batch testing; throughput optimization requires testing at batch 16, 32, 64). GPTQ's optimal batch size differs from FP16 because the weight memory savings allow larger batches per GPU.

  5. Output correctness check: run a set of reference prompts with known expected outputs and compare quantized model outputs to FP16. Focus on outputs that require precise multi-step reasoning, as these are most sensitive to quantization degradation.

Q6: Compare GPTQ's approach to quantization error minimization with a simpler method like Round-to-Nearest. What is the theoretical justification for GPTQ's advantage?

Round-to-Nearest (RTN) minimizes quantization error independently for each weight, ignoring the effect on the layer output. Formally, RTN minimizes:

ij(wijw^ij)2\sum_{ij} (w_{ij} - \hat{w}_{ij})^2

This is the sum of squared weight errors. It does not account for the fact that the layer computes y=Wxy = Wx, not just stores WW.

GPTQ minimizes the layer output error:

WXW^XF2=(WW^)XF2\|WX - \hat{W}X\|_F^2 = \|(W - \hat{W})X\|_F^2

The difference: when you multiply by XX, weights corresponding to frequently-activated input dimensions (large XX values) get amplified. RTN treats all weight errors equally. GPTQ's objective treats them proportionally to how much their errors affect the output.

The GPTQ compensation update makes this concrete: when weight wijw_{ij} is quantized, the error is not just absorbed by that weight - it is spread across all remaining weights in the column direction, guided by the inverse Hessian which encodes the sensitivity of the output to each weight. Weights in input dimensions that fire rarely can absorb more error than weights in frequently-activated dimensions, because their contribution to the output is smaller.

The theoretical bound (from the OBS framework) is that GPTQ's layer-wise reconstruction error is at most as large as the best possible error for a given quantization budget. RTN has no such bound - it can produce arbitrarily large output errors for individual layers even when the weight errors are small, if those weight errors happen to align with large activations.

In practice, the gap between GPTQ and RTN at INT4 for 7B models is 0.5-1.5 perplexity points - meaningful for production use. For 70B models, the gap is smaller (0.2-0.5 PPL) because the model's redundancy limits RTN's worst-case behavior.

© 2026 EngineersOfAI. All rights reserved.