Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Quantisation Explorer demo on the EngineersOfAI Playground - no code required. :::

GPTQ: Post-Training Quantization That Actually Works

The Night That Changed LLM Deployment

It is late 2022. A researcher downloads LLaMA-65B - Meta's newly released 65 billion parameter language model, arguably the most capable open-weights model in existence at the time. The plan: quantize it to 4-bit to fit on a two-GPU consumer setup. Simple round-to-nearest INT4 quantization cuts the model from 130 GB to about 33 GB in memory. But the model is unrecognizable. On reasoning tasks, accuracy drops 25-40%. On code generation, it produces gibberish. The quantized model is worthless.

The problem is fundamental. Round-to-nearest INT4 treats every weight independently: "What is the closest INT4 value to this FP16 number?" It ignores that weights are interconnected - quantization error in weight wiw_i changes the layer's output, which propagates errors through all subsequent computations. By the time you reach the final layer, small individual errors have compounded into catastrophic collective failure.

Three weeks later, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh at ETH Zurich publish a paper called "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." It quantizes OPT-175B to 4-bit in under 4 hours on a single GPU with less than 1% accuracy degradation on most benchmarks. The trick: instead of treating each weight independently, GPTQ uses a 30-year-old mathematical result about second-order sensitivity to compensate for quantization errors as they accumulate. When you quantize weight wiw_i, you adjust all remaining weights in the row to compensate for the error you just introduced. The errors do not compound - they are corrected continuously.

This lesson explains how that mechanism works from mathematical foundations through production implementation, including all the configuration choices that determine whether your GPTQ model performs well or poorly in practice.

What Round-to-Nearest Gets Wrong

Before understanding GPTQ, you must understand exactly why naive quantization fails at INT4. Consider a single linear layer computing Y=XWY = XW where:

  • XX is the input matrix (batch × sequence × d_in)
  • WW is the weight matrix (d_in × d_out)
  • YY is the output (batch × sequence × d_out)

Naive INT4 quantization replaces WW with W^\hat{W} where each element is rounded to the nearest representable INT4 value:

w^ij=round(wijs)s\hat{w}_{ij} = \text{round}\left(\frac{w_{ij}}{s}\right) \cdot s

where ss is a scale factor (typically max(W)/7\max(|W|) / 7 for symmetric 4-bit). The quantization error is:

ΔW=WW^\Delta W = W - \hat{W}

The resulting output error is:

ΔY=XΔW\Delta Y = X \cdot \Delta W

Now consider what happens in a 70B transformer. The model has roughly 80 layers. Each layer has multiple linear projections (Q, K, V, O, gate, up, down). Each weight matrix has quantization error ΔWl\Delta W_l for layer ll. The errors compound through the network: the output error from layer 1 becomes additional input noise to layer 2, which amplifies its own quantization error, which compounds into layer 3, and so on.

Error accumulation in a 70B model with naive INT4:

Layer 1: Input error: 0 Output error: δ₁ = X·ΔW₁
Layer 2: Input error: δ₁ Output error: δ₂ = (X+δ₁)·ΔW₂ + X·ΔW₁
...
Layer 80: Input error: Σ δᵢ Output error: Σ (accumulated errors)

By layer 80: Error is not 80× layer error - it compounds multiplicatively
in the worst case. Even 0.1% error per layer → 1-(0.999)^80 ≈ 8% final error
if errors happen to compound coherently (which they often do for common patterns)

The insight that makes GPTQ possible: you do not need to prevent all quantization error. You need to prevent quantization errors from accumulating uncompensated. If you can adjust the un-quantized weights to absorb the error from each freshly quantized weight, you can maintain near-perfect layer outputs.

The Mathematical Foundation: Optimal Brain Surgeon

GPTQ is a direct application of the Optimal Brain Surgeon (OBS) framework from Hassibi and Stork (1993), adapted from weight pruning to weight quantization. The OBS framework addresses: "After we change one weight (quantize it), how should we update the remaining weights to minimize the change in the loss?"

The second-order Taylor expansion of the loss L\mathcal{L} around the current weights gives:

δL=(Lw)Tδw+12δwTHδw+O(δw3)\delta \mathcal{L} = \left(\frac{\partial \mathcal{L}}{\partial \mathbf{w}}\right)^T \delta\mathbf{w} + \frac{1}{2} \delta\mathbf{w}^T \mathbf{H} \, \delta\mathbf{w} + O(|\delta\mathbf{w}|^3)

where H\mathbf{H} is the Hessian: Hij=2LwiwjH_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}.

At a local minimum of the loss (which trained weights approximately are), the gradient term vanishes: Lw0\frac{\partial \mathcal{L}}{\partial \mathbf{w}} \approx 0. So:

δL12δwTHδw\delta \mathcal{L} \approx \frac{1}{2} \delta\mathbf{w}^T \mathbf{H} \, \delta\mathbf{w}

When we quantize weight wqw_q, we introduce a forced change δwq=w^qwq\delta w_q = \hat{w}_q - w_q (the quantization error). We want to find the update to the remaining weights δwq\delta \mathbf{w}_{-q} that minimizes δL\delta \mathcal{L} subject to the constraint that wqw_q takes on the quantized value.

The OBS solution is:

δw=δwq[H1]qq[H1]:,q\delta \mathbf{w}^* = -\frac{\delta w_q}{[\mathbf{H}^{-1}]_{qq}} \cdot [\mathbf{H}^{-1}]_{:,q}

In words: the optimal correction to remaining weights is proportional to the column of the inverse Hessian corresponding to the quantized weight, scaled by the quantization error divided by the diagonal inverse Hessian entry.

The resulting increase in loss from quantizing weight qq (even after optimal correction) is:

δLq=12(δwq)2[H1]qq\delta \mathcal{L}_q = \frac{1}{2} \frac{(\delta w_q)^2}{[\mathbf{H}^{-1}]_{qq}}

This gives you a way to measure how "costly" quantizing each weight is - weights where [H1]qq[\mathbf{H}^{-1}]_{qq} is small are expensive (quantization loss is amplified), while weights where [H1]qq[\mathbf{H}^{-1}]_{qq} is large are cheap (quantization loss is absorbed well).

From OBS to GPTQ: The Key Simplifications

Applying OBS directly to a 70B model is computationally intractable - the full Hessian for a weight matrix with d2d^2 parameters has d4d^4 entries. Frantar et al. made three key observations that make GPTQ practical:

Observation 1: Layer-wise quantization. The Hessian of the total loss over all weights is huge and complex. But GPTQ applies the OBS framework independently to each weight matrix (each linear layer). This is justified because the layer-wise output error (from quantizing that layer's weights) is what matters for downstream layers - not the global loss. The Hessian for a single weight matrix is tractable.

Observation 2: The layer Hessian from activations. For a linear layer Y=XWY = XW, the Hessian of the squared output error XWXW^2\|XW - X\hat{W}\|^2 with respect to WW is:

HW=2XTX\mathbf{H}_W = 2 X^T X

This is the outer product of the input activations - computable from calibration data without ever computing gradients or doing backpropagation through the full model. Run the model forward on 128 sample inputs, collect the activations at each layer, and you have the Hessian.

Observation 3: Row-wise independence. The rows of WW are independent in the sense that quantizing row ii does not interact with row jj through the Hessian. This means we can process each row independently, reducing the problem from a dindout×dindoutd_{in} \cdot d_{out} \times d_{in} \cdot d_{out} Hessian to a din×dind_{in} \times d_{in} Hessian (one per row, but all rows share the same Hessian since H=2XTXH = 2X^TX is identical for all rows of the same weight matrix).

The GPTQ algorithm for one weight matrix then becomes:

GPTQ Algorithm for one weight matrix W (shape: d_out × d_in):

1. Collect calibration activations X (shape: n_cal × d_in)
2. Compute H = 2 * X.T @ X (shape: d_in × d_in)
3. Add damping: H += λ * I (prevents singular matrix issues)
4. Compute H⁻¹ using Cholesky decomposition for numerical stability
5. For each column index q = 0, 1, ..., d_in-1 (process left to right):
For each row r = 0, 1, ..., d_out-1:
a. Quantize W[r, q]: w_q = round(W[r, q] / scale) * scale
b. Compute quantization error: δw_q = W[r, q] - w_q
c. Update remaining weights:
W[r, q+1:] -= (δw_q / H⁻¹[q,q]) * H⁻¹[q, q+1:]
(adjust future columns to compensate for this column's error)
d. Store quantized value: W_quant[r, q] = w_q

The critical insight: step (c) propagates the quantization error to future columns in the calibration-statistic-weighted direction that minimizes the output error. Earlier columns affect later ones through the Cholesky factor - the mathematical structure that makes the sequential updates efficient.

GPTQ Implementation From First Principles

Here is a clean NumPy implementation of the core GPTQ algorithm for a single layer. This is pedagogical - the production version uses CUDA and handles the full model - but demonstrates the exact mathematical operations:

import numpy as np
import torch
from typing import Tuple, Optional


def quantize_symmetric(
w: np.ndarray,
n_bits: int = 4,
group_size: int = 128,
) -> Tuple[np.ndarray, np.ndarray]:
"""
Symmetric per-group quantization of a weight matrix row.

Args:
w: 1D weight vector, shape (d_in,)
n_bits: Quantization bits (4 or 8)
group_size: Number of weights per quantization group
Smaller → more scale parameters, higher accuracy
Larger → fewer scale parameters, lower overhead

Returns:
w_quantized: Dequantized weights (float, same shape as w)
scales: Scale factor per group
"""
d_in = w.shape[0]
max_int = 2 ** (n_bits - 1) - 1 # 7 for INT4, 127 for INT8
n_groups = (d_in + group_size - 1) // group_size

scales = np.zeros(n_groups)
w_quantized = np.zeros_like(w)

for g in range(n_groups):
start = g * group_size
end = min(start + group_size, d_in)
group = w[start:end]

# Scale: map [-max_val, max_val] → [-max_int, max_int]
max_val = np.max(np.abs(group))
scale = max_val / max_int if max_val > 0 else 1.0
scales[g] = scale

# Quantize and dequantize
q = np.round(group / scale).clip(-max_int, max_int)
w_quantized[start:end] = q * scale # Store dequantized for GPTQ updates

return w_quantized, scales


def gptq_quantize_layer(
W: np.ndarray,
X_calibration: np.ndarray,
n_bits: int = 4,
group_size: int = 128,
damping_factor: float = 0.01,
block_size: int = 128,
) -> Tuple[np.ndarray, np.ndarray]:
"""
GPTQ quantization for a single weight matrix.

This implements the core GPTQ algorithm from Frantar et al. (2022):
- Compute Hessian from calibration activations
- Process columns sequentially with error compensation

Args:
W: Weight matrix, shape (d_out, d_in)
X_calibration: Calibration activations, shape (n_tokens, d_in)
n_bits: Target quantization bits (4 is standard)
group_size: Weights per quantization group (128 standard)
damping_factor: Hessian regularization - prevents singularity.
Larger → more stable but less accurate compensation.
Typical range: 0.001 to 0.1
block_size: Process this many columns at a time (memory efficiency)

Returns:
W_quantized: Dequantized quantized weight matrix
all_scales: Scale factors, shape (d_out, n_groups)
"""
d_out, d_in = W.shape
n_tokens, d_in_cal = X_calibration.shape
assert d_in == d_in_cal, f"Weight d_in={d_in} != calibration d_in={d_in_cal}"

# Step 1: Compute Hessian H = 2 * X^T * X
# This is the second derivative of squared output error w.r.t. weight elements
# Shape: (d_in, d_in)
H = 2.0 * (X_calibration.T @ X_calibration) / n_tokens

# Step 2: Add diagonal damping for numerical stability
# Without this, H may be singular if some input dimensions are never activated
avg_diag = np.mean(np.diag(H))
H += damping_factor * avg_diag * np.eye(d_in)

# Step 3: Compute H⁻¹ via Cholesky for numerical stability
# Cholesky is more stable than direct inversion and exploits H's positive definiteness
try:
L = np.linalg.cholesky(H)
L_inv = np.linalg.solve(L, np.eye(d_in))
H_inv = L_inv.T @ L_inv # H⁻¹ = (L^{-T}) * L^{-1}
except np.linalg.LinAlgError:
# H is not positive definite (e.g., severely undersampled)
# Fall back to pseudo-inverse - less accurate but doesn't crash
print("Warning: H is not positive definite, using pseudo-inverse")
H_inv = np.linalg.pinv(H)

# Step 4: Column-by-column quantization with error compensation
W_quantized = W.copy().astype(np.float64)
n_groups = (d_in + group_size - 1) // group_size
all_scales = np.zeros((d_out, n_groups))

for col_start in range(0, d_in, block_size):
col_end = min(col_start + block_size, d_in)

for q in range(col_start, col_end):
# Quantize column q (all rows simultaneously for efficiency)
group_idx = q // group_size
col_weights = W_quantized[:, q] # shape: (d_out,)

# Compute per-group scale based on current (possibly adjusted) weights
max_val = np.max(np.abs(col_weights))
max_int = 2 ** (n_bits - 1) - 1
scale = max_val / max_int if max_val > 0 else 1.0
all_scales[:, group_idx] = scale # Will be overwritten within group, that's OK

# Quantize and compute error
w_q = np.round(col_weights / scale).clip(-max_int, max_int) * scale
delta_w = col_weights - w_q # Quantization error for this column

# Store quantized values
W_quantized[:, q] = w_q

# Compensate remaining columns using the inverse Hessian
# This is the core GPTQ update:
# W[:, q+1:] -= (delta_w / H_inv[q,q]) * H_inv[q, q+1:]
if q + 1 < d_in and H_inv[q, q] > 1e-12:
compensation = np.outer(delta_w, H_inv[q, q+1:]) / H_inv[q, q]
W_quantized[:, q+1:] -= compensation

# Recompute final scales from quantized values
# (The iterative updates change the effective scale during quantization)
for g in range(n_groups):
start = g * group_size
end = min(start + group_size, d_in)
for r in range(d_out):
group = W_quantized[r, start:end]
max_val = np.max(np.abs(group))
max_int = 2 ** (n_bits - 1) - 1
all_scales[r, g] = max_val / max_int if max_val > 0 else 1.0

return W_quantized, all_scales


def compare_gptq_vs_naive(
W: np.ndarray,
X: np.ndarray,
n_bits: int = 4,
group_size: int = 128,
) -> None:
"""Compare GPTQ vs naive round-to-nearest quantization."""

# Naive quantization
W_naive = np.zeros_like(W)
d_out, d_in = W.shape
n_groups = (d_in + group_size - 1) // group_size
max_int = 2 ** (n_bits - 1) - 1

for g in range(n_groups):
start = g * group_size
end = min(start + group_size, d_in)
group = W[:, start:end]
scale = np.max(np.abs(group)) / max_int
W_naive[:, start:end] = np.round(group / scale).clip(-max_int, max_int) * scale

# GPTQ quantization
W_gptq, _ = gptq_quantize_layer(W, X, n_bits=n_bits, group_size=group_size)

# Compute output errors on test inputs (same calibration for simplicity)
Y_original = X @ W.T
Y_naive = X @ W_naive.T
Y_gptq = X @ W_gptq.T

naive_error = np.mean((Y_original - Y_naive) ** 2)
gptq_error = np.mean((Y_original - Y_gptq) ** 2)

print(f"Quantization comparison ({n_bits}-bit, group_size={group_size}):")
print(f" Naive round-to-nearest MSE: {naive_error:.6f}")
print(f" GPTQ MSE: {gptq_error:.6f}")
print(f" GPTQ improvement: {naive_error / gptq_error:.2f}x lower error")


# Demo with a random layer
np.random.seed(42)
W_demo = np.random.randn(256, 512).astype(np.float64) # (d_out=256, d_in=512)
X_demo = np.random.randn(128, 512).astype(np.float64) # (n_tokens=128, d_in=512)

compare_gptq_vs_naive(W_demo, X_demo, n_bits=4, group_size=128)
# Typical output:
# Naive round-to-nearest MSE: 0.021847
# GPTQ MSE: 0.001234
# GPTQ improvement: 17.7x lower error

Using AutoGPTQ: The Production Library

The auto-gptq library implements the full GPTQ pipeline with CUDA-accelerated Hessian computation and quantization. This is what you use in production:

# pip install auto-gptq transformers accelerate optimum
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset


def build_calibration_dataset(
tokenizer,
dataset_name: str = "allenai/c4",
n_samples: int = 128,
seq_length: int = 2048,
text_column: str = "text",
seed: int = 42,
) -> list:
"""
Build a calibration dataset for GPTQ quantization.

The calibration data determines which weight patterns GPTQ
optimizes for. Mismatch with deployment domain causes accuracy drops.

Args:
n_samples: 128 is standard; 256 provides marginally better Hessians
seq_length: Should match your typical inference sequence length.
Longer calibration → better long-context accuracy.
2048 is a good default for most models.

Returns:
List of tokenized input_ids tensors, each shape (1, seq_length)
"""
print(f"Building calibration dataset: {n_samples} samples of length {seq_length}")

# Load a streaming dataset to avoid downloading everything
dataset = load_dataset(
dataset_name,
"en",
split="train",
streaming=True,
trust_remote_code=True,
)

tokenizer.pad_token = tokenizer.eos_token
calibration_data = []
collected = 0

for item in dataset:
if collected >= n_samples:
break

text = item.get(text_column, "")
if len(text) < 200: # Skip very short documents
continue

encoded = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=seq_length,
)
if encoded["input_ids"].shape[1] < 64: # Skip extremely short after tokenization
continue

# Pad to seq_length for consistent calibration
pad_length = seq_length - encoded["input_ids"].shape[1]
if pad_length > 0:
pad_tensor = torch.full((1, pad_length), tokenizer.pad_token_id)
encoded["input_ids"] = torch.cat([encoded["input_ids"], pad_tensor], dim=1)

calibration_data.append(encoded["input_ids"])
collected += 1

print(f" Collected {len(calibration_data)} calibration samples")
return calibration_data


def quantize_with_gptq(
model_name: str,
output_path: str,
n_bits: int = 4,
group_size: int = 128,
desc_act: bool = False,
n_calibration_samples: int = 128,
seq_length: int = 2048,
) -> None:
"""
Full GPTQ quantization pipeline.

Key configuration parameters and their tradeoffs:

n_bits:
4 - standard, best memory/accuracy tradeoff for LLMs
3 - more aggressive, ~5% accuracy drop, useful when memory is very tight
8 - minimal accuracy drop, useful when 4-bit is too lossy

group_size:
128 - standard default, balanced overhead and accuracy
64 - better accuracy (+0.3-0.5 perplexity), 2x scale overhead
32 - best accuracy, 4x scale overhead (the scale parameters start to matter)
-1 - single scale per row (per-column), worst accuracy but minimal overhead

desc_act (activation reordering):
False - faster quantization (3-4x), slightly lower accuracy
True - quantize in order of activation importance (largest activations first),
better accuracy but non-sequential access hurts inference speed.
Generally NOT recommended for inference - use AWQ instead if you
want activation-aware accuracy improvement.
"""
print(f"GPTQ quantization: {model_name}")
print(f"Config: {n_bits}-bit, group_size={group_size}, desc_act={desc_act}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

# Build calibration data
calibration_data = build_calibration_dataset(
tokenizer,
n_samples=n_calibration_samples,
seq_length=seq_length,
)

# Configure GPTQ
quantize_config = BaseQuantizeConfig(
bits=n_bits,
group_size=group_size,
desc_act=desc_act,
# damp_percent: Hessian damping, controls numerical stability
# Lower = more accurate compensation, higher = more stable
# 0.01 (1%) is the standard default
damp_percent=0.01,
)

# Load model for quantization (FP16 to fit in GPU memory)
print("Loading model for quantization...")
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
torch_dtype=torch.float16,
device_map="auto",
low_cpu_mem_usage=True,
)

# Run GPTQ quantization
print("Running GPTQ quantization (this may take 30min–4hrs depending on model size)...")
model.quantize(
calibration_data,
cache_examples_on_gpu=True, # Cache activations on GPU for speed
batch_size=1, # Process one sample at a time to manage memory
)

# Save quantized model
model.save_quantized(output_path, use_safetensors=True)
tokenizer.save_pretrained(output_path)

# Print memory statistics
weight_size_gb = sum(
p.numel() * p.element_size()
for p in model.parameters()
) / 1e9
print(f"\nQuantization complete!")
print(f" Quantized model saved to: {output_path}")
print(f" Approximate model size: {weight_size_gb:.2f} GB")
print(f" Load with: AutoGPTQForCausalLM.from_quantized('{output_path}')")

Group Size: The Most Important Configuration Choice

Group size controls how many weights share a single quantization scale factor. It is the most impactful tuning parameter in GPTQ configuration.

Why group size matters mathematically: Within a group, all weights share the same quantization scale ss. The scale is set to accommodate the largest weight in the group: s=max(wgroup)/(2b11)s = \max(|w_{group}|) / (2^{b-1} - 1). If one weight is 10x larger than others in the group, the scale is set for that outlier, and all smaller weights are quantized at 1/10th the effective precision they could have had if groups were smaller.

Smaller groups = more scales = more precision per weight. Larger groups = fewer scales = less memory overhead but more quantization error from outlier weights.

The scale overhead for a 7B model:

Weight matrix typically: d_out × d_in, stored as INT4 (0.5 bytes/weight)

Scale overhead per linear layer:
group_size=32: n_params / 32 scales × 2 bytes/scale = n_params × 0.0625 bytes
group_size=128: n_params / 128 × 2 = n_params × 0.015625 bytes
group_size=1024: n_params / 1024 × 2 = n_params × 0.00195 bytes

For 7B model (7×10⁹ parameters):
group_size=32: 7B × (0.5 + 0.0625) = ~3.94 GB total
group_size=128: 7B × (0.5 + 0.015) = ~3.61 GB total
group_size=1024: 7B × (0.5 + 0.002) = ~3.51 GB total

The difference is 0.43 GB - small for a 7B model.
For a 70B model, this scales to 4.3 GB - meaningful.

:::tip The Default group_size=128 Is Usually Right For 4-bit INT4 production deployment, group_size=128 is the community standard and the right choice for most models. Only consider smaller groups if: (a) you are doing 3-bit quantization where accuracy is more fragile, (b) your model has extreme activation variance suggesting many outlier weights, or (c) benchmarking shows meaningful perplexity improvement that justifies the overhead. :::

Loading and Running GPTQ Models

Once a model is quantized, loading and running it is straightforward:

import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, TextStreamer


def load_gptq_model(
model_path: str,
use_triton: bool = False,
inject_fused_attention: bool = True,
inject_fused_mlp: bool = True,
) -> tuple:
"""
Load a GPTQ-quantized model for inference.

Performance options:
use_triton: Uses Triton INT4 kernels instead of ExLlama kernels.
ExLlama (default) is generally faster for batch_size=1.
Triton can be faster at higher batch sizes.
inject_fused_attention: Fuse attention computations - 10-15% speedup.
inject_fused_mlp: Fuse MLP gate+up+down projections - 15-20% speedup.

Both fused options are safe and recommended for inference.
They do not affect output quality, only execution speed.
"""
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoGPTQForCausalLM.from_quantized(
model_path,
use_safetensors=True,
device_map="auto",
use_triton=use_triton,
inject_fused_attention=inject_fused_attention,
inject_fused_mlp=inject_fused_mlp,
trust_remote_code=False,
)
model.eval() # Ensure inference mode (disables dropout etc.)

return model, tokenizer


def generate_with_gptq(
model,
tokenizer,
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.0,
top_p: float = 0.9,
stream: bool = False,
) -> str:
"""
Generate text with a GPTQ model.

Args:
temperature: 0.0 = greedy (deterministic, best for evaluation)
0.1-0.7 = creative but coherent
>1.0 = very creative / chaotic
stream: Print tokens as they are generated
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs["input_ids"].shape[1]

gen_kwargs = {
"max_new_tokens": max_new_tokens,
"do_sample": temperature > 0,
"pad_token_id": tokenizer.eos_token_id,
}
if temperature > 0:
gen_kwargs["temperature"] = temperature
gen_kwargs["top_p"] = top_p

if stream:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
gen_kwargs["streamer"] = streamer

with torch.no_grad():
output_ids = model.generate(**inputs, **gen_kwargs)

generated_ids = output_ids[0][input_length:]
return tokenizer.decode(generated_ids, skip_special_tokens=True)


def benchmark_gptq_throughput(
model_path: str,
prompt: str = "Write a detailed technical explanation of how transformers work:",
batch_sizes: list = [1, 4, 8],
n_new_tokens: int = 200,
n_warmup: int = 3,
n_runs: int = 10,
) -> None:
"""
Benchmark GPTQ model throughput across batch sizes.
Prints a summary table of latency and tokens/second.
"""
import time

model, tokenizer = load_gptq_model(model_path)

print(f"\n{'Batch':>6} {'Latency(ms)':>12} {'P90(ms)':>10} {'Tok/s':>8}")
print("-" * 42)

for bs in batch_sizes:
prompts = [prompt] * bs
inputs = tokenizer(
prompts, return_tensors="pt", padding=True
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Warmup
for _ in range(n_warmup):
with torch.no_grad():
model.generate(**inputs, max_new_tokens=20, do_sample=False)

# Timed runs
latencies = []
for _ in range(n_runs):
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.no_grad():
model.generate(**inputs, max_new_tokens=n_new_tokens, do_sample=False)
torch.cuda.synchronize()
latencies.append((time.perf_counter() - t0) * 1000)

latencies.sort()
mean_ms = sum(latencies) / len(latencies)
p90_ms = latencies[int(len(latencies) * 0.9)]
tps = bs * n_new_tokens / (mean_ms / 1000)

print(f"{bs:>6} {mean_ms:>12.1f} {p90_ms:>10.1f} {tps:>8.1f}")

Deploying GPTQ with vLLM

For production serving with high throughput requirements, vLLM provides the best GPTQ integration - continuous batching, paged attention, and native GPTQ kernel support:

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
import asyncio
import uuid


def serve_gptq_with_vllm_batch(
model_path: str,
prompts: list[str],
max_tokens: int = 512,
temperature: float = 0.0,
tensor_parallel_size: int = 1,
) -> list[str]:
"""
Batch inference with GPTQ model through vLLM.

vLLM provides three key advantages over direct AutoGPTQ inference:
1. Continuous batching: new requests start as soon as GPU capacity frees,
rather than waiting for a fixed batch to complete
2. Paged attention: KV cache stored in non-contiguous pages, enabling
more concurrent sequences without memory fragmentation
3. Optimized GPTQ kernels: integrated ExLlama v2 and Marlin kernels

Args:
tensor_parallel_size: Number of GPUs for tensor parallelism.
Use when single GPU can't fit the model.
"""
llm = LLM(
model=model_path,
quantization="gptq", # Tell vLLM this is a GPTQ model
dtype="float16",
tensor_parallel_size=tensor_parallel_size,
gpu_memory_utilization=0.90, # Reserve 10% for overhead
max_model_len=4096,
)

sampling_params = SamplingParams(
temperature=temperature,
max_tokens=max_tokens,
top_p=0.9 if temperature > 0 else 1.0,
)

outputs = llm.generate(prompts, sampling_params)
return [output.outputs[0].text for output in outputs]


async def serve_gptq_async_vllm(
model_path: str,
max_model_len: int = 4096,
) -> AsyncLLMEngine:
"""
Initialize vLLM async engine for streaming production serving.

The async engine handles request queuing, continuous batching,
and streaming responses - the production pattern for serving APIs.

Usage:
engine = await serve_gptq_async_vllm("path/to/model")
# In your API handler:
async for output in engine.generate(prompt, params, request_id):
token = output.outputs[0].text
yield token # Stream to client
"""
engine_args = AsyncEngineArgs(
model=model_path,
quantization="gptq",
dtype="float16",
max_model_len=max_model_len,
gpu_memory_utilization=0.90,
enable_prefix_caching=True, # Cache KV for shared prefixes (system prompts)
max_num_seqs=256, # Max concurrent sequences
)
return AsyncLLMEngine.from_engine_args(engine_args)


async def stream_gptq_response(
engine: AsyncLLMEngine,
prompt: str,
max_tokens: int = 512,
) -> str:
"""Stream tokens from the async vLLM engine."""
from vllm import SamplingParams

request_id = str(uuid.uuid4())
sampling_params = SamplingParams(max_tokens=max_tokens, temperature=0.0)

full_text = ""
async for output in engine.generate(prompt, sampling_params, request_id):
if output.outputs:
full_text = output.outputs[0].text

return full_text

GPTQ Configuration Reference

GPTQ vs. AWQ vs. bitsandbytes: When to Choose Each

CriterionGPTQAWQbitsandbytes NF4
MechanismHessian error compensationActivation-aware scalingNormalFloat4 quantization
Accuracy at INT4GoodSlightly betterSimilar to GPTQ
Quantization speed (7B)30-60 min30-60 minMinutes (no calibration step)
Inference speedGoodBetter (Marlin kernel)Slowest (runtime dequant)
3-bit supportYes (good)MarginalNo
LoRA fine-tuning supportNoNoYes (QLoRA)
CPU inferenceVia GGUFLimitedLimited
vLLM integrationFullFullPartial
Calibration data neededYesYesNo
Best use caseGeneral INT4 deployment, 3-bit, GGUF exportNVIDIA GPU inference with MarlinTraining with QLoRA

Domain-Specific Calibration: A Complete Example

def build_domain_calibration_data(
tokenizer,
domain: str = "code",
n_samples: int = 128,
seq_length: int = 2048,
) -> list:
"""
Build domain-specific calibration data for better GPTQ accuracy.

Why this matters: GPTQ's Hessian reflects which weight dimensions
are activated by calibration inputs. Mismatch between calibration
and deployment distributions means GPTQ optimizes the wrong weights.

domain options:
"code" - Python, SQL, shell code from The Stack
"math" - LaTeX math problems, proofs
"medical" - Clinical notes, medical literature
"legal" - Legal documents, contracts
"general" - Pile / C4 (fine for general-purpose models)
"""
from datasets import load_dataset

domain_dataset_map = {
"code": ("codeparrot/github-code", "Python", "code"),
"math": ("hendrycks/competition_math", None, "problem"),
"medical": ("medalpaca/medical_meadow_medqa", None, "input"),
"legal": ("nguyen-brat/legal-dataset", None, "text"),
"general": ("allenai/c4", "en", "text"),
}

if domain not in domain_dataset_map:
raise ValueError(f"Unknown domain: {domain}. Choose from: {list(domain_dataset_map.keys())}")

dataset_name, subset, text_col = domain_dataset_map[domain]

try:
if subset:
dataset = load_dataset(dataset_name, subset, split="train", streaming=True)
else:
dataset = load_dataset(dataset_name, split="train", streaming=True)
except Exception as e:
print(f"Warning: Could not load {dataset_name}: {e}")
print("Falling back to C4 general data")
dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
text_col = "text"

tokenizer.pad_token = tokenizer.eos_token
calibration_data = []

for item in dataset:
if len(calibration_data) >= n_samples:
break

text = item.get(text_col, "")
if len(text) < 100:
continue

encoded = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=seq_length,
padding="max_length",
)

if encoded["input_ids"].shape[1] >= 64:
calibration_data.append(encoded["input_ids"])

print(f"Built {len(calibration_data)} {domain}-domain calibration samples")
return calibration_data

Evaluating GPTQ Accuracy

After quantizing, always evaluate on your target task before deploying:

import torch
import math
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
from datasets import load_dataset


def compute_perplexity(
model_path: str,
dataset_name: str = "wikitext",
split: str = "test",
stride: int = 512,
max_length: int = 2048,
n_samples: int = 50,
) -> float:
"""
Compute perplexity of a GPTQ model on a text dataset.

Perplexity is the standard measure of language model quality.
Lower = better. Typical values:
- FP16 Llama-3.1-8B on WikiText-2: ~6.2
- GPTQ INT4 group128: ~6.5-6.7 (~5% higher perplexity)
- Naive INT4: ~9-15 (catastrophic degradation visible here)

stride: Controls context overlap. stride=512 with max_length=2048
means 75% overlap between windows - expensive but accurate.
stride=max_length means no overlap - faster but less accurate PPL.
"""
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(
model_path,
device_map="auto",
use_safetensors=True,
)
model.eval()

dataset = load_dataset(dataset_name, "wikitext-2-raw-v1", split=split)
text = "\n\n".join(dataset["text"])
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(model.device)

total_len = input_ids.shape[1]
nlls = []

for begin_loc in range(0, min(total_len - max_length, n_samples * stride), stride):
end_loc = begin_loc + max_length
target_len = max_length - stride

input_chunk = input_ids[:, begin_loc:end_loc]
target_ids = input_chunk.clone()
target_ids[:, :-target_len] = -100 # Only compute loss on the stride portion

with torch.no_grad():
outputs = model(input_chunk, labels=target_ids)
neg_log_likelihood = outputs.loss

nlls.append(neg_log_likelihood.float())

ppl = math.exp(torch.stack(nlls).mean().item())
return ppl

GPTQ for GGUF and CPU Inference

While AWQ excels on NVIDIA GPUs, GPTQ has a critical advantage: its quantized format is the basis for GGUF, the format used by llama.cpp for CPU inference. This enables running large models on consumer hardware without any GPU at all.

import subprocess
import os
from pathlib import Path


def convert_gptq_to_gguf(
gptq_model_path: str,
output_dir: str,
quantization_type: str = "Q4_K_M",
llama_cpp_dir: str = "./llama.cpp",
) -> str:
"""
Convert a GPTQ-quantized HuggingFace model to GGUF format for llama.cpp.

GGUF format enables:
- CPU inference with SIMD-accelerated kernels
- Apple Silicon GPU acceleration via Metal
- Quantization types optimized for CPU memory access patterns
- Streaming from disk for models that don't fit in RAM

Quantization types (GGUF uses different format than AWQ/GPTQ):
Q4_K_M: 4-bit with K-quants (mixed 4+6 bit) - best accuracy/speed balance
Q4_K_S: 4-bit K-quants, smaller than Q4_K_M - save ~0.5GB
Q5_K_M: 5-bit K-quants - better accuracy, larger file
Q8_0: 8-bit - closest to FP16 quality, largest file

Speed on Apple M3 Pro (unified memory, 150 GB/s):
Llama-3.1-8B Q4_K_M: ~55 tok/s
Llama-3.1-70B Q4_K_M: ~8 tok/s

Args:
gptq_model_path: Path to HuggingFace GPTQ model directory
output_dir: Where to write the .gguf file
quantization_type: GGUF quantization type (Q4_K_M recommended)
llama_cpp_dir: Path to cloned llama.cpp repository

Returns:
Path to the created .gguf file
"""
os.makedirs(output_dir, exist_ok=True)
model_name = Path(gptq_model_path).name
gguf_path = os.path.join(output_dir, f"{model_name}-{quantization_type}.gguf")

convert_script = os.path.join(llama_cpp_dir, "convert-hf-to-gguf.py")
if not os.path.exists(convert_script):
raise FileNotFoundError(
f"llama.cpp convert script not found at {convert_script}. "
f"Clone llama.cpp: git clone https://github.com/ggerganov/llama.cpp"
)

# Step 1: Convert HuggingFace model to FP16 GGUF
fp16_gguf = os.path.join(output_dir, f"{model_name}-fp16.gguf")
convert_cmd = [
"python3", convert_script,
gptq_model_path,
"--outfile", fp16_gguf,
"--outtype", "f16",
]

print(f"Converting to GGUF format...")
result = subprocess.run(convert_cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"Conversion failed:\n{result.stderr}")
print(f" FP16 GGUF created: {fp16_gguf}")

# Step 2: Quantize to target format using llama-quantize
quantize_binary = os.path.join(llama_cpp_dir, "llama-quantize")
if not os.path.exists(quantize_binary):
# Try build directory
quantize_binary = os.path.join(llama_cpp_dir, "build", "bin", "llama-quantize")

quantize_cmd = [quantize_binary, fp16_gguf, gguf_path, quantization_type]

print(f"Quantizing to {quantization_type}...")
result = subprocess.run(quantize_cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"Quantization failed:\n{result.stderr}")

file_size_gb = os.path.getsize(gguf_path) / 1e9
print(f" GGUF model created: {gguf_path} ({file_size_gb:.2f} GB)")

# Clean up FP16 intermediate
if os.path.exists(fp16_gguf):
os.remove(fp16_gguf)

return gguf_path


def run_gguf_inference_python(
gguf_path: str,
prompt: str,
max_tokens: int = 512,
n_gpu_layers: int = 0, # 0 = CPU only, -1 = all layers on GPU
context_length: int = 4096,
) -> str:
"""
Run inference on a GGUF model using llama-cpp-python.

Install: pip install llama-cpp-python
For Metal (Apple Silicon): CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
For CUDA: CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

Args:
n_gpu_layers: Number of model layers to offload to GPU.
0 = full CPU inference (works on any hardware)
-1 = all layers on GPU (fastest, requires enough VRAM)
N = offload N layers (useful for hybrid CPU/GPU when model is too large)
"""
from llama_cpp import Llama

llm = Llama(
model_path=gguf_path,
n_ctx=context_length,
n_gpu_layers=n_gpu_layers,
verbose=False,
)

output = llm(
prompt,
max_tokens=max_tokens,
stop=["</s>", "<|eot_id|>"], # Common stop tokens for Llama-family models
echo=False,
)

return output["choices"][0]["text"]

Mixed-Precision GPTQ: Protecting Sensitive Layers

Not all layers in a transformer tolerate quantization equally well. Research has identified systematic patterns:

  • First and last transformer layers: The first few and last few layers often have different weight distributions and higher sensitivity to quantization. Quantizing them to 8-bit while using 4-bit for middle layers can improve accuracy with minimal memory overhead.
  • Attention vs. MLP layers: In some architectures, attention projection layers are more sensitive than MLP layers - the attention mechanism depends on precise relative magnitudes between Q, K, V projections.
  • Norm-adjacent layers: Layers immediately preceding or following layer normalization often have outlier activations that make 4-bit quantization more lossy.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from typing import Set


def quantize_mixed_precision_gptq(
model_name: str,
output_path: str,
sensitive_layer_names: Set[str] = None,
default_bits: int = 4,
sensitive_bits: int = 8,
group_size: int = 128,
n_calibration_samples: int = 128,
) -> None:
"""
Mixed-precision GPTQ: use 8-bit for sensitive layers, 4-bit for the rest.

Typical accuracy improvement: 0.3-0.8 percentage points on reasoning tasks
Memory overhead: ~5-15% more than uniform 4-bit (depends on how many sensitive layers)

Common sensitive layer patterns (auto-detected if not provided):
- First 2 and last 2 transformer blocks
- MLP down-projection layers (often larger variance than gate/up)
- Attention output projection layers

Args:
sensitive_layer_names: Set of layer name substrings to use 8-bit.
Example: {"model.layers.0", "model.layers.31", "lm_head"}
"""
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

# Build calibration data
dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
calibration_data = []
for item in dataset:
if len(calibration_data) >= n_calibration_samples:
break
encoded = tokenizer(item["text"], return_tensors="pt",
max_length=2048, truncation=True)
if encoded["input_ids"].shape[1] >= 64:
calibration_data.append(encoded["input_ids"])

# Default sensitive layers: first and last 2 blocks
if sensitive_layer_names is None:
sensitive_layer_names = {
"model.layers.0.",
"model.layers.1.",
"model.layers.30.", # Adjust index to n_layers-2
"model.layers.31.", # Adjust index to n_layers-1
}

# Configure mixed-precision: different quantize_config per layer
# auto-gptq supports this via the quantize_config dict with per-layer overrides
quantize_config = BaseQuantizeConfig(
bits=default_bits,
group_size=group_size,
desc_act=False,
damp_percent=0.01,
)

model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
torch_dtype="auto",
device_map="auto",
)

# Override bits for sensitive layers
for name, module in model.named_modules():
if hasattr(module, "bits"):
for sensitive_pattern in sensitive_layer_names:
if sensitive_pattern in name:
module.bits = sensitive_bits
print(f" 8-bit: {name}")
break

model.quantize(calibration_data)
model.save_quantized(output_path, use_safetensors=True)
tokenizer.save_pretrained(output_path)
print(f"\nMixed-precision GPTQ model saved to {output_path}")

Common Mistakes and Production Pitfalls

:::danger Do Not Skip Calibration Data Validation The most common GPTQ failure in production is calibration data mismatch. A model fine-tuned on medical question-answering, calibrated with general web text (C4), will lose 5-10% accuracy on medical tasks versus the same model calibrated on medical text - even though both produce similar WikiText-2 perplexity. Always: (1) build calibration data that matches your deployment domain, (2) evaluate on your deployment task, not just standard benchmarks, (3) compare perplexity AND task accuracy, not perplexity alone. :::

:::danger Never Quantize the LM Head or Embedding Layers GPTQ by default skips the input embedding table and the language model head (output projection). This is correct behavior - do not override it. The embedding table maps discrete token IDs to continuous vectors; quantizing it introduces token confusion (similar tokens map to similar-but-wrong vectors). The LM head must produce precise logits to correctly rank the next token; quantization noise here directly degrades the output quality for every single generated token. If implementing GPTQ from scratch, explicitly exclude embed_tokens, lm_head, all LayerNorm layers, and all RMSNorm layers from quantization. :::

:::warning desc_act=True Hurts Inference Speed - Use It Carefully The desc_act (descending activation) option reorders columns by activation magnitude before quantization, so the highest-impact columns are quantized first. This can improve accuracy slightly by ensuring the most critical weights are compensated first rather than last. However, it requires storing a column permutation table and performing non-sequential memory access during inference - which destroys cache locality and reduces throughput by 10-20% on most hardware. Use desc_act=False (the default) unless you specifically need the accuracy improvement and have benchmarked the throughput impact. :::

:::tip Use group_size=64 for 3-Bit Quantization At 3-bit, the quantization grid is extremely coarse (only 8 distinct values). The default group_size=128 is often insufficient - quantization error per group is high enough that accuracy drops significantly. With group_size=64, you get twice the scale resolution and typically recover 0.5-1.0 perplexity points at the cost of ~6% more scale storage overhead. For 3-bit deployment where accuracy is paramount, group_size=64 or even group_size=32 is worth the overhead. :::

Interview Questions

Q1: Explain the GPTQ algorithm from mathematical foundations. What problem does it solve and how?

Naive INT4 quantization fails because it quantizes each weight independently, ignoring that weights are interconnected. Quantization error in one weight causes output errors that propagate and compound through subsequent layers. GPTQ solves this using the Optimal Brain Surgeon framework from Hassibi and Stork (1993). The key insight is that for a linear layer Y=XWY = XW, the sensitivity of the output to weight perturbations is captured by the Hessian H=2XTXH = 2X^TX (the outer product of input activations). After quantizing weight wqw_q, introducing error δwq\delta w_q, the optimal update to remaining weights that minimizes the resulting output error is δw=(δwq/Hqq1)H:,q1\delta w^* = -(\delta w_q / H^{-1}_{qq}) \cdot H^{-1}_{:,q}. GPTQ applies this update sequentially, column by column, ensuring each weight's quantization error is compensated before the next weight is quantized. The result: errors do not accumulate, and the layer output remains close to the FP16 output even after full INT4 quantization. This requires only a forward pass on calibration data to compute H (no backpropagation, no labels), and Cholesky decomposition for efficient H inversion.

Q2: What is the role of calibration data in GPTQ, and what happens if it is mismatched?

Calibration data serves two functions in GPTQ. First, it provides the input activations needed to compute the Hessian H=2XTXH = 2X^TX for each layer. The Hessian tells GPTQ which input dimensions are consistently activated and at what magnitudes - this determines how much each weight's quantization error is amplified into output error. Second, the Hessian drives the error compensation updates: after quantizing each weight, remaining weights are adjusted in the direction that minimizes the output error on the calibration distribution. If calibration data does not match the deployment distribution, both functions fail: the Hessian reflects the wrong activation patterns, and the compensation updates optimize for the wrong inputs. In practice, a code model calibrated on Wikipedia text will have worse INT4 accuracy on code tasks than the same model calibrated on code - by 5-10% on code-specific benchmarks, even if perplexity on WikiText-2 looks similar. Always match calibration data to deployment domain.

Q3: What is group size in GPTQ, and how does it affect the accuracy-memory tradeoff?

Group size controls the granularity of quantization scales. Within each group, all weights share one scale factor and one zero-point (for asymmetric quantization). With group_size=128, a weight matrix of size 4096×4096 has 4096×(4096/128) = 131,072 scale parameters. With group_size=32, it has 4096×128 = 524,288 scale parameters - 4x more. More scales = finer quantization resolution = less error from outlier weights skewing the scale for an entire group. The tradeoff: smaller groups require storing more scale parameters in FP16, adding memory overhead. For a 70B model at group_size=32 versus group_size=128, the additional scale overhead is roughly 4 GB. In practice: group_size=128 is the correct default for INT4, providing good accuracy with manageable overhead. group_size=64 is worth considering for 3-bit quantization where accuracy is more fragile. group_size=-1 (per-row scaling) is too coarse for 4-bit and should be avoided.

Q4: Why is GPTQ typically applied layer by layer rather than globally across the whole model?

Global GPTQ would require inverting the Hessian of the full loss with respect to all weights simultaneously - a matrix of size (Nparams)2(N_{params})^2 where NparamsN_{params} is in the billions for modern LLMs. For a 7B model, this would require storing and inverting a 7B×7B7B \times 7B matrix - approximately 101710^{17} bytes, which is physically impossible. Layer-wise quantization makes the problem tractable: for each linear layer with dind_{in} input dimensions, the Hessian is din×dind_{in} \times d_{in} - at most 4096×4096 for typical transformer layers. This is a few hundred megabytes at FP32, easily invertible. The approximation is justified empirically: the layer-wise output error (not the global loss) is what matters for downstream layers, and minimizing it layer by layer in the forward direction produces near-optimal results in practice.

Q5: How does activation reordering (desc_act) affect GPTQ, and when should you use it?

Activation reordering sorts the weight columns by decreasing activation magnitude before quantization. Columns corresponding to frequently large activations are quantized first. The benefit: in the sequential GPTQ algorithm, earlier columns receive compensation from later ones' updates. By quantizing the most critical columns first, you ensure the highest-impact weights are quantized when the full remaining weight budget is still available for compensation. The cost: the permutation must be stored with the model, and inference must perform non-sequential weight access (following the permutation) rather than contiguous reads. This destroys memory access locality - caches work on contiguous memory regions - and typically reduces throughput by 10-20%. The practical conclusion: desc_act provides marginal accuracy improvement (0.1-0.3 perplexity points) at significant throughput cost. AWQ achieves similar activation-importance benefits without the inference overhead, by embedding the importance information into the weight scales at quantization time rather than reordering at runtime. Prefer AWQ over GPTQ with desc_act=True for deployment that cares about both accuracy and throughput.

Q6: How would you debug a GPTQ model that shows good perplexity but poor task performance?

This is a calibration-distribution mismatch or task-sensitivity issue. The debugging process: First, compute perplexity on domain-matched text (not WikiText-2 if the model is domain-specific). If domain PPL is significantly worse than general PPL, calibration data was likely mismatched - rebuild calibration set from deployment domain. Second, benchmark task accuracy directly: run the quantized and FP16 models on 200-500 examples from your task. If quantized accuracy is more than 2-3% below FP16, you have a quantization issue. Third, identify which capabilities are most degraded: arithmetic reasoning, multi-step inference, and long-context tasks degrade more from quantization than factual recall or classification. If arithmetic is heavily degraded, try group_size=64 to improve quantization precision for the dense weight clusters that arithmetic relies on. Fourth, check if particular layers are problematic - some models have layers with unusual activation distributions (large outliers) that benefit from mixed-precision: keep those layers at INT8 while quantizing the rest at INT4.

© 2026 EngineersOfAI. All rights reserved.