Skip to main content

Quantization Error Debugging

The Model That Broke on Tuesdays

Your team spent three weeks quantizing a 70B model to INT4. You ran the benchmarks - MMLU held up, HellaSwag looked fine, GSM8K was within two points. You deployed. For four days, everything was quiet.

Then the support tickets started coming in.

Users were reporting that the model would sometimes repeat the last sentence of its response over and over, looping until it hit the max token limit. Others said it refused to answer coding questions it had handled perfectly in staging. One user reported getting what looked like garbled text when asking about certain financial instruments - specific enough that it felt like a data issue, but your data team found nothing wrong.

You rolled back to the FP16 checkpoint. The problems disappeared. So you had proof: quantization was the culprit. But that was the easy part. The hard part was figuring out exactly what broke, why it broke for those specific inputs, and what you could do about it short of abandoning INT4 entirely. You needed a systematic debugging methodology - not guesswork.

This lesson is that methodology. We will walk through every major failure mode that quantization introduces, the tools that expose them, and the concrete fixes that work in practice. By the end, you will be able to take a degraded quantized model and diagnose exactly what is wrong with it - down to the layer.

Quantization failures are not random. They follow patterns. Once you learn to recognize those patterns, debugging goes from "try things until it works" to "measure, identify, fix."


Why This Exists - The Hidden Costs of Compression

Before quantization became routine, deploying a 70B model meant either serving it on 8x A100 machines (expensive) or not serving it at all. The math was brutal: 70 billion parameters at 2 bytes each (BF16) = 140 GB. A single A100 has 80 GB. You needed at minimum two GPUs just to load the model, before a single token was generated.

INT4 quantization changed this arithmetic fundamentally. Four bits per parameter means 70B parameters fit in roughly 35 GB. A single A100. A single H100. Even consumer-grade hardware with enough VRAM.

The problem is that the quality guarantees you get from quantization papers are average-case guarantees. They report perplexity on WikiText-2, accuracy on MMLU, maybe a few other benchmarks. But production models do not serve average-case inputs. They serve the long tail - unusual questions, domain-specific terminology, rare topics, adversarial prompts. That long tail is exactly where quantization failures cluster.

The reason failures cluster in the long tail comes down to how quantization works. When you compress a weight matrix from FP16 to INT4, you are replacing a continuous range of values with a small set of discrete levels. For most weights, this approximation is good enough - the errors are small and they average out. But some weights are different. Some activation channels produce values that are dramatically larger than the others - outliers. When your quantization range is calibrated around the typical values, these outliers get clipped. Information is lost. And that information loss tends to matter for exactly the rare, specific inputs that activate those channels strongly.

This is the core insight behind quantization debugging: errors are not uniform. They concentrate in specific layers, specific activation channels, and specific input patterns. Finding where they concentrate is the entire job.


Historical Context - How the Field Learned This the Hard Way

The earliest quantization work for neural networks (Hubara et al., 2016 - Binary Neural Networks) treated the problem as uniform compression. Compress everything to the same bit width, measure average accuracy, report results. For small models on image classification, this worked reasonably well.

The trouble started when the field moved to large language models and INT8. In 2022, researchers at Hugging Face and the University of Washington (Dettmers et al., 2022 - "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale") ran into something nobody had fully quantified before: as models scale beyond 6.7 billion parameters, a small number of activation dimensions start producing values that are 10x to 100x larger than the median. They called these "emergent large magnitude features."

The aha moment came when Tim Dettmers was debugging why INT8 quantized versions of large models degraded so much more than expected. He plotted activation histograms and saw it immediately: a handful of channels had values in the range of 200-400, while 99% of channels had values under 10. Standard INT8 quantization was calibrating its range to fit those 200-400 values, which meant the other 99% of channels got compressed into just a few discrete levels. Quality collapsed.

His solution - vector-wise quantization that pulls outlier dimensions out into FP16 and quantizes the rest as INT8 - became LLM.int8(). It demonstrated something crucial: the path forward was not uniform compression but selective, non-uniform compression that respected the structure of the activations.

GPTQ (Frantar et al., 2022) extended this reasoning to INT4 by using second-order information (the Hessian) to figure out which weights mattered most and compensating for quantization error during the compression process. AWQ (Lin et al., 2023) approached the same problem differently - instead of post-hoc error compensation, it searched for per-channel scaling factors that made the weights easier to quantize in the first place.

Both approaches implicitly acknowledged the same truth: quantization failure is not uniform, and the fix requires understanding the structure of that failure.


Core Concepts - Reading the Symptoms

Symptom 1: Perplexity Spike

Perplexity is the canonical quantization quality metric. If your quantized model shows perplexity more than 0.2-0.3 points above the FP16 baseline on a standard corpus (WikiText-2 or C4), something is wrong beyond normal quantization noise.

The formula for perplexity is:

PPL=exp(1Ni=1NlogP(xix<i))\text{PPL} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(x_i | x_{<i})\right)

A spike in perplexity usually means one of three things: the calibration data was mismatched, the group size is too large (not enough granularity in the quantization), or specific layers are being quantized too aggressively.

Symptom 2: Task-Specific Accuracy Drop

You see normal perplexity but a specific benchmark tanks. For example, mathematical reasoning drops 8 points while language understanding holds. This is a red flag for layer-specific sensitivity - the layers that matter most for the degraded task are the ones where quantization error concentrated.

This happens because different capabilities in a transformer are not uniformly distributed. Attention heads that implement specific reasoning patterns, MLP layers that store factual knowledge - these can be disproportionately sensitive to quantization noise.

Symptom 3: Repetition Loops

The model starts repeating phrases or sentences. This is one of the most recognizable failure modes of aggressive quantization and it traces to degraded attention pattern quality.

The attention mechanism relies on the model building up a coherent representation of what it has already generated, so it can decide what to generate next. When the attention weights are corrupted by quantization error, the model loses track of what it has said and falls into an attractor state where it keeps generating the same token sequences.

Symptom 4: Nonsense Outputs on Specific Topics

The model produces fluent but semantically wrong text - or actual garbled output - for specific domains. This symptom points to domain-specific knowledge stored in MLP layers that was disrupted by quantization. The model still has the fluency (language model components are robust) but lost the factual content.

Symptom 5: Refusal on Valid Inputs

The model refuses to engage with requests it should handle. This one is subtle because it can look like a safety issue rather than a quality issue. The mechanism: quantization noise in the embedding layer or early attention layers corrupts the representation of the prompt enough that the model misclassifies it as something it should refuse. You can verify this by checking whether the same prompt reformulated slightly gets a normal response.


The Root Causes

Root Cause 1: Outlier Activations

This is the most important root cause to understand. In transformers larger than about 6.7B parameters, certain activation channels consistently produce values that are 10-100x larger than the typical channel. These are the "outlier features" that Dettmers et al. identified.

The problem: when you quantize the weight matrix, the quantization range is determined by the weights. But the actual error in the output depends on both the weight error and the activation values. A small weight error multiplied by a large activation value produces a large output error.

More precisely, if the weight WW has quantization error δW\delta W, and the activation is xx, the output error is:

δy=δWx\delta y = \delta W \cdot x

For outlier channels where xx is large, even a small δW\delta W produces a large δy\delta y.

LLM.int8() handles this with decomposition: identify the outlier dimensions at runtime, compute their contribution in FP16, and compute the non-outlier contribution in INT8. This adds some overhead but eliminates the catastrophic error from outlier channels.

Root Cause 2: Calibration Data Mismatch

Calibration data determines the quantization ranges. If your calibration data does not match your deployment distribution, the quantization ranges are set wrong and quality degrades for inputs that look nothing like the calibration set.

A common mistake: calibrating on WikiText-2 (general English text) and deploying for code generation. Code activations have a very different distribution than natural language activations. The quantization ranges calibrated on natural language text clip code-related activations, and quality drops.

Root Cause 3: Group Size Too Large

In group-wise quantization (used by GPTQ, AWQ, and GGUF), each group of consecutive weights shares a single scale and zero-point. Larger groups mean fewer parameters to store but also mean the quantization has less granularity.

Group size 128 is the standard. Group size 64 is higher quality. Group size 256 starts to show quality degradation on some models. Group size 32 is very high quality but significantly increases the model size because you are storing more scale/zero-point values.

The key insight: if a weight row has a bimodal distribution (values clustered around two different values), a single scale for the whole row is a poor fit. Smaller groups let the quantization adapt to local weight distributions.

Root Cause 4: Wrong Layer Type Treatment

Not all layers in a transformer are equally sensitive to quantization. The general ordering from most to least sensitive:

  1. Embedding layers (very sensitive - quantization error here affects every token)
  2. First and last few transformer layers (tend to be more sensitive than middle layers)
  3. Attention output projection (sensitive in large models)
  4. MLP down projection (moderately sensitive)
  5. MLP up/gate projections (least sensitive)

If your quantization scheme treats all layers identically, you are over-quantizing the sensitive ones.


Diagnostic Tools and Workflow

Step 1: Establish Baseline Metrics

Before you start changing anything, measure everything:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import numpy as np

def compute_perplexity(model, tokenizer, text_samples, device="cuda", max_length=2048):
"""
Compute perplexity on a list of text samples.
Lower is better. FP16 baseline is the reference.
"""
model.eval()
total_nll = 0.0
total_tokens = 0

with torch.no_grad():
for text in text_samples:
encodings = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=max_length
).to(device)

input_ids = encodings.input_ids
# Shift for next-token prediction
labels = input_ids.clone()

outputs = model(input_ids, labels=labels)
# outputs.loss is mean NLL over the sequence
nll = outputs.loss.item()
n_tokens = input_ids.shape[1] - 1 # exclude first token

total_nll += nll * n_tokens
total_tokens += n_tokens

avg_nll = total_nll / total_tokens
perplexity = np.exp(avg_nll)
return perplexity


# Load a small evaluation set - use 128 samples from WikiText-2
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
# Take only longer samples to avoid edge effects
samples = [
row["text"] for row in dataset
if len(row["text"].split()) > 100
][:128]

# Measure FP16 baseline
model_fp16 = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
ppl_fp16 = compute_perplexity(model_fp16, tokenizer, samples)
print(f"FP16 Perplexity: {ppl_fp16:.4f}")

# Measure quantized model
model_quant = AutoModelForCausalLM.from_pretrained(
"path/to/your/quantized/model",
device_map="auto"
)
ppl_quant = compute_perplexity(model_quant, tokenizer, samples)
print(f"Quantized Perplexity: {ppl_quant:.4f}")
print(f"Perplexity increase: {ppl_quant - ppl_fp16:.4f}")
print(f"Relative increase: {(ppl_quant / ppl_fp16 - 1) * 100:.2f}%")

Step 2: Per-Layer Perplexity Analysis

The key diagnostic tool. You want to find which layers contribute the most to quality degradation. The approach: replace quantized layers one at a time with their FP16 counterparts and measure how much perplexity improves.

import copy

def per_layer_sensitivity_analysis(
fp16_model,
quantized_model,
tokenizer,
eval_samples,
device="cuda"
):
"""
For each transformer layer, swap the quantized version with FP16
and measure perplexity improvement. Layers with large improvement
are the most sensitive to quantization.
"""
# Start from the quantized model
test_model = copy.deepcopy(quantized_model)
baseline_ppl = compute_perplexity(quantized_model, tokenizer, eval_samples)
print(f"All-quantized baseline PPL: {baseline_ppl:.4f}")

sensitivity_scores = {}

# Get the list of transformer layers
# For Llama-style models, layers are in model.layers
num_layers = len(fp16_model.model.layers)

for layer_idx in range(num_layers):
# Swap this layer to FP16
fp16_layer = fp16_model.model.layers[layer_idx]
quant_layer = test_model.model.layers[layer_idx]

# Replace quantized layer with fp16 version
test_model.model.layers[layer_idx] = copy.deepcopy(fp16_layer).to(device)

# Measure perplexity with this layer in FP16
ppl_with_fp16_layer = compute_perplexity(test_model, tokenizer, eval_samples)
improvement = baseline_ppl - ppl_with_fp16_layer
sensitivity_scores[layer_idx] = improvement

print(f"Layer {layer_idx:3d}: PPL with FP16 = {ppl_with_fp16_layer:.4f}, "
f"improvement = {improvement:.4f}")

# Restore quantized layer for next iteration
test_model.model.layers[layer_idx] = quant_layer

# Sort by sensitivity (most sensitive first)
sorted_layers = sorted(sensitivity_scores.items(), key=lambda x: x[1], reverse=True)
print("\nTop 10 most sensitive layers:")
for layer_idx, score in sorted_layers[:10]:
print(f" Layer {layer_idx}: {score:.4f} PPL improvement when kept in FP16")

return sensitivity_scores

Step 3: Activation Histogram Analysis

After identifying sensitive layers, you need to understand why they are sensitive. Activation histograms reveal outlier channels:

import matplotlib.pyplot as plt
from collections import defaultdict

class ActivationCapturer:
"""
Hook-based activation capturer.
Attach to specific layers to record activation statistics.
"""

def __init__(self):
self.hooks = []
self.activation_stats = defaultdict(list)

def add_hook(self, module, name):
def hook_fn(module, input, output):
if isinstance(output, tuple):
act = output[0]
else:
act = output
# Record per-channel statistics
# act shape: [batch, seq_len, hidden_dim]
act_flat = act.detach().float().reshape(-1, act.shape[-1])
channel_max = act_flat.abs().max(dim=0).values.cpu().numpy()
self.activation_stats[name].append(channel_max)

handle = module.register_forward_hook(hook_fn)
self.hooks.append(handle)

def remove_all_hooks(self):
for h in self.hooks:
h.remove()
self.hooks.clear()

def get_channel_max_values(self, name):
"""Returns max activation value per channel, averaged over all captured batches."""
import numpy as np
stacked = np.stack(self.activation_stats[name], axis=0)
return stacked.max(axis=0) # shape: [hidden_dim]


def analyze_outlier_channels(model, tokenizer, eval_samples, layer_idx, device="cuda"):
"""
Analyze activation distributions in a specific layer.
Returns information about which channels are outliers.
"""
capturer = ActivationCapturer()

# Hook into the target layer's attention output
target_layer = model.model.layers[layer_idx]
capturer.add_hook(target_layer.self_attn, f"layer_{layer_idx}_attn")
capturer.add_hook(target_layer.mlp, f"layer_{layer_idx}_mlp")

model.eval()
with torch.no_grad():
for sample in eval_samples[:32]: # Use subset for speed
inputs = tokenizer(
sample,
return_tensors="pt",
truncation=True,
max_length=512
).to(device)
model(**inputs)

capturer.remove_all_hooks()

# Analyze attention activations
attn_max_vals = capturer.get_channel_max_values(f"layer_{layer_idx}_attn")
median_val = np.median(attn_max_vals)
p99_val = np.percentile(attn_max_vals, 99)

# A channel is an "outlier" if its max value is > 6x the median
outlier_threshold = 6.0 * median_val
outlier_channels = np.where(attn_max_vals > outlier_threshold)[0]

print(f"Layer {layer_idx} attention activations:")
print(f" Median channel max: {median_val:.2f}")
print(f" 99th percentile: {p99_val:.2f}")
print(f" Max channel max: {attn_max_vals.max():.2f}")
print(f" Outlier channels (>{outlier_threshold:.1f}): {len(outlier_channels)}")
print(f" Outlier fraction: {len(outlier_channels)/len(attn_max_vals)*100:.2f}%")

return attn_max_vals, outlier_channels


def plot_activation_histogram(channel_max_values, layer_idx, title_suffix=""):
"""
Plot activation value distribution.
Outlier channels will appear as a long right tail.
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Linear scale
axes[0].hist(channel_max_values, bins=100, color="#818cf8", edgecolor="white")
axes[0].set_xlabel("Max activation value")
axes[0].set_ylabel("Number of channels")
axes[0].set_title(f"Layer {layer_idx} Activation Distribution {title_suffix}")

# Log scale to see outliers more clearly
axes[1].hist(channel_max_values, bins=100, color="#34d399", edgecolor="white")
axes[1].set_yscale("log")
axes[1].set_xlabel("Max activation value")
axes[1].set_ylabel("Number of channels (log scale)")
axes[1].set_title(f"Layer {layer_idx} Activation Distribution (log scale)")

plt.tight_layout()
plt.savefig(f"activation_hist_layer_{layer_idx}.png", dpi=150)
plt.show()
print(f"Saved to activation_hist_layer_{layer_idx}.png")

Step 4: Quantization Range Analysis

After you know which channels are problematic, check whether the quantization is clipping them:

def analyze_quantization_clipping(fp16_model, quantized_model, tokenizer, eval_samples, layer_idx):
"""
Compare what the FP16 model produces versus what the quantized model produces
at a specific layer. Significant divergence = quantization error at this layer.
"""
fp16_outputs = {}
quant_outputs = {}

def make_capture_hook(storage, name):
def hook(module, input, output):
if isinstance(output, tuple):
storage[name] = output[0].detach().cpu().float()
else:
storage[name] = output.detach().cpu().float()
return hook

# Attach hooks
fp16_hook = fp16_model.model.layers[layer_idx].register_forward_hook(
make_capture_hook(fp16_outputs, "layer_out")
)
quant_hook = quantized_model.model.layers[layer_idx].register_forward_hook(
make_capture_hook(quant_outputs, "layer_out")
)

fp16_model.eval()
quantized_model.eval()
sample = eval_samples[0]
inputs_fp16 = tokenizer(sample, return_tensors="pt", max_length=256, truncation=True)
inputs_quant = tokenizer(sample, return_tensors="pt", max_length=256, truncation=True)

with torch.no_grad():
fp16_model(**{k: v.to(fp16_model.device) for k, v in inputs_fp16.items()})
quantized_model(**{k: v.to(quantized_model.device) for k, v in inputs_quant.items()})

fp16_hook.remove()
quant_hook.remove()

fp16_out = fp16_outputs["layer_out"]
quant_out = quant_outputs["layer_out"]

# Compute per-channel relative error
channel_error = (fp16_out - quant_out).abs().mean(dim=[0, 1])
channel_scale = fp16_out.abs().mean(dim=[0, 1]).clamp(min=1e-8)
relative_error = (channel_error / channel_scale).numpy()

print(f"Layer {layer_idx} quantization error:")
print(f" Mean relative error: {relative_error.mean():.4f}")
print(f" Max relative error: {relative_error.max():.4f}")
print(f" Channels with >10% error: {(relative_error > 0.1).sum()}")
print(f" Channels with >50% error: {(relative_error > 0.5).sum()}")

return relative_error

Mermaid Diagrams

Diagnostic Decision Tree

LLM.int8() Decomposition Flow

Mixed Precision Layer Assignment


Fixes in Practice

Fix 1: Switch from GPTQ to AWQ for Better Calibration Robustness

AWQ is generally more robust to calibration data mismatch than GPTQ because it searches for per-channel scales that minimize quantization error rather than just running a one-pass Hessian approximation. If your model degrades with GPTQ and you suspect calibration issues, try AWQ first.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-awq-int4"
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM" # Use GEMM for inference, GEMV for batch size 1
}

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Use domain-specific calibration data, NOT just WikiText-2
# For code: use The Stack samples
# For finance: use financial news corpus
# For medical: use PubMed abstracts
calibration_data = [
"Your domain-specific calibration text sample 1...",
"Your domain-specific calibration text sample 2...",
# Use 128-512 samples, diverse, representative of deployment distribution
]

model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"AWQ quantized model saved to {quant_path}")

Fix 2: Mixed Precision Quantization with AutoGPTQ

When per-layer sensitivity analysis reveals specific layers that account for most of the quality degradation, the most cost-effective fix is to keep those layers in higher precision:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
# Specify which layers to skip (keep in FP16)
# Based on your sensitivity analysis results
)

model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantize_config
)

# Identify layers to keep in FP16 from sensitivity analysis
# These are layers where replacing FP16 -> INT4 caused the biggest PPL jump
layers_to_skip = [0, 1, 2, 30, 31] # Example: first 3 and last 2 layers

# Mark modules to skip quantization
# In AutoGPTQ, this is done via the quantize_config modules_to_not_convert
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
modules_to_not_convert=[
f"model.layers.{i}" for i in layers_to_skip
]
)

# Re-quantize with the updated config
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantize_config
)

calibration_data = [...] # Your calibration samples
model.quantize(calibration_data)
model.save_quantized("llama-2-7b-mixed-precision-gptq")

Fix 3: Reduce Group Size for Better Quality

# Standard quality (default)
quantize_config_standard = BaseQuantizeConfig(
bits=4,
group_size=128, # 1 scale per 128 weights
desc_act=False,
)

# Higher quality, ~5% larger model size
quantize_config_higher = BaseQuantizeConfig(
bits=4,
group_size=64, # 1 scale per 64 weights - 2x more scales stored
desc_act=False,
)

# Best quality, ~10% larger model size
quantize_config_best = BaseQuantizeConfig(
bits=4,
group_size=32, # 1 scale per 32 weights - 4x more scales stored
desc_act=False,
)

# Note: reducing group_size increases model size because you store more
# scale/zero_point values. For a 7B model:
# group_size=128: adds ~0.5% to model size from scale storage
# group_size=64: adds ~1.0% to model size from scale storage
# group_size=32: adds ~2.0% to model size from scale storage

Fix 4: LLM.int8() for Outlier-Heavy Models

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Standard INT8 with outlier decomposition via bitsandbytes
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # Threshold for outlier detection
# Channels with absolute value > threshold * median are treated as outliers
# Lower threshold = more channels treated as outliers = more FP16 computation
# Higher threshold = fewer channels treated as outliers = more potential error
llm_int8_has_fp16_weight=False,
llm_int8_skip_modules=["lm_head"], # Keep LM head in FP16
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.float16,
)

# Check how many parameters ended up in INT8 vs FP16
total_params = sum(p.numel() for p in model.parameters())
int8_params = sum(
p.numel() for p in model.parameters()
if p.dtype == torch.int8
)
print(f"INT8 parameters: {int8_params/total_params*100:.1f}%")
print(f"FP16 parameters: {(total_params-int8_params)/total_params*100:.1f}%")

Fix 5: Better Calibration Data Selection

from datasets import load_dataset
import random

def build_domain_calibration_data(domain: str, n_samples: int = 256):
"""
Build calibration data appropriate for the deployment domain.
Using the wrong distribution is one of the most common quantization bugs.
"""
if domain == "code":
# Use actual code from The Stack or CodeParrot
ds = load_dataset("codeparrot/github-code", split="train", streaming=True)
samples = []
for row in ds:
if len(row["code"]) > 200:
samples.append(row["code"][:1000])
if len(samples) >= n_samples:
break

elif domain == "medical":
# Use PubMed abstracts
ds = load_dataset("pubmed", split="train", streaming=True)
samples = []
for row in ds:
if row["MedlineCitation"]["Article"]["Abstract"]:
text = row["MedlineCitation"]["Article"]["Abstract"]["AbstractText"]
if isinstance(text, str) and len(text) > 100:
samples.append(text[:1000])
if len(samples) >= n_samples:
break

elif domain == "general":
# Mix of WikiText-2 and C4 for general-purpose models
wiki = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
c4 = load_dataset("c4", "en", split="train", streaming=True)

wiki_samples = [
row["text"] for row in wiki
if len(row["text"].split()) > 100
][:n_samples // 2]

c4_samples = []
for row in c4:
if len(row["text"].split()) > 100:
c4_samples.append(row["text"][:1000])
if len(c4_samples) >= n_samples // 2:
break

samples = wiki_samples + c4_samples
random.shuffle(samples)

else:
raise ValueError(f"Unknown domain: {domain}")

return samples[:n_samples]


# Usage
calibration_data = build_domain_calibration_data("code", n_samples=256)
print(f"Built {len(calibration_data)} calibration samples for code domain")
print(f"Average sample length: {sum(len(s) for s in calibration_data) / len(calibration_data):.0f} chars")

Complete Debugging Workflow: Step by Step

Here is the end-to-end workflow for a 7B model that degrades at INT4:

"""
Complete quantization debugging workflow.
Run this when a quantized model shows quality issues.
"""

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import json
from pathlib import Path

class QuantizationDebugger:
def __init__(self, fp16_model_path: str, quant_model_path: str, device: str = "cuda"):
print("Loading models...")
self.tokenizer = AutoTokenizer.from_pretrained(fp16_model_path)

self.fp16_model = AutoModelForCausalLM.from_pretrained(
fp16_model_path,
torch_dtype=torch.float16,
device_map="auto"
)

self.quant_model = AutoModelForCausalLM.from_pretrained(
quant_model_path,
device_map="auto"
)

self.device = device
print("Models loaded.")

def run_full_diagnostic(self, output_dir: str = "debug_results"):
Path(output_dir).mkdir(exist_ok=True)
results = {}

# Step 1: Overall perplexity comparison
print("\n=== Step 1: Perplexity Comparison ===")
eval_samples = self._load_eval_samples()
ppl_fp16 = compute_perplexity(self.fp16_model, self.tokenizer, eval_samples)
ppl_quant = compute_perplexity(self.quant_model, self.tokenizer, eval_samples)
results["ppl_fp16"] = ppl_fp16
results["ppl_quant"] = ppl_quant
results["ppl_delta"] = ppl_quant - ppl_fp16
print(f"FP16 PPL: {ppl_fp16:.4f}")
print(f"Quant PPL: {ppl_quant:.4f}")
print(f"Delta: +{ppl_quant - ppl_fp16:.4f} ({(ppl_quant/ppl_fp16-1)*100:.2f}%)")

# Step 2: Per-layer sensitivity
print("\n=== Step 2: Per-Layer Sensitivity Analysis ===")
sensitivity = per_layer_sensitivity_analysis(
self.fp16_model, self.quant_model, self.tokenizer, eval_samples
)
results["layer_sensitivity"] = {
str(k): float(v) for k, v in sensitivity.items()
}

# Step 3: Identify top problematic layers
top_layers = sorted(sensitivity.items(), key=lambda x: x[1], reverse=True)[:5]
print(f"\nTop 5 most sensitive layers: {[l[0] for l in top_layers]}")

# Step 4: Check outliers in top layers
print("\n=== Step 4: Outlier Analysis on Top Layers ===")
outlier_results = {}
for layer_idx, sensitivity_score in top_layers[:3]:
channel_maxes, outlier_channels = analyze_outlier_channels(
self.fp16_model, self.tokenizer, eval_samples, layer_idx
)
outlier_results[layer_idx] = {
"n_outlier_channels": len(outlier_channels),
"outlier_fraction": len(outlier_channels) / len(channel_maxes),
"max_activation": float(channel_maxes.max()),
"median_activation": float(np.median(channel_maxes))
}
results["outlier_analysis"] = {str(k): v for k, v in outlier_results.items()}

# Step 5: Generate recommendations
print("\n=== Step 5: Recommendations ===")
recommendations = self._generate_recommendations(results)
results["recommendations"] = recommendations
for rec in recommendations:
print(f" - {rec}")

# Save full results
with open(f"{output_dir}/debug_report.json", "w") as f:
json.dump(results, f, indent=2)
print(f"\nFull report saved to {output_dir}/debug_report.json")

return results

def _load_eval_samples(self, n=64):
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
return [
row["text"] for row in dataset
if len(row["text"].split()) > 100
][:n]

def _generate_recommendations(self, results):
recs = []
ppl_delta = results["ppl_delta"]

if ppl_delta > 1.0:
recs.append("CRITICAL: PPL increase > 1.0 - consider switching to INT8 or using FP16 for this model")

elif ppl_delta > 0.3:
recs.append("Significant PPL degradation - apply mixed precision to top sensitive layers")

# Check for outlier presence
outlier_info = results.get("outlier_analysis", {})
has_severe_outliers = any(
v["outlier_fraction"] > 0.01 # More than 1% outlier channels
for v in outlier_info.values()
)
if has_severe_outliers:
recs.append("Outlier activations detected - use LLM.int8() or AWQ (better outlier handling)")

# Check sensitivity concentration
sensitivity = results.get("layer_sensitivity", {})
if sensitivity:
values = list(sensitivity.values())
top5_sum = sum(sorted(values, reverse=True)[:5])
total_sum = sum(values)
if total_sum > 0 and top5_sum / total_sum > 0.5:
top5_layers = sorted(sensitivity.items(), key=lambda x: x[1], reverse=True)[:5]
layer_ids = [k for k, v in top5_layers]
recs.append(
f"50%+ of degradation from 5 layers: {layer_ids} - "
f"keep these in FP16 for mixed precision"
)

if ppl_delta <= 0.3:
recs.append("PPL within acceptable range - check task-specific benchmarks for targeted degradation")

return recs


# Run the debugger
debugger = QuantizationDebugger(
fp16_model_path="meta-llama/Llama-2-7b-hf",
quant_model_path="path/to/quantized/model"
)
results = debugger.run_full_diagnostic(output_dir="debug_output")

Production Engineering Notes

Do Not Trust Perplexity Alone

Perplexity measures how well the model predicts the next token on a specific corpus. It is a proxy for quality, not a direct measure of it. A model can have acceptable perplexity but produce terrible outputs for specific use cases.

Always complement perplexity with task-specific benchmarks that match your deployment use case. For a coding assistant: HumanEval pass@1. For a math assistant: GSM8K and MATH. For general chat: MT-Bench or AlpacaEval.

Calibration Data Is Not Optional

The choice of calibration data has as much impact on quantization quality as the choice of algorithm. A mismatch between calibration distribution and deployment distribution will degrade quality in ways that are hard to detect on standard benchmarks but obvious in production.

Rule of thumb: your calibration data should be a representative sample of what users will actually send to the model. If you do not have production data yet, use the closest public dataset to your expected deployment domain.

Mixed Precision Increases Model Size

Keeping specific layers in FP16 while the rest is INT4 means those layers are 4x larger (16 bits vs 4 bits). For a 7B model where you keep 5 layers in FP16, the size impact is modest - roughly 5% larger. But be aware of this when planning memory budgets, especially for edge deployment.

Sensitivity Analysis is Expensive - Cache the Results

Per-layer sensitivity analysis requires loading both the FP16 and quantized model and running inference passes for each layer. On a 70B model, this can take several hours. Cache the results:

import json

def run_sensitivity_with_cache(fp16_model, quant_model, tokenizer, eval_samples, cache_path):
if Path(cache_path).exists():
print(f"Loading cached sensitivity from {cache_path}")
with open(cache_path) as f:
return {int(k): v for k, v in json.load(f).items()}

sensitivity = per_layer_sensitivity_analysis(
fp16_model, quant_model, tokenizer, eval_samples
)

with open(cache_path, "w") as f:
json.dump({str(k): float(v) for k, v in sensitivity.items()}, f)

return sensitivity

Common Mistakes

:::danger Calibrating on the Wrong Data Distribution

The single most common quantization debugging mistake is assuming calibration data does not matter much. It matters enormously.

If you calibrate a coding model on WikiText-2, the quantization ranges will be set for English prose activations, not code activations. The resulting quantized model will degrade significantly on code tasks even if its perplexity on WikiText-2 looks fine.

Always calibrate on data that matches your deployment distribution.

:::

:::danger Stopping at Perplexity

Perplexity on WikiText-2 is a widely reported metric but a poor quality proxy for production performance. A model can have a perplexity increase of only 0.1 and still fail badly on math or code tasks.

Always run task-specific benchmarks that match your use case before declaring a quantized model production-ready.

:::

:::warning Assuming All Layers Are Equally Sensitive

A common mistake when first encountering quantization quality issues is to apply the same fix uniformly (e.g., reduce group_size for all layers). This is wasteful.

Run sensitivity analysis first. Most of the quality degradation typically comes from a small fraction of layers (the first few, the last few, and occasionally a middle layer with unusual activation patterns). Target your fixes at those layers specifically.

:::

:::warning Ignoring the LM Head

The language model head (the final projection from hidden dimension to vocabulary) is frequently left out of sensitivity analysis because it looks like "just another linear layer." It is not.

The LM head maps from hidden space to the vocabulary logits that determine token probabilities. Small errors here affect every single token prediction. Always either skip the LM head from quantization (keep in FP16) or verify explicitly that quantizing it does not degrade quality.

:::

:::tip Group Size vs Quality Tradeoff

Group size 128 is the default for good reason - it is a reasonable quality/size tradeoff for most models. But if you are seeing quality degradation and cannot identify a specific layer issue, try reducing group_size to 64 or 32 before resorting to mixed precision. It is a simpler fix and often sufficient.

:::


Interview Q&A

Q1: What are emergent large magnitude features in LLMs and why do they cause quantization problems?

A: In transformers larger than approximately 6.7 billion parameters, a small number of activation dimensions - typically less than 0.5% - consistently produce absolute values that are 10x to 100x larger than the typical dimension. These were called "emergent large magnitude features" by Dettmers et al. in the LLM.int8() paper (2022).

They cause quantization problems because quantization works by dividing a range of values into discrete buckets. The scale factor (which defines the range) is chosen to cover the maximum value in the tensor. If 0.1% of channels have values of 300 while 99.9% have values of 3, the scale factor is set to cover the range up to 300. This means the 99.9% of channels with values around 3 get compressed into only a few discrete levels (because the quantization range is dominated by the outliers), resulting in large quantization error for most of the computation.

LLM.int8() solves this by identifying outlier dimensions at runtime and computing their contribution in FP16 while quantizing the rest to INT8. The key insight is that the outlier dimensions can be detected cheaply and their FP16 computation is a small fraction of the total work because there are so few of them.

Q2: You run per-layer sensitivity analysis and find that layer 0 and layers 28-31 account for 70% of the perplexity degradation. What is your debugging strategy?

A: This pattern is typical and expected - early and late layers tend to be more sensitive to quantization than middle layers. My strategy would be:

First, keep those layers in FP16 (mixed precision). For a 32-layer model, keeping 6 layers in FP16 while the rest is INT4 results in a model that is roughly 5-6% larger than full INT4, but with dramatically better quality. This is usually the best cost-quality tradeoff.

Second, investigate whether the sensitivity in those layers traces to outlier activations or to something else. Run activation histograms on those specific layers. If there are outlier channels, consider using INT8 with LLM.int8() for those layers rather than FP16 (INT8 is still a significant compression vs FP16).

Third, re-run the benchmarks (both perplexity and task-specific) after applying the mixed precision fix to verify the improvement. If quality is now acceptable, ship the mixed precision model. If it is still not acceptable, reduce group_size from 128 to 64 for the remaining quantized layers and measure again.

Q3: What is the difference between calibration data quantity and calibration data quality for quantization?

A: Both matter but in different ways.

Quantity: You need enough samples for the activation statistics to be stable. With too few samples, the measured activation distributions are noisy and the calibration-derived quantization ranges may be poorly estimated. In practice, 128-512 samples is usually sufficient for most models. More than 512 samples provides diminishing returns on calibration quality.

Quality (distribution): This is more important than quantity. The calibration samples must represent the distribution of inputs the model will see at deployment time. If your calibration data comes from a different domain (e.g., general English text but the model will be used for code), the activation distributions during calibration will be different from deployment. The quantization ranges will be wrong for the deployment domain, and quality will degrade.

The practical recommendation: use a small but representative sample of actual deployment inputs for calibration. If you do not have real deployment data yet, use the closest public dataset to your expected input distribution, mixed with a small amount of general text to maintain breadth.

Q4: A quantized model produces repetition loops on specific inputs but works correctly on most inputs. What is the most likely root cause and how would you investigate?

A: Repetition loops in quantized models almost always trace to degraded attention pattern quality. The attention mechanism needs to build a coherent representation of the token history to decide what to generate next. When attention weights are corrupted by quantization error, the model can lose track of what it has said and fall into attractor states that repeat the same sequence.

The "specific inputs" aspect is the key clue. This suggests the issue is not uniform but concentrated around specific activation patterns that those inputs trigger. My investigation would be:

First, collect 10-20 examples of inputs that cause the loop and 10-20 that do not. Look for patterns in the looping inputs - are they longer? Do they share specific vocabulary? Are they from a particular domain?

Second, run activation analysis on a looping input vs a non-looping input. Focus on the attention layers of the most sensitive layers (from sensitivity analysis). Look for channels that produce unusually large values specifically for the looping inputs.

Third, if you find outlier channels that activate more strongly for looping inputs, this confirms LLM.int8() or mixed precision as the fix.

Fourth, also check whether the issue is in the attention weights or the attention output projection. Both can produce repetition when corrupted, but for different reasons: corrupted attention weights mean the model cannot track position properly; corrupted output projection means the attended information is garbled before being passed to the MLP.

Q5: How do you decide between reducing group_size, switching algorithms (GPTQ vs AWQ), and using mixed precision when fixing quantization quality issues?

A: These are not mutually exclusive - you can combine them - but they address different root causes and have different costs:

Reducing group_size is the cheapest fix to try first. It increases model size slightly (each halving of group_size adds roughly 0.5-1% to model size) and requires re-quantization, but it does not require any architectural changes. It helps most when the weight distributions within rows are heterogeneous (bimodal or otherwise non-uniform), which makes a single scale per 128 weights a poor fit.

Switching algorithms (GPTQ to AWQ) is appropriate when the degradation appears to come from calibration data sensitivity. AWQ explicitly optimizes for minimal activation error rather than just weight error, which makes it more robust to distributional shift between calibration and deployment. If your GPTQ model degrades on domain-specific inputs but your calibration data was general, AWQ is a good first try.

Mixed precision is the right fix when sensitivity analysis shows that a small number of layers account for most of the degradation. Instead of improving the quantization of those layers (which may not be possible at INT4), you simply keep them in higher precision. The cost is a modest increase in model size and memory, but the quality improvement is often dramatic.

In practice: run sensitivity analysis first. If degradation is concentrated in a few layers, use mixed precision. If it is spread uniformly across layers, try reducing group_size. If you suspect calibration mismatch, switch to AWQ with domain-appropriate calibration data.

Q6: What is the role of the Hessian in GPTQ-based quantization, and how does Hessian approximation quality affect debugging?

A: In GPTQ (Frantar et al., 2022), the Hessian of the layer's loss with respect to the weights is used to guide the quantization order and to compute error compensation. Specifically, GPTQ quantizes weights one at a time in order of their Hessian diagonal values (weights with smaller second-order sensitivity are quantized first), and after each weight is quantized, it adjusts the remaining weights to compensate for the quantization error.

The Hessian is computed from the calibration data - it measures how much the layer's output changes when each weight is perturbed, given the activations seen during calibration. If the calibration data is mismatched with the deployment distribution, the Hessian is measuring sensitivity on the wrong activations. The result is that GPTQ might keep "important" weights (important for calibration data activations) in higher precision while over-compressing weights that are actually critical for the deployment distribution.

This is one reason why AWQ can outperform GPTQ in domain-specific settings even though GPTQ has more sophisticated error compensation: AWQ's optimization is simpler and less dependent on getting the calibration distribution exactly right.

For debugging: if you see that GPTQ-quantized models degrade more on domain-specific inputs than AWQ-quantized versions of the same model at the same bit width, Hessian mismatch from calibration data is a likely explanation.

© 2026 EngineersOfAI. All rights reserved.