Quantization Benchmarking

The Production Alert That Woke Everyone Up

It was 2:47 AM when the PagerDuty alert fired. A fintech company had just deployed their quantized LLM-based document analysis system to production two days earlier - the one that had looked so promising in staging. The model was a GPTQ 4-bit quantized Mistral 7B, and the benchmarks had been solid: perplexity of 5.82 on WikiText-2 versus 5.61 for the FP16 baseline. That 3.7% perplexity increase seemed totally acceptable. Memory dropped from 14 GB to 4.1 GB. Throughput doubled. The team had celebrated.

The alert was not about a crash. The system was running fine from an infrastructure perspective. The alert came from their business monitoring layer: the fraud detection accuracy had dropped from 94.1% to 88.3% overnight. Thousands of transactions were being miscategorized. A junior analyst had been watching the numbers and flagged it.

The on-call engineer - let us call her Priya - pulled up the logs. The model was responding. Latency was good. But something was systematically wrong. She started running the model on held-out test cases from their fraud detection domain, cases that looked nothing like Wikipedia articles. The model was failing on exactly the cases that required multi-step reasoning about transaction sequences. Cases like: "Customer withdrew $500 at ATM in Chicago at 9 AM, then purchased$ 2,000 electronics in Miami at 10:30 AM - flag as suspicious?" The FP16 model got these right reliably. The 4-bit GPTQ model was getting them wrong roughly 15% of the time.

The root cause took three days to fully diagnose. The GPTQ calibration data had been WikiText-2 - general English text with no financial domain content. The quantization had preserved the model's language modeling capability reasonably well, as the perplexity number showed, but had selectively damaged the numerical reasoning circuits and the temporal sequence analysis capabilities. These capabilities were distributed across specific attention heads in specific layers, and those layers happened to be among the most sensitive to quantization error. Perplexity - averaged over millions of tokens of general text - had completely masked this degradation.

This scenario plays out repeatedly across the industry. Teams benchmark their quantized models on the wrong metrics, deploy to production, and discover the failures downstream. The lesson is not that quantization is dangerous - it is that quantization evaluation requires a multi-dimensional approach that matches your deployment domain. Perplexity is necessary but not sufficient. Downstream task accuracy on representative benchmarks is essential. And when your use case is specialized, you need domain-specific evaluation on held-out data that actually looks like your production traffic.

This lesson teaches you how to build that multi-dimensional evaluation pipeline from scratch. You will learn which metrics to measure, which benchmarks to run, how to correctly set up latency measurement, and how to build a complete comparison across FP16, GPTQ, AWQ, and NF4 quantization methods. By the end, you will be able to make quantization decisions with engineering rigor rather than gut feel.

Why This Exists

The Problem Before Rigorous Benchmarking

When quantization techniques started becoming practical around 2022-2023, teams adopted a naive evaluation approach: check that the model still produces coherent text, measure the memory reduction and speed improvement, and ship. Perplexity on WikiText-2 became the de facto quality check because it was fast to compute and had well-established baseline numbers for comparison.

This approach failed for three interconnected reasons.

First, perplexity is an aggregate metric. It averages the log-likelihood of next-token prediction across an entire test corpus. A model can have near-perfect prediction on 95% of tokens - the common, predictable words - while completely failing on the 5% of tokens that represent the difficult reasoning steps. Since those difficult tokens are rare, their contribution to the aggregate perplexity is small. But they are exactly the tokens that matter for reasoning-intensive tasks.

Second, the distribution mismatch problem. WikiText-2 is a corpus of Wikipedia articles. C4 is a cleaned version of Common Crawl. Neither looks much like medical records, financial documents, legal contracts, or code. A model quantized with WikiText-2 as its calibration dataset will have its quantization parameters optimized for Wikipedia-style text. The resulting quantization error pattern is unpredictable on out-of-distribution inputs.

Third, different capabilities degrade at different rates. Language fluency is highly robust to quantization - even aggressively quantized models produce grammatically correct text. Factual recall shows moderate sensitivity. Multi-step reasoning, mathematical computation, and long-range dependency tracking are significantly more sensitive. A perplexity number cannot tell you which of these you have degraded.

What Rigorous Benchmarking Solves

The solution is a layered evaluation strategy. You measure perplexity as a sanity check for gross degradation, but you treat it as a floor, not a ceiling. On top of that floor, you run downstream task evaluations on standardized benchmarks that probe different capability dimensions. Then you run domain-specific evaluation on data that matches your production distribution. Finally, you measure the engineering tradeoffs - latency, throughput, and memory - under realistic serving conditions.

This gives you a complete picture: how much quality did you lose, where did you lose it, is the loss acceptable for your use case, and what did you gain in efficiency?

Historical Context

The systematic evaluation of quantized language models developed alongside the quantization methods themselves. Before 2022, quantization of neural networks was mostly studied in the context of computer vision, where established benchmarks like ImageNet top-1 accuracy gave a clean single-number quality signal. The community knew what a 1% drop in ImageNet accuracy meant in practice.

For language models, the evaluation landscape was more fragmented. The field had inherited a collection of NLP benchmarks developed independently over years: MMLU (Hendrycks et al., 2021) for knowledge assessment, HellaSwag (Zellers et al., 2019) for commonsense reasoning, ARC (Clark et al., 2018) for science questions, WinoGrande (Sakagami et al., 2019) for coreference reasoning. These had been designed to measure the capabilities of full-precision models and were not specifically designed to detect quantization-induced degradation.

Elias Frantar and colleagues at IST Austria, while developing GPTQ in 2022, used perplexity on WikiText-2 and Penn Treebank as their primary evaluation metrics. This was reasonable for a methods paper focused on demonstrating that their quantization approach worked, but it established a precedent that teams then followed uncritically in production deployments.

The "aha moment" in the community came through a series of failure reports shared on forums and in technical blog posts through late 2023 and 2024. Teams deploying quantized models to production started noticing the disconnect: models with good perplexity degradation numbers were failing on specific task categories at unacceptably high rates. Ji Lin et al., while developing AWQ in 2023, explicitly addressed this by evaluating on downstream tasks (MMLU, WinoGrande, HellaSwag, ARC) in addition to perplexity, and they noted that the correlation between perplexity improvement and downstream task improvement was imperfect.

Eleuther AI's lm-evaluation-harness (Gao et al., 2021, continuously updated) became the standard tooling for systematic downstream evaluation. By 2024, serious quantization work was expected to include lm-eval-harness results on a standard battery of benchmarks alongside perplexity numbers.

Core Concepts

Understanding Perplexity

Perplexity measures how surprised a language model is by a test corpus. Lower perplexity means the model predicted the text more confidently. Formally, for a sequence of tokens $w_1, w_2, \ldots, w_N$ :

$\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \mid w_1, \ldots, w_{i-1})\right)$

A perplexity of 5.0 means the model is, on average, as uncertain as if it were choosing uniformly among 5 equally likely options at each token position.

For quantization evaluation, you compute perplexity on WikiText-2 or C4 for both the FP16 baseline and the quantized model, then report the delta. An increase of 0.1-0.3 PPL on WikiText-2 is generally considered acceptable for 4-bit quantization. An increase of more than 1.0 PPL is a red flag that something went wrong - often a misconfigured quantization group size, incorrect calibration, or a layer that should have been kept in FP16.

The key insight: perplexity measures the average case. Downstream tasks measure the tail - the hard cases where the model needs to get things exactly right.

Downstream Task Benchmarks

The standard battery for language model evaluation consists of four benchmark families:

MMLU (Massive Multitask Language Understanding): 57 academic subjects, multiple choice. Tests factual knowledge and reasoning across domains. 5-shot by convention. A 4-bit quantized model typically loses 1-3 points on MMLU compared to FP16.

HellaSwag: Sentence completion requiring commonsense reasoning. 10-shot. Generally robust to quantization - models rarely lose more than 1 point.

ARC Challenge: Science questions from standardized tests, with adversarial filtering to remove easy items. 25-shot. Moderate sensitivity to quantization.

WinoGrande: Coreference resolution with controlled variable substitution. 5-shot. Can show significant degradation in heavily quantized models.

The formula for normalized accuracy across these tasks gives you an aggregate quality score:

$\text{Quality Score} = \frac{1}{|T|} \sum_{t \in T} \frac{\text{Acc}_t^{\text{quantized}}}{\text{Acc}_t^{\text{FP16}}}$

A quality score of 0.98 means you retained 98% of the original capability on average. But remember: averages hide task-specific degradation. Always report per-task numbers alongside the aggregate.

The Perplexity-Accuracy Disconnect

This is the most important concept in quantization evaluation. Perplexity and downstream task accuracy are correlated but not identical, and the gap widens with aggressive quantization and domain mismatch.

Here is the intuition: language modeling (what perplexity measures) and task solving (what benchmarks measure) use overlapping but distinct capabilities. Language modeling requires accurate prediction of common words, grammatical structures, and typical phrase continuations. Task solving requires reliable activation of specific reasoning circuits - multi-hop inference chains, numerical reasoning, logical entailment. These reasoning circuits involve specific attention patterns across many layers. Quantization error is not uniform across the model; it concentrates in layers with high weight variance and activations that contain outliers. If those layers happen to be critical for your task type, accuracy drops while perplexity remains acceptable.

Concretely: research by Dettmers et al. (2022, LLM.int8()) showed that large language models contain "emergent features" - a small fraction of hidden dimensions (0.1-1%) that carry disproportionately large magnitudes. These outlier dimensions are critical for downstream task performance. Naive 8-bit quantization that truncates these outliers causes massive accuracy degradation. Their solution - keeping outlier dimensions in FP16 - restored accuracy while maintaining most of the compression benefit. The perplexity impact of outlier truncation was moderate; the accuracy impact was catastrophic.

The practical lesson: if your perplexity looks fine but your downstream tasks are degrading more than expected, suspect outlier handling.

Latency Benchmarking Methodology

Measuring inference latency correctly is harder than it looks. Three common mistakes invalidate most latency numbers:

Mistake 1 - No warmup. The first inference call is always slow due to CUDA kernel compilation, cache warming, and PyTorch JIT compilation. Always run at least 5-10 warmup iterations before recording measurements.

Mistake 2 - Single-run measurement. GPU execution has variance. A single timing measurement is unreliable. Run 50-100 iterations and report median and 95th percentile (p50 and p95), not mean.

Mistake 3 - Wrong batch size. Memory savings from quantization primarily benefit batch throughput, not single-sequence latency. Measure at multiple batch sizes: 1 (for latency-sensitive use cases), 8, 32, and 64 (for throughput-oriented serving).

The key metrics to report:

Time to First Token (TTFT): latency from request to first token generated. Measures prefill performance.
Tokens per second (TPS): generation throughput. Report per-sequence (batch=1) and total (batch=N).
Peak VRAM: maximum GPU memory during inference, measured at your serving batch size.
KV cache size: for long-context deployments, the KV cache memory scales with sequence length and can dominate.

Calibration Dataset Impact

GPTQ and AWQ both require calibration data to compute quantization parameters. The choice of calibration data has measurable impact on the resulting model quality.

The effect is asymmetric: calibrating on data similar to your deployment domain improves performance on that domain but can hurt performance on other domains. Calibrating on general text (WikiText-2, C4) gives a balanced baseline.

The typical finding from empirical comparisons:

C4 calibration produces slightly better average perplexity than WikiText-2 calibration
Domain-specific calibration produces better domain accuracy but slightly worse general accuracy
Using 128-512 calibration samples is generally sufficient; more samples yield diminishing returns
Sequence length for calibration should match your deployment context length

For production deployments: if you have domain-specific data, create a mixed calibration set - 70% general text (C4 or WikiText-2) and 30% domain samples. This preserves general capability while adapting to your domain distribution.

Mermaid Diagrams

Evaluation Pipeline Architecture

Task Sensitivity by Quantization Level

Benchmarking Latency Protocol

Code Examples

Setting Up Perplexity Evaluation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import numpy as np
from typing import Optional

def compute_perplexity(
    model_name_or_path: str,
    dataset_name: str = "wikitext",
    dataset_config: str = "wikitext-2-raw-v1",
    split: str = "test",
    stride: int = 512,
    max_length: Optional[int] = None,
    device: str = "cuda",
    load_in_4bit: bool = False,
    load_in_8bit: bool = False,
) -> dict:
    """
    Compute perplexity of a model on a standard dataset.

    Uses a sliding window approach to handle long sequences,
    which is the standard method used in published benchmarks.
    """
    print(f"Loading tokenizer from {model_name_or_path}")
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    # Load model with optional quantization
    load_kwargs = {
        "torch_dtype": torch.float16,
        "device_map": "auto",
    }
    if load_in_4bit:
        from transformers import BitsAndBytesConfig
        load_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )
    elif load_in_8bit:
        load_kwargs["load_in_8bit"] = True

    print(f"Loading model...")
    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, **load_kwargs)
    model.eval()

    # Load dataset
    print(f"Loading dataset {dataset_name}/{dataset_config}")
    dataset = load_dataset(dataset_name, dataset_config, split=split)

    # Concatenate all text
    text = "\n\n".join(dataset["text"])
    encodings = tokenizer(text, return_tensors="pt")

    seq_len = encodings.input_ids.size(1)
    if max_length is None:
        max_length = model.config.max_position_embeddings
    # Cap at 2048 for reasonable compute time
    max_length = min(max_length, 2048)

    print(f"Total tokens: {seq_len}, context window: {max_length}, stride: {stride}")

    nlls = []
    prev_end_loc = 0

    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc

        input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
        # Mask out the context tokens we are not evaluating
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss

        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc

        if end_loc == seq_len:
            break

    # Perplexity = exp(mean NLL)
    ppl = torch.exp(torch.stack(nlls).mean()).item()

    return {
        "perplexity": ppl,
        "num_tokens": seq_len,
        "dataset": f"{dataset_name}/{dataset_config}",
    }


# Example usage
if __name__ == "__main__":
    model_id = "mistralai/Mistral-7B-v0.1"

    # FP16 baseline
    fp16_result = compute_perplexity(model_id)
    print(f"FP16 Perplexity: {fp16_result['perplexity']:.4f}")

    # NF4 quantized
    nf4_result = compute_perplexity(model_id, load_in_4bit=True)
    print(f"NF4 Perplexity: {nf4_result['perplexity']:.4f}")

    delta = nf4_result["perplexity"] - fp16_result["perplexity"]
    pct = (delta / fp16_result["perplexity"]) * 100
    print(f"Perplexity increase: +{delta:.4f} ({pct:.2f}%)")

Downstream Task Evaluation with lm-evaluation-harness

# First install: pip install lm-eval
# Then run from command line or wrap in Python

import subprocess
import json
import os
from pathlib import Path

def run_lm_eval(
    model_name: str,
    tasks: list,
    num_fewshot_map: dict = None,
    output_dir: str = "./eval_results",
    batch_size: int = 8,
    device: str = "cuda",
    load_in_4bit: bool = False,
    load_in_8bit: bool = False,
    gptq_model: bool = False,
) -> dict:
    """
    Run lm-evaluation-harness on specified tasks.
    Returns parsed results dict.
    """
    os.makedirs(output_dir, exist_ok=True)
    output_path = Path(output_dir) / f"{model_name.replace('/', '_')}_results.json"

    # Build the command
    if gptq_model:
        model_type = "hf"
        model_args = f"pretrained={model_name},gptq=True,dtype=float16"
    elif load_in_4bit:
        model_type = "hf"
        model_args = f"pretrained={model_name},load_in_4bit=True,dtype=float16"
    elif load_in_8bit:
        model_type = "hf"
        model_args = f"pretrained={model_name},load_in_8bit=True"
    else:
        model_type = "hf"
        model_args = f"pretrained={model_name},dtype=float16"

    task_str = ",".join(tasks)

    cmd = [
        "lm_eval",
        "--model", model_type,
        "--model_args", model_args,
        "--tasks", task_str,
        "--batch_size", str(batch_size),
        "--device", device,
        "--output_path", str(output_path),
        "--log_samples",
    ]

    print(f"Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode != 0:
        print(f"Error: {result.stderr}")
        raise RuntimeError(f"lm_eval failed: {result.stderr}")

    with open(output_path) as f:
        results = json.load(f)

    return results


def parse_eval_results(results: dict) -> dict:
    """Extract key accuracy numbers from lm-eval output."""
    parsed = {}
    for task_name, task_results in results.get("results", {}).items():
        # lm-eval stores accuracy under different keys per task
        acc = (
            task_results.get("acc,none")
            or task_results.get("acc_norm,none")
            or task_results.get("acc")
            or 0.0
        )
        parsed[task_name] = {
            "accuracy": acc * 100,  # convert to percentage
            "stderr": task_results.get("acc_stderr,none", 0.0) * 100,
        }
    return parsed


# Standard benchmark tasks for LLM evaluation
STANDARD_TASKS = [
    "mmlu",          # 57-subject knowledge benchmark, 5-shot
    "hellaswag",     # Commonsense completion, 10-shot
    "arc_challenge", # Science questions, 25-shot
    "winogrande",    # Coreference resolution, 5-shot
]

def compare_models(fp16_name: str, quant_name: str, gptq: bool = False):
    """Compare FP16 vs quantized model on standard tasks."""
    print("=" * 60)
    print("Evaluating FP16 baseline...")
    fp16_raw = run_lm_eval(fp16_name, STANDARD_TASKS)
    fp16_results = parse_eval_results(fp16_raw)

    print("\nEvaluating quantized model...")
    quant_raw = run_lm_eval(quant_name, STANDARD_TASKS, gptq_model=gptq)
    quant_results = parse_eval_results(quant_raw)

    print("\n" + "=" * 60)
    print(f"{'Task':<20} {'FP16':>8} {'Quant':>8} {'Delta':>8} {'Retained':>10}")
    print("-" * 60)

    total_fp16 = 0
    total_quant = 0

    for task in STANDARD_TASKS:
        task_short = task.replace("arc_challenge", "ARC-C")
        fp16_acc = fp16_results.get(task, {}).get("accuracy", 0)
        quant_acc = quant_results.get(task, {}).get("accuracy", 0)
        delta = quant_acc - fp16_acc
        retained = (quant_acc / fp16_acc * 100) if fp16_acc > 0 else 0

        print(f"{task_short:<20} {fp16_acc:>7.2f}% {quant_acc:>7.2f}% "
              f"{delta:>+7.2f}% {retained:>9.1f}%")
        total_fp16 += fp16_acc
        total_quant += quant_acc

    avg_retained = (total_quant / total_fp16 * 100) if total_fp16 > 0 else 0
    print("-" * 60)
    print(f"{'AVERAGE':<20} {total_fp16/4:>7.2f}% {total_quant/4:>7.2f}% "
          f"{'':>8} {avg_retained:>9.1f}%")

    return {"fp16": fp16_results, "quantized": quant_results}

Latency and Throughput Benchmarking

import torch
import time
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import List, Dict
import gc

def measure_inference_latency(
    model,
    tokenizer,
    prompts: List[str],
    max_new_tokens: int = 100,
    num_warmup: int = 10,
    num_runs: int = 100,
    device: str = "cuda",
) -> Dict:
    """
    Rigorous latency measurement with warmup and percentile reporting.

    Returns p50, p95, and mean for both TTFT and total generation time.
    """
    model.eval()

    # Tokenize all prompts
    encodings = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512,
    ).to(device)

    print(f"Input shape: {encodings.input_ids.shape}")
    print(f"Running {num_warmup} warmup iterations...")

    # Warmup - these results are discarded
    for _ in range(num_warmup):
        with torch.no_grad():
            _ = model.generate(
                **encodings,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
            )

    # Synchronize GPU before timing
    torch.cuda.synchronize()

    print(f"Running {num_runs} benchmark iterations...")

    ttft_times = []       # Time to first token
    total_times = []      # Total generation time
    tokens_generated = [] # Actual tokens produced

    for i in range(num_runs):
        torch.cuda.synchronize()
        t_start = time.perf_counter()

        with torch.no_grad():
            # Use a custom generation approach to capture TTFT
            # For a proper TTFT measurement, we need the first forward pass time
            outputs = model.generate(
                **encodings,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
                return_dict_in_generate=True,
            )

        torch.cuda.synchronize()
        t_end = time.perf_counter()

        elapsed = (t_end - t_start) * 1000  # ms
        n_tokens = outputs.sequences.shape[1] - encodings.input_ids.shape[1]

        total_times.append(elapsed)
        tokens_generated.append(n_tokens)

        if (i + 1) % 20 == 0:
            print(f"  Iteration {i+1}/{num_runs}: {elapsed:.1f}ms, {n_tokens} tokens")

    total_times = np.array(total_times)
    tokens_generated = np.array(tokens_generated)

    # Tokens per second
    tps_values = (tokens_generated / (total_times / 1000))  # tokens/sec

    # Peak VRAM
    peak_vram_mb = torch.cuda.max_memory_allocated(device) / (1024 ** 2)

    results = {
        "batch_size": encodings.input_ids.shape[0],
        "input_tokens": encodings.input_ids.shape[1],
        "avg_output_tokens": float(np.mean(tokens_generated)),
        "total_time_ms": {
            "p50": float(np.percentile(total_times, 50)),
            "p95": float(np.percentile(total_times, 95)),
            "mean": float(np.mean(total_times)),
            "std": float(np.std(total_times)),
        },
        "tokens_per_second": {
            "p50": float(np.percentile(tps_values, 50)),
            "p95": float(np.percentile(tps_values, 5)),  # note: low TPS = p5 for "worst case"
            "mean": float(np.mean(tps_values)),
        },
        "peak_vram_mb": peak_vram_mb,
        "peak_vram_gb": peak_vram_mb / 1024,
    }

    return results


def benchmark_multiple_batch_sizes(
    model_name: str,
    batch_sizes: List[int] = [1, 4, 8, 16, 32],
    prompt: str = "Explain the concept of machine learning in simple terms:",
    max_new_tokens: int = 100,
    **model_kwargs,
) -> Dict:
    """
    Benchmark a model across multiple batch sizes to capture
    both latency profile (bs=1) and throughput profile (bs>1).
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)
    model.eval()

    all_results = {}

    for bs in batch_sizes:
        print(f"\nBenchmarking batch_size={bs}...")
        torch.cuda.reset_peak_memory_stats()

        prompts = [prompt] * bs

        try:
            results = measure_inference_latency(
                model, tokenizer, prompts,
                max_new_tokens=max_new_tokens,
                num_warmup=5,
                num_runs=50,
            )
            all_results[bs] = results

            # Print summary
            print(f"  bs={bs}: p50={results['total_time_ms']['p50']:.0f}ms, "
                  f"p95={results['total_time_ms']['p95']:.0f}ms, "
                  f"TPS={results['tokens_per_second']['mean']:.1f}, "
                  f"VRAM={results['peak_vram_gb']:.2f}GB")
        except torch.cuda.OutOfMemoryError:
            print(f"  bs={bs}: OOM - skipping")
            break

        # Clean up between batch sizes
        torch.cuda.empty_cache()
        gc.collect()

    return all_results


def print_benchmark_report(results_by_method: Dict[str, Dict]) -> None:
    """
    Print a formatted comparison table of benchmark results
    across different quantization methods.
    """
    methods = list(results_by_method.keys())
    batch_sizes = list(list(results_by_method.values())[0].keys())

    print("\n" + "=" * 80)
    print("QUANTIZATION BENCHMARK REPORT")
    print("=" * 80)

    for bs in batch_sizes:
        print(f"\nBatch Size: {bs}")
        print(f"{'Method':<15} {'p50 (ms)':>10} {'p95 (ms)':>10} "
              f"{'TPS':>8} {'VRAM (GB)':>12}")
        print("-" * 60)

        for method in methods:
            if bs in results_by_method[method]:
                r = results_by_method[method][bs]
                print(f"{method:<15} "
                      f"{r['total_time_ms']['p50']:>10.0f} "
                      f"{r['total_time_ms']['p95']:>10.0f} "
                      f"{r['tokens_per_second']['mean']:>8.1f} "
                      f"{r['peak_vram_gb']:>12.2f}")

Complete Benchmarking Pipeline

import json
from pathlib import Path
from datetime import datetime

def run_complete_benchmark(
    model_configs: dict,
    output_dir: str = "./benchmark_results",
    run_perplexity: bool = True,
    run_downstream: bool = True,
    run_latency: bool = True,
) -> dict:
    """
    Full benchmark pipeline comparing multiple quantization configs.

    model_configs = {
        "FP16": {"model_name": "mistralai/Mistral-7B-v0.1"},
        "NF4": {
            "model_name": "mistralai/Mistral-7B-v0.1",
            "load_in_4bit": True
        },
        "GPTQ-4bit": {
            "model_name": "TheBloke/Mistral-7B-v0.1-GPTQ",
            "gptq": True
        },
        "AWQ-4bit": {
            "model_name": "TheBloke/Mistral-7B-v0.1-AWQ",
            "awq": True
        },
    }
    """
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    all_results = {
        "timestamp": timestamp,
        "models": {},
    }

    for method_name, config in model_configs.items():
        print(f"\n{'='*60}")
        print(f"Benchmarking: {method_name}")
        print(f"Config: {config}")
        print(f"{'='*60}")

        method_results = {"config": config}
        model_name = config["model_name"]

        # 1. Perplexity
        if run_perplexity:
            print("\n[1/3] Computing perplexity...")
            ppl_wikitext = compute_perplexity(
                model_name,
                load_in_4bit=config.get("load_in_4bit", False),
                load_in_8bit=config.get("load_in_8bit", False),
            )
            ppl_c4 = compute_perplexity(
                model_name,
                dataset_name="c4",
                dataset_config="en",
                split="validation",
                load_in_4bit=config.get("load_in_4bit", False),
            )
            method_results["perplexity"] = {
                "wikitext2": ppl_wikitext["perplexity"],
                "c4": ppl_c4["perplexity"],
            }

        # 2. Downstream tasks
        if run_downstream:
            print("\n[2/3] Running downstream task evaluation...")
            tasks = ["mmlu", "hellaswag", "arc_challenge", "winogrande"]
            lm_eval_results = run_lm_eval(
                model_name,
                tasks=tasks,
                gptq_model=config.get("gptq", False),
                load_in_4bit=config.get("load_in_4bit", False),
            )
            method_results["downstream_tasks"] = parse_eval_results(lm_eval_results)

        # 3. Latency and memory
        if run_latency:
            print("\n[3/3] Running latency benchmark...")
            from transformers import BitsAndBytesConfig

            load_kwargs = {"torch_dtype": torch.float16, "device_map": "auto"}
            if config.get("load_in_4bit"):
                load_kwargs["quantization_config"] = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_quant_type="nf4",
                    bnb_4bit_compute_dtype=torch.float16,
                )

            latency_results = benchmark_multiple_batch_sizes(
                model_name,
                batch_sizes=[1, 4, 8],
                **load_kwargs,
            )
            method_results["latency"] = latency_results

        all_results["models"][method_name] = method_results

        # Save intermediate results after each model
        output_path = Path(output_dir) / f"benchmark_{timestamp}.json"
        with open(output_path, "w") as f:
            json.dump(all_results, f, indent=2)
        print(f"\nResults saved to {output_path}")

    return all_results

Domain-Specific Evaluation

def evaluate_domain_accuracy(
    model,
    tokenizer,
    eval_examples: list,
    prompt_template: str,
    device: str = "cuda",
) -> dict:
    """
    Evaluate model on domain-specific examples.

    eval_examples: list of dicts with 'input', 'expected_output', 'category'
    prompt_template: f-string with {input} placeholder

    This is the evaluation that actually matters for your production use case.
    The standardized benchmarks tell you about general capability retention,
    but this tells you about YOUR capability retention.
    """
    model.eval()

    correct = 0
    total = 0
    results_by_category = {}

    for example in eval_examples:
        category = example.get("category", "general")
        if category not in results_by_category:
            results_by_category[category] = {"correct": 0, "total": 0}

        prompt = prompt_template.format(input=example["input"])
        inputs = tokenizer(prompt, return_tensors="pt").to(device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=False,
                temperature=1.0,
                pad_token_id=tokenizer.eos_token_id,
            )

        response = tokenizer.decode(
            outputs[0][inputs.input_ids.shape[1]:],
            skip_special_tokens=True
        ).strip()

        # Simple exact/contains matching - customize for your domain
        expected = example["expected_output"].lower()
        is_correct = expected in response.lower()

        correct += is_correct
        total += 1
        results_by_category[category]["correct"] += is_correct
        results_by_category[category]["total"] += 1

    overall_accuracy = (correct / total * 100) if total > 0 else 0

    category_accuracies = {}
    for cat, counts in results_by_category.items():
        category_accuracies[cat] = (
            counts["correct"] / counts["total"] * 100
            if counts["total"] > 0 else 0
        )

    return {
        "overall_accuracy": overall_accuracy,
        "total_examples": total,
        "correct": correct,
        "by_category": category_accuracies,
    }

Production Engineering Notes

Building a Regression Gate

In production, you need automated benchmarking that runs on every quantization configuration before deployment. Build a benchmark gate with hard thresholds:

ACCEPTANCE_THRESHOLDS = {
    "perplexity_delta_pct": 5.0,      # max 5% perplexity increase
    "mmlu_retained_pct": 97.0,         # must retain 97% of MMLU accuracy
    "hellaswag_retained_pct": 98.0,    # hellaswag is robust, high bar
    "arc_challenge_retained_pct": 96.0,
    "domain_accuracy_retained_pct": 95.0,  # domain is most important
}

def check_acceptance(fp16_results: dict, quant_results: dict) -> bool:
    """Returns True if quantized model passes all thresholds."""
    fp16_ppl = fp16_results["perplexity"]["wikitext2"]
    quant_ppl = quant_results["perplexity"]["wikitext2"]
    ppl_delta_pct = ((quant_ppl - fp16_ppl) / fp16_ppl) * 100

    if ppl_delta_pct > ACCEPTANCE_THRESHOLDS["perplexity_delta_pct"]:
        print(f"FAIL: Perplexity delta {ppl_delta_pct:.2f}% exceeds threshold")
        return False

    for task in ["mmlu", "hellaswag", "arc_challenge"]:
        fp16_acc = fp16_results["downstream_tasks"][task]["accuracy"]
        quant_acc = quant_results["downstream_tasks"][task]["accuracy"]
        retained = (quant_acc / fp16_acc * 100) if fp16_acc > 0 else 0
        threshold_key = f"{task.replace('arc_challenge', 'arc_challenge')}_retained_pct"

        if threshold_key in ACCEPTANCE_THRESHOLDS:
            if retained < ACCEPTANCE_THRESHOLDS[threshold_key]:
                print(f"FAIL: {task} retained {retained:.1f}% below threshold")
                return False

    print("PASS: All thresholds met")
    return True

Calibration Dataset Strategy

For domain-specific deployments, build a mixed calibration dataset:

Take 100 samples from WikiText-2 or C4 (general capability preservation)
Take 28 samples from your domain data (domain adaptation)
Shuffle and use as calibration for GPTQ/AWQ

Keep your calibration samples separate from your evaluation set. Using the same data for calibration and evaluation will give falsely optimistic results.

Memory Measurement in Practice

VRAM measurement is subtle. torch.cuda.memory_allocated() only measures PyTorch tensors. For a complete picture:

def get_memory_stats(device="cuda:0"):
    stats = {
        "allocated_mb": torch.cuda.memory_allocated(device) / 1024**2,
        "reserved_mb": torch.cuda.memory_reserved(device) / 1024**2,
        "peak_allocated_mb": torch.cuda.max_memory_allocated(device) / 1024**2,
    }
    # nvidia-smi gives total process VRAM including CUDA context overhead
    # typically 200-400 MB higher than PyTorch reports
    return stats

Always measure peak allocated memory during actual inference with your target sequence length and batch size. Static model loading memory is not the number that matters - inference peak memory is.

Common Mistakes

:::danger Skipping Warmup in Latency Benchmarks The single most common benchmarking mistake. The first inference call on a GPU is 5-50x slower than subsequent calls due to CUDA kernel compilation. If you measure only one call, your latency numbers are completely wrong. Always run at least 5-10 warmup iterations before recording any measurements. This is non-negotiable. :::

:::danger Using Perplexity as the Only Quality Metric Perplexity is a necessary sanity check but not a sufficient quality signal. A model with acceptable perplexity can still show severe degradation on reasoning tasks, numerical tasks, or domain-specific tasks. Always run downstream task benchmarks. Always run domain evaluation if you have domain-specific data. Ships have sunk because teams looked only at perplexity and called it good. :::

:::warning Calibrating on Your Evaluation Data If you use your test set or evaluation data as the calibration dataset for GPTQ or AWQ, you will get artificially optimistic benchmark results. The calibration data should come from a separate distribution (typically general text like WikiText-2 or C4). Keep your domain evaluation data strictly separate from calibration data. :::

:::warning Benchmarking at the Wrong Batch Size Memory savings from quantization primarily help throughput at batch sizes greater than 1. If you benchmark only at batch=1 and report the speedup, you are misleading yourself. Single-sequence latency often improves very little from quantization because the bottleneck is memory bandwidth, not compute. The real gains appear at batch=8 or higher where the compute utilization improves. Always benchmark across multiple batch sizes. :::

:::warning Ignoring Statistical Significance A 0.3 point drop on MMLU might be within the variance of the evaluation (which uses sampling). Report standard errors from lm-evaluation-harness. The stderr for a 57-subject benchmark like MMLU, evaluated on the full test set, is typically around 0.3-0.5 percentage points. Differences smaller than 2x the stderr are not statistically meaningful. :::

Interview Q&A

Q1: You quantize a model to 4 bits and the perplexity on WikiText-2 increases by only 0.15, which looks great. But in production the model performs noticeably worse on your use case. What went wrong and how do you diagnose it?

A: The perplexity-accuracy disconnect. Perplexity is an average over millions of tokens of general text, so it masks degradation on specific capability types. The production use case likely requires capabilities that are disproportionately sensitive to quantization - most commonly multi-step reasoning, numerical computation, or domain-specific knowledge recall. To diagnose: (1) run lm-evaluation-harness on MMLU, ARC-Challenge, and WinoGrande to identify which capability category degraded; (2) inspect which layers have highest quantization error using the GPTQ/AWQ layer-wise error metrics; (3) try keeping the most sensitive layers in FP16 using quantization skip lists. Also check whether the calibration data matched your production domain - if you calibrated on WikiText-2 but deploy on financial text, the quantization parameters may be suboptimal for your distribution.

Q2: How do you correctly measure the latency improvement from quantization for a production serving system?

A: Several things matter. First, run 5-10 warmup iterations before recording any measurements - the first forward pass includes CUDA kernel compilation and is unrepresentative. Second, run at least 50-100 iterations and report p50 and p95, not mean - GPU execution has variance and means hide tail latency. Third, benchmark at your production batch size, not just batch=1. Quantization's primary benefit at batch=1 is memory reduction enabling serving on smaller GPUs; the throughput benefit appears at batch sizes of 4-16+ where compute utilization increases. Fourth, measure end-to-end latency including tokenization, KV cache allocation, and response serialization, not just model forward pass time. Finally, test at your production sequence lengths - KV cache memory scales with sequence length and changes the effective memory savings.

Q3: What is the impact of calibration dataset choice on GPTQ quantization quality, and what would you use for a medical question-answering deployment?

A: GPTQ uses calibration data to compute the Hessian (second-order weight importance matrix) used in layer-wise reconstruction. The calibration data shapes which weight patterns are most carefully preserved. Calibrating on general text (C4, WikiText-2) produces a balanced model that degrades gracefully across domains. Calibrating on domain-specific text improves performance on that domain but can hurt general capability. For medical QA, I would use a mixed calibration set: ~70% from C4 or WikiText-2 (preserves general reasoning and language capability), ~30% from medical text (PubMed abstracts, clinical notes from MIMIC if available). I would use 512 samples total at 2048 token context length. I would also use a larger group size (128 rather than 32) to reduce quantization error in exchange for slightly less compression - for medical applications, accuracy matters more than squeezing out the last bit of compression.

Q4: How do you build a quantization benchmarking pipeline that can serve as a deployment gate in CI/CD?

A: The pipeline has four stages with defined pass/fail thresholds. Stage 1 is a fast sanity check - compute perplexity on 1000 tokens of WikiText-2 (takes ~1 minute). If perplexity delta exceeds 5%, fail fast. Stage 2 is downstream task evaluation with lm-eval-harness on a reduced task set - just MMLU 5-shot on a 20% sample of the test set (takes ~10 minutes). If any task retains less than 96% of FP16 accuracy, fail. Stage 3 is domain evaluation on your held-out domain test set (takes ~5 minutes). If domain accuracy drops more than 3 percentage points, fail. Stage 4 is performance profiling - measure VRAM and tokens/sec at batch=1 and batch=8 (takes ~3 minutes). If VRAM exceeds your deployment budget, fail. Total runtime under 20 minutes, which is feasible for CI. Store all results as JSON artifacts in your CI system. Track them over time to detect if model updates or quantization config changes cause regressions.

Q5: MMLU drops 2.1 points when you go from 8-bit to 4-bit quantization, but your memory budget requires 4-bit. What options do you have to recover the accuracy?

A: Several targeted interventions. First, try increasing the group size granularity - going from group_size=128 to group_size=64 or group_size=32 increases accuracy at the cost of slightly more overhead and memory. Second, identify the most sensitive layers by examining per-layer quantization error (GPTQ provides these statistics). Keep the top 10-20% most sensitive layers in FP8 or FP16 - this is called "mixed precision quantization" and typically recovers 60-80% of the accuracy gap with minimal memory overhead. Third, try AWQ instead of GPTQ for the same bit width - AWQ's weight scaling approach often produces better results on reasoning tasks. Fourth, if using BitsAndBytes NF4, enable double quantization and try a 64-group-size variant. Fifth, check whether specific MMLU subjects are driving the drop - if it is concentrated in a few subjects (often mathematics, physics, or formal reasoning), those domains may need a specialized fine-tuning step after quantization. Finally, if none of these work, revisit the memory budget: quantizing to 4-bit with a larger model is often better than 4-bit on a smaller model.

Q6: How do you interpret a situation where quantized model latency is actually slower than FP16 at batch size 1?

A: This is more common than people expect, especially with software-based quantization (BitsAndBytes NF4) on hardware without native 4-bit compute support. The issue is dequantization overhead. With NF4, weights are stored in 4-bit but computation still happens in FP16. Before each matrix multiply, the 4-bit weights are dequantized to FP16, which takes time. At batch=1, the model is memory-bandwidth-bound, and the dequantization adds overhead on top of the already-fast memory reads. The breakeven point depends on your GPU - on an A100, the dequantization overhead typically starts paying off at batch=4 to batch=8. On an RTX 3090, it may not pay off until batch=16. For true latency reduction at batch=1, you need hardware with native INT4 compute support (NVIDIA Hopper/Ada with FP8/INT4 tensor cores, or use GPTQ with the Marlin kernel which has optimized 4-bit GEMM). AWQ with the vLLM backend also has optimized CUDA kernels that make 4-bit generation faster than FP16 at batch=1 on supported hardware. If your production use case is latency-sensitive at batch=1, test the actual serving framework (vLLM, TGI, llama.cpp) rather than just the Transformers library - kernel optimization matters enormously for single-sequence latency.

The Production Alert That Woke Everyone Up​

Why This Exists​

The Problem Before Rigorous Benchmarking​

What Rigorous Benchmarking Solves​

Historical Context​

Core Concepts​

Understanding Perplexity​

Downstream Task Benchmarks​

The Perplexity-Accuracy Disconnect​

Latency Benchmarking Methodology​

Calibration Dataset Impact​

Mermaid Diagrams​

Evaluation Pipeline Architecture​

Task Sensitivity by Quantization Level​

Benchmarking Latency Protocol​

Code Examples​

Setting Up Perplexity Evaluation​

Downstream Task Evaluation with lm-evaluation-harness​

Latency and Throughput Benchmarking​

Complete Benchmarking Pipeline​

Domain-Specific Evaluation​

Production Engineering Notes​

Building a Regression Gate​

Calibration Dataset Strategy​

Memory Measurement in Practice​

Common Mistakes​

Interview Q&A​