What is model compression?

The memory wall, inference costs, edge deployment, and latency requirements that make model compression essential for production AI systems - with real cost math, a full compression taxonomy, and decision frameworks for choosing the right technique.

How does memory wall work in practice?

Why Model Compression Matters covers model compression, memory wall, inference cost from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/model-compression/why-compression

What is the difference between model compression and inference cost?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/model-compression/why-compression

:::tip 🎮 Interactive Playground Visualize this concept: Try the Quantisation Explorer demo on the EngineersOfAI Playground - no code required. :::

Why Model Compression Matters

The $100K Per Month Wake-Up Call

It is 2 AM on a Tuesday. Your team just shipped the new AI-powered legal document review feature. Users love it. The Slack channel is lighting up with praise. Then the product manager sends a message: "Can we roll this out to enterprise tier - 500 simultaneous users?" You do the math at your desk, stomach sinking.

Your current setup handles 12 concurrent requests. Each inference call ties up a full A100 GPU for about 800 milliseconds. To serve 500 users with acceptable latency, you need roughly 40 A100s running continuously. At $3.50/hour per GPU on AWS, that is$ 140/hour, $3,360/day,$ 100,800/month - for a single feature. You go back to the product manager: "We can do it, but it will cost us about $100K per month just in GPU compute." The feature gets shelved. The enterprise contract goes unsigned.

This is not a hypothetical. It is a pattern repeated thousands of times across the industry every year. The gap between what LLMs can do and what organizations can afford to run is the central engineering problem of the current AI era. Model compression is the discipline that closes this gap. Done well, it transforms a model that requires 4 A100 GPUs into one that runs on a single RTX 4090. It turns a model that costs $0.10 per thousand tokens into one that costs$ 0.01. It makes the difference between a product that ships and one that sits in a design document.

This lesson explains why compression is necessary - the physical, mathematical, and economic forces that make it unavoidable - and gives you the full taxonomy of techniques so you can make informed choices about which to apply and when. The rest of this module goes deep on each technique. Here, you are building the mental model that makes those details meaningful.

The Physics of the Problem

Parameters Have Weight - Literally

Every parameter in a neural network is a number that must be stored in physical memory and moved across physical buses during inference. This is not a software abstraction - it is a constraint imposed by the laws of physics and the economics of silicon fabrication.

In FP32 (32-bit floating point), a single parameter occupies 4 bytes. A 7B parameter model requires a minimum of 28 GB of memory just to hold the weights - before you account for the KV cache, activations, or the optimizer states needed for training. That 28 GB does not fit on a single consumer GPU. It barely fits on a single A100 (80 GB). Serving multiple requests simultaneously multiplies these requirements further.

The situation becomes more stark as models grow. A 70B model in FP32 requires 280 GB of memory - more than three A100s just for the weights. A 175B model requires 700 GB. The hardware is not keeping up. GPU memory has grown modestly while model sizes have grown exponentially. This is the memory wall: the gap between what models require and what hardware provides.

The numbers make the asymmetry visceral. From 2019 to 2022, model parameter count grew from 1.5B (GPT-2) to 540B (PaLM) - a 360x increase. GPU memory grew from 32 GB (V100) to 141 GB (H100) - a 4.4x increase. The gap is not closing. If anything, it widens as organizations push for larger models while hardware roadmaps face physical limits on memory density.

The Bandwidth Bottleneck: Why Reading Weights Is the Real Problem

Storage is only half the problem. The other half is memory bandwidth - how fast you can move weights from DRAM into the GPU's compute cores. Modern transformers are often memory bandwidth-bound, not compute-bound. The bottleneck is not the number of FLOPS your GPU can execute, but how fast it can read weight data from memory.

An A100 GPU has 312 TFLOPS of FP16 compute but only 2 TB/s of memory bandwidth. A 7B parameter model in FP16 has 14 GB of weights. Reading these weights once takes 7 milliseconds just in pure bandwidth time. For autoregressive generation - where you must read every weight for every new token - this fundamentally limits throughput.

The arithmetic is brutal and inescapable:

Per-token weight reads at batch_size=1 (autoregressive decoding):

  7B model in FP16:
    14 GB to read per token
    At 2 TB/s bandwidth: 7ms minimum per token
    Max throughput: ~143 tokens/second (bandwidth-limited)
    GPU compute utilization: ~5% (it's idle 95% of the time, waiting for data)

  7B model in INT4 (AWQ/GPTQ):
    3.5 GB to read per token
    At 2 TB/s bandwidth: 1.75ms minimum per token
    Max throughput: ~570 tokens/second (bandwidth-limited)
    GPU compute utilization: ~20% (still mostly waiting, but 4x less)

  Speedup from INT4 quantization: 4x
  Source of speedup: reduced memory transfer, not faster compute
  The GPU's arithmetic units did not change at all.
  We simply reduced the volume of data it must wait for.

This is why quantization is often the first and most impactful tool in the compression toolkit. It directly attacks the bandwidth bottleneck that governs autoregressive inference speed. You are not making the GPU compute faster - you are making it wait less.

:::info The Arithmetic Intensity Problem Arithmetic intensity measures FLOPS performed per byte of memory transferred. Transformer inference at batch size 1 has very low arithmetic intensity - you move 14 GB of weights to perform roughly 14 GFLOPS of work (matrix-vector products). The GPU can do 312,000 GFLOPS but can only transfer 2,000 GB/s. The ratio says we should be able to hit 156 FLOPS per byte - but the operation only needs 1 FLOP per byte. The GPU is compute-starved, not memory-starved. Quantization fixes this by reducing bytes transferred without reducing useful computation. :::

The Economic Reality of Cloud GPU Pricing

Cloud GPU pricing as of 2025 makes the economics stark. The table below shows which model sizes fit on which GPUs and what compression enables:

GPU	VRAM	Cost (on-demand)	Models at FP16	Models at INT4
RTX 4090	24 GB	$0.74/hr (Lambda)	7B	34B
A10G	24 GB	$1.50/hr (AWS)	7B	34B
A100 40GB	40 GB	$3.50/hr	13B	70B
A100 80GB	80 GB	$5.00/hr	34B	70B (comfortable)
H100 80GB	80 GB	$7.00/hr	34B	70B + headroom
H100 NVL	188 GB	$12.00/hr	70B	405B

By quantizing a 70B model from FP16 to INT4, you cut its memory from 140 GB to 35 GB - fitting it on a single A100 80GB instead of two. That is a 2x cost reduction in hardware, translating directly to 2x lower serving costs. For a production system handling millions of requests per day, this difference is millions of dollars per year.

Even more dramatic: a 34B model that previously required one A100 now fits on an RTX 4090 at $0.74/hour instead of$ 5.00/hour. The cost-per-token drops by 6.7x just from compression. This is not a marginal optimization - it is a business model shift.

For a high-traffic API generating 10 billion tokens per day:

FP16 serving (70B model, 2× A100 80GB):
  GPUs needed per request: 2
  Cost per hour (2 GPUs): $10/hr
  Tokens per second (2× A100): ~150 tok/s
  Seconds to generate 10B tokens: 10B / 150 = 18.5 hours of GPU-time
  Cost: 18.5 × $10 = $185 per day, $5,550/month per serving instance

INT4 serving (70B model AWQ, 1× A100 80GB):
  GPUs needed per request: 1
  Cost per hour (1 GPU): $5/hr
  Tokens per second (1× A100 AWQ): ~280 tok/s
  Seconds to generate 10B tokens: 10B / 280 = 9.9 hours of GPU-time
  Cost: 9.9 × $5 = $49.5 per day, $1,485/month per serving instance

Savings from compression: $4,065/month per serving instance
At 10 serving instances: $40,650/month saved
Annual savings: ~$488,000

The Full Compression Taxonomy

Model compression is not a single technique - it is a family of four distinct approaches, each with different tradeoffs and use cases. Understanding the taxonomy before diving into any one technique prevents the common mistake of reaching for a hammer when you need a scalpel.

Quantization: The First Tool to Reach For

Quantization reduces the numeric precision of model weights (and sometimes activations) from floating-point formats (FP32, FP16, BF16) to lower-precision integers (INT8, INT4) or custom formats (NF4, FP8). A weight that previously occupied 4 bytes at FP32 now occupies 1 byte at INT8 - a 4x memory reduction.

The key insight is that most model weights are not uniformly important and do not require the full dynamic range of FP32. Research has shown that models can tolerate significant precision reduction in weights while maintaining near-original accuracy, as long as the quantization is done carefully. The critical challenge is outlier weights - a small fraction of values that are much larger than the rest. Modern methods like GPTQ and AWQ handle these outliers explicitly and elegantly.

When to use it: Quantization is the first tool to reach for in virtually every deployment scenario. It provides the best accuracy-to-compression ratio and requires no retraining. Post-training quantization (PTQ) can be applied in hours to any pretrained model.

Memory reduction achieved:

FP32 → FP16/BF16: 2x reduction (virtually free, use always)
FP16 → INT8: 2x further reduction (4x vs FP32)
FP16 → INT4: 4x further reduction (8x vs FP32)
FP16 → INT2: 8x further reduction (16x vs FP32, accuracy not production-ready)

The quantization formula:

$Q(w) = \text{clamp}\!\left(\text{round}\!\left(\frac{w}{s}\right) + z,\; 0,\; 2^b - 1\right)$

where $s$ is the scale factor, $z$ is the zero-point (for asymmetric), and $b$ is the number of bits. Dequantization recovers an approximation: $\hat{w} = s \cdot (Q(w) - z)$ . The error $w - \hat{w}$ is the quantization noise - minimizing it is the central problem that GPTQ and AWQ solve differently.

Pruning: Reducing the Number of Operations

Pruning removes parts of a model that contribute little to the output. Unstructured pruning zeroes out individual weights, creating a sparse matrix. Structured pruning removes entire architectural units - attention heads, MLP neurons, or whole layers - resulting in a smaller, denser model.

The fundamental question pruning answers is: "Which parts of this model are actually doing useful work?" Empirically, a large fraction of attention heads in transformers can be removed with minimal accuracy loss. Whole layers can sometimes be dropped. The challenge is identifying which ones without catastrophically harming accuracy.

The critical practical distinction:

Unstructured pruning: Creates sparse weights but typically does not speed up inference on standard GPUs - zeros are still stored and processed by the GPU's matrix multiply hardware. Speedup requires dedicated sparse hardware (e.g., NVIDIA A100 structured sparsity support for 2:4 sparsity patterns) or custom sparse kernels.
Structured pruning: Changes tensor shapes, directly reducing computation and memory. Removing 25% of attention heads means the key/query/value matrices shrink proportionally - the GPU literally does less work.

When to use it: Structured pruning delivers actual latency improvements, not just memory savings. Prune for architecture reduction, then quantize for precision reduction. The techniques compose well.

Knowledge Distillation: Train Small, Think Big

Distillation trains a small student model to imitate the behavior of a large teacher model. Instead of training the student on hard labels (0 or 1), it trains to match the teacher's soft probability distributions. These soft distributions contain more information than hard labels - the teacher's uncertainty and second-best guesses encode a form of knowledge that the student learns to replicate.

Geoffrey Hinton introduced this approach in 2015. DistilBERT achieved 97% of BERT's performance at 60% of the size through distillation. TinyBERT achieved 96.8% of BERT-large performance with 7.5x fewer parameters. For production use cases where a stable, high-volume task is being served, task-specific distillation often produces the best cost-to-performance ratio of any technique.

The distillation loss combines cross-entropy on true labels with KL divergence between student and teacher distributions:

$\mathcal{L}_{\text{distill}} = \alpha \cdot \mathcal{L}_{\text{CE}}(y, p_S) + (1-\alpha) \cdot T^2 \cdot \text{KL}(p_T^{(T)} \| p_S^{(T)})$

where $T$ is the temperature parameter that softens the distributions, $p_T^{(T)}$ and $p_S^{(T)}$ are teacher and student softmax outputs at temperature $T$ , and $\alpha$ balances the two loss terms.

When to use it: Distillation is the right choice when you have significant compute budget for a one-time training run and want a permanently smaller architecture. It produces models that are fast and accurate, but requires training data and a working teacher model. It is not a deployment-time technique - it is a training-time investment with long-term payoff.

Low-Rank Methods: LoRA and SVD Approximation

Large weight matrices in transformers are often low-rank in practice: most of the information content of a 4096×4096 weight matrix can be captured by two smaller matrices of rank $r$ , where $r$ is much less than 4096. LoRA (Low-Rank Adaptation) exploits this for fine-tuning - instead of updating the full weight matrix, it learns a low-rank update that is orders of magnitude smaller.

For a 7B model, LoRA with rank=16 reduces the number of trainable parameters from 7B to roughly 10M - a 700x reduction. Combined with 4-bit quantization of the frozen base model (QLoRA), this makes fine-tuning 70B models possible on a single A100.

When to use it: LoRA is primarily for efficient fine-tuning, not inference compression. For inference, the LoRA weights can be merged back into the base model with no runtime overhead. SVD-based compression can apply the low-rank idea to compress existing weights, but this typically underperforms quantization for pure inference optimization.

The Concrete Numbers: A Benchmark Comparison

Here is a working example you can run to see compression impact in concrete terms. This demonstrates the memory and throughput difference before any further theory:

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"


def get_peak_memory_gb() -> float:
    """Return peak GPU memory usage in GB since last reset."""
    if torch.cuda.is_available():
        return torch.cuda.max_memory_allocated() / 1e9
    return 0.0


def load_and_benchmark(
    quantization_config=None,
    label: str = "FP16",
    prompt: str = "Explain the difference between machine learning and deep learning in detail:",
    n_generate_tokens: int = 50,
) -> dict:
    """
    Load a model with the given quantization config and benchmark it.
    Returns memory, latency, and throughput measurements.
    """
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    print(f"\n{'='*60}")
    print(f"Loading {label} ...")

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    load_kwargs = {"device_map": "auto"}
    if quantization_config is not None:
        load_kwargs["quantization_config"] = quantization_config
    else:
        load_kwargs["torch_dtype"] = torch.float16

    t_load_start = time.time()
    model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
    load_time = time.time() - t_load_start

    peak_mem_gb = get_peak_memory_gb()
    n_params = sum(p.numel() for p in model.parameters()) / 1e9

    print(f"  Load time:        {load_time:.1f}s")
    print(f"  Peak GPU memory:  {peak_mem_gb:.2f} GB")
    print(f"  Parameters:       {n_params:.2f}B")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Warmup run - first generation is slower due to JIT and caching
    with torch.no_grad():
        _ = model.generate(**inputs, max_new_tokens=10, do_sample=False)

    # Timed generation run
    torch.cuda.synchronize()
    t_gen_start = time.time()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=n_generate_tokens,
            do_sample=False,
        )
    torch.cuda.synchronize()
    gen_time = time.time() - t_gen_start

    tokens_per_second = n_generate_tokens / gen_time
    response = tokenizer.decode(
        outputs[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    )

    print(f"  Generation ({n_generate_tokens} tokens): {gen_time*1000:.0f}ms")
    print(f"  Throughput: {tokens_per_second:.1f} tok/s")
    print(f"  Sample: {response[:80]}...")

    result = {
        "label": label,
        "memory_gb": peak_mem_gb,
        "tokens_per_second": tokens_per_second,
        "gen_time_ms": gen_time * 1000,
    }

    del model
    torch.cuda.empty_cache()
    return result


# --- Run comparisons across three precision levels ---

# 1. FP16 baseline
results_fp16 = load_and_benchmark(label="FP16 (baseline)")

# 2. INT8 via bitsandbytes LLM.int8() - outlier-aware mixed precision
int8_config = BitsAndBytesConfig(load_in_8bit=True)
results_int8 = load_and_benchmark(
    quantization_config=int8_config,
    label="INT8 (LLM.int8)",
)

# 3. INT4 NF4 with double quantization - the QLoRA-style baseline
int4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NormalFloat4: optimal for normally distributed weights
    bnb_4bit_use_double_quant=True,       # Quantize the scale constants themselves
    bnb_4bit_compute_dtype=torch.bfloat16, # Dequantize to BF16 for computation
)
results_int4 = load_and_benchmark(
    quantization_config=int4_config,
    label="INT4 NF4 (QLoRA-style)",
)

# --- Summary table ---
all_results = [results_fp16, results_int8, results_int4]
fp16_mem = results_fp16["memory_gb"]
fp16_tps = results_fp16["tokens_per_second"]

print(f"\n{'='*70}")
print(f"{'Configuration':<28} {'Memory':>8} {'Compression':>13} {'Tok/s':>8} {'Speedup':>8}")
print("-" * 70)
for r in all_results:
    compression = fp16_mem / r["memory_gb"] if r["memory_gb"] > 0 else 0
    speedup = r["tokens_per_second"] / fp16_tps
    print(
        f"{r['label']:<28} {r['memory_gb']:>7.1f}G "
        f"{compression:>12.1f}x {r['tokens_per_second']:>8.1f} {speedup:>7.1f}x"
    )

Running this on a Llama 3.1 8B model produces results roughly like:

Configuration	Memory	Compression	Tok/s	Speedup
FP16 (baseline)	16.1 GB	1.0x	41.3	1.0x
INT8 (LLM.int8)	8.7 GB	1.9x	52.8	1.3x
INT4 NF4	4.8 GB	3.4x	98.2	2.4x

This is the compression opportunity in concrete numbers - before you learn how any of these techniques work internally. The rest of this module explains the internals, the failure modes, and how to tune each technique for production.

The Accuracy-Compression Tradeoff

No compression is free. Every technique trades some accuracy for efficiency gain. Understanding this tradeoff is essential for making production decisions.

The numbers above are rough guides - the actual accuracy drop depends heavily on the model, the quantization method, and the task. Key empirical findings from the research literature:

FP16/BF16: Virtually free - nearly identical to FP32 for inference. Every production system should use this. Using FP32 in production is an antipattern.
INT8: Typically 0.5-1% accuracy drop on standard benchmarks. For most applications, this is fully acceptable.
INT4 (with GPTQ/AWQ): Typically 1-3% accuracy drop on standard benchmarks. The sweet spot for LLM deployment - 4x memory reduction for approximately 2% accuracy cost.
INT2/INT1: Significant accuracy degradation for most models. Research-only for now.

Why Benchmark Numbers Lie in Production

Reporting "1% accuracy drop on MMLU" is a seductive but dangerous summary. Here is what accuracy metrics commonly hide:

Task sensitivity varies dramatically. A model that loses 1% on MMLU might lose 8% on arithmetic reasoning tasks, because quantization disproportionately harms precision-sensitive computations. Arithmetic requires maintaining precise intermediate values - quantization noise compounds in exactly the kinds of multi-step computations that math requires. Summarization and conversational tasks are much more robust to quantization.

Outlier tokens matter more than average accuracy. LLMs produce catastrophic outputs ("degeneration") when quantization breaks the model's ability to assign very high or very low probabilities to specific tokens. Average accuracy can look fine while the tail of outputs is broken. A model might answer 99% of questions well and give completely wrong answers to 1% of questions that happen to rely on the specific weights that were poorly quantized.

Context length dependency is real. Quantization errors compound over long context windows. A model that performs fine at 512 tokens may degrade noticeably at 8K tokens because each layer's quantization error adds to the accumulated error from previous layers, and long contexts amplify this accumulation.

Calibration dataset mismatch is common. Post-training quantization requires a calibration dataset. If your deployment domain differs from the calibration data, you will see larger accuracy drops than published papers report. A code model calibrated with Wikipedia text will be poorly calibrated for code generation tasks.

:::warning Benchmark Mismatch is a Common Production Failure Always benchmark your compressed model on your specific task and data distribution, not just on standard benchmarks like MMLU or HellaSwag. A 1% drop on MMLU can mask a 10-15% drop on your actual use case. The $100K per month scenario at the opening of this lesson is partly a story about teams that deployed compressed models that looked fine on benchmarks but failed on the actual task. :::

Edge Deployment: A Different Class of Problem

Cloud deployment is one use case. Edge deployment - running models on phones, embedded devices, IoT hardware - is another, with much tighter constraints that change the problem fundamentally.

A modern smartphone (iPhone 16, Pixel 9) has 8-12 GB of DRAM, shared between the OS, all running apps, and any ML model. A 7B model in INT4 still requires 3.5 GB just for weights - over a third of available memory on a typical device. Running 7B models locally on phones requires careful memory management and often still requires INT4 quantization combined with architectural changes.

The llama.cpp project demonstrated in 2023 that 7B models could run on MacBook CPUs using aggressive quantization (Q4_K_M format). This sparked an entire ecosystem of local AI - applications running models entirely offline, on consumer hardware, without API calls. The key enabler was not new algorithms but careful implementation of INT4 quantization with custom CPU kernels that exploit SIMD instructions.

Constraint	Cloud GPU	Edge (phone)	Edge (MCU)
Memory	40-80 GB VRAM	4-8 GB shared	256 KB - 4 MB
Compute	1000+ TFLOPS	5-15 TOPS	0.1-1 TOPS
Power	300-700W	5-15W	0.01-1W
Batch size	1-512	1	1
Latency budget	50-500ms	500ms-2s	100ms-10s
Typical target	7B-70B	1B-7B	Under 10M params
Compression needed	INT4-INT8	INT4+ distillation	Binary or tiny arch

The Apple Silicon Case Study

Apple's M-series chips demonstrate what hardware-software co-design achieves. The M3 Pro has 18-36 GB of unified memory shared between the CPU and GPU, with 150 GB/s memory bandwidth. llama.cpp running on M3 Pro achieves:

Llama 3.1 8B (Q4_K_M): ~55 tokens/second
Llama 3.1 70B (Q4_K_M): ~8 tokens/second
Mistral 7B (Q4_K_M): ~60 tokens/second

These are production-usable speeds for an offline, private, zero-cost-per-token setup. The Q4_K_M format uses a mix of 4-bit and 6-bit quantization to preserve accuracy at higher bit-depths for sensitive layers while using 4-bit for most of the model. This is compression as a product strategy: the compression technique directly enables a fundamentally different go-to-market (offline, private, no subscription).

Latency vs. Throughput: Two Different Problems

Engineers new to ML systems often conflate two distinct objectives that require different optimization strategies:

Latency is the time to produce a single response. It matters for interactive applications - chatbots, real-time code completion, voice assistants. Users perceive latency above 200ms as "slow" and above 1 second as broken for interactive tools. Latency is primarily determined by:

Model size (more weights to read per token)
Number of sequential operations (layers, token count)
Memory bandwidth (how fast you can stream weights into compute cores)
Prefill time (processing the input prompt before generation begins)

Throughput is how many tokens (or responses) you can produce per unit time. It matters for batch processing - document analysis, data enrichment, offline generation pipelines. Throughput is primarily determined by:

GPU compute utilization (are you using all available TFLOPS?)
Batch size (running more requests in parallel amortizes weight loads)
Memory bandwidth (shared constraint with latency)
KV cache capacity (how many concurrent sequences can be held)

Compression helps both objectives, but through different mechanisms:

Latency improvement: Quantization reduces per-token weight streaming time. Structured pruning reduces layer count (fewer sequential operations per token).
Throughput improvement: Quantization allows larger batch sizes to fit in memory. Distillation creates permanently smaller architectures that are faster regardless of batch size.

:::tip The Batch Size Insight A model that fits in 40 GB VRAM can serve a batch of 8 requests simultaneously (8 concurrent KV caches). The same model compressed to 10 GB can serve a batch of 32. At equal GPU utilization, the compressed model serves 4x more requests per hour - even if per-token latency is identical. For a high-traffic API, this 4x throughput improvement directly translates to a 4x reduction in the number of GPUs required to serve the same load. This is the quantization win that often goes unnoticed: not just faster per request, but more capacity per dollar. :::

A Decision Framework: Which Technique to Apply First

Use this framework when you need to compress a model for production deployment:

Step 1: What is your hardware budget?

Adequate GPU VRAM for FP16 baseline: start with FP16 or BF16. Already done.
Need to cut memory 2x: use INT8 quantization (bitsandbytes, TensorRT-LLM).
Need to cut memory 4x: use INT4 quantization (GPTQ, AWQ, GGUF).
Need to cut memory 8x or more: combine INT4 with structured pruning or use a distilled smaller model.

Step 2: Do you have a budget for retraining?

No retraining budget: post-training quantization (GPTQ, AWQ, bitsandbytes) or unstructured pruning (SparseGPT). Hours of compute.
Small training budget (GPU-hours): QLoRA fine-tuning after PTQ. Days of compute.
Moderate training budget (days): knowledge distillation or structured pruning with fine-tuning.
Large training budget (weeks): quantization-aware training or full distillation from scratch.

Step 3: What is your accuracy tolerance?

Near-lossless required: FP16 or INT8 only. Do not use INT4.
1-3% degradation acceptable: INT4 with GPTQ or AWQ. The industry standard.
5-10% degradation acceptable: aggressive structured pruning or distillation to a smaller architecture.

Step 4: What is your deployment target?

Cloud GPU serving: GPTQ, AWQ, bitsandbytes INT4, vLLM with quantization.
Cloud CPU serving: GGUF (llama.cpp formats), ONNX with INT8 quantization.
Edge mobile: ExecuTorch (Meta), Core ML (Apple), TFLite (Google) with INT8/INT4.
Embedded MCU: Binary networks, custom architectures, TinyML frameworks.

Production Compression: A Worked Example

The following script shows how to compare three compression techniques on the same model and produce a deployment decision:

import torch
import json
from dataclasses import dataclass, field
from typing import List, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


@dataclass
class CompressionConfig:
    """Configuration for a compression experiment."""
    label: str
    bits: int
    memory_gb_expected: float
    quant_type: Optional[str] = None   # "nf4", "fp4", or None for no 4-bit
    load_in_8bit: bool = False
    load_in_4bit: bool = False
    double_quant: bool = False
    compute_dtype: torch.dtype = torch.bfloat16
    notes: str = ""


@dataclass
class CompressionResult:
    """Measured results for a compression configuration."""
    config: CompressionConfig
    actual_memory_gb: float = 0.0
    tokens_per_second: float = 0.0
    perplexity: float = 0.0
    load_time_s: float = 0.0
    error: Optional[str] = None
    benchmark_scores: dict = field(default_factory=dict)


class CompressionBenchmark:
    """
    Systematically benchmark multiple compression configurations on the same model.

    Usage:
        bench = CompressionBenchmark("meta-llama/Llama-3.1-8B-Instruct")
        results = bench.run_all()
        bench.print_summary(results)
        bench.export_report(results, "compression_report.json")
    """

    STANDARD_CONFIGS = [
        CompressionConfig(
            label="FP16 baseline",
            bits=16,
            memory_gb_expected=16.0,
            notes="Standard inference precision - always test this first",
        ),
        CompressionConfig(
            label="INT8 (LLM.int8)",
            bits=8,
            memory_gb_expected=8.5,
            load_in_8bit=True,
            notes="2x memory saving, ~0.5% accuracy loss, no calibration needed",
        ),
        CompressionConfig(
            label="INT4 NF4",
            bits=4,
            memory_gb_expected=5.0,
            load_in_4bit=True,
            quant_type="nf4",
            double_quant=True,
            notes="4x memory saving, ~2% accuracy loss on standard benchmarks",
        ),
        CompressionConfig(
            label="INT4 FP4",
            bits=4,
            memory_gb_expected=5.0,
            load_in_4bit=True,
            quant_type="fp4",
            notes="Alternative 4-bit format - sometimes better for non-Gaussian distributions",
        ),
    ]

    def __init__(self, model_id: str, calibration_prompt: str = None):
        self.model_id = model_id
        self.calibration_prompt = calibration_prompt or (
            "The transformer architecture consists of encoder and decoder stacks. "
            "Each layer contains multi-head self-attention mechanisms and feed-forward networks. "
        )

    def _build_bnb_config(self, config: CompressionConfig) -> Optional[BitsAndBytesConfig]:
        if config.load_in_8bit:
            return BitsAndBytesConfig(load_in_8bit=True)
        if config.load_in_4bit:
            return BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type=config.quant_type or "nf4",
                bnb_4bit_use_double_quant=config.double_quant,
                bnb_4bit_compute_dtype=config.compute_dtype,
            )
        return None

    def _measure_throughput(self, model, tokenizer, n_tokens: int = 100) -> float:
        """Measure tokens per second for autoregressive generation."""
        import time

        inputs = tokenizer(self.calibration_prompt, return_tensors="pt").to(model.device)

        # Warmup
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=20, do_sample=False)

        torch.cuda.synchronize()
        t0 = time.perf_counter()
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=n_tokens, do_sample=False)
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - t0

        return n_tokens / elapsed

    def run_config(self, config: CompressionConfig) -> CompressionResult:
        """Load and benchmark one compression configuration."""
        import time

        result = CompressionResult(config=config)
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()

        try:
            bnb_config = self._build_bnb_config(config)
            load_kwargs = {"device_map": "auto"}

            if bnb_config:
                load_kwargs["quantization_config"] = bnb_config
            else:
                load_kwargs["torch_dtype"] = torch.float16

            t_load = time.time()
            model = AutoModelForCausalLM.from_pretrained(self.model_id, **load_kwargs)
            tokenizer = AutoTokenizer.from_pretrained(self.model_id)
            tokenizer.pad_token = tokenizer.eos_token
            result.load_time_s = time.time() - t_load

            result.actual_memory_gb = torch.cuda.max_memory_allocated() / 1e9
            result.tokens_per_second = self._measure_throughput(model, tokenizer)

            del model
            torch.cuda.empty_cache()

        except Exception as e:
            result.error = str(e)

        return result

    def run_all(self) -> List[CompressionResult]:
        results = []
        for config in self.STANDARD_CONFIGS:
            print(f"\nRunning: {config.label}")
            result = self.run_config(config)
            if result.error:
                print(f"  ERROR: {result.error}")
            else:
                print(f"  Memory: {result.actual_memory_gb:.1f} GB")
                print(f"  Speed: {result.tokens_per_second:.1f} tok/s")
                print(f"  Load time: {result.load_time_s:.1f}s")
            results.append(result)
        return results

    def print_summary(self, results: List[CompressionResult]) -> None:
        baseline = next((r for r in results if not r.config.load_in_8bit
                        and not r.config.load_in_4bit and not r.error), None)
        if not baseline:
            print("No baseline result found")
            return

        print(f"\n{'='*80}")
        print(f"{'Configuration':<22} {'Memory':>8} {'vs FP16':>8} {'Tok/s':>8} {'Speedup':>8}")
        print("-" * 80)

        for r in results:
            if r.error:
                print(f"{r.config.label:<22} ERROR: {r.error[:40]}")
                continue
            mem_ratio = baseline.actual_memory_gb / r.actual_memory_gb
            tps_ratio = r.tokens_per_second / baseline.tokens_per_second
            print(
                f"{r.config.label:<22} "
                f"{r.actual_memory_gb:>7.1f}G "
                f"{mem_ratio:>7.1f}x "
                f"{r.tokens_per_second:>7.1f} "
                f"{tps_ratio:>7.1f}x"
            )

    def export_report(self, results: List[CompressionResult], path: str) -> None:
        report = []
        for r in results:
            report.append({
                "label": r.config.label,
                "bits": r.config.bits,
                "memory_gb": r.actual_memory_gb,
                "tokens_per_second": r.tokens_per_second,
                "load_time_s": r.load_time_s,
                "notes": r.config.notes,
                "error": r.error,
            })
        with open(path, "w") as f:
            json.dump(report, f, indent=2)
        print(f"Report saved to {path}")


# Usage
bench = CompressionBenchmark("meta-llama/Llama-3.1-8B-Instruct")
results = bench.run_all()
bench.print_summary(results)
bench.export_report(results, "compression_report.json")

Combining Techniques: The Compression Stack

The most powerful production compression strategies combine multiple techniques. They are not mutually exclusive - they operate on orthogonal dimensions:

Quantization attacks the bits-per-weight dimension (FP16 → INT4 = 4x)
Pruning attacks the number-of-weights dimension (remove heads/layers)
Distillation attacks the model-scale dimension (70B → 7B = 10x)
Architecture choices (GQA, sliding window) attack the KV cache dimension

A modern production deployment of an LLM typically combines at minimum quantization and a modern efficient architecture. The best systems stack all four.

Stacking Quantization and Distillation

The most impactful combination for high-volume task-specific APIs:

Step 1: Start with a large capable teacher (e.g., Llama 3.1 70B)
Step 2: Distill a 7B student on your task data
        → 10x parameter reduction
        → Student achieves ~93-95% of teacher accuracy on the task
Step 3: Quantize the 7B student to INT4 with AWQ
        → Additional 4x memory reduction
        → Total: 40x compression vs FP16 70B baseline
        → ~1-2% additional accuracy cost from quantization
Step 4: Serve with vLLM + continuous batching
        → 4x higher throughput vs naive serving

Final result:
  Memory: 140 GB (70B FP16) → 3.5 GB (7B INT4)  = 40x reduction
  Cost:   ~$5/hr (A100 80GB) → $0.09/hr (RTX 4090) = 55x cost reduction
  Accuracy on task: ~90-93% of 70B FP16 (distillation + quantization cost)

This pattern - distill to a task-specific small model, then quantize - is the highest ROI compression strategy when you have a stable, high-volume task and a one-time training budget. The distillation is expensive (GPU-days) but pays off across millions of inference calls.

The QLoRA Pattern: Fine-Tune + Compress in One Pass

For teams that need both domain adaptation and compression:

# QLoRA pattern: fine-tune a compressed model, then deploy compressed
# Step 1: Load base model in INT4 NF4
# Step 2: Train LoRA adapters in BF16
# Step 3: Merge adapters → FP16 merged model
# Step 4: Re-quantize merged model with AWQ → INT4 deployment model

from peft import LoraConfig, get_peft_model, TaskType
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
import torch

def qlora_finetune_then_quantize(
    base_model_name: str,
    training_data: list,
    output_path: str,
    lora_rank: int = 16,
    lora_alpha: int = 32,
    target_modules: list = None,
) -> None:
    """
    Full QLoRA fine-tuning pipeline followed by AWQ deployment quantization.

    Phase 1: QLoRA fine-tuning
        - Load base model in INT4 NF4 (huge memory saving for training)
        - Apply LoRA adapters in BF16 - only adapters are trained
        - Backprop only through LoRA layers, not frozen INT4 base
        - Train on domain-specific data

    Phase 2: Merge and re-quantize
        - Merge LoRA adapters into base model → FP16 merged model
        - Run AWQ quantization on FP16 merged model
        - Deploy AWQ model with Marlin kernels for maximum throughput

    Why re-quantize instead of keeping NF4?
    NF4 inference is slower than AWQ - NF4 dequantizes at runtime in a
    general way that doesn't use optimized GEMM kernels. AWQ with Marlin
    is 20-40% faster for the same INT4 bit width.
    """
    # Phase 1: QLoRA fine-tuning
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        quantization_config=bnb_config,
        device_map="auto",
    )

    if target_modules is None:
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                         "gate_proj", "up_proj", "down_proj"]

    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_rank,
        lora_alpha=lora_alpha,
        target_modules=target_modules,
        lora_dropout=0.05,
        bias="none",
    )
    model = get_peft_model(model, lora_config)

    n_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    n_total = sum(p.numel() for p in model.parameters())
    print(f"Trainable params: {n_trainable:,} / {n_total:,} ({100*n_trainable/n_total:.2f}%)")

    # ... training loop (standard SFT training with LoRA) ...

    # Phase 2: Merge LoRA into base, then run AWQ
    merged_model = model.merge_and_unload()  # Returns FP16 merged model
    merged_model.save_pretrained(f"{output_path}/merged_fp16")

    # Now run AWQ on the merged FP16 model for deployment
    print("Running AWQ quantization on merged model for deployment...")
    # ... call quantize_model_awq(f"{output_path}/merged_fp16", f"{output_path}/awq_int4")

The Speculative Decoding Integration

Compression is not the only technique for improving inference efficiency. Speculative decoding is complementary: a small "draft" model quickly proposes multiple tokens, and the large "target" model verifies them in parallel. If the draft tokens are accepted, you get multiple tokens for the cost of one target model forward pass.

Speculative decoding composes with quantization: the draft model is often quantized to INT4, and the target model is also INT4-quantized. The result:

Standard INT4 quantized serving:
  Target model: 4 tok/s (bottlenecked by memory bandwidth at batch=1)

INT4 + speculative decoding (draft accepts ~0.7 tokens on average):
  Effective throughput: 4 × (1 + 0.7) = 6.8 tok/s
  Without changing the model or memory footprint

For models with a good draft (same family, 5-10x smaller):
  Acceptance rate: 0.75-0.85 per speculative step
  With 5 speculative tokens: 4 × (1 + 5×0.80) = 20 tok/s
  4x speedup from speculative decoding alone

This is the 2025 production stack for LLM inference: INT4 AWQ quantization + continuous batching (vLLM) + speculative decoding with a small draft model + prefix caching for shared system prompts. Each technique multiplies with the others.

The Historical Arc: How We Got Here

The modern model compression field was not invented for LLMs. Its roots go back to the early days of neural networks, and the mathematical foundations predate deep learning:

1989: Yann LeCun introduces "Optimal Brain Damage" - removing weights based on second-order sensitivity (the conceptual ancestor of GPTQ). Core idea: use the Hessian to identify which weights contribute least to the loss.
1993: Hassibi and Stork extend this to "Optimal Brain Surgeon" - the direct mathematical ancestor of GPTQ's Hessian-based approach. They show that error compensation across remaining weights can recover accuracy after pruning.
2015: Geoffrey Hinton, Oriol Vinyals, and Jeff Dean publish "Distilling the Knowledge in a Neural Network." The teacher-student framework is born.
2015: Han et al. publish "Deep Compression" - combining pruning, quantization, and Huffman coding to compress AlexNet 35x.
2019: Hugging Face releases DistilBERT - 97% of BERT's quality at 60% of its size. The most widely deployed knowledge distillation result in production.
2022: Tim Dettmers releases bitsandbytes, making 8-bit quantization accessible with a one-line config change.
2022: GPTQ paper (Frantar et al.) - first practical method to quantize 175B models to INT4 in hours on a single GPU.
2023: AWQ paper (Lin et al., MIT Han Lab) - activation-aware quantization achieves better INT4 accuracy than GPTQ with faster inference throughput.
2023: llama.cpp by Georgi Gerganov - pure C++ inference with quantized LLMs on CPU, igniting the local AI movement.
2024: GGUF K-quants, ExecuTorch, FP8 with native hardware support on H100. Compression becomes a mature engineering discipline.
2025: Production systems routinely combine INT4 quantization + structured pruning + speculative decoding. The default deployment configuration for open-source LLMs is compressed - not full precision.

Compression Across the Model Lifecycle

It is tempting to think of compression as a deployment-time activity - something you do to a model after it is trained. In practice, compression decisions affect every stage of the ML lifecycle:

At Architecture Design Time

The most impactful compression happens before training, when you choose the architecture. Several design decisions have massive downstream effects on compressibility:

Grouped Query Attention (GQA): Standard multi-head attention has one key and value head per query head. GQA shares key-value heads across multiple query heads. Llama 3.1 uses 8 key-value heads for 32 query heads - a 4x reduction in KV cache memory with minimal accuracy cost. This is compression at the architecture level, before any PTQ is applied. A model with GQA can serve 4x more concurrent requests at the same memory budget.

Sliding Window Attention: Instead of attending to all previous tokens, each token attends only to the last W tokens (window size). This caps KV cache growth regardless of sequence length - critical for long-context serving. Mistral 7B uses W=4096, keeping KV cache constant for sequences longer than 4096 tokens. Combined with INT4 quantization, Mistral 7B in INT4 with sliding window attention can handle very long sequences on a single consumer GPU.

MoE (Mixture of Experts): Architectures like Mixtral 8×7B use 8 expert FFN layers but only activate 2 per token. The "model" is 47B parameters but each forward pass uses only 13B. For memory-constrained serving, you can quantize only the activated experts to INT4 and keep idle experts in CPU RAM - a form of dynamic compression that adapts to available hardware.

At Training Time

Quantization-Aware Training (QAT): During training, simulate quantization noise by occasionally rounding weights to their quantized values and back. The model learns to be robust to the quantization noise it will experience at inference. QAT typically produces 0.3-0.8 percentage points better accuracy than post-training quantization at the same bit width - worth the additional training complexity for high-accuracy requirements.

import torch
import torch.nn as nn
from torch.quantization import QuantStub, DeQuantStub


class QATLinear(nn.Module):
    """
    Linear layer with simulated quantization for QAT.
    During training, weights are quantized-then-dequantized to
    expose the model to quantization noise, building robustness.
    """

    def __init__(self, in_features: int, out_features: int, n_bits: int = 4):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features, bias=False)
        self.n_bits = n_bits
        self.q_max = 2 ** (n_bits - 1) - 1

    def _fake_quantize(self, weight: torch.Tensor) -> torch.Tensor:
        """
        Simulate INT4 quantization with straight-through estimator (STE).

        The round() operation has zero gradient almost everywhere.
        STE passes gradients through as if round() were the identity function.
        This is the standard trick for training with discrete operations.
        """
        max_val = weight.abs().max().clamp(min=1e-8)
        scale = max_val / self.q_max

        # Quantize
        w_q = (weight / scale).round().clamp(-self.q_max, self.q_max)

        # Straight-Through Estimator: backward passes through as if no rounding
        w_q_ste = weight + (w_q * scale - weight).detach()
        return w_q_ste

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.training:
            # Apply fake quantization only during training
            weight_fq = self._fake_quantize(self.linear.weight)
            return nn.functional.linear(x, weight_fq)
        else:
            return self.linear(x)

At Fine-Tuning Time

Fine-tuning with LoRA (Low-Rank Adaptation) is itself a form of compression-aware training. LoRA adds low-rank matrices to frozen base weights:

$W' = W_0 + \Delta W = W_0 + B \cdot A$

where $W_0$ is the frozen pre-trained weight (rank $d$ ), and $B \cdot A$ is the low-rank update (rank $r \ll d$ ). For a 4096×4096 weight matrix at rank $r=16$ :

Full fine-tuning: $4096 \times 4096 = 16.7M$ trainable parameters
LoRA: $2 \times (4096 \times 16) = 131K$ trainable parameters - 127x fewer

The LoRA parameters can be merged back into the base weights before deployment: $W_{deployed} = W_0 + BA$ . The result is a standard weight matrix with no inference overhead.

Tooling Ecosystem: What Engineers Actually Use

Understanding the tooling ecosystem prevents you from building what already exists:

Tool	Primary Use	Key Feature	When to Use
bitsandbytes	INT8/INT4 PTQ	One-line config, QLoRA support	Fine-tuning + quick INT4 inference
auto-gptq	GPTQ quantization	Hessian-based, 3-bit support	GPTQ models, CPU/GGUF export
autoawq	AWQ quantization	Marlin kernels, fast inference	Production GPU inference
llama.cpp	CPU inference	GGUF format, SIMD kernels	CPU serving, edge, local AI
vLLM	Serving engine	Continuous batching, paged attn	High-throughput GPU serving
TensorRT-LLM	NVIDIA-optimized	Maximum NVIDIA GPU throughput	Enterprise, NVIDIA-only fleet
ExecuTorch	Mobile inference	iOS/Android, optimized kernels	Mobile edge deployment
ONNX Runtime	Cross-platform	CPU/GPU/NPU, many backends	Cross-platform, non-NVIDIA
lm-evaluation-harness	Benchmarking	Standard eval across tasks	Pre-deploy accuracy validation

The production stack for most teams in 2025 is: autoawq for quantization + vLLM for serving + lm-evaluation-harness for accuracy validation. This combination provides the best throughput-accuracy tradeoff with the least custom engineering.

Cost Modeling: Making the Business Case for Compression

Before committing engineering time to a compression project, build a cost model. Compression requires engineering effort and has accuracy risk. The business case must justify both.

from dataclasses import dataclass
from typing import Optional


@dataclass
class ServingCostModel:
    """Model the cost savings from compression for a given serving workload."""
    daily_requests: int
    avg_tokens_per_request: int        # Input + output tokens
    current_model_params_b: float      # e.g., 70.0 for 70B
    current_precision: str             # "fp16", "int8", "int4"
    target_precision: str              # "int8", "int4"
    current_gpus_per_request: int      # GPUs needed for current setup
    gpu_cost_per_hour_usd: float       # e.g., 5.0 for A100 80GB
    engineering_days_to_compress: int  # Estimated effort for compression project
    engineer_daily_cost_usd: float = 800.0  # Fully-loaded cost per engineer-day
    months_to_amortize: int = 12       # How long to amortize engineering cost

    def current_tokens_per_second(self) -> float:
        """Approximate current throughput (tokens/sec per GPU at given precision)."""
        base_tps = {
            "fp16": {"7": 41, "13": 28, "34": 18, "70": 12},
            "int8": {"7": 55, "13": 35, "34": 22, "70": 15},
            "int4": {"7": 98, "13": 65, "34": 40, "70": 28},
        }
        size_key = min(["7", "13", "34", "70"],
                       key=lambda k: abs(float(k) - self.current_model_params_b))
        return base_tps.get(self.current_precision, base_tps["fp16"]).get(size_key, 30)

    def compressed_tokens_per_second(self) -> float:
        base_tps = {
            "int8": {"7": 55, "13": 35, "34": 22, "70": 15},
            "int4": {"7": 98, "13": 65, "34": 40, "70": 28},
        }
        size_key = min(["7", "13", "34", "70"],
                       key=lambda k: abs(float(k) - self.current_model_params_b))
        return base_tps.get(self.target_precision, base_tps["int4"]).get(size_key, 50)

    def compute_savings(self) -> dict:
        """Compute monthly and annual savings from compression."""
        daily_tokens = self.daily_requests * self.avg_tokens_per_request

        current_tps = self.current_tokens_per_second()
        compressed_tps = self.compressed_tokens_per_second()

        # GPU-hours needed per day
        current_gpu_hours = (daily_tokens / current_tps) / 3600 * self.current_gpus_per_request

        # After compression: fewer GPUs needed (memory reduction), faster throughput
        precision_memory_factor = {"fp16": 2, "int8": 1, "int4": 0.5}
        current_mem_factor = precision_memory_factor.get(self.current_precision, 2)
        target_mem_factor = precision_memory_factor.get(self.target_precision, 0.5)
        gpu_count_reduction = current_mem_factor / target_mem_factor

        compressed_gpus_per_request = max(1, int(
            self.current_gpus_per_request / gpu_count_reduction
        ))
        compressed_gpu_hours = (daily_tokens / compressed_tps) / 3600 * compressed_gpus_per_request

        current_daily_cost = current_gpu_hours * self.gpu_cost_per_hour_usd
        compressed_daily_cost = compressed_gpu_hours * self.gpu_cost_per_hour_usd
        daily_savings = current_daily_cost - compressed_daily_cost
        monthly_savings = daily_savings * 30
        annual_savings = daily_savings * 365

        engineering_cost = (self.engineering_days_to_compress
                           * self.engineer_daily_cost_usd)
        monthly_amortized_cost = engineering_cost / self.months_to_amortize
        net_monthly_savings = monthly_savings - monthly_amortized_cost
        payback_months = (engineering_cost / monthly_savings
                         if monthly_savings > 0 else float("inf"))

        return {
            "current_daily_cost_usd": round(current_daily_cost, 2),
            "compressed_daily_cost_usd": round(compressed_daily_cost, 2),
            "daily_savings_usd": round(daily_savings, 2),
            "monthly_savings_usd": round(monthly_savings, 2),
            "annual_savings_usd": round(annual_savings, 2),
            "engineering_cost_usd": round(engineering_cost, 2),
            "payback_months": round(payback_months, 1),
            "net_monthly_savings_usd": round(net_monthly_savings, 2),
            "throughput_improvement": round(compressed_tps / current_tps, 2),
        }


# Example: 70B model serving 1M requests/day
model = ServingCostModel(
    daily_requests=1_000_000,
    avg_tokens_per_request=500,
    current_model_params_b=70.0,
    current_precision="fp16",
    target_precision="int4",
    current_gpus_per_request=2,     # 2× A100 80GB for 70B FP16
    gpu_cost_per_hour_usd=5.0,      # A100 80GB on-demand
    engineering_days_to_compress=10, # ~2 weeks for GPTQ/AWQ + eval
    months_to_amortize=12,
)
savings = model.compute_savings()
print(f"Monthly savings: ${savings['monthly_savings_usd']:,.0f}")
print(f"Payback period:  {savings['payback_months']} months")
print(f"Annual ROI:      ${savings['annual_savings_usd']:,.0f}")
# Typical output for this scenario:
# Monthly savings: $12,400
# Payback period:  0.6 months
# Annual ROI: $148,800

The cost model makes the business case concrete. For high-traffic systems, the payback period for a compression project is typically measured in weeks, not months. This is why compression is not a nice-to-have optimization - it is core infrastructure investment that pays for itself rapidly.

Common Mistakes to Avoid

:::danger Serving Models in FP32 in Production Using FP32 for inference is almost always wrong. FP16 and BF16 are numerically equivalent for inference on modern hardware and use 50% less memory. There is essentially no reason to use FP32 for inference. The default loading in many frameworks is FP32 if you do not specify - always explicitly set torch_dtype=torch.float16 or torch_dtype=torch.bfloat16 when loading for inference. This single change is free compression with zero accuracy cost. :::

:::danger Benchmarking Only on Standard Datasets A model that achieves "only 1% degradation on MMLU" may still be broken for your task. MMLU is a knowledge-recall benchmark. If your application requires arithmetic, code generation, or multi-step reasoning, you must benchmark those capabilities specifically. Ship a compressed model without task-specific evaluation and you will encounter production failures. :::

:::warning The Calibration Dataset Is Not Optional For all post-training quantization methods (GPTQ, AWQ, bitsandbytes PTQ), the calibration dataset significantly affects quality. Using a mismatched calibration set - e.g., a C4 text corpus when your task is code generation - produces a quantized model that is worse than one calibrated on domain-matched data. The calibration data teaches the quantizer which weights are most frequently activated and at what scales. Mismatch means the quantizer optimizes for the wrong distribution. :::

:::warning The Memory vs. Compute Tradeoff at High Batch Sizes Quantization saves memory and improves bandwidth-bound throughput at small batch sizes. But on compute-bound workloads - large batch sizes where the GPU compute cores are fully utilized - the dequantization overhead can actually reduce throughput compared to native FP16. Profile your actual batch size distribution before assuming quantization always helps. At batch size 128+, the crossover point is near and benchmarking is essential. :::

:::tip Always Test at Your Production Sequence Lengths Quantization errors compound over long context windows in ways that short-context benchmarks miss. A model that looks fine at 512-token prompts may show degradation at 4K or 8K tokens. Always benchmark at the sequence lengths you will actually use in production before declaring a quantized model production-ready. :::

Interview Questions

Q1: Explain the memory wall problem in LLM deployment and how compression addresses it.

The memory wall refers to the growing gap between what large language models require in terms of memory and what current GPU hardware provides. A 70B parameter model in FP16 requires roughly 140 GB of memory for weights alone, plus memory for KV cache and activations during inference. A single A100 80GB holds only 80 GB - insufficient even for the weights. The problem is compounded by memory bandwidth: even when weights fit, reading them from DRAM for each inference step is slow. An A100 has 2 TB/s bandwidth, but a 7B model's 14 GB of weights take 7ms just to stream - fundamentally limiting per-token generation speed. Compression directly addresses both issues: quantization reduces the bytes that must be stored (INT4 = 4x fewer bytes than FP16) and reduces the bytes that must be transferred per token (4x less bandwidth used, 4x faster generation for bandwidth-bound workloads). Structured pruning reduces the number of operations, attacking the compute side rather than the memory side.

Q2: What is the difference between quantization, pruning, and distillation? When would you choose each?

They address the same goal - a smaller, faster model - through fundamentally different mechanisms. Quantization reduces numeric precision of existing weights without changing model structure. It requires no retraining, can be applied in hours, and works best as a first-pass optimization for any deployment. Choose quantization first. Pruning removes parameters or structures entirely. Unstructured pruning creates sparse weights but rarely speeds up inference on standard GPUs. Structured pruning removes entire attention heads, neurons, or layers - this directly reduces computation and produces real latency improvements. Choose structured pruning when you need lower latency and can afford fine-tuning. Distillation trains a new, smaller model from scratch to mimic a larger teacher. It is the most expensive (requires a full training run) but produces the highest-quality small model. Choose distillation when you need a specific target architecture size and have the compute budget.

Q3: Why does INT4 quantization not cause catastrophic accuracy loss, given that INT4 has only 16 distinct values?

Two reasons work together. First, weight distributions in transformer models are typically near-Gaussian - most weights cluster near zero, with few values at the extremes. INT4 with proper calibration can allocate its 16 values to cover this distribution efficiently, with more quantization points near the dense center. NF4 (NormalFloat 4-bit) does this explicitly by placing quantization points at the quantiles of a standard normal distribution. Second, neural networks exhibit "error tolerance" - small quantization errors in individual weights average out across billions of multiply-accumulate operations in a forward pass. The KL divergence from quantization error in one weight matrix is diluted by the hundreds of other matrices that remain accurate. The breakthrough insight from GPTQ and AWQ research is that protecting the high-sensitivity weights - those activated by large input values - while aggressively quantizing less-sensitive weights allows INT4 to reach near-FP16 accuracy on most benchmarks.

Q4: Explain the bandwidth-bound vs. compute-bound distinction and how it affects compression strategy.

A GPU operation is bandwidth-bound when the bottleneck is moving data between memory and compute units. It is compute-bound when the compute cores are fully occupied and memory delivery is not the limiting factor. For autoregressive LLM inference at batch size 1, the model is almost always bandwidth-bound: generating each token requires reading all weights once, but the actual matrix-vector computation is trivial for modern GPUs. The GPU is waiting for data, not computation. Quantization directly helps bandwidth-bound workloads by reducing data movement - INT4 reduces weight transfer by 4x. For large-batch inference (batch size 32+), models become compute-bound as the matrix multiplications become larger and the GPU compute cores are fully utilized. Here, quantization's benefit is smaller or may involve overhead from dequantization. The correct compression strategy differs: for latency-critical single-request serving, quantization wins. For throughput-critical batch processing, structured pruning or distillation often wins.

Q5: What is the compression-accuracy Pareto frontier and how do you find it for a given model?

The Pareto frontier is the set of (compression, accuracy) operating points where you cannot improve one metric without sacrificing the other. Any point on the frontier represents an efficient tradeoff. To find the frontier empirically: benchmark five to seven compression configurations (FP16, INT8, INT4-GPTQ, INT4-AWQ, INT4-AWQ with structured pruning, and a distilled smaller model) on your target task. Plot accuracy vs. memory or latency. The frontier shows achievable tradeoffs. Points significantly inside the frontier - such as naive INT4 without calibration - represent approaches that are worse on both dimensions simultaneously. The typical LLM frontier in 2025: FP16 (best accuracy, most memory), INT8 (marginal accuracy drop, 2x memory savings), INT4-GPTQ or INT4-AWQ (1-3% accuracy drop, 4x memory savings). Below INT4, the frontier drops steeply for most architectures.

Q6: How does edge deployment change the compression problem compared to cloud deployment?

Edge deployment introduces constraints that fundamentally don't exist in cloud: fixed, non-expandable memory (can't add GPUs), strict power budgets (affects thermal throttling and battery life), no network connectivity (model must be fully self-contained on-device), and heterogeneous hardware (ARM CPUs, NPUs, Apple Silicon instead of NVIDIA GPUs). These constraints require more aggressive compression - INT4 is often the minimum for any useful model, and smaller architectures (1-3B parameters rather than 7-70B) are mandatory. Format choices must match the execution engine: GGUF for llama.cpp on CPU, ExecuTorch for mobile (Meta), Core ML for Apple Silicon, TFLite for Android NPU. Edge deployment also changes the accuracy-compression tradeoff: for offline-capable local tools, users accept longer response times and slightly lower quality in exchange for privacy and no API costs. This broader tolerance for latency allows more aggressive compression without harming the user experience.

Q7: Walk me through a compression decision for a 70B model that needs to run on a single A100 80GB in production. What would you do?

Start by computing whether the target is even feasible. A 70B model at FP16 requires 140 GB - double the A100's 80 GB. At INT8, it requires 70 GB - just under the limit, but leaving only 10 GB for KV cache and activations, which is tight for realistic sequence lengths. At INT4, the weights require ~35 GB, leaving 45 GB for KV cache - this is workable. So INT4 is the required compression level. Between GPTQ and AWQ at INT4, I would choose AWQ for a greenfield deployment: faster quantization pipeline, better average accuracy at standard group sizes, and the Marlin kernel provides better throughput at batch sizes relevant to a single-GPU serving setup. I would use group_size=128, zero_point=True for asymmetric quantization. Before deployment, I would benchmark perplexity on the task domain, check accuracy on the specific capabilities the product uses (not just MMLU), and profile latency at the expected concurrent request count. If accuracy is insufficient at INT4, the fallback is to consider multi-GPU INT8 or to distill a smaller 34B model that can run on the A100 at INT8.

The $100K Per Month Wake-Up Call​

The Physics of the Problem​

Parameters Have Weight - Literally​

The Bandwidth Bottleneck: Why Reading Weights Is the Real Problem​

The Economic Reality of Cloud GPU Pricing​

The Full Compression Taxonomy​

Quantization: The First Tool to Reach For​

Pruning: Reducing the Number of Operations​

Knowledge Distillation: Train Small, Think Big​

Low-Rank Methods: LoRA and SVD Approximation​

The Concrete Numbers: A Benchmark Comparison​

The Accuracy-Compression Tradeoff​

Why Benchmark Numbers Lie in Production​

Edge Deployment: A Different Class of Problem​

The Apple Silicon Case Study​

Latency vs. Throughput: Two Different Problems​

A Decision Framework: Which Technique to Apply First​

Production Compression: A Worked Example​

Combining Techniques: The Compression Stack​

Stacking Quantization and Distillation​

The QLoRA Pattern: Fine-Tune + Compress in One Pass​

The Speculative Decoding Integration​

The Historical Arc: How We Got Here​

Compression Across the Model Lifecycle​

At Architecture Design Time​

At Training Time​

At Fine-Tuning Time​

Tooling Ecosystem: What Engineers Actually Use​

Cost Modeling: Making the Business Case for Compression​

Common Mistakes to Avoid​

Interview Questions​