Quantization Hardware Tradeoffs
The Model That Was Too Big to Ship
Your team has spent three months fine-tuning a 70B parameter LLaMA-3 model on proprietary data. The quality is excellent - internal evaluations show it outperforms GPT-4o-mini on your domain-specific tasks by a meaningful margin. Leadership is excited. The demo is scheduled for next Thursday.
Then the infrastructure lead runs the numbers. FP16 inference on a single A100 80GB is impossible - the model alone takes 140 GB. You need at least two A100s in tensor parallel just to fit it. At your projected traffic of 10 million tokens per day, the serving cost comes out to $2,400 per month on two A100s - expensive but manageable. Except the latency is 180ms per output token at your target batch size. Your product requires under 50ms. The physics do not cooperate.
Someone suggests quantization. INT8 would cut the model to 70 GB - fitting on a single A100 80GB with room for KV cache. Two times better memory bandwidth utilization means two times better latency. The cost drops to under $1,200 per month. But immediately the questions start: Will quality degrade? How much? Which quantization method? Does the A100 even support INT8 Tensor Cores natively? What about INT4 - could you go even further? The product launch is six days away.
This scenario plays out constantly in production ML. Quantization is not just a model compression trick. It is fundamentally a hardware optimization that changes which parts of the GPU you are utilizing, what your arithmetic intensity looks like, and what your cost-per-token ends up being. Getting the decision wrong means either shipping a product that is too slow (picked wrong precision) or a product that gives wrong answers (picked wrong quantization method).
Understanding quantization at the hardware level - not just as a training regularizer or accuracy-loss tradeoff but as a change in computational substrate - is essential for anyone building LLM serving infrastructure. This lesson gives you the complete picture, from first principles through production deployment.
Why This Exists - The Memory Wall and Precision Inflation
The fundamental problem is simple: modern LLMs are trained in FP32 or BF16, with parameters stored in FP16 or BF16 for inference. Each parameter takes 2 or 4 bytes. A 70B parameter model in FP16 needs 140 GB of memory just for weights, before any KV cache or activations.
This creates three compounding problems:
Problem 1 - Hardware capacity. The largest commercially available GPUs have 80 GB of HBM. A single 70B FP16 model cannot fit. You need multi-GPU serving, which adds NVLink/PCIe communication overhead, increases infrastructure cost, and complicates deployment.
Problem 2 - Memory bandwidth bottleneck. As shown in Lesson 1, decode-phase inference is memory-bandwidth-bound. Every generated token requires loading the entire model from HBM. At 140 GB per forward pass and 2 TB/s HBM bandwidth (A100), you spend just transferring weights, before any compute. Reducing weight size directly reduces this time.
Problem 3 - Cost. Fitting 140 GB of weights on 80 GB cards requires 2+ GPUs. Cloud cost scales linearly with GPU count. For high-traffic applications, a 2x reduction in model size is a 2x reduction in serving cost.
Quantization attacks all three problems simultaneously: it reduces memory footprint, increases effective memory bandwidth (fewer bytes per parameter means more parameters delivered per second), and lowers cost by reducing GPU count or allowing faster serving with fewer GPUs.
The catch - always - is accuracy. Every bit you drop from a floating point number removes information. The central challenge of quantization is determining how much precision you actually need, which parameters are sensitive to precision loss, and how to recover accuracy where you lose it.
Historical Context - From Fixed-Point DSP to LLM Quantization
Quantization is not new. Digital signal processors (DSPs) in the 1980s ran in fixed-point integer arithmetic because floating-point hardware was expensive and slow. The entire field of digital signal processing was built around making integer arithmetic work for applications that were naturally continuous.
For neural networks, the first serious quantization work appeared in the late 1980s. Peter Auer, John Hertz, and others experimented with low-precision weights in early backpropagation experiments. The motivation was memory - early networks had to fit in kilobytes of RAM.
The modern era begins with three pivotal developments:
2015 - BinaryConnect and subsequent work (Courbariaux, Bengio, et al.): showed that binary weights (1 bit) could train reasonably well on simple benchmarks. Demonstrated the theoretical possibility that neural networks have significant redundancy.
2017-2018 - Google's INT8 quantization work for inference: The TensorFlow Lite team developed post-training quantization workflows that made INT8 inference practical on ARM CPUs and later on TPUs. This proved that 8-bit weights could achieve near-FP32 accuracy on classification tasks with careful calibration.
2022-2023 - LLM quantization research explosion: Tim Dettmers (University of Washington) released bitsandbytes with LLM.int8() quantization (2022) - the first method that could quantize 175B parameter models to INT8 without significant degradation by handling outlier activations separately. This was the "aha moment" for the field: you could run a 175B model on consumer hardware.
Following work came rapidly: GPTQ (Frantar et al., 2022), AWQ (Lin et al., 2023), SmoothQuant (Xiao et al., 2022), QuIP (Chee et al., 2023), and NVIDIA's FP8 training/inference support in the H100. By 2024, quantization had become a standard production tool rather than a research curiosity.
Core Concepts - Precision, Hardware, and the Quantization Arithmetic
Floating Point Representation Review
A floating-point number has three components: sign (1 bit), exponent (determines range), and mantissa (determines precision).
| Format | Total bits | Exponent bits | Mantissa bits | Dynamic range | Relative precision |
|---|---|---|---|---|---|
| FP32 | 32 | 8 | 23 | ||
| BF16 | 16 | 8 | 7 | ||
| FP16 | 16 | 5 | 10 | ||
| FP8 E4M3 | 8 | 4 | 3 | ||
| FP8 E5M2 | 8 | 5 | 2 | ||
| INT8 | 8 | - | - | -128 to 127 | uniform |
| INT4 | 4 | - | - | -8 to 7 | uniform |
| NF4 | 4 | - | - | non-uniform | information-optimal |
BF16 is preferred over FP16 for training because it has the same exponent range as FP32 (less overflow/underflow during gradient computation). FP16 is more common for inference because it has better mantissa precision for forward-pass activations.
What Quantization Actually Does
Post-training quantization (PTQ) maps a floating-point weight tensor to a lower-precision representation. The linear quantization mapping for INT8 is:
where is the original FP32/FP16 value, is the scale factor, is the zero-point (for asymmetric quantization), and are the integer range bounds (-128 to 127 for INT8).
Dequantization (reconstructing approximate float from integer):
The quantization error is:
For uniform quantization, the maximum error is . The scale is chosen to minimize this error:
where is the number of bits. For INT8 with symmetric quantization around zero:
The Memory Bandwidth Benefit of Quantization
This is the most direct hardware benefit. For a weight matrix :
| Precision | Bytes per weight | Bytes for 70B params | Bandwidth time (A100 2 TB/s) |
|---|---|---|---|
| FP32 | 4 | 280 GB | 140 ms |
| FP16/BF16 | 2 | 140 GB | 70 ms |
| INT8 | 1 | 70 GB | 35 ms |
| INT4/NF4 | 0.5 | 35 GB | 17.5 ms |
For memory-bandwidth-bound decode, this is a direct latency multiplier: INT8 gives 2x throughput vs FP16, INT4 gives 4x throughput vs FP16 - purely from bandwidth savings.
When Quantization Helps vs Hurts
This is the critical nuance most engineers miss: quantization's benefits depend entirely on your arithmetic intensity regime.
Memory-bandwidth-bound regime (small batch decode):
- Bottleneck: bytes loaded from HBM per token
- INT8 loads 2x fewer bytes per token - 2x faster
- INT4 loads 4x fewer bytes - 4x faster
- Compute savings are irrelevant because compute is not the bottleneck
- Quantization helps significantly
Compute-bound regime (large batch or long prefill):
- Bottleneck: FLOP/s of Tensor Cores
- INT8 Tensor Cores on H100/A100 run at 2x the FLOP/s of FP16 Tensor Cores
- FP8 Tensor Cores on H100 run at 2x FP16 TFLOP/s
- Both bandwidth AND compute improve - quantization helps even more
- Quantization helps even more here
Edge case - pure compute bound at very large batches:
- If you have FP8 hardware and are already compute-bound in FP16
- Moving to FP8 doubles available FLOP/s
- Memory is no longer the bottleneck; compute utilization increases
- Quantization still helps, via compute path
The only case where quantization clearly hurts performance is when it introduces significant accuracy degradation - which depends on the method and model.
Tensor Core Hardware Support Matrix
Different quantization formats require different hardware support. Without native Tensor Core support, the GPU must dequantize to FP16 before multiplying - losing most of the computational benefit (though still gaining the memory bandwidth benefit).
| Format | H100 SXM5 | A100 | A10G | L40S | RTX 4090 | RTX 3090 |
|---|---|---|---|---|---|---|
| FP32 | Native | Native | Native | Native | Native | Native |
| FP16 | Native TC | Native TC | Native TC | Native TC | Native TC | Native TC |
| BF16 | Native TC | Native TC | Native TC | Native TC | Native TC | No |
| FP8 (E4M3/E5M2) | Native TC | No | No | No | No | No |
| INT8 | Native TC | Native TC | Native TC | Native TC | Native TC | Native TC |
| INT4 | Native TC | Native TC | No | Native TC | Native TC | No |
"Native TC" = the operation runs directly in Tensor Cores without dequantization. "No" = must dequantize to FP16 first.
Key implications:
- FP8 is H100-only (and H200). If you are on A100, FP8 quantization only helps via the bandwidth path, not the compute path.
- INT4 has no native TC on A10G. On A10G, INT4 quantization gives bandwidth savings (2x over INT8) but Tensor Core compute does not improve. The CUDA dequantize-then-multiply kernel means you lose some of the compute benefit.
- INT8 has the broadest hardware support - available on all modern data center and consumer GPUs.
The INT4 Paradox - No Native INT4 Tensor Cores on Most Hardware
Here is something that surprises many engineers: despite INT4 being common in production quantization (GPTQ, AWQ, bitsandbytes NF4 all use 4-bit weights), most of the CUDA kernels that run INT4 models actually dequantize the INT4 weights to FP16 before running the Tensor Core multiply.
The reason: hardware INT4 Tensor Core operations have significant restrictions on matrix sizes and alignment that make them difficult to use efficiently in practice for general LLM layer shapes. The cutlass library and vendor-specific kernels often find it easier to dequantize weights from INT4 to FP16 at load time, then run the FP16 GEMM.
So why bother with INT4? The memory bandwidth benefit is real regardless. You still load 4x fewer bytes from HBM. The dequantization happens in registers (fast SRAM) at very low cost. The net result for memory-bandwidth-bound workloads is close to the theoretical 4x throughput improvement - even without native INT4 Tensor Core operations.
The practical implication: on A10G (no native INT4 TC), INT4 quantization still delivers approximately 3.5-4x better decode throughput than FP16 for memory-bandwidth-bound workloads. On H100 (native INT4 TC available but rarely used efficiently), the improvement can be slightly higher.
NF4 - Information-Theoretically Optimal 4-Bit Quantization
Normal Float 4 (NF4), introduced in the QLoRA paper (Dettmers et al., 2023), is based on a key observation: neural network weights follow an approximately normal (Gaussian) distribution after training. Uniform quantization allocates equal intervals to all regions of the value range, wasting precision in the tails where values are rare.
NF4 instead assigns quantization levels with equal probability under a standard normal distribution. For a standard normal, the levels are the quantiles at for .
For 4-bit NF4, the 16 quantization levels are placed at:
where is the inverse CDF of the standard normal.
This means NF4 allocates more quantization levels near zero (where most weights cluster) and fewer at the extremes. It is information-theoretically optimal for Gaussian-distributed data: it minimizes quantization error in expectation.
The accuracy benefit over INT4 (which uses uniform spacing) is typically 0.5-1.5 perplexity points on language modeling benchmarks for 7B-70B models.
NF4 is used by:
- bitsandbytes QLoRA (4-bit fine-tuning)
- bitsandbytes
load_in_4bit=Trueinference - HuggingFace Transformers BitsAndBytes integration
FP8 on H100 - The New Standard
The H100 is the first data center GPU with native FP8 Tensor Core support. NVIDIA defined two FP8 formats:
- E4M3: 4 exponent bits, 3 mantissa bits. Better precision, smaller range. Max value: 448. Preferred for activations and weights.
- E5M2: 5 exponent bits, 2 mantissa bits. Larger range, less precision. Max value: . Preferred for gradients during training.
FP8 E4M3 inference on H100:
- 2x the FLOP/s of FP16 Tensor Cores (1979 vs 989 TFLOP/s)
- 2x memory bandwidth savings vs FP16
- Much better numerical stability than INT8 (preserves non-uniform spacing of floating-point)
- Quantization error is lower than INT8 for the same number of bits because of the floating-point representation
FP8 is increasingly used for both training and inference at NVIDIA shops. TensorRT-LLM, vLLM (H100 path), and NVIDIA's internal serving stack all support FP8 natively on H100. For A100-based deployments, INT8 remains the practical choice.
Quantization Methods - From Simple to Production-Grade
Per-Tensor vs Per-Channel vs Per-Token vs Per-Group
The granularity at which you compute scale factors determines accuracy:
Per-tensor quantization: One scale factor for the entire weight matrix. Fastest and simplest. But if the weight matrix has columns with very different value ranges, one global scale gives poor precision for columns with small values. Typically loses 1-3 perplexity points on LLMs.
Per-channel quantization: One scale factor per output channel (each column of gets its own scale). Much better accuracy. Supported natively by INT8 Tensor Core operations in CUTLASS/cuBLAS. This is the standard for INT8 post-training quantization.
Per-token quantization (activations): For quantizing activations, use one scale per token (per row of the activation matrix). This handles the token-to-token variation in activation magnitudes that causes per-tensor activation quantization to fail badly.
Per-group quantization: For INT4/NF4, divide each weight row into groups of elements (typically ) and use a separate FP16 scale per group. This dramatically reduces quantization error at the cost of storing one FP16 scale per weights (small overhead: extra storage). Used by GPTQ, AWQ, and bitsandbytes NF4.
The math for per-group INT4 quantization: a weight row of length , divided into groups, each with its own scale and zero-point . For group :
Reconstruction error within a group is bounded by , which is much smaller than per-tensor or per-channel scales because is calibrated on just 128 values instead of millions.
GPTQ - One-Shot Weight Quantization
GPTQ (Frantar, Ashkboos, Hoefler, Alistarh, 2022) is a post-training quantization method based on the Optimal Brain Quantization (OBQ) framework. The core insight: when you quantize a weight, you can compensate for the error by updating remaining (not yet quantized) weights in the same row.
The algorithm quantizes one weight at a time, minimizing the second-order approximation of the output change:
Using the inverse Hessian of the objective with respect to the weights, GPTQ can select per-weight quantization points that minimize output perturbation. The result is 3-4 bit quantization with near-FP16 accuracy for 7B-70B models.
GPTQ requires:
- A calibration dataset (typically 128 random samples from C4 or WikiText-2, sequence length 2048)
- One forward pass per layer during calibration to compute the Hessian
- No fine-tuning or gradient computation needed
Runtime: typically 1-4 hours for a 70B model on a single A100. One-time cost.
AWQ - Activation-Aware Weight Quantization
AWQ (Lin, Tang, Tang, Yang, Dang, Han, 2023) observes that not all weights are equally important. Weights that correspond to large activation channels (channels where input activations consistently have large magnitudes) cause proportionally more error when quantized.
AWQ's solution: identify the 1% of weights connected to the largest activation channels and apply per-channel scaling to make those weights less sensitive to quantization. The scaling is "absorbed" into adjacent normalization layers so inference cost is unchanged.
Compared to GPTQ:
- AWQ is faster to apply (minutes vs hours for 70B models)
- AWQ often achieves slightly better accuracy at the same bit width
- AWQ with GEMM kernels (via AutoAWQ) is slightly faster at inference time due to better memory access patterns
- AWQ is preferred for production deployment in most benchmarks as of 2024
SmoothQuant - Handling Activation Outliers
INT8 quantization of activations fails dramatically for LLMs because activations have outliers - specific channels where values are consistently 10-100x larger than the typical channel magnitude. A single outlier can inflate the per-tensor scale and waste precision on all other values.
SmoothQuant (Xiao, Lin, Seznec, Wu, Demouth, Han, 2022) "migrates" the quantization difficulty from activations to weights. For a linear layer:
The scaling matrix is chosen to equalize the per-channel magnitudes of activations and weights. After smoothing, both activations and weights can be quantized to INT8 with a single per-tensor or per-channel scale without outlier degradation.
SmoothQuant is the foundation of most production INT8 serving stacks. TensorRT-LLM, vLLM's INT8 path, and NVIDIA's Triton-based serving all use SmoothQuant-style per-channel smoothing.
Code Examples - Quantization in Practice
Measuring Quantization Impact on Throughput
"""
quantization_benchmark.py - Compare throughput across quantization formats.
Requires: transformers, bitsandbytes, auto-gptq or autoawq, torch
"""
import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
def measure_decode_throughput(
model,
tokenizer,
prompt: str = "The future of artificial intelligence is",
max_new_tokens: int = 100,
num_runs: int = 5,
device: str = "cuda",
) -> dict:
"""
Measure decode throughput in tokens per second.
Returns mean and std across runs.
"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)
prompt_len = inputs["input_ids"].shape[1]
# Warmup
with torch.no_grad():
_ = model.generate(
**inputs,
max_new_tokens=20,
do_sample=False,
)
torch.cuda.synchronize()
times = []
for _ in range(num_runs):
start = time.perf_counter()
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
tokens_generated = output.shape[1] - prompt_len
times.append(tokens_generated / elapsed)
mean_tps = sum(times) / len(times)
std_tps = (sum((t - mean_tps)**2 for t in times) / len(times)) ** 0.5
return {
"mean_tokens_per_second": round(mean_tps, 2),
"std_tokens_per_second": round(std_tps, 2),
}
def load_fp16_model(model_name: str):
return AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
def load_int8_model(model_name: str):
"""INT8 via bitsandbytes LLM.int8()"""
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
# llm_int8_threshold controls outlier detection sensitivity
# higher = more outliers handled in FP16, slightly less speedup
llm_int8_threshold=6.0,
)
return AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
def load_nf4_model(model_name: str):
"""NF4 (4-bit) via bitsandbytes"""
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16, # dequantize to FP16 for compute
bnb_4bit_use_double_quant=True, # quantize the scales too (saves ~0.4 bits/param)
)
return AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
# Usage example (swap in actual model_name):
# model_name = "meta-llama/Llama-3.1-8B-Instruct"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
#
# for name, loader in [("FP16", load_fp16_model), ("INT8", load_int8_model), ("NF4", load_nf4_model)]:
# model = loader(model_name)
# result = measure_decode_throughput(model, tokenizer)
# print(f"{name}: {result['mean_tokens_per_second']:.1f} ± {result['std_tokens_per_second']:.1f} tok/s")
# del model
# torch.cuda.empty_cache()
GPTQ Quantization with AutoGPTQ
"""
gptq_quantize.py - Apply GPTQ 4-bit quantization to any HuggingFace model.
Requires: auto-gptq, transformers, datasets
"""
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset
import torch
import random
def prepare_calibration_dataset(
tokenizer,
n_samples: int = 128,
seq_len: int = 2048,
dataset_name: str = "allenai/c4",
seed: int = 42,
) -> list:
"""
Load calibration data for GPTQ.
Uses random samples from C4 (standard practice).
"""
random.seed(seed)
dataset = load_dataset(
dataset_name,
"en",
split="train",
streaming=True,
)
samples = []
for i, sample in enumerate(dataset):
if len(samples) >= n_samples * 2:
break
text = sample["text"]
if len(text.split()) > 100: # filter very short texts
samples.append(text)
# Tokenize and pack into seq_len chunks
calibration_data = []
random.shuffle(samples)
for text in samples[:n_samples]:
enc = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=seq_len,
padding="max_length",
)
if enc["input_ids"].shape[1] >= seq_len:
calibration_data.append(enc)
if len(calibration_data) >= n_samples:
break
return calibration_data
def quantize_model_gptq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
desc_act: bool = False, # activation reordering - better quality but slower
n_calibration_samples: int = 128,
):
"""
Apply GPTQ post-training quantization.
Args:
model_name: HuggingFace model ID or local path
output_dir: where to save quantized model
bits: 2, 3, 4, or 8
group_size: 32, 64, 128 (smaller = better accuracy, more overhead)
desc_act: True improves accuracy but adds ~15% latency overhead
n_calibration_samples: more samples = better calibration, longer runtime
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
quantize_config = BaseQuantizeConfig(
bits=bits,
group_size=group_size,
desc_act=desc_act,
sym=True, # symmetric quantization (zero-point = 0)
)
print(f"Loading model {model_name}...")
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
torch_dtype=torch.float16,
device_map="auto",
)
print(f"Preparing calibration data ({n_calibration_samples} samples)...")
calibration_data = prepare_calibration_dataset(
tokenizer,
n_samples=n_calibration_samples,
)
print("Quantizing... (this takes 30 min to 4 hours depending on model size)")
model.quantize(calibration_data)
print(f"Saving quantized model to {output_dir}")
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"Done. Model saved to {output_dir}")
return model
# Example:
# quantize_model_gptq(
# model_name="meta-llama/Llama-3.1-8B-Instruct",
# output_dir="./llama3-8b-gptq-4bit",
# bits=4,
# group_size=128,
# )
Evaluating Quantization Quality with LM-Eval
#!/bin/bash
# evaluate_quantization.sh
# Compare perplexity and few-shot accuracy across quantization methods
MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
GPTQ_MODEL="./llama3-8b-gptq-4bit"
# Install lm-evaluation-harness
pip install lm-eval
# Evaluate FP16 baseline
echo "=== FP16 Baseline ==="
lm_eval --model hf \
--model_args "pretrained=${MODEL_NAME},dtype=float16" \
--tasks winogrande,hellaswag,arc_challenge,arc_easy,mmlu \
--num_fewshot 5 \
--output_path ./results_fp16/ \
--log_samples
# Evaluate INT8 (bitsandbytes)
echo "=== INT8 bitsandbytes ==="
lm_eval --model hf \
--model_args "pretrained=${MODEL_NAME},load_in_8bit=True" \
--tasks winogrande,hellaswag,arc_challenge,arc_easy,mmlu \
--num_fewshot 5 \
--output_path ./results_int8/ \
--log_samples
# Evaluate NF4 (bitsandbytes 4-bit)
echo "=== NF4 bitsandbytes ==="
lm_eval --model hf \
--model_args "pretrained=${MODEL_NAME},load_in_4bit=True,bnb_4bit_quant_type=nf4,bnb_4bit_compute_dtype=float16" \
--tasks winogrande,hellaswag,arc_challenge,arc_easy,mmlu \
--num_fewshot 5 \
--output_path ./results_nf4/ \
--log_samples
# Evaluate GPTQ 4-bit
echo "=== GPTQ 4-bit ==="
lm_eval --model hf \
--model_args "pretrained=${GPTQ_MODEL},gptq=True,dtype=float16" \
--tasks winogrande,hellaswag,arc_challenge,arc_easy,mmlu \
--num_fewshot 5 \
--output_path ./results_gptq4/ \
--log_samples
echo "Evaluation complete. Results in ./results_*/"
# Compare perplexity on WikiText-2
for method in fp16 int8 nf4 gptq4; do
echo "=== Perplexity ${method} ==="
lm_eval --model hf \
--model_args "pretrained=./results_${method}_model,dtype=float16" \
--tasks wikitext \
--output_path "./results_ppl_${method}/"
done
Measuring Memory Footprint Reduction
"""
memory_analysis.py - Quantify memory savings from quantization.
"""
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
def get_model_memory_gb(model) -> float:
"""Return total parameter memory in GB."""
total_bytes = sum(
p.nelement() * p.element_size()
for p in model.parameters()
)
return total_bytes / (1024**3)
def get_gpu_memory_gb() -> float:
"""Return currently allocated GPU memory in GB."""
return torch.cuda.memory_allocated() / (1024**3)
def analyze_layer_dtypes(model) -> dict:
"""Show which layers are in which dtype (relevant for mixed-precision)."""
dtype_counts = {}
dtype_params = {}
for name, param in model.named_parameters():
dtype = str(param.dtype)
dtype_counts[dtype] = dtype_counts.get(dtype, 0) + 1
dtype_params[dtype] = dtype_params.get(dtype, 0) + param.nelement()
return {
"layer_counts": dtype_counts,
"param_counts": dtype_params,
}
def quantization_memory_report(model_name: str = "facebook/opt-6.7b"):
"""
Run memory analysis for FP16, INT8, and NF4 configs.
Uses a smaller model by default so it can run on consumer hardware.
"""
configs = {
"FP16": {
"torch_dtype": torch.float16,
},
"INT8 (LLM.int8)": {
"quantization_config": BitsAndBytesConfig(load_in_8bit=True),
},
"NF4 (QLoRA)": {
"quantization_config": BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
),
},
}
results = {}
for config_name, kwargs in configs.items():
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
**kwargs,
)
param_memory = get_model_memory_gb(model)
gpu_memory = get_gpu_memory_gb()
layer_info = analyze_layer_dtypes(model)
results[config_name] = {
"parameter_memory_gb": round(param_memory, 2),
"total_gpu_memory_gb": round(gpu_memory, 2),
"layer_dtypes": layer_info,
}
del model
torch.cuda.empty_cache()
print(f"\nMemory Analysis: {model_name}")
print("=" * 60)
for name, r in results.items():
print(f"\n{name}:")
print(f" Parameter memory: {r['parameter_memory_gb']:.2f} GB")
print(f" Total GPU memory: {r['total_gpu_memory_gb']:.2f} GB")
dtypes = r["layer_dtypes"]["param_counts"]
for dtype, count in sorted(dtypes.items()):
print(f" {dtype}: {count/1e6:.1f}M params")
return results
Architecture Diagrams
Quantization Format Comparison by Hardware Path
Quantization Granularity and Accuracy Tradeoff
Production Quantization Decision Flow
Production Engineering Notes
Calibration Dataset Selection Matters
The calibration dataset for GPTQ and AWQ determines which weight distributions get preserved with high fidelity. Using a generic dataset (C4, WikiText-2) when your model is used for a specialized domain (medical, legal, code) can cause accuracy degradation on your actual use case even when generic benchmarks look fine.
Best practice: include 50-100 samples from your actual production distribution in the calibration set. Mix them with generic text (50% domain, 50% C4) to avoid overfitting the quantization to a narrow distribution.
# Example: mixed calibration dataset
def build_mixed_calibration_set(
tokenizer,
domain_texts: list, # your actual production examples
n_generic: int = 64,
n_domain: int = 64,
seq_len: int = 2048,
):
from datasets import load_dataset
import random
# Generic samples from C4
c4 = load_dataset("allenai/c4", "en", split="train", streaming=True)
generic_texts = [s["text"] for s, _ in zip(c4, range(n_generic * 5)) if len(s["text"].split()) > 100]
random.shuffle(generic_texts)
# Combine and tokenize
all_texts = domain_texts[:n_domain] + generic_texts[:n_generic]
random.shuffle(all_texts)
calibration_data = []
for text in all_texts:
enc = tokenizer(text, return_tensors="pt", truncation=True,
max_length=seq_len, padding="max_length")
if enc["input_ids"].shape[1] == seq_len:
calibration_data.append(enc)
return calibration_data[:n_generic + n_domain]
Sensitivity Analysis Before Choosing Bit Width
Not all layers are equally sensitive to quantization. The embedding layer, the LM head (final linear projection to vocabulary), and the first and last few transformer layers tend to be more sensitive than middle layers.
A sensitivity analysis runs quantization at different bit widths per layer and measures output metric change:
"""
sensitivity_scan.py - Identify which layers are most sensitive to quantization.
Quick approximation: quantize one layer at a time and measure perplexity change.
"""
import torch
import copy
from transformers import AutoModelForCausalLM, AutoTokenizer
def layer_sensitivity_scan(model_name: str, n_eval_tokens: int = 1000):
"""
For each linear layer: quantize to INT8, measure perplexity change.
High change = sensitive layer (keep in FP16 or use higher precision).
"""
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto"
)
# Get evaluation data
wikitext = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
eval_text = " ".join(wikitext["text"][:100])
eval_tokens = tokenizer(eval_text, return_tensors="pt")["input_ids"][:, :n_eval_tokens].cuda()
def compute_perplexity(model, tokens):
with torch.no_grad():
outputs = model(tokens, labels=tokens)
return torch.exp(outputs.loss).item()
baseline_ppl = compute_perplexity(base_model, eval_tokens)
print(f"Baseline FP16 perplexity: {baseline_ppl:.3f}")
sensitivities = {}
for name, module in base_model.named_modules():
if not isinstance(module, torch.nn.Linear):
continue
# Temporarily quantize this layer to INT8
original_weight = module.weight.data.clone()
scale = module.weight.data.abs().max() / 127.0
quantized = (module.weight.data / scale).round().clamp(-128, 127)
module.weight.data = (quantized * scale).to(torch.float16)
ppl_with_quant = compute_perplexity(base_model, eval_tokens)
ppl_increase = ppl_with_quant - baseline_ppl
sensitivities[name] = round(ppl_increase, 4)
# Restore
module.weight.data = original_weight
# Sort by sensitivity
sorted_layers = sorted(sensitivities.items(), key=lambda x: x[1], reverse=True)
print("\nMost sensitive layers:")
for layer_name, ppl_increase in sorted_layers[:10]:
print(f" {layer_name}: +{ppl_increase:.4f} ppl")
return sensitivities
vLLM FP8 Inference on H100
"""
vllm_fp8_serving.py - Launch FP8 inference with vLLM on H100.
"""
from vllm import LLM, SamplingParams
# FP8 quantization in vLLM uses NVIDIA's TRT-LLM FP8 calibration under the hood
# For supported models, this is the simplest production path
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
quantization="fp8", # H100 only
tensor_parallel_size=2, # 2x H100 for 70B
gpu_memory_utilization=0.90,
max_model_len=8192,
# FP8 KV cache further reduces memory footprint:
kv_cache_dtype="fp8", # experimental in vLLM >= 0.5
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
prompts = [
"Explain the difference between INT8 and FP8 quantization.",
"What are the hardware requirements for running FP8 inference?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Output: {output.outputs[0].text[:200]}")
print()
AWQ Quantization (Faster Than GPTQ for Large Models)
"""
awq_quantize.py - AWQ post-training quantization.
Requires: autoawq
"""
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
output_dir = "./llama3-8b-awq-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoAWQForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
use_cache=False,
)
quant_config = {
"zero_point": True, # asymmetric quantization
"q_group_size": 128, # per-group size
"w_bit": 4, # 4-bit weights
"version": "GEMM", # GEMM kernel (faster) vs GEMV (better for batch=1)
}
# AWQ calibration - runs in ~5 minutes for 8B model (vs 30-60 min for GPTQ)
model.quantize(
tokenizer,
quant_config=quant_config,
# Default uses Pile dataset for calibration
# To use custom data: pass calib_data=["your", "texts", "here"]
)
model.save_quantized(output_dir, safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"AWQ quantized model saved to {output_dir}")
Common Mistakes
:::danger Applying Quantization Without Calibration Data Many engineers apply naive round-to-nearest quantization (no calibration) or use calibration data that does not match their use case. Without proper calibration, INT8 quantization can degrade perplexity by 5-15 points on LLMs due to activation outliers. Always use SmoothQuant-style smoothing or a calibrated quantization method (GPTQ, AWQ) for LLM inference - never naive rounding. :::
:::danger Assuming INT4 Uses INT4 Tensor Cores Engineers often assume that since they are running INT4 quantization, their GPU is running INT4 matrix multiplications. On almost all current production deployments (AutoGPTQ, AutoAWQ, bitsandbytes), the actual GPU kernel dequantizes INT4 weights to FP16 and runs FP16 matrix multiplications. The speedup comes from memory bandwidth reduction, not from INT4 compute. Do not expect INT4 to give you 4x the Tensor Core throughput - it gives you 4x the memory bandwidth and roughly 3-4x the decode throughput on memory-bandwidth-bound workloads. :::
:::danger Using FP8 on A100 and Expecting Compute Gains FP8 Tensor Core operations require the H100 (Hopper architecture) or newer. On an A100, loading FP8 weights requires dequantization to FP16 before any compute. You will get the 2x memory bandwidth benefit but no compute benefit. A model reported as "FP8 quantized" running on an A100 cluster will deliver approximately the same throughput as INT8 on A100. If you see claims of "FP8 gives 2x speedup", verify they are benchmarking on H100 or later hardware. :::
:::warning Calibrating Once and Never Updating Quantization scales are computed on a calibration dataset at a point in time. If your model is updated (fine-tuned, RLHF-trained, merged with adapters), the weight distributions change and the old calibration scales are no longer optimal. Re-run quantization calibration whenever you update the base model weights. Even fine-tuning on a small dataset can shift layer distributions enough to cause meaningful accuracy degradation with old calibration scales. :::
:::warning Ignoring Per-Layer Sensitivity in Production Not all quantization methods handle all model architectures equally. Some models have attention layers with extreme activation outliers (older BERT-family models, some fine-tuned LLaMA variants) where per-tensor INT8 quantization fails badly while per-channel is fine. Always run a sensitivity analysis and lm-eval benchmark on your specific model+quantization combination before deploying. Do not rely on benchmark numbers from a different model or a different fine-tune of the same base model. :::
:::warning Double-Counting Memory Savings at Small Batch Sizes At very small batch sizes (1-4), the KV cache memory is negligible. The dominant memory consumer is model weights. Here, quantization memory savings directly translate to fitting larger models or more KV cache. But at large batch sizes (128+), KV cache can exceed weight memory. INT8 weight quantization reduces weight memory by 2x but does not help KV cache size (unless you also quantize the KV cache separately, which vLLM and TRT-LLM both support). Always account for KV cache when planning capacity. :::
Interview Q&A
Q1: Why does INT8 quantization give approximately 2x throughput improvement for LLM decode, but only on memory-bandwidth-bound workloads?
Answer: For memory-bandwidth-bound workloads (decode phase at small-to-medium batch sizes), the bottleneck is how fast GPU HBM can deliver bytes to the Tensor Cores. The Tensor Cores sit waiting for data.
INT8 weights are 1 byte per parameter vs 2 bytes for FP16. For a 70B model: 70 GB INT8 vs 140 GB FP16. At A100's 2 TB/s bandwidth, loading the model takes 35 ms (INT8) vs 70 ms (FP16) per forward pass. The decode step is 2x faster purely from bandwidth.
The Tensor Cores themselves on A100 can do INT8 operations at roughly 2x the TFLOP/s of FP16 operations (1248 vs 624 TFLOP/s sparse, or 624 vs 312 TFLOP/s dense). But this compute improvement is irrelevant if you are memory-bandwidth-bound - the Tensor Cores were already idle. So you get 2x bandwidth improvement, not 4x (2x bandwidth x 2x compute).
If you were compute-bound (which happens at very large batches of 200+), you would see the 2x compute improvement on top of the 2x bandwidth improvement, giving up to 4x total throughput. But reaching compute-bound territory for LLM decode requires batch sizes that are typically impractical due to KV cache memory limits.
Q2: Explain the difference between GPTQ and AWQ quantization, and when you would choose each.
Answer: Both are post-training weight quantization methods for LLMs, targeting 4-bit weights with group-based scales. The key differences:
GPTQ (Frantar et al., 2022) uses a second-order optimization approach. It quantizes weights one at a time, using the inverse Hessian of the output with respect to weights to compensate for each quantization error by adjusting remaining weights in the same row. This is optimal but slow: 30 minutes to 4 hours for a 70B model. GPTQ with desc_act=True (activation reordering) gives the best accuracy at the cost of ~15% inference overhead.
AWQ (Lin et al., 2023) uses a first-order approach: identify which weights are most important (those connected to activation channels with consistently large magnitudes), apply per-channel scaling to make those weights more quantization-resistant, and absorb the scaling into adjacent normalization layers. AWQ is 10-100x faster to apply than GPTQ (minutes vs hours) and typically achieves slightly better accuracy on benchmarks.
When to choose each:
- For one-time offline quantization where quality is paramount: GPTQ with
desc_act=True - For fast turnaround (model updates, many model variants to quantize): AWQ
- For deployment: AWQ with GEMM kernels typically gives slightly better inference speed due to more regular memory access patterns
In practice, the accuracy difference between well-calibrated GPTQ and AWQ is small (< 0.5 perplexity points) and model-dependent. Run both on your specific model and eval set before deciding.
Q3: A team is running INT8 quantized LLM inference on A100s. They want to migrate to H100s and are considering FP8. What are the hardware considerations and expected gains?
Answer: The migration from A100 INT8 to H100 FP8 has several distinct components:
Memory bandwidth gain: H100 SXM5 has 3.35 TB/s vs A100's 2.0 TB/s, a 1.68x improvement regardless of precision format. Switching from INT8 (1 byte/param) to FP8 (1 byte/param) does not change the per-byte count, so the bandwidth gain comes entirely from the hardware improvement, not the precision change.
Tensor Core compute gain: H100 FP8 Tensor Cores deliver 3958 TFLOP/s (sparse) or 1979 TFLOP/s (dense) vs H100 FP16 at 989 TFLOP/s (dense). This is 2x more TFLOP/s for FP8 vs FP16. Compared to A100 INT8 (624 TFLOP/s dense), H100 FP8 is approximately 3.2x higher compute throughput.
Expected inference speedup (memory-bandwidth-bound regime): Primarily driven by bandwidth. H100 FP8 vs A100 INT8: 1.68x bandwidth improvement. If moving from compute-bound (large batches), add the 3.2x compute improvement.
Accuracy consideration: FP8 (E4M3) has floating-point representation, meaning it handles values with non-uniform precision across the range. INT8 is uniform precision. For LLM weights and activations, FP8 E4M3 typically loses less accuracy than INT8 because the non-uniform spacing better fits the Gaussian-like distribution of neural network values.
Implementation requirement: FP8 inference requires TensorRT-LLM or a vLLM build with H100 FP8 support. Not all models have FP8 calibration recipes yet. Verify your model is supported before committing.
Q4: What is NF4 quantization and why does it have better accuracy than INT4 for language model weights?
Answer: NF4 (Normal Float 4) is a 4-bit quantization format where the 16 quantization levels are spaced according to the quantile function of the standard normal distribution rather than uniformly.
The motivation: neural network weights, after training with SGT or Adam, follow an approximately normal (zero-mean Gaussian) distribution. Uniform INT4 quantization allocates equal bin widths across the entire value range. This wastes quantization levels on the rare large-magnitude tails while providing too few levels near zero where most weights are concentrated.
NF4 places quantization levels at the -th quantile of the standard normal, giving equal probability mass to each bin:
This is information-theoretically optimal for normally-distributed data: it minimizes the expected quantization error for a Gaussian source.
In practice, LLM weight distributions are not perfectly Gaussian but are close enough that NF4 outperforms INT4 by approximately 0.5-1.5 perplexity points across 7B-70B model sizes. The improvement is more pronounced for smaller models where each bit of representation matters more.
The implementation requires a lookup table (16 entries) for dequantization rather than simple arithmetic, but the overhead is negligible on modern GPUs.
Q5: How does SmoothQuant solve the activation outlier problem in INT8 quantization?
Answer: The core problem: LLM activations (intermediate layer outputs) have "outlier channels" - specific feature dimensions where activations are consistently 10-100x larger than the average channel. A single outlier inflates the per-tensor scale factor, leaving most channels with only 2-3 effective bits of precision.
Example: if 99% of activation values are in but one channel has values in , a symmetric INT8 scale of means the normal channels only use values to , which maps to INT8 values to - only 4 of the 255 available levels.
SmoothQuant's solution is a mathematically equivalent transformation. For a linear layer :
This is identical to in exact arithmetic. But by choosing (where controls how much difficulty is migrated from activations to weights, typically ), you equalize the per-channel magnitudes of both activations and weights.
After smoothing:
- Activations are smooth enough to quantize well with per-tensor or per-token INT8
- Weights absorb some of the activation channel magnitudes, making them slightly harder to quantize but still well within INT8 range
The critical advantage: the and operations can be merged into adjacent normalization layers (LayerNorm or RMSNorm) at zero inference cost. This makes SmoothQuant a zero-overhead transformation - you only pay for it once during quantization, not during inference.
SmoothQuant is now standard in all production INT8 LLM serving stacks: TensorRT-LLM, vLLM INT8 mode, and NVIDIA's Triton inference server all implement it.
Q6: Given a team choosing between INT8 on 2x A100 80GB and NF4 on 1x A100 80GB for serving a 70B model, what are the tradeoffs?
Answer: This is a real cost-optimization decision that comes up constantly in production.
Memory footprint:
- 70B INT8: 70 GB - just fits on single A100 80GB (with careful KV cache management)
- Actually, 70 GB weights + KV cache often overflows - needs 2x A100 for any meaningful batch
- 70B NF4: 35 GB - fits on single A100 with 45 GB left for KV cache
Throughput (single-user, small batch):
- 2x A100 INT8 with tensor parallel: 2x bandwidth = 4 TB/s effective - roughly 2x vs single A100 FP16
- 1x A100 NF4: bandwidth for weights is 4x better than FP16 (4x fewer bytes) = equivalent to having 4x the bandwidth = 8 TB/s equivalent for the weight-loading portion
- But NF4 requires dequantization to FP16 before compute, adding ~5-10% overhead
- Net: 1x A100 NF4 decode throughput is roughly similar to 2x A100 INT8 for small batches
Throughput (large batch):
- 2x A100 INT8 has 2x the KV cache capacity (2x80GB = 160GB, minus 70GB weights = 90GB KV)
- 1x A100 NF4 has only 45GB for KV cache
- Larger batch = better GPU utilization = higher throughput
- 2x A100 INT8 wins at large batch sizes due to more KV cache headroom
Accuracy:
- INT8 with SmoothQuant: typically within 0.2-0.5 perplexity points of FP16
- NF4 (AWQ/GPTQ 4-bit): typically 0.3-1.0 perplexity points above FP16
- INT8 is slightly better quality in most benchmarks
Cost:
- 2x A100: 2x the hardware cost and power consumption
- 1x A100 NF4: 1x hardware, lower power
Recommendation: For latency-sensitive low-traffic serving, 1x A100 NF4 is more cost-effective and provides comparable throughput. For high-traffic serving where batch sizes can reach 32-64, 2x A100 INT8 is better - more KV cache capacity enables larger batches which drives higher utilization and throughput per dollar. Benchmark both with your actual traffic distribution before deciding.
