Deploying Quantized Models in Production
The Traffic Spike That Rewrote the Budget
It was 2:17 AM when the on-call engineer got paged. The LLM inference cluster had run out of GPU memory. Not a gradual degradation - a hard OOM that killed all serving processes simultaneously, because the traffic spike had pushed the queue depth past the threshold where each new request forced an additional model copy into memory, and the system had no graceful degradation path.
The model was a 13B parameter assistant deployed in BF16. The serving team had sized the cluster for P95 traffic. The actual traffic that night was 4x P95, triggered by a viral social post. Eight H100s went dark. A hundred thousand requests queued and then timed out. The incident report was grim.
The post-mortem surfaced two decisions that should have been made before launch. First: at BF16, the 13B model occupied 26 GB per replica. An H100's 80 GB could fit exactly three replicas with minimal headroom. At INT4 with AWQ, the same model occupied 7 GB. Eleven replicas per H100. The cluster could have absorbed the traffic spike on the same hardware, at a fraction of the memory cost. Second: there was no tested rollback path. When the quantized version was finally deployed six weeks after the incident, nobody had answered the question of what "rollback" meant operationally when your deployment artifact is a 7 GB GPTQ checkpoint rather than a 26 GB BF16 checkpoint, and your serving stack is vLLM rather than a plain Hugging Face inference endpoint.
Quantized model deployment is not harder than FP16 deployment. But it has specific decision points that FP16 deployment does not have: format selection, calibration artifact management, quality drift detection, and format-to-serving-stack compatibility. Get these decisions right upfront and the operational picture is actually simpler than FP16. Get them wrong and you are debugging at 2 AM.
This lesson covers every decision in that pipeline - from choosing the right quantization format for your serving stack, through latency SLA planning, to production monitoring and rollback strategy. We will build a complete, production-ready deployment pipeline by the end.
Why This Exists - The Gap Between Research and Production
The quantization research literature is excellent at answering the question "how much quality do we lose at N bits?" The papers publish perplexity numbers, benchmark accuracy, and model sizes. What they rarely address is the operational question: "how do I actually run this in production at scale, with SLAs, monitoring, and a rollback plan?"
This gap exists for a real reason. Researchers optimize for demonstrating a technique works. Practitioners need to know it works reliably, at scale, under adversarial conditions, with predictable failure modes. These are different problems.
The practical challenges that production quantization deployment adds on top of research quantization work:
The format fragmentation problem. At least five different quantized formats are in common use today (GPTQ, AWQ, GGUF, NF4/bitsandbytes, FP8), and each format has different compatibility with different serving stacks. A GPTQ checkpoint cannot be loaded directly by llama.cpp. A GGUF file cannot be loaded by vLLM. Choosing the wrong format for your serving stack means starting over.
The performance regression detection problem. Quantization degrades quality on a continuum - it is not a binary pass/fail. In production, small quality regressions accumulate and become visible only over time, through user-facing metrics like task completion rate or user rating scores. You need a monitoring strategy that can detect these regressions before they become serious.
The serving configuration problem. Quantized models have different memory footprints, different compute characteristics, and different batching behaviors than their FP16 counterparts. Serving configurations optimized for FP16 (batch sizes, memory fractions, tensor parallelism configs) are not automatically correct for quantized models.
Historical Context - How the Serving Ecosystem Evolved
The modern quantized model serving ecosystem is roughly three years old. Before 2022, most LLM serving was done with FP16 models using fairly standard deep learning serving infrastructure (Triton Inference Server, simple PyTorch HTTP servers, or early vLLM versions).
The first practical quantization format for LLM serving was GPTQ (Frantar et al., 2022), which came with a custom CUDA kernel (ExllamaV2) that implemented INT4 weight dequantization efficiently on GPU. Before that kernel existed, INT4 quantization was largely impractical - the dequantization overhead swamped any memory bandwidth savings.
llama.cpp (Gerganov, late 2022) brought quantization to CPU inference. The GGUF format (its successor to GGML) became the dominant format for edge and consumer-device deployment because it handled mixed-precision quantization gracefully and was designed for memory-mapped loading, which made it fast to start even on systems without dedicated ML hardware.
vLLM (Kwon et al., 2023) from UC Berkeley addressed the serving efficiency problem separately, with PagedAttention for KV cache management. When vLLM added native GPTQ and AWQ support in mid-2023, it became the standard serving stack for GPU-based quantized model deployment because it combined efficient memory management with quantized model support.
FP8 quantization support arrived with the H100 hardware generation. The H100 has native FP8 tensor core support, and in 2023-2024, both NVIDIA's TensorRT-LLM and vLLM added FP8 quantization support that runs faster than INT4 on H100s while maintaining quality close to BF16.
The current landscape is a format zoo, but it has a practical structure: each format has a clear primary use case, and matching format to use case is the first deployment decision.
Core Concepts - Choosing the Right Format
The Format Decision Matrix
Every production deployment starts with this decision. The key variables are: serving hardware (GPU vs CPU vs Apple Silicon), serving stack, required throughput, and acceptable quality loss.
| Format | Serving Stack | Hardware | Best For |
|---|---|---|---|
| GPTQ | vLLM, text-generation-inference | NVIDIA GPU | High-throughput GPU serving |
| AWQ | vLLM, text-generation-inference | NVIDIA GPU | High-throughput, better quality |
| GGUF | llama.cpp, Ollama, LMStudio | CPU, Apple Silicon, any GPU | Edge, local, CPU inference |
| NF4 (bitsandbytes) | Hugging Face transformers | NVIDIA GPU | Research, fine-tuning, low throughput |
| FP8 | vLLM, TensorRT-LLM | H100, A100 | Maximum throughput on modern GPU |
The critical constraint: vLLM does not natively support GGUF, and llama.cpp does not natively support GPTQ. Mixing formats and stacks requires conversion steps that add operational complexity.
Format Deep Dive: GPTQ + vLLM
GPTQ remains the most widely deployed format for GPU-based serving because it has been supported in vLLM the longest, the quality-size tradeoff at INT4 is well-understood, and the ExllamaV2 kernel is mature and reliable.
The quantization config that matters most for production GPTQ:
from auto_gptq import BaseQuantizeConfig
# Production-grade GPTQ config for a 7B model
quantize_config = BaseQuantizeConfig(
bits=4, # INT4 - standard production choice
group_size=128, # Standard. Use 64 for better quality at ~5% size cost.
desc_act=False, # False = use static activation ordering (faster at inference)
# True = dynamic ordering (better quality, slower inference)
sym=True, # Symmetric quantization - slightly faster kernel
true_sequential=True, # Quantize layers sequentially to minimize error propagation
)
# Note: desc_act=True improves quality by ~0.1 PPL for most models
# but costs 15-20% throughput. Only use it if quality headroom is tight.
Format Deep Dive: AWQ + vLLM
AWQ is the better choice when quality is more important than inference speed (AWQ kernels are slightly slower than ExllamaV2 for GPTQ), or when calibration data may be mismatched (AWQ is more robust to this).
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM", # GEMM = batch inference (vLLM, most serving)
# GEMV = single-token generation (interactive chat)
}
# AWQ GEMM kernel is optimized for batch size > 1
# AWQ GEMV kernel is optimized for batch size = 1 (lower latency)
Format Deep Dive: GGUF + llama.cpp
GGUF is the correct choice for any deployment that is not on a high-end NVIDIA GPU, or for edge/local deployment scenarios. The format supports mixed quantization types within a single file - you can store attention layers in Q5_K and MLP layers in Q4_K, letting you trade off memory vs quality at granular level.
Common GGUF quantization types ranked by quality (highest to lowest):
- Q8_0: 8-bit, minimal quality loss, largest size
- Q6_K: 6-bit k-quant, excellent quality
- Q5_K_M: 5-bit medium k-quant, recommended default
- Q4_K_M: 4-bit medium k-quant, good balance
- Q4_K_S: 4-bit small k-quant, more compression
- Q3_K_M: 3-bit, noticeable quality loss
- Q2_K: 2-bit, significant quality loss, very small
Format Deep Dive: NF4 + bitsandbytes
NF4 (4-bit Normal Float) via bitsandbytes is optimized for cases where the model weights follow an approximately normal distribution, which is true for most trained transformer weights. It is the format used by QLoRA for fine-tuning, and it is the simplest format to use with Hugging Face transformers.
NF4 is not recommended for high-throughput production serving because bitsandbytes lacks the optimized batching kernels that GPTQ/AWQ have. It is best for research, development, and low-throughput inference where developer ergonomics matter more than throughput.
Mermaid Diagrams
Format Selection Decision Flow
Complete Production Deployment Pipeline
Tensor Parallelism with Quantized Models
Serving Stack Configuration
vLLM with GPTQ
vLLM is the standard serving stack for high-throughput GPU inference. Loading a GPTQ model requires specifying the quantization type:
from vllm import LLM, SamplingParams
# Load GPTQ quantized model
llm = LLM(
model="TheBloke/Llama-2-7B-GPTQ", # HuggingFace GPTQ checkpoint
quantization="gptq",
# Key serving parameters for quantized models:
max_model_len=4096, # Reduce if GPU memory is tight
gpu_memory_utilization=0.90, # Higher than FP16 default (0.85) - INT4 leaves more headroom
tensor_parallel_size=1, # Set to 2+ for 70B models or multi-GPU serving
# Performance tuning:
max_num_batched_tokens=8192, # Increase for higher throughput (at cost of latency variance)
max_num_seqs=256, # Max concurrent sequences in flight
enforce_eager=False, # Keep False - CUDA graph capture needed for speed
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
)
outputs = llm.generate(["Explain quantum entanglement in simple terms."], sampling_params)
print(outputs[0].outputs[0].text)
vLLM with AWQ
AWQ loading is almost identical to GPTQ - just change the quantization argument:
from vllm import LLM, SamplingParams
llm = LLM(
model="casperhansen/llama-2-7b-chat-awq",
quantization="awq",
gpu_memory_utilization=0.90,
max_model_len=4096,
tensor_parallel_size=1,
)
# AWQ is generally ~5-10% slower than GPTQ at the same bit width
# due to the AWQ dequantization kernel being slightly less optimized than ExllamaV2
# But AWQ quality is usually 0.1-0.3 PPL better than GPTQ at the same settings
vLLM as an OpenAI-compatible API Server
For production deployments, you almost always want the OpenAI-compatible server mode rather than the Python API. This decouples the serving infrastructure from the application code:
# Start a GPTQ model as an OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-7B-GPTQ \
--quantization gptq \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 1 \
--max-num-seqs 256 \
--served-model-name llama-2-7b-gptq # Client-facing model name
# Application code uses standard OpenAI client - no vLLM dependency
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM does not require an API key by default
)
response = client.chat.completions.create(
model="llama-2-7b-gptq",
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=100,
temperature=0.0,
)
print(response.choices[0].message.content)
llama.cpp for CPU/Edge Deployment
For CPU, Apple Silicon, or edge deployments using GGUF:
# Build llama.cpp with Metal (Apple Silicon) or CPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# For Apple Silicon with Metal GPU acceleration:
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j $(nproc)
# Run the server
./build/bin/llama-server \
--model /path/to/llama-2-7b.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 32 \ # How many layers to offload to GPU (if available)
--threads 8 \ # CPU threads for non-GPU layers
--batch-size 512 \
--n-predict -1 # No limit on output tokens
# llama.cpp server is also OpenAI-compatible
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "llama-2-7b-q4km",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100,
}
)
print(response.json()["choices"][0]["message"]["content"])
Converting Between Formats
A common production scenario: you have a GPTQ checkpoint (from HuggingFace) but need a GGUF file for an edge deployment:
"""
Format conversion workflow: GPTQ -> FP16 -> GGUF
There is no direct GPTQ -> GGUF converter.
The path is: GPTQ -> dequantize to FP16 -> convert to GGUF with llama.cpp
"""
# Step 1: Dequantize GPTQ back to FP16
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
import torch
# Load the GPTQ quantized model and dequantize it
gptq_model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-7B-GPTQ",
use_safetensors=True,
device_map="auto",
)
# Save as FP16 safetensors (this will be the full 13 GB FP16 checkpoint)
fp16_save_path = "/tmp/llama-2-7b-fp16-for-gguf"
gptq_model.save_pretrained(fp16_save_path)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-GPTQ")
tokenizer.save_pretrained(fp16_save_path)
print(f"FP16 model saved to {fp16_save_path}")
# Step 2: Convert FP16 to GGUF using llama.cpp conversion script
cd /path/to/llama.cpp
python convert_hf_to_gguf.py /tmp/llama-2-7b-fp16-for-gguf \
--outtype f16 \
--outfile /tmp/llama-2-7b-f16.gguf
# Step 3: Quantize the GGUF file to Q4_K_M
./build/bin/llama-quantize \
/tmp/llama-2-7b-f16.gguf \
/tmp/llama-2-7b-Q4_K_M.gguf \
Q4_K_M
echo "Conversion complete. Final GGUF: /tmp/llama-2-7b-Q4_K_M.gguf"
ls -lh /tmp/llama-2-7b-Q4_K_M.gguf
Latency SLAs and Throughput Planning
Expected Performance at Different Quantization Levels
Real numbers measured on an A100 80GB GPU with a 7B Llama-2 model, vLLM serving, batch size 32:
| Format | Memory | TTFT (ms) | Tokens/sec (single req) | Throughput (req/min) |
|---|---|---|---|---|
| BF16 | 14 GB | 45 ms | 85 tok/s | ~180 req/min |
| GPTQ INT4 | 4 GB | 55 ms | 115 tok/s | ~280 req/min |
| AWQ INT4 | 4 GB | 58 ms | 108 tok/s | ~265 req/min |
| FP8 (H100) | 7 GB | 35 ms | 145 tok/s | ~360 req/min |
Key observations: INT4 is faster than BF16 in throughput because more sequences fit in the KV cache simultaneously. But TTFT (time to first token) is slightly higher because dequantization adds overhead during the prefill phase. For interactive applications where TTFT matters, this tradeoff is important to measure on your specific model and hardware.
"""
Latency measurement harness for quantized model deployment.
Run this before setting SLAs.
"""
import time
import statistics
import concurrent.futures
from openai import OpenAI
def measure_latency_profile(
base_url: str,
model_name: str,
n_requests: int = 100,
prompt: str = "Explain the water cycle in three sentences.",
max_tokens: int = 150,
concurrency: int = 1,
) -> dict:
"""
Measure TTFT and end-to-end latency for a serving endpoint.
Returns p50, p95, p99, mean metrics.
"""
client = OpenAI(base_url=base_url, api_key="unused")
ttft_samples = []
e2e_samples = []
def single_request():
start = time.perf_counter()
first_token_time = None
# Use streaming to measure TTFT accurately
with client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
stream=True,
) as stream:
for chunk in stream:
if first_token_time is None and chunk.choices[0].delta.content:
first_token_time = time.perf_counter()
end = time.perf_counter()
ttft = (first_token_time - start) * 1000 if first_token_time else None
e2e = (end - start) * 1000
return ttft, e2e
with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = [executor.submit(single_request) for _ in range(n_requests)]
for future in concurrent.futures.as_completed(futures):
ttft, e2e = future.result()
if ttft is not None:
ttft_samples.append(ttft)
e2e_samples.append(e2e)
def percentile(data, p):
return statistics.quantiles(data, n=100)[p-1]
results = {
"ttft_p50_ms": percentile(ttft_samples, 50),
"ttft_p95_ms": percentile(ttft_samples, 95),
"ttft_p99_ms": percentile(ttft_samples, 99),
"ttft_mean_ms": statistics.mean(ttft_samples),
"e2e_p50_ms": percentile(e2e_samples, 50),
"e2e_p95_ms": percentile(e2e_samples, 95),
"e2e_p99_ms": percentile(e2e_samples, 99),
"e2e_mean_ms": statistics.mean(e2e_samples),
"n_requests": n_requests,
"concurrency": concurrency,
}
return results
# Run latency profile before setting SLAs
# Compare FP16 and quantized endpoints side by side
fp16_metrics = measure_latency_profile(
base_url="http://fp16-endpoint:8000/v1",
model_name="llama-2-7b-fp16",
n_requests=200,
concurrency=10,
)
quant_metrics = measure_latency_profile(
base_url="http://quant-endpoint:8000/v1",
model_name="llama-2-7b-gptq",
n_requests=200,
concurrency=10,
)
print("=== Latency Comparison ===")
print(f"{'Metric':<25} {'FP16':>12} {'INT4 GPTQ':>12} {'Ratio':>10}")
print("-" * 65)
for key in ["ttft_p50_ms", "ttft_p95_ms", "ttft_p99_ms", "e2e_p50_ms", "e2e_p95_ms"]:
fp16_val = fp16_metrics[key]
quant_val = quant_metrics[key]
ratio = quant_val / fp16_val
print(f"{key:<25} {fp16_val:>12.1f} {quant_val:>12.1f} {ratio:>10.2f}x")
A/B Testing Quantized vs FP16 in Production
Never go directly from staging to 100% traffic with a quantized model. Always gate through an A/B test that lets you measure quality impact on real user traffic before full rollout.
"""
A/B test controller for quantized model rollout.
Uses consistent hashing so each user always gets the same model version.
"""
import hashlib
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class ModelVariant(Enum):
CONTROL = "fp16"
TREATMENT = "int4_gptq"
@dataclass
class ABTestConfig:
experiment_name: str
treatment_fraction: float # 0.0 to 1.0 - fraction getting treatment
control_endpoint: str
treatment_endpoint: str
control_model_name: str
treatment_model_name: str
class ABTestRouter:
def __init__(self, config: ABTestConfig):
self.config = config
def get_variant(self, user_id: str) -> ModelVariant:
"""
Consistent assignment: same user always gets same variant.
Uses SHA-256 hash of experiment_name + user_id for stability.
"""
hash_input = f"{self.config.experiment_name}:{user_id}".encode()
hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
# Normalize to [0, 1)
normalized = (hash_value % 10000) / 10000.0
if normalized < self.config.treatment_fraction:
return ModelVariant.TREATMENT
return ModelVariant.CONTROL
def get_endpoint_and_model(self, user_id: str) -> tuple:
variant = self.get_variant(user_id)
if variant == ModelVariant.TREATMENT:
return self.config.treatment_endpoint, self.config.treatment_model_name, variant
return self.config.control_endpoint, self.config.control_model_name, variant
# Usage in your API layer
ab_config = ABTestConfig(
experiment_name="int4_gptq_rollout_v1",
treatment_fraction=0.10, # Start at 10%, increase if metrics hold
control_endpoint="http://fp16-endpoint:8000/v1",
treatment_endpoint="http://gptq-endpoint:8000/v1",
control_model_name="llama-2-7b-fp16",
treatment_model_name="llama-2-7b-gptq",
)
router = ABTestRouter(ab_config)
def handle_user_request(user_id: str, prompt: str) -> dict:
endpoint, model_name, variant = router.get_endpoint_and_model(user_id)
client = OpenAI(base_url=endpoint, api_key="unused")
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=512,
)
# Log which variant served this request for analysis
log_inference_event(
user_id=user_id,
variant=variant.value,
model_name=model_name,
latency_ms=response.usage.completion_tokens / response.usage.total_tokens * 1000,
)
return {
"content": response.choices[0].message.content,
"model_variant": variant.value,
}
def log_inference_event(user_id, variant, model_name, latency_ms):
"""Replace with your actual logging infrastructure (Datadog, CloudWatch, etc.)"""
print(f"[AB_TEST] user={user_id} variant={variant} model={model_name} latency={latency_ms:.1f}ms")
Monitoring Quantization Quality Drift in Production
Quality drift is one of the hardest problems in quantized model production deployment. The quantized model's behavior may degrade gradually over time as the input distribution shifts, in ways that are hard to detect without explicit monitoring.
"""
Production quality monitoring for quantized LLMs.
Tracks output quality metrics that do not require human judgment.
"""
import time
import re
from typing import Optional
from dataclasses import dataclass, field
@dataclass
class QualityMetrics:
repetition_rate: float = 0.0 # Fraction of outputs with detected repetition
refusal_rate: float = 0.0 # Fraction of outputs with unexpected refusals
avg_output_length: float = 0.0 # Average output token count
coherence_score: float = 0.0 # Simple local coherence proxy
timestamp: float = field(default_factory=time.time)
class QuantizationQualityMonitor:
"""
Lightweight production quality monitor for quantized model outputs.
Does not require ground truth - uses heuristics that are reliable indicators
of quantization degradation.
"""
def __init__(self, window_size: int = 1000, alert_threshold: float = 0.05):
self.window_size = window_size
self.alert_threshold = alert_threshold
self.recent_outputs = []
self.baseline_metrics: Optional[QualityMetrics] = None
def record_output(self, output_text: str, prompt: str) -> dict:
"""
Analyze a single output and record quality signals.
Returns per-output quality signals.
"""
signals = {
"has_repetition": self._detect_repetition(output_text),
"has_refusal": self._detect_unexpected_refusal(output_text, prompt),
"output_length": len(output_text.split()),
"coherence_score": self._compute_coherence(output_text),
"timestamp": time.time(),
}
self.recent_outputs.append(signals)
# Keep only the most recent window_size outputs
if len(self.recent_outputs) > self.window_size:
self.recent_outputs.pop(0)
return signals
def get_current_metrics(self) -> QualityMetrics:
if not self.recent_outputs:
return QualityMetrics()
n = len(self.recent_outputs)
return QualityMetrics(
repetition_rate=sum(1 for o in self.recent_outputs if o["has_repetition"]) / n,
refusal_rate=sum(1 for o in self.recent_outputs if o["has_refusal"]) / n,
avg_output_length=sum(o["output_length"] for o in self.recent_outputs) / n,
coherence_score=sum(o["coherence_score"] for o in self.recent_outputs) / n,
)
def check_for_drift(self) -> list:
"""
Compare current metrics against baseline.
Returns list of alert messages (empty if no issues detected).
"""
if self.baseline_metrics is None:
# First call sets the baseline
self.baseline_metrics = self.get_current_metrics()
return []
current = self.get_current_metrics()
alerts = []
# Check repetition rate increase
rep_increase = current.repetition_rate - self.baseline_metrics.repetition_rate
if rep_increase > self.alert_threshold:
alerts.append(
f"ALERT: Repetition rate increased by {rep_increase:.3f} "
f"(baseline: {self.baseline_metrics.repetition_rate:.3f}, "
f"current: {current.repetition_rate:.3f})"
)
# Check unexpected refusal increase
refusal_increase = current.refusal_rate - self.baseline_metrics.refusal_rate
if refusal_increase > self.alert_threshold:
alerts.append(
f"ALERT: Refusal rate increased by {refusal_increase:.3f} "
f"(baseline: {self.baseline_metrics.refusal_rate:.3f}, "
f"current: {current.refusal_rate:.3f})"
)
# Check significant output length change (> 30% decrease suggests truncation issues)
length_ratio = current.avg_output_length / max(self.baseline_metrics.avg_output_length, 1)
if length_ratio < 0.70:
alerts.append(
f"ALERT: Average output length dropped by {(1-length_ratio)*100:.1f}% "
f"(baseline: {self.baseline_metrics.avg_output_length:.0f} tokens, "
f"current: {current.avg_output_length:.0f} tokens)"
)
return alerts
def _detect_repetition(self, text: str) -> bool:
"""
Detect n-gram repetition loops.
Quantization-induced repetition typically involves long repeated sequences.
"""
if len(text) < 100:
return False
# Check if any 4-gram appears more than 4 times
words = text.lower().split()
if len(words) < 8:
return False
ngrams = {}
n = 4
for i in range(len(words) - n + 1):
ngram = tuple(words[i:i+n])
ngrams[ngram] = ngrams.get(ngram, 0) + 1
max_count = max(ngrams.values()) if ngrams else 0
return max_count > 4 # 4-gram repeating more than 4 times is a loop
def _detect_unexpected_refusal(self, output_text: str, prompt: str) -> bool:
"""
Detect outputs that look like refusals for prompts that should not be refused.
This is a heuristic - refine the pattern list for your use case.
"""
refusal_patterns = [
r"i (cannot|can't|am unable to|won't) (help|assist|answer|respond|provide)",
r"i (don't|do not) (feel comfortable|think it's appropriate)",
r"this (request|question|topic) (is|seems) (inappropriate|harmful|dangerous)",
r"i must (decline|refuse)",
]
output_lower = output_text.lower()
for pattern in refusal_patterns:
if re.search(pattern, output_lower):
return True
return False
def _compute_coherence(self, text: str) -> float:
"""
Simple coherence proxy: ratio of recognizable words to total words.
A drop in this metric can indicate that the model is generating garbled output.
Very simple heuristic - for production use a proper language model scorer.
"""
words = text.split()
if not words:
return 1.0
# Count words that contain at least 50% alphabetic characters
coherent = sum(1 for w in words if sum(c.isalpha() for c in w) / max(len(w), 1) > 0.5)
return coherent / len(words)
# Integration example - instrument your serving layer
monitor = QuantizationQualityMonitor(window_size=500, alert_threshold=0.05)
def monitored_generate(prompt: str, user_id: str) -> str:
"""Wrapper around your serving call that adds quality monitoring."""
response_text = call_your_model_endpoint(prompt)
signals = monitor.record_output(response_text, prompt)
# Check for drift every 100 requests
if len(monitor.recent_outputs) % 100 == 0:
alerts = monitor.check_for_drift()
for alert in alerts:
# Send to your alerting system (PagerDuty, Slack, etc.)
send_alert(alert)
print(f"[QUALITY MONITOR] {alert}")
return response_text
def call_your_model_endpoint(prompt: str) -> str:
"""Placeholder - replace with your actual serving call."""
return "Model response here."
def send_alert(message: str):
"""Placeholder - replace with your alerting integration."""
print(f"[ALERT] {message}")
Rollback Strategy
A rollback strategy must be designed before deployment, not after an incident. The key questions to answer upfront:
- What is the rollback artifact? (The FP16 checkpoint, not the quantized checkpoint)
- How long does rollback take? (Time to load a 13 GB FP16 model vs a 4 GB INT4 model)
- What metric triggers rollback? (Define the threshold before deployment)
- Who can authorize rollback? (On-call engineer, or does it require manager approval?)
"""
Automated rollback system for quantized model deployments.
Monitors quality metrics and triggers rollback when thresholds are breached.
"""
import threading
import time
from enum import Enum
class DeploymentState(Enum):
FP16_ONLY = "fp16_only"
AB_TEST = "ab_test" # 10% on quantized
GRADUAL_ROLLOUT = "gradual" # 10-90% on quantized
FULL_QUANTIZED = "full_quant" # 100% on quantized
ROLLBACK = "rollback" # Problem detected, reverting to FP16
class AutomaticRollbackController:
"""
Monitors quality metrics and automatically triggers rollback
if pre-defined thresholds are breached.
"""
def __init__(
self,
monitor: QuantizationQualityMonitor,
rollback_fn,
# Thresholds for automatic rollback
max_repetition_rate: float = 0.15, # 15% of outputs have repetition loops
max_refusal_rate_increase: float = 0.10, # 10% absolute increase in refusals
min_output_length_ratio: float = 0.60, # Output length cannot drop below 60% of baseline
check_interval_seconds: float = 60.0,
):
self.monitor = monitor
self.rollback_fn = rollback_fn
self.max_repetition_rate = max_repetition_rate
self.max_refusal_rate_increase = max_refusal_rate_increase
self.min_output_length_ratio = min_output_length_ratio
self.check_interval = check_interval_seconds
self.state = DeploymentState.AB_TEST
self._stop_event = threading.Event()
self._monitor_thread = None
def start_monitoring(self):
self._monitor_thread = threading.Thread(target=self._monitoring_loop, daemon=True)
self._monitor_thread.start()
print(f"Rollback controller started. Checking every {self.check_interval}s.")
def stop_monitoring(self):
self._stop_event.set()
if self._monitor_thread:
self._monitor_thread.join()
def _monitoring_loop(self):
while not self._stop_event.wait(self.check_interval):
self._check_and_maybe_rollback()
def _check_and_maybe_rollback(self):
current = self.monitor.get_current_metrics()
baseline = self.monitor.baseline_metrics
if baseline is None or len(self.monitor.recent_outputs) < 50:
return # Not enough data to make a rollback decision
rollback_reasons = []
if current.repetition_rate > self.max_repetition_rate:
rollback_reasons.append(
f"Repetition rate {current.repetition_rate:.3f} > threshold {self.max_repetition_rate}"
)
refusal_increase = current.refusal_rate - baseline.refusal_rate
if refusal_increase > self.max_refusal_rate_increase:
rollback_reasons.append(
f"Refusal rate increase {refusal_increase:.3f} > threshold {self.max_refusal_rate_increase}"
)
length_ratio = current.avg_output_length / max(baseline.avg_output_length, 1)
if length_ratio < self.min_output_length_ratio:
rollback_reasons.append(
f"Output length ratio {length_ratio:.2f} < threshold {self.min_output_length_ratio}"
)
if rollback_reasons:
print(f"[ROLLBACK TRIGGERED] Reasons:")
for reason in rollback_reasons:
print(f" - {reason}")
self.state = DeploymentState.ROLLBACK
self.rollback_fn(reasons=rollback_reasons)
def record_gradual_rollout_progress(self, current_fraction: float):
self.state = (
DeploymentState.FULL_QUANTIZED
if current_fraction >= 1.0
else DeploymentState.GRADUAL_ROLLOUT
)
def execute_rollback(reasons: list):
"""
Replace this with your actual rollback implementation.
For vLLM: update the load balancer to stop routing to the quantized endpoint.
For Kubernetes: update the deployment to point to the FP16 image.
"""
print(f"[ROLLBACK] Executing rollback. Sending 100% traffic to FP16 endpoint.")
print(f"[ROLLBACK] Reasons: {reasons}")
# Example: update a Redis flag that the load balancer checks
# redis_client.set("active_model_variant", "fp16")
# Example: update Kubernetes deployment
# subprocess.run(["kubectl", "set", "image", "deployment/llm-serving",
# "server=fp16-image:latest"])
send_alert(f"ROLLBACK EXECUTED: {reasons}")
# Wire it all together
monitor = QuantizationQualityMonitor(window_size=500)
rollback_controller = AutomaticRollbackController(
monitor=monitor,
rollback_fn=execute_rollback,
max_repetition_rate=0.15,
max_refusal_rate_increase=0.10,
min_output_length_ratio=0.60,
check_interval_seconds=60.0,
)
rollback_controller.start_monitoring()
Multi-GPU Quantized Model Serving with Tensor Parallelism
For 70B models, a single GPU is not enough even with INT4. You need tensor parallelism. vLLM handles this transparently when you set tensor_parallel_size:
from vllm import LLM, SamplingParams
# 70B GPTQ model on 2x A100 80GB
# INT4 70B = ~35 GB, so 2x A100 has comfortable headroom
llm = LLM(
model="TheBloke/Llama-2-70B-GPTQ",
quantization="gptq",
tensor_parallel_size=2, # Shard across 2 GPUs
gpu_memory_utilization=0.90,
max_model_len=4096,
# For 70B GPTQ, adjust these for your traffic pattern:
max_num_seqs=128, # Lower than 7B because the model is larger
max_num_batched_tokens=4096, # Balance latency vs throughput
enforce_eager=False,
)
# Tensor parallelism splits each weight matrix across GPUs.
# For a weight matrix W of shape [hidden, hidden]:
# GPU 0 gets W[:, :hidden//2] - the first half of output dimensions
# GPU 1 gets W[:, hidden//2:] - the second half of output dimensions
# After each matrix multiply, GPUs all-reduce to sum the partial results.
# NVLink bandwidth is critical - NVLINK-connected A100s achieve ~600 GB/s.
# PCIe-only connections (~32 GB/s) will severely bottleneck multi-GPU serving.
print(f"Model loaded. Testing throughput...")
sampling_params = SamplingParams(temperature=0.0, max_tokens=200)
test_prompts = ["What are the key principles of software engineering?"] * 10
outputs = llm.generate(test_prompts, sampling_params)
print(f"Generated {len(outputs)} responses successfully.")
# vLLM server with tensor parallelism - production startup script
#!/bin/bash
MODEL="TheBloke/Llama-2-70B-GPTQ"
TP_SIZE=2 # Set to number of GPUs
python -m vllm.entrypoints.openai.api_server \
--model $MODEL \
--quantization gptq \
--tensor-parallel-size $TP_SIZE \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128 \
--served-model-name llama-70b-int4 \
--trust-remote-code \
2>&1 | tee /var/log/vllm-server.log
Production Engineering Notes
GPU Memory Budget for Quantized Models
When planning your GPU memory allocation, account for three components:
- Model weights: (parameters x bits) / 8 bytes. For a 7B INT4 model: (7e9 x 4) / 8 = 3.5 GB
- KV cache: (n_layers x n_heads x head_dim x 2 x max_seq_len x batch_size x 2 bytes for FP16). For a 7B model with 4096 context and batch 32: roughly 8-12 GB depending on architecture
- Activations and framework overhead: typically 2-4 GB
The key insight: quantized models free up GPU memory primarily in the weight component. The KV cache cost is identical to FP16 (it stores attention keys and values at FP16 regardless of weight quantization). This is why quantization gives you more replicas (more concurrent conversations) but not necessarily more context length per conversation.
Do Not Use NF4 for High-Throughput Serving
bitsandbytes NF4 is excellent for research and fine-tuning but it is not designed for high-throughput batch serving. The bitsandbytes CUDA kernels do not have the same level of batching optimization as the ExllamaV2 (GPTQ) or AWQ GEMM kernels. At batch size 1, the difference is small. At batch size 32+, GPTQ/AWQ will outperform NF4 by 2-3x in throughput.
If you are using NF4 for development and need to move to production serving, convert the model to GPTQ or AWQ first.
Warmup Before Serving Traffic
vLLM with CUDA graph capture (enforce_eager=False) requires a warmup pass after loading to compile the CUDA graphs. The first few requests after startup will be slower. Always include a warmup step in your deployment health check:
import requests
import time
def wait_for_server_ready(base_url: str, max_wait_seconds: int = 120):
"""Wait for vLLM server to finish loading and warmup."""
start = time.time()
while time.time() - start < max_wait_seconds:
try:
response = requests.get(f"{base_url}/health", timeout=5)
if response.status_code == 200:
# Server is up - send a warmup request
warmup_response = requests.post(
f"{base_url}/v1/completions",
json={"model": "your-model", "prompt": "Hello", "max_tokens": 10},
timeout=30,
)
if warmup_response.status_code == 200:
print(f"Server ready after {time.time()-start:.1f}s")
return True
except requests.exceptions.RequestException:
pass
time.sleep(2)
raise TimeoutError(f"Server did not become ready within {max_wait_seconds}s")
Common Mistakes
:::danger Loading GGUF with vLLM or GPTQ with llama.cpp
The format-to-serving-stack compatibility matrix is non-negotiable. GGUF is not supported by vLLM. GPTQ is not natively supported by llama.cpp. Trying to load the wrong format in the wrong stack gives you cryptic errors that look like model corruption but are actually format mismatch.
Always confirm format compatibility before starting the deployment process. If you need to switch serving stacks, budget time for format conversion.
:::
:::danger Setting gpu_memory_utilization Too High
vLLM uses gpu_memory_utilization to allocate memory for the KV cache. If you set it to 0.98 thinking "more is better," you leave no headroom for CUDA working memory, temporary buffers, and driver overhead. Under high concurrency, this causes out-of-memory errors.
The recommended value is 0.90 for quantized models. If you are running multiple vLLM instances on the same GPU (rare but possible for very small models), lower this further.
:::
:::warning Skipping the A/B Test Phase
It is tempting to skip directly from staging validation to full production rollout, especially when staging benchmarks look good. Do not do this.
Staging benchmarks cover the test distribution. Production traffic is always different. The only way to know whether your quantized model behaves correctly on real user traffic is to actually serve a fraction of that traffic and measure the quality signals in production.
Always run at least 24 hours of A/B testing at 10% traffic before increasing the rollout fraction.
:::
:::warning Ignoring KV Cache Costs in Memory Planning
A common mistake when estimating whether a quantized model will fit on a GPU: calculating only the weight memory and forgetting about the KV cache.
For a 7B INT4 model, the weights are ~3.5 GB. But a production deployment with batch size 32 and 4096 context length will use 8-12 GB for the KV cache. The total is 12-16 GB, not 3.5 GB.
Use vLLM's --max-model-len parameter to control KV cache memory usage when memory is tight.
:::
:::tip Use Paged KV Cache (vLLM default) for Quantized Models
One of vLLM's key contributions is PagedAttention - a paged memory management system for the KV cache that eliminates memory fragmentation. This matters even more for quantized models because you can fit more replicas per GPU, which increases the benefit of good KV cache management.
PagedAttention is on by default in vLLM. Do not disable it.
:::
Interview Q&A
Q1: Your team needs to deploy a 70B model on 4x A100 80GB GPUs. Compare the viable deployment options (FP16, INT8, INT4) in terms of memory, throughput, and quality.
A: Let me work through the numbers concretely.
FP16: 70B parameters x 2 bytes = 140 GB. With 4x A100 (320 GB total), you have 180 GB for KV cache and overhead. In tensor parallel mode (TP=4), each GPU holds 35 GB of weights and shares memory for KV cache. At standard context lengths (4096) and batch size 64, this is viable with comfortable headroom. But TP=4 means 4 all-reduce operations per forward pass, which adds communication overhead (roughly 20-25% throughput penalty versus TP=1).
INT8 (LLM.int8()): 70B parameters x 1 byte = 70 GB. Each GPU holds 17.5 GB of weights with TP=4. KV cache headroom triples compared to FP16. The throughput improvement is moderate - INT8 matrix multiplies are not dramatically faster than FP16 on A100s because A100s have very fast BF16 tensor cores. The main benefit is fitting more concurrent conversations. Quality loss is minimal with LLM.int8() (typically < 0.1 PPL increase).
INT4 (GPTQ or AWQ): 70B parameters x 0.5 bytes = 35 GB. Each GPU holds 8.75 GB with TP=4. Alternatively, you could use TP=2 (17.5 GB weights per GPU) and have two serving replicas (TP=2 each) on the 4 GPUs, which doubles throughput compared to a single TP=4 deployment. INT4 throughput is significantly higher than FP16 because dequantization is cheap compared to the memory bandwidth savings. Quality loss is 0.3-0.8 PPL increase typically.
For a 70B production deployment on 4x A100, I would recommend INT4 with two TP=2 replicas. You get roughly double the throughput compared to FP16 TP=4, with acceptable quality loss that you measure and validate before deployment.
Q2: What is the difference between tensor parallelism and pipeline parallelism for quantized model serving, and when would you use each?
A: Tensor parallelism (TP) splits individual weight matrices across GPUs. For each transformer layer, every GPU holds a fraction of the weight matrix and computes its partial contribution to the output. After each layer, GPUs synchronize via all-reduce to combine the partial results. TP requires fast interconnects (NVLink) because of the synchronization overhead, and it minimizes latency because all GPUs participate in every layer simultaneously. vLLM uses tensor parallelism.
Pipeline parallelism (PP) assigns different layers to different GPUs. GPU 0 runs layers 0-10, GPU 1 runs layers 11-20, and so on. Each request flows sequentially through the pipeline. PP has lower communication volume (only activations are passed between stages, not synchronized gradients or partial results), but it introduces pipeline latency - the first GPU sits idle while subsequent GPUs process, and vice versa. PP is commonly used in training (Megatron-LM, DeepSpeed) but less common in inference serving because of the latency impact.
For quantized models specifically: TP is almost always preferred for inference serving because it reduces latency (critical for interactive applications) and vLLM's PagedAttention is designed around TP. PP becomes relevant when you need to serve very large models (400B+) where the activation sizes exchanged between TP ranks would saturate NVLink bandwidth.
Q3: Describe the complete rollback procedure when a quantized model deployed to 30% traffic shows a sudden spike in repetition loops.
A: This is an incident, not a standard rollback - the "sudden spike" part means something changed, not that the model was quietly degrading. Here is the procedure:
Immediate actions (first 5 minutes): Stop routing new traffic to the quantized model by setting the load balancer to 0% for the treatment variant. This stops the bleeding while you investigate. Do not wait to collect more data - a spike in repetition loops is a clear user-facing quality issue.
Investigation (next 15-30 minutes): Pull the monitoring data to establish the timeline. When did the spike start? Was there a deployment event (new model version, config change, serving restart) at that time? Check whether the spike correlates with any specific type of input (length, topic, user segment) or is uniform across all traffic.
Root cause candidates: (1) A serving restart that loaded a different model checkpoint than expected. (2) A batch size or concurrency change that triggered OOM conditions and caused partial model loading. (3) A change in input distribution (e.g., a new feature launched that sends different prompt formats).
Verification: Run the problematic inputs against the FP16 model. If FP16 handles them correctly, the issue is quantization-related. If FP16 also shows issues, the problem is upstream of quantization.
Resolution: If it is a quantization issue, keep traffic at 0% for the quantized model and investigate whether the quantized model checkpoint is corrupted or whether the serving config is wrong. Do not re-route traffic to the quantized model until you understand the root cause. If it is a serving config issue (wrong batch size, wrong tensor parallel config), fix the config and run a fresh A/B test from 5% traffic before scaling back up.
Q4: How does FP8 quantization on H100s differ from INT4 GPTQ, and when would you choose one over the other?
A: FP8 and INT4 are solving different problems with different tradeoffs.
INT4 GPTQ is a post-training quantization method that requires a calibration step and produces 4-bit integer weights. It uses specialized dequantization kernels (ExllamaV2) that reconstruct approximate FP16 values before the matrix multiply. The 4-bit format gives very compact model storage (4x smaller than FP16) and very fast memory-bandwidth-bound operations. Quality loss is typically 0.3-0.8 PPL points for 7-70B models.
FP8 on H100 uses the H100's native FP8 tensor core support. FP8 has higher precision than INT4 (8 bits instead of 4) but lower precision than BF16 (16 bits). The quality loss versus BF16 is typically under 0.1 PPL points - much less than INT4. The model is 2x smaller than BF16 (8 bits vs 16 bits). FP8 inference does not require post-training quantization - many frameworks can quantize weights to FP8 on the fly without a calibration step. On H100s, FP8 matrix multiplies are native hardware operations with no dequantization overhead.
Choose INT4 when: you need maximum memory compression (4x vs FP16), you are deploying on non-H100 hardware (A100, A10, consumer GPUs), or your quality budget allows the slightly higher degradation of INT4 versus FP8.
Choose FP8 when: you have H100s, you need quality close to BF16 with 2x memory reduction, you want to avoid the calibration complexity of GPTQ/AWQ, or you need maximum throughput (FP8 tensor core throughput on H100 is higher than INT4 with dequantization overhead).
For a new H100 deployment starting from scratch in 2024+, FP8 with vLLM is the cleaner choice. For deployments on older hardware or where 4x compression is required, INT4 GPTQ or AWQ remains the answer.
Q5: What quality metrics should you monitor in production for a quantized model, and how do you set the thresholds for automatic rollback?
A: The challenge with production quality monitoring for LLMs is that you almost never have ground truth labels - users do not tag whether a response was good or bad. So you rely on proxy metrics that are reliable indicators of quality degradation.
The most reliable proxy metrics are:
Repetition rate: count the fraction of outputs that contain n-gram repetition loops. For a healthy model, this should be under 2-3%. A quantized model that is degrading typically shows repetition rates of 10%+. The rollback threshold depends on your use case - for a customer-facing product, I would rollback at 8-10% repetition rate.
Refusal rate delta: track the absolute change in refusal rate compared to baseline. Quantization noise can cause the model to misclassify benign inputs as harmful. A 5-10 percentage point increase in refusal rate is a signal worth investigating and potentially rolling back.
Output length distribution: the mean and distribution of output token counts should be stable over time. A significant drop in mean output length (>20-30% below baseline) often indicates the model is truncating or producing degenerate outputs.
User-facing signals if available: thumbs up/down ratings, task completion rates, session continuation rates. These are the gold standard but require instrumented UIs.
For threshold setting: run the A/B test for at least 24 hours before setting thresholds. Use the FP16 baseline during A/B test to measure the "normal" values for all metrics. Set rollback thresholds at 2-3x the normal day-to-day variance you observe during the A/B test, rather than at fixed absolute values. This accounts for natural variability in input distribution.
Q6: A production GPTQ model is performing well on average but shows quality degradation specifically for long context inputs (> 2000 tokens). What is the likely root cause and how do you address it?
A: Long context degradation with quantized models is a specific failure pattern that traces to attention layer sensitivity. Here is the mechanism:
When generating tokens at position 2000+, the attention mechanism needs to attend to all previous 2000 tokens. The attention weight computation involves the query-key dot product across all 2000 positions, which means the attention pattern is more sensitive to small errors in the key and value representations than it would be at position 100.
In a quantized model, the key and value projections have quantization error. At short sequence lengths, these errors are small and do not significantly affect which tokens are attended to. At long sequences, the errors accumulate - a small per-token error in the key representation, multiplied across 2000 tokens of attention computation, leads to attention patterns that are meaningfully different from the FP16 baseline.
The fix has two approaches. The first is to keep the K and V projection layers (key and value projections in the attention blocks) in higher precision. Run per-layer sensitivity analysis specifically on long-context inputs and you will typically find that K/V projections are among the most sensitive layers for this failure mode. Keeping them in FP16 while everything else stays INT4 adds minimal memory overhead (the K/V projections are a small fraction of total parameters) but significantly improves long-context quality.
The second approach is to use a quantization format with better support for long contexts. Some GGUF k-quant variants (like Q5_K_M) use a 5-bit quantization that reduces the per-token key/value error, which helps at long context. For a vLLM deployment, AWQ with smaller group size (64 instead of 128) also helps because it gives the quantization more flexibility to match the activation distributions in the attention layers.
Regardless of which fix you apply, add long-context inputs explicitly to your benchmark suite before and after the fix to verify the improvement and prevent regression.
