Quantization for Vision Models
The Robotics Team's Dilemma
The autonomous inspection drone was impressive on paper. Equipped with a LLaVA-based vision-language model, it could analyze infrastructure photos and produce detailed natural language reports describing detected damage, estimated severity, and recommended repair priorities. In the lab, the FP16 model running on a laptop-grade GPU produced excellent reports. The problem was the deployment target: a NVIDIA Jetson Orin NX embedded module with 16 GB of unified memory, shared between CPU and GPU. The FP16 LLaVA-1.5 13B model required 26 GB of VRAM. It could not run at all.
The team's first instinct was to quantize the entire model to 4 bits using BitsAndBytes NF4. Memory dropped to 7.8 GB. The model ran. But the image analysis reports immediately degraded in a way that was hard to quantify but obvious to read. The model was still producing grammatically correct, confident-sounding reports. It was describing damage categories that were not visible in the photos. It was systematically misidentifying corrosion as "shadow artifacts" and missing hairline cracks. The structured language outputs from the LLM backbone were fluent. The visual grounding was broken.
The engineering lead pulled the model apart and started profiling. What he found was that the vision encoder - a CLIP ViT-L/14 in this case - was suffering disproportionately from aggressive quantization. The visual token representations that the encoder produced were degraded in ways that the LLM backbone could not compensate for. The LLM was receiving corrupted visual features and confabulating plausible-sounding descriptions based on incomplete information. The natural language output was fluent because the LLM backbone quantization was manageable. The visual understanding was broken because the vision encoder quantization was not.
The solution was asymmetric quantization: keep the vision encoder in FP16, quantize only the LLM backbone to 4 bits. This gave them 11.4 GB total memory - within budget for the Jetson Orin NX. And crucially, the image analysis quality was nearly indistinguishable from the FP16 baseline. The visual features entering the LLM were preserved, the LLM could reason correctly about them, and the reports were accurate.
This experience generalizes across the entire space of vision and vision-language model quantization. Vision models differ from pure language models in ways that make naive quantization more dangerous. Batch normalization layers are exquisitely sensitive to quantization. Convolutional layers have different outlier patterns than transformer attention layers. Skip connections in ResNets concentrate quantization error at addition points. And in multimodal models, the visual encoder and the language decoder have fundamentally different robustness profiles that demand a differentiated strategy.
This lesson gives you the full picture: how to quantize CNNs, how ViT quantization differs, why VLMs need special treatment, and how to implement the asymmetric strategy that preserves visual fidelity while achieving aggressive memory compression on the language backbone.
Why This Exists
The Problem: Vision Models Are Not Just Smaller LLMs
When quantization tooling matured for LLMs in 2022-2024, many practitioners assumed they could apply the same approaches directly to vision models and vision-language models. This assumption fails in several important ways.
Pure language models - GPT, LLaMA, Mistral - consist almost entirely of transformer blocks with attention layers and feed-forward networks. All parameters are in linear projection matrices. The quantization techniques developed for LLMs (GPTQ, AWQ, NF4) are optimized specifically for linear layers with the weight distributions typical of transformer weight matrices.
Vision models bring new architectural components that each have their own sensitivity profile:
Batch Normalization (BatchNorm): CNNs use BatchNorm layers that normalize activations using running mean and variance statistics. These statistics are computed from the training data distribution. When you quantize the weights of a BatchNorm layer, you distort the normalization computation. Small quantization errors in BatchNorm scale/shift parameters (, ) can cause systematic activation distribution shifts downstream - a problem that compounds across many layers.
Convolutional layers: Conv2d weight tensors have a different shape than Linear weight tensors. They are 4D: (out_channels, in_channels, kernel_h, kernel_w). The weight distribution within a convolutional kernel can be very different from attention projection matrices. Outlier patterns in vision model weights tend to be structured along the channel dimension rather than the token/embedding dimension.
Skip connections: ResNet-style skip connections add residuals from earlier layers to later layers. Each addition point mixes features from paths that have accumulated different amounts of quantization error. In extreme cases, quantization error from one path can completely dominate the signal from the other path after summation.
Attention in ViTs: Vision Transformer attention layers are similar to LLM attention, but they operate on image patch tokens rather than text tokens. The patch token representations - especially in early and middle layers - contain spatial structure (neighboring patches have correlated activations) that creates different outlier patterns than text token representations.
What a Principled Approach Solves
A principled approach to vision model quantization identifies which components are sensitive and which are robust, then applies different precision to different components. For CNNs: static INT8 calibration with per-channel quantization of Conv2d layers and special handling of BatchNorm. For ViTs: attention layers get higher precision or finer-grained quantization; FFN layers tolerate more aggressive quantization. For VLMs: asymmetric precision where the vision encoder is preserved and the LLM backbone is compressed.
Historical Context
Quantization for vision models preceded LLM quantization by several years. The computer vision community was compressing ResNets and MobileNets to INT8 for edge deployment from around 2017-2018. Google's work on quantization-aware training for MobileNetV2 (Sandler et al., 2018) established that 8-bit quantization of CNNs was highly practical with the right methodology. NVIDIA's TensorRT had INT8 calibration for CNNs available by TensorRT 3.0 (2017).
The "aha moment" for CNN quantization came from the observation that post-training quantization to INT8 was feasible if you had a representative calibration dataset. Jacob et al. (Google Brain, 2018) showed that symmetric per-channel quantization of weights combined with per-tensor quantization of activations gave near-lossless results on ImageNet classification. Their framework became the foundation for TensorRT's INT8 calibration workflow.
Vision Transformers introduced new challenges. When Dosovitskiy et al. released ViT in 2020, practitioners discovered that naive INT8 quantization caused larger accuracy drops than with CNNs. Liu et al. (2021) investigated this with PTQ4ViT, finding that ViT softmax outputs and GELU activations have twin-peaked distributions (two distinct ranges of values) that violate the assumptions of standard uniform quantization. They introduced Hessian-guided quantization parameter selection to address this.
For vision-language models, the quantization question became prominent in late 2023 when LLaVA (Liu et al., 2023) and similar models achieved strong performance. The community quickly found that the asymmetric approach - FP16 vision encoder, quantized LLM backbone - was the practical sweet spot. This was implemented in the BitsAndBytes library and later natively supported in Hugging Face Transformers through the quantization_config combined with explicit module skipping.
Core Concepts
CNN Quantization: The BatchNorm Problem
In a standard ResNet or EfficientNet, each convolutional block follows the pattern:
where BatchNorm is:
The (scale) and (bias) parameters are learned. The running statistics and are estimated from training data. When you quantize and to INT8, you introduce quantization error and . Because these parameters directly scale and shift every activation in a channel, even small errors propagate:
This shifts the output distribution, which then flows into the next layer's input distribution. If that next layer is also quantized, its quantization parameters (computed during calibration on the unshifted distribution) are now wrong.
The standard solution is BatchNorm folding before quantization: merge the BatchNorm parameters into the preceding convolutional layer's weights and bias:
After folding, the BatchNorm layer disappears entirely. The convolutional weight now incorporates the normalization. This eliminates the BatchNorm quantization sensitivity problem and is standard practice in TensorRT and ONNX runtime deployment pipelines.
Per-Channel vs Per-Tensor Quantization
For CNN weights, the choice between per-tensor and per-channel quantization has a large accuracy impact.
Per-tensor quantization uses a single scale factor for the entire weight tensor. The quantization function is:
Per-channel quantization uses a separate scale factor for each output channel:
In convolutional networks, the weight magnitude can vary dramatically across output channels - sometimes by factors of 10-100x. Per-tensor quantization forces all channels to use the scale of the largest channel, which means small-magnitude channels lose precision entirely. Per-channel quantization allocates the full INT8 range to each channel independently, typically recovering 0.5-2% ImageNet top-1 accuracy compared to per-tensor.
For ViTs and LLMs, per-channel quantization corresponds to the standard approach of quantizing along the output dimension of each linear layer.
ViT Quantization: Attention Sensitivity
Vision Transformer attention outputs have a distinct statistical property: after the softmax, the attention distribution is often very sharp (one or two patches receive most of the attention weight). This means the post-softmax values live in two distinct regimes: values near 0 (almost all patches) and values near 1 (the attended patches). Standard uniform INT8 quantization wastes most of its range on the near-zero region.
The GELU activation in ViT FFN layers has a similar issue: it is smooth and near-zero for large negative inputs, but the exact shape of the near-zero region matters for gradient flow and downstream computation.
The practical solutions:
-
Log-scale quantization for attention weights: quantize the softmax outputs on a log scale rather than linear scale. This gives more precision to the small-value region.
-
Higher-bit quantization for attention layers: use INT8 for attention layers and INT4 or INT8 for FFN layers. Attention layers are more sensitive.
-
Activation-aware scaling (AWQ-style): multiply salient channels by a scale factor before quantization to reduce their quantization error, then divide by the same factor after. The same approach that works for LLM quantization applies to ViT.
The quantization sensitivity ordering in a ViT is approximately:
- Most sensitive: attention projection layers, especially Q/K projections (determine what to attend to)
- Moderate: V projections and output projections
- Least sensitive: FFN layers (especially second FC layer)
VLM Architecture and the Asymmetric Strategy
A vision-language model like LLaVA or LLaMA 3.2 Vision has three main components:
- Vision encoder: typically a CLIP ViT-L/14 or SigLIP ViT. Takes an image as input, outputs patch embeddings.
- Projection layer (MLP connector): maps vision encoder output dimensions to LLM input dimensions.
- LLM backbone: LLaMA, Mistral, or similar. Takes a sequence of (text tokens + visual tokens) and generates text.
The asymmetric quantization strategy recognizes that these components have different sensitivity profiles and different size contributions:
| Component | Typical Parameters | Sensitivity | Recommended Precision |
|---|---|---|---|
| CLIP ViT-L/14 | 307M | High | FP16 |
| MLP connector | 10-50M | Medium | FP16 |
| LLaMA 7B backbone | 7B | Low-Medium | 4-bit |
The LLM backbone dominates memory consumption (95%+ of total parameters in most VLMs) and is well-studied for quantization. The vision encoder is 2-5% of total parameters but disproportionately sensitive because its outputs directly determine what visual information the LLM can access. A corrupted visual representation is unfixable - no amount of good reasoning by the LLM backbone can recover information that was destroyed in the encoder.
The memory math for LLaMA 3.2 Vision 11B with asymmetric quantization:
- Vision encoder (ViT-L/14): 307M params * 2 bytes/param (FP16) = 614 MB
- LLM backbone (11B): 11B params * 0.5 bytes/param (NF4) = 5.5 GB
- Total: ~6.1 GB vs ~22 GB for full FP16
That is a 3.6x compression ratio while preserving the most sensitive component at full precision.
Mermaid Diagrams
VLM Component Architecture and Precision Strategy
CNN Quantization Pipeline with BatchNorm Folding
ViT Layer Sensitivity Map
Code Examples
INT8 CNN Quantization with TorchScript and Calibration
import torch
import torchvision.models as models
from torch.quantization import (
get_default_qconfig,
prepare,
convert,
fuse_modules,
)
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
from typing import Optional
import copy
def prepare_cnn_for_quantization(model: torch.nn.Module) -> torch.nn.Module:
"""
Fuse BatchNorm into Conv layers and prepare for static quantization.
This is the critical first step. Fusing Conv+BN eliminates BatchNorm
sensitivity by absorbing its parameters into the Conv weights.
"""
model = copy.deepcopy(model)
model.eval()
# For ResNet-style models, fuse Conv-BN-ReLU triplets
# The exact module paths depend on your architecture
# This example targets ResNet50
for layer_name in ["layer1", "layer2", "layer3", "layer4"]:
layer = getattr(model, layer_name)
for block_idx in range(len(layer)):
block = layer[block_idx]
# Fuse the main path: conv1-bn1-relu, conv2-bn2-relu, conv3-bn3
fuse_modules(
block,
[["conv1", "bn1", "relu"], ["conv2", "bn2"]],
inplace=True,
)
# For bottleneck blocks, also fuse conv3+bn3
if hasattr(block, "conv3"):
fuse_modules(block, [["conv3", "bn3"]], inplace=True)
# Fuse the initial conv-bn-relu
fuse_modules(model, [["conv1", "bn1", "relu"]], inplace=True)
return model
def calibrate_model(
model: torch.nn.Module,
calibration_loader: DataLoader,
num_batches: int = 100,
device: str = "cpu", # static quant calibration runs on CPU
) -> None:
"""
Run calibration data through the model to collect activation statistics.
This determines the quantization scale factors for activations.
More calibration data = better scale estimates.
100-200 batches of 32 images each is typically sufficient.
"""
model.eval()
model.to(device)
with torch.no_grad():
for batch_idx, (images, _) in enumerate(calibration_loader):
if batch_idx >= num_batches:
break
images = images.to(device)
model(images)
if (batch_idx + 1) % 20 == 0:
print(f"Calibration: {batch_idx + 1}/{num_batches} batches")
def quantize_resnet_int8(
model_name: str = "resnet50",
calibration_data_path: str = "/data/imagenet/train",
num_calibration_batches: int = 100,
backend: str = "fbgemm", # 'fbgemm' for x86, 'qnnpack' for ARM
) -> torch.nn.Module:
"""
Full pipeline: load pretrained CNN, fuse BatchNorm, calibrate, quantize.
"""
# Load pretrained model
print(f"Loading pretrained {model_name}...")
model = getattr(models, model_name)(pretrained=True)
model.eval()
# Step 1: Fuse BatchNorm into Conv layers
print("Fusing BatchNorm layers...")
model_fused = prepare_cnn_for_quantization(model)
# Step 2: Set quantization configuration
# per-channel for weights (better accuracy), per-tensor for activations (standard)
torch.backends.quantized.engine = backend
qconfig = get_default_qconfig(backend)
model_fused.qconfig = qconfig
# Step 3: Insert quantization observers
model_prepared = prepare(model_fused, inplace=False)
# Step 4: Calibration
print("Setting up calibration data...")
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
calibration_dataset = datasets.ImageFolder(calibration_data_path, transform=transform)
calibration_loader = DataLoader(
calibration_dataset,
batch_size=32,
shuffle=True,
num_workers=4,
)
print("Running calibration...")
calibrate_model(model_prepared, calibration_loader, num_calibration_batches)
# Step 5: Convert to quantized model
print("Converting to INT8...")
model_quantized = convert(model_prepared, inplace=False)
return model_quantized
def evaluate_imagenet_accuracy(
model: torch.nn.Module,
val_data_path: str,
device: str = "cpu",
max_batches: Optional[int] = None,
) -> float:
"""Evaluate top-1 accuracy on ImageNet validation set."""
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
val_dataset = datasets.ImageFolder(val_data_path, transform=transform)
val_loader = DataLoader(val_dataset, batch_size=256, num_workers=4)
model.eval()
model.to(device)
correct = 0
total = 0
with torch.no_grad():
for batch_idx, (images, labels) in enumerate(val_loader):
if max_batches and batch_idx >= max_batches:
break
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = outputs.max(1)
correct += predicted.eq(labels).sum().item()
total += labels.size(0)
return (correct / total) * 100
# Compare FP32 vs INT8
if __name__ == "__main__":
VAL_PATH = "/data/imagenet/val"
TRAIN_PATH = "/data/imagenet/train"
# FP32 baseline
fp32_model = models.resnet50(pretrained=True)
fp32_acc = evaluate_imagenet_accuracy(fp32_model, VAL_PATH)
print(f"FP32 ResNet50 Top-1: {fp32_acc:.2f}%")
# INT8 quantized
int8_model = quantize_resnet_int8(
model_name="resnet50",
calibration_data_path=TRAIN_PATH,
num_calibration_batches=100,
)
int8_acc = evaluate_imagenet_accuracy(int8_model, VAL_PATH)
print(f"INT8 ResNet50 Top-1: {int8_acc:.2f}%")
print(f"Accuracy drop: {fp32_acc - int8_acc:.3f}%")
ViT Quantization with Selective Layer Precision
import torch
import torch.nn as nn
from transformers import ViTForImageClassification, ViTConfig
from transformers import BitsAndBytesConfig
import re
def identify_sensitive_vit_layers(model: nn.Module) -> list:
"""
Identify attention Q/K projection layers in a ViT model.
These are the most sensitive to quantization and should be
kept in higher precision.
"""
sensitive_patterns = [
r".*attention\.query.*",
r".*attention\.key.*",
r".*layernorm.*",
r".*layer_norm.*",
]
sensitive_modules = []
for name, module in model.named_modules():
for pattern in sensitive_patterns:
if re.match(pattern, name, re.IGNORECASE):
sensitive_modules.append(name)
break
return sensitive_modules
def load_vit_with_selective_quantization(
model_name: str = "google/vit-large-patch14-224",
quantize_attention: bool = False, # Whether to quantize attention layers
) -> nn.Module:
"""
Load a ViT model with selective layer precision.
Keeps attention Q/K projections and LayerNorm in FP16
while quantizing FFN layers to INT8.
Note: for full INT8 deployment, use torch.quantization or TensorRT.
This example demonstrates the selective precision concept with
PyTorch's native quantization hooks.
"""
if not quantize_attention:
# Most conservative: only quantize FFN layers
# Attention layers stay in FP16
print(f"Loading {model_name} with FFN-only quantization...")
# Load in FP16, then selectively apply quantization
model = ViTForImageClassification.from_pretrained(
model_name,
torch_dtype=torch.float16,
)
model.eval()
# Apply dynamic INT8 quantization only to FFN linear layers
# (those NOT in attention modules)
def is_attention_layer(name: str) -> bool:
attention_keywords = ["query", "key", "value", "attention"]
return any(kw in name.lower() for kw in attention_keywords)
for name, module in model.named_modules():
if isinstance(module, nn.Linear) and not is_attention_layer(name):
# We would quantize here in a full implementation
# For demonstration, mark the layer
module._quantize_target = True
return model
else:
# More aggressive: quantize everything with bnb INT8
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = ViTForImageClassification.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
return model
def apply_ptq4vit_style_quantization(
model: nn.Module,
calibration_images: torch.Tensor,
bits: int = 8,
) -> nn.Module:
"""
Apply PTQ4ViT-inspired quantization that handles twin-peaked
distributions in softmax and GELU activations.
The key insight from PTQ4ViT: standard uniform quantization
wastes representational capacity on the near-zero region of
softmax outputs. Use separate scaling for each peak.
This is a simplified demonstration of the concept.
For production, use the full PTQ4ViT implementation.
"""
model.eval()
# Collect activation statistics with hooks
activation_stats = {}
def make_hook(name):
def hook(module, input, output):
if name not in activation_stats:
activation_stats[name] = []
activation_stats[name].append(output.detach().cpu().float())
return hook
hooks = []
for name, module in model.named_modules():
if isinstance(module, nn.Softmax) or "softmax" in name.lower():
hooks.append(module.register_forward_hook(make_hook(name + "_softmax")))
# Run calibration
with torch.no_grad():
_ = model(calibration_images)
# Remove hooks
for hook in hooks:
hook.remove()
# Analyze distributions
for name, activations in activation_stats.items():
all_acts = torch.cat(activations, dim=0)
# Check for bimodality
hist = torch.histc(all_acts, bins=50, min=0, max=1)
near_zero_mass = hist[:5].sum() / hist.sum()
near_one_mass = hist[-5:].sum() / hist.sum()
print(f"{name}: near-zero mass={near_zero_mass:.3f}, "
f"near-one mass={near_one_mass:.3f}")
if near_zero_mass > 0.7 and near_one_mass > 0.05:
print(f" -> Twin-peaked distribution detected. "
f"Consider log-scale or asymmetric quantization.")
return model
VLM Quantization - LLaVA / LLaMA 3.2 Vision with Asymmetric Precision
import torch
from transformers import (
LlavaNextProcessor,
LlavaNextForConditionalGeneration,
MllamaForConditionalGeneration,
AutoProcessor,
BitsAndBytesConfig,
)
from PIL import Image
import requests
from io import BytesIO
from typing import Union, List
def load_llava_asymmetric_quantization(
model_id: str = "llava-hf/llava-v1.6-mistral-7b-hf",
) -> tuple:
"""
Load LLaVA with asymmetric quantization:
- Vision encoder (CLIP ViT): kept in FP16
- LLM backbone (Mistral 7B): quantized to NF4
This is the recommended production configuration for LLaVA
on memory-constrained hardware.
"""
print(f"Loading processor from {model_id}...")
processor = LlavaNextProcessor.from_pretrained(model_id)
# Configure NF4 quantization for the LLM backbone only
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # nested quantization for extra compression
)
print("Loading model with asymmetric quantization...")
print(" - Vision encoder: FP16 (preserved)")
print(" - LLM backbone: NF4 (4-bit compressed)")
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.float16,
)
# Verify which modules were quantized
total_params = 0
quantized_params = 0
fp16_params = 0
for name, module in model.named_modules():
try:
import bitsandbytes as bnb
if isinstance(module, bnb.nn.Linear4bit):
param_count = sum(p.numel() for p in module.parameters())
quantized_params += param_count
total_params += param_count
elif isinstance(module, torch.nn.Linear):
param_count = sum(p.numel() for p in module.parameters())
fp16_params += param_count
total_params += param_count
except ImportError:
break
print(f"\nQuantization summary:")
print(f" Quantized (NF4): {quantized_params / 1e9:.2f}B params")
print(f" FP16 (vision): {fp16_params / 1e9:.2f}B params")
print(f" Ratio quantized: {quantized_params / total_params * 100:.1f}%")
return model, processor
def load_llama32_vision_quantized(
model_id: str = "meta-llama/Llama-3.2-11B-Vision-Instruct",
) -> tuple:
"""
Load LLaMA 3.2 Vision with the recommended asymmetric quantization.
LLaMA 3.2 Vision uses a cross-attention architecture:
- Image encoder: separate ViT, outputs to cross-attention layers
- Language model: LLaMA transformer with cross-attention injected
For quantization: keep cross-attention layers in FP16,
quantize self-attention and FFN layers to NF4.
"""
processor = AutoProcessor.from_pretrained(model_id)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = MllamaForConditionalGeneration.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
print(f"Model loaded. Memory: {get_model_memory_gb(model):.2f} GB")
return model, processor
def get_model_memory_gb(model: torch.nn.Module) -> float:
"""Calculate approximate model memory footprint in GB."""
total_bytes = 0
for param in model.parameters():
# For quantized models, bitsandbytes stores in compressed format
total_bytes += param.numel() * param.element_size()
return total_bytes / (1024 ** 3)
def run_visual_inference(
model,
processor,
image: Union[str, Image.Image],
prompt: str,
max_new_tokens: int = 256,
) -> str:
"""
Run inference on a VLM with an image and text prompt.
Works with both LLaVA and LLaMA 3.2 Vision models.
"""
# Load image if URL/path provided
if isinstance(image, str):
if image.startswith("http"):
response = requests.get(image)
image = Image.open(BytesIO(response.content)).convert("RGB")
else:
image = Image.open(image).convert("RGB")
# Format the prompt
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
},
]
text_prompt = processor.apply_chat_template(
conversation, add_generation_prompt=True
)
inputs = processor(
images=image,
text=text_prompt,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=1.0,
)
# Decode only the generated tokens (not the prompt)
generated_ids = output[0][inputs.input_ids.shape[1]:]
response = processor.decode(generated_ids, skip_special_tokens=True)
return response.strip()
def benchmark_vlm_image_quality(
model_fp16,
model_quantized,
processor,
test_images: List[Image.Image],
questions: List[str],
reference_answers: List[str],
) -> dict:
"""
Compare FP16 vs quantized VLM on visual understanding quality.
For each test case: run both models, compare responses to
reference answers using simple string matching.
In production, use an LLM judge for semantic similarity scoring.
"""
assert len(test_images) == len(questions) == len(reference_answers)
fp16_scores = []
quant_scores = []
for i, (image, question, reference) in enumerate(
zip(test_images, questions, reference_answers)
):
print(f"\nTest {i+1}/{len(test_images)}: {question[:50]}...")
fp16_response = run_visual_inference(model_fp16, processor, image, question)
quant_response = run_visual_inference(model_quantized, processor, image, question)
# Simple scoring: check if key terms from reference appear in response
reference_terms = reference.lower().split()
fp16_score = sum(
1 for term in reference_terms if term in fp16_response.lower()
) / len(reference_terms)
quant_score = sum(
1 for term in reference_terms if term in quant_response.lower()
) / len(reference_terms)
fp16_scores.append(fp16_score)
quant_scores.append(quant_score)
print(f" FP16: {fp16_response[:100]}...")
print(f" Quant: {quant_response[:100]}...")
print(f" Scores - FP16: {fp16_score:.3f}, Quant: {quant_score:.3f}")
import numpy as np
return {
"fp16_avg_score": float(np.mean(fp16_scores)),
"quant_avg_score": float(np.mean(quant_scores)),
"retention": float(np.mean(quant_scores) / np.mean(fp16_scores) * 100),
"per_example": list(zip(fp16_scores, quant_scores)),
}
Quantizing CLIP / SigLIP for Retrieval
import torch
from transformers import CLIPModel, CLIPProcessor, SiglipModel, AutoProcessor
from transformers import BitsAndBytesConfig
import torch.nn.functional as F
def load_clip_quantized(
model_id: str = "openai/clip-vit-large-patch14",
quantize_vision: bool = False,
quantize_text: bool = True,
) -> tuple:
"""
CLIP quantization strategy.
CLIP has two encoders: vision and text.
For image-text retrieval:
- Image embeddings are often pre-computed and cached - vision encoder
only runs at index time. Keep vision encoder in FP16 for quality.
- Text encoder runs at query time (real-time). INT8 is acceptable here
since text query latency matters more than marginal quality.
If you are running CLIP in real-time on images (no caching),
keep both encoders in FP16.
"""
processor = CLIPProcessor.from_pretrained(model_id)
if not quantize_vision and not quantize_text:
# Full FP16 baseline
model = CLIPModel.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
return model, processor
# For CLIP, we use INT8 rather than NF4 since CLIP models are
# embedding models - we care about vector quality, not text generation
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = CLIPModel.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
# Optionally keep vision encoder in FP16
if not quantize_vision:
# Cast vision model back to FP16
# Note: this is a simplified approach; in practice you would
# load with load_in_8bit but then explicitly keep vision_model
# in float16 by not applying quantization to those layers
print("Note: For selective component quantization in CLIP, ")
print("consider manual layer-by-layer quantization rather than")
print("the bnb whole-model approach.")
return model, processor
def compute_embeddings_quantized(
model,
processor,
images: list,
texts: list,
batch_size: int = 32,
device: str = "cuda",
) -> dict:
"""
Compute CLIP embeddings with a quantized model.
Measure embedding quality via cosine similarity.
"""
all_image_embeddings = []
all_text_embeddings = []
model.eval()
# Image embeddings
for i in range(0, len(images), batch_size):
batch_images = images[i:i + batch_size]
inputs = processor(images=batch_images, return_tensors="pt").to(device)
with torch.no_grad():
image_features = model.get_image_features(**inputs)
# Normalize embeddings
image_features = F.normalize(image_features.float(), dim=-1)
all_image_embeddings.append(image_features.cpu())
# Text embeddings
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i + batch_size]
inputs = processor(text=batch_texts, return_tensors="pt",
padding=True, truncation=True).to(device)
with torch.no_grad():
text_features = model.get_text_features(**inputs)
text_features = F.normalize(text_features.float(), dim=-1)
all_text_embeddings.append(text_features.cpu())
image_embeddings = torch.cat(all_image_embeddings, dim=0)
text_embeddings = torch.cat(all_text_embeddings, dim=0)
# Compute similarity matrix
similarity = image_embeddings @ text_embeddings.T
return {
"image_embeddings": image_embeddings,
"text_embeddings": text_embeddings,
"similarity_matrix": similarity,
"embedding_dim": image_embeddings.shape[1],
}
def measure_retrieval_accuracy(
fp16_similarities: torch.Tensor,
quant_similarities: torch.Tensor,
) -> dict:
"""
Measure retrieval quality degradation.
For image-text pairs (diagonal = correct match),
compute Recall@1 and Recall@5.
"""
n = fp16_similarities.shape[0]
def recall_at_k(sim_matrix, k=1):
# For each image, rank text candidates by similarity
# Correct match is on the diagonal
correct = 0
for i in range(n):
scores = sim_matrix[i]
top_k_indices = scores.topk(k).indices
if i in top_k_indices:
correct += 1
return correct / n * 100
fp16_r1 = recall_at_k(fp16_similarities, k=1)
fp16_r5 = recall_at_k(fp16_similarities, k=5)
quant_r1 = recall_at_k(quant_similarities, k=1)
quant_r5 = recall_at_k(quant_similarities, k=5)
return {
"fp16_r1": fp16_r1,
"fp16_r5": fp16_r5,
"quant_r1": quant_r1,
"quant_r5": quant_r5,
"r1_drop": fp16_r1 - quant_r1,
"r5_drop": fp16_r5 - quant_r5,
}
Production Engineering Notes
When to Quantize the Vision Encoder
The rule of thumb: keep the vision encoder in FP16 unless you are severely memory-constrained and have empirically verified acceptable quality degradation on your specific task.
The cases where you might quantize the vision encoder to INT8 (not INT4):
- You are running image classification, not complex visual QA or visual reasoning. Classification tasks are more robust to small feature degradation than open-ended generation.
- You are using a very large vision encoder (e.g., ViT-H/14 or ViT-G) and need the memory savings.
- You are using TensorRT INT8 calibration with a large representative calibration set from your deployment domain.
Never quantize a vision encoder used for VLM to 4 bits. The information loss at 4 bits in the visual feature space is too severe for reliable visual grounding.
TensorRT INT8 for Production CNN Deployment
For production CNN deployment, TensorRT with INT8 calibration is the gold standard. The workflow:
- Export your PyTorch model to ONNX with BatchNorm folding.
- Use TensorRT's
IInt8EntropyCalibrator2to run calibration on 500-1000 representative images. - Build the TensorRT engine with
INT8flag and explicit precision layers. - Keep the final classification head in FP32 (it is tiny and sensitivity to quantization at the final layer is high).
The TensorRT INT8 calibration uses the entropy-based algorithm which minimizes the information loss (KL divergence) between the FP32 and INT8 activation distributions, which typically gives better results than range-based calibration (MinMax).
Memory Budget Planning for VLMs
For planning memory budgets across different VLM configurations:
LLaVA-1.5 7B (Vicuna backbone):
FP16 total: 14.2 GB
NF4 (LLM only): 4.8 GB (vision encoder 0.6 GB + 4.2 GB LLM)
INT8 (LLM only): 8.1 GB
LLaMA 3.2 Vision 11B:
FP16 total: 22.0 GB
NF4 (LLM only): 6.8 GB
NF4 (both): 5.5 GB (NOT recommended - see danger below)
LLaMA 3.2 Vision 90B:
FP16 total: 180.0 GB
NF4 (LLM only): 48.0 GB
Multi-GPU NF4: 48.0 GB across 2x H100 80GB
For the 11B model, the 1.3 GB difference between asymmetric NF4 and full NF4 is rarely worth the quality degradation. Keep the vision encoder in FP16 unless you have no choice.
Serving VLMs in Production
For production serving:
- Use vLLM (version 0.4+) for LLaVA and LLaMA Vision models. It supports asymmetric quantization natively.
- Enable continuous batching - VLM requests with different image sizes have different prefill costs, which makes static batching inefficient.
- Pre-compute and cache image embeddings when serving the same images repeatedly (product catalog, document templates).
- Monitor GPU memory carefully during KV cache growth - long vision conversations grow the KV cache significantly.
Common Mistakes
:::danger Quantizing the Vision Encoder to 4 Bits in a VLM The vision encoder produces the visual tokens that the LLM backbone uses for all visual reasoning. Quantizing it to 4 bits destroys fine-grained visual features - texture details, subtle spatial relationships, color nuances. The LLM backbone cannot reconstruct what was destroyed. The model will produce fluent, confident, factually wrong descriptions. Always keep vision encoders in FP16 or at most INT8. The 600 MB savings from quantizing a CLIP ViT-L is not worth the capability loss. :::
:::danger Skipping BatchNorm Folding Before CNN Quantization If you attempt to quantize a CNN that still has separate BatchNorm layers, the quantization of the BN scale and bias parameters will cause systematic activation distribution shifts. These shifts compound across layers and can cause catastrophic accuracy loss even at INT8. Always fold BatchNorm into the preceding Conv layer before quantization. Most export pipelines (TorchScript, ONNX) do this automatically, but verify explicitly before you trust the output. :::
:::warning Using Per-Tensor Quantization for CNN Weights Per-tensor quantization for CNN weights causes 1-3% unnecessary accuracy loss on ImageNet compared to per-channel quantization. CNN weight magnitudes vary dramatically across output channels. Per-tensor uses the range of the largest-magnitude channel as the scale, which leaves small-magnitude channels with very coarse quantization. Always use per-channel quantization for convolutional layer weights. Per-tensor is acceptable for activations but not for weights. :::
:::warning Applying LLM Quantization Tools Directly to ViTs Without Modification GPTQ and AWQ were designed and tuned for autoregressive LLMs with causal attention. The Hessian computation in GPTQ and the activation analysis in AWQ assume specific patterns in how activations look. ViT attention uses bidirectional attention, patch-structured inputs, and has the twin-peaked softmax distribution. If you run GPTQ on a ViT using default LLM settings, you may get acceptable perplexity-equivalent metrics but worse-than-expected classification accuracy. Use vision-specific quantization tools (PTQ4ViT, RepQ-ViT) or validate carefully with per-task accuracy. :::
:::warning Ignoring the MLP Connector in VLM Quantization The MLP connector (or linear projection) that maps vision encoder outputs to LLM input dimensions is small (typically 10-50M parameters) but important. It is the bridge between visual and language representations. Some quantization frameworks quantize it by default along with the LLM backbone. Keep it in FP16. The memory savings from quantizing a 50M parameter connector are less than 25 MB - completely negligible. Always explicitly exclude the connector from quantization. :::
Interview Q&A
Q1: Why can you quantize the LLM backbone of a VLM to 4 bits but you should keep the vision encoder in FP16? What is the fundamental difference?
A: The LLM backbone generates text autoregressively, where each token's quality is somewhat self-correcting - the model attends to its own outputs and can implicitly smooth over quantization noise in the representation space. The LLM backbone is also the component that LLM quantization techniques (GPTQ, AWQ, NF4) were specifically designed and tuned for - they know the weight distribution patterns and handle outliers appropriately.
The vision encoder is different in two critical ways. First, it is an encoder, not a generative model - its output visual tokens are the only source of visual information for the entire generation process. If those tokens are corrupted, no reasoning by the LLM can recover the lost information. Second, CLIP ViT and similar architectures have not been the design target of 4-bit quantization methods. The activation distributions, weight scales, and layer sensitivity patterns differ from LLM backbones. Applying NF4 to a ViT damages the visual feature manifold in ways that degrade fine-grained visual understanding disproportionately. The 600 MB or so you save by quantizing the vision encoder is not worth the capability loss. Keep it in FP16.
Q2: A CNN model achieves 76.2% ImageNet top-1 in FP32. After INT8 post-training quantization, it drops to 73.1%, a 3.1 point loss which is unacceptable. What are the likely causes and how do you diagnose and fix them?
A: Three likely causes. First, check whether BatchNorm was properly folded before quantization. If BatchNorm layers are being quantized separately, the BN scale and bias quantization error creates systematic activation distribution shifts across all layers. This is the single most common cause of large accuracy drops in CNN quantization. Verify the exported model graph - there should be no separate BatchNorm nodes after folding.
Second, check whether per-channel quantization is being used for weights. Per-tensor weight quantization typically causes 1-2% accuracy loss on its own. Switch to per-channel.
Third, check the calibration data quality. If the calibration images are not representative of the validation distribution (wrong augmentations, wrong color normalization, too few samples), the activation scale factors will be wrong. Use at least 100 batches of 32 images from the training set with the same preprocessing pipeline as validation.
To diagnose: use a layer-wise sensitivity analysis - quantize one layer at a time and measure the accuracy impact. Plot accuracy vs layer index. You will typically see one or two "sensitive" layers that drive most of the degradation. For those specific layers, use INT8 for activations but keep weights in INT16 or FP16, or keep the layer in FP32 entirely.
Q3: What are twin-peaked distributions in ViT attention, why do they break standard INT8 quantization, and how do you handle them?
A: After the softmax operation in ViT attention, the attention weights have a bimodal distribution: a large fraction of values are very close to 0 (patches that are not attended to) and a small fraction are close to 1 (the attended patches). Standard INT8 uniform quantization distributes its 256 integer levels evenly across the observed range, say 0 to 1. But since almost all values are near 0, the quantization is allocating most of its precision to a range where nothing lives, and using very coarse precision for the near-1 region where the important values live.
The practical consequence: the high-attention values (the ones that actually matter for determining where the model is looking) are quantized with only a few integer levels, introducing large quantization error into the attention computation. This can cause the model to "look" at the wrong patches.
The solutions: (1) Log-scale quantization, which gives more levels to small values and fewer to large values. (2) Asymmetric clipping, where you choose different scales for the two peaks. (3) Keep attention Q/K/V and output projections in INT8 while being especially careful with the post-softmax values. (4) The PTQ4ViT approach uses Hessian-guided search to find the optimal quantization parameters for ViT activations, which implicitly handles twin-peaked distributions. (5) Simply keep attention layers in INT16 or FP16 and only quantize FFN layers - the FFN layers are larger and their quantization gives you most of the compression benefit.
Q4: How would you quantize CLIP ViT-L/14 for a real-time image search system that needs to index 10 million product images and serve 1000 embedding queries per second?
A: This use case has two distinct operating modes that justify different precision strategies. For indexing (offline), you are encoding 10 million images once. Quality matters more than speed here - use FP16 for the vision encoder, batch the images at 256-512 per batch, and run on multiple GPUs. The index is computed once and stored.
For query serving (online, 1000 QPS), you have two paths: image queries and text queries. Text queries use the CLIP text encoder. Text encoder INT8 quantization is acceptable here - text embeddings are slightly less sensitive to quantization error than image embeddings, and the latency reduction from INT8 is meaningful at high QPS. Image queries at serving time can use INT8 with TensorRT calibrated on a representative sample of your product images.
The key insight is that you can use different precision for indexing vs serving. Index at FP16 for maximum quality. Serve at INT8 for maximum throughput. If the query-time image encoder (INT8) and the index-time image encoder (FP16) produce embeddings with the same normalization structure (which they will, since they are the same architecture), the similarity scores will be compatible. Measure recall@1 and recall@10 on a held-out set comparing FP16-vs-FP16 to INT8-vs-FP16 to verify the similarity is preserved. You should see less than 0.5% recall drop at INT8.
Q5: LLaMA 3.2 Vision 11B in NF4 (asymmetric - vision encoder FP16) uses 6.8 GB. But when you actually run inference with a high-resolution image and a 2048-token context, you get OOM. Why?
A: Model weights at 6.8 GB is not the only memory consumer. There are three additional sources of memory that can push you over budget.
First, the KV cache. During generation, the KV cache stores key-value pairs for every token in the context across all layers. For LLaMA 3.2 11B with 2048 context and typical batch size, the KV cache can be 2-4 GB. At longer contexts, it grows linearly.
Second, the image encoding. High-resolution images in LLaMA 3.2 Vision are tiled and processed at multiple resolutions. A 1024x1024 image might produce 1000+ visual tokens, each contributing to the KV cache. The image encoding process itself is memory-intensive before compression into visual tokens.
Third, activation memory during the forward pass. The intermediate activations during the LLM backbone forward pass scale with sequence length and batch size. At long contexts, this can be several GB.
The fix: (1) Reduce max_new_tokens to limit KV cache growth during generation. (2) Use paged attention (vLLM's PagedAttention or similar) which manages KV cache memory more efficiently. (3) Reduce image resolution at inference time if your use case allows it. (4) Use torch.cuda.memory_stats() to profile exactly where the memory is going before trying to fix it.
Q6: You are deploying a quantized LLaVA model and you notice it performs well on most images but consistently fails on images with fine text (receipts, documents, whiteboards). What is likely happening and how do you fix it?
A: Fine text in images is a high-frequency visual signal. The CLIP ViT-L/14 architecture encodes images into 14x14 pixel patches at typical input resolution (224x224). Fine text may span only one or two pixels in a patch, making it a very subtle signal in the patch embedding. When the vision encoder is at FP16, it preserves this subtle signal reasonably well. But if you have quantized the vision encoder (or are using it at a resolution lower than its training resolution), the fine texture information that distinguishes text from background is exactly the kind of signal that gets rounded away.
The solutions: (1) First, verify you are keeping the vision encoder at FP16. If you mistakenly quantized it, that is the primary fix. (2) Use a higher-resolution vision encoder. LLaVA-HR or LLaVA-1.6 uses dynamic resolution processing that tiles the image into sub-images and processes each at higher resolution. This is the standard fix for OCR-in-the-wild tasks. (3) If using LLaMA 3.2 Vision, ensure you are using it at the full supported resolution (up to 1120x1120) rather than downscaling inputs. (4) Consider using a vision encoder with dedicated high-res support - models like InternVL or CogVLM handle document images better. For document-heavy use cases, you may need a model with a PaliGemma or DocOwl-style encoder that was specifically trained on high-resolution text recognition.
