QLoRA: Quantized Low-Rank Adaptation
The Single-GPU Dream
The year is 2023. Tim Dettmers, a PhD student at the University of Washington, is working on what seems like an impossible goal: fine-tune a 65 billion parameter language model on a single consumer GPU.
A 65B model in FP16 (2 bytes per parameter) requires 130GB of memory for the weights alone. The most powerful consumer GPU available - the RTX 4090 - has 24GB. Even if you quantize to 8-bit (halving the weight memory to 65GB), you are 2.7x over what fits on a single GPU. The problem seems fundamentally intractable.
Dettmers and his co-authors (Paged Optimizers, NF4 quantization, double quantization - each independently interesting) combine them into QLoRA, published in May 2023. The result: fine-tune a 65B model on a single NVIDIA A100 80GB GPU. Or a 33B model on a consumer 24GB RTX 4090. Or a 7B model on a 12GB RTX 3080 laptop GPU.
Fine-tuning that previously required a $100K+ cluster became possible on a laptop. The paper's code was open-sourced through the bitsandbytes library. Within weeks, the open-source community was training custom 65B models in dorm rooms.
Why This Exists: The Remaining Memory Problem After LoRA
LoRA dramatically reduced the number of trainable parameters, but it did not solve the full memory problem. Here is what LoRA achieves for a 7B model:
| Memory component | Full fine-tuning | LoRA (r=16) |
|---|---|---|
| Model weights (BF16) | 14GB | 14GB |
| Gradients | 14GB | ~0.03GB (LoRA only) |
| Optimizer states (Adam) | 28GB | ~0.06GB (LoRA only) |
| Activations | ~4GB | ~4GB |
| Total | ~60GB | ~18GB |
LoRA reduced the training overhead from 60GB to 18GB. But the base model weights (14GB) must still be loaded in full. For a 7B model, this fits on a 24GB GPU. For a 65B model (130GB in BF16), no consumer GPU can hold it.
The remaining bottleneck: the base model weights themselves. Can you quantize the base model weights aggressively enough that they fit on one GPU, without destroying model quality?
QLoRA's answer: yes, but it requires careful engineering.
NF4: NormalFloat 4-bit Quantization
Standard 4-bit quantization (INT4) uses 4-bit integers to represent weights. But 4 bits gives you only 16 distinct values to represent the entire range of a weight matrix. The question is: how do you choose those 16 values to minimize quantization error?
INT4 uses evenly spaced values: (scaled to the range of the weight matrix). This is suboptimal because neural network weights are not uniformly distributed - they follow an approximately normal (Gaussian) distribution. Most weights cluster near zero; few weights have large magnitudes. Evenly spaced quantization wastes representational capacity on large values that rarely occur.
NF4 (NormalFloat 4-bit) chooses the 16 quantization bins such that each bin has equal probability mass under a standard normal distribution . This is the optimal quantization scheme for normally distributed data - it minimizes quantization error for the actual distribution of neural network weights.
Formally, the NF4 quantization values are the 16 quantiles of the standard normal distribution, symmetrically placed:
where is the quantile function of the normal distribution.
The result: NF4 quantization introduces significantly less quantization error than INT4 for normally distributed weights, which describes most neural network weight matrices after training.
Double Quantization: Quantizing the Constants
4-bit quantization uses block quantization: weights are grouped into blocks of size 64, and each block stores one quantization constant (the maximum absolute value of the block, absmax). This absmax value is stored as FP32 (4 bytes) per block.
Memory overhead of quantization constants: for a 7B parameter model with blocks of size 64:
- Number of blocks: 7B / 64 = 109 million blocks
- Memory for constants at FP32: 109M × 4 bytes = ~437MB
This overhead is significant. Double quantization quantizes the quantization constants themselves: instead of storing each absmax in FP32, quantize them to 8-bit with block size 256. This reduces the per-parameter overhead of quantization constants from 0.5 bits to 0.127 bits - saving approximately 37MB for a 7B model.
Individually, 37MB savings sound minor. For a 65B model, this becomes ~340MB - meaningful when every gigabyte counts.
Paged Optimizers: Handling Memory Spikes
Even with 4-bit base model weights and LoRA, there is one remaining memory problem: optimizer state spikes.
During normal training, the LoRA parameters are small enough to fit in GPU memory. But during long-sequence batches or particularly gradient-intensive steps, the optimizer states (Adam's first and second moments for the LoRA parameters) can cause brief memory spikes that trigger out-of-memory (OOM) errors.
This is frustrating: your model fits in memory 99% of the time, but crashes on 1% of batches due to momentary memory pressure. Restarting the training run costs you hours.
Paged optimizers solve this by storing optimizer states in CPU RAM instead of GPU memory, using NVIDIA's unified memory feature. When GPU memory is full, optimizer states are automatically "paged out" to CPU RAM. When needed for an update step, they are paged back in.
The overhead: CPU-GPU memory transfer is slow (~32 GB/s PCIe vs ~2 TB/s HBM). But optimizer states are only accessed once per update step, not every forward pass - so the latency impact is small. A training run with paged optimizers might be 5-10% slower, but it is the difference between running at all and crashing.
The QLoRA Recipe
Put it all together:
-
Load base model in NF4: weights stored as 4-bit NF4, reducing 65B weights from 130GB (BF16) to ~32.5GB
-
Apply LoRA adapters in BF16: the small LoRA matrices ( and ) are stored and trained in BF16. Only gradients flow through these matrices.
-
Dequantize on the fly: during the forward pass, NF4 weights are dequantized to BF16 for the matrix multiplication, then the result is used. The dequantized weights are not stored - they are computed on the fly for each layer as needed. This is computationally more expensive than BF16 training but uses much less memory.
-
Paged optimizer: Adam optimizer states for LoRA parameters are stored in CPU memory with automatic paging.
Memory comparison for 65B model:
| Configuration | Memory for weights | Trainable params | Total approx |
|---|---|---|---|
| Full BF16 fine-tuning | 130GB | 65B | 520GB+ |
| LoRA in BF16 | 130GB | ~130M | ~140GB |
| QLoRA (NF4 base) | 32.5GB | ~130M | ~40GB |
QLoRA enables fine-tuning a 65B model on a single A100 80GB GPU with 40GB of memory for the model, leaving 40GB for activations, optimizer states, and batch data.
Quantization Error and Quality
Does 4-bit quantization hurt quality? QLoRA is trained with:
- 4-bit weights at INFERENCE TIME (during training's forward pass, weights are used as-is in 4-bit)
- BF16 compute for the matrix multiplications (dequantization happens on the fly)
- BF16 LoRA adapters (trained in full precision)
The key finding from Dettmers et al. (2023): models fine-tuned with QLoRA on the Guanaco dataset matched or exceeded the performance of models fine-tuned with full-precision LoRA, at a fraction of the memory cost. The NF4 quantization error is small enough that the LoRA training compensates for it.
However, QLoRA is strictly not equal to full-precision LoRA. For tasks requiring high mathematical precision or very fine-grained numerical reasoning, the quantization error can compound. In production, if you can afford the memory, BF16 LoRA is preferred. QLoRA is the choice when memory constraints are the binding constraint.
Code: QLoRA Fine-tuning
"""
QLoRA fine-tuning with bitsandbytes + PEFT.
Fine-tune a 7B or larger model with 4-bit quantization.
"""
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
from peft import (
LoraConfig,
get_peft_model,
TaskType,
prepare_model_for_kbit_training,
)
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
def load_model_in_4bit(model_name: str):
"""
Load model in 4-bit NF4 with double quantization.
This is the QLoRA base configuration.
"""
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Use 4-bit quantization
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit (better than int4)
bnb_4bit_compute_dtype=torch.bfloat16, # Compute dtype for dequantized ops
bnb_4bit_use_double_quant=True, # Double quantization for quantization constants
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto", # Automatically distribute across available GPUs
use_cache=False, # Required for gradient checkpointing
)
# Prepare for k-bit training:
# 1. Casts layernorm to float32 for stability
# 2. Upcast output embedding to float32
# 3. Enables gradient checkpointing
model = prepare_model_for_kbit_training(model)
return model
def create_qlora_model(
base_model_name: str,
lora_r: int = 64, # Dettmers et al. used r=64 in original QLoRA paper
lora_alpha: int = 16,
lora_dropout: float = 0.1,
):
"""Create a QLoRA-ready model."""
model = load_model_in_4bit(base_model_name)
lora_config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
target_modules=[
"q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none",
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
return model
def qlora_training_pipeline(
base_model_name: str = "meta-llama/Llama-2-7b-hf",
output_dir: str = "./qlora-adapter",
dataset: Dataset = None,
):
"""
Complete QLoRA training pipeline.
Memory footprint for 7B model: ~12GB (fits on RTX 3080)
"""
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = create_qlora_model(base_model_name)
if dataset is None:
# Example data
dataset = Dataset.from_list([
{"text": "### Instruction:\nWhat is gradient descent?\n\n### Response:\nGradient descent is an optimization algorithm that iteratively updates model parameters in the direction of steepest descent of the loss function."}
])
sft_config = SFTConfig(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
# QLoRA typically uses higher rank (64) but lower learning rate
learning_rate=2e-4,
lr_scheduler_type="constant_with_warmup",
warmup_steps=100,
weight_decay=0.001,
# Precision settings
bf16=True, # BF16 compute (base model is NF4, compute is BF16)
fp16=False,
max_seq_length=2048,
packing=False,
# Optimizer: use paged_adamw for memory spike handling
optim="paged_adamw_8bit", # Paged optimizer with 8-bit Adam
logging_steps=10,
save_steps=200,
report_to="none",
max_grad_norm=0.3, # Tighter gradient clipping for 4-bit training
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
# Save LoRA adapter (base model stays as NF4 quantized)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"QLoRA adapter saved to {output_dir}")
return trainer
# ---- Memory estimation utility ----
def estimate_qlora_memory(
model_params_billions: float,
lora_r: int = 64,
num_layers: int = 32,
hidden_dim: int = 4096,
batch_size: int = 4,
sequence_length: int = 2048,
):
"""
Estimate GPU memory requirements for QLoRA training.
All estimates are approximate.
"""
# Base model: 4 bits per parameter
base_model_memory_gb = (model_params_billions * 1e9 * 0.5) / 1e9
# LoRA adapter: BF16 (2 bytes)
# Approximate LoRA params: 2 * num_layers * 4 * hidden_dim * r (Q, K, V, O projections)
lora_params = 2 * num_layers * 4 * hidden_dim * lora_r
lora_memory_gb = (lora_params * 2) / 1e9
# Optimizer states (8-bit Adam for LoRA params): 2 bytes per param x2 moments
optimizer_gb = (lora_params * 4) / 1e9 # 8-bit = 1 byte each, x2 moments
# Activations: rough estimate
activation_gb = (batch_size * sequence_length * hidden_dim * num_layers * 2) / 1e9
total_gb = base_model_memory_gb + lora_memory_gb + optimizer_gb + activation_gb
print(f"QLoRA Memory Estimate for {model_params_billions}B model:")
print(f" Base model (NF4): {base_model_memory_gb:.1f} GB")
print(f" LoRA adapters (BF16): {lora_memory_gb:.2f} GB")
print(f" Optimizer states: {optimizer_gb:.2f} GB")
print(f" Activations: {activation_gb:.1f} GB")
print(f" Total estimate: {total_gb:.1f} GB")
return total_gb
# Example outputs:
# estimate_qlora_memory(7) -> ~12GB (fits on RTX 3080 12GB)
# estimate_qlora_memory(13) -> ~18GB (fits on RTX 3090 24GB)
# estimate_qlora_memory(33) -> ~24GB (fits on RTX 4090 24GB with care)
# estimate_qlora_memory(65) -> ~40GB (fits on A100 80GB)
Comparing Quantization Formats
"""
Demonstrate the difference between INT4 and NF4 quantization
for normally distributed weight values.
"""
import numpy as np
import torch
def quantize_int4(values: np.ndarray) -> tuple:
"""Standard INT4 symmetric quantization."""
abs_max = np.max(np.abs(values))
scale = abs_max / 7.0 # 4-bit symmetric: range -8 to 7
# Quantize and dequantize
quantized = np.clip(np.round(values / scale), -8, 7).astype(np.int8)
dequantized = quantized * scale
return dequantized, np.mean((values - dequantized) ** 2) # MSE
def get_nf4_quantization_points() -> np.ndarray:
"""
Compute the 16 NF4 quantization points as normal distribution quantiles.
These are the optimal quantization points for N(0,1) distributed data.
"""
from scipy.stats import norm
# 16 evenly spaced quantiles of N(0,1)
quantiles = [(i + 0.5) / 16 for i in range(16)]
nf4_points = norm.ppf(quantiles)
# Normalize to [-1, 1]
nf4_points = nf4_points / np.max(np.abs(nf4_points))
return nf4_points
def quantize_nf4(values: np.ndarray) -> tuple:
"""NF4 quantization using normal distribution quantile points."""
nf4_points = get_nf4_quantization_points()
abs_max = np.max(np.abs(values))
normalized = values / abs_max # Normalize to [-1, 1]
# For each value, find nearest NF4 quantile point
quantized_indices = np.argmin(
np.abs(normalized[:, np.newaxis] - nf4_points[np.newaxis, :]),
axis=1
)
dequantized_normalized = nf4_points[quantized_indices]
dequantized = dequantized_normalized * abs_max
return dequantized, np.mean((values - dequantized) ** 2) # MSE
if __name__ == "__main__":
np.random.seed(42)
# Simulate a weight vector with normal distribution (typical for neural networks)
weights = np.random.randn(10000) * 0.02 # Small normal weights
_, int4_mse = quantize_int4(weights)
_, nf4_mse = quantize_nf4(weights)
print(f"INT4 quantization MSE: {int4_mse:.8f}")
print(f"NF4 quantization MSE: {nf4_mse:.8f}")
print(f"NF4 improvement: {int4_mse / nf4_mse:.2f}x lower error")
# NF4 typically shows 2-4x lower MSE for normally distributed weights
Production Engineering Notes
When to Use QLoRA vs Regular LoRA
| Scenario | Recommendation |
|---|---|
| 7B model, 24GB GPU | Regular LoRA in BF16 (fits fine) |
| 13B model, 24GB GPU | QLoRA with NF4 |
| 33B model, 80GB GPU | Either - BF16 LoRA preferred for quality |
| 65B model, 80GB GPU | QLoRA required |
| 7B model, 12GB laptop GPU | QLoRA required |
| Production serving | Merge, then re-quantize with GPTQ or AWQ |
After QLoRA Training: Handling the Final Model
After QLoRA training, your model has NF4 weights (from the base model) and BF16 LoRA adapters. For production deployment:
Option 1 (simplest): Load NF4 base + LoRA adapter at inference time. The NF4 model is approximately 2x faster than BF16 and 4x smaller - suitable for inference.
Option 2 (best quality): Merge LoRA into the dequantized BF16 model, then re-quantize the merged model using GPTQ or AWQ (higher quality quantization methods designed for inference). This produces the best inference quality and speed.
from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch
# Load full-precision base (for clean merge)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16, # Full precision for the merge
)
# Load QLoRA adapter trained on NF4 base
model_with_lora = PeftModel.from_pretrained(base_model, "./qlora-adapter/")
# Merge and save as clean BF16 model
merged_model = model_with_lora.merge_and_unload()
merged_model.save_pretrained("./merged-bf16-model/")
# Now re-quantize with GPTQ/AWQ for production
The Guanaco result from the QLoRA paper Dettmers et al. fine-tuned LLaMA-65B on the Guanaco dataset (9,000 examples) using QLoRA in approximately 24 hours on a single A100 80GB. The resulting Guanaco-65B matched GPT-4 performance on 63.7% of comparisons in human preference evaluation (MT-bench style), while GPT-4 won 32.1% (the rest were ties). This was a landmark result for open-source LLM fine-tuning.
Common Mistakes
Loading a 4-bit quantized model without prepare_model_for_kbit_training
When fine-tuning a 4-bit quantized model, you must call prepare_model_for_kbit_training(model) before adding LoRA adapters. This function: (1) casts LayerNorm layers to FP32 (critical for gradient stability); (2) handles the output embedding; (3) prepares the model for gradient checkpointing. Skipping this step causes NaN losses during training.
Using fp16=True instead of bf16=True with 4-bit training
The bitsandbytes library uses BF16 for all compute by default (specified in bnb_4bit_compute_dtype=torch.bfloat16). If you also set fp16=True in TrainingArguments, there is a mismatch between the compute dtype expected by the model and the precision the Trainer is using. Always use bf16=True, fp16=False in TrainingArguments when using QLoRA.
Underestimating sequence length memory impact
In QLoRA, the base model weights are 4x smaller than normal, but activations and the KV cache remain at BF16. For a very long sequence (4096+ tokens), the activation memory can dominate. The activation memory scales as O(batch_size * seq_len * hidden_dim * num_layers). If you are OOMing even with QLoRA, try: (1) reducing sequence length; (2) reducing batch size to 1; (3) increasing gradient accumulation to compensate; (4) enabling gradient checkpointing more aggressively.
Use paged_adamw_32bit for the most stable QLoRA training
The bitsandbytes library provides paged_adamw_8bit (memory-efficient) and paged_adamw_32bit (more numerically stable). For most cases, paged_adamw_8bit is fine. If you observe training instability (loss oscillations, gradient spikes), switch to paged_adamw_32bit or even the standard adamw_hf optimizer with gradient checkpointing. The paged optimizer's primary benefit is preventing OOM during memory spikes, not improving optimization quality.
Interview Q&A
Q1: What are the three innovations in QLoRA and what problem does each solve?
QLoRA combines three innovations. First, NF4 (NormalFloat 4-bit) quantization: stores base model weights in 4 bits using quantization points optimized for the normal distribution of neural network weights, reducing weight memory by 4x (130GB to 32.5GB for a 65B model) with less quantization error than standard INT4. Second, double quantization: quantizes the per-block quantization constants (stored as FP32 in standard 4-bit quantization) to 8-bit, saving an additional 37MB-340MB depending on model size. Third, paged optimizers: stores LoRA adapter optimizer states (Adam moments) in CPU memory with automatic paging to prevent out-of-memory crashes during memory spikes from long-sequence batches. Together, these make 65B model fine-tuning possible on a single 80GB GPU.
Q2: Why is NF4 quantization better than standard INT4 for neural network weights?
Standard INT4 uses uniformly spaced quantization bins: 16 equally-spaced values across the weight range. But neural network weights follow an approximately normal (Gaussian) distribution - most values are near zero, few are large. Uniform spacing wastes bins on large-magnitude values that rarely occur. NF4 spaces the 16 quantization bins such that each bin covers an equal probability mass under the standard normal distribution. This concentrates representational capacity where the weights actually are: near zero. The result is 2-4x lower quantization MSE for normally distributed weights, which translates to measurably better model quality after fine-tuning.
Q3: During QLoRA training, what precision is used for what?
QLoRA uses multiple precisions simultaneously: base model weights stored in NF4 (4-bit), dequantized to BF16 on the fly for actual computation in each layer's forward pass (the dequantized tensor is not stored, just computed temporarily), LoRA adapter weights and stored and trained in BF16, gradient computations done in BF16, optimizer states for LoRA parameters stored in CPU RAM as 8-bit (with paged optimizer). The key insight: only 4 bits are stored for base model weights, but all arithmetic is done in BF16 to maintain numerical stability. This is why bnb_4bit_compute_dtype=torch.bfloat16 is critical - it specifies the computation dtype, not the storage dtype.
Q4: What is double quantization and is the memory savings worth the added complexity?
Double quantization quantizes the quantization constants themselves. Standard 4-bit block quantization stores one FP32 absmax value per block of 64 weights: (total_params / 64) * 4 bytes. For a 7B model: 109M blocks × 4 bytes = 437MB overhead. Double quantization quantizes these constants to 8-bit with block size 256: (total_params / 64) * 1 byte = 109MB. The saving is ~328MB for 7B (minor) and ~3.3GB for 65B (more significant). The "complexity" is handled entirely by bitsandbytes - from a user perspective, it is a single boolean flag: bnb_4bit_use_double_quant=True. So yes, always enable it - the memory savings are free from a usability standpoint.
Q5: After QLoRA training, should you merge the LoRA adapters? What is the recommended production workflow?
For production, the recommended workflow is: (1) train with QLoRA (NF4 base + BF16 LoRA adapters); (2) load the full-precision (BF16) base model; (3) load the LoRA adapters; (4) call merge_and_unload() to get a clean BF16 model; (5) re-quantize the merged BF16 model using GPTQ or AWQ (inference-optimized quantization methods) for production serving. This produces better quality than keeping the NF4 base at inference time (GPTQ/AWQ are more carefully calibrated for inference quality) while maintaining the memory savings of quantization. The QLoRA training was just a means to an end - the resulting merged model can be deployed like any other quantized model.
Advanced: GPTQ and AWQ for Inference Quantization
QLoRA uses NF4 quantization optimized for training. For inference, two superior quantization methods exist: GPTQ and AWQ.
GPTQ (Frantar et al., 2022): Post-training quantization using second-order information (the Hessian of the loss) to minimize quantization error. GPTQ quantizes layer by layer, using calibration data to find optimal quantization points that minimize reconstruction error for each weight matrix. Achieves near-FP16 quality at 4-bit on most benchmarks. The standard for 4-bit inference in the open-source community.
AWQ (Lin et al., 2023): Activation-aware Weight Quantization. Observes that not all weights are equally important - weights corresponding to large activation values have a bigger impact on output quality when quantized. AWQ protects the most important 1% of weights by scaling them before quantization. Achieves slightly better quality than GPTQ at 4-bit, with faster quantization time.
"""
Post-training quantization with GPTQ and AWQ for inference.
These are applied AFTER training (or after merging QLoRA adapters).
"""
# ---- GPTQ quantization ----
# Requires: pip install auto-gptq optimum
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
from datasets import load_dataset
def quantize_with_gptq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128, # Smaller group = better quality, more memory
dataset_name: str = "c4", # Calibration dataset
num_calibration_samples: int = 128,
):
"""
Quantize a model using GPTQ.
Uses calibration data to find optimal quantization points.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load calibration data
calibration_data = load_dataset(dataset_name, "en", split="train", streaming=True)
calibration_texts = [
next(iter(calibration_data))["text"]
for _ in range(num_calibration_samples)
]
gptq_config = GPTQConfig(
bits=bits,
dataset=calibration_texts,
group_size=group_size,
desc_act=False, # Activation ordering - improves quality, slower
damp_percent=0.01, # Damping for Hessian computation stability
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=gptq_config,
device_map="auto",
)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"GPTQ {bits}-bit model saved to {output_dir}")
return model
# ---- AWQ quantization ----
# Requires: pip install autoawq
def quantize_with_awq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
):
"""
Quantize a model using AWQ.
Protects the most sensitive weights from quantization error.
"""
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
quant_config = {
"zero_point": True, # Enable zero-point quantization
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM", # Kernel version for inference speed
}
# AWQ calibration - finds optimal scaling factors
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ {bits}-bit model saved to {output_dir}")
return model
# ---- Comparing quantization formats: quality and speed ----
QUANTIZATION_COMPARISON = """
Quantization Method Comparison (7B model on A100):
| Method | Bits | Memory | Perplexity | Inference Speed |
|-----------|------|--------|------------|-----------------|
| BF16 | 16 | 14 GB | Baseline | 1.0x (baseline) |
| NF4 (QLoRA)| 4 | 3.5 GB | +0.3 PPL | 0.85x (slower) |
| GPTQ-4bit | 4 | 3.5 GB | +0.2 PPL | 1.3x (faster) |
| AWQ-4bit | 4 | 3.5 GB | +0.15 PPL | 1.4x (faster) |
| GPTQ-8bit | 8 | 7.0 GB | +0.05 PPL | 1.1x (faster) |
Notes:
- NF4 is slower than BF16 because dequantization adds overhead per layer
- GPTQ and AWQ use GPU kernels optimized for inference (not training)
- AWQ slightly outperforms GPTQ on quality at same bit-width
- For production serving: prefer AWQ-4bit for the best speed/quality tradeoff
"""
print(QUANTIZATION_COMPARISON)
The QLoRA Ecosystem in 2025
QLoRA catalyzed a wave of tooling and techniques. The ecosystem in 2025 includes:
Training libraries:
bitsandbytes: the original NF4 quantization implementationPEFT (HuggingFace): LoRA and other adapter methodsTRL (HuggingFace): SFT, DPO, RLHF training with QLoRA supportUnsloth: 2-4x faster QLoRA training through custom CUDA kernels (significantly reduces training time)LLaMA-Factory: all-in-one fine-tuning framework with QLoRA support
Inference libraries:
llama.cpp: CPU inference with 2-8 bit quantizationvLLM: high-throughput serving with PagedAttention and quantization supportTGI (HuggingFace): production inference server with GPTQ/AWQ supportollama: local model serving with automatic quantization
Training a 7B QLoRA model: practical timeline
With unsloth + trl on a single A100 80GB:
- Data preparation: 1-3 hours for 10K-100K examples
- Training at 2048 tokens: approximately 1-3 hours per epoch (batch size 4, grad accum 4)
- Evaluation: 30 minutes for standard benchmarks
- Total for production-quality 7B SFT+DPO pipeline: approximately 1 day
On consumer hardware (RTX 4090 24GB):
- 7B model: 2-4 hours per epoch (reduced batch size)
- 13B model: 4-8 hours per epoch (QLoRA required)
- 33B model: 8-16 hours per epoch (QLoRA required, tight on 24GB)
Use Unsloth for 2x QLoRA training speedup
Unsloth (Daniel Han, 2023) rewrites the core QLoRA training kernels in Triton for better GPU utilization. Installing it before training with PEFT is a simple way to get 1.5-2x training speedup with identical results. The speedup comes from fused kernels for RoPE, attention, and the LoRA forward/backward passes. Install with pip install unsloth and it integrates transparently with HuggingFace models.
Post-Training Quantization for Inference
QLoRA is a training technique. For inference, different quantization approaches exist optimized for serving speed rather than training memory:
GPTQ - Post-Training Quantization
GPTQ (Frantar et al., 2022) is the most widely used method for quantizing LLMs to INT4 or INT3 for fast inference. It uses second-order information (the Hessian of the loss) to minimize quantization error per layer.
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
def quantize_model_gptq(
model_name: str,
output_dir: str,
bits: int = 4,
calibration_dataset_size: int = 128,
) -> str:
"""
Quantize a model to INT4 using GPTQ.
Requires: pip install auto-gptq optimum
"""
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
tokenizer = AutoTokenizer.from_pretrained(model_name)
# GPTQ requires calibration data - a small set of real examples
# Used to compute layer-wise Hessians for optimal quantization
calibration_data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
calibration_examples = [
tokenizer(
example["text"],
return_tensors="pt",
max_length=2048,
truncation=True,
)
for example in list(calibration_data)[:calibration_dataset_size]
if len(example["text"]) > 100 # Skip very short examples
]
quantize_config = BaseQuantizeConfig(
bits=bits, # 4-bit quantization
group_size=128, # Quantize in groups of 128 weights
desc_act=False, # Don't quantize activation-order (faster inference)
)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
)
# Quantize using calibration data
model.quantize(calibration_examples)
model.save_quantized(output_dir, use_safetensors=True)
tokenizer.save_pretrained(output_dir)
print(f"GPTQ model saved to {output_dir}")
return output_dir
def load_and_run_gptq(model_dir: str, prompt: str) -> str:
"""Load GPTQ quantized model for inference."""
from auto_gptq import AutoGPTQForCausalLM
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoGPTQForCausalLM.from_quantized(
model_dir,
use_safetensors=True,
device="cuda:0",
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=200)
return tokenizer.decode(output[0], skip_special_tokens=True)
AWQ - Activation-Aware Weight Quantization
AWQ (Lin et al., 2023) observes that not all weights are equally important - weights corresponding to high-activation channels should be quantized more carefully. AWQ selects a per-channel scale that minimizes quantization error for the most important weights, without needing the expensive Hessian computation required by GPTQ.
def quantize_model_awq(
model_name: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
) -> str:
"""
Quantize a model using AWQ.
Requires: pip install autoawq
"""
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoAWQForCausalLM.from_pretrained(
model_name,
safetensors=True,
)
quant_config = {
"zero_point": True, # Use zero-point quantization
"q_group_size": group_size,
"w_bit": bits,
"version": "GEMM", # GEMM kernel - faster than GEMV for batch inference
}
# AWQ auto-selects calibration data from C4 dataset
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"AWQ model saved to {output_dir}")
return output_dir
Quantization Format Comparison
| Method | When to Use | Speed (tokens/s) | Quality vs FP16 |
|---|---|---|---|
| FP16 | Training, highest quality inference | 1x baseline | 100% |
| BF16 | Training on Ampere+ GPUs | ~1x | ~100% |
| INT8 (bitsandbytes) | When memory-constrained, minimal quality loss | 0.9x | 99% |
| GPTQ INT4 | Production inference, balanced quality | 1.5-2x | 97-99% |
| AWQ INT4 | Production inference, fastest | 1.5-2.5x | 97-99% |
| QLoRA NF4 | Training on consumer GPUs | N/A (training) | ~99% |
| INT2 | Edge/mobile deployment, quality-constrained | 3-4x | 85-92% |
GPTQ vs AWQ in 2025
Both achieve similar quality at INT4. Key differences: GPTQ is more widely supported (included in vLLM, TGI, text-generation-webui). AWQ is faster at generation time because the quantization is more inference-friendly (the GEMM kernel maps more efficiently to GPU tensor cores). For most production use cases, AWQ at INT4 is the current best practice for serving 7B-70B models. Pre-quantized AWQ models for most popular LLMs are available on HuggingFace Hub (TheBloke, quantized model collections).
Interview Q&A
Q1: What is the difference between NF4 and INT4 quantization?
INT4 quantization divides the weight range into 16 equally spaced bins, assuming weights are uniformly distributed. NF4 (Normal Float 4) is information-theoretically optimal for weights that follow a normal (Gaussian) distribution. Because neural network weights are approximately normally distributed around zero, NF4 places more quantization bins near zero (where most weights cluster) and fewer bins in the tails (where few weights exist). This makes NF4 more accurate than INT4 for the same number of bits. The specific bin boundaries for NF4 are derived by dividing the standard normal CDF into 16 equal-probability intervals.
Q2: What is double quantization in QLoRA and why does it matter?
In standard 4-bit quantization, each weight is stored in 4 bits, but each group of 64 weights also has a quantization constant (scale factor) stored in FP32 (32 bits). These constants add 32/64 = 0.5 bits per weight of overhead. Double quantization (DQ) quantizes these constants themselves to 8 bits, reducing their overhead to approximately 0.127 bits per weight. For a 7B model, DQ saves roughly 0.37 bits/weight × 7B weights = approximately 2.6 GB of GPU memory. Combined with NF4, QLoRA achieves approximately 4.5 bits per parameter total (vs 4.0 for naive INT4), while recovering nearly all the quality of BF16 training.
Q3: What are paged optimizers and when are they needed?
Paged optimizers manage optimizer state (AdamW first and second moment estimates) using NVIDIA's unified memory system - the ability to transparently page memory between GPU and CPU DRAM. Without paged optimizers, processing sequences longer than the expected maximum can cause out-of-memory (OOM) errors when optimizer states momentarily exceed GPU VRAM. With paged optimizers, these memory spikes are handled by temporarily offloading optimizer state to CPU DRAM (which is slower but much larger). The cost is a latency spike for those specific batches. In practice, paged optimizers are most valuable when processing variable-length inputs and you cannot perfectly predict the maximum VRAM usage per batch.
Q4: How does QLoRA training quality compare to full fine-tuning?
Dettmers et al. (2023) showed that QLoRA (NF4 + LoRA) on a 65B model approaches or matches full fine-tuning quality on the MMLU benchmark. The quality hierarchy: Full FP16 fine-tuning > LoRA FP16 > QLoRA NF4 > LoRA INT8. The gap between full fine-tuning and QLoRA is typically 0.5–2% on most benchmarks. For most practical applications, this gap is smaller than the variance from different training data or hyperparameters. QLoRA becomes particularly competitive relative to full fine-tuning as model size increases - for 70B models, the 4-bit quantization quality loss is smaller as a percentage of total representational capacity.
Q5: What should you do if QLoRA training produces worse results than expected?
Diagnostic checklist: (1) Check learning rate - QLoRA is more sensitive to LR than FP16 LoRA; try 2e-4 for r=16, reduce by 2x if loss diverges; (2) Verify LoRA target modules - include MLP layers (gate_proj, up_proj, down_proj) in addition to attention for domain adaptation tasks; (3) Check rank - r=8 is often too low; try r=32 or r=64 for complex tasks; (4) Verify data quality - NF4 quantization introduces quantization noise; if your task requires very precise representations, low-quality data is amplified more than in FP16 training; (5) Try lora_alpha = 2*r instead of lora_alpha = r for stronger LoRA scaling; (6) If using Flash Attention 2, verify compatibility with your bitsandbytes version - version mismatches can silently degrade quality.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the QLoRA: Quantized Fine-Tuning demo on the EngineersOfAI Playground - no code required.
:::
