Learn how QLoRA combines 4-bit quantization with LoRA to fine-tune 65B parameter models on a single consumer GPU, using NF4 quantization, double quantization, and paged optimizers.

How does 4-bit quantization work in practice?

QLoRA: Quantized Low-Rank Adaptation covers QLoRA, 4-bit quantization, NF4 NormalFloat from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/pretraining-and-finetuning/qlora

What is the difference between QLoRA and NF4 NormalFloat?

See the full breakdown at https://engineersofai.com/docs/llms/pretraining-and-finetuning/qlora

QLoRA: Quantized Low-Rank Adaptation

The Single-GPU Dream

The year is 2023. Tim Dettmers, a PhD student at the University of Washington, is working on what seems like an impossible goal: fine-tune a 65 billion parameter language model on a single consumer GPU.

A 65B model in FP16 (2 bytes per parameter) requires 130GB of memory for the weights alone. The most powerful consumer GPU available - the RTX 4090 - has 24GB. Even if you quantize to 8-bit (halving the weight memory to 65GB), you are 2.7x over what fits on a single GPU. The problem seems fundamentally intractable.

Dettmers and his co-authors (Paged Optimizers, NF4 quantization, double quantization - each independently interesting) combine them into QLoRA, published in May 2023. The result: fine-tune a 65B model on a single NVIDIA A100 80GB GPU. Or a 33B model on a consumer 24GB RTX 4090. Or a 7B model on a 12GB RTX 3080 laptop GPU.

Fine-tuning that previously required a $100K+ cluster became possible on a laptop. The paper's code was open-sourced through the bitsandbytes library. Within weeks, the open-source community was training custom 65B models in dorm rooms.

Why This Exists: The Remaining Memory Problem After LoRA

LoRA dramatically reduced the number of trainable parameters, but it did not solve the full memory problem. Here is what LoRA achieves for a 7B model:

Memory component	Full fine-tuning	LoRA (r=16)
Model weights (BF16)	14GB	14GB
Gradients	14GB	~0.03GB (LoRA only)
Optimizer states (Adam)	28GB	~0.06GB (LoRA only)
Activations	~4GB	~4GB
Total	~60GB	~18GB

LoRA reduced the training overhead from 60GB to 18GB. But the base model weights (14GB) must still be loaded in full. For a 7B model, this fits on a 24GB GPU. For a 65B model (130GB in BF16), no consumer GPU can hold it.

The remaining bottleneck: the base model weights themselves. Can you quantize the base model weights aggressively enough that they fit on one GPU, without destroying model quality?

QLoRA's answer: yes, but it requires careful engineering.

NF4: NormalFloat 4-bit Quantization

Standard 4-bit quantization (INT4) uses 4-bit integers to represent weights. But 4 bits gives you only 16 distinct values to represent the entire range of a weight matrix. The question is: how do you choose those 16 values to minimize quantization error?

INT4 uses evenly spaced values: $\{-8, -7, -6, \ldots, 6, 7\}$ (scaled to the range of the weight matrix). This is suboptimal because neural network weights are not uniformly distributed - they follow an approximately normal (Gaussian) distribution. Most weights cluster near zero; few weights have large magnitudes. Evenly spaced quantization wastes representational capacity on large values that rarely occur.

NF4 (NormalFloat 4-bit) chooses the 16 quantization bins such that each bin has equal probability mass under a standard normal distribution $\mathcal{N}(0, 1)$ . This is the optimal quantization scheme for normally distributed data - it minimizes quantization error for the actual distribution of neural network weights.

Formally, the NF4 quantization values are the 16 quantiles of the standard normal distribution, symmetrically placed:

$q_i = Q_\mathcal{N}\left(\frac{i + 0.5}{16}\right), \quad i = 0, 1, \ldots, 15$

where $Q_\mathcal{N}$ is the quantile function of the normal distribution.

The result: NF4 quantization introduces significantly less quantization error than INT4 for normally distributed weights, which describes most neural network weight matrices after training.

Double Quantization: Quantizing the Constants

4-bit quantization uses block quantization: weights are grouped into blocks of size 64, and each block stores one quantization constant (the maximum absolute value of the block, absmax). This absmax value is stored as FP32 (4 bytes) per block.

Memory overhead of quantization constants: for a 7B parameter model with blocks of size 64:

Number of blocks: 7B / 64 = 109 million blocks
Memory for constants at FP32: 109M × 4 bytes = ~437MB

This overhead is significant. Double quantization quantizes the quantization constants themselves: instead of storing each absmax in FP32, quantize them to 8-bit with block size 256. This reduces the per-parameter overhead of quantization constants from 0.5 bits to 0.127 bits - saving approximately 37MB for a 7B model.

Individually, 37MB savings sound minor. For a 65B model, this becomes ~340MB - meaningful when every gigabyte counts.

Paged Optimizers: Handling Memory Spikes

Even with 4-bit base model weights and LoRA, there is one remaining memory problem: optimizer state spikes.

During normal training, the LoRA parameters are small enough to fit in GPU memory. But during long-sequence batches or particularly gradient-intensive steps, the optimizer states (Adam's first and second moments for the LoRA parameters) can cause brief memory spikes that trigger out-of-memory (OOM) errors.

This is frustrating: your model fits in memory 99% of the time, but crashes on 1% of batches due to momentary memory pressure. Restarting the training run costs you hours.

Paged optimizers solve this by storing optimizer states in CPU RAM instead of GPU memory, using NVIDIA's unified memory feature. When GPU memory is full, optimizer states are automatically "paged out" to CPU RAM. When needed for an update step, they are paged back in.

The overhead: CPU-GPU memory transfer is slow (~32 GB/s PCIe vs ~2 TB/s HBM). But optimizer states are only accessed once per update step, not every forward pass - so the latency impact is small. A training run with paged optimizers might be 5-10% slower, but it is the difference between running at all and crashing.

The QLoRA Recipe

Put it all together:

Load base model in NF4: weights stored as 4-bit NF4, reducing 65B weights from 130GB (BF16) to ~32.5GB
Apply LoRA adapters in BF16: the small LoRA matrices ( $A$ and $B$ ) are stored and trained in BF16. Only gradients flow through these matrices.
Dequantize on the fly: during the forward pass, NF4 weights are dequantized to BF16 for the matrix multiplication, then the result is used. The dequantized weights are not stored - they are computed on the fly for each layer as needed. This is computationally more expensive than BF16 training but uses much less memory.
Paged optimizer: Adam optimizer states for LoRA parameters are stored in CPU memory with automatic paging.

Memory comparison for 65B model:

Configuration	Memory for weights	Trainable params	Total approx
Full BF16 fine-tuning	130GB	65B	520GB+
LoRA in BF16	130GB	~130M	~140GB
QLoRA (NF4 base)	32.5GB	~130M	~40GB

QLoRA enables fine-tuning a 65B model on a single A100 80GB GPU with 40GB of memory for the model, leaving 40GB for activations, optimizer states, and batch data.

Quantization Error and Quality

Does 4-bit quantization hurt quality? QLoRA is trained with:

4-bit weights at INFERENCE TIME (during training's forward pass, weights are used as-is in 4-bit)
BF16 compute for the matrix multiplications (dequantization happens on the fly)
BF16 LoRA adapters (trained in full precision)

The key finding from Dettmers et al. (2023): models fine-tuned with QLoRA on the Guanaco dataset matched or exceeded the performance of models fine-tuned with full-precision LoRA, at a fraction of the memory cost. The NF4 quantization error is small enough that the LoRA training compensates for it.

However, QLoRA is strictly not equal to full-precision LoRA. For tasks requiring high mathematical precision or very fine-grained numerical reasoning, the quantization error can compound. In production, if you can afford the memory, BF16 LoRA is preferred. QLoRA is the choice when memory constraints are the binding constraint.

Code: QLoRA Fine-tuning

"""
QLoRA fine-tuning with bitsandbytes + PEFT.
Fine-tune a 7B or larger model with 4-bit quantization.
"""

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    prepare_model_for_kbit_training,
)
from trl import SFTTrainer, SFTConfig
from datasets import Dataset


def load_model_in_4bit(model_name: str):
    """
    Load model in 4-bit NF4 with double quantization.
    This is the QLoRA base configuration.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,              # Use 4-bit quantization
        bnb_4bit_quant_type="nf4",      # NormalFloat 4-bit (better than int4)
        bnb_4bit_compute_dtype=torch.bfloat16,   # Compute dtype for dequantized ops
        bnb_4bit_use_double_quant=True, # Double quantization for quantization constants
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",              # Automatically distribute across available GPUs
        use_cache=False,                # Required for gradient checkpointing
    )

    # Prepare for k-bit training:
    # 1. Casts layernorm to float32 for stability
    # 2. Upcast output embedding to float32
    # 3. Enables gradient checkpointing
    model = prepare_model_for_kbit_training(model)

    return model


def create_qlora_model(
    base_model_name: str,
    lora_r: int = 64,          # Dettmers et al. used r=64 in original QLoRA paper
    lora_alpha: int = 16,
    lora_dropout: float = 0.1,
):
    """Create a QLoRA-ready model."""
    model = load_model_in_4bit(base_model_name)

    lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=[
            "q_proj", "v_proj", "k_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        bias="none",
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    return model


def qlora_training_pipeline(
    base_model_name: str = "meta-llama/Llama-2-7b-hf",
    output_dir: str = "./qlora-adapter",
    dataset: Dataset = None,
):
    """
    Complete QLoRA training pipeline.
    Memory footprint for 7B model: ~12GB (fits on RTX 3080)
    """
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    model = create_qlora_model(base_model_name)

    if dataset is None:
        # Example data
        dataset = Dataset.from_list([
            {"text": "### Instruction:\nWhat is gradient descent?\n\n### Response:\nGradient descent is an optimization algorithm that iteratively updates model parameters in the direction of steepest descent of the loss function."}
        ])

    sft_config = SFTConfig(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,

        # QLoRA typically uses higher rank (64) but lower learning rate
        learning_rate=2e-4,
        lr_scheduler_type="constant_with_warmup",
        warmup_steps=100,
        weight_decay=0.001,

        # Precision settings
        bf16=True,                     # BF16 compute (base model is NF4, compute is BF16)
        fp16=False,

        max_seq_length=2048,
        packing=False,

        # Optimizer: use paged_adamw for memory spike handling
        optim="paged_adamw_8bit",     # Paged optimizer with 8-bit Adam

        logging_steps=10,
        save_steps=200,
        report_to="none",
        max_grad_norm=0.3,            # Tighter gradient clipping for 4-bit training
    )

    trainer = SFTTrainer(
        model=model,
        args=sft_config,
        train_dataset=dataset,
        processing_class=tokenizer,
    )

    trainer.train()

    # Save LoRA adapter (base model stays as NF4 quantized)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"QLoRA adapter saved to {output_dir}")
    return trainer


# ---- Memory estimation utility ----

def estimate_qlora_memory(
    model_params_billions: float,
    lora_r: int = 64,
    num_layers: int = 32,
    hidden_dim: int = 4096,
    batch_size: int = 4,
    sequence_length: int = 2048,
):
    """
    Estimate GPU memory requirements for QLoRA training.
    All estimates are approximate.
    """
    # Base model: 4 bits per parameter
    base_model_memory_gb = (model_params_billions * 1e9 * 0.5) / 1e9

    # LoRA adapter: BF16 (2 bytes)
    # Approximate LoRA params: 2 * num_layers * 4 * hidden_dim * r (Q, K, V, O projections)
    lora_params = 2 * num_layers * 4 * hidden_dim * lora_r
    lora_memory_gb = (lora_params * 2) / 1e9

    # Optimizer states (8-bit Adam for LoRA params): 2 bytes per param x2 moments
    optimizer_gb = (lora_params * 4) / 1e9  # 8-bit = 1 byte each, x2 moments

    # Activations: rough estimate
    activation_gb = (batch_size * sequence_length * hidden_dim * num_layers * 2) / 1e9

    total_gb = base_model_memory_gb + lora_memory_gb + optimizer_gb + activation_gb

    print(f"QLoRA Memory Estimate for {model_params_billions}B model:")
    print(f"  Base model (NF4):     {base_model_memory_gb:.1f} GB")
    print(f"  LoRA adapters (BF16): {lora_memory_gb:.2f} GB")
    print(f"  Optimizer states:     {optimizer_gb:.2f} GB")
    print(f"  Activations:          {activation_gb:.1f} GB")
    print(f"  Total estimate:       {total_gb:.1f} GB")
    return total_gb


# Example outputs:
# estimate_qlora_memory(7)   -> ~12GB (fits on RTX 3080 12GB)
# estimate_qlora_memory(13)  -> ~18GB (fits on RTX 3090 24GB)
# estimate_qlora_memory(33)  -> ~24GB (fits on RTX 4090 24GB with care)
# estimate_qlora_memory(65)  -> ~40GB (fits on A100 80GB)

Comparing Quantization Formats

"""
Demonstrate the difference between INT4 and NF4 quantization
for normally distributed weight values.
"""

import numpy as np
import torch


def quantize_int4(values: np.ndarray) -> tuple:
    """Standard INT4 symmetric quantization."""
    abs_max = np.max(np.abs(values))
    scale = abs_max / 7.0  # 4-bit symmetric: range -8 to 7

    # Quantize and dequantize
    quantized = np.clip(np.round(values / scale), -8, 7).astype(np.int8)
    dequantized = quantized * scale
    return dequantized, np.mean((values - dequantized) ** 2)  # MSE


def get_nf4_quantization_points() -> np.ndarray:
    """
    Compute the 16 NF4 quantization points as normal distribution quantiles.
    These are the optimal quantization points for N(0,1) distributed data.
    """
    from scipy.stats import norm
    # 16 evenly spaced quantiles of N(0,1)
    quantiles = [(i + 0.5) / 16 for i in range(16)]
    nf4_points = norm.ppf(quantiles)
    # Normalize to [-1, 1]
    nf4_points = nf4_points / np.max(np.abs(nf4_points))
    return nf4_points


def quantize_nf4(values: np.ndarray) -> tuple:
    """NF4 quantization using normal distribution quantile points."""
    nf4_points = get_nf4_quantization_points()

    abs_max = np.max(np.abs(values))
    normalized = values / abs_max  # Normalize to [-1, 1]

    # For each value, find nearest NF4 quantile point
    quantized_indices = np.argmin(
        np.abs(normalized[:, np.newaxis] - nf4_points[np.newaxis, :]),
        axis=1
    )
    dequantized_normalized = nf4_points[quantized_indices]
    dequantized = dequantized_normalized * abs_max

    return dequantized, np.mean((values - dequantized) ** 2)  # MSE


if __name__ == "__main__":
    np.random.seed(42)

    # Simulate a weight vector with normal distribution (typical for neural networks)
    weights = np.random.randn(10000) * 0.02  # Small normal weights

    _, int4_mse = quantize_int4(weights)
    _, nf4_mse = quantize_nf4(weights)

    print(f"INT4 quantization MSE: {int4_mse:.8f}")
    print(f"NF4 quantization MSE:  {nf4_mse:.8f}")
    print(f"NF4 improvement: {int4_mse / nf4_mse:.2f}x lower error")
    # NF4 typically shows 2-4x lower MSE for normally distributed weights

Production Engineering Notes

When to Use QLoRA vs Regular LoRA

Scenario	Recommendation
7B model, 24GB GPU	Regular LoRA in BF16 (fits fine)
13B model, 24GB GPU	QLoRA with NF4
33B model, 80GB GPU	Either - BF16 LoRA preferred for quality
65B model, 80GB GPU	QLoRA required
7B model, 12GB laptop GPU	QLoRA required
Production serving	Merge, then re-quantize with GPTQ or AWQ

After QLoRA Training: Handling the Final Model

After QLoRA training, your model has NF4 weights (from the base model) and BF16 LoRA adapters. For production deployment:

Option 1 (simplest): Load NF4 base + LoRA adapter at inference time. The NF4 model is approximately 2x faster than BF16 and 4x smaller - suitable for inference.

Option 2 (best quality): Merge LoRA into the dequantized BF16 model, then re-quantize the merged model using GPTQ or AWQ (higher quality quantization methods designed for inference). This produces the best inference quality and speed.

from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch

# Load full-precision base (for clean merge)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,  # Full precision for the merge
)

# Load QLoRA adapter trained on NF4 base
model_with_lora = PeftModel.from_pretrained(base_model, "./qlora-adapter/")

# Merge and save as clean BF16 model
merged_model = model_with_lora.merge_and_unload()
merged_model.save_pretrained("./merged-bf16-model/")
# Now re-quantize with GPTQ/AWQ for production

note

The Guanaco result from the QLoRA paper Dettmers et al. fine-tuned LLaMA-65B on the Guanaco dataset (9,000 examples) using QLoRA in approximately 24 hours on a single A100 80GB. The resulting Guanaco-65B matched GPT-4 performance on 63.7% of comparisons in human preference evaluation (MT-bench style), while GPT-4 won 32.1% (the rest were ties). This was a landmark result for open-source LLM fine-tuning.

Common Mistakes

danger

Loading a 4-bit quantized model without prepare_model_for_kbit_training When fine-tuning a 4-bit quantized model, you must call prepare_model_for_kbit_training(model) before adding LoRA adapters. This function: (1) casts LayerNorm layers to FP32 (critical for gradient stability); (2) handles the output embedding; (3) prepares the model for gradient checkpointing. Skipping this step causes NaN losses during training.

danger

Using fp16=True instead of bf16=True with 4-bit training The bitsandbytes library uses BF16 for all compute by default (specified in bnb_4bit_compute_dtype=torch.bfloat16). If you also set fp16=True in TrainingArguments, there is a mismatch between the compute dtype expected by the model and the precision the Trainer is using. Always use bf16=True, fp16=False in TrainingArguments when using QLoRA.

warning

Underestimating sequence length memory impact In QLoRA, the base model weights are 4x smaller than normal, but activations and the KV cache remain at BF16. For a very long sequence (4096+ tokens), the activation memory can dominate. The activation memory scales as O(batch_size * seq_len * hidden_dim * num_layers). If you are OOMing even with QLoRA, try: (1) reducing sequence length; (2) reducing batch size to 1; (3) increasing gradient accumulation to compensate; (4) enabling gradient checkpointing more aggressively.

tip

Use paged_adamw_32bit for the most stable QLoRA training The bitsandbytes library provides paged_adamw_8bit (memory-efficient) and paged_adamw_32bit (more numerically stable). For most cases, paged_adamw_8bit is fine. If you observe training instability (loss oscillations, gradient spikes), switch to paged_adamw_32bit or even the standard adamw_hf optimizer with gradient checkpointing. The paged optimizer's primary benefit is preventing OOM during memory spikes, not improving optimization quality.

Interview Q&A

Q1: What are the three innovations in QLoRA and what problem does each solve?

QLoRA combines three innovations. First, NF4 (NormalFloat 4-bit) quantization: stores base model weights in 4 bits using quantization points optimized for the normal distribution of neural network weights, reducing weight memory by 4x (130GB to 32.5GB for a 65B model) with less quantization error than standard INT4. Second, double quantization: quantizes the per-block quantization constants (stored as FP32 in standard 4-bit quantization) to 8-bit, saving an additional 37MB-340MB depending on model size. Third, paged optimizers: stores LoRA adapter optimizer states (Adam moments) in CPU memory with automatic paging to prevent out-of-memory crashes during memory spikes from long-sequence batches. Together, these make 65B model fine-tuning possible on a single 80GB GPU.

Q2: Why is NF4 quantization better than standard INT4 for neural network weights?

Standard INT4 uses uniformly spaced quantization bins: 16 equally-spaced values across the weight range. But neural network weights follow an approximately normal (Gaussian) distribution - most values are near zero, few are large. Uniform spacing wastes bins on large-magnitude values that rarely occur. NF4 spaces the 16 quantization bins such that each bin covers an equal probability mass under the standard normal distribution. This concentrates representational capacity where the weights actually are: near zero. The result is 2-4x lower quantization MSE for normally distributed weights, which translates to measurably better model quality after fine-tuning.

Q3: During QLoRA training, what precision is used for what?

QLoRA uses multiple precisions simultaneously: base model weights stored in NF4 (4-bit), dequantized to BF16 on the fly for actual computation in each layer's forward pass (the dequantized tensor is not stored, just computed temporarily), LoRA adapter weights $A$ and $B$ stored and trained in BF16, gradient computations done in BF16, optimizer states for LoRA parameters stored in CPU RAM as 8-bit (with paged optimizer). The key insight: only 4 bits are stored for base model weights, but all arithmetic is done in BF16 to maintain numerical stability. This is why bnb_4bit_compute_dtype=torch.bfloat16 is critical - it specifies the computation dtype, not the storage dtype.

Q4: What is double quantization and is the memory savings worth the added complexity?

Double quantization quantizes the quantization constants themselves. Standard 4-bit block quantization stores one FP32 absmax value per block of 64 weights: (total_params / 64) * 4 bytes. For a 7B model: 109M blocks × 4 bytes = 437MB overhead. Double quantization quantizes these constants to 8-bit with block size 256: (total_params / 64) * 1 byte = 109MB. The saving is ~328MB for 7B (minor) and ~3.3GB for 65B (more significant). The "complexity" is handled entirely by bitsandbytes - from a user perspective, it is a single boolean flag: bnb_4bit_use_double_quant=True. So yes, always enable it - the memory savings are free from a usability standpoint.

Q5: After QLoRA training, should you merge the LoRA adapters? What is the recommended production workflow?

For production, the recommended workflow is: (1) train with QLoRA (NF4 base + BF16 LoRA adapters); (2) load the full-precision (BF16) base model; (3) load the LoRA adapters; (4) call merge_and_unload() to get a clean BF16 model; (5) re-quantize the merged BF16 model using GPTQ or AWQ (inference-optimized quantization methods) for production serving. This produces better quality than keeping the NF4 base at inference time (GPTQ/AWQ are more carefully calibrated for inference quality) while maintaining the memory savings of quantization. The QLoRA training was just a means to an end - the resulting merged model can be deployed like any other quantized model.

Advanced: GPTQ and AWQ for Inference Quantization

QLoRA uses NF4 quantization optimized for training. For inference, two superior quantization methods exist: GPTQ and AWQ.

GPTQ (Frantar et al., 2022): Post-training quantization using second-order information (the Hessian of the loss) to minimize quantization error. GPTQ quantizes layer by layer, using calibration data to find optimal quantization points that minimize reconstruction error for each weight matrix. Achieves near-FP16 quality at 4-bit on most benchmarks. The standard for 4-bit inference in the open-source community.

AWQ (Lin et al., 2023): Activation-aware Weight Quantization. Observes that not all weights are equally important - weights corresponding to large activation values have a bigger impact on output quality when quantized. AWQ protects the most important 1% of weights by scaling them before quantization. Achieves slightly better quality than GPTQ at 4-bit, with faster quantization time.

"""
Post-training quantization with GPTQ and AWQ for inference.
These are applied AFTER training (or after merging QLoRA adapters).
"""

# ---- GPTQ quantization ----
# Requires: pip install auto-gptq optimum

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
from datasets import load_dataset

def quantize_with_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128,     # Smaller group = better quality, more memory
    dataset_name: str = "c4",  # Calibration dataset
    num_calibration_samples: int = 128,
):
    """
    Quantize a model using GPTQ.
    Uses calibration data to find optimal quantization points.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Load calibration data
    calibration_data = load_dataset(dataset_name, "en", split="train", streaming=True)
    calibration_texts = [
        next(iter(calibration_data))["text"]
        for _ in range(num_calibration_samples)
    ]

    gptq_config = GPTQConfig(
        bits=bits,
        dataset=calibration_texts,
        group_size=group_size,
        desc_act=False,           # Activation ordering - improves quality, slower
        damp_percent=0.01,        # Damping for Hessian computation stability
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=gptq_config,
        device_map="auto",
    )

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"GPTQ {bits}-bit model saved to {output_dir}")
    return model


# ---- AWQ quantization ----
# Requires: pip install autoawq

def quantize_with_awq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128,
):
    """
    Quantize a model using AWQ.
    Protects the most sensitive weights from quantization error.
    """
    from awq import AutoAWQForCausalLM

    model = AutoAWQForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

    quant_config = {
        "zero_point": True,      # Enable zero-point quantization
        "q_group_size": group_size,
        "w_bit": bits,
        "version": "GEMM",       # Kernel version for inference speed
    }

    # AWQ calibration - finds optimal scaling factors
    model.quantize(tokenizer, quant_config=quant_config)

    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"AWQ {bits}-bit model saved to {output_dir}")
    return model


# ---- Comparing quantization formats: quality and speed ----

QUANTIZATION_COMPARISON = """
Quantization Method Comparison (7B model on A100):

| Method    | Bits | Memory | Perplexity | Inference Speed |
|-----------|------|--------|------------|-----------------|
| BF16      | 16   | 14 GB  | Baseline   | 1.0x (baseline) |
| NF4 (QLoRA)| 4   | 3.5 GB | +0.3 PPL   | 0.85x (slower)  |
| GPTQ-4bit | 4    | 3.5 GB | +0.2 PPL   | 1.3x (faster)   |
| AWQ-4bit  | 4    | 3.5 GB | +0.15 PPL  | 1.4x (faster)   |
| GPTQ-8bit | 8    | 7.0 GB | +0.05 PPL  | 1.1x (faster)   |

Notes:
- NF4 is slower than BF16 because dequantization adds overhead per layer
- GPTQ and AWQ use GPU kernels optimized for inference (not training)
- AWQ slightly outperforms GPTQ on quality at same bit-width
- For production serving: prefer AWQ-4bit for the best speed/quality tradeoff
"""

print(QUANTIZATION_COMPARISON)

The QLoRA Ecosystem in 2025

QLoRA catalyzed a wave of tooling and techniques. The ecosystem in 2025 includes:

Training libraries:

bitsandbytes: the original NF4 quantization implementation
PEFT (HuggingFace): LoRA and other adapter methods
TRL (HuggingFace): SFT, DPO, RLHF training with QLoRA support
Unsloth: 2-4x faster QLoRA training through custom CUDA kernels (significantly reduces training time)
LLaMA-Factory: all-in-one fine-tuning framework with QLoRA support

Inference libraries:

llama.cpp: CPU inference with 2-8 bit quantization
vLLM: high-throughput serving with PagedAttention and quantization support
TGI (HuggingFace): production inference server with GPTQ/AWQ support
ollama: local model serving with automatic quantization

Training a 7B QLoRA model: practical timeline

With unsloth + trl on a single A100 80GB:

Data preparation: 1-3 hours for 10K-100K examples
Training at 2048 tokens: approximately 1-3 hours per epoch (batch size 4, grad accum 4)
Evaluation: 30 minutes for standard benchmarks
Total for production-quality 7B SFT+DPO pipeline: approximately 1 day

On consumer hardware (RTX 4090 24GB):

7B model: 2-4 hours per epoch (reduced batch size)
13B model: 4-8 hours per epoch (QLoRA required)
33B model: 8-16 hours per epoch (QLoRA required, tight on 24GB)

tip

Use Unsloth for 2x QLoRA training speedup Unsloth (Daniel Han, 2023) rewrites the core QLoRA training kernels in Triton for better GPU utilization. Installing it before training with PEFT is a simple way to get 1.5-2x training speedup with identical results. The speedup comes from fused kernels for RoPE, attention, and the LoRA forward/backward passes. Install with pip install unsloth and it integrates transparently with HuggingFace models.

Post-Training Quantization for Inference

QLoRA is a training technique. For inference, different quantization approaches exist optimized for serving speed rather than training memory:

GPTQ - Post-Training Quantization

GPTQ (Frantar et al., 2022) is the most widely used method for quantizing LLMs to INT4 or INT3 for fast inference. It uses second-order information (the Hessian of the loss) to minimize quantization error per layer.

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

def quantize_model_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    calibration_dataset_size: int = 128,
) -> str:
    """
    Quantize a model to INT4 using GPTQ.
    Requires: pip install auto-gptq optimum
    """
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # GPTQ requires calibration data - a small set of real examples
    # Used to compute layer-wise Hessians for optimal quantization
    calibration_data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    calibration_examples = [
        tokenizer(
            example["text"],
            return_tensors="pt",
            max_length=2048,
            truncation=True,
        )
        for example in list(calibration_data)[:calibration_dataset_size]
        if len(example["text"]) > 100  # Skip very short examples
    ]

    quantize_config = BaseQuantizeConfig(
        bits=bits,           # 4-bit quantization
        group_size=128,      # Quantize in groups of 128 weights
        desc_act=False,      # Don't quantize activation-order (faster inference)
    )

    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config=quantize_config,
    )

    # Quantize using calibration data
    model.quantize(calibration_examples)
    model.save_quantized(output_dir, use_safetensors=True)
    tokenizer.save_pretrained(output_dir)

    print(f"GPTQ model saved to {output_dir}")
    return output_dir


def load_and_run_gptq(model_dir: str, prompt: str) -> str:
    """Load GPTQ quantized model for inference."""
    from auto_gptq import AutoGPTQForCausalLM

    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoGPTQForCausalLM.from_quantized(
        model_dir,
        use_safetensors=True,
        device="cuda:0",
    )

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=200)
    return tokenizer.decode(output[0], skip_special_tokens=True)

AWQ - Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) observes that not all weights are equally important - weights corresponding to high-activation channels should be quantized more carefully. AWQ selects a per-channel scale that minimizes quantization error for the most important weights, without needing the expensive Hessian computation required by GPTQ.

def quantize_model_awq(
    model_name: str,
    output_dir: str,
    bits: int = 4,
    group_size: int = 128,
) -> str:
    """
    Quantize a model using AWQ.
    Requires: pip install autoawq
    """
    from awq import AutoAWQForCausalLM
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoAWQForCausalLM.from_pretrained(
        model_name,
        safetensors=True,
    )

    quant_config = {
        "zero_point": True,     # Use zero-point quantization
        "q_group_size": group_size,
        "w_bit": bits,
        "version": "GEMM",      # GEMM kernel - faster than GEMV for batch inference
    }

    # AWQ auto-selects calibration data from C4 dataset
    model.quantize(tokenizer, quant_config=quant_config)
    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)

    print(f"AWQ model saved to {output_dir}")
    return output_dir

Quantization Format Comparison

Method	When to Use	Speed (tokens/s)	Quality vs FP16
FP16	Training, highest quality inference	1x baseline	100%
BF16	Training on Ampere+ GPUs	~1x	~100%
INT8 (bitsandbytes)	When memory-constrained, minimal quality loss	0.9x	99%
GPTQ INT4	Production inference, balanced quality	1.5-2x	97-99%
AWQ INT4	Production inference, fastest	1.5-2.5x	97-99%
QLoRA NF4	Training on consumer GPUs	N/A (training)	~99%
INT2	Edge/mobile deployment, quality-constrained	3-4x	85-92%

note

GPTQ vs AWQ in 2025

Both achieve similar quality at INT4. Key differences: GPTQ is more widely supported (included in vLLM, TGI, text-generation-webui). AWQ is faster at generation time because the quantization is more inference-friendly (the GEMM kernel maps more efficiently to GPU tensor cores). For most production use cases, AWQ at INT4 is the current best practice for serving 7B-70B models. Pre-quantized AWQ models for most popular LLMs are available on HuggingFace Hub (TheBloke, quantized model collections).

Interview Q&A

Q1: What is the difference between NF4 and INT4 quantization?

INT4 quantization divides the weight range into 16 equally spaced bins, assuming weights are uniformly distributed. NF4 (Normal Float 4) is information-theoretically optimal for weights that follow a normal (Gaussian) distribution. Because neural network weights are approximately normally distributed around zero, NF4 places more quantization bins near zero (where most weights cluster) and fewer bins in the tails (where few weights exist). This makes NF4 more accurate than INT4 for the same number of bits. The specific bin boundaries for NF4 are derived by dividing the standard normal CDF into 16 equal-probability intervals.

Q2: What is double quantization in QLoRA and why does it matter?

In standard 4-bit quantization, each weight is stored in 4 bits, but each group of 64 weights also has a quantization constant (scale factor) stored in FP32 (32 bits). These constants add 32/64 = 0.5 bits per weight of overhead. Double quantization (DQ) quantizes these constants themselves to 8 bits, reducing their overhead to approximately 0.127 bits per weight. For a 7B model, DQ saves roughly 0.37 bits/weight × 7B weights = approximately 2.6 GB of GPU memory. Combined with NF4, QLoRA achieves approximately 4.5 bits per parameter total (vs 4.0 for naive INT4), while recovering nearly all the quality of BF16 training.

Q3: What are paged optimizers and when are they needed?

Paged optimizers manage optimizer state (AdamW first and second moment estimates) using NVIDIA's unified memory system - the ability to transparently page memory between GPU and CPU DRAM. Without paged optimizers, processing sequences longer than the expected maximum can cause out-of-memory (OOM) errors when optimizer states momentarily exceed GPU VRAM. With paged optimizers, these memory spikes are handled by temporarily offloading optimizer state to CPU DRAM (which is slower but much larger). The cost is a latency spike for those specific batches. In practice, paged optimizers are most valuable when processing variable-length inputs and you cannot perfectly predict the maximum VRAM usage per batch.

Q4: How does QLoRA training quality compare to full fine-tuning?

Dettmers et al. (2023) showed that QLoRA (NF4 + LoRA) on a 65B model approaches or matches full fine-tuning quality on the MMLU benchmark. The quality hierarchy: Full FP16 fine-tuning > LoRA FP16 > QLoRA NF4 > LoRA INT8. The gap between full fine-tuning and QLoRA is typically 0.5–2% on most benchmarks. For most practical applications, this gap is smaller than the variance from different training data or hyperparameters. QLoRA becomes particularly competitive relative to full fine-tuning as model size increases - for 70B models, the 4-bit quantization quality loss is smaller as a percentage of total representational capacity.

Q5: What should you do if QLoRA training produces worse results than expected?

Diagnostic checklist: (1) Check learning rate - QLoRA is more sensitive to LR than FP16 LoRA; try 2e-4 for r=16, reduce by 2x if loss diverges; (2) Verify LoRA target modules - include MLP layers (gate_proj, up_proj, down_proj) in addition to attention for domain adaptation tasks; (3) Check rank - r=8 is often too low; try r=32 or r=64 for complex tasks; (4) Verify data quality - NF4 quantization introduces quantization noise; if your task requires very precise representations, low-quality data is amplified more than in FP16 training; (5) Try lora_alpha = 2*r instead of lora_alpha = r for stronger LoRA scaling; (6) If using Flash Attention 2, verify compatibility with your bitsandbytes version - version mismatches can silently degrade quality.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the QLoRA: Quantized Fine-Tuning demo on the EngineersOfAI Playground - no code required.

:::

The Single-GPU Dream​

Why This Exists: The Remaining Memory Problem After LoRA​

NF4: NormalFloat 4-bit Quantization​

Double Quantization: Quantizing the Constants​

Paged Optimizers: Handling Memory Spikes​

The QLoRA Recipe​

Quantization Error and Quality​

Code: QLoRA Fine-tuning​

Comparing Quantization Formats​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Advanced: GPTQ and AWQ for Inference Quantization​

The QLoRA Ecosystem in 2025​

Post-Training Quantization for Inference​

GPTQ - Post-Training Quantization​

AWQ - Activation-Aware Weight Quantization​

Quantization Format Comparison​

Interview Q&A​

The Single-GPU Dream

Why This Exists: The Remaining Memory Problem After LoRA

NF4: NormalFloat 4-bit Quantization

Double Quantization: Quantizing the Constants

Paged Optimizers: Handling Memory Spikes

The QLoRA Recipe

Quantization Error and Quality

Code: QLoRA Fine-tuning

Comparing Quantization Formats

Production Engineering Notes

Common Mistakes

Interview Q&A

Advanced: GPTQ and AWQ for Inference Quantization

The QLoRA Ecosystem in 2025

Post-Training Quantization for Inference

GPTQ - Post-Training Quantization

AWQ - Activation-Aware Weight Quantization

Quantization Format Comparison

Interview Q&A