QLoRA: 4-Bit Fine-Tuning

The Night Someone Fine-Tuned a 65B Model on a Gaming GPU

In May 2023, Tim Dettmers and his collaborators at the University of Washington published a result that most ML engineers read twice before believing: they had fine-tuned a LLaMA 65B model on a single NVIDIA A100 48GB GPU. Not inference - fine-tuning. Full gradient computation, weight updates, the whole process. On one GPU.

To appreciate why that was remarkable, consider what fine-tuning a 65B model normally requires. In FP16, the model weights alone consume 130 GB. Add optimizer states (Adam needs two floats per parameter) and you are at 390 GB just for weights and optimizer. Activations for a typical sequence length push the total past 500 GB. The standard hardware answer is a cluster - eight A100 80GB cards connected via NVLink, or a larger TPU pod. That is a $200K+ hardware investment or a substantial cloud spend per training run.

The paper was called "QLoRA: Efficient Finetuning of Quantized LLMs." Its central claim was that you could compress the base model to 4 bits, keep the LoRA adapters in full 16-bit precision, and lose almost nothing in final model quality. The fine-tuned result, a 65B model they called Guanaco, matched GPT-4 on several benchmarks when run on one GPU. The community had found a way to democratize large model fine-tuning overnight.

This was not just a research curiosity. Within weeks, teams that had been locked out of 13B+ model fine-tuning started producing competitive results. A fine-tuning run on LLaMA 2 70B that would have cost $3,000 on A100 clusters now cost$ 80 on a single rented A100. Teams running on-premise could use a single gaming GPU (RTX 4090) to fine-tune 7B models that previously required four datacenter GPUs. The hardware ceiling for serious ML work dropped by an order of magnitude.

This lesson dissects how QLoRA achieves this. There are three innovations working together: NF4 quantization (a new data type optimal for normally distributed weights), double quantization (quantize the quantization constants themselves), and paged optimizers (use CPU RAM as overflow storage for GPU optimizer states). Understanding each one individually is straightforward; understanding how they combine to make 65B-on-one-GPU possible is what this lesson is about.

Why This Exists - The Memory Wall at Scale

The previous lesson on LoRA solved the trainable parameter problem. With LoRA, you can reduce the number of parameters you actually update from 70 billion to 70 million. But there is a second problem that LoRA alone does not solve: the base model itself still needs to live in GPU memory.

Here is the memory breakdown for fine-tuning LLaMA 2 70B with LoRA in BF16:

Component	Size
Model weights (70B x 2 bytes BF16)	140 GB
LoRA adapter (0.1% of params x 2 bytes)	0.28 GB
LoRA gradients	0.28 GB
Adam optimizer states for LoRA (2x)	0.56 GB
Activations (batch=1, seq=2048)	~8 GB
Total	~149 GB

You need approximately 150 GB of GPU memory. The largest single-GPU system widely available is the A100 80GB at $10K-12K, and even that falls short. You need two of them at minimum, with the associated NVLink interconnect and engineering overhead.

LoRA made the optimizer states for trainable parameters tiny. But the frozen base model weights are still enormous. To fine-tune large models on accessible hardware, you need to compress the base model itself.

The naive approach is standard INT8 quantization: represent each weight as an 8-bit integer instead of a 16-bit float. This halves the memory, but the quantization error degrades training stability - the model produces incorrect gradient magnitudes, training diverges, or final quality suffers noticeably. Going further to INT4 naive quantization loses even more precision and becomes unreliable for fine-tuning.

QLoRA's contribution was showing that you could go to 4 bits reliably, with near-zero quality loss, if you used the right data type, the right quantization scheme, and the right memory management strategy.

Historical Context - The Path to 4-Bit Training

The history of neural network quantization stretches back to work on model compression in the late 2010s. Courbariaux et al. (2015) showed binary networks could work. Han et al. (2015) combined pruning with quantization. Jacob et al. (2018) from Google established the standard INT8 quantization approach used in TensorFlow Lite.

For inference, 4-bit quantization became practical around 2022. The llama.cpp project by Georgi Gerganov showed that heavily quantized models (Q4_0, Q4_1) could run on CPU and consumer GPUs with acceptable quality loss. GPTQ (Frantar et al., 2022) introduced a more principled approach using second-order information from calibration data to minimize quantization error.

The unique challenge for training (as opposed to inference) is that quantization-induced errors interact with gradient computation in complex ways. A small error in the forward pass translates through the chain rule to potentially large errors in parameter gradients. This is why 4-bit quantization worked for inference first - you could measure output quality directly - but took longer to make work for training.

Tim Dettmers had been working on quantization for ML since his PhD research on 8-bit Adam (published at NeurIPS 2022). His insight for QLoRA came from a combination of three observations: that weight distributions in transformers are close to normal distributions (enabling optimal 4-bit discretization), that you could decouple the compute precision from the storage precision (running compute in BF16 while storing weights in NF4), and that the optimizer state memory for frozen parameters was zero by definition in LoRA - you only needed optimizer states for the tiny LoRA adapters, not the 70B frozen base.

Those three insights, combined into one system, produced QLoRA.

Innovation 1 - NF4: The Optimal 4-Bit Data Type

The problem with standard INT4 quantization

Standard 4-bit integer quantization maps weight values to 16 evenly-spaced levels from -8 to 7 (or 0 to 15 for unsigned). This uniform spacing is efficient in hardware but wasteful for neural network weights.

Neural network weights, after pretraining, have distributions that are approximately zero-centered and roughly Gaussian (normal). This means most weights cluster near zero, with relatively few large-magnitude weights. A uniform quantization grid wastes half its precision on outlier values that appear rarely, while under-representing the dense central region where most weights live.

Optimal quantization should allocate grid points proportional to where data actually lives - more points near the center, fewer at the extremes.

NF4: Quantile-based discretization

NF4 (4-bit NormalFloat) places quantization levels at the quantiles of a standard normal distribution $\mathcal{N}(0, 1)$ . With 4 bits, you have 16 possible values. NF4 places these 16 values at the 1/32, 3/32, 5/32, ..., 31/32 quantiles of the normal distribution.

The quantile values $q_i$ satisfy:

$q_i = \Phi^{-1}\!\left(\frac{2i - 1}{2 \cdot 16}\right), \quad i = 1, 2, \ldots, 16$

where $\Phi^{-1}$ is the inverse normal CDF (probit function). After computing these quantiles, they are rescaled so that the values span $[-1, 1]$ .

The NF4 quantization levels are approximately:

[-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911,
  0.0, 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]

Notice the dense packing near zero and the sparse coverage near the extremes. This directly matches where transformer weights actually concentrate.

The quantization process

To store a weight matrix $W$ in NF4:

Compute the absolute maximum value: $c = \max(|W|)$
Normalize: $W_{\text{norm}} = W / c$ (now all values in $[-1, 1]$ )
For each normalized weight, find the nearest NF4 level and store its 4-bit index
Store $c$ as a 32-bit float (the "quantization constant")

To dequantize during the forward pass:

Look up the NF4 level for each 4-bit index
Multiply by the stored constant $c$ : $W_{\text{deq}} = \text{NF4\_level}[i] \times c$

The decompression happens in-place on the GPU at each forward pass. Modern GPUs can do this fast enough that the overhead is small compared to the matrix multiply.

import torch

# The 16 NF4 quantization levels (from the paper)
NF4_LEVELS = torch.tensor([
    -1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453,
    -0.28444138169288635, -0.18477343022823334, -0.09105003625154495,
    0.0, 0.07958029955625534, 0.16093020141124725, 0.24611230194568634,
    0.33791524171829224, 0.44070982933044434, 0.5626170039176941,
    0.7229568362236023, 1.0
])

def quantize_to_nf4(weight: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Quantize a weight tensor to NF4 format.
    Returns (quantized_indices, absmax_per_block).
    """
    # Normalize to [-1, 1] using per-block absmax
    block_size = 64
    weight_flat = weight.reshape(-1)
    n_blocks = len(weight_flat) // block_size

    absmax = weight_flat.reshape(n_blocks, block_size).abs().max(dim=1).values
    normalized = weight_flat.reshape(n_blocks, block_size) / absmax.unsqueeze(1)
    normalized = normalized.reshape(-1)

    # Find nearest NF4 level for each weight
    distances = (normalized.unsqueeze(1) - NF4_LEVELS.to(normalized.device)).abs()
    indices = distances.argmin(dim=1).to(torch.uint8)

    return indices, absmax


def dequantize_from_nf4(indices: torch.Tensor, absmax: torch.Tensor,
                         original_shape: tuple) -> torch.Tensor:
    """Reconstruct float weights from NF4 quantized representation."""
    block_size = 64
    levels = NF4_LEVELS.to(indices.device)

    weights_flat = levels[indices.long()]
    n_blocks = len(absmax)
    weights_flat = weights_flat.reshape(n_blocks, block_size)
    weights_dequant = weights_flat * absmax.unsqueeze(1)

    return weights_dequant.reshape(original_shape)

Innovation 2 - Double Quantization

The quantization constant overhead

Every block of 64 weights needs one 32-bit quantization constant (the absmax value). For 70B parameters with block size 64:

$\frac{70 \times 10^9}{64} \times 4 \text{ bytes} = 4.375 \text{ GB}$

That is 4.375 GB just for the quantization constants. On a 48 GB GPU, this is not trivial - it is roughly 9% of your available memory.

The solution: quantize the quantization constants

The quantization constants themselves are floating-point values. They also have a distribution. Dettmers et al. observed that the absmax values across blocks follow their own distribution, which is again approximately normal. You can apply a second round of quantization to these constants.

Double quantization uses 8-bit quantization (standard INT8) for the quantization constants, with their own second-level quantization constants. The second-level constants are stored in 32-bit float.

Memory accounting with block size 64 for NF4, and 256 for the second-level quantization:

$\frac{70 \times 10^9}{64} \times 0.5 \text{ bytes} + \frac{70 \times 10^9}{64 \times 256} \times 4 \text{ bytes} \approx 0.547 \text{ GB} + 0.017 \text{ GB}$

The first-level constants drop from 4.375 GB to 0.547 GB (8-bit vs 32-bit), and the second-level constants add only 0.017 GB. Net savings: approximately 3.8 GB. Across a 70B model, double quantization saves 0.37 bits per parameter on average.

The total memory for a double-quantized 70B model:

Component	Memory
NF4 weights (70B x 0.5 bytes)	35.0 GB
First-level quantization constants (INT8)	0.55 GB
Second-level quantization constants (FP32)	0.02 GB
Total for model	35.57 GB

Add LoRA adapters, gradients, and optimizer states for the adapters (small), and activations for a batch of 1 - the entire fine-tuning setup fits in 40-48 GB.

Innovation 3 - Paged Optimizers

The spike problem in GPU memory

Even with NF4 and double quantization, there is a remaining issue: memory usage is not constant during training. Long sequences or unlucky batches trigger memory spikes that exceed GPU capacity. When the GPU runs out of memory, the training job crashes - not gracefully, just OOM (out of memory) and dead.

The traditional fix is to use smaller batches or shorter sequences. Both options either hurt throughput (smaller batches) or restrict what data you can train on (shorter sequences).

Unified memory to the rescue

Modern GPU systems (CUDA with Pascal architecture and later) support unified memory - a mechanism where the GPU and CPU RAM are part of the same virtual address space. When the GPU tries to access memory that is not currently on-device, a page fault triggers an automatic transfer from CPU RAM to GPU memory.

Paged optimizers use this mechanism specifically for optimizer states (the first and second moment estimates in Adam). During normal operation, optimizer states live on the GPU as usual. When a memory spike occurs and the GPU is about to OOM, optimizer state pages are automatically evicted to CPU RAM. The GPU continues running. When those optimizer states are needed again (at the next Adam update step), they are paged back in.

The key insight: optimizer states are accessed infrequently relative to activations (once per optimizer step, not once per forward/backward pass). The CPU-GPU transfer latency for an occasional page fault has minimal impact on throughput, but prevents OOM crashes.

For LoRA fine-tuning, this matters primarily for long sequences and large batch sizes. The LoRA adapter optimizer states are tiny (a few hundred MB). The value of paged optimizers is in handling the activation memory spikes from long contexts.

from bitsandbytes.optim import PagedAdamW32bit, PagedAdamW8bit

# Standard Adam
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)

# Paged Adam - uses CPU RAM as overflow for optimizer states
optimizer = PagedAdamW32bit(model.parameters(), lr=2e-4)

# Paged 8-bit Adam - further reduces memory for optimizer states
optimizer = PagedAdamW8bit(model.parameters(), lr=2e-4)

Memory Math: Full Picture

Here is the complete memory accounting for QLoRA fine-tuning of LLaMA 2 70B on a single A100 80GB:

Component	Calculation	Memory
NF4 weights	70B × 0.5 bytes	35.0 GB
Quantization constants (double quant)	70B/64 × 1 byte + overhead	0.57 GB
LoRA adapters (r=64, all layers)	~300M params × 2 bytes	0.60 GB
LoRA gradients	same as adapters	0.60 GB
LoRA optimizer states (Adam)	2× adapter size	1.20 GB
Activations (batch=1, seq=1024)	empirical	~4.0 GB
CUDA overhead and fragmentation	empirical	~2.0 GB
Total		~44 GB

A 48 GB A100 can handle this with 4 GB headroom. For a 65B model (slightly smaller than 70B), the fit is slightly more comfortable.

For comparison, the same job without quantization (BF16 LoRA):

65B × 2 bytes = 130 GB - does not fit on any single GPU currently available.

Implementation: BitsAndBytesConfig

The entire QLoRA stack in HuggingFace is controlled by BitsAndBytesConfig. Here is a production-ready configuration:

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
import torch

# QLoRA quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",             # Use NormalFloat4 data type
    bnb_4bit_compute_dtype=torch.bfloat16, # Computation in bfloat16
    bnb_4bit_use_double_quant=True,        # Enable double quantization
)
# Note on compute_dtype:
# Weights are stored in NF4 (4-bit)
# Before any matrix multiply, they are dequantized to bfloat16
# The actual computation runs in bfloat16
# This is why QLoRA quality is close to BF16 fine-tuning:
# the math precision is the same; only storage is compressed

MODEL_ID = "meta-llama/Meta-Llama-3-8B"

# Load with quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",        # Automatically place layers on available GPUs
    trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Pad on right for causal LM

# Critical: prepare model for kbit training
# This handles the embedding layer dtype and gradient checkpointing setup
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
)

# Apply LoRA on top of the quantized base
lora_config = LoraConfig(
    r=64,                      # Higher rank since memory is freed by quantization
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 167,772,160 || all params: 8,198,094,848 || trainable%: 2.05%

Complete Training Pipeline: LLaMA 3 8B on RTX 4090

This is a complete, tested QLoRA training script for a single consumer GPU. The RTX 4090 has 24 GB VRAM, sufficient for LLaMA 3 8B with QLoRA.

from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
    TrainingArguments,
)
from peft import (
    LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
)
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
import os

# --- Configuration ---
MODEL_ID = "meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR = "./qlora-llama3-8b-output"
DATASET_PATH = "your_dataset.jsonl"
MAX_SEQ_LENGTH = 2048

# --- Step 1: Quantization config ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# --- Step 2: Load model ---
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",  # Requires flash-attn package
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# --- Step 3: Prepare for kbit training ---
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model.config.use_cache = False  # Incompatible with gradient checkpointing

# --- Step 4: LoRA config ---
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)

# --- Step 5: Dataset ---
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
dataset = dataset.shuffle(seed=42)

def format_instruction(example):
    """Format as Alpaca instruction template."""
    if example.get("input"):
        return {
            "text": (
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Input:\n{example['input']}\n\n"
                f"### Response:\n{example['output']}"
            )
        }
    return {
        "text": (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Response:\n{example['output']}"
        )
    }

dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)

# --- Step 6: Training arguments ---
training_args = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,    # Effective batch size = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    weight_decay=0.01,
    fp16=False,
    bf16=True,                          # BF16 computation (matches compute_dtype)
    max_grad_norm=0.3,                  # Gradient clipping - important for stability
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    load_best_model_at_end=True,
    report_to="wandb",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=False,                      # Set True to pack multiple short examples
    optim="paged_adamw_32bit",          # Paged optimizer for OOM protection
)

# --- Step 7: Trainer ---
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
)

# --- Step 8: Train ---
trainer.train()

# --- Step 9: Save adapter ---
model.save_pretrained(f"{OUTPUT_DIR}/final-adapter")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/final-adapter")
print(f"Training complete. Adapter saved to {OUTPUT_DIR}/final-adapter")

Architecture Diagram: QLoRA Memory Flow

Compute Dtype vs Storage Dtype - A Critical Distinction

One of the most important things to understand about QLoRA is that there are two separate dtypes in play, and confusing them leads to bugs and poor training results.

Storage dtype (NF4): This is how the base model weights are stored in GPU memory between operations. 4 bits per weight. Never used directly in arithmetic. The entire point is to compress memory footprint.

Compute dtype (bfloat16): When a weight is needed for a matrix multiply, it is dequantized from NF4 to bfloat16 on-the-fly, the computation runs in bfloat16, and the result is in bfloat16. The dequantized weights are not stored - they are computed, used, and discarded.

# Storage vs compute dtype diagram in code form

# Weight tensor in GPU memory: 4 bits per value, NF4 encoding
weight_storage = torch.Tensor  # dtype=torch.uint8 (packed 4-bit)
# Memory: d x k / 2 bytes

# At forward pass time, just-in-time dequantization:
weight_compute = dequantize(weight_storage)  # dtype=torch.bfloat16
# Memory: d x k * 2 bytes (temporary, freed after matmul)

# Matrix multiply runs in bfloat16:
output = input @ weight_compute.T  # dtype=torch.bfloat16

# weight_compute is freed; only weight_storage persists

This matters for one key reason: the quantization error from NF4 is present at every forward pass, but it is bounded and consistent. Because the compute happens in bfloat16, gradient computation is also in bfloat16. The LoRA adapters see the same model behavior during training as they will during inference (since inference also dequantizes on-the-fly). There is no train-vs-inference mismatch.

Comparing QLoRA to Alternatives

Understanding where QLoRA fits in the broader landscape of fine-tuning approaches helps you make the right architectural choice for your use case.

# Memory estimates for fine-tuning LLaMA 2 7B (for comparison)

approaches = {
    "Full fine-tune (FP32)":        {"memory_gb": 112, "params_trainable": "100%"},
    "Full fine-tune (BF16)":        {"memory_gb": 60,  "params_trainable": "100%"},
    "LoRA (BF16, r=64)":            {"memory_gb": 18,  "params_trainable": "~1%"},
    "QLoRA (NF4 + LoRA, r=64)":    {"memory_gb": 9,   "params_trainable": "~1%"},
    "GPTQ inference only":          {"memory_gb": 4,   "params_trainable": "0%"},
}

# Key insight: QLoRA cuts memory by ~2x vs BF16 LoRA
# while maintaining the same adapter quality

The practical decision tree:

Have a GPU cluster (8x A100+)? Use full fine-tuning or BF16 LoRA. Maximum quality.
Have 2-4 A100 80GB GPUs? Use BF16 LoRA for models up to 70B. No quantization noise.
Have 1 A100 40-80GB GPU? Use QLoRA for 30-70B models. Slight quality tradeoff, acceptable.
Have 1 RTX 4090 (24GB)? QLoRA for 7-13B models. The sweet spot for consumer hardware.
Have 1 RTX 3090/4080 (16-20GB)? QLoRA for 7B models only.
Running on CPU only? Use GGUF quantized models with llama.cpp. Not suitable for training.

Production Engineering Notes

Flash Attention Integration

Flash Attention 2 significantly reduces activation memory during the forward pass, enabling longer sequence lengths and larger batch sizes at the same memory budget. Always enable it when training on Ampere (A100) or newer hardware:

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    attn_implementation="flash_attention_2",  # Requires pip install flash-attn
    device_map="auto",
)

Flash Attention 2 typically reduces memory by 30-40% for long sequences (2048+) and increases throughput by 2-4x.

Gradient Clipping for Stability

QLoRA training with high learning rates and small batches can produce large gradient spikes. Gradient clipping prevents these from destabilizing training:

# In SFTConfig or TrainingArguments:
max_grad_norm=0.3   # Clip gradients with L2 norm > 0.3

Values of 0.3-1.0 work well for QLoRA. Lower values (0.3) are safer; higher values (1.0) allow larger updates but risk instability.

Checkpoint Validation

Before committing to a full training run, validate that your setup loads correctly and the first forward pass runs without errors:

# Validation script - run before starting long training jobs
import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
test_input = tokenizer(
    "### Instruction:\nWhat is the capital of France?\n\n### Response:\n",
    return_tensors="pt"
).to(model.device)

# Verify forward pass
with torch.no_grad():
    output = model(**test_input)
    print(f"Logits shape: {output.logits.shape}")
    print(f"No NaN in logits: {not torch.isnan(output.logits).any()}")

# Verify backward pass (check gradients flow to LoRA params)
loss = output.logits.sum()
loss.backward()
for name, param in model.named_parameters():
    if param.requires_grad and param.grad is None:
        print(f"WARNING: No gradient for {name}")
        break
else:
    print("All trainable parameters received gradients")

Handling Long Sequences with Packing

For short instruction-following examples (< 512 tokens), packing multiple examples into one training sequence increases GPU utilization dramatically:

training_args = SFTConfig(
    max_seq_length=2048,
    packing=True,   # Pack multiple short examples into full-length sequences
    # This can 4-5x your effective throughput on short examples
)

Important: do not use packing for examples that might bleed context across examples. The TRL SFTTrainer handles BOS/EOS tokens to prevent this, but validate with your specific data format.

Merging QLoRA Adapters

Unlike standard LoRA merging, QLoRA adapters are trained on top of a quantized base. To get a deployable merged model, you need to dequantize first:

from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch

# Step 1: Load base model in BF16 (NOT quantized)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Step 2: Load the QLoRA adapter on the non-quantized base
model = PeftModel.from_pretrained(base_model, "./qlora-adapter-final")

# Step 3: Merge
merged = model.merge_and_unload()

# Step 4: Save merged model (standard BF16 model, no quantization)
merged.save_pretrained("./merged-qlora-model")

# Optional Step 5: Re-quantize the merged model for inference
# Use llama.cpp or GPTQ for deployment quantization

Note the pattern: train with QLoRA (quantized base + LoRA adapters), merge onto unquantized base for a clean merged checkpoint, then optionally re-quantize for efficient inference.

Memory Reduction Diagram

Choosing Rank for QLoRA

Because QLoRA frees up significant memory compared to BF16 LoRA, you have headroom to use higher ranks without running OOM. The guidance from the QLoRA paper and follow-up community research:

r=64 is the sweet spot for QLoRA on instruction-following tasks. The freed memory allows this without the OOM risk you would face with BF16.
r=16 is a good default when you are unsure and want faster iteration. Lower memory, faster training.
r=128 for tasks requiring major domain shift or knowledge injection. Use only with large datasets (>100K examples) to avoid overfitting.

The interaction between rank and quantization noise: higher rank means the LoRA adapters can correct for more of the quantization error introduced by NF4. At very low ranks (r=4, r=8), the LoRA capacity may be insufficient to compensate for the 4-bit quantization in domains with significant distribution shift from the training data.

# Rank ablation for QLoRA - compare r=16, 32, 64
for rank in [16, 32, 64]:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    lora_config = LoraConfig(
        r=rank,
        lora_alpha=16,   # Keep alpha fixed across ranks to isolate rank effect
        target_modules=["q_proj", "v_proj"],
        task_type=TaskType.CAUSAL_LM,
    )
    # train for 1 epoch, evaluate, compare

QLoRA Quality Analysis - What You Actually Lose

The honest answer to "how much quality do you lose with QLoRA?" depends on the task:

Tasks with minimal quality loss (< 1% drop vs BF16 LoRA):

Instruction following (Alpaca-style fine-tuning)
Conversational style adaptation
Output format enforcement (JSON, markdown, structured text)
Tone and persona alignment

Tasks with moderate quality loss (1-3% drop):

Code generation, especially less common languages
Mathematical reasoning tasks
Tasks requiring precise factual recall from fine-tuning data

Tasks where QLoRA may be insufficient:

Teaching completely new languages (the base never saw the language)
Tasks requiring precise multi-step numerical computation
Safety-critical applications where marginal quality differences matter

The paper's Guanaco models demonstrated that a QLoRA-fine-tuned 65B model could match GPT-4 on some benchmarks (Vicuna evaluation). This held because the base model (LLaMA 65B) was strong enough that the quantization noise was overwhelmed by the domain-specific signal from fine-tuning. With weaker base models, the quality gap is more pronounced.

Monitoring QLoRA Training

Key metrics to track and what they tell you:

# Custom callback for QLoRA health monitoring
from transformers import TrainerCallback
import wandb

class QLoRAMonitorCallback(TrainerCallback):

    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is None:
            return

        metrics = {}

        # Loss tracking
        if "loss" in logs:
            metrics["train/loss"] = logs["loss"]
        if "eval_loss" in logs:
            metrics["eval/loss"] = logs["eval_loss"]
            # Overfit check: eval_loss > train_loss * 1.5 signals overfitting
            if "loss" in logs:
                ratio = logs["eval_loss"] / max(logs["loss"], 1e-8)
                metrics["eval/overfit_ratio"] = ratio
                if ratio > 1.5:
                    print(f"WARNING: Overfitting detected. eval/train ratio: {ratio:.2f}")

        # Learning rate
        if "learning_rate" in logs:
            metrics["train/learning_rate"] = logs["learning_rate"]

        # GPU memory - important to track for QLoRA
        if torch.cuda.is_available():
            metrics["gpu/memory_allocated_gb"] = (
                torch.cuda.memory_allocated() / 1e9
            )
            metrics["gpu/memory_reserved_gb"] = (
                torch.cuda.memory_reserved() / 1e9
            )
            metrics["gpu/max_memory_gb"] = (
                torch.cuda.max_memory_allocated() / 1e9
            )

        if wandb.run is not None:
            wandb.log(metrics, step=state.global_step)

    def on_epoch_end(self, args, state, control, **kwargs):
        # Reset peak memory stats each epoch
        if torch.cuda.is_available():
            torch.cuda.reset_peak_memory_stats()

Common Mistakes

:::danger Using load_in_4bit without prepare_model_for_kbit_training

A model loaded with load_in_4bit=True has some layers in non-standard dtypes. Without calling prepare_model_for_kbit_training(), gradient checkpointing fails silently or produces incorrect gradients. Always follow this sequence:

model = AutoModelForCausalLM.from_pretrained(..., quantization_config=bnb_config)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, lora_config)  # Apply LoRA AFTER prepare

The order matters: quantize first, prepare for training second, apply LoRA third.

:::

:::danger Mismatch between compute_dtype and training precision

If bnb_4bit_compute_dtype=torch.float16 but your TrainingArguments uses bf16=True, you get dtype mismatches that cause either NaN loss or silent precision degradation. Always match:

# Correct: both use bfloat16
bnb_config = BitsAndBytesConfig(
    bnb_4bit_compute_dtype=torch.bfloat16,  # <-- bfloat16
    ...
)
training_args = TrainingArguments(
    bf16=True,   # <-- bfloat16
    fp16=False,  # <-- never True when using bfloat16
    ...
)

# Wrong: mixing float16 and bfloat16
bnb_config = BitsAndBytesConfig(
    bnb_4bit_compute_dtype=torch.float16,  # float16
    ...
)
training_args = TrainingArguments(
    bf16=True,  # bfloat16 - mismatch!
)

:::

:::warning Attempting to merge QLoRA adapter onto quantized base

A common mistake is trying to call merge_and_unload() on a model that is still in NF4 quantized form. The merge operation tries to add a bfloat16 matrix (the LoRA delta) to a 4-bit quantized matrix, which produces incorrect results silently.

Always load a fresh BF16 base model for merging:

# WRONG: merging onto quantized base
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, quantization_config=bnb_config)
peft_model = PeftModel.from_pretrained(model, "./adapter")
merged = peft_model.merge_and_unload()  # Produces wrong results

# CORRECT: load clean BF16 base first
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
peft_model = PeftModel.from_pretrained(model, "./adapter")
merged = peft_model.merge_and_unload()  # Correct

:::

:::warning Using FP16 instead of BF16 for compute_dtype

Float16 has a narrower dynamic range than bfloat16 (max value ~65,504 vs ~3.4 × 10^38). QLoRA training can produce gradient values that overflow FP16, causing NaN loss. BF16 maintains the same exponent range as FP32, making it far more stable for training.

On Ampere and Hopper GPUs (A100, H100, RTX 30/40 series), BF16 arithmetic is hardware-accelerated and performs identically to FP16. There is no performance reason to use FP16 for QLoRA training on modern hardware.

:::

Interview Q&A

Q1: What are the three innovations in QLoRA and how does each one contribute to memory reduction?

QLoRA combines three distinct techniques:

NF4 quantization reduces the base model's storage from 2 bytes per parameter (BF16) to 0.5 bytes per parameter - a 4x reduction. For a 70B model, this goes from 140 GB to 35 GB. NF4 is information-theoretically optimal for normally distributed weights because it places quantization levels at the quantiles of the normal distribution, minimizing expected quantization error.

Double quantization further compresses the quantization constants that NF4 requires. Each block of 64 weights needs a 32-bit absmax constant. Without double quantization, these constants consume ~4.4 GB for 70B parameters. Double quantization encodes these constants in INT8 with their own quantization constants, reducing the overhead to ~0.6 GB. Total savings: approximately 3.8 GB, or 0.37 bits per parameter.

Paged optimizers do not reduce steady-state memory but prevent OOM crashes during memory spikes. Long sequences or unlucky batches cause temporary memory spikes. Without paged optimizers, these spikes kill the training job. With paged optimizers, the CUDA unified memory mechanism automatically evicts optimizer state pages to CPU RAM during spikes and pages them back when needed. The cost is occasional PCIe transfer latency, which is negligible compared to preventing job failure.

Q2: Why does QLoRA use bfloat16 as the compute dtype rather than float16, given that float16 is also 16 bits?

Float16 and bfloat16 have the same total bit count (16 bits) but different internal layouts. Float16 uses 5 bits for the exponent and 10 bits for the mantissa, giving a maximum representable value of approximately 65,504. Bfloat16 uses 8 bits for the exponent (same as float32) and 7 bits for the mantissa, giving a maximum representable value of approximately 3.4 × 10^38.

In neural network training, gradient values can span many orders of magnitude. When gradients exceed float16's dynamic range, they overflow to infinity or NaN, causing loss spikes or divergence. Bfloat16's wider dynamic range (matching float32) prevents this. The tradeoff is lower mantissa precision - bfloat16 rounds more aggressively than float16 for values close in magnitude - but this is generally acceptable for gradient-based optimization.

On modern GPUs (Ampere and Hopper), bfloat16 matrix multiplications are hardware-accelerated using tensor cores at the same speed as float16. There is no performance penalty, making bfloat16 strictly preferable for training.

Q3: What is the difference between storage dtype and compute dtype in QLoRA, and why does this design work?

Storage dtype (NF4) is how weights are stored in GPU memory between uses. The 4-bit encoding persists in GPU memory and is never directly used in arithmetic.

Compute dtype (bfloat16) is the precision used when weights are actually needed for a matrix multiply. Before each matrix multiplication involving a quantized layer, the layer's weights are dequantized from NF4 to bfloat16 on-the-fly. The dequantized copy exists only temporarily during the computation and is then freed.

This design works because modern GPU memory bandwidth is often the bottleneck, not compute. Storing weights in 4 bits means the GPU can load 4x as many weight values per memory transaction, reducing the effective bandwidth requirement. The dequantization itself is a fast lookup operation (find NF4 level, multiply by constant) that runs much faster than the matrix multiply it feeds.

The quality preservation comes from the compute dtype: even though weights are stored with 4-bit precision, they are expanded to bfloat16 before any arithmetic. Gradient computation therefore operates on bfloat16 values. The quantization error affects only the representation of the base model's weights, not the gradient computation precision.

Q4: How does QLoRA handle the fact that you can't backpropagate through a quantization operation?

This is a subtle but important point. You cannot compute meaningful gradients through the quantization step (the argmin operation that maps a weight to a NF4 index). The discrete nature of quantization makes it non-differentiable.

QLoRA sidesteps this entirely by freezing the base model weights. There are no gradients flowing to the quantized base weights - they are constants, not parameters. Backpropagation flows only through the LoRA adapter matrices A and B, which are stored and computed in bfloat16 and are fully differentiable.

From the optimizer's perspective, the quantized base model is just a function that maps inputs to activations. The LoRA adapters are the only parameters being optimized. The gradient of the loss with respect to A and B passes through the standard bfloat16 operations in the forward pass (including the dequantized base model computation), never needing to differentiate through the quantization step itself.

Q5: When would you choose QLoRA over standard BF16 LoRA, and when is BF16 LoRA the better choice?

Choose QLoRA when:

Your target model does not fit on your GPU(s) in BF16. For a 70B model on 2x A100 80GB, BF16 LoRA is at the memory limit; QLoRA fits comfortably on one A100 80GB.
You have consumer hardware (RTX 4090, 24 GB) and need to fine-tune anything above 7B parameters. QLoRA makes 13B viable; BF16 LoRA on 13B needs 28+ GB.
Cost is a primary concern. Cloud GPU rental costs scale with hours. Fitting on fewer GPUs means faster iteration and lower bills.

Choose BF16 LoRA when:

You have sufficient GPU memory. BF16 LoRA has no quantization noise and gives marginally better quality.
Your task is quality-critical and you can measure small differences in output quality that might correlate with the NF4 quantization error.
You need to fine-tune multiple adapters and re-run frequently. The setup overhead of QLoRA (prepare_model_for_kbit_training, careful dtype handling) makes BF16 LoRA simpler to iterate on.
You are training on older hardware (pre-Ampere) that does not support bfloat16. QLoRA on FP16 is less stable and not recommended.

Q6: Explain paged optimizers. What problem do they solve and when does that problem actually occur?

The problem paged optimizers solve is GPU OOM crashes from memory spikes during training. Memory usage during a training step is not constant - it peaks during the backward pass when activations and gradients are simultaneously in memory. The spike can be 20-30% above average memory usage for typical sequence lengths, and larger for long sequences.

Without paged optimizers, this spike causes a CUDA OOM error, which kills the training process. You lose the current epoch's progress and need to restart with a smaller batch or sequence length, both of which hurt throughput.

Paged optimizers use CUDA's unified memory (UM) feature, which creates a virtual address space shared between GPU and CPU memory. Optimizer states - the first and second moment estimates maintained by Adam for each parameter - are allocated in this unified memory space. During normal operation, they reside on GPU as usual. When a memory spike occurs and the GPU is about to OOM, CUDA's UM driver automatically identifies the least-recently-used memory pages (typically optimizer states, since they are accessed once per step) and transfers them to CPU RAM. The GPU computation continues uninterrupted.

The cost is the latency of CPU-GPU transfer when optimizer states are paged back in. For Adam, this happens once per optimizer step (not per forward/backward pass), so the per-step overhead is small. Empirically, paged optimizers add 5-15% overhead compared to standard Adam in non-OOM conditions, which is acceptable given the alternative (crashes).

Q7: The Guanaco paper claimed that a 65B QLoRA model matched GPT-4 on some benchmarks. How should you interpret this claim, and what are its limitations?

The Guanaco benchmark result used the Vicuna evaluation methodology: ask both models a set of open-ended questions, then have GPT-4 score both responses and pick a winner. On this benchmark, Guanaco-65B "won" against GPT-4 responses in some comparisons.

Interpreting this carefully:

First, the Vicuna benchmark uses GPT-4 as the judge, which introduces a bias toward responses GPT-4 itself would produce. Models that mimic GPT-4's style score well regardless of actual quality.

Second, the benchmark was designed for open-ended conversation quality, not for technical tasks. Code generation, mathematical reasoning, and factual recall were not primary evaluation axes.

Third, LLaMA 65B is a strong base model. QLoRA + LoRA fine-tuning on a high-quality instruction dataset (OASST) will produce impressive results when the base model is strong. The same QLoRA approach on a weaker 7B model would show a larger gap from GPT-4.

The legitimate takeaway from the result: QLoRA fine-tuning quality loss is small enough to be invisible on conversational benchmarks. You are not making a significant trade when you choose QLoRA over BF16 for instruction-following tasks. The claim that it "matches GPT-4" is benchmark-specific and should not be generalized to all task types.

Summary

QLoRA is three innovations stacked on one insight: you can compress the base model's storage without affecting compute precision.

The insight: NF4 stores weights in 4 bits, but the forward pass dequantizes to bfloat16. Gradients flow in bfloat16. The quantization error is fixed and bounded, not accumulated. This means the LoRA adapters see a consistent (slightly noisy) base model throughout training and can correct for the noise.

The innovations:

NF4 quantization: 4x storage reduction with minimal quality loss for normally distributed weights
Double quantization: quantize the quantization constants, saving another 0.37 bits/param
Paged optimizers: prevent OOM crashes by using CPU RAM as overflow storage for optimizer states

The result: fine-tune a 65B model on one GPU. Fine-tune a 7B model on a gaming GPU. Produce competitive models for hundreds of dollars instead of thousands.

Combined with LoRA from the previous lesson, this represents the complete stack for practical large model fine-tuning on accessible hardware. The remaining lessons in this module build on this foundation - selecting optimal target modules, preparing training data, using production training frameworks, and evaluating the results.

The Night Someone Fine-Tuned a 65B Model on a Gaming GPU​

Why This Exists - The Memory Wall at Scale​

Historical Context - The Path to 4-Bit Training​

Innovation 1 - NF4: The Optimal 4-Bit Data Type​

The problem with standard INT4 quantization​

NF4: Quantile-based discretization​

The quantization process​

Innovation 2 - Double Quantization​

The quantization constant overhead​

The solution: quantize the quantization constants​

Innovation 3 - Paged Optimizers​

The spike problem in GPU memory​

Unified memory to the rescue​

Memory Math: Full Picture​

Implementation: BitsAndBytesConfig​

Complete Training Pipeline: LLaMA 3 8B on RTX 4090​

Architecture Diagram: QLoRA Memory Flow​

Compute Dtype vs Storage Dtype - A Critical Distinction​

Comparing QLoRA to Alternatives​

Production Engineering Notes​

Flash Attention Integration​

Gradient Clipping for Stability​

Checkpoint Validation​

Handling Long Sequences with Packing​

Merging QLoRA Adapters​

Memory Reduction Diagram​

Choosing Rank for QLoRA​

QLoRA Quality Analysis - What You Actually Lose​

Monitoring QLoRA Training​

Common Mistakes​

Interview Q&A​

Summary​