Module 3: LoRA and QLoRA Fine-Tuning

Full fine-tuning a 70B model requires 8 H100s and hundreds of GB of GPU memory. LoRA fine-tuning the same model requires a single consumer GPU with 24GB VRAM. QLoRA reduces that further - fine-tuning a 70B model is achievable on a single A100 80GB or two consumer GPUs.

This is not magic. LoRA (Low-Rank Adaptation) works by observing that the weight updates during fine-tuning tend to have low intrinsic rank - they live in a much lower-dimensional subspace than the full weight matrix. Instead of updating all 70 billion parameters, you train two small matrices whose product approximates the update. QLoRA adds 4-bit quantization for the frozen base model weights, dramatically reducing memory requirements.

Why LoRA Works

A weight matrix W of shape (d_model, d_model) has d_model^2 parameters. For d_model=4096 (Llama 7B), that is 16 million parameters per layer. Fine-tuning all of them is expensive.

LoRA's insight: the update delta_W during fine-tuning has low rank. You can approximate delta_W as A × B where A is (d_model, r) and B is (r, d_model), with r << d_model (typically r=8 to 64). For r=16, you have 2 × 4096 × 16 = 131,072 parameters per layer instead of 16,777,216 - a 128x reduction.

QLoRA further quantizes the frozen base model to 4-bit NF4 (Normal Float 4), which halves memory again with minimal quality loss because the gradient flows through the LoRA adapters in full precision.

LoRA Parameter Landscape

Lessons in This Module

#	Lesson	Key Concept
1	Why Full Fine-Tuning Does Not Scale	Memory requirements, catastrophic forgetting
2	LoRA Theory and Math	Low-rank decomposition, rank vs expressiveness
3	QLoRA and 4-bit Training	NF4 quantization, double quantization, paged optimizers
4	Rank, Alpha, and Target Modules	Hyperparameter selection, which modules to adapt
5	Instruction Tuning Dataset Prep	Format, size requirements, quality filtering
6	Training Loop with Unsloth	2x faster LoRA training, memory efficient implementation
7	Evaluating Fine-Tuned Models	Task-specific metrics, comparison to base model
8	Merging LoRA Adapters	merge_and_unload, GGUF export, deployment prep

Key Concepts You Will Master

LoRA rank selection - how to choose r and alpha for your specific task
Target module selection - q_proj, v_proj, k_proj, o_proj - which to adapt and why
QLoRA memory math - calculating exact memory requirements before starting training
Training data requirements - how much data you need for different fine-tuning objectives
Adapter merging - how to merge LoRA weights back into the base model for inference

Prerequisites

Model Ecosystem
Running Locally
PyTorch basics, at least one GPU with 8GB+ VRAM

Why LoRA Works​

LoRA Parameter Landscape​

Lessons in This Module​

Key Concepts You Will Master​

Prerequisites​

Why LoRA Works

LoRA Parameter Landscape

Lessons in This Module

Key Concepts You Will Master

Prerequisites