Module 3: LoRA and QLoRA Fine-Tuning
Full fine-tuning a 70B model requires 8 H100s and hundreds of GB of GPU memory. LoRA fine-tuning the same model requires a single consumer GPU with 24GB VRAM. QLoRA reduces that further - fine-tuning a 70B model is achievable on a single A100 80GB or two consumer GPUs.
This is not magic. LoRA (Low-Rank Adaptation) works by observing that the weight updates during fine-tuning tend to have low intrinsic rank - they live in a much lower-dimensional subspace than the full weight matrix. Instead of updating all 70 billion parameters, you train two small matrices whose product approximates the update. QLoRA adds 4-bit quantization for the frozen base model weights, dramatically reducing memory requirements.
Why LoRA Works
A weight matrix W of shape (d_model, d_model) has d_model^2 parameters. For d_model=4096 (Llama 7B), that is 16 million parameters per layer. Fine-tuning all of them is expensive.
LoRA's insight: the update delta_W during fine-tuning has low rank. You can approximate delta_W as A × B where A is (d_model, r) and B is (r, d_model), with r << d_model (typically r=8 to 64). For r=16, you have 2 × 4096 × 16 = 131,072 parameters per layer instead of 16,777,216 - a 128x reduction.
QLoRA further quantizes the frozen base model to 4-bit NF4 (Normal Float 4), which halves memory again with minimal quality loss because the gradient flows through the LoRA adapters in full precision.
LoRA Parameter Landscape
Lessons in This Module
| # | Lesson | Key Concept |
|---|---|---|
| 1 | Why Full Fine-Tuning Does Not Scale | Memory requirements, catastrophic forgetting |
| 2 | LoRA Theory and Math | Low-rank decomposition, rank vs expressiveness |
| 3 | QLoRA and 4-bit Training | NF4 quantization, double quantization, paged optimizers |
| 4 | Rank, Alpha, and Target Modules | Hyperparameter selection, which modules to adapt |
| 5 | Instruction Tuning Dataset Prep | Format, size requirements, quality filtering |
| 6 | Training Loop with Unsloth | 2x faster LoRA training, memory efficient implementation |
| 7 | Evaluating Fine-Tuned Models | Task-specific metrics, comparison to base model |
| 8 | Merging LoRA Adapters | merge_and_unload, GGUF export, deployment prep |
Key Concepts You Will Master
- LoRA rank selection - how to choose r and alpha for your specific task
- Target module selection - q_proj, v_proj, k_proj, o_proj - which to adapt and why
- QLoRA memory math - calculating exact memory requirements before starting training
- Training data requirements - how much data you need for different fine-tuning objectives
- Adapter merging - how to merge LoRA weights back into the base model for inference
Prerequisites
- Model Ecosystem
- Running Locally
- PyTorch basics, at least one GPU with 8GB+ VRAM
