Skip to main content

Module 3: LoRA and QLoRA Fine-Tuning

Full fine-tuning a 70B model requires 8 H100s and hundreds of GB of GPU memory. LoRA fine-tuning the same model requires a single consumer GPU with 24GB VRAM. QLoRA reduces that further - fine-tuning a 70B model is achievable on a single A100 80GB or two consumer GPUs.

This is not magic. LoRA (Low-Rank Adaptation) works by observing that the weight updates during fine-tuning tend to have low intrinsic rank - they live in a much lower-dimensional subspace than the full weight matrix. Instead of updating all 70 billion parameters, you train two small matrices whose product approximates the update. QLoRA adds 4-bit quantization for the frozen base model weights, dramatically reducing memory requirements.

Why LoRA Works

A weight matrix W of shape (d_model, d_model) has d_model^2 parameters. For d_model=4096 (Llama 7B), that is 16 million parameters per layer. Fine-tuning all of them is expensive.

LoRA's insight: the update delta_W during fine-tuning has low rank. You can approximate delta_W as A × B where A is (d_model, r) and B is (r, d_model), with r << d_model (typically r=8 to 64). For r=16, you have 2 × 4096 × 16 = 131,072 parameters per layer instead of 16,777,216 - a 128x reduction.

QLoRA further quantizes the frozen base model to 4-bit NF4 (Normal Float 4), which halves memory again with minimal quality loss because the gradient flows through the LoRA adapters in full precision.

LoRA Parameter Landscape

Lessons in This Module

#LessonKey Concept
1Why Full Fine-Tuning Does Not ScaleMemory requirements, catastrophic forgetting
2LoRA Theory and MathLow-rank decomposition, rank vs expressiveness
3QLoRA and 4-bit TrainingNF4 quantization, double quantization, paged optimizers
4Rank, Alpha, and Target ModulesHyperparameter selection, which modules to adapt
5Instruction Tuning Dataset PrepFormat, size requirements, quality filtering
6Training Loop with Unsloth2x faster LoRA training, memory efficient implementation
7Evaluating Fine-Tuned ModelsTask-specific metrics, comparison to base model
8Merging LoRA Adaptersmerge_and_unload, GGUF export, deployment prep

Key Concepts You Will Master

  • LoRA rank selection - how to choose r and alpha for your specific task
  • Target module selection - q_proj, v_proj, k_proj, o_proj - which to adapt and why
  • QLoRA memory math - calculating exact memory requirements before starting training
  • Training data requirements - how much data you need for different fine-tuning objectives
  • Adapter merging - how to merge LoRA weights back into the base model for inference

Prerequisites

© 2026 EngineersOfAI. All rights reserved.