Skip to main content

LoRA: Low-Rank Adaptation

The Memory Wall

It is late 2021. A researcher at a startup needs to fine-tune GPT-3 on their company's proprietary data. GPT-3 has 175 billion parameters. Each parameter stored as FP32 needs 4 bytes. The weights alone: 700GB. Fine-tuning with Adam requires also storing the gradient (700GB) and two optimizer states - first moment (700GB) and second moment (700GB). Total memory for full fine-tuning: approximately 2.8 terabytes.

The largest GPU on the market (A100 80GB) has 80 gigabytes of memory. You would need 35 A100 GPUs just for memory, plus interconnect, plus cluster management. At the time, a cluster of that scale cost $50-100K per month in cloud costs. Renting it long enough for meaningful fine-tuning: several hundred thousand dollars.

The entire idea of fine-tuning a frontier model was economically out of reach for anyone who was not OpenAI, Google, or Microsoft.

Then Edward Hu and colleagues at Microsoft Research published a paper in October 2021 that changed the equation: "LoRA: Low-Rank Adaptation of Large Language Models."

The insight: weight updates during fine-tuning are low-rank. You do not need to update 175 billion parameters. You need to update the directions in which the model's representations need to change - and those directions, empirically, span a much lower-dimensional space than the full parameter count suggests.

LoRA with rank r=8 adds only 0.3% of parameters to GPT-3. It reduces trainable parameters from 175B to about 526M. On a single A100, you can fine-tune a model that would otherwise require a data center.

Why This Exists: The Fundamental Memory Problem

To understand why LoRA matters, you need to understand exactly where the memory goes during fine-tuning.

Consider a single weight matrix WR4096×4096W \in \mathbb{R}^{4096 \times 4096} (the Q projection in a GPT-3-sized attention layer):

ComponentBytes per parameterTotal for W
Model weights (FP32)467MB
Gradient (FP32)467MB
Adam first moment467MB
Adam second moment467MB
Total per matrix16268MB

GPT-3 has about 1,000 such matrices across 96 layers. Full fine-tuning memory for weights+optimizer: approximately 268GB per billion parameters. For 175B parameters: ~2.8TB.

LoRA's solution: keep the pretrained weights W0W_0 frozen and instead learn a low-rank decomposition ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}, with rmin(d,k)r \ll \min(d, k).

For the same 4096×40964096 \times 4096 matrix with rank r=8r=8:

  • A matrix: 8×4096=32,7688 \times 4096 = 32,768 parameters
  • B matrix: 4096×8=32,7684096 \times 8 = 32,768 parameters
  • Total trainable: 65,536 vs 16,777,216 - a 256x reduction

You still need to load W0W_0 into GPU memory (it is needed for the forward pass), but you only compute gradients and optimizer states for the small LoRA matrices. Memory savings: approximately 3-4x for full Adam fine-tuning of a typical model.

The Core Mathematics

Normally during fine-tuning, a weight update modifies the weight matrix:

Wnew=W0+ΔWW_{\text{new}} = W_0 + \Delta W

LoRA constrains the update to be low-rank:

ΔW=BA\Delta W = BA

where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d, k).

The forward pass becomes:

h=W0x+ΔWx=W0x+BAxh = W_0 x + \Delta W x = W_0 x + B A x

During forward propagation, you compute both W0xW_0 x and BAxBAx and add them. Since W0W_0 is frozen, no gradient flows through it. Gradients only flow through BB and AA.

Initialization: AA is initialized with random Gaussian noise (small values), and BB is initialized to zero. This ensures ΔW=BA=0\Delta W = BA = 0 at the start of training - the model begins fine-tuning from the pretrained weights without any perturbation.

Scaling: the output of LoRA is scaled by α/r\alpha/r:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} B A x

This scaling factor controls the effective magnitude of the LoRA update. When α=r\alpha = r, the scale is 1.0 (no scaling). Setting α=2r\alpha = 2r doubles the effective learning rate for the LoRA update. In practice, a common choice is α=r\alpha = r or α=2r\alpha = 2r.

Rank Selection: The Key Hyperparameter

The rank rr controls the expressive capacity of the LoRA adapter. Common choices:

RankTrainable Parameters (7B model)When to Use
r=4~8MStyle/format adaptation, very small data
r=8~16MStandard instruction following
r=16~32MComplex task adaptation
r=32~64MSignificant distribution shift
r=64~128MWhen approaching full fine-tuning quality
r=128+~256M+Diminishing returns vs full fine-tuning

Hu et al. (2021) showed that for most NLP tasks, r=4 and r=8 produce results comparable to larger ranks. The empirical finding: weight update matrices during fine-tuning have a low intrinsic rank - typically far below the matrix dimensions. Even for large distribution shifts, rank above 64 rarely helps.

Which Matrices to Apply LoRA To

In a transformer, the main weight matrices are:

  • Attention: WQW_Q, WKW_K, WVW_V, WOW_O (query, key, value, output projections)
  • MLP: WupW_{\text{up}}, WgateW_{\text{gate}}, WdownW_{\text{down}} (feed-forward layers)
  • Embedding and LM head (usually skipped)

Original LoRA paper: applied only to WQW_Q and WVW_V in attention - this was sufficient for competitive results on natural language understanding.

Modern practice: apply to all attention matrices AND the MLP matrices. This is especially important for tasks requiring significant knowledge or style adaptation. In HuggingFace PEFT, this is controlled by the target_modules parameter.

Typical target_modules for LLaMA-based models: ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Alpha Scaling and Effective Learning Rate

The lora_alpha parameter controls how much the LoRA update is scaled relative to the base model update. The formula:

effective LoRA contribution=αr×ΔWx\text{effective LoRA contribution} = \frac{\alpha}{r} \times \Delta W x

With lora_alpha = r (common default), the scale is 1.0. The LoRA update has the same scale as if ΔW\Delta W were a normal weight update.

The choice of lora_alpha interacts with the learning rate:

  • Higher lora_alpha/r ratio → effectively higher learning rate for LoRA matrices
  • A common heuristic: set lora_alpha = 2 * r (scale = 2.0) which slightly amplifies the LoRA signal
  • Another common choice: lora_alpha = r always (simpler to reason about)

In practice, the interaction between lora_alpha and learning_rate means you should tune one or the other but not both independently. Set lora_alpha = r and tune only learning_rate.

Merging LoRA Weights for Inference

After training, you have two sets of weights:

  1. The frozen base model weights W0W_0
  2. The trained LoRA matrices AA and BB

For inference, you have two options:

Option 1: Inference with adapters - load base model, load LoRA adapters, compute W0x+αrBAxW_0 x + \frac{\alpha}{r} BAx at every layer. Slightly slower than baseline due to extra computation.

Option 2: Merge and unload - compute Wmerged=W0+αrBAW_{\text{merged}} = W_0 + \frac{\alpha}{r} BA once and save. Inference is identical to the base model with no overhead.

Merging produces a single model with no LoRA overhead. The downside: you cannot easily swap adapters. The advantage: zero inference overhead - the merged model runs exactly as fast as the original.

# Merge LoRA into base model weights
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model_with_lora = PeftModel.from_pretrained(base_model, "./lora-adapter/")

# Merge LoRA weights into base model
merged_model = model_with_lora.merge_and_unload()

# Save as a standalone model (no PEFT dependency needed at inference)
merged_model.save_pretrained("./merged-model/")

Code: Training LoRA with PEFT

"""
LoRA fine-tuning with HuggingFace PEFT library.
Demonstrates:
1. Configuring LoRA for a transformer model
2. Training loop
3. Saving and loading adapters
4. Merging for inference
"""

import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from peft import (
LoraConfig,
get_peft_model,
TaskType,
PeftModel,
)
from datasets import Dataset
from trl import SFTTrainer, SFTConfig


def create_lora_model(
base_model_name: str,
lora_r: int = 16,
lora_alpha: int = 32,
lora_dropout: float = 0.05,
target_modules: list = None,
):
"""
Load a base model and wrap it with LoRA adapters.
"""
if target_modules is None:
# For LLaMA/Mistral-based models - apply to all linear layers
target_modules = [
"q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
]

# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
use_cache=False,
)

# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=lora_r, # Rank
lora_alpha=lora_alpha, # Alpha scaling (use 2*r as a good default)
lora_dropout=lora_dropout, # Dropout on LoRA layers for regularization
target_modules=target_modules, # Which weight matrices to adapt
bias="none", # Don't adapt bias terms
inference_mode=False,
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print parameter count comparison
model.print_trainable_parameters()
# Example output: "trainable params: 33,554,432 || all params: 6,771,970,048
# || trainable%: 0.4956"

return model


def train_with_lora(
base_model_name: str = "meta-llama/Llama-2-7b-hf",
output_dir: str = "./lora-adapter",
train_data: list = None,
lora_r: int = 16,
lora_alpha: int = 32,
):
"""Full LoRA training pipeline."""

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = create_lora_model(
base_model_name,
lora_r=lora_r,
lora_alpha=lora_alpha,
)

# Example training data
if train_data is None:
train_data = [
{
"text": (
"Below is an instruction. Write a response.\n\n"
"### Instruction:\nExplain what a transformer is.\n\n"
"### Response:\nA transformer is a neural network architecture "
"that uses self-attention mechanisms to process sequential data."
)
}
]

dataset = Dataset.from_list(train_data)

sft_config = SFTConfig(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4, # LoRA can use higher LR than full fine-tuning
lr_scheduler_type="cosine",
warmup_ratio=0.05,
bf16=True,
max_seq_length=2048,
logging_steps=10,
save_steps=100,
report_to="none",
)

trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
)

trainer.train()

# Save only the LoRA adapter weights (much smaller than full model)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"LoRA adapter saved to {output_dir}")

return model


def load_and_use_lora(
base_model_name: str,
lora_adapter_path: str,
prompt: str,
merge: bool = False,
):
"""
Load a base model + LoRA adapter and generate text.

merge=True: merge LoRA into weights for faster inference (no adapter overhead)
merge=False: keep adapters separate (allows easy adapter swapping)
"""
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
)

if merge:
# Option 1: Merge adapters into weights (recommended for production)
model = PeftModel.from_pretrained(base_model, lora_adapter_path)
model = model.merge_and_unload()
print("LoRA merged into base model weights")
else:
# Option 2: Keep adapters loaded (allows hot-swapping)
model = PeftModel.from_pretrained(base_model, lora_adapter_path)
model.eval()
print("LoRA adapter loaded separately")

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
do_sample=True,
)

return tokenizer.decode(outputs[0], skip_special_tokens=True)


# ---- Demonstrate the parameter count math ----

def show_lora_parameter_counts():
"""
Show how LoRA dramatically reduces trainable parameters.
"""
import torch.nn as nn

# Simulate a weight matrix (like Q_proj in LLaMA-7B)
d, k = 4096, 4096 # Typical transformer dimension

# Full matrix
W_full = nn.Parameter(torch.zeros(d, k))
full_params = d * k
print(f"Full matrix ({d}x{k}): {full_params:,} trainable parameters")

# LoRA decomposition
for r in [4, 8, 16, 32, 64]:
lora_params = d * r + r * k
reduction = full_params / lora_params
print(f" LoRA r={r}: {lora_params:,} params, {reduction:.0f}x fewer")


# show_lora_parameter_counts() output:
# Full matrix (4096x4096): 16,777,216 trainable parameters
# LoRA r=4: 32,768 params, 512x fewer
# LoRA r=8: 65,536 params, 256x fewer
# LoRA r=16: 131,072 params, 128x fewer
# LoRA r=32: 262,144 params, 64x fewer
# LoRA r=64: 524,288 params, 32x fewer

Production Engineering Notes

Multiple LoRA Adapters

One of LoRA's underrated advantages: you can maintain multiple adapters for one base model. The base model (e.g., LLaMA-2-7B) is loaded once. Different LoRA adapters encode different tasks, styles, or domains. Switching adapters at inference time is fast - just load the small adapter weights.

This enables "model routing" without multiple full model deployments: one base model, 10 different adapters for 10 different customer use cases. The adapter files are typically 10-50MB, vs 14GB for the full 7B model.

LoRA for Continual Learning

Because the base model weights are frozen, LoRA provides a natural framework for continual learning. Train adapter V1 for task A. Train adapter V2 for task B. Neither adapter modifies the base model, so there is no interference between them. Compare this to full fine-tuning where task B training overwrites the representations from task A.

LoRA Rank and Downstream Task Complexity

Empirically:

  • Style/format transfer (tone, length, format): r=4 is usually sufficient
  • Domain adaptation (medical, legal, financial): r=16 to r=32
  • Task-specific fine-tuning (code generation, math): r=32 to r=64
  • Significant knowledge update: r=64+, consider full fine-tuning instead

If you are trying to teach the model new factual knowledge (facts not in the pretraining data), LoRA is generally not effective - the model's factual knowledge is stored in its weights, not in the low-rank directions. Use retrieval-augmented generation (RAG) for knowledge updates.

note

LoRA added 0.3% parameters to GPT-3 For GPT-3 (175B parameters), Hu et al. applied LoRA with r=4 to the query and value projection matrices across all 96 layers. The total trainable parameters: ~4.7M out of 175B - 0.0027%. Yet the fine-tuned model matched full fine-tuning quality on all tested benchmarks. This is the paper's most striking result: you do not need to update all parameters, because the space of useful updates is much lower-dimensional than the full parameter space.

Common Mistakes

danger

Setting lora_alpha too high or too low With lora_alpha much larger than r (e.g., r=8, alpha=128), the LoRA updates are scaled 16x, effectively multiplying the learning rate for LoRA by 16. This can cause training instability and overshooting. Conversely, lora_alpha = 1 with r = 8 means every LoRA update is scaled down to 0.125 - the model barely learns. A safe default: lora_alpha = r (scale = 1.0) or lora_alpha = 2*r (scale = 2.0). Tune learning_rate for fine-grained control.

danger

Forgetting to set use_cache=False with gradient checkpointing LoRA training uses gradient checkpointing for memory efficiency, which is incompatible with the KV cache. If you forget to set use_cache=False in the model config, you will get inconsistent behavior or a warning. Always set use_cache=False in the model's generation config during training, and re-enable it at inference time.

warning

Applying LoRA only to Q and V projections for complex tasks The original LoRA paper showed that applying to Q and V projections was sufficient for GLUE-style tasks. For instruction following and complex generation tasks, applying LoRA to all attention projections AND MLP layers produces significantly better results. Check the model's architecture to identify all linear layers and use target_modules="all-linear" in PEFT to automatically target all linear layers.

tip

Check trainable parameter count before starting training Always call model.print_trainable_parameters() after wrapping with PEFT/LoRA. This confirms you have configured LoRA correctly. A common mistake is misconfiguring target_modules and accidentally having 0 trainable parameters (LoRA not applied to any layers) or having too many (LoRA accidentally applied to embedding tables).

Interview Q&A

Q1: Explain the LoRA math. Why does constraining weight updates to be low-rank work?

LoRA represents weight updates as ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}, with rmin(d,k)r \ll \min(d, k). The forward pass computes h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} BAx. This works because of an empirical observation: during fine-tuning, the weight updates have a low intrinsic rank. The space of meaningful adaptations for a given task is low-dimensional. Hu et al. verified this by measuring the rank of fine-tuning updates - they found that even for full fine-tuning, the updates could be well-approximated by low-rank matrices. The intuition: you are not relearning the model from scratch, you are nudging it in specific semantic directions, and the set of useful "nudges" is much smaller than the full parameter space.

Q2: How do you choose between r=8, r=16, and r=64 for a LoRA adapter?

Start with r=8 for simple style/format adaptation. Use r=16 as a good default for instruction tuning on moderate-sized datasets. Use r=32 to r=64 for significant domain shift or complex tasks (code generation, reasoning). Rule of thumb: if r=64 does not improve over r=16, your bottleneck is data quality or the task truly requires full fine-tuning. Compute-wise, doubling the rank roughly doubles the trainable parameters and adds minimal compute overhead - the base model forward pass dominates. Memory for LoRA matrices is negligible (a rank-64 adapter for a 7B model is about 100MB).

Q3: What happens when you merge LoRA weights into the base model? Is the result identical to full fine-tuning?

Merging computes Wnew=W0+αrBAW_{\text{new}} = W_0 + \frac{\alpha}{r} BA and saves the result as the new weight matrix. The merged model is computationally identical to a regular transformer - no adapter overhead. The merged model is NOT identical to what you would get from full fine-tuning, because LoRA constrains the update to be low-rank. Full fine-tuning can make arbitrary weight updates; LoRA can only make rank-rr updates. In practice, for most tasks at r=16+, the quality difference is negligible. But for tasks requiring large distribution shift or updating factual knowledge baked into specific neurons, full fine-tuning may be strictly better.

Q4: What are the advantages of not merging the LoRA adapter (keeping it separate from the base model)?

Keeping adapters separate enables: (1) adapter hot-swapping - load the base model once, swap adapters in milliseconds for different tasks without reloading the 7GB base model; (2) multi-tenancy - in a serving setup, 100 customers can each have their own 50MB adapter while sharing one 14GB base model; (3) adapter composition - combine multiple adapters (LoraHub, AdaLoRA) without merging conflicts; (4) easy rollback - keep the base model as the fallback, just unload the adapter. The downside: a small inference overhead from the extra matrix multiplication, typically less than 5% latency increase.

Q5: How does LoRA prevent catastrophic forgetting?

LoRA prevents catastrophic forgetting by keeping the pretrained weights W0W_0 completely frozen - no gradient ever flows through them. Only the small AA and BB matrices are updated. At worst, LoRA can cause some "representation drift" at the layers where it is applied, but the base model's core knowledge (stored in the frozen weights) is preserved. This is a stronger guarantee than full fine-tuning with a small learning rate - with full fine-tuning, you might still inadvertently overwrite representations with enough training steps. LoRA makes catastrophic forgetting essentially impossible: the pretrained knowledge is in W0W_0, which never changes.

Advanced: LoRA Variants and Extensions

The original LoRA paper spawned a family of variants that address specific limitations:

DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024)

DoRA decomposes the pretrained weight into a magnitude component and a direction component, then applies LoRA only to the direction component:

W=mVVc,V=W0+BAW = m \cdot \frac{V}{\|V\|_c}, \quad V = W_0 + BA

Where mm is the magnitude vector (trainable scalar per output dimension) and V/VcV/\|V\|_c is the normalized direction. This decomposition more closely mimics how full fine-tuning updates weights - which tend to update magnitude and direction somewhat independently. DoRA typically outperforms LoRA by 1-3% on downstream tasks with the same parameter count.

AdaLoRA: Adaptive Budget Allocation (Zhang et al., 2023)

Instead of using the same rank for all weight matrices, AdaLoRA allocates the rank budget based on importance. Less important matrices get lower rank; critical matrices get higher rank. The total parameter count is fixed, but the rank distribution is learned. This typically outperforms LoRA with uniform rank on tasks where different weight matrices have very different importance.

VeRA: Vector-Based Random Matrix Adaptation (Kopiczko et al., 2024)

VeRA uses a single set of frozen random matrices shared across all layers, and only learns small per-layer scaling vectors. Even more parameter-efficient than LoRA (0.01% of parameters for a 7B model vs 0.1-0.5% for LoRA). Works surprisingly well for style/format adaptation.

"""
LoRA variant implementations and comparisons.
"""

import torch
import torch.nn as nn
import math


class StandardLoRA(nn.Module):
"""Standard LoRA implementation from Hu et al. (2021)."""

def __init__(self, in_features: int, out_features: int, r: int = 8, alpha: float = 16):
super().__init__()
self.r = r
self.scaling = alpha / r

self.lora_A = nn.Parameter(torch.empty(r, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, r))

# A: Gaussian init, B: zero init (so delta_W = 0 at start)
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

def forward(self, x: torch.Tensor) -> torch.Tensor:
# x @ A.T @ B.T * scaling
return (x @ self.lora_A.T @ self.lora_B.T) * self.scaling


class DoRALayer(nn.Module):
"""
DoRA: Decomposes weight into magnitude + direction,
applies LoRA to direction component.
"""

def __init__(
self,
base_weight: torch.Tensor, # The frozen pretrained weight W0
r: int = 8,
alpha: float = 16,
):
super().__init__()
out_features, in_features = base_weight.shape
self.r = r
self.scaling = alpha / r

# Magnitude: norm of each output dimension of W0
# Shape: (out_features, 1)
self.magnitude = nn.Parameter(
base_weight.norm(p=2, dim=1, keepdim=True)
)

# LoRA for the direction update
self.lora_A = nn.Parameter(torch.empty(r, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, r))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

# Store the normalized base weight (direction only)
self.register_buffer(
"weight_normalized",
base_weight / base_weight.norm(p=2, dim=1, keepdim=True)
)

def forward(self, x: torch.Tensor, base_output: torch.Tensor) -> torch.Tensor:
# Compute the LoRA update to the direction
lora_update = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling

# New direction = (W0 + BA) / norm(W0 + BA)
updated_weight = self.weight_normalized + lora_update.T.detach()
direction = updated_weight / updated_weight.norm(p=2, dim=1, keepdim=True)

# Scale by learned magnitude
return self.magnitude * (x @ direction.T)


# LoRA rank ablation study
def lora_rank_ablation():
"""
Simulate the effect of different ranks on model quality.
In practice, measure downstream task performance.
"""
import torch

# Simulate a weight matrix update during fine-tuning
# and measure how well different ranks can approximate it
d = 4096 # Hidden dimension (typical for 7B model)

# Simulate what a full fine-tuning update looks like
# (in reality, this comes from gradient descent)
torch.manual_seed(42)
true_update = torch.randn(d, d) * 0.01 # Small updates, as in fine-tuning

# Measure the intrinsic rank: how many singular values are "large"?
U, S, V = torch.linalg.svd(true_update, full_matrices=False)
total_energy = (S ** 2).sum()

print("Singular value energy distribution (cumulative):")
for r in [1, 4, 8, 16, 32, 64, 128]:
energy_r = (S[:r] ** 2).sum() / total_energy * 100
print(f" r={r:3d}: captures {energy_r:.1f}% of update energy")

print("\nReconstructed update quality (Frobenius norm error):")
for r in [4, 8, 16, 32, 64]:
# Reconstruct the update using rank-r approximation
W_approx = (U[:, :r] * S[:r]) @ V[:r, :]
error = (true_update - W_approx).norm() / true_update.norm() * 100
print(f" r={r:3d}: {error:.1f}% reconstruction error")


lora_rank_ablation()

LoRA in Multi-Adapter Serving

One of LoRA's most powerful production use cases is serving multiple fine-tuned models efficiently:

"""
Multi-adapter serving with LoRA.
Load one base model, hot-swap multiple LoRA adapters.
"""

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel


class MultiAdapterServer:
"""
Efficient serving of multiple LoRA-adapted models.
Loads base model once, swaps adapters per request.
"""

def __init__(self, base_model_name: str):
self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token

# Load base model once
self.base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)

# Track loaded adapters
self.loaded_adapters = {}
self.current_adapter = None

def load_adapter(self, adapter_name: str, adapter_path: str):
"""Load a LoRA adapter without loading a new base model."""
if adapter_name not in self.loaded_adapters:
model_with_adapter = PeftModel.from_pretrained(
self.base_model,
adapter_path,
adapter_name=adapter_name,
)
self.loaded_adapters[adapter_name] = adapter_path
print(f"Loaded adapter '{adapter_name}' from {adapter_path}")

def set_adapter(self, adapter_name: str):
"""Switch to a specific LoRA adapter."""
if adapter_name not in self.loaded_adapters:
raise ValueError(f"Adapter '{adapter_name}' not loaded")
self.current_adapter = adapter_name
# In practice: model.set_adapter(adapter_name)

def generate(self, prompt: str, adapter_name: str = None, max_new_tokens: int = 256):
"""Generate text using the specified adapter."""
if adapter_name and adapter_name != self.current_adapter:
self.set_adapter(adapter_name)

inputs = self.tokenizer(prompt, return_tensors="pt").to(self.base_model.device)

with torch.no_grad():
outputs = self.base_model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
)

return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


# Memory comparison: multi-adapter vs multiple full models
def memory_comparison(
base_model_gb: float = 14.0, # 7B model in BF16
adapter_size_gb: float = 0.05, # 50MB per adapter (r=16)
num_adapters: int = 10,
):
multi_adapter_memory = base_model_gb + (num_adapters * adapter_size_gb)
multiple_models_memory = num_adapters * base_model_gb

print(f"Serving {num_adapters} fine-tuned 7B models:")
print(f" Multiple full models: {multiple_models_memory:.1f} GB")
print(f" Single base + {num_adapters} LoRA adapters: {multi_adapter_memory:.1f} GB")
print(f" Memory savings: {multiple_models_memory / multi_adapter_memory:.0f}x")

# Output: ~140GB vs ~14.5GB - a 9.7x memory saving


memory_comparison()
note

LoRA is now the standard fine-tuning method As of 2025, virtually every open-source fine-tuned LLM uses LoRA or a variant. The HuggingFace Hub hosts hundreds of thousands of LoRA adapters for models from LLaMA-2 to Mistral to Phi-3. The cost of fine-tuning a 7B model with LoRA on commodity hardware has dropped to under 10formostusecases.Thisdemocratizationfroma10 for most use cases. This democratization - from a 100,000+ GPU cluster requirement to a personal GPU - is LoRA's most important contribution to the field.


LoRA Hyperparameter Selection Guide

Choosing the right LoRA hyperparameters is more art than science, but empirical patterns have emerged:

def recommend_lora_config(
model_size_b: float,
task_type: str, # "classification", "instruction", "code", "domain_adapt"
data_size_k: int, # Training examples in thousands
quality_vs_speed: str = "balanced", # "quality", "speed", "balanced"
) -> dict:
"""
Recommend LoRA configuration based on task and constraints.
Based on empirical observations from the open-source community (2023-2025).
"""
config = {}

# Rank selection
if task_type == "classification":
# Classification needs less rank - mapping is relatively simple
config["r"] = 8 if model_size_b <= 7 else 16
elif task_type == "instruction":
# Instruction following benefits from higher rank for style/format learning
config["r"] = 16 if quality_vs_speed == "speed" else 32
elif task_type == "code":
# Code requires precise token patterns - benefits from higher rank
config["r"] = 32 if quality_vs_speed != "speed" else 16
elif task_type == "domain_adapt":
# Domain adaptation may need to learn new vocabulary patterns
config["r"] = 64 if data_size_k > 50 else 32

# Alpha = r typically works; 2*r gives more aggressive scaling
config["lora_alpha"] = config["r"] * 2 if quality_vs_speed == "quality" else config["r"]

# Target modules
if model_size_b <= 3:
# Small models - target all linear layers
config["target_modules"] = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
else:
# Standard - attention layers usually sufficient
config["target_modules"] = ["q_proj", "k_proj", "v_proj", "o_proj"]
if task_type in ["domain_adapt", "code"] and quality_vs_speed != "speed":
config["target_modules"].extend(["gate_proj", "up_proj", "down_proj"])

# Learning rate
if config["r"] <= 8:
config["learning_rate"] = 3e-4
elif config["r"] <= 32:
config["learning_rate"] = 2e-4
else:
config["learning_rate"] = 1e-4 # Higher rank → lower LR to avoid instability

# Dropout
config["lora_dropout"] = 0.0 if data_size_k >= 10 else 0.05

# Epochs
if data_size_k < 5:
config["num_epochs"] = 5
elif data_size_k < 50:
config["num_epochs"] = 3
else:
config["num_epochs"] = 1

return config


# Examples:
print(recommend_lora_config(7, "code", 50, "quality"))
# → r=32, alpha=64, target all linear layers, lr=2e-4, 1 epoch

print(recommend_lora_config(1, "classification", 2, "speed"))
# → r=8, alpha=8, all linear, lr=3e-4, 5 epochs

LoRA for Vision-Language Models

LoRA is not limited to text-only LLMs. Vision-language models (VLMs) like LLaVA, InternVL, and Qwen-VL are increasingly fine-tuned with LoRA applied to both the language backbone and the visual projection layers.

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForVision2Seq, AutoProcessor

def create_vlm_lora_model(
model_name: str = "Salesforce/blip2-opt-2.7b",
r: int = 16,
target_language_modules: bool = True,
target_vision_projector: bool = True,
) -> tuple:
"""Apply LoRA to a vision-language model."""
model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(model_name)

# Identify target modules - depends on VLM architecture
# For BLIP-2/InstructBLIP:
target_modules = []
if target_language_modules:
# Language model component
target_modules.extend(["q_proj", "k_proj", "v_proj", "o_proj"])
if target_vision_projector:
# The vision-language projector (QFormer or linear projection)
# This teaches the model to interpret visual features differently
target_modules.extend(["query", "key", "value", "dense"]) # QFormer layers

lora_config = LoraConfig(
r=r,
lora_alpha=r * 2,
target_modules=target_modules,
lora_dropout=0.0,
bias="none",
task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

return model, processor


def format_vlm_instruction(image, question: str, processor) -> dict:
"""Format image + text instruction for VLM fine-tuning."""
prompt = f"Question: {question}\nAnswer:"
inputs = processor(images=image, text=prompt, return_tensors="pt")
return inputs

Debugging LoRA Training

Common issues and how to diagnose them:

def diagnose_lora_training(
model,
train_loss_history: list[float],
eval_loss_history: list[float],
step_history: list[int],
) -> list[str]:
"""
Diagnose common LoRA training issues from loss curves.
Returns list of detected issues and suggestions.
"""
issues = []

if len(train_loss_history) < 10:
return ["Not enough data to diagnose. Need at least 10 evaluation points."]

# Issue 1: Loss not decreasing
early_loss = np.mean(train_loss_history[:len(train_loss_history)//4])
late_loss = np.mean(train_loss_history[3*len(train_loss_history)//4:])
if late_loss > early_loss * 0.95:
issues.append(
"ISSUE: Training loss not decreasing significantly. "
"Suggestions: (1) increase learning rate by 3-5x, "
"(2) verify that labels are not all -100 (masked), "
"(3) check that LoRA parameters are actually in optimizer."
)

# Issue 2: Validation loss increasing while train loss decreasing (overfitting)
if len(eval_loss_history) >= 5:
eval_trend = eval_loss_history[-5:]
if eval_trend[-1] > eval_trend[0] * 1.03:
issues.append(
"ISSUE: Validation loss increasing (overfitting). "
"Suggestions: (1) reduce num_epochs, "
"(2) add lora_dropout=0.05-0.1, "
"(3) add 10% general data to training mix."
)

# Issue 3: Loss spike
loss_std = np.std(train_loss_history)
loss_mean = np.mean(train_loss_history)
spikes = [l for l in train_loss_history if l > loss_mean + 3 * loss_std]
if len(spikes) > 3:
issues.append(
f"ISSUE: {len(spikes)} loss spikes detected. "
"Suggestions: (1) reduce learning rate by 2x, "
"(2) check for corrupted examples in training data, "
"(3) add gradient clipping (max_norm=0.3 for LoRA)."
)

# Issue 4: Frozen base weights have NaN (should not happen with LoRA)
nan_params = []
for name, param in model.named_parameters():
if "lora_" in name and torch.isnan(param).any():
nan_params.append(name)
if nan_params:
issues.append(
f"ISSUE: NaN detected in LoRA parameters: {nan_params[:3]}. "
"Suggestions: (1) reduce learning rate significantly (try 1e-5), "
"(2) check for FP16 overflow (switch to BF16), "
"(3) verify input data doesn't contain NaN."
)

# Issue 5: Rank too low
# Check effective rank of learned delta weights
for name, module in model.named_modules():
if hasattr(module, "lora_A") and hasattr(module, "lora_B"):
delta = module.lora_B.weight @ module.lora_A.weight
# Compute effective rank via singular values
sv = torch.linalg.svdvals(delta.float())
effective_rank = (sv > sv[0] * 0.01).sum().item()
if effective_rank == module.lora_A.weight.shape[0]:
# All singular values are significant - rank might be too low
issues.append(
f"POSSIBLE ISSUE: {name} is using full rank r={module.lora_A.weight.shape[0]}. "
"Consider doubling rank - the task may require more expressivity."
)
break # Check just first LoRA module as sample

if not issues:
issues.append("No obvious issues detected. Training appears healthy.")

return issues

Key Takeaways

LoRA solved one of the most practically important problems in applied ML: how do you adapt a very large model to a specific task without the compute, memory, and storage requirements of full fine-tuning?

The core insight - that weight updates during fine-tuning have low intrinsic rank - turned out to be empirically well-supported. In practice, LoRA with rank 16–64 recovers 95–98% of full fine-tuning performance on most tasks while using 3–10x less memory and producing adapters that are megabytes rather than gigabytes.

For practitioners, LoRA is now the default starting point for any fine-tuning task. The question is no longer "LoRA or full fine-tuning?" but rather "what rank, which modules, and what learning rate?" The answer to those questions depends on your task's complexity, dataset size, and quality requirements - but the framework in this lesson gives you a principled way to make those decisions.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the LoRA: Low-Rank Adaptation demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.