LoRA and PEFT - Fine-Tuning at a Fraction of the Cost
Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Engineer, Research Engineer
The Real Interview Moment
You're in a system design interview at an AI startup. The interviewer describes their product: "We have a 7B parameter base model and need to create specialized versions for 50 different enterprise customers - each with their own domain data, tone, and compliance requirements. Full fine-tuning each variant would cost us $50K in compute per customer and require 50 separate model copies in production. Walk me through how you'd make this economically viable."
The answer they're looking for is LoRA \text{---} but not just "use LoRA." They want you to explain why low-rank adaptation works theoretically, how to choose the rank, which layers to target, how to serve multiple adapters efficiently, and what trade-offs you're making. They want to know if you understand the math behind or if you just copied a tutorial.
This is the PEFT interview. Every AI company building on foundation models needs engineers who can fine-tune efficiently. If you can't explain LoRA, you're missing a core competency for modern AI engineering.
What You Will Master
After reading this page, you will be able to:
- Explain the low-rank hypothesis and why it makes efficient fine-tuning possible
- Derive the LoRA factorization and explain each component
- Choose appropriate rank, target layers, and hyperparameters for LoRA
- Compare LoRA with other PEFT methods and explain when to use each
- Explain QLoRA and why quantization + LoRA is so effective
- Implement LoRA fine-tuning using the PEFT library
- Design multi-tenant serving architectures with LoRA adapters
- Answer every common interview question about parameter-efficient fine-tuning
Part 1 - The Problem: Full Fine-Tuning Doesn't Scale
The Cost of Full Fine-Tuning
When you fine-tune a model, you update all parameters:
For a 7B parameter model in float16:
| Resource | Full Fine-Tuning | Why |
|---|---|---|
| Model weights | 14 GB | 7B x 2 bytes (fp16) |
| Gradients | 14 GB | Same size as weights |
| Optimizer states (AdamW) | 56 GB | 2 moments in fp32 = 4x weights |
| Activations | 10-50 GB | Depends on batch size, sequence length |
| Total GPU memory | ~100+ GB | Requires A100 80GB or multi-GPU |
| Per-customer copy | 14 GB storage | Full separate model per variant |
For 50 customers: 50 x 14 GB = 700 GB just for model storage, plus 50 separate fine-tuning runs.
Never say "LoRA trains fewer parameters so it's faster per step." LoRA's forward pass is actually slightly slower than the base model (it has extra computation from the adapter). The savings come from (1) dramatically less memory for optimizer states and gradients, (2) much smaller adapter checkpoints (~10 MB vs 14 GB), and (3) ability to share the base model across customers.
The Low-Rank Hypothesis
The key insight from the LoRA paper (Hu et al., 2021):
The weight updates during fine-tuning have low intrinsic rank.
When you fine-tune a pre-trained model, you're not changing the weights arbitrarily. The updates tend to lie in a low-dimensional subspace.
Why? Pre-training already learns a powerful general-purpose representation. Fine-tuning only needs to "steer" this representation toward a specific task - a much simpler transformation than learning from scratch.
Empirical evidence: Aghajanyan et al. (2021) showed that fine-tuning succeeds even when projected to a very low-dimensional subspace (rank 1-4 often suffices for many tasks).
Part 2 - The LoRA Math
Core Formulation
For a pre-trained weight matrix , LoRA represents the update as:
where:
- - the "up-projection"
- - the "down-projection"
- - the rank
Forward pass:
where is a scaling factor (the "LoRA alpha").
Parameter Savings
For a single weight matrix :
| Method | Parameters | Example (, , ) |
|---|---|---|
| Full fine-tuning | 16,777,216 | |
| LoRA | 131,072 | |
| Savings | 0.78% of full |
"LoRA factorizes the weight update into two small matrices - a down-projection A that compresses the input to a low-rank space, and an up-projection B that maps back to the original dimension. Since the update has low intrinsic rank, this captures most of the fine-tuning effect with fewer than 1% of the parameters. The key trick is that A is initialized randomly and B is initialized to zero, so the adapter starts as an identity (no change) and gradually learns the adaptation."
Initialization
This is a critical detail that interviewers love to ask about:
- is initialized with random Gaussian (Kaiming uniform)
- is initialized to zero
Why? At the start of training, , so the model begins as the exact pre-trained model. This ensures training stability - you don't corrupt the pre-trained representations from step 1.
The Scaling Factor
The update is scaled by :
Why divide by ? When you increase the rank, the magnitude of increases (more terms contribute). Dividing by keeps the update magnitude roughly constant regardless of rank, making it easier to transfer hyperparameters across different rank settings.
Common settings:
- (effective scaling = 1) - simplest, works well
- (effective scaling = 2) - more aggressive adaptation
- , is a very common default
Many candidates confuse LoRA alpha () with the learning rate. They serve different purposes: scales the magnitude of the LoRA update relative to the pretrained weights, while the learning rate controls the optimizer step size. You typically use a higher learning rate for LoRA parameters (e.g., 1e-4 to 3e-4) compared to full fine-tuning (e.g., 2e-5) because you're only updating a small number of parameters.
Part 3 - Which Layers to Adapt
Layer Selection Strategy
Not all layers benefit equally from LoRA adaptation. The original paper tested on transformers and found:
| Layer | LoRA Effectiveness | Reasoning |
|---|---|---|
| (query) | High | Modifies attention patterns |
| (key) | Medium-High | Affects what tokens attend to |
| (value) | High | Changes information extraction |
| (output projection) | Medium | Post-attention transformation |
| MLP up-projection | Medium-High | Modifies feature computation |
| MLP down-projection | Medium | Feature combination |
| Embedding layers | Low | Base vocabulary is sufficient |
| LayerNorm | Very Low | Already tiny, just tune directly |
Typical configurations:
# Conservative - attention only (original LoRA paper)
target_modules = ["q_proj", "v_proj"]
# Standard - all attention projections
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# Aggressive - attention + MLP
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "gate_proj", "down_proj"]
Choosing the Rank
| Rank () | Parameters (7B model, QV only) | Quality | Use Case |
|---|---|---|---|
| 4 | ~2M (0.03%) | Good for simple tasks | Classification, sentiment |
| 8 | ~4M (0.06%) | Good general default | Most fine-tuning tasks |
| 16 | ~8M (0.11%) | Strong | Complex reasoning, code |
| 32 | ~16M (0.23%) | Near full fine-tuning quality | Domain-heavy adaptation |
| 64 | ~32M (0.46%) | Diminishing returns | Rarely needed |
| 256 | ~130M (1.9%) | Approaching full fine-tuning | Almost never used |
- OpenAI: Uses LoRA for their fine-tuning API (user-facing). Default rank is proprietary but speculated to be r=8-16.
- Meta: Published LLaMA with extensive LoRA benchmarks. Recommends r=8 for most tasks, r=64 for complex instruction following.
- Google: Gemma models tested with LoRA rank 4-16. Layer selection matters more than rank.
- Startups: Often use r=16 with all linear layers as a robust default.
Part 4 - Comparison with Other PEFT Methods
The PEFT Landscape
Detailed Comparison
| Method | How It Works | Trainable Params | Inference Latency | Merging | Quality |
|---|---|---|---|---|---|
| Full Fine-Tuning | Update all parameters | 100% | Baseline | N/A | Best |
| LoRA | Low-rank weight updates | 0.1-1% | No overhead (merge) | Yes ✅ | Near full FT |
| Adapters | Insert small bottleneck layers | 0.5-5% | +5-10% overhead | No ❌ | Good |
| Prefix Tuning | Prepend learnable tokens to keys/values | 0.1-1% | +2-5% overhead | No ❌ | Moderate |
| Prompt Tuning | Prepend learnable embeddings to input | <0.1% | Minimal | No ❌ | Lower (small models) |
| BitFit | Only tune bias terms | 0.1% | No overhead | Yes ✅ | Lower |
| QLoRA | LoRA + 4-bit quantization | 0.1-1% | Minimal | Partial | Near LoRA |
| DoRA | Decompose weight into magnitude + direction | 0.1-1% | No overhead (merge) | Yes ✅ | Slightly > LoRA |
Why LoRA Won
LoRA has three critical advantages that made it the dominant PEFT method:
- No inference latency: After training, merge into , producing a standard model with zero additional computation
- Composability: Multiple LoRA adapters can be combined, swapped at serving time, or interpolated
- No architecture changes: Unlike adapters, LoRA doesn't add new layers, simplifying deployment
"LoRA dominates PEFT because of three properties: First, you can merge the adapter into the base weights at inference, so there's zero latency cost. Second, adapters are small (~10MB) and composable - you can swap adapters for different tasks with a single base model. Third, it doesn't change the model architecture, so all existing inference infrastructure works unchanged. Adapters add latency, prefix tuning reduces context length, and prompt tuning doesn't work well for smaller models."
Part 5 - QLoRA: The Memory Breakthrough
The Problem QLoRA Solves
Even with LoRA, you still need to load the base model in fp16 - 14 GB for a 7B model. QLoRA (Dettmers et al., 2023) solves this by quantizing the base model to 4-bit:
| Setup | Memory for 7B Model |
|---|---|
| Full fine-tuning (fp16) | ~100 GB |
| LoRA (fp16 base) | ~16 GB |
| QLoRA (4-bit base + LoRA) | ~6 GB |
How QLoRA Works
- NF4 quantization: Quantize base model to 4-bit using NormalFloat4 - a data type designed for normally distributed weights
- Double quantization: Quantize the quantization constants themselves (saves ~0.37 bits/param)
- Paged optimizers: Use CPU memory as overflow for optimizer states using unified memory
- LoRA in fp16/bf16: The LoRA adapters (B, A) are kept in higher precision
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# QLoRA configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
bnb_4bit_use_double_quant=True, # Double quantization
)
# Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Add LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "gate_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.23% of total parameters
The Forward Pass with QLoRA
QLoRA does NOT train the quantized weights. The 4-bit weights are frozen. Only the LoRA adapters (in higher precision) receive gradients. The backward pass computes gradients through the dequantized weights and updates only A and B. This is a common confusion point - quantization is for memory efficiency of the base model, not for the trainable parameters.
Part 6 - Advanced LoRA Techniques
LoRA Merging
After training, merge the adapter into the base weights for zero-overhead inference:
# Merge LoRA weights into base model
model = model.merge_and_unload()
# Now model is a standard model with no LoRA overhead
# Save as regular model
model.save_pretrained("merged_model")
Multi-Adapter Serving
Serve multiple customers with a single base model:
LoRA Composition and Arithmetic
You can combine multiple LoRA adapters:
This enables:
- Task arithmetic: Add writing style adapter + domain knowledge adapter
- Interpolation:
- Negation: Subtract an adapter to remove unwanted behavior
DoRA: Weight-Decomposed LoRA
DoRA (2024) decomposes the weight update into magnitude and direction:
where is a learnable magnitude vector and is the column-wise norm. This achieves slightly better performance than LoRA at the same rank.
Part 7 - Practical Implementation Guide
Complete Fine-Tuning Script
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# 1. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# 2. Quantization config (for QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# 3. Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model)
# 4. LoRA config
lora_config = LoraConfig(
r=16, # Rank - start here, increase if quality is insufficient
lora_alpha=32, # Alpha = 2 * r is a good default
target_modules=[ # All linear layers for best quality
"q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "gate_proj", "down_proj",
],
lora_dropout=0.05, # Small dropout for regularization
bias="none", # Don't train biases
task_type="CAUSAL_LM",
)
# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 6. Training arguments
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-4, # Higher LR for LoRA
warmup_steps=100,
logging_steps=10,
save_strategy="epoch",
bf16=True,
optim="paged_adamw_8bit", # Memory-efficient optimizer
gradient_checkpointing=True, # Trade compute for memory
)
# 7. Load dataset and train
dataset = load_dataset("your_dataset")
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
max_seq_length=2048,
)
trainer.train()
# 8. Save adapter (only ~30MB)
model.save_pretrained("./lora-adapter")
Hyperparameter Tuning Guide
| Hyperparameter | Start | Range | Impact |
|---|---|---|---|
| Rank (r) | 16 | 4-64 | Higher = more capacity, more memory |
| Alpha (α) | 2 * r | r to 4*r | Higher = stronger adaptation signal |
| Learning rate | 2e-4 | 1e-4 to 5e-4 | Higher than full FT because fewer params |
| Dropout | 0.05 | 0-0.1 | Higher if overfitting on small datasets |
| Target modules | All linear | QV only to all linear | More modules = better quality, more params |
| Epochs | 3 | 1-5 | LoRA overfits faster than full FT |
Part 8 - When LoRA vs Fine-Tuning vs Prompting
Decision Framework
Comparison Table
| Factor | Prompt Engineering | LoRA | Full Fine-Tuning |
|---|---|---|---|
| Training cost | $0 | $10-100 | $1K-100K+ |
| Training time | 0 | Hours | Days-weeks |
| Data needed | 0-10 examples | 100-10K examples | 1K-100K+ examples |
| Quality ceiling | Limited by model | Near full FT | Highest |
| New knowledge | No | Limited | Yes |
| Serving cost | Higher (long prompts) | Same as base | Same as base |
| Iteration speed | Minutes | Hours | Days |
| Multi-tenant | Easy (different prompts) | Easy (swap adapters) | Hard (separate models) |
Part 9 - Practice Problems
Problem 1: LoRA Parameter Calculation
A LLaMA-7B model has 32 transformer layers. Each layer has Q, K, V, O projections (all 4096 x 4096) and an MLP with up (4096 x 11008), gate (4096 x 11008), and down (11008 x 4096) projections. Calculate the total trainable parameters when applying LoRA with rank 16 to all linear layers.
Hint 1 - Direction
For each weight matrix , LoRA adds and , so the parameter count is .
Full Answer
Per layer, attention projections (Q, K, V, O):
- Each:
- 4 projections:
Per layer, MLP projections (up, gate, down):
- Up/Gate: each
- Down:
- 3 projections:
Per layer total:
All 32 layers: parameters
As percentage of 7B:
Problem 2: Rank Selection
Your team fine-tuned a 13B model with LoRA rank 4 on a medical QA dataset (5000 examples). The model performs well on general medical questions but fails on rare diseases. Should you increase the rank, add more data, or try a different approach? Justify with theory.
Hint 1 - Direction
Think about what the rank limits: the expressiveness of the weight update. Rare diseases represent tail knowledge that may not be in the pre-training data at all.
Full Answer + Rubric
Analysis: The issue is likely not rank. Rank controls the expressiveness of the adaptation, not the knowledge of the model. If the base model doesn't know about rare diseases, no amount of LoRA adaptation will surface that knowledge - you can't extract information that isn't there.
Recommended approach (in order):
-
Add more data - Curate examples specifically about rare diseases. LoRA with rank 4 on more targeted data will outperform rank 64 on the same 5000 generic examples.
-
Continued pre-training + LoRA - First, do continued pre-training on medical literature (including rare disease content) to inject knowledge, then apply LoRA for task adaptation.
-
RAG - Retrieve rare disease information from a medical knowledge base at inference time. This doesn't require fine-tuning at all.
-
Only then increase rank - If the model has the knowledge but can't express it, increase rank to 16-32.
Scoring:
- Strong Hire: Distinguishes between knowledge (data) and expressiveness (rank), suggests data-first and RAG approaches
- Lean Hire: Suggests increasing rank but also mentions data quality
- No Hire: Just says "increase the rank to 64"
Problem 3: Multi-Tenant Architecture
Design a serving system for a single 70B base model with 100 LoRA adapters (one per customer). Requirements: <100ms latency, ability to add new customers in minutes, and cost-effective GPU usage.
Hint 1 - Direction
Think about how LoRA adapters interact with batched inference. Can you batch requests across different adapters?
Full Answer + Rubric
Architecture:
-
Base model: Load once in GPU memory (~35 GB in fp16, or ~20 GB in 4-bit). Shared across all requests.
-
Adapter storage: Store all 100 adapters on fast storage (SSD). Each is ~50 MB. Total: 5 GB - trivially small.
-
Hot adapter cache: Keep the 20 most active adapters in GPU memory. Each adapter adds ~50 MB, so 20 adapters = 1 GB.
-
Request routing: Route incoming requests by customer_id to the appropriate adapter. Use an adapter cache with LRU eviction.
-
Batching strategy: Requests with the same adapter can be batched normally. For mixed-adapter batching, use frameworks like S-LoRA or Punica that support batched LoRA inference by:
- Computing for the full batch (shared)
- Computing adapter-specific with custom CUDA kernels that handle different adapters per sequence
-
Scaling: For >100ms SLA, use tensor parallelism across 2-4 GPUs. For more customers, replicate the base model.
Scoring:
- Strong Hire: Describes shared base model, adapter caching, mixed-adapter batching (S-LoRA/Punica), and concrete memory calculations
- Lean Hire: Understands shared base model with swappable adapters but doesn't address batching across adapters
- No Hire: Suggests loading separate model copies per customer
Part 10 - The Paper in Context
Key Contributions of the LoRA Paper
- Low-rank hypothesis for fine-tuning updates - theoretically motivated and empirically validated
- Zero initialization via - ensures training stability
- No inference latency - adapter merging is unique among PEFT methods
- Composability - multiple adapters on one base model
Impact on the Field
- Made fine-tuning accessible to individuals and small teams (QLoRA on a single GPU)
- Enabled multi-tenant LLM serving for enterprise
- Spawned a family of methods: QLoRA, LoRA+, DoRA, AdaLoRA, LoRA-FA
- Became the standard fine-tuning approach for the open-source LLM ecosystem
Timeline
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Explain LoRA" | Full FT cost → low-rank hypothesis → → init → merging | "LoRA exploits the fact that fine-tuning updates have low intrinsic rank, so we factorize the update into two small matrices" |
| "Why initialize B to zero?" | Start as identity → training stability → gradual adaptation | "B=0 means the model starts as the exact pretrained model. We don't corrupt learned representations from step 1." |
| "LoRA vs full fine-tuning?" | Parameter count → memory → quality comparison → when each wins | "LoRA achieves 95-99% of full FT quality with <1% of trainable params. Full FT wins only when you have unlimited compute and need maximum quality." |
| "What is QLoRA?" | 4-bit base model → NF4 quantization → LoRA in higher precision | "QLoRA quantizes the frozen base model to 4-bit, reducing memory 4x, while keeping LoRA adapters in bf16 for training quality." |
| "How to choose rank?" | Start r=16 → tune based on task complexity → diminishing returns | "I start with r=16 as a default. For simple classification tasks, r=4-8 suffices. Complex reasoning might need r=32-64." |
| "Multi-tenant serving?" | Shared base model → adapter cache → S-LoRA batching | "One base model in memory, swap LoRA adapters per request. Use S-LoRA for mixed-adapter batching." |
Spaced Repetition Checkpoints
- Day 0: Read this page. Write the LoRA equation from memory. Explain why B is initialized to zero.
- Day 3: Calculate LoRA parameters for a given model. Explain QLoRA's three innovations without looking.
- Day 7: Compare LoRA vs adapters vs prefix tuning. Give three reasons LoRA won.
- Day 14: Design a multi-tenant LoRA serving architecture from memory. Include memory calculations.
- Day 21: Solve all three practice problems. Time yourself - 5-8 minutes each.
Next Steps
- Continue to RLHF Papers to understand how LoRA-adapted models are aligned using RLHF
- Review Adam Optimizer for understanding the optimizer used during LoRA training
- For system design with LoRA, see ML System Design
