Skip to main content

LoRA and PEFT - Fine-Tuning at a Fraction of the Cost

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Engineer, Research Engineer

The Real Interview Moment

You're in a system design interview at an AI startup. The interviewer describes their product: "We have a 7B parameter base model and need to create specialized versions for 50 different enterprise customers - each with their own domain data, tone, and compliance requirements. Full fine-tuning each variant would cost us $50K in compute per customer and require 50 separate model copies in production. Walk me through how you'd make this economically viable."

The answer they're looking for is LoRA \text{---} but not just "use LoRA." They want you to explain why low-rank adaptation works theoretically, how to choose the rank, which layers to target, how to serve multiple adapters efficiently, and what trade-offs you're making. They want to know if you understand the math behind W=W0+BAW = W_0 + BA or if you just copied a tutorial.

This is the PEFT interview. Every AI company building on foundation models needs engineers who can fine-tune efficiently. If you can't explain LoRA, you're missing a core competency for modern AI engineering.

What You Will Master

After reading this page, you will be able to:

  • Explain the low-rank hypothesis and why it makes efficient fine-tuning possible
  • Derive the LoRA factorization and explain each component
  • Choose appropriate rank, target layers, and hyperparameters for LoRA
  • Compare LoRA with other PEFT methods and explain when to use each
  • Explain QLoRA and why quantization + LoRA is so effective
  • Implement LoRA fine-tuning using the PEFT library
  • Design multi-tenant serving architectures with LoRA adapters
  • Answer every common interview question about parameter-efficient fine-tuning

Part 1 - The Problem: Full Fine-Tuning Doesn't Scale

The Cost of Full Fine-Tuning

When you fine-tune a model, you update all parameters:

θnew=θpretrainedηtθL(θ,xt,yt)\theta_\text{new} = \theta_\text{pretrained} - \eta \sum_t \nabla_\theta \mathcal{L}(\theta, x_t, y_t)

For a 7B parameter model in float16:

ResourceFull Fine-TuningWhy
Model weights14 GB7B x 2 bytes (fp16)
Gradients14 GBSame size as weights
Optimizer states (AdamW)56 GB2 moments in fp32 = 4x weights
Activations10-50 GBDepends on batch size, sequence length
Total GPU memory~100+ GBRequires A100 80GB or multi-GPU
Per-customer copy14 GB storageFull separate model per variant

For 50 customers: 50 x 14 GB = 700 GB just for model storage, plus 50 separate fine-tuning runs.

Instant Rejection

Never say "LoRA trains fewer parameters so it's faster per step." LoRA's forward pass is actually slightly slower than the base model (it has extra computation from the adapter). The savings come from (1) dramatically less memory for optimizer states and gradients, (2) much smaller adapter checkpoints (~10 MB vs 14 GB), and (3) ability to share the base model across customers.

The Low-Rank Hypothesis

The key insight from the LoRA paper (Hu et al., 2021):

The weight updates during fine-tuning have low intrinsic rank.

When you fine-tune a pre-trained model, you're not changing the weights arbitrarily. The updates ΔW=Wfine-tunedWpretrained\Delta W = W_\text{fine-tuned} - W_\text{pretrained} tend to lie in a low-dimensional subspace.

Why? Pre-training already learns a powerful general-purpose representation. Fine-tuning only needs to "steer" this representation toward a specific task - a much simpler transformation than learning from scratch.

Empirical evidence: Aghajanyan et al. (2021) showed that fine-tuning succeeds even when projected to a very low-dimensional subspace (rank 1-4 often suffices for many tasks).

Full Fine-Tuning vs LoRA

Part 2 - The LoRA Math

Core Formulation

For a pre-trained weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, LoRA represents the update as:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where:

  • BRd×rB \in \mathbb{R}^{d \times r} - the "up-projection"
  • ARr×kA \in \mathbb{R}^{r \times k} - the "down-projection"
  • rmin(d,k)r \ll \min(d, k) - the rank

Forward pass:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} B A x

where α\alpha is a scaling factor (the "LoRA alpha").

Parameter Savings

For a single weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}:

MethodParametersExample (d=4096d=4096, k=4096k=4096, r=16r=16)
Full fine-tuningd×kd \times k16,777,216
LoRAd×r+r×kd \times r + r \times k131,072
Savingsd×r+r×kd×k=2rd\frac{d \times r + r \times k}{d \times k} = \frac{2r}{d}0.78% of full
60-Second Answer

"LoRA factorizes the weight update into two small matrices - a down-projection A that compresses the input to a low-rank space, and an up-projection B that maps back to the original dimension. Since the update has low intrinsic rank, this captures most of the fine-tuning effect with fewer than 1% of the parameters. The key trick is that A is initialized randomly and B is initialized to zero, so the adapter starts as an identity (no change) and gradually learns the adaptation."

Initialization

This is a critical detail that interviewers love to ask about:

  • AA is initialized with random Gaussian (Kaiming uniform)
  • BB is initialized to zero

Why? At the start of training, ΔW=BA=0\Delta W = BA = 0, so the model begins as the exact pre-trained model. This ensures training stability - you don't corrupt the pre-trained representations from step 1.

The Scaling Factor α/r\alpha / r

The update is scaled by αr\frac{\alpha}{r}:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} B A x

Why divide by rr? When you increase the rank, the magnitude of BABA increases (more terms contribute). Dividing by rr keeps the update magnitude roughly constant regardless of rank, making it easier to transfer hyperparameters across different rank settings.

Common settings:

  • α=r\alpha = r (effective scaling = 1) - simplest, works well
  • α=2r\alpha = 2r (effective scaling = 2) - more aggressive adaptation
  • α=16\alpha = 16, r=16r = 16 is a very common default
Common Trap

Many candidates confuse LoRA alpha (α\alpha) with the learning rate. They serve different purposes: α\alpha scales the magnitude of the LoRA update relative to the pretrained weights, while the learning rate controls the optimizer step size. You typically use a higher learning rate for LoRA parameters (e.g., 1e-4 to 3e-4) compared to full fine-tuning (e.g., 2e-5) because you're only updating a small number of parameters.

Part 3 - Which Layers to Adapt

Layer Selection Strategy

Not all layers benefit equally from LoRA adaptation. The original paper tested on transformers and found:

LayerLoRA EffectivenessReasoning
WQW_Q (query)HighModifies attention patterns
WKW_K (key)Medium-HighAffects what tokens attend to
WVW_V (value)HighChanges information extraction
WOW_O (output projection)MediumPost-attention transformation
MLP up-projectionMedium-HighModifies feature computation
MLP down-projectionMediumFeature combination
Embedding layersLowBase vocabulary is sufficient
LayerNormVery LowAlready tiny, just tune directly

Typical configurations:

# Conservative - attention only (original LoRA paper)
target_modules = ["q_proj", "v_proj"]

# Standard - all attention projections
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Aggressive - attention + MLP
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "gate_proj", "down_proj"]

Transformer Block LoRA Target Layers

Choosing the Rank

Rank (rr)Parameters (7B model, QV only)QualityUse Case
4~2M (0.03%)Good for simple tasksClassification, sentiment
8~4M (0.06%)Good general defaultMost fine-tuning tasks
16~8M (0.11%)StrongComplex reasoning, code
32~16M (0.23%)Near full fine-tuning qualityDomain-heavy adaptation
64~32M (0.46%)Diminishing returnsRarely needed
256~130M (1.9%)Approaching full fine-tuningAlmost never used
Company Variation
  • OpenAI: Uses LoRA for their fine-tuning API (user-facing). Default rank is proprietary but speculated to be r=8-16.
  • Meta: Published LLaMA with extensive LoRA benchmarks. Recommends r=8 for most tasks, r=64 for complex instruction following.
  • Google: Gemma models tested with LoRA rank 4-16. Layer selection matters more than rank.
  • Startups: Often use r=16 with all linear layers as a robust default.

Part 4 - Comparison with Other PEFT Methods

The PEFT Landscape

PEFT Landscape

Detailed Comparison

MethodHow It WorksTrainable ParamsInference LatencyMergingQuality
Full Fine-TuningUpdate all parameters100%BaselineN/ABest
LoRALow-rank weight updates0.1-1%No overhead (merge)Yes ✅Near full FT
AdaptersInsert small bottleneck layers0.5-5%+5-10% overheadNo ❌Good
Prefix TuningPrepend learnable tokens to keys/values0.1-1%+2-5% overheadNo ❌Moderate
Prompt TuningPrepend learnable embeddings to input<0.1%MinimalNo ❌Lower (small models)
BitFitOnly tune bias terms0.1%No overheadYes ✅Lower
QLoRALoRA + 4-bit quantization0.1-1%MinimalPartialNear LoRA
DoRADecompose weight into magnitude + direction0.1-1%No overhead (merge)Yes ✅Slightly > LoRA

Why LoRA Won

LoRA has three critical advantages that made it the dominant PEFT method:

  1. No inference latency: After training, merge ΔW=BA\Delta W = BA into W0W_0, producing a standard model with zero additional computation
  2. Composability: Multiple LoRA adapters can be combined, swapped at serving time, or interpolated
  3. No architecture changes: Unlike adapters, LoRA doesn't add new layers, simplifying deployment
60-Second Answer

"LoRA dominates PEFT because of three properties: First, you can merge the adapter into the base weights at inference, so there's zero latency cost. Second, adapters are small (~10MB) and composable - you can swap adapters for different tasks with a single base model. Third, it doesn't change the model architecture, so all existing inference infrastructure works unchanged. Adapters add latency, prefix tuning reduces context length, and prompt tuning doesn't work well for smaller models."

Part 5 - QLoRA: The Memory Breakthrough

The Problem QLoRA Solves

Even with LoRA, you still need to load the base model in fp16 - 14 GB for a 7B model. QLoRA (Dettmers et al., 2023) solves this by quantizing the base model to 4-bit:

SetupMemory for 7B Model
Full fine-tuning (fp16)~100 GB
LoRA (fp16 base)~16 GB
QLoRA (4-bit base + LoRA)~6 GB

How QLoRA Works

  1. NF4 quantization: Quantize base model to 4-bit using NormalFloat4 - a data type designed for normally distributed weights
  2. Double quantization: Quantize the quantization constants themselves (saves ~0.37 bits/param)
  3. Paged optimizers: Use CPU memory as overflow for optimizer states using unified memory
  4. LoRA in fp16/bf16: The LoRA adapters (B, A) are kept in higher precision
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# QLoRA configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
bnb_4bit_use_double_quant=True, # Double quantization
)

# Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)

# Add LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "gate_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.23% of total parameters

The Forward Pass with QLoRA

QLoRA Forward Pass

Common Trap

QLoRA does NOT train the quantized weights. The 4-bit weights are frozen. Only the LoRA adapters (in higher precision) receive gradients. The backward pass computes gradients through the dequantized weights and updates only A and B. This is a common confusion point - quantization is for memory efficiency of the base model, not for the trainable parameters.

Part 6 - Advanced LoRA Techniques

LoRA Merging

After training, merge the adapter into the base weights for zero-overhead inference:

Wmerged=W0+αrBAW_\text{merged} = W_0 + \frac{\alpha}{r} BA

# Merge LoRA weights into base model
model = model.merge_and_unload()

# Now model is a standard model with no LoRA overhead
# Save as regular model
model.save_pretrained("merged_model")

Multi-Adapter Serving

Serve multiple customers with a single base model:

Multi-Adapter Serving Architecture

LoRA Composition and Arithmetic

You can combine multiple LoRA adapters:

W=W0+λ1B1A1+λ2B2A2W = W_0 + \lambda_1 B_1 A_1 + \lambda_2 B_2 A_2

This enables:

  • Task arithmetic: Add writing style adapter + domain knowledge adapter
  • Interpolation: λΔWtask1+(1λ)ΔWtask2\lambda \cdot \Delta W_\text{task1} + (1-\lambda) \cdot \Delta W_\text{task2}
  • Negation: Subtract an adapter to remove unwanted behavior

DoRA: Weight-Decomposed LoRA

DoRA (2024) decomposes the weight update into magnitude and direction:

W=mW0+BAW0+BAcW = m \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c}

where mm is a learnable magnitude vector and c\|\cdot\|_c is the column-wise norm. This achieves slightly better performance than LoRA at the same rank.

Part 7 - Practical Implementation Guide

Complete Fine-Tuning Script

import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# 1. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# 2. Quantization config (for QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)

# 3. Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model)

# 4. LoRA config
lora_config = LoraConfig(
r=16, # Rank - start here, increase if quality is insufficient
lora_alpha=32, # Alpha = 2 * r is a good default
target_modules=[ # All linear layers for best quality
"q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "gate_proj", "down_proj",
],
lora_dropout=0.05, # Small dropout for regularization
bias="none", # Don't train biases
task_type="CAUSAL_LM",
)

# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 6. Training arguments
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-4, # Higher LR for LoRA
warmup_steps=100,
logging_steps=10,
save_strategy="epoch",
bf16=True,
optim="paged_adamw_8bit", # Memory-efficient optimizer
gradient_checkpointing=True, # Trade compute for memory
)

# 7. Load dataset and train
dataset = load_dataset("your_dataset")

trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
max_seq_length=2048,
)

trainer.train()

# 8. Save adapter (only ~30MB)
model.save_pretrained("./lora-adapter")

Hyperparameter Tuning Guide

HyperparameterStartRangeImpact
Rank (r)164-64Higher = more capacity, more memory
Alpha (α)2 * rr to 4*rHigher = stronger adaptation signal
Learning rate2e-41e-4 to 5e-4Higher than full FT because fewer params
Dropout0.050-0.1Higher if overfitting on small datasets
Target modulesAll linearQV only to all linearMore modules = better quality, more params
Epochs31-5LoRA overfits faster than full FT

Part 8 - When LoRA vs Fine-Tuning vs Prompting

Decision Framework

LoRA Decision Framework

Comparison Table

FactorPrompt EngineeringLoRAFull Fine-Tuning
Training cost$0$10-100$1K-100K+
Training time0HoursDays-weeks
Data needed0-10 examples100-10K examples1K-100K+ examples
Quality ceilingLimited by modelNear full FTHighest
New knowledgeNoLimitedYes
Serving costHigher (long prompts)Same as baseSame as base
Iteration speedMinutesHoursDays
Multi-tenantEasy (different prompts)Easy (swap adapters)Hard (separate models)

Part 9 - Practice Problems

Problem 1: LoRA Parameter Calculation

A LLaMA-7B model has 32 transformer layers. Each layer has Q, K, V, O projections (all 4096 x 4096) and an MLP with up (4096 x 11008), gate (4096 x 11008), and down (11008 x 4096) projections. Calculate the total trainable parameters when applying LoRA with rank 16 to all linear layers.

Hint 1 - Direction

For each weight matrix WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}}, LoRA adds ARr×dinA \in \mathbb{R}^{r \times d_{in}} and BRdout×rB \in \mathbb{R}^{d_{out} \times r}, so the parameter count is r×(din+dout)r \times (d_{in} + d_{out}).

Full Answer

Per layer, attention projections (Q, K, V, O):

  • Each: r×(4096+4096)=16×8192=131,072r \times (4096 + 4096) = 16 \times 8192 = 131,072
  • 4 projections: 4×131,072=524,2884 \times 131,072 = 524,288

Per layer, MLP projections (up, gate, down):

  • Up/Gate: r×(4096+11008)=16×15104=241,664r \times (4096 + 11008) = 16 \times 15104 = 241,664 each
  • Down: r×(11008+4096)=16×15104=241,664r \times (11008 + 4096) = 16 \times 15104 = 241,664
  • 3 projections: 3×241,664=724,9923 \times 241,664 = 724,992

Per layer total: 524,288+724,992=1,249,280524,288 + 724,992 = 1,249,280

All 32 layers: 32×1,249,280=39,976,96040M32 \times 1,249,280 = 39,976,960 \approx 40M parameters

As percentage of 7B: 40M/7000M0.57%40M / 7000M \approx 0.57\%

Problem 2: Rank Selection

Your team fine-tuned a 13B model with LoRA rank 4 on a medical QA dataset (5000 examples). The model performs well on general medical questions but fails on rare diseases. Should you increase the rank, add more data, or try a different approach? Justify with theory.

Hint 1 - Direction

Think about what the rank limits: the expressiveness of the weight update. Rare diseases represent tail knowledge that may not be in the pre-training data at all.

Full Answer + Rubric

Analysis: The issue is likely not rank. Rank controls the expressiveness of the adaptation, not the knowledge of the model. If the base model doesn't know about rare diseases, no amount of LoRA adaptation will surface that knowledge - you can't extract information that isn't there.

Recommended approach (in order):

  1. Add more data - Curate examples specifically about rare diseases. LoRA with rank 4 on more targeted data will outperform rank 64 on the same 5000 generic examples.

  2. Continued pre-training + LoRA - First, do continued pre-training on medical literature (including rare disease content) to inject knowledge, then apply LoRA for task adaptation.

  3. RAG - Retrieve rare disease information from a medical knowledge base at inference time. This doesn't require fine-tuning at all.

  4. Only then increase rank - If the model has the knowledge but can't express it, increase rank to 16-32.

Scoring:

  • Strong Hire: Distinguishes between knowledge (data) and expressiveness (rank), suggests data-first and RAG approaches
  • Lean Hire: Suggests increasing rank but also mentions data quality
  • No Hire: Just says "increase the rank to 64"

Problem 3: Multi-Tenant Architecture

Design a serving system for a single 70B base model with 100 LoRA adapters (one per customer). Requirements: <100ms latency, ability to add new customers in minutes, and cost-effective GPU usage.

Hint 1 - Direction

Think about how LoRA adapters interact with batched inference. Can you batch requests across different adapters?

Full Answer + Rubric

Architecture:

  1. Base model: Load once in GPU memory (~35 GB in fp16, or ~20 GB in 4-bit). Shared across all requests.

  2. Adapter storage: Store all 100 adapters on fast storage (SSD). Each is ~50 MB. Total: 5 GB - trivially small.

  3. Hot adapter cache: Keep the 20 most active adapters in GPU memory. Each adapter adds ~50 MB, so 20 adapters = 1 GB.

  4. Request routing: Route incoming requests by customer_id to the appropriate adapter. Use an adapter cache with LRU eviction.

  5. Batching strategy: Requests with the same adapter can be batched normally. For mixed-adapter batching, use frameworks like S-LoRA or Punica that support batched LoRA inference by:

    • Computing W0xW_0 x for the full batch (shared)
    • Computing adapter-specific BAxBAx with custom CUDA kernels that handle different adapters per sequence
  6. Scaling: For >100ms SLA, use tensor parallelism across 2-4 GPUs. For more customers, replicate the base model.

Scoring:

  • Strong Hire: Describes shared base model, adapter caching, mixed-adapter batching (S-LoRA/Punica), and concrete memory calculations
  • Lean Hire: Understands shared base model with swappable adapters but doesn't address batching across adapters
  • No Hire: Suggests loading separate model copies per customer

Part 10 - The Paper in Context

Key Contributions of the LoRA Paper

  1. Low-rank hypothesis for fine-tuning updates - theoretically motivated and empirically validated
  2. Zero initialization via B=0B=0 - ensures training stability
  3. No inference latency - adapter merging is unique among PEFT methods
  4. Composability - multiple adapters on one base model

Impact on the Field

  • Made fine-tuning accessible to individuals and small teams (QLoRA on a single GPU)
  • Enabled multi-tenant LLM serving for enterprise
  • Spawned a family of methods: QLoRA, LoRA+, DoRA, AdaLoRA, LoRA-FA
  • Became the standard fine-tuning approach for the open-source LLM ecosystem

Timeline

PEFT Evolution Timeline

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Explain LoRA"Full FT cost → low-rank hypothesis → W=W0+BAW = W_0 + BA → init → merging"LoRA exploits the fact that fine-tuning updates have low intrinsic rank, so we factorize the update into two small matrices"
"Why initialize B to zero?"Start as identity → training stability → gradual adaptation"B=0 means the model starts as the exact pretrained model. We don't corrupt learned representations from step 1."
"LoRA vs full fine-tuning?"Parameter count → memory → quality comparison → when each wins"LoRA achieves 95-99% of full FT quality with <1% of trainable params. Full FT wins only when you have unlimited compute and need maximum quality."
"What is QLoRA?"4-bit base model → NF4 quantization → LoRA in higher precision"QLoRA quantizes the frozen base model to 4-bit, reducing memory 4x, while keeping LoRA adapters in bf16 for training quality."
"How to choose rank?"Start r=16 → tune based on task complexity → diminishing returns"I start with r=16 as a default. For simple classification tasks, r=4-8 suffices. Complex reasoning might need r=32-64."
"Multi-tenant serving?"Shared base model → adapter cache → S-LoRA batching"One base model in memory, swap LoRA adapters per request. Use S-LoRA for mixed-adapter batching."

Spaced Repetition Checkpoints

  • Day 0: Read this page. Write the LoRA equation from memory. Explain why B is initialized to zero.
  • Day 3: Calculate LoRA parameters for a given model. Explain QLoRA's three innovations without looking.
  • Day 7: Compare LoRA vs adapters vs prefix tuning. Give three reasons LoRA won.
  • Day 14: Design a multi-tenant LoRA serving architecture from memory. Include memory calculations.
  • Day 21: Solve all three practice problems. Time yourself - 5-8 minutes each.

Next Steps

  • Continue to RLHF Papers to understand how LoRA-adapted models are aligned using RLHF
  • Review Adam Optimizer for understanding the optimizer used during LoRA training
  • For system design with LoRA, see ML System Design
© 2026 EngineersOfAI. All rights reserved.