Full Fine-Tuning vs PEFT
The 3 AM Alert
Your on-call phone buzzes at 3:14 AM. The medical AI assistant your team deployed six months ago - built on a fine-tuned Llama 3 70B - has started recommending drug dosages that are subtly wrong. Not catastrophically wrong. Subtly wrong. Off by enough to matter, not enough to trigger hard validation rules.
The post-mortem reveals the root cause: the base model's medical knowledge was always incomplete. You instruction-tuned it on clinical dialogue examples, and the fine-tuning made the model sound authoritative. But the underlying parametric knowledge was never updated. The model learned to format medical answers confidently. It didn't learn medicine.
The fix requires a real decision: do you go back and do full fine-tuning on a medical corpus, burning through $40,000 in GPU compute and six weeks of engineering time? Or do you stack another PEFT adapter on top, this time trained on medical facts rather than clinical dialogue style?
This is the decision every ML team faces once they get past the tutorial phase. The answer is not "it depends" - that is a non-answer. The answer requires understanding what full fine-tuning and PEFT actually do to a model's weights, what each approach can and cannot learn, and what the total cost picture looks like beyond just training compute.
This lesson gives you the decision framework. By the end, you will be able to look at any fine-tuning problem and make the right call in under five minutes - not by gut feel, but by reasoning through the specific tradeoffs that matter for your use case.
Why This Exists - The Problem With "Just Fine-Tune It"
Before parameter-efficient methods existed, fine-tuning meant one thing: update all the weights. Train on your data, update every parameter, done. For smaller models - BERT at 110M parameters, GPT-2 at 1.5B - this was completely tractable on a single V100.
Then model scale happened. GPT-3 at 175B parameters changed the equation entirely. Even inference was expensive. Fine-tuning was completely out of reach for anyone without a research lab budget. The community needed a way to adapt large models without retraining them from scratch.
The first generation of solutions was prompt engineering and in-context learning: just tell the model what to do in the prompt. This worked surprisingly well for many tasks. But it had a hard ceiling. You couldn't change what the model knew, only how it was being asked to use that knowledge. For specialized domains - legal, medical, code generation in niche languages - in-context learning hit that ceiling fast.
The second generation was adapter methods. Houlsby et al. (2019) showed you could insert small bottleneck modules into transformer layers and train only those modules. The base model weights stayed frozen. Adapter training used a fraction of the compute. Quality on downstream tasks was competitive with full fine-tuning for many classification benchmarks.
Then Hu et al. (2021) published LoRA, and the entire field shifted. Instead of adding new modules, LoRA decomposed weight updates into low-rank matrices. The insight was simple but powerful: the change in weights during fine-tuning has a much lower intrinsic rank than the full weight matrix. You do not need to update all 4096x4096 values in an attention matrix to teach a model new behavior. You need to find the low-dimensional subspace where the important changes live.
This created a real choice: full fine-tuning when you need everything the model can give you, PEFT when you need efficiency. Understanding when each one wins - and why - is what this lesson is about.
Historical Context - From Full Fine-Tuning to LoRA
The distinction between full fine-tuning and parameter-efficient methods emerged from a specific bottleneck in the history of NLP.
2018 - The BERT era: Fine-tuning BERT (110M parameters) on a single GPU was standard practice. Full fine-tuning was the default because it was affordable. The pretrain-then-fine-tune paradigm became the dominant paradigm for NLP.
2019 - Adapter layers: Neil Houlsby and colleagues at Google Brain published "Parameter-Efficient Transfer Learning for NLP." They inserted two small feed-forward networks (adapters) into each transformer layer. Training only adapters achieved within 0.4% of full fine-tuning on GLUE benchmarks while updating only 3.6% of parameters. The idea was right but the architecture added inference latency - the adapters had to run at inference time.
2020 - Prefix tuning and prompt tuning: Brian Lester et al. showed you could prepend trainable soft tokens to the input and train only those tokens. For very large models (T5 11B), prompt tuning matched fine-tuning quality. For smaller models, the gap was significant.
2021 - LoRA: Edward Hu, Yelong Shen, and colleagues at Microsoft published "LoRA: Low-Rank Adaptation of Large Language Models." The key insight came from an observation about neural network training: the weight updates that happen during fine-tuning have a low intrinsic rank. A 4096x4096 weight matrix has 16.7M parameters, but the important variation can often be captured with rank 8 or rank 16 - meaning you only need to learn 65,536 parameters instead.
The "aha moment" was that LoRA adapters could be merged into the base weights at inference time - zero added latency compared to the base model. This eliminated the main disadvantage of adapter methods.
2023 - QLoRA: Tim Dettmers et al. combined LoRA with 4-bit quantization of the base model. You could now fine-tune a 65B parameter model on a single 48GB GPU. This democratized fine-tuning to anyone with access to a single high-end GPU.
2023-2024 - The ecosystem explosion: Axolotl, LLaMA-Factory, Unsloth, and dozens of other training frameworks built on LoRA. The community converged on PEFT as the default for most fine-tuning scenarios. Full fine-tuning became the exception, used only when you had strong reasons.
Core Concepts - What Actually Changes in Each Approach
Full Fine-Tuning: Every Weight is a Variable
In full fine-tuning, all model parameters are updated during backpropagation. For a model with parameters, the optimizer maintains to additional values (gradients and optimizer states for Adam).
For a 7B parameter model with Adam optimizer in bf16:
- Model weights: bytes = 14 GB
- Gradients: bytes = 28 GB (float32 for gradient accumulation)
- Adam optimizer states (m, v): bytes = 56 GB
- Total: approximately 98 GB
That requires multiple A100 80GB GPUs just for the model state. Add activations for a reasonable batch size and you are looking at 4-8 A100s for a 7B model, 32-64 A100s for a 70B model.
The update rule for each parameter after seeing a batch of training data:
where and are bias-corrected first and second moment estimates from Adam. Every single parameter in the network gets updated on every batch. The model's knowledge can shift substantially.
PEFT with LoRA: Constrained Updates
LoRA freezes the base model and adds trainable low-rank matrices to specific weight matrices - typically the query and value projections in attention layers.
For a weight matrix , the forward pass becomes:
where , , and is the rank (typically 8-64). The scaling factor controls how much the adapter contributes. At initialization, is set to zero so the model starts identical to the base model.
Memory requirement for training a 7B model with LoRA (rank=16, targeting q_proj and v_proj):
- Frozen base model weights: 14 GB (bf16)
- LoRA trainable parameters: approximately 20-80M parameters depending on rank and target layers
- Gradients for LoRA only: approximately 0.3 GB
- Optimizer states for LoRA only: approximately 0.6 GB
- Total: approximately 15-16 GB - fits on a single A100
The Rank Hypothesis - Why LoRA Works
The fundamental claim LoRA makes is that fine-tuning updates lie in a low-dimensional subspace. Aghajanyan et al. (2020) in "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" provided evidence that the effective dimensionality of fine-tuning updates is much lower than the parameter count suggests.
The practical implication: if you are teaching a model a new output format, a new persona, or task-specific instruction following behavior, the weight changes needed are genuinely low-rank. LoRA captures them efficiently.
If you are teaching a model new factual knowledge that requires updating many different neurons across many layers - domain-specific terminology, new entities, new relationships - the changes are less clearly low-rank, and full fine-tuning may use the available parameter budget more effectively.
When Full Fine-Tuning Wins
1. Continual Pre-Training on Domain Text
If your goal is to deeply embed domain knowledge into the model - medical literature, legal documents, proprietary technical documentation - you are doing continual pre-training, not instruction fine-tuning. You are not teaching the model a new behavior; you are updating what it knows.
This requires full fine-tuning. The factual associations you are building span many layers and many weight matrices. LoRA's low-rank constraint limits how much new knowledge can be encoded.
The distinction is critical:
- Teaching a model to answer medical questions in a specific format = instruction fine-tuning = LoRA works
- Teaching a model that metformin works by activating AMPK and suppressing hepatic glucose production = continual pre-training = full fine-tuning needed
2. Task-Specific Models Where General Capability Is Irrelevant
If you are building a model that will only ever do one thing - extract structured data from invoices, classify support tickets into 47 categories, generate SQL from natural language - and you never need the model to do anything else, full fine-tuning gives you a higher quality ceiling.
The base model's general knowledge acts as implicit regularization in LoRA (frozen weights pull toward general capability). In full fine-tuning, the model can specialize completely. For high-throughput production systems doing one narrow task at scale, that extra quality ceiling matters.
3. Very Large Proprietary Datasets
When you have millions of domain-specific examples and the quality of the task distribution genuinely requires capacity across many model components, full fine-tuning learns more efficiently. At very large data scales, LoRA's rank constraint becomes a bottleneck.
Practical threshold: if you have more than 10M high-quality training examples for a specific task, consider full fine-tuning. Below 1M examples, PEFT methods typically match or approach full fine-tuning quality.
4. Changing the Model's Tokenizer or Embedding Layer
If you need to add new tokens to the vocabulary - new domain-specific terms, new programming languages, new special tokens - you must use full fine-tuning. LoRA does not typically target embedding layers, and adding vocabulary requires retraining the head as well.
When PEFT Wins
1. Instruction Fine-Tuning and Chat
This is the canonical LoRA use case. Teaching a model to follow instructions, maintain a persona, respond in a specific format, use system prompts correctly - these are all behavioral modifications that are genuinely low-rank. The model already has the knowledge; you are teaching it how to communicate.
Llama 3.1 Instruct was not built by instruction-tuning from scratch on every message format. The behavioral changes that differentiate "base model" from "instruct model" are largely capturable with LoRA.
2. Limited Compute Budget
The compute math is straightforward. If you have one A100 and need to fine-tune a 7B model:
- Full fine-tuning: impossible without model parallelism and gradient checkpointing tricks
- QLoRA (4-bit base + LoRA): fits comfortably, 3-4 hours for 50K examples
If you have a $500 budget for a one-time fine-tuning run:
- Full fine-tuning of 7B: 8x A100 hours, approximately $200-400 for training plus storage
- QLoRA of 7B: single A100 hours, approximately $20-40
3. Multiple Specialized Variants from the Same Base
If you need 10 different fine-tuned variants of the same base model - one per customer, one per language, one per use case - PEFT is dramatically more efficient for storage and deployment.
Full fine-tuning: 10 complete model copies at 14 GB each = 140 GB of storage minimum LoRA adapters: 1 base model (14 GB) + 10 adapters at 50-200 MB each = approximately 16 GB total
For multi-tenant systems, this difference is the difference between tractable and intractable infrastructure.
4. Rapid Experimentation
When you are still figuring out what your fine-tuning objective should be, PEFT methods let you run 10 experiments in the time full fine-tuning would run 1. The iteration velocity advantage is significant in early stages of a project.
The 3 Hidden Costs of Full Fine-Tuning
The GPU cost is visible. Three other costs are less obvious but often larger in total.
Hidden Cost 1 - Storage and Checkpoint Management
During training, you checkpoint every N steps. A 7B model checkpoint in bf16 is 14 GB. A 70B model checkpoint is 140 GB. If you checkpoint every 500 steps for a 10-epoch run with 100K examples, you generate 20+ checkpoints before you select the best one.
For full fine-tuning of a 7B model:
- 20 checkpoints x 14 GB = 280 GB of S3 storage
- At 6.44/month just to keep your checkpoints around for analysis
- Scale to 70B: 2.8 TB of checkpoints, $64/month in storage
LoRA adapters checkpoint the adapter weights only:
- 20 checkpoints x 150 MB = 3 GB
- Storage cost is effectively zero
Hidden Cost 2 - Inference and Hot-Swapping
Full fine-tuned models cannot be hot-swapped. If you have 10 customers each needing a customized model, you need 10 separate inference servers, each loading a complete 14 GB model. Or you need a serving system that can efficiently multiplex requests across models.
LoRA adapters can be hot-swapped: load the base model once, swap adapter weights between requests. Systems like vLLM and Punica implement this. For multi-tenant serving, LoRA makes vertical scaling tractable.
Hidden Cost 3 - Experimentation Velocity
When your fine-tuning job takes 8 hours instead of 45 minutes, your engineering team can run 10x fewer experiments per week. The quality ceiling of full fine-tuning is irrelevant if you cannot afford to find the right hyperparameters, the right data mix, and the right training duration.
Most teams underestimate how many iterations it takes to get a production-quality fine-tuned model. Budget for at least 10-20 training runs. With full fine-tuning of a 7B model at 2,000-4,000 before you have a production candidate. With QLoRA at 250-500.
Memory and Cost Comparison Tables
Memory Requirements by Model Size and Method
| Model Size | Full Fine-Tuning (Adam, bf16) | LoRA (r=16, bf16 base) | QLoRA (r=16, 4-bit base) |
|---|---|---|---|
| 7B | ~98 GB (8x A100 40GB) | ~18 GB (1x A100 40GB) | ~10 GB (1x RTX 3090) |
| 13B | ~180 GB (16x A100 40GB) | ~32 GB (2x A100 40GB) | ~16 GB (1x A100 40GB) |
| 34B | ~470 GB (6x A100 80GB) | ~80 GB (2x A100 80GB) | ~40 GB (1x A100 80GB) |
| 70B | ~960 GB (12x A100 80GB) | ~160 GB (4x A100 80GB) | ~80 GB (2x A100 80GB) |
Approximate Training Cost for 50K Examples, 3 Epochs
| Model Size | Full Fine-Tuning | LoRA | QLoRA |
|---|---|---|---|
| 7B | $150-300 | $15-30 | $8-15 |
| 13B | $400-700 | $30-60 | $15-25 |
| 70B | $2,000-4,000 | $200-400 | $80-150 |
Costs assume cloud A100 at approximately $3/hour. Actual costs vary by provider and spot pricing.
Code Examples
Full Fine-Tuning Setup with DeepSpeed ZeRO-3
# Full fine-tuning of Llama 3 8B with DeepSpeed ZeRO-3
# Requires: 8x A100 40GB or 4x A100 80GB
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
import torch
MODEL_ID = "meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR = "./llama3-8b-full-finetune"
# Load model without quantization for full fine-tuning
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
use_cache=False, # disable KV cache during training
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# All parameters are trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} / {total_params:,} = {trainable_params/total_params:.1%}")
# Output: Trainable: 8,030,261,248 / 8,030,261,248 = 100.0%
# DeepSpeed config lives in ds_config.json - ZeRO-3 shards optimizer states,
# gradients, AND parameters across all GPUs
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 2 * 8 * num_gpus
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
tf32=True,
logging_steps=10,
save_strategy="steps",
save_steps=500,
deepspeed="ds_config_zero3.json", # critical for 8B model
gradient_checkpointing=True, # trade compute for memory
report_to="wandb",
)
dataset = load_dataset("your_dataset_here", split="train")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
// ds_config_zero3.json
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none"
},
"offload_param": {
"device": "none"
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"gradient_clipping": 1.0,
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 8
}
LoRA Fine-Tuning with PEFT
# LoRA fine-tuning of Llama 3 8B
# Requires: 1x A100 40GB
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset
import torch
MODEL_ID = "meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR = "./llama3-8b-lora"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Configure LoRA
lora_config = LoraConfig(
r=16, # rank - higher = more capacity, more memory
lora_alpha=32, # scaling factor; effective scale = alpha/r = 2.0
target_modules=[ # which weight matrices to apply LoRA to
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
# See the actual trainable parameter count
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} / {total_params:,} = {trainable_params/total_params:.3%}")
# Output: Trainable: 83,886,080 / 8,114,147,328 = 1.034%
# Only 84M parameters out of 8B are being updated
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4, # LoRA uses higher LR than full fine-tuning
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
logging_steps=10,
save_strategy="steps",
save_steps=500,
gradient_checkpointing=True,
report_to="wandb",
)
dataset = load_dataset("your_dataset_here", split="train")
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
max_seq_length=2048,
packing=False,
)
trainer.train()
# Save only the adapter weights (~150 MB vs 16 GB for full model)
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
QLoRA Fine-Tuning (Maximum Memory Efficiency)
# QLoRA: 4-bit base model + LoRA adapters
# Fits a 13B model on a single 24GB GPU
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
MODEL_ID = "meta-llama/Meta-Llama-3-8B"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 - better than int4 for neural networks
bnb_4bit_compute_dtype=torch.bfloat16, # upcast to bf16 for matrix multiplications
bnb_4bit_use_double_quant=True, # quantize the quantization constants too
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
)
# Critical: prepare for k-bit training before adding LoRA
# This adds gradient checkpointing and casts layernorms to float32
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=64, # can afford higher rank since base model is quantized
lora_alpha=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 167,772,160 || all params: 8,201,932,800 || trainable%: 2.0456
# Memory breakdown for 8B model with QLoRA:
# - 4-bit weights: ~4 GB
# - LoRA adapter weights in bf16: ~0.3 GB
# - Optimizer states for LoRA only: ~0.6 GB
# - Activations: ~3-4 GB depending on batch size
# Total: ~8-10 GB - fits on a single RTX 3090
Merging LoRA Adapters (Zero Inference Overhead)
# After training, merge LoRA weights into base model for deployment
# Eliminates the overhead of loading both base model and adapter
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch
# Load the trained adapter
model = AutoPeftModelForCausalLM.from_pretrained(
"./llama3-8b-lora",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Merge LoRA weights into base model
# BA gets added directly into W_0
model = model.merge_and_unload()
# Save merged model - this is your deployment artifact
model.save_pretrained("./llama3-8b-merged")
# After merging:
# - No PEFT library needed at inference time
# - Identical inference speed to base model
# - Model size identical to full fine-tuned model
# - Cannot un-merge (keep the adapter + base model separately if you need to)
Deciding Between Methods - Decision Tree in Code
def recommend_fine_tuning_method(
model_size_b: float, # model size in billions
gpu_count: int, # available A100 80GB equivalent GPUs
dataset_size_k: int, # training examples in thousands
objective: str, # "instruction", "domain_knowledge", "task_specific"
need_variants: int, # number of different fine-tuned variants needed
budget_usd: float, # total training budget
) -> dict:
"""
Returns recommended method with reasoning.
"""
# Compute requirements for full fine-tuning (rough estimates)
# Full fine-tuning requires ~14 bytes/param for model+gradients+optimizer
full_ft_gpu_gb = model_size_b * 14
full_ft_gpus_needed = max(1, int(full_ft_gpu_gb / 80) + 1)
# LoRA requires ~2.5 bytes/param for frozen model (bf16) + small adapter
lora_gpu_gb = model_size_b * 2.5
lora_gpus_needed = max(1, int(lora_gpu_gb / 80) + 1)
# QLoRA requires ~0.5 bytes/param for quantized model + small adapter
qlora_gpu_gb = model_size_b * 0.6
qlora_gpus_needed = max(1, int(qlora_gpu_gb / 80) + 1)
# Estimate training hours (rough: 50K examples, 3 epochs, batch size 32)
# Scales with model size and data
base_hours = model_size_b * dataset_size_k / 50 * 3
recommendations = []
reasons = []
# Rule 1: Objective determines method more than anything else
if objective == "domain_knowledge":
recommendations.append("full_fine_tuning")
reasons.append(
"Encoding new factual knowledge requires updating weights broadly - "
"LoRA's low-rank constraint will limit how much new knowledge is absorbed."
)
elif objective == "instruction":
recommendations.append("lora")
reasons.append(
"Instruction following is a behavioral modification - genuinely low-rank. "
"LoRA matches or approaches full fine-tuning quality at 10-100x lower cost."
)
# Rule 2: Hardware constraint overrides everything
if gpu_count < full_ft_gpus_needed:
if "full_fine_tuning" in recommendations:
recommendations.remove("full_fine_tuning")
recommendations.append("qlora")
reasons.append(
f"Full fine-tuning needs {full_ft_gpus_needed} A100s but you have {gpu_count}. "
f"QLoRA needs only {qlora_gpus_needed} A100s."
)
# Rule 3: Multiple variants strongly favors PEFT
if need_variants > 3:
recommendations = ["lora"]
reasons.append(
f"{need_variants} variants x {model_size_b * 2:.0f} GB per model = "
f"{need_variants * model_size_b * 2:.0f} GB storage for full fine-tuning. "
f"LoRA: {model_size_b * 2:.0f} GB base + {need_variants * 0.15:.1f} GB adapters."
)
return {
"recommendation": recommendations[-1] if recommendations else "lora",
"reasons": reasons,
"full_ft_gpus_needed": full_ft_gpus_needed,
"lora_gpus_needed": lora_gpus_needed,
"qlora_gpus_needed": qlora_gpus_needed,
}
# Example: 13B model, 2 A100s, 200K examples, instruction tuning, 5 variants
result = recommend_fine_tuning_method(
model_size_b=13,
gpu_count=2,
dataset_size_k=200,
objective="instruction",
need_variants=5,
budget_usd=1000,
)
print(result["recommendation"]) # "lora"
print(result["reasons"])
Mermaid Diagrams
Decision Framework
Memory Architecture Comparison
LoRA Forward Pass Architecture
Production Engineering Notes
Checkpoint Strategy
Full fine-tuning checkpoints every weight tensor. At 14 GB per checkpoint, save conservatively: every 10-20% of training, not every 500 steps. Use a checkpoint manager that keeps only the top-k by validation loss plus the most recent N checkpoints.
# Custom checkpoint callback to avoid disk exhaustion
from transformers import TrainerCallback
class ConservativeCheckpointCallback(TrainerCallback):
def __init__(self, keep_top_k: int = 3, keep_last_n: int = 2):
self.keep_top_k = keep_top_k
self.keep_last_n = keep_last_n
self.checkpoints = [] # list of (step, val_loss, path)
def on_save(self, args, state, control, **kwargs):
# Track and prune checkpoints
# Implementation: sort by val_loss, keep top_k + last_n, delete rest
pass
Gradient Checkpointing Trade-offs
Gradient checkpointing reduces peak memory by recomputing activations during the backward pass instead of storing them. This trades memory for compute - typically 20-40% slower training but enables 2-3x larger batch sizes or smaller GPU requirements.
Always enable gradient checkpointing for full fine-tuning of 7B+ models. For PEFT, it is less critical but still recommended.
model.gradient_checkpointing_enable()
# or in TrainingArguments:
# gradient_checkpointing=True
Learning Rate for LoRA vs Full Fine-Tuning
Full fine-tuning: use 1e-5 to 5e-5. Larger LR risks catastrophic forgetting. LoRA: use 1e-4 to 3e-4. The adapter starts at zero, needs a larger LR to learn quickly.
The LR scheduler matters more for full fine-tuning. Cosine decay with warm restart (linear warmup for 3% of steps) works well for both.
Catastrophic Forgetting in Full Fine-Tuning
Full fine-tuning can overwrite the model's general capabilities when you train on a narrow distribution. Signs of catastrophic forgetting:
- Model loses coherence on general prompts after fine-tuning
- Performance on held-out general benchmarks drops significantly
- Model starts producing repetitive or degenerate outputs
Mitigation strategies:
- Mix 5-10% of general pretraining data into your fine-tuning data
- Lower the learning rate
- Use fewer epochs (1-2 epochs often enough for instruction fine-tuning)
- Evaluate on general benchmarks (MMLU, ARC, HellaSwag) throughout training
PEFT methods are naturally resistant to catastrophic forgetting because the base weights are frozen. This is a major practical advantage for instruction fine-tuning.
Adapter Rank Selection
Higher rank = more capacity = more trainable parameters = better on complex tasks but risk of overfitting.
Guidelines based on task complexity:
- r=4 to 8: Simple format or persona changes, small datasets (<10K examples)
- r=16 to 32: Standard instruction tuning, moderate datasets (10K-500K examples)
- r=64 to 128: Complex domain adaptation, large datasets (>500K examples)
Always monitor the condition number of the LoRA matrices during training. If the singular values collapse (rank collapse), you are wasting capacity. Regularization or a lower LR can help.
Common Mistakes
:::danger Applying LoRA for Continual Pre-Training Training LoRA on raw domain text (next-token prediction objective) is significantly less effective than full fine-tuning for embedding domain knowledge. LoRA works for behavioral modifications. Using it for continual pre-training can give you a false sense of having done domain adaptation - the model will generate domain-appropriate text style while its factual knowledge remains largely unchanged. Always benchmark on domain-specific factual QA tasks, not just perplexity on held-out domain text. :::
:::danger Using Full Fine-Tuning Without Mixing General Data If you fine-tune on a narrow task distribution without mixing in general pretraining data, you will observe catastrophic forgetting. The model will become very good at your task and progressively worse at everything else. This matters even for task-specific models - degenerate outputs on out-of-distribution inputs can cause silent failures in production. Mix at least 5% general text (The Pile, SlimPajama, or similar) into your fine-tuning data. :::
:::warning Setting LoRA Alpha Without Understanding the Scaling
The effective learning rate for LoRA adapters is determined by both the optimizer LR and the scaling factor alpha/r. A common mistake is setting lora_alpha = r (so scaling = 1.0) and then wondering why training is slow. The typical recommendation is lora_alpha = 2 * r for a scaling factor of 2.0, which compensates for the fact that LoRA adapters start at zero and need to learn quickly. If you change the rank without changing alpha, you change the effective learning rate.
:::
:::warning Saving the Full Model When You Only Need the Adapter
After LoRA training, model.save_pretrained() on a PeftModel with merge_and_unload() called first saves the full merged model (14+ GB). Without calling merge_and_unload, it saves only the adapter (~150 MB). Make sure you know which you want before calling save. If you are doing multi-tenant serving with hot-swapping, save only the adapter. If you are deploying a single model and want zero inference overhead, merge first then save.
:::
:::warning Comparing Methods Without Controlling for Training FLOPS Comparing "full fine-tuning vs LoRA" quality on a fixed dataset without controlling for the number of training steps is misleading. Full fine-tuning updates more parameters per step, which means more learning per forward-backward pass. If you run both for the same number of steps, full fine-tuning will often look better simply because more compute was used. For a fair comparison, control for wall-clock time or total FLOPs, not just epoch count. :::
Interview Q&A
Q1: When would you choose full fine-tuning over LoRA for a production system?
Full fine-tuning wins in three clear scenarios. First, continual pre-training on domain text: if you need the model to absorb new factual knowledge (new medical guidelines, new legal regulations, proprietary technical documentation), LoRA's low-rank constraint limits how much new information can be encoded - the weight changes for embedding new facts are not well-approximated by low-rank matrices. Second, very large datasets at narrow tasks: when you have 10M+ high-quality examples for a specific task and quality ceiling matters more than cost, full fine-tuning uses the model's full capacity. Third, when you are changing tokenizer vocabulary or embedding layers - LoRA does not apply to these.
For everything else (instruction following, format adaptation, persona, style transfer), LoRA matches or approaches full fine-tuning quality at 10-100x lower cost.
Q2: Explain the math behind why LoRA works. What assumption does it make?
LoRA assumes that the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full weight matrix , LoRA decomposes the update as where and with .
The forward pass becomes . The key insight from Aghajanyan et al. (2020) is that the effective dimensionality of fine-tuning updates is much lower than the number of parameters - you can recover most of the fine-tuning signal with a rank-8 or rank-16 approximation of the full update. At inference time, you merge: , eliminating any overhead.
The assumption breaks down when the required weight changes are genuinely high-rank - which happens for large domain shifts requiring many independent facts to be encoded across many neurons.
Q3: What is catastrophic forgetting and how does it manifest differently in full fine-tuning vs LoRA?
Catastrophic forgetting is when fine-tuning on a narrow data distribution overwrites the model's existing capabilities. In neural networks, gradient updates during fine-tuning push all weights toward patterns that minimize loss on the training distribution, which can overwrite representations that were important for general capability.
In full fine-tuning, catastrophic forgetting can be severe - after fine-tuning on a narrow task for too long, models can lose coherence on general prompts, produce repetitive outputs, and score significantly lower on general benchmarks like MMLU.
In LoRA, catastrophic forgetting is largely prevented because base weights are frozen. The model's original capabilities are preserved exactly. The adapter learns the new behavior additively. This is a major practical advantage, especially during early experiments when you are still dialing in the training setup.
Q4: You have a 70B model and one A100 80GB. What are your options?
With a single 80GB A100:
- Full fine-tuning: completely impossible even with ZeRO-3 (need 12+ A100s minimum)
- Standard LoRA with bf16 base: impossible (70B model alone is 140 GB in bf16)
- LoRA with 8-bit base (bitsandbytes): 70 GB base + adapter, tight fit, might work with very small batch
- QLoRA (4-bit base + LoRA): 70B * 0.5 GB/B = 35 GB base + adapter, fits comfortably
QLoRA is the only practical option for fine-tuning 70B on a single GPU. Tim Dettmers showed this is possible with NF4 quantization. Quality is somewhat below bf16 full fine-tuning but dramatically better than not fine-tuning at all.
Q5: How do you decide on LoRA rank and which modules to target?
Rank selection depends on task complexity and dataset size. Start with r=16 for standard instruction tuning. Increase to r=64 if you have a large complex dataset and see underfitting (training loss plateaus high). Decrease to r=4 if you have a small simple dataset and see overfitting.
For target modules, the research consensus is that targeting all linear layers (q, k, v, o projections in attention + gate, up, down projections in MLP) gives better results than only q and v. Hu et al.'s original paper targeted only q and v in GPT-2, but for modern Llama-style models, targeting all linear layers with a modest rank (16-32) beats targeting fewer layers with a higher rank.
Validate your rank choice by looking at the singular value spectrum of your trained LoRA matrices. If most singular values are near zero, you have higher rank than needed. If the distribution is flat, you may need higher rank.
Q6: What are the three hidden costs of full fine-tuning that teams typically underestimate?
Storage cost: full fine-tuning checkpoints are 14 GB for a 7B model. With 20 checkpoints over a training run, that is 280 GB per experiment. At $0.023/GB/month on S3, this adds up over multiple experiments.
Inference inflexibility: full fine-tuned models cannot be hot-swapped. Multi-tenant systems where each customer needs a customized model require either separate inference servers per customer (expensive) or complex model multiplexing. LoRA adapters can be swapped between requests on a single loaded base model.
Experimentation velocity: a full fine-tuning run for a 7B model takes 8-12 hours on 8 A100s. A QLoRA run takes 45-90 minutes on 1 A100. If you need 15 experiments to find the right hyperparameters and data mix (which is realistic), full fine-tuning means 5-7 days of iteration time. QLoRA means 1-2 days.
