Skip to main content

Fine-Tuning Strategies

Reshaping the weights: how to teach a foundation model new tricks without breaking the ones it already knows.

Reading time: ~30 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, Data Scientist

The Real Interview Moment

You are forty minutes into an ML engineer interview at a Series B startup building an AI-powered contract analysis platform. The interviewer, the head of ML, slides her laptop toward you:

"Our base Llama 3 70B model extracts contract clauses reasonably well with few-shot prompting, but the output format is inconsistent and it misses domain-specific terms like 'indemnification carve-out' and 'material adverse change.' We have 8,000 labeled examples of contract clause extraction. We tried full fine-tuning on 4 A100s but the model started hallucinating boilerplate language that was not in the source contract. How would you approach this?"

She is not asking you to recite LoRA's paper. She wants you to diagnose the failure mode (catastrophic forgetting plus overfitting), propose a parameter-efficient strategy that fits their compute budget, discuss data quality, and articulate a production deployment plan. This chapter gives you every piece you need.

Why Fine-Tuning Matters

LLMs arrive pretrained on trillions of tokens, encoding broad world knowledge and language understanding. But they do not know your task. There are three ways to bridge this gap:

ApproachWhat It ChangesBest ForLimitations
Prompt engineeringThe input, not the modelQuick iteration, general tasksContext window limits, fragile, costly per-query
RAGThe context available at inferenceKnowledge-intensive tasks, frequently updated dataAdds retrieval latency, does not change model behavior
Fine-tuningThe model weightsTeaching new behavior, formats, domain reasoningRequires training data and compute, risk of forgetting
60-Second Answer

"Fine-tuning modifies the model's weights to internalize a new behavior, domain vocabulary, or output format. Unlike prompting, the knowledge is baked into the parameters, so it does not consume context window tokens at inference. Unlike RAG, it changes how the model reasons, not just what information it can see. The tradeoff is that you need labeled data and compute, and you risk catastrophic forgetting if not careful."

The Decision Hierarchy

Always start simple and escalate only when necessary:

Fine-Tuning Decision Tree

The Fine-Tuning Spectrum

Not all fine-tuning is created equal. The methods exist on a spectrum from modifying every parameter to touching none of the original weights:

MethodParameters TrainedMemory NeededTypical Use Case
Full fine-tuningAll (billions)4-16x model sizeLarge data + large compute budgets
LoRA0.1-1% of params~1.2-1.5x model sizeProduction standard for most teams
QLoRA0.1-1% of params~0.3-0.5x model sizeWhen GPU memory is the constraint
Prefix tuningVirtual token embeddings~1.01x model sizeMulti-tenant serving
Prompt tuningSoft prompt vectors~1.001x model sizeLightweight task adaptation
AdaptersBottleneck layers~1.05-1.1x model sizeModular, composable skills

Fine-Tuning Spectrum

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 70B parameter model with FP16 precision, this requires:

  • Model weights: 70×109×270 \times 10^9 \times 2 bytes =140= 140 GB
  • Gradients: Another 140 GB
  • Optimizer states (Adam): 2×1402 \times 140 GB =280= 280 GB (momentum + variance)
  • Activations: Variable, often 50-200+ GB depending on batch size and sequence length
  • Total: ~600-800 GB across GPUs

When Full Fine-Tuning Is Appropriate

  1. You have massive, high-quality data (100K+ examples) and the budget for multi-GPU training
  2. You need deep behavioral changes - not just format, but fundamentally different reasoning patterns
  3. You are building a foundation model for a specific domain (e.g., Bloomberg's BloombergGPT for finance)
  4. The base model is small (under 3B parameters), where full fine-tuning is practical on a single GPU

Catastrophic Forgetting

The central risk of full fine-tuning: the model forgets its pretrained capabilities as it overfits to the fine-tuning distribution.

Symptoms:

  • Model becomes excellent at the fine-tuned task but loses general language ability
  • Generates repetitive or degenerate text outside the fine-tuning domain
  • Fails at tasks it previously handled well (e.g., basic math, common knowledge)

Mitigations:

  • Low learning rate: Use 10510^{-5} to 5×1055 \times 10^{-5} (10-100x smaller than pretraining)
  • Short training: 1-3 epochs is usually sufficient; more epochs increases forgetting
  • Data mixing: Include a percentage of general-purpose data alongside task-specific data
  • Regularization: Weight decay, dropout, or elastic weight consolidation (EWC)
  • Evaluation on held-out general benchmarks: Track MMLU or similar alongside your task metric
Common Trap

Many candidates suggest "just train for more epochs" when fine-tuning performance is poor. Beyond 3-5 epochs on a typical dataset, you are almost certainly overfitting and inducing catastrophic forgetting. The fix is usually better data quality, not more training.

Learning Rate Selection

The learning rate is the single most important hyperparameter for fine-tuning:

Model SizeRecommended LR RangeNotes
< 1B1×1051 \times 10^{-5} to 5×1055 \times 10^{-5}Can tolerate slightly higher LR
1-10B5×1065 \times 10^{-6} to 2×1052 \times 10^{-5}Standard range
10-70B1×1061 \times 10^{-6} to 1×1051 \times 10^{-5}Lower is safer
> 70B5×1075 \times 10^{-7} to 5×1065 \times 10^{-6}Very conservative

Use a cosine learning rate schedule with a linear warmup of 3-10% of total steps. The warmup prevents early gradient spikes that can destabilize the pretrained weights.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods train a small number of new or modified parameters while keeping the original model weights frozen. This dramatically reduces memory requirements, training time, and the risk of catastrophic forgetting.

LoRA: Low-Rank Adaptation

LoRA is the dominant PEFT method in production. It was introduced by Hu et al. (2021) and is based on a simple but powerful observation: the weight updates during fine-tuning have low intrinsic rank.

The Core Idea

Instead of updating a weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k} directly, LoRA freezes W0W_0 and adds a low-rank decomposition:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}, with rank rmin(d,k)r \ll \min(d, k).

For a typical transformer layer where d=k=4096d = k = 4096 and r=16r = 16:

  • Original W0W_0: 4096×4096=16.7M4096 \times 4096 = 16.7M parameters (frozen)
  • LoRA BB: 4096×16=65K4096 \times 16 = 65K parameters (trainable)
  • LoRA AA: 16×4096=65K16 \times 4096 = 65K parameters (trainable)
  • Total trainable: 131K vs. 16.7M = 0.78% of original

LoRA Architecture

The forward pass becomes:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} B A x

where α\alpha is a scaling factor (typically set equal to rr so α/r=1\alpha/r = 1, or tuned independently).

Initialization

  • AA is initialized with a random Gaussian distribution
  • BB is initialized to zero
  • This means ΔW=BA=0\Delta W = BA = 0 at the start of training, so the model begins from the pretrained weights

Rank Selection

The rank rr controls the capacity of the adaptation:

Rank rrTrainable Params (per layer)Use Case
4-8~33-65KSimple format changes, style transfer
16-32~131-262KMost tasks (instruction tuning, classification)
64-128~524K-1MComplex domain adaptation
256+~2M+Approaching full fine-tuning territory

Rule of thumb: Start with r=16r = 16. If the model underfits, increase to 32 or 64. If it overfits, decrease to 8. The performance usually plateaus well before r=64r = 64 for most tasks.

Which Layers to Target

Not all layers benefit equally from LoRA:

TargetCommon ChoiceReasoning
Query + Value projectionsDefault in most frameworksWqW_q and WvW_v capture attention patterns; most efficient
All attention projectionsWqW_q, WkW_k, WvW_v, WoW_oBetter for complex tasks
Attention + MLPAll linear layersHighest quality but more parameters
Only MLPRareSometimes useful for knowledge injection

Research finding (Hu et al., 2021): Adapting WqW_q and WvW_v together outperforms adapting either alone, and matches or exceeds adapting all four attention matrices at half the parameter count.

60-Second Answer

"LoRA freezes the pretrained weights and injects trainable low-rank matrices alongside each target layer. Instead of updating a d×kd \times k weight matrix, we learn two smaller matrices BB (d×rd \times r) and AA (r×kr \times k) where rr is typically 8-64. This reduces trainable parameters to under 1% of the original model while achieving comparable performance to full fine-tuning on most tasks. At inference, you can merge BABA into W0W_0 for zero additional latency."

LoRA in Code

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)

# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank
lora_alpha=32, # scaling factor
lora_dropout=0.05, # dropout on LoRA layers
target_modules=[ # which layers to adapt
"q_proj", "v_proj", # attention projections
"k_proj", "o_proj", # optional: all attention
"gate_proj", "up_proj", # optional: MLP layers
"down_proj",
],
bias="none", # don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,839,488 || trainable%: 0.1695

QLoRA: Quantized LoRA

QLoRA (Dettmers et al., 2023) makes fine-tuning accessible on consumer GPUs by combining 4-bit quantization of the base model with LoRA adapters.

Three Key Innovations

  1. 4-bit NormalFloat (NF4) quantization: A new data type optimized for normally distributed weights. The quantization levels are placed at the quantiles of a standard normal distribution, minimizing quantization error for neural network weights.

  2. Double quantization: The quantization constants themselves are quantized. Each 64-parameter block has a 32-bit quantization constant. Double quantization reduces this overhead from ~0.5 bits/param to ~0.37 bits/param.

  3. Paged optimizers: Uses NVIDIA unified memory to automatically page optimizer states between GPU and CPU memory, preventing out-of-memory crashes during gradient checkpointing.

Memory Comparison

For a 70B parameter model:

ConfigurationMemory RequiredHardware
Full fine-tuning (FP16)~600-800 GB8x A100 80GB
LoRA (FP16 base)~140 GB2x A100 80GB
QLoRA (4-bit base)~36-48 GB1x A100 80GB or 2x A6000
QLoRA (4-bit base, small batch)~24 GB1x RTX 4090 (with CPU offloading)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# QLoRA quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_use_double_quant=True, # double quantization
bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B",
quantization_config=quantization_config,
device_map="auto",
)

# Prepare for k-bit training (cast layernorm to FP32, etc.)
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
Common Trap

QLoRA does introduce a small quality gap compared to full-precision LoRA (typically 1-3% on benchmarks). For maximum quality, train with QLoRA for development speed, then do a final run with FP16 LoRA if quality is critical. Also note that QLoRA training is slower per step than FP16 LoRA because of quantization/dequantization overhead - you are trading compute time for memory.

Prefix Tuning

Prefix tuning (Li & Liang, 2021) prepends trainable "virtual tokens" to the key and value matrices of every attention layer.

How It Works

Instead of modifying any model weights, you learn a set of continuous prefix vectors Pk,PvRl×dP_k, P_v \in \mathbb{R}^{l \times d} where ll is the prefix length (typically 10-100) and dd is the hidden dimension:

headi=Attention(xWq,[Pk;xWk],[Pv;xWv])\text{head}_i = \text{Attention}(xW_q, [P_k; xW_k], [P_v; xW_v])

The model attends to both the real input tokens and the learned virtual prefix tokens. The prefix acts as a continuous task-specific "instruction."

Trainable parameters: l×d×2×Ll \times d \times 2 \times L (prefix length ×\times hidden dim ×\times key+value ×\times num layers). For a 7B model with l=20l=20, d=4096d=4096, L=32L=32: about 5.2M parameters (~0.07%).

Strengths:

  • Extremely parameter-efficient
  • Multiple prefixes can serve different tasks with the same base model
  • No modification to original weights

Weaknesses:

  • Consumes prefix length tokens from the context window
  • Less expressive than LoRA for complex adaptations
  • Optimization can be unstable; often requires reparameterization through an MLP during training

Prompt Tuning

Prompt tuning (Lester et al., 2021) is even simpler: learn a set of continuous embedding vectors prepended to the input at the embedding layer only (not at every attention layer like prefix tuning).

input=[e1,e2,,el,x1,x2,,xn]\text{input} = [e_1, e_2, \ldots, e_l, x_1, x_2, \ldots, x_n]

where e1,,ele_1, \ldots, e_l are learned continuous vectors and x1,,xnx_1, \ldots, x_n are the embedded input tokens.

Trainable parameters: l×dl \times d (just the soft prompt embeddings). For l=20l=20, d=4096d=4096: about 82K parameters.

Key finding (Lester et al., 2021): Prompt tuning approaches the performance of full fine-tuning as model size increases. For models above 10B parameters, the gap is often negligible for simple classification tasks.

Company Variation

Google originally developed prompt tuning for multi-task serving at scale. When you have hundreds of tasks sharing the same base model, prompt tuning lets you store one model copy with hundreds of tiny task-specific prompt vectors, swapping them at inference. This is the main production use case.

Adapters

Adapter modules (Houlsby et al., 2019) insert small bottleneck layers between the existing transformer layers.

Architecture

Each adapter consists of:

  1. A down-projection: WdownRd×mW_{\text{down}} \in \mathbb{R}^{d \times m} (reduces dimension from dd to mm)
  2. A nonlinearity: ReLU or GELU
  3. An up-projection: WupRm×dW_{\text{up}} \in \mathbb{R}^{m \times d} (projects back to dd)
  4. A residual connection
Adapter(x)=x+Wupσ(Wdownx)\text{Adapter}(x) = x + W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot x)

where mdm \ll d is the bottleneck dimension (typically m=d/64m = d/64 to d/8d/8).

Serial adapters (Houlsby): placed after both the attention and FFN sublayers. Parallel adapters (He et al., 2022): placed alongside attention/FFN, which can reduce sequential latency.

Tradeoff vs. LoRA: Adapters add inference latency because they introduce new sequential computation. LoRA adapters can be merged into the base weights (W=W0+BAW = W_0 + BA) for zero-latency inference. This is why LoRA has largely replaced adapters in practice.

PEFT Method Comparison

MethodParams TrainedInference OverheadMerges into Base?Multi-Task ServingQuality (vs Full FT)
LoRA0.1-1%None (after merge)YesSwap adapters95-100%
QLoRA0.1-1%Quantization overheadYes (after dequant)Swap adapters92-98%
Prefix Tuning0.05-0.1%Prefix tokens use contextNoSwap prefixes85-95%
Prompt Tuning~0.001%Soft prompt tokensNoSwap prompts80-95% (scale dependent)
Adapters1-5%Extra forward passesNoSwap adapters93-99%
Instant Rejection

Saying "LoRA and adapters are the same thing" will immediately signal lack of depth. LoRA modifies existing weight matrices via low-rank additive updates and can be merged for zero inference overhead. Adapters insert entirely new bottleneck layers that add sequential computation. They are architecturally different methods with different tradeoffs.

Instruction Tuning and Alignment Fine-Tuning

Fine-tuning is not just about domain adaptation. Two of the most important applications reshape the model's fundamental behavior:

Instruction Tuning

Trains the model to follow instructions by fine-tuning on (instruction, response) pairs.

Key datasets:

  • FLAN (Google): 1.8K tasks converted to instruction format
  • Alpaca (Stanford): 52K instruction-response pairs generated by GPT-4
  • Open Assistant: Human-generated multi-turn conversations
  • ShareGPT: Real user conversations with ChatGPT

Why it matters: A base model predicts the next token. It does not inherently know that when you ask a question, you want an answer (not a continuation of the question). Instruction tuning bridges this gap.

Alignment Fine-Tuning

After instruction tuning, alignment further refines the model to be helpful, harmless, and honest. This typically involves RLHF or DPO (covered in the next chapter) but the supervised fine-tuning (SFT) stage is part of the fine-tuning pipeline:

Alignment Pipeline

Data Preparation

Data quality determines fine-tuning quality. A model trained on 1,000 high-quality examples will outperform one trained on 50,000 noisy examples.

Data Formatting

Different frameworks expect different formats:

Alpaca format (single-turn):

{
"instruction": "Summarize the following contract clause.",
"input": "The Seller hereby represents and warrants...",
"output": "This clause establishes the seller's representations..."
}

ShareGPT format (multi-turn):

{
"conversations": [
{"from": "human", "value": "What is a force majeure clause?"},
{"from": "gpt", "value": "A force majeure clause excuses..."},
{"from": "human", "value": "Can you give an example?"},
{"from": "gpt", "value": "A common example..."}
]
}

ChatML format (OpenAI-style):

<|im_start|>system
You are a legal contract analyst.<|im_end|>
<|im_start|>user
Summarize this clause: ...<|im_end|>
<|im_start|>assistant
This clause establishes...<|im_end|>

Data Quality Checklist

CheckWhyHow
DeduplicationDuplicates cause memorization, not generalizationExact match + near-duplicate detection (MinHash, SimHash)
Format consistencyInconsistent formats confuse the modelValidate all examples against a schema
Response qualityGarbage in, garbage outHuman review of a random sample (at least 5-10%)
Length distributionVery long or very short examples skew the modelPlot histogram, trim outliers
Label accuracyWrong labels teach wrong behaviorInter-annotator agreement check (Cohen's κ>0.8\kappa > 0.8)
DiversityRepetitive data causes mode collapseCluster embeddings, ensure coverage
Toxicity/PII screeningLegal and safety riskRun through content filters before training

How Much Data Do You Need?

TaskTypical Data SizeNotes
Format/style transfer100-500 examplesThe model already knows the content; you are teaching format
Classification500-5,000 examplesPer class; balance matters
Domain adaptation1,000-10,000 examplesQuality over quantity
Instruction following5,000-50,000 examplesDiverse instructions matter more than volume
Full behavior change50,000-500,000+ examplesApproaching pretraining-scale fine-tuning
60-Second Answer

"For most PEFT fine-tuning, I target 1,000-5,000 high-quality examples. I spend more time on data curation than on hyperparameter tuning - deduplication, format validation, human review of a random sample, and ensuring diversity across the input distribution. A common mistake is collecting 50,000 noisy examples when 2,000 clean ones would produce a better model."

Hyperparameter Selection

The Essential Hyperparameters

HyperparameterTypical RangeImpactNotes
Learning rate1×1051 \times 10^{-5} to 2×1042 \times 10^{-4} (PEFT)CriticalToo high = forgetting; too low = no learning
Epochs1-5HighMore epochs = more forgetting risk
Batch size4-128 (effective, with gradient accumulation)ModerateLarger = more stable gradients, but diminishing returns
Warmup ratio0.03-0.1ModeratePrevents early gradient spikes
Weight decay0.01-0.1Low-ModerateRegularization against overfitting
LoRA rank (rr)8-64ModerateCapacity of adaptation
LoRA alpha (α\alpha)2r2r (common default)LowScaling factor, usually set to 2×r2 \times r
Max sequence length512-8192High (memory)Longer = more memory, must match data

Training Configuration Example

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # effective batch size = 32
learning_rate=2e-4, # higher LR for PEFT
lr_scheduler_type="cosine",
warmup_ratio=0.05,
weight_decay=0.01,
bf16=True, # bfloat16 mixed precision
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
gradient_checkpointing=True, # saves memory
report_to="wandb",
)

trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
max_seq_length=2048,
packing=True, # pack short examples together
)
trainer.train()

Evaluation

Training Metrics

Training loss vs. evaluation loss is the most important diagnostic:

PatternDiagnosisAction
Both decreasing togetherHealthy trainingContinue
Train loss decreasing, eval loss flat or increasingOverfittingStop training, reduce LR, add regularization, get more data
Both flat from the startLearning rate too low or data too noisyIncrease LR, check data quality
Both oscillatingLearning rate too high or batch too smallDecrease LR, increase batch size
Train loss spikes then recoversNormal warmup, or bad data batchCheck if spike is during warmup (ok) or later (investigate)

Beyond Loss: Task-Specific Evaluation

Training loss alone is insufficient. You must evaluate on the actual task:

# Example: evaluating a fine-tuned model on domain-specific QA
from datasets import load_dataset

eval_prompts = load_dataset("your_eval_set")

correct = 0
for example in eval_prompts:
output = model.generate(example["prompt"], max_new_tokens=256)
# Task-specific scoring
if task_metric(output, example["expected"]):
correct += 1

accuracy = correct / len(eval_prompts)

Benchmark suites for general capability regression:

  • MMLU: General knowledge across 57 subjects
  • HellaSwag: Commonsense reasoning
  • GSM8K: Math reasoning
  • HumanEval: Code generation

Monitor these alongside your task metric to detect catastrophic forgetting.

Instant Rejection

Only reporting training loss as your evaluation strategy will end the conversation. Interviewers want to hear about task-specific metrics (accuracy, F1, exact match), held-out test sets, and monitoring for capability regression on general benchmarks. Training loss tells you the optimization is working; it does not tell you the model is useful.

When NOT to Fine-Tune

Fine-tuning is expensive, time-consuming, and risky. Here is a decision framework for when to avoid it:

SituationWhy Not Fine-TuneAlternative
Knowledge updates frequentlyYou would need to retrain constantlyRAG
Need citations/traceabilityFine-tuned knowledge is opaqueRAG
Limited labeled data (< 50 examples)Risk of overfitting; insufficient signalFew-shot prompting
Task is already well-served by promptingUnnecessary complexityPrompt engineering
Base model capability is the bottleneckFine-tuning cannot add capabilities the base model lacksUse a larger base model
You need deterministic outputsFine-tuning does not guarantee exact format complianceConstrained decoding (outlines, LMQL)
Rapid prototyping phaseFine-tuning slows iterationPrompting + RAG
Company Variation

At large tech companies (Google, Meta, OpenAI), fine-tuning is routine because they have the data, compute, and infrastructure. At startups, the default should be prompting + RAG first, fine-tuning only when you have proven that prompt engineering hits a ceiling. The business case for fine-tuning must justify the ongoing maintenance cost (retraining on data updates, tracking base model updates, managing adapter versions).

Multi-Task Fine-Tuning and Continual Learning

Multi-Task Fine-Tuning

Training on multiple tasks simultaneously to create a versatile model:

Approach 1: Data mixing Combine datasets from different tasks, prefixed with task instructions:

[Summarize] The contract states that...
[Extract clauses] The agreement between...
[Classify risk] Under section 4.2...

Approach 2: Task-specific LoRA adapters Train separate LoRA adapters per task, share the same base model. At inference, load the adapter for the requested task. This prevents task interference.

Continual Learning

When you need to fine-tune on new data without forgetting previous fine-tuning:

  1. Replay buffer: Keep a small representative sample from previous tasks and mix into new training data
  2. Elastic Weight Consolidation (EWC): Add a regularization term that penalizes changes to parameters that were important for previous tasks:
L(θ)=Lnew(θ)+λ2iFi(θiθi)2\mathcal{L}(\theta) = \mathcal{L}_{\text{new}}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{i}^{*})^2

where FiF_i is the Fisher information matrix diagonal (importance of parameter ii) and θ\theta^{*} are the previous task weights.

  1. Progressive LoRA: Train a new LoRA adapter for each task, stack or merge them. Avoids overwriting previous adapters.

Production Considerations

Merging LoRA Weights

For production inference, merge the LoRA weights into the base model to eliminate adapter overhead:

# Merge LoRA into base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

# Now serve as a standard model - no PEFT dependency needed

After merging:

  • No additional latency vs. the base model
  • No PEFT library dependency in the inference stack
  • The adapter is permanently baked into the weights

Serving Multiple Adapters

When you need multiple specialized versions of the same model (e.g., one per customer or per task):

Multi-Adapter Serving

Frameworks like LoRAX, S-LoRA, and Punica enable serving hundreds of LoRA adapters from a single base model with batch-level adapter switching. This is how multi-tenant LLM platforms work.

A/B Testing Fine-Tuned Models

  1. Deploy both the current model (control) and the fine-tuned model (treatment)
  2. Route a percentage of traffic (5-20%) to the treatment
  3. Measure task-specific metrics (accuracy, format compliance, user satisfaction)
  4. Also measure regression metrics (latency, error rate, capability on general tasks)
  5. Gradually increase traffic if metrics improve, roll back if not

Version Management

Track fine-tuning runs like software releases:

ArtifactWhat to StoreWhy
Training data snapshotVersioned dataset (hash + location)Reproducibility
HyperparametersFull config fileReproducibility
Adapter weightsLoRA adapter checkpointDeployment
Base model versionExact model ID and hashCompatibility
Evaluation resultsMetrics on test set + benchmarksRegression detection
Training logsLoss curves, learning rate scheduleDebugging

Practice Problems

Problem 1: Choosing a Fine-Tuning Strategy

Your company has a customer support chatbot powered by GPT-4. It handles 10,000 queries/day. You want to reduce costs by switching to a fine-tuned Llama 3 8B model. You have 15,000 labeled customer support conversations. You have two A100 40GB GPUs available. Design the fine-tuning strategy.

Hint 1 - Direction

Consider what the model needs to learn: domain vocabulary (product names, policies), output format (structured responses with links), and conversational tone. This is a behavior change, not a knowledge change - fine-tuning is appropriate. Think about which PEFT method fits your hardware.

Hint 2 - Insight

With 2x A100 40GB, you can fit Llama 3 8B in FP16 (16 GB per GPU) with room for LoRA training. QLoRA is unnecessary - you have enough memory for standard LoRA. The 15K conversations should be filtered for quality. Multi-turn format is important since customer support is conversational.

Hint 3 - Full Solution

Strategy: LoRA fine-tuning on Llama 3 8B

Step 1: Data preparation

  • Filter 15K conversations: remove short/unhelpful ones, keep ~10K high-quality examples
  • Format in ChatML or ShareGPT format (multi-turn)
  • Split: 9K train, 500 validation, 500 test
  • Deduplicate (MinHash)
  • Validate format consistency with a schema check

Step 2: Training configuration

  • Method: LoRA (r=32r=32, α=64\alpha=64, target all attention + MLP)
  • Learning rate: 2×1042 \times 10^{-4} with cosine schedule
  • Epochs: 3 (monitor eval loss for early stopping)
  • Effective batch size: 32 (per_device=4, gradient_accumulation=4, 2 GPUs)
  • Max sequence length: 2048 (typical customer support conversation length)
  • Gradient checkpointing: enabled

Step 3: Evaluation

  • Task metrics: response quality (human eval on 100 samples), format compliance rate, factual accuracy
  • Regression check: MMLU subset, commonsense reasoning
  • Cost analysis: tokens/query reduction vs GPT-4, latency comparison

Step 4: Production deployment

  • Merge LoRA weights into base model
  • Serve with vLLM for optimized inference
  • A/B test: route 10% of traffic to fine-tuned model, compare user satisfaction scores
  • Set up monitoring for response quality degradation

Scoring rubric:

GradeCriteria
Strong HireProposes LoRA with justified rank selection, discusses data quality filtering, has a multi-stage evaluation plan (automated + human), includes production rollout strategy with A/B testing.
Lean HireChooses appropriate PEFT method for hardware, mentions data quality, has basic evaluation plan.
No HireSuggests full fine-tuning without considering memory, ignores data quality, evaluates only training loss.

Problem 2: Diagnosing Fine-Tuning Failure

After LoRA fine-tuning a Llama 3 70B model on 5,000 legal contract examples, the model correctly extracts contract clauses from examples similar to the training data but produces hallucinated legal terms on out-of-distribution contracts. The training loss converged to 0.3 and evaluation loss to 0.8. What went wrong and how do you fix it?

Hint 1 - Direction

The gap between training loss (0.3) and evaluation loss (0.8) is a clear signal. What does this tell you about the model's generalization?

Hint 2 - Insight

This is textbook overfitting. The model memorized the training data distribution (specific legal terms, contract structures) rather than learning the general skill of clause extraction. The large train/eval loss gap confirms this. Consider what factors contribute: data diversity, training duration, rank selection, regularization.

Hint 3 - Full Solution

Diagnosis: Overfitting to the training distribution

The train-eval loss gap (0.3 vs 0.8) confirms overfitting. The model memorized training-set contract patterns and hallucinates when encountering unfamiliar contracts.

Root cause analysis:

  1. Insufficient data diversity: 5K examples may all be from the same contract types or jurisdictions
  2. Rank too high: If rr is large (64+), the adapter has too much capacity and memorizes
  3. Trained too long: May have continued past the optimal epoch
  4. No regularization: Missing dropout or weight decay

Fixes (in order of priority):

  1. Data diversity (highest impact):

    • Audit training data: How many contract types, jurisdictions, and formats are represented?
    • Augment with diverse contracts even if fewer are perfectly labeled
    • Add negative examples (text that is NOT a clause) to prevent over-triggering
  2. Reduce model capacity:

    • Lower LoRA rank from current to r=8r=8 or r=16r=16
    • Target only WqW_q and WvW_v (not all linear layers)
    • Add LoRA dropout (0.1)
  3. Training adjustments:

    • Train for fewer steps - use the checkpoint with lowest eval loss, not final checkpoint
    • Reduce learning rate by 2-5x
    • Add weight decay (0.05-0.1)
  4. Evaluation improvements:

    • Create an eval set with intentionally OOD contracts (different types, jurisdictions)
    • Track hallucination rate as a first-class metric, not just loss
    • Use human evaluation on 50 OOD examples

Scoring rubric:

GradeCriteria
Strong HireIdentifies overfitting from train/eval gap, proposes data diversity as the primary fix, adjusts model capacity AND training regime, creates OOD evaluation set.
Lean HireIdentifies overfitting, proposes at least two reasonable fixes.
No HireDoes not recognize overfitting, suggests "train for more epochs," or proposes only increasing data volume without considering diversity.

Problem 3: Multi-Tenant Fine-Tuning Architecture

You are building an LLM platform where each enterprise customer gets a model fine-tuned on their private data. You have 50 customers, each with 1,000-10,000 training examples. Design a system that serves all 50 customers efficiently without maintaining 50 separate model deployments.

Hint 1 - Direction

Think about parameter-efficient methods where the base model is shared and only the adaptation is customer-specific. Consider how LoRA adapters can be swapped at serving time.

Hint 2 - Insight

One base model + 50 LoRA adapters. Each adapter is small (~10-50 MB for an 8B model with r=16r=16). The key challenge is efficient serving: you need to swap adapters per-request without reloading the base model. Look into S-LoRA or LoRAX for batch-level adapter switching.

Hint 3 - Full Solution

Architecture: Shared base model + per-customer LoRA adapters

Multi-Tenant LoRA Architecture - Shared Base Model with Per-Customer Adapters

Training pipeline:

  1. Common base model: Llama 3 8B (or fine-tuned variant for your domain)
  2. Per-customer LoRA training: automated pipeline triggered when customer uploads data
    • Standard config: r=16r=16, α=32\alpha=32, target all attention projections
    • Training: 3 epochs, early stopping on eval loss
    • Automated quality gate: model must pass basic benchmarks before deployment
  3. Store adapter weights in versioned object storage (S3)

Serving architecture:

  1. Base model: Loaded once per GPU, shared across all customers
  2. Adapter cache: Hot adapters (frequently used) kept in GPU memory; cold adapters loaded on demand from SSD/object storage
  3. Request routing: Each API request includes customer_id; router selects the correct adapter
  4. Batching: Use S-LoRA or Punica for batch-level adapter switching - requests for different customers can be batched together if using the same base model
  5. Hardware: 2-4 GPUs with the base model replicated; adapter switching adds <5ms per request

Key design decisions:

  • Use the same LoRA config (rank, target modules) across all customers for serving compatibility
  • Implement adapter versioning: customers can roll back to previous versions
  • Data isolation: each customer's training data and adapter weights are strictly isolated
  • Monitoring: per-customer quality metrics, latency, error rates

Cost analysis:

  • Without multi-tenant: 50 x 16 GB (model) = 800 GB GPU memory
  • With multi-tenant: 16 GB (base) + 1.5 GB (all adapters) = 17.5 GB GPU memory
  • 47x memory reduction

Scoring rubric:

GradeCriteria
Strong HireDesigns shared base + per-customer LoRA, discusses serving frameworks (S-LoRA/LoRAX), includes adapter versioning and data isolation, provides cost/memory analysis.
Lean HireProposes shared base + LoRA adapters, mentions efficient serving.
No HireSuggests deploying 50 separate models or does not consider serving efficiency.

Interview Cheat Sheet

TopicKey FactWhy It Matters
Fine-tuning vs. prompting vs. RAGFT changes weights (behavior), RAG adds context (knowledge), prompting changes inputMost common opening question
LoRA mathW=W0+BAW = W_0 + BA, BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}Shows you understand the mechanism, not just the API
LoRA rankr=16r = 16 is a good default; 8 for simple, 64 for complex tasksDemonstrates practical experience
QLoRA key ideasNF4 quantization + double quantization + paged optimizersThree innovations, not just "4-bit LoRA"
Catastrophic forgettingModel forgets pretrained skills during fine-tuningMain risk; mitigate with low LR, few epochs, data mixing
Learning rate10510^{-5} to 10410^{-4} for PEFT; 10x lower for full FTSingle most impactful hyperparameter
Data quality1K clean examples > 50K noisy examplesThe answer that separates practitioners from theorists
Merging LoRAW=W0+BAW = W_0 + BA can be precomputed for zero inference overheadCritical production deployment detail
Multi-tenant servingOne base model + N adapters via S-LoRA/LoRAXShows systems design thinking
When NOT to fine-tuneKnowledge updates, < 50 examples, prompting worksShows judgment, not just technique
Epochs1-3 for most tasks; more increases forgetting riskPrevents the "more training = better" trap
EvaluationTask metrics + general benchmarks + human evalLoss alone is insufficient
Data formatsAlpaca (single-turn), ShareGPT (multi-turn), ChatMLPractical framework knowledge

Spaced Repetition Checkpoints

Day 0 (Today)

  • Write the LoRA equation from memory: W=W0+BAW = W_0 + BA, sizes of BB and AA, and what rr controls
  • Name three differences between LoRA and QLoRA
  • When should you use RAG instead of fine-tuning? Give three scenarios

Day 3

  • Compare all five PEFT methods: LoRA, QLoRA, prefix tuning, prompt tuning, adapters
  • What is catastrophic forgetting? List three mitigation strategies
  • Calculate the trainable parameters for LoRA with d=4096d=4096, r=16r=16, applied to WqW_q and WvW_v

Day 7

  • Design a fine-tuning pipeline for a domain-specific task: data prep, method selection, hyperparameters, evaluation
  • Explain why LoRA targets WqW_q and WvW_v by default, and when you should target more layers
  • A model fine-tuned for 5 epochs has train loss 0.2 and eval loss 1.1. Diagnose and propose fixes

Day 14

  • Design a multi-tenant serving architecture with shared base model and per-customer LoRA adapters
  • Explain the NF4 data type and double quantization in QLoRA - why are these innovations important?
  • Walk through a production fine-tuning deployment: merging weights, A/B testing, version management, monitoring

Day 21

  • Teach someone the complete fine-tuning decision framework: prompting to RAG to PEFT to full fine-tuning
  • Given a novel task, select the optimal PEFT method and justify your choice with memory, quality, and serving tradeoffs
  • Design a continual learning strategy for a model that must adapt to new domains without forgetting previous ones
© 2026 EngineersOfAI. All rights reserved.