Fine-Tuning Strategies

Reshaping the weights: how to teach a foundation model new tricks without breaking the ones it already knows.

Reading time: ~30 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, Data Scientist

The Real Interview Moment

You are forty minutes into an ML engineer interview at a Series B startup building an AI-powered contract analysis platform. The interviewer, the head of ML, slides her laptop toward you:

"Our base Llama 3 70B model extracts contract clauses reasonably well with few-shot prompting, but the output format is inconsistent and it misses domain-specific terms like 'indemnification carve-out' and 'material adverse change.' We have 8,000 labeled examples of contract clause extraction. We tried full fine-tuning on 4 A100s but the model started hallucinating boilerplate language that was not in the source contract. How would you approach this?"

She is not asking you to recite LoRA's paper. She wants you to diagnose the failure mode (catastrophic forgetting plus overfitting), propose a parameter-efficient strategy that fits their compute budget, discuss data quality, and articulate a production deployment plan. This chapter gives you every piece you need.

Why Fine-Tuning Matters

LLMs arrive pretrained on trillions of tokens, encoding broad world knowledge and language understanding. But they do not know your task. There are three ways to bridge this gap:

Approach	What It Changes	Best For	Limitations
Prompt engineering	The input, not the model	Quick iteration, general tasks	Context window limits, fragile, costly per-query
RAG	The context available at inference	Knowledge-intensive tasks, frequently updated data	Adds retrieval latency, does not change model behavior
Fine-tuning	The model weights	Teaching new behavior, formats, domain reasoning	Requires training data and compute, risk of forgetting

60-Second Answer

"Fine-tuning modifies the model's weights to internalize a new behavior, domain vocabulary, or output format. Unlike prompting, the knowledge is baked into the parameters, so it does not consume context window tokens at inference. Unlike RAG, it changes how the model reasons, not just what information it can see. The tradeoff is that you need labeled data and compute, and you risk catastrophic forgetting if not careful."

The Decision Hierarchy

Always start simple and escalate only when necessary:

Fine-Tuning Decision Tree

The Fine-Tuning Spectrum

Not all fine-tuning is created equal. The methods exist on a spectrum from modifying every parameter to touching none of the original weights:

Method	Parameters Trained	Memory Needed	Typical Use Case
Full fine-tuning	All (billions)	4-16x model size	Large data + large compute budgets
LoRA	0.1-1% of params	~1.2-1.5x model size	Production standard for most teams
QLoRA	0.1-1% of params	~0.3-0.5x model size	When GPU memory is the constraint
Prefix tuning	Virtual token embeddings	~1.01x model size	Multi-tenant serving
Prompt tuning	Soft prompt vectors	~1.001x model size	Lightweight task adaptation
Adapters	Bottleneck layers	~1.05-1.1x model size	Modular, composable skills

Fine-Tuning Spectrum

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 70B parameter model with FP16 precision, this requires:

Model weights: $70 \times 10^9 \times 2$ bytes $= 140$ GB
Gradients: Another 140 GB
Optimizer states (Adam): $2 \times 140$ GB $= 280$ GB (momentum + variance)
Activations: Variable, often 50-200+ GB depending on batch size and sequence length
Total: ~600-800 GB across GPUs

When Full Fine-Tuning Is Appropriate

You have massive, high-quality data (100K+ examples) and the budget for multi-GPU training
You need deep behavioral changes - not just format, but fundamentally different reasoning patterns
You are building a foundation model for a specific domain (e.g., Bloomberg's BloombergGPT for finance)
The base model is small (under 3B parameters), where full fine-tuning is practical on a single GPU

Catastrophic Forgetting

The central risk of full fine-tuning: the model forgets its pretrained capabilities as it overfits to the fine-tuning distribution.

Symptoms:

Model becomes excellent at the fine-tuned task but loses general language ability
Generates repetitive or degenerate text outside the fine-tuning domain
Fails at tasks it previously handled well (e.g., basic math, common knowledge)

Mitigations:

Low learning rate: Use $10^{-5}$ to $5 \times 10^{-5}$ (10-100x smaller than pretraining)
Short training: 1-3 epochs is usually sufficient; more epochs increases forgetting
Data mixing: Include a percentage of general-purpose data alongside task-specific data
Regularization: Weight decay, dropout, or elastic weight consolidation (EWC)
Evaluation on held-out general benchmarks: Track MMLU or similar alongside your task metric

Common Trap

Many candidates suggest "just train for more epochs" when fine-tuning performance is poor. Beyond 3-5 epochs on a typical dataset, you are almost certainly overfitting and inducing catastrophic forgetting. The fix is usually better data quality, not more training.

Learning Rate Selection

The learning rate is the single most important hyperparameter for fine-tuning:

Model Size	Recommended LR Range	Notes
< 1B	$1 \times 10^{-5}$ to $5 \times 10^{-5}$	Can tolerate slightly higher LR
1-10B	$5 \times 10^{-6}$ to $2 \times 10^{-5}$	Standard range
10-70B	$1 \times 10^{-6}$ to $1 \times 10^{-5}$	Lower is safer
> 70B	$5 \times 10^{-7}$ to $5 \times 10^{-6}$	Very conservative

Use a cosine learning rate schedule with a linear warmup of 3-10% of total steps. The warmup prevents early gradient spikes that can destabilize the pretrained weights.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods train a small number of new or modified parameters while keeping the original model weights frozen. This dramatically reduces memory requirements, training time, and the risk of catastrophic forgetting.

LoRA: Low-Rank Adaptation

LoRA is the dominant PEFT method in production. It was introduced by Hu et al. (2021) and is based on a simple but powerful observation: the weight updates during fine-tuning have low intrinsic rank.

The Core Idea

Instead of updating a weight matrix $W_0 \in \mathbb{R}^{d \times k}$ directly, LoRA freezes $W_0$ and adds a low-rank decomposition:

W = W_0 + \Delta W = W_0 + BA

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ , with rank $r \ll \min(d, k)$ .

For a typical transformer layer where $d = k = 4096$ and $r = 16$ :

Original $W_0$ : $4096 \times 4096 = 16.7M$ parameters (frozen)
LoRA $B$ : $4096 \times 16 = 65K$ parameters (trainable)
LoRA $A$ : $16 \times 4096 = 65K$ parameters (trainable)
Total trainable: 131K vs. 16.7M = 0.78% of original

LoRA Architecture

The forward pass becomes:

h = W_0 x + \frac{\alpha}{r} B A x

where $\alpha$ is a scaling factor (typically set equal to $r$ so $\alpha/r = 1$ , or tuned independently).

Initialization

$A$ is initialized with a random Gaussian distribution
$B$ is initialized to zero
This means $\Delta W = BA = 0$ at the start of training, so the model begins from the pretrained weights

Rank Selection

The rank $r$ controls the capacity of the adaptation:

Rank $r$	Trainable Params (per layer)	Use Case
4-8	~33-65K	Simple format changes, style transfer
16-32	~131-262K	Most tasks (instruction tuning, classification)
64-128	~524K-1M	Complex domain adaptation
256+	~2M+	Approaching full fine-tuning territory

Rule of thumb: Start with $r = 16$ . If the model underfits, increase to 32 or 64. If it overfits, decrease to 8. The performance usually plateaus well before $r = 64$ for most tasks.

Which Layers to Target

Not all layers benefit equally from LoRA:

Target	Common Choice	Reasoning
Query + Value projections	Default in most frameworks	$W_q$ and $W_v$ capture attention patterns; most efficient
All attention projections	$W_q$ , $W_k$ , $W_v$ , $W_o$	Better for complex tasks
Attention + MLP	All linear layers	Highest quality but more parameters
Only MLP	Rare	Sometimes useful for knowledge injection

Research finding (Hu et al., 2021): Adapting $W_q$ and $W_v$ together outperforms adapting either alone, and matches or exceeds adapting all four attention matrices at half the parameter count.

60-Second Answer

"LoRA freezes the pretrained weights and injects trainable low-rank matrices alongside each target layer. Instead of updating a $d \times k$ weight matrix, we learn two smaller matrices $B$ ( $d \times r$ ) and $A$ ( $r \times k$ ) where $r$ is typically 8-64. This reduces trainable parameters to under 1% of the original model while achieving comparable performance to full fine-tuning on most tasks. At inference, you can merge $BA$ into $W_0$ for zero additional latency."

LoRA in Code

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # rank
    lora_alpha=32,                 # scaling factor
    lora_dropout=0.05,             # dropout on LoRA layers
    target_modules=[               # which layers to adapt
        "q_proj", "v_proj",        # attention projections
        "k_proj", "o_proj",        # optional: all attention
        "gate_proj", "up_proj",    # optional: MLP layers
        "down_proj",
    ],
    bias="none",                   # don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,839,488 || trainable%: 0.1695

QLoRA: Quantized LoRA

QLoRA (Dettmers et al., 2023) makes fine-tuning accessible on consumer GPUs by combining 4-bit quantization of the base model with LoRA adapters.

Three Key Innovations

4-bit NormalFloat (NF4) quantization: A new data type optimized for normally distributed weights. The quantization levels are placed at the quantiles of a standard normal distribution, minimizing quantization error for neural network weights.
Double quantization: The quantization constants themselves are quantized. Each 64-parameter block has a 32-bit quantization constant. Double quantization reduces this overhead from ~0.5 bits/param to ~0.37 bits/param.
Paged optimizers: Uses NVIDIA unified memory to automatically page optimizer states between GPU and CPU memory, preventing out-of-memory crashes during gradient checkpointing.

Memory Comparison

For a 70B parameter model:

Configuration	Memory Required	Hardware
Full fine-tuning (FP16)	~600-800 GB	8x A100 80GB
LoRA (FP16 base)	~140 GB	2x A100 80GB
QLoRA (4-bit base)	~36-48 GB	1x A100 80GB or 2x A6000
QLoRA (4-bit base, small batch)	~24 GB	1x RTX 4090 (with CPU offloading)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# QLoRA quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4
    bnb_4bit_use_double_quant=True,      # double quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=quantization_config,
    device_map="auto",
)

# Prepare for k-bit training (cast layernorm to FP32, etc.)
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

Common Trap

QLoRA does introduce a small quality gap compared to full-precision LoRA (typically 1-3% on benchmarks). For maximum quality, train with QLoRA for development speed, then do a final run with FP16 LoRA if quality is critical. Also note that QLoRA training is slower per step than FP16 LoRA because of quantization/dequantization overhead - you are trading compute time for memory.

Prefix Tuning

Prefix tuning (Li & Liang, 2021) prepends trainable "virtual tokens" to the key and value matrices of every attention layer.

How It Works

Instead of modifying any model weights, you learn a set of continuous prefix vectors $P_k, P_v \in \mathbb{R}^{l \times d}$ where $l$ is the prefix length (typically 10-100) and $d$ is the hidden dimension:

\text{head}_i = \text{Attention}(xW_q, [P_k; xW_k], [P_v; xW_v])

The model attends to both the real input tokens and the learned virtual prefix tokens. The prefix acts as a continuous task-specific "instruction."

Trainable parameters: $l \times d \times 2 \times L$ (prefix length $\times$ hidden dim $\times$ key+value $\times$ num layers). For a 7B model with $l=20$ , $d=4096$ , $L=32$ : about 5.2M parameters (~0.07%).

Strengths:

Extremely parameter-efficient
Multiple prefixes can serve different tasks with the same base model
No modification to original weights

Weaknesses:

Consumes prefix length tokens from the context window
Less expressive than LoRA for complex adaptations
Optimization can be unstable; often requires reparameterization through an MLP during training

Prompt Tuning

Prompt tuning (Lester et al., 2021) is even simpler: learn a set of continuous embedding vectors prepended to the input at the embedding layer only (not at every attention layer like prefix tuning).

\text{input} = [e_1, e_2, \ldots, e_l, x_1, x_2, \ldots, x_n]

where $e_1, \ldots, e_l$ are learned continuous vectors and $x_1, \ldots, x_n$ are the embedded input tokens.

Trainable parameters: $l \times d$ (just the soft prompt embeddings). For $l=20$ , $d=4096$ : about 82K parameters.

Key finding (Lester et al., 2021): Prompt tuning approaches the performance of full fine-tuning as model size increases. For models above 10B parameters, the gap is often negligible for simple classification tasks.

Company Variation

Google originally developed prompt tuning for multi-task serving at scale. When you have hundreds of tasks sharing the same base model, prompt tuning lets you store one model copy with hundreds of tiny task-specific prompt vectors, swapping them at inference. This is the main production use case.

Adapters

Adapter modules (Houlsby et al., 2019) insert small bottleneck layers between the existing transformer layers.

Architecture

Each adapter consists of:

A down-projection: $W_{\text{down}} \in \mathbb{R}^{d \times m}$ (reduces dimension from $d$ to $m$ )
A nonlinearity: ReLU or GELU
An up-projection: $W_{\text{up}} \in \mathbb{R}^{m \times d}$ (projects back to $d$ )
A residual connection

\text{Adapter}(x) = x + W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot x)

where $m \ll d$ is the bottleneck dimension (typically $m = d/64$ to $d/8$ ).

Serial adapters (Houlsby): placed after both the attention and FFN sublayers. Parallel adapters (He et al., 2022): placed alongside attention/FFN, which can reduce sequential latency.

Tradeoff vs. LoRA: Adapters add inference latency because they introduce new sequential computation. LoRA adapters can be merged into the base weights ( $W = W_0 + BA$ ) for zero-latency inference. This is why LoRA has largely replaced adapters in practice.

PEFT Method Comparison

Method	Params Trained	Inference Overhead	Merges into Base?	Multi-Task Serving	Quality (vs Full FT)
LoRA	0.1-1%	None (after merge)	Yes	Swap adapters	95-100%
QLoRA	0.1-1%	Quantization overhead	Yes (after dequant)	Swap adapters	92-98%
Prefix Tuning	0.05-0.1%	Prefix tokens use context	No	Swap prefixes	85-95%
Prompt Tuning	~0.001%	Soft prompt tokens	No	Swap prompts	80-95% (scale dependent)
Adapters	1-5%	Extra forward passes	No	Swap adapters	93-99%

Instant Rejection

Saying "LoRA and adapters are the same thing" will immediately signal lack of depth. LoRA modifies existing weight matrices via low-rank additive updates and can be merged for zero inference overhead. Adapters insert entirely new bottleneck layers that add sequential computation. They are architecturally different methods with different tradeoffs.

Instruction Tuning and Alignment Fine-Tuning

Fine-tuning is not just about domain adaptation. Two of the most important applications reshape the model's fundamental behavior:

Instruction Tuning

Trains the model to follow instructions by fine-tuning on (instruction, response) pairs.

Key datasets:

FLAN (Google): 1.8K tasks converted to instruction format
Alpaca (Stanford): 52K instruction-response pairs generated by GPT-4
Open Assistant: Human-generated multi-turn conversations
ShareGPT: Real user conversations with ChatGPT

Why it matters: A base model predicts the next token. It does not inherently know that when you ask a question, you want an answer (not a continuation of the question). Instruction tuning bridges this gap.

Alignment Fine-Tuning

After instruction tuning, alignment further refines the model to be helpful, harmless, and honest. This typically involves RLHF or DPO (covered in the next chapter) but the supervised fine-tuning (SFT) stage is part of the fine-tuning pipeline:

Alignment Pipeline

Data Preparation

Data quality determines fine-tuning quality. A model trained on 1,000 high-quality examples will outperform one trained on 50,000 noisy examples.

Data Formatting

Different frameworks expect different formats:

Alpaca format (single-turn):

{
  "instruction": "Summarize the following contract clause.",
  "input": "The Seller hereby represents and warrants...",
  "output": "This clause establishes the seller's representations..."
}

ShareGPT format (multi-turn):

{
  "conversations": [
    {"from": "human", "value": "What is a force majeure clause?"},
    {"from": "gpt", "value": "A force majeure clause excuses..."},
    {"from": "human", "value": "Can you give an example?"},
    {"from": "gpt", "value": "A common example..."}
  ]
}

ChatML format (OpenAI-style):

<|im_start|>system
You are a legal contract analyst.<|im_end|>
<|im_start|>user
Summarize this clause: ...<|im_end|>
<|im_start|>assistant
This clause establishes...<|im_end|>

Data Quality Checklist

Check	Why	How
Deduplication	Duplicates cause memorization, not generalization	Exact match + near-duplicate detection (MinHash, SimHash)
Format consistency	Inconsistent formats confuse the model	Validate all examples against a schema
Response quality	Garbage in, garbage out	Human review of a random sample (at least 5-10%)
Length distribution	Very long or very short examples skew the model	Plot histogram, trim outliers
Label accuracy	Wrong labels teach wrong behavior	Inter-annotator agreement check (Cohen's $\kappa > 0.8$ )
Diversity	Repetitive data causes mode collapse	Cluster embeddings, ensure coverage
Toxicity/PII screening	Legal and safety risk	Run through content filters before training

How Much Data Do You Need?

Task	Typical Data Size	Notes
Format/style transfer	100-500 examples	The model already knows the content; you are teaching format
Classification	500-5,000 examples	Per class; balance matters
Domain adaptation	1,000-10,000 examples	Quality over quantity
Instruction following	5,000-50,000 examples	Diverse instructions matter more than volume
Full behavior change	50,000-500,000+ examples	Approaching pretraining-scale fine-tuning

60-Second Answer

"For most PEFT fine-tuning, I target 1,000-5,000 high-quality examples. I spend more time on data curation than on hyperparameter tuning - deduplication, format validation, human review of a random sample, and ensuring diversity across the input distribution. A common mistake is collecting 50,000 noisy examples when 2,000 clean ones would produce a better model."

Hyperparameter Selection

The Essential Hyperparameters

Hyperparameter	Typical Range	Impact	Notes
Learning rate	$1 \times 10^{-5}$ to $2 \times 10^{-4}$ (PEFT)	Critical	Too high = forgetting; too low = no learning
Epochs	1-5	High	More epochs = more forgetting risk
Batch size	4-128 (effective, with gradient accumulation)	Moderate	Larger = more stable gradients, but diminishing returns
Warmup ratio	0.03-0.1	Moderate	Prevents early gradient spikes
Weight decay	0.01-0.1	Low-Moderate	Regularization against overfitting
LoRA rank ( $r$ )	8-64	Moderate	Capacity of adaptation
LoRA alpha ( $\alpha$ )	$2r$ (common default)	Low	Scaling factor, usually set to $2 \times r$
Max sequence length	512-8192	High (memory)	Longer = more memory, must match data

Training Configuration Example

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,       # effective batch size = 32
    learning_rate=2e-4,                  # higher LR for PEFT
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    weight_decay=0.01,
    bf16=True,                           # bfloat16 mixed precision
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    gradient_checkpointing=True,         # saves memory
    report_to="wandb",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    max_seq_length=2048,
    packing=True,                        # pack short examples together
)
trainer.train()

Evaluation

Training Metrics

Training loss vs. evaluation loss is the most important diagnostic:

Pattern	Diagnosis	Action
Both decreasing together	Healthy training	Continue
Train loss decreasing, eval loss flat or increasing	Overfitting	Stop training, reduce LR, add regularization, get more data
Both flat from the start	Learning rate too low or data too noisy	Increase LR, check data quality
Both oscillating	Learning rate too high or batch too small	Decrease LR, increase batch size
Train loss spikes then recovers	Normal warmup, or bad data batch	Check if spike is during warmup (ok) or later (investigate)

Beyond Loss: Task-Specific Evaluation

Training loss alone is insufficient. You must evaluate on the actual task:

# Example: evaluating a fine-tuned model on domain-specific QA
from datasets import load_dataset

eval_prompts = load_dataset("your_eval_set")

correct = 0
for example in eval_prompts:
    output = model.generate(example["prompt"], max_new_tokens=256)
    # Task-specific scoring
    if task_metric(output, example["expected"]):
        correct += 1

accuracy = correct / len(eval_prompts)

Benchmark suites for general capability regression:

MMLU: General knowledge across 57 subjects
HellaSwag: Commonsense reasoning
GSM8K: Math reasoning
HumanEval: Code generation

Monitor these alongside your task metric to detect catastrophic forgetting.

Instant Rejection

Only reporting training loss as your evaluation strategy will end the conversation. Interviewers want to hear about task-specific metrics (accuracy, F1, exact match), held-out test sets, and monitoring for capability regression on general benchmarks. Training loss tells you the optimization is working; it does not tell you the model is useful.

When NOT to Fine-Tune

Fine-tuning is expensive, time-consuming, and risky. Here is a decision framework for when to avoid it:

Situation	Why Not Fine-Tune	Alternative
Knowledge updates frequently	You would need to retrain constantly	RAG
Need citations/traceability	Fine-tuned knowledge is opaque	RAG
Limited labeled data (< 50 examples)	Risk of overfitting; insufficient signal	Few-shot prompting
Task is already well-served by prompting	Unnecessary complexity	Prompt engineering
Base model capability is the bottleneck	Fine-tuning cannot add capabilities the base model lacks	Use a larger base model
You need deterministic outputs	Fine-tuning does not guarantee exact format compliance	Constrained decoding (outlines, LMQL)
Rapid prototyping phase	Fine-tuning slows iteration	Prompting + RAG

Company Variation

At large tech companies (Google, Meta, OpenAI), fine-tuning is routine because they have the data, compute, and infrastructure. At startups, the default should be prompting + RAG first, fine-tuning only when you have proven that prompt engineering hits a ceiling. The business case for fine-tuning must justify the ongoing maintenance cost (retraining on data updates, tracking base model updates, managing adapter versions).

Multi-Task Fine-Tuning and Continual Learning

Multi-Task Fine-Tuning

Training on multiple tasks simultaneously to create a versatile model:

Approach 1: Data mixing Combine datasets from different tasks, prefixed with task instructions:

[Summarize] The contract states that...
[Extract clauses] The agreement between...
[Classify risk] Under section 4.2...

Approach 2: Task-specific LoRA adapters Train separate LoRA adapters per task, share the same base model. At inference, load the adapter for the requested task. This prevents task interference.

Continual Learning

When you need to fine-tune on new data without forgetting previous fine-tuning:

Replay buffer: Keep a small representative sample from previous tasks and mix into new training data
Elastic Weight Consolidation (EWC): Add a regularization term that penalizes changes to parameters that were important for previous tasks:

\mathcal{L}(\theta) = \mathcal{L}_{\text{new}}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{i}^{*})^2

where $F_i$ is the Fisher information matrix diagonal (importance of parameter $i$ ) and $\theta^{*}$ are the previous task weights.

Progressive LoRA: Train a new LoRA adapter for each task, stack or merge them. Avoids overwriting previous adapters.

Production Considerations

Merging LoRA Weights

For production inference, merge the LoRA weights into the base model to eliminate adapter overhead:

# Merge LoRA into base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

# Now serve as a standard model - no PEFT dependency needed

After merging:

No additional latency vs. the base model
No PEFT library dependency in the inference stack
The adapter is permanently baked into the weights

Serving Multiple Adapters

When you need multiple specialized versions of the same model (e.g., one per customer or per task):

Multi-Adapter Serving

Frameworks like LoRAX, S-LoRA, and Punica enable serving hundreds of LoRA adapters from a single base model with batch-level adapter switching. This is how multi-tenant LLM platforms work.

A/B Testing Fine-Tuned Models

Deploy both the current model (control) and the fine-tuned model (treatment)
Route a percentage of traffic (5-20%) to the treatment
Measure task-specific metrics (accuracy, format compliance, user satisfaction)
Also measure regression metrics (latency, error rate, capability on general tasks)
Gradually increase traffic if metrics improve, roll back if not

Version Management

Track fine-tuning runs like software releases:

Artifact	What to Store	Why
Training data snapshot	Versioned dataset (hash + location)	Reproducibility
Hyperparameters	Full config file	Reproducibility
Adapter weights	LoRA adapter checkpoint	Deployment
Base model version	Exact model ID and hash	Compatibility
Evaluation results	Metrics on test set + benchmarks	Regression detection
Training logs	Loss curves, learning rate schedule	Debugging

Practice Problems

Problem 1: Choosing a Fine-Tuning Strategy

Your company has a customer support chatbot powered by GPT-4. It handles 10,000 queries/day. You want to reduce costs by switching to a fine-tuned Llama 3 8B model. You have 15,000 labeled customer support conversations. You have two A100 40GB GPUs available. Design the fine-tuning strategy.

Hint 1 - Direction

Consider what the model needs to learn: domain vocabulary (product names, policies), output format (structured responses with links), and conversational tone. This is a behavior change, not a knowledge change - fine-tuning is appropriate. Think about which PEFT method fits your hardware.

Hint 2 - Insight

With 2x A100 40GB, you can fit Llama 3 8B in FP16 (16 GB per GPU) with room for LoRA training. QLoRA is unnecessary - you have enough memory for standard LoRA. The 15K conversations should be filtered for quality. Multi-turn format is important since customer support is conversational.

Hint 3 - Full Solution

Strategy: LoRA fine-tuning on Llama 3 8B

Step 1: Data preparation

Filter 15K conversations: remove short/unhelpful ones, keep ~10K high-quality examples
Format in ChatML or ShareGPT format (multi-turn)
Split: 9K train, 500 validation, 500 test
Deduplicate (MinHash)
Validate format consistency with a schema check

Step 2: Training configuration

Method: LoRA ( $r=32$ , $\alpha=64$ , target all attention + MLP)
Learning rate: $2 \times 10^{-4}$ with cosine schedule
Epochs: 3 (monitor eval loss for early stopping)
Effective batch size: 32 (per_device=4, gradient_accumulation=4, 2 GPUs)
Max sequence length: 2048 (typical customer support conversation length)
Gradient checkpointing: enabled

Step 3: Evaluation

Task metrics: response quality (human eval on 100 samples), format compliance rate, factual accuracy
Regression check: MMLU subset, commonsense reasoning
Cost analysis: tokens/query reduction vs GPT-4, latency comparison

Step 4: Production deployment

Merge LoRA weights into base model
Serve with vLLM for optimized inference
A/B test: route 10% of traffic to fine-tuned model, compare user satisfaction scores
Set up monitoring for response quality degradation

Scoring rubric:

Grade	Criteria
Strong Hire	Proposes LoRA with justified rank selection, discusses data quality filtering, has a multi-stage evaluation plan (automated + human), includes production rollout strategy with A/B testing.
Lean Hire	Chooses appropriate PEFT method for hardware, mentions data quality, has basic evaluation plan.
No Hire	Suggests full fine-tuning without considering memory, ignores data quality, evaluates only training loss.

Problem 2: Diagnosing Fine-Tuning Failure

After LoRA fine-tuning a Llama 3 70B model on 5,000 legal contract examples, the model correctly extracts contract clauses from examples similar to the training data but produces hallucinated legal terms on out-of-distribution contracts. The training loss converged to 0.3 and evaluation loss to 0.8. What went wrong and how do you fix it?

Hint 1 - Direction

The gap between training loss (0.3) and evaluation loss (0.8) is a clear signal. What does this tell you about the model's generalization?

Hint 2 - Insight

This is textbook overfitting. The model memorized the training data distribution (specific legal terms, contract structures) rather than learning the general skill of clause extraction. The large train/eval loss gap confirms this. Consider what factors contribute: data diversity, training duration, rank selection, regularization.

Hint 3 - Full Solution

Diagnosis: Overfitting to the training distribution

The train-eval loss gap (0.3 vs 0.8) confirms overfitting. The model memorized training-set contract patterns and hallucinates when encountering unfamiliar contracts.

Root cause analysis:

Insufficient data diversity: 5K examples may all be from the same contract types or jurisdictions
Rank too high: If $r$ is large (64+), the adapter has too much capacity and memorizes
Trained too long: May have continued past the optimal epoch
No regularization: Missing dropout or weight decay

Fixes (in order of priority):

Data diversity (highest impact):
- Audit training data: How many contract types, jurisdictions, and formats are represented?
- Augment with diverse contracts even if fewer are perfectly labeled
- Add negative examples (text that is NOT a clause) to prevent over-triggering
Reduce model capacity:
- Lower LoRA rank from current to $r=8$ or $r=16$
- Target only $W_q$ and $W_v$ (not all linear layers)
- Add LoRA dropout (0.1)
Training adjustments:
- Train for fewer steps - use the checkpoint with lowest eval loss, not final checkpoint
- Reduce learning rate by 2-5x
- Add weight decay (0.05-0.1)
Evaluation improvements:
- Create an eval set with intentionally OOD contracts (different types, jurisdictions)
- Track hallucination rate as a first-class metric, not just loss
- Use human evaluation on 50 OOD examples

Scoring rubric:

Grade	Criteria
Strong Hire	Identifies overfitting from train/eval gap, proposes data diversity as the primary fix, adjusts model capacity AND training regime, creates OOD evaluation set.
Lean Hire	Identifies overfitting, proposes at least two reasonable fixes.
No Hire	Does not recognize overfitting, suggests "train for more epochs," or proposes only increasing data volume without considering diversity.

Problem 3: Multi-Tenant Fine-Tuning Architecture

You are building an LLM platform where each enterprise customer gets a model fine-tuned on their private data. You have 50 customers, each with 1,000-10,000 training examples. Design a system that serves all 50 customers efficiently without maintaining 50 separate model deployments.

Hint 1 - Direction

Think about parameter-efficient methods where the base model is shared and only the adaptation is customer-specific. Consider how LoRA adapters can be swapped at serving time.

Hint 2 - Insight

One base model + 50 LoRA adapters. Each adapter is small (~10-50 MB for an 8B model with $r=16$ ). The key challenge is efficient serving: you need to swap adapters per-request without reloading the base model. Look into S-LoRA or LoRAX for batch-level adapter switching.

Hint 3 - Full Solution

Architecture: Shared base model + per-customer LoRA adapters

Multi-Tenant LoRA Architecture - Shared Base Model with Per-Customer Adapters

Training pipeline:

Common base model: Llama 3 8B (or fine-tuned variant for your domain)
Per-customer LoRA training: automated pipeline triggered when customer uploads data
- Standard config: $r=16$ , $\alpha=32$ , target all attention projections
- Training: 3 epochs, early stopping on eval loss
- Automated quality gate: model must pass basic benchmarks before deployment
Store adapter weights in versioned object storage (S3)

Serving architecture:

Base model: Loaded once per GPU, shared across all customers
Adapter cache: Hot adapters (frequently used) kept in GPU memory; cold adapters loaded on demand from SSD/object storage
Request routing: Each API request includes customer_id; router selects the correct adapter
Batching: Use S-LoRA or Punica for batch-level adapter switching - requests for different customers can be batched together if using the same base model
Hardware: 2-4 GPUs with the base model replicated; adapter switching adds <5ms per request

Key design decisions:

Use the same LoRA config (rank, target modules) across all customers for serving compatibility
Implement adapter versioning: customers can roll back to previous versions
Data isolation: each customer's training data and adapter weights are strictly isolated
Monitoring: per-customer quality metrics, latency, error rates

Cost analysis:

Without multi-tenant: 50 x 16 GB (model) = 800 GB GPU memory
With multi-tenant: 16 GB (base) + 1.5 GB (all adapters) = 17.5 GB GPU memory
47x memory reduction

Scoring rubric:

Grade	Criteria
Strong Hire	Designs shared base + per-customer LoRA, discusses serving frameworks (S-LoRA/LoRAX), includes adapter versioning and data isolation, provides cost/memory analysis.
Lean Hire	Proposes shared base + LoRA adapters, mentions efficient serving.
No Hire	Suggests deploying 50 separate models or does not consider serving efficiency.

Interview Cheat Sheet

Topic	Key Fact	Why It Matters
Fine-tuning vs. prompting vs. RAG	FT changes weights (behavior), RAG adds context (knowledge), prompting changes input	Most common opening question
LoRA math	$W = W_0 + BA$ , $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$	Shows you understand the mechanism, not just the API
LoRA rank	$r = 16$ is a good default; 8 for simple, 64 for complex tasks	Demonstrates practical experience
QLoRA key ideas	NF4 quantization + double quantization + paged optimizers	Three innovations, not just "4-bit LoRA"
Catastrophic forgetting	Model forgets pretrained skills during fine-tuning	Main risk; mitigate with low LR, few epochs, data mixing
Learning rate	$10^{-5}$ to $10^{-4}$ for PEFT; 10x lower for full FT	Single most impactful hyperparameter
Data quality	1K clean examples > 50K noisy examples	The answer that separates practitioners from theorists
Merging LoRA	$W = W_0 + BA$ can be precomputed for zero inference overhead	Critical production deployment detail
Multi-tenant serving	One base model + N adapters via S-LoRA/LoRAX	Shows systems design thinking
When NOT to fine-tune	Knowledge updates, < 50 examples, prompting works	Shows judgment, not just technique
Epochs	1-3 for most tasks; more increases forgetting risk	Prevents the "more training = better" trap
Evaluation	Task metrics + general benchmarks + human eval	Loss alone is insufficient
Data formats	Alpaca (single-turn), ShareGPT (multi-turn), ChatML	Practical framework knowledge

Spaced Repetition Checkpoints

Day 0 (Today)

Write the LoRA equation from memory: $W = W_0 + BA$ , sizes of $B$ and $A$ , and what $r$ controls
Name three differences between LoRA and QLoRA
When should you use RAG instead of fine-tuning? Give three scenarios

Day 3

Compare all five PEFT methods: LoRA, QLoRA, prefix tuning, prompt tuning, adapters
What is catastrophic forgetting? List three mitigation strategies
Calculate the trainable parameters for LoRA with $d=4096$ , $r=16$ , applied to $W_q$ and $W_v$

Day 7

Design a fine-tuning pipeline for a domain-specific task: data prep, method selection, hyperparameters, evaluation
Explain why LoRA targets $W_q$ and $W_v$ by default, and when you should target more layers
A model fine-tuned for 5 epochs has train loss 0.2 and eval loss 1.1. Diagnose and propose fixes

Day 14

Design a multi-tenant serving architecture with shared base model and per-customer LoRA adapters
Explain the NF4 data type and double quantization in QLoRA - why are these innovations important?
Walk through a production fine-tuning deployment: merging weights, A/B testing, version management, monitoring

Day 21

Teach someone the complete fine-tuning decision framework: prompting to RAG to PEFT to full fine-tuning
Given a novel task, select the optimal PEFT method and justify your choice with memory, quality, and serving tradeoffs
Design a continual learning strategy for a model that must adapt to new domains without forgetting previous ones

The Real Interview Moment​

Why Fine-Tuning Matters​

The Decision Hierarchy​

The Fine-Tuning Spectrum​

Full Fine-Tuning​

When Full Fine-Tuning Is Appropriate​

Catastrophic Forgetting​

Learning Rate Selection​

Parameter-Efficient Fine-Tuning (PEFT)​

LoRA: Low-Rank Adaptation​

The Core Idea​

Initialization​

Rank Selection​

Which Layers to Target​

LoRA in Code​

QLoRA: Quantized LoRA​

Three Key Innovations​

Memory Comparison​

Prefix Tuning​

How It Works​

Prompt Tuning​

Adapters​

Architecture​

PEFT Method Comparison​

Instruction Tuning and Alignment Fine-Tuning​

Instruction Tuning​

Alignment Fine-Tuning​

Data Preparation​

Data Formatting​

Data Quality Checklist​

How Much Data Do You Need?​

Hyperparameter Selection​

The Essential Hyperparameters​

Training Configuration Example​

Evaluation​

Training Metrics​

Beyond Loss: Task-Specific Evaluation​

When NOT to Fine-Tune​

Multi-Task Fine-Tuning and Continual Learning​

Multi-Task Fine-Tuning​

Continual Learning​

Production Considerations​

Merging LoRA Weights​

Serving Multiple Adapters​

A/B Testing Fine-Tuned Models​

Version Management​

Practice Problems​

Problem 1: Choosing a Fine-Tuning Strategy​

Problem 2: Diagnosing Fine-Tuning Failure​

Problem 3: Multi-Tenant Fine-Tuning Architecture​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

The Real Interview Moment

Why Fine-Tuning Matters

The Decision Hierarchy

The Fine-Tuning Spectrum

Full Fine-Tuning

When Full Fine-Tuning Is Appropriate

Catastrophic Forgetting

Learning Rate Selection

Parameter-Efficient Fine-Tuning (PEFT)

LoRA: Low-Rank Adaptation

The Core Idea

Initialization

Rank Selection

Which Layers to Target

LoRA in Code

QLoRA: Quantized LoRA

Three Key Innovations

Memory Comparison

Prefix Tuning

How It Works

Prompt Tuning

Adapters

Architecture

PEFT Method Comparison

Instruction Tuning and Alignment Fine-Tuning

Instruction Tuning

Alignment Fine-Tuning

Data Preparation

Data Formatting

Data Quality Checklist

How Much Data Do You Need?

Hyperparameter Selection

The Essential Hyperparameters

Training Configuration Example

Evaluation

Training Metrics

Beyond Loss: Task-Specific Evaluation

When NOT to Fine-Tune

Multi-Task Fine-Tuning and Continual Learning

Multi-Task Fine-Tuning

Continual Learning

Production Considerations

Merging LoRA Weights

Serving Multiple Adapters

A/B Testing Fine-Tuned Models

Version Management

Practice Problems

Problem 1: Choosing a Fine-Tuning Strategy

Problem 2: Diagnosing Fine-Tuning Failure

Problem 3: Multi-Tenant Fine-Tuning Architecture

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21