Fine-Tuning Strategies
Reshaping the weights: how to teach a foundation model new tricks without breaking the ones it already knows.
Reading time: ~30 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, Data Scientist
The Real Interview Moment
You are forty minutes into an ML engineer interview at a Series B startup building an AI-powered contract analysis platform. The interviewer, the head of ML, slides her laptop toward you:
"Our base Llama 3 70B model extracts contract clauses reasonably well with few-shot prompting, but the output format is inconsistent and it misses domain-specific terms like 'indemnification carve-out' and 'material adverse change.' We have 8,000 labeled examples of contract clause extraction. We tried full fine-tuning on 4 A100s but the model started hallucinating boilerplate language that was not in the source contract. How would you approach this?"
She is not asking you to recite LoRA's paper. She wants you to diagnose the failure mode (catastrophic forgetting plus overfitting), propose a parameter-efficient strategy that fits their compute budget, discuss data quality, and articulate a production deployment plan. This chapter gives you every piece you need.
Why Fine-Tuning Matters
LLMs arrive pretrained on trillions of tokens, encoding broad world knowledge and language understanding. But they do not know your task. There are three ways to bridge this gap:
| Approach | What It Changes | Best For | Limitations |
|---|---|---|---|
| Prompt engineering | The input, not the model | Quick iteration, general tasks | Context window limits, fragile, costly per-query |
| RAG | The context available at inference | Knowledge-intensive tasks, frequently updated data | Adds retrieval latency, does not change model behavior |
| Fine-tuning | The model weights | Teaching new behavior, formats, domain reasoning | Requires training data and compute, risk of forgetting |
"Fine-tuning modifies the model's weights to internalize a new behavior, domain vocabulary, or output format. Unlike prompting, the knowledge is baked into the parameters, so it does not consume context window tokens at inference. Unlike RAG, it changes how the model reasons, not just what information it can see. The tradeoff is that you need labeled data and compute, and you risk catastrophic forgetting if not careful."
The Decision Hierarchy
Always start simple and escalate only when necessary:
The Fine-Tuning Spectrum
Not all fine-tuning is created equal. The methods exist on a spectrum from modifying every parameter to touching none of the original weights:
| Method | Parameters Trained | Memory Needed | Typical Use Case |
|---|---|---|---|
| Full fine-tuning | All (billions) | 4-16x model size | Large data + large compute budgets |
| LoRA | 0.1-1% of params | ~1.2-1.5x model size | Production standard for most teams |
| QLoRA | 0.1-1% of params | ~0.3-0.5x model size | When GPU memory is the constraint |
| Prefix tuning | Virtual token embeddings | ~1.01x model size | Multi-tenant serving |
| Prompt tuning | Soft prompt vectors | ~1.001x model size | Lightweight task adaptation |
| Adapters | Bottleneck layers | ~1.05-1.1x model size | Modular, composable skills |
Full Fine-Tuning
Full fine-tuning updates every parameter in the model. For a 70B parameter model with FP16 precision, this requires:
- Model weights: bytes GB
- Gradients: Another 140 GB
- Optimizer states (Adam): GB GB (momentum + variance)
- Activations: Variable, often 50-200+ GB depending on batch size and sequence length
- Total: ~600-800 GB across GPUs
When Full Fine-Tuning Is Appropriate
- You have massive, high-quality data (100K+ examples) and the budget for multi-GPU training
- You need deep behavioral changes - not just format, but fundamentally different reasoning patterns
- You are building a foundation model for a specific domain (e.g., Bloomberg's BloombergGPT for finance)
- The base model is small (under 3B parameters), where full fine-tuning is practical on a single GPU
Catastrophic Forgetting
The central risk of full fine-tuning: the model forgets its pretrained capabilities as it overfits to the fine-tuning distribution.
Symptoms:
- Model becomes excellent at the fine-tuned task but loses general language ability
- Generates repetitive or degenerate text outside the fine-tuning domain
- Fails at tasks it previously handled well (e.g., basic math, common knowledge)
Mitigations:
- Low learning rate: Use to (10-100x smaller than pretraining)
- Short training: 1-3 epochs is usually sufficient; more epochs increases forgetting
- Data mixing: Include a percentage of general-purpose data alongside task-specific data
- Regularization: Weight decay, dropout, or elastic weight consolidation (EWC)
- Evaluation on held-out general benchmarks: Track MMLU or similar alongside your task metric
Many candidates suggest "just train for more epochs" when fine-tuning performance is poor. Beyond 3-5 epochs on a typical dataset, you are almost certainly overfitting and inducing catastrophic forgetting. The fix is usually better data quality, not more training.
Learning Rate Selection
The learning rate is the single most important hyperparameter for fine-tuning:
| Model Size | Recommended LR Range | Notes |
|---|---|---|
| < 1B | to | Can tolerate slightly higher LR |
| 1-10B | to | Standard range |
| 10-70B | to | Lower is safer |
| > 70B | to | Very conservative |
Use a cosine learning rate schedule with a linear warmup of 3-10% of total steps. The warmup prevents early gradient spikes that can destabilize the pretrained weights.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT methods train a small number of new or modified parameters while keeping the original model weights frozen. This dramatically reduces memory requirements, training time, and the risk of catastrophic forgetting.
LoRA: Low-Rank Adaptation
LoRA is the dominant PEFT method in production. It was introduced by Hu et al. (2021) and is based on a simple but powerful observation: the weight updates during fine-tuning have low intrinsic rank.
The Core Idea
Instead of updating a weight matrix directly, LoRA freezes and adds a low-rank decomposition:
where and , with rank .
For a typical transformer layer where and :
- Original : parameters (frozen)
- LoRA : parameters (trainable)
- LoRA : parameters (trainable)
- Total trainable: 131K vs. 16.7M = 0.78% of original
The forward pass becomes:
where is a scaling factor (typically set equal to so , or tuned independently).
Initialization
- is initialized with a random Gaussian distribution
- is initialized to zero
- This means at the start of training, so the model begins from the pretrained weights
Rank Selection
The rank controls the capacity of the adaptation:
| Rank | Trainable Params (per layer) | Use Case |
|---|---|---|
| 4-8 | ~33-65K | Simple format changes, style transfer |
| 16-32 | ~131-262K | Most tasks (instruction tuning, classification) |
| 64-128 | ~524K-1M | Complex domain adaptation |
| 256+ | ~2M+ | Approaching full fine-tuning territory |
Rule of thumb: Start with . If the model underfits, increase to 32 or 64. If it overfits, decrease to 8. The performance usually plateaus well before for most tasks.
Which Layers to Target
Not all layers benefit equally from LoRA:
| Target | Common Choice | Reasoning |
|---|---|---|
| Query + Value projections | Default in most frameworks | and capture attention patterns; most efficient |
| All attention projections | , , , | Better for complex tasks |
| Attention + MLP | All linear layers | Highest quality but more parameters |
| Only MLP | Rare | Sometimes useful for knowledge injection |
Research finding (Hu et al., 2021): Adapting and together outperforms adapting either alone, and matches or exceeds adapting all four attention matrices at half the parameter count.
"LoRA freezes the pretrained weights and injects trainable low-rank matrices alongside each target layer. Instead of updating a weight matrix, we learn two smaller matrices () and () where is typically 8-64. This reduces trainable parameters to under 1% of the original model while achieving comparable performance to full fine-tuning on most tasks. At inference, you can merge into for zero additional latency."
LoRA in Code
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank
lora_alpha=32, # scaling factor
lora_dropout=0.05, # dropout on LoRA layers
target_modules=[ # which layers to adapt
"q_proj", "v_proj", # attention projections
"k_proj", "o_proj", # optional: all attention
"gate_proj", "up_proj", # optional: MLP layers
"down_proj",
],
bias="none", # don't train biases
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,839,488 || trainable%: 0.1695
QLoRA: Quantized LoRA
QLoRA (Dettmers et al., 2023) makes fine-tuning accessible on consumer GPUs by combining 4-bit quantization of the base model with LoRA adapters.
Three Key Innovations
-
4-bit NormalFloat (NF4) quantization: A new data type optimized for normally distributed weights. The quantization levels are placed at the quantiles of a standard normal distribution, minimizing quantization error for neural network weights.
-
Double quantization: The quantization constants themselves are quantized. Each 64-parameter block has a 32-bit quantization constant. Double quantization reduces this overhead from ~0.5 bits/param to ~0.37 bits/param.
-
Paged optimizers: Uses NVIDIA unified memory to automatically page optimizer states between GPU and CPU memory, preventing out-of-memory crashes during gradient checkpointing.
Memory Comparison
For a 70B parameter model:
| Configuration | Memory Required | Hardware |
|---|---|---|
| Full fine-tuning (FP16) | ~600-800 GB | 8x A100 80GB |
| LoRA (FP16 base) | ~140 GB | 2x A100 80GB |
| QLoRA (4-bit base) | ~36-48 GB | 1x A100 80GB or 2x A6000 |
| QLoRA (4-bit base, small batch) | ~24 GB | 1x RTX 4090 (with CPU offloading) |
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# QLoRA quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_use_double_quant=True, # double quantization
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B",
quantization_config=quantization_config,
device_map="auto",
)
# Prepare for k-bit training (cast layernorm to FP32, etc.)
model = prepare_model_for_kbit_training(model)
# Apply LoRA on top of quantized model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
QLoRA does introduce a small quality gap compared to full-precision LoRA (typically 1-3% on benchmarks). For maximum quality, train with QLoRA for development speed, then do a final run with FP16 LoRA if quality is critical. Also note that QLoRA training is slower per step than FP16 LoRA because of quantization/dequantization overhead - you are trading compute time for memory.
Prefix Tuning
Prefix tuning (Li & Liang, 2021) prepends trainable "virtual tokens" to the key and value matrices of every attention layer.
How It Works
Instead of modifying any model weights, you learn a set of continuous prefix vectors where is the prefix length (typically 10-100) and is the hidden dimension:
The model attends to both the real input tokens and the learned virtual prefix tokens. The prefix acts as a continuous task-specific "instruction."
Trainable parameters: (prefix length hidden dim key+value num layers). For a 7B model with , , : about 5.2M parameters (~0.07%).
Strengths:
- Extremely parameter-efficient
- Multiple prefixes can serve different tasks with the same base model
- No modification to original weights
Weaknesses:
- Consumes prefix length tokens from the context window
- Less expressive than LoRA for complex adaptations
- Optimization can be unstable; often requires reparameterization through an MLP during training
Prompt Tuning
Prompt tuning (Lester et al., 2021) is even simpler: learn a set of continuous embedding vectors prepended to the input at the embedding layer only (not at every attention layer like prefix tuning).
where are learned continuous vectors and are the embedded input tokens.
Trainable parameters: (just the soft prompt embeddings). For , : about 82K parameters.
Key finding (Lester et al., 2021): Prompt tuning approaches the performance of full fine-tuning as model size increases. For models above 10B parameters, the gap is often negligible for simple classification tasks.
Google originally developed prompt tuning for multi-task serving at scale. When you have hundreds of tasks sharing the same base model, prompt tuning lets you store one model copy with hundreds of tiny task-specific prompt vectors, swapping them at inference. This is the main production use case.
Adapters
Adapter modules (Houlsby et al., 2019) insert small bottleneck layers between the existing transformer layers.
Architecture
Each adapter consists of:
- A down-projection: (reduces dimension from to )
- A nonlinearity: ReLU or GELU
- An up-projection: (projects back to )
- A residual connection
where is the bottleneck dimension (typically to ).
Serial adapters (Houlsby): placed after both the attention and FFN sublayers. Parallel adapters (He et al., 2022): placed alongside attention/FFN, which can reduce sequential latency.
Tradeoff vs. LoRA: Adapters add inference latency because they introduce new sequential computation. LoRA adapters can be merged into the base weights () for zero-latency inference. This is why LoRA has largely replaced adapters in practice.
PEFT Method Comparison
| Method | Params Trained | Inference Overhead | Merges into Base? | Multi-Task Serving | Quality (vs Full FT) |
|---|---|---|---|---|---|
| LoRA | 0.1-1% | None (after merge) | Yes | Swap adapters | 95-100% |
| QLoRA | 0.1-1% | Quantization overhead | Yes (after dequant) | Swap adapters | 92-98% |
| Prefix Tuning | 0.05-0.1% | Prefix tokens use context | No | Swap prefixes | 85-95% |
| Prompt Tuning | ~0.001% | Soft prompt tokens | No | Swap prompts | 80-95% (scale dependent) |
| Adapters | 1-5% | Extra forward passes | No | Swap adapters | 93-99% |
Saying "LoRA and adapters are the same thing" will immediately signal lack of depth. LoRA modifies existing weight matrices via low-rank additive updates and can be merged for zero inference overhead. Adapters insert entirely new bottleneck layers that add sequential computation. They are architecturally different methods with different tradeoffs.
Instruction Tuning and Alignment Fine-Tuning
Fine-tuning is not just about domain adaptation. Two of the most important applications reshape the model's fundamental behavior:
Instruction Tuning
Trains the model to follow instructions by fine-tuning on (instruction, response) pairs.
Key datasets:
- FLAN (Google): 1.8K tasks converted to instruction format
- Alpaca (Stanford): 52K instruction-response pairs generated by GPT-4
- Open Assistant: Human-generated multi-turn conversations
- ShareGPT: Real user conversations with ChatGPT
Why it matters: A base model predicts the next token. It does not inherently know that when you ask a question, you want an answer (not a continuation of the question). Instruction tuning bridges this gap.
Alignment Fine-Tuning
After instruction tuning, alignment further refines the model to be helpful, harmless, and honest. This typically involves RLHF or DPO (covered in the next chapter) but the supervised fine-tuning (SFT) stage is part of the fine-tuning pipeline:
Data Preparation
Data quality determines fine-tuning quality. A model trained on 1,000 high-quality examples will outperform one trained on 50,000 noisy examples.
Data Formatting
Different frameworks expect different formats:
Alpaca format (single-turn):
{
"instruction": "Summarize the following contract clause.",
"input": "The Seller hereby represents and warrants...",
"output": "This clause establishes the seller's representations..."
}
ShareGPT format (multi-turn):
{
"conversations": [
{"from": "human", "value": "What is a force majeure clause?"},
{"from": "gpt", "value": "A force majeure clause excuses..."},
{"from": "human", "value": "Can you give an example?"},
{"from": "gpt", "value": "A common example..."}
]
}
ChatML format (OpenAI-style):
<|im_start|>system
You are a legal contract analyst.<|im_end|>
<|im_start|>user
Summarize this clause: ...<|im_end|>
<|im_start|>assistant
This clause establishes...<|im_end|>
Data Quality Checklist
| Check | Why | How |
|---|---|---|
| Deduplication | Duplicates cause memorization, not generalization | Exact match + near-duplicate detection (MinHash, SimHash) |
| Format consistency | Inconsistent formats confuse the model | Validate all examples against a schema |
| Response quality | Garbage in, garbage out | Human review of a random sample (at least 5-10%) |
| Length distribution | Very long or very short examples skew the model | Plot histogram, trim outliers |
| Label accuracy | Wrong labels teach wrong behavior | Inter-annotator agreement check (Cohen's ) |
| Diversity | Repetitive data causes mode collapse | Cluster embeddings, ensure coverage |
| Toxicity/PII screening | Legal and safety risk | Run through content filters before training |
How Much Data Do You Need?
| Task | Typical Data Size | Notes |
|---|---|---|
| Format/style transfer | 100-500 examples | The model already knows the content; you are teaching format |
| Classification | 500-5,000 examples | Per class; balance matters |
| Domain adaptation | 1,000-10,000 examples | Quality over quantity |
| Instruction following | 5,000-50,000 examples | Diverse instructions matter more than volume |
| Full behavior change | 50,000-500,000+ examples | Approaching pretraining-scale fine-tuning |
"For most PEFT fine-tuning, I target 1,000-5,000 high-quality examples. I spend more time on data curation than on hyperparameter tuning - deduplication, format validation, human review of a random sample, and ensuring diversity across the input distribution. A common mistake is collecting 50,000 noisy examples when 2,000 clean ones would produce a better model."
Hyperparameter Selection
The Essential Hyperparameters
| Hyperparameter | Typical Range | Impact | Notes |
|---|---|---|---|
| Learning rate | to (PEFT) | Critical | Too high = forgetting; too low = no learning |
| Epochs | 1-5 | High | More epochs = more forgetting risk |
| Batch size | 4-128 (effective, with gradient accumulation) | Moderate | Larger = more stable gradients, but diminishing returns |
| Warmup ratio | 0.03-0.1 | Moderate | Prevents early gradient spikes |
| Weight decay | 0.01-0.1 | Low-Moderate | Regularization against overfitting |
| LoRA rank () | 8-64 | Moderate | Capacity of adaptation |
| LoRA alpha () | (common default) | Low | Scaling factor, usually set to |
| Max sequence length | 512-8192 | High (memory) | Longer = more memory, must match data |
Training Configuration Example
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # effective batch size = 32
learning_rate=2e-4, # higher LR for PEFT
lr_scheduler_type="cosine",
warmup_ratio=0.05,
weight_decay=0.01,
bf16=True, # bfloat16 mixed precision
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
gradient_checkpointing=True, # saves memory
report_to="wandb",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
max_seq_length=2048,
packing=True, # pack short examples together
)
trainer.train()
Evaluation
Training Metrics
Training loss vs. evaluation loss is the most important diagnostic:
| Pattern | Diagnosis | Action |
|---|---|---|
| Both decreasing together | Healthy training | Continue |
| Train loss decreasing, eval loss flat or increasing | Overfitting | Stop training, reduce LR, add regularization, get more data |
| Both flat from the start | Learning rate too low or data too noisy | Increase LR, check data quality |
| Both oscillating | Learning rate too high or batch too small | Decrease LR, increase batch size |
| Train loss spikes then recovers | Normal warmup, or bad data batch | Check if spike is during warmup (ok) or later (investigate) |
Beyond Loss: Task-Specific Evaluation
Training loss alone is insufficient. You must evaluate on the actual task:
# Example: evaluating a fine-tuned model on domain-specific QA
from datasets import load_dataset
eval_prompts = load_dataset("your_eval_set")
correct = 0
for example in eval_prompts:
output = model.generate(example["prompt"], max_new_tokens=256)
# Task-specific scoring
if task_metric(output, example["expected"]):
correct += 1
accuracy = correct / len(eval_prompts)
Benchmark suites for general capability regression:
- MMLU: General knowledge across 57 subjects
- HellaSwag: Commonsense reasoning
- GSM8K: Math reasoning
- HumanEval: Code generation
Monitor these alongside your task metric to detect catastrophic forgetting.
Only reporting training loss as your evaluation strategy will end the conversation. Interviewers want to hear about task-specific metrics (accuracy, F1, exact match), held-out test sets, and monitoring for capability regression on general benchmarks. Training loss tells you the optimization is working; it does not tell you the model is useful.
When NOT to Fine-Tune
Fine-tuning is expensive, time-consuming, and risky. Here is a decision framework for when to avoid it:
| Situation | Why Not Fine-Tune | Alternative |
|---|---|---|
| Knowledge updates frequently | You would need to retrain constantly | RAG |
| Need citations/traceability | Fine-tuned knowledge is opaque | RAG |
| Limited labeled data (< 50 examples) | Risk of overfitting; insufficient signal | Few-shot prompting |
| Task is already well-served by prompting | Unnecessary complexity | Prompt engineering |
| Base model capability is the bottleneck | Fine-tuning cannot add capabilities the base model lacks | Use a larger base model |
| You need deterministic outputs | Fine-tuning does not guarantee exact format compliance | Constrained decoding (outlines, LMQL) |
| Rapid prototyping phase | Fine-tuning slows iteration | Prompting + RAG |
At large tech companies (Google, Meta, OpenAI), fine-tuning is routine because they have the data, compute, and infrastructure. At startups, the default should be prompting + RAG first, fine-tuning only when you have proven that prompt engineering hits a ceiling. The business case for fine-tuning must justify the ongoing maintenance cost (retraining on data updates, tracking base model updates, managing adapter versions).
Multi-Task Fine-Tuning and Continual Learning
Multi-Task Fine-Tuning
Training on multiple tasks simultaneously to create a versatile model:
Approach 1: Data mixing Combine datasets from different tasks, prefixed with task instructions:
[Summarize] The contract states that...
[Extract clauses] The agreement between...
[Classify risk] Under section 4.2...
Approach 2: Task-specific LoRA adapters Train separate LoRA adapters per task, share the same base model. At inference, load the adapter for the requested task. This prevents task interference.
Continual Learning
When you need to fine-tune on new data without forgetting previous fine-tuning:
- Replay buffer: Keep a small representative sample from previous tasks and mix into new training data
- Elastic Weight Consolidation (EWC): Add a regularization term that penalizes changes to parameters that were important for previous tasks:
where is the Fisher information matrix diagonal (importance of parameter ) and are the previous task weights.
- Progressive LoRA: Train a new LoRA adapter for each task, stack or merge them. Avoids overwriting previous adapters.
Production Considerations
Merging LoRA Weights
For production inference, merge the LoRA weights into the base model to eliminate adapter overhead:
# Merge LoRA into base model
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
# Now serve as a standard model - no PEFT dependency needed
After merging:
- No additional latency vs. the base model
- No PEFT library dependency in the inference stack
- The adapter is permanently baked into the weights
Serving Multiple Adapters
When you need multiple specialized versions of the same model (e.g., one per customer or per task):
Frameworks like LoRAX, S-LoRA, and Punica enable serving hundreds of LoRA adapters from a single base model with batch-level adapter switching. This is how multi-tenant LLM platforms work.
A/B Testing Fine-Tuned Models
- Deploy both the current model (control) and the fine-tuned model (treatment)
- Route a percentage of traffic (5-20%) to the treatment
- Measure task-specific metrics (accuracy, format compliance, user satisfaction)
- Also measure regression metrics (latency, error rate, capability on general tasks)
- Gradually increase traffic if metrics improve, roll back if not
Version Management
Track fine-tuning runs like software releases:
| Artifact | What to Store | Why |
|---|---|---|
| Training data snapshot | Versioned dataset (hash + location) | Reproducibility |
| Hyperparameters | Full config file | Reproducibility |
| Adapter weights | LoRA adapter checkpoint | Deployment |
| Base model version | Exact model ID and hash | Compatibility |
| Evaluation results | Metrics on test set + benchmarks | Regression detection |
| Training logs | Loss curves, learning rate schedule | Debugging |
Practice Problems
Problem 1: Choosing a Fine-Tuning Strategy
Your company has a customer support chatbot powered by GPT-4. It handles 10,000 queries/day. You want to reduce costs by switching to a fine-tuned Llama 3 8B model. You have 15,000 labeled customer support conversations. You have two A100 40GB GPUs available. Design the fine-tuning strategy.
Hint 1 - Direction
Consider what the model needs to learn: domain vocabulary (product names, policies), output format (structured responses with links), and conversational tone. This is a behavior change, not a knowledge change - fine-tuning is appropriate. Think about which PEFT method fits your hardware.
Hint 2 - Insight
With 2x A100 40GB, you can fit Llama 3 8B in FP16 (16 GB per GPU) with room for LoRA training. QLoRA is unnecessary - you have enough memory for standard LoRA. The 15K conversations should be filtered for quality. Multi-turn format is important since customer support is conversational.
Hint 3 - Full Solution
Strategy: LoRA fine-tuning on Llama 3 8B
Step 1: Data preparation
- Filter 15K conversations: remove short/unhelpful ones, keep ~10K high-quality examples
- Format in ChatML or ShareGPT format (multi-turn)
- Split: 9K train, 500 validation, 500 test
- Deduplicate (MinHash)
- Validate format consistency with a schema check
Step 2: Training configuration
- Method: LoRA (, , target all attention + MLP)
- Learning rate: with cosine schedule
- Epochs: 3 (monitor eval loss for early stopping)
- Effective batch size: 32 (per_device=4, gradient_accumulation=4, 2 GPUs)
- Max sequence length: 2048 (typical customer support conversation length)
- Gradient checkpointing: enabled
Step 3: Evaluation
- Task metrics: response quality (human eval on 100 samples), format compliance rate, factual accuracy
- Regression check: MMLU subset, commonsense reasoning
- Cost analysis: tokens/query reduction vs GPT-4, latency comparison
Step 4: Production deployment
- Merge LoRA weights into base model
- Serve with vLLM for optimized inference
- A/B test: route 10% of traffic to fine-tuned model, compare user satisfaction scores
- Set up monitoring for response quality degradation
Scoring rubric:
| Grade | Criteria |
|---|---|
| Strong Hire | Proposes LoRA with justified rank selection, discusses data quality filtering, has a multi-stage evaluation plan (automated + human), includes production rollout strategy with A/B testing. |
| Lean Hire | Chooses appropriate PEFT method for hardware, mentions data quality, has basic evaluation plan. |
| No Hire | Suggests full fine-tuning without considering memory, ignores data quality, evaluates only training loss. |
Problem 2: Diagnosing Fine-Tuning Failure
After LoRA fine-tuning a Llama 3 70B model on 5,000 legal contract examples, the model correctly extracts contract clauses from examples similar to the training data but produces hallucinated legal terms on out-of-distribution contracts. The training loss converged to 0.3 and evaluation loss to 0.8. What went wrong and how do you fix it?
Hint 1 - Direction
The gap between training loss (0.3) and evaluation loss (0.8) is a clear signal. What does this tell you about the model's generalization?
Hint 2 - Insight
This is textbook overfitting. The model memorized the training data distribution (specific legal terms, contract structures) rather than learning the general skill of clause extraction. The large train/eval loss gap confirms this. Consider what factors contribute: data diversity, training duration, rank selection, regularization.
Hint 3 - Full Solution
Diagnosis: Overfitting to the training distribution
The train-eval loss gap (0.3 vs 0.8) confirms overfitting. The model memorized training-set contract patterns and hallucinates when encountering unfamiliar contracts.
Root cause analysis:
- Insufficient data diversity: 5K examples may all be from the same contract types or jurisdictions
- Rank too high: If is large (64+), the adapter has too much capacity and memorizes
- Trained too long: May have continued past the optimal epoch
- No regularization: Missing dropout or weight decay
Fixes (in order of priority):
-
Data diversity (highest impact):
- Audit training data: How many contract types, jurisdictions, and formats are represented?
- Augment with diverse contracts even if fewer are perfectly labeled
- Add negative examples (text that is NOT a clause) to prevent over-triggering
-
Reduce model capacity:
- Lower LoRA rank from current to or
- Target only and (not all linear layers)
- Add LoRA dropout (0.1)
-
Training adjustments:
- Train for fewer steps - use the checkpoint with lowest eval loss, not final checkpoint
- Reduce learning rate by 2-5x
- Add weight decay (0.05-0.1)
-
Evaluation improvements:
- Create an eval set with intentionally OOD contracts (different types, jurisdictions)
- Track hallucination rate as a first-class metric, not just loss
- Use human evaluation on 50 OOD examples
Scoring rubric:
| Grade | Criteria |
|---|---|
| Strong Hire | Identifies overfitting from train/eval gap, proposes data diversity as the primary fix, adjusts model capacity AND training regime, creates OOD evaluation set. |
| Lean Hire | Identifies overfitting, proposes at least two reasonable fixes. |
| No Hire | Does not recognize overfitting, suggests "train for more epochs," or proposes only increasing data volume without considering diversity. |
Problem 3: Multi-Tenant Fine-Tuning Architecture
You are building an LLM platform where each enterprise customer gets a model fine-tuned on their private data. You have 50 customers, each with 1,000-10,000 training examples. Design a system that serves all 50 customers efficiently without maintaining 50 separate model deployments.
Hint 1 - Direction
Think about parameter-efficient methods where the base model is shared and only the adaptation is customer-specific. Consider how LoRA adapters can be swapped at serving time.
Hint 2 - Insight
One base model + 50 LoRA adapters. Each adapter is small (~10-50 MB for an 8B model with ). The key challenge is efficient serving: you need to swap adapters per-request without reloading the base model. Look into S-LoRA or LoRAX for batch-level adapter switching.
Hint 3 - Full Solution
Architecture: Shared base model + per-customer LoRA adapters
Training pipeline:
- Common base model: Llama 3 8B (or fine-tuned variant for your domain)
- Per-customer LoRA training: automated pipeline triggered when customer uploads data
- Standard config: , , target all attention projections
- Training: 3 epochs, early stopping on eval loss
- Automated quality gate: model must pass basic benchmarks before deployment
- Store adapter weights in versioned object storage (S3)
Serving architecture:
- Base model: Loaded once per GPU, shared across all customers
- Adapter cache: Hot adapters (frequently used) kept in GPU memory; cold adapters loaded on demand from SSD/object storage
- Request routing: Each API request includes
customer_id; router selects the correct adapter - Batching: Use S-LoRA or Punica for batch-level adapter switching - requests for different customers can be batched together if using the same base model
- Hardware: 2-4 GPUs with the base model replicated; adapter switching adds <5ms per request
Key design decisions:
- Use the same LoRA config (rank, target modules) across all customers for serving compatibility
- Implement adapter versioning: customers can roll back to previous versions
- Data isolation: each customer's training data and adapter weights are strictly isolated
- Monitoring: per-customer quality metrics, latency, error rates
Cost analysis:
- Without multi-tenant: 50 x 16 GB (model) = 800 GB GPU memory
- With multi-tenant: 16 GB (base) + 1.5 GB (all adapters) = 17.5 GB GPU memory
- 47x memory reduction
Scoring rubric:
| Grade | Criteria |
|---|---|
| Strong Hire | Designs shared base + per-customer LoRA, discusses serving frameworks (S-LoRA/LoRAX), includes adapter versioning and data isolation, provides cost/memory analysis. |
| Lean Hire | Proposes shared base + LoRA adapters, mentions efficient serving. |
| No Hire | Suggests deploying 50 separate models or does not consider serving efficiency. |
Interview Cheat Sheet
| Topic | Key Fact | Why It Matters |
|---|---|---|
| Fine-tuning vs. prompting vs. RAG | FT changes weights (behavior), RAG adds context (knowledge), prompting changes input | Most common opening question |
| LoRA math | , , | Shows you understand the mechanism, not just the API |
| LoRA rank | is a good default; 8 for simple, 64 for complex tasks | Demonstrates practical experience |
| QLoRA key ideas | NF4 quantization + double quantization + paged optimizers | Three innovations, not just "4-bit LoRA" |
| Catastrophic forgetting | Model forgets pretrained skills during fine-tuning | Main risk; mitigate with low LR, few epochs, data mixing |
| Learning rate | to for PEFT; 10x lower for full FT | Single most impactful hyperparameter |
| Data quality | 1K clean examples > 50K noisy examples | The answer that separates practitioners from theorists |
| Merging LoRA | can be precomputed for zero inference overhead | Critical production deployment detail |
| Multi-tenant serving | One base model + N adapters via S-LoRA/LoRAX | Shows systems design thinking |
| When NOT to fine-tune | Knowledge updates, < 50 examples, prompting works | Shows judgment, not just technique |
| Epochs | 1-3 for most tasks; more increases forgetting risk | Prevents the "more training = better" trap |
| Evaluation | Task metrics + general benchmarks + human eval | Loss alone is insufficient |
| Data formats | Alpaca (single-turn), ShareGPT (multi-turn), ChatML | Practical framework knowledge |
Spaced Repetition Checkpoints
Day 0 (Today)
- Write the LoRA equation from memory: , sizes of and , and what controls
- Name three differences between LoRA and QLoRA
- When should you use RAG instead of fine-tuning? Give three scenarios
Day 3
- Compare all five PEFT methods: LoRA, QLoRA, prefix tuning, prompt tuning, adapters
- What is catastrophic forgetting? List three mitigation strategies
- Calculate the trainable parameters for LoRA with , , applied to and
Day 7
- Design a fine-tuning pipeline for a domain-specific task: data prep, method selection, hyperparameters, evaluation
- Explain why LoRA targets and by default, and when you should target more layers
- A model fine-tuned for 5 epochs has train loss 0.2 and eval loss 1.1. Diagnose and propose fixes
Day 14
- Design a multi-tenant serving architecture with shared base model and per-customer LoRA adapters
- Explain the NF4 data type and double quantization in QLoRA - why are these innovations important?
- Walk through a production fine-tuning deployment: merging weights, A/B testing, version management, monitoring
Day 21
- Teach someone the complete fine-tuning decision framework: prompting to RAG to PEFT to full fine-tuning
- Given a novel task, select the optimal PEFT method and justify your choice with memory, quality, and serving tradeoffs
- Design a continual learning strategy for a model that must adapt to new domains without forgetting previous ones
