Fine-Tuning Hyperparameter Search
The Run That Cost $4,000 and Taught Nothing
A team at a mid-size enterprise AI startup had done everything right. They had a clean 15,000-example dataset for a contract analysis task. They had GPU credits from their cloud provider. They picked Mistral-7B as their base model, configured LoRA, and launched their first fine-tuning run on a Friday afternoon.
By Monday morning, the run had finished. The model was worse than the base model on their evaluation set. Contract clause extraction accuracy had dropped from 71% (base model with few-shot prompting) to 63% (fine-tuned). The team spent two days debugging data pipelines, chat templates, and evaluation scripts, convinced there was a bug somewhere. There was not. The hyperparameters were simply wrong. The learning rate was 1e-3 - four to ten times too high. The model had overfit badly in the first third of training, and the eval loss had been climbing for the last two-thirds of the run.
The four thousand dollars in compute and the week of engineering time taught them one thing: learning rate matters more than almost any other decision in LLM fine-tuning, and the default settings from library documentation are not safe starting points. This lesson gives you the search strategy that turns that expensive mistake into a systematic process.
The deeper issue is that hyperparameter selection for LLM fine-tuning is not well-covered by general machine learning advice. The learning rates that work for training a ResNet from scratch (1e-3 to 1e-2) are catastrophically high for fine-tuning a 7B parameter model. The batch sizes that work for CIFAR-10 (128-512) are infeasible without gradient accumulation on most GPU configurations. The number of epochs appropriate for ImageNet training (90) would destroy any instruction-tuned model. You need different mental models, different search ranges, and different diagnostics.
There is also the LoRA-specific search space that has no equivalent in traditional machine learning. Rank, alpha, dropout, and target modules are four dimensions that interact in non-obvious ways. A rank-4 LoRA with alpha-32 behaves completely differently from a rank-32 LoRA with alpha-32, even though they have similar total parameter counts. Understanding why requires understanding the mathematics of what these parameters control.
This lesson gives you a principled approach: start with the most impactful hyperparameters, understand what each one controls at a mechanistic level, and build an efficient search strategy that finds good configurations without exhausting your compute budget.
Why This Exists - The Configuration Complexity Problem
What Came Before LoRA Fine-Tuning
Before parameter-efficient fine-tuning (PEFT) methods, full fine-tuning was the only option, and it had a much smaller hyperparameter space. The critical parameters were learning rate, batch size, and epochs. Warm-up steps mattered. Weight decay mattered. But the search space was manageable because you were optimizing a relatively well-understood gradient descent problem on a well-understood architecture.
The transition to LoRA-based fine-tuning added four new dimensions to search (rank, alpha, dropout, target modules) while also changing the optimal values of the existing dimensions. Learning rates appropriate for full fine-tuning are too high for LoRA fine-tuning because LoRA initializes with near-zero updates - you need a gentler initial gradient signal. This created a situation where practitioners were copying hyperparameter configs from full fine-tuning papers and applying them to LoRA runs, then wondering why results were unstable.
Why Default Configurations Are Dangerous
Library documentation tends to show configurations that work on toy examples - small datasets, short training runs, forgiving evaluation metrics. The default learning rate in the Hugging Face PEFT documentation (2e-4 for LoRA) is a reasonable starting point for some tasks, but it is higher than optimal for instruction following on large models and lower than optimal for classification heads on small models. There is no single safe default.
The cost of wrong hyperparameters in LLM fine-tuning is particularly high because: (1) runs take hours to days, so you cannot iterate quickly; (2) overfitting is harder to detect than in traditional ML (the loss curve can look healthy while the model is silently memorizing training examples); and (3) the interaction between hyperparameters is stronger - a high learning rate with a short warmup and small batch size creates instability that no amount of other tuning can fix.
What Systematic Search Provides
The payoff from systematic hyperparameter search is not marginal. Moving from default configurations to optimized ones typically yields 5-15% improvement on domain-specific tasks, with outlier cases showing 20-30% improvements. More importantly, it provides confidence: when you know your hyperparameters are optimized, you can trust that performance differences between model architectures or dataset variants are real, not artifacts of suboptimal training.
Historical Context
The Learning Rate Schedule Wars (2017-2020)
The question of how learning rate should vary during training predates transformer fine-tuning by several years. The 1cycle policy (Leslie Smith, 2018) showed that cycling the learning rate from a minimum to a maximum and back during training could significantly outperform constant learning rate training for vision models. Cosine annealing (Loshchilov and Hutter, 2017) provided a smooth decay schedule that became standard for language model pre-training.
The "aha moment" for transformer fine-tuning came from the original BERT paper (Devlin et al., 2018), which established the now-standard pattern: linear warmup for a small fraction of training steps, followed by linear decay to zero. This warmup was critical because transformers initialized from pre-trained weights have a delicate landscape near the starting point - high gradients immediately after initialization can corrupt the pre-trained representations before they can adapt.
LoRA's New Hyperparameter Space (2021)
When Hu et al. published LoRA (2021), they introduced rank as a new hyperparameter but gave limited guidance on how to select it. The paper showed that rank-4 and rank-8 worked well on specific tasks (natural language understanding benchmarks), but these findings did not generalize cleanly to instruction-following fine-tuning at larger model scales.
The alpha parameter, which controls the scaling of the LoRA updates relative to the original weights, was the most misunderstood. Many practitioners set alpha equal to rank (following the paper's default) without understanding that the ratio alpha/rank is what actually matters for effective learning rate scaling. Setting alpha = 2*rank effectively doubles the LoRA learning rate relative to the base weights, which is often beneficial for instruction tuning where you want the LoRA updates to have strong influence.
QLoRA and the New Default Landscape (2023)
Dettmers et al.'s QLoRA paper (2023) changed the practical defaults again. Training in 4-bit NF4 quantization changes the gradient landscape in ways that require lower learning rates and longer warmup. The paper's recommended learning rate of 2e-4 for QLoRA fine-tuning became widely cited but also widely misapplied to non-quantized LoRA runs.
The current (2024-2025) consensus from practitioners who have run hundreds of fine-tuning experiments: use 1e-4 to 3e-4 for LoRA on 7B models, lower (5e-5 to 1e-4) for 13B and above, with warmup of 3-6% of training steps and cosine decay.
Core Concepts
Learning Rate - The Most Important Decision
The learning rate is the single most impactful hyperparameter in fine-tuning. It controls the step size along the loss gradient:
For LoRA fine-tuning, the actual update to the combined weight matrix depends on both the optimizer learning rate and the LoRA scaling factor. The effective update magnitude is:
where is the LoRA alpha parameter and is the rank. This means the effective learning rate for LoRA updates is , not just alone.
Safe starting ranges:
- LoRA on 7B models: 1e-4 to 3e-4
- LoRA on 13B models: 5e-5 to 2e-4
- LoRA on 70B models: 1e-5 to 5e-5
- Full fine-tuning (rare): 1e-6 to 5e-6
If your training loss oscillates wildly in the first 50 steps, your learning rate is too high. If your training loss barely decreases over hundreds of steps, it is too low. A smooth, steady decline in the first 10% of training steps is the target behavior.
Batch Size and Gradient Accumulation
The effective batch size is what actually matters for training stability and generalization. The relationship is:
Larger effective batch sizes provide more stable gradient estimates but can hurt generalization (the "sharp minima" problem). For LLM fine-tuning, effective batch sizes of 32-128 are typical for instruction tuning, with 16-64 for small specialized datasets.
On a single A100-80GB, a 7B model with LoRA can typically handle per-device batch sizes of 4-8. To reach an effective batch size of 32, you would set gradient_accumulation_steps=4. Gradient accumulation simulates larger batches by accumulating gradients over multiple forward passes before doing a parameter update - it uses the same memory as a batch size of 1 but provides gradient quality similar to larger batches.
Linear scaling rule: if you double the batch size, multiply the learning rate by (not by 2 - the square root rule is empirically better for adaptive optimizers like AdamW). This applies roughly in the range of effective batch sizes 16-256.
Number of Epochs - When to Stop
For instruction tuning with datasets of 5,000+ examples, 1-3 epochs is almost always sufficient. With 3+ epochs, overfitting becomes likely. The optimal epoch count scales inversely with dataset size:
| Dataset Size | Recommended Epochs |
|---|---|
| < 1,000 | 3-5 |
| 1,000 - 5,000 | 2-3 |
| 5,000 - 50,000 | 1-2 |
| 50,000+ | 1 |
These are starting points, not rules. Always use eval loss as the actual stopping criterion via early stopping. If eval loss starts increasing while train loss continues to decrease, you are overfitting regardless of epoch count.
Warmup Ratio
Warmup prevents the early catastrophic forgetting of pre-trained representations by gradually increasing the learning rate from near-zero to the target value. During warmup, the model adapts slowly, preserving the pre-trained knowledge while beginning to specialize.
The warmup ratio is the fraction of total training steps spent in warmup:
With cosine decay, the learning rate schedule looks like:
Empirically, warmup ratio of 0.03 to 0.1 (3-10% of training steps) works well for most fine-tuning scenarios. Smaller datasets need more warmup (higher ratio) because with fewer training steps, the warmup covers more of the early critical phase.
LoRA Rank - The Capacity Control
LoRA rank controls how many dimensions the low-rank decomposition uses. A higher rank allows more expressive updates but uses more memory and parameters:
For a Mistral-7B model with , targeting q_proj and v_proj (2 modules per layer, 32 layers), rank-8 LoRA adds:
That is approximately 4M parameters on top of 7B base parameters - about 0.06% of total parameters. Rank-64 would add 32M parameters (~0.5% of total).
When to use high rank (32-64):
- Complex, multi-domain fine-tuning where the model needs to learn diverse new behaviors
- Fine-tuning on code or math where the update direction is very different from general text
- When you have 10,000+ training examples and are seeing underfitting at lower ranks
When to use low rank (4-16):
- Simple format or style adaptation
- Small datasets (< 2,000 examples) - lower rank prevents overfitting
- When memory is constrained
- When the task is similar to the base model's training distribution
LoRA Alpha - The Scaling Factor
Alpha scales the LoRA updates relative to the original weight matrix. The merged weight is:
The ratio is the critical quantity, not alone. Common patterns:
alpha = rank: scaling factor of 1.0 - updates have the same scale as if alpha were not presentalpha = 2 * rank: scaling factor of 2.0 - updates are doubled; use when you want aggressive adaptationalpha = rank / 2: scaling factor of 0.5 - conservative, good for slight style adjustments
The practical effect: setting alpha = 2 * rank is approximately equivalent to doubling the effective learning rate for LoRA parameters. When your learning rate search suggests that a higher learning rate would help but causes instability, increasing alpha is a more stable way to increase the effective update magnitude.
Weight Decay and Dropout
Weight decay regularizes by penalizing large parameter values:
For LoRA fine-tuning, weight decay is applied to the LoRA matrices (B and A), not to the frozen base model weights. Values of 0.01 to 0.1 are typical. Higher weight decay (0.1) is beneficial when overfitting is a concern (small dataset, high rank). Lower weight decay (0.01) allows more flexible adaptation.
LoRA dropout adds noise to the LoRA pathway during training, acting as regularization. Values of 0.05 to 0.1 are common. Note: LoRA dropout at inference time should be set to 0 (turned off) - the PEFT library handles this automatically.
Code Examples
Baseline Training Run with Diagnostics
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForSeq2Seq
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
import wandb
def create_model_and_tokenizer(model_name: str = "mistralai/Mistral-7B-v0.1"):
"""Load model and tokenizer with LoRA configuration."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
return model, tokenizer
def apply_lora(
model,
rank: int = 16,
alpha: int = 32,
dropout: float = 0.05,
target_modules: list[str] | None = None
):
"""Apply LoRA to the model with specified hyperparameters."""
if target_modules is None:
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=rank,
lora_alpha=alpha,
lora_dropout=dropout,
target_modules=target_modules,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
return model
def create_training_args(
output_dir: str,
learning_rate: float = 2e-4,
per_device_batch_size: int = 4,
gradient_accumulation_steps: int = 4,
num_epochs: float = 2.0,
warmup_ratio: float = 0.05,
weight_decay: float = 0.01,
lr_scheduler_type: str = "cosine",
run_name: str | None = None
) -> TrainingArguments:
"""Create training arguments with the specified hyperparameters."""
effective_batch_size = per_device_batch_size * gradient_accumulation_steps
print(f"Effective batch size: {effective_batch_size}")
return TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=per_device_batch_size,
per_device_eval_batch_size=per_device_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
warmup_ratio=warmup_ratio,
lr_scheduler_type=lr_scheduler_type,
fp16=False,
bf16=True,
# Evaluation and saving
evaluation_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=50,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
# Logging
logging_dir=f"{output_dir}/logs",
logging_steps=10,
report_to="wandb" if run_name else "none",
run_name=run_name,
# Early stopping patience (configured via callback separately)
dataloader_num_workers=2,
group_by_length=True, # efficiency: group sequences of similar length
remove_unused_columns=False,
)
Learning Rate Finder
import numpy as np
import matplotlib.pyplot as plt
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
def find_learning_rate(
model,
train_dataloader,
min_lr: float = 1e-6,
max_lr: float = 1e-2,
n_steps: int = 100,
smoothing: float = 0.05
) -> tuple[list[float], list[float]]:
"""
LR range test: increase LR exponentially and track loss.
The optimal LR is just before the loss starts increasing sharply.
"""
optimizer = AdamW(model.parameters(), lr=min_lr)
# Exponentially increase LR from min to max over n_steps
lr_multiplier = (max_lr / min_lr) ** (1.0 / n_steps)
lrs = []
losses = []
smoothed_loss = None
best_loss = float("inf")
model.train()
for step, batch in enumerate(train_dataloader):
if step >= n_steps:
break
# Forward pass
outputs = model(**batch)
loss = outputs.loss
# Update smoothed loss
current_loss = loss.item()
if smoothed_loss is None:
smoothed_loss = current_loss
else:
smoothed_loss = smoothing * current_loss + (1 - smoothing) * smoothed_loss
lrs.append(optimizer.param_groups[0]["lr"])
losses.append(smoothed_loss)
# Stop if loss is diverging
if smoothed_loss > 4 * best_loss:
print(f"Loss diverging at lr={optimizer.param_groups[0]['lr']:.2e}, stopping.")
break
if smoothed_loss < best_loss:
best_loss = smoothed_loss
# Backward pass
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Increase learning rate
for param_group in optimizer.param_groups:
param_group["lr"] *= lr_multiplier
return lrs, losses
def plot_lr_finder(lrs: list[float], losses: list[float]):
"""Plot the LR finder curve to identify optimal LR."""
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(lrs, losses)
ax.set_xscale("log")
ax.set_xlabel("Learning Rate (log scale)")
ax.set_ylabel("Smoothed Loss")
ax.set_title("Learning Rate Range Test")
ax.grid(True, alpha=0.3)
# Suggest LR: steepest descent region
# Find the point with maximum negative gradient
losses_arr = np.array(losses)
lrs_arr = np.array(lrs)
gradients = np.gradient(losses_arr, np.log(lrs_arr))
best_idx = np.argmin(gradients)
suggested_lr = lrs_arr[best_idx] / 10 # one order of magnitude before steepest descent
ax.axvline(x=suggested_lr, color="red", linestyle="--",
label=f"Suggested LR: {suggested_lr:.2e}")
ax.legend()
plt.tight_layout()
plt.savefig("lr_finder.png", dpi=150)
print(f"Suggested learning rate: {suggested_lr:.2e}")
return suggested_lr
Detecting Overfitting
import json
from pathlib import Path
def analyze_training_logs(log_dir: str) -> dict:
"""
Analyze training logs to detect overfitting and other issues.
Returns a diagnosis with specific recommendations.
"""
# Load trainer state
trainer_state_path = Path(log_dir) / "trainer_state.json"
with open(trainer_state_path) as f:
state = json.load(f)
log_history = state["log_history"]
# Separate train and eval logs
train_logs = [l for l in log_history if "loss" in l and "eval_loss" not in l]
eval_logs = [l for l in log_history if "eval_loss" in l]
if not eval_logs:
return {"error": "No eval logs found. Enable evaluation_strategy in TrainingArguments."}
train_losses = [l["loss"] for l in train_logs]
eval_losses = [l["eval_loss"] for l in eval_logs]
eval_steps = [l["step"] for l in eval_logs]
# Find best eval point
best_eval_idx = eval_losses.index(min(eval_losses))
best_eval_step = eval_steps[best_eval_idx]
best_eval_loss = eval_losses[best_eval_idx]
final_eval_loss = eval_losses[-1]
# Calculate overfitting gap
overfitting_gap = final_eval_loss - best_eval_loss
relative_overfit = overfitting_gap / best_eval_loss * 100
# Diagnose issues
issues = []
recommendations = []
# Check for overfitting
if relative_overfit > 5:
issues.append(f"Overfitting detected: eval loss increased {relative_overfit:.1f}% after step {best_eval_step}")
recommendations.append("Reduce epochs or add early stopping")
recommendations.append("Consider reducing LoRA rank or increasing dropout")
recommendations.append("Check for dataset quality issues")
# Check for underfitting (train loss still decreasing at end)
if len(train_losses) >= 10:
final_train_trend = train_losses[-1] - train_losses[-10]
if final_train_trend < -0.05: # still decreasing
issues.append("Possible underfitting: train loss still decreasing at end of training")
recommendations.append("Increase epochs or learning rate")
# Check for training instability (high variance in train loss)
if len(train_losses) > 20:
recent_losses = train_losses[-20:]
loss_std = np.std(recent_losses)
loss_mean = np.mean(recent_losses)
cv = loss_std / loss_mean
if cv > 0.1:
issues.append(f"Training instability: high loss variance (CV={cv:.2f}) in recent steps")
recommendations.append("Reduce learning rate or increase gradient accumulation steps")
recommendations.append("Check for problematic examples in training data")
# Check if model barely improved from start
if eval_losses[0] > 0 and (eval_losses[0] - best_eval_loss) / eval_losses[0] < 0.05:
issues.append("Minimal improvement: eval loss barely decreased (<5% relative improvement)")
recommendations.append("Increase learning rate - current LR may be too low")
recommendations.append("Check that training data format matches model expectations")
return {
"best_eval_loss": best_eval_loss,
"best_eval_step": best_eval_step,
"final_eval_loss": final_eval_loss,
"overfitting_gap": overfitting_gap,
"relative_overfit_pct": relative_overfit,
"total_training_steps": state.get("global_step", "unknown"),
"issues": issues,
"recommendations": recommendations,
"verdict": "OVERFIT" if relative_overfit > 5 else "HEALTHY" if not issues else "UNSTABLE"
}
Optuna Hyperparameter Search for LoRA
import optuna
from optuna.integration import WeightsAndBiasesCallback
import wandb
from transformers import EarlyStoppingCallback
import json
import os
def objective(trial: optuna.Trial, base_model_name: str, train_dataset, eval_dataset) -> float:
"""
Optuna objective function for LoRA hyperparameter search.
Returns the best eval loss achieved.
"""
# Sample hyperparameters
learning_rate = trial.suggest_float("learning_rate", 5e-5, 5e-4, log=True)
lora_rank = trial.suggest_categorical("lora_rank", [4, 8, 16, 32])
lora_alpha_multiplier = trial.suggest_categorical("lora_alpha_multiplier", [1, 2])
lora_alpha = lora_rank * lora_alpha_multiplier
lora_dropout = trial.suggest_float("lora_dropout", 0.0, 0.15)
warmup_ratio = trial.suggest_float("warmup_ratio", 0.01, 0.1)
weight_decay = trial.suggest_float("weight_decay", 0.0, 0.1)
# Log parameters
print(f"\nTrial {trial.number}:")
print(f" learning_rate: {learning_rate:.2e}")
print(f" lora_rank: {lora_rank}, lora_alpha: {lora_alpha}")
print(f" lora_dropout: {lora_dropout:.3f}")
print(f" warmup_ratio: {warmup_ratio:.3f}")
print(f" weight_decay: {weight_decay:.3f}")
# Load model fresh for each trial
model, tokenizer = create_model_and_tokenizer(base_model_name)
model = apply_lora(model, rank=lora_rank, alpha=lora_alpha, dropout=lora_dropout)
output_dir = f"./hpo_trial_{trial.number}"
training_args = create_training_args(
output_dir=output_dir,
learning_rate=learning_rate,
per_device_batch_size=4,
gradient_accumulation_steps=4,
num_epochs=1, # use 1 epoch for HPO to save cost
warmup_ratio=warmup_ratio,
weight_decay=weight_decay,
lr_scheduler_type="cosine",
run_name=f"hpo_trial_{trial.number}"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[
EarlyStoppingCallback(early_stopping_patience=3)
]
)
try:
trainer.train()
eval_results = trainer.evaluate()
best_eval_loss = eval_results["eval_loss"]
except Exception as e:
print(f"Trial {trial.number} failed: {e}")
best_eval_loss = float("inf")
finally:
# Clean up to free memory for next trial
del model
torch.cuda.empty_cache()
trial.set_user_attr("eval_loss", best_eval_loss)
return best_eval_loss
def run_optuna_search(
base_model_name: str,
train_dataset,
eval_dataset,
n_trials: int = 20,
study_name: str = "lora_hpo"
) -> optuna.Study:
"""
Run Optuna hyperparameter search with Bayesian optimization.
"""
# Use TPESampler (Tree-structured Parzen Estimator) - Bayesian optimization
sampler = optuna.samplers.TPESampler(
seed=42,
n_startup_trials=5 # first 5 trials are random, then Bayesian
)
study = optuna.create_study(
study_name=study_name,
direction="minimize",
sampler=sampler,
storage=f"sqlite:///{study_name}.db", # persist results
load_if_exists=True
)
study.optimize(
lambda trial: objective(trial, base_model_name, train_dataset, eval_dataset),
n_trials=n_trials,
callbacks=[
# Prune trials that are clearly worse than best so far
],
gc_after_trial=True
)
print("\n" + "="*60)
print("HPO RESULTS SUMMARY")
print("="*60)
print(f"Best trial: {study.best_trial.number}")
print(f"Best eval loss: {study.best_value:.4f}")
print("\nBest hyperparameters:")
for key, value in study.best_params.items():
print(f" {key}: {value}")
return study
Cost-Efficient HPO: Small-to-Large Transfer
def hpo_on_small_model_transfer_to_large(
small_model_name: str, # e.g., "mistralai/Mistral-7B-v0.1"
large_model_name: str, # e.g., "mistralai/Mistral-7B-Instruct-v0.2" or "meta-llama/Llama-2-13b-hf"
train_dataset,
eval_dataset,
n_hpo_trials: int = 15,
n_verification_trials: int = 3
) -> dict:
"""
Run HPO on a small/cheap model to find good hyperparameters,
then verify that the best ones transfer to the larger model.
Cost savings: 7B model HPO costs ~10% of 70B model HPO.
Transfer effectiveness: learning rate and warmup typically transfer well;
rank and alpha may need slight adjustment.
"""
print(f"Phase 1: Running {n_hpo_trials} trials on {small_model_name}")
small_study = run_optuna_search(
base_model_name=small_model_name,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
n_trials=n_hpo_trials,
study_name=f"hpo_{small_model_name.split('/')[-1]}"
)
# Get top-N configurations from small model search
top_trials = sorted(
[t for t in small_study.trials if t.value is not None],
key=lambda t: t.value
)[:n_verification_trials]
print(f"\nPhase 2: Verifying top {n_verification_trials} configs on {large_model_name}")
large_model_results = []
for i, trial in enumerate(top_trials):
params = trial.params
print(f"\nVerification trial {i+1}/{n_verification_trials}")
print(f"Params from small model trial {trial.number}: {params}")
# Adjust rank for larger model (larger models often benefit from slightly higher rank)
adjusted_params = dict(params)
if "lora_rank" in adjusted_params:
# Optionally scale rank up for larger model
adjusted_params["lora_rank"] = min(adjusted_params["lora_rank"] * 2, 64)
model, tokenizer = create_model_and_tokenizer(large_model_name)
model = apply_lora(
model,
rank=adjusted_params.get("lora_rank", 16),
alpha=adjusted_params.get("lora_rank", 16) * adjusted_params.get("lora_alpha_multiplier", 2),
dropout=adjusted_params.get("lora_dropout", 0.05)
)
training_args = create_training_args(
output_dir=f"./verify_large_{i}",
learning_rate=adjusted_params.get("learning_rate", 2e-4),
warmup_ratio=adjusted_params.get("warmup_ratio", 0.05),
weight_decay=adjusted_params.get("weight_decay", 0.01),
num_epochs=1,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()
eval_results = trainer.evaluate()
large_model_results.append({
"params": adjusted_params,
"small_model_eval_loss": trial.value,
"large_model_eval_loss": eval_results["eval_loss"]
})
del model
torch.cuda.empty_cache()
# Find best configuration for large model
best_large = min(large_model_results, key=lambda x: x["large_model_eval_loss"])
return {
"best_params": best_large["params"],
"large_model_eval_loss": best_large["large_model_eval_loss"],
"all_results": large_model_results
}
WandB Sweeps Configuration
import wandb
WANDB_SWEEP_CONFIG = {
"method": "bayes", # Bayesian optimization (alternatives: "random", "grid")
"metric": {
"name": "eval/loss",
"goal": "minimize"
},
"parameters": {
"learning_rate": {
"distribution": "log_uniform_values",
"min": 5e-5,
"max": 5e-4
},
"lora_rank": {
"values": [4, 8, 16, 32]
},
"lora_alpha_multiplier": {
"values": [1, 2]
},
"lora_dropout": {
"distribution": "uniform",
"min": 0.0,
"max": 0.15
},
"warmup_ratio": {
"distribution": "uniform",
"min": 0.01,
"max": 0.1
},
"weight_decay": {
"distribution": "uniform",
"min": 0.0,
"max": 0.1
},
"lr_scheduler_type": {
"values": ["cosine", "linear"]
}
},
"early_terminate": {
"type": "hyperband",
"min_iter": 3,
"eta": 3
}
}
def run_sweep_agent(sweep_id: str, base_model_name: str, train_dataset, eval_dataset):
"""Run a WandB sweep agent."""
def train_with_wandb():
with wandb.init() as run:
config = wandb.config
lora_alpha = config.lora_rank * config.lora_alpha_multiplier
model, tokenizer = create_model_and_tokenizer(base_model_name)
model = apply_lora(
model,
rank=config.lora_rank,
alpha=lora_alpha,
dropout=config.lora_dropout
)
training_args = create_training_args(
output_dir=f"./sweep_{run.id}",
learning_rate=config.learning_rate,
warmup_ratio=config.warmup_ratio,
weight_decay=config.weight_decay,
lr_scheduler_type=config.lr_scheduler_type,
num_epochs=1,
run_name=run.name
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()
eval_results = trainer.evaluate()
wandb.log({"eval/loss": eval_results["eval_loss"]})
del model
torch.cuda.empty_cache()
wandb.agent(sweep_id, function=train_with_wandb, count=20)
# To use:
# sweep_id = wandb.sweep(WANDB_SWEEP_CONFIG, project="lora-hpo")
# run_sweep_agent(sweep_id, "mistralai/Mistral-7B-v0.1", train_dataset, eval_dataset)
Architecture Diagrams
Production Engineering Notes
The Minimum Viable Hyperparameter Search Protocol
When you cannot afford exhaustive HPO (the common case in production), follow this priority order:
Step 1: Fix learning rate first. Run three quick experiments with learning rates spanning one order of magnitude: your lower bound, your center, and your upper bound (e.g., 5e-5, 1.5e-4, 5e-4). These three runs cost 3x a single run. Evaluate after 20% of the planned training steps. Pick the LR where loss decreases most smoothly in the first 20% of training.
Step 2: Fix rank. With learning rate established, run two runs: your expected rank and double that rank. If doubling rank gives meaningfully lower eval loss (>2% relative improvement), use the higher rank. If not, save the memory.
Step 3: Fix warmup. Only matters if your early training loss shows instability (spiky behavior in the first 10% of steps). Default to 0.05 (5% warmup). If spiky, try 0.1.
Step 4: Everything else is secondary. Weight decay (0.01 default), dropout (0.05 default), and alpha (2x rank default) rarely need tuning if the first three are right.
This three-step protocol costs 5-6 training runs instead of a full grid search over all parameters. It identifies 80-90% of the available performance improvement at 20-30% of the search cost.
Compute Budget Allocation for HPO
A practical budget framework for a 7B model fine-tuning project:
Total budget: 20 GPU-hours
- LR range test: 1 GPU-hour (quick 20% of training run, 3 LR values)
- Rank search: 2 GPU-hours (2 full-epoch runs, 2 rank values)
- Warmup/weight decay: 2 GPU-hours (spot check if instability observed)
- Final full training with best config: 8 GPU-hours
- Held-out evaluation: 1 GPU-hour
- Buffer for reruns: 6 GPU-hours
This allocation ensures that the majority of your compute goes into the final, optimized training run rather than exploratory runs.
Gradient Accumulation and Numerical Stability
Large gradient accumulation steps (>= 16) can cause numerical issues with mixed precision training. The accumulated gradients are summed over many steps, and if individual step gradients have large variance, the accumulated gradient can overflow float16. Symptoms: sudden loss spikes, NaN losses, model weights becoming extreme.
Mitigations:
- Use bfloat16 instead of float16 (bfloat16 has a larger dynamic range)
- Enable gradient clipping:
max_grad_norm=1.0in TrainingArguments - If using gradient accumulation > 16, reduce per-step learning rate proportionally
- Monitor gradient norm (logged as
grad_normin WandB) - should stay below 5.0
Early Stopping Implementation
The Hugging Face EarlyStoppingCallback monitors eval loss and stops training when it stops improving. Critical parameters:
from transformers import EarlyStoppingCallback
# Patience of 3: stop training if eval loss does not improve
# for 3 consecutive evaluation checkpoints
early_stopping = EarlyStoppingCallback(
early_stopping_patience=3,
early_stopping_threshold=0.001 # must improve by at least 0.001 to count
)
Set load_best_model_at_end=True in TrainingArguments alongside early stopping - otherwise you get the checkpoint from the last step, not the best checkpoint.
The evaluation frequency matters. With eval_steps=50 and per_device_batch_size=4 and gradient_accumulation_steps=4, you evaluate every 50 * 4 * 4 = 800 training examples. For a 5,000-example dataset with 2 epochs (10,000 total examples), this means evaluating about 12-13 times per run. Patience of 3 would allow roughly 3/12 of total training to overfit before stopping - usually acceptable.
Multi-GPU Learning Rate Scaling
When scaling from 1 GPU to multiple GPUs, the effective batch size increases. Apply the square root learning rate scaling rule:
import math
def scale_lr_for_gpus(base_lr: float, base_gpus: int, target_gpus: int) -> float:
"""Scale learning rate when changing number of GPUs."""
# Square root scaling (empirically better than linear for AdamW)
scale_factor = math.sqrt(target_gpus / base_gpus)
scaled_lr = base_lr * scale_factor
print(f"LR scaled from {base_lr:.2e} to {scaled_lr:.2e} for {target_gpus} GPUs")
return scaled_lr
# Example: found optimal LR of 1e-4 on 1 GPU, scaling to 4 GPUs
# optimal_lr_4gpu = scale_lr_for_gpus(1e-4, base_gpus=1, target_gpus=4)
# -> 2e-4 (approximately)
Common Mistakes
:::danger Setting Learning Rate Too High (The Most Common Mistake) The default learning rate in many tutorials (1e-3 or even 2e-3) is appropriate for training from scratch but will destroy pre-trained representations during fine-tuning. Symptoms: training loss drops fast in the first 5% of steps, then plateaus or increases; eval loss is higher than the base model's eval loss at the end of training. For LoRA fine-tuning of 7B models, start at 2e-4 maximum. If you are using QLoRA (4-bit), stay below 1e-4. If you are fine-tuning a 70B model, start at 5e-5. :::
:::danger Ignoring Eval Loss and Only Watching Train Loss A model can achieve very low training loss (< 0.3) while producing garbage outputs on unseen prompts. Train loss measures memorization; eval loss measures generalization. Always set up an evaluation dataset (at minimum 5-10% of your total data) and log eval loss at regular intervals. If you do not have an eval set, create one by holding out a random sample before training. Never report a fine-tuned model as ready for production based on train loss alone. :::
:::warning Using the Same Hyperparameters for Different Model Sizes A configuration that works well for Mistral-7B (LR=2e-4, rank=16) will likely overfit a Llama-13B and may be insufficient for a 70B model. Larger models are more sensitive to high learning rates (their pre-trained representations are richer and more fragile), need lower learning rates, and often benefit from higher LoRA rank to capture the richer update directions needed. When switching model sizes, always rerun at least the LR search step. :::
:::warning Forgetting That Effective Batch Size Affects Optimal Learning Rate If you change gradient_accumulation_steps from 4 to 8 (doubling your effective batch size), your current learning rate is now suboptimal. The loss curve will look healthy but you will leave performance on the table. Apply the square root scaling rule: if effective batch doubles, multiply LR by sqrt(2) ~= 1.41. If it halves, divide by sqrt(2). This applies to changes in number of GPUs as well. :::
:::warning Running Full HPO When Budget Is Constrained A 20-trial Optuna search on a 70B model can easily cost $500-1,000 in GPU time. Most of that budget goes to exploring configurations that are not competitive. Use the small-model transfer strategy: run HPO on a smaller proxy model (7B for a 13B target, 13B for a 70B target) to identify the best configuration, then verify with 2-3 confirmation runs on the large model. The hyperparameter transfer rate for LR and warmup is approximately 80-90%; for rank, expect to scale up slightly when moving to larger models. :::
Interview Q&A
Q: Why is learning rate the most important hyperparameter in LLM fine-tuning, and what range should you start with?
Learning rate directly controls how much the pre-trained model's weights change with each gradient update. Pre-trained weights encode billions of examples of language patterns, factual knowledge, and reasoning capabilities accumulated during pre-training. Too high a learning rate destroys these patterns in the first few training steps - the model "forgets" its pre-training before it can learn the new task. Too low a learning rate means the model barely adapts and the fine-tuning adds negligible value. The learning rate is also multiplicative with every other source of gradient signal - if the data is noisy or the batch size is small, a high learning rate amplifies the noise catastrophically. Safe starting ranges: 1e-4 to 3e-4 for LoRA on 7B models, 5e-5 to 1e-4 for 13B+, 1e-5 to 5e-5 for 70B+. Always tune learning rate first before optimizing any other hyperparameter.
Q: What is gradient accumulation and when would you use it?
Gradient accumulation simulates larger batch sizes by accumulating gradients over multiple forward passes before performing a parameter update. Instead of updating weights after every batch of 4 examples, you accumulate gradients over 8 batches of 4 (=32 effective batch size) and then update. Memory usage is identical to a batch size of 4, but the gradient estimate uses 32 examples, providing a smoother, lower-variance gradient signal. Use it when: your GPU cannot fit the batch size you want (memory constraints), you need a specific effective batch size for training stability, or you are running on a single GPU and want to match the behavior of multi-GPU training with a larger batch. The cost is that each "effective step" takes k times longer than a single gradient step (k = accumulation steps), so throughput is unchanged but the training loop is more memory-efficient.
Q: How do you detect overfitting in LLM fine-tuning, and what do you do about it?
The primary signal is divergence between training loss and evaluation loss. In healthy training, both decrease together. Overfitting appears as: training loss continues to decrease while evaluation loss stops decreasing or starts increasing. The relative gap - (eval_loss_final - eval_loss_best) / eval_loss_best - gives you a percentage overfitting measure. Above 5% is a problem; above 10% is significant. Secondary signals: perplexity on the eval set increasing while training perplexity falls, or generation quality on held-out prompts degrading (the model starts "memorizing" training phrasings rather than generalizing). Fixes in order of severity: (1) use early stopping with load_best_model_at_end=True, (2) reduce epochs, (3) reduce LoRA rank, (4) increase dropout, (5) add more regularization (weight decay), (6) augment or increase dataset size.
Q: Explain the relationship between LoRA rank and alpha. What does the alpha/rank ratio control?
LoRA decomposes weight updates as , where B has shape (d, r) and A has shape (r, d). The final merged weight uses a scaling factor: . The rank r controls how many linearly independent directions the LoRA update can span - higher rank = more expressive updates = more parameters. The alpha parameter scales the magnitude of those updates relative to the frozen base weights. The ratio alpha/r is the effective scaling multiplier: if alpha=rank (ratio=1), the LoRA update has unit scale; if alpha=2rank (ratio=2), the updates are doubled in magnitude. This is equivalent to doubling the effective learning rate for LoRA parameters without changing the actual optimizer learning rate. In practice: set alpha = rank when you want conservative adaptation, alpha = 2rank when you want stronger adaptation, alpha = rank/2 for very gentle style adjustments. Never set alpha much larger than 2*rank - the updates become too large and destabilize training.
Q: How does Bayesian optimization in Optuna differ from random search, and when does it actually outperform random search?
Bayesian optimization (specifically TPE - Tree-structured Parzen Estimators - in Optuna) builds a probabilistic model of the hyperparameter-to-objective mapping and uses that model to select the next configuration to try, favoring regions that are likely to improve on the best result found so far. Random search samples configurations uniformly without using information from previous trials. Bayesian optimization outperforms random search when: (1) the search space is large (5+ dimensions), (2) function evaluations are expensive (each trial costs >30 minutes), and (3) the objective function is smooth (nearby hyperparameter values give similar losses). For LLM fine-tuning HPO, all three conditions hold, making Bayesian optimization the right choice. The crossover point is roughly 15-20 trials: below this, random search and Bayesian optimization are roughly equivalent (Bayesian needs trials to build its model). Above 20 trials, Bayesian optimization typically finds better configurations faster. For very small trial budgets (< 10), a well-designed random search covering the likely optimal region can outperform Bayesian optimization.
Q: You find the optimal hyperparameters for a 7B model. How do you transfer them to a 70B model efficiently?
The key insight is that some hyperparameters transfer well and some do not. Learning rate transfers poorly - 70B models need lower learning rates (roughly 3-5x lower) than 7B models for the same task. Warmup ratio transfers well - if 5% warmup worked for 7B, it will work for 70B. LoRA rank often needs to increase - 70B models have more representational capacity and benefit from higher-rank updates. Alpha (as a multiple of rank) transfers well. Dropout transfers reasonably well. The practical protocol: take your optimal 7B config, halve the learning rate (or reduce by factor 3-5), optionally double the rank, keep warmup/dropout/weight_decay the same. Run 1-2 verification experiments at 20% of planned training steps to confirm the transferred config produces smooth, stable loss curves. Full HPO on 70B is usually not cost-effective - the small-model proxy approach captures 80-90% of the possible improvement at 10-20% of the HPO cost.
