Skip to main content

Monitoring and Debugging Fine-Tuning

Reading time: ~35 min · Interview relevance: Very High · Target roles: ML Engineer, LLM Engineer


The 3 AM Production Incident

It is 3 AM on a Tuesday. Your team has been running a 70B Llama fine-tuning job for 14 hours. The A100 cluster is costing $180 per hour. You are halfway through the planned training when an on-call alert fires: the training script has crashed with RuntimeError: CUDA error: device-side assert triggered. The checkpoint from 6 hours ago was corrupt. The one from 12 hours ago is the last clean save.

You restart from 12 hours ago. Two hours in, the loss spikes to inf and then nan. The job is eating GPU time but producing nothing. You have no idea what changed, no clear metrics trail, and the engineer who wrote the training script is asleep in Berlin.

This is not a hypothetical. It is a Tuesday for some ML team somewhere every single week.

The engineers who handle this well are not smarter. They set up monitoring before the job started. They have a WandB dashboard open that shows them exactly when the gradient norm started climbing three hours before the crash. They have a custom callback that logs learning rate, loss, gradient norm, GPU memory, and tokens per second every 50 steps. They can look at the loss curve and immediately diagnose whether they are looking at a data issue, a learning rate issue, or a mixed precision bug.

The engineers who lose $1000 and 14 hours are the ones who trusted the Trainer default logging and checked the terminal every few hours.

Monitoring a fine-tuning run is not optional. It is the difference between a successful model and an expensive crash. This lesson teaches you exactly what to track, what healthy looks like, and how to diagnose everything that goes wrong.


Why This Exists - The Problem Before Proper Monitoring

Before modern training frameworks had built-in metric logging, training a neural network was genuinely opaque. You started a job, waited for it to finish (or crash), and looked at the final loss. If something went wrong, you had almost no information to work with.

The first generation of deep learning practitioners ran print(loss) every 100 steps and called it monitoring. When a job crashed, they had a terminal output. When it succeeded but the model performed poorly, they had no idea whether the issue was the data, the learning rate schedule, the architecture, or something else entirely.

The shift came from two directions. First, Google Brain and DeepMind started publishing postmortems on large-scale training failures. Exploding gradients were a known problem from Bengio's 2013 work on vanishing gradients in RNNs. Gradient clipping was introduced as a fix - but you need to know the gradient norm is climbing before you can decide to clip it. That requires logging.

Second, the emergence of WandB (2018) and the maturation of TensorBoard made comprehensive metric logging practical. Suddenly you could log 50 different metrics every 10 steps and visualize them in real time without writing a single line of custom plotting code. This changed the standard expectation. By 2020, any serious training job that did not have a WandB run was considered under-instrumented.

For LLM fine-tuning specifically, the problem is harder than standard training for three reasons. First, the compute cost of each run is high - a failed 70B job costs real money, not just time. Second, LLMs are sensitive to subtle data quality issues that do not manifest obviously in early loss curves. Third, mixed precision training (bf16/fp16) introduces failure modes - loss scale underflow, overflow - that pure fp32 training does not have.

Proper monitoring solves all of this. You see problems before they become failures. You understand what is normal so you can recognize what is abnormal. You have a systematic debug process instead of guesswork.


Historical Context - From Print Statements to Real-Time Dashboards

The history of training monitoring tracks almost exactly with the history of deep learning itself.

1986 - 2012: Backpropagation-era networks were small enough that researchers could observe loss by hand. A few hundred training examples, a few hundred parameters. Loss curves were plotted in MATLAB after training completed. Debugging meant staring at weight matrices.

2012 - Krizhevsky's AlexNet: The ImageNet breakthrough introduced GPU training at scale for the first time in a competitive context. Krizhevsky famously ran two GPUs in parallel (the hardware limitation of GTX 580 with 3GB VRAM). Monitoring was still basic - training accuracy and loss logged to a file. The paper reports training for 90 epochs over 5-6 days. They checked the log periodically. No real-time dashboards.

2015 - TensorBoard arrives: Google released TensorBoard with TensorFlow 0.6. For the first time, practitioners could watch loss curves update in real time in a browser. This was genuinely transformative. Within a year, it became the default monitoring tool for academic research.

2018 - WandB launches: Weights and Biases launched with a key insight: TensorBoard is great for local development but terrible for team collaboration and experiment comparison. WandB added centralized storage, experiment comparison, hyperparameter tables, and artifact versioning. It became the production standard within 2-3 years.

2020 onwards - Gradient norm monitoring becomes standard: The scaling laws paper (Kaplan et al., 2020) and the GPT-3 training report introduced the idea that large model training has specific failure signatures that need monitoring. Gradient norm histograms, loss scale tracking, and attention entropy logging became part of the standard LLM training toolkit.

2022 - HuggingFace Trainer integration: HuggingFace Trainer gained native WandB and TensorBoard integration, making comprehensive monitoring available to anyone using the standard training pipeline with a single flag.

The "aha moment" for modern LLM monitoring came from the Chinchilla and PaLM training reports (2022). Both papers described monitoring practices in detail - not just loss curves but gradient norm distributions, learning rate warmup behavior, and loss spike recovery. Practitioners realized that the difference between a successful large run and a failed one was often just the monitoring setup.


Core Concepts - What to Track and Why

Training Loss vs Evaluation Loss

These are the two most fundamental metrics. Understanding the relationship between them tells you almost everything about training health.

Training loss is the average loss computed on the training batches. For causal language modeling (next token prediction), this is cross-entropy loss:

L=1Ni=1NlogP(xix1,...,xi1)\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, ..., x_{i-1})

where NN is the number of tokens in the batch. Loss typically starts around log(V)\log(V) where VV is vocabulary size - for LLaMA's 32,000 token vocabulary, initial loss is around ln(32000)10.4\ln(32000) \approx 10.4.

Evaluation loss is the same computation but on a held-out validation set, run with torch.no_grad(). The gap between training loss and eval loss is the single most useful signal about model health.

Healthy training looks like this:

  • Both losses decrease together in the early phase
  • Training loss decreases slightly faster than eval loss (normal)
  • The gap between them stabilizes and stays roughly constant
  • Neither loss plateaus prematurely

The three failure modes to watch:

  1. Overfitting: Training loss keeps decreasing while eval loss flattens or starts increasing. The model is memorizing training examples rather than learning generalizable patterns. Fix: more data, regularization, early stopping, reduce LoRA rank.

  2. Underfitting: Both losses plateau at a high value. The model is not learning. Fix: higher learning rate, more training steps, check for data masking issues.

  3. Training instability: Loss decreases normally then suddenly spikes up, then recovers. This is a gradient spike. Fix: gradient clipping, lower learning rate, check for outlier examples in data.

Gradient Norm

Gradient norm is the L2 norm of all gradient tensors concatenated:

grad_norm=pigp,i2\text{grad\_norm} = \sqrt{\sum_{p} \sum_{i} g_{p,i}^2}

where gp,ig_{p,i} is the gradient of parameter pp at index ii.

This is the single most important diagnostic metric beyond loss. Here is what it tells you:

Normal behavior: Gradient norm starts somewhat high in the first few hundred steps (model parameters are being adjusted rapidly from their pretrained values), then settles into a stable range. For LoRA fine-tuning of a 7B model, typical stable gradient norm is 0.1 - 2.0.

Exploding gradients: Gradient norm climbs steadily or spikes suddenly above 10-100. This usually precedes a loss spike or NaN loss. The standard fix - gradient clipping at 1.0 - caps the gradient norm before the update is applied:

ggmin(1,clip_thresholdg)g \leftarrow g \cdot \min\left(1, \frac{\text{clip\_threshold}}{\|g\|}\right)

HuggingFace Trainer applies this by default when you set max_grad_norm=1.0 in TrainingArguments. Many people leave this at the default and never check whether it is actually being triggered. Log trainer_state.log_history and look for grad_norm values frequently hitting 1.0 - that means clipping is engaged constantly, which is a signal that your learning rate is too high.

Vanishing gradients: Gradient norm near zero consistently. The model is not learning because gradients are not reaching early layers. This is rare in transformer fine-tuning (transformers have residual connections specifically to prevent this) but can appear in extremely long fine-tuning runs or when fine-tuning very specific adapter configurations.

Learning Rate Schedule

Learning rate itself should be logged. Not just the initial value - the actual scheduled value at every step.

The standard setup for fine-tuning is:

  • Linear warmup for the first 3-10% of training steps
  • Cosine decay to a minimum learning rate (typically 10% of peak)

During warmup, learning rate rises linearly from 0 to the peak value. This matters because starting at full learning rate from random-initialized LoRA weights often causes early instability.

Log the learning rate at every step. When you see a loss spike, look at where the learning rate was at that point. A loss spike that correlates with the end of warmup (transition from increasing to decreasing LR) often means the peak LR is too high.

GPU Utilization vs MFU

These are two different things that are commonly confused.

GPU utilization is the percentage of time the GPU compute units are doing something - reported by nvidia-smi. 90%+ GPU utilization sounds good. But a GPU can be 100% utilized doing memory copies, not compute. High GPU util with low throughput is a data loading bottleneck.

Model FLOPs Utilization (MFU) is the fraction of peak theoretical GPU FLOPs that your training job is actually using for model computation:

MFU=achieved_throughput_in_TFLOPSpeak_TFLOPS\text{MFU} = \frac{\text{achieved\_throughput\_in\_TFLOPS}}{\text{peak\_TFLOPS}}

For an A100 80GB with bf16, peak TFLOPS is 312. For a well-optimized 7B model training job with Flash Attention and gradient checkpointing, you might achieve 40-60 TFLOPS - an MFU of around 13-19%. That sounds low but is actually good for fine-tuning (which has smaller batch sizes than pretraining).

The practical metric most people track is tokens per second (throughput). A 7B LoRA fine-tuning job on a single A100 should process 2,000-4,000 tokens/second with reasonable batch sizes. If you see 500 tokens/second, you have a bottleneck somewhere - data loading, small batch size, or high gradient checkpointing overhead.

Memory Usage

Log peak GPU memory at each step. GPU memory in LLM training has three components:

  1. Model weights: 7B parameters in bf16 = 14GB
  2. Optimizer states: AdamW keeps two moment estimates per parameter = 2x model size for full fine-tuning (28GB for 7B). LoRA only applies optimizer states to adapter parameters, which is why LoRA is memory-efficient.
  3. Activations: Intermediate values needed for backprop. With gradient checkpointing, you trade compute for memory by recomputing activations during backprop instead of storing them.

When memory starts climbing step over step, you have a memory leak - often from accumulating optimizer states or not properly clearing gradients. Log torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() periodically.


Code Examples

Setting Up WandB with HuggingFace Trainer

import wandb
from transformers import TrainingArguments, Trainer, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch

# Initialize WandB run before training
wandb.init(
project="llm-finetuning",
name="llama-7b-lora-v1",
config={
"model": "meta-llama/Llama-2-7b-hf",
"lora_rank": 16,
"learning_rate": 2e-4,
"batch_size": 4,
"gradient_accumulation_steps": 8,
"max_seq_length": 2048,
}
)

training_args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
warmup_ratio=0.05,
lr_scheduler_type="cosine",
max_grad_norm=1.0,

# Logging
logging_steps=50,
logging_first_step=True,
report_to="wandb", # Enables WandB logging

# Eval
evaluation_strategy="steps",
eval_steps=200,
save_strategy="steps",
save_steps=200,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,

# Mixed precision
bf16=True,

# Memory
gradient_checkpointing=True,
dataloader_num_workers=4,
dataloader_pin_memory=True,
)

Custom WandB Callback for Comprehensive Monitoring

The default HuggingFace WandB integration logs training loss, eval loss, and learning rate. That is not enough. Here is a comprehensive callback that logs everything useful:

import wandb
import torch
import psutil
import subprocess
from transformers import TrainerCallback, TrainerState, TrainerControl

class ComprehensiveMonitoringCallback(TrainerCallback):
"""
Logs training loss, eval loss, gradient norm, learning rate,
GPU memory, GPU utilization, and tokens per second.
"""

def __init__(self, log_every_n_steps: int = 50):
self.log_every_n_steps = log_every_n_steps
self._step_start_time = None
self._tokens_in_step = 0
self._grad_norm_history = []
self._loss_history = []

def on_step_begin(self, args, state: TrainerState, control: TrainerControl, **kwargs):
import time
self._step_start_time = time.time()

def on_step_end(
self,
args,
state: TrainerState,
control: TrainerControl,
model=None,
**kwargs
):
import time

if state.global_step % self.log_every_n_steps != 0:
return

metrics = {}

# --- Timing and throughput ---
if self._step_start_time is not None:
elapsed = time.time() - self._step_start_time
# Tokens per step = batch_size * seq_len * grad_accum
tokens_per_step = (
args.per_device_train_batch_size
* args.max_steps # proxy - better to track actual seq len
* args.gradient_accumulation_steps
)
metrics["tokens_per_second"] = tokens_per_step / max(elapsed, 1e-6)

# --- GPU memory ---
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
allocated = torch.cuda.memory_allocated(i) / 1e9
reserved = torch.cuda.memory_reserved(i) / 1e9
metrics[f"gpu_{i}/memory_allocated_GB"] = allocated
metrics[f"gpu_{i}/memory_reserved_GB"] = reserved
metrics[f"gpu_{i}/memory_utilization_pct"] = (allocated / reserved * 100) if reserved > 0 else 0

# --- GPU utilization via nvidia-smi ---
try:
result = subprocess.run(
["nvidia-smi", "--query-gpu=utilization.gpu,temperature.gpu",
"--format=csv,noheader,nounits"],
capture_output=True, text=True, timeout=2
)
for i, line in enumerate(result.stdout.strip().split("\n")):
util, temp = line.split(", ")
metrics[f"gpu_{i}/utilization_pct"] = float(util)
metrics[f"gpu_{i}/temperature_C"] = float(temp)
except Exception:
pass # nvidia-smi not available or timeout

# --- Gradient norm (from model parameters) ---
if model is not None:
total_norm = 0.0
num_params_with_grad = 0
for p in model.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
num_params_with_grad += 1
if num_params_with_grad > 0:
total_norm = total_norm ** 0.5
metrics["grad_norm"] = total_norm
self._grad_norm_history.append(total_norm)

# Alert if gradient norm is unusually high
if total_norm > 10.0:
print(f"WARNING: High gradient norm at step {state.global_step}: {total_norm:.2f}")

# --- Learning rate ---
# HuggingFace Trainer logs this by default, but let's be explicit
if len(state.log_history) > 0:
last_log = state.log_history[-1]
if "learning_rate" in last_log:
metrics["learning_rate"] = last_log["learning_rate"]

# --- Loss tracking and spike detection ---
if len(state.log_history) > 0:
last_log = state.log_history[-1]
if "loss" in last_log:
current_loss = last_log["loss"]
self._loss_history.append(current_loss)

# Detect loss spike: current loss > 2x rolling average of last 10 steps
if len(self._loss_history) > 10:
rolling_avg = sum(self._loss_history[-10:]) / 10
if current_loss > 2 * rolling_avg:
metrics["loss_spike_detected"] = 1
print(f"WARNING: Loss spike at step {state.global_step}: "
f"{current_loss:.4f} vs rolling avg {rolling_avg:.4f}")

# NaN/Inf detection
if not torch.isfinite(torch.tensor(current_loss)):
metrics["nan_loss_detected"] = 1
print(f"CRITICAL: NaN/Inf loss at step {state.global_step}!")

if metrics:
wandb.log(metrics, step=state.global_step)

def on_evaluate(self, args, state: TrainerState, control: TrainerControl, metrics=None, **kwargs):
"""Log eval metrics with step number for proper x-axis alignment."""
if metrics and wandb.run is not None:
eval_metrics = {
"eval/loss": metrics.get("eval_loss"),
"eval/perplexity": torch.exp(torch.tensor(metrics["eval_loss"])).item()
if "eval_loss" in metrics else None,
}
eval_metrics = {k: v for k, v in eval_metrics.items() if v is not None}
wandb.log(eval_metrics, step=state.global_step)

def on_train_end(self, args, state: TrainerState, control: TrainerControl, **kwargs):
"""Log training summary statistics."""
if wandb.run is not None and self._grad_norm_history:
import statistics
wandb.run.summary.update({
"grad_norm/mean": statistics.mean(self._grad_norm_history),
"grad_norm/max": max(self._grad_norm_history),
"grad_norm/p95": sorted(self._grad_norm_history)[int(0.95 * len(self._grad_norm_history))],
"total_steps": state.global_step,
})

Using the Callback in Training

from peft import get_peft_model, LoraConfig, TaskType

# Set up LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
bias="none",
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# Initialize trainer with monitoring callback
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
callbacks=[
ComprehensiveMonitoringCallback(log_every_n_steps=50),
],
)

trainer.train()
wandb.finish()

Debugging Script for Common Training Failures

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset


def diagnose_nan_loss(
model: torch.nn.Module,
batch: dict,
check_inputs: bool = True,
check_gradients: bool = True,
) -> dict:
"""
Run a single forward + backward pass and diagnose NaN/Inf failures.
Returns a diagnostic report.
"""
report = {
"has_nan_in_inputs": False,
"has_inf_in_inputs": False,
"has_nan_in_outputs": False,
"has_nan_in_gradients": False,
"problematic_layers": [],
"loss_value": None,
}

model.train()

# Check inputs
if check_inputs:
for key, val in batch.items():
if isinstance(val, torch.Tensor):
if torch.isnan(val).any():
report["has_nan_in_inputs"] = True
print(f"NaN found in input tensor: {key}")
if torch.isinf(val).any():
report["has_inf_in_inputs"] = True
print(f"Inf found in input tensor: {key}")

# Register hooks to check activations at each layer
nan_hooks = []
problematic_layers = []

def make_hook(layer_name):
def hook(module, input, output):
if isinstance(output, torch.Tensor):
if torch.isnan(output).any():
problematic_layers.append(f"{layer_name}: NaN in output")
if torch.isinf(output).any():
problematic_layers.append(f"{layer_name}: Inf in output")
elif isinstance(output, tuple):
for i, o in enumerate(output):
if isinstance(o, torch.Tensor):
if torch.isnan(o).any():
problematic_layers.append(f"{layer_name}[{i}]: NaN in output")
return hook

for name, module in model.named_modules():
if len(list(module.children())) == 0: # leaf modules only
h = module.register_forward_hook(make_hook(name))
nan_hooks.append(h)

# Forward pass
try:
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
outputs = model(**batch)
loss = outputs.loss
report["loss_value"] = loss.item() if not torch.isnan(loss) else float("nan")

if torch.isnan(loss):
report["has_nan_in_outputs"] = True
print(f"NaN loss detected! Value: {loss.item()}")

# Backward pass
if check_gradients and not torch.isnan(loss):
loss.backward()

for name, param in model.named_parameters():
if param.grad is not None:
if torch.isnan(param.grad).any():
report["has_nan_in_gradients"] = True
report["problematic_layers"].append(f"grad NaN: {name}")
if param.grad.norm() > 100:
report["problematic_layers"].append(
f"large grad norm: {name} = {param.grad.norm().item():.2f}"
)

finally:
for h in nan_hooks:
h.remove()

report["problematic_layers"].extend(problematic_layers)
return report


def check_data_quality(dataset: Dataset, tokenizer, sample_size: int = 1000) -> dict:
"""
Check a dataset for common issues that cause training failures.
"""
issues = {
"empty_examples": 0,
"very_long_examples": 0,
"duplicate_examples": 0,
"special_token_issues": 0,
"label_mask_issues": 0,
}

seen = set()
sample = dataset.select(range(min(sample_size, len(dataset))))

for example in sample:
# Check for empty text
text = example.get("text", example.get("input", ""))
if not text or len(text.strip()) == 0:
issues["empty_examples"] += 1
continue

# Check length
tokens = tokenizer.encode(text)
if len(tokens) > 4096:
issues["very_long_examples"] += 1

# Check for duplicates
text_hash = hash(text[:200]) # Hash first 200 chars as proxy
if text_hash in seen:
issues["duplicate_examples"] += 1
seen.add(text_hash)

# Check label mask (if present)
if "labels" in example:
labels = example["labels"]
if isinstance(labels, list):
non_masked = sum(1 for l in labels if l != -100)
if non_masked == 0:
issues["label_mask_issues"] += 1 # All labels masked = no learning signal

return issues


def estimate_memory_requirements(
model_name: str,
lora_rank: int = 16,
batch_size: int = 4,
seq_len: int = 2048,
gradient_checkpointing: bool = True,
) -> dict:
"""
Estimate GPU memory requirements before running the job.
Avoids OOM surprises mid-training.
"""
# Rough parameter counts for common models
param_counts = {
"7b": 7e9,
"13b": 13e9,
"70b": 70e9,
}

model_key = [k for k in param_counts.keys() if k in model_name.lower()]
params = param_counts.get(model_key[0], 7e9) if model_key else 7e9

# Bytes per parameter (bf16)
bytes_per_param = 2
model_memory_GB = params * bytes_per_param / 1e9

# LoRA adapter memory (r * 2 * target_modules * hidden_dim * 2 bytes)
# Approximation: ~0.5-2% of model size for rank 16
lora_fraction = (lora_rank / 128) * 0.02
lora_memory_GB = model_memory_GB * lora_fraction

# Optimizer states for LoRA params only (AdamW: 2x params)
optimizer_memory_GB = lora_memory_GB * 2

# Activations - depends heavily on seq_len, batch_size, and checkpointing
# Without gradient checkpointing: ~seq_len * batch_size * hidden_dim * num_layers * 4 bytes
# With gradient checkpointing: ~sqrt(num_layers) factor reduction
hidden_dim = 4096 if "7b" in model_name.lower() else 8192
num_layers = 32 if "7b" in model_name.lower() else 80
activation_factor = 1 if not gradient_checkpointing else 0.15 # rough estimate
activation_memory_GB = (
seq_len * batch_size * hidden_dim * num_layers * 4 # float32 activations
* activation_factor / 1e9
)

total_GB = model_memory_GB + lora_memory_GB + optimizer_memory_GB + activation_memory_GB

return {
"model_weights_GB": round(model_memory_GB, 1),
"lora_adapters_GB": round(lora_memory_GB, 2),
"optimizer_states_GB": round(optimizer_memory_GB, 2),
"activations_GB": round(activation_memory_GB, 1),
"total_estimated_GB": round(total_GB, 1),
"recommended_vram": f"{int(total_GB * 1.2 + 4)} GB", # 20% headroom + 4GB system
"fits_in_80GB_A100": total_GB * 1.2 < 80,
}

TensorBoard Integration (Alternative to WandB)

from torch.utils.tensorboard import SummaryWriter

class TensorBoardMonitoringCallback(TrainerCallback):
"""Minimal TensorBoard callback for environments without WandB access."""

def __init__(self, log_dir: str = "./tb_logs"):
self.writer = SummaryWriter(log_dir=log_dir)
self._step = 0

def on_log(self, args, state, control, logs=None, **kwargs):
if logs is None:
return

self._step = state.global_step

for key, value in logs.items():
if isinstance(value, (int, float)):
# Organize metrics into groups for cleaner TensorBoard layout
if key.startswith("eval_"):
self.writer.add_scalar(f"eval/{key[5:]}", value, self._step)
elif key in ("loss", "grad_norm", "learning_rate"):
self.writer.add_scalar(f"train/{key}", value, self._step)
else:
self.writer.add_scalar(key, value, self._step)

self.writer.flush()

def on_train_end(self, args, state, control, **kwargs):
self.writer.close()

Mermaid Diagrams

Loss Curve Interpretation

Gradient Norm Monitoring Flow

Complete Monitoring Architecture


Production Engineering Notes

Setting Up WandB Alerts

WandB supports automated alerts that fire when a metric crosses a threshold. Set these up before every run:

# After wandb.init(), configure alerts
wandb.alert(
title="Training failure detected",
text="NaN loss or critical gradient explosion - check run immediately",
level=wandb.AlertLevel.ERROR,
wait_duration=0, # Alert immediately, no debounce
)

For production, use WandB's UI to set metric-based alerts: go to the project settings, add an alert on eval_loss (trigger when metric increases by more than 10% over 3 evaluations). This catches overfitting early.

Checkpoint Strategy

Save checkpoints more frequently than you think you need to. Storage is cheap. Retraining 6 hours of a 70B job is expensive.

Recommended checkpoint strategy for long runs:

training_args = TrainingArguments(
save_strategy="steps",
save_steps=100, # Save every 100 steps
save_total_limit=5, # Keep only last 5 checkpoints
load_best_model_at_end=True,
# This automatically loads the eval_loss-best checkpoint at the end
)

For very long runs (multi-day), also save to a separate "milestone" directory every 1000 steps that is never auto-deleted:

class MilestoneCheckpointCallback(TrainerCallback):
def on_step_end(self, args, state, control, **kwargs):
if state.global_step % 1000 == 0:
milestone_dir = f"./milestones/step-{state.global_step}"
# Manually save - bypasses save_total_limit
control.should_save = True

Handling NaN Loss in Production

When you encounter NaN loss in a running job, the priority order is:

  1. Stop the job immediately - letting it continue burns compute and corrupts the loss scale
  2. Load the last clean checkpoint - not the most recent one (it may be corrupt)
  3. Inspect the batch that caused the NaN - log batch indices during training so you can identify it
  4. Check for data issues first - NaN in input embeddings is often a tokenization bug
  5. Then check learning rate - if data is clean, LR is almost always the cause

Logging batch indices costs almost nothing and saves hours of debugging:

class BatchIndexLogger(TrainerCallback):
def on_step_begin(self, args, state, control, **kwargs):
# Log which batch we're about to process
# If training crashes, the last logged step points to the problematic batch
if state.global_step % 10 == 0:
with open("./batch_log.txt", "a") as f:
f.write(f"step={state.global_step}\n")

Multi-GPU Monitoring

When training across multiple GPUs, naive monitoring only captures the primary process (rank 0). Memory usage on other GPUs can diverge due to uneven data distribution or model parallelism imbalances.

import torch.distributed as dist

def log_all_gpu_memory():
"""Call this from rank 0 after gathering from all ranks."""
if dist.is_initialized():
all_memory = [None] * dist.get_world_size()
local_memory = torch.cuda.memory_allocated() / 1e9
dist.all_gather_object(all_memory, local_memory)
return {f"gpu_{i}_memory_GB": m for i, m in enumerate(all_memory)}
else:
return {"gpu_0_memory_GB": torch.cuda.memory_allocated() / 1e9}

Reading WandB Loss Curves - Pattern Recognition

After running dozens of fine-tuning experiments, you develop an eye for loss curve patterns. Here are the signatures to recognize:

The "hockey stick" - overfitting:

  • Loss falls steeply for first epoch, then barely moves in epoch 2-3
  • Eval loss bottoms out at epoch 1 and slowly rises
  • Cause: dataset is small relative to model capacity and training duration
  • Fix: early stopping at epoch 1, or more data

The "plateau at high loss" - bad data masking:

  • Loss starts at expected value (~10 for LLaMA), drops quickly for 50-100 steps, then flatlines at 2.5-4.0
  • Model is learning but only on unmasked tokens - if your label mask is wrong and you are masking target tokens, training signal is too sparse
  • Fix: verify your label creation code; unmasked tokens should be the completion, not the prompt

The "sawtooth" - learning rate too high with cosine schedule:

  • Loss drops during warmup, then oscillates up and down in a sawtooth pattern
  • The LR is high enough that each update overshoots the local minimum
  • Fix: reduce LR by 2-5x, or increase warmup ratio

The "cliff" - OOM on batch with long sequence:

  • Stable loss for hours, then sudden crash without NaN (different error pattern)
  • Usually hits when a batch randomly contains several very long sequences
  • Fix: add max_length truncation, use DataCollatorForSeq2Seq with pad_to_multiple_of=8

Common Mistakes

:::danger NaN Loss From fp16 Mixed Precision - Switch to bf16 Immediately

Using fp16=True instead of bf16=True on Ampere or later GPUs (A100, H100, 4090) is the single most common cause of NaN loss in LLM fine-tuning. fp16 has a limited dynamic range (max value 65504) and a loss scaler that can overflow. bf16 has the same range as float32 but reduced precision - far more numerically stable for large models.

If you see NaN loss with fp16 mixed precision:

# WRONG - prone to NaN on large models
training_args = TrainingArguments(fp16=True)

# CORRECT for Ampere+ GPUs
training_args = TrainingArguments(bf16=True)

Never use fp16 for fine-tuning on A100/H100. The performance difference is negligible and bf16 eliminates an entire class of numerical failures. :::

:::danger All Labels Masked - Model Trains But Learns Nothing

The most insidious fine-tuning bug: your training loss decreases to ~2.0 and then completely plateaus. The model appears to train. In reality, you masked all target tokens in your label construction code.

For instruction fine-tuning, you typically mask the prompt tokens so the model only learns to generate the response. If your indexing is off by one, or you apply the mask to the wrong slice, you can accidentally mask everything.

# WRONG - masking includes response tokens
labels = input_ids.clone()
labels[:prompt_length] = -100 # This is correct
# But if prompt_length is wrong or len(input_ids) == prompt_length, everything is masked

# ALWAYS verify:
non_masked_count = (labels != -100).sum().item()
assert non_masked_count > 0, f"All labels masked! Prompt length: {prompt_length}, total: {len(input_ids)}"

Add this assertion to your data preparation pipeline. Run it on 100 random examples before starting training. :::

:::warning Not Logging Eval Loss Frequently Enough

Setting eval_steps=1000 for a job that finishes in 500 steps means you never see eval loss until training ends. You cannot detect overfitting in time to stop the job.

For any fine-tuning job under 1000 steps, evaluate every 50-100 steps. For longer jobs, evaluate every 5-10% of total training steps. Evaluation is cheap (no gradient computation) and the insight is invaluable. :::

:::warning Comparing Runs With Different Tokenizers

When tracking loss across experiments, seemingly identical numbers can mean completely different things if the underlying tokenizers encode differently. A loss of 1.5 on a vocabulary of 32,000 tokens is not the same as 1.5 on a vocabulary of 100,000 tokens.

Always compare runs that use the same base model, tokenizer, and evaluation dataset. When comparing across model families (LLaMA vs Mistral vs Falcon), use task-specific evaluation metrics (ROUGE, exact match, downstream accuracy) rather than raw loss values. :::

:::warning Ignoring Gradient Norm During Warmup

Gradient norm is always high during warmup - that is normal. Many engineers see the high grad norm in the first 50-100 steps and panic, reducing the learning rate or adding extra clipping. This disrupts the warmup schedule and leads to underfitting.

The signal to act on is gradient norm that is still high or climbing after the warmup phase ends. During warmup, high grad norm is expected. After warmup stabilizes, grad norm should settle into a consistent range. Watch the post-warmup behavior, not the warmup itself. :::


Interview Q&A

Q1: What is the difference between GPU utilization and MFU (Model FLOPs Utilization), and why does the distinction matter?

GPU utilization is a binary metric - the percentage of time the GPU is doing any computation, reported by nvidia-smi. It tells you whether the GPU is busy but not what it is doing. A GPU can be 95% utilized doing memory transfers, cache management, or waiting for data from CPU, with almost none of that time spent on actual tensor math.

MFU measures the fraction of the GPU's theoretical peak FLOP capacity that is being used for model computation. For an A100 80GB at bf16, peak is 312 TFLOPS. If your training job achieves 60 TFLOPS on model computation, MFU is 60/312 = 19%.

The distinction matters because optimization strategies differ. If GPU utilization is low (less than 60%), you have a data loading or CPU-GPU transfer bottleneck. Fix: increase dataloader_num_workers, use pin_memory=True, or pre-tokenize and cache your dataset. If GPU utilization is high but tokens/second is low, you have a compute efficiency issue. Fix: increase batch size (more work per kernel launch), use Flash Attention, or reduce overhead from small tensor operations.

For LLM fine-tuning, typical achievable MFU is 15-25%. Pretraining jobs with much larger batch sizes can reach 30-45%. If you are below 10% MFU on a training job, something is seriously wrong with your setup.

Q2: What are the three most common causes of NaN loss in LLM fine-tuning, and how do you diagnose which one you have?

The three most common causes are: (1) learning rate too high causing gradient explosion that overflows floating point range, (2) fp16 loss scale overflow (specific to fp16 mixed precision), and (3) NaN or Inf values in the input data.

Diagnosis procedure: First, check if the NaN appears immediately (first step) or after some training. Immediate NaN points to data issues - inspect your dataset for special tokens, empty sequences, or encoding errors. NaN after stable training points to learning rate or numerical precision.

Second, check gradient norm before the NaN step. If gradient norm was climbing for several steps before the NaN, that is learning rate causing gradient explosion. Reduce LR by 5-10x.

Third, if you are using fp16, switch to bf16. fp16's loss scaler can overflow when gradients get very small, causing the scaler to multiply them up to compensate, then overflow. This creates NaN that appears suddenly without prior warning signs.

The diagnose_nan_loss() function in this lesson runs forward and backward passes with activation hooks to pinpoint exactly which layer first produces NaN. Use it with the batch from the step before the failure.

Q3: How do you detect and respond to overfitting in an LLM fine-tuning run using real-time monitoring?

Overfitting has a specific signature in the loss curves: training loss continues decreasing while eval loss stops decreasing or begins increasing. The gap between training and eval loss widens over time.

In practice, detection requires evaluating frequently enough to catch it early. Set eval_steps to evaluate every 5-10% of total training steps. Enable load_best_model_at_end=True with metric_for_best_model="eval_loss" so HuggingFace Trainer automatically uses the checkpoint with best eval loss, not the final checkpoint.

For automated response, use the EarlyStoppingCallback from transformers.trainer_callback:

from transformers import EarlyStoppingCallback

trainer = Trainer(
callbacks=[
EarlyStoppingCallback(
early_stopping_patience=3, # Stop if eval loss doesn't improve for 3 evals
early_stopping_threshold=0.001 # Minimum improvement to count
)
]
)

In WandB, set an alert rule: trigger if eval_loss increases by more than 5% relative to the historical minimum. This sends a notification without stopping the run automatically, giving you human oversight over the decision.

Q4: Walk through how gradient clipping works and when it is not enough to prevent training instability.

Gradient clipping scales the gradient tensor down if its L2 norm exceeds a threshold. If the norm is 5.0 and the clip threshold is 1.0, all gradients are multiplied by 1.0/5.0 = 0.2. The direction is preserved, only the magnitude is reduced.

This prevents a single catastrophic update from destroying the model weights. It is triggered via max_grad_norm=1.0 in TrainingArguments, or explicitly:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Gradient clipping is not enough when: (1) the gradient norm is consistently high (above 10-20), meaning it is being clipped on nearly every step. Consistent clipping means the underlying gradients are unstable, not just occasionally large. The fix is reducing learning rate. (2) The instability comes from bf16/fp16 precision loss in activations rather than large gradients - clipping gradients does not help if the activations themselves are producing NaN before gradients are computed. (3) Data-level issues cause individual batches to produce wildly different gradient magnitudes - clip the batch but the instability returns on the next problematic batch. The fix there is data cleaning, not clipping.

Think of gradient clipping as a safety valve, not a fix. If it is firing constantly, that is a signal to investigate the root cause rather than just rely on the valve.

Q5: Your eval loss is not decreasing at all after 500 training steps. Training loss is decreasing normally. What is your debug process?

This specific pattern - training loss decreasing while eval loss is completely flat - almost always has one of three causes.

First, check that your train and eval datasets are different. If your eval set is a subset of or identical to your training data, it would decrease together. Run a quick check: sample 100 examples from each and verify they are distinct.

Second, check your eval data format. The eval dataset must use the same tokenization and formatting as training. A common bug: you apply a chat template to training data but not to eval data (or vice versa). The model is learning to produce responses in chat format but is being evaluated on raw text, so eval loss does not improve because the model's output distribution and eval distribution are misaligned.

Third, check the eval label mask. If you apply special label masking during training (masking prompt tokens) but your eval collator does not apply the same mask, eval loss is computed over all tokens including prompt tokens. Since the model was not trained to predict prompt tokens, eval loss stays high while training loss (on completion tokens only) improves.

After those three checks: ensure eval_steps is not set so large that eval has only run once or twice - the curve may look flat just because of sparse sampling. And verify the eval dataset is representative - if your 500-example eval set has a different distribution than training, improvement on training examples will not show up in eval loss.

Q6: How would you set up a monitoring system for a multi-week fine-tuning job to ensure you catch problems even if you are not watching the dashboard?

A multi-week job needs automated alerting, not manual checking. The monitoring system has three layers.

Layer 1 - WandB alerts. Set metric-based alerts in WandB's UI: alert if eval_loss increases by more than 5% versus the previous evaluation, alert if grad_norm exceeds 10.0 for 3 consecutive steps, alert if tokens_per_second drops below 50% of the baseline (indicates a stall or hardware failure). These send Slack or email notifications within minutes.

Layer 2 - Checkpoint integrity checks. Every 1000 steps, run a brief sanity check on the saved checkpoint: load the model, run 10 inference examples, check the output is not degenerate (no empty strings, no infinite repetition loops). Log a "checkpoint_health" metric to WandB. If the model starts producing obviously wrong outputs, you want to know before you have trained 10 more hours on top of a broken checkpoint.

Layer 3 - Compute cost monitoring. Set a cloud budget alert on your GPU cluster (AWS, GCP, CoreWeave all support this). If the job runs 20% over the projected cost, you want a notification. This catches infinite loops, jobs that fail to stop cleanly, and accidental multi-GPU allocation bugs.

For the WandB piece specifically, the alert setup is through the UI (Alerts tab in project settings) or programmatically:

wandb.alert(
title="Training stalled",
text=f"Loss has not improved in 500 steps. Last eval_loss: {current_eval_loss}",
level=wandb.AlertLevel.WARN,
)

Key Takeaways

Monitoring is not overhead. The 30 minutes you spend setting up a comprehensive callback before a long training run is paid back in the first job that would otherwise have crashed silently and wasted hours.

The metrics that matter most, in order of importance: eval loss (is the model learning generalizable patterns?), gradient norm (is training numerically stable?), tokens per second (is the hardware being used efficiently?), GPU memory (are you close to OOM?).

The two failure modes that cost the most money and time are NaN loss from fp16 (fix: use bf16) and overfitting from over-training a small dataset (fix: eval frequently, use early stopping).

Everything else is downstream of having the data to debug. Log it all. Disk space is cheap.

© 2026 EngineersOfAI. All rights reserved.