Skip to main content

Weights & Biases - The ML Experiment Tracking Standard

Reading time: ~30 minutes | Level: ML with Python | Role: MLE, Data Scientist, MLOps Engineer


The OpenAI GPT-2 Team, 2019

Picture the research team at OpenAI in early 2019, training GPT-2. They are running hundreds of experiments - varying the learning rate schedule, the number of layers, the batch size, the warmup steps, the weight decay, the tokenizer vocabulary size. Each experiment takes 8 to 48 hours on a cluster of V100s. The team is 12 people across 3 time zones.

Without experiment tracking: experiment 47 finishes on a Tuesday morning and shows the best perplexity you have seen. You Slack the result to the team. Three weeks later, someone asks to reproduce it. Which learning rate did experiment 47 use? You check your local notes - you wrote "3e-4" but the training script had a typo and used 3e-3. Was experiment 12 trained with dropout or without? You are not sure - the logs say it ran for 72 hours and achieved 15.2 perplexity, but the config file has been overwritten twice since then. Experiment 34 beat experiment 31 by 0.8 points - but why? You cannot diff two experiment configs without manually reading two log files and hoping someone committed the right version of the training script.

With W&B: every experiment, automatically, logs every hyperparameter from argparse or your config dict, every metric at every step, every system metric (GPU utilization, memory, temperature), the git commit hash of the code that ran it, the Docker image, and a snapshot of the diff. The team dashboard shows all 200 runs in a single table. You filter by eval_loss < 2.5, sort by val_perplexity, click on experiment 47 to see its complete config, and reproduce it from the logged command line in two minutes.

The team that ships fastest in ML is not always the one with the best ideas. It is the one that can find, reproduce, and build on their best results. Experiment tracking is infrastructure for compounding - every good result you can actually reproduce is one you can improve from. Every result you cannot reproduce is a dead end, even if it was good.

This lesson covers W&B end-to-end: the core tracking API, rich media logging, artifacts for data and model versioning, hyperparameter sweeps with Bayesian optimization, the model registry, and integration patterns with PyTorch, HuggingFace, and Lightning.

Why This Exists: The Experiment Tracking Problem

A model takes 8 hours to train. You run 50 experiments over a month. Without tracking, four specific things break:

Reproducibility: which exact combination of hyperparameters, code version, and random seed produced that result? Print statements and log files require manual parsing. Config files get overwritten.

Comparison: comparing two runs means reading two log files side by side, manually diffing configs, hoping both used the same evaluation protocol.

Collaboration: your teammate runs experiment 31, you run experiment 34. Neither of you has a shared view of all running and completed experiments. Progress is communicated through Slack messages and spreadsheets.

Selection bias: when you cannot easily compare all runs, you tend to remember the ones that confirmed your hypothesis. W&B makes all runs equally visible - the one that ran at 3am on Friday is as searchable as the one you manually wrote up.

The standard before W&B: TensorBoard (Google, 2016) - visualizes training curves from a local event file. Good for a single run, on a single machine. No multi-run comparison, no hyperparameter logging, no team sharing, no artifact management.

Neptune.ai and Comet.ml arrived in 2017, providing cloud-hosted alternatives. Weights & Biases (founded by Lukas Biewald, Shawn Lewis, Chris Van Pelt in 2018) became the dominant standard by 2021 for two reasons: the comparison interface is genuinely excellent, and the sweep (hyperparameter search) system is tightly integrated with the run tracking.

MLflow (Databricks, 2018) is the main alternative - open-source, self-hostable, stronger in enterprise settings with strict data governance requirements. W&B is cloud-first and teams-first.

Quick Start: The Core Three Functions

W&B's API surface is deliberately small. You need three functions to track an experiment.

pip install wandb
wandb login # authenticates with your W&B account, saves token to ~/.netrc
import wandb
import torch
import torch.nn as nn

# 1. Initialize a run - creates a new experiment record on W&B
run = wandb.init(
project="financial-sentiment", # project groups related runs
name="bert-lr-3e4-wd-01", # human-readable run name (optional)
tags=["bert", "finbert", "lora"],
notes="Testing higher LR with LoRA r=16",
config={ # hyperparameters logged at init
"model": "ProsusAI/finbert",
"learning_rate": 3e-4,
"epochs": 5,
"batch_size": 32,
"lora_r": 16,
"lora_alpha": 32,
"weight_decay": 0.01,
"warmup_ratio": 0.06,
},
)

# 2. Log metrics - call this inside your training loop
for epoch in range(wandb.config.epochs):
train_loss = train_one_epoch(model, optimizer)
val_loss, val_f1 = evaluate(model)

wandb.log({
"train/loss": train_loss,
"val/loss": val_loss,
"val/f1": val_f1,
"epoch": epoch,
"learning_rate": optimizer.param_groups[0]["lr"],
})

# 3. Finish - marks the run as complete, flushes all data
wandb.finish()

When you call wandb.init(), W&B prints a URL: https://wandb.ai/your-username/financial-sentiment/runs/abc123. Every metric logged to wandb.log() appears as a live chart at that URL within seconds. Your teammate in a different timezone can watch your run in real time.

wandb.config: Tracking Hyperparameters

The config is the contract of your experiment - the full specification of what was run. W&B stores it as a structured JSON object attached to the run.

# Method 1: dict at init (shown above)
wandb.init(config={"lr": 3e-4, "epochs": 5})

# Method 2: argparse integration - W&B reads from argparse namespace
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=3e-4)
parser.add_argument("--epochs", type=int, default=5)
parser.add_argument("--batch_size", type=int, default=32)
args = parser.parse_args()

wandb.init(config=args) # W&B reads all argparse attributes

# Method 3: update config after init (e.g., after computing derived values)
wandb.config.update({
"effective_batch_size": args.batch_size * args.gradient_accumulation_steps,
"total_steps": len(train_loader) * args.epochs,
})

# Access config values in code (useful inside sweep agents)
lr = wandb.config.lr
epochs = wandb.config.epochs

:::tip Sweep agents override config

When running a hyperparameter sweep, the sweep agent calls wandb.init() and automatically injects the hyperparameter values into wandb.config. This is why your training code should always read from wandb.config rather than hard-coded values - the same script works both for individual runs and for sweep agents without modification.

:::

wandb.log(): Metrics, Media, and Everything Else

wandb.log() accepts any JSON-serializable value, plus W&B's rich media types.

Basic Metrics

# Log scalars - the most common case
wandb.log({
"train/loss": 0.423,
"train/accuracy": 0.874,
"val/loss": 0.391,
"val/f1": 0.881,
"grad_norm": 1.24,
"learning_rate": 2.1e-5,
})

# Use the step parameter to specify x-axis position
# By default, W&B uses an auto-incrementing step counter
wandb.log({"train/loss": loss}, step=global_step)

# Log every N steps (not every step) to reduce overhead
if global_step % 50 == 0:
wandb.log({"train/loss": loss.item()}, step=global_step)

Logging Images - Visualizing Predictions

import numpy as np
from PIL import Image

# Log a batch of images with their predicted vs true labels
def log_prediction_examples(model, val_loader, n=8):
model.eval()
images, labels, preds = [], [], []
with torch.no_grad():
for imgs, lbls in val_loader:
logits = model(imgs)
predictions = logits.argmax(dim=1)
images.extend(imgs[:n].cpu().numpy())
labels.extend(lbls[:n].cpu().numpy())
preds.extend(predictions[:n].cpu().numpy())
break

label_names = ["negative", "neutral", "positive"]
wandb_images = [
wandb.Image(
img.transpose(1, 2, 0), # CHW → HWC for PIL
caption=f"True: {label_names[lbl]} | Pred: {label_names[pred]}"
)
for img, lbl, pred in zip(images, labels, preds)
]
wandb.log({"val/predictions": wandb_images})

wandb.Table: Per-Sample Tracking

Tables let you track individual predictions, find patterns in errors, and build confusion matrices.

# Track per-sample predictions for error analysis
def log_prediction_table(texts, true_labels, pred_labels, scores):
"""Log individual predictions to a W&B table for qualitative analysis."""
table = wandb.Table(
columns=["text", "true_label", "predicted_label", "confidence", "correct"],
)
label_names = {0: "positive", 1: "negative", 2: "neutral"}

for text, true, pred, score in zip(texts, true_labels, pred_labels, scores):
table.add_data(
text[:200], # truncate long texts
label_names[true],
label_names[pred],
round(float(score), 4),
"yes" if true == pred else "no",
)

wandb.log({"val/predictions_table": table})

# Built-in confusion matrix
from sklearn.metrics import confusion_matrix
import wandb

def log_confusion_matrix(y_true, y_pred, class_names):
cm = confusion_matrix(y_true, y_pred)
wandb.log({
"val/confusion_matrix": wandb.plot.confusion_matrix(
probs=None,
y_true=y_true,
preds=y_pred,
class_names=class_names,
)
})

# ROC curve
wandb.log({
"val/roc_curve": wandb.plot.roc_curve(
y_true, y_score, labels=class_names
)
})

# PR curve
wandb.log({
"val/pr_curve": wandb.plot.pr_curve(
y_true, y_score, labels=class_names
)
})

Weight Histograms - Monitoring Training Health

# Log weight and gradient distributions to detect vanishing/exploding gradients
def log_weight_histograms(model, step):
for name, param in model.named_parameters():
if param.requires_grad and param.grad is not None:
wandb.log({
f"weights/{name}": wandb.Histogram(param.data.cpu().numpy()),
f"gradients/{name}": wandb.Histogram(param.grad.cpu().numpy()),
}, step=step)

Artifacts: Versioned Data and Models

Artifacts are the W&B answer to "which version of the dataset trained which version of the model which produced which set of predictions?" They give ML the same lineage tracking that software has with git.

An Artifact is a versioned, named, typed collection of files with metadata. Every time you log an artifact with the same name, W&B creates a new version (v0, v1, v2, ...) and tracks what changed.

Logging Artifacts

import wandb

run = wandb.init(project="financial-sentiment", job_type="data-preprocessing")

# --- Dataset artifact ---
dataset_artifact = wandb.Artifact(
name="bloomberg-financial-clean",
type="dataset",
description="Cleaned Bloomberg financial sentiment corpus, deduplicated and normalized.",
metadata={
"num_rows": 4612,
"num_classes": 3,
"source": "Bloomberg terminal + manual annotation",
"preprocessing": "deduplicated, lowercased, removed HTML",
},
)

# Add files or directories to the artifact
dataset_artifact.add_dir("./data/processed/") # adds all files in directory
dataset_artifact.add_file("./data/processed/train.csv") # or individual files

# Log the artifact - creates version v0 (or increments version if name exists)
run.log_artifact(dataset_artifact)

wandb.finish()


# --- Model artifact (after training) ---
run = wandb.init(project="financial-sentiment", job_type="training")

# Use a specific dataset version as input (creates lineage link)
artifact = run.use_artifact("bloomberg-financial-clean:latest", type="dataset")
artifact_dir = artifact.download() # downloads to local cache
# Now use artifact_dir as your data path

# ... training happens here ...

# Log model artifact
model_artifact = wandb.Artifact(
name="finbert-bloomberg",
type="model",
description="FinBERT fine-tuned on Bloomberg financial sentiment with LoRA.",
metadata={
"architecture": "ProsusAI/finbert + LoRA r=16",
"val_f1": 0.891,
"val_accuracy": 0.887,
"training_epochs": 5,
"training_steps": 1500,
},
)
model_artifact.add_dir("./finbert-bloomberg/") # model weights + config
run.log_artifact(model_artifact)

wandb.finish()

Loading Artifacts

# In a downstream run (evaluation, deployment)
run = wandb.init(project="financial-sentiment", job_type="evaluation")

# Reference a specific version
model_artifact = run.use_artifact("finbert-bloomberg:v3", type="model")

# Or always use latest
model_artifact = run.use_artifact("finbert-bloomberg:latest", type="model")

# Download to local cache (W&B caches - second call is instant)
model_dir = model_artifact.download()
print(model_dir) # something like ./artifacts/finbert-bloomberg:v3/

# Load the model from the downloaded path
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

A sweep is a coordinated search over a hyperparameter space. You define the search space and strategy in a config, then launch multiple agents - each agent runs one or more training jobs, reporting results back to the sweep controller which uses those results to suggest the next configuration.

Sweep Configuration

import wandb

sweep_config = {
"name": "finbert-lora-sweep",
"method": "bayes", # "grid", "random", or "bayes"
"metric": {
"name": "val/f1",
"goal": "maximize",
},
"parameters": {
"learning_rate": {
"distribution": "log_uniform_values",
"min": 1e-5,
"max": 1e-3,
},
"lora_r": {
"values": [4, 8, 16, 32],
},
"lora_alpha": {
"values": [8, 16, 32, 64],
},
"weight_decay": {
"distribution": "uniform",
"min": 0.0,
"max": 0.1,
},
"warmup_ratio": {
"values": [0.0, 0.03, 0.06, 0.1],
},
"batch_size": {
"values": [16, 32, 64],
},
"dropout": {
"distribution": "uniform",
"min": 0.0,
"max": 0.3,
},
},
"early_terminate": {
"type": "hyperband", # stop underperforming runs early
"min_iter": 3,
"eta": 2,
},
}

# Create the sweep - returns a sweep_id
sweep_id = wandb.sweep(sweep_config, project="financial-sentiment")
print(f"Sweep ID: {sweep_id}")

The Training Function for Sweeps

The training function must read all hyperparameters from wandb.config (not hard-coded), since the sweep agent will inject the values for each trial.

def train():
# wandb.init() - sweep agent provides the config
run = wandb.init()

# All hyperparameters come from wandb.config
lr = wandb.config.learning_rate
lora_r = wandb.config.lora_r
lora_alpha = wandb.config.lora_alpha
weight_decay = wandb.config.weight_decay
warmup_ratio = wandb.config.warmup_ratio
batch_size = wandb.config.batch_size
dropout = wandb.config.dropout

# Build model with these hyperparameters
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType

base = AutoModelForSequenceClassification.from_pretrained(
"ProsusAI/finbert", num_labels=3
)
lora_cfg = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
target_modules=["query", "key", "value"],
lora_dropout=dropout,
bias="none",
task_type=TaskType.SEQ_CLS,
)
model = get_peft_model(base, lora_cfg)

args = TrainingArguments(
output_dir=f"./sweep-{run.id}",
num_train_epochs=3,
per_device_train_batch_size=batch_size,
learning_rate=lr,
weight_decay=weight_decay,
warmup_ratio=warmup_ratio,
evaluation_strategy="epoch",
save_strategy="no", # no checkpointing during sweep (saves disk)
report_to="wandb",
logging_steps=50,
fp16=True,
)

trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)

trainer.train()

# Log final metrics - the sweep controller uses "val/f1" to rank this trial
metrics = trainer.evaluate()
wandb.log({"val/f1": metrics["eval_f1"], "val/accuracy": metrics["eval_accuracy"]})
wandb.finish()


# Launch agents - each agent runs trials until the sweep quota is exhausted
# Run this on multiple machines to parallelize the search
wandb.agent(sweep_id, function=train, count=20) # this agent runs 20 trials

Bayesian Optimization Under the Hood

When method: "bayes", W&B uses Gaussian Process (GP) based Bayesian optimization. The GP models the unknown objective function f(x)f(\mathbf{x}) (where x\mathbf{x} is the hyperparameter vector) as a probability distribution. After each completed trial, the GP posterior is updated.

To select the next trial, W&B maximizes an acquisition function. The Expected Improvement (EI) acquisition is:

EI(x)=E[max(f(x)f(x),0)]\text{EI}(\mathbf{x}) = \mathbb{E}[\max(f(\mathbf{x}) - f(\mathbf{x}^*), 0)]

where f(x)f(\mathbf{x}^*) is the best observed value so far. This naturally balances exploitation (sampling near the current best) and exploration (sampling where the GP is uncertain).

In practice, Bayesian optimization finds good hyperparameters in 20-40 trials that would require hundreds of random trials to match. For expensive training runs (hours per trial), this matters significantly.

:::tip Sweep strategy selection

  • Grid search: exhaustive, practical only with 2-3 small discrete parameters. Never use for continuous parameters.
  • Random search: embarrassingly parallel, surprisingly effective (Bergstra & Bengio, 2012 showed it beats grid for high-dimensional spaces where most parameters do not matter). Use as a fast baseline.
  • Bayesian (bayes): best when trials are expensive (30+ minutes each) and you have budget for 20-100 trials. The benefit over random diminishes if trials are cheap.

:::

Model Registry

The Model Registry is where you promote a trained model artifact to a named, versioned alias that represents "what is in production right now."

# After a sweep, identify the best run
api = wandb.Api()
sweep = api.sweep("your-username/financial-sentiment/sweep_id")
best_run = sweep.best_run()
print(f"Best run: {best_run.id}, val_f1: {best_run.summary['val/f1']:.4f}")

# Link the best model artifact to the registry
# (assumes you logged a model artifact during training)
best_artifact = best_run.logged_artifacts()[0] # get the model artifact

# Create a registry model (or link to existing)
wandb.run = wandb.init(project="financial-sentiment")

# Tag the artifact as "production" - this alias can be moved to new versions
best_artifact.aliases.append("production")
best_artifact.save()

wandb.finish()


# In deployment / serving code - always load "production" alias
api = wandb.Api()
artifact = api.artifact("your-username/financial-sentiment/finbert-bloomberg:production")
model_dir = artifact.download()

from transformers import AutoModelForSequenceClassification
production_model = AutoModelForSequenceClassification.from_pretrained(model_dir)

The Registry solves the "which model is in production" problem. When you retrain and promote a new version to "production", the alias moves automatically. Old versions remain available by their version number (v3, v4) for rollback.

Reports: Communicating Results

W&B Reports are shareable, interactive dashboards that embed live charts, tables, and markdown text. They are the equivalent of a lab notebook page that automatically stays in sync with your run data.

# Reports are created through the W&B UI, but you can create them via the API
# The most common workflow: create in UI, embed in team wiki or email

# To generate a report programmatically (wandb SDK v0.15+):
import wandb
from wandb.sdk.wandb_run import Run

api = wandb.Api()

# Get runs for a project
runs = api.runs(
"your-username/financial-sentiment",
filters={"state": "finished", "config.model": "ProsusAI/finbert"},
)

# Export run summary as a dataframe for offline analysis
import pandas as pd

summary_data = []
for run in runs:
summary_data.append({
"run_id": run.id,
"run_name": run.name,
"val_f1": run.summary.get("val/f1"),
"val_accuracy": run.summary.get("val/accuracy"),
"learning_rate": run.config.get("learning_rate"),
"lora_r": run.config.get("lora_r"),
"epochs": run.config.get("epochs"),
})

df = pd.DataFrame(summary_data).sort_values("val_f1", ascending=False)
print(df.head(10))

Integration Patterns

PyTorch: Manual Integration

import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train_pytorch(model, train_loader, val_loader, config):
run = wandb.init(project="financial-sentiment", config=config)

optimizer = torch.optim.AdamW(
model.parameters(),
lr=wandb.config.learning_rate,
weight_decay=wandb.config.weight_decay,
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=len(train_loader) * wandb.config.epochs
)
criterion = nn.CrossEntropyLoss()

global_step = 0
for epoch in range(wandb.config.epochs):
# Training
model.train()
epoch_loss = 0.0
for batch_idx, batch in enumerate(train_loader):
inputs, labels = batch
optimizer.zero_grad()
logits = model(**inputs).logits
loss = criterion(logits, labels)
loss.backward()

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()

epoch_loss += loss.item()
global_step += 1

# Log every 50 steps - not every step (reduces overhead)
if global_step % 50 == 0:
wandb.log({
"train/loss": loss.item(),
"train/lr": scheduler.get_last_lr()[0],
"train/grad_norm": compute_grad_norm(model),
}, step=global_step)

# Validation
model.eval()
val_loss, val_preds, val_labels = [], [], []
with torch.no_grad():
for batch in val_loader:
inputs, labels = batch
logits = model(**inputs).logits
loss = criterion(logits, labels)
val_loss.append(loss.item())
val_preds.extend(logits.argmax(dim=1).cpu().tolist())
val_labels.extend(labels.cpu().tolist())

val_f1 = f1_score(val_labels, val_preds, average="weighted")
wandb.log({
"val/loss": sum(val_loss) / len(val_loss),
"val/f1": val_f1,
"epoch": epoch,
}, step=global_step)

print(f"Epoch {epoch}: val_f1={val_f1:.4f}")

wandb.finish()
return model

def compute_grad_norm(model):
total_norm = 0.0
for p in model.parameters():
if p.grad is not None:
total_norm += p.grad.data.norm(2).item() ** 2
return total_norm ** 0.5

HuggingFace Trainer: One Line

The HuggingFace Trainer integrates with W&B via a single parameter in TrainingArguments. Every metric computed by compute_metrics and all training losses are logged automatically.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=5,
per_device_train_batch_size=16,
learning_rate=3e-4,
report_to="wandb", # one line - all logging handled automatically
run_name="finbert-lora-v4", # W&B run name
logging_steps=50,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)

trainer.train()
# All metrics logged to W&B automatically: train/loss, val/loss, val/f1, etc.

PyTorch Lightning: WandbLogger

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger

wandb_logger = WandbLogger(
project="financial-sentiment",
name="finbert-lightning-v1",
log_model=True, # automatically log model checkpoints as artifacts
)

trainer = Trainer(
max_epochs=5,
logger=wandb_logger,
log_every_n_steps=50,
enable_progress_bar=True,
)

trainer.fit(model, train_dataloader, val_dataloader)

Keras: WandbCallback

import wandb
from wandb.keras import WandbCallback

wandb.init(project="financial-sentiment", config={"epochs": 10, "lr": 0.001})

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

model.fit(
train_dataset,
validation_data=val_dataset,
epochs=wandb.config.epochs,
callbacks=[
WandbCallback(
monitor="val_accuracy",
log_weights=True, # log weight histograms
log_gradients=True, # log gradient histograms
)
],
)

Complete Production Example

"""
Complete W&B integration for fine-tuning FinBERT with LoRA.
Covers: init, config, per-step logging, artifacts, rich media, finish.
"""

import wandb
import torch
import numpy as np
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer,
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import evaluate

# ---- Configuration ----
config = {
"model_name": "ProsusAI/finbert",
"lora_r": 16,
"lora_alpha": 32,
"lora_dropout": 0.1,
"learning_rate": 3e-4,
"weight_decay": 0.01,
"warmup_ratio": 0.06,
"num_epochs": 5,
"batch_size": 32,
"max_seq_length": 128,
"fp16": True,
"seed": 42,
}

# ---- Initialize W&B ----
run = wandb.init(
project="financial-sentiment-production",
name=f"finbert-lora-r{config['lora_r']}-lr{config['learning_rate']}",
config=config,
tags=["finbert", "lora", "production"],
notes="Production fine-tuning run with LoRA r=16 on financial_phrasebank",
settings=wandb.Settings(code_dir="."), # saves git diff + code snapshot
)

print(f"Run URL: {run.url}")

# ---- Data ----
raw = load_dataset("financial_phrasebank", "sentences_50agree")
train_test = raw["train"].train_test_split(test_size=0.2, seed=config["seed"])

tokenizer = AutoTokenizer.from_pretrained(config["model_name"])

def preprocess(batch):
return tokenizer(
batch["sentence"],
truncation=True,
max_length=config["max_seq_length"],
padding="max_length",
)

tokenized = train_test.map(preprocess, batched=True, remove_columns=["sentence"])
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch")

# Log dataset as artifact
ds_artifact = wandb.Artifact(
"financial-phrasebank",
type="dataset",
metadata={
"train_size": len(tokenized["train"]),
"val_size": len(tokenized["test"]),
"num_classes": 3,
},
)
run.log_artifact(ds_artifact)

# ---- Model ----
base = AutoModelForSequenceClassification.from_pretrained(
config["model_name"],
num_labels=3,
id2label={0: "positive", 1: "negative", 2: "neutral"},
label2id={"positive": 0, "negative": 1, "neutral": 2},
)

lora_cfg = LoraConfig(
r=config["lora_r"],
lora_alpha=config["lora_alpha"],
target_modules=["query", "key", "value"],
lora_dropout=config["lora_dropout"],
bias="none",
task_type=TaskType.SEQ_CLS,
)
model = get_peft_model(base, lora_cfg)

trainable, total = 0, 0
for p in model.parameters():
total += p.numel()
if p.requires_grad:
trainable += p.numel()

wandb.config.update({
"trainable_params": trainable,
"total_params": total,
"trainable_pct": round(100 * trainable / total, 4),
})

# ---- Metrics ----
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
"f1": f1.compute(predictions=preds, references=labels, average="weighted")["f1"],
}

# ---- Training ----
args = TrainingArguments(
output_dir=f"./finbert-lora-{run.id}",
num_train_epochs=config["num_epochs"],
per_device_train_batch_size=config["batch_size"],
per_device_eval_batch_size=64,
learning_rate=config["learning_rate"],
weight_decay=config["weight_decay"],
warmup_ratio=config["warmup_ratio"],
fp16=config["fp16"],
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
greater_is_better=True,
logging_steps=25,
report_to="wandb",
run_name=run.name,
)

trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)

trainer.train()

# ---- Evaluate and Log Final Results ----
final_metrics = trainer.evaluate()
print(f"Final val F1: {final_metrics['eval_f1']:.4f}")

# ---- Log Model Artifact ----
model.save_pretrained(f"./finbert-lora-{run.id}")
tokenizer.save_pretrained(f"./finbert-lora-{run.id}")

model_artifact = wandb.Artifact(
"finbert-bloomberg",
type="model",
description="FinBERT + LoRA fine-tuned on financial_phrasebank.",
metadata={
"val_f1": final_metrics["eval_f1"],
"val_accuracy": final_metrics["eval_accuracy"],
"lora_r": config["lora_r"],
"lora_alpha": config["lora_alpha"],
"base_model": config["model_name"],
},
)
model_artifact.add_dir(f"./finbert-lora-{run.id}/")
run.log_artifact(model_artifact)

wandb.finish()

The Full W&B Workflow - Diagram

Team Collaboration

W&B projects are shared by default across your team account. Every member can see all runs, filter by tag, hyperparameter, or metric value, and add notes to any run.

# Tag runs for easy filtering
wandb.init(
project="financial-sentiment",
tags=["baseline", "lora", "finbert"], # filter by tag in UI
)

# Add notes to an existing run after the fact
api = wandb.Api()
run = api.run("your-username/financial-sentiment/run_id")
run.notes = "This was the run where we found weight_decay=0.05 matters more than LR."
run.save()

# Group runs - useful for multi-GPU training or cross-validation folds
wandb.init(
project="financial-sentiment",
group="5fold-crossval-v3", # all 5 folds appear together in UI
job_type="fold-3",
)

# Fork a run - start a new run from a specific checkpoint of an existing run
# (W&B feature: creates lineage link between parent and child run)
wandb.init(
project="financial-sentiment",
fork_from="run_id?step=500", # fork from step 500 of run_id
)

Access Control

W&B projects support three visibility levels:

  • Private: only you and explicitly added members
  • Team: all members of your W&B organization
  • Public: anyone can view (read-only)

For production ML systems, keep model artifacts private and use the Registry to expose only promoted versions to deployment systems.

Production Notes

Offline Mode

When running on a compute cluster without internet access, W&B can buffer all logs locally and sync when connectivity is available.

# Set environment variable before running
export WANDB_MODE=offline

# Or in Python
import wandb
wandb.init(mode="offline")

# Later, sync the buffered data
wandb sync ./wandb/offline-run-timestamp

Multi-Process Training

For distributed training (multiple GPUs), only the main process should log to W&B. The accelerate library handles this automatically. For manual DDP:

import torch.distributed as dist

is_main_process = not dist.is_initialized() or dist.get_rank() == 0

if is_main_process:
wandb.init(project="financial-sentiment")
# ... log metrics ...

# Or use the group parameter to aggregate across processes
wandb.init(
project="financial-sentiment",
group="ddp-run-v1",
job_type=f"rank-{dist.get_rank()}",
)

System Metrics

W&B automatically logs GPU utilization, GPU memory, CPU usage, RAM, disk I/O, and network I/O every 10 seconds. No additional code required. This is invaluable for identifying:

  • Memory leaks: RAM usage grows monotonically across epochs
  • GPU underutilization: GPU% stays below 80%, indicating a data loading bottleneck (num_workers too low)
  • Gradient accumulation correctness: loss should decrease smoothly with the right effective batch size

Alerts

# Notify via Slack or email when a run finishes or when a metric crosses a threshold
wandb.alert(
title="Training complete",
text=f"Run {wandb.run.name} finished. Val F1: {final_f1:.4f}",
level=wandb.AlertLevel.INFO,
)

# Alert on NaN loss (run failure detection)
if torch.isnan(loss):
wandb.alert(
title="NaN loss detected",
text=f"Run {wandb.run.name} has NaN loss at step {global_step}. Stopping.",
level=wandb.AlertLevel.ERROR,
)
raise ValueError("NaN loss - check learning rate and gradient clipping")

Common Mistakes

:::danger Not saving code with your run

By default, W&B saves the git commit hash but not the full code diff or file contents. If you change code without committing, your run is not reproducible from the W&B record alone. Fix:

wandb.init(
project="financial-sentiment",
settings=wandb.Settings(code_dir="."), # saves all Python files
)

Or always commit before running. Set up a pre-run check in your training script:

import subprocess
result = subprocess.run(["git", "status", "--porcelain"], capture_output=True, text=True)
if result.stdout.strip():
print("WARNING: Uncommitted changes detected. Run may not be reproducible.")

:::

:::warning Logging too frequently

Calling wandb.log() at every gradient step in large training runs adds meaningful overhead - W&B serializes and buffers each call, and the network upload can lag behind training. For batch sizes above 32 and datasets above 100K samples, log every 50-100 steps:

if global_step % 100 == 0:
wandb.log({"train/loss": loss.item()}, step=global_step)

:::

:::warning Not calling wandb.finish()

If your training script exits abnormally (OOM, keyboard interrupt, exception), the W&B run stays in "Running" state forever unless you call wandb.finish(). Wrap training in a try/finally:

try:
trainer.train()
wandb.finish()
except Exception as e:
wandb.finish(exit_code=1)
raise

:::

:::note W&B vs TensorBoard vs MLflow

TensorBoard: best for quick local visualization of a single run. No multi-run comparison, no cloud storage, no artifact management. Still useful inside Jupyter.

MLflow: open-source, self-hostable, strong in enterprises with data governance requirements (financial services, healthcare). Run tracking is comparable to W&B. Artifact management is more manual. No built-in sweeps.

W&B: best for teams, best sweep interface, best run comparison UI. Cloud-hosted (data leaves your infrastructure). Free tier is generous for individual researchers. The default choice for most ML teams as of 2024.

:::

YouTube Resources

VideoCreatorWhat You'll Learn
W&B QuickstartWeights & BiasesOfficial W&B intro and core API
W&B Sweeps TutorialWeights & BiasesHyperparameter sweeps with Bayesian optimization
ML Experiment TrackingMLOps CommunityW&B in production MLOps workflows
W&B ArtifactsWeights & BiasesDataset and model versioning with artifact lineage

Interview Q&A

Q1: Why is experiment tracking important and what specific problem does it solve?

The core problem is reproducibility at scale. A single ML experiment involves dozens of choices: model architecture, hyperparameters, optimizer, learning rate schedule, data preprocessing, augmentation, random seeds, code version. Without tracking, reproducing a past result requires either a perfect memory or comprehensive logging discipline that most teams do not maintain.

The secondary problem is comparison. When you have run 50 experiments, meaningful comparison requires seeing all configurations and all metrics in a unified view. Log files and spreadsheets do not scale to this.

The tertiary problem is collaboration. In a team, experiment tracking creates a shared record that every member can query. The researcher who ran experiment 31 at 2am is not the only one who knows what it did.

W&B specifically solves all three: automatic hyperparameter logging at wandb.init(), automatic metric logging via wandb.log(), and a comparison UI that lets you filter, sort, and correlate hyperparameters with outcomes across all runs. The artifact system adds model and dataset lineage - you can always answer "which dataset trained which model."

Q2: How would you use W&B sweeps to find the optimal learning rate for a transformer fine-tuning run?

First, choose the right strategy. For learning rate search, Bayesian optimization is appropriate because: trials are expensive (30+ minutes each), the learning rate is a continuous parameter, and the relationship between LR and performance is smooth (unimodal-ish in log space). Random search would work too but is less sample-efficient.

Configure the sweep with a log-uniform distribution over a wide range:

method: bayes
metric:
name: val/f1
goal: maximize
parameters:
learning_rate:
distribution: log_uniform_values
min: 1.0e-6
max: 1.0e-2
warmup_ratio:
values: [0.03, 0.06, 0.1]

The log-uniform distribution is critical - learning rates span several orders of magnitude and you want uniform sampling in log space (equal probability of sampling 1e-5 as 1e-4, not equal probability in linear space which would almost never sample small values).

Run 20-30 trials. After 10-15 trials, W&B's Gaussian Process will have a good model of the LR-performance landscape and will focus sampling on the promising region. Look at the parallel coordinates plot in the W&B sweep dashboard to visually confirm the LR range and see correlations with warmup ratio.

Q3: What is artifact lineage and why does it matter for ML reproducibility?

Artifact lineage is the directed graph of which artifacts produced which other artifacts, mediated by which runs. A complete lineage record answers: dataset version X → preprocessing run P → clean dataset version Y → training run T → model version M → evaluation run E → prediction set version Q.

This matters for three reasons:

Debugging production issues: when a production model starts making bad predictions, you can trace back to exactly which training data version it used, download that version, and inspect whether the issue is a data problem, a model problem, or both.

Compliance: regulated industries (healthcare, finance) require audit trails. "Which data trained the model that made this credit decision?" needs a precise answer, not "probably version 3 of the dataset."

Experiment reproducibility: if you want to reproduce a result from six months ago, you need the exact model weights, the exact dataset version, and the exact code. W&B artifacts provide the first two. With settings=wandb.Settings(code_dir="."), W&B captures the code too.

In W&B, lineage is created automatically when you use run.use_artifact() to declare an input artifact and run.log_artifact() to declare an output. The lineage graph is visible in the W&B UI as a DAG.

Q4: How do you track per-sample predictions for a multi-class classification model in W&B?

Use wandb.Table to log individual predictions with their ground truth labels, confidence scores, and any other relevant metadata (text content, image path, etc.). This enables qualitative error analysis that aggregate metrics do not.

table = wandb.Table(columns=["id", "text", "true_label", "pred_label", "confidence", "correct"])

for i, (text, true, pred, score) in enumerate(zip(texts, true_labels, pred_labels, scores)):
table.add_data(
i, text[:200], label_names[true], label_names[pred],
round(float(score), 4), true == pred
)

wandb.log({"val/per_sample_predictions": table})

Once logged as a W&B Table, you can filter interactively in the UI (show only incorrect predictions, sort by confidence to find high-confidence errors, filter by true label to see class-specific error patterns). This is qualitatively different from a confusion matrix - you can read the actual texts that the model gets wrong and form hypotheses about failure modes.

Q5: When would you choose W&B over TensorBoard or MLflow?

Choose W&B when: you are working in a team where shared visibility matters, you want built-in hyperparameter sweep functionality, you need artifact lineage tracking, or you want the best run comparison UI with minimal setup. W&B's free tier is generous enough for most individual researchers. The setup is pip install wandb + wandb.init() - three minutes from zero to working dashboard.

Choose TensorBoard when: you are working locally and individually, data governance requirements prohibit sending training metadata to a third-party cloud service, or you are already embedded in a TensorFlow/Keras stack where TensorBoard integrates natively. TensorBoard is also useful as a quick local viewer even when using W&B for cloud logging.

Choose MLflow when: you are in an enterprise environment that requires self-hosting (financial services, healthcare, government), you want tight integration with the broader Databricks/Spark ecosystem, or you need an open-source system where you can audit and control every component. MLflow's experiment tracking is comparable to W&B in functionality, but the sweep system and comparison UI are less polished. MLflow's model registry is mature and well-integrated with deployment pipelines.

In practice, many teams use W&B for tracking and sweeps during research and then MLflow for the model registry and deployment pipeline in production - the two are not mutually exclusive.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Training Dynamics demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.