Skip to main content

Weights & Biases Deep Dive

Three Time Zones, One Research Deadline

Your team is building a foundation model. The research team is split across San Francisco, London, and Singapore. You have a hard deadline in 10 weeks - a workshop paper submission. Every researcher is running experiments. The SF team finishes a training run at 5pm and leaves a comment in Slack for London: "run 347 looks promising, LR 3e-4 with cosine warmup." London logs in 8 hours later, cannot find "run 347" anywhere because there is no shared system. They start their own experiments. Singapore sees neither team's results.

By week 6, you have 800 experiments, zero shared visibility, and three different "best models" being championed by three different researchers. The paper deadline is in 4 weeks. Nobody can answer the question: what is our current best result, and which run produced it?

Weights & Biases (W&B) was designed for exactly this scenario. It is a collaborative ML platform built around the idea that experiment results should be as shareable and collaborative as documents.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Experiment Tracking with MLflow demo on the EngineersOfAI Playground - no code required. :::

Why W&B Exists

W&B launched in 2017 and reached widespread adoption in research labs around 2019–2020. Its founders came from the OpenAI ecosystem and understood that the pain point for research teams was not just tracking - it was sharing, visualizing, and collaborating on results in real time.

Where MLflow optimizes for self-hosted infrastructure and engineering rigor, W&B optimizes for collaborative velocity. The hosted SaaS model means zero infrastructure setup. The rich UI means results are navigable without writing SQL queries. The sweep engine means HPO is integrated, not bolted on.


W&B Architecture

All metrics and artifacts flow to W&B's hosted infrastructure. For enterprises with data residency requirements, W&B offers a self-hosted "Local" deployment on your own Kubernetes cluster.


Getting Started

pip install wandb
wandb login # prompts for API key from wandb.ai/authorize

The basic integration is minimal:

import wandb

# Initialize a run
run = wandb.init(
project="foundation_model_v2",
entity="your-team-name", # W&B team/org
name="transformer_3e4_cosine", # human-readable run name
tags=["transformer", "nlp", "phase-1"],
notes="Testing cosine warmup with linear decay vs previous step schedule",
config={ # all hyperparameters go here
"learning_rate": 3e-4,
"batch_size": 512,
"max_steps": 100_000,
"warmup_steps": 2_000,
"d_model": 512,
"num_layers": 12,
"num_heads": 8,
"dropout": 0.1,
"optimizer": "AdamW",
"weight_decay": 0.1,
"gradient_clip": 1.0,
"scheduler": "cosine_with_warmup",
"dataset": "openwebtext_v2",
"tokenizer": "gpt2",
"random_seed": 42,
}
)

# Use wandb.config instead of raw values - makes sweeps work automatically
config = wandb.config

for step in range(config.max_steps):
loss = train_step(model, batch, config.learning_rate)

# Log metrics
wandb.log({
"train/loss": loss,
"train/perplexity": torch.exp(torch.tensor(loss)).item(),
"train/learning_rate": get_current_lr(optimizer),
"train/grad_norm": compute_grad_norm(model),
}, step=step)

if step % 1000 == 0:
val_metrics = evaluate(model, val_loader)
wandb.log({
"val/loss": val_metrics["loss"],
"val/perplexity": val_metrics["perplexity"],
"val/accuracy": val_metrics["accuracy"],
}, step=step)

# Mark run as complete, upload remaining data
wandb.finish()

Rich Logging: Tables, Images, and Custom Charts

W&B goes far beyond scalar metrics. You can log images, audio, video, tables of predictions, histograms, and custom Vega-Lite charts.

Logging Predictions as Tables

# Log a sample of model predictions for qualitative review
val_table = wandb.Table(
columns=["input_text", "true_label", "predicted_label", "confidence", "correct"]
)

model.eval()
for i, (inputs, labels) in enumerate(val_loader):
if i >= 10: # log first 10 batches
break
with torch.no_grad():
logits = model(inputs)
probs = torch.softmax(logits, dim=-1)
preds = probs.argmax(dim=-1)

for j in range(len(inputs)):
val_table.add_data(
inputs[j]["text"],
id2label[labels[j].item()],
id2label[preds[j].item()],
probs[j].max().item(),
(preds[j] == labels[j]).item(),
)

wandb.log({"val/predictions_sample": val_table})

Logging Images and Confusion Matrices

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(cm, cmap="Blues")
plt.colorbar(im, ax=ax)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
wandb.log({"val/confusion_matrix": wandb.Image(fig)})
plt.close()

# Log model gradients and parameter distributions (useful for debugging)
wandb.watch(model, log="all", log_freq=100) # logs gradients and weights

W&B Sweeps: Built-In Hyperparameter Optimization

Sweeps are W&B's integrated HPO system. You define a sweep configuration, and W&B's sweep controller distributes trials across any number of agents (training processes).

Sweep Configuration

# sweeps/transformer_sweep.yaml
sweep_config = {
"name": "transformer_lr_architecture_sweep",
"method": "bayes", # or "grid", "random"
"metric": {
"name": "val/perplexity",
"goal": "minimize",
},
"parameters": {
"learning_rate": {
"distribution": "log_uniform_values",
"min": 1e-5,
"max": 1e-2,
},
"batch_size": {
"values": [256, 512, 1024],
},
"d_model": {
"values": [256, 512, 768],
},
"num_layers": {
"values": [6, 8, 12],
},
"dropout": {
"distribution": "uniform",
"min": 0.0,
"max": 0.4,
},
"warmup_ratio": {
"distribution": "uniform",
"min": 0.01,
"max": 0.10,
},
},
"early_terminate": {
"type": "hyperband",
"min_iter": 3,
"eta": 3,
},
}

Running a Sweep

import wandb

# Create the sweep - returns a sweep ID
sweep_id = wandb.sweep(
sweep=sweep_config,
project="foundation_model_v2",
entity="your-team-name",
)

# Define the training function (uses wandb.config for hyperparams)
def train_sweep():
with wandb.init() as run:
config = wandb.config

model = build_model(
d_model=config.d_model,
num_layers=config.num_layers,
dropout=config.dropout,
)
optimizer = AdamW(
model.parameters(),
lr=config.learning_rate,
)

for step in range(MAX_STEPS):
loss = train_step(model, get_batch(), optimizer)
val_metrics = evaluate(model, val_loader)

wandb.log({
"train/loss": loss,
"val/perplexity": val_metrics["perplexity"],
}, step=step)

# Report to Hyperband pruner
if val_metrics["perplexity"] > PRUNING_THRESHOLD:
wandb.finish(exit_code=1)
return

# Launch agents - each agent runs trials until the sweep is complete
# Run this command on each GPU machine:
wandb.agent(
sweep_id=sweep_id,
function=train_sweep,
count=25, # number of trials this agent will run
)

Sweep Methods Compared

MethodWhen to UseSample Efficiency
gridSmall, discrete search spacesLow - exhaustive
randomGood baseline, highly parallelizableMedium
bayesContinuous spaces, fewer trials availableHigh
hyperbandEarly stopping of bad runsVery high (via pruning)
tip

For most teams: start with random (fast to start, highly parallelizable), switch to bayes when you have fewer than 50 GPU-days of budget and want to be more sample-efficient.


W&B Artifacts: Versioned Data and Models

W&B Artifacts are versioned, immutable records of any file or directory. Think of them as git commits for data and models.

# Log a dataset as an artifact
with wandb.init(project="foundation_model_v2", job_type="data_prep") as run:
dataset_artifact = wandb.Artifact(
name="openwebtext_tokenized",
type="dataset",
description="OpenWebText tokenized with GPT-2 tokenizer, sequence length 1024",
metadata={
"source": "huggingface/openwebtext",
"tokenizer": "gpt2",
"sequence_length": 1024,
"num_tokens": 9_035_582_198,
"preprocessing_date": "2024-10-01",
},
)
dataset_artifact.add_dir("data/tokenized/", name="tokenized")
run.log_artifact(dataset_artifact)

# Reference the artifact in a training run (creates lineage link)
with wandb.init(project="foundation_model_v2", job_type="training") as run:
# Use a specific version
dataset_artifact = run.use_artifact("openwebtext_tokenized:v3")
dataset_dir = dataset_artifact.download()

# ... training ...

# Log the trained model as an artifact
model_artifact = wandb.Artifact(
name="transformer_base",
type="model",
description=f"Trained on openwebtext_tokenized:v3. Val perplexity: 18.4",
metadata={
"val_perplexity": 18.4,
"training_steps": 100_000,
"dataset_version": "v3",
},
)
model_artifact.add_file("checkpoints/best_model.pt")
model_artifact.add_file("configs/model_config.yaml")
run.log_artifact(model_artifact)

Artifact Lineage

W&B automatically builds a lineage graph from use_artifact and log_artifact calls:

raw_text:v1 → [preprocessing job] → openwebtext_tokenized:v3
openwebtext_tokenized:v3 → [training job] → transformer_base:v7
transformer_base:v7 → [fine-tuning job] → transformer_finetuned:v2

This lineage is queryable in the UI and API - you can trace any production model back to its raw training data.


W&B Reports: Collaborative Documentation

Reports are the W&B feature that most distinguishes it from MLflow. A Report is a live document that embeds W&B plots, run comparisons, and text commentary in a shared, editable page.

Use cases:

  • Weekly experiment summaries: "Here are this week's best runs, why we think they worked, and what we are trying next"
  • Paper appendices: embed the exact run comparison plots that appear in your paper
  • Model card generation: document the training configuration, evaluation results, and intended use of a model before promoting it
  • Team onboarding: "Here is how our best model was found, here is the sweep that produced it"

Reports update automatically when new runs are logged - a plot in a report always shows the latest data from the query it references.


W&B Alerts

Alerts send notifications when training runs hit conditions you define. This is essential for long-running training jobs.

# Alert when validation loss plateaus for 5 epochs
consecutive_no_improvement = 0
best_val_loss = float("inf")

for epoch in range(MAX_EPOCHS):
val_loss = evaluate(model, val_loader)["loss"]

if val_loss < best_val_loss * 0.999: # improvement threshold
best_val_loss = val_loss
consecutive_no_improvement = 0
else:
consecutive_no_improvement += 1

if consecutive_no_improvement >= 5:
wandb.alert(
title="Training Plateau Detected",
text=(
f"Val loss has not improved for 5 epochs. "
f"Current: {val_loss:.4f}, Best: {best_val_loss:.4f}. "
f"Consider reducing LR or stopping."
),
level=wandb.AlertLevel.WARN,
)

# Alert on training completion with final metrics
wandb.alert(
title="Training Complete",
text=f"Run {wandb.run.name} finished. Val perplexity: {final_perplexity:.2f}",
level=wandb.AlertLevel.INFO,
)

Alerts are delivered via email or Slack (configured in the W&B project settings).


W&B vs MLflow: When to Use Each

DimensionMLflowW&B
HostingSelf-hosted (open-source)SaaS (W&B servers) or self-hosted (enterprise)
CostFree (infra cost only)Free tier + paid team plans
SetupRequires server + DB setuppip install wandb && wandb login
UI richnessBasic but functionalRich, interactive, collaborative
Sweeps/HPORequires Optuna integrationBuilt-in sweep engine
ArtifactsS3/GCS backedW&B artifact store
ReportsNot built-inFirst-class feature
AlertsRequires external integrationBuilt-in with Slack/email
Data residencyFull controlRequires enterprise plan
Framework supportMLflow flavor systemwandb.watch() + integrations

The bottom line: use W&B for research (collaboration, rich viz, fast iteration, papers). Use MLflow for production engineering (self-hosted, regulatory compliance, integration with Spark/Databricks ecosystems).


Production Best Practices

Run Naming

Use a systematic naming convention that encodes the key hypothesis:

run_name = f"{model_family}_{hypothesis}_{key_param}_{date}"
# transformer_cosine_lr3e4_1015
# xgboost_feature_ablation_no_temporal_1018

Group Runs

Use group to cluster related runs (like all trials in a sweep) in the UI:

wandb.init(
project="foundation_model_v2",
group="architecture_sweep_oct15", # groups all sweep trials
job_type="train",
)

System Metrics

W&B logs GPU utilization, GPU memory, CPU, network, and disk I/O automatically. Check these before claiming a model is "fast" - high GPU utilization and low memory fragmentation are signs of efficient training.


Common Mistakes

:::danger Logging in Every Worker Process During Distributed Training In DDP training, every worker process will call wandb.init() if you are not careful. This creates N duplicate runs for an N-GPU job. Guard with rank check:

if int(os.environ.get("LOCAL_RANK", 0)) == 0:
wandb.init(project="...", name="...")

:::

:::danger Using wandb.log with Non-Monotonic Step Values W&B expects the step argument to wandb.log() to be monotonically increasing. Logging with step=global_step inside one loop and step=epoch inside another loop creates broken charts. Pick one step counter and use it consistently. :::

:::warning Not Finishing Runs After Crashes If a training script crashes without wandb.finish() being called, the run stays in "running" state indefinitely. Configure your training framework to call wandb.finish(exit_code=1) in exception handlers:

try:
train()
except Exception as e:
wandb.finish(exit_code=1)
raise

:::

:::warning Storing Large Files as Metrics Metrics must be scalar numbers. Never try to log a tensor, numpy array, or PIL image directly to wandb.log() as a metric. Use wandb.Image(), wandb.Table(), or wandb.Artifact() for non-scalar data. :::


Interview Q&A

Q: What is a W&B sweep, and how does the Bayesian search work under the hood?

A: A W&B sweep is a coordinated hyperparameter optimization system. You define a search space and objective in a YAML config. W&B creates a sweep ID and a centralized sweep controller. Each agent (training process) contacts the controller to get a trial configuration, trains with those hyperparameters, reports the metric, and requests the next configuration. The Bayesian method uses a Gaussian Process surrogate model: after each trial, the controller fits a GP to the (hyperparameter configuration, metric) pairs observed so far, then uses Expected Improvement to select the next configuration likely to improve the metric. The Hyperband early termination prunes poorly performing trials before they complete, allowing more trials in the same budget.

Q: How does W&B artifact versioning work, and how is it different from git LFS?

A: W&B Artifacts create immutable, versioned snapshots of files or directories, stored in W&B's artifact store (backed by cloud storage). Each version gets a unique hash, and a lineage graph links artifacts to the runs that produced or consumed them. Unlike git LFS, which tracks file versions in git commits, W&B Artifacts are tracked in the context of ML runs - you can see "this model artifact was produced by run X using dataset artifact v3." W&B also supports artifact metadata, types, and aliases (like "best," "production") that git LFS does not have. For ML workflows, W&B Artifacts provide richer lineage semantics than git LFS at the cost of storing data in W&B's infrastructure.

Q: How would you migrate from W&B to MLflow (or vice versa) without losing experiment history?

A: There is no official migration tool as of 2024. The practical approach: export W&B runs using the W&B API (wandb.Api().runs(project_path)), extract params, metrics, and artifact paths, and reimport using mlflow.log_params() and mlflow.log_metrics(). Artifacts need to be re-uploaded. This is lossy - W&B's rich metadata (tables, reports, sweep structures) does not map cleanly to MLflow. In practice, most teams run both systems in parallel for a transition period or accept that old runs stay in the old system. The key lesson: choose your tracking system carefully before you have 10,000 runs.

Q: What is the W&B Local (self-hosted) option, and when would you use it?

A: W&B Local is an enterprise deployment that runs all W&B infrastructure in your own Kubernetes cluster or single VM. It provides the same UI and API as the SaaS platform but with full data sovereignty. Use it when: your data governance policy prohibits sending training metrics to external services, your training data contains PHI (Protected Health Information) and you are HIPAA-regulated, or you operate in a country with strict data residency requirements (EU, China). W&B Local requires an enterprise license and adds operational overhead (you manage the infrastructure). Smaller teams should exhaust the SaaS tier before considering Local.

Q: How do you use W&B for distributed hyperparameter optimization across multiple machines?

A: Create a sweep with wandb.sweep() on any machine to get a sweep ID. Then SSH into each machine (or submit to a job scheduler) and run wandb agent <sweep_id>. Each agent independently contacts the W&B sweep controller, gets a trial configuration, runs the training, reports metrics, and loops. The sweep controller coordinates across all agents automatically - you do not need to partition the search space manually. For HPC clusters, submit array jobs where each array element runs wandb agent --count 1 <sweep_id>. The Bayesian controller adapts to however many agents are running concurrently.

© 2026 EngineersOfAI. All rights reserved.