What is weights and biases?

W&B for production ML teams - run tracking, sweeps, artifact versioning, collaborative reports, alerts, and how it compares to MLflow.

How does wandb work in practice?

Weights & Biases Deep Dive covers weights and biases, wandb, experiment tracking from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/experiment-tracking/weights-and-biases-deep-dive

What is the difference between weights and biases and experiment tracking?

See the full breakdown at https://engineersofai.com/docs/mlops/experiment-tracking/weights-and-biases-deep-dive

Weights & Biases Deep Dive

Three Time Zones, One Research Deadline

Your team is building a foundation model. The research team is split across San Francisco, London, and Singapore. You have a hard deadline in 10 weeks - a workshop paper submission. Every researcher is running experiments. The SF team finishes a training run at 5pm and leaves a comment in Slack for London: "run 347 looks promising, LR 3e-4 with cosine warmup." London logs in 8 hours later, cannot find "run 347" anywhere because there is no shared system. They start their own experiments. Singapore sees neither team's results.

By week 6, you have 800 experiments, zero shared visibility, and three different "best models" being championed by three different researchers. The paper deadline is in 4 weeks. Nobody can answer the question: what is our current best result, and which run produced it?

Weights & Biases (W&B) was designed for exactly this scenario. It is a collaborative ML platform built around the idea that experiment results should be as shareable and collaborative as documents.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Experiment Tracking with MLflow demo on the EngineersOfAI Playground - no code required. :::

Why W&B Exists

W&B launched in 2017 and reached widespread adoption in research labs around 2019–2020. Its founders came from the OpenAI ecosystem and understood that the pain point for research teams was not just tracking - it was sharing, visualizing, and collaborating on results in real time.

Where MLflow optimizes for self-hosted infrastructure and engineering rigor, W&B optimizes for collaborative velocity. The hosted SaaS model means zero infrastructure setup. The rich UI means results are navigable without writing SQL queries. The sweep engine means HPO is integrated, not bolted on.

W&B Architecture

All metrics and artifacts flow to W&B's hosted infrastructure. For enterprises with data residency requirements, W&B offers a self-hosted "Local" deployment on your own Kubernetes cluster.

Getting Started

pip install wandb
wandb login  # prompts for API key from wandb.ai/authorize

The basic integration is minimal:

import wandb

# Initialize a run
run = wandb.init(
    project="foundation_model_v2",
    entity="your-team-name",         # W&B team/org
    name="transformer_3e4_cosine",   # human-readable run name
    tags=["transformer", "nlp", "phase-1"],
    notes="Testing cosine warmup with linear decay vs previous step schedule",
    config={                          # all hyperparameters go here
        "learning_rate": 3e-4,
        "batch_size": 512,
        "max_steps": 100_000,
        "warmup_steps": 2_000,
        "d_model": 512,
        "num_layers": 12,
        "num_heads": 8,
        "dropout": 0.1,
        "optimizer": "AdamW",
        "weight_decay": 0.1,
        "gradient_clip": 1.0,
        "scheduler": "cosine_with_warmup",
        "dataset": "openwebtext_v2",
        "tokenizer": "gpt2",
        "random_seed": 42,
    }
)

# Use wandb.config instead of raw values - makes sweeps work automatically
config = wandb.config

for step in range(config.max_steps):
    loss = train_step(model, batch, config.learning_rate)

    # Log metrics
    wandb.log({
        "train/loss": loss,
        "train/perplexity": torch.exp(torch.tensor(loss)).item(),
        "train/learning_rate": get_current_lr(optimizer),
        "train/grad_norm": compute_grad_norm(model),
    }, step=step)

    if step % 1000 == 0:
        val_metrics = evaluate(model, val_loader)
        wandb.log({
            "val/loss": val_metrics["loss"],
            "val/perplexity": val_metrics["perplexity"],
            "val/accuracy": val_metrics["accuracy"],
        }, step=step)

# Mark run as complete, upload remaining data
wandb.finish()

Rich Logging: Tables, Images, and Custom Charts

W&B goes far beyond scalar metrics. You can log images, audio, video, tables of predictions, histograms, and custom Vega-Lite charts.

Logging Predictions as Tables

# Log a sample of model predictions for qualitative review
val_table = wandb.Table(
    columns=["input_text", "true_label", "predicted_label", "confidence", "correct"]
)

model.eval()
for i, (inputs, labels) in enumerate(val_loader):
    if i >= 10:  # log first 10 batches
        break
    with torch.no_grad():
        logits = model(inputs)
        probs = torch.softmax(logits, dim=-1)
        preds = probs.argmax(dim=-1)

    for j in range(len(inputs)):
        val_table.add_data(
            inputs[j]["text"],
            id2label[labels[j].item()],
            id2label[preds[j].item()],
            probs[j].max().item(),
            (preds[j] == labels[j]).item(),
        )

wandb.log({"val/predictions_sample": val_table})

Logging Images and Confusion Matrices

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(cm, cmap="Blues")
plt.colorbar(im, ax=ax)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
wandb.log({"val/confusion_matrix": wandb.Image(fig)})
plt.close()

# Log model gradients and parameter distributions (useful for debugging)
wandb.watch(model, log="all", log_freq=100)  # logs gradients and weights

W&B Sweeps: Built-In Hyperparameter Optimization

Sweeps are W&B's integrated HPO system. You define a sweep configuration, and W&B's sweep controller distributes trials across any number of agents (training processes).

Sweep Configuration

# sweeps/transformer_sweep.yaml
sweep_config = {
    "name": "transformer_lr_architecture_sweep",
    "method": "bayes",              # or "grid", "random"
    "metric": {
        "name": "val/perplexity",
        "goal": "minimize",
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-2,
        },
        "batch_size": {
            "values": [256, 512, 1024],
        },
        "d_model": {
            "values": [256, 512, 768],
        },
        "num_layers": {
            "values": [6, 8, 12],
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.0,
            "max": 0.4,
        },
        "warmup_ratio": {
            "distribution": "uniform",
            "min": 0.01,
            "max": 0.10,
        },
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 3,
        "eta": 3,
    },
}

Running a Sweep

import wandb

# Create the sweep - returns a sweep ID
sweep_id = wandb.sweep(
    sweep=sweep_config,
    project="foundation_model_v2",
    entity="your-team-name",
)

# Define the training function (uses wandb.config for hyperparams)
def train_sweep():
    with wandb.init() as run:
        config = wandb.config

        model = build_model(
            d_model=config.d_model,
            num_layers=config.num_layers,
            dropout=config.dropout,
        )
        optimizer = AdamW(
            model.parameters(),
            lr=config.learning_rate,
        )

        for step in range(MAX_STEPS):
            loss = train_step(model, get_batch(), optimizer)
            val_metrics = evaluate(model, val_loader)

            wandb.log({
                "train/loss": loss,
                "val/perplexity": val_metrics["perplexity"],
            }, step=step)

            # Report to Hyperband pruner
            if val_metrics["perplexity"] > PRUNING_THRESHOLD:
                wandb.finish(exit_code=1)
                return

# Launch agents - each agent runs trials until the sweep is complete
# Run this command on each GPU machine:
wandb.agent(
    sweep_id=sweep_id,
    function=train_sweep,
    count=25,   # number of trials this agent will run
)

Sweep Methods Compared

Method	When to Use	Sample Efficiency
`grid`	Small, discrete search spaces	Low - exhaustive
`random`	Good baseline, highly parallelizable	Medium
`bayes`	Continuous spaces, fewer trials available	High
`hyperband`	Early stopping of bad runs	Very high (via pruning)

tip

For most teams: start with random (fast to start, highly parallelizable), switch to bayes when you have fewer than 50 GPU-days of budget and want to be more sample-efficient.

W&B Artifacts: Versioned Data and Models

W&B Artifacts are versioned, immutable records of any file or directory. Think of them as git commits for data and models.

# Log a dataset as an artifact
with wandb.init(project="foundation_model_v2", job_type="data_prep") as run:
    dataset_artifact = wandb.Artifact(
        name="openwebtext_tokenized",
        type="dataset",
        description="OpenWebText tokenized with GPT-2 tokenizer, sequence length 1024",
        metadata={
            "source": "huggingface/openwebtext",
            "tokenizer": "gpt2",
            "sequence_length": 1024,
            "num_tokens": 9_035_582_198,
            "preprocessing_date": "2024-10-01",
        },
    )
    dataset_artifact.add_dir("data/tokenized/", name="tokenized")
    run.log_artifact(dataset_artifact)

# Reference the artifact in a training run (creates lineage link)
with wandb.init(project="foundation_model_v2", job_type="training") as run:
    # Use a specific version
    dataset_artifact = run.use_artifact("openwebtext_tokenized:v3")
    dataset_dir = dataset_artifact.download()

    # ... training ...

    # Log the trained model as an artifact
    model_artifact = wandb.Artifact(
        name="transformer_base",
        type="model",
        description=f"Trained on openwebtext_tokenized:v3. Val perplexity: 18.4",
        metadata={
            "val_perplexity": 18.4,
            "training_steps": 100_000,
            "dataset_version": "v3",
        },
    )
    model_artifact.add_file("checkpoints/best_model.pt")
    model_artifact.add_file("configs/model_config.yaml")
    run.log_artifact(model_artifact)

Artifact Lineage

W&B automatically builds a lineage graph from use_artifact and log_artifact calls:

raw_text:v1 → [preprocessing job] → openwebtext_tokenized:v3
openwebtext_tokenized:v3 → [training job] → transformer_base:v7
transformer_base:v7 → [fine-tuning job] → transformer_finetuned:v2

This lineage is queryable in the UI and API - you can trace any production model back to its raw training data.

W&B Reports: Collaborative Documentation

Reports are the W&B feature that most distinguishes it from MLflow. A Report is a live document that embeds W&B plots, run comparisons, and text commentary in a shared, editable page.

Use cases:

Weekly experiment summaries: "Here are this week's best runs, why we think they worked, and what we are trying next"
Paper appendices: embed the exact run comparison plots that appear in your paper
Model card generation: document the training configuration, evaluation results, and intended use of a model before promoting it
Team onboarding: "Here is how our best model was found, here is the sweep that produced it"

Reports update automatically when new runs are logged - a plot in a report always shows the latest data from the query it references.

W&B Alerts

Alerts send notifications when training runs hit conditions you define. This is essential for long-running training jobs.

# Alert when validation loss plateaus for 5 epochs
consecutive_no_improvement = 0
best_val_loss = float("inf")

for epoch in range(MAX_EPOCHS):
    val_loss = evaluate(model, val_loader)["loss"]

    if val_loss < best_val_loss * 0.999:  # improvement threshold
        best_val_loss = val_loss
        consecutive_no_improvement = 0
    else:
        consecutive_no_improvement += 1

    if consecutive_no_improvement >= 5:
        wandb.alert(
            title="Training Plateau Detected",
            text=(
                f"Val loss has not improved for 5 epochs. "
                f"Current: {val_loss:.4f}, Best: {best_val_loss:.4f}. "
                f"Consider reducing LR or stopping."
            ),
            level=wandb.AlertLevel.WARN,
        )

# Alert on training completion with final metrics
wandb.alert(
    title="Training Complete",
    text=f"Run {wandb.run.name} finished. Val perplexity: {final_perplexity:.2f}",
    level=wandb.AlertLevel.INFO,
)

Alerts are delivered via email or Slack (configured in the W&B project settings).

W&B vs MLflow: When to Use Each

Dimension	MLflow	W&B
Hosting	Self-hosted (open-source)	SaaS (W&B servers) or self-hosted (enterprise)
Cost	Free (infra cost only)	Free tier + paid team plans
Setup	Requires server + DB setup	`pip install wandb && wandb login`
UI richness	Basic but functional	Rich, interactive, collaborative
Sweeps/HPO	Requires Optuna integration	Built-in sweep engine
Artifacts	S3/GCS backed	W&B artifact store
Reports	Not built-in	First-class feature
Alerts	Requires external integration	Built-in with Slack/email
Data residency	Full control	Requires enterprise plan
Framework support	MLflow flavor system	`wandb.watch()` + integrations

The bottom line: use W&B for research (collaboration, rich viz, fast iteration, papers). Use MLflow for production engineering (self-hosted, regulatory compliance, integration with Spark/Databricks ecosystems).

Production Best Practices

Run Naming

Use a systematic naming convention that encodes the key hypothesis:

run_name = f"{model_family}_{hypothesis}_{key_param}_{date}"
# transformer_cosine_lr3e4_1015
# xgboost_feature_ablation_no_temporal_1018

Group Runs

Use group to cluster related runs (like all trials in a sweep) in the UI:

wandb.init(
    project="foundation_model_v2",
    group="architecture_sweep_oct15",  # groups all sweep trials
    job_type="train",
)

System Metrics

W&B logs GPU utilization, GPU memory, CPU, network, and disk I/O automatically. Check these before claiming a model is "fast" - high GPU utilization and low memory fragmentation are signs of efficient training.

Common Mistakes

:::danger Logging in Every Worker Process During Distributed Training In DDP training, every worker process will call wandb.init() if you are not careful. This creates N duplicate runs for an N-GPU job. Guard with rank check:

if int(os.environ.get("LOCAL_RANK", 0)) == 0:
    wandb.init(project="...", name="...")

:::

:::danger Using wandb.log with Non-Monotonic Step Values W&B expects the step argument to wandb.log() to be monotonically increasing. Logging with step=global_step inside one loop and step=epoch inside another loop creates broken charts. Pick one step counter and use it consistently. :::

:::warning Not Finishing Runs After Crashes If a training script crashes without wandb.finish() being called, the run stays in "running" state indefinitely. Configure your training framework to call wandb.finish(exit_code=1) in exception handlers:

try:
    train()
except Exception as e:
    wandb.finish(exit_code=1)
    raise

:::

:::warning Storing Large Files as Metrics Metrics must be scalar numbers. Never try to log a tensor, numpy array, or PIL image directly to wandb.log() as a metric. Use wandb.Image(), wandb.Table(), or wandb.Artifact() for non-scalar data. :::

Interview Q&A

Q: What is a W&B sweep, and how does the Bayesian search work under the hood?

A: A W&B sweep is a coordinated hyperparameter optimization system. You define a search space and objective in a YAML config. W&B creates a sweep ID and a centralized sweep controller. Each agent (training process) contacts the controller to get a trial configuration, trains with those hyperparameters, reports the metric, and requests the next configuration. The Bayesian method uses a Gaussian Process surrogate model: after each trial, the controller fits a GP to the (hyperparameter configuration, metric) pairs observed so far, then uses Expected Improvement to select the next configuration likely to improve the metric. The Hyperband early termination prunes poorly performing trials before they complete, allowing more trials in the same budget.

Q: How does W&B artifact versioning work, and how is it different from git LFS?

A: W&B Artifacts create immutable, versioned snapshots of files or directories, stored in W&B's artifact store (backed by cloud storage). Each version gets a unique hash, and a lineage graph links artifacts to the runs that produced or consumed them. Unlike git LFS, which tracks file versions in git commits, W&B Artifacts are tracked in the context of ML runs - you can see "this model artifact was produced by run X using dataset artifact v3." W&B also supports artifact metadata, types, and aliases (like "best," "production") that git LFS does not have. For ML workflows, W&B Artifacts provide richer lineage semantics than git LFS at the cost of storing data in W&B's infrastructure.

Q: How would you migrate from W&B to MLflow (or vice versa) without losing experiment history?

A: There is no official migration tool as of 2024. The practical approach: export W&B runs using the W&B API (wandb.Api().runs(project_path)), extract params, metrics, and artifact paths, and reimport using mlflow.log_params() and mlflow.log_metrics(). Artifacts need to be re-uploaded. This is lossy - W&B's rich metadata (tables, reports, sweep structures) does not map cleanly to MLflow. In practice, most teams run both systems in parallel for a transition period or accept that old runs stay in the old system. The key lesson: choose your tracking system carefully before you have 10,000 runs.

Q: What is the W&B Local (self-hosted) option, and when would you use it?

A: W&B Local is an enterprise deployment that runs all W&B infrastructure in your own Kubernetes cluster or single VM. It provides the same UI and API as the SaaS platform but with full data sovereignty. Use it when: your data governance policy prohibits sending training metrics to external services, your training data contains PHI (Protected Health Information) and you are HIPAA-regulated, or you operate in a country with strict data residency requirements (EU, China). W&B Local requires an enterprise license and adds operational overhead (you manage the infrastructure). Smaller teams should exhaust the SaaS tier before considering Local.

Q: How do you use W&B for distributed hyperparameter optimization across multiple machines?

A: Create a sweep with wandb.sweep() on any machine to get a sweep ID. Then SSH into each machine (or submit to a job scheduler) and run wandb agent <sweep_id>. Each agent independently contacts the W&B sweep controller, gets a trial configuration, runs the training, reports metrics, and loops. The sweep controller coordinates across all agents automatically - you do not need to partition the search space manually. For HPC clusters, submit array jobs where each array element runs wandb agent --count 1 <sweep_id>. The Bayesian controller adapts to however many agents are running concurrently.

Three Time Zones, One Research Deadline​

Why W&B Exists​

W&B Architecture​

Getting Started​

Rich Logging: Tables, Images, and Custom Charts​

Logging Predictions as Tables​

Logging Images and Confusion Matrices​

W&B Sweeps: Built-In Hyperparameter Optimization​

Sweep Configuration​

Running a Sweep​

Sweep Methods Compared​

W&B Artifacts: Versioned Data and Models​

Artifact Lineage​

W&B Reports: Collaborative Documentation​

W&B Alerts​

W&B vs MLflow: When to Use Each​

Production Best Practices​

Run Naming​

Group Runs​

System Metrics​

Common Mistakes​

Interview Q&A​