Artifact Management and Experiment Organization
2,000 Runs and No Map
It is six months into a production ML project. The team has been diligent about running experiments. They have logged 2,147 runs to MLflow. That is the good news. The bad news: nobody can find anything.
The production model - the one serving real users - was promoted in a Tuesday afternoon Slack message three months ago. The message says "pushing model v7 to prod." Nobody logged which MLflow run ID "model v7" corresponds to. The model file was copied to an S3 bucket called prod-models/ with the filename ctr_model_oct_v7.pkl. The run that trained it is somewhere in MLflow, but searching for "ctr" returns 400 results, sorted by creation time. The engineer who did it is on parental leave.
Now the model is degrading. You need to: (1) find the exact training run that produced it, (2) understand what data it was trained on, (3) determine if a model from the same period but slightly better offline metrics would be a safe replacement, and (4) know who approved the promotion.
With good artifact management, this is a 30-second database query. Without it, it is a 3-day investigation.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Experiment Tracking with MLflow demo on the EngineersOfAI Playground - no code required. :::
What Is an ML Artifact
An artifact is any file produced or consumed by a training run that is necessary to understand or reproduce the run's result. This is broader than just model weights.
Every artifact needs:
- Content: the actual file
- Identity: a unique, immutable identifier (hash or versioned path)
- Provenance: which run created it, from which input artifacts
- Metadata: human-readable description, size, type, creation date
Naming Conventions: The Foundation of Findability
The most impactful thing you can do before your second month of experiments is establish naming conventions. Conventions imposed after 1,000 nameless runs are useless - you cannot retroactively rename runs in most tracking systems.
Experiment Names
An experiment groups related runs. Name experiments after the business objective + time horizon:
{team}/{project}/{quarter}
recommendations/ctr_model/2024q4
search/query_understanding/2024q4
fraud/transaction_classifier/2024q4
Run Names
A run is a single training job. Name runs after the key hypothesis being tested:
{model_family}_{key_hypothesis}_{variant}_{date}
transformer_cosine_lr_v1_1015
transformer_warmup_ablation_v3_1022
xgboost_feature_selection_no_temporal_1018
bert_large_vs_base_comparison_1101
The date at the end enables chronological sorting. The hypothesis in the name allows filtering. The variant counter distinguishes multiple runs testing the same thing.
Artifact Names
Artifacts within a run should have descriptive names that include their content type:
best_model/ # saved model (best checkpoint)
final_preprocessor/ # sklearn pipeline, tokenizer, etc.
evaluation/ # all evaluation outputs
confusion_matrix.png
classification_report.csv
roc_curve.png
shap_summary.png
configs/ # config files used in this run
model_config.yaml
training_config.yaml
checkpoints/ # intermediate checkpoints if needed
epoch_10.pt
epoch_20.pt
Tagging Strategy
Tags are the metadata layer that makes filtering at scale possible. Design your tag schema before your first run and enforce it via a wrapper function.
Core Tag Categories
STANDARD_TAGS = {
# Team and project
"team": "recommendations", # which team owns this run
"project": "ctr_model_2024q4", # project within the team
"hypothesis": "cosine_lr_schedule", # what idea is being tested
# Run lifecycle
"status": "completed", # in_progress | completed | failed | archived
"promoted": "false", # was this run promoted to the registry?
"production_run": "false", # is this the run that's in production?
# Data
"dataset": "clickstream_2024q3_v2", # dataset used
"dataset_hash": "a3f9c1d2", # hash of the dataset
# Code
"git_sha": "f7b3a1c9", # git commit that produced this run
"git_branch": "feature/cosine-lr", # git branch
# Environment
"engineer": "sarah_chen", # who ran this
"gpu_type": "a100_80gb", # hardware
# Review
"reviewed": "false", # has been reviewed by team lead?
"approved_for_staging": "false", # approved to go to staging?
}
Enforcing Tags with a Wrapper
import mlflow
import subprocess
import socket
import os
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class RunConfig:
team: str
project: str
hypothesis: str
dataset: str
dataset_hash: str
engineer: Optional[str] = None
notes: str = ""
@contextmanager
def production_run(config: RunConfig, run_name: str):
"""
Context manager that enforces tagging conventions,
logs environment metadata, and handles cleanup on failure.
"""
engineer = config.engineer or os.environ.get("USER", "unknown")
# Validate naming conventions
assert "/" not in run_name, "Run name must not contain slashes"
assert len(run_name) <= 80, "Run name too long (max 80 chars)"
# Get git info
try:
git_sha = subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"], text=True
).strip()
git_branch = subprocess.check_output(
["git", "rev-parse", "--abbrev-ref", "HEAD"], text=True
).strip()
git_dirty = bool(subprocess.check_output(
["git", "status", "--porcelain"], text=True
).strip())
except Exception:
git_sha, git_branch, git_dirty = "unknown", "unknown", True
experiment_name = f"{config.team}/{config.project}"
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name=run_name) as run:
mlflow.set_tags({
"team": config.team,
"project": config.project,
"hypothesis": config.hypothesis,
"dataset": config.dataset,
"dataset_hash": config.dataset_hash,
"engineer": engineer,
"git_sha": git_sha,
"git_branch": git_branch,
"git_dirty": str(git_dirty),
"hostname": socket.gethostname(),
"status": "in_progress",
"promoted": "false",
"production_run": "false",
"reviewed": "false",
"notes": config.notes,
})
try:
yield run
mlflow.set_tag("status", "completed")
except Exception as e:
mlflow.set_tag("status", "failed")
mlflow.set_tag("failure_reason", str(e)[:200])
raise
# Usage
config = RunConfig(
team="recommendations",
project="ctr_model_2024q4",
hypothesis="cosine_lr_vs_step",
dataset="clickstream_2024q3_v2",
dataset_hash="a3f9c1d2e5b8",
notes="Testing cosine decay after observing step decay plateau at epoch 30",
)
with production_run(config, run_name="transformer_cosine_lr_1015") as run:
mlflow.log_params({"learning_rate": 3e-4, "scheduler": "cosine"})
# ... training ...
Parent-Child Run Relationships
For HPO sweeps, ablation studies, and multi-stage pipelines, use nested runs (MLflow) or run groups (W&B) to establish parent-child relationships.
Nested Runs for HPO Sweeps
with mlflow.start_run(run_name="hpo_sweep_transformer_1015") as parent_run:
mlflow.log_params({
"sweep_algorithm": "bayesian_tpe",
"n_trials": 100,
"search_space": "lr,batch_size,dropout,num_layers",
})
mlflow.set_tag("run_type", "hpo_parent")
best_val_auc = 0.0
best_child_run_id = None
for trial_num in range(100):
with mlflow.start_run(
run_name=f"trial_{trial_num:04d}",
nested=True,
) as child_run:
mlflow.set_tag("run_type", "hpo_trial")
mlflow.set_tag("trial_number", str(trial_num))
config = sample_config(trial_num)
mlflow.log_params(config)
val_auc = train_and_evaluate(**config)
mlflow.log_metric("val_auc", val_auc)
if val_auc > best_val_auc:
best_val_auc = val_auc
best_child_run_id = child_run.info.run_id
# Log best result on parent
mlflow.log_metrics({
"best_val_auc": best_val_auc,
"best_trial": best_child_run_id,
})
mlflow.set_tag("best_trial_run_id", best_child_run_id)
Multi-Stage Pipeline
def run_pipeline(dataset_version: str):
"""Three-stage pipeline: preprocess → train → evaluate."""
with mlflow.start_run(run_name=f"pipeline_{dataset_version}") as parent:
mlflow.set_tag("pipeline_version", "v3.1")
mlflow.set_tag("run_type", "pipeline_parent")
# Stage 1: Preprocessing
with mlflow.start_run(run_name="stage1_preprocess", nested=True) as stage1:
mlflow.set_tag("stage", "preprocess")
processed_data_path = preprocess_data(dataset_version)
mlflow.log_artifact(processed_data_path, "preprocessed_data")
mlflow.log_metric("num_samples_after_filtering", count_samples(processed_data_path))
# Stage 2: Training
with mlflow.start_run(run_name="stage2_train", nested=True) as stage2:
mlflow.set_tag("stage", "train")
mlflow.log_param("preprocessed_data_path", processed_data_path)
model = train_model(processed_data_path)
mlflow.pytorch.log_model(model, "trained_model")
# Stage 3: Evaluation
with mlflow.start_run(run_name="stage3_evaluate", nested=True) as stage3:
mlflow.set_tag("stage", "evaluate")
metrics = evaluate_model(model)
mlflow.log_metrics(metrics)
mlflow.log_artifact("outputs/confusion_matrix.png")
# Summary on parent
mlflow.log_metrics({"final_val_auc": metrics["auc"]})
Archival Policies
Without archival, your tracking system accumulates failed runs, duplicate runs, and exploratory runs forever. This creates noise and storage cost.
Archival Decision Framework
Automated Archival Script
from mlflow.tracking import MlflowClient
from datetime import datetime, timedelta
client = MlflowClient()
def archive_old_runs(experiment_id: str, days_threshold: int = 30):
"""Archive non-promoted runs older than days_threshold."""
cutoff_ms = int((datetime.now() - timedelta(days=days_threshold)).timestamp() * 1000)
# Find completed, non-promoted runs older than threshold
old_runs = client.search_runs(
experiment_ids=[experiment_id],
filter_string=(
f"attributes.start_time < {cutoff_ms} "
"AND tags.status = 'completed' "
"AND tags.promoted = 'false'"
),
)
print(f"Found {len(old_runs)} runs to archive")
for run in old_runs:
# Check if it is in the top 10% of its experiment
all_runs = client.search_runs(
experiment_ids=[experiment_id],
filter_string="tags.status = 'completed'",
order_by=["metrics.`val/auc` DESC"],
)
run_auc = run.data.metrics.get("val/auc", 0)
top_10_pct_threshold = sorted(
[r.data.metrics.get("val/auc", 0) for r in all_runs], reverse=True
)[max(0, len(all_runs) // 10)]
if run_auc >= top_10_pct_threshold:
print(f" Keeping {run.info.run_name} (top 10%)")
continue
# Archive the run (MLflow does not have a native archive status,
# so we use a tag and optionally delete artifacts)
client.set_tag(run.info.run_id, "status", "archived")
client.set_tag(run.info.run_id, "archived_at",
datetime.now().isoformat())
print(f" Archived: {run.info.run_name}")
Cross-Team Sharing and Governance
When multiple teams share a tracking system, governance becomes essential.
Experiment Ownership
# Register experiments with ownership metadata
client.create_experiment(
name="recommendations/ctr_model/2024q4",
artifact_location="s3://ml-artifacts/recommendations/ctr_model/2024q4",
tags={
"owner_team": "recommendations",
"owner_lead": "sarah_chen",
"slack_channel": "#recommendations-ml",
"business_metric": "click_through_rate",
"model_type": "ranking",
"created_date": "2024-10-01",
"expected_end_date": "2024-12-31",
},
)
Shared Model Registry Naming
When multiple teams push models to the same registry, use namespaced names:
{team}_{model_name}
recommendations_ctr_ranker
search_query_classifier
fraud_transaction_scorer
Finding the Production Model: A Case Study
Back to our opening problem: 2,000 runs, need to find which one is in production.
If you have the tagging system in place:
# 30-second query
client = MlflowClient()
production_runs = client.search_runs(
experiment_ids=client.get_experiment_by_name(
"recommendations/ctr_model/2024q4"
).experiment_id,
filter_string="tags.production_run = 'true'",
)
run = production_runs[0]
print(f"Production run: {run.info.run_name}")
print(f"Run ID: {run.info.run_id}")
print(f"Dataset: {run.data.tags['dataset']}")
print(f"Dataset hash: {run.data.tags['dataset_hash']}")
print(f"Git SHA: {run.data.tags['git_sha']}")
print(f"Trained by: {run.data.tags['engineer']}")
print(f"Val AUC: {run.data.metrics['val/auc']:.4f}")
If you do not have the tagging system (the forensic case): cross-reference the model's S3 creation timestamp with MLflow run start times, filter by approximate time window, check git SHAs in run tags against deployment logs.
Common Mistakes
:::danger Not Tagging the Promoted Run at Promotion Time
The moment a model is promoted to production, tag its originating run with production_run=true and the registry version it became. If you wait to do this later, you will forget. Automate it in the promotion script.
:::
:::danger Storing Model Files Outside the Tracking System
Copying model files to s3://prod-models/my_model_v7.pkl creates an orphaned artifact with no lineage. Always log models as tracked artifacts in MLflow or W&B, then reference the registry version in your deployment scripts.
:::
:::warning No Cleanup Policy for Failed Runs Failed runs accumulate. After 6 months, half your tracking system is failed runs cluttering the UI. Set up a weekly job that deletes or archives failed runs older than 7 days. The metadata (params, tags) is valuable; the artifact blobs are not. :::
:::warning Not Including the Dataset in the Run Name
A run named transformer_cosine_v1 tells you nothing about what data was used. A run named transformer_cosine_clickstream_q3_v1 is self-documenting. Include a short dataset identifier in every run name.
:::
Interview Q&A
Q: How would you design an artifact management system for a team training 500 models per week?
A: Four components: (1) Naming convention enforced by a wrapper function that validates all run and artifact names before creation. (2) Tag schema covering team, project, hypothesis, dataset, git SHA, engineer, and status - logged automatically by the wrapper. (3) Lifecycle policy: auto-archive failed runs after 7 days, non-top-10% runs after 30 days, keep promoted runs forever. (4) Model registry as the single source of truth for production models - no model goes to production without a registry entry, and the registry entry links back to the training run.
Q: What is the difference between an experiment, a run, and a model version in MLflow?
A: An experiment is a logical container grouping related runs - typically one per project or business objective. A run is a single execution of a training script - it has parameters, metrics, artifacts, and tags. A model version is a specific trained model that has been registered in the model registry and assigned a lifecycle stage (Staging, Production, Archived). The relationship: many runs belong to one experiment; a run can produce a model version; a model registry entry can have many versions across different lifecycle stages.
Q: How do you enforce tagging conventions across a team without constant manual reminders?
A: Replace mlflow.start_run() with a team-internal wrapper that validates and auto-populates required tags. Make the required tags part of the function signature - callers must provide team, project, and hypothesis arguments or the function raises an error. Add the wrapper to the team's shared ML utilities package so every training script imports it. In CI, run a linting check that fails the build if a training script calls mlflow.start_run() directly instead of the wrapper.
Q: What storage strategy minimizes artifact storage costs at scale?
A: Four strategies: (1) Log only the best model checkpoint per run, not all epoch checkpoints - reduces storage by 10-50x. (2) Use S3 lifecycle policies to move artifacts older than 90 days to S3 Glacier (10x cheaper storage). (3) Deduplicate artifacts - if two runs use the same base model weights, log a reference, not a copy. (4) Implement the archival policy described above - delete artifact blobs for failed and low-performing runs after 30 days (keep the metadata). For a team running 500 experiments per week with 500MB models, these strategies can reduce storage costs from 3K/year.
Q: How do you handle artifact management when training occurs across multiple clouds or on-premises?
A: Centralize the tracking metadata in a single MLflow or W&B instance accessible from all environments. For artifacts, use a cloud object store (S3 or GCS) accessible from all environments via cross-cloud IAM or VPN. If some artifacts must remain on-premises (data residency requirements), use MLflow's per-experiment artifact location feature to store sensitive artifacts in an on-premises MinIO instance while storing other artifacts in cloud storage. Tag all runs with their origin environment so you know where the artifact physically lives.
