What is ml artifacts?

Managing ML artifacts at scale - naming conventions, tagging, parent-child relationships, archival policies, and finding the model that became production from 2000 runs.

How does experiment organization work in practice?

Artifact Management & Experiment Organization covers ml artifacts, experiment organization, model artifacts from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/experiment-tracking/artifact-management

What is the difference between ml artifacts and model artifacts?

See the full breakdown at https://engineersofai.com/docs/mlops/experiment-tracking/artifact-management

Artifact Management and Experiment Organization

2,000 Runs and No Map

It is six months into a production ML project. The team has been diligent about running experiments. They have logged 2,147 runs to MLflow. That is the good news. The bad news: nobody can find anything.

The production model - the one serving real users - was promoted in a Tuesday afternoon Slack message three months ago. The message says "pushing model v7 to prod." Nobody logged which MLflow run ID "model v7" corresponds to. The model file was copied to an S3 bucket called prod-models/ with the filename ctr_model_oct_v7.pkl. The run that trained it is somewhere in MLflow, but searching for "ctr" returns 400 results, sorted by creation time. The engineer who did it is on parental leave.

Now the model is degrading. You need to: (1) find the exact training run that produced it, (2) understand what data it was trained on, (3) determine if a model from the same period but slightly better offline metrics would be a safe replacement, and (4) know who approved the promotion.

With good artifact management, this is a 30-second database query. Without it, it is a 3-day investigation.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Experiment Tracking with MLflow demo on the EngineersOfAI Playground - no code required. :::

What Is an ML Artifact

An artifact is any file produced or consumed by a training run that is necessary to understand or reproduce the run's result. This is broader than just model weights.

Every artifact needs:

Content: the actual file
Identity: a unique, immutable identifier (hash or versioned path)
Provenance: which run created it, from which input artifacts
Metadata: human-readable description, size, type, creation date

Naming Conventions: The Foundation of Findability

The most impactful thing you can do before your second month of experiments is establish naming conventions. Conventions imposed after 1,000 nameless runs are useless - you cannot retroactively rename runs in most tracking systems.

Experiment Names

An experiment groups related runs. Name experiments after the business objective + time horizon:

{team}/{project}/{quarter}
recommendations/ctr_model/2024q4
search/query_understanding/2024q4
fraud/transaction_classifier/2024q4

Run Names

A run is a single training job. Name runs after the key hypothesis being tested:

{model_family}_{key_hypothesis}_{variant}_{date}
transformer_cosine_lr_v1_1015
transformer_warmup_ablation_v3_1022
xgboost_feature_selection_no_temporal_1018
bert_large_vs_base_comparison_1101

The date at the end enables chronological sorting. The hypothesis in the name allows filtering. The variant counter distinguishes multiple runs testing the same thing.

Artifact Names

Artifacts within a run should have descriptive names that include their content type:

best_model/           # saved model (best checkpoint)
final_preprocessor/   # sklearn pipeline, tokenizer, etc.
evaluation/           # all evaluation outputs
  confusion_matrix.png
  classification_report.csv
  roc_curve.png
  shap_summary.png
configs/              # config files used in this run
  model_config.yaml
  training_config.yaml
checkpoints/          # intermediate checkpoints if needed
  epoch_10.pt
  epoch_20.pt

Tagging Strategy

Tags are the metadata layer that makes filtering at scale possible. Design your tag schema before your first run and enforce it via a wrapper function.

Core Tag Categories

STANDARD_TAGS = {
    # Team and project
    "team": "recommendations",          # which team owns this run
    "project": "ctr_model_2024q4",      # project within the team
    "hypothesis": "cosine_lr_schedule", # what idea is being tested

    # Run lifecycle
    "status": "completed",              # in_progress | completed | failed | archived
    "promoted": "false",                # was this run promoted to the registry?
    "production_run": "false",          # is this the run that's in production?

    # Data
    "dataset": "clickstream_2024q3_v2", # dataset used
    "dataset_hash": "a3f9c1d2",         # hash of the dataset

    # Code
    "git_sha": "f7b3a1c9",              # git commit that produced this run
    "git_branch": "feature/cosine-lr", # git branch

    # Environment
    "engineer": "sarah_chen",          # who ran this
    "gpu_type": "a100_80gb",           # hardware

    # Review
    "reviewed": "false",               # has been reviewed by team lead?
    "approved_for_staging": "false",   # approved to go to staging?
}

Enforcing Tags with a Wrapper

import mlflow
import subprocess
import socket
import os
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class RunConfig:
    team: str
    project: str
    hypothesis: str
    dataset: str
    dataset_hash: str
    engineer: Optional[str] = None
    notes: str = ""

@contextmanager
def production_run(config: RunConfig, run_name: str):
    """
    Context manager that enforces tagging conventions,
    logs environment metadata, and handles cleanup on failure.
    """
    engineer = config.engineer or os.environ.get("USER", "unknown")

    # Validate naming conventions
    assert "/" not in run_name, "Run name must not contain slashes"
    assert len(run_name) <= 80, "Run name too long (max 80 chars)"

    # Get git info
    try:
        git_sha = subprocess.check_output(
            ["git", "rev-parse", "--short", "HEAD"], text=True
        ).strip()
        git_branch = subprocess.check_output(
            ["git", "rev-parse", "--abbrev-ref", "HEAD"], text=True
        ).strip()
        git_dirty = bool(subprocess.check_output(
            ["git", "status", "--porcelain"], text=True
        ).strip())
    except Exception:
        git_sha, git_branch, git_dirty = "unknown", "unknown", True

    experiment_name = f"{config.team}/{config.project}"
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run(run_name=run_name) as run:
        mlflow.set_tags({
            "team": config.team,
            "project": config.project,
            "hypothesis": config.hypothesis,
            "dataset": config.dataset,
            "dataset_hash": config.dataset_hash,
            "engineer": engineer,
            "git_sha": git_sha,
            "git_branch": git_branch,
            "git_dirty": str(git_dirty),
            "hostname": socket.gethostname(),
            "status": "in_progress",
            "promoted": "false",
            "production_run": "false",
            "reviewed": "false",
            "notes": config.notes,
        })

        try:
            yield run
            mlflow.set_tag("status", "completed")
        except Exception as e:
            mlflow.set_tag("status", "failed")
            mlflow.set_tag("failure_reason", str(e)[:200])
            raise

# Usage
config = RunConfig(
    team="recommendations",
    project="ctr_model_2024q4",
    hypothesis="cosine_lr_vs_step",
    dataset="clickstream_2024q3_v2",
    dataset_hash="a3f9c1d2e5b8",
    notes="Testing cosine decay after observing step decay plateau at epoch 30",
)

with production_run(config, run_name="transformer_cosine_lr_1015") as run:
    mlflow.log_params({"learning_rate": 3e-4, "scheduler": "cosine"})
    # ... training ...

Parent-Child Run Relationships

For HPO sweeps, ablation studies, and multi-stage pipelines, use nested runs (MLflow) or run groups (W&B) to establish parent-child relationships.

Nested Runs for HPO Sweeps

with mlflow.start_run(run_name="hpo_sweep_transformer_1015") as parent_run:
    mlflow.log_params({
        "sweep_algorithm": "bayesian_tpe",
        "n_trials": 100,
        "search_space": "lr,batch_size,dropout,num_layers",
    })
    mlflow.set_tag("run_type", "hpo_parent")

    best_val_auc = 0.0
    best_child_run_id = None

    for trial_num in range(100):
        with mlflow.start_run(
            run_name=f"trial_{trial_num:04d}",
            nested=True,
        ) as child_run:
            mlflow.set_tag("run_type", "hpo_trial")
            mlflow.set_tag("trial_number", str(trial_num))

            config = sample_config(trial_num)
            mlflow.log_params(config)

            val_auc = train_and_evaluate(**config)
            mlflow.log_metric("val_auc", val_auc)

            if val_auc > best_val_auc:
                best_val_auc = val_auc
                best_child_run_id = child_run.info.run_id

    # Log best result on parent
    mlflow.log_metrics({
        "best_val_auc": best_val_auc,
        "best_trial": best_child_run_id,
    })
    mlflow.set_tag("best_trial_run_id", best_child_run_id)

Multi-Stage Pipeline

def run_pipeline(dataset_version: str):
    """Three-stage pipeline: preprocess → train → evaluate."""

    with mlflow.start_run(run_name=f"pipeline_{dataset_version}") as parent:
        mlflow.set_tag("pipeline_version", "v3.1")
        mlflow.set_tag("run_type", "pipeline_parent")

        # Stage 1: Preprocessing
        with mlflow.start_run(run_name="stage1_preprocess", nested=True) as stage1:
            mlflow.set_tag("stage", "preprocess")
            processed_data_path = preprocess_data(dataset_version)
            mlflow.log_artifact(processed_data_path, "preprocessed_data")
            mlflow.log_metric("num_samples_after_filtering", count_samples(processed_data_path))

        # Stage 2: Training
        with mlflow.start_run(run_name="stage2_train", nested=True) as stage2:
            mlflow.set_tag("stage", "train")
            mlflow.log_param("preprocessed_data_path", processed_data_path)
            model = train_model(processed_data_path)
            mlflow.pytorch.log_model(model, "trained_model")

        # Stage 3: Evaluation
        with mlflow.start_run(run_name="stage3_evaluate", nested=True) as stage3:
            mlflow.set_tag("stage", "evaluate")
            metrics = evaluate_model(model)
            mlflow.log_metrics(metrics)
            mlflow.log_artifact("outputs/confusion_matrix.png")

        # Summary on parent
        mlflow.log_metrics({"final_val_auc": metrics["auc"]})

Archival Policies

Without archival, your tracking system accumulates failed runs, duplicate runs, and exploratory runs forever. This creates noise and storage cost.

Archival Decision Framework

Automated Archival Script

from mlflow.tracking import MlflowClient
from datetime import datetime, timedelta

client = MlflowClient()

def archive_old_runs(experiment_id: str, days_threshold: int = 30):
    """Archive non-promoted runs older than days_threshold."""
    cutoff_ms = int((datetime.now() - timedelta(days=days_threshold)).timestamp() * 1000)

    # Find completed, non-promoted runs older than threshold
    old_runs = client.search_runs(
        experiment_ids=[experiment_id],
        filter_string=(
            f"attributes.start_time < {cutoff_ms} "
            "AND tags.status = 'completed' "
            "AND tags.promoted = 'false'"
        ),
    )

    print(f"Found {len(old_runs)} runs to archive")
    for run in old_runs:
        # Check if it is in the top 10% of its experiment
        all_runs = client.search_runs(
            experiment_ids=[experiment_id],
            filter_string="tags.status = 'completed'",
            order_by=["metrics.`val/auc` DESC"],
        )

        run_auc = run.data.metrics.get("val/auc", 0)
        top_10_pct_threshold = sorted(
            [r.data.metrics.get("val/auc", 0) for r in all_runs], reverse=True
        )[max(0, len(all_runs) // 10)]

        if run_auc >= top_10_pct_threshold:
            print(f"  Keeping {run.info.run_name} (top 10%)")
            continue

        # Archive the run (MLflow does not have a native archive status,
        # so we use a tag and optionally delete artifacts)
        client.set_tag(run.info.run_id, "status", "archived")
        client.set_tag(run.info.run_id, "archived_at",
                       datetime.now().isoformat())
        print(f"  Archived: {run.info.run_name}")

When multiple teams share a tracking system, governance becomes essential.

Experiment Ownership

# Register experiments with ownership metadata
client.create_experiment(
    name="recommendations/ctr_model/2024q4",
    artifact_location="s3://ml-artifacts/recommendations/ctr_model/2024q4",
    tags={
        "owner_team": "recommendations",
        "owner_lead": "sarah_chen",
        "slack_channel": "#recommendations-ml",
        "business_metric": "click_through_rate",
        "model_type": "ranking",
        "created_date": "2024-10-01",
        "expected_end_date": "2024-12-31",
    },
)

Shared Model Registry Naming

When multiple teams push models to the same registry, use namespaced names:

{team}_{model_name}
recommendations_ctr_ranker
search_query_classifier
fraud_transaction_scorer

Finding the Production Model: A Case Study

Back to our opening problem: 2,000 runs, need to find which one is in production.

If you have the tagging system in place:

# 30-second query
client = MlflowClient()
production_runs = client.search_runs(
    experiment_ids=client.get_experiment_by_name(
        "recommendations/ctr_model/2024q4"
    ).experiment_id,
    filter_string="tags.production_run = 'true'",
)

run = production_runs[0]
print(f"Production run: {run.info.run_name}")
print(f"Run ID: {run.info.run_id}")
print(f"Dataset: {run.data.tags['dataset']}")
print(f"Dataset hash: {run.data.tags['dataset_hash']}")
print(f"Git SHA: {run.data.tags['git_sha']}")
print(f"Trained by: {run.data.tags['engineer']}")
print(f"Val AUC: {run.data.metrics['val/auc']:.4f}")

If you do not have the tagging system (the forensic case): cross-reference the model's S3 creation timestamp with MLflow run start times, filter by approximate time window, check git SHAs in run tags against deployment logs.

Common Mistakes

:::danger Not Tagging the Promoted Run at Promotion Time The moment a model is promoted to production, tag its originating run with production_run=true and the registry version it became. If you wait to do this later, you will forget. Automate it in the promotion script. :::

:::danger Storing Model Files Outside the Tracking System Copying model files to s3://prod-models/my_model_v7.pkl creates an orphaned artifact with no lineage. Always log models as tracked artifacts in MLflow or W&B, then reference the registry version in your deployment scripts. :::

:::warning No Cleanup Policy for Failed Runs Failed runs accumulate. After 6 months, half your tracking system is failed runs cluttering the UI. Set up a weekly job that deletes or archives failed runs older than 7 days. The metadata (params, tags) is valuable; the artifact blobs are not. :::

:::warning Not Including the Dataset in the Run Name A run named transformer_cosine_v1 tells you nothing about what data was used. A run named transformer_cosine_clickstream_q3_v1 is self-documenting. Include a short dataset identifier in every run name. :::

Interview Q&A

Q: How would you design an artifact management system for a team training 500 models per week?

A: Four components: (1) Naming convention enforced by a wrapper function that validates all run and artifact names before creation. (2) Tag schema covering team, project, hypothesis, dataset, git SHA, engineer, and status - logged automatically by the wrapper. (3) Lifecycle policy: auto-archive failed runs after 7 days, non-top-10% runs after 30 days, keep promoted runs forever. (4) Model registry as the single source of truth for production models - no model goes to production without a registry entry, and the registry entry links back to the training run.

Q: What is the difference between an experiment, a run, and a model version in MLflow?

A: An experiment is a logical container grouping related runs - typically one per project or business objective. A run is a single execution of a training script - it has parameters, metrics, artifacts, and tags. A model version is a specific trained model that has been registered in the model registry and assigned a lifecycle stage (Staging, Production, Archived). The relationship: many runs belong to one experiment; a run can produce a model version; a model registry entry can have many versions across different lifecycle stages.

Q: How do you enforce tagging conventions across a team without constant manual reminders?

A: Replace mlflow.start_run() with a team-internal wrapper that validates and auto-populates required tags. Make the required tags part of the function signature - callers must provide team, project, and hypothesis arguments or the function raises an error. Add the wrapper to the team's shared ML utilities package so every training script imports it. In CI, run a linting check that fails the build if a training script calls mlflow.start_run() directly instead of the wrapper.

Q: What storage strategy minimizes artifact storage costs at scale?

A: Four strategies: (1) Log only the best model checkpoint per run, not all epoch checkpoints - reduces storage by 10-50x. (2) Use S3 lifecycle policies to move artifacts older than 90 days to S3 Glacier (10x cheaper storage). (3) Deduplicate artifacts - if two runs use the same base model weights, log a reference, not a copy. (4) Implement the archival policy described above - delete artifact blobs for failed and low-performing runs after 30 days (keep the metadata). For a team running 500 experiments per week with 500MB models, these strategies can reduce storage costs from $50K/year to$ 3K/year.

Q: How do you handle artifact management when training occurs across multiple clouds or on-premises?

A: Centralize the tracking metadata in a single MLflow or W&B instance accessible from all environments. For artifacts, use a cloud object store (S3 or GCS) accessible from all environments via cross-cloud IAM or VPN. If some artifacts must remain on-premises (data residency requirements), use MLflow's per-experiment artifact location feature to store sensitive artifacts in an on-premises MinIO instance while storing other artifacts in cloud storage. Tag all runs with their origin environment so you know where the artifact physically lives.

2,000 Runs and No Map​

What Is an ML Artifact​

Naming Conventions: The Foundation of Findability​

Experiment Names​

Run Names​

Artifact Names​

Tagging Strategy​

Core Tag Categories​

Enforcing Tags with a Wrapper​

Parent-Child Run Relationships​

Nested Runs for HPO Sweeps​

Multi-Stage Pipeline​

Archival Policies​

Archival Decision Framework​

Automated Archival Script​

Cross-Team Sharing and Governance​

Experiment Ownership​

Shared Model Registry Naming​

Finding the Production Model: A Case Study​

Common Mistakes​

Interview Q&A​