What is model staging?

How to safely gate model promotion through staging, production, and archiving with automated checks and human approval workflows.

How does model promotion work in practice?

Model Staging and Promotion covers model staging, model promotion, MLflow model registry from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/model-registry/model-staging-and-promotion

What is the difference between model staging and MLflow model registry?

See the full breakdown at https://engineersofai.com/docs/mlops/model-registry/model-staging-and-promotion

Model Staging and Promotion

The Incident That Should Never Happen - But Does, Constantly

It was a Tuesday afternoon. A data scientist named Priya had just finished training a credit risk model that outperformed the incumbent by 4.2% on the held-out test set. Excited, she exported the model artifact to a shared S3 bucket, updated the path in a config file, and sent a Slack message to the engineering team: "New model is live. Let me know if you see anything weird."

Nobody reviewed the training data distribution. Nobody checked whether the model's performance held across demographic groups. Nobody ran it through the bias evaluation suite that had been added to the repo three months earlier but never integrated into any promotion workflow. Nobody asked Priya what version of the feature pipeline she had trained against - because there was no system that required her to answer that question. The deployment was a copy-paste operation dressed up as a release.

Seven days later, customer complaints started trickling in. A segment of users - specifically small business owners applying for lines of credit - were being systematically declined at a rate three times higher than the previous quarter. The issue wasn't a bug in the traditional sense. The model had learned a spurious correlation in the training data that only appeared in production under a specific combination of feature values. A proper fairness audit would have caught it. The bias check script would have caught it. But those tools existed on someone's laptop, not in the promotion pipeline.

The rollback took six hours because nobody knew exactly which model version had been running. The S3 path had been overwritten. The config file had no version history. The audit trail was a Slack thread that had since been buried under three hundred other messages. The team spent the next two weeks reconstructing what had happened, producing a post-mortem that essentially said: "we had no process."

The irony is that the fix was not technically complex. A structured promotion pipeline - one that requires automated checks to pass and a human reviewer to approve before any model enters production - would have caught this before it caused harm. The gap was not capability. It was process. And process, in MLOps, is what the model registry and its promotion workflow exist to enforce.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Staging & Promotion demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Before model registries existed, model promotion was an informal process. You trained a model, you evaluated it in a notebook, and you pushed it somewhere. The "somewhere" varied by team: an S3 bucket, a shared filesystem, an ad-hoc model serving endpoint, a Docker image tagged with latest. None of these mechanisms had opinions about what had to be true before a model could move from experimentation to production.

The core problem was that there was no concept of a stage. Software engineering had solved this decades earlier with environments: you don't push directly to production; you push to dev, then staging, then production, with gates between each stage. But ML teams were reinventing the wheel in a worse way every time, because model promotion has additional complexity that software promotion does not. You're not just deploying code - you're deploying a statistical artifact that has opinions baked into it from training data. That artifact needs to be evaluated for accuracy, fairness, latency, and compatibility with the production feature pipeline before it can be trusted.

Model staging and promotion pipelines exist to formalize this process. They make the implicit explicit: here are the stages a model can be in, here are the criteria that must be met to move between stages, here is who has authority to approve the move, and here is the audit trail that proves everything was done correctly. This is not bureaucracy for its own sake. It is the difference between having a responsible ML system and having a time bomb.

Historical Context

The concept of model staging was largely absent from early ML tooling. The first generation of ML deployment tools - circa 2015–2018 - focused almost entirely on training and inference, with promotion treated as a deployment script problem. Teams using TensorFlow Serving or Seldon Core would write custom shell scripts to swap model versions.

MLflow, released by Databricks in 2018, introduced the first widely adopted formalized model registry with explicit lifecycle stages. The original MLflow Model Registry (released in MLflow 1.0, September 2019) defined four stages: None, Staging, Production, and Archived. This was a deliberate analog to software deployment environments and gave teams a shared vocabulary for model lifecycle management.

The concept of automated promotion gates emerged from the DevOps world. Continuous delivery pipelines had long used automated quality gates - test coverage thresholds, performance benchmarks, security scans - as preconditions for promotion. ML teams began applying the same pattern around 2020–2021, driven partly by high-profile failures at companies that had shipped models with inadequate evaluation. The EU AI Act (proposed 2021, enacted 2024) and similar regulatory frameworks then turned automated evaluation from a best practice into a legal requirement for certain model categories.

Shadow mode deployment - running a new model in parallel with the production model without serving its predictions to users - was pioneered in recommendation systems at companies like Netflix and Spotify, where the cost of a bad recommendation is low but the signal from real traffic is extremely valuable. The pattern was later generalized and integrated into ML deployment platforms.

The Promotion Lifecycle

Every model in a well-run registry exists in exactly one stage at any given time. The canonical stages map directly to questions about readiness:

Stage	Meaning	Who interacts
None	Freshly registered, not evaluated	Data scientist
Staging	Under evaluation, passing automated checks	ML engineer
Production	Serving live traffic	Automated + on-call
Archived	Superseded, preserved for audit	Automated

The transitions between stages are not automatic promotions - they are gated transitions. Each transition requires a set of criteria to be satisfied before it can proceed.

Automated Promotion Criteria

The power of a formal promotion pipeline is that it can enforce a checklist automatically. Instead of relying on a data scientist to remember to run the bias evaluation, the pipeline runs it for them and refuses to promote the model if it fails.

A production-grade automated gate typically checks five categories:

1. Accuracy Thresholds

The new model must meet a minimum absolute performance threshold AND outperform the current production model by at least a configurable margin.

def check_accuracy(new_metrics: dict, prod_metrics: dict, config: dict) -> tuple[bool, str]:
    """
    Returns (passed, reason) for the accuracy gate.
    Checks:
      - Absolute threshold: new model AUC must be >= min_auc
      - Relative improvement: new model must beat prod by at least min_delta
    """
    new_auc = new_metrics["auc_roc"]
    prod_auc = prod_metrics["auc_roc"]
    min_auc = config["min_absolute_auc"]       # e.g., 0.82
    min_delta = config["min_relative_improvement"]  # e.g., 0.005

    if new_auc < min_auc:
        return False, f"AUC {new_auc:.4f} below minimum threshold {min_auc}"

    improvement = new_auc - prod_auc
    if improvement < min_delta:
        return False, f"Improvement {improvement:.4f} below required delta {min_delta}"

    return True, f"AUC {new_auc:.4f} (+{improvement:.4f} vs production)"

2. Latency Budget

The model must serve predictions within the agreed SLA. This is measured under load using a representative request distribution.

import time
import statistics
import mlflow.pyfunc

def check_latency(model_uri: str, sample_inputs, config: dict) -> tuple[bool, str]:
    """
    Load the model and benchmark p50/p95/p99 latency.
    Fail if p95 exceeds the latency budget.
    """
    model = mlflow.pyfunc.load_model(model_uri)
    latencies = []

    # warm up
    for _ in range(10):
        model.predict(sample_inputs[:1])

    # benchmark
    for row in sample_inputs:
        start = time.perf_counter()
        model.predict(row)
        latencies.append((time.perf_counter() - start) * 1000)  # ms

    p50 = statistics.median(latencies)
    p95 = statistics.quantiles(latencies, n=20)[18]  # 95th percentile
    p99 = statistics.quantiles(latencies, n=100)[98]

    budget_ms = config["latency_budget_p95_ms"]  # e.g., 50ms
    if p95 > budget_ms:
        return False, f"p95 latency {p95:.1f}ms exceeds budget {budget_ms}ms"

    return True, f"Latency OK - p50={p50:.1f}ms, p95={p95:.1f}ms, p99={p99:.1f}ms"

3. Fairness Metrics

The model's performance must not degrade unacceptably across protected demographic groups.

import pandas as pd

def check_fairness(model_uri: str, eval_df: pd.DataFrame, config: dict) -> tuple[bool, str]:
    """
    Evaluate the model's performance disaggregated by demographic groups.
    Fail if performance gap between any subgroup and the overall metric
    exceeds the allowed disparity threshold.
    """
    model = mlflow.pyfunc.load_model(model_uri)
    predictions = model.predict(eval_df.drop(columns=["label", "demographic_group"]))

    overall_auc = compute_auc(eval_df["label"], predictions)
    max_disparity = config["max_demographic_disparity"]  # e.g., 0.05

    failures = []
    for group in eval_df["demographic_group"].unique():
        mask = eval_df["demographic_group"] == group
        group_auc = compute_auc(eval_df.loc[mask, "label"], predictions[mask])
        gap = abs(overall_auc - group_auc)
        if gap > max_disparity:
            failures.append(f"  Group '{group}': AUC={group_auc:.4f}, gap={gap:.4f}")

    if failures:
        return False, "Fairness check FAILED:\n" + "\n".join(failures)

    return True, f"Fairness OK - overall AUC={overall_auc:.4f}, max group gap within threshold"

4. Data Drift Check

Ensure the model was trained on data that is still representative of the production distribution. A model trained on stale data can be technically accurate on the validation set but dead wrong in production.

from scipy import stats

def check_data_drift(
    training_stats: dict,
    current_production_stats: dict,
    config: dict
) -> tuple[bool, str]:
    """
    Compare feature distributions between training time and current production.
    Uses PSI (Population Stability Index) for numerical features.
    PSI < 0.1: no significant change
    PSI 0.1–0.25: moderate change, investigate
    PSI > 0.25: significant change, do not promote
    """
    max_psi = config["max_psi_threshold"]  # e.g., 0.20
    violations = []

    for feature, train_dist in training_stats.items():
        prod_dist = current_production_stats.get(feature)
        if prod_dist is None:
            violations.append(f"  Feature '{feature}' missing from production stats")
            continue

        psi = compute_psi(train_dist["buckets"], train_dist["counts"], prod_dist["counts"])
        if psi > max_psi:
            violations.append(f"  Feature '{feature}': PSI={psi:.3f} (limit={max_psi})")

    if violations:
        return False, "Data drift check FAILED:\n" + "\n".join(violations)

    return True, "Data drift within acceptable bounds"


def compute_psi(buckets, expected_counts, actual_counts) -> float:
    """Population Stability Index calculation."""
    import numpy as np
    expected = np.array(expected_counts) / sum(expected_counts)
    actual = np.array(actual_counts) / sum(actual_counts)
    # Avoid division by zero
    expected = np.where(expected == 0, 1e-6, expected)
    actual = np.where(actual == 0, 1e-6, actual)
    return float(np.sum((actual - expected) * np.log(actual / expected)))

5. Schema Compatibility

The model's expected input schema must be compatible with what the production feature pipeline currently produces.

import mlflow

def check_schema_compatibility(model_uri: str, feature_pipeline_schema: dict) -> tuple[bool, str]:
    """
    Load the model's input schema from MLflow and verify it matches
    the current production feature pipeline output schema.
    """
    model_info = mlflow.models.get_model_info(model_uri)
    model_schema = model_info.signature.inputs.to_dict() if model_info.signature else None

    if model_schema is None:
        return False, "Model has no registered input signature - cannot verify compatibility"

    mismatches = []
    for feature in model_schema:
        name = feature["name"]
        expected_type = feature["type"]
        if name not in feature_pipeline_schema:
            mismatches.append(f"  Feature '{name}' not found in production pipeline")
        elif feature_pipeline_schema[name] != expected_type:
            actual_type = feature_pipeline_schema[name]
            mismatches.append(f"  Feature '{name}': model expects {expected_type}, pipeline produces {actual_type}")

    if mismatches:
        return False, "Schema incompatibility:\n" + "\n".join(mismatches)

    return True, "Schema compatible with production feature pipeline"

The Full Automated Promotion Pipeline

Here is a complete promotion pipeline that orchestrates all five checks and only transitions the model if every gate passes:

import mlflow
from mlflow.tracking import MlflowClient
from dataclasses import dataclass
from typing import Optional
import json
import datetime

@dataclass
class PromotionConfig:
    min_absolute_auc: float = 0.82
    min_relative_improvement: float = 0.005
    latency_budget_p95_ms: float = 50.0
    max_demographic_disparity: float = 0.05
    max_psi_threshold: float = 0.20

@dataclass
class PromotionResult:
    approved: bool
    model_name: str
    model_version: str
    checks: dict
    promoted_to: Optional[str]
    reason: Optional[str]
    timestamp: str

def run_promotion_pipeline(
    model_name: str,
    candidate_version: str,
    eval_dataset_path: str,
    feature_pipeline_schema: dict,
    config: PromotionConfig,
    mlflow_tracking_uri: str,
) -> PromotionResult:
    """
    Full automated promotion pipeline.
    Runs 5 gates. If all pass, transitions model to Staging.
    Records all results as tags on the model version for audit trail.
    """
    mlflow.set_tracking_uri(mlflow_tracking_uri)
    client = MlflowClient()

    model_uri = f"models:/{model_name}/{candidate_version}"
    candidate_run_id = client.get_model_version(model_name, candidate_version).run_id

    # Load evaluation data
    eval_df = load_eval_dataset(eval_dataset_path)
    sample_inputs = eval_df.drop(columns=["label", "demographic_group"]).head(500)

    # Fetch current production model metrics for comparison
    prod_version = get_production_version(client, model_name)
    if prod_version:
        prod_run = client.get_run(prod_version.run_id)
        prod_metrics = prod_run.data.metrics
    else:
        # No production model yet - use minimum thresholds only
        prod_metrics = {"auc_roc": 0.0}

    # Load candidate metrics from training run
    candidate_run = client.get_run(candidate_run_id)
    candidate_metrics = candidate_run.data.metrics

    # Load training feature statistics for drift check
    training_stats = json.loads(
        client.get_run(candidate_run_id).data.params.get("feature_stats", "{}")
    )
    current_prod_stats = load_current_production_feature_stats()

    # --- Run all 5 gates ---
    checks = {}

    checks["accuracy"] = check_accuracy(candidate_metrics, prod_metrics, config.__dict__)
    checks["latency"] = check_latency(model_uri, sample_inputs, config.__dict__)
    checks["fairness"] = check_fairness(model_uri, eval_df, config.__dict__)
    checks["data_drift"] = check_data_drift(training_stats, current_prod_stats, config.__dict__)
    checks["schema"] = check_schema_compatibility(model_uri, feature_pipeline_schema)

    all_passed = all(passed for passed, _ in checks.values())
    timestamp = datetime.datetime.utcnow().isoformat()

    # Record all check results as model version tags (permanent audit trail)
    for check_name, (passed, reason) in checks.items():
        client.set_model_version_tag(
            model_name, candidate_version,
            f"gate.{check_name}.passed", str(passed)
        )
        client.set_model_version_tag(
            model_name, candidate_version,
            f"gate.{check_name}.reason", reason[:500]  # MLflow tag value limit
        )

    client.set_model_version_tag(model_name, candidate_version, "promotion.timestamp", timestamp)
    client.set_model_version_tag(model_name, candidate_version, "promotion.pipeline_version", "v2.1")

    if all_passed:
        # Transition to Staging - not yet Production
        client.transition_model_version_stage(
            name=model_name,
            version=candidate_version,
            stage="Staging",
            archive_existing_versions=False,  # Don't auto-archive - let human review decide
        )
        client.set_model_version_tag(
            model_name, candidate_version,
            "promotion.auto_to_staging", "true"
        )

        return PromotionResult(
            approved=True,
            model_name=model_name,
            model_version=candidate_version,
            checks={k: {"passed": v[0], "reason": v[1]} for k, v in checks.items()},
            promoted_to="Staging",
            reason="All automated gates passed. Awaiting human review for Production promotion.",
            timestamp=timestamp,
        )
    else:
        failed_checks = [name for name, (passed, _) in checks.items() if not passed]
        return PromotionResult(
            approved=False,
            model_name=model_name,
            model_version=candidate_version,
            checks={k: {"passed": v[0], "reason": v[1]} for k, v in checks.items()},
            promoted_to=None,
            reason=f"Promotion blocked. Failed gates: {', '.join(failed_checks)}",
            timestamp=timestamp,
        )


def get_production_version(client: MlflowClient, model_name: str):
    """Return the current Production model version, or None if no production model exists."""
    versions = client.get_latest_versions(model_name, stages=["Production"])
    return versions[0] if versions else None

Human-in-the-Loop Gates

Automated gates catch objective failures. But some promotion decisions require human judgment - especially the final promotion from Staging to Production. A human reviewer should ask:

Does the improvement justify the risk? A 0.3% AUC improvement may not be worth the deployment risk.
Are there any known data issues? The automated checks may not have access to all context.
Is the timing appropriate? Don't promote during a holiday weekend when on-call coverage is thin.
Has the model been reviewed for unexpected behavior? Running a few manual predictions with edge cases.

The challenge with human-in-the-loop gates is rubber-stamping - reviewers who approve without actually reviewing. This is one of the most common failure modes in ML governance.

:::warning Preventing Rubber-Stamping To prevent rubber-stamping, require reviewers to answer a short structured questionnaire before they can approve. Each answer should be logged. If a reviewer approves within 30 seconds of opening the review request, flag it for audit. Make the reviewer explicitly acknowledge the model's known limitations. The goal is friction - the right amount to ensure real review without blocking progress. :::

def request_human_approval(
    model_name: str,
    model_version: str,
    reviewer_slack_channel: str,
    promotion_result: PromotionResult,
):
    """
    Send a structured approval request to Slack.
    The reviewer must respond with a specific approval token
    (not just a thumbs up) to confirm they reviewed the checklist.
    """
    summary_lines = []
    for check, result in promotion_result.checks.items():
        status = "PASS" if result["passed"] else "FAIL"
        summary_lines.append(f"  {status}  {check}: {result['reason'][:80]}")

    message = f"""
*Model Promotion Request - Human Review Required*

Model: `{model_name}` version `{model_version}`
Pipeline: All automated gates PASSED
Next stage: *Production*

*Gate Summary:*

{"".join(summary_lines)}

*Before approving, confirm:*
1. You have reviewed the model card
2. You have checked the disaggregated eval results
3. You are aware of any known edge case failures
4. On-call coverage is in place for the deployment window

To approve: `/approve-model {model_name} {model_version} <your-name>`
To reject:  `/reject-model {model_name} {model_version} <reason>`
"""
    send_slack_message(reviewer_slack_channel, message)

Shadow Mode as a Pre-Promotion Stage

Before promoting a model to full production, routing a small percentage of live traffic through it - without serving its predictions to users - is one of the most powerful validation techniques available.

Shadow mode works as follows: the production model serves the actual response. The shadow model receives the same request, runs inference, and logs its prediction to a shadow log store. Engineers then compare the shadow model's predictions to the production model's predictions (and eventually to the ground truth labels when they arrive) to understand how the models differ.

class ShadowServingRouter:
    """
    Routes a percentage of requests to a shadow model.
    The shadow model's predictions are logged but NOT returned to the client.
    """
    def __init__(
        self,
        production_model,
        shadow_model,
        shadow_traffic_fraction: float = 0.01,
        shadow_log_store=None,
    ):
        self.production_model = production_model
        self.shadow_model = shadow_model
        self.shadow_fraction = shadow_traffic_fraction
        self.shadow_log = shadow_log_store
        import random
        self._rng = random.Random()

    def predict(self, features: dict, request_id: str) -> dict:
        # Always run production model - this is what the user gets
        production_result = self.production_model.predict(features)

        # Probabilistically run shadow model
        if self._rng.random() < self.shadow_fraction:
            try:
                shadow_result = self.shadow_model.predict(features)
                if self.shadow_log:
                    self.shadow_log.write({
                        "request_id": request_id,
                        "production_prediction": production_result,
                        "shadow_prediction": shadow_result,
                        "features_hash": hash_features(features),
                        "timestamp": datetime.datetime.utcnow().isoformat(),
                    })
            except Exception as e:
                # Shadow model failures MUST NOT affect production
                log_shadow_error(request_id, str(e))

        return production_result

:::tip Shadow Mode Duration Run shadow mode for at least 24–48 hours to capture daily traffic patterns. For seasonal models (e.g., recommendation engines, fraud detection), shadow mode should cover a full weekly cycle to catch weekend behavior differences. :::

Champion-Challenger During Promotion

Shadow mode tells you how the models differ in predictions. Champion-challenger tells you which model produces better outcomes. In a champion-challenger setup, the challenger (new model) receives a small percentage of live traffic (e.g., 5–10%) and its outcomes are tracked against the champion (current production model).

The distinction: shadow mode uses no real traffic for the challenger. Champion-challenger uses real traffic but limited exposure.

class ChampionChallengerRouter:
    """
    Routes a configurable percentage of traffic to the challenger model.
    Both models return real predictions to real users.
    Outcomes are tracked per model for statistical comparison.
    """
    def __init__(
        self,
        champion_model,
        challenger_model,
        challenger_fraction: float = 0.05,  # 5% to challenger
        outcome_tracker=None,
    ):
        self.champion = champion_model
        self.challenger = challenger_model
        self.challenger_fraction = challenger_fraction
        self.outcome_tracker = outcome_tracker
        import random
        self._rng = random.Random()

    def predict(self, features: dict, user_id: str) -> tuple[dict, str]:
        # Deterministic routing by user_id for consistency
        # (same user always gets same model during the experiment)
        import hashlib
        user_hash = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        use_challenger = (user_hash % 100) < (self.challenger_fraction * 100)

        model = self.challenger if use_challenger else self.champion
        model_label = "challenger" if use_challenger else "champion"

        result = model.predict(features)

        if self.outcome_tracker:
            self.outcome_tracker.record_exposure(
                user_id=user_id,
                model_label=model_label,
                prediction=result,
                timestamp=datetime.datetime.utcnow().isoformat(),
            )

        return result, model_label

Promotion Webhooks and CI/CD Integration

When a model transitions between stages, that event should trigger downstream actions. MLflow supports webhooks on registry events.

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a webhook that fires when any model transitions to Production
# This triggers a deployment job in your CI/CD system
client.create_registry_webhook(
    events=["MODEL_VERSION_TRANSITIONED_TO_PRODUCTION"],
    http_url_spec={
        "url": "https://gitlab.example.com/api/v4/projects/42/trigger/pipeline",
        "authorization": "Bearer <trigger-token>",
        "enable_ssl_verification": True,
    },
    description="Trigger GitLab deployment pipeline when model reaches Production",
)

# Register a webhook for Slack notification on any stage change
client.create_registry_webhook(
    events=[
        "MODEL_VERSION_TRANSITIONED_STAGE",
        "MODEL_VERSION_TAG_SET",
    ],
    http_url_spec={
        "url": "https://hooks.slack.com/services/T00/B00/XXXX",
        "enable_ssl_verification": True,
    },
    description="Slack notification on all registry events",
)

The deployment job triggered by the webhook handles the actual infrastructure changes: updating the Kubernetes deployment to point to the new model version, running a smoke test, and confirming health checks pass.

Multi-Environment Promotion

Large organizations often run separate registries for each environment rather than using stages within a single registry. This provides stronger isolation and clearer RBAC.

Copying a model version between registries while preserving provenance:

def copy_model_version_to_registry(
    source_registry_uri: str,
    target_registry_uri: str,
    model_name: str,
    version: str,
    target_stage: str = "Staging",
):
    """
    Copy a model version from one registry to another.
    Preserves tags, run metadata, and lineage information.
    """
    # Load from source
    source_client = MlflowClient(tracking_uri=source_registry_uri)
    source_version = source_client.get_model_version(model_name, version)
    source_run = source_client.get_run(source_version.run_id)

    # Download the model artifact
    local_path = mlflow.artifacts.download_artifacts(
        artifact_uri=f"models:/{model_name}/{version}",
        tracking_uri=source_registry_uri,
    )

    # Register in target
    target_client = MlflowClient(tracking_uri=target_registry_uri)
    mlflow.set_tracking_uri(target_registry_uri)

    with mlflow.start_run(
        run_name=f"promoted-from-dev-{version}",
        tags={
            "source_registry": source_registry_uri,
            "source_version": version,
            "source_run_id": source_version.run_id,
        }
    ) as run:
        # Log all metrics from source run for traceability
        for key, value in source_run.data.metrics.items():
            mlflow.log_metric(key, value)
        for key, value in source_run.data.params.items():
            mlflow.log_param(key, value)

        # Register the model
        model_uri = mlflow.log_model(
            artifact_path="model",
            python_model=mlflow.pyfunc.load_model(local_path),
        ).model_uri

    new_version = mlflow.register_model(model_uri, model_name)

    # Copy all tags from source version
    for tag_key, tag_value in source_version.tags.items():
        target_client.set_model_version_tag(model_name, new_version.version, tag_key, tag_value)

    # Set provenance tags
    target_client.set_model_version_tag(
        model_name, new_version.version,
        "promoted_from_registry", source_registry_uri
    )
    target_client.set_model_version_tag(
        model_name, new_version.version,
        "promoted_from_version", version
    )

    # Transition to target stage
    target_client.transition_model_version_stage(
        name=model_name,
        version=new_version.version,
        stage=target_stage,
    )

    return new_version.version

The Audit Trail

Every promotion decision must be auditable. When a regulator, a post-mortem investigator, or a new team member asks "why was version 47 promoted to production on March 3rd?", you should be able to answer with:

What automated checks ran and what they returned
What the metric values were at the time of promotion
Who approved the human review gate
What version of the evaluation dataset was used
What shadow mode results were collected before promotion
What the production model's performance was at promotion time (so you can compare)

MLflow tags on model versions provide a lightweight audit trail. For regulated industries, you should additionally write an immutable audit record to a separate store (e.g., an append-only database table or an object storage bucket with object lock enabled).

def write_promotion_audit_record(
    promotion_result: PromotionResult,
    human_reviewer: str,
    human_approval_timestamp: str,
    eval_dataset_version: str,
    audit_store,  # e.g., a PostgreSQL connection or S3 client
):
    """
    Write an immutable audit record for this promotion decision.
    This is separate from MLflow tags - it is the official audit trail.
    """
    record = {
        "event": "MODEL_PROMOTION",
        "model_name": promotion_result.model_name,
        "model_version": promotion_result.model_version,
        "promoted_to": promotion_result.promoted_to,
        "automated_gates": promotion_result.checks,
        "human_reviewer": human_reviewer,
        "human_approval_timestamp": human_approval_timestamp,
        "eval_dataset_version": eval_dataset_version,
        "pipeline_result_timestamp": promotion_result.timestamp,
        "approved": promotion_result.approved,
    }
    audit_store.write_immutable(record)

Production Engineering Notes

Version pinning in serving infrastructure: Your model server should always load a specific model version by number, never by stage alias. Stages change; version numbers do not. If you load models:/credit-risk-model/Production, the model server will silently serve a different model the next time a promotion happens. Load models:/credit-risk-model/47 instead, and update the version number explicitly as part of the deployment job.

Gradual traffic shifting: Even after formal promotion to Production, don't immediately send 100% of traffic to the new model. Use a traffic splitting mechanism (canary deployment) to route 5% → 25% → 100% over a period of hours, with automated rollback triggers on each step.

Model version aliases (MLflow 2.9+): MLflow now supports named aliases (@champion, @shadow) as an alternative to stages. Aliases are more flexible than the fixed stage names and are preferred for new registries.

Parallel production versions: In some architectures, you may run multiple production model versions simultaneously to serve different user segments (e.g., different versions for EU vs. US users due to regulatory requirements). Your registry and serving infrastructure must support this explicitly.

Common Mistakes

:::danger Never Promote Directly to Production from None Skipping the Staging stage entirely - or treating it as optional - eliminates the buffer between automated checks and live traffic. If your automated check has a bug, you have no safety net. The Staging stage exists precisely to catch issues that automated checks miss, through shadow mode observation and integration testing. :::

:::danger Do Not Use Stage Aliases in Serving Code Loading a model with mlflow.pyfunc.load_model("models:/my-model/Production") in serving code means that every time someone transitions a new model to Production, your serving infrastructure immediately starts using it - possibly without a deployment restart, cache invalidation, or health check. Always load by version number in production serving. :::

:::warning Automated Gate Thresholds Must Be Reviewed Regularly Setting min_auc = 0.82 once and forgetting it is dangerous. As your production model improves over time, the old threshold may allow regressions that would have been caught by a relative threshold. Review and update gate thresholds every time a major model improvement is released. :::

:::warning Human Approval Is Not a Substitute for Automated Checks The reverse failure mode: teams that rely entirely on human review to catch issues, skipping automated gates because "a human will catch it." Humans are slow, inconsistent, and subject to cognitive biases. They rubber-stamp under pressure. Automated checks are your first line of defense. Human review catches what automation cannot - context, timing, business logic. :::

Interview Q&A

Q: How do you safely promote a model to production?

A: Safe model promotion requires a combination of automated gates and human review. First, the model must pass automated checks: accuracy thresholds relative to the current production model, latency benchmarks, fairness evaluations across demographic groups, data drift checks comparing training distribution to current production distribution, and schema compatibility checks with the production feature pipeline. If all gates pass, the model transitions to Staging - not directly to Production. In Staging, it runs in shadow mode: receiving real production traffic but not returning predictions to users, so we can observe its behavior on live data without risk. After shadow mode analysis, a human reviewer approves the final promotion to Production with a structured checklist. All of this is recorded in an audit trail: what checks ran, what values they returned, who approved, and when. The final deployment to Production uses canary traffic shifting - 5%, 25%, 100% - with automated rollback triggers at each step.

Q: What checks gate a model promotion?

A: The canonical five gates are: (1) accuracy threshold - absolute minimum performance AND relative improvement over the current production model; (2) latency budget - p95 serving latency must be within the agreed SLA; (3) fairness check - disaggregated evaluation across protected demographic groups with a maximum allowed disparity; (4) data drift - PSI or KL divergence between the training feature distribution and the current production distribution, to catch models trained on stale data; (5) schema compatibility - the model's expected input schema must match what the production feature pipeline currently produces. Depending on the domain, additional checks might include: dataset card verification, regulatory approval for the model card, adversarial robustness evaluation, or custom business rule checks.

Q: What is shadow mode and when do you use it?

A: Shadow mode is a deployment pattern where a new model receives real production traffic but does not return its predictions to users. The production model handles the actual response. The shadow model logs its predictions for later analysis. You use shadow mode before promoting a model to Production to observe how it behaves on real traffic without any user impact. Shadow mode is particularly valuable for catching distribution shifts that validation datasets don't capture, for identifying latency issues under real load patterns, and for quantifying prediction divergence between the new model and the incumbent before exposing any real users to the new model.

Q: How do you build an audit trail for model promotions?

A: An audit trail for model promotions should capture, for each promotion event: the model version number, the stage it was promoted from and to, the timestamp, the identity of the human reviewer who approved it, the results of all automated gates (what checks ran, what values they returned, pass/fail), the version of the evaluation dataset used, and the current production model metrics at the time of promotion (for comparison). In MLflow, model version tags provide a lightweight audit trail. For regulated industries, you should additionally write an immutable audit record to a separate store - an append-only PostgreSQL table, an S3 bucket with object lock, or a dedicated audit logging service. The key property is immutability: once written, the audit record must not be modifiable.

Q: What is the champion-challenger pattern and how does it differ from shadow mode?

A: In shadow mode, the challenger model receives real traffic and runs inference, but its predictions are logged - never returned to users. The challenger has zero impact on the user experience. In champion-challenger, the challenger model serves real predictions to a subset of real users (typically 5–10%). Both models are in production simultaneously; outcomes are tracked per-model to determine which produces better business results. Champion-challenger is a live A/B test with model versions as the treatment condition. Shadow mode is pure observation. You use shadow mode first (lower risk, no user exposure) to verify basic correctness; then champion-challenger (higher risk, but real outcome signals) to validate business impact. Champion-challenger requires statistical rigor: you need sufficient sample size and duration to detect meaningful differences with appropriate confidence levels.

Q: How do you prevent rubber-stamping in human review gates?

A: Rubber-stamping - where reviewers approve without genuinely reviewing - is one of the most common failures in ML governance. Preventive measures include: requiring reviewers to fill out a structured questionnaire with specific questions about the model's known limitations, disaggregated performance, and deployment risks; logging the time between the review request being sent and the approval being submitted (flag approvals under 60 seconds for audit); rotating reviewers so no single person becomes a bottleneck who approves under time pressure; requiring a second reviewer for high-stakes models (credit, healthcare, legal decisions); and making rejection easy and low-stigma - the culture should treat a rejection as a normal part of the process, not a personal criticism of the data scientist.

Summary

Model staging and promotion is the enforcement mechanism for everything else in ML governance. You can have the best model cards, the best experiment tracking, the best training infrastructure - but if models can be promoted to production informally, through a copy-paste operation, all of that careful work can be bypassed in an afternoon.

The key architectural decisions are: define explicit stages with clear meanings; gate every stage transition with both automated checks and human review; run shadow mode before exposing any users to a new model; record every promotion decision in an immutable audit trail; and never load models by stage alias in production serving infrastructure - always by version number.

These are not complicated ideas. They are operational discipline applied consistently.

The Incident That Should Never Happen - But Does, Constantly​

Why This Exists​

Historical Context​

The Promotion Lifecycle​

Automated Promotion Criteria​

1. Accuracy Thresholds​

2. Latency Budget​

3. Fairness Metrics​

4. Data Drift Check​

5. Schema Compatibility​

The Full Automated Promotion Pipeline​

Human-in-the-Loop Gates​

Shadow Mode as a Pre-Promotion Stage​

Champion-Challenger During Promotion​

Promotion Webhooks and CI/CD Integration​

Multi-Environment Promotion​

The Audit Trail​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Summary​