Model Staging and Promotion
The Incident That Should Never Happen - But Does, Constantly
It was a Tuesday afternoon. A data scientist named Priya had just finished training a credit risk model that outperformed the incumbent by 4.2% on the held-out test set. Excited, she exported the model artifact to a shared S3 bucket, updated the path in a config file, and sent a Slack message to the engineering team: "New model is live. Let me know if you see anything weird."
Nobody reviewed the training data distribution. Nobody checked whether the model's performance held across demographic groups. Nobody ran it through the bias evaluation suite that had been added to the repo three months earlier but never integrated into any promotion workflow. Nobody asked Priya what version of the feature pipeline she had trained against - because there was no system that required her to answer that question. The deployment was a copy-paste operation dressed up as a release.
Seven days later, customer complaints started trickling in. A segment of users - specifically small business owners applying for lines of credit - were being systematically declined at a rate three times higher than the previous quarter. The issue wasn't a bug in the traditional sense. The model had learned a spurious correlation in the training data that only appeared in production under a specific combination of feature values. A proper fairness audit would have caught it. The bias check script would have caught it. But those tools existed on someone's laptop, not in the promotion pipeline.
The rollback took six hours because nobody knew exactly which model version had been running. The S3 path had been overwritten. The config file had no version history. The audit trail was a Slack thread that had since been buried under three hundred other messages. The team spent the next two weeks reconstructing what had happened, producing a post-mortem that essentially said: "we had no process."
The irony is that the fix was not technically complex. A structured promotion pipeline - one that requires automated checks to pass and a human reviewer to approve before any model enters production - would have caught this before it caused harm. The gap was not capability. It was process. And process, in MLOps, is what the model registry and its promotion workflow exist to enforce.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Staging & Promotion demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Before model registries existed, model promotion was an informal process. You trained a model, you evaluated it in a notebook, and you pushed it somewhere. The "somewhere" varied by team: an S3 bucket, a shared filesystem, an ad-hoc model serving endpoint, a Docker image tagged with latest. None of these mechanisms had opinions about what had to be true before a model could move from experimentation to production.
The core problem was that there was no concept of a stage. Software engineering had solved this decades earlier with environments: you don't push directly to production; you push to dev, then staging, then production, with gates between each stage. But ML teams were reinventing the wheel in a worse way every time, because model promotion has additional complexity that software promotion does not. You're not just deploying code - you're deploying a statistical artifact that has opinions baked into it from training data. That artifact needs to be evaluated for accuracy, fairness, latency, and compatibility with the production feature pipeline before it can be trusted.
Model staging and promotion pipelines exist to formalize this process. They make the implicit explicit: here are the stages a model can be in, here are the criteria that must be met to move between stages, here is who has authority to approve the move, and here is the audit trail that proves everything was done correctly. This is not bureaucracy for its own sake. It is the difference between having a responsible ML system and having a time bomb.
Historical Context
The concept of model staging was largely absent from early ML tooling. The first generation of ML deployment tools - circa 2015–2018 - focused almost entirely on training and inference, with promotion treated as a deployment script problem. Teams using TensorFlow Serving or Seldon Core would write custom shell scripts to swap model versions.
MLflow, released by Databricks in 2018, introduced the first widely adopted formalized model registry with explicit lifecycle stages. The original MLflow Model Registry (released in MLflow 1.0, September 2019) defined four stages: None, Staging, Production, and Archived. This was a deliberate analog to software deployment environments and gave teams a shared vocabulary for model lifecycle management.
The concept of automated promotion gates emerged from the DevOps world. Continuous delivery pipelines had long used automated quality gates - test coverage thresholds, performance benchmarks, security scans - as preconditions for promotion. ML teams began applying the same pattern around 2020–2021, driven partly by high-profile failures at companies that had shipped models with inadequate evaluation. The EU AI Act (proposed 2021, enacted 2024) and similar regulatory frameworks then turned automated evaluation from a best practice into a legal requirement for certain model categories.
Shadow mode deployment - running a new model in parallel with the production model without serving its predictions to users - was pioneered in recommendation systems at companies like Netflix and Spotify, where the cost of a bad recommendation is low but the signal from real traffic is extremely valuable. The pattern was later generalized and integrated into ML deployment platforms.
The Promotion Lifecycle
Every model in a well-run registry exists in exactly one stage at any given time. The canonical stages map directly to questions about readiness:
| Stage | Meaning | Who interacts |
|---|---|---|
| None | Freshly registered, not evaluated | Data scientist |
| Staging | Under evaluation, passing automated checks | ML engineer |
| Production | Serving live traffic | Automated + on-call |
| Archived | Superseded, preserved for audit | Automated |
The transitions between stages are not automatic promotions - they are gated transitions. Each transition requires a set of criteria to be satisfied before it can proceed.
Automated Promotion Criteria
The power of a formal promotion pipeline is that it can enforce a checklist automatically. Instead of relying on a data scientist to remember to run the bias evaluation, the pipeline runs it for them and refuses to promote the model if it fails.
A production-grade automated gate typically checks five categories:
1. Accuracy Thresholds
The new model must meet a minimum absolute performance threshold AND outperform the current production model by at least a configurable margin.
def check_accuracy(new_metrics: dict, prod_metrics: dict, config: dict) -> tuple[bool, str]:
"""
Returns (passed, reason) for the accuracy gate.
Checks:
- Absolute threshold: new model AUC must be >= min_auc
- Relative improvement: new model must beat prod by at least min_delta
"""
new_auc = new_metrics["auc_roc"]
prod_auc = prod_metrics["auc_roc"]
min_auc = config["min_absolute_auc"] # e.g., 0.82
min_delta = config["min_relative_improvement"] # e.g., 0.005
if new_auc < min_auc:
return False, f"AUC {new_auc:.4f} below minimum threshold {min_auc}"
improvement = new_auc - prod_auc
if improvement < min_delta:
return False, f"Improvement {improvement:.4f} below required delta {min_delta}"
return True, f"AUC {new_auc:.4f} (+{improvement:.4f} vs production)"
2. Latency Budget
The model must serve predictions within the agreed SLA. This is measured under load using a representative request distribution.
import time
import statistics
import mlflow.pyfunc
def check_latency(model_uri: str, sample_inputs, config: dict) -> tuple[bool, str]:
"""
Load the model and benchmark p50/p95/p99 latency.
Fail if p95 exceeds the latency budget.
"""
model = mlflow.pyfunc.load_model(model_uri)
latencies = []
# warm up
for _ in range(10):
model.predict(sample_inputs[:1])
# benchmark
for row in sample_inputs:
start = time.perf_counter()
model.predict(row)
latencies.append((time.perf_counter() - start) * 1000) # ms
p50 = statistics.median(latencies)
p95 = statistics.quantiles(latencies, n=20)[18] # 95th percentile
p99 = statistics.quantiles(latencies, n=100)[98]
budget_ms = config["latency_budget_p95_ms"] # e.g., 50ms
if p95 > budget_ms:
return False, f"p95 latency {p95:.1f}ms exceeds budget {budget_ms}ms"
return True, f"Latency OK - p50={p50:.1f}ms, p95={p95:.1f}ms, p99={p99:.1f}ms"
3. Fairness Metrics
The model's performance must not degrade unacceptably across protected demographic groups.
import pandas as pd
def check_fairness(model_uri: str, eval_df: pd.DataFrame, config: dict) -> tuple[bool, str]:
"""
Evaluate the model's performance disaggregated by demographic groups.
Fail if performance gap between any subgroup and the overall metric
exceeds the allowed disparity threshold.
"""
model = mlflow.pyfunc.load_model(model_uri)
predictions = model.predict(eval_df.drop(columns=["label", "demographic_group"]))
overall_auc = compute_auc(eval_df["label"], predictions)
max_disparity = config["max_demographic_disparity"] # e.g., 0.05
failures = []
for group in eval_df["demographic_group"].unique():
mask = eval_df["demographic_group"] == group
group_auc = compute_auc(eval_df.loc[mask, "label"], predictions[mask])
gap = abs(overall_auc - group_auc)
if gap > max_disparity:
failures.append(f" Group '{group}': AUC={group_auc:.4f}, gap={gap:.4f}")
if failures:
return False, "Fairness check FAILED:\n" + "\n".join(failures)
return True, f"Fairness OK - overall AUC={overall_auc:.4f}, max group gap within threshold"
4. Data Drift Check
Ensure the model was trained on data that is still representative of the production distribution. A model trained on stale data can be technically accurate on the validation set but dead wrong in production.
from scipy import stats
def check_data_drift(
training_stats: dict,
current_production_stats: dict,
config: dict
) -> tuple[bool, str]:
"""
Compare feature distributions between training time and current production.
Uses PSI (Population Stability Index) for numerical features.
PSI < 0.1: no significant change
PSI 0.1–0.25: moderate change, investigate
PSI > 0.25: significant change, do not promote
"""
max_psi = config["max_psi_threshold"] # e.g., 0.20
violations = []
for feature, train_dist in training_stats.items():
prod_dist = current_production_stats.get(feature)
if prod_dist is None:
violations.append(f" Feature '{feature}' missing from production stats")
continue
psi = compute_psi(train_dist["buckets"], train_dist["counts"], prod_dist["counts"])
if psi > max_psi:
violations.append(f" Feature '{feature}': PSI={psi:.3f} (limit={max_psi})")
if violations:
return False, "Data drift check FAILED:\n" + "\n".join(violations)
return True, "Data drift within acceptable bounds"
def compute_psi(buckets, expected_counts, actual_counts) -> float:
"""Population Stability Index calculation."""
import numpy as np
expected = np.array(expected_counts) / sum(expected_counts)
actual = np.array(actual_counts) / sum(actual_counts)
# Avoid division by zero
expected = np.where(expected == 0, 1e-6, expected)
actual = np.where(actual == 0, 1e-6, actual)
return float(np.sum((actual - expected) * np.log(actual / expected)))
5. Schema Compatibility
The model's expected input schema must be compatible with what the production feature pipeline currently produces.
import mlflow
def check_schema_compatibility(model_uri: str, feature_pipeline_schema: dict) -> tuple[bool, str]:
"""
Load the model's input schema from MLflow and verify it matches
the current production feature pipeline output schema.
"""
model_info = mlflow.models.get_model_info(model_uri)
model_schema = model_info.signature.inputs.to_dict() if model_info.signature else None
if model_schema is None:
return False, "Model has no registered input signature - cannot verify compatibility"
mismatches = []
for feature in model_schema:
name = feature["name"]
expected_type = feature["type"]
if name not in feature_pipeline_schema:
mismatches.append(f" Feature '{name}' not found in production pipeline")
elif feature_pipeline_schema[name] != expected_type:
actual_type = feature_pipeline_schema[name]
mismatches.append(f" Feature '{name}': model expects {expected_type}, pipeline produces {actual_type}")
if mismatches:
return False, "Schema incompatibility:\n" + "\n".join(mismatches)
return True, "Schema compatible with production feature pipeline"
The Full Automated Promotion Pipeline
Here is a complete promotion pipeline that orchestrates all five checks and only transitions the model if every gate passes:
import mlflow
from mlflow.tracking import MlflowClient
from dataclasses import dataclass
from typing import Optional
import json
import datetime
@dataclass
class PromotionConfig:
min_absolute_auc: float = 0.82
min_relative_improvement: float = 0.005
latency_budget_p95_ms: float = 50.0
max_demographic_disparity: float = 0.05
max_psi_threshold: float = 0.20
@dataclass
class PromotionResult:
approved: bool
model_name: str
model_version: str
checks: dict
promoted_to: Optional[str]
reason: Optional[str]
timestamp: str
def run_promotion_pipeline(
model_name: str,
candidate_version: str,
eval_dataset_path: str,
feature_pipeline_schema: dict,
config: PromotionConfig,
mlflow_tracking_uri: str,
) -> PromotionResult:
"""
Full automated promotion pipeline.
Runs 5 gates. If all pass, transitions model to Staging.
Records all results as tags on the model version for audit trail.
"""
mlflow.set_tracking_uri(mlflow_tracking_uri)
client = MlflowClient()
model_uri = f"models:/{model_name}/{candidate_version}"
candidate_run_id = client.get_model_version(model_name, candidate_version).run_id
# Load evaluation data
eval_df = load_eval_dataset(eval_dataset_path)
sample_inputs = eval_df.drop(columns=["label", "demographic_group"]).head(500)
# Fetch current production model metrics for comparison
prod_version = get_production_version(client, model_name)
if prod_version:
prod_run = client.get_run(prod_version.run_id)
prod_metrics = prod_run.data.metrics
else:
# No production model yet - use minimum thresholds only
prod_metrics = {"auc_roc": 0.0}
# Load candidate metrics from training run
candidate_run = client.get_run(candidate_run_id)
candidate_metrics = candidate_run.data.metrics
# Load training feature statistics for drift check
training_stats = json.loads(
client.get_run(candidate_run_id).data.params.get("feature_stats", "{}")
)
current_prod_stats = load_current_production_feature_stats()
# --- Run all 5 gates ---
checks = {}
checks["accuracy"] = check_accuracy(candidate_metrics, prod_metrics, config.__dict__)
checks["latency"] = check_latency(model_uri, sample_inputs, config.__dict__)
checks["fairness"] = check_fairness(model_uri, eval_df, config.__dict__)
checks["data_drift"] = check_data_drift(training_stats, current_prod_stats, config.__dict__)
checks["schema"] = check_schema_compatibility(model_uri, feature_pipeline_schema)
all_passed = all(passed for passed, _ in checks.values())
timestamp = datetime.datetime.utcnow().isoformat()
# Record all check results as model version tags (permanent audit trail)
for check_name, (passed, reason) in checks.items():
client.set_model_version_tag(
model_name, candidate_version,
f"gate.{check_name}.passed", str(passed)
)
client.set_model_version_tag(
model_name, candidate_version,
f"gate.{check_name}.reason", reason[:500] # MLflow tag value limit
)
client.set_model_version_tag(model_name, candidate_version, "promotion.timestamp", timestamp)
client.set_model_version_tag(model_name, candidate_version, "promotion.pipeline_version", "v2.1")
if all_passed:
# Transition to Staging - not yet Production
client.transition_model_version_stage(
name=model_name,
version=candidate_version,
stage="Staging",
archive_existing_versions=False, # Don't auto-archive - let human review decide
)
client.set_model_version_tag(
model_name, candidate_version,
"promotion.auto_to_staging", "true"
)
return PromotionResult(
approved=True,
model_name=model_name,
model_version=candidate_version,
checks={k: {"passed": v[0], "reason": v[1]} for k, v in checks.items()},
promoted_to="Staging",
reason="All automated gates passed. Awaiting human review for Production promotion.",
timestamp=timestamp,
)
else:
failed_checks = [name for name, (passed, _) in checks.items() if not passed]
return PromotionResult(
approved=False,
model_name=model_name,
model_version=candidate_version,
checks={k: {"passed": v[0], "reason": v[1]} for k, v in checks.items()},
promoted_to=None,
reason=f"Promotion blocked. Failed gates: {', '.join(failed_checks)}",
timestamp=timestamp,
)
def get_production_version(client: MlflowClient, model_name: str):
"""Return the current Production model version, or None if no production model exists."""
versions = client.get_latest_versions(model_name, stages=["Production"])
return versions[0] if versions else None
Human-in-the-Loop Gates
Automated gates catch objective failures. But some promotion decisions require human judgment - especially the final promotion from Staging to Production. A human reviewer should ask:
- Does the improvement justify the risk? A 0.3% AUC improvement may not be worth the deployment risk.
- Are there any known data issues? The automated checks may not have access to all context.
- Is the timing appropriate? Don't promote during a holiday weekend when on-call coverage is thin.
- Has the model been reviewed for unexpected behavior? Running a few manual predictions with edge cases.
The challenge with human-in-the-loop gates is rubber-stamping - reviewers who approve without actually reviewing. This is one of the most common failure modes in ML governance.
:::warning Preventing Rubber-Stamping To prevent rubber-stamping, require reviewers to answer a short structured questionnaire before they can approve. Each answer should be logged. If a reviewer approves within 30 seconds of opening the review request, flag it for audit. Make the reviewer explicitly acknowledge the model's known limitations. The goal is friction - the right amount to ensure real review without blocking progress. :::
def request_human_approval(
model_name: str,
model_version: str,
reviewer_slack_channel: str,
promotion_result: PromotionResult,
):
"""
Send a structured approval request to Slack.
The reviewer must respond with a specific approval token
(not just a thumbs up) to confirm they reviewed the checklist.
"""
summary_lines = []
for check, result in promotion_result.checks.items():
status = "PASS" if result["passed"] else "FAIL"
summary_lines.append(f" {status} {check}: {result['reason'][:80]}")
message = f"""
*Model Promotion Request - Human Review Required*
Model: `{model_name}` version `{model_version}`
Pipeline: All automated gates PASSED
Next stage: *Production*
*Gate Summary:*
{"".join(summary_lines)}
*Before approving, confirm:*
1. You have reviewed the model card
2. You have checked the disaggregated eval results
3. You are aware of any known edge case failures
4. On-call coverage is in place for the deployment window
To approve: `/approve-model {model_name} {model_version} <your-name>`
To reject: `/reject-model {model_name} {model_version} <reason>`
"""
send_slack_message(reviewer_slack_channel, message)
Shadow Mode as a Pre-Promotion Stage
Before promoting a model to full production, routing a small percentage of live traffic through it - without serving its predictions to users - is one of the most powerful validation techniques available.
Shadow mode works as follows: the production model serves the actual response. The shadow model receives the same request, runs inference, and logs its prediction to a shadow log store. Engineers then compare the shadow model's predictions to the production model's predictions (and eventually to the ground truth labels when they arrive) to understand how the models differ.
class ShadowServingRouter:
"""
Routes a percentage of requests to a shadow model.
The shadow model's predictions are logged but NOT returned to the client.
"""
def __init__(
self,
production_model,
shadow_model,
shadow_traffic_fraction: float = 0.01,
shadow_log_store=None,
):
self.production_model = production_model
self.shadow_model = shadow_model
self.shadow_fraction = shadow_traffic_fraction
self.shadow_log = shadow_log_store
import random
self._rng = random.Random()
def predict(self, features: dict, request_id: str) -> dict:
# Always run production model - this is what the user gets
production_result = self.production_model.predict(features)
# Probabilistically run shadow model
if self._rng.random() < self.shadow_fraction:
try:
shadow_result = self.shadow_model.predict(features)
if self.shadow_log:
self.shadow_log.write({
"request_id": request_id,
"production_prediction": production_result,
"shadow_prediction": shadow_result,
"features_hash": hash_features(features),
"timestamp": datetime.datetime.utcnow().isoformat(),
})
except Exception as e:
# Shadow model failures MUST NOT affect production
log_shadow_error(request_id, str(e))
return production_result
:::tip Shadow Mode Duration Run shadow mode for at least 24–48 hours to capture daily traffic patterns. For seasonal models (e.g., recommendation engines, fraud detection), shadow mode should cover a full weekly cycle to catch weekend behavior differences. :::
Champion-Challenger During Promotion
Shadow mode tells you how the models differ in predictions. Champion-challenger tells you which model produces better outcomes. In a champion-challenger setup, the challenger (new model) receives a small percentage of live traffic (e.g., 5–10%) and its outcomes are tracked against the champion (current production model).
The distinction: shadow mode uses no real traffic for the challenger. Champion-challenger uses real traffic but limited exposure.
class ChampionChallengerRouter:
"""
Routes a configurable percentage of traffic to the challenger model.
Both models return real predictions to real users.
Outcomes are tracked per model for statistical comparison.
"""
def __init__(
self,
champion_model,
challenger_model,
challenger_fraction: float = 0.05, # 5% to challenger
outcome_tracker=None,
):
self.champion = champion_model
self.challenger = challenger_model
self.challenger_fraction = challenger_fraction
self.outcome_tracker = outcome_tracker
import random
self._rng = random.Random()
def predict(self, features: dict, user_id: str) -> tuple[dict, str]:
# Deterministic routing by user_id for consistency
# (same user always gets same model during the experiment)
import hashlib
user_hash = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
use_challenger = (user_hash % 100) < (self.challenger_fraction * 100)
model = self.challenger if use_challenger else self.champion
model_label = "challenger" if use_challenger else "champion"
result = model.predict(features)
if self.outcome_tracker:
self.outcome_tracker.record_exposure(
user_id=user_id,
model_label=model_label,
prediction=result,
timestamp=datetime.datetime.utcnow().isoformat(),
)
return result, model_label
Promotion Webhooks and CI/CD Integration
When a model transitions between stages, that event should trigger downstream actions. MLflow supports webhooks on registry events.
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register a webhook that fires when any model transitions to Production
# This triggers a deployment job in your CI/CD system
client.create_registry_webhook(
events=["MODEL_VERSION_TRANSITIONED_TO_PRODUCTION"],
http_url_spec={
"url": "https://gitlab.example.com/api/v4/projects/42/trigger/pipeline",
"authorization": "Bearer <trigger-token>",
"enable_ssl_verification": True,
},
description="Trigger GitLab deployment pipeline when model reaches Production",
)
# Register a webhook for Slack notification on any stage change
client.create_registry_webhook(
events=[
"MODEL_VERSION_TRANSITIONED_STAGE",
"MODEL_VERSION_TAG_SET",
],
http_url_spec={
"url": "https://hooks.slack.com/services/T00/B00/XXXX",
"enable_ssl_verification": True,
},
description="Slack notification on all registry events",
)
The deployment job triggered by the webhook handles the actual infrastructure changes: updating the Kubernetes deployment to point to the new model version, running a smoke test, and confirming health checks pass.
Multi-Environment Promotion
Large organizations often run separate registries for each environment rather than using stages within a single registry. This provides stronger isolation and clearer RBAC.
Copying a model version between registries while preserving provenance:
def copy_model_version_to_registry(
source_registry_uri: str,
target_registry_uri: str,
model_name: str,
version: str,
target_stage: str = "Staging",
):
"""
Copy a model version from one registry to another.
Preserves tags, run metadata, and lineage information.
"""
# Load from source
source_client = MlflowClient(tracking_uri=source_registry_uri)
source_version = source_client.get_model_version(model_name, version)
source_run = source_client.get_run(source_version.run_id)
# Download the model artifact
local_path = mlflow.artifacts.download_artifacts(
artifact_uri=f"models:/{model_name}/{version}",
tracking_uri=source_registry_uri,
)
# Register in target
target_client = MlflowClient(tracking_uri=target_registry_uri)
mlflow.set_tracking_uri(target_registry_uri)
with mlflow.start_run(
run_name=f"promoted-from-dev-{version}",
tags={
"source_registry": source_registry_uri,
"source_version": version,
"source_run_id": source_version.run_id,
}
) as run:
# Log all metrics from source run for traceability
for key, value in source_run.data.metrics.items():
mlflow.log_metric(key, value)
for key, value in source_run.data.params.items():
mlflow.log_param(key, value)
# Register the model
model_uri = mlflow.log_model(
artifact_path="model",
python_model=mlflow.pyfunc.load_model(local_path),
).model_uri
new_version = mlflow.register_model(model_uri, model_name)
# Copy all tags from source version
for tag_key, tag_value in source_version.tags.items():
target_client.set_model_version_tag(model_name, new_version.version, tag_key, tag_value)
# Set provenance tags
target_client.set_model_version_tag(
model_name, new_version.version,
"promoted_from_registry", source_registry_uri
)
target_client.set_model_version_tag(
model_name, new_version.version,
"promoted_from_version", version
)
# Transition to target stage
target_client.transition_model_version_stage(
name=model_name,
version=new_version.version,
stage=target_stage,
)
return new_version.version
The Audit Trail
Every promotion decision must be auditable. When a regulator, a post-mortem investigator, or a new team member asks "why was version 47 promoted to production on March 3rd?", you should be able to answer with:
- What automated checks ran and what they returned
- What the metric values were at the time of promotion
- Who approved the human review gate
- What version of the evaluation dataset was used
- What shadow mode results were collected before promotion
- What the production model's performance was at promotion time (so you can compare)
MLflow tags on model versions provide a lightweight audit trail. For regulated industries, you should additionally write an immutable audit record to a separate store (e.g., an append-only database table or an object storage bucket with object lock enabled).
def write_promotion_audit_record(
promotion_result: PromotionResult,
human_reviewer: str,
human_approval_timestamp: str,
eval_dataset_version: str,
audit_store, # e.g., a PostgreSQL connection or S3 client
):
"""
Write an immutable audit record for this promotion decision.
This is separate from MLflow tags - it is the official audit trail.
"""
record = {
"event": "MODEL_PROMOTION",
"model_name": promotion_result.model_name,
"model_version": promotion_result.model_version,
"promoted_to": promotion_result.promoted_to,
"automated_gates": promotion_result.checks,
"human_reviewer": human_reviewer,
"human_approval_timestamp": human_approval_timestamp,
"eval_dataset_version": eval_dataset_version,
"pipeline_result_timestamp": promotion_result.timestamp,
"approved": promotion_result.approved,
}
audit_store.write_immutable(record)
Production Engineering Notes
Version pinning in serving infrastructure: Your model server should always load a specific model version by number, never by stage alias. Stages change; version numbers do not. If you load models:/credit-risk-model/Production, the model server will silently serve a different model the next time a promotion happens. Load models:/credit-risk-model/47 instead, and update the version number explicitly as part of the deployment job.
Gradual traffic shifting: Even after formal promotion to Production, don't immediately send 100% of traffic to the new model. Use a traffic splitting mechanism (canary deployment) to route 5% → 25% → 100% over a period of hours, with automated rollback triggers on each step.
Model version aliases (MLflow 2.9+): MLflow now supports named aliases (@champion, @shadow) as an alternative to stages. Aliases are more flexible than the fixed stage names and are preferred for new registries.
Parallel production versions: In some architectures, you may run multiple production model versions simultaneously to serve different user segments (e.g., different versions for EU vs. US users due to regulatory requirements). Your registry and serving infrastructure must support this explicitly.
Common Mistakes
:::danger Never Promote Directly to Production from None Skipping the Staging stage entirely - or treating it as optional - eliminates the buffer between automated checks and live traffic. If your automated check has a bug, you have no safety net. The Staging stage exists precisely to catch issues that automated checks miss, through shadow mode observation and integration testing. :::
:::danger Do Not Use Stage Aliases in Serving Code
Loading a model with mlflow.pyfunc.load_model("models:/my-model/Production") in serving code means that every time someone transitions a new model to Production, your serving infrastructure immediately starts using it - possibly without a deployment restart, cache invalidation, or health check. Always load by version number in production serving.
:::
:::warning Automated Gate Thresholds Must Be Reviewed Regularly
Setting min_auc = 0.82 once and forgetting it is dangerous. As your production model improves over time, the old threshold may allow regressions that would have been caught by a relative threshold. Review and update gate thresholds every time a major model improvement is released.
:::
:::warning Human Approval Is Not a Substitute for Automated Checks The reverse failure mode: teams that rely entirely on human review to catch issues, skipping automated gates because "a human will catch it." Humans are slow, inconsistent, and subject to cognitive biases. They rubber-stamp under pressure. Automated checks are your first line of defense. Human review catches what automation cannot - context, timing, business logic. :::
Interview Q&A
Q: How do you safely promote a model to production?
A: Safe model promotion requires a combination of automated gates and human review. First, the model must pass automated checks: accuracy thresholds relative to the current production model, latency benchmarks, fairness evaluations across demographic groups, data drift checks comparing training distribution to current production distribution, and schema compatibility checks with the production feature pipeline. If all gates pass, the model transitions to Staging - not directly to Production. In Staging, it runs in shadow mode: receiving real production traffic but not returning predictions to users, so we can observe its behavior on live data without risk. After shadow mode analysis, a human reviewer approves the final promotion to Production with a structured checklist. All of this is recorded in an audit trail: what checks ran, what values they returned, who approved, and when. The final deployment to Production uses canary traffic shifting - 5%, 25%, 100% - with automated rollback triggers at each step.
Q: What checks gate a model promotion?
A: The canonical five gates are: (1) accuracy threshold - absolute minimum performance AND relative improvement over the current production model; (2) latency budget - p95 serving latency must be within the agreed SLA; (3) fairness check - disaggregated evaluation across protected demographic groups with a maximum allowed disparity; (4) data drift - PSI or KL divergence between the training feature distribution and the current production distribution, to catch models trained on stale data; (5) schema compatibility - the model's expected input schema must match what the production feature pipeline currently produces. Depending on the domain, additional checks might include: dataset card verification, regulatory approval for the model card, adversarial robustness evaluation, or custom business rule checks.
Q: What is shadow mode and when do you use it?
A: Shadow mode is a deployment pattern where a new model receives real production traffic but does not return its predictions to users. The production model handles the actual response. The shadow model logs its predictions for later analysis. You use shadow mode before promoting a model to Production to observe how it behaves on real traffic without any user impact. Shadow mode is particularly valuable for catching distribution shifts that validation datasets don't capture, for identifying latency issues under real load patterns, and for quantifying prediction divergence between the new model and the incumbent before exposing any real users to the new model.
Q: How do you build an audit trail for model promotions?
A: An audit trail for model promotions should capture, for each promotion event: the model version number, the stage it was promoted from and to, the timestamp, the identity of the human reviewer who approved it, the results of all automated gates (what checks ran, what values they returned, pass/fail), the version of the evaluation dataset used, and the current production model metrics at the time of promotion (for comparison). In MLflow, model version tags provide a lightweight audit trail. For regulated industries, you should additionally write an immutable audit record to a separate store - an append-only PostgreSQL table, an S3 bucket with object lock, or a dedicated audit logging service. The key property is immutability: once written, the audit record must not be modifiable.
Q: What is the champion-challenger pattern and how does it differ from shadow mode?
A: In shadow mode, the challenger model receives real traffic and runs inference, but its predictions are logged - never returned to users. The challenger has zero impact on the user experience. In champion-challenger, the challenger model serves real predictions to a subset of real users (typically 5–10%). Both models are in production simultaneously; outcomes are tracked per-model to determine which produces better business results. Champion-challenger is a live A/B test with model versions as the treatment condition. Shadow mode is pure observation. You use shadow mode first (lower risk, no user exposure) to verify basic correctness; then champion-challenger (higher risk, but real outcome signals) to validate business impact. Champion-challenger requires statistical rigor: you need sufficient sample size and duration to detect meaningful differences with appropriate confidence levels.
Q: How do you prevent rubber-stamping in human review gates?
A: Rubber-stamping - where reviewers approve without genuinely reviewing - is one of the most common failures in ML governance. Preventive measures include: requiring reviewers to fill out a structured questionnaire with specific questions about the model's known limitations, disaggregated performance, and deployment risks; logging the time between the review request being sent and the approval being submitted (flag approvals under 60 seconds for audit); rotating reviewers so no single person becomes a bottleneck who approves under time pressure; requiring a second reviewer for high-stakes models (credit, healthcare, legal decisions); and making rejection easy and low-stigma - the culture should treat a rejection as a normal part of the process, not a personal criticism of the data scientist.
Summary
Model staging and promotion is the enforcement mechanism for everything else in ML governance. You can have the best model cards, the best experiment tracking, the best training infrastructure - but if models can be promoted to production informally, through a copy-paste operation, all of that careful work can be bypassed in an afternoon.
The key architectural decisions are: define explicit stages with clear meanings; gate every stage transition with both automated checks and human review; run shadow mode before exposing any users to a new model; record every promotion decision in an immutable audit trail; and never load models by stage alias in production serving infrastructure - always by version number.
These are not complicated ideas. They are operational discipline applied consistently.
