Model Rollback Strategies
2:17 AM. Revenue Down $180K Per Hour. You Have 8 Minutes.
The PagerDuty alert came through at 2:17 AM. Error rate: 0.3%. That was below the 1% alert threshold - which is why the alert hadn't fired automatically. The alert that did fire was a revenue anomaly: the homepage recommendation model was returning predictions that were causing users to bounce at a rate that translated to approximately $180,000 in lost revenue per hour.
The on-call engineer, half-asleep, pulled up the dashboard. The model had been deployed four hours ago as part of a routine release. The deployment had passed all smoke tests. The error rate in the serving logs was normal. But the predictions - she could see it in the session data - the model was recommending the same five products to essentially every user, regardless of their browsing history. The diversity metric had collapsed from 0.73 to 0.04 sometime around 10 PM.
Her options were: investigate (risky - every minute of investigation was $3,000 in lost revenue), or roll back (fast - but she needed to know exactly how). She opened the runbook. The runbook said "roll back the model to the previous version." The previous version was... what, exactly? The model registry showed three versions in the Production stage - someone had run a migration script two weeks ago that hadn't cleaned up properly. The serving infrastructure documentation pointed to a config file that no longer existed in the current repository structure.
At 2:24 AM - seven minutes in - she made a judgment call. She found the model artifact from a week prior in S3, updated the serving config manually, and restarted the model server. Revenue recovered at 2:31 AM. The rollback had taken fourteen minutes, during which the platform had lost approximately $42,000 in revenue.
The post-mortem identified the root cause (a feature pipeline change that had altered the embeddings the model depended on) and recommended three improvements: a single source of truth for the active model version, an automated rollback trigger that could have detected the diversity metric collapse within five minutes, and a practiced rollback procedure with a documented and tested runbook. None of these were technically complex. All of them should have existed before the incident.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Staging & Promotion demo on the EngineersOfAI Playground - no code required. :::
Why Rollback Is Harder for ML Than for Software
When a software engineer deploys a bug in a web service, rollback is conceptually simple: revert the code to the previous version, redeploy, done. The state of the system is mostly in the code. The code is versioned. The rollback is a git operation.
ML models distribute their state across four distinct components, and a failure can originate in any of them:
| Component | What it contains | Rollback mechanism |
|---|---|---|
| Model artifact | Weights, parameters, learned representations | Registry version transition |
| Serving infrastructure | Kubernetes deployment, container image, serving config | kubectl rollout undo / Helm |
| Feature pipeline | Feature computation logic, transformations, encoders | DVC checkout / pipeline revert |
| Training data | The distribution the model learned from | Data version rollback + retrain |
A rollback that only addresses the model artifact may not fix the underlying problem if the failure was caused by a feature pipeline change that altered the inputs the model receives. A rollback that addresses the feature pipeline may not fix the problem if the model artifact was also changed in the same release. Diagnosing which component is the root cause - quickly, under pressure, at 2 AM - is the hard part.
This is why rollback strategy for ML systems cannot be an afterthought. It must be designed in advance, practiced, and automated wherever possible.
Historical Context
The concept of deployment rollback in software engineering predates DevOps - it was formalized in early continuous integration systems and became a first-class feature of deployment tools in the early 2010s (Capistrano, Heroku's slug-based deployments, then Kubernetes rolling updates). The principle was simple: keep the previous working version accessible and make reverting to it fast.
ML rollback inherited this principle but required significant adaptation. The first major ML platforms (TensorFlow Serving, circa 2016–2017) supported model version management primarily for A/B testing, not rollback - the assumption was that newer models were always better. MLflow's Model Registry (2019) introduced the lifecycle stage model, which implicitly supported rollback by allowing model versions to be transitioned back to Production from Archived.
The practice of automated rollback - where a monitoring system detects model degradation and triggers rollback without human intervention - emerged from the reliability engineering traditions at companies like Netflix and Uber, where the cost of a production incident was high enough to justify the engineering investment. Argo Rollouts (released 2019) brought progressive delivery with automated metric-based rollback to the Kubernetes ecosystem and became a standard tool for ML deployments.
The 2020–2022 period saw widespread adoption of canary deployments for ML models, driven partly by high-profile incidents at consumer internet companies where a model regression had gone undetected for hours because the impact was gradual rather than sudden. The canary pattern - routing a small percentage of traffic to the new model - allowed continuous comparison with the incumbent and made automatic rollback triggers practical.
The Four Rollback Targets in Detail
Target 1: Model Artifact Rollback
The fastest rollback to execute. Transition the previous model version back to Production in the registry, update the serving infrastructure to point to it, restart or hot-reload the model server.
from mlflow.tracking import MlflowClient
import mlflow
from typing import Optional
def rollback_model_artifact(
model_name: str,
target_version: str,
current_version: str,
mlflow_tracking_uri: str,
triggered_by: str,
reason: str,
) -> dict:
"""
Roll back the model registry to a previous version.
Transitions target_version to Production and current_version to Archived.
Records rollback event as tags on both versions for audit.
"""
mlflow.set_tracking_uri(mlflow_tracking_uri)
client = MlflowClient()
import datetime
timestamp = datetime.datetime.utcnow().isoformat()
# Transition previous version back to Production
client.transition_model_version_stage(
name=model_name,
version=target_version,
stage="Production",
archive_existing_versions=False, # We handle current_version explicitly
)
# Archive the problematic version
client.transition_model_version_stage(
name=model_name,
version=current_version,
stage="Archived",
archive_existing_versions=False,
)
# Tag the rollback event for audit trail
rollback_id = f"rollback-{timestamp}"
for version, role in [(target_version, "rollback_target"), (current_version, "rolled_back_from")]:
client.set_model_version_tag(model_name, version, "rollback.event_id", rollback_id)
client.set_model_version_tag(model_name, version, "rollback.timestamp", timestamp)
client.set_model_version_tag(model_name, version, "rollback.triggered_by", triggered_by)
client.set_model_version_tag(model_name, version, "rollback.reason", reason[:500])
client.set_model_version_tag(model_name, version, "rollback.role", role)
return {
"rollback_id": rollback_id,
"model_name": model_name,
"rolled_back_from": current_version,
"rolled_back_to": target_version,
"timestamp": timestamp,
"triggered_by": triggered_by,
"reason": reason,
}
def find_rollback_candidate(
model_name: str,
client: MlflowClient,
exclude_versions: Optional[list] = None,
) -> Optional[str]:
"""
Find the most recent previously-healthy model version to roll back to.
Looks for versions that were previously in Production and have
the 'rollback.healthy_baseline' tag set to 'true'.
Priority: explicitly tagged healthy baseline > most recent Archived version.
"""
exclude = set(exclude_versions or [])
# Check for explicitly tagged healthy baselines
all_versions = client.search_model_versions(f"name='{model_name}'")
healthy_baselines = [
v for v in all_versions
if v.tags.get("rollback.healthy_baseline") == "true"
and v.version not in exclude
]
if healthy_baselines:
# Sort by version number descending (most recent healthy baseline first)
return max(healthy_baselines, key=lambda v: int(v.version)).version
# Fallback: most recent Archived version
archived = [
v for v in all_versions
if v.current_stage == "Archived"
and v.version not in exclude
]
if archived:
return max(archived, key=lambda v: int(v.version)).version
return None
Target 2: Infrastructure Rollback
When the serving infrastructure itself is the problem (bad container image, changed serving config, memory leak in the serving code), a Kubernetes rollout undo is the fastest fix.
# Kubernetes rollback - undoes the most recent deployment
kubectl rollout undo deployment/model-server --namespace=ml-serving
# Check rollback status
kubectl rollout status deployment/model-server --namespace=ml-serving
# Roll back to a specific revision (not just the most recent)
kubectl rollout history deployment/model-server --namespace=ml-serving
kubectl rollout undo deployment/model-server --to-revision=14 --namespace=ml-serving
import subprocess
import json
def rollback_kubernetes_deployment(
deployment_name: str,
namespace: str,
to_revision: Optional[int] = None,
) -> dict:
"""
Execute a Kubernetes deployment rollback.
Returns the result of the rollback operation.
"""
cmd = ["kubectl", "rollout", "undo", f"deployment/{deployment_name}", f"--namespace={namespace}"]
if to_revision:
cmd.extend([f"--to-revision={to_revision}"])
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
if result.returncode != 0:
raise RuntimeError(f"kubectl rollout undo failed: {result.stderr}")
# Wait for rollback to complete
status_cmd = [
"kubectl", "rollout", "status", f"deployment/{deployment_name}",
f"--namespace={namespace}", "--timeout=300s"
]
status_result = subprocess.run(status_cmd, capture_output=True, text=True, timeout=330)
return {
"success": status_result.returncode == 0,
"deployment": deployment_name,
"namespace": namespace,
"output": result.stdout,
"status_output": status_result.stdout,
}
def rollback_helm_release(release_name: str, namespace: str, revision: Optional[int] = None) -> dict:
"""
Roll back a Helm release.
If revision is not specified, rolls back to the previous release.
"""
cmd = ["helm", "rollback", release_name, "--namespace", namespace, "--wait"]
if revision:
cmd.insert(2, str(revision))
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
return {
"success": result.returncode == 0,
"release": release_name,
"output": result.stdout,
"error": result.stderr if result.returncode != 0 else None,
}
Target 3: Feature Pipeline Rollback
If the root cause is a change in the feature pipeline - a transformation bug, a new feature that is computed incorrectly, a schema change that alters how an existing feature is calculated - you need to roll back the feature computation logic.
import subprocess
def rollback_feature_pipeline_with_dvc(
repo_path: str,
target_commit: str,
) -> dict:
"""
Use DVC to check out the feature pipeline state at a specific git commit.
This rolls back the feature computation logic AND the data artifacts
(feature datasets, encoders, scalers) to the state at that commit.
"""
# First, git checkout the pipeline code to the target commit
git_result = subprocess.run(
["git", "checkout", target_commit, "--", "feature_pipeline/"],
cwd=repo_path, capture_output=True, text=True,
)
if git_result.returncode != 0:
raise RuntimeError(f"git checkout failed: {git_result.stderr}")
# Then, DVC checkout to pull the data artifacts associated with that commit
dvc_result = subprocess.run(
["dvc", "checkout", "--force"],
cwd=repo_path, capture_output=True, text=True,
)
if dvc_result.returncode != 0:
raise RuntimeError(f"dvc checkout failed: {dvc_result.stderr}")
return {
"pipeline_rolled_back_to_commit": target_commit,
"git_output": git_result.stdout,
"dvc_output": dvc_result.stdout,
}
Target 4: Training Data Rollback
Data rollback is the most expensive option - it requires retraining the model from a known-good dataset version. This is not a fast rollback path. You do not use this when you need to recover in minutes. You use it when the root cause analysis of an incident reveals that the training data itself was corrupted, mislabeled, or drawn from the wrong distribution.
def trigger_emergency_retrain(
model_name: str,
training_dataset_version: str,
training_config_commit: str,
ci_trigger_url: str,
ci_trigger_token: str,
):
"""
Trigger an emergency retrain using a known-good dataset version and config commit.
This is a long path - use only when the incident analysis confirms
that no existing model version can be safely used.
"""
import requests
payload = {
"ref": training_config_commit,
"variables": {
"TRAINING_DATASET_VERSION": training_dataset_version,
"EMERGENCY_RETRAIN": "true",
"NOTIFY_SLACK": "true",
"MODEL_NAME": model_name,
}
}
response = requests.post(
ci_trigger_url,
headers={"PRIVATE-TOKEN": ci_trigger_token},
json=payload,
)
if response.status_code not in (200, 201):
raise RuntimeError(f"Failed to trigger retrain: {response.text}")
pipeline_id = response.json()["id"]
return {
"triggered": True,
"pipeline_id": pipeline_id,
"dataset_version": training_dataset_version,
"config_commit": training_config_commit,
"estimated_duration_minutes": 45, # Configure per your training time
}
Rollback Strategies
Strategy 1: Instant Switch (Blue-Green Rollback)
In a blue-green deployment, two identical environments run simultaneously - one serving live traffic (blue) and one on standby (green). Rollback means switching the load balancer to send traffic to the green environment, which still runs the previous model version.
This is the fastest rollback strategy (sub-second switch) but the most resource-intensive (you pay for two environments continuously).
Strategy 2: Gradual Rollback (Canary Reversal)
In a canary deployment, the new model serves a small percentage of traffic (e.g., 5%). If a problem is detected, you reduce the canary's traffic share to zero before the full rollout completes. This limits the blast radius of the bad model.
class CanaryTrafficController:
"""
Manages gradual traffic shifting for canary deployments.
Supports both forward progression (canary increase) and
rollback (canary decrease to zero).
"""
def __init__(
self,
canary_weight: float = 0.05, # start at 5%
step_size: float = 0.05,
step_interval_minutes: int = 10,
rollback_step_size: float = 1.0, # immediate full rollback
):
self.canary_weight = canary_weight
self.step_size = step_size
self.step_interval_minutes = step_interval_minutes
self.rollback_step_size = rollback_step_size
def advance(self) -> float:
"""Increase canary traffic by one step."""
self.canary_weight = min(1.0, self.canary_weight + self.step_size)
self._apply_weight()
return self.canary_weight
def rollback(self) -> float:
"""Roll back: set canary weight to zero immediately."""
self.canary_weight = 0.0
self._apply_weight()
return self.canary_weight
def _apply_weight(self):
"""Apply the current weight to the load balancer / service mesh."""
# Implementation depends on your infrastructure
# For Kubernetes + Istio:
apply_istio_virtual_service_weights(
stable_weight=int((1 - self.canary_weight) * 100),
canary_weight=int(self.canary_weight * 100),
)
print(f"Traffic weights updated: stable={1-self.canary_weight:.0%}, canary={self.canary_weight:.0%}")
Strategy 3: Shadow Rollback
If a problem is detected in the shadow model (running in shadow mode, not serving real traffic), you simply stop routing shadow traffic. There is nothing to roll back for users - they were never exposed. This is the safest path and the reason shadow mode is worth the infrastructure overhead.
The Rollback Decision Framework
Not every anomaly warrants immediate rollback. A small accuracy regression that is within normal variance does not. A sudden collapse in a critical metric does. The decision must be fast - especially at 2 AM - which means it must be pre-decided, not deliberated under pressure.
The decision rule for immediate rollback versus investigate-first:
from dataclasses import dataclass
from enum import Enum
class RollbackDecision(Enum):
ROLLBACK_IMMEDIATELY = "rollback_immediately"
INVESTIGATE_FIRST = "investigate_first"
MONITOR = "monitor"
@dataclass
class RollbackThresholds:
"""
Configurable thresholds that determine the rollback decision.
These should be set per-model during the deployment planning phase.
"""
# Hard revenue/safety thresholds → immediate rollback
max_revenue_loss_per_minute: float = 3000.0 # $3K/min
max_error_rate: float = 0.05 # 5% error rate
max_latency_p99_ms: float = 2000.0 # 2 second p99
# Soft model quality thresholds → investigate first
min_primary_metric: float = 0.80 # e.g., AUC-ROC
max_prediction_diversity_drop: float = 0.20 # 20% drop in diversity metric
max_metric_regression_from_baseline: float = 0.03 # 3% regression from champion
def make_rollback_decision(
current_metrics: dict,
baseline_metrics: dict,
thresholds: RollbackThresholds,
) -> tuple[RollbackDecision, str]:
"""
Make a rollback decision based on current metrics vs. baseline.
Returns the decision and a human-readable reason.
"""
# Hard triggers - revenue or safety → act immediately
if current_metrics.get("revenue_loss_per_minute", 0) > thresholds.max_revenue_loss_per_minute:
return (
RollbackDecision.ROLLBACK_IMMEDIATELY,
f"Revenue loss ${current_metrics['revenue_loss_per_minute']:.0f}/min exceeds threshold ${thresholds.max_revenue_loss_per_minute:.0f}/min"
)
if current_metrics.get("error_rate", 0) > thresholds.max_error_rate:
return (
RollbackDecision.ROLLBACK_IMMEDIATELY,
f"Error rate {current_metrics['error_rate']:.1%} exceeds threshold {thresholds.max_error_rate:.1%}"
)
if current_metrics.get("latency_p99_ms", 0) > thresholds.max_latency_p99_ms:
return (
RollbackDecision.ROLLBACK_IMMEDIATELY,
f"p99 latency {current_metrics['latency_p99_ms']:.0f}ms exceeds threshold {thresholds.max_latency_p99_ms:.0f}ms"
)
# Soft triggers - model quality degradation → investigate
if current_metrics.get("primary_metric", 1.0) < thresholds.min_primary_metric:
return (
RollbackDecision.INVESTIGATE_FIRST,
f"Primary metric {current_metrics['primary_metric']:.4f} below minimum {thresholds.min_primary_metric}"
)
baseline_diversity = baseline_metrics.get("prediction_diversity", 1.0)
current_diversity = current_metrics.get("prediction_diversity", 1.0)
diversity_drop = (baseline_diversity - current_diversity) / baseline_diversity
if diversity_drop > thresholds.max_prediction_diversity_drop:
return (
RollbackDecision.INVESTIGATE_FIRST,
f"Prediction diversity dropped {diversity_drop:.1%} from baseline (threshold: {thresholds.max_prediction_diversity_drop:.1%})"
)
return RollbackDecision.MONITOR, "All metrics within acceptable range"
The Automated Rollback Controller
An automated rollback controller is a continuously running process that monitors model serving metrics and executes rollback if thresholds are breached. It is the difference between a 14-minute manual rollback and a 90-second automated one.
import time
import threading
import logging
from dataclasses import dataclass
from typing import Callable, Optional
logger = logging.getLogger("rollback_controller")
@dataclass
class ControllerConfig:
model_name: str
current_version: str
rollback_target_version: str
thresholds: RollbackThresholds
check_interval_seconds: int = 30
consecutive_failures_required: int = 3 # Avoid false positives from transient spikes
mlflow_tracking_uri: str = ""
slack_webhook_url: str = ""
class AutomatedRollbackController:
"""
Monitors model metrics and automatically triggers rollback if thresholds are breached.
Requires consecutive threshold violations to avoid false positives from transient spikes.
Usage:
controller = AutomatedRollbackController(
config=config,
metrics_fetcher=lambda: fetch_from_prometheus(model_name),
baseline_metrics={"primary_metric": 0.847, "prediction_diversity": 0.73},
rollback_fn=lambda: execute_full_rollback(model_name, rollback_target_version),
)
controller.start() # Runs in background thread
"""
def __init__(
self,
config: ControllerConfig,
metrics_fetcher: Callable[[], dict],
baseline_metrics: dict,
rollback_fn: Callable[[], dict],
alert_fn: Optional[Callable[[str], None]] = None,
):
self.config = config
self.metrics_fetcher = metrics_fetcher
self.baseline_metrics = baseline_metrics
self.rollback_fn = rollback_fn
self.alert_fn = alert_fn or self._default_alert
self._running = False
self._thread = None
self._consecutive_failures = 0
self._rollback_triggered = False
def start(self):
"""Start the controller in a background thread."""
self._running = True
self._thread = threading.Thread(target=self._run_loop, daemon=True)
self._thread.start()
logger.info(
f"Rollback controller started for {self.config.model_name} v{self.config.current_version}. "
f"Rollback target: v{self.config.rollback_target_version}. "
f"Check interval: {self.config.check_interval_seconds}s."
)
def stop(self):
"""Stop the controller."""
self._running = False
if self._thread:
self._thread.join(timeout=10)
def _run_loop(self):
while self._running and not self._rollback_triggered:
try:
current_metrics = self.metrics_fetcher()
decision, reason = make_rollback_decision(
current_metrics, self.baseline_metrics, self.config.thresholds
)
if decision == RollbackDecision.ROLLBACK_IMMEDIATELY:
self._consecutive_failures += 1
logger.warning(
f"Rollback trigger: {reason} "
f"(failure {self._consecutive_failures}/{self.config.consecutive_failures_required})"
)
if self._consecutive_failures >= self.config.consecutive_failures_required:
self._execute_rollback(reason, current_metrics)
return
elif decision == RollbackDecision.INVESTIGATE_FIRST:
logger.warning(f"Soft threshold breached - alerting: {reason}")
self.alert_fn(f"SOFT ALERT: {self.config.model_name} - {reason}")
self._consecutive_failures = 0 # Reset hard failure counter
else:
# All good - reset failure counter
self._consecutive_failures = 0
except Exception as e:
logger.error(f"Controller error (will retry): {e}")
time.sleep(self.config.check_interval_seconds)
def _execute_rollback(self, reason: str, triggering_metrics: dict):
"""Execute the rollback and send notifications."""
self._rollback_triggered = True
logger.critical(
f"AUTOMATED ROLLBACK TRIGGERED for {self.config.model_name}. "
f"Rolling back from v{self.config.current_version} to v{self.config.rollback_target_version}. "
f"Reason: {reason}"
)
self.alert_fn(
f"AUTOMATED ROLLBACK IN PROGRESS\n"
f"Model: {self.config.model_name}\n"
f"Rolling back: v{self.config.current_version} -> v{self.config.rollback_target_version}\n"
f"Trigger: {reason}\n"
f"Metrics at trigger: {triggering_metrics}"
)
try:
rollback_result = self.rollback_fn()
self.alert_fn(
f"ROLLBACK COMPLETE\n"
f"Model: {self.config.model_name} now serving v{self.config.rollback_target_version}\n"
f"Result: {rollback_result}"
)
logger.info(f"Rollback complete: {rollback_result}")
except Exception as e:
self.alert_fn(
f"ROLLBACK FAILED - MANUAL INTERVENTION REQUIRED\n"
f"Model: {self.config.model_name}\n"
f"Error: {e}"
)
logger.critical(f"Rollback FAILED: {e}")
raise
def _default_alert(self, message: str):
logger.info(f"ALERT: {message}")
# --- Example usage ---
def setup_rollback_controller_for_deployment(
model_name: str,
current_version: str,
rollback_target_version: str,
prometheus_url: str,
mlflow_tracking_uri: str,
slack_webhook_url: str,
):
"""
Set up and start an automated rollback controller after a deployment.
Call this immediately after a new model version goes live.
"""
def fetch_prometheus_metrics() -> dict:
"""Fetch current metrics from Prometheus."""
import requests
queries = {
"error_rate": f'rate(model_errors_total{{model="{model_name}"}}[2m])',
"latency_p99_ms": f'histogram_quantile(0.99, rate(model_latency_bucket{{model="{model_name}"}}[2m])) * 1000',
"prediction_diversity": f'model_prediction_diversity{{model="{model_name}"}}',
"primary_metric": f'model_primary_metric{{model="{model_name}"}}',
}
metrics = {}
for metric_name, query in queries.items():
resp = requests.get(f"{prometheus_url}/api/v1/query", params={"query": query})
result = resp.json()["data"]["result"]
if result:
metrics[metric_name] = float(result[0]["value"][1])
return metrics
# Fetch baseline metrics from the rollback target version's training run
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_uri)
client = MlflowClient()
target_version_info = client.get_model_version(model_name, rollback_target_version)
target_run = client.get_run(target_version_info.run_id)
baseline_metrics = target_run.data.metrics
config = ControllerConfig(
model_name=model_name,
current_version=current_version,
rollback_target_version=rollback_target_version,
thresholds=RollbackThresholds(),
check_interval_seconds=30,
consecutive_failures_required=3,
mlflow_tracking_uri=mlflow_tracking_uri,
slack_webhook_url=slack_webhook_url,
)
def execute_rollback():
return rollback_model_artifact(
model_name=model_name,
target_version=rollback_target_version,
current_version=current_version,
mlflow_tracking_uri=mlflow_tracking_uri,
triggered_by="automated_rollback_controller",
reason="Automated threshold breach",
)
controller = AutomatedRollbackController(
config=config,
metrics_fetcher=fetch_prometheus_metrics,
baseline_metrics=baseline_metrics,
rollback_fn=execute_rollback,
)
controller.start()
return controller
Canary Analysis with Argo Rollouts
Argo Rollouts provides a Kubernetes-native progressive delivery controller with built-in canary analysis. It can automatically roll back based on Prometheus metric evaluations during a canary release.
# argo-rollout-with-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-server
namespace: ml-serving
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5 # 5% to canary
- pause: {duration: 10m}
- analysis:
templates:
- templateName: model-quality-analysis
- setWeight: 25 # 25% to canary
- pause: {duration: 10m}
- analysis:
templates:
- templateName: model-quality-analysis
- setWeight: 100 # Full promotion
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: model-quality-analysis
namespace: ml-serving
spec:
metrics:
- name: primary-metric
successCondition: result[0] >= 0.82
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
model_primary_metric{model="recommendation-model", version="{{args.canary-version}}"}
- name: error-rate
successCondition: result[0] <= 0.01
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
rate(model_errors_total{model="recommendation-model", version="{{args.canary-version}}"}[5m])
- name: prediction-diversity
successCondition: result[0] >= 0.60
failureLimit: 1 # One failure triggers immediate rollback
provider:
prometheus:
address: http://prometheus:9090
query: |
model_prediction_diversity{model="recommendation-model", version="{{args.canary-version}}"}
If any analysis step fails, Argo Rollouts automatically rolls back to the stable version. This is the most operationally reliable approach for teams with Kubernetes infrastructure.
MTTR Optimization: Pre-Warming the Rollback Target
Mean Time to Recovery (MTTR) is the metric that matters during a rollback incident. Every second the bad model is serving live traffic costs real money and real user trust. Pre-warming the rollback target - keeping it loaded in memory on at least one replica - can dramatically reduce MTTR.
class PrewarmedRollbackTarget:
"""
Keeps the previous model version loaded in memory as a hot standby.
When rollback is needed, the switch is instant - no model loading latency.
"""
def __init__(self, model_name: str, version: str, mlflow_tracking_uri: str):
self.model_name = model_name
self.version = version
self._model = None
self._load_time = None
def prewarm(self):
"""Load the rollback model into memory."""
import mlflow
import datetime
start = datetime.datetime.utcnow()
self._model = mlflow.pyfunc.load_model(f"models:/{self.model_name}/{self.version}")
self._load_time = (datetime.datetime.utcnow() - start).total_seconds()
# Run a single prediction to warm the JIT / compilation caches
self._model.predict(get_warmup_input())
logger.info(f"Rollback target v{self.version} loaded in {self._load_time:.1f}s - ready for instant rollback")
def predict(self, features):
"""Serve a prediction from the prewarmed rollback model."""
if self._model is None:
raise RuntimeError("Rollback target not prewarmed. Call prewarm() first.")
return self._model.predict(features)
@property
def is_ready(self) -> bool:
return self._model is not None
:::tip Pre-Warm on Every Deployment Every time you deploy a new model version to production, pre-warm the previous version in the rollback target. The cost is the memory footprint of one additional model. The benefit is that the first step of your rollback procedure - "load the rollback model" - takes zero seconds instead of 30–120 seconds. :::
Post-Rollback: The Incident Report
The rollback is not the end of the incident. The rollback stops the bleeding. The incident report is what prevents recurrence.
def generate_incident_report_template(
rollback_event: dict,
root_cause: str,
contributing_factors: list,
timeline: list,
corrective_actions: list,
) -> str:
"""
Generate a structured incident report from a model rollback event.
"""
return f"""
# Incident Report: Model Rollback - {rollback_event['model_name']}
**Incident ID:** {rollback_event['rollback_id']}
**Date:** {rollback_event['timestamp']}
**Severity:** P{determine_severity(rollback_event)}
**Status:** Resolved
## Summary
Model `{rollback_event['model_name']}` was rolled back from v{rollback_event['rolled_back_from']} to v{rollback_event['rolled_back_to']} due to:
{rollback_event['reason']}
## Timeline
{chr(10).join(f"- **{item['time']}** - {item['event']}" for item in timeline)}
## Impact
- **Duration of degradation:** [fill in]
- **Users affected:** [fill in]
- **Revenue impact (estimated):** [fill in]
## Root Cause
{root_cause}
## Contributing Factors
{chr(10).join(f"- {f}" for f in contributing_factors)}
## What Went Well
- [fill in - what detection/response mechanisms worked]
## What Did Not Go Well
- [fill in - what gaps allowed the incident to occur or prolonged it]
## Corrective Actions
{chr(10).join(f"- [ ] {action['description']} - Owner: {action['owner']}, Due: {action['due_date']}" for action in corrective_actions)}
## Prevention
[What changes to the deployment process, monitoring, or automated gates would prevent this class of incident?]
---
*Generated by automated incident report system. Review and complete all [fill in] sections before sharing.*
"""
def determine_severity(rollback_event: dict) -> int:
"""P1 = highest severity, P4 = lowest."""
reason = rollback_event.get("reason", "").lower()
if "revenue" in reason or "safety" in reason:
return 1
if "error_rate" in reason or "latency" in reason:
return 2
return 3
Production Engineering Notes
The rollback target must be pre-designated: Do not decide which version to roll back to during an incident. Decide before the deployment. Tag the rollback target with rollback.healthy_baseline=true and the rollback target version as a model version tag before promoting the new model. This eliminates a decision from the incident response path.
Test your rollback procedure quarterly: A rollback procedure that has never been executed in a non-incident context will fail during an incident. Run a quarterly rollback drill: deploy a known-good model, simulate a threshold breach, execute the rollback, measure MTTR. The goal is to make rollback boring - a procedure so practiced it requires no thought.
Keep rollback artifacts co-located with serving infrastructure: The rollback procedure should not require the on-call engineer to access the model registry, the CI/CD system, and the Kubernetes cluster in sequence. Ideally, a single command or button click triggers the complete rollback sequence. Pre-stage the rollback artifact (the Kubernetes manifest, the Helm values, the model server config) in the serving environment so that activating it is atomic.
Monitor for two hours after rollback: A rollback that successfully restores the previous model version may not immediately restore all metrics to baseline. Feature pipeline caches, CDN caches, and user session state may take time to normalize. Continue monitoring at high frequency for at least two hours after a rollback is declared complete.
Document rollback latency: Track how long each phase of the rollback takes: detection time (alert fires), triage time (on-call reviews and decides), execution time (rollback completes), verification time (metrics confirmed recovered). The sum of these is your MTTR. Each phase has different optimization levers - detection time improved by better monitoring, triage time improved by better runbooks and decision automation, execution time improved by pre-warming, verification time improved by clear success criteria.
Common Mistakes
:::danger Never Overwrite Model Artifacts in Place
Overwriting the model artifact at the serving path (e.g., s3://models/production/model.pkl) instead of using versioned paths eliminates the ability to roll back. There is no previous version to restore. Always use immutable, versioned artifact paths. The serving config points to a specific path; changing which version is active means changing the config, not overwriting the artifact.
:::
:::danger Do Not Delete Model Registry Versions After Production Deleting the previous model version from the registry to "clean up" removes your rollback target. Registry versions should be archived, not deleted. The storage cost of keeping old model artifacts is trivial compared to the operational cost of not being able to roll back. Archive old versions; never delete them unless specifically required by data retention policy. :::
:::warning A Single Consecutive-Check Rule Prevents False Positives A rollback controller that triggers on a single threshold breach will trigger on transient metric spikes - a 10-second burst of high latency from a downstream service, a brief traffic surge, a momentary metric collection gap. Require a configurable number of consecutive threshold violations (typically 3) before triggering automated rollback. This eliminates most false positives while adding at most one check interval (30–60 seconds) to the rollback latency. :::
:::warning Do Not Mix Model Rollback with Infrastructure Rollback Without Analysis Rolling back both the model version and the infrastructure simultaneously makes it impossible to determine which one fixed the problem. If time permits (it usually permits 5 minutes), try the model rollback first, assess whether metrics recover, and only proceed to infrastructure rollback if the model rollback did not help. If the incident is severe enough that every second counts, do both simultaneously - but document that you did so and plan a careful post-incident analysis. :::
:::danger The Runbook Must Be Tested, Not Just Written A rollback runbook that has never been executed is a hypothesis, not a procedure. Links go stale. Credentials expire. APIs change. Commands that worked six months ago fail because a configuration changed. Runbooks must be executed in a test environment on a regular schedule. If a runbook step fails during a test, fix it immediately - that failure during a real incident would cost real money. :::
Interview Q&A
Q: Walk me through a model rollback. What are the steps?
A: A model rollback has five phases. First, detection: a monitoring alert fires (or an on-call engineer identifies the problem). Second, triage: determine whether to roll back immediately or investigate first. The decision framework is: if there is confirmed revenue impact, user-safety impact, or hard metric threshold breaches (error rate, latency), roll back immediately and investigate after. If it is a soft quality degradation, investigate briefly before deciding. Third, execution: roll back the model artifact in the registry - transition the previous healthy version back to Production, archive the problematic version. Simultaneously, update the serving infrastructure (Kubernetes deployment, Helm release) to point to the previous model artifact and restart the model server. Fourth, verification: confirm metrics recover to baseline within a reasonable window - typically within 5–15 minutes of rollback completion. Fifth, post-mortem: document the incident timeline, root cause, and corrective actions within 48 hours.
Q: How do you automate rollback decisions?
A: An automated rollback controller is a continuously running process - either a daemon in your serving infrastructure or a sidecar container - that fetches model serving metrics on a configurable interval (typically 30–60 seconds) and compares them against pre-defined thresholds. To avoid false positives from transient metric spikes, require a configurable number of consecutive threshold violations (typically 3) before triggering rollback. When the trigger condition is met, the controller executes the rollback: transitions the registry to the previous version, updates the serving config, restarts the model server, and sends notifications. The rollback target version must be pre-designated before the new model goes live, tagged in the registry so the controller knows exactly where to roll back to without making a decision under pressure.
Q: Why is ML rollback harder than software rollback?
A: Software rollback is conceptually simple: revert the code, redeploy. The state of a software service lives mostly in the running code. ML model state is distributed across four components: the model artifact (weights and parameters), the serving infrastructure (container, config), the feature pipeline (transformation logic, encoders, scalers), and the training data distribution. A failure can originate in any of these. A model artifact rollback fixes nothing if the root cause was a feature pipeline change that altered the model's inputs. A serving infrastructure rollback fixes nothing if the problem was that the new model weights learned a bad pattern in the training data. Diagnosing which component is the root cause - quickly, under pressure - is the hard part. This is why you need both a fast default rollback path (always start with model artifact rollback - it is the fastest) and a systematic post-rollback investigation that checks all four components.
Q: What is the difference between blue-green rollback and canary reversal?
A: In a blue-green deployment, two complete environments run simultaneously. The load balancer routes all traffic to one (the active environment). Rollback means switching the load balancer to the standby environment - sub-second switch, zero traffic disruption. The cost is running two full environments continuously (roughly double the infrastructure cost). In a canary deployment, the new model version serves a small percentage of traffic (say, 5%) alongside the stable version. Canary reversal means reducing the canary's traffic share to zero - the new model stops receiving traffic, and 100% flows to the stable version. This is slightly slower than a blue-green switch but much cheaper operationally. The practical choice depends on your traffic volume and cost constraints. For high-traffic systems where every second of bad model exposure costs money, blue-green is worth the cost. For typical ML serving environments, canary deployment with automatic reversal is the right balance.
Q: How do you reduce Mean Time to Recovery (MTTR) for model incidents?
A: MTTR has four components: detection time, triage time, execution time, and verification time. Detection time is reduced by tightening alert thresholds, adding composite metrics (revenue anomaly + model quality metric simultaneously), and reducing the metric scrape interval in Prometheus. Triage time is reduced by pre-making the rollback decision: designate the rollback target before deploying, write a decision framework (if X then rollback, else investigate), and practice the runbook until it requires no deliberation. Execution time is the most impactful lever: pre-warm the rollback model in memory so model loading takes zero seconds, pre-stage all rollback artifacts (Kubernetes manifests, serving configs) so the rollback is a single command, and use blue-green deployment so the switch is a load balancer configuration change with no restart latency. Verification time is reduced by defining explicit success criteria - "metrics recover to within 10% of pre-incident baseline" - so the on-call engineer knows when the incident is over.
Q: How do you handle the case where there is no safe rollback target?
A: This is the worst case: every recent model version in the registry has a known problem - the data drift that caused the incident was present in all recent training runs, or the problematic feature pipeline change predates multiple model versions. In this case, the rollback options are: serve a fallback (a simple heuristic model, a rule-based system, or a static response) while an emergency retrain is conducted; temporarily disable the ML-powered feature and fall back to non-ML behavior (e.g., showing most-popular items instead of personalized recommendations); or serve from a much older model version even if its performance is significantly below current standards. The key is that the fallback must be pre-built and pre-tested before any incident occurs. If you reach the incident and discover you have no safe rollback target and no fallback, you are in a position where the only option is to take the feature offline - which is almost always worse than serving a degraded ML response. The preparation investment is a tested fallback that can be activated in under five minutes.
Summary
Model rollback strategy is the last line of defense in your ML deployment pipeline. Everything else - model cards, promotion gates, shadow mode, canary deployments - reduces the probability of a bad model reaching production. Rollback strategy handles the cases where those defenses fail.
The three operational principles: pre-designate the rollback target before each deployment and tag it explicitly in the registry; pre-warm the rollback artifact so execution is fast; and automate the rollback trigger so the decision and execution happen in seconds, not minutes.
Rollback is not a failure mode. It is an expected part of operating any ML system at scale. The teams that do it well treat it as a first-class operational capability - practiced, automated, and boring. The teams that do it poorly discover at 2 AM that their runbook is out of date and their rollback target was deleted three weeks ago.
