:::tip 🎮 Interactive Playground Visualize this concept: Try the A/B Testing demo on the EngineersOfAI Playground - no code required. :::
Canary and Blue-Green Deployments for ML Models
The Production Scenario
Your recommendation model just passed shadow evaluation with flying colors - 97% label agreement, no subgroup regressions, latency within budget. Now comes the step that actually matters: letting real users experience the new model's outputs and measuring whether CTR, session duration, and conversion rate improve as expected.
The naive approach is to flip the switch - redirect 100% of traffic to the new model immediately. This is what was done in 2010 before the industry learned better. At 100% traffic, a subtle model regression that produces worse recommendations for 5% of users immediately affects the entire user base. By the time you notice the CTR drop (which happens gradually because it requires statistical confidence), you have already shown millions of users degraded recommendations. Rolling back is painful because you have accumulated state (what users saw, what A/B experiment they were in) in the new model's logs.
Canary deployment solves this by routing only a small fraction of traffic - 1% to start - to the new model. You monitor business metrics on that 1% cohort versus the 99% control cohort. If metrics are good, you incrementally increase traffic: 1% → 5% → 20% → 50% → 100%. If metrics regress at any stage, you route the canary's traffic back to the control model before more than a small fraction of users are affected.
Blue-green deployment solves a different problem: you have a fully validated model and you want to switch to it instantly, with the ability to instantly switch back if something is catastrophically wrong. Blue-green keeps two complete environments running simultaneously and changes which one receives traffic via a DNS or load balancer change.
Both patterns are essential. Blue-green for clean transitions. Canary for gradual validation of business impact.
Why Canary for ML Instead of Classic Software Deploys
For software services, canary deployment catches crashes, errors, and performance regressions - detectable within minutes. For ML models, the regressions you care about are often statistical: a model that returns HTTP 200 for every request but quietly produces worse recommendations for users in a specific demographic cohort. These regressions are invisible to infrastructure monitoring. They only show up when you measure business metrics (CTR, revenue, engagement) and have enough samples for statistical significance - which takes hours or days.
This changes the cadence of canary deployment for ML:
- Software canary: 30 minutes at 1%, then promote if error rate is stable
- ML canary: 24-48 hours at each traffic level, waiting for statistical significance on business metrics
The longer exposure at each stage is deliberate. ML model regressions are often subtle and require statistical power to detect with confidence. Rushing the canary stages is how you end up with a "successful" deployment that only shows degradation after full rollout when it is expensive to roll back.
Historical Context
Blue-green deployment was described by Martin Fowler and Jez Humble in the book Continuous Delivery (2010). The concept is simple: maintain two identical production environments ("blue" and "green"), deploy to the inactive environment, test it, then route traffic to it. The inactive environment becomes the new standby.
Canary deployment gets its name from "canary in a coal mine" - miners would bring canaries into mines because the birds were more sensitive to carbon monoxide. If the canary died, the miners knew to evacuate. In software deployment, a small fraction of users are the "canary" - if the deployment harms them, you detect it before it affects everyone.
Both patterns were popularized at scale by Google, which described their canary deployment practices publicly around 2011-2014. For ML specifically, the patterns became standard after high-profile incidents in which model updates (particularly at social networks and e-commerce platforms) produced large-scale business metric regressions that took days to detect and roll back.
Core Concepts
Blue-Green Deployment
In blue-green, you maintain two complete deployment environments. At any given time, one is "live" (receiving production traffic) and one is "idle" (on standby):
The switch is instant - a load balancer configuration change takes milliseconds. Rollback is equally instant - switch back. The idle environment is not wasted: it runs smoke tests and synthetic traffic to validate the new model before it receives production traffic.
Blue-green works best when: you have already validated the new model thoroughly (shadow mode + offline evaluation), you want a clean instant switch rather than gradual traffic migration, and the operational simplicity of "flip a switch" outweighs the cost of running two full environments.
Canary Deployment
In canary, traffic is split between the production (control) and new (canary) model. The split starts small and grows only when metrics confirm the canary is at least as good:
Implementation: Kubernetes Traffic Splitting with Istio
Kubernetes with Istio is the standard production tooling for traffic splitting:
# kubernetes/model-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-service
namespace: ml-serving
spec:
hosts:
- model-service
http:
- route:
- destination:
host: model-service
subset: v6 # Production (control)
weight: 99
- destination:
host: model-service
subset: v7 # Canary
weight: 1
---
# kubernetes/model-destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-service
namespace: ml-serving
spec:
host: model-service
subsets:
- name: v6
labels:
model-version: v6
- name: v7
labels:
model-version: v7
# kubernetes/deployment-v7.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-v7
namespace: ml-serving
spec:
replicas: 2 # Start with 2 replicas for 1% traffic
selector:
matchLabels:
app: model-service
model-version: v7
template:
metadata:
labels:
app: model-service
model-version: v7
spec:
containers:
- name: model-server
image: registry.company.com/model-service:v7
resources:
requests:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
env:
- name: MODEL_VERSION
value: "v7"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
Update traffic split via Kubernetes API or kubectl:
# canary_controller.py
from kubernetes import client, config
import yaml
import json
from typing import Tuple
class CanaryController:
"""Controls canary traffic percentage via Kubernetes + Istio."""
def __init__(self, namespace: str = "ml-serving"):
config.load_incluster_config() # Use in-cluster credentials
self.custom_api = client.CustomObjectsApi()
self.namespace = namespace
self.virtual_service_name = "model-service"
def get_current_split(self) -> Tuple[int, int]:
"""Returns (control_weight, canary_weight)."""
vs = self.custom_api.get_namespaced_custom_object(
group="networking.istio.io",
version="v1beta1",
namespace=self.namespace,
plural="virtualservices",
name=self.virtual_service_name,
)
routes = vs["spec"]["http"][0]["route"]
control = next(r for r in routes if r["destination"]["subset"] == "v6")
canary = next(r for r in routes if r["destination"]["subset"] == "v7")
return control["weight"], canary["weight"]
def set_canary_percentage(self, canary_pct: int):
"""Update Istio VirtualService to set canary traffic percentage."""
assert 0 <= canary_pct <= 100
control_pct = 100 - canary_pct
patch = {
"spec": {
"http": [{
"route": [
{
"destination": {
"host": "model-service",
"subset": "v6",
},
"weight": control_pct,
},
{
"destination": {
"host": "model-service",
"subset": "v7",
},
"weight": canary_pct,
},
]
}]
}
}
self.custom_api.patch_namespaced_custom_object(
group="networking.istio.io",
version="v1beta1",
namespace=self.namespace,
plural="virtualservices",
name=self.virtual_service_name,
body=patch,
)
print(f"Traffic updated: {control_pct}% v6, {canary_pct}% v7")
def rollback(self):
"""Immediately route all traffic back to production model."""
self.set_canary_percentage(0)
print("ROLLBACK: All traffic routed to v6")
Automated Rollback: Monitoring and Triggers
The most important part of canary deployment is the automated rollback system. You should not rely on a human to detect and manually rollback a regression at 3:00 AM:
# canary_monitor.py
import time
import logging
from dataclasses import dataclass
from typing import Optional
import prometheus_api_client
from prometheus_api_client import PrometheusConnect
logger = logging.getLogger(__name__)
@dataclass
class CanaryMetrics:
canary_error_rate: float
control_error_rate: float
canary_p99_latency: float
control_p99_latency: float
canary_ctr: float
control_ctr: float
@property
def error_rate_regression(self) -> bool:
# Canary error rate > 2x control AND > 1% absolute
return (
self.canary_error_rate > self.control_error_rate * 2
and self.canary_error_rate > 0.01
)
@property
def latency_regression(self) -> bool:
# Canary p99 > 20% worse than control
return self.canary_p99_latency > self.control_p99_latency * 1.2
@property
def business_regression(self) -> bool:
# CTR dropped more than 2% relative
if self.control_ctr == 0:
return False
relative_drop = (self.control_ctr - self.canary_ctr) / self.control_ctr
return relative_drop > 0.02
@property
def should_rollback(self) -> bool:
return (
self.error_rate_regression
or self.latency_regression
or self.business_regression
)
def rollback_reason(self) -> Optional[str]:
reasons = []
if self.error_rate_regression:
reasons.append(
f"Error rate: canary {self.canary_error_rate:.2%} "
f"vs control {self.control_error_rate:.2%}"
)
if self.latency_regression:
reasons.append(
f"p99 latency: canary {self.canary_p99_latency:.0f}ms "
f"vs control {self.control_p99_latency:.0f}ms"
)
if self.business_regression:
reasons.append(
f"CTR: canary {self.canary_ctr:.4f} "
f"vs control {self.control_ctr:.4f}"
)
return "; ".join(reasons) if reasons else None
class CanaryMonitor:
"""Monitors canary metrics and triggers rollback if needed."""
def __init__(
self,
prometheus_url: str,
canary_controller: "CanaryController",
check_interval_seconds: int = 60,
):
self.prometheus = PrometheusConnect(url=prometheus_url)
self.canary_controller = canary_controller
self.check_interval = check_interval_seconds
self.rollback_triggered = False
def query_metric(self, promql: str) -> float:
"""Execute a Prometheus query and return the scalar result."""
result = self.prometheus.custom_query(promql)
if not result:
return 0.0
return float(result[0]["value"][1])
def fetch_metrics(self) -> CanaryMetrics:
"""Fetch all canary and control metrics from Prometheus."""
# Error rates - from HTTP 5xx response codes
canary_errors = self.query_metric(
'rate(http_requests_total{model_version="v7",status=~"5.."}[5m])'
' / rate(http_requests_total{model_version="v7"}[5m])'
)
control_errors = self.query_metric(
'rate(http_requests_total{model_version="v6",status=~"5.."}[5m])'
' / rate(http_requests_total{model_version="v6"}[5m])'
)
# Latency from histogram quantiles
canary_p99 = self.query_metric(
'histogram_quantile(0.99, rate('
'http_request_duration_ms_bucket{model_version="v7"}[5m]))'
)
control_p99 = self.query_metric(
'histogram_quantile(0.99, rate('
'http_request_duration_ms_bucket{model_version="v6"}[5m]))'
)
# Business metric - click-through rate
canary_ctr = self.query_metric(
'rate(recommendation_clicks_total{model_version="v7"}[1h])'
' / rate(recommendation_impressions_total{model_version="v7"}[1h])'
)
control_ctr = self.query_metric(
'rate(recommendation_clicks_total{model_version="v6"}[1h])'
' / rate(recommendation_impressions_total{model_version="v6"}[1h])'
)
return CanaryMetrics(
canary_error_rate=canary_errors,
control_error_rate=control_errors,
canary_p99_latency=canary_p99,
control_p99_latency=control_p99,
canary_ctr=canary_ctr,
control_ctr=control_ctr,
)
def run(self):
"""Continuous monitoring loop - runs until rollback or external stop."""
logger.info("Canary monitor started")
while not self.rollback_triggered:
metrics = self.fetch_metrics()
logger.info(
f"Canary metrics: "
f"error={metrics.canary_error_rate:.2%}, "
f"p99={metrics.canary_p99_latency:.0f}ms, "
f"ctr={metrics.canary_ctr:.4f}"
)
if metrics.should_rollback:
reason = metrics.rollback_reason()
logger.error(f"ROLLBACK TRIGGERED: {reason}")
self.canary_controller.rollback()
self.rollback_triggered = True
# Send alert
self._send_alert(reason)
break
time.sleep(self.check_interval)
def _send_alert(self, reason: str):
"""Send PagerDuty or Slack alert on rollback."""
# Implementation depends on your alerting infrastructure
pass
Traffic Splitting Strategies Beyond Random
Random traffic splitting (X% of requests, randomly chosen) is the simplest approach. But ML deployments sometimes need more nuanced strategies:
User-stable splitting: Route the same user to the same model version every time. This is important for models that affect user experience over multiple sessions - you do not want a user to see canary recommendations on one session and control recommendations on the next.
def route_user_to_model(user_id: str, canary_pct: int) -> str:
"""
Stable per-user routing - same user always hits same model.
Uses consistent hashing so no state is required.
"""
import hashlib
user_hash = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = user_hash % 100 # 0-99
return "v7" if bucket < canary_pct else "v6"
Feature-based splitting: Route specific user segments to the canary. For example, new users first (lower business risk), or power users last (they drive disproportionate revenue and you want them on the stable model until confidence is high).
Geographic splitting: Route canary traffic to a single region first. Useful when models have language or cultural dependencies that differ by geography.
The Canary Schedule for ML Models
A safe canary schedule for a production ML model with meaningful business impact:
| Stage | Traffic % | Duration | What to Monitor |
|---|---|---|---|
| Baseline | 0% | Shadow run | Prediction distribution, latency |
| Initial | 1% | 24 hours | Error rate, p99 latency |
| Small | 5% | 48 hours | CTR, revenue, error rate |
| Medium | 20% | 48 hours | All metrics, statistical significance |
| Large | 50% | 24 hours | Confirm metrics hold at scale |
| Full | 100% | Ongoing | Standard production monitoring |
Total time from shadow to full production: approximately 8 days. For high-impact models (fraud, payments), extend each stage. For low-risk models (email subject lines), compress the schedule.
Production Engineering Notes
Version all model artifacts with the same identifier used in traffic routing: If your load balancer routes to model-version: v7, then your model's prediction logs, metrics, and traces must all contain model_version=v7. Mismatched identifiers make post-incident analysis much harder.
Maintain N+1 capacity during canary: During a 50-50 split, if you need 10 replicas at full traffic, you need at least 10 replicas of v6 and 10 replicas of v7 - 20 total. Canary deployment temporarily doubles your compute footprint. Plan for this in your scaling policy.
Smoke tests before traffic: Before routing any user traffic to the canary, run automated smoke tests against it: send 100 synthetic requests and verify all return 200 and produce output distributions within expected range. This catches the most obvious failures (model fails to load, crashes on specific input shapes) before users are involved.
:::danger The False Safety of Low Traffic Percentages 1% of traffic is not "safe" - at 100K requests per second, 1% is 1,000 requests per second. A regression that degrades 1% of users at 100K QPS still affects 86,400 users per day. The canary percentage determines how fast you detect a regression, not how many users it affects per unit time. At high traffic volumes, even 1% exposure is significant. :::
:::warning Statistical Significance Takes Time - Do Not Rush The most common canary mistake: declaring "no regression" after 4 hours at 1% traffic and immediately jumping to 100%. A 2% CTR regression on 1% of traffic for 4 hours does not have enough samples for statistical significance - you might have 500 samples when you need 10,000. Use proper power analysis to compute the minimum number of samples required before each promotion decision, and wait until that sample size is reached. :::
:::warning Rollback is Not Free for ML Models When you roll back a model, any A/B experiment results logged during the canary period are confounded - some users saw v6, some saw v7, and the boundary is not clean. If your canary ran for 48 hours, you may need to exclude those 48 hours from any experiment analysis. Log model version on every request and expose it to your experiment analysis framework so confounded periods can be excluded. :::
Interview Q&A
Q: What is the difference between blue-green and canary deployment for ML models?
Blue-green keeps two complete environments (blue = production, green = new) and switches traffic instantly between them. It provides immediate rollback by flipping a switch, but the switch is all-or-nothing. It is best when you have already validated the new model thoroughly and want a clean instantaneous transition. Canary deployment gradually shifts traffic from old to new (1% → 5% → 20% → 100%), allowing you to measure business metrics at each stage and rollback before the full user base is exposed to a regression. For ML models, canary is generally preferred because model regressions are often statistical and require exposure across a large sample to detect.
Q: What metrics would you monitor during an ML model canary deployment?
Three categories: infrastructure metrics (error rate, p99 latency, GPU utilization - detectable within minutes), ML metrics (prediction score distribution, coverage, model-specific quality signals - detectable within hours), and business metrics (CTR, revenue, conversion rate, engagement - require 24-48 hours for statistical significance). Monitor all three. Infrastructure regressions warrant immediate rollback. Business metric regressions require statistical testing before rollback. The most expensive mistakes in ML deployments involve teams that only monitored infrastructure metrics and missed business regressions.
Q: How do you implement automated rollback in a canary deployment?
Define rollback triggers with thresholds before the canary starts: "error rate more than 2x control AND above 1% absolute," "p99 latency more than 20% worse than control," "business metric more than 2% relative decline." Run a monitoring loop that queries Prometheus every 60 seconds and applies these rules. If any trigger fires, call the Kubernetes API (or Istio VirtualService patch) to set canary weight to 0%, and send an alert. Critically, the rollback trigger logic must handle statistical noise - checking error rates over a 5-minute window rather than instantaneous values, to avoid false positives from transient spikes.
Q: Why should user assignment to canary vs control be stable across sessions?
If a user is randomly routed to canary or control on each request, some of their requests hit v7 and some hit v6. For models that affect sequential user experiences (recommendations, personalization), this creates a mixed experience that neither version would produce on its own. More importantly, it poisons your experiment analysis: the user's behavior in a v7 session is influenced by their v6 sessions and vice versa. Stable per-user routing ensures each user has a consistent experience and clean experiment data. Use consistent hashing of the user ID to determine assignment deterministically without storing state.
Q: How long should a canary deployment run at each traffic stage before promoting?
Long enough to have statistically sufficient samples for your most sensitive business metric. For a metric that requires 10,000 samples per variant for 80% power at 1% effect size, and you have 50,000 daily users with 1% canary: 500 canary users per day → 20 days to reach 10,000 samples. In practice, increase the canary percentage to 5-10% quickly (after infrastructure metrics clear) to collect samples faster. A minimum of 24-48 hours at any stage is also required to cover full daily traffic variation patterns - a model that works fine for daytime traffic may fail for the different patterns of nighttime traffic.
