Skip to main content

Model Versioning and Canary Releases

The 3 AM Incident

It is 3:17 AM on a Tuesday. Your team shipped a new version of your customer-facing LLM three hours ago. At the time, all offline evals looked clean - ROUGE scores up, perplexity down, human preference ratings improved by 11 points in internal testing. You went to bed confident.

Your phone rings. Support is getting flooded. Users on the payment flow are reporting the assistant is refusing to acknowledge order confirmations, defaulting to vague responses that trigger confusion and churn. Your revenue dashboard shows a 23% drop in checkout completions in the last two hours. Someone pings you with a Slack screenshot: "Is the bot broken? It keeps saying it cannot access order details even when I paste them directly."

You pull up your deployment dashboard. The new model - call it llama-3-finetuned-v2.1 - went out as a full cutover. One hundred percent of traffic, instantly. There is no way to quickly route back without a full redeploy, which takes 18 minutes with your current setup. The rollback procedure requires re-pulling a 7B weight checkpoint from object storage, re-initializing the serving engine, and draining the load balancer. By the time it completes, you have lost $140,000 in GMV.

Post-mortem: the fine-tune improved general reasoning but degraded on a specific class of prompts involving structured data pasted inline - a pattern common in payment flows but rare in your eval set. Your offline evals never caught it because your eval distribution did not match production traffic. The new model was not strictly better. It was better on average, worse on a critical tail.

This incident, or something structurally identical to it, has happened at nearly every company that has deployed LLMs at scale. The core mistake is not fine-tuning poorly. The core mistake is treating model updates as binary - old model out, new model in - with no safety net in between. What you needed was a canary release: route 5% of traffic to the new model first, monitor real-user quality signals, and only promote when the data says it is safe.

This lesson covers exactly that: how to version models like software, how to deploy them with controlled traffic splits, how to detect regressions before they become 3 AM incidents, and how to roll back in seconds rather than minutes.


Why This Exists - The Problem With Binary Deployments

Before canary releases existed in software engineering, teams deployed the same way: take the old version down, put the new version up. The practice worked poorly even for stateless web services. For ML models, it is actively dangerous.

The fundamental problem is evaluation distribution mismatch. Your offline eval set is a sample. Production traffic is the real distribution, and it is richer, messier, and more adversarial than anything you can curate. A model that scores higher on your benchmark can still regress on important subpopulations of real users. You do not know which subpopulations matter until you see them fail in production.

The second problem is that model quality is not a single number. A higher average quality score can mask severe degradation on a specific intent class. When you serve 100% of traffic to a new model immediately, you have no comparison baseline running simultaneously. You are flying blind.

The third problem is rollback latency. A traditional deployment has a "big red button" rollback, but for LLMs that button takes 10-20 minutes to work because of checkpoint loading time. If you are losing 10,000perminuteinapaymentflow,thatisa10,000 per minute in a payment flow, that is a 100,000-$200,000 mistake per incident.

Canary releases solve all three problems. By sending only a small fraction of traffic to the new model, you get real-production signal with bounded blast radius. You have the old model running simultaneously, giving you a live comparison baseline. And rollback means simply setting the canary weight back to zero - a configuration change that takes seconds.

This pattern was pioneered in software deployments at Google (see the SRE book, 2016) and Netflix, and was later adapted for ML model serving as the field matured. The key insight that transfers from software to ML is the same: never make a change that cannot be quickly reversed, and never make a change without measuring its effect in production.


Historical Context - From Big-Bang Deploys to Progressive Delivery

The term "canary release" comes from coal mining. Miners carried canaries into the mine as an early-warning system for carbon monoxide - if the canary died, miners evacuated before the gas reached lethal concentration for humans. In software, the "canary" is a small percentage of users or servers that receive the new version first. If the canary shows elevated error rates, you pull back before the damage spreads.

The practice entered mainstream software engineering around 2007-2010, pioneered by teams at Google, Amazon, and Facebook who were deploying hundreds of times per day and could not afford to validate every release against the full user base before shipping. Martin Fowler formalized it in his 2010 writing on continuous delivery.

The adaptation to ML models happened later and more slowly. The first serious public writing on canary deploys for ML appeared around 2015-2017 from teams at Uber (Michelangelo), Google (TFX), and Facebook (FBLearner Flow). These teams recognized that model updates carried the same risks as code updates - maybe higher, because models are harder to reason about statically.

The "aha moment" for ML canary releases came when teams realized that a model is not just code - it is a function that maps inputs to outputs, and that function can change in subtle ways that no static analysis can catch. A code change that removes a conditional branch is auditable. A weight update that shifts 7 billion parameters is not. The only reliable oracle is production traffic.

By 2020, canary releases for ML were standard practice at large tech companies. The ecosystem of tooling - Seldon Core, BentoML, Ray Serve, AWS SageMaker - had all baked in traffic splitting primitives. By 2024, even small teams deploying open-source models on single-GPU setups could implement basic canary patterns using nginx weighted upstream or Kubernetes traffic splitting.


Core Concepts

Semantic Versioning for Models

Software uses semantic versioning: MAJOR.MINOR.PATCH. The same logic applies to models, but the meaning of each level shifts:

Major version - architecture change. A new base model (switching from Llama 3 to Mistral), a different context window size (4K to 128K), or a change in tokenizer vocabulary. Major versions are rarely backward-compatible in terms of prompt format or behavior expectations.

Minor version - fine-tune on new data or with a different objective. The architecture is the same, but the weights have shifted significantly. Behavior changes are expected and intentional.

Patch version - post-training modifications that should not change behavior substantially. Quantization (FP16 to INT4), GGUF format conversion, or a small targeted fix to a specific failure mode.

This versioning hierarchy has direct operational implications. A patch deploy (quantization) carries low risk and might not need a canary. A minor deploy (new fine-tune) should always go through canary. A major deploy (new base model) should be treated as a completely new service.

llama-3-8b-instruct-chat:
v1.0.0 - initial Llama 3 8B base fine-tune
v1.1.0 - fine-tune on 50k additional customer support examples
v1.1.1 - INT8 quantization of v1.1.0
v1.2.0 - fine-tune on 20k payment flow examples + DPO alignment
v2.0.0 - switch to Llama 3.1 architecture (128K context)

For weight versioning, Git LFS and DVC are the two main tools. Git LFS stores large binary files in a separate object store while keeping pointers in the git repo. DVC extends this with data pipeline tracking - you can version both the training data and the resulting model weights together, so you can always reconstruct which data produced which model.

The Canary Pattern

The canary pattern routes a configurable percentage of production traffic to a new model version while the remainder continues to hit the stable version. The key parameters are:

  • Canary weight: the fraction of traffic sent to the new version (typically 1-5% initially)
  • Promotion schedule: how you increase the weight over time (manual, automated, or time-gated)
  • Health metrics: what signals you monitor to decide whether to promote or roll back
  • Rollback trigger: the threshold at which automatic rollback fires

The mathematics of canary sizing is straightforward. If your service handles NN requests per hour and your canary weight is ww, you observe NwN \cdot w canary requests per hour. Statistical significance for detecting a quality regression of size δ\delta (measured in, say, user satisfaction rate) requires a sample size of approximately:

n2z2p(1p)δ2n \approx \frac{2 \cdot z^2 \cdot p(1-p)}{\delta^2}

where pp is the baseline quality rate and zz is the z-score for your desired confidence level (1.96 for 95%). For a baseline satisfaction rate of 0.85 and a minimum detectable degradation of 0.05, you need roughly 500 canary requests to detect the regression with 95% confidence. At a 5% canary weight on a 10,000 request/hour service, that is 500 canary requests in 6 minutes.

The important implication: canary weight and canary duration are a tradeoff between blast radius and detection speed. Lower canary weight means fewer users affected by a bad model, but slower detection. Higher canary weight means faster detection, but more users affected. The right answer depends on your service volume and your tolerance for user impact.

Shadow Mode

Shadow mode is a special deployment pattern where the new model receives every request, processes it, and logs the result - but the result is never returned to the user. The user always sees the stable model's response. Shadow mode has zero blast radius because users are never exposed to the new model's outputs.

Shadow mode is particularly valuable when you cannot define quality metrics automatically. If you need human evaluation of model outputs, you can run shadow mode for a week, collect a diverse sample of (stable output, shadow output) pairs, and send them to human raters before making any promotion decision.

The cost of shadow mode is doubled compute: every request is processed twice. For large models, this can be expensive. A common optimization is probabilistic shadow mode - only shadow a random 10% of requests rather than all of them.

Feature Flags for Model Routing

Feature flags decouple deployment from rollout. You deploy the new model weights to all serving infrastructure, but a feature flag controls which users actually receive the new model's outputs. This lets you target specific user segments for the canary - for example, you might want to canary on users who have explicitly opted into beta features, or on internal employee accounts, before touching real customers.

A model router implementing feature flags evaluates a decision function per request:

route(request) -> model_version

The decision function can incorporate:

  • Random sampling (5% of requests)
  • User segment (employee accounts, beta users)
  • Request type (specific intent classes)
  • Geographic region (one datacenter gets canary traffic)
  • Time of day (only during business hours, so human monitoring is available)

Blue-Green Deployments

Blue-green is a different pattern from canary. In blue-green, you maintain two complete, parallel serving stacks - blue (current) and green (new). All traffic routes to blue. When you are ready to promote, you flip the load balancer to send all traffic to green. Rollback means flipping back to blue.

Blue-green is simpler to reason about than canary but more expensive: you need double the infrastructure running simultaneously. For large model deployments, this can mean running two full GPU clusters. Most teams use blue-green for zero-downtime model updates (no request is dropped during the transition) combined with canary for gradual rollout.

The pattern looks like this: deploy the new model to green, run shadow mode against green for an hour, promote green to a 5% canary, increase to 20%, 50%, 100% over the promotion schedule, then decommission blue.

Rollback Procedures

Rollback in a canary setup has three levels:

  1. Immediate canary halt: set canary weight to 0. This takes effect in seconds and stops sending new users to the degraded model. Users already in a session with the canary model may still get one or two more canary responses depending on session affinity, but the damage stops almost immediately.

  2. Full traffic revert: if the canary weight was already at 100% (a full promotion), rollback means deploying the previous version. This is slower (minutes), which is why you should never promote a canary to 100% without first verifying quality at 50%.

  3. Emergency model swap: if neither of the above is available (e.g., you had a bad deploy that overwrote your serving infrastructure), you fall back to your last known-good model artifact from DVC or model registry. This is the slowest path - potentially 10-30 minutes - and should be avoided by maintaining canary discipline.

The golden rule: never fully decommission the previous model version until you have run the new version at 100% canary weight for at least 24 hours and verified quality metrics across a full diurnal traffic cycle.


Architecture Diagrams


Code Examples

Model Version Tracking with DVC

# Install DVC with S3 backend
pip install dvc[s3]

# Initialize DVC in your model repo
cd model-artifacts/
dvc init

# Add remote storage
dvc remote add -d modelstore s3://your-bucket/models

# Track a model checkpoint
dvc add checkpoints/llama-3-8b-v1.2.0/

# Commit the .dvc pointer file
git add checkpoints/llama-3-8b-v1.2.0.dvc .gitignore
git commit -m "feat: add model v1.2.0 fine-tuned on payment flow data"
git tag v1.2.0

# Push weights to remote
dvc push

# Later: reproduce any version
git checkout v1.1.0
dvc pull

Nginx Weighted Canary Routing

# /etc/nginx/conf.d/llm-proxy.conf

upstream llm_stable {
server llm-stable:8000;
# v1.1.0 - gets 95% of traffic
}

upstream llm_canary {
server llm-canary:8001;
# v1.2.0 - gets 5% of traffic
}

# Split map: 95/5 split using $request_id hash
split_clients "${request_id}" $backend {
95% llm_stable;
* llm_canary;
}

server {
listen 80;

location /v1/chat/completions {
proxy_pass http://$backend;

# Pass model version in header for logging
proxy_set_header X-Backend-Pool $backend;

# Preserve original request
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}

Python Canary Router (Feature Flag Approach)

import hashlib
import os
from typing import Optional
from dataclasses import dataclass

@dataclass
class ModelEndpoint:
name: str
version: str
url: str
weight: float # 0.0 to 1.0

class CanaryRouter:
"""
Routes requests to model versions based on configurable weights.
Deterministic per user_id to avoid session-level flapping.
"""

def __init__(self, stable: ModelEndpoint, canary: Optional[ModelEndpoint] = None):
self.stable = stable
self.canary = canary

def route(self, user_id: str, request_id: str) -> ModelEndpoint:
"""
Returns the model endpoint to use for this request.
Uses user_id for deterministic routing (same user always
gets same model version within a canary window).
"""
if self.canary is None or self.canary.weight == 0.0:
return self.stable

# Deterministic hash for consistent user experience
hash_input = f"{user_id}:canary_v2"
hash_val = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = (hash_val % 10000) / 10000.0 # 0.0 to 1.0

if bucket < self.canary.weight:
return self.canary
return self.stable

def set_canary_weight(self, weight: float) -> None:
"""Adjust canary traffic percentage (0.0 to 1.0)."""
if self.canary:
self.canary.weight = max(0.0, min(1.0, weight))


# Usage
stable = ModelEndpoint(
name="llm-stable",
version="v1.1.0",
url="http://llm-stable:8000",
weight=1.0
)

canary = ModelEndpoint(
name="llm-canary",
version="v1.2.0",
url="http://llm-canary:8001",
weight=0.05 # 5% canary
)

router = CanaryRouter(stable=stable, canary=canary)

def handle_request(user_id: str, request_id: str, prompt: str) -> dict:
endpoint = router.route(user_id, request_id)

response = call_model(endpoint.url, prompt)

# Always log which model served this request
log_request(
user_id=user_id,
request_id=request_id,
model_version=endpoint.version,
prompt_hash=hash_prompt(prompt),
response=response,
)

return {
"response": response,
"model_version": endpoint.version, # Return in header
}

Automated Rollback Trigger

import time
from collections import defaultdict, deque
from threading import Thread

class QualityMonitor:
"""
Computes rolling quality metrics per model version.
Triggers automatic rollback if canary degrades below threshold.
"""

def __init__(self, window_seconds: int = 300, rollback_threshold: float = 0.10):
self.window_seconds = window_seconds
self.rollback_threshold = rollback_threshold
self.events = defaultdict(deque) # version -> deque of (timestamp, score)

def record(self, version: str, quality_score: float) -> None:
"""Record a quality event for a model version."""
now = time.time()
self.events[version].append((now, quality_score))
# Evict old events outside the window
cutoff = now - self.window_seconds
while self.events[version] and self.events[version][0][0] < cutoff:
self.events[version].popleft()

def get_quality_rate(self, version: str) -> Optional[float]:
"""Returns mean quality score in the window, or None if insufficient data."""
events = self.events[version]
if len(events) < 50: # Need minimum sample size
return None
scores = [score for _, score in events]
return sum(scores) / len(scores)

def check_canary_health(
self,
stable_version: str,
canary_version: str,
router: CanaryRouter
) -> str:
"""
Compare stable vs canary quality. Rollback if degradation exceeds threshold.
Returns: 'healthy', 'insufficient_data', 'rolled_back'
"""
stable_q = self.get_quality_rate(stable_version)
canary_q = self.get_quality_rate(canary_version)

if stable_q is None or canary_q is None:
return "insufficient_data"

degradation = stable_q - canary_q

if degradation > self.rollback_threshold:
print(
f"ROLLBACK: canary {canary_version} degraded by {degradation:.3f} "
f"(stable={stable_q:.3f}, canary={canary_q:.3f})"
)
router.set_canary_weight(0.0)
alert_oncall(
message=f"Canary rollback triggered: {canary_version}",
degradation=degradation,
)
return "rolled_back"

return "healthy"


def monitoring_loop(monitor: QualityMonitor, router: CanaryRouter) -> None:
"""Background thread: check canary health every 60 seconds."""
while True:
status = monitor.check_canary_health(
stable_version="v1.1.0",
canary_version="v1.2.0",
router=router,
)
print(f"Canary health: {status}")
time.sleep(60)


# Integrate quality scoring into request handling
def score_response(response: str, context: dict) -> float:
"""
Lightweight quality score computed per response.
Options: length-based heuristic, guardrail pass/fail,
thumbs up/down from user, LLM-as-judge (async).
"""
# Simplified: penalize very short or refusal-pattern responses
if len(response) < 50:
return 0.2
refusal_patterns = ["I cannot", "I am unable", "I don't have access"]
if any(p in response for p in refusal_patterns):
return 0.4
return 1.0

Kubernetes Blue-Green Model Deployment

# stable-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-stable
labels:
app: llm
slot: stable
version: v1.1.0
spec:
replicas: 4
selector:
matchLabels:
app: llm
slot: stable
template:
metadata:
labels:
app: llm
slot: stable
version: v1.1.0
spec:
containers:
- name: llm-server
image: your-registry/vllm-server:latest
env:
- name: MODEL_NAME
value: "your-org/llama-3-8b-v1.1.0"
- name: MODEL_VERSION
value: "v1.1.0"
resources:
limits:
nvidia.com/gpu: "1"

---
# canary-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-canary
labels:
app: llm
slot: canary
version: v1.2.0
spec:
replicas: 1 # Only 1 replica for 5% canary (vs 4 stable = ~20% by replica count)
selector:
matchLabels:
app: llm
slot: canary
template:
metadata:
labels:
app: llm
slot: canary
version: v1.2.0
spec:
containers:
- name: llm-server
image: your-registry/vllm-server:latest
env:
- name: MODEL_NAME
value: "your-org/llama-3-8b-v1.2.0"
- name: MODEL_VERSION
value: "v1.2.0"
resources:
limits:
nvidia.com/gpu: "1"

---
# Argo Rollouts for automated canary (preferred over manual replica math)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: llm-rollout
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 5 # 5% for 10 minutes
- pause: {duration: 10m}
- setWeight: 25 # 25% for 30 minutes
- pause: {duration: 30m}
- setWeight: 50 # 50% for 1 hour
- pause: {duration: 1h}
- setWeight: 100 # Full promotion
analysis:
templates:
- templateName: llm-quality-check
startingStep: 1
selector:
matchLabels:
app: llm
template:
metadata:
labels:
app: llm
spec:
containers:
- name: llm-server
image: your-registry/vllm-server:latest

A/B Testing Statistical Significance Check

from scipy import stats
import numpy as np

def check_ab_significance(
stable_scores: list[float],
canary_scores: list[float],
alpha: float = 0.05,
) -> dict:
"""
Two-sample t-test to determine if canary quality difference is
statistically significant.

Returns dict with:
significant: bool
p_value: float
effect_size: float (Cohen's d)
recommendation: str
"""
n_stable = len(stable_scores)
n_canary = len(canary_scores)

if n_stable < 30 or n_canary < 30:
return {
"significant": False,
"p_value": None,
"effect_size": None,
"recommendation": "insufficient_data",
}

t_stat, p_value = stats.ttest_ind(stable_scores, canary_scores)

# Cohen's d for effect size
pooled_std = np.sqrt(
(np.std(stable_scores) ** 2 + np.std(canary_scores) ** 2) / 2
)
effect_size = (np.mean(canary_scores) - np.mean(stable_scores)) / pooled_std

is_significant = p_value < alpha
canary_is_better = np.mean(canary_scores) > np.mean(stable_scores)

if is_significant and canary_is_better:
recommendation = "promote"
elif is_significant and not canary_is_better:
recommendation = "rollback"
else:
recommendation = "continue_monitoring"

return {
"significant": is_significant,
"p_value": p_value,
"effect_size": effect_size,
"stable_mean": np.mean(stable_scores),
"canary_mean": np.mean(canary_scores),
"recommendation": recommendation,
}


# Example usage
result = check_ab_significance(
stable_scores=[0.82, 0.91, 0.78, ...], # quality scores from stable model
canary_scores=[0.85, 0.88, 0.81, ...], # quality scores from canary model
)

print(f"Recommendation: {result['recommendation']}")
print(f"p-value: {result['p_value']:.4f}")
print(f"Effect size (Cohen's d): {result['effect_size']:.3f}")

Production Engineering Notes

Session affinity matters more than you think. If a user is mid-conversation with the stable model and your canary routing is purely random per-request, they may see the stable model for turn 1, the canary for turn 2, and back to stable for turn 3. This creates jarring behavioral inconsistencies that users notice and attribute to bugs. Always route at the session level, not the request level. Hash on user_id or session_id, not request_id.

Log model version on every single request. This sounds obvious, but production systems frequently ship without this, and when something goes wrong you cannot separate stable from canary signals. Add a X-Model-Version response header and log it in your access logs. Store the model version alongside every response in your logging pipeline.

Define quality metrics before you deploy, not after. Your canary rollback trigger needs a metric to watch. If you do not have quality metrics instrumented before the deploy, you are flying blind. Minimum acceptable metrics: response latency (p50, p95, p99), refusal rate, user thumbs-down rate (if you have UI feedback), guardrail failure rate. Ideal: an async LLM-as-judge scoring pipeline running on a 10% sample of traffic.

Canary weight and replica count must be consistent. If you run 1 canary replica and 9 stable replicas and use random routing without proper weight configuration, you will get 10% canary traffic (from replica-based round-robin), not 5%. Use explicit weight configuration in your load balancer or proxy layer rather than relying on replica count math.

Always test rollback before you need it. Before every major canary deploy, run a fire drill: set canary weight to 5%, confirm metrics are flowing, then set canary weight back to 0 and confirm the rollback logs appear. A rollback that you have never exercised will fail at 3 AM.

Shadow mode requires compute budget planning. Running every request through two models doubles your GPU usage. Budget for this in your capacity planning. If GPU capacity is tight, use probabilistic shadow mode (10-20% of requests) rather than full shadow mode. The statistical power is lower but the cost is proportional.

DVC pull times matter during incidents. If your rollback procedure requires pulling model weights from DVC remote (S3/GCS), you need to know how long that takes. A 14B parameter FP16 model is ~28GB. On a 1 Gbps connection that is 224 seconds - nearly 4 minutes. Cache weights on your serving nodes rather than pulling them during rollback. The previous version's weights should always be on disk.


Common Mistakes

:::danger Deploying Canary Without Quality Metrics Instrumented

The single most common failure: teams implement traffic splitting correctly but never define what "quality" means or instrument it. You have 5% of traffic going to the canary model, but you have no metric to tell you whether it is performing well or not. You watch p99 latency and error rate, neither of which shows the behavioral regression. Three days later you promote to 100% and then get user complaints.

Fix: before any canary deploy, you must have at minimum: (1) a per-model-version quality signal logged for every response (even if it is just a simple heuristic like refusal rate), (2) a dashboard showing stable vs canary quality side by side, (3) an alert configured to fire if canary quality drops below stable by more than N%.

:::

:::danger Full Cutover Without Rollback Plan

Deploying a new model directly to 100% of traffic with no way to quickly revert is not a deployment strategy - it is gambling. The mitigation is not optional. Even if you are very confident in the new model, you must have: (a) the previous version's weights cached on disk, (b) a tested procedure that can revert traffic in under 2 minutes, (c) on-call engineer awareness during the deploy window.

:::

:::warning Routing at Request Level Instead of Session Level

If you hash on request_id for canary routing, a user in a multi-turn conversation will switch between model versions mid-session. The stable model set up context in turn 1. The canary model responds in turn 2 without that context-setting being part of its training. The result is inconsistent behavior that users experience as bugs. Always hash on user_id or session_id so a given user always hits the same model version for the duration of their session.

:::

:::warning Promoting Before Statistical Significance

Monitoring canary quality for 10 minutes at 5% weight and then promoting to 100% is not enough data. At 5% canary weight on a 1000 req/hour service, you get 50 canary requests in 10 minutes. That is not enough to detect a 5% quality regression with statistical confidence. Use the sample size formula, define your minimum detectable effect, and do not promote until you have collected sufficient data.

:::

:::warning Ignoring Tail Metrics

A canary model can have a better mean quality score but a much worse p99 quality score on a specific intent class. Mean metrics hide tail failures. Always monitor per-intent-class quality breakdown when possible, and always look at the distribution of quality scores, not just the mean. A model that is great on 95% of requests but catastrophically bad on 5% can have an acceptable-looking mean.

:::


Interview Q&A

Q: What is the difference between a canary release, a blue-green deployment, and A/B testing in the context of LLM serving?

A: These are three distinct patterns that are often confused. A canary release is about risk management: you gradually increase traffic to a new model version while monitoring quality, ready to roll back if things go wrong. The goal is safe promotion, not comparison. Blue-green is about operational continuity: you maintain two complete serving stacks and flip all traffic from one to the other atomically, giving you zero-downtime deployments and instant rollback. The goal is operational availability. A/B testing is about measurement: you intentionally run two model versions simultaneously, collect quality metrics on both, and use statistical tests to determine which is better. The goal is scientific comparison. In practice, you often combine all three: use blue-green for zero-downtime infrastructure, canary for gradual rollout, and A/B testing to analyze the quality data collected during the canary window.

Q: How do you define "quality" for LLM canary monitoring, and what automated signals can you use when you do not have human raters available in real time?

A: This is the hardest part of LLM canary releases, and there is no single right answer. The most reliable signal is direct user feedback - thumbs up/down, explicit ratings - but this requires UI support and has high latency. For automated signals that work without human raters: (1) refusal rate - if the new model is refusing more requests than the old one, that is an immediate red flag; (2) response length distribution - a model that suddenly generates very short responses may be degenerating; (3) guardrail failure rate - if you have a safety classifier running as a filter, a spike in filtered responses signals something changed; (4) downstream task completion rate - if your assistant helps users complete purchases, monitor purchase completion rate segmented by model version; (5) LLM-as-judge scoring - run a lightweight judge model asynchronously on a sample of responses to score relevance and quality. The key is to define these metrics before the deploy, not scramble for them after a regression.

Q: Walk me through how you would implement automatic canary rollback. What are the failure modes?

A: The basic pattern is a monitoring loop that runs on a fixed interval (say, every 60 seconds), computes the rolling quality metric for both stable and canary versions, and fires a rollback if canary quality has degraded beyond a threshold. Implementation: (1) log model version on every request and response, (2) compute quality score per response (automated heuristic or async judge), (3) maintain a sliding window of (timestamp, model_version, quality_score) tuples, (4) when canary window has at least N samples (minimum for statistical reliability), compare canary mean vs stable mean, (5) if degradation exceeds threshold, call the routing layer to set canary weight to 0 and alert on-call. Failure modes: insufficient canary traffic makes the window too sparse to trigger (fix: lower minimum sample threshold or increase canary weight); quality metric has high variance causing false positives (fix: use longer window, require sustained degradation over multiple check intervals); routing layer update fails silently (fix: verify weight change took effect after setting it); the on-call alert fires but no one acts (fix: automatic action first, alert second, not the other way around).

Q: Why is session-level routing important for canary deployments, and how do you implement it?

A: Session-level routing ensures that a user always sees the same model version for the duration of their conversation. If you route per-request randomly, a user in a multi-turn conversation can get stable model for turn 1, canary for turn 2, stable for turn 3. Each model has different behavioral tendencies, tone, and knowledge - jumping between them mid-conversation creates jarring inconsistencies that users experience as bugs rather than model degradation. It also corrupts your quality metrics, because the session-level quality signal (did the user complete their task?) is being influenced by two different models in an uncontrolled way. Implementation: hash the session_id or user_id deterministically into a bucket (0-9999), and assign canary if bucket < canary_weight * 10000. Because the hash is deterministic, the same user always lands in the same bucket, giving consistent routing across all turns of their session. The tradeoff is that your canary cohort is fixed (same users always in canary), which can introduce selection effects. For long-running canary windows, rotate the hash seed periodically.

Q: How do you version model weights alongside training data, and why does this matter for incident response?

A: Model weights without their training data provenance are like a binary without source code - you can run it but you cannot reason about why it behaves the way it does. DVC (Data Version Control) is the standard tool for this: it tracks both the dataset and the resulting model artifact in the same version commit, so you can always answer "what data produced this model?" For incident response, this matters in two ways. First, if you need to roll back, you want to understand why the canary model is worse - is it a data quality issue, a fine-tuning configuration issue, or something more fundamental? The DVC lineage gives you a starting point. Second, if you need to retrain from a known-good state, you need to know exactly which dataset version produced your last stable model. Without this, a re-train might inadvertently use contaminated or updated data and produce a different model than the one you are trying to recover. The practical implementation: every training run should commit a DVC-tracked artifact that includes both the model checkpoint and a pointer to the exact dataset version used, tagged with a semantic version that matches what your serving infrastructure deploys.

Q: What is shadow mode, and when would you use it instead of a live canary?

A: Shadow mode runs the new model on real production requests but never returns its outputs to users - the user always sees the stable model's response. The new model's outputs are logged for offline analysis. Shadow mode is useful in several situations. When you cannot define automated quality metrics - if you need human evaluation of model outputs, shadow mode lets you collect a diverse production sample before any users see the new model. When the stakes of a bad response are very high - for a medical or legal assistant, even 0.1% of users seeing a degraded response might be unacceptable; shadow mode lets you validate extensively with zero user risk. When you are evaluating a significantly different model (major version change) where the behavioral differences are large enough that you want to review outputs manually before any promotion. The cost of shadow mode is doubled compute for every shadowed request, since you are running both models in parallel. To manage cost, use probabilistic shadow mode - shadow only 10-20% of requests rather than all of them. The statistical power is lower but usually sufficient for detecting large behavioral shifts. Shadow mode does not replace canary; it is the step before canary in high-stakes deployments.


Real-World Canary Runbook

A runbook is a step-by-step procedure document that you execute during a canary deployment. This is the template that production teams use. Having it written down before the deploy is not bureaucracy - it is the difference between calm, methodical execution and improvised decision-making at 2 AM.

Pre-Deploy Checklist

CANARY DEPLOY CHECKLIST - Model Version: ___________
Date: ____________ Engineer: ____________

PRE-DEPLOY
[ ] Model artifact tagged and pushed to model registry (DVC)
[ ] Previous model version weights cached on all serving nodes
[ ] Quality metrics dashboard configured with stable vs canary split
[ ] Rollback trigger threshold defined and configured in monitor
[ ] On-call engineer notified of deploy window
[ ] Rollback procedure tested in staging environment this week

DEPLOY
[ ] Canary deployment applied (verify: kubectl get deployment llm-canary)
[ ] Canary weight set to 5% in router config
[ ] Verify canary is receiving traffic (check logs: grep "v1.2.0" access.log | wc -l)
[ ] Verify stable is still receiving majority traffic
[ ] Confirm X-Model-Version header appears in response samples

MONITOR (30 minutes at 5% weight)
[ ] Canary p99 latency within 20% of stable
[ ] Canary refusal rate not elevated vs stable
[ ] Canary error rate not elevated vs stable
[ ] No spike in user thumbs-down rate on canary traffic
[ ] Quality score comparison updated with at least 200 canary samples

PROMOTE OR ROLLBACK
If all checks pass:
[ ] Increase canary weight to 25%
[ ] Monitor for 2 hours
[ ] Increase to 50%, monitor for 6 hours
[ ] Increase to 100%
[ ] Monitor for 24 hours before decommissioning stable

If any check fails:
[ ] Set canary weight to 0%
[ ] Verify traffic returned to stable (check: grep "v1.2.0" access.log - should stop)
[ ] Page oncall if automated rollback did not fire
[ ] Write incident report within 24 hours

POST-DEPLOY (after 24h at 100%)
[ ] Decommission stable deployment (kubectl delete deployment llm-stable)
[ ] Archive stable weights (do NOT delete - keep for 30 days minimum)
[ ] Update model version in service registry
[ ] Close deploy ticket

Monitoring Queries (Prometheus / Grafana)

# Per-version request rate
sum(rate(http_requests_total{model_version=~"v1.*"}[5m])) by (model_version)

# Per-version latency p99
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="llm-proxy"}[5m]))
by (le, model_version)
)

# Refusal rate by model version
sum(rate(llm_refusal_total[5m])) by (model_version)
/
sum(rate(http_requests_total[5m])) by (model_version)

# Quality score comparison (requires custom metric from quality monitor)
avg(llm_quality_score) by (model_version)

# Canary traffic fraction
sum(rate(http_requests_total{model_version="v1.2.0"}[5m]))
/
sum(rate(http_requests_total[5m]))

Alerting Rules

# prometheus/alerts/canary.yaml
groups:
- name: canary_deployment
rules:

- alert: CanaryQualityDegradation
expr: |
(
avg(llm_quality_score{model_version=~".*canary.*"}) -
avg(llm_quality_score{model_version=~".*stable.*"})
) < -0.10
for: 5m
labels:
severity: critical
action: rollback
annotations:
summary: "Canary model quality degraded by >10% vs stable"
description: |
Canary quality: {{ $value | humanize }}
Trigger automatic rollback via rollback webhook.

- alert: CanaryHighRefusalRate
expr: |
(
sum(rate(llm_refusal_total{model_version=~".*canary.*"}[5m]))
/
sum(rate(http_requests_total{model_version=~".*canary.*"}[5m]))
) > 0.15
for: 3m
labels:
severity: warning
annotations:
summary: "Canary refusal rate above 15%"

- alert: CanaryHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{model_version=~".*canary.*"}[5m]))
by (le)
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Canary p99 latency above 10 seconds"

Summary

Model versioning and canary releases are not nice-to-haves for LLM production deployments - they are the safety infrastructure that separates teams that ship with confidence from teams that dread deploy nights. The core ideas are: version your models semantically so changes have clear expectations, deploy incrementally to bound blast radius, monitor real-production quality signals (not just infra metrics), and build rollback into the deploy process so it is automatic and fast.

The 3 AM incident described at the top of this lesson is preventable. A 5% canary with a refusal-rate monitor and a 10-minute detection window would have caught the regression before it touched 95% of users. That is the only thing standing between a confident team and a $140,000 incident - a routing weight and a monitoring loop.

Build the canary infrastructure before you need it. Test the rollback before you need it. Define the quality metrics before you deploy. Then sleep soundly on deploy nights.

© 2026 EngineersOfAI. All rights reserved.