:::tip 🎮 Interactive Playground Visualize this concept: Try the A/B Testing demo on the EngineersOfAI Playground - no code required. :::
Shadow Deployment for Safe Model Releases
The Production Scenario
Your fraud detection team has trained a new model - GBM v7 - that shows 12% better precision at the same recall in offline evaluation. The business case is clear: if it works in production, it prevents roughly $2M in monthly fraud losses. The risk is also clear: fraud models operate at the boundary of blocking legitimate transactions. A model that is too aggressive blocks real customers. A model that is subtly miscalibrated produces confidence scores that look right but shift in unexpected ways for specific customer segments. Offline evaluation on historical data, no matter how careful, cannot catch every failure mode.
You need to see GBM v7 running on real production traffic - real transactions, real velocity patterns, real device fingerprints - before you let it make actual decisions. You need it to run long enough to observe rare transaction patterns that appear only a few times per week. And you need to do all of this without any real transaction being decided by the new model until you are confident.
This is exactly what shadow deployment solves. GBM v7 runs in shadow mode: it receives the same requests as the production model (GBM v6), it runs inference on every request, and it logs its predictions. But its predictions are never served to users. Actual fraud decisions are still made by GBM v6. Meanwhile, you accumulate a rich dataset of side-by-side comparisons: GBM v6's prediction versus GBM v7's prediction, for every transaction that ran through production.
After 72 hours of shadow running, you have predictions for several hundred thousand transactions. You can compute disagreement rates, distribution shifts, and segmented analysis across customer cohorts, device types, and transaction amounts. You find everything you would have found from a canary deployment - at zero risk to users or the business.
Why This Exists - The Failure of Offline Evaluation
Offline evaluation on a held-out test set is necessary but not sufficient for deploying ML models. It fails to catch several important classes of production bugs:
Distribution shift: Production traffic is never exactly like training data. Users change behavior, seasonal patterns emerge, new device types appear. A model that scores 0.94 AUC on a test set from 3 months ago may perform differently on traffic from today.
Data pipeline bugs: The feature computation in the training pipeline may differ subtly from the feature computation in the serving pipeline. Shadow deployment runs the serving pipeline on real traffic, which catches these discrepancies. If GBM v7 makes unexpected predictions, you can inspect the features it actually received.
Rare input distributions: Fraud patterns for large transactions (over $10,000) may occur only 0.01% of the time. A test set of 100,000 examples contains only ~10 such transactions - too few for statistical confidence. 72 hours of production shadow traffic may include thousands.
Latency under production load: Offline evaluation does not measure inference latency under production concurrency. Shadow mode runs the new model under real traffic patterns. Latency regressions are caught before they affect the SLA.
Historical Context
Shadow deployment, also called "dark launch" or "shadow mode," was popularized by teams at Facebook and Twitter around 2010-2012. The original context was not ML but distributed systems testing: you could run a rewrite of a service in parallel with the production service, send the same requests to both, and compare responses. If the responses diverged, you had a bug in the rewrite.
The technique was later adapted for ML model deployment at companies like Stripe (fraud), Airbnb (pricing), and DoorDash (ETA prediction). The specific requirements for ML shadow deployment are more nuanced than for service testing: you are not comparing exact equality of responses (two well-calibrated models will give different probabilities) but rather statistical properties - prediction distributions, disagreement rates, latency distributions, and performance metrics on the ground-truth labels when they arrive days later.
Core Concepts: How Traffic Mirroring Works
Shadow deployment requires that every request sent to the production model is also sent to the shadow model. The challenge is to do this without adding latency to the production path. The standard architecture uses an asynchronous mirroring proxy:
The critical design constraint: the shadow model must not be in the critical path of the production response. The user must never wait for the shadow model to complete. This is achieved by dispatching the shadow request asynchronously (fire and forget) immediately after the production model completes, or in parallel with the production model.
The Three Shadow Data Uses
Shadow mode gives you three categories of data:
-
Prediction comparison (immediate): For each request, you have GBM v6's prediction and GBM v7's prediction. You can compute disagreement rates, prediction distribution overlap, and segmented disagreement by feature buckets.
-
Latency comparison (immediate): For each request, you record the latency of both models. You can detect latency regressions before they hit the SLA.
-
Retrospective performance (delayed): When ground truth labels arrive (the fraud outcome for each transaction, typically 24-72 hours later), you can compute the new model's precision, recall, and AUC on production traffic - not historical test data.
Implementation: Shadow Proxy
# shadow_proxy.py
import asyncio
import time
import json
import uuid
import logging
import httpx
from dataclasses import dataclass, asdict
from typing import Optional
from fastapi import FastAPI, Request, Response
import redis.asyncio as aioredis
app = FastAPI()
logger = logging.getLogger(__name__)
PRODUCTION_MODEL_URL = "http://model-v6:8080"
SHADOW_MODEL_URL = "http://model-v7:8080"
SHADOW_LOG_KEY = "shadow:predictions"
@dataclass
class ShadowRecord:
"""Stored for each request - production vs shadow comparison."""
request_id: str
timestamp: float
# Production model
production_prediction: Optional[float]
production_label: Optional[str]
production_latency_ms: float
# Shadow model
shadow_prediction: Optional[float]
shadow_label: Optional[str]
shadow_latency_ms: float
# Computed
prediction_diff: float
labels_agree: bool
# Request context (for segmented analysis)
feature_snapshot: dict
redis_client: aioredis.Redis = None
http_client: httpx.AsyncClient = None
@app.on_event("startup")
async def startup():
global redis_client, http_client
redis_client = aioredis.from_url("redis://localhost:6379")
http_client = httpx.AsyncClient(timeout=httpx.Timeout(1.0))
@app.on_event("shutdown")
async def shutdown():
await http_client.aclose()
await redis_client.aclose()
async def call_model(
url: str,
payload: dict,
) -> tuple[Optional[dict], float]:
"""Call a model endpoint, return (response, latency_ms)."""
start = time.perf_counter()
try:
response = await http_client.post(
f"{url}/predict",
json=payload,
timeout=0.5,
)
response.raise_for_status()
latency_ms = (time.perf_counter() - start) * 1000
return response.json(), latency_ms
except Exception as e:
latency_ms = (time.perf_counter() - start) * 1000
logger.warning(f"Model call failed: {url}: {e}")
return None, latency_ms
async def log_shadow_record(record: ShadowRecord):
"""Append shadow record to Redis list for async consumption."""
try:
await redis_client.rpush(
SHADOW_LOG_KEY,
json.dumps(asdict(record)),
)
# Keep only last 1M records
await redis_client.ltrim(SHADOW_LOG_KEY, -1_000_000, -1)
except Exception as e:
logger.error(f"Failed to log shadow record: {e}")
@app.post("/predict")
async def predict(request: Request):
"""
Shadow proxy endpoint.
Returns production model result immediately.
Fires shadow model call asynchronously - never in critical path.
"""
payload = await request.json()
request_id = str(uuid.uuid4())
# Call production model - THIS IS IN THE CRITICAL PATH
production_result, production_latency = await call_model(
PRODUCTION_MODEL_URL, payload
)
if production_result is None:
return Response(status_code=502, content="Production model unavailable")
# Fire shadow call asynchronously - NOT in critical path
# The production response returns to the user before shadow completes
asyncio.create_task(
_shadow_and_log(
request_id=request_id,
payload=payload,
production_result=production_result,
production_latency=production_latency,
)
)
# Return production result - user never waits for shadow
return production_result
async def _shadow_and_log(
request_id: str,
payload: dict,
production_result: dict,
production_latency: float,
):
"""
Runs asynchronously after production response is sent.
Calls shadow model and logs comparison.
"""
shadow_result, shadow_latency = await call_model(SHADOW_MODEL_URL, payload)
production_pred = production_result.get("fraud_probability", 0.0)
production_label = production_result.get("label", "unknown")
shadow_pred = shadow_result.get("fraud_probability", 0.0) if shadow_result else None
shadow_label = shadow_result.get("label", "unknown") if shadow_result else None
record = ShadowRecord(
request_id=request_id,
timestamp=time.time(),
production_prediction=production_pred,
production_label=production_label,
production_latency_ms=production_latency,
shadow_prediction=shadow_pred,
shadow_label=shadow_label,
shadow_latency_ms=shadow_latency,
prediction_diff=abs(production_pred - shadow_pred) if shadow_pred else 1.0,
labels_agree=(production_label == shadow_label) if shadow_label else False,
feature_snapshot={
# Capture key features for segmentation analysis
"amount_bucket": _bucket_amount(payload.get("amount", 0)),
"user_age_days": payload.get("user_age_days"),
"device_type": payload.get("device_type"),
},
)
await log_shadow_record(record)
def _bucket_amount(amount: float) -> str:
if amount < 10: return "micro"
if amount < 100: return "small"
if amount < 1000: return "medium"
if amount < 10000: return "large"
return "whale"
Shadow Analysis: Evaluating Comparison Results
# shadow_analysis.py
import json
import numpy as np
import pandas as pd
from scipy import stats
import redis
from typing import Optional
class ShadowAnalyzer:
"""Analyze shadow deployment data to decide whether to graduate the model."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def load_shadow_records(
self,
start_time: Optional[float] = None,
limit: int = 100_000,
) -> pd.DataFrame:
"""Load shadow records from Redis into a DataFrame."""
raw_records = self.redis.lrange("shadow:predictions", -limit, -1)
records = [json.loads(r) for r in raw_records]
df = pd.DataFrame(records)
if start_time and len(df):
df = df[df["timestamp"] >= start_time]
return df
def compute_agreement_metrics(self, df: pd.DataFrame) -> dict:
"""
Core shadow analysis: how much do production and shadow agree?
Metrics:
- Label agreement rate: % of requests where both models predict the same class
- Mean absolute prediction diff: average |prod_score - shadow_score|
- Kolmogorov-Smirnov test: are the prediction distributions the same?
- Agreement by segment: disaggregated by feature buckets
"""
valid = df.dropna(subset=["shadow_prediction"])
label_agreement = valid["labels_agree"].mean()
mean_abs_diff = valid["prediction_diff"].mean()
p95_diff = valid["prediction_diff"].quantile(0.95)
# KS test: are the two prediction distributions statistically identical?
ks_stat, ks_pvalue = stats.ks_2samp(
valid["production_prediction"],
valid["shadow_prediction"],
)
# Segmented analysis - critical for catching model regressions
# that only affect specific subgroups
by_amount = valid.groupby("amount_bucket").agg(
label_agreement=("labels_agree", "mean"),
mean_diff=("prediction_diff", "mean"),
count=("request_id", "count"),
).to_dict("index")
return {
"n_total_requests": len(df),
"n_shadow_successful": len(valid),
"shadow_failure_rate": 1 - len(valid) / len(df) if len(df) else 0,
"label_agreement_rate": label_agreement,
"mean_absolute_diff": mean_abs_diff,
"p95_absolute_diff": p95_diff,
"ks_statistic": ks_stat,
"ks_pvalue": ks_pvalue,
"distributions_differ_p01": ks_pvalue < 0.01,
"by_amount_bucket": by_amount,
}
def compute_latency_comparison(self, df: pd.DataFrame) -> dict:
"""Compare latency distributions between production and shadow models."""
p50_prod = df["production_latency_ms"].quantile(0.50)
p95_prod = df["production_latency_ms"].quantile(0.95)
p99_prod = df["production_latency_ms"].quantile(0.99)
valid_shadow = df.dropna(subset=["shadow_latency_ms"])
p50_shadow = valid_shadow["shadow_latency_ms"].quantile(0.50)
p95_shadow = valid_shadow["shadow_latency_ms"].quantile(0.95)
p99_shadow = valid_shadow["shadow_latency_ms"].quantile(0.99)
return {
"production": {"p50": p50_prod, "p95": p95_prod, "p99": p99_prod},
"shadow": {"p50": p50_shadow, "p95": p95_shadow, "p99": p99_shadow},
"latency_regression": p99_shadow > p99_prod * 1.1, # >10% regression
}
def graduation_verdict(self, metrics: dict, latency: dict) -> dict:
"""
Return a clear go/no-go verdict for graduating shadow to canary.
Graduation criteria:
- Label agreement > 95% (both models mostly agree)
- Mean absolute diff < 0.05 (predictions are close)
- No subgroup with agreement < 90% (no hidden regressions)
- Shadow latency p99 within 10% of production
- Shadow failure rate < 1%
"""
issues = []
if metrics["label_agreement_rate"] < 0.95:
issues.append(
f"Label agreement {metrics['label_agreement_rate']:.1%} < 95%"
)
if metrics["mean_absolute_diff"] > 0.05:
issues.append(
f"Mean prediction diff {metrics['mean_absolute_diff']:.3f} > 0.05"
)
if metrics["shadow_failure_rate"] > 0.01:
issues.append(
f"Shadow failure rate {metrics['shadow_failure_rate']:.1%} > 1%"
)
for bucket, bucket_metrics in metrics["by_amount_bucket"].items():
if bucket_metrics["count"] > 100: # Only flag statistically meaningful
if bucket_metrics["label_agreement"] < 0.90:
issues.append(
f"Agreement in '{bucket}' bucket: "
f"{bucket_metrics['label_agreement']:.1%} < 90%"
)
if latency["latency_regression"]:
shadow_p99 = latency["shadow"]["p99"]
prod_p99 = latency["production"]["p99"]
issues.append(
f"Latency regression: shadow p99 {shadow_p99:.1f}ms vs "
f"production {prod_p99:.1f}ms"
)
return {
"ready_for_canary": len(issues) == 0,
"issues": issues,
"recommendation": "Graduate to canary" if not issues else "Fix issues before graduating",
}
Shadow Deployment for LLMs
Shadow mode for LLMs requires different comparison logic. You are comparing text outputs, not probability scores:
# llm_shadow_analysis.py
from sentence_transformers import SentenceTransformer
import numpy as np
class LLMShadowComparator:
"""
Compare LLM shadow outputs using semantic similarity.
Exact string match is meaningless for LLMs - "Paris" and
"The answer is Paris." are semantically equivalent.
Use embedding cosine similarity to measure agreement.
"""
def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
self.encoder = SentenceTransformer(embedding_model)
def compute_similarity(
self, production_output: str, shadow_output: str
) -> float:
"""Returns cosine similarity in [0, 1]. 1.0 = identical meaning."""
embeddings = self.encoder.encode([production_output, shadow_output])
cosine_sim = float(
np.dot(embeddings[0], embeddings[1])
/ (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
)
# Clip to [0, 1] - cosine can be negative for opposite meanings
return max(0.0, cosine_sim)
def batch_compare(
self,
production_outputs: list[str],
shadow_outputs: list[str],
) -> np.ndarray:
"""Compute similarities for a batch. More efficient than one at a time."""
all_texts = production_outputs + shadow_outputs
all_embeddings = self.encoder.encode(all_texts)
n = len(production_outputs)
prod_embeddings = all_embeddings[:n]
shadow_embeddings = all_embeddings[n:]
# Vectorized cosine similarity
prod_norms = np.linalg.norm(prod_embeddings, axis=1, keepdims=True)
shadow_norms = np.linalg.norm(shadow_embeddings, axis=1, keepdims=True)
similarities = np.sum(
(prod_embeddings / prod_norms) * (shadow_embeddings / shadow_norms),
axis=1,
)
return np.clip(similarities, 0.0, 1.0)
Production Engineering Notes
Shadow compute cost: Shadow mode roughly doubles your inference compute - you are running two models on every request. This is the intended tradeoff: safety at the cost of compute. To reduce cost, shadow a fraction of traffic (20-50%) rather than 100%. You still get statistically meaningful data and cut the extra cost in half.
Shadow queue depth: Because shadow calls are async (fire and forget), if your shadow model is slow, the background task queue can grow. Monitor the queue depth and set a maximum: if the queue exceeds N tasks, drop new shadow requests rather than accumulating unbounded lag.
Graduation criteria must be decided before shadowing starts: Do not decide "is this good enough?" by looking at the data - that is p-hacking. Define your graduation criteria in advance: "label agreement above 95%, latency regression less than 10%, no subgroup below 90% agreement, minimum 72 hours and 50K samples."
:::warning Shadow Data is Not Ground Truth Shadow predictions are compared to production predictions, not to ground truth. Two models can disagree and both be wrong. Only when ground truth labels arrive (fraud outcomes, actual clicks, etc.) can you compute the shadow model's true metrics. Design your shadow analysis to handle both: immediate comparison metrics and deferred performance metrics. :::
:::danger Shadow Traffic Mirroring at the Infrastructure Level Some teams implement shadow mode by having the application code call two models. This is fragile - the shadow call is in the same process as the production call, and a shadow model crash can affect the production response. Better: implement traffic mirroring at the infrastructure level using Istio's traffic mirroring feature or Nginx's mirror module. This isolates the shadow completely from the production path. :::
Interview Q&A
Q: What is shadow deployment and when would you use it instead of a canary deployment?
Shadow deployment runs a new model on real traffic in parallel with the production model, but never serves the shadow model's predictions to users. Use it when: you want maximum safety before any user exposure, you need to validate model behavior on rare real-world patterns not well-represented in test data, you want to catch data pipeline bugs (training-serving skew), or you need to validate latency under production load. Use canary when you have already validated in shadow mode and want to measure downstream business metrics (CTR, revenue) that require real user responses. Shadow mode catches technical and statistical bugs; canary catches business metric regressions.
Q: How do you compare shadow model outputs without ground truth labels?
Without labels, you can only compare the shadow model to the production model - not to the true outcome. Useful metrics: label agreement rate (do both models make the same decision at your threshold?), prediction score distribution comparison (KS test or histogram overlap), mean absolute difference in predicted probabilities, and segmented comparison across feature slices (age bucket, transaction size, device type). These metrics tell you how different the two models are, not which is better. To know which is better, you must wait for ground truth labels and compute offline metrics retrospectively on the shadow predictions.
Q: What is the minimum shadow duration and sample size before graduating to canary?
Rules of thumb: minimum 72 hours (to cover multiple full cycles of daily/weekly traffic variation), minimum 50,000 shadow predictions (for statistical significance in segmented analysis), and minimum 1,000 samples in each important subgroup you want to validate separately. For seasonal businesses or fraud models with rare high-value transaction patterns, you may need longer - 1-2 weeks to observe enough rare events. Define these thresholds before running the shadow, not after - looking at data and then choosing thresholds based on what you see is a form of data leakage.
Q: How would you handle a shadow model that is consistently 50ms slower than production?
A 50ms latency regression in shadow mode does not affect users (shadow is not in the critical path) but it is a strong signal that the new model has a latency problem that will affect the SLA if it is promoted to production. Investigate: is the model architecture larger? is it missing quantization or TorchScript compilation? is it hitting a cold GPU path that production does not? Fix the latency issue before graduating to canary. If the latency cannot be fixed within the required budget, the model cannot be deployed - a model that is more accurate but breaks the SLA is worse than the current model.
Q: Describe how you would use shadow deployment for a new LLM - different from a classifier?
LLM shadow comparison cannot use exact match or score difference - outputs are free-form text. Instead: (1) Use semantic similarity (embedding cosine similarity) to measure how much the shadow and production outputs mean the same thing. (2) Sample outputs manually at a set rate (e.g., 100 per day) for human quality review. (3) Measure format compliance (does the shadow output follow required JSON structure, word count limits, tone guidelines). (4) Measure latency and token count - LLMs that generate more tokens are slower and more expensive. (5) For task-specific LLMs (code generation, SQL), run the outputs through automated evaluation (execute the code, run the SQL) and measure pass rate. Graduation criteria should include human reviewer approval in addition to automated metrics.
