Shadow Mode Testing
The Model That Crashed in Silence
The fraud detection team had built a new gradient boosting model that outperformed the existing neural network on every offline metric. Precision improved from 0.89 to 0.93. Recall held steady. AUC on the holdout set was 0.97. The model was ready, and the team was confident.
They scheduled a direct 50/50 A/B deployment for Monday morning. The deployment ran. The new model started serving traffic. Within 45 minutes, the on-call engineer received a PagerDuty alert: transaction decline rate had jumped from 2.3% to 11.8%. The new model was flagging legitimate transactions as fraudulent at five times the normal rate.
Rollback took 20 minutes. During those 65 minutes, thousands of legitimate customers had their purchases declined. Several called customer support. Some abandoned their carts permanently. The revenue impact was quantifiable in the hundreds of thousands.
The cause: the training data distribution had shifted subtly between when the model was trained and when it was deployed. Specifically, a new payment provider had been integrated two weeks earlier, and transactions from that provider used a slightly different feature encoding. The model had never seen this encoding during training. The offline evaluation holdout did not include this provider's traffic. The model misclassified those transactions as fraud.
Shadow mode testing would have caught this. The new model would have run against all production traffic in the shadows for a week, with its predictions logged but not acted upon. The divergence between shadow predictions and live predictions would have been immediately visible. The team would have investigated, found the encoding mismatch, fixed it, and deployed a correct model.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Shadow Mode Testing demo on the EngineersOfAI Playground - no code required. :::
Why Shadow Mode Testing Exists
A/B testing answers the question: "Does this model produce better outcomes for users?" Shadow mode testing answers a different question: "Does this model behave correctly on production traffic without breaking anything?"
These are not the same question. A model can have excellent offline metrics and still:
- Fail silently on certain input patterns that did not appear in training data
- Produce latency spikes on specific query shapes (long documents, rare character sets)
- Generate output format violations that downstream systems cannot parse
- Behave correctly for 99% of users and catastrophically for 1%
- Consume 3x the memory of the existing model under load
Shadow mode catches these production-specific failures before they affect users. It is the safety layer between offline evaluation and live A/B testing.
How Shadow Mode Works
In shadow mode, every incoming request is processed twice: once by the live model (whose response is returned to the user) and once by the shadow model (whose response is logged and discarded). Users see only the live model's output. The shadow model processes real production traffic in parallel, with no user-facing consequences.
The key requirement: the shadow model request must be asynchronous and non-blocking. If the shadow model is slow or throws an exception, it must not affect the user-facing response from the live model. The shadow path is fire-and-forget.
import asyncio
import logging
import time
from dataclasses import dataclass, field
from typing import Any, Optional
from concurrent.futures import ThreadPoolExecutor
import json
logger = logging.getLogger(__name__)
@dataclass
class ModelPrediction:
model_id: str
prediction: Any
latency_ms: float
error: Optional[str] = None
request_id: str = ""
timestamp: float = field(default_factory=time.time)
class ShadowModeRouter:
"""
Routes production traffic to both live and shadow models.
Shadow model runs asynchronously - never blocks the user response.
"""
def __init__(self, live_model, shadow_model, prediction_logger):
self.live_model = live_model
self.shadow_model = shadow_model
self.logger = prediction_logger
self.executor = ThreadPoolExecutor(max_workers=16, thread_name_prefix="shadow")
def predict(self, request: dict, request_id: str) -> Any:
"""
Get prediction from live model (synchronous, user-facing).
Trigger shadow model asynchronously in background.
"""
# 1. Get live model prediction - this is what the user receives
start = time.perf_counter()
try:
live_result = self.live_model.predict(request)
live_latency = (time.perf_counter() - start) * 1000
live_prediction = ModelPrediction(
model_id="live_v2",
prediction=live_result,
latency_ms=live_latency,
request_id=request_id
)
except Exception as e:
live_latency = (time.perf_counter() - start) * 1000
live_prediction = ModelPrediction(
model_id="live_v2",
prediction=None,
latency_ms=live_latency,
error=str(e),
request_id=request_id
)
# Live model failed - still propagate the error to user
raise
# 2. Trigger shadow model - fire and forget, never block user response
self.executor.submit(
self._run_shadow_model,
request=request,
request_id=request_id,
live_prediction=live_prediction
)
# 3. Return live model result immediately
return live_result
def _run_shadow_model(self, request: dict, request_id: str,
live_prediction: ModelPrediction) -> None:
"""
Run shadow model and log comparison. Exceptions are caught and logged,
never propagated to the caller.
"""
start = time.perf_counter()
try:
shadow_result = self.shadow_model.predict(request)
shadow_latency = (time.perf_counter() - start) * 1000
shadow_prediction = ModelPrediction(
model_id="shadow_v3",
prediction=shadow_result,
latency_ms=shadow_latency,
request_id=request_id
)
except Exception as e:
shadow_latency = (time.perf_counter() - start) * 1000
shadow_prediction = ModelPrediction(
model_id="shadow_v3",
prediction=None,
latency_ms=shadow_latency,
error=str(e),
request_id=request_id
)
logger.warning(f"Shadow model error for request {request_id}: {e}")
# Log comparison for analysis
try:
self.logger.log_comparison(live_prediction, shadow_prediction)
except Exception as e:
logger.error(f"Failed to log shadow comparison: {e}")
# Never let logging failures affect anything
class ShadowComparisonLogger:
"""
Logs live vs shadow predictions to a message queue or data warehouse.
"""
def __init__(self, event_sink):
self.event_sink = event_sink # Kafka producer, BigQuery client, etc.
def log_comparison(self, live: ModelPrediction, shadow: ModelPrediction):
event = {
"request_id": live.request_id,
"timestamp": live.timestamp,
"live": {
"model_id": live.model_id,
"prediction": live.prediction,
"latency_ms": live.latency_ms,
"error": live.error
},
"shadow": {
"model_id": shadow.model_id,
"prediction": shadow.prediction,
"latency_ms": shadow.latency_ms,
"error": shadow.error
},
"diverged": live.prediction != shadow.prediction,
"shadow_error": shadow.error is not None,
"live_error": live.error is not None,
}
self.event_sink.send(json.dumps(event))
What to Measure in Shadow Mode
Shadow mode is only useful if you have a disciplined analysis pipeline. The raw comparison logs need to become actionable signals.
Prediction Divergence
The fraction of requests where the shadow model produces a different prediction than the live model. High divergence is expected (the shadow model is supposed to be different). But unexpected patterns in divergence reveal problems:
import pandas as pd
import numpy as np
from scipy import stats
def analyze_shadow_divergence(shadow_logs: pd.DataFrame) -> dict:
"""
Comprehensive divergence analysis for shadow mode logs.
shadow_logs columns: request_id, timestamp, live_prediction, shadow_prediction,
shadow_error, live_error, feature_group, user_segment
"""
total = len(shadow_logs)
shadow_errors = shadow_logs["shadow_error"].sum()
diverged = shadow_logs["live_prediction"] != shadow_logs["shadow_prediction"]
results = {
"total_requests": total,
"shadow_error_rate": shadow_errors / total,
"divergence_rate": diverged.mean(),
}
# Divergence by feature group - where are the disagreements concentrated?
if "feature_group" in shadow_logs.columns:
by_group = shadow_logs.groupby("feature_group").apply(
lambda g: (g["live_prediction"] != g["shadow_prediction"]).mean()
).sort_values(ascending=False)
results["divergence_by_feature_group"] = by_group.to_dict()
# For classification: disagreement matrix
# live says class A, shadow says class B - how often?
if shadow_logs["live_prediction"].dtype == object:
from collections import Counter
disagreements = shadow_logs[diverged][["live_prediction", "shadow_prediction"]]
disagreement_pairs = Counter(zip(
disagreements["live_prediction"],
disagreements["shadow_prediction"]
))
results["top_disagreement_patterns"] = dict(
disagreement_pairs.most_common(10)
)
# Time-based drift: is divergence rate changing over time?
shadow_logs_sorted = shadow_logs.sort_values("timestamp")
shadow_logs_sorted["hour"] = pd.to_datetime(
shadow_logs_sorted["timestamp"], unit="s"
).dt.floor("H")
hourly_divergence = shadow_logs_sorted.groupby("hour").apply(
lambda g: (g["live_prediction"] != g["shadow_prediction"]).mean()
)
results["divergence_trend"] = {
"first_hour": hourly_divergence.iloc[0] if len(hourly_divergence) > 0 else None,
"last_hour": hourly_divergence.iloc[-1] if len(hourly_divergence) > 0 else None,
"trending_up": (hourly_divergence.iloc[-3:].mean() >
hourly_divergence.iloc[:3].mean()) if len(hourly_divergence) >= 6 else None
}
return results
def analyze_latency_impact(shadow_logs: pd.DataFrame) -> dict:
"""
Compare shadow model latency against live model.
Key question: if we ship the shadow model, will latency get worse?
"""
live_p50 = shadow_logs["live_latency_ms"].quantile(0.50)
live_p95 = shadow_logs["live_latency_ms"].quantile(0.95)
live_p99 = shadow_logs["live_latency_ms"].quantile(0.99)
shadow_p50 = shadow_logs["shadow_latency_ms"].quantile(0.50)
shadow_p95 = shadow_logs["shadow_latency_ms"].quantile(0.95)
shadow_p99 = shadow_logs["shadow_latency_ms"].quantile(0.99)
return {
"live": {"p50": live_p50, "p95": live_p95, "p99": live_p99},
"shadow": {"p50": shadow_p50, "p95": shadow_p95, "p99": shadow_p99},
"latency_regression": {
"p50_delta_pct": (shadow_p50 - live_p50) / live_p50 * 100,
"p95_delta_pct": (shadow_p95 - live_p95) / live_p95 * 100,
"p99_delta_pct": (shadow_p99 - live_p99) / live_p99 * 100,
}
}
Error Rate Analysis
A shadow error rate above 0.1% warrants investigation. Shadow errors often reveal:
- Input validation failures (production data has edge cases offline data lacks)
- Missing feature handling (feature is available in prod with null values not seen in training)
- Library version incompatibilities
- Memory OOM errors on large inputs
- Timeout errors under load
def categorize_shadow_errors(shadow_logs: pd.DataFrame) -> pd.DataFrame:
"""
Group shadow errors by type and identify patterns.
Returns DataFrame with error categories and counts.
"""
error_logs = shadow_logs[shadow_logs["shadow_error"].notna()].copy()
if len(error_logs) == 0:
return pd.DataFrame({"error_category": [], "count": [], "rate": []})
# Categorize errors by type
def categorize(error_msg: str) -> str:
error_msg = str(error_msg).lower()
if "keyerror" in error_msg or "missing" in error_msg:
return "missing_feature"
elif "oom" in error_msg or "memory" in error_msg:
return "out_of_memory"
elif "timeout" in error_msg:
return "timeout"
elif "valueerror" in error_msg or "invalid" in error_msg:
return "invalid_input"
elif "none" in error_msg or "null" in error_msg:
return "null_input"
else:
return "other"
error_logs["error_category"] = error_logs["shadow_error"].apply(categorize)
summary = error_logs.groupby("error_category").size().reset_index(name="count")
summary["rate"] = summary["count"] / len(shadow_logs)
return summary.sort_values("count", ascending=False)
Shadow Mode vs Canary Deployment vs A/B Testing
These three techniques serve different purposes and are used sequentially, not as alternatives:
| Technique | Users Affected | Purpose | When |
|---|---|---|---|
| Shadow Mode | 0% | Catch silent failures, latency issues, error rates | Before any traffic shift |
| Canary Deploy | 1–5% | Catch major metric regressions with low blast radius | After shadow mode passes |
| A/B Test | 50% | Measure true causal effect on business metrics | After canary passes |
Implementing Shadow Mode in Common Serving Architectures
With a Feature Store and Model Server
# FastAPI shadow routing example
from fastapi import FastAPI, Request, BackgroundTasks
import httpx
import asyncio
app = FastAPI()
LIVE_MODEL_URL = "http://model-v2:8080/predict"
SHADOW_MODEL_URL = "http://model-v3:8080/predict"
async def call_shadow_model_async(payload: dict, request_id: str):
"""
Async shadow model call - completely isolated from live path.
Errors here never surface to the user.
"""
try:
async with httpx.AsyncClient(timeout=5.0) as client:
start = asyncio.get_event_loop().time()
response = await client.post(SHADOW_MODEL_URL, json=payload)
latency = asyncio.get_event_loop().time() - start
await log_shadow_result(
request_id=request_id,
prediction=response.json(),
latency_ms=latency * 1000,
error=None
)
except Exception as e:
await log_shadow_result(
request_id=request_id,
prediction=None,
latency_ms=None,
error=str(e)
)
@app.post("/predict")
async def predict(request: Request, background_tasks: BackgroundTasks):
payload = await request.json()
request_id = request.headers.get("x-request-id", "unknown")
# 1. Call live model synchronously - user waits for this
async with httpx.AsyncClient(timeout=10.0) as client:
live_response = await client.post(LIVE_MODEL_URL, json=payload)
live_result = live_response.json()
# 2. Schedule shadow model as background task - does not delay response
background_tasks.add_task(
call_shadow_model_async,
payload=payload,
request_id=request_id
)
# 3. Return live model result immediately
return live_result
With Envoy / Istio Traffic Mirroring
For microservice architectures, traffic mirroring at the proxy level avoids application-layer changes:
# Envoy mirror policy - mirrors 100% of traffic to shadow cluster
# The shadow cluster receives identical requests but responses are ignored
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: recommendation-service
spec:
hosts:
- recommendation-service
http:
- route:
- destination:
host: recommendation-service
subset: live-v2
weight: 100
mirror:
host: recommendation-service
subset: shadow-v3
mirrorPercentage:
value: 100.0 # mirror 100% of traffic to shadow
This is the cleanest approach: the shadow model receives identical copies of all requests, but its responses are discarded at the proxy level. No application code changes required. The shadow model's latency and errors are recorded in proxy metrics.
Shadow Mode for Generative Models
For LLMs and other generative models, prediction divergence is not binary - responses are not equal or unequal, they are more or less similar. Shadow mode for generative models requires semantic comparison:
from sentence_transformers import SentenceTransformer
import numpy as np
model_embedder = SentenceTransformer("all-MiniLM-L6-v2")
def compute_response_similarity(live_response: str, shadow_response: str) -> dict:
"""
Compare live and shadow LLM responses using semantic similarity.
Returns multiple similarity signals.
"""
# Semantic similarity via embeddings
embeddings = model_embedder.encode([live_response, shadow_response])
cosine_sim = np.dot(embeddings[0], embeddings[1]) / (
np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
)
# Lexical overlap metrics
live_words = set(live_response.lower().split())
shadow_words = set(shadow_response.lower().split())
jaccard = len(live_words & shadow_words) / len(live_words | shadow_words) if live_words | shadow_words else 0
# Length comparison
len_ratio = len(shadow_response) / max(len(live_response), 1)
return {
"semantic_similarity": float(cosine_sim),
"lexical_jaccard": jaccard,
"length_ratio": len_ratio,
"substantially_different": cosine_sim < 0.7, # threshold for alerting
}
def analyze_generative_shadow_logs(shadow_logs: pd.DataFrame) -> dict:
"""
For LLM shadow mode, compute similarity distributions rather than
binary divergence rates.
"""
sims = shadow_logs["semantic_similarity"]
return {
"mean_similarity": sims.mean(),
"pct_substantially_different": (sims < 0.7).mean(),
"similarity_distribution": {
"p10": sims.quantile(0.10),
"p25": sims.quantile(0.25),
"p50": sims.quantile(0.50),
"p75": sims.quantile(0.75),
"p90": sims.quantile(0.90),
},
"refusal_rate_shadow": shadow_logs["shadow_refused"].mean(),
"refusal_rate_live": shadow_logs["live_refused"].mean(),
}
Production Engineering Notes
Resource isolation: The shadow model should run on separate compute from the live model. If they share resources, a shadow model memory spike can degrade live model performance. Use separate containers, separate CPU quotas, or separate hosts.
Sampling rate: For very high-traffic systems, you do not need 100% shadow traffic. Sample 10–20% of requests for shadow evaluation. This still gives you excellent coverage for detecting errors and divergence while reducing cost by 5–10x.
Shadow traffic delay: For sequential models (where the live model's output affects the next request), shadow traffic cannot simply replay requests - it would inject the shadow model's predictions into the next step. Handle this with request-level shadowing only at the API boundary, not through the full pipeline.
Shadow mode duration: Run shadow mode for at least 7 days, covering both weekday and weekend traffic patterns. Some failures are time-dependent (e.g., a model that fails on requests with timestamp features near midnight, or on months with 31 days).
Go/no-go criteria: Define explicit shadow mode pass criteria before you start. For example:
- Shadow error rate less than 0.1%
- Shadow p99 latency within 20% of live p99
- Shadow divergence rate within expected range (based on offline model comparison)
- No critical errors (OOM, segfault, silent corruption)
Common Mistakes
:::danger Shadow Model Sharing Resources with Live Model If your shadow model runs on the same pod or VM as your live model, a shadow model crash or memory spike can take down the live model. The entire point of shadow mode is zero user impact. Use separate resource allocations: separate Kubernetes pods with separate resource limits, separate GPU allocations, separate memory quotas. :::
:::danger Ignoring Shadow Errors A 2% shadow error rate seems small but means 1 in 50 production requests would fail if you shipped. Investigate every shadow error category before promoting the model. Common root causes: null handling in preprocessing, input length limits not enforced, missing features in the production feature store that existed in the training feature store. :::
:::warning Confusing Divergence Rate with Error Rate Divergence (shadow prediction differs from live prediction) is expected and desirable - the whole point is to build a better model. Error rate (shadow model crashes or returns invalid output) is the failure signal. High divergence that is concentrated on specific input types (long text, certain geographies, specific feature combinations) reveals input distribution mismatches worth investigating before deployment. :::
:::warning Running Shadow Mode Too Short Seven days minimum. Traffic patterns vary by hour of day, day of week, and sometimes by season. A model that works fine on Tuesday morning may fail on Friday evening due to different request characteristics. Shadow mode over a full week catches the full distribution of production traffic. :::
Interview Q&A
Q: What is shadow mode testing and how does it differ from A/B testing?
A: Shadow mode testing runs a candidate model against all production traffic in parallel with the live model, but discards the shadow model's responses without showing them to users. It catches technical failures: error rates, latency regressions, format violations, silent crashes on production input patterns. A/B testing, by contrast, assigns users to control or treatment and measures the causal effect on business metrics like conversion rate or engagement. The two techniques answer different questions: shadow mode asks "will this model work correctly on production traffic?", A/B testing asks "will this model produce better outcomes for users?". In practice, shadow mode comes first - you validate correctness before measuring impact.
Q: How would you implement shadow mode in a microservices architecture?
A: Two main approaches. First, application-level: the serving application duplicates each request to both the live and shadow model endpoints, uses the live model's response, and logs the shadow model's response asynchronously. The shadow model call must be fire-and-forget - any error or latency spike in the shadow path must not affect the live response. Second, proxy-level: use a service mesh like Envoy or Istio with traffic mirroring configured. The proxy duplicates requests to the shadow cluster and ignores its responses at the proxy level. Proxy-level mirroring is cleaner because it requires no application code changes, and shadow failures are guaranteed to never propagate to users. In both cases, the shadow model must run on isolated compute to prevent resource contention with the live model.
Q: What metrics do you monitor during shadow mode?
A: Three main categories. First, technical health: shadow error rate (must be below threshold, typically 0.1%), shadow timeout rate, shadow p50/p95/p99 latency compared to live model. Second, prediction divergence: what fraction of requests get different predictions from the shadow model vs live model, and how are the divergences distributed across request types, feature groups, user segments. Divergence itself is expected; concentrated divergence in unexpected places reveals problems. Third, output validity: does the shadow model produce valid output formats, valid ranges (no negative probabilities), valid types? For generative models, track semantic similarity between live and shadow responses rather than binary divergence. Define explicit go/no-go thresholds for all three categories before starting shadow mode.
Q: How do you handle shadow mode for a generative model like an LLM?
A: Binary divergence comparison (equal vs not-equal) does not apply to generative models - two valid responses will never be identical. Instead, you compare along several dimensions: semantic similarity using embedding-based cosine similarity (a good threshold is greater than 0.7 for similar intent, below 0.5 for substantially different), lexical overlap, response length distribution, refusal rate (are safety filters triggering at different rates?), and output format compliance (valid JSON, correct schema). You also want to sample and manually review responses in the substantially-different bucket to understand whether the differences are quality improvements or regressions. For LLMs, shadow mode also helps catch prompt injection susceptibility and jailbreaking behavior on real production inputs that may not appear in your red-team dataset.
Q: How long should you run shadow mode before promoting a model?
A: Minimum seven days to cover a full business cycle including both weekday and weekend traffic patterns. Some failures are time-dependent: models that use timestamp features can fail near day/month boundaries, models serving sports content may see very different traffic on game days, e-commerce models see different patterns on promotional days. For models serving time-sensitive predictions (fraud, real-time pricing), also ensure you have coverage across different times of day, since traffic composition (device types, user demographics, query types) varies significantly by hour. For high-stakes domains (payments, healthcare, safety), 14 days is more appropriate. Use the shadow period not just to catch errors but to build confidence in your divergence analysis - you want to understand why the shadow model disagrees with the live model before you ship it.
