Skip to main content

Environment Parity

The Model That Died Crossing the Environment Boundary

The fraud detection model had a 94.2% F1 score in staging. The team had spent six weeks tuning it, running cross-validation, reviewing SHAP explanations. The staging environment had integration tests that verified end-to-end feature pipelines. Performance was excellent. The stakeholders were excited.

On the day of production deployment, within four hours, the support queue began filling with false positives. Legitimate transactions were being blocked. The model that had been so carefully validated in staging was performing significantly worse in production - not because of model drift, but because of environment drift. Three distinct problems had stacked on top of each other.

First, the feature engineering pipeline used a different version of the pandas library in production - 1.5.3 vs 2.0.1 in staging. A subtle behavior change in how groupby().transform() handled null values produced different feature distributions. Second, the production feature store had a time-to-live setting of 3600 seconds on cached feature values, meaning peak-hour traffic (when fraud is highest) was reading 59-minute-old features. Staging had TTL disabled. Third, the transaction amount was normalized in staging using statistics computed from the staging dataset (which was a 30-day sample), but production normalized against statistics from the full 3-year history - a different mean and standard deviation.

The model had not failed. The environment had. And the team had no systematic way to detect these discrepancies before deployment.

Environment parity is the discipline of making your non-production environments behave as close to production as possible - not just at the infrastructure level, but at the data, library, configuration, and behavioral level. It is one of the hardest problems in MLOps, and ignoring it is responsible for a disproportionate share of model failures in production.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

The "works on my machine" problem has plagued software engineering for decades. Docker solved much of it for application code. But ML systems have additional sources of environment divergence that containers alone cannot fix: data distributions, external service behavior, feature computation timing, and statistical preprocessing parameters.

The 12-Factor App methodology (Hermes Foundation, 2011) articulated the dev/prod parity principle for web applications: keep development, staging, and production as similar as possible. For ML, this principle needs to extend to four dimensions simultaneously - compute, data, code, and configuration. Getting any one of these wrong produces a model that looks great until it meets production.

The Four Dimensions of ML Environment Parity

Compute Parity - Containers Are Necessary But Not Sufficient

Containers solve OS and library version parity. But container parity requires discipline.

# Dockerfile.training - NEVER use floating tags
# Bad: FROM pytorch/pytorch:latest
# Good: pin exact digest
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime@sha256:3a9c5c9e7...

# Pin ALL Python packages - no version ranges in production training images
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# requirements.txt - pin everything, including transitive deps
# torch==2.2.0+cu121
# pandas==2.1.4
# scikit-learn==1.4.0
# numpy==1.26.3
# pyarrow==14.0.2
# feast==0.35.0
# Generate pinned requirements from a working environment
pip freeze > requirements.txt

# Or use pip-compile for reproducible resolution
pip-compile requirements.in --output-file requirements.txt --generate-hashes

# Verify your container runs the exact same code in all environments
docker run --rm myimage python -c "import torch; print(torch.__version__)"
# Expected: 2.2.0+cu121 in all environments

Hardware Parity

Training on an A100 and serving on a T4 is fine - but be aware of numerical differences. torch.float16 computation on different GPU architectures can produce subtly different results. For models where prediction consistency across hardware matters, test explicitly.

# test_hardware_consistency.py
# Run this on both staging (T4) and prod (A100) GPUs
import torch
import numpy as np

def test_numeric_consistency():
"""Verify model outputs are consistent across GPU models."""
model = load_model("fraud-detector-v2.1")
test_input = torch.load("consistency_test_inputs.pt")

with torch.inference_mode():
output = model(test_input)

# Save output on staging, compare on prod
expected = np.load("expected_outputs_staging.npy")
actual = output.cpu().numpy()

max_diff = np.max(np.abs(actual - expected))
assert max_diff < 1e-4, f"Max difference {max_diff} exceeds threshold - hardware produces inconsistent outputs"

Data Parity - The Hardest Problem

The Feature Skew Problem

The most common production failure in ML is training-serving skew: the features the model sees at serving time are computed differently from the features used at training time. This has three root causes:

  1. Code skew: feature computation logic differs between training pipeline (batch) and serving pipeline (online)
  2. Data skew: training data is a historical snapshot; serving data is live, with different distribution
  3. Temporal skew: features are computed at different points in time (batch vs real-time)
# The danger: computing features differently in training vs serving

# Training pipeline (Python, batch processing):
def compute_user_fraud_score(transactions_df):
"""Compute rolling fraud signal from transaction history."""
return transactions_df.groupby("user_id")["amount"].apply(
lambda x: x.rolling(30, min_periods=1).mean()
).reset_index(level=0, drop=True)

# Serving pipeline (Java, real-time):
# ... completely different code, different rolling window implementation
# Different handling of edge cases (first transaction, NULL values)

# Solution: Feature Store with registered transformation logic
# Both training and serving call the same compute function
# No code divergence possible

Feature Store Snapshots for Data Parity

# Using Feast for training-serving parity
from feast import FeatureStore, RetrievalJob
from datetime import datetime, timedelta
import pandas as pd

fs = FeatureStore(repo_path="feature_repo/")

# Training: point-in-time correct historical features
# This is what the model WOULD have seen at each training example's timestamp
entity_df = pd.DataFrame({
"user_id": training_labels["user_id"],
"event_timestamp": training_labels["transaction_time"], # Past timestamps!
})

training_data = fs.get_historical_features(
entity_df=entity_df,
features=[
"user_features:tx_count_7d",
"user_features:avg_amount_30d",
"user_features:fraud_score",
"merchant_features:fraud_rate",
],
).to_df()

# Serving: same features, but computed as of NOW
# Guaranteed to use the same computation logic
online_features = fs.get_online_features(
features=[
"user_features:tx_count_7d",
"user_features:avg_amount_30d",
"user_features:fraud_score",
"merchant_features:fraud_rate",
],
entity_rows=[{"user_id": "user_123"}],
).to_dict()

Feature Statistics for Parity Checking

# feature_parity_check.py
# Run this as a pre-deployment gate before promoting to production

import numpy as np
from scipy import stats
from feast import FeatureStore

def check_feature_distribution_parity(
staging_fs: FeatureStore,
prod_fs: FeatureStore,
feature_names: list[str],
sample_entity_ids: list[str],
p_value_threshold: float = 0.05,
) -> dict:
"""
Compare feature distributions between staging and production feature stores.
Uses KS test to detect significant distribution differences.
"""
results = {}

for feature in feature_names:
staging_values = staging_fs.get_online_features(
features=[feature],
entity_rows=[{"user_id": uid} for uid in sample_entity_ids],
).to_dict()[feature.split(":")[1]]

prod_values = prod_fs.get_online_features(
features=[feature],
entity_rows=[{"user_id": uid} for uid in sample_entity_ids],
).to_dict()[feature.split(":")[1]]

# Kolmogorov-Smirnov test for distribution equality
ks_stat, p_value = stats.ks_2samp(
[v for v in staging_values if v is not None],
[v for v in prod_values if v is not None],
)

results[feature] = {
"ks_statistic": ks_stat,
"p_value": p_value,
"distributions_match": p_value > p_value_threshold,
"staging_mean": np.nanmean(staging_values),
"prod_mean": np.nanmean(prod_values),
}

failing_features = [
f for f, r in results.items() if not r["distributions_match"]
]

if failing_features:
raise ValueError(
f"Feature distribution mismatch detected: {failing_features}. "
f"Promotion to production blocked."
)

return results

Kustomize Overlays - Infrastructure Configuration Parity

Kustomize is the standard Kubernetes tool for environment-specific configuration without code duplication. You define a base configuration and apply environment-specific overlays.

Directory Structure

model-deployments/
├── base/
│ ├── kustomization.yaml
│ ├── rollout.yaml # Argo Rollout definition
│ ├── service.yaml
│ ├── hpa.yaml # HorizontalPodAutoscaler
│ └── configmap.yaml
└── overlays/
├── dev/
│ ├── kustomization.yaml
│ └── resources-patch.yaml
├── staging/
│ ├── kustomization.yaml
│ ├── resources-patch.yaml
│ └── replicas-patch.yaml
└── prod/
├── kustomization.yaml
├── resources-patch.yaml
├── replicas-patch.yaml
└── hpa-patch.yaml
# model-deployments/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- rollout.yaml
- service.yaml
- hpa.yaml
- configmap.yaml

commonLabels:
app: fraud-detector
managed-by: kustomize

images:
- name: fraud-detector
newName: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector
newTag: v2.1.0 # Updated by CI pipeline
# model-deployments/base/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
spec:
replicas: 2 # Base value - overridden by overlays
selector:
matchLabels:
app: fraud-detector
template:
spec:
containers:
- name: model
image: fraud-detector # Kustomize replaces this with the full registry URL
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
env:
- name: FEATURE_STORE_ENDPOINT
valueFrom:
configMapKeyRef:
name: fraud-detector-config
key: FEATURE_STORE_ENDPOINT
- name: MODEL_THRESHOLD
valueFrom:
configMapKeyRef:
name: fraud-detector-config
key: MODEL_THRESHOLD
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 100
# model-deployments/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../base

namePrefix: dev- # All resources prefixed with "dev-"

namespace: ml-dev

patches:
- path: resources-patch.yaml
target:
kind: Rollout
name: fraud-detector

configMapGenerator:
- name: fraud-detector-config
behavior: merge
literals:
- FEATURE_STORE_ENDPOINT=http://feature-store.ml-dev:8080
- MODEL_THRESHOLD=0.80 # Lower threshold in dev for testing
- LOG_LEVEL=DEBUG
- SHADOW_MODE=false
# model-deployments/overlays/dev/resources-patch.yaml
# Strategic merge patch - only specify what changes from base
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
spec:
replicas: 1 # 1 replica in dev (not 2)
template:
spec:
containers:
- name: model
resources:
requests:
cpu: "200m" # Smaller CPU in dev
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
# model-deployments/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../base

namespace: ml-prod

patches:
- path: resources-patch.yaml
target:
kind: Rollout
name: fraud-detector
- path: hpa-patch.yaml
target:
kind: HorizontalPodAutoscaler
name: fraud-detector

configMapGenerator:
- name: fraud-detector-config
behavior: merge
literals:
- FEATURE_STORE_ENDPOINT=http://feature-store.ml-prod:8080
- MODEL_THRESHOLD=0.85
- LOG_LEVEL=INFO
- SHADOW_MODE=false
- CACHE_TTL_SECONDS=60 # Shorter TTL in prod for fresher features
# model-deployments/overlays/prod/resources-patch.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
spec:
replicas: 8 # 8 replicas in production
template:
spec:
containers:
- name: model
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
strategy:
canary:
steps:
- setWeight: 10 # More conservative canary in prod
- pause: {duration: 10m}
- analysis:
templates:
- templateName: fraud-detector-latency
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100

Environment Comparison Table

DimensionDevStagingProduction
Replicas12-38-20 (autoscaled)
CPU request200m500m2 cores
Memory request512Mi1Gi4Gi
Instance typet3.mediumm5.xlargem5.4xlarge
Feature storeDev FS (small)Staging FS (30-day sample)Prod FS (full history)
Data1-day sample30-day sampleFull production data
Canary steps100% immediate50% → 100%10% → 50% → 100%
Analysis gatesNoneLatency onlyLatency + error rate + business
Secrets backendLocal k8s secretsAWS SM (staging prefix)AWS SM (prod prefix)
Log levelDEBUGINFOINFO
MonitoringOptionalBasicFull observability

The Environment Promotion Pipeline

# .github/workflows/environment-promotion.yml
name: Environment Promotion

on:
workflow_dispatch:
inputs:
model_version:
description: 'Model version to promote (e.g., v2.1.0)'
required: true
target_environment:
description: 'Target environment'
required: true
type: choice
options: [staging, prod]

jobs:
validate-staging-parity:
runs-on: ubuntu-latest
if: inputs.target_environment == 'prod'

steps:
- uses: actions/checkout@v4

- name: Check feature distribution parity
run: |
python scripts/check_feature_parity.py \
--staging-fs-config configs/feature-store-staging.yaml \
--prod-fs-config configs/feature-store-prod.yaml \
--sample-size 1000

- name: Run shadow mode comparison
run: |
python scripts/shadow_comparison.py \
--model-version ${{ inputs.model_version }} \
--hours 24 \
--min-agreement-rate 0.98

- name: Validate library versions match
run: |
python scripts/check_library_parity.py \
--staging-registry eai-staging \
--prod-registry eai-prod \
--image fraud-detector:${{ inputs.model_version }}

promote-to-environment:
needs: validate-staging-parity
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Update image tag in target environment
run: |
OVERLAY="model-deployments/overlays/${{ inputs.target_environment }}"
cd "$OVERLAY"
kustomize edit set image fraud-detector=${{ inputs.model_version }}

- name: Create promotion PR
uses: peter-evans/create-pull-request@v6
with:
commit-message: "promote: fraud-detector ${{ inputs.model_version }} to ${{ inputs.target_environment }}"
title: "Promote fraud-detector ${{ inputs.model_version }} to ${{ inputs.target_environment }}"
body: |
## Promotion Request

**From**: staging (validated)
**To**: ${{ inputs.target_environment }}
**Version**: ${{ inputs.model_version }}

### Pre-promotion Checks Passed
- Feature distribution parity: PASSED
- Shadow mode agreement rate: PASSED
- Library version consistency: PASSED

### Required Reviews
- [ ] ML engineer sign-off on evaluation metrics
- [ ] Platform engineer sign-off on infrastructure changes

Shadow Mode Testing - The Safety Net

Shadow mode runs the new model against production traffic, compares outputs to the current model, but never uses the new model's predictions for actual decisions. It is the safest way to validate production behavior before real exposure.

# shadow_mode_middleware.py
import asyncio
import logging
import time
from dataclasses import dataclass
from typing import Any

import httpx

logger = logging.getLogger(__name__)


@dataclass
class PredictionResult:
model_version: str
prediction: Any
confidence: float
latency_ms: float


class ShadowModeRouter:
"""
Routes each request to both stable and shadow models.
Returns stable model response. Logs comparison asynchronously.
"""

def __init__(
self,
stable_endpoint: str,
shadow_endpoint: str,
metrics_client,
):
self.stable_endpoint = stable_endpoint
self.shadow_endpoint = shadow_endpoint
self.metrics = metrics_client

async def predict(self, request_data: dict) -> PredictionResult:
"""Predict using stable model. Shadow compare in background."""

# Call stable model (blocking - this is what the user gets)
stable_result = await self._call_model(self.stable_endpoint, request_data)

# Fire shadow call without waiting (non-blocking)
asyncio.create_task(
self._shadow_compare(request_data, stable_result)
)

return stable_result

async def _call_model(self, endpoint: str, data: dict) -> PredictionResult:
start = time.monotonic()
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.post(f"{endpoint}/predict", json=data)
response.raise_for_status()
latency_ms = (time.monotonic() - start) * 1000
result = response.json()
return PredictionResult(
model_version=result["model_version"],
prediction=result["prediction"],
confidence=result["confidence"],
latency_ms=latency_ms,
)

async def _shadow_compare(self, request_data: dict, stable: PredictionResult):
"""Compare shadow result to stable. Log discrepancies."""
try:
shadow = await self._call_model(self.shadow_endpoint, request_data)

# Track agreement rate in metrics
agreement = stable.prediction == shadow.prediction
self.metrics.increment(
"shadow_comparison",
tags={
"stable_version": stable.model_version,
"shadow_version": shadow.model_version,
"agreement": str(agreement),
}
)

if not agreement:
logger.info(
"Shadow disagreement",
extra={
"stable_prediction": stable.prediction,
"shadow_prediction": shadow.prediction,
"stable_confidence": stable.confidence,
"shadow_confidence": shadow.confidence,
}
)

# Track latency comparison
self.metrics.histogram(
"shadow_latency_ms",
shadow.latency_ms,
tags={"model": shadow.model_version}
)

except Exception as e:
logger.warning(f"Shadow call failed: {e}")
# Shadow failures never affect the stable response

Cost-Efficient Staging Environments

Staging environments do not need to be full production scale. Right-size them for effective testing without burning budget.

# staging cluster - smaller instances, same configuration pattern
# environments/staging/terraform.tfvars

gpu_instance_type = "g4dn.xlarge" # vs p3.8xlarge in prod - same GPU family, smaller
gpu_min_count = 0 # Scale to zero when idle
gpu_max_count = 4 # Cap to prevent runaway costs
cpu_instance_type = "m5.large" # vs m5.4xlarge in prod
cpu_desired_count = 2
mlflow_db_class = "db.t3.medium" # vs db.r6g.xlarge in prod
redis_node_type = "cache.t3.micro" # vs cache.r7g.large in prod
# Auto-shutdown staging environments outside business hours
# AWS EventBridge scheduled rule to scale down staging at night
resource "aws_scheduler_schedule" "staging_scale_down" {
name = "staging-scale-down-nights"

flexible_time_window {
mode = "OFF"
}

schedule_expression = "cron(0 20 ? * MON-FRI *)" # 8 PM weekdays (UTC)

target {
arn = aws_lambda_function.scale_down_staging.arn
role_arn = aws_iam_role.scheduler.arn
}
}

Production Engineering Notes

Version everything, not just model code: The conda environment spec, the Docker base image digest, the feature store schema version, and the preprocessing statistics artifact should all be versioned and stored alongside the model. When debugging a production incident, you need to be able to reproduce the exact environment - not just the model weights.

Feature parity checks in CI: Run the feature distribution parity check as a required step in your PR pipeline, not just before production deployments. Catching feature skew when a data engineer changes a feature definition (not a model engineer deploying a model) requires continuous monitoring, not point-in-time checks.

Preprocessing artifact versioning: If your model requires normalization statistics (mean, std, min, max) computed from training data, store those statistics alongside the model artifact (in the same S3 path or MLflow run). Never recompute preprocessing statistics at serving time from the online feature store - the distributions will differ.

Blue-green for data migrations: When migrating the feature store schema (adding new features, renaming columns), use blue-green deployment. Keep the old schema running until all model versions are updated to use the new schema. Feature store schema changes are among the highest-risk events in an ML platform.

Common Mistakes

:::danger Never Use Production Data in Staging Without Anonymization Using raw production data in staging environments is a compliance and security violation in almost every regulated industry. Use synthetic data, or anonymized/tokenized production data. The data distribution matters for testing - you can approximate it without using real PII. :::

:::danger Preprocessing Statistics Must Match Between Training and Serving This is the single most common cause of training-serving skew. If you normalize by mean/std during training, those exact mean and std values must be used at serving time - not recomputed from the online feature store. Store them in your model artifact and load them in your serving container. :::

:::warning Staging Feature Store TTL Must Match Production The time-to-live settings on your feature store's online cache affect which features your model sees. If staging has TTL=∞ and production has TTL=3600, your model sees stale features in production that it never saw in staging. The incident in the opening scenario came from exactly this mismatch. Always check TTL settings when comparing environments. :::

:::warning "Same infrastructure" Is Not Enough - Test the Behavior Two environments can be running identical Kubernetes manifests with identical container images and still produce different model outputs due to data distribution differences. Structural parity (same code, same config) is necessary but not sufficient. Behavioral parity (model outputs follow the expected distribution) requires active monitoring and shadow mode testing. :::

Interview Q&A

Q: What is training-serving skew and how do you prevent it?

Training-serving skew is when the features a model sees at inference time differ from the features it was trained on. It has three root causes: (1) Code skew - feature computation logic implemented differently in the training pipeline (Python/batch) vs serving pipeline (Java/real-time). Prevention: use a Feature Store with registered transformation logic that both training and serving call - the same code path computes features in both cases. (2) Data skew - the training data distribution differs from the production distribution (time shift, population shift). Prevention: monitor feature distributions continuously and alert when they diverge from training-time distributions. (3) Statistical skew - preprocessing statistics (normalization mean/std, label encodings) computed from training data but not saved, then recomputed differently at serving time. Prevention: save preprocessing artifacts alongside the model and load them at serving time.

Q: How do you test that staging behaves like production for ML models?

Five-layer testing strategy: (1) Library parity: check that the production and staging container images have identical Python package versions - run pip freeze in both and diff. (2) Feature distribution parity: fetch the same set of entity features from staging and production feature stores for a representative sample of entities, run a KS test, and block promotion if distributions differ significantly. (3) Preprocessing consistency: apply the staging preprocessing pipeline and the production preprocessing pipeline to identical raw inputs and verify identical outputs. (4) Shadow mode comparison: run the new model against a sample of real production traffic without acting on its predictions, compare output distributions to the current model. Target 98%+ agreement rate before full promotion. (5) Latency profiling: verify that staging latency at similar load levels is comparable to production - significant latency divergence often indicates hardware or configuration differences.

Q: Walk me through a Kustomize overlay structure for a fraud detection model across dev, staging, and prod.

Base directory contains the canonical Kubernetes manifests: Argo Rollout, Service, HPA, ConfigMap. Base images use generic names that Kustomize replaces. Base resource requests are conservative (suitable for a medium environment). Overlays are in three directories. Dev overlay: namePrefix dev-, namespace ml-dev, patches that reduce replicas to 1 and halve CPU/memory requests, ConfigMap with dev feature store endpoint and DEBUG logging, immediate rollout strategy (no canary). Staging overlay: namespace ml-staging, 2-3 replicas, medium resource requests, staging feature store endpoint, simple 50%→100% canary. Prod overlay: namespace ml-prod, 8+ replicas matching production traffic, full resource requests, production feature store endpoint, conservative 10%→50%→100% canary with analysis gates. The key principle: identical structure (same manifests, same labels, same configuration shape), environment-specific values only.

Q: What is shadow mode testing and when would you use it over staging validation?

Shadow mode runs a new model version against real production traffic, comparing its outputs to the production model but never acting on them. Use shadow mode when: (1) staging data does not adequately represent production traffic patterns (common for fraud, where staged data lacks the long tail of unusual transactions); (2) you need to validate latency and resource consumption at production scale, which staging cannot replicate; (3) the model change is high-risk and you want production signal before committing to a canary rollout; (4) you want to validate behavior on data from the future (shadow mode runs on real-time data, staging validation uses historical data). Shadow mode adds infrastructure cost (running two model serving pods) and complexity (async comparison logic, metrics tracking), so it is not always necessary - use it when staging validation leaves unacceptable uncertainty.

Q: How do you manage secrets across dev, staging, and prod environments without duplicating sensitive data?

Use a hierarchical secrets management approach. All secrets live in AWS Secrets Manager (or equivalent) with path-based namespacing: eai/dev/mlflow/db-password, eai/staging/mlflow/db-password, eai/prod/mlflow/db-password. The External Secrets Operator in each cluster is configured with an IRSA role that can only read secrets under its environment prefix - the prod cluster cannot read staging secrets and vice versa. ExternalSecret manifests in each Kustomize overlay reference the correct path prefix. For the Terraform/infrastructure side, use separate AWS accounts per environment (the AWS Control Tower model) or at minimum separate IAM roles with strict path-based conditions. Never copy secrets between environments - treat each environment's secrets as independent. Rotate secrets in each environment independently on a schedule.

Environment Parity Checklist

Use this checklist before every production promotion:

PRE-PROMOTION PARITY CHECKLIST
================================

Compute Parity
[ ] Container image digest is identical between staging and prod builds
[ ] Python package versions are identical (pip freeze diff = empty)
[ ] CUDA version and GPU driver version match (if GPU workload)
[ ] Kubernetes version is identical between staging and prod clusters

Data Parity
[ ] Feature store TTL configuration matches prod values in staging
[ ] Preprocessing statistics (mean/std/min/max) were saved with the model artifact
[ ] Feature distribution KS test passed (p > 0.05 for all features)
[ ] No new features added to the serving pipeline without retraining

Configuration Parity
[ ] All environment variables present in prod ConfigMap exist in staging
[ ] No hardcoded staging-specific values in application code
[ ] Feature flag states are documented and accounted for
[ ] Timeout and retry values match prod settings

Code Parity
[ ] Training code version matches the model artifact metadata
[ ] Serving code version matches the Dockerfile used to build the image
[ ] Preprocessing pipeline version matches training run record in MLflow

Behavioral Validation
[ ] Shadow mode agreement rate > 98% over 24-hour window
[ ] Latency P99 in staging within 20% of prod baseline
[ ] No unexpected feature value distributions in serving logs
[ ] Integration tests pass against staging feature store with prod-sized data sample

This checklist is not bureaucracy - each item corresponds to a known class of production failure. The fraud model incident in the opening scenario would have been caught by items 2 (TTL mismatch), 4 (library version mismatch), and 3 (normalization statistics mismatch). Running the checklist before every promotion is the fastest path to zero production surprises.

© 2026 EngineersOfAI. All rights reserved.