Environment Parity
The Model That Died Crossing the Environment Boundary
The fraud detection model had a 94.2% F1 score in staging. The team had spent six weeks tuning it, running cross-validation, reviewing SHAP explanations. The staging environment had integration tests that verified end-to-end feature pipelines. Performance was excellent. The stakeholders were excited.
On the day of production deployment, within four hours, the support queue began filling with false positives. Legitimate transactions were being blocked. The model that had been so carefully validated in staging was performing significantly worse in production - not because of model drift, but because of environment drift. Three distinct problems had stacked on top of each other.
First, the feature engineering pipeline used a different version of the pandas library in production - 1.5.3 vs 2.0.1 in staging. A subtle behavior change in how groupby().transform() handled null values produced different feature distributions. Second, the production feature store had a time-to-live setting of 3600 seconds on cached feature values, meaning peak-hour traffic (when fraud is highest) was reading 59-minute-old features. Staging had TTL disabled. Third, the transaction amount was normalized in staging using statistics computed from the staging dataset (which was a 30-day sample), but production normalized against statistics from the full 3-year history - a different mean and standard deviation.
The model had not failed. The environment had. And the team had no systematic way to detect these discrepancies before deployment.
Environment parity is the discipline of making your non-production environments behave as close to production as possible - not just at the infrastructure level, but at the data, library, configuration, and behavioral level. It is one of the hardest problems in MLOps, and ignoring it is responsible for a disproportionate share of model failures in production.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
The "works on my machine" problem has plagued software engineering for decades. Docker solved much of it for application code. But ML systems have additional sources of environment divergence that containers alone cannot fix: data distributions, external service behavior, feature computation timing, and statistical preprocessing parameters.
The 12-Factor App methodology (Hermes Foundation, 2011) articulated the dev/prod parity principle for web applications: keep development, staging, and production as similar as possible. For ML, this principle needs to extend to four dimensions simultaneously - compute, data, code, and configuration. Getting any one of these wrong produces a model that looks great until it meets production.
The Four Dimensions of ML Environment Parity
Compute Parity - Containers Are Necessary But Not Sufficient
Containers solve OS and library version parity. But container parity requires discipline.
# Dockerfile.training - NEVER use floating tags
# Bad: FROM pytorch/pytorch:latest
# Good: pin exact digest
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime@sha256:3a9c5c9e7...
# Pin ALL Python packages - no version ranges in production training images
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# requirements.txt - pin everything, including transitive deps
# torch==2.2.0+cu121
# pandas==2.1.4
# scikit-learn==1.4.0
# numpy==1.26.3
# pyarrow==14.0.2
# feast==0.35.0
# Generate pinned requirements from a working environment
pip freeze > requirements.txt
# Or use pip-compile for reproducible resolution
pip-compile requirements.in --output-file requirements.txt --generate-hashes
# Verify your container runs the exact same code in all environments
docker run --rm myimage python -c "import torch; print(torch.__version__)"
# Expected: 2.2.0+cu121 in all environments
Hardware Parity
Training on an A100 and serving on a T4 is fine - but be aware of numerical differences. torch.float16 computation on different GPU architectures can produce subtly different results. For models where prediction consistency across hardware matters, test explicitly.
# test_hardware_consistency.py
# Run this on both staging (T4) and prod (A100) GPUs
import torch
import numpy as np
def test_numeric_consistency():
"""Verify model outputs are consistent across GPU models."""
model = load_model("fraud-detector-v2.1")
test_input = torch.load("consistency_test_inputs.pt")
with torch.inference_mode():
output = model(test_input)
# Save output on staging, compare on prod
expected = np.load("expected_outputs_staging.npy")
actual = output.cpu().numpy()
max_diff = np.max(np.abs(actual - expected))
assert max_diff < 1e-4, f"Max difference {max_diff} exceeds threshold - hardware produces inconsistent outputs"
Data Parity - The Hardest Problem
The Feature Skew Problem
The most common production failure in ML is training-serving skew: the features the model sees at serving time are computed differently from the features used at training time. This has three root causes:
- Code skew: feature computation logic differs between training pipeline (batch) and serving pipeline (online)
- Data skew: training data is a historical snapshot; serving data is live, with different distribution
- Temporal skew: features are computed at different points in time (batch vs real-time)
# The danger: computing features differently in training vs serving
# Training pipeline (Python, batch processing):
def compute_user_fraud_score(transactions_df):
"""Compute rolling fraud signal from transaction history."""
return transactions_df.groupby("user_id")["amount"].apply(
lambda x: x.rolling(30, min_periods=1).mean()
).reset_index(level=0, drop=True)
# Serving pipeline (Java, real-time):
# ... completely different code, different rolling window implementation
# Different handling of edge cases (first transaction, NULL values)
# Solution: Feature Store with registered transformation logic
# Both training and serving call the same compute function
# No code divergence possible
Feature Store Snapshots for Data Parity
# Using Feast for training-serving parity
from feast import FeatureStore, RetrievalJob
from datetime import datetime, timedelta
import pandas as pd
fs = FeatureStore(repo_path="feature_repo/")
# Training: point-in-time correct historical features
# This is what the model WOULD have seen at each training example's timestamp
entity_df = pd.DataFrame({
"user_id": training_labels["user_id"],
"event_timestamp": training_labels["transaction_time"], # Past timestamps!
})
training_data = fs.get_historical_features(
entity_df=entity_df,
features=[
"user_features:tx_count_7d",
"user_features:avg_amount_30d",
"user_features:fraud_score",
"merchant_features:fraud_rate",
],
).to_df()
# Serving: same features, but computed as of NOW
# Guaranteed to use the same computation logic
online_features = fs.get_online_features(
features=[
"user_features:tx_count_7d",
"user_features:avg_amount_30d",
"user_features:fraud_score",
"merchant_features:fraud_rate",
],
entity_rows=[{"user_id": "user_123"}],
).to_dict()
Feature Statistics for Parity Checking
# feature_parity_check.py
# Run this as a pre-deployment gate before promoting to production
import numpy as np
from scipy import stats
from feast import FeatureStore
def check_feature_distribution_parity(
staging_fs: FeatureStore,
prod_fs: FeatureStore,
feature_names: list[str],
sample_entity_ids: list[str],
p_value_threshold: float = 0.05,
) -> dict:
"""
Compare feature distributions between staging and production feature stores.
Uses KS test to detect significant distribution differences.
"""
results = {}
for feature in feature_names:
staging_values = staging_fs.get_online_features(
features=[feature],
entity_rows=[{"user_id": uid} for uid in sample_entity_ids],
).to_dict()[feature.split(":")[1]]
prod_values = prod_fs.get_online_features(
features=[feature],
entity_rows=[{"user_id": uid} for uid in sample_entity_ids],
).to_dict()[feature.split(":")[1]]
# Kolmogorov-Smirnov test for distribution equality
ks_stat, p_value = stats.ks_2samp(
[v for v in staging_values if v is not None],
[v for v in prod_values if v is not None],
)
results[feature] = {
"ks_statistic": ks_stat,
"p_value": p_value,
"distributions_match": p_value > p_value_threshold,
"staging_mean": np.nanmean(staging_values),
"prod_mean": np.nanmean(prod_values),
}
failing_features = [
f for f, r in results.items() if not r["distributions_match"]
]
if failing_features:
raise ValueError(
f"Feature distribution mismatch detected: {failing_features}. "
f"Promotion to production blocked."
)
return results
Kustomize Overlays - Infrastructure Configuration Parity
Kustomize is the standard Kubernetes tool for environment-specific configuration without code duplication. You define a base configuration and apply environment-specific overlays.
Directory Structure
model-deployments/
├── base/
│ ├── kustomization.yaml
│ ├── rollout.yaml # Argo Rollout definition
│ ├── service.yaml
│ ├── hpa.yaml # HorizontalPodAutoscaler
│ └── configmap.yaml
└── overlays/
├── dev/
│ ├── kustomization.yaml
│ └── resources-patch.yaml
├── staging/
│ ├── kustomization.yaml
│ ├── resources-patch.yaml
│ └── replicas-patch.yaml
└── prod/
├── kustomization.yaml
├── resources-patch.yaml
├── replicas-patch.yaml
└── hpa-patch.yaml
# model-deployments/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- rollout.yaml
- service.yaml
- hpa.yaml
- configmap.yaml
commonLabels:
app: fraud-detector
managed-by: kustomize
images:
- name: fraud-detector
newName: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector
newTag: v2.1.0 # Updated by CI pipeline
# model-deployments/base/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
spec:
replicas: 2 # Base value - overridden by overlays
selector:
matchLabels:
app: fraud-detector
template:
spec:
containers:
- name: model
image: fraud-detector # Kustomize replaces this with the full registry URL
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
env:
- name: FEATURE_STORE_ENDPOINT
valueFrom:
configMapKeyRef:
name: fraud-detector-config
key: FEATURE_STORE_ENDPOINT
- name: MODEL_THRESHOLD
valueFrom:
configMapKeyRef:
name: fraud-detector-config
key: MODEL_THRESHOLD
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 100
# model-deployments/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base
namePrefix: dev- # All resources prefixed with "dev-"
namespace: ml-dev
patches:
- path: resources-patch.yaml
target:
kind: Rollout
name: fraud-detector
configMapGenerator:
- name: fraud-detector-config
behavior: merge
literals:
- FEATURE_STORE_ENDPOINT=http://feature-store.ml-dev:8080
- MODEL_THRESHOLD=0.80 # Lower threshold in dev for testing
- LOG_LEVEL=DEBUG
- SHADOW_MODE=false
# model-deployments/overlays/dev/resources-patch.yaml
# Strategic merge patch - only specify what changes from base
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
spec:
replicas: 1 # 1 replica in dev (not 2)
template:
spec:
containers:
- name: model
resources:
requests:
cpu: "200m" # Smaller CPU in dev
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
# model-deployments/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base
namespace: ml-prod
patches:
- path: resources-patch.yaml
target:
kind: Rollout
name: fraud-detector
- path: hpa-patch.yaml
target:
kind: HorizontalPodAutoscaler
name: fraud-detector
configMapGenerator:
- name: fraud-detector-config
behavior: merge
literals:
- FEATURE_STORE_ENDPOINT=http://feature-store.ml-prod:8080
- MODEL_THRESHOLD=0.85
- LOG_LEVEL=INFO
- SHADOW_MODE=false
- CACHE_TTL_SECONDS=60 # Shorter TTL in prod for fresher features
# model-deployments/overlays/prod/resources-patch.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
spec:
replicas: 8 # 8 replicas in production
template:
spec:
containers:
- name: model
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
strategy:
canary:
steps:
- setWeight: 10 # More conservative canary in prod
- pause: {duration: 10m}
- analysis:
templates:
- templateName: fraud-detector-latency
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
Environment Comparison Table
| Dimension | Dev | Staging | Production |
|---|---|---|---|
| Replicas | 1 | 2-3 | 8-20 (autoscaled) |
| CPU request | 200m | 500m | 2 cores |
| Memory request | 512Mi | 1Gi | 4Gi |
| Instance type | t3.medium | m5.xlarge | m5.4xlarge |
| Feature store | Dev FS (small) | Staging FS (30-day sample) | Prod FS (full history) |
| Data | 1-day sample | 30-day sample | Full production data |
| Canary steps | 100% immediate | 50% → 100% | 10% → 50% → 100% |
| Analysis gates | None | Latency only | Latency + error rate + business |
| Secrets backend | Local k8s secrets | AWS SM (staging prefix) | AWS SM (prod prefix) |
| Log level | DEBUG | INFO | INFO |
| Monitoring | Optional | Basic | Full observability |
The Environment Promotion Pipeline
# .github/workflows/environment-promotion.yml
name: Environment Promotion
on:
workflow_dispatch:
inputs:
model_version:
description: 'Model version to promote (e.g., v2.1.0)'
required: true
target_environment:
description: 'Target environment'
required: true
type: choice
options: [staging, prod]
jobs:
validate-staging-parity:
runs-on: ubuntu-latest
if: inputs.target_environment == 'prod'
steps:
- uses: actions/checkout@v4
- name: Check feature distribution parity
run: |
python scripts/check_feature_parity.py \
--staging-fs-config configs/feature-store-staging.yaml \
--prod-fs-config configs/feature-store-prod.yaml \
--sample-size 1000
- name: Run shadow mode comparison
run: |
python scripts/shadow_comparison.py \
--model-version ${{ inputs.model_version }} \
--hours 24 \
--min-agreement-rate 0.98
- name: Validate library versions match
run: |
python scripts/check_library_parity.py \
--staging-registry eai-staging \
--prod-registry eai-prod \
--image fraud-detector:${{ inputs.model_version }}
promote-to-environment:
needs: validate-staging-parity
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Update image tag in target environment
run: |
OVERLAY="model-deployments/overlays/${{ inputs.target_environment }}"
cd "$OVERLAY"
kustomize edit set image fraud-detector=${{ inputs.model_version }}
- name: Create promotion PR
uses: peter-evans/create-pull-request@v6
with:
commit-message: "promote: fraud-detector ${{ inputs.model_version }} to ${{ inputs.target_environment }}"
title: "Promote fraud-detector ${{ inputs.model_version }} to ${{ inputs.target_environment }}"
body: |
## Promotion Request
**From**: staging (validated)
**To**: ${{ inputs.target_environment }}
**Version**: ${{ inputs.model_version }}
### Pre-promotion Checks Passed
- Feature distribution parity: PASSED
- Shadow mode agreement rate: PASSED
- Library version consistency: PASSED
### Required Reviews
- [ ] ML engineer sign-off on evaluation metrics
- [ ] Platform engineer sign-off on infrastructure changes
Shadow Mode Testing - The Safety Net
Shadow mode runs the new model against production traffic, compares outputs to the current model, but never uses the new model's predictions for actual decisions. It is the safest way to validate production behavior before real exposure.
# shadow_mode_middleware.py
import asyncio
import logging
import time
from dataclasses import dataclass
from typing import Any
import httpx
logger = logging.getLogger(__name__)
@dataclass
class PredictionResult:
model_version: str
prediction: Any
confidence: float
latency_ms: float
class ShadowModeRouter:
"""
Routes each request to both stable and shadow models.
Returns stable model response. Logs comparison asynchronously.
"""
def __init__(
self,
stable_endpoint: str,
shadow_endpoint: str,
metrics_client,
):
self.stable_endpoint = stable_endpoint
self.shadow_endpoint = shadow_endpoint
self.metrics = metrics_client
async def predict(self, request_data: dict) -> PredictionResult:
"""Predict using stable model. Shadow compare in background."""
# Call stable model (blocking - this is what the user gets)
stable_result = await self._call_model(self.stable_endpoint, request_data)
# Fire shadow call without waiting (non-blocking)
asyncio.create_task(
self._shadow_compare(request_data, stable_result)
)
return stable_result
async def _call_model(self, endpoint: str, data: dict) -> PredictionResult:
start = time.monotonic()
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.post(f"{endpoint}/predict", json=data)
response.raise_for_status()
latency_ms = (time.monotonic() - start) * 1000
result = response.json()
return PredictionResult(
model_version=result["model_version"],
prediction=result["prediction"],
confidence=result["confidence"],
latency_ms=latency_ms,
)
async def _shadow_compare(self, request_data: dict, stable: PredictionResult):
"""Compare shadow result to stable. Log discrepancies."""
try:
shadow = await self._call_model(self.shadow_endpoint, request_data)
# Track agreement rate in metrics
agreement = stable.prediction == shadow.prediction
self.metrics.increment(
"shadow_comparison",
tags={
"stable_version": stable.model_version,
"shadow_version": shadow.model_version,
"agreement": str(agreement),
}
)
if not agreement:
logger.info(
"Shadow disagreement",
extra={
"stable_prediction": stable.prediction,
"shadow_prediction": shadow.prediction,
"stable_confidence": stable.confidence,
"shadow_confidence": shadow.confidence,
}
)
# Track latency comparison
self.metrics.histogram(
"shadow_latency_ms",
shadow.latency_ms,
tags={"model": shadow.model_version}
)
except Exception as e:
logger.warning(f"Shadow call failed: {e}")
# Shadow failures never affect the stable response
Cost-Efficient Staging Environments
Staging environments do not need to be full production scale. Right-size them for effective testing without burning budget.
# staging cluster - smaller instances, same configuration pattern
# environments/staging/terraform.tfvars
gpu_instance_type = "g4dn.xlarge" # vs p3.8xlarge in prod - same GPU family, smaller
gpu_min_count = 0 # Scale to zero when idle
gpu_max_count = 4 # Cap to prevent runaway costs
cpu_instance_type = "m5.large" # vs m5.4xlarge in prod
cpu_desired_count = 2
mlflow_db_class = "db.t3.medium" # vs db.r6g.xlarge in prod
redis_node_type = "cache.t3.micro" # vs cache.r7g.large in prod
# Auto-shutdown staging environments outside business hours
# AWS EventBridge scheduled rule to scale down staging at night
resource "aws_scheduler_schedule" "staging_scale_down" {
name = "staging-scale-down-nights"
flexible_time_window {
mode = "OFF"
}
schedule_expression = "cron(0 20 ? * MON-FRI *)" # 8 PM weekdays (UTC)
target {
arn = aws_lambda_function.scale_down_staging.arn
role_arn = aws_iam_role.scheduler.arn
}
}
Production Engineering Notes
Version everything, not just model code: The conda environment spec, the Docker base image digest, the feature store schema version, and the preprocessing statistics artifact should all be versioned and stored alongside the model. When debugging a production incident, you need to be able to reproduce the exact environment - not just the model weights.
Feature parity checks in CI: Run the feature distribution parity check as a required step in your PR pipeline, not just before production deployments. Catching feature skew when a data engineer changes a feature definition (not a model engineer deploying a model) requires continuous monitoring, not point-in-time checks.
Preprocessing artifact versioning: If your model requires normalization statistics (mean, std, min, max) computed from training data, store those statistics alongside the model artifact (in the same S3 path or MLflow run). Never recompute preprocessing statistics at serving time from the online feature store - the distributions will differ.
Blue-green for data migrations: When migrating the feature store schema (adding new features, renaming columns), use blue-green deployment. Keep the old schema running until all model versions are updated to use the new schema. Feature store schema changes are among the highest-risk events in an ML platform.
Common Mistakes
:::danger Never Use Production Data in Staging Without Anonymization Using raw production data in staging environments is a compliance and security violation in almost every regulated industry. Use synthetic data, or anonymized/tokenized production data. The data distribution matters for testing - you can approximate it without using real PII. :::
:::danger Preprocessing Statistics Must Match Between Training and Serving This is the single most common cause of training-serving skew. If you normalize by mean/std during training, those exact mean and std values must be used at serving time - not recomputed from the online feature store. Store them in your model artifact and load them in your serving container. :::
:::warning Staging Feature Store TTL Must Match Production The time-to-live settings on your feature store's online cache affect which features your model sees. If staging has TTL=∞ and production has TTL=3600, your model sees stale features in production that it never saw in staging. The incident in the opening scenario came from exactly this mismatch. Always check TTL settings when comparing environments. :::
:::warning "Same infrastructure" Is Not Enough - Test the Behavior Two environments can be running identical Kubernetes manifests with identical container images and still produce different model outputs due to data distribution differences. Structural parity (same code, same config) is necessary but not sufficient. Behavioral parity (model outputs follow the expected distribution) requires active monitoring and shadow mode testing. :::
Interview Q&A
Q: What is training-serving skew and how do you prevent it?
Training-serving skew is when the features a model sees at inference time differ from the features it was trained on. It has three root causes: (1) Code skew - feature computation logic implemented differently in the training pipeline (Python/batch) vs serving pipeline (Java/real-time). Prevention: use a Feature Store with registered transformation logic that both training and serving call - the same code path computes features in both cases. (2) Data skew - the training data distribution differs from the production distribution (time shift, population shift). Prevention: monitor feature distributions continuously and alert when they diverge from training-time distributions. (3) Statistical skew - preprocessing statistics (normalization mean/std, label encodings) computed from training data but not saved, then recomputed differently at serving time. Prevention: save preprocessing artifacts alongside the model and load them at serving time.
Q: How do you test that staging behaves like production for ML models?
Five-layer testing strategy: (1) Library parity: check that the production and staging container images have identical Python package versions - run pip freeze in both and diff. (2) Feature distribution parity: fetch the same set of entity features from staging and production feature stores for a representative sample of entities, run a KS test, and block promotion if distributions differ significantly. (3) Preprocessing consistency: apply the staging preprocessing pipeline and the production preprocessing pipeline to identical raw inputs and verify identical outputs. (4) Shadow mode comparison: run the new model against a sample of real production traffic without acting on its predictions, compare output distributions to the current model. Target 98%+ agreement rate before full promotion. (5) Latency profiling: verify that staging latency at similar load levels is comparable to production - significant latency divergence often indicates hardware or configuration differences.
Q: Walk me through a Kustomize overlay structure for a fraud detection model across dev, staging, and prod.
Base directory contains the canonical Kubernetes manifests: Argo Rollout, Service, HPA, ConfigMap. Base images use generic names that Kustomize replaces. Base resource requests are conservative (suitable for a medium environment). Overlays are in three directories. Dev overlay: namePrefix dev-, namespace ml-dev, patches that reduce replicas to 1 and halve CPU/memory requests, ConfigMap with dev feature store endpoint and DEBUG logging, immediate rollout strategy (no canary). Staging overlay: namespace ml-staging, 2-3 replicas, medium resource requests, staging feature store endpoint, simple 50%→100% canary. Prod overlay: namespace ml-prod, 8+ replicas matching production traffic, full resource requests, production feature store endpoint, conservative 10%→50%→100% canary with analysis gates. The key principle: identical structure (same manifests, same labels, same configuration shape), environment-specific values only.
Q: What is shadow mode testing and when would you use it over staging validation?
Shadow mode runs a new model version against real production traffic, comparing its outputs to the production model but never acting on them. Use shadow mode when: (1) staging data does not adequately represent production traffic patterns (common for fraud, where staged data lacks the long tail of unusual transactions); (2) you need to validate latency and resource consumption at production scale, which staging cannot replicate; (3) the model change is high-risk and you want production signal before committing to a canary rollout; (4) you want to validate behavior on data from the future (shadow mode runs on real-time data, staging validation uses historical data). Shadow mode adds infrastructure cost (running two model serving pods) and complexity (async comparison logic, metrics tracking), so it is not always necessary - use it when staging validation leaves unacceptable uncertainty.
Q: How do you manage secrets across dev, staging, and prod environments without duplicating sensitive data?
Use a hierarchical secrets management approach. All secrets live in AWS Secrets Manager (or equivalent) with path-based namespacing: eai/dev/mlflow/db-password, eai/staging/mlflow/db-password, eai/prod/mlflow/db-password. The External Secrets Operator in each cluster is configured with an IRSA role that can only read secrets under its environment prefix - the prod cluster cannot read staging secrets and vice versa. ExternalSecret manifests in each Kustomize overlay reference the correct path prefix. For the Terraform/infrastructure side, use separate AWS accounts per environment (the AWS Control Tower model) or at minimum separate IAM roles with strict path-based conditions. Never copy secrets between environments - treat each environment's secrets as independent. Rotate secrets in each environment independently on a schedule.
Environment Parity Checklist
Use this checklist before every production promotion:
PRE-PROMOTION PARITY CHECKLIST
================================
Compute Parity
[ ] Container image digest is identical between staging and prod builds
[ ] Python package versions are identical (pip freeze diff = empty)
[ ] CUDA version and GPU driver version match (if GPU workload)
[ ] Kubernetes version is identical between staging and prod clusters
Data Parity
[ ] Feature store TTL configuration matches prod values in staging
[ ] Preprocessing statistics (mean/std/min/max) were saved with the model artifact
[ ] Feature distribution KS test passed (p > 0.05 for all features)
[ ] No new features added to the serving pipeline without retraining
Configuration Parity
[ ] All environment variables present in prod ConfigMap exist in staging
[ ] No hardcoded staging-specific values in application code
[ ] Feature flag states are documented and accounted for
[ ] Timeout and retry values match prod settings
Code Parity
[ ] Training code version matches the model artifact metadata
[ ] Serving code version matches the Dockerfile used to build the image
[ ] Preprocessing pipeline version matches training run record in MLflow
Behavioral Validation
[ ] Shadow mode agreement rate > 98% over 24-hour window
[ ] Latency P99 in staging within 20% of prod baseline
[ ] No unexpected feature value distributions in serving logs
[ ] Integration tests pass against staging feature store with prod-sized data sample
This checklist is not bureaucracy - each item corresponds to a known class of production failure. The fraud model incident in the opening scenario would have been caught by items 2 (TTL mismatch), 4 (library version mismatch), and 3 (normalization statistics mismatch). Running the checklist before every promotion is the fastest path to zero production surprises.
