ML Deployment Patterns - From Jupyter Notebook to Production at Scale
:::note Reading time and relevance 30–35 min read | Interview relevance: extremely high for ML Engineer and MLOps roles. Senior MLE interviews at Stripe, Uber, DoorDash, and Meta routinely ask about deployment strategy, latency optimization, and rollback procedures. :::
The Real Interview Moment
It is 2021. Stripe's ML team has spent three months training a new fraud detection model. Their existing model processes over $10 billion per day in transactions. Every API call to POST /v1/charges goes through the fraud model. The new model shows +8% precision at the same recall level on the holdout set. In dollar terms, that is tens of millions of dollars per year in reduced fraud losses.
Nobody on the team suggests rolling out the new model overnight.
The previous model had been in production for 18 months. Its behavior was understood - the team knew its false positive rate by merchant category, by geography, by transaction amount. They had alerting on every metric that mattered. They knew how it behaved under load at 3 AM on a Sunday (when fraudsters are most active). The new model, despite its better offline metrics, was an unknown in production.
The decision was to use shadow mode first: run the new model in parallel with the production model for two weeks. Every transaction goes through both models. The production model's output drives the actual block/allow decision. The new model's output is logged, but never acted on. After two weeks, the team compared the two models' decisions on 50 million transactions. They found three edge cases where the new model's behavior was unexpected - high false positive rates on a specific type of international transaction that was underrepresented in their training set. They retrained with a fix, ran shadow mode again, and only then began a canary rollout.
This is the difference between ML engineering and ML research. Getting the model right is 30% of the work. Getting the deployment right is the other 70%.
Why This Exists - The Cost of Bad Deployments
The history of ML deployment failures is long and expensive:
- Amazon's recruiting AI (2018): Deployed a hiring model trained on historical resumes. The historical data encoded gender bias. By the time it was caught, it had been screening resumes for over a year.
- Zillow's iBuying model (2021): A home pricing model was deployed with insufficient monitoring. When market conditions shifted rapidly, the model continued predicting pre-shift prices. Zillow took a $569 million write-down and shut down the iBuying business.
- Knight Capital Group (2012): Deployed trading software with a routing bug. The system executed 4 million transactions in 45 minutes, losing $440 million. The root cause was a deployment without a working rollback procedure.
Every one of these failures had a deployment pattern that would have caught the problem: shadow mode, gradual rollout, monitoring, rollback capability. The patterns in this lesson are not bureaucratic overhead - they are learned from disasters.
Deployment Strategies - Comparing the Patterns
Blue-Green Deployment
Blue-green deployment maintains two identical production environments. At any time, one environment (blue) serves all traffic, and the other (green) is idle. When deploying a new model:
- Deploy the new model to the green environment
- Run smoke tests on green
- Switch the load balancer to route all traffic to green (instant cutover)
- If something is wrong, switch back to blue instantly (rollback in seconds)
Advantages: Zero-downtime deployment, instant rollback, simple to implement, clean separation between versions.
Disadvantages: Requires 2x infrastructure cost. The cutover is binary - either 0% or 100% of traffic sees the new model. No gradual validation in production.
Load Balancer
│
├── Blue (old model) ← currently serving
│
└── Green (new model) ← deployed, tested, ready
→ Switch:
Load Balancer
│
├── Blue (old model) ← idle, ready for rollback
│
└── Green (new model) ← now serving
When to use it: Stateless models, short warm-up time, when you need instant rollback capability, when infrastructure cost is acceptable.
Canary Deployment
A canary release gradually shifts traffic from the old model to the new model, monitoring metrics at each stage before committing to the next. Named after the "canary in a coal mine" - a small number of users experience the new model first, acting as an early warning system.
Typical rollout stages:
- 1% of traffic → monitor for 24 hours
- 5% of traffic → monitor for 48 hours
- 20% of traffic → monitor for 48–72 hours
- 50% of traffic → monitor for 1 week
- 100% of traffic → complete rollout
At each stage, monitor: error rates, latency (p50/p95/p99), business metrics (conversion rate, click-through rate, fraud rate). If any metric degrades beyond a threshold, automatically roll back to 0% and alert the team.
Advantages: Catches production issues that offline testing and shadow mode missed. Limits blast radius of a bad deployment. Provides confidence before full rollout.
Disadvantages: Slower than blue-green. Requires feature flag infrastructure. Consistent user experience is harder (some users see old model, some see new).
Shadow Mode
Shadow mode (also called dark launch or mirror testing) is the safest deployment strategy for high-stakes ML systems. The new model runs in parallel with the production model, receives the same inputs, but its outputs are never used for actual decisions - only logged for comparison.
The shadow mode process:
- Deploy the new model alongside the production model
- Every incoming request is processed by both models simultaneously
- The production model's decision is used (block/allow, rank, recommend)
- The new model's decision is logged to a database or stream
- Engineers compare the two models' decisions: agreement rate, divergence patterns, edge cases
- Only after validation does shadow mode end and the canary begin
Shadow mode answers the question: "Does the new model behave the way we expect, on real production traffic, without any risk to users?" This is particularly critical for:
- Fraud detection (false positives affect real customers)
- Medical diagnosis (wrong predictions have patient consequences)
- Content moderation (incorrect removal affects free expression)
- Any financial decision system
Implementation with dual serving:
import asyncio
from typing import Any
class ShadowModePredictor:
def __init__(self, production_model, shadow_model, logger):
self.production = production_model
self.shadow = shadow_model
self.logger = logger
async def predict(self, features: dict) -> Any:
# Production prediction is on the critical path
production_result = await self.production.predict_async(features)
# Shadow prediction runs in background - never blocks the response
asyncio.create_task(self._shadow_predict_and_log(features, production_result))
return production_result # Only return production result
async def _shadow_predict_and_log(self, features: dict, production_result: Any):
try:
shadow_result = await self.shadow.predict_async(features)
self.logger.log({
"features": features,
"production_decision": production_result,
"shadow_decision": shadow_result,
"agreed": production_result == shadow_result,
})
except Exception as e:
# Shadow errors MUST NOT surface to users
self.logger.log_error(f"Shadow model error: {e}")
The critical design constraint: shadow model failures must never affect the production response. If the shadow model crashes or times out, the production response proceeds normally. This is enforced by running the shadow prediction asynchronously and catching all exceptions.
Feature Flags
Feature flags decouple model deployment from code deployment. A feature flag is a configuration toggle that enables or disables a model feature independently of the code release cycle:
from flagsmith import Flagsmith
flagsmith = Flagsmith(environment_key="your-env-key")
def get_recommendation(user_id: str, context: dict):
flags = flagsmith.get_identity_flags(user_id)
if flags.is_feature_enabled("use_transformer_ranker"):
# New transformer-based ranking model
return transformer_model.rank(context)
else:
# Existing XGBoost ranking model
return xgboost_model.rank(context)
Feature flags allow: percentage rollout (enable for X% of users), user-segment targeting (enable for beta users), instant kill switch (disable immediately without a code deployment), and A/B test integration.
Serving Architectures - Batch vs Real-Time vs Streaming
The right serving architecture depends on when predictions are needed and how fresh they must be.
Batch Inference
Predictions are computed ahead of time for a known set of entities and stored for later retrieval:
Overnight job → Model scores all users → Predictions stored in database
→ User request → Read pre-computed prediction → Serve in <1ms
When to use batch inference:
- Entity set is bounded and known in advance (scoring all products in catalog)
- Prediction freshness requirements are loose (daily or hourly is fine)
- Computational cost is high (scoring millions of users takes 4 hours - cannot do it at request time)
- Features are available in bulk (no need to fetch real-time features for each user)
Examples: Email recommendation engines (run nightly for all users), product catalog scoring (re-score all listings weekly), customer health scores for CRM systems.
Implementation pattern with Spark:
from pyspark.sql import SparkSession
import mlflow
spark = SparkSession.builder.appName("batch_inference").getOrCreate()
# Load production model from registry
model = mlflow.pyfunc.load_model("models:/fraud_model/Production")
# Score all transactions from yesterday
df = spark.read.parquet("s3://data-lake/transactions/2026-03-08/")
# Apply model in parallel across Spark cluster
predictions_df = df.mapInPandas(
lambda batch: model.predict(batch),
schema="user_id string, score double, prediction int"
)
predictions_df.write.parquet("s3://predictions/2026-03-08/", mode="overwrite")
Real-Time Inference
The model is called at request time, producing a fresh prediction for each incoming request. Requires low latency (typically under 100ms) and auto-scaling to handle traffic spikes.
Architecture:
Client → API Gateway → Feature Service → Model Server → Response
↓
Feature Store (Redis)
(pre-computed features,
served in <5ms)
Latency budget breakdown:
- Network round trip: 10–20ms
- Feature lookup (Redis): 1–5ms
- Model inference: 5–50ms (depending on model complexity)
- Response serialization: 1–2ms
- Total budget: under 100ms (or under 50ms for high-speed systems)
Auto-scaling is critical: a 10x traffic spike (flash sale, news event) must not degrade latency. Kubernetes Horizontal Pod Autoscaler (HPA) scales based on CPU/GPU utilization or custom metrics (requests per second, queue depth).
Streaming Inference
Predictions are made per event in a streaming pipeline - events flow through Kafka, get enriched with features, and run through the model:
Kafka topic (events) → Flink/Spark Streaming → Feature enrichment
→ Model inference → Output topic (predictions) → Downstream consumers
When to use streaming: Real-time fraud detection (detect before transaction settles), dynamic pricing (reprice as supply changes), content moderation (process new posts immediately), IoT sensor anomaly detection.
Streaming adds complexity: you must handle late-arriving events, out-of-order events, and model version consistency across a distributed fleet. Use Apache Flink or Spark Structured Streaming with Kafka.
Model Serving Stacks - When to Use Each
| Framework | Best for | Strengths | Limitations |
|---|---|---|---|
| TorchServe | PyTorch models | Native PyTorch, easy setup, good batching | PyTorch-only |
| TF Serving | TensorFlow/Keras | Mature, battle-tested at Google scale | TF-only, heavier |
| Triton Inference Server | Multi-framework, GPU serving | Supports TF/PyTorch/ONNX/TensorRT, optimal GPU utilization | Complex configuration |
| ONNX Runtime | Cross-framework, CPU serving | Universal format, fast CPU inference, easy integration | Not GPU-optimized |
| BentoML | Rapid deployment, flexibility | Framework-agnostic, simple API, good DX | Less optimized than Triton |
| vLLM | Large language models | PagedAttention, continuous batching, fastest LLM serving | LLM-specific |
General guidance:
- For small teams deploying PyTorch models to CPU: ONNX Runtime + FastAPI
- For GPU fleet serving multiple model types: Triton Inference Server
- For LLM deployments: vLLM
- For simplest possible deployment: BentoML
Optimization for Latency and Throughput
When your model does not meet the latency SLA in its original form, apply optimization techniques in order of complexity:
1. Request Batching
Group multiple requests together and run them through the model as a single batch. GPUs are most efficient when processing large batches. Dynamic batching (Triton) collects requests that arrive within a time window (e.g., 5ms) and processes them together:
# Triton model config (config.pbtxt)
# dynamic_batching {
# preferred_batch_size: [ 4, 8, 16 ]
# max_queue_delay_microseconds: 5000 # 5ms max wait
# }
Batching improves throughput but adds latency for individual requests. The tradeoff is controlled by the queue delay parameter.
2. Quantization
Reduce numerical precision from FP32 to FP16 or INT8:
- FP16: ~2x memory reduction, ~1.5–2x speedup on GPUs with Tensor Cores, minimal accuracy loss
- INT8: ~4x memory reduction, ~2–4x speedup on CPU (Intel VNNI) and GPU (Tensor Cores), accuracy loss of 0.5–2%
import torch
from torch.quantization import quantize_dynamic
# Dynamic quantization: INT8 for linear layers (no calibration data needed)
model_int8 = quantize_dynamic(
model,
{torch.nn.Linear, torch.nn.LSTM},
dtype=torch.qint8
)
# Export to ONNX for cross-platform serving
torch.onnx.export(model_int8, dummy_input, "model_int8.onnx",
opset_version=14, do_constant_folding=True)
For production, use Post-Training Static Quantization (requires calibration data) for better accuracy than dynamic quantization, or Quantization-Aware Training (QAT) for maximum accuracy.
3. Knowledge Distillation
Train a smaller "student" model to mimic the outputs of a larger "teacher" model. The student learns from the teacher's soft probability outputs (which carry more information than hard labels) and can achieve 80–90% of the teacher's performance at 10–20% of the computational cost.
Where is the temperature (controls softness of teacher distribution) and balances task loss vs distillation loss. DistilBERT (Sanh et al., 2019) achieved 97% of BERT's performance with 40% fewer parameters and 60% faster inference using this technique.
4. Model Pruning
Remove weights that contribute minimally to predictions. Unstructured pruning zeroes out individual weights; structured pruning removes entire neurons/filters/attention heads (more hardware-friendly).
import torch.nn.utils.prune as prune
# Prune 30% of weights in each Linear layer (unstructured)
for module_name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name='weight', amount=0.30)
prune.remove(module, 'weight') # make pruning permanent
# Fine-tune for 1–2 epochs to recover accuracy
5. TensorRT Optimization
NVIDIA TensorRT compiles a trained model into an optimized engine specifically for the deployment GPU. It fuses operations, selects optimal kernel implementations, and applies automatic precision reduction:
import tensorrt as trt
# Build TensorRT engine from ONNX model
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
with open("model.onnx", "rb") as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16) # Enable FP16
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
engine = builder.build_serialized_network(network, config)
TensorRT typically gives 2–8x speedup over vanilla PyTorch for GPU inference on NVIDIA hardware.
Model Registry - Versioning, Lineage, and Rollback
A model registry is the source of truth for model artifacts in production. It tracks: model version, training data version, performance metrics, deployment history, and approval status.
What a model registry must support:
- Artifact storage: Serialized model files, preprocessing pipeline, feature schema
- Versioning: Every model version is immutable and addressable
- Lineage: Which dataset, code version, and hyperparameters produced this model
- Stage transitions: development → staging → production
- Rollback: Revert to any previous production version instantly
import mlflow
from mlflow.tracking import MlflowClient
# Log model with metadata
with mlflow.start_run() as run:
mlflow.log_params(best_params)
mlflow.log_metrics({"val_auc": 0.923, "val_f1": 0.871})
mlflow.xgboost.log_model(
model,
artifact_path="model",
registered_model_name="fraud_detection_v2",
)
# Promote to production
client = MlflowClient()
client.transition_model_version_stage(
name="fraud_detection_v2",
version="14",
stage="Production",
archive_existing_versions=True # auto-archive current production version
)
# Load production model in serving
production_model = mlflow.pyfunc.load_model(
"models:/fraud_detection_v2/Production"
)
Rollback procedure:
# If the new model is causing problems, promote the previous version back
client.transition_model_version_stage(
name="fraud_detection_v2",
version="13", # Previous production version
stage="Production",
archive_existing_versions=True
)
# Takes effect immediately - all serving instances reload Production on next request
:::tip Define rollback criteria before deployment Before promoting any model to production, define explicit rollback criteria: "If error rate exceeds 0.5%, or p99 latency exceeds 200ms, or booking conversion drops more than 2% relative to baseline, trigger automatic rollback." This prevents the "should we roll back or wait?" debate during a production incident. :::
Common Mistakes
:::danger Skipping shadow mode for high-stakes systems Any model that makes decisions with real consequences for users (fraud blocks, content removal, lending decisions) must run in shadow mode before going live. The offline holdout set does not reveal behavior on production traffic patterns, edge cases, or adversarial inputs that exist only in production. :::
:::danger No rollback procedure before deployment Every deployment must have a tested rollback procedure defined before it goes live. "We'll figure it out if something goes wrong" is not a plan. Rollback must be achievable in under 5 minutes and must not require a new code deployment. :::
:::warning Assuming batch inference is always cheaper Batch inference has lower per-prediction compute cost but higher engineering cost (pipeline orchestration, freshness management, storage). For low-volume models, the infrastructure overhead of batch jobs exceeds the cost of simple real-time serving. Choose based on total cost of ownership, not just compute unit cost. :::
:::warning Optimizing for throughput without measuring latency percentiles Average latency is a lie. A model that serves 95% of requests in 50ms and 5% in 5 seconds has an "average" latency that looks acceptable but causes a terrible experience. Always monitor p95 and p99 latency (the 95th and 99th percentile). p99 is what your worst-served user experiences. :::
YouTube Resources
- MLOps Community - "Model Serving at Scale": talks from engineers at Uber, LinkedIn, and DoorDash on their model serving infrastructure
- NVIDIA - "Triton Inference Server Tutorial": hands-on walkthrough of deploying models to Triton for GPU-accelerated serving
- Chip Huyen - "Designing Machine Learning Systems" lecture series: covers deployment patterns and serving architectures in depth
Interview Q&A
Q1: Walk me through how you would deploy a new fraud detection model that will process 100K transactions per second.
I would start with shadow mode. The existing model processes all transactions; the new model runs in parallel and logs its decisions, but never acts on them. After 2 weeks of shadow mode, I analyze divergence patterns - where do the two models disagree, and is the new model's behavior correct? After validating, I begin a canary rollout: 1% of traffic for 24 hours, monitoring p99 latency, false positive rate by transaction category, and overall block rate. If metrics are clean, I proceed to 5%, 20%, 50%, then 100%, with monitoring gates at each stage. For serving at 100K TPS, I would use Triton Inference Server on a GPU fleet behind a load balancer, with INT8 quantization (4x speedup) and dynamic batching. The model registry (MLflow) holds all versions with rollback in under 5 minutes.
Q2: What is the difference between canary and blue-green deployment for ML models?
Blue-green is a binary switch: all traffic moves from the old model to the new model at once, with instant rollback capability. It is simple and safe (rollback in seconds) but provides no gradual validation in production. Canary is a gradual shift: traffic moves from old to new in stages (1% → 5% → 20% → 100%), monitoring metrics at each stage. If anything degrades, you roll back from a small blast radius rather than from 100% exposure. For ML models specifically, canary is almost always preferred because online behavior is hard to predict from offline metrics - you want to validate on a small user cohort before committing. Blue-green is appropriate when the model change is minor (a retrain with the same architecture) and you have high confidence from offline evaluation and shadow mode.
Q3: When would you use batch inference vs real-time inference?
Batch inference is appropriate when: (1) the entity set is known in advance (all users, all products), (2) prediction freshness requirements allow hours of staleness, (3) computation is expensive and cannot be done at request time, or (4) features are available in bulk. Examples: nightly email recommendation, weekly credit score refresh, product catalog scoring. Real-time inference is required when: (1) predictions depend on the current request context (search query, current page), (2) freshness is critical (fraud detection must be instantaneous), or (3) the entity is not known in advance (new user, new content). The hybrid pattern - pre-compute user embeddings in batch, combine with item embeddings at request time - is common in recommendation systems to get the best of both worlds.
Q4: How would you optimize a Transformer model that is not meeting a 50ms p99 latency SLA?
I would work through optimizations in order of risk. First, dynamic batching: collect requests within a 2–5ms window and process as a batch - improves GPU utilization with no accuracy impact. Second, FP16 quantization: convert from FP32 to FP16 - roughly 2x speedup on NVIDIA Tensor Core GPUs, minimal accuracy loss. Third, INT8 post-training quantization with calibration: 4x memory reduction, 2–4x speedup, may require 1–2% accuracy drop acceptance. Fourth, knowledge distillation: train a smaller student model (DistilBERT from BERT, for example) - can achieve 90%+ of the teacher's accuracy at 2–5x the speed. Fifth, TensorRT compilation: compile the model specifically for the deployment GPU - typically another 2–4x speedup. Sixth, reduce sequence length: if inputs allow, truncate or use sliding window - O(n^2) attention means halving sequence length gives 4x speedup. I would instrument p99 latency at each step and stop when the SLA is met.
Q5: What should a model rollback procedure look like?
A model rollback procedure must be: fast (under 5 minutes), tested before deployment (not theoretical), automated where possible, and not require a new code deployment. The procedure: (1) Define rollback criteria in advance - specific metric thresholds that trigger rollback (error rate, latency, business metric decline). (2) Store previous production model version in the registry with a clear version number. (3) At rollback trigger, transition the previous version back to "Production" stage in the registry - serving instances reload on next request cycle. (4) Document the rollback decision with the triggering metric values and timestamp. (5) Notify the team and begin postmortem. The registry-based approach (MLflow, SageMaker Model Registry) makes this a 30-second operation. The worst rollback I have seen required a fresh code deployment because the model was hardcoded in the application - that took 45 minutes. Model artifacts must always be decoupled from application code.
Production Infrastructure Patterns
Feature Serving - Online vs Precomputed
The feature serving architecture is a first-class part of deployment design. Features for real-time inference fall into two categories:
Precomputed (batch) features: Computed once, stored in a low-latency feature store (Redis, DynamoDB), retrieved at request time. Examples: user embedding (computed nightly), historical purchase count, credit score. Latency: 1–5ms.
Request-time (streaming) features: Computed at the moment of the request, using the current event as input. Examples: current session duration, time since last action, real-time fraud signals from the last 60 seconds. Latency: varies based on complexity, 5–50ms typical.
import redis
import json
from typing import Any
class FeatureStore:
"""Simple Redis-backed feature store for precomputed user features."""
def __init__(self, host: str = "localhost", port: int = 6379):
self.client = redis.Redis(host=host, port=port, decode_responses=True)
def get_user_features(self, user_id: str) -> dict[str, Any]:
"""Retrieve precomputed user features. Sub-5ms at p99."""
raw = self.client.get(f"user_features:{user_id}")
if raw is None:
return self._get_default_features()
return json.loads(raw)
def set_user_features(self, user_id: str, features: dict, ttl_seconds: int = 86400):
"""Write user features (called by nightly batch job)."""
self.client.setex(
f"user_features:{user_id}",
ttl_seconds,
json.dumps(features)
)
def _get_default_features(self) -> dict:
"""Cold-start defaults for new users."""
return {"age_days": 0, "purchase_count": 0, "avg_order_value": 0.0}
class RealtimePredictor:
def __init__(self, model, feature_store: FeatureStore):
self.model = model
self.feature_store = feature_store
def predict(self, user_id: str, request_context: dict) -> float:
# Layer 1: Precomputed features (fast, from Redis)
user_features = self.feature_store.get_user_features(user_id)
# Layer 2: Request-time features (computed now)
request_features = {
"session_duration_s": request_context.get("session_duration", 0),
"items_viewed": request_context.get("items_viewed", 0),
"hour_of_day": request_context.get("hour", 12),
}
# Merge and predict
all_features = {**user_features, **request_features}
return float(self.model.predict_proba([list(all_features.values())])[0, 1])
Multi-Model Serving - A/B Routing and Ensemble
Production systems often run multiple model versions simultaneously for A/B testing or ensemble prediction:
import hashlib
from enum import Enum
class ModelVariant(Enum):
CONTROL = "control"
TREATMENT = "treatment"
class ABModelRouter:
def __init__(self, control_model, treatment_model, treatment_pct: float = 0.1):
self.models = {
ModelVariant.CONTROL: control_model,
ModelVariant.TREATMENT: treatment_model,
}
self.treatment_pct = treatment_pct
def _assign_variant(self, user_id: str) -> ModelVariant:
"""Deterministically assign user to control or treatment."""
# Hash-based assignment: same user always sees same variant
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = hash_val % 100 # 0–99
if bucket < int(self.treatment_pct * 100):
return ModelVariant.TREATMENT
return ModelVariant.CONTROL
def predict(self, user_id: str, features: list) -> dict:
variant = self._assign_variant(user_id)
model = self.models[variant]
score = float(model.predict_proba([features])[0, 1])
return {
"score": score,
"model_variant": variant.value,
"user_id": user_id,
}
# Hash-based assignment ensures:
# 1. Each user always sees the same variant (consistent experience)
# 2. Assignment is deterministic - no need to store variant in a database
# 3. Exactly treatment_pct% of users get the treatment model
Containerization and Kubernetes Deployment
Every model in production should be containerized. This ensures the model runs identically in development, staging, and production:
# Dockerfile for ML model serving
FROM python:3.11-slim
WORKDIR /app
# Install only inference dependencies (not training libraries)
COPY requirements-serve.txt .
RUN pip install --no-cache-dir -r requirements-serve.txt
# Copy model artifacts from CI/CD pipeline
COPY model/ ./model/
COPY src/serve.py .
# Health check - liveness probe for Kubernetes
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
EXPOSE 8080
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]
# Kubernetes deployment with auto-scaling
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-model-canary
spec:
replicas: 2
selector:
matchLabels:
app: fraud-model
variant: canary
template:
metadata:
labels:
app: fraud-model
variant: canary
spec:
containers:
- name: model-server
image: company/fraud-model:v3.2.1
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-model-canary
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Monitoring and Alerting
Production models require monitoring at three levels: infrastructure, data, and model behavior.
import prometheus_client as prom
import time
# Define metrics
REQUEST_LATENCY = prom.Histogram(
"model_request_duration_seconds",
"Model inference latency",
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0],
labelnames=["model_version", "endpoint"],
)
PREDICTION_SCORE = prom.Histogram(
"model_prediction_score",
"Distribution of model output scores",
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
labelnames=["model_version"],
)
ERROR_COUNTER = prom.Counter(
"model_prediction_errors_total",
"Total number of prediction errors",
labelnames=["model_version", "error_type"],
)
class InstrumentedPredictor:
def __init__(self, model, model_version: str):
self.model = model
self.version = model_version
def predict(self, features: list) -> float:
start = time.perf_counter()
try:
score = float(self.model.predict_proba([features])[0, 1])
REQUEST_LATENCY.labels(
model_version=self.version, endpoint="predict"
).observe(time.perf_counter() - start)
PREDICTION_SCORE.labels(model_version=self.version).observe(score)
return score
except Exception as e:
ERROR_COUNTER.labels(
model_version=self.version, error_type=type(e).__name__
).inc()
raise
Critical alerts to set up before any deployment:
- p99 latency exceeds 200ms (sustained for 5 minutes)
- Error rate exceeds 0.5% (sustained for 2 minutes)
- Prediction score distribution shifts by more than 20% (PSI > 0.25)
- Pod restart count exceeds 3 in 10 minutes (model crash loop)
Edge Deployment - Mobile and IoT Patterns
Increasingly, ML models are deployed not to cloud servers but to edge devices: mobile phones, embedded sensors, autonomous vehicles. The constraints are fundamentally different:
- Memory: 1–10 MB (model weights only)
- Latency: under 50ms on device CPU (no GPU, no network round-trip)
- Power: battery drain is a real constraint (float32 math drains more than int8)
- Offline: device may not have internet connectivity
Edge model checklist:
- Quantize to INT8: reduces model size 4x, reduces compute 2–4x, typical accuracy drop less than 2%
- Prune to target sparsity: 50–80% weight sparsity is achievable with minimal accuracy loss
- Export to TFLite or ONNX: framework-agnostic, optimized for edge runtimes
- Benchmark on target hardware: A model that runs at 50ms on your dev laptop may run at 500ms on a low-end Android device
import tensorflow as tf
# TFLite conversion with INT8 quantization for mobile deployment
def export_tflite_int8(keras_model, representative_dataset):
converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Representative dataset for calibration (100 samples sufficient)
def representative_data_gen():
for features in representative_dataset:
yield [features.astype("float32")]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
f.write(tflite_model)
print(f"INT8 model size: {len(tflite_model) / 1024:.1f} KB")
return tflite_model
Role-Specific Callouts
:::note Machine Learning Engineer Deployment strategy is your responsibility, not just model training. Know the full pipeline: shadow mode, canary, gradual rollout, monitoring, and rollback. Be able to estimate serving costs (how many GPU instances does a 100K QPS Transformer require?) and latency at each optimization stage. :::
:::note MLOps / Platform Engineer Build the deployment infrastructure that makes the above patterns accessible to every data scientist on the team. A great MLOps platform makes shadow mode trivial to enable, canary rollouts automatic, and rollback a one-click operation. The goal: a data scientist should be able to ship a new model version without involving an infrastructure engineer. :::
:::note AI Engineer For LLM-based systems, deployment patterns extend to: prompt versioning (treat prompts like model versions), response caching (semantic cache with embedding similarity), streaming responses (SSE/WebSockets for long generations), and rate limiting. The same shadow mode and canary principles apply - test a new prompt or model on a small fraction of traffic before full rollout. :::
:::note Data Scientist Deployment is not your primary focus, but understanding it makes you dramatically more effective. Know how to package your model for handoff (ONNX, pickle with explicit dependencies, MLflow logged model), write a serving test that validates the deployed model's output matches your notebook output, and define the monitoring metric that should alert if your model is degrading. :::
Full End-to-End Deployment Walkthrough
Let us walk through a complete deployment for a fraud detection model at a payments company processing 10,000 transactions per second.
Phase 1 - Packaging the Model
Before deploying anything, the model must be packaged with its full inference pipeline: preprocessing steps, feature names, model weights, and serving code.
import mlflow
import mlflow.sklearn
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
import json
# Define the complete inference pipeline
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", XGBClassifier(n_estimators=300, max_depth=5)),
])
pipeline.fit(X_train, y_train)
# Validate on holdout before packaging
holdout_auc = roc_auc_score(y_holdout, pipeline.predict_proba(X_holdout)[:, 1])
print(f"Holdout AUC: {holdout_auc:.4f}") # must exceed production threshold
# Log to MLflow registry with full metadata
with mlflow.start_run(run_name="fraud-v3.2-candidate") as run:
mlflow.log_params(best_params)
mlflow.log_metrics({
"holdout_auc": holdout_auc,
"holdout_precision_at_5pct_fpr": precision_at_5pct_fpr,
"training_data_rows": len(X_train),
"feature_count": X_train.shape[1],
})
mlflow.log_param("training_data_version", "2026-03-01")
# Log feature schema for serving validation
feature_schema = {
"features": list(X_train.columns),
"dtypes": {col: str(dtype) for col, dtype in X_train.dtypes.items()}
}
mlflow.log_dict(feature_schema, "feature_schema.json")
# Package as MLflow model
mlflow.sklearn.log_model(
pipeline,
artifact_path="model",
registered_model_name="fraud_detection",
input_example=X_train.iloc[:5],
signature=mlflow.models.infer_signature(X_train, pipeline.predict(X_train)),
)
The model signature is critical: it records the exact input/output schema. The serving layer validates every incoming request against this schema. Schema mismatches (wrong feature names, wrong types) are caught before they silently corrupt predictions.
Phase 2 - Shadow Mode Implementation
# FastAPI serving endpoint with shadow mode support
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import mlflow.pyfunc
import logging
import json
import time
app = FastAPI()
# Load models
production_model = mlflow.pyfunc.load_model("models:/fraud_detection/Production")
shadow_model = mlflow.pyfunc.load_model("models:/fraud_detection/Staging") # candidate
logger = logging.getLogger("shadow_mode")
class TransactionFeatures(BaseModel):
amount: float
merchant_category: str
hour_of_day: int
device_type: str
user_tenure_days: int
# ... (all features)
async def log_shadow_prediction(features: dict, prod_score: float):
"""Log shadow model prediction asynchronously - never blocks response."""
try:
import pandas as pd
df = pd.DataFrame([features])
shadow_score = float(shadow_model.predict(df)[0])
logger.info(json.dumps({
"event": "shadow_prediction",
"production_score": prod_score,
"shadow_score": shadow_score,
"agreed": (prod_score > 0.5) == (shadow_score > 0.5),
"timestamp": time.time(),
}))
except Exception as e:
logger.error(f"Shadow model failed: {e}") # never propagate to caller
@app.post("/predict")
async def predict(features: TransactionFeatures, background_tasks: BackgroundTasks):
import pandas as pd
feature_dict = features.dict()
df = pd.DataFrame([feature_dict])
# Production prediction - on critical path
start = time.perf_counter()
score = float(production_model.predict(df)[0])
latency_ms = (time.perf_counter() - start) * 1000
# Shadow prediction - background, never blocks
background_tasks.add_task(log_shadow_prediction, feature_dict, score)
return {
"fraud_score": score,
"decision": "block" if score > 0.85 else "allow",
"model_version": "v3.1",
"latency_ms": round(latency_ms, 2),
}
After two weeks of shadow mode, analyze the divergence logs:
import pandas as pd
import json
# Load shadow mode logs
shadow_logs = pd.read_json("shadow_logs.jsonl", lines=True)
print(f"Total predictions: {len(shadow_logs):,}")
print(f"Agreement rate: {shadow_logs['agreed'].mean():.2%}")
print(f"Shadow blocks more: {(shadow_logs['shadow_score'] > shadow_logs['production_score']).mean():.2%}")
print(f"Shadow blocks less: {(shadow_logs['shadow_score'] < shadow_logs['production_score']).mean():.2%}")
# Identify high-divergence cases for manual review
divergence = (shadow_logs["shadow_score"] - shadow_logs["production_score"]).abs()
high_divergence = shadow_logs[divergence > 0.3]
print(f"\nHigh divergence cases (delta > 0.3): {len(high_divergence):,} ({len(high_divergence)/len(shadow_logs):.2%})")
# Analyze: are high-divergence cases concentrated in a specific merchant category?
# high_divergence.groupby("merchant_category")["shadow_score"].agg(["count", "mean"])
Phase 3 - Canary Rollout Automation
import mlflow
from mlflow.tracking import MlflowClient
import time
class CanaryRolloutManager:
def __init__(self, model_name: str, client: MlflowClient):
self.model_name = model_name
self.client = client
self.stages = [0.01, 0.05, 0.10, 0.25, 0.50, 1.00]
def advance_rollout(self, current_pct: float, metrics: dict) -> float | None:
"""
Advance to next rollout stage if metrics are clean.
Returns new percentage or None if rollback needed.
"""
# Guardrail checks
if metrics.get("p99_latency_ms", 0) > 200:
return None # Rollback: latency SLA violated
if metrics.get("error_rate", 0) > 0.005:
return None # Rollback: error rate exceeded
if metrics.get("block_rate_relative_change", 0) > 0.15:
return None # Rollback: block rate shifted >15% (unexpected model behavior)
# Find next stage
next_stages = [s for s in self.stages if s > current_pct]
return next_stages[0] if next_stages else None
def update_traffic_config(self, treatment_pct: float):
"""Update the load balancer traffic split configuration."""
config = {
"production_model_pct": 1.0 - treatment_pct,
"canary_model_pct": treatment_pct,
}
# In practice: update feature flag in LaunchDarkly, Flagsmith, etc.
print(f"Updated traffic: {(1-treatment_pct):.0%} production, {treatment_pct:.0%} canary")
This pattern - shadow mode → automated canary stages with metric gates → automatic rollback - is the gold standard for ML deployment at any company processing high-stakes transactions.
Summary - The Deployment Mental Model
Every ML deployment decision maps to one question: what is the blast radius if this goes wrong, and how do we reduce it to acceptable levels before each step?
Shadow mode: zero blast radius. Canary at 1%: 1% blast radius. Blue-green: 100% blast radius but instant rollback. Choose the pattern based on how confident you are, how costly a failure is, and how fast you need to move.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the End-to-End ML Pipeline demo on the EngineersOfAI Playground - no code required.
:::
