What is ML deployment?

A comprehensive guide to ML deployment strategies, serving architectures, optimization techniques, and model registry practices for shipping models safely at scale.

How does model serving work in practice?

ML Deployment Patterns - From Jupyter Notebook to Production at Scale covers ML deployment, model serving, canary deployment from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/ml-system-design/deployment-patterns

What is the difference between ML deployment and canary deployment?

See the full breakdown at https://engineersofai.com/docs/ml/ml-system-design/deployment-patterns

ML Deployment Patterns - From Jupyter Notebook to Production at Scale

:::note Reading time and relevance 30–35 min read | Interview relevance: extremely high for ML Engineer and MLOps roles. Senior MLE interviews at Stripe, Uber, DoorDash, and Meta routinely ask about deployment strategy, latency optimization, and rollback procedures. :::

The Real Interview Moment

It is 2021. Stripe's ML team has spent three months training a new fraud detection model. Their existing model processes over $10 billion per day in transactions. Every API call to POST /v1/charges goes through the fraud model. The new model shows +8% precision at the same recall level on the holdout set. In dollar terms, that is tens of millions of dollars per year in reduced fraud losses.

Nobody on the team suggests rolling out the new model overnight.

The previous model had been in production for 18 months. Its behavior was understood - the team knew its false positive rate by merchant category, by geography, by transaction amount. They had alerting on every metric that mattered. They knew how it behaved under load at 3 AM on a Sunday (when fraudsters are most active). The new model, despite its better offline metrics, was an unknown in production.

The decision was to use shadow mode first: run the new model in parallel with the production model for two weeks. Every transaction goes through both models. The production model's output drives the actual block/allow decision. The new model's output is logged, but never acted on. After two weeks, the team compared the two models' decisions on 50 million transactions. They found three edge cases where the new model's behavior was unexpected - high false positive rates on a specific type of international transaction that was underrepresented in their training set. They retrained with a fix, ran shadow mode again, and only then began a canary rollout.

This is the difference between ML engineering and ML research. Getting the model right is 30% of the work. Getting the deployment right is the other 70%.

Why This Exists - The Cost of Bad Deployments

The history of ML deployment failures is long and expensive:

Amazon's recruiting AI (2018): Deployed a hiring model trained on historical resumes. The historical data encoded gender bias. By the time it was caught, it had been screening resumes for over a year.
Zillow's iBuying model (2021): A home pricing model was deployed with insufficient monitoring. When market conditions shifted rapidly, the model continued predicting pre-shift prices. Zillow took a $569 million write-down and shut down the iBuying business.
Knight Capital Group (2012): Deployed trading software with a routing bug. The system executed 4 million transactions in 45 minutes, losing $440 million. The root cause was a deployment without a working rollback procedure.

Every one of these failures had a deployment pattern that would have caught the problem: shadow mode, gradual rollout, monitoring, rollback capability. The patterns in this lesson are not bureaucratic overhead - they are learned from disasters.

Deployment Strategies - Comparing the Patterns

Blue-Green Deployment

Blue-green deployment maintains two identical production environments. At any time, one environment (blue) serves all traffic, and the other (green) is idle. When deploying a new model:

Deploy the new model to the green environment
Run smoke tests on green
Switch the load balancer to route all traffic to green (instant cutover)
If something is wrong, switch back to blue instantly (rollback in seconds)

Advantages: Zero-downtime deployment, instant rollback, simple to implement, clean separation between versions.

Disadvantages: Requires 2x infrastructure cost. The cutover is binary - either 0% or 100% of traffic sees the new model. No gradual validation in production.

Load Balancer
      │
      ├── Blue (old model) ← currently serving
      │
      └── Green (new model) ← deployed, tested, ready

→ Switch:

Load Balancer
      │
      ├── Blue (old model) ← idle, ready for rollback
      │
      └── Green (new model) ← now serving

When to use it: Stateless models, short warm-up time, when you need instant rollback capability, when infrastructure cost is acceptable.

Canary Deployment

A canary release gradually shifts traffic from the old model to the new model, monitoring metrics at each stage before committing to the next. Named after the "canary in a coal mine" - a small number of users experience the new model first, acting as an early warning system.

Typical rollout stages:

1% of traffic → monitor for 24 hours
5% of traffic → monitor for 48 hours
20% of traffic → monitor for 48–72 hours
50% of traffic → monitor for 1 week
100% of traffic → complete rollout

At each stage, monitor: error rates, latency (p50/p95/p99), business metrics (conversion rate, click-through rate, fraud rate). If any metric degrades beyond a threshold, automatically roll back to 0% and alert the team.

Advantages: Catches production issues that offline testing and shadow mode missed. Limits blast radius of a bad deployment. Provides confidence before full rollout.

Disadvantages: Slower than blue-green. Requires feature flag infrastructure. Consistent user experience is harder (some users see old model, some see new).

Shadow Mode

Shadow mode (also called dark launch or mirror testing) is the safest deployment strategy for high-stakes ML systems. The new model runs in parallel with the production model, receives the same inputs, but its outputs are never used for actual decisions - only logged for comparison.

The shadow mode process:

Deploy the new model alongside the production model
Every incoming request is processed by both models simultaneously
The production model's decision is used (block/allow, rank, recommend)
The new model's decision is logged to a database or stream
Engineers compare the two models' decisions: agreement rate, divergence patterns, edge cases
Only after validation does shadow mode end and the canary begin

Shadow mode answers the question: "Does the new model behave the way we expect, on real production traffic, without any risk to users?" This is particularly critical for:

Fraud detection (false positives affect real customers)
Medical diagnosis (wrong predictions have patient consequences)
Content moderation (incorrect removal affects free expression)
Any financial decision system

Implementation with dual serving:

import asyncio
from typing import Any

class ShadowModePredictor:
    def __init__(self, production_model, shadow_model, logger):
        self.production = production_model
        self.shadow = shadow_model
        self.logger = logger

    async def predict(self, features: dict) -> Any:
        # Production prediction is on the critical path
        production_result = await self.production.predict_async(features)

        # Shadow prediction runs in background - never blocks the response
        asyncio.create_task(self._shadow_predict_and_log(features, production_result))

        return production_result  # Only return production result

    async def _shadow_predict_and_log(self, features: dict, production_result: Any):
        try:
            shadow_result = await self.shadow.predict_async(features)
            self.logger.log({
                "features": features,
                "production_decision": production_result,
                "shadow_decision": shadow_result,
                "agreed": production_result == shadow_result,
            })
        except Exception as e:
            # Shadow errors MUST NOT surface to users
            self.logger.log_error(f"Shadow model error: {e}")

The critical design constraint: shadow model failures must never affect the production response. If the shadow model crashes or times out, the production response proceeds normally. This is enforced by running the shadow prediction asynchronously and catching all exceptions.

Feature Flags

Feature flags decouple model deployment from code deployment. A feature flag is a configuration toggle that enables or disables a model feature independently of the code release cycle:

from flagsmith import Flagsmith

flagsmith = Flagsmith(environment_key="your-env-key")

def get_recommendation(user_id: str, context: dict):
    flags = flagsmith.get_identity_flags(user_id)

    if flags.is_feature_enabled("use_transformer_ranker"):
        # New transformer-based ranking model
        return transformer_model.rank(context)
    else:
        # Existing XGBoost ranking model
        return xgboost_model.rank(context)

Feature flags allow: percentage rollout (enable for X% of users), user-segment targeting (enable for beta users), instant kill switch (disable immediately without a code deployment), and A/B test integration.

Serving Architectures - Batch vs Real-Time vs Streaming

The right serving architecture depends on when predictions are needed and how fresh they must be.

Batch Inference

Predictions are computed ahead of time for a known set of entities and stored for later retrieval:

Overnight job → Model scores all users → Predictions stored in database
→ User request → Read pre-computed prediction → Serve in <1ms

When to use batch inference:

Entity set is bounded and known in advance (scoring all products in catalog)
Prediction freshness requirements are loose (daily or hourly is fine)
Computational cost is high (scoring millions of users takes 4 hours - cannot do it at request time)
Features are available in bulk (no need to fetch real-time features for each user)

Examples: Email recommendation engines (run nightly for all users), product catalog scoring (re-score all listings weekly), customer health scores for CRM systems.

Implementation pattern with Spark:

from pyspark.sql import SparkSession
import mlflow

spark = SparkSession.builder.appName("batch_inference").getOrCreate()

# Load production model from registry
model = mlflow.pyfunc.load_model("models:/fraud_model/Production")

# Score all transactions from yesterday
df = spark.read.parquet("s3://data-lake/transactions/2026-03-08/")

# Apply model in parallel across Spark cluster
predictions_df = df.mapInPandas(
    lambda batch: model.predict(batch),
    schema="user_id string, score double, prediction int"
)

predictions_df.write.parquet("s3://predictions/2026-03-08/", mode="overwrite")

Real-Time Inference

The model is called at request time, producing a fresh prediction for each incoming request. Requires low latency (typically under 100ms) and auto-scaling to handle traffic spikes.

Architecture:

Client → API Gateway → Feature Service → Model Server → Response
                              ↓
                        Feature Store (Redis)
                        (pre-computed features,
                         served in <5ms)

Latency budget breakdown:

Network round trip: 10–20ms
Feature lookup (Redis): 1–5ms
Model inference: 5–50ms (depending on model complexity)
Response serialization: 1–2ms
Total budget: under 100ms (or under 50ms for high-speed systems)

Auto-scaling is critical: a 10x traffic spike (flash sale, news event) must not degrade latency. Kubernetes Horizontal Pod Autoscaler (HPA) scales based on CPU/GPU utilization or custom metrics (requests per second, queue depth).

Streaming Inference

Predictions are made per event in a streaming pipeline - events flow through Kafka, get enriched with features, and run through the model:

Kafka topic (events) → Flink/Spark Streaming → Feature enrichment
→ Model inference → Output topic (predictions) → Downstream consumers

When to use streaming: Real-time fraud detection (detect before transaction settles), dynamic pricing (reprice as supply changes), content moderation (process new posts immediately), IoT sensor anomaly detection.

Streaming adds complexity: you must handle late-arriving events, out-of-order events, and model version consistency across a distributed fleet. Use Apache Flink or Spark Structured Streaming with Kafka.

Model Serving Stacks - When to Use Each

Framework	Best for	Strengths	Limitations
TorchServe	PyTorch models	Native PyTorch, easy setup, good batching	PyTorch-only
TF Serving	TensorFlow/Keras	Mature, battle-tested at Google scale	TF-only, heavier
Triton Inference Server	Multi-framework, GPU serving	Supports TF/PyTorch/ONNX/TensorRT, optimal GPU utilization	Complex configuration
ONNX Runtime	Cross-framework, CPU serving	Universal format, fast CPU inference, easy integration	Not GPU-optimized
BentoML	Rapid deployment, flexibility	Framework-agnostic, simple API, good DX	Less optimized than Triton
vLLM	Large language models	PagedAttention, continuous batching, fastest LLM serving	LLM-specific

General guidance:

For small teams deploying PyTorch models to CPU: ONNX Runtime + FastAPI
For GPU fleet serving multiple model types: Triton Inference Server
For LLM deployments: vLLM
For simplest possible deployment: BentoML

Optimization for Latency and Throughput

When your model does not meet the latency SLA in its original form, apply optimization techniques in order of complexity:

1. Request Batching

Group multiple requests together and run them through the model as a single batch. GPUs are most efficient when processing large batches. Dynamic batching (Triton) collects requests that arrive within a time window (e.g., 5ms) and processes them together:

# Triton model config (config.pbtxt)
# dynamic_batching {
#   preferred_batch_size: [ 4, 8, 16 ]
#   max_queue_delay_microseconds: 5000  # 5ms max wait
# }

Batching improves throughput but adds latency for individual requests. The tradeoff is controlled by the queue delay parameter.

2. Quantization

Reduce numerical precision from FP32 to FP16 or INT8:

FP16: ~2x memory reduction, ~1.5–2x speedup on GPUs with Tensor Cores, minimal accuracy loss
INT8: ~4x memory reduction, ~2–4x speedup on CPU (Intel VNNI) and GPU (Tensor Cores), accuracy loss of 0.5–2%

import torch
from torch.quantization import quantize_dynamic

# Dynamic quantization: INT8 for linear layers (no calibration data needed)
model_int8 = quantize_dynamic(
    model,
    {torch.nn.Linear, torch.nn.LSTM},
    dtype=torch.qint8
)

# Export to ONNX for cross-platform serving
torch.onnx.export(model_int8, dummy_input, "model_int8.onnx",
                  opset_version=14, do_constant_folding=True)

For production, use Post-Training Static Quantization (requires calibration data) for better accuracy than dynamic quantization, or Quantization-Aware Training (QAT) for maximum accuracy.

3. Knowledge Distillation

Train a smaller "student" model to mimic the outputs of a larger "teacher" model. The student learns from the teacher's soft probability outputs (which carry more information than hard labels) and can achieve 80–90% of the teacher's performance at 10–20% of the computational cost.

$\mathcal{L}_{\text{distillation}} = \alpha \cdot \mathcal{L}_{\text{CE}}(y, \hat{y}_{\text{student}}) + (1-\alpha) \cdot \mathcal{L}_{\text{KL}}\!\left(\frac{z_{\text{teacher}}}{T}, \frac{z_{\text{student}}}{T}\right)$

Where $T$ is the temperature (controls softness of teacher distribution) and $\alpha$ balances task loss vs distillation loss. DistilBERT (Sanh et al., 2019) achieved 97% of BERT's performance with 40% fewer parameters and 60% faster inference using this technique.

4. Model Pruning

Remove weights that contribute minimally to predictions. Unstructured pruning zeroes out individual weights; structured pruning removes entire neurons/filters/attention heads (more hardware-friendly).

import torch.nn.utils.prune as prune

# Prune 30% of weights in each Linear layer (unstructured)
for module_name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.30)
        prune.remove(module, 'weight')  # make pruning permanent

# Fine-tune for 1–2 epochs to recover accuracy

5. TensorRT Optimization

NVIDIA TensorRT compiles a trained model into an optimized engine specifically for the deployment GPU. It fuses operations, selects optimal kernel implementations, and applies automatic precision reduction:

import tensorrt as trt

# Build TensorRT engine from ONNX model
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)

with open("model.onnx", "rb") as f:
    parser.parse(f.read())

config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB

engine = builder.build_serialized_network(network, config)

TensorRT typically gives 2–8x speedup over vanilla PyTorch for GPU inference on NVIDIA hardware.

Model Registry - Versioning, Lineage, and Rollback

A model registry is the source of truth for model artifacts in production. It tracks: model version, training data version, performance metrics, deployment history, and approval status.

What a model registry must support:

Artifact storage: Serialized model files, preprocessing pipeline, feature schema
Versioning: Every model version is immutable and addressable
Lineage: Which dataset, code version, and hyperparameters produced this model
Stage transitions: development → staging → production
Rollback: Revert to any previous production version instantly

import mlflow
from mlflow.tracking import MlflowClient

# Log model with metadata
with mlflow.start_run() as run:
    mlflow.log_params(best_params)
    mlflow.log_metrics({"val_auc": 0.923, "val_f1": 0.871})
    mlflow.xgboost.log_model(
        model,
        artifact_path="model",
        registered_model_name="fraud_detection_v2",
    )

# Promote to production
client = MlflowClient()
client.transition_model_version_stage(
    name="fraud_detection_v2",
    version="14",
    stage="Production",
    archive_existing_versions=True  # auto-archive current production version
)

# Load production model in serving
production_model = mlflow.pyfunc.load_model(
    "models:/fraud_detection_v2/Production"
)

Rollback procedure:

# If the new model is causing problems, promote the previous version back
client.transition_model_version_stage(
    name="fraud_detection_v2",
    version="13",   # Previous production version
    stage="Production",
    archive_existing_versions=True
)
# Takes effect immediately - all serving instances reload Production on next request

:::tip Define rollback criteria before deployment Before promoting any model to production, define explicit rollback criteria: "If error rate exceeds 0.5%, or p99 latency exceeds 200ms, or booking conversion drops more than 2% relative to baseline, trigger automatic rollback." This prevents the "should we roll back or wait?" debate during a production incident. :::

Common Mistakes

:::danger Skipping shadow mode for high-stakes systems Any model that makes decisions with real consequences for users (fraud blocks, content removal, lending decisions) must run in shadow mode before going live. The offline holdout set does not reveal behavior on production traffic patterns, edge cases, or adversarial inputs that exist only in production. :::

:::danger No rollback procedure before deployment Every deployment must have a tested rollback procedure defined before it goes live. "We'll figure it out if something goes wrong" is not a plan. Rollback must be achievable in under 5 minutes and must not require a new code deployment. :::

:::warning Assuming batch inference is always cheaper Batch inference has lower per-prediction compute cost but higher engineering cost (pipeline orchestration, freshness management, storage). For low-volume models, the infrastructure overhead of batch jobs exceeds the cost of simple real-time serving. Choose based on total cost of ownership, not just compute unit cost. :::

:::warning Optimizing for throughput without measuring latency percentiles Average latency is a lie. A model that serves 95% of requests in 50ms and 5% in 5 seconds has an "average" latency that looks acceptable but causes a terrible experience. Always monitor p95 and p99 latency (the 95th and 99th percentile). p99 is what your worst-served user experiences. :::

YouTube Resources

MLOps Community - "Model Serving at Scale": talks from engineers at Uber, LinkedIn, and DoorDash on their model serving infrastructure
NVIDIA - "Triton Inference Server Tutorial": hands-on walkthrough of deploying models to Triton for GPU-accelerated serving
Chip Huyen - "Designing Machine Learning Systems" lecture series: covers deployment patterns and serving architectures in depth

Interview Q&A

Q1: Walk me through how you would deploy a new fraud detection model that will process 100K transactions per second.

I would start with shadow mode. The existing model processes all transactions; the new model runs in parallel and logs its decisions, but never acts on them. After 2 weeks of shadow mode, I analyze divergence patterns - where do the two models disagree, and is the new model's behavior correct? After validating, I begin a canary rollout: 1% of traffic for 24 hours, monitoring p99 latency, false positive rate by transaction category, and overall block rate. If metrics are clean, I proceed to 5%, 20%, 50%, then 100%, with monitoring gates at each stage. For serving at 100K TPS, I would use Triton Inference Server on a GPU fleet behind a load balancer, with INT8 quantization (4x speedup) and dynamic batching. The model registry (MLflow) holds all versions with rollback in under 5 minutes.

Q2: What is the difference between canary and blue-green deployment for ML models?

Blue-green is a binary switch: all traffic moves from the old model to the new model at once, with instant rollback capability. It is simple and safe (rollback in seconds) but provides no gradual validation in production. Canary is a gradual shift: traffic moves from old to new in stages (1% → 5% → 20% → 100%), monitoring metrics at each stage. If anything degrades, you roll back from a small blast radius rather than from 100% exposure. For ML models specifically, canary is almost always preferred because online behavior is hard to predict from offline metrics - you want to validate on a small user cohort before committing. Blue-green is appropriate when the model change is minor (a retrain with the same architecture) and you have high confidence from offline evaluation and shadow mode.

Q3: When would you use batch inference vs real-time inference?

Batch inference is appropriate when: (1) the entity set is known in advance (all users, all products), (2) prediction freshness requirements allow hours of staleness, (3) computation is expensive and cannot be done at request time, or (4) features are available in bulk. Examples: nightly email recommendation, weekly credit score refresh, product catalog scoring. Real-time inference is required when: (1) predictions depend on the current request context (search query, current page), (2) freshness is critical (fraud detection must be instantaneous), or (3) the entity is not known in advance (new user, new content). The hybrid pattern - pre-compute user embeddings in batch, combine with item embeddings at request time - is common in recommendation systems to get the best of both worlds.

Q4: How would you optimize a Transformer model that is not meeting a 50ms p99 latency SLA?

I would work through optimizations in order of risk. First, dynamic batching: collect requests within a 2–5ms window and process as a batch - improves GPU utilization with no accuracy impact. Second, FP16 quantization: convert from FP32 to FP16 - roughly 2x speedup on NVIDIA Tensor Core GPUs, minimal accuracy loss. Third, INT8 post-training quantization with calibration: 4x memory reduction, 2–4x speedup, may require 1–2% accuracy drop acceptance. Fourth, knowledge distillation: train a smaller student model (DistilBERT from BERT, for example) - can achieve 90%+ of the teacher's accuracy at 2–5x the speed. Fifth, TensorRT compilation: compile the model specifically for the deployment GPU - typically another 2–4x speedup. Sixth, reduce sequence length: if inputs allow, truncate or use sliding window - O(n^2) attention means halving sequence length gives 4x speedup. I would instrument p99 latency at each step and stop when the SLA is met.

Q5: What should a model rollback procedure look like?

A model rollback procedure must be: fast (under 5 minutes), tested before deployment (not theoretical), automated where possible, and not require a new code deployment. The procedure: (1) Define rollback criteria in advance - specific metric thresholds that trigger rollback (error rate, latency, business metric decline). (2) Store previous production model version in the registry with a clear version number. (3) At rollback trigger, transition the previous version back to "Production" stage in the registry - serving instances reload on next request cycle. (4) Document the rollback decision with the triggering metric values and timestamp. (5) Notify the team and begin postmortem. The registry-based approach (MLflow, SageMaker Model Registry) makes this a 30-second operation. The worst rollback I have seen required a fresh code deployment because the model was hardcoded in the application - that took 45 minutes. Model artifacts must always be decoupled from application code.

Production Infrastructure Patterns

Feature Serving - Online vs Precomputed

The feature serving architecture is a first-class part of deployment design. Features for real-time inference fall into two categories:

Precomputed (batch) features: Computed once, stored in a low-latency feature store (Redis, DynamoDB), retrieved at request time. Examples: user embedding (computed nightly), historical purchase count, credit score. Latency: 1–5ms.

Request-time (streaming) features: Computed at the moment of the request, using the current event as input. Examples: current session duration, time since last action, real-time fraud signals from the last 60 seconds. Latency: varies based on complexity, 5–50ms typical.

import redis
import json
from typing import Any

class FeatureStore:
    """Simple Redis-backed feature store for precomputed user features."""

    def __init__(self, host: str = "localhost", port: int = 6379):
        self.client = redis.Redis(host=host, port=port, decode_responses=True)

    def get_user_features(self, user_id: str) -> dict[str, Any]:
        """Retrieve precomputed user features. Sub-5ms at p99."""
        raw = self.client.get(f"user_features:{user_id}")
        if raw is None:
            return self._get_default_features()
        return json.loads(raw)

    def set_user_features(self, user_id: str, features: dict, ttl_seconds: int = 86400):
        """Write user features (called by nightly batch job)."""
        self.client.setex(
            f"user_features:{user_id}",
            ttl_seconds,
            json.dumps(features)
        )

    def _get_default_features(self) -> dict:
        """Cold-start defaults for new users."""
        return {"age_days": 0, "purchase_count": 0, "avg_order_value": 0.0}

class RealtimePredictor:
    def __init__(self, model, feature_store: FeatureStore):
        self.model = model
        self.feature_store = feature_store

    def predict(self, user_id: str, request_context: dict) -> float:
        # Layer 1: Precomputed features (fast, from Redis)
        user_features = self.feature_store.get_user_features(user_id)

        # Layer 2: Request-time features (computed now)
        request_features = {
            "session_duration_s": request_context.get("session_duration", 0),
            "items_viewed": request_context.get("items_viewed", 0),
            "hour_of_day": request_context.get("hour", 12),
        }

        # Merge and predict
        all_features = {**user_features, **request_features}
        return float(self.model.predict_proba([list(all_features.values())])[0, 1])

Multi-Model Serving - A/B Routing and Ensemble

Production systems often run multiple model versions simultaneously for A/B testing or ensemble prediction:

import hashlib
from enum import Enum

class ModelVariant(Enum):
    CONTROL = "control"
    TREATMENT = "treatment"

class ABModelRouter:
    def __init__(self, control_model, treatment_model, treatment_pct: float = 0.1):
        self.models = {
            ModelVariant.CONTROL: control_model,
            ModelVariant.TREATMENT: treatment_model,
        }
        self.treatment_pct = treatment_pct

    def _assign_variant(self, user_id: str) -> ModelVariant:
        """Deterministically assign user to control or treatment."""
        # Hash-based assignment: same user always sees same variant
        hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        bucket = hash_val % 100  # 0–99
        if bucket < int(self.treatment_pct * 100):
            return ModelVariant.TREATMENT
        return ModelVariant.CONTROL

    def predict(self, user_id: str, features: list) -> dict:
        variant = self._assign_variant(user_id)
        model = self.models[variant]
        score = float(model.predict_proba([features])[0, 1])

        return {
            "score": score,
            "model_variant": variant.value,
            "user_id": user_id,
        }

# Hash-based assignment ensures:
# 1. Each user always sees the same variant (consistent experience)
# 2. Assignment is deterministic - no need to store variant in a database
# 3. Exactly treatment_pct% of users get the treatment model

Containerization and Kubernetes Deployment

Every model in production should be containerized. This ensures the model runs identically in development, staging, and production:

# Dockerfile for ML model serving
FROM python:3.11-slim

WORKDIR /app

# Install only inference dependencies (not training libraries)
COPY requirements-serve.txt .
RUN pip install --no-cache-dir -r requirements-serve.txt

# Copy model artifacts from CI/CD pipeline
COPY model/ ./model/
COPY src/serve.py .

# Health check - liveness probe for Kubernetes
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]

# Kubernetes deployment with auto-scaling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-model-canary
spec:
  replicas: 2
  selector:
    matchLabels:
      app: fraud-model
      variant: canary
  template:
    metadata:
      labels:
        app: fraud-model
        variant: canary
    spec:
      containers:
      - name: model-server
        image: company/fraud-model:v3.2.1
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-model-canary
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Monitoring and Alerting

Production models require monitoring at three levels: infrastructure, data, and model behavior.

import prometheus_client as prom
import time

# Define metrics
REQUEST_LATENCY = prom.Histogram(
    "model_request_duration_seconds",
    "Model inference latency",
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0],
    labelnames=["model_version", "endpoint"],
)

PREDICTION_SCORE = prom.Histogram(
    "model_prediction_score",
    "Distribution of model output scores",
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    labelnames=["model_version"],
)

ERROR_COUNTER = prom.Counter(
    "model_prediction_errors_total",
    "Total number of prediction errors",
    labelnames=["model_version", "error_type"],
)

class InstrumentedPredictor:
    def __init__(self, model, model_version: str):
        self.model = model
        self.version = model_version

    def predict(self, features: list) -> float:
        start = time.perf_counter()
        try:
            score = float(self.model.predict_proba([features])[0, 1])
            REQUEST_LATENCY.labels(
                model_version=self.version, endpoint="predict"
            ).observe(time.perf_counter() - start)
            PREDICTION_SCORE.labels(model_version=self.version).observe(score)
            return score
        except Exception as e:
            ERROR_COUNTER.labels(
                model_version=self.version, error_type=type(e).__name__
            ).inc()
            raise

Critical alerts to set up before any deployment:

p99 latency exceeds 200ms (sustained for 5 minutes)
Error rate exceeds 0.5% (sustained for 2 minutes)
Prediction score distribution shifts by more than 20% (PSI > 0.25)
Pod restart count exceeds 3 in 10 minutes (model crash loop)

Edge Deployment - Mobile and IoT Patterns

Increasingly, ML models are deployed not to cloud servers but to edge devices: mobile phones, embedded sensors, autonomous vehicles. The constraints are fundamentally different:

Memory: 1–10 MB (model weights only)
Latency: under 50ms on device CPU (no GPU, no network round-trip)
Power: battery drain is a real constraint (float32 math drains more than int8)
Offline: device may not have internet connectivity

Edge model checklist:

Quantize to INT8: reduces model size 4x, reduces compute 2–4x, typical accuracy drop less than 2%
Prune to target sparsity: 50–80% weight sparsity is achievable with minimal accuracy loss
Export to TFLite or ONNX: framework-agnostic, optimized for edge runtimes
Benchmark on target hardware: A model that runs at 50ms on your dev laptop may run at 500ms on a low-end Android device

import tensorflow as tf

# TFLite conversion with INT8 quantization for mobile deployment
def export_tflite_int8(keras_model, representative_dataset):
    converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]

    # Representative dataset for calibration (100 samples sufficient)
    def representative_data_gen():
        for features in representative_dataset:
            yield [features.astype("float32")]

    converter.representative_dataset = representative_data_gen
    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    converter.inference_input_type = tf.int8
    converter.inference_output_type = tf.int8

    tflite_model = converter.convert()

    with open("model_int8.tflite", "wb") as f:
        f.write(tflite_model)

    print(f"INT8 model size: {len(tflite_model) / 1024:.1f} KB")
    return tflite_model

Role-Specific Callouts

:::note Machine Learning Engineer Deployment strategy is your responsibility, not just model training. Know the full pipeline: shadow mode, canary, gradual rollout, monitoring, and rollback. Be able to estimate serving costs (how many GPU instances does a 100K QPS Transformer require?) and latency at each optimization stage. :::

:::note MLOps / Platform Engineer Build the deployment infrastructure that makes the above patterns accessible to every data scientist on the team. A great MLOps platform makes shadow mode trivial to enable, canary rollouts automatic, and rollback a one-click operation. The goal: a data scientist should be able to ship a new model version without involving an infrastructure engineer. :::

:::note AI Engineer For LLM-based systems, deployment patterns extend to: prompt versioning (treat prompts like model versions), response caching (semantic cache with embedding similarity), streaming responses (SSE/WebSockets for long generations), and rate limiting. The same shadow mode and canary principles apply - test a new prompt or model on a small fraction of traffic before full rollout. :::

:::note Data Scientist Deployment is not your primary focus, but understanding it makes you dramatically more effective. Know how to package your model for handoff (ONNX, pickle with explicit dependencies, MLflow logged model), write a serving test that validates the deployed model's output matches your notebook output, and define the monitoring metric that should alert if your model is degrading. :::

Full End-to-End Deployment Walkthrough

Let us walk through a complete deployment for a fraud detection model at a payments company processing 10,000 transactions per second.

Phase 1 - Packaging the Model

Before deploying anything, the model must be packaged with its full inference pipeline: preprocessing steps, feature names, model weights, and serving code.

import mlflow
import mlflow.sklearn
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
import json

# Define the complete inference pipeline
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", XGBClassifier(n_estimators=300, max_depth=5)),
])

pipeline.fit(X_train, y_train)

# Validate on holdout before packaging
holdout_auc = roc_auc_score(y_holdout, pipeline.predict_proba(X_holdout)[:, 1])
print(f"Holdout AUC: {holdout_auc:.4f}")  # must exceed production threshold

# Log to MLflow registry with full metadata
with mlflow.start_run(run_name="fraud-v3.2-candidate") as run:
    mlflow.log_params(best_params)
    mlflow.log_metrics({
        "holdout_auc": holdout_auc,
        "holdout_precision_at_5pct_fpr": precision_at_5pct_fpr,
        "training_data_rows": len(X_train),
        "feature_count": X_train.shape[1],
    })
    mlflow.log_param("training_data_version", "2026-03-01")

    # Log feature schema for serving validation
    feature_schema = {
        "features": list(X_train.columns),
        "dtypes": {col: str(dtype) for col, dtype in X_train.dtypes.items()}
    }
    mlflow.log_dict(feature_schema, "feature_schema.json")

    # Package as MLflow model
    mlflow.sklearn.log_model(
        pipeline,
        artifact_path="model",
        registered_model_name="fraud_detection",
        input_example=X_train.iloc[:5],
        signature=mlflow.models.infer_signature(X_train, pipeline.predict(X_train)),
    )

The model signature is critical: it records the exact input/output schema. The serving layer validates every incoming request against this schema. Schema mismatches (wrong feature names, wrong types) are caught before they silently corrupt predictions.

Phase 2 - Shadow Mode Implementation

# FastAPI serving endpoint with shadow mode support
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import mlflow.pyfunc
import logging
import json
import time

app = FastAPI()

# Load models
production_model = mlflow.pyfunc.load_model("models:/fraud_detection/Production")
shadow_model = mlflow.pyfunc.load_model("models:/fraud_detection/Staging")  # candidate

logger = logging.getLogger("shadow_mode")

class TransactionFeatures(BaseModel):
    amount: float
    merchant_category: str
    hour_of_day: int
    device_type: str
    user_tenure_days: int
    # ... (all features)

async def log_shadow_prediction(features: dict, prod_score: float):
    """Log shadow model prediction asynchronously - never blocks response."""
    try:
        import pandas as pd
        df = pd.DataFrame([features])
        shadow_score = float(shadow_model.predict(df)[0])
        logger.info(json.dumps({
            "event": "shadow_prediction",
            "production_score": prod_score,
            "shadow_score": shadow_score,
            "agreed": (prod_score > 0.5) == (shadow_score > 0.5),
            "timestamp": time.time(),
        }))
    except Exception as e:
        logger.error(f"Shadow model failed: {e}")  # never propagate to caller

@app.post("/predict")
async def predict(features: TransactionFeatures, background_tasks: BackgroundTasks):
    import pandas as pd
    feature_dict = features.dict()
    df = pd.DataFrame([feature_dict])

    # Production prediction - on critical path
    start = time.perf_counter()
    score = float(production_model.predict(df)[0])
    latency_ms = (time.perf_counter() - start) * 1000

    # Shadow prediction - background, never blocks
    background_tasks.add_task(log_shadow_prediction, feature_dict, score)

    return {
        "fraud_score": score,
        "decision": "block" if score > 0.85 else "allow",
        "model_version": "v3.1",
        "latency_ms": round(latency_ms, 2),
    }

After two weeks of shadow mode, analyze the divergence logs:

import pandas as pd
import json

# Load shadow mode logs
shadow_logs = pd.read_json("shadow_logs.jsonl", lines=True)

print(f"Total predictions: {len(shadow_logs):,}")
print(f"Agreement rate: {shadow_logs['agreed'].mean():.2%}")
print(f"Shadow blocks more: {(shadow_logs['shadow_score'] > shadow_logs['production_score']).mean():.2%}")
print(f"Shadow blocks less: {(shadow_logs['shadow_score'] < shadow_logs['production_score']).mean():.2%}")

# Identify high-divergence cases for manual review
divergence = (shadow_logs["shadow_score"] - shadow_logs["production_score"]).abs()
high_divergence = shadow_logs[divergence > 0.3]
print(f"\nHigh divergence cases (delta > 0.3): {len(high_divergence):,} ({len(high_divergence)/len(shadow_logs):.2%})")

# Analyze: are high-divergence cases concentrated in a specific merchant category?
# high_divergence.groupby("merchant_category")["shadow_score"].agg(["count", "mean"])

Phase 3 - Canary Rollout Automation

import mlflow
from mlflow.tracking import MlflowClient
import time

class CanaryRolloutManager:
    def __init__(self, model_name: str, client: MlflowClient):
        self.model_name = model_name
        self.client = client
        self.stages = [0.01, 0.05, 0.10, 0.25, 0.50, 1.00]

    def advance_rollout(self, current_pct: float, metrics: dict) -> float | None:
        """
        Advance to next rollout stage if metrics are clean.
        Returns new percentage or None if rollback needed.
        """
        # Guardrail checks
        if metrics.get("p99_latency_ms", 0) > 200:
            return None  # Rollback: latency SLA violated
        if metrics.get("error_rate", 0) > 0.005:
            return None  # Rollback: error rate exceeded
        if metrics.get("block_rate_relative_change", 0) > 0.15:
            return None  # Rollback: block rate shifted >15% (unexpected model behavior)

        # Find next stage
        next_stages = [s for s in self.stages if s > current_pct]
        return next_stages[0] if next_stages else None

    def update_traffic_config(self, treatment_pct: float):
        """Update the load balancer traffic split configuration."""
        config = {
            "production_model_pct": 1.0 - treatment_pct,
            "canary_model_pct": treatment_pct,
        }
        # In practice: update feature flag in LaunchDarkly, Flagsmith, etc.
        print(f"Updated traffic: {(1-treatment_pct):.0%} production, {treatment_pct:.0%} canary")

This pattern - shadow mode → automated canary stages with metric gates → automatic rollback - is the gold standard for ML deployment at any company processing high-stakes transactions.

Summary - The Deployment Mental Model

Every ML deployment decision maps to one question: what is the blast radius if this goes wrong, and how do we reduce it to acceptable levels before each step?

Shadow mode: zero blast radius. Canary at 1%: 1% blast radius. Blue-green: 100% blast radius but instant rollback. Choose the pattern based on how confident you are, how costly a failure is, and how fast you need to move.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the End-to-End ML Pipeline demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Cost of Bad Deployments​

Deployment Strategies - Comparing the Patterns​

Blue-Green Deployment​

Canary Deployment​

Shadow Mode​

Feature Flags​

Serving Architectures - Batch vs Real-Time vs Streaming​

Batch Inference​

Real-Time Inference​

Streaming Inference​

Model Serving Stacks - When to Use Each​

Optimization for Latency and Throughput​

1. Request Batching​

2. Quantization​

3. Knowledge Distillation​

4. Model Pruning​

5. TensorRT Optimization​

Model Registry - Versioning, Lineage, and Rollback​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Production Infrastructure Patterns​

Feature Serving - Online vs Precomputed​

Multi-Model Serving - A/B Routing and Ensemble​

Containerization and Kubernetes Deployment​

Monitoring and Alerting​

Edge Deployment - Mobile and IoT Patterns​

Role-Specific Callouts​

Full End-to-End Deployment Walkthrough​

Phase 1 - Packaging the Model​

Phase 2 - Shadow Mode Implementation​

Phase 3 - Canary Rollout Automation​

Summary - The Deployment Mental Model​