Skip to main content

The MLOps Lifecycle

The SSH Deployment

Sofia had spent four months building the churn prediction model. Gradient boosted trees, careful feature engineering, 87% AUC on the holdout set. Her manager loved the demo. The business expected to save $2M annually from targeted retention campaigns.

The deployment plan was simple: she packaged the model as a Flask API, SSH'd into the production server, copied the files over, and set up a cron job to retrain every Sunday. It worked. The API responded in milliseconds. The first retention campaigns ran. Everyone moved on to the next project.

Six months later, a data analyst noticed something odd: the churn rate in the retained-customer cohort was actually higher than in the control group. The model was doing worse than random. Someone filed a support ticket. Sofia was pulled off her current project to investigate.

It took three days to find the root cause. The upstream data pipeline had changed its encoding for a categorical variable - "mobile" became "MOBILE" - four months prior. The model silently received an unknown category, defaulted to zero, and kept serving predictions that were subtly, then dramatically, wrong. There were no alerts. No monitoring. No way to know. The model had been broken for four months, and nobody knew until the business results proved it.

This story is not unusual. It is, in fact, the default outcome when you treat ML deployment as a software deployment problem. It is the exact problem MLOps was invented to prevent.


:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Pipeline demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Software engineering has solved deployment. You write code, test it, deploy it. If it breaks, the error is deterministic - the same input produces the same wrong output every time. You can reproduce the failure, fix it, deploy the fix.

Machine learning breaks differently. A model is not just code - it is code plus data plus the statistical relationship the training process extracted from that data. When any of those three things change - and in production, they always eventually change - the model degrades. Silently. Gradually. In ways that require statistical reasoning to detect.

The standard software deployment toolkit - CI/CD, unit tests, integration tests, health checks - catches zero of these failures. A model can pass every software test and still be completely wrong because the world shifted and nobody retrained it.

MLOps emerged from teams who had hit this wall enough times. Google published the canonical paper on the problem in 2015 (Sculley et al., "Hidden Technical Debt in Machine Learning Systems"). The core argument: the ML model is a tiny fraction of a real-world ML system, and the surrounding infrastructure - data pipelines, feature computation, serving code, monitoring, retraining loops - is where most of the complexity and failure lives.


Historical Context

The term "MLOps" gained currency around 2018–2019, roughly when teams at large tech companies started publishing about their internal ML platforms: TFX at Google, FBLearner at Facebook, Michelangelo at Uber. But the underlying problem had been recognized earlier.

The Sculley et al. paper (2015) introduced the concept of CACE - Changing Anything Changes Everything - as the core challenge of production ML. Every component of an ML system interacts with every other: changing the input features changes the model, changing the model changes the monitoring thresholds, changing the monitoring thresholds changes the alerts, and so on. The blast radius of any change is hard to predict.

By 2020, Continuous Delivery for Machine Learning (CD4ML) had formalized a set of practices borrowed from DevOps and adapted for the ML case. MLOps as a discipline was crystallizing around the same set of core ideas that this lesson covers.


The Nine Components of Production ML

A production ML system is not a model. It is a system of systems. Google's paper identified the actual components of a mature ML deployment, and understanding this decomposition is the first step to building one.

1. Data Ingestion and Validation

Raw data arrives from upstream sources - databases, event streams, third-party APIs. Data validation checks that the schema, types, and statistical distributions match expectations before any training begins. Without this, bad data silently trains bad models.

2. Feature Engineering and Feature Store

Raw data is transformed into model inputs. In production systems, this computation must happen identically at training time and serving time. A feature store centralizes this computation so both paths use the same code. Training-serving skew - the divergence between features computed at training vs serving time - is one of the most common production failure modes.

3. Training Pipeline

The code and orchestration that reads features, trains a model, and produces a model artifact. In mature systems, training is automated, parameterized, and reproducible. The same training pipeline runs on every retraining cycle.

4. Model Evaluation and Testing

Beyond accuracy on a holdout set: behavioral tests (does the model degrade on known-hard subgroups?), slice-based evaluation (does performance hold across demographic segments?), invariance tests (does the output change when it shouldn't?). Shadow deployment and A/B testing belong here too.

5. Model Registry and Versioning

A central catalog of trained model artifacts with metadata: who trained it, on what data, with what hyperparameters, with what evaluation metrics. The registry is the handoff point between training and serving, and the audit trail for compliance.

6. Serving Infrastructure

How predictions reach users: REST APIs, gRPC services, batch scoring pipelines, edge inference. Serving has its own engineering concerns - latency, throughput, model loading time, versioning for rollback.

7. Monitoring and Alerting

Tracking model health in production: prediction distribution drift, feature distribution drift, business metric correlation, data freshness. Monitoring is the mechanism by which Sofia's Sunday night failure would have been caught on Monday morning instead of six months later.

8. Retraining Trigger

The logic that decides when to retrain: schedule-based (every Sunday), drift-triggered (when input distribution shifts beyond a threshold), performance-triggered (when business metric drops). Fully automated retraining is the hallmark of high-maturity MLOps.

9. Governance and Compliance

Model cards, audit trails, bias evaluation, regulatory documentation. For healthcare, finance, and other regulated domains, this is not optional.


MLOps Maturity Levels

Teams do not build all nine components at once. The MLOps community has converged on a maturity framework - four levels, each representing a different degree of automation and system sophistication.

Level 0 - Manual, Script-Based ML (Most Teams Start Here)

What it looks like: Data scientists work in notebooks. Training is a manual process. Model deployment means copying a pickle file to a server. Retraining happens when someone remembers to run a script. No monitoring. No reproducibility.

Who lives here: Most startups. Early-stage ML projects. Teams where ML is a side project.

The failure mode: Sofia's story. Models degrade silently. Experiments can't be reproduced. Nobody knows which model is running in production.

Level 1 - ML Pipeline Automation

What it looks like: Training is codified as a pipeline that can be triggered on demand. Experiments are tracked (MLflow, W&B). Data is versioned (DVC). Models are registered. Basic monitoring exists.

What changed: Training is no longer a person running a notebook. It is a reproducible, parameterized process.

Who lives here: Mature data science teams. Companies where ML has proven value and is investing in infrastructure.

The failure mode: The pipelines are automated, but deployment is still manual and slow. Serving infrastructure is fragile.

Level 2 - CI/CD for ML

What it looks like: Every change to model code or pipeline code triggers an automated test and evaluation. Model deployment is automated with gates - the new model must beat the current champion on defined metrics before it goes live. Rolling deployments, canary releases.

What changed: The model lifecycle has the same CI/CD discipline as software. Humans approve deployments but don't execute them manually.

Who lives here: Large tech companies with mature ML platforms. Netflix, Spotify, DoorDash.

Level 3 - Continuous Training and Full Automation

What it looks like: Retraining is triggered automatically by data drift or business metric change. The entire pipeline from data arrival to serving update runs without human intervention. A/B testing and champion/challenger evaluation are automated.

What changed: The system maintains itself. Engineers are building and monitoring the automation, not running it.

Who lives here: Google, Meta, Amazon, a handful of ML-native companies.

:::tip Where to Start Most teams should target Level 1 first. Getting to Level 2 before you understand why Level 1 matters leads to over-engineered systems that are abandoned. The goal is not the highest level - it is the level appropriate to your team's size, ML maturity, and business stakes. :::


ML vs Software Deployment: The Real Differences

This is the table every ML engineer should internalize. These differences explain why you cannot simply apply DevOps practices to ML without adaptation.

DimensionSoftware DeploymentML Deployment
What you deployCodeCode + Data + Model artifact
What causes failureBugsData drift, model decay, schema changes, distribution shift
Failure modeDeterministic errorGradual statistical degradation
TestingUnit / integration tests catch failuresTests catch code bugs, not model quality degradation
ReproducibilitySame code = same behaviorSame code + same data + same seed = same model (usually)
VersioningCode versioning is sufficientNeed to version code + data + model + environment
RollbackRedeploy previous code versionMay require retraining on previous data
MonitoringUptime, error rate, latencyPlus: prediction drift, feature drift, business metrics
Feedback loopBug reports are fastModel degradation may take weeks to surface in business metrics

The Hidden Technical Debt in ML Systems

Sculley et al. (2015) is the foundational paper of production ML. The key insight is that the actual ML code is a small island in a sea of supporting infrastructure.

The paper identifies several specific forms of technical debt unique to ML:

Entanglement: Changing anything changes everything (CACE). Adding a feature changes the model behavior unpredictably. Removing a feature that seemed unimportant can cause cascading degradation because the model learned unexpected correlations.

Undeclared consumers: Other systems start depending on the model's output format or value distribution without telling you. When you change the model, their systems silently break.

Data dependencies: Unlike code dependencies, data dependencies are harder to detect and remove. A model trained on a data source that gets deprecated will fail silently.

Feedback loops: Models influence the data they are later trained on. A recommendation model that surfaces content affects what users click, which affects the next training set. This can create runaway feedback loops - filter bubbles, popularity bias - that are invisible from the model's evaluation metrics.

Configuration debt: ML systems have a huge configuration surface - hyperparameters, data paths, evaluation thresholds, feature names. Poorly managed configuration makes systems brittle and unreproducible.


Model Decay: Why Good Models Go Bad

Even if everything else is perfect - pipelines are automated, experiments are tracked, monitoring is in place - your model will eventually be wrong. This is not a bug. It is the nature of learning from data about a world that changes.

Types of Drift

Data drift (covariate shift): The statistical distribution of input features changes. The model was trained on data from one distribution but receives inputs from another. Example: a fraud detection model trained on pre-COVID transaction patterns sees completely different patterns post-COVID.

Concept drift (label drift): The relationship between inputs and outputs changes. What constituted a fraudulent transaction in 2019 is different from 2023 because fraudsters adapt. The features are the same but the correct label for a given input has changed.

Upstream schema change: The format, encoding, or semantics of a data field changes. Sofia's case: "mobile" became "MOBILE". The feature pipeline did not catch it.

Feature staleness: A feature that was a strong signal at training time becomes irrelevant because the underlying business reality changed. A "days since last purchase" feature for a retail model becomes meaningless during a supply chain disruption.

:::warning The Silent Degradation Problem Model decay is dangerous specifically because it is gradual and silent. Unlike an API returning a 500 error, a model returning subtly wrong predictions does not trigger any alert by default. By the time business metrics reveal the problem, weeks or months of degraded service have occurred. This is the primary motivation for ML-specific monitoring. :::


Code Example: Detecting Distribution Shift

Here is a simple implementation of detecting feature distribution drift using the Population Stability Index (PSI), a standard metric in production ML monitoring:

import numpy as np
from scipy.stats import ks_2samp

def population_stability_index(
reference: np.ndarray,
production: np.ndarray,
bins: int = 10,
epsilon: float = 1e-6
) -> float:
"""
Compute Population Stability Index between reference and production distributions.

PSI < 0.1 → No significant change
PSI < 0.2 → Moderate change, investigate
PSI >= 0.2 → Significant change, likely model degradation
"""
# Create bins based on reference distribution
min_val = min(reference.min(), production.min())
max_val = max(reference.max(), production.max())
bin_edges = np.linspace(min_val, max_val, bins + 1)

# Compute proportions in each bin
ref_counts, _ = np.histogram(reference, bins=bin_edges)
prod_counts, _ = np.histogram(production, bins=bin_edges)

ref_pct = ref_counts / len(reference) + epsilon
prod_pct = prod_counts / len(production) + epsilon

# PSI formula: sum((actual - expected) * ln(actual / expected))
psi = np.sum((prod_pct - ref_pct) * np.log(prod_pct / ref_pct))
return float(psi)


def detect_feature_drift(
training_features: dict[str, np.ndarray],
serving_features: dict[str, np.ndarray],
psi_threshold: float = 0.2,
ks_alpha: float = 0.05
) -> dict[str, dict]:
"""
Detect drift across all features using PSI and KS test.
Returns a dict of feature -> drift report.
"""
report = {}

for feature_name in training_features:
if feature_name not in serving_features:
report[feature_name] = {"status": "MISSING_IN_PRODUCTION", "psi": None, "ks_p": None}
continue

ref = training_features[feature_name]
prod = serving_features[feature_name]

psi = population_stability_index(ref, prod)
ks_stat, ks_p = ks_2samp(ref, prod)

if psi >= psi_threshold or ks_p < ks_alpha:
status = "DRIFT_DETECTED"
elif psi >= 0.1:
status = "MODERATE_DRIFT"
else:
status = "OK"

report[feature_name] = {
"status": status,
"psi": round(psi, 4),
"ks_stat": round(ks_stat, 4),
"ks_p_value": round(ks_p, 4),
}

return report


# Example usage
if __name__ == "__main__":
# Simulate training vs production distributions
rng = np.random.default_rng(42)

training = {
"age": rng.normal(35, 10, 10000),
"transaction_amount": rng.exponential(50, 10000),
"days_since_login": rng.poisson(7, 10000).astype(float),
}

# Production: age shifted (covariate shift), others stable
production = {
"age": rng.normal(42, 12, 5000), # Distribution shifted
"transaction_amount": rng.exponential(52, 5000), # Stable
"days_since_login": rng.poisson(7, 5000).astype(float), # Stable
}

drift_report = detect_feature_drift(training, production)

for feature, result in drift_report.items():
status_icon = "ALERT" if result["status"] == "DRIFT_DETECTED" else "OK"
print(f"[{status_icon}] {feature}: PSI={result['psi']}, KS p={result['ks_p_value']}")

Production Engineering Notes

Start with monitoring, not tooling: The most common mistake teams make entering Level 1 MLOps is buying or building a sophisticated training pipeline before setting up even basic prediction monitoring. Monitoring catches failures. Training automation speeds iteration. Monitoring is more urgent.

Version everything together: A model artifact is only meaningful alongside the exact code that produced it, the exact data it was trained on, and the exact environment it was trained in. If any of these are missing, the model artifact is an orphan you cannot reproduce or trust.

Automate the boring, gate the important: Automation should handle the mechanical steps (running training, computing metrics, packaging artifacts). Humans should approve production deployments. The automation makes the human decision informed and fast, not absent.

Model decay timescales vary dramatically: A fraud detection model may need retraining weekly. A demand forecasting model may be stable for months. Understanding your specific model's decay rate is a product of monitoring, not guesswork. Do not assume; measure.


Common Mistakes

:::danger Not Monitoring Predictions in Production The single most dangerous mistake in production ML is treating model deployment as the finish line. A model that is not monitored is a liability, not an asset. Set up prediction distribution monitoring before you ship. Without it, you will be the last to know when your model breaks. :::

:::danger Skipping Data Validation Accepting data inputs without validation means your model silently handles malformed or schema-changed data by making wrong predictions. Data validation at pipeline entry is the first line of defense. Tools like Great Expectations and Pandera make this straightforward. :::

:::warning Conflating Offline and Online Metrics A model with 0.91 AUC is not necessarily better in production than one with 0.89 AUC. Offline metrics measure performance on a static holdout set. Online metrics measure actual business impact. The correlation between the two is often weaker than teams assume. Always validate with an online experiment before retiring the previous model. :::

:::warning Over-Engineering Before Validating ML Value Teams sometimes build Level 2 MLOps infrastructure for a model that should be a Level 0 proof of concept. If the business case for ML is not validated, elaborate automation is wasted. Match your MLOps maturity to your ML maturity. :::


Interview Q&A

Q1: Explain the difference between MLOps maturity Level 1 and Level 2. What specifically changes between them?

Level 1 automates the ML training pipeline - training is reproducible, parameterized, and can be triggered on demand. But deployment is still a manual, human-driven process. A data scientist or ML engineer decides to deploy a model, runs the deployment script, and monitors the rollout themselves.

Level 2 adds CI/CD to the ML pipeline itself. Every change to pipeline code triggers automated testing and evaluation. Model deployment is automated with gates - the new model must outperform the current champion on defined metrics before it is automatically promoted. The key distinction is that Level 2 treats the ML pipeline as software that must pass tests before deployment, not just a script that humans run when they feel it's ready.

Q2: What is training-serving skew and why is it one of the most common production failures?

Training-serving skew is when the feature computation at training time and at serving time diverge. The model learned from one distribution of features, but in production it receives slightly different feature values - different enough to degrade performance, similar enough that tests don't catch it.

Common causes: using different code to compute features in the training pipeline versus the serving pipeline, using aggregate statistics (like mean, median) computed at different times, or applying preprocessing steps in a different order. The fix is to use a feature store with a single feature computation function called from both the training and serving path. If you compute a feature twice, you will eventually compute it differently.

Q3: What is the CACE principle from Sculley et al. and why does it make ML systems hard to maintain?

CACE stands for Changing Anything Changes Everything. In an ML system, every component interacts with every other component through the learned model. If you add a new feature, the model learns new correlations and its behavior on existing inputs changes. If you change the preprocessing of an existing feature, the model's weights that depended on that feature become wrong. If you change the training data, all of the above.

This makes ML systems brittle in a way that software systems are not. In software, you can change function A without affecting function B as long as the interface between them is unchanged. In ML, there is no clear interface - the model has learned implicit relationships across all inputs simultaneously. CACE is why version control, experiment tracking, and staged rollouts are more important in ML than in standard software.

Q4: What are the three types of drift, and how would you detect each?

Data drift (covariate shift) is when the input feature distribution changes. Detect it by comparing statistical properties of feature distributions between training and production windows - PSI, KS test, or Jensen-Shannon divergence. Alert when PSI exceeds 0.2 or KS p-value drops below 0.05.

Concept drift is when the relationship between inputs and outputs changes. Harder to detect without labels, which arrive with delay. Detect it by monitoring the correlation between model confidence and actual outcomes once labels arrive. If the model says 90% confidence but is correct only 70% of the time, concept drift is likely.

Upstream schema drift is when the format or semantics of an input field changes. Detect it with strict data validation at pipeline entry - schema checks for types, range checks for values, vocabulary checks for categorical features. This should alert immediately on schema mismatch before the model ever sees the data.

Q5: A team is starting their first production ML project. They have a trained model and want to deploy it. What would you recommend as a minimum viable MLOps setup?

I would recommend four things before anything else. First, set up prediction monitoring - log every prediction the model makes, then compute the prediction distribution daily and alert if it shifts significantly. This is your early warning system. Second, version the model artifact along with the code commit and data snapshot that produced it. Store this in a model registry so you know exactly what is running in production. Third, add data validation at pipeline entry - schema checks, range checks, null checks. Fourth, build a simple retraining script that can be triggered on demand, not just a notebook.

With these four in place you have a defensible baseline: you know what's deployed, you know when it's degrading, you can prevent bad data from reaching it, and you can update it when needed. Everything else - automated CI/CD, feature stores, automated retraining - comes after you have validated the ML value and understand the specific failure modes of your system.

Q6: Why is the ML model code typically a small fraction of the total production ML system code?

The Sculley et al. paper estimated the ML model code as a small island in a sea of infrastructure. The surrounding code includes: data ingestion and validation (reading from databases, streaming systems, APIs), feature engineering (potentially hundreds of transformations), training orchestration (pipeline scheduling, resource management, hyperparameter handling), model evaluation (multiple metrics, slice analysis, baseline comparison), model serving (API, batching, caching, versioning), and monitoring (feature drift, prediction drift, business metric tracking). Each of these is a non-trivial engineering effort. The model itself - usually a few hundred lines of PyTorch or scikit-learn - is the smallest piece. This is why ML projects should be scoped as systems projects, not model-building projects.

© 2026 EngineersOfAI. All rights reserved.