MLOps vs DevOps
Reading time: 30–35 min | Relevance: ML Engineer, MLOps Engineer, Platform Engineer, DevOps Engineer moving into ML
The Day the CI/CD Pipeline Lied
It was a Monday morning when the on-call message came through. A senior DevOps engineer at a mid-sized fintech company had been handed his first ML project three months ago - a fraud detection model. He'd done what he always did: set up a clean CI/CD pipeline, wrote unit tests, added integration tests, wired up a staging environment, configured Prometheus metrics, and deployed to production with a blue-green rollout. The pipeline was beautiful. Every push to main ran 847 tests. 847 tests, all green.
The model had been flagging fraudulent transactions at 94% precision for the first two months. Then, silently, precision dropped to 71%. No alerts fired. No tests failed. The deployment pipeline showed green across the board. The model kept serving predictions confidently - just increasingly wrong ones. By the time a business analyst noticed the jump in fraud chargebacks, the model had been degraded for six weeks.
The DevOps engineer pulled up every dashboard he had. CPU usage: normal. Memory: normal. Request latency: normal. Error rate: 0.00%. The software was working perfectly. The model was broken. And nothing in his entire observability stack could tell the difference between those two states.
This is the precise moment where DevOps ends and MLOps begins. Not as a philosophical distinction, but as a hard engineering problem. The fraud model hadn't crashed - it had drifted. The patterns in transaction data had shifted as fraudsters adapted their behavior, and the model's learned decision boundary was now tracing the wrong line through feature space. There was no stack trace. No 500 error. No failed health check. Just a slowly worsening probability estimate, invisible to every tool built for deterministic software.
What the engineer needed - and didn't have - was a second class of observability entirely: one that watches not the behavior of the software but the quality of the model's outputs. He needed data drift detectors, prediction distribution monitors, business-metric correlation alerts, and automatic retraining triggers. He needed, in short, an MLOps platform. And none of that comes out of the DevOps toolbox.
:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Pipeline demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
DevOps solved a real problem: the wall between software development and operations. Before DevOps, code was written in isolation, thrown over the wall to ops, and deployed manually. Deployments were rituals of fear. DevOps brought automation, shared ownership, and the idea that "you build it, you run it."
MLOps exists because machine learning introduced a third dimension that DevOps never accounted for: the behavior of a software artifact is not determined only by its code - it is also determined by the data it was trained on and the data it currently sees. A compiled binary does exactly what its code says, forever. A trained model does what its training data implied, until the world changes.
This single fact - that ML models are functions of both code and data - is the root cause of every divergence between MLOps and DevOps. Everything else follows from it.
Historical Context
DevOps as a movement crystallized around 2008–2010, emerging from Patrick Debois, Gene Kim's work on The Phoenix Project, and the Agile community's frustration with siloed operations. By 2013, tools like Docker, Jenkins, and later Kubernetes had made CI/CD pipelines a standard part of software engineering.
Machine learning teams existed inside these companies too, but they operated almost entirely outside the DevOps culture. Data scientists worked in notebooks. Models were trained once, pickled, and handed to engineers to deploy manually. There was no versioning, no automated retraining, no monitoring beyond "is the endpoint up."
The term "MLOps" began appearing in earnest around 2017–2018, driven by Google's landmark paper "Hidden Technical Debt in Machine Learning Systems" (Sculley et al., 2015) and the release of TFX (TensorFlow Extended) in 2017. By 2019, Gartner had coined the term and analysts were writing about it as a discipline. The MLflow project, Kubeflow, and Metaflow all launched between 2018 and 2019, each trying to bring DevOps-style reproducibility to the ML workflow.
The core insight that drove MLOps as a field: you can version your code perfectly and still have no idea what your model will do next month. A model is not software. It is a compressed statistical summary of a dataset. And datasets change.
The Fundamental Difference: Deterministic vs Probabilistic
In traditional software engineering, correctness is binary. A function either returns the right answer or it doesn't. You write a test:
def test_discount_calculation():
assert calculate_discount(price=100, pct=20) == 80.0
If this test passes today, it will pass in six months. The logic is deterministic. Code does not degrade over time by itself.
A machine learning model has no such guarantee:
def test_fraud_model_precision():
# This test passes on the test set collected in January
predictions = model.predict(january_test_set)
precision = precision_score(january_test_set.labels, predictions)
assert precision > 0.90 # ✅ passes today
# But what about the June distribution?
# The test cannot tell you.
The test passes because it evaluates the model against a historical snapshot of data. But in production, the model sees a live stream of data that evolves continuously. Fraud patterns change. User behavior shifts. Economic conditions alter transaction amounts. The model's training distribution and its serving distribution slowly diverge - and no unit test catches this, because the test dataset doesn't change.
This is the root of the "tests pass, quality degrades" problem. The entire testing philosophy of DevOps assumes a closed, deterministic system. ML breaks that assumption by design.
The Three Extra Loops in MLOps
A DevOps pipeline has one loop: code → test → build → deploy → monitor. MLOps has three additional loops that must run alongside this:
Loop 1: The Data Loop
Before any model training can begin, data must be collected, validated, and versioned. This loop has no equivalent in DevOps.
Data Sources → Ingestion → Validation → Versioning → Feature Store
↑ |
|←←←← drift detected, re-label, augment ←←←←←←←←←←|
The data loop includes:
- Schema validation: does the incoming data match expected types and ranges?
- Statistical validation: has the distribution of feature X drifted significantly from last week?
- Data versioning: DVC, Delta Lake, or Feast snapshots that tie a training run to an exact data state
- Labeling pipelines: for supervised learning, labels must be managed, audited, and updated
# Great Expectations - data validation in the data loop
import great_expectations as ge
df = ge.read_csv("transactions.csv")
result = df.expect_column_values_to_be_between(
column="transaction_amount",
min_value=0.01,
max_value=50000.0
)
print(result["success"]) # False if out-of-range values detected
Loop 2: The Model Loop
After a model is trained, it enters a continuous evaluation-and-retraining loop. This is not a one-time event - it is a recurring operational process.
New Data → Retrain Trigger → Training → Evaluation Gate → Shadow Deploy → Promote
↑ |
|←←← fail gate, investigate ←←|
Retraining triggers fall into three categories:
- Scheduled: retrain every Sunday at 2 AM regardless of performance
- Performance-triggered: retrain when precision drops below threshold X
- Data-triggered: retrain when KL divergence of input features exceeds threshold Y
# Example: performance-triggered retraining logic
def should_retrain(model_metrics: dict, thresholds: dict) -> bool:
"""
Returns True if current model performance has degraded
below acceptable thresholds.
"""
if model_metrics["precision"] < thresholds["min_precision"]:
return True
if model_metrics["f1"] < thresholds["min_f1"]:
return True
if model_metrics["psi"] > thresholds["max_psi"]:
# Population Stability Index > 0.2 indicates significant drift
return True
return False
# Scheduled check (e.g., in Airflow or Prefect)
metrics = get_model_metrics_last_7_days()
if should_retrain(metrics, thresholds={"min_precision": 0.88, "min_f1": 0.85, "max_psi": 0.2}):
trigger_retraining_pipeline()
Loop 3: The Experiment Loop
Before a model version is even trained for production, data scientists run experiments: hyperparameter searches, architecture comparisons, feature ablations. This loop is pure research-and-development, but it must feed into the production pipeline in a controlled, reproducible way.
Hypothesis → Experiment → Track → Compare → Best Config → Production Pipeline
↑ |
|←←←←←←←←| (iterate)
DevOps has no concept of an experiment loop. Software engineers don't run 50 variants of a function to find the best configuration - they write code that does what the spec requires. ML engineers must explore empirically, and the infrastructure must support this.
Where MLOps and DevOps Overlap
Despite the differences, MLOps inherits substantial machinery from DevOps. These concepts carry over largely intact:
Version Control: Git for code is unchanged. MLOps adds DVC (Data Version Control) for datasets and MLflow for model artifacts, but the underlying versioning philosophy is identical.
Containerization: Docker containers are used identically. A model-serving container is just a container. Kubernetes orchestration applies unchanged.
CI/CD Automation: The idea of automated pipelines triggered by code changes applies directly. The pipeline contents change (you run training jobs, not just builds), but the automation infrastructure - GitLab CI, GitHub Actions, Jenkins - is the same.
Infrastructure as Code: Terraform, Helm charts, and configuration management tools apply without modification.
Logging and Distributed Tracing: Application logs from ML serving endpoints use the same ELK/Loki stacks as any other service.
Where MLOps Diverges - In Detail
Artifacts Are Fundamentally Different
In DevOps, the artifact is a binary or container image. It is deterministic. Given the same source code and the same build environment, you always get the same binary.
In MLOps, the artifact is a trained model. It is a function of:
- Source code (training script)
- Data (which dataset, which version, which split)
- Random seeds (initialization, dropout, data shuffle)
- Compute environment (floating-point precision differences across hardware)
A model artifact is non-reproducible by default unless you explicitly control all four of these variables. This is why MLflow, W&B, and similar tools log everything - not just the model weights, but every hyperparameter, every data hash, every environment variable, every library version.
# MLflow autolog captures all of these automatically
mlflow_run:
run_id: "8f3a21b4c9d7"
params:
learning_rate: 0.001
batch_size: 256
n_estimators: 200
max_depth: 8
random_seed: 42
metrics:
train_accuracy: 0.9612
val_accuracy: 0.9314
val_precision: 0.9201
val_recall: 0.9407
tags:
data_version: "transactions_v3.2_2024-01-15"
git_commit: "a7b3d9f"
model_type: "XGBoostClassifier"
artifacts:
- model.pkl
- feature_importance.png
- confusion_matrix.json
Testing Philosophy Is Completely Different
| DevOps Testing | MLOps Testing |
|---|---|
| Unit test: function returns expected output | Data quality test: feature distributions within expected range |
| Integration test: service A calls service B | Model quality test: precision/recall above threshold |
| Regression test: no new bugs introduced | Bias/fairness test: model performance equal across demographic groups |
| Performance test: latency under 200ms | Inference latency test: p99 latency under 50ms |
| N/A | Canary evaluation: new model vs old model on live traffic slice |
| N/A | Shadow mode evaluation: run both models, compare outputs silently |
Traditional test coverage is a meaningless metric for ML quality. A model could have 100% test coverage of the training script and still produce deeply biased predictions. The "tests" that matter are statistical evaluations on held-out data, and those evaluations must be repeated on fresh data as the world changes.
Concept Drift: The Bug With No Stack Trace
Concept drift is when the statistical relationship between input features and the target label changes over time. It is the most dangerous failure mode in ML because it is invisible to all standard monitoring.
January: P(fraud | transaction_amount > $500, new_merchant = True) = 0.72
June: P(fraud | transaction_amount > $500, new_merchant = True) = 0.31
The same feature values now mean something different. The model learned January's distribution. It has no way to know that distribution has changed. It continues making predictions with high confidence - they're just wrong.
Types of drift:
- Data drift (covariate shift): P(X) changes, P(Y|X) stays the same - input distribution shifts
- Concept drift: P(Y|X) changes - the underlying relationship changes
- Label drift: P(Y) changes - the prevalence of the target class shifts
- Upstream data drift: a data pipeline silently changes, corrupting features
Detecting drift requires statistical tests run continuously in production:
from scipy import stats
import numpy as np
def detect_data_drift(
reference_feature: np.ndarray,
production_feature: np.ndarray,
threshold: float = 0.05
) -> dict:
"""
Kolmogorov-Smirnov test for data drift detection.
Returns drift detected if p-value < threshold.
"""
ks_stat, p_value = stats.ks_2samp(reference_feature, production_feature)
return {
"ks_statistic": round(float(ks_stat), 4),
"p_value": round(float(p_value), 4),
"drift_detected": p_value < threshold,
"severity": "HIGH" if p_value < 0.01 else "MEDIUM" if p_value < 0.05 else "LOW"
}
# Example usage in production monitoring
ref_amounts = load_reference_distribution("transaction_amount")
prod_amounts = get_last_7_days_production("transaction_amount")
result = detect_data_drift(ref_amounts, prod_amounts)
# {"ks_statistic": 0.2341, "p_value": 0.0012, "drift_detected": True, "severity": "HIGH"}
Deployment Means Something Different
In DevOps, "deployment" means: put the new binary somewhere, route traffic to it, roll back if health checks fail.
In MLOps, "deployment" has multiple modes with different risk profiles:
Shadow deployment: the new model runs in parallel with the old one, receives the same inputs, but its outputs are not served to users. You compare the two silently to build confidence before switching.
Canary deployment: 5–10% of traffic goes to the new model. You monitor business metrics (not just technical metrics) to catch regressions before full rollout.
A/B testing: two model variants serve different user segments indefinitely, with statistical rigor to determine which performs better on business outcomes.
Champion/Challenger: the production model is the "champion." New model candidates are "challengers" that must statistically outperform the champion on defined metrics before promotion.
None of these have a direct equivalent in standard DevOps. A blue-green deployment is close to a canary, but it cares about latency and error rate - not precision or recall or revenue impact.
Rollback Is More Complicated
In DevOps, rollback means: point traffic at the previous container image. Takes 30 seconds. Done.
In MLOps, rollback has additional complications:
- Data state: the new model may have already written predictions to a database. Rewriting historical predictions is complex.
- Model-data coupling: the rolled-back model was trained on older data and may be worse on current distribution than when originally deployed.
- Feature pipeline state: if the feature engineering pipeline was also updated, rolling back the model without rolling back the pipeline may produce incompatible inputs.
- Online learning systems: if the model has been updating its weights in real-time on production data, there is no clean checkpoint to roll back to.
The DataOps Layer
DevOps has no concept of a data layer. Data is an input that comes from somewhere and the software handles it. MLOps requires an entire additional discipline - DataOps - that manages:
- Data pipelines (Airflow, Prefect, dbt)
- Data quality and validation (Great Expectations, Deequ, Soda)
- Data versioning (DVC, Delta Lake, Iceberg)
- Feature stores (Feast, Hopsworks, Tecton)
- Data lineage (OpenLineage, Marquez)
- Schema management and evolution
This is not optional overhead - it is the foundation. A model trained on undocumented, unversioned, unvalidated data is a model you can never reproduce, debug, or safely retrain.
The Full Comparison Table: DevOps vs MLOps
| Dimension | DevOps | MLOps |
|---|---|---|
| Primary artifact | Binary / container image | Trained model + training pipeline |
| Artifact determinism | Deterministic (same code = same binary) | Probabilistic (same code + different data = different model) |
| Version control | Git for source code | Git + DVC for data + MLflow for models |
| Testing | Unit, integration, E2E, performance | All of DevOps + data validation, model evaluation, bias testing |
| Definition of "passing" | All tests green | All tests green + model metrics above threshold |
| Deployment unit | Container image | Model weights + serving code + feature pipeline |
| Deployment modes | Blue-green, canary, rolling | Shadow, canary, A/B test, champion/challenger |
| Monitoring | CPU, memory, latency, error rate | All of DevOps + prediction drift, feature drift, business metrics |
| Rollback | Point to previous image | Complex - model, pipeline, and data state must be considered |
| Failure mode | Crash, exception, 5xx | Silent quality degradation, concept drift |
| What breaks tests | Code changes | Code changes OR data changes OR world changes |
| CI/CD pipeline content | Build, test, push, deploy | Build, test, train, evaluate, validate, push, deploy |
| Reproducibility challenge | High (given same code, same result) | Very high (requires code + data + environment + seeds) |
| Team structure | Dev + Ops | Data scientists + ML engineers + data engineers + MLOps platform |
| Feedback loop | Application logs, error tracking | Prediction monitoring, label feedback, business outcome tracking |
| Governance | Access control, audit logs | All of DevOps + model cards, bias audits, regulatory compliance |
MLOps Pipeline Diagram
Data Scientists vs MLOps Engineers: Skill Overlaps and Gaps
This divide is a source of real friction on ML teams. Understanding it helps you hire correctly and collaborate better.
Data scientists are strong at: statistical modeling, feature engineering, understanding domain context, interpreting model behavior, research iteration. They are typically weak at: production system design, observability, SLA management, CI/CD, infrastructure.
MLOps engineers are strong at: pipeline automation, container orchestration, system reliability, monitoring, CI/CD, infrastructure as code. They are typically weak at: statistical evaluation, feature engineering, understanding model behavior, research iteration.
The most effective ML teams have both, with clear ownership: data scientists own the model quality and the experiment loop; MLOps engineers own the deployment pipeline, the data loop, and the monitoring infrastructure. Where they must collaborate closely: evaluation metrics (what to monitor), retraining triggers (when to retrain), and model packaging (how to serve).
# Shared vocabulary example: an evaluation gate both teams define together
evaluation_gate = {
# Data scientist defines what to measure
"metrics": {
"precision": {"threshold": 0.88, "direction": "min"},
"recall": {"threshold": 0.82, "direction": "min"},
"auc_roc": {"threshold": 0.91, "direction": "min"},
"max_bias_gap": {"threshold": 0.05, "direction": "max"}, # fairness
},
# MLOps engineer defines how to enforce it
"enforcement": {
"mode": "blocking", # pipeline fails if gate not passed
"comparison": "challenger_must_beat_champion",
"min_eval_samples": 10000,
"statistical_test": "bootstrap_ci_95"
}
}
A/B Testing Infrastructure Requirements
DevOps doesn't need A/B testing infrastructure. MLOps depends on it for safe model rollouts.
An ML A/B testing system requires:
- Traffic splitting: route user X consistently to model A, user Y to model B (sticky assignment)
- Outcome tracking: log business outcomes (conversion, fraud caught, click) against model assignment
- Statistical analysis: compute significance and effect size automatically
- Guardrail metrics: automatically stop the experiment if the challenger harms any critical metric
- Holdout groups: preserve a never-seen population for unbiased evaluation
# Simplified A/B routing for model serving
import hashlib
def get_model_for_request(user_id: str, experiment: dict) -> str:
"""
Deterministic, sticky assignment: same user always gets same model.
experiment = {"name": "fraud_v2_test", "challenger_pct": 0.10}
"""
hash_val = int(hashlib.md5(
f"{user_id}:{experiment['name']}".encode()
).hexdigest(), 16)
bucket = (hash_val % 100) / 100.0 # 0.0 to 0.99
if bucket < experiment["challenger_pct"]:
return "challenger"
return "champion"
Production Engineering Notes
On evaluation gates in CI/CD: build your model evaluation gate as a proper blocking step in your pipeline. The gate should fail the pipeline - not just warn - when a model doesn't meet quality thresholds. Teams that only log warnings find that warnings are always ignored under deployment pressure.
On monitoring granularity: monitor model quality metrics at multiple granularities. Aggregate precision over all predictions can look fine while precision on a specific user segment or feature value has collapsed. Always slice your monitoring metrics by key feature dimensions.
On retraining pipelines: treat your retraining pipeline exactly like production code. It should have automated tests, be triggered by a CI system, and produce versioned artifacts. Ad-hoc retraining in a notebook that someone runs manually when they remember to is not a retraining pipeline - it is technical debt.
On shadow deployments: always run a shadow deployment before a full rollout for any model that affects revenue, safety, or regulatory compliance. The compute cost of running two models is almost always worth it.
Common Mistakes
:::danger Treating ML Model Tests Like Software Tests Writing passing unit tests for your training code and concluding the model is "tested" is dangerous. The tests verify the code runs - not that the model is any good. You need statistical evaluation on held-out data, not just execution correctness. :::
:::danger Skipping Data Versioning "Just for Now" "We'll add data versioning later" is the most common source of non-reproducible models. Later never comes, and six months down the line you have no idea what data version trained your production model. Start versioning data from day one, even if it's just a hash of the dataset stored alongside the model artifact. :::
:::warning Monitoring Only Infrastructure Metrics for ML Systems Setting up CPU/memory/latency monitors and declaring the system "monitored" leaves the most important failure modes - concept drift, prediction distribution shift, silent quality degradation - completely undetected. ML systems need a second tier of monitoring: statistical metrics on the outputs, not just on the infrastructure. :::
:::warning Using DevOps Rollback Procedures Unchanged for ML "Roll back to the previous model" sounds simple but has hidden complications: the previous model was trained on older data, the feature pipeline may have changed, and if you have an online database of predictions, you may have already written bad predictions that need correction. Define your ML rollback procedure explicitly before you need it. :::
:::warning Defining Retraining as a Manual Process Any retraining that depends on someone remembering to do it will be forgotten. Retraining triggers must be automated: scheduled jobs, performance threshold alerts, or data drift alerts that automatically queue a training run. Manual retraining is not operational. :::
Interview Q&A
Q1: How is MLOps different from DevOps? What does MLOps add?
Strong answer: MLOps extends DevOps to handle the unique challenges introduced by machine learning. DevOps assumes software artifacts are deterministic - the same code always produces the same behavior. ML breaks this assumption because a trained model's behavior depends not just on code, but on training data and the evolving distribution of production data.
MLOps adds three extra loops to the standard DevOps pipeline:
- The data loop: data ingestion, validation, versioning, and feature engineering infrastructure
- The model loop: automated retraining triggers, evaluation gates, champion/challenger deployment, prediction monitoring
- The experiment loop: hyperparameter search, architecture comparison, experiment tracking
It also adds an entirely new failure mode that DevOps tooling can't detect: concept drift - when the statistical relationship between inputs and the correct output changes over time, degrading model quality silently.
Q2: Why can't you just use standard CI/CD for ML?
Strong answer: Standard CI/CD can manage the code side of ML (training scripts, serving code, pipeline definitions). But it cannot manage what makes ML unique:
First, the artifact problem: a Docker image is reproducible from source code. A trained model is not - it depends on data version, random seeds, and hardware. You need additional tooling (MLflow, DVC) to track these.
Second, the testing problem: traditional tests verify code correctness, not model quality. A training script that executes without errors can produce a model that's 30% worse than the previous one. You need statistical evaluation gates that block promotion unless model quality metrics pass.
Third, the deployment problem: you can't safely deploy an ML model with just a blue-green switch. You need shadow deployments, A/B testing, and champion/challenger frameworks because model quality is probabilistic and only fully measurable on live traffic.
Fourth, the monitoring problem: infrastructure metrics (latency, error rate) tell you the software is running. They don't tell you the model is performing well. You need statistical monitoring of prediction distributions and business outcomes.
Q3: What is concept drift and why is it uniquely dangerous in ML systems?
Strong answer: Concept drift is when the statistical relationship between input features and the correct label changes over time. For example, a fraud detection model trained in January might learn that transaction_amount > $500 AND new_merchant = True has a 72% probability of fraud. Six months later, fraudsters have adapted and that pattern now has only 31% fraud probability. The feature values are the same - their meaning has changed.
It's uniquely dangerous because it is invisible to all standard monitoring. CPU, memory, and latency are fine. Error rate is zero. The model serves predictions with high confidence. But the predictions are increasingly wrong. By the time a business metric alert fires (fraud chargebacks increasing), you've often been degraded for weeks.
Detection requires statistical monitoring of the model's input distribution (data drift) and output distribution (prediction drift), combined with periodic ground truth comparison when labels become available. Tools like Evidently AI, WhyLogs, and NannyML automate this detection.
Q4: What does "model monitoring" mean, and how is it different from standard application monitoring?
Strong answer: Standard application monitoring watches the behavior of the software: is it running, how fast is it responding, are there errors. Model monitoring watches the quality of the model's outputs, which requires an entirely different set of measurements.
Model monitoring has four layers:
- Infrastructure monitoring (same as DevOps): CPU, memory, latency, error rate - tells you the software is healthy
- Input data monitoring: are the features arriving in expected distributions? Are there missing values, out-of-range values, schema violations?
- Output monitoring: is the distribution of predictions stable? Are you predicting "fraud" 0.1% of the time or 40% of the time? Sudden changes in prediction distribution often signal drift.
- Business outcome monitoring: is the model actually achieving the business goal? Fraud caught, revenue attributed, click-through rate - these are the ultimate ground truth
Most teams implement layer 1 immediately, layer 4 eventually, and neglect layers 2 and 3 - which is exactly where drift shows up first.
Q5: How would you explain the difference between a DevOps deployment and an MLOps deployment to a DevOps engineer who's new to ML?
Strong answer: In DevOps, you deploy code and the artifact's behavior is fully determined by the code. A web server that passes tests in staging will behave identically in production. Confidence comes from test coverage.
In MLOps, you deploy a model whose behavior is determined by both code and training data. Confidence can't come from test coverage alone - it has to come from statistical evaluation on held-out data and from carefully controlled exposure to live traffic.
Think of it this way: when you deploy a new web server, you're asking "does this code do what we wrote it to do?" When you deploy a new model, you're asking "is this model still accurate on the kinds of examples we'll actually see in the wild?" Those are fundamentally different questions, and the second one requires live traffic evaluation to answer fully.
That's why MLOps deployments always include shadow mode (run both models, compare silently) before a canary (send 5% of real traffic to the new model, watch business metrics). The code tests tell you the model can run. The canary tells you whether it should be running.
Q6: What is the DataOps layer in MLOps, and why doesn't DevOps need it?
Strong answer: DataOps is the discipline of applying DevOps principles - automation, version control, testing, monitoring - to data pipelines. It includes data ingestion automation, schema validation, statistical data quality checks, data versioning, feature stores, and data lineage tracking.
DevOps doesn't need this because traditional software treats data as an input it receives and processes - the software is responsible for its logic, not the data's correctness. If a web app receives bad data, it returns an error or handles it. The software itself is unchanged.
ML systems are fundamentally data-coupled. The model's behavior is a direct function of the data it was trained on. If training data is wrong, the model is wrong - and it may be wrong in ways that are invisible for months. If feature engineering logic changes silently, the model sees a different input distribution than it was trained on and degrades. Data is not just an input - it is half of the artifact. That's why it requires the same engineering discipline as code: version control, automated testing, monitoring for drift, and governance.
Building an MLOps Platform on Top of DevOps Infrastructure
The practical question for any team transitioning from pure DevOps to MLOps is: what do we keep and what do we replace?
The answer: keep almost everything. Add the ML-specific layers on top.
What Carries Over Unchanged
Source control: Git workflows, pull request processes, branch policies - unchanged. Your ML code belongs in Git exactly as your application code does.
Container registry: Docker images for training jobs and serving endpoints use the same registry (ECR, GCR, ACR) as all other services.
Secrets management: API keys, database credentials, service account tokens - same Vault, AWS Secrets Manager, or Kubernetes secrets.
Alerting infrastructure: PagerDuty, Opsgenie, Slack webhooks - same systems. MLOps adds new alert types, not new alert delivery infrastructure.
Log aggregation: model serving logs go to the same ELK or Loki stack. MLOps adds structured logging for prediction inputs and outputs, not a separate log system.
Infrastructure as Code: Terraform for cloud resources, Helm for Kubernetes deployments - identical. An ML serving deployment is still a Kubernetes Deployment.
What MLOps Adds on Top
DevOps Foundation:
├── Git (code versioning)
├── Docker + Kubernetes (containerization + orchestration)
├── CI/CD (GitHub Actions / GitLab CI / Jenkins)
├── Prometheus + Grafana (infrastructure metrics)
└── ELK / Loki (log aggregation)
MLOps Layer (added on top):
├── DVC (data versioning, on top of Git)
├── MLflow / W&B (experiment tracking + model registry)
├── Airflow / Prefect / Kubeflow (pipeline orchestration for training)
├── Feature Store (Feast / Tecton / Hopsworks)
├── Evidently / NannyML (ML-specific monitoring: drift, prediction quality)
└── Model Serving: Seldon, BentoML, or Ray Serve (ML-optimized serving)
The CI/CD Pipeline Before and After
Standard DevOps CI/CD pipeline:
# .github/workflows/deploy.yml (DevOps version)
on:
push:
branches: [main]
jobs:
build-and-deploy:
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Run tests
run: pytest tests/
- name: Build Docker image
run: docker build -t myapp:${{ github.sha }} .
- name: Push to registry
run: docker push myapp:${{ github.sha }}
- name: Deploy to Kubernetes
run: kubectl set image deployment/myapp container=myapp:${{ github.sha }}
MLOps CI/CD pipeline (what gets added):
# .gitlab-ci.yml (MLOps version)
stages:
- validate # NEW: data validation
- test # same as DevOps
- train # NEW: model training
- evaluate # NEW: evaluation gate
- build # same as DevOps (build serving image)
- shadow-deploy # NEW: shadow deployment
- promote # replaces simple "deploy"
validate-data:
stage: validate
script:
- python scripts/validate_data.py --data-version $DATA_VERSION
- python scripts/check_schema.py --schema configs/feature_schema.yaml
# Fails pipeline if data quality checks don't pass
run-tests:
stage: test
script:
- pytest tests/unit/
- pytest tests/pipeline/ # NEW: pipeline integration tests
- pytest tests/data/ # NEW: data validation unit tests
train-model:
stage: train
script:
- python src/train.py --config configs/train.yaml --data-version $DATA_VERSION
- dvc push # push trained model artifact to remote storage
artifacts:
paths:
- models/
- reports/metrics.json
evaluate-model:
stage: evaluate
script:
- python scripts/evaluate.py --gate configs/eval_gate.yaml
# Script exits with code 1 if metrics below threshold - blocks pipeline
needs: [train-model]
build-serving-image:
stage: build
script:
- docker build -f Dockerfile.serve -t fraud-model:$CI_COMMIT_SHA .
- docker push fraud-model:$CI_COMMIT_SHA
needs: [evaluate-model]
deploy-shadow:
stage: shadow-deploy
script:
- python scripts/deploy_shadow.py --image fraud-model:$CI_COMMIT_SHA
- python scripts/run_shadow_validation.py --duration-hours 24
# Waits 24 hours and checks shadow metrics before proceeding
needs: [build-serving-image]
when: manual # human approval required before shadow
promote-to-production:
stage: promote
script:
- python scripts/promote_to_production.py --image fraud-model:$CI_COMMIT_SHA
needs: [deploy-shadow]
when: manual # second human approval for full rollout
Model Serving vs Application Serving
Standard application serving optimizes for: latency, throughput, and availability. ML serving optimizes for all of these plus a set of concerns unique to model inference:
Batching: many ML frameworks (XGBoost, neural networks) are dramatically faster when predictions are batched. A batch of 100 requests may be 50x faster per-request than 100 individual requests. Serving infrastructure must handle dynamic request batching - assembling requests that arrive within a time window into a batch, then returning responses individually.
Model warm-up: large models (especially deep learning) have a "cold start" problem. The first inference on a freshly loaded model is much slower than subsequent ones due to JIT compilation (PyTorch 2.0), kernel caching, and GPU memory transfer. Serving infrastructure must handle warm-up requests at startup.
Version management: you may need to serve multiple model versions simultaneously (for A/B testing or for different user segments). Application servers don't need this concept.
Hardware-aware serving: GPU-accelerated models require different infrastructure (GPU nodes, CUDA drivers, memory management) than CPU-only serving. DevOps engineers often encounter this for the first time when serving deep learning models.
# BentoML: ML-aware serving with batching built in
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON
# Load model from BentoML model store (integrated with MLflow registry)
fraud_model = bentoml.sklearn.get("fraud_detector:latest").to_runner()
svc = bentoml.Service("fraud_detection", runners=[fraud_model])
@svc.api(input=NumpyNdarray(), output=JSON())
async def predict(features: np.ndarray) -> dict:
# BentoML handles dynamic batching automatically:
# requests that arrive within the adaptive_batch_size window
# are grouped into a single model.predict() call
predictions = await fraud_model.predict.async_run(features)
return {
"fraud_score": float(predictions[0]),
"label": "FRAUD" if predictions[0] > 0.73 else "LEGITIMATE"
}
The Organizational Dimension: Team Structure for MLOps
DevOps solved the dev-ops silos by blending them into cross-functional teams that own the full lifecycle. MLOps introduces new silos that must be deliberately addressed:
The data scientist / ML engineer silo: data scientists build models; ML engineers productionize them. The handoff point is often a pickle file and a Jupyter notebook. The receiving ML engineer doesn't understand the modeling decisions; the data scientist doesn't understand the production constraints. The result: slow handoffs, lost context, and production models that don't match what was evaluated in notebooks.
The data engineering / ML silo: data engineers build the pipelines that produce features; ML engineers consume those features. When the pipeline changes, the ML team is the last to know. Feature definitions drift silently. Training-serving skew accumulates.
The business / ML silo: business stakeholders define the problem; ML teams solve it. The gap between "business success" (revenue, churn reduction, customer satisfaction) and "ML success" (F1 score, AUC-ROC) is never explicitly bridged. Models that optimize for the wrong metric get deployed and quietly fail to deliver business value.
Effective MLOps organizations close these silos through:
- Shared metric definitions: data scientists and business stakeholders jointly define evaluation metrics before any model training begins
- Embedded ML engineers: ML engineers participate in model development, not just deployment
- Data contracts: ML teams and data engineering teams maintain versioned schema contracts, with notifications for all breaking changes
- Shared monitoring dashboards: business metrics and model metrics on the same dashboard, visible to all stakeholders
Summary: The MLOps Mental Model
Think of DevOps as the foundation - a complete, working system for managing deterministic software artifacts. Think of MLOps as an extension that adds three capabilities DevOps doesn't have:
- Data management: treating data with the same engineering rigor as code - versioning, testing, monitoring, and governance
- Probabilistic quality management: evaluating model quality statistically on held-out data, in shadow deployments, and in production - because test pass/fail is not sufficient
- Temporal stability management: monitoring for concept drift, maintaining retraining infrastructure, and having a plan for the model's eventual retirement
Everything in DevOps that was good - automation, shared ownership, "you build it you run it," infrastructure as code, continuous delivery - remains good in MLOps. MLOps doesn't replace DevOps. It extends it to handle the unique challenges of software whose behavior is a function of the world, not just of its code.
MLOps Maturity Levels
Not every organization needs the same level of MLOps sophistication. Google's paper on MLOps maturity (and subsequent frameworks from Microsoft and others) describes three levels:
Level 0 - Manual process: Data scientists train models manually in notebooks, export a pickle file, and hand it to engineers who deploy it manually. No pipeline automation, no experiment tracking, no monitoring. This is where most teams start and where most teams stay for too long.
Level 1 - ML pipeline automation: The training pipeline is automated and can be triggered on demand. Experiment tracking is in place. Data and model versioning exist. Basic infrastructure monitoring is live. Retraining is still manually triggered. This is the minimum viable MLOps state for any production ML system.
Level 2 - Full CI/CD for ML pipelines: The entire pipeline - data validation, training, evaluation, shadow deployment, promotion - is automated and triggered by code commits or monitoring alerts. Retraining triggers are automated. Champion/challenger evaluation is automated. The ML system self-maintains within defined thresholds. This is the target state for ML platforms at significant scale.
| Capability | Level 0 | Level 1 | Level 2 |
|---|---|---|---|
| Training automation | Manual, notebook | Scripted, on-demand | CI/CD triggered |
| Experiment tracking | None | MLflow / W&B | Full lineage |
| Data versioning | None | DVC / manual hash | Automated, every run |
| Model versioning | None | MLflow Registry | Automated promotion |
| Infrastructure monitoring | None | Basic | Full observability |
| ML monitoring | None | Manual spot checks | Automated drift detection |
| Retraining | Manual | Manual + script | Automated triggers |
| Deployment | Manual | Semi-automated | Shadow → Canary → Auto |
Most teams are at Level 0. Moving to Level 1 delivers the majority of the reliability improvement. Level 2 is appropriate when the number of models in production makes manual management impractical - typically beyond 10–15 production models.
