Skip to main content

The ML Lifecycle

Reading time: 35–40 min | Relevance: ML Engineer, MLOps Engineer, Data Scientist, Engineering Manager


The Model Nobody Maintained

The team had built a good model. An e-commerce recommendation system, trained on six months of click and purchase data, that lifted average order value by 8% in its first production month. The project was declared a success. The Jira tickets were closed. The data scientists moved on to the next project.

Six months later, a product manager noticed something: the recommendations had gotten worse. Users were complaining in support tickets that the app was suggesting items they'd already bought, or products in categories they'd never browsed. The 8% AOV lift had evaporated. They were back to baseline, possibly worse.

The team was pulled back in to investigate. What they found was a mess. The original training data was gone from the shared drive - someone had cleaned it up. The model version in production had no documentation: no training run record, no hyperparameters, no evaluation results. The feature engineering code had been modified three times since the original training, but the model was still running against the original feature computation logic because nobody had connected the pipeline. The click data schema had evolved - two new fields, one renamed field - and the feature pipeline was silently substituting zeros for missing values rather than failing loudly.

The model had been deployed as a finished product. But a trained model is not a finished product. It is a living artifact that requires continuous care: monitoring for drift, retraining on fresh data, evaluation against updated ground truth, and eventually, deliberate retirement when it can no longer serve its purpose. The team had shipped the model but had never thought about the post-deployment lifecycle at all. That oversight cost them six months of degraded recommendations and a significant recovery effort.

This lesson is about the lifecycle they missed - and how to engineer it correctly from the start.


:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Pipeline demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Software has a lifecycle too, but it is fundamentally simpler. You write code, ship it, and it does what it does until you change the code. The behavior is deterministic and stable. You don't need to "retrain" a web server or "monitor it for drift."

Machine learning adds a new dimension: the model's correctness depends on the relationship between input features and outcomes holding stable in the world. When that relationship changes - when user behavior shifts, when external conditions evolve, when the data distribution drifts - the model's performance degrades without any code change at all. This means ML systems require a lifecycle management discipline that traditional software engineering never developed.

The ML lifecycle framework exists to answer four questions that would never occur to a software engineer: When do you retrain? How do you know when the model is failing? Who has to approve a new model version? And when do you kill the model entirely?


Historical Context: CRISP-DM and Its Limitations

The most widely used historical framework for ML project management is CRISP-DM (Cross-Industry Standard Process for Data Mining), developed in 1996 by a consortium including IBM, Daimler-Benz, and NCR. CRISP-DM defines six phases:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

CRISP-DM was a genuine advance in its time - it formalized the idea that data science projects have structure and should be managed with rigor. But it has critical limitations for modern MLOps:

CRISP-DM treats deployment as an endpoint. The sixth phase, "Deployment," is followed by nothing. The model goes into production and the process ends. There is no monitoring phase, no retraining loop, no retirement criteria. CRISP-DM was designed for data mining projects that produced reports and insights - not for production ML systems that serve millions of predictions daily.

CRISP-DM has no concept of data versioning. The "Data Preparation" phase prepares data but has no mechanism for reproducibly versioning what was prepared. You cannot reconstruct the original training dataset from CRISP-DM documentation alone.

CRISP-DM has no governance structure. Who approves the model for deployment? What metrics must it pass? What's the rollback plan? CRISP-DM doesn't address any of this.

The modern ML lifecycle framework keeps CRISP-DM's core insight - that building ML systems is an iterative, multi-phase process - but extends it dramatically to cover the full operational reality of production models.


The Modern ML Lifecycle: All Phases in Depth

The modern ML lifecycle has ten phases arranged in a loop, not a line. The model cycles through production, monitoring, retraining, and back to production continuously.

Phase 1: Problem Definition

What happens: A business problem is translated into a machine learning problem. This is harder than it sounds and is the most underestimated phase.

Key questions to answer:

  • Is ML actually the right tool? (Often the answer is no - a simpler rule-based system may be faster, cheaper, and more maintainable.)
  • What is the target variable? How is it defined? Where does ground truth come from?
  • What is the acceptable latency? (Real-time at 10ms? Batch overnight?)
  • What are the evaluation metrics? (Precision, recall, RMSE - and how do they map to business value?)
  • What data is available, and at what lag? (Ground truth labels often arrive hours or days after the prediction is needed.)

Artifacts produced: Problem statement document, success criteria definition, metric specification, feasibility assessment.

What can go wrong: Poorly defined success criteria are the leading cause of ML projects that "succeed" technically but fail commercially. If you don't specify before training that the model needs F1 > 0.85 on the demographic segment that represents 60% of revenue, you'll discover the problem at demo time.


Phase 2: Data Collection and Acquisition

What happens: Data is sourced, understood at a high level, and loaded into a working environment. Data lineage is established from the start.

Key activities:

  • Identify all data sources (databases, APIs, third-party providers, manual labels)
  • Negotiate data access and understand data licensing constraints
  • Understand label availability and label lag
  • Establish data versioning (DVC, Delta Lake, S3 + hash manifest)
  • Document schema at this point - it will drift, and you want a reference

Artifacts produced: Raw data with version hash, data dictionary, data access agreements, initial quality report.

# DVC: version your dataset from day one
# Initialize DVC in the repo
# $ dvc init
# $ dvc add data/raw/transactions_2024_01.parquet
# $ git add data/raw/transactions_2024_01.parquet.dvc .gitignore
# $ git commit -m "Add raw transaction dataset v1.0"

# Later, to reproduce any training run with its exact data:
# $ git checkout <commit_hash>
# $ dvc pull # fetches exact dataset for that commit

Phase 3: Exploratory Data Analysis

What happens: The data is deeply examined to understand distributions, correlations, quality issues, and the feasibility of the target task.

Key activities:

  • Distribution analysis for all features - spot-check for anomalies, outliers, unexpected zeros
  • Target variable analysis - is it balanced? Is the base rate reasonable for the business problem?
  • Correlation analysis - which features are informative? Which are redundant?
  • Data quality audit - missing values, inconsistent encodings, schema violations
  • Temporal analysis - are there seasonality effects? Leakage risks (features that contain information from the future)?

What EDA is for operationally: EDA is where you discover the problems that will haunt you in production. Feature leakage - where a feature accidentally encodes the answer - is always caught in EDA if done rigorously, and almost always missed if EDA is rushed.

Artifacts produced: EDA notebook (pinned to specific data version), quality report, feature selection shortlist, potential leakage risk log.


Phase 4: Feature Engineering

What happens: Raw data is transformed into model-ready features. This phase has the highest leverage on model quality and the highest maintenance cost long-term.

Key activities:

  • Missing value imputation strategy (mean, median, forward-fill, model-based)
  • Encoding of categorical variables (one-hot, target encoding, embedding)
  • Normalization and scaling
  • Feature construction (aggregations, interaction terms, time-window statistics)
  • Feature selection (correlation thresholds, recursive feature elimination)
  • Feature store registration (if using a feature store)

The critical engineering discipline here is treating feature engineering as production code from the start:

# Feature engineering as a versioned, tested pipeline
# NOT as notebook cells you paste into a script

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
import pandas as pd

def build_feature_pipeline(
numeric_features: list,
categorical_features: list,
) -> ColumnTransformer:
"""
Returns a sklearn Pipeline that can be fit on training data
and applied consistently to production inference data.
Serializable via joblib - version it alongside your model.
"""
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),
])
preprocessor = ColumnTransformer(transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
])
return preprocessor

# The key: this pipeline is fit ONLY on training data,
# then applied identically to validation, test, and production.
# Fitting on all data before splitting is training-serving skew.

Artifacts produced: Feature engineering pipeline (serialized, versioned), feature documentation, feature importance analysis.


Phase 5: Model Training and Experimentation

What happens: Models are trained, hyperparameters are tuned, and experiments are tracked. This is the phase data scientists spend the most time in.

Key activities:

  • Train-validation-test split (strict - test set is locked until final evaluation)
  • Baseline model training (simple model as a reference point - if you can't beat logistic regression, your features are the problem, not the model)
  • Experiment tracking (MLflow, W&B, Neptune)
  • Hyperparameter optimization (grid search, random search, Bayesian optimization with Optuna)
  • Cross-validation for reliable performance estimates
import mlflow
import mlflow.sklearn
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

def objective(trial: optuna.Trial) -> float:
"""Optuna objective: maximize CV AUC-ROC."""
params = {
"n_estimators": trial.suggest_int("n_estimators", 100, 500),
"max_depth": trial.suggest_int("max_depth", 3, 10),
"learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
}

with mlflow.start_run(nested=True):
model = GradientBoostingClassifier(**params, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc")
mean_auc = cv_scores.mean()

mlflow.log_params(params)
mlflow.log_metric("cv_auc_mean", mean_auc)
mlflow.log_metric("cv_auc_std", cv_scores.std())

return mean_auc

# Run HPO - every trial is logged to MLflow automatically
with mlflow.start_run(run_name="hpo_gbm_v3"):
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=3600)
best_params = study.best_params
mlflow.log_params({"best_" + k: v for k, v in best_params.items()})

Artifacts produced: Trained model artifacts (all variants), experiment tracking records, hyperparameter configuration, training performance metrics.


Phase 6: Model Evaluation and Selection

What happens: The best experiment variant is evaluated rigorously on the locked test set, compared against the baseline and (in production systems) against the current champion model.

Key activities:

  • Test set evaluation - this is the one holdout the model has never seen
  • Business metric translation - convert ML metrics to revenue/cost/risk impact
  • Bias and fairness evaluation - does the model perform equally across demographic groups?
  • Error analysis - where does the model fail? Are failures acceptable?
  • Champion/challenger comparison - does the new model beat the current production model?
from sklearn.metrics import (
precision_score, recall_score, f1_score,
roc_auc_score, average_precision_score,
confusion_matrix, classification_report
)
import json

def evaluate_model(model, X_test, y_test, threshold: float = 0.5) -> dict:
"""
Full evaluation suite for a binary classifier.
Returns a dict suitable for logging to MLflow or storage.
"""
y_proba = model.predict_proba(X_test)[:, 1]
y_pred = (y_proba >= threshold).astype(int)

metrics = {
"precision": precision_score(y_test, y_pred),
"recall": recall_score(y_test, y_pred),
"f1": f1_score(y_test, y_pred),
"auc_roc": roc_auc_score(y_test, y_proba),
"avg_precision": average_precision_score(y_test, y_proba),
"threshold_used": threshold,
"n_test_samples": len(y_test),
"positive_rate_test": y_test.mean(),
"positive_rate_predicted": y_pred.mean(),
}
# Log confusion matrix
cm = confusion_matrix(y_test, y_pred)
metrics["tn"], metrics["fp"], metrics["fn"], metrics["tp"] = cm.ravel()

return metrics

def passes_evaluation_gate(metrics: dict, gate: dict) -> tuple[bool, list]:
"""
Returns (passed, list_of_failures).
gate: {"precision": 0.88, "recall": 0.82, "auc_roc": 0.91}
"""
failures = []
for metric_name, min_value in gate.items():
actual = metrics.get(metric_name, 0)
if actual < min_value:
failures.append(
f"{metric_name}: {actual:.4f} < required {min_value}"
)
return len(failures) == 0, failures

Artifacts produced: Test set evaluation report, model card, bias/fairness report, business impact estimate.

The evaluation gate is a blocking step. A model that fails the evaluation gate does not proceed to deployment - it goes back to Phase 5. This is not optional. Removing the gate "just this once" to meet a deadline is how bad models get into production.


Phase 7: Deployment

What happens: The approved model is deployed to production, following a staged rollout that minimizes risk.

Shadow deployment (lowest risk, always recommended for new models):

  • The new model runs alongside the champion on real traffic
  • Its outputs are logged but not served to users
  • Compare: do the two models agree? Where do they disagree? What does the business metric look like for each?
  • Collect 7–30 days of shadow data before promoting

Canary deployment (medium risk):

  • 5–10% of live traffic routes to the new model
  • Monitor business metrics at the segment level
  • Automatic rollback if guardrail metrics breach thresholds

Full deployment:

  • 100% traffic on the new model
  • Champion model remains staged for 7 days before retirement, enabling fast rollback
# Kubernetes deployment config with canary via Istio
# Route 10% of traffic to new model version (canary)
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: fraud-model-vs
spec:
hosts:
- fraud-model
http:
- route:
- destination:
host: fraud-model
subset: champion
weight: 90
- destination:
host: fraud-model
subset: canary
weight: 10

Artifacts produced: Deployment manifest, rollout record, shadow/canary evaluation report.


Phase 8: Production Monitoring and Alerting

What happens: The deployed model is continuously monitored across four dimensions: infrastructure health, data quality, prediction quality, and business outcomes.

This is the phase that was entirely missing from the team in the opening scenario. Without it, you are flying blind.

Monitoring stack:

# Example: prediction distribution monitor
# Run as a scheduled job (every hour, or after every N predictions)
import numpy as np
from scipy.stats import ks_2samp
from datetime import datetime, timedelta

class PredictionDriftMonitor:
def __init__(self, reference_predictions: np.ndarray):
"""Initialize with predictions from the first 2 weeks of deployment."""
self.reference = reference_predictions
self.reference_mean = reference_predictions.mean()
self.reference_std = reference_predictions.std()

def check_drift(
self,
recent_predictions: np.ndarray,
window_label: str = "last_24h"
) -> dict:
ks_stat, p_value = ks_2samp(self.reference, recent_predictions)
current_mean = recent_predictions.mean()
mean_shift = abs(current_mean - self.reference_mean)

alert_level = "OK"
if p_value < 0.01 or mean_shift > 2 * self.reference_std:
alert_level = "CRITICAL"
elif p_value < 0.05 or mean_shift > self.reference_std:
alert_level = "WARNING"

return {
"timestamp": datetime.utcnow().isoformat(),
"window": window_label,
"ks_statistic": round(float(ks_stat), 4),
"p_value": round(float(p_value), 4),
"current_mean": round(float(current_mean), 4),
"reference_mean": round(float(self.reference_mean), 4),
"mean_shift_sigmas": round(float(mean_shift / self.reference_std), 2),
"alert_level": alert_level,
}

Monitoring hierarchy:

LayerMetricsTooling
InfrastructureCPU, memory, latency, error ratePrometheus, Grafana
Data qualitySchema violations, missing rates, feature distributionsGreat Expectations, Evidently
Prediction qualityPrediction distribution, PSI, output driftEvidently AI, NannyML, WhyLogs
Business outcomesPrecision on ground truth labels, revenue impactCustom dashboards, BI tools

Phase 9: Retraining and Model Updates

What happens: A new model version is trained on more recent data and, if it passes evaluation, promoted to production. This is not a special event - it is a routine operational process.

Retraining triggers:

from enum import Enum
from dataclasses import dataclass
from typing import Optional

class RetrainingTrigger(Enum):
SCHEDULED = "scheduled"
PERFORMANCE_DEGRADED = "performance_degraded"
DATA_DRIFT = "data_drift"
MANUAL = "manual"
NEW_DATA_VOLUME = "new_data_volume"

@dataclass
class RetrainingDecision:
should_retrain: bool
trigger: Optional[RetrainingTrigger]
reason: str
urgency: str # "routine" | "urgent" | "emergency"

def evaluate_retraining_need(
model_metrics: dict,
drift_report: dict,
schedule_due: bool,
new_data_rows: int,
) -> RetrainingDecision:
"""
Evaluate whether retraining should be triggered.
Called by a monitoring job on a regular schedule.
"""
# Emergency: severe performance drop
if model_metrics.get("precision", 1.0) < 0.75:
return RetrainingDecision(
should_retrain=True,
trigger=RetrainingTrigger.PERFORMANCE_DEGRADED,
reason=f"Precision {model_metrics['precision']:.3f} below emergency threshold 0.75",
urgency="emergency"
)

# Urgent: significant drift detected
if drift_report.get("alert_level") == "CRITICAL":
return RetrainingDecision(
should_retrain=True,
trigger=RetrainingTrigger.DATA_DRIFT,
reason=f"Critical data drift detected: KS={drift_report['ks_statistic']}",
urgency="urgent"
)

# Routine: scheduled retraining
if schedule_due:
return RetrainingDecision(
should_retrain=True,
trigger=RetrainingTrigger.SCHEDULED,
reason="Weekly scheduled retraining window",
urgency="routine"
)

# No retraining needed
return RetrainingDecision(
should_retrain=False,
trigger=None,
reason="All metrics within acceptable ranges",
urgency="none"
)

Types of model updates:

  • Full retraining: train from scratch on the full updated dataset - most expensive, most thorough
  • Fine-tuning: continue training from the existing model checkpoint on recent data - faster, risk of catastrophic forgetting
  • Incremental learning: models that support online updates (e.g., online SGD, river library) - continuous but complex to validate
  • Ensemble refresh: replace one member of an ensemble with a retrained version - reduces risk per update

Phase 10: Model Retirement

What happens: A model is deliberately decommissioned. This is not failure - it is a planned transition. A model that is never retired accumulates technical debt indefinitely.

Retirement triggers:

  1. Performance: the model can no longer be retrained to acceptable quality on current data (concept has shifted too far)
  2. Regulatory: new legal requirements make the model non-compliant (e.g., GDPR, CCPA, fair lending law changes)
  3. Business: the business problem the model solved no longer exists, or a better solution (different model architecture, third-party API) has replaced it
  4. Technical debt: the model's infrastructure is too costly to maintain relative to its business value

Retirement checklist:

  • Document model performance at retirement - what metrics drove the decision?
  • Ensure all downstream consumers are notified 30+ days in advance
  • Archive the final model artifact and training data version for regulatory compliance
  • Remove the model from the serving infrastructure cleanly
  • Document lessons learned

Lifecycle State Machine in Python

A model's status is not a single boolean (deployed/not deployed). It transitions through well-defined states with explicit triggers:

from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

class ModelStatus(Enum):
EXPERIMENTING = "experimenting"
EVALUATED = "evaluated"
SHADOW = "shadow"
STAGING = "staging"
PRODUCTION = "production"
RETRAINING = "retraining"
DEPRECATED = "deprecated"
RETIRED = "retired"

# Allowed state transitions: only these paths are valid
ALLOWED_TRANSITIONS = {
ModelStatus.EXPERIMENTING: {ModelStatus.EVALUATED},
ModelStatus.EVALUATED: {ModelStatus.SHADOW, ModelStatus.EXPERIMENTING}, # fail gate → back to experimenting
ModelStatus.SHADOW: {ModelStatus.STAGING, ModelStatus.EXPERIMENTING},
ModelStatus.STAGING: {ModelStatus.PRODUCTION, ModelStatus.EXPERIMENTING},
ModelStatus.PRODUCTION: {ModelStatus.RETRAINING, ModelStatus.DEPRECATED},
ModelStatus.RETRAINING: {ModelStatus.EVALUATED},
ModelStatus.DEPRECATED: {ModelStatus.RETIRED},
ModelStatus.RETIRED: set(), # terminal state
}

@dataclass
class ModelVersion:
model_name: str
version: str
status: ModelStatus = ModelStatus.EXPERIMENTING
created_at: datetime = field(default_factory=datetime.utcnow)
last_updated: datetime = field(default_factory=datetime.utcnow)
promoted_by: Optional[str] = None
metrics: dict = field(default_factory=dict)
transition_log: list = field(default_factory=list)

def transition_to(self, new_status: ModelStatus, actor: str, reason: str) -> None:
"""
Transition to a new lifecycle state.
Raises ValueError if the transition is not allowed.
"""
allowed = ALLOWED_TRANSITIONS.get(self.status, set())
if new_status not in allowed:
raise ValueError(
f"Cannot transition {self.model_name} v{self.version} "
f"from {self.status.value} to {new_status.value}. "
f"Allowed transitions: {[s.value for s in allowed]}"
)
old_status = self.status
self.status = new_status
self.last_updated = datetime.utcnow()
self.promoted_by = actor
self.transition_log.append({
"from": old_status.value,
"to": new_status.value,
"actor": actor,
"reason": reason,
"timestamp": self.last_updated.isoformat(),
})
print(f"[{self.model_name} v{self.version}] {old_status.value}{new_status.value} (by {actor}: {reason})")

# Example usage - a model's journey through the lifecycle
fraud_model_v3 = ModelVersion(
model_name="fraud_detector",
version="3.2.1"
)
fraud_model_v3.metrics = {"precision": 0.912, "recall": 0.887, "auc_roc": 0.951}

fraud_model_v3.transition_to(ModelStatus.EVALUATED, "ci_pipeline", "Passed evaluation gate: all metrics above threshold")
fraud_model_v3.transition_to(ModelStatus.SHADOW, "mlops_team", "Starting 14-day shadow deployment")
fraud_model_v3.transition_to(ModelStatus.STAGING, "mlops_team", "Shadow metrics match champion. Promoting to 10% canary")
fraud_model_v3.transition_to(ModelStatus.PRODUCTION, "engineering_lead", "Canary metrics verified. Full rollout approved")

How MLflow, DVC, and Kubeflow Map to Lifecycle Stages

Lifecycle PhaseMLflowDVCKubeflow Pipelines
Problem Definition---
Data Collection-dvc add, dvc push-
EDAMLflow tracking (log plots)dvc repro-
Feature EngineeringMLflow artifactsdvc.yaml stagesKFP component
Model Trainingmlflow.start_run(), autologdvc runKFP training component
EvaluationMLflow compare runs, Model Registry-KFP eval component, GCP Model Evaluation
DeploymentMLflow Model Registry (Staging/Production)-KFP serving component
Monitoring- (use Evidently)-KFP recurring runs
RetrainingMLflow new rundvc reproKFP triggered pipeline
RetirementMLflow Model Registry (Archived)-Deregister serving

Lifecycle Governance: Approval Gates and Audit Trails

Production ML systems in regulated industries (finance, healthcare, insurance) require explicit governance at every lifecycle transition. This is not bureaucracy - it is risk management.

Minimum viable governance checklist:

# model-governance.yaml - stored alongside every model artifact
model:
name: "fraud_detector"
version: "3.2.1"
trained_by: "[email protected]"
training_date: "2024-06-15"
data_version: "transactions_v5.1_hash_8f3a21b"

evaluation:
approved_by: "[email protected]" # model owner sign-off
approval_date: "2024-06-17"
test_precision: 0.9120
test_recall: 0.8870
test_auc_roc: 0.9510
bias_audit_completed: true
bias_audit_report: "s3://artifacts/bias_report_3.2.1.pdf"

deployment:
approved_by: "[email protected]" # engineering lead sign-off
deployment_date: "2024-06-20"
deployment_mode: "canary_10pct"
rollback_plan: "revert to v3.1.8 via feature flag toggle"

monitoring:
alert_recipients: ["[email protected]"]
retraining_threshold_precision: 0.88
retraining_threshold_drift_psi: 0.20

regulatory:
model_card_location: "s3://artifacts/model_card_3.2.1.pdf"
gdpr_dpia_completed: true
fair_lending_review: "n/a"

Every time a model transitions lifecycle states, the governance document is updated and committed to version control. This creates an auditable record of who approved what and when.


Production Engineering Notes

On data versioning as a prerequisite: You cannot have a reproducible lifecycle without data versioning. If you cannot reconstruct the exact training dataset for any model version, you cannot reproduce failures, cannot audit model behavior, and cannot safely retrain. Implement DVC or equivalent on day one - retrofitting it after six months of unversioned training runs is painful.

On the locked test set: The test set must be collected before training begins, locked, and never touched during any phase of training or validation. Using the test set for any model selection decision (including hyperparameter tuning) invalidates it as an honest performance estimate. If you use the test set results to choose between models, you now need a fourth split for final evaluation.

On retraining frequency: More frequent retraining is not always better. Each retraining cycle carries risk (the new model might be worse). The right frequency depends on how quickly your data distribution changes. A model serving stable demand forecasting in a mature market might retrain monthly. A fraud detection model fighting active adversaries might retrain weekly or daily.

On lifecycle documentation: Every model in production should have an associated "model card" - a brief document that describes what the model does, what data it was trained on, what its performance metrics are, and what its known limitations are. This is the minimum documentation that makes the model manageable by anyone other than its original author.


Common Mistakes

:::danger Treating Deployment as the End of the Lifecycle The most common and most costly mistake in ML engineering. After deployment, the work is not done - it has entered its most critical phase. A model without monitoring, retraining triggers, and a retirement plan is a time bomb. Make post-deployment planning a requirement in the project's initial kickoff, not an afterthought after launch. :::

:::danger Using the Test Set During Model Selection If you look at test set performance to decide between model A and model B, you have leaked the test set. Your reported test performance is now optimistically biased. The test set must be touched exactly once: for final evaluation of the single selected model. For all selection decisions, use the validation set or cross-validation. :::

:::warning Retraining Manually "When Someone Remembers" Manual retraining is not a retraining strategy. It is the absence of one. Any retraining that depends on a human remembering to run a script will be delayed, forgotten, and eventually skipped under time pressure. Automate all retraining triggers from the start, even if the initial trigger is just a weekly cron job. :::

:::warning Skipping Shadow Deployment "Because We're Confident" Shadow deployment feels like unnecessary overhead when the team is confident in a new model. It almost always surfaces surprises. Edge cases that don't appear in the test set appear in live traffic. Feature distribution differences between your test set collection period and right now show up only in shadow traffic. The cost of shadow deployment is compute time. The cost of skipping it is potential production incidents. :::

:::warning Not Defining Retirement Criteria at Deployment Time If you don't decide upfront what "this model has failed and should be retired" looks like, you will keep patching degraded models indefinitely rather than rebuilding properly. Retirement criteria (minimum acceptable performance, maximum technical debt score, regulatory compliance requirements) should be defined and documented at deployment time, not discovered when things go wrong. :::


Interview Q&A

Q1: Walk me through the ML lifecycle. What are the key phases and what can go wrong in each?

Strong answer: The modern ML lifecycle has ten phases arranged in a continuous loop:

Problem definition - translate a business problem into an ML problem. The most common failure: vague success criteria. If you don't define precision > 0.88 before training, you'll debate whether 0.82 is "good enough" under deployment pressure.

Data collection - acquire and version data from the start. Most common failure: skipping versioning "until later." Later never comes.

EDA - understand distributions, quality issues, and leakage risks. Most common failure: rushing EDA under deadline pressure and missing feature leakage that degrades the model later.

Feature engineering - build the feature pipeline as production code, not notebook cells. Most common failure: training-serving skew from fitting transformers on the whole dataset instead of only the training split.

Model training - track every experiment with MLflow or W&B. Most common failure: running experiments without tracking, making results impossible to reproduce.

Evaluation - strict test set evaluation against predefined gates. Most common failure: peeking at the test set during model selection.

Deployment - always shadow before canary, canary before full rollout. Most common failure: full direct rollout with no validation on live traffic.

Monitoring - four layers: infrastructure, data, predictions, business outcomes. Most common failure: only monitoring infrastructure.

Retraining - automated triggers, not manual processes. Most common failure: manual retraining that gets forgotten.

Retirement - planned decommissioning with governance. Most common failure: never retiring models, letting them accumulate as technical debt.


Q2: How do you decide when to retrain a model?

Strong answer: There are three categories of retraining triggers, and a production system should have all three:

Scheduled triggers: retrain on a fixed cadence regardless of performance. Weekly, monthly - depends on how fast the data distribution changes. Fraud detection: weekly. Demand forecasting in a stable market: monthly. This handles slow, gradual drift before it becomes a crisis.

Performance triggers: monitor model metrics on labeled ground truth data as it becomes available. When precision or F1 drops below a predefined threshold, queue a retraining run automatically. The threshold should be set with stakeholders before deployment.

Data drift triggers: monitor statistical properties of input features (using KS test, PSI, or similar). When significant drift is detected even before ground truth labels arrive, trigger an investigation. This is an early warning system that can catch problems before they appear in performance metrics.

Manual triggers are not a category - they're a failure mode. Any team that relies on someone manually deciding to retrain will retrain too infrequently.


Q3: What is the difference between CRISP-DM and the modern ML lifecycle?

Strong answer: CRISP-DM (Cross-Industry Standard Process for Data Mining, 1996) was designed for data mining projects that produce reports and insights - not for production ML systems that serve continuous predictions.

Its core limitation: it treats deployment as an endpoint. Phase 6 is "Deployment" and then the process ends. There is no monitoring phase, no retraining loop, no concept of concept drift, no retirement criteria. For a data mining project that produces a quarterly report, this is fine. For a fraud detection model that runs 24/7 and faces an adversarial distribution that evolves continuously, it is inadequate.

The modern ML lifecycle adds: continuous monitoring of both infrastructure and model quality, automated retraining triggers, champion/challenger deployment infrastructure, model governance and audit trails, and explicit retirement planning. It treats the model not as a finished artifact but as a living system that requires ongoing operational care - more analogous to a production service than a delivered report.


Q4: What is training-serving skew, and how does the lifecycle design prevent it?

Strong answer: Training-serving skew is when the features a model sees at inference time are computed differently than the features it was trained on. This is one of the most common and most subtle bugs in production ML.

Classic example: during training, you compute a "days since last purchase" feature on historical data with a full record. In production, the feature pipeline has a bug and computes it as "days since database joined" instead. The model sees a fundamentally different feature value than it was trained on, and its predictions degrade - but the infrastructure metrics all look fine.

The lifecycle prevents this through two mechanisms: First, the feature engineering pipeline from Phase 4 must be a single, serializable artifact that is used both during training (fit + transform on training data) and at inference time (transform only). If you have two code paths that compute "the same" feature, you have guaranteed skew. Second, the monitoring layer in Phase 8 includes input data monitoring that compares the distribution of production features against the reference distribution captured during training. Distribution shifts flag potential skew early.


Q5: What is model retirement and when should you retire a model vs retrain it?

Strong answer: Model retirement is the deliberate, planned decommissioning of a model - removing it from production and stopping all associated infrastructure. It is not the same as retraining.

Retrain when: the model's performance has degraded due to data drift, but the underlying problem the model is solving is still well-defined and the model architecture is still appropriate. Retraining on fresher data can restore performance.

Retire when: (1) The model's concept has drifted so severely that no amount of retraining on available data can recover acceptable performance - the world has changed too much. (2) Regulatory requirements have changed and the model's decision-making process is no longer compliant. (3) The business problem the model solved no longer exists. (4) The technical debt of maintaining the model (infrastructure cost, retraining complexity, data pipeline maintenance) exceeds the business value it delivers.

Retirement should be planned from deployment day. Define the retirement criteria upfront: "This model will be retired when precision falls below 0.75 on three consecutive weekly evaluations despite retraining attempts, or when the regulatory environment requires explainability features this architecture cannot support." Retirement criteria defined in a crisis lead to political arguments. Retirement criteria defined upfront lead to clean, professional transitions.


Q6: How does MLflow support the ML lifecycle?

Strong answer: MLflow provides three capabilities that map to different lifecycle phases:

MLflow Tracking (Phases 5–6): logs experiments with full parameter, metric, and artifact recording. Every training run gets a unique run ID, and all hyperparameters, metrics, and model files are stored and queryable. This enables honest experiment comparison and reproducibility.

MLflow Projects (Phases 4–5): packages ML code with its environment specification so any training run can be reproduced exactly on any machine. The MLproject file defines the entry points and conda/docker environment.

MLflow Model Registry (Phases 6–9): provides explicit lifecycle state management for models. A model version in the Registry can be in one of three states: Staging, Production, or Archived. Transitions require annotations (who promoted it, why, what metrics justified it). The Registry is the single source of truth for "what model is running in production right now."

What MLflow does not provide: data versioning (use DVC), production monitoring (use Evidently or NannyML), retraining orchestration (use Airflow or Kubeflow), or feature stores (use Feast or Hopsworks). MLflow is the experiment tracking and model registry layer. A complete lifecycle platform combines it with these complementary tools.

© 2026 EngineersOfAI. All rights reserved.