The ML Workflow - End to End
Reading time: ~22 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer
A team at a large e-commerce company was given a clear mandate: "Build a product recommendation model that increases clickthrough rate." Three months later they presented results: their neural collaborative filtering model achieved 87% AUC on the offline test set, a 12-point improvement over the existing heuristic baseline.
When they deployed it, clickthrough rate went down 4%.
Post-mortem findings: The offline test set used historical data from the old recommendation system. Every "positive" example was a product that the old system had already surfaced. The new model learned to recommend popular products - exactly what the old system did. The AUC improvement was measuring "does our model agree with the old model," not "does our model improve CTR." The deployment metric (CTR) was never measured offline. The model was evaluated against the wrong objective for three months.
This story repeats in some form at almost every company doing ML. The solution is a disciplined workflow where the objective is validated before training begins, the evaluation is coupled to the deployment metric, and failure modes are identified at each stage - not just at the end.
What You Will Learn
- The full ML workflow from problem framing to production monitoring
- Where projects fail at each stage (with specific failure modes)
- The data flywheel: why data compounds over time
- How to iterate fast: baseline-first, complexity-later
- The critical distinction between offline and online evaluation
The Full ML Workflow
Stage 1 - Problem Framing
Problem framing is the highest-leverage stage of the workflow. It is also the most commonly skipped.
Business objective → ML objective
Most business objectives are not directly optimizable with ML. The translation from business to ML objective is a design decision with major downstream consequences.
| Business Objective | ML Objective | Failure Mode If Not Aligned |
|---|---|---|
| Increase revenue | Maximize CTR, then conversion | High CTR on cheap items → low revenue |
| Reduce fraud | Maximize recall at fixed precision | Optimizing recall alone blocks too many good transactions |
| Improve content moderation | Minimize FN rate | Optimizing FNs alone creates excessive FP removals |
| Improve search quality | Maximize nDCG@10 | nDCG doesn't capture diversity - repetitive results |
| Increase engagement | Maximize session time | Outrage content maximizes time; destroys brand trust |
The discipline: Before writing any code, write out:
- What is the business metric we are trying to move?
- What is the ML metric proxy?
- How do we know these are aligned? (Can we construct a case where the ML metric improves but the business metric doesn't?)
Is ML the right tool?
Before committing to an ML project:
- What is the baseline (current system, heuristic, or human performance)?
- What is the minimum improvement needed to justify ML over the baseline?
- Do we have the data, compute, and team to build this system?
- What is the cost of being wrong?
:::warning The most expensive stage to skip Discovering at deployment that your ML objective doesn't match your business objective costs 3–6 months of wasted work. Discovering it at the problem framing stage costs one hour of whiteboarding. :::
Stage 2 - Data Collection
After problem framing, the question becomes: do we have the data needed to learn the mapping?
Data requirements checklist
# Framework for estimating data requirements before collection
def estimate_data_needs(
task_complexity: str, # 'linear', 'moderate', 'complex'
n_features: int,
class_balance: float, # minority class fraction for classification
target_performance: float # target accuracy/AUC
) -> dict:
"""
Rough heuristics for data volume estimation.
These are starting points, not guarantees.
"""
# Rule of thumb: 10x features for linear tasks
# 100x features for moderate tasks (GBT, shallow NN)
# Much more for complex tasks (deep learning)
multipliers = {'linear': 10, 'moderate': 100, 'complex': 1000}
base_estimate = multipliers[task_complexity] * n_features
# Imbalanced datasets need more data proportional to minority frequency
# To get 1000 minority examples: need 1000 / class_balance total
if class_balance < 0.1:
imbalance_factor = 1.0 / class_balance
else:
imbalance_factor = 1.0
# Higher target performance requires exponentially more data
performance_factor = 1.0 / (1.0 - target_performance + 1e-6) * 0.1
estimate = int(base_estimate * imbalance_factor * performance_factor)
return {
'min_samples': estimate,
'recommended_samples': estimate * 5,
'note': 'These are heuristics - validate with learning curves'
}
# Example: credit scoring
needs = estimate_data_needs(
task_complexity='moderate',
n_features=50,
class_balance=0.02, # 2% fraud rate
target_performance=0.85
)
print(needs)
Labeling strategy
| Strategy | When to use | Cost | Quality |
|---|---|---|---|
| Human annotation | Gold standard required | High | High |
| Crowdsourcing (MTurk) | Non-expert task, volume needed | Medium | Medium (needs QA) |
| Weak supervision (Snorkel) | Expert rules, programmatic labels | Low | Medium |
| Active learning | Limited budget, high uncertainty guidance | Medium | High per label |
| Programmatic / heuristic | Logs, clicks, behavior signals | Very low | Varies (noisy) |
| Self-supervised | Pretraining, no labels needed | Very low | N/A (no labels) |
Stage 3 - Exploratory Data Analysis
EDA is not optional - it is where you find data problems before they become model problems.
Critical EDA steps for ML
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
# Load your data
np.random.seed(42)
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
weights=[0.9, 0.1], random_state=42)
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
df['target'] = y
print("=== Basic info ===")
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum().sum()} total")
print(f"\nClass balance:")
print(df['target'].value_counts(normalize=True))
# Critical: if imbalanced, accuracy is a misleading metric
print("\n=== Feature distributions ===")
print(df.describe().round(2))
# Check for constant or near-constant features
variances = df.drop('target', axis=1).var()
low_var_features = variances[variances < 0.01].index.tolist()
print(f"\nLow variance features (may be useless): {low_var_features}")
# Check for duplicates
n_dups = df.duplicated().sum()
print(f"\nDuplicate rows: {n_dups}")
# Correlation with target
from scipy import stats
correlations = []
for col in df.columns[:-1]:
corr, _ = stats.pointbiserialr(df[col], df['target'])
correlations.append((col, abs(corr)))
correlations.sort(key=lambda x: x[1], reverse=True)
print(f"\nTop 5 features by correlation with target:")
for feat, corr in correlations[:5]:
print(f" {feat}: {corr:.4f}")
Leakage audit
Leakage - where information from the future or from the test set bleeds into training - is the most common cause of falsely optimistic offline metrics.
# Common leakage patterns to check manually:
leakage_checklist = {
"Temporal leakage": [
"Are all features computed from data strictly before the prediction time?",
"Does the feature pipeline use any data that would not exist at serving time?",
"Are timestamps correctly ordered in your train/test split?"
],
"Target leakage": [
"Is any feature derived from the target variable or its future values?",
"Could any feature be unavailable at prediction time (e.g., only known after outcome)?",
"Example: using 'days_in_hospital' to predict 'readmission' - only known post-discharge"
],
"Pipeline leakage": [
"Is normalization (mean/std) fitted on the full dataset or only train?",
"Are feature selection decisions made using the full dataset or only train?",
"Is the test set ever seen by any preprocessing step?"
],
"Group leakage": [
"Are the same users/patients/entities in both train and test?",
"For medical data: same patient in train and test is leakage",
"For user behavior: same user in train and test overfits to that user"
]
}
# CRITICAL: Always fit preprocessors on TRAIN only
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on TRAIN only
X_test_scaled = scaler.transform(X_test) # transform TEST with TRAIN stats
# NEVER: scaler.fit_transform(X_test) -- that's leakage
Stage 4 - Feature Engineering
Feature engineering is the art of representing the input data in a form that makes the learning problem easier. (Covered in depth in Lesson 04 - this section is workflow context.)
Key principle: Features should encode domain knowledge about what predicts the target. A good feature can improve a simple model more than switching to a complex model.
import numpy as np
import pandas as pd
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler,
LabelEncoder, OneHotEncoder
)
# Example: e-commerce click prediction
# Raw features → engineered features
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
"""Transform raw event log into ML-ready feature vector."""
features = pd.DataFrame()
# Temporal features (extract signal from timestamps)
features['hour_of_day'] = pd.to_datetime(df['timestamp']).dt.hour
features['is_weekend'] = pd.to_datetime(df['timestamp']).dt.dayofweek >= 5
# User behavior features (aggregate history → summary statistics)
features['user_30d_clicks'] = df['user_click_count_30d']
features['user_avg_order_value'] = df['user_total_spend'] / (df['user_orders'] + 1)
# Item features
features['item_popularity_log'] = np.log1p(df['item_view_count'])
features['item_click_rate'] = df['item_clicks'] / (df['item_impressions'] + 1)
# User-item interaction (cross features)
features['category_match'] = (df['user_top_category'] == df['item_category']).astype(int)
return features
Stage 5 - Baseline Model First
This is the principle that most engineers skip and then regret.
Rule: Before building a complex model, build the simplest possible model that can serve as a baseline. The baseline serves three purposes:
- Sanity check: If your complex model cannot beat a logistic regression, something is wrong - with the data, the features, or the problem framing.
- Reference point: Every improvement from the baseline is measurable and explainable.
- Production alternative: In many cases, the baseline is good enough to ship, saving months of engineering.
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score, f1_score
import numpy as np
def evaluate_baselines(X_train, X_test, y_train, y_test):
"""
Always evaluate these before building complex models.
If you can't beat DummyClassifier, check your data.
If you can't beat LogisticRegression, check your features.
"""
baselines = {
"Most frequent class (dummy)": DummyClassifier(strategy='most_frequent'),
"Random predictions (dummy)": DummyClassifier(strategy='stratified'),
"Logistic Regression": LogisticRegression(max_iter=1000, C=1.0),
"Decision Tree (depth=3)": DecisionTreeClassifier(max_depth=3),
}
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_train)
X_te = scaler.transform(X_test)
results = {}
for name, model in baselines.items():
model.fit(X_tr, y_train)
if hasattr(model, 'predict_proba'):
y_prob = model.predict_proba(X_te)[:, 1]
auc = roc_auc_score(y_test, y_prob)
else:
auc = 0.5
f1 = f1_score(y_test, model.predict(X_te), zero_division=0)
results[name] = {'AUC': auc, 'F1': f1}
return results
# Run this before any complex model development
# Document the baseline numbers - you will need them for model comparison
The baseline escalation ladder:
Level 1: Constant prediction (always predict majority class) → sanity check
Level 2: Rules-based heuristic → what the team already knows
Level 3: Logistic regression → linear signal exists?
Level 4: Gradient boosted tree → nonlinear signal?
Level 5: Neural network → when GBT is not enough
Level 6: Pretrained model + fine-tune → when NN is not enough
Move up only when the previous level is insufficient. Each step adds training time, complexity, debugging surface, and serving cost.
Stage 6 - Model Development and Experimentation
Experiment tracking
Every model training run should be logged. This is not optional in a team setting.
# Using MLflow for experiment tracking (or Weights & Biases, Neptune, etc.)
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score
def train_with_tracking(
X_train, X_test, y_train, y_test,
n_estimators: int = 100,
max_depth: int = 4,
learning_rate: float = 0.1,
experiment_name: str = "fraud_detection_v1"
):
mlflow.set_experiment(experiment_name)
with mlflow.start_run():
# Log hyperparameters
mlflow.log_params({
'n_estimators': n_estimators,
'max_depth': max_depth,
'learning_rate': learning_rate,
'n_train': len(X_train),
'n_test': len(X_test)
})
# Train
model = GradientBoostingClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
learning_rate=learning_rate
)
model.fit(X_train, y_train)
# Evaluate
y_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
f1 = f1_score(y_test, model.predict(X_test))
# Log metrics
mlflow.log_metrics({'test_auc': auc, 'test_f1': f1})
mlflow.sklearn.log_model(model, "model")
print(f"AUC: {auc:.4f}, F1: {f1:.4f}")
return model, auc
# Never make a training run you cannot reproduce or compare to a baseline
Stage 7 - Offline Evaluation
Offline evaluation is the practice of measuring model quality on held-out data before deployment. It must be designed carefully to avoid the recommendation system failure story that opened this lesson.
Slice-based evaluation
Aggregate metrics hide failures in subgroups:
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np
def sliced_evaluation(
y_true: np.ndarray,
y_pred_proba: np.ndarray,
metadata: pd.DataFrame,
slice_columns: list
) -> pd.DataFrame:
"""
Evaluate model performance on each slice of the data.
Why: A model with 92% overall AUC might have 60% AUC on
a specific demographic or product category.
You need to know this before deploying.
"""
results = []
# Overall performance
results.append({
'slice': 'OVERALL',
'size': len(y_true),
'auc': roc_auc_score(y_true, y_pred_proba)
})
# Per-slice performance
for col in slice_columns:
for value in metadata[col].unique():
mask = metadata[col] == value
if mask.sum() < 30: # skip tiny slices
continue
try:
auc = roc_auc_score(y_true[mask], y_pred_proba[mask])
except ValueError:
auc = float('nan') # only one class in slice
results.append({
'slice': f'{col}={value}',
'size': int(mask.sum()),
'auc': auc
})
df = pd.DataFrame(results).sort_values('auc')
return df
# Usage: always check if any slice has significantly worse performance
# A model that works for majority users but fails for minority groups
# is a deployment risk, not just an ethical issue
Stage 8 - Deployment
Model deployment is an engineering problem, not just a packaging problem.
Deployment modes
| Mode | Description | When to use |
|---|---|---|
| Batch inference | Run predictions on a schedule (nightly, hourly) | When predictions don't need to be real-time |
| Online inference (REST API) | Real-time predictions via HTTP endpoint | When latency < 100ms required |
| Streaming inference | Predictions on a streaming data pipeline (Kafka) | Event-driven systems |
| Edge inference | Model runs on device (mobile, IoT) | When network latency is unacceptable |
Shadow mode (safe deployment)
# Shadow mode: new model runs in parallel with production model
# Predictions logged but NOT served to users
# Lets you validate real-world performance before switching traffic
class ShadowDeployment:
def __init__(self, production_model, shadow_model):
self.production = production_model
self.shadow = shadow_model
self.shadow_log = []
def predict(self, features):
# Production prediction: this is what the user sees
prod_pred = self.production.predict_proba(features)
# Shadow prediction: logged for analysis, never served
try:
shadow_pred = self.shadow.predict_proba(features)
self.shadow_log.append({
'prod_score': prod_pred[0, 1],
'shadow_score': shadow_pred[0, 1],
'agreement': abs(prod_pred[0, 1] - shadow_pred[0, 1]) < 0.1
})
except Exception as e:
# Shadow model errors must never affect production
pass
return prod_pred # only return production prediction to user
def analyze_shadow(self):
"""After N samples, analyze shadow vs production."""
import pandas as pd
df = pd.DataFrame(self.shadow_log)
print(f"Shadow agreement rate: {df['agreement'].mean():.3f}")
print(f"Shadow vs prod correlation: {df['prod_score'].corr(df['shadow_score']):.4f}")
Stage 9 - Online Evaluation: A/B Testing
Offline metrics are necessary but not sufficient. A/B testing measures the actual business metric impact of your model.
import numpy as np
from scipy import stats
def ab_test_significance(
control_conversions: int,
control_impressions: int,
treatment_conversions: int,
treatment_impressions: int,
alpha: float = 0.05
) -> dict:
"""
Two-proportion z-test for A/B test significance.
Returns whether the treatment (new model) significantly
outperforms the control (old model) on the conversion metric.
"""
p_control = control_conversions / control_impressions
p_treatment = treatment_conversions / treatment_impressions
# Pooled proportion under H₀ (p_control == p_treatment)
p_pool = (control_conversions + treatment_conversions) / \
(control_impressions + treatment_impressions)
# Standard error
se = np.sqrt(p_pool * (1 - p_pool) *
(1/control_impressions + 1/treatment_impressions))
# Z-statistic
z = (p_treatment - p_control) / se
p_value = 1 - stats.norm.cdf(z) # one-tailed
relative_lift = (p_treatment - p_control) / p_control * 100
return {
'control_rate': p_control,
'treatment_rate': p_treatment,
'relative_lift_pct': relative_lift,
'z_statistic': z,
'p_value': p_value,
'significant': p_value < alpha,
'conclusion': 'Launch' if (p_value < alpha and relative_lift > 0) else 'Do not launch'
}
# Example: CTR experiment
result = ab_test_significance(
control_conversions=12000,
control_impressions=500000, # 2.4% CTR
treatment_conversions=13200,
treatment_impressions=500000 # 2.64% CTR
)
print(f"Lift: {result['relative_lift_pct']:.1f}%")
print(f"p-value: {result['p_value']:.4f}")
print(f"Recommendation: {result['conclusion']}")
Stage 10 - Monitoring
Models degrade silently. Without monitoring, you discover degradation from user complaints or revenue drops.
import numpy as np
from scipy import stats
class ModelMonitor:
"""
Basic production model monitor.
Tracks feature distributions and prediction distributions.
"""
def __init__(self, baseline_features: np.ndarray, baseline_predictions: np.ndarray):
"""Initialize with training/validation distribution as baseline."""
self.baseline_feature_stats = {
'mean': baseline_features.mean(axis=0),
'std': baseline_features.std(axis=0),
'p25': np.percentile(baseline_features, 25, axis=0),
'p75': np.percentile(baseline_features, 75, axis=0),
}
self.baseline_pred_stats = {
'mean': baseline_predictions.mean(),
'std': baseline_predictions.std(),
}
def check_feature_drift(
self,
current_features: np.ndarray,
ks_threshold: float = 0.05
) -> list:
"""
Kolmogorov-Smirnov test for feature distribution drift.
Returns list of features that have drifted significantly.
"""
drifted_features = []
n_features = current_features.shape[1]
for i in range(n_features):
# Compare current feature distribution to baseline
# We approximate baseline with Gaussian for simplicity
baseline_samples = np.random.normal(
self.baseline_feature_stats['mean'][i],
self.baseline_feature_stats['std'][i],
size=len(current_features)
)
ks_stat, p_value = stats.ks_2samp(baseline_samples, current_features[:, i])
if p_value < ks_threshold:
drifted_features.append({
'feature_index': i,
'ks_stat': ks_stat,
'p_value': p_value
})
return drifted_features
def check_prediction_drift(
self,
current_predictions: np.ndarray,
psi_threshold: float = 0.2
) -> dict:
"""
Population Stability Index (PSI) for prediction drift.
PSI < 0.1: no significant change
PSI 0.1-0.2: moderate change, investigate
PSI > 0.2: significant change, model likely degraded
"""
# Compute PSI between baseline and current prediction distributions
bins = np.percentile(current_predictions, np.arange(0, 110, 10))
bins = np.unique(bins)
if len(bins) < 2:
return {'psi': 0, 'status': 'insufficient_data'}
baseline_samples = np.random.normal(
self.baseline_pred_stats['mean'],
self.baseline_pred_stats['std'],
size=10000
)
baseline_counts = np.histogram(baseline_samples, bins=bins)[0] + 1
current_counts = np.histogram(current_predictions, bins=bins)[0] + 1
baseline_pct = baseline_counts / baseline_counts.sum()
current_pct = current_counts / current_counts.sum()
psi = np.sum((current_pct - baseline_pct) * np.log(current_pct / baseline_pct))
return {
'psi': psi,
'status': 'stable' if psi < 0.1 else 'warning' if psi < 0.2 else 'degraded'
}
The Data Flywheel
The data flywheel is the virtuous cycle that makes ML systems compound in value over time:
Deploy model
↓
Model makes predictions that affect user behavior
↓
User behavior generates new labeled data
(clicks, conversions, complaints, labels from downstream outcomes)
↓
New data improves the next model version
↓
Better model serves more users, generates more data
↓
[Cycle continues - data moat compounds]
Production example: A spam filter deployed in Gmail generates millions of labeled examples daily - emails users mark as spam or "not spam" are implicit labels. Every day the model is in production, the training set grows. By year 3, the training set is 100x what it was at launch, and the model has improved accordingly without any additional labeling cost.
Engineering implication: Design your data pipeline from day one to collect feedback from production. This means:
- Logging model inputs (the features at prediction time)
- Logging model predictions
- Capturing ground truth labels when they become available
- Linking predictions to outcomes for delayed labeling (fraud outcome known 30 days later)
Where Projects Actually Fail
From analysis of ML project failures, the distribution is approximately:
| Stage | Failure Rate | Most Common Failure |
|---|---|---|
| Problem framing | ~15% | ML objective doesn't align with business metric |
| Data collection | ~20% | Insufficient data, wrong data, biased labeling |
| EDA/data quality | ~15% | Leakage discovered post-deployment |
| Feature engineering | ~15% | Train-serve skew (features computed differently offline vs. online) |
| Model development | ~5% | Overfitting, wrong architecture for the task |
| Offline evaluation | ~10% | Wrong test set, inflated metrics, no slice analysis |
| Deployment | ~10% | Latency, memory, dependency issues |
| Monitoring | ~10% | Silent degradation undetected for weeks or months |
Key insight: Only 5% of failures are at the model development stage - the stage that gets the most textbook coverage. The other 95% are process, data, and system failures.
Iterating Fast: The Baseline-First Protocol
Week 1: Problem framing + data audit + EDA
→ Output: clear success metric, data quality report, baseline dataset
Week 2: Feature engineering + baseline model
→ Output: baseline metrics, slice analysis, leakage-free evaluation
Week 3: Model iteration (GBT → NN if needed)
→ Output: model checkpoint with reproducible training script
Week 4: Offline evaluation + shadow deployment
→ Output: comprehensive evaluation report, shadow traffic analysis
Week 5+: A/B test → monitor → iterate
This timeline is not rigid - it is a mindset. Complexity is added only when the simpler approach is insufficient. The goal is to have something in production (even a simple model) as fast as possible, because production data and user feedback are irreplaceable.
:::note Role-specific perspective Data Scientist: Your job is not to train the best model in isolation - it is to define and validate the problem, ensure the data is sound, and connect offline metrics to online business outcomes.
ML Engineer: You own the training-serving infrastructure - feature pipelines, model serving, monitoring. Train-serve skew (features computed differently at training time vs. serving time) is your most dangerous failure mode.
MLE (ML Engineering): You connect the data scientist's model to production. Reproducibility, versioning, rollback capability, and latency budgets are your primary concerns.
Research Engineer: You push state-of-the-art. But even research has a workflow - ablation studies, held-out test sets, and baselines are not optional. A new architecture that doesn't beat the baseline on a fixed test set is not a contribution. :::
Interview Questions
Q1: Walk me through the ML workflow from a business problem to a deployed model. Where do projects most commonly fail?
The ML workflow:
-
Problem framing: Translate business objective to ML objective. Define success metric. Validate that ML is the right tool. Identify the baseline.
-
Data collection and EDA: Assess data availability and quality. Audit for leakage. Check class balance, feature distributions, and temporal ordering.
-
Feature engineering: Build a clean, leakage-free feature set. Document all features. Ensure training features match serving features (train-serve consistency).
-
Baseline model: Build the simplest model first. If logistic regression achieves 90% of the target performance, stop there.
-
Model development: Experiment tracking from day one. Hyperparameter optimization on validation set (not test set).
-
Offline evaluation: Evaluate on a held-out test set. Perform slice analysis. Check calibration. Validate that offline metrics correlate with the business metric.
-
Deployment: Shadow mode, then canary (5% traffic), then full rollout.
-
Online evaluation: A/B test or bandit-based experiment. Measure business metric impact, not just model metrics.
-
Monitoring: Track feature drift, prediction drift, and business metric degradation.
Where projects most commonly fail: In order of frequency: data quality (leakage, label noise, insufficient volume), problem framing (wrong ML objective), feature engineering (train-serve skew), and offline evaluation (test set constructed incorrectly). Model architecture failures account for only ~5% of failures despite being the most studied in textbooks.
Q2: What is train-serve skew and how do you prevent it?
Train-serve skew occurs when features are computed differently during model training than during model serving in production. This is one of the most common and most subtle production ML failures.
Example: During training, you compute the feature "user's average purchase value" by dividing total lifetime spend by number of purchases. During serving, due to a different code path in the real-time feature server, you divide by number of distinct items instead. The model was trained on one quantity and served a different one. Performance degrades, but there is no error - just wrong predictions.
How to prevent it:
-
Single source of truth for feature computation: Use a feature store (Feast, Tecton, Vertex Feature Store) where features are computed once and shared across training and serving pipelines.
-
Training-serving parity tests: Write automated tests that compute the same feature from the same raw data in both the training pipeline and the serving pipeline and assert that the results are identical (or within floating-point tolerance).
-
Log-and-replay: During training, use logged features from production (the actual feature vectors that were used to make predictions) rather than recomputing features offline. This guarantees parity because you are training on the exact feature representation that will be used at serving time.
-
Code sharing: The feature computation code should be a shared library, not duplicated between training and serving. If there is a bug in the computation, it exists in one place and is fixed in one place.
Q3: What is the data flywheel and how does it create a competitive moat?
The data flywheel is the virtuous cycle where: a deployed ML system makes predictions → predictions influence user behavior → user behavior generates new labeled data → new data improves the next model → better model generates better predictions → more user engagement → more data.
Why it creates a competitive moat: The quality of an ML-driven product is a function of the quality and quantity of training data. A company that has been running an ML-powered product for 5 years has 5 years of production data, user feedback, and implicit labels that a new entrant cannot acquire quickly. Even if the new entrant uses a better model architecture, the incumbent's data advantage dominates.
Real examples:
- Google Search: 20+ years of clicks, reformulations, and engagement signals
- Netflix recommendation: 150M+ users generating viewing behavior daily
- Spotify Discover Weekly: user skip/listen behavior → weekly-updated playlists
Engineering implications for designing a data flywheel:
- Log model inputs at prediction time (not just outputs)
- Design feedback collection into the product (explicit or implicit)
- Build delayed-label pipelines (outcome may not be known until days later)
- Version your data - know exactly what data trained each model version
- Build infrastructure for retraining on a schedule (weekly, daily, hourly depending on the domain)
Q4: Why is offline evaluation not sufficient, and how do you design an online evaluation strategy?
Offline evaluation on a held-out test set measures: "Does my model predict the historical labels in my test set better than the baseline?" This answers a necessary but not sufficient question for deployment.
Why offline is insufficient:
- Metric misalignment: The offline metric (AUC, F1) may not correlate with the business metric (revenue, engagement). The opening story in this lesson illustrates exactly this.
- Distribution shift: The test set is historical. Production data in the future may have a different distribution, making offline performance a poor predictor of live performance.
- Feedback loops: The current production model affects what data exists (e.g., a recommendation model determines what users see, which determines what they click, which becomes training data). New models may perform very differently in this feedback loop.
- User adaptation: Users adapt to model behavior. A new model may initially perform worse simply because users are unfamiliar with its output style.
Online evaluation design:
-
A/B testing: Randomly split users into control (old model) and treatment (new model). Measure business metrics for each group. Run until statistical significance (power analysis for sample size - typically days to weeks). Watch for novelty effects (initial boost that fades).
-
Interleaving: For ranking tasks (search, recommendations), interleave results from both models and measure which model's results get more engagement. More sensitive than standard A/B testing, requires less traffic.
-
Bandit experiments (multi-armed bandit): When exploring multiple model variants simultaneously, use Thompson Sampling or UCB to adaptively allocate more traffic to better-performing variants. Faster and more efficient than standard A/B.
-
Shadow mode first: Run the new model in parallel, log all predictions, and compare distributions against the production model before any user sees the new model's output. A cheap first check for major issues.
Q5: What is the "baseline first" principle and why do experienced ML engineers insist on it?
The baseline-first principle: before building any complex model, build the simplest possible model that is appropriate for the problem. Always have a baseline before iterating.
Why experienced engineers insist on this:
-
Calibrates the ceiling: If a logistic regression achieves 91% AUC and your target is 93%, you know the remaining gap is small and the marginal return on model complexity will be low. If logistic regression achieves 65% AUC, you know there is significant signal to capture with a more complex model.
-
Detects data/feature problems early: If a logistic regression cannot beat a dummy classifier (most frequent class), the problem is almost certainly in the data or features, not the model. No amount of neural architecture tuning will fix bad data.
-
Provides production insurance: In many projects, the baseline is good enough to ship. Deploying a logistic regression that achieves 90% of the target performance gives you a production system while you iterate toward the full model. The alternative (spend 3 months on the complex model first) gives you nothing until month 3.
-
Anchors ablation studies: When you add a feature or increase model complexity, measuring the delta over a documented baseline is how you know whether the change helped. Without a baseline, every change is unmeasured.
-
Builds intuition: The features and coefficients of a logistic regression are interpretable. They tell you which features are predictive before you obscure everything in a black-box neural network.
The practical hierarchy: dummy classifier → logistic regression → gradient boosted tree → neural network → pretrained model. Only escalate when the previous level is provably insufficient.
Key Takeaways
- The ML workflow is: problem framing → data → EDA → features → baseline → model → offline eval → deployment → online eval → monitoring → loop
- Most ML projects fail at data collection, problem framing, or feature engineering - not model architecture
- Always build a baseline model before a complex model; only escalate complexity when the simpler model is demonstrably insufficient
- Data leakage is the most common cause of falsely optimistic offline metrics - audit for it explicitly at every stage
- The data flywheel is the compounding value of production deployment: more data → better model → more users → more data
- Offline evaluation is necessary but not sufficient - always validate business metric impact via online A/B testing
Next: Lesson 04 - Data Representation and Feature Spaces →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Bias-Variance Tradeoff demo on the EngineersOfAI Playground - no code required.
:::
