The ML Workflow - End to End

Reading time: ~22 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer

A team at a large e-commerce company was given a clear mandate: "Build a product recommendation model that increases clickthrough rate." Three months later they presented results: their neural collaborative filtering model achieved 87% AUC on the offline test set, a 12-point improvement over the existing heuristic baseline.

When they deployed it, clickthrough rate went down 4%.

Post-mortem findings: The offline test set used historical data from the old recommendation system. Every "positive" example was a product that the old system had already surfaced. The new model learned to recommend popular products - exactly what the old system did. The AUC improvement was measuring "does our model agree with the old model," not "does our model improve CTR." The deployment metric (CTR) was never measured offline. The model was evaluated against the wrong objective for three months.

This story repeats in some form at almost every company doing ML. The solution is a disciplined workflow where the objective is validated before training begins, the evaluation is coupled to the deployment metric, and failure modes are identified at each stage - not just at the end.

What You Will Learn

The full ML workflow from problem framing to production monitoring
Where projects fail at each stage (with specific failure modes)
The data flywheel: why data compounds over time
How to iterate fast: baseline-first, complexity-later
The critical distinction between offline and online evaluation

The Full ML Workflow

Stage 1 - Problem Framing

Problem framing is the highest-leverage stage of the workflow. It is also the most commonly skipped.

Business objective → ML objective

Most business objectives are not directly optimizable with ML. The translation from business to ML objective is a design decision with major downstream consequences.

Business Objective	ML Objective	Failure Mode If Not Aligned
Increase revenue	Maximize CTR, then conversion	High CTR on cheap items → low revenue
Reduce fraud	Maximize recall at fixed precision	Optimizing recall alone blocks too many good transactions
Improve content moderation	Minimize FN rate	Optimizing FNs alone creates excessive FP removals
Improve search quality	Maximize nDCG@10	nDCG doesn't capture diversity - repetitive results
Increase engagement	Maximize session time	Outrage content maximizes time; destroys brand trust

The discipline: Before writing any code, write out:

What is the business metric we are trying to move?
What is the ML metric proxy?
How do we know these are aligned? (Can we construct a case where the ML metric improves but the business metric doesn't?)

Is ML the right tool?

Before committing to an ML project:

What is the baseline (current system, heuristic, or human performance)?
What is the minimum improvement needed to justify ML over the baseline?
Do we have the data, compute, and team to build this system?
What is the cost of being wrong?

:::warning The most expensive stage to skip Discovering at deployment that your ML objective doesn't match your business objective costs 3–6 months of wasted work. Discovering it at the problem framing stage costs one hour of whiteboarding. :::

Stage 2 - Data Collection

After problem framing, the question becomes: do we have the data needed to learn the mapping?

Data requirements checklist

# Framework for estimating data requirements before collection

def estimate_data_needs(
    task_complexity: str,    # 'linear', 'moderate', 'complex'
    n_features: int,
    class_balance: float,    # minority class fraction for classification
    target_performance: float  # target accuracy/AUC
) -> dict:
    """
    Rough heuristics for data volume estimation.
    These are starting points, not guarantees.
    """

    # Rule of thumb: 10x features for linear tasks
    # 100x features for moderate tasks (GBT, shallow NN)
    # Much more for complex tasks (deep learning)
    multipliers = {'linear': 10, 'moderate': 100, 'complex': 1000}
    base_estimate = multipliers[task_complexity] * n_features

    # Imbalanced datasets need more data proportional to minority frequency
    # To get 1000 minority examples: need 1000 / class_balance total
    if class_balance < 0.1:
        imbalance_factor = 1.0 / class_balance
    else:
        imbalance_factor = 1.0

    # Higher target performance requires exponentially more data
    performance_factor = 1.0 / (1.0 - target_performance + 1e-6) * 0.1

    estimate = int(base_estimate * imbalance_factor * performance_factor)

    return {
        'min_samples': estimate,
        'recommended_samples': estimate * 5,
        'note': 'These are heuristics - validate with learning curves'
    }

# Example: credit scoring
needs = estimate_data_needs(
    task_complexity='moderate',
    n_features=50,
    class_balance=0.02,  # 2% fraud rate
    target_performance=0.85
)
print(needs)

Labeling strategy

Strategy	When to use	Cost	Quality
Human annotation	Gold standard required	High	High
Crowdsourcing (MTurk)	Non-expert task, volume needed	Medium	Medium (needs QA)
Weak supervision (Snorkel)	Expert rules, programmatic labels	Low	Medium
Active learning	Limited budget, high uncertainty guidance	Medium	High per label
Programmatic / heuristic	Logs, clicks, behavior signals	Very low	Varies (noisy)
Self-supervised	Pretraining, no labels needed	Very low	N/A (no labels)

Stage 3 - Exploratory Data Analysis

EDA is not optional - it is where you find data problems before they become model problems.

Critical EDA steps for ML

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Load your data
np.random.seed(42)
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           weights=[0.9, 0.1], random_state=42)
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
df['target'] = y

print("=== Basic info ===")
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum().sum()} total")
print(f"\nClass balance:")
print(df['target'].value_counts(normalize=True))
# Critical: if imbalanced, accuracy is a misleading metric

print("\n=== Feature distributions ===")
print(df.describe().round(2))

# Check for constant or near-constant features
variances = df.drop('target', axis=1).var()
low_var_features = variances[variances < 0.01].index.tolist()
print(f"\nLow variance features (may be useless): {low_var_features}")

# Check for duplicates
n_dups = df.duplicated().sum()
print(f"\nDuplicate rows: {n_dups}")

# Correlation with target
from scipy import stats
correlations = []
for col in df.columns[:-1]:
    corr, _ = stats.pointbiserialr(df[col], df['target'])
    correlations.append((col, abs(corr)))

correlations.sort(key=lambda x: x[1], reverse=True)
print(f"\nTop 5 features by correlation with target:")
for feat, corr in correlations[:5]:
    print(f"  {feat}: {corr:.4f}")

Leakage audit

Leakage - where information from the future or from the test set bleeds into training - is the most common cause of falsely optimistic offline metrics.

# Common leakage patterns to check manually:

leakage_checklist = {
    "Temporal leakage": [
        "Are all features computed from data strictly before the prediction time?",
        "Does the feature pipeline use any data that would not exist at serving time?",
        "Are timestamps correctly ordered in your train/test split?"
    ],
    "Target leakage": [
        "Is any feature derived from the target variable or its future values?",
        "Could any feature be unavailable at prediction time (e.g., only known after outcome)?",
        "Example: using 'days_in_hospital' to predict 'readmission' - only known post-discharge"
    ],
    "Pipeline leakage": [
        "Is normalization (mean/std) fitted on the full dataset or only train?",
        "Are feature selection decisions made using the full dataset or only train?",
        "Is the test set ever seen by any preprocessing step?"
    ],
    "Group leakage": [
        "Are the same users/patients/entities in both train and test?",
        "For medical data: same patient in train and test is leakage",
        "For user behavior: same user in train and test overfits to that user"
    ]
}

# CRITICAL: Always fit preprocessors on TRAIN only
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit on TRAIN only
X_test_scaled = scaler.transform(X_test)          # transform TEST with TRAIN stats
# NEVER: scaler.fit_transform(X_test) -- that's leakage

Stage 4 - Feature Engineering

Feature engineering is the art of representing the input data in a form that makes the learning problem easier. (Covered in depth in Lesson 04 - this section is workflow context.)

Key principle: Features should encode domain knowledge about what predicts the target. A good feature can improve a simple model more than switching to a complex model.

import numpy as np
import pandas as pd
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler,
    LabelEncoder, OneHotEncoder
)

# Example: e-commerce click prediction
# Raw features → engineered features

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    """Transform raw event log into ML-ready feature vector."""

    features = pd.DataFrame()

    # Temporal features (extract signal from timestamps)
    features['hour_of_day'] = pd.to_datetime(df['timestamp']).dt.hour
    features['is_weekend'] = pd.to_datetime(df['timestamp']).dt.dayofweek >= 5

    # User behavior features (aggregate history → summary statistics)
    features['user_30d_clicks'] = df['user_click_count_30d']
    features['user_avg_order_value'] = df['user_total_spend'] / (df['user_orders'] + 1)

    # Item features
    features['item_popularity_log'] = np.log1p(df['item_view_count'])
    features['item_click_rate'] = df['item_clicks'] / (df['item_impressions'] + 1)

    # User-item interaction (cross features)
    features['category_match'] = (df['user_top_category'] == df['item_category']).astype(int)

    return features

Stage 5 - Baseline Model First

This is the principle that most engineers skip and then regret.

Rule: Before building a complex model, build the simplest possible model that can serve as a baseline. The baseline serves three purposes:

Sanity check: If your complex model cannot beat a logistic regression, something is wrong - with the data, the features, or the problem framing.
Reference point: Every improvement from the baseline is measurable and explainable.
Production alternative: In many cases, the baseline is good enough to ship, saving months of engineering.

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score, f1_score
import numpy as np

def evaluate_baselines(X_train, X_test, y_train, y_test):
    """
    Always evaluate these before building complex models.
    If you can't beat DummyClassifier, check your data.
    If you can't beat LogisticRegression, check your features.
    """
    baselines = {
        "Most frequent class (dummy)": DummyClassifier(strategy='most_frequent'),
        "Random predictions (dummy)": DummyClassifier(strategy='stratified'),
        "Logistic Regression": LogisticRegression(max_iter=1000, C=1.0),
        "Decision Tree (depth=3)": DecisionTreeClassifier(max_depth=3),
    }

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_tr = scaler.fit_transform(X_train)
    X_te = scaler.transform(X_test)

    results = {}
    for name, model in baselines.items():
        model.fit(X_tr, y_train)
        if hasattr(model, 'predict_proba'):
            y_prob = model.predict_proba(X_te)[:, 1]
            auc = roc_auc_score(y_test, y_prob)
        else:
            auc = 0.5
        f1 = f1_score(y_test, model.predict(X_te), zero_division=0)
        results[name] = {'AUC': auc, 'F1': f1}

    return results

# Run this before any complex model development
# Document the baseline numbers - you will need them for model comparison

The baseline escalation ladder:

Level 1: Constant prediction (always predict majority class) → sanity check
Level 2: Rules-based heuristic → what the team already knows
Level 3: Logistic regression → linear signal exists?
Level 4: Gradient boosted tree → nonlinear signal?
Level 5: Neural network → when GBT is not enough
Level 6: Pretrained model + fine-tune → when NN is not enough

Move up only when the previous level is insufficient. Each step adds training time, complexity, debugging surface, and serving cost.

Stage 6 - Model Development and Experimentation

Experiment tracking

Every model training run should be logged. This is not optional in a team setting.

# Using MLflow for experiment tracking (or Weights & Biases, Neptune, etc.)
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score

def train_with_tracking(
    X_train, X_test, y_train, y_test,
    n_estimators: int = 100,
    max_depth: int = 4,
    learning_rate: float = 0.1,
    experiment_name: str = "fraud_detection_v1"
):
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run():
        # Log hyperparameters
        mlflow.log_params({
            'n_estimators': n_estimators,
            'max_depth': max_depth,
            'learning_rate': learning_rate,
            'n_train': len(X_train),
            'n_test': len(X_test)
        })

        # Train
        model = GradientBoostingClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            learning_rate=learning_rate
        )
        model.fit(X_train, y_train)

        # Evaluate
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
        f1 = f1_score(y_test, model.predict(X_test))

        # Log metrics
        mlflow.log_metrics({'test_auc': auc, 'test_f1': f1})
        mlflow.sklearn.log_model(model, "model")

        print(f"AUC: {auc:.4f}, F1: {f1:.4f}")
        return model, auc

# Never make a training run you cannot reproduce or compare to a baseline

Stage 7 - Offline Evaluation

Offline evaluation is the practice of measuring model quality on held-out data before deployment. It must be designed carefully to avoid the recommendation system failure story that opened this lesson.

Slice-based evaluation

Aggregate metrics hide failures in subgroups:

from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np

def sliced_evaluation(
    y_true: np.ndarray,
    y_pred_proba: np.ndarray,
    metadata: pd.DataFrame,
    slice_columns: list
) -> pd.DataFrame:
    """
    Evaluate model performance on each slice of the data.

    Why: A model with 92% overall AUC might have 60% AUC on
    a specific demographic or product category.
    You need to know this before deploying.
    """
    results = []

    # Overall performance
    results.append({
        'slice': 'OVERALL',
        'size': len(y_true),
        'auc': roc_auc_score(y_true, y_pred_proba)
    })

    # Per-slice performance
    for col in slice_columns:
        for value in metadata[col].unique():
            mask = metadata[col] == value
            if mask.sum() < 30:  # skip tiny slices
                continue
            try:
                auc = roc_auc_score(y_true[mask], y_pred_proba[mask])
            except ValueError:
                auc = float('nan')  # only one class in slice

            results.append({
                'slice': f'{col}={value}',
                'size': int(mask.sum()),
                'auc': auc
            })

    df = pd.DataFrame(results).sort_values('auc')
    return df

# Usage: always check if any slice has significantly worse performance
# A model that works for majority users but fails for minority groups
# is a deployment risk, not just an ethical issue

Stage 8 - Deployment

Model deployment is an engineering problem, not just a packaging problem.

Deployment modes

Mode	Description	When to use
Batch inference	Run predictions on a schedule (nightly, hourly)	When predictions don't need to be real-time
Online inference (REST API)	Real-time predictions via HTTP endpoint	When latency < 100ms required
Streaming inference	Predictions on a streaming data pipeline (Kafka)	Event-driven systems
Edge inference	Model runs on device (mobile, IoT)	When network latency is unacceptable

Shadow mode (safe deployment)

# Shadow mode: new model runs in parallel with production model
# Predictions logged but NOT served to users
# Lets you validate real-world performance before switching traffic

class ShadowDeployment:
    def __init__(self, production_model, shadow_model):
        self.production = production_model
        self.shadow = shadow_model
        self.shadow_log = []

    def predict(self, features):
        # Production prediction: this is what the user sees
        prod_pred = self.production.predict_proba(features)

        # Shadow prediction: logged for analysis, never served
        try:
            shadow_pred = self.shadow.predict_proba(features)
            self.shadow_log.append({
                'prod_score': prod_pred[0, 1],
                'shadow_score': shadow_pred[0, 1],
                'agreement': abs(prod_pred[0, 1] - shadow_pred[0, 1]) < 0.1
            })
        except Exception as e:
            # Shadow model errors must never affect production
            pass

        return prod_pred  # only return production prediction to user

    def analyze_shadow(self):
        """After N samples, analyze shadow vs production."""
        import pandas as pd
        df = pd.DataFrame(self.shadow_log)
        print(f"Shadow agreement rate: {df['agreement'].mean():.3f}")
        print(f"Shadow vs prod correlation: {df['prod_score'].corr(df['shadow_score']):.4f}")

Stage 9 - Online Evaluation: A/B Testing

Offline metrics are necessary but not sufficient. A/B testing measures the actual business metric impact of your model.

import numpy as np
from scipy import stats

def ab_test_significance(
    control_conversions: int,
    control_impressions: int,
    treatment_conversions: int,
    treatment_impressions: int,
    alpha: float = 0.05
) -> dict:
    """
    Two-proportion z-test for A/B test significance.

    Returns whether the treatment (new model) significantly
    outperforms the control (old model) on the conversion metric.
    """
    p_control = control_conversions / control_impressions
    p_treatment = treatment_conversions / treatment_impressions

    # Pooled proportion under H₀ (p_control == p_treatment)
    p_pool = (control_conversions + treatment_conversions) / \
             (control_impressions + treatment_impressions)

    # Standard error
    se = np.sqrt(p_pool * (1 - p_pool) *
                 (1/control_impressions + 1/treatment_impressions))

    # Z-statistic
    z = (p_treatment - p_control) / se
    p_value = 1 - stats.norm.cdf(z)  # one-tailed

    relative_lift = (p_treatment - p_control) / p_control * 100

    return {
        'control_rate': p_control,
        'treatment_rate': p_treatment,
        'relative_lift_pct': relative_lift,
        'z_statistic': z,
        'p_value': p_value,
        'significant': p_value < alpha,
        'conclusion': 'Launch' if (p_value < alpha and relative_lift > 0) else 'Do not launch'
    }

# Example: CTR experiment
result = ab_test_significance(
    control_conversions=12000,
    control_impressions=500000,  # 2.4% CTR
    treatment_conversions=13200,
    treatment_impressions=500000  # 2.64% CTR
)
print(f"Lift: {result['relative_lift_pct']:.1f}%")
print(f"p-value: {result['p_value']:.4f}")
print(f"Recommendation: {result['conclusion']}")

Stage 10 - Monitoring

Models degrade silently. Without monitoring, you discover degradation from user complaints or revenue drops.

import numpy as np
from scipy import stats

class ModelMonitor:
    """
    Basic production model monitor.
    Tracks feature distributions and prediction distributions.
    """

    def __init__(self, baseline_features: np.ndarray, baseline_predictions: np.ndarray):
        """Initialize with training/validation distribution as baseline."""
        self.baseline_feature_stats = {
            'mean': baseline_features.mean(axis=0),
            'std': baseline_features.std(axis=0),
            'p25': np.percentile(baseline_features, 25, axis=0),
            'p75': np.percentile(baseline_features, 75, axis=0),
        }
        self.baseline_pred_stats = {
            'mean': baseline_predictions.mean(),
            'std': baseline_predictions.std(),
        }

    def check_feature_drift(
        self,
        current_features: np.ndarray,
        ks_threshold: float = 0.05
    ) -> list:
        """
        Kolmogorov-Smirnov test for feature distribution drift.
        Returns list of features that have drifted significantly.
        """
        drifted_features = []
        n_features = current_features.shape[1]

        for i in range(n_features):
            # Compare current feature distribution to baseline
            # We approximate baseline with Gaussian for simplicity
            baseline_samples = np.random.normal(
                self.baseline_feature_stats['mean'][i],
                self.baseline_feature_stats['std'][i],
                size=len(current_features)
            )
            ks_stat, p_value = stats.ks_2samp(baseline_samples, current_features[:, i])

            if p_value < ks_threshold:
                drifted_features.append({
                    'feature_index': i,
                    'ks_stat': ks_stat,
                    'p_value': p_value
                })

        return drifted_features

    def check_prediction_drift(
        self,
        current_predictions: np.ndarray,
        psi_threshold: float = 0.2
    ) -> dict:
        """
        Population Stability Index (PSI) for prediction drift.
        PSI < 0.1: no significant change
        PSI 0.1-0.2: moderate change, investigate
        PSI > 0.2: significant change, model likely degraded
        """
        # Compute PSI between baseline and current prediction distributions
        bins = np.percentile(current_predictions, np.arange(0, 110, 10))
        bins = np.unique(bins)
        if len(bins) < 2:
            return {'psi': 0, 'status': 'insufficient_data'}

        baseline_samples = np.random.normal(
            self.baseline_pred_stats['mean'],
            self.baseline_pred_stats['std'],
            size=10000
        )
        baseline_counts = np.histogram(baseline_samples, bins=bins)[0] + 1
        current_counts = np.histogram(current_predictions, bins=bins)[0] + 1

        baseline_pct = baseline_counts / baseline_counts.sum()
        current_pct = current_counts / current_counts.sum()

        psi = np.sum((current_pct - baseline_pct) * np.log(current_pct / baseline_pct))

        return {
            'psi': psi,
            'status': 'stable' if psi < 0.1 else 'warning' if psi < 0.2 else 'degraded'
        }

The Data Flywheel

The data flywheel is the virtuous cycle that makes ML systems compound in value over time:

Deploy model
    ↓
Model makes predictions that affect user behavior
    ↓
User behavior generates new labeled data
(clicks, conversions, complaints, labels from downstream outcomes)
    ↓
New data improves the next model version
    ↓
Better model serves more users, generates more data
    ↓
[Cycle continues - data moat compounds]

Production example: A spam filter deployed in Gmail generates millions of labeled examples daily - emails users mark as spam or "not spam" are implicit labels. Every day the model is in production, the training set grows. By year 3, the training set is 100x what it was at launch, and the model has improved accordingly without any additional labeling cost.

Engineering implication: Design your data pipeline from day one to collect feedback from production. This means:

Logging model inputs (the features at prediction time)
Logging model predictions
Capturing ground truth labels when they become available
Linking predictions to outcomes for delayed labeling (fraud outcome known 30 days later)

Where Projects Actually Fail

From analysis of ML project failures, the distribution is approximately:

Stage	Failure Rate	Most Common Failure
Problem framing	~15%	ML objective doesn't align with business metric
Data collection	~20%	Insufficient data, wrong data, biased labeling
EDA/data quality	~15%	Leakage discovered post-deployment
Feature engineering	~15%	Train-serve skew (features computed differently offline vs. online)
Model development	~5%	Overfitting, wrong architecture for the task
Offline evaluation	~10%	Wrong test set, inflated metrics, no slice analysis
Deployment	~10%	Latency, memory, dependency issues
Monitoring	~10%	Silent degradation undetected for weeks or months

Key insight: Only 5% of failures are at the model development stage - the stage that gets the most textbook coverage. The other 95% are process, data, and system failures.

Iterating Fast: The Baseline-First Protocol

Week 1: Problem framing + data audit + EDA
         → Output: clear success metric, data quality report, baseline dataset

Week 2: Feature engineering + baseline model
         → Output: baseline metrics, slice analysis, leakage-free evaluation

Week 3: Model iteration (GBT → NN if needed)
         → Output: model checkpoint with reproducible training script

Week 4: Offline evaluation + shadow deployment
         → Output: comprehensive evaluation report, shadow traffic analysis

Week 5+: A/B test → monitor → iterate

This timeline is not rigid - it is a mindset. Complexity is added only when the simpler approach is insufficient. The goal is to have something in production (even a simple model) as fast as possible, because production data and user feedback are irreplaceable.

:::note Role-specific perspective Data Scientist: Your job is not to train the best model in isolation - it is to define and validate the problem, ensure the data is sound, and connect offline metrics to online business outcomes.

ML Engineer: You own the training-serving infrastructure - feature pipelines, model serving, monitoring. Train-serve skew (features computed differently at training time vs. serving time) is your most dangerous failure mode.

MLE (ML Engineering): You connect the data scientist's model to production. Reproducibility, versioning, rollback capability, and latency budgets are your primary concerns.

Research Engineer: You push state-of-the-art. But even research has a workflow - ablation studies, held-out test sets, and baselines are not optional. A new architecture that doesn't beat the baseline on a fixed test set is not a contribution. :::

Interview Questions

Q1: Walk me through the ML workflow from a business problem to a deployed model. Where do projects most commonly fail?

The ML workflow:

Problem framing: Translate business objective to ML objective. Define success metric. Validate that ML is the right tool. Identify the baseline.
Data collection and EDA: Assess data availability and quality. Audit for leakage. Check class balance, feature distributions, and temporal ordering.
Feature engineering: Build a clean, leakage-free feature set. Document all features. Ensure training features match serving features (train-serve consistency).
Baseline model: Build the simplest model first. If logistic regression achieves 90% of the target performance, stop there.
Model development: Experiment tracking from day one. Hyperparameter optimization on validation set (not test set).
Offline evaluation: Evaluate on a held-out test set. Perform slice analysis. Check calibration. Validate that offline metrics correlate with the business metric.
Deployment: Shadow mode, then canary (5% traffic), then full rollout.
Online evaluation: A/B test or bandit-based experiment. Measure business metric impact, not just model metrics.
Monitoring: Track feature drift, prediction drift, and business metric degradation.

Where projects most commonly fail: In order of frequency: data quality (leakage, label noise, insufficient volume), problem framing (wrong ML objective), feature engineering (train-serve skew), and offline evaluation (test set constructed incorrectly). Model architecture failures account for only ~5% of failures despite being the most studied in textbooks.

Q2: What is train-serve skew and how do you prevent it?

Train-serve skew occurs when features are computed differently during model training than during model serving in production. This is one of the most common and most subtle production ML failures.

Example: During training, you compute the feature "user's average purchase value" by dividing total lifetime spend by number of purchases. During serving, due to a different code path in the real-time feature server, you divide by number of distinct items instead. The model was trained on one quantity and served a different one. Performance degrades, but there is no error - just wrong predictions.

How to prevent it:

Single source of truth for feature computation: Use a feature store (Feast, Tecton, Vertex Feature Store) where features are computed once and shared across training and serving pipelines.
Training-serving parity tests: Write automated tests that compute the same feature from the same raw data in both the training pipeline and the serving pipeline and assert that the results are identical (or within floating-point tolerance).
Log-and-replay: During training, use logged features from production (the actual feature vectors that were used to make predictions) rather than recomputing features offline. This guarantees parity because you are training on the exact feature representation that will be used at serving time.
Code sharing: The feature computation code should be a shared library, not duplicated between training and serving. If there is a bug in the computation, it exists in one place and is fixed in one place.

Q3: What is the data flywheel and how does it create a competitive moat?

The data flywheel is the virtuous cycle where: a deployed ML system makes predictions → predictions influence user behavior → user behavior generates new labeled data → new data improves the next model → better model generates better predictions → more user engagement → more data.

Why it creates a competitive moat: The quality of an ML-driven product is a function of the quality and quantity of training data. A company that has been running an ML-powered product for 5 years has 5 years of production data, user feedback, and implicit labels that a new entrant cannot acquire quickly. Even if the new entrant uses a better model architecture, the incumbent's data advantage dominates.

Real examples:

Google Search: 20+ years of clicks, reformulations, and engagement signals
Netflix recommendation: 150M+ users generating viewing behavior daily
Spotify Discover Weekly: user skip/listen behavior → weekly-updated playlists

Engineering implications for designing a data flywheel:

Log model inputs at prediction time (not just outputs)
Design feedback collection into the product (explicit or implicit)
Build delayed-label pipelines (outcome may not be known until days later)
Version your data - know exactly what data trained each model version
Build infrastructure for retraining on a schedule (weekly, daily, hourly depending on the domain)

Q4: Why is offline evaluation not sufficient, and how do you design an online evaluation strategy?

Offline evaluation on a held-out test set measures: "Does my model predict the historical labels in my test set better than the baseline?" This answers a necessary but not sufficient question for deployment.

Why offline is insufficient:

Metric misalignment: The offline metric (AUC, F1) may not correlate with the business metric (revenue, engagement). The opening story in this lesson illustrates exactly this.
Distribution shift: The test set is historical. Production data in the future may have a different distribution, making offline performance a poor predictor of live performance.
Feedback loops: The current production model affects what data exists (e.g., a recommendation model determines what users see, which determines what they click, which becomes training data). New models may perform very differently in this feedback loop.
User adaptation: Users adapt to model behavior. A new model may initially perform worse simply because users are unfamiliar with its output style.

Online evaluation design:

A/B testing: Randomly split users into control (old model) and treatment (new model). Measure business metrics for each group. Run until statistical significance (power analysis for sample size - typically days to weeks). Watch for novelty effects (initial boost that fades).
Interleaving: For ranking tasks (search, recommendations), interleave results from both models and measure which model's results get more engagement. More sensitive than standard A/B testing, requires less traffic.
Bandit experiments (multi-armed bandit): When exploring multiple model variants simultaneously, use Thompson Sampling or UCB to adaptively allocate more traffic to better-performing variants. Faster and more efficient than standard A/B.
Shadow mode first: Run the new model in parallel, log all predictions, and compare distributions against the production model before any user sees the new model's output. A cheap first check for major issues.

Q5: What is the "baseline first" principle and why do experienced ML engineers insist on it?

The baseline-first principle: before building any complex model, build the simplest possible model that is appropriate for the problem. Always have a baseline before iterating.

Why experienced engineers insist on this:

Calibrates the ceiling: If a logistic regression achieves 91% AUC and your target is 93%, you know the remaining gap is small and the marginal return on model complexity will be low. If logistic regression achieves 65% AUC, you know there is significant signal to capture with a more complex model.
Detects data/feature problems early: If a logistic regression cannot beat a dummy classifier (most frequent class), the problem is almost certainly in the data or features, not the model. No amount of neural architecture tuning will fix bad data.
Provides production insurance: In many projects, the baseline is good enough to ship. Deploying a logistic regression that achieves 90% of the target performance gives you a production system while you iterate toward the full model. The alternative (spend 3 months on the complex model first) gives you nothing until month 3.
Anchors ablation studies: When you add a feature or increase model complexity, measuring the delta over a documented baseline is how you know whether the change helped. Without a baseline, every change is unmeasured.
Builds intuition: The features and coefficients of a logistic regression are interpretable. They tell you which features are predictive before you obscure everything in a black-box neural network.

The practical hierarchy: dummy classifier → logistic regression → gradient boosted tree → neural network → pretrained model. Only escalate when the previous level is provably insufficient.

Key Takeaways

The ML workflow is: problem framing → data → EDA → features → baseline → model → offline eval → deployment → online eval → monitoring → loop
Most ML projects fail at data collection, problem framing, or feature engineering - not model architecture
Always build a baseline model before a complex model; only escalate complexity when the simpler model is demonstrably insufficient
Data leakage is the most common cause of falsely optimistic offline metrics - audit for it explicitly at every stage
The data flywheel is the compounding value of production deployment: more data → better model → more users → more data
Offline evaluation is necessary but not sufficient - always validate business metric impact via online A/B testing

Next: Lesson 04 - Data Representation and Feature Spaces →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Bias-Variance Tradeoff demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

The Full ML Workflow​

Stage 1 - Problem Framing​

Business objective → ML objective​

Is ML the right tool?​

Stage 2 - Data Collection​

Data requirements checklist​

Labeling strategy​

Stage 3 - Exploratory Data Analysis​

Critical EDA steps for ML​

Leakage audit​

Stage 4 - Feature Engineering​

Stage 5 - Baseline Model First​

Stage 6 - Model Development and Experimentation​

Experiment tracking​

Stage 7 - Offline Evaluation​

Slice-based evaluation​

Stage 8 - Deployment​

Deployment modes​

Shadow mode (safe deployment)​

Stage 9 - Online Evaluation: A/B Testing​

Stage 10 - Monitoring​

The Data Flywheel​

Where Projects Actually Fail​

Iterating Fast: The Baseline-First Protocol​

Interview Questions​

Key Takeaways​