Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the LLM Product Architecture demo on the EngineersOfAI Playground - no code required. :::

ML Platform Design

The Productivity Crisis

The ML team at a mid-size tech company has grown from 3 to 20 data scientists in eighteen months. Individually, each scientist is skilled. Collectively, they are producing chaos. There are six different ways to train a model: raw Python scripts, Jupyter notebooks committed to git, a custom Airflow DAG that nobody fully understands, three different Docker images with conflicting dependencies, and a Makefile that assumes a specific EC2 instance type. There are seven different ways to deploy: Flask on a VM, TorchServe on an EC2, a FastAPI container in ECS, two different Lambda functions for "lightweight" inference, a Streamlit app that was supposed to be a demo, and one model that runs on a laptop belonging to an engineer who is on vacation.

When the head of data science asks "how many models do we have in production?" nobody can answer with confidence.

This is not a talent problem. This is a missing platform problem. An ML platform is the shared infrastructure, tooling, and conventions that let data scientists focus on models rather than infrastructure. When the platform is missing, every team member reinvents it differently. When the platform exists and is good, a new data scientist is running a training job in their first week and deploying to production in their first month.

Building an ML platform is a bet: you are trading engineering effort today for team velocity forever. The companies that got this right - Uber with Michelangelo, Airbnb with Bighead, LinkedIn with Pro-ML - are the ones that were able to scale their ML teams from tens to hundreds of practitioners without descending into the chaos described above. This lesson explains what they built and how to build it.


Why This Exists: The Tooling Gap Before ML Platforms

Before dedicated ML platforms, data scientists used general-purpose software tools that were not designed for the ML lifecycle. Jupyter notebooks for exploration: fine. But notebooks do not version data alongside code, do not reproduce environments reliably, do not scale to distributed training, and do not have built-in deployment pathways. Git for code versioning: fine. But git does not version 100 GB training datasets or 5 GB model artifacts. Cron jobs for scheduled training: fragile. A cron job does not handle retries, does not report GPU utilization, does not notify you when training diverges, and does not know that the training run from yesterday is now blocking three other jobs waiting for the same GPU cluster.

The insight at Uber in 2015 (Michelangelo), Airbnb in 2017 (Bighead), and LinkedIn in 2018 (Pro-ML) was the same: ML teams need infrastructure purpose-built for the ML lifecycle, not adapted from general software tools. The ML lifecycle has specific requirements that general tools do not address: dataset versioning, experiment tracking, distributed training orchestration, model versioning and governance, feature sharing across teams, and production monitoring tied back to training.


The ML Platform Components

A production ML platform has four layers, each with specific responsibilities.


Layer 1: The Data Layer

Feature Store

The feature store is the shared repository of computed features, accessible to all teams. It has two access patterns:

Online store: low-latency (under 5ms) key-value lookups for serving time. Redis or DynamoDB. Features are pre-computed and pushed here by offline pipelines.

Offline store: high-throughput column-oriented storage for training time. Hive, BigQuery, or Snowflake. Stores historical feature values with timestamps for point-in-time correct training dataset generation.

The feature store solves feature duplication: without it, Team A and Team B independently compute "user_30day_purchase_count" with slightly different definitions, burning compute twice and creating inconsistency.

Dataset Registry

The dataset registry versions training datasets alongside model code. A dataset is not just a file path - it is a versioned artifact with metadata: the query that generated it, the feature schema version it depends on, the time range it covers, and the number of rows.

DVC (Data Version Control) is the most common open-source solution. It stores large files in S3 or GCS, with a lightweight pointer committed to git. dvc pull dataset.dvc fetches the exact dataset version used to train any past experiment.

Label Pipeline

Ground truth labels are a first-class citizen in the ML platform. The label pipeline ingests labels from multiple sources - user feedback, manual annotation, delayed signals (chargebacks, churns) - deduplicates them, validates them against the training schema, and writes them to the offline store.


Layer 2: The Training Layer

Training Orchestrator

The training orchestrator is the job scheduler for ML training runs. It accepts a training configuration (model code version, dataset version, hyperparameters, hardware requirements), schedules the job on available resources, monitors execution, and reports results to the experiment tracker.

At scale, the orchestrator manages priority queues (experimentation queue, production retraining queue), preemption (stop a low-priority job to run an urgent one), and resource quotas (no single team can use more than 40% of the GPU cluster during business hours).

Apache Airflow, Kubeflow Pipelines, and Metaflow are common choices. At large scale, custom orchestrators built on Kubernetes (using Jobs and CronJobs) are more common.

# Example: training job submission to a Kubernetes-based orchestrator
from kubernetes import client, config
import yaml


class MLTrainingOrchestrator:
"""
Submits training jobs to Kubernetes.
Handles resource requests, environment configuration,
and job monitoring.
"""

def __init__(self, namespace: str = "ml-training"):
config.load_incluster_config()
self.batch_v1 = client.BatchV1Api()
self.namespace = namespace

def submit_training_job(
self,
job_name: str,
image: str,
command: list,
gpu_count: int = 1,
memory_gb: int = 32,
env_vars: dict = None,
priority: str = "normal", # "low", "normal", "high"
) -> str:
"""
Submit a training job as a Kubernetes Job.
Returns job name for status polling.
"""
job = client.V1Job(
metadata=client.V1ObjectMeta(
name=job_name,
namespace=self.namespace,
labels={
"app": "ml-training",
"priority": priority,
},
),
spec=client.V1JobSpec(
template=client.V1PodTemplateSpec(
spec=client.V1PodSpec(
restart_policy="Never",
priority_class_name=f"ml-{priority}",
containers=[
client.V1Container(
name="trainer",
image=image,
command=command,
resources=client.V1ResourceRequirements(
requests={
"memory": f"{memory_gb}Gi",
"cpu": str(memory_gb // 4),
"nvidia.com/gpu": str(gpu_count),
},
limits={
"memory": f"{memory_gb * 2}Gi",
"nvidia.com/gpu": str(gpu_count),
},
),
env=[
client.V1EnvVar(name=k, value=v)
for k, v in (env_vars or {}).items()
],
)
],
)
),
backoff_limit=2, # retry up to 2 times on failure
ttl_seconds_after_finished=86400, # clean up after 24h
),
)

self.batch_v1.create_namespaced_job(
namespace=self.namespace, body=job
)
print(f"[Orchestrator] Submitted job: {job_name}")
return job_name

def get_job_status(self, job_name: str) -> str:
"""Poll job status: pending, running, succeeded, failed."""
job = self.batch_v1.read_namespaced_job(
name=job_name, namespace=self.namespace
)
if job.status.succeeded:
return "succeeded"
elif job.status.failed:
return "failed"
elif job.status.active:
return "running"
return "pending"

Experiment Tracker

The experiment tracker records every training run: the git commit of the code, the dataset version, all hyperparameters, all metrics (loss curves, final accuracy, inference latency), and the model artifact location. This turns model development from an art into a science - every result is reproducible and comparable.

import mlflow
import mlflow.pytorch
import torch
from torch import nn
from torch.utils.data import DataLoader


class MLPlatformTrainer:
"""
Training loop integrated with MLflow experiment tracking.
Every run is fully logged: params, metrics, artifacts.
"""

def __init__(
self,
experiment_name: str,
mlflow_tracking_uri: str = "http://mlflow-server:5000",
):
mlflow.set_tracking_uri(mlflow_tracking_uri)
mlflow.set_experiment(experiment_name)

def train(
self,
model: nn.Module,
train_loader: DataLoader,
val_loader: DataLoader,
config: dict,
) -> str:
"""
Run training with full MLflow logging.
Returns the run_id for model registry registration.
"""
with mlflow.start_run() as run:
# Log all hyperparameters
mlflow.log_params(config)

# Log code version
import subprocess
git_hash = subprocess.check_output(
["git", "rev-parse", "HEAD"]
).strip().decode()
mlflow.set_tag("git_commit", git_hash)
mlflow.set_tag("dataset_version", config.get("dataset_version"))

optimizer = torch.optim.Adam(
model.parameters(),
lr=config["learning_rate"],
weight_decay=config.get("weight_decay", 1e-4),
)

best_val_loss = float("inf")
for epoch in range(config["epochs"]):
train_loss = self._train_epoch(model, train_loader, optimizer)
val_loss, val_auc = self._eval_epoch(model, val_loader)

# Log metrics per step
mlflow.log_metrics(
{
"train_loss": train_loss,
"val_loss": val_loss,
"val_auc": val_auc,
},
step=epoch,
)

if val_loss < best_val_loss:
best_val_loss = val_loss
# Save best model checkpoint
mlflow.pytorch.log_model(model, "best_model")

# Log final summary metrics
mlflow.log_metric("best_val_loss", best_val_loss)
print(f"[Trainer] Run {run.info.run_id} completed.")
return run.info.run_id

def _train_epoch(self, model, loader, optimizer):
model.train()
total_loss = 0.0
criterion = nn.BCEWithLogitsLoss()
for batch in loader:
features, labels = batch
optimizer.zero_grad()
outputs = model(features)
loss = criterion(outputs.squeeze(), labels.float())
loss.backward()
optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)

def _eval_epoch(self, model, loader):
model.eval()
total_loss = 0.0
all_preds, all_labels = [], []
criterion = nn.BCEWithLogitsLoss()
with torch.no_grad():
for batch in loader:
features, labels = batch
outputs = model(features)
loss = criterion(outputs.squeeze(), labels.float())
total_loss += loss.item()
all_preds.extend(torch.sigmoid(outputs).cpu().numpy())
all_labels.extend(labels.cpu().numpy())

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(all_labels, all_preds)
return total_loss / len(loader), auc

Layer 3: The Serving Layer

Model Registry

The model registry is the catalogue of production models. Every model artifact - PyTorch state dict, sklearn pickle, ONNX file - is stored here with versioned metadata: which training run produced it, what dataset it was trained on, what metrics it achieved, who approved it for production, when it was deployed.

import mlflow
from mlflow.tracking import MlflowClient
from enum import Enum


class ModelStage(str, Enum):
STAGING = "Staging"
PRODUCTION = "Production"
ARCHIVED = "Archived"


class ModelRegistry:
"""
Wrapper around MLflow Model Registry.
Handles model registration, promotion, and retrieval.
"""

def __init__(self, tracking_uri: str):
mlflow.set_tracking_uri(tracking_uri)
self.client = MlflowClient()

def register_model(
self,
run_id: str,
model_name: str,
artifact_path: str = "best_model",
) -> str:
"""
Register a trained model in the registry.
Model starts in None stage (not yet staged or promoted).
Returns the model version.
"""
model_uri = f"runs:/{run_id}/{artifact_path}"
result = mlflow.register_model(model_uri, model_name)
print(
f"[Registry] Registered {model_name} version {result.version} "
f"from run {run_id}"
)
return result.version

def promote_to_staging(
self,
model_name: str,
version: str,
description: str = "",
) -> None:
"""Move model to Staging for integration testing."""
self.client.transition_model_version_stage(
name=model_name,
version=version,
stage=ModelStage.STAGING,
archive_existing_versions=False,
)
self.client.update_model_version(
name=model_name, version=version, description=description
)

def promote_to_production(
self,
model_name: str,
version: str,
approved_by: str,
) -> None:
"""
Promote model to Production. Archives the previous production version.
Requires an approver - never promote autonomously without a human gate.
"""
self.client.set_model_version_tag(
name=model_name,
version=version,
key="approved_by",
value=approved_by,
)
self.client.transition_model_version_stage(
name=model_name,
version=version,
stage=ModelStage.PRODUCTION,
archive_existing_versions=True, # auto-archives previous production
)
print(
f"[Registry] {model_name} v{version} promoted to Production "
f"by {approved_by}"
)

def get_production_model(self, model_name: str):
"""Load the current production model."""
model_uri = f"models:/{model_name}/Production"
return mlflow.pytorch.load_model(model_uri)

def list_versions(self, model_name: str, stage: ModelStage = None) -> list:
"""List all versions of a model, optionally filtered by stage."""
versions = self.client.search_model_versions(f"name='{model_name}'")
if stage:
versions = [v for v in versions if v.current_stage == stage]
return versions

Layer 4: The Monitoring Layer

Production ML monitoring requires more than uptime checks. You need to detect when the model's predictions are drifting from expected behavior - either because the world has changed (concept drift) or because your input data has changed (data drift).

import numpy as np
from scipy import stats
from prometheus_client import Gauge, Counter, Histogram
from typing import Optional
import time


# Prometheus metrics for ML monitoring
PREDICTION_DISTRIBUTION = Histogram(
"ml_prediction_score",
"Distribution of prediction scores",
labelnames=["model_name", "model_version"],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
)

FEATURE_DRIFT_SCORE = Gauge(
"ml_feature_drift_ks_statistic",
"KS test statistic for feature drift detection",
labelnames=["model_name", "feature_name"],
)

PREDICTION_COUNT = Counter(
"ml_predictions_total",
"Total number of predictions served",
labelnames=["model_name", "model_version"],
)


class MLMonitor:
"""
Production ML monitoring: tracks prediction distributions and detects drift.
"""

def __init__(
self,
model_name: str,
model_version: str,
reference_data: dict, # baseline feature distributions from training
):
self.model_name = model_name
self.model_version = model_version
self.reference_data = reference_data
self.recent_predictions: list = []
self.recent_features: dict = {}
self.WINDOW_SIZE = 10000 # compute drift over last 10K predictions

def record_prediction(self, features: dict, prediction: float) -> None:
"""Log a prediction for monitoring. Non-blocking."""
PREDICTION_DISTRIBUTION.labels(
model_name=self.model_name,
model_version=self.model_version,
).observe(prediction)
PREDICTION_COUNT.labels(
model_name=self.model_name,
model_version=self.model_version,
).inc()

self.recent_predictions.append(prediction)
for feature_name, value in features.items():
if feature_name not in self.recent_features:
self.recent_features[feature_name] = []
self.recent_features[feature_name].append(value)

# Trim window
if len(self.recent_predictions) > self.WINDOW_SIZE:
self.recent_predictions = self.recent_predictions[-self.WINDOW_SIZE:]
for name in self.recent_features:
if len(self.recent_features[name]) > self.WINDOW_SIZE:
self.recent_features[name] = (
self.recent_features[name][-self.WINDOW_SIZE:]
)

def check_feature_drift(self, feature_name: str) -> float:
"""
Kolmogorov-Smirnov test between current feature distribution
and reference (training) distribution.
KS statistic > 0.1 warrants investigation; > 0.2 is an alert.
"""
if feature_name not in self.recent_features:
return 0.0
current_dist = self.recent_features[feature_name]
reference_dist = self.reference_data.get(feature_name, [])

if len(current_dist) < 100 or len(reference_dist) < 100:
return 0.0 # not enough data for meaningful test

ks_stat, p_value = stats.ks_2samp(current_dist, reference_dist)

FEATURE_DRIFT_SCORE.labels(
model_name=self.model_name,
feature_name=feature_name,
).set(ks_stat)

if ks_stat > 0.2:
print(
f"[Monitor] ALERT: Feature drift detected for "
f"{feature_name}. KS={ks_stat:.3f}, p={p_value:.4f}"
)
return ks_stat

def run_drift_checks(self) -> dict:
"""Run drift checks for all monitored features."""
return {
name: self.check_feature_drift(name)
for name in self.recent_features
}

Build vs Buy: Managed ML Platforms

You do not always need to build your own platform. Here is how to decide.

FactorBuild Your OwnManaged (Vertex AI / SageMaker)
Team size20+ ML practitionersFewer than 20
Customization needsDeep custom workflowsStandard training and serving
Data residencyStrict on-prem requirementsCloud-OK
Cost at scaleCheaper (no platform markup)Cheaper initially
Time to first model in prodMonthsDays
Operational burdenHighLow

Use SageMaker or Vertex AI when: you are a small team, you need to move fast, and you are not doing anything that requires deep customization. The managed platforms handle cluster provisioning, Docker image management, model serving autoscaling, and basic experiment tracking. You pay a 20-30% markup over raw compute, which is worth it for teams under 20.

Build your own when: you have more than 20-30 practitioners, you have unique infrastructure requirements (specific GPU types, on-prem data, custom hardware accelerators), or the platform cost at your scale exceeds the cost of building and operating your own.

Most companies at the scale of Uber, LinkedIn, and Airbnb built their own because they reached the scale where the customization and cost advantages outweigh the build cost.


Case Studies

Uber Michelangelo (2017)

Michelangelo is Uber's end-to-end ML platform. At the time of the original blog post, it supported hundreds of models used in demand forecasting, ETAs, fraud detection, and personalization. Key design decisions: unified feature store (online in Cassandra, offline in Hive), training orchestration via a DSL (rather than code), model serving via a shared serving container (engineers provide model artifacts; the platform provides the container), and per-model performance dashboards tied directly to business metrics.

The insight that distinguishes Michelangelo: the platform team treated internal ML teams as customers. The developer experience (DX) was a first-class design concern. New data scientists could train and deploy a model using a web UI with no infrastructure knowledge.

Airbnb Bighead (2018)

Airbnb's Bighead connected the full ML lifecycle from data to production, with a particular focus on cross-team feature reuse. A search engineer could browse the feature catalogue and immediately use a feature that the pricing team had built, with no re-implementation. Bighead's evaluation framework was notable: it automated the comparison between a new model version and the current production model, surfacing statistical significance of improvements before any A/B test.

LinkedIn Pro-ML (2019)

LinkedIn's Pro-ML platform emphasized model governance - who trained a model, on what data, with what approval, for what purpose. At LinkedIn's scale (thousands of models across hundreds of ML practitioners), governance is not a compliance checkbox; it is a safety requirement. Pro-ML implemented a mandatory review gate before any model could reach production: automated checks (performance regression, fairness metrics) plus a human sign-off.


:::danger Missing the Developer Experience

The most common ML platform failure is building the infrastructure without building the developer experience. An ML platform is not a cluster of GPUs. It is the SDK, CLI, notebooks, documentation, and tutorials that make those GPUs accessible to a data scientist. Uber invested heavily in Michelangelo's web UI and Python SDK because they understood that adoption is the success metric, not technical capability.

A platform that requires 200 lines of Kubernetes YAML to submit a training job will not be adopted, no matter how powerful the underlying infrastructure. :::

:::warning Model Registry as an Afterthought

Many ML platforms are built bottom-up: feature store first, training infrastructure second, serving third, and model registry last - as an afterthought. This is backwards. The model registry is the central hub that connects training to serving, enables rollbacks, enforces governance, and provides the audit trail. Build the registry design first. Let it drive the interfaces between all other components. :::


Interview Q&A

Q1: What are the components of an ML platform and why does each exist?

An ML platform has four layers. The data layer (feature store + dataset registry + label pipeline) solves the problem of data duplication and inconsistency across teams - one place to compute and share features, one place to manage training datasets. The training layer (orchestrator + experiment tracker + distributed training) solves the problem of reproducibility and resource management - every experiment is logged, every training job runs on managed infrastructure. The serving layer (model registry + deployment pipeline + inference service) solves the problem of model governance and deployment consistency - models go through a defined promotion pipeline before reaching production. The monitoring layer (prediction logger + drift detector + dashboards) solves the problem of silent degradation - production models are actively monitored for distributional shift.


Q2: How does a feature store differ from a feature engineering pipeline?

A feature engineering pipeline is code that computes features. A feature store is infrastructure that stores, versions, serves, and shares features. The pipeline produces the features; the store makes them available across teams with consistent definitions and low-latency access.

Specifically, the feature store solves three problems a pipeline alone cannot: (1) online serving latency - a pipeline runs in batch over hours, but serving needs features in 5ms; the feature store pre-computes and materializes features for low-latency lookup; (2) cross-team sharing - different teams reuse the same feature definitions without re-implementation; (3) training-serving consistency - the same feature definition used in the offline training dataset is used in the online serving path, preventing skew.


Q3: When should a company use SageMaker or Vertex AI instead of building an in-house ML platform?

Use managed platforms when team size is under 20 ML practitioners, customization needs are modest, and speed to first production model matters. SageMaker and Vertex AI handle cluster provisioning, container management, autoscaling, and basic experiment tracking. The overhead of building and operating a custom platform (a dedicated platform engineering team of 3-5 people) is not justified at small scale.

Build your own when: (1) you have 20+ practitioners generating enough platform traffic to justify dedicated engineering; (2) you have unique infrastructure requirements (on-prem GPUs, specialized hardware) that managed platforms do not support; (3) the managed platform cost at your scale exceeds the cost of building and operating your own - typically this crossover happens around $500K-1M/year in platform spend; (4) you have strict data residency or compliance requirements that prohibit cloud-managed infrastructure.


Q4: How does Uber Michelangelo handle feature sharing across teams?

Michelangelo has a centralized feature catalogue - essentially a schema registry for ML features. Any feature that a team computes and registers in the catalogue becomes available to all other teams. Features are defined by three things: the entity type (user, driver, trip), the feature name, and the computation logic (a SQL or Python definition). Michelangelo's data pipeline recomputes all registered features on a schedule, writing results to the shared online store (Cassandra) and offline store (Hive).

When a new team wants to use an existing feature, they reference it by name. Michelangelo ensures they get the same value as anyone else who uses that feature. This eliminates the "N teams computing the same feature N different ways" problem, which Uber estimated was responsible for significant compute waste and model inconsistency before Michelangelo.


Q5: How do you design a model promotion pipeline for an ML platform?

A model promotion pipeline has four gates, each of which must pass before the model advances to the next stage.

Gate 1 (Development to Staging): automated checks - all unit tests pass, model artifact size is within bounds, inference latency is within SLA (measured in staging environment), model metrics are above minimum thresholds.

Gate 2 (Staging): integration testing - the model serves traffic in a staging environment with production-like data. Shadow mode testing: requests are duplicated to staging model and production model; results are compared but only the production model's response is returned to users. Significant divergence triggers investigation.

Gate 3 (Human Approval): a designated reviewer (tech lead or MLOps engineer) reviews the experiment tracker results, the shadow mode comparison, and the fairness metrics report. They approve or reject in the model registry UI. No model reaches production without human sign-off.

Gate 4 (Canary Deployment): the model is deployed to 5% of traffic. Automated checks monitor for error rate increases, latency regressions, and prediction distribution shifts. If clean for 24 hours, traffic ramps to 25%, then 100%.


Summary

An ML platform is shared infrastructure - data layer, training layer, serving layer, monitoring layer - that lets ML practitioners focus on models rather than infrastructure. The feature store solves cross-team duplication. The experiment tracker solves reproducibility. The model registry solves governance. The monitoring layer solves silent degradation. Build vs buy: managed platforms (SageMaker, Vertex AI) win for teams under 20; custom platforms win at scale. The case studies from Uber (Michelangelo), Airbnb (Bighead), and LinkedIn (Pro-ML) all share one lesson: developer experience is as important as technical capability. A platform that practitioners do not adopt is not a platform - it is expensive unused infrastructure.

© 2026 EngineersOfAI. All rights reserved.