Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the ML System Design Framework demo on the EngineersOfAI Playground - no code required. :::

ML Platform Design

When Data Scientists Spend 70% of Their Time on Infrastructure

The 2022 State of Data Science report found that data scientists at mid-size companies spend 39% of their time on data preparation and 26% on infrastructure and deployment - only 35% on actual model development. At a company with 50 data scientists at an average total compensation of 200,000,thismeans200,000, this means 5.2 million per year is being spent on work that no one hired data scientists to do.

The VP of AI at a fintech firm with 50 data scientists was looking at exactly this problem. Her team had 14 different scripts for launching training jobs, none of which shared configuration formats. They had 4 different ways to log experiment results, and when a regulator asked "which model is running in production and what data was it trained on?", it took three days to produce an answer - and they were not certain the answer was correct. New data scientists took 6-8 weeks to produce their first production model, spending most of that time learning the bespoke infrastructure.

She hired one platform engineering team of 5. Their mandate: build an ML platform that lets a data scientist go from "I have an idea" to "my model is in production" in two weeks, not three months. And make every model's lineage traceable for compliance.

This case study designs that platform from scratch.

Why ML Platforms Exist

The pattern is consistent across ML organizations: every team starts building their own infrastructure. Feature preprocessing scripts, training launchers, model serialization code, deployment scripts, monitoring dashboards. Each team's solution is slightly different. When a data scientist joins a new team, they start over. When the organization wants to standardize monitoring or compliance, they find 15 different systems to integrate with.

An ML platform is the investment that converts duplicated individual infrastructure into shared organizational infrastructure. It is not a product that data scientists use - it is infrastructure that makes data scientists more productive. The distinction matters: a product has users who choose to adopt it, while infrastructure that is forced on users breeds resentment. The best ML platforms are adopted voluntarily because they make the data scientist's life genuinely easier.

Platform Components

Component 1: Feature Store

The feature store is the most impactful component of any ML platform. It solves three problems simultaneously: training-serving skew (features computed differently during training and serving), feature duplication (10 teams each computing "user 30-day transaction count" slightly differently), and feature discoverability (new data scientists cannot find what features already exist).

Architecture: A feature store has two components:

  • Offline store (batch): historical features for training. Implemented on top of a data warehouse (BigQuery, Snowflake) or data lake (Parquet on S3). Supports point-in-time correct feature retrieval - crucial for avoiding data leakage.
  • Online store (real-time): current feature values for inference. Implemented in Redis or DynamoDB. Latency under 5ms. Updated by streaming pipelines from the offline store.
from feast import FeatureStore, Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64, String
from datetime import timedelta
import pandas as pd
from typing import List, Dict


# Define feature entities
user_entity = Entity(
name="user_id",
description="User identifier",
value_type=String,
)

# Define feature view - a logical group of related features
user_transaction_features = FeatureView(
name="user_transaction_features",
entities=["user_id"],
ttl=timedelta(days=1), # online store TTL
schema=[
Field(name="txn_count_7d", dtype=Int64),
Field(name="txn_amount_sum_7d", dtype=Float32),
Field(name="txn_count_30d", dtype=Int64),
Field(name="avg_txn_amount_30d", dtype=Float32),
Field(name="days_since_last_txn", dtype=Int64),
],
source=FileSource( # offline store source
path="s3://ml-feature-store/user_transactions/",
timestamp_field="event_timestamp",
),
online=True, # also serve from online store
)


class FeatureStoreClient:
"""
Application-level wrapper around the feature store.
Provides training data retrieval and online serving.
"""

def __init__(self, store_config_path: str = "feature_store.yaml"):
self.fs = FeatureStore(repo_path=store_config_path)

def get_training_data(
self,
entity_df: pd.DataFrame, # must have entity columns + "event_timestamp"
feature_refs: List[str], # ["user_transaction_features:txn_count_7d", ...]
) -> pd.DataFrame:
"""
Retrieve point-in-time correct historical features for training.
For each row in entity_df, retrieves features as they were at event_timestamp.
This prevents future data leakage in training datasets.
"""
return self.fs.get_historical_features(
entity_df=entity_df,
features=feature_refs,
).to_df()

def get_online_features(
self,
entity_rows: List[Dict], # [{"user_id": "u123"}, ...]
feature_refs: List[str],
) -> Dict:
"""
Retrieve current feature values for real-time inference.
Latency: sub-5ms from Redis.
"""
return self.fs.get_online_features(
features=feature_refs,
entity_rows=entity_rows,
).to_dict()

def materialize_features(self, start_date: str, end_date: str):
"""
Run the batch job that copies features from offline to online store.
Should be run as a daily scheduled job.
"""
self.fs.materialize(
start_date=pd.Timestamp(start_date),
end_date=pd.Timestamp(end_date),
)

Point-in-time correctness is the feature store's most important property. Training data for a model trained in 2024 should use features as they existed in 2022 (when the training examples occurred), not as they exist today. Without point-in-time correct retrieval, you leak future information into training and get a model that looks great offline but fails in production.

Component 2: Experiment Tracking

Every training run should be automatically logged: hyperparameters, metrics (training loss, validation metrics), model artifacts, training data version, and code version (git hash).

MLflow is the standard choice for self-hosted experiment tracking. It provides a Python API for logging within training code, a web UI for comparing experiments, and artifact storage for model files.

import mlflow
import mlflow.pytorch
from typing import Dict, Any
import torch


class MLPlatformTrainer:
"""
Training wrapper that automatically logs to the ML platform.
Data scientists use this instead of raw PyTorch training loops.
"""

def __init__(self, experiment_name: str, tracking_uri: str = "http://mlflow:5000"):
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)

def train(
self,
model: torch.nn.Module,
train_config: Dict[str, Any],
feature_refs: List[str],
entity_df: pd.DataFrame,
n_epochs: int = 10,
):
"""
Train a model with full automatic logging to MLflow.
Logs: hyperparameters, metrics per epoch, model artifact, data version.
"""
with mlflow.start_run() as run:
# Log all configuration
mlflow.log_params(train_config)
mlflow.set_tag("feature_refs", str(feature_refs))
mlflow.set_tag("data_hash", self._hash_dataframe(entity_df))
mlflow.set_tag("git_commit", self._get_git_hash())
mlflow.set_tag("model_class", type(model).__name__)

# Get training data from feature store
feature_store = FeatureStoreClient()
training_data = feature_store.get_training_data(entity_df, feature_refs)
mlflow.log_artifact("training_data_schema.json", self._get_schema(training_data))

# Training loop
optimizer = torch.optim.Adam(model.parameters(), lr=train_config.get("lr", 1e-3))
best_val_metric = float("inf")

for epoch in range(n_epochs):
train_loss = self._train_epoch(model, training_data, optimizer)
val_metrics = self._evaluate(model, training_data)

# Log metrics
mlflow.log_metric("train_loss", train_loss, step=epoch)
for metric_name, value in val_metrics.items():
mlflow.log_metric(metric_name, value, step=epoch)

# Save best model
if val_metrics.get("val_loss", float("inf")) < best_val_metric:
best_val_metric = val_metrics.get("val_loss", float("inf"))
mlflow.pytorch.log_model(model, "best_model")

# Log final model with signature
mlflow.pytorch.log_model(
model,
"final_model",
signature=self._infer_signature(training_data),
)

return run.info.run_id

def _get_git_hash(self) -> str:
import subprocess
return subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()

def _hash_dataframe(self, df: pd.DataFrame) -> str:
import hashlib
return hashlib.md5(pd.util.hash_pandas_object(df).values).hexdigest()

def _get_schema(self, df: pd.DataFrame) -> dict:
return {col: str(dtype) for col, dtype in df.dtypes.items()}

def _infer_signature(self, df: pd.DataFrame):
from mlflow.models.signature import infer_signature
sample_input = df.head(5).to_numpy()
sample_output = model(torch.tensor(sample_input).float()).detach().numpy()
return infer_signature(sample_input, sample_output)

Component 3: Model Registry

The model registry is the source of truth for which models exist, their quality metrics, and which version is in production.

A model moves through lifecycle stages: Staging (tested, not yet in production), Production (serving live traffic), Archived (retired).

import mlflow
from mlflow.tracking import MlflowClient


class ModelRegistryClient:
"""
Manages model lifecycle in the MLflow Model Registry.
Provides approval workflows and deployment integration.
"""

def __init__(self, tracking_uri: str = "http://mlflow:5000"):
mlflow.set_tracking_uri(tracking_uri)
self.client = MlflowClient()

def register_model(
self,
run_id: str,
model_name: str,
metrics: Dict[str, float],
tags: Dict[str, str] = None,
) -> str:
"""Register a trained model and return the version number."""
model_uri = f"runs:/{run_id}/best_model"
model_version = mlflow.register_model(model_uri, model_name)

# Add metrics and tags to the version
self.client.set_model_version_tag(model_name, model_version.version, "run_id", run_id)
for k, v in (tags or {}).items():
self.client.set_model_version_tag(model_name, model_version.version, k, str(v))

# Log metrics to registry for comparison
for metric_name, value in metrics.items():
self.client.set_model_version_tag(
model_name, model_version.version, f"metric_{metric_name}", str(value)
)

return model_version.version

def promote_to_production(
self,
model_name: str,
version: str,
approved_by: str,
approval_notes: str = "",
):
"""
Promote a model to production. Requires explicit approval.
Archives the current production version.
"""
# Archive current production models
current_prod = self.client.get_latest_versions(model_name, stages=["Production"])
for prod_model in current_prod:
self.client.transition_model_version_stage(
model_name, prod_model.version, "Archived"
)

# Promote new version
self.client.transition_model_version_stage(model_name, version, "Production")
self.client.set_model_version_tag(model_name, version, "approved_by", approved_by)
self.client.set_model_version_tag(model_name, version, "approval_notes", approval_notes)

print(f"Model {model_name} v{version} promoted to production by {approved_by}")

def get_production_model(self, model_name: str):
"""Load the current production model."""
model_uri = f"models:/{model_name}/Production"
return mlflow.pytorch.load_model(model_uri)

def get_model_lineage(self, model_name: str, version: str) -> dict:
"""Return full lineage: model → training run → data version → code version."""
mv = self.client.get_model_version(model_name, version)
run_id = mv.tags.get("run_id") or mv.run_id
run = self.client.get_run(run_id)

return {
"model_name": model_name,
"version": version,
"run_id": run_id,
"git_commit": run.data.tags.get("git_commit"),
"data_hash": run.data.tags.get("data_hash"),
"feature_refs": run.data.tags.get("feature_refs"),
"training_params": run.data.params,
"metrics": run.data.metrics,
"created_at": mv.creation_timestamp,
"approved_by": mv.tags.get("approved_by"),
}

Component 4: Model Serving Infrastructure

Models registered in the registry are deployed via a serving platform. NVIDIA Triton Inference Server handles multi-model serving with hardware optimization:

import tritonclient.http as httpclient
import numpy as np
from typing import Dict


class TritonModelServer:
"""
Client for NVIDIA Triton Inference Server.
Triton handles: GPU batching, model versioning, multi-model serving.
"""

def __init__(self, server_url: str = "http://triton:8000"):
self.client = httpclient.InferenceServerClient(url=server_url)

def infer(
self,
model_name: str,
inputs: Dict[str, np.ndarray],
model_version: str = "", # empty string = latest production version
) -> Dict[str, np.ndarray]:
"""
Run inference on a deployed model.
Triton handles batching, GPU placement, and model loading.
"""
triton_inputs = []
for name, array in inputs.items():
inp = httpclient.InferInput(name, array.shape, "FP32")
inp.set_data_from_numpy(array)
triton_inputs.append(inp)

response = self.client.infer(
model_name,
inputs=triton_inputs,
model_version=model_version,
)

return {
output_name: response.as_numpy(output_name)
for output_name in response.get_output(None)
}

def get_model_metadata(self, model_name: str) -> dict:
return self.client.get_model_metadata(model_name)

def is_model_ready(self, model_name: str, version: str = "") -> bool:
return self.client.is_model_ready(model_name, model_version=version)

A/B Testing Gateway

A custom gateway routes a fraction of production traffic to the challenger model while the champion model serves the rest:

import random
from typing import Callable


class ABTestingGateway:
"""
Routes inference requests between champion and challenger models.
Logs routing decisions for experiment analysis.
"""

def __init__(
self,
champion_model: str,
challenger_model: str,
challenger_traffic_fraction: float = 0.10,
experiment_id: str = None,
):
self.champion = champion_model
self.challenger = challenger_model
self.challenger_fraction = challenger_traffic_fraction
self.experiment_id = experiment_id
self.triton = TritonModelServer()

def route_and_infer(
self,
user_id: str,
inputs: Dict[str, np.ndarray],
log_callback: Callable = None,
) -> dict:
"""Route request to champion or challenger. Log the routing decision."""
# Deterministic routing per user (same user always hits same model)
bucket = hash(user_id) % 100
use_challenger = bucket < (self.challenger_fraction * 100)

model_name = self.challenger if use_challenger else self.champion
variant = "challenger" if use_challenger else "champion"

result = self.triton.infer(model_name, inputs)

if log_callback:
log_callback({
"user_id": user_id,
"experiment_id": self.experiment_id,
"variant": variant,
"model": model_name,
"inputs_hash": hash(str(inputs)),
})

return {**result, "variant": variant}

Platform Adoption Strategies

The hardest part of ML platform development is not the technology - it is adoption. Data scientists who have built their own workflows resist switching to a new system, especially if it adds friction to their current process.

Adoption principles from successful platforms:

Meet data scientists where they are: The platform's Python SDK should feel like ordinary Python. Do not require data scientists to learn new configuration languages, YAML schemas, or domain-specific APIs. The first interaction should be adding three lines to an existing training script.

Make the default path the right path: If using the feature store is easier than reading from S3 directly, data scientists will use it. If experiment tracking is automatic (no explicit logging calls), it will be used. Design the platform so the best practice is also the path of least resistance.

Start with acute pain points: The first platform feature should solve the problem that data scientists complain about most. At most organizations, this is either: "I can't reproduce that experiment from 3 months ago" (experiment tracking) or "it takes me 2 weeks to deploy a model" (serving infrastructure). Solve one problem excellently before building the full platform.

Provide escape hatches: Data scientists doing research need flexibility. Do not lock them into platform APIs for experimental work. Provide a way to use the platform for production models while keeping research ad-hoc. The platform is for production; notebooks are for exploration.

Quantify the value: Track adoption metrics (percentage of training jobs going through the platform) and value metrics (time from model idea to production deployment, time spent debugging deployment issues). Present these to leadership monthly. This demonstrates ROI and creates organizational pressure for adoption.

Common ML Platform Mistakes

danger

Mistake: Building a platform that requires data scientists to change how they work.

The most successful ML platforms (Uber Michelangelo, LinkedIn Pro-ML) made existing ML workflows easier, not different. If your platform requires data scientists to rewrite their code in a new framework, they will resist and route around it. Build platform features that integrate with existing workflows (add logging to their existing training script, add feature retrieval as a library call) rather than replacing their workflows.

danger

Mistake: Building the full platform before validating with users.

Platform engineering teams routinely build 12-month roadmaps: feature store, experiment tracking, model registry, serving, monitoring, governance - all at once. The result is a platform that is deployed 18 months later, does not match what data scientists actually needed, and has low adoption because no one was consulted during development. Build one component, deploy it to 5 data scientists, measure adoption and satisfaction, iterate. Repeat. The full platform emerges from this iterative process.

warning

Mistake: Treating the feature store as a batch system only.

A feature store that only serves offline training data is incomplete. Production models need the same feature values at inference time that they used during training. Without an online serving path, teams implement their own feature computation for serving - and compute features differently than the training pipeline, causing training-serving skew. Always build the online store alongside the offline store, even if the online store is initially just Redis with a simple key-value interface.

tip

Tip: Measure training-serving skew explicitly for every production model.

Training-serving skew - features computed differently between training and serving - is the most common cause of offline metrics not matching online performance. Add an automated skew detection job that samples a small fraction of production requests, recomputes the features using the training pipeline, and compares to the features actually used in serving. Alert if skew exceeds a threshold (e.g., more than 5% of feature values differ by more than 1%). This catches pipeline bugs early before they cause silent model degradation.

Interview Q&A

Q: What components make up an ML platform and which should you build first?

A: An ML platform typically has four layers: data (feature store, data versioning), development (experiment tracking, compute), governance (model registry, lineage), and serving (model server, monitoring, A/B testing). Build them in order of impact on data scientist productivity. Most organizations should start with experiment tracking - it solves the "I can't reproduce that experiment" problem, which every team has, and it is relatively easy to implement (MLflow is open-source and self-hosted in a week). Next, tackle the deployment bottleneck - usually a standardized serving infrastructure that reduces deployment time from weeks to days. The feature store is third priority - it is the most complex to build and requires buy-in from data engineering as well as data science. Model governance and compliance tooling can often be deferred until regulatory requirements demand it. The order is: solve the current most acute pain first, not what seems architecturally important.

Q: What is training-serving skew and how does an ML platform prevent it?

A: Training-serving skew occurs when features are computed differently between the training pipeline and the serving pipeline. The model is trained on feature values computed by the training pipeline. At inference time, a different code path computes the same features and passes them to the model - but if that code path has different logic (different aggregation window, different handling of null values, different normalization), the model receives different inputs than it trained on. The result: the model performs worse in production than offline metrics predict, with no obvious error signal. ML platforms prevent this by making the feature computation code the single source of truth: the same code path and the same feature definitions are used for both training data retrieval (from the offline store) and online serving (from the online store). Feast and Tecton enforce this by design - you define a feature view once, and both the batch materialization job and the online serving API use the same feature logic.

Q: Should you build or buy an ML platform? When does each make sense?

A: Buy (use managed solutions like Databricks, Vertex AI, SageMaker) when: your team is small (under 20 data scientists), you are not yet at the scale where managed costs are prohibitive, your use cases are standard (not requiring custom serving architectures), and you cannot afford to have engineers focused on platform instead of models. Build when: you have more than 30 data scientists where the managed costs become prohibitive (Vertex AI at scale is very expensive), your models require non-standard serving (sub-millisecond latency, specialized hardware), you have strict data residency requirements that preclude cloud-managed services, or you have a dedicated platform team that can maintain custom infrastructure. The common mistake is building too early. A team of 10 data scientists should almost never build a custom ML platform - the opportunity cost (those engineers could be building models) exceeds the efficiency gains. The inflection point is typically around 20-30 data scientists where coordination overhead and tooling fragmentation become visibly costly.

Q: How do you measure whether an ML platform is succeeding?

A: Two categories of metrics. Adoption metrics: percentage of training jobs using the platform's training launcher, percentage of production models registered in the model registry, percentage of features served through the feature store (vs. custom retrieval code), percentage of new data scientists onboarded using the platform's standard workflow. Productivity metrics: time from model idea to production deployment (should decrease over time), time spent debugging deployment issues per incident, number of production incidents caused by training-serving skew or missing lineage, time to answer "what data was this model trained on?" (should be under 5 minutes with good lineage tooling). Survey data scientists quarterly on platform satisfaction - a platform that is technically correct but that data scientists hate is failing. Adoption rate below 60% after 6 months is a signal to investigate what friction is preventing adoption, not to build more features.

© 2026 EngineersOfAI. All rights reserved.