:::tip 🎮 Interactive Playground Visualize this concept: Try the Multi-Tenant ML Platform demo on the EngineersOfAI Playground - no code required. :::
Self-Service ML Platform
The Platform Nobody Used
After eight months of engineering work, the ML platform was feature-complete. It had experiment tracking, a model registry, automated training pipelines, canary deployments, drift monitoring - everything on the roadmap. The platform team had poured 3,200 engineering hours into it.
Three months after launch, adoption was 23%. Seven data scientists out of thirty were using the platform for their models. The other twenty-three had found workarounds: running training jobs in notebooks, deploying models via ad-hoc Docker containers, monitoring with custom scripts. The platform existed. Nobody used it.
The post-mortem was uncomfortable. The platform team had spent eight months building features without talking to users. They had built what they thought data scientists needed, not what data scientists actually needed. The experiment tracking was "too complicated" - it required 40 lines of boilerplate to log a run. The deployment pipeline "didn't work with my model type." The CLI "assumed too much familiarity with Kubernetes." The documentation was written by engineers for engineers, not for data scientists who think in Python, not YAML.
The platform team spent the next three months doing user research, redesigning the highest-friction surfaces, and rebuilding the onboarding experience. Adoption went from 23% to 78%. The lesson: platform engineering is product engineering. Features are not value. Adoption is value.
The Core Insight: Adoption Is the Metric
Most ML platform teams measure feature completeness, uptime, and latency. These are necessary but not sufficient. The only metric that matters is adoption - the fraction of ML workloads running on the platform vs outside it.
A platform with 10 features and 90% adoption is massively more valuable than a platform with 50 features and 30% adoption. Unadopted infrastructure is pure waste.
This requires a fundamental mental model shift: from "we are building infrastructure" to "we are running an internal product with users, and our users are data scientists."
The Notebook-to-Production Workflow
The most critical workflow to optimize is the path from "I trained a model in a notebook" to "my model is serving real traffic." Every step of friction on this path reduces adoption.
The Typical Friction Map
Before redesign, the platform required:
- Refactor notebook code into a training script (1–4 hours)
- Create a
requirements.txt(30 minutes, frequently wrong) - Write a Dockerfile (1–2 hours for data scientists unfamiliar with Docker)
- Push to container registry (15 minutes)
- Write Kubernetes YAML manifest (2–4 hours, requires K8s knowledge)
- Get PR approved by infra team (1–2 business days)
- Deploy and debug (2–4 hours)
Total: 6–14 hours, requiring skills many data scientists don't have. Result: data scientists find ways around the process.
The Redesigned Workflow
# Target: data scientist experience should look like this
# 1. Add 3 lines to existing training code
import engineersofai_platform as platform
with platform.training_run(
name="bert-finetuning",
tags={"team": "nlp", "project": "document-classifier"},
) as run:
# Existing training code - no changes needed
model = train_bert(config)
run.log_metrics({"val_accuracy": 0.91})
run.register_model(model, "document-classifier")
# 2. Deploy with one command (no Dockerfile, no YAML, no K8s knowledge)
# $ platform deploy document-classifier --production
# > Detecting model type: HuggingFace transformers
# > Building container (2 min)
# > Running validation tests (3 min)
# > Deploying to staging (1 min)
# > ✓ Deployed at: https://models.internal/document-classifier/v3
Implementation: Smart Containerization
# Platform SDK: auto-detect model type and build appropriate container
from enum import Enum
import subprocess
import tempfile
from pathlib import Path
class ModelFramework(Enum):
PYTORCH = "pytorch"
TENSORFLOW = "tensorflow"
SKLEARN = "sklearn"
HUGGINGFACE = "huggingface"
XGBOOST = "xgboost"
CUSTOM = "custom"
def detect_model_framework(model_artifact_path: str) -> ModelFramework:
"""Detect model framework from artifact files."""
path = Path(model_artifact_path)
if (path / "config.json").exists() and (path / "tokenizer_config.json").exists():
return ModelFramework.HUGGINGFACE
elif list(path.glob("*.pt")) or list(path.glob("*.pth")):
return ModelFramework.PYTORCH
elif list(path.glob("saved_model.pb")):
return ModelFramework.TENSORFLOW
elif list(path.glob("*.pkl")):
return ModelFramework.SKLEARN
else:
return ModelFramework.CUSTOM
BASE_IMAGES = {
ModelFramework.HUGGINGFACE: "myregistry/hf-serving:transformers-4.38-cuda12",
ModelFramework.PYTORCH: "myregistry/pytorch-serving:2.1-cuda12",
ModelFramework.SKLEARN: "myregistry/sklearn-serving:1.4",
ModelFramework.XGBOOST: "myregistry/xgboost-serving:1.7",
}
def build_serving_container(
model_artifact_path: str,
model_name: str,
version: str,
) -> str:
"""
Automatically build a serving container from a model artifact.
No Dockerfile required from the data scientist.
"""
framework = detect_model_framework(model_artifact_path)
base_image = BASE_IMAGES.get(framework)
if base_image is None:
raise ValueError(
f"Unknown framework: {framework}. "
"Please contact the platform team to add support."
)
# Generate Dockerfile from template
dockerfile_content = f"""
FROM {base_image}
# Copy model artifacts
COPY {model_artifact_path} /model/
# Set serving configuration
ENV MODEL_PATH=/model
ENV MODEL_NAME={model_name}
ENV MODEL_VERSION={version}
ENV FRAMEWORK={framework.value}
# Platform server handles all serving logic
CMD ["platform-server", "--model-path", "/model", "--port", "8080"]
"""
with tempfile.TemporaryDirectory() as tmpdir:
dockerfile_path = Path(tmpdir) / "Dockerfile"
dockerfile_path.write_text(dockerfile_content)
image_tag = f"myregistry/user-models/{model_name}:{version}"
subprocess.run(
["docker", "build", "-t", image_tag, "-f", str(dockerfile_path), "."],
check=True,
)
subprocess.run(["docker", "push", image_tag], check=True)
return image_tag
Template-Based Workflows
Most ML workflows at a given company are variations of a small number of patterns:
- Fine-tune a HuggingFace model on domain data
- Train an XGBoost classifier on tabular data
- Build a RAG pipeline on document corpus
- Train a recommendation model
Templates encode these patterns as one-click starting points:
# Platform CLI: create new ML project from template
# $ platform new --template bert-finetuning my-project
TEMPLATES = {
"bert-finetuning": {
"description": "Fine-tune BERT on a classification task",
"files": {
"train.py": "templates/bert_finetuning/train.py",
"config.yaml": "templates/bert_finetuning/config.yaml",
"requirements.txt": "templates/bert_finetuning/requirements.txt",
"tests/test_model.py": "templates/bert_finetuning/tests/test_model.py",
},
"variables": ["model_name", "dataset_path", "num_labels"],
"estimated_cost": "$15-50 on A10G GPU",
"estimated_time": "2-6 hours",
},
"tabular-classifier": {
"description": "Train XGBoost/LightGBM on tabular data",
"files": {
"train.py": "templates/tabular/train.py",
"feature_engineering.py": "templates/tabular/feature_engineering.py",
"evaluate.py": "templates/tabular/evaluate.py",
},
"variables": ["dataset_path", "target_column"],
"estimated_cost": "$2-10 on CPU",
"estimated_time": "30 minutes - 2 hours",
},
"rag-pipeline": {
"description": "Build a RAG system with document retrieval",
"files": {
"ingest.py": "templates/rag/ingest.py",
"pipeline.py": "templates/rag/pipeline.py",
"serve.py": "templates/rag/serve.py",
},
"variables": ["document_source", "embedding_model", "llm_model"],
"estimated_cost": "Depends on LLM API usage",
"estimated_time": "1-3 hours",
},
}
class TemplateEngine:
"""Generate new ML projects from templates."""
def create_from_template(
self,
template_name: str,
project_name: str,
variables: dict,
output_dir: str,
) -> None:
if template_name not in TEMPLATES:
available = list(TEMPLATES.keys())
raise ValueError(f"Unknown template. Available: {available}")
template = TEMPLATES[template_name]
output_path = Path(output_dir) / project_name
output_path.mkdir(parents=True)
for dest_file, template_file in template["files"].items():
content = self._render_template(template_file, variables)
dest = output_path / dest_file
dest.parent.mkdir(parents=True, exist_ok=True)
dest.write_text(content)
# Write project metadata
metadata = {
"project_name": project_name,
"template": template_name,
"created_at": datetime.utcnow().isoformat(),
"variables": variables,
}
(output_path / ".platform.json").write_text(json.dumps(metadata, indent=2))
print(f"Created project at {output_path}")
print(f"Next steps:")
print(f" cd {project_name}")
print(f" platform run --dev # run locally")
print(f" platform deploy # deploy to staging")
def _render_template(self, template_path: str, variables: dict) -> str:
"""Simple Jinja2-like template rendering."""
from string import Template
content = Path(template_path).read_text()
return Template(content).safe_substitute(variables)
Guardrails vs Flexibility
The most contentious design question in ML platform design is: how much should the platform constrain users?
Too constrained: Users can't do what they need. They find workarounds that bypass the platform entirely. Adoption drops.
Too flexible: Users make poor decisions at scale. No consistency, no cost controls, no security guarantees. Platform provides little value over raw infrastructure.
The right answer is sensible defaults with explicit escape hatches:
class PlatformDeploymentConfig:
"""
Deployment configuration with sensible defaults.
Users can override everything, but defaults are production-safe.
"""
def __init__(
self,
model_name: str,
# Defaults: sensible production settings
replicas: int = 2, # HA by default
gpu_count: int = 1,
memory_gb: float = 16.0,
cpu_count: float = 4.0,
max_replicas: int = 20,
autoscaling_target_gpu_pct: float = 70,
min_accuracy_threshold: float = 0.75, # quality gate default
# Escape hatch: allow overriding any default with justification
overrides: dict = None,
override_reason: str = "",
):
self.model_name = model_name
self.replicas = replicas
self.gpu_count = gpu_count
self.memory_gb = memory_gb
self.cpu_count = cpu_count
self.max_replicas = max_replicas
self.autoscaling_target = autoscaling_target_gpu_pct
self.min_accuracy_threshold = min_accuracy_threshold
if overrides:
if not override_reason:
raise ValueError(
"Providing overrides requires an override_reason for audit trail. "
"Example: override_reason='Latency-critical model requires single replica for consistency'"
)
self._apply_overrides(overrides)
self._log_override_audit(overrides, override_reason)
def _apply_overrides(self, overrides: dict):
for key, value in overrides.items():
if hasattr(self, key):
setattr(self, key, value)
def _log_override_audit(self, overrides: dict, reason: str):
"""Log all overrides for audit - required for compliance."""
print(f"[AUDIT] Platform default overridden: {overrides}")
print(f"[AUDIT] Reason: {reason}")
The Guardrail List
Enforce these unconditionally - they protect the cluster and the business:
| Guardrail | Why |
|---|---|
| Max replicas cap | Prevent runaway autoscaling that exhausts GPU pool |
| Required cost tags | Without tags, cost attribution is impossible |
| Minimum 2 replicas for production | Single replica = no availability guarantee |
| Mandatory readiness probes | Prevents bad pods from receiving traffic |
| Max GPU memory request per model | One model can't starve the whole cluster |
Offer overrides (with justification required) for:
- Accuracy threshold (edge cases where lower threshold is acceptable)
- Resource limits (models with unusual requirements)
- Single replica (stateful models where consistency matters more than HA)
Measuring Platform Adoption
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class PlatformAdoptionMetrics:
"""Weekly platform adoption metrics."""
period: str
total_ml_engineers: int
platform_users: int # unique users who ran any platform workflow
platform_models_deployed: int # models deployed via platform
total_models_deployed: int # models deployed via any method
platform_experiments_logged: int # runs logged in platform experiment tracking
total_experiments_run: int # estimated total experiments (harder to measure)
@property
def user_adoption_rate(self) -> float:
return self.platform_users / self.total_ml_engineers
@property
def deployment_adoption_rate(self) -> float:
return self.platform_models_deployed / self.total_models_deployed
@property
def experiment_adoption_rate(self) -> float:
return self.platform_experiments_logged / self.total_experiments_run
def build_weekly_adoption_report(
metrics: list[PlatformAdoptionMetrics],
) -> str:
"""Generate weekly adoption trend report."""
latest = metrics[-1]
previous = metrics[-2] if len(metrics) > 1 else None
report = f"""
## Platform Adoption Report - {latest.period}
### Summary
- User Adoption: {latest.user_adoption_rate:.0%} ({latest.platform_users}/{latest.total_ml_engineers} engineers)
- Deployment Adoption: {latest.deployment_adoption_rate:.0%}
- Experiment Adoption: {latest.experiment_adoption_rate:.0%}
"""
if previous:
user_delta = latest.user_adoption_rate - previous.user_adoption_rate
report += f"\n### Week-over-Week\n"
report += f"- User Adoption: {'+' if user_delta > 0 else ''}{user_delta:.1%}\n"
# Who isn't using the platform? (for follow-up conversations)
report += "\n### Follow-up Required\n"
report += "Engineers not yet on platform - schedule 1:1s to understand blockers\n"
return report
The Office Hours Strategy
Technical documentation alone doesn't drive adoption. Regular office hours do:
Weekly Platform Office Hours (30 min)
- Tuesday 2pm: Open Q&A for any platform questions
- Thursday 3pm: "Platform Tip of the Week" - live demo of one feature
Monthly
- "New User Workshop" - 60-min hands-on workshop for newly onboarded data scientists
- "Power User Session" - advanced features for teams already using the platform
Quarterly
- "Platform Roadmap Preview" - share upcoming features, get user input on priorities
- "User Research Round" - 6 × 30-min 1:1 interviews with diverse platform users
The office hours strategy has two effects: it accelerates adoption by removing friction in real time, and it surfaces UX problems the team didn't know about. The Tuesday Q&A session, in particular, is where you discover that 15 people have the same friction point with the same feature - friction that could be fixed in an afternoon.
Common Mistakes
:::danger Building features before validating demand Every quarter, the platform team should spend one week doing user research before planning the next quarter's roadmap. Ask: "What's the most frustrating thing about your ML workflow right now?" The answer is your next feature. Building features based on "this seems like it would be useful" without user validation is how you end up with 50 features and 30% adoption. :::
:::warning Not having a day-one onboarding experience The first 30 minutes a new user spends with your platform determine whether they adopt it. If setup requires reading a 20-page guide, filing a ticket for access, and attending a 2-hour training session, most users will decide it's not worth it. Every new ML engineer should be able to complete a "hello world" deployment within 30 minutes of joining the company, with zero platform team involvement. :::
:::danger Treating adoption metrics as optional
"We know people are using the platform" is not a metric. Without quantified adoption tracking, you can't tell whether the last quarter's work made things better or worse. Instrument everything: how many experiments logged this week vs last week, how many deployments via platform vs manual, how many times users called platform help. These are your product metrics.
:::
:::warning Making the escape hatch too easy If users can bypass every guardrail trivially, they will - especially under time pressure. Guardrails should have friction proportional to their importance. Overriding the accuracy threshold should require a one-line justification. Bypassing the entire deployment pipeline should require a manager approval. Make the right path the easy path. :::
Interview Q&A
Q: How do you drive adoption for an internal ML platform?
A: Product thinking, not feature-building. Three strategies I've used. First, radical friction removal: map the end-to-end workflow from "model trained in notebook" to "model serving traffic." Count every step. Every step with friction is a place users will give up or find a workaround. Reduce notebook-to-production to under 30 minutes with zero K8s knowledge required. Second, active onboarding: don't publish documentation and hope people read it. Run weekly office hours, a monthly hands-on workshop for new users, and assign a platform "buddy" to every new data scientist for their first deployment. Third, measure and prioritize by adoption blockers: track adoption weekly, find the users not on platform, and do 1:1s to understand their specific friction. The answers are always specific and actionable - "I don't use the platform because my model type isn't supported" is a 2-day fix that unlocks a whole team.
Q: What is the notebook-to-production gap and how do you close it?
A: The notebook-to-production gap is the distance between "I trained a model in a Jupyter notebook" and "this model is serving real user traffic with proper monitoring." For most teams without a platform, this gap takes 2–8 weeks and requires skills - Docker, Kubernetes, CI/CD - that many data scientists don't have. To close it: (1) make containerization automatic - detect the model framework and build the appropriate container without requiring a Dockerfile; (2) generate the Kubernetes manifests from model metadata rather than requiring YAML authorship; (3) provide templates for common model types that encode all best practices; (4) make the platform CLI the simplest path to production - platform deploy my-model --production is the target experience. The gap is never fully closed, but reducing it from 8 weeks to 2 hours changes what teams can ship.
Q: How do you design guardrails for an ML platform without limiting flexibility?
A: The principle is "sensible defaults with explicit escape hatches." Default to production-safe settings: 2 replicas for HA, cost tags required, readiness probes mandatory, autoscaling with a sensible cap. These defaults protect the cluster and the business. Then provide explicit overrides for everything, with two requirements: a justification string (for audit trail), and appropriate friction proportional to risk. Overriding a memory limit: low friction, just add a field. Bypassing the CI/CD quality gates entirely: requires a manager approval in the system. The key is that defaults should make the right thing easy, not make the wrong thing impossible. If you make the platform too restrictive, teams route around it. If you make it too permissive, the defaults provide no value.
Q: What metrics do you use to evaluate whether a platform investment is working?
A: Five metrics. First, user adoption rate: percentage of ML engineers who used the platform at least once in the past 4 weeks. Target: 80% within 6 months of launch. Second, deployment adoption: percentage of new model deployments done via the platform vs manually. This is the clearest signal of whether the platform is actually reducing work. Third, time-to-first-deployment: how long it takes a new user to complete their first platform deployment. Should be under 2 hours. Fourth, experiment tracking adoption: percentage of model training runs logged in the platform. Fifth, support request volume: number of platform help requests per active user per week - should decrease over time as UX improves. I report these metrics weekly to the platform team and quarterly to engineering leadership, with trend lines showing whether things are improving.
Q: What is the most common failure mode for internal ML platform projects?
A: Building features in isolation from users. The pattern: a small platform team spends 6–12 months building what they think data scientists need. At launch, adoption is low. The team's response is to build more features. Adoption stays low. Eventually the project gets cancelled or the team gets reorganized. The root cause: no regular user feedback loop. The fix: mandatory weekly user conversations before any feature work. "What are you working on right now?" and "What's the most frustrating part of your ML workflow?" answered by real users in real time completely changes what you build. The teams that build the best internal platforms treat their users like customers, run user research like a product team, and measure adoption like a growth team. The teams that build the worst platforms treat their users like they should be grateful for whatever infrastructure gets built.
