What is self-service ML platform?

Build ML platforms that data scientists actually use - applying product thinking to internal tooling, from user research and notebook-to-production workflows to adoption metrics and guardrails.

How does ML platform UX work in practice?

Self-Service ML Platform covers self-service ML platform, ML platform UX, ML developer experience from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/ai-platform-engineering/self-service-ml-platform

What is the difference between self-service ML platform and ML developer experience?

See the full breakdown at https://engineersofai.com/docs/ai-systems/ai-platform-engineering/self-service-ml-platform

:::tip 🎮 Interactive Playground Visualize this concept: Try the Multi-Tenant ML Platform demo on the EngineersOfAI Playground - no code required. :::

Self-Service ML Platform

The Platform Nobody Used

After eight months of engineering work, the ML platform was feature-complete. It had experiment tracking, a model registry, automated training pipelines, canary deployments, drift monitoring - everything on the roadmap. The platform team had poured 3,200 engineering hours into it.

Three months after launch, adoption was 23%. Seven data scientists out of thirty were using the platform for their models. The other twenty-three had found workarounds: running training jobs in notebooks, deploying models via ad-hoc Docker containers, monitoring with custom scripts. The platform existed. Nobody used it.

The post-mortem was uncomfortable. The platform team had spent eight months building features without talking to users. They had built what they thought data scientists needed, not what data scientists actually needed. The experiment tracking was "too complicated" - it required 40 lines of boilerplate to log a run. The deployment pipeline "didn't work with my model type." The CLI "assumed too much familiarity with Kubernetes." The documentation was written by engineers for engineers, not for data scientists who think in Python, not YAML.

The platform team spent the next three months doing user research, redesigning the highest-friction surfaces, and rebuilding the onboarding experience. Adoption went from 23% to 78%. The lesson: platform engineering is product engineering. Features are not value. Adoption is value.

The Core Insight: Adoption Is the Metric

Most ML platform teams measure feature completeness, uptime, and latency. These are necessary but not sufficient. The only metric that matters is adoption - the fraction of ML workloads running on the platform vs outside it.

A platform with 10 features and 90% adoption is massively more valuable than a platform with 50 features and 30% adoption. Unadopted infrastructure is pure waste.

This requires a fundamental mental model shift: from "we are building infrastructure" to "we are running an internal product with users, and our users are data scientists."

The Notebook-to-Production Workflow

The most critical workflow to optimize is the path from "I trained a model in a notebook" to "my model is serving real traffic." Every step of friction on this path reduces adoption.

The Typical Friction Map

Before redesign, the platform required:

Refactor notebook code into a training script (1–4 hours)
Create a requirements.txt (30 minutes, frequently wrong)
Write a Dockerfile (1–2 hours for data scientists unfamiliar with Docker)
Push to container registry (15 minutes)
Write Kubernetes YAML manifest (2–4 hours, requires K8s knowledge)
Get PR approved by infra team (1–2 business days)
Deploy and debug (2–4 hours)

Total: 6–14 hours, requiring skills many data scientists don't have. Result: data scientists find ways around the process.

The Redesigned Workflow

# Target: data scientist experience should look like this

# 1. Add 3 lines to existing training code
import engineersofai_platform as platform

with platform.training_run(
    name="bert-finetuning",
    tags={"team": "nlp", "project": "document-classifier"},
) as run:
    # Existing training code - no changes needed
    model = train_bert(config)
    run.log_metrics({"val_accuracy": 0.91})
    run.register_model(model, "document-classifier")


# 2. Deploy with one command (no Dockerfile, no YAML, no K8s knowledge)
# $ platform deploy document-classifier --production
# > Detecting model type: HuggingFace transformers
# > Building container (2 min)
# > Running validation tests (3 min)
# > Deploying to staging (1 min)
# > ✓ Deployed at: https://models.internal/document-classifier/v3

Implementation: Smart Containerization

# Platform SDK: auto-detect model type and build appropriate container
from enum import Enum
import subprocess
import tempfile
from pathlib import Path

class ModelFramework(Enum):
    PYTORCH = "pytorch"
    TENSORFLOW = "tensorflow"
    SKLEARN = "sklearn"
    HUGGINGFACE = "huggingface"
    XGBOOST = "xgboost"
    CUSTOM = "custom"

def detect_model_framework(model_artifact_path: str) -> ModelFramework:
    """Detect model framework from artifact files."""
    path = Path(model_artifact_path)

    if (path / "config.json").exists() and (path / "tokenizer_config.json").exists():
        return ModelFramework.HUGGINGFACE
    elif list(path.glob("*.pt")) or list(path.glob("*.pth")):
        return ModelFramework.PYTORCH
    elif list(path.glob("saved_model.pb")):
        return ModelFramework.TENSORFLOW
    elif list(path.glob("*.pkl")):
        return ModelFramework.SKLEARN
    else:
        return ModelFramework.CUSTOM


BASE_IMAGES = {
    ModelFramework.HUGGINGFACE: "myregistry/hf-serving:transformers-4.38-cuda12",
    ModelFramework.PYTORCH: "myregistry/pytorch-serving:2.1-cuda12",
    ModelFramework.SKLEARN: "myregistry/sklearn-serving:1.4",
    ModelFramework.XGBOOST: "myregistry/xgboost-serving:1.7",
}

def build_serving_container(
    model_artifact_path: str,
    model_name: str,
    version: str,
) -> str:
    """
    Automatically build a serving container from a model artifact.
    No Dockerfile required from the data scientist.
    """
    framework = detect_model_framework(model_artifact_path)
    base_image = BASE_IMAGES.get(framework)

    if base_image is None:
        raise ValueError(
            f"Unknown framework: {framework}. "
            "Please contact the platform team to add support."
        )

    # Generate Dockerfile from template
    dockerfile_content = f"""
FROM {base_image}

# Copy model artifacts
COPY {model_artifact_path} /model/

# Set serving configuration
ENV MODEL_PATH=/model
ENV MODEL_NAME={model_name}
ENV MODEL_VERSION={version}
ENV FRAMEWORK={framework.value}

# Platform server handles all serving logic
CMD ["platform-server", "--model-path", "/model", "--port", "8080"]
"""

    with tempfile.TemporaryDirectory() as tmpdir:
        dockerfile_path = Path(tmpdir) / "Dockerfile"
        dockerfile_path.write_text(dockerfile_content)

        image_tag = f"myregistry/user-models/{model_name}:{version}"
        subprocess.run(
            ["docker", "build", "-t", image_tag, "-f", str(dockerfile_path), "."],
            check=True,
        )

        subprocess.run(["docker", "push", image_tag], check=True)

    return image_tag

Template-Based Workflows

Most ML workflows at a given company are variations of a small number of patterns:

Fine-tune a HuggingFace model on domain data
Train an XGBoost classifier on tabular data
Build a RAG pipeline on document corpus
Train a recommendation model

Templates encode these patterns as one-click starting points:

# Platform CLI: create new ML project from template
# $ platform new --template bert-finetuning my-project

TEMPLATES = {
    "bert-finetuning": {
        "description": "Fine-tune BERT on a classification task",
        "files": {
            "train.py": "templates/bert_finetuning/train.py",
            "config.yaml": "templates/bert_finetuning/config.yaml",
            "requirements.txt": "templates/bert_finetuning/requirements.txt",
            "tests/test_model.py": "templates/bert_finetuning/tests/test_model.py",
        },
        "variables": ["model_name", "dataset_path", "num_labels"],
        "estimated_cost": "$15-50 on A10G GPU",
        "estimated_time": "2-6 hours",
    },
    "tabular-classifier": {
        "description": "Train XGBoost/LightGBM on tabular data",
        "files": {
            "train.py": "templates/tabular/train.py",
            "feature_engineering.py": "templates/tabular/feature_engineering.py",
            "evaluate.py": "templates/tabular/evaluate.py",
        },
        "variables": ["dataset_path", "target_column"],
        "estimated_cost": "$2-10 on CPU",
        "estimated_time": "30 minutes - 2 hours",
    },
    "rag-pipeline": {
        "description": "Build a RAG system with document retrieval",
        "files": {
            "ingest.py": "templates/rag/ingest.py",
            "pipeline.py": "templates/rag/pipeline.py",
            "serve.py": "templates/rag/serve.py",
        },
        "variables": ["document_source", "embedding_model", "llm_model"],
        "estimated_cost": "Depends on LLM API usage",
        "estimated_time": "1-3 hours",
    },
}


class TemplateEngine:
    """Generate new ML projects from templates."""

    def create_from_template(
        self,
        template_name: str,
        project_name: str,
        variables: dict,
        output_dir: str,
    ) -> None:
        if template_name not in TEMPLATES:
            available = list(TEMPLATES.keys())
            raise ValueError(f"Unknown template. Available: {available}")

        template = TEMPLATES[template_name]
        output_path = Path(output_dir) / project_name
        output_path.mkdir(parents=True)

        for dest_file, template_file in template["files"].items():
            content = self._render_template(template_file, variables)
            dest = output_path / dest_file
            dest.parent.mkdir(parents=True, exist_ok=True)
            dest.write_text(content)

        # Write project metadata
        metadata = {
            "project_name": project_name,
            "template": template_name,
            "created_at": datetime.utcnow().isoformat(),
            "variables": variables,
        }
        (output_path / ".platform.json").write_text(json.dumps(metadata, indent=2))

        print(f"Created project at {output_path}")
        print(f"Next steps:")
        print(f"  cd {project_name}")
        print(f"  platform run --dev  # run locally")
        print(f"  platform deploy     # deploy to staging")

    def _render_template(self, template_path: str, variables: dict) -> str:
        """Simple Jinja2-like template rendering."""
        from string import Template
        content = Path(template_path).read_text()
        return Template(content).safe_substitute(variables)

Guardrails vs Flexibility

The most contentious design question in ML platform design is: how much should the platform constrain users?

Too constrained: Users can't do what they need. They find workarounds that bypass the platform entirely. Adoption drops.

Too flexible: Users make poor decisions at scale. No consistency, no cost controls, no security guarantees. Platform provides little value over raw infrastructure.

The right answer is sensible defaults with explicit escape hatches:

class PlatformDeploymentConfig:
    """
    Deployment configuration with sensible defaults.
    Users can override everything, but defaults are production-safe.
    """

    def __init__(
        self,
        model_name: str,
        # Defaults: sensible production settings
        replicas: int = 2,                      # HA by default
        gpu_count: int = 1,
        memory_gb: float = 16.0,
        cpu_count: float = 4.0,
        max_replicas: int = 20,
        autoscaling_target_gpu_pct: float = 70,
        min_accuracy_threshold: float = 0.75,   # quality gate default
        # Escape hatch: allow overriding any default with justification
        overrides: dict = None,
        override_reason: str = "",
    ):
        self.model_name = model_name
        self.replicas = replicas
        self.gpu_count = gpu_count
        self.memory_gb = memory_gb
        self.cpu_count = cpu_count
        self.max_replicas = max_replicas
        self.autoscaling_target = autoscaling_target_gpu_pct
        self.min_accuracy_threshold = min_accuracy_threshold

        if overrides:
            if not override_reason:
                raise ValueError(
                    "Providing overrides requires an override_reason for audit trail. "
                    "Example: override_reason='Latency-critical model requires single replica for consistency'"
                )
            self._apply_overrides(overrides)
            self._log_override_audit(overrides, override_reason)

    def _apply_overrides(self, overrides: dict):
        for key, value in overrides.items():
            if hasattr(self, key):
                setattr(self, key, value)

    def _log_override_audit(self, overrides: dict, reason: str):
        """Log all overrides for audit - required for compliance."""
        print(f"[AUDIT] Platform default overridden: {overrides}")
        print(f"[AUDIT] Reason: {reason}")

The Guardrail List

Enforce these unconditionally - they protect the cluster and the business:

Guardrail	Why
Max replicas cap	Prevent runaway autoscaling that exhausts GPU pool
Required cost tags	Without tags, cost attribution is impossible
Minimum 2 replicas for production	Single replica = no availability guarantee
Mandatory readiness probes	Prevents bad pods from receiving traffic
Max GPU memory request per model	One model can't starve the whole cluster

Offer overrides (with justification required) for:

Accuracy threshold (edge cases where lower threshold is acceptable)
Resource limits (models with unusual requirements)
Single replica (stateful models where consistency matters more than HA)

Measuring Platform Adoption

from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class PlatformAdoptionMetrics:
    """Weekly platform adoption metrics."""
    period: str
    total_ml_engineers: int
    platform_users: int                # unique users who ran any platform workflow
    platform_models_deployed: int      # models deployed via platform
    total_models_deployed: int         # models deployed via any method
    platform_experiments_logged: int   # runs logged in platform experiment tracking
    total_experiments_run: int         # estimated total experiments (harder to measure)

    @property
    def user_adoption_rate(self) -> float:
        return self.platform_users / self.total_ml_engineers

    @property
    def deployment_adoption_rate(self) -> float:
        return self.platform_models_deployed / self.total_models_deployed

    @property
    def experiment_adoption_rate(self) -> float:
        return self.platform_experiments_logged / self.total_experiments_run


def build_weekly_adoption_report(
    metrics: list[PlatformAdoptionMetrics],
) -> str:
    """Generate weekly adoption trend report."""
    latest = metrics[-1]
    previous = metrics[-2] if len(metrics) > 1 else None

    report = f"""
## Platform Adoption Report - {latest.period}

### Summary
- User Adoption: {latest.user_adoption_rate:.0%} ({latest.platform_users}/{latest.total_ml_engineers} engineers)
- Deployment Adoption: {latest.deployment_adoption_rate:.0%}
- Experiment Adoption: {latest.experiment_adoption_rate:.0%}
"""
    if previous:
        user_delta = latest.user_adoption_rate - previous.user_adoption_rate
        report += f"\n### Week-over-Week\n"
        report += f"- User Adoption: {'+' if user_delta > 0 else ''}{user_delta:.1%}\n"

    # Who isn't using the platform? (for follow-up conversations)
    report += "\n### Follow-up Required\n"
    report += "Engineers not yet on platform - schedule 1:1s to understand blockers\n"

    return report

The Office Hours Strategy

Technical documentation alone doesn't drive adoption. Regular office hours do:

Weekly Platform Office Hours (30 min)
- Tuesday 2pm: Open Q&A for any platform questions
- Thursday 3pm: "Platform Tip of the Week" - live demo of one feature

Monthly
- "New User Workshop" - 60-min hands-on workshop for newly onboarded data scientists
- "Power User Session" - advanced features for teams already using the platform

Quarterly
- "Platform Roadmap Preview" - share upcoming features, get user input on priorities
- "User Research Round" - 6 × 30-min 1:1 interviews with diverse platform users

The office hours strategy has two effects: it accelerates adoption by removing friction in real time, and it surfaces UX problems the team didn't know about. The Tuesday Q&A session, in particular, is where you discover that 15 people have the same friction point with the same feature - friction that could be fixed in an afternoon.

Common Mistakes

:::danger Building features before validating demand Every quarter, the platform team should spend one week doing user research before planning the next quarter's roadmap. Ask: "What's the most frustrating thing about your ML workflow right now?" The answer is your next feature. Building features based on "this seems like it would be useful" without user validation is how you end up with 50 features and 30% adoption. :::

:::warning Not having a day-one onboarding experience The first 30 minutes a new user spends with your platform determine whether they adopt it. If setup requires reading a 20-page guide, filing a ticket for access, and attending a 2-hour training session, most users will decide it's not worth it. Every new ML engineer should be able to complete a "hello world" deployment within 30 minutes of joining the company, with zero platform team involvement. :::

:::danger Treating adoption metrics as optional "We know people are using the platform" is not a metric. Without quantified adoption tracking, you can't tell whether the last quarter's work made things better or worse. Instrument everything: how many experiments logged this week vs last week, how many deployments via platform vs manual, how many times users called platform help. These are your product metrics. :::

:::warning Making the escape hatch too easy If users can bypass every guardrail trivially, they will - especially under time pressure. Guardrails should have friction proportional to their importance. Overriding the accuracy threshold should require a one-line justification. Bypassing the entire deployment pipeline should require a manager approval. Make the right path the easy path. :::

Interview Q&A

Q: How do you drive adoption for an internal ML platform?

A: Product thinking, not feature-building. Three strategies I've used. First, radical friction removal: map the end-to-end workflow from "model trained in notebook" to "model serving traffic." Count every step. Every step with friction is a place users will give up or find a workaround. Reduce notebook-to-production to under 30 minutes with zero K8s knowledge required. Second, active onboarding: don't publish documentation and hope people read it. Run weekly office hours, a monthly hands-on workshop for new users, and assign a platform "buddy" to every new data scientist for their first deployment. Third, measure and prioritize by adoption blockers: track adoption weekly, find the users not on platform, and do 1:1s to understand their specific friction. The answers are always specific and actionable - "I don't use the platform because my model type isn't supported" is a 2-day fix that unlocks a whole team.

Q: What is the notebook-to-production gap and how do you close it?

A: The notebook-to-production gap is the distance between "I trained a model in a Jupyter notebook" and "this model is serving real user traffic with proper monitoring." For most teams without a platform, this gap takes 2–8 weeks and requires skills - Docker, Kubernetes, CI/CD - that many data scientists don't have. To close it: (1) make containerization automatic - detect the model framework and build the appropriate container without requiring a Dockerfile; (2) generate the Kubernetes manifests from model metadata rather than requiring YAML authorship; (3) provide templates for common model types that encode all best practices; (4) make the platform CLI the simplest path to production - platform deploy my-model --production is the target experience. The gap is never fully closed, but reducing it from 8 weeks to 2 hours changes what teams can ship.

Q: How do you design guardrails for an ML platform without limiting flexibility?

A: The principle is "sensible defaults with explicit escape hatches." Default to production-safe settings: 2 replicas for HA, cost tags required, readiness probes mandatory, autoscaling with a sensible cap. These defaults protect the cluster and the business. Then provide explicit overrides for everything, with two requirements: a justification string (for audit trail), and appropriate friction proportional to risk. Overriding a memory limit: low friction, just add a field. Bypassing the CI/CD quality gates entirely: requires a manager approval in the system. The key is that defaults should make the right thing easy, not make the wrong thing impossible. If you make the platform too restrictive, teams route around it. If you make it too permissive, the defaults provide no value.

Q: What metrics do you use to evaluate whether a platform investment is working?

A: Five metrics. First, user adoption rate: percentage of ML engineers who used the platform at least once in the past 4 weeks. Target: 80% within 6 months of launch. Second, deployment adoption: percentage of new model deployments done via the platform vs manually. This is the clearest signal of whether the platform is actually reducing work. Third, time-to-first-deployment: how long it takes a new user to complete their first platform deployment. Should be under 2 hours. Fourth, experiment tracking adoption: percentage of model training runs logged in the platform. Fifth, support request volume: number of platform help requests per active user per week - should decrease over time as UX improves. I report these metrics weekly to the platform team and quarterly to engineering leadership, with trend lines showing whether things are improving.

Q: What is the most common failure mode for internal ML platform projects?

A: Building features in isolation from users. The pattern: a small platform team spends 6–12 months building what they think data scientists need. At launch, adoption is low. The team's response is to build more features. Adoption stays low. Eventually the project gets cancelled or the team gets reorganized. The root cause: no regular user feedback loop. The fix: mandatory weekly user conversations before any feature work. "What are you working on right now?" and "What's the most frustrating part of your ML workflow?" answered by real users in real time completely changes what you build. The teams that build the best internal platforms treat their users like customers, run user research like a product team, and measure adoption like a growth team. The teams that build the worst platforms treat their users like they should be grateful for whatever infrastructure gets built.

The Platform Nobody Used​

The Core Insight: Adoption Is the Metric​

The Notebook-to-Production Workflow​

The Typical Friction Map​

The Redesigned Workflow​

Implementation: Smart Containerization​

Template-Based Workflows​

Guardrails vs Flexibility​

The Guardrail List​

Measuring Platform Adoption​

The Office Hours Strategy​

Common Mistakes​

Interview Q&A​