What is MLOps maturity model?

Understand the MLOps maturity model from Level 0 to Level 3, design the components of a complete ML platform, and build a realistic 12-month roadmap from ad-hoc to automated.

How does ML platform architecture work in practice?

MLOps Platform Architecture covers MLOps maturity model, ML platform architecture, MLOps levels from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/ai-platform-engineering/mlops-platform-architecture

What is the difference between MLOps maturity model and MLOps levels?

See the full breakdown at https://engineersofai.com/docs/ai-systems/ai-platform-engineering/mlops-platform-architecture

:::tip 🎮 Interactive Playground Visualize this concept: Try the ML System Design Framework demo on the EngineersOfAI Playground - no code required. :::

MLOps Platform Architecture

The Level-0 Company

The data scientist had been at the company for eight months. In that time, she had trained twelve models. Exactly two were in production. The others were in various states of limbo: notebooks on her laptop, CSV files on a shared drive, a Python script that "worked on my machine" but failed when the infrastructure team tried to deploy it, a model whose training data had been accidentally overwritten.

The two production models were maintained by the infrastructure team - a separate team that had spent two months integrating each model into the serving infrastructure. Communication was painful. Debugging required both teams in a call, passing screenshots back and forth, because neither had access to the other's tooling. When a model degraded, nobody noticed for three weeks because there was no monitoring. When the data schema changed upstream, the production model started silently returning garbage predictions because there was no data validation.

This is Level-0 MLOps. It is the most common state in the industry. Most companies that call themselves "AI-first" are operating at Level-0 or Level-1. The models work. The infrastructure around the models is a mess of manual processes, undocumented decisions, and institutional knowledge locked in people's heads.

The question is not whether to move up the maturity levels - it is how to do it without destroying the team's ability to ship models during the transition.

Why Platform Architecture Matters

Before MLOps, every ML deployment was a unique adventure. A data scientist trained a model, an engineer rewrote it in a "production language," another engineer deployed it to a server, and a third engineer wrote monitoring code. The same work happened for every model, every time. Knowledge didn't accumulate - it scattered.

The fundamental insight behind ML platforms is standardization enables acceleration. When every model uses the same deployment interface, the same monitoring hooks, the same serving infrastructure, each new model deployment takes days instead of months. When every experiment is tracked in the same system, comparisons are trivial. When every feature is stored in the same feature store, feature engineering work is shared instead of duplicated.

The cost of building this infrastructure is real. The payoff is compounding - each new model benefits from all the previous investment.

The MLOps Maturity Model

Google's MLOps maturity model (2020) defined four levels that have become the industry standard:

Level 0: No MLOps

Data collection and preparation: manual, ad-hoc
Model training: Jupyter notebooks on local machines
Model deployment: manual, undocumented process
Monitoring: none, or manual dashboards
Iteration cycle: weeks to months

Signature sign: "We need to ask [name] how this model works because they wrote it."

Level 1: ML Pipeline Automation

Training is automated: training script runs on CI trigger or schedule
Experiment tracking: MLflow or W&B integrated
Model validation: automated test against holdout set before deployment
Deployment: still largely manual, but documented
Monitoring: basic model metrics tracked

Signature sign: Any engineer can retrain any model without asking the original author.

Level 2: CI/CD Pipeline for ML

Training and deployment in the same CI/CD pipeline
Automated testing: unit tests, integration tests, performance tests
Canary deployment: new model deployed to small traffic fraction
Monitoring: automated drift detection, alerting
Feature store: shared features across teams

Signature sign: A model can go from "training triggered" to "serving 5% traffic" without human intervention.

Level 3: Automated ML Pipeline

Continuous training: models retrain automatically on schedule or data triggers
Automated rollback: if new model degrades, rollback triggers automatically
A/B test automation: traffic allocation adjusts based on statistical significance
Full observability: every prediction is logged, every model has a health score

Signature sign: "The recommendation model retrains every night and deploys itself if it passes quality gates."

Platform Components

A complete ML platform has six major components. Understanding each independently is essential before you can understand how they fit together:

Component 1: Data Platform

The data platform stores, versions, and serves features - the inputs to ML models. Without it, every team builds their own feature pipelines, creating duplication, inconsistency, and the training-serving skew problem (training on different features than you serve in production).

Key capabilities:

Feature store: compute features once, reuse across teams and models
Data versioning: track which data version trained which model
Point-in-time correct features: serve features as they would have appeared at prediction time (critical for training-serving consistency)

Component 2: Training Platform

The training platform provides compute orchestration for ML training jobs. It abstracts the underlying hardware (GPU instances, distributed training clusters) from the data scientist writing the training code.

Key capabilities:

Job scheduling: queue-based GPU allocation, priority management
Experiment tracking: log hyperparameters, metrics, and artifacts
Distributed training: coordinate multi-GPU and multi-node training

Component 3: Model Registry

The model registry is the single source of truth for all model artifacts and their lifecycle state. It connects training outputs to serving deployments.

Key capabilities:

Artifact storage: model weights, tokenizers, preprocessing code
Lifecycle management: staging, production, archived states
Lineage: link model version to training data version and code version

Component 4: Serving Platform

The serving platform deploys models as HTTP endpoints, manages traffic routing, and handles autoscaling.

Key capabilities:

Multi-model serving: many models on the same infrastructure
Traffic splitting: A/B tests, canary deployments
Latency SLOs: p50/p99 latency tracking, auto-scaling

Component 5: Monitoring Platform

The monitoring platform detects model quality degradation, data drift, and infrastructure health issues.

Key capabilities:

Data drift detection: detect when input distributions shift
Prediction drift: detect when model outputs change
Alert routing: notify the right team for the right issue

Component 6: CI/CD Platform

The CI/CD platform automates the journey from training code commit to production deployment.

Key capabilities:

Pipeline execution: trigger training, validation, deployment
Quality gates: automated checks before promotion
Rollback: automatic revert on quality degradation

Platform vs No Platform: The Numbers

Why invest in building a platform at all? The numbers tell the story:

Metric	Level-0 (No Platform)	Level-2 (Platform)
Time to first production model	3–6 months	2–4 weeks
Time to deploy subsequent models	4–8 weeks	1–3 days
Models per team per year	2–4	15–30
Mean time to detect degradation	2–4 weeks	24 hours
Engineering effort per deployment	3–5 weeks of infra eng	0.5 days
Rollback time on model failure	4–8 hours	5–15 minutes

The platform investment pays back when you have enough models that the per-model savings exceed the amortized platform cost. For most teams, this crossover is at 5–10 production models.

Build vs Buy by Platform Component

from enum import Enum

class BuildVsBuyRecommendation(Enum):
    BUILD = "Build with OSS"
    BUY_MANAGED = "Buy managed service"
    HYBRID = "OSS + managed hosting"


PLATFORM_RECOMMENDATIONS = {
    "experiment_tracking": {
        "small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "W&B Team"),
        "large_team": (BuildVsBuyRecommendation.BUILD, "MLflow self-hosted"),
        "rationale": "W&B UX is excellent; switch to MLflow when seat cost exceeds engineering TCO",
    },
    "model_registry": {
        "small_team": (BuildVsBuyRecommendation.HYBRID, "MLflow Registry or W&B"),
        "large_team": (BuildVsBuyRecommendation.BUILD, "MLflow Registry self-hosted"),
        "rationale": "MLflow Registry is solid OSS; no compelling vendor advantage at scale",
    },
    "serving_platform": {
        "small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "SageMaker or Vertex AI"),
        "large_team": (BuildVsBuyRecommendation.BUILD, "BentoML or Seldon on K8s"),
        "rationale": "Managed services win until serving cost justifies self-hosted ops",
    },
    "feature_store": {
        "small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "Tecton"),
        "large_team": (BuildVsBuyRecommendation.HYBRID, "Feast self-hosted"),
        "rationale": "Feature stores are complex; only build when vendor doesn't fit",
    },
    "monitoring": {
        "small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "Arize or WhyLabs"),
        "large_team": (BuildVsBuyRecommendation.BUILD, "Evidently + Prometheus"),
        "rationale": "ML monitoring vendors add real value; custom at scale",
    },
    "ci_cd": {
        "small_team": (BuildVsBuyRecommendation.BUILD, "GitHub Actions + MLflow"),
        "large_team": (BuildVsBuyRecommendation.BUILD, "Argo Workflows or Kubeflow"),
        "rationale": "CI/CD pipelines need to match your infra; always custom",
    },
}

The 12-Month Level-0 to Level-2 Roadmap

Q1: Foundations (Months 1–3)

Goal: Any engineer can reproduce any model.

Actions:

Deploy MLflow tracking server with S3 artifact store
Add tracking to all existing training scripts (2 lines of code each)
Create model registry entries for all production models
Document training-to-deployment process for each existing model

Success metric: 100% of production models have a reproducible training script logged in MLflow.

Q2: Serving and Monitoring (Months 4–6)

Goal: Model degradation caught within 24 hours.

Actions:

Standardize model serving with a common inference interface
Add prediction logging to all serving endpoints
Deploy basic drift monitoring (feature distribution shift)
Set up PagerDuty/Slack alerts for model health

Success metric: Mean time to detect model degradation under 24 hours.

Q3: CI/CD and Feature Store (Months 7–9)

Goal: New models deploy automatically when quality gates pass.

Actions:

Build automated training pipeline (CI triggers training on code merge)
Add automated quality gates (performance regression test)
Deploy canary serving (5% traffic to new model before full rollout)
Deploy Feast feature store for top 3 most-reused features

Success metric: Time from code merge to production under 4 hours for standard models.

Q4: Self-Service (Months 10–12)

Goal: Data scientists ship models without infrastructure team help.

Actions:

Build platform UI (model catalog, experiment browser)
Create model templates for common use cases
Add cost tracking per model/experiment
Measure and publish adoption metrics

Success metric: 80% of new models deployed without infrastructure team involvement.

Platform Team Structure

The platform team structure determines what gets built and how fast. Three common patterns:

Embedded model (small companies, under 20 ML engineers): 1–2 engineers dedicated to ML platform work within the ML team. Close to users, fast feedback loop. Risk: platform work gets deprioritized when model work is urgent.

Centralized platform team (medium companies, 20–100 ML engineers): Dedicated 4–8 engineer platform team serving all ML users. Clear ownership, stable roadmap. Risk: can become disconnected from user needs.

Platform-as-product (large companies, 100+ ML engineers): Platform team treats internal users as customers. Has a product roadmap, user research, SLAs. Highest quality, hardest to staff. Works at companies like Google, Meta, Uber.

Production Engineering Notes

The Two Failure Modes of ML Platforms

Over-engineering: Building a Level-3 platform for a 5-person team. The platform consumes more engineering resources than the models themselves. Teams spend more time on infrastructure than on ML.

Under-engineering: Staying at Level-0 as the team and model portfolio grow. Technical debt accumulates. Each new model requires 6 weeks of manual work. The team becomes a bottleneck.

The right level of platform sophistication is determined by: number of models in production (below 10: Level 1 is fine; above 20: Level 2 is necessary; above 50: Level 3 starts paying off), team size, and iteration velocity requirements.

Platform Adoption is the Real Problem

Most ML platforms fail not because they're technically wrong but because nobody uses them. Engineers find workarounds. Data scientists maintain their own tooling. The platform becomes shelfware.

The fundamental principle: adoption is the product metric, features are a means to adoption. A platform with 10 features and 90% adoption beats a platform with 50 features and 30% adoption every time. This lesson is revisited in depth in Module 10, Lesson 08.

Common Mistakes

:::danger Building the platform before the platform users exist Building ML infrastructure for a team of 2 data scientists is premature. The overhead of maintaining the platform exceeds the benefit. Wait until you have at least 3–5 models in production and a team of 4–6 ML practitioners before investing seriously in platform work. :::

:::danger Buying a "complete ML platform" from a single vendor Vendors selling "complete MLops platforms" (DataRobot, H2O, etc.) rarely fit your actual workflow. You end up constrained by the vendor's opinions about how ML should work. Better approach: compose best-of-breed tools (MLflow for tracking, Feast for features, Seldon for serving) that each do one thing well. :::

:::warning Not measuring platform adoption If you don't measure who is using the platform and how, you don't know if the investment is working. Track: number of experiments logged per week, number of models deployed via CI/CD vs manually, number of features served through the feature store vs custom pipelines. These adoption metrics are your platform's business metrics. :::

Interview Q&A

Q: What is the MLOps maturity model and how do you use it in practice?

A: The MLOps maturity model (from Google's 2020 paper) describes four levels of ML operational maturity. Level 0: everything manual - training in notebooks, deployment ad-hoc. Level 1: automated ML pipeline - training is scripted and repeatable, experiment tracking in place. Level 2: automated CI/CD - model deployment triggered automatically when quality gates pass. Level 3: automated ML pipeline - continuous training, automatic retraining on data triggers, automated rollback. In practice, I use the model as a diagnostic tool and a roadmap. When joining a new company, I assess which level they're at, identify the highest-value gaps, and propose a prioritized roadmap to move up levels. The key insight: you don't need to reach Level 3 immediately - Level 2 is the right target for most companies, and getting there typically takes 9–18 months.

Q: How do you decide which platform components to build vs buy?

A: I evaluate on three dimensions: team size, compute cost, and how standard the use case is. Experiment tracking: buy W&B for teams under 20 (seat cost justified by productivity), switch to self-hosted MLflow at scale. Model serving: buy managed services (SageMaker) for teams with under 10 production models, self-host on Kubernetes at higher scale. Feature stores: almost always buy or use OSS (Feast) - building a correct feature store is extremely hard, and most teams underestimate the complexity of handling training-serving consistency correctly. The universal rule: build when your use case is genuinely unusual, buy when you're within 80% of a standard use case.

Q: What is the most common MLOps mistake you've seen?

A: Building the data pipeline before the model pipeline. Teams spend months building sophisticated data infrastructure for a model that hasn't been validated as useful yet. The correct order is: prototype the model first (Level 0 is fine for a prototype), validate business value in a manual deployment, then invest in automation. Investing in automation before validating the model is the fastest way to build expensive infrastructure for something that gets cancelled.

Q: How do you structure a platform team for maximum impact?

A: The best structure depends on company size, but a few principles hold universally. First, embed platform engineers with ML teams early - not in a separate "infra" org where they can't hear user complaints. Second, treat internal users as customers: hold office hours, do user research, measure adoption. Third, define the platform's SLO - data scientists should know what uptime and support response time to expect. Fourth, have a clear interface: document what the platform provides and what the user is responsible for. Blurry ownership - "who do I talk to when the training job fails?" - kills adoption faster than any technical limitation.

Q: What would you build first when joining a company at MLOps Level 0?

A: Experiment tracking and model registry - in that order. Experiment tracking first because it has zero disruption to existing workflows (you add 2 lines to existing training scripts), creates immediate value (reproducibility, comparison), and builds the habit of logging. Model registry second because it provides the foundation for everything downstream: deployments reference model versions, monitoring tracks model versions, rollbacks use model versions. With these two things in place, you've gone from "we have no idea what we've trained" to "we have a complete audit trail of every model." The third investment is monitoring - because the highest-impact production failure is a model silently degrading and nobody noticing for weeks.

The Level-0 Company​

Why Platform Architecture Matters​

The MLOps Maturity Model​

Level 0: No MLOps​

Level 1: ML Pipeline Automation​

Level 2: CI/CD Pipeline for ML​

Level 3: Automated ML Pipeline​

Platform Components​

Component 1: Data Platform​

Component 2: Training Platform​

Component 3: Model Registry​

Component 4: Serving Platform​

Component 5: Monitoring Platform​

Component 6: CI/CD Platform​

Platform vs No Platform: The Numbers​

Build vs Buy by Platform Component​

The 12-Month Level-0 to Level-2 Roadmap​

Q1: Foundations (Months 1–3)​

Q2: Serving and Monitoring (Months 4–6)​

Q3: CI/CD and Feature Store (Months 7–9)​

Q4: Self-Service (Months 10–12)​

Platform Team Structure​

Production Engineering Notes​

The Two Failure Modes of ML Platforms​

Platform Adoption is the Real Problem​

Common Mistakes​

Interview Q&A​

The Level-0 Company

Why Platform Architecture Matters

The MLOps Maturity Model

Level 0: No MLOps

Level 1: ML Pipeline Automation

Level 2: CI/CD Pipeline for ML

Level 3: Automated ML Pipeline

Platform Components

Component 1: Data Platform

Component 2: Training Platform

Component 3: Model Registry

Component 4: Serving Platform

Component 5: Monitoring Platform

Component 6: CI/CD Platform

Platform vs No Platform: The Numbers

Build vs Buy by Platform Component

The 12-Month Level-0 to Level-2 Roadmap

Q1: Foundations (Months 1–3)

Q2: Serving and Monitoring (Months 4–6)

Q3: CI/CD and Feature Store (Months 7–9)

Q4: Self-Service (Months 10–12)

Platform Team Structure

Production Engineering Notes

The Two Failure Modes of ML Platforms

Platform Adoption is the Real Problem

Common Mistakes

Interview Q&A