:::tip 🎮 Interactive Playground Visualize this concept: Try the ML System Design Framework demo on the EngineersOfAI Playground - no code required. :::
MLOps Platform Architecture
The Level-0 Company
The data scientist had been at the company for eight months. In that time, she had trained twelve models. Exactly two were in production. The others were in various states of limbo: notebooks on her laptop, CSV files on a shared drive, a Python script that "worked on my machine" but failed when the infrastructure team tried to deploy it, a model whose training data had been accidentally overwritten.
The two production models were maintained by the infrastructure team - a separate team that had spent two months integrating each model into the serving infrastructure. Communication was painful. Debugging required both teams in a call, passing screenshots back and forth, because neither had access to the other's tooling. When a model degraded, nobody noticed for three weeks because there was no monitoring. When the data schema changed upstream, the production model started silently returning garbage predictions because there was no data validation.
This is Level-0 MLOps. It is the most common state in the industry. Most companies that call themselves "AI-first" are operating at Level-0 or Level-1. The models work. The infrastructure around the models is a mess of manual processes, undocumented decisions, and institutional knowledge locked in people's heads.
The question is not whether to move up the maturity levels - it is how to do it without destroying the team's ability to ship models during the transition.
Why Platform Architecture Matters
Before MLOps, every ML deployment was a unique adventure. A data scientist trained a model, an engineer rewrote it in a "production language," another engineer deployed it to a server, and a third engineer wrote monitoring code. The same work happened for every model, every time. Knowledge didn't accumulate - it scattered.
The fundamental insight behind ML platforms is standardization enables acceleration. When every model uses the same deployment interface, the same monitoring hooks, the same serving infrastructure, each new model deployment takes days instead of months. When every experiment is tracked in the same system, comparisons are trivial. When every feature is stored in the same feature store, feature engineering work is shared instead of duplicated.
The cost of building this infrastructure is real. The payoff is compounding - each new model benefits from all the previous investment.
The MLOps Maturity Model
Google's MLOps maturity model (2020) defined four levels that have become the industry standard:
Level 0: No MLOps
- Data collection and preparation: manual, ad-hoc
- Model training: Jupyter notebooks on local machines
- Model deployment: manual, undocumented process
- Monitoring: none, or manual dashboards
- Iteration cycle: weeks to months
Signature sign: "We need to ask [name] how this model works because they wrote it."
Level 1: ML Pipeline Automation
- Training is automated: training script runs on CI trigger or schedule
- Experiment tracking: MLflow or W&B integrated
- Model validation: automated test against holdout set before deployment
- Deployment: still largely manual, but documented
- Monitoring: basic model metrics tracked
Signature sign: Any engineer can retrain any model without asking the original author.
Level 2: CI/CD Pipeline for ML
- Training and deployment in the same CI/CD pipeline
- Automated testing: unit tests, integration tests, performance tests
- Canary deployment: new model deployed to small traffic fraction
- Monitoring: automated drift detection, alerting
- Feature store: shared features across teams
Signature sign: A model can go from "training triggered" to "serving 5% traffic" without human intervention.
Level 3: Automated ML Pipeline
- Continuous training: models retrain automatically on schedule or data triggers
- Automated rollback: if new model degrades, rollback triggers automatically
- A/B test automation: traffic allocation adjusts based on statistical significance
- Full observability: every prediction is logged, every model has a health score
Signature sign: "The recommendation model retrains every night and deploys itself if it passes quality gates."
Platform Components
A complete ML platform has six major components. Understanding each independently is essential before you can understand how they fit together:
Component 1: Data Platform
The data platform stores, versions, and serves features - the inputs to ML models. Without it, every team builds their own feature pipelines, creating duplication, inconsistency, and the training-serving skew problem (training on different features than you serve in production).
Key capabilities:
- Feature store: compute features once, reuse across teams and models
- Data versioning: track which data version trained which model
- Point-in-time correct features: serve features as they would have appeared at prediction time (critical for training-serving consistency)
Component 2: Training Platform
The training platform provides compute orchestration for ML training jobs. It abstracts the underlying hardware (GPU instances, distributed training clusters) from the data scientist writing the training code.
Key capabilities:
- Job scheduling: queue-based GPU allocation, priority management
- Experiment tracking: log hyperparameters, metrics, and artifacts
- Distributed training: coordinate multi-GPU and multi-node training
Component 3: Model Registry
The model registry is the single source of truth for all model artifacts and their lifecycle state. It connects training outputs to serving deployments.
Key capabilities:
- Artifact storage: model weights, tokenizers, preprocessing code
- Lifecycle management: staging, production, archived states
- Lineage: link model version to training data version and code version
Component 4: Serving Platform
The serving platform deploys models as HTTP endpoints, manages traffic routing, and handles autoscaling.
Key capabilities:
- Multi-model serving: many models on the same infrastructure
- Traffic splitting: A/B tests, canary deployments
- Latency SLOs: p50/p99 latency tracking, auto-scaling
Component 5: Monitoring Platform
The monitoring platform detects model quality degradation, data drift, and infrastructure health issues.
Key capabilities:
- Data drift detection: detect when input distributions shift
- Prediction drift: detect when model outputs change
- Alert routing: notify the right team for the right issue
Component 6: CI/CD Platform
The CI/CD platform automates the journey from training code commit to production deployment.
Key capabilities:
- Pipeline execution: trigger training, validation, deployment
- Quality gates: automated checks before promotion
- Rollback: automatic revert on quality degradation
Platform vs No Platform: The Numbers
Why invest in building a platform at all? The numbers tell the story:
| Metric | Level-0 (No Platform) | Level-2 (Platform) |
|---|---|---|
| Time to first production model | 3–6 months | 2–4 weeks |
| Time to deploy subsequent models | 4–8 weeks | 1–3 days |
| Models per team per year | 2–4 | 15–30 |
| Mean time to detect degradation | 2–4 weeks | 24 hours |
| Engineering effort per deployment | 3–5 weeks of infra eng | 0.5 days |
| Rollback time on model failure | 4–8 hours | 5–15 minutes |
The platform investment pays back when you have enough models that the per-model savings exceed the amortized platform cost. For most teams, this crossover is at 5–10 production models.
Build vs Buy by Platform Component
from enum import Enum
class BuildVsBuyRecommendation(Enum):
BUILD = "Build with OSS"
BUY_MANAGED = "Buy managed service"
HYBRID = "OSS + managed hosting"
PLATFORM_RECOMMENDATIONS = {
"experiment_tracking": {
"small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "W&B Team"),
"large_team": (BuildVsBuyRecommendation.BUILD, "MLflow self-hosted"),
"rationale": "W&B UX is excellent; switch to MLflow when seat cost exceeds engineering TCO",
},
"model_registry": {
"small_team": (BuildVsBuyRecommendation.HYBRID, "MLflow Registry or W&B"),
"large_team": (BuildVsBuyRecommendation.BUILD, "MLflow Registry self-hosted"),
"rationale": "MLflow Registry is solid OSS; no compelling vendor advantage at scale",
},
"serving_platform": {
"small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "SageMaker or Vertex AI"),
"large_team": (BuildVsBuyRecommendation.BUILD, "BentoML or Seldon on K8s"),
"rationale": "Managed services win until serving cost justifies self-hosted ops",
},
"feature_store": {
"small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "Tecton"),
"large_team": (BuildVsBuyRecommendation.HYBRID, "Feast self-hosted"),
"rationale": "Feature stores are complex; only build when vendor doesn't fit",
},
"monitoring": {
"small_team": (BuildVsBuyRecommendation.BUY_MANAGED, "Arize or WhyLabs"),
"large_team": (BuildVsBuyRecommendation.BUILD, "Evidently + Prometheus"),
"rationale": "ML monitoring vendors add real value; custom at scale",
},
"ci_cd": {
"small_team": (BuildVsBuyRecommendation.BUILD, "GitHub Actions + MLflow"),
"large_team": (BuildVsBuyRecommendation.BUILD, "Argo Workflows or Kubeflow"),
"rationale": "CI/CD pipelines need to match your infra; always custom",
},
}
The 12-Month Level-0 to Level-2 Roadmap
Q1: Foundations (Months 1–3)
Goal: Any engineer can reproduce any model.
Actions:
- Deploy MLflow tracking server with S3 artifact store
- Add tracking to all existing training scripts (2 lines of code each)
- Create model registry entries for all production models
- Document training-to-deployment process for each existing model
Success metric: 100% of production models have a reproducible training script logged in MLflow.
Q2: Serving and Monitoring (Months 4–6)
Goal: Model degradation caught within 24 hours.
Actions:
- Standardize model serving with a common inference interface
- Add prediction logging to all serving endpoints
- Deploy basic drift monitoring (feature distribution shift)
- Set up PagerDuty/Slack alerts for model health
Success metric: Mean time to detect model degradation under 24 hours.
Q3: CI/CD and Feature Store (Months 7–9)
Goal: New models deploy automatically when quality gates pass.
Actions:
- Build automated training pipeline (CI triggers training on code merge)
- Add automated quality gates (performance regression test)
- Deploy canary serving (5% traffic to new model before full rollout)
- Deploy Feast feature store for top 3 most-reused features
Success metric: Time from code merge to production under 4 hours for standard models.
Q4: Self-Service (Months 10–12)
Goal: Data scientists ship models without infrastructure team help.
Actions:
- Build platform UI (model catalog, experiment browser)
- Create model templates for common use cases
- Add cost tracking per model/experiment
- Measure and publish adoption metrics
Success metric: 80% of new models deployed without infrastructure team involvement.
Platform Team Structure
The platform team structure determines what gets built and how fast. Three common patterns:
Embedded model (small companies, under 20 ML engineers): 1–2 engineers dedicated to ML platform work within the ML team. Close to users, fast feedback loop. Risk: platform work gets deprioritized when model work is urgent.
Centralized platform team (medium companies, 20–100 ML engineers): Dedicated 4–8 engineer platform team serving all ML users. Clear ownership, stable roadmap. Risk: can become disconnected from user needs.
Platform-as-product (large companies, 100+ ML engineers): Platform team treats internal users as customers. Has a product roadmap, user research, SLAs. Highest quality, hardest to staff. Works at companies like Google, Meta, Uber.
Production Engineering Notes
The Two Failure Modes of ML Platforms
Over-engineering: Building a Level-3 platform for a 5-person team. The platform consumes more engineering resources than the models themselves. Teams spend more time on infrastructure than on ML.
Under-engineering: Staying at Level-0 as the team and model portfolio grow. Technical debt accumulates. Each new model requires 6 weeks of manual work. The team becomes a bottleneck.
The right level of platform sophistication is determined by: number of models in production (below 10: Level 1 is fine; above 20: Level 2 is necessary; above 50: Level 3 starts paying off), team size, and iteration velocity requirements.
Platform Adoption is the Real Problem
Most ML platforms fail not because they're technically wrong but because nobody uses them. Engineers find workarounds. Data scientists maintain their own tooling. The platform becomes shelfware.
The fundamental principle: adoption is the product metric, features are a means to adoption. A platform with 10 features and 90% adoption beats a platform with 50 features and 30% adoption every time. This lesson is revisited in depth in Module 10, Lesson 08.
Common Mistakes
:::danger Building the platform before the platform users exist Building ML infrastructure for a team of 2 data scientists is premature. The overhead of maintaining the platform exceeds the benefit. Wait until you have at least 3–5 models in production and a team of 4–6 ML practitioners before investing seriously in platform work. :::
:::danger Buying a "complete ML platform" from a single vendor Vendors selling "complete MLops platforms" (DataRobot, H2O, etc.) rarely fit your actual workflow. You end up constrained by the vendor's opinions about how ML should work. Better approach: compose best-of-breed tools (MLflow for tracking, Feast for features, Seldon for serving) that each do one thing well. :::
:::warning Not measuring platform adoption If you don't measure who is using the platform and how, you don't know if the investment is working. Track: number of experiments logged per week, number of models deployed via CI/CD vs manually, number of features served through the feature store vs custom pipelines. These adoption metrics are your platform's business metrics. :::
Interview Q&A
Q: What is the MLOps maturity model and how do you use it in practice?
A: The MLOps maturity model (from Google's 2020 paper) describes four levels of ML operational maturity. Level 0: everything manual - training in notebooks, deployment ad-hoc. Level 1: automated ML pipeline - training is scripted and repeatable, experiment tracking in place. Level 2: automated CI/CD - model deployment triggered automatically when quality gates pass. Level 3: automated ML pipeline - continuous training, automatic retraining on data triggers, automated rollback. In practice, I use the model as a diagnostic tool and a roadmap. When joining a new company, I assess which level they're at, identify the highest-value gaps, and propose a prioritized roadmap to move up levels. The key insight: you don't need to reach Level 3 immediately - Level 2 is the right target for most companies, and getting there typically takes 9–18 months.
Q: How do you decide which platform components to build vs buy?
A: I evaluate on three dimensions: team size, compute cost, and how standard the use case is. Experiment tracking: buy W&B for teams under 20 (seat cost justified by productivity), switch to self-hosted MLflow at scale. Model serving: buy managed services (SageMaker) for teams with under 10 production models, self-host on Kubernetes at higher scale. Feature stores: almost always buy or use OSS (Feast) - building a correct feature store is extremely hard, and most teams underestimate the complexity of handling training-serving consistency correctly. The universal rule: build when your use case is genuinely unusual, buy when you're within 80% of a standard use case.
Q: What is the most common MLOps mistake you've seen?
A: Building the data pipeline before the model pipeline. Teams spend months building sophisticated data infrastructure for a model that hasn't been validated as useful yet. The correct order is: prototype the model first (Level 0 is fine for a prototype), validate business value in a manual deployment, then invest in automation. Investing in automation before validating the model is the fastest way to build expensive infrastructure for something that gets cancelled.
Q: How do you structure a platform team for maximum impact?
A: The best structure depends on company size, but a few principles hold universally. First, embed platform engineers with ML teams early - not in a separate "infra" org where they can't hear user complaints. Second, treat internal users as customers: hold office hours, do user research, measure adoption. Third, define the platform's SLO - data scientists should know what uptime and support response time to expect. Fourth, have a clear interface: document what the platform provides and what the user is responsible for. Blurry ownership - "who do I talk to when the training job fails?" - kills adoption faster than any technical limitation.
Q: What would you build first when joining a company at MLOps Level 0?
A: Experiment tracking and model registry - in that order. Experiment tracking first because it has zero disruption to existing workflows (you add 2 lines to existing training scripts), creates immediate value (reproducibility, comparison), and builds the habit of logging. Model registry second because it provides the foundation for everything downstream: deployments reference model versions, monitoring tracks model versions, rollbacks use model versions. With these two things in place, you've gone from "we have no idea what we've trained" to "we have a complete audit trail of every model." The third investment is monitoring - because the highest-impact production failure is a model silently degrading and nobody noticing for weeks.
