Model Registry Concepts
The 2am Rollback
It is 2:14am on a Tuesday. Your phone is ringing. The on-call engineer answers and immediately hears: "The recommendation model is broken. Revenue is down 30%. Roll it back."
Simple enough. Except - roll back to what? The team has been shipping model updates every few days for six months. The "previous" model lives somewhere in S3. The engineer opens the AWS console and starts searching. There are 340 objects in the ml-models/ bucket. Files are named things like rec_model_v2.pkl, rec_model_v2_retrained.pkl, rec_model_v2_GOOD.pkl, rec_model_FINAL_march.pkl. None of them have timestamps in the name. The actual S3 last-modified dates don't help because several files were copied between buckets.
Forty-five minutes pass. The engineer finds a likely candidate, deploys it, and the error rate drops. But is that the right model? Was it trained on good data? Did it pass evaluation? No one knows. Post-mortem tomorrow will be brutal.
This scenario plays out at ML teams everywhere. It is not a discipline problem. It is an infrastructure problem. The team had no model registry - no system that records which model is where, what version it is, what it was trained on, and whether it is safe to use in production. The cost of that missing infrastructure was 45 minutes of downtime and a very bad night.
The model registry is the system that makes this incident impossible.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Registry & Versioning demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
The Problem Before
Before model registries became standard practice, ML teams managed models the same way they managed files: manually. A common pattern looked like this:
- Train model → save to S3 or local disk
- Name it something human-readable (hoping everyone follows the convention)
- Post in Slack: "New model is ready, please deploy rec_model_v3_final.pkl"
- Update a spreadsheet with performance metrics
- Hope everyone agrees on which file is "production"
This breaks down in four specific ways:
1. No single source of truth. Three engineers have three different answers for "what model is in production." Each is partially correct.
2. No metadata. The file knows nothing about itself. You cannot ask a .pkl file what data it was trained on, what its validation AUC was, or whether it passed fairness checks.
3. No lifecycle management. There is no official way to say "this model has been approved for production" versus "this is experimental." The distinction lives in people's heads.
4. No rollback path. When something goes wrong, finding the last known-good model is manual forensics - not an operation.
What a Model Registry Solves
A model registry is a centralized service that stores, versions, and tracks metadata for ML models throughout their lifecycle. It answers four critical questions at any moment:
- What is running in production right now?
- What version is it, and what is the full history of versions?
- What was this model trained on, and what were its metrics?
- What is the safe rollback target?
Historical Context
The concept of a model registry emerged as ML teams grew beyond 5-10 engineers and started shipping models more frequently. Early MLflow (released by Databricks in 2018) introduced model tracking, but the registry concept - with lifecycle stages and governance - came in MLflow 1.0 (2019).
The parallel in software engineering is the package registry (npm, PyPI, Maven). Before those existed, sharing and versioning code was chaotic. The model registry applies the same discipline to ML artifacts.
Other tools followed: Weights & Biases Model Registry, Amazon SageMaker Model Registry, Vertex AI Model Registry, and DVC with Git-based model tracking. The concept is now universal - every major ML platform has one.
Core Concepts
The Model Lifecycle
Every model moves through a series of stages from creation to retirement. The canonical model lifecycle has five phases:
Development (None): The model has been registered but has not been reviewed. It may have come from any experiment. Not safe for production.
Staging: The model has passed initial evaluation and is in the process of validation - integration tests, shadow testing, business review. Not yet serving live traffic.
Production: The model has passed all gates and is actively serving predictions. There should be at most one or two models in this stage per use case.
Archived: The model is retired. It is kept for audit and reproducibility purposes but no longer serves traffic.
Model as Artifact vs Model as Service
This is a distinction that matters for architecture:
Model as Artifact is the serialized file - the weights, the preprocessing pipeline, the hyperparameters. It lives in a file system or object store. It has no opinions about infrastructure.
Model as Service is the artifact deployed behind an API. It has a URL, SLAs, a deployment configuration, health checks, and scaling policies. It lives in Kubernetes or a cloud serving platform.
The model registry tracks the artifact. The deployment infrastructure manages the service. The registry is the bridge - it knows which artifact is powering which service.
Model Metadata
Metadata is what makes a model registry valuable beyond a simple file store. When you register a model, you attach structured information:
| Metadata Category | Examples |
|---|---|
| Performance metrics | AUC: 0.847, F1: 0.791, latency p99: 23ms |
| Training data | Dataset version, date range, row count, data hash |
| Code version | Git commit SHA, branch, repository URL |
| Hyperparameters | Learning rate, batch size, architecture choices |
| Environment | Python version, framework version, CUDA version |
| Evaluation results | Held-out test set metrics, subgroup metrics |
| Tags | team, use-case, compliance-approved, author |
This metadata lets you answer questions like: "Show me all models trained on data from before the pipeline bug we found last month" or "Which models are using the old feature encoding that we deprecated?"
The Lineage Graph
Model lineage is the complete provenance chain from raw data to production predictions. It answers: "Where did this prediction come from?"
Full lineage is essential for:
- Debugging: "The model started degrading on Jan 20 - what changed?" You can trace back to data, code, and configuration.
- Compliance: "Prove this model was not trained on data from users who opted out." You need the data version, and the data version needs its own lineage.
- Impact analysis: "We found a bug in feature pipeline v2.3 - which models are affected?" You query the registry for all models that used that pipeline version.
Registry vs Artifact Store
These two things are often confused:
| Concept | What It Is | Examples |
|---|---|---|
| Artifact Store | Object storage for the actual model files (weights, pickles, etc.) | S3, GCS, Azure Blob, local filesystem |
| Model Registry | Database tracking versions, metadata, stages, and lineage | MLflow Registry, W&B Registry, SageMaker Registry |
The registry stores references to artifacts, not the artifacts themselves. A registry entry says: "Model rec-model version 7 is stored at s3://ml-artifacts/rec-model/v7/model.pkl and has these metrics." The file is in S3. The knowledge about the file is in the registry.
This separation matters because:
- Artifact storage is optimized for large binary files (cheap, durable)
- Registry storage is optimized for queries ("show me all models with AUC greater than 0.85")
- You can swap artifact backends without changing registry logic
Practical Implementation Concepts
Registering a Model
The basic flow is: train → evaluate → register → promote.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
# Simulate a training run
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
with mlflow.start_run() as run:
# Train
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
# Evaluate
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
mlflow.log_metric("auc", auc)
mlflow.log_param("n_estimators", 100)
mlflow.log_param("learning_rate", 0.1)
# Log the model
mlflow.sklearn.log_model(model, "model")
run_id = run.info.run_id
# Register the model in the registry
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, "fraud-detector")
print(f"Registered: version {mv.version}")
Querying the Registry
The registry is a database, and you should query it programmatically:
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Get all versions of a model
versions = client.search_model_versions("name='fraud-detector'")
for v in versions:
print(f"Version {v.version}: stage={v.current_stage}, run={v.run_id}")
# Get the current production model
prod_versions = client.get_latest_versions("fraud-detector", stages=["Production"])
if prod_versions:
prod_model = prod_versions[0]
print(f"Production: v{prod_model.version}, run={prod_model.run_id}")
# Get training metrics for the production model
run = client.get_run(prod_model.run_id)
print(f"Production AUC: {run.data.metrics['auc']}")
Model Naming Conventions
A registry with 50 models needs a naming convention. Common patterns:
# By team and use case
{team}-{use-case} # fraud-detector, rec-ranker
{domain}-{team}-{use-case} # payments-fraud-detector
# By environment prefix (less common - use stages instead)
prod-fraud-detector # anti-pattern: duplicates stage concept
# Recommended: flat names + stages
fraud-detector # versions 1-N, use stages for env
recommendation-ranker
churn-predictor
Keep model names stable and stable means: don't encode the version, the date, or the environment in the name. That is what version numbers and stages are for. A good model name is a noun describing what it does, not when or how it was made.
Production Engineering Notes
Registry as a Deployment Contract
The registry becomes a contract between the ML team and the serving infrastructure. The serving layer should never deploy a model that is not in the Production stage. This means:
- Any deployment automation reads the current
Productionmodel from the registry - The only way to change what is in production is through the registry (not by uploading a file directly to S3)
- Automated rollback means: promote the previous version to
Production
High-Availability Considerations
The model registry itself needs to be treated as production infrastructure:
- Backup: Registry metadata should be backed up separately from the artifact store
- HA mode: MLflow supports PostgreSQL as its backend store - run it with replication
- Read replicas: Training jobs and serving systems should hit read replicas, not the primary
- Access control: Role-based access - data scientists can register, only CI/CD can promote to Production
Registry at Scale
A large organization (50+ data scientists, 100+ models) needs additional structure:
# Namespace by team or domain
namespace: payments
- fraud-detector (v1-v47)
- authorization-scorer (v1-v12)
namespace: recommendations
- item-ranker (v1-v31)
- diversity-reranker (v1-v8)
Some registries support this natively. MLflow uses model names as a flat namespace - you simulate hierarchy with naming conventions.
Common Mistakes
Skipping the registry for "quick" deployments. This is how you end up with shadow models in production that no one knows about. Every model that runs in production must be registered - no exceptions. The two-minute shortcut costs you 45 minutes at 2am.
Storing the model file in the registry. The registry stores metadata and references. The actual model file goes in object storage. If you configure MLflow to use a local filesystem as its artifact store in a multi-node environment, different nodes will have different views of the filesystem - silent corruption.
Not logging metadata at registration time. You cannot add the training dataset version retroactively if you did not log it during the run. Log everything at training time - data version, feature pipeline version, environment details. Treat registration metadata as immutable once written.
Using model stages as environment names. Staging in the MLflow model registry does not mean "the staging environment." It means "approved for testing, not yet in production." Your deployment system maps stages to environments - the mapping is your choice. Do not conflate registry stages with infrastructure environments.
Interview Q&A
Q: What is a model registry and how does it differ from an artifact store?
A: A model registry is a metadata management system that tracks model versions, lifecycle stages, performance metrics, and lineage information. An artifact store is object storage (like S3) that holds the actual model files. The registry stores references to artifacts along with structured metadata that can be queried - for example, "show me all models with production AUC above 0.85 trained in the last 30 days." You need both: the artifact store for the binary data, the registry for the intelligence about that data.
Q: Walk me through the model lifecycle stages and what gates should exist between them.
A: The canonical stages are None (newly registered, unreviewed), Staging (approved for validation), Production (serving live traffic), and Archived (retired). The gates are:
- None → Staging: model passes automated evaluation gates (metrics above threshold, no regression vs. baseline), code review of training pipeline, data quality checks pass
- Staging → Production: integration tests pass, shadow testing shows consistent predictions, business stakeholder sign-off, latency SLA verified, potentially canary period completes
- Production → Archived: a newer version has been promoted to Production, deprecation period has elapsed, no rollback needed
Each gate should be automated where possible with human approval required only for the Staging → Production transition.
Q: How would you design the rollback process for a model registry?
A: Rollback should be a one-command operation. The design is: (1) every production model version is kept in Archived state, never deleted; (2) rollback is implemented as a stage transition - promote the target version from Archived back to Production and transition the current Production version to Archived; (3) the deployment system watches the registry for Production stage changes and automatically updates the serving fleet; (4) the whole operation should take less than 5 minutes. The key insight is that rollback is not a special operation - it is just a stage transition, the same as any other promotion.
Q: What metadata is critical to log in a model registry and why?
A: Critical metadata: training dataset identifier/version (for lineage and compliance), data date range (for freshness reasoning), git commit SHA of training code (for reproducibility), all hyperparameters (for debugging), all evaluation metrics on the held-out test set (for comparison), framework and Python versions (for reproduction), and the run duration (for cost tracking). Secondary but important: feature pipeline version, feature set used, training compute used, author and team. The guiding principle is: what would I need to know to reproduce this model exactly, and what would I need to know to debug a production issue with this model?
Q: How does model lineage support GDPR compliance?
A: GDPR gives users the right to erasure - the right to have their data deleted. If a user exercises this right, you must be able to answer: "Was this user's data used to train any model currently in production?" Without lineage, you cannot answer this. With lineage, you trace: user's data → training dataset versions → model versions → production deployments. If a user's data touched a model that is in production, you have a compliance obligation - typically to retrain without that user's data or to document why retraining is infeasible. Full lineage from raw data through features to model versions is the audit trail that makes this answerable.
Summary
A model registry is not optional infrastructure for serious ML teams - it is the foundation of reliable model operations. It provides:
- A single source of truth for what is running in production
- Structured metadata that makes models queryable and debuggable
- Lifecycle management that creates clear governance checkpoints
- Lineage that satisfies both engineering and compliance requirements
- A fast rollback path when things go wrong
The difference between a 2am incident that takes 5 minutes to resolve versus 45 minutes is entirely a model registry question.
