What is model registry?

Understand what a model registry is, why it exists, and how it brings order to the chaos of managing ML models in production.

How does model lifecycle work in practice?

Model Registry Concepts covers model registry, model lifecycle, model metadata from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/model-registry/model-registry-concepts

What is the difference between model registry and model metadata?

See the full breakdown at https://engineersofai.com/docs/mlops/model-registry/model-registry-concepts

Model Registry Concepts

The 2am Rollback

It is 2:14am on a Tuesday. Your phone is ringing. The on-call engineer answers and immediately hears: "The recommendation model is broken. Revenue is down 30%. Roll it back."

Simple enough. Except - roll back to what? The team has been shipping model updates every few days for six months. The "previous" model lives somewhere in S3. The engineer opens the AWS console and starts searching. There are 340 objects in the ml-models/ bucket. Files are named things like rec_model_v2.pkl, rec_model_v2_retrained.pkl, rec_model_v2_GOOD.pkl, rec_model_FINAL_march.pkl. None of them have timestamps in the name. The actual S3 last-modified dates don't help because several files were copied between buckets.

Forty-five minutes pass. The engineer finds a likely candidate, deploys it, and the error rate drops. But is that the right model? Was it trained on good data? Did it pass evaluation? No one knows. Post-mortem tomorrow will be brutal.

This scenario plays out at ML teams everywhere. It is not a discipline problem. It is an infrastructure problem. The team had no model registry - no system that records which model is where, what version it is, what it was trained on, and whether it is safe to use in production. The cost of that missing infrastructure was 45 minutes of downtime and a very bad night.

The model registry is the system that makes this incident impossible.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Registry & Versioning demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

The Problem Before

Before model registries became standard practice, ML teams managed models the same way they managed files: manually. A common pattern looked like this:

Train model → save to S3 or local disk
Name it something human-readable (hoping everyone follows the convention)
Post in Slack: "New model is ready, please deploy rec_model_v3_final.pkl"
Update a spreadsheet with performance metrics
Hope everyone agrees on which file is "production"

This breaks down in four specific ways:

1. No single source of truth. Three engineers have three different answers for "what model is in production." Each is partially correct.

2. No metadata. The file knows nothing about itself. You cannot ask a .pkl file what data it was trained on, what its validation AUC was, or whether it passed fairness checks.

3. No lifecycle management. There is no official way to say "this model has been approved for production" versus "this is experimental." The distinction lives in people's heads.

4. No rollback path. When something goes wrong, finding the last known-good model is manual forensics - not an operation.

What a Model Registry Solves

A model registry is a centralized service that stores, versions, and tracks metadata for ML models throughout their lifecycle. It answers four critical questions at any moment:

What is running in production right now?
What version is it, and what is the full history of versions?
What was this model trained on, and what were its metrics?
What is the safe rollback target?

Historical Context

The concept of a model registry emerged as ML teams grew beyond 5-10 engineers and started shipping models more frequently. Early MLflow (released by Databricks in 2018) introduced model tracking, but the registry concept - with lifecycle stages and governance - came in MLflow 1.0 (2019).

The parallel in software engineering is the package registry (npm, PyPI, Maven). Before those existed, sharing and versioning code was chaotic. The model registry applies the same discipline to ML artifacts.

Other tools followed: Weights & Biases Model Registry, Amazon SageMaker Model Registry, Vertex AI Model Registry, and DVC with Git-based model tracking. The concept is now universal - every major ML platform has one.

Core Concepts

The Model Lifecycle

Every model moves through a series of stages from creation to retirement. The canonical model lifecycle has five phases:

Development (None): The model has been registered but has not been reviewed. It may have come from any experiment. Not safe for production.

Staging: The model has passed initial evaluation and is in the process of validation - integration tests, shadow testing, business review. Not yet serving live traffic.

Production: The model has passed all gates and is actively serving predictions. There should be at most one or two models in this stage per use case.

Archived: The model is retired. It is kept for audit and reproducibility purposes but no longer serves traffic.

Model as Artifact vs Model as Service

This is a distinction that matters for architecture:

Model as Artifact is the serialized file - the weights, the preprocessing pipeline, the hyperparameters. It lives in a file system or object store. It has no opinions about infrastructure.

Model as Service is the artifact deployed behind an API. It has a URL, SLAs, a deployment configuration, health checks, and scaling policies. It lives in Kubernetes or a cloud serving platform.

The model registry tracks the artifact. The deployment infrastructure manages the service. The registry is the bridge - it knows which artifact is powering which service.

Model Metadata

Metadata is what makes a model registry valuable beyond a simple file store. When you register a model, you attach structured information:

Metadata Category	Examples
Performance metrics	AUC: 0.847, F1: 0.791, latency p99: 23ms
Training data	Dataset version, date range, row count, data hash
Code version	Git commit SHA, branch, repository URL
Hyperparameters	Learning rate, batch size, architecture choices
Environment	Python version, framework version, CUDA version
Evaluation results	Held-out test set metrics, subgroup metrics
Tags	team, use-case, compliance-approved, author

This metadata lets you answer questions like: "Show me all models trained on data from before the pipeline bug we found last month" or "Which models are using the old feature encoding that we deprecated?"

The Lineage Graph

Model lineage is the complete provenance chain from raw data to production predictions. It answers: "Where did this prediction come from?"

Full lineage is essential for:

Debugging: "The model started degrading on Jan 20 - what changed?" You can trace back to data, code, and configuration.
Compliance: "Prove this model was not trained on data from users who opted out." You need the data version, and the data version needs its own lineage.
Impact analysis: "We found a bug in feature pipeline v2.3 - which models are affected?" You query the registry for all models that used that pipeline version.

Registry vs Artifact Store

These two things are often confused:

Concept	What It Is	Examples
Artifact Store	Object storage for the actual model files (weights, pickles, etc.)	S3, GCS, Azure Blob, local filesystem
Model Registry	Database tracking versions, metadata, stages, and lineage	MLflow Registry, W&B Registry, SageMaker Registry

The registry stores references to artifacts, not the artifacts themselves. A registry entry says: "Model rec-model version 7 is stored at s3://ml-artifacts/rec-model/v7/model.pkl and has these metrics." The file is in S3. The knowledge about the file is in the registry.

This separation matters because:

Artifact storage is optimized for large binary files (cheap, durable)
Registry storage is optimized for queries ("show me all models with AUC greater than 0.85")
You can swap artifact backends without changing registry logic

Practical Implementation Concepts

Registering a Model

The basic flow is: train → evaluate → register → promote.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Simulate a training run
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

with mlflow.start_run() as run:
    # Train
    model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
    model.fit(X_train, y_train)

    # Evaluate
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    mlflow.log_metric("auc", auc)
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("learning_rate", 0.1)

    # Log the model
    mlflow.sklearn.log_model(model, "model")

    run_id = run.info.run_id

# Register the model in the registry
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, "fraud-detector")
print(f"Registered: version {mv.version}")

Querying the Registry

The registry is a database, and you should query it programmatically:

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Get all versions of a model
versions = client.search_model_versions("name='fraud-detector'")
for v in versions:
    print(f"Version {v.version}: stage={v.current_stage}, run={v.run_id}")

# Get the current production model
prod_versions = client.get_latest_versions("fraud-detector", stages=["Production"])
if prod_versions:
    prod_model = prod_versions[0]
    print(f"Production: v{prod_model.version}, run={prod_model.run_id}")

    # Get training metrics for the production model
    run = client.get_run(prod_model.run_id)
    print(f"Production AUC: {run.data.metrics['auc']}")

Model Naming Conventions

A registry with 50 models needs a naming convention. Common patterns:

# By team and use case
{team}-{use-case}                      # fraud-detector, rec-ranker
{domain}-{team}-{use-case}             # payments-fraud-detector

# By environment prefix (less common - use stages instead)
prod-fraud-detector                    # anti-pattern: duplicates stage concept

# Recommended: flat names + stages
fraud-detector                         # versions 1-N, use stages for env
recommendation-ranker
churn-predictor

tip

Keep model names stable and stable means: don't encode the version, the date, or the environment in the name. That is what version numbers and stages are for. A good model name is a noun describing what it does, not when or how it was made.

Production Engineering Notes

Registry as a Deployment Contract

The registry becomes a contract between the ML team and the serving infrastructure. The serving layer should never deploy a model that is not in the Production stage. This means:

Any deployment automation reads the current Production model from the registry
The only way to change what is in production is through the registry (not by uploading a file directly to S3)
Automated rollback means: promote the previous version to Production

High-Availability Considerations

The model registry itself needs to be treated as production infrastructure:

Backup: Registry metadata should be backed up separately from the artifact store
HA mode: MLflow supports PostgreSQL as its backend store - run it with replication
Read replicas: Training jobs and serving systems should hit read replicas, not the primary
Access control: Role-based access - data scientists can register, only CI/CD can promote to Production

Registry at Scale

A large organization (50+ data scientists, 100+ models) needs additional structure:

# Namespace by team or domain
namespace: payments
  - fraud-detector (v1-v47)
  - authorization-scorer (v1-v12)

namespace: recommendations
  - item-ranker (v1-v31)
  - diversity-reranker (v1-v8)

Some registries support this natively. MLflow uses model names as a flat namespace - you simulate hierarchy with naming conventions.

Common Mistakes

danger

Skipping the registry for "quick" deployments. This is how you end up with shadow models in production that no one knows about. Every model that runs in production must be registered - no exceptions. The two-minute shortcut costs you 45 minutes at 2am.

danger

Storing the model file in the registry. The registry stores metadata and references. The actual model file goes in object storage. If you configure MLflow to use a local filesystem as its artifact store in a multi-node environment, different nodes will have different views of the filesystem - silent corruption.

warning

Not logging metadata at registration time. You cannot add the training dataset version retroactively if you did not log it during the run. Log everything at training time - data version, feature pipeline version, environment details. Treat registration metadata as immutable once written.

warning

Using model stages as environment names. Staging in the MLflow model registry does not mean "the staging environment." It means "approved for testing, not yet in production." Your deployment system maps stages to environments - the mapping is your choice. Do not conflate registry stages with infrastructure environments.

Interview Q&A

Q: What is a model registry and how does it differ from an artifact store?

A: A model registry is a metadata management system that tracks model versions, lifecycle stages, performance metrics, and lineage information. An artifact store is object storage (like S3) that holds the actual model files. The registry stores references to artifacts along with structured metadata that can be queried - for example, "show me all models with production AUC above 0.85 trained in the last 30 days." You need both: the artifact store for the binary data, the registry for the intelligence about that data.

Q: Walk me through the model lifecycle stages and what gates should exist between them.

A: The canonical stages are None (newly registered, unreviewed), Staging (approved for validation), Production (serving live traffic), and Archived (retired). The gates are:

None → Staging: model passes automated evaluation gates (metrics above threshold, no regression vs. baseline), code review of training pipeline, data quality checks pass
Staging → Production: integration tests pass, shadow testing shows consistent predictions, business stakeholder sign-off, latency SLA verified, potentially canary period completes
Production → Archived: a newer version has been promoted to Production, deprecation period has elapsed, no rollback needed

Each gate should be automated where possible with human approval required only for the Staging → Production transition.

Q: How would you design the rollback process for a model registry?

A: Rollback should be a one-command operation. The design is: (1) every production model version is kept in Archived state, never deleted; (2) rollback is implemented as a stage transition - promote the target version from Archived back to Production and transition the current Production version to Archived; (3) the deployment system watches the registry for Production stage changes and automatically updates the serving fleet; (4) the whole operation should take less than 5 minutes. The key insight is that rollback is not a special operation - it is just a stage transition, the same as any other promotion.

Q: What metadata is critical to log in a model registry and why?

A: Critical metadata: training dataset identifier/version (for lineage and compliance), data date range (for freshness reasoning), git commit SHA of training code (for reproducibility), all hyperparameters (for debugging), all evaluation metrics on the held-out test set (for comparison), framework and Python versions (for reproduction), and the run duration (for cost tracking). Secondary but important: feature pipeline version, feature set used, training compute used, author and team. The guiding principle is: what would I need to know to reproduce this model exactly, and what would I need to know to debug a production issue with this model?

Q: How does model lineage support GDPR compliance?

A: GDPR gives users the right to erasure - the right to have their data deleted. If a user exercises this right, you must be able to answer: "Was this user's data used to train any model currently in production?" Without lineage, you cannot answer this. With lineage, you trace: user's data → training dataset versions → model versions → production deployments. If a user's data touched a model that is in production, you have a compliance obligation - typically to retrain without that user's data or to document why retraining is infeasible. Full lineage from raw data through features to model versions is the audit trail that makes this answerable.

Summary

A model registry is not optional infrastructure for serious ML teams - it is the foundation of reliable model operations. It provides:

A single source of truth for what is running in production
Structured metadata that makes models queryable and debuggable
Lifecycle management that creates clear governance checkpoints
Lineage that satisfies both engineering and compliance requirements
A fast rollback path when things go wrong

The difference between a 2am incident that takes 5 minutes to resolve versus 45 minutes is entirely a model registry question.

The 2am Rollback​

Why This Exists​

The Problem Before​

What a Model Registry Solves​

Historical Context​

Core Concepts​

The Model Lifecycle​

Model as Artifact vs Model as Service​

Model Metadata​

The Lineage Graph​

Registry vs Artifact Store​

Practical Implementation Concepts​

Registering a Model​

Querying the Registry​

Model Naming Conventions​

Production Engineering Notes​

Registry as a Deployment Contract​

High-Availability Considerations​

Registry at Scale​

Common Mistakes​

Interview Q&A​

Summary​