Model Versioning Strategies
When a Version Change Breaks Everything
It is a Monday morning and your phone starts lighting up. The team that owns the product recommendation service is furious. Over the weekend, the ML team released a new version of the recommendation model. The model itself is better - offline metrics improved 8%. But the output format changed.
The previous model returned a list of integers (item IDs). The new model returns a list of dictionaries: [{"id": 1234, "score": 0.87}, {"id": 5678, "score": 0.73}]. The downstream service, written six months ago, was never told about this change. It calls item_ids[0] on the response and crashes because {"id": 1234, "score": 0.87}[0] raises a TypeError.
This is a versioning problem. Not "model registry" versioning - interface versioning. The model team changed the contract with its consumers without following a versioning policy that would have warned them. The downstream team had no way to know a breaking change was coming.
The problem also runs deeper. If you need to roll back - which model do you roll back to? How do you know the "previous" model had the old output format? Is there a version number you can look up, or are you hoping the S3 filename gives you a clue?
Model versioning is not just about incrementing numbers. It is about communicating the nature of changes, enforcing compatibility rules, and making the history of decisions traceable.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Registry & Versioning demo on the EngineersOfAI Playground - no code required. :::
Why Versioning for ML Is Different from Software
Software versioning is well understood. Semantic versioning (SemVer) says: increment MAJOR when you make incompatible API changes, MINOR when you add backward-compatible functionality, PATCH when you fix bugs. Consumers know that upgrading from 2.3.1 to 2.4.0 is safe. Upgrading to 3.0.0 requires review.
ML models have an additional dimension: the model itself changes even when the code does not. Retraining on new data produces a new artifact. The interface might be identical but the behavior is different. The prediction for a given input may shift. This is a change consumers care about - it may require their downstream logic to be re-validated.
Three categories of change exist in ML models:
| Category | Examples | Interface Change? | Behavior Change? |
|---|---|---|---|
| Code change | New architecture, new features, bug fix | Sometimes | Yes |
| Data change | New training data, fixed data pipeline | No | Yes |
| Hyperparameter change | Learning rate, depth, regularization | No | Yes |
| Breaking interface change | New output schema, different input format | Yes | Yes |
A good versioning strategy must handle all four categories.
Semantic Versioning for Models
Adapting SemVer for ML
SemVer can be adapted for models with clear rules about what triggers each segment:
{MAJOR}.{MINOR}.{PATCH}
MAJOR - breaking interface change
- Input schema changed (new required field, removed field)
- Output schema changed (new fields, removed fields, type changes)
- Model API changed (different prediction endpoint)
- Fundamental architectural change (e.g., single-task to multi-task)
MINOR - significant behavior change, backward-compatible interface
- New training data (different distribution, extended time range)
- New features added (backward-compatible input - new optional fields)
- Architecture improvement (same interface, meaningfully different predictions)
- Major retraining after known data quality fix
PATCH - minor improvement, same interface, minimal behavior change
- Hyperparameter tuning without architecture change
- Same data with bug fix in preprocessing
- Regularization adjustment
- Threshold calibration update
# Version tracking in MLflow tags
import mlflow
from dataclasses import dataclass
from enum import Enum
class VersionBump(Enum):
MAJOR = "major"
MINOR = "minor"
PATCH = "patch"
@dataclass
class ModelVersion:
major: int
minor: int
patch: int
def __str__(self) -> str:
return f"{self.major}.{self.minor}.{self.patch}"
def bump(self, bump_type: VersionBump) -> "ModelVersion":
if bump_type == VersionBump.MAJOR:
return ModelVersion(self.major + 1, 0, 0)
elif bump_type == VersionBump.MINOR:
return ModelVersion(self.major, self.minor + 1, 0)
else:
return ModelVersion(self.major, self.minor, self.patch + 1)
@classmethod
def from_string(cls, version_str: str) -> "ModelVersion":
parts = version_str.split(".")
return cls(int(parts[0]), int(parts[1]), int(parts[2]))
def register_model_with_version(
run_id: str,
model_name: str,
semantic_version: str,
bump_type: VersionBump,
change_description: str,
breaking_changes: list[str] | None = None,
) -> str:
"""Register a model with semantic version metadata."""
import mlflow
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, model_name)
client = mlflow.tracking.MlflowClient()
# Set semantic version
client.set_model_version_tag(
name=model_name,
version=mv.version,
key="semantic_version",
value=semantic_version,
)
client.set_model_version_tag(
name=model_name,
version=mv.version,
key="version_bump_type",
value=bump_type.value,
)
if breaking_changes:
client.set_model_version_tag(
name=model_name,
version=mv.version,
key="breaking_changes",
value="; ".join(breaking_changes),
)
client.update_model_version(
name=model_name,
version=mv.version,
description=f"v{semantic_version} ({bump_type.value}): {change_description}",
)
return mv.version
Version Triggers
Automated Version Bump Detection
You should automate the detection of what kind of version bump is appropriate:
import hashlib
import json
from pathlib import Path
def compute_schema_hash(schema: dict) -> str:
"""Compute a hash of the model schema for change detection."""
canonical = json.dumps(schema, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()[:16]
def determine_version_bump(
current_version: str,
current_input_schema: dict,
current_output_schema: dict,
new_input_schema: dict,
new_output_schema: dict,
data_version_changed: bool,
architecture_changed: bool,
) -> tuple[VersionBump, list[str]]:
"""Determine the appropriate version bump type and reasons."""
breaking_changes = []
# Check for breaking schema changes
input_breaking = detect_breaking_schema_changes(
current_input_schema, new_input_schema
)
output_breaking = detect_breaking_schema_changes(
current_output_schema, new_output_schema
)
if input_breaking or output_breaking:
if input_breaking:
breaking_changes.extend([f"Input schema: {c}" for c in input_breaking])
if output_breaking:
breaking_changes.extend([f"Output schema: {c}" for c in output_breaking])
return VersionBump.MAJOR, breaking_changes
# Check for significant behavioral changes
if data_version_changed or architecture_changed:
reasons = []
if data_version_changed:
reasons.append("training data version changed")
if architecture_changed:
reasons.append("model architecture changed")
return VersionBump.MINOR, reasons
# Minor improvements
return VersionBump.PATCH, ["hyperparameters or minor tuning only"]
def detect_breaking_schema_changes(old_schema: dict, new_schema: dict) -> list[str]:
"""Detect breaking changes between two schemas."""
changes = []
old_fields = set(old_schema.keys())
new_fields = set(new_schema.keys())
# Removed fields are always breaking
removed = old_fields - new_fields
for field in removed:
changes.append(f"field '{field}' removed")
# Type changes are breaking
for field in old_fields & new_fields:
if old_schema[field] != new_schema[field]:
changes.append(
f"field '{field}' type changed from {old_schema[field]} to {new_schema[field]}"
)
return changes
Champion/Challenger Model Management
The champion/challenger pattern is the industry standard for safely evaluating new model versions in production.
class ChampionChallengerManager:
"""Manage champion/challenger model versions for safe A/B evaluation."""
def __init__(self, model_name: str, challenger_traffic_pct: float = 0.10):
self.model_name = model_name
self.challenger_traffic_pct = challenger_traffic_pct
self.client = mlflow.tracking.MlflowClient()
def get_champion(self):
"""Get the current champion (production) model."""
versions = self.client.get_latest_versions(
self.model_name, stages=["Production"]
)
return versions[0] if versions else None
def get_challenger(self):
"""Get the current challenger model (Staging)."""
versions = self.client.get_latest_versions(
self.model_name, stages=["Staging"]
)
return versions[0] if versions else None
def route_request(self, request_id: str) -> str:
"""Route a request to champion or challenger based on hash."""
import hashlib
# Deterministic routing by request_id for consistent user experience
hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
pct = (hash_val % 1000) / 1000.0
if pct < self.challenger_traffic_pct and self.get_challenger():
return "challenger"
return "champion"
def promote_challenger(self, approver: str) -> bool:
"""Promote challenger to champion if it is winning."""
challenger = self.get_challenger()
if not challenger:
print("No challenger to promote")
return False
# Archive current champion
self.client.transition_model_version_stage(
name=self.model_name,
version=challenger.version,
stage="Production",
archive_existing_versions=True,
)
self.client.set_model_version_tag(
name=self.model_name,
version=challenger.version,
key="promoted_from_challenger",
value="true",
)
print(f"Promoted challenger v{challenger.version} to champion")
return True
def abandon_challenger(self, reason: str) -> None:
"""Archive challenger and keep champion."""
challenger = self.get_challenger()
if not challenger:
return
self.client.transition_model_version_stage(
name=self.model_name,
version=challenger.version,
stage="Archived",
)
self.client.set_model_version_tag(
name=self.model_name,
version=challenger.version,
key="abandoned_reason",
value=reason,
)
print(f"Abandoned challenger v{challenger.version}: {reason}")
Shadow Versioning
Shadow mode runs a new version in parallel without serving its predictions to users. It is the safest way to validate a new version before any traffic exposure.
import asyncio
import logging
from typing import Any
logger = logging.getLogger(__name__)
class ShadowModelRunner:
"""
Runs shadow model alongside production model.
Production predictions are served; shadow predictions are logged for comparison.
"""
def __init__(self, production_model, shadow_model, shadow_name: str):
self.production = production_model
self.shadow = shadow_model
self.shadow_name = shadow_name
async def predict(self, features: dict, request_id: str) -> Any:
"""
Run production model synchronously, shadow model asynchronously.
Returns production prediction immediately.
"""
# Production prediction - this is what the user gets
production_result = self.production.predict(features)
# Shadow prediction - fire and forget, log result for comparison
asyncio.create_task(
self._run_shadow(features, production_result, request_id)
)
return production_result
async def _run_shadow(
self,
features: dict,
production_result: Any,
request_id: str,
) -> None:
"""Run shadow model and log comparison."""
try:
shadow_result = self.shadow.predict(features)
# Log for offline analysis
logger.info(
"shadow_comparison",
extra={
"request_id": request_id,
"shadow_model": self.shadow_name,
"production_prediction": production_result,
"shadow_prediction": shadow_result,
"agreement": production_result == shadow_result,
},
)
except Exception as e:
# Shadow failures must NEVER affect production
logger.error(f"Shadow model failed: {e}", exc_info=True)
Version Deprecation Policies
Models need retirement policies to prevent registry clutter and ensure compliance obligations are met:
from datetime import datetime, timedelta
from enum import Enum
class DeprecationPolicy(Enum):
# Keep all versions forever (compliance-sensitive use cases)
KEEP_FOREVER = "keep_forever"
# Keep for N days after archival
TIME_BASED = "time_based"
# Keep last N versions
COUNT_BASED = "count_based"
def apply_deprecation_policy(
model_name: str,
policy: DeprecationPolicy,
retention_days: int = 90,
max_versions: int = 20,
) -> list[str]:
"""Apply deprecation policy to archived model versions."""
client = mlflow.tracking.MlflowClient()
archived = client.get_latest_versions(model_name, stages=["Archived"])
deleted = []
if policy == DeprecationPolicy.KEEP_FOREVER:
return []
if policy == DeprecationPolicy.TIME_BASED:
cutoff = datetime.utcnow() - timedelta(days=retention_days)
for v in archived:
archived_at_tag = v.tags.get("archived_at")
if archived_at_tag:
archived_at = datetime.fromisoformat(archived_at_tag)
if archived_at < cutoff:
# In practice: delete model version
# client.delete_model_version(model_name, v.version)
deleted.append(v.version)
print(f"Would delete {model_name} v{v.version} (archived {archived_at.date()})")
elif policy == DeprecationPolicy.COUNT_BASED:
sorted_archived = sorted(archived, key=lambda v: int(v.version))
excess = len(sorted_archived) - max_versions
if excess > 0:
for v in sorted_archived[:excess]:
deleted.append(v.version)
print(f"Would delete {model_name} v{v.version} (count-based cleanup)")
return deleted
Backward Compatibility for Model APIs
When you expose a model as a REST API, versioning the API separately from the model artifact is critical:
from fastapi import FastAPI, Path
from pydantic import BaseModel
from typing import Union
import mlflow.pyfunc
app = FastAPI()
class PredictionRequestV1(BaseModel):
"""V1 API: simple feature list."""
features: list[float]
class PredictionResponseV1(BaseModel):
"""V1 API: single score."""
score: float
class PredictionRequestV2(BaseModel):
"""V2 API: named features with metadata."""
features: dict[str, float]
metadata: dict[str, str] = {}
class PredictionResponseV2(BaseModel):
"""V2 API: score with explanation."""
score: float
confidence: float
top_features: list[dict[str, float]]
model_version: str
@app.post("/api/v1/predict", response_model=PredictionResponseV1)
async def predict_v1(request: PredictionRequestV1):
"""V1 endpoint - kept for backward compatibility."""
# Load production model regardless of semantic version
model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")
# Adapt old format to new model if needed
result = model.predict([request.features])
return PredictionResponseV1(score=float(result[0]))
@app.post("/api/v2/predict", response_model=PredictionResponseV2)
async def predict_v2(request: PredictionRequestV2):
"""V2 endpoint - full response with explanation."""
import pandas as pd
model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")
df = pd.DataFrame([request.features])
result = model.predict(df)
return PredictionResponseV2(
score=float(result[0]),
confidence=0.95, # Computed from model internals
top_features=[], # SHAP values
model_version="2.3.1",
)
The API version (/v1, /v2) and the model semantic version are independent. A MAJOR model version bump may or may not require an API version bump. Break the API version only when the API contract changes - not when the underlying model changes.
Production Engineering Notes
Documenting Version Changes at Register Time
def create_version_release_note(
model_name: str,
version: str,
semantic_version: str,
bump_type: str,
changes: list[str],
migration_guide: str = "",
) -> str:
"""Create a structured release note for a model version."""
note = f"""
# {model_name} - {semantic_version}
**Version type**: {bump_type}
**Registry version**: {version}
**Release date**: {datetime.utcnow().strftime('%Y-%m-%d')}
## Changes
{chr(10).join(f'- {c}' for c in changes)}
## Migration Guide
{migration_guide if migration_guide else 'No migration required.'}
## Compatibility
- API backward compatible: {'No' if bump_type == 'major' else 'Yes'}
- Safe rollback target: {'Check migration guide' if bump_type == 'major' else 'Yes'}
""".strip()
client = mlflow.tracking.MlflowClient()
client.update_model_version(
name=model_name,
version=version,
description=note,
)
return note
Common Mistakes
Treating every registry version as a semantic version. MLflow auto-increments version numbers (1, 2, 3…). These are registry versions - they communicate order but not the nature of the change. Without a parallel semantic versioning system in tags, consumers cannot know if upgrading to version 24 means "small hyperparameter tweak" or "completely different output schema." Always log semantic version as a tag.
Not testing backward compatibility before promoting. When you claim a new version is MINOR or PATCH, you should have automated tests that verify the output schema is identical and predictions for a reference dataset have not shifted more than acceptable tolerances. Without these tests, MINOR and PATCH claims are just hope.
Using calendar dates in model names instead of version numbers. Names like fraud-model-jan-2024 do not compose - how do you express "the second model we trained in January"? Use version numbers for ordering and semantic version tags for meaning.
Deleting production-phase model versions for compliance use cases. If a model was used in production to make financial, medical, or legal decisions, those model versions may need to be retained for years. Deletion policies must account for the regulatory context of the use case.
Interview Q&A
Q: How do you apply semantic versioning to ML models, and what triggers each type of bump?
A: MAJOR bumps when the interface changes - input schema, output schema, or fundamental API contract. Downstream consumers need to update their code. MINOR bumps when behavior changes significantly but the interface is stable - new training data, new architecture, meaningful prediction shifts. Consumers should re-validate their downstream logic but their code still runs. PATCH bumps for minor improvements - hyperparameter tuning, small regularization changes. Predictions may shift slightly but the intent is unchanged. The key addition for ML is that the model artifact changes even with a PATCH bump - this is normal and expected, unlike software where a PATCH means the binary is nearly identical.
Q: Describe the champion/challenger pattern and how you would implement it.
A: Champion is the current production model serving the majority of traffic. Challenger is a candidate version serving a small percentage - typically 5-15%. Traffic is split deterministically by user or request ID (hashing ensures consistent routing - the same user always goes to the same model). Both models' predictions and outcomes are logged. After a statistically significant evaluation period (7-14 days), compare metrics. If the challenger is better on the primary metric and does not degrade on guardrail metrics, promote it to champion and archive the old champion. If the challenger underperforms, archive it and the champion continues unchanged.
Q: How would you handle a breaking output schema change in a production model?
A: Multi-step approach: (1) version the API separately from the model - introduce /api/v2/predict with the new schema while keeping /api/v1/predict working on an adapter layer; (2) communicate the change through your registry's description field and release notes with a migration guide; (3) notify downstream consumers at least 30 days before deprecated; (4) run both versions in parallel during migration; (5) track which consumers are still using v1 via API analytics; (6) sunset v1 only after all consumers have migrated. The model semantic version bumps to MAJOR. The registry and serving layers handle backward compatibility.
Q: What is shadow versioning and when should you use it instead of canary?
A: Shadow versioning runs a new model version in parallel with the production model. The production model's predictions are served to users. The shadow model's predictions are computed and logged but never served. There is zero risk to users. Use shadow mode when: the model has a different output schema than the champion (you cannot safely A/B because the downstream service expects one format), the model is for a safety-critical use case, or you want to measure latency and throughput characteristics before any traffic exposure. Use canary when you need real user feedback signals (click-through rate, conversion) that cannot be simulated offline.
Summary
Model versioning is a communication contract with your consumers and your future self. A good versioning strategy tells consumers whether they need to update their code (MAJOR), re-validate their logic (MINOR), or simply expect slightly better predictions (PATCH). Champion/challenger and shadow mode give you safe paths to evaluate new versions without risking production quality. Deprecation policies prevent the registry from becoming an unbounded archive. These are not optional practices - they are the difference between a model team that can ship confidently and one that ships fearfully.
