What is model versioning?

Design versioning schemes for ML models that support safe rollbacks, A/B testing, champion/challenger management, and backward compatibility.

How does semantic versioning ml work in practice?

Model Versioning Strategies covers model versioning, semantic versioning ml, champion challenger models from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/model-registry/model-versioning-strategy

What is the difference between model versioning and champion challenger models?

See the full breakdown at https://engineersofai.com/docs/mlops/model-registry/model-versioning-strategy

Model Versioning Strategies

When a Version Change Breaks Everything

It is a Monday morning and your phone starts lighting up. The team that owns the product recommendation service is furious. Over the weekend, the ML team released a new version of the recommendation model. The model itself is better - offline metrics improved 8%. But the output format changed.

The previous model returned a list of integers (item IDs). The new model returns a list of dictionaries: [{"id": 1234, "score": 0.87}, {"id": 5678, "score": 0.73}]. The downstream service, written six months ago, was never told about this change. It calls item_ids[0] on the response and crashes because {"id": 1234, "score": 0.87}[0] raises a TypeError.

This is a versioning problem. Not "model registry" versioning - interface versioning. The model team changed the contract with its consumers without following a versioning policy that would have warned them. The downstream team had no way to know a breaking change was coming.

The problem also runs deeper. If you need to roll back - which model do you roll back to? How do you know the "previous" model had the old output format? Is there a version number you can look up, or are you hoping the S3 filename gives you a clue?

Model versioning is not just about incrementing numbers. It is about communicating the nature of changes, enforcing compatibility rules, and making the history of decisions traceable.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Registry & Versioning demo on the EngineersOfAI Playground - no code required. :::

Why Versioning for ML Is Different from Software

Software versioning is well understood. Semantic versioning (SemVer) says: increment MAJOR when you make incompatible API changes, MINOR when you add backward-compatible functionality, PATCH when you fix bugs. Consumers know that upgrading from 2.3.1 to 2.4.0 is safe. Upgrading to 3.0.0 requires review.

ML models have an additional dimension: the model itself changes even when the code does not. Retraining on new data produces a new artifact. The interface might be identical but the behavior is different. The prediction for a given input may shift. This is a change consumers care about - it may require their downstream logic to be re-validated.

Three categories of change exist in ML models:

Category	Examples	Interface Change?	Behavior Change?
Code change	New architecture, new features, bug fix	Sometimes	Yes
Data change	New training data, fixed data pipeline	No	Yes
Hyperparameter change	Learning rate, depth, regularization	No	Yes
Breaking interface change	New output schema, different input format	Yes	Yes

A good versioning strategy must handle all four categories.

Semantic Versioning for Models

Adapting SemVer for ML

SemVer can be adapted for models with clear rules about what triggers each segment:

{MAJOR}.{MINOR}.{PATCH}

MAJOR - breaking interface change
  - Input schema changed (new required field, removed field)
  - Output schema changed (new fields, removed fields, type changes)
  - Model API changed (different prediction endpoint)
  - Fundamental architectural change (e.g., single-task to multi-task)

MINOR - significant behavior change, backward-compatible interface
  - New training data (different distribution, extended time range)
  - New features added (backward-compatible input - new optional fields)
  - Architecture improvement (same interface, meaningfully different predictions)
  - Major retraining after known data quality fix

PATCH - minor improvement, same interface, minimal behavior change
  - Hyperparameter tuning without architecture change
  - Same data with bug fix in preprocessing
  - Regularization adjustment
  - Threshold calibration update

# Version tracking in MLflow tags
import mlflow
from dataclasses import dataclass
from enum import Enum

class VersionBump(Enum):
    MAJOR = "major"
    MINOR = "minor"
    PATCH = "patch"


@dataclass
class ModelVersion:
    major: int
    minor: int
    patch: int

    def __str__(self) -> str:
        return f"{self.major}.{self.minor}.{self.patch}"

    def bump(self, bump_type: VersionBump) -> "ModelVersion":
        if bump_type == VersionBump.MAJOR:
            return ModelVersion(self.major + 1, 0, 0)
        elif bump_type == VersionBump.MINOR:
            return ModelVersion(self.major, self.minor + 1, 0)
        else:
            return ModelVersion(self.major, self.minor, self.patch + 1)

    @classmethod
    def from_string(cls, version_str: str) -> "ModelVersion":
        parts = version_str.split(".")
        return cls(int(parts[0]), int(parts[1]), int(parts[2]))


def register_model_with_version(
    run_id: str,
    model_name: str,
    semantic_version: str,
    bump_type: VersionBump,
    change_description: str,
    breaking_changes: list[str] | None = None,
) -> str:
    """Register a model with semantic version metadata."""
    import mlflow

    model_uri = f"runs:/{run_id}/model"
    mv = mlflow.register_model(model_uri, model_name)

    client = mlflow.tracking.MlflowClient()

    # Set semantic version
    client.set_model_version_tag(
        name=model_name,
        version=mv.version,
        key="semantic_version",
        value=semantic_version,
    )

    client.set_model_version_tag(
        name=model_name,
        version=mv.version,
        key="version_bump_type",
        value=bump_type.value,
    )

    if breaking_changes:
        client.set_model_version_tag(
            name=model_name,
            version=mv.version,
            key="breaking_changes",
            value="; ".join(breaking_changes),
        )

    client.update_model_version(
        name=model_name,
        version=mv.version,
        description=f"v{semantic_version} ({bump_type.value}): {change_description}",
    )

    return mv.version

Version Triggers

Automated Version Bump Detection

You should automate the detection of what kind of version bump is appropriate:

import hashlib
import json
from pathlib import Path


def compute_schema_hash(schema: dict) -> str:
    """Compute a hash of the model schema for change detection."""
    canonical = json.dumps(schema, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()[:16]


def determine_version_bump(
    current_version: str,
    current_input_schema: dict,
    current_output_schema: dict,
    new_input_schema: dict,
    new_output_schema: dict,
    data_version_changed: bool,
    architecture_changed: bool,
) -> tuple[VersionBump, list[str]]:
    """Determine the appropriate version bump type and reasons."""
    breaking_changes = []

    # Check for breaking schema changes
    input_breaking = detect_breaking_schema_changes(
        current_input_schema, new_input_schema
    )
    output_breaking = detect_breaking_schema_changes(
        current_output_schema, new_output_schema
    )

    if input_breaking or output_breaking:
        if input_breaking:
            breaking_changes.extend([f"Input schema: {c}" for c in input_breaking])
        if output_breaking:
            breaking_changes.extend([f"Output schema: {c}" for c in output_breaking])
        return VersionBump.MAJOR, breaking_changes

    # Check for significant behavioral changes
    if data_version_changed or architecture_changed:
        reasons = []
        if data_version_changed:
            reasons.append("training data version changed")
        if architecture_changed:
            reasons.append("model architecture changed")
        return VersionBump.MINOR, reasons

    # Minor improvements
    return VersionBump.PATCH, ["hyperparameters or minor tuning only"]


def detect_breaking_schema_changes(old_schema: dict, new_schema: dict) -> list[str]:
    """Detect breaking changes between two schemas."""
    changes = []
    old_fields = set(old_schema.keys())
    new_fields = set(new_schema.keys())

    # Removed fields are always breaking
    removed = old_fields - new_fields
    for field in removed:
        changes.append(f"field '{field}' removed")

    # Type changes are breaking
    for field in old_fields & new_fields:
        if old_schema[field] != new_schema[field]:
            changes.append(
                f"field '{field}' type changed from {old_schema[field]} to {new_schema[field]}"
            )

    return changes

Champion/Challenger Model Management

The champion/challenger pattern is the industry standard for safely evaluating new model versions in production.

class ChampionChallengerManager:
    """Manage champion/challenger model versions for safe A/B evaluation."""

    def __init__(self, model_name: str, challenger_traffic_pct: float = 0.10):
        self.model_name = model_name
        self.challenger_traffic_pct = challenger_traffic_pct
        self.client = mlflow.tracking.MlflowClient()

    def get_champion(self):
        """Get the current champion (production) model."""
        versions = self.client.get_latest_versions(
            self.model_name, stages=["Production"]
        )
        return versions[0] if versions else None

    def get_challenger(self):
        """Get the current challenger model (Staging)."""
        versions = self.client.get_latest_versions(
            self.model_name, stages=["Staging"]
        )
        return versions[0] if versions else None

    def route_request(self, request_id: str) -> str:
        """Route a request to champion or challenger based on hash."""
        import hashlib

        # Deterministic routing by request_id for consistent user experience
        hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
        pct = (hash_val % 1000) / 1000.0

        if pct < self.challenger_traffic_pct and self.get_challenger():
            return "challenger"
        return "champion"

    def promote_challenger(self, approver: str) -> bool:
        """Promote challenger to champion if it is winning."""
        challenger = self.get_challenger()
        if not challenger:
            print("No challenger to promote")
            return False

        # Archive current champion
        self.client.transition_model_version_stage(
            name=self.model_name,
            version=challenger.version,
            stage="Production",
            archive_existing_versions=True,
        )

        self.client.set_model_version_tag(
            name=self.model_name,
            version=challenger.version,
            key="promoted_from_challenger",
            value="true",
        )

        print(f"Promoted challenger v{challenger.version} to champion")
        return True

    def abandon_challenger(self, reason: str) -> None:
        """Archive challenger and keep champion."""
        challenger = self.get_challenger()
        if not challenger:
            return

        self.client.transition_model_version_stage(
            name=self.model_name,
            version=challenger.version,
            stage="Archived",
        )

        self.client.set_model_version_tag(
            name=self.model_name,
            version=challenger.version,
            key="abandoned_reason",
            value=reason,
        )

        print(f"Abandoned challenger v{challenger.version}: {reason}")

Shadow Versioning

Shadow mode runs a new version in parallel without serving its predictions to users. It is the safest way to validate a new version before any traffic exposure.

import asyncio
import logging
from typing import Any

logger = logging.getLogger(__name__)

class ShadowModelRunner:
    """
    Runs shadow model alongside production model.
    Production predictions are served; shadow predictions are logged for comparison.
    """

    def __init__(self, production_model, shadow_model, shadow_name: str):
        self.production = production_model
        self.shadow = shadow_model
        self.shadow_name = shadow_name

    async def predict(self, features: dict, request_id: str) -> Any:
        """
        Run production model synchronously, shadow model asynchronously.
        Returns production prediction immediately.
        """
        # Production prediction - this is what the user gets
        production_result = self.production.predict(features)

        # Shadow prediction - fire and forget, log result for comparison
        asyncio.create_task(
            self._run_shadow(features, production_result, request_id)
        )

        return production_result

    async def _run_shadow(
        self,
        features: dict,
        production_result: Any,
        request_id: str,
    ) -> None:
        """Run shadow model and log comparison."""
        try:
            shadow_result = self.shadow.predict(features)

            # Log for offline analysis
            logger.info(
                "shadow_comparison",
                extra={
                    "request_id": request_id,
                    "shadow_model": self.shadow_name,
                    "production_prediction": production_result,
                    "shadow_prediction": shadow_result,
                    "agreement": production_result == shadow_result,
                },
            )
        except Exception as e:
            # Shadow failures must NEVER affect production
            logger.error(f"Shadow model failed: {e}", exc_info=True)

Version Deprecation Policies

Models need retirement policies to prevent registry clutter and ensure compliance obligations are met:

from datetime import datetime, timedelta
from enum import Enum

class DeprecationPolicy(Enum):
    # Keep all versions forever (compliance-sensitive use cases)
    KEEP_FOREVER = "keep_forever"
    # Keep for N days after archival
    TIME_BASED = "time_based"
    # Keep last N versions
    COUNT_BASED = "count_based"


def apply_deprecation_policy(
    model_name: str,
    policy: DeprecationPolicy,
    retention_days: int = 90,
    max_versions: int = 20,
) -> list[str]:
    """Apply deprecation policy to archived model versions."""
    client = mlflow.tracking.MlflowClient()
    archived = client.get_latest_versions(model_name, stages=["Archived"])
    deleted = []

    if policy == DeprecationPolicy.KEEP_FOREVER:
        return []

    if policy == DeprecationPolicy.TIME_BASED:
        cutoff = datetime.utcnow() - timedelta(days=retention_days)
        for v in archived:
            archived_at_tag = v.tags.get("archived_at")
            if archived_at_tag:
                archived_at = datetime.fromisoformat(archived_at_tag)
                if archived_at < cutoff:
                    # In practice: delete model version
                    # client.delete_model_version(model_name, v.version)
                    deleted.append(v.version)
                    print(f"Would delete {model_name} v{v.version} (archived {archived_at.date()})")

    elif policy == DeprecationPolicy.COUNT_BASED:
        sorted_archived = sorted(archived, key=lambda v: int(v.version))
        excess = len(sorted_archived) - max_versions
        if excess > 0:
            for v in sorted_archived[:excess]:
                deleted.append(v.version)
                print(f"Would delete {model_name} v{v.version} (count-based cleanup)")

    return deleted

Backward Compatibility for Model APIs

When you expose a model as a REST API, versioning the API separately from the model artifact is critical:

from fastapi import FastAPI, Path
from pydantic import BaseModel
from typing import Union
import mlflow.pyfunc

app = FastAPI()

class PredictionRequestV1(BaseModel):
    """V1 API: simple feature list."""
    features: list[float]

class PredictionResponseV1(BaseModel):
    """V1 API: single score."""
    score: float

class PredictionRequestV2(BaseModel):
    """V2 API: named features with metadata."""
    features: dict[str, float]
    metadata: dict[str, str] = {}

class PredictionResponseV2(BaseModel):
    """V2 API: score with explanation."""
    score: float
    confidence: float
    top_features: list[dict[str, float]]
    model_version: str


@app.post("/api/v1/predict", response_model=PredictionResponseV1)
async def predict_v1(request: PredictionRequestV1):
    """V1 endpoint - kept for backward compatibility."""
    # Load production model regardless of semantic version
    model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")
    # Adapt old format to new model if needed
    result = model.predict([request.features])
    return PredictionResponseV1(score=float(result[0]))


@app.post("/api/v2/predict", response_model=PredictionResponseV2)
async def predict_v2(request: PredictionRequestV2):
    """V2 endpoint - full response with explanation."""
    import pandas as pd
    model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")
    df = pd.DataFrame([request.features])
    result = model.predict(df)
    return PredictionResponseV2(
        score=float(result[0]),
        confidence=0.95,  # Computed from model internals
        top_features=[],  # SHAP values
        model_version="2.3.1",
    )

note

The API version (/v1, /v2) and the model semantic version are independent. A MAJOR model version bump may or may not require an API version bump. Break the API version only when the API contract changes - not when the underlying model changes.

Production Engineering Notes

Documenting Version Changes at Register Time

def create_version_release_note(
    model_name: str,
    version: str,
    semantic_version: str,
    bump_type: str,
    changes: list[str],
    migration_guide: str = "",
) -> str:
    """Create a structured release note for a model version."""
    note = f"""
# {model_name} - {semantic_version}

**Version type**: {bump_type}
**Registry version**: {version}
**Release date**: {datetime.utcnow().strftime('%Y-%m-%d')}

## Changes
{chr(10).join(f'- {c}' for c in changes)}

## Migration Guide
{migration_guide if migration_guide else 'No migration required.'}

## Compatibility
- API backward compatible: {'No' if bump_type == 'major' else 'Yes'}
- Safe rollback target: {'Check migration guide' if bump_type == 'major' else 'Yes'}
""".strip()

    client = mlflow.tracking.MlflowClient()
    client.update_model_version(
        name=model_name,
        version=version,
        description=note,
    )

    return note

Common Mistakes

danger

Treating every registry version as a semantic version. MLflow auto-increments version numbers (1, 2, 3…). These are registry versions - they communicate order but not the nature of the change. Without a parallel semantic versioning system in tags, consumers cannot know if upgrading to version 24 means "small hyperparameter tweak" or "completely different output schema." Always log semantic version as a tag.

danger

Not testing backward compatibility before promoting. When you claim a new version is MINOR or PATCH, you should have automated tests that verify the output schema is identical and predictions for a reference dataset have not shifted more than acceptable tolerances. Without these tests, MINOR and PATCH claims are just hope.

warning

Using calendar dates in model names instead of version numbers. Names like fraud-model-jan-2024 do not compose - how do you express "the second model we trained in January"? Use version numbers for ordering and semantic version tags for meaning.

warning

Deleting production-phase model versions for compliance use cases. If a model was used in production to make financial, medical, or legal decisions, those model versions may need to be retained for years. Deletion policies must account for the regulatory context of the use case.

Interview Q&A

Q: How do you apply semantic versioning to ML models, and what triggers each type of bump?

A: MAJOR bumps when the interface changes - input schema, output schema, or fundamental API contract. Downstream consumers need to update their code. MINOR bumps when behavior changes significantly but the interface is stable - new training data, new architecture, meaningful prediction shifts. Consumers should re-validate their downstream logic but their code still runs. PATCH bumps for minor improvements - hyperparameter tuning, small regularization changes. Predictions may shift slightly but the intent is unchanged. The key addition for ML is that the model artifact changes even with a PATCH bump - this is normal and expected, unlike software where a PATCH means the binary is nearly identical.

Q: Describe the champion/challenger pattern and how you would implement it.

A: Champion is the current production model serving the majority of traffic. Challenger is a candidate version serving a small percentage - typically 5-15%. Traffic is split deterministically by user or request ID (hashing ensures consistent routing - the same user always goes to the same model). Both models' predictions and outcomes are logged. After a statistically significant evaluation period (7-14 days), compare metrics. If the challenger is better on the primary metric and does not degrade on guardrail metrics, promote it to champion and archive the old champion. If the challenger underperforms, archive it and the champion continues unchanged.

Q: How would you handle a breaking output schema change in a production model?

A: Multi-step approach: (1) version the API separately from the model - introduce /api/v2/predict with the new schema while keeping /api/v1/predict working on an adapter layer; (2) communicate the change through your registry's description field and release notes with a migration guide; (3) notify downstream consumers at least 30 days before deprecated; (4) run both versions in parallel during migration; (5) track which consumers are still using v1 via API analytics; (6) sunset v1 only after all consumers have migrated. The model semantic version bumps to MAJOR. The registry and serving layers handle backward compatibility.

Q: What is shadow versioning and when should you use it instead of canary?

A: Shadow versioning runs a new model version in parallel with the production model. The production model's predictions are served to users. The shadow model's predictions are computed and logged but never served. There is zero risk to users. Use shadow mode when: the model has a different output schema than the champion (you cannot safely A/B because the downstream service expects one format), the model is for a safety-critical use case, or you want to measure latency and throughput characteristics before any traffic exposure. Use canary when you need real user feedback signals (click-through rate, conversion) that cannot be simulated offline.

Summary

Model versioning is a communication contract with your consumers and your future self. A good versioning strategy tells consumers whether they need to update their code (MAJOR), re-validate their logic (MINOR), or simply expect slightly better predictions (PATCH). Champion/challenger and shadow mode give you safe paths to evaluate new versions without risking production quality. Deprecation policies prevent the registry from becoming an unbounded archive. These are not optional practices - they are the difference between a model team that can ship confidently and one that ships fearfully.

When a Version Change Breaks Everything​

Why Versioning for ML Is Different from Software​

Semantic Versioning for Models​

Adapting SemVer for ML​

Version Triggers​

Automated Version Bump Detection​

Champion/Challenger Model Management​

Shadow Versioning​

Version Deprecation Policies​

Backward Compatibility for Model APIs​

Production Engineering Notes​

Documenting Version Changes at Register Time​

Common Mistakes​

Interview Q&A​

Summary​