What is CAP theorem machine learning?

How CAP theorem, eventual consistency, and training-serving skew apply to ML systems - feature stores, model versioning, multi-region serving, and when consistency actually matters.

How does eventual consistency ML work in practice?

Consistency and Availability in ML Systems covers CAP theorem machine learning, eventual consistency ML, training-serving skew from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/systems-foundations/consistency-availability-in-ml

What is the difference between CAP theorem machine learning and training-serving skew?

See the full breakdown at https://engineersofai.com/docs/ai-systems/systems-foundations/consistency-availability-in-ml

:::tip 🎮 Interactive Playground Visualize this concept: Try the CAP Theorem for ML Systems demo on the EngineersOfAI Playground - no code required. :::

Consistency and Availability in ML Systems

Consistency is not binary. The question is never "do we want consistent data?" The question is "how inconsistent can we tolerate, for how long, for which data, at what cost to the user experience?"

The Production Moment

The fraud detection model had been running flawlessly for eight months. Then, on a Thursday afternoon, the feature engineering team deployed an update to the velocity feature pipeline - the one that computed how many transactions a card had made in the last hour. The update was a performance optimization; it reduced computation time by 40%. The output format was identical.

What the team didn't notice: the new pipeline computed the 1-hour window using the processing time (when the event was processed) rather than the event time (when the transaction actually occurred). During normal conditions, these were within seconds of each other. On Thursday afternoon, a message queue backlog meant events were arriving 90 minutes late. The "1-hour velocity" feature was computing values as if transactions from 90 minutes ago had happened now.

The model, trained on event-time velocity features, was receiving processing-time velocity features. Fraud scores dropped. $3.2M in fraudulent transactions were approved in the next 6 hours before the on-call engineer diagnosed the issue.

This is a consistency problem. Not a database consistency problem - the data was consistent with itself. It was a semantic consistency problem between the training environment and the serving environment: two parts of the system that needed to agree on "what does this feature mean" had silently diverged.

ML systems have consistency requirements that don't exist in traditional software. Understanding them - and the classic framework of CAP theorem applied to this context - is the foundation of reliable ML system design.

CAP Theorem: The Basics

CAP theorem, proven by Eric Brewer (2000) and formalized by Gilbert and Lynch (2002), states that a distributed system cannot simultaneously guarantee all three of:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response
Partition Tolerance: The system continues operating despite network partitions

In the presence of a network partition (which is a real-world inevitability in distributed systems), you must choose between consistency and availability.

For most ML systems, the right choice is AP (availability + partition tolerance). Here is why: an ML prediction based on slightly stale data is almost always better than no prediction at all. A recommendation system that returns yesterday's personalization is better than an error. A fraud model that uses features from 5 minutes ago is better than timing out.

The exceptions - where consistency matters more than availability - are precisely the cases where ML decisions have direct financial or safety consequences, and we will cover those explicitly.

Consistency in ML Systems: Three Distinct Concerns

Traditional databases have one primary consistency concern: read-after-write consistency. ML systems have three separate concerns, and confusing them is common.

1. Feature Store Consistency

The feature store serves features during model inference. Consistency concerns:

Read-your-writes: If a user updates their profile, should their next prediction see the updated features? For personalization, yes - but only if the system can guarantee this. For most deployed feature stores (Redis, Cassandra), this requires careful routing: the write and the subsequent read must go to the same replica.

Replica lag: Feature stores often replicate data across availability zones for resilience. During replication, different replicas may have different feature values. If two sequential requests from the same user hit different replicas, they may receive different feature values - is this acceptable?

For most ML use cases: yes. A recommendation model that occasionally uses 30-second-stale features is fine. But for fraud detection where velocity features measure recent transaction counts, 30-second lag can mean missing an ongoing fraud campaign.

from enum import Enum
from dataclasses import dataclass

class ConsistencyLevel(Enum):
    EVENTUAL = "eventual"       # Any replica, may be stale
    LOCAL_QUORUM = "local_quorum"   # Majority of local replicas
    STRONG = "strong"           # All replicas, guaranteed fresh

@dataclass
class FeatureGroup:
    name: str
    staleness_tolerance_seconds: float
    consistency_level: ConsistencyLevel

# Feature groups with different consistency requirements
feature_groups = [
    FeatureGroup(
        name="user_preferences",
        staleness_tolerance_seconds=3600,  # 1 hour OK
        consistency_level=ConsistencyLevel.EVENTUAL
    ),
    FeatureGroup(
        name="transaction_velocity_1h",
        staleness_tolerance_seconds=60,   # 1 minute max
        consistency_level=ConsistencyLevel.LOCAL_QUORUM  # stricter
    ),
    FeatureGroup(
        name="card_fraud_score_realtime",
        staleness_tolerance_seconds=5,    # 5 seconds max
        consistency_level=ConsistencyLevel.STRONG  # strongest
    )
]

class AdaptiveFeatureStore:
    """
    Feature store that adapts consistency level based on feature group requirements.
    Uses Cassandra's tunable consistency as the underlying mechanism.
    """
    def __init__(self, cassandra_session):
        self.session = cassandra_session

    def get_features(self, user_id: str, feature_group: FeatureGroup) -> dict:
        consistency = self._map_consistency(feature_group.consistency_level)
        query = f"SELECT * FROM {feature_group.name} WHERE user_id = ? LIMIT 1"
        result = self.session.execute(
            query,
            [user_id],
            consistency_level=consistency
        )
        return dict(result.one() or {})

    def _map_consistency(self, level: ConsistencyLevel):
        """Map to Cassandra consistency levels."""
        from cassandra import ConsistencyLevel as CL
        mapping = {
            ConsistencyLevel.EVENTUAL: CL.ONE,
            ConsistencyLevel.LOCAL_QUORUM: CL.LOCAL_QUORUM,
            ConsistencyLevel.STRONG: CL.ALL
        }
        return mapping[level]

2. Training-Serving Skew: The Deep Consistency Problem

Training-serving skew is the most insidious consistency problem in ML. It is not about database replication lag - it is about semantic consistency between the training environment and the serving environment.

Definition: Training-serving skew occurs when the statistical distribution of features seen during model training differs from the statistical distribution seen during model serving.

This happens through several mechanisms:

Temporal skew: Features computed at training time use data from the past. Features computed at serving time use current data. If the world has changed, they will diverge.

# Illustrating temporal skew detection
import numpy as np
from scipy.stats import ks_2samp

class TrainingServingSkewDetector:
    """
    Detect when serving feature distribution diverges from training distribution.
    Uses Kolmogorov-Smirnov test for numerical features.
    """
    def __init__(self, alert_threshold: float = 0.05):
        self.alert_threshold = alert_threshold  # KS p-value threshold
        self.training_distributions: dict[str, np.ndarray] = {}

    def register_training_distribution(self, feature_name: str,
                                       training_values: np.ndarray):
        """Call this when training data is prepared."""
        self.training_distributions[feature_name] = training_values

    def check_serving_distribution(self, feature_name: str,
                                   serving_values: np.ndarray) -> dict:
        """
        Compare serving feature distribution against training.
        Returns alert if distribution has shifted significantly.
        """
        if feature_name not in self.training_distributions:
            return {"status": "no_baseline", "feature": feature_name}

        training = self.training_distributions[feature_name]
        # KS test: null hypothesis is that distributions are identical
        ks_statistic, p_value = ks_2samp(training, serving_values)

        is_skewed = p_value < self.alert_threshold
        return {
            "feature": feature_name,
            "ks_statistic": float(ks_statistic),
            "p_value": float(p_value),
            "skew_detected": is_skewed,
            "severity": "high" if ks_statistic > 0.3 else "medium" if ks_statistic > 0.1 else "low"
        }

    def check_all_features(self, serving_sample: dict[str, np.ndarray]) -> list[dict]:
        alerts = []
        for feature_name, values in serving_sample.items():
            result = self.check_serving_distribution(feature_name, values)
            if result.get("skew_detected"):
                alerts.append(result)
        return sorted(alerts, key=lambda x: x.get("ks_statistic", 0), reverse=True)

Preprocessing skew: If the model was trained with a specific preprocessing pipeline (normalization, imputation strategy, encoding scheme) and the serving pipeline uses a different implementation, features will be systematically different.

# The right way: share preprocessing code between training and serving
# Using scikit-learn Pipelines or custom transformers that can be serialized

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import pickle

class ConsistentPreprocessor:
    """
    Same preprocessing logic for training and serving.
    Fit once (during training), apply identically at serving time.
    """
    def __init__(self):
        self.pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
        ])
        self._fitted = False

    def fit_and_transform(self, training_data) -> object:
        """Training time: fit on training data, save state."""
        result = self.pipeline.fit_transform(training_data)
        self._fitted = True
        return result

    def transform(self, serving_data) -> object:
        """Serving time: apply identical transformations."""
        if not self._fitted:
            raise ValueError("Preprocessor not fitted - call fit_and_transform first")
        return self.pipeline.transform(serving_data)

    def save(self, path: str):
        """Serialize the fitted pipeline for deployment."""
        with open(path, 'wb') as f:
            pickle.dump(self.pipeline, f)

    @classmethod
    def load(cls, path: str) -> 'ConsistentPreprocessor':
        """Load fitted pipeline at serving time."""
        preprocessor = cls()
        with open(path, 'rb') as f:
            preprocessor.pipeline = pickle.load(f)
        preprocessor._fitted = True
        return preprocessor

# Training:
preprocessor = ConsistentPreprocessor()
X_train_processed = preprocessor.fit_and_transform(raw_training_data)
preprocessor.save("preprocessor_v3.pkl")  # Save alongside model

# Serving (EXACT SAME LOGIC):
preprocessor = ConsistentPreprocessor.load("preprocessor_v3.pkl")
X_serving = preprocessor.transform(raw_serving_data)  # Same transformations guaranteed

Label leakage / point-in-time skew: The most subtle form - using future information to compute training features. If a training example has a label from January 15th but uses features computed with data available on February 1st (when training was run), those features contain information that wasn't available when the label was generated.

# WRONG: Feature computed with data from AFTER the label timestamp
# This is the training leakage that Spotify, Uber, and many others have
# discovered in retrospective audits of their ML systems

def wrong_feature_join(events_df, user_features_df):
    """
    Classic mistake: join on user_id only, ignoring time.
    The user_features_df has features computed at CURRENT time,
    not at the time of each event.
    """
    return events_df.join(user_features_df, on='user_id', how='left')

# CORRECT: Point-in-time correct join
def correct_point_in_time_join(events_df, user_feature_history_df):
    """
    For each event, get user features as of just before the event timestamp.
    Requires feature history (snapshots at regular intervals) to be stored.

    In SQL (e.g., using Spark SQL or BigQuery):
    SELECT
        e.*,
        f.feature_value
    FROM events e
    LEFT JOIN user_feature_history f
        ON e.user_id = f.user_id
        AND f.snapshot_time = (
            SELECT MAX(snapshot_time)
            FROM user_feature_history
            WHERE user_id = e.user_id
            AND snapshot_time <= e.event_time
        )
    """
    # PySpark implementation using window functions
    from pyspark.sql import functions as F
    from pyspark.sql.window import Window

    # Add row number: most recent feature snapshot before event time
    window = Window.partitionBy("user_id").orderBy(F.desc("snapshot_time"))
    features_with_rank = (
        user_feature_history_df
        .join(
            events_df.select("user_id", "event_time"),
            on="user_id"
        )
        .filter(F.col("snapshot_time") <= F.col("event_time"))
        .withColumn("rank", F.row_number().over(window))
        .filter(F.col("rank") == 1)
    )
    return events_df.join(features_with_rank.drop("rank"), on=["user_id", "event_time"])

3. Model Version Consistency

When you deploy a new model version, there is a period where multiple versions serve traffic simultaneously. This is intentional (gradual rollout, canary deployment) but creates consistency questions:

If a user's first request is served by model v1 and their second by model v2, do they see different recommendations? Is that acceptable?
Can a single user session straddle a model deployment?

import hashlib
import random

class ModelVersionRouter:
    """
    Route requests to model versions during gradual rollout.
    Supports sticky routing (same user always gets same version) and random routing.
    """
    def __init__(self, versions: dict[str, float]):
        """
        versions: {"v1": 0.9, "v2": 0.1}  # 90% v1, 10% v2
        """
        self.versions = versions
        # Validate weights sum to 1
        assert abs(sum(versions.values()) - 1.0) < 0.001

    def route_random(self) -> str:
        """Random routing: each request independently assigned."""
        r = random.random()
        cumulative = 0.0
        for version, weight in self.versions.items():
            cumulative += weight
            if r < cumulative:
                return version
        return list(self.versions.keys())[-1]

    def route_sticky(self, user_id: str) -> str:
        """
        Sticky routing: same user always gets same model version.
        Uses hash of user_id for deterministic assignment.
        Critical for: A/B testing consistency, personalization continuity.
        """
        hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
        normalized = (hash_value % 10_000) / 10_000.0  # 0 to 0.9999

        cumulative = 0.0
        for version, weight in self.versions.items():
            cumulative += weight
            if normalized < cumulative:
                return version
        return list(self.versions.keys())[-1]

# Example: 10% canary rollout with sticky routing
router = ModelVersionRouter({"v3.1": 0.90, "v3.2": 0.10})

# Same user always gets same version during rollout
user_id = "user_abc123"
versions_seen = {router.route_sticky(user_id) for _ in range(100)}
assert len(versions_seen) == 1  # deterministic!
print(f"User {user_id} always gets: {router.route_sticky(user_id)}")

Eventual Consistency in Feature Stores

Most ML feature stores implement eventual consistency: writes to the feature store will eventually propagate to all replicas, but reads may return stale data during propagation.

This is an intentional design choice, not a limitation. The trade-off: eventual consistency enables higher availability, lower write latency, and geographic distribution that strong consistency cannot achieve without prohibitive coordination costs.

The key question: how stale is acceptable for each feature?

@dataclass
class FeatureStalenessPolicy:
    """Define and enforce staleness policies for feature groups."""
    feature_group: str
    max_staleness_seconds: float
    action_on_stale: str  # "use_stale", "recompute", "use_default", "raise_error"
    default_value: float | None = None

class FeatureStalenesEnforcer:
    def __init__(self, policies: list[FeatureStalenessPolicy]):
        self.policies = {p.feature_group: p for p in policies}

    def get_or_enforce(
        self,
        feature_group: str,
        cached_value: float | None,
        cached_timestamp: float,
        current_time: float,
        compute_fn
    ) -> float:
        """Return feature value, enforcing staleness policy."""
        policy = self.policies.get(feature_group)
        if policy is None:
            return cached_value  # No policy: use whatever we have

        age_seconds = current_time - cached_timestamp
        if age_seconds <= policy.max_staleness_seconds:
            return cached_value  # Fresh enough

        # Feature is too stale - apply policy
        if policy.action_on_stale == "use_stale":
            return cached_value  # Use it anyway, just log a warning
        elif policy.action_on_stale == "recompute":
            return compute_fn()  # Recompute on the fly (expensive!)
        elif policy.action_on_stale == "use_default":
            return policy.default_value  # Safe fallback
        elif policy.action_on_stale == "raise_error":
            raise ValueError(f"Feature {feature_group} is {age_seconds:.0f}s stale, max {policy.max_staleness_seconds}s")
        return cached_value  # Fallback

# Define policies
enforcer = FeatureStalenesEnforcer([
    FeatureStalenessPolicy("user_prefs", 3600, "use_stale"),          # 1 hour: use anyway
    FeatureStalenessPolicy("tx_velocity_1h", 60, "recompute"),        # 1 min: must be fresh
    FeatureStalenessPolicy("device_risk_score", 300, "use_default", 0.5),  # 5 min: safe default
])

When Consistency Actually Matters in ML

Most ML systems can tolerate eventual consistency. But there are domains where the cost of incorrect predictions from stale data is high enough to require stronger guarantees.

Financial Fraud Detection

A real-time fraud detection system computes velocity features: "how many transactions has this card made in the last 60 minutes?" If the feature store returns stale velocity data, the model may approve a transaction from a card that has already made 20 transactions in the last hour - a classic card-testing pattern.

Requirement: velocity features must be strongly consistent within the transaction processing window. No stale reads.

Architecture implication: Use a strongly consistent data store (DynamoDB with STRONG consistency, or Redis with single-instance writes + synchronous replication) for velocity features specifically, even if you use eventually consistent stores for other features.

Medical Decision Support

A clinical decision support system that recommends drug dosages based on a patient's recent lab values must never use stale values. A potassium level from 4 hours ago is clinically irrelevant for a rapidly changing patient.

Requirement: lab value features must have provenance (timestamp of the reading) exposed to the model and to the clinician, with explicit staleness warnings.

Multi-Region Consistency and the Read-Your-Writes Problem

Personalization systems face a subtle consistency problem: a user updates their preferences (writes to Region A's feature store), then immediately loads their homepage (read from Region B's feature store, which hasn't yet received the replication). They see their old preferences. This violates "read-your-writes" consistency.

Solutions:

Session affinity: Route a user's requests to the same region for the duration of a session. Writes and reads hit the same replica.
Write-then-read token: After a write, return a token encoding the write's timestamp. The read request includes this token and the serving layer waits until that timestamp is replicated before serving.
Optimistic consistency: After a user updates preferences, show their new preferences immediately (from the write path) without waiting for model serving to confirm.

class MultiRegionFeatureStore:
    """
    Implements read-your-writes consistency for cross-region feature stores.
    Uses a write token (vector clock / timestamp) pattern.
    """
    def __init__(self, local_store, remote_store):
        self.local = local_store
        self.remote = remote_store

    def write(self, user_id: str, features: dict) -> str:
        """Write features, return a consistency token."""
        write_timestamp = self._current_timestamp_ms()
        self.local.write(user_id, features, timestamp=write_timestamp)
        # Async replication to other regions happens in background
        return f"ts:{write_timestamp}"  # Token for client to include in next read

    def read(self, user_id: str, consistency_token: str | None = None) -> dict:
        """Read features, optionally waiting for a specific write to be visible."""
        if consistency_token:
            # Parse the required timestamp
            required_ts = int(consistency_token.split(":")[1])
            # Wait until local store has data at least as fresh as the write
            self._wait_for_replication(user_id, required_ts, timeout_ms=100)

        return self.local.read(user_id) or self.remote.read(user_id)

    def _wait_for_replication(self, user_id: str, required_ts: int, timeout_ms: int):
        """Poll until local store has replicated the required write."""
        import time
        deadline = time.time() + (timeout_ms / 1000)
        while time.time() < deadline:
            local_ts = self.local.get_timestamp(user_id)
            if local_ts and local_ts >= required_ts:
                return  # Replicated!
            time.sleep(0.005)  # 5ms polling interval
        # Timeout: proceed with potentially stale data rather than failing

    def _current_timestamp_ms(self) -> int:
        import time
        return int(time.time() * 1000)

Model Version Consistency: Avoiding Split-Brain Serving

During a model deployment, you inevitably have a period where multiple model versions are serving traffic. This is unavoidable and desirable - gradual rollouts reduce risk. But it requires explicit consistency decisions:

Challenge: Different model versions may have been trained with different feature sets, different preprocessing, or different output semantics. Serving them simultaneously is safe only if the differences are backward-compatible.

from dataclasses import dataclass

@dataclass
class ModelMetadata:
    version: str
    feature_schema: dict      # Required features and their types
    preprocessing_version: str  # Version of preprocessing pipeline
    output_schema: dict       # Output format and semantics

class SafeModelDeployer:
    """
    Validates that a new model version is safe to serve alongside the current version.
    Prevents inconsistent behavior during gradual rollout.
    """

    def validate_rollout_compatibility(
        self,
        current: ModelMetadata,
        candidate: ModelMetadata
    ) -> tuple[bool, list[str]]:
        """Check if candidate can safely serve alongside current version."""
        issues = []

        # Check feature compatibility
        current_features = set(current.feature_schema.keys())
        candidate_features = set(candidate.feature_schema.keys())

        removed_features = current_features - candidate_features
        if removed_features:
            issues.append(f"Candidate removed features used by current: {removed_features}")

        # Check for type changes in shared features
        for feature in current_features & candidate_features:
            if current.feature_schema[feature] != candidate.feature_schema[feature]:
                issues.append(f"Feature type mismatch: {feature}")

        # Check preprocessing compatibility
        if current.preprocessing_version != candidate.preprocessing_version:
            issues.append(
                f"Preprocessing mismatch: current={current.preprocessing_version}, "
                f"candidate={candidate.preprocessing_version}. "
                f"Ensure feature store serves correct version for each model."
            )

        # Check output schema compatibility
        if current.output_schema != candidate.output_schema:
            issues.append("Output schema changed - update client-side parsing before rollout")

        is_compatible = len(issues) == 0
        return is_compatible, issues

Production Architecture: The Consistency Budget

Different components of an ML system have different consistency requirements. Document them explicitly.

The consistency budget: allocate stronger consistency (more expensive, more coordination) to features where staleness would cause measurable harm. Use eventual consistency everywhere else.

Common Mistakes

:::danger Assuming All Features Need Strong Consistency The most expensive mistake in ML system design is over-engineering consistency. A recommendation system's user preference features do not need strong consistency - if you see yesterday's preferences for 1% of requests, the business impact is negligible. But you pay in latency, coordination overhead, and operational complexity for strong consistency. Reserve it for features where staleness causes measurable, significant harm. :::

:::danger Silently Accepting Stale Data Without Monitoring Eventual consistency is acceptable when you know how stale the data is. If your feature store has no freshness monitoring, you don't know whether features are 5 seconds stale or 5 hours stale - and you'll discover the latter only when model performance degrades. Add staleness monitoring to every feature group: emit metrics for "feature last updated N seconds ago" and alert when N exceeds your threshold. :::

:::warning Not Testing Training-Serving Consistency The most common source of training-serving skew is untested preprocessing logic that differs between the training pipeline (often Spark) and the serving pipeline (often Python). Test this explicitly: run the same raw inputs through both pipelines and assert the outputs are identical. Do this in CI before every model deployment. :::

Interview Q&A

Q1: Explain CAP theorem and how it applies to ML feature stores.

CAP theorem states that a distributed system cannot simultaneously guarantee consistency (every read sees the latest write), availability (every request receives a non-error response), and partition tolerance (the system works despite network splits). In the presence of a partition - inevitable in distributed systems - you must choose between consistency and availability.

ML feature stores almost universally choose AP (availability + partition tolerance) over CP. A recommendation system that returns slightly stale user features is better than one that returns an error. The practical implication: feature stores like Cassandra, DynamoDB, and Redis allow configuring consistency levels per-request. Production ML systems typically use EVENTUAL or LOCAL_QUORUM consistency for most features, reserving STRONG consistency for the small set of features where staleness has measurable business impact (fraud velocity, medical measurements).

Q2: What is training-serving skew and how do you prevent it?

Training-serving skew is when the feature distributions seen during model training differ from those seen during model serving, causing the model to perform worse in production than in offline evaluation.

Prevention requires multiple layers: First, share preprocessing code between training and serving. The Spark job that computes features for training and the Python service that computes features for serving should use the same logic - ideally the same codebase. Second, implement point-in-time correct training data: when building training examples, use feature snapshots that existed at the time of the label, not at the time of training. Third, monitor feature distributions continuously: compare the distribution of each feature at serving time against the distribution seen during training. Use KS tests or PSI (Population Stability Index) with automated alerting when distributions drift significantly.

Q3: A user updates their preferences and immediately loads the next page, but sees their old recommendations. What consistency problem is this and how do you fix it?

This is a read-your-writes consistency violation. The write (preference update) went to the primary replica; the subsequent read (homepage load) was served by a secondary replica that hadn't yet received the replication. This is an inherent property of eventually consistent systems.

Solutions in order of increasing complexity: (1) Session affinity - route the same user's requests to the same region/replica so writes and reads hit the same data source; (2) Write tokens - after a write, return a timestamp token to the client; the next read includes this token and the server waits until that timestamp's data is locally available; (3) Optimistic UI - update the client-side state immediately after the write without waiting for the model serving layer to confirm, giving the illusion of immediate consistency.

The right solution depends on the user impact. For preference updates, brief eventual inconsistency is usually acceptable - most users don't notice if their recommendations take 30 seconds to reflect a preference change. For financial transactions (account balance), read-your-writes is a hard requirement.

Q4: How do you handle model version consistency during a rolling deployment?

During a rolling deployment, different replicas may run different model versions simultaneously. The key requirement: ensure that the feature processing logic is compatible across versions.

The canonical approach: (1) Version your feature schemas alongside your models. A model can only serve traffic if its feature schema is satisfied by the current feature store. (2) Use sticky routing during rollout - route each user to the same model version for the duration of a session to avoid inconsistent behavior within a session. (3) Validate backward compatibility before rollout: check that the new model's required features are a subset of (or compatible with) the current feature schema. (4) Maintain the old model serving until the rollout is complete and validated, so you can instant-rollback without retraining.

For models with fundamentally different feature schemas (adding a new embedding type, removing a deprecated feature), deploy as a blue-green switch with validation period rather than a rolling update.

Q5: When does consistency matter more than availability in ML systems? Give a concrete example.

Consistency takes priority over availability when the cost of acting on incorrect (stale or inconsistent) data exceeds the cost of not acting at all.

Financial fraud detection: A fraud model that approves a transaction based on stale velocity features (missing recent fraudulent activity) causes direct monetary loss. Here, it is better to fail the fraud check (block the transaction pending re-evaluation) than to approve it based on 5-minute-stale data. Strong consistency for velocity features is worth the latency and complexity cost.

Clinical decision support: A dosing recommendation system must not use lab values from a previous patient visit when the patient's current values have changed significantly. If the current lab value is unavailable (replication lag), it is better to show "data unavailable - verify with clinician" than to show a stale value and risk a dosing error.

Financial account balance: A banking app must never show a balance that doesn't reflect the most recent transactions. Showing a stale balance that then changes after a transaction could confuse users into overdrafting. This requires strong consistency even at the cost of higher read latency.

The general rule: strong consistency is warranted when (1) decisions are irreversible (transactions, medical dosing), (2) incorrect decisions have significant financial or safety consequences, or (3) users expect immediate consistency as a product guarantee (account balances).

Summary

Consistency in ML systems is multi-dimensional: feature store consistency, training-serving semantic consistency, and model version consistency. CAP theorem provides the framework for understanding the fundamental trade-off between consistency and availability under network partitions - and for most ML use cases, the right choice is eventual consistency with explicit staleness budgets per feature group.

Training-serving skew is the uniquely dangerous consistency problem in ML: a silent divergence between training and serving environments that degrades model performance without raising obvious errors. Prevention requires shared preprocessing code, point-in-time correct training data, and continuous feature distribution monitoring.

The production principle: define your consistency requirements explicitly, per feature group, with maximum staleness tolerances and automated monitoring. Reserve strong consistency for features where staleness causes measurable business harm. Use eventual consistency everywhere else and build monitoring to ensure "eventual" actually means what you think it does.

:::tip Key Takeaway Most ML systems are AP systems - availability and partition tolerance take priority over strict consistency. This is the right choice for most use cases. The discipline is not in choosing AP, but in understanding exactly how stale each feature group can be and building monitoring that catches when "eventual" has become "way too eventual." :::

The Production Moment​

CAP Theorem: The Basics​

Consistency in ML Systems: Three Distinct Concerns​

1. Feature Store Consistency​

2. Training-Serving Skew: The Deep Consistency Problem​

3. Model Version Consistency​

Eventual Consistency in Feature Stores​

When Consistency Actually Matters in ML​

Financial Fraud Detection​

Medical Decision Support​

Multi-Region Consistency and the Read-Your-Writes Problem​

Model Version Consistency: Avoiding Split-Brain Serving​

Production Architecture: The Consistency Budget​

Common Mistakes​

Interview Q&A​

Summary​