What is feature consistency?

Ensuring identical features between training (offline) and serving (online).

How does training serving skew work in practice?

Feature Consistency covers feature consistency, training serving skew, feature versioning from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/real-time-feature-engineering/feature-consistency

What is the difference between feature consistency and feature versioning?

See the full breakdown at https://engineersofai.com/docs/data-engineering/real-time-feature-engineering/feature-consistency

:::tip 🎮 Interactive Playground Visualize this concept: Try the Point-in-Time Join demo on the EngineersOfAI Playground - no code required. :::

Feature Consistency

The 8-Hour Overlap

A fraud detection team had built what they considered a robust system. The model was trained using Spark batch jobs. Features were served by a Python microservice. Both implementations claimed to compute transaction_velocity_24h - the number of transactions in the past 24 hours for a given user. Both implementations were code-reviewed. Both had unit tests. The model performance in staging was good.

Three months after deployment, a compliance audit required the team to compare every feature value served in production against its expected value recomputed from raw events. The audit found discrepancies on transaction_velocity_24h for users in specific geographic regions.

The root cause took two days to find. The Spark job, running on a cluster with UTC system time, computed 24-hour windows using UTC midnight as the window boundary: "the 24 hours ending at the most recent UTC midnight before the transaction." The Python microservice, deployed on servers with US Pacific timezone configured, computed "the 24 hours ending right now (Pacific time)." For users who transacted between midnight UTC (4 PM Pacific) and midnight Pacific (8 AM UTC the next day) - an 8-hour window - the two implementations were counting different sets of transactions.

The difference ranged from 0% to 40% in absolute feature value, depending on transaction patterns. For 8 hours of every day, the serving feature was computing a different quantity than the training feature. The model was making predictions based on a distribution it had never been trained on. Nobody had noticed because the model was "mostly working" - the 40% errors were bounded to specific time windows, and the model's tree structure was robust enough to absorb some feature noise.

This is feature inconsistency. It is silent. It is pervasive. And it degrades model quality in ways that are nearly impossible to detect without explicit consistency testing.

Why This Exists

Feature inconsistency has a simple cause: the same logical feature is computed by multiple code paths. A Spark job for training. A Python microservice for serving. Sometimes a third path for A/B testing. Each implementation is written by a different engineer, at a different time, with different assumptions about edge cases.

The solution sounds obvious in retrospect: compute each feature exactly once, store the result, and have both training and serving read from the stored value. This is the single-computation path. It is the architectural guarantee that consistency problems cannot occur, because there is only one computation.

Feature stores (Feast, Tecton, Databricks Feature Store) exist largely to enforce this guarantee. A feature store defines a feature transformation once and ensures that the same transformation runs for both training data retrieval and online serving. The training pipeline says "give me historical values of transaction_velocity_24h" and the serving API says "give me the current value of transaction_velocity_24h" - both are served by the same feature store, which runs the same code.

The discipline of feature consistency is about understanding when the single-computation path is sufficient, when it isn't, and how to test and monitor for violations in production.

Sources of Inconsistency

Feature inconsistency can arise from any of the following:

Source	Example	Impact
Different implementations	Spark SQL `NULLIF` vs Python `None` handling	Silent null → non-null mismatch
Timezone handling	UTC in training, local time in serving	8-hour window where values diverge
Window boundary definition	"24h ending at midnight" vs "24h ending now"	Different transaction counts
Null handling	Spark `SUM` over NULLs returns NULL; Python sum() returns 0	Model receives 0 vs NULL
Library version differences	scikit-learn TF-IDF normalization changed between versions	Feature distribution shift
Data source versions	Training on 2024 data; serving schema has new column defaults	Different null rates
Approximation shortcuts	Batch daily count / 24 vs true hourly window	Up to 95% error on bursty data
Rounding and precision	float32 in training (GPU) vs float64 in serving (Python)	Accumulates in gradient boosted trees

The Single-Computation Path

The single-computation path is the architectural fix for all sources of inconsistency that arise from multiple implementations.

The feature transformation is defined once. The feature store runs it both for historical retrieval (training) and online serving. There is no second implementation.

In Feast, this looks like:

# feature_repo/features.py - the single source of truth
from feast import Feature, FeatureView, Entity, ValueType
from feast.infra.offline_stores.file_source import FileSource
from datetime import timedelta

user_entity = Entity(
    name="user_id",
    value_type=ValueType.STRING,
    description="Unique user identifier",
)

transaction_source = FileSource(
    path="s3://my-bucket/transactions/",
    timestamp_field="event_timestamp",
)

user_velocity_features = FeatureView(
    name="user_velocity",
    entities=["user_id"],
    ttl=timedelta(hours=24),
    features=[
        Feature(name="tx_count_1h", dtype=ValueType.INT64),
        Feature(name="tx_sum_1h", dtype=ValueType.FLOAT),
        Feature(name="tx_count_24h", dtype=ValueType.INT64),
        Feature(name="tx_sum_24h", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=transaction_source,
)

Training pipeline reads historical values:

from feast import FeatureStore
store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=transactions_with_timestamps,
    features=["user_velocity:tx_count_1h", "user_velocity:tx_sum_1h"],
).to_df()

Serving retrieves current values using the same store:

features = store.get_online_features(
    features=["user_velocity:tx_count_1h", "user_velocity:tx_sum_1h"],
    entity_rows=[{"user_id": "u12345"}],
).to_dict()

Same feature store, same feature definition, same values. Consistency is structural, not tested.

When Single-Computation Isn't Possible

Some features cannot be pre-computed and cached. They must be computed at request time from inputs that are only available in the request:

time_since_last_login - requires the current time, which isn't known until the request
price_delta_from_historical_avg - requires the item's current price from a live price feed
distance_to_nearest_store - requires the user's current GPS coordinates

For these on-demand features, separate implementations exist at training and serving time, and consistency must be tested explicitly rather than guaranteed architecturally.

The testing strategy:

Extract the transformation logic into a pure function with no side effects
Unit test the function with edge cases: NULLs, empty windows, future timestamps, negative values
Integration test: run the training pipeline and serving code with the same inputs, compare outputs
Regression test: check that the function produces identical results across Python versions, library versions, and environments

# The transformation is a pure function - testable in isolation
def time_since_last_login(
    current_timestamp: float,
    last_login_timestamp: float,
) -> float:
    """
    Compute seconds since last login.
    Returns 0.0 for new users with no login history.
    """
    if last_login_timestamp is None or last_login_timestamp <= 0:
        return 0.0
    delta = current_timestamp - last_login_timestamp
    if delta < 0:
        # Serving timestamp is before login timestamp - clock skew
        return 0.0
    return float(delta)


# Unit tests
import pytest

def test_time_since_login_normal():
    assert time_since_last_login(1000.0, 500.0) == 500.0

def test_time_since_login_null():
    assert time_since_last_login(1000.0, None) == 0.0

def test_time_since_login_zero():
    assert time_since_last_login(1000.0, 0.0) == 0.0

def test_time_since_login_clock_skew():
    # Serving timestamp before login timestamp (edge case)
    assert time_since_last_login(500.0, 1000.0) == 0.0

Feature Versioning

Feature definitions are versioned. A change to a feature's computation logic is a new version, not a patch in place. This is because the model is calibrated on the feature's distribution at training time - a change to the computation produces a different distribution, and the model's predictions will be wrong until it is retrained on the new distribution.

Semantic versioning for features:

v1.0 → v1.1: backward-compatible change (e.g., better null handling that changes a small fraction of values). Acceptable for gradual rollout. Retrain is recommended but may not be required if the change is small.
v1.x → v2.0: breaking change (e.g., fixing the UTC vs. Pacific timezone bug). The distribution changes significantly. Must retrain before deploying v2.0 features to production.

class FeatureVersion:
    """
    Versioned feature definition. Tracks computation logic changes.
    New model deployments must specify the feature version they were trained on.
    """
    def __init__(self, name: str, version: str):
        self.name = name
        self.version = version  # "2.0"

    @property
    def full_name(self) -> str:
        return f"{self.name}_v{self.version.replace('.', '_')}"


# In Redis, store features under versioned keys
# user:u12345:features:v1 - old features
# user:u12345:features:v2 - new features (after fix)

# Model v3 (trained on v2 features) reads from v2 key
# Model v2 (trained on v1 features) still reads from v1 key during transition
# Both feature versions are computed and stored simultaneously during migration

The FeatureConsistencyValidator

This class runs both implementations with identical inputs and diffs the outputs. It should run automatically at deployment time for any feature that has separate training and serving implementations.

import numpy as np
from typing import Callable, Dict, Any, List
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)


@dataclass
class ConsistencyReport:
    feature_name: str
    n_samples: int
    n_mismatches: int
    mismatch_rate: float
    max_absolute_error: float
    mean_absolute_error: float
    samples_with_mismatch: List[Dict]
    passed: bool


class FeatureConsistencyValidator:
    """
    Validates that two implementations of the same feature agree on the same inputs.

    Use this:
    1. At deployment time to verify a new serving implementation matches training
    2. After fixing a bug - confirm the fix aligns training and serving
    3. As a regression test in CI for all on-demand features
    """

    def __init__(
        self,
        tolerance: float = 1e-6,  # Absolute tolerance for floating-point comparison
        mismatch_threshold: float = 0.001,  # Alert if more than 0.1% of samples disagree
    ):
        self.tolerance = tolerance
        self.mismatch_threshold = mismatch_threshold

    def validate(
        self,
        feature_name: str,
        training_fn: Callable,
        serving_fn: Callable,
        test_inputs: List[Dict[str, Any]],
    ) -> ConsistencyReport:
        """
        Run both functions on all test inputs and compare outputs.

        test_inputs: list of dicts, each containing the inputs for one sample
        training_fn: the function used in the training pipeline
        serving_fn: the function used in the serving system
        """
        mismatches = []
        absolute_errors = []

        for sample in test_inputs:
            training_val = training_fn(**sample)
            serving_val = serving_fn(**sample)

            # Handle None/NaN
            training_val = 0.0 if training_val is None else float(training_val)
            serving_val = 0.0 if serving_val is None else float(serving_val)

            if np.isnan(training_val) or np.isnan(serving_val):
                # NaN in either is a mismatch unless both are NaN
                if not (np.isnan(training_val) and np.isnan(serving_val)):
                    mismatches.append({
                        "inputs": sample,
                        "training": training_val,
                        "serving": serving_val,
                        "error": "NaN mismatch",
                    })
                continue

            abs_error = abs(training_val - serving_val)
            absolute_errors.append(abs_error)

            if abs_error > self.tolerance:
                mismatches.append({
                    "inputs": sample,
                    "training": training_val,
                    "serving": serving_val,
                    "absolute_error": abs_error,
                    "relative_error": abs_error / max(abs(training_val), 1e-10),
                })

        n_samples = len(test_inputs)
        n_mismatches = len(mismatches)
        mismatch_rate = n_mismatches / max(n_samples, 1)
        max_err = max(absolute_errors) if absolute_errors else 0.0
        mean_err = np.mean(absolute_errors) if absolute_errors else 0.0

        passed = mismatch_rate <= self.mismatch_threshold

        report = ConsistencyReport(
            feature_name=feature_name,
            n_samples=n_samples,
            n_mismatches=n_mismatches,
            mismatch_rate=mismatch_rate,
            max_absolute_error=max_err,
            mean_absolute_error=mean_err,
            samples_with_mismatch=mismatches[:10],  # First 10 for debugging
            passed=passed,
        )

        if not passed:
            logger.error(
                f"Feature consistency check FAILED for '{feature_name}': "
                f"{n_mismatches}/{n_samples} samples disagree "
                f"({mismatch_rate:.2%} mismatch rate). "
                f"Max error: {max_err:.6f}"
            )
        else:
            logger.info(
                f"Feature consistency check PASSED for '{feature_name}': "
                f"{n_mismatches}/{n_samples} samples disagree "
                f"(within {self.mismatch_threshold:.2%} threshold)"
            )

        return report


# Example usage - the UTC vs. Pacific timezone bug scenario

def training_tx_velocity_24h(user_id: str, utc_timestamp: float, event_log) -> int:
    """Training implementation - uses UTC."""
    from datetime import datetime, timezone
    cutoff = utc_timestamp - 86400
    return sum(1 for ts, uid in event_log if uid == user_id and ts >= cutoff)

def serving_tx_velocity_24h_buggy(user_id: str, utc_timestamp: float, event_log) -> int:
    """Buggy serving implementation - accidentally uses local time offset."""
    import time
    local_offset = -time.timezone  # seconds offset from UTC (negative for US Pacific)
    local_timestamp = utc_timestamp + local_offset
    cutoff = local_timestamp - 86400
    return sum(1 for ts, uid in event_log if uid == user_id and ts >= cutoff)

def serving_tx_velocity_24h_fixed(user_id: str, utc_timestamp: float, event_log) -> int:
    """Fixed serving implementation - now uses UTC consistently."""
    cutoff = utc_timestamp - 86400
    return sum(1 for ts, uid in event_log if uid == user_id and ts >= cutoff)


# Generate test data
import time
base_ts = time.time()
sample_event_log = [(base_ts - i * 3600, "u001") for i in range(30)]  # 30 events, 1/hour

test_inputs = [
    {
        "user_id": "u001",
        "utc_timestamp": base_ts - (i * 3600),
        "event_log": sample_event_log,
    }
    for i in range(24)
]

validator = FeatureConsistencyValidator(tolerance=0.5)

# This should FAIL - buggy implementation diverges for Pacific timezone overlap
buggy_report = validator.validate(
    "tx_velocity_24h",
    training_fn=training_tx_velocity_24h,
    serving_fn=serving_tx_velocity_24h_buggy,
    test_inputs=test_inputs,
)
print(f"Buggy: passed={buggy_report.passed}, mismatches={buggy_report.n_mismatches}")

# This should PASS - fixed implementation matches training
fixed_report = validator.validate(
    "tx_velocity_24h",
    training_fn=training_tx_velocity_24h,
    serving_fn=serving_tx_velocity_24h_fixed,
    test_inputs=test_inputs,
)
print(f"Fixed: passed={fixed_report.passed}, mismatches={fixed_report.n_mismatches}")

Canary Testing for Feature Changes

When deploying a new version of a feature, route a small percentage of traffic to the new version and compare model outputs with the old version before full rollout.

import random
import hashlib


class FeatureCanaryRouter:
    """
    Routes requests to either the stable or canary feature version.
    Uses deterministic hashing to ensure the same entity always
    goes to the same version within a canary window.
    """

    def __init__(self, canary_fraction: float = 0.01):
        self.canary_fraction = canary_fraction

    def _is_canary(self, entity_id: str) -> bool:
        """Deterministic assignment: same entity_id always maps to same bucket."""
        hash_val = int(hashlib.md5(entity_id.encode()).hexdigest(), 16)
        bucket = (hash_val % 10000) / 10000.0  # 0.0 to 1.0
        return bucket < self.canary_fraction

    def get_features(
        self,
        entity_id: str,
        stable_client,
        canary_client,
        feature_logger,
    ) -> dict:
        use_canary = self._is_canary(entity_id)
        features = canary_client.get_features(entity_id) if use_canary else stable_client.get_features(entity_id)
        version = "canary" if use_canary else "stable"
        feature_logger.log(entity_id=entity_id, features=features, version=version)
        return features

Feature shadowing is a related technique: compute both feature versions for every request, but only use one for the actual prediction. Log both values. Analyze the distribution difference offline before switching to the new version.

Feature Consistency Lifecycle

Production Notes

Run consistency checks in CI: every PR that touches a feature transformation should trigger a consistency check that runs both implementations on a representative test dataset. Fail the build if mismatch rate exceeds 0.1%.

Log feature values alongside predictions: this is the only way to detect serving distribution drift after deployment. Compare logged serving values against recomputed training values monthly. Alert if feature mean or variance shifts by more than 10%.

Version your Redis keys during migrations: when deploying a new feature version, write both the old version (features:v1) and new version (features:v2) simultaneously for a migration period. The old model reads v1, the new model reads v2. Roll back is instantaneous - just point the old model at v1 again.

:::danger "Close Enough" Consistency is Not Consistency A 1% error rate on feature values sounds acceptable. At 1 million predictions per day, that's 10,000 predictions per day receiving incorrect features. In fraud detection, this could be 10,000 uncaught frauds or 10,000 false positives. In recommendation, it's 10,000 recommendations for a distribution the model was never calibrated on.

Rounding differences, floating-point precision differences, and null handling differences all compound across features. A model with 20 features, each with 0.5% inconsistency, has approximately a 10% chance of receiving at least one incorrect feature on any given prediction.

Set the consistency threshold at 0.01% (1 in 10,000 samples) and enforce it in CI. "Close enough" is not a valid architecture. :::

:::warning Fixing Consistency Bugs Requires Model Retraining When you fix a feature inconsistency - correct the timezone, fix the null handling, align the window boundaries - the model's predictions will shift. The model was trained on the buggy feature values. It has calibrated its decision boundaries to that distribution. Deploying the fixed feature without retraining can make model performance worse, not better, in the short term.

The correct process: fix the feature → retrain the model on the corrected feature → validate the new model offline → deploy the new model and new feature together. Never deploy a fixed feature under an existing model without understanding the distribution impact. :::

Interview Q&A

Q: What is training-serving skew, and why is it considered one of the most dangerous ML system problems?

Training-serving skew is the condition where the features a model receives at serving time have a different distribution than the features it was trained on - caused by different implementations, timezone bugs, null handling differences, or approximation shortcuts. It is dangerous because it is silent: the model continues to produce predictions, often with only minor visible degradation, while systematically making errors that are impossible to attribute to the feature issue without explicit consistency testing. By the time the impact is detected (usually months later through a compliance audit, A/B test regression, or model review), millions of incorrect predictions have been made.

Q: Describe the single-computation path and how feature stores enforce it.

The single-computation path means a feature transformation is defined once (in a feature store like Feast or Tecton), and both the training pipeline and the serving API use the feature store to retrieve values - there is no separate implementation for each. The training pipeline calls store.get_historical_features() and the serving code calls store.get_online_features(). Both call the same feature store, which applies the same transformation logic. Consistency is structural - it cannot drift unless the feature definition itself is changed, which requires an explicit version bump.

Q: How do you test feature consistency for an on-demand feature that cannot use the single-computation path?

Extract the transformation logic into a pure Python function with no side effects (no database calls, no current time, just inputs → output). Unit test this function with all edge cases: nulls, zeros, negative values, empty lists, future timestamps. Write an integration test that runs the training pipeline's version of the function and the serving code's version on the same inputs and compares outputs with tolerance. Run this comparison in CI on every PR. Run it as a pre-deployment gate. Additionally, shadow-log both implementations in production for a period before switching to the new version.

Q: A model's performance degraded 3 months after deployment with no code changes. You suspect feature inconsistency. How do you investigate?

First, establish that the degradation is correlated with feature values, not label distribution: compare the model's prediction distribution today vs. 3 months ago. If the prediction distribution has shifted but the label distribution hasn't, feature drift is likely. Second, pull serving feature logs (you do have serving feature logs) and compare feature value distributions between now and 3 months ago. A feature whose distribution has shifted without a code change indicates inconsistency or data source drift. Third, identify which features changed: recompute the feature from raw events for a sample of recent entities, compare to the served values. Large discrepancies indicate inconsistency. Fourth, trace the implementation: when did the feature start diverging? What changed? Correlate with deployment history.

Q: How does feature versioning work in practice when you need to fix a timezone bug in a feature?

Create a new feature version (v2.0) with the correct timezone handling. Deploy the streaming pipeline to write both user:u123:features:v1 (old, for existing model) and user:u123:features:v2 (new, for next model version) to Redis simultaneously. Retrain the model using historical feature values recomputed with the v2 logic. Validate the new model's offline performance on a held-out set. Deploy the new model with its serving code pointing to v2 keys. Maintain both versions in Redis for a transition period. Once the new model is fully rolled out and stable, stop writing the v1 keys and let them expire via TTL.

Q: What metrics do you monitor to catch feature consistency issues before they affect model performance?

Feature value distribution metrics: mean, standard deviation, percentiles (p50/p95/p99) for each feature, computed on serving logs daily. Alert on shifts greater than 2 standard deviations from the trailing 30-day average. Feature staleness: max(current_time - feature.computed_at) over all served features - alert if any feature is older than 2x its intended refresh interval. Null rate: alert if a feature's null rate increases by more than 1 percentage point. Prediction distribution: alert if the model's output distribution shifts (KL divergence from baseline). And explicitly: the output of the consistency validator run weekly against recomputed ground truth.

The 8-Hour Overlap​

Why This Exists​

Sources of Inconsistency​

The Single-Computation Path​

When Single-Computation Isn't Possible​

Feature Versioning​

The FeatureConsistencyValidator​

Canary Testing for Feature Changes​

Feature Consistency Lifecycle​

Production Notes​

Interview Q&A​