Skip to main content

Feature Stores in Production

The 12% Accuracy Gap Nobody Could Explain

The fraud detection model had been performing well for eight months. Offline evaluation showed an F1 of 0.89. The business was happy. Then, during a routine audit, an analyst compared the model's production confusion matrix against the offline evaluation numbers. The gap was startling: online precision was 0.79, not 0.89. The model was generating 35% more false positives than expected. The fraud operations team had been manually reviewing and clearing these false alerts for months without escalating the discrepancy.

The investigation started with the obvious suspects: data distribution shift, label drift, changes in fraud patterns. None of them explained the magnitude. The confusion matrix shapes were similar; the magnitudes were consistently off across all merchant categories and user cohorts.

Three weeks in, a junior engineer noticed something in the serving code. The merchant_category_risk_score feature - a pre-computed risk score based on historical fraud rates per merchant category - was computed differently in training and serving. In training, it used a 90-day lookback window, joining merchant-level statistics computed as of each transaction date. In serving, it used a pre-computed static table refreshed once a week. When a merchant category experienced a fraud spike, the training feature would reflect it within 24 hours; the serving feature remained stale for up to 7 days.

This is training-serving skew. The feature store exists specifically to prevent it. The fix took one afternoon once the root cause was found. The loss - in analyst time, false alarm investigation, and undetected fraud - could not be fully quantified. The lesson was not forgotten.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Dataset Lineage & Provenance demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: The Training-Serving Skew Problem

The core problem a feature store solves is deceptively simple. When you train a model, you compute features from historical data. When you serve that model, you compute features from live data. If these two computations differ in any way - different code, different aggregation windows, different null handling - the model sees different feature distributions at serving time than it trained on.

The gap can be subtle. Maybe training uses Python and serving uses Java, with floating-point differences in normalization. Maybe training computes a rolling average over exactly 7 days but serving uses a cached value pre-computed 3 days ago. Maybe the training pipeline includes a null-handling step that was forgotten when the serving implementation was written six months later by a different engineer.

Any of these differences produces a feature at serving time that is statistically different from what the model expected when its weights were learned. Performance degrades - sometimes catastrophically, sometimes gradually, almost always silently.

The feature store solution: define feature computation logic once, in a single place, and use that same logic for both offline (training) and online (serving) computation. One definition, two materializations.


Historical Context

The feature store concept was popularized by Uber's Michelangelo platform, described publicly in 2017. Michelangelo introduced a centralized feature repository where teams could register, discover, and share features - addressing not just training-serving skew but also feature duplication across teams. Before Michelangelo, each team at Uber implemented the same features independently in different languages for different models.

Feast (Feature Store) was open-sourced by Gojek in 2019 and became the de facto open-source standard. It introduced declarative feature definitions and a clean separation between offline and online store layers.

Tecton was founded in 2019 by members of the Michelangelo team, offering a managed feature store with built-in streaming features, feature monitoring, and enterprise access controls.

By 2022, feature stores had become standard in enterprise ML stacks. The architectures converged on the same basic pattern: declarative feature definitions, an offline store for training data generation, an online store for low-latency serving, and point-in-time correct retrieval.


Core Concepts

The Dual-Store Architecture

A production feature store has two layers serving fundamentally different purposes:

The offline store is a columnar data store - typically Parquet files on S3, BigQuery, or Snowflake. It contains feature values for all entities, across all historical time. Its purpose is generating training datasets: given a list of (entity, timestamp) pairs, retrieve the feature values that were available at each timestamp. Reads take minutes to hours, accessed by batch training jobs.

The online store is a key-value store - typically Redis, DynamoDB, or Bigtable. It contains only the latest feature values for each entity. Its purpose is low-latency serving: given a user_id, return current feature values in under 10ms. It does not need history; it only needs now.

Point-in-Time Correct Joins

This is the most important and most frequently misunderstood concept in feature stores.

When building a training dataset, you have labeled events: "at timestamp T, user U performed action A with label L." To train a model, you need feature values for user U as they existed at timestamp T - not as they exist today, not as they existed when the training job ran last week.

If you join features naively using the current (latest) feature values, you leak future information into training. Features that incorporate data from after the label event make the model appear to perform better offline than it does in production - because in production, those future features are never available.

A point-in-time correct join retrieves feature values as of the label timestamp. This requires the offline store to maintain the full history of feature value changes, not just the latest snapshot.

import pandas as pd
from datetime import datetime

# Training labels: one row per labeled event
labels = pd.DataFrame({
"user_id": ["u1", "u2", "u1", "u3"],
"event_timestamp": [
datetime(2024, 1, 15, 10, 0),
datetime(2024, 1, 15, 14, 0),
datetime(2024, 2, 1, 9, 0),
datetime(2024, 2, 10, 11, 0),
],
"is_fraud": [1, 0, 0, 1]
})

# Historical feature values with the time each value was computed
feature_history = pd.DataFrame({
"user_id": ["u1", "u1", "u1", "u2", "u3"],
"feature_timestamp": [
datetime(2024, 1, 10), # before u1's first event
datetime(2024, 1, 20), # after u1's first event - do NOT use for first label
datetime(2024, 2, 5), # between u1's two events
datetime(2024, 1, 14),
datetime(2024, 2, 8),
],
"spend_7d_avg": [120.0, 135.0, 140.0, 89.0, 2100.0]
})

def point_in_time_join(
labels: pd.DataFrame,
features: pd.DataFrame,
entity_col: str,
label_ts_col: str,
feature_ts_col: str
) -> pd.DataFrame:
"""
For each label row, retrieve the most recent feature value
that was available at or before the label timestamp.
This is point-in-time correct retrieval.
"""
result_rows = []

for _, label_row in labels.iterrows():
entity = label_row[entity_col]
label_ts = label_row[label_ts_col]

# Filter: same entity, feature was computed before or at the label timestamp
available_features = features[
(features[entity_col] == entity) &
(features[feature_ts_col] <= label_ts)
]

if available_features.empty:
# No feature value was known at this point in time - return null
feature_vals = {
col: None for col in features.columns
if col not in [entity_col, feature_ts_col]
}
else:
# Take the most recent feature value that was available
latest = available_features.sort_values(feature_ts_col).iloc[-1]
feature_vals = {
col: latest[col] for col in features.columns
if col not in [entity_col, feature_ts_col]
}

result_rows.append({**label_row.to_dict(), **feature_vals})

return pd.DataFrame(result_rows)


training_dataset = point_in_time_join(
labels=labels,
features=feature_history,
entity_col="user_id",
label_ts_col="event_timestamp",
feature_ts_col="feature_timestamp"
)

print(training_dataset[["user_id", "event_timestamp", "spend_7d_avg", "is_fraud"]])
# u1 at 2024-01-15: gets spend_7d_avg=120.0 (from 2024-01-10)
# u1 at 2024-02-01: gets spend_7d_avg=135.0 (from 2024-01-20, most recent before event)
# Crucially: NOT 140.0 from 2024-02-05, which was AFTER the event

Feast in Production

Feast is the most widely deployed open-source feature store. Its core abstractions:

  • Entity: The primary key of your features (user_id, product_id, driver_id)
  • FeatureView: A group of related features from a single data source, with defined schema and TTL
  • FeatureService: A named collection of features across FeatureViews - the contract a model uses
  • DataSource: The backing storage for offline or streaming feature computation
from datetime import timedelta
import pandas as pd
from feast import Entity, FeatureView, Field, FeatureStore, FileSource
from feast.types import Float32, Int64

# 1. Declare the entity
user = Entity(
name="user_id",
description="Customer identifier"
)

# 2. Declare the offline data source
user_spend_source = FileSource(
path="s3://my-feature-store/user-spend-features/",
timestamp_field="feature_timestamp",
file_format="parquet"
)

# 3. Declare the feature view - single source of truth for computation
user_spend_fv = FeatureView(
name="user_spend_features",
entities=[user],
ttl=timedelta(days=1), # alert if features are older than 1 day
schema=[
Field(name="spend_7d_avg", dtype=Float32),
Field(name="spend_30d_avg", dtype=Float32),
Field(name="txn_count_7d", dtype=Int64),
Field(name="spend_velocity", dtype=Float32),
],
source=user_spend_source,
)

# 4. Apply definitions to the feature registry
store = FeatureStore(repo_path="./feature_repo")
store.apply([user, user_spend_fv])

# 5. Materialize features to the online store (Redis)
# Run this daily/hourly to keep online store fresh
store.materialize_incremental(end_date=datetime.now())

# 6. Generate training dataset - POINT-IN-TIME CORRECT
entity_df = pd.DataFrame({
"user_id": ["u1", "u2", "u3"],
"event_timestamp": pd.to_datetime([
"2024-01-15 10:00:00",
"2024-01-15 14:00:00",
"2024-02-10 11:00:00"
])
})

training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"user_spend_features:spend_7d_avg",
"user_spend_features:txn_count_7d",
]
).to_df()

# 7. Retrieve online features for serving (same feature view, same computation)
online_features = store.get_online_features(
features=[
"user_spend_features:spend_7d_avg",
"user_spend_features:txn_count_7d",
],
entity_rows=[{"user_id": "u1"}, {"user_id": "u2"}]
).to_dict()

Feature Serving Latency

The online store's latency budget is set by the overall API SLA. If your recommendation API must respond in 100ms, and model inference takes 20ms, you have roughly 50ms for feature retrieval (leaving buffer for network and overhead). Redis typically serves feature lookups in 1–3ms; DynamoDB in 2–10ms; Bigtable in 5–20ms.

Latency failure modes to watch for:

  • Too many features per request: Serializing 500 features takes time. Profile your serialization overhead and prune unused features.
  • Large feature values: Storing 768-dimensional embedding vectors in Redis is expensive in both latency and memory. For embedding features, consider storing only the entity key and running a separate approximate nearest neighbor lookup.
  • Hot keys: If millions of requests all retrieve features for the same popular entity (a trending product), that key becomes a Redis hotspot. Implement client-side caching with a short TTL for high-traffic entities.
  • Cold start after failover: After a Redis restart, the cache is empty. Requests fall through to the offline store or a fallback. Plan for cache warm-up procedures before cutting traffic to a new Redis instance.

Tecton vs. Feast vs. Hopsworks

DimensionFeast (OSS)Tecton (SaaS)Hopsworks
Deployment modelSelf-hostedFully managedSelf-hosted or managed
Streaming featuresLimitedFirst-classFirst-class
Feature monitoringBasicBuilt-in alertsBuilt-in dashboards
Transformation logicExternal pipelineBuilt-in PythonBuilt-in Python/SQL
Approx. costInfra only50K50K–200K/yearInfra or $30K+/year
Best forLean teams, cost-sensitiveEnterprise, full lifecycleResearch orgs, flexibility

The choice depends primarily on team size and streaming requirements. Teams with fewer than 20 data scientists and batch-only features: Feast works well. Teams with real-time features (fraud, recommendations) and strict SLAs: Tecton or Hopsworks pay for themselves by avoiding engineering time spent on infrastructure.


Feature Store Failure Modes

Silent Staleness

A feature store not actively monitored for freshness will serve stale features without any error. The batch pipeline fails, Redis does not get updated, and the model continues serving yesterday's features. From the model's perspective, everything looks fine. Silent degradation continues until an analyst notices a metrics regression.

Fix: Monitor max(feature_timestamp) per feature view continuously. Alert if any feature's freshness exceeds its TTL by more than a configurable threshold (e.g., TTL + 2 hours = hard alert).

Dual-Write Inconsistency

When a batch pipeline writes to the offline S3 store and then writes to the online Redis store in sequence - without atomicity - a failure between the two writes leaves them inconsistent. The offline store has new data; the online store still has old data.

Fix: Design pipelines to write to the offline store first (append-only, safe to retry), then materialize the online store by reading from the offline store. The offline store is the source of truth; the online store is always derivable from it.

Schema Drift

A feature view's schema changes (column renamed, type changed) but downstream model serving code is not updated simultaneously. The model receives null values for renamed columns, or type-coercion errors for type changes.

Fix: Version feature schemas. Use backward-compatible changes (add columns, don't remove or rename) where possible. For breaking changes, version the feature view (user_spend_features_v2) and migrate consumers explicitly before deprecating the old version.


Production Engineering Notes

Feature TTL: Every feature must have a defined TTL - the maximum age at which it is still considered valid. A TTL of None means stale features will be served indefinitely on pipeline failure. Start with TTL = 2× the pipeline schedule (daily pipeline → 48h TTL) and tighten as you build confidence.

Access control: In multi-team environments, features may contain PII or business-sensitive signals. Implement feature-level access control - a recommendation model should not be able to read features from the fraud scoring system without explicit approval.

Disaster recovery: Document and practice the procedure to rebuild the online store from the offline store. This is the feature store equivalent of a database restore drill. It should take under 2 hours for a well-designed system.

Feature discovery: A feature store's secondary value is discoverability. Maintain descriptions, data owners, and usage metrics for every registered feature. When a new model needs a "7-day spend" feature, the team should find it in the registry rather than reimplementing it.


Common Mistakes

:::danger Computing features differently in training and serving This is the most expensive mistake in production ML. If your training pipeline uses Python and your serving pipeline uses Java, or if they have subtly different aggregation windows, you have training-serving skew that degrades model performance silently. Every feature must have a single canonical computation, used identically in both offline training and online serving paths. :::

:::danger Naive joins when building training data Joining feature tables to label tables using JOIN ON user_id (without time constraints) retrieves current feature values, not historical ones. This leaks future information into training, producing optimistic offline metrics that never transfer to production. Always use point-in-time correct joins when building training datasets. :::

:::warning Not defining feature TTL A feature without a TTL will be served regardless of how old it is. If your batch pipeline fails for 3 days, the model silently receives 3-day-old features. Define TTLs for all features. Features older than their TTL should trigger an alert and optionally fall back to a default or null value. :::

:::warning Treating the feature store as an afterthought Teams that bolt on a feature store after multiple models are in production spend months migrating pipelines and finding training-serving skew they didn't know existed. The feature store architecture decision should happen before the first production model is deployed. :::


Interview Q&A

Q: What is training-serving skew, and how does a feature store prevent it?

A: Training-serving skew is when the features computed during model training differ from the features served at inference time. It causes offline metrics to be systematically more optimistic than production performance - the model was tuned on features it will never see again. A feature store prevents this by maintaining a single canonical definition of each feature's computation logic, shared by both the offline training path and the online serving path. The offline store materializes historical feature values for training dataset generation; the online store materializes the latest values for low-latency serving. Both derive from the same definition, so drift between training and serving is structurally prevented.

Q: What is a point-in-time correct join and why does it matter?

A: A point-in-time correct join retrieves, for each labeled event at timestamp T, the feature values that were available at exactly time T - not the current values. Without it, you build training data using features that incorporate future information: data that was not available when the label event occurred. This creates data leakage, resulting in inflated offline metrics and poor production performance. Implementing point-in-time joins requires the offline store to maintain the full history of feature values over time, not just a snapshot of the latest values. Feast and Tecton both implement this natively.

Q: When would you choose Feast over Tecton?

A: Feast is the right choice when you want to minimize costs, have the engineering capacity to operate the infrastructure, and don't need built-in streaming features. It works well for teams with 5–20 data scientists, batch-only feature computation, and existing infrastructure (S3, Redis). Tecton makes sense when you need managed streaming features with sub-second freshness, built-in monitoring, and you're willing to pay 50K50K–200K/year for reduced operational burden. The inflection point is when the cost of operating Feast - roughly 1–2 platform engineers dedicated to it - approaches Tecton's subscription price.

Q: How do you handle feature store outages in a serving system?

A: The approach depends on whether staleness is acceptable for the use case. For models that can tolerate stale features (product ranking, content recommendation): implement a multi-layer fallback - read from the online store, fall back to a local cache with a short TTL on miss, fall back to a safe default value if both fail. For models where staleness is dangerous (real-time fraud): fail closed - return an error or route to a simpler rule-based fallback rather than serve on stale features. In all cases, the offline store (S3/BigQuery) is the durable source of truth from which the online store can always be rebuilt.

Q: How do you monitor a feature store in production?

A: Monitor four dimensions. Freshness: is the latest feature value recent enough? Alert if now() - max(feature_timestamp) exceeds the feature's TTL. Completeness: what percentage of entity lookups return a non-null value? A drop in completeness usually indicates a pipeline failure. Distribution: has the statistical distribution of feature values shifted compared to a stable baseline? Compute PSI or KS statistics daily and alert on significant deviation. Latency: is the online store serving within SLA? Monitor P50, P95, and P99 and alert on threshold violations. Define explicit SLOs for all four dimensions for every production feature view.

© 2026 EngineersOfAI. All rights reserved.