Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Store Architecture demo on the EngineersOfAI Playground - no code required. :::

Online vs Offline Features

The 15-Point Accuracy Gap​

A fraud detection team spent three months building a gradient boosted tree model for transaction fraud. Offline evaluation on held-out historical data looked excellent - 94% accuracy, 0.91 AUC, clean precision-recall curves. They deployed with confidence.

Two weeks into production, the model was catching 79% of fraud cases. Fifteen percentage points had evaporated between the evaluation notebook and the production endpoint. The team dug into the cases the model was missing. The pattern became clear: the missed frauds were concentrated in fraud spikes - bursts of illegitimate transactions from a single account in a short window. The model was failing in exactly the situations it was supposed to catch.

The root cause traced back to a single feature: transaction_velocity_1h - the number of transactions the cardholder had submitted in the past hour. During offline training and evaluation, this feature was computed correctly from historical data: for each transaction, the pipeline looked back exactly one hour in the event log and counted. At the moment a fraud spike occurred, this feature correctly showed 15, 20, 30 transactions in the last hour.

In production, the feature was computed differently. The real-time serving infrastructure was expensive to maintain, so the team had taken a shortcut: they pre-computed transaction_velocity_24h (transactions in the past 24 hours) in a nightly batch job, then divided by 24 at serving time to approximate the hourly rate. For average users on average days, the approximation held reasonably. But for a fraud spike - where a compromised card fires 20 transactions in 45 minutes - the past-24-hour count was mostly legitimate activity from earlier in the day. Dividing by 24 yielded a velocity of roughly 1.2. The actual velocity in the past hour was 20. The model saw a quiet customer when it should have seen an emergency.

This gap - the training-serving gap - is the central problem that online feature engineering solves. The model was not wrong. The features it received at serving time were wrong.


Why This Exists​

Before streaming infrastructure became accessible, all ML features were batch features. You ran a Spark job overnight, computed feature values for every entity in your user table, and wrote a snapshot to a feature store. The model read from that snapshot.

This worked fine for features that change slowly: a user's age, their account tenure, their historical purchase count over 90 days, their geographic region. These features are the same at 9 AM as they were at midnight. Batch computation with a 24-hour refresh cycle introduces at most 24 hours of staleness - acceptable for slowly-changing signals.

The problem appeared when teams started using features that carry real-time signal. Fraud detection, recommendation systems, and dynamic pricing all depend on what is happening right now: what this user clicked in the last 5 minutes, how many transactions came through in the last hour, how many rides were requested in this zip code in the last 15 minutes. For these features, a 24-hour batch update introduces 24 hours of staleness - and the entire predictive value of the feature exists in the last few minutes, not the last few hours.

Online feature engineering is the answer: compute features from live event streams, store them in low-latency online stores (Redis, DynamoDB), and serve them within the latency budget of the model endpoint. The discipline grew from the fraud detection and recommendation teams at companies like PayPal, Uber, and Netflix, who hit the batch-approximation problem hard and built stream-to-feature infrastructure to solve it.


The Feature Freshness Spectrum​

Not all features have the same freshness requirement. The right question is not "should this be a real-time feature?" but "how stale can this feature be before it hurts the model?"

Static features change so infrequently that batch computation is not only sufficient but preferred. User age, account creation date, geographic region, and subscription tier can all be computed in a nightly job with zero loss of model quality. The cost of streaming these features would be high; the benefit would be negligible.

Near-real-time features change on hourly or daily cycles. A user's total purchases in the past 30 days, their average transaction amount over the past month, the number of support tickets they opened this week. These features can tolerate an hourly refresh via mini-batch jobs. The business impact of a 1-hour lag is low for most use cases.

Real-time features carry signal that degrades sharply with staleness. Transaction velocity in the last hour, items clicked in the last session, search queries in the last 5 minutes, demand surge in the last 15 minutes. For these, even a 5-minute lag can matter. They require streaming computation.

Request-time features are computed inline at serving time from inputs available in the request itself: the time elapsed since the user's last login (computed from a stored last-login timestamp and now()), the difference between a product's current price and the user's historical average purchase price. These require no separate pipeline - but they require the raw values they depend on to be available in the online store.


When Batch Features Are Sufficient​

Before reaching for streaming infrastructure, confirm that the feature's freshness requirement actually demands it. Consider:

  • Signal half-life: if the predictive value of a feature decays over days, not hours, batch is sufficient
  • Update frequency of underlying data: if the source data is only available daily (e.g., a data warehouse export), streaming adds no value
  • Model use case: a weekly churn prediction model has no use for sub-second feature freshness
  • Cost-benefit: streaming infrastructure has non-trivial operational cost; batch is simpler and cheaper

A rough rule: if the feature's optimal computation window is longer than 6 hours, batch is almost always sufficient. If the optimal window is shorter than 1 hour, you need streaming.


When You Need Real-Time Features​

Four use cases reliably require real-time features:

Fraud detection - Transaction velocity features (counts, sums, distinct merchants in a time window) are the primary fraud signal. Batch approximations fail during bursts, which is exactly when fraud happens.

Real-time recommendation - Collaborative filtering captures long-term preferences. But session context - what the user clicked in the past 5 minutes, what they added and then removed from their cart - captures immediate intent. This session context is the highest-value signal for next-action recommendations.

Dynamic pricing - Uber surge pricing, airline revenue management, hotel rate optimization. The demand signal is the number of requests in the past 15 minutes in a geographic cell. This is meaningless as a daily average.

Search ranking - Query reformulation patterns within a session, items the user skipped earlier in the session, categories they browsed before searching - all carry session-level signal for re-ranking results.


The Serving Latency Constraint​

Real-time features must fit within the model's latency budget. Consider a simple model endpoint:

Total Latency=Network+Feature Retrieval+Model Inference+Response Serialization\text{Total Latency} = \text{Network} + \text{Feature Retrieval} + \text{Model Inference} + \text{Response Serialization}

A typical breakdown for a 200ms SLA:

ComponentBudget
Network (client to server)20ms
Feature retrieval70ms
Model inference90ms
Response serialization20ms
Total200ms

Feature retrieval has 70ms. That sounds comfortable until you realize that a naive implementation - fetching 8 features from 4 different services with individual HTTP calls - spends 8 Ă— 15ms = 120ms in network round trips alone. The entire budget is consumed before a single feature is computed.

This constraint forces several architectural decisions: features must be stored in a co-located low-latency store (Redis, not a remote database), multiple features must be fetched in a single batched request, and expensive computations must be pre-computed and cached.

:::note The 10x Rule Design feature retrieval to use at most 10% of your total SLA. If your SLA is 200ms, target feature retrieval under 20ms. This leaves room for latency spikes, increased feature counts, and model complexity growth. :::


The Training-Serving Gap​

The most expensive form of online feature engineering problem is also the most subtle: when the feature exists at both training and serving time but is computed differently. The model trains on one distribution and serves on another. This is training-serving skew.

Sources of training-serving skew include:

  • Different implementations: Spark SQL for training, Python for serving. SQL window functions behave differently from Python loops at edge cases (NULLs, empty windows, timezone handling).
  • Approximation shortcuts: daily batch divided by 24 instead of true hourly window.
  • Timezone bugs: training pipeline uses UTC, serving microservice uses local time.
  • Null handling differences: Spark SUM over a NULL column returns NULL; Python sum returns 0 if you're not careful.
  • Window boundary definitions: "last 1 hour" in training means event_time > now() - 3600s; in serving it might mean "the current hour starting at :00".

Three Strategies for Training on Real-Time Features​

Real-time features don't exist in historical form by definition. You cannot retroactively compute transaction_velocity_1h for a transaction from six months ago unless you logged the raw events at the time. Three strategies exist:

1. Log and Wait: Instrument the serving pipeline to log both the feature value and the prediction. Use this log as the training dataset. You get perfectly consistent training features - but you must wait for the log to accumulate and cannot use this strategy for the first model version.

2. Simulation: Replay historical raw events through the feature computation logic to reconstruct what the feature value would have been at each point in time. This is accurate if the computation logic is deterministic and you have the raw events. The Flink replay approach.

3. Approximation with awareness: Use a batch approximation for training (daily counts divided by time factor) but be aware that the model will be calibrated on that approximation. Acceptable only when the approximation error is small relative to the signal. Never acceptable for velocity features in fraud detection.

:::warning Strategy 3 is a Last Resort Approximating real-time features for training produces a model that is explicitly calibrated on an incorrect signal. This can work when the error is small, but it creates a hidden dependency that will cause problems as the model is updated or retrained. Use strategies 1 or 2. :::


Core Implementation: OnlineFeatureComputer​

This class implements the real-time serving path for velocity features. It reads from Kafka (live events) and serves pre-computed values from Redis with a fallback to direct computation.

import redis
import json
import time
from collections import defaultdict
from typing import Dict, List, Optional
from dataclasses import dataclass, field
from kafka import KafkaConsumer

@dataclass
class TransactionEvent:
user_id: str
transaction_id: str
amount: float
merchant_id: str
timestamp: float # Unix epoch seconds

@dataclass
class VelocityFeatures:
user_id: str
tx_count_1h: int
tx_count_24h: int
tx_sum_1h: float
tx_sum_24h: float
distinct_merchants_1h: int
computed_at: float


class OnlineFeatureComputer:
"""
Computes transaction velocity features from live Kafka events.
Stores pre-computed values in Redis for sub-millisecond retrieval.
"""

WINDOW_1H = 3600
WINDOW_24H = 86400
FEATURE_TTL = 90000 # 25 hours - keep enough for 24h window + buffer

def __init__(self, redis_client: redis.Redis, kafka_bootstrap: str):
self.redis = redis_client
self.consumer = KafkaConsumer(
"transactions",
bootstrap_servers=kafka_bootstrap,
value_deserializer=lambda m: json.loads(m.decode("utf-8")),
auto_offset_reset="latest",
group_id="feature-computer-velocity",
)

def _feature_key(self, user_id: str) -> str:
return f"features:velocity:{user_id}"

def _event_log_key(self, user_id: str) -> str:
# Redis Sorted Set: score = timestamp, member = tx_id:amount:merchant_id
return f"events:tx:{user_id}"

def ingest_event(self, event: TransactionEvent) -> None:
"""Append a transaction event to the user's event log in Redis."""
log_key = self._event_log_key(event.user_id)
member = f"{event.transaction_id}:{event.amount}:{event.merchant_id}"

pipe = self.redis.pipeline()
# Add to sorted set with timestamp as score
pipe.zadd(log_key, {member: event.timestamp})
# Evict events older than 25 hours to bound memory
cutoff = event.timestamp - self.FEATURE_TTL
pipe.zremrangebyscore(log_key, "-inf", cutoff)
pipe.expire(log_key, self.FEATURE_TTL)
pipe.execute()

# Recompute and cache features for this user
self._recompute_and_cache(event.user_id, event.timestamp)

def _recompute_and_cache(self, user_id: str, now: float) -> VelocityFeatures:
"""Recompute velocity features from the event log and cache in Redis."""
log_key = self._event_log_key(user_id)
cutoff_1h = now - self.WINDOW_1H
cutoff_24h = now - self.WINDOW_24H

# Fetch events in each window using sorted set range by score
events_1h = self.redis.zrangebyscore(log_key, cutoff_1h, now)
events_24h = self.redis.zrangebyscore(log_key, cutoff_24h, now)

def parse_events(raw_events):
results = []
for e in raw_events:
parts = e.decode().split(":", 2)
tx_id, amount, merchant_id = parts[0], float(parts[1]), parts[2]
results.append({"tx_id": tx_id, "amount": amount, "merchant_id": merchant_id})
return results

parsed_1h = parse_events(events_1h)
parsed_24h = parse_events(events_24h)

features = VelocityFeatures(
user_id=user_id,
tx_count_1h=len(parsed_1h),
tx_count_24h=len(parsed_24h),
tx_sum_1h=sum(e["amount"] for e in parsed_1h),
tx_sum_24h=sum(e["amount"] for e in parsed_24h),
distinct_merchants_1h=len({e["merchant_id"] for e in parsed_1h}),
computed_at=now,
)

# Cache the computed feature vector
feature_key = self._feature_key(user_id)
pipe = self.redis.pipeline()
pipe.hset(feature_key, mapping={
"tx_count_1h": features.tx_count_1h,
"tx_count_24h": features.tx_count_24h,
"tx_sum_1h": str(features.tx_sum_1h),
"tx_sum_24h": str(features.tx_sum_24h),
"distinct_merchants_1h": features.distinct_merchants_1h,
"computed_at": str(features.computed_at),
})
pipe.expire(feature_key, self.FEATURE_TTL)
pipe.execute()

return features

def get_features(
self, user_id: str, max_age_seconds: float = 30.0
) -> Optional[VelocityFeatures]:
"""
Retrieve cached features for a user.
Returns None if features are stale or missing - caller must handle fallback.
"""
feature_key = self._feature_key(user_id)
raw = self.redis.hgetall(feature_key)

if not raw:
return None

computed_at = float(raw[b"computed_at"])
if time.time() - computed_at > max_age_seconds:
# Features are stale - trigger recomputation (async in production)
return None

return VelocityFeatures(
user_id=user_id,
tx_count_1h=int(raw[b"tx_count_1h"]),
tx_count_24h=int(raw[b"tx_count_24h"]),
tx_sum_1h=float(raw[b"tx_sum_1h"]),
tx_sum_24h=float(raw[b"tx_sum_24h"]),
distinct_merchants_1h=int(raw[b"distinct_merchants_1h"]),
computed_at=computed_at,
)

def run_consumer(self):
"""Main event loop: consume Kafka events and update feature store."""
print("Feature computer starting...")
for message in self.consumer:
event_dict = message.value
event = TransactionEvent(**event_dict)
self.ingest_event(event)

This implementation uses a Redis Sorted Set as a time-indexed event log. New events are appended with their timestamp as the score. Window queries use ZRANGEBYSCORE to pull only events within the target window. The computed feature vector is then cached in a Redis Hash for sub-millisecond retrieval by the model server.


The "Divide Daily by 24" Antipattern​

:::danger Never Approximate Bursty Data by Time Division Dividing a daily count by 24 to estimate an hourly rate assumes uniform distribution across the day. For fraud signals, demand surges, and viral content spread, the distribution is anything but uniform. The signal you are trying to capture exists precisely in the burst - and time-division smoothing eliminates it.

The fraud example: a compromised card fires 20 transactions in 40 minutes at 2 AM. The past-24-hour count at that moment includes 3 legitimate daytime transactions + 20 fraudulent ones = 23. Divided by 24 = 0.96/hour. The actual 1-hour velocity: 20. The model sees "quiet" when it should see "alarm."

The demand surge example: 100 ride requests arrive in a 5-minute window in downtown at 6 PM. The past-24-hour count is 1,000. Divided by 24/hours, divided by 12/five-minute-slots = 3.5. Actual 5-minute count: 100. Surge pricing never triggers.

Use true sliding window computation. If streaming infrastructure isn't available yet, the correct temporary fix is to widen the approximation window and be explicit about the lag, not to silently smooth out the signal. :::


The Training-Serving Gap: Concrete Simulation​

To understand what the training-serving gap looks like numerically, consider this simulation for the fraud scenario:

import numpy as np
from typing import List, Tuple

def simulate_training_serving_gap(
events: List[Tuple[float, float]], # (timestamp, amount) pairs
eval_time: float,
window_seconds: float = 3600,
) -> dict:
"""
Compare true window feature vs. batch approximation.

events: list of (timestamp, amount) tuples - the ground truth event log
eval_time: the timestamp at which we evaluate the feature
window_seconds: the intended window (default 1 hour)
"""
# True feature: count events in [eval_time - window, eval_time]
window_start = eval_time - window_seconds
true_events = [(ts, amt) for ts, amt in events if window_start <= ts <= eval_time]
true_count = len(true_events)
true_sum = sum(amt for _, amt in true_events)

# Batch approximation: daily count divided by (86400 / window_seconds)
day_start = eval_time - 86400
daily_events = [(ts, amt) for ts, amt in events if day_start <= ts <= eval_time]
daily_count = len(daily_events)
daily_sum = sum(amt for _, amt in daily_events)
scale_factor = 86400 / window_seconds
approx_count = daily_count / scale_factor
approx_sum = daily_sum / scale_factor

return {
"true_count_1h": true_count,
"approx_count_1h": round(approx_count, 2),
"count_error_pct": round(abs(true_count - approx_count) / max(true_count, 1) * 100, 1),
"true_sum_1h": round(true_sum, 2),
"approx_sum_1h": round(approx_sum, 2),
"sum_error_pct": round(abs(true_sum - approx_sum) / max(true_sum, 0.01) * 100, 1),
}


# Simulate a fraud spike: 2 legitimate transactions earlier, then 18 fraudulent
import time

base_time = time.time()
normal_events = [
(base_time - 20000, 45.00), # coffee, 5.5 hours ago
(base_time - 14400, 120.00), # grocery, 4 hours ago
]
fraud_spike_events = [
(base_time - 3500 + i * 180, 299.99) # 18 rapid transactions
for i in range(18)
]
all_events = normal_events + fraud_spike_events

result = simulate_training_serving_gap(all_events, base_time)
print(f"True 1h count: {result['true_count_1h']}")
print(f"Batch approx 1h count: {result['approx_count_1h']}")
print(f"Count error: {result['count_error_pct']}%")

Expected output for the fraud spike scenario:

True 1h count: 18
Batch approx 1h count: 0.83
Count error: 95.4%

The batch approximation is 95% wrong. The model sees a velocity of 0.83 when the actual velocity is 18.


Feature Freshness in Practice: A Decision Framework​

Use this framework when deciding whether a feature requires real-time computation:

Is the feature's predictive signal concentrated in events
within the last 60 minutes?
├── YES → Real-time feature (Kafka + Flink + Redis)
└── NO → Is it concentrated within the last 6 hours?
├── YES → Near-real-time (mini-batch, 5-15 min refresh)
└── NO → Is it stable for 24+ hours?
├── YES → Batch feature (daily Spark job)
└── NO → Near-real-time (hourly job)

A practical inventory from a fraud detection system:

FeatureWindowFreshness NeededStrategy
account_age_dayslifetimeweeksstatic
historical_fraud_flaglifetimedaysbatch
avg_transaction_90d90 daysdailybatch
avg_transaction_7d7 dayshourlynear-real-time
tx_count_24h24 hourshourlynear-real-time
tx_count_1h1 hourminutesreal-time
tx_velocity_5min5 minutessecondsreal-time
session_tx_countsessionsecondsreal-time

Production Notes​

Feature logging is non-negotiable: log every feature value served alongside the prediction. This is the only way to reconstruct the serving distribution for retraining, detect feature drift, and debug model degradation.

Version your feature computations: when you change the computation logic for an existing feature (e.g., fixing a timezone bug), that's a new feature version. The model trained on v1 is calibrated to v1's distribution. Swap in v2 only after retraining.

Monitor feature freshness in production: alert when a feature's computed_at timestamp is more than 2x its intended refresh interval. A 5-minute velocity feature that hasn't been updated in 20 minutes is delivering stale values to a live model.

Cold start for new entities: a new user has no event history. Define explicit defaults for every real-time feature and ensure the model was trained with those defaults included in the training data (not NULLs silently replaced at serving time).


Interview Q&A​

Q: What is the training-serving skew problem, and what causes it?

Training-serving skew is the condition where a model's training features and its serving features compute different values for the same logical quantity, causing the model to operate on a distribution it was never trained on. Common causes: different code implementations of the same feature (Spark vs. Python), timezone mismatches, null handling differences, window boundary definitions, and approximation shortcuts taken at serving time that weren't used at training time. The fix is the single-computation path - one piece of code computes the feature, its output is stored, and both training and serving read from that store.

Q: A feature engineer says "we can approximate the 1-hour velocity by taking the daily count and dividing by 24." When is this acceptable and when is it not?

It is acceptable when the underlying events are approximately uniformly distributed throughout the day and the prediction task is not sensitive to bursts. It is never acceptable for fraud detection, anomaly detection, demand surge detection, or any use case where the signal is specifically concentrated in bursts. For these cases, the approximation is 95%+ wrong during the exact events the model needs to catch. The correct answer is to build the sliding window computation correctly, even if it requires streaming infrastructure.

Q: How do you handle the cold start problem for real-time features when a new user has no event history?

Define explicit defaults for every real-time feature that represent a "neutral" entity: zero transaction count, zero velocity, zero session history. Ensure these defaults are included in the training data as-is - don't drop rows with zero values or impute them with the population mean. The model must be calibrated on the true default values it will receive at serving time. Store defaults in the feature schema so any downstream service gets them automatically on a cache miss.

Q: How would you instrument a feature pipeline to detect that training-serving skew has developed in production?

Log every feature value served alongside the request ID and prediction. In a separate pipeline, compute the same feature from the training pipeline on the same entities. Join on entity ID and timestamp window. Compute the mean absolute percentage error between logged serving values and recomputed training values. Alert when MAPE exceeds 5% on any feature. Automate this comparison to run weekly and at every model deployment.

Q: What is the "log and wait" strategy for training on real-time features, and when does it break down?

Log-and-wait: instrument the serving pipeline to record the exact feature values that were used for each prediction, along with the ground truth outcome when it becomes available. Use this logged (feature, outcome) dataset as the training data for the next model version. This guarantees perfect training-serving consistency because the training data is literally the serving data. It breaks down in two cases: (1) for the first model version, you have no serving logs to train on; (2) if the serving pipeline had a bug (e.g., the training-serving skew you're trying to fix), the logs contain the buggy values and training on them perpetuates the bug.

Q: A feature has a 1-hour computation window. A late event arrives 90 minutes after it was generated (network delay in the event pipeline). How do you handle it?

For features served at the time of the original prediction, the late event doesn't change anything - the prediction already happened. For features used in retraining, you need to decide on a watermark: the maximum delay you're willing to wait before declaring a window closed. If your watermark is 2 hours, you delay writing training data by 2 hours to allow for late arrivals. This increases training data latency but improves completeness. In Flink, configure WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofHours(2)). The trade-off: longer watermark delay = more complete windows but higher latency to generate training data.

© 2026 EngineersOfAI. All rights reserved.