:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Store Architecture demo on the EngineersOfAI Playground - no code required. :::
Online Feature Computation for Model Serving
The Production Scenario
Your ad ranking team has just trained a model that considers 120 features - and it is 4% better on CTR than the previous 40-feature model. The model is ready. The problem is that 40 of those 120 features do not exist yet in the feature store. They need to be computed at request time from the raw ad candidate data, the user's current session context, and the page the user is currently viewing.
The ad serving system has a 50ms SLA from request to response. The model itself takes 8ms to run. Your upstream feature store lookup takes 12ms. That leaves 30ms of latency budget for feature computation - across 40 on-demand features that include cross-features, embedding lookups, and session aggregates. You need to compute those 40 features in 30ms or the model cannot ship.
This is the online feature computation problem. It lives at the intersection of ML systems and low-latency engineering. The features that make models most accurate - recency signals, session context, cross-user-item interactions - are precisely the features most expensive to compute at request time. The entire craft of this discipline is finding the right balance between computation depth and latency budget.
This lesson gives you a complete mental model for online feature computation: what types of features can be pre-computed, what must be computed live, how to structure computation within a latency budget, and how to cache effectively without introducing training-serving skew.
Why This Exists - The Gap Between Training and Serving
When you train a model in a notebook, features are a pandas DataFrame column. You compute them in batch over your historical dataset, join everything together, and hand the resulting matrix to a training loop. The computation is offline, unlimited in time, and totally separate from serving.
When you deploy that model, the same features must appear at request time - and they must appear within milliseconds. The ad candidate that was a row in a training DataFrame is now a live object arriving from an auction request. The user context that was a column in training data is now a partially assembled set of signals from the current session.
Three things make this hard:
-
Freshness requirements: Some features must reflect what the user did in the last 5 minutes, not yesterday. These cannot be pre-computed and cached for more than a few minutes.
-
Request-time context: Some features depend on the specific context of the current request - which page the user is on, which other candidates are in the same auction, what the current timestamp is. These literally cannot be pre-computed because the context does not exist until the request arrives.
-
Training-serving skew: The feature computation code that runs during training must produce byte-for-byte identical outputs as the code that runs during serving. Any divergence silently degrades model quality without any error signal. This is one of the most common and hardest to detect bugs in ML production systems.
Historical Context
Early ML systems at companies like Google and Facebook had no feature store concept. Every team maintained their own feature computation pipeline, often duplicating logic between training and serving in different languages (Python training, C++ serving). When a feature's computation logic changed, both pipelines needed to be updated in sync. Drift was constant.
The feature store concept emerged around 2017-2019. Uber's Michelangelo platform, announced in 2017, was among the first public descriptions of a unified system for creating, storing, and serving ML features across both training and real-time use cases. Feast (2020, open-sourced from Gojek and later Tecton) brought this pattern to the broader community. The core innovation: a single feature definition that is used for both offline training data generation and online serving.
But feature stores solve the pre-computed feature problem. The online feature computation problem - computing features that depend on request-time context - remains a local concern for each serving pipeline. The patterns in this lesson are how production teams solve it.
Feature Taxonomy
Understanding which features fit which category is the first step in designing the computation pipeline:
Pre-computed features are the majority by count but the minority by freshness requirement. User age, item category, historical engagement rates - these change slowly and can be computed by a batch job and stored in a feature store (Redis, DynamoDB). Lookup latency is 1-5ms.
Near-real-time features require a streaming pipeline (Flink, Spark Streaming) that continuously updates feature values as new events arrive. A user's click count in the last 30 minutes is recomputed as each click event arrives. Latency to update is typically 1-5 minutes (stream processing lag). Lookup latency is the same 1-5ms as pre-computed features.
On-demand features have no pre-computation option because they depend on the specific request context. The cross-feature between user and item (does this user frequently click items in this item's category?) can be computed on demand from pre-computed components. Request-time timestamp features (is it evening? is it a weekend?) can only be computed when the request arrives.
The Latency Budget Framework
Every ML serving pipeline has a latency budget. For ad ranking, it might be 50ms total. You need to allocate that budget across all the work that happens before the model runs:
Total latency budget: 50ms
├── Feature store lookup (batch features): 8ms
├── Streaming feature lookup (near-RT features): 8ms
├── On-demand feature computation: ?ms ← this is what you control
├── Feature assembly + vectorization: 2ms
├── Model forward pass: 8ms
└── Post-processing + response: 2ms
Remaining for on-demand: 50 - 8 - 8 - 2 - 8 - 2 = 22ms
22ms to compute all on-demand features. This budget forces specific engineering choices:
- Vectorize all computation (numpy arrays, not Python loops)
- Parallelize independent computations (asyncio, thread pool)
- Cache intermediate results within a request
- Push as many features as possible into the pre-computed or near-real-time buckets
Implementation: The Feature Pipeline
# feature_pipeline.py
import asyncio
import time
import hashlib
from typing import Optional
import numpy as np
import redis.asyncio as aioredis
from dataclasses import dataclass
@dataclass
class InferenceContext:
"""Everything available at request time."""
user_id: str
item_ids: list[str] # Candidates to rank
page_type: str # "search", "homepage", "pdp"
device_type: str # "mobile", "desktop", "tablet"
request_timestamp: float
session_start_timestamp: float
items_viewed_this_session: list[str]
@dataclass
class FeatureVector:
"""Assembled feature vector for one user-item pair."""
user_features: np.ndarray # Pre-computed: user profile
item_features: np.ndarray # Pre-computed: item attributes
user_rt_features: np.ndarray # Near-RT: recent activity
ondemand_features: np.ndarray # On-demand: computed this request
feature_names: list[str] # For debugging / logging
class FeatureService:
"""
Orchestrates all three types of feature retrieval and computation.
Designed to run all I/O in parallel and compute on-demand features
while I/O is in flight.
"""
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis: Optional[aioredis.Redis] = None
self.redis_url = redis_url
# In-process cache for features fetched this request
# Prevents duplicate Redis calls for the same key in one request
self._request_cache: dict = {}
async def setup(self):
self.redis = aioredis.from_url(
self.redis_url,
encoding="utf-8",
decode_responses=False, # Binary for numpy arrays
)
async def get_user_features(self, user_id: str) -> np.ndarray:
"""Pre-computed user profile features - batch updated daily."""
key = f"user:features:{user_id}"
if key in self._request_cache:
return self._request_cache[key]
raw = await self.redis.get(key)
if raw is None:
# Feature not found - use mean-imputation fallback
features = np.zeros(64, dtype=np.float32)
else:
features = np.frombuffer(raw, dtype=np.float32)
self._request_cache[key] = features
return features
async def get_item_features(self, item_id: str) -> np.ndarray:
"""Pre-computed item features - updated hourly."""
key = f"item:features:{item_id}"
if key in self._request_cache:
return self._request_cache[key]
raw = await self.redis.get(key)
if raw is None:
features = np.zeros(32, dtype=np.float32)
else:
features = np.frombuffer(raw, dtype=np.float32)
self._request_cache[key] = features
return features
async def get_user_realtime_features(self, user_id: str) -> np.ndarray:
"""Near-real-time features - streaming pipeline updates every minute."""
key = f"user:rt:features:{user_id}"
raw = await self.redis.get(key)
if raw is None:
return np.zeros(16, dtype=np.float32)
return np.frombuffer(raw, dtype=np.float32)
async def get_item_batch_features(
self, item_ids: list[str]
) -> dict[str, np.ndarray]:
"""Fetch multiple items in a single Redis MGET - much faster than N individual GETs."""
keys = [f"item:features:{iid}" for iid in item_ids]
raw_values = await self.redis.mget(*keys)
result = {}
for item_id, raw in zip(item_ids, raw_values):
if raw is None:
result[item_id] = np.zeros(32, dtype=np.float32)
else:
result[item_id] = np.frombuffer(raw, dtype=np.float32)
return result
def compute_ondemand_features(
self,
context: InferenceContext,
user_features: np.ndarray,
item_features: np.ndarray,
) -> np.ndarray:
"""
Compute features that depend on request-time context.
These CANNOT be pre-computed.
All computation uses numpy - no Python loops.
Target: under 5ms for this entire function.
"""
now = context.request_timestamp
session_start = context.session_start_timestamp
# Time-based features (vectorized - same for all items in this request)
session_duration_min = (now - session_start) / 60.0
hour_of_day = (now % 86400) / 3600.0
is_evening = float(18 <= hour_of_day <= 23)
is_weekend = float((int(now / 86400) % 7) >= 5)
# Session context features
n_items_viewed = len(context.items_viewed_this_session)
is_repeat_visitor = float(n_items_viewed > 0)
# Page type one-hot (3 types: search, homepage, pdp)
page_vec = np.zeros(3, dtype=np.float32)
page_map = {"search": 0, "homepage": 1, "pdp": 2}
page_vec[page_map.get(context.page_type, 1)] = 1.0
# Device type one-hot (3 types)
device_vec = np.zeros(3, dtype=np.float32)
device_map = {"mobile": 0, "desktop": 1, "tablet": 2}
device_vec[device_map.get(context.device_type, 1)] = 1.0
# Cross feature: dot product of user embedding and item embedding
# This captures user-item affinity at a single float
user_emb = user_features[:32] # First 32 dims as "embedding"
item_emb = item_features[:32]
user_item_dot = float(np.dot(user_emb, item_emb))
# L2 distance between user and item in embedding space
user_item_l2 = float(np.linalg.norm(user_emb - item_emb))
# Assemble all on-demand features into a vector
ondemand = np.array([
session_duration_min / 60.0, # Normalized to [0, 1] roughly
hour_of_day / 24.0,
is_evening,
is_weekend,
float(n_items_viewed) / 100.0, # Normalized
is_repeat_visitor,
user_item_dot,
user_item_l2,
*page_vec,
*device_vec,
], dtype=np.float32)
return ondemand
async def compute_features_for_request(
self,
context: InferenceContext,
) -> list[FeatureVector]:
"""
Compute features for all candidates in one request.
Structure:
1. Fire all I/O in parallel (asyncio.gather)
2. While I/O is in flight, prepare what we can
3. Assemble features when I/O completes
"""
start = time.perf_counter()
self._request_cache = {} # Clear per-request cache
# Step 1: Fire all I/O in parallel
user_features_task = asyncio.create_task(
self.get_user_features(context.user_id)
)
user_rt_task = asyncio.create_task(
self.get_user_realtime_features(context.user_id)
)
item_features_task = asyncio.create_task(
self.get_item_batch_features(context.item_ids) # Single MGET
)
# Step 2: Await all I/O (runs in parallel)
user_features, user_rt_features, all_item_features = await asyncio.gather(
user_features_task,
user_rt_task,
item_features_task,
)
io_elapsed_ms = (time.perf_counter() - start) * 1000
# Step 3: Compute on-demand features and assemble (pure CPU, no I/O)
feature_vectors = []
for item_id in context.item_ids:
item_features = all_item_features[item_id]
ondemand = self.compute_ondemand_features(
context, user_features, item_features
)
fv = FeatureVector(
user_features=user_features,
item_features=item_features,
user_rt_features=user_rt_features,
ondemand_features=ondemand,
feature_names=[
"user_features", "item_features",
"user_rt_features", "ondemand_features"
],
)
feature_vectors.append(fv)
total_elapsed_ms = (time.perf_counter() - start) * 1000
# Log if we're close to or over budget
if total_elapsed_ms > 20:
print(
f"Feature computation slow: {total_elapsed_ms:.1f}ms "
f"(I/O: {io_elapsed_ms:.1f}ms) for {len(context.item_ids)} items"
)
return feature_vectors
def assemble_model_input(
self, feature_vectors: list[FeatureVector]
) -> np.ndarray:
"""Stack feature vectors into a model input matrix [n_items, n_features]."""
rows = []
for fv in feature_vectors:
row = np.concatenate([
fv.user_features,
fv.item_features,
fv.user_rt_features,
fv.ondemand_features,
])
rows.append(row)
return np.stack(rows) # Shape: [n_items, total_features]
Caching Computed Features
Not all "on-demand" features are truly unique per request. Some features that seem request-specific are actually stable over a window:
# feature_cache.py
import time
import hashlib
import numpy as np
from typing import Optional
from collections import OrderedDict
class TwoLevelFeatureCache:
"""
Two-level feature cache:
- L1: In-process LRU cache (microsecond access)
- L2: Redis (millisecond access, shared across pods)
For features that are expensive to compute but stable for
a short window (e.g., "user's last 5 items viewed" changes
only when the user clicks something).
"""
def __init__(
self,
redis_client,
l1_max_size: int = 1000,
l2_ttl_seconds: int = 60,
):
self.redis = redis_client
self.l1_cache: OrderedDict = OrderedDict()
self.l1_max_size = l1_max_size
self.l2_ttl = l2_ttl_seconds
def _make_key(self, feature_name: str, *args) -> str:
key_parts = [feature_name] + [str(a) for a in args]
return hashlib.sha256(":".join(key_parts).encode()).hexdigest()[:16]
def get_l1(self, key: str) -> Optional[np.ndarray]:
if key in self.l1_cache:
# Move to end (LRU update)
self.l1_cache.move_to_end(key)
value, expiry = self.l1_cache[key]
if time.time() < expiry:
return value
else:
del self.l1_cache[key]
return None
def set_l1(self, key: str, value: np.ndarray, ttl: float = 5.0):
if len(self.l1_cache) >= self.l1_max_size:
# Evict oldest
self.l1_cache.popitem(last=False)
self.l1_cache[key] = (value, time.time() + ttl)
async def get(self, feature_name: str, *args) -> Optional[np.ndarray]:
key = self._make_key(feature_name, *args)
# L1 first - microseconds
result = self.get_l1(key)
if result is not None:
return result
# L2 - milliseconds
raw = await self.redis.get(f"fcache:{key}")
if raw is not None:
value = np.frombuffer(raw, dtype=np.float32)
self.set_l1(key, value) # Promote to L1
return value
return None
async def set(
self,
feature_name: str,
value: np.ndarray,
ttl_seconds: int = None,
*args,
):
key = self._make_key(feature_name, *args)
ttl = ttl_seconds or self.l2_ttl
self.set_l1(key, value, ttl=min(ttl, 5.0)) # L1 max 5s
await self.redis.setex(
f"fcache:{key}",
ttl,
value.astype(np.float32).tobytes(),
)
The Training-Serving Skew Problem
This is the most dangerous failure mode in online feature computation. If your training pipeline computes features differently than your serving pipeline, your model receives inputs during serving that it never saw during training. The model silently degrades.
# feature_definitions.py - THE SINGLE SOURCE OF TRUTH
# Import this in BOTH training and serving pipelines.
# Never reimplement feature computation in two places.
import numpy as np
import pandas as pd
from typing import Union
def compute_session_duration_feature(
session_start_ts: Union[float, np.ndarray],
request_ts: Union[float, np.ndarray],
) -> Union[float, np.ndarray]:
"""
Session duration in normalized minutes.
CRITICAL: This exact function must be used in both:
- Offline training feature generation
- Online serving feature computation
Any change to this function invalidates all trained models
that used it. Version this function alongside model versions.
"""
duration_seconds = request_ts - session_start_ts
# Clip to [0, 60 min] then normalize to [0, 1]
return np.clip(duration_seconds / 3600.0, 0.0, 1.0)
def compute_user_item_affinity(
user_embedding: np.ndarray,
item_embedding: np.ndarray,
) -> float:
"""
Dot-product affinity between user and item embeddings.
Both embeddings must be L2-normalized before calling this.
The dot product of L2-normalized vectors is cosine similarity.
"""
# Ensure L2 normalization - this prevents scaling bugs
u_norm = user_embedding / (np.linalg.norm(user_embedding) + 1e-8)
i_norm = item_embedding / (np.linalg.norm(item_embedding) + 1e-8)
return float(np.dot(u_norm, i_norm))
:::tip Feature Hashing for High-Cardinality Categoricals
When you have a categorical feature with millions of values (user IDs, item IDs), one-hot encoding is infeasible. Use feature hashing: bucket_id = hash(feature_value) % n_buckets. This maps any string to one of N integer buckets with some collision rate. With n_buckets=10,000, collision rate is acceptable for most use cases. The key advantage: works identically in training and serving, no vocabulary required, handles new values at serving time without retraining.
:::
Production Engineering Notes
Measure feature computation latency separately: Instrument each step independently. The standard pattern is to emit a histogram metric for each feature retrieval: feature_lookup_latency_ms{feature="user_profile"}. When serving latency spikes, you want to know immediately whether it is the model, the feature store, or a specific computation.
Feature freshness monitoring: Track how old the features are when they reach the model. If your near-real-time features are supposed to be at most 5 minutes stale and you see staleness jumping to 30 minutes, your streaming pipeline is lagging. This staleness directly degrades model quality.
Graceful degradation: Every feature must have a fallback. Redis is unavailable? Return the feature mean from the training distribution. Streaming pipeline is lagging? Use the most recent available value. Never return NaN to the model - it will produce undefined outputs for the entire batch.
:::warning The mget vs Pipeline Anti-Pattern
A common mistake: N separate await redis.get(key) calls in a loop, each round-tripping to Redis. At 1ms per call, 50 calls = 50ms. Use await redis.mget(*keys) to fetch all keys in a single round trip. For writes, use a Redis pipeline. The difference is often 50x: 50ms of sequential gets vs 1ms of a single mget.
:::
:::danger Feature Imputation During Serving Imputing missing features with 0 during serving when training used mean imputation will degrade model quality. Standardize imputation strategy in your shared feature definitions file and use the same imputation in both training data generation and serving. Better yet, track the fraction of requests that trigger imputation as a metric - rising imputation rates signal data pipeline problems upstream. :::
Interview Q&A
Q: What is training-serving skew and how do you prevent it?
Training-serving skew occurs when the feature computation logic used to generate training data produces different values than the logic used at serving time. This silently degrades model quality - the model receives inputs at serving time that differ from what it learned on. Prevention: define all feature computation in shared library code that is imported by both the offline training pipeline and the online serving pipeline. Never reimplement feature computation in two separate places. Version feature definitions alongside model versions so you always know which feature code a model was trained with.
Q: You have a 50ms serving SLA. The model takes 20ms. How do you allocate the remaining budget?
Profile each component: feature store I/O (measure separately for pre-computed and near-real-time lookups), on-demand feature computation, and post-processing. A typical allocation for 30ms remaining: 10ms for feature store lookups (parallelized with asyncio.gather across all required keys), 8ms for on-demand computation (must be fully vectorized numpy, no Python loops), 5ms for model input assembly and response serialization, 7ms buffer. If feature store lookups dominate, investigate co-location (serving and feature store in same availability zone), switching from N individual gets to a single mget, or adding an in-process L1 cache for frequently accessed features.
Q: How would you implement feature freshness SLAs in a production system?
For each feature, define a maximum acceptable age - e.g., user profile features can be up to 24 hours stale, session features can be at most 5 minutes stale. At serving time, read the feature's updated_at timestamp alongside its value. Compute staleness = request_time - updated_at. If staleness exceeds the SLA, log a warning and optionally fall back to a safer imputed value. Emit a staleness histogram metric per feature type. Alert when p95 staleness exceeds the SLA threshold. This catches streaming pipeline lag and batch pipeline failures before they materially degrade model quality.
Q: What is the difference between pre-computed, near-real-time, and on-demand features? Give an example of each.
Pre-computed features are computed by a batch job and stored in a feature store. They change slowly. Example: a user's historical click-through rate by product category, computed nightly. Near-real-time features are computed by a stream processing pipeline (Flink or Kafka Streams) and updated continuously as events arrive. Example: number of items a user has clicked in the last 30 minutes, updated each time a click event arrives (lag of 1-5 minutes). On-demand features depend on the specific context of the current request and cannot be pre-computed. Example: the cosine similarity between the current user's embedding and the specific item being ranked - it depends on which item is in the current request, which is not known until the request arrives.
Q: How do you handle a feature store that is unavailable during serving?
Every feature retrieval must have a fallback. Design order: first check an in-process L1 cache (microsecond access), then check Redis (millisecond access), then fall back to a statically configured default value (training distribution mean). Never return NaN or None to the model. Track the fallback rate per feature as a metric - a spike in fallbacks indicates the feature store is struggling and model quality is degrading. Implement a circuit breaker: if the feature store error rate exceeds a threshold, stop attempting to call it and serve all requests with fallback values, preventing the serving layer from accumulating timeout latency waiting for a failing upstream.
