Skip to main content

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the Latency vs Throughput demo on the EngineersOfAI Playground - no code required. :::

Low-Latency Feature Serving

The 380ms Problemโ€‹

A model endpoint was breaching its 200ms SLA. The p99 latency was sitting at 450ms, and the team was running out of ideas - the model itself had been profiled and was fast, a PyTorch inference taking only 40ms. Where was the rest of the time going?

A Jaeger trace told the story. The request spent 380ms fetching 8 features. Not because any single feature was slow - each individual fetch averaged 47ms. The problem was that the features were being fetched sequentially, one at a time, from four different internal microservices. Feature 1: 47ms. Feature 2: 48ms. Feature 3: 45ms. And so on, eight times over.

The root cause was a pattern common in teams that build feature serving organically: different teams own different features and expose them through their own APIs. The model serving code calls each API in sequence because that's the simplest implementation. No one notices the latency issue until the number of features grows past 5 or 6 and the sequential overhead dominates the entire SLA budget.

The fix was architectural. All eight features were consolidated into a single Redis Hash keyed on user ID. One call, one round trip, all features. The p99 latency dropped from 450ms to 12ms. Model inference now dominated the latency profile - which is exactly where it should be.

This lesson is about designing feature retrieval so it is never the bottleneck.


Why This Existsโ€‹

In the early days of ML serving, model endpoints were simple. A model loaded at startup, a request arrived, the model computed a prediction from request-time inputs, and returned a response. No external feature lookups required.

As models became more sophisticated, they needed context that wasn't in the request: historical aggregates, user profile data, entity relationships. Teams started adding database queries to the serving path. A single query to a Postgres table: acceptable. Three queries to two different databases: the latency starts to mount. Eight queries across four services: the feature retrieval cost exceeds the model inference cost by 10x.

The online feature store emerged as the solution: a dedicated, purpose-built storage layer optimized for exactly one access pattern - point lookups by entity ID, returning all feature values for that entity, in under 10ms. This is fundamentally different from a general-purpose database. It is not optimized for complex queries, aggregations, or joins. It does one thing extremely fast.

Redis became the dominant online feature store not because of any ML-specific design but because its memory-first architecture, O(1) hash operations, and built-in pipelining make it a natural fit for the feature retrieval pattern.


The Latency Budgetโ€‹

Every model endpoint has a latency SLA. Decompose it before writing a line of feature-serving code:

Totalย SLA=Tnetwork+Tfeature+Tinference+Tserialization\text{Total SLA} = T_\text{network} + T_\text{feature} + T_\text{inference} + T_\text{serialization}

A concrete budget for a 100ms recommendation SLA:

ComponentBudgetNotes
Network (client to edge)10msCDN / edge routing
Feature retrieval15msTarget: p99, not p50
Model inference65msTransformer small model
Response serialization10msJSON or protobuf
Total100ms-

The feature retrieval budget here is 15ms. That means the entire lookup - from the moment the serving code issues the Redis call to the moment it has all feature values in memory - must complete at p99 in 15ms. With a co-located Redis instance and pipeline batching, this is achievable. With sequential REST calls, it is not.

:::tip The 10% Rule Size your feature retrieval budget at no more than 10โ€“15% of your total SLA. This gives headroom for latency spikes, feature count growth, and model complexity increases. If feature retrieval is consuming 30โ€“40% of your SLA, you have a structural problem. :::


Redis as the Online Feature Storeโ€‹

Redis is the most common online feature store for three reasons: it stores data in memory (sub-millisecond access), it has rich data structure support, and it has native pipeline support for batching multiple operations into a single round trip.

Data Structures for Featuresโ€‹

Hash - the primary data structure for feature vectors. One Redis Hash per entity, fields are feature names, values are feature values.

HSET user:u12345:features \
tx_count_1h 14 \
tx_sum_1h 892.50 \
account_age_days 547 \
avg_tx_90d 63.40 \
distinct_merchants_7d 8 \
credit_score_bucket 3 \
is_premium 1 \
last_login_ts 1741996800

Sorted Set - for ranking and recency. Top-N items for a user, ordered by score (timestamp, relevance score, etc.).

ZADD user:u12345:recent_items 1741996700 item_9823
ZADD user:u12345:recent_items 1741996650 item_4401
ZREVRANGE user:u12345:recent_items 0 9 # top 10 most recent

String - for simple scalar values and JSON blobs when the feature vector is infrequently updated.

SET user:u12345:embedding '[0.12, -0.34, 0.89, ...]'

Key Designโ€‹

Key design affects Redis performance through two mechanisms: key size (affects memory and network) and key distribution (affects cluster shard balance).

Good key pattern: {entity_type}:{entity_id}:{feature_group}

  • user:u12345:features
  • item:i9823:features
  • merchant:m441:risk_features

Anti-patterns:

  • Long descriptive keys that waste memory: user_features_for_user_with_id_u12345_v2
  • Embedding the timestamp in the key: user:u12345:features:2026-03-12 - defeats the point of a live feature store

TTL Managementโ€‹

Every feature key must have a TTL. Without TTL, the Redis instance fills with stale data and eventually runs out of memory, at which point the eviction policy (allkeys-lru, volatile-lru) starts dropping data unpredictably.

Set TTL to 2x the feature's maximum freshness window. A 1-hour velocity feature should have a TTL of 2 hours. A session feature should have a TTL of 4 hours (a generous session length). A daily aggregate feature should have a TTL of 48 hours.


The Full Redis Feature Clientโ€‹

import redis
import redis.asyncio as aioredis
import json
import time
import logging
from typing import Dict, List, Optional, Any, Tuple
from dataclasses import dataclass, field

logger = logging.getLogger(__name__)


# Default values returned when a feature is missing from the online store.
# These must match the defaults used during model training.
FEATURE_DEFAULTS: Dict[str, Any] = {
"tx_count_1h": 0,
"tx_sum_1h": 0.0,
"tx_count_24h": 0,
"tx_sum_24h": 0.0,
"account_age_days": 0,
"avg_tx_90d": 0.0,
"distinct_merchants_7d": 0,
"credit_score_bucket": 2, # median bucket as default
"is_premium": 0,
"last_login_ts": 0,
}


@dataclass
class FeatureVector:
entity_id: str
features: Dict[str, Any]
source: str # "cache", "default", "computed"
retrieved_at: float = field(default_factory=time.time)
cache_hit: bool = False


class OnlineFeatureClient:
"""
Production Redis client for online feature retrieval.

Implements:
- Single-round-trip pipeline fetching for multiple entities
- Fallback to defaults on cache miss
- Connection pooling with health checks
- Latency instrumentation
"""

def __init__(
self,
redis_url: str,
max_connections: int = 50,
socket_timeout: float = 0.05, # 50ms - if Redis takes longer, use defaults
feature_namespace: str = "user",
):
self.pool = redis.ConnectionPool.from_url(
redis_url,
max_connections=max_connections,
socket_timeout=socket_timeout,
socket_connect_timeout=0.1,
retry_on_timeout=False, # fail fast, use defaults
decode_responses=True,
)
self.client = redis.Redis(connection_pool=self.pool)
self.namespace = feature_namespace

def _feature_key(self, entity_id: str) -> str:
return f"{self.namespace}:{entity_id}:features"

def get_features(
self,
entity_id: str,
feature_names: Optional[List[str]] = None,
) -> FeatureVector:
"""
Retrieve features for a single entity.
Returns defaults on miss or error.
"""
key = self._feature_key(entity_id)
start = time.monotonic()

try:
if feature_names:
raw = self.client.hmget(key, feature_names)
raw_dict = dict(zip(feature_names, raw))
else:
raw_dict = self.client.hgetall(key)

latency_ms = (time.monotonic() - start) * 1000
logger.debug(f"Feature retrieval {entity_id}: {latency_ms:.1f}ms")

if not raw_dict or all(v is None for v in raw_dict.values()):
return FeatureVector(
entity_id=entity_id,
features=FEATURE_DEFAULTS.copy(),
source="default",
cache_hit=False,
)

# Cast types and fill in defaults for missing fields
features = {}
for name, default_val in FEATURE_DEFAULTS.items():
raw_val = raw_dict.get(name)
if raw_val is None:
features[name] = default_val
elif isinstance(default_val, int):
features[name] = int(raw_val)
elif isinstance(default_val, float):
features[name] = float(raw_val)
else:
features[name] = raw_val

return FeatureVector(
entity_id=entity_id,
features=features,
source="cache",
cache_hit=True,
)

except redis.RedisError as e:
logger.error(f"Redis error fetching features for {entity_id}: {e}")
return FeatureVector(
entity_id=entity_id,
features=FEATURE_DEFAULTS.copy(),
source="default",
cache_hit=False,
)

def get_features_batch(
self,
entity_ids: List[str],
feature_names: Optional[List[str]] = None,
) -> Dict[str, FeatureVector]:
"""
Retrieve features for multiple entities in a single pipeline.
This is the correct way to batch feature retrieval - one round trip total.
"""
if not entity_ids:
return {}

start = time.monotonic()
pipe = self.client.pipeline(transaction=False)

# Queue all HGETALL commands in the pipeline
keys = [self._feature_key(eid) for eid in entity_ids]
for key in keys:
if feature_names:
pipe.hmget(key, feature_names)
else:
pipe.hgetall(key)

try:
results = pipe.execute() # Single round trip for all entities
latency_ms = (time.monotonic() - start) * 1000
logger.info(
f"Batch feature retrieval {len(entity_ids)} entities: {latency_ms:.1f}ms"
)

output = {}
for entity_id, raw in zip(entity_ids, results):
if feature_names and isinstance(raw, list):
raw_dict = dict(zip(feature_names, raw))
else:
raw_dict = raw or {}

if not raw_dict or all(v is None for v in raw_dict.values()):
output[entity_id] = FeatureVector(
entity_id=entity_id,
features=FEATURE_DEFAULTS.copy(),
source="default",
cache_hit=False,
)
continue

features = {}
for name, default_val in FEATURE_DEFAULTS.items():
raw_val = raw_dict.get(name)
if raw_val is None:
features[name] = default_val
elif isinstance(default_val, int):
features[name] = int(raw_val)
elif isinstance(default_val, float):
features[name] = float(raw_val)
else:
features[name] = raw_val

output[entity_id] = FeatureVector(
entity_id=entity_id,
features=features,
source="cache",
cache_hit=True,
)

return output

except redis.RedisError as e:
logger.error(f"Redis pipeline error: {e}")
return {
eid: FeatureVector(
entity_id=eid,
features=FEATURE_DEFAULTS.copy(),
source="default",
cache_hit=False,
)
for eid in entity_ids
}

def write_features(
self,
entity_id: str,
features: Dict[str, Any],
ttl_seconds: int = 7200,
) -> bool:
"""Write a feature vector to Redis with TTL."""
key = self._feature_key(entity_id)
str_features = {k: str(v) for k, v in features.items()}
try:
pipe = self.client.pipeline()
pipe.hset(key, mapping=str_features)
pipe.expire(key, ttl_seconds)
pipe.execute()
return True
except redis.RedisError as e:
logger.error(f"Redis write error for {entity_id}: {e}")
return False

def health_check(self) -> bool:
"""Verify Redis connectivity."""
try:
return self.client.ping()
except redis.RedisError:
return False

Pipelining: The 8x Latency Reductionโ€‹

The single most important optimization in feature serving is pipelining: sending multiple Redis commands in a single network round trip. Without a pipeline, fetching features for 8 entities costs 8 round trips. With a pipeline, it costs 1.

import time
import redis

client = redis.Redis()

# Seed some data for the benchmark
for i in range(100):
client.hset(f"user:u{i:04d}:features", mapping={
"tx_count_1h": i,
"tx_sum_1h": i * 45.5,
"account_age_days": i * 10,
})

entity_ids = [f"u{i:04d}" for i in range(100)]

# --- Sequential: N round trips ---
t0 = time.monotonic()
results_seq = {}
for eid in entity_ids:
results_seq[eid] = client.hgetall(f"user:{eid}:features")
t_seq = (time.monotonic() - t0) * 1000

# --- Pipeline: 1 round trip ---
t0 = time.monotonic()
pipe = client.pipeline(transaction=False)
for eid in entity_ids:
pipe.hgetall(f"user:{eid}:features")
results_pipe = dict(zip(entity_ids, pipe.execute()))
t_pipe = (time.monotonic() - t0) * 1000

print(f"Sequential (100 entities): {t_seq:.1f}ms")
print(f"Pipeline (100 entities): {t_pipe:.1f}ms")
print(f"Speedup: {t_seq / t_pipe:.1f}x")

Typical output on a local Redis instance (network RTT ~0.1ms):

Sequential (100 entities): 8.4ms
Pipeline (100 entities): 1.1ms
Speedup: 7.6x

On a remote Redis instance with 5ms RTT (realistic for a cross-AZ connection):

Sequential (100 entities): 502ms
Pipeline (100 entities): 6.2ms
Speedup: 81x

The speedup scales with network latency. In production, where Redis is typically in a different availability zone or even a different data center, the pipeline advantage is enormous.


Pre-computation vs. On-Demandโ€‹

Not all features should be pre-computed and cached. The decision depends on computation cost and update frequency:

Pre-compute and cache in Redis (read-heavy, expensive to compute):

  • Transaction velocity in the last hour (requires scanning event log)
  • User embedding (requires neural network inference)
  • Session aggregates (requires stateful streaming)

Compute inline at request time (cheap, always available from request):

  • Time since last login: now() - last_login_ts - requires only last_login_ts from Redis
  • Price delta: current_price - user_avg_purchase_price - both values from Redis, subtraction inline
  • Day of week, hour of day, is_weekend - derived from request timestamp

The rule: if the feature can be derived in microseconds from values already in the feature vector, compute it inline. If it requires a database scan, aggregation, or model inference, pre-compute it.


Feature Serving Architectureโ€‹

The architecture has three cache layers:

  1. In-process LRU cache (5-second TTL): eliminates Redis round trips for hot entities (top 1% of users often represent 20% of traffic). Written in-process memory.

  2. Redis Cluster (primary online store): serves 95%+ of requests. Sub-5ms p99. Volatile (in-memory), so...

  3. DynamoDB fallback (persistent backing store): handles Redis cold starts after cluster restarts, provides persistence for audit requirements. 10โ€“30ms p99 - acceptable for the fallback path, not for the hot path.


DynamoDB for Feature Servingโ€‹

DynamoDB is the right choice over Redis when:

  • Persistence is a requirement (audit, compliance, recovery without re-warming)
  • Auto-scaling is needed (variable traffic without manual capacity management)
  • Multi-region active-active serving is required (DynamoDB Global Tables)
  • Cost: Redis requires a dedicated cluster; DynamoDB is pay-per-request

Partition key design: use the entity ID as the partition key. Features are attributes on the item.

import boto3
from typing import Dict, Any

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamodb.Table("online-features")

def write_user_features(user_id: str, features: Dict[str, Any]) -> None:
table.put_item(Item={"user_id": user_id, **features})

def get_user_features(user_id: str) -> Dict[str, Any]:
response = table.get_item(
Key={"user_id": user_id},
ProjectionExpression="tx_count_1h, tx_sum_1h, account_age_days, avg_tx_90d",
)
return response.get("Item", {})

# Batch get for multiple users (up to 100 per call)
def batch_get_features(user_ids: list) -> Dict[str, Dict]:
response = dynamodb.batch_get_item(
RequestItems={
"online-features": {
"Keys": [{"user_id": uid} for uid in user_ids],
}
}
)
items = response["Responses"]["online-features"]
return {item["user_id"]: item for item in items}

DynamoDB read capacity: 1 RCU = 1 strongly consistent read of up to 4KB per second. A feature vector with 20 features is typically under 1KB. Plan for 1 RCU per 4 read requests for eventual consistency, 1 RCU per read for strong consistency. At 10,000 reads/second, you need 2,500 RCUs for eventual consistency.


Bigtable for Feature Servingโ€‹

Google Bigtable suits scenarios with very large feature vectors (hundreds of columns), time-series feature history requirements, or Google Cloud infrastructure.

Row key design: {entity_type}#{entity_id} - for user features: user#u12345. Bigtable rows are sorted lexicographically, so prefix with entity type to avoid hot spots (don't use sequential numeric IDs as row keys directly).

Column families: group features by update frequency - velocity_features, historical_features, session_features. Different column families can have different garbage collection policies (TTL).

from google.cloud import bigtable
from google.cloud.bigtable import column_family

client = bigtable.Client(project="my-project", admin=True)
instance = client.instance("feature-store")
table = instance.table("user-features")

def read_user_features(user_id: str) -> dict:
row_key = f"user#{user_id}".encode()
row = table.read_row(row_key)
if row is None:
return {}

features = {}
for cf_id, cf_data in row.cells.items():
for col_qualifier, cell_list in cf_data.items():
col_name = col_qualifier.decode()
features[col_name] = cell_list[0].value.decode()
return features

Bigtable excels when feature counts are in the hundreds or thousands (wide rows), when you need time-series access to feature history (each cell has a timestamp version), or when you are already on Google Cloud and want managed infrastructure.


Latency Targets by Use Caseโ€‹

Use CaseTotal SLAFeature BudgetStore
Real-time fraud detection50ms10msRedis, in-process cache
E-commerce recommendation100ms20msRedis
Search ranking200ms30msRedis or DynamoDB
Ad bidding10ms3msIn-process cache only
Content personalization200ms40msRedis
Batch scoringno constraintno constraintS3 / Iceberg

:::danger Never Fetch Features in a Loop The most common production latency bug: fetching features for N entities in a for loop, issuing one Redis call per entity. This multiplies your network round-trip cost by N. For N=100 entities in a batch recommendation call with 5ms Redis RTT, that is 500ms just in round trips.

Wrong:

features = {}
for user_id in user_ids: # N=100 โ†’ 100 round trips โ†’ 500ms
features[user_id] = redis_client.hgetall(f"user:{user_id}:features")

Right:

pipe = redis_client.pipeline(transaction=False)
for user_id in user_ids:
pipe.hgetall(f"user:{user_id}:features")
results = pipe.execute() # 1 round trip โ†’ ~6ms
features = dict(zip(user_ids, results))

Always use pipeline or batch get APIs. :::

:::danger Never Store Full Event History in the Online Store The online store is for current feature values, not event history. Storing all transactions for a user in Redis to enable arbitrary window queries means: unbounded memory growth, increasing HGETALL latency as the hash grows, and eviction of other users' features during memory pressure.

Pre-compute the feature values (tx_count_1h, tx_sum_24h) in the streaming pipeline and store only the final feature vector in Redis. Move the event log to a separate sorted set with aggressive TTL. The model server should never see raw events. :::


Interview Q&Aโ€‹

Q: Why is Redis the dominant choice for online feature stores, and what are its limitations?

Redis is dominant because its memory-first architecture delivers O(1) hash operations at sub-millisecond latency, it has native pipeline support for batching multiple lookups into one round trip, and it supports rich data structures (Hash, Sorted Set, String) that map naturally to feature vectors and rankings. Limitations: it is volatile by default (restarts lose data, requiring re-warming from a backing store), it is expensive at scale (memory is costly), and it is single-threaded per command (though this rarely bottlenecks in practice). For persistence, pair Redis with DynamoDB as a backing store and snapshot to S3 via Redis RDB.

Q: A feature serving system has a 100ms SLA. Profiling shows feature retrieval taking 70ms at p99. What is your diagnosis and fix?

Diagnosis: sequential network calls. At 70ms for feature retrieval with a 5ms Redis RTT, there are roughly 14 sequential calls. Investigate the call pattern. Are features being fetched entity by entity in a loop? Are different feature groups being fetched from different services sequentially?

Fix: consolidate all features for an entity into a single Redis Hash, fetch with one HGETALL call (or one HMGET for specific fields). If features span multiple services, introduce a feature server that fans out requests in parallel (not sequential) and returns a single response. For truly latency-critical paths, add an in-process LRU cache to eliminate the network hop entirely for hot entities.

Q: How do you handle feature serving when Redis is down?

The system must be designed with three levels of fallback: (1) in-process LRU cache - if the feature was retrieved recently for a hot entity, the cache has a valid value; serve it. (2) DynamoDB or Bigtable as a backing store - slower (20โ€“30ms) but persistent and highly available; tolerable for the fallback path. (3) Default values - if both stores are unavailable, serve the pre-defined default values that were used during model training. The model must be designed to be deployable with defaults - if it catastrophically fails on default inputs, the fallback is unsafe. Circuit breakers around each Redis call prevent cascading failures.

Q: How do you right-size a Redis cluster for a feature store?

Memory sizing: measure the average size of one feature vector per entity (use DEBUG OBJECT user:u123:features to get bytes). Multiply by number of active entities (entities with traffic in the past TTL window). Add 30% overhead for Redis internals. For example: 200 bytes per user ร— 10M active users ร— 1.3 overhead = 2.6GB. For cluster sizing, target 60โ€“70% memory utilization headroom.

CPU sizing: Redis is single-threaded per command. At 100K commands/second, a single Redis instance is comfortable. Above 500K/second, consider Redis Cluster to distribute load.

Replication: always run with at least one replica per shard for HA. Use Redis Sentinel or Redis Cluster for automatic failover.

Q: When would you choose DynamoDB over Redis for feature serving?

Choose DynamoDB when: persistence is a hard requirement (features must survive restarts without re-warming), traffic is bursty and unpredictable (DynamoDB auto-scales, Redis requires manual capacity planning), the use case is global multi-region (DynamoDB Global Tables for active-active replication), or the feature vector is large and infrequently read (DynamoDB's per-request pricing is cheaper than provisioning memory for cold data). The trade-off: DynamoDB p99 is 10โ€“30ms vs. Redis at 1โ€“5ms. For latency-critical fraud detection (less than 50ms total SLA), Redis is required. For recommendation serving (100โ€“200ms SLA), DynamoDB may be sufficient.

Q: Explain the connection pool exhaustion problem in feature serving under peak traffic.

Each Redis client request requires a connection from the pool. If 100 concurrent requests each hold a connection for 5ms, the system needs 100 ร— (peak_rps / 1000 ร— 5ms) connections at peak. If max_connections is set too low and all connections are in use, new requests block waiting for a free connection - adding that wait time to the feature retrieval latency. Under extreme load, the wait time grows unboundedly and the system falls over.

Fix: (1) right-size the pool: max_connections = (peak_rps ร— avg_latency_ms / 1000) ร— 1.5. (2) Set socket_timeout aggressively (50ms) so slow Redis calls fail fast and release connections. (3) Add a circuit breaker: if the pool is more than 80% utilized, trip the breaker and serve defaults rather than queueing more requests into an overloaded pool.

ยฉ 2026 EngineersOfAI. All rights reserved.