Skip to main content

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the Feature Store Architecture demo on the EngineersOfAI Playground - no code required. :::

Production Patterns

Three Failures at 50x Loadโ€‹

A real-time ML system had been running flawlessly for months in staging. The architecture was solid: Kafka for event streaming, Flink for stream-to-feature computation, Redis Cluster for online serving, a FastAPI model server. Performance tests with 1x expected load showed excellent results - sub-20ms feature retrieval, no errors.

On the day of production launch, traffic was 50x the staging volume. Three things broke within the first hour.

Failure one: Redis connection pool exhaustion. The feature server had been configured with max_connections=20 - fine for staging's 200 req/sec, completely inadequate for production's 10,000 req/sec. Every connection held for 5ms means the system needs at least 50 simultaneous connections at 10,000 req/sec. With max_connections=20, requests queued behind a full pool. The queue grew. Latency climbed from 5ms to 800ms. The feature server appeared to function but was silently destroying the model's SLA.

Failure two: Flink pipeline falling behind. The Flink job processing click events had been tuned for a throughput of 50,000 events per second. Production traffic peaked at 250,000 events per second during the first viral moment. The Flink job's consumer lag climbed from 0 to 4 million messages in 90 minutes. Session features - items clicked in the last 30 minutes - were now reflecting a state from 90 minutes ago. Users were getting recommendations based on sessions that had already ended.

Failure three: Staleness alerts not catching the lag. The monitoring team had set a staleness alert threshold of 30 minutes: "alert if any feature is more than 30 minutes old." The Flink lag was 90 minutes by the time the alert fired. The threshold had been set during staging using a rough estimate of acceptable staleness, never validated against actual SLAs.

All three failures were predictable. All three had been described in design reviews and marked as "nice to have" improvements post-launch. This lesson is the checklist that prevents them.


Why This Existsโ€‹

Real-time ML systems fail differently than batch systems. Batch pipeline failures are visible - the job status turns red, the data doesn't arrive. Real-time system failures are often invisible: the system continues to produce predictions, features continue to be served, but the values are stale, incorrect, or from the wrong distribution. By the time the degradation is detected through model metrics, thousands or millions of incorrect predictions have been made.

Production real-time feature engineering requires the same reliability engineering discipline applied to any high-availability distributed system: circuit breakers, connection pooling, backpressure handling, graceful degradation, load testing at realistic scale, and observability at every layer.


The Production Readiness Checklistโ€‹

Before launching a real-time feature system, verify all 20 items:

Reliability

  • Connection pools sized for peak load (not average load)
  • Circuit breakers on all external calls (Redis, DynamoDB, downstream services)
  • Fallback values defined for every feature (match training defaults exactly)
  • Fallback path tested: manually break Redis and verify the system serves defaults, not errors
  • Timeout set on all Redis/DynamoDB calls (50ms max - fail fast, use fallback)

Performance

  • Feature retrieval uses pipeline/batch (no sequential loops over entities)
  • Load test at 10x expected peak, not 1x expected average
  • p99 latency profiled under load (not just p50)
  • Local in-process cache for hot entities (top 1% of traffic)
  • Redis key TTLs set on all keys (no unbounded growth)

Observability

  • Feature staleness metric (age of most recently computed feature) emitted per feature group
  • Staleness alert threshold set to 2x the intended refresh interval (not 30 minutes if refresh is 1 minute)
  • Cache hit rate tracked (low hit rate โ†’ cold start or eviction problem)
  • Feature value distribution logged (mean, p99, null rate per feature)
  • Consumer lag tracked for Flink/Kafka pipelines with alert at 5-minute lag

Operations

  • Runbook for each failure mode (Redis down, Flink lag, schema change emergency)
  • Multi-region readiness (or explicit single-region SLA acknowledgment)
  • Canary deployment process for feature changes
  • Incident response: pager rotation, escalation path, defined recovery SLA
  • Cost model: Redis memory cost at 2x current entity count

Connection Poolingโ€‹

The most common production failure in feature serving under load.

Sizing the pool correctly:

min_connections=peak_rpsร—avg_latency_msรท1000\text{min\_connections} = \text{peak\_rps} \times \text{avg\_latency\_ms} \div 1000 max_connections=min_connectionsร—1.5\text{max\_connections} = \text{min\_connections} \times 1.5

For 10,000 req/sec and 5ms Redis latency:

  • min_connections = 10,000 ร— 0.005 = 50
  • max_connections = 75
import redis
import time
import threading
from typing import Optional
import logging

logger = logging.getLogger(__name__)


class PooledFeatureClient:
"""
Redis feature client with properly sized connection pool,
timeout management, and pool exhaustion detection.
"""

def __init__(
self,
redis_url: str,
peak_rps: int,
avg_latency_ms: float = 5.0,
safety_factor: float = 1.5,
socket_timeout_ms: float = 50.0,
):
# Size the pool for peak load
min_connections = int(peak_rps * avg_latency_ms / 1000)
max_connections = int(min_connections * safety_factor)

logger.info(
f"Sizing Redis pool: peak_rps={peak_rps}, "
f"min_connections={min_connections}, max_connections={max_connections}"
)

self.pool = redis.ConnectionPool.from_url(
redis_url,
max_connections=max_connections,
socket_timeout=socket_timeout_ms / 1000,
socket_connect_timeout=0.1,
retry_on_timeout=False, # Fail fast - use fallback
)
self.client = redis.Redis(connection_pool=self.pool)
self._pool_exhaustion_count = 0

def get_features(self, entity_id: str) -> Optional[dict]:
"""
Retrieve features. Returns None on any Redis error.
Caller is responsible for applying fallback values.
"""
try:
start = time.monotonic()
key = f"user:{entity_id}:features"
raw = self.client.hgetall(key)
latency_ms = (time.monotonic() - start) * 1000

if latency_ms > 40:
logger.warning(f"Slow Redis fetch for {entity_id}: {latency_ms:.1f}ms")

return {k.decode(): v.decode() for k, v in raw.items()} if raw else None

except redis.exceptions.ConnectionError:
self._pool_exhaustion_count += 1
logger.error(f"Redis pool exhausted. Total exhaustions: {self._pool_exhaustion_count}")
return None
except redis.exceptions.TimeoutError:
logger.warning(f"Redis timeout for entity {entity_id}")
return None
except redis.RedisError as e:
logger.error(f"Redis error: {e}")
return None

def pool_utilization(self) -> float:
"""Return fraction of connections in use (0.0 to 1.0)."""
in_use = self.pool._created_connections - self.pool._available_connections.qsize()
return in_use / self.pool.max_connections

Circuit Breakersโ€‹

A circuit breaker prevents cascading failures: when a downstream service (Redis) is degraded, the circuit breaker trips and the system immediately serves fallback values instead of queueing requests behind a failing service.

import time
import threading
from enum import Enum
from typing import Callable, Any, Optional

class CircuitState(Enum):
CLOSED = "closed" # Normal: requests pass through
OPEN = "open" # Tripped: requests fail immediately with fallback
HALF_OPEN = "half_open" # Testing: one request allowed through to check recovery


class CircuitBreaker:
"""
Circuit breaker for feature store calls.

States:
- CLOSED: normal operation, requests pass through
- OPEN: failure threshold exceeded, return fallback immediately
- HALF_OPEN: after reset_timeout, allow one request through to test recovery
"""

def __init__(
self,
failure_threshold: int = 5, # Trip after 5 failures in window
failure_window_seconds: float = 10.0,
reset_timeout_seconds: float = 30.0,
success_threshold: int = 2, # Close after 2 successes in HALF_OPEN
):
self.failure_threshold = failure_threshold
self.failure_window = failure_window_seconds
self.reset_timeout = reset_timeout_seconds
self.success_threshold = success_threshold

self._state = CircuitState.CLOSED
self._failures: list = []
self._last_open_time: float = 0.0
self._half_open_successes: int = 0
self._lock = threading.Lock()

@property
def state(self) -> CircuitState:
with self._lock:
if self._state == CircuitState.OPEN:
if time.time() - self._last_open_time > self.reset_timeout:
self._state = CircuitState.HALF_OPEN
self._half_open_successes = 0
return self._state

def call(self, fn: Callable, fallback: Any = None) -> Any:
"""
Execute fn through the circuit breaker.
Returns fallback if the circuit is OPEN.
"""
state = self.state

if state == CircuitState.OPEN:
return fallback

try:
result = fn()
self._on_success()
return result
except Exception as e:
self._on_failure()
return fallback

def _on_success(self):
with self._lock:
if self._state == CircuitState.HALF_OPEN:
self._half_open_successes += 1
if self._half_open_successes >= self.success_threshold:
self._state = CircuitState.CLOSED
self._failures.clear()

def _on_failure(self):
now = time.time()
with self._lock:
# Evict old failures outside the window
self._failures = [t for t in self._failures if now - t < self.failure_window]
self._failures.append(now)

if len(self._failures) >= self.failure_threshold:
if self._state != CircuitState.OPEN:
self._state = CircuitState.OPEN
self._last_open_time = now
logger.error(
f"Circuit breaker OPEN after {len(self._failures)} failures "
f"in {self.failure_window}s"
)


# Usage: wrap Redis calls with circuit breaker
FEATURE_DEFAULTS = {
"tx_count_1h": 0,
"tx_sum_1h": 0.0,
"account_age_days": 0,
}

redis_breaker = CircuitBreaker(failure_threshold=5, reset_timeout_seconds=30)

def get_features_with_breaker(entity_id: str, redis_client: redis.Redis) -> dict:
def fetch():
raw = redis_client.hgetall(f"user:{entity_id}:features")
return {k.decode(): v.decode() for k, v in raw.items()} or FEATURE_DEFAULTS

return redis_breaker.call(fetch, fallback=FEATURE_DEFAULTS)

Backpressure Handlingโ€‹

When the Flink pipeline falls behind the Kafka event stream, consumer lag grows. Features become stale. The system must detect this and alert before staleness exceeds the acceptable threshold.

from kafka import KafkaAdminClient, KafkaConsumer
from kafka.structs import TopicPartition
import time


def get_consumer_lag(
bootstrap_servers: str,
group_id: str,
topic: str,
) -> dict:
"""
Compute consumer lag per partition.
Returns {partition: lag_messages}.
"""
admin = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
consumer = KafkaConsumer(bootstrap_servers=bootstrap_servers)

# Get latest offsets (end of topic)
partitions = consumer.partitions_for_topic(topic)
tp_list = [TopicPartition(topic, p) for p in partitions]
end_offsets = consumer.end_offsets(tp_list)

# Get committed offsets for the consumer group
committed = {
tp: admin.list_consumer_group_offsets(group_id).get(tp)
for tp in tp_list
}

lag = {}
for tp in tp_list:
end = end_offsets.get(tp, 0)
committed_offset = committed.get(tp)
if committed_offset:
lag[tp.partition] = end - committed_offset.offset
else:
lag[tp.partition] = end # No committed offset โ†’ all messages are unconsumed

consumer.close()
admin.close()
return lag


def lag_to_staleness_minutes(lag_messages: int, events_per_second: float) -> float:
"""Estimate how stale features are given the current lag."""
lag_seconds = lag_messages / max(events_per_second, 1)
return lag_seconds / 60


# In a monitoring loop (run as a separate process or CloudWatch scheduled rule)
def monitor_pipeline_lag():
ALERT_LAG_MESSAGES = 100_000 # Alert if lag > 100K messages
EVENTS_PER_SECOND = 50_000

lag = get_consumer_lag(
bootstrap_servers="kafka:9092",
group_id="session-feature-pipeline",
topic="click-events",
)

total_lag = sum(lag.values())
staleness_min = lag_to_staleness_minutes(total_lag, EVENTS_PER_SECOND)

if total_lag > ALERT_LAG_MESSAGES:
logger.error(
f"ALERT: Flink pipeline lag={total_lag:,} messages "
f"(~{staleness_min:.1f} minutes of staleness). "
f"Consider scaling Flink task managers."
)

Graceful Degradationโ€‹

Design the fallback before you need it. For every real-time feature, answer these questions at design time, not incident time:

  1. What happens if the feature store is unavailable? (Default values? Batch fallback?)
  2. What happens if the feature is stale beyond the acceptable threshold? (Serve or reject?)
  3. What is the model's behavior when all features are defaults? (Is it safe? Does it bias toward false negatives?)
from enum import Enum
from typing import Dict, Any, Optional
import time

class DegradationLevel(Enum):
HEALTHY = "healthy" # All features fresh from online store
STALE = "stale" # Features served but beyond freshness threshold
DEGRADED = "degraded" # Online store unavailable, serving defaults
CRITICAL = "critical" # Cannot serve the model safely


class DegradedFeatureServer:
"""
Feature server with explicit degradation levels.
Each level has a defined behavior and produces observability signals.
"""

FRESHNESS_THRESHOLD_SECONDS = 120 # Alert if features older than 2 minutes
STALENESS_LIMIT_SECONDS = 600 # Serve degraded if older than 10 minutes

def __init__(
self,
online_client, # Redis client
defaults: Dict[str, Any],
):
self.online = online_client
self.defaults = defaults

def get_features(self, entity_id: str) -> tuple:
"""
Returns (features, degradation_level).
Callers can log the degradation_level for observability.
"""
try:
raw = self.online.get_features(entity_id)
if raw is None:
# Cache miss - new entity or evicted
return self.defaults.copy(), DegradationLevel.DEGRADED

computed_at = float(raw.get("computed_at", 0))
age_seconds = time.time() - computed_at

if age_seconds > self.STALENESS_LIMIT_SECONDS:
# Too stale - serve defaults rather than mislead the model
logger.warning(
f"Feature staleness {age_seconds:.0f}s > limit "
f"{self.STALENESS_LIMIT_SECONDS}s for {entity_id}. "
f"Serving defaults."
)
return self.defaults.copy(), DegradationLevel.STALE

if age_seconds > self.FRESHNESS_THRESHOLD_SECONDS:
# Serve but flag as stale
features = {**self.defaults, **raw}
return features, DegradationLevel.STALE

return {**self.defaults, **raw}, DegradationLevel.HEALTHY

except Exception as e:
logger.error(f"Feature retrieval error for {entity_id}: {e}")
return self.defaults.copy(), DegradationLevel.CRITICAL

The Full ProductionFeatureServerโ€‹

import redis
import time
import logging
from typing import Dict, Any, Optional, List
from fastapi import FastAPI, Response
from pydantic import BaseModel
from prometheus_client import Counter, Histogram, Gauge, generate_latest

logger = logging.getLogger(__name__)

# Prometheus metrics
FEATURE_REQUESTS = Counter("feature_requests_total", "Total feature requests", ["status"])
FEATURE_LATENCY = Histogram("feature_latency_seconds", "Feature retrieval latency",
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25])
FEATURE_STALENESS = Gauge("feature_staleness_seconds", "Age of most recent feature", ["entity_type"])
CACHE_HIT_RATE = Gauge("feature_cache_hit_rate", "Fraction of requests served from cache")


FEATURE_DEFAULTS = {
"tx_count_1h": 0, "tx_sum_1h": 0.0, "tx_count_24h": 0,
"tx_sum_24h": 0.0, "account_age_days": 0, "avg_tx_90d": 0.0,
"distinct_merchants_7d": 0, "credit_score_bucket": 2, "is_premium": 0,
"last_login_ts": 0,
}


class ProductionFeatureServer:
def __init__(self, redis_url: str, peak_rps: int = 10000):
# Sized for peak load
max_connections = int(peak_rps * 0.005 * 1.5) # peak_rps ร— avg_latency ร— safety

self.pool = redis.ConnectionPool.from_url(
redis_url,
max_connections=max_connections,
socket_timeout=0.05,
socket_connect_timeout=0.1,
retry_on_timeout=False,
decode_responses=True,
)
self.client = redis.Redis(connection_pool=self.pool)

# In-process LRU for hot entities
from functools import lru_cache
self._hot_cache: Dict[str, tuple] = {} # entity_id โ†’ (features, expires_at)
self._hot_cache_ttl = 5.0 # 5-second in-process cache

# Circuit breaker
self.breaker = CircuitBreaker(failure_threshold=10, reset_timeout_seconds=30)

# Hit rate tracking
self._total_requests = 0
self._cache_hits = 0

def _try_hot_cache(self, entity_id: str) -> Optional[Dict]:
entry = self._hot_cache.get(entity_id)
if entry and time.time() < entry[1]:
return entry[0]
return None

def _set_hot_cache(self, entity_id: str, features: Dict):
self._hot_cache[entity_id] = (features, time.time() + self._hot_cache_ttl)
# Evict oldest entries if cache grows too large
if len(self._hot_cache) > 10000:
oldest = min(self._hot_cache, key=lambda k: self._hot_cache[k][1])
del self._hot_cache[oldest]

def get_features(self, entity_id: str) -> Dict[str, Any]:
self._total_requests += 1
start = time.monotonic()

# Layer 1: in-process hot cache
cached = self._try_hot_cache(entity_id)
if cached:
self._cache_hits += 1
FEATURE_REQUESTS.labels(status="hot_cache").inc()
return cached

# Layer 2: Redis (via circuit breaker)
def redis_fetch():
key = f"user:{entity_id}:features"
return self.client.hgetall(key)

raw = self.breaker.call(redis_fetch, fallback=None)

latency = time.monotonic() - start
FEATURE_LATENCY.observe(latency)

if raw is None or not raw:
FEATURE_REQUESTS.labels(status="miss").inc()
return FEATURE_DEFAULTS.copy()

# Parse features
features = {}
for name, default_val in FEATURE_DEFAULTS.items():
raw_val = raw.get(name)
if raw_val is None:
features[name] = default_val
elif isinstance(default_val, int):
features[name] = int(raw_val)
elif isinstance(default_val, float):
features[name] = float(raw_val)
else:
features[name] = raw_val

# Track staleness
if "computed_at" in raw:
age = time.time() - float(raw["computed_at"])
FEATURE_STALENESS.labels(entity_type="user").set(age)

# Populate hot cache
self._set_hot_cache(entity_id, features)
FEATURE_REQUESTS.labels(status="hit").inc()
self._cache_hits += 1

if self._total_requests % 1000 == 0:
hit_rate = self._cache_hits / self._total_requests
CACHE_HIT_RATE.set(hit_rate)

return features

def get_features_batch(self, entity_ids: List[str]) -> Dict[str, Dict]:
results = {}
cache_misses = []

# Check hot cache first
for eid in entity_ids:
cached = self._try_hot_cache(eid)
if cached:
results[eid] = cached
else:
cache_misses.append(eid)

if not cache_misses:
return results

# Pipeline Redis for misses
def batch_fetch():
pipe = self.client.pipeline(transaction=False)
for eid in cache_misses:
pipe.hgetall(f"user:{eid}:features")
return pipe.execute()

redis_results = self.breaker.call(batch_fetch, fallback=[None] * len(cache_misses))

for eid, raw in zip(cache_misses, redis_results):
if not raw:
results[eid] = FEATURE_DEFAULTS.copy()
continue

features = {}
for name, default_val in FEATURE_DEFAULTS.items():
raw_val = raw.get(name)
if raw_val is None:
features[name] = default_val
elif isinstance(default_val, int):
features[name] = int(raw_val)
elif isinstance(default_val, float):
features[name] = float(raw_val)
else:
features[name] = raw_val

results[eid] = features
self._set_hot_cache(eid, features)

return results

def health_check(self) -> dict:
redis_ok = False
try:
redis_ok = self.client.ping()
except:
pass

return {
"status": "healthy" if redis_ok else "degraded",
"redis": "ok" if redis_ok else "down",
"circuit_breaker": self.breaker.state.value,
"pool_utilization": (
self.pool._created_connections / self.pool.max_connections
if self.pool.max_connections > 0 else 0
),
"cache_hit_rate": self._cache_hits / max(self._total_requests, 1),
}


# FastAPI application
app = FastAPI()
feature_server = ProductionFeatureServer(redis_url="redis://redis:6379", peak_rps=10000)


@app.get("/features/{entity_id}")
def get_features(entity_id: str):
return feature_server.get_features(entity_id)


@app.get("/health")
def health():
return feature_server.health_check()


@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type="text/plain")

Production Architecture with Resilience Patternsโ€‹


Operational Runbook: 10 Common Failure Scenariosโ€‹

#SymptomDiagnosisMitigation
1p99 latency > 100msCheck pool utilization; if >80%, pool exhaustedScale feature servers; increase max_connections
2Feature staleness alertCheck Flink consumer lag; if lag > 5 min, pipeline fell behindScale Flink task managers; check for backpressure
3High null rate in featuresCheck Redis memory (if >85% used, eviction happening)Add Redis nodes; reduce TTL; increase cluster size
4Circuit breaker OPENCheck Redis connectivity; if Redis down, serve defaultsRedis failover (Sentinel/Cluster auto-failover); DynamoDB fallback
5Model predictions shiftedCheck feature distribution logs; compare to 7-day baselineRoll back last feature change; retrain if feature bug
6Cache hit rate < 50%Check if entity distribution changed; may be new user typesPre-warm cache for new entities; increase LRU cache size
7Feature server OOMCheck hot cache size; may be unbounded entity accumulationAdd LRU eviction limit; profile cache memory
8Flink checkpoint failuresCheck S3 connectivity and RocksDB state sizeIncrease checkpoint timeout; archive old state
9DLQ volume spikeCheck upstream producer changes; schema change not registeredRegister new schema version; restart pipeline with new JAR
10Redis cluster partitionCheck Redis Cluster node status; slave lagForce failover to healthy primary; verify replica count

Multi-Region Feature Servingโ€‹

For global systems, Redis must be replicated to serving regions to avoid cross-region latency:

US-East (primary) US-West (replica) EU-West (replica)
Redis Primary โ†’ Redis Replica โ†’ Redis Replica
Flink writes โ† read-only serving โ† read-only serving

Redis replication is asynchronous - replicas may lag 10โ€“100ms behind the primary under write load. For fraud detection where freshness is critical, route all writes to the primary and all reads to the local replica, accepting a small freshness lag. For recommendation where freshness is less critical, the eventual consistency is acceptable.

DynamoDB Global Tables offer active-active multi-region with ~1 second replication lag - appropriate for features that can tolerate 1 second of cross-region staleness.


Feature SLOsโ€‹

Define and monitor Service Level Objectives for each feature group:

Feature GroupFreshness SLORetrieval p99 SLOAvailability SLO
Velocity (1h window)< 30 seconds stale< 10ms99.9%
Session features< 60 seconds stale< 10ms99.9%
Historical aggregates< 1 hour stale< 10ms99.9%
Embeddings< 24 hours stale< 5ms99.99%
Static profile< 7 days stale< 10ms99.99%

:::danger No Fallback Strategy Is a Single Point of Failure A feature system without a defined fallback strategy will fail the entire model endpoint when the feature store is unavailable. Redis has 99.9% availability - that's 8.7 hours of downtime per year. If every one of those 8.7 hours brings down your fraud detection or recommendation system, the operational impact is severe.

Define the fallback before launch:

  1. What default values does the model receive?
  2. Is the model safe to serve with those defaults? (If not, fail the model request gracefully - return an error rather than a wrong prediction)
  3. Is there a batch fallback (hourly features from S3) that is safer than defaults?

Test the fallback: deliberately break Redis in staging and verify the system serves the correct defaults, emits the correct metrics, and does not error the model endpoint. :::

:::danger Load Testing with Synthetic Traffic Only Synthetic load tests often miss real traffic patterns: power-law entity distribution (top 1% of entities receive 20% of traffic), correlated request bursts (marketing email sends โ†’ 10x traffic spike for 5 minutes), and cold-start patterns (new users with no feature history). Test with a representative sample of real entity IDs replayed at target throughput. Use shadow mode: route 5% of real production traffic to the new feature server, measure latency and error rates under real traffic patterns before full cutover. :::


Cost Optimizationโ€‹

Real-time feature infrastructure is expensive. Optimize before it becomes unmanageable:

Redis memory: audit with redis-cli --bigkeys. Most memory is consumed by a small fraction of entities (power users with long history). Set TTLs aggressively on features with high entity counts. Use Redis's memory compression settings for small values.

Batch Redis writes: when the feature pipeline writes to Redis, batch multiple entity writes into a single pipeline call. Writing 1,000 features individually uses 1,000 RTTs; a pipeline uses 1. For stream pipelines writing to Redis, use Flink's Redis sink with configurable batch size (write every 100ms or every 1,000 events, whichever comes first).

Right-size the Flink cluster: Flink task managers are memory-intensive (RocksDB state). Profile actual state size per task manager after the pipeline has been running for 24 hours (steady-state state size). Right-size memory allocation based on actual usage, not estimates.


Interview Q&Aโ€‹

Q: You are asked to design the on-call runbook for a real-time feature serving system. What are the top 5 scenarios and how do you diagnose each?

(1) High latency (p99 > SLO): check pool utilization - if pool is exhausted, scale feature servers. Check Redis cluster health - if a node is down, verify failover. Profile which features are slow with distributed tracing. (2) Feature staleness alert: check Flink consumer lag in Kafka metrics. If lag is growing, the pipeline is behind - scale Flink task managers. If lag is stable but features are stale, the pipeline may be crashing and restarting - check Flink job logs. (3) High model error rate (not latency): check feature distribution shift - compare logged feature values to 7-day baseline. A recent feature deployment may have changed the distribution. (4) Redis memory alarm: eviction is happening - features are being dropped. Add Redis nodes or reduce TTLs. (5) Circuit breaker open: Redis is failing. Verify Redis Cluster is healthy. Trigger DynamoDB fallback manually. Verify defaults are being served and model is operating safely on defaults.

Q: How do you handle the scenario where a feature pipeline falls behind during a traffic spike?

Immediate mitigation: scale Flink task managers horizontally. Flink supports live rescaling. Increase Kafka topic partitions if the bottleneck is the Kafka source (Flink parallelism cannot exceed partition count). Second, verify that the downstream systems can tolerate stale features - if the staleness alert has fired but the model is still operational, accept the situation while the pipeline catches up. If staleness exceeds the limit where model safety is uncertain, serve defaults instead of stale features (better to be explicit). Third, prevent future recurrence: set Flink parallelism to handle 10x expected peak, not 1x. Monitor consumer lag continuously with an alert threshold of 5 minutes of lag, not 30 minutes. Use Kafka consumer group lag as the primary signal for pipeline health.

Q: A new feature is being added to the serving pipeline. Walk me through the deployment process.

(1) Define the feature in the feature registry (Feast/Tecton) with a new version tag. (2) Implement the feature transformation in the streaming pipeline. Run the consistency validator: verify the training implementation and serving implementation agree on a test dataset. (3) Deploy the streaming pipeline to compute and write the new feature to a new Redis key pattern (user:{id}:features:v2) while keeping the old pipeline writing to v1. (4) Retrain the model on training data that includes the new feature. Validate offline performance improvement. (5) Deploy the new model in canary mode (1% of traffic) pointing to v2 Redis keys. Monitor for errors, latency regression, and distribution drift. (6) Promote from 1% to 10% to 50% to 100% over the course of 24 hours, verifying metrics at each step. (7) Stop writing v1 features after the old model is fully decommissioned. Let v1 keys expire via TTL.

Q: Explain the cache-aside pattern for feature serving and when you would use it.

Cache-aside (lazy loading): the feature server checks the cache first. On a miss, it fetches from the source-of-truth store (DynamoDB), stores the result in the cache (Redis), and returns it. On the next request, the cache has the value. This is appropriate when: the source-of-truth store is slower than the cache (DynamoDB at 20ms vs. Redis at 2ms); the feature is not updated via streaming (it's computed on demand); and the entity population is large (can't eagerly pre-load all entities). The risk: a cold cache after Redis restart causes a thundering herd - every entity misses, flooding DynamoDB. Mitigate with lazy loading + TTL randomization (spread cache expiration over a window, not all at once) and a rate limiter on the DynamoDB fallback path.

Q: How do you set appropriate staleness alert thresholds for real-time features?

Set the alert threshold based on the business impact of staleness, not a round number. For a fraud velocity feature with a 1-minute window: the feature must be fresher than the window itself (1 minute). Alert at 2 minutes of staleness - that's when the feature loses its meaning. For a session feature updated every 30 seconds: alert at 90 seconds. For a daily aggregate feature: alert at 2 hours. Never set staleness alerts to 30 minutes across the board - this misses the case where a 1-minute feature is 29 minutes stale (which is catastrophic). Additionally: measure staleness as the age of the most recently computed feature for a sample of active entities, not the timestamp of the last pipeline write. A pipeline can write to Redis successfully while computing stale values.

ยฉ 2026 EngineersOfAI. All rights reserved.