What is production feature serving?

Case studies in real-time feature engineering from Uber, Twitter, and LinkedIn.

How does feature system reliability work in practice?

Production Patterns covers production feature serving, feature system reliability, circuit breaker feature store from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/real-time-feature-engineering/production-patterns

What is the difference between production feature serving and circuit breaker feature store?

See the full breakdown at https://engineersofai.com/docs/data-engineering/real-time-feature-engineering/production-patterns

:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Store Architecture demo on the EngineersOfAI Playground - no code required. :::

Production Patterns

Three Failures at 50x Load

A real-time ML system had been running flawlessly for months in staging. The architecture was solid: Kafka for event streaming, Flink for stream-to-feature computation, Redis Cluster for online serving, a FastAPI model server. Performance tests with 1x expected load showed excellent results - sub-20ms feature retrieval, no errors.

On the day of production launch, traffic was 50x the staging volume. Three things broke within the first hour.

Failure one: Redis connection pool exhaustion. The feature server had been configured with max_connections=20 - fine for staging's 200 req/sec, completely inadequate for production's 10,000 req/sec. Every connection held for 5ms means the system needs at least 50 simultaneous connections at 10,000 req/sec. With max_connections=20, requests queued behind a full pool. The queue grew. Latency climbed from 5ms to 800ms. The feature server appeared to function but was silently destroying the model's SLA.

Failure two: Flink pipeline falling behind. The Flink job processing click events had been tuned for a throughput of 50,000 events per second. Production traffic peaked at 250,000 events per second during the first viral moment. The Flink job's consumer lag climbed from 0 to 4 million messages in 90 minutes. Session features - items clicked in the last 30 minutes - were now reflecting a state from 90 minutes ago. Users were getting recommendations based on sessions that had already ended.

Failure three: Staleness alerts not catching the lag. The monitoring team had set a staleness alert threshold of 30 minutes: "alert if any feature is more than 30 minutes old." The Flink lag was 90 minutes by the time the alert fired. The threshold had been set during staging using a rough estimate of acceptable staleness, never validated against actual SLAs.

All three failures were predictable. All three had been described in design reviews and marked as "nice to have" improvements post-launch. This lesson is the checklist that prevents them.

Why This Exists

Real-time ML systems fail differently than batch systems. Batch pipeline failures are visible - the job status turns red, the data doesn't arrive. Real-time system failures are often invisible: the system continues to produce predictions, features continue to be served, but the values are stale, incorrect, or from the wrong distribution. By the time the degradation is detected through model metrics, thousands or millions of incorrect predictions have been made.

Production real-time feature engineering requires the same reliability engineering discipline applied to any high-availability distributed system: circuit breakers, connection pooling, backpressure handling, graceful degradation, load testing at realistic scale, and observability at every layer.

The Production Readiness Checklist

Before launching a real-time feature system, verify all 20 items:

Reliability

Connection pools sized for peak load (not average load)
Circuit breakers on all external calls (Redis, DynamoDB, downstream services)
Fallback values defined for every feature (match training defaults exactly)
Fallback path tested: manually break Redis and verify the system serves defaults, not errors
Timeout set on all Redis/DynamoDB calls (50ms max - fail fast, use fallback)

Performance

Feature retrieval uses pipeline/batch (no sequential loops over entities)
Load test at 10x expected peak, not 1x expected average
p99 latency profiled under load (not just p50)
Local in-process cache for hot entities (top 1% of traffic)
Redis key TTLs set on all keys (no unbounded growth)

Observability

Feature staleness metric (age of most recently computed feature) emitted per feature group
Staleness alert threshold set to 2x the intended refresh interval (not 30 minutes if refresh is 1 minute)
Cache hit rate tracked (low hit rate → cold start or eviction problem)
Feature value distribution logged (mean, p99, null rate per feature)
Consumer lag tracked for Flink/Kafka pipelines with alert at 5-minute lag

Operations

Runbook for each failure mode (Redis down, Flink lag, schema change emergency)
Multi-region readiness (or explicit single-region SLA acknowledgment)
Canary deployment process for feature changes
Incident response: pager rotation, escalation path, defined recovery SLA
Cost model: Redis memory cost at 2x current entity count

Connection Pooling

The most common production failure in feature serving under load.

Sizing the pool correctly:

$\text{min\_connections} = \text{peak\_rps} \times \text{avg\_latency\_ms} \div 1000$ $\text{max\_connections} = \text{min\_connections} \times 1.5$

For 10,000 req/sec and 5ms Redis latency:

min_connections = 10,000 × 0.005 = 50
max_connections = 75

import redis
import time
import threading
from typing import Optional
import logging

logger = logging.getLogger(__name__)


class PooledFeatureClient:
    """
    Redis feature client with properly sized connection pool,
    timeout management, and pool exhaustion detection.
    """

    def __init__(
        self,
        redis_url: str,
        peak_rps: int,
        avg_latency_ms: float = 5.0,
        safety_factor: float = 1.5,
        socket_timeout_ms: float = 50.0,
    ):
        # Size the pool for peak load
        min_connections = int(peak_rps * avg_latency_ms / 1000)
        max_connections = int(min_connections * safety_factor)

        logger.info(
            f"Sizing Redis pool: peak_rps={peak_rps}, "
            f"min_connections={min_connections}, max_connections={max_connections}"
        )

        self.pool = redis.ConnectionPool.from_url(
            redis_url,
            max_connections=max_connections,
            socket_timeout=socket_timeout_ms / 1000,
            socket_connect_timeout=0.1,
            retry_on_timeout=False,  # Fail fast - use fallback
        )
        self.client = redis.Redis(connection_pool=self.pool)
        self._pool_exhaustion_count = 0

    def get_features(self, entity_id: str) -> Optional[dict]:
        """
        Retrieve features. Returns None on any Redis error.
        Caller is responsible for applying fallback values.
        """
        try:
            start = time.monotonic()
            key = f"user:{entity_id}:features"
            raw = self.client.hgetall(key)
            latency_ms = (time.monotonic() - start) * 1000

            if latency_ms > 40:
                logger.warning(f"Slow Redis fetch for {entity_id}: {latency_ms:.1f}ms")

            return {k.decode(): v.decode() for k, v in raw.items()} if raw else None

        except redis.exceptions.ConnectionError:
            self._pool_exhaustion_count += 1
            logger.error(f"Redis pool exhausted. Total exhaustions: {self._pool_exhaustion_count}")
            return None
        except redis.exceptions.TimeoutError:
            logger.warning(f"Redis timeout for entity {entity_id}")
            return None
        except redis.RedisError as e:
            logger.error(f"Redis error: {e}")
            return None

    def pool_utilization(self) -> float:
        """Return fraction of connections in use (0.0 to 1.0)."""
        in_use = self.pool._created_connections - self.pool._available_connections.qsize()
        return in_use / self.pool.max_connections

Circuit Breakers

A circuit breaker prevents cascading failures: when a downstream service (Redis) is degraded, the circuit breaker trips and the system immediately serves fallback values instead of queueing requests behind a failing service.

import time
import threading
from enum import Enum
from typing import Callable, Any, Optional

class CircuitState(Enum):
    CLOSED = "closed"        # Normal: requests pass through
    OPEN = "open"            # Tripped: requests fail immediately with fallback
    HALF_OPEN = "half_open"  # Testing: one request allowed through to check recovery


class CircuitBreaker:
    """
    Circuit breaker for feature store calls.

    States:
    - CLOSED: normal operation, requests pass through
    - OPEN: failure threshold exceeded, return fallback immediately
    - HALF_OPEN: after reset_timeout, allow one request through to test recovery
    """

    def __init__(
        self,
        failure_threshold: int = 5,      # Trip after 5 failures in window
        failure_window_seconds: float = 10.0,
        reset_timeout_seconds: float = 30.0,
        success_threshold: int = 2,       # Close after 2 successes in HALF_OPEN
    ):
        self.failure_threshold = failure_threshold
        self.failure_window = failure_window_seconds
        self.reset_timeout = reset_timeout_seconds
        self.success_threshold = success_threshold

        self._state = CircuitState.CLOSED
        self._failures: list = []
        self._last_open_time: float = 0.0
        self._half_open_successes: int = 0
        self._lock = threading.Lock()

    @property
    def state(self) -> CircuitState:
        with self._lock:
            if self._state == CircuitState.OPEN:
                if time.time() - self._last_open_time > self.reset_timeout:
                    self._state = CircuitState.HALF_OPEN
                    self._half_open_successes = 0
            return self._state

    def call(self, fn: Callable, fallback: Any = None) -> Any:
        """
        Execute fn through the circuit breaker.
        Returns fallback if the circuit is OPEN.
        """
        state = self.state

        if state == CircuitState.OPEN:
            return fallback

        try:
            result = fn()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            return fallback

    def _on_success(self):
        with self._lock:
            if self._state == CircuitState.HALF_OPEN:
                self._half_open_successes += 1
                if self._half_open_successes >= self.success_threshold:
                    self._state = CircuitState.CLOSED
                    self._failures.clear()

    def _on_failure(self):
        now = time.time()
        with self._lock:
            # Evict old failures outside the window
            self._failures = [t for t in self._failures if now - t < self.failure_window]
            self._failures.append(now)

            if len(self._failures) >= self.failure_threshold:
                if self._state != CircuitState.OPEN:
                    self._state = CircuitState.OPEN
                    self._last_open_time = now
                    logger.error(
                        f"Circuit breaker OPEN after {len(self._failures)} failures "
                        f"in {self.failure_window}s"
                    )


# Usage: wrap Redis calls with circuit breaker
FEATURE_DEFAULTS = {
    "tx_count_1h": 0,
    "tx_sum_1h": 0.0,
    "account_age_days": 0,
}

redis_breaker = CircuitBreaker(failure_threshold=5, reset_timeout_seconds=30)

def get_features_with_breaker(entity_id: str, redis_client: redis.Redis) -> dict:
    def fetch():
        raw = redis_client.hgetall(f"user:{entity_id}:features")
        return {k.decode(): v.decode() for k, v in raw.items()} or FEATURE_DEFAULTS

    return redis_breaker.call(fetch, fallback=FEATURE_DEFAULTS)

Backpressure Handling

When the Flink pipeline falls behind the Kafka event stream, consumer lag grows. Features become stale. The system must detect this and alert before staleness exceeds the acceptable threshold.

from kafka import KafkaAdminClient, KafkaConsumer
from kafka.structs import TopicPartition
import time


def get_consumer_lag(
    bootstrap_servers: str,
    group_id: str,
    topic: str,
) -> dict:
    """
    Compute consumer lag per partition.
    Returns {partition: lag_messages}.
    """
    admin = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
    consumer = KafkaConsumer(bootstrap_servers=bootstrap_servers)

    # Get latest offsets (end of topic)
    partitions = consumer.partitions_for_topic(topic)
    tp_list = [TopicPartition(topic, p) for p in partitions]
    end_offsets = consumer.end_offsets(tp_list)

    # Get committed offsets for the consumer group
    committed = {
        tp: admin.list_consumer_group_offsets(group_id).get(tp)
        for tp in tp_list
    }

    lag = {}
    for tp in tp_list:
        end = end_offsets.get(tp, 0)
        committed_offset = committed.get(tp)
        if committed_offset:
            lag[tp.partition] = end - committed_offset.offset
        else:
            lag[tp.partition] = end  # No committed offset → all messages are unconsumed

    consumer.close()
    admin.close()
    return lag


def lag_to_staleness_minutes(lag_messages: int, events_per_second: float) -> float:
    """Estimate how stale features are given the current lag."""
    lag_seconds = lag_messages / max(events_per_second, 1)
    return lag_seconds / 60


# In a monitoring loop (run as a separate process or CloudWatch scheduled rule)
def monitor_pipeline_lag():
    ALERT_LAG_MESSAGES = 100_000  # Alert if lag > 100K messages
    EVENTS_PER_SECOND = 50_000

    lag = get_consumer_lag(
        bootstrap_servers="kafka:9092",
        group_id="session-feature-pipeline",
        topic="click-events",
    )

    total_lag = sum(lag.values())
    staleness_min = lag_to_staleness_minutes(total_lag, EVENTS_PER_SECOND)

    if total_lag > ALERT_LAG_MESSAGES:
        logger.error(
            f"ALERT: Flink pipeline lag={total_lag:,} messages "
            f"(~{staleness_min:.1f} minutes of staleness). "
            f"Consider scaling Flink task managers."
        )

Graceful Degradation

Design the fallback before you need it. For every real-time feature, answer these questions at design time, not incident time:

What happens if the feature store is unavailable? (Default values? Batch fallback?)
What happens if the feature is stale beyond the acceptable threshold? (Serve or reject?)
What is the model's behavior when all features are defaults? (Is it safe? Does it bias toward false negatives?)

from enum import Enum
from typing import Dict, Any, Optional
import time

class DegradationLevel(Enum):
    HEALTHY = "healthy"              # All features fresh from online store
    STALE = "stale"                  # Features served but beyond freshness threshold
    DEGRADED = "degraded"            # Online store unavailable, serving defaults
    CRITICAL = "critical"            # Cannot serve the model safely


class DegradedFeatureServer:
    """
    Feature server with explicit degradation levels.
    Each level has a defined behavior and produces observability signals.
    """

    FRESHNESS_THRESHOLD_SECONDS = 120  # Alert if features older than 2 minutes
    STALENESS_LIMIT_SECONDS = 600      # Serve degraded if older than 10 minutes

    def __init__(
        self,
        online_client,    # Redis client
        defaults: Dict[str, Any],
    ):
        self.online = online_client
        self.defaults = defaults

    def get_features(self, entity_id: str) -> tuple:
        """
        Returns (features, degradation_level).
        Callers can log the degradation_level for observability.
        """
        try:
            raw = self.online.get_features(entity_id)
            if raw is None:
                # Cache miss - new entity or evicted
                return self.defaults.copy(), DegradationLevel.DEGRADED

            computed_at = float(raw.get("computed_at", 0))
            age_seconds = time.time() - computed_at

            if age_seconds > self.STALENESS_LIMIT_SECONDS:
                # Too stale - serve defaults rather than mislead the model
                logger.warning(
                    f"Feature staleness {age_seconds:.0f}s > limit "
                    f"{self.STALENESS_LIMIT_SECONDS}s for {entity_id}. "
                    f"Serving defaults."
                )
                return self.defaults.copy(), DegradationLevel.STALE

            if age_seconds > self.FRESHNESS_THRESHOLD_SECONDS:
                # Serve but flag as stale
                features = {**self.defaults, **raw}
                return features, DegradationLevel.STALE

            return {**self.defaults, **raw}, DegradationLevel.HEALTHY

        except Exception as e:
            logger.error(f"Feature retrieval error for {entity_id}: {e}")
            return self.defaults.copy(), DegradationLevel.CRITICAL

The Full ProductionFeatureServer

import redis
import time
import logging
from typing import Dict, Any, Optional, List
from fastapi import FastAPI, Response
from pydantic import BaseModel
from prometheus_client import Counter, Histogram, Gauge, generate_latest

logger = logging.getLogger(__name__)

# Prometheus metrics
FEATURE_REQUESTS = Counter("feature_requests_total", "Total feature requests", ["status"])
FEATURE_LATENCY = Histogram("feature_latency_seconds", "Feature retrieval latency",
                            buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25])
FEATURE_STALENESS = Gauge("feature_staleness_seconds", "Age of most recent feature", ["entity_type"])
CACHE_HIT_RATE = Gauge("feature_cache_hit_rate", "Fraction of requests served from cache")


FEATURE_DEFAULTS = {
    "tx_count_1h": 0, "tx_sum_1h": 0.0, "tx_count_24h": 0,
    "tx_sum_24h": 0.0, "account_age_days": 0, "avg_tx_90d": 0.0,
    "distinct_merchants_7d": 0, "credit_score_bucket": 2, "is_premium": 0,
    "last_login_ts": 0,
}


class ProductionFeatureServer:
    def __init__(self, redis_url: str, peak_rps: int = 10000):
        # Sized for peak load
        max_connections = int(peak_rps * 0.005 * 1.5)  # peak_rps × avg_latency × safety

        self.pool = redis.ConnectionPool.from_url(
            redis_url,
            max_connections=max_connections,
            socket_timeout=0.05,
            socket_connect_timeout=0.1,
            retry_on_timeout=False,
            decode_responses=True,
        )
        self.client = redis.Redis(connection_pool=self.pool)

        # In-process LRU for hot entities
        from functools import lru_cache
        self._hot_cache: Dict[str, tuple] = {}  # entity_id → (features, expires_at)
        self._hot_cache_ttl = 5.0  # 5-second in-process cache

        # Circuit breaker
        self.breaker = CircuitBreaker(failure_threshold=10, reset_timeout_seconds=30)

        # Hit rate tracking
        self._total_requests = 0
        self._cache_hits = 0

    def _try_hot_cache(self, entity_id: str) -> Optional[Dict]:
        entry = self._hot_cache.get(entity_id)
        if entry and time.time() < entry[1]:
            return entry[0]
        return None

    def _set_hot_cache(self, entity_id: str, features: Dict):
        self._hot_cache[entity_id] = (features, time.time() + self._hot_cache_ttl)
        # Evict oldest entries if cache grows too large
        if len(self._hot_cache) > 10000:
            oldest = min(self._hot_cache, key=lambda k: self._hot_cache[k][1])
            del self._hot_cache[oldest]

    def get_features(self, entity_id: str) -> Dict[str, Any]:
        self._total_requests += 1
        start = time.monotonic()

        # Layer 1: in-process hot cache
        cached = self._try_hot_cache(entity_id)
        if cached:
            self._cache_hits += 1
            FEATURE_REQUESTS.labels(status="hot_cache").inc()
            return cached

        # Layer 2: Redis (via circuit breaker)
        def redis_fetch():
            key = f"user:{entity_id}:features"
            return self.client.hgetall(key)

        raw = self.breaker.call(redis_fetch, fallback=None)

        latency = time.monotonic() - start
        FEATURE_LATENCY.observe(latency)

        if raw is None or not raw:
            FEATURE_REQUESTS.labels(status="miss").inc()
            return FEATURE_DEFAULTS.copy()

        # Parse features
        features = {}
        for name, default_val in FEATURE_DEFAULTS.items():
            raw_val = raw.get(name)
            if raw_val is None:
                features[name] = default_val
            elif isinstance(default_val, int):
                features[name] = int(raw_val)
            elif isinstance(default_val, float):
                features[name] = float(raw_val)
            else:
                features[name] = raw_val

        # Track staleness
        if "computed_at" in raw:
            age = time.time() - float(raw["computed_at"])
            FEATURE_STALENESS.labels(entity_type="user").set(age)

        # Populate hot cache
        self._set_hot_cache(entity_id, features)
        FEATURE_REQUESTS.labels(status="hit").inc()
        self._cache_hits += 1

        if self._total_requests % 1000 == 0:
            hit_rate = self._cache_hits / self._total_requests
            CACHE_HIT_RATE.set(hit_rate)

        return features

    def get_features_batch(self, entity_ids: List[str]) -> Dict[str, Dict]:
        results = {}
        cache_misses = []

        # Check hot cache first
        for eid in entity_ids:
            cached = self._try_hot_cache(eid)
            if cached:
                results[eid] = cached
            else:
                cache_misses.append(eid)

        if not cache_misses:
            return results

        # Pipeline Redis for misses
        def batch_fetch():
            pipe = self.client.pipeline(transaction=False)
            for eid in cache_misses:
                pipe.hgetall(f"user:{eid}:features")
            return pipe.execute()

        redis_results = self.breaker.call(batch_fetch, fallback=[None] * len(cache_misses))

        for eid, raw in zip(cache_misses, redis_results):
            if not raw:
                results[eid] = FEATURE_DEFAULTS.copy()
                continue

            features = {}
            for name, default_val in FEATURE_DEFAULTS.items():
                raw_val = raw.get(name)
                if raw_val is None:
                    features[name] = default_val
                elif isinstance(default_val, int):
                    features[name] = int(raw_val)
                elif isinstance(default_val, float):
                    features[name] = float(raw_val)
                else:
                    features[name] = raw_val

            results[eid] = features
            self._set_hot_cache(eid, features)

        return results

    def health_check(self) -> dict:
        redis_ok = False
        try:
            redis_ok = self.client.ping()
        except:
            pass

        return {
            "status": "healthy" if redis_ok else "degraded",
            "redis": "ok" if redis_ok else "down",
            "circuit_breaker": self.breaker.state.value,
            "pool_utilization": (
                self.pool._created_connections / self.pool.max_connections
                if self.pool.max_connections > 0 else 0
            ),
            "cache_hit_rate": self._cache_hits / max(self._total_requests, 1),
        }


# FastAPI application
app = FastAPI()
feature_server = ProductionFeatureServer(redis_url="redis://redis:6379", peak_rps=10000)


@app.get("/features/{entity_id}")
def get_features(entity_id: str):
    return feature_server.get_features(entity_id)


@app.get("/health")
def health():
    return feature_server.health_check()


@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain")

Production Architecture with Resilience Patterns

Operational Runbook: 10 Common Failure Scenarios

#	Symptom	Diagnosis	Mitigation
1	p99 latency > 100ms	Check pool utilization; if >80%, pool exhausted	Scale feature servers; increase `max_connections`
2	Feature staleness alert	Check Flink consumer lag; if lag > 5 min, pipeline fell behind	Scale Flink task managers; check for backpressure
3	High null rate in features	Check Redis memory (if >85% used, eviction happening)	Add Redis nodes; reduce TTL; increase cluster size
4	Circuit breaker OPEN	Check Redis connectivity; if Redis down, serve defaults	Redis failover (Sentinel/Cluster auto-failover); DynamoDB fallback
5	Model predictions shifted	Check feature distribution logs; compare to 7-day baseline	Roll back last feature change; retrain if feature bug
6	Cache hit rate < 50%	Check if entity distribution changed; may be new user types	Pre-warm cache for new entities; increase LRU cache size
7	Feature server OOM	Check hot cache size; may be unbounded entity accumulation	Add LRU eviction limit; profile cache memory
8	Flink checkpoint failures	Check S3 connectivity and RocksDB state size	Increase checkpoint timeout; archive old state
9	DLQ volume spike	Check upstream producer changes; schema change not registered	Register new schema version; restart pipeline with new JAR
10	Redis cluster partition	Check Redis Cluster node status; slave lag	Force failover to healthy primary; verify replica count

Multi-Region Feature Serving

For global systems, Redis must be replicated to serving regions to avoid cross-region latency:

US-East (primary)    US-West (replica)    EU-West (replica)
   Redis Primary  →  Redis Replica     →  Redis Replica
   Flink writes   ←  read-only serving ←  read-only serving

Redis replication is asynchronous - replicas may lag 10–100ms behind the primary under write load. For fraud detection where freshness is critical, route all writes to the primary and all reads to the local replica, accepting a small freshness lag. For recommendation where freshness is less critical, the eventual consistency is acceptable.

DynamoDB Global Tables offer active-active multi-region with ~1 second replication lag - appropriate for features that can tolerate 1 second of cross-region staleness.

Feature SLOs

Define and monitor Service Level Objectives for each feature group:

Feature Group	Freshness SLO	Retrieval p99 SLO	Availability SLO
Velocity (1h window)	< 30 seconds stale	< 10ms	99.9%
Session features	< 60 seconds stale	< 10ms	99.9%
Historical aggregates	< 1 hour stale	< 10ms	99.9%
Embeddings	< 24 hours stale	< 5ms	99.99%
Static profile	< 7 days stale	< 10ms	99.99%

:::danger No Fallback Strategy Is a Single Point of Failure A feature system without a defined fallback strategy will fail the entire model endpoint when the feature store is unavailable. Redis has 99.9% availability - that's 8.7 hours of downtime per year. If every one of those 8.7 hours brings down your fraud detection or recommendation system, the operational impact is severe.

Define the fallback before launch:

What default values does the model receive?
Is the model safe to serve with those defaults? (If not, fail the model request gracefully - return an error rather than a wrong prediction)
Is there a batch fallback (hourly features from S3) that is safer than defaults?

Test the fallback: deliberately break Redis in staging and verify the system serves the correct defaults, emits the correct metrics, and does not error the model endpoint. :::

:::danger Load Testing with Synthetic Traffic Only Synthetic load tests often miss real traffic patterns: power-law entity distribution (top 1% of entities receive 20% of traffic), correlated request bursts (marketing email sends → 10x traffic spike for 5 minutes), and cold-start patterns (new users with no feature history). Test with a representative sample of real entity IDs replayed at target throughput. Use shadow mode: route 5% of real production traffic to the new feature server, measure latency and error rates under real traffic patterns before full cutover. :::

Cost Optimization

Real-time feature infrastructure is expensive. Optimize before it becomes unmanageable:

Redis memory: audit with redis-cli --bigkeys. Most memory is consumed by a small fraction of entities (power users with long history). Set TTLs aggressively on features with high entity counts. Use Redis's memory compression settings for small values.

Batch Redis writes: when the feature pipeline writes to Redis, batch multiple entity writes into a single pipeline call. Writing 1,000 features individually uses 1,000 RTTs; a pipeline uses 1. For stream pipelines writing to Redis, use Flink's Redis sink with configurable batch size (write every 100ms or every 1,000 events, whichever comes first).

Right-size the Flink cluster: Flink task managers are memory-intensive (RocksDB state). Profile actual state size per task manager after the pipeline has been running for 24 hours (steady-state state size). Right-size memory allocation based on actual usage, not estimates.

Interview Q&A

Q: You are asked to design the on-call runbook for a real-time feature serving system. What are the top 5 scenarios and how do you diagnose each?

(1) High latency (p99 > SLO): check pool utilization - if pool is exhausted, scale feature servers. Check Redis cluster health - if a node is down, verify failover. Profile which features are slow with distributed tracing. (2) Feature staleness alert: check Flink consumer lag in Kafka metrics. If lag is growing, the pipeline is behind - scale Flink task managers. If lag is stable but features are stale, the pipeline may be crashing and restarting - check Flink job logs. (3) High model error rate (not latency): check feature distribution shift - compare logged feature values to 7-day baseline. A recent feature deployment may have changed the distribution. (4) Redis memory alarm: eviction is happening - features are being dropped. Add Redis nodes or reduce TTLs. (5) Circuit breaker open: Redis is failing. Verify Redis Cluster is healthy. Trigger DynamoDB fallback manually. Verify defaults are being served and model is operating safely on defaults.

Q: How do you handle the scenario where a feature pipeline falls behind during a traffic spike?

Immediate mitigation: scale Flink task managers horizontally. Flink supports live rescaling. Increase Kafka topic partitions if the bottleneck is the Kafka source (Flink parallelism cannot exceed partition count). Second, verify that the downstream systems can tolerate stale features - if the staleness alert has fired but the model is still operational, accept the situation while the pipeline catches up. If staleness exceeds the limit where model safety is uncertain, serve defaults instead of stale features (better to be explicit). Third, prevent future recurrence: set Flink parallelism to handle 10x expected peak, not 1x. Monitor consumer lag continuously with an alert threshold of 5 minutes of lag, not 30 minutes. Use Kafka consumer group lag as the primary signal for pipeline health.

Q: A new feature is being added to the serving pipeline. Walk me through the deployment process.

(1) Define the feature in the feature registry (Feast/Tecton) with a new version tag. (2) Implement the feature transformation in the streaming pipeline. Run the consistency validator: verify the training implementation and serving implementation agree on a test dataset. (3) Deploy the streaming pipeline to compute and write the new feature to a new Redis key pattern (user:{id}:features:v2) while keeping the old pipeline writing to v1. (4) Retrain the model on training data that includes the new feature. Validate offline performance improvement. (5) Deploy the new model in canary mode (1% of traffic) pointing to v2 Redis keys. Monitor for errors, latency regression, and distribution drift. (6) Promote from 1% to 10% to 50% to 100% over the course of 24 hours, verifying metrics at each step. (7) Stop writing v1 features after the old model is fully decommissioned. Let v1 keys expire via TTL.

Q: Explain the cache-aside pattern for feature serving and when you would use it.

Cache-aside (lazy loading): the feature server checks the cache first. On a miss, it fetches from the source-of-truth store (DynamoDB), stores the result in the cache (Redis), and returns it. On the next request, the cache has the value. This is appropriate when: the source-of-truth store is slower than the cache (DynamoDB at 20ms vs. Redis at 2ms); the feature is not updated via streaming (it's computed on demand); and the entity population is large (can't eagerly pre-load all entities). The risk: a cold cache after Redis restart causes a thundering herd - every entity misses, flooding DynamoDB. Mitigate with lazy loading + TTL randomization (spread cache expiration over a window, not all at once) and a rate limiter on the DynamoDB fallback path.

Q: How do you set appropriate staleness alert thresholds for real-time features?

Set the alert threshold based on the business impact of staleness, not a round number. For a fraud velocity feature with a 1-minute window: the feature must be fresher than the window itself (1 minute). Alert at 2 minutes of staleness - that's when the feature loses its meaning. For a session feature updated every 30 seconds: alert at 90 seconds. For a daily aggregate feature: alert at 2 hours. Never set staleness alerts to 30 minutes across the board - this misses the case where a 1-minute feature is 29 minutes stale (which is catastrophic). Additionally: measure staleness as the age of the most recently computed feature for a sample of active entities, not the timestamp of the last pipeline write. A pipeline can write to Redis successfully while computing stale values.

Three Failures at 50x Load​

Why This Exists​

The Production Readiness Checklist​

Connection Pooling​

Circuit Breakers​

Backpressure Handling​

Graceful Degradation​

The Full ProductionFeatureServer​

Production Architecture with Resilience Patterns​

Operational Runbook: 10 Common Failure Scenarios​

Multi-Region Feature Serving​

Feature SLOs​

Cost Optimization​

Interview Q&A​