Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Lambda and Kappa Architecture demo on the EngineersOfAI Playground - no code required. :::

Lambda and Kappa Architecture for ML Systems

The 3 AM Incident

It is 3 AM, and your on-call phone rings. The recommendation engine at your e-commerce company has been serving identical product lists to every user for the past four hours. The root cause takes twenty minutes to identify: the batch job that recomputes user embeddings every night ran into a data corruption bug. It wrote partial results. Your serving layer picked up the partial results. Every user got the same stale, truncated embedding.

You fix the batch job, rerun it, and deploy fresh embeddings by 5 AM. The postmortem the next morning has one uncomfortable question: why did four hours of user behavior - purchases, searches, clicks - go unweighted in the recommendations? Your batch pipeline recomputes from scratch on a twenty-four-hour schedule. The most recent four hours of signal were simply invisible to the model.

This is the problem Lambda architecture was designed to solve. Not the bug - bugs happen - but the architectural tension between two legitimate requirements: (1) correctness, which demands recomputing from raw data, and (2) freshness, which demands incorporating the last hour of events. Before Lambda, teams chose one or the other. Nathan Marz gave us a framework that delivers both, at the cost of running two codebases in parallel. Jay Kreps, watching teams struggle with that cost, proposed Kappa: throw out the batch layer entirely, and lean on stream reprocessing to get correctness without duplication.

Understanding when to use Lambda, when to use Kappa, and how ML changes the calculus of that decision is the subject of this lesson. By the end, you will be able to design the data architecture for a production ML system, reason about the trade-offs, and defend your choices in an architecture review.


Why This Exists: The Problem Before Lambda

Before 2011, large-scale data systems faced a binary choice. You could build a batch system (Hadoop MapReduce being the canonical example): correct, recomputable from raw data, able to handle petabytes, but unavoidably slow. Your freshest data was always N hours old, where N was however long the batch job took. Or you could build a streaming system (Storm, early Kafka consumers): fast, low-latency, but fragile. Streaming systems operated on windows of recent data and had no reliable way to recompute from historical truth if something went wrong.

The problem was acute for recommendation and fraud systems. A fraud model needs to respond to a transaction in under 100 milliseconds - that demands a streaming path. But it also needs to incorporate a user's full behavioral history spanning months - that demands batch computation. Teams bolted these together ad hoc. The result was two separate codebases, maintained by two different teams, with no guarantee they computed the same features in the same way. When results diverged (and they always did), debugging was a nightmare.

Nathan Marz, working at Twitter on distributed data systems, formalized the problem and the solution in a 2011 blog post that became the book "Big Data." His key insight: raw data is immutable. If you store it permanently and treat it as truth, you can always recompute. The architecture that follows from that insight is Lambda.


Lambda Architecture: The Full Picture

Lambda architecture has three layers that serve distinct roles.

The Batch Layer

The batch layer has two responsibilities. First, it stores all raw data immutably - every event that ever occurred is appended to the master dataset, never overwritten. Second, it periodically recomputes batch views from scratch. Because it recomputes from raw data, it is always correct. Human errors, algorithm bugs, and infrastructure failures can all be corrected by fixing the code and rerunning the batch job.

In practice, the batch layer is usually Apache Spark on top of S3 or HDFS. The batch job might run every hour, every six hours, or every twenty-four hours depending on your latency tolerance and compute cost. The output - batch views - is typically stored in a fast-read store like Cassandra, DynamoDB, or Redis.

The batch layer's weakness is latency. A batch job that takes two hours to run means your batch views are always at least two hours stale. For the period between batch runs, the speed layer fills the gap.

The Speed Layer

The speed layer processes the same data stream in real time, producing incremental views that cover only the recent data that the batch layer has not yet incorporated. It trades some accuracy for speed. Stream processing with windowing and approximate algorithms (HyperLogLog for cardinality, Count-Min Sketch for frequency) is acceptable here because the batch layer will eventually recompute the correct answer.

In practice, the speed layer is Apache Kafka Streams, Apache Flink, or Apache Spark Streaming. It updates views continuously, keeping them fresh within seconds or minutes.

The Serving Layer

The serving layer merges batch views and real-time views to answer queries. For a recommendation query, this might mean: take the user embedding computed by last night's batch job, and apply the last two hours of click behavior from the real-time view to adjust it. The serving layer performs this merge on every query, typically in-memory.


Lambda in ML Systems

Lambda's abstract layers map cleanly onto the ML lifecycle.

Batch layer for ML = offline model training. You train a model on your full historical dataset every N hours. This produces the most accurate model parameters because it has seen all the data. The trained model is the "batch view."

Speed layer for ML = online learning or feature updates. While the offline model is trained, the speed layer updates user features in real time. When a user clicks something, that click is immediately incorporated into their feature vector.

Serving layer for ML = inference. At prediction time, the server loads the batch-trained model weights and applies real-time features to produce a prediction that is both accurate (trained on all history) and fresh (current context).

A real example: Spotify's personalized playlist generation. The batch layer trains a matrix factorization model on all listening history weekly, capturing deep long-term taste. The speed layer tracks what the user has listened to in the current session. The serving layer combines the long-term model with session context to recommend the next track. Neither layer alone is sufficient: the batch model misses the session context; the speed layer lacks the long-term taste model.


Lambda's Weakness: Two Codebases

Lambda architecture is elegant in theory. In practice, teams repeatedly discover the same problem: you must implement every computation twice - once as a batch job (Spark) and once as a streaming job (Flink). These two implementations must produce identical results, but they use different APIs, different execution models, different debugging tools. They drift over time. A bug fix applied to the Spark job is forgotten on the Flink side. A new feature is added to the streaming path but not to the batch path.

Teams at Twitter, LinkedIn, and elsewhere discovered this independently and with the same conclusion: the operational burden of maintaining two codebases exceeds the architectural benefits in many use cases. Jay Kreps, co-creator of Kafka and co-founder of Confluent, wrote about this in 2014 in a post titled "Questioning the Lambda Architecture."


Kappa Architecture: Stream Everything

Kappa architecture's premise is simple: if you can replay a stream, you do not need a batch layer. Store all your raw events in a long-retention log (Kafka with 30-day or infinite retention). When you need to recompute, replay the entire log through your streaming job. You only maintain one codebase - the streaming job.

Kappa Reprocessing Pattern

When you need to update your algorithm:

  1. Deploy the new streaming job (v2) to a separate consumer group
  2. Set v2's consumer offset to the beginning of the Kafka topic (replay from day one)
  3. v2 runs in parallel with v1, catching up on historical data
  4. Once v2 has caught up to the current time, atomically swap the serving layer to point to v2's output store
  5. Decommission v1

This gives you the correctness guarantees of Lambda (full recomputation from raw data) with the simplicity of a single codebase.

Kappa in ML Systems

For ML, Kappa maps to: one streaming feature pipeline that computes all features from the event log, and one streaming training loop (or periodic retraining triggered by the same stream). You replay when you need to retrain with a new algorithm.

The Kappa approach works beautifully for feature stores that are fully derived from events. User's click count in the last 7 days, session duration, item interaction rate - all computable from an event stream replay. Where Kappa struggles in ML is for computations that require global passes over the data: matrix factorization requires iterating over the full user-item matrix many times; gradient descent on a neural network requires multiple epochs. These are inherently batch computations that do not fit naturally into a streaming job.


Lambda vs Kappa: When Each Wins

DimensionLambdaKappa
Code complexityTwo codebases (batch + stream)One codebase (stream only)
Operational complexityHigher - two systems to manageLower
CorrectnessHigh - batch recomputes from scratchHigh - stream replay from event log
FreshnessNear real-time via speed layerReal-time
Works with multi-epoch trainingYes - batch layer handles itRequires workaround
Suitable for petabyte-scale historyYes - Spark excels hereChallenging - replay latency grows
Best forDeep model training + real-time featuresFeature computation, lightweight models

Production Code: Lambda Feature Pipeline

Here is a production-grade Python implementation of the Lambda serving layer - the component that merges batch features with real-time features at inference time.

import redis
import boto3
import json
import time
from typing import Optional
from dataclasses import dataclass, field


@dataclass
class UserFeatures:
user_id: str
# Batch features (from last training run, hours old)
embedding: list
long_term_category_prefs: dict
# Real-time features (seconds old)
session_clicks: int = 0
session_category: Optional[str] = None
last_event_ts: Optional[float] = None


class LambdaServingLayer:
"""
Serving layer that merges batch features (DynamoDB) with
real-time features (Redis) for ML inference.

Batch layer runs every 6 hours via Spark.
Speed layer updates Redis on every user event via Kafka consumer.
"""

def __init__(
self,
redis_host: str = "localhost",
redis_port: int = 6379,
batch_table: str = "user_batch_features",
batch_ttl_seconds: int = 3600,
):
self.redis = redis.Redis(
host=redis_host,
port=redis_port,
decode_responses=True,
socket_connect_timeout=2,
socket_timeout=2,
)
self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
self.batch_table = self.dynamodb.Table(batch_table)
self.batch_ttl = batch_ttl_seconds
# Local in-process cache for batch features
self._batch_cache: dict = {}

def get_batch_features(self, user_id: str) -> dict:
"""
Fetch batch features from DynamoDB with in-process TTL caching.
Batch features are expensive to fetch; cache them aggressively.
"""
now = time.monotonic()
cached = self._batch_cache.get(user_id)
if cached and (now - cached[1]) < self.batch_ttl:
return cached[0]

try:
response = self.batch_table.get_item(Key={"user_id": user_id})
features = response.get("Item", {})
except Exception as e:
# Degrade gracefully - return empty features, not an error
print(f"[LambdaServing] DynamoDB fetch failed for {user_id}: {e}")
features = {}

self._batch_cache[user_id] = (features, now)
return features

def get_realtime_features(self, user_id: str) -> dict:
"""
Fetch real-time features from Redis.
Redis is written by the Kafka consumer in the speed layer.
TTL is set to 2 hours on each write.
"""
key = f"rt_features:{user_id}"
try:
raw = self.redis.hgetall(key)
return {k: json.loads(v) for k, v in raw.items()}
except Exception as e:
print(f"[LambdaServing] Redis fetch failed for {user_id}: {e}")
return {}

def get_merged_features(self, user_id: str) -> UserFeatures:
"""
Merge batch features and real-time features.
Real-time features override batch features for the same key.
This is the core Lambda serving operation.
"""
batch = self.get_batch_features(user_id)
realtime = self.get_realtime_features(user_id)

# Real-time features are more current - they take precedence
merged = {**batch, **realtime}

return UserFeatures(
user_id=user_id,
embedding=merged.get("embedding", []),
long_term_category_prefs=merged.get("category_prefs", {}),
session_clicks=int(merged.get("session_clicks", 0)),
session_category=merged.get("session_category"),
last_event_ts=merged.get("last_event_ts"),
)

def predict_recommendations(
self,
user_id: str,
candidate_items: list,
model,
top_k: int = 20,
) -> list:
"""End-to-end recommendation with Lambda-merged features."""
features = self.get_merged_features(user_id)
scores = model.score(features, candidate_items)
ranked = sorted(zip(candidate_items, scores), key=lambda x: -x[1])
return [item_id for item_id, _ in ranked[:top_k]]

The Speed Layer: Kafka Consumer Updating Redis

from kafka import KafkaConsumer
import redis
import json
import time


class SpeedLayerConsumer:
"""
Speed layer: consumes user events from Kafka and updates
real-time feature vectors in Redis.

This is the streaming half of Lambda architecture.
"""

def __init__(
self,
kafka_brokers: list,
topic: str = "user_events",
group_id: str = "speed_layer_v1",
redis_host: str = "localhost",
):
self.consumer = KafkaConsumer(
topic,
bootstrap_servers=kafka_brokers,
group_id=group_id,
value_deserializer=lambda v: json.loads(v.decode("utf-8")),
auto_offset_reset="latest", # speed layer only needs recent data
enable_auto_commit=True,
)
self.redis = redis.Redis(host=redis_host, decode_responses=True)
self.FEATURE_TTL = 7200 # 2 hours in Redis

def process_event(self, event: dict) -> None:
"""
Update real-time features for one user event.
Atomic Redis pipeline ensures consistency.
"""
user_id = event.get("user_id")
event_type = event.get("event_type")
category = event.get("category")
ts = event.get("timestamp", time.time())

key = f"rt_features:{user_id}"
pipe = self.redis.pipeline()

if event_type == "click":
pipe.hincrby(key, "session_clicks", 1)
if category:
pipe.hset(key, "session_category", json.dumps(category))

if event_type in ("click", "purchase", "view"):
pipe.hset(key, "last_event_ts", json.dumps(ts))

if event_type == "purchase":
# Reset session on purchase
pipe.hset(key, "session_clicks", json.dumps(0))

pipe.expire(key, self.FEATURE_TTL)
pipe.execute()

def run(self) -> None:
print("[SpeedLayer] Starting consumer...")
for message in self.consumer:
try:
self.process_event(message.value)
except Exception as e:
print(f"[SpeedLayer] Error processing event: {e}")
# Do not crash the consumer on a single bad event

Kappa Architecture in Code: Stream Replay

The power of Kappa is in the replay. Here is how you trigger a full historical recompute while keeping production serving live:

from kafka import KafkaConsumer
import json
import time


class KappaReprocessor:
"""
Kappa architecture reprocessing: replay entire event log
with new algorithm version, writing to a new output store.
Once caught up, atomically swap to new version.
"""

def __init__(
self,
kafka_brokers: list,
topic: str = "user_events",
new_version: str = "v2",
):
self.brokers = kafka_brokers
self.topic = topic
self.new_version = new_version

def replay_from_beginning(self, processor_fn, output_store) -> None:
"""
Replay all events from offset 0.
processor_fn: your new algorithm, callable on each event
output_store: where to write results (new Redis namespace or table)
"""
group_id = f"kappa_replay_{self.new_version}_{int(time.time())}"

consumer = KafkaConsumer(
bootstrap_servers=self.brokers,
group_id=group_id,
value_deserializer=lambda v: json.loads(v.decode()),
auto_offset_reset="earliest", # KEY: start from the very beginning
enable_auto_commit=True,
)
consumer.subscribe([self.topic])

# Force assignment and seek to beginning
consumer.poll(timeout_ms=1000)
for tp in consumer.assignment():
consumer.seek_to_beginning(tp)

print(f"[KappaReprocessor] Starting replay for {self.new_version}")
events_processed = 0

for message in consumer:
try:
result = processor_fn(message.value)
output_store.write(result)
events_processed += 1

if events_processed % 100_000 == 0:
lag = self._compute_lag(consumer)
print(
f"[Replay] Processed {events_processed:,} events. "
f"Lag: {lag:,}"
)
if lag < 1000:
print("[Replay] Caught up. Ready to swap.")
break

except Exception as e:
print(f"[Replay] Error on event {events_processed}: {e}")

consumer.close()
print(
f"[KappaReprocessor] Replay complete. "
f"{events_processed:,} events processed."
)

def _compute_lag(self, consumer) -> int:
"""Compute total consumer lag across all partitions."""
total_lag = 0
for tp in consumer.assignment():
end_offset = consumer.end_offsets([tp])[tp]
current = consumer.position(tp)
total_lag += max(0, end_offset - current)
return total_lag

Production Engineering Notes

Retention Policy for Kappa

Kafka's default retention is 7 days. For Kappa to work, you need infinite or very long retention. Use Kafka's log compaction (key-based) for stateful topics, and set retention.ms=-1 (unlimited) for your raw event topics. Budget accordingly - 1 billion events per day at 500 bytes per event is roughly 500 GB/day. Three years of history is around 550 TB per partition.

Tiered storage (Kafka 3.6+) moves cold data to S3 automatically, reducing broker disk costs by 60-80%. This makes long-retention Kappa economically viable at scale.

Batch Layer Sizing

A Lambda batch job that processes 10 TB of data with Spark on EMR typically completes in 30-90 minutes with a 50-node r5.4xlarge cluster. Cost is approximately 1525perrun.Runningevery6hoursis15-25 per run. Running every 6 hours is 60-100/day - reasonable for a production recommendation system with millions of users.

Lag Monitoring

Always monitor consumer lag in your speed layer. A healthy speed layer consumer should have lag under 10,000 events. If lag grows to millions of events, your speed layer is falling behind and real-time features will be stale - effectively turning your Lambda system into a batch-only system.

from prometheus_client import Gauge

speed_layer_lag = Gauge(
"speed_layer_consumer_lag_events",
"Number of events the speed layer consumer is behind",
labelnames=["topic", "partition"],
)

:::danger Training-Serving Skew in Lambda

The most dangerous failure mode in Lambda ML systems is training-serving skew - when the features used to train the model (computed by the batch layer) differ from the features used at serving time (merged batch + real-time). If your batch layer computes clicks_last_7_days using one definition and your speed layer computes session_clicks using a different window, your model has never seen the actual feature distribution it encounters at inference time. This causes silent accuracy degradation that is extremely hard to diagnose.

Solution: Define all feature logic in a single shared library. Both the batch Spark job and the streaming Kafka consumer import from the same features.py. The serving layer uses the same library for any on-the-fly computation. One definition, three uses. :::

:::warning Kappa Replay Time

Kappa's "replay from beginning" sounds great until you have 3 years of event history. Replaying 500 billion events through a streaming job takes time - potentially days. Plan your replay strategy carefully:

  • Use many Kafka partitions (parallelism scales linearly with partition count)
  • Deploy replay consumers on separate hardware from production consumers
  • Stage the swap when replay lag drops below 10 minutes of events, not zero
  • Keep the old version live until the new version has been stable for 24+ hours :::

Interview Q&A

Q1: Explain Lambda architecture and its trade-offs in the context of an ML recommendation system.

Lambda architecture splits processing into three layers: batch (full historical recompute on a schedule), speed (real-time incremental updates), and serving (merging both at query time). For a recommendation system, the batch layer trains the model on all historical interactions every 6-24 hours. The speed layer tracks current session behavior in Redis. The serving layer loads the trained model and injects session features at inference time.

The trade-off: the batch layer produces accurate, correct results from full data, while the speed layer provides low-latency freshness. The cost is two codebases (Spark for batch, Flink/Kafka for streaming) that must produce identical feature logic but use completely different APIs. This dual-codebase maintenance burden is Lambda's Achilles heel and the reason Kappa was invented.


Q2: When would you choose Kappa over Lambda for an ML system?

Choose Kappa when: (1) your feature computation can be expressed as streaming operations without requiring multiple passes over data - session counts, rolling averages, recent interaction rates; (2) you have sufficient Kafka retention to replay the full history; (3) your team cannot afford to maintain two separate pipelines; (4) your models are lightweight enough to train in an online or mini-batch fashion from the stream.

Choose Lambda when: (1) your ML algorithm requires multiple epochs over the full dataset (deep neural networks, matrix factorization); (2) your historical data volume is too large for reasonable Kafka replay; (3) your batch computation has complex joins that are impractical in streaming. Most production deep learning systems use a pragmatic hybrid: a streaming feature pipeline (Kappa) feeding periodic offline training (batch).


Q3: How do you handle training-serving skew in a Lambda ML system?

Training-serving skew is the number one silent killer of Lambda ML systems. The fix is strict feature centralization: every feature has exactly one implementation, and both the batch layer and the serving layer import it from the same library.

Concretely: define a FeatureComputer class in a shared Python package. The Spark batch job imports FeatureComputer and applies it to the full dataset. The serving layer imports the same FeatureComputer and applies it to real-time features. You publish this package as an internal pip package with versioned releases, so batch job and serving layer are always pinned to the same version. Add shadow mode testing: log the features actually used at serving time, compare their distribution to training-time features daily, and alert if distributions diverge beyond a threshold.


Q4: How does Spotify use Lambda-like architecture for music recommendations?

Spotify's Discover Weekly and similar features use a Lambda-like approach at scale. The batch layer runs weekly matrix factorization on 400 million+ user profiles and 80 million+ track interaction histories using Spark on GCP Dataproc, producing user and track embeddings. These embeddings are stored in Bigtable for fast serving. The speed layer tracks session listening behavior in real time using Kafka and updates session-level features in Redis. At serving time, the playlist generation service loads the user's embedding from Bigtable, reads session context from Redis, and uses approximate nearest neighbor search (ScaNN) to find matching tracks. The full system refreshes weekly embeddings while maintaining real-time responsiveness.


Q5: What happens when your batch layer produces a wrong result in Lambda? How do you recover?

Lambda's recovery story is its strongest argument over pure streaming. Because the master dataset (raw events in S3/HDFS) is immutable and append-only, recovery is mechanical: fix the bug, delete the incorrect batch view, and rerun the batch job from raw data. The speed layer continues serving approximate real-time results during the batch recompute - users experience degraded quality, not an outage.

Contrast this with a pure database system where you have overwritten the state: recovery requires point-in-time restore and manually replaying transactions. Lambda's immutable master dataset eliminates that class of problem entirely. The recovery SLA is simply "time to rerun the batch job" - typically 30 minutes to 4 hours depending on data scale and cluster size.


Summary

Lambda architecture separates data processing into batch correctness and streaming freshness, merged at serving time. It works well for ML systems that require deep offline training plus real-time feature freshness, but carries a dual-codebase maintenance burden. Kappa simplifies this by making the stream replayable - one codebase, full historical correctness on demand - at the cost of per-algorithm batch training requiring workarounds. Most production ML systems use a pragmatic hybrid: Kappa-style streaming for features, periodic batch training for model weights. The choice between them should be driven by your algorithm's compute characteristics, your data retention capacity, and your team's operational bandwidth.

© 2026 EngineersOfAI. All rights reserved.