Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Flink Stream Processing demo on the EngineersOfAI Playground - no code required. :::

Kafka Streams vs Apache Flink - The ML Pipeline Decision Guide

Reading time: ~25 minutes | Interview relevance: High | Target roles: ML Engineer, Data Engineer, AI Platform Engineer


The 9 AM Architecture Review

It is Monday morning at a fintech startup. Maya, the ML platform architect, is standing in front of a whiteboard with three engineers and a VP of Engineering who has one question: "Should we use Kafka Streams or Flink?" They need to build the real-time feature infrastructure for a fraud detection model. 50,000 events per second at peak. Stateful aggregations: rolling transaction counts per user, velocity features, merchant risk scores. Features must be fresh within seconds.

The wrong choice is expensive in both directions. Choose Kafka Streams and hit a wall at 500 GB of state in six months - then face a painful migration. Choose Flink and spend the next three months managing cluster infrastructure that this three-engineer team cannot afford to babysit. Either way, the ML model training schedule is disrupted, features go stale during incidents, and the team loses credibility with the business.

Maya has built both. She has debugged Flink job restarts at 2 AM and watched Kafka Streams applications silently OOM under state pressure. She has seen Python ML teams spend weeks writing Java just to use Kafka Streams, and seen simple aggregation jobs massively over-engineered with a full Flink cluster. She also knows about a third option the VP has not heard of: Faust, the Python library that gives you Kafka Streams semantics without a single line of Java.

What Maya knows is that this is not a question with a universally correct answer. It is a question about your team's primary language, your current and projected state size, your tolerance for infrastructure complexity, and the specific shapes of computation your ML pipeline requires. The decision framework exists. The tools are mature. You just need to understand them deeply enough to apply the framework correctly to your situation.

By the end of this lesson you will understand exactly what Kafka Streams, Faust, and Apache Flink each are at the architectural level, what tradeoffs they encode, and how to choose the right one for a given ML feature pipeline problem.


Why This Exists - The State Problem in Streaming

Early stream processing was stateless. You received an event, transformed it, emitted a result. No memory across events. This worked for simple transformations - filtering bad records, enriching events from a static lookup table, converting formats.

The moment you need to answer "how many transactions has this user made in the last hour?" you have a state problem. You must remember. Across millions of users. Across billions of events. Across process restarts and machine failures. The naive answer - keep it all in memory - fails immediately at production scale. The state outlives any single process.

The three tools in this lesson are fundamentally different answers to the same question: how do you maintain large, fault-tolerant, queryable state in a streaming system at production scale?

  • Kafka Streams answers: keep state local to each application instance, backed by RocksDB on disk, with Kafka acting as the fault-tolerant changelog
  • Faust answers: same idea, but implemented in Python with asyncio instead of Java
  • Apache Flink answers: use a distributed computation engine purpose-built for this, with its own cluster, its own state management, and its own checkpoint coordination

What Kafka Streams Is

Kafka Streams is a client library, not a cluster. You add it as a dependency to your Java or Kotlin application. When your application runs, it is both a Kafka consumer and a Kafka producer. It reads events from input topics, applies your processing logic, and writes results to output topics. The "cluster" is your own set of JVM processes - you scale it by starting more instances.

This is the key architectural insight: there is no separate streaming infrastructure to deploy or manage. Your application is the stream processor. This is what makes Kafka Streams uniquely attractive for small teams.

KStream and KTable - The Stream-Table Duality

Everything in Kafka Streams begins with this duality:

KStream represents an unbounded sequence of events. Each message is an independent fact. A click event. A payment event. A sensor reading. Nothing is updated - new events just keep appending.

KTable represents the current state of a key. Each new message for a key is an upsert: it replaces whatever was stored before. A KTable is the materialized current state of a changelog. It answers "what is the latest value for this key?"

The stream-table duality is mathematical. A KStream is the derivative - the series of changes. A KTable is the integral - the accumulated state. Given a KStream of payment events, you materialize a KTable of "total spend per user" by grouping and summing. Given a KTable with all its changes logged (the changelog topic), you reconstruct the KStream.

KStream: each event is a new independent record
[user_1, purchase, $50]
[user_1, purchase, $30]
[user_2, purchase, $100]
[user_1, purchase, $20]

KTable: each event upserts the key's current value
user_1 → $100 (50 + 30 + 20, accumulated)
user_2 → $100 (single event)

GlobalKTable - Broadcast State for Enrichment

A regular KTable is partitioned - each instance holds only the partitions assigned to it. A GlobalKTable is fully replicated to every instance. This is the Kafka Streams equivalent of broadcast state: use it when you want to enrich streaming events with a reference dataset (merchant categories, user profiles, risk scores) without requiring co-partitioning between the stream and the reference data.

The tradeoff: GlobalKTable consumes memory and disk on every instance. Fine for a 10,000-row merchant category table. Not fine for a 100-million-row user profile table - use a partitioned KTable with co-partitioning instead.

State Stores - RocksDB Under the Hood

When your Kafka Streams application performs aggregations, joins, or windowed operations, the intermediate state lives in a state store. The default implementation is RocksDB - an embedded LSM-tree key-value store from Facebook that spills to disk when data exceeds the JVM heap. This is what lets Kafka Streams handle state that is larger than available RAM.

Every state store has a corresponding changelog topic in Kafka. Every update to the store is written to this topic. When an instance crashes and restarts - or when partition assignment changes during scaling - the new owner restores state by replaying the changelog from the beginning. This is how Kafka Streams achieves fault tolerance without any external coordination service.

Queryable state stores (Interactive Queries) expose state store contents via a query interface. External services can ask "what is the current fraud risk score for user X?" directly from the stream processor's local state. For ML feature serving, this is powerful: your feature computation process can also serve feature reads, eliminating the need for a separate cache layer in simple cases.

Exactly-Once in Kafka Streams

Kafka Streams achieves exactly-once semantics via Kafka transactions. A single atomic transaction encompasses: advancing the input offset, writing the state store changelog update, and writing the output record. Either all three happen atomically, or none do. No external coordinator is required - Kafka itself manages the transaction.

The limitation: exactly-once in Kafka Streams only covers within-Kafka operations. If you write to an external system - Redis, PostgreSQL, a REST API - those writes are not covered by the Kafka transaction and require separate idempotency handling.

Deployment - Zero Extra Infrastructure

Deploying Kafka Streams is just deploying a JVM application. Start more instances to scale out. Each additional instance automatically picks up a share of the Kafka partitions via the consumer group rebalancing protocol. Remove instances and the remaining ones absorb the freed partitions. No cluster manager. No resource negotiation. No separate scheduler to learn.


Apache Flink is a distributed streaming computation engine with its own cluster architecture. The cluster has two components:

  • JobManager: coordinates job submission, plans the execution graph, manages checkpoints, and handles failure recovery
  • TaskManagers: worker nodes that execute the operator instances assigned to them, storing state locally

You submit a job to the cluster. The JobManager plans it as a directed acyclic graph of operators and assigns operator instances (tasks) to task slots in the TaskManagers. This architecture is more complex to operate but unlocks capabilities that Kafka Streams cannot provide:

  • Arbitrary state backends: RocksDB, in-memory heap, or custom. State can reach terabytes with incremental checkpointing.
  • Rich windowing: tumbling, sliding, session, global windows with custom triggers and evictors
  • Complex Event Processing (CEP): pattern matching across sequences of events
  • Async I/O: call external services (enrichment APIs, model endpoints) asynchronously from within operators without blocking the processing pipeline
  • Multiple source types: Kafka, Kinesis, Pulsar, Apache Iceberg, JDBC, files - not just Kafka
  • Flink SQL: declarative stream processing that generates optimized execution plans

Flink is natively Java and Scala. PyFlink provides Python bindings via the Table API, DataStream API, and SQL API. The Python code runs in a Python subprocess alongside the Java process, with serialization at the boundary.

For ML teams, the Flink SQL and Table API work well from PyFlink - they push computation into the JVM and avoid the serialization overhead. The DataStream API in PyFlink is functional but more limited than the Java equivalent. For maximum PyFlink performance: use SQL or built-in operators wherever possible, and use Python only for the parts that genuinely require it - ML model inference, custom feature logic.


Faust - Kafka Streams Semantics in Python

Faust is a Python stream processing library built on asyncio, originally developed at Robinhood. It implements Kafka Streams-like semantics in Python: agents (async coroutines that process Kafka topics), tables (persistent key-value state backed by Kafka changelog topics), and exactly-once support.

import faust

app = faust.App(
"fraud-features",
broker="kafka://localhost:9092",
value_serializer="json",
)

# A Faust Table: persistent state, backed by a Kafka changelog topic
# Survives worker restarts via changelog replay
user_spend_1h = app.Table("user_spend_1h", default=float)

payments_topic = app.topic("payments", value_type=dict)

@app.agent(payments_topic)
async def process_payment(payments):
async for payment in payments:
user_id = payment["user_id"]
amount = payment["amount"]
# State update - durable across restarts
user_spend_1h[user_id] += amount
yield {"user_id": user_id, "spend_1h": user_spend_1h[user_id]}

A Faust agent is an async Python coroutine that consumes a Kafka topic. It is the fundamental unit of computation in Faust. Agents can be chained: one agent writes to a Faust Channel or Kafka topic, another agent reads from it.

A Faust Table is a persistent key-value store backed by a Kafka changelog topic. State survives worker restarts. When a Faust worker restarts, it replays the changelog topic to restore state to the last committed point. This is identical in principle to Kafka Streams state stores.

Faust is deployed as a plain Python process: faust -A myapp worker -l info. Scale by starting more worker processes. Kafka partition assignment distributes the work across workers.


Head-to-Head Comparison for ML Teams

Latency

All three achieve sub-second end-to-end latency for per-event processing. For windowed aggregations, latency equals window size plus trigger delay. Flink can achieve lower latency for complex multi-stage pipelines due to its pipelined execution model and backpressure handling. In practice, for typical ML feature computation (rolling aggregations, enrichments), the latency differences among these tools are rarely the deciding factor.

State Size

ToolState BackendPractical Per-Node Limit
Kafka StreamsRocksDB (local disk)~500 GB (disk-bound)
FaustRocksDB (local disk)~100 GB (disk-bound)
FlinkRocksDB + incremental checkpointsMultiple TB (distributed)

Flink wins decisively on large state. Its incremental checkpointing means consistent state snapshots of terabytes are taken by only writing the delta, not the full state - so checkpoints do not stall processing.

Deployment Complexity

ToolInfrastructure RequiredOperational Overhead
Kafka StreamsNone (just Kafka)Very low
FaustNone (just Kafka)Very low
FlinkJobManager + TaskManagersMedium to High

For a small team, Flink cluster management is a real ongoing cost. Managed Flink (Confluent Cloud, Amazon Kinesis Data Analytics, Ververica Platform) eliminates this at the cost of money.

Python Support

ToolPython Support
Kafka StreamsJava / Kotlin only
FaustFirst-class Python (asyncio)
FlinkPyFlink - functional but limited vs Java

ML teams are Python-native. This is often the deciding factor. If your team will not write Java, Kafka Streams is off the table. Use Faust or Flink (PyFlink) instead.

ML-Specific Capabilities

Async I/O in Flink is transformative for ML pipelines. You can call an external enrichment service or model endpoint asynchronously from within a Flink operator, without blocking the pipeline thread:

# Flink Async I/O - non-blocking calls to external enrichment
from pyflink.datastream.functions import AsyncFunction
import aiohttp

class EnrichWithModelScore(AsyncFunction):
async def open(self, runtime_context):
self.session = aiohttp.ClientSession()

async def async_invoke(self, value, result_future):
async with self.session.post(
"http://feature-enrichment-service/enrich",
json={"user_id": value["user_id"]}
) as resp:
enrichment = await resp.json()
result_future.complete([{**value, **enrichment}])

Faust handles this natively through asyncio - any async Python library works inside a Faust agent without any extra framework support.

Kafka Streams has no async I/O primitive. External calls block the processing thread, directly reducing throughput proportional to the call latency.


Code Examples

1. Faust Agent - Rolling 1-Hour Purchase Total with Session Tracking

import faust
from datetime import timedelta

app = faust.App(
"ml-features",
broker="kafka://localhost:9092",
value_serializer="json",
store="rocksdb://",
)

class PaymentEvent(faust.Record, serializer="json"):
user_id: str
merchant_id: str
amount: float
timestamp: float

class UserFeatures(faust.Record, serializer="json"):
user_id: str
spend_1h: float
txn_count_1h: int
unique_merchants_1h: int
computed_at: float

payments_topic = app.topic("payments", value_type=PaymentEvent)
features_topic = app.topic("user-features", value_type=UserFeatures)

# Persistent tables survive restarts via Kafka changelog
user_spend_table = app.Table("user_spend_1h", default=float)
user_txn_count_table = app.Table("user_txn_count_1h", default=int)
# Note: For true rolling windows, use windowed tables (see next example)

@app.agent(payments_topic, sink=[features_topic])
async def compute_user_features(payments):
# group_by ensures all events for a user go to the same worker
async for payment in payments.group_by(PaymentEvent.user_id):
uid = payment.user_id

user_spend_table[uid] += payment.amount
user_txn_count_table[uid] += 1

yield UserFeatures(
user_id=uid,
spend_1h=user_spend_table[uid],
txn_count_1h=user_txn_count_table[uid],
unique_merchants_1h=0, # simplified - see windowed example
computed_at=payment.timestamp,
)

2. Faust Windowed Table - True Tumbling Window Aggregation

import faust
from datetime import timedelta

app = faust.App("windowed-features", broker="kafka://localhost:9092")
payments_topic = app.topic("payments", value_type=dict)

# Hopping window: 1-hour window, updated every 10 minutes
# expires: how long to retain past windows
user_spend_windowed = (
app.Table("user_spend_1h_windowed", default=float)
.hopping(3600, step=600, expires=timedelta(hours=2))
)

user_txn_count_windowed = (
app.Table("user_txn_count_1h_windowed", default=int)
.hopping(3600, step=600, expires=timedelta(hours=2))
)

@app.agent(payments_topic)
async def process(payments):
async for payment in payments.group_by(lambda p: p["user_id"]):
uid = payment["user_id"]
user_spend_windowed[uid] += payment["amount"]
user_txn_count_windowed[uid] += 1

# .current() returns the value in the active window
current_spend = user_spend_windowed[uid].current()
current_count = user_txn_count_windowed[uid].current()

print(f"User {uid}: 1h_spend={current_spend:.2f}, 1h_txns={current_count}")

3. Faust + ML Inference - Scoring Events with a scikit-learn Model

import faust
import joblib
import numpy as np
from pathlib import Path

app = faust.App("fraud-scoring", broker="kafka://localhost:9092")

# Load model once at startup - never inside the agent
MODEL_PATH = Path("/models/fraud_model_v3.pkl")
model = None

@app.on_start.connect
async def load_model(app, **kwargs):
global model
model = joblib.load(MODEL_PATH)
print(f"Loaded model from {MODEL_PATH}")

payments_topic = app.topic("payments", value_type=dict)
scores_topic = app.topic("fraud-scores", value_type=dict)

spend_1h = app.Table("spend_1h", default=float)
txn_count_1h = app.Table("txn_count_1h", default=int)
avg_amount_7d = app.Table("avg_amount_7d", default=float)

@app.agent(payments_topic, sink=[scores_topic])
async def score_payments(payments):
async for payment in payments.group_by(lambda p: p["user_id"]):
uid = payment["user_id"]

# Update state
spend_1h[uid] += payment["amount"]
txn_count_1h[uid] += 1

# Build feature vector
features = np.array([[
payment["amount"],
spend_1h[uid],
txn_count_1h[uid],
avg_amount_7d[uid],
payment.get("merchant_risk_score", 0.5),
int(payment.get("is_international", False)),
]])

# Inference - synchronous, CPU-bound
# For high throughput (>50k/s), offload to a ThreadPoolExecutor
fraud_prob = float(model.predict_proba(features)[0][1])

yield {
"user_id": uid,
"transaction_id": payment["transaction_id"],
"fraud_probability": fraud_prob,
"is_flagged": fraud_prob > 0.85,
"scored_at": payment["timestamp"],
}
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import KafkaSource
from pyflink.datastream.window import TumblingEventTimeWindows
from pyflink.common import Time, WatermarkStrategy, Duration
from pyflink.datastream.functions import ProcessWindowFunction
import json

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)
env.enable_checkpointing(60_000) # Checkpoint every 60 seconds

kafka_source = (
KafkaSource.builder()
.set_bootstrap_servers("localhost:9092")
.set_topics("payments")
.set_group_id("flink-feature-job")
.set_value_only_deserializer(SimpleStringSchema())
.build()
)

watermark_strategy = (
WatermarkStrategy
.for_bounded_out_of_orderness(Duration.of_seconds(10))
.with_timestamp_assigner(
lambda event, _: int(json.loads(event)["timestamp"] * 1000)
)
)

class UserSpendWindowFn(ProcessWindowFunction):
def process(self, user_id, context, elements):
records = [json.loads(e) for e in elements]
total_spend = sum(r["amount"] for r in records)
txn_count = len(records)

yield json.dumps({
"user_id": user_id,
"spend_5min": total_spend,
"txn_count_5min": txn_count,
"window_end_ms": context.window().end,
})

stream = (
env
.from_source(kafka_source, watermark_strategy, "Kafka Payments")
.map(lambda x: x) # parse happens inside ProcessWindowFunction
.key_by(lambda x: json.loads(x)["user_id"])
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(UserSpendWindowFn())
)

stream.print()
env.execute("user-spend-features-flink")
# ==============================================================
# FAUST: 5-minute tumbling window, spend per user
# Lines of code: ~15 | Infrastructure: zero extra
# State: RocksDB, local disk, max ~100 GB/node
# Python: native asyncio, any Python library works
# Best for: teams starting out, state < 100 GB, simple pipelines
# ==============================================================
import faust
from datetime import timedelta

app = faust.App("spend-features", broker="kafka://localhost:9092")
payments = app.topic("payments", value_type=dict)
spend = app.Table("spend_5min", default=float).tumbling(
300, expires=timedelta(minutes=15)
)

@app.agent(payments)
async def compute(stream):
async for event in stream.group_by(lambda e: e["user_id"]):
spend[event["user_id"]] += event["amount"]
print(f"{event['user_id']}: {spend[event['user_id']].current():.2f}")


# ==============================================================
# PYFLINK: Same computation, more boilerplate, cluster required
# Lines of code: ~45 | Infrastructure: Flink cluster
# State: distributed RocksDB, multi-TB possible
# Python: PyFlink API, serialization overhead at JVM boundary
# Best for: large state, complex joins, Async I/O, non-Kafka sources
# ==============================================================
# (See PyFlink example above - 3x more code, same logical computation)
# The extra complexity buys: proper watermark handling for out-of-order
# events, incremental checkpointing for large state, Async I/O support

6. Kafka Streams Concepts - Java Pseudocode as Reference

// Kafka Streams (Java) - shown for architectural understanding
// This is NOT runnable Python - it shows the core abstractions
StreamsBuilder builder = new StreamsBuilder();

// Source stream from Kafka topic
KStream<String, Payment> payments = builder.stream("payments");

// Group by user_id and aggregate into a KTable
KTable<Windowed<String>, UserFeatures> features = payments
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
.aggregate(
UserFeatures::new, // initializer
(userId, payment, state) -> state.add(payment), // aggregator
Materialized.as("user-features-store") // RocksDB state store name
);

// Write results to output topic
features.toStream().to("user-features");

// Build and start the topology
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

// Interactive Query - serve feature reads from local RocksDB state
ReadOnlyWindowStore<String, UserFeatures> store =
streams.store("user-features-store", QueryableStoreTypes.windowStore());
UserFeatures current = store.fetch("user_123", Instant.now().minus(1, HOURS), Instant.now());

Decision Framework


Production Notes

Running Faust in Production

# faust_app.py - production configuration
import faust

app = faust.App(
"ml-features-prod",
broker=(
"kafka://kafka-1:9092;"
"kafka://kafka-2:9092;"
"kafka://kafka-3:9092"
),
value_serializer="json",
store="rocksdb://",

# Exactly-once semantics (adds ~15% overhead - evaluate if needed)
processing_guarantee="exactly_once",

# Topic defaults - match your Kafka cluster configuration
topic_replication_factor=3,
topic_partitions=12,

# Broker connection health
broker_request_timeout=90.0,
broker_session_timeout=60.0,
broker_heartbeat_interval=3.0,

# How often to clean up old table entries
table_cleanup_interval=30.0,

# For replay scenarios: set to "earliest"
consumer_auto_offset_reset="latest",
)

Start with: faust -A faust_app worker -l info --web-port=6066

The --web-port flag enables the Faust web interface at localhost:6066 for health checks, table inspection, and consumer lag monitoring. Use this as your liveness probe in Kubernetes.

Confluent Cloud Flink: easiest path if you are already on Confluent. Flink jobs run as SQL or Java/Python applications. No cluster management. Pay-per-compute-unit. Best integration with Confluent Schema Registry.

Amazon Kinesis Data Analytics (managed Flink): good if your stack is AWS. Some limitations on Flink version and connector support. Integrates natively with Kinesis, S3, and Glue.

Ververica Platform: enterprise Flink management by the original Flink creators (who founded the company). Best for large-scale, compliance-heavy environments.

Recommendation for a three-engineer team: start with Faust for Python-native pipelines. When you hit Faust's state size limits or need Async I/O with external services, migrate to managed Flink. Do not run self-managed Flink until you have dedicated platform engineering capacity and clear need.

Changelog Topic Retention - The Silent Killer

Faust and Kafka Streams both back their state stores with Kafka changelog topics. By default, Kafka topics have time-based retention (7 days). If a worker is down for more than 7 days, the changelog is partially truncated and state is unrecoverable.

The fix is simple and should be applied on day one:

# Set changelog topics to compacted - retains latest value per key forever
kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics \
--entity-name user_spend_1h-changelog \
--alter \
--add-config cleanup.policy=compact

For Faust, all table changelog topics end in -changelog. Set them all to cleanup.policy=compact immediately after first deployment.


Common Mistakes

:::danger Faust Table Without group_by If you update a Faust table inside an agent that is NOT keyed by the table key, you will get race conditions between workers - two workers may try to update the same key simultaneously, with the last write winning unpredictably. Always use .group_by(lambda e: e["user_id"]) or equivalent. This ensures all events for a given key are processed by exactly one worker instance. :::

:::danger Loading the ML Model Inside the Agent Never do this:

@app.agent(payments_topic)
async def score(payments):
async for payment in payments:
model = joblib.load("/models/model.pkl") # WRONG: loads every event
...

Load the model once at application startup using @app.on_start.connect. Loading on every event adds hundreds of milliseconds of latency per event and will cause the job to OOM as model files accumulate in memory. :::

:::warning PyFlink Serialization Overhead PyFlink's Python UDFs cross a JVM-Python process boundary on every call. For high-throughput pipelines (above 10,000 events per second), implement the computationally intensive operators in Java or use Flink SQL - these stay in the JVM. Use Python only for the final output transformation or for model inference that genuinely requires Python libraries. :::

:::warning Kafka Streams Requires Java - There Is No Python Client There is no official Kafka Streams Python client. Some community projects exist but they are thin wrappers with significant limitations and no production adoption. If your team writes Python, use Faust. Do not try to run Kafka Streams from Python. :::

:::tip When Exactly-Once Is Overkill for ML Features Exactly-once semantics carry a 10–20% throughput overhead. For ML feature computation where features are time-windowed aggregations (spend in last hour, transaction count in last day), at-least-once is often acceptable. A duplicate event slightly inflates a rolling count or sum, but it does not materially shift the model's risk assessment. Evaluate your accuracy requirements before paying the exactly-once tax. At-least-once with idempotent writes to Redis (using version checks) often gives you the practical guarantees you need at lower cost. :::


Interview Q&A

Q: What is the fundamental difference between Kafka Streams and Apache Flink?

A: Kafka Streams is a Java library - there is no separate cluster. Your application instance is the stream processor: it reads from Kafka, processes events, stores state locally in RocksDB (backed by Kafka changelog topics for fault tolerance), and writes to Kafka. Scaling is done by starting more application instances, which automatically pick up Kafka partitions via consumer group rebalancing.

Flink is a distributed computation engine with its own cluster: a JobManager that coordinates jobs and checkpoints, and TaskManagers that execute operators. Flink supports more complex computation - Async I/O, complex event patterns, terabytes of distributed state, multiple source types beyond Kafka - but requires managing or paying for cluster infrastructure. The core tradeoff is capability versus operational simplicity.

Q: What is Faust and how does it compare to Kafka Streams?

A: Faust is a Python stream processing library that implements Kafka Streams-like semantics using asyncio. It provides agents (async coroutines that process Kafka topics), tables (persistent key-value state backed by Kafka changelog topics for fault tolerance), windowed aggregations, and exactly-once support. The programming model is essentially Kafka Streams in Python.

The key differences from Kafka Streams: Faust is Python-native (huge advantage for ML teams), has a smaller community and ecosystem, has simpler windowing primitives (no session windows), and has a lower throughput ceiling than native Java Kafka Streams. For Python ML teams that want stateful stream processing without running a separate cluster, Faust is the right starting point.

Q: What is a KTable and how does it differ from a KStream?

A: This is the stream-table duality. A KStream is an unbounded sequence of events where each record is an independent fact - a click, a payment, a sensor reading. Records are never updated; new events just keep appending.

A KTable represents the current state of each key. Each new record for a key is an upsert - it replaces whatever was stored before. The KTable answers "what is the latest value for key X right now?" In ML terms, you might model raw payment events as a KStream, and compute a KTable of "current 1-hour spend per user" by aggregating the stream. Mathematically: the KStream is the derivative (the changes), the KTable is the integral (the accumulated state). Given either, you can reconstruct the other.

Q: What is a queryable state store in Kafka Streams?

A: A queryable state store (Interactive Queries) allows external services to read directly from a Kafka Streams application's local RocksDB state, without going through an output topic and separate cache. You expose the state store over HTTP, and clients can query "what is the current value for key X?" directly from the running stream processor.

For ML feature serving, this is useful for simple cases: your Kafka Streams application computes fraud features and also serves feature reads directly from the same RocksDB state. The limitation is that state is partitioned across instances - a query might need to route to the specific instance that holds the relevant partition. Most teams add a routing layer or use a dedicated feature store for production serving.

Q: When would you choose Flink over Kafka Streams (or Faust) for an ML pipeline?

A: Choose Flink when any of these apply: (1) State size will exceed ~100–500 GB - Flink's distributed RocksDB with incremental checkpointing handles terabytes where Faust hits local disk limits; (2) You need Async I/O - non-blocking calls to external enrichment services or model endpoints are a first-class Flink feature, critical for complex enrichment pipelines; (3) Your pipeline has complex multi-stream joins - Flink's interval joins and temporal table joins are more powerful than what Faust offers; (4) You need non-Kafka sources - Flink reads from Kinesis, Iceberg, JDBC natively; (5) You need Complex Event Processing - detecting patterns across sequences of events (fraud ring detection, user journey analysis). For straightforward rolling aggregations with state under 100 GB in Python, Faust is simpler and operationally cheaper.

Q: How does exactly-once work differently in Kafka Streams versus Flink?

A: In Kafka Streams, exactly-once is implemented entirely within Kafka's transaction API. A single Kafka transaction atomically covers: advancing the input consumer offset, writing the state store changelog update, and writing the output record to the output topic. Either all three happen or none do. No external coordinator is needed - Kafka manages it. The limitation: exactly-once only covers within-Kafka operations. Writes to Redis or PostgreSQL are outside the transaction boundary.

In Flink, exactly-once is implemented via distributed checkpoints plus a two-phase commit (2PC) protocol with sinks. Flink periodically injects checkpoint barriers into the data stream. Operators process the barrier, flush their in-flight state to the checkpoint store, then acknowledge. Once all operators acknowledge, the checkpoint is complete and offsets advance. For Kafka sinks, Flink uses Kafka transactions - it opens a transaction, writes output records, and commits only when the checkpoint succeeds. For non-Kafka sinks (databases, file systems), Flink uses 2PC where the sink participates in the commit protocol. This means Flink can guarantee exactly-once into any sink that supports transactions, not just Kafka.


Summary Decision Table

Kafka StreamsFaustApache Flink
LanguageJava / Kotlin onlyPython (asyncio)Java / Scala / PyFlink
InfrastructureNone (library)None (library)Cluster required
Max state~500 GB / node~100 GB / nodeMulti-TB distributed
Async I/ONoYes (asyncio)Yes (native operator)
WindowingGoodBasic-to-goodExcellent
Exactly-onceYes (Kafka only)Yes (Kafka only)Yes (any transactional sink)
Multiple sourcesKafka onlyKafka onlyKafka, Kinesis, Iceberg, JDBC, files
Best forJava teams, zero infraPython teams, state under 100 GBComplex pipelines, large state, Async I/O

Recommended path for most Python ML teams: Start with Faust. Migrate to managed Flink when state exceeds 100 GB or when Async I/O becomes necessary. Avoid self-managed Flink until you have dedicated platform engineering capacity.


Advanced Patterns

Faust Agent Chaining - Multi-Stage Feature Pipelines

Real ML feature pipelines are rarely single-stage. You often need to enrich events before computing features, compute intermediate aggregates before final features, or fan out to multiple downstream topics.

import faust
from datetime import timedelta

app = faust.App("multi-stage-features", broker="kafka://kafka:9092")

# Stage 1: Raw payment events
raw_payments = app.topic("payments", value_type=dict)

# Stage 2: Enriched payment events (after merchant lookup)
enriched_payments = app.topic("payments-enriched", value_type=dict)

# Stage 3: Feature output
user_features = app.topic("user-features", value_type=dict)

# State for enrichment
merchant_risk_scores = app.Table("merchant_risk", default=float)

# STAGE 1: Enrich payments with merchant risk score
@app.agent(raw_payments, sink=[enriched_payments])
async def enrich_payments(payments):
async for payment in payments:
merchant_id = payment["merchant_id"]
# Add merchant risk score from in-process state
risk_score = merchant_risk_scores[merchant_id]
yield {
**payment,
"merchant_risk_score": risk_score,
}

# STAGE 2: Compute per-user features from enriched events
spend_window = (
app.Table("spend_1h", default=float)
.hopping(3600, step=300, expires=timedelta(hours=2))
)
risk_window = (
app.Table("avg_risk_1h", default=float)
.hopping(3600, step=300, expires=timedelta(hours=2))
)

@app.agent(enriched_payments, sink=[user_features])
async def compute_features(payments):
async for payment in payments.group_by(lambda p: p["user_id"]):
uid = payment["user_id"]
spend_window[uid] += payment["amount"]
risk_window[uid] = (
risk_window[uid].current() * 0.9 # Exponential decay
+ payment["merchant_risk_score"] * 0.1
)
yield {
"user_id": uid,
"spend_1h": spend_window[uid].current(),
"weighted_risk_1h": risk_window[uid].current(),
"computed_at": payment["event_timestamp"],
}

This is the pattern that makes Flink uniquely powerful for ML pipelines that call external services. Without Async I/O, each external call blocks a thread slot, limiting throughput to threads × 1/latency. With Async I/O, you can have hundreds of concurrent in-flight requests with a fraction of the thread count.

from pyflink.datastream.functions import AsyncFunction, ResultFuture
from pyflink.datastream import AsyncDataStream
import aiohttp
import asyncio
import json

class EnrichWithRiskService(AsyncFunction):
"""
Calls an external risk scoring service asynchronously.
Without Async I/O: 1 thread per in-flight request = 100ms latency = 10 req/s/thread
With Async I/O: 100 in-flight requests, 1 thread = 1000 req/s/thread
"""

async def open(self, runtime_context):
self.session = aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=2.0) # 2 second timeout
)

async def async_invoke(self, event: str, result_future: ResultFuture):
try:
parsed = json.loads(event)
async with self.session.post(
"http://risk-service:8080/score",
json={"user_id": parsed["user_id"], "amount": parsed["amount"]},
) as resp:
risk_data = await resp.json()
parsed["risk_score"] = risk_data["score"]
result_future.complete([json.dumps(parsed)])
except asyncio.TimeoutError:
# Timeout: pass through with default risk score
parsed = json.loads(event)
parsed["risk_score"] = 0.5 # neutral default
result_future.complete([json.dumps(parsed)])
except Exception as e:
result_future.complete_exceptionally(e)

async def close(self):
await self.session.close()


# Wire Async I/O into the Flink pipeline
enriched_stream = AsyncDataStream.unordered_wait(
payments_stream,
EnrichWithRiskService(),
timeout=3000, # 3 second per-event timeout (ms)
capacity=100, # Max 100 concurrent in-flight requests
)

For teams comfortable with SQL, Flink SQL is often the fastest path to production for standard aggregation features. It stays entirely in the JVM (no Python serialization overhead) and generates optimized execution plans.

from pyflink.table import EnvironmentSettings, TableEnvironment

settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
t_env = TableEnvironment.create(settings)

# Register Kafka source as a table
t_env.execute_sql("""
CREATE TABLE payments (
user_id STRING,
merchant_id STRING,
amount DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'payments',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json'
)
""")

# Register Redis sink (using custom connector or JDBC)
t_env.execute_sql("""
CREATE TABLE user_features_sink (
user_id STRING,
spend_5min DOUBLE,
txn_count_5min BIGINT,
window_end TIMESTAMP(3)
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'user-features-sql',
'properties.bootstrap.servers' = 'kafka:9092',
'key.format' = 'json',
'value.format' = 'json'
)
""")

# The feature computation - pure SQL, runs entirely in JVM
t_env.execute_sql("""
INSERT INTO user_features_sink
SELECT
user_id,
SUM(amount) AS spend_5min,
COUNT(*) AS txn_count_5min,
TUMBLE_END(event_time, INTERVAL '5' MINUTE) AS window_end
FROM payments
GROUP BY
user_id,
TUMBLE(event_time, INTERVAL '5' MINUTE)
""")
# Configure Flink to export metrics to Prometheus
# Add to flink-conf.yaml:
# metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
# metrics.reporter.prom.port: 9249

# Key Flink metrics to monitor for ML feature pipelines:
FLINK_METRICS_TO_WATCH = {
# Throughput
"flink_taskmanager_job_task_numRecordsInPerSecond": "Records processed per second",
"flink_taskmanager_job_task_numRecordsOutPerSecond": "Records emitted per second",

# Latency
"flink_taskmanager_job_latency_source_id_operator_id_subtask_index_latency": "End-to-end latency",

# Watermark lag (how far behind event time is from processing time)
"flink_taskmanager_job_task_currentInputWatermark": "Current watermark (event time)",

# Late records (events arriving after watermark - dropped or handled)
"flink_taskmanager_job_task_numLateRecordsDropped": "Records dropped due to lateness",

# Checkpoint health
"flink_jobmanager_job_lastCheckpointDuration": "Last checkpoint duration (ms)",
"flink_jobmanager_job_lastCheckpointSize": "Last checkpoint size (bytes)",
"flink_jobmanager_job_numberOfFailedCheckpoints": "Failed checkpoints (cumulative)",

# Back-pressure (0=no back-pressure, 1=fully back-pressured)
"flink_taskmanager_job_task_isBackPressured": "Task back-pressure indicator",
}

Many teams start with Faust and eventually need to migrate to Flink. Here are the concrete signals that it is time to make the move, not vague thresholds:

Signal 1: State size approaching 80 GB per worker node. Faust's RocksDB is limited by local disk. When you are consistently at 80%+ disk utilization on your Faust worker nodes, you are 1-2 months from hitting the wall. Start Flink migration now.

Signal 2: You need to call an external service on every event and throughput is below 5,000 events/second. Faust is asyncio-native so external calls are non-blocking, but the Python event loop has a throughput ceiling. Flink's Async I/O operator handles this more efficiently for high-volume pipelines.

Signal 3: You have a join between two Kafka topics with different partitioning. Faust has limited join support. Flink's interval joins and temporal table joins are purpose-built for this pattern in ML pipelines (enriching events with a lookup table that itself changes over time).

Signal 4: You need to read from a non-Kafka source. Faust is Kafka-only. If you need to join a Kafka stream with data from an Iceberg table, a Kinesis stream, or a JDBC source, Flink is required.

Signal 5: Your checkpoint restore time exceeds your recovery SLA. If Faust takes 20 minutes to rebuild its RocksDB state from the changelog topic, and your feature freshness SLA is 60 seconds, you cannot meet SLA during failure recovery. Flink's incremental checkpoints restore much faster for large state.

© 2026 EngineersOfAI. All rights reserved.