Skip to main content

Module 3: Stream Processing for Real-Time AI

Batch processing keeps ML systems honest at scale. But batch processing has a ceiling: the minimum latency you can achieve is bounded by your job frequency, and for entire categories of ML problems - fraud detection, real-time recommendations, anomaly detection, live personalization - that ceiling is too low. A fraudster can drain an account in 45 minutes. A batch job that runs every 4 hours is not fraud detection; it is fraud archaeology.

This module is about the infrastructure that closes that gap. Stream processing treats data as an infinite, unbounded flow of events rather than a finite, bounded dataset. It enables you to compute features, serve models, and evaluate predictions on data that is seconds or milliseconds old rather than hours. The tools in this module - Apache Kafka, Apache Flink, Kafka Streams - are the backbone of real-time ML at every company operating at meaningful scale: Netflix recommendation serving, Uber surge pricing, DoorDash fraud detection, LinkedIn feed ranking.

By the end of this module, you will understand not just how to use these tools, but when to use them, what they cost in operational complexity, and how to make them reliable enough to trust with production ML workloads.


Lesson Progression

Lessons 01 and 02 build the conceptual and architectural foundation. Lessons 03 through 06 are where you learn to use Kafka and Flink as an ML practitioner - not as a generic data engineer, but specifically for the patterns that arise in feature pipelines, model serving, and feedback loops. Lessons 07 and 08 address the hardest problems: maintaining feature freshness guarantees under real load, and making streaming systems reliable enough that on-call engineers can sleep.


Lessons at a Glance

#TitleLevelKey Concept
01Streaming ConceptsBeginnerEvent time vs processing time, watermarks, windows, delivery semantics, backpressure
02Apache Kafka ArchitectureBeginnerDistributed commit log, partitions, ISR replication, consumer groups, compacted topics
03Kafka for ML SystemsIntermediateFeature event pipelines, Schema Registry, model serving integration, feedback loops
04Apache Flink FundamentalsIntermediateStateful operators, checkpointing, savepoints, exactly-once with Kafka
05Stream Processing PatternsIntermediateStream-stream joins, enrichment, late data strategies, out-of-order handling
06Kafka Streams vs FlinkIntermediateEmbedded vs cluster topology, DSL comparison, operational tradeoffs for ML teams
07Real-Time Feature ComputationAdvancedOnline feature computation, point-in-time correctness, feature freshness SLOs
08Streaming Pipeline ReliabilityAdvancedEnd-to-end exactly-once, lag alerting, chaos engineering, graceful degradation

What You Will Build

  • A Kafka producer and consumer pipeline in Python (confluent-kafka) that handles feature events with manual offset commits, consumer group rebalancing, and consumer lag monitoring
  • A stateful Flink job (PyFlink) that computes real-time fraud detection features - velocity counts, rolling averages, session aggregations - with exactly-once semantics via Kafka checkpoints
  • A stream-table join pattern that enriches live transaction events with user profile data, demonstrating the stream-table duality at production scale
  • A reliability runbook: lag-based SLO alerting, dead-letter queue handling, and a chaos test protocol that verifies your pipeline recovers correctly from broker failure and consumer crash

Prerequisites

This module builds directly on Module 1 (Foundations) and Module 2 (Batch Processing). Before starting, you should be comfortable with:

  • The medallion architecture and why data flows from raw events to curated features
  • Apache Spark fundamentals - DAG execution, partitioning, the difference between transformations and actions
  • Python at the intermediate level - classes, decorators, context managers, async basics
  • SQL window functions - you will see their streaming equivalents in Lessons 04 and 05
  • What a feature store is and why point-in-time correctness matters (Module 1, Lesson 07)
note

Stream processing adds operational complexity that batch processing does not have. A Spark job that fails can be rerun. A Kafka consumer with a bug may commit offsets on bad data and lose those messages permanently. The bar for correctness is higher in streaming, and this module reflects that: the later lessons cover failure modes and reliability with the same depth as happy-path functionality.

If you have not completed Module 2, complete at least Lessons 01 and 02 (Spark Architecture and Spark for ML Pipelines) before proceeding. The mental model of distributed execution - partitions, parallelism, fault tolerance - transfers directly to Kafka and Flink.


How This Module Connects to the Rest of the Course

Stream processing is not a replacement for batch processing - it is a complement. The two paradigms coexist in every mature ML platform, and understanding when to use each is as important as understanding how to use either.

  • Module 2 (Batch Processing) is the prerequisite and the contrast. After this module, you will have a precise mental model of when batch is sufficient (model training, nightly feature backfills, weekly reporting) and when streaming is necessary (fraud detection, real-time recommendations, anomaly detection). The answer is not "streaming is always better" - streaming is more expensive to operate. Use it when batch latency is the actual bottleneck.
  • Module 5 (Feature Stores) is the destination for the features computed in this module. Real-time feature computation (Lesson 07) writes to an online feature store. The feature store is what the model serving layer reads at inference time. You cannot understand feature stores without understanding how features get into them.
  • Module 10 (Real-Time Feature Engineering) extends the patterns in this module with advanced techniques: feature pipelines that join streaming features with batch features, point-in-time correct training dataset generation from streaming history, and online-offline consistency validation.
  • Module 9 (Data Observability) monitors the health of the pipelines you build in this module. Streaming pipeline reliability (Lesson 08) introduces lag alerting and chaos testing; Module 9 extends this to schema validation, statistical drift detection, and SLO tracking across the full data platform.

The skills in this module are among the most in-demand in ML engineering roles at any company running real-time ML. Kafka and Flink expertise consistently appears in job descriptions for senior ML engineers, AI platform engineers, and data engineers at companies that run fraud detection, recommendation systems, or any ML system with sub-minute feature freshness requirements.

© 2026 EngineersOfAI. All rights reserved.