Module 04: Real-Time ML Systems

Batch ML is forgiving. You score a million records overnight - a few percent slower than expected is irrelevant. Real-time ML is unforgiving. When an ad auction must complete in 10 milliseconds, a fraud check must finish in 50 milliseconds, or a recommendation must appear before the user's scroll slows down, every microsecond is a design decision.

Real-time ML is not just fast batch ML. It requires a different architecture, different feature engineering patterns, different failure modes, and different monitoring. The system must handle unpredictable traffic bursts, maintain model freshness as the world changes, process streams of events with guaranteed ordering, and produce predictions at latencies that the human eye barely notices.

This module covers the full stack of real-time ML - from the overall architecture of a 1M QPS inference system to the specific tricks that take inference from 8ms to 0.6ms for latency-critical paths.

Module Map

Lessons at a Glance

#	Lesson	Core Question
01	Real-Time Inference Design	How do you architect ML at 1M QPS with 10ms SLA?
02	Online Learning	How do you adapt a fraud model to new attack patterns in hours?
03	Streaming Inference	How do you run ML on 50K events/second reliably?
04	Low-Latency Optimization	What engineering gets you from 8ms to 0.6ms?
05	Event-Driven ML	How do you build ML systems that are decoupled, replayable, and debuggable?
06	Temporal Features	How do you build 200+ time-based features without training/serving skew?

Key Capabilities You Will Build

By the end of this module you will be able to:

Design ML inference systems for 1M QPS with sub-10ms p99 latency
Implement online learning systems that adapt to concept drift within hours
Build Kafka-integrated inference pipelines with exactly-once processing guarantees
Apply NUMA-aware, lock-free, zero-copy optimizations to reach sub-millisecond inference
Architect event-driven ML systems with CQRS and event replay for debugging
Engineer temporal features (session features, sliding windows, recency weights) without training/serving skew

Why Real-Time ML is Different

The fundamental challenge of real-time ML is that you cannot wait. In batch ML, slow models can be compensated with more hardware. In real-time ML, a 200ms prediction is as bad as no prediction for many use cases - the user has already decided, the bid window has closed, the fraud has already completed.

Real-time ML forces every layer of the system to be explicitly designed for latency:

The model architecture must be latency-bounded, not just accuracy-optimized
Feature computation must happen in microseconds, not minutes
The serving infrastructure must eliminate every source of variable-latency overhead
The learning loop must close faster than the world changes

:::tip Prerequisites Module 03 (Model Serving) is a direct prerequisite - this module assumes you understand batching, quantization, and compilation. Familiarity with Apache Kafka and stream processing concepts is helpful but not required; the relevant concepts are introduced within each lesson. :::

Module Map​

Lessons at a Glance​

Key Capabilities You Will Build​

Why Real-Time ML is Different​

Module Map

Lessons at a Glance

Key Capabilities You Will Build

Why Real-Time ML is Different