Module 04: Real-Time ML Systems
Batch ML is forgiving. You score a million records overnight - a few percent slower than expected is irrelevant. Real-time ML is unforgiving. When an ad auction must complete in 10 milliseconds, a fraud check must finish in 50 milliseconds, or a recommendation must appear before the user's scroll slows down, every microsecond is a design decision.
Real-time ML is not just fast batch ML. It requires a different architecture, different feature engineering patterns, different failure modes, and different monitoring. The system must handle unpredictable traffic bursts, maintain model freshness as the world changes, process streams of events with guaranteed ordering, and produce predictions at latencies that the human eye barely notices.
This module covers the full stack of real-time ML - from the overall architecture of a 1M QPS inference system to the specific tricks that take inference from 8ms to 0.6ms for latency-critical paths.
Module Map
Lessons at a Glance
| # | Lesson | Core Question |
|---|---|---|
| 01 | Real-Time Inference Design | How do you architect ML at 1M QPS with 10ms SLA? |
| 02 | Online Learning | How do you adapt a fraud model to new attack patterns in hours? |
| 03 | Streaming Inference | How do you run ML on 50K events/second reliably? |
| 04 | Low-Latency Optimization | What engineering gets you from 8ms to 0.6ms? |
| 05 | Event-Driven ML | How do you build ML systems that are decoupled, replayable, and debuggable? |
| 06 | Temporal Features | How do you build 200+ time-based features without training/serving skew? |
Key Capabilities You Will Build
By the end of this module you will be able to:
- Design ML inference systems for 1M QPS with sub-10ms p99 latency
- Implement online learning systems that adapt to concept drift within hours
- Build Kafka-integrated inference pipelines with exactly-once processing guarantees
- Apply NUMA-aware, lock-free, zero-copy optimizations to reach sub-millisecond inference
- Architect event-driven ML systems with CQRS and event replay for debugging
- Engineer temporal features (session features, sliding windows, recency weights) without training/serving skew
Why Real-Time ML is Different
The fundamental challenge of real-time ML is that you cannot wait. In batch ML, slow models can be compensated with more hardware. In real-time ML, a 200ms prediction is as bad as no prediction for many use cases - the user has already decided, the bid window has closed, the fraud has already completed.
Real-time ML forces every layer of the system to be explicitly designed for latency:
- The model architecture must be latency-bounded, not just accuracy-optimized
- Feature computation must happen in microseconds, not minutes
- The serving infrastructure must eliminate every source of variable-latency overhead
- The learning loop must close faster than the world changes
:::tip Prerequisites Module 03 (Model Serving) is a direct prerequisite - this module assumes you understand batching, quantization, and compilation. Familiarity with Apache Kafka and stream processing concepts is helpful but not required; the relevant concepts are introduced within each lesson. :::
