Module 05 - Feature Stores
The Infrastructure Layer That Makes ML Reproducible
Every ML model is only as reliable as the features it consumes. But features are computed by humans, in code, at different times, under different assumptions. Without a shared system to define, store, and serve features consistently, the same logical concept - "how many purchases did this user make in the last 30 days?" - ends up implemented five different ways across five different teams. Models disagree. Debugging is impossible. Deploying a new model means reimplementing features from scratch.
Feature stores are the infrastructure that prevents this. They are the layer between raw data and model training/serving that enforces a single, version-controlled, time-correct definition of every feature. They let you train a model on historical data and serve that model in production using identical feature computation logic. They let you reuse work across teams instead of duplicating it.
This module covers everything you need to understand, design, and operate a feature store in production - from the historical origins at Uber to the architectural decisions that determine whether your system survives at scale.
Module Roadmap
Lesson Table
| # | Lesson | Key Concepts | Read Time |
|---|---|---|---|
| 01 | Why Feature Stores Exist | Training-serving skew, feature duplication, point-in-time leakage, Uber Michelangelo | 20 min |
| 02 | Feature Store Architecture | Online store, offline store, feature registry, materialization engine, dual-write pattern | 25 min |
| 03 | Point-in-Time Joins | Temporal joins, label leakage, as-of queries, entity-timestamp pairs | 25 min |
| 04 | Feature Pipelines | Batch transformation, stream processing, on-demand computation, Lambda vs. Kappa | 20 min |
| 05 | Feature Registry & Governance | Feature discovery, ownership, versioning, lineage, deprecation | 20 min |
| 06 | Materialization & Freshness | Scheduled jobs, trigger-based materialization, backfill, SLA enforcement | 20 min |
| 07 | Feature Monitoring | Distribution drift, staleness alerts, null rate tracking, model feedback loops | 20 min |
Prerequisites
Before starting this module, you should be comfortable with:
- Module 01 - Data Engineering Foundations: batch vs. streaming, data lakes, warehouse patterns
- Module 02 - Batch Processing: Spark, partitioning, window functions, large-scale transformations
- Module 03 - Stream Processing: Kafka, Flink, event-time semantics, watermarks
- Module 04 - Data Quality: schema validation, anomaly detection, pipeline monitoring
:::tip If you are new to ML pipelines You do not need to have trained ML models yourself. But you should understand that models are trained on historical datasets and then deployed to serve predictions on new data. The gap between these two contexts is exactly what feature stores exist to bridge. :::
Key Concepts at a Glance
| Concept | One-Line Definition |
|---|---|
| Training-serving skew | Features computed differently at training time vs. serving time, causing models to behave unexpectedly in production |
| Offline store | Historical storage of feature values - columnar, queryable, optimized for training dataset assembly |
| Online store | Low-latency key-value store of current feature values - optimized for sub-10ms serving |
| Point-in-time join | Joining feature values to training labels using only information that would have been available at the label's timestamp |
| Feature registry | Metadata catalog: what features exist, how they're computed, who owns them, what depends on them |
| Materialization | The process of running feature computation logic and writing results to offline and online stores |
| Feature drift | Statistical shift in a feature's distribution between training time and serving time |
What You Will Be Able to Do After This Module
- Explain what training-serving skew is, why it happens, and how feature stores prevent it
- Design a feature store architecture with online store, offline store, registry, and materialization engine
- Implement point-in-time correct training datasets that avoid label leakage
- Build feature pipelines using batch, streaming, and on-demand computation patterns
- Define governance policies: feature ownership, versioning, lineage tracking, and deprecation
- Configure materialization schedules and monitor feature freshness against SLA targets
- Detect and respond to feature drift before it silently degrades model performance
