Module 2 - Data Infrastructure
Data infrastructure is the unsexy part of ML engineering that determines whether your models ever make it to production. A brilliant model trained on poorly engineered data is worse than a simple model trained on clean, well-understood data. This module covers the systems that store, process, and serve data at the scale that ML requires.
What This Module Covers
Lesson Map
| # | Lesson | Key Concepts |
|---|---|---|
| 01 | Data Lake and Data Warehouse | Evolution from RDBMS to lakehouse; Parquet/ORC; cost at scale |
| 02 | Batch Processing with Spark | Driver/executor model; partitioning; broadcast joins; Delta Lake integration |
| 03 | Stream Processing with Kafka | Topics, consumer groups; Flink on Kafka; exactly-once semantics |
| 04 | Feature Store Architecture | Dual-store design; point-in-time retrieval; Feast, Tecton, Hopsworks |
| 05 | Data Quality and Validation | Great Expectations; data contracts; PSI; pipeline monitoring |
| 06 | Data Versioning with Delta Lake | ACID on object storage; time travel; schema evolution; MERGE |
| 07 | Lakehouse Architecture | Medallion pattern; Iceberg vs Delta vs Hudi; DuckDB, Trino |
Key Themes
Data infrastructure is ML infrastructure. Every ML failure that isn't a model architecture problem is a data infrastructure problem. This module is about preventing those failures by understanding the systems deeply.
The right tool for the right job. Data lakes, warehouses, and lakehouses each occupy a different point in the cost/capability trade-off space. Batch and stream processing have different use cases. Feature stores solve specific problems that general databases don't. Understanding which tool to reach for - and when - is the core skill.
Correctness before performance. A fast pipeline that produces wrong features is worse than a slow pipeline that produces correct ones. Data quality, point-in-time correctness, and data versioning are not optional - they are what make the difference between a model that works in production and one that looks good offline.
Prerequisites
- Module 1 (Systems Foundations) - especially lessons 01, 03, and 06
- Basic Python and SQL
- Familiarity with the concept of distributed computing (parallel processing)
