Module 2 - Data Infrastructure

Data infrastructure is the unsexy part of ML engineering that determines whether your models ever make it to production. A brilliant model trained on poorly engineered data is worse than a simple model trained on clean, well-understood data. This module covers the systems that store, process, and serve data at the scale that ML requires.

What This Module Covers

Lesson Map

#	Lesson	Key Concepts
01	Data Lake and Data Warehouse	Evolution from RDBMS to lakehouse; Parquet/ORC; cost at scale
02	Batch Processing with Spark	Driver/executor model; partitioning; broadcast joins; Delta Lake integration
03	Stream Processing with Kafka	Topics, consumer groups; Flink on Kafka; exactly-once semantics
04	Feature Store Architecture	Dual-store design; point-in-time retrieval; Feast, Tecton, Hopsworks
05	Data Quality and Validation	Great Expectations; data contracts; PSI; pipeline monitoring
06	Data Versioning with Delta Lake	ACID on object storage; time travel; schema evolution; MERGE
07	Lakehouse Architecture	Medallion pattern; Iceberg vs Delta vs Hudi; DuckDB, Trino

Key Themes

Data infrastructure is ML infrastructure. Every ML failure that isn't a model architecture problem is a data infrastructure problem. This module is about preventing those failures by understanding the systems deeply.

The right tool for the right job. Data lakes, warehouses, and lakehouses each occupy a different point in the cost/capability trade-off space. Batch and stream processing have different use cases. Feature stores solve specific problems that general databases don't. Understanding which tool to reach for - and when - is the core skill.

Correctness before performance. A fast pipeline that produces wrong features is worse than a slow pipeline that produces correct ones. Data quality, point-in-time correctness, and data versioning are not optional - they are what make the difference between a model that works in production and one that looks good offline.

Prerequisites

Module 1 (Systems Foundations) - especially lessons 01, 03, and 06
Basic Python and SQL
Familiarity with the concept of distributed computing (parallel processing)

What This Module Covers​

Lesson Map​

Key Themes​

Prerequisites​

What This Module Covers

Lesson Map

Key Themes

Prerequisites