Skip to main content

Module 2 - Data Infrastructure

Data infrastructure is the unsexy part of ML engineering that determines whether your models ever make it to production. A brilliant model trained on poorly engineered data is worse than a simple model trained on clean, well-understood data. This module covers the systems that store, process, and serve data at the scale that ML requires.

What This Module Covers

Lesson Map

#LessonKey Concepts
01Data Lake and Data WarehouseEvolution from RDBMS to lakehouse; Parquet/ORC; cost at scale
02Batch Processing with SparkDriver/executor model; partitioning; broadcast joins; Delta Lake integration
03Stream Processing with KafkaTopics, consumer groups; Flink on Kafka; exactly-once semantics
04Feature Store ArchitectureDual-store design; point-in-time retrieval; Feast, Tecton, Hopsworks
05Data Quality and ValidationGreat Expectations; data contracts; PSI; pipeline monitoring
06Data Versioning with Delta LakeACID on object storage; time travel; schema evolution; MERGE
07Lakehouse ArchitectureMedallion pattern; Iceberg vs Delta vs Hudi; DuckDB, Trino

Key Themes

Data infrastructure is ML infrastructure. Every ML failure that isn't a model architecture problem is a data infrastructure problem. This module is about preventing those failures by understanding the systems deeply.

The right tool for the right job. Data lakes, warehouses, and lakehouses each occupy a different point in the cost/capability trade-off space. Batch and stream processing have different use cases. Feature stores solve specific problems that general databases don't. Understanding which tool to reach for - and when - is the core skill.

Correctness before performance. A fast pipeline that produces wrong features is worse than a slow pipeline that produces correct ones. Data quality, point-in-time correctness, and data versioning are not optional - they are what make the difference between a model that works in production and one that looks good offline.

Prerequisites

  • Module 1 (Systems Foundations) - especially lessons 01, 03, and 06
  • Basic Python and SQL
  • Familiarity with the concept of distributed computing (parallel processing)
© 2026 EngineersOfAI. All rights reserved.