Module 06 - Data Lakehouse

The data lakehouse is the convergence point of two decades of parallel evolution. Data lakes gave us cheap, scalable storage for raw files but turned into swamps without ACID guarantees. Data warehouses gave us reliable, queryable tables but at proprietary cost with ML teams locked out. The lakehouse pattern unifies both - ACID transactions and schema enforcement on top of open file formats stored in object storage, with first-class support for both SQL analytics and ML model training on the same data.

This module covers the architecture, open table formats, query engines, governance patterns, and ML integration that define the modern lakehouse.

Module Roadmap

Lessons at a Glance

#	Lesson	Key Concepts	Est. Read Time
01	Lake vs Warehouse vs Lakehouse	Architecture tradeoffs, metadata layer, open formats	20 min
02	Apache Iceberg	Three-layer metadata, hidden partitioning, time travel, PyIceberg	25 min
03	Delta Lake	`_delta_log/`, MERGE INTO, Change Data Feed, Z-ordering	25 min
04	Apache Hudi	Copy-on-Write vs Merge-on-Read, incremental pull, timeline	20 min
05	Query Engines	Trino/Presto, Spark SQL, DuckDB, Athena, query planning	20 min
06	Data Governance	Unity Catalog, fine-grained access, data lineage, tagging	20 min
07	Lakehouse for ML	Feature stores, versioned training sets, experiment tracking	20 min

Prerequisites

Before this module you should be comfortable with:

Module 01 - Data Engineering Foundations: storage systems, object storage (S3/GCS/ADLS), file formats (Parquet, ORC, Avro)
Module 02 - Batch Processing: Spark architecture, DataFrame API, partitioning strategies
Module 03 - Stream Processing: event-time semantics, watermarks, streaming sinks
Module 04 - Data Modeling: star schema, dimensional modeling, slowly changing dimensions
Module 05 - Data Pipelines: orchestration, idempotency, pipeline reliability patterns

Key Concepts Reference

Concept	What It Means
ACID on object storage	Atomicity, Consistency, Isolation, Durability guarantees implemented over S3/GCS via transaction logs
Open table format	A specification (not a storage engine) that defines how metadata and data files are organized - Iceberg, Delta, Hudi
Time travel	Query historical versions of a table using timestamps or snapshot IDs
Schema evolution	Add, rename, or reorder columns without rewriting existing data files
Hidden partitioning	Iceberg partitions data transparently - queries never need partition predicates
Copy-on-Write (CoW)	Row-level updates rewrite affected Parquet files - fast reads, slower writes
Merge-on-Read (MoR)	Updates written as delta files, merged at read time - fast writes, slightly slower reads
Z-ordering	Multi-dimensional clustering of data for better data skipping and query performance
Metadata layer	The transaction log (Delta) or manifest hierarchy (Iceberg) that tracks what files exist and what state the table is in
Lakehouse query engine	A query engine that understands open table formats - Trino, Spark SQL, DuckDB, Athena

What You Will Be Able to Do

After completing this module you will be able to:

Explain the architectural tradeoffs between data lakes, warehouses, and lakehouses - and when to choose each
Design and implement Iceberg tables with schema evolution, time travel, and row-level deletes using PyIceberg and Spark
Build Delta Lake pipelines with MERGE INTO upserts, Change Data Feed, and Z-ordering optimizations
Understand when to use Hudi's Copy-on-Write vs Merge-on-Read table types for your write/read ratio
Choose and configure the right query engine (Trino, Spark, DuckDB, Athena) for a given access pattern
Implement data governance with Unity Catalog including column-level security and data lineage
Use lakehouse table formats as the backbone for ML feature stores and reproducible training datasets

Module Roadmap​

Lessons at a Glance​

Prerequisites​

Key Concepts Reference​

What You Will Be Able to Do​

Module Roadmap

Lessons at a Glance

Prerequisites

Key Concepts Reference

What You Will Be Able to Do