Module 06 - Data Lakehouse
The data lakehouse is the convergence point of two decades of parallel evolution. Data lakes gave us cheap, scalable storage for raw files but turned into swamps without ACID guarantees. Data warehouses gave us reliable, queryable tables but at proprietary cost with ML teams locked out. The lakehouse pattern unifies both - ACID transactions and schema enforcement on top of open file formats stored in object storage, with first-class support for both SQL analytics and ML model training on the same data.
This module covers the architecture, open table formats, query engines, governance patterns, and ML integration that define the modern lakehouse.
Module Roadmap
Lessons at a Glance
| # | Lesson | Key Concepts | Est. Read Time |
|---|---|---|---|
| 01 | Lake vs Warehouse vs Lakehouse | Architecture tradeoffs, metadata layer, open formats | 20 min |
| 02 | Apache Iceberg | Three-layer metadata, hidden partitioning, time travel, PyIceberg | 25 min |
| 03 | Delta Lake | _delta_log/, MERGE INTO, Change Data Feed, Z-ordering | 25 min |
| 04 | Apache Hudi | Copy-on-Write vs Merge-on-Read, incremental pull, timeline | 20 min |
| 05 | Query Engines | Trino/Presto, Spark SQL, DuckDB, Athena, query planning | 20 min |
| 06 | Data Governance | Unity Catalog, fine-grained access, data lineage, tagging | 20 min |
| 07 | Lakehouse for ML | Feature stores, versioned training sets, experiment tracking | 20 min |
Prerequisites
Before this module you should be comfortable with:
- Module 01 - Data Engineering Foundations: storage systems, object storage (S3/GCS/ADLS), file formats (Parquet, ORC, Avro)
- Module 02 - Batch Processing: Spark architecture, DataFrame API, partitioning strategies
- Module 03 - Stream Processing: event-time semantics, watermarks, streaming sinks
- Module 04 - Data Modeling: star schema, dimensional modeling, slowly changing dimensions
- Module 05 - Data Pipelines: orchestration, idempotency, pipeline reliability patterns
Key Concepts Reference
| Concept | What It Means |
|---|---|
| ACID on object storage | Atomicity, Consistency, Isolation, Durability guarantees implemented over S3/GCS via transaction logs |
| Open table format | A specification (not a storage engine) that defines how metadata and data files are organized - Iceberg, Delta, Hudi |
| Time travel | Query historical versions of a table using timestamps or snapshot IDs |
| Schema evolution | Add, rename, or reorder columns without rewriting existing data files |
| Hidden partitioning | Iceberg partitions data transparently - queries never need partition predicates |
| Copy-on-Write (CoW) | Row-level updates rewrite affected Parquet files - fast reads, slower writes |
| Merge-on-Read (MoR) | Updates written as delta files, merged at read time - fast writes, slightly slower reads |
| Z-ordering | Multi-dimensional clustering of data for better data skipping and query performance |
| Metadata layer | The transaction log (Delta) or manifest hierarchy (Iceberg) that tracks what files exist and what state the table is in |
| Lakehouse query engine | A query engine that understands open table formats - Trino, Spark SQL, DuckDB, Athena |
What You Will Be Able to Do
After completing this module you will be able to:
- Explain the architectural tradeoffs between data lakes, warehouses, and lakehouses - and when to choose each
- Design and implement Iceberg tables with schema evolution, time travel, and row-level deletes using PyIceberg and Spark
- Build Delta Lake pipelines with MERGE INTO upserts, Change Data Feed, and Z-ordering optimizations
- Understand when to use Hudi's Copy-on-Write vs Merge-on-Read table types for your write/read ratio
- Choose and configure the right query engine (Trino, Spark, DuckDB, Athena) for a given access pattern
- Implement data governance with Unity Catalog including column-level security and data lineage
- Use lakehouse table formats as the backbone for ML feature stores and reproducible training datasets
