Skip to main content

Module 06 - Data Lakehouse

The data lakehouse is the convergence point of two decades of parallel evolution. Data lakes gave us cheap, scalable storage for raw files but turned into swamps without ACID guarantees. Data warehouses gave us reliable, queryable tables but at proprietary cost with ML teams locked out. The lakehouse pattern unifies both - ACID transactions and schema enforcement on top of open file formats stored in object storage, with first-class support for both SQL analytics and ML model training on the same data.

This module covers the architecture, open table formats, query engines, governance patterns, and ML integration that define the modern lakehouse.


Module Roadmap


Lessons at a Glance

#LessonKey ConceptsEst. Read Time
01Lake vs Warehouse vs LakehouseArchitecture tradeoffs, metadata layer, open formats20 min
02Apache IcebergThree-layer metadata, hidden partitioning, time travel, PyIceberg25 min
03Delta Lake_delta_log/, MERGE INTO, Change Data Feed, Z-ordering25 min
04Apache HudiCopy-on-Write vs Merge-on-Read, incremental pull, timeline20 min
05Query EnginesTrino/Presto, Spark SQL, DuckDB, Athena, query planning20 min
06Data GovernanceUnity Catalog, fine-grained access, data lineage, tagging20 min
07Lakehouse for MLFeature stores, versioned training sets, experiment tracking20 min

Prerequisites

Before this module you should be comfortable with:

  • Module 01 - Data Engineering Foundations: storage systems, object storage (S3/GCS/ADLS), file formats (Parquet, ORC, Avro)
  • Module 02 - Batch Processing: Spark architecture, DataFrame API, partitioning strategies
  • Module 03 - Stream Processing: event-time semantics, watermarks, streaming sinks
  • Module 04 - Data Modeling: star schema, dimensional modeling, slowly changing dimensions
  • Module 05 - Data Pipelines: orchestration, idempotency, pipeline reliability patterns

Key Concepts Reference

ConceptWhat It Means
ACID on object storageAtomicity, Consistency, Isolation, Durability guarantees implemented over S3/GCS via transaction logs
Open table formatA specification (not a storage engine) that defines how metadata and data files are organized - Iceberg, Delta, Hudi
Time travelQuery historical versions of a table using timestamps or snapshot IDs
Schema evolutionAdd, rename, or reorder columns without rewriting existing data files
Hidden partitioningIceberg partitions data transparently - queries never need partition predicates
Copy-on-Write (CoW)Row-level updates rewrite affected Parquet files - fast reads, slower writes
Merge-on-Read (MoR)Updates written as delta files, merged at read time - fast writes, slightly slower reads
Z-orderingMulti-dimensional clustering of data for better data skipping and query performance
Metadata layerThe transaction log (Delta) or manifest hierarchy (Iceberg) that tracks what files exist and what state the table is in
Lakehouse query engineA query engine that understands open table formats - Trino, Spark SQL, DuckDB, Athena

What You Will Be Able to Do

After completing this module you will be able to:

  1. Explain the architectural tradeoffs between data lakes, warehouses, and lakehouses - and when to choose each
  2. Design and implement Iceberg tables with schema evolution, time travel, and row-level deletes using PyIceberg and Spark
  3. Build Delta Lake pipelines with MERGE INTO upserts, Change Data Feed, and Z-ordering optimizations
  4. Understand when to use Hudi's Copy-on-Write vs Merge-on-Read table types for your write/read ratio
  5. Choose and configure the right query engine (Trino, Spark, DuckDB, Athena) for a given access pattern
  6. Implement data governance with Unity Catalog including column-level security and data lineage
  7. Use lakehouse table formats as the backbone for ML feature stores and reproducible training datasets
© 2026 EngineersOfAI. All rights reserved.