Skip to main content

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the Data Lakehouse Architecture demo on the EngineersOfAI Playground - no code required. :::

Data Lake vs Warehouse vs Lakehouse for AI Workloads

The $180k/Year Mistakeโ€‹

The data team at a mid-size SaaS company had built what looked like a solid architecture. Snowflake held the production data warehouse - clean, fast, query-optimized. Sales dashboards pulled from it. Finance ran month-end reports on it. The BI team loved it. The bill was $180,000 per year.

But the ML team was not using Snowflake. They could not. Their PyTorch training jobs needed raw Parquet files they could load with pandas.read_parquet(). They needed unstructured logs for anomaly detection. They needed to run Spark jobs that read terabytes at a time without paying per-byte query costs. So they had their own data lake on S3 - raw files, no guarantees, but free to read as many times as they wanted.

The problem was the gap between them. Every night at 2am, a six-hour Spark ETL job ran to extract data from Snowflake, transform it, and write it back to S3 so the ML team could use it. And every morning at 8am, another job ran to pull the ML team's feature outputs from S3 into Snowflake so the BI team could see them. Two copies of every important dataset. Six-hour latency from source to ML training. A $180k/year Snowflake bill for data that also existed on S3.

Then came the incident. The Snowflake-to-S3 sync job failed silently on a Tuesday. The ML team's models retrained on Wednesday using Thursday's data from S3 - but the data was from the previous week. The churn prediction model fired alerts on customers who had already churned. Customer success reached out to apologize to customers who were not even subscribed anymore. The dashboards showed "active" and the model agreed - but the model was looking at stale data.

The data engineering team spent two weeks auditing. The root cause was simple: two storage systems meant two sources of truth. The fix was a different architecture entirely. Three months later they had migrated to a lakehouse on S3 with Delta Lake as the table format. The Snowflake bill became $0. The 6-hour ETL sync became instant. The ML team and the BI team now read from the same tables, at the same version, with ACID guarantees.


Why This Exists - The Fundamental Tensionโ€‹

The data lake and data warehouse were built by different teams solving different problems. They optimized for different things. And for a decade, organizations ran both, paying the cost of keeping them synchronized.

To understand why the lakehouse exists, you need to feel the pain of both systems before it.


The Data Lake: Scalable Storage, Unreliable Dataโ€‹

The data lake concept emerged around 2010 as object storage (Amazon S3 launched in 2006) became so cheap that organizations could afford to store everything. The idea was simple: dump all raw data into S3, HDFS, or GCS in whatever format it arrived in. Analyze it later with Spark or Hive. Pay almost nothing for storage.

For a few years this worked. But as data volumes grew and more teams began writing to these lakes, the problems became severe.

Problem 1: No schema enforcement. Any job could write any schema. If a producer changed a column name, downstream readers would silently read null for that column - or crash. There was no contract, no enforcement, no validation at write time.

Problem 2: No ACID transactions. If a Spark job writing 500 files to a directory failed at file 347, those 347 partial files remained. The next reader would read a corrupt partial dataset. Readers mid-write would see a half-written state. There was no concept of a "transaction."

Problem 3: No efficient updates or deletes. Parquet is an immutable columnar file format. To update one row, you had to rewrite the entire Parquet file containing that row. To delete one record for a GDPR erasure request, you had to rewrite every file that contained data for that user. A 3TB dataset meant 4 hours of Spark jobs for a single record deletion.

Problem 4: No query optimization. A Hive query had no statistics about data distribution. It had to scan every file in a directory. Predicate pushdown was limited. File listing on S3 was slow. Queries that should have taken 5 seconds took 20 minutes.

Problem 5: Metadata chaos. Without a reliable catalog, teams did not know what tables existed, who owned them, when they were last updated, or whether the schema was current. Data discovery was done via Slack messages to teammates.

These problems combined into what practitioners called the "data swamp" - a lake where data entered but nothing useful came out because no one trusted it.

:::warning Data Swamp Warning Signs If your data lake has directories named final_v2_USE_THIS/, old-DO-NOT-USE/, and temp_amit_test/ sitting next to your production data - you have a data swamp. The lakehouse metadata layer is what prevents this from happening. :::


The Data Warehouse: Reliable Data, Expensive Constraintsโ€‹

The data warehouse evolved from 1980s relational database technology. Teradata, Oracle, then cloud-native systems like Redshift, BigQuery, and Snowflake. The core value proposition was reliability: ACID transactions, schema enforcement, query optimization with statistics, and consistent reads.

Data warehouses solved everything the data lake could not. But they introduced a different set of constraints.

Problem 1: Expensive storage and compute. Snowflake charges separately for storage (per TB per month) and compute (per second of virtual warehouse usage). Loading 1TB of data into Snowflake and running queries against it costs real money. For ML teams that need to read the same dataset 100 times during hyperparameter tuning, the economics do not work.

Problem 2: Proprietary formats. Data inside Snowflake is stored in Snowflake's internal compressed columnar format. You cannot read it with Spark, Pandas, or PyTorch directly. You must go through Snowflake's SQL interface. This creates a hard boundary between the SQL analytics world and the Python/ML world.

Problem 3: ML workloads need raw files. A PyTorch training job needs to read millions of rows as fast as possible, ideally in parallel across 8 GPUs. It does not want to issue SQL queries - it wants to open Parquet files directly. A warehouse cannot serve this pattern.

Problem 4: Can't run arbitrary code. You can run SQL in a warehouse. You cannot run a custom Spark job, a Pandas transformation, or a scikit-learn pipeline directly against warehouse storage.

Problem 5: Slow for streaming. Loading streaming data into a warehouse typically involves micro-batch ETL jobs. Sub-minute latency is expensive and operationally complex in most warehouse systems.


The Lakehouse Thesis - Armbrust et al., 2021โ€‹

In 2021, a team from Databricks and UC Berkeley published a paper at CIDR (Conference on Innovative Data Systems Research) titled "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics." The key authors included Michael Armbrust, who also designed Spark SQL.

The thesis was precise: you can implement data warehouse features on top of low-cost object storage by adding a metadata layer that handles ACID transactions, schema enforcement, statistics, and indexing.

The insight was that a data warehouse's value is not in its proprietary storage format - it is in the metadata it maintains about that storage. If you implement the same metadata guarantees on top of open file formats (Parquet, ORC) stored in S3 or GCS, you get warehouse capabilities at lake costs. And because the underlying files are open format, ML tools can read them directly without going through a SQL interface.

The three components of a lakehouse:

  1. Open file formats for data: Parquet or ORC files stored in object storage - readable by any tool
  2. A metadata layer: a transaction log or manifest hierarchy (Delta Lake, Iceberg, Hudi) that tracks what files exist, what schema they have, and manages ACID commits
  3. A query engine: understands the metadata layer to enable SQL queries, data skipping, and statistics-driven optimization (Spark SQL, Trino, DuckDB, Athena)

Architecture Comparison: 10 Dimensionsโ€‹

DimensionData LakeData WarehouseLakehouse
Storage costVery low ($0.023/GB/mo S3)Mediumโ€“high (23โ€“23โ€“40/TB/mo Snowflake)Very low (object storage)
ACID transactionsNoYesYes (via table format)
ML workload supportYes (raw files)No (proprietary format)Yes (open Parquet files)
Query performancePoor (no statistics)Excellent (statistics + indexes)Goodโ€“excellent (data skipping, Z-order)
Schema enforcementNo (write anything)Yes (enforced at load)Yes (enforced by table format)
Open formatYes (but no guarantees)No (proprietary)Yes (Parquet + open spec)
Vendor lock-inLowHigh (proprietary SQL dialect, format)Low (open standards)
Streaming supportManual file writesExpensive micro-batchFirst-class (Delta streaming)
Time travelNoLimited or expensiveYes (configurable retention)
GovernanceManual / noneBuilt-in (role-based)Via catalog (Unity, Glue, Polaris)

How the Metadata Layer Worksโ€‹

This is the core technical insight. Understanding it makes everything else about Delta, Iceberg, and Hudi click into place.

Consider a simple table stored as Parquet files on S3:

s3://my-bucket/sales/
part-00000-abc.parquet (rows 1โ€“1,000,000)
part-00001-def.parquet (rows 1,000,001โ€“2,000,000)
part-00002-ghi.parquet (rows 2,000,001โ€“3,000,000)

Without a metadata layer, a query engine must list all files, read all of them, and filter. There is no ACID - two concurrent writers can produce overlapping or corrupt state. There is no schema - each file might have different columns.

With a metadata layer (here, Delta Lake's _delta_log/):

s3://my-bucket/sales/
_delta_log/
00000000000000000000.json โ† commit 0: add 3 files, schema definition
00000000000000000001.json โ† commit 1: add 2 new files
00000000000000000002.json โ† commit 2: remove 1 file (UPDATE rewrote it)
part-00000-abc.parquet
part-00001-def.parquet
part-00002-ghi.parquet
part-00003-new.parquet
part-00004-new.parquet
part-00005-updated.parquet

The transaction log is the source of truth. To know the current state of the table, replay the log. The log records which files were added, which were removed, the schema at each commit, and statistics about each file (min/max values per column, null counts).

This gives you:

  • Atomicity: a commit either fully succeeds (JSON file written) or fails (no JSON file) - partial writes are never visible
  • Isolation: readers always see the state at a specific commit version, even if writers are active
  • Schema enforcement: the schema is part of the log - any write that violates it is rejected
  • Data skipping: column-level statistics let the query engine skip entire files without reading them
  • Time travel: by replaying only the first N commits, you reconstruct any historical state

Query Performance: Closing the Gap with Warehousesโ€‹

Early lakehouses were slower than warehouses for SQL queries because file listing was slow and there were no fine-grained statistics. The gap has narrowed substantially.

Data skipping. Each Parquet file written to a lakehouse table stores column statistics in the metadata: minimum value, maximum value, null count. If a query filters WHERE order_date = '2024-01-15', the query engine reads the min/max order_date for each file from the metadata - if 2024-01-15 is not in range, the file is skipped entirely without reading it. For selective queries on clustered data, this can skip 99% of files.

Z-ordering. Z-ordering (also called Z-curve ordering or multi-dimensional clustering) co-locates data by multiple columns simultaneously. Running OPTIMIZE my_table ZORDER BY (customer_id, order_date) rewrites Parquet files so that rows with similar customer_id and order_date values are physically adjacent. Queries filtering on either or both columns now skip far more files.

Vectorized reads. Modern Parquet readers (Arrow, Velox) use SIMD CPU instructions to read columnar data at memory bandwidth. A well-tuned lakehouse query can match or exceed warehouse query performance for analytical workloads.

Bloom filters. Some table formats (Iceberg, Delta) support per-column bloom filters embedded in file metadata. A bloom filter lets the query engine ask "does this file contain value X?" in microseconds without reading the file.

:::tip Performance Benchmark In the Databricks 2021 benchmarks on TPC-DS 1TB, Delta Lake with Z-ordering achieved query times within 2x of Snowflake while using raw S3 storage at 1/10th the cost. For ML teams doing full-table scans at high throughput, lakehouses are often faster than warehouses because they parallelize better. :::


The Open Table Format Ecosystemโ€‹

Three open table formats exist. Understanding why helps you choose between them.

Apache Iceberg was created at Netflix in 2017 and open-sourced in 2018. Netflix had a 100-petabyte Hive metastore that was crumbling under the weight of partition listing operations. Iceberg was designed from the ground up for massive scale and correctness - it tracks partitions and files in metadata files, not Hive-style directories. Iceberg has become the preferred format for cross-engine interoperability. If you need to read the same table from Spark, Trino, Flink, and DuckDB simultaneously, Iceberg is the safest choice.

Delta Lake was created by Databricks in 2019. It uses a simple append-only JSON transaction log (_delta_log/) that is easy to understand and implement. Delta is deeply integrated with Spark and the Databricks platform. If your team is Spark-first or on Databricks, Delta is the natural choice. Delta also has strong streaming support - Spark Structured Streaming can read Delta tables with exactly-once semantics.

Apache Hudi (Hadoop Upserts Deletes and Incrementals) was created at Uber in 2016, open-sourced in 2019. Hudi was purpose-built for high-frequency upserts - Uber had ride event data arriving millions of times per second that needed to update existing records. Hudi's unique contribution is the incremental pull pattern: consumers can query only the records that changed since a given timestamp, enabling highly efficient downstream pipelines.

In 2024, the ecosystem converged toward Iceberg as the interoperability standard, driven partly by the formation of the Apache Iceberg REST catalog spec and adoption by Snowflake, Google BigQuery, and AWS Athena. But Delta remains dominant in Databricks-centric shops, and Hudi remains strong for high-throughput upsert workloads.


Hands-On: Reading a Delta Table Three Waysโ€‹

# Install: pip install delta-spark pyspark pandas deltalake

# --- 1. Reading with PySpark + Delta Lake ---
from pyspark.sql import SparkSession

spark = (
SparkSession.builder
.appName("lakehouse-demo")
.config("spark.jars.packages", "io.delta:delta-spark_2.12:3.1.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
.getOrCreate()
)

# Write a Delta table
df = spark.range(1000).selectExpr(
"id",
"cast(id * 2.5 as double) as revenue",
"date_add('2024-01-01', cast(id % 365 as int)) as order_date",
)
df.write.format("delta").mode("overwrite").save("/tmp/sales_delta")

# Read it back
sales = spark.read.format("delta").load("/tmp/sales_delta")
sales.show(5)


# --- 2. Reading with the standalone deltalake Python library (no Spark) ---
from deltalake import DeltaTable
import pandas as pd

dt = DeltaTable("/tmp/sales_delta")

# Read as Pandas DataFrame
df_pandas = dt.to_pandas()
print(df_pandas.head())

# Read as Arrow Table (faster for large datasets)
arrow_table = dt.to_pyarrow_table()

# Time travel: read version 0
dt_v0 = DeltaTable("/tmp/sales_delta", version=0)
df_v0 = dt_v0.to_pandas()


# --- 3. Direct Parquet reads (ML training pattern - no metadata overhead) ---
import pyarrow.dataset as ds

# Open the dataset through the Delta metadata to get the right file list
delta_table = DeltaTable("/tmp/sales_delta")
file_list = delta_table.file_uris() # list of Parquet file URIs at current version

# Read as Arrow dataset for fast ML consumption
dataset = ds.dataset(file_list, format="parquet")
for batch in dataset.to_batches(batch_size=65536):
# Each batch is an Arrow RecordBatch - convert to numpy for training
revenue = batch.column("revenue").to_pylist()
# ... feed to model

:::note Why Three Ways to Read? The PySpark path is for large-scale batch processing. The deltalake library (Rust-based) is for Python scripts that do not want JVM overhead - faster startup for small-to-medium tables. The direct PyArrow path is for ML training loops that need maximum throughput without any metadata overhead per batch. :::


Choosing Between Lake, Warehouse, and Lakehouseโ€‹

This is not purely a technical decision - it involves cost, team skills, and access patterns.

Choose a data lake (raw object storage) when:

  • You need to store raw, unprocessed data before you know how it will be used
  • All consumers are ML/data science teams reading raw files
  • You have no SQL analytics needs
  • Budget is extremely constrained

Choose a data warehouse when:

  • Your primary consumers are BI tools and SQL analysts
  • Your data is well-structured and schema-stable
  • You need enterprise-grade governance and support
  • Compute costs are acceptable relative to the value of fast SQL queries
  • You have no ML teams accessing raw files (or they are a small minority)

Choose a lakehouse when:

  • You have both ML teams and SQL analytics teams consuming the same data
  • You need ACID guarantees but cannot afford warehouse costs for all storage
  • You want to avoid the ETL synchronization problem between lake and warehouse
  • You need time travel (regulatory compliance, model reproducibility)
  • You want to avoid vendor lock-in on your core data assets
  • You are building on cloud object storage and want open standards

:::danger The Dual-System Anti-Pattern Running a data lake AND a data warehouse in parallel is not a stable long-term architecture. The synchronization cost (engineering time + infrastructure cost + data consistency risk) compounds over time. Every organization that has gone through this migration reports the same thing: they should have moved to a lakehouse sooner. :::


Production Notesโ€‹

File size matters. The ideal Parquet file size for a lakehouse is 128MBโ€“1GB. Files smaller than 32MB create excessive metadata overhead and slow file listing. Files larger than 2GB cannot parallelize well across worker nodes. Run periodic OPTIMIZE (Delta) or rewriteDataFiles (Iceberg) to compact small files created by streaming or high-frequency writes.

Partition by query patterns, not by instinct. Partitioning by date is common but only helps if most queries filter by date. If your most common query is WHERE customer_id = X, date partitioning does nothing - Z-ordering by customer_id will help far more. Understand your query patterns before committing to a partition scheme.

The metadata layer has overhead. For tables under 1GB that are only ever fully scanned (e.g., a small lookup table), the overhead of maintaining a transaction log can exceed its benefits. Use raw Parquet for tiny static tables. Use table formats for anything that grows, changes, or needs ACID.

Catalog is not optional at scale. Without a metadata catalog (AWS Glue, Databricks Unity Catalog, Apache Polaris), table discovery becomes Slack-driven chaos. Register every lakehouse table in a catalog from day one.


Common Mistakesโ€‹

:::danger Treating the Lakehouse as a Silver Bullet The lakehouse eliminates the lake-warehouse gap but does not eliminate the need for data modeling, governance, or pipeline reliability engineering. Teams that adopt Delta Lake or Iceberg expecting their data quality problems to disappear are disappointed. ACID prevents corrupt writes - it does not prevent bad data from being written correctly. :::

:::warning Skipping OPTIMIZE and Vacuum A lakehouse table that never runs OPTIMIZE (Delta) or rewriteDataFiles (Iceberg) degrades over time. Streaming pipelines write tiny files - a week of streaming without compaction can leave millions of 1KB files that make queries painfully slow. Set up scheduled OPTIMIZE jobs. Run VACUUM (Delta) or expireSnapshots (Iceberg) periodically to reclaim storage from deleted files, but never below the retention window you need for time travel. :::

:::warning Object Storage Eventual Consistency AWS S3 is strongly consistent for new object writes (as of December 2020), but Azure ADLS Gen1 and some non-AWS S3-compatible stores are not. Test consistency guarantees on your specific object storage before relying on lakehouse ACID semantics. :::


Interview Q&Aโ€‹

Q: What is the fundamental problem the lakehouse architecture solves compared to running a data lake and data warehouse in parallel?

A: The dual-system architecture creates two copies of every important dataset that must be kept synchronized via ETL jobs. This introduces latency (data in the warehouse is hours behind the lake), consistency risk (sync failures create divergence between what SQL dashboards show and what ML models train on), and operational cost. The lakehouse solves this by adding ACID transaction semantics and schema enforcement directly on top of the open file formats already used in the data lake, so the same Parquet files can serve both SQL analytics (via Trino or Spark SQL) and ML training (via direct PyArrow reads) without any copy or ETL sync.

Q: How does a lakehouse achieve ACID transactions on top of object storage, which has no native transaction support?

A: Through a transaction log - a metadata layer separate from the data files. In Delta Lake this is the _delta_log/ directory; in Iceberg it is the metadata.json โ†’ manifest list โ†’ manifest files hierarchy. Writers atomically commit a new log entry that lists which data files were added and which were removed. Readers determine the current table state by reading the log, not by listing the data directory. Atomicity comes from the log entry being a single atomic object write. Isolation comes from each read working against a specific log version. Concurrent writes use optimistic concurrency - if two writers attempt to modify the same table simultaneously, one will see a conflict on log version and retry.

Q: What is data skipping and why does it matter for lakehouse query performance?

A: Data skipping uses per-file column statistics (min value, max value, null count) stored in the table metadata to eliminate files from a query without reading them. If a query filters WHERE event_date = '2024-06-01' and a file's metadata shows event_date ranges from 2024-07-01 to 2024-07-31, that file is skipped entirely. For selective queries on large tables, data skipping can eliminate 95โ€“99% of the data that would otherwise be scanned. Z-ordering improves data skipping by physically co-locating rows with similar values, making the min/max ranges per file narrower and therefore the skipping more aggressive.

Q: When would you choose Apache Iceberg over Delta Lake?

A: Choose Iceberg when you need multi-engine interoperability - the same table read by Spark, Trino, Flink, Dremio, and DuckDB simultaneously. Iceberg has a well-specified REST catalog API (the "open catalog" spec) that all major query engines implement. Choose Iceberg also for very large tables (petabyte scale) where Iceberg's manifest file hierarchy handles partition listing more efficiently than Delta's flat transaction log. Choose Delta Lake when you are a Databricks shop, when you need tight Spark Structured Streaming integration, or when your team prefers the simplicity of Delta's JSON transaction log format for debugging.

Q: Explain the difference between Copy-on-Write and Merge-on-Read table types in the context of row-level operations.

A: These are two strategies for implementing UPDATE, DELETE, and MERGE on immutable Parquet files. In Copy-on-Write (CoW), when a row is updated or deleted, the entire Parquet file containing that row is rewritten with the change applied. Reads are fast because every file is always fully valid. Writes are expensive because they rewrite files. In Merge-on-Read (MoR), updates and deletes are written as small delta files. The original data files are not touched. At read time, the query engine merges the delta files with the base files to produce the current view. Writes are cheap. Reads are slightly more expensive because they must merge. Apache Hudi explicitly exposes this as a table type choice. Iceberg implements both strategies (controlled by write configuration). Delta Lake primarily uses CoW with automatic MoR-like optimizations internally. For workloads with very high write frequency and moderate read volume, MoR is better. For workloads with low write frequency and very high read volume, CoW is better.

Q: Why can't you just use a data warehouse for ML model training?

A: Three reasons. First, data warehouses store data in proprietary internal formats - you cannot open a Snowflake table with pandas.read_parquet(). ML frameworks expect files (Parquet, CSV, TFRecord) not SQL cursor results, and transferring terabytes through a JDBC driver is orders of magnitude slower than reading Parquet directly. Second, the cost model is wrong - warehouses charge for compute per query, and ML training involves reading the same dataset dozens or hundreds of times during hyperparameter search. Third, you cannot run arbitrary Python or Spark code against warehouse storage. Feature engineering jobs, Spark-based preprocessing, and custom data loaders all need file access, not SQL access.


Migrating From Dual-System to Lakehouse - A Practical Pathโ€‹

If you are already running both a data lake and a data warehouse, migrating to a lakehouse is not a big-bang rewrite. It is an incremental adoption where you introduce a table format on top of existing S3 data and gradually retire the warehouse dependency.

Phase 1: Introduce the table format on new tables. Start by writing new data pipelines to Delta Lake or Iceberg instead of raw Parquet. Do not touch existing tables. Let the team build familiarity with the format, the tooling, and the operational patterns (OPTIMIZE, VACUUM, schema evolution).

Phase 2: Convert high-cost warehouse tables. Identify the tables in your warehouse that are most expensive - either by query compute cost or by ETL cost to keep them synchronized with the lake. Migrate these first. The migration pattern is: backfill historical data to a lakehouse table, set up a parallel write path, validate query results match, cut over the dashboard, then stop writing to the warehouse.

Phase 3: Retire the ETL sync jobs. Once the warehouse tables that are used by ML teams have been migrated, the ETL jobs that kept lake and warehouse in sync can be retired. This is the moment the architecture actually becomes simpler - fewer moving parts, fewer failure modes, fewer teams to coordinate.

Phase 4: Evaluate whether the warehouse is still needed. Some organizations find that after migrating their high-volume tables, the remaining warehouse usage is small enough to be cheaper than maintaining a separate warehouse. Others find that their BI tool integrations, governance workflows, or SQL analyst productivity depends on the warehouse and they keep it for those specific use cases. This is a legitimate outcome - the lakehouse does not have to eliminate the warehouse, it just needs to eliminate the expensive duplication.

# Phase 2 migration pattern: backfill warehouse table to Delta Lake
from pyspark.sql import SparkSession
import snowflake.connector

spark = SparkSession.builder.appName("migration").getOrCreate()

# Step 1: Read historical data from Snowflake
# (Do this once for the full history)
df_historical = (
spark.read
.format("snowflake")
.options(**{
"sfURL": "your-account.snowflakecomputing.com",
"sfUser": "your-user",
"sfPassword": "your-password",
"sfDatabase": "ANALYTICS",
"sfSchema": "PUBLIC",
"sfWarehouse": "COMPUTE_WH",
"dbtable": "ORDERS",
})
.load()
)

# Step 2: Write to Delta Lake (partitioned by date for efficient queries)
(
df_historical
.write
.format("delta")
.partitionBy("order_date")
.mode("overwrite")
.save("s3://your-bucket/lakehouse/orders/")
)

# Step 3: Set up ongoing sync from source to Delta (bypass warehouse)
# New orders from Kafka/source system โ†’ Delta directly
# This replaces the warehouse ETL job

# Step 4: Point dashboards at Delta via Trino or Spark SQL
# SELECT * FROM orders WHERE order_date = '2024-01-15'
# Reads from Delta, same performance as warehouse for date-filtered queries

The Storage Cost Mathโ€‹

One of the strongest arguments for the lakehouse is the economics. Let us work through a concrete example.

Suppose your company has 50TB of production data. You run a Snowflake warehouse for SQL analytics and an S3 data lake for ML. You maintain both in sync.

Dual-system cost estimate:

ItemEstimated Monthly Cost
Snowflake storage (50TB ร— $23/TB)$1,150
Snowflake compute (100 hours/day ร— 1 XS warehouse ร— $2/hr)$6,000
S3 storage (50TB ร— 0.023/GB=0.023/GB = 0.023 ร— 50,000GB)$1,150
EC2/EMR for ETL sync jobs (4hr/day ร— 10 r5.4xlarge ร— $1/hr)$1,200
Engineering time (0.5 FTE maintaining sync pipeline)~$8,000
Total per month~$17,500

Lakehouse cost estimate (Delta Lake on S3 + Spark on EMR):

ItemEstimated Monthly Cost
S3 storage (50TB ร— $0.023/GB)$1,150
EMR/Spark compute for analytics (50 hours/day ร— 2 m5.4xlarge ร— $0.80/hr)$2,400
No ETL sync jobs$0
Engineering time (0.1 FTE maintenance)~$1,600
Total per month~$5,150

The savings are roughly 12,000/monthor12,000/month or 144,000/year - for a 50TB dataset. For larger datasets the savings scale linearly with storage volume. This is why the economics of lakehouses are compelling even before you account for the data consistency improvements and engineering velocity gains.

:::note Real-World Savings Actual savings vary significantly based on query patterns, Snowflake tier, and how heavily the compute warehouse is used. Organizations with very query-intensive BI workloads may find Snowflake's query optimization still justifies the cost for those workloads - the hybrid pattern (Snowflake for BI, lakehouse for ML and large batch processing) is common and sensible. :::


Lakehouse and the Modern Data Stackโ€‹

The lakehouse sits at the center of what has been called the "modern data stack" - a loosely coupled set of open-source and managed tools that together provide the data platform capabilities that monolithic warehouses provided as a single product.

Each component is independently replaceable. You can swap Fivetran for Airbyte, or dbt for Spark, without rebuilding the whole architecture. The lakehouse storage layer is the stable foundation - because the data is in open Parquet files with open table format metadata, no single vendor controls access to it.

This modularity is a significant operational advantage over monolithic warehouse platforms, where switching vendors means migrating all your data out of a proprietary format.


Streaming Data Into a Lakehouseโ€‹

One of the early criticisms of lakehouses was that they were batch-first - real-time data required micro-batch ETL that introduced latency. That criticism has mostly been addressed.

Delta Lake + Spark Structured Streaming. Writing a streaming Spark job to a Delta table is identical to writing a batch job, with the addition of a checkpoint location. Delta's transaction log provides exactly-once semantics: if a micro-batch fails and retries, the idempotent commit prevents duplicates.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType

spark = SparkSession.builder.appName("streaming-lakehouse").getOrCreate()

# Read from Kafka
schema = StructType([
StructField("event_id", StringType()),
StructField("customer_id", StringType()),
StructField("revenue", DoubleType()),
StructField("event_time", TimestampType()),
])

raw_stream = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "orders")
.option("startingOffsets", "latest")
.load()
.select(F.from_json(F.col("value").cast("string"), schema).alias("data"))
.select("data.*")
)

# Write to Delta with exactly-once guarantees
query = (
raw_stream
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "s3://bucket/checkpoints/orders/")
.trigger(processingTime="30 seconds")
.start("s3://bucket/lakehouse/orders/")
)

query.awaitTermination()

Iceberg + Flink. Apache Flink has first-class Iceberg integration through the iceberg-flink-runtime connector. Flink writes to Iceberg tables with millisecond latency, enabling true streaming lakehouse patterns with sub-second freshness.

Latency trade-off. For use cases requiring sub-second data freshness (real-time dashboards, fraud detection), the lakehouse pattern with streaming writes is viable. For sub-millisecond use cases, a dedicated stream processing system (Kafka, Kinesis, Redis) is still necessary - the lakehouse is not the right place for sub-millisecond state.


The Governance Layerโ€‹

A lakehouse without a catalog and governance layer is just a better-organized data swamp. Data governance - knowing who owns what, who can access what, and where data came from - becomes more important as the number of tables grows.

Apache Atlas / OpenMetadata. Open-source metadata catalog solutions that index table schemas, track lineage (which pipelines produced which tables), and provide a searchable data catalog. Atlas is older and more mature; OpenMetadata is newer with a better UI.

Databricks Unity Catalog. Databricks' managed governance layer for lakehouses. Provides three-tier namespace (catalog โ†’ schema โ†’ table), column-level security, row-level security, automated data lineage, and audit logging. Works across Delta Lake, Iceberg, and external tables. Covered in depth in Module 06 Lesson 06.

AWS Glue Data Catalog. The AWS-native metadata catalog. Works as an Iceberg and Delta Lake catalog, integrates with Athena, EMR, and Glue ETL. Simpler than Unity Catalog but sufficient for AWS-native architectures.

The minimum viable governance setup for a production lakehouse:

  1. A catalog - register every table with its schema, owner, and description
  2. IAM-based access control - S3 bucket policies and object-level permissions limit who can read which tables
  3. Lineage tracking - record which pipeline produced each table (job name + version + run ID in table properties)
  4. Schema change log - every schema evolution is recorded in the table's metadata history

Production Notesโ€‹

Right-size your files. The sweet spot for Parquet file size in a lakehouse is 128MB to 1GB. Files smaller than 32MB create excessive S3 LIST and manifest overhead. Files larger than 2GB parallelize poorly across worker nodes. Monitor your average file size with DESCRIBE DETAIL (Delta) or SELECT * FROM table.files (Iceberg). Schedule compaction when average file size drops below 64MB.

Understand your partition strategy before you write data. Changing a partition scheme after data exists is expensive - it typically requires rewriting all historical data. Think carefully: what are the most common filter predicates? If most queries filter by date, partition by date. If most queries filter by region, partition by region. Never partition by high-cardinality columns like user_id - you will create millions of directories with a handful of records each.

Test time travel retention against your compliance requirements. GDPR's "right to erasure" conflicts with time travel retention. If you retain 90 days of history, deleted records are still accessible via time travel for 90 days. Define a policy: either set a short retention (7 days, vacuum aggressively) or implement a separate secure-delete process that removes records from all historical snapshots.


Reading Lakehouse Tables Without a Query Engineโ€‹

A lakehouse's open format means you can read tables without running Spark or Trino. For ML inference scripts, lightweight cron jobs, and ad-hoc analysis, reading Delta or Iceberg tables directly from Python is practical and fast.

# Reading a Delta table from Python (no Spark, no JVM)
# pip install deltalake pyarrow

from deltalake import DeltaTable
import pandas as pd
import pyarrow.dataset as ds

# Point at the S3 path (or local path)
table_path = "s3://my-bucket/lakehouse/orders/"

dt = DeltaTable(
table_path,
storage_options={
"AWS_ACCESS_KEY_ID": "...",
"AWS_SECRET_ACCESS_KEY": "...",
"AWS_REGION": "us-east-1",
},
)

# Read into Pandas
df = dt.to_pandas(
filters=[
("order_date", ">=", "2024-01-01"),
("order_date", "<", "2024-02-01"),
],
columns=["order_id", "customer_id", "revenue"],
)

# Read into PyArrow (better for large datasets, zero-copy to NumPy)
arrow_table = dt.to_pyarrow_table(
filters=[("customer_id", "=", "c100")],
)
revenue_array = arrow_table.column("revenue").to_pylist()

# Time travel for ML reproducibility
dt_historical = DeltaTable(table_path, version=42)
training_df = dt_historical.to_pandas()
# Reading an Iceberg table from Python (no Spark)
# pip install pyiceberg pyarrow

from pyiceberg.catalog import load_catalog

catalog = load_catalog("my_catalog", **{
"type": "rest",
"uri": "http://iceberg-catalog:8181",
"credential": "client:secret",
})

table = catalog.load_table("analytics.orders")

# Scan with filters - pushed down to manifest and file level
df = (
table
.scan(
row_filter="order_date >= '2024-01-01' AND order_date < '2024-02-01'",
selected_fields=["order_id", "customer_id", "revenue"],
limit=100_000,
)
.to_pandas()
)

# Read as Arrow dataset for streaming/batched ML consumption
arrow_dataset = table.scan().to_arrow_table()

The ability to read large lakehouse tables from lightweight Python processes is a major enabler for ML workflows - training jobs can read directly from the lakehouse without a data copy or a separate feature materialization step.


Lakehouse vs. Modern Warehouse - Not Always Either/Orโ€‹

A nuance that is often glossed over: the lakehouse does not always eliminate the need for a data warehouse. For some organizations, a hybrid architecture makes sense indefinitely.

Snowflake's lakehouse mode. Snowflake introduced "Iceberg tables" - Snowflake tables backed by Iceberg files stored in your S3 bucket. You get Snowflake's SQL performance, governance, and BI tool integration while the underlying data is in open Iceberg format that Spark or Trino can also read. This is genuinely useful for organizations that need Snowflake's BI integration but also want ML teams to have direct file access.

BigQuery Omni. Google BigQuery can query data stored in other clouds' object storage through BigQuery Omni. Similarly, BigQuery supports querying Iceberg external tables in GCS. The warehouse query engine becomes a layer on top of open format storage rather than a silo.

The pattern. Use the lakehouse (Delta/Iceberg on S3) as the single source of truth. Use warehouse query engines (Snowflake, BigQuery) as one of several consumers - they read from the same Iceberg tables as Spark and Trino. The data lives once, in an open format, and multiple engines access it through their respective catalog integrations. This eliminates the synchronization ETL while retaining access to warehouse SQL features.

:::note The Real Question The real architectural question is not "lakehouse or warehouse" but "where does the canonical data live and what format is it in?" If the canonical data lives in a proprietary warehouse format, you are locked in. If it lives in open Parquet/Iceberg/Delta, you have choices. Build around the open format, and use warehouse engines as optional query layers. :::


The Metadata Layer in Depth - Reading a Table Step by Stepโ€‹

Understanding exactly what happens when a query engine reads a lakehouse table demystifies the whole architecture.

Reading a Delta Lake table - step by step:

  1. The engine receives: SELECT SUM(revenue) FROM orders WHERE order_date = '2024-01-15'
  2. It checks the _delta_log/ directory to find the latest checkpoint and any subsequent JSON files
  3. It replays the log to reconstruct the current set of active data files - this gives a list of file paths, partition values, and per-file column statistics
  4. For each file in the active set, it checks the order_date statistics: if min(order_date) > '2024-01-15' or max(order_date) < '2024-01-15', the file is skipped entirely
  5. For the remaining files, the engine reads only the revenue and order_date columns (column projection - Parquet columnar format means unneeded columns are never read)
  6. It applies the filter order_date = '2024-01-15' to remaining rows and aggregates revenue

At step 4, data skipping eliminates files. At step 5, columnar projection eliminates column I/O. These two optimizations together mean a well-designed lakehouse query reads a tiny fraction of the stored data.

Reading an Iceberg table - step by step:

  1. The engine queries the catalog (Glue, REST catalog) for the table's metadata.json location
  2. It reads metadata.json to find the current snapshot's manifest list file path
  3. It reads the manifest list to get the list of manifest files for the current snapshot
  4. For each manifest file, it reads the partition summary: if a manifest covers only partitions outside the query's date range, the whole manifest is skipped
  5. For manifests that pass partition pruning, it reads individual manifest entries: per-file column statistics further prune which Parquet files to read
  6. The engine reads only the necessary Parquet files, applying column projection and row filtering

The key performance difference: Iceberg never lists an S3 directory. All file paths come from metadata. On a 10-million-file table, the difference between "list all files" and "read manifest files to find the 100 relevant files" is the difference between a 10-minute query planning phase and a sub-second one.


Encoding the Tradeoffs in Mathโ€‹

The cost of a lakehouse query has three components:

Queryย Cost=Metadataย Readย Cost+Dataย Skipย Savings+IOย Cost\text{Query Cost} = \text{Metadata Read Cost} + \text{Data Skip Savings} + \text{IO Cost}

More precisely, for a table with NN total files where a query is selective over kk files:

Filesย Read=k(ifย dataย skippingย isย perfect)\text{Files Read} = k \quad \text{(if data skipping is perfect)} Filesย Readย Withoutย Skipping=N\text{Files Read Without Skipping} = N

The data skip ratio is 1โˆ’k/N1 - k/N. A well-ordered table with Z-ordering might have k/N=0.01k/N = 0.01 (1% of files read). Without Z-ordering on a large table, k/N=1.0k/N = 1.0 (full scan).

The metadata overhead is bounded by the manifest file count. In Iceberg, each manifest covers up to 100,000 file entries by default. A 10-million-file table has at most 100 manifest files. Reading 100 manifest files adds roughly 100ms to query planning - negligible compared to the seconds saved by skipping 9.99 million files.

This is the mathematical reason why Iceberg outperforms Hive at scale: Hive's partition listing scales O(Npartitions)O(N_\text{partitions}) in S3 API calls. Iceberg's metadata read scales O(Nmanifests)O(N_\text{manifests}) in file reads, where Nmanifestsโ‰ชNpartitionsN_\text{manifests} \ll N_\text{partitions}.


Common Lakehouse Design Mistakes and How to Avoid Themโ€‹

:::danger Too Many Small Partitions Partitioning a 100GB events table by (user_id, date) creates millions of partitions - one per user per day. S3 directory listing across millions of prefixes kills query performance even with Iceberg. The rule: a partition should contain at least 128MB of data. If a partition column has more than 10,000 distinct values, it is too high-cardinality for partitioning. Use Z-ordering instead. :::

:::danger Mixing Raw and Curated Data in the Same Table A common mistake is writing raw events and curated/aggregated summaries into the same Delta or Iceberg table, distinguished only by a record_type column. This breaks the schema evolution guarantees - raw events have different schemas than summaries. Separate concerns: one table for raw events, separate tables for curated aggregations. Use a catalog to express the lineage between them. :::

:::warning Schema Enforcement Turned Off in Production During development, it is tempting to enable mergeSchema = true on all writes to avoid friction. In production, schema merging can silently allow upstream schema drift - a producer adds a column with the wrong type, or renames a critical join key. Enable mergeSchema only for controlled schema migration operations. Use strict schema enforcement (mergeSchema = false) as the default in all production write jobs. :::

:::warning Not Monitoring Commit Frequency Lakehouse tables that receive hundreds of tiny commits per minute (from a misconfigured streaming job with a 1-second trigger interval) accumulate thousands of transaction log entries per day. Even with checkpointing, querying such tables involves reading many JSON files. If a streaming job writes to a Delta table, the trigger interval should be at least 30 seconds. Use Auto Compaction on Databricks or a scheduled OPTIMIZE job to mitigate the effects of high-frequency commits. :::


Hands-On: Benchmarking Data Skippingโ€‹

The best way to understand data skipping is to measure it. The following example demonstrates the difference between a full scan and a skipping-enabled scan on the same Delta table.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import time

spark = (
SparkSession.builder
.appName("skipping-benchmark")
.config("spark.jars.packages", "io.delta:delta-spark_2.12:3.1.0")
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)

# Generate a 10GB dataset with 100 million rows
import random
from datetime import date, timedelta

# Write in random order (no skipping benefit)
df = spark.range(10_000_000).select(
F.col("id").cast("string").alias("event_id"),
(F.col("id") % 100_000).cast("string").alias("customer_id"),
(F.rand() * 1000).alias("revenue"),
F.date_add(F.lit("2024-01-01"), (F.col("id") % 365).cast("int")).alias("event_date"),
)

# Write unordered (typical streaming output - poor data skipping)
df.write.format("delta").mode("overwrite").save("/tmp/events_unordered/")

# Write with Z-ordering (for comparison)
df.write.format("delta").mode("overwrite").save("/tmp/events_ordered/")
spark.sql("""
OPTIMIZE delta.`/tmp/events_ordered/`
ZORDER BY (customer_id, event_date)
""")

# Benchmark: selective query (customer_id = '42000')
query = "SELECT COUNT(*), SUM(revenue) FROM delta.`{path}/` WHERE customer_id = '42000'"

# Unordered table - must scan all files
start = time.time()
spark.sql(query.format(path="events_unordered")).collect()
unordered_time = time.time() - start

# Z-ordered table - skips most files
start = time.time()
spark.sql(query.format(path="events_ordered")).collect()
ordered_time = time.time() - start

print(f"Unordered scan: {unordered_time:.1f}s")
print(f"Z-ordered scan: {ordered_time:.1f}s")
print(f"Speedup: {unordered_time / ordered_time:.1f}x")

# Check how many files were skipped
spark.sql("""
EXPLAIN COST
SELECT COUNT(*), SUM(revenue)
FROM delta.`/tmp/events_ordered/`
WHERE customer_id = '42000'
""").show(truncate=False)

On a typical 10GB dataset, a selective point query on a Z-ordered column will scan 1โ€“5% of the total files versus 100% without Z-ordering. The measured speedup is typically 10โ€“50x for selective queries. For full-table aggregations (no WHERE clause), Z-ordering provides no benefit - every file must be read regardless.


When Parquet Alone Is Enoughโ€‹

The lakehouse pattern and open table formats are powerful but not always necessary. Knowing when to use raw Parquet (without Delta or Iceberg overhead) is part of being a good data engineer.

Use raw Parquet when:

  • The dataset is static and never updated (a historical export, a reference dataset, a model artifact)
  • Only one writer exists and writes are always full replacements of a partition
  • The table is smaller than 10GB and always fully scanned - metadata overhead exceeds the skipping benefit
  • Your consumers only ever need the latest version - no time travel, no audit trail needed
  • The pipeline is a one-time migration or backfill

Use Delta Lake or Iceberg when:

  • Multiple concurrent writers exist or may exist in the future
  • Rows are updated or deleted (SCD Type 1, GDPR erasure, CDC from operational systems)
  • Time travel or audit trail is required
  • Schema evolution is expected
  • ML teams need reproducible dataset versions for model training

The cost of adding a table format to raw Parquet is low - it is a configuration change, not a storage migration. The maintenance overhead (compaction, snapshot expiry) is manageable. Default to using a table format for any table that will live longer than a few months. Use raw Parquet only for ephemeral, static, or single-use datasets.

# Decision helper: should this table use a table format?
def needs_table_format(
concurrent_writers: bool,
has_row_level_mutations: bool,
needs_time_travel: bool,
schema_will_evolve: bool,
table_size_gb: float,
always_fully_scanned: bool,
) -> str:
"""
Returns 'table_format' or 'raw_parquet' with rationale.
"""
reasons_for_format = []

if concurrent_writers:
reasons_for_format.append("concurrent writes require ACID")
if has_row_level_mutations:
reasons_for_format.append("UPDATE/DELETE/MERGE requires transaction log")
if needs_time_travel:
reasons_for_format.append("time travel requires snapshot history")
if schema_will_evolve:
reasons_for_format.append("schema evolution is safer with column IDs")
if table_size_gb > 10 and not always_fully_scanned:
reasons_for_format.append(
f"data skipping beneficial at {table_size_gb:.0f}GB scale"
)

if reasons_for_format:
return f"use_table_format: {', '.join(reasons_for_format)}"
return "raw_parquet: simple static dataset, format overhead not justified"

Interview Q&Aโ€‹

The interview questions for this lesson are covered in the dedicated Interview Q&A section earlier in this file. As a quick reference, the six high-signal questions for this lesson are:

  1. What is the fundamental problem the lakehouse architecture solves compared to running a data lake and data warehouse in parallel?
  2. How does a lakehouse achieve ACID transactions on top of object storage, which has no native transaction support?
  3. What is data skipping and why does it matter for lakehouse query performance?
  4. When would you choose Apache Iceberg over Delta Lake?
  5. Explain the difference between Copy-on-Write and Merge-on-Read table types.
  6. Why can't you just use a data warehouse for ML model training?

These are all answered in full in the Interview Q&A section above. Additionally, expect follow-up questions around:

"What is the metadata layer and why is it the core of the lakehouse pattern?"

The metadata layer is the component that transforms raw Parquet files on object storage into a table with ACID semantics. Without a metadata layer, a collection of Parquet files is just files - no transaction guarantees, no schema enforcement, no statistics for query planning. The metadata layer (Delta's _delta_log/, Iceberg's manifest hierarchy) is a second layer of structured data on top of the raw data files. It tracks which files are part of the current table state, what schema they conform to, and per-column statistics about their contents. The query engine reads the metadata layer first to plan a query, then reads only the data files identified as necessary. This is why lakehouses can achieve warehouse-like query performance while storing data in open formats on cheap object storage.

"How would you explain the lakehouse to a non-technical stakeholder?"

A data lake is like a warehouse where you pile everything on the floor - cheap to store but impossible to find anything. A data warehouse is like a well-organized filing cabinet - easy to find things but expensive to maintain and you can only store things that fit the cabinet's format. A lakehouse is like a warehouse with a smart inventory system: things are still stored cheaply on shelves (object storage), but there is a computer system that tracks exactly what is where, enforces rules about how new items are stored, and can instantly tell you what was on a specific shelf at any point in the past. Both the finance team (who need organized reports) and the AI team (who need to pick up the raw items directly) can use the same warehouse.


Summaryโ€‹

The data lake gave us cheap, scalable storage. The data warehouse gave us reliable, queryable tables. For a decade, running both meant double the infrastructure, constant synchronization jobs, and data consistency risk. The lakehouse architecture, formalized by Armbrust et al. in 2021, adds a metadata layer on top of open file formats in object storage to deliver ACID transactions, schema enforcement, time travel, and query performance - at lake costs and with open-standard formats. The three major implementations (Delta Lake, Apache Iceberg, Apache Hudi) each optimize for different trade-offs. Understanding which to choose, and why the metadata layer is the core of the pattern, is the foundation for everything else in this module.

ยฉ 2026 EngineersOfAI. All rights reserved.