Skip to main content

Module 08 - Cloud Data Platforms

Cloud data platforms abstract away infrastructure so engineers can focus on data. But each platform makes fundamentally different trade-offs on cost, performance, and ecosystem lock-in. Choosing the wrong one - or using the right one wrong - can cost your team hundreds of thousands of dollars per year.

This module maps the landscape of modern cloud data platforms from the perspective of a data engineer building AI and ML systems.


Module Map


Lessons

#LessonKey ConceptsRead Time
01Snowflake for MLVirtual warehouses, Snowpark, time travel, zero-copy cloning30 min
02Google BigQueryDremel engine, BQML, partitioning, Vertex AI integration30 min
03AWS Data ServicesS3 lakehouse, Glue, Athena, Redshift, EMR, Lake Formation35 min
04DatabricksDelta Lake, DLT, MLflow, Feature Store, Unity Catalog35 min
05Cost OptimisationQuery tuning, storage tiering, FinOps, chargeback25 min
06Multi-Cloud StrategiesData gravity, egress costs, open formats, federation25 min

Prerequisites

This module builds on Modules 01–06. Before starting, you should be comfortable with:

  • SQL and columnar storage (Module 02 - Storage Formats)
  • Apache Spark for large-scale transformations (Module 03 - Batch Processing)
  • Data lakehouse architecture using Delta Lake or Apache Iceberg (Module 05 - Lakehouses)
  • Feature engineering patterns for ML (Module 06 - Feature Engineering)

Key Concepts Introduced in This Module

ConceptWhat It Means
Warehouse-as-a-serviceFully managed data warehouse - no servers, patching, or cluster config
Separation of compute and storageScale query engines independently from data storage - pay only for what you use
Virtual warehousesIsolated compute clusters in Snowflake that can be started, stopped, and resized independently
Serverless querySubmit a SQL query, get results - no cluster to manage or wait for (BigQuery, Athena)
Data sharingShare live data across accounts without copying - the foundation of data mesh on cloud platforms
Columnar storageData organized by column, not row - enables 10-100x compression and predicate pushdown for analytics

What You Will Be Able to Do After This Module

  1. Choose the right cloud data platform for a given workload - and explain the trade-offs clearly in an interview
  2. Build end-to-end ML feature pipelines on Snowflake using Snowpark and Python UDFs
  3. Use BigQuery ML to train and serve models entirely inside SQL, and know when to use Vertex AI instead
  4. Navigate the AWS data ecosystem - understand which service to use for ETL, warehousing, streaming, and ML
  5. Architect a production ML platform on Databricks using Delta Live Tables, MLflow, and Unity Catalog
  6. Reduce a cloud data bill by 40–60% using query optimization, storage tiering, and right-sizing strategies
  7. Design a multi-cloud data architecture that avoids expensive data gravity traps

:::note Prerequisites Check If you have not read Module 05 (Lakehouses) yet, read the Delta Lake and Apache Iceberg lessons before starting Lesson 04 (Databricks). The Delta Live Tables content builds directly on those concepts. :::

© 2026 EngineersOfAI. All rights reserved.