Module 08 - Cloud Data Platforms

Cloud data platforms abstract away infrastructure so engineers can focus on data. But each platform makes fundamentally different trade-offs on cost, performance, and ecosystem lock-in. Choosing the wrong one - or using the right one wrong - can cost your team hundreds of thousands of dollars per year.

This module maps the landscape of modern cloud data platforms from the perspective of a data engineer building AI and ML systems.

Module Map

Lessons

#	Lesson	Key Concepts	Read Time
01	Snowflake for ML	Virtual warehouses, Snowpark, time travel, zero-copy cloning	30 min
02	Google BigQuery	Dremel engine, BQML, partitioning, Vertex AI integration	30 min
03	AWS Data Services	S3 lakehouse, Glue, Athena, Redshift, EMR, Lake Formation	35 min
04	Databricks	Delta Lake, DLT, MLflow, Feature Store, Unity Catalog	35 min
05	Cost Optimisation	Query tuning, storage tiering, FinOps, chargeback	25 min
06	Multi-Cloud Strategies	Data gravity, egress costs, open formats, federation	25 min

Prerequisites

This module builds on Modules 01–06. Before starting, you should be comfortable with:

SQL and columnar storage (Module 02 - Storage Formats)
Apache Spark for large-scale transformations (Module 03 - Batch Processing)
Data lakehouse architecture using Delta Lake or Apache Iceberg (Module 05 - Lakehouses)
Feature engineering patterns for ML (Module 06 - Feature Engineering)

Key Concepts Introduced in This Module

Concept	What It Means
Warehouse-as-a-service	Fully managed data warehouse - no servers, patching, or cluster config
Separation of compute and storage	Scale query engines independently from data storage - pay only for what you use
Virtual warehouses	Isolated compute clusters in Snowflake that can be started, stopped, and resized independently
Serverless query	Submit a SQL query, get results - no cluster to manage or wait for (BigQuery, Athena)
Data sharing	Share live data across accounts without copying - the foundation of data mesh on cloud platforms
Columnar storage	Data organized by column, not row - enables 10-100x compression and predicate pushdown for analytics

What You Will Be Able to Do After This Module

Choose the right cloud data platform for a given workload - and explain the trade-offs clearly in an interview
Build end-to-end ML feature pipelines on Snowflake using Snowpark and Python UDFs
Use BigQuery ML to train and serve models entirely inside SQL, and know when to use Vertex AI instead
Navigate the AWS data ecosystem - understand which service to use for ETL, warehousing, streaming, and ML
Architect a production ML platform on Databricks using Delta Live Tables, MLflow, and Unity Catalog
Reduce a cloud data bill by 40–60% using query optimization, storage tiering, and right-sizing strategies
Design a multi-cloud data architecture that avoids expensive data gravity traps

:::note Prerequisites Check If you have not read Module 05 (Lakehouses) yet, read the Delta Lake and Apache Iceberg lessons before starting Lesson 04 (Databricks). The Delta Live Tables content builds directly on those concepts. :::

Module Map​

Lessons​

Prerequisites​

Key Concepts Introduced in This Module​

What You Will Be Able to Do After This Module​

Module Map

Lessons

Prerequisites

Key Concepts Introduced in This Module

What You Will Be Able to Do After This Module