Module 08 - Cloud Data Platforms
Cloud data platforms abstract away infrastructure so engineers can focus on data. But each platform makes fundamentally different trade-offs on cost, performance, and ecosystem lock-in. Choosing the wrong one - or using the right one wrong - can cost your team hundreds of thousands of dollars per year.
This module maps the landscape of modern cloud data platforms from the perspective of a data engineer building AI and ML systems.
Module Map
Lessons
| # | Lesson | Key Concepts | Read Time |
|---|---|---|---|
| 01 | Snowflake for ML | Virtual warehouses, Snowpark, time travel, zero-copy cloning | 30 min |
| 02 | Google BigQuery | Dremel engine, BQML, partitioning, Vertex AI integration | 30 min |
| 03 | AWS Data Services | S3 lakehouse, Glue, Athena, Redshift, EMR, Lake Formation | 35 min |
| 04 | Databricks | Delta Lake, DLT, MLflow, Feature Store, Unity Catalog | 35 min |
| 05 | Cost Optimisation | Query tuning, storage tiering, FinOps, chargeback | 25 min |
| 06 | Multi-Cloud Strategies | Data gravity, egress costs, open formats, federation | 25 min |
Prerequisites
This module builds on Modules 01–06. Before starting, you should be comfortable with:
- SQL and columnar storage (Module 02 - Storage Formats)
- Apache Spark for large-scale transformations (Module 03 - Batch Processing)
- Data lakehouse architecture using Delta Lake or Apache Iceberg (Module 05 - Lakehouses)
- Feature engineering patterns for ML (Module 06 - Feature Engineering)
Key Concepts Introduced in This Module
| Concept | What It Means |
|---|---|
| Warehouse-as-a-service | Fully managed data warehouse - no servers, patching, or cluster config |
| Separation of compute and storage | Scale query engines independently from data storage - pay only for what you use |
| Virtual warehouses | Isolated compute clusters in Snowflake that can be started, stopped, and resized independently |
| Serverless query | Submit a SQL query, get results - no cluster to manage or wait for (BigQuery, Athena) |
| Data sharing | Share live data across accounts without copying - the foundation of data mesh on cloud platforms |
| Columnar storage | Data organized by column, not row - enables 10-100x compression and predicate pushdown for analytics |
What You Will Be Able to Do After This Module
- Choose the right cloud data platform for a given workload - and explain the trade-offs clearly in an interview
- Build end-to-end ML feature pipelines on Snowflake using Snowpark and Python UDFs
- Use BigQuery ML to train and serve models entirely inside SQL, and know when to use Vertex AI instead
- Navigate the AWS data ecosystem - understand which service to use for ETL, warehousing, streaming, and ML
- Architect a production ML platform on Databricks using Delta Live Tables, MLflow, and Unity Catalog
- Reduce a cloud data bill by 40–60% using query optimization, storage tiering, and right-sizing strategies
- Design a multi-cloud data architecture that avoids expensive data gravity traps
:::note Prerequisites Check If you have not read Module 05 (Lakehouses) yet, read the Delta Lake and Apache Iceberg lessons before starting Lesson 04 (Databricks). The Delta Live Tables content builds directly on those concepts. :::
© 2026 EngineersOfAI. All rights reserved.
