Module 2: Batch Processing for ML
Batch processing is the backbone of ML data engineering. Every model training run, every feature backfill, every nightly aggregation - these are batch jobs running at scale. This module takes you from the fundamentals of distributed computation with Apache Spark through to production-grade orchestration, testing, and performance tuning.
By the end of this module, you will be able to design, build, test, and tune batch pipelines that process hundreds of gigabytes to petabytes of data reliably and efficiently - the exact skills that separate junior data engineers from the senior engineers who own production ML infrastructure.
Lesson Progression
Lessons 01 and 02 establish Spark as your primary batch computation engine. Lessons 03 and 04 cover dbt for SQL-based transformations - complementary to Spark, not a replacement. Lesson 05 goes deep on cloud SQL engines at scale. Lessons 06 and 07 address the operational concerns: scheduling, dependency management, and data quality testing. Lesson 08 is an advanced performance chapter for engineers who own high-scale Spark pipelines.
Lessons at a Glance
| # | Title | Level | Key Concept |
|---|---|---|---|
| 01 | Apache Spark Architecture | Beginner | RDDs, DAG execution model, Catalyst optimizer, Tungsten engine |
| 02 | Spark for ML Pipelines | Beginner | Window functions, PySpark UDFs, MLlib Pipelines, point-in-time joins |
| 03 | dbt for Data Transformation | Beginner | Models, materializations, tests, lineage, incremental builds |
| 04 | dbt Advanced Patterns | Intermediate | Macros, snapshots, exposures, CI/CD for data transforms |
| 05 | SQL at Scale | Intermediate | Partitioning, clustering, query plans, warehouse-specific optimization |
| 06 | Batch Orchestration Patterns | Intermediate | Airflow DAGs, Dagster assets, sensors, SLAs, backfill strategies |
| 07 | Testing Data Pipelines | Intermediate | Great Expectations, dbt tests, contract enforcement, anomaly detection |
| 08 | Performance Tuning Spark | Advanced | Data skew, shuffle tuning, AQE, broadcast joins, memory management |
What You Will Build
- A complete PySpark feature pipeline that computes rolling window features from billions of rows of transaction data, correctly partitioned for efficient training set construction
- A dbt project with incremental models, macros, and automated data quality tests that catches schema drift and value anomalies before they reach model training
- An Airflow DAG with dependency management, SLA monitoring, and smart backfill logic for a nightly ML feature refresh
- A performance-tuned Spark job that handles a skewed dataset using salting, adaptive query execution, and broadcast join optimization - reducing runtime from hours to minutes
Prerequisites
This module builds directly on Module 1 (Foundations of Data Engineering for AI). Before starting, you should be comfortable with:
- The medallion architecture (bronze → silver → gold layers)
- Idempotency as a design principle for data pipelines
- The difference between ETL and ELT, and when each pattern applies
- Basic Python (functions, classes, list comprehensions, context managers)
- SQL fundamentals (SELECT, JOIN, GROUP BY, window functions at a basic level)
- What a feature store is and why point-in-time correctness matters
If any of these feel shaky, revisit the relevant lessons in Module 1 before continuing. The lessons in this module assume these concepts without re-explaining them.
How This Module Connects to the Rest of the Course
Batch processing is the foundation layer. Every other module builds on the skills developed here:
- Module 3 (Stream Processing) contrasts with batch - you will understand exactly when stream processing is worth the added complexity versus a well-tuned Spark batch job.
- Module 5 (Feature Stores) builds on the feature pipeline patterns in Lessons 01 and 02. The feature store is the destination for the data this module teaches you to compute.
- Module 7 (Pipeline Orchestration) orchestrates the Spark jobs and dbt transformations you build in this module. Airflow DAGs call Spark submit, monitor job completion, and trigger downstream processes.
- Module 9 (Data Observability) monitors the output quality of the pipelines you build here - detecting schema drift, row count anomalies, and statistical drift before they reach model training.
The skills in this module are tested in every senior data engineer and ML engineer interview. Questions about Spark internals, dbt incremental models, and SQL query optimization appear consistently across companies at all scales. Work through the code examples. Run them. Read the query plans. The understanding you build here will be directly applicable on your first day in a production ML engineering role.
