Skip to main content

Module 2: Batch Processing for ML

Batch processing is the backbone of ML data engineering. Every model training run, every feature backfill, every nightly aggregation - these are batch jobs running at scale. This module takes you from the fundamentals of distributed computation with Apache Spark through to production-grade orchestration, testing, and performance tuning.

By the end of this module, you will be able to design, build, test, and tune batch pipelines that process hundreds of gigabytes to petabytes of data reliably and efficiently - the exact skills that separate junior data engineers from the senior engineers who own production ML infrastructure.


Lesson Progression

Lessons 01 and 02 establish Spark as your primary batch computation engine. Lessons 03 and 04 cover dbt for SQL-based transformations - complementary to Spark, not a replacement. Lesson 05 goes deep on cloud SQL engines at scale. Lessons 06 and 07 address the operational concerns: scheduling, dependency management, and data quality testing. Lesson 08 is an advanced performance chapter for engineers who own high-scale Spark pipelines.


Lessons at a Glance

#TitleLevelKey Concept
01Apache Spark ArchitectureBeginnerRDDs, DAG execution model, Catalyst optimizer, Tungsten engine
02Spark for ML PipelinesBeginnerWindow functions, PySpark UDFs, MLlib Pipelines, point-in-time joins
03dbt for Data TransformationBeginnerModels, materializations, tests, lineage, incremental builds
04dbt Advanced PatternsIntermediateMacros, snapshots, exposures, CI/CD for data transforms
05SQL at ScaleIntermediatePartitioning, clustering, query plans, warehouse-specific optimization
06Batch Orchestration PatternsIntermediateAirflow DAGs, Dagster assets, sensors, SLAs, backfill strategies
07Testing Data PipelinesIntermediateGreat Expectations, dbt tests, contract enforcement, anomaly detection
08Performance Tuning SparkAdvancedData skew, shuffle tuning, AQE, broadcast joins, memory management

What You Will Build

  • A complete PySpark feature pipeline that computes rolling window features from billions of rows of transaction data, correctly partitioned for efficient training set construction
  • A dbt project with incremental models, macros, and automated data quality tests that catches schema drift and value anomalies before they reach model training
  • An Airflow DAG with dependency management, SLA monitoring, and smart backfill logic for a nightly ML feature refresh
  • A performance-tuned Spark job that handles a skewed dataset using salting, adaptive query execution, and broadcast join optimization - reducing runtime from hours to minutes

Prerequisites

This module builds directly on Module 1 (Foundations of Data Engineering for AI). Before starting, you should be comfortable with:

  • The medallion architecture (bronze → silver → gold layers)
  • Idempotency as a design principle for data pipelines
  • The difference between ETL and ELT, and when each pattern applies
  • Basic Python (functions, classes, list comprehensions, context managers)
  • SQL fundamentals (SELECT, JOIN, GROUP BY, window functions at a basic level)
  • What a feature store is and why point-in-time correctness matters

If any of these feel shaky, revisit the relevant lessons in Module 1 before continuing. The lessons in this module assume these concepts without re-explaining them.


How This Module Connects to the Rest of the Course

Batch processing is the foundation layer. Every other module builds on the skills developed here:

  • Module 3 (Stream Processing) contrasts with batch - you will understand exactly when stream processing is worth the added complexity versus a well-tuned Spark batch job.
  • Module 5 (Feature Stores) builds on the feature pipeline patterns in Lessons 01 and 02. The feature store is the destination for the data this module teaches you to compute.
  • Module 7 (Pipeline Orchestration) orchestrates the Spark jobs and dbt transformations you build in this module. Airflow DAGs call Spark submit, monitor job completion, and trigger downstream processes.
  • Module 9 (Data Observability) monitors the output quality of the pipelines you build here - detecting schema drift, row count anomalies, and statistical drift before they reach model training.

The skills in this module are tested in every senior data engineer and ML engineer interview. Questions about Spark internals, dbt incremental models, and SQL query optimization appear consistently across companies at all scales. Work through the code examples. Run them. Read the query plans. The understanding you build here will be directly applicable on your first day in a production ML engineering role.

© 2026 EngineersOfAI. All rights reserved.