Module 07 - Pipeline Orchestration

A cron job is a silent promise you make to your future self: "this will keep working." It almost never does. Cron has no retry logic, no visibility into what failed, no record of what succeeded, and no concept of dependencies between jobs. When it breaks at 3am, you find out from a user the next morning - if you find out at all.

Pipeline orchestration is the infrastructure layer that makes pipelines reliable, observable, and recoverable. It introduces the DAG - a Directed Acyclic Graph of tasks with declared dependencies - so the system knows the order of execution, which tasks can run in parallel, and exactly what to retry when something fails. It adds a metadata store so every run has a record. It adds a UI so you can see the state of every task across every run. And it adds scheduling, backfill, alerting, and SLA tracking on top.

This module covers the three orchestrators that dominate modern data and ML engineering: Apache Airflow (the incumbent, built at Airbnb in 2014), Prefect (the developer-friendly modern alternative), and Dagster (the asset-centric orchestrator built for data quality). You will learn not just how to use each, but when each one fits - and how to test, monitor, and make architectural decisions about orchestration at scale.

Module Map

Lessons at a Glance

#	Lesson	Key Concepts	Read Time
01	Apache Airflow Architecture	DAG, Scheduler, Executors, Operators, XCom, Connections	25 min
02	Airflow for ML Pipelines	Training DAGs, ShortCircuitOperator, KubernetesPodOperator, champion/challenger	25 min
03	Prefect and Modern Orchestration	`@flow`, `@task`, Work Pools, Deployments, Automations	22 min
04	Dagster for Data Assets	Software-defined assets, ops, jobs, sensors, IO managers	22 min
05	Orchestration Patterns	Idempotency, backfill, SLA monitoring, data-aware scheduling	20 min
06	Testing and Monitoring Pipelines	Unit testing DAGs, task testing, observability, alerting	20 min
07	Choosing an Orchestrator	Decision matrix, migration paths, hybrid architectures	18 min

Key Concepts

Concept	What It Means
DAG	Directed Acyclic Graph - a graph of tasks with dependency edges and no cycles; the core unit in Airflow
Task dependencies	Declared upstream/downstream relationships that determine execution order
Backfill	Re-running a pipeline for past time intervals - requires idempotent tasks
SLA	Service Level Agreement - the maximum acceptable time for a pipeline to complete
Idempotency	Running a task multiple times produces the same result - essential for safe retries
Data-aware scheduling	Triggering a pipeline when upstream data is ready rather than on a fixed clock
Asset materialization	The act of computing and storing a data asset (table, model, file) - Dagster's core abstraction

Prerequisites

This module assumes you have completed:

Module 01 - Data Engineering Fundamentals
Module 02 - SQL and Data Modelling
Module 03 - Batch Processing
Module 04 - Stream Processing
Module 05 - Data Warehouse and Lakehouse

You should be comfortable writing Python, reading SQL, and have a general understanding of what an ETL pipeline does.

What You Will Be Able to Do

By the end of this module you will be able to:

Write production-grade Apache Airflow DAGs with proper retry logic, SLA monitoring, and idempotent tasks
Orchestrate ML training pipelines with data quality gates and conditional deployment decisions
Build and deploy Prefect flows with work pools, parameterized deployments, and event-driven automations
Use Dagster's software-defined asset model to build pipelines that are observable and lineage-aware
Apply orchestration patterns - backfill, partitioning, fan-out - correctly in any orchestrator
Write unit and integration tests for pipeline code
Make an informed architectural decision between Airflow, Prefect, and Dagster given a team's constraints

Module Map​

Lessons at a Glance​

Key Concepts​

Prerequisites​

What You Will Be Able to Do​

Module Map

Lessons at a Glance

Key Concepts

Prerequisites

What You Will Be Able to Do