Module 07 - Pipeline Orchestration
A cron job is a silent promise you make to your future self: "this will keep working." It almost never does. Cron has no retry logic, no visibility into what failed, no record of what succeeded, and no concept of dependencies between jobs. When it breaks at 3am, you find out from a user the next morning - if you find out at all.
Pipeline orchestration is the infrastructure layer that makes pipelines reliable, observable, and recoverable. It introduces the DAG - a Directed Acyclic Graph of tasks with declared dependencies - so the system knows the order of execution, which tasks can run in parallel, and exactly what to retry when something fails. It adds a metadata store so every run has a record. It adds a UI so you can see the state of every task across every run. And it adds scheduling, backfill, alerting, and SLA tracking on top.
This module covers the three orchestrators that dominate modern data and ML engineering: Apache Airflow (the incumbent, built at Airbnb in 2014), Prefect (the developer-friendly modern alternative), and Dagster (the asset-centric orchestrator built for data quality). You will learn not just how to use each, but when each one fits - and how to test, monitor, and make architectural decisions about orchestration at scale.
Module Map
Lessons at a Glance
| # | Lesson | Key Concepts | Read Time |
|---|---|---|---|
| 01 | Apache Airflow Architecture | DAG, Scheduler, Executors, Operators, XCom, Connections | 25 min |
| 02 | Airflow for ML Pipelines | Training DAGs, ShortCircuitOperator, KubernetesPodOperator, champion/challenger | 25 min |
| 03 | Prefect and Modern Orchestration | @flow, @task, Work Pools, Deployments, Automations | 22 min |
| 04 | Dagster for Data Assets | Software-defined assets, ops, jobs, sensors, IO managers | 22 min |
| 05 | Orchestration Patterns | Idempotency, backfill, SLA monitoring, data-aware scheduling | 20 min |
| 06 | Testing and Monitoring Pipelines | Unit testing DAGs, task testing, observability, alerting | 20 min |
| 07 | Choosing an Orchestrator | Decision matrix, migration paths, hybrid architectures | 18 min |
Key Concepts
| Concept | What It Means |
|---|---|
| DAG | Directed Acyclic Graph - a graph of tasks with dependency edges and no cycles; the core unit in Airflow |
| Task dependencies | Declared upstream/downstream relationships that determine execution order |
| Backfill | Re-running a pipeline for past time intervals - requires idempotent tasks |
| SLA | Service Level Agreement - the maximum acceptable time for a pipeline to complete |
| Idempotency | Running a task multiple times produces the same result - essential for safe retries |
| Data-aware scheduling | Triggering a pipeline when upstream data is ready rather than on a fixed clock |
| Asset materialization | The act of computing and storing a data asset (table, model, file) - Dagster's core abstraction |
Prerequisites
This module assumes you have completed:
- Module 01 - Data Engineering Fundamentals
- Module 02 - SQL and Data Modelling
- Module 03 - Batch Processing
- Module 04 - Stream Processing
- Module 05 - Data Warehouse and Lakehouse
You should be comfortable writing Python, reading SQL, and have a general understanding of what an ETL pipeline does.
What You Will Be Able to Do
By the end of this module you will be able to:
- Write production-grade Apache Airflow DAGs with proper retry logic, SLA monitoring, and idempotent tasks
- Orchestrate ML training pipelines with data quality gates and conditional deployment decisions
- Build and deploy Prefect flows with work pools, parameterized deployments, and event-driven automations
- Use Dagster's software-defined asset model to build pipelines that are observable and lineage-aware
- Apply orchestration patterns - backfill, partitioning, fan-out - correctly in any orchestrator
- Write unit and integration tests for pipeline code
- Make an informed architectural decision between Airflow, Prefect, and Dagster given a team's constraints
