Skip to main content

Module 07 - Pipeline Orchestration

A cron job is a silent promise you make to your future self: "this will keep working." It almost never does. Cron has no retry logic, no visibility into what failed, no record of what succeeded, and no concept of dependencies between jobs. When it breaks at 3am, you find out from a user the next morning - if you find out at all.

Pipeline orchestration is the infrastructure layer that makes pipelines reliable, observable, and recoverable. It introduces the DAG - a Directed Acyclic Graph of tasks with declared dependencies - so the system knows the order of execution, which tasks can run in parallel, and exactly what to retry when something fails. It adds a metadata store so every run has a record. It adds a UI so you can see the state of every task across every run. And it adds scheduling, backfill, alerting, and SLA tracking on top.

This module covers the three orchestrators that dominate modern data and ML engineering: Apache Airflow (the incumbent, built at Airbnb in 2014), Prefect (the developer-friendly modern alternative), and Dagster (the asset-centric orchestrator built for data quality). You will learn not just how to use each, but when each one fits - and how to test, monitor, and make architectural decisions about orchestration at scale.


Module Map


Lessons at a Glance

#LessonKey ConceptsRead Time
01Apache Airflow ArchitectureDAG, Scheduler, Executors, Operators, XCom, Connections25 min
02Airflow for ML PipelinesTraining DAGs, ShortCircuitOperator, KubernetesPodOperator, champion/challenger25 min
03Prefect and Modern Orchestration@flow, @task, Work Pools, Deployments, Automations22 min
04Dagster for Data AssetsSoftware-defined assets, ops, jobs, sensors, IO managers22 min
05Orchestration PatternsIdempotency, backfill, SLA monitoring, data-aware scheduling20 min
06Testing and Monitoring PipelinesUnit testing DAGs, task testing, observability, alerting20 min
07Choosing an OrchestratorDecision matrix, migration paths, hybrid architectures18 min

Key Concepts

ConceptWhat It Means
DAGDirected Acyclic Graph - a graph of tasks with dependency edges and no cycles; the core unit in Airflow
Task dependenciesDeclared upstream/downstream relationships that determine execution order
BackfillRe-running a pipeline for past time intervals - requires idempotent tasks
SLAService Level Agreement - the maximum acceptable time for a pipeline to complete
IdempotencyRunning a task multiple times produces the same result - essential for safe retries
Data-aware schedulingTriggering a pipeline when upstream data is ready rather than on a fixed clock
Asset materializationThe act of computing and storing a data asset (table, model, file) - Dagster's core abstraction

Prerequisites

This module assumes you have completed:

  • Module 01 - Data Engineering Fundamentals
  • Module 02 - SQL and Data Modelling
  • Module 03 - Batch Processing
  • Module 04 - Stream Processing
  • Module 05 - Data Warehouse and Lakehouse

You should be comfortable writing Python, reading SQL, and have a general understanding of what an ETL pipeline does.


What You Will Be Able to Do

By the end of this module you will be able to:

  • Write production-grade Apache Airflow DAGs with proper retry logic, SLA monitoring, and idempotent tasks
  • Orchestrate ML training pipelines with data quality gates and conditional deployment decisions
  • Build and deploy Prefect flows with work pools, parameterized deployments, and event-driven automations
  • Use Dagster's software-defined asset model to build pipelines that are observable and lineage-aware
  • Apply orchestration patterns - backfill, partitioning, fan-out - correctly in any orchestrator
  • Write unit and integration tests for pipeline code
  • Make an informed architectural decision between Airflow, Prefect, and Dagster given a team's constraints
© 2026 EngineersOfAI. All rights reserved.