:::tip 🎮 Interactive Playground Visualize this concept: Try the Build vs Buy demo on the EngineersOfAI Playground - no code required. :::
Choosing an Orchestrator for Your AI Data Stack
The $300,000 Orchestrator Mistake
A team spent 6 months building on Airflow. They had a working platform: scheduled ETL, feature engineering, model retraining. Then a product lead saw a demo of Prefect - cleaner UI, simpler Python, local development without a Docker stack. The team migrated. 3 months of engineering effort.
Three months later, they were back to evaluating Airflow. Prefect's sensor ecosystem was too thin for their use case: they needed to react to S3 events, poll an external vendor API, and coordinate across 14 upstream data sources. In Airflow, they had operators for all of this. In Prefect, they were writing custom sensors from scratch.
The migration back to Airflow took another 2 months. Total cost: approximately 5 engineers × 5 months × $60,000/year salary cost. The correct orchestrator choice at the beginning would have avoided all of it. The problem was not that Prefect is bad - it is genuinely excellent for many use cases. The problem was that the team chose based on developer experience and UI quality without evaluating the sensor ecosystem depth that their specific architecture required.
The right orchestrator depends on your team's needs, your existing stack, and the specific problems you are solving today. This lesson provides a structured framework for making that decision - and making it once.
Why This Decision Is Hard - The Orchestration Landscape Has Legitimate Competition
In most software categories, one tool dominates and the choice is obvious. Kubernetes for container orchestration. PostgreSQL for relational data. Python for ML. The orchestration landscape is genuinely competitive because the problem is genuinely multi-dimensional and different tools have made different trade-offs along those dimensions.
Airflow optimized for breadth: 10 years of operators and integrations, battle-hardened at massive scale, a large commercial ecosystem. Prefect optimized for developer experience: Python-native, fast local iteration, beautiful UI. Dagster optimized for data quality: assets as first-class citizens, lineage built-in, testing is simple. Temporal optimized for reliability: durable execution, fault-tolerant workflows, built for distributed systems engineers.
None of these is uniformly better than the others. The decision depends on what you value most - and what you will be unable to change once you have built a production platform on top of a choice.
Understanding the trade-offs in depth is the only way to avoid the 6-month-then-migrate pattern that wastes engineering capacity and erodes team trust.
Historical Context - How We Got Here
2007: Oozie. Apache Oozie became the standard for Hadoop workflow orchestration. XML-based job definitions, tight Hadoop integration, limited programmability. This is where the DAG model was established as the canonical primitive for pipeline orchestration.
2014: Airflow. Maxime Beauchemin built Airflow at Airbnb. The breakthrough insight: define pipelines as Python code, not XML. The DAG-as-code model made pipelines version-controllable, testable, and dynamic. Airflow was donated to Apache in 2016 and became the dominant open-source orchestrator by 2018.
2016: Luigi. Spotify's Luigi introduced the concept of target-based idempotency - tasks declared their output targets and were skipped if the target already existed. A precursor to the asset model. Less adopted than Airflow due to weaker UI and scheduler.
2018: Prefect. Jeremiah Lowin founded Prefect to address Airflow's complexity. The core insight: a modern pipeline framework should work seamlessly locally and in production without requiring a dedicated infrastructure setup. Prefect 2.0 (2022) was a near-complete rewrite with a much simpler architecture.
2019: Dagster. Nick Schrock at Elementl built Dagster around the asset-centric model. Every computation declares what data it produces. The graph is an asset graph, not a task graph. Rich lineage, first-class testing, resource injection.
2019: Temporal. Built by former Uber engineers who built Cadence (Uber's internal workflow engine). Temporal is a durable execution platform: functions run to completion even across machine failures, network partitions, and multi-day wait periods. It is less "data pipeline orchestrator" and more "workflow engine for distributed systems." But for ML platforms with long-running training jobs and complex multi-step approval workflows, it fills a unique niche.
2022–2025: Cloud-Native Maturation. AWS Step Functions, GCP Workflows, and Azure Data Factory matured into production-grade managed services. For teams already deeply embedded in a single cloud, these services eliminate the operational burden of running a self-hosted orchestrator.
The Decision Dimensions
Before evaluating specific tools, define what you are evaluating. These are the dimensions that matter most:
1. Developer experience (DX). How hard is it to write a new pipeline? How fast is the local feedback loop? How easy is it for a new team member to understand an existing pipeline?
2. Maturity and ecosystem. How many operators/integrations exist? How stable is the API? What is the production track record at scale?
3. Data-awareness. Does the orchestrator know about the data products that pipelines produce? Can it track freshness, lineage, and data quality natively?
4. Testing ergonomics. How easy is it to write a unit test for a pipeline? Does testing require a running orchestrator?
5. Operational complexity. How hard is it to run in production? What infrastructure does it require? What is the upgrade burden?
6. Cloud-native integration. Does it integrate natively with your cloud? Are there managed hosting options that eliminate operational overhead?
7. Cost. What does it cost to run? Include both infrastructure cost (compute, storage) and engineering time (setup, maintenance, on-call).
8. Migration path. If you need to migrate from or to this tool in the future, how painful is it?
Airflow - The Incumbent
Airflow is the most widely deployed data orchestration tool in the world. It has the largest ecosystem of operators and integrations, the deepest talent pool, and the most commercial support options (Astronomer, AWS MWAA, Google Cloud Composer, Astro).
The strengths:
Airflow has operators for essentially everything. S3, GCS, BigQuery, Redshift, Snowflake, Databricks, Kubernetes, Docker, Spark, Hive, Postgres, MySQL, HTTP, SFTP, SSH - and hundreds more. If you need to integrate with a data system, an Airflow operator almost certainly exists. Building the same integration in Prefect or Dagster often means writing a custom task from scratch.
The Airflow scheduler is battle-tested at scale. Airbnb, Lyft, Spotify, Twitter, and hundreds of other companies run Airflow with thousands of DAGs and millions of task instances per day. Edge cases in scheduler behavior have been found and fixed over 10 years.
The talent pool is the largest in the category. When you post a data engineering job, candidates who know Airflow are far more common than candidates who know Prefect or Dagster. This matters for team scaling.
The weaknesses:
Airflow's local development experience is painful. Running Airflow locally requires Docker Compose with a scheduler, webserver, and database. The feedback loop for testing a DAG change involves waiting for the scheduler to pick it up. The testing story is awkward: dag.test() works but is slow and requires the full Airflow stack.
Airflow has no built-in concept of the data that pipelines produce. It tracks task execution, not data freshness. You cannot look at the Airflow UI and answer "is this table stale?" without building your own metadata layer.
Dynamic task generation was a second-class citizen until Airflow 2.3. Even now, dynamic task mapping has limitations that Prefect handles more elegantly.
The DAG import model causes latency: the scheduler periodically scans all DAG files and imports them. Large DAG files or DAGs with slow imports degrade scheduler performance. This requires careful engineering discipline at scale.
When to choose Airflow:
- Your team already has significant Airflow expertise
- You need a broad operator/sensor ecosystem
- You need managed hosting with commercial support (Astronomer, MWAA)
- You are integrating with many heterogeneous data systems
- Your team size is large enough to absorb the operational complexity
When NOT to choose Airflow:
- Small team (fewer than 5 data engineers) building from scratch
- You prioritize fast local iteration and simple testing
- Your pipelines are primarily Python compute with minimal external system integration
- You want native lineage and data freshness tracking without building it yourself
Prefect - The Modern Python Orchestrator
Prefect was built with the explicit goal of providing a better developer experience than Airflow. The architecture is fundamentally different: flows (pipelines) are ordinary Python functions, the orchestrator is decoupled from execution, and you can run a Prefect flow locally with zero infrastructure setup.
The core architecture:
from prefect import flow, task
from prefect.deployments import Deployment
@task(retries=3, retry_delay_seconds=60, log_prints=True)
def extract_events(date: str) -> list[dict]:
"""Fetch raw events from the source API."""
response = requests.get(
f"https://api.company.com/events",
params={"date": date},
headers={"Authorization": f"Bearer {API_KEY}"}
)
response.raise_for_status()
return response.json()["events"]
@task
def compute_features(events: list[dict]) -> pd.DataFrame:
"""Compute aggregate features from raw events."""
df = pd.DataFrame(events)
return df.groupby("user_id").agg(...).reset_index()
@flow(name="Feature Engineering Pipeline")
def feature_engineering_pipeline(date: str = None):
"""
This is a plain Python function. You can call it directly:
feature_engineering_pipeline(date="2024-01-15")
No scheduler, no Docker, no database needed for local runs.
"""
if date is None:
date = str(pd.Timestamp.now().date())
events = extract_events(date)
features = compute_features(events)
save_to_warehouse(features, date)
# Run locally without any infrastructure
if __name__ == "__main__":
feature_engineering_pipeline(date="2024-01-15")
The strengths:
Local development is genuinely simple. python my_flow.py runs the flow. No Docker Compose, no scheduler, no database. This makes the feedback loop extremely fast during development.
Prefect's task library is growing and covers the most common integrations. The documentation is excellent and the Python API is clean.
Prefect Cloud provides a fully managed control plane with a beautiful UI, deployment management, and team collaboration features. The open-source Prefect server can be self-hosted with minimal infrastructure.
The weaknesses:
The sensor ecosystem is significantly thinner than Airflow's. If you need to poll an obscure SaaS API, react to a specific Kafka message pattern, or coordinate across a complex set of external systems, you will likely need to write custom sensors that Airflow would provide out-of-the-box.
Prefect has no native data model. Like Airflow, it tracks task execution, not data products. Cross-flow dependencies require explicit coordination - there is no automatic freshness propagation.
The architecture has changed significantly between major versions (0.x, 1.x, 2.x). Teams that invested in Prefect 1 had to substantially rewrite for Prefect 2. This is not Prefect-specific - software evolves - but it is a risk to weigh.
When to choose Prefect:
- Python-first team that wants minimal orchestration overhead
- Rapid development environment where DX is the top priority
- Moderate external system integration needs
- No requirement for native data lineage tracking
- You want managed hosting without vendor lock-in (Prefect Cloud vs. self-hosted server)
When NOT to choose Prefect:
- Complex sensor and operator requirements
- Large existing Airflow investment that would need migration
- Data quality and lineage are primary concerns (Dagster is a better fit)
- Need for the deepest possible operator ecosystem depth
Dagster - The Asset-Centric Orchestrator
Dagster is covered in depth in lesson 04. The key differentiator: Dagster is the only mainstream orchestrator that treats data products as the fundamental primitive, not computation tasks.
The strengths:
Software-Defined Assets (SDAs) make lineage automatic. Every asset knows its dependencies, its owner, its freshness policy, and its materialization history. The Dagster UI is a data catalog and an execution monitor simultaneously.
Testing is dramatically simpler than Airflow. The materialize() function runs an asset graph in-process with injectable resources. Testing a Dagster pipeline is as simple as testing any other Python function.
The dbt integration is best-in-class: @dbt_assets converts the entire dbt project into Dagster assets, creating a unified lineage graph across Python ingestion and SQL transformation.
Partitions are first-class. Date-partitioned assets, partition-aware dependency resolution, and UI-driven backfills are all built in.
The weaknesses:
The asset model requires a different mental model than task-based orchestration. Teams coming from Airflow or Prefect need time to internalize the shift from "what should I run?" to "what do I want to produce?"
The operator/integration library is smaller than Airflow's. For uncommon integrations, you will likely write your own resource or asset from scratch.
Dagster's architecture has a daemon, a webserver, and an event log that require operational management. Less complex than Airflow's multi-component stack, but not zero.
When to choose Dagster:
- Data quality and lineage are primary concerns
- dbt-heavy stack that needs a unified orchestration layer
- Software engineering discipline is a team priority (testing, type safety, resource injection)
- Your team thinks in terms of data products, not tasks
- You need partition-aware backfills with minimal custom code
When NOT to choose Dagster:
- Team is unfamiliar with the asset model and has no bandwidth to learn it
- Simple pipelines where the asset model adds more overhead than value
- Extensive need for the Airflow operator ecosystem
Temporal - Durable Execution for Engineers
Temporal is different from the other tools in this list. It is not primarily a data pipeline scheduler - it is a durable execution platform. Temporal guarantees that a workflow function will run to completion, even across machine failures, network partitions, and restarts of the Temporal cluster itself.
from temporalio import workflow, activity
from temporalio.client import Client
from temporalio.worker import Worker
from datetime import timedelta
@activity.defn
async def train_model_chunk(chunk_id: int, params: dict) -> dict:
"""
Long-running activity. If this crashes at hour 3, Temporal
will retry it from the beginning of this activity (not from
the beginning of the workflow).
"""
# ... training logic ...
return {"chunk_id": chunk_id, "metrics": {"accuracy": 0.94}}
@workflow.defn
class ModelTrainingWorkflow:
@workflow.run
async def run(self, model_config: dict) -> dict:
"""
This workflow will complete even if:
- The worker process crashes mid-run
- The Temporal cluster restarts
- The network goes down for hours
Temporal replays the event history to reconstruct state.
"""
# Fan out: train 10 model chunks in parallel
chunk_futures = [
workflow.execute_activity(
train_model_chunk,
args=[i, model_config],
start_to_close_timeout=timedelta(hours=4),
)
for i in range(10)
]
results = await asyncio.gather(*chunk_futures)
# Wait for human approval before deploying
await workflow.wait_condition(
lambda: workflow.get_external_workflow_handle(
"approval-workflow"
).result()
)
deploy_model(results)
return {"status": "deployed", "chunks": len(results)}
The strengths:
Temporal's core guarantee - durable execution - is fundamentally stronger than any other orchestrator. A Temporal workflow that starts will complete, period. No other tool makes this guarantee at the infrastructure level. This matters for multi-hour or multi-day workflows where machine failures are expected.
Temporal handles long-running workflows naturally. A workflow that waits for a 2-day human approval process, then continues, is straightforward in Temporal. In Airflow or Prefect, this requires careful state management.
Temporal scales to extremely high throughput: millions of workflow executions per second, millions of concurrent workflows. It is used at Uber, Stripe, and Snap at production scale.
The weaknesses:
Temporal is designed for engineers, not data analysts. Writing Temporal workflows requires understanding activities, workers, workflow event history, and the determinism constraints on workflow code. The learning curve is significantly steeper than Airflow or Prefect.
Temporal has no data pipeline abstractions: no operators for Snowflake, no sensors for S3, no built-in scheduling of recurring workflows (you need to build this yourself using the Temporal Go/Python SDK).
The self-hosted operational complexity is high: Temporal requires a persistence backend (Cassandra or PostgreSQL), an Elasticsearch backend for visibility, and careful resource provisioning for high-throughput use cases. Temporal Cloud (managed hosting) reduces this significantly.
When to choose Temporal:
- Multi-day or multi-week workflows with human approval steps
- Long-running ML training jobs where failure tolerance is critical
- Microservices coordination that requires distributed workflow execution
- Engineering teams comfortable with distributed systems concepts
- You need true durable execution guarantees, not just retry logic
When NOT to choose Temporal:
- Data analyst-maintained pipelines (the engineering bar is too high)
- Standard scheduled ETL without long-running or approval-gated steps
- Teams without distributed systems experience
Cloud-Native Orchestration - Step Functions, GCP Workflows, Azure Data Factory
The major cloud providers offer managed orchestration services that eliminate the operational burden of running a self-hosted orchestrator.
AWS Step Functions:
Step Functions is a serverless state machine service. Workflows are defined as JSON-based state machines (or via the visual editor or CDK) and executed by the managed service. No infrastructure to run.
# Define a Step Functions workflow using AWS CDK (Python)
from aws_cdk import aws_stepfunctions as sfn
from aws_cdk import aws_stepfunctions_tasks as tasks
from aws_cdk import aws_lambda as lambda_
extract_fn = lambda_.Function(self, "ExtractFunction", ...)
transform_fn = lambda_.Function(self, "TransformFunction", ...)
load_fn = lambda_.Function(self, "LoadFunction", ...)
extract = tasks.LambdaInvoke(
self, "Extract",
lambda_function=extract_fn,
output_path="$.Payload"
)
transform = tasks.LambdaInvoke(
self, "Transform",
lambda_function=transform_fn,
output_path="$.Payload"
)
load = tasks.LambdaInvoke(
self, "Load",
lambda_function=load_fn,
)
# Chain the states
pipeline = extract.next(transform).next(load)
state_machine = sfn.StateMachine(
self, "FeaturePipeline",
definition=pipeline,
)
Strengths: No infrastructure to manage. Serverless cost model. Deep AWS integration (Lambda, ECS, Glue, SageMaker, Bedrock). Built-in retry and error handling. Visual execution graph in the AWS console.
Weaknesses: JSON/YAML workflow definitions are less expressive than Python. No native concept of data lineage. Limited cross-cloud portability. Complex branching logic becomes verbose. No native scheduling (you need EventBridge for that).
GCP Workflows and Azure Data Factory follow similar patterns: managed services that eliminate operational overhead at the cost of expressiveness and portability.
When to choose cloud-native:
- Already deeply invested in a single cloud provider
- Simple to moderate pipeline complexity
- Minimal ops team - infrastructure management is a constraint
- Serverless cost model is attractive for variable workloads
- Pipeline logic fits well in the service's expression model
When NOT to choose cloud-native:
- Complex Python logic that doesn't fit the state machine model
- Multi-cloud strategy
- Need for rich UI, lineage tracking, or community integrations
- Python-native development workflow is important to your team
Full Decision Matrix
| Dimension | Airflow | Prefect | Dagster | Temporal | Step Functions | GCP Workflows |
|---|---|---|---|---|---|---|
| Core model | Task DAG | Flow + Task | Asset graph | Durable workflow | State machine | State machine |
| Developer experience | Complex | Excellent | Good | Expert-level | Moderate | Moderate |
| Local dev | Docker stack | python flow.py | Dagster UI | Worker + server | Emulator | Emulator |
| Operator ecosystem | Vast (1000+) | Good (200+) | Growing (150+) | None (DIY) | AWS-native | GCP-native |
| Data lineage | None | None | First-class | None | None | None |
| Freshness tracking | None | None | Built-in | None | None | None |
| Testing | Hard | Easy | Very easy | Moderate | Hard | Hard |
| Operational complexity | High | Low-Medium | Medium | High | Zero | Zero |
| Managed hosting | Astronomer, MWAA | Prefect Cloud | Dagster Cloud | Temporal Cloud | Native | Native |
| Cost (self-hosted) | Medium | Low | Medium | High | N/A | N/A |
| Maturity | 10+ years | 5 years | 5 years | 6 years | 7 years | 4 years |
| Partition support | Via templates | Via parameters | First-class | N/A | N/A | N/A |
| dbt integration | Task-level | Task-level | Asset-level | None | None | None |
| Best for | Large teams, sensors | Python-native teams | dbt shops, lineage | Long-running workflows | AWS-native teams | GCP-native teams |
Migration Strategies
Airflow to Prefect
The lowest-friction migration path. Prefect tasks map directly to Python functions, which is exactly how most Airflow PythonOperator callables are written.
# Before: Airflow PythonOperator
from airflow.operators.python import PythonOperator
def my_transform_function(**context):
date = context["ds"]
data = load_data(date)
result = transform(data)
save_result(result, date)
task = PythonOperator(
task_id="transform",
python_callable=my_transform_function,
provide_context=True,
)
# After: Prefect task (the callable is nearly identical)
from prefect import task, flow
@task(retries=3, retry_delay_seconds=60)
def my_transform_function(date: str):
data = load_data(date)
result = transform(data)
save_result(result, date)
@flow
def my_pipeline(date: str):
my_transform_function(date)
The main migration cost is replacing Airflow operators (S3, BigQuery, etc.) with equivalent Prefect tasks. Operators that use Airflow hooks need to be rewritten using the underlying client libraries. Budget 1-2 weeks of engineering per 10 operators you depend on.
Airflow to Dagster
The conceptual shift is larger: you are moving from task-centric to asset-centric thinking. Each Airflow task needs to be reframed as an asset - what does it produce, what are its inputs, what are its dependencies.
# Airflow task (task-centric)
def compute_user_features(**context):
date = context["ds"]
events = load_events(date) # implicit input
features = aggregate_features(events)
write_to_warehouse(features, "user_features", date) # implicit output
# Dagster asset (asset-centric)
from dagster import asset, DailyPartitionsDefinition
@asset(
partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
group_name="feature_engineering",
)
def user_features(context, raw_events: pd.DataFrame) -> pd.DataFrame:
"""Explicit: this asset produces user_features, consuming raw_events."""
return aggregate_features(raw_events)
The migration is best done incrementally: start with a new Dagster project, migrate one pipeline at a time, and run both orchestrators in parallel until migration is complete.
Incremental Migration - Running Two Orchestrators Simultaneously
The lowest-risk migration strategy for large production platforms:
# Shared Python library - used by both orchestrators
# my_pipeline/transforms.py (no orchestrator imports)
def compute_features(events: pd.DataFrame, date: str) -> pd.DataFrame:
return events.groupby("user_id").agg(...).reset_index()
# Airflow wrapper (old)
def airflow_compute_features(**context):
from my_pipeline.transforms import compute_features
...
# Dagster wrapper (new)
@asset
def user_features(raw_events):
from my_pipeline.transforms import compute_features
return compute_features(raw_events, ...)
Keep business logic in a shared library that neither orchestrator owns. Both orchestrators call the same functions. You can migrate the orchestrator wrapper without touching the transformation logic.
Mermaid Diagram - Orchestrator Decision Tree
The "Good Enough" Principle
The most important principle in orchestrator selection is often the one that gets ignored in the excitement of evaluating shiny new tools: the cost of switching orchestrators is very high, so optimize for solving today's problem correctly rather than anticipated future requirements.
Every orchestrator migration involves:
- Rewriting all existing pipelines in the new framework
- Migrating or rebuilding all operator/sensor integrations
- Re-training the team on new concepts and APIs
- Running two orchestrators in parallel during the migration window
- Updating all monitoring, alerting, and on-call runbooks
- Managing the risk window where the new system is untested in production
This cost is typically 3-6 months of senior engineering time for a team with 20-30 pipelines. For a team with 200+ pipelines, it is a year-long project.
The implication: your initial orchestrator choice should be "good enough for the next 2-3 years," not "theoretically optimal." A team running 30 pipelines on Prefect with a slightly thin sensor library is better off writing custom sensors than migrating to Airflow. A team running 300 pipelines on Airflow with painful local development is better off improving their Airflow developer tooling (VS Code extensions, local DAG testing, better CI) than migrating to Prefect.
The exception to this rule is when the orchestrator choice fundamentally constrains a business requirement - not a preference. If you need native lineage tracking and data quality enforcement at scale, Dagster is not a preference - it is a requirement. If you need durable multi-week workflow execution, Temporal is not a preference - it is a requirement. Build the decision on requirements that cannot be reasonably met with your current tool, not on features that would be nice to have.
Production Engineering Notes
Evaluate on a real prototype. Before committing to an orchestrator, spend 2-3 days building a representative subset of your actual pipeline - not a toy example. Test the specific integrations you need, the local development experience, the deployment workflow, and the monitoring story. This investment prevents months of regret.
Consider managed hosting from day one. Self-hosting Airflow, Dagster, or Temporal carries ongoing maintenance burden: version upgrades, scheduler tuning, database maintenance, scaling. Astronomer (Airflow), Dagster Cloud, and Prefect Cloud each provide managed hosting that lets your team focus on pipelines rather than infrastructure. The cost is typically justified unless you have a dedicated platform engineering team.
Document your orchestrator conventions. Whatever you choose, document your conventions in a team runbook: how to create a new DAG/flow/asset, how to handle secrets, how to write tests, how to add monitoring. This reduces the knowledge silos that develop when conventions are implicit.
Plan for the version upgrade path. Check the historical version upgrade frequency and breaking change rate of any tool you are considering. Airflow has a stable upgrade path with managed migration guides. Prefect had a major breaking rewrite from 1.x to 2.x. Understand what you are signing up for.
Common Mistakes
:::danger Choosing Based on Hype Alone Dagster has excellent marketing and a strong engineering blog. Prefect has a beautiful UI. Temporal has impressive conference talks about durable execution. None of these are valid reasons to choose an orchestrator. Evaluate on your specific requirements: which operator integrations do you need, what is your team's background, and how much operational complexity can you absorb? :::
:::danger Under-Estimating Migration Cost "We'll migrate when we outgrow this" is almost always more expensive than expected. Budget migration cost explicitly before starting: a realistic estimate is 2-4 weeks per engineer per 10 pipelines, including testing, monitoring, and runbook updates. If the migration is not worth that cost now, it is probably not worth it at all. :::
:::warning Over-Engineering for Anticipated Scale A team running 5 pipelines does not need Temporal's durable execution guarantees or Airflow's massive operator ecosystem. Simple problems deserve simple solutions. Prefect or even a simple cron-scheduled Python script may be the right answer. The correct question is not "what is the best orchestrator?" but "what is the minimum orchestrator complexity that solves my current problem?" :::
:::warning Ignoring Operational Cost in the Evaluation Evaluations often focus on developer experience and features without counting the operational cost: who maintains the infrastructure, who handles upgrades, who is on-call when the scheduler goes down at 3 AM. For small teams without dedicated platform engineering, operational complexity is often the deciding factor - and managed hosting options deserve serious consideration. :::
Interview Questions and Answers
Q1: What are the key differences between Airflow, Prefect, and Dagster at an architectural level?
The fundamental difference is in their core abstraction. Airflow uses task DAGs: you define tasks and explicit dependency edges between them. The orchestrator tracks task execution. Prefect uses flows and tasks: flows are Python functions decorated with @flow, tasks with @task. The architecture is similar to Airflow but with a simpler local development model and a decoupled control plane. Dagster uses Software-Defined Assets: you define data products (assets) and their dependency relationships. The execution plan is derived from the asset graph, not manually specified. Airflow and Prefect track "did this computation run?"; Dagster tracks "is this data product fresh?".
Q2: When would you choose Temporal over Airflow for an ML pipeline?
Temporal is appropriate when you need true durable execution guarantees - workflows that will complete even across machine failures, cluster restarts, and multi-day wait periods. The canonical ML use case is a long-running model training workflow with a human approval gate: train for 8 hours, wait for an evaluation team to approve deployment (which might take 2 days), then deploy. In Airflow, the 2-day wait requires a sensor that polls an approval API, with risk of losing state if the scheduler is restarted. In Temporal, the workflow function literally pauses at await wait_for_approval() and resumes when the approval arrives, with the state durably persisted in Temporal's event log. The downside: Temporal has no data pipeline abstractions, so you build everything from scratch.
Q3: How would you evaluate a new orchestrator for a team that currently uses Airflow?
Start by cataloging your current Airflow usage: how many DAGs, which operators are used, which sensors are critical, what is the test coverage. Then evaluate the candidate orchestrator against this catalog: can it replace every operator you use, or will you need to build custom integrations? Build a proof-of-concept with 2-3 representative pipelines and measure: local development time per pipeline, test setup time, integration effort for your most complex operators. Finally, estimate the migration cost honestly: multiply the number of pipelines by a per-pipeline estimate, add 50% for unexpected complexity, and evaluate whether the benefits justify that cost over the next 2-3 years.
Q4: What is the "good enough" principle for orchestrator selection, and why does it matter?
The "good enough" principle is: choose the orchestrator that solves your current requirements adequately, because the cost of switching is very high. Orchestrator migrations typically cost 3-6 months of senior engineering time for teams with 20-30 pipelines. That cost needs to be justified by concrete business requirements that cannot be met with the current tool - not by preferences, UI quality, or features that would be convenient to have. The implication: build your evaluation around requirements you have today, not requirements you might have in 3 years. A slightly imperfect tool that solves your current problem is almost always better than a theoretically superior tool that requires a months-long migration.
Q5: How does cloud-native orchestration (Step Functions, GCP Workflows) compare to self-hosted options for ML pipelines?
Cloud-native orchestration eliminates operational overhead entirely: no infrastructure to provision, no scheduler to maintain, no upgrades to manage. This is a genuine advantage for small teams. The trade-offs: expressiveness is limited by the state machine model (complex Python logic is awkward to express), portability is near-zero (you are fully committed to one cloud), and there is no community ecosystem of shared operators. For ML pipelines specifically, cloud-native services integrate well with managed ML services (SageMaker, Vertex AI) and have native fan-out/fan-in support for parallel training. They struggle with: complex sensor logic that requires custom polling, pipelines with complex Python transformation logic, and cross-cloud or hybrid architectures.
Q6: Describe an incremental migration strategy from Airflow to Dagster.
An incremental migration runs both orchestrators in parallel, migrating pipelines one at a time. The key enabler is a shared Python library: all transformation logic lives in pure Python functions with no orchestrator imports. The Airflow DAG calls my_transforms.compute_features(date). The Dagster asset calls the same function: from my_transforms import compute_features; return compute_features(date). Start with the simplest, most self-contained pipelines and migrate them first. This builds team familiarity with the asset model on low-risk pipelines. Add Dagster-specific features (partitions, freshness policies, dbt integration) to migrated assets once the basic migration is validated. Maintain the shared library throughout - it also becomes the testing target for both the old and new versions.
Q7: What is the risk of choosing an orchestrator based primarily on developer experience?
Developer experience is an important dimension but is dangerous as the primary selection criterion because it is the most visible dimension and the most marketable one. Orchestrators with excellent DX are excellent at marketing excellent DX. The risks of over-weighting DX: (1) Operator/sensor ecosystem gap - you discover 3 months in that a critical integration isn't available, and the custom build takes longer than the DX savings. (2) Operational complexity reveals itself in production, not in development. A tool with excellent local development may have painful production operations. (3) Team knowledge transfer - if the orchestrator has a small talent pool, onboarding new engineers takes longer and losing key engineers is riskier. Evaluate DX as one factor among many, weighted by your team's actual bottlenecks.
Building the Evaluation Checklist
When your team is selecting an orchestrator, use this structured evaluation process to avoid the 6-month migration regret.
Step 1 - Catalog Your Requirements
Before opening a single documentation page, write down your requirements across these dimensions:
Integration requirements:
- List every external system your pipelines read from or write to: Snowflake, Redshift, BigQuery, S3, GCS, Kafka, Postgres, MySQL, MongoDB, REST APIs, SFTP servers
- For each system, identify whether a native operator/integration exists in each candidate orchestrator
- Flag any systems where you would need to build a custom integration
Execution requirements:
- Maximum pipeline duration (affects Temporal vs. others)
- Whether you need fan-out/fan-in with dynamic task counts
- Whether pipelines need to wait for external events (human approval, external API polling)
- Whether you have multi-cloud requirements
Data quality requirements:
- Do you need native lineage tracking?
- Do you need freshness policies and SLA monitoring?
- Do you need data quality checks integrated with pipeline execution?
Team requirements:
- Current orchestrator experience by team member
- Budget for learning and ramp-up time
- Dedicated platform engineering headcount (affects self-host vs. managed)
Step 2 - Build a Scored Evaluation Matrix
Score each candidate orchestrator on each requirement dimension on a 1-5 scale, weighted by importance to your team:
Dimension | Weight | Airflow | Prefect | Dagster | Temporal
-----------------------|--------|---------|---------|---------|----------
Integration coverage | 30% | 5 | 3 | 3 | 1
Developer experience | 20% | 2 | 5 | 4 | 3
Data lineage | 20% | 1 | 1 | 5 | 1
Testing ergonomics | 10% | 2 | 4 | 5 | 3
Operational complexity | 10% | 2 | 4 | 3 | 2
Managed hosting | 10% | 5 | 4 | 4 | 4
-----------------------|--------|---------|---------|---------|----------
Weighted score | | 3.00 | 3.20 | 3.80 | 2.00
This matrix forces explicit trade-off reasoning. A team with 40 Snowflake integrations and heavy dbt usage should weight integration coverage and data lineage higher. A team building a new ML platform from scratch with a Python-first culture should weight developer experience higher.
Step 3 - Build a Representative Proof-of-Concept
Choose 2-3 pipelines that represent your real workload - not toy examples. For each candidate orchestrator:
- Build the pipelines from scratch, timing the setup and implementation
- Test your most complex integration requirement (the one most likely to be unsupported)
- Write unit tests for each pipeline and measure the test setup time
- Deploy to a staging environment and measure the ops complexity
- Simulate a pipeline failure and measure the debugging experience
This step typically takes 3-5 days per orchestrator but saves months of regret. The patterns you discover in a realistic POC almost never appear in documentation or marketing materials.
Step 4 - Estimate Total Cost of Ownership
Calculate the 2-year total cost of each option:
def estimate_tco(
orchestrator: str,
pipelines: int,
engineers: int,
annual_salary: int = 150_000,
) -> dict:
"""
Rough 2-year TCO model for orchestrator selection.
Adjust multipliers based on your context.
"""
setup_weeks = {
"airflow_self_hosted": 6,
"airflow_managed": 2,
"prefect_self_hosted": 3,
"prefect_cloud": 1,
"dagster_self_hosted": 4,
"dagster_cloud": 1,
"temporal_self_hosted": 10,
"temporal_cloud": 4,
}
# Ongoing maintenance (weeks/year at scale)
maintenance_weeks_per_year = {
"airflow_self_hosted": 8,
"airflow_managed": 2,
"prefect_self_hosted": 4,
"prefect_cloud": 1,
"dagster_self_hosted": 4,
"dagster_cloud": 1,
"temporal_self_hosted": 12,
"temporal_cloud": 3,
}
weekly_cost = annual_salary / 52
setup_cost = setup_weeks[orchestrator] * weekly_cost * min(engineers, 2)
maintenance_cost = (
maintenance_weeks_per_year[orchestrator] * weekly_cost * 2 # 2 years
)
return {
"orchestrator": orchestrator,
"setup_cost": setup_cost,
"maintenance_cost_2yr": maintenance_cost,
"total_2yr": setup_cost + maintenance_cost,
}
Special Considerations for ML Platforms
ML pipelines have specific requirements that distinguish them from traditional ETL orchestration. When evaluating orchestrators for ML platforms, weight these dimensions heavily:
Model training job integration. Can the orchestrator launch distributed training jobs on Kubernetes, Spark, or managed services (SageMaker, Vertex AI)? Can it monitor job completion and stream logs back to the orchestrator UI?
Experiment tracking integration. Does the orchestrator have built-in or community-maintained integrations with MLflow, Weights & Biases, or Neptune? Can it capture model metrics alongside pipeline metadata?
Model registry coordination. Can the orchestrator trigger model registration after successful training and validation? Can it coordinate the promote-to-production workflow?
Feature store integration. If your stack includes Feast, Tecton, or Hopsworks, can the orchestrator coordinate feature computation and registration natively?
Online/batch serving handoff. Can the orchestrator trigger model deployment to a serving layer after training completes and quality checks pass?
A representative ML platform orchestration flow:
# Dagster example: full ML pipeline as assets
from dagster import asset, FreshnessPolicy, AutoMaterializePolicy
import mlflow
@asset(group_name="ml_pipeline")
def training_data(user_features: pd.DataFrame, labels: pd.DataFrame) -> pd.DataFrame:
return user_features.merge(labels, on="user_id", how="inner")
@asset(group_name="ml_pipeline")
def trained_model(training_data: pd.DataFrame) -> dict:
with mlflow.start_run() as run:
model = train_xgboost(training_data)
metrics = evaluate(model, training_data)
mlflow.xgboost.log_model(model, "model")
mlflow.log_metrics(metrics)
return {
"mlflow_run_id": run.info.run_id,
"model_uri": f"runs:/{run.info.run_id}/model",
"metrics": metrics,
}
@asset(group_name="ml_pipeline")
def registered_model(trained_model: dict) -> str:
"""Promote model to registry if it meets quality threshold."""
if trained_model["metrics"]["auc"] < 0.85:
raise ValueError(
f"Model AUC {trained_model['metrics']['auc']:.3f} below threshold 0.85. "
f"Refusing to register."
)
client = mlflow.tracking.MlflowClient()
result = client.register_model(
model_uri=trained_model["model_uri"],
name="user_conversion_model",
)
client.transition_model_version_stage(
name="user_conversion_model",
version=result.version,
stage="Staging",
)
return f"user_conversion_model/version/{result.version}"
Summary - Making the Decision
The orchestration landscape in 2025 is genuinely competitive. There is no universally correct answer. But there is a correct process:
- Catalog your actual requirements - integrations, execution patterns, team constraints
- Score each candidate honestly against those requirements, weighted by importance
- Build a realistic POC with real integrations, not toy pipelines
- Calculate 2-year total cost of ownership including setup, maintenance, and managed hosting
- Choose the option with the highest score on the requirements that matter most to your team
The most common mistake is skipping the POC step and choosing based on marketing materials and conference talks. The second most common mistake is choosing based on future requirements that may never materialize. Solve today's problem well. Migration cost is real.
For most teams in 2025:
- Running complex pipelines with many heterogeneous integrations: Airflow with managed hosting (MWAA or Astronomer)
- Python-first team building from scratch without complex sensor needs: Prefect with Prefect Cloud
- dbt shop that wants unified lineage across Python and SQL: Dagster with Dagster Cloud
- Long-running workflows with human approval gates or failure-tolerant requirements: Temporal with Temporal Cloud
- Already deep in AWS/GCP with simple pipeline logic: Step Functions or GCP Workflows
Reference - Orchestrator Quick Comparison
| Orchestrator | Open Source | Managed Hosting | Primary Abstraction | Best Single Feature |
|---|---|---|---|---|
| Airflow | Yes (Apache) | MWAA, Astronomer, GCC | Task DAG | Operator ecosystem depth |
| Prefect | Yes (core) | Prefect Cloud | Flow + Task | Local development experience |
| Dagster | Yes | Dagster Cloud | Software-Defined Asset | Native lineage + data catalog |
| Temporal | Yes | Temporal Cloud | Durable Workflow | Fault-tolerant execution |
| Step Functions | No (AWS-native) | Managed by AWS | State Machine | Zero infrastructure ops |
| GCP Workflows | No (GCP-native) | Managed by GCP | State Machine | GCP service integration |
| Azure Data Factory | No (Azure-native) | Managed by Azure | Pipeline + Activity | Azure service integration |
| Luigi | Yes | None | Target-based task | Idempotency model (legacy) |
The right choice is the one that solves your current problem with acceptable trade-offs - and that your team will still want to use in two years.
Vendor and Community Health Signals
An orchestrator choice is a 3-5 year commitment. The tool you choose today needs to be actively maintained, commercially viable, and growing in adoption when you need to hire engineers who know it. Evaluate these signals before committing:
GitHub activity. Check commit frequency, issue response time, and the ratio of open to closed issues over the last 6 months. A repository with thousands of stale open issues and infrequent commits signals a project in decline.
Commercial backing. Airflow (Apache + Astronomer), Prefect (Prefect Inc.), Dagster (Dagster Labs), and Temporal (Temporal Technologies) all have funded companies behind them. This matters: open-source projects without commercial backing tend to stagnate when the founding engineers move on.
Adoption trend. Check the Stack Overflow Developer Survey, GitHub star growth rate, and job posting frequency. A tool that nobody is hiring for in 2 years is a tool your team will struggle to hire for.
Community size. Active Slack or Discord communities, frequent conference talks, and a healthy third-party plugin ecosystem signal a tool that will be supported and improved over the next 3 years.
As of 2025, all four major orchestrators (Airflow, Prefect, Dagster, Temporal) have strong commercial backing and active communities. The risk of choosing any of them is low from a vendor health perspective. The risk is primarily about fit-to-requirements, not about the tool disappearing.
Avoiding the Evaluation Paralysis Trap
A final note on process: do not let perfect be the enemy of good. Teams that spend 3 months evaluating orchestrators sometimes do so to avoid the commitment and accountability of making a choice. The evaluation process described above should take 2-3 weeks, not months. Build the POC, run the matrix, make the call.
The cost of 2 extra weeks of evaluation that produces clarity is much lower than the cost of a 6-month migration triggered by a bad initial choice. But the cost of 3 months of evaluation paralysis - while production pipelines run on cron jobs and shell scripts - is also high.
Set a deadline for the evaluation. Use the decision framework. Make the choice with the information you have. The best orchestrator is the one that is running in production, not the one that is still being evaluated.
Orchestrator Configuration Patterns - Shared Best Practices
Regardless of which orchestrator you choose, these configuration patterns apply across all of them.
Secrets Management
Never hardcode credentials in pipeline code or DAG definitions. Use a secrets backend:
# Airflow - configure a secrets backend (AWS Secrets Manager)
# airflow.cfg
# [secrets]
# backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
# backend_kwargs = {"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}
# Prefect - use SecretStr blocks
from prefect.blocks.system import Secret
@task
def load_from_warehouse():
db_password = Secret.load("warehouse-password").get()
engine = sa.create_engine(f"postgresql://user:{db_password}@host/db")
...
# Dagster - use environment variable resources
from dagster import EnvVar
@resource
def database_resource():
return DatabaseConnection(
host=EnvVar("DB_HOST").get_value(),
password=EnvVar("DB_PASSWORD").get_value(),
)
Concurrency Control
Prevent pipelines from overwhelming downstream systems by limiting concurrent task execution:
# Airflow - use pools to limit concurrency per resource
from airflow.models import Pool
# Create a pool limiting Snowflake concurrent connections to 5
# airflow pools set snowflake_pool 5 "Limit concurrent Snowflake connections"
# Then assign tasks to the pool:
task = PythonOperator(
task_id="query_snowflake",
python_callable=run_query,
pool="snowflake_pool",
pool_slots=1,
)
# Prefect - use task run concurrency limits
from prefect import task
from prefect.concurrency.sync import concurrency
@task
def query_snowflake(query: str):
with concurrency("snowflake", occupy=1):
return execute_query(query)
Environment-Based Configuration
Pipelines need different configuration in dev, staging, and production. Centralize this:
# config.py - shared across all orchestrators
import os
from dataclasses import dataclass
@dataclass
class PipelineConfig:
db_host: str
s3_bucket: str
mlflow_uri: str
log_level: str
dry_run: bool
def get_config() -> PipelineConfig:
env = os.environ.get("PIPELINE_ENV", "dev")
configs = {
"dev": PipelineConfig(
db_host="localhost",
s3_bucket="my-company-dev",
mlflow_uri="http://localhost:5000",
log_level="DEBUG",
dry_run=True,
),
"staging": PipelineConfig(
db_host="staging-db.internal",
s3_bucket="my-company-staging",
mlflow_uri="http://mlflow-staging.internal",
log_level="INFO",
dry_run=False,
),
"prod": PipelineConfig(
db_host="prod-db.internal",
s3_bucket="my-company-prod",
mlflow_uri="http://mlflow.internal",
log_level="WARNING",
dry_run=False,
),
}
return configs[env]
Logging Best Practices
Structured logging makes pipeline debugging tractable at scale:
import logging
import json
from datetime import datetime
class StructuredLogger:
def __init__(self, pipeline_name: str, execution_date: str):
self.pipeline = pipeline_name
self.date = execution_date
self.logger = logging.getLogger(pipeline_name)
def info(self, message: str, **kwargs):
self.logger.info(json.dumps({
"level": "INFO",
"pipeline": self.pipeline,
"execution_date": self.date,
"timestamp": datetime.utcnow().isoformat(),
"message": message,
**kwargs,
}))
def error(self, message: str, exc: Exception = None, **kwargs):
self.logger.error(json.dumps({
"level": "ERROR",
"pipeline": self.pipeline,
"execution_date": self.date,
"timestamp": datetime.utcnow().isoformat(),
"message": message,
"exception": str(exc) if exc else None,
**kwargs,
}))
# Usage in any pipeline task:
log = StructuredLogger("feature_engineering", "2024-01-15")
log.info("Computed features", row_count=10_000, duration_seconds=42.1)
Structured logs flow into log aggregation systems (Datadog, Splunk, CloudWatch) where you can query across pipeline runs: "show me all runs where row_count dropped below 8000" becomes a simple query rather than a manual log search.
