Choosing an Orchestrator
The Startup CTO's Dilemma
It is Monday morning. You are the CTO of a 30-person startup that has just closed a Series A. You have 4 ML engineers, 3 data scientists, and a product that is increasingly dependent on ML-driven features. Your current "pipeline" is a collection of Python scripts on a cron job running on a single EC2 instance. The cron job fails silently. Nobody knows when it last ran successfully. The last time someone changed the preprocessing script, it broke the training job for three days before anyone noticed.
Your board deck for the next round includes "production ML infrastructure" as a key investment area. You need to pick an orchestration platform now - before you hire more ML engineers and the technical debt compounds. You open a browser and search "best MLOps orchestration tool 2024." You get 12 blog posts, each recommending a different tool, each written by the vendor of that tool.
The tools in the field: Apache Airflow, Prefect, Kubeflow Pipelines, Metaflow, ZenML, Dagster. Each has genuine strengths. Each has genuine weaknesses. Each has a community of engineers who will passionately argue it is the right choice. The decision you make will shape your ML infrastructure for the next two to three years. Getting it wrong means a painful migration during a period when you should be focused on model quality and product delivery.
This lesson provides an honest, structured framework for making this decision. We compare tools across eight dimensions that actually matter in production, build a scoring matrix for three archetypes of teams, analyze migration paths and total cost of ownership, and end with a concrete decision tree that gives you "the right answer" for your specific situation - even if the right answer is "none of the above."
:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Pipeline Orchestration demo on the EngineersOfAI Playground - no code required. :::
Why This Decision Is Hard
Choosing an orchestrator is not a pure technical decision. It involves:
- Team composition: Data scientists prefer Pythonic APIs. Platform engineers prefer Kubernetes-native tools. The tools optimized for each are different.
- Infrastructure maturity: A team with Kubernetes expertise finds KFP natural. A team without Kubernetes experience will spend months learning it before getting value.
- ML maturity: A team running 2 models benefits from a simple tool. A team running 200 models needs metadata lineage, experiment tracking, and automated retraining.
- Budget: Some tools are free (Airflow, Metaflow OSS). Others have significant managed service costs (Prefect Cloud, KFP on GKE).
- Lock-in risk: Kubeflow is tightly coupled to Kubernetes. Metaflow is tightly coupled to AWS (partially). Prefect Cloud is tightly coupled to Prefect Inc.
The tools are not interchangeable. There is no universally correct answer. But there is a correct answer for your specific context - and there is a rigorous way to find it.
The Six Contenders
Before scoring, a brief, honest description of each tool.
Apache Airflow
The incumbent. Launched at Airbnb in 2014, open-sourced in 2015, donated to the Apache Software Foundation in 2019. The most widely deployed workflow orchestrator in the world. Its ubiquity means every cloud provider offers a managed version (Google Cloud Composer, AWS MWAA, Astronomer). It has the largest community, the most tutorials, and the most integrations.
Its weakness for ML is structural: DAGs are defined at import time, not at execution time. Dynamic pipelines require workarounds. The operator pattern is verbose. XCom size limits make passing large artifacts between tasks awkward. Airflow was built for data engineering - it works for ML workflows but it fights you at every turn.
Best for: Teams that already run Airflow for ETL and want to add ML pipelines without introducing a new tool.
Prefect
The modern Python-first orchestrator. Prefect 2.x (2022) threw away the Airflow-inspired architecture and rebuilt around decorated Python functions. Flows are Python functions. Tasks are Python functions. Everything is runtime - no DAG parsing, no import-time evaluation. The observability dashboard (Prefect Cloud) is excellent. The learning curve is low for Python developers.
Its weakness: the managed offering (Prefect Cloud) has per-run pricing that scales aggressively with pipeline frequency. Self-hosting Prefect Server requires managing a PostgreSQL database and server process. The ML-specific features (artifact tracking, model registry integration) are less developed than ZenML or KFP.
Best for: Python-heavy teams migrating from Airflow who want better developer experience without heavy infrastructure investment.
Kubeflow Pipelines
The Kubernetes-native ML orchestration platform. Every step runs in an isolated container. ML Metadata tracks every artifact with full lineage. The pipeline compiler produces backend-agnostic IR YAML. Vertex AI Pipelines uses the same SDK, enabling identical pipelines to run on GKE or Google's managed infrastructure.
Its weakness: it requires Kubernetes. Running KFP locally for development is cumbersome (requires Minikube or Kind). The component authoring model (import-inside-function, typed artifacts) is unfamiliar to most data scientists. The startup time for each step (container pull, pod scheduling) adds minutes of overhead to pipelines that run frequently.
Best for: Teams with strong Kubernetes expertise running on GCP, AWS, or hybrid infrastructure where container isolation and artifact lineage are first-class requirements.
Metaflow
Netflix's contribution to the open-source MLOps ecosystem. The most data-scientist-friendly orchestrator - write a Python class, run it locally, add @batch to scale to AWS. The artifact persistence model (assign to self, Metaflow handles the rest) is the most intuitive of any orchestrator. The foreach pattern for dynamic parallelism is elegant.
Its weakness: the open-source version requires AWS for cloud compute (though Kubernetes support has improved). The metadata service is minimal in OSS - you get run history but no lineage graph or artifact browser without deploying the Metaflow UI separately. The flow-step model does not map naturally to complex conditional pipelines.
Best for: Data science-heavy teams (more researchers than platform engineers) running primarily on AWS.
ZenML
The portability-first orchestrator. ZenML's stack abstraction separates pipeline logic from infrastructure - the same pipeline runs locally, on Prefect, on Airflow, on Vertex AI, or on Kubeflow without code changes. The integrations ecosystem covers every major MLOps tool. The materializer system handles custom artifact types cleanly.
Its weakness: the abstraction layer adds complexity. Debugging ZenML issues requires understanding both the ZenML layer and the underlying orchestrator (Prefect, Airflow, etc.). The OSS version lacks the UI and collaboration features of ZenML Cloud. For simple pipelines, the stack configuration overhead is more complexity than it provides value.
Best for: Teams that expect to change infrastructure (cloud migration, tool consolidation) or that need to run the same pipeline on multiple stacks (local dev, staging on Prefect, production on Vertex AI).
Dagster
The data asset-centric orchestrator. Dagster's core abstraction is the software-defined asset - instead of defining tasks, you define the data assets your pipeline produces. The scheduler figures out what needs to run to materialize requested assets. This model is excellent for complex data pipelines with multiple consumers of the same assets. The Dagster UI (Dagit) is the best observability dashboard of any open-source orchestrator.
Its weakness for ML: the asset model is less natural for training pipelines where the "asset" is a model, not a dataset. The learning curve is steeper than Prefect or Metaflow. Enterprise features (sensors, partitions, asset checks) require time to learn and configure correctly.
Best for: Data engineering-heavy teams where ML pipelines share infrastructure with complex ETL pipelines, and where the same dataset is consumed by multiple downstream processes.
Comparison Matrix
Dimension Scores (1-5 scale)
| Dimension | Airflow | Prefect | KFP | Metaflow | ZenML | Dagster |
|---|---|---|---|---|---|---|
| Learning curve (5=easy) | 2 | 4 | 2 | 5 | 3 | 2 |
| ML-specific features | 2 | 3 | 5 | 4 | 5 | 3 |
| Dynamic pipelines | 2 | 5 | 3 | 4 | 4 | 4 |
| Kubernetes required (5=no) | 5 | 5 | 1 | 4 | 4 | 5 |
| Cloud portability | 5 | 4 | 3 | 3 | 5 | 5 |
| Community size | 5 | 4 | 3 | 3 | 2 | 3 |
| Observability/UI | 3 | 4 | 4 | 2 | 3 | 5 |
| Local dev experience | 2 | 5 | 2 | 5 | 4 | 4 |
| Enterprise support | 5 | 4 | 4 | 3 | 3 | 4 |
| Artifact lineage | 1 | 2 | 5 | 3 | 4 | 4 |
Weighted Scores for Three Team Archetypes
Archetype A: Startup (5-10 engineers, 1-3 models, no Kubernetes)
Weights: Learning curve (30%), Local dev (25%), ML features (20%), Kubernetes not required (15%), Community (10%)
| Tool | Weighted Score |
|---|---|
| Metaflow | 4.3 |
| Prefect | 4.1 |
| ZenML | 3.4 |
| Airflow | 2.8 |
| Dagster | 3.2 |
| KFP | 1.9 |
Archetype B: Growth-Stage Company (20-50 engineers, 10-50 models, Kubernetes available)
Weights: ML features (25%), Artifact lineage (20%), Observability (20%), Dynamic pipelines (20%), Learning curve (15%)
| Tool | Weighted Score |
|---|---|
| KFP | 4.0 |
| ZenML | 3.9 |
| Dagster | 3.6 |
| Prefect | 3.3 |
| Metaflow | 3.2 |
| Airflow | 2.3 |
Archetype C: Enterprise (50+ engineers, 100+ models, multi-cloud, strict compliance)
Weights: Cloud portability (25%), Enterprise support (20%), Artifact lineage (20%), Community (20%), ML features (15%)
| Tool | Weighted Score |
|---|---|
| Airflow (managed) | 4.0 |
| ZenML | 3.8 |
| KFP (Vertex AI) | 3.7 |
| Dagster | 3.6 |
| Prefect Cloud | 3.4 |
| Metaflow | 2.9 |
Anti-Patterns
Understanding what not to do is as important as knowing what to do.
Over-Orchestration
The most common MLOps mistake: adopting a complex orchestration platform before you have the volume of pipelines that justifies it. A team running 2 models that retrain weekly does not need Kubeflow Pipelines. They need a Python script on a cron job with good logging. The value of an orchestration platform scales with the number of pipelines, the frequency of runs, and the complexity of dependencies.
# This is fine for a team with 2 models
import schedule
import time
from my_pipeline import run_training
schedule.every().monday.at("02:00").do(run_training)
while True:
schedule.run_pending()
time.sleep(60)
Do not introduce Kubeflow for this. The overhead of Kubernetes, container images, and KFP component authoring will slow you down. Start simple. Add complexity when the simple approach breaks.
DAG Complexity Creep
A pipeline that starts as 5 tasks becomes 50 tasks over 18 months. Each new requirement adds a new task, a new dependency, a new conditional branch. The DAG becomes impossible to reason about. No single engineer understands it fully. Failures in one part have non-obvious effects on other parts.
Prevention: set a hard limit on DAG size (20 tasks max). When a pipeline exceeds the limit, break it into sub-pipelines with clear input/output contracts. Use data artifacts (files in S3/GCS) as the interface between sub-pipelines, not in-memory task dependencies.
Tool-First Thinking
Choosing an orchestrator because it is popular or because a conference talk impressed you - then designing your pipelines around the tool's quirks. Airflow teams write code that only works with Airflow. KFP teams write components so tightly coupled to the KFP SDK that they cannot be tested independently.
Prevention: design your ML code independently of the orchestrator. The training function, the feature engineering function, and the evaluation function should all be importable and runnable without any orchestration framework. The orchestrator is a wrapper around your code, not the foundation it is built on.
Premature Cloud Lock-in
Choosing an orchestrator that ties you to a specific cloud provider before you know which cloud provider you will use long-term. A startup that adopts Metaflow heavily in Year 1 and then moves from AWS to GCP in Year 2 will spend three months migrating flows and rewriting @batch decorators to @kubernetes.
Prevention: if you expect infrastructure to change, prefer tools with cloud portability (ZenML, Airflow, Prefect) over tools tightly coupled to a specific cloud (Metaflow/AWS, early-era KFP/GKE).
Migration Paths
Airflow 1.x to Prefect
The most common migration in 2023-2024. Airflow 1.x's age means: no async, limited retry configurability, confusing state model. The migration path is well-worn.
Step 1: Map each PythonOperator to a @task function. The Python callables are usually reusable as-is.
# Airflow 1.x
def _preprocess_data(**context):
ds = context["ds"]
df = load_data(ds)
features = engineer_features(df)
features.to_parquet(f"/tmp/features/{ds}.parquet")
preprocess_op = PythonOperator(
task_id="preprocess_data",
python_callable=_preprocess_data,
provide_context=True,
dag=dag,
)
# Prefect equivalent
@task(retries=3, retry_delay_seconds=exponential_backoff(2))
def preprocess_data(date: str) -> pd.DataFrame:
df = load_data(date)
features = engineer_features(df)
return features # Pass directly, no temp files
Step 2: Replace >> dependency chains with direct Python function calls.
# Airflow
preprocess_op >> train_op >> evaluate_op >> deploy_op
# Prefect - just Python
@flow
def training_pipeline(date: str):
features = preprocess_data(date)
model = train_model(features)
metrics = evaluate_model(model, features)
deploy_model(model, metrics)
Step 3: Add retry and caching logic - this is where you immediately get value over Airflow.
Step 4: Create a Prefect deployment to replace the Airflow schedule.
Step 5: Run both systems in parallel for 2 weeks, verify outputs match, then decommission Airflow.
Airflow to Kubeflow Pipelines
More complex - requires containerizing each step.
# Airflow - inline Python
def _train_model(**context):
import pickle
model = RandomForestClassifier().fit(X_train, y_train)
with open("/tmp/model.pkl", "wb") as f:
pickle.dump(model, f)
# KFP - containerized component
@dsl.component(
base_image="python:3.11-slim",
packages_to_install=["scikit-learn", "pandas"],
)
def train_model(
train_data: Input[Dataset],
trained_model: Output[Model],
n_estimators: int = 100,
):
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_parquet(train_data.path)
feature_cols = [c for c in df.columns if c != "label"]
model = RandomForestClassifier(n_estimators=n_estimators)
model.fit(df[feature_cols], df["label"])
with open(trained_model.path, "wb") as f:
pickle.dump(model, f)
The migration effort is higher, but the result is better: containerized steps, MLMD tracking, and portable IR YAML.
Total Cost of Ownership Analysis
Cost analysis assumes a medium-sized team (10 ML engineers) running 20 pipelines that execute 10 times per week each (200 runs/week, ~10,000 runs/year).
Apache Airflow (Astronomer managed)
- Astronomer pricing: ~$500-2,000/month depending on workers
- Infrastructure: managed - no ops overhead
- Annual TCO: ~$6,000-24,000 (managed) + 0.5 FTE for pipeline maintenance
Apache Airflow (self-hosted on EC2)
- 2x m5.xlarge (scheduler + webserver): ~$300/month
- RDS PostgreSQL (metadata DB): ~$150/month
- Engineer time for ops: 0.25 FTE (maintaining Airflow infra)
- Annual TCO: ~$5,400 (infra) + 0.25 FTE
Prefect Cloud
- Prefect Cloud pricing: ~$500-2,000/month for 10,000 runs
- Infrastructure: managed Prefect Cloud + your own workers (2x t3.medium workers)
- Worker infrastructure: ~$100/month
- Annual TCO: ~$7,200-25,200 (managed) + 0.1 FTE
Kubeflow Pipelines (on GKE)
- GKE cluster (3-node n2-standard-4 for KFP control plane): ~$400/month
- Worker nodes (auto-scaling, ~60% utilization): ~$800/month
- Artifact store (GCS): ~$50/month
- Engineer time for KFP ops: 0.5 FTE (cluster management, component authoring)
- Annual TCO: ~$15,000 (infra) + 0.5 FTE
Metaflow (OSS + AWS Batch)
- Metaflow metadata service (optional, adds UI): ~$100/month on EC2
- AWS Batch: pay-per-use, ~$200-500/month for 10,000 runs
- S3 artifacts: ~$50/month
- Engineer time: 0.1 FTE
- Annual TCO: ~$4,200-7,800 (infra) + 0.1 FTE
ZenML (OSS server + Prefect orchestrator)
- Self-hosted ZenML server: ~$100/month (small EC2 + RDS)
- Prefect Cloud (orchestrator): ~$500-1,000/month
- Artifact store (S3/GCS): ~$50/month
- Engineer time: 0.2 FTE (stack management)
- Annual TCO: ~$7,800-13,800 (infra) + 0.2 FTE
The Decision Tree
Feature Comparison: Quick Reference
| Feature | Airflow | Prefect | KFP | Metaflow | ZenML | Dagster |
|---|---|---|---|---|---|---|
| Local dev | Painful | Excellent | Painful | Excellent | Good | Good |
| Dynamic tasks | Limited | Full | Moderate | foreach only | Good | Good |
| Container isolation | No | Optional | Yes | Optional | Optional | Optional |
| Artifact lineage | No | Basic | Full (MLMD) | Good | Good | Excellent |
| Retries | Verbose | Trivial | Component-level | @retry decorator | Step-level | Policy-based |
| Caching | No | input hash | Full | No built-in | Input hash | Asset-level |
| Experiment tracking | Plugin | Plugin | Built-in | Via integrations | Via stack | Plugin |
| Cloud scaling | Worker-based | Worker-based | Pod-per-step | @batch / @k8s | Via orchestrator | Worker-based |
| K8s required | No | No | Yes | No | No | No |
| Open-source | Yes | Yes | Yes | Yes | Yes | Yes |
| Managed offering | Astronomer, MWAA, Composer | Prefect Cloud | Vertex AI | Outerbounds | ZenML Cloud | Dagster Cloud |
The Honest Recommendation
After all the analysis, the decision reduces to three heuristics:
Heuristic 1: Start with the simplest thing that could possibly work. If you have fewer than 5 pipelines and a team of fewer than 10, Prefect is almost always the right choice. It is the lowest friction path from "scripts on a cron job" to "properly orchestrated pipelines." The developer experience is excellent. The learning curve is minimal. You can be running in production in a week.
Heuristic 2: Match the tool to the team's dominant skill set. KFP is the right choice for platform engineering-heavy teams who already live in Kubernetes. Metaflow is the right choice for data science-heavy teams who live in Jupyter notebooks and AWS. Airflow is the right choice for teams with strong data engineering culture. Forcing a data science team to write KFP components is a recipe for resentment and poor adoption.
Heuristic 3: Optimize for operational simplicity over feature richness. The best orchestrator is the one your team will actually use and maintain. A team running Prefect with 10 well-maintained flows is more effective than a team running Kubeflow with 50 broken pipelines. Do not choose based on the feature list - choose based on what your team will realistically operate at 2 AM when something fails.
Common Mistakes
:::danger Migrating Orchestrators Mid-Project The worst time to change your orchestration tool is in the middle of a major ML initiative. Migrations require rewriting pipelines, retraining team members, and running systems in parallel. The disruption is significant. If you are mid-project and your current orchestrator is "good enough," finish the project, then migrate. Never migrate orchestrators and ship new model features at the same time. :::
:::danger Evaluating Tools on Toy Pipelines A tool that is easy to set up for a "hello world" pipeline may be terrible in production. Evaluation criteria that matter:
- How does it behave when step 7 of 12 fails in production at 3 AM?
- How long does it take to add a new step to an existing pipeline?
- How do you debug a step that is silently producing wrong outputs?
- How do you re-run a failed pipeline from the middle, not from the beginning?
Run a realistic pipeline under failure conditions before committing to a tool. :::
:::warning Underestimating Migration Costs Teams consistently underestimate what it takes to migrate from Airflow to another orchestrator. It is not just rewriting the DAGs. It is:
- Retraining every ML engineer and data scientist
- Migrating historical run metadata
- Updating monitoring and alerting integrations
- Updating CI/CD pipelines
- Handling edge cases in production pipelines that took years to discover
Budget 3-6 months of engineering time for a migration involving 20+ pipelines. Do not let vendors tell you otherwise. :::
:::warning Choosing Based on Hype Cycles Every 18 months, a new orchestration tool becomes "the one everyone is switching to." The community hype cycle does not correlate with production reliability. Airflow has survived multiple predicted deaths because its ubiquity, managed offerings, and operator ecosystem make it genuinely useful despite its limitations. New tools fail to gain enterprise traction all the time regardless of technical merit. Evaluate based on your team's needs and the tool's production track record, not on recent conference buzz. :::
Interview Questions and Answers
Q1: A startup CTO asks you: "We have 3 ML engineers and 2 models. Should we invest in Kubeflow Pipelines?" What do you say?
Almost certainly not yet. Kubeflow Pipelines is designed for teams with Kubernetes expertise, multiple models in production, and a need for containerized isolation and MLMD artifact lineage. For a team with 2 models and 3 engineers, the overhead of maintaining a Kubernetes cluster, building container images for every pipeline step, and learning the KFP SDK's component authoring model will consume engineering time that should go toward model quality and product delivery.
The better answer for this team is Prefect or Metaflow - both have local development experience that works well from day one, both have cloud scaling as a later addition, and both have a learning curve that does not require Kubernetes expertise. When the team has 10-15 models in production and starts feeling the pain of lacking artifact lineage and container isolation, revisiting KFP or Vertex AI Pipelines makes sense.
Q2: What are the key differences between Prefect's task model and Airflow's operator model?
In Airflow, tasks are defined using Operators - classes that encapsulate a specific type of work (PythonOperator, BashOperator, BigQueryOperator). The task graph is defined in a DAG file using >> dependency syntax. The entire DAG is evaluated at import time by the Airflow scheduler, which means the task graph is static. You cannot create tasks dynamically based on runtime data without complex workarounds like dynamic task mapping (added in Airflow 2.3).
In Prefect, tasks are just Python functions decorated with @task. The flow function determines the task graph at runtime when it executes - the graph is not known until the flow actually runs. This means Prefect flows can naturally use Python loops to generate dynamic task graphs, branch conditionally based on real data, and call the same task multiple times with different inputs. For ML workflows that require dynamic hyperparameter sweeps or conditional retraining, this runtime evaluation model is a significant advantage.
Q3: When would you choose ZenML over Prefect or Metaflow?
Choose ZenML when infrastructure portability is a primary requirement. The three scenarios where ZenML is clearly the right choice:
First, the team needs to develop locally (on a laptop with SQLite and local file storage) and deploy to production on Vertex AI or SageMaker without code changes. ZenML's stack abstraction handles this directly - switch the active stack, run the same code.
Second, the team expects to migrate cloud providers within the next 1-2 years. Starting with ZenML means the pipeline code is not coupled to the current cloud provider.
Third, the team uses multiple orchestrators - for example, Airflow for ETL and Prefect for ML training pipelines - and wants a single abstraction layer that works with both. ZenML can use Airflow, Prefect, or Kubeflow as its orchestrator component.
Choose Prefect over ZenML when portability is not a concern and you want the simplest possible Python-first experience. Choose Metaflow over ZenML when your team is data scientist-heavy, you run primarily on AWS, and you prioritize the simplest possible authoring experience.
Q4: What is the total cost of ownership difference between self-hosted Airflow and Prefect Cloud?
Self-hosted Airflow requires: a PostgreSQL database for the metadata store, a scheduler process, a webserver process, and Celery workers (or the KubernetesExecutor) for task execution. For a medium team, this typically runs on 3-5 EC2 instances at roughly $300-600/month in infrastructure costs plus 0.25-0.5 FTE of engineer time for maintenance (upgrades, debugging scheduler issues, managing worker autoscaling). The hidden cost is the operational knowledge burden - Airflow's architecture has enough moving parts that you need at least one engineer who understands it deeply.
Prefect Cloud offloads the API server, dashboard, and metadata storage to Prefect Inc. You still run Prefect Workers on your own infrastructure, but workers are simple processes - much easier to manage than Airflow's architecture. Prefect Cloud pricing is typically $500-2,000/month for a medium team. The operational burden drops significantly. The tradeoff is vendor dependency and cost that scales with pipeline volume.
For most teams, Prefect Cloud is cheaper when you factor in engineer time. Self-hosted Airflow is cheaper in infrastructure dollars but expensive in engineering hours.
Q5: How do you evaluate an orchestration tool in a proof-of-concept before committing to it?
Run four tests:
First, implement a realistic pipeline - not a toy example. Use your actual ML code (data loading, feature engineering, training, evaluation). This surfaces the friction in the component authoring model.
Second, simulate a failure. Deliberately cause step 5 of 10 to fail. Evaluate: how quickly can you identify which step failed and why? How do you re-run from step 5 without re-running steps 1-4? The failure experience is as important as the success experience.
Third, measure the operational overhead. After the pipeline is running, how long does it take to add a new step? How long does it take to change a step's resource allocation? These tasks represent ongoing maintenance work.
Fourth, evaluate the observability. Can you tell, at a glance, which pipeline runs succeeded last week? Which runs failed? What was the model AUC from last Tuesday's run? Poor observability means flying blind in production.
Score each tool on these four tests against your team's specific context. The tool that scores highest on your tests - not on a generic feature matrix - is the right tool for your team.
Q6: What are the warning signs that a team has outgrown their current orchestrator?
Five clear signals:
One: pipelines regularly fail and no one knows until a downstream system breaks. The orchestrator has poor observability or the team has not invested in monitoring.
Two: adding a new pipeline takes more than a week. The authoring model is too complex or the deployment process is too slow.
Three: the pipeline team cannot answer "which dataset version trained the model in production?" Artifact lineage is missing.
Four: experiments cannot be reproduced because the environment that produced them is not captured. Container isolation or environment management is absent.
Five: onboarding a new ML engineer to the pipeline takes more than a month. The orchestrator's complexity creates a steep barrier to entry that blocks team growth.
When you see three or more of these signals, the cost of staying with the current tool exceeds the cost of migrating. The migration is painful, but the status quo is more painful.
Practical Evaluation Checklist
Use this checklist when evaluating any orchestration tool for production adoption.
Development Experience
- Can a new engineer run the first pipeline locally in under one hour?
- Can you write and test a step without running the full pipeline?
- Does the tool support iterative development (run from a specific step, not always from the beginning)?
- Are error messages actionable? Do they tell you what went wrong and where?
- Does the tool work well with standard Python IDEs (type hints, autocomplete)?
Failure Handling
- Can you configure per-step retry with exponential backoff?
- Can you re-run a failed pipeline from the failed step without re-running succeeded steps?
- Are failure notifications configurable (Slack, PagerDuty, email)?
- Can you inspect the state of a running pipeline without stopping it?
- Does the tool distinguish between transient failures (retriable) and permanent failures (require intervention)?
Observability
- Can you see a history of all pipeline runs with their status, duration, and parameters?
- Can you access step-level logs without SSHing into a worker machine?
- Does the tool track which artifacts were produced by which runs?
- Can you compare metrics between runs (run A vs run B)?
- Are there alerts when a pipeline run takes significantly longer than usual?
Operations
- How do you upgrade the orchestrator without downtime?
- How do you scale the number of concurrent pipeline runs?
- What happens when the orchestrator's control plane goes down? Do running pipelines continue?
- How do you manage secrets (API keys, database passwords) used by pipeline steps?
- Can the team on-call debug pipeline failures without needing deep orchestrator expertise?
Team Adoption
- How long does it take to train a new data scientist to write their first pipeline?
- Can data scientists test pipeline changes without needing to deploy to the shared environment?
- Is the tool's documentation comprehensive and up-to-date?
- Is the community active (GitHub issues answered, Stack Overflow responses)?
Combining Tools: The Hybrid Architecture Pattern
For larger teams, the answer is often not one tool but a combination. A common production architecture:
In this architecture:
- Airflow handles data engineering - ETL pipelines, dbt transformations, data quality checks. This is where Airflow excels. The DAG model works well for predictable, time-based data workflows.
- Prefect handles ML training pipelines - triggered by Airflow (when new data is ready) or by a monitoring event (when model drift is detected). Prefect's Python-first model suits ML workflows better than Airflow.
- MLflow handles experiment tracking and model registration - decoupled from both orchestrators.
- Seldon Core handles model serving - receives approved models from the registry.
- Evidently AI monitors deployed models and emits drift alerts when distribution shifts are detected.
This hybrid approach avoids the mistake of forcing one tool to do everything. Airflow is excellent at what it was built for. Prefect is excellent at what it was built for. Using each where it fits, and connecting them via an event bus, gives you the best of both.
Vendor Lock-in Risk Assessment
Every tool carries some lock-in risk. Here is an honest assessment for each:
Airflow Lock-in Risk: Low
Airflow DAGs are Python files. Your task logic (the Python functions called by PythonOperators) is completely portable. The DAG definition syntax (@dag, >>, @task) is Airflow-specific but not deeply embedded in your ML logic. Migrating away from Airflow means rewriting DAG definitions but not the underlying ML code. The community is large enough that Airflow is unlikely to disappear or dramatically change APIs. Lock-in risk: low.
Prefect Lock-in Risk: Low-Medium
Prefect flows use @flow and @task decorators. Your ML logic inside tasks is plain Python. The decorators themselves are not deeply coupled to ML business logic - removing them would leave runnable Python. Prefect Cloud is a commercial product; if Prefect Inc. changed pricing significantly, migrating to self-hosted Prefect Server or another tool would require work but not a ground-up rewrite. Lock-in risk: low-medium.
Kubeflow Pipelines Lock-in Risk: Medium
KFP components must import everything inside the function body. The Input[Dataset], Output[Model] typed artifact system is KFP-specific. Migrating away from KFP requires rewriting component authoring - the underlying ML logic is extractable but the component interface must be rebuilt. The IR YAML format is compatible with Vertex AI Pipelines, so migration within the KFP ecosystem is low-risk. Migration away from KFP to a non-KFP tool is significant work. Lock-in risk: medium.
Metaflow Lock-in Risk: Medium
Metaflow's FlowSpec and @step decorators are framework-specific. The artifact persistence model (self.x = value) is Metaflow-specific. Migrating away requires rewriting the flow class structure, but the underlying ML logic (sklearn models, pandas transformations) is extractable. AWS Batch coupling adds cloud lock-in on top of framework lock-in if you rely heavily on @batch. Lock-in risk: medium.
ZenML Lock-in Risk: Low
ZenML's @step and @pipeline decorators are lightweight. The underlying ML logic is plain Python. Switching from ZenML to Prefect requires replacing decorators and removing stack configuration - the ML logic is unchanged. ZenML is explicitly designed to be portable, so its own lock-in is minimal. However, if you use ZenML Cloud for collaboration features, you have some vendor dependency. Lock-in risk: low.
The One Thing Most Teams Get Wrong
After all the frameworks, decision trees, and comparison matrices, there is a single mistake that causes more MLOps pain than any tool choice: teams treat orchestration as a platform problem when it is actually a process problem.
The best orchestrator in the world does not help if:
- Engineers do not write pipelines - they run scripts manually because the pipeline authoring is too complex
- Pipelines are not monitored - failures are discovered by customers, not alerts
- There is no review process for pipeline changes - a bad change goes straight to production
- Artifacts are not tracked - nobody knows which model is in production or how it was trained
The orchestrator enables good practices. It does not enforce them. A team with good ML engineering discipline will succeed with any of the tools in this lesson. A team with poor discipline will fail with all of them.
Choose the simplest tool your team will actually use. Build the processes: code review for pipeline changes, alerts for pipeline failures, a documented runbook for on-call engineers. The tooling choice matters less than the operational discipline.
The right orchestrator is the one your team runs in production, monitors in production, and fixes at 3 AM without panic. That property belongs to the team, not the tool.
Orchestrator Feature Deep-Dive: Scheduling Models
Every orchestrator has a scheduling model - the mechanism that determines when and how pipeline runs are triggered. Understanding the differences prevents nasty surprises in production.
Time-Based Scheduling
All orchestrators support cron-based scheduling. The key differences are in reliability and backfill behavior.
Airflow uses cron expressions with a concept of "execution_date" - the logical date a DAG run represents, which can be different from the actual time it runs. This backfill model is powerful for data engineering (re-run yesterday's ETL) but confusing for ML (retraining pipeline "for" January 15th may not make sense semantically).
# Airflow - execution_date-aware scheduling
@dag(
schedule_interval="0 2 * * *", # 2 AM daily
start_date=datetime(2024, 1, 1),
catchup=False, # Don't backfill missed runs
)
def training_dag():
pass
Prefect uses simpler cron without execution_date semantics:
from prefect.deployments import Deployment
from prefect.server.schemas.schedules import CronSchedule
deployment = Deployment.build_from_flow(
flow=training_pipeline,
name="nightly-training",
schedule=CronSchedule(cron="0 2 * * *", timezone="UTC"),
)
Metaflow does not have built-in scheduling - you use --with batch and an external cron (AWS EventBridge, GitHub Actions) to trigger runs.
Event-Driven Scheduling
Event-driven scheduling - "run when new data arrives" - is increasingly important for ML pipelines that retrain on demand rather than on a fixed schedule.
Airflow supports sensors (S3KeySensor, HttpSensor, etc.) that poll for conditions and trigger DAGs. The polling interval has a minimum of seconds, which can be resource-intensive.
Prefect supports automations triggered by events:
# Prefect 3.x - trigger on S3 upload event
from prefect.events import DeploymentEventTrigger
trigger = DeploymentEventTrigger(
expect={"prefect.resource.id": "s3://data-bucket/new-data/*"},
deployment_id="fraud-detector-deployment-id",
)
Dagster has the most sophisticated event-driven model via its sensor system:
from dagster import sensor, RunRequest, SkipReason
@sensor(job=training_job)
def new_data_sensor(context):
"""Trigger training when new data appears in S3."""
import boto3
s3 = boto3.client("s3")
objects = s3.list_objects_v2(
Bucket="data-lake",
Prefix="fraud/incoming/",
StartAfter=context.cursor or "",
)
new_objects = objects.get("Contents", [])
if not new_objects:
return SkipReason("No new data")
latest_key = new_objects[-1]["Key"]
yield RunRequest(
run_key=latest_key,
run_config={"ops": {"load_data": {"config": {"s3_key": latest_key}}}},
)
context.update_cursor(latest_key)
Orchestrator Interoperability: Using Multiple Tools Together
Real production environments often need to integrate orchestrators with each other and with the broader data stack. Here are the most common integration patterns.
Airflow Triggering Prefect
When your team runs Airflow for ETL and Prefect for ML training, you need Airflow to trigger Prefect flows when data is ready:
# Airflow DAG that triggers a Prefect flow
from airflow.operators.python import PythonOperator
from prefect.client.orchestration import get_client
import asyncio
def trigger_prefect_training(**context):
"""Trigger Prefect training flow from Airflow."""
ds = context["ds"] # execution date
async def _trigger():
async with get_client() as client:
deployment = await client.read_deployment_by_name(
"fraud-detection-training/nightly"
)
flow_run = await client.create_flow_run_from_deployment(
deployment_id=deployment.id,
parameters={"date_partition": ds},
name=f"airflow-triggered-{ds}",
)
return flow_run.id
run_id = asyncio.run(_trigger())
print(f"Triggered Prefect flow run: {run_id}")
return run_id
trigger_training_task = PythonOperator(
task_id="trigger_prefect_training",
python_callable=trigger_prefect_training,
dag=dag,
)
data_quality_check >> trigger_training_task
Metaflow Calling External Services
Metaflow flows can call any Python code - including Prefect or Airflow clients - to trigger downstream pipelines:
from metaflow import FlowSpec, step
class FeaturePipelineFlow(FlowSpec):
@step
def start(self):
self.run_date = "2024-03-01"
self.next(self.compute_features)
@step
def compute_features(self):
# ... feature computation
self.feature_path = f"s3://feature-store/{self.run_date}/"
self.next(self.trigger_training)
@step
def trigger_training(self):
"""Trigger downstream training in Prefect after feature computation."""
from prefect.client.sync import get_client
client = get_client()
client.create_flow_run_from_deployment(
deployment_name="training-pipeline/production",
parameters={"feature_path": self.feature_path},
)
print(f"Training pipeline triggered with features: {self.feature_path}")
self.next(self.end)
@step
def end(self):
pass
Monitoring and Alerting Patterns
Regardless of which orchestrator you choose, production ML pipelines need monitoring and alerting. The orchestrator handles execution - your monitoring stack handles visibility.
Pipeline-Level Monitoring
Every orchestrator exposes an API for querying run status. Build a monitoring dashboard that queries this API:
# Example: Query Prefect Cloud for pipeline health
import httpx
from datetime import datetime, timedelta
def get_pipeline_health_report(hours: int = 24) -> dict:
"""Get a health summary for the last N hours of pipeline runs."""
headers = {"Authorization": f"Bearer {PREFECT_API_KEY}"}
since = (datetime.utcnow() - timedelta(hours=hours)).isoformat()
response = httpx.post(
f"https://api.prefect.cloud/api/accounts/{ACCOUNT_ID}/workspaces/{WORKSPACE_ID}/flow_runs/filter",
headers=headers,
json={
"flow_runs": {
"start_time": {"after_": since},
}
},
)
runs = response.json()
total = len(runs)
failed = sum(1 for r in runs if r["state_type"] == "FAILED")
succeeded = sum(1 for r in runs if r["state_type"] == "COMPLETED")
return {
"total_runs": total,
"succeeded": succeeded,
"failed": failed,
"success_rate": succeeded / total if total > 0 else 0,
"failed_runs": [r["name"] for r in runs if r["state_type"] == "FAILED"],
}
Model-Level Monitoring Integration
The orchestrator should be integrated with your model monitoring tool. When a monitoring alert fires (data drift, prediction distribution shift), the orchestrator should trigger a retraining run:
# Webhook endpoint that receives drift alerts from Evidently AI
# and triggers a Prefect retraining flow
from fastapi import FastAPI
from prefect.client.sync import get_client
app = FastAPI()
@app.post("/drift-alert")
async def handle_drift_alert(alert: dict):
"""Handle drift alert from Evidently AI monitoring."""
model_name = alert["model_name"]
drift_score = alert["drift_score"]
feature_with_drift = alert["feature_name"]
if drift_score > 0.3: # Significant drift threshold
client = get_client()
run = client.create_flow_run_from_deployment(
deployment_name=f"{model_name}-retraining/production",
parameters={
"trigger_reason": f"drift_alert_{feature_with_drift}",
"drift_score": drift_score,
},
)
return {"status": "retraining_triggered", "run_id": str(run.id)}
return {"status": "acknowledged", "action": "no_retraining_needed"}
This pattern closes the MLOps loop: model in production, monitoring detects drift, orchestrator triggers retraining, new model replaces the drifted one. The orchestrator is the backbone that connects all the pieces.
Summary: The Right Tool for the Right Job
There is no universally correct orchestrator. The choice depends on your team's skills, infrastructure maturity, and operational requirements. Here is the condensed guidance:
Use Airflow if: You already run it for ETL. Your team has Airflow expertise. You need the broadest ecosystem of integrations and the most managed service options (Astronomer, MWAA, Composer). Your ML pipelines are not highly dynamic.
Use Prefect if: You are starting fresh or migrating from Airflow. Your team writes Python well. You want the best developer experience with the lowest learning curve. You can accept Prefect Cloud's pricing model or are willing to self-host.
Use Kubeflow Pipelines if: You are on Kubernetes (or GCP with Vertex AI). Container isolation per step is a hard requirement. You need MLMD artifact lineage. Your team has platform engineering depth.
Use Metaflow if: Your team is data scientist-heavy. You run primarily on AWS. You want the simplest possible path from local Jupyter notebook to AWS Batch cluster without framework overhead.
Use ZenML if: Infrastructure portability is a first-class requirement. You run the same pipeline across multiple environments (local, staging, production, multi-cloud). You want standard interfaces for MLOps tools rather than direct tool coupling.
Use Dagster if: You have heavy data engineering requirements alongside ML. The software-defined asset model fits your use case. You need the best observability UI in the open-source space.
The common thread across all successful ML orchestration setups: start simple, add complexity only when pain demands it, and never lose sight of the actual goal - shipping better models to production faster.
