Skip to main content

ML Pipeline Orchestration Concepts

The 3am Failure Nobody Saw

It is Tuesday. Somewhere in a production data science team, eight cron jobs are scheduled to run a complete ML training pipeline each night starting at midnight. Job one pulls data from the warehouse. Job two validates schema. Job three featurizes. Job four trains the model. Jobs five through eight package, evaluate, register, and deploy.

At 3:17am, job three fails. The featurization script encounters a column that arrived as float64 instead of int32 due to an upstream schema change. The cron job exits with code 1. No alert fires - the team never set one up. Job four, scheduled for 3:30am, runs on time. It finds yesterday's feature file and uses it without complaint. The model trains. Evaluation passes because yesterday's features still produce reasonable accuracy on the validation set. Job eight deploys the model.

The data scientist responsible for the pipeline checks the dashboard Friday morning for their weekly review. The model metrics look fine. They do not notice that the deployed model was trained on data that is now 72 hours stale. Two days later, a business stakeholder notices that product recommendations have become oddly repetitive. A four-hour investigation begins.

The cron job approach had exactly zero of the properties a production ML pipeline needs: no dependency enforcement, no retry logic, no cross-step state tracking, no failure propagation, no alerting, and no audit trail. Eight jobs ran independently, and the system happily continued after a critical step had silently failed.

This is the problem orchestration exists to solve. Not just scheduling - orchestration is about making pipelines correct, observable, and resilient by design.


:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Pipeline demo on the EngineersOfAI Playground - no code required. :::

Why This Exists - The Problem Before Orchestration

Before dedicated ML orchestrators, data science teams used one of three approaches, each with serious limitations.

Monolithic scripts: A single Python script that does everything - reads data, featurizes, trains, evaluates, deploys. Simple for a notebook but a nightmare in production. One failure kills everything. No way to restart from a checkpoint. No parallelism. Rerunning means rerunning everything.

Cron jobs: The "obvious" step up from monolithic scripts. Split the pipeline into steps, schedule each with cron, and hope they run in order. The problems are numerous: no dependency enforcement (step four runs whether step three succeeded or not), no shared state, failures are invisible unless you manually check logs, no retry on transient errors, and no visibility into the overall pipeline health.

Ad-hoc scripts glued with shell: A shell script that calls Python scripts in sequence, checking exit codes. Better than raw cron, but fragile. No retries, no distributed execution, no UI, no standard artifact tracking.

None of these scale beyond a single developer working on a single pipeline. When a team has ten pipelines running across thirty engineers, the lack of orchestration becomes an organizational failure: nobody knows what is running, when it ran, whether it succeeded, or what it produced.


Historical Context - Where Orchestration Came From

Workflow orchestration is not a new problem. The concept of representing work as a directed acyclic graph - tasks as nodes, dependencies as edges - traces back to project scheduling theory in the 1950s (the Program Evaluation and Review Technique, PERT). Computer scientists formalized DAG-based task scheduling for parallel computation in the 1960s and 1970s.

The modern era of data pipeline orchestration began with Oozie at Yahoo (2009), designed for Hadoop workflows. Oozie worked but was XML-heavy and difficult to test. Luigi (Spotify, 2012) was the first widely adopted Python-native orchestrator - pipelines as Python classes, dependency resolution built in. Luigi was a significant step forward but had no scheduler, no distributed execution, and limited observability.

Airflow (Airbnb, 2014, open-sourced 2015) changed the game by introducing Python-defined DAGs with a built-in scheduler, a web UI, pluggable executors, and a rich operator ecosystem. It became the industry default for data and ML pipelines through the late 2010s. The Apache Software Foundation accepted Airflow in 2019.

The 2020s saw a new generation: Prefect (2018, major rewrite as Prefect 2 in 2022), Dagster (2018), Kubeflow Pipelines (Google, 2018), ZenML (2021), and Metaflow (Netflix, 2019, open-sourced 2020). Each addressed real gaps in Airflow - better local development, first-class artifact tracking, native Kubernetes integration, and ML-specific primitives.


The DAG Mental Model

Every ML orchestrator, regardless of its surface API, represents pipelines as Directed Acyclic Graphs (DAGs).

Three properties define a DAG:

Directed: Edges have direction. Task B runs after Task A completes. Dependencies flow in one direction.

Acyclic: No cycles. Task A cannot depend on Task B if Task B depends on Task A, directly or transitively. This guarantees the pipeline terminates.

Graph: Tasks (nodes) are connected by dependencies (edges). Tasks with no unfulfilled dependencies can run in parallel.

The acyclic constraint is why ML pipelines are naturally representable as DAGs: training requires features, features require raw data, raw data requires ingestion. The flow is always forward. If you find yourself wanting a cycle in an ML pipeline (run training until convergence, then re-evaluate), the right pattern is to model it as a loop at the task level within a single step, not as a cycle at the pipeline level.


What an Orchestrator Is Responsible For

A common misconception: orchestrators run your code. They do not. They tell your code when to run, where to run it, and what to do when it fails. The responsibilities divide cleanly into five areas.

1. Dependency Resolution

The orchestrator reads your DAG definition and determines the execution order. Tasks with no pending dependencies are queued. When a task completes, the orchestrator marks it done and queues downstream tasks whose dependencies are now all satisfied.

This is the most fundamental job, and it is what cron completely fails at. With cron, you encode dependencies implicitly in the schedule times ("featurize runs at 3:30 because ingestion finishes by 3:00"). With an orchestrator, you encode them explicitly: featurize depends on ingestion. The orchestrator enforces this - featurize will wait regardless of the clock.

2. Scheduling

Orchestrators support cron-based schedules (run every day at midnight), interval schedules (run every 6 hours), and increasingly, event-driven triggers (run when new data arrives). They handle time zone conversion, daylight saving time transitions, and backfill (running past scheduled runs that were missed).

3. State Tracking

Every task run has a state: queued, running, success, failed, skipped, up_for_retry. The orchestrator persists these states in a metadata database. This is what gives you the ability to inspect historical runs, understand which step failed and why, and restart a pipeline from a specific checkpoint rather than from the beginning.

4. Retry Logic

Transient failures - a database timeout, a network blip, a rate-limited API - should not kill a pipeline. Orchestrators support configurable retry policies: retry up to N times, with exponential backoff, after a delay. The orchestrator handles this automatically without you writing retry loops.

5. Alerting and Observability

Orchestrators emit events on task success, failure, and retry. They integrate with alerting systems (PagerDuty, Slack, email) so that failures are visible immediately. They expose metrics (task duration, failure rate, queue depth) for monitoring dashboards.


Task vs Pipeline Granularity

A question every team wrestles with: how much work goes in a single task?

Too coarse (one task does everything): You lose the ability to restart from a checkpoint. If training takes 6 hours and the packaging step fails, you re-run 6 hours of training. You also lose parallel execution opportunities.

Too fine (every function call is a task): Orchestrator overhead dominates. Hundreds of tasks per pipeline run create scheduler bottlenecks, fill the metadata database with noise, and make the DAG diagram unreadable.

The right granularity is determined by two questions:

  1. Where would I want to restart? If training fails, I want to restart from the last good checkpoint, not from data ingestion. Each restart boundary is a task boundary.

  2. What units are independently reusable? Featurization is often reused across different model training tasks. If it is a single task, multiple training tasks can depend on one featurization run. If it is embedded inside a larger task, each training task reruns featurization.

A reasonable default for ML pipelines: tasks map to major pipeline stages (ingest, validate, featurize, train, evaluate, package, deploy). Sub-steps within a stage (e.g., individual feature transformations) live inside the task's code, not as separate orchestrator tasks.


Idempotency in ML Pipelines

Idempotency means running a task multiple times with the same inputs produces the same outputs and has no additional side effects beyond the first run.

This property is essential for orchestration. When a task fails and is retried, the orchestrator has no way to "undo" what the task partially did. If the task is idempotent, partial side effects don't matter - the retry produces a correct result regardless.

Non-idempotent patterns to avoid:

# BAD: appends to existing data - retries duplicate rows
def write_features(features_df, output_path):
features_df.to_parquet(output_path, mode="append")

# GOOD: overwrites atomically - retries are safe
def write_features(features_df, output_path):
temp_path = output_path + ".tmp"
features_df.to_parquet(temp_path)
import os
os.rename(temp_path, output_path) # atomic on most filesystems
# BAD: inserts every run - double-counting on retry
def log_training_run(conn, run_id, metrics):
conn.execute("INSERT INTO runs VALUES (?, ?)", (run_id, metrics))

# GOOD: upsert is idempotent
def log_training_run(conn, run_id, metrics):
conn.execute(
"""INSERT INTO runs (run_id, metrics) VALUES (?, ?)
ON CONFLICT(run_id) DO UPDATE SET metrics = excluded.metrics""",
(run_id, metrics)
)

The general pattern: write atomically to a deterministic location. For ML artifacts (models, feature files, evaluation reports), the output path should encode the run parameters or timestamp so that each logical run writes to its own location. Retries overwrite the same path cleanly.


The Hidden Cost of Cron Jobs for ML

Beyond the silent failure problem, cron-based ML pipelines accumulate hidden costs that compound over time.

No visibility into runtime: Cron runs jobs and forgets. If your training job starts taking 3 hours instead of 1.5, you won't know until it overlaps with the next scheduled run and causes resource contention.

No artifact provenance: Which model was trained with which data? With cron, you only know if you have a rigorous manual naming convention - and those always drift. With an orchestrator, artifact lineage is automatic.

No safe reruns: Running a cron-scheduled pipeline on a different date (for backfill or debugging) requires manually adjusting every script. With an orchestrator, you pass the logical date as a parameter.

No parallel runs: Cron schedules one run at a time per job. An orchestrator can run multiple pipeline instances concurrently (different date ranges, different hyperparameter sets) with configurable concurrency limits.

Dependency drift: As the pipeline grows, dependency timing (step four scheduled 30 minutes after step three) becomes increasingly fragile. Adding a new step or changing the data volume breaks timing assumptions silently.


Orchestrator Architecture Pattern

All major ML orchestrators share a common architectural pattern, though they name components differently:

The scheduler reads the DAG, resolves which tasks are ready to run, and places them in a queue. Workers pick tasks from the queue, execute them, and report results back to the metadata database. The web server reads the metadata database to render the UI and serve the API. Artifact storage is external - orchestrators track artifact locations, not artifact contents.

This separation means orchestrators are horizontally scalable: add more workers to increase throughput. The control plane remains small and stable.


Designing a Pipeline from Scratch - A Practical Template

Here is how to structure any new ML pipeline design:

# Step 1: Define the logical pipeline stages
PIPELINE_STAGES = [
"ingest", # Pull raw data from source
"validate", # Schema checks, null checks, range checks
"featurize", # Transform raw data to model-ready features
"train", # Train the model
"evaluate", # Compute metrics against held-out data
"register", # Push model to model registry (conditional on evaluation pass)
"deploy", # Update serving infrastructure
]

# Step 2: Define dependencies explicitly
DEPENDENCIES = {
"validate": ["ingest"],
"featurize": ["validate"],
"train": ["featurize"],
"evaluate": ["train"],
"register": ["evaluate"],
"deploy": ["register"],
}

# Step 3: Define retry and timeout per stage
STAGE_CONFIG = {
"ingest": {"retries": 3, "retry_delay_seconds": 60, "timeout_seconds": 1800},
"validate": {"retries": 2, "retry_delay_seconds": 30, "timeout_seconds": 600},
"featurize": {"retries": 2, "retry_delay_seconds": 60, "timeout_seconds": 7200},
"train": {"retries": 1, "retry_delay_seconds": 300, "timeout_seconds": 86400},
"evaluate": {"retries": 2, "retry_delay_seconds": 30, "timeout_seconds": 3600},
"register": {"retries": 3, "retry_delay_seconds": 30, "timeout_seconds": 300},
"deploy": {"retries": 2, "retry_delay_seconds": 60, "timeout_seconds": 1800},
}

# Step 4: Define what artifact each stage writes
STAGE_OUTPUTS = {
"ingest": "s3://ml-artifacts/{run_date}/raw/data.parquet",
"validate": "s3://ml-artifacts/{run_date}/raw/validation_report.json",
"featurize": "s3://ml-artifacts/{run_date}/features/features.parquet",
"train": "s3://ml-artifacts/{run_date}/models/model.pkl",
"evaluate": "s3://ml-artifacts/{run_date}/eval/metrics.json",
"register": None, # writes to model registry, not object store
"deploy": None, # updates serving config
}

This template works regardless of which orchestrator you choose. The orchestrator-specific code is a thin translation layer that wraps this logical design.


Production Engineering Notes

Design pipelines for failure from day one. Assume every task will fail at least once. Set retries, set timeouts, set failure notifications before you go to production. Retrofitting these properties is painful.

Use logical date parameters, not wall-clock time. Your pipeline should accept a run_date parameter and process data for that date, regardless of when it actually runs. This enables backfill, debugging, and idempotent reruns.

Separate pipeline code from orchestration code. Your training logic should live in a Python module. Your orchestrator task should be a thin wrapper that calls that module. This makes unit testing trivial - test the module, not the orchestrator.

Track artifacts explicitly. Every pipeline run should record what data it consumed, what model it produced, and what metrics it achieved. This is non-negotiable for debugging and compliance. Most modern orchestrators do this automatically; wire it up from the start.

Start simple. A single-machine orchestrator (Prefect local, Airflow with LocalExecutor) is appropriate for small teams. Do not pay the operational cost of a distributed executor until you actually need parallel execution across multiple machines.


Common Mistakes

:::danger Silent Failures: The Most Dangerous Mistake Never design a pipeline where a step failure allows downstream steps to continue with stale or incorrect data. This is the cron job antipattern. Every task must fail loudly - upstream failures must propagate and block downstream tasks. If you find yourself writing logic that "handles" upstream failures by falling back to yesterday's output, stop and redesign. :::

:::warning Forgetting Idempotency A task that appends to a database table or object store prefix without checking for existing data will duplicate records on every retry. Before writing any task that produces output, ask: "What happens if this task runs twice?" If the answer is "it creates duplicates," the task is not idempotent. :::

:::warning Over-Granular DAGs Creating a task for every function call in your ML pipeline makes the orchestrator a slow, noisy test runner. If a task takes less than 5 seconds to run, it probably belongs inside a larger task. Reserve tasks for operations that are independently restartable, independently parallelizable, or independently monitored. :::

:::tip Timeout Every Task Every task should have an explicit timeout. Training tasks that should take 2 hours but run forever due to a bug will consume cluster resources indefinitely without a timeout. Set timeouts generously (2x expected runtime), but always set them. :::


Interview Q&A

Q1: What is a DAG and why is it the right data structure for ML pipelines?

A DAG (Directed Acyclic Graph) represents ML pipeline steps as nodes and dependencies as directed edges. It is the right structure because ML pipelines have natural, well-defined dependencies (training requires features, features require data), they must terminate (guaranteed by the acyclic property), and many steps can run in parallel if their dependencies are satisfied. The DAG makes dependencies explicit and machine-verifiable rather than implicit in scheduling times or documentation.

Q2: What does idempotency mean for an ML pipeline task, and how do you achieve it?

A task is idempotent if running it multiple times with the same inputs produces the same output with no unintended side effects. In ML pipelines, idempotency matters because orchestrators retry failed tasks - a non-idempotent retry could duplicate data, double-count metrics, or corrupt model registries. You achieve idempotency by: writing outputs to deterministic paths that overwrite rather than append, using upsert instead of insert for database writes, and encoding run parameters in output paths so each logical run has its own namespace.

Q3: What are the five core responsibilities of an ML orchestrator?

Dependency resolution (determining task execution order), scheduling (when pipelines run), state tracking (persisting task success/failure history), retry logic (automatically re-running failed tasks with configurable policies), and alerting/observability (notifying humans when things fail and exposing metrics for monitoring).

Q4: Why is the cron job approach dangerous for production ML pipelines?

Cron has no dependency enforcement (downstream steps run regardless of upstream failure), no cross-step state tracking (each job is independent), no built-in retry logic, no artifact provenance, and no UI for visualizing pipeline health. Most critically, cron failures are silent by default - a failed job leaves no visible record unless you explicitly check logs. In practice, this means pipeline failures are discovered days later through business metrics rather than immediately through engineering alerts.

Q5: How do you choose the right task granularity in an ML pipeline?

Two questions determine granularity: (1) Where do you want restart checkpoints? Tasks should align with restart boundaries - if you want to retry training without re-running featurization, they must be separate tasks. (2) What units are independently reusable or parallelizable? Shared preprocessing that feeds multiple training experiments should be one task with many downstream training tasks, not duplicated inside each training task. Sub-steps that do not benefit from independent monitoring, reuse, or parallelism belong inside a task, not as separate tasks.

Q6: A pipeline task writes a Parquet file to S3. The task succeeds but subsequent runs find a corrupted file. What could cause this and how do you prevent it?

The most common cause is a non-atomic write: the task wrote to the final path incrementally, was interrupted partway through, and left a partial file that subsequent tasks interpret as complete. The fix is atomic writes: write to a temporary path first, then rename (which is atomic on most object stores). S3 multipart uploads are atomic - either the full upload completes or nothing is visible. GCS has similar semantics. Always verify the final file exists and passes a basic schema check before marking the task as successful.

Q7: What is the difference between a task and a pipeline run in orchestration terms?

A pipeline run is one execution instance of the entire DAG - triggered by a schedule, event, or manual trigger. A task is a single node within the DAG. One pipeline run creates one task instance per node. If the pipeline has a schedule of once-daily and runs for 30 days, that is 30 pipeline run instances, each containing the same set of task instances. State tracking operates at both levels: you can see that pipeline run #14 failed, and drill down to see that it failed specifically in the featurization task.

© 2026 EngineersOfAI. All rights reserved.