Building portable, stack-agnostic MLOps pipelines with ZenML - stacks, steps, materializers, and seamless local-to-cloud migration with MLflow and Vertex AI.

How does mlops stack work in practice?

ZenML covers zenml, mlops stack, pipeline portability from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/orchestration/zenml

What is the difference between zenml and pipeline portability?

See the full breakdown at https://engineersofai.com/docs/mlops/orchestration/zenml

ZenML

The Stack Lock-in Problem

Your ML team has built a clean training pipeline over six months. It runs locally with MLflow experiment tracking. The model is registered in the local MLflow model registry. It deploys to a Flask API running on your on-premise server. The pipeline works. Then the company decides to move to Google Cloud Platform. The CTO wants everything on Vertex AI, the model registry should use Vertex AI Model Registry, and deployments should go through Cloud Run.

You look at your pipeline code. It is 3,000 lines of Python tightly coupled to your current stack. mlflow.start_run() is called in 12 places. The artifact paths hardcode /home/user/artifacts. The deployment step runs subprocess.call(["systemctl", "restart", "flask-api"]). Migrating this to Vertex AI is not a refactor - it is a rewrite.

This is the infrastructure coupling problem in MLOps. Every pipeline ever written for a specific tool or cloud provider faces this moment: the stack changes, and the pipeline has to change with it. The usual solution is "we will abstract it properly next time." But there is never a next time. The coupling accumulates because there is no natural forcing function to separate pipeline logic from infrastructure concerns.

ZenML was built to solve this problem structurally. It introduces a concept called a stack - the combination of infrastructure components (artifact store, orchestrator, model deployer, experiment tracker) that a pipeline runs on. Your pipeline code is written once. The stack is configured separately. Switching from a local MLflow stack to a Vertex AI stack is a single command - no pipeline code changes required.

This lesson covers ZenML's architecture in depth: the stack concept, the @step and @pipeline decorators, custom materializers for non-standard types, stack configuration via YAML and CLI, and a complete example of the same pipeline running locally with MLflow then switching to Vertex AI with zero code changes.

:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Pipeline Orchestration demo on the EngineersOfAI Playground - no code required. :::

Why ZenML Exists

The problem ZenML addresses is subtler than the one Prefect or Metaflow addresses. Prefect and Metaflow solve workflow execution - making pipelines run reliably with retries, parallelism, and cloud compute. ZenML solves portability - making the same pipeline code run on different infrastructure stacks without modification.

Before ZenML, teams had two unsatisfying options:

Option 1: Couple directly to infrastructure. Write your pipeline with MLflow tracking, Kubernetes Jobs for training, and Seldon Core for deployment. The pipeline is simple to write but impossible to move. When your company changes cloud providers or tools, the pipeline is scrap.

Option 2: Build internal abstractions. Write your own thin wrappers around your infrastructure. Now you maintain an internal ML platform. You have reinvented ZenML, but poorly and with no community support.

ZenML's insight is that these abstractions should exist at the framework level, not the team level. The framework defines standard interfaces for artifact stores, experiment trackers, orchestrators, and model deployers. Integrations implement these interfaces for specific tools (MLflow, Weights & Biases, Vertex AI, SageMaker, etc.). Your pipeline code uses the interfaces, and the active stack determines which integration is used at runtime.

:::note ZenML vs Prefect vs Metaflow ZenML is not a replacement for Prefect or Metaflow - it can use them as its orchestrator. ZenML's value is the stack abstraction and portability. Prefect's value is the Python-first execution model and observability. Metaflow's value is the simplicity of the flow-step model with transparent cloud scaling. Many teams use ZenML with Prefect or Airflow as the orchestrator. :::

Historical Context

ZenML was started in 2020 by Adam Probst and Hamza Tahir, two engineers frustrated with the infrastructure coupling problem in their previous ML roles. The initial version was focused on the step/pipeline model with local execution. Version 0.20 (2022) introduced the stack abstraction - the breakthrough that separated ZenML from other orchestrators conceptually. The commercial offering, ZenML Cloud (formerly ZenML Pro), launched in 2023, providing a managed server, team collaboration features, and a pipeline visualization UI.

ZenML is now OSS (Apache 2.0) with the server code included - you can self-host the full ZenML server. The cloud offering adds managed infrastructure and multi-user features.

Core Concepts

The Stack

A ZenML stack is a collection of infrastructure components:

The stack is configured separately from the pipeline. Your pipeline code never imports mlflow directly - it uses ZenML's experiment tracker interface, which routes to MLflow (or W&B, or Vertex Experiments) based on the active stack.

Defining Steps

from zenml import step
from zenml.integrations.mlflow.experiment_trackers import MLFlowExperimentTracker
from typing import Tuple, Annotated
import pandas as pd
from sklearn.base import ClassifierMixin

@step
def load_data(
    dataset_path: str,
    test_fraction: float = 0.2,
) -> Tuple[
    Annotated[pd.DataFrame, "X_train"],
    Annotated[pd.DataFrame, "X_test"],
    Annotated[pd.Series, "y_train"],
    Annotated[pd.Series, "y_test"],
]:
    """Load and split the dataset."""
    from sklearn.model_selection import train_test_split

    df = pd.read_parquet(dataset_path)
    feature_cols = [c for c in df.columns if c != "label"]
    X, y = df[feature_cols], df["label"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_fraction, random_state=42, stratify=y
    )
    return X_train, X_test, y_train, y_test

Notice the Annotated[pd.DataFrame, "X_train"] return type. ZenML uses these type annotations to:

Know what materializer to use for serialization
Name the artifact in the artifact store
Enable type-safe composition between steps

Defining Pipelines

from zenml import pipeline

@pipeline
def training_pipeline(
    dataset_path: str = "data/features.parquet",
    test_fraction: float = 0.2,
    n_estimators: int = 100,
    max_depth: int = 5,
    deploy_threshold: float = 0.85,
):
    X_train, X_test, y_train, y_test = load_data(
        dataset_path=dataset_path,
        test_fraction=test_fraction,
    )
    model = train_model(
        X_train=X_train,
        y_train=y_train,
        n_estimators=n_estimators,
        max_depth=max_depth,
    )
    evaluation_results = evaluate_model(
        model=model,
        X_test=X_test,
        y_test=y_test,
    )
    deploy_model_step(
        model=model,
        evaluation_results=evaluation_results,
        deploy_threshold=deploy_threshold,
    )

This is standard Python. There are no decorators with infrastructure names, no SDK-specific imports for artifact stores, no cloud-specific API calls.

Complete Pipeline Example

Here is a full pipeline that runs identically on a local stack with MLflow and on a Vertex AI stack - with zero code changes.

# pipeline/steps.py
from zenml import step, log_artifact_metadata
from zenml.client import Client
from typing import Annotated, Tuple, Dict, Any, Optional
import pandas as pd
import numpy as np
from sklearn.base import ClassifierMixin


# ──────────────────────────────────────────────────────────────
# Step 1: Data Loading
# ──────────────────────────────────────────────────────────────

@step
def load_and_validate_data(
    dataset_uri: str,
) -> Tuple[
    Annotated[pd.DataFrame, "raw_dataframe"],
    Annotated[Dict[str, Any], "data_profile"],
]:
    """Load dataset from URI and compute a basic data profile."""
    df = pd.read_parquet(dataset_uri)

    # Validate schema
    required_cols = {"user_id", "feature_1", "feature_2", "feature_3", "label"}
    missing = required_cols - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    profile = {
        "num_rows": len(df),
        "num_columns": df.shape[1],
        "null_counts": df.isnull().sum().to_dict(),
        "label_distribution": df["label"].value_counts(normalize=True).to_dict(),
        "feature_dtypes": {c: str(df[c].dtype) for c in df.columns},
    }

    # ZenML: log metadata alongside the artifact
    log_artifact_metadata(
        artifact_name="raw_dataframe",
        metadata={
            "num_rows": len(df),
            "source_uri": dataset_uri,
        },
    )

    return df, profile


# ──────────────────────────────────────────────────────────────
# Step 2: Feature Engineering
# ──────────────────────────────────────────────────────────────

@step
def engineer_features(
    raw_df: pd.DataFrame,
    profile: Dict[str, Any],
    test_fraction: float = 0.2,
) -> Tuple[
    Annotated[pd.DataFrame, "X_train"],
    Annotated[pd.DataFrame, "X_test"],
    Annotated[pd.Series, "y_train"],
    Annotated[pd.Series, "y_test"],
]:
    """Feature engineering and train/test split."""
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler

    df = raw_df.copy()

    # Feature engineering
    df["interaction_1_2"] = df["feature_1"] * df["feature_2"]
    df["log_feature_3"] = np.log1p(df["feature_3"].clip(lower=0))

    feature_cols = [c for c in df.columns if c not in {"user_id", "label"}]
    X = df[feature_cols]
    y = df["label"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_fraction, random_state=42, stratify=y
    )

    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index
    )
    X_test_scaled = pd.DataFrame(
        scaler.transform(X_test), columns=X_test.columns, index=X_test.index
    )

    return X_train_scaled, X_test_scaled, y_train, y_test


# ──────────────────────────────────────────────────────────────
# Step 3: Model Training (with experiment tracking)
# ──────────────────────────────────────────────────────────────

@step(experiment_tracker="mlflow_tracker")  # Uses the active stack's tracker
def train_model(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    n_estimators: int = 100,
    max_depth: int = 5,
    learning_rate: float = 0.1,
) -> Annotated[ClassifierMixin, "trained_model"]:
    """Train a GBM model with experiment tracking."""
    import mlflow
    from sklearn.ensemble import GradientBoostingClassifier

    # mlflow is automatically configured by ZenML based on the active stack
    # When stack uses Vertex Experiments, mlflow calls route there instead
    mlflow.log_params({
        "n_estimators": n_estimators,
        "max_depth": max_depth,
        "learning_rate": learning_rate,
    })

    model = GradientBoostingClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        learning_rate=learning_rate,
        random_state=42,
    )
    model.fit(X_train, y_train)

    train_preds = model.predict_proba(X_train)[:, 1]
    from sklearn.metrics import roc_auc_score
    train_auc = roc_auc_score(y_train, train_preds)
    mlflow.log_metric("train_auc", train_auc)

    return model


# ──────────────────────────────────────────────────────────────
# Step 4: Evaluation
# ──────────────────────────────────────────────────────────────

@step
def evaluate_model(
    model: ClassifierMixin,
    X_test: pd.DataFrame,
    y_test: pd.Series,
) -> Annotated[Dict[str, float], "evaluation_metrics"]:
    """Evaluate model on test set."""
    from sklearn.metrics import roc_auc_score, average_precision_score, f1_score

    preds_proba = model.predict_proba(X_test)[:, 1]
    preds = (preds_proba >= 0.5).astype(int)

    metrics = {
        "roc_auc": float(roc_auc_score(y_test, preds_proba)),
        "avg_precision": float(average_precision_score(y_test, preds_proba)),
        "f1": float(f1_score(y_test, preds)),
    }

    log_artifact_metadata(
        artifact_name="evaluation_metrics",
        metadata=metrics,
    )

    return metrics


# ──────────────────────────────────────────────────────────────
# Step 5: Conditional Deployment
# ──────────────────────────────────────────────────────────────

@step
def deploy_model_step(
    model: ClassifierMixin,
    evaluation_metrics: Dict[str, float],
    deploy_threshold: float = 0.85,
) -> Optional[str]:
    """Deploy model if it passes the quality threshold."""
    from zenml.client import Client
    from zenml.integrations.mlflow.model_deployers import MLFlowModelDeployer

    roc_auc = evaluation_metrics["roc_auc"]

    if roc_auc < deploy_threshold:
        print(f"Model AUC {roc_auc:.4f} below threshold {deploy_threshold}. Skipping deployment.")
        return None

    # ZenML deployment - routes to the active stack's model deployer
    # (MLflow local server, Seldon, BentoML, or Vertex AI depending on stack)
    client = Client()
    model_deployer = client.active_stack.model_deployer

    if model_deployer is None:
        print("No model deployer in active stack. Skipping deployment.")
        return None

    deployment_info = model_deployer.deploy_model(
        config={
            "model": model,
            "model_name": "fraud-detector",
            "replicas": 1,
        }
    )

    print(f"Model deployed: {deployment_info}")
    return str(deployment_info)

# pipeline/pipeline.py
from zenml import pipeline
from pipeline.steps import (
    load_and_validate_data,
    engineer_features,
    train_model,
    evaluate_model,
    deploy_model_step,
)

@pipeline(name="fraud-detection-training", enable_cache=True)
def fraud_detection_pipeline(
    dataset_uri: str = "data/features.parquet",
    test_fraction: float = 0.2,
    n_estimators: int = 100,
    max_depth: int = 5,
    learning_rate: float = 0.1,
    deploy_threshold: float = 0.85,
):
    raw_df, profile = load_and_validate_data(dataset_uri=dataset_uri)
    X_train, X_test, y_train, y_test = engineer_features(
        raw_df=raw_df,
        profile=profile,
        test_fraction=test_fraction,
    )
    model = train_model(
        X_train=X_train,
        y_train=y_train,
        n_estimators=n_estimators,
        max_depth=max_depth,
        learning_rate=learning_rate,
    )
    metrics = evaluate_model(model=model, X_test=X_test, y_test=y_test)
    deploy_model_step(
        model=model,
        evaluation_metrics=metrics,
        deploy_threshold=deploy_threshold,
    )

Stack Configuration

Local Stack with MLflow

# stacks/local-mlflow.yaml
zenml_version: "0.55.0"

components:
  artifact_store:
    type: local
    name: local_artifact_store
    configuration:
      path: /tmp/zenml-artifacts

  orchestrator:
    type: local
    name: local_orchestrator

  experiment_tracker:
    type: mlflow
    name: mlflow_tracker
    configuration:
      tracking_uri: http://localhost:5000
      experiment_name: fraud-detection

# Register and activate the local stack
zenml stack register local-mlflow \
    --artifact-store local_artifact_store \
    --orchestrator local_orchestrator \
    --experiment-tracker mlflow_tracker

zenml stack set local-mlflow

# Run the pipeline
python run.py

Vertex AI Stack

# stacks/vertex-ai.yaml
zenml_version: "0.55.0"

components:
  artifact_store:
    type: gcp
    name: gcs_artifact_store
    configuration:
      path: gs://my-company-ml-artifacts/zenml

  orchestrator:
    type: vertex
    name: vertex_orchestrator
    configuration:
      project: my-gcp-project
      location: us-central1
      pipeline_root: gs://my-company-ml-artifacts/vertex-pipelines

  experiment_tracker:
    type: mlflow
    name: vertex_experiment_tracker
    configuration:
      tracking_uri: https://mlflow.my-company.com  # or Vertex Experiments
      experiment_name: fraud-detection-vertex

  container_registry:
    type: gcp
    name: gcr_registry
    configuration:
      uri: gcr.io/my-gcp-project

  model_deployer:
    type: vertex
    name: vertex_model_deployer
    configuration:
      project: my-gcp-project
      location: us-central1

# Register the Vertex AI stack
zenml stack register vertex-production \
    --artifact-store gcs_artifact_store \
    --orchestrator vertex_orchestrator \
    --experiment-tracker vertex_experiment_tracker \
    --container-registry gcr_registry \
    --model-deployer vertex_model_deployer

# Switch to Vertex AI - ZERO code changes to your pipeline
zenml stack set vertex-production

# Run exactly the same pipeline on Vertex AI
python run.py

This is the central promise of ZenML: zenml stack set vertex-production and python run.py - the same pipeline, different infrastructure.

Custom Materializers

ZenML uses materializers to serialize and deserialize step outputs. Built-in materializers handle pandas DataFrames, numpy arrays, sklearn models, and primitive types. For custom types, you write a materializer:

# materializers/xgboost_materializer.py
import os
from typing import Type
import xgboost as xgb
from zenml.materializers.base_materializer import BaseMaterializer
from zenml.enums import ArtifactType

class XGBoostMaterializer(BaseMaterializer):
    """Custom materializer for XGBoost Booster objects."""

    ASSOCIATED_TYPES = (xgb.Booster,)
    ASSOCIATED_ARTIFACT_TYPE = ArtifactType.MODEL

    def load(self, data_type: Type[xgb.Booster]) -> xgb.Booster:
        """Load an XGBoost model from the artifact store."""
        model_path = os.path.join(self.uri, "model.json")
        model = xgb.Booster()
        model.load_model(model_path)
        return model

    def save(self, model: xgb.Booster) -> None:
        """Save an XGBoost model to the artifact store."""
        os.makedirs(self.uri, exist_ok=True)
        model_path = os.path.join(self.uri, "model.json")
        model.save_model(model_path)

    def extract_metadata(self, model: xgb.Booster) -> dict:
        """Extract metadata for MLMD tracking."""
        return {
            "num_trees": model.num_boosted_rounds(),
            "feature_names": model.feature_names,
        }

from zenml import step
from materializers.xgboost_materializer import XGBoostMaterializer
import xgboost as xgb

@step(output_materializers={"model": XGBoostMaterializer})
def train_xgboost(
    X_train: pd.DataFrame,
    y_train: pd.Series,
) -> xgb.Booster:
    dtrain = xgb.DMatrix(X_train, label=y_train)
    params = {"max_depth": 6, "eta": 0.1, "objective": "binary:logistic"}
    model = xgb.train(params, dtrain, num_boost_round=100)
    return model

ZenML will use XGBoostMaterializer to serialize this model to the artifact store (local path, S3, or GCS depending on the active stack) and to deserialize it in downstream steps.

Pipeline Architecture Diagram

ZenML Cloud vs OSS

Feature	ZenML OSS	ZenML Cloud
Pipeline definition	Yes	Yes
Local execution	Yes	Yes
Self-hosted server	Yes (zenml up)	Managed
Dashboard UI	Basic	Full-featured
Multi-user / RBAC	No	Yes
Model Control Plane	No	Yes
Pipeline lineage UI	No	Yes
Support	Community	Enterprise
Cost	Free	Paid (per-seat)

The OSS server (zenml up) is sufficient for single-developer or small-team use. ZenML Cloud adds collaboration features, a better UI, and managed infrastructure.

Integrations Overview

ZenML integrates with the major MLOps tools in each stack component category:

# Install integrations
pip install "zenml[mlflow]"
pip install "zenml[vertex]"
pip install "zenml[aws]"
pip install "zenml[bentoml]"
pip install "zenml[wandb]"

# List available integrations
zenml integration list

# Install all integrations for a cloud provider
zenml integration install gcp -y

Stack Component	Available Integrations
Orchestrator	Local, Airflow, Prefect, Kubeflow, Vertex AI, SageMaker, Kubernetes
Artifact Store	Local, S3 (AWS), GCS (GCP), Azure Blob Storage
Experiment Tracker	MLflow, Weights & Biases, Comet, Neptune
Model Deployer	MLflow, Seldon Core, BentoML, Vertex AI, Hugging Face
Feature Store	Feast, Tecton
Container Registry	Docker Hub, ECR, GCR, Azure Container Registry
Data Validator	Great Expectations, Evidently, Deepchecks

Production Engineering Notes

Caching

ZenML's step caching is enabled by default. Steps are cached based on their input hashes and code hash. Use caching carefully for data-loading steps:

@step(enable_cache=False)  # Always re-run - data changes daily
def load_fresh_data(source_uri: str) -> pd.DataFrame:
    return pd.read_parquet(source_uri)

@step(enable_cache=True)  # Cache safely - deterministic computation
def compute_features(df: pd.DataFrame) -> pd.DataFrame:
    # Feature engineering - same inputs always produce same outputs
    return engineered_df

Triggering Pipelines from CI/CD

# trigger.py - called from GitHub Actions or GitLab CI
from zenml.client import Client
from pipeline.pipeline import fraud_detection_pipeline

client = Client()

# Ensure the right stack is active in CI
client.activate_stack("vertex-production")

# Run the pipeline
fraud_detection_pipeline(
    dataset_uri="gs://data-lake/fraud/2024-03-01.parquet",
    deploy_threshold=0.86,
)

Accessing Artifacts Programmatically

from zenml.client import Client

client = Client()

# Get the latest pipeline run
pipeline = client.get_pipeline("fraud-detection-training")
latest_run = pipeline.last_run

# Access artifacts from a specific step
train_step_artifact = latest_run.steps["train_model"].outputs["trained_model"].load()
metrics = latest_run.steps["evaluate_model"].outputs["evaluation_metrics"].load()

print(f"Latest model AUC: {metrics['roc_auc']:.4f}")
print(f"Model type: {type(train_step_artifact)}")

Common Mistakes

:::danger Putting Cloud-Specific Code in Steps The entire point of ZenML is stack portability. If you write boto3.client("s3") or google.cloud.storage.Client() directly inside a step, you break portability - that step will fail when the active stack is not the corresponding cloud provider.

# WRONG - not portable
@step
def save_model(model):
    import boto3
    boto3.client("s3").put_object(Bucket="my-bucket", Key="model.pkl", Body=...)

# CORRECT - use ZenML's artifact store interface
@step
def train_and_save(X_train, y_train) -> ClassifierMixin:
    model = RandomForestClassifier().fit(X_train, y_train)
    return model  # ZenML saves to active artifact store automatically

:::

:::danger Forgetting Type Annotations on Step Outputs ZenML uses return type annotations to determine which materializer to use and how to name artifacts in the lineage graph. A step with no return type annotation returns None from ZenML's perspective - the actual return value is lost.

# WRONG - no type annotation - artifact is not tracked
@step
def train_model(X_train, y_train):
    return RandomForestClassifier().fit(X_train, y_train)

# CORRECT
from sklearn.base import ClassifierMixin
from typing import Annotated

@step
def train_model(X_train, y_train) -> Annotated[ClassifierMixin, "trained_model"]:
    return RandomForestClassifier().fit(X_train, y_train)

:::

:::warning Using experiment_tracker Without Including It in the Stack If a step uses @step(experiment_tracker="mlflow_tracker") but the active stack does not have an experiment tracker component named mlflow_tracker, the run will fail. Always verify your active stack has all required components before running a pipeline with component-specific steps.

# Check active stack components
zenml stack describe

:::

:::warning Stack Switching in Multi-User Environments zenml stack set changes the active stack for the current user session. In a multi-user team, each team member's local environment is isolated. But in CI/CD, stack selection must be explicit - do not assume the CI environment has the same active stack as your local machine. Always set the stack explicitly in CI scripts. :::

Interview Questions and Answers

Q1: What is a ZenML stack and what components does it contain?

A ZenML stack is a configuration object that defines the infrastructure a pipeline runs on. It is a named collection of stack components, each serving a specific role: the orchestrator determines how and where steps execute; the artifact store determines where step outputs are persisted; the experiment tracker connects to a tool like MLflow or Weights & Biases for logging metrics and parameters; the model deployer handles deploying trained models to serving infrastructure; the container registry stores Docker images used by remote orchestrators. The stack is configured and registered once, and then activated with zenml stack set stack-name. The pipeline code never references the stack directly - ZenML injects the appropriate implementation at runtime based on the active stack.

Q2: How does ZenML enable running the same pipeline on different infrastructure without code changes?

ZenML defines standard Python interfaces for each stack component type. When a step needs to write an artifact, it goes through ZenML's artifact store interface, which delegates to the implementation registered in the active stack. If the active stack uses a local artifact store, the file is written to local disk. If the active stack uses the GCS artifact store, the file is written to Google Cloud Storage - with no code changes. The same delegation applies to experiment tracking (MLflow vs Vertex Experiments), orchestration (local process vs Vertex AI Pipelines), and model deployment (local MLflow server vs Vertex AI Model Registry). Changing infrastructure means changing the active stack, not the pipeline code.

Q3: What is a materializer in ZenML and when do you need a custom one?

A materializer is a class that handles serialization and deserialization of step artifacts. ZenML includes built-in materializers for common types: pandas DataFrames (parquet), numpy arrays (npy), sklearn models (pickle), Python primitives, and more. You need a custom materializer when a step returns a type that ZenML does not know how to serialize - for example, an XGBoost Booster, a PyTorch Lightning module, or a custom data class. A materializer inherits from BaseMaterializer, declares ASSOCIATED_TYPES (which Python types it handles), and implements load() and save() methods. You register it by passing it in the @step decorator's output_materializers argument.

Q4: How does ZenML's artifact caching work, and when should you disable it?

ZenML caches step outputs based on a cache key computed from three inputs: the hash of the step's function body (its code), the hash of its input artifacts, and the hash of its configuration parameters. If a step is run with the same code, same inputs, and same parameters, ZenML skips execution and returns the cached output from a previous run. Disable caching for steps where the output should always be fresh even when inputs are nominally the same - for example, a data loading step that fetches from an API or reads a file that may have been updated without changing its URI. Use @step(enable_cache=False) or set enable_cache=False at the pipeline decorator level to disable globally.

Q5: How does ZenML compare to tools like Prefect and Metaflow?

Prefect and Metaflow are workflow execution frameworks - they focus on making pipelines run reliably with retries, parallelism, and cloud compute. ZenML is a pipeline portability framework - it focuses on making pipelines run on different infrastructure stacks without code changes. ZenML can use Prefect or Airflow as its orchestrator via stack components. The tools solve different problems: use Prefect if your primary pain is reliable execution with good observability; use Metaflow if your primary pain is scaling from laptop to AWS Batch; use ZenML if your primary pain is being locked into specific infrastructure tools or needing to run the same pipeline on multiple cloud providers. Many production teams use ZenML as the outer portability layer with Prefect or Airflow as the orchestrator component inside the ZenML stack.

Q6: What is the ZenML Model Control Plane and why was it added?

The Model Control Plane is a feature in ZenML Cloud (and available in recent OSS versions) that provides a centralized registry for ML models, tracking which pipeline runs produced them, what metrics they achieved, and their deployment status. It is analogous to MLflow's Model Registry but built into the ZenML framework and connected to the full artifact lineage graph. The problem it solves is model lifecycle management: without it, you need a separate tool (MLflow, Vertex AI Model Registry, SageMaker Model Registry) to track which model is in staging, which is in production, and which pipeline run produced the production model. The Model Control Plane integrates this tracking into ZenML's existing pipeline and artifact tracking, making it part of the same lineage graph rather than a separate system.

The Stack Lock-in Problem​

Why ZenML Exists​

Historical Context​

Core Concepts​

The Stack​

Defining Steps​

Defining Pipelines​

Complete Pipeline Example​

Stack Configuration​

Local Stack with MLflow​

Vertex AI Stack​

Custom Materializers​

Pipeline Architecture Diagram​

ZenML Cloud vs OSS​

Integrations Overview​

Production Engineering Notes​

Caching​

Triggering Pipelines from CI/CD​

Accessing Artifacts Programmatically​

Common Mistakes​

Interview Questions and Answers​