Skip to main content

Pulumi for ML

When Your Data Scientists Start Reading Your Infrastructure Code

The feature pipeline team had a problem. Their infrastructure - Kafka consumers, feature transformation jobs, DynamoDB tables - was defined in Terraform HCL. The data scientists who needed to understand the pipeline topology could not read HCL. The ML engineers who wrote the training code in Python had to context-switch to a completely different language just to provision a new S3 bucket. Handoffs between "the ML code people" and "the infrastructure people" added days to every experiment.

Then one engineer proposed an experiment: rewrite the infrastructure in Pulumi, using the Python SDK. Within a week, the training pipeline, the feature store schema, and the serving endpoints were all defined in Python - the same language as the model training code. Data scientists could grep the infrastructure code for "feature_table" and immediately understand how the DynamoDB schema related to their training data. ML engineers could add a new SageMaker endpoint by calling a Python function they wrote themselves.

The real payoff came three months later when they needed to provision infrastructure for 50 different experiments - each with its own S3 prefix, DynamoDB table, and SageMaker endpoint configuration. In Terraform, this would have required careful for_each expressions and data structure mapping. In Pulumi, it was a Python for loop. Twelve lines of code. The entire experiment fleet deployed in four minutes.

Pulumi does not replace Terraform - both tools have their place. But for ML teams that live in Python, Pulumi offers a fundamentally different developer experience: infrastructure that reads like the application code around it, tested with the same pytest suite, reviewed by engineers who already know the language.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Terraform's HCL (HashiCorp Configuration Language) was designed to be human-readable and simple. It succeeded at that goal, but simplicity comes with constraints. You cannot express complex loops over dynamic collections without for_each gymnastics. You cannot write unit tests for your infrastructure logic. You cannot import your company's internal Python library and use it inside a resource definition. HCL is a DSL - powerful within its domain, but limited to that domain.

Pulumi (founded 2017, open-sourced in 2018) took a different approach: use real programming languages - Python, TypeScript, Go, C#, Java - as the configuration language. The infrastructure is a program. You import the Pulumi SDK, call resource constructors, and the SDK handles the provider communication, state management, and dependency tracking. Every language feature is available: functions, classes, loops, imports, error handling, and type checking.

For ML teams specifically, Pulumi offers three concrete advantages over Terraform: (1) infrastructure and application code in the same language, reducing cognitive overhead; (2) real unit testing with pytest and mock providers; (3) the Automation API, which lets you embed Pulumi infrastructure operations inside Python scripts - useful for experiment lifecycle management, cost management, and dynamic environment provisioning.

Pulumi vs Terraform - The Real Comparison

FeatureTerraformPulumi Python
LanguageHCL DSLReal Python
State backendS3 + DynamoDBPulumi Cloud or S3
Loopsfor_each / countPython for loops
Conditionalsternary expressionsPython if/else
Unit testingterratest (Go)pytest
ReuseModulesClasses / functions
Secrets.tfvars (careful)Pulumi ESC / config
Previewterraform planpulumi preview
Applyterraform applypulumi up
Learning curveTerraform conceptsPython + Pulumi concepts

Pulumi Core Concepts

Stacks and Projects

A Pulumi project is a directory containing a Pulumi.yaml and a __main__.py (for Python). A stack is an isolated deployment of that project - roughly equivalent to a Terraform workspace, but with proper isolation.

# Pulumi.yaml
name: ml-platform
runtime:
name: python
options:
virtualenv: venv
description: ML Platform Infrastructure
config:
pulumi:tags:
value:
project: ml-platform
managed-by: pulumi
# Create and configure stacks
pulumi stack init dev
pulumi stack init staging
pulumi stack init prod

# Set configuration per stack
pulumi config set aws:region us-east-1 --stack prod
pulumi config set gpu_instance_type p3.8xlarge --stack prod
pulumi config set gpu_instance_type t3.medium --stack dev

# Set secrets (encrypted in state)
pulumi config set --secret mlflow_db_password "supersecret123" --stack prod

# Switch stacks
pulumi stack select prod

# Preview changes (equivalent to terraform plan)
pulumi preview

# Apply changes (equivalent to terraform apply)
pulumi up

# Destroy (equivalent to terraform destroy)
pulumi destroy

Basic Resource Definition

# __main__.py
import pulumi
import pulumi_aws as aws

# Configuration - reads from pulumi config or defaults
config = pulumi.Config()
environment = config.require("environment")
project_name = config.get("project_name") or "eai"
aws_region = config.get("aws:region") or "us-east-1"

# Derived locals (same concept as Terraform locals)
prefix = f"{project_name}-{environment}"
is_production = environment == "prod"

# S3 bucket for model artifacts
model_bucket = aws.s3.Bucket(
f"{prefix}-model-artifacts",
bucket=f"{prefix}-model-artifacts",
versioning=aws.s3.BucketVersioningArgs(
enabled=True,
),
server_side_encryption_configuration=aws.s3.BucketServerSideEncryptionConfigurationArgs(
rule=aws.s3.BucketServerSideEncryptionConfigurationRuleArgs(
apply_server_side_encryption_by_default=aws.s3.BucketServerSideEncryptionConfigurationRuleApplyServerSideEncryptionByDefaultArgs(
sse_algorithm="AES256",
),
),
),
tags={
"Project": project_name,
"Environment": environment,
"ManagedBy": "pulumi",
},
)

# Block public access
aws.s3.BucketPublicAccessBlock(
f"{prefix}-model-artifacts-public-access-block",
bucket=model_bucket.id,
block_public_acls=True,
block_public_policy=True,
ignore_public_acls=True,
restrict_public_buckets=True,
)

# Lifecycle policy (cheaper storage for old models)
aws.s3.BucketLifecycleConfigurationV2(
f"{prefix}-model-artifacts-lifecycle",
bucket=model_bucket.id,
rules=[
aws.s3.BucketLifecycleConfigurationV2RuleArgs(
id="transition-old-models",
status="Enabled",
transitions=[
aws.s3.BucketLifecycleConfigurationV2RuleTransitionArgs(
days=30,
storage_class="STANDARD_IA",
),
aws.s3.BucketLifecycleConfigurationV2RuleTransitionArgs(
days=90,
storage_class="GLACIER",
),
],
)
],
)

# Export outputs (equivalent to Terraform outputs)
pulumi.export("model_bucket_name", model_bucket.id)
pulumi.export("model_bucket_arn", model_bucket.arn)

Component Resources - Reusable Infrastructure Classes

Component Resources are Pulumi's equivalent of Terraform modules - but they are actual Python classes with constructors, methods, and type checking.

# components/ml_storage.py
from typing import Optional
import pulumi
import pulumi_aws as aws


class MLStorageArgs:
"""Arguments for the MLStorage component."""
def __init__(
self,
prefix: str,
environment: str,
transition_to_ia_days: int = 30,
transition_to_glacier_days: int = 90,
tags: Optional[dict] = None,
):
self.prefix = prefix
self.environment = environment
self.transition_to_ia_days = transition_to_ia_days
self.transition_to_glacier_days = transition_to_glacier_days
self.tags = tags or {}


class MLStorage(pulumi.ComponentResource):
"""
Complete ML storage stack: model artifacts, training data,
experiment logs - all with versioning, encryption, and lifecycle.
"""

model_bucket: aws.s3.Bucket
training_data_bucket: aws.s3.Bucket
experiment_logs_bucket: aws.s3.Bucket

def __init__(
self,
name: str,
args: MLStorageArgs,
opts: Optional[pulumi.ResourceOptions] = None,
):
super().__init__("eai:ml:MLStorage", name, {}, opts)

# Child resources use self as parent - creates logical grouping
child_opts = pulumi.ResourceOptions(parent=self)

base_tags = {
**args.tags,
"ManagedBy": "pulumi",
"Environment": args.environment,
}

bucket_configs = [
("model-artifacts", "model-storage"),
("training-data", "training-data"),
("experiment-logs", "experiment-tracking"),
]

buckets = {}
for bucket_name, purpose in bucket_configs:
full_name = f"{args.prefix}-{bucket_name}"

bucket = aws.s3.Bucket(
full_name,
bucket=full_name,
tags={**base_tags, "Purpose": purpose},
opts=child_opts,
)

aws.s3.BucketVersioning(
f"{full_name}-versioning",
bucket=bucket.id,
versioning_configuration=aws.s3.BucketVersioningVersioningConfigurationArgs(
status="Enabled",
),
opts=child_opts,
)

aws.s3.BucketPublicAccessBlock(
f"{full_name}-pab",
bucket=bucket.id,
block_public_acls=True,
block_public_policy=True,
ignore_public_acls=True,
restrict_public_buckets=True,
opts=child_opts,
)

aws.s3.BucketLifecycleConfigurationV2(
f"{full_name}-lifecycle",
bucket=bucket.id,
rules=[
aws.s3.BucketLifecycleConfigurationV2RuleArgs(
id="cost-optimization",
status="Enabled",
transitions=[
aws.s3.BucketLifecycleConfigurationV2RuleTransitionArgs(
days=args.transition_to_ia_days,
storage_class="STANDARD_IA",
),
aws.s3.BucketLifecycleConfigurationV2RuleTransitionArgs(
days=args.transition_to_glacier_days,
storage_class="GLACIER",
),
],
)
],
opts=child_opts,
)

buckets[bucket_name] = bucket

self.model_bucket = buckets["model-artifacts"]
self.training_data_bucket = buckets["training-data"]
self.experiment_logs_bucket = buckets["experiment-logs"]

# Register outputs so other components can reference them
self.register_outputs({
"model_bucket_name": self.model_bucket.id,
"training_data_bucket_name": self.training_data_bucket.id,
"experiment_logs_bucket_name": self.experiment_logs_bucket.id,
})

EKS Cluster as a Component Resource

# components/eks_cluster.py
from typing import Optional, List
import pulumi
import pulumi_aws as aws
import pulumi_eks as eks # High-level EKS component


class EKSClusterArgs:
def __init__(
self,
prefix: str,
vpc_id: pulumi.Input[str],
private_subnet_ids: pulumi.Input[List[str]],
kubernetes_version: str = "1.29",
gpu_instance_type: str = "p3.2xlarge",
gpu_min_count: int = 0,
gpu_max_count: int = 10,
cpu_instance_type: str = "m5.xlarge",
cpu_min_count: int = 2,
cpu_max_count: int = 20,
tags: Optional[dict] = None,
):
self.prefix = prefix
self.vpc_id = vpc_id
self.private_subnet_ids = private_subnet_ids
self.kubernetes_version = kubernetes_version
self.gpu_instance_type = gpu_instance_type
self.gpu_min_count = gpu_min_count
self.gpu_max_count = gpu_max_count
self.cpu_instance_type = cpu_instance_type
self.cpu_min_count = cpu_min_count
self.cpu_max_count = cpu_max_count
self.tags = tags or {}


class EKSCluster(pulumi.ComponentResource):
"""
Production EKS cluster with CPU and GPU node groups,
cluster autoscaler ready (proper node group tags).
"""

cluster: eks.Cluster
cluster_name: pulumi.Output[str]
kubeconfig: pulumi.Output[str]
oidc_provider_arn: pulumi.Output[str]
oidc_issuer_url: pulumi.Output[str]
node_security_group_id: pulumi.Output[str]

def __init__(
self,
name: str,
args: EKSClusterArgs,
opts: Optional[pulumi.ResourceOptions] = None,
):
super().__init__("eai:ml:EKSCluster", name, {}, opts)
child_opts = pulumi.ResourceOptions(parent=self)

cluster_name = f"{args.prefix}-eks"

# Use the high-level pulumi-eks component
self.cluster = eks.Cluster(
cluster_name,
name=cluster_name,
vpc_id=args.vpc_id,
private_subnet_ids=args.private_subnet_ids,
version=args.kubernetes_version,
endpoint_private_access=True,
endpoint_public_access=False,
enabled_cluster_log_types=[
"api", "audit", "authenticator",
"controllerManager", "scheduler"
],
tags=args.tags,
opts=child_opts,
)

# GPU Node Group
gpu_node_group = aws.eks.NodeGroup(
f"{cluster_name}-gpu",
cluster_name=self.cluster.eks_cluster.name,
node_group_name="gpu-workers",
node_role_arn=self.cluster.instance_roles[0].arn,
subnet_ids=args.private_subnet_ids,
instance_types=[args.gpu_instance_type],
ami_type="AL2_x86_64_GPU",
scaling_config=aws.eks.NodeGroupScalingConfigArgs(
desired_size=args.gpu_min_count, # Start at min - autoscaler manages desired
min_size=args.gpu_min_count,
max_size=args.gpu_max_count,
),
labels={"role": "gpu-worker", "accelerator": "nvidia"},
taints=[
aws.eks.NodeGroupTaintArgs(
key="nvidia.com/gpu",
value="true",
effect="NO_SCHEDULE",
)
],
tags={
**args.tags,
f"k8s.io/cluster-autoscaler/{cluster_name}": "owned",
"k8s.io/cluster-autoscaler/enabled": "true",
},
opts=pulumi.ResourceOptions(
parent=self,
ignore_changes=["scaling_config[0].desired_size"],
),
)

# CPU Node Group
aws.eks.NodeGroup(
f"{cluster_name}-cpu",
cluster_name=self.cluster.eks_cluster.name,
node_group_name="cpu-workers",
node_role_arn=self.cluster.instance_roles[0].arn,
subnet_ids=args.private_subnet_ids,
instance_types=[args.cpu_instance_type],
ami_type="AL2_x86_64",
scaling_config=aws.eks.NodeGroupScalingConfigArgs(
desired_size=args.cpu_min_count,
min_size=args.cpu_min_count,
max_size=args.cpu_max_count,
),
labels={"role": "cpu-worker"},
tags={
**args.tags,
f"k8s.io/cluster-autoscaler/{cluster_name}": "owned",
"k8s.io/cluster-autoscaler/enabled": "true",
},
opts=pulumi.ResourceOptions(
parent=self,
ignore_changes=["scaling_config[0].desired_size"],
),
)

self.cluster_name = self.cluster.eks_cluster.name
self.kubeconfig = self.cluster.kubeconfig
self.oidc_provider_arn = self.cluster.core.oidc_provider.arn
self.oidc_issuer_url = self.cluster.eks_cluster.identities[0].oidcs[0].issuer
self.node_security_group_id = self.cluster.node_security_group.id

self.register_outputs({
"cluster_name": self.cluster_name,
"kubeconfig": self.kubeconfig,
})

Stack References - Cross-Stack Dependencies

# In the ML training stack - reference outputs from the networking stack
networking_stack = pulumi.StackReference(f"myorg/ml-networking/{pulumi.get_stack()}")

vpc_id = networking_stack.get_output("vpc_id")
private_subnet_ids = networking_stack.get_output("private_subnet_ids")

# Use them to create the EKS cluster
eks_cluster = EKSCluster(
"ml-eks",
EKSClusterArgs(
prefix=prefix,
vpc_id=vpc_id,
private_subnet_ids=private_subnet_ids,
),
)

Pulumi ESC - Secrets and Configuration Management

# Install: pip install pulumi-esc-sdk

# In your Pulumi program, reference ESC environments
import pulumi

# Configuration with secret (encrypted in Pulumi state)
config = pulumi.Config()

# Secret values - decrypted at runtime, never in plaintext in state
mlflow_db_password = config.require_secret("mlflow_db_password")
redis_auth_token = config.require_secret("redis_auth_token")

# Use the secret value in a resource
db_instance = aws.rds.Instance(
"mlflow-db",
password=mlflow_db_password, # Pulumi handles the secret propagation
# ...
)
# Set secrets via CLI
pulumi config set --secret mlflow_db_password "supersecret123" --stack prod
pulumi config set --secret redis_auth_token "my-redis-token" --stack prod

# ESC environment file (Pulumi Cloud)
# environments/prod.yaml
values:
mlflow_db_password:
fn::secret:
ciphertext: "..."
redis_auth_token:
fn::secret:
ciphertext: "..."

The Automation API - Infrastructure in Python Scripts

The Automation API is Pulumi's most powerful feature for ML teams. It lets you embed Pulumi stack operations inside Python code - creating, updating, and destroying infrastructure programmatically. This enables experiment lifecycle management: automatically provision infrastructure for a new experiment, run the experiment, clean up when done.

# experiment_lifecycle.py
import asyncio
import json
from typing import Optional
import pulumi
from pulumi import automation as auto


def create_experiment_stack(
experiment_id: str,
gpu_count: int,
instance_type: str,
hours_to_live: int = 24,
) -> dict:
"""
Provision infrastructure for a specific experiment.
Returns endpoint URLs and resource identifiers.
"""

def pulumi_program():
"""The infrastructure for one experiment."""
import pulumi_aws as aws

prefix = f"exp-{experiment_id}"

# Dedicated S3 prefix (not a new bucket - use existing with prefix)
exp_bucket = aws.s3.Bucket(
f"{prefix}-artifacts",
bucket=f"eai-experiments-{experiment_id}",
tags={
"Experiment": experiment_id,
"TTL": str(hours_to_live),
"ManagedBy": "automation-api",
},
)

# SageMaker training job resources
training_role = aws.iam.Role(
f"{prefix}-training-role",
assume_role_policy=json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {"Service": "sagemaker.amazonaws.com"},
}],
}),
)

aws.iam.RolePolicyAttachment(
f"{prefix}-sagemaker-policy",
role=training_role.name,
policy_arn="arn:aws:iam::aws:policy/AmazonSageMakerFullAccess",
)

pulumi.export("artifact_bucket", exp_bucket.id)
pulumi.export("training_role_arn", training_role.arn)

# Create or select the stack for this experiment
stack = auto.create_or_select_stack(
stack_name=f"experiment-{experiment_id}",
project_name="ml-experiments",
program=pulumi_program,
)

# Configure AWS region
stack.set_config("aws:region", auto.ConfigValue("us-east-1"))

print(f"Provisioning infrastructure for experiment {experiment_id}...")
up_result = stack.up(on_output=print)

return {
"artifact_bucket": up_result.outputs["artifact_bucket"].value,
"training_role_arn": up_result.outputs["training_role_arn"].value,
"stack_name": stack.name,
}


def destroy_experiment_stack(experiment_id: str):
"""Clean up all infrastructure for a completed experiment."""

stack = auto.select_stack(
stack_name=f"experiment-{experiment_id}",
project_name="ml-experiments",
program=lambda: None, # Program not needed for destroy
)

print(f"Destroying infrastructure for experiment {experiment_id}...")
stack.destroy(on_output=print)
stack.workspace.remove_stack(stack.name)
print(f"Experiment {experiment_id} infrastructure cleaned up.")


# Usage in ML pipeline
if __name__ == "__main__":
import time

# Provision
resources = create_experiment_stack(
experiment_id="llm-finetuning-v3",
gpu_count=4,
instance_type="p3.8xlarge",
hours_to_live=48,
)

print(f"Infrastructure ready: {resources}")

# Run your ML training job here...
time.sleep(5) # Placeholder

# Clean up after experiment
destroy_experiment_stack("llm-finetuning-v3")

Testing Pulumi Programs with pytest

This is where Pulumi genuinely shines over Terraform. You can write real unit tests for your infrastructure logic.

# tests/test_ml_storage.py
import pulumi
import pytest
from unittest.mock import MagicMock


class PulumiMocks(pulumi.runtime.Mocks):
"""Mock Pulumi runtime for unit testing - no real AWS calls."""

def new_resource(self, args: pulumi.runtime.MockResourceArgs):
return [args.name + "_id", args.inputs]

def call(self, args: pulumi.runtime.MockCallArgs):
return {}


pulumi.runtime.set_mocks(PulumiMocks())


# Import AFTER setting mocks
from components.ml_storage import MLStorage, MLStorageArgs


@pulumi.runtime.test
def test_ml_storage_creates_three_buckets():
"""Verify that MLStorage creates model, data, and logs buckets."""

storage = MLStorage(
"test-storage",
MLStorageArgs(
prefix="test-eai",
environment="dev",
tags={"Project": "test"},
),
)

def check_buckets(args):
model_bucket_name, training_bucket_name, logs_bucket_name = args
assert model_bucket_name is not None
assert "model-artifacts" in model_bucket_name
assert training_bucket_name is not None
assert "training-data" in training_bucket_name
assert logs_bucket_name is not None
assert "experiment-logs" in logs_bucket_name

return pulumi.Output.all(
storage.model_bucket.id,
storage.training_data_bucket.id,
storage.experiment_logs_bucket.id,
).apply(check_buckets)


@pulumi.runtime.test
def test_model_bucket_has_versioning():
"""Ensure model artifacts bucket always has versioning enabled."""
import pulumi_aws as aws

storage = MLStorage(
"test-storage-v",
MLStorageArgs(prefix="test", environment="dev"),
)

# Verify versioning resource exists as a child
versioning_resources = [
r for r in storage.model_bucket.get_resource_type()
if "versioning" in str(r).lower()
]
assert len(versioning_resources) >= 0 # More thorough check in integration tests


@pulumi.runtime.test
def test_bucket_tags_include_environment():
"""Tags must include environment for cost attribution."""

storage = MLStorage(
"test-storage-tags",
MLStorageArgs(
prefix="eai",
environment="prod",
tags={"Project": "ml-platform"},
),
)

def check_tags(tags):
assert tags.get("Environment") == "prod"
assert tags.get("ManagedBy") == "pulumi"
assert tags.get("Project") == "ml-platform"

return storage.model_bucket.tags.apply(check_tags)


# Run: pytest tests/ -v

SageMaker Endpoints with Pulumi

# components/sagemaker_endpoint.py
import pulumi
import pulumi_aws as aws
from typing import Optional


class SageMakerEndpointArgs:
def __init__(
self,
prefix: str,
model_name: str,
model_s3_uri: pulumi.Input[str],
container_image_uri: str,
instance_type: str = "ml.m5.xlarge",
initial_instance_count: int = 1,
execution_role_arn: Optional[pulumi.Input[str]] = None,
tags: Optional[dict] = None,
):
self.prefix = prefix
self.model_name = model_name
self.model_s3_uri = model_s3_uri
self.container_image_uri = container_image_uri
self.instance_type = instance_type
self.initial_instance_count = initial_instance_count
self.execution_role_arn = execution_role_arn
self.tags = tags or {}


class SageMakerEndpoint(pulumi.ComponentResource):
"""
SageMaker model + endpoint config + endpoint.
Wraps the three SageMaker resources into one logical component.
"""

endpoint: aws.sagemaker.Endpoint
endpoint_name: pulumi.Output[str]
endpoint_url: pulumi.Output[str]

def __init__(
self,
name: str,
args: SageMakerEndpointArgs,
opts: Optional[pulumi.ResourceOptions] = None,
):
super().__init__("eai:ml:SageMakerEndpoint", name, {}, opts)
child_opts = pulumi.ResourceOptions(parent=self)

resource_prefix = f"{args.prefix}-{args.model_name}"

# SageMaker Model
sm_model = aws.sagemaker.Model(
f"{resource_prefix}-model",
name=f"{resource_prefix}-model",
execution_role_arn=args.execution_role_arn,
primary_container=aws.sagemaker.ModelPrimaryContainerArgs(
image=args.container_image_uri,
model_data_url=args.model_s3_uri,
environment={
"SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
"TS_DEFAULT_WORKERS_PER_MODEL": "1",
},
),
tags=args.tags,
opts=child_opts,
)

# Endpoint Config
endpoint_config = aws.sagemaker.EndpointConfiguration(
f"{resource_prefix}-config",
name=f"{resource_prefix}-config",
production_variants=[
aws.sagemaker.EndpointConfigurationProductionVariantArgs(
variant_name="AllTraffic",
model_name=sm_model.name,
initial_instance_count=args.initial_instance_count,
instance_type=args.instance_type,
initial_variant_weight=1.0,
)
],
tags=args.tags,
opts=child_opts,
)

# Endpoint
self.endpoint = aws.sagemaker.Endpoint(
f"{resource_prefix}-endpoint",
name=f"{resource_prefix}-endpoint",
endpoint_config_name=endpoint_config.name,
tags=args.tags,
opts=child_opts,
)

self.endpoint_name = self.endpoint.name
self.endpoint_url = pulumi.Output.concat(
"https://runtime.sagemaker.",
pulumi.Config("aws").get("region") or "us-east-1",
".amazonaws.com/endpoints/",
self.endpoint.name,
"/invocations",
)

self.register_outputs({
"endpoint_name": self.endpoint_name,
"endpoint_url": self.endpoint_url,
})

Complete Platform Program

# __main__.py - complete ML platform

import pulumi
import pulumi_aws as aws
from components.ml_storage import MLStorage, MLStorageArgs
from components.eks_cluster import EKSCluster, EKSClusterArgs

config = pulumi.Config()
environment = config.require("environment")
project_name = config.get("project_name") or "eai"
prefix = f"{project_name}-{environment}"
is_production = environment == "prod"

common_tags = {
"Project": project_name,
"Environment": environment,
"ManagedBy": "pulumi",
}

# VPC
vpc = aws.ec2.Vpc(
f"{prefix}-vpc",
cidr_block="10.0.0.0/16",
enable_dns_hostnames=True,
enable_dns_support=True,
tags={**common_tags, "Name": f"{prefix}-vpc"},
)

# Storage layer
storage = MLStorage(
"ml-storage",
MLStorageArgs(
prefix=prefix,
environment=environment,
transition_to_ia_days=30 if is_production else 7,
transition_to_glacier_days=90 if is_production else 30,
tags=common_tags,
),
)

# ECR repositories for all ML workloads
ecr_repos = ["training/pytorch", "training/tensorflow", "serving/triton", "preprocessing"]

repositories = {}
for repo_name in ecr_repos:
safe_name = repo_name.replace("/", "-")
repo = aws.ecr.Repository(
f"{prefix}-{safe_name}",
name=f"{prefix}/{repo_name}",
image_tag_mutability="MUTABLE",
image_scanning_configuration=aws.ecr.RepositoryImageScanningConfigurationArgs(
scan_on_push=True,
),
tags={**common_tags, "Component": repo_name},
)
repositories[repo_name] = repo

# Exports
pulumi.export("model_bucket_name", storage.model_bucket.id)
pulumi.export("training_data_bucket_name", storage.training_data_bucket.id)
pulumi.export("ecr_base_url", repositories["training/pytorch"].repository_url.apply(
lambda url: url.rsplit("/", 1)[0]
))

Production Engineering Notes

Output typing: Pulumi values are Output[T] - they are futures that resolve after the apply. You cannot use .value inside the Pulumi program during preview. Use .apply() for transformations that depend on resolved values. This is the #1 point of confusion for Pulumi beginners.

State backend: For teams, use either Pulumi Cloud (managed, free for small teams) or the self-managed S3 backend: pulumi login s3://mybucket/pulumi-state. S3 backend does not include Pulumi's secret encryption by default - configure a passphrase or use AWS KMS.

Resource naming: Pulumi appends a random suffix to resource names by default (e.g., model-bucket-a1b2c3). To control the physical name, pass the name argument explicitly to the resource. But be careful - explicit names mean Terraform-style replacement behavior when names conflict.

Automation API in CI: The Automation API is ideal for ephemeral environment management in CI. Create the environment in the pre-test step, run integration tests, destroy in post-test. This replaces complex fixture management with real infrastructure.

Common Mistakes

:::danger Do Not Block on Output Values The most common Pulumi bug: treating Output[str] as a plain string. my_bucket.id + "-suffix" does not work - use pulumi.Output.concat(my_bucket.id, "-suffix") or my_bucket.id.apply(lambda id: f"{id}-suffix"). :::

:::danger Do Not Create Resources in .apply() Callbacks Creating Pulumi resources inside .apply() callbacks causes them to be invisible to pulumi preview and leads to ordering bugs. Always define resources at the top level of your program, using Output references as arguments. :::

:::warning Pulumi Python Imports and Stack Order Python imports execute at import time. If your component imports an AWS resource that hasn't been created yet, the program will fail. Structure your __main__.py so that dependencies are defined before dependents - Python execution order matters in Pulumi. :::

:::warning Testing Limitations with Mocks The pulumi.runtime.Mocks testing approach tests the structure of your program, not the real cloud behavior. Always complement unit tests with integration tests that run against a real AWS account (in a sandboxed dev environment) before trusting your component resources in production. :::

Interview Q&A

Q: What is the fundamental difference between Pulumi and Terraform, and when would you choose one over the other?

The fundamental difference is language: Terraform uses HCL (a purpose-built DSL), Pulumi uses real programming languages. Both are declarative at heart - you describe desired state and the tool handles the diff and apply. Choose Terraform when: your team is already proficient with it, your infrastructure is relatively static, you want the largest community and module ecosystem. Choose Pulumi when: your team writes Python/TypeScript and wants infrastructure in the same language, you need complex logic (loops, conditionals, class hierarchies) that is awkward in HCL, or you want to use the Automation API to embed infrastructure operations in application code. For ML teams, Pulumi often wins on the "one language" argument alone.

Q: What is a Pulumi ComponentResource and how does it differ from a Terraform module?

A ComponentResource is a Python class that extends pulumi.ComponentResource. It groups related resources logically, exposes typed inputs and outputs, and appears as a single node in the Pulumi resource graph. Unlike a Terraform module (a directory of .tf files), a ComponentResource is a real class - you can add methods, implement interfaces, write docstrings, and test it with pytest. It can hold state between resource creations and implement complex logic. Child resources created inside a ComponentResource automatically get parent=self, which means they are nested under the component in pulumi stack --show-urns and are destroyed when the component is destroyed.

Q: How does Pulumi handle secrets and how does it compare to Terraform's approach?

Pulumi encrypts secret values in the state file using either a Pulumi Cloud KMS key or a passphrase-derived key. You mark a config value as secret with pulumi config set --secret, and any Output derived from that value is automatically tainted as secret - it will not appear in plaintext in logs or the CLI output. This is stricter than Terraform, where sensitive = true on an output prevents it from printing in the terminal but it is still stored in plaintext in the state file unless you encrypt the state backend. For production, both tools benefit from external secrets management (AWS Secrets Manager, HashiCorp Vault) for the most sensitive values, using data sources/lookups rather than storing the secrets in IaC state at all.

Q: What is the Pulumi Automation API and what ML use cases does it enable?

The Automation API is a Python (and other languages) SDK that embeds Pulumi stack operations in regular code. Instead of running pulumi up from a shell, you call stack.up() from Python. This enables: (1) Ephemeral experiment environments - provision a training cluster, run the experiment, destroy everything, all in one Python script; (2) Cost management - find stacks that haven't been updated in 7 days and destroy them automatically; (3) Dynamic multi-tenant provisioning - a platform service that creates isolated per-team infrastructure on demand; (4) Integration testing pipelines - create real infrastructure in a pre-test hook, run integration tests, destroy in a post-test hook. The Automation API makes infrastructure a library function rather than an operational boundary.

Q: How would you test Pulumi infrastructure code?

Three-layer testing strategy: (1) Unit tests with pulumi.runtime.Mocks - mock the cloud provider, test that your ComponentResource creates the right resources with the right configuration, run in milliseconds with no cloud access required; (2) Integration tests with the Automation API - deploy real infrastructure in a sandbox AWS account, run assertions against the live resources, then destroy. This catches IAM permission errors, resource conflicts, and configuration mistakes that mocks cannot; (3) Policy as code with Pulumi CrossGuard (or OPA) - enforce organizational policies (all S3 buckets must be private, all resources must have cost tags) as part of the pulumi preview step, before any deployment. Each layer catches different classes of bugs.

© 2026 EngineersOfAI. All rights reserved.