Skip to main content

Infrastructure as Code for ML

Reading time: ~50 min · Interview relevance: Very High · Target roles: ML Engineer, MLOps Engineer, Platform Engineer

The GPU Cluster That Cost $40,000 and Could Not Be Rebuilt

A well-funded ML startup spun up a 16-node A100 cluster on AWS for a large language model pretraining run. The cluster had been manually configured over three weeks: one engineer installed CUDA drivers, another configured NCCL, a third set up the shared NFS volume, a fourth configured the Slurm scheduler. Each step was done by SSHing into machines and running commands. Nobody wrote anything down systematically.

Six months later, the cluster needed to be rebuilt in a different AWS region for regulatory reasons. The engineers who built it had moved on. The company spent four weeks trying to reproduce the setup, hit mysterious NCCL performance issues that did not exist in the original cluster, and ultimately gave up and hired a consulting firm to rebuild it from scratch. Total cost: $40,000 in engineering time, plus three weeks of delayed training.

The root cause was not that the engineers were bad. It was that infrastructure was treated as a manually configured artifact rather than as code. If the cluster configuration had been in Terraform and Ansible, rebuilding it in a new region would have taken 20 minutes and a single terraform apply.

This is the core promise of Infrastructure as Code (IaC): your infrastructure is described in version-controlled files. You can rebuild it, audit it, review changes to it, and roll it back - the same way you manage application code. For ML teams, this has additional benefits: you can reproduce the exact hardware environment used to train a model, spin up identical environments for experimentation, and tear down expensive GPU clusters when they are not needed.

This lesson covers the IaC stack most commonly used in ML infrastructure: Terraform for provisioning cloud resources, Helm for deploying applications to Kubernetes, Pulumi for Python-native IaC, Ansible for node configuration, and ArgoCD for GitOps-style deployment automation.

Why This Exists - The Snowflake Server Problem

In the pre-IaC era, servers were "pets" - individually named, hand-configured, carefully maintained, and irreplaceable. When a pet server failed, you scrambled to recreate its configuration from memory and scattered documentation. This worked fine when teams had 5 servers. It broke completely at 50, and was impossible at 500.

IaC introduced the "cattle" model: servers are interchangeable instances of a defined configuration. When one fails, you provision a new one from the same definition. The term "snowflake server" emerged to describe the opposite - a server so heavily customized by hand that no two are identical and none can be reproduced.

For ML infrastructure, the snowflake problem is acute because GPU nodes require precise configuration: specific CUDA toolkit versions, specific cuDNN versions, NCCL compiled with specific flags, InfiniBand drivers configured with specific MTU settings, EFA (Elastic Fabric Adapter) enabled for AWS P4d instances. Any mismatch causes silent performance degradation - collective operations fall back to slower paths, bandwidth is halved, and you spend days profiling before finding the driver version mismatch.

Terraform was created by HashiCorp in 2014, inspired by AWS CloudFormation but designed to be cloud-agnostic. Its declarative model - you describe the desired state, Terraform figures out how to reach it - proved significantly more ergonomic than the procedural scripts that preceded it.

Historical Context - From CFEngine to Pulumi

Configuration management predates cloud computing. CFEngine (1993) was the first tool to manage server configuration at scale, introducing the concept of convergence: the tool repeatedly applies the desired configuration until the system reaches it. Puppet (2005) and Chef (2009) followed, using domain-specific languages to describe desired state.

Ansible (2012) simplified this by using SSH and YAML playbooks - no agent needed on managed nodes. This made it particularly well-suited for initial node setup (bootstrapping CUDA drivers onto fresh VMs), where you cannot install an agent on the machine before the machine is configured.

Terraform (2014) separated infrastructure provisioning (creating cloud resources) from configuration management (configuring the OS and applications on those resources). This distinction matters: you use Terraform to create an EC2 instance and an S3 bucket; you use Ansible to install CUDA on that EC2 instance.

Pulumi (2017) introduced a radical alternative: write IaC in general-purpose programming languages (Python, TypeScript, Go). This means you can use loops, conditionals, functions, and classes to express infrastructure logic that would require complex workarounds in HCL (Terraform's domain-specific language).

Helm (2015) addressed the Kubernetes packaging problem. Writing raw Kubernetes YAML manifests for a model serving deployment involves dozens of files (Deployment, Service, Ingress, HorizontalPodAutoscaler, ConfigMap, Secret...). Helm packages these into a chart with templating, making it easy to deploy the same application to multiple environments with different configurations.

Core Concepts

Terraform Architecture

Terraform has four key components: providers (plugins that interface with cloud APIs), resources (cloud objects like EC2 instances or S3 buckets), state (a file tracking what Terraform has created), and modules (reusable packages of resources).

Terraform Module for GPU Training Cluster

# modules/gpu-cluster/main.tf
# Terraform module that creates a GPU training cluster on AWS
# using P4d.24xlarge instances (8x A100 80GB GPUs each)

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}

# ---------------------------------------------------------------
# Variables
# ---------------------------------------------------------------
variable "cluster_name" {
description = "Name for this GPU cluster"
type = string
}

variable "num_nodes" {
description = "Number of GPU nodes"
type = number
default = 4
}

variable "instance_type" {
description = "EC2 instance type"
type = string
default = "p4d.24xlarge"
}

variable "use_spot" {
description = "Use spot instances (cheaper, may be interrupted)"
type = bool
default = false
}

variable "vpc_id" {
description = "VPC to launch instances in"
type = string
}

variable "subnet_ids" {
description = "Subnet IDs (use placement group subnet for best NCCL performance)"
type = list(string)
}

variable "key_name" {
description = "EC2 key pair name for SSH access"
type = string
}

variable "tags" {
description = "Tags to apply to all resources"
type = map(string)
default = {}
}

# ---------------------------------------------------------------
# Placement group: ensures low-latency network between nodes
# Critical for NCCL all-reduce performance
# ---------------------------------------------------------------
resource "aws_placement_group" "gpu_cluster" {
name = "${var.cluster_name}-placement-group"
strategy = "cluster" # all instances in same physical rack

tags = merge(var.tags, {
Name = "${var.cluster_name}-placement-group"
})
}

# ---------------------------------------------------------------
# Security group: allow NCCL traffic between nodes (all ports
# within the security group), plus SSH from bastion
# ---------------------------------------------------------------
resource "aws_security_group" "gpu_nodes" {
name = "${var.cluster_name}-gpu-sg"
description = "Security group for GPU training nodes"
vpc_id = var.vpc_id

# All traffic between nodes in this security group (NCCL)
ingress {
from_port = 0
to_port = 0
protocol = "-1"
self = true
}

# SSH from within VPC
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"]
}

# All outbound (for package downloads, S3 access)
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}

tags = merge(var.tags, {Name = "${var.cluster_name}-gpu-sg"})
}

# ---------------------------------------------------------------
# Launch template: AMI with CUDA pre-installed, EFA enabled
# ---------------------------------------------------------------
data "aws_ami" "dlami" {
most_recent = true
owners = ["amazon"]

filter {
name = "name"
values = ["Deep Learning AMI GPU PyTorch*"]
}

filter {
name = "architecture"
values = ["x86_64"]
}
}

resource "aws_launch_template" "gpu_node" {
name_prefix = "${var.cluster_name}-lt-"
image_id = data.aws_ami.dlami.id
instance_type = var.instance_type
key_name = var.key_name

# Enable EFA (Elastic Fabric Adapter) for high-bandwidth GPU-to-GPU comm
network_interfaces {
interface_type = "efa"
security_groups = [aws_security_group.gpu_nodes.id]
subnet_id = var.subnet_ids[0]
delete_on_termination = true
}

# 4 TB NVMe SSD for training data
block_device_mappings {
device_name = "/dev/sda1"
ebs {
volume_size = 500
volume_type = "gp3"
iops = 16000
throughput = 1000
delete_on_termination = true
encrypted = true
}
}

# Enable instance metadata v2 (security best practice)
metadata_options {
http_endpoint = "enabled"
http_tokens = "required"
http_put_response_hop_limit = 2
}

# Instance profile for S3 access
iam_instance_profile {
arn = aws_iam_instance_profile.gpu_node.arn
}

# User data: initial setup script
user_data = base64encode(templatefile(
"${path.module}/scripts/node_init.sh",
{
cluster_name = var.cluster_name
s3_bucket = var.s3_bucket
}
))

dynamic "instance_market_options" {
for_each = var.use_spot ? [1] : []
content {
market_type = "spot"
spot_options {
spot_instance_type = "one-time"
instance_interruption_behavior = "terminate"
}
}
}

tag_specifications {
resource_type = "instance"
tags = merge(var.tags, {Name = "${var.cluster_name}-gpu-node"})
}
}

# ---------------------------------------------------------------
# Auto Scaling Group: manages the node count
# ---------------------------------------------------------------
resource "aws_autoscaling_group" "gpu_nodes" {
name = "${var.cluster_name}-asg"
desired_capacity = var.num_nodes
max_size = var.num_nodes
min_size = 0

vpc_zone_identifier = var.subnet_ids
placement_group = aws_placement_group.gpu_cluster.id

launch_template {
id = aws_launch_template.gpu_node.id
version = "$Latest"
}

# Instance refresh: rolling update when launch template changes
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 50
}
}

tag {
key = "cluster_name"
value = var.cluster_name
propagate_at_launch = true
}

lifecycle {
# Prevent accidental scale-down during training
# Override with: terraform apply -var="num_nodes=0" to scale down
prevent_destroy = false
}
}

# ---------------------------------------------------------------
# IAM: allow nodes to read from S3 training data bucket
# ---------------------------------------------------------------
resource "aws_iam_role" "gpu_node" {
name = "${var.cluster_name}-gpu-node-role"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {Service = "ec2.amazonaws.com"}
Action = "sts:AssumeRole"
}]
})
}

resource "aws_iam_role_policy" "gpu_node_s3" {
name = "s3-access"
role = aws_iam_role.gpu_node.id

policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:DeleteObject"
]
Resource = [
"arn:aws:s3:::${var.s3_bucket}",
"arn:aws:s3:::${var.s3_bucket}/*"
]
}
]
})
}

resource "aws_iam_instance_profile" "gpu_node" {
name = "${var.cluster_name}-gpu-node-profile"
role = aws_iam_role.gpu_node.name
}

# ---------------------------------------------------------------
# Outputs
# ---------------------------------------------------------------
output "placement_group_id" {
value = aws_placement_group.gpu_cluster.id
}

output "security_group_id" {
value = aws_security_group.gpu_nodes.id
}

output "autoscaling_group_name" {
value = aws_autoscaling_group.gpu_nodes.name
}
# environments/prod/main.tf
# Top-level Terraform configuration that uses the gpu-cluster module

terraform {
required_version = ">= 1.5"

backend "s3" {
bucket = "my-terraform-state-bucket"
key = "ml-infrastructure/prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock" # prevents concurrent applies
}
}

provider "aws" {
region = "us-east-1"
}

module "training_cluster" {
source = "../../modules/gpu-cluster"

cluster_name = "llm-pretraining-cluster"
num_nodes = 16
instance_type = "p4d.24xlarge"
use_spot = false # preemption would disrupt 72h training runs

vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
key_name = "ml-team-keypair"
s3_bucket = "my-training-data-bucket"

tags = {
Environment = "production"
Team = "ml-platform"
CostCenter = "model-training"
ManagedBy = "terraform"
}
}

# Separate spot instance cluster for experimentation
module "experiment_cluster" {
source = "../../modules/gpu-cluster"

cluster_name = "llm-experiment-cluster"
num_nodes = 4
instance_type = "p3.8xlarge"
use_spot = true # experiments can tolerate interruptions

vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
key_name = "ml-team-keypair"
s3_bucket = "my-training-data-bucket"

tags = {
Environment = "staging"
Team = "ml-platform"
CostCenter = "experimentation"
ManagedBy = "terraform"
}
}

output "training_cluster_asg" {
value = module.training_cluster.autoscaling_group_name
}

Spot Instance Handling with Checkpointing

Spot instances cost 60-90% less than on-demand but can be interrupted with 2 minutes notice. For ML training, this requires checkpoint-aware training code and infrastructure that handles interruptions gracefully.

# Terraform: spot interruption handling
# Uses EC2 instance termination lifecycle hooks to save a checkpoint
# before the instance is terminated

resource "aws_autoscaling_lifecycle_hook" "spot_termination" {
name = "${var.cluster_name}-spot-termination"
autoscaling_group_name = aws_autoscaling_group.gpu_nodes.name
lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING"
heartbeat_timeout = 120 # 2 minutes to save checkpoint
default_result = "CONTINUE"

notification_target_arn = aws_sqs_queue.spot_interruption.arn
role_arn = aws_iam_role.lifecycle_hook.arn
}

resource "aws_sqs_queue" "spot_interruption" {
name = "${var.cluster_name}-spot-interruption"
message_retention_seconds = 300
visibility_timeout_seconds = 120
}
# scripts/spot_interruption_handler.py
"""
Runs on each training node.
Listens for spot interruption notifications via SQS.
On interrupt, triggers checkpoint save and graceful shutdown.
"""

import signal
import boto3
import threading
import time
import os
import sys


class SpotInterruptionHandler:
def __init__(self, trainer, sqs_queue_url: str, region: str = "us-east-1"):
self.trainer = trainer
self.sqs = boto3.client("sqs", region_name=region)
self.queue_url = sqs_queue_url
self.interrupted = threading.Event()

# Also handle AWS instance metadata notification
# (2-minute warning via metadata service)
threading.Thread(target=self._poll_metadata, daemon=True).start()
threading.Thread(target=self._poll_sqs, daemon=True).start()

def _poll_metadata(self) -> None:
"""Check EC2 instance metadata for spot interruption notice."""
import requests

while not self.interrupted.is_set():
try:
resp = requests.get(
"http://169.254.169.254/latest/meta-data/"
"spot/instance-action",
timeout=1,
)
if resp.status_code == 200:
action = resp.json()
print(f"SPOT INTERRUPTION: action={action['action']} "
f"at={action['time']}")
self._handle_interruption()
return
except Exception:
pass
time.sleep(5)

def _poll_sqs(self) -> None:
"""Poll SQS for lifecycle hook messages."""
while not self.interrupted.is_set():
try:
messages = self.sqs.receive_message(
QueueUrl = self.queue_url,
MaxNumberOfMessages = 1,
WaitTimeSeconds = 10,
)
for msg in messages.get("Messages", []):
print("Received spot interruption via SQS lifecycle hook")
self._handle_interruption()
self.sqs.delete_message(
QueueUrl = self.queue_url,
ReceiptHandle = msg["ReceiptHandle"],
)
except Exception:
pass

def _handle_interruption(self) -> None:
"""Save checkpoint and signal training to stop."""
if self.interrupted.is_set():
return # already handling

self.interrupted.set()
print("Spot interruption detected - saving emergency checkpoint...")

try:
self.trainer.save_checkpoint(
path = os.environ.get("CHECKPOINT_DIR", "/tmp/checkpoints"),
is_emergency = True,
)
print("Emergency checkpoint saved successfully")
except Exception as exc:
print(f"FAILED to save emergency checkpoint: {exc}")

# Give other ranks time to save their checkpoints
time.sleep(10)

# Signal the training process to exit gracefully
os.kill(os.getpid(), signal.SIGTERM)

Helm Chart for Model Serving

Helm packages a set of Kubernetes manifests into a "chart" - a versioned, distributable unit with template variables.

# helm/model-serving/Chart.yaml
apiVersion: v2
name: model-serving
description: Helm chart for deploying ML model inference services
version: 1.3.0
appVersion: "2.1.0"

dependencies:
- name: prometheus-operator
version: "0.72.0"
repository: "https://prometheus-community.github.io/helm-charts"
condition: prometheus.enabled
# helm/model-serving/values.yaml
# Default values - override per environment

replicaCount: 2

image:
repository: gcr.io/my-project/model-server
pullPolicy: IfNotPresent
tag: "" # overridden by CI with git commit SHA

model:
name: recommendation-model
version: "2.1.0"
artifactUri: s3://my-models/recommendation-model/v2.1.0/
configMap: model-config

resources:
requests:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "8"
nvidia.com/gpu: "1"

autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 70
# Scale on p95 latency via KEDA (optional)
kedaEnabled: false

service:
type: ClusterIP
port: 8080
metricsPort: 8090

ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
hosts:
- host: model.internal.example.com
paths:
- path: /
pathType: Prefix

livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3

tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"

nodeSelector:
node-type: gpu-inference

prometheus:
enabled: true
serviceMonitor:
enabled: true
interval: 15s
# helm/model-serving/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "model-serving.fullname" . }}
labels:
{{- include "model-serving.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "model-serving.selectorLabels" . | nindent 6 }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # zero downtime deploys
template:
metadata:
labels:
{{- include "model-serving.selectorLabels" . | nindent 8 }}
app.kubernetes.io/version: {{ .Values.model.version | quote }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: {{ .Values.service.metricsPort | quote }}
prometheus.io/path: "/metrics"
spec:
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
containers:
- name: model-server
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
- name: metrics
containerPort: {{ .Values.service.metricsPort }}
env:
- name: MODEL_NAME
value: {{ .Values.model.name | quote }}
- name: MODEL_VERSION
value: {{ .Values.model.version | quote }}
- name: MODEL_ARTIFACT_URI
value: {{ .Values.model.artifactUri | quote }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
livenessProbe:
{{- toYaml .Values.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.readinessProbe | nindent 12 }}
volumeMounts:
- name: model-cache
mountPath: /models/cache
volumes:
- name: model-cache
emptyDir:
sizeLimit: 20Gi
# Deploying the Helm chart

# Install to a new cluster
helm install recommendation-model ./helm/model-serving \
--namespace ml-serving \
--create-namespace \
--values helm/model-serving/values.yaml \
--values helm/model-serving/values.production.yaml \
--set image.tag=$(git rev-parse --short HEAD) \
--set model.version="2.1.0"

# Upgrade an existing deployment
helm upgrade recommendation-model ./helm/model-serving \
--namespace ml-serving \
--values helm/model-serving/values.yaml \
--values helm/model-serving/values.production.yaml \
--set image.tag=$(git rev-parse --short HEAD) \
--atomic \ # roll back automatically if upgrade fails
--timeout 5m

# Rollback to previous version
helm rollback recommendation-model 0 --namespace ml-serving

# Diff before applying (requires helm-diff plugin)
helm diff upgrade recommendation-model ./helm/model-serving \
--namespace ml-serving \
--set image.tag=$(git rev-parse --short HEAD)

Pulumi Python IaC

Pulumi lets you write infrastructure in Python, with the full language available for logic that would require complex workarounds in HCL.

# infra/ml_infrastructure.py
# Pulumi Python stack for ML infrastructure

import pulumi
import pulumi_aws as aws
import pulumi_kubernetes as k8s
from typing import Optional


class MLInfrastructureStack:
"""
Pulumi stack that provisions the full ML infrastructure:
- EKS cluster for model serving
- GPU node groups (training and inference sizes)
- S3 buckets for data and model artifacts
- ECR repository for container images
"""

def __init__(self, config: pulumi.Config):
self.config = config
self.env = config.require("environment")
self.region = config.get("region") or "us-east-1"

# Create resources in dependency order
self.buckets = self._create_s3_buckets()
self.ecr = self._create_ecr_repository()
self.eks = self._create_eks_cluster()
self.node_groups = self._create_node_groups()

def _create_s3_buckets(self) -> dict[str, aws.s3.BucketV2]:
"""Create S3 buckets for training data and model artifacts."""
buckets = {}

for name, suffix in [
("training_data", "training-data"),
("model_artifacts", "model-artifacts"),
("experiment_logs", "experiment-logs"),
]:
bucket = aws.s3.BucketV2(
f"{self.env}-{suffix}",
bucket=f"mycompany-ml-{self.env}-{suffix}",
force_destroy=(self.env != "production"),
tags={
"Environment": self.env,
"ManagedBy": "pulumi",
},
)

# Enable versioning for model artifacts
if name == "model_artifacts":
aws.s3.BucketVersioningV2(
f"{self.env}-{suffix}-versioning",
bucket=bucket.id,
versioning_configuration=aws.s3.BucketVersioningV2VersioningConfigurationArgs(
status="Enabled"
),
)

# Lifecycle rule: move old training data to Glacier after 90 days
if name == "training_data":
aws.s3.BucketLifecycleConfigurationV2(
f"{self.env}-{suffix}-lifecycle",
bucket=bucket.id,
rules=[aws.s3.BucketLifecycleConfigurationV2RuleArgs(
id = "archive-old-data",
status = "Enabled",
filter = aws.s3.BucketLifecycleConfigurationV2RuleFilterArgs(
prefix = ""
),
transitions=[
aws.s3.BucketLifecycleConfigurationV2RuleTransitionArgs(
days = 90,
storage_class = "GLACIER",
)
],
)],
)

buckets[name] = bucket

return buckets

def _create_ecr_repository(self) -> aws.ecr.Repository:
"""Create ECR repository for ML Docker images."""
repo = aws.ecr.Repository(
f"{self.env}-ml-models",
name = f"ml-models-{self.env}",
image_tag_mutability = "MUTABLE",
image_scanning_configuration = aws.ecr.RepositoryImageScanningConfigurationArgs(
scan_on_push = True,
),
tags = {"Environment": self.env, "ManagedBy": "pulumi"},
)

# Lifecycle policy: keep only 10 tagged images + last 30 days of untagged
aws.ecr.LifecyclePolicy(
f"{self.env}-ecr-lifecycle",
repository = repo.name,
policy = pulumi.Output.json_dumps({
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 tagged images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["v"],
"countType": "imageCountMoreThan",
"countNumber": 10,
},
"action": {"type": "expire"},
}
]
}),
)

return repo

def _create_eks_cluster(self) -> aws.eks.Cluster:
"""Create EKS cluster for ML workloads."""
cluster_role = aws.iam.Role(
f"{self.env}-eks-cluster-role",
assume_role_policy = pulumi.Output.json_dumps({
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "eks.amazonaws.com"},
"Action": "sts:AssumeRole",
}],
}),
)

aws.iam.RolePolicyAttachment(
f"{self.env}-eks-cluster-policy",
role = cluster_role.name,
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy",
)

cluster = aws.eks.Cluster(
f"{self.env}-ml-cluster",
name = f"ml-serving-{self.env}",
role_arn = cluster_role.arn,
version = "1.29",
vpc_config = aws.eks.ClusterVpcConfigArgs(
subnet_ids = self.config.require_object("subnet_ids"),
endpoint_private_access = True,
endpoint_public_access = (self.env != "production"),
),
enabled_cluster_log_types = [
"api", "audit", "authenticator"
],
tags = {"Environment": self.env, "ManagedBy": "pulumi"},
)

return cluster

def _create_node_groups(self) -> dict[str, aws.eks.NodeGroup]:
"""Create GPU node groups for training and inference."""
node_groups = {}

# Node group configurations
configs = [
{
"name": "inference-gpu",
"instance_types": ["g4dn.xlarge"], # 1x T4, for inference
"min_size": 2,
"max_size": 20,
"desired_size": 2,
"labels": {"node-type": "gpu-inference"},
},
{
"name": "training-gpu",
"instance_types": ["p3.8xlarge"], # 4x V100, for training
"min_size": 0,
"max_size": 8,
"desired_size": 0, # scale to 0 when not training
"labels": {"node-type": "gpu-training"},
},
]

for cfg in configs:
node_role = aws.iam.Role(
f"{self.env}-{cfg['name']}-role",
assume_role_policy = pulumi.Output.json_dumps({
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "ec2.amazonaws.com"},
"Action": "sts:AssumeRole",
}],
}),
)

for policy_arn in [
"arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
"arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
"arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
]:
aws.iam.RolePolicyAttachment(
f"{self.env}-{cfg['name']}-{policy_arn.split('/')[-1]}",
role = node_role.name,
policy_arn = policy_arn,
)

node_group = aws.eks.NodeGroup(
f"{self.env}-{cfg['name']}",
cluster_name = self.eks.name,
node_group_name = cfg["name"],
node_role_arn = node_role.arn,
subnet_ids = self.config.require_object("subnet_ids"),
instance_types = cfg["instance_types"],
scaling_config = aws.eks.NodeGroupScalingConfigArgs(
min_size = cfg["min_size"],
max_size = cfg["max_size"],
desired_size = cfg["desired_size"],
),
labels = cfg["labels"],
taint = [aws.eks.NodeGroupTaintArgs(
key = "nvidia.com/gpu",
value = "true",
effect = "NO_SCHEDULE",
)],
tags = {"Environment": self.env, "ManagedBy": "pulumi"},
)

node_groups[cfg["name"]] = node_group

return node_groups


# Pulumi entry point
config = pulumi.Config()
stack = MLInfrastructureStack(config)

# Export values for use by other stacks or CI/CD
pulumi.export("cluster_name", stack.eks.name)
pulumi.export("training_bucket", stack.buckets["training_data"].bucket)
pulumi.export("model_artifact_bucket", stack.buckets["model_artifacts"].bucket)
pulumi.export("ecr_repository_url", stack.ecr.repository_url)

Ansible for GPU Node Configuration

Ansible configures the OS and installed software on GPU nodes after Terraform provisions them.

# ansible/playbooks/setup_gpu_node.yaml
---
- name: Configure GPU training node
hosts: gpu_nodes
become: true
vars:
cuda_version: "12.2"
nccl_version: "2.18.3"
python_version: "3.11"

tasks:
# ---------------------------------------------------------------
# System packages
# ---------------------------------------------------------------
- name: Update apt cache
ansible.builtin.apt:
update_cache: true
cache_valid_time: 3600

- name: Install system dependencies
ansible.builtin.apt:
name:
- build-essential
- cmake
- git
- htop
- nvtop
- tmux
- nfs-common
- iftop
- sysstat
state: present

# ---------------------------------------------------------------
# CUDA installation (from NVIDIA package repos)
# ---------------------------------------------------------------
- name: Add NVIDIA CUDA repository
ansible.builtin.apt_key:
url: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
state: present

- name: Add CUDA apt repository
ansible.builtin.apt_repository:
repo: "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
state: present

- name: Install CUDA toolkit {{ cuda_version }}
ansible.builtin.apt:
name:
- "cuda-toolkit-{{ cuda_version | replace('.', '-') }}"
- "libcudnn8={{ cuda_version }}"
- "libcudnn8-dev={{ cuda_version }}"
state: present
update_cache: true

- name: Set CUDA environment variables
ansible.builtin.template:
src: templates/cuda_env.sh.j2
dest: /etc/profile.d/cuda.sh
mode: "0644"

# ---------------------------------------------------------------
# NCCL configuration
# ---------------------------------------------------------------
- name: Configure NCCL for EFA (AWS Elastic Fabric Adapter)
ansible.builtin.lineinfile:
path: /etc/nccl.conf
create: true
line: "{{ item }}"
loop:
- "NCCL_DEBUG=WARN"
- "NCCL_SOCKET_IFNAME=^lo,docker"
- "FI_PROVIDER=efa"
- "FI_EFA_USE_DEVICE_RDMA=1"
- "NCCL_ALGO=Ring"

# ---------------------------------------------------------------
# Python environment
# ---------------------------------------------------------------
- name: Install Python {{ python_version }}
ansible.builtin.apt:
name:
- "python{{ python_version }}"
- "python{{ python_version }}-venv"
- "python{{ python_version }}-dev"
state: present

- name: Create training virtual environment
ansible.builtin.command:
cmd: "python{{ python_version }} -m venv /opt/ml-env"
creates: /opt/ml-env

- name: Install PyTorch and dependencies
ansible.builtin.pip:
virtualenv: /opt/ml-env
name:
- "torch==2.1.0+cu121"
- "torchvision==0.16.0+cu121"
- "torchaudio==2.1.0+cu121"
- "transformers==4.35.0"
- "deepspeed==0.12.3"
- "flash-attn==2.3.0"
extra_args: "--index-url https://download.pytorch.org/whl/cu121"

# ---------------------------------------------------------------
# Storage: mount shared NFS for training data
# ---------------------------------------------------------------
- name: Create mount point for training data
ansible.builtin.file:
path: /data/training
state: directory
mode: "0755"

- name: Mount NFS training data volume
ansible.posix.mount:
src: "{{ nfs_server }}:/exports/training-data"
path: /data/training
fstype: nfs
opts: "rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"
state: mounted

# ---------------------------------------------------------------
# Validation
# ---------------------------------------------------------------
- name: Verify CUDA installation
ansible.builtin.command: nvidia-smi
register: nvidia_smi_output
changed_when: false

- name: Print nvidia-smi output
ansible.builtin.debug:
msg: "{{ nvidia_smi_output.stdout_lines }}"

- name: Verify PyTorch CUDA access
ansible.builtin.command:
cmd: "/opt/ml-env/bin/python -c \"import torch; print(torch.cuda.is_available(), torch.cuda.device_count())\""
register: torch_check
changed_when: false

- name: Fail if PyTorch cannot see CUDA
ansible.builtin.fail:
msg: "PyTorch CUDA check failed: {{ torch_check.stdout }}"
when: "'True' not in torch_check.stdout"

ArgoCD for GitOps Deployments

ArgoCD watches a Git repository and automatically applies changes to Kubernetes when the repo changes. This is "GitOps" - Git is the single source of truth for what should be running in production.

# argocd/applications/recommendation-model.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: recommendation-model
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: ml-serving

source:
repoURL: https://github.com/mycompany/ml-infrastructure
targetRevision: main
path: helm/model-serving

helm:
releaseName: recommendation-model
valueFiles:
- values.yaml
- values.production.yaml

destination:
server: https://kubernetes.default.svc
namespace: ml-serving

syncPolicy:
automated:
prune: true # delete resources removed from git
selfHeal: true # revert manual kubectl edits

syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- RespectIgnoreDifferences=true

retry:
limit: 3
backoff:
duration: "30s"
factor: 2
maxDuration: "5m"

ignoreDifferences:
# Ignore HPA's current replica count (it scales dynamically)
- group: autoscaling
kind: HorizontalPodAutoscaler
jsonPointers:
- /spec/currentReplicas

revisionHistoryLimit: 10

Infrastructure Testing with Checkov

Checkov scans Terraform code for security misconfigurations before you apply them.

# Install checkov
pip install checkov

# Scan a Terraform directory
checkov -d ./terraform/environments/prod

# Example: will flag these common ML infrastructure issues:
# - CKV_AWS_8: EC2 instance without IMDSv2 (fixed with http_tokens=required)
# - CKV_AWS_79: EC2 instance with public IP
# - CKV_AWS_20: S3 bucket is publicly accessible
# - CKV_AWS_52: S3 bucket does not have MFA delete enabled
# - CKV_AWS_19: S3 bucket is not encrypted
# tests/infrastructure/test_terraform.py
# Terratest-style infrastructure tests in Python (using pytest)
# These run against a temporary test environment

import pytest
import subprocess
import boto3
import json


@pytest.fixture(scope="session")
def terraform_output():
"""Apply Terraform to a test environment and return outputs."""
# Use a test-specific workspace to avoid affecting production
subprocess.run(
["terraform", "workspace", "new", "test-ci"],
cwd="terraform/environments/staging",
capture_output=True,
)

subprocess.run(
["terraform", "init"],
cwd="terraform/environments/staging",
check=True,
)

subprocess.run(
["terraform", "apply", "-auto-approve",
"-var", "environment=test-ci",
"-var", "num_nodes=1"],
cwd="terraform/environments/staging",
check=True,
)

result = subprocess.run(
["terraform", "output", "-json"],
cwd="terraform/environments/staging",
capture_output=True, text=True, check=True,
)

yield json.loads(result.stdout)

# Cleanup after tests
subprocess.run(
["terraform", "destroy", "-auto-approve",
"-var", "environment=test-ci"],
cwd="terraform/environments/staging",
check=True,
)


def test_s3_bucket_encrypted(terraform_output):
"""Verify that the model artifacts S3 bucket has encryption enabled."""
s3 = boto3.client("s3")
bucket = terraform_output["model_artifact_bucket"]["value"]

response = s3.get_bucket_encryption(Bucket=bucket)
rules = response["ServerSideEncryptionConfiguration"]["Rules"]

assert len(rules) > 0, "Bucket has no encryption rules"
assert any(
r["ApplyServerSideEncryptionByDefault"]["SSEAlgorithm"] == "aws:kms"
for r in rules
), "Bucket is not encrypted with KMS"


def test_eks_cluster_private_endpoint(terraform_output):
"""Verify that the EKS cluster has private endpoint access enabled."""
eks = boto3.client("eks")
cluster_name = terraform_output["cluster_name"]["value"]

response = eks.describe_cluster(name=cluster_name)
vpc_config = response["cluster"]["resourcesVpcConfig"]

assert vpc_config["endpointPrivateAccess"], \
"EKS cluster does not have private endpoint access"

IaC Architecture for ML

Production Engineering Notes

State file security: The Terraform state file contains sensitive information including resource IDs, IP addresses, and sometimes plaintext secrets. Always store it encrypted in an S3 bucket with server-side encryption (KMS), enable versioning for rollback, and use DynamoDB for state locking to prevent concurrent applies from corrupting it.

Cost estimation before applying: Use infracost (pip install infracost) to calculate the monthly cost of a Terraform plan before applying it. A terraform apply that adds 16x p4d.24xlarge instances will cost $100,000/month. Always review cost estimates in code review for large infrastructure changes.

Kustomize vs Helm: Helm uses Go templating (powerful but complex). Kustomize uses strategic merge patches (simpler, no templating). For simple applications with few configuration differences between environments, Kustomize is easier. For complex applications with many configurable parameters (like a model server with autoscaling, GPU resource requests, custom metrics), Helm's templating is worth the complexity.

Multi-cloud ML infrastructure: Training on AWS, serving on GCP is a legitimate architecture when different clouds have different strengths or prices for different GPU types. Terraform is genuinely multi-cloud - you can use both aws and google providers in the same configuration. The main complexity is networking: you need either a VPN gateway or a dedicated interconnect between clouds.

Drift detection: Real infrastructure drifts from the IaC definition when engineers make manual changes via the console. Use terraform plan in a scheduled CI job (without applying) to detect drift. ArgoCD has built-in drift detection for Kubernetes resources - it shows you exactly which fields have been manually changed.

:::danger Dangerous Patterns to Avoid Do not commit secrets to Terraform code. Database passwords, API keys, and SSH private keys must never appear in .tf files. Use AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault, and reference secrets by ARN/path. The sensitive = true attribute in Terraform prevents values from appearing in plan output but does not prevent them from being stored in the state file.

Do not run terraform apply without terraform plan first. Always review the plan output, especially for destructive operations. A change to a resource that Terraform cannot update in-place (e.g., changing the instance type of a running EC2 instance) will cause Terraform to destroy and recreate the resource. Destroying a training node mid-run without a checkpoint costs you hours of GPU time. :::

:::warning Common Pitfalls Implicit provider dependencies in modules: If a Terraform module creates resources in multiple AWS regions (e.g., a replica bucket in us-west-2 when your primary is in us-east-1), you need to explicitly pass provider configurations to the module. Forgetting this causes confusing errors where Terraform tries to create a resource in the wrong region.

Helm values file merge order: When you pass multiple --values files to helm install, later files override earlier ones. A common mistake is having values.production.yaml be loaded before values.yaml, resulting in production values being overridden by defaults. Always verify the merge order with helm template before deploying.

ArgoCD sync waves: When deploying multiple interdependent resources (e.g., a database migration job must complete before the model server starts), use ArgoCD sync waves (argocd.argoproj.io/sync-wave: "1" annotations) to control ordering. Without this, ArgoCD may start the model server before the database is ready. :::

Interview Questions and Answers

Q1: What is the difference between Terraform and Ansible, and when would you use each for ML infrastructure?

A: Terraform is an infrastructure provisioning tool - it creates, modifies, and destroys cloud resources (EC2 instances, S3 buckets, VPCs, EKS clusters) using a declarative model. Ansible is a configuration management tool - it configures the OS and software on already-provisioned machines (installing CUDA, setting up Python environments, mounting NFS volumes). The typical ML workflow is: use Terraform to create the GPU instances, then use Ansible to configure them with CUDA, NCCL, and your ML framework. You would not use Terraform to install software packages, and you would not use Ansible to create cloud resources (it can, but it lacks Terraform's state management and is procedural rather than declarative). Ansible is also useful for tasks that Terraform cannot do: running commands on existing machines, configuring application settings, or performing operational tasks like restarting services.

Q2: What is Terraform state and why does it need to be stored remotely for teams?

A: Terraform state is a JSON file that maps the resources in your Terraform configuration to real cloud resources. It tracks which resource in your .tf file corresponds to which EC2 instance ID in AWS. Without state, Terraform would have no way to know that the aws_instance.gpu_node resource in your code already exists as i-0abc123456 in AWS - it would try to create a new instance every time you run apply. State must be stored remotely (in S3 + DynamoDB) for teams because: (1) multiple engineers need to see the same state, (2) the DynamoDB lock prevents two engineers from running apply simultaneously (which would corrupt the state), and (3) local state on a developer's laptop would be lost if the laptop is lost or reset. Sensitive data in the state file (resource IDs, connection strings) means the state backend itself needs to be encrypted and access-controlled.

Q3: Explain Helm's role in deploying ML models to Kubernetes. What problems does it solve?

A: Deploying a production ML inference service to Kubernetes requires many Kubernetes resources: Deployment, Service, Ingress, HorizontalPodAutoscaler, ConfigMap, ServiceMonitor (for Prometheus), PodDisruptionBudget, and possibly VirtualService (for Istio). Writing these as raw YAML files means repeating the application name, image tag, and namespace dozens of times, and maintaining separate copies for each environment (staging, production). Helm solves this by: (1) templating - use {{ .Values.image.tag }} instead of hardcoding the image tag, and override it at deploy time; (2) packaging - bundle all related manifests into a single versioned chart; (3) release management - Helm tracks which version of the chart is deployed and provides helm rollback to revert to a previous version; (4) dependency management - a model serving chart can declare a dependency on the prometheus-operator chart. The tradeoff is that Helm's Go templating can become complex for large charts - Kustomize is simpler for applications with only minor environment-specific differences.

Q4: How would you handle spot instance interruptions in a Terraform-managed GPU training cluster?

A: At the infrastructure level: configure an ASG lifecycle hook that fires when an instance is about to be terminated, with a heartbeat timeout of 120 seconds (the maximum AWS gives for spot interruptions). This sends a message to an SQS queue. At the application level: run a sidecar process on each training node that polls the EC2 instance metadata endpoint for spot/instance-action every 5 seconds. When it detects the interruption notice, it signals the training process (via SIGUSR1 or a shared file) to save an emergency checkpoint immediately. The training code itself should save checkpoints periodically (every N minutes) so that at most N minutes of computation is lost when interrupted. After the interrupt, the ASG requests a new spot instance and training resumes from the last checkpoint. This pattern is why DVC or S3 checkpoint uploads should use atomic operations - never delete the old checkpoint until the new one is fully uploaded.

Q5: What is GitOps and how does ArgoCD implement it for ML deployments?

A: GitOps is the practice of using Git as the single source of truth for all infrastructure and application configuration. Every desired state change goes through a Git commit and pull request. ArgoCD implements GitOps for Kubernetes by continuously watching a Git repository and comparing the live cluster state to the desired state in Git. When a difference is detected, ArgoCD can automatically sync (apply the desired state) or alert an operator. For ML deployments, the workflow is: a CI pipeline builds a new container image and updates the image tag in the Helm values file via a Git commit. ArgoCD detects the commit, diffs the new desired state against the current cluster state, and applies the change. This creates a complete audit trail: every deployment in the cluster can be traced back to a specific Git commit by a specific engineer at a specific time. The selfHeal feature means that manual kubectl edit changes to production are automatically reverted - production always reflects what is in Git.

Q6: How does Terraform handle the problem of dependent resources that need to be created in a specific order?

A: Terraform automatically determines creation order from the dependency graph. If resource B references resource A (e.g., an EC2 instance's subnet_id references an aws_subnet.id), Terraform knows to create the subnet before the instance. Explicit dependencies can be declared with depends_on = [aws_resource.name] when the dependency is not expressed through attribute references. For cross-module dependencies, you pass module outputs as inputs to other modules, which creates the dependency relationship. The terraform plan command shows the dependency-resolved creation order. For the common ML case of "create the EKS cluster, then install the NVIDIA device plugin, then deploy the ML workloads," you would use three separate Terraform applies (or apply steps in a pipeline) rather than trying to do it all in one - because the NVIDIA device plugin Helm chart requires a running EKS cluster, which terraform apply cannot know is fully ready until all pods are healthy.

Q7: What is infrastructure drift and how would you detect it in a production ML environment?

A: Infrastructure drift occurs when the real cloud resources diverge from what the IaC code describes. This typically happens when engineers make manual changes via the AWS console or CLI, when cloud resources change automatically (auto-scaling events, automatic OS patch updates), or when a partial terraform apply leaves resources in an inconsistent state. Detection approaches: (1) Run terraform plan on a schedule (daily, in CI) and alert if the plan shows any changes other than expected auto-generated changes. (2) Use ArgoCD's sync status - it continuously compares live Kubernetes state to Git and marks applications as "OutOfSync" immediately when drift is detected. (3) Tools like driftctl or Terrascan can detect cloud resources that exist but are not managed by any Terraform state (so-called "untracked" resources). For ML infrastructure, drift is particularly risky because a manually changed NCCL configuration or a manually added security group rule can silently degrade training performance or open security holes without triggering any alert.

© 2026 EngineersOfAI. All rights reserved.