IaC for ML Teams
The GPU Cluster Cost Explosion
It is Tuesday afternoon and the quarterly infrastructure review has just started. The cloud bill for the last month is on the screen: 210,000. The month before, $180,000. The CTO wants to know what happened.
The ML platform team starts digging. They find eight GPU clusters running in AWS - three p3.16xlarge clusters that nobody recognizes. When they SSH in, they find half-finished experiments from three different teams. One cluster has been running since a hackathon six weeks ago. Another was spun up for a benchmark that finished in two days but nobody remembered to tear it down. A third appears to have been created by a contractor who left the company a month ago.
The real problem surfaces as they try to investigate: nobody can tell you exactly what is running or why. The clusters were created through the AWS Console, configured by hand, and the only record of their existence is in the billing dashboard. There is no audit trail of who created them. There is no way to know if the CUDA version matches what the research team uses locally. There is no automated process to shut them down when idle.
The team that built the most expensive cluster tries to recreate it in a new region for a customer demo. Two engineers spend a day and a half clicking through the AWS Console, SSH-ing into machines, running install scripts from memory, and debugging CUDA driver conflicts. When they finally have it working, it is subtly different from the original in ways they cannot fully articulate. The demo runs - barely.
This is not an unusual story. It is the default state for ML teams that have not adopted Infrastructure as Code. And it gets worse as the team grows. Every engineer develops their own cluster setup scripts. Every project uses a slightly different environment. The phrase "it works on my cluster" becomes a punchline that nobody finds funny.
IaC is the practice of defining infrastructure in version-controlled, machine-readable configuration files and using automated tools to apply those definitions. It does not just solve the cost problem - it solves the reproducibility problem, the audit problem, the collaboration problem, and the onboarding problem simultaneously.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::
Why This Exists - The Problem Before IaC
Before IaC tools became mainstream, infrastructure was managed through one of three approaches: clicking through cloud consoles, running shell scripts, or using configuration management tools like Chef and Puppet.
The ClickOps problem: Console-created infrastructure leaves no record. You cannot diff it, version it, review it in a PR, or reproduce it exactly. When something breaks, you cannot answer "what changed?" When the bill spikes, you cannot answer "who created what?" When a new team member needs the same environment, they get a two-hour verbal walkthrough that produces something similar but not identical.
The shell script problem: Scripts are better than ClickOps because they are repeatable, but they are not idempotent. Running a script twice may create duplicate resources, fail with "resource already exists" errors, or do nothing at all depending on how defensive the author was. Scripts encode how to create infrastructure, not what the desired state is. When infrastructure drifts from the script (someone tweaked a setting in the console), the script cannot detect or fix the drift.
The config management problem: Tools like Ansible, Chef, and Puppet solve the server configuration problem well. But they were designed for configuring existing servers, not for provisioning cloud resources. They do not have native concepts for cloud resource lifecycle - creating VPCs, managing IAM roles, or handling resource dependencies across services.
The insight behind modern IaC tools like Terraform was: describe the desired end state, not the steps to get there. The tool compares the desired state against the current state and figures out the minimal set of changes to apply. This is declarative infrastructure, and it changes everything.
Historical Context - The Declarative Infrastructure Insight
The concept of declarative infrastructure predates Terraform. CFEngine (1993) introduced the idea of specifying desired system state rather than imperatives. Puppet (2005) and Chef (2009) brought this to configuration management at scale. But all of these tools focused on servers that already existed.
The modern IaC era began with AWS CloudFormation (2011), which introduced declarative templates for cloud resources. CloudFormation proved the concept but had serious usability problems: templates grew to thousands of lines of JSON, the feedback loop was slow, and the tool was AWS-only.
HashiCorp released Terraform in 2014 with a key insight: a provider plugin system could abstract across every cloud provider with a consistent workflow. The same terraform apply command works for AWS, GCP, Azure, Kubernetes, and dozens of other platforms. HCL (HashiCorp Configuration Language) was designed to be more readable than JSON or YAML while remaining declarative.
Pulumi followed in 2018 with a different bet: instead of a domain-specific language, use real programming languages (Python, TypeScript, Go). This turned out to be particularly valuable for ML teams who were already fluent in Python and found HCL limiting for complex logic.
Core Concepts
Declarative vs Imperative IaC
The fundamental divide in IaC is between declarative and imperative approaches.
Declarative: You describe the desired state. The tool calculates what needs to change.
# Terraform (declarative) - describe what you want
resource "aws_instance" "training_server" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "p3.2xlarge"
tags = {
Name = "ml-training-server"
Team = "ml-platform"
Project = "recommendation-v2"
}
}
Imperative: You describe the steps to get to the desired state.
# Shell script (imperative) - describe how to get there
#!/bin/bash
# Check if instance exists first (error-prone)
INSTANCE_ID=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=ml-training-server" \
--query "Reservations[0].Instances[0].InstanceId" \
--output text)
if [ "$INSTANCE_ID" == "None" ]; then
aws ec2 run-instances \
--image-id ami-0c55b159cbfafe1f0 \
--instance-type p3.2xlarge \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=ml-training-server}]'
else
echo "Instance already exists: $INSTANCE_ID"
fi
The declarative version is shorter and handles the "already exists" case automatically. More importantly, if someone modifies the instance type through the console, terraform plan will detect the drift and show you what changed.
Idempotency in ML Infrastructure
Idempotency means that applying the same configuration multiple times produces the same result as applying it once. This property is essential for ML infrastructure for several reasons:
-
Experiment reproducibility: Running
terraform applyon Monday and again on Thursday should produce identical infrastructure. If it does not, your experiments are not reproducible. -
CI/CD safety: Your CI pipeline should be able to run
terraform applyon every merge without fear of duplicating resources or causing race conditions. -
Disaster recovery: If a region goes down, you should be able to recreate your entire ML infrastructure in a new region by running one command.
IaC for ML-Specific Resources
ML teams work with a set of cloud resources that have unique characteristics compared to typical web application infrastructure. IaC for ML needs to handle:
GPU Instances: Different instance types have different GPU counts, memory, and networking. Getting this wrong wastes money or causes OOM errors.
# Define a map of training configurations
locals {
training_configs = {
small = {
instance_type = "p3.2xlarge" # 1x V100 16GB
gpu_count = 1
}
medium = {
instance_type = "p3.8xlarge" # 4x V100 64GB
gpu_count = 4
}
large = {
instance_type = "p3.16xlarge" # 8x V100 128GB
gpu_count = 8
}
}
}
resource "aws_instance" "training_cluster" {
for_each = var.active_experiments
ami = data.aws_ami.deep_learning.id
instance_type = local.training_configs[each.value.size].instance_type
root_block_device {
volume_size = 200 # Large root for model checkpoints
volume_type = "gp3"
}
tags = {
Name = "training-${each.key}"
Experiment = each.key
AutoStop = "true"
CostCenter = var.cost_center
}
}
S3 Data Lakes: ML data lakes have specific access patterns (large sequential reads), lifecycle policies (move old training data to Glacier), and versioning requirements (reproducibility).
resource "aws_s3_bucket" "ml_data_lake" {
bucket = "${var.org_name}-ml-data-lake-${var.environment}"
}
resource "aws_s3_bucket_versioning" "ml_data_lake" {
bucket = aws_s3_bucket.ml_data_lake.id
versioning_configuration {
status = "Enabled" # Required for dataset reproducibility
}
}
resource "aws_s3_bucket_lifecycle_configuration" "ml_data_lake" {
bucket = aws_s3_bucket.ml_data_lake.id
rule {
id = "raw_data_lifecycle"
status = "Enabled"
filter {
prefix = "raw/"
}
# Move raw data to cheaper storage after 90 days
transition {
days = 90
storage_class = "STANDARD_IA"
}
# Archive after 1 year
transition {
days = 365
storage_class = "GLACIER"
}
}
rule {
id = "model_artifacts_lifecycle"
status = "Enabled"
filter {
prefix = "models/"
}
# Keep model artifacts in standard storage (fast access needed)
# Delete non-current versions after 30 days (keep latest)
noncurrent_version_expiration {
noncurrent_days = 30
}
}
}
Managed ML Services: SageMaker, Vertex AI, and Azure ML have their own resource types that require specific IaC configuration.
# SageMaker domain for a team
resource "aws_sagemaker_domain" "ml_team" {
domain_name = "ml-team-${var.environment}"
auth_mode = "IAM"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
default_user_settings {
execution_role = aws_iam_role.sagemaker_execution.arn
jupyter_server_app_settings {
default_resource_spec {
instance_type = "system"
sagemaker_image_arn = data.aws_sagemaker_prebuilt_ecr_image.jupyter.registry_path
}
}
kernel_gateway_app_settings {
default_resource_spec {
instance_type = "ml.g4dn.xlarge" # GPU for notebooks
sagemaker_image_arn = data.aws_sagemaker_prebuilt_ecr_image.data_science.registry_path
}
}
}
}
Infrastructure Drift Detection
Drift occurs when the actual state of your infrastructure diverges from what your IaC definitions say it should be. This happens when someone makes a "quick fix" in the console, when automated processes modify resources, or when cloud providers update resource properties.
For ML teams, drift is particularly dangerous because:
- A GPU driver update made through the console is invisible to the next team member who runs
terraform apply - Security groups modified manually may get reverted, breaking training jobs
- S3 bucket policies changed outside IaC may expose sensitive training data
# Detect drift in your ML infrastructure
terraform plan -detailed-exitcode
# Exit code 0: no changes needed
# Exit code 1: error
# Exit code 2: changes needed (drift detected)
# For automated drift detection in CI
terraform plan \
-detailed-exitcode \
-out=tfplan \
2>&1 | tee plan_output.txt
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "DRIFT DETECTED - infrastructure has diverged from IaC definition"
# Send alert to Slack, PagerDuty, etc.
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
-d '{"text":"🚨 Infrastructure drift detected in ML platform. Run: terraform plan"}'
fi
Version Controlling Infrastructure Alongside Model Code
The most important organizational practice in IaC for ML is keeping infrastructure definitions in the same repository as (or at least alongside) the model code that depends on it. This creates traceability: you can look at any commit in your model training code and find the exact infrastructure it requires.
ml-project/
├── training/
│ ├── train.py
│ ├── model.py
│ └── requirements.txt
├── serving/
│ ├── serve.py
│ └── Dockerfile
├── infrastructure/
│ ├── main.tf # Training cluster definition
│ ├── variables.tf # Configurable parameters
│ ├── outputs.tf # Exported values (bucket names, endpoints)
│ └── modules/
│ ├── training-cluster/
│ └── serving-endpoint/
├── .gitlab-ci.yml # CI/CD pipeline
└── README.md
When you open a PR that changes the model architecture, the reviewer can see whether the infrastructure change (perhaps a larger GPU instance or more memory) is also included. When the PR is merged, the CI pipeline can apply both the model code changes and the infrastructure changes atomically.
IaC in ML CI/CD Pipelines
The IaC workflow integrates naturally into ML CI/CD. The key is separating the plan step (read-only, shows what will change) from the apply step (makes the changes), and requiring human review between them for significant changes.
# .gitlab-ci.yml - IaC workflow for ML infrastructure
stages:
- validate
- plan
- apply
variables:
TF_VERSION: "1.7.0"
AWS_DEFAULT_REGION: "us-east-1"
terraform_validate:
stage: validate
image: hashicorp/terraform:$TF_VERSION
script:
- cd infrastructure/
- terraform init -backend=false
- terraform validate
- terraform fmt -check
rules:
- changes:
- infrastructure/**/*
terraform_plan:
stage: plan
image: hashicorp/terraform:$TF_VERSION
script:
- cd infrastructure/
- terraform init
- terraform plan -out=tfplan -var="environment=staging"
- terraform show -no-color tfplan > plan_output.txt
- cat plan_output.txt
artifacts:
paths:
- infrastructure/tfplan
- infrastructure/plan_output.txt
expire_in: 1 hour
environment:
name: staging
action: prepare
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
changes:
- infrastructure/**/*
terraform_apply:
stage: apply
image: hashicorp/terraform:$TF_VERSION
script:
- cd infrastructure/
- terraform init
- terraform apply tfplan
dependencies:
- terraform_plan
environment:
name: staging
action: start
rules:
- if: $CI_COMMIT_BRANCH == "main"
changes:
- infrastructure/**/*
when: manual # Require human approval for apply
:::tip Plan as Code Review
Treat terraform plan output like a code review. Before merging any infrastructure change, a second engineer should read the plan output and confirm that the changes are intentional, that no unexpected resources are being destroyed, and that costs are acceptable.
:::
Production Engineering Notes
State Management
The single most dangerous aspect of Terraform for teams is state management. The state file is Terraform's record of what it has created. If it gets corrupted, deleted, or out of sync, you are in trouble.
Always use remote state for team use:
# backend.tf - Remote state configuration
terraform {
backend "s3" {
bucket = "my-org-terraform-state"
key = "ml-platform/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
# DynamoDB for state locking (prevents concurrent applies)
dynamodb_table = "terraform-state-locks"
}
}
Never run terraform apply concurrently. State locking with DynamoDB prevents this for Terraform, but you should also enforce it in your CI pipeline by using serialized pipeline stages or distributed locks.
Environment Separation
Use Terraform workspaces or separate state files to keep dev, staging, and production environments isolated:
# Workspace approach
terraform workspace new development
terraform workspace new staging
terraform workspace new production
# Apply to a specific environment
terraform workspace select staging
terraform apply -var-file="environments/staging.tfvars"
# environments/staging.tfvars
environment = "staging"
gpu_instance_type = "g4dn.xlarge" # Smaller GPUs for staging
training_cluster_size = 2 # Fewer nodes than prod
s3_bucket_name = "ml-data-staging"
Tagging Strategy
Every ML infrastructure resource should be tagged for cost attribution, security, and lifecycle management:
# locals.tf - Common tags applied to all resources
locals {
common_tags = {
Environment = var.environment
Team = var.team_name
Project = var.project_name
CostCenter = var.cost_center
ManagedBy = "terraform"
Repository = var.git_repository
CreatedBy = var.creator_id
AutoStop = var.auto_stop_enabled
}
}
# Apply common tags to all resources
resource "aws_instance" "training" {
# ... instance config ...
tags = merge(local.common_tags, {
Name = "training-${var.project_name}"
Role = "training"
GPUCount = "4"
})
}
Common Mistakes
:::danger Never Store State Files in Git The Terraform state file contains sensitive information including resource IDs, sometimes passwords, and the full structure of your infrastructure. Committing it to git exposes this information and causes conflicts when multiple engineers work simultaneously. Always use remote state (S3 + DynamoDB for AWS, GCS for GCP, Azure Blob for Azure). :::
:::danger Never Run apply Without Reviewing plan First
The terraform apply command can destroy resources. Always run terraform plan first, read the output carefully, and confirm that no unexpected resources are being destroyed. In CI, enforce this with pipeline stages that require human approval.
:::
:::warning Beware of Force-Replace for Running Training Jobs
Changing certain resource properties (like an EC2 instance type) forces Terraform to destroy and recreate the resource. If a training job is running on that instance, it will be terminated. Use terraform plan to identify force-replaces (-/+) before applying, and schedule infrastructure changes for off-hours or after jobs complete.
:::
:::warning IaC Does Not Manage Data
Terraform manages infrastructure, not data. Deleting a Terraform resource for an S3 bucket does not delete the data in it (by default) - but changing the lifecycle rules will affect future data management. Understand what terraform destroy will and will not do before running it on production data resources.
:::
Interview Q&A
Q: What is the difference between declarative and imperative IaC, and which should you use for ML infrastructure?
A: Declarative IaC describes the desired end state - what you want the infrastructure to look like - and the tool figures out how to achieve it. Imperative IaC describes the steps to take. Terraform and Pulumi are declarative; Ansible playbooks are imperative (though Ansible has idempotent modules).
For ML infrastructure, declarative is strongly preferred because it provides idempotency (safe to run multiple times), drift detection (you can see when reality diverges from the definition), and self-documenting state (the config file IS the documentation of what exists). The main advantage of imperative for ML is complex conditional logic - e.g., "if this model is in ONNX format, configure the serving layer differently" - which is where Pulumi (real programming languages) shines over Terraform HCL.
Q: How do you handle Terraform state in a team of 10 ML engineers all deploying infrastructure?
A: Remote state with locking is mandatory. For AWS: S3 bucket for state storage (with versioning enabled for recovery), DynamoDB table for locking (prevents concurrent applies). For GCP: GCS bucket with object versioning. State is never committed to git. Each environment (dev/staging/prod) has its own state file at a separate key. CI/CD pipelines serialize applies using pipeline locking. Engineers run terraform plan locally but terraform apply only through CI after PR approval.
Q: A data scientist has been clicking around in the AWS Console and now your Terraform state is out of sync with reality. How do you fix this?
A: Three options depending on severity. First, terraform import to bring the manually-created resources under Terraform management - this adds them to state without destroying and recreating them. Second, terraform state rm followed by re-applying if the console changes were intentional and you want to capture them in IaC. Third, if the console changes are unwanted drift, just run terraform apply - Terraform will revert the drift to match the declared state. Prevention is better than cure: use AWS SCP (Service Control Policies) to deny console access to infrastructure accounts, requiring all changes to go through IaC.
Q: How do you ensure that GPU training clusters are automatically shut down when training jobs finish?
A: Multiple layers of defense. First, IaC defines auto-scaling groups with scheduled scale-in at known idle times. Second, training job code emits a signal (CloudWatch event, SNS notification) when it completes, triggering a Lambda that sets the ASG desired capacity to zero. Third, CloudWatch alarms on GPU utilization: if a GPU instance shows less than 5% utilization for 30 minutes, it triggers an alert and optionally auto-terminates. Fourth, Terraform includes force_destroy = true on training clusters (not data resources) so they can be cleanly destroyed from CI when the experiment is complete. Tag all training instances with AutoStop=true and run a daily Lambda that checks for idle instances.
Q: What is infrastructure drift and why is it particularly dangerous for ML teams?
A: Infrastructure drift is when the actual state of cloud resources diverges from what the IaC definition says they should be. It happens through console changes, automated cloud processes, or cloud provider updates. For ML teams, drift is dangerous because: (1) training reproducibility - if the CUDA version on a training server was manually updated, you cannot reproduce old results; (2) security - a security group opened for debugging may never get closed; (3) cost - resources created manually are invisible to cost attribution; (4) debugging - when a model starts performing differently, you cannot tell if the infrastructure changed. Automated drift detection (running terraform plan on a cron, alerting on exit code 2) closes this gap.
Q: How do you version infrastructure alongside ML model code?
A: Keep infrastructure definitions in the same monorepo as the model code, or link them explicitly through git tags or external references. The rule: given a git commit hash of model training code, you should be able to find the exact infrastructure it requires. Practically, this means an infrastructure/ directory in the model repo, with a README that says "to run training, apply the Terraform in this directory first." In CI, the pipeline that trains a model first applies the infrastructure, then runs training, then optionally tears down the training cluster (but not the data or model storage). Git tags like experiment-v2.3-infra document the infrastructure state at a point in time.
