Skip to main content

Terraform for ML Infrastructure

From Zero to ML Platform in One Command

The new ML platform engineer arrived on a Monday morning with a clear mandate: rebuild the company's training infrastructure. The old platform was a patchwork of manually-created EC2 instances, S3 buckets with inconsistent naming, an MLflow server that "lives on someone's laptop," and a Kubernetes cluster that three people had touched and nobody fully understood. The documentation was a shared Google Doc with 47 comments, most of them questions that were never answered.

By Friday, she had written 1,800 lines of Terraform. By the following Monday, she ran terraform apply in the staging account and watched the entire ML platform materialize: VPC with private subnets, EKS cluster with GPU node groups, MLflow tracking server backed by RDS, model artifact storage in S3 with lifecycle policies, ECR repositories for training images, IRSA roles so pods could access S3 without static credentials. Everything tagged. Everything documented in code. Everything reproducible in a fresh AWS account in under twenty minutes.

The key was not starting from scratch - it was composing well-designed modules. The EKS module from the Terraform registry handled 90% of the Kubernetes cluster configuration. The VPC module gave her a production-grade network in fifteen lines. The custom modules she wrote for MLflow and the model registry were small and focused - each one doing one thing well.

When the infrastructure manager asked "how does this work?" she did not give him a tour of the AWS console. She handed him a GitHub link. Every resource was visible, searchable, and reviewable. When she needed to add a second node group for CPU-only jobs, it was a four-line change - reviewed in a PR, applied through CI, done in ten minutes. That is what Terraform for ML infrastructure looks like when done correctly.

This lesson walks you through building a complete MLOps platform with Terraform. Real modules, real configurations, and the patterns that separate maintainable infrastructure from another pile of technical debt.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::

Why Terraform for ML Infrastructure Specifically

ML infrastructure has characteristics that make IaC especially important. First, it is heterogeneous - GPU instances, distributed storage, container registries, databases, queues, monitoring systems, and Kubernetes all exist in the same platform. Tracking this manually is impossible. Second, it is expensive - a misconfigured GPU cluster can burn thousands of dollars per hour. Terraform's plan phase gives you a last-chance review before those costs materialize. Third, ML infrastructure is experimental - teams spin up new environments constantly (new experiment, new model architecture, new team). Terraform makes this a five-minute operation instead of a half-day one.

The Complete ML Platform Architecture

Module Structure for ML Platforms

ml-platform/
├── modules/
│ ├── networking/ # VPC, subnets, NAT, security groups
│ ├── eks-cluster/ # EKS control plane + node groups
│ ├── ml-storage/ # S3 buckets with lifecycle policies
│ ├── model-registry/ # S3 + RDS for artifact + metadata store
│ ├── mlflow/ # MLflow server on EKS or EC2
│ ├── ecr/ # Container registries
│ └── irsa/ # IAM Roles for Service Accounts
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── prod/
├── versions.tf
└── backend.tf

EKS Cluster Module

# modules/eks-cluster/main.tf

# Data sources
data "aws_caller_identity" "current" {}

locals {
cluster_name = "${var.prefix}-eks"
}

# EKS Control Plane
resource "aws_eks_cluster" "this" {
name = local.cluster_name
role_arn = aws_iam_role.cluster.arn
version = var.kubernetes_version

vpc_config {
subnet_ids = var.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = var.enable_public_endpoint
public_access_cidrs = var.enable_public_endpoint ? var.public_access_cidrs : []
security_group_ids = [aws_security_group.cluster.id]
}

enabled_cluster_log_types = [
"api", "audit", "authenticator", "controllerManager", "scheduler"
]

encryption_config {
provider {
key_arn = aws_kms_key.eks.arn
}
resources = ["secrets"]
}

depends_on = [
aws_iam_role_policy_attachment.cluster_policy,
aws_cloudwatch_log_group.eks,
]

tags = var.tags
}

# KMS key for envelope encryption of K8s secrets
resource "aws_kms_key" "eks" {
description = "EKS secret encryption key for ${local.cluster_name}"
deletion_window_in_days = 7
enable_key_rotation = true
tags = var.tags
}

resource "aws_cloudwatch_log_group" "eks" {
name = "/aws/eks/${local.cluster_name}/cluster"
retention_in_days = 30
tags = var.tags
}

# IAM role for the EKS control plane
resource "aws_iam_role" "cluster" {
name = "${local.cluster_name}-cluster-role"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "eks.amazonaws.com" }
}]
})

tags = var.tags
}

resource "aws_iam_role_policy_attachment" "cluster_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.cluster.name
}

# GPU Node Group
resource "aws_eks_node_group" "gpu" {
cluster_name = aws_eks_cluster.this.name
node_group_name = "gpu-workers"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnet_ids
instance_types = [var.gpu_instance_type]
ami_type = "AL2_x86_64_GPU" # Amazon Linux 2 with GPU drivers

scaling_config {
desired_size = var.gpu_desired_count
min_size = var.gpu_min_count
max_size = var.gpu_max_count
}

update_config {
max_unavailable = 1
}

# Allow the node group to be updated without destroying it
lifecycle {
ignore_changes = [scaling_config[0].desired_size]
}

labels = {
role = "gpu-worker"
accelerator = "nvidia"
}

taint {
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}

tags = merge(var.tags, {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned"
})

depends_on = [
aws_iam_role_policy_attachment.node_worker_policy,
aws_iam_role_policy_attachment.node_cni_policy,
aws_iam_role_policy_attachment.node_ecr_policy,
]
}

# CPU Node Group for orchestration, preprocessing, serving
resource "aws_eks_node_group" "cpu" {
cluster_name = aws_eks_cluster.this.name
node_group_name = "cpu-workers"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnet_ids
instance_types = [var.cpu_instance_type]
ami_type = "AL2_x86_64"

scaling_config {
desired_size = var.cpu_desired_count
min_size = 2 # Always keep 2 for system workloads
max_size = var.cpu_max_count
}

labels = {
role = "cpu-worker"
}

tags = merge(var.tags, {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned"
})

depends_on = [aws_iam_role_policy_attachment.node_worker_policy]
}

# Node IAM role
resource "aws_iam_role" "node" {
name = "${local.cluster_name}-node-role"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
}]
})

tags = var.tags
}

resource "aws_iam_role_policy_attachment" "node_worker_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.node.name
}

resource "aws_iam_role_policy_attachment" "node_cni_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.node.name
}

resource "aws_iam_role_policy_attachment" "node_ecr_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.node.name
}

IRSA - IAM Roles for Service Accounts

IRSA allows Kubernetes pods to assume IAM roles without static credentials. Every ML workload should use IRSA instead of instance profiles or environment variables with AWS keys.

# modules/irsa/main.tf

# OIDC provider for the EKS cluster (enables IRSA)
data "tls_certificate" "eks" {
url = var.cluster_oidc_issuer_url
}

resource "aws_iam_openid_connect_provider" "eks" {
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
url = var.cluster_oidc_issuer_url
tags = var.tags
}

# Create an IRSA role for a given service account
resource "aws_iam_role" "this" {
name = "${var.prefix}-${var.service_account_name}-role"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = "sts:AssumeRoleWithWebIdentity"
Principal = {
Federated = aws_iam_openid_connect_provider.eks.arn
}
Condition = {
StringEquals = {
"${replace(var.cluster_oidc_issuer_url, "https://", "")}:sub" = "system:serviceaccount:${var.namespace}:${var.service_account_name}"
"${replace(var.cluster_oidc_issuer_url, "https://", "")}:aud" = "sts.amazonaws.com"
}
}
}]
})

tags = var.tags
}

resource "aws_iam_role_policy" "this" {
name = "permissions"
role = aws_iam_role.this.id
policy = var.policy_json
}

# modules/irsa/outputs.tf
output "role_arn" {
value = aws_iam_role.this.arn
}

# Usage: create IRSA for the MLflow training job service account
module "training_irsa" {
source = "./modules/irsa"

prefix = local.prefix
service_account_name = "training-job"
namespace = "ml-training"
cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url

policy_json = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
Resource = [
module.ml_storage.training_data_bucket_arn,
"${module.ml_storage.training_data_bucket_arn}/*",
module.ml_storage.model_artifacts_bucket_arn,
"${module.ml_storage.model_artifacts_bucket_arn}/*",
]
}
]
})

tags = local.common_tags
}

MLflow Tracking Server Module

# modules/mlflow/main.tf

# RDS PostgreSQL for MLflow backend store
resource "aws_db_subnet_group" "mlflow" {
name = "${var.prefix}-mlflow"
subnet_ids = var.database_subnet_ids
tags = var.tags
}

resource "aws_security_group" "mlflow_db" {
name = "${var.prefix}-mlflow-db"
description = "MLflow RDS security group"
vpc_id = var.vpc_id

ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_group_ids = [var.eks_node_security_group_id]
description = "PostgreSQL from EKS nodes"
}

tags = var.tags
}

resource "aws_db_instance" "mlflow" {
identifier = "${var.prefix}-mlflow"

engine = "postgres"
engine_version = "15.4"
instance_class = var.db_instance_class
allocated_storage = 50
max_allocated_storage = 200 # Auto-scaling up to 200 GB
storage_encrypted = true
storage_type = "gp3"

db_name = "mlflow"
username = "mlflow"
password = var.db_password

db_subnet_group_name = aws_db_subnet_group.mlflow.name
vpc_security_group_ids = [aws_security_group.mlflow_db.id]
multi_az = var.multi_az
backup_retention_period = var.backup_retention_days
deletion_protection = var.enable_deletion_protection
skip_final_snapshot = !var.enable_deletion_protection

performance_insights_enabled = true

tags = var.tags
}

# Store DB password in AWS Secrets Manager
resource "aws_secretsmanager_secret" "mlflow_db" {
name = "${var.prefix}/mlflow/db-password"
recovery_window_in_days = 7
tags = var.tags
}

resource "aws_secretsmanager_secret_version" "mlflow_db" {
secret_id = aws_secretsmanager_secret.mlflow_db.id
secret_string = jsonencode({
password = var.db_password
endpoint = aws_db_instance.mlflow.endpoint
username = aws_db_instance.mlflow.username
dbname = aws_db_instance.mlflow.db_name
})
}

# Helm chart deployment of MLflow on EKS
resource "helm_release" "mlflow" {
name = "mlflow"
repository = "https://community-charts.github.io/helm-charts"
chart = "mlflow"
version = "0.7.19"
namespace = "mlops"

create_namespace = true

set {
name = "backendStore.postgres.enabled"
value = "true"
}

set {
name = "backendStore.postgres.host"
value = aws_db_instance.mlflow.address
}

set {
name = "backendStore.postgres.dbName"
value = "mlflow"
}

set {
name = "defaultArtifactRoot"
value = "s3://${var.model_artifacts_bucket}/${var.mlflow_artifact_prefix}"
}

set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = var.mlflow_irsa_role_arn
}

set {
name = "replicaCount"
value = var.mlflow_replica_count
}

values = [
templatefile("${path.module}/values/mlflow.yaml.tpl", {
db_secret_arn = aws_secretsmanager_secret.mlflow_db.arn
environment = var.environment
})
]
}

Feature Store Module (DynamoDB + Redis)

# modules/feature-store/main.tf

# DynamoDB for low-latency feature serving (online store)
resource "aws_dynamodb_table" "features" {
name = "${var.prefix}-features"
billing_mode = "PAY_PER_REQUEST" # Serverless scaling
hash_key = "entity_id"
range_key = "feature_group"

attribute {
name = "entity_id"
type = "S"
}

attribute {
name = "feature_group"
type = "S"
}

attribute {
name = "updated_at"
type = "N"
}

# GSI for time-based queries (finding stale features)
global_secondary_index {
name = "feature-group-time-index"
hash_key = "feature_group"
range_key = "updated_at"
projection_type = "ALL"
}

ttl {
attribute_name = "ttl"
enabled = true
}

point_in_time_recovery {
enabled = true
}

server_side_encryption {
enabled = true
}

tags = merge(var.tags, { Component = "feature-store-online" })
}

# ElastiCache Redis for sub-millisecond feature lookup
resource "aws_elasticache_subnet_group" "features" {
name = "${var.prefix}-features"
subnet_ids = var.private_subnet_ids
tags = var.tags
}

resource "aws_security_group" "redis" {
name = "${var.prefix}-redis"
description = "Redis feature store security group"
vpc_id = var.vpc_id

ingress {
from_port = 6379
to_port = 6379
protocol = "tcp"
security_group_ids = [var.application_security_group_id]
description = "Redis from application layer"
}

tags = var.tags
}

resource "aws_elasticache_replication_group" "features" {
replication_group_id = "${var.prefix}-features"
description = "Feature store cache"

node_type = var.redis_node_type
num_cache_clusters = var.is_production ? 3 : 1 # Multi-AZ in prod
parameter_group_name = "default.redis7"
engine_version = "7.0"
port = 6379

subnet_group_name = aws_elasticache_subnet_group.features.name
security_group_ids = [aws_security_group.redis.id]

at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = var.redis_auth_token

automatic_failover_enabled = var.is_production
multi_az_enabled = var.is_production

snapshot_retention_limit = 7
snapshot_window = "05:00-06:00"

tags = merge(var.tags, { Component = "feature-store-cache" })
}

Inference Endpoints - EKS with KEDA Autoscaling

# modules/inference/main.tf

# KEDA (Kubernetes Event-Driven Autoscaler) via Helm
resource "helm_release" "keda" {
name = "keda"
repository = "https://kedacore.github.io/charts"
chart = "keda"
version = "2.13.0"
namespace = "keda"

create_namespace = true

set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = var.keda_irsa_role_arn
}
}

# Cluster Autoscaler for node-level scaling
resource "helm_release" "cluster_autoscaler" {
name = "cluster-autoscaler"
repository = "https://kubernetes.github.io/autoscaler"
chart = "cluster-autoscaler"
version = "9.35.0"
namespace = "kube-system"

set {
name = "autoDiscovery.clusterName"
value = var.cluster_name
}

set {
name = "awsRegion"
value = var.aws_region
}

set {
name = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = var.cluster_autoscaler_irsa_role_arn
}

set {
name = "extraArgs.balance-similar-node-groups"
value = "true"
}

set {
name = "extraArgs.skip-nodes-with-system-pods"
value = "false"
}
}

# IRSA for cluster autoscaler
module "cluster_autoscaler_irsa" {
source = "../irsa"

prefix = var.prefix
service_account_name = "cluster-autoscaler"
namespace = "kube-system"
cluster_oidc_issuer_url = var.cluster_oidc_issuer_url

policy_json = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeInstanceTypes",
]
Resource = "*"
}
]
})

tags = var.tags
}

Terragrunt for DRY Multi-Environment Configuration

Terragrunt wraps Terraform to solve the environment repetition problem. Instead of duplicating backend config, provider config, and common variables across dev/staging/prod, you define them once and inherit.

# terragrunt.hcl (root - applies to ALL environments)
locals {
account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
region_vars = read_terragrunt_config(find_in_parent_folders("region.hcl"))
environment_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))

account_id = local.account_vars.locals.account_id
aws_region = local.region_vars.locals.aws_region
environment = local.environment_vars.locals.environment
}

# Remote state config - generated automatically for every module
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "mycompany-terraform-state-${local.account_id}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = local.aws_region
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}

# Provider config - generated automatically for every module
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "${local.aws_region}"

default_tags {
tags = {
Environment = "${local.environment}"
ManagedBy = "terragrunt"
}
}
}
EOF
}

# Common inputs available to all modules
inputs = {
aws_region = local.aws_region
environment = local.environment
}
# environments/prod/eks-cluster/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}

terraform {
source = "../../../modules//eks-cluster"
}

dependency "networking" {
config_path = "../networking"

# Use mock outputs during plan when networking hasn't been applied yet
mock_outputs = {
vpc_id = "vpc-00000000"
private_subnet_ids = ["subnet-00000000", "subnet-11111111"]
}
}

inputs = {
cluster_name = "eai-prod-eks"
kubernetes_version = "1.29"
vpc_id = dependency.networking.outputs.vpc_id
private_subnet_ids = dependency.networking.outputs.private_subnet_ids

# Prod-specific sizing
gpu_instance_type = "p3.8xlarge"
gpu_min_count = 0
gpu_desired_count = 2
gpu_max_count = 20

cpu_instance_type = "m5.2xlarge"
cpu_desired_count = 3
cpu_max_count = 30

enable_public_endpoint = false # Private cluster in prod
multi_az = true
}

Atlantis - GitOps Terraform via Pull Requests

Atlantis is a self-hosted server that runs Terraform plan/apply in response to GitHub/GitLab PR events. It enforces the golden rule: no one applies Terraform manually.

# atlantis.yaml - in the root of your repo
version: 3
automerge: false
delete_source_branch_on_merge: false

projects:
- name: ml-platform-networking
dir: environments/prod/networking
workspace: default
autoplan:
enabled: true
when_modified:
- "**/*.tf"
- "**/*.tfvars"
- "../../../modules/networking/**/*.tf"

- name: ml-platform-eks
dir: environments/prod/eks-cluster
workspace: default
autoplan:
enabled: true
when_modified:
- "**/*.tf"
- "../../../modules/eks-cluster/**/*.tf"
apply_requirements:
- approved # Requires PR approval before apply
- mergeable # Branch must be up to date
- undiverged # No merge conflicts

- name: ml-platform-mlflow
dir: environments/prod/mlflow
depends_on:
- ml-platform-eks # Apply EKS before MLflow
apply_requirements:
- approved

Putting It All Together - Root Configuration

# environments/prod/main.tf - assemble all modules

locals {
prefix = "eai-prod"
environment = "prod"
aws_region = "us-east-1"
is_production = true

common_tags = {
Project = "engineersofai"
Environment = "prod"
ManagedBy = "terraform"
}
}

# Networking
module "networking" {
source = "terraform-aws-modules/vpc/aws"
version = "5.5.0"

name = "${local.prefix}-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]

private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]

enable_nat_gateway = true
single_nat_gateway = false # Multi-NAT for prod HA
enable_dns_hostnames = true
create_database_subnet_group = true

tags = local.common_tags
}

# EKS Cluster
module "eks" {
source = "../../modules/eks-cluster"

prefix = local.prefix
kubernetes_version = "1.29"
vpc_id = module.networking.vpc_id
private_subnet_ids = module.networking.private_subnets

gpu_instance_type = "p3.8xlarge"
gpu_min_count = 0
gpu_desired_count = 2
gpu_max_count = 20

cpu_instance_type = "m5.2xlarge"
cpu_desired_count = 3
cpu_max_count = 30

enable_public_endpoint = false
multi_az = true
tags = local.common_tags
}

# Storage
module "ml_storage" {
source = "../../modules/ml-storage"

prefix = local.prefix
environment = local.environment
tags = local.common_tags
}

# ECR Repositories
module "ecr" {
source = "../../modules/ecr"

prefix = local.prefix
repositories = [
"training/pytorch",
"training/tensorflow",
"serving/triton",
"preprocessing/spark",
]
tags = local.common_tags
}

# MLflow
module "mlflow" {
source = "../../modules/mlflow"

prefix = local.prefix
environment = local.environment
vpc_id = module.networking.vpc_id
database_subnet_ids = module.networking.database_subnets
cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url

db_instance_class = "db.r6g.xlarge"
db_password = var.mlflow_db_password
multi_az = true
backup_retention_days = 14
enable_deletion_protection = true

model_artifacts_bucket = module.ml_storage.model_artifacts_bucket_name
mlflow_artifact_prefix = "mlflow-artifacts"
mlflow_replica_count = 3

tags = local.common_tags
}

# Feature Store
module "feature_store" {
source = "../../modules/feature-store"

prefix = local.prefix
environment = local.environment
vpc_id = module.networking.vpc_id
private_subnet_ids = module.networking.private_subnets
is_production = local.is_production
redis_node_type = "cache.r7g.large"
redis_auth_token = var.redis_auth_token

application_security_group_id = module.eks.node_security_group_id
tags = local.common_tags
}

Production Engineering Notes

Module versioning: When using modules from the Terraform registry or your own internal registry, always pin to an exact version (version = "5.5.0", not version = "~> 5.0"). Floating versions mean a terraform init six months from now gets a different module version, breaking your apply.

Dependency between modules: Terragrunt's dependency blocks are the cleanest way to express cross-module dependencies. Without Terragrunt, use data sources to read outputs from other state files: data "terraform_remote_state" "networking" { backend = "s3"; config = { bucket = "..." } }.

Node group lifecycle: Add lifecycle { ignore_changes = [scaling_config[0].desired_size] } to EKS node groups. The cluster autoscaler changes the desired count at runtime - without this, the next terraform apply reverts the autoscaler's changes and causes unexpected scale-in.

GPU driver management: The AL2_x86_64_GPU AMI type includes NVIDIA drivers for p3/p4 instances. For g4dn/g5 instances, you may need custom launch templates with specific driver versions.

Cost guardrails: Tag every resource with Environment and Project. Use AWS Budgets with Terraform (aws_budgets_budget) to alert when spend exceeds threshold. Set a hard budget on dev environments.

Common Mistakes

:::danger Do Not Use terraform apply -auto-approve in Production This bypasses the most important safety checkpoint - human review of the plan. Every production apply must be reviewed, even for "small" changes. A misconfigured security group rule or wrong instance type can cause an outage or data loss that auto-approve will apply without any warning. :::

:::danger Do Not Share IAM Credentials in Terraform State If you terraform apply with AWS access keys that Terraform stores in state (e.g., as outputs), those credentials are now in your state file in plaintext. Use IRSA for EKS workloads, instance profiles for EC2, and AWS Secrets Manager for everything else. Never output secrets. :::

:::warning EKS Node Group Replacement Changing the instance_type of an EKS node group forces Terraform to destroy and recreate the node group, draining all pods first. In production, this causes a temporary capacity reduction. Always test node group changes in staging first, and do them during low-traffic windows. :::

:::warning Terragrunt Dependency Ordering Terragrunt runs modules in dependency order, but it does not know about resource-level dependencies within modules. If your EKS cluster module creates resources that a later module's data sources need to read, ensure terraform apply for the upstream module completes before running the downstream one. :::

Interview Q&A

Q: How would you structure Terraform for a team managing the same ML infrastructure across dev, staging, and prod?

The recommended approach is separate state files per environment with shared modules. Structure the repo as environments/dev/, environments/staging/, environments/prod/ - each calling the same modules from modules/. This gives complete isolation between environments (a terraform destroy on dev does not touch prod), with code reuse through modules. Add Terragrunt on top for DRY backend configuration and dependency management. Each environment gets its own .tfvars file with environment-specific sizes and settings. CI runs terraform plan on PRs and terraform apply only on merge to main with environment-specific approval gates.

Q: What is IRSA and why is it critical for ML workloads on EKS?

IRSA (IAM Roles for Service Accounts) allows Kubernetes pods to assume AWS IAM roles using OIDC federation - without any static credentials. The EKS cluster creates an OIDC provider. You create an IAM role whose trust policy allows only a specific Kubernetes service account in a specific namespace to assume it. Pods running with that service account automatically receive AWS credentials via the pod environment variables (AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN). For ML workloads, this means training pods can access S3 training data and write model artifacts without any hardcoded keys. It's least-privilege and automatically rotated - vastly more secure than instance profiles or environment variable credentials.

Q: How does Terragrunt solve the multi-environment problem compared to Terraform workspaces?

Terraform workspaces use a single backend configuration and a single codebase with a workspace-keyed state. They feel simple but become messy as environments diverge - you end up with terraform.workspace == "prod" ? ... : ... conditionals everywhere, making the code hard to read. Terragrunt uses a hierarchical configuration inheritance model: root terragrunt.hcl defines shared backend config, provider config, and common inputs; environment-specific terragrunt.hcl files override and extend. Each module gets its own state file automatically. Environments are truly isolated - different state files, potentially different AWS accounts. Terragrunt also adds dependency blocks for cross-module output references and run-all for applying entire environments in dependency order.

Q: What are the trade-offs between using the community EKS Terraform module vs writing your own?

The community EKS module (terraform-aws-modules/eks/aws) is well-maintained, handles many edge cases (IRSA setup, add-ons, launch templates, managed node groups), and is used by hundreds of companies. It is the right choice for most teams. The trade-offs: it is opinionated and complex (hundreds of variables), it may lag behind new EKS features, and debugging failures requires understanding someone else's module internals. Writing your own gives full control and simpler code, but you take on the maintenance burden of keeping up with EKS API changes. My recommendation: use the community module for the EKS cluster itself, but write your own thin wrapper modules for the ML-specific parts (IRSA roles, node groups with ML-specific taints and labels, GPU driver configuration).

Q: A terraform plan shows a resource will be destroyed and recreated. How do you handle this in production without downtime?

First, understand why Terraform wants to replace it - many attributes are "forces new resource" and cannot be updated in place. Options: (1) Use create_before_destroy = true in the resource's lifecycle block - Terraform creates the replacement first, then destroys the original, avoiding downtime during the switchover. (2) Target the specific resource with -target to isolate the replacement from other changes. (3) For stateful resources like RDS, create a new instance, migrate data, update application config, then destroy the old instance as a manual process outside Terraform. (4) Sometimes the right answer is to not change the attribute - check if the change is truly necessary.

© 2026 EngineersOfAI. All rights reserved.