What is terraform ML infrastructure?

Build complete ML platforms with Terraform - GPU clusters, MLflow, EKS, feature stores, and model registries using production-grade HCL modules.

How does GPU cluster terraform work in practice?

Terraform for ML Infrastructure covers terraform ML infrastructure, GPU cluster terraform, EKS terraform from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/infrastructure-as-code/terraform-for-ml-infrastructure

What is the difference between terraform ML infrastructure and EKS terraform?

See the full breakdown at https://engineersofai.com/docs/mlops/infrastructure-as-code/terraform-for-ml-infrastructure

Terraform for ML Infrastructure

From Zero to ML Platform in One Command

The new ML platform engineer arrived on a Monday morning with a clear mandate: rebuild the company's training infrastructure. The old platform was a patchwork of manually-created EC2 instances, S3 buckets with inconsistent naming, an MLflow server that "lives on someone's laptop," and a Kubernetes cluster that three people had touched and nobody fully understood. The documentation was a shared Google Doc with 47 comments, most of them questions that were never answered.

By Friday, she had written 1,800 lines of Terraform. By the following Monday, she ran terraform apply in the staging account and watched the entire ML platform materialize: VPC with private subnets, EKS cluster with GPU node groups, MLflow tracking server backed by RDS, model artifact storage in S3 with lifecycle policies, ECR repositories for training images, IRSA roles so pods could access S3 without static credentials. Everything tagged. Everything documented in code. Everything reproducible in a fresh AWS account in under twenty minutes.

The key was not starting from scratch - it was composing well-designed modules. The EKS module from the Terraform registry handled 90% of the Kubernetes cluster configuration. The VPC module gave her a production-grade network in fifteen lines. The custom modules she wrote for MLflow and the model registry were small and focused - each one doing one thing well.

When the infrastructure manager asked "how does this work?" she did not give him a tour of the AWS console. She handed him a GitHub link. Every resource was visible, searchable, and reviewable. When she needed to add a second node group for CPU-only jobs, it was a four-line change - reviewed in a PR, applied through CI, done in ten minutes. That is what Terraform for ML infrastructure looks like when done correctly.

This lesson walks you through building a complete MLOps platform with Terraform. Real modules, real configurations, and the patterns that separate maintainable infrastructure from another pile of technical debt.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::

Why Terraform for ML Infrastructure Specifically

ML infrastructure has characteristics that make IaC especially important. First, it is heterogeneous - GPU instances, distributed storage, container registries, databases, queues, monitoring systems, and Kubernetes all exist in the same platform. Tracking this manually is impossible. Second, it is expensive - a misconfigured GPU cluster can burn thousands of dollars per hour. Terraform's plan phase gives you a last-chance review before those costs materialize. Third, ML infrastructure is experimental - teams spin up new environments constantly (new experiment, new model architecture, new team). Terraform makes this a five-minute operation instead of a half-day one.

The Complete ML Platform Architecture

Module Structure for ML Platforms

ml-platform/
├── modules/
│   ├── networking/          # VPC, subnets, NAT, security groups
│   ├── eks-cluster/         # EKS control plane + node groups
│   ├── ml-storage/          # S3 buckets with lifecycle policies
│   ├── model-registry/      # S3 + RDS for artifact + metadata store
│   ├── mlflow/              # MLflow server on EKS or EC2
│   ├── ecr/                 # Container registries
│   └── irsa/                # IAM Roles for Service Accounts
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
├── versions.tf
└── backend.tf

EKS Cluster Module

# modules/eks-cluster/main.tf

# Data sources
data "aws_caller_identity" "current" {}

locals {
  cluster_name = "${var.prefix}-eks"
}

# EKS Control Plane
resource "aws_eks_cluster" "this" {
  name     = local.cluster_name
  role_arn = aws_iam_role.cluster.arn
  version  = var.kubernetes_version

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = var.enable_public_endpoint
    public_access_cidrs     = var.enable_public_endpoint ? var.public_access_cidrs : []
    security_group_ids      = [aws_security_group.cluster.id]
  }

  enabled_cluster_log_types = [
    "api", "audit", "authenticator", "controllerManager", "scheduler"
  ]

  encryption_config {
    provider {
      key_arn = aws_kms_key.eks.arn
    }
    resources = ["secrets"]
  }

  depends_on = [
    aws_iam_role_policy_attachment.cluster_policy,
    aws_cloudwatch_log_group.eks,
  ]

  tags = var.tags
}

# KMS key for envelope encryption of K8s secrets
resource "aws_kms_key" "eks" {
  description             = "EKS secret encryption key for ${local.cluster_name}"
  deletion_window_in_days = 7
  enable_key_rotation     = true
  tags                    = var.tags
}

resource "aws_cloudwatch_log_group" "eks" {
  name              = "/aws/eks/${local.cluster_name}/cluster"
  retention_in_days = 30
  tags              = var.tags
}

# IAM role for the EKS control plane
resource "aws_iam_role" "cluster" {
  name = "${local.cluster_name}-cluster-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "eks.amazonaws.com" }
    }]
  })

  tags = var.tags
}

resource "aws_iam_role_policy_attachment" "cluster_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = aws_iam_role.cluster.name
}

# GPU Node Group
resource "aws_eks_node_group" "gpu" {
  cluster_name    = aws_eks_cluster.this.name
  node_group_name = "gpu-workers"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = [var.gpu_instance_type]
  ami_type        = "AL2_x86_64_GPU"  # Amazon Linux 2 with GPU drivers

  scaling_config {
    desired_size = var.gpu_desired_count
    min_size     = var.gpu_min_count
    max_size     = var.gpu_max_count
  }

  update_config {
    max_unavailable = 1
  }

  # Allow the node group to be updated without destroying it
  lifecycle {
    ignore_changes = [scaling_config[0].desired_size]
  }

  labels = {
    role         = "gpu-worker"
    accelerator  = "nvidia"
  }

  taint {
    key    = "nvidia.com/gpu"
    value  = "true"
    effect = "NO_SCHEDULE"
  }

  tags = merge(var.tags, {
    "k8s.io/cluster-autoscaler/enabled"                    = "true"
    "k8s.io/cluster-autoscaler/${local.cluster_name}"      = "owned"
  })

  depends_on = [
    aws_iam_role_policy_attachment.node_worker_policy,
    aws_iam_role_policy_attachment.node_cni_policy,
    aws_iam_role_policy_attachment.node_ecr_policy,
  ]
}

# CPU Node Group for orchestration, preprocessing, serving
resource "aws_eks_node_group" "cpu" {
  cluster_name    = aws_eks_cluster.this.name
  node_group_name = "cpu-workers"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids
  instance_types  = [var.cpu_instance_type]
  ami_type        = "AL2_x86_64"

  scaling_config {
    desired_size = var.cpu_desired_count
    min_size     = 2  # Always keep 2 for system workloads
    max_size     = var.cpu_max_count
  }

  labels = {
    role = "cpu-worker"
  }

  tags = merge(var.tags, {
    "k8s.io/cluster-autoscaler/enabled"                    = "true"
    "k8s.io/cluster-autoscaler/${local.cluster_name}"      = "owned"
  })

  depends_on = [aws_iam_role_policy_attachment.node_worker_policy]
}

# Node IAM role
resource "aws_iam_role" "node" {
  name = "${local.cluster_name}-node-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
    }]
  })

  tags = var.tags
}

resource "aws_iam_role_policy_attachment" "node_worker_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = aws_iam_role.node.name
}

resource "aws_iam_role_policy_attachment" "node_cni_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = aws_iam_role.node.name
}

resource "aws_iam_role_policy_attachment" "node_ecr_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = aws_iam_role.node.name
}

IRSA - IAM Roles for Service Accounts

IRSA allows Kubernetes pods to assume IAM roles without static credentials. Every ML workload should use IRSA instead of instance profiles or environment variables with AWS keys.

# modules/irsa/main.tf

# OIDC provider for the EKS cluster (enables IRSA)
data "tls_certificate" "eks" {
  url = var.cluster_oidc_issuer_url
}

resource "aws_iam_openid_connect_provider" "eks" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
  url             = var.cluster_oidc_issuer_url
  tags            = var.tags
}

# Create an IRSA role for a given service account
resource "aws_iam_role" "this" {
  name = "${var.prefix}-${var.service_account_name}-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = "sts:AssumeRoleWithWebIdentity"
      Principal = {
        Federated = aws_iam_openid_connect_provider.eks.arn
      }
      Condition = {
        StringEquals = {
          "${replace(var.cluster_oidc_issuer_url, "https://", "")}:sub" = "system:serviceaccount:${var.namespace}:${var.service_account_name}"
          "${replace(var.cluster_oidc_issuer_url, "https://", "")}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })

  tags = var.tags
}

resource "aws_iam_role_policy" "this" {
  name   = "permissions"
  role   = aws_iam_role.this.id
  policy = var.policy_json
}

# modules/irsa/outputs.tf
output "role_arn" {
  value = aws_iam_role.this.arn
}

# Usage: create IRSA for the MLflow training job service account
module "training_irsa" {
  source = "./modules/irsa"

  prefix               = local.prefix
  service_account_name = "training-job"
  namespace            = "ml-training"
  cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url

  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
        Resource = [
          module.ml_storage.training_data_bucket_arn,
          "${module.ml_storage.training_data_bucket_arn}/*",
          module.ml_storage.model_artifacts_bucket_arn,
          "${module.ml_storage.model_artifacts_bucket_arn}/*",
        ]
      }
    ]
  })

  tags = local.common_tags
}

MLflow Tracking Server Module

# modules/mlflow/main.tf

# RDS PostgreSQL for MLflow backend store
resource "aws_db_subnet_group" "mlflow" {
  name       = "${var.prefix}-mlflow"
  subnet_ids = var.database_subnet_ids
  tags       = var.tags
}

resource "aws_security_group" "mlflow_db" {
  name        = "${var.prefix}-mlflow-db"
  description = "MLflow RDS security group"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_group_ids = [var.eks_node_security_group_id]
    description     = "PostgreSQL from EKS nodes"
  }

  tags = var.tags
}

resource "aws_db_instance" "mlflow" {
  identifier = "${var.prefix}-mlflow"

  engine               = "postgres"
  engine_version       = "15.4"
  instance_class       = var.db_instance_class
  allocated_storage    = 50
  max_allocated_storage = 200  # Auto-scaling up to 200 GB
  storage_encrypted    = true
  storage_type         = "gp3"

  db_name  = "mlflow"
  username = "mlflow"
  password = var.db_password

  db_subnet_group_name   = aws_db_subnet_group.mlflow.name
  vpc_security_group_ids = [aws_security_group.mlflow_db.id]
  multi_az               = var.multi_az
  backup_retention_period = var.backup_retention_days
  deletion_protection    = var.enable_deletion_protection
  skip_final_snapshot    = !var.enable_deletion_protection

  performance_insights_enabled = true

  tags = var.tags
}

# Store DB password in AWS Secrets Manager
resource "aws_secretsmanager_secret" "mlflow_db" {
  name                    = "${var.prefix}/mlflow/db-password"
  recovery_window_in_days = 7
  tags                    = var.tags
}

resource "aws_secretsmanager_secret_version" "mlflow_db" {
  secret_id     = aws_secretsmanager_secret.mlflow_db.id
  secret_string = jsonencode({
    password = var.db_password
    endpoint = aws_db_instance.mlflow.endpoint
    username = aws_db_instance.mlflow.username
    dbname   = aws_db_instance.mlflow.db_name
  })
}

# Helm chart deployment of MLflow on EKS
resource "helm_release" "mlflow" {
  name       = "mlflow"
  repository = "https://community-charts.github.io/helm-charts"
  chart      = "mlflow"
  version    = "0.7.19"
  namespace  = "mlops"

  create_namespace = true

  set {
    name  = "backendStore.postgres.enabled"
    value = "true"
  }

  set {
    name  = "backendStore.postgres.host"
    value = aws_db_instance.mlflow.address
  }

  set {
    name  = "backendStore.postgres.dbName"
    value = "mlflow"
  }

  set {
    name  = "defaultArtifactRoot"
    value = "s3://${var.model_artifacts_bucket}/${var.mlflow_artifact_prefix}"
  }

  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = var.mlflow_irsa_role_arn
  }

  set {
    name  = "replicaCount"
    value = var.mlflow_replica_count
  }

  values = [
    templatefile("${path.module}/values/mlflow.yaml.tpl", {
      db_secret_arn = aws_secretsmanager_secret.mlflow_db.arn
      environment   = var.environment
    })
  ]
}

Feature Store Module (DynamoDB + Redis)

# modules/feature-store/main.tf

# DynamoDB for low-latency feature serving (online store)
resource "aws_dynamodb_table" "features" {
  name         = "${var.prefix}-features"
  billing_mode = "PAY_PER_REQUEST"  # Serverless scaling
  hash_key     = "entity_id"
  range_key    = "feature_group"

  attribute {
    name = "entity_id"
    type = "S"
  }

  attribute {
    name = "feature_group"
    type = "S"
  }

  attribute {
    name = "updated_at"
    type = "N"
  }

  # GSI for time-based queries (finding stale features)
  global_secondary_index {
    name            = "feature-group-time-index"
    hash_key        = "feature_group"
    range_key       = "updated_at"
    projection_type = "ALL"
  }

  ttl {
    attribute_name = "ttl"
    enabled        = true
  }

  point_in_time_recovery {
    enabled = true
  }

  server_side_encryption {
    enabled = true
  }

  tags = merge(var.tags, { Component = "feature-store-online" })
}

# ElastiCache Redis for sub-millisecond feature lookup
resource "aws_elasticache_subnet_group" "features" {
  name       = "${var.prefix}-features"
  subnet_ids = var.private_subnet_ids
  tags       = var.tags
}

resource "aws_security_group" "redis" {
  name        = "${var.prefix}-redis"
  description = "Redis feature store security group"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 6379
    to_port         = 6379
    protocol        = "tcp"
    security_group_ids = [var.application_security_group_id]
    description     = "Redis from application layer"
  }

  tags = var.tags
}

resource "aws_elasticache_replication_group" "features" {
  replication_group_id = "${var.prefix}-features"
  description          = "Feature store cache"

  node_type            = var.redis_node_type
  num_cache_clusters   = var.is_production ? 3 : 1  # Multi-AZ in prod
  parameter_group_name = "default.redis7"
  engine_version       = "7.0"
  port                 = 6379

  subnet_group_name  = aws_elasticache_subnet_group.features.name
  security_group_ids = [aws_security_group.redis.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = var.redis_auth_token

  automatic_failover_enabled = var.is_production
  multi_az_enabled           = var.is_production

  snapshot_retention_limit = 7
  snapshot_window          = "05:00-06:00"

  tags = merge(var.tags, { Component = "feature-store-cache" })
}

Inference Endpoints - EKS with KEDA Autoscaling

# modules/inference/main.tf

# KEDA (Kubernetes Event-Driven Autoscaler) via Helm
resource "helm_release" "keda" {
  name       = "keda"
  repository = "https://kedacore.github.io/charts"
  chart      = "keda"
  version    = "2.13.0"
  namespace  = "keda"

  create_namespace = true

  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = var.keda_irsa_role_arn
  }
}

# Cluster Autoscaler for node-level scaling
resource "helm_release" "cluster_autoscaler" {
  name       = "cluster-autoscaler"
  repository = "https://kubernetes.github.io/autoscaler"
  chart      = "cluster-autoscaler"
  version    = "9.35.0"
  namespace  = "kube-system"

  set {
    name  = "autoDiscovery.clusterName"
    value = var.cluster_name
  }

  set {
    name  = "awsRegion"
    value = var.aws_region
  }

  set {
    name  = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = var.cluster_autoscaler_irsa_role_arn
  }

  set {
    name  = "extraArgs.balance-similar-node-groups"
    value = "true"
  }

  set {
    name  = "extraArgs.skip-nodes-with-system-pods"
    value = "false"
  }
}

# IRSA for cluster autoscaler
module "cluster_autoscaler_irsa" {
  source = "../irsa"

  prefix               = var.prefix
  service_account_name = "cluster-autoscaler"
  namespace            = "kube-system"
  cluster_oidc_issuer_url = var.cluster_oidc_issuer_url

  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "autoscaling:DescribeAutoScalingGroups",
          "autoscaling:DescribeAutoScalingInstances",
          "autoscaling:DescribeLaunchConfigurations",
          "autoscaling:DescribeScalingActivities",
          "autoscaling:DescribeTags",
          "autoscaling:SetDesiredCapacity",
          "autoscaling:TerminateInstanceInAutoScalingGroup",
          "ec2:DescribeLaunchTemplateVersions",
          "ec2:DescribeInstanceTypes",
        ]
        Resource = "*"
      }
    ]
  })

  tags = var.tags
}

Terragrunt for DRY Multi-Environment Configuration

Terragrunt wraps Terraform to solve the environment repetition problem. Instead of duplicating backend config, provider config, and common variables across dev/staging/prod, you define them once and inherit.

# terragrunt.hcl (root - applies to ALL environments)
locals {
  account_vars     = read_terragrunt_config(find_in_parent_folders("account.hcl"))
  region_vars      = read_terragrunt_config(find_in_parent_folders("region.hcl"))
  environment_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))

  account_id  = local.account_vars.locals.account_id
  aws_region  = local.region_vars.locals.aws_region
  environment = local.environment_vars.locals.environment
}

# Remote state config - generated automatically for every module
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "mycompany-terraform-state-${local.account_id}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = local.aws_region
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

# Provider config - generated automatically for every module
generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "${local.aws_region}"

  default_tags {
    tags = {
      Environment = "${local.environment}"
      ManagedBy   = "terragrunt"
    }
  }
}
EOF
}

# Common inputs available to all modules
inputs = {
  aws_region  = local.aws_region
  environment = local.environment
}

# environments/prod/eks-cluster/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "../../../modules//eks-cluster"
}

dependency "networking" {
  config_path = "../networking"

  # Use mock outputs during plan when networking hasn't been applied yet
  mock_outputs = {
    vpc_id             = "vpc-00000000"
    private_subnet_ids = ["subnet-00000000", "subnet-11111111"]
  }
}

inputs = {
  cluster_name       = "eai-prod-eks"
  kubernetes_version = "1.29"
  vpc_id             = dependency.networking.outputs.vpc_id
  private_subnet_ids = dependency.networking.outputs.private_subnet_ids

  # Prod-specific sizing
  gpu_instance_type  = "p3.8xlarge"
  gpu_min_count      = 0
  gpu_desired_count  = 2
  gpu_max_count      = 20

  cpu_instance_type  = "m5.2xlarge"
  cpu_desired_count  = 3
  cpu_max_count      = 30

  enable_public_endpoint = false  # Private cluster in prod
  multi_az               = true
}

Atlantis - GitOps Terraform via Pull Requests

Atlantis is a self-hosted server that runs Terraform plan/apply in response to GitHub/GitLab PR events. It enforces the golden rule: no one applies Terraform manually.

# atlantis.yaml - in the root of your repo
version: 3
automerge: false
delete_source_branch_on_merge: false

projects:
  - name: ml-platform-networking
    dir: environments/prod/networking
    workspace: default
    autoplan:
      enabled: true
      when_modified:
        - "**/*.tf"
        - "**/*.tfvars"
        - "../../../modules/networking/**/*.tf"

  - name: ml-platform-eks
    dir: environments/prod/eks-cluster
    workspace: default
    autoplan:
      enabled: true
      when_modified:
        - "**/*.tf"
        - "../../../modules/eks-cluster/**/*.tf"
    apply_requirements:
      - approved          # Requires PR approval before apply
      - mergeable         # Branch must be up to date
      - undiverged        # No merge conflicts

  - name: ml-platform-mlflow
    dir: environments/prod/mlflow
    depends_on:
      - ml-platform-eks   # Apply EKS before MLflow
    apply_requirements:
      - approved

Putting It All Together - Root Configuration

# environments/prod/main.tf - assemble all modules

locals {
  prefix      = "eai-prod"
  environment = "prod"
  aws_region  = "us-east-1"
  is_production = true

  common_tags = {
    Project     = "engineersofai"
    Environment = "prod"
    ManagedBy   = "terraform"
  }
}

# Networking
module "networking" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.0"

  name = "${local.prefix}-vpc"
  cidr = "10.0.0.0/16"
  azs  = ["us-east-1a", "us-east-1b", "us-east-1c"]

  private_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets   = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = false  # Multi-NAT for prod HA
  enable_dns_hostnames = true
  create_database_subnet_group = true

  tags = local.common_tags
}

# EKS Cluster
module "eks" {
  source = "../../modules/eks-cluster"

  prefix             = local.prefix
  kubernetes_version = "1.29"
  vpc_id             = module.networking.vpc_id
  private_subnet_ids = module.networking.private_subnets

  gpu_instance_type = "p3.8xlarge"
  gpu_min_count     = 0
  gpu_desired_count = 2
  gpu_max_count     = 20

  cpu_instance_type = "m5.2xlarge"
  cpu_desired_count = 3
  cpu_max_count     = 30

  enable_public_endpoint = false
  multi_az               = true
  tags                   = local.common_tags
}

# Storage
module "ml_storage" {
  source = "../../modules/ml-storage"

  prefix      = local.prefix
  environment = local.environment
  tags        = local.common_tags
}

# ECR Repositories
module "ecr" {
  source = "../../modules/ecr"

  prefix = local.prefix
  repositories = [
    "training/pytorch",
    "training/tensorflow",
    "serving/triton",
    "preprocessing/spark",
  ]
  tags = local.common_tags
}

# MLflow
module "mlflow" {
  source = "../../modules/mlflow"

  prefix                  = local.prefix
  environment             = local.environment
  vpc_id                  = module.networking.vpc_id
  database_subnet_ids     = module.networking.database_subnets
  cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url

  db_instance_class           = "db.r6g.xlarge"
  db_password                 = var.mlflow_db_password
  multi_az                    = true
  backup_retention_days       = 14
  enable_deletion_protection  = true

  model_artifacts_bucket = module.ml_storage.model_artifacts_bucket_name
  mlflow_artifact_prefix = "mlflow-artifacts"
  mlflow_replica_count   = 3

  tags = local.common_tags
}

# Feature Store
module "feature_store" {
  source = "../../modules/feature-store"

  prefix             = local.prefix
  environment        = local.environment
  vpc_id             = module.networking.vpc_id
  private_subnet_ids = module.networking.private_subnets
  is_production      = local.is_production
  redis_node_type    = "cache.r7g.large"
  redis_auth_token   = var.redis_auth_token

  application_security_group_id = module.eks.node_security_group_id
  tags                          = local.common_tags
}

Production Engineering Notes

Module versioning: When using modules from the Terraform registry or your own internal registry, always pin to an exact version (version = "5.5.0", not version = "~> 5.0"). Floating versions mean a terraform init six months from now gets a different module version, breaking your apply.

Dependency between modules: Terragrunt's dependency blocks are the cleanest way to express cross-module dependencies. Without Terragrunt, use data sources to read outputs from other state files: data "terraform_remote_state" "networking" { backend = "s3"; config = { bucket = "..." } }.

Node group lifecycle: Add lifecycle { ignore_changes = [scaling_config[0].desired_size] } to EKS node groups. The cluster autoscaler changes the desired count at runtime - without this, the next terraform apply reverts the autoscaler's changes and causes unexpected scale-in.

GPU driver management: The AL2_x86_64_GPU AMI type includes NVIDIA drivers for p3/p4 instances. For g4dn/g5 instances, you may need custom launch templates with specific driver versions.

Cost guardrails: Tag every resource with Environment and Project. Use AWS Budgets with Terraform (aws_budgets_budget) to alert when spend exceeds threshold. Set a hard budget on dev environments.

Common Mistakes

:::danger Do Not Use terraform apply -auto-approve in Production This bypasses the most important safety checkpoint - human review of the plan. Every production apply must be reviewed, even for "small" changes. A misconfigured security group rule or wrong instance type can cause an outage or data loss that auto-approve will apply without any warning. :::

:::danger Do Not Share IAM Credentials in Terraform State If you terraform apply with AWS access keys that Terraform stores in state (e.g., as outputs), those credentials are now in your state file in plaintext. Use IRSA for EKS workloads, instance profiles for EC2, and AWS Secrets Manager for everything else. Never output secrets. :::

:::warning EKS Node Group Replacement Changing the instance_type of an EKS node group forces Terraform to destroy and recreate the node group, draining all pods first. In production, this causes a temporary capacity reduction. Always test node group changes in staging first, and do them during low-traffic windows. :::

:::warning Terragrunt Dependency Ordering Terragrunt runs modules in dependency order, but it does not know about resource-level dependencies within modules. If your EKS cluster module creates resources that a later module's data sources need to read, ensure terraform apply for the upstream module completes before running the downstream one. :::

Interview Q&A

Q: How would you structure Terraform for a team managing the same ML infrastructure across dev, staging, and prod?

The recommended approach is separate state files per environment with shared modules. Structure the repo as environments/dev/, environments/staging/, environments/prod/ - each calling the same modules from modules/. This gives complete isolation between environments (a terraform destroy on dev does not touch prod), with code reuse through modules. Add Terragrunt on top for DRY backend configuration and dependency management. Each environment gets its own .tfvars file with environment-specific sizes and settings. CI runs terraform plan on PRs and terraform apply only on merge to main with environment-specific approval gates.

Q: What is IRSA and why is it critical for ML workloads on EKS?

IRSA (IAM Roles for Service Accounts) allows Kubernetes pods to assume AWS IAM roles using OIDC federation - without any static credentials. The EKS cluster creates an OIDC provider. You create an IAM role whose trust policy allows only a specific Kubernetes service account in a specific namespace to assume it. Pods running with that service account automatically receive AWS credentials via the pod environment variables (AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN). For ML workloads, this means training pods can access S3 training data and write model artifacts without any hardcoded keys. It's least-privilege and automatically rotated - vastly more secure than instance profiles or environment variable credentials.

Q: How does Terragrunt solve the multi-environment problem compared to Terraform workspaces?

Terraform workspaces use a single backend configuration and a single codebase with a workspace-keyed state. They feel simple but become messy as environments diverge - you end up with terraform.workspace == "prod" ? ... : ... conditionals everywhere, making the code hard to read. Terragrunt uses a hierarchical configuration inheritance model: root terragrunt.hcl defines shared backend config, provider config, and common inputs; environment-specific terragrunt.hcl files override and extend. Each module gets its own state file automatically. Environments are truly isolated - different state files, potentially different AWS accounts. Terragrunt also adds dependency blocks for cross-module output references and run-all for applying entire environments in dependency order.

Q: What are the trade-offs between using the community EKS Terraform module vs writing your own?

The community EKS module (terraform-aws-modules/eks/aws) is well-maintained, handles many edge cases (IRSA setup, add-ons, launch templates, managed node groups), and is used by hundreds of companies. It is the right choice for most teams. The trade-offs: it is opinionated and complex (hundreds of variables), it may lag behind new EKS features, and debugging failures requires understanding someone else's module internals. Writing your own gives full control and simpler code, but you take on the maintenance burden of keeping up with EKS API changes. My recommendation: use the community module for the EKS cluster itself, but write your own thin wrapper modules for the ML-specific parts (IRSA roles, node groups with ML-specific taints and labels, GPU driver configuration).

Q: A terraform plan shows a resource will be destroyed and recreated. How do you handle this in production without downtime?

First, understand why Terraform wants to replace it - many attributes are "forces new resource" and cannot be updated in place. Options: (1) Use create_before_destroy = true in the resource's lifecycle block - Terraform creates the replacement first, then destroys the original, avoiding downtime during the switchover. (2) Target the specific resource with -target to isolate the replacement from other changes. (3) For stateful resources like RDS, create a new instance, migrate data, update application config, then destroy the old instance as a manual process outside Terraform. (4) Sometimes the right answer is to not change the attribute - check if the change is truly necessary.

From Zero to ML Platform in One Command​

Why Terraform for ML Infrastructure Specifically​

The Complete ML Platform Architecture​

Module Structure for ML Platforms​

EKS Cluster Module​

IRSA - IAM Roles for Service Accounts​

MLflow Tracking Server Module​

Feature Store Module (DynamoDB + Redis)​

Inference Endpoints - EKS with KEDA Autoscaling​

Terragrunt for DRY Multi-Environment Configuration​

Atlantis - GitOps Terraform via Pull Requests​

Putting It All Together - Root Configuration​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​