What is ML platform engineering?

Production IaC patterns for ML platform engineering - golden paths, blue-green infrastructure, self-destructing experiment environments, OPA policies, GPU quota management, and the internal developer platform model.

How does IaC patterns work in practice?

IaC Patterns for ML Platforms covers ML platform engineering, IaC patterns, golden path infrastructure from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/infrastructure-as-code/iac-patterns-for-ml-platforms

What is the difference between ML platform engineering and golden path infrastructure?

See the full breakdown at https://engineersofai.com/docs/mlops/infrastructure-as-code/iac-patterns-for-ml-platforms

IaC Patterns for ML Platforms

From Two Weeks to Two Hours

The ML platform team at a 50-person company had a productivity problem they could not fully explain. Their engineers were talented. Their code was good. But shipping a new ML model - from "it works in a notebook" to "it runs in production" - took two weeks on average. Nobody could point to one reason. It was a thousand small frictions.

Spinning up a new experiment environment meant filing an infrastructure ticket and waiting three days for the platform team to manually provision the right EC2 instances. Adding a new data source required three separate approval emails and a Confluence page update. Deploying a model update meant coordinating a "deployment window" with the SRE team, who needed to be present in case something went wrong. Every GPU request went through a weekly committee that prioritized workloads by team seniority, not business value. The documentation, when it existed at all, described what the infrastructure looked like six months ago.

After adopting the patterns in this lesson, the same team went from two weeks to two hours. Experiment environments self-provisioned in five minutes via a CLI command. Model deployments were PR merges. GPU requests were governed by automated quota policies that allocated based on project tags and auto-scaled down idle clusters. The platform team stopped being a bottleneck and started being a capability multiplier.

The shift was not about one tool or one technique. It was about adopting a platform engineering mindset: build reusable IaC patterns that encode best practices, expose them through golden paths that are easier to follow than to bypass, and use policy-as-code to enforce guardrails automatically rather than manually.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::

The Platform Engineering Model for ML

Platform engineering treats the internal ML platform as a product. The platform team builds tools, templates, and guardrails that make it easy for ML engineers and data scientists to do the right thing - and hard to do the wrong thing.

Pattern 1 - The ML Platform Module

The highest-leverage IaC pattern is a single module that provisions a complete ML workbench: training cluster, feature store access, model registry, monitoring - everything an ML team needs to go from prototype to production.

# modules/ml-workbench/main.tf
# Single module call creates a complete ML team environment

variable "team_name" {
  description = "ML team identifier (e.g., 'fraud', 'recommendations', 'nlp')"
  type        = string
  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,20}$", var.team_name))
    error_message = "team_name must be lowercase alphanumeric with hyphens."
  }
}

variable "environment" {
  type    = string
  default = "dev"
}

variable "gpu_quota" {
  description = "Maximum number of GPU instances this team can run simultaneously"
  type        = number
  default     = 4
}

variable "budget_monthly_usd" {
  description = "Monthly AWS budget alert threshold for this team"
  type        = number
  default     = 500
}

variable "experiment_ttl_days" {
  description = "Auto-destroy experiment resources after N days of inactivity"
  type        = number
  default     = 14
}

locals {
  prefix = "eai-${var.team_name}-${var.environment}"

  required_tags = {
    Team        = var.team_name
    Environment = var.environment
    ManagedBy   = "terraform"
    CostCenter  = var.team_name
  }
}

# --- S3 Namespace for the team ---
resource "aws_s3_bucket" "team_workspace" {
  bucket = "${local.prefix}-workspace"
  tags   = merge(local.required_tags, { Purpose = "ml-workspace" })

  lifecycle {
    prevent_destroy = var.environment == "prod"
  }
}

resource "aws_s3_bucket_versioning" "team_workspace" {
  bucket = aws_s3_bucket.team_workspace.id
  versioning_configuration { status = "Enabled" }
}

# --- IAM Role for the team's ML workloads ---
data "aws_iam_policy_document" "team_assume_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com", "sagemaker.amazonaws.com"]
    }

    # Enforce that the role can only be assumed from within the VPC
    condition {
      test     = "StringEquals"
      variable = "aws:SourceVpc"
      values   = [var.vpc_id]
    }
  }

  # Allow team engineers to assume this role via their IAM user
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"]
    }

    condition {
      test     = "StringLike"
      variable = "aws:PrincipalTag/Team"
      values   = [var.team_name]
    }
  }
}

resource "aws_iam_role" "team_ml" {
  name               = "${local.prefix}-ml-role"
  assume_role_policy = data.aws_iam_policy_document.team_assume_role.json
  tags               = local.required_tags

  # Maximum session duration for assume-role - 8 hours
  max_session_duration = 28800
}

# Team can only access their own S3 prefix - scoped by team tag (ABAC)
data "aws_iam_policy_document" "team_s3_access" {
  statement {
    effect  = "Allow"
    actions = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"]
    resources = [
      aws_s3_bucket.team_workspace.arn,
      "${aws_s3_bucket.team_workspace.arn}/*",
      # Also allow access to shared training data (read-only)
      var.shared_training_data_bucket_arn,
      "${var.shared_training_data_bucket_arn}/${var.team_name}/*",
    ]
  }
}

resource "aws_iam_role_policy" "team_s3" {
  name   = "s3-access"
  role   = aws_iam_role.team_ml.id
  policy = data.aws_iam_policy_document.team_s3_access.json
}

# --- GPU Quota Enforcement via Service Quota ---
resource "aws_servicequotas_service_quota" "gpu_instances" {
  quota_code   = "L-DB2E81BA"  # Running On-Demand P instances
  service_code = "ec2"
  value        = var.gpu_quota

  # This requests a quota increase - won't fail if already at value
  lifecycle {
    ignore_changes = [value]
  }
}

# --- Monthly Budget Alert ---
resource "aws_budgets_budget" "team" {
  name         = "${local.prefix}-monthly-budget"
  budget_type  = "COST"
  limit_amount = tostring(var.budget_monthly_usd)
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["Team$${var.team_name}"]  # Filter by team tag
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80  # Alert at 80% of budget
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = var.budget_alert_emails
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100  # Alert when over budget
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = var.budget_alert_emails
  }
}

# --- MLflow Namespace for the team ---
resource "kubernetes_namespace" "team" {
  metadata {
    name = "ml-${var.team_name}"
    labels = {
      team        = var.team_name
      environment = var.environment
      "pod-security.kubernetes.io/enforce" = "restricted"
    }
  }
}

resource "kubernetes_resource_quota" "team_gpu" {
  metadata {
    name      = "gpu-quota"
    namespace = kubernetes_namespace.team.metadata[0].name
  }

  spec {
    hard = {
      "requests.nvidia.com/gpu" = tostring(var.gpu_quota)
      "limits.nvidia.com/gpu"   = tostring(var.gpu_quota)
      "pods"                    = tostring(var.gpu_quota * 4)  # 4 pods per GPU max
    }
  }
}

# outputs.tf
output "workspace_bucket_name" {
  value = aws_s3_bucket.team_workspace.id
}

output "team_role_arn" {
  value = aws_iam_role.team_ml.arn
}

output "team_namespace" {
  value = kubernetes_namespace.team.metadata[0].name
}

Pattern 2 - Blue-Green Infrastructure Deployment

Blue-green infrastructure means running two identical environments and switching traffic between them atomically. For ML, this is used for zero-downtime model serving infrastructure upgrades (changing instance types, updating Kubernetes versions, etc.).

# modules/blue-green-serving/main.tf

variable "active_color" {
  description = "Which color is currently active"
  type        = string
  default     = "blue"
  validation {
    condition     = contains(["blue", "green"], var.active_color)
    error_message = "active_color must be 'blue' or 'green'."
  }
}

locals {
  inactive_color = var.active_color == "blue" ? "green" : "blue"
}

# Both blue and green deployment groups
resource "aws_autoscaling_group" "serving" {
  for_each = toset(["blue", "green"])

  name                = "${var.prefix}-serving-${each.key}"
  vpc_zone_identifier = var.private_subnet_ids
  min_size            = each.key == var.active_color ? var.min_instances : 0
  max_size            = each.key == var.active_color ? var.max_instances : var.max_instances
  desired_capacity    = each.key == var.active_color ? var.desired_instances : 0

  launch_template {
    id      = aws_launch_template.serving[each.key].id
    version = "$Latest"
  }

  tag {
    key                 = "Color"
    value               = each.key
    propagate_at_launch = true
  }

  tag {
    key                 = "Active"
    value               = each.key == var.active_color ? "true" : "false"
    propagate_at_launch = true
  }

  lifecycle {
    # Terraform manages size - autoscaler manages desired at runtime
    ignore_changes = [desired_capacity]
  }
}

# Target groups for each color
resource "aws_lb_target_group" "serving" {
  for_each = toset(["blue", "green"])

  name     = "${var.prefix}-${each.key}"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
  }
}

# ALB listener rule - routes 100% to active color
resource "aws_lb_listener_rule" "model_serving" {
  listener_arn = var.alb_listener_arn
  priority     = 100

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.serving[var.active_color].arn
  }

  condition {
    path_pattern {
      values = ["/predict/*"]
    }
  }
}

# To switch: change active_color in tfvars and apply
# blue ASG scales up, green ASG scales down, listener switches in one apply

Pattern 3 - Canary Infrastructure (Weighted DNS)

For gradual traffic migration at the DNS level - useful when you have multiple model serving clusters and want to shift traffic percentage-based.

# modules/canary-dns/main.tf

resource "aws_route53_record" "model_serving" {
  zone_id = var.hosted_zone_id
  name    = "fraud-detector.internal.${var.domain}"
  type    = "A"

  # Weighted routing - split traffic between stable and canary
  set_identifier = "stable"
  weighted_routing_policy {
    weight = var.stable_weight  # e.g., 90
  }

  alias {
    name                   = var.stable_alb_dns_name
    zone_id                = var.stable_alb_zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "model_serving_canary" {
  zone_id = var.hosted_zone_id
  name    = "fraud-detector.internal.${var.domain}"
  type    = "A"

  set_identifier = "canary"
  weighted_routing_policy {
    weight = var.canary_weight  # e.g., 10
  }

  alias {
    name                   = var.canary_alb_dns_name
    zone_id                = var.canary_alb_zone_id
    evaluate_target_health = true
  }
}

# Gradual rollout: update stable_weight and canary_weight via tfvars
# 90/10 → 70/30 → 50/50 → 100/0
# Each change is a terraform apply, each is reversible

Pattern 4 - Self-Destructing Experiment Environments

GPU clusters are expensive. Experiment environments that run indefinitely but sit idle cost thousands of dollars per month. The self-destructing experiment pattern automatically tears down environments that have been inactive for a configurable number of days.

# modules/experiment-environment/main.tf

variable "experiment_id" {
  type        = string
  description = "Unique identifier for this experiment"
}

variable "ttl_hours" {
  type        = number
  description = "Auto-destroy after N hours of inactivity"
  default     = 48
}

variable "owner_email" {
  type        = string
  description = "Email to notify before auto-destroy"
}

# EC2 instances for the experiment
resource "aws_instance" "experiment_nodes" {
  count         = var.gpu_count
  ami           = data.aws_ami.training.id
  instance_type = var.gpu_instance_type

  tags = {
    Experiment   = var.experiment_id
    TTLHours     = tostring(var.ttl_hours)
    Owner        = var.owner_email
    CreatedAt    = timestamp()
    ManagedBy    = "terraform"
  }

  lifecycle {
    # Ignore timestamp changes - only the first apply matters
    ignore_changes = [tags["CreatedAt"]]
  }
}

# Lambda that checks for idle experiments and destroys them
resource "aws_lambda_function" "ttl_enforcer" {
  function_name = "experiment-ttl-enforcer"
  role          = aws_iam_role.ttl_enforcer.arn
  runtime       = "python3.12"
  handler       = "index.handler"
  timeout       = 300

  filename         = data.archive_file.ttl_enforcer.output_path
  source_code_hash = data.archive_file.ttl_enforcer.output_base64sha256

  environment {
    variables = {
      SLACK_WEBHOOK_URL = var.slack_webhook_url
      SNS_TOPIC_ARN     = aws_sns_topic.ttl_alerts.arn
    }
  }
}

# EventBridge rule: run TTL enforcer every hour
resource "aws_cloudwatch_event_rule" "ttl_enforcer" {
  name                = "experiment-ttl-check"
  description         = "Check and enforce experiment TTLs"
  schedule_expression = "rate(1 hour)"
}

resource "aws_cloudwatch_event_target" "ttl_enforcer" {
  rule      = aws_cloudwatch_event_rule.ttl_enforcer.name
  target_id = "TTLEnforcer"
  arn       = aws_lambda_function.ttl_enforcer.arn
}

# lambda/ttl_enforcer/index.py

import boto3
import json
import os
from datetime import datetime, timezone, timedelta

ec2 = boto3.client("ec2")
sns = boto3.client("sns")


def handler(event, context):
    """
    Find experiment instances that have exceeded their TTL.
    Send warning at 80% of TTL, terminate at 100%.
    """
    now = datetime.now(timezone.utc)

    # Find all experiment instances
    response = ec2.describe_instances(
        Filters=[
            {"Name": "tag-key", "Values": ["Experiment"]},
            {"Name": "instance-state-name", "Values": ["running", "stopped"]},
        ]
    )

    for reservation in response["Reservations"]:
        for instance in reservation["Instances"]:
            tags = {t["Key"]: t["Value"] for t in instance.get("Tags", [])}

            experiment_id = tags.get("Experiment")
            ttl_hours = int(tags.get("TTLHours", "48"))
            created_at_str = tags.get("CreatedAt", "")
            owner_email = tags.get("Owner", "")

            if not created_at_str:
                continue

            created_at = datetime.fromisoformat(created_at_str.replace("Z", "+00:00"))
            age_hours = (now - created_at).total_seconds() / 3600
            ttl_fraction = age_hours / ttl_hours

            if ttl_fraction >= 1.0:
                # TTL exceeded - terminate
                print(f"Terminating experiment {experiment_id} (age: {age_hours:.1f}h, TTL: {ttl_hours}h)")
                ec2.terminate_instances(InstanceIds=[instance["InstanceId"]])

                sns.publish(
                    TopicArn=os.environ["SNS_TOPIC_ARN"],
                    Subject=f"Experiment {experiment_id} terminated (TTL exceeded)",
                    Message=json.dumps({
                        "experiment_id": experiment_id,
                        "instance_id": instance["InstanceId"],
                        "age_hours": round(age_hours, 1),
                        "ttl_hours": ttl_hours,
                        "owner": owner_email,
                        "action": "TERMINATED",
                    }),
                )

            elif ttl_fraction >= 0.8:
                # 80% of TTL - send warning
                hours_remaining = ttl_hours - age_hours
                print(f"Warning: experiment {experiment_id} has {hours_remaining:.1f}h remaining")

                sns.publish(
                    TopicArn=os.environ["SNS_TOPIC_ARN"],
                    Subject=f"Experiment {experiment_id} TTL warning ({hours_remaining:.0f}h remaining)",
                    Message=json.dumps({
                        "experiment_id": experiment_id,
                        "hours_remaining": round(hours_remaining, 1),
                        "owner": owner_email,
                        "action": "WARNING",
                        "extend_command": f"terraform apply -var experiment_ttl_hours={ttl_hours + 24}",
                    }),
                )

Pattern 5 - OPA Policy as Code for Infrastructure

Open Policy Agent (OPA) evaluates infrastructure changes against policy rules before terraform apply runs. This is how you enforce organizational standards automatically - no manual review required for clear-cut violations.

# policies/terraform_policies.rego

package terraform_compliance

import rego.v1

# Policy 1: All resources must have required tags
required_tags := {"Team", "Environment", "ManagedBy", "CostCenter"}

deny contains msg if {
    resource := input.resource_changes[_]
    resource.change.actions[_] in {"create", "update"}
    resource.type != "random_id"    # Skip helper resources

    # Find which required tags are missing
    provided_tags := {k | resource.change.after.tags[k]}
    missing := required_tags - provided_tags
    count(missing) > 0

    msg := sprintf(
        "Resource '%s' (%s) is missing required tags: %v",
        [resource.address, resource.type, missing]
    )
}

# Policy 2: S3 buckets must not be public
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket_public_access_block"
    resource.change.actions[_] in {"create", "update"}

    not resource.change.after.block_public_acls == true

    msg := sprintf(
        "S3 bucket public access block '%s' must have block_public_acls = true",
        [resource.address]
    )
}

# Policy 3: RDS instances must have deletion protection in prod
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_db_instance"
    resource.change.actions[_] in {"create", "update"}

    resource.change.after.tags.Environment == "prod"
    not resource.change.after.deletion_protection == true

    msg := sprintf(
        "RDS instance '%s' in prod environment must have deletion_protection = true",
        [resource.address]
    )
}

# Policy 4: GPU instances must have a Team tag with a known team
approved_teams := {"fraud", "recommendations", "nlp", "cv", "platform"}

deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    resource.change.actions[_] in {"create", "update"}

    # Check if it's a GPU instance
    startswith(resource.change.after.instance_type, "p")

    team := resource.change.after.tags.Team
    not team in approved_teams

    msg := sprintf(
        "GPU instance '%s' has Team tag '%s' which is not in approved teams: %v",
        [resource.address, team, approved_teams]
    )
}

# Policy 5: ECR repositories must have image scanning enabled
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type == "aws_ecr_repository"
    resource.change.actions[_] in {"create", "update"}

    not resource.change.after.image_scanning_configuration[0].scan_on_push == true

    msg := sprintf(
        "ECR repository '%s' must have scan_on_push = true",
        [resource.address]
    )
}

# Policy 6: Enforce encryption on all EBS volumes
deny contains msg if {
    resource := input.resource_changes[_]
    resource.type in {"aws_ebs_volume", "aws_instance"}
    resource.change.actions[_] in {"create", "update"}

    root_block := resource.change.after.root_block_device[_]
    not root_block.encrypted == true

    msg := sprintf(
        "Resource '%s' has unencrypted root EBS volume",
        [resource.address]
    )
}

# Run OPA check in CI before terraform apply
terraform show -json tfplan > plan.json

# Evaluate against policies
opa eval \
  --input plan.json \
  --data policies/terraform_policies.rego \
  --format pretty \
  "data.terraform_compliance.deny"

# Fail CI if any violations found
VIOLATIONS=$(opa eval --input plan.json --data policies/ \
  --format raw "count(data.terraform_compliance.deny)")

if [ "$VIOLATIONS" -gt "0" ]; then
  echo "Policy violations found:"
  opa eval --input plan.json --data policies/ \
    --format pretty "data.terraform_compliance.deny"
  exit 1
fi

Pattern 6 - GPU Quota Management

GPU instances are expensive and scarce. Without quota management, one team can consume all available GPUs, blocking everyone else.

# modules/gpu-quota-manager/main.tf

# Resource quotas per namespace
resource "kubernetes_resource_quota" "gpu" {
  for_each = var.team_gpu_allocations

  metadata {
    name      = "gpu-quota"
    namespace = "ml-${each.key}"
  }

  spec {
    hard = {
      "requests.nvidia.com/gpu" = tostring(each.value.max_gpus)
      "limits.nvidia.com/gpu"   = tostring(each.value.max_gpus)
    }
  }
}

# Priority classes: high-priority for prod, low-priority for experiments
resource "kubernetes_priority_class" "prod_serving" {
  metadata {
    name = "ml-prod-serving"
  }

  value          = 1000
  global_default = false
  description    = "Production model serving - preempts experiment workloads"
}

resource "kubernetes_priority_class" "experiment" {
  metadata {
    name = "ml-experiment"
  }

  value          = 100
  global_default = false
  description    = "Experiment workloads - preempted by prod serving"
  preemption_policy = "PreemptLowerPriority"
}

resource "kubernetes_priority_class" "background" {
  metadata {
    name = "ml-background"
  }

  value          = 10
  global_default = false
  description    = "Background jobs (data preprocessing, evaluation) - never preempts"
  preemption_policy = "Never"
}

# gpu_quota_reporter.py - daily report on GPU utilization by team

import boto3
from datetime import datetime, timedelta, timezone

cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")


def get_gpu_utilization_by_team(days: int = 7) -> dict:
    """
    Query CloudWatch for GPU utilization metrics grouped by Team tag.
    Returns average utilization per team over the specified period.
    """
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(days=days)

    teams = ["fraud", "recommendations", "nlp", "cv"]
    results = {}

    for team in teams:
        response = cloudwatch.get_metric_statistics(
            Namespace="AWS/EC2",
            MetricName="GPUUtilization",
            Dimensions=[
                {"Name": "Team", "Value": team}
            ],
            StartTime=start_time,
            EndTime=end_time,
            Period=3600,    # 1-hour buckets
            Statistics=["Average"],
        )

        if response["Datapoints"]:
            avg_util = sum(d["Average"] for d in response["Datapoints"]) / len(response["Datapoints"])
            results[team] = round(avg_util, 1)
        else:
            results[team] = 0.0

    return results


def identify_idle_clusters(threshold_percent: float = 10.0) -> list:
    """Find teams with GPU utilization below threshold - idle clusters to reclaim."""
    utilization = get_gpu_utilization_by_team(days=3)
    idle = [
        {"team": team, "avg_utilization": util}
        for team, util in utilization.items()
        if util < threshold_percent
    ]
    return idle

Module Registry Structure

terraform-modules/              # Private module registry
├── modules/
│   ├── ml-workbench/           # v1.2.0
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── CHANGELOG.md
│   │   └── examples/
│   │       ├── basic/
│   │       └── full-platform/
│   ├── experiment-environment/ # v2.0.0
│   ├── gpu-quota-manager/      # v1.0.0
│   ├── blue-green-serving/     # v1.1.0
│   ├── canary-dns/             # v1.0.0
│   └── feature-store/         # v3.1.0
├── policies/                   # OPA policies
│   ├── terraform_policies.rego
│   ├── cost_policies.rego
│   └── security_policies.rego
└── templates/                  # Golden path templates
    ├── new-ml-project/
    ├── new-model-serving/
    └── new-experiment/

# Using versioned modules from the private registry
module "fraud_team_workbench" {
  source  = "git::https://github.com/myorg/terraform-modules.git//modules/ml-workbench?ref=v1.2.0"

  team_name          = "fraud"
  environment        = "prod"
  gpu_quota          = 8
  budget_monthly_usd = 5000
  owner_email        = "[email protected]"

  vpc_id             = module.networking.vpc_id
  private_subnet_ids = module.networking.private_subnets
}

Backstage - The ML Platform Portal

Backstage (by Spotify) is an open-source internal developer portal. For ML platforms, it serves as the entry point for all platform capabilities: browsing the module catalog, creating new projects from golden path templates, viewing model deployment status, and tracking costs by team.

# catalog-info.yaml - ML Workbench component in Backstage
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: fraud-detection-platform
  title: Fraud Detection ML Platform
  description: End-to-end fraud detection ML workbench
  annotations:
    github.com/project-slug: myorg/fraud-detection
    backstage.io/techdocs-ref: dir:.
    argocd/app-name: fraud-detector
  tags:
    - ml
    - fraud
    - tensorflow
spec:
  type: ml-platform
  lifecycle: production
  owner: group:fraud-team
  system: ml-platform

---
# Software Template - create a new ML project from golden path
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: ml-project-template
  title: New ML Project
  description: Creates a new ML project with all platform boilerplate
  tags:
    - ml
    - terraform
    - recommended

spec:
  owner: group:platform-team
  type: ml-project

  parameters:
    - title: Project Information
      required: [team_name, project_name, gpu_quota]
      properties:
        team_name:
          title: Team Name
          type: string
          description: Your team identifier (lowercase, no spaces)
          pattern: ^[a-z][a-z0-9-]{1,20}$

        project_name:
          title: Project Name
          type: string
          description: Short name for the ML project

        gpu_quota:
          title: GPU Quota (max simultaneous GPUs)
          type: integer
          default: 4
          minimum: 1
          maximum: 16

        budget_monthly_usd:
          title: Monthly Budget (USD)
          type: integer
          default: 500

  steps:
    - id: fetch-template
      name: Fetch Template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          team_name: ${{ parameters.team_name }}
          project_name: ${{ parameters.project_name }}
          gpu_quota: ${{ parameters.gpu_quota }}

    - id: create-repo
      name: Create GitHub Repository
      action: publish:github
      input:
        repoUrl: github.com?repo=${{ parameters.project_name }}&owner=myorg
        description: ML project for ${{ parameters.team_name }}

    - id: create-terraform-pr
      name: Create Infrastructure PR
      action: publish:github:pull-request
      input:
        repoUrl: github.com?repo=ml-platform-infra&owner=myorg
        title: Add ML workbench for ${{ parameters.team_name }}/${{ parameters.project_name }}
        branchName: add-${{ parameters.team_name }}-${{ parameters.project_name }}
        description: |
          Auto-generated by Backstage ML Project Template.
          Creates workspace for team ${{ parameters.team_name }}.

Anti-Patterns to Avoid

Anti-Pattern: Managing Model Weights with Terraform

# WRONG - never do this
resource "aws_s3_object" "model_weights" {
  bucket = aws_s3_bucket.models.id
  key    = "fraud-detector/v2.1/model.pt"
  source = "../../model-artifacts/model.pt"   # Binary file in Terraform state!
}
# This puts 500MB model files through Terraform state, slowing every plan
# to a crawl and making state files enormous.
# Model artifacts should be managed by the ML pipeline (MLflow, DVC),
# not by infrastructure-as-code.

# CORRECT - manage the bucket, not the contents
resource "aws_s3_bucket" "models" {
  bucket = "${local.prefix}-model-artifacts"
}

# MLflow / training pipeline uploads model.pt directly to S3
# Terraform only knows about the bucket infrastructure

Anti-Pattern: Hard-Coded IAM ARNs Instead of ABAC

# WRONG - list every team member's ARN in the policy
data "aws_iam_policy_document" "model_access" {
  statement {
    principals {
      type = "AWS"
      identifiers = [
        "arn:aws:iam::123:user/alice",
        "arn:aws:iam::123:user/bob",
        "arn:aws:iam::123:user/charlie",
        # Have to update this every time someone joins/leaves!
      ]
    }
    actions   = ["s3:GetObject"]
    resources = ["arn:aws:s3:::models/*"]
  }
}

# CORRECT - use ABAC: IAM conditions based on tags
data "aws_iam_policy_document" "model_access" {
  statement {
    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::123:root"]  # Any principal in the account
    }
    actions   = ["s3:GetObject"]
    resources = ["arn:aws:s3:::${local.prefix}-*/*"]

    condition {
      test     = "StringEquals"
      variable = "aws:PrincipalTag/Team"
      values   = [var.team_name]  # Only principals tagged with this team
    }
  }
}
# Add a new engineer → tag their IAM user/role with Team=fraud
# Automatically gets access. No Terraform change required.

Production Engineering Notes

Module versioning lifecycle: When you update a module, use semantic versioning. Patch (1.0.1): bug fix, backward compatible. Minor (1.1.0): new optional feature, backward compatible. Major (2.0.0): breaking change to interface. Announce major versions in Slack, give teams a migration window. Use source = "...?ref=v1.2.0" (not main) everywhere in production.

Testing IaC modules: Test your platform modules before releasing to consumers. Use Terratest (Go) or pytest with the Pulumi Automation API to spin up real infrastructure in a sandbox account, run assertions, and destroy. Testing that a module creates resources with the right tags, the right security group rules, and the right bucket policies prevents "works in the registry, breaks in prod" situations.

Cost tagging enforcement: Make missing tags a hard failure in OPA, not a warning. If you allow resources without the Team tag, you cannot attribute costs. Cost attribution without team tagging is impossible - you end up with a single AWS bill that nobody owns. Enforce it at the policy level on day one.

The platform team as enabler: The platform team's job is to make the ML team fast. Every ticket that the ML team files to the platform team is a failure mode. The goal is: ML engineer wants a GPU cluster → runs a CLI command → gets a GPU cluster in 5 minutes. The platform team builds the CLI command (the golden path), not the GPU cluster for each team.

Common Mistakes

:::danger Do Not Allow Manual kubectl or aws CLI Changes in Production Any manual change bypasses policy enforcement, audit trails, and GitOps reconciliation. Enforce this with IAM: the production cluster's admin access should require MFA and be granted only to a specific break-glass role with CloudTrail alerting on every use. Day-to-day operations should never require this role - everything goes through GitOps. :::

:::danger Snowflake Modules Defeat the Purpose of IaC A snowflake module is one that has been individually modified for each team until it no longer resembles the original. This happens when teams fork the module instead of contributing back. Prevent it by making the shared module flexible enough to cover real use cases (good variable design), and by making the contribution process fast (open a PR, it gets reviewed in 24 hours). :::

:::warning Policy Violations Should Block CI, Not Just Warn OPA policies that print warnings but allow the apply to continue are ignored within weeks. Policy violations must fail the CI pipeline, block the PR, and require either fixing the violation or getting an explicit exception (which creates an audit trail). Soft enforcement degrades to no enforcement. :::

:::warning Experiment TTL Requires Notification - Not Just Termination An experiment environment that disappears without warning can destroy hours of work. The TTL enforcer should send a warning at 80% of TTL with a one-click extension link, a second warning at 95%, and then terminate at 100% with a final notification that includes a summary of what was destroyed. Ungraceful termination breeds resentment toward the platform team. :::

Interview Q&A

Q: What is the golden path concept in platform engineering and how does it apply to ML infrastructure?

A golden path is an opinionated, curated set of tools and templates that represents the recommended way to do a given task. It is golden because it encodes best practices by default - security, tagging, cost management, monitoring are all built in. For ML infrastructure, a golden path for "start a new experiment" might be a CLI command (eai create experiment --team fraud --gpu-count 4) that calls the Pulumi Automation API, provisions a complete experiment environment from the ml-experiment module, registers it in Backstage, and sets a 48-hour TTL. The ML engineer gets a working environment in 5 minutes without knowing anything about Terraform, IAM, or VPCs. Golden paths succeed when they are easier to follow than to bypass - if following the golden path is harder than filing a ticket or running kubectl directly, engineers will not use it.

Q: How would you design GPU quota management for a 50-person ML team with multiple projects?

Multi-layer approach: (1) AWS Service Quotas - set account-level limits on p-instance family, preventing runaway spend at the cloud level. (2) Kubernetes ResourceQuota - per-namespace GPU limits enforce per-team allocations at the cluster level, independent of the cloud quota. (3) Priority classes - production serving gets priority 1000, experiments get priority 100. When the cluster is full, Kubernetes preempts experiment pods to make room for serving pods. (4) TTL enforcement - idle experiment clusters are automatically destroyed after 48 hours, reclaiming GPUs for active work. (5) Quota requests - teams can request temporary quota increases through a lightweight process (Slack bot → approval → temporary quota bump with auto-expiry). Track utilization daily and reallocate quota from low-utilization teams to high-utilization ones quarterly.

Q: What is an OPA policy in the context of IaC and how would you use it for cost governance?

OPA (Open Policy Agent) evaluates JSON input against Rego policy rules and returns allow/deny decisions. In the IaC context, you pipe terraform plan -out=plan.json → OPA → fail CI if violations found. For cost governance specifically: (1) Required cost tags - deny any resource creation without Team, Environment, and CostCenter tags. (2) Instance type limits - deny GPU instance types above p3.2xlarge in dev environments. (3) Storage limits - deny RDS instances with allocated_storage > 1000 without an explicit override tag. (4) Public endpoints - deny creation of any public-facing resource (ALB, RDS, S3 ACL) without a policy exception. OPA runs in CI before any infrastructure is created, so violations are caught in PRs rather than after deployment. This is "shift-left" for infrastructure compliance.

Q: Describe the blue-green infrastructure deployment pattern and when you would use it for ML.

Blue-green maintains two identical environments (blue = current production, green = new version). Traffic is 100% on one color at all times. To deploy: (1) bring up the green environment to full scale while blue serves all traffic; (2) run smoke tests and validation against green; (3) atomically switch traffic from blue to green (DNS change, load balancer update, or route53 weighted routing flip); (4) monitor green for a period; (5) if healthy, scale down blue (leaving it as instant rollback); if unhealthy, flip traffic back to blue in seconds. For ML infrastructure, use blue-green for: Kubernetes version upgrades (bring up a new node group, drain old one), changing instance types for model serving (bring up new ASG with new instance type, validate performance, switch traffic), and major model serving framework updates where in-place upgrades are risky. The key advantage over in-place upgrades: rollback is a traffic switch, not a re-deployment.

Q: A junior engineer on your team wants to add a new feature to a shared Terraform module that already has 20 consumers. What process do you follow?

Follow module versioning discipline: (1) Create a feature branch in the modules repo. (2) Write the change - new variable with a default value that preserves existing behavior (backward-compatible addition). (3) Update the CHANGELOG (v1.3.0 with the new feature description). (4) Write a test in Terratest that validates the new feature. (5) Open a PR - another platform engineer reviews for backward compatibility, correct defaults, and test coverage. (6) Merge and tag the release (git tag v1.3.0). (7) Announce to the ML engineering Slack channel: "Module ml-workbench v1.3.0 released - new optional feature X. Existing users at v1.2.0 are not affected; opt-in by updating your source ref." (8) Never force consumers to upgrade - they upgrade at their own pace. For breaking changes (v2.0.0): draft a migration guide, run a workshop, give teams a migration window (typically 4-6 weeks), and offer to pair-program the migration for complex cases. Breaking modules without warning destroys trust in the platform team.

From Two Weeks to Two Hours​

The Platform Engineering Model for ML​

Pattern 1 - The ML Platform Module​

Pattern 2 - Blue-Green Infrastructure Deployment​

Pattern 3 - Canary Infrastructure (Weighted DNS)​

Pattern 4 - Self-Destructing Experiment Environments​

Pattern 5 - OPA Policy as Code for Infrastructure​

Pattern 6 - GPU Quota Management​

Module Registry Structure​

Backstage - The ML Platform Portal​

Anti-Patterns to Avoid​

Anti-Pattern: Managing Model Weights with Terraform​

Anti-Pattern: Hard-Coded IAM ARNs Instead of ABAC​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​