Module 13 - Infrastructure as Code for ML
ML teams that manage infrastructure manually operate at a permanent disadvantage. Every GPU cluster spun up by hand, every S3 bucket created through the console, every environment configured by memory is infrastructure debt accumulating interest. This module teaches you to codify ML infrastructure so it is reproducible, auditable, and automatable.
What You Will Learn
Infrastructure as Code (IaC) is the practice of defining and managing infrastructure through machine-readable configuration files rather than manual processes or interactive tools. For ML teams, IaC is the difference between "it works on my cluster" and infrastructure that can be recreated identically in minutes.
Module Lessons
| # | Lesson | What You Learn |
|---|---|---|
| 01 | IaC for ML Teams | Why IaC, declarative vs imperative, IaC for ML-specific resources, drift detection |
| 02 | Terraform Fundamentals | HCL syntax, providers, state management, workspaces, modules |
| 03 | Terraform for ML Infrastructure | GPU autoscaling, S3 data lakes, SageMaker, EKS in HCL |
| 04 | Pulumi for ML | Python-native IaC, Automation API, self-service training clusters |
| 05 | GitOps for ML | ArgoCD, Flux, progressive delivery, model deployment lifecycle |
| 06 | Environment Parity | Dev/staging/prod parity, environment promotion, configuration management |
| 07 | IaC Patterns for ML Platforms | Reusable modules, platform engineering, ML infra patterns |
Key Concepts at a Glance
Declarative vs Imperative: Declarative IaC (Terraform, Pulumi) describes what you want - the tool figures out how to get there. Imperative scripts (bash, Ansible) describe how - you manage the sequence yourself. Declarative wins for infrastructure because idempotency is built in.
Idempotency: Running the same IaC definition ten times produces the same result as running it once. This is non-negotiable for ML infrastructure - you need to be able to recreate environments deterministically.
State: Terraform maintains a state file that maps your HCL definitions to real cloud resources. This is how it knows what to create, update, or destroy. Remote state (S3 + DynamoDB) is required for team use.
GitOps: The git repository is the source of truth. Changes flow through PRs, not CLI commands. The cluster reconciles itself to match the desired state in git. Kubernetes deployment without GitOps is manual and drift-prone.
:::tip Prerequisites You should be comfortable with basic cloud concepts (EC2, S3, VPC), have used a cloud CLI (aws, gcloud, or az), and understand Kubernetes basics before this module. Python familiarity helps for the Pulumi lessons. :::
Why IaC Matters Specifically for ML
ML infrastructure has unique properties that make IaC even more critical than for traditional software:
- GPU resources are expensive - a misconfigured autoscaler can cost thousands per hour; IaC makes the configuration reviewable before it goes live
- Experiments require reproducible environments - if two researchers use slightly different CUDA versions, results are not comparable
- Models must be auditable - regulators increasingly require proof of what infrastructure ran a training job
- Data pipelines span many services - S3, Spark, feature stores, model registries; IaC documents these dependencies explicitly
- Teams scale rapidly - onboarding a new data scientist to a manually-configured cluster is a week of setup; IaC makes it a one-line command
