Terraform Fundamentals
The 3 AM Infrastructure Fire
It's 3 AM. Your model serving cluster is down. Alert after alert floods Slack. You SSH into the bastion host, navigate the undocumented maze of manually-created AWS resources, and eventually find the culprit: someone changed a security group rule last Tuesday "just to test something" and never changed it back. No ticket. No audit trail. No way to know what the correct state even looks like.
Your "infrastructure" is a collection of tribal knowledge, Confluence documents with screenshots from 2022, and one engineer who has memorized the VPC CIDR ranges because nobody else needs to know them - until now, at 3 AM, when that engineer is on vacation in Portugal and not answering their phone.
This scenario plays out every week at ML teams that treat infrastructure as a craft - something built by hand, refined through experience, never documented in machine-readable form. The GPU cluster was spun up by following a blog post. The feature store was created by clicking through the AWS console. The model registry S3 bucket exists because someone ran aws s3 mb eighteen months ago and the bucket name is in nobody's runbook.
Terraform exists to make this nightmare impossible. When your infrastructure is code, it is version-controlled, reviewed, tested, and reproducible. The 3 AM incident becomes: checkout the known-good commit, run terraform apply, watch the correct state materialize. No archaeology. No tribal knowledge required. No calls to Portugal.
This lesson builds your Terraform foundation from the ground up. We cover every core concept with real HCL examples sized for ML infrastructure - VPCs, S3 buckets, IAM roles, ECR repositories. By the end you will understand not just how Terraform works, but why it was designed the way it was, and how to use it correctly in a team environment.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure as Code for ML demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Before Terraform (released 2014), the dominant approach to cloud infrastructure was one of two things: click-through console GUIs (not reproducible, not version-controlled), or cloud-provider-specific scripting tools like AWS CloudFormation (XML/JSON templates, AWS-only, verbose, no native state diffing).
AWS CloudFormation was the first serious attempt at IaC for AWS. It worked, but its design had fundamental problems. The templates were JSON (later YAML) - difficult to parameterize, impossible to test, and deeply AWS-specific. There was no way to reuse logic across templates without copy-pasting. The "drift detection" was manual and expensive. And if you wanted to manage infrastructure across AWS and GCP, you needed two completely different tools and skillsets.
Mitchell Hashimoto founded HashiCorp in 2012 and released Terraform in 2014 with a different design philosophy: a provider-agnostic, declarative language that could manage any cloud or service, backed by a local state file that tracked what Terraform had created. The killer insight was the execution plan - before making any changes, Terraform shows you exactly what it will do. No surprises. This was genuinely new.
For ML teams, the reasons to use Terraform over alternatives are compelling. Your ML infrastructure is complex: GPU nodes, distributed storage, container registries, IAM roles, networking, databases, monitoring. Managing this by hand is the cause of the 3 AM incident. Managing it with cloud-native tools locks you into one vendor. Terraform gives you a consistent approach regardless of whether your training cluster is on AWS, GCP, or Azure.
Historical Context
Terraform 0.1 was released in July 2014. The early versions were rough - state management was entirely local, there were no modules, and the provider ecosystem was tiny. By 2016, Terraform had introduced remote state (storing state in S3 instead of a local file) and modules (reusable configuration units). The 0.12 release in 2019 was a major inflection point: HCL 2.0 brought proper for_each expressions, type constraints on variables, and a much richer expression language.
The big architectural decision that defines Terraform is its approach to state. Terraform maintains a JSON state file that records every resource it manages. When you run terraform plan, Terraform reads the state file, calls the cloud APIs to check the actual current state, computes a diff, and shows you what it will change. This design means Terraform is idempotent: running apply twice produces the same result. It also means the state file is critical infrastructure - lose it, and Terraform loses track of everything it manages.
In 2023, HashiCorp changed Terraform's license from MPL to BSL (Business Source License), prompting the community to fork Terraform as OpenTofu under the Linux Foundation. OpenTofu is API-compatible with Terraform and is now the open-source alternative. The concepts in this lesson apply equally to both.
Core Concepts
Providers
A provider is a plugin that knows how to talk to a specific cloud or service API. Providers are the translation layer between Terraform's declarative language and real API calls.
# versions.tf - always pin provider versions
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0" # allows 5.x, not 6.x
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.24"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.12"
}
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = "mlops-platform"
Environment = var.environment
ManagedBy = "terraform"
}
}
}
# Second provider for a different region (e.g., DR region)
provider "aws" {
alias = "us_west"
region = "us-west-2"
}
Resources
Resources are the fundamental building blocks - each resource represents one real-world infrastructure object.
# An S3 bucket for storing model artifacts
resource "aws_s3_bucket" "model_artifacts" {
bucket = "${var.project_name}-model-artifacts-${var.environment}"
tags = {
Purpose = "model-storage"
}
}
# Enable versioning so we can recover old model files
resource "aws_s3_bucket_versioning" "model_artifacts" {
bucket = aws_s3_bucket.model_artifacts.id
versioning_configuration {
status = "Enabled"
}
}
# Block all public access - models should never be public
resource "aws_s3_bucket_public_access_block" "model_artifacts" {
bucket = aws_s3_bucket.model_artifacts.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# Server-side encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "model_artifacts" {
bucket = aws_s3_bucket.model_artifacts.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
bucket_key_enabled = true
}
}
Data Sources
Data sources let you query existing infrastructure without managing it. Use them to read values that exist outside your Terraform configuration.
# Read the current AWS account ID
data "aws_caller_identity" "current" {}
# Find the latest Amazon Linux 2 AMI for training nodes
data "aws_ami" "amazon_linux_2" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
# Find an existing VPC by tag (not managed by this Terraform)
data "aws_vpc" "main" {
tags = {
Name = "main-vpc"
}
}
# Use in resources
resource "aws_instance" "training_node" {
ami = data.aws_ami.amazon_linux_2.id
instance_type = "p3.2xlarge"
subnet_id = data.aws_subnets.private.ids[0]
tags = {
Name = "training-node"
AccountId = data.aws_caller_identity.current.account_id
}
}
Variables and Outputs
Variables are the inputs to your configuration. Outputs are the values your configuration exposes for other systems or modules to consume.
# variables.tf
variable "project_name" {
description = "Short name used in resource naming (e.g., 'eai')"
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,10}[a-z0-9]$", var.project_name))
error_message = "project_name must be lowercase alphanumeric with hyphens, 3-12 chars."
}
}
variable "environment" {
description = "Deployment environment"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be one of: dev, staging, prod."
}
}
variable "aws_region" {
description = "AWS region for the primary deployment"
type = string
default = "us-east-1"
}
variable "gpu_instance_count" {
description = "Number of GPU training nodes"
type = number
default = 2
validation {
condition = var.gpu_instance_count >= 1 && var.gpu_instance_count <= 32
error_message = "gpu_instance_count must be between 1 and 32."
}
}
variable "allowed_cidr_blocks" {
description = "CIDR blocks allowed to reach internal services"
type = list(string)
default = []
}
# Sensitive variable - never logged in plan output
variable "mlflow_db_password" {
description = "Password for the MLflow tracking database"
type = string
sensitive = true
}
# outputs.tf
output "model_bucket_name" {
description = "S3 bucket name for model artifacts"
value = aws_s3_bucket.model_artifacts.id
}
output "model_bucket_arn" {
description = "S3 bucket ARN - use this in IAM policies"
value = aws_s3_bucket.model_artifacts.arn
}
output "training_role_arn" {
description = "IAM role ARN for training jobs"
value = aws_iam_role.training.arn
}
# Sensitive output - will be redacted in plan/apply output
output "mlflow_endpoint" {
description = "MLflow tracking server URL"
value = "http://${aws_instance.mlflow.private_ip}:5000"
sensitive = true
}
Locals
Locals are computed values - functions of your variables that you want to reuse without repeating.
# locals.tf
locals {
# Resource name prefix - consistent across all resources
prefix = "${var.project_name}-${var.environment}"
# Common tags merged with resource-specific tags
common_tags = {
Project = var.project_name
Environment = var.environment
ManagedBy = "terraform"
Owner = "ml-platform-team"
}
# Whether we're in a production environment
is_production = var.environment == "prod"
# GPU instance type - bigger in prod
gpu_instance_type = local.is_production ? "p3.8xlarge" : "p3.2xlarge"
# Derive bucket name from prefix
model_bucket_name = "${local.prefix}-model-artifacts"
# Derive ECR repo URL from account/region
ecr_base_url = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${var.aws_region}.amazonaws.com"
}
The State File
The state file is Terraform's memory. It maps your HCL resources to real cloud resources. Understanding state is the most important thing to get right in a team setting.
Remote State - The Only Acceptable Approach for Teams
Never use local state in a team. Local state means two people can run terraform apply simultaneously, causing corruption. Use remote state with locking.
# backend.tf - must be created BEFORE the rest of the config
# The S3 bucket and DynamoDB table for state must exist first
# (Bootstrap them once manually or with a separate tiny Terraform config)
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "mlops-platform/us-east-1/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Bootstrap the state backend once (this is the only thing you ever do manually):
# bootstrap/main.tf - run this once to create the state bucket
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state"
lifecycle {
prevent_destroy = true # Never let Terraform delete this
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration { status = "Enabled" }
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
lifecycle {
prevent_destroy = true
}
}
The Plan/Apply/Destroy Lifecycle
# Initialize - always run after git clone or adding a new provider
terraform init
# See what will change (safe - no side effects)
terraform plan
# Apply with explicit var file - never rely on defaults in production
terraform apply -var-file="prod.tfvars"
# Apply a specific plan file (the rigorous approach)
terraform plan -out=tfplan
terraform show -json tfplan | jq . # Review the plan as JSON
terraform apply tfplan
# Destroy a specific resource (careful!)
terraform destroy -target=aws_instance.training_node
# Destroy everything
terraform destroy -var-file="prod.tfvars"
HCL Syntax Deep Dive
for_each vs count
count creates N identical resources. for_each creates one resource per item in a map or set - much more flexible.
# count - simple, but fragile: deleting index 0 shifts everything
resource "aws_s3_bucket" "experiment" {
count = 3
bucket = "experiment-${count.index}"
}
# for_each - preferred: keyed by stable identifier
resource "aws_ecr_repository" "training_images" {
for_each = toset(["pytorch-training", "tensorflow-training", "inference-serving"])
name = "${local.prefix}-${each.key}"
image_tag_mutability = "MUTABLE"
image_scanning_configuration {
scan_on_push = true
}
}
# for_each with a map of objects - most powerful pattern
variable "training_queues" {
type = map(object({
delay_seconds = number
max_size = number
}))
default = {
"gpu-high-priority" = { delay_seconds = 0, max_size = 100 }
"gpu-low-priority" = { delay_seconds = 60, max_size = 1000 }
"cpu-training" = { delay_seconds = 0, max_size = 500 }
}
}
resource "aws_sqs_queue" "training" {
for_each = var.training_queues
name = "${local.prefix}-${each.key}"
delay_seconds = each.value.delay_seconds
max_message_size = each.value.max_size * 1024
message_retention_seconds = 86400 # 1 day
receive_wait_time_seconds = 20 # long polling
}
Dynamic Blocks
Dynamic blocks generate repeated nested blocks based on a collection - essential for things like security group rules.
variable "ingress_rules" {
type = list(object({
port = number
protocol = string
description = string
cidr = string
}))
default = [
{ port = 22, protocol = "tcp", description = "SSH", cidr = "10.0.0.0/8" },
{ port = 5000, protocol = "tcp", description = "MLflow", cidr = "10.0.0.0/8" },
{ port = 8888, protocol = "tcp", description = "JupyterHub", cidr = "10.0.0.0/8" },
]
}
resource "aws_security_group" "ml_platform" {
name = "${local.prefix}-ml-platform"
description = "ML platform internal services"
vpc_id = var.vpc_id
dynamic "ingress" {
for_each = var.ingress_rules
content {
from_port = ingress.value.port
to_port = ingress.value.port
protocol = ingress.value.protocol
description = ingress.value.description
cidr_blocks = [ingress.value.cidr]
}
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow all outbound"
}
}
Conditional Expressions and templatefile()
# Conditional: bigger disk in prod
resource "aws_instance" "mlflow_server" {
ami = data.aws_ami.amazon_linux_2.id
instance_type = local.is_production ? "r6i.xlarge" : "t3.medium"
root_block_device {
volume_size = local.is_production ? 500 : 50
volume_type = "gp3"
encrypted = true
}
user_data = templatefile("${path.module}/templates/mlflow-init.sh.tpl", {
mlflow_version = "2.10.0"
db_endpoint = aws_db_instance.mlflow.endpoint
db_name = "mlflow"
s3_bucket = aws_s3_bucket.model_artifacts.id
environment = var.environment
})
}
# templates/mlflow-init.sh.tpl
#!/bin/bash
set -euo pipefail
pip install mlflow==${mlflow_version} psycopg2-binary boto3
cat > /etc/mlflow/config.env <<EOF
MLFLOW_BACKEND_STORE_URI=postgresql://mlflow:$DB_PASSWORD@${db_endpoint}/${db_name}
MLFLOW_DEFAULT_ARTIFACT_ROOT=s3://${s3_bucket}/mlflow-artifacts
MLFLOW_ENVIRONMENT=${environment}
EOF
systemctl enable mlflow
systemctl start mlflow
Modules
Modules are reusable configuration units. A module is just a directory of .tf files. When you call a module, you pass in variables and get back outputs.
# Module directory structure:
# modules/
# ml-storage/
# main.tf - resources
# variables.tf - inputs
# outputs.tf - outputs
# README.md - required: what does this module do?
# modules/ml-storage/main.tf
resource "aws_s3_bucket" "this" {
bucket = "${var.prefix}-${var.purpose}"
}
resource "aws_s3_bucket_versioning" "this" {
bucket = aws_s3_bucket.this.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_lifecycle_configuration" "this" {
bucket = aws_s3_bucket.this.id
rule {
id = "transition-old-models"
status = "Enabled"
transition {
days = var.transition_to_ia_days
storage_class = "STANDARD_IA"
}
transition {
days = var.transition_to_glacier_days
storage_class = "GLACIER"
}
}
}
# Calling the module from the root configuration
module "model_storage" {
source = "./modules/ml-storage" # local path
prefix = local.prefix
purpose = "models"
transition_to_ia_days = 30
transition_to_glacier_days = 90
}
module "dataset_storage" {
source = "./modules/ml-storage"
prefix = local.prefix
purpose = "datasets"
transition_to_ia_days = 60
transition_to_glacier_days = 180
}
# Using a versioned module from Terraform Registry
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.5.0" # Always pin version
name = "${local.prefix}-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = !local.is_production # Multi-NAT in prod only
enable_dns_hostnames = true
tags = local.common_tags
}
Complete ML Infrastructure Example
Here is a full, working Terraform configuration for a basic ML platform: VPC, S3, ECR, IAM roles for training.
# main.tf - ML Platform Foundation
# --- Networking ---
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.5.0"
name = "${local.prefix}-vpc"
cidr = "10.0.0.0/16"
azs = data.aws_availability_zones.available.names
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]
enable_nat_gateway = true
single_nat_gateway = !local.is_production
enable_dns_hostnames = true
create_database_subnet_group = true
tags = local.common_tags
}
# --- S3 Buckets ---
resource "aws_s3_bucket" "model_artifacts" {
bucket = "${local.prefix}-model-artifacts"
tags = merge(local.common_tags, { Purpose = "model-storage" })
}
resource "aws_s3_bucket" "training_data" {
bucket = "${local.prefix}-training-data"
tags = merge(local.common_tags, { Purpose = "training-data" })
}
resource "aws_s3_bucket" "experiment_logs" {
bucket = "${local.prefix}-experiment-logs"
tags = merge(local.common_tags, { Purpose = "experiment-tracking" })
}
# Apply versioning to model artifacts and data
resource "aws_s3_bucket_versioning" "model_artifacts" {
bucket = aws_s3_bucket.model_artifacts.id
versioning_configuration { status = "Enabled" }
}
# Block all public access on all ML buckets
resource "aws_s3_bucket_public_access_block" "all_buckets" {
for_each = {
models = aws_s3_bucket.model_artifacts.id
data = aws_s3_bucket.training_data.id
logs = aws_s3_bucket.experiment_logs.id
}
bucket = each.value
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# --- ECR Repositories ---
locals {
ecr_repos = ["training", "inference", "preprocessing", "evaluation"]
}
resource "aws_ecr_repository" "ml_images" {
for_each = toset(local.ecr_repos)
name = "${local.prefix}/${each.key}"
image_tag_mutability = "MUTABLE"
image_scanning_configuration {
scan_on_push = true
}
encryption_configuration {
encryption_type = "AES256"
}
tags = merge(local.common_tags, { Component = each.key })
}
# ECR lifecycle: keep only last 30 images per repo
resource "aws_ecr_lifecycle_policy" "ml_images" {
for_each = aws_ecr_repository.ml_images
repository = each.value.name
policy = jsonencode({
rules = [{
rulePriority = 1
description = "Keep last 30 images"
selection = {
tagStatus = "any"
countType = "imageCountMoreThan"
countNumber = 30
}
action = { type = "expire" }
}]
})
}
# --- IAM Role for Training Jobs ---
data "aws_iam_policy_document" "training_assume_role" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ec2.amazonaws.com", "sagemaker.amazonaws.com"]
}
}
}
resource "aws_iam_role" "training" {
name = "${local.prefix}-training-role"
assume_role_policy = data.aws_iam_policy_document.training_assume_role.json
tags = local.common_tags
}
data "aws_iam_policy_document" "training_permissions" {
# Read training data
statement {
effect = "Allow"
actions = ["s3:GetObject", "s3:ListBucket"]
resources = [
aws_s3_bucket.training_data.arn,
"${aws_s3_bucket.training_data.arn}/*"
]
}
# Write model artifacts
statement {
effect = "Allow"
actions = ["s3:PutObject", "s3:DeleteObject"]
resources = ["${aws_s3_bucket.model_artifacts.arn}/*"]
}
# Write experiment logs
statement {
effect = "Allow"
actions = ["s3:PutObject", "s3:GetObject"]
resources = ["${aws_s3_bucket.experiment_logs.arn}/*"]
}
# Pull training container from ECR
statement {
effect = "Allow"
actions = [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability",
"ecr:GetAuthorizationToken"
]
resources = ["*"]
}
# CloudWatch logs for training job metrics
statement {
effect = "Allow"
actions = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams"
]
resources = ["arn:aws:logs:*:*:log-group:/mlops/*"]
}
}
resource "aws_iam_role_policy" "training" {
name = "training-permissions"
role = aws_iam_role.training.id
policy = data.aws_iam_policy_document.training_permissions.json
}
resource "aws_iam_instance_profile" "training" {
name = "${local.prefix}-training-profile"
role = aws_iam_role.training.name
}
Importing Existing Infrastructure
When adopting Terraform on an existing team, you'll need to import resources that were created manually.
# Import an existing S3 bucket into Terraform state
terraform import aws_s3_bucket.model_artifacts my-existing-bucket-name
# Import with resource address when using for_each
terraform import 'aws_ecr_repository.ml_images["training"]' my-company/training
# Import an IAM role
terraform import aws_iam_role.training my-training-role-name
Terraform 1.5+ supports declarative import blocks, which are much cleaner:
# Import block in main.tf (Terraform 1.5+)
import {
to = aws_s3_bucket.model_artifacts
id = "my-existing-bucket-name"
}
import {
to = aws_iam_role.training
id = "my-training-role-name"
}
Workspace Management
Workspaces let you maintain multiple state files from the same configuration - useful for deploying the same infrastructure to dev/staging/prod.
# Create workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
# List workspaces
terraform workspace list
# Switch workspace
terraform workspace select prod
# Use workspace in configuration
resource "aws_instance" "training" {
instance_type = terraform.workspace == "prod" ? "p3.8xlarge" : "t3.medium"
}
:::warning Workspace Limitations Workspaces work well for simple cases but become awkward for complex multi-environment setups. Many teams prefer separate directories or Terragrunt (covered in the next lesson) for environment isolation. Workspaces share the same configuration - you cannot have fundamentally different resources across workspaces without messy conditionals. :::
The Dependency Graph
Terraform builds a directed acyclic graph (DAG) of all resources and data sources, determining the order of operations automatically.
# Terraform knows: VPC must exist before subnets
# Subnets must exist before security groups
# Security groups must exist before EC2 instances
# This ordering is computed automatically from references
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
}
resource "aws_subnet" "private" {
vpc_id = aws_vpc.main.id # <-- reference creates a dependency edge
cidr_block = "10.0.1.0/24"
}
resource "aws_security_group" "training" {
vpc_id = aws_vpc.main.id # <-- another dependency edge
}
resource "aws_instance" "trainer" {
subnet_id = aws_subnet.private.id # depends on subnet
vpc_security_group_ids = [aws_security_group.training.id] # depends on SG
}
# Visualize the dependency graph
terraform graph | dot -Tsvg > graph.svg
.tfvars Files for Environment Configuration
# dev.tfvars
project_name = "eai"
environment = "dev"
aws_region = "us-east-1"
gpu_instance_count = 1
# prod.tfvars
project_name = "eai"
environment = "prod"
aws_region = "us-east-1"
gpu_instance_count = 8
# Apply with environment-specific values
# terraform apply -var-file="prod.tfvars"
Store secrets in a separate file and never commit it:
# secrets.tfvars - in .gitignore
mlflow_db_password = "supersecret123"
# Apply with both
terraform apply -var-file="prod.tfvars" -var-file="secrets.tfvars"
Production Engineering Notes
State file security: The state file contains plaintext secrets (database passwords, private keys) if you store them in Terraform. Enable S3 server-side encryption and restrict bucket access to only the CI system and platform team. Never store state files in Git.
Plan before apply: In CI/CD, always run terraform plan -out=tfplan and require human approval before terraform apply tfplan. Automated applies without review are how production outages happen.
Lock files: Commit .terraform.lock.hcl to version control. This file locks the exact provider versions across your team - equivalent to package-lock.json.
Lifecycle rules: Use prevent_destroy = true on critical resources (state bucket, primary database, ECR repositories with production images). Use create_before_destroy = true on resources that would cause downtime if replaced.
Tagging enforcement: Use a default_tags block in the AWS provider to ensure every resource gets tagged. Missing tags make cost allocation and security auditing impossible at scale.
Common Mistakes
:::danger Never Store Secrets in .tf Files
Do not put passwords, API keys, or tokens directly in .tf files or committed .tfvars files. Use AWS Secrets Manager or Parameter Store and fetch with data sources, or pass via environment variables (TF_VAR_mlflow_db_password).
:::
:::danger Never Run terraform apply in Production Without a Plan Review
Automated applies that skip the plan review step are responsible for most Terraform-related outages. Always generate a plan file, review it, then apply the specific plan file.
:::
:::warning Understand Resource Replacement
Many resource attribute changes require Terraform to destroy and recreate the resource - causing downtime. Always check if a change is in-place or requires replacement. The plan output clearly says # aws_instance.training must be replaced. Never skim past this.
:::
:::warning Remote State Before Day 1 Do not start with local state and plan to migrate later. Set up remote state in S3+DynamoDB before writing your first resource. State migration is painful and risky. :::
Interview Q&A
Q: What is Terraform state and why does it exist?
Terraform state is a JSON file that maps HCL resource definitions to real-world cloud resources. It exists because Terraform needs to know what it has already created - cloud APIs are not stateful from Terraform's perspective. When you run terraform plan, Terraform compares the desired state (your .tf files) against the current state (state file + cloud API refresh) and computes the diff. Without state, Terraform would have no way to know whether a resource needs to be created, updated, or already exists. For teams, state must be stored remotely (S3 + DynamoDB for locking) to prevent concurrent apply conflicts.
Q: What is the difference between count and for_each, and when would you use each?
count creates N copies of a resource indexed by integer. for_each creates one resource per key in a map or set. The critical difference is how deletion works: with count, removing item at index 2 of 5 shifts indices 3 and 4 down to 2 and 3, causing Terraform to destroy and recreate those resources. With for_each, each resource is keyed by a stable string, so removing one item does not affect others. Use for_each whenever the items have natural string identifiers (bucket names, queue names, repo names). Use count only when you genuinely want N identical resources with no meaningful difference between them.
Q: How does Terraform determine the order in which to create resources?
Terraform builds a Directed Acyclic Graph (DAG) of all resources by analyzing references between them. When resource B references resource A (e.g., subnet_id = aws_subnet.private.id), Terraform knows A must be created before B. Terraform parallelizes independent resources automatically - resources with no dependency relationship are created concurrently, speeding up large applies. You can also add explicit dependencies with the depends_on argument, though this is usually a sign that you should restructure your references.
Q: What is terraform import and when would you use it?
terraform import brings an existing cloud resource under Terraform management by writing its current state into the Terraform state file. You use it when adopting Terraform on a team that has existing manually-created infrastructure. The process is: write the HCL resource definition, then run terraform import <resource_address> <cloud_resource_id>. After importing, run terraform plan - it should show no changes if your HCL matches reality. Any differences need to be reconciled. Terraform 1.5+ introduced declarative import blocks that are cleaner and can be reviewed in PRs.
Q: How would you structure Terraform for a team of 5 engineers, and what practices prevent conflicts?
Use remote state with S3 + DynamoDB locking - this prevents concurrent applies by acquiring a lock before any apply and releasing it after. Structure the configuration with modules for reusability and workspaces or separate directories for environments. Establish a branch policy: all Terraform changes go through PRs, CI runs terraform plan automatically and posts the diff as a PR comment, applies require approval, and only the CI system (not individual engineers) runs terraform apply. Use Atlantis (covered next lesson) to enforce this workflow. Never allow engineers to apply from their laptops in environments beyond dev.
Q: What is the terraform.lock.hcl file and should it be committed?
Yes - always commit .terraform.lock.hcl. This is the dependency lock file generated by terraform init. It records the exact version and hash of every provider, ensuring that everyone on the team (and your CI system) uses the same provider versions. Without it, different engineers might get different provider versions depending on when they ran terraform init, leading to subtle differences in behavior. It is analogous to package-lock.json in Node.js or Pipfile.lock in Python.
