:::tip 🎮 Interactive Playground Visualize this concept: Try the Spot Instances for ML demo on the EngineersOfAI Playground - no code required. :::
Cloud Cost Management
Zero Visibility to Full FinOps in 4 Weeks
The ML team had eight people and a cloud bill that nobody fully understood. The monthly AWS invoice was 200 line items. Engineers knew they were using EC2 and S3. Finance knew the total. Nobody knew which project was responsible for which cost, whether they were getting good value, or where the waste was hiding.
The CTO gave them a mandate: build cost visibility within four weeks. Not just a dashboard - a complete FinOps practice. The goal: every engineer should know their team's monthly spend, every experiment should have a cost tag, and any resource that spikes 50% week-over-week should trigger an alert within an hour.
Week one: tagging strategy. Week two: cost attribution and reporting. Week three: commitment discounts and idle resource cleanup. Week four: alert system and governance process.
By the end of week four, they had found $23,000/month in unnecessary spend - idle Jupyter instances, over-provisioned RDS databases, multi-TB S3 buckets of stale model checkpoints from experiments run six months ago. The visibility work had already paid for itself three times over. And for the first time, the team could answer "what did this project cost us to train?" with a number backed by data.
This lesson documents exactly what they built and how.
Why FinOps Matters for ML Teams Specifically
Traditional software teams have relatively predictable cloud spend. ML teams don't - for three structural reasons:
1. Burst compute: A training run might cost $20,000 in 18 hours. If nobody is watching, you discover it after the fact.
2. Experiment proliferation: Each experiment spins up resources. Many never get cleaned up. Over 12 months, a team of 10 doing 5 experiments per person per month accumulates 600 potential cost-generating events, each needing lifecycle management.
3. Research waste: 70–80% of training runs fail or are superseded. The compute cost of failed experiments is real and substantial - often 30–50% of total training spend.
FinOps for ML is not about cutting costs - it's about eliminating waste so the same budget funds more experiments and better models.
Pillar 1: Tagging Strategy
Tags are the foundation of all cost attribution. Without consistent tagging, you have a bill. With tagging, you have a cost model.
The Tag Taxonomy
# Mandatory tags - enforced via AWS Service Control Policies or GCP Organization Policy
MANDATORY_TAGS = {
"team": str, # "recommendations", "search", "nlp", "cv"
"project": str, # "user-embedding-v2", "ranking-model-q1"
"environment": str, # "production", "staging", "development", "experiment"
"cost_center": str, # "cc-ml-001" - maps to finance team structure
"owner": str, # email of responsible person
}
# Optional but strongly encouraged
OPTIONAL_TAGS = {
"experiment_id": str, # MLflow/W&B run ID
"model_version": str, # "v1.2.3"
"auto_shutdown": str, # "2024-03-15" - date after which resource can be stopped
"dataset_version": str, # training data version
}
Enforcing Tagging at Resource Creation
import boto3
from functools import wraps
from typing import Callable
def require_cost_tags(resource_type: str):
"""
Decorator that enforces mandatory cost tags before creating cloud resources.
"""
def decorator(fn: Callable):
@wraps(fn)
def wrapper(*args, **kwargs):
tags = kwargs.get("tags", {})
missing = [k for k in MANDATORY_TAGS if k not in tags]
if missing:
raise ValueError(
f"Missing required cost tags for {resource_type}: {missing}\n"
f"All ML resources require: {list(MANDATORY_TAGS.keys())}"
)
return fn(*args, **kwargs)
return wrapper
return decorator
class MLResourceLauncher:
"""Centralized resource launcher with mandatory cost tagging."""
def __init__(self, region: str = "us-east-1"):
self.ec2 = boto3.client("ec2", region_name=region)
self.sagemaker = boto3.client("sagemaker", region_name=region)
@require_cost_tags("EC2 instance")
def launch_training_instance(
self,
instance_type: str,
ami_id: str,
tags: dict,
**kwargs,
) -> str:
"""Launch a training instance with enforced cost tags."""
response = self.ec2.run_instances(
ImageId=ami_id,
InstanceType=instance_type,
MinCount=1,
MaxCount=1,
TagSpecifications=[{
"ResourceType": "instance",
"Tags": [{"Key": k, "Value": v} for k, v in tags.items()]
}],
**kwargs,
)
return response["Instances"][0]["InstanceId"]
@require_cost_tags("SageMaker training job")
def launch_sagemaker_training(
self,
job_name: str,
estimator_config: dict,
tags: dict,
) -> str:
"""Launch SageMaker training job with enforced cost tags."""
response = self.sagemaker.create_training_job(
TrainingJobName=job_name,
Tags=[{"Key": k, "Value": v} for k, v in tags.items()],
**estimator_config,
)
return response["TrainingJobArn"]
Tag Compliance Report
def generate_tag_compliance_report() -> dict:
"""
Scan all ML resources and report tagging compliance.
Run this weekly - non-compliant resources get auto-tagged with owner="unknown"
and trigger a Slack notification.
"""
ec2 = boto3.client("ec2")
missing_tags = []
response = ec2.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
if instance["State"]["Name"] not in ("running", "stopped"):
continue
instance_tags = {t["Key"]: t["Value"] for t in instance.get("Tags", [])}
missing = [k for k in MANDATORY_TAGS if k not in instance_tags]
if missing:
missing_tags.append({
"resource_id": instance["InstanceId"],
"resource_type": "EC2",
"instance_type": instance["InstanceType"],
"launch_time": instance["LaunchTime"].isoformat(),
"missing_tags": missing,
"existing_tags": instance_tags,
})
return {
"total_resources_checked": len(response.get("Reservations", [])),
"non_compliant_count": len(missing_tags),
"non_compliant_resources": missing_tags,
"compliance_rate": 1 - len(missing_tags) / max(1, len(response.get("Reservations", []))),
}
Pillar 2: Commitment-Based Discounts
Reserved Instances vs Savings Plans
AWS offers two types of commitment-based discounts for EC2:
Reserved Instances (RIs): Commit to a specific instance type in a specific region. Discount: 30–60% off on-demand. Inflexible - you lose savings if your instance type changes.
Savings Plans (SPs): Commit to a dollar-per-hour spend level. Discount: 20–50% off on-demand. Flexible - applies to any EC2, Lambda, or SageMaker usage that matches your committed spend.
For ML teams, Savings Plans are almost always better than Reserved Instances because ML workloads change. The GPU instance you use today may not be the best choice in 6 months.
def analyze_commitment_opportunity(
monthly_on_demand_spend: float,
predictable_baseline_pct: float, # % of spend you're confident will continue
commitment_term_years: int = 1, # 1 or 3 year commitment
partial_upfront: bool = True,
) -> dict:
"""
Calculate optimal commitment level and expected savings.
Conservative rule: commit to 70% of your predictable baseline.
This leaves buffer for workload changes without wasting commitment.
"""
# Savings Plan discount rates (approximate, varies by region/instance)
sp_discounts = {
1: {"no_upfront": 0.30, "partial_upfront": 0.38, "full_upfront": 0.42},
3: {"no_upfront": 0.42, "partial_upfront": 0.55, "full_upfront": 0.60},
}
payment = "partial_upfront" if partial_upfront else "no_upfront"
discount = sp_discounts[commitment_term_years][payment]
# Optimal commitment: 70% of predictable baseline spend
predictable_monthly = monthly_on_demand_spend * predictable_baseline_pct
optimal_commitment_hourly = predictable_monthly * 0.70 / 730 # $/hr commitment
# Expected savings
committed_spend_monthly = optimal_commitment_hourly * 730
savings_on_committed = committed_spend_monthly * discount
return {
"monthly_on_demand": monthly_on_demand_spend,
"optimal_commitment_hourly": optimal_commitment_hourly,
"committed_monthly_spend": committed_spend_monthly,
"monthly_savings": savings_on_committed,
"annual_savings": savings_on_committed * 12,
"effective_discount": discount,
"commitment_term_years": commitment_term_years,
}
# Example: $30K/month on-demand, 70% predictable baseline
result = analyze_commitment_opportunity(
monthly_on_demand_spend=30_000,
predictable_baseline_pct=0.70,
commitment_term_years=1,
)
print(f"Optimal commitment: ${result['optimal_commitment_hourly']:.2f}/hr")
print(f"Monthly savings: ${result['monthly_savings']:,.0f}")
print(f"Annual savings: ${result['annual_savings']:,.0f}")
# Typically: $6,500/month savings = $78,000/year on a $30K/month bill
The Commitment Purchase Playbook
- Analyze 90 days of historical spend to identify your baseline
- Separate predictable from variable: serving infrastructure is predictable; training is variable
- Commit to 70% of predictable baseline (conservative - avoids stranded commitments)
- Review monthly: if actual spend consistently exceeds commitment by 30%+, add more
- Set a calendar reminder for renewal 60 days before expiry - auto-renew is dangerous
Pillar 3: Spot Instance Automation
For ML training, spot instances are the highest-value cost lever. But they require infrastructure to be reliable. Here is the production system:
import boto3
from typing import Optional
class SpotBidManager:
"""
Find the cheapest spot instance option for a training job.
Considers spot price history across AZs and instance types.
"""
def __init__(self, region: str = "us-east-1"):
self.ec2 = boto3.client("ec2", region_name=region)
def find_cheapest_spot(
self,
candidate_instance_types: list[str],
target_availability_zones: list[str],
lookback_hours: int = 24,
) -> Optional[dict]:
"""
Return the cheapest spot option from candidate instances/AZs.
"""
from datetime import datetime, timedelta
best_option = None
best_price = float('inf')
for instance_type in candidate_instance_types:
response = self.ec2.describe_spot_price_history(
InstanceTypes=[instance_type],
ProductDescriptions=["Linux/UNIX"],
StartTime=datetime.utcnow() - timedelta(hours=lookback_hours),
)
for price_history in response["SpotPriceHistory"]:
az = price_history["AvailabilityZone"]
if az not in target_availability_zones:
continue
price = float(price_history["SpotPrice"])
if price < best_price:
best_price = price
best_option = {
"instance_type": instance_type,
"availability_zone": az,
"spot_price": price,
"on_demand_price": self._get_on_demand_price(instance_type),
}
if best_option:
best_option["savings_vs_on_demand"] = (
1 - best_option["spot_price"] / best_option["on_demand_price"]
)
return best_option
def _get_on_demand_price(self, instance_type: str) -> float:
"""Approximate on-demand prices for common ML instances."""
prices = {
"p3.2xlarge": 3.06, "p3.8xlarge": 12.24,
"p4d.24xlarge": 32.77, "g4dn.xlarge": 0.526,
"g5.xlarge": 1.006, "g5.12xlarge": 5.672,
}
return prices.get(instance_type, 1.0) # fallback
Pillar 4: Auto-Shutdown Policies
Idle resources are pure waste. An auto-shutdown policy catches the resources that engineers forget to stop.
import boto3
from datetime import datetime, timedelta
class IdleResourceDetector:
"""
Detect and optionally stop idle ML resources.
Runs daily via CloudWatch Events / EventBridge.
"""
IDLE_THRESHOLDS = {
"ec2_cpu_pct": 5.0, # CPU < 5% = idle
"ec2_gpu_pct": 2.0, # GPU < 2% = idle
"sagemaker_invocations": 0, # 0 invocations = idle endpoint
"idle_days": 3, # idle for 3 consecutive days
}
def __init__(self):
self.ec2 = boto3.client("ec2")
self.cloudwatch = boto3.client("cloudwatch")
self.sagemaker = boto3.client("sagemaker")
self.sns = boto3.client("sns")
def find_idle_instances(self) -> list[dict]:
"""Find EC2 instances with low CPU for past 3 days."""
idle = []
threshold_days = self.IDLE_THRESHOLDS["idle_days"]
start = datetime.utcnow() - timedelta(days=threshold_days)
instances = self.ec2.describe_instances(
Filters=[{"Name": "instance-state-name", "Values": ["running"]}]
)
for reservation in instances["Reservations"]:
for instance in reservation["Instances"]:
instance_id = instance["InstanceId"]
tags = {t["Key"]: t["Value"] for t in instance.get("Tags", [])}
# Skip if tagged for long-running use
if tags.get("auto_shutdown") == "never":
continue
# Check CPU utilization
response = self.cloudwatch.get_metric_statistics(
Namespace="AWS/EC2",
MetricName="CPUUtilization",
Dimensions=[{"Name": "InstanceId", "Value": instance_id}],
StartTime=start,
EndTime=datetime.utcnow(),
Period=int(threshold_days * 86400),
Statistics=["Average"],
)
if response["Datapoints"]:
avg_cpu = response["Datapoints"][0]["Average"]
if avg_cpu < self.IDLE_THRESHOLDS["ec2_cpu_pct"]:
idle.append({
"instance_id": instance_id,
"instance_type": instance["InstanceType"],
"avg_cpu_pct": avg_cpu,
"owner": tags.get("owner", "unknown"),
"launch_time": instance["LaunchTime"].isoformat(),
"tags": tags,
})
return idle
def send_idle_alert(self, idle_resources: list[dict], topic_arn: str):
"""Send Slack/email alert for idle resources."""
if not idle_resources:
return
message_lines = ["🔍 Idle ML Resources Detected\n"]
for r in idle_resources:
message_lines.append(
f"• {r['instance_id']} ({r['instance_type']}) - "
f"CPU: {r['avg_cpu_pct']:.1f}% - Owner: {r['owner']}"
)
message_lines.append(
f"\nPlease stop these instances or tag with auto_shutdown=never"
)
self.sns.publish(
TopicArn=topic_arn,
Subject="Idle ML Resources - Action Required",
Message="\n".join(message_lines),
)
Pillar 5: Budget Alerts and Anomaly Detection
import boto3
def create_ml_budget_alerts(
monthly_budget_usd: float,
team_name: str,
alert_email: str,
cost_center_tag: str,
) -> list[str]:
"""
Create AWS Budget with anomaly detection for an ML team.
Alerts at 50%, 80%, 100%, and 120% of monthly budget.
"""
budgets = boto3.client("budgets")
account_id = boto3.client("sts").get_caller_identity()["Account"]
# Create cost budget filtered to team's cost center tag
response = budgets.create_budget(
AccountId=account_id,
Budget={
"BudgetName": f"{team_name}-monthly-budget",
"BudgetLimit": {"Amount": str(monthly_budget_usd), "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {
"TagKeyValue": [f"user:cost_center${cost_center_tag}"]
},
},
NotificationsWithSubscribers=[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": threshold,
"ThresholdType": "PERCENTAGE",
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": alert_email}]
}
for threshold in [50, 80, 100, 120]
],
)
return [response["Budget"]["BudgetName"]]
# AWS Cost Anomaly Detection - catches unexpected spikes
def setup_anomaly_detection(team_name: str, alert_email: str, threshold_usd: float = 500):
"""
Configure AWS Cost Anomaly Detection.
Alerts when any service spends $500+ more than expected.
"""
cost_explorer = boto3.client("ce")
monitor = cost_explorer.create_anomaly_monitor(
AnomalyMonitor={
"MonitorName": f"{team_name}-anomaly-monitor",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE",
}
)
cost_explorer.create_anomaly_subscription(
AnomalySubscription={
"MonitorArnList": [monitor["MonitorArn"]],
"Subscribers": [{"Address": alert_email, "Type": "EMAIL"}],
"Threshold": threshold_usd,
"Frequency": "DAILY",
"SubscriptionName": f"{team_name}-anomaly-alerts",
}
)
The 4-Week FinOps Implementation Plan
| Week | Actions | Expected Quick Wins |
|---|---|---|
| 1 | Tag all running resources, build compliance report | Identify unowned resources |
| 2 | Build team cost dashboard, 30-day trend | Identify top 3 cost drivers |
| 3 | Clean idle resources, purchase Savings Plans | 15–25% immediate cost reduction |
| 4 | Budget alerts, anomaly detection, review cadence | Prevent future cost surprises |
Common Mistakes
:::danger Purchasing Reserved Instances for ML workloads ML instance types change faster than 1–3 year RI terms. A model you're training on A100 today may run on H100 next year. Savings Plans cover all instance families with a dollar-commitment, making them almost always the right choice for ML over Reserved Instances. :::
:::warning Setting tags but never enforcing them A tagging policy with no enforcement is just a suggestion. Use AWS Service Control Policies or GCP Organization Policies to prevent resource creation without mandatory tags. The short-term friction is worth it - teams adapt quickly, and the compliance rate goes from 40% to 95%+ within a month. :::
:::danger Buying 100% of your spend as Savings Plans If your workload decreases - product sunset, model deprecation, cost optimization project - you're still paying for the committed spend. Buy commitments for 70% of your stable baseline. Keep 30% on-demand for flexibility. :::
Interview Q&A
Q: How do you implement cost attribution for a team with 20 ML projects running simultaneously?
A: Three-layer approach. First, tagging - every resource gets team, project, environment, and cost_center tags at creation time, enforced via IAM policy or organization-level guardrails. Second, a cost attribution pipeline - daily job that queries AWS Cost Explorer API, joins with tag data, and writes per-project cost summaries to a shared analytics database. Third, a weekly cost report - automated Slack message to each team lead showing this week's spend vs last week, top cost drivers, and any anomalies. The pipeline runs in 15 minutes via a Lambda function. The hard part is the tagging enforcement - I always add it to the CI/CD pipeline for infrastructure code so you can't deploy untagged resources.
Q: When should you buy Savings Plans vs Reserved Instances for ML?
A: Almost always Savings Plans for ML workloads. RIs lock you to a specific instance type and region. Savings Plans commit to a $/hour spend level and apply to any matching compute. Since ML hardware is evolving fast - A100 to H100, V100 to A10G - you don't want to be locked into a specific instance family for 1–3 years. The only exception: if you have very stable, long-running serving infrastructure on a specific instance type (unlikely for ML), RIs might give marginally better discounts. In practice, I always recommend Savings Plans for ML and RIs for traditional web services where instance type stability is higher.
Q: What's the most impactful FinOps action for a team with no current cost visibility?
A: Tagging and idle resource cleanup, in that order. Tagging first because without it, nothing else is attributable. Then idle resource cleanup - a sweep of all running resources against last 7 days of CloudWatch metrics. In my experience, teams with no FinOps practice have 15–25% of their spend on idle or abandoned resources. You can find this in a week and eliminate it immediately. The psychological effect is also important: when the team sees $15K of waste cleaned up in week one, they buy into the FinOps process. Commitment discounts (Savings Plans) come next - they're free savings once you've established your baseline.
Q: How do you handle the tension between engineering velocity and cost governance?
A: Governance should add friction to waste, not to work. The way I design it: zero friction for small experiments (under 100–1,000). The tagging and budget alert system makes the cost of each experiment visible without blocking it. Engineers can still start experiments freely - but they see the cost in real time and get an alert when they hit 80% of their experiment budget. This creates cost awareness without slowing down research. The key principle: make the default behavior cost-conscious, not cost-blocked.
