Skip to main content

Cloud FinOps for ML

3× Over Budget: A Four-Week Recovery

The ML team's quarterly cloud bill arrived. It was 847,000.Theapprovedbudgethadbeen847,000. The approved budget had been 280,000. The CTO sent a single-line email to the ML platform lead: "We need to talk."

The overage was not caused by a single event. It was the accumulated result of twelve months of decisions that each seemed reasonable in isolation: spinning up a development cluster that was never torn down, running hyperparameter searches on on-demand GPUs because spot configuration was complex, keeping old model versions "just in case," not tagging resources so nobody could tell who was running what.

The CTO asked for a recovery plan in 48 hours and a bill under $300,000 in four weeks.

The first 48 hours were spent on forensics: getting AWS Cost Explorer to attribute costs by resource tag, identifying the largest line items, mapping each to a team and use case. The results were uncomfortable. 180,000hadbeenspentoninstancesthatwererunningwithlessthan15180,000 had been spent on instances that were running with less than 15% utilization. 120,000 on development and experiment clusters that should have been on spot pricing. $95,000 on S3 storage for experiment artifacts that no active model was using.

Week 1: Terminate idle instances. Immediate savings: 23,000/month.Week2:Movedevelopmenttrainingtospot.Immediatesavings:23,000/month. Week 2: Move development training to spot. Immediate savings: 28,000/month. Week 3: Implement S3 lifecycle policies for experiment artifacts. Savings: $8,000/month. Week 4: Purchase reserved instances for stable production workloads. Locked in 40% discount on baseline inference cost.

Month-end bill: 247,000undertarget.Annualizedrunrate:247,000 - under target. Annualized run rate: 2.96M vs. the previous trajectory of $10.2M.

This lesson covers the framework that makes this kind of rapid intervention possible - and the organizational practices that prevent the situation from recurring.


:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Cost & Unit Economics demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: Cloud Spending Without Governance Compounds

Cloud infrastructure has a fundamental property that traditional IT spending doesn't: the cost is variable, provisioned on demand, and grows silently in proportion to usage. A traditional server purchase requires a purchase order, approval, and procurement process. An EC2 instance launch requires one API call. The friction is gone. The guardrails must be built deliberately.

ML teams are particularly prone to uncontrolled cloud spending for three reasons:

Experimentation culture: ML development is inherently experimental. Researchers spin up clusters, try things, and move on - often leaving resources running.

Infrastructure complexity: ML systems span many services: compute, storage, networking, managed services, data transfer. The cost interactions between services are non-obvious. Optimizing EC2 without optimizing S3 data transfer can produce minimal real savings.

Lack of ownership: When infrastructure is shared and not tagged, no individual or team feels responsible for the costs. Nobody gets the bill; nobody optimizes.

FinOps (Financial Operations) is the practice that addresses this: bringing financial accountability, visibility, and optimization rigor to cloud spending.


Historical Context

FinOps as a formalized practice emerged around 2018–2019, driven by technology companies that had grown large enough for cloud costs to materially affect their margins. The FinOps Foundation was established in 2019 to define standards and certifications.

For ML specifically, the FinOps challenge became acute with the rise of large GPU-based workloads. GPU instances are 5–30× more expensive than CPU instances per hour. A team that applied CPU-era practices (on-demand, always-on, no cost attribution) to GPU workloads quickly ran up bills an order of magnitude higher than expected.

AWS Savings Plans (introduced 2019) and the equivalents on GCP and Azure provided a more flexible alternative to Reserved Instances - committing to an hourly spend amount rather than specific instance types, which better accommodated ML workloads that change instance types frequently.

The "ML FinOps" framing was popularized around 2022–2023 by MLOps practitioners who applied FinOps principles specifically to the ML lifecycle - where the cost drivers (training, inference, feature pipelines, storage) have very different characteristics that require different optimization strategies.


Core Concepts

The ML FinOps Maturity Model

Level 1 - Reactive: The team discovers costs from the monthly bill. No attribution by team or model. No alerts on spending anomalies. Optimization is reactive (post-budget-overage) rather than proactive.

Level 2 - Informed: All resources are tagged. Costs are attributable by team, project, model, and lifecycle phase. Daily cost reports are visible to engineering leads. Spending anomalies trigger alerts within 24 hours.

Level 3 - Optimized: Engineering teams have monthly cost budgets per project. Cost forecasting is automated based on usage trends. Reserved instance coverage is managed. Engineering incentives include cost efficiency alongside model quality metrics.

Most ML teams are at Level 1. The goal of this lesson is to move to Level 3.

Reserved Instances and Savings Plans

Cloud providers offer significant discounts for committed spending:

On-demand: Pay per hour, no commitment, full price. Use for: short-running experiments, unpredictable workloads, new projects.

Reserved Instances (RI): Commit to a specific instance type, region, and term (1 or 3 years). Discounts: 30–40% (1-year, no upfront), 50–70% (1-year, all upfront). Use for: stable, always-on inference workloads where the instance type is known.

Savings Plans: Commit to an hourly spend amount (not a specific instance). More flexible than RIs - compute savings plans apply across all EC2 and Lambda usage. Discounts: similar to RIs. Best for ML teams that change instance types frequently.

Spot / Preemptible: Discounts of 60–80% in exchange for interruption risk with 2-minute notice. Use for: training workloads with checkpoint-and-restart, batch inference jobs, feature pipeline compute.

from dataclasses import dataclass
from typing import Dict, List

@dataclass
class InstanceCoverage:
instance_type: str
monthly_hours: float
on_demand_rate: float
reserved_rate_1yr_no_upfront: float
spot_rate: float
workload_type: str # "always_on", "scheduled_batch", "experimental"

def compute_coverage_savings(instances: List[InstanceCoverage]) -> pd.DataFrame:
"""
Compute savings from optimally purchasing reserved instances or savings plans.
"""
results = []

for inst in instances:
on_demand_monthly = inst.monthly_hours * inst.on_demand_rate

if inst.workload_type == "always_on":
# Always-on: buy reserved (1-year, no upfront for flexibility)
optimal_monthly = inst.monthly_hours * inst.reserved_rate_1yr_no_upfront
strategy = "reserved_1yr"

elif inst.workload_type == "scheduled_batch":
# Scheduled batch: use spot with checkpoint-and-restart
# Assume 10% interruption rate - adds 10% to effective hours
effective_hours = inst.monthly_hours * 1.10
optimal_monthly = effective_hours * inst.spot_rate
strategy = "spot_with_checkpoints"

else: # experimental
# Experiments: use spot aggressively, accept interruptions
optimal_monthly = inst.monthly_hours * inst.spot_rate * 1.2
strategy = "spot_experimental"

monthly_savings = on_demand_monthly - optimal_monthly

results.append({
"instance_type": inst.instance_type,
"workload_type": inst.workload_type,
"on_demand_monthly": round(on_demand_monthly, 0),
"optimal_monthly": round(optimal_monthly, 0),
"monthly_savings": round(monthly_savings, 0),
"savings_pct": round(monthly_savings / on_demand_monthly * 100, 1),
"strategy": strategy
})

return pd.DataFrame(results)


# Example fleet analysis
fleet = [
InstanceCoverage("g4dn.12xlarge", 730*4, 3.912, 2.35, 1.18, "always_on"),
InstanceCoverage("p3.8xlarge", 72*8, 12.24, 7.34, 3.67, "scheduled_batch"),
InstanceCoverage("p3.8xlarge", 40*8, 12.24, 7.34, 3.67, "experimental"),
InstanceCoverage("m5.4xlarge", 730*6, 0.768, 0.461, 0.23, "always_on"),
]

coverage_df = compute_coverage_savings(fleet)
print(coverage_df)
print(f"\nTotal monthly on-demand: ${coverage_df['on_demand_monthly'].sum():,.0f}")
print(f"Total monthly optimized: ${coverage_df['optimal_monthly'].sum():,.0f}")
print(f"Monthly savings: ${coverage_df['monthly_savings'].sum():,.0f}")

Tagging Strategy for Cost Attribution

Without consistent resource tagging, cloud cost attribution is impossible. Every resource - EC2, S3, EKS, RDS, Lambda - must be tagged at creation.

# Tagging policy: required tags and their allowed values
TAGGING_POLICY = {
"required_tags": [
"team", # which team owns this resource
"project", # which ML project
"model_id", # which specific model (or "shared" for platform resources)
"environment", # production, staging, development, experiment
"lifecycle_phase", # training, inference, feature_pipeline, monitoring
"cost_center", # finance reporting unit
],
"optional_tags": [
"experiment_id", # specific hyperparameter search or A/B test
"created_by", # engineer or automation
"expires_at", # ISO 8601 date - for auto-termination
],
"tag_validation": {
"environment": ["production", "staging", "development", "experiment"],
"lifecycle_phase": ["training", "inference", "feature_pipeline",
"monitoring", "data_pipeline", "development"],
}
}

def validate_tags(tags: dict, policy: dict = TAGGING_POLICY) -> dict:
"""Validate a tag set against the policy. Returns {valid: bool, errors: list}."""
errors = []

# Check required tags
for required_tag in policy["required_tags"]:
if required_tag not in tags or not tags[required_tag]:
errors.append(f"Missing required tag: {required_tag}")

# Validate allowed values
for tag, allowed_values in policy["tag_validation"].items():
if tag in tags and tags[tag] not in allowed_values:
errors.append(
f"Invalid value for tag '{tag}': '{tags[tag]}'. "
f"Allowed: {allowed_values}"
)

return {"valid": len(errors) == 0, "errors": errors}


# In CI/CD and infrastructure-as-code, validate tags before resource creation
def enforce_tagging_in_terraform(resource_config: dict) -> bool:
"""
Called as a pre-creation hook in the ML platform's Terraform automation.
Returns True if tags are valid, raises ValueError otherwise.
"""
tags = resource_config.get("tags", {})
validation = validate_tags(tags)

if not validation["valid"]:
raise ValueError(
f"Resource creation blocked - invalid tags:\n" +
"\n".join(f" - {e}" for e in validation["errors"])
)

return True

ML FinOps Dashboard

The FinOps dashboard is the team's primary tool for cost visibility. It should show costs by model, team, and phase - updated daily.

import pandas as pd
from datetime import date, timedelta

def build_finops_dashboard(
cost_explorer_data: pd.DataFrame, # date, service, tag_team, tag_model, cost_usd
budget_by_team: dict, # {team: monthly_budget_usd}
current_month_start: date = None
) -> dict:
"""
Build a FinOps dashboard from AWS Cost Explorer data.
"""
if current_month_start is None:
today = date.today()
current_month_start = today.replace(day=1)

# Filter to current month
month_data = cost_explorer_data[
pd.to_datetime(cost_explorer_data["date"]).dt.date >= current_month_start
]

# Cost by team
cost_by_team = month_data.groupby("tag_team")["cost_usd"].sum().to_dict()

# Cost by model
cost_by_model = month_data.groupby(["tag_team", "tag_model"])["cost_usd"].sum().reset_index()

# Cost by lifecycle phase
cost_by_phase = month_data.groupby("lifecycle_phase")["cost_usd"].sum().to_dict()

# Budget burn rate per team
days_elapsed = (date.today() - current_month_start).days + 1
days_in_month = 30 # approximate

burn_rate_analysis = []
for team, monthly_budget in budget_by_team.items():
spent = cost_by_team.get(team, 0)
daily_run_rate = spent / days_elapsed
projected_monthly = daily_run_rate * days_in_month
pct_of_budget = spent / monthly_budget * 100

burn_rate_analysis.append({
"team": team,
"spent_to_date_usd": round(spent, 0),
"monthly_budget_usd": monthly_budget,
"pct_of_budget_used": round(pct_of_budget, 1),
"daily_run_rate_usd": round(daily_run_rate, 0),
"projected_monthly_usd": round(projected_monthly, 0),
"projected_vs_budget_pct": round(projected_monthly / monthly_budget * 100, 1),
"status": (
"OVER_BUDGET" if projected_monthly > monthly_budget * 1.1 else
"AT_RISK" if projected_monthly > monthly_budget * 0.9 else
"ON_TRACK"
)
})

return {
"month": str(current_month_start),
"cost_by_team": cost_by_team,
"cost_by_model": cost_by_model.to_dict(orient="records"),
"cost_by_phase": cost_by_phase,
"burn_rate_by_team": burn_rate_analysis,
"total_month_to_date": round(month_data["cost_usd"].sum(), 0)
}

Cost Anomaly Detection

Manual review of cost dashboards misses sudden spikes. Automated anomaly detection alerts on unexpected cost increases before they compound.

from scipy import stats

def detect_cost_anomalies(
daily_costs: pd.DataFrame, # columns: date, tag_team, cost_usd
lookback_days: int = 30,
z_score_threshold: float = 3.0, # alert if cost > 3 standard deviations above mean
alert_fn=None
) -> list:
"""
Detect sudden cost anomalies using z-score on rolling historical costs.
"""
alerts = []
today = daily_costs["date"].max()
lookback_start = today - timedelta(days=lookback_days)

for team in daily_costs["tag_team"].unique():
team_data = daily_costs[daily_costs["tag_team"] == team].sort_values("date")

# Historical window (excluding today)
historical = team_data[
(team_data["date"] >= lookback_start) &
(team_data["date"] < today)
]["cost_usd"]

if len(historical) < 7:
continue # not enough history

today_cost = team_data[team_data["date"] == today]["cost_usd"].values
if len(today_cost) == 0:
continue

today_cost = today_cost[0]
hist_mean = historical.mean()
hist_std = historical.std()

z_score = (today_cost - hist_mean) / (hist_std + 1e-9)

if z_score > z_score_threshold:
alert = {
"team": team,
"date": str(today),
"today_cost": round(today_cost, 0),
"historical_mean": round(hist_mean, 0),
"z_score": round(z_score, 2),
"pct_above_mean": round((today_cost - hist_mean) / hist_mean * 100, 1)
}
alerts.append(alert)
if alert_fn:
alert_fn(alert)

return alerts

ML Budget Forecasting

Forward-looking budget forecasting uses cost trends and growth projections to predict future spend.

def forecast_ml_costs(
historical_monthly_costs: pd.DataFrame, # month, cost_usd
growth_rate_pct_per_month: float = 5.0, # expected traffic growth
efficiency_improvement_pct: float = 2.0, # expected cost efficiency gains
forecast_months: int = 6
) -> pd.DataFrame:
"""
Forecast ML costs over the next N months using trend + growth + efficiency adjustments.
"""
# Fit linear trend to historical costs
months = np.arange(len(historical_monthly_costs))
costs = historical_monthly_costs["cost_usd"].values

slope, intercept, r_value, p_value, _ = stats.linregress(months, costs)

forecast_rows = []
last_actual_cost = costs[-1]
last_month = historical_monthly_costs["month"].iloc[-1]

for i in range(1, forecast_months + 1):
# Trend projection
trend_cost = intercept + slope * (len(months) + i - 1)

# Adjust for growth and efficiency
growth_factor = (1 + growth_rate_pct_per_month / 100) ** i
efficiency_factor = (1 - efficiency_improvement_pct / 100) ** i
adjusted_cost = trend_cost * growth_factor * efficiency_factor

# Simple confidence interval (±20% for 6-month forecast)
uncertainty = 0.20 * (i / forecast_months)

forecast_rows.append({
"month": f"Month+{i}",
"forecast_cost_usd": round(adjusted_cost, 0),
"low_estimate_usd": round(adjusted_cost * (1 - uncertainty), 0),
"high_estimate_usd": round(adjusted_cost * (1 + uncertainty), 0),
"growth_assumption_pct": growth_rate_pct_per_month,
"efficiency_assumption_pct": efficiency_improvement_pct,
})

return pd.DataFrame(forecast_rows)

Production Engineering Notes

Implement tagging in infrastructure-as-code, not manually: Tags applied manually are inconsistently applied. Require tags in Terraform/Pulumi/CloudFormation templates and validate in CI before any apply. Resources without valid tags should not be created.

Reserved instance reviews quarterly: Instance type requirements change as models evolve. Reserved instances for a GPU type you no longer use are wasted commitments. Review RI utilization quarterly and convert unused reservations to Savings Plans or sell in the RI Marketplace.

Cost alerts within 24 hours of anomaly: A cluster that was accidentally left running should trigger an alert before it runs for a week. Set alerts at 150% of the team's daily average spend, with hard limits at 200%.


Common Mistakes

:::danger Purchasing reserved instances before establishing baseline usage patterns Reserved instances require a 1-year commitment to a specific instance type. If you purchase reservations before you know what instance types you actually use in production (e.g., before your first production deployment), you'll often find you reserved the wrong type. Establish 3–6 months of production usage first, then purchase reservations for the stable baseline. :::

:::danger Treating spot instance savings as guaranteed Spot instances can be interrupted at any time with 2-minute notice. If your training pipeline doesn't have checkpoint-and-restart implemented, a spot interruption at hour 47 of a 48-hour training run costs you the full on-demand equivalent plus a re-run. Calculate savings after accounting for your expected interruption rate and restart overhead. :::

:::warning Not reviewing reserved instance utilization A reserved instance that is unused still charges you at the reserved rate. Teams that buy reservations for large GPU instances and then switch to a different instance family for production are paying for unused reservations. Check RI utilization monthly in Cost Explorer. :::

:::tip Savings Plans before Reserved Instances for ML workloads ML workloads change instance types more often than standard application workloads (different GPU generations for different models). Compute Savings Plans provide similar discounts to RIs but apply across all EC2 instance types. This flexibility is valuable for ML teams. Buy Savings Plans first; use specific RIs only for resources you know will be stable for at least 1 year. :::


Interview Q&A

Q: What is the ML FinOps maturity model and where do most ML teams sit?

A: The ML FinOps maturity model has three levels. Level 1 (Reactive): costs are discovered from the monthly bill, no attribution by team or model, optimization is reactive. Most ML teams are here - they know the aggregate cloud bill but can't answer "what does the fraud model cost?" Level 2 (Informed): all resources are tagged, daily cost reports are visible to engineering leads, anomalies trigger alerts within 24 hours. Some mature ML platform teams are here. Level 3 (Optimized): teams have monthly cost budgets per project, cost efficiency is part of engineering metrics, reserved instance coverage is actively managed, and cost forecasting is automated. This requires both technical infrastructure (tagging, dashboards) and organizational change (team budgets, incentives).

Q: What is the difference between reserved instances and savings plans for ML workloads?

A: Reserved instances commit to a specific instance type, operating system, and region for 1 or 3 years, providing 30–70% discounts. They are inflexible - if you switch from p3.8xlarge to p4d.24xlarge GPUs, your reserved p3 instances continue to charge you. Savings Plans commit to an hourly spend amount (e.g., $10/hour) that can be applied to any EC2 instance type within a region or globally. They offer similar discounts to RIs but with much more flexibility. For ML teams where instance types change with model architecture, Savings Plans are usually more practical. Use specific Reserved Instances only for resources you are confident will be stable for the commitment period - typically the inference fleet for a mature, stable model.

Q: How would you investigate and fix a cloud bill 3× over budget in 4 weeks?

A: Week 1 is forensics. Use AWS Cost Explorer filtered by resource tags to identify the top 10 cost drivers. If tagging is poor, use the resource inventory to map untagged resources to teams. Find the biggest line items - often idle clusters, undeleted development environments, or on-demand GPU instances that should be on spot. Terminate idle resources immediately - this produces the largest immediate savings. Week 2 is quick wins. Move all development and training workloads to spot instances. Implement S3 lifecycle policies to move old experiment artifacts to cheaper storage tiers or delete them. Fix any obviously overprovisioned resources (CPU models running on GPU instances). Week 3 is systemic fixes. Implement mandatory tagging, cost alerts at 150% of baseline, and resource auto-termination for resources tagged with an expires_at date. Week 4 is prevention. Purchase reserved instances or savings plans for the stable production baseline to lock in discounts. Establish monthly team budget reviews.

Q: How do you set cost budgets for ML teams without stifling experimentation?

A: Distinguish between production costs (which should be tightly budgeted) and experiment costs (which need flexibility but guardrails). Production budgets are based on historical costs with explicit sign-off for changes. Experiment budgets are a separate pool with a monthly cap per team - for example, $5,000/month per team for compute experiments. Within the experiment budget, teams have autonomy. To prevent runaway costs: implement hard instance count limits for development environments, require an expires_at tag on all experiment resources (enforced by auto-termination), and use cost alerts at 80% of the monthly experiment budget. This structure gives teams freedom to experiment while making the cost of each experiment visible and bounded.

Q: What tagging strategy would you implement for a multi-team ML platform?

A: The minimum viable tag set for ML cost attribution: team (which team owns the resource), project (which ML project), model_id (which specific model, or "shared" for platform resources), environment (production/staging/development/experiment), lifecycle_phase (training/inference/feature_pipeline/monitoring), and cost_center (finance reporting unit). Enforce these tags in infrastructure-as-code templates - any resource that fails tag validation should not be created. Validate in CI before any Terraform apply. Apply the same tags to S3 buckets (for storage cost attribution), EKS namespaces (for compute), and RDS instances (for feature store). Review monthly for untagged resources and establish a weekly process to tag or terminate them. After 60 days of consistent tagging, you can produce per-model cost reports from AWS Cost Explorer with no additional effort.

© 2026 EngineersOfAI. All rights reserved.