ML Infrastructure Cost Model
The CTO's Question
The CTO walked into the weekly ML platform meeting with a single question: "How much does our recommendation system cost per request?"
The room went quiet. The ML platform team knew what the system cost in aggregate - roughly $340,000 per year in cloud infrastructure. But "per request" was different. That required knowing the total request volume, which required coordinating with the analytics team. And it required attributing costs specifically to the recommendation system, not the whole ML infrastructure shared by twelve different models. And it required deciding what to include: just the inference servers, or also the training pipeline, the feature store, the experiment tracking system, the data pipeline, the monitoring infrastructure?
The engineering lead said he'd have an answer by end of week. It took three weeks, three teams, and two data warehouse queries to produce a number with reasonable confidence. The answer was $0.0047 per request, at 72 million requests per day. Nobody had known this number existed.
The CTO's follow-up question was worse: "Is that good or bad?"
This lesson is about building the cost model that makes both questions answerable in under an hour - and makes the answer defensible.
:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Cost & Unit Economics demo on the EngineersOfAI Playground - no code required. :::
Why This Exists: The Invisibility of ML Costs
Most engineering costs are naturally measured per unit of work. A software API has a cost per request. A database has a cost per query. These measurements arise naturally because requests and queries are discrete, countable events that trigger resource consumption.
ML costs are harder to attribute because they are spread across a lifecycle with very different cost structures at each phase:
- Training is a large, periodic, batch cost - a single training run might cost $5,000 and happen once a month
- Inference is a continuous, per-request cost - small per request but enormous in aggregate at scale
- Feature pipelines run on a schedule and produce inputs that many models share - hard to attribute to any single model
- Storage accumulates silently - experiment artifacts, model weights, training data, feature store snapshots
- Managed services have pricing models that don't map cleanly to usage (flat fee, tiered pricing, per-seat)
Without a cost model, ML teams cannot make economically rational decisions. They cannot evaluate whether a 2% accuracy improvement is worth a 40% increase in compute cost. They cannot compare the cost of training a larger model with hiring an engineer to do feature engineering on a smaller model. They cannot answer the CTO's question.
Historical Context
The discipline of ML cost management is young. Before cloud computing, ML training happened on-premise hardware that was amortized over years and rarely attributed to specific projects. The marginal cost of a training run was invisible.
Cloud computing changed this by making costs explicit and variable. Suddenly, each EC2 instance or TPU hour had a direct dollar cost visible on a monthly bill. But cloud bills are organized by service (EC2, S3, EKS, RDS), not by model or team. Attributing costs requires tagging, which requires organizational discipline.
The concept of "ML unit economics" entered the vocabulary around 2019–2020, driven by the explosion in LLM costs. Training GPT-3 (OpenAI, 2020) reportedly cost $4.6M. This made the question "what does this model cost to build and operate?" suddenly urgent for any organization building large-scale ML systems.
The emergence of MLOps as a discipline brought cost management into scope alongside deployment, monitoring, and governance. FinOps (Financial Operations) - originally a cloud infrastructure practice - began to be applied specifically to ML workloads around 2021–2022.
Core Concepts
The ML Cost Stack
ML infrastructure cost breaks down into six categories. Understanding each category is prerequisite to optimizing any of them.
Building a Cost Model from Scratch
A cost model translates infrastructure resource consumption into dollar amounts, attributed to specific models and lifecycle phases.
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import pandas as pd
@dataclass
class ComputeResourceCost:
"""Cost parameters for a specific compute resource."""
resource_type: str # "gpu_a100", "cpu_m5.4xlarge", etc.
hourly_rate_usd: float # on-demand rate
spot_discount: float = 0.7 # typical spot discount (30% of on-demand)
reserved_discount: float = 0.4 # 1-year reserved (60% of on-demand)
# Cloud pricing reference (AWS us-east-1, approximate)
RESOURCE_COSTS = {
"p4d.24xlarge_8xa100": ComputeResourceCost("p4d.24xlarge_8xa100", 32.77, 0.7, 0.4),
"p3.16xlarge_8xv100": ComputeResourceCost("p3.16xlarge_8xv100", 24.48, 0.7, 0.4),
"g4dn.xlarge_1xt4": ComputeResourceCost("g4dn.xlarge_1xt4", 0.526, 0.7, 0.4),
"g4dn.12xlarge_4xt4": ComputeResourceCost("g4dn.12xlarge_4xt4", 3.912, 0.7, 0.4),
"m5.4xlarge_cpu": ComputeResourceCost("m5.4xlarge_cpu", 0.768, 0.7, 0.4),
"c5.2xlarge_cpu": ComputeResourceCost("c5.2xlarge_cpu", 0.34, 0.7, 0.4),
}
@dataclass
class ModelCostProfile:
"""Complete cost model for a single ML model."""
model_name: str
# Training costs
training_instance_type: str
training_instance_count: int
training_hours_per_run: float
training_runs_per_month: float
training_uses_spot: bool = True
# Inference costs
inference_instance_type: str
inference_instance_count: int # always-on replicas
monthly_request_volume: float # total requests per month
# Feature pipeline costs
feature_pipeline_instance_type: str
feature_pipeline_hours_per_day: float
# Storage costs (USD/month)
training_data_gb: float = 0
model_artifacts_gb: float = 0
feature_store_gb: float = 0
experiment_artifacts_gb: float = 0
s3_cost_per_gb_month: float = 0.023
# Data transfer costs
data_transfer_gb_per_month: float = 0
data_transfer_cost_per_gb: float = 0.09
def compute_monthly_cost(profile: ModelCostProfile) -> dict:
"""Compute total monthly cost and break it down by category."""
# Training compute cost
train_resource = RESOURCE_COSTS[profile.training_instance_type]
train_rate = train_resource.hourly_rate_usd
if profile.training_uses_spot:
train_rate *= train_resource.spot_discount
training_cost = (
train_rate *
profile.training_instance_count *
profile.training_hours_per_run *
profile.training_runs_per_month
)
# Inference compute cost (always-on)
infer_resource = RESOURCE_COSTS[profile.inference_instance_type]
# Use reserved pricing for always-on inference
infer_rate = infer_resource.hourly_rate_usd * (1 - infer_resource.reserved_discount)
inference_cost = infer_rate * profile.inference_instance_count * 730 # hours/month
# Feature pipeline cost
feature_resource = RESOURCE_COSTS[profile.feature_pipeline_instance_type]
feature_rate = feature_resource.hourly_rate_usd * feature_resource.spot_discount
feature_cost = feature_rate * profile.feature_pipeline_hours_per_day * 30
# Storage cost
total_storage_gb = (
profile.training_data_gb +
profile.model_artifacts_gb +
profile.feature_store_gb +
profile.experiment_artifacts_gb
)
storage_cost = total_storage_gb * profile.s3_cost_per_gb_month
# Data transfer cost
transfer_cost = profile.data_transfer_gb_per_month * profile.data_transfer_cost_per_gb
total_cost = training_cost + inference_cost + feature_cost + storage_cost + transfer_cost
return {
"model_name": profile.model_name,
"training_cost_usd": round(training_cost, 2),
"inference_cost_usd": round(inference_cost, 2),
"feature_pipeline_cost_usd": round(feature_cost, 2),
"storage_cost_usd": round(storage_cost, 2),
"data_transfer_cost_usd": round(transfer_cost, 2),
"total_cost_usd": round(total_cost, 2),
"cost_per_request_usd": round(total_cost / profile.monthly_request_volume, 6)
if profile.monthly_request_volume > 0 else None,
}
# Example: recommendation system cost model
reco_profile = ModelCostProfile(
model_name="product_recommendation_v3",
# Training: nightly retrain on 8xV100, ~3 hours, spot
training_instance_type="p3.16xlarge_8xv100",
training_instance_count=1,
training_hours_per_run=3.0,
training_runs_per_month=30,
training_uses_spot=True,
# Inference: 4x T4 GPUs, reserved
inference_instance_type="g4dn.12xlarge_4xt4",
inference_instance_count=4,
monthly_request_volume=72_000_000,
# Feature pipeline: daily Spark job, ~2 hours, spot
feature_pipeline_instance_type="m5.4xlarge_cpu",
feature_pipeline_hours_per_day=2.0,
# Storage
training_data_gb=500,
model_artifacts_gb=20,
feature_store_gb=200,
experiment_artifacts_gb=150,
# Transfer
data_transfer_gb_per_month=100
)
cost_breakdown = compute_monthly_cost(reco_profile)
print(f"Monthly cost: ${cost_breakdown['total_cost_usd']:,.0f}")
print(f"Cost per request: ${cost_breakdown['cost_per_request_usd']:.5f}")
print(f"Annual cost: ${cost_breakdown['total_cost_usd'] * 12:,.0f}")
Unit Economics: Cost Per Prediction
Unit economics connects infrastructure costs to business metrics. For a recommendation system, the relevant unit is a prediction (or a page impression, or a recommendation set). For a fraud detection model, it might be a transaction scored.
def compute_unit_economics(
monthly_cost_usd: float,
monthly_volume: float,
monthly_revenue_attributed_usd: float = None,
monthly_value_preserved_usd: float = None, # fraud prevented, churn avoided, etc.
) -> dict:
"""
Compute unit economics for an ML system.
"""
cost_per_unit = monthly_cost_usd / monthly_volume
result = {
"monthly_cost_usd": monthly_cost_usd,
"monthly_volume": monthly_volume,
"cost_per_unit_usd": cost_per_unit,
"cost_per_1000_units_usd": cost_per_unit * 1000,
}
if monthly_revenue_attributed_usd:
revenue_per_unit = monthly_revenue_attributed_usd / monthly_volume
result["revenue_per_unit_usd"] = revenue_per_unit
result["gross_margin_pct"] = (
(monthly_revenue_attributed_usd - monthly_cost_usd) /
monthly_revenue_attributed_usd * 100
)
result["cost_as_pct_of_revenue"] = (
monthly_cost_usd / monthly_revenue_attributed_usd * 100
)
if monthly_value_preserved_usd:
result["roi"] = (monthly_value_preserved_usd - monthly_cost_usd) / monthly_cost_usd
result["cost_per_dollar_value_preserved"] = (
monthly_cost_usd / monthly_value_preserved_usd
)
return result
# Recommendation system example
reco_economics = compute_unit_economics(
monthly_cost_usd=28_400,
monthly_volume=72_000_000,
monthly_revenue_attributed_usd=4_200_000 # estimated click-through revenue
)
print(f"Cost per prediction: ${reco_economics['cost_per_unit_usd']:.5f}")
print(f"Cost per 1,000 predictions: ${reco_economics['cost_per_1000_units_usd']:.4f}")
print(f"ML cost as % of attributed revenue: {reco_economics['cost_as_pct_of_revenue']:.2f}%")
Hidden Costs
The visible line items on a cloud bill - EC2, S3, data transfer - are only part of the total cost. Hidden costs are real costs that don't appear as direct line items:
Idle capacity: An inference fleet scaled to peak load spends most of its time at 30–50% utilization. The idle capacity is paid for but produces no value. At 50% average utilization, you're paying double for what you actually use.
Over-provisioned instances: Teams that size instances for peak memory usage may be on a GPU instance for a model that could run on a CPU. Systematic right-sizing analyses regularly find 20–30% of compute running on overspecified hardware.
Experiment waste: A hyperparameter search that runs 500 trials, each on a GPU for 2 hours, costs 2,000 on spot pricing. Many experiments run to completion even when it's clear the configuration is not competitive after the first 30 minutes. Early stopping and cost budgets for experiments are almost never implemented.
Engineering time: The most expensive item on any cost model is the people who build and maintain the system. A senior ML engineer costs 400K fully loaded per year. An MLOps system that requires 2 FTEs to operate costs 800K/year in engineering time - often more than the infrastructure it manages.
def estimate_hidden_costs(
visible_monthly_cost: float,
avg_utilization_pct: float = 0.5, # actual utilization of reserved capacity
experiment_waste_pct: float = 0.3, # fraction of experiment compute that's wasted
mlops_fte_count: float = 1.5, # FTEs dedicated to maintaining the ML platform
avg_engineer_cost_per_year: float = 300_000
) -> dict:
"""Estimate hidden costs that don't appear on the cloud bill."""
# Idle capacity cost
idle_cost = visible_monthly_cost * (1 - avg_utilization_pct) / avg_utilization_pct
# Experiment waste
experiment_cost = visible_monthly_cost * 0.15 # ~15% of infra is experiment compute
waste_cost = experiment_cost * experiment_waste_pct
# Engineering time (monthly)
mlops_engineering_cost = mlops_fte_count * avg_engineer_cost_per_year / 12
total_hidden = idle_cost + waste_cost + mlops_engineering_cost
return {
"visible_monthly_cost": visible_monthly_cost,
"idle_capacity_cost": round(idle_cost, 0),
"experiment_waste_cost": round(waste_cost, 0),
"mlops_engineering_monthly": round(mlops_engineering_cost, 0),
"total_hidden_monthly": round(total_hidden, 0),
"true_total_monthly": round(visible_monthly_cost + total_hidden, 0),
"hidden_as_pct_of_visible": round(total_hidden / visible_monthly_cost * 100, 1)
}
hidden = estimate_hidden_costs(
visible_monthly_cost=28_400,
avg_utilization_pct=0.45,
mlops_fte_count=1.5
)
print(f"Visible cost: ${hidden['visible_monthly_cost']:,.0f}/month")
print(f"True total: ${hidden['true_total_monthly']:,.0f}/month")
print(f"Hidden = {hidden['hidden_as_pct_of_visible']}% of visible")
Cost Allocation by Team, Project, and Model
A single cloud account with no tagging produces a bill that is impossible to attribute. Cost allocation requires a systematic tagging strategy applied to all resources at creation time.
# Required tags for every ML resource
REQUIRED_TAGS = {
"team": str, # "search-ml", "fraud-ml", "recommendation-ml"
"project": str, # "product-ranking-v2", "fraud-detection"
"model_id": str, # "product_rec_v3", "fraud_v7"
"environment": str, # "production", "staging", "experiment"
"cost_center": str, # "eng-ml-platform", "data-science"
"lifecycle_phase": str, # "training", "inference", "feature_pipeline"
}
def generate_resource_tags(
team: str,
project: str,
model_id: str,
environment: str,
lifecycle_phase: str,
cost_center: str = None
) -> dict:
"""Generate a complete tag set for an ML resource."""
return {
"team": team,
"project": project,
"model_id": model_id,
"environment": environment,
"lifecycle_phase": lifecycle_phase,
"cost_center": cost_center or f"eng-{team}",
"created_by": "mlops-automation",
"managed_by": "terraform",
}
def compute_cost_by_dimension(
aws_cost_explorer_df: pd.DataFrame,
group_by: str = "model_id"
) -> pd.DataFrame:
"""
Aggregate AWS Cost Explorer data by a tag dimension.
Assumes aws_cost_explorer_df has columns: date, service, tag_value, cost_usd
"""
return (
aws_cost_explorer_df
.groupby([group_by, "service"])["cost_usd"]
.sum()
.reset_index()
.sort_values("cost_usd", ascending=False)
)
Production Engineering Notes
Cost model cadence: Update your cost model monthly. Prices change, architectures change, and usage patterns change. A cost model built six months ago and never updated will be systematically wrong.
Cost forecasting: Once you have a cost model, you can forecast future costs by projecting request volume growth. A recommendation system growing 20% per quarter will double its inference cost in ~4 quarters unless you also improve inference efficiency.
Cost-to-accuracy trade-off documentation: For every model in production, document the cost-accuracy frontier: how does model cost change as you vary model size, compute budget, and serving configuration? This allows economically rational decisions when budget constraints arise.
Common Mistakes
:::danger Attributing only inference compute to a model's cost Teams regularly understate ML costs by only counting the inference servers. Training compute, feature pipeline compute, storage for artifacts and training data, monitoring infrastructure, and the engineering time to maintain all of it - these are all real costs of operating a model in production. A complete cost model must include all lifecycle phases. :::
:::warning Using on-demand pricing for cost estimates when you're actually on spot or reserved On-demand prices are 2–3× higher than spot (for training workloads) and 1.5–2.5× higher than 1-year reserved pricing (for always-on inference). If your cost model uses on-demand list prices but your actual bill reflects spot and reserved discounts, your cost-per-request estimate is significantly inflated. :::
:::tip Cost models require usage data, not just pricing data Cost = price × usage. You need both. Don't start with a pricing spreadsheet - start with your actual cloud bill attributed by resource tag, then compute per-unit costs by dividing by request/training volumes from your observability system. :::
Interview Q&A
Q: How would you compute cost per prediction for an ML serving system?
A: Start by identifying all infrastructure components that serve predictions: inference servers, load balancers, feature store lookups, and any preprocessing services. Get the monthly cost of each from your cloud billing data (filtering by resource tags). Sum them to get total monthly infrastructure cost for the serving path. Divide by total monthly prediction volume (from your monitoring system). For a more complete picture, also attribute a fraction of training cost (amortized over the number of predictions the model will serve before the next retrain) and feature pipeline cost (proportional to features used by this model). The result is a per-prediction cost that includes all lifecycle costs, not just the serving compute.
Q: What are the main hidden costs in ML systems that don't appear on the cloud bill?
A: Three categories. First, idle capacity: inference fleets sized for peak load typically run at 30–60% average utilization, meaning you pay for 40–70% of your fleet to do nothing most of the time. Second, experiment waste: hyperparameter searches that don't use early stopping or cost budgets run failed experiments to completion. A grid search over 200 configurations where 80% are clearly poor after 20% of training wastes most of its compute budget. Third, engineering time: the biggest hidden cost. A platform that requires 2 FTEs to maintain costs 800K/year in salary, often more than the infrastructure it manages. Include fully-loaded engineering costs in your total cost of ownership.
Q: What is ML unit economics and why does it matter?
A: ML unit economics expresses the cost of ML infrastructure in terms of per-unit business metrics: cost per prediction, cost per user, cost per decision. It matters because aggregate infrastructure costs are hard to evaluate without context - "0.0047 per recommendation" can be easily compared to the revenue per recommendation, the cost of alternative recommendation approaches, and the cost trajectory as the business scales. Unit economics enables business-rational decisions: if cost per prediction is 0.12, the ROI is obvious and additional investment is clearly justified.
Q: How do you set up cost attribution across multiple ML teams sharing a cloud account?
A: The foundation is resource tagging: every cloud resource (EC2 instance, S3 bucket, EKS node, RDS instance) must be tagged at creation with at minimum the team, project, model ID, environment, and lifecycle phase. Enforce this in your infrastructure-as-code templates and CI/CD pipelines - resources that fail tag validation should not be created. Use AWS Cost Explorer or GCP Cost Management to aggregate costs by tag value. For shared resources (a shared feature store, shared monitoring infrastructure), define an allocation methodology: allocate by usage (number of feature reads per team), by model count, or by raw compute. Document the methodology and apply it consistently. Review monthly cost reports by team and present them in team reviews - visibility alone changes spending behavior.
Q: A model that serves 10M requests/day has a monthly cloud bill of $50,000. Is this good or bad?
A: Without context, neither. To evaluate: compute cost per request: 0.000167 per request, or 0.05 per request, the $0.000167 cost is 0.33% of the value created - excellent economics. Compare to the cost of alternatives: could a simpler model produce 90% of the value at 20% of the cost? The answer "good or bad" requires knowing the denominator - what value does the model create per request.
