Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Inference Cost Explorer demo on the EngineersOfAI Playground - no code required. :::

ML Cost Models

The $300K Surprise

It started with a quarterly business review. The VP of Engineering pulled up the cloud cost dashboard and the room went quiet. The AWS bill for the past three months was $312,000 - nearly three times what anyone had budgeted. The ML team had been heads-down building, shipping, and iterating. Nobody had been watching the meter.

The postmortem was painful. A training job that ran every night and had been left on a persistent p3.8xlarge instance. A feature store sync that transferred 400 GB of data across availability zones - every hour. A model serving cluster that scaled up to handle a load spike two months ago and was never scaled back down. Three Jupyter notebook servers that had been idle for six weeks because their owners had quit and nobody removed them. Individually, each was a rounding error. Together, they were a budget crisis.

The VP's question was the right one: "Why didn't we know about any of this?" The answer was equally uncomfortable: the team had no cost model. They had a cloud bill, but a bill tells you what you spent, not why you spent it, not what you got for it, and not how to prevent it from happening again. There is a fundamental difference between a cloud bill and a cost model - and most ML teams are operating with only the former.

Over the next four weeks, the team built a proper cost model from scratch. They instrumented every workload, tagged every resource, and mapped spend to business outcomes. The result was not just visibility - it was control. They could answer "what does it cost to make one recommendation?" and "how much does retraining cost per accuracy point gained?" These numbers changed how they made every subsequent engineering decision.

This lesson teaches you how to build that cost model - the foundation of all ML FinOps work.


Why Cost Models Exist

The Problem Before Cost Visibility

Early cloud computing was sold as "pay for what you use." This is technically true, but it created a psychological trap: because individual resources seem cheap, teams rarely think about the aggregate. A GPU instance at 3.06/hoursoundsreasonable.Runningit24hoursadayforaquartercosts3.06/hour sounds reasonable. Running it 24 hours a day for a quarter costs 6,692. Running ten of them costs 66,920.Runningtenwith3066,920. Running ten with 30% idle time costs 20,000 in wasted compute.

The fundamental problem is that ML workloads have a property that traditional software workloads don't: their cost is highly variable and non-obvious. A web server's cost scales roughly linearly with traffic. An ML system's cost depends on model size, batch size, hardware choice, training frequency, serving strategy, data volume, and a dozen other decisions that seem unrelated to cost but are deeply intertwined with it.

Without a cost model, teams make decisions in a vacuum. They choose the most powerful GPU without asking whether they need it. They set training to run nightly without asking if daily retraining is worth the cost. They serve requests with zero batching without understanding the economics of batching. Each of these decisions has real financial consequences - but without a model, those consequences are invisible until the quarterly bill arrives.

What a Cost Model Actually Is

A cost model is not a spreadsheet of cloud prices. It is a causal model: a structured way of understanding which engineering decisions drive which costs, and by how much. A good cost model lets you answer questions like:

  • "If we double our training frequency, what happens to monthly spend?"
  • "What is our cost per prediction, and how does it change with load?"
  • "If we move from gpt-4 to gpt-3.5-turbo, how much do we save and what do we lose in quality?"
  • "What is the ROI of spending 2 engineer-weeks on inference optimization?"

These questions require a model - not just a bill.


Historical Context

The discipline of FinOps (Financial Operations) emerged from DevOps culture around 2016–2018, driven by organizations that had moved to cloud and discovered that their infrastructure costs were growing faster than their revenue. The FinOps Foundation was established in 2019 to formalize best practices.

ML-specific FinOps is even newer. The concept of "cost per prediction" as a first-class engineering metric was popularized by practitioners at companies like Airbnb and Uber around 2019–2021 as they scaled their ML platforms to thousands of models in production. The key insight - that you cannot optimize what you cannot measure - came from applying lean manufacturing principles to ML infrastructure.

Emma Strubell's 2019 paper "Energy and Policy Considerations for Deep Learning in NLP" was a watershed moment: it showed that training a large transformer model emits as much CO2 as five cars over their lifetimes, forcing the community to think seriously about compute cost as a first-class concern rather than an afterthought.


The ML Cost Model Framework

Layer 1: Training Costs

Training cost is the most visible and controllable cost in ML. The fundamental formula is simple:

Training Cost=Compute Cost×Training Time\text{Training Cost} = \text{Compute Cost} \times \text{Training Time}

But "compute cost" and "training time" are both products of many variables:

Compute Cost=NGPUs×CostGPU/hr×Utilization Factor\text{Compute Cost} = N_{\text{GPUs}} \times \text{Cost}_{\text{GPU/hr}} \times \text{Utilization Factor}

Training Time=Ntokens×6×NparamsGPU FLOPS×MFU\text{Training Time} = \frac{N_{\text{tokens}} \times 6 \times N_{\text{params}}}{\text{GPU FLOPS} \times \text{MFU}}

Where:

  • NtokensN_{\text{tokens}} = number of training tokens
  • NparamsN_{\text{params}} = model parameter count
  • GPU FLOPS\text{GPU FLOPS} = peak GPU throughput (e.g., A100 = 312 TFLOPS for BF16)
  • MFU\text{MFU} = Model FLOP Utilization (typically 0.3–0.6 in practice)
  • The factor of 6 accounts for forward pass (2×) plus backward pass (4×)

Example calculation:

Training a 7B parameter model on 1 trillion tokens:

  • Compute required: 6×7×109×1012=4.2×10226 \times 7 \times 10^9 \times 10^{12} = 4.2 \times 10^{22} FLOPs
  • A100 at 40% MFU: 312×1012×0.4=124.8×1012312 \times 10^{12} \times 0.4 = 124.8 \times 10^{12} FLOPS effective
  • Time on 8x A100: 4.2×10228×124.8×101242,067\frac{4.2 \times 10^{22}}{8 \times 124.8 \times 10^{12}} \approx 42{,}067 hours per GPU → ~5,258 hours wall clock
  • At 3/GPUhrforA100:3/GPU-hr for A100: 8 \times 3 \times 5{,}258 \approx $126{,}192$

This is a rough estimate - MFU varies, communication overhead adds ~10–30% - but it gives you a ballpark before you start a run.

Layer 2: Inference Costs

Inference cost is harder to model because it depends on traffic patterns, latency requirements, and serving architecture choices. The fundamental formula:

Monthly Inference Cost=Requests/month×Cost per Request\text{Monthly Inference Cost} = \text{Requests/month} \times \text{Cost per Request}

Cost per Request=Instance Cost/hr3600×Requests/sec/instance\text{Cost per Request} = \frac{\text{Instance Cost/hr}}{3600 \times \text{Requests/sec/instance}}

For LLMs specifically, cost is token-based:

Cost per Request=(Ninput tokens×Priceinput)+(Noutput tokens×Priceoutput)\text{Cost per Request} = (N_{\text{input tokens}} \times \text{Price}_{\text{input}}) + (N_{\text{output tokens}} \times \text{Price}_{\text{output}})

Example for a RAG application on GPT-4:

Assume: 500 requests/day, average 2,000 input tokens (context + retrieved docs), 500 output tokens:

  • Daily input cost: 500 \times 2000 \times \0.03/1000 = $30$
  • Daily output cost: 500 \times 500 \times \0.06/1000 = $15$
  • Monthly cost: 45 \times 30 = \1,350$

This seems manageable - until you scale to 50,000 requests/day: \135,000$/month.

Layer 3: Hidden Costs

These are the costs that don't show up in your "compute" line but appear scattered across your bill:

Hidden CostTypical MagnitudeCommon Surprise
Data transfer (egress)$0.09/GB from AWSCross-AZ feature store sync
S3 storage for artifacts$0.023/GB/monthStoring every model checkpoint
Data transfer between services$0.01–0.02/GBLogging every prediction
Monitoring & observability5–15% of compute costDatadog GPU metrics at scale
Developer tooling (notebooks, IDEs)200200–500/user/monthUnused SageMaker Studio instances
Data labeling0.010.01–5/labelUnderestimated by 3-10x

The checkpoint storage trap: Teams often checkpoint models every epoch or every N steps. A 13B parameter model checkpoint is ~26 GB in float16. Training for 100 epochs and keeping all checkpoints: 2,600 GB = 59.80/monthinS3storage.Multiplyby10trainingruns:59.80/month in S3 storage. Multiply by 10 training runs: 598/month just for old checkpoints nobody looks at.

Layer 4: The Full Cost Model


Cost Per Prediction: The Key Metric

The single most important unit economic for ML systems is cost per prediction. This is the number that connects infrastructure decisions to business outcomes.

Cost per Prediction=Monthly Inference CostMonthly Predictions\text{Cost per Prediction} = \frac{\text{Monthly Inference Cost}}{\text{Monthly Predictions}}

If your product charges 0.01perAPIcallandyourcostperpredictionis0.01 per API call and your cost per prediction is 0.003, your gross margin on inference is 70%. If your cost per prediction is $0.012, you are losing money on every request - a business that scales itself into bankruptcy.

Building cost per prediction:

from dataclasses import dataclass
from typing import Optional

@dataclass
class InferenceCostModel:
"""
Model the cost per prediction for an ML serving system.
All costs in USD.
"""
# Instance configuration
instance_hourly_cost: float # e.g., $3.06 for p3.xlarge
instances_count: int # number of serving replicas

# Throughput
requests_per_second_per_instance: float # measured QPS at target latency

# Overhead factors
idle_buffer_pct: float = 0.30 # keep 30% headroom for spikes
operational_overhead_pct: float = 0.15 # monitoring, LB, etc.

def cost_per_request(self) -> float:
"""Compute cost per inference request."""
# Total hourly cost including overhead
compute_cost_hr = (
self.instance_hourly_cost
* self.instances_count
* (1 + self.operational_overhead_pct)
)

# Effective throughput accounting for idle buffer
effective_rps = (
self.requests_per_second_per_instance
* self.instances_count
* (1 - self.idle_buffer_pct)
)

# Cost per request = cost per second / requests per second
cost_per_second = compute_cost_hr / 3600
return cost_per_second / effective_rps

def monthly_cost_at_volume(self, monthly_requests: int) -> dict:
"""Full cost breakdown for a given monthly volume."""
cpp = self.cost_per_request()
compute = self.instance_hourly_cost * self.instances_count * 730 # 730 hrs/month
overhead = compute * self.operational_overhead_pct

return {
"cost_per_prediction": cpp,
"monthly_compute": compute,
"monthly_overhead": overhead,
"monthly_total": compute + overhead,
"implied_rps": monthly_requests / (30 * 24 * 3600),
"gross_margin_at_1cent": max(0, (0.01 - cpp) / 0.01),
}


# Example: BERT-based classifier on p3.xlarge
model = InferenceCostModel(
instance_hourly_cost=3.06,
instances_count=2,
requests_per_second_per_instance=50,
idle_buffer_pct=0.30,
operational_overhead_pct=0.15,
)

result = model.monthly_cost_at_volume(monthly_requests=10_000_000)
print(f"Cost per prediction: ${result['cost_per_prediction']:.6f}")
print(f"Monthly total: ${result['monthly_total']:,.2f}")
print(f"Gross margin at $0.01/call: {result['gross_margin_at_1cent']:.1%}")

TCO: Self-Hosted vs Managed

One of the most consequential decisions in ML infrastructure is whether to run your own compute or use managed services (SageMaker, Vertex AI, Azure ML). The sticker price difference is obvious - managed services charge a premium of 20–40% over raw compute. The hidden difference is total cost of ownership.

def calculate_tco(
raw_compute_monthly: float,
managed_premium_pct: float,
engineering_fte_cost_annual: float,
engineering_hours_per_week_self_hosted: float,
engineering_hours_per_week_managed: float,
) -> dict:
"""
Compare TCO of self-hosted vs managed ML infrastructure.

Args:
raw_compute_monthly: Base compute cost per month
managed_premium_pct: Premium charged by managed service (e.g., 0.30 for 30%)
engineering_fte_cost_annual: Fully loaded engineer cost (salary + benefits + overhead)
engineering_hours_per_week_self_hosted: Hours/week maintaining self-hosted infra
engineering_hours_per_week_managed: Hours/week with managed service

Returns:
TCO comparison dict
"""
hourly_engineer_cost = engineering_fte_cost_annual / 52 / 40

# Self-hosted costs
self_hosted_compute = raw_compute_monthly * 12
self_hosted_ops = (
engineering_hours_per_week_self_hosted
* 52
* hourly_engineer_cost
)
self_hosted_tco = self_hosted_compute + self_hosted_ops

# Managed service costs
managed_compute = raw_compute_monthly * (1 + managed_premium_pct) * 12
managed_ops = (
engineering_hours_per_week_managed
* 52
* hourly_engineer_cost
)
managed_tco = managed_compute + managed_ops

return {
"self_hosted_annual_tco": self_hosted_tco,
"managed_annual_tco": managed_tco,
"self_hosted_breakdown": {
"compute": self_hosted_compute,
"engineering": self_hosted_ops,
},
"managed_breakdown": {
"compute": managed_compute,
"engineering": managed_ops,
},
"winner": "self-hosted" if self_hosted_tco < managed_tco else "managed",
"difference": abs(self_hosted_tco - managed_tco),
}


# Scenario: $10K/month compute, $200K engineer, 8 hrs/wk self-hosted ops vs 2 hrs/wk managed
tco = calculate_tco(
raw_compute_monthly=10_000,
managed_premium_pct=0.30,
engineering_fte_cost_annual=200_000,
engineering_hours_per_week_self_hosted=8,
engineering_hours_per_week_managed=2,
)

print(f"Self-hosted TCO: ${tco['self_hosted_annual_tco']:,.0f}")
# Compute: $120K + Engineering: $76.9K = $196.9K

print(f"Managed TCO: ${tco['managed_annual_tco']:,.0f}")
# Compute: $156K + Engineering: $19.2K = $175.2K

print(f"Winner: {tco['winner']} (saves ${tco['difference']:,.0f}/year)")
# Managed wins despite 30% compute premium because of engineering savings

This is a critical insight: the compute premium for managed services is often justified by engineering savings. Self-hosting is only cheaper when your compute costs are large enough that the premium exceeds the engineering cost delta.

The breakeven point: managed services win when engineering hours saved × hourly rate > compute premium. Solve for minimum compute spend where self-hosting makes sense:

Computemin=(ΔEng hours)×Eng rateManaged premium %\text{Compute}_{\min} = \frac{(\Delta\text{Eng hours}) \times \text{Eng rate}}{\text{Managed premium \%}}

For most teams with 200Kengineerssaving6hours/week:200K engineers saving 6 hours/week: \frac{6 \times 52 \times $96}{0.30} \approx $99{,}840$/year in compute before self-hosting pays off.


Building the Cost Visibility Dashboard

The practical output of a cost model is a dashboard that answers key questions in real time. Here is the architecture:

The tagging strategy is the foundation. Without consistent tags, you cannot attribute cost to teams, models, or business outcomes:

# Required tags for every ML resource
REQUIRED_TAGS = {
"team": "recommendations", # which team owns it
"model": "user-embedding-v3", # which model/project
"environment": "production", # prod / staging / dev / experiment
"cost_center": "cc-ml-platform", # for finance attribution
"owner": "[email protected]", # who to alert when cost spikes
"experiment_id": "exp-2024-001", # for training runs
"auto_shutdown": "false", # triggers auto-shutdown policy
}

# Boto3 example: apply tags when launching training job
import boto3

def launch_training_job(config: dict) -> str:
sagemaker = boto3.client('sagemaker')

job = sagemaker.create_training_job(
TrainingJobName=config["job_name"],
# ... other config ...
Tags=[
{"Key": k, "Value": v}
for k, v in REQUIRED_TAGS.items()
]
)
return job["TrainingJobArn"]

Production Engineering Notes

Cost Attribution at Scale

When you have 50+ models in production, attribution becomes complex. Use a hierarchical tagging system:

Organization
└── Business Unit (cost_center)
└── Team (team)
└── Product (product)
└── Model (model)
└── Environment (environment)
└── Experiment (experiment_id)

Each level rolls up to the next for reporting. A data scientist sees their experiment costs. A VP sees business unit costs. Finance sees total organization costs.

The 30-Day Rule

Any ML resource that hasn't been used in 30 days should trigger an alert to its owner. Any resource that hasn't been used in 60 days should be automatically stopped (with a 7-day warning). This single policy, consistently applied, typically saves 10–15% of total ML infrastructure spend.

import boto3
from datetime import datetime, timedelta

def find_idle_sagemaker_endpoints(days_threshold: int = 30) -> list[dict]:
"""Find SageMaker endpoints with no invocations in N days."""
cloudwatch = boto3.client('cloudwatch')
sagemaker = boto3.client('sagemaker')

endpoints = sagemaker.list_endpoints()['Endpoints']
idle_endpoints = []

cutoff = datetime.utcnow() - timedelta(days=days_threshold)

for endpoint in endpoints:
name = endpoint['EndpointName']

# Check invocation count over the threshold period
response = cloudwatch.get_metric_statistics(
Namespace='AWS/SageMaker',
MetricName='Invocations',
Dimensions=[{'Name': 'EndpointName', 'Value': name}],
StartTime=cutoff,
EndTime=datetime.utcnow(),
Period=int(days_threshold * 86400),
Statistics=['Sum'],
)

total_invocations = sum(
dp['Sum'] for dp in response['Datapoints']
)

if total_invocations == 0:
# Get endpoint config to estimate monthly cost
config = sagemaker.describe_endpoint(EndpointName=name)
idle_endpoints.append({
"name": name,
"days_idle": days_threshold,
"status": endpoint['EndpointStatus'],
"creation_time": endpoint['CreationTime'],
"tags": sagemaker.list_tags(ResourceArn=endpoint['EndpointArn'])['Tags'],
})

return idle_endpoints

Common Mistakes

:::danger Treating cloud cost as a finance problem Cloud cost is an engineering problem. Finance can report it, but only engineers can fix it. The moment a team treats cost as something that "finance handles," the bill will grow unchecked. Every engineer on the team should know their team's monthly cloud spend and have a target cost per prediction for their models. :::

:::danger Ignoring data transfer costs Compute costs are visible; data transfer costs hide in line items. Moving 1 TB of features from S3 to a training instance in the same region is free. Moving it cross-region costs $90. Running a feature store that syncs cross-AZ every hour can generate more cost than the training compute itself. Always map data flows before estimating costs. :::

:::warning Modeling cost without modeling utilization A GPU instance costs $3.06/hr whether it is running at 100% or 5% utilization. Most teams underestimate idle time. Measure actual GPU utilization with CloudWatch or Prometheus before building a cost model - assume 40–60% unless you have data showing otherwise. :::

:::warning Confusing cost per request with cost per user These are not the same. One user might make 10 requests (power user) or 1 request per month (casual user). Unit economics should be modeled at the level of business value - usually per user, per transaction, or per outcome - not per API call. :::


Interview Q&A

Q: How do you calculate cost per prediction for a model in production?

A: Start with the total monthly inference cost - compute instances, load balancers, data transfer, and monitoring overhead. Divide by total monthly predictions. Cost per prediction = monthly cost / monthly predictions. The tricky part is allocating shared infrastructure: if one instance serves multiple models, allocate by request volume. The even trickier part is including hidden costs: logging every prediction to S3, model artifact loading time at cold start, and autoscaling buffer capacity. A good cost model includes all of these. I typically build a structured cost model in code that takes instance configuration, throughput measurements, and overhead percentages as inputs, so the model can be updated as measurements change.

Q: When does it make sense to self-host a model vs use a managed API like OpenAI?

A: It's primarily a volume and customization decision. At low volume (under ~100K requests/month), managed APIs almost always win on TCO - the engineering cost of self-hosting exceeds the compute savings. At high volume (millions of requests/month), self-hosting a quantized open-source model can be 5–20x cheaper per token. The crossover point depends on your model quality requirements (can OSS match GPT-4 for your task?), engineering capacity, and latency requirements. I'd build a detailed TCO model before deciding, factoring in: compute cost, engineering time for deployment and maintenance, model performance gap, and switching cost if you need to migrate later.

Q: What are the biggest hidden costs in ML infrastructure?

A: Three big ones. First, data transfer - especially cross-region or cross-AZ data movement. Feature stores and training data pipelines are the biggest culprits. Second, checkpoint storage - teams checkpoint frequently but rarely clean up old checkpoints. A 13B model checkpoint is 26 GB; keeping 100 checkpoints is 2.6 TB of S3. Third, idle compute - instances that scaled up for a peak and were never scaled back down, or developer notebooks left running overnight. I'd add a fourth: experiment overhead - failed training runs, hyperparameter sweeps that ran longer than needed, and debugging sessions on expensive GPU instances. These typically add 20–40% to the stated training budget.

Q: How do you build a cost model before you've actually run the system?

A: Start with first principles. For training: estimate FLOPs from model parameters and dataset size using the 6ND formula, then divide by expected GPU throughput at realistic MFU (40–60%). For inference: estimate requests per second from product usage projections, benchmark tokens-per-second on candidate hardware, then calculate instances needed plus a 30% buffer. Add 15% for overhead (monitoring, networking). For storage: estimate model artifact size plus prediction log volume. The key is to make your assumptions explicit and build the model in code so you can easily re-run it as assumptions change. I always do a 3-scenario model: optimistic, expected, and pessimistic - and I always find that the optimistic scenario is what got budgeted and the pessimistic scenario is what happened.

Q: What KPIs do you track for ML system economics?

A: Five key metrics. (1) Cost per prediction - the fundamental unit economic, tracked weekly with a target. (2) GPU/CPU utilization - idle capacity is wasted money; target >70% average utilization for serving, >80% for training. (3) Training cost efficiency - cost per accuracy point gained, tracks whether training runs are getting more efficient over time. (4) Inference cost trend - cost per prediction over time; should be declining as optimizations are applied. (5) Cost per active user - total ML infrastructure cost divided by monthly active users; this connects engineering spend to business value. I also track anomalies: any resource that spikes more than 50% week-over-week gets investigated immediately.

Q: How do you handle cost attribution for shared ML infrastructure?

A: Use a combination of tagging and activity-based cost allocation. For dedicated resources (a model deployed for one team), direct attribution is easy. For shared resources (a GPU cluster used by multiple teams), allocate by measured usage: GPU hours consumed per job, requests served per model, storage used per team. The key is instrumentation at the job level - every training job and serving request should log which team, model, and experiment generated it. Build a weekly cost report that aggregates from these logs, not just from cloud billing tags. For shared platform infrastructure (feature store, monitoring), allocate proportionally to usage rather than splitting evenly - otherwise heavy users subsidize light users and there's no incentive to be efficient.

© 2026 EngineersOfAI. All rights reserved.