What is ML infrastructure cost?

Understanding what drives ML costs - building a cost-per-request model for your ML system from scratch, and computing unit economics the CTO will believe.

How does cost per prediction work in practice?

ML Infrastructure Cost Model covers ML infrastructure cost, cost per prediction, ML unit economics from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/cost-management/ml-unit-economics

What is the difference between ML infrastructure cost and ML unit economics?

See the full breakdown at https://engineersofai.com/docs/mlops/cost-management/ml-unit-economics

ML Infrastructure Cost Model

The CTO's Question

The CTO walked into the weekly ML platform meeting with a single question: "How much does our recommendation system cost per request?"

The room went quiet. The ML platform team knew what the system cost in aggregate - roughly $340,000 per year in cloud infrastructure. But "per request" was different. That required knowing the total request volume, which required coordinating with the analytics team. And it required attributing costs specifically to the recommendation system, not the whole ML infrastructure shared by twelve different models. And it required deciding what to include: just the inference servers, or also the training pipeline, the feature store, the experiment tracking system, the data pipeline, the monitoring infrastructure?

The engineering lead said he'd have an answer by end of week. It took three weeks, three teams, and two data warehouse queries to produce a number with reasonable confidence. The answer was $0.0047 per request, at 72 million requests per day. Nobody had known this number existed.

The CTO's follow-up question was worse: "Is that good or bad?"

This lesson is about building the cost model that makes both questions answerable in under an hour - and makes the answer defensible.

:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Cost & Unit Economics demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: The Invisibility of ML Costs

Most engineering costs are naturally measured per unit of work. A software API has a cost per request. A database has a cost per query. These measurements arise naturally because requests and queries are discrete, countable events that trigger resource consumption.

ML costs are harder to attribute because they are spread across a lifecycle with very different cost structures at each phase:

Training is a large, periodic, batch cost - a single training run might cost $5,000 and happen once a month
Inference is a continuous, per-request cost - small per request but enormous in aggregate at scale
Feature pipelines run on a schedule and produce inputs that many models share - hard to attribute to any single model
Storage accumulates silently - experiment artifacts, model weights, training data, feature store snapshots
Managed services have pricing models that don't map cleanly to usage (flat fee, tiered pricing, per-seat)

Without a cost model, ML teams cannot make economically rational decisions. They cannot evaluate whether a 2% accuracy improvement is worth a 40% increase in compute cost. They cannot compare the cost of training a larger model with hiring an engineer to do feature engineering on a smaller model. They cannot answer the CTO's question.

Historical Context

The discipline of ML cost management is young. Before cloud computing, ML training happened on-premise hardware that was amortized over years and rarely attributed to specific projects. The marginal cost of a training run was invisible.

Cloud computing changed this by making costs explicit and variable. Suddenly, each EC2 instance or TPU hour had a direct dollar cost visible on a monthly bill. But cloud bills are organized by service (EC2, S3, EKS, RDS), not by model or team. Attributing costs requires tagging, which requires organizational discipline.

The concept of "ML unit economics" entered the vocabulary around 2019–2020, driven by the explosion in LLM costs. Training GPT-3 (OpenAI, 2020) reportedly cost $4.6M. This made the question "what does this model cost to build and operate?" suddenly urgent for any organization building large-scale ML systems.

The emergence of MLOps as a discipline brought cost management into scope alongside deployment, monitoring, and governance. FinOps (Financial Operations) - originally a cloud infrastructure practice - began to be applied specifically to ML workloads around 2021–2022.

Core Concepts

The ML Cost Stack

ML infrastructure cost breaks down into six categories. Understanding each category is prerequisite to optimizing any of them.

Building a Cost Model from Scratch

A cost model translates infrastructure resource consumption into dollar amounts, attributed to specific models and lifecycle phases.

from dataclasses import dataclass, field
from typing import Dict, List, Optional
import pandas as pd

@dataclass
class ComputeResourceCost:
    """Cost parameters for a specific compute resource."""
    resource_type: str           # "gpu_a100", "cpu_m5.4xlarge", etc.
    hourly_rate_usd: float       # on-demand rate
    spot_discount: float = 0.7   # typical spot discount (30% of on-demand)
    reserved_discount: float = 0.4  # 1-year reserved (60% of on-demand)

# Cloud pricing reference (AWS us-east-1, approximate)
RESOURCE_COSTS = {
    "p4d.24xlarge_8xa100":  ComputeResourceCost("p4d.24xlarge_8xa100",  32.77, 0.7, 0.4),
    "p3.16xlarge_8xv100":   ComputeResourceCost("p3.16xlarge_8xv100",   24.48, 0.7, 0.4),
    "g4dn.xlarge_1xt4":     ComputeResourceCost("g4dn.xlarge_1xt4",      0.526, 0.7, 0.4),
    "g4dn.12xlarge_4xt4":   ComputeResourceCost("g4dn.12xlarge_4xt4",    3.912, 0.7, 0.4),
    "m5.4xlarge_cpu":       ComputeResourceCost("m5.4xlarge_cpu",        0.768, 0.7, 0.4),
    "c5.2xlarge_cpu":       ComputeResourceCost("c5.2xlarge_cpu",        0.34,  0.7, 0.4),
}

@dataclass
class ModelCostProfile:
    """Complete cost model for a single ML model."""
    model_name: str

    # Training costs
    training_instance_type: str
    training_instance_count: int
    training_hours_per_run: float
    training_runs_per_month: float
    training_uses_spot: bool = True

    # Inference costs
    inference_instance_type: str
    inference_instance_count: int   # always-on replicas
    monthly_request_volume: float   # total requests per month

    # Feature pipeline costs
    feature_pipeline_instance_type: str
    feature_pipeline_hours_per_day: float

    # Storage costs (USD/month)
    training_data_gb: float = 0
    model_artifacts_gb: float = 0
    feature_store_gb: float = 0
    experiment_artifacts_gb: float = 0
    s3_cost_per_gb_month: float = 0.023

    # Data transfer costs
    data_transfer_gb_per_month: float = 0
    data_transfer_cost_per_gb: float = 0.09


def compute_monthly_cost(profile: ModelCostProfile) -> dict:
    """Compute total monthly cost and break it down by category."""

    # Training compute cost
    train_resource = RESOURCE_COSTS[profile.training_instance_type]
    train_rate = train_resource.hourly_rate_usd
    if profile.training_uses_spot:
        train_rate *= train_resource.spot_discount
    training_cost = (
        train_rate *
        profile.training_instance_count *
        profile.training_hours_per_run *
        profile.training_runs_per_month
    )

    # Inference compute cost (always-on)
    infer_resource = RESOURCE_COSTS[profile.inference_instance_type]
    # Use reserved pricing for always-on inference
    infer_rate = infer_resource.hourly_rate_usd * (1 - infer_resource.reserved_discount)
    inference_cost = infer_rate * profile.inference_instance_count * 730  # hours/month

    # Feature pipeline cost
    feature_resource = RESOURCE_COSTS[profile.feature_pipeline_instance_type]
    feature_rate = feature_resource.hourly_rate_usd * feature_resource.spot_discount
    feature_cost = feature_rate * profile.feature_pipeline_hours_per_day * 30

    # Storage cost
    total_storage_gb = (
        profile.training_data_gb +
        profile.model_artifacts_gb +
        profile.feature_store_gb +
        profile.experiment_artifacts_gb
    )
    storage_cost = total_storage_gb * profile.s3_cost_per_gb_month

    # Data transfer cost
    transfer_cost = profile.data_transfer_gb_per_month * profile.data_transfer_cost_per_gb

    total_cost = training_cost + inference_cost + feature_cost + storage_cost + transfer_cost

    return {
        "model_name": profile.model_name,
        "training_cost_usd": round(training_cost, 2),
        "inference_cost_usd": round(inference_cost, 2),
        "feature_pipeline_cost_usd": round(feature_cost, 2),
        "storage_cost_usd": round(storage_cost, 2),
        "data_transfer_cost_usd": round(transfer_cost, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_request_usd": round(total_cost / profile.monthly_request_volume, 6)
            if profile.monthly_request_volume > 0 else None,
    }


# Example: recommendation system cost model
reco_profile = ModelCostProfile(
    model_name="product_recommendation_v3",

    # Training: nightly retrain on 8xV100, ~3 hours, spot
    training_instance_type="p3.16xlarge_8xv100",
    training_instance_count=1,
    training_hours_per_run=3.0,
    training_runs_per_month=30,
    training_uses_spot=True,

    # Inference: 4x T4 GPUs, reserved
    inference_instance_type="g4dn.12xlarge_4xt4",
    inference_instance_count=4,
    monthly_request_volume=72_000_000,

    # Feature pipeline: daily Spark job, ~2 hours, spot
    feature_pipeline_instance_type="m5.4xlarge_cpu",
    feature_pipeline_hours_per_day=2.0,

    # Storage
    training_data_gb=500,
    model_artifacts_gb=20,
    feature_store_gb=200,
    experiment_artifacts_gb=150,

    # Transfer
    data_transfer_gb_per_month=100
)

cost_breakdown = compute_monthly_cost(reco_profile)
print(f"Monthly cost: ${cost_breakdown['total_cost_usd']:,.0f}")
print(f"Cost per request: ${cost_breakdown['cost_per_request_usd']:.5f}")
print(f"Annual cost: ${cost_breakdown['total_cost_usd'] * 12:,.0f}")

Unit Economics: Cost Per Prediction

Unit economics connects infrastructure costs to business metrics. For a recommendation system, the relevant unit is a prediction (or a page impression, or a recommendation set). For a fraud detection model, it might be a transaction scored.

def compute_unit_economics(
    monthly_cost_usd: float,
    monthly_volume: float,
    monthly_revenue_attributed_usd: float = None,
    monthly_value_preserved_usd: float = None,  # fraud prevented, churn avoided, etc.
) -> dict:
    """
    Compute unit economics for an ML system.
    """
    cost_per_unit = monthly_cost_usd / monthly_volume

    result = {
        "monthly_cost_usd": monthly_cost_usd,
        "monthly_volume": monthly_volume,
        "cost_per_unit_usd": cost_per_unit,
        "cost_per_1000_units_usd": cost_per_unit * 1000,
    }

    if monthly_revenue_attributed_usd:
        revenue_per_unit = monthly_revenue_attributed_usd / monthly_volume
        result["revenue_per_unit_usd"] = revenue_per_unit
        result["gross_margin_pct"] = (
            (monthly_revenue_attributed_usd - monthly_cost_usd) /
            monthly_revenue_attributed_usd * 100
        )
        result["cost_as_pct_of_revenue"] = (
            monthly_cost_usd / monthly_revenue_attributed_usd * 100
        )

    if monthly_value_preserved_usd:
        result["roi"] = (monthly_value_preserved_usd - monthly_cost_usd) / monthly_cost_usd
        result["cost_per_dollar_value_preserved"] = (
            monthly_cost_usd / monthly_value_preserved_usd
        )

    return result


# Recommendation system example
reco_economics = compute_unit_economics(
    monthly_cost_usd=28_400,
    monthly_volume=72_000_000,
    monthly_revenue_attributed_usd=4_200_000   # estimated click-through revenue
)

print(f"Cost per prediction: ${reco_economics['cost_per_unit_usd']:.5f}")
print(f"Cost per 1,000 predictions: ${reco_economics['cost_per_1000_units_usd']:.4f}")
print(f"ML cost as % of attributed revenue: {reco_economics['cost_as_pct_of_revenue']:.2f}%")

Hidden Costs

The visible line items on a cloud bill - EC2, S3, data transfer - are only part of the total cost. Hidden costs are real costs that don't appear as direct line items:

Idle capacity: An inference fleet scaled to peak load spends most of its time at 30–50% utilization. The idle capacity is paid for but produces no value. At 50% average utilization, you're paying double for what you actually use.

Over-provisioned instances: Teams that size instances for peak memory usage may be on a GPU instance for a model that could run on a CPU. Systematic right-sizing analyses regularly find 20–30% of compute running on overspecified hardware.

Experiment waste: A hyperparameter search that runs 500 trials, each on a GPU for 2 hours, costs $500–$ 2,000 on spot pricing. Many experiments run to completion even when it's clear the configuration is not competitive after the first 30 minutes. Early stopping and cost budgets for experiments are almost never implemented.

Engineering time: The most expensive item on any cost model is the people who build and maintain the system. A senior ML engineer costs $200–$ 400K fully loaded per year. An MLOps system that requires 2 FTEs to operate costs $400–$ 800K/year in engineering time - often more than the infrastructure it manages.

def estimate_hidden_costs(
    visible_monthly_cost: float,
    avg_utilization_pct: float = 0.5,      # actual utilization of reserved capacity
    experiment_waste_pct: float = 0.3,     # fraction of experiment compute that's wasted
    mlops_fte_count: float = 1.5,          # FTEs dedicated to maintaining the ML platform
    avg_engineer_cost_per_year: float = 300_000
) -> dict:
    """Estimate hidden costs that don't appear on the cloud bill."""

    # Idle capacity cost
    idle_cost = visible_monthly_cost * (1 - avg_utilization_pct) / avg_utilization_pct

    # Experiment waste
    experiment_cost = visible_monthly_cost * 0.15  # ~15% of infra is experiment compute
    waste_cost = experiment_cost * experiment_waste_pct

    # Engineering time (monthly)
    mlops_engineering_cost = mlops_fte_count * avg_engineer_cost_per_year / 12

    total_hidden = idle_cost + waste_cost + mlops_engineering_cost

    return {
        "visible_monthly_cost": visible_monthly_cost,
        "idle_capacity_cost": round(idle_cost, 0),
        "experiment_waste_cost": round(waste_cost, 0),
        "mlops_engineering_monthly": round(mlops_engineering_cost, 0),
        "total_hidden_monthly": round(total_hidden, 0),
        "true_total_monthly": round(visible_monthly_cost + total_hidden, 0),
        "hidden_as_pct_of_visible": round(total_hidden / visible_monthly_cost * 100, 1)
    }

hidden = estimate_hidden_costs(
    visible_monthly_cost=28_400,
    avg_utilization_pct=0.45,
    mlops_fte_count=1.5
)
print(f"Visible cost: ${hidden['visible_monthly_cost']:,.0f}/month")
print(f"True total:   ${hidden['true_total_monthly']:,.0f}/month")
print(f"Hidden = {hidden['hidden_as_pct_of_visible']}% of visible")

Cost Allocation by Team, Project, and Model

A single cloud account with no tagging produces a bill that is impossible to attribute. Cost allocation requires a systematic tagging strategy applied to all resources at creation time.

# Required tags for every ML resource
REQUIRED_TAGS = {
    "team": str,         # "search-ml", "fraud-ml", "recommendation-ml"
    "project": str,      # "product-ranking-v2", "fraud-detection"
    "model_id": str,     # "product_rec_v3", "fraud_v7"
    "environment": str,  # "production", "staging", "experiment"
    "cost_center": str,  # "eng-ml-platform", "data-science"
    "lifecycle_phase": str,  # "training", "inference", "feature_pipeline"
}

def generate_resource_tags(
    team: str,
    project: str,
    model_id: str,
    environment: str,
    lifecycle_phase: str,
    cost_center: str = None
) -> dict:
    """Generate a complete tag set for an ML resource."""
    return {
        "team": team,
        "project": project,
        "model_id": model_id,
        "environment": environment,
        "lifecycle_phase": lifecycle_phase,
        "cost_center": cost_center or f"eng-{team}",
        "created_by": "mlops-automation",
        "managed_by": "terraform",
    }


def compute_cost_by_dimension(
    aws_cost_explorer_df: pd.DataFrame,
    group_by: str = "model_id"
) -> pd.DataFrame:
    """
    Aggregate AWS Cost Explorer data by a tag dimension.
    Assumes aws_cost_explorer_df has columns: date, service, tag_value, cost_usd
    """
    return (
        aws_cost_explorer_df
        .groupby([group_by, "service"])["cost_usd"]
        .sum()
        .reset_index()
        .sort_values("cost_usd", ascending=False)
    )

Production Engineering Notes

Cost model cadence: Update your cost model monthly. Prices change, architectures change, and usage patterns change. A cost model built six months ago and never updated will be systematically wrong.

Cost forecasting: Once you have a cost model, you can forecast future costs by projecting request volume growth. A recommendation system growing 20% per quarter will double its inference cost in ~4 quarters unless you also improve inference efficiency.

Cost-to-accuracy trade-off documentation: For every model in production, document the cost-accuracy frontier: how does model cost change as you vary model size, compute budget, and serving configuration? This allows economically rational decisions when budget constraints arise.

Common Mistakes

:::danger Attributing only inference compute to a model's cost Teams regularly understate ML costs by only counting the inference servers. Training compute, feature pipeline compute, storage for artifacts and training data, monitoring infrastructure, and the engineering time to maintain all of it - these are all real costs of operating a model in production. A complete cost model must include all lifecycle phases. :::

:::warning Using on-demand pricing for cost estimates when you're actually on spot or reserved On-demand prices are 2–3× higher than spot (for training workloads) and 1.5–2.5× higher than 1-year reserved pricing (for always-on inference). If your cost model uses on-demand list prices but your actual bill reflects spot and reserved discounts, your cost-per-request estimate is significantly inflated. :::

:::tip Cost models require usage data, not just pricing data Cost = price × usage. You need both. Don't start with a pricing spreadsheet - start with your actual cloud bill attributed by resource tag, then compute per-unit costs by dividing by request/training volumes from your observability system. :::

Interview Q&A

Q: How would you compute cost per prediction for an ML serving system?

A: Start by identifying all infrastructure components that serve predictions: inference servers, load balancers, feature store lookups, and any preprocessing services. Get the monthly cost of each from your cloud billing data (filtering by resource tags). Sum them to get total monthly infrastructure cost for the serving path. Divide by total monthly prediction volume (from your monitoring system). For a more complete picture, also attribute a fraction of training cost (amortized over the number of predictions the model will serve before the next retrain) and feature pipeline cost (proportional to features used by this model). The result is a per-prediction cost that includes all lifecycle costs, not just the serving compute.

Q: What are the main hidden costs in ML systems that don't appear on the cloud bill?

A: Three categories. First, idle capacity: inference fleets sized for peak load typically run at 30–60% average utilization, meaning you pay for 40–70% of your fleet to do nothing most of the time. Second, experiment waste: hyperparameter searches that don't use early stopping or cost budgets run failed experiments to completion. A grid search over 200 configurations where 80% are clearly poor after 20% of training wastes most of its compute budget. Third, engineering time: the biggest hidden cost. A platform that requires 2 FTEs to maintain costs $400–$ 800K/year in salary, often more than the infrastructure it manages. Include fully-loaded engineering costs in your total cost of ownership.

Q: What is ML unit economics and why does it matter?

A: ML unit economics expresses the cost of ML infrastructure in terms of per-unit business metrics: cost per prediction, cost per user, cost per decision. It matters because aggregate infrastructure costs are hard to evaluate without context - " $340K/year" sounds like a lot, but "$ 0.0047 per recommendation" can be easily compared to the revenue per recommendation, the cost of alternative recommendation approaches, and the cost trajectory as the business scales. Unit economics enables business-rational decisions: if cost per prediction is $0.0047 and average order value lift per recommendation is$ 0.12, the ROI is obvious and additional investment is clearly justified.

Q: How do you set up cost attribution across multiple ML teams sharing a cloud account?

A: The foundation is resource tagging: every cloud resource (EC2 instance, S3 bucket, EKS node, RDS instance) must be tagged at creation with at minimum the team, project, model ID, environment, and lifecycle phase. Enforce this in your infrastructure-as-code templates and CI/CD pipelines - resources that fail tag validation should not be created. Use AWS Cost Explorer or GCP Cost Management to aggregate costs by tag value. For shared resources (a shared feature store, shared monitoring infrastructure), define an allocation methodology: allocate by usage (number of feature reads per team), by model count, or by raw compute. Document the methodology and apply it consistently. Review monthly cost reports by team and present them in team reviews - visibility alone changes spending behavior.

Q: A model that serves 10M requests/day has a monthly cloud bill of $50,000. Is this good or bad?

A: Without context, neither. To evaluate: compute cost per request: $50,000 / (10M × 30) =$ 0.000167 per request, or $0.167 per 1,000 requests. Compare to industry benchmarks - for a CPU-served classification model, this is on the higher end; for a GPU-served large language model, it is very competitive. Compare to the value per request: if this model increases average order value by$ 0.05 per request, the $0.000167 cost is 0.33% of the value created - excellent economics. Compare to the cost of alternatives: could a simpler model produce 90% of the value at 20% of the cost? The answer "good or bad" requires knowing the denominator - what value does the model create per request.

The CTO's Question​

Why This Exists: The Invisibility of ML Costs​

Historical Context​

Core Concepts​

The ML Cost Stack​

Building a Cost Model from Scratch​

Unit Economics: Cost Per Prediction​

Hidden Costs​

Cost Allocation by Team, Project, and Model​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​