Skip to main content

Inference Cost Optimization

0.08to0.08 to 0.02: A Four-Week Campaign

The ML API team had a problem they'd been aware of for a year: inference cost was growing faster than revenue. The API served 10 million requests per day at a total infrastructure cost that worked out to approximately 0.08perrequest.Atthatrate,theAPIwasprofitableonlybecausetheaveragerevenueperrequestwas0.08 per request. At that rate, the API was profitable only because the average revenue per request was 0.35. But the margin was thinning as volume grew faster than the team could optimize.

The CTO set a target: $0.02 per request within 90 days, without a measurable quality degradation on the company's primary accuracy benchmark. The ML engineering team mapped out the opportunity:

OptimizationPotential Savings
INT8 quantization (4× throughput, 0.5% accuracy loss)~60% cost reduction
Dynamic batching (batch size 1→16 average)3.2× throughput
Instance right-sizing (m5.2xlarge → g4dn.xlarge)Better price/performance
Semantic caching for repeated queries18% hit rate, 18% fewer model calls
Autoscaling (vs. always-on peak fleet)35% reduction in idle compute

Not all of these were independent - quantization reduced memory enough to enable larger batches on the same instances. The team implemented them in order of ROI, tracking cost per request weekly.

Week 4 result: $0.021 per request. Within target, on schedule. Accuracy benchmark: 0.3% below pre-optimization baseline - within the acceptable threshold.

This lesson walks through each optimization and the engineering decisions behind it.


:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Cost & Unit Economics demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: Inference Is Where the Money Goes

A common misconception: training is the expensive part of ML. In production systems with sustained traffic, inference cost typically exceeds training cost by 5–20×. A model trained once at 10,000mayserve100millionrequestspermonthforyears.At10,000 may serve 100 million requests per month for years. At 0.08 per request, that is $8 million per month. Training is a one-time cost; inference is a recurring cost that scales with usage.

This asymmetry makes inference optimization the highest-leverage ML cost reduction opportunity for any model with significant traffic. A 50% reduction in inference cost is a 50% reduction in the largest recurring cost in the ML system.

The levers available for inference cost optimization operate at different levels:

  • Model efficiency: Quantization, pruning, distillation - making the model faster without retraining
  • Hardware efficiency: Instance selection, GPU vs. CPU tradeoffs, multi-model serving
  • Traffic efficiency: Batching, caching, request routing - reducing the number of model calls
  • Capacity efficiency: Autoscaling, spot instances for async workloads - matching capacity to demand

Historical Context

Inference optimization became a major engineering discipline around 2018–2020 as deep learning models moved from research into high-traffic production systems. The shift from batch inference (process a queue overnight) to real-time inference (respond in under 100ms to live requests) made latency and throughput constraints explicit and demanding.

Model quantization, particularly INT8 quantization, gained production traction after NVIDIA's release of TensorRT (2018) and the subsequent support for INT8 inference in PyTorch (via PyTorch Quantization) and ONNX Runtime. The core insight: inference only requires forward pass computation, which is numerically less demanding than training's backward pass. Lower precision weights produce acceptable inference quality at significantly higher hardware throughput.

The concept of semantic caching for ML APIs emerged around 2023 with the rise of LLM applications, where adjacent or similar queries often produce the same or equivalent answers. GPTCache and similar tools formalized the approach: cache model outputs indexed by embedding similarity, not exact string match.


Core Concepts

Model Quantization: ROI Analysis

Quantization reduces the numeric precision of model weights and activations from 32-bit floats to 8-bit integers (INT8) or even 4-bit integers (INT4). The primary benefit: INT8 arithmetic uses 4× less memory bandwidth than FP32, enabling 2–4× higher throughput on the same hardware.

The trade-off: quantization introduces small amounts of approximation error. The accuracy cost varies by model and task. For most production classification and regression models, INT8 quantization costs 0.1–1% accuracy. For language models, the cost is higher but often acceptable.

import torch
from torch.quantization import quantize_dynamic, prepare, convert
from torch.quantization.observer import MinMaxObserver
import time
import numpy as np

def benchmark_model(model, input_tensor: torch.Tensor, n_runs: int = 100) -> dict:
"""Benchmark model latency and throughput."""
model.eval()

# Warmup
with torch.no_grad():
for _ in range(10):
_ = model(input_tensor)

# Benchmark
latencies = []
with torch.no_grad():
for _ in range(n_runs):
start = time.perf_counter()
_ = model(input_tensor)
end = time.perf_counter()
latencies.append((end - start) * 1000)

return {
"p50_ms": np.percentile(latencies, 50),
"p95_ms": np.percentile(latencies, 95),
"p99_ms": np.percentile(latencies, 99),
"throughput_qps": 1000 / np.mean(latencies),
}


def apply_dynamic_quantization(model) -> torch.nn.Module:
"""
Dynamic quantization: quantize weights to INT8 at load time,
quantize activations at inference time.
Requires no calibration data. Best for LSTM, Transformer, Linear-heavy models.
"""
quantized_model = quantize_dynamic(
model,
{torch.nn.Linear, torch.nn.LSTM}, # quantize these layer types
dtype=torch.qint8
)
return quantized_model


def apply_static_quantization(model, calibration_dataloader) -> torch.nn.Module:
"""
Static quantization: calibrate activation ranges using representative data.
Better accuracy than dynamic quantization. Requires calibration dataset.
Best for CNN-based models.
"""
model.eval()

# Set quantization configuration
model.qconfig = torch.quantization.get_default_qconfig("fbgemm") # x86 CPU
# model.qconfig = torch.quantization.get_default_qconfig("qnnpack") # ARM CPU

# Prepare model (insert observers)
prepared_model = prepare(model)

# Calibrate using representative data
with torch.no_grad():
for batch in calibration_dataloader:
prepared_model(batch)

# Convert to quantized model
quantized_model = convert(prepared_model)

return quantized_model


def quantization_roi_analysis(
baseline_throughput_qps: float,
quantized_throughput_qps: float,
baseline_accuracy: float,
quantized_accuracy: float,
instance_cost_per_hour: float,
monthly_request_volume: float
) -> dict:
"""
Compute the financial ROI of quantization given throughput and accuracy changes.
"""
# Throughput improvement
throughput_improvement = quantized_throughput_qps / baseline_throughput_qps

# Instances needed (fewer instances for same throughput)
# If we need to serve N requests/sec:
# Before: ceil(N / baseline_throughput) instances
# After: ceil(N / quantized_throughput) instances

rps = monthly_request_volume / (30 * 24 * 3600) # requests per second
baseline_instances = max(1, int(np.ceil(rps / baseline_throughput_qps)))
quantized_instances = max(1, int(np.ceil(rps / quantized_throughput_qps)))

monthly_baseline_cost = baseline_instances * instance_cost_per_hour * 730
monthly_quantized_cost = quantized_instances * instance_cost_per_hour * 730

accuracy_delta = quantized_accuracy - baseline_accuracy

return {
"throughput_improvement": f"{throughput_improvement:.1f}×",
"baseline_instances": baseline_instances,
"quantized_instances": quantized_instances,
"monthly_baseline_cost_usd": round(monthly_baseline_cost, 0),
"monthly_quantized_cost_usd": round(monthly_quantized_cost, 0),
"monthly_savings_usd": round(monthly_baseline_cost - monthly_quantized_cost, 0),
"annual_savings_usd": round((monthly_baseline_cost - monthly_quantized_cost) * 12, 0),
"accuracy_delta_pct": accuracy_delta * 100,
"recommendation": (
"STRONG GO - high savings, acceptable accuracy cost" if
(monthly_baseline_cost - monthly_quantized_cost) > 5000 and abs(accuracy_delta) < 0.01
else "REVIEW - accuracy cost may be unacceptable" if abs(accuracy_delta) > 0.02
else "MARGINAL - savings modest, proceed with caution"
)
}

Dynamic Batching Economics

Batching multiple inference requests together is one of the most effective ways to increase GPU utilization. A GPU running batch size 1 is typically 5–20× less efficient than the same GPU running batch size 16–32 (depending on model architecture). The throughput improvement comes from better GPU utilization - the hardware executes matrix multiplications more efficiently on larger inputs.

The trade-off: batching introduces latency. A request that arrives when the batch isn't full must wait for more requests to arrive or for a timeout.

import asyncio
import time
from collections import deque
from typing import List, Callable

class DynamicBatcher:
"""
Server-side dynamic batching for ML inference.
Groups incoming requests into batches for efficient GPU utilization.
"""
def __init__(
self,
model_fn: Callable, # function(batch_inputs) -> batch_outputs
max_batch_size: int = 32,
max_wait_ms: float = 10.0, # max time to wait for batch to fill
min_batch_size: int = 1
):
self.model_fn = model_fn
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms / 1000 # convert to seconds
self.min_batch_size = min_batch_size

self.pending_requests = deque()
self._lock = asyncio.Lock()
self._batch_event = asyncio.Event()

async def infer(self, input_data) -> any:
"""Submit a single request and wait for its result."""
future = asyncio.get_event_loop().create_future()

async with self._lock:
self.pending_requests.append((input_data, future))
if len(self.pending_requests) >= self.max_batch_size:
self._batch_event.set() # trigger immediate batch execution

return await future

async def _batch_loop(self):
"""Background loop that processes batches."""
while True:
# Wait for first request or batch fill
await self._batch_event.wait()

# Wait for max_wait_ms to allow batch to fill
await asyncio.sleep(self.max_wait_ms)

async with self._lock:
if not self.pending_requests:
self._batch_event.clear()
continue

# Take up to max_batch_size requests
batch_requests = []
while self.pending_requests and len(batch_requests) < self.max_batch_size:
batch_requests.append(self.pending_requests.popleft())

if not self.pending_requests:
self._batch_event.clear()

# Execute batch
batch_inputs = [req[0] for req in batch_requests]
batch_futures = [req[1] for req in batch_requests]

try:
batch_outputs = self.model_fn(batch_inputs)
for future, output in zip(batch_futures, batch_outputs):
if not future.done():
future.set_result(output)
except Exception as e:
for future in batch_futures:
if not future.done():
future.set_exception(e)


def compute_batching_economics(
single_request_throughput_qps: float, # at batch size 1
batched_throughput_qps: float, # at optimal batch size
p95_latency_single_ms: float,
p95_latency_batched_ms: float,
hourly_instance_cost: float,
monthly_request_volume: float,
latency_sla_ms: float = 200.0
) -> dict:
"""Analyze the economics of dynamic batching."""

rps = monthly_request_volume / (30 * 24 * 3600)
instances_single = max(1, int(np.ceil(rps / single_request_throughput_qps)))
instances_batched = max(1, int(np.ceil(rps / batched_throughput_qps)))

cost_single = instances_single * hourly_instance_cost * 730
cost_batched = instances_batched * hourly_instance_cost * 730

return {
"throughput_improvement": f"{batched_throughput_qps / single_request_throughput_qps:.1f}×",
"latency_increase_ms": p95_latency_batched_ms - p95_latency_single_ms,
"latency_within_sla": p95_latency_batched_ms <= latency_sla_ms,
"monthly_cost_single_usd": round(cost_single, 0),
"monthly_cost_batched_usd": round(cost_batched, 0),
"monthly_savings_usd": round(cost_single - cost_batched, 0),
}

Caching: Exact and Semantic

Not all requests need a model call. Many requests are repeated or near-duplicates. Caching model outputs eliminates model compute for cached requests entirely - the most cost-effective optimization when hit rates are high.

Exact caching: Cache on exact input hash. High precision (same output as model), low hit rate (only works for literally identical inputs). Best for: search queries, product lookups, classification on structured inputs.

Semantic caching: Cache on embedding similarity. Retrieve cached outputs for semantically similar but not identical inputs. Higher hit rate than exact caching. Requires an embedding model and a vector similarity search. Acceptable when slightly different inputs produce the same or equivalent outputs.

import hashlib
import redis
import numpy as np
from sentence_transformers import SentenceTransformer
import pickle

class ExactCache:
"""Exact-match cache using Redis for ML inference results."""
def __init__(self, redis_url: str, ttl_seconds: int = 3600):
self.redis = redis.from_url(redis_url)
self.ttl = ttl_seconds
self.hits = 0
self.misses = 0

def _key(self, inputs: dict) -> str:
canonical = str(sorted(inputs.items()))
return "exact:" + hashlib.sha256(canonical.encode()).hexdigest()

def get(self, inputs: dict):
data = self.redis.get(self._key(inputs))
if data is not None:
self.hits += 1
return pickle.loads(data)
self.misses += 1
return None

def set(self, inputs: dict, result: any):
self.redis.setex(self._key(inputs), self.ttl, pickle.dumps(result))

@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0


class SemanticCache:
"""
Semantic cache: retrieve cached results for semantically similar queries.
Uses embedding similarity to find cache hits for near-duplicate inputs.
"""
def __init__(
self,
redis_url: str,
embedding_model: str = "BAAI/bge-small-en-v1.5",
similarity_threshold: float = 0.97, # high threshold for safety
ttl_seconds: int = 3600,
max_cache_size: int = 10_000
):
self.redis = redis.from_url(redis_url)
self.encoder = SentenceTransformer(embedding_model)
self.threshold = similarity_threshold
self.ttl = ttl_seconds
self.max_cache_size = max_cache_size

self._cache_embeddings = [] # in-memory index for similarity search
self._cache_keys = [] # Redis keys corresponding to embeddings
self.hits = 0
self.misses = 0

def _embed(self, text: str) -> np.ndarray:
return self.encoder.encode(text, normalize_embeddings=True)

def get(self, query: str):
query_emb = self._embed(query)

if not self._cache_embeddings:
self.misses += 1
return None

# Cosine similarity with cached embeddings
cache_matrix = np.array(self._cache_embeddings)
similarities = cache_matrix @ query_emb # (n_cached,)
best_idx = np.argmax(similarities)
best_sim = similarities[best_idx]

if best_sim >= self.threshold:
redis_key = self._cache_keys[best_idx]
data = self.redis.get(redis_key)
if data is not None:
self.hits += 1
return pickle.loads(data)

self.misses += 1
return None

def set(self, query: str, result: any):
query_emb = self._embed(query)
redis_key = f"semantic:{hashlib.sha256(query.encode()).hexdigest()}"

self.redis.setex(redis_key, self.ttl, pickle.dumps(result))
self._cache_embeddings.append(query_emb)
self._cache_keys.append(redis_key)

# Evict oldest entries if over size limit
if len(self._cache_embeddings) > self.max_cache_size:
evict_key = self._cache_keys.pop(0)
self._cache_embeddings.pop(0)
self.redis.delete(evict_key)

@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0

LLM Cost Per Token Analysis

For large language model APIs, the pricing model is per-token (input tokens + output tokens). Understanding the true cost requires computing token consumption patterns, not just request volumes.

def compute_llm_cost_analysis(
monthly_requests: int,
avg_input_tokens: int,
avg_output_tokens: int,
input_cost_per_1m_tokens: float, # e.g., $3.00 for GPT-4o
output_cost_per_1m_tokens: float, # e.g., $15.00 for GPT-4o
cache_hit_rate: float = 0.0
) -> dict:
"""
Compute monthly cost for an LLM API, including caching savings.
"""
# Effective requests after cache
effective_requests = monthly_requests * (1 - cache_hit_rate)

monthly_input_tokens = effective_requests * avg_input_tokens
monthly_output_tokens = effective_requests * avg_output_tokens

input_cost = monthly_input_tokens / 1_000_000 * input_cost_per_1m_tokens
output_cost = monthly_output_tokens / 1_000_000 * output_cost_per_1m_tokens
total_cost = input_cost + output_cost

return {
"monthly_requests": monthly_requests,
"cache_hit_rate": cache_hit_rate,
"effective_requests": effective_requests,
"monthly_input_tokens_m": monthly_input_tokens / 1_000_000,
"monthly_output_tokens_m": monthly_output_tokens / 1_000_000,
"input_cost_usd": round(input_cost, 2),
"output_cost_usd": round(output_cost, 2),
"total_monthly_cost_usd": round(total_cost, 2),
"cost_per_request_usd": round(total_cost / monthly_requests, 5),
"output_as_pct_of_total": round(output_cost / total_cost * 100, 1) if total_cost > 0 else 0
}

# GPT-4o at 1M requests/month, average 500 input + 200 output tokens
gpt4o_costs = compute_llm_cost_analysis(
monthly_requests=1_000_000,
avg_input_tokens=500,
avg_output_tokens=200,
input_cost_per_1m_tokens=2.50, # GPT-4o mini pricing
output_cost_per_1m_tokens=10.00,
cache_hit_rate=0.18 # 18% semantic cache hit rate
)
print(f"Monthly cost: ${gpt4o_costs['total_monthly_cost_usd']:,.0f}")
print(f"Cost per request: ${gpt4o_costs['cost_per_request_usd']:.4f}")
print(f"Output tokens = {gpt4o_costs['output_as_pct_of_total']:.0f}% of cost")

Autoscaling vs. Always-On

A fleet sized for peak load wastes compute during off-peak hours. Autoscaling matches capacity to demand by adding instances during high traffic and removing them during low traffic.

The trade-off: autoscaling introduces cold start latency (time to provision a new instance). For CPU inference, cold starts take 30–60 seconds. For GPU inference with large models, 2–10 minutes. During a traffic spike, the first burst of requests hits a scaled-down fleet before new instances come online.

def estimate_autoscaling_savings(
peak_rps: float, # maximum requests per second
avg_rps: float, # average requests per second
instance_throughput_qps: float,
instance_cost_per_hour: float,
cold_start_minutes: float = 5,
scale_down_after_minutes: int = 10
) -> dict:
"""
Estimate savings from autoscaling vs. always-on fleet.
"""
# Always-on: sized for peak
always_on_instances = int(np.ceil(peak_rps / instance_throughput_qps * 1.2)) # 20% buffer
always_on_monthly = always_on_instances * instance_cost_per_hour * 730

# Autoscaling: sized for average, scales up to peak
avg_instances = max(2, int(np.ceil(avg_rps / instance_throughput_qps))) # min 2 for HA
# Effective cost: average instances + overhead for scaling events
scaling_overhead_pct = 0.1 # ~10% overhead for over-provisioning during scale events
autoscaling_monthly = avg_instances * instance_cost_per_hour * 730 * (1 + scaling_overhead_pct)

utilization_always_on = avg_rps / (always_on_instances * instance_throughput_qps)

return {
"always_on_instances": always_on_instances,
"always_on_monthly_usd": round(always_on_monthly, 0),
"avg_instances": avg_instances,
"autoscaling_monthly_usd": round(autoscaling_monthly, 0),
"monthly_savings_usd": round(always_on_monthly - autoscaling_monthly, 0),
"savings_pct": round((always_on_monthly - autoscaling_monthly) / always_on_monthly * 100, 1),
"always_on_avg_utilization_pct": round(utilization_always_on * 100, 1),
"cold_start_risk_minutes": cold_start_minutes,
"recommendation": (
"AUTOSCALE - large savings, acceptable cold start" if
(always_on_monthly - autoscaling_monthly) > 2000 and cold_start_minutes < 3 else
"CONSIDER WARM POOL - reduce cold start risk" if cold_start_minutes > 5 else
"ALWAYS-ON - savings insufficient to justify scaling complexity"
)
}

Common Mistakes

:::danger Applying quantization without measuring accuracy impact per model INT8 quantization costs 0.1–1% accuracy for most models - but some models are significantly more sensitive. Transformer models with extreme weight distributions, models trained with low precision, and models fine-tuned on small datasets may show 3–5% accuracy degradation from INT8 quantization. Always benchmark quantized vs. original model on your specific evaluation dataset before deploying. Never apply quantization blindly based on general benchmarks. :::

:::danger Caching on mutable inputs If your model inputs include timestamps, session IDs, or user context that changes with each request, exact caching will never hit. If these mutable fields are not predictive and could be excluded from the cache key without changing the model's relevant output, exclude them. Cache key design requires understanding which features actually drive model predictions. :::

:::warning Setting autoscaling scale-down too aggressively for GPU inference CPU instances spin up in 60 seconds. GPU instances with large models (7B+ parameters) can take 5–10 minutes to load the model into GPU memory. If you scale down to zero during quiet periods, you will have 5–10 minutes of failed or degraded requests when traffic returns. Keep a minimum fleet of 1–2 warm instances for user-facing GPU inference, even during low traffic. :::

:::tip Profile latency before and after each optimization Never assume an optimization reduced cost without measuring. Quantization that increases latency beyond your SLA may require more instances to maintain P99 latency targets - potentially increasing cost instead of reducing it. Always measure throughput, P50/P95/P99 latency, and accuracy before and after each optimization. :::


Interview Q&A

Q: How does INT8 quantization reduce inference cost and what is the accuracy trade-off?

A: INT8 quantization replaces 32-bit floating-point weights and activations with 8-bit integers. This provides two benefits: memory bandwidth is reduced 4×, and INT8 matrix multiplications execute 2–4× faster than FP32 on modern hardware with dedicated integer arithmetic units (NVIDIA Tensor Cores). For most production classification and regression models, the accuracy cost is 0.1–1% on standard benchmarks. The cost is higher for models that depend on large dynamic ranges in weights (some transformer layers) or for tasks requiring fine-grained numerical precision. In practice, measure accuracy on your specific held-out dataset, not on general benchmarks, before deploying quantized models.

Q: What is semantic caching and when does it make sense for ML inference?

A: Semantic caching stores model outputs indexed by the embedding of the input query rather than the exact input text. When a new query arrives, it is embedded and compared against cached embeddings using cosine similarity. If a sufficiently similar cached query exists (above a similarity threshold like 0.97), the cached result is returned without running the model. Semantic caching makes sense when: queries are natural language (search, Q&A, summarization) where users express the same intent differently, when model outputs are deterministic or near-deterministic for similar inputs, and when the embedding lookup cost is significantly less than the model inference cost. For structured inputs (tabular ML models, recommendation features), exact caching is more appropriate. The critical decision is setting the similarity threshold - too low produces incorrect responses, too high produces a cache that barely hits.

Q: How do you decide between autoscaling and an always-on fleet for ML inference?

A: Three factors: traffic variability, cold start time, and savings magnitude. If peak traffic is more than 3× average traffic, autoscaling saves significant money - the always-on fleet is heavily underutilized during off-peak hours. If cold start time is under 2 minutes (CPU inference with small models), autoscaling from zero is feasible. If cold start time is over 5 minutes (GPU inference with large models), maintain a minimum warm fleet and autoscale around it - never scale to zero. Compute the monthly savings: if autoscaling saves less than 500/month,theoperationalcomplexityofautoscaling(monitoring,scalingpolicies,coldstarthandling)maynotbeworthit.Ifitsaves500/month, the operational complexity of autoscaling (monitoring, scaling policies, cold start handling) may not be worth it. If it saves 5,000+/month, it almost always is.

Q: Walk me through how you would reduce inference cost for a model serving 10M requests/day.

A: I would approach this in order of ROI. First, measure the current baseline: throughput, latency, cost per request, and accuracy. Second, apply INT8 quantization: benchmark throughput and accuracy. If throughput improves 2–4× with less than 1% accuracy cost, deploy it - this immediately reduces the required instance count. Third, implement dynamic batching: even with quantization, batching requests together improves GPU utilization. Target average batch size of 8–16 for classification models, 4–8 for generation models. Fourth, implement caching: analyze the request distribution for repeated or near-repeated queries. Implement exact caching for identical inputs; semantic caching if hit rates justify the engineering cost. Fifth, right-size instances: after quantization, the memory footprint changes - you may be able to serve on smaller GPU instances or consolidate multiple models onto a single GPU. Finally, implement autoscaling for the serving fleet if traffic shows significant daily variation.

Q: What is the relationship between batch size and throughput in GPU inference?

A: GPU throughput increases with batch size up to the hardware's saturation point, then plateaus. At batch size 1, most of the GPU's parallel compute units are idle - the single example doesn't generate enough work to keep the hardware busy. As batch size increases, GPU utilization rises and throughput per GPU increases approximately linearly until memory bandwidth or compute capacity is saturated, typically around batch size 32–128 for medium-sized models. Beyond saturation, adding more examples per batch increases latency without proportionally increasing throughput. The practical implication: serving with batch size 1 (one request at a time) uses your GPU at 10–20% efficiency; serving with batch size 16–32 uses it at 70–90%. Dynamic batching maximizes throughput by grouping concurrent requests into batches, subject to a maximum latency budget for each request.

© 2026 EngineersOfAI. All rights reserved.