Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Build vs Buy demo on the EngineersOfAI Playground - no code required. :::

GPU Cost Optimization

The $200K Monthly Bill

The ML team had built an impressive platform: three teams training models, a serving layer handling 50 million requests per day, and an experimentation framework that provisioned GPUs on demand. They were moving fast. The monthly AWS bill was $200,000.

The CFO called a meeting. The VP of Engineering asked for a plan to cut GPU costs by 40% without impacting throughput or model quality. The ML platform lead had three weeks.

After an audit, the team found:

  • Training jobs were using p3.8xlarge instances (4× V100) for small experiments that only needed 1 GPU
  • The serving cluster ran 40 A100s at 20% average GPU utilization
  • No jobs were using spot instances - all were on-demand, paying 3× the spot price
  • Weekly model evaluation jobs ran 8 hours but had been provisioned for 24 hours "just in case"

The changes the team made:

  1. Moved all training experiments to spot instances with checkpoint tolerance: -70% training cost
  2. Enabled MIG (Multi-Instance GPU) to share A100s across serving replicas: +2× serving capacity without adding hardware
  3. Right-sized evaluation jobs with capacity planning: -66% evaluation cost
  4. Set up reserved instances for the always-on serving cluster: -40% serving cost

Total monthly bill after 4 weeks of changes: $84,000. A 58% reduction with no impact on model quality or system throughput.

This lesson gives you the playbook for each of these optimizations, with the technical depth to implement them correctly rather than just understanding the concept.


Why GPU Costs Are High and Improvable

GPU costs are high because of the high capital cost of GPU hardware, which cloud providers pass through as high hourly rates. A single H100 GPU card costs ~30,00030,000–40,000 at retail. Cloud providers amortize hardware over 3–4 years and add margin; that makes a single H100 GPU 55–8/hr on demand.

But GPU costs are also improvable because of pervasive inefficiency in how teams use GPUs:

  1. Overprovisioning: Using 8 GPUs for a job that fits on 2
  2. Low utilization: GPUs sitting at 20% utilization during serving
  3. Wrong pricing model: Paying on-demand prices for interruptible workloads
  4. Constant allocation: Keeping GPUs allocated 24/7 for jobs that run 6 hours/day
  5. Wrong hardware: Using expensive A100s for workloads that run fine on L4

Each of these is a separate optimization with a specific technical approach.


Spot Instances for Training

Cloud providers offer "spot" (AWS), "preemptible" (GCP), or "low-priority" (Azure) instances at 60–80% discount. The catch: the instance can be reclaimed with 2 minutes notice when the provider needs capacity.

When spot is appropriate for training:

  • Training runs that can resume from checkpoint (all well-designed training jobs)
  • Batch processing jobs where individual interruptions are acceptable
  • Experimentation and hyperparameter search where some runs failing is acceptable

When spot is NOT appropriate:

  • Inference serving (users need consistent availability)
  • Jobs that cannot checkpoint (e.g., early-stage experiments without checkpoint logic)
  • Very long runs where frequent interruptions would extend total wall time significantly
import boto3
import time
import signal
import torch
from pathlib import Path

class SpotInstanceTrainer:
"""
Training loop that handles spot instance interruptions gracefully.
Receives SIGTERM 2 minutes before termination and saves checkpoint.
"""
def __init__(
self,
model,
optimizer,
checkpoint_dir: str,
checkpoint_interval_steps: int = 200,
):
self.model = model
self.optimizer = optimizer
self.checkpoint_dir = Path(checkpoint_dir)
self.checkpoint_dir.mkdir(parents=True, exist_ok=True)
self.step = 0
self.interrupted = False

# Register SIGTERM handler - spot interruption gives SIGTERM 2 min before SIGKILL
signal.signal(signal.SIGTERM, self._handle_interruption)

def _handle_interruption(self, signum, frame):
print(f"\nSpot interruption signal received at step {self.step}")
print("Saving emergency checkpoint...")
self._save_checkpoint(tag="spot_interrupt")
self.interrupted = True

# Upload checkpoint to S3 before instance dies
self._sync_checkpoints_to_s3()
print("Checkpoint saved and uploaded. Safe to terminate.")

def _save_checkpoint(self, tag: str = "regular"):
"""Atomic checkpoint save."""
tmp = self.checkpoint_dir / "checkpoint_tmp.pt"
final = self.checkpoint_dir / f"checkpoint_{tag}_step{self.step}.pt"
latest = self.checkpoint_dir / "checkpoint_latest.pt"

torch.save({
"step": self.step,
"model_state": self.model.state_dict(),
"optimizer_state": self.optimizer.state_dict(),
}, tmp)

tmp.rename(final)
if latest.exists():
latest.unlink()
latest.symlink_to(final)

def _sync_checkpoints_to_s3(self):
"""Upload checkpoints to S3 for persistence across spot instances."""
import subprocess
result = subprocess.run(
["aws", "s3", "sync", str(self.checkpoint_dir), "s3://your-bucket/checkpoints/"],
capture_output=True, timeout=90 # 90 seconds - within the 2-minute window
)
if result.returncode != 0:
print(f"S3 sync warning: {result.stderr.decode()}")

def load_from_latest_checkpoint(self) -> bool:
"""Attempt to restore from S3 checkpoint on restart."""
# Download latest checkpoint from S3
import subprocess
subprocess.run([
"aws", "s3", "sync",
"s3://your-bucket/checkpoints/",
str(self.checkpoint_dir),
], capture_output=True)

latest = self.checkpoint_dir / "checkpoint_latest.pt"
if not latest.exists():
return False

checkpoint = torch.load(str(latest), map_location="cpu")
self.step = checkpoint["step"]
self.model.load_state_dict(checkpoint["model_state"])
self.optimizer.load_state_dict(checkpoint["optimizer_state"])
print(f"Resumed from checkpoint at step {self.step}")
return True

def train(self, data_loader):
"""Training loop with spot interruption handling."""
self.load_from_latest_checkpoint()

for inputs, labels in data_loader:
if self.interrupted:
print("Interrupted - exiting training loop cleanly")
return

# Skip already-processed steps
if self.step < self._steps_to_skip:
self.step += 1
continue

# Forward + backward
self.optimizer.zero_grad()
output = self.model(inputs.cuda())
loss = torch.nn.functional.cross_entropy(output, labels.cuda())
loss.backward()
self.optimizer.step()
self.step += 1

if self.step % self.checkpoint_interval_steps == 0:
self._save_checkpoint()

GPU Sharing: MPS and MIG

Multi-Process Service (MPS)

By default, each CUDA process exclusively locks an entire GPU. If you have 4 small model serving processes, they need 4 GPUs. MPS (Multi-Process Service) allows multiple CUDA processes to share a single GPU concurrently, with hardware-level time-slicing.

# Enable NVIDIA MPS daemon
export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d # start MPS daemon

# Now multiple processes can share GPU 0
# Process 1: python serve_model_a.py &
# Process 2: python serve_model_b.py &
# Both run concurrently on GPU 0

# Stop MPS
echo quit | nvidia-cuda-mps-control

Limitations of MPS:

  • Error isolation: if one process crashes, all processes sharing the GPU may be affected
  • Memory isolation: no memory protection between processes
  • Only applies to compute workloads - not for different users (security concern)
  • Not recommended for multi-tenant environments

Multi-Instance GPU (MIG)

MIG (available on A100, H100) partitions a single GPU into isolated "slices," each with dedicated compute, memory, and cache. Unlike MPS time-sharing, MIG provides hardware isolation.

An A100 80GB can be partitioned into:

  • 7 × 1g.10gb (7 instances, 10 GB each)
  • 3 × 2g.20gb (3 instances, 20 GB each)
  • 1 × 3g.40gb + other combinations
  • 1 × 7g.80gb (full GPU, MIG disabled)
# Configure A100 with MIG partitioning
# First enable MIG mode
sudo nvidia-smi -mig 1

# Create 3 MIG instances of 2g.20gb (3 × 20 GB partitions)
sudo nvidia-smi mig -cgi 2g.20gb,2g.20gb,2g.20gb -C

# Verify
nvidia-smi mig -lgi # list GPU instances
nvidia-smi mig -lci # list compute instances

# Disable MIG and restore full GPU
sudo nvidia-smi mig -dci # delete compute instances
sudo nvidia-smi mig -dgi # delete GPU instances
sudo nvidia-smi -mig 0 # disable MIG mode
import torch

def configure_for_mig_instance():
"""
When running on a MIG instance, PyTorch sees only the allocated slice.
No code changes needed - CUDA_VISIBLE_DEVICES handles it.
"""
# On a 2g.20gb MIG instance, this reports 20 GB
props = torch.cuda.get_device_properties(0)
print(f"Available VRAM: {props.total_memory / 1e9:.1f} GB")
print(f"Device: {props.name}") # shows "MIG 2g.20gb" or similar

# Fit 3 independent inference servers on a single A100:
# CUDA_VISIBLE_DEVICES=MIG-GPU-xxx:0:0 python serve_model_a.py
# CUDA_VISIBLE_DEVICES=MIG-GPU-xxx:1:0 python serve_model_b.py
# CUDA_VISIBLE_DEVICES=MIG-GPU-xxx:2:0 python serve_model_c.py

Right-Sizing GPU Instances

Right-sizing is the simplest optimization with the highest ROI. It requires measuring actual resource utilization and selecting the smallest instance that meets requirements.

import subprocess
import time
import json
from typing import List

class GPUUtilizationProfiler:
"""
Profile GPU utilization during a workload to right-size the instance.
"""
def __init__(self, sample_interval_sec: float = 1.0):
self.sample_interval = sample_interval_sec
self.samples = []
self._running = False

def start(self):
"""Start background sampling."""
import threading
self._running = True
self._thread = threading.Thread(target=self._sample_loop, daemon=True)
self._thread.start()

def stop(self) -> dict:
"""Stop sampling and return summary."""
self._running = False
self._thread.join(timeout=5)
return self._summarize()

def _sample_loop(self):
while self._running:
try:
result = subprocess.run([
"nvidia-smi",
"--query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total,power.draw",
"--format=csv,noheader,nounits",
], capture_output=True, text=True)

for line in result.stdout.strip().split("\n"):
parts = [p.strip() for p in line.split(",")]
if len(parts) == 5:
self.samples.append({
"gpu_util_pct": float(parts[0]),
"mem_util_pct": float(parts[1]),
"mem_used_mb": float(parts[2]),
"mem_total_mb": float(parts[3]),
"power_w": float(parts[4]) if parts[4] != "N/A" else None,
})
except Exception:
pass
time.sleep(self.sample_interval)

def _summarize(self) -> dict:
import numpy as np
if not self.samples:
return {}

gpu_utils = [s["gpu_util_pct"] for s in self.samples]
mem_utils = [s["mem_used_mb"] for s in self.samples]

return {
"gpu_utilization_p50_pct": round(np.percentile(gpu_utils, 50), 1),
"gpu_utilization_p95_pct": round(np.percentile(gpu_utils, 95), 1),
"gpu_utilization_mean_pct": round(np.mean(gpu_utils), 1),
"vram_peak_gb": round(max(mem_utils) / 1024, 2),
"vram_p95_gb": round(np.percentile(mem_utils, 95) / 1024, 2),
"n_samples": len(self.samples),
"recommendation": _make_sizing_recommendation(
gpu_util_mean=np.mean(gpu_utils),
vram_peak_gb=max(mem_utils) / 1024,
),
}


def _make_sizing_recommendation(gpu_util_mean: float, vram_peak_gb: float) -> str:
if gpu_util_mean < 25:
return "OVERPROVISIONED: consider smaller instance or GPU sharing"
elif gpu_util_mean < 60:
return f"UNDERUTILIZED: investigate bottleneck (likely memory or I/O)"
else:
return f"WELL-UTILIZED: current sizing appropriate (VRAM peak: {vram_peak_gb:.1f} GB)"

Reserved Instances for Serving

For always-on serving infrastructure, reserved instances provide 30–60% discount vs on-demand in exchange for a 1–3 year commitment.

def reserved_instance_roi_analysis(
on_demand_hourly: float,
reserved_1yr_upfront: float,
reserved_1yr_hourly: float,
reserved_3yr_upfront: float,
reserved_3yr_hourly: float,
n_instances: int,
usage_hours_per_month: float = 720, # 24*30 = fully on
) -> dict:
"""
Calculate ROI for reserved vs on-demand instances.
"""
months = 12

on_demand_monthly = on_demand_hourly * usage_hours_per_month * n_instances
on_demand_total_1yr = on_demand_monthly * 12

reserved_1yr_monthly = reserved_1yr_hourly * usage_hours_per_month * n_instances
reserved_1yr_total = reserved_1yr_upfront * n_instances + reserved_1yr_monthly * 12

reserved_3yr_monthly = reserved_3yr_hourly * usage_hours_per_month * n_instances
reserved_3yr_total = reserved_3yr_upfront * n_instances + reserved_3yr_monthly * 36

return {
"n_instances": n_instances,
"on_demand_monthly": round(on_demand_monthly, 0),
"on_demand_1yr_total": round(on_demand_total_1yr, 0),
"reserved_1yr_total": round(reserved_1yr_total, 0),
"reserved_1yr_savings": round(on_demand_total_1yr - reserved_1yr_total, 0),
"reserved_1yr_savings_pct": round(
(on_demand_total_1yr - reserved_1yr_total) / on_demand_total_1yr * 100, 1
),
"reserved_3yr_total_annualized": round(reserved_3yr_total / 3, 0),
"reserved_3yr_savings_vs_ondemand_pct": round(
(on_demand_total_1yr * 3 - reserved_3yr_total) / (on_demand_total_1yr * 3) * 100, 1
),
}

# Example: AWS p3.2xlarge (V100 16GB)
# On-demand: $3.06/hr, 1-yr reserved: ~$1.82/hr + $0 upfront (no-upfront)
roi = reserved_instance_roi_analysis(
on_demand_hourly=3.06,
reserved_1yr_upfront=0,
reserved_1yr_hourly=1.82,
reserved_3yr_upfront=0,
reserved_3yr_hourly=1.31,
n_instances=8,
usage_hours_per_month=720,
)

for key, value in roi.items():
print(f" {key}: {value}")

Efficient Batching for Serving

Batching is the most straightforward GPU utilization improvement for inference serving. A GPU running batch size 1 at 30% utilization is wasting 70% of its compute capacity on every request.

import asyncio
import time
from collections import deque
import torch

class DynamicBatcher:
"""
Batch individual inference requests together for GPU efficiency.
Collects requests up to max_batch_size or max_wait_ms, whichever comes first.
"""
def __init__(
self,
model,
max_batch_size: int = 32,
max_wait_ms: float = 50.0, # max time to wait before processing a partial batch
):
self.model = model
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue: asyncio.Queue = asyncio.Queue()

async def infer(self, input_tensor: torch.Tensor) -> torch.Tensor:
"""
Submit a single inference request and await the result.
The batcher groups requests into batches automatically.
"""
future = asyncio.get_event_loop().create_future()
await self.queue.put((input_tensor, future))
return await future

async def run_batching_loop(self):
"""
Continuously collect requests and process in batches.
Run this as a background task.
"""
while True:
# Wait for the first request
inputs_and_futures = [await self.queue.get()]
deadline = time.perf_counter() + self.max_wait_ms / 1000

# Collect more requests up to max_batch_size or deadline
while (
len(inputs_and_futures) < self.max_batch_size
and time.perf_counter() < deadline
):
try:
item = self.queue.get_nowait()
inputs_and_futures.append(item)
except asyncio.QueueEmpty:
await asyncio.sleep(0.001)

# Process batch
inputs = [x for x, _ in inputs_and_futures]
futures = [f for _, f in inputs_and_futures]

batch = torch.stack(inputs, dim=0).cuda()
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
batch_output = self.model(batch)

# Return individual results
for i, future in enumerate(futures):
if not future.cancelled():
future.set_result(batch_output[i].cpu())

Utilization Monitoring Dashboard

Key metrics to track in your GPU monitoring dashboard:

MetricAlert thresholdWhat it indicates
GPU utilization (mean)< 40% for 30 minOverprovisioned or idle
GPU utilization (peak)> 95% for 1 hourAt capacity, scale out
VRAM utilization> 85% sustainedRisk of OOM on traffic spike
GPU temperature> 83°CThermal throttling risk
Power draw> 90% TDPHardware stress
SM occupancy< 50% for trainingKernel optimization opportunity

Production Engineering Notes

Spot instance interruption rate varies by region and instance type. A100 spot instances in us-east-1 may be interrupted frequently when demand is high. Check historical spot interruption rates in the AWS Spot Instance Advisor before depending on specific instance types for critical training jobs.

Set GPU utilization targets, not GPU count targets. "We need 8 GPUs" is a capacity statement. "We need to handle 500 QPS at 50ms P99" is a requirements statement. Derive GPU count from utilization targets: if one GPU handles 200 QPS at target latency, you need 3 GPUs for 500 QPS plus 20% headroom = 4 GPUs.

Monthly spot saving varies; recalculate quarterly. Spot prices fluctuate with demand. Set calendar reminders to compare your current on-demand spend against spot pricing and reserved instance pricing every quarter.


Common Mistakes

:::danger Assuming spot instances require rearchitecting the training pipeline The only requirement for spot training is checkpoint/restore capability - which well-designed training pipelines should have anyway. Teams often say "we can't use spot because our jobs are too long." Long jobs are exactly where spot saves the most money. Add checkpoint logic first, then migrate to spot. The two changes are independent. :::

:::warning Enabling MIG without updating your serving infrastructure When you enable MIG on a serving GPU and create 3 × 1g.10gb partitions, your existing serving processes may fail to start because they try to use cuda:0 which now points to only 10 GB. Each process must be configured with the correct CUDA_VISIBLE_DEVICES pointing to its assigned MIG instance. Update your Kubernetes deployment manifests, systemd service files, or Docker run commands to specify the correct MIG device before enabling MIG in production. :::

:::tip Target 60-80% GPU utilization for inference serving Below 40%: overprovisioned - consolidate or use smaller instances. 60-80%: healthy target - enough headroom for traffic spikes without wasting hardware. Above 85%: at risk - any traffic spike may cause latency degradation or OOM. Scale out before reaching 85% sustained utilization. :::


Interview Questions

Q1: Your ML team's GPU bill is $200K/month. Walk through a systematic cost reduction approach.

Step 1: Categorize the spend by workload type - training experiments, production training runs, inference serving, evaluation jobs. Step 2: For training experiments, audit GPU count vs actual parallelism needs. Move all to spot instances with checkpoint recovery - expect 60–70% cost reduction on this category. Step 3: Profile serving cluster GPU utilization. If under 50%, enable MIG to share A100s across multiple serving replicas, or consolidate to fewer instances. Step 4: Evaluate reserved instances for the always-on serving cluster - 30–40% discount for 1-year commitment. Step 5: Right-size evaluation and batch jobs - use the minimum instance type that meets time requirements. Step 6: Check instance types - are you using A100s for small models that run fine on L4? Typical result: 50–70% overall cost reduction with 4–6 weeks of engineering time.

Q2: What is the difference between MPS and MIG for GPU sharing, and when do you use each?

MPS (Multi-Process Service) is software-level time-sharing: multiple processes share one GPU by time-slicing the compute units. There is no memory isolation - a bug in one process can crash all processes sharing the GPU. MPS is appropriate for development environments where multiple personal experiments share a cluster node, or for controlled production environments where all processes are trusted. MIG is hardware-level partitioning: each slice gets dedicated compute units, dedicated VRAM, and dedicated cache. Isolation is at the hardware level - a process on MIG slice 0 cannot affect MIG slice 1. Use MIG for production serving where you want to run multiple small models on one expensive A100, providing true isolation between tenants or services.

Q3: How do you calculate whether reserved instances are worth it for a serving cluster?

Break-even analysis: (on-demand hourly × 8760 hours) vs (reserved upfront + reserved hourly × 8760). For AWS p3.2xlarge (V100): on-demand at 3.06/hr×8760=3.06/hr × 8760 = 26,806/yr per instance. 1-year no-upfront reserved at 1.82/hr×8760=1.82/hr × 8760 = 15,943/yr. Savings: 10,863/yrperinstance(4010,863/yr per instance (40% reduction). Break-even is immediate with no-upfront reserved. For a 10-instance serving cluster, that is 108,000/yr in savings from a one-time decision. Reserved instances are worth it when: (1) utilization will remain above 60% for the reservation period, (2) your instance type needs are predictable 1–3 years out, and (3) you have budget to commit (partial upfront provides additional savings).

Q4: You want to batch inference requests to improve GPU utilization. What are the latency tradeoffs and how do you set the batching parameters?

Dynamic batching waits up to max_wait_ms to accumulate requests before processing them. This improves GPU utilization (higher batch sizes → better compute efficiency) at the cost of latency headroom: every request waits up to max_wait_ms before processing begins. Setting max_wait_ms=50ms means p99 latency is at least 50ms even if the model itself takes 10ms. Set max_batch_size based on VRAM and compute capacity: find the batch size where GPU utilization exceeds 70% and latency still meets your SLA. Set max_wait_ms based on your latency budget: if your SLA is 200ms p99 and the model takes 30ms per batch, you have up to 170ms for waiting and overhead. Start with 20–30ms and tune based on observed latency percentiles.

Q5: A model serving pipeline has 35% GPU utilization. List 5 reasons this might happen and how to diagnose each.

  1. Batch size too small: Single-request serving doesn't use GPU efficiently. Add dynamic batching and measure batch size distribution. 2. CPU preprocessing bottleneck: GPU sits idle waiting for data from CPU tokenization/preprocessing. Profile with PyTorch profiler - look for long gaps between CUDA kernels. 3. Memory bandwidth bound: Arithmetic intensity too low (small model, short sequences). Use roofline analysis to check if compute or bandwidth is the bottleneck. 4. PCIe transfer bottleneck: Input data must be transferred from CPU to GPU for every request. Look for H2D (host-to-device) transfers in the profiler trace. 5. Concurrent request limit too low: Serving framework is throttling concurrency below GPU capacity. Increase the number of concurrent requests the serving system processes simultaneously.
© 2026 EngineersOfAI. All rights reserved.