Inference Cost Optimization
The $47,000 Mistake
In March 2023, a team at a well-funded startup launched a GPT-4-based legal document analysis feature. The product lead had done the math: GPT-4 was 60. Totally acceptable. The launch went viral.
The problem was that the math was wrong in three ways. First, they counted only input tokens, not output tokens - and their summarization prompts generated 800 tokens of output per document, not the 200 they assumed. Second, they had not accounted for their system prompt - a 1,200-token instruction block prepended to every request. Third, they had not anticipated that "500 users per day" would send documents through multiple times to refine summaries. The real token count was 8x their estimate.
The actual daily cost was 14,400. That is 0 in direct revenue. The board asked questions. The engineers got paged at 2 AM to add rate limits.
This story is not unusual. Inference cost surprises are the second-most-common cause of production AI incidents (after accuracy issues). The teams that avoid them are the ones that treat inference cost with the same rigor they give to infrastructure cost - which means understanding exactly what drives it, which levers move it, and how to measure whether the levers are working.
This lesson is a systematic treatment of inference cost optimization for production LLM systems. It covers the cost model, the optimization techniques, the tradeoffs between them, and the engineering practices that separate teams paying 0.50/million tokens. That 100x difference is real, documented, and achievable with the techniques in this lesson.
Why This Exists - The GPU Utilization Crisis
When LLM serving moved from research to production in 2022-2023, the default deployment pattern was one request at a time. You sent a prompt, the model generated a response, the response was returned, then the next request was processed. This is called static batching.
Static batching had catastrophic GPU utilization. A typical A100 serving GPT-J-6B in this mode ran at 15-25% MFU (Model FLOP Utilization). The GPU sat idle 75-85% of the time. The causes were fundamental to how transformer inference works:
The prefill/decode asymmetry. Prefill (processing the prompt) is a parallelizable matrix multiply that uses the GPU efficiently. Decode (generating each output token) is a sequential, memory-bandwidth-bound operation that generates one token at a time. A GPU with 312 TFLOPS of FP16 compute is bottlenecked by its 2 TB/s memory bandwidth during decode - it generates one token, then must reload all model weights from HBM before generating the next. The GPU's compute units are idle while waiting for memory.
Variable sequence lengths waste padding. If you batch requests together, you must pad shorter sequences to match the longest. A batch containing requests of lengths [50, 150, 200, 500] wastes 75% of the compute on the first request just processing padding tokens.
The head-of-line blocking problem. A batch cannot complete until every request in it finishes. If one user asks for a 10-token response and another asks for a 2,000-token response, the short request's GPU slot sits idle during the other 1,990 token generation steps.
The result was absurd economics. A server that could theoretically process 100 requests/second was processing 15-20 in practice. The compute was provisioned for the peak it never reached. Users were paying for the peak that never existed.
The industry's response was a set of techniques - continuous batching, PagedAttention, speculative decoding, quantization - that together solved these problems. Teams that implemented all of them reduced cost per token by 10-20x compared to naive static batching, without changing the model or the hardware.
Historical Context - From Static Batching to Production-Grade Serving
The first publicly available efficient LLM serving system was Triton Inference Server's batching support in 2020, but it addressed CV models, not LLMs. The LLM-specific breakthrough came in three waves.
Wave 1: Continuous Batching (2022). Orca (Yu et al., NeurIPS 2022) introduced the insight that you do not need to hold a batch together for its entire lifetime. Instead, you can add new requests to a batch at the beginning of any decode step, and remove completed requests at the end of any decode step. This eliminated head-of-line blocking and allowed GPU utilization to approach 80-90% under load. The paper showed 23x improvement over static batching in some configurations.
Wave 2: PagedAttention and vLLM (2023). The vLLM paper (Kwon et al., SOSP 2023) identified a different bottleneck: memory waste from KV cache pre-allocation. The standard approach pre-allocated contiguous memory for the maximum possible sequence length of each request, even if the actual sequence would be much shorter. On average, 60-80% of reserved KV cache memory was never used. PagedAttention borrowed the virtual memory concept from OS design: KV cache blocks are allocated in fixed-size "pages" on demand, enabling near-100% KV cache memory utilization. The result was 24x higher throughput than HuggingFace Transformers baseline.
Wave 3: Quantization at scale (2023-2024). GPTQ, AWQ, and GGUF quantization formats made it practical to serve LLMs in INT4 or INT8 without meaningful quality degradation. An H100 that could serve one FP16 70B model could serve two INT4 70B models, doubling throughput per GPU dollar.
The "aha moment" that unified these insights was a cost model that let teams quantify exactly where the waste was happening and which technique would address it. That cost model is where this lesson starts.
Core Concepts
The Inference Cost Model
Every inference request on a GPU cluster costs money through a single mechanism: GPU-hours consumed. The cost per million tokens is:
\text{Cost per M tokens} = \frac{\text{GPU hourly rate (\\$)}}{\text{Tokens per second} \times 3600} \times 10^6
For an A100 80GB at $3.50/hour generating 1,000 tokens/second:
This means tokens per second is the master lever. Everything else - quantization, batching, speculative decoding - is a way to increase tokens per second from the same GPU hardware.
The full cost model has three components:
For most LLM workloads, GPU cost dominates (70-85% of total). Memory (HBM on the GPU) is included in the GPU hourly rate. Networking becomes significant only for very large models distributed across multiple GPUs.
GPU Utilization Metric
The metric that captures whether you are getting value from your GPU is Model FLOP Utilization (MFU):
An A100 at peak delivers 312 TFLOPS (FP16). A poorly optimized LLM serving stack might achieve 50 TFLOPS - 16% MFU. A well-optimized stack with continuous batching and quantization achieves 200-250 TFLOPS - 65-80% MFU. The cost per token scales inversely with MFU, so 4x improvement in MFU means 4x reduction in cost.
Time-to-First-Token (TTFT) vs Tokens Per Second
These are separate metrics that matter for different reasons:
- TTFT: How long from request submission until the first token appears. Dominated by prefill latency. Users notice TTFT as "how long before it starts responding." P95 target for production: under 1 second.
- TPS (tokens per second per user): Generation speed. Users notice this as "how fast it types." Target: 30-60 TPS feels instantaneous to most users; below 15 TPS feels slow.
- Throughput (system-level TPS): Total tokens per second across all users. This is what determines cost per token.
Optimizations that improve throughput sometimes hurt TTFT. Continuous batching, for instance, can increase TTFT slightly because new requests wait for the current iteration to complete before being added to the batch. The tradeoff is acceptable because throughput gains are 10-20x while TTFT increase is typically under 100ms.
Optimization Lever 1 - Quantization (2-4x Cost Reduction)
The most accessible cost optimization is quantization. By reducing weight precision from FP16 to INT8 or INT4:
| Precision | LLaMA-3-70B Size | A100s Needed | Cost/M Tokens (Relative) |
|---|---|---|---|
| FP16 | 140 GB | 2x A100 80GB | 1.0x (baseline) |
| INT8 | 70 GB | 1x A100 80GB | 0.5x |
| INT4 | 35 GB | 1x A100 40GB | 0.25x |
INT8 quantization (using bitsandbytes, GPTQ, or AWQ) causes perplexity degradation of 0.1-0.5 PPL - negligible for most tasks. INT4 causes 0.3-1.2 PPL degradation - meaningful for some tasks (coding, math) but acceptable for others (summarization, translation).
The key implementation decision is which quantization framework to use:
- bitsandbytes (INT8): Zero-shot, no calibration data, loads model in 8-bit using LLM.int8() algorithm. Easy to apply but slower than GPTQ for inference.
- GPTQ (INT4): Calibration required (~100 examples), but produces faster inference kernels (specifically optimized for Marlin or ExLlamaV2 backends). The right choice for production INT4 serving.
- AWQ (INT4): Activation-aware quantization, slightly better accuracy than GPTQ at the same bit width. Supported by vLLM and TGI natively.
# Serving a GPTQ-quantized model with vLLM
from vllm import LLM, SamplingParams
# GPTQ model uses 1 A100 instead of 2 - halving infrastructure cost
llm = LLM(
model="TheBloke/Llama-2-70B-Chat-GPTQ",
quantization="gptq",
dtype="float16",
max_model_len=4096,
)
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=512,
)
prompts = [
"Summarize the key risks in this contract: [contract text]",
"Translate to Spanish: Hello, how are you?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Generated: {output.outputs[0].text}")
print(f"Tokens: {len(output.outputs[0].token_ids)}")
Optimization Lever 2 - Continuous Batching (3-5x Throughput Improvement)
Continuous batching (also called iteration-level scheduling) is the single highest-impact optimization for server-side LLM throughput. It works by treating each decode step as an independent scheduling opportunity rather than committing to batch membership for the entire request lifetime.
The mechanism:
- When a new request arrives, it is added to the "waiting" queue.
- At each decode step, the scheduler checks the waiting queue.
- If GPU memory allows, a waiting request's prefill is added to the current batch.
- Completed requests are removed from the batch immediately after their last token.
- The freed memory is available for new requests in the next step.
This means the GPU is almost never idle waiting for one slow request to finish before accepting new work. The batch is always full (up to memory limits), and work flows continuously through the system.
vLLM implements continuous batching with PagedAttention. TGI (Text Generation Inference) implements it with a different KV cache management approach. Both achieve 10-25x throughput improvement over static batching for realistic workload distributions.
# vLLM server configuration for continuous batching
# Launch as: python -m vllm.entrypoints.openai.api_server --config serve_config.yaml
# serve_config.yaml equivalent settings:
from vllm import AsyncLLMEngine, AsyncEngineArgs
engine_args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3-8B-Instruct",
dtype="bfloat16",
max_model_len=8192, # Maximum sequence length
max_num_seqs=256, # Maximum concurrent sequences in flight
max_num_batched_tokens=8192, # Max tokens per iteration across all sequences
# This is the continuous batching budget - higher = better throughput, higher latency
gpu_memory_utilization=0.92, # Leave 8% for non-KV-cache allocations
enable_chunked_prefill=True, # Chunk long prefills to avoid blocking decode
max_chunked_prefill_tokens=512, # Prefill chunk size
)
# The key insight: max_num_batched_tokens controls the throughput/latency tradeoff
# Higher value: more tokens processed per step, better throughput, higher TTFT
# Lower value: faster TTFT, lower throughput
Optimization Lever 3 - Speculative Decoding (1.5-3x Speedup)
Speculative decoding exploits the observation that LLM decoding is memory-bandwidth-bound, not compute-bound. The GPU has spare compute capacity during each decode step. Speculative decoding uses that spare capacity to run a small "draft" model that generates multiple candidate tokens, which the large "target" model then verifies in a single forward pass.
The math: if the draft model proposes tokens and the target model accepts of them on average, the speedup is approximately relative to token-by-token generation, at the cost of running the draft model for steps.
For draft tokens and acceptance rate:
In practice, 2-3x is typical because the draft model itself has latency and acceptance rate varies with temperature and task.
# Speculative decoding with vLLM
from vllm import LLM, SamplingParams
# Draft model should be 5-10x smaller than target
# LLaMA-3-8B as draft, LLaMA-3-70B as target
llm = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct",
speculative_model="meta-llama/Meta-Llama-3-8B-Instruct",
num_speculative_tokens=5, # Draft model generates 5 candidate tokens per step
speculative_draft_tensor_parallel_size=1, # Draft model uses 1 GPU
dtype="bfloat16",
)
# The API is identical - speculative decoding is transparent to callers
params = SamplingParams(temperature=0.0, max_tokens=200) # Greedy = highest acceptance rate
response = llm.generate(["Explain the transformer architecture:"], params)
An important constraint: speculative decoding provides the most benefit when temperature is low (greedy or near-greedy sampling) because high temperature makes token distributions less predictable and reduces the draft model's acceptance rate. For applications requiring creative generation (temperature 0.8-1.0), speculative decoding may provide only 1.2-1.5x improvement.
Optimization Lever 4 - KV Cache Management
The KV cache stores the key and value tensors from the attention mechanism for all previously generated tokens. Without caching, generating the -th token requires recomputing attention over all previous tokens - total cost. With caching, it requires only computing attention from the new token against cached K/V values - per step, total.
KV cache memory usage for a single sequence is:
Where:
- = number of layers
- = number of attention heads
- = head dimension
- = sequence length
- = bits per element (16 for FP16, 8 for INT8)
For LLaMA-3-8B with 4096 context length in FP16:
This means an 80 GB A100 with a 16 GB model loaded can hold at most concurrent sequences at maximum context. PagedAttention improves this by allocating KV cache in pages and sharing pages across sequences with common prefixes.
Prefix caching is the highest-impact KV cache optimization for applications with shared prompt prefixes (system prompts, RAG context, few-shot examples). When multiple requests share the same first tokens, the KV cache for those tokens is computed once and reused:
# vLLM automatic prefix caching
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
enable_prefix_caching=True, # Cache KV tensors for common prefixes
max_model_len=8192,
)
# These two requests share the same system prompt - prefill computed once
system_prompt = "You are a helpful legal assistant. " * 100 # 200 tokens
requests = [
f"{system_prompt} Question 1: What is consideration in contract law?",
f"{system_prompt} Question 2: What is promissory estoppel?",
]
# Second request's system prompt prefill is served from cache
# Effective speedup: TTFT reduced by (200/total_length) percent
Semantic caching goes further - it caches full request/response pairs and uses embedding similarity to serve near-duplicate queries from cache rather than re-running inference:
import numpy as np
from sentence_transformers import SentenceTransformer
import redis
import json
import hashlib
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.97):
self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
self.redis = redis.Redis(host="localhost", port=6379)
self.threshold = similarity_threshold
self.cache_hits = 0
self.cache_misses = 0
def _get_embedding(self, text: str) -> np.ndarray:
return self.encoder.encode(text, normalize_embeddings=True)
def lookup(self, query: str) -> str | None:
"""Return cached response if a sufficiently similar query was seen before."""
query_embedding = self._get_embedding(query)
# Check stored embeddings (in production, use a vector DB like Qdrant or Pinecone)
cache_keys = self.redis.keys("semantic:*")
for key in cache_keys:
cached_data = json.loads(self.redis.get(key))
cached_embedding = np.array(cached_data["embedding"])
similarity = float(np.dot(query_embedding, cached_embedding))
if similarity >= self.threshold:
self.cache_hits += 1
return cached_data["response"]
self.cache_misses += 1
return None
def store(self, query: str, response: str, ttl_seconds: int = 3600):
embedding = self._get_embedding(query).tolist()
key = f"semantic:{hashlib.md5(query.encode()).hexdigest()}"
data = {"query": query, "response": response, "embedding": embedding}
self.redis.setex(key, ttl_seconds, json.dumps(data))
@property
def hit_rate(self) -> float:
total = self.cache_hits + self.cache_misses
return self.cache_hits / total if total > 0 else 0.0
For production Q&A and customer support workloads, semantic cache hit rates of 20-40% are common (many user questions are paraphrases of the same underlying query). At 0.30/million - meaningful at scale.
Optimization Lever 5 - Cascade Inference
Cascade inference routes queries to models of different sizes based on expected complexity. Simple queries ("What is the capital of France?") can be answered by a 7B model. Complex queries requiring multi-step reasoning or domain expertise need a 70B or larger model.
The cost math is compelling: a 7B model costs approximately 0.80/million tokens. If you can route 60% of queries to the 7B model, the blended cost is:
That is a 2.1x cost reduction with no change to quality for either class of query.
The routing mechanism is itself a small, fast model:
from transformers import pipeline
import torch
class CascadeRouter:
"""
Routes LLM queries to small or large model based on predicted complexity.
The router itself is a lightweight classifier trained on labeled examples.
"""
def __init__(self, router_model_path: str, complexity_threshold: float = 0.6):
self.classifier = pipeline(
"text-classification",
model=router_model_path, # ~50M param DeBERTa or DistilBERT fine-tuned
device=0 if torch.cuda.is_available() else -1,
)
self.threshold = complexity_threshold
self.route_counts = {"small": 0, "large": 0}
def route(self, query: str) -> str:
"""Returns 'small' or 'large' based on predicted complexity."""
result = self.classifier(query)[0]
# Label 'complex' predicted with high confidence -> large model
if result["label"] == "complex" and result["score"] >= self.threshold:
self.route_counts["large"] += 1
return "large"
self.route_counts["small"] += 1
return "small"
def serve(self, query: str, small_model, large_model) -> str:
target = self.route(query)
if target == "small":
response = small_model.generate(query)
# Optional: verify quality with a lightweight judge
# If quality check fails, fall through to large model
else:
response = large_model.generate(query)
return response
@property
def routing_stats(self) -> dict:
total = sum(self.route_counts.values())
return {
"small_pct": self.route_counts["small"] / total * 100 if total else 0,
"large_pct": self.route_counts["large"] / total * 100 if total else 0,
"total_queries": total,
}
A simpler but effective heuristic router uses query length and keyword presence as proxies for complexity:
def heuristic_route(query: str) -> str:
"""Simple routing heuristic - no ML required."""
complexity_signals = [
len(query) > 300, # Long queries tend to be complex
"analyze" in query.lower(),
"compare" in query.lower(),
"explain why" in query.lower(),
"step by step" in query.lower(),
"write code" in query.lower(),
query.count("?") > 2, # Multiple questions
]
complexity_score = sum(complexity_signals)
if complexity_score >= 2:
return "large"
return "small"
Architecture Diagrams
Building the Cost Model for a Production LLM API
Before optimizing, you need to measure. Here is a complete cost modeling framework:
import time
import dataclasses
from typing import Optional
import httpx
import asyncio
@dataclasses.dataclass
class InferenceMetrics:
"""Captured per-request metrics for cost modeling."""
request_id: str
model: str
prompt_tokens: int
completion_tokens: int
ttft_ms: float # Time to first token
total_latency_ms: float # End-to-end latency
gpu_id: Optional[str] # Which GPU served this request
batch_size_at_serve: int # How many concurrent requests were in batch
timestamp: float
@dataclasses.dataclass
class CostModel:
"""Cost model for an LLM serving infrastructure."""
gpu_hourly_cost_usd: float # e.g., $3.50 for A100
model_name: str
gpu_count: int
def cost_per_token(
self, observed_tokens_per_second: float
) -> float:
"""Calculate cost per output token given observed throughput."""
total_gpu_cost_per_second = (
self.gpu_hourly_cost_usd * self.gpu_count / 3600
)
return total_gpu_cost_per_second / observed_tokens_per_second
def cost_per_million_tokens(
self, observed_tokens_per_second: float
) -> float:
return self.cost_per_token(observed_tokens_per_second) * 1e6
def break_even_tps(self, target_cost_per_million: float) -> float:
"""What throughput do you need to hit a target cost per million tokens?"""
total_cost_per_second = (
self.gpu_hourly_cost_usd * self.gpu_count / 3600
)
# cost_per_M = (cost_per_second / tps) * 1e6
# tps = (cost_per_second * 1e6) / cost_per_M
return (total_cost_per_second * 1e6) / target_cost_per_million
class InferenceBenchmarker:
"""Runs a load test and computes cost model metrics."""
def __init__(self, base_url: str, model: str, cost_model: CostModel):
self.base_url = base_url
self.model = model
self.cost_model = cost_model
self.metrics: list[InferenceMetrics] = []
async def _single_request(
self,
client: httpx.AsyncClient,
prompt: str,
max_tokens: int = 200,
) -> InferenceMetrics:
request_id = f"req_{time.time():.6f}"
first_token_time = None
start = time.perf_counter()
async with client.stream(
"POST",
f"{self.base_url}/v1/completions",
json={
"model": self.model,
"prompt": prompt,
"max_tokens": max_tokens,
"stream": True,
},
timeout=60.0,
) as response:
prompt_tokens = 0
completion_tokens = 0
async for line in response.aiter_lines():
if line.startswith("data: ") and line != "data: [DONE]":
if first_token_time is None:
first_token_time = time.perf_counter()
completion_tokens += 1
total_time = time.perf_counter() - start
ttft = (first_token_time - start) * 1000 if first_token_time else 0
return InferenceMetrics(
request_id=request_id,
model=self.model,
prompt_tokens=len(prompt.split()), # Approximate
completion_tokens=completion_tokens,
ttft_ms=ttft,
total_latency_ms=total_time * 1000,
gpu_id=None,
batch_size_at_serve=1, # Would need server-side tracking
timestamp=start,
)
async def run_load_test(
self,
prompts: list[str],
concurrency: int = 10,
) -> dict:
async with httpx.AsyncClient() as client:
semaphore = asyncio.Semaphore(concurrency)
async def bounded_request(prompt):
async with semaphore:
return await self._single_request(client, prompt)
start_time = time.perf_counter()
results = await asyncio.gather(*[bounded_request(p) for p in prompts])
total_time = time.perf_counter() - start_time
self.metrics = list(results)
total_tokens = sum(m.completion_tokens for m in results)
observed_tps = total_tokens / total_time
return {
"total_requests": len(results),
"total_output_tokens": total_tokens,
"total_time_s": total_time,
"observed_tps": observed_tps,
"p50_ttft_ms": sorted(m.ttft_ms for m in results)[len(results)//2],
"p95_ttft_ms": sorted(m.ttft_ms for m in results)[int(0.95*len(results))],
"p50_latency_ms": sorted(m.total_latency_ms for m in results)[len(results)//2],
"cost_per_million_tokens": self.cost_model.cost_per_million_tokens(observed_tps),
"gpu_cost_usd": self.cost_model.gpu_hourly_cost_usd * self.cost_model.gpu_count * total_time / 3600,
}
# Usage
async def main():
cost_model = CostModel(
gpu_hourly_cost_usd=3.50, # A100 80GB on-demand
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
gpu_count=1,
)
benchmarker = InferenceBenchmarker(
base_url="http://localhost:8000",
model="meta-llama/Meta-Llama-3-8B-Instruct",
cost_model=cost_model,
)
# Generate test prompts with realistic length distribution
prompts = [
f"Summarize the following in 2-3 sentences: {'This is test content. ' * 50}"
for _ in range(100)
]
results = await benchmarker.run_load_test(prompts, concurrency=20)
print("=== Load Test Results ===")
for k, v in results.items():
if isinstance(v, float):
print(f"{k}: {v:.3f}")
else:
print(f"{k}: {v}")
target = $1.0 # Target cost per million tokens
needed_tps = cost_model.break_even_tps(target)
print(f"\nTo hit ${target}/M tokens, need {needed_tps:.0f} TPS")
print(f"Currently at {results['observed_tps']:.0f} TPS")
asyncio.run(main())
Multi-Instance GPU (MIG) for Multi-Tenant Serving
NVIDIA's Multi-Instance GPU technology (available on A100 and H100) partitions a single physical GPU into up to 7 isolated GPU instances, each with dedicated HBM, L2 cache, and SM groups. This enables running multiple small models on a single physical GPU without interference.
Cost implications:
A100 80GB MIG partitioning options:
- 1x MIG 7g.80gb (full GPU, one tenant)
- 2x MIG 3g.40gb (split into 2 x 40GB instances)
- 7x MIG 1g.10gb (split into 7 x 10GB instances)
Economics example for customer serving:
- Single 7B model on full A100: $3.50/hr, ~1000 TPS
- 4x 3B models on 4 x MIG partitions: $3.50/hr, ~3200 TPS (4 models × 800 TPS)
- Cost per token: 3.2x better utilization, 3.2x lower cost
# Configure A100 for MIG mode
sudo nvidia-smi -i 0 -mig 1
# Create 2 MIG instances of 40GB each (for two 7B models)
sudo nvidia-smi mig -cgi 3g.40gb,3g.40gb -C
# Verify
nvidia-smi -L
# GPU 0: A100 80GB PCIe (UUID: GPU-...)
# MIG 3g.40gb Device 0: (UUID: MIG-...)
# MIG 3g.40gb Device 1: (UUID: MIG-...)
# Deploy model to specific MIG instance
CUDA_VISIBLE_DEVICES=MIG-GPU-..../0/0 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--port 8000
CUDA_VISIBLE_DEVICES=MIG-GPU-..../0/1 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--port 8001
FinOps Practices for Inference Infrastructure
Spot vs Reserved vs On-Demand
GPU instances across cloud providers offer three pricing tiers:
| Type | Price | Interruption risk | Use case |
|---|---|---|---|
| On-Demand | 1.0x baseline | None | Production, latency-sensitive |
| Reserved (1yr) | 0.55-0.65x | None | Stable baseline load |
| Spot / Preemptible | 0.25-0.40x | Medium (2-15 min notice) | Batch inference, training |
The right architecture for cost optimization mixes all three:
Serving Architecture for Cost-Optimal LLM Deployment:
Reserved capacity (70% of baseline load):
- Always-on A100 instances for P50 traffic
- 1-year commitment, 40% discount
- ~$2.10/hr vs $3.50/hr on-demand
On-demand capacity (burst to P95):
- Scale out on-demand when reserved slots are full
- Pay full price but only for the burst duration
- Autoscaling rule: add on-demand when queue > 50 requests
Spot capacity (batch workloads):
- Document processing, index building, evaluation runs
- Use spot at $1.05/hr (70% savings vs on-demand)
- Checkpoint frequently; resume after interruption
- Never route real-time user requests through spot
Autoscaling Configuration
# Kubernetes HPA configuration for LLM serving
# Scale based on GPU utilization and queue depth
# metrics-server must be installed, plus DCGM exporter for GPU metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 2 # Minimum for availability (at least 2 GPUs always warm)
maxReplicas: 8 # Maximum to cap cloud spend
metrics:
- type: External
external:
metric:
name: gpu_queue_depth # Custom metric from vLLM /metrics endpoint
selector:
matchLabels:
app: vllm-server
target:
type: AverageValue
averageValue: 50 # Scale out if average queue > 50 requests
- type: External
external:
metric:
name: DCGM_FI_DEV_GPU_UTIL # GPU utilization from DCGM
target:
type: AverageValue
averageValue: 75 # Scale out if GPU util > 75%
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s of high load before scaling up
policies:
- type: Pods
value: 2 # Add max 2 pods per scale event
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down (avoid thrash)
Right-Sizing GPU Instances
The right GPU for inference is often not the most powerful one:
Model Size KV Cache Needs Recommended GPU Why
---------- -------------- --------------- ---
< 7B INT4 ~2-4 GB RTX 4090 (24GB) Cheapest per TOPS, no NVLink needed
7-13B INT4 ~4-8 GB A10G (24GB) Best cost/token for cloud inference
70B INT4 ~15-25 GB A100 40GB x2 Needs tensor parallel, cost-effective
70B FP16 ~30-50 GB A100 80GB x2 Full precision, high quality
Mixtral 8x7B ~20-35 GB A100 80GB x2 MoE reduces active params per token
The A10G at 3.50/hour for cost per token on 7-13B models because the model fits in single-GPU memory, avoiding the latency and cost overhead of tensor parallelism across two GPUs.
Production Engineering Notes
Monitoring the Right Metrics
The metrics that matter for inference cost optimization are not the ones your infra team typically monitors. CPU utilization and network throughput are irrelevant. The metrics that drive costs are:
-
GPU MFU (Model FLOP Utilization): Target 65-80% for well-optimized continuous batching. Below 50% indicates batching problems. Above 85% you are likely memory-bound and should scale out.
-
KV cache utilization: The percentage of KV cache pages in use. Target 80-90%. Below 60% suggests you over-provisioned memory or batch sizes are too small. Above 95% means you are dropping requests due to OOM.
-
Queue depth: Number of requests waiting for a free GPU slot. Should be near 0 at P95. Consistently positive queue depth indicates you need more GPU capacity.
-
Tokens per dollar: The ultimate business metric. Track it over time and verify that every optimization you deploy actually moves it.
# Pull metrics from vLLM's Prometheus endpoint
curl http://localhost:8000/metrics | grep -E "vllm_"
# Key vLLM metrics:
# vllm:num_requests_waiting - current queue depth
# vllm:gpu_cache_usage_perc - KV cache utilization (0-1)
# vllm:num_preempted_seqs_total - requests dropped due to memory pressure
# vllm:generation_tokens_total - cumulative output tokens (rate = TPS)
# vllm:prompt_tokens_total - cumulative input tokens
Prompt Engineering as a Cost Optimization
Prompt length directly drives prefill cost and KV cache consumption. Every 1,000 tokens of system prompt multiplied across 1 million requests is 1 billion extra tokens of compute. For a 1,000 per million requests just in wasted system prompt processing.
Audit your prompts for compression opportunities:
- Replace verbose role descriptions ("You are a highly skilled, deeply experienced professional assistant who...") with concise ones ("You are a helpful assistant.")
- Move static context into the model via fine-tuning rather than prompting
- Use prefix caching for shared system prompts so they are computed once, not once per request
A 50% reduction in average prompt length is typically achievable and produces a 30-40% cost reduction with no quality loss.
Common Mistakes
:::danger Measuring GPU Utilization and Thinking You're Done
GPU utilization percentage from nvidia-smi measures SM occupancy - what fraction of CUDA cores are doing something. But for LLM decode, "doing something" often means waiting for HBM reads. An LLM serving stack can show 85% GPU utilization and 15% MFU simultaneously. High utilization with low MFU means you are bandwidth-bound and spending money on compute that is mostly idle. The metric that matters is tokens per second per dollar, not GPU utilization.
:::
:::danger Ignoring KV Cache Fragmentation at Scale If you pre-allocate a fixed contiguous KV cache block per sequence (the naive approach), you will waste 40-60% of KV cache memory due to internal fragmentation. A request that generates 50 tokens but was allocated a 2,048-token block wastes 96% of its allocation. Under load, this fragmentation means your GPU runs out of KV cache memory long before it runs out of HBM in total. Use PagedAttention (vLLM) or a similar block-based allocator. The impact is real: PagedAttention improves throughput by 2-4x on realistic workloads compared to contiguous allocation. :::
:::warning Not Accounting for Input Tokens in Cost Models Many teams budget based on output tokens and forget that input tokens (the prompt) are also computed. For RAG applications with 2,000-token context windows, the input token count often exceeds the output count 5:1. The prefill computation for those 2,000 input tokens may cost more than generating the 400-token response. Always model both input and output token costs. For context: OpenAI's GPT-4o charges 15/million output tokens - a 3x difference that completely changes the cost profile of prompt-heavy applications. :::
:::warning Applying Speculative Decoding to High-Temperature Applications Speculative decoding degrades at high temperature (0.8-1.0) because the draft model's token predictions become unreliable when the target model's output distribution is diffuse. At temperature 1.0 with a 5-token draft, you may see only 40-50% acceptance rate, providing minimal speedup while adding the overhead of running two models. Profile acceptance rate before deploying speculative decoding. If acceptance rate is below 0.6, speculative decoding is not providing enough benefit to justify the complexity. :::
:::warning Using the Same Model for All Request Sizes LLMs have roughly constant cost per output token regardless of input length (for small-to-medium inputs). But for very short requests ("What time is it in Tokyo?"), the overhead of routing to a 70B model - loading it from HBM, running attention over the system prompt - dwarfs the actual generation cost. A 3B model will answer this question just as accurately at 10x lower cost. Implement cascade routing. Even a simple heuristic (route requests under 50 words to the small model) can cut costs 30-40% on realistic customer query distributions. :::
Interview Q&A
Q1: What is continuous batching and why does it improve GPU utilization so dramatically compared to static batching?
Static batching processes requests in discrete waves: collect requests, prefill them all, decode until the longest completes, then collect the next wave. During decode, the batch is fixed - if request A finishes at token 50 and request B finishes at token 500, the GPU slot held by request A sits idle processing padding for 450 additional decode steps. Utilization approaches where is the length of the longest sequence.
Continuous batching (iteration-level scheduling, from the Orca paper, 2022) treats each decode step as a scheduling opportunity. After every iteration, completed sequences are removed and new sequences can be added. The batch is always full (up to memory constraints). The GPU never idles waiting for the slowest request. Under realistic request distributions (where response lengths vary from 10 to 2,000 tokens), continuous batching achieves 10-25x higher throughput than static batching. vLLM combines this with PagedAttention for KV cache efficiency, together accounting for the bulk of the 24x improvement over HuggingFace Transformers that the vLLM paper reports.
Q2: Walk me through the math of how quantization reduces inference cost.
Quantization reduces cost through two mechanisms: model size reduction (which determines how many models fit on a GPU) and memory bandwidth reduction (which determines how fast each token generates).
For model size: a 70B parameter model in FP16 requires 140 GB. It needs two A100 80GB GPUs in tensor-parallel configuration. At 7.00/hour. In INT4, the same model requires 35 GB - it fits on a single A100. Cost drops to $3.50/hour. Same model, same quality, 2x cheaper.
For throughput: LLM decode is memory-bandwidth-bound. The time to generate one token is dominated by loading model weights from HBM. An FP16 weight requires 2 bytes; an INT4 weight requires 0.5 bytes. With 2 TB/s HBM bandwidth, FP16 loads at 1 trillion parameters/second; INT4 loads at 4 trillion parameters/second. For a 70B model: FP16 decode is at minimum ms per token. INT4 decode: ms per token. 4x faster token generation means 4x more tokens per second per GPU, which means 4x lower cost per token.
Q3: How does PagedAttention work and why is it critical for production LLM serving?
PagedAttention borrows the virtual memory concept from operating systems. Instead of allocating a contiguous block of HBM for each sequence's KV cache upfront (which wastes memory when sequences are shorter than the maximum), PagedAttention divides KV cache memory into fixed-size "pages" (typically 16-32 tokens worth of K and V tensors per page). Pages are allocated on demand as sequences grow, and freed immediately when sequences complete.
The production impact is on memory efficiency and therefore throughput. Traditional contiguous allocation, assuming requests vary in length between 50 and 2,048 tokens, wastes on average 60-70% of allocated KV cache memory due to internal fragmentation. An A100 80GB with a 7B model (14 GB) and 66 GB for KV cache might support only 15-20 concurrent sequences in practice. With PagedAttention, the same 66 GB supports 40-60 concurrent sequences because there is almost no wasted memory. More concurrent sequences means more batching, which means more throughput and lower cost per token. The vLLM paper shows 24x throughput improvement over baseline; roughly 4-6x of that improvement specifically comes from PagedAttention's memory efficiency versus speculative and fragmented allocation.
An additional benefit: PagedAttention enables copy-on-write sharing of KV cache pages between sequences that share a common prefix (same system prompt, same few-shot examples). Those shared pages are computed once and referenced by all sequences, eliminating redundant computation.
Q4: What is speculative decoding and when should you not use it?
Speculative decoding is a technique where a small "draft" model generates multiple candidate tokens in parallel with the GPU's spare compute capacity, and the large "target" model verifies them in a single batched forward pass. If the target model agrees with the draft token, it accepts it (free - no extra forward pass needed). If not, it rejects and replaces with its own token. The expected speedup is where is the number of draft tokens and is the acceptance rate.
When not to use it: (1) High temperature generation (above 0.8). The draft model's predictions become less reliable when the target model's output distribution is broad, reducing acceptance rate below 0.6 and making the technique provide minimal benefit. (2) When you are already batch-bound, not memory-bandwidth-bound. Speculative decoding helps when the bottleneck is memory reads per token. If you already have 50+ concurrent sequences in continuous batching, the GPU is compute-saturated and speculative decoding cannot help. (3) When the draft and target model have very different tokenizers or vocabularies - the draft model must share the exact tokenizer with the target model. (4) On long prefill-heavy workloads where generation is a small fraction of total compute. Speculative decoding only accelerates the decode phase.
Q5: How would you build a cost model to decide whether to self-host a 70B LLM or use an API like Together AI?
The decision hinges on throughput. API costs are fixed per token regardless of utilization. Self-hosted costs are fixed per GPU-hour regardless of utilization.
API break-even calculation:
For a 70B INT4 model on one A100 at 3.50 \times 720 = $2,520$/month.
API cost for the same model (e.g., Together AI Llama-3-70B at $2,520 / $0.90 = 2.8 \text{ million tokens per month}$ is the break-even point.
If you consume more than 2.8 million tokens/month, self-hosting is cheaper. At 10 million tokens/month, self-hosting saves 87,480 vs API.
The complications are: (1) self-hosting has operational overhead (DevOps, on-call, updates) - price this at engineer-hours and add to self-hosted cost; (2) API providers have better reliability SLAs than most self-hosted setups; (3) API providers can serve demand spikes you cannot - self-hosted throughput is capped at GPU capacity. The practical rule of thumb: self-host when monthly token volume exceeds 50 million and you have DevOps capacity.
Q6: You have a production LLM API where cost per million tokens is 1.00. Walk through your optimization plan.
3.50/hr (A100) implies approximately 833 TPS. The target of (3.50 \times 1e6) / (1.00 \times 3600) \approx 972$ TPS - or roughly 3,500 TPS at current utilization. That is a 4.2x improvement needed.
Step 1 - Measure the baseline. Is the current 833 TPS at 100% of peak possible, or is the GPU under-loaded? If MFU is 20%, the bottleneck is not hardware but the serving stack. Deploy vLLM with continuous batching if not already running. Expected improvement: 3-5x. This alone may hit the target.
Step 2 - Quantize. If running FP16 70B on two A100s, switching to INT4 fits on one A100. Cost halves from 3.50/hr. Expected improvement from quantization alone: 2x.
Step 3 - Enable prefix caching. If system prompts are shared across requests (very common), this eliminates prefill cost for the shared portion. Expected improvement: 20-40% depending on system prompt fraction.
Step 4 - Implement cascade routing. Route simple requests to a 7B model. If 50% of requests route to the small model (which is typical for mixed-workload APIs), blended cost drops to 0.5 \times \0.20 + 0.5 \times $0.80 = $0.50/M tokens on self-hosted hardware.
Step 5 - Enable speculative decoding. If the workload has moderate temperature (0.3-0.7), expect 1.5-2x improvement. Combined with the above, you will be well below $1.00/M tokens.
Key Takeaways
Inference cost optimization is not a single technique - it is a stack of complementary improvements, each targeting a different source of waste:
- Continuous batching eliminates idle GPU time between requests (3-5x improvement)
- Quantization reduces memory bandwidth demand and allows more models per GPU (2-4x)
- PagedAttention eliminates KV cache fragmentation and increases concurrent capacity (2-3x)
- Speculative decoding uses spare compute to accelerate decode (1.5-2.5x at low temperature)
- Cascade routing routes cheap queries to cheap models (2-3x blended cost reduction)
- Semantic and prefix caching eliminates redundant computation entirely (20-40% on typical workloads)
Applied together, teams routinely achieve 5-10/million tokens for naive FP16 static batching deployments. That is a 10-15x cost reduction with no change to model quality.
The engineers who understand these techniques treat inference infrastructure the same way a database engineer treats query optimization: measure the bottleneck, apply the right technique, measure again. The bottleneck moves as you fix each layer. Keep iterating until you hit your cost target.
