What is llm cost optimization?

Learn how to systematically reduce LLM inference costs using model selection, quantization, caching, request routing, prompt compression, and infrastructure strategies.

How does inference cost work in practice?

Inference Cost Optimization covers llm cost optimization, inference cost, semantic caching from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-inference/inference-cost-optimization

What is the difference between llm cost optimization and semantic caching?

See the full breakdown at https://engineersofai.com/docs/llms/llm-inference/inference-cost-optimization

Inference Cost Optimization

The Production Scenario

Your LLM-powered product has hit product-market fit. Three months ago you had 500 users. Today you have 50,000. Congratulations. The OpenAI invoice for last month was $87,000. Your CFO schedules a meeting.

The meeting goes about how you expect. The CFO pulls up a spreadsheet. At current growth, the inference bill will be $400,000/month in six months and$ 2M/month in a year. The product grosses $800,000/month in revenue. The LLM cost alone is 10% of revenue today, on track to exceed 50%. "This is not a product," the CFO says. "It is a$ 2M/month GPU rental business that happens to have software on top."

You start auditing how your system uses LLMs. What you find is embarrassing in retrospect. You are sending GPT-4 requests to answer questions like "What are your business hours?" - information that never changes. You are sending full 8,000-token conversation histories to generate three-sentence replies. You are running the same "summarize this product description" prompt thousands of times per day on the same 200 products. You are paying for the model's reasoning capabilities on tasks that a much smaller model handles perfectly well.

None of these decisions were made maliciously. They were made by developers moving fast, reaching for the most capable model, defaulting to full context windows, not thinking about cost at all because cost was not the constraint yet. Now it is.

This lesson is a systematic playbook for identifying and eliminating that waste. The strategies covered can reduce inference costs by 70–95% in typical production systems - without any quality regression that users would notice.

Why This Exists: Cost is the Constraint That Scales With You

In the LLM era, infrastructure cost has a unique property: it scales directly with usage in a way that most software systems avoid. Traditional SaaS: you pay for servers that serve millions of requests once they are running. LLM SaaS: every token costs money. More users = more tokens = linearly more cost.

This makes cost optimization a first-class engineering concern, not an afterthought. Teams that treat it as an afterthought are regularly surprised when their inference bill exceeds their revenue.

Cost Structure

LLM inference cost breaks down as:

$\text{total cost} = \text{GPU hours consumed} \times \text{cost per GPU hour}$

$\text{GPU hours} = \frac{\text{total tokens generated}}{\text{tokens per second per GPU} \times 3600}$

$\text{cost per 1M tokens} = \frac{\text{GPU cost per hour}}{\text{tokens per second per GPU} \times 3600} \times 10^6$

Example calculation for a self-hosted LLaMA-3 70B on an A100 80GB ($3/hour):

$\text{throughput} \approx 800 \text{ tokens/second (with vLLM, moderate load)}$

$\text{cost per 1M tokens} = \frac{\$3}{800 \times 3600} \times 10^6 = \frac{\$3}{2{,}880{,}000} \times 10^6 \approx \$1.04/\text{1M tokens}$

For comparison: GPT-4o costs $15/1M input tokens +$ 60/1M output tokens as of 2025. A well-run self-hosted LLaMA-3 70B costs ~$1–2/1M tokens. The gap is real - but self-hosting has operational overhead and quality trade-offs.

For API-based LLMs, the cost structure is even simpler: you pay per token, per call. Input tokens (your prompt) cost less than output tokens (the model's generation). This asymmetry matters: strategies that reduce output tokens have outsized impact.

The Optimization Stack: Strategies Ranked by Impact

The following strategies are ranked roughly by the effort-to-impact ratio. Start from the top.

Strategy 1: Model Selection

The highest-leverage decision is choosing the right model for each task. This sounds obvious, but most teams default to the most capable model for everything - often because it is easier than building a routing system.

Cost comparison for common models (2025 approximate pricing):

Model	Input cost / 1M tokens	Output cost / 1M tokens	Relative cost
GPT-4o	$5	$20	100×
GPT-4o mini	$0.15	$0.60	3×
Claude 3.5 Sonnet	$3	$15	75×
Claude 3 Haiku	$0.25	$1.25	5×
Mistral Small	$0.20	$0.60	3×
Llama-3 8B (self-hosted)	~$0.10	~$0.10	1×
Llama-3 70B (self-hosted)	~$0.80	~$0.80	8×

A task that GPT-4o handles correctly for $20/1M output tokens can often be handled with identical quality by GPT-4o mini for$ 0.60/1M output tokens - a 33× cost reduction.

Task difficulty categories:

Task type	Recommended model tier
FAQ lookup, slot extraction, simple classification	Smallest model (Haiku, mini, 7B)
Email drafting, code snippet generation	Mid-tier (Sonnet, 8B–13B)
Complex reasoning, multi-step analysis	Large model (GPT-4o, Claude Sonnet, 70B)
Creative writing, research	Large model as needed
Structured data extraction	Small model with good prompting
Long document summarization	Mid-tier with chunking

Run an audit: take 500 random production requests, have humans rate whether a smaller model's output was acceptable. You will typically find 60–80% of requests could have been routed to a cheaper model.

Strategy 2: Quantization

For self-hosted models, quantization directly reduces GPU memory requirements, enabling more concurrent sequences per GPU or smaller GPUs.

Precision	Memory for LLaMA-3 70B	Throughput vs FP16	Quality impact
FP16	140 GB	1×	Reference
INT8	70 GB	1.1×	Minimal
INT4 (GPTQ/AWQ)	35 GB	1.5–2×	Small for most tasks
INT4 (GGUF Q4_K_M)	40 GB	1.3×	Small for most tasks

INT4 quantization allows LLaMA-3 70B to fit on two A100 80GB GPUs instead of two for INT8 or four for FP16. Alternatively, on the same four GPUs, you can serve twice the concurrent requests by using the freed memory for KV cache.

For API providers (OpenAI, Anthropic), you cannot control quantization - but the decision is already made for you. Switching from GPT-4 (FP16 or equivalent) to GPT-4o (already optimized) provides quality improvements at lower cost.

For self-hosted deployments:

# vLLM with AWQ INT4 quantization
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-Chat-AWQ \
    --quantization awq \
    --dtype float16 \
    --max-model-len 4096 \
    --tensor-parallel-size 2   # Only 2 GPUs needed now vs 4 for FP16

Strategy 3: Continuous Batching

As covered in the previous lesson, switching from static batching (or single-request serving) to a proper inference server like vLLM delivers 10–20× throughput improvement. This is the most impactful single infrastructure change for teams currently using naive serving.

Cost implication: if your current serving achieves 50 tokens/second (naive) and vLLM achieves 800 tokens/second (optimized), you need 16× fewer GPU-hours to serve the same workload. On $3/hour A100 instances, that is$ 3 vs $48 per million tokens - assuming you are currently paying for compute capacity at the naive throughput rate.

Strategy 4: Caching

Caching is the highest-leverage strategy for workloads with repeated or similar prompts. Three types matter:

Exact Caching

Store the exact prompt → response mapping. Cache hit = zero LLM cost.

Best for:

FAQ systems where the same question is asked many times
Product descriptions that never change
Any deterministic prompt (temperature=0.0)

import hashlib
import json
import redis
from typing import Optional


class ExactLLMCache:
    """
    Redis-backed exact cache for LLM responses.
    Key = SHA256 hash of (model, messages, max_tokens, temperature).
    """

    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        ttl_seconds: int = 86400,  # 24 hours default
    ):
        self.redis = redis.from_url(redis_url, decode_responses=True)
        self.ttl = ttl_seconds

    def _make_key(
        self,
        model: str,
        messages: list[dict],
        max_tokens: int,
        temperature: float,
    ) -> str:
        payload = json.dumps(
            {
                "model": model,
                "messages": messages,
                "max_tokens": max_tokens,
                "temperature": temperature,
            },
            sort_keys=True,
        )
        return f"llm:exact:{hashlib.sha256(payload.encode()).hexdigest()}"

    def get(
        self,
        model: str,
        messages: list[dict],
        max_tokens: int,
        temperature: float,
    ) -> Optional[str]:
        key = self._make_key(model, messages, max_tokens, temperature)
        cached = self.redis.get(key)
        if cached:
            self.redis.hincrby("llm:stats", "exact_hits", 1)
        else:
            self.redis.hincrby("llm:stats", "exact_misses", 1)
        return cached

    def set(
        self,
        model: str,
        messages: list[dict],
        max_tokens: int,
        temperature: float,
        response: str,
    ) -> None:
        key = self._make_key(model, messages, max_tokens, temperature)
        self.redis.setex(key, self.ttl, response)

    def stats(self) -> dict:
        stats = self.redis.hgetall("llm:stats")
        hits = int(stats.get("exact_hits", 0))
        misses = int(stats.get("exact_misses", 0))
        total = hits + misses
        return {
            "hits": hits,
            "misses": misses,
            "hit_rate": hits / total if total > 0 else 0,
        }

Semantic Caching

Semantic caching extends exact caching to similar-but-not-identical prompts. If "What are your store hours?" and "When are you open?" should produce the same answer, exact caching misses - semantic caching catches it.

Implementation:

Embed each incoming prompt with a fast embedding model (text-embedding-3-small, or a local model like all-MiniLM-L6-v2)
Query a vector database for the nearest cached prompt
If similarity exceeds a threshold, return the cached response

import numpy as np
from openai import OpenAI
import redis
import json
import pickle
from typing import Optional


class SemanticLLMCache:
    """
    Semantic cache: embed prompts, find similar cached responses.
    Uses Redis for storage with vector similarity search (Redis Stack).
    Falls back to brute-force cosine similarity for simplicity.
    """

    def __init__(
        self,
        openai_client: OpenAI,
        embedding_model: str = "text-embedding-3-small",
        similarity_threshold: float = 0.92,
        max_cache_entries: int = 10_000,
        ttl_seconds: int = 3600,
    ):
        self.client = openai_client
        self.embedding_model = embedding_model
        self.threshold = similarity_threshold
        self.max_entries = max_cache_entries
        self.ttl = ttl_seconds

        # In-memory cache for embeddings (production: use Redis vector search)
        self._embeddings: list[tuple[np.ndarray, str, str]] = []
        # Each entry: (embedding, prompt_text, response_text)

    def _embed(self, text: str) -> np.ndarray:
        """Get embedding for text using OpenAI's embedding model."""
        response = self.client.embeddings.create(
            model=self.embedding_model,
            input=text,
        )
        return np.array(response.data[0].embedding, dtype=np.float32)

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        """Compute cosine similarity between two vectors."""
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-10))

    def get(self, prompt: str) -> Optional[str]:
        """
        Look up a cached response for a semantically similar prompt.
        Returns None if no match above threshold.
        """
        if not self._embeddings:
            return None

        query_emb = self._embed(prompt)

        # Find most similar cached prompt
        best_score = 0.0
        best_response = None

        for cached_emb, _, response in self._embeddings:
            score = self._cosine_similarity(query_emb, cached_emb)
            if score > best_score:
                best_score = score
                best_response = response

        if best_score >= self.threshold:
            return best_response

        return None

    def set(self, prompt: str, response: str) -> None:
        """Cache a prompt-response pair with its embedding."""
        emb = self._embed(prompt)
        self._embeddings.append((emb, prompt, response))

        # Evict oldest entries if over limit (simple FIFO)
        if len(self._embeddings) > self.max_entries:
            self._embeddings = self._embeddings[-self.max_entries :]

    def cached_completion(
        self,
        prompt: str,
        model: str = "gpt-4o",
        max_tokens: int = 500,
        temperature: float = 0.0,
    ) -> tuple[str, bool]:
        """
        Get completion with caching.
        Returns (response_text, cache_hit).
        """
        cached = self.get(prompt)
        if cached is not None:
            return cached, True

        # Cache miss - call LLM
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature,
        )
        result = response.choices[0].message.content

        # Store in cache
        self.set(prompt, result)

        return result, False


# Example usage with cost tracking
def demo_semantic_cache():
    client = OpenAI()
    cache = SemanticLLMCache(
        openai_client=client,
        similarity_threshold=0.92,
    )

    test_queries = [
        "What are your business hours?",
        "When are you open?",             # Similar to above → cache hit
        "What time do you close?",        # Similar → potential cache hit
        "How do I reset my password?",    # Different topic → cache miss
        "I forgot my password, what now?",  # Similar to above → cache hit
    ]

    hits = 0
    for query in test_queries:
        response, is_hit = cache.cached_completion(
            query, model="gpt-4o-mini", max_tokens=100
        )
        hits += int(is_hit)
        print(f"{'HIT ' if is_hit else 'MISS'} | {query[:40]}")

    print(f"\nCache hit rate: {hits}/{len(test_queries)} = {hits/len(test_queries):.0%}")

Provider-Level Prompt Caching

Anthropic (Claude) and OpenAI both offer prompt caching: if the first N tokens of your prompt are identical across requests, subsequent requests pay a significantly reduced rate for those tokens.

Anthropic prompt caching:

Mark system prompt as cacheable with cache_control
First request: full price. Subsequent requests: ~10% of input token price for cached prefix
Cache persists for 5 minutes (extended cache: 1 hour for an additional fee)
Useful for: system prompts, few-shot examples, long context documents

OpenAI prompt caching (automatic):

Automatically caches prompts longer than 1,024 tokens
Cached tokens cost 50% less
Cache duration: typically minutes to hours

Design principles for maximum cache utilization:

Put stable content first: system prompt, instructions, examples come before the user's variable input. The cache matches the longest common prefix.
Minimize what changes: isolate the dynamic part (user message) to the end. If your system prompt is 2,000 tokens and the user message is 50 tokens, 97.5% of your tokens are cacheable.
Batch similar requests: requests sent within the cache window get cached discounts. Don't spread similar requests over hours.

# Anthropic prompt caching example
import anthropic

client = anthropic.Anthropic()

# System prompt: stable, 2000 tokens, cached
SYSTEM_PROMPT = """You are a customer service agent for AcmeCorp.
... (2000 tokens of instructions, policies, and examples) ...
"""

def handle_customer_query(user_message: str) -> str:
    """
    Uses prompt caching for the stable system prompt.
    First call: pay for all tokens.
    Subsequent calls within 5 min: pay ~10% for system prompt tokens.
    """
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},  # Mark for caching
            }
        ],
        messages=[
            {"role": "user", "content": user_message}
        ],
    )

    # Check cache usage in response metadata
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    if hasattr(usage, 'cache_read_input_tokens'):
        print(f"Cache read tokens: {usage.cache_read_input_tokens}")
        print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")

    return response.content[0].text

Strategy 5: Request Routing

Model routing sends simple requests to cheap small models and complex requests to expensive large models. A well-designed router can reduce costs by 60–80% on mixed workloads.

from openai import OpenAI
from dataclasses import dataclass
from typing import Literal
import re


@dataclass
class RoutingConfig:
    """Model routing configuration with cost tracking."""
    simple_model: str = "gpt-4o-mini"
    medium_model: str = "gpt-4o-mini"
    complex_model: str = "gpt-4o"

    # Cost per 1M output tokens (approximate, 2025)
    simple_cost_per_1m: float = 0.60
    medium_cost_per_1m: float = 0.60
    complex_cost_per_1m: float = 20.0


class LLMRouter:
    """
    Routes requests to models based on estimated complexity.

    Complexity estimation: heuristics + optional lightweight classifier.
    In production, train a proper classifier on labeled data.
    """

    def __init__(self, client: OpenAI, config: RoutingConfig = None):
        self.client = client
        self.config = config or RoutingConfig()
        self._total_cost = 0.0
        self._request_counts = {"simple": 0, "medium": 0, "complex": 0}

    def estimate_complexity(self, query: str) -> tuple[str, float]:
        """
        Estimate query complexity. Returns (tier, score).

        Heuristics used (replace with a trained classifier in production):
        - Length: longer queries tend to be more complex
        - Keywords: reasoning, analyze, compare, explain why → complex
        - Question structure: multi-part questions → complex
        - Task type: classification, extraction → simple
        """
        query_lower = query.lower()
        score = 0.0

        # Length signal
        word_count = len(query.split())
        if word_count < 15:
            score += 0.1
        elif word_count < 50:
            score += 0.3
        else:
            score += 0.5

        # Simple task keywords
        simple_keywords = [
            "what is", "when", "where", "list", "name",
            "define", "classify", "extract", "categorize",
        ]
        for kw in simple_keywords:
            if kw in query_lower:
                score -= 0.15

        # Complex task keywords
        complex_keywords = [
            "analyze", "evaluate", "compare", "explain why", "reason",
            "argue", "critique", "synthesize", "implications", "trade-off",
            "design", "architect", "strategy", "comprehensive", "thorough",
        ]
        for kw in complex_keywords:
            if kw in query_lower:
                score += 0.2

        # Multi-part question detection
        question_marks = query.count("?")
        if question_marks > 1:
            score += 0.15 * (question_marks - 1)

        # Code generation (medium complexity)
        if any(kw in query_lower for kw in ["write a", "implement", "code", "function"]):
            score += 0.3

        # Clamp score to [0, 1]
        score = max(0.0, min(1.0, score))

        if score < 0.3:
            return "simple", score
        elif score < 0.7:
            return "medium", score
        else:
            return "complex", score

    def route(
        self,
        query: str,
        max_tokens: int = 500,
        force_tier: str = None,
    ) -> dict:
        """
        Route a request to the appropriate model and return the response.
        """
        tier, score = self.estimate_complexity(query)
        if force_tier:
            tier = force_tier

        model_map = {
            "simple": self.config.simple_model,
            "medium": self.config.medium_model,
            "complex": self.config.complex_model,
        }
        cost_map = {
            "simple": self.config.simple_cost_per_1m,
            "medium": self.config.medium_cost_per_1m,
            "complex": self.config.complex_cost_per_1m,
        }

        model = model_map[tier]
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": query}],
            max_tokens=max_tokens,
        )

        output_tokens = response.usage.completion_tokens
        cost = output_tokens / 1_000_000 * cost_map[tier]
        self._total_cost += cost
        self._request_counts[tier] += 1

        return {
            "response": response.choices[0].message.content,
            "tier": tier,
            "complexity_score": round(score, 2),
            "model_used": model,
            "output_tokens": output_tokens,
            "cost_usd": round(cost, 6),
        }

    def cost_report(self) -> dict:
        """Summary of routing decisions and costs."""
        total_requests = sum(self._request_counts.values())
        return {
            "total_cost_usd": round(self._total_cost, 4),
            "total_requests": total_requests,
            "distribution": {
                tier: {
                    "count": count,
                    "pct": round(count / total_requests * 100, 1) if total_requests else 0,
                }
                for tier, count in self._request_counts.items()
            },
        }


# Example
client = OpenAI()
router = LLMRouter(client)

test_queries = [
    "What is the capital of France?",
    "List 5 programming languages.",
    "Write a Python function to merge two sorted arrays.",
    "Analyze the trade-offs between microservices and monolithic architectures for a startup scaling from 10 to 100 engineers. Consider organizational complexity, deployment overhead, fault isolation, and performance characteristics.",
]

for query in test_queries:
    result = router.route(query, max_tokens=200)
    print(f"[{result['tier'].upper():7}] score={result['complexity_score']} | {query[:60]}")

print("\nCost Report:")
report = router.cost_report()
print(f"  Total cost: ${report['total_cost_usd']}")
for tier, info in report["distribution"].items():
    print(f"  {tier}: {info['count']} requests ({info['pct']}%)")

Strategy 6: Prompt Compression

Long prompts cost money even before the model generates a single token. If your system sends 4,000-token prompts with only 200 tokens of truly essential information, you are paying 20× more than necessary on the input side.

LLMLingua (Jiang et al., 2023) and its successor LLMLingua-2 compress prompts by removing tokens that are unlikely to affect the model's response. The compression works by using a small local language model to score each token's importance, then removing low-importance tokens while preserving grammatical structure.

Compression ratios of 3–20× have been reported with minimal quality degradation for many tasks.

# LLMLingua prompt compression
# pip install llmlingua

from llmlingua import PromptCompressor


def compress_prompt(
    long_prompt: str,
    compression_ratio: float = 0.5,  # Keep 50% of tokens
    question: str = "",               # Target question helps focus compression
) -> dict:
    """
    Compress a long prompt using LLMLingua.

    Args:
        long_prompt: The full prompt to compress
        compression_ratio: Target ratio of tokens to keep (0.5 = keep 50%)
        question: The question the compressed prompt should answer (optional)

    Returns:
        dict with compressed_prompt, original_tokens, compressed_tokens, ratio
    """
    compressor = PromptCompressor(
        model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
        use_llmlingua2=True,
        device_map="cpu",  # Or "cuda" if GPU available
    )

    result = compressor.compress_prompt(
        long_prompt,
        instruction=question,
        question=question,
        target_token=int(len(long_prompt.split()) * compression_ratio),
        condition_compare=True,
        condition_in_question="after",
        rank_method="longllmlingua",
        use_sentence_level_filter=False,
        context_budget="+100",
        dynamic_context_compression_ratio=0.4,
        reorder_context="sort",
    )

    return {
        "compressed_prompt": result["compressed_prompt"],
        "original_tokens": result["origin_tokens"],
        "compressed_tokens": result["compressed_tokens"],
        "actual_ratio": result["compressed_tokens"] / result["origin_tokens"],
        "savings_pct": (1 - result["compressed_tokens"] / result["origin_tokens"]) * 100,
    }


# Manual prompt compression patterns (no library needed)
def compress_few_shot_examples(
    system_prompt: str,
    examples: list[dict],   # [{"input": ..., "output": ...}]
    n_examples_to_keep: int = 3,
) -> str:
    """
    Reduce few-shot examples from N to K.
    Select the most diverse/representative examples.
    """
    if len(examples) <= n_examples_to_keep:
        return build_prompt(system_prompt, examples)

    # Simple: keep first and last (boundary examples) + random middle
    import random
    selected = [examples[0], examples[-1]]
    remaining = examples[1:-1]
    selected += random.sample(remaining, min(n_examples_to_keep - 2, len(remaining)))

    return build_prompt(system_prompt, selected)


def build_prompt(system_prompt: str, examples: list[dict]) -> str:
    lines = [system_prompt, ""]
    for ex in examples:
        lines.append(f"Input: {ex['input']}")
        lines.append(f"Output: {ex['output']}")
        lines.append("")
    return "\n".join(lines)

Strategy 7: Speculative Decoding

Speculative decoding (covered in depth in Lesson 05) uses a small draft model to propose tokens that the large target model verifies in parallel. For inference at a fixed hardware budget, this means getting 2–3× more tokens per second from the same GPU, effectively reducing cost per token by 2–3×.

Cost impact: If you are serving on reserved instances (paying for GPU-hours regardless of utilization), speculative decoding increases throughput from those GPUs - reducing cost per token without changing the GPU bill. If you are paying per-token on an API, speculative decoding does not help (the API provider charges you for output tokens, not compute).

Best scenarios for speculative decoding:

Self-hosted models on owned or reserved GPU infrastructure
Applications where the output token distribution is predictable (code, templates, structured responses)
Large models (70B+) where the draft/target cost ratio is favorable

Strategy 8: Spot and Preemptible Instances

Cloud providers offer preemptible (AWS Spot, GCP Spot, Azure Spot) GPU instances at 60–90% discount versus on-demand. The trade-off: the cloud provider can reclaim the instance with 30–120 seconds notice.

GPU instance	On-demand $/hr	Spot $/hr	Savings
AWS p3.2xlarge (V100 16GB)	$3.06	$0.92	70%
AWS p3.8xlarge (4× V100)	$12.24	$3.67	70%
AWS p4d.24xlarge (8× A100 40GB)	$32.77	$9.83	70%
GCP a2-highgpu-1g (A100 40GB)	$3.67	$1.10	70%

Spot instances work for:

Batch inference jobs (document processing, embedding generation, offline summarization)
Development and testing workloads
Any workload that can tolerate interruption and retry

Spot instances do NOT work for:

Interactive user-facing inference (your serving pod can vanish mid-request)
Workloads with strict latency SLAs

Architecture for spot instance batch processing:

import boto3
import json
import time
from typing import Iterator


def process_batch_with_spot_fallback(
    prompts: list[str],
    model_name: str,
    output_bucket: str,
    output_prefix: str,
) -> Iterator[dict]:
    """
    Process a batch of prompts using spot instances via SQS + Lambda.

    Pattern:
    1. Push prompts to SQS queue
    2. Spot EC2 fleet processes from queue, writes results to S3
    3. On spot interruption, messages return to queue automatically (visibility timeout)
    4. Another spot instance picks up and continues

    This function emulates the pattern - in production, this would be
    an SQS consumer running on EC2 spot fleet.
    """
    sqs = boto3.client("sqs")
    s3 = boto3.client("s3")

    queue_url = "https://sqs.us-east-1.amazonaws.com/123456789/llm-batch-queue"

    # Enqueue all prompts
    for i, prompt in enumerate(prompts):
        sqs.send_message(
            QueueUrl=queue_url,
            MessageBody=json.dumps({
                "prompt_id": i,
                "prompt": prompt,
                "model": model_name,
            }),
        )

    # In production: EC2 spot fleet runs this consumer
    # Messages have visibility timeout - if instance is interrupted,
    # message becomes visible again and another instance picks it up
    while True:
        messages = sqs.receive_message(
            QueueUrl=queue_url,
            MaxNumberOfMessages=10,
            VisibilityTimeout=300,  # 5 minutes to process
        ).get("Messages", [])

        if not messages:
            break

        for msg in messages:
            body = json.loads(msg["Body"])
            # ... run LLM inference ...
            # ... write result to S3 ...
            # ... delete message from SQS ...
            sqs.delete_message(
                QueueUrl=queue_url,
                ReceiptHandle=msg["ReceiptHandle"],
            )
            yield {"prompt_id": body["prompt_id"], "status": "complete"}

Cost Monitoring and Attribution

Cost optimization is meaningless without observability. You need to know:

Which features cost the most
Whether cost per feature is increasing over time
Which experiments are expensive before they reach production

import time
from functools import wraps
from openai import OpenAI
import dataclasses
from typing import Optional
import json


@dataclasses.dataclass
class LLMCallRecord:
    timestamp: float
    feature: str          # "customer_support", "code_review", "summarization"
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    cost_usd: float
    cache_hit: bool = False
    user_id: Optional[str] = None


# Cost per 1M tokens by model (approximate 2025 prices)
MODEL_COSTS = {
    "gpt-4o": {"input": 5.0, "output": 20.0},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-3-5-sonnet-20241022": {"input": 3.0, "output": 15.0},
    "claude-3-5-haiku-20241022": {"input": 0.25, "output": 1.25},
}


def compute_cost(
    model: str,
    input_tokens: int,
    output_tokens: int,
) -> float:
    """Compute USD cost for an LLM call."""
    costs = MODEL_COSTS.get(model, {"input": 1.0, "output": 4.0})
    return (
        input_tokens / 1_000_000 * costs["input"]
        + output_tokens / 1_000_000 * costs["output"]
    )


class CostTrackedLLM:
    """
    Wrapper around OpenAI client that tracks cost per feature.
    In production: persist records to a database, aggregate in Grafana.
    """

    def __init__(self, client: OpenAI):
        self.client = client
        self._records: list[LLMCallRecord] = []

    def chat(
        self,
        messages: list[dict],
        model: str = "gpt-4o-mini",
        max_tokens: int = 500,
        feature: str = "unknown",
        user_id: str = None,
    ) -> str:
        start = time.perf_counter()

        response = self.client.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
        )

        latency_ms = (time.perf_counter() - start) * 1000
        usage = response.usage
        cost = compute_cost(model, usage.prompt_tokens, usage.completion_tokens)

        record = LLMCallRecord(
            timestamp=time.time(),
            feature=feature,
            model=model,
            input_tokens=usage.prompt_tokens,
            output_tokens=usage.completion_tokens,
            latency_ms=latency_ms,
            cost_usd=cost,
            user_id=user_id,
        )
        self._records.append(record)

        return response.choices[0].message.content

    def cost_by_feature(self) -> dict:
        """Aggregate cost breakdown by feature."""
        from collections import defaultdict

        breakdown = defaultdict(lambda: {"calls": 0, "cost_usd": 0.0, "tokens": 0})
        for record in self._records:
            breakdown[record.feature]["calls"] += 1
            breakdown[record.feature]["cost_usd"] += record.cost_usd
            breakdown[record.feature]["tokens"] += record.input_tokens + record.output_tokens

        return dict(breakdown)

    def print_cost_report(self) -> None:
        breakdown = self.cost_by_feature()
        total = sum(v["cost_usd"] for v in breakdown.values())

        print(f"\nLLM Cost Report - Total: ${total:.4f}")
        print(f"{'Feature':<25} {'Calls':>8} {'Tokens':>10} {'Cost USD':>12} {'% of total':>12}")
        print("-" * 70)

        for feature, stats in sorted(breakdown.items(), key=lambda x: -x[1]["cost_usd"]):
            pct = stats["cost_usd"] / total * 100 if total else 0
            print(
                f"  {feature:<23} {stats['calls']:>8} {stats['tokens']:>10} "
                f"  ${stats['cost_usd']:>9.4f}   {pct:>9.1f}%"
            )

:::danger Never Optimize Before You Measure

The most expensive optimization mistake is optimizing the wrong thing. Before applying any strategy from this lesson, instrument your current system to understand:

What fraction of requests touch each feature?
What is the average token count (input + output) per feature?
What is the model distribution (are you using GPT-4 for FAQ lookups)?
What is the cache hit rate (if you have caching)?

Teams that skip measurement typically spend weeks optimizing a feature that represents 2% of cost while ignoring the feature that represents 60% of cost. Always profile first. :::

:::warning Semantic Cache Threshold Tuning Is Critical

Semantic caching returns a cached response when similarity exceeds a threshold (e.g., 0.92). A threshold that is too low causes incorrect cache hits - similar-sounding questions with different correct answers return wrong cached responses. This is a correctness bug, not just a quality issue.

Example: "Who is the CEO of Apple?" and "Who is the CEO of Google?" have cosine similarity ~0.87 depending on the embedding model. If your threshold is 0.85, you would return the cached Apple CEO response for the Google question.

Validate your similarity threshold with adversarial examples before deploying semantic caching to production. A/B test with a subset of traffic. Start with a high threshold (0.95) and lower gradually, monitoring accuracy metrics. :::

:::warning Prompt Caching Only Works With Stable Prefix Content

Anthropic and OpenAI prompt caching matches on the longest common prefix of your prompt. If your system prompt includes dynamic content (current date, user name, session ID), the prefix changes with every request and the cache never hits.

Structure your prompts as:

[stable system instructions]       ← cached
[stable few-shot examples]         ← cached
---
User: [dynamic user message]       ← not cached (that's fine)

Never put dynamic content in the system prompt unless it is truly necessary for correctness. Move it to the user message turn instead. :::

Interview Questions

Q1: Your LLM inference cost is $50,000/month and growing linearly with users. You have 8 weeks to cut it by 50%. What do you do first?

First, measure. Spend week 1 instrumenting cost attribution: which features, which models, how many tokens per request, current cache hit rate. This reveals the distribution - in typical systems, 20% of features account for 80% of cost.

Given 8 weeks, a pragmatic ordering:

Weeks 1–2: Model selection audit. Identify which requests are sent to expensive models (GPT-4) that could use cheaper models (GPT-4o-mini). For most B2B SaaS, 50–70% of requests are classifiable as "simple" and can shift to GPT-4o-mini (33× cheaper output). This is often a configuration change, not code.

Weeks 3–4: Deploy semantic caching for high-volume, similar-query use cases. A customer support chatbot with 1,000 FAQ variants often achieves 40–60% cache hit rates, cutting cost in half for that feature.

Week 5–6: Prompt auditing - are you sending 4,000-token prompts where 2,000 tokens are boilerplate? Trim or compress. Enable provider-level prompt caching for stable system prompts.

Weeks 7–8: Enable prompt caching (Anthropic/OpenAI) for the remaining high-cost features.

A realistic outcome: 40–60% cost reduction in 8 weeks with zero user-facing impact.

Q2: Explain prompt caching (Anthropic/OpenAI) - what is it, how does it work, and what do you need to do to benefit from it?

Prompt caching stores the KV activations for a prompt prefix on the model provider's servers. When a subsequent request has the same prefix, the provider can skip recomputing attention for those tokens - the model starts from the cached state and processes only the new tokens.

How it works: Modern attention mechanisms are causal - each token only attends to previous tokens. If you have a 2,000-token system prompt followed by a 50-token user message, the attention computations for the system prompt tokens are identical across all requests that use the same system prompt. Caching avoids recomputing them.

What you need to do:

Anthropic: explicitly mark cacheable content with cache_control: {"type": "ephemeral"}. Cost: full price first time, ~10% for cached reads.
OpenAI: automatic for prompts over 1,024 tokens. No code changes needed. Cached tokens cost 50% less.

Design requirement: Put stable content (system prompt, few-shot examples, document to analyze) at the beginning of the prompt before the variable user input. Cache matching is prefix-based - any dynamic content in the system prompt breaks caching for all requests.

Q3: What is semantic caching and when would it incorrectly return a cached response?

Semantic caching embeds each incoming prompt into a vector and finds the nearest cached prompt by cosine similarity. If similarity exceeds a threshold, it returns the cached response instead of calling the LLM.

Incorrect cached responses happen when:

Semantic similarity without answer equivalence: "What is the CEO of Apple?" and "What is the CEO of Google?" are semantically similar (same sentence structure, same domain) but have different correct answers. If your threshold is too low, the second question returns the cached answer to the first.
Context-dependent questions: "What is the latest news?" asked twice - same question, different correct answers because the world changed. Time-sensitive queries should not be semantically cached.
User-specific queries: "What is my account balance?" is semantically similar for all users but requires different answers for each. Never cache personalized queries.

Mitigation: Set a high similarity threshold (start at 0.95), exclude time-sensitive and user-specific queries from semantic caching by routing them to exact cache only, and run ongoing accuracy evaluation on a sample of cache hits.

Q4: How would you implement a cost dashboard to track LLM spend by product feature?

Three layers:

Instrumentation layer: Wrap every LLM call with cost tracking middleware. Log: timestamp, feature name, model, input tokens, output tokens, latency, cost in USD. Calculate cost with: (input_tokens/1M × input_cost) + (output_tokens/1M × output_cost). Persist to a time-series database (InfluxDB, PostgreSQL with TimescaleDB, or a data warehouse like BigQuery).

Aggregation layer: Compute daily/weekly cost per feature. Calculate: cost per 1M tokens by feature, cost per user by feature, feature cost as % of total, month-over-month growth rate per feature. These are your budget line items.

Alerting layer: Set budget alerts per feature per day. Alert when a feature exceeds 2× its rolling 7-day average (anomaly detection for prompt injection or infinite loops). Alert when total daily cost exceeds 90% of budget. Alert when cache hit rate drops below 30% (indicates cache invalidation or configuration issue).

In practice: a single PostgreSQL table with columns (timestamp, feature, model, input_tokens, output_tokens, cost_usd, user_id, request_id) plus a Grafana dashboard on top is sufficient for most teams.

Q5: A competitor claims their product is 10× cheaper than yours. Your product uses GPT-4o; theirs uses a fine-tuned LLaMA-3 8B on self-hosted infrastructure. What are the actual trade-offs?

Cost difference is real but incomplete as an analysis.

Their advantage: LLaMA-3 8B with vLLM on H100 generates ~3,000 tokens/second at ~ $3/hour =$ 0.28/1M tokens. GPT-4o at $20/1M output tokens is 70× more expensive on output. For high-volume, repetitive tasks where the 8B model has been fine-tuned adequately, this cost advantage compounds dramatically.

Your advantage:

Quality ceiling: Fine-tuned 8B models can match GPT-4o on specific, narrow tasks - but GPT-4o generalizes better on novel or complex queries. If your product handles diverse, high-stakes requests, the quality gap matters.
Operational overhead: They pay for GPU infrastructure, on-call engineers, model updates, security patching, and downtime incidents. This can easily be $20,000–100,000/month in engineering time. At low volume, self-hosting costs more than the API.
Reliability: OpenAI/Anthropic offer 99.9% SLAs with global infrastructure. Self-hosted requires active management.
Development speed: Fine-tuning takes time and data. GPT-4o is available today for any new task.

When to self-host: when your monthly token volume makes self-hosting cheaper than API costs even after counting infrastructure and engineering overhead. The break-even is typically $10,000–30,000/month in API costs for a small team.

Q6: What is request routing and how would you build a classifier to route requests to different model tiers?

Request routing uses a classifier to predict which model tier a request needs, then routes it accordingly. The goal: route "easy" requests to cheap models and "hard" requests to expensive models.

Training the classifier:

Collect 1,000–10,000 historical requests with human labels or quality signals (thumbs up/down, escalation rate, retry rate)
For each request, label which model tier gave acceptable quality
Extract features: prompt length, keyword signals, question structure, domain
Train a small classifier (logistic regression, gradient boosting, or a fine-tuned small LM)
Calibrate confidence thresholds to control the precision-recall trade-off

Production architecture:

The routing classifier must be fast (under 10ms) - use a small model or rule-based system
Start conservative: only route requests where the classifier is confident (score > 0.9) the cheap model suffices
Monitor quality for routed requests separately - if the small model handles 80% of traffic with equal user satisfaction scores, the routing is working

Practical shortcut: For many applications, a simple heuristic (prompt length + keyword matching) works surprisingly well and requires no training data. Build the heuristic first, then replace it with a trained classifier once you have quality signal data.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the LLM Inference Cost Breakdown demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Why This Exists: Cost is the Constraint That Scales With You​

Cost Structure​

The Optimization Stack: Strategies Ranked by Impact​

Strategy 1: Model Selection​

Strategy 2: Quantization​

Strategy 3: Continuous Batching​

Strategy 4: Caching​

Exact Caching​

Semantic Caching​

Provider-Level Prompt Caching​

Strategy 5: Request Routing​

Strategy 6: Prompt Compression​

Strategy 7: Speculative Decoding​

Strategy 8: Spot and Preemptible Instances​

Cost Monitoring and Attribution​

Interview Questions​