Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the LLM Product Architecture demo on the EngineersOfAI Playground - no code required. :::

Why an LLM Gateway

The Night Everything Broke

It was 11:47 PM on a Tuesday when the on-call engineer's phone started screaming. The AI writing assistant - the flagship product that 80,000 paying customers relied on to draft emails, reports, and marketing copy - had been returning HTTP 429 errors for the past twenty-three minutes. The team scrambled onto a Zoom call. Somebody checked the Anthropic status page: all systems operational. Somebody else pulled up the logs: the API keys were valid, the requests were being sent correctly, but the provider was rate-limiting every single one of them. The quota was gone.

But nobody could explain which team had burned through it. Three separate product teams - the email assistant team, the content generation team, and the internal analytics team - were all sharing a single set of API keys. One of them had kicked off a batch job. The analytics team had decided to regenerate metadata summaries for their entire data catalog: 400,000 documents, each requiring one Claude call, launched at 9 PM with no throttling and no coordination with the teams who shared the same key. By midnight, the daily token quota was gone. The user-facing product collapsed. The on-call engineer had no dashboard to tell him which team was responsible, no tooling to throttle the offending job, and no fallback to route traffic to a different provider. He could only wait.

The incident lasted four hours. The post-mortem produced a brutal truth: the company had seventeen different places in the codebase where LLM calls were made. Each was a direct HTTP call to a provider with its own hardcoded key, its own error-handling logic written slightly differently by a different engineer on a different sprint, its own retry strategy that sometimes retried 4xx errors (which would never succeed), its own cost assumptions baked into a comment that hadn't been updated since the pricing changed. Seventeen independent integrations masquerading as a platform.

This is the problem an LLM gateway exists to solve.

Why Direct Integration Fails at Scale

In the early days of a team's AI journey, calling providers directly is perfectly reasonable. You install the Anthropic SDK, add your key to the environment, call client.messages.create(), and ship. The code is simple and there are no unnecessary abstractions. It works.

Then the codebase grows. The problems don't arrive all at once - they accumulate, slowly, then cascade:

  • Multiple models in production: you add a second provider because Claude's context window handles your legal documents better, but GPT-4o is faster for short completions. Now you have two SDKs, two authentication patterns, two response formats, two error-handling paths, and two sets of retry logic.
  • Cost blindness: you receive a $47,000 invoice at the end of the month. Finance asks for a breakdown. You have none. Every LLM call looked the same in the application logs - a line that says "API call succeeded." Nobody tracked which feature, which user, or which team generated which cost.
  • Rate limit chaos: multiple services share API keys. One batch job saturates the daily quota. Every other feature goes dark simultaneously. There is no isolation, no priority, no fairness.
  • No fallback: when Anthropic has a partial outage, every feature that calls Claude fails simultaneously. There is no automatic rerouting to OpenAI or Gemini. Every team scrambles independently to add a fallback to their own code.
  • Zero caching: the same expensive query - "Summarize this product's features for an e-commerce listing" - is sent to the LLM hundreds of times per hour because nobody built a shared cache layer. Every call costs money when most of them could be served from a cache hit.
  • Scattered observability: request traces are in CloudWatch, error counts are in Datadog, costs are in a spreadsheet, and latency metrics don't exist. Nobody has a coherent picture of what the LLM infrastructure is actually doing.

A gateway addresses every single one of these failure modes with a single architectural change: route all LLM calls through one place.

The Gateway Pattern: Architecture First

An LLM gateway is a reverse proxy and policy enforcement layer that sits between your application code and every LLM provider. From the application's perspective, it looks like one endpoint - typically OpenAI-compatible at POST /v1/chat/completions. From the gateway's perspective, it receives requests, applies a set of policies, and forwards them to the appropriate provider.

The gateway does not change what the LLM does. It changes who gets to talk to the LLM, how often, at what cost, with what resilience guarantees, and with what visibility.

What the Gateway Actually Does

Unified API

The gateway exposes a single OpenAI-compatible endpoint. Application code calls this endpoint regardless of which model ultimately handles the request. Switching from Claude to GPT-4o is a configuration change - one line in a YAML file - not a code change requiring a new SDK import, a new authentication pattern, and updated response parsing logic.

This matters at the org level. When every team calls the same endpoint, the platform team can push changes - new models, updated routing rules, revised fallback chains - without asking product teams to update their code.

Model Routing

The gateway can route requests to different models based on rules defined at configuration time:

  • Cost-based routing: use Claude Haiku for FAQ lookups, Claude Sonnet for document analysis, Claude Opus for complex reasoning that actually needs it
  • Capability routing: route requests with large contexts to models with 200k+ token windows
  • Latency routing: for real-time features, always pick the fastest available model
  • A/B routing: send 10% of traffic to a new model to compare quality and cost before full rollout

Semantic Caching

The gateway embeds each incoming query and compares it to cached query vectors using cosine similarity. If similarity exceeds a threshold (typically 0.92–0.95), the cached response is returned immediately. No LLM call is made, no tokens are consumed, and the user gets a response in milliseconds instead of seconds.

This is the single highest-ROI optimization available for most AI applications. FAQ bots, documentation assistants, and support chat systems routinely see 30–50% cache hit rates after a few weeks of warm-up traffic.

Fallback and Retry

When a provider returns a 429 or a 5xx, the gateway retries with exponential backoff and then automatically routes to a fallback model if retries are exhausted. From the application's perspective, the request succeeded - maybe 200ms slower, maybe handled by a different model, but not failed. Provider outages become invisible to users.

Rate Limiting

The gateway enforces token budgets per user, per team, and per feature. It tracks consumption in real time using Redis and rejects requests that exceed configured limits - returning a 429 with a Retry-After header. Batch jobs get their own quota bucket that cannot starve the real-time user-facing features.

Cost Tracking

The gateway calculates cost at the time of each request: (input_tokens × input_price + output_tokens × output_price) / 1,000,000. It tags each cost event with user ID, team ID, and feature name. Finance gets a per-team monthly breakdown. Engineering gets a per-feature cost-per-request metric. Anomaly detection fires a Slack alert when hourly spend suddenly spikes.

Observability

Every request flowing through the gateway is logged with: provider, model, latency, input tokens, output tokens, cost, cache hit/miss, user ID, and any error codes. Traces connect the LLM call to the upstream request that triggered it. Dashboards show token throughput, cost trends, cache hit rates, fallback frequency, and P95 latency - all in one place.

Historical Context: Why This Category Exists Now

The concept of an API gateway is not new. Kong, AWS API Gateway, and Nginx have handled reverse proxy, authentication, rate limiting, and circuit breaking for REST services since the mid-2010s. The pattern of centralizing cross-cutting concerns at the infrastructure layer rather than the application layer is a well-established principle in distributed systems engineering.

When LLMs became production workloads in 2022–2023, teams initially tried to adapt existing API gateways to handle LLM traffic. The results were mediocre. Traditional gateways don't understand tokens. They can rate limit by request count but not by token consumption. They can cache by URL but not by semantic similarity. They can route by path pattern but not by model capability or cost tier.

The first purpose-built LLM gateway patterns emerged in late 2022. LiteLLM was open-sourced by BerriAI in late 2023, quickly becoming the de facto standard for teams that needed a self-hosted solution. Portkey launched with a stronger emphasis on observability and enterprise features. Helicone, OpenRouter, and MLflow's AI gateway followed. By 2024, "LLM gateway" had become a recognized infrastructure category, appearing in AI engineering job descriptions and system design interview questions.

The unifying insight across all these tools: translate provider-specific protocols into a unified OpenAI-compatible API, then apply routing, caching, and observability policies on top.

The 40kto40k to 12k Story

A B2B SaaS company in the developer tools space had deployed Claude across four product areas: a code assistant, a documentation generator, a PR reviewer, and a customer-facing support chatbot. Total monthly LLM spend: $40,000. The team felt it was high but couldn't explain exactly why.

An AI infrastructure audit identified four specific problems:

Problem 1: No semantic caching. The documentation generator fielded 40,000 queries per day from developers. When stripped of minor wording variations, the unique question types numbered under 2,000. "How do I install the SDK?", "Show me the installation steps", and "What's the setup process?" all warranted the same answer and were all being sent to the LLM independently. The top 200 most common questions accounted for 31% of all documentation generator spend.

Problem 2: Wrong model for the job. The support chatbot was using Claude Opus for every request, including simple FAQ lookups like "What is your refund policy?" and "Where do I find my API key?" These questions require no frontier model reasoning - Claude Haiku handles them at 10x lower cost with no quality difference.

Problem 3: No input filtering. The PR reviewer sent entire diffs to the LLM, including auto-generated migration files, lock files, and build artifacts. Nobody had added preprocessing logic to filter these out.

Problem 4: Duplicate requests. A bug in the code assistant caused completions to be requested twice under certain network timeout conditions. No deduplication logic existed at the application layer.

Four gateway-layer changes:

  1. Semantic cache with 0.93 similarity threshold on the documentation generator
  2. Model routing: support chatbot calls classified as "simple FAQ" route to Claude Haiku
  3. Input preprocessor: strip generated files from PR diffs before sending
  4. Request deduplication using idempotency keys

Month-over-month results: documentation generator cache hit rate 34%; chatbot cost reduction 67%; PR reviewer cost reduction 22%. Total: 40,000/month40,000/month → 12,000/month. A 70% reduction with zero degradation in output quality.

What the Gateway Does Not Do

The gateway is infrastructure. Understanding its scope prevents misplaced expectations.

It does not:

  • Improve model quality - routing efficiently doesn't make a bad prompt better
  • Replace prompt engineering - the gateway moves requests; it doesn't write them
  • Handle fine-tuning workflows - model training happens outside the gateway
  • Solve evaluation - whether LLM responses are correct or helpful is a separate problem
  • Replace application-level error handling - the gateway handles infrastructure failures; application code still needs to handle cases where the LLM returns semantically wrong output

The gateway is a force multiplier for good AI engineering. It does not substitute for it.

Building a Minimal Gateway: The Core Pattern

Before reaching for LiteLLM or Portkey, it is worth understanding what a minimal gateway looks like in code. The following implements the core pattern: unified routing, exact-match caching, fallback, and cost tracking in under 150 lines.

import anthropic
import openai
import time
import hashlib
import json
import uuid
from typing import Any, Optional
from dataclasses import dataclass, field
from collections import defaultdict


@dataclass
class GatewayConfig:
primary_provider: str = "anthropic"
fallback_provider: str = "openai"
cache_enabled: bool = True
cost_tracking_enabled: bool = True
request_timeout_s: float = 45.0


@dataclass
class GatewayResult:
response: str
provider: str
model: str
cost_usd: float
cache_hit: bool
input_tokens: int = 0
output_tokens: int = 0
latency_ms: float = 0.0
request_id: str = field(default_factory=lambda: str(uuid.uuid4())[:8])


class SimpleGateway:
"""
A minimal LLM gateway demonstrating the core pattern.

Features:
- Unified API: one .complete() method regardless of provider
- Exact-match cache: deterministic cache key from messages + model
- Automatic fallback: primary -> fallback on any error
- Real-time cost tracking: per-user accumulation with formula
"""

# Pricing per 1M tokens (March 2026 - verify with current provider docs)
PRICING = {
"anthropic": {"input": 3.00, "output": 15.00}, # claude-sonnet-4-6
"openai": {"input": 2.50, "output": 10.00}, # gpt-4o
}

def __init__(self, config: GatewayConfig):
self.config = config
self.cache: dict[str, str] = {}
self.spend_by_user: dict[str, float] = defaultdict(float)
self.spend_by_feature: dict[str, float] = defaultdict(float)
self.anthropic_client = anthropic.Anthropic()
self.openai_client = openai.OpenAI()

def _cache_key(self, messages: list[dict], model: str) -> str:
"""Generate a deterministic SHA-256 cache key."""
payload = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()

def _calculate_cost(self, provider: str, input_tokens: int, output_tokens: int) -> float:
"""Calculate request cost in USD from token counts and pricing."""
prices = self.PRICING.get(provider, {"input": 0, "output": 0})
return (
input_tokens * prices["input"] + output_tokens * prices["output"]
) / 1_000_000

def _call_anthropic(
self, messages: list[dict], model: str = "claude-sonnet-4-6"
) -> tuple[str, int, int]:
"""Call Anthropic and return (response_text, input_tokens, output_tokens)."""
response = self.anthropic_client.messages.create(
model=model,
max_tokens=1024,
messages=messages,
timeout=self.config.request_timeout_s,
)
return (
response.content[0].text,
response.usage.input_tokens,
response.usage.output_tokens,
)

def _call_openai(
self, messages: list[dict], model: str = "gpt-4o"
) -> tuple[str, int, int]:
"""Call OpenAI and return (response_text, input_tokens, output_tokens)."""
response = self.openai_client.chat.completions.create(
model=model,
messages=messages,
timeout=self.config.request_timeout_s,
)
return (
response.choices[0].message.content,
response.usage.prompt_tokens,
response.usage.completion_tokens,
)

def complete(
self,
messages: list[dict],
user_id: str = "anonymous",
feature: str = "default",
model: str = "claude-sonnet-4-6",
) -> GatewayResult:
"""
Route a completion request through the gateway.

Checks cache -> tries primary provider -> falls back on error.
Tracks cost per user and per feature regardless of path.
"""
cache_key = self._cache_key(messages, model)

# Step 1: Check exact-match cache
if self.config.cache_enabled and cache_key in self.cache:
return GatewayResult(
response=self.cache[cache_key],
provider="cache",
model=model,
cost_usd=0.0,
cache_hit=True,
)

# Step 2: Try primary provider, fall back on any error
provider = self.config.primary_provider
start = time.time()

try:
if provider == "anthropic":
text, input_tok, output_tok = self._call_anthropic(messages, model)
else:
text, input_tok, output_tok = self._call_openai(messages, model)

except Exception as primary_err:
print(f"Primary ({provider}) failed: {primary_err}. Trying fallback.")
provider = self.config.fallback_provider
try:
if provider == "openai":
text, input_tok, output_tok = self._call_openai(messages)
else:
text, input_tok, output_tok = self._call_anthropic(messages)
except Exception as fallback_err:
raise RuntimeError(
f"All providers failed.\n"
f" Primary: {self.config.primary_provider}: {primary_err}\n"
f" Fallback: {self.config.fallback_provider}: {fallback_err}"
)

latency_ms = (time.time() - start) * 1000
cost = self._calculate_cost(provider, input_tok, output_tok)

# Step 3: Track costs
if self.config.cost_tracking_enabled:
self.spend_by_user[user_id] += cost
self.spend_by_feature[feature] += cost

# Step 4: Store in cache on success
if self.config.cache_enabled:
self.cache[cache_key] = text

return GatewayResult(
response=text,
provider=provider,
model=model,
cost_usd=cost,
cache_hit=False,
input_tokens=input_tok,
output_tokens=output_tok,
latency_ms=latency_ms,
)

def get_spend_summary(self) -> dict:
"""Return current spend breakdown by user and feature."""
return {
"by_user": dict(self.spend_by_user),
"by_feature": dict(self.spend_by_feature),
"total": sum(self.spend_by_user.values()),
}


if __name__ == "__main__":
gateway = SimpleGateway(GatewayConfig())

messages = [{"role": "user", "content": "Explain what a transformer model is in two sentences."}]

# First call - hits the LLM
result = gateway.complete(messages, user_id="user_123", feature="docs-assistant")
print(f"Provider: {result.provider} | Cost: ${result.cost_usd:.6f} | Cache: {result.cache_hit}")
print(f"Latency: {result.latency_ms:.0f}ms | Tokens: {result.input_tokens}+{result.output_tokens}")
print(f"Response: {result.response[:120]}...\n")

# Second call with same messages - hits the cache
result2 = gateway.complete(messages, user_id="user_123", feature="docs-assistant")
print(f"Provider: {result2.provider} | Cost: ${result2.cost_usd:.6f} | Cache: {result2.cache_hit}")

print(f"\nSpend summary: {gateway.get_spend_summary()}")

This 150-line implementation captures the essence of every production gateway: unified routing, caching, fallback, and cost tracking. The production tools - LiteLLM Proxy and Portkey - do the same things with more providers, better performance, built-in admin APIs, and observability dashboards around them.

Self-Hosted vs Managed: The Architecture Decision

The first decision when adopting a gateway is whether to run it yourself or use a managed service.

DimensionSelf-Hosted (LiteLLM)Managed (Portkey)
Data residencyFully controlled - stays in your VPCData flows through provider infrastructure
Setup timeHours (Docker Compose)Minutes (API key only)
Operational overheadYou manage uptime, upgrades, scalingZero - provider manages it
ObservabilityRequires external tool (Langfuse, Helicone)Built in - traces, cost analytics, dashboards
Provider support100+ via LiteLLM250+ via Portkey
CostEngineering time + hostingSaaS subscription
ComplianceSuitable for HIPAA, SOC 2, regulated industriesCheck provider certifications

For regulated industries (healthcare, finance, government), self-hosted is typically required by compliance - data cannot flow through a third-party SaaS. For early-stage startups or product teams that want velocity without infrastructure investment, a managed gateway is the right tradeoff.

The Gateway as Team Infrastructure

The most useful frame for engineering leaders is this: the LLM gateway is to AI teams what the API gateway is to microservices teams. You wouldn't have every microservice implement its own authentication, rate limiting, and circuit breaking - you centralize those concerns at the infrastructure layer. The same principle applies to AI.

The gateway is where cross-cutting concerns live. It is infrastructure that every AI feature benefits from, without requiring each feature team to build it independently.

When the gateway is in place, a product engineer adding a new LLM-powered feature can:

  • Call POST /v1/chat/completions with the OpenAI SDK they already know
  • Trust that fallbacks are handled if the primary provider goes down
  • Trust that their feature's costs are tracked and attributable to their team
  • Trust that rate limits won't let their feature accidentally consume another team's quota
  • Trust that traces are automatically captured for debugging and quality review

Without the gateway, every feature is a one-off integration with its own fragile assumptions. With the gateway, every feature inherits the platform's resilience, observability, and cost controls automatically.

Measuring Gateway Value: Key Metrics

You cannot evaluate whether the gateway is delivering value without measuring the right things. These are the six metrics that matter most:

MetricMeasurementGoal
Cache hit rateCache hits / total requestsAbove 25% for FAQ-type features
Cost per request by featureTotal cost / request count per featureDeclining trend over time
Fallback frequencyFallback triggers / total requestsBelow 1% under normal conditions
P95 gateway overheadGateway latency (excluding LLM)Below 30ms
Budget alert lead timeHours between soft alert and hard limitAlert fires at least 24 hours before limit
Cost attribution coverageRequests with user_id+team_id / total100% - no anonymous requests

Track these in your observability platform (Datadog, Prometheus, Grafana) from day one. The cache hit rate and cost per feature metrics will be the most immediately useful for justifying the infrastructure investment.

When You Need a Gateway (and When You Don't)

You need a gateway when any of the following are true:

  • Multiple LLM providers in production: more than one SDK in your codebase
  • Multiple teams using LLMs: cost attribution is essential for accountability
  • Availability requirements: provider outages cannot cause user-facing downtime
  • Cost visibility requirements: finance is asking where the spend is going
  • Rate limit problems: production traffic is hitting provider limits during peak usage
  • Compliance or audit requirements: all LLM requests must be logged

You probably don't need a gateway when:

  • You have one model, one team, and LLM spend under $500/month
  • You are building a prototype with no production users
  • Your LLM calls are fully asynchronous batch jobs with no user-facing SLA

For most teams building AI products past the MVP stage, the gateway is not optional. It is the infrastructure foundation on which everything else is built.

Production Engineering Notes

:::tip Adopt a gateway before the first incident, not after The instinct to add the gateway "when we need it" results in adding it during a crisis - when a $50k surprise invoice arrives, or when a provider outage causes a four-hour user-facing incident. Gateway adoption is 10x easier in a calm sprint than in a post-incident fire. Build it early. :::

:::warning Don't add gateway complexity before you have real multi-model traffic If you have exactly one LLM, one team, and spend under $1,000/month, a gateway adds operational overhead without proportional benefit. The inflection point is when you add a second provider or a second team. Add the gateway at that moment. :::

:::danger The shared API key anti-pattern destroys fairness Never share a single API key across multiple product teams without per-team quota enforcement at the gateway. Rate limits are per-key. One team's batch job will always find a way to saturate the shared limit at the worst possible time. This is not a matter of trust - it is a matter of incentives. Centralized enforcement is the only reliable solution. :::

:::info The gateway adds 10–30ms of latency A network-mode gateway adds one extra round-trip - typically 10–30ms. For LLM responses that take 500–2000ms, this is a 1–6% overhead. For sub-100ms latency requirements (rare for LLM features), use SDK-mode provider abstraction instead of a network proxy. :::

Common Mistakes

Mistake 1: Treating the gateway as optional until a crisis

Teams wait until a major incident before investing. By then, the cost - in downtime, in surprise invoices, in engineering scramble - has already been paid. Build the gateway at the second team or second provider, whichever comes first.

Mistake 2: Bypassing the gateway for "special cases"

Once established, the gateway must handle all LLM traffic. No exceptions. Every bypass degrades cost attribution, rate limit enforcement, and fallback coverage. "Just this one batch job" and "just this internal tool" erode the system over months until the bypass paths outnumber the gateway paths.

Mistake 3: Setting semantic cache similarity thresholds too low

A 0.80 similarity threshold will serve incorrect cached responses. "What is a Python snake?" and "What is Python programming?" have a cosine similarity around 0.82. Start at 0.95 and tune downward carefully with empirical quality measurement.

Mistake 4: Not testing the fallback path

Teams configure fallbacks but never test them. Under real provider failure conditions, the untested fallback either doesn't trigger correctly or adds unacceptable latency. Add chaos testing: periodically inject synthetic provider failures in staging and verify the fallback chain behaves as expected within acceptable latency bounds.

Mistake 5: Tracking costs in aggregate, not by dimension

"We spent 40,000lastmonthonLLMs"isnotactionable."Thedocumentationgeneratorspent40,000 last month on LLMs" is not actionable. "The documentation generator spent 18,000, of which $11,000 was on queries that hit the cache this week after we deployed the cache" is actionable. Dimension your cost tracking from day one: by user, by team, by feature, by model.

Mistake 6: Running the gateway without Redis

An in-memory cache and in-memory rate limiter state works fine for a single-replica gateway. The moment you add a second replica for high availability, state is no longer shared - each replica has its own cache bucket. Redis is non-negotiable for production multi-replica deployments. Set it up from the beginning.

Interview Q&A

Q: What is an LLM gateway and why does a production team need one?

An LLM gateway is a reverse proxy and policy enforcement layer that sits between application code and LLM providers. It exposes a unified OpenAI-compatible API regardless of which underlying provider handles each request. It provides: model routing based on cost or capability, semantic caching to avoid redundant LLM calls, per-user and per-team rate limiting, real-time cost tracking with budget alerts, automatic fallback on provider failures, and centralized observability. Teams need it when they have more than one LLM provider in use, more than one team consuming LLMs, or any cost, availability, or compliance requirements that cannot be met by per-service direct integrations.

Q: How does a gateway prevent a batch job from breaking real-time user-facing features that share the same API key?

The gateway enforces per-feature and per-user token quotas using token buckets stored in Redis. The batch pipeline is registered with a lower-priority quota bucket (e.g., 500k TPM) that is separate from the real-time chat feature's bucket (e.g., 2M TPM). When the batch job's bucket empties, the gateway starts returning 429s to the batch job - without affecting the real-time feature's independent bucket. The batch job must retry with backoff and wait for its bucket to refill. The real-time feature is completely isolated from this. Additionally, you can implement priority queuing at the gateway layer: real-time requests jump the queue; batch requests wait.

Q: What is the difference between a gateway and using the LiteLLM SDK in-process?

The LiteLLM Python SDK is a client-side abstraction that normalizes different provider APIs behind a common interface. It runs in-process within your application code and is only available in Python. The LiteLLM Proxy (gateway mode) is a standalone HTTP server. All your services - Python, Go, Node.js, any language - send requests to http://your-gateway:4000/v1/chat/completions using the standard OpenAI client. Changes to routing rules, cache configuration, or fallback logic propagate to all services by updating the gateway's config file, without any service redeployment. Centralized cost tracking and caching work across all services simultaneously. Use the SDK for single-service Python applications; use the proxy for any multi-service architecture.

Q: How does semantic caching work at the gateway layer, and what threshold should you start with?

When a request arrives, the gateway generates a dense vector embedding of the user's query using an embedding model (e.g., OpenAI text-embedding-3-small). It searches a vector store (Redis with RediSearch, Qdrant, or FAISS) for the nearest cached query vector using approximate nearest-neighbor search. If the cosine similarity between the incoming query and the nearest cached query exceeds the configured threshold, the cached response is returned immediately without calling the LLM. On a cache miss, the LLM is called, the response is stored with its query vector, and future similar queries will hit the cache. Start with a threshold of 0.95 and tune downward based on empirical quality measurement of cache hits - do not go below 0.90 without extensive testing.

Q: Walk me through how a gateway tracks cost by dimension in real time.

Each provider API response includes the input and output token counts. The gateway applies the formula: cost = (input_tokens / 1e6) * input_price_per_million + (output_tokens / 1e6) * output_price_per_million. It then writes this cost to Redis using INCRBYFLOAT - an atomic increment operation - against three keys: cost:user:{user_id}:{period}, cost:team:{team_id}:{period}, and cost:feature:{feature}:{period}. Each key accumulates the spend for the current billing period. Budget check rules fire Slack alerts when any dimension's total crosses a configured threshold. Hard limit enforcement pre-checks the Redis totals before making the LLM call and rejects the request if the budget is already exceeded. This entire flow adds 1–2ms per request (two Redis operations) and provides real-time spend visibility at all granularities.

Q: How would you architect an LLM gateway for a company with strict data residency requirements where all data must remain in their AWS region?

Deploy LiteLLM Proxy in the company's AWS VPC using ECS or Kubernetes. All requests from application services route to the internal LiteLLM endpoint - no external SaaS gateway. Use Amazon ElastiCache (Redis) for the caching and rate limiting state, and Amazon RDS (PostgreSQL) for spend tracking. For providers: Anthropic Claude via direct API (data goes to Anthropic but doesn't pass through a third-party SaaS gateway), AWS Bedrock for Claude access entirely within AWS (all traffic stays in the VPC and AWS network), and Amazon Bedrock for fallback using Titan or other Bedrock-hosted models. For observability: emit traces to AWS CloudWatch or self-hosted Langfuse within the VPC. No request data leaves the AWS account. The LiteLLM container is the only egress point, and it is deployed with security group rules that permit only outbound HTTPS to the allowed provider endpoints.

Q: What observability data should every LLM request generate at the gateway layer?

Every request should emit a structured log event with at minimum: request ID, timestamp, user ID, team ID, feature name, provider, model, input token count, output token count, calculated cost in USD, latency in milliseconds, cache hit or miss (and similarity score if hit), fallback triggered (true/false), any error code and type. These fields enable: per-feature cost analytics (aggregate by feature+model), per-user spend tracking (aggregate by user_id), fallback rate monitoring (percentage of requests where fallback_triggered=true), cache hit rate (cache_hit aggregated over time), P95 latency per model (latency percentile by model), and anomaly detection (sudden cost spike for one feature). Emit these to your observability platform as structured JSON, not as free-text logs, so they can be queried and aggregated efficiently.

The Request Lifecycle Through a Gateway

Understanding how a single request flows through a production gateway clarifies what the gateway is doing at each step and why each step exists.

Steps 1–4 add approximately 5–15ms total. Steps 5–6 (the LLM call) typically account for 500–5000ms. The gateway overhead is 0.3–3% of the total request time - negligible in practice.

The Gateway Contract: What Callers Can Expect

A well-designed LLM gateway makes specific guarantees to its callers. These guarantees should be documented and tested:

Success path: when a request succeeds, the response includes the LLM output, the model that served it, the cost in USD, the request ID (for trace correlation), and the remaining rate limit tokens. The caller does not need to know whether the response came from the primary provider, the fallback, or the semantic cache.

Rate limit path: when a rate limit is hit, the gateway returns HTTP 429 with a Retry-After header in seconds. The response body includes the dimension that was exceeded (user, team, or feature) and the wait time. Callers implement a simple "if 429, wait Retry-After seconds, retry" loop.

Budget exceeded path: when the budget is exceeded, the gateway returns HTTP 402 (Payment Required) with a generic message that does not expose internal budget details. Internal engineering systems can query the cost tracker API for details.

Provider failure path: when all configured providers fail or are unavailable, the gateway returns HTTP 503 with a retry recommendation. Callers should not retry immediately - they should wait at least 30 seconds before retrying to avoid thundering herd on already-stressed providers.

Extending the Minimal Gateway: Adding Real Semantic Cache

The minimal gateway above uses exact-match caching. Production gateways use semantic caching - embedding-based similarity search. Here is how to extend the SimpleGateway with a vector similarity cache using the Anthropic Python SDK for embeddings:

import anthropic
import numpy as np
from typing import Optional


class SemanticCacheLayer:
"""
Embedding-based semantic cache for LLM query/response pairs.

Uses cosine similarity to find cached responses to semantically equivalent queries.
Threshold of 0.95 means: only serve a cache hit if the queries are extremely similar.
"""

def __init__(
self,
threshold: float = 0.95,
max_entries: int = 10_000,
):
self.threshold = threshold
self.max_entries = max_entries
# In-memory storage: list of (embedding_vector, query_text, response_text)
self._entries: list[tuple[list[float], str, str]] = []
self._client = anthropic.Anthropic()

def _embed(self, text: str) -> list[float]:
"""
Generate a dense embedding for the query text.

Uses Anthropic's embedding endpoint.
In production, use a fast, cheap embedding model:
OpenAI text-embedding-3-small ($0.02/1M tokens) or
Cohere embed-english-v3 ($0.10/1M tokens).
"""
# Placeholder: in production use the actual embedding API
# response = openai_client.embeddings.create(
# input=text, model="text-embedding-3-small"
# )
# return response.data[0].embedding
import hashlib
# Deterministic placeholder for demo purposes
h = hashlib.sha256(text.encode()).digest()
return [b / 255.0 for b in h[:64]]

def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
"""Compute cosine similarity between two embedding vectors."""
va = np.array(a, dtype=np.float32)
vb = np.array(b, dtype=np.float32)
norm_a = np.linalg.norm(va)
norm_b = np.linalg.norm(vb)
if norm_a == 0 or norm_b == 0:
return 0.0
return float(np.dot(va, vb) / (norm_a * norm_b))

def lookup(self, query: str) -> Optional[str]:
"""
Look up a semantically similar query in the cache.
Returns the cached response if found above threshold, else None.
"""
if not self._entries:
return None

query_embedding = self._embed(query)
best_score = 0.0
best_response = None

for entry_embedding, entry_query, entry_response in self._entries:
score = self._cosine_similarity(query_embedding, entry_embedding)
if score > best_score:
best_score = score
best_response = entry_response

if best_score >= self.threshold:
return best_response
return None

def store(self, query: str, response: str) -> None:
"""Store a query/response pair in the semantic cache."""
if len(self._entries) >= self.max_entries:
# Evict oldest entry (simple FIFO - production uses LRU or TTL)
self._entries.pop(0)

embedding = self._embed(query)
self._entries.append((embedding, query, response))

def cache_size(self) -> int:
return len(self._entries)

Gateway Deployment Checklist

Before declaring a gateway production-ready, verify these items:

Functional correctness:

  • All LLM providers in use are configured and tested
  • Fallback chain is tested under simulated primary provider failure
  • Semantic cache returns correct responses above threshold
  • Rate limits enforce correctly under concurrent load (50+ simultaneous requests)
  • Budget hard limits block requests and do not allow any calls through

Observability:

  • Every request emits a structured log event with all required fields
  • Cost metrics are visible in dashboard within 60 seconds of a request
  • Cache hit rate metric is tracked per feature
  • Alert fires when a budget threshold is crossed (test in staging with a $0.01 limit)

Reliability:

  • Gateway is deployed with at least 2 replicas (no single point of failure)
  • Redis is deployed with persistence and replication (not ephemeral)
  • Gateway health check endpoint returns 200 and is configured in load balancer
  • Gateway handles Redis connection loss gracefully (fail open or fail closed - decide deliberately)

Security:

  • All API keys stored in secrets manager (not in YAML files checked into git)
  • Gateway logs never contain full API key values (only key prefixes)
  • Budget error messages shown to users are generic (no internal amounts or dimensions)
  • Admin API endpoints require internal authentication

Performance:

  • Gateway P99 overhead (excluding LLM latency) is below 50ms under peak load
  • Load test completed at 2x expected peak traffic
  • Cache warm-up strategy is defined for new deployments

The Gateway Decision: When to Start

The most common mistake is deferring the gateway until after a painful incident. Here is a practical framework for when to introduce gateway infrastructure based on team and traffic characteristics:

StageCharacteristicsGateway recommendation
POC / prototype1 team, 1 model, under $500/monthSkip - direct SDK calls
Early production1 team, 1 model, users in productionAdd basic retry + cost logging
Growth stage2+ teams OR 2+ modelsFull gateway (LiteLLM or Portkey)
ScaleMultiple services, $5k+/monthSelf-hosted with Redis, PostgreSQL
EnterpriseCompliance requirements, >$50k/monthSelf-hosted in VPC, SIEM integration

The inflection points are clear: the second team and the second model. Either event makes cost attribution, rate limit isolation, and centralized routing non-negotiable. Build it at that moment - not before, not after.

What Comes Next in This Module

The remaining lessons in this LLM Gateways module go deep on each gateway capability:

  • LiteLLM (next lesson): the most widely used self-hosted gateway - SDK mode, proxy mode, virtual keys, routing strategies, and Docker deployment
  • Portkey: managed gateway with first-class observability - Config JSON system, guardrails, traces, and the Feedback API
  • Semantic Caching: embedding-based caching in depth - threshold calibration, Redis vector indexing, invalidation strategies, and ROI measurement
  • Model Fallback and Retry: exponential backoff with full jitter, FailureType taxonomy, circuit breakers, and multi-provider fallback chains
  • Load Balancing: multi-key routing strategies, per-key health checking, and capacity planning
  • Cost Management: real-time cost tracking, budget enforcement, anomaly detection, and monthly forecasting
  • Rate Limiting: token bucket implementation with atomic Lua scripts, sliding windows, and priority queuing

Each lesson includes production-grade Python code using the Anthropic SDK, Mermaid architecture diagrams, and interview preparation questions with detailed answers. The code in each lesson is designed to be directly adaptable to production use - not toy examples.

Summary: The Case for a Gateway

An LLM gateway is not optional for production AI applications at scale. The economics are compelling - a 70% cost reduction is achievable without quality degradation. The reliability improvements are essential - provider outages become invisible to users. The observability gains are non-negotiable for accountability at the engineering and finance level.

The gateway is the infrastructure foundation that lets you:

  • Switch providers without touching application code
  • Cache expensive queries automatically across all services
  • Isolate teams' quota usage so one job cannot break another team's feature
  • Know exactly what each feature, team, and user is spending in real time
  • Respond to provider failures automatically rather than via on-call scramble

It is not glamorous infrastructure. It does not make your models smarter. But it is the difference between an AI platform that is manageable at scale and one that generates monthly invoice surprises and 3 AM incident pages.

Gateway Cost-Benefit Analysis: A Framework

Before adopting gateway infrastructure, quantify the expected value. This calculation helps justify the engineering investment to stakeholders.

def gateway_roi_estimate(
monthly_llm_spend: float,
estimated_cache_hit_rate: float = 0.30, # 30% cache hit rate (conservative)
estimated_cost_reduction_pct: float = 0.10, # Model routing saves another 10%
incident_hours_per_month: float = 2.0, # Hours of incident time monthly
engineer_hourly_rate: float = 150.0, # Fully loaded cost
gateway_setup_hours: float = 20.0, # One-time engineering investment
) -> dict:
"""
Estimate the ROI of deploying an LLM gateway.

Conservative assumptions - actual savings are often higher.
"""
# Monthly savings from semantic caching
cache_savings = monthly_llm_spend * estimated_cache_hit_rate

# Monthly savings from model routing (routing FAQ to cheaper models)
routing_savings = monthly_llm_spend * estimated_cost_reduction_pct

# Monthly savings from reduced incidents
incident_savings = incident_hours_per_month * engineer_hourly_rate

total_monthly_savings = cache_savings + routing_savings + incident_savings

# One-time setup cost
setup_cost = gateway_setup_hours * engineer_hourly_rate

# Payback period in months
payback_months = setup_cost / total_monthly_savings if total_monthly_savings > 0 else float("inf")

return {
"monthly_llm_spend": monthly_llm_spend,
"cache_savings_monthly": round(cache_savings, 2),
"routing_savings_monthly": round(routing_savings, 2),
"incident_savings_monthly": round(incident_savings, 2),
"total_monthly_savings": round(total_monthly_savings, 2),
"pct_reduction": round(total_monthly_savings / monthly_llm_spend * 100, 1),
"setup_cost_one_time": round(setup_cost, 2),
"payback_period_months": round(payback_months, 1),
"annual_savings": round(total_monthly_savings * 12 - setup_cost, 2),
}


# Example: team spending $15,000/month on LLMs
roi = gateway_roi_estimate(
monthly_llm_spend=15_000,
estimated_cache_hit_rate=0.30,
estimated_cost_reduction_pct=0.15,
incident_hours_per_month=3.0,
gateway_setup_hours=20.0,
)
print(f"Monthly spend: ${roi['monthly_llm_spend']:,.2f}")
print(f"Cache savings: ${roi['cache_savings_monthly']:,.2f}/mo")
print(f"Routing savings: ${roi['routing_savings_monthly']:,.2f}/mo")
print(f"Incident savings: ${roi['incident_savings_monthly']:,.2f}/mo")
print(f"Total monthly savings: ${roi['total_monthly_savings']:,.2f}/mo ({roi['pct_reduction']}% reduction)")
print(f"Setup cost (one-time): ${roi['setup_cost_one_time']:,.2f}")
print(f"Payback period: {roi['payback_period_months']} months")
print(f"Annual net savings: ${roi['annual_savings']:,.2f}")
# Output (example):
# Monthly spend: $15,000.00
# Cache savings: $4,500.00/mo
# Routing savings: $2,250.00/mo
# Incident savings: $450.00/mo
# Total monthly savings: $7,200.00/mo (48.0% reduction)
# Setup cost (one-time): $3,000.00
# Payback period: 0.4 months
# Annual net savings: $83,400.00

This framework provides a defensible business case for gateway adoption. For a team spending $15,000/month, a 48% cost reduction pays back the 20-hour engineering investment in less than two weeks. Present this analysis to engineering leadership before the first incident, not after.

The Most Common Gateway Adoption Path

Teams that successfully adopt LLM gateways typically follow this sequence. Understanding the path helps you anticipate what comes next at each stage.

Week 1 - Basic proxy: deploy LiteLLM Proxy or Portkey in front of your primary provider. Update all services to call the gateway endpoint. Verify traffic is flowing through. No behavior change yet - the gateway is transparent.

Week 2 - Observability: enable structured logging at the gateway. Start seeing per-feature request counts and per-user cost accumulation. This data alone reveals surprises - features you thought were low-usage often turn out to be significant cost centers.

Week 3 - Rate limiting: configure per-feature and per-team token buckets. This requires coordination with each team to agree on limits. The first time a batch job hits a rate limit and the real-time feature keeps serving normally, the value of isolation becomes concrete.

Week 4 - Semantic cache: enable semantic caching for your highest-traffic FAQ-type features. Measure hit rate. Expect 5–15% in the first week as the cache warms up; 25–40% by the end of the month.

Month 2 - Fallbacks: configure provider fallback chains. Test them under simulated failure. The first real provider incident after this is in place demonstrates the value more convincingly than any benchmark.

Month 3 - Budget alerts: configure soft and hard budget limits per feature and per team. Set up Slack webhooks. Finance gets a dashboard they can read.

Month 6 - Optimization: use cost data to justify model routing decisions. Route FAQ traffic to the cheapest model. Route long-context requests to the appropriate model. The cost data from months 1–5 is your evidence for each routing rule.

By month 6, the gateway is a strategic asset - not just infrastructure. Engineering and finance have shared visibility. Provider outages are operational non-events. Cost attribution supports team-level accountability.

Anti-Patterns: What the Gateway Does Not Fix

The gateway is infrastructure that amplifies good AI engineering practices. It does not correct fundamental problems in how LLMs are used. Understanding what the gateway cannot fix prevents misplaced expectations.

Anti-pattern 1: Using the gateway as a substitute for prompt engineering. A poorly designed prompt will produce low-quality responses whether they go through a gateway or not. The gateway routes efficiently; it does not improve what the model receives.

Anti-pattern 2: Treating the gateway as a security boundary against prompt injection. Portkey's guardrails and LiteLLM's callbacks can detect some known injection patterns, but they are bypassable by creative phrasing. Defense against prompt injection requires application-layer input validation, output sanitization, and architectural choices about what the LLM can access.

Anti-pattern 3: Setting semantic cache thresholds based on intuition. Teams that set the threshold at 0.80 "because it feels about right" will serve semantically wrong answers. The threshold must be calibrated empirically against your specific domain's query distribution.

Anti-pattern 4: Treating the gateway as a magic cost reducer. The gateway reduces costs by eliminating redundant LLM calls (caching) and routing to cheaper models where appropriate. It does not reduce the cost of calls that genuinely need to be made to the premium model. If your product's core value requires expensive model calls, the gateway improves efficiency but cannot eliminate the fundamental cost.

Anti-pattern 5: Over-centralizing without fallback for the gateway itself. If the gateway is a single point of failure and it goes down, every AI feature goes down simultaneously. Deploy the gateway with at least two replicas. Configure health checks so your load balancer can route around a failed replica. The gateway improves reliability for LLM providers; don't make the gateway itself a reliability risk.

Key Concepts Reference

ConceptDefinitionWhere it matters
OpenAI-compatible endpointStandard /v1/chat/completions API that all major LLMs supportAll gateway tools expose this
Virtual keyGateway-scoped identifier that maps to provider credentialsPortkey, LiteLLM virtual keys
Token bucketRate limiting mechanism - bucket fills at constant rate, requests consume tokensRate limiting lesson
Cosine similarityVector similarity metric (0–1) used for semantic cache matchingSemantic caching lesson
Circuit breakerThree-state (closed/open/half-open) system that stops routing to failing providersFallback lesson
Exponential backoff with jitterWait base * 2^n * random(0,1) between retries to spread loadRetry lesson
INCRBYFLOATAtomic Redis increment for concurrent cost accumulationCost tracking lesson
P95 latency95th percentile response time - 95% of requests are faster than this valueLoad balancing lesson
Priority queuingQueue mechanism where real-time requests are served before batchRate limiting lesson
Semantic cache warm-upPre-populating the cache with known high-frequency queries before launchSemantic caching lesson

The Gateway's Role in the Broader AI Engineering Stack

The LLM gateway is one component in a larger AI engineering stack. Understanding where it sits relative to other components helps avoid scope confusion and prevents teams from expecting the gateway to do things it is not designed for.

The application layer is where AI engineering quality decisions are made - prompts, evaluation, fine-tuning. The gateway layer handles operational concerns - routing, resilience, cost, rate limiting, and observability. The infrastructure layer stores the gateway's stateful components.

A common mistake is expecting the gateway to improve output quality. The gateway routes efficiently and provides resilience - but if the prompt produces hallucinations, the gateway cannot fix that. Prompt engineering and evaluation are application-layer concerns that the gateway does not touch.

The inverse mistake is expecting application-layer code to handle operational concerns. Application code should not implement retry logic, cost tracking, or rate limiting independently - that duplicates effort across every service and produces inconsistent enforcement. Operational concerns belong in the gateway layer, where they apply uniformly to all traffic.

Testing Your Gateway Configuration

Before deploying to production, run these tests against your staging gateway:

import httpx
import time


def test_gateway_endpoint(base_url: str, api_key: str) -> None:
"""
Basic gateway integration test suite.
Run against staging before every production deployment.
"""
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
payload = {
"model": "claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Reply with exactly: GATEWAY_OK"}],
"max_tokens": 20,
}

# Test 1: Basic connectivity
print("Test 1: Basic connectivity...")
resp = httpx.post(f"{base_url}/v1/chat/completions", json=payload, headers=headers)
assert resp.status_code == 200, f"Expected 200, got {resp.status_code}: {resp.text}"
data = resp.json()
assert "choices" in data, "Response missing 'choices' field"
print(f" PASS: Response received in {resp.elapsed.total_seconds()*1000:.0f}ms")

# Test 2: Cache is working (second identical request should be faster)
print("Test 2: Cache effectiveness...")
start = time.time()
resp2 = httpx.post(f"{base_url}/v1/chat/completions", json=payload, headers=headers)
elapsed2 = time.time() - start
assert resp2.status_code == 200
# Cache hits should be dramatically faster (milliseconds vs seconds)
if elapsed2 < 0.1:
print(f" PASS: Cache hit detected ({elapsed2*1000:.0f}ms)")
else:
print(f" INFO: No cache hit ({elapsed2*1000:.0f}ms) - cache may need warm-up")

# Test 3: Rate limit headers present
print("Test 3: Rate limit headers...")
rl_headers = [k for k in resp.headers.keys() if "rate" in k.lower() or "limit" in k.lower()]
if rl_headers:
print(f" PASS: Rate limit headers found: {rl_headers}")
else:
print(f" WARN: No rate limit headers in response")

# Test 4: Health check endpoint
print("Test 4: Health check...")
health_resp = httpx.get(f"{base_url}/health", headers=headers, timeout=5.0)
assert health_resp.status_code == 200, f"Health check failed: {health_resp.status_code}"
print(f" PASS: Health check OK")

print("\nAll gateway integration tests passed.")


# Usage:
# test_gateway_endpoint("http://localhost:4000", "your-virtual-key")

Run this test suite as part of your staging validation pipeline before every gateway configuration change. The cache effectiveness test in particular will catch misconfigured Redis connections or incorrect cache settings before they affect production traffic.

© 2026 EngineersOfAI. All rights reserved.