:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Fallback & Retry demo on the EngineersOfAI Playground - no code required. :::
Load Balancing Across Providers
The Rate Limit Ceiling
The growth team's quarterly projections landed in the engineering team's inbox on a Monday morning. The numbers were clean: the AI-powered email personalization feature was generating 23% higher response rates versus the control group. The board wanted it rolled out to the full customer base - 800,000 users - within three weeks.
The engineering team did the math over lunch. At peak send windows, the system needed to generate personalized subject lines for approximately 400 emails per second. At an average of 300 tokens per call, that was 120,000 tokens per second - 7.2 million tokens per minute. Their Anthropic account's highest rate limit tier: 400,000 tokens per minute. They were 18x short of the required throughput on a single key.
The team convened for a working session. Three options on the whiteboard: wait for Anthropic to grant a rate limit increase (timeline unknown, outcome uncertain), distribute traffic across multiple API keys (immediate, self-service), or add OpenAI as a secondary provider for overflow capacity (possible but required quality validation for each model). They chose multi-key load balancing as the primary approach with OpenAI overflow as a secondary safety net. By Thursday they had a working prototype. By Friday they had load-tested it to 900,000 tokens per minute across five keys.
The architecture they built had no exotic infrastructure. It was a Python class with a Redis-backed stats store, a strategy selection function, and a per-key health checker running on a background thread. The entire load balancer was under 300 lines of code. What made it effective was not complexity - it was the discipline of tracking P95 latency per key and routing to the fastest one, combined with automatic cooldown when rate limits were hit on any individual key.
This is the problem that LLM load balancing solves: distributing traffic across multiple API keys and providers to exceed the throughput ceiling that any single key imposes, while maintaining a single point of configuration and observability.
Why Provider Rate Limits Create a Scaling Problem
LLM providers rate limit at the API key level (and sometimes at the organization level). Key facts engineers need to understand before designing a load balancing solution:
- Rate limits are per-key: each API key has its own tokens-per-minute (TPM) and requests-per-minute (RPM) bucket that fills and drains independently
- Multiple keys multiply capacity: two keys give you two independent TPM buckets - if each is 200k TPM, you effectively have 400k TPM available across your system
- Organization-level limits exist: some providers implement a hard ceiling at the account level above the per-key limits - adding a 6th key to an account with a 1M TPM account ceiling does not give you 1.2M TPM; always verify this with your provider before designing your architecture around key multiplication
- Rate limits don't scale with money alone: you cannot simply pay more to increase limits in most provider tiers; you must request higher tiers or purchase reserved quota in advance
Load balancing across keys and providers also provides compounding benefits beyond raw throughput:
- Resilience: if one key is temporarily blocked (billing issue, suspicious activity flag, key rotation), traffic continues on other keys without manual intervention
- Cost optimization: route to the cheapest provider that meets quality requirements for each specific request type - use Claude Haiku for simple FAQ queries and Claude Sonnet for complex reasoning, without the application needing to know which backend served the request
- Geographic routing: some providers offer region-specific endpoints with meaningfully different latency characteristics - a key routing to the EU endpoint may have 40ms less latency for European users than one routing to the US endpoint
- A/B testing at scale: split traffic between model versions by manipulating the key routing weights, comparing quality or cost before a full rollout without changing application code
Routing Strategies Compared
| Strategy | Best for | Weakness |
|---|---|---|
| Round-robin | All keys are equivalent, uniform request sizes | Ignores latency differences and rate limit state |
| Weighted | Keys with different tier limits | Weights must be tuned manually as tiers change |
| Least-connections | Streaming requests where in-flight count matters | Ignores token consumption variance across requests |
| Latency-based | User-facing latency-sensitive features | Requires tracking P95 per key over a rolling window |
| Usage-aware | Strict TPM/RPM compliance required | Requires Redis for accurate shared state across replicas |
For most production user-facing deployments, latency-based routing is the best default. It naturally avoids overloaded keys (which exhibit higher latency as they approach rate limits), adapts to geographic routing differences without manual configuration, and degrades gracefully under load without requiring explicit threshold tuning. The strategy is self-correcting: slow keys get fewer requests, allowing them to recover, after which they may receive more traffic again.
Routing Strategy Selection: Decision Flowchart
Implementation: ProviderLoadBalancer
The core implementation tracks per-key statistics and implements four pluggable routing strategies. All state mutations are guarded by a reentrant lock for thread safety in concurrent environments.
import anthropic
import time
import threading
import random
from dataclasses import dataclass, field
from collections import deque
from typing import Optional
@dataclass
class ApiKeyStats:
"""Runtime performance statistics for one API key. Updated after every request."""
active_requests: int = 0
total_requests: int = 0
total_failures: int = 0
recent_latencies_ms: deque = field(default_factory=lambda: deque(maxlen=30))
recent_request_times: deque = field(default_factory=lambda: deque(maxlen=200))
recent_token_counts: deque = field(default_factory=lambda: deque(maxlen=200))
last_failure_time: Optional[float] = None
cooldown_until: Optional[float] = None
@dataclass
class ApiKey:
"""Configuration and runtime state for one API key."""
key: str
provider: str
model: str
rpm_limit: int = 500
tpm_limit: int = 200_000
weight: float = 1.0 # for weighted routing
stats: ApiKeyStats = field(default_factory=ApiKeyStats)
@property
def p95_latency_ms(self) -> float:
"""P95 response latency based on the last 30 requests."""
lats = list(self.stats.recent_latencies_ms)
if len(lats) < 3:
return 0.0 # not enough data - treat as "new key" for exploration
sorted_lats = sorted(lats)
idx = int(len(sorted_lats) * 0.95)
return sorted_lats[min(idx, len(sorted_lats) - 1)]
@property
def current_rpm(self) -> float:
"""Estimate current requests-per-minute from recent request timestamps."""
now = time.time()
recent = [t for t in self.stats.recent_request_times if t > now - 60]
return len(recent)
@property
def current_tpm(self) -> float:
"""Estimate current tokens-per-minute from recent usage."""
now = time.time()
pairs = list(zip(self.stats.recent_request_times, self.stats.recent_token_counts))
return sum(tokens for ts, tokens in pairs if ts > now - 60)
@property
def is_available(self) -> bool:
"""True if the key is below rate limit thresholds and not in cooldown."""
now = time.time()
if self.stats.cooldown_until and now < self.stats.cooldown_until:
return False
if self.current_rpm >= self.rpm_limit * 0.90: # 90% capacity threshold
return False
if self.current_tpm >= self.tpm_limit * 0.90:
return False
return True
class ProviderLoadBalancer:
"""
Production LLM load balancer with pluggable routing strategies.
Thread-safe: all state mutations are inside a reentrant lock.
For Redis-backed shared state across replicas, see notes below.
"""
def __init__(self, keys: list[ApiKey], strategy: str = "latency"):
self.keys = keys
self.strategy = strategy
self._lock = threading.RLock()
self._rr_index = 0 # for round-robin state
# ─── Strategy implementations ─────────────────────────────────────────────
def _available_keys(self) -> list[ApiKey]:
return [k for k in self.keys if k.is_available]
def _select_round_robin(self) -> Optional[ApiKey]:
available = self._available_keys()
if not available:
return None
with self._lock:
key = available[self._rr_index % len(available)]
self._rr_index += 1
return key
def _select_weighted(self) -> Optional[ApiKey]:
"""Weighted random selection proportional to each key's weight field."""
available = self._available_keys()
if not available:
return None
total = sum(k.weight for k in available)
r = random.uniform(0, total)
cumulative = 0.0
for key in available:
cumulative += key.weight
if r <= cumulative:
return key
return available[-1]
def _select_least_connections(self) -> Optional[ApiKey]:
"""Route to the key with the fewest active in-flight requests."""
available = self._available_keys()
if not available:
return None
return min(available, key=lambda k: k.stats.active_requests)
def _select_latency_based(self) -> Optional[ApiKey]:
"""
Route to the key with the lowest P95 latency.
Keys with fewer than 3 requests (no latency data) are treated as "new"
and given exploration priority to gather baseline measurements.
"""
available = self._available_keys()
if not available:
return None
# Exploration: new keys with no latency history get priority to build a baseline
new_keys = [k for k in available if len(k.stats.recent_latencies_ms) < 3]
if new_keys:
return random.choice(new_keys)
# Exploitation: route to lowest P95 latency
return min(available, key=lambda k: k.p95_latency_ms)
def select(self) -> Optional[ApiKey]:
"""Select a key based on the configured strategy."""
strategies = {
"round_robin": self._select_round_robin,
"weighted": self._select_weighted,
"least_connections": self._select_least_connections,
"latency": self._select_latency_based,
}
selector = strategies.get(self.strategy, self._select_latency_based)
return selector()
# ─── State recording ──────────────────────────────────────────────────────
def mark_inflight(self, key: ApiKey) -> None:
"""Call when a request is dispatched to a key."""
with self._lock:
key.stats.active_requests += 1
def record_success(self, key: ApiKey, latency_ms: float, tokens: int) -> None:
"""Call when a request to a key completes successfully."""
with self._lock:
key.stats.active_requests = max(0, key.stats.active_requests - 1)
key.stats.total_requests += 1
key.stats.recent_latencies_ms.append(latency_ms)
key.stats.recent_request_times.append(time.time())
key.stats.recent_token_counts.append(tokens)
key.stats.last_failure_time = None
key.stats.cooldown_until = None
def record_failure(
self,
key: ApiKey,
is_rate_limit: bool = False,
cooldown_s: float = 60.0,
) -> None:
"""Call when a request to a key fails."""
with self._lock:
key.stats.active_requests = max(0, key.stats.active_requests - 1)
key.stats.total_failures += 1
key.stats.last_failure_time = time.time()
if is_rate_limit:
key.stats.cooldown_until = time.time() + cooldown_s
def status(self) -> list[dict]:
"""Return a status snapshot for all keys (safe for logging and monitoring)."""
return [
{
"key_prefix": key.key[:12] + "...",
"provider": key.provider,
"model": key.model,
"available": key.is_available,
"active_requests": key.stats.active_requests,
"current_rpm": round(key.current_rpm, 1),
"current_tpm": round(key.current_tpm),
"rpm_limit": key.rpm_limit,
"tpm_limit": key.tpm_limit,
"p95_latency_ms": round(key.p95_latency_ms, 1),
"total_requests": key.stats.total_requests,
"total_failures": key.stats.total_failures,
"cooldown_remaining_s": max(0.0, (key.stats.cooldown_until or 0) - time.time()),
}
for key in self.keys
]
Load-Balanced Anthropic Client with Automatic Failover
The LoadBalancedAnthropicClient wraps the balancer with automatic key failover. On rate limit: puts the key in cooldown and tries the next available key. On 4xx client errors: raises immediately (the request is invalid and retrying on another key will not help). On 5xx server errors: short cooldown and tries the next key.
class LoadBalancedAnthropicClient:
"""
Anthropic client with multi-key load balancing and automatic failover.
Key behaviors:
- On rate limit (429): cooldown the key for 60s, try next key
- On server error (5xx): short cooldown, try next key
- On client error (4xx): raise immediately - don't waste another key attempt
- On success: record latency + tokens for routing decisions
"""
def __init__(self, lb: ProviderLoadBalancer, max_fallback_attempts: int = 3):
self.lb = lb
self.max_fallback_attempts = max_fallback_attempts
self._clients: dict[str, anthropic.Anthropic] = {}
def _client_for(self, api_key: str) -> anthropic.Anthropic:
"""Get or create a cached Anthropic client for a specific key."""
if api_key not in self._clients:
self._clients[api_key] = anthropic.Anthropic(api_key=api_key)
return self._clients[api_key]
def complete(
self,
messages: list[dict],
model: str = "claude-sonnet-4-6",
max_tokens: int = 1024,
system: Optional[str] = None,
) -> dict:
"""
Complete a request with automatic key failover.
Tries up to max_fallback_attempts different keys if rate-limited or erroring.
Returns the first successful response with routing metadata.
"""
tried_keys: set[str] = set()
for attempt in range(self.max_fallback_attempts):
key = self.lb.select()
if key is None:
raise RuntimeError(
"No available API keys - all keys are at capacity or in cooldown. "
"Consider adding more keys or increasing rate limit tiers."
)
if key.key in tried_keys:
# We've cycled through all available keys
break
tried_keys.add(key.key)
self.lb.mark_inflight(key)
client = self._client_for(key.key)
start = time.time()
try:
kwargs: dict = {
"model": model,
"max_tokens": max_tokens,
"messages": messages,
"timeout": 30.0,
}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
latency_ms = (time.time() - start) * 1000
tokens = response.usage.input_tokens + response.usage.output_tokens
self.lb.record_success(key, latency_ms, tokens)
return {
"response": response.content[0].text,
"model": response.model,
"key_prefix": key.key[:12] + "...",
"latency_ms": round(latency_ms, 1),
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"attempts": attempt + 1,
}
except anthropic.RateLimitError:
self.lb.record_failure(key, is_rate_limit=True, cooldown_s=60.0)
print(f"Rate limit on key {key.key[:12]}... - trying next key")
continue
except anthropic.APIStatusError as e:
cooldown = 30.0 if e.status_code >= 500 else 0.0
self.lb.record_failure(key, is_rate_limit=False, cooldown_s=cooldown)
if e.status_code < 500:
# 4xx errors indicate an invalid request - retrying on another key won't help
raise
print(f"Server error {e.status_code} on key {key.key[:12]}... - trying next key")
continue
except Exception as e:
self.lb.record_failure(key, is_rate_limit=False)
print(f"Unexpected error on key {key.key[:12]}...: {type(e).__name__}: {e}")
continue
raise RuntimeError(
f"All {len(tried_keys)} available keys failed after {self.max_fallback_attempts} attempts"
)
def build_demo_balancer() -> tuple[ProviderLoadBalancer, LoadBalancedAnthropicClient]:
"""Build a load balancer with 3 API keys for demonstration."""
import os
api_key = os.environ.get("ANTHROPIC_API_KEY", "sk-ant-your-key-here")
# In production these would be 3 distinct API keys from your provider account
# Key 3 has a higher tier (more weight + higher limits)
keys = [
ApiKey(
key=api_key,
provider="anthropic",
model="claude-sonnet-4-6",
rpm_limit=500,
tpm_limit=200_000,
weight=1.0,
),
ApiKey(
key=api_key,
provider="anthropic",
model="claude-sonnet-4-6",
rpm_limit=500,
tpm_limit=200_000,
weight=1.0,
),
ApiKey(
key=api_key,
provider="anthropic",
model="claude-sonnet-4-6",
rpm_limit=1000,
tpm_limit=400_000,
weight=2.0, # Higher-tier key gets 2x the traffic share in weighted mode
),
]
lb = ProviderLoadBalancer(keys, strategy="latency")
client = LoadBalancedAnthropicClient(lb)
return lb, client
def demo_load_balancer() -> None:
lb, client = build_demo_balancer()
questions = [
"What is eventual consistency in distributed systems?",
"Explain the Paxos consensus protocol in two sentences.",
"What is a Bloom filter and when would you use one?",
"How does consistent hashing work?",
"Describe the two-phase commit protocol briefly.",
]
print("=== Sending 5 requests through load balancer ===\n")
for question in questions:
result = client.complete(
messages=[{"role": "user", "content": question}],
model="claude-sonnet-4-6",
max_tokens=80,
)
print(f"Q: {question}")
print(f" Key: {result['key_prefix']} | "
f"Latency: {result['latency_ms']:.0f}ms | "
f"Tokens: {result['input_tokens']}+{result['output_tokens']} | "
f"Attempts: {result['attempts']}")
print()
print("=== Load Balancer Status ===")
for s in lb.status():
status_icon = "OK" if s['available'] else "COOLDOWN"
print(f" [{status_icon}] {s['key_prefix']} | "
f"P95: {s['p95_latency_ms']:.0f}ms | "
f"RPM: {s['current_rpm']:.0f}/{s['rpm_limit']} | "
f"TPM: {s['current_tpm']:,}/{s['tpm_limit']:,} | "
f"Reqs: {s['total_requests']}")
if __name__ == "__main__":
demo_load_balancer()
Background Health Checker
In production, API keys can become unavailable for reasons beyond rate limiting: billing issues, account suspension, key rotation, or network-level problems. A background health checker probes each key periodically and removes unhealthy keys from routing before they cause user-visible failures.
import threading
import time
import anthropic
class ProviderHealthChecker:
"""
Background daemon thread that probes each API key periodically.
Distinguishes three failure modes:
- AuthenticationError: key is revoked/rotated - long cooldown, alert required
- RateLimitError: key is healthy but saturated - do NOT penalize
- Any other error: transient issue - short cooldown, retry on next cycle
"""
PROBE_MESSAGE = [{"role": "user", "content": "Reply with 'ok'."}]
def __init__(
self,
lb: ProviderLoadBalancer,
interval_s: float = 30.0,
probe_timeout_s: float = 10.0,
):
self.lb = lb
self.interval_s = interval_s
self.probe_timeout_s = probe_timeout_s
self._stop = threading.Event()
self._thread = threading.Thread(
target=self._run, daemon=True, name="llm-health-checker"
)
def start(self) -> None:
self._thread.start()
print(f"[HealthCheck] Started - interval={self.interval_s}s, "
f"probe_timeout={self.probe_timeout_s}s, "
f"keys={len(self.lb.keys)}")
def stop(self) -> None:
self._stop.set()
self._thread.join(timeout=5.0)
def probe(self, key: ApiKey) -> bool:
"""
Send a minimal probe to a key to test its health.
Returns True if healthy (or rate-limited but functional), False if broken.
"""
try:
client = anthropic.Anthropic(api_key=key.key)
start = time.time()
client.messages.create(
model=key.model,
max_tokens=5,
messages=self.PROBE_MESSAGE,
timeout=self.probe_timeout_s,
)
latency_ms = (time.time() - start) * 1000
# Record as lightweight success - warms up latency statistics
self.lb.record_success(key, latency_ms, tokens=10)
return True
except anthropic.AuthenticationError:
# Key is invalid or revoked - requires human intervention
# Use 1-hour cooldown so the key stops receiving traffic immediately
print(f"[HealthCheck] AUTH FAILURE for key {key.key[:12]}... - "
f"setting 1-hour cooldown. ALERT: check for key rotation.")
key.stats.cooldown_until = time.time() + 3600
return False
except anthropic.RateLimitError:
# Rate limited on probe - the key is alive and healthy, just saturated
# Do NOT set cooldown - the normal routing logic handles saturation
print(f"[HealthCheck] Key {key.key[:12]}... is rate-limited - healthy but busy")
return True
except Exception as e:
# Transient error - network timeout, 5xx, etc.
print(f"[HealthCheck] Probe error for {key.key[:12]}...: "
f"{type(e).__name__}: {e}")
return False
def _run(self) -> None:
while not self._stop.is_set():
results = []
for key in self.lb.keys:
healthy = self.probe(key)
results.append((key.key[:12], healthy))
unhealthy_keys = [prefix for prefix, ok in results if not ok]
healthy_count = len(results) - len(unhealthy_keys)
if unhealthy_keys:
print(f"[HealthCheck] UNHEALTHY keys: {unhealthy_keys} | "
f"Healthy: {healthy_count}/{len(results)}")
else:
print(f"[HealthCheck] All {len(results)} keys healthy")
self._stop.wait(self.interval_s)
def demo_with_health_checking() -> None:
lb, client = build_demo_balancer()
checker = ProviderHealthChecker(lb, interval_s=30.0)
checker.start()
try:
result = client.complete(
messages=[{"role": "user", "content": "What is a distributed hash table?"}],
model="claude-haiku-4-5-20251001",
max_tokens=100,
)
print(f"Response via {result['key_prefix']} in {result['latency_ms']:.0f}ms")
print(f"Content: {result['response'][:120]}...")
finally:
checker.stop()
Multi-Provider Load Balancing
When a single provider's total capacity is insufficient even across multiple keys, add a second provider as overflow capacity. This is the "primary + overflow" pattern: all traffic goes to the primary provider first; overflow is activated only when all primary keys are at capacity.
import openai as openai_module
from dataclasses import dataclass, field
@dataclass
class MultiProviderKey:
"""An API key for any provider, with priority tier for overflow routing."""
key: str
provider: str # "anthropic" | "openai"
model: str
rpm_limit: int = 500
tpm_limit: int = 200_000
weight: float = 1.0
priority: int = 0 # 0 = primary, 1 = overflow - lower priority value = served first
stats: ApiKeyStats = field(default_factory=ApiKeyStats)
@property
def is_available(self) -> bool:
now = time.time()
if self.stats.cooldown_until and now < self.stats.cooldown_until:
return False
if self._current_rpm >= self.rpm_limit * 0.90:
return False
if self._current_tpm >= self.tpm_limit * 0.90:
return False
return True
@property
def _current_rpm(self) -> float:
now = time.time()
return len([t for t in self.stats.recent_request_times if t > now - 60])
@property
def _current_tpm(self) -> float:
now = time.time()
pairs = list(zip(self.stats.recent_request_times, self.stats.recent_token_counts))
return sum(tok for ts, tok in pairs if ts > now - 60)
@property
def _p95_latency_ms(self) -> float:
lats = sorted(self.stats.recent_latencies_ms)
if not lats:
return 0.0
idx = int(len(lats) * 0.95)
return lats[min(idx, len(lats) - 1)]
class MultiProviderLoadBalancer:
"""
Load balancer spanning multiple providers with priority-based overflow.
Routing logic:
1. Try priority=0 (primary) keys first - latency-based selection within tier
2. Route to priority=1 (overflow) keys only when all primary keys are at capacity
3. Log overflow activations - repeated overflow indicates need for more primary capacity
"""
def __init__(self, keys: list[MultiProviderKey]):
self.keys = keys
self._lock = threading.RLock()
self._overflow_count = 0
def select(self) -> Optional[MultiProviderKey]:
"""Select the best available key, preferring primary tier."""
# Primary tier (priority=0)
primary = [k for k in self.keys if k.priority == 0 and k.is_available]
if primary:
new_keys = [k for k in primary if len(k.stats.recent_latencies_ms) < 3]
if new_keys:
return random.choice(new_keys)
return min(primary, key=lambda k: k._p95_latency_ms)
# Overflow tier (priority=1) - only when all primary keys are full
overflow = [k for k in self.keys if k.priority == 1 and k.is_available]
if overflow:
with self._lock:
self._overflow_count += 1
print(f"[MultiProvider] All primary keys at capacity - "
f"routing to overflow (total overflows: {self._overflow_count})")
return random.choice(overflow)
return None
def complete(
self,
messages: list[dict],
max_tokens: int = 1024,
) -> dict:
"""Route a request across providers with automatic overflow handling."""
anthropic_clients: dict[str, anthropic.Anthropic] = {}
openai_clients: dict[str, openai_module.OpenAI] = {}
key = self.select()
if key is None:
raise RuntimeError("No available keys across any provider")
start = time.time()
key.stats.active_requests += 1
try:
if key.provider == "anthropic":
if key.key not in anthropic_clients:
anthropic_clients[key.key] = anthropic.Anthropic(api_key=key.key)
client = anthropic_clients[key.key]
response = client.messages.create(
model=key.model, max_tokens=max_tokens, messages=messages
)
text = response.content[0].text
in_tok = response.usage.input_tokens
out_tok = response.usage.output_tokens
elif key.provider == "openai":
if key.key not in openai_clients:
openai_clients[key.key] = openai_module.OpenAI(api_key=key.key)
client = openai_clients[key.key]
response = client.chat.completions.create(
model=key.model, max_tokens=max_tokens, messages=messages
)
text = response.choices[0].message.content
in_tok = response.usage.prompt_tokens
out_tok = response.usage.completion_tokens
else:
raise ValueError(f"Unknown provider: {key.provider}")
latency_ms = (time.time() - start) * 1000
key.stats.active_requests = max(0, key.stats.active_requests - 1)
key.stats.recent_latencies_ms.append(latency_ms)
key.stats.recent_request_times.append(time.time())
key.stats.recent_token_counts.append(in_tok + out_tok)
return {
"response": text,
"provider": key.provider,
"model": key.model,
"priority_tier": key.priority,
"latency_ms": round(latency_ms, 1),
"input_tokens": in_tok,
"output_tokens": out_tok,
}
except Exception as e:
key.stats.active_requests = max(0, key.stats.active_requests - 1)
key.stats.total_failures += 1
if isinstance(e, (anthropic.RateLimitError, openai_module.RateLimitError)):
key.stats.cooldown_until = time.time() + 60.0
raise
Deploying Across Multiple Replicas: Shared State
The implementations above use in-process state (Python deques and floats). In a single-process deployment this is fine. But when you run multiple gateway replicas (e.g., 3 pods in Kubernetes), each replica has its own independent copy of the stats - latency measurements from replica A are invisible to replica B, so each replica makes routing decisions based on incomplete data.
For multi-replica deployments, move key state to Redis:
import redis
import json
class RedisBackedKeyStats:
"""
Key stats backed by Redis for sharing across multiple gateway replicas.
Trade-off: adds ~1ms Redis round-trip per request for stat reads/writes.
For high-throughput systems, use Redis pipelines to batch the updates.
"""
def __init__(self, redis_client: redis.Redis, key_prefix: str):
self.r = redis_client
self.prefix = key_prefix
def record_latency(self, latency_ms: float) -> None:
"""Append latency to a capped Redis list (last 50 values)."""
key = f"lb:latency:{self.prefix}"
pipe = self.r.pipeline()
pipe.rpush(key, f"{latency_ms:.1f}")
pipe.ltrim(key, -50, -1) # Keep last 50 values
pipe.expire(key, 3600)
pipe.execute()
def get_p95_latency_ms(self) -> float:
"""Read P95 from the Redis latency list."""
key = f"lb:latency:{self.prefix}"
raw = self.r.lrange(key, 0, -1)
if len(raw) < 3:
return 0.0
lats = sorted(float(v) for v in raw)
idx = int(len(lats) * 0.95)
return lats[min(idx, len(lats) - 1)]
def record_request(self, tokens: int) -> None:
"""Record request timestamp and token count for RPM/TPM tracking."""
now_ms = int(time.time() * 1000)
pipe = self.r.pipeline()
# Store as "timestamp:tokens" in a sorted set, scored by timestamp
member = f"{now_ms}:{tokens}"
pipe.zadd(f"lb:requests:{self.prefix}", {member: now_ms})
# Remove entries older than 60 seconds
cutoff_ms = (time.time() - 60) * 1000
pipe.zremrangebyscore(f"lb:requests:{self.prefix}", 0, cutoff_ms)
pipe.expire(f"lb:requests:{self.prefix}", 120)
pipe.execute()
def get_current_tpm(self) -> int:
"""Get current TPM from the Redis sorted set."""
key = f"lb:requests:{self.prefix}"
cutoff_ms = (time.time() - 60) * 1000
# Get all members in the last 60 seconds
members = self.r.zrangebyscore(key, cutoff_ms, "+inf")
total_tokens = sum(int(m.decode().split(":")[1]) for m in members)
return total_tokens
Production Engineering Notes
:::tip Use multiple API keys from the same provider to scale beyond single-key limits Provider rate limits are per API key, not per IP or per account in most cases. Each key gets its own TPM and RPM bucket. If you need 1M TPM, use five keys each at 200k TPM. This is the simplest and most reliable throughput scaling mechanism available. Test multi-key throughput under realistic concurrent load to verify that effective capacity is actually multiplied - not shared at an account level. :::
:::warning Check whether your provider has account-level rate limits above the key-level limits Some providers implement a hard ceiling at the account level. Adding a 6th key to an account that has a 1M TPM account ceiling may not give you 1.2M TPM - it could give you 1M TPM shared across all 6 keys. Always verify this before building your architecture around key multiplication. If account-level ceilings exist, the solution is a second provider account, not a seventh key on the first account. :::
:::danger Never log full API keys in status endpoints or monitoring output When logging load balancer status, key selection decisions, or debugging output, always log only the key prefix (first 12 characters) followed by "...". A full API key in a log line creates a critical security risk - anyone with read access to your log aggregation system can extract valid provider credentials. Treat API keys with the same access control discipline as passwords. Rotate any key that appears in logs. :::
:::info Latency-based routing has a warm-up period for new keys When you add a new key to the load balancer, it has no P95 latency history. The latency-based strategy routes new keys exploration requests first to build a baseline. During warm-up (typically the first 3–10 requests), the key may receive more or less traffic than its steady-state allocation. This is correct behavior - the strategy is gathering the data it needs to make informed routing decisions. After warm-up, routing converges to the stable pattern. :::
Common Mistakes
Mistake 1: Treating rate limit errors from one key as a signal to remove it permanently. Rate limit errors are transient - the key's bucket refills over time. The correct response is a timed cooldown (60 seconds for a 429). After the cooldown period, the key returns to the available pool automatically. Removing keys permanently on rate limit errors depletes your key pool over time until no keys remain.
Mistake 2: Failing to re-raise 4xx errors on retry. When a request fails with a 4xx error (400 Bad Request, 401 Unauthorized, 413 Payload Too Large), retrying on a different API key will not fix it - the error is in the request itself, not in the key or the provider's state. Always re-raise 4xx errors immediately without trying another key.
Mistake 3: Sharing ApiKey objects across threads without locking. The stats fields on ApiKey (active_requests, recent_latencies_ms) are mutated concurrently by multiple threads handling simultaneous requests. Without a lock, you get race conditions: two threads both reading active_requests = 3, both decrementing it, and writing active_requests = 2 instead of active_requests = 1. Always use threading.RLock around stat mutation.
Mistake 4: Not distinguishing between "key is rate-limited" and "key is revoked" in the health checker. A rate-limited key is healthy - it just needs time to recover. A revoked key will never recover on its own. The health checker must detect AuthenticationError specifically and set a long cooldown (1 hour) while alerting the on-call engineer. If the health checker puts both error types in the same short cooldown, revoked keys will keep getting probed every 60 seconds forever, wasting health check quota.
Mistake 5: Using only in-process state with multiple gateway replicas. When you run two or more gateway processes (e.g., for horizontal scaling), each process tracks its own per-key latency and rate limit stats independently. Replica A may route heavily to key 1 because it appears fast from A's perspective; replica B may also route heavily to key 1 for the same reason. Together they overwhelm key 1 while key 2 sits idle. Use Redis-backed stats for any multi-replica deployment.
Interview Q&A
Q: Why does adding more API keys to the same provider increase throughput?
LLM providers rate limit by API key (not by IP address or account, in most cases). Each key has its own independent TPM and RPM bucket. A single key with a 200k TPM limit can process at most 200k tokens per minute. Two keys give you two 200k buckets - effective 400k TPM. A load balancer distributes requests across keys such that no single key's bucket is depleted. This scales linearly with the number of keys up to any account-level ceiling. It is the simplest scaling mechanism available: no infrastructure changes, no provider negotiation, just key management and routing logic.
Q: What is the difference between round-robin and latency-based load balancing for LLM providers?
Round-robin distributes requests evenly across available keys in a cyclic pattern, without regard to each key's current performance. It works well when all keys have identical limits and similar geographic routing. Latency-based routing tracks the P95 response time for each key over a rolling window and routes new requests to the key with the lowest recent P95. This provides two benefits: it naturally avoids overloaded keys (which exhibit higher latency as they approach rate limits) and it adapts to geographic routing differences (a key routing to a closer provider endpoint will have lower latency and receive more traffic). Latency-based routing is self-correcting and requires no manual threshold configuration.
Q: How do you detect that an API key has become unhealthy for reasons other than rate limiting?
Implement a background health checker that sends minimal probe requests ("Reply with 'ok'.") to each key at a regular interval (every 30 seconds). If a probe returns AuthenticationError, the key has been revoked, rotated, or invalidated - put it in a long cooldown (1 hour) and alert the on-call engineer. If a probe times out or returns a 5xx, put the key in a short cooldown (30–60 seconds) and retry. If a probe returns a rate limit error, the key is healthy but saturated - do not penalize it. Track health check results separately from production request outcomes to avoid contaminating the latency statistics used for routing decisions.
Q: A key is still "available" by your capacity metrics but all requests to it are slow. How does latency-based routing handle this?
Latency-based routing tracks P95 latency per key over the last N requests. As the key becomes slow (due to provider-side throttling, geographic degradation, or high load on that key's backend), its P95 latency increases relative to other keys. The routing selection function min(available, key=lambda k: k.p95_latency_ms) naturally shifts traffic away from the slow key toward faster alternatives. This is self-correcting: as the slow key receives fewer requests, the provider-side load reduces, and its latency may recover. No explicit threshold configuration is needed.
Q: How would you handle a scenario where all API keys across all providers are simultaneously rate-limited?
Three strategies in priority order: (1) Request queuing - buffer incoming requests in Redis with a TTL equal to your maximum acceptable wait time. A dequeue worker monitors key availability and dispatches queued requests as capacity becomes available. For real-time user-facing features, set a short TTL (5–10 seconds) and return a graceful "service busy" response when the queue wait exceeds it. (2) Tiered model degradation - if all premium model keys are exhausted but budget model keys remain available, serve requests with the budget model. Response quality is reduced but the service remains functional. (3) Proactive rate limit management - monitor TPM consumption in real time and throttle incoming requests before the provider limit is hit, returning a clear 429 to callers with a Retry-After header. This prevents stampedes on the provider at the cost of some user-visible throttling.
Q: How does weighted routing differ from latency-based routing, and when would you choose weights?
Weighted routing is a static configuration: you assign each key a weight proportional to its desired traffic share, then use weighted random selection to route requests. A key with weight=2.0 receives approximately twice the traffic of a key with weight=1.0. You choose weights when you have keys on different rate limit tiers and want to send more traffic to the higher-tier key by design, regardless of current latency. Latency-based routing is dynamic: it allocates traffic based on measured P95 response times, naturally sending more to the faster key without configuration. You choose latency-based routing when you want the system to adapt automatically and you don't have a reason to prefer one key over another by design. In practice, latency-based routing is the better default for most use cases because it adapts to runtime conditions; weighted routing is useful when you want deterministic traffic splits for billing, auditing, or A/B test isolation.
Q: How would you implement A/B testing between two models using a load balancer?
Set up two key groups with different models and assign traffic splits using weighted routing. For example: 80% of requests go to keys with model="claude-sonnet-4-6" (weight=0.8), 20% go to keys with model="claude-haiku-4-5-20251001" (weight=0.2). Record the model used in each response's metadata. After sufficient traffic, compare the output quality, user satisfaction (thumbs up/down, downstream conversion), and cost per request between the two model groups. The load balancer makes this A/B test transparent to the application layer - no changes to calling code are needed. To avoid cohort contamination, ensure the same user always gets the same model: either use consistent hashing on user_id for key selection, or store the model assignment in the user session.
Load Balancer Observability: Metrics to Track
A load balancer without observability is a black box. These are the metrics that should be exposed at your gateway's /metrics endpoint (Prometheus format) and displayed in your operations dashboard:
| Metric | Type | Description |
|---|---|---|
lb_requests_total{key, provider, model} | Counter | Total requests routed per key |
lb_failures_total{key, provider, error_type} | Counter | Failures per key and error type |
lb_p95_latency_ms{key, provider} | Gauge | Current P95 latency per key |
lb_active_requests{key} | Gauge | In-flight requests per key |
lb_current_tpm{key} | Gauge | Estimated current TPM per key |
lb_current_rpm{key} | Gauge | Estimated current RPM per key |
lb_cooldown_active{key} | Gauge | 1 if key is in cooldown, else 0 |
lb_overflow_total | Counter | Times overflow tier was used (multi-provider) |
lb_keys_available | Gauge | Number of keys currently available for routing |
The most operationally important alert: lb_keys_available == 0. This means all keys are in cooldown simultaneously and no requests can be served. Configure a PagerDuty alert to fire within 30 seconds of this condition.
Capacity Planning: How Many Keys Do You Need?
Given a peak throughput requirement and per-key limits, calculate the minimum key count:
import math
def calculate_keys_needed(
peak_tpm_required: int,
key_tpm_limit: int,
safety_factor: float = 0.85, # Never use more than 85% of capacity
redundancy: int = 1, # Extra keys for failover
) -> dict:
"""
Calculate the number of API keys needed to meet a throughput requirement.
safety_factor: operate at this fraction of the key's limit to avoid
triggering rate limits under bursty load patterns.
redundancy: extra keys above the minimum so you can lose one and still serve.
"""
# Effective TPM per key after safety factor
effective_tpm_per_key = key_tpm_limit * safety_factor
# Minimum keys for the required throughput
min_keys = math.ceil(peak_tpm_required / effective_tpm_per_key)
# With redundancy
total_keys = min_keys + redundancy
# Effective capacity with total keys at safety factor
effective_total_tpm = total_keys * effective_tpm_per_key
return {
"peak_tpm_required": peak_tpm_required,
"key_tpm_limit": key_tpm_limit,
"safety_factor": safety_factor,
"effective_tpm_per_key": int(effective_tpm_per_key),
"min_keys_no_redundancy": min_keys,
"total_keys_with_redundancy": total_keys,
"effective_total_tpm": int(effective_total_tpm),
"headroom_pct": round((effective_total_tpm - peak_tpm_required) / peak_tpm_required * 100, 1),
}
# Example: the email personalization scenario from the opening
result = calculate_keys_needed(
peak_tpm_required=7_200_000, # 400 emails/sec × 300 tokens × 60 seconds
key_tpm_limit=400_000, # Highest Anthropic tier per key
safety_factor=0.85,
redundancy=2,
)
print(f"Peak requirement: {result['peak_tpm_required']:,} TPM")
print(f"Effective per key: {result['effective_tpm_per_key']:,} TPM @ {result['safety_factor']*100:.0f}% safety")
print(f"Keys needed (min): {result['min_keys_no_redundancy']}")
print(f"Keys needed (w/ 2 spare): {result['total_keys_with_redundancy']}")
print(f"Effective total TPM: {result['effective_total_tpm']:,}")
print(f"Headroom above target: {result['headroom_pct']}%")
# Output:
# Peak requirement: 7,200,000 TPM
# Effective per key: 340,000 TPM @ 85% safety
# Keys needed (min): 22
# Keys needed (w/ 2 spare): 24
# Effective total TPM: 8,160,000
# Headroom above target: 13.3%
This calculation tells the team: to handle 7.2M TPM safely, you need 24 keys, each at the 400k TPM tier. If your provider has an account-level ceiling (e.g., 10M TPM across the entire account), that ceiling must be at least 8.16M TPM or the calculation breaks down. Always verify account-level ceilings with your provider's support team before designing the architecture.
