Load Balancing and Request Routing
The Request That Breaks Your Load Balancer's Model
Your team has five vLLM instances running behind a round-robin Nginx upstream. Five servers, five GPUs, even distribution - it sounds correct. On a whiteboard it looks correct. In production at 2 PM on a Wednesday, it is the source of a bizarre performance cliff.
Server 1 receives a request from your enterprise customer's document analysis pipeline: 12,000 tokens of legal text, generate a 3,000-token summary. Server 1 is now occupied for roughly 80 seconds. During those 80 seconds, 24 more requests arrive from round-robin. Four go to each server - including four more 12,000-token legal documents, because the enterprise customer sends them in batches. Now servers 2, 3, 4, and 5 each have one massive request occupying the GPU for the next 80 seconds, plus three smaller requests queued behind it.
Meanwhile, servers 1 through 5 all appear equally loaded to Nginx because Nginx measures connections, not GPU time consumed by in-flight requests. The small API requests from your other customers - 200-token prompts generating 50-token replies - are queued behind legal documents on all five servers. Their latency degrades from 400ms to 45 seconds. None of those servers are "overloaded" in any metric Nginx can see. They all have exactly 4 active connections. Your SLO dashboard turns red.
This is the core problem that makes LLM load balancing fundamentally different from web server load balancing. A web server request costs roughly the same amount of compute regardless of whether it is fetching a CSS file or a database record. An LLM request cost is wildly variable - a factor of 100x difference between the cheapest and most expensive requests is routine. Standard load balancing algorithms assume requests are roughly equal. They are not.
The second problem is the KV cache. When vLLM processes a request with a long system prompt - say, a 2,000-token instructions block that every request in a particular application shares - it caches that computation. If the next request with the same 2,000-token prefix lands on the same server, the prefix computation is free. If it lands on a different server via round-robin, that server pays the full prefix cost again. With five servers and round-robin, 80% of your requests pay for prefix computation that was already done on a different server.
This lesson fixes both problems. We cover the routing algorithms that account for variable request cost, the prefix-aware routing that keeps KV cache hits high, the model routing gateway that sends requests to the right model, and the circuit breaker patterns that keep a single slow server from degrading the entire cluster.
Why This Exists - What Generic Load Balancers Get Wrong for LLMs
Generic load balancers (Nginx, HAProxy, AWS ALB) were designed for HTTP request/response workloads where requests are short, homogeneous, and stateless. They optimize for connection count and connection setup overhead. For web servers handling 100ms requests, this is the right optimization - the bottleneck is connection handling, not per-request compute.
LLM inference breaks every assumption this model makes. Requests are long (seconds to minutes). They are highly heterogeneous (200-token vs 12,000-token prompts with 50-token vs 3,000-token outputs). They are not stateless with respect to the server - a server that has processed requests with a particular system prompt has warm KV cache for subsequent requests with the same prefix, giving it a significant compute advantage for those requests.
The failure modes this creates are severe:
Queue head-of-line blocking. With round-robin and a fixed maximum concurrent requests per server, a single very long request occupies a slot on a server for 2 minutes while a dozen short requests queue behind it. The short requests could be served in 400ms on any other server. They are not, because Nginx does not know this - it sees an "active connection" and has no visibility into whether that connection is using 2% or 100% of the server's GPU.
KV cache invalidation at scale. Consider an application where 95% of requests share a 4,000-token system prompt. With round-robin across 5 servers, on average each server only receives 20% of the requests with that prefix. The KV cache for the prefix on each server is cold most of the time. Routing all requests from a particular application to the same server - or using consistent hashing on the prompt prefix - keeps the cache warm on each server, reducing time-to-first-token by 30-60% for prefix-heavy workloads.
Uneven GPU memory pressure. If one server receives a disproportionate share of long-context requests, its KV cache fills up, forcing vLLM to evict earlier sequences. The server becomes memory-bound while other servers have abundant KV cache headroom. A smarter load balancer would have detected this via a "current KV cache utilization" signal and avoided routing additional long-context requests to the already-pressured server.
Generic load balancers cannot solve any of these problems because they have no visibility into GPU utilization, request length, or KV cache state. The solution is a load balancer that understands LLM serving semantics, or a routing layer that sits in front of generic load balancers and implements LLM-aware routing logic.
Historical Context - How LLM Routing Evolved
In 2022 and early 2023, when the first open-source LLM serving systems appeared, routing was an afterthought. Teams ran a single model server. There was nothing to route between. Load balancing was not a design decision - it was absent.
As model serving scaled to multiple replicas in 2023, teams borrowed the closest analogy they had: database connection pooling and API gateway patterns from microservices. Round-robin and least-connections from HAProxy. IP hash for "session affinity." These worked poorly for the reasons described above, but they worked well enough for teams operating at 100-1,000 requests per day.
The KV cache prefix insight came from the vLLM team at Berkeley in late 2023. The Automatic Prefix Caching (APC) feature in vLLM was designed to reuse KV cache computations for requests sharing common prefixes within a single server. The team quickly recognized that cross-server routing needed to be prefix-aware to fully exploit this capability. The vLLM router - a simple consistent-hash-based routing component - was added to the vLLM documentation as a reference architecture.
The variable-cost routing problem received serious attention in 2024 with papers and implementations from teams at major AI serving providers. The key insight: if you model each server as having a "load score" that accounts for both the number of active requests and the estimated compute cost of each request (based on prompt length), you can implement a weighted least-load algorithm that dramatically outperforms least-connections for LLM workloads.
KEDA and custom proxy implementations proliferated in 2024 and 2025 as teams realized that neither Nginx nor Envoy had built-in support for GPU-aware or prefix-aware routing. The community converged on a pattern: use Nginx or Envoy for connection management and TLS termination, implement a custom routing layer in Python or Go that makes routing decisions based on LLM-specific signals, and inject routing decisions via upstream selection or header manipulation.
Core Concepts
Why Round-Robin Fails for Variable-Cost Requests
The fundamental assumption behind round-robin is that if you distribute N requests across K servers, each server handles N/K work. This holds when requests have equal cost. For LLM requests, request cost varies by 100x. The correct metric is not "number of requests distributed" but "GPU-seconds of work distributed."
Mathematically, the total GPU-time load on server at time is:
where is the estimated GPU time for request , approximated as:
where is the server's tokens-per-second throughput. A good router minimizes the variance of across servers, not the variance of request counts.
In practice, you do not know in advance. You can estimate it from historical data by request type, or use prompt length alone as a proxy (longer prompts tend to correlate with longer outputs).
Nginx Configuration for Basic LLM Load Balancing
Nginx with the least_conn directive is a significant improvement over round-robin for LLM workloads because it routes to the server with the fewest active connections. Since long requests hold connections open, least_conn naturally avoids routing to servers already handling long requests.
# /etc/nginx/conf.d/vllm-upstream.conf
upstream vllm_backend {
least_conn; # Route to server with fewest connections
server vllm-server-0:8000 weight=1;
server vllm-server-1:8000 weight=1;
server vllm-server-2:8000 weight=1;
server vllm-server-3:8000 weight=1;
keepalive 32; # Keep connections alive - reduces TCP overhead
keepalive_requests 1000;
keepalive_timeout 60s;
}
server {
listen 80;
server_name llm-api.internal;
# Increase timeouts for long LLM requests
proxy_read_timeout 300s; # Allow up to 5 min for response
proxy_connect_timeout 10s;
proxy_send_timeout 30s;
# Disable request buffering - pass through immediately for streaming
proxy_buffering off;
proxy_request_buffering off;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection ""; # Required for keepalive
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Request-ID $request_id;
# Health check endpoint - skip auth
proxy_set_header Authorization $http_authorization;
}
location /health {
proxy_pass http://vllm_backend/health;
access_log off;
}
}
least_conn works better than round-robin for LLM but still has a key limitation: it counts connections, not GPU-seconds consumed. Two long requests count the same as two short ones. The routing layer described below fixes this.
Prefix-Aware Routing for KV Cache Reuse
Prefix-aware routing is the highest-impact optimization for applications where requests share common prefixes (system prompts, few-shot examples, document context). The routing algorithm uses a hash of the request's prefix to consistently route requests with the same prefix to the same server.
The math: if your system prompt is 2,000 tokens and prefill costs are proportional to prompt length, processing a 2,000-token prefix takes roughly 10x longer than processing a 200-token prompt. If the next request sharing that prefix lands on the same server, the 2,000-token prefill is skipped (KV cache hit) and only the new tokens are processed. The savings compound: for an application where every request starts with the same 2,000-token context, prefix caching reduces average TTFT from ~2 seconds to ~200ms.
# prefix_router.py - Prefix-aware routing layer for vLLM
import hashlib
import aiohttp
import asyncio
from typing import Optional
from fastapi import FastAPI, Request, Response
from fastapi.responses import StreamingResponse
import json
app = FastAPI(title="LLM Prefix-Aware Router")
# Backend vLLM server endpoints
BACKENDS = [
"http://vllm-server-0:8000",
"http://vllm-server-1:8000",
"http://vllm-server-2:8000",
"http://vllm-server-3:8000",
]
def extract_prefix(messages: list, prefix_tokens: int = 512) -> str:
"""
Extract the prefix for consistent hashing.
Use the system prompt + first user message as the routing key.
This captures the most common shared prefix (system instructions).
"""
prefix_parts = []
for msg in messages:
if msg.get("role") == "system":
prefix_parts.append(msg["content"])
break # System prompt is always the key prefix
# Also include first user message if system prompt is short
for msg in messages:
if msg.get("role") == "user":
# Take first 512 characters of first user message
prefix_parts.append(msg["content"][:prefix_tokens])
break
return "|".join(prefix_parts)
def select_backend_by_prefix(prefix: str) -> str:
"""
Consistent hash on the prefix to select a backend.
Same prefix always routes to the same backend (cache locality).
"""
prefix_hash = int(hashlib.sha256(prefix.encode()).hexdigest(), 16)
backend_index = prefix_hash % len(BACKENDS)
return BACKENDS[backend_index]
def select_backend_least_load(backend_loads: dict) -> str:
"""
Fallback: select the backend with the lowest estimated load.
Used when prefix is empty or for non-chat completions.
"""
return min(backend_loads, key=backend_loads.get)
# Track estimated load per backend (approximation via active request count)
backend_active_requests: dict = {b: 0 for b in BACKENDS}
@app.post("/v1/chat/completions")
async def route_chat_completion(request: Request):
body = await request.json()
messages = body.get("messages", [])
stream = body.get("stream", False)
# Extract prefix for routing decision
prefix = extract_prefix(messages)
if prefix:
# Prefix-aware routing: consistent hash to cache-warm server
backend = select_backend_by_prefix(prefix)
else:
# No prefix (e.g., single-message request with no system prompt)
# Fall back to least-load routing
backend = select_backend_least_load(backend_active_requests)
backend_active_requests[backend] = backend_active_requests.get(backend, 0) + 1
try:
target_url = f"{backend}/v1/chat/completions"
if stream:
return await proxy_streaming(target_url, body, request, backend)
else:
return await proxy_blocking(target_url, body, request, backend)
finally:
backend_active_requests[backend] = max(
0, backend_active_requests.get(backend, 0) - 1
)
async def proxy_streaming(
url: str, body: dict, request: Request, backend: str
) -> StreamingResponse:
"""Forward streaming request and stream back the response."""
headers = {
"Authorization": request.headers.get("Authorization", ""),
"Content-Type": "application/json",
"X-Routed-To": backend,
}
async def generate():
async with aiohttp.ClientSession() as session:
async with session.post(
url,
json=body,
headers=headers,
timeout=aiohttp.ClientTimeout(total=300),
) as resp:
async for chunk in resp.content.iter_chunked(1024):
yield chunk
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={"X-Routed-To": backend},
)
async def proxy_blocking(
url: str, body: dict, request: Request, backend: str
) -> Response:
"""Forward non-streaming request."""
headers = {
"Authorization": request.headers.get("Authorization", ""),
"Content-Type": "application/json",
}
async with aiohttp.ClientSession() as session:
async with session.post(
url,
json=body,
headers=headers,
timeout=aiohttp.ClientTimeout(total=300),
) as resp:
content = await resp.read()
return Response(
content=content,
status_code=resp.status,
media_type="application/json",
headers={"X-Routed-To": backend},
)
Model Routing - Sending Requests to the Right Model
Not all requests need the same model. A user asking "what is the capital of France?" does not need Llama 3 70B. A request that needs complex multi-step reasoning or code generation might fail on a 7B model. Model routing sends requests to the appropriate model based on complexity signals, saving GPU time and cost.
# model_router.py - Routes requests to appropriate model based on complexity
import re
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class ModelTier(Enum):
SMALL = "small" # 7B - fast, cheap, simple queries
MEDIUM = "medium" # 13-34B - balanced capability and cost
LARGE = "large" # 70B+ - complex reasoning, code, analysis
@dataclass
class RoutingDecision:
tier: ModelTier
model_name: str
backend_url: str
reason: str
# Model tier configuration - customize per deployment
MODEL_CONFIG = {
ModelTier.SMALL: {
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"url": "http://vllm-small:8000",
"max_prompt_tokens": 4096,
},
ModelTier.MEDIUM: {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"url": "http://vllm-medium:8000",
"max_prompt_tokens": 16384,
},
ModelTier.LARGE: {
"model": "meta-llama/Llama-3-70B-Instruct",
"url": "http://vllm-large:8000",
"max_prompt_tokens": 65536,
},
}
# Keywords that signal complex reasoning - route to large model
COMPLEX_REASONING_PATTERNS = [
r"\bcode\b|\bprogram\b|\bimplementation\b|\bdebug\b",
r"\banalyze\b|\banalysis\b|\bevaluate\b|\bcompare\b",
r"\bstep.by.step\b|\bchain.of.thought\b|\breason\b",
r"\bmath\b|\bcalculate\b|\bequation\b|\bproof\b",
]
def estimate_prompt_tokens(messages: list) -> int:
"""Rough token estimate: 1 token per 4 characters."""
total_chars = sum(len(m.get("content", "")) for m in messages)
return total_chars // 4
def detect_complexity(messages: list, requested_max_tokens: int) -> ModelTier:
"""
Route to appropriate model tier based on:
1. Explicit model tier header (highest priority)
2. Prompt length (long prompts need large model for context)
3. Complexity keywords in the user message
4. Requested output length (long outputs need large model)
"""
prompt_tokens = estimate_prompt_tokens(messages)
# Long prompts - need large context window
if prompt_tokens > 16000:
return ModelTier.LARGE
if prompt_tokens > 4000:
return ModelTier.MEDIUM
# Long output requested - complex task
if requested_max_tokens > 2000:
return ModelTier.LARGE
if requested_max_tokens > 500:
return ModelTier.MEDIUM
# Check for complexity keywords in user messages
user_content = " ".join(
m.get("content", "") for m in messages if m.get("role") == "user"
).lower()
for pattern in COMPLEX_REASONING_PATTERNS:
if re.search(pattern, user_content):
return ModelTier.LARGE
# Default: small model for simple queries
return ModelTier.SMALL
def route_request(
messages: list,
max_tokens: int = 256,
force_tier: Optional[str] = None,
) -> RoutingDecision:
"""Determine which model and backend to route this request to."""
if force_tier:
try:
tier = ModelTier(force_tier.lower())
except ValueError:
tier = detect_complexity(messages, max_tokens)
else:
tier = detect_complexity(messages, max_tokens)
config = MODEL_CONFIG[tier]
return RoutingDecision(
tier=tier,
model_name=config["model"],
backend_url=config["url"],
reason=f"Detected tier={tier.value}, prompt_tokens~{estimate_prompt_tokens(messages)}",
)
Request Prioritization - Latency-Sensitive vs Batch
Not all requests have the same latency requirements. Interactive user-facing requests need sub-2-second TTFT. Batch processing jobs (nightly report generation, document indexing) can tolerate 60-second latency. Mixing these without prioritization causes interactive requests to queue behind batch jobs.
The cleanest implementation uses separate endpoints (and separate request queues) for interactive and batch traffic, pointing to the same pool of backends but with different queueing behavior:
# priority_queue_router.py - Separate queues for interactive vs batch
import asyncio
import heapq
import time
from enum import IntEnum
from dataclasses import dataclass, field
from typing import Any
class Priority(IntEnum):
INTERACTIVE = 0 # Lowest number = highest priority
BACKGROUND = 1
BATCH = 2
@dataclass(order=True)
class QueuedRequest:
priority: int
timestamp: float = field(compare=False)
request_id: str = field(compare=False)
payload: Any = field(compare=False)
def wait_time(self) -> float:
return time.monotonic() - self.timestamp
class PriorityRequestQueue:
def __init__(self, max_wait_seconds: dict = None):
self._heap = []
self._lock = asyncio.Lock()
# Max wait before promoting to higher priority (anti-starvation)
self.max_wait = max_wait_seconds or {
Priority.BATCH: 30.0, # Promote batch after 30s wait
Priority.BACKGROUND: 10.0, # Promote background after 10s wait
}
async def put(self, request_id: str, payload: Any, priority: Priority):
async with self._lock:
item = QueuedRequest(
priority=priority.value,
timestamp=time.monotonic(),
request_id=request_id,
payload=payload,
)
heapq.heappush(self._heap, item)
async def get(self) -> QueuedRequest:
"""Get highest priority request, with anti-starvation promotion."""
async with self._lock:
if not self._heap:
return None
# Check for starvation: promote long-waiting requests
now = time.monotonic()
for item in self._heap:
wait = now - item.timestamp
if item.priority == Priority.BATCH and wait > self.max_wait.get(Priority.BATCH, 30):
item.priority = Priority.BACKGROUND.value
elif item.priority == Priority.BACKGROUND and wait > self.max_wait.get(Priority.BACKGROUND, 10):
item.priority = Priority.INTERACTIVE.value
heapq.heapify(self._heap) # Rebuild heap after priority changes
return heapq.heappop(self._heap)
def size(self) -> int:
return len(self._heap)
Circuit Breakers for Failing Model Servers
A model server can fail in ways that are hard to detect at the TCP connection level: the GPU hangs, the KV cache is full and all requests are timing out internally, CUDA errors are causing 500 responses. Without a circuit breaker, your load balancer keeps sending requests to the failing server, those requests all fail after their timeout, and the failing server's timeouts consume slots that could go to healthy servers.
# circuit_breaker.py - Track backend health and stop routing to unhealthy servers
import time
import threading
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Optional
class CircuitState(Enum):
CLOSED = "closed" # Normal operation - requests flow through
OPEN = "open" # Backend failing - block requests immediately
HALF_OPEN = "half_open" # Testing if backend recovered - allow one probe
@dataclass
class BackendStats:
success_count: int = 0
failure_count: int = 0
last_failure_time: float = 0.0
consecutive_failures: int = 0
state: CircuitState = CircuitState.CLOSED
lock: threading.Lock = field(default_factory=threading.Lock)
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 60.0,
half_open_probe_timeout: float = 10.0,
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_probe_timeout = half_open_probe_timeout
self.backends: dict = {}
def register_backend(self, backend_url: str):
self.backends[backend_url] = BackendStats()
def is_available(self, backend_url: str) -> bool:
stats = self.backends.get(backend_url)
if stats is None:
return True
with stats.lock:
if stats.state == CircuitState.CLOSED:
return True
elif stats.state == CircuitState.OPEN:
# Check if recovery timeout has elapsed - move to half-open
if time.monotonic() - stats.last_failure_time > self.recovery_timeout:
stats.state = CircuitState.HALF_OPEN
return True # Allow one probe request
return False
else: # HALF_OPEN
return True
def record_success(self, backend_url: str):
stats = self.backends.get(backend_url)
if stats is None:
return
with stats.lock:
stats.success_count += 1
stats.consecutive_failures = 0
if stats.state == CircuitState.HALF_OPEN:
stats.state = CircuitState.CLOSED # Backend recovered
print(f"[circuit_breaker] {backend_url} recovered, circuit CLOSED")
def record_failure(self, backend_url: str):
stats = self.backends.get(backend_url)
if stats is None:
return
with stats.lock:
stats.failure_count += 1
stats.consecutive_failures += 1
stats.last_failure_time = time.monotonic()
if stats.consecutive_failures >= self.failure_threshold:
if stats.state != CircuitState.OPEN:
stats.state = CircuitState.OPEN
print(
f"[circuit_breaker] {backend_url} OPEN after "
f"{stats.consecutive_failures} consecutive failures"
)
def available_backends(self, all_backends: list) -> list:
return [b for b in all_backends if self.is_available(b)]
Complete Production Routing Layer
Combining prefix routing, circuit breakers, and model routing into a single FastAPI gateway:
# gateway.py - Production LLM routing gateway
import asyncio
import hashlib
import time
import uuid
from contextlib import asynccontextmanager
from typing import Optional
import aiohttp
from fastapi import FastAPI, Header, HTTPException, Request
from fastapi.responses import JSONResponse, StreamingResponse
from circuit_breaker import CircuitBreaker
from model_router import ModelTier, route_request
# Backend pools by model tier
BACKEND_POOLS = {
ModelTier.SMALL: [
"http://vllm-small-0:8000",
"http://vllm-small-1:8000",
],
ModelTier.MEDIUM: [
"http://vllm-medium-0:8000",
],
ModelTier.LARGE: [
"http://vllm-large-0:8000",
"http://vllm-large-1:8000",
],
}
circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=60.0,
)
# Register all backends with circuit breaker
for tier_backends in BACKEND_POOLS.values():
for backend in tier_backends:
circuit_breaker.register_backend(backend)
# Track active request count per backend for least-load routing
backend_load: dict = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: initialize backend load tracking
for tier_backends in BACKEND_POOLS.values():
for backend in tier_backends:
backend_load[backend] = 0
yield
# Shutdown: nothing to clean up
app = FastAPI(title="LLM Routing Gateway", lifespan=lifespan)
def select_backend(
tier: ModelTier,
prefix: Optional[str] = None,
use_prefix_routing: bool = True,
) -> Optional[str]:
"""Select a backend from the appropriate pool using prefix-aware + least-load routing."""
pool = BACKEND_POOLS.get(tier, [])
available = circuit_breaker.available_backends(pool)
if not available:
return None
if prefix and use_prefix_routing and len(available) > 1:
# Try prefix-consistent routing first
prefix_hash = int(hashlib.sha256(prefix.encode()).hexdigest(), 16)
preferred = available[prefix_hash % len(available)]
# Only use preferred if its load is not significantly higher than min
min_load = min(backend_load.get(b, 0) for b in available)
preferred_load = backend_load.get(preferred, 0)
if preferred_load <= min_load + 2:
# Preferred backend is within 2 requests of least-loaded - use it for cache
return preferred
# Fall back to least-loaded backend (ignores prefix for load balance)
return min(available, key=lambda b: backend_load.get(b, 0))
@app.post("/v1/chat/completions")
async def chat_completions(
request: Request,
x_model_tier: Optional[str] = Header(None, alias="X-Model-Tier"),
x_request_priority: Optional[str] = Header(None, alias="X-Request-Priority"),
):
request_id = str(uuid.uuid4())
body = await request.json()
messages = body.get("messages", [])
max_tokens = body.get("max_tokens", 256)
stream = body.get("stream", False)
# Route to appropriate model tier
decision = route_request(messages, max_tokens, force_tier=x_model_tier)
# Extract prefix for cache-aware routing
prefix_parts = []
for msg in messages:
if msg.get("role") == "system":
prefix_parts.append(msg["content"])
break
prefix = prefix_parts[0] if prefix_parts else None
# Select backend
backend = select_backend(decision.tier, prefix)
if backend is None:
raise HTTPException(
status_code=503,
detail={
"error": "No healthy backends available",
"tier": decision.tier.value,
},
)
# Override model name in body to match selected backend
body["model"] = decision.model_name
# Track load
backend_load[backend] = backend_load.get(backend, 0) + 1
start_time = time.monotonic()
try:
headers = {
"Authorization": request.headers.get("Authorization", ""),
"Content-Type": "application/json",
"X-Request-ID": request_id,
}
if stream:
result = await _proxy_stream(backend, body, headers)
circuit_breaker.record_success(backend)
return result
else:
result = await _proxy_blocking(backend, body, headers)
circuit_breaker.record_success(backend)
duration = time.monotonic() - start_time
# Add routing metadata to response
if hasattr(result, "headers"):
result.headers["X-Routed-To"] = backend
result.headers["X-Model-Tier"] = decision.tier.value
result.headers["X-Request-Duration"] = f"{duration:.3f}s"
return result
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
circuit_breaker.record_failure(backend)
raise HTTPException(status_code=502, detail=str(e))
finally:
backend_load[backend] = max(0, backend_load.get(backend, 0) - 1)
async def _proxy_stream(backend: str, body: dict, headers: dict) -> StreamingResponse:
async def generate():
async with aiohttp.ClientSession() as session:
async with session.post(
f"{backend}/v1/chat/completions",
json=body,
headers=headers,
timeout=aiohttp.ClientTimeout(total=300),
) as resp:
if resp.status != 200:
error_body = await resp.read()
yield error_body
return
async for chunk in resp.content.iter_chunked(4096):
yield chunk
return StreamingResponse(generate(), media_type="text/event-stream")
async def _proxy_blocking(backend: str, body: dict, headers: dict) -> JSONResponse:
async with aiohttp.ClientSession() as session:
async with session.post(
f"{backend}/v1/chat/completions",
json=body,
headers=headers,
timeout=aiohttp.ClientTimeout(total=300),
) as resp:
content = await resp.json()
return JSONResponse(content=content, status_code=resp.status)
Architecture Diagrams
Production Engineering Notes
Measuring KV Cache Hit Rate
The ROI of prefix-aware routing depends on how much your traffic actually shares common prefixes. If every request has a unique system prompt, prefix routing adds overhead with no benefit. Instrument your router to measure the cache hit rate:
# Add to gateway.py - track prefix routing effectiveness
from prometheus_client import Counter, Histogram, start_http_server
prefix_route_total = Counter(
"prefix_route_total",
"Total prefix routing decisions",
["result"], # "cache_hit", "cache_miss", "no_prefix"
)
request_latency = Histogram(
"gateway_request_latency_seconds",
"Request latency by routing type",
["routing_type", "model_tier"],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0],
)
def select_backend_instrumented(tier, prefix=None):
if not prefix:
prefix_route_total.labels(result="no_prefix").inc()
return select_backend(tier, prefix=None)
# Check if preferred (prefix-hash) backend is within load threshold
pool = BACKEND_POOLS.get(tier, [])
available = circuit_breaker.available_backends(pool)
if not available:
return None
prefix_hash = int(hashlib.sha256(prefix.encode()).hexdigest(), 16)
preferred = available[prefix_hash % len(available)]
min_load = min(backend_load.get(b, 0) for b in available)
preferred_load = backend_load.get(preferred, 0)
if preferred_load <= min_load + 2:
prefix_route_total.labels(result="cache_hit").inc()
return preferred
else:
prefix_route_total.labels(result="cache_miss").inc()
return min(available, key=lambda b: backend_load.get(b, 0))
Handling Backend Restarts During Weight Loading
When a vLLM backend restarts (after a crash or rolling update), it takes 3-10 minutes to load model weights before it can serve requests. During this window, requests forwarded to it will fail. The circuit breaker handles this automatically - after 5 consecutive failures, the backend is opened and no new requests are routed to it until it recovers.
However, the circuit breaker's recovery probe is passive (it waits for requests to test with). For faster recovery detection, add an active health check:
# health_checker.py - Active background health checking
import asyncio
import aiohttp
from circuit_breaker import circuit_breaker, BACKEND_POOLS, ModelTier
async def check_backend_health(backend_url: str) -> bool:
"""Returns True if backend is healthy and ready to serve."""
try:
async with aiohttp.ClientSession() as session:
async with session.get(
f"{backend_url}/health",
timeout=aiohttp.ClientTimeout(total=5),
) as resp:
return resp.status == 200
except Exception:
return False
async def health_check_loop():
"""Background task: check all backends every 15 seconds."""
all_backends = [b for pool in BACKEND_POOLS.values() for b in pool]
while True:
for backend in all_backends:
healthy = await check_backend_health(backend)
if healthy:
circuit_breaker.record_success(backend)
else:
circuit_breaker.record_failure(backend)
await asyncio.sleep(15)
Nginx as the Outer Layer, Gateway as the Inner Layer
In production, the architecture typically has two layers. The outer layer is Nginx - it handles TLS termination, client connection management, basic rate limiting by IP, and request logging. The inner layer is the custom routing gateway that implements LLM-specific routing logic. This separation keeps each component focused on what it does best.
# Outer Nginx layer - TLS, connection management, routing to gateway
upstream llm_gateway {
server llm-gateway-0:8080;
server llm-gateway-1:8080; # Run multiple gateway instances for HA
keepalive 64;
}
server {
listen 443 ssl http2;
server_name api.yourdomain.com;
ssl_certificate /etc/nginx/ssl/cert.pem;
ssl_certificate_key /etc/nginx/ssl/key.pem;
# Rate limiting by API key (use $http_authorization as key)
limit_req_zone $http_authorization zone=api_rate:10m rate=100r/m;
limit_req zone=api_rate burst=20 nodelay;
location /v1/ {
proxy_pass http://llm_gateway;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Long timeouts for LLM generation
proxy_read_timeout 300s;
proxy_send_timeout 30s;
# SSE/streaming support
proxy_buffering off;
proxy_cache off;
}
}
Consistent Hash Stability Under Backend Changes
When you add or remove backends from the pool, consistent hashing routes change. Adding a backend with naive consistent hashing remaps N/K requests (where N is request count and K is new backend count), disrupting KV cache locality. Consistent hashing with virtual nodes (rendezvous hashing) minimizes this disruption - only 1/K requests remap when a backend is added or removed.
For the routing gateway implementation above, the current approach uses simple modulo hashing (prefix_hash % len(available)) which remaps many routes when pool size changes. For production with frequent scaling events, switch to a consistent hash ring:
# consistent_hash.py - Minimize cache disruption during scaling events
import bisect
import hashlib
class ConsistentHashRing:
def __init__(self, replicas: int = 150):
"""
replicas: number of virtual nodes per backend.
Higher values = better distribution but more memory.
"""
self.replicas = replicas
self._ring = {} # hash position -> backend URL
self._sorted_keys = []
def add_backend(self, backend: str):
for i in range(self.replicas):
virtual_key = f"{backend}:vnode:{i}"
h = int(hashlib.sha256(virtual_key.encode()).hexdigest(), 16)
self._ring[h] = backend
bisect.insort(self._sorted_keys, h)
def remove_backend(self, backend: str):
for i in range(self.replicas):
virtual_key = f"{backend}:vnode:{i}"
h = int(hashlib.sha256(virtual_key.encode()).hexdigest(), 16)
if h in self._ring:
del self._ring[h]
idx = bisect.bisect_left(self._sorted_keys, h)
if idx < len(self._sorted_keys) and self._sorted_keys[idx] == h:
self._sorted_keys.pop(idx)
def get_backend(self, key: str) -> str:
if not self._ring:
return None
h = int(hashlib.sha256(key.encode()).hexdigest(), 16)
idx = bisect.bisect_right(self._sorted_keys, h) % len(self._sorted_keys)
return self._ring[self._sorted_keys[idx]]
Common Mistakes
:::danger Using Round-Robin for Variable-Cost LLM Requests
Round-robin distributes requests evenly by count, not by compute cost. One 12,000-token request can consume as much GPU time as 30 typical requests. With round-robin, a server handling one long request will receive 3-4 more requests via round-robin while it is still processing the first one, creating a pile-up. Always use least_conn as the minimum baseline for LLM load balancing, and implement load-aware routing (tracking estimated GPU-seconds per backend) for high-traffic deployments.
:::
:::danger Not Disabling Proxy Buffering for Streaming Responses
If proxy_buffering on is set in Nginx (the default), Nginx will buffer the entire LLM response before forwarding it to the client. For a 2,000-token streaming response at 40 tokens/second, this means the client waits 50 seconds for the first token instead of receiving tokens incrementally. Always set proxy_buffering off and proxy_request_buffering off for LLM proxies. This is the single most common configuration mistake in LLM serving deployments.
:::
:::warning Prefix Hash Collisions Overloading One Backend If your application happens to use only 2-3 distinct system prompts and you have 8 backends, consistent hashing will route all traffic with those 2-3 prefixes to 2-3 backends, leaving the other 5-6 backends idle. The gateway code above mitigates this with a load threshold check: if the prefix-preferred backend is more than 2 requests above the least-loaded backend, it falls back to least-load routing. Always monitor per-backend request distribution in your routing layer and tune the load threshold to balance cache hit rate vs load distribution. :::
:::warning Setting Timeouts Too Short for Long-Context Requests
A 12,000-token input with 3,000 tokens of output at 40 tokens/second takes 75 seconds. If your Nginx proxy_read_timeout is set to 60 seconds (a common default), this request will be killed mid-generation. Set proxy_read_timeout to at least 300 seconds for interactive LLM APIs and 600 seconds for batch processing endpoints. Different timeout values for different routes are achievable with Nginx location blocks - use stricter timeouts on lightweight endpoints (embeddings, health checks) and generous timeouts on generation endpoints.
:::
:::warning Not Propagating Request IDs for Distributed Tracing
When a request fails in a multi-server LLM serving setup, you need to know which backend it landed on, which model version it used, and what the KV cache state was. Without propagating a unique request ID through every hop (client to gateway, gateway to vLLM backend), correlating logs across services is nearly impossible. Generate a UUID at the gateway layer, set it as X-Request-ID in the forwarded request, log it with every decision (which backend selected, routing reason, latency), and include it in error responses so users can report it to support.
:::
Interview Q&A
Q: Why does round-robin load balancing perform poorly for LLM workloads, and what algorithm should you use instead?
Round-robin fails because it assumes all requests have equal compute cost. For LLM inference, a single long-context request (12,000 tokens input, 3,000 tokens output) can consume 30-100x the GPU time of a short request (500 tokens input, 50 tokens output). Round-robin distributes by count, not by GPU-seconds, so servers receiving several long requests fall far behind servers receiving short requests - but round-robin continues sending equal request counts to both.
The improved algorithms in order of sophistication are:
least_conn (Nginx built-in): route to the server with fewest active connections. Better than round-robin because long requests hold connections open, creating a natural back-pressure signal. Still imperfect because two connections can represent wildly different amounts of work.
Weighted least-load: maintain a running estimate of GPU-seconds consumed per backend. Assign estimated cost to each new request (based on prompt token count), add it to the selected backend's load score, subtract it when the response completes. Route new requests to the backend with the lowest current load score. This is the algorithm to implement for high-traffic LLM serving.
Prefix-aware + least-load hybrid: use consistent hashing on the request prefix as the first routing criterion, but fall back to least-load if the prefix-preferred backend is significantly more loaded. This captures KV cache reuse benefits while preventing overload on popular-prefix backends.
Q: Explain KV cache prefix reuse and how routing can increase its effectiveness.
When vLLM processes a request, it computes key-value attention pairs for each token in the prompt. This is computationally expensive: for a 2,000-token prompt on a 7B model, prefill takes roughly 400ms. vLLM's Automatic Prefix Caching (APC) stores these KV pairs in GPU memory. If a subsequent request begins with the same sequence of tokens (the same system prompt, for example), vLLM reuses the cached KV pairs instead of recomputing them. This reduces TTFT from 400ms to roughly 20ms for that prefix.
The catch: the cache is per-process, per-server. If requests with the same 2,000-token system prompt are distributed round-robin across 5 servers, each server only sees 20% of those requests. The cache on each server is cold most of the time. By routing requests with the same prefix to the same server using consistent hashing, you ensure the server that processed the first request with that prefix will process subsequent ones too - keeping its KV cache warm.
The impact is substantial for applications with heavy system prompts (agents with tool schemas, document QA systems with context prepended, coding assistants with examples). In practice, teams report 30-60% reduction in average TTFT for prefix-heavy workloads after implementing prefix-aware routing.
Q: How does a circuit breaker work in the context of LLM serving, and what failure modes does it protect against?
A circuit breaker tracks the health of each backend by monitoring the success/failure rate of requests forwarded to it. When failures exceed a threshold (typically 5 consecutive failures), the circuit "opens" - the routing layer stops sending requests to that backend entirely and instead routes to healthy backends.
After a recovery timeout (typically 60 seconds), the circuit enters a "half-open" state where it allows one probe request through. If the probe succeeds, the circuit closes and normal routing resumes. If the probe fails, the circuit opens again.
The failure modes this protects against in LLM serving:
GPU out-of-memory errors: vLLM's KV cache is full and all new requests return 503. Without a circuit breaker, all 5 backends with full KV caches keep receiving requests that immediately fail. With a circuit breaker, each backend opens after 5 failures and the load balancer shifts traffic to less-saturated instances.
Model loading after restart: a newly started vLLM pod takes 3-8 minutes to load weights. During this window, requests to it fail with connection refused or 503. The circuit breaker opens the backend after 5 failures, preventing the loading pod from receiving traffic until it is actually ready (and the health check circuit probe succeeds).
CUDA errors: GPU hardware errors or driver issues cause inference failures. The circuit breaker opens the affected backend and allows the on-call engineer time to investigate and potentially reschedule the pod to a healthy node.
Q: How would you implement a model routing gateway that sends simple queries to a 7B model and complex queries to a 70B model? What signals would you use?
The routing decision uses four signals in order of reliability:
First, explicit client signals: an X-Model-Tier header or a model field in the request body that explicitly specifies which tier to use. Client applications that know their use case should be able to override routing.
Second, prompt length: estimate token count from character count (roughly 4 characters per token). Prompts over 16,000 tokens require the large model for context window. Prompts over 4,000 tokens route to medium. This is the most reliable signal.
Third, requested output length: max_tokens over 2,000 suggests complex generation (code, essays, analysis) and routes to the large model. This is a strong signal - users who set max_tokens: 4096 are not asking simple questions.
Fourth, complexity keywords: patterns like "implement", "analyze", "step-by-step", "debug", "proof" in the user message suggest complex tasks. This is a weak signal - keyword matching has false positives and false negatives - but it is cheap to compute and contributes to the routing decision when other signals are ambiguous.
The implementation maintains separate backend pools per model tier and uses the routing decision to select the pool. The gateway patches the model field in the request body to match the actual model served by the selected backend, ensuring the response's model field is accurate for clients that check it.
Q: What are the tradeoffs between sticky routing (session affinity) and stateless routing for LLM serving?
Sticky routing (routing all requests from a particular client or session to the same backend) maximizes KV cache reuse within multi-turn conversations, because the conversation history accumulates in the KV cache on one server. The tradeoff is load imbalance: if one client sends 100x more requests than average, all their requests land on one server.
Stateless routing (routing each request independently based on current load) maximizes load balance but sacrifices KV cache reuse for multi-turn conversations. Each turn in a conversation may land on a different server, and each server has to recompute the previous conversation tokens from scratch.
The hybrid approach - which is what prefix-aware routing implements - routes based on a hash of the conversation prefix (including system prompt and conversation history up to the current turn). This achieves the benefits of sticky routing for prefix computation without hard-binding a client to a specific server permanently. If the preferred server is overloaded, the router can fall back to least-load routing, accepting a KV cache miss in exchange for lower latency.
For production applications with multi-turn conversations, the practical recommendation is: implement prefix-aware routing with a load threshold fallback. Measure your KV cache hit rate and TTFT distribution. If prefix routing is working, you will see TTFT drop significantly for subsequent conversation turns. If hit rates are low, your traffic pattern may not have enough prefix sharing to justify the routing complexity.
Q: A backend in your LLM serving cluster starts returning high-latency responses (20 seconds instead of 2 seconds) but is not failing outright. How do you detect and handle this?
Standard circuit breakers only trigger on failures (5xx responses, connection errors). A degraded backend returning slow 200 responses bypasses the circuit breaker entirely. This is a common failure mode: GPU thermal throttling, KV cache eviction causing many recomputes, or a memory leak causing performance degradation without outright errors.
Detection: add a timeout-based failure signal to the circuit breaker. If a request takes longer than 2x the p99 baseline latency, treat it as a soft failure and increment the failure counter. Use a histogram of recent request durations per backend to detect this:
# Degradation detection - track p95 latency per backend
backend_latencies = {b: [] for b in all_backends}
def record_request(backend, duration_seconds, success):
backend_latencies[backend].append(duration_seconds)
# Keep only last 100 samples
if len(backend_latencies[backend]) > 100:
backend_latencies[backend].pop(0)
# Check for degradation: p95 latency > 3x baseline
if len(backend_latencies[backend]) >= 20:
sorted_durations = sorted(backend_latencies[backend])
p95 = sorted_durations[int(len(sorted_durations) * 0.95)]
if p95 > 10.0: # 10 second p95 is degraded for a 2-second baseline
circuit_breaker.record_failure(backend)
Handling: once detected, open the circuit for the slow backend and allow it to recover. Add an alert that fires when a backend's p95 latency exceeds the threshold - slow backends often indicate resource pressure (GPU temperature, KV cache fragmentation) that benefits from investigation and potentially pod restart.
