:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Fallback & Retry demo on the EngineersOfAI Playground - no code required. :::
LiteLLM
The Dependency Nightmare
The platform engineering team had been quietly dreading the upcoming architecture review for months. What started as a clean OpenAI integration eighteen months ago had metastasized into a maze of provider-specific code scattered across eleven microservices. Service A used the Anthropic SDK directly, importing it and calling client.messages.create(). Service B used the OpenAI SDK. Service C had a hand-rolled HTTP client that called Amazon Bedrock with AWS Signature v4 authentication. Service D called Azure OpenAI with a different base URL pattern and a different API version header than Service B's direct OpenAI calls. Each service had its own hardcoded retry logic, each written slightly differently by a different engineer on a different sprint.
The situation came to a head during a late-night push to add Gemini Pro support. Three engineers worked independently on three different services. Each discovered different quirks in the Gemini response format independently. Each wrote different error handling independently. Each made different assumptions about streaming behavior independently. By the time the feature was done, the codebase had three inconsistent Gemini integrations and no single engineer could fully explain how any of the other two worked.
The lead platform engineer proposed a different approach: route all LLM traffic through a single proxy. Give every service the same OpenAI-compatible API. Handle all the provider-specific translation in one place. The team was skeptical - adding a network hop sounded like unnecessary complexity. But after a one-week proof of concept with LiteLLM Proxy, the result was undeniable: adding Gemini support now meant one config file change. Switching a service from GPT-4o to Claude was two lines. And for the first time, the team had a single dashboard showing how much every service was spending on every provider, broken down by team and feature.
That was the moment LiteLLM stopped being a convenience and became platform infrastructure.
Why This Exists
LiteLLM was created to solve one specific problem: every LLM provider has a different API.
OpenAI, Anthropic, Google, Cohere, Mistral, Bedrock, Azure OpenAI - each has its own authentication scheme, its own request format, its own response format, its own error codes, its own streaming protocol. Building against more than one provider means understanding all of these differences and handling the translations in application code. Adding a new provider means learning another API from scratch and updating multiple services.
The key insight that makes LiteLLM work: the OpenAI API format has become the de facto standard that most downstream tooling expects. Rather than inventing a new universal format, LiteLLM translates every provider into the OpenAI format. Application code speaks OpenAI to LiteLLM; LiteLLM speaks whatever the underlying provider requires.
Ishaan Jaffer open-sourced LiteLLM in late 2023 under the MIT license. By early 2025 it supported over 100 providers and models, had over 13,000 GitHub stars, and was in production at hundreds of teams. The project grew into two distinct modes: a Python SDK for in-process use, and a Proxy Server for network-level deployment as a true gateway.
Two Modes: SDK vs Proxy
SDK Mode: Import litellm into your Python application and call litellm.completion(). Provider translation happens in-process. Simple to set up. No network overhead. Works well for single-service Python applications or when you want provider abstraction without a separate network service.
Proxy Mode: Run litellm --config config.yaml to start an HTTP server on port 4000. Every service points its OpenAI SDK at http://your-proxy:4000. The proxy handles translation, routing, caching, and cost tracking centrally. Config changes propagate to all consumers without redeployment. This is the mode that gives you the full gateway architecture.
For teams with multiple services or multiple languages, Proxy Mode is the right choice. For a single Python application where simplicity matters more than centralization, SDK Mode is sufficient.
SDK Mode: Provider Abstraction in Python
pip install litellm anthropic openai
import litellm
import os
from typing import Optional
# LiteLLM reads API keys from environment variables:
# ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, etc.
# No SDK-specific setup required per provider.
def demo_provider_abstraction() -> None:
"""
The same litellm.completion() call works for any provider.
Only the model string changes - the response format is always OpenAI-compatible.
"""
messages = [{"role": "user", "content": "What is the capital of France?"}]
# Anthropic Claude - prefix: none (inferred from model name)
response_claude = litellm.completion(
model="claude-sonnet-4-6",
messages=messages,
max_tokens=64,
)
print(f"Claude: {response_claude.choices[0].message.content}")
print(f" Tokens: {response_claude.usage.total_tokens}")
# OpenAI GPT-4o
response_gpt = litellm.completion(
model="gpt-4o",
messages=messages,
max_tokens=64,
)
print(f"GPT-4o: {response_gpt.choices[0].message.content}")
# Google Gemini - prefix: "gemini/"
response_gemini = litellm.completion(
model="gemini/gemini-1.5-pro",
messages=messages,
max_tokens=64,
)
print(f"Gemini: {response_gemini.choices[0].message.content}")
# Local Ollama - specify api_base
response_local = litellm.completion(
model="ollama/llama3.2",
messages=messages,
api_base="http://localhost:11434",
)
print(f"Llama (local): {response_local.choices[0].message.content}")
def demo_fallback_sdk() -> None:
"""
LiteLLM SDK supports automatic fallback via the fallbacks parameter.
If claude-sonnet-4-6 fails, try gpt-4o, then gpt-4o-mini.
"""
messages = [{"role": "user", "content": "Explain gradient descent in one sentence."}]
response = litellm.completion(
model="claude-sonnet-4-6",
messages=messages,
fallbacks=["gpt-4o", "gpt-4o-mini"],
num_retries=2,
request_timeout=30,
)
# response.model tells you which model actually handled it
print(f"Handled by: {response.model}")
print(f"Response: {response.choices[0].message.content[:200]}")
def demo_cost_tracking() -> None:
"""
LiteLLM computes request cost automatically.
Use litellm.completion_cost() after every call.
"""
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about distributed systems."},
]
response = litellm.completion(
model="claude-sonnet-4-6",
messages=messages,
max_tokens=64,
)
cost_usd = litellm.completion_cost(completion_response=response)
print(f"Model: {response.model}")
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Cost: ${cost_usd:.8f} USD")
print(f"Response:\n{response.choices[0].message.content}")
def demo_budget_manager() -> None:
"""
LiteLLM SDK includes a BudgetManager that tracks per-user spend.
When a user exceeds their budget, requests are blocked.
"""
budget_manager = litellm.BudgetManager(project_name="my-app", client_type="local")
user_id = "user_8821"
# Create a $5/month budget for this user
budget_manager.create_budget(
total_budget=5.00,
user=user_id,
duration="monthly",
)
messages = [{"role": "user", "content": "What is a transformer?"}]
# Check budget before calling
if budget_manager.get_current_cost(user=user_id) < budget_manager.get_total_budget(user=user_id):
response = litellm.completion(
model="claude-sonnet-4-6",
messages=messages,
max_tokens=200,
)
cost = litellm.completion_cost(completion_response=response)
budget_manager.update_cost(user=user_id, completion_obj=response)
print(f"Remaining budget: ${budget_manager.get_model_budget(user=user_id):.4f}")
else:
print(f"Budget exceeded for {user_id}")
if __name__ == "__main__":
demo_provider_abstraction()
demo_fallback_sdk()
demo_cost_tracking()
Proxy Mode: The Production Configuration
The real power of LiteLLM comes in proxy mode. A YAML config file defines your model list, routing strategy, fallbacks, cache settings, and authentication. The proxy runs as a standalone HTTP server that all services call.
Step 1: Write the Config
# litellm_config.yaml
model_list:
# Primary: Claude Sonnet - for complex reasoning tasks
- model_name: claude-sonnet
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 500 # requests per minute - used by usage-based routing
tpm: 200000 # tokens per minute
# Budget model: Claude Haiku - for high-volume simple tasks
- model_name: claude-haiku
litellm_params:
model: claude-haiku-4-5-20251001
api_key: os.environ/ANTHROPIC_API_KEY
rpm: 2000
tpm: 1000000
# OpenAI fallback
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
rpm: 500
tpm: 150000
# Load-balanced group: three Anthropic API keys, same model
# LiteLLM treats same model_name as a group to load-balance across
- model_name: claude-sonnet-lb
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY_1
- model_name: claude-sonnet-lb
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY_2
- model_name: claude-sonnet-lb
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY_3
router_settings:
routing_strategy: latency-based-routing # lowest P95 latency wins
num_retries: 3
retry_after: 5 # seconds between retries
allowed_fails: 3 # failures before cooldown
cooldown_time: 60 # seconds to cooldown failing key
litellm_settings:
# Semantic caching backed by Redis
cache: true
cache_params:
type: redis
host: redis
port: 6379
similarity_threshold: 0.93
# Global fallback chain: if claude-sonnet fails, try these in order
fallbacks:
- claude-sonnet: ["gpt-4o", "claude-haiku"]
# Observability callbacks
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
# Per-request timeout
request_timeout: 60
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY # required auth key for proxy admin API
database_url: os.environ/DATABASE_URL # PostgreSQL for spend tracking
store_model_in_db: true
Step 2: Run the Proxy
# Install with proxy dependencies
pip install 'litellm[proxy]' psycopg2-binary
# Set environment variables
export ANTHROPIC_API_KEY="sk-ant-..."
export ANTHROPIC_API_KEY_1="sk-ant-key1..."
export ANTHROPIC_API_KEY_2="sk-ant-key2..."
export ANTHROPIC_API_KEY_3="sk-ant-key3..."
export OPENAI_API_KEY="sk-..."
export LITELLM_MASTER_KEY="sk-litellm-master..."
export DATABASE_URL="postgresql://litellm:password@localhost:5432/litellm"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
# Start the proxy (development)
litellm --config litellm_config.yaml --port 4000
# Verify it's running
curl http://localhost:4000/health
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-master..."
Step 3: Call the Proxy from Application Code
Once the proxy is running, application code uses the OpenAI SDK pointed at the proxy. No provider-specific SDKs or authentication in application code.
import openai
import anthropic
import json
import time
from typing import Optional
# ─────────────────────────────────────────────────────────────────────────────
# Option A: OpenAI SDK → LiteLLM Proxy (recommended for most services)
# Works from any language that has an OpenAI-compatible SDK
# ─────────────────────────────────────────────────────────────────────────────
proxy_client = openai.OpenAI(
api_key="sk-litellm-master...", # your LiteLLM master key
base_url="http://localhost:4000", # or your internal service URL
)
def call_claude_via_proxy() -> str:
"""
Route to Claude Sonnet through the proxy.
The model name matches what is defined in litellm_config.yaml - not the Anthropic model ID.
"""
response = proxy_client.chat.completions.create(
model="claude-sonnet", # matches model_name in config
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "Explain gradient descent in two sentences."},
],
max_tokens=256,
)
print(f"Model: {response.model}")
print(f"Response: {response.choices[0].message.content}")
return response.choices[0].message.content
def call_load_balanced_group() -> str:
"""
Send to the load-balanced group. LiteLLM routes to the key with
lowest recent P95 latency automatically.
"""
response = proxy_client.chat.completions.create(
model="claude-sonnet-lb", # the load-balanced group of 3 keys
messages=[
{"role": "user", "content": "Write a Python function to reverse a string."},
],
max_tokens=256,
)
print(f"Load balanced response: {response.choices[0].message.content[:200]}")
return response.choices[0].message.content
def call_with_cost_attribution() -> dict:
"""
Pass user_id and team_id for cost attribution.
LiteLLM records these in PostgreSQL against the spend entry.
"""
response = proxy_client.chat.completions.create(
model="claude-sonnet",
messages=[{"role": "user", "content": "What is the CAP theorem?"}],
max_tokens=300,
extra_body={
"user_id": "user_8821",
"team_id": "platform-team",
"metadata": {
"feature": "documentation-assistant",
"session_id": "sess_abc123",
"environment": "production",
},
},
)
return {
"content": response.choices[0].message.content,
"model": response.model,
"usage": dict(response.usage),
}
# ─────────────────────────────────────────────────────────────────────────────
# Option B: Anthropic SDK → LiteLLM Proxy
# Useful when migrating existing code that already uses the Anthropic SDK
# ─────────────────────────────────────────────────────────────────────────────
anthropic_via_proxy = anthropic.Anthropic(
api_key="sk-litellm-master...",
base_url="http://localhost:4000",
)
def call_anthropic_sdk_through_proxy() -> str:
"""
The Anthropic SDK can call through the LiteLLM proxy.
LiteLLM translates between Anthropic message format and its internal format.
No changes to Anthropic SDK call syntax required.
"""
message = anthropic_via_proxy.messages.create(
model="claude-sonnet", # proxy model_name, not Anthropic model ID
max_tokens=256,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": "What is a Merkle tree?"}],
)
print(f"Via Anthropic SDK -> LiteLLM proxy:")
print(f" {message.content[0].text[:200]}")
return message.content[0].text
if __name__ == "__main__":
call_claude_via_proxy()
call_load_balanced_group()
result = call_with_cost_attribution()
print(f"\nCost attribution result: {json.dumps(result, indent=2)[:400]}")
call_anthropic_sdk_through_proxy()
Routing Strategies Compared
LiteLLM's router supports five traffic distribution strategies. Choose based on your use case.
| Strategy | When to use | Trade-off |
|---|---|---|
simple-shuffle | All keys are equivalent, just need distribution | No performance awareness; can route to degraded key |
least-busy | Streaming responses where in-flight count matters | Doesn't account for token consumption variance |
latency-based-routing | User-facing, latency-sensitive features | Slight overhead to track P95 per key |
cost-based-routing | Batch jobs where cost minimization is priority | May concentrate traffic; can exhaust one key |
usage-based-routing-v2 | You have strict TPM/RPM provider limits | Requires Redis; most accurate quota-aware routing |
For user-facing production deployments, latency-based-routing with Redis is the best default. It naturally avoids overloaded keys, routes around degraded providers, and degrades gracefully under load without additional configuration.
Team and User Budget Enforcement
LiteLLM Proxy supports team-level and user-level spend limits enforced at the database layer. The admin API manages teams, users, and virtual keys.
import httpx
import json
LITELLM_BASE = "http://localhost:4000"
MASTER_KEY = "sk-litellm-master..."
HEADERS = {"Authorization": f"Bearer {MASTER_KEY}", "Content-Type": "application/json"}
def create_team_with_budget() -> str:
"""Create a team entity with a monthly budget limit and rate limits."""
payload = {
"team_id": "team-platform",
"team_alias": "Platform Engineering",
"max_budget": 500.00, # USD per billing period
"budget_duration": "monthly",
"tpm_limit": 500_000, # tokens per minute
"rpm_limit": 1_000, # requests per minute
"models": ["claude-sonnet", "claude-haiku", "gpt-4o"],
}
response = httpx.post(f"{LITELLM_BASE}/team/new", headers=HEADERS, json=payload)
data = response.json()
print(f"Team created: {data.get('team_id')}")
return data.get("team_id", "")
def create_user_virtual_key(team_id: str) -> str:
"""
Generate a virtual API key for a user.
The key is scoped to the team's budget and model allowlist.
Users call the proxy with this key - not the real provider keys.
"""
payload = {
"user_id": "user_8821",
"team_id": team_id,
"key_alias": "alice-dev-key",
"max_budget": 50.00, # USD per user per month
"budget_duration": "monthly",
"models": ["claude-sonnet", "claude-haiku"], # restrict model access
"tpm_limit": 100_000,
"metadata": {"role": "senior-engineer", "department": "platform"},
}
response = httpx.post(f"{LITELLM_BASE}/key/generate", headers=HEADERS, json=payload)
data = response.json()
virtual_key = data["key"]
print(f"Virtual key: {virtual_key[:20]}...")
return virtual_key
def inspect_team_spend(team_id: str) -> dict:
"""Query current spend and remaining budget for a team."""
response = httpx.get(
f"{LITELLM_BASE}/team/info",
headers=HEADERS,
params={"team_id": team_id},
)
info = response.json()
spend = info.get("spend", 0.0)
budget = info.get("max_budget", 0.0)
print(f"Team: {team_id}")
print(f" Spend: ${spend:.4f}")
print(f" Budget: ${budget:.2f}")
print(f" Remaining: ${budget - spend:.4f}")
return info
def list_spend_logs(limit: int = 10) -> list[dict]:
"""Return recent spend log entries across all users and teams."""
response = httpx.get(
f"{LITELLM_BASE}/spend/logs",
headers=HEADERS,
params={"limit": limit},
)
logs = response.json()
print(f"\nRecent spend log ({len(logs)} entries):")
for entry in logs[:5]:
print(
f" [{entry.get('call_type', 'chat')}] "
f"user={entry.get('user_id', 'anon')} | "
f"model={entry.get('model', '?')} | "
f"${entry.get('spend', 0):.6f}"
)
return logs
def set_per_model_budget(team_id: str) -> None:
"""
Set different budgets for different models within the same team.
Useful when expensive models need tighter limits than cheap models.
"""
response = httpx.post(
f"{LITELLM_BASE}/team/update",
headers=HEADERS,
json={
"team_id": team_id,
"model_spend_map": {
"claude-opus-4-6": 50.00, # strict limit for expensive model
"claude-sonnet-4-6": 300.00,
"claude-haiku-4-5-20251001": 150.00,
},
},
)
print(f"Per-model budget set: {response.status_code}")
if __name__ == "__main__":
team_id = create_team_with_budget()
virtual_key = create_user_virtual_key(team_id)
inspect_team_spend(team_id)
list_spend_logs()
set_per_model_budget(team_id)
Docker Compose: Full Production Stack
Running LiteLLM with Redis (for caching and routing state) and PostgreSQL (for spend tracking) is the recommended production setup.
# docker-compose.yml
version: "3.9"
services:
litellm:
image: ghcr.io/berriai/litellm:main-stable
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
ANTHROPIC_API_KEY_1: ${ANTHROPIC_API_KEY_1}
ANTHROPIC_API_KEY_2: ${ANTHROPIC_API_KEY_2}
ANTHROPIC_API_KEY_3: ${ANTHROPIC_API_KEY_3}
OPENAI_API_KEY: ${OPENAI_API_KEY}
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
DATABASE_URL: postgresql://litellm:litellm@postgres:5432/litellm
REDIS_HOST: redis
REDIS_PORT: "6379"
LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY}
LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY}
command: --config /app/config.yaml --port 4000 --detailed_debug
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4000/health/readiness"]
interval: 30s
timeout: 10s
retries: 3
start_period: 15s
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: litellm
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm -d litellm"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
volumes:
postgres_data:
redis_data:
# Deploy the full stack
docker compose up -d
# Verify the proxy is healthy
curl http://localhost:4000/health/readiness
# Check all configured models
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-litellm-master..."
# Query spend summary
curl "http://localhost:4000/spend/logs?limit=20" \
-H "Authorization: Bearer sk-litellm-master..."
# View the admin UI (if enabled)
open http://localhost:4000/ui
Connecting to Langfuse for Full Observability
LiteLLM supports callbacks to observability platforms. Langfuse is the most popular option for AI-specific tracing.
# In litellm_config.yaml
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
environment_variables:
LANGFUSE_PUBLIC_KEY: "pk-lf-..."
LANGFUSE_SECRET_KEY: "sk-lf-..."
LANGFUSE_HOST: "https://cloud.langfuse.com" # or your self-hosted URL
With this config active, every LLM call is automatically traced in Langfuse with model name, token counts, cost, latency, prompt, and response. No instrumentation code required in any application service.
Other supported callbacks: helicone, prometheus, datadog, s3, sentry, and custom HTTP webhooks for building your own observability pipeline.
Understanding the Request Lifecycle
Every request passes through: authentication, rate limiting, budget check, cache lookup, routing, provider call, cost recording, and observability callbacks - all in under 30ms of gateway overhead for a cache miss, and under 10ms for a cache hit.
Kubernetes Deployment for Production Scale
For production deployments requiring high availability, run multiple LiteLLM replicas behind a load balancer. Redis and PostgreSQL must be external shared services.
# kubernetes/litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
namespace: ai-platform
spec:
replicas: 3 # 3 replicas - shared Redis state keeps them consistent
selector:
matchLabels:
app: litellm-proxy
template:
metadata:
labels:
app: litellm-proxy
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-stable
ports:
- containerPort: 4000
command:
- litellm
- --config
- /app/config.yaml
- --port
- "4000"
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: anthropic-api-key
- name: REDIS_HOST
value: "redis-cluster.ai-platform.svc.cluster.local"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: llm-secrets
key: database-url
volumeMounts:
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
readinessProbe:
httpGet:
path: /health/readiness
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/liveliness
port: 4000
initialDelaySeconds: 30
periodSeconds: 30
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
name: litellm-proxy
namespace: ai-platform
spec:
selector:
app: litellm-proxy
ports:
- port: 4000
targetPort: 4000
type: ClusterIP
Production Engineering Notes
:::tip Pin the LiteLLM Docker image version
LiteLLM releases frequently and occasionally introduces breaking YAML schema changes. Use a pinned tag like ghcr.io/berriai/litellm:main-v1.40.0 rather than :main-latest in production. Track the changelog before upgrading.
:::
:::warning Health check the providers, not just the container
A LiteLLM container can be running while all configured providers are returning errors. Use /health/readiness which actually tests connectivity to each configured provider, not just container process liveness. Add this endpoint to your load balancer health check.
:::
:::danger Always set the master key - never run with an open proxy
Without a master key configured, any caller can send requests through your proxy and charge costs to your provider accounts. Bind the proxy to an internal network interface (127.0.0.1:4000 or the cluster-internal IP), not 0.0.0.0, unless you have authentication at the network layer.
:::
:::info The proxy adds 10–30ms latency per request The network hop through LiteLLM Proxy adds roughly 10–30ms. For LLM responses that take 500–2000ms, this overhead is 1–6%. For batch processing, it is negligible. For sub-50ms total latency requirements (extremely rare for LLM use cases), use SDK Mode instead. :::
Common Mistakes
Mistake 1: Assigning the same model_name to different models unintentionally
If two entries share the same model_name, LiteLLM treats them as a load-balanced group. This is intentional for the multi-key pattern. But if you accidentally give Claude Sonnet and GPT-4o the same name, you will route production traffic to a random mix of both. Use distinct names unless you explicitly want load balancing across those entries.
Mistake 2: Omitting rpm and tpm in model config
Without rate limit metadata, usage-based-routing-v2 cannot make informed routing decisions - it has no idea how close a key is to its limit. Always configure rpm and tpm in each model entry to match your actual provider quotas. The router uses these numbers to avoid routing to a key that is about to be rate-limited.
Mistake 3: Not passing user_id in requests
LiteLLM tracks spend per user, but only if you pass user_id in extra_body. Without it, all spend is attributed to an anonymous bucket and per-user cost reporting is useless. Make user_id a mandatory field in every team's LLM call wrapper.
Mistake 4: Running multiple replicas without Redis
Without Redis, each proxy replica maintains its own independent in-memory routing state and cache. Load balancing decisions are inconsistent across replicas, and the cache provides no benefit at scale since each replica maintains a separate cache. Redis is required for any multi-replica deployment.
Mistake 5: Not configuring fallbacks before launch
The fallback config is the most important resilience feature, but it is only useful if configured before a provider goes down. Set up fallbacks in the config file before your first production deployment. Test the fallback manually by temporarily removing a provider's API key and verifying traffic routes correctly.
Mistake 6: Ignoring the admin UI for debugging
LiteLLM Proxy ships with a web-based admin UI at /ui. It shows real-time spend by user and team, all configured models, health status of each key, and recent request logs. Most teams never open it. During an incident - when you need to understand which key is being rate-limited, why fallback triggered, or which team is overrunning their budget - the admin UI saves 20 minutes of log trawling.
Interview Q&A
Q: What is LiteLLM and how does it differ from calling provider SDKs directly?
LiteLLM is a Python library and proxy server that normalizes the APIs of 100+ LLM providers into a single OpenAI-compatible interface. When you call providers directly, each requires a different SDK, authentication scheme, and response format - switching providers means changing application code. With LiteLLM, application code calls one consistent API; the provider is a config-level decision. This separation is valuable for multi-provider setups, fallback chains, centralized cost tracking, and organizations where multiple teams consume LLMs through a shared platform endpoint.
Q: What is the difference between LiteLLM SDK mode and proxy mode? When would you choose each?
SDK mode imports LiteLLM as a Python library and runs provider translation in-process. It adds no network overhead and is simpler to set up. Proxy mode runs LiteLLM as a standalone HTTP server; all services send requests over the network. Proxy mode provides: centralized cost tracking across all services, centralized caching (shared across replicas), centralized routing state, and compatibility with non-Python services. Choose SDK mode for a single-language, single-service Python application. Choose proxy mode for any multi-service architecture or when you need a true gateway layer with centralized control.
Q: How does LiteLLM handle provider failures during routing?
LiteLLM's router tracks error rates per model/key combination. When a key exceeds the allowed_fails threshold within a time window, it enters a cooldown period (defined by cooldown_time). During cooldown, no requests are routed to that key - they go to other available keys in the group. If all keys for a model group are in cooldown, LiteLLM uses the fallbacks config to route to a different model. After the cooldown period expires, the key is reintroduced to routing. All failures - including during fallback - are logged and reported to configured observability callbacks.
Q: How would you implement per-team cost budgets with LiteLLM Proxy?
Create teams via the admin API (POST /team/new) with max_budget and budget_duration fields. Issue virtual API keys for each team member (POST /key/generate) associated with the team. Every request using a virtual key is attributed to that user and team in the PostgreSQL spend table. LiteLLM enforces budgets by checking accumulated spend against the limit on each request - when the limit is exceeded, it returns a 429 with a budget-exceeded message. For soft alerting at 80% consumption, poll the /team/info endpoint on a schedule and fire Slack notifications before the hard limit is hit.
Q: A service is getting 429s from Anthropic even though three API keys are configured in LiteLLM. What is the likely cause?
The most likely cause is that all three keys belong to the same Anthropic organization and share a single organization-level rate limit. Adding keys within the same org multiplies the key count but does not multiply the organization's token quota. The 429s come from the org-level limit, not the per-key limit. The fix options are: (1) request a rate limit increase from Anthropic, (2) use keys from different Anthropic accounts if permitted by the terms of service, or (3) add OpenAI or another provider as a fallback and configure LiteLLM to overflow to that provider when Anthropic is rate-limited. Using usage-based-routing-v2 with accurate tpm limits will prevent any single key from being over-subscribed, but it cannot create capacity that doesn't exist at the organization level.
Q: How would you migrate an existing application from direct Anthropic SDK calls to LiteLLM Proxy without breaking anything?
Deploy LiteLLM Proxy alongside the existing application, configured with the same Anthropic key as the application currently uses. Update the application's Anthropic client to point at the proxy: set base_url="http://litellm-proxy:4000" and use a LiteLLM virtual key instead of the Anthropic key directly. The Anthropic SDK is compatible with the LiteLLM proxy endpoint - no call syntax changes required. Run both pathways in parallel for one week, comparing response quality, latency, and error rates. Once validated, remove the direct Anthropic SDK dependency and standardize on the proxy for all LLM traffic. The migration can be done service by service with zero downtime.
Q: What is usage-based-routing-v2 and when should you use it over latency-based-routing?
usage-based-routing-v2 routes requests to the key with the most remaining capacity in its TPM/RPM buckets, tracked in real time using Redis. It reads the provider-configured limits from the model's rpm and tpm fields in the config, and routes to the key furthest from its limits. This is the correct strategy when your primary constraint is staying within provider rate limits - for example, when you have multiple keys at different tier levels and need to respect each key's exact quota. latency-based-routing routes to the lowest P95 latency key without explicit quota tracking. It works well when keys have similar limits and you want to optimize user-perceived response time. Use usage-based-routing-v2 when quota compliance is critical; use latency-based-routing when performance optimization is the primary goal.
LiteLLM Proxy: Complete Production Configuration
The following is a production-ready LiteLLM Proxy configuration file combining all the elements covered in this lesson - multi-key routing, fallbacks, cost tracking, and caching.
# litellm_config.yaml - production configuration
model_list:
# Primary: Claude Sonnet with 3 keys for throughput
- model_name: claude-primary
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY_1
rpm: 500
tpm: 200_000
- model_name: claude-primary
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY_2
rpm: 500
tpm: 200_000
- model_name: claude-primary
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY_3
rpm: 1000
tpm: 400_000 # Higher-tier key
# Fallback: OpenAI GPT-4o
- model_name: openai-fallback
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
# Budget model: Claude Haiku for low-cost tasks
- model_name: claude-budget
litellm_params:
model: anthropic/claude-haiku-4-5-20251001
api_key: os.environ/ANTHROPIC_API_KEY_1
router_settings:
routing_strategy: usage-based-routing-v2
redis_url: os.environ/REDIS_URL
num_retries: 2
retry_after: 5
allowed_fails: 3
cooldown_time: 60
fallbacks:
- {"claude-primary": ["openai-fallback"]}
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
set_verbose: false
# Semantic caching
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: 6379
similarity_threshold: 0.95
supported_call_types: ["acompletion", "completion"]
general_settings:
database_url: os.environ/DATABASE_URL # PostgreSQL for spend tracking
store_model_in_db: true
master_key: os.environ/LITELLM_MASTER_KEY
environment_variables:
LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY
Monitoring LiteLLM Proxy in Production
LiteLLM exposes Prometheus metrics at /metrics by default when the Prometheus integration is enabled. Key metrics to scrape and alert on:
| Metric | Type | Alert Condition |
|---|---|---|
litellm_request_total_requests | Counter | Sudden drop (service stopped receiving traffic) |
litellm_requests_total_failed | Counter | Error rate above 1% of total requests |
litellm_deployment_success_responses | Counter | Track per-deployment success rates |
litellm_deployment_failure_responses | Counter | Any deployment with sustained failures |
litellm_remaining_requests | Gauge | Below 20% of RPM limit on any key |
litellm_remaining_tokens | Gauge | Below 20% of TPM limit on any key |
litellm_overhead_latency_ms | Histogram | P99 above 100ms (gateway is slow) |
Configure Prometheus scraping at the /metrics endpoint and set up Grafana dashboards for these metrics from day one. The most operationally useful dashboard shows: requests per second by model, error rate by model, TPM utilization per key (as % of limit), and the P95 latency breakdown (gateway overhead vs LLM response time).
Summary: LiteLLM Proxy in Production
LiteLLM is the most widely adopted self-hosted LLM gateway. Its key properties:
- OpenAI-compatible endpoint: any service using the standard OpenAI SDK can point at LiteLLM with a base URL change - no other code changes required
- 100+ provider support: every major LLM provider, managed by one unified config file
- Virtual keys: decouple application keys from provider credentials, enabling zero-downtime rotation and per-team spend caps
- Budget enforcement: team and user budgets enforced at request time using PostgreSQL spend tracking
- Fallback chains: automatic provider failover with configurable retry logic
- Semantic caching: Redis-backed embedding similarity cache, shared across all services
- Routing strategies:
simple-shuffle,latency-based-routing,usage-based-routing-v2, andleast-busy - Admin API: programmatic control over virtual keys, teams, budgets, and routing - enabling automated provisioning workflows
The canonical deployment is Docker Compose (or Kubernetes) with LiteLLM Proxy + Redis + PostgreSQL. This three-component stack handles everything described in this lesson and scales to thousands of requests per minute on modest hardware.
