Deploy LiteLLM as a universal LLM proxy supporting 100+ providers. Configure routing, load balancing, fallbacks, semantic caching, and cost tracking through a single OpenAI-compatible endpoint.

How does LLM proxy work in practice?

LiteLLM covers LiteLLM, LLM proxy, model routing from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/llm-gateways/litellm

What is the difference between LiteLLM and model routing?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/llm-gateways/litellm

:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Fallback & Retry demo on the EngineersOfAI Playground - no code required. :::

LiteLLM

The Dependency Nightmare

The platform engineering team had been quietly dreading the upcoming architecture review for months. What started as a clean OpenAI integration eighteen months ago had metastasized into a maze of provider-specific code scattered across eleven microservices. Service A used the Anthropic SDK directly, importing it and calling client.messages.create(). Service B used the OpenAI SDK. Service C had a hand-rolled HTTP client that called Amazon Bedrock with AWS Signature v4 authentication. Service D called Azure OpenAI with a different base URL pattern and a different API version header than Service B's direct OpenAI calls. Each service had its own hardcoded retry logic, each written slightly differently by a different engineer on a different sprint.

The situation came to a head during a late-night push to add Gemini Pro support. Three engineers worked independently on three different services. Each discovered different quirks in the Gemini response format independently. Each wrote different error handling independently. Each made different assumptions about streaming behavior independently. By the time the feature was done, the codebase had three inconsistent Gemini integrations and no single engineer could fully explain how any of the other two worked.

The lead platform engineer proposed a different approach: route all LLM traffic through a single proxy. Give every service the same OpenAI-compatible API. Handle all the provider-specific translation in one place. The team was skeptical - adding a network hop sounded like unnecessary complexity. But after a one-week proof of concept with LiteLLM Proxy, the result was undeniable: adding Gemini support now meant one config file change. Switching a service from GPT-4o to Claude was two lines. And for the first time, the team had a single dashboard showing how much every service was spending on every provider, broken down by team and feature.

That was the moment LiteLLM stopped being a convenience and became platform infrastructure.

Why This Exists

LiteLLM was created to solve one specific problem: every LLM provider has a different API.

OpenAI, Anthropic, Google, Cohere, Mistral, Bedrock, Azure OpenAI - each has its own authentication scheme, its own request format, its own response format, its own error codes, its own streaming protocol. Building against more than one provider means understanding all of these differences and handling the translations in application code. Adding a new provider means learning another API from scratch and updating multiple services.

The key insight that makes LiteLLM work: the OpenAI API format has become the de facto standard that most downstream tooling expects. Rather than inventing a new universal format, LiteLLM translates every provider into the OpenAI format. Application code speaks OpenAI to LiteLLM; LiteLLM speaks whatever the underlying provider requires.

Ishaan Jaffer open-sourced LiteLLM in late 2023 under the MIT license. By early 2025 it supported over 100 providers and models, had over 13,000 GitHub stars, and was in production at hundreds of teams. The project grew into two distinct modes: a Python SDK for in-process use, and a Proxy Server for network-level deployment as a true gateway.

Two Modes: SDK vs Proxy

SDK Mode: Import litellm into your Python application and call litellm.completion(). Provider translation happens in-process. Simple to set up. No network overhead. Works well for single-service Python applications or when you want provider abstraction without a separate network service.

Proxy Mode: Run litellm --config config.yaml to start an HTTP server on port 4000. Every service points its OpenAI SDK at http://your-proxy:4000. The proxy handles translation, routing, caching, and cost tracking centrally. Config changes propagate to all consumers without redeployment. This is the mode that gives you the full gateway architecture.

For teams with multiple services or multiple languages, Proxy Mode is the right choice. For a single Python application where simplicity matters more than centralization, SDK Mode is sufficient.

SDK Mode: Provider Abstraction in Python

pip install litellm anthropic openai

import litellm
import os
from typing import Optional


# LiteLLM reads API keys from environment variables:
# ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, etc.
# No SDK-specific setup required per provider.


def demo_provider_abstraction() -> None:
    """
    The same litellm.completion() call works for any provider.
    Only the model string changes - the response format is always OpenAI-compatible.
    """
    messages = [{"role": "user", "content": "What is the capital of France?"}]

    # Anthropic Claude - prefix: none (inferred from model name)
    response_claude = litellm.completion(
        model="claude-sonnet-4-6",
        messages=messages,
        max_tokens=64,
    )
    print(f"Claude: {response_claude.choices[0].message.content}")
    print(f"  Tokens: {response_claude.usage.total_tokens}")

    # OpenAI GPT-4o
    response_gpt = litellm.completion(
        model="gpt-4o",
        messages=messages,
        max_tokens=64,
    )
    print(f"GPT-4o: {response_gpt.choices[0].message.content}")

    # Google Gemini - prefix: "gemini/"
    response_gemini = litellm.completion(
        model="gemini/gemini-1.5-pro",
        messages=messages,
        max_tokens=64,
    )
    print(f"Gemini: {response_gemini.choices[0].message.content}")

    # Local Ollama - specify api_base
    response_local = litellm.completion(
        model="ollama/llama3.2",
        messages=messages,
        api_base="http://localhost:11434",
    )
    print(f"Llama (local): {response_local.choices[0].message.content}")


def demo_fallback_sdk() -> None:
    """
    LiteLLM SDK supports automatic fallback via the fallbacks parameter.
    If claude-sonnet-4-6 fails, try gpt-4o, then gpt-4o-mini.
    """
    messages = [{"role": "user", "content": "Explain gradient descent in one sentence."}]

    response = litellm.completion(
        model="claude-sonnet-4-6",
        messages=messages,
        fallbacks=["gpt-4o", "gpt-4o-mini"],
        num_retries=2,
        request_timeout=30,
    )

    # response.model tells you which model actually handled it
    print(f"Handled by: {response.model}")
    print(f"Response: {response.choices[0].message.content[:200]}")


def demo_cost_tracking() -> None:
    """
    LiteLLM computes request cost automatically.
    Use litellm.completion_cost() after every call.
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about distributed systems."},
    ]

    response = litellm.completion(
        model="claude-sonnet-4-6",
        messages=messages,
        max_tokens=64,
    )

    cost_usd = litellm.completion_cost(completion_response=response)

    print(f"Model: {response.model}")
    print(f"Input tokens:  {response.usage.prompt_tokens}")
    print(f"Output tokens: {response.usage.completion_tokens}")
    print(f"Cost:          ${cost_usd:.8f} USD")
    print(f"Response:\n{response.choices[0].message.content}")


def demo_budget_manager() -> None:
    """
    LiteLLM SDK includes a BudgetManager that tracks per-user spend.
    When a user exceeds their budget, requests are blocked.
    """
    budget_manager = litellm.BudgetManager(project_name="my-app", client_type="local")

    user_id = "user_8821"

    # Create a $5/month budget for this user
    budget_manager.create_budget(
        total_budget=5.00,
        user=user_id,
        duration="monthly",
    )

    messages = [{"role": "user", "content": "What is a transformer?"}]

    # Check budget before calling
    if budget_manager.get_current_cost(user=user_id) < budget_manager.get_total_budget(user=user_id):
        response = litellm.completion(
            model="claude-sonnet-4-6",
            messages=messages,
            max_tokens=200,
        )
        cost = litellm.completion_cost(completion_response=response)
        budget_manager.update_cost(user=user_id, completion_obj=response)
        print(f"Remaining budget: ${budget_manager.get_model_budget(user=user_id):.4f}")
    else:
        print(f"Budget exceeded for {user_id}")


if __name__ == "__main__":
    demo_provider_abstraction()
    demo_fallback_sdk()
    demo_cost_tracking()

Proxy Mode: The Production Configuration

The real power of LiteLLM comes in proxy mode. A YAML config file defines your model list, routing strategy, fallbacks, cache settings, and authentication. The proxy runs as a standalone HTTP server that all services call.

Step 1: Write the Config

# litellm_config.yaml

model_list:
  # Primary: Claude Sonnet - for complex reasoning tasks
  - model_name: claude-sonnet
    litellm_params:
      model: claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 500          # requests per minute - used by usage-based routing
      tpm: 200000       # tokens per minute

  # Budget model: Claude Haiku - for high-volume simple tasks
  - model_name: claude-haiku
    litellm_params:
      model: claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 2000
      tpm: 1000000

  # OpenAI fallback
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500
      tpm: 150000

  # Load-balanced group: three Anthropic API keys, same model
  # LiteLLM treats same model_name as a group to load-balance across
  - model_name: claude-sonnet-lb
    litellm_params:
      model: claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_1
  - model_name: claude-sonnet-lb
    litellm_params:
      model: claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_2
  - model_name: claude-sonnet-lb
    litellm_params:
      model: claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_3

router_settings:
  routing_strategy: latency-based-routing    # lowest P95 latency wins
  num_retries: 3
  retry_after: 5                             # seconds between retries
  allowed_fails: 3                           # failures before cooldown
  cooldown_time: 60                          # seconds to cooldown failing key

litellm_settings:
  # Semantic caching backed by Redis
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    similarity_threshold: 0.93

  # Global fallback chain: if claude-sonnet fails, try these in order
  fallbacks:
    - claude-sonnet: ["gpt-4o", "claude-haiku"]

  # Observability callbacks
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

  # Per-request timeout
  request_timeout: 60

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY   # required auth key for proxy admin API
  database_url: os.environ/DATABASE_URL         # PostgreSQL for spend tracking
  store_model_in_db: true

Step 2: Run the Proxy

# Install with proxy dependencies
pip install 'litellm[proxy]' psycopg2-binary

# Set environment variables
export ANTHROPIC_API_KEY="sk-ant-..."
export ANTHROPIC_API_KEY_1="sk-ant-key1..."
export ANTHROPIC_API_KEY_2="sk-ant-key2..."
export ANTHROPIC_API_KEY_3="sk-ant-key3..."
export OPENAI_API_KEY="sk-..."
export LITELLM_MASTER_KEY="sk-litellm-master..."
export DATABASE_URL="postgresql://litellm:password@localhost:5432/litellm"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."

# Start the proxy (development)
litellm --config litellm_config.yaml --port 4000

# Verify it's running
curl http://localhost:4000/health
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-litellm-master..."

Step 3: Call the Proxy from Application Code

Once the proxy is running, application code uses the OpenAI SDK pointed at the proxy. No provider-specific SDKs or authentication in application code.

import openai
import anthropic
import json
import time
from typing import Optional


# ─────────────────────────────────────────────────────────────────────────────
# Option A: OpenAI SDK → LiteLLM Proxy (recommended for most services)
# Works from any language that has an OpenAI-compatible SDK
# ─────────────────────────────────────────────────────────────────────────────

proxy_client = openai.OpenAI(
    api_key="sk-litellm-master...",       # your LiteLLM master key
    base_url="http://localhost:4000",      # or your internal service URL
)


def call_claude_via_proxy() -> str:
    """
    Route to Claude Sonnet through the proxy.
    The model name matches what is defined in litellm_config.yaml - not the Anthropic model ID.
    """
    response = proxy_client.chat.completions.create(
        model="claude-sonnet",            # matches model_name in config
        messages=[
            {"role": "system", "content": "You are a concise technical assistant."},
            {"role": "user", "content": "Explain gradient descent in two sentences."},
        ],
        max_tokens=256,
    )
    print(f"Model: {response.model}")
    print(f"Response: {response.choices[0].message.content}")
    return response.choices[0].message.content


def call_load_balanced_group() -> str:
    """
    Send to the load-balanced group. LiteLLM routes to the key with
    lowest recent P95 latency automatically.
    """
    response = proxy_client.chat.completions.create(
        model="claude-sonnet-lb",         # the load-balanced group of 3 keys
        messages=[
            {"role": "user", "content": "Write a Python function to reverse a string."},
        ],
        max_tokens=256,
    )
    print(f"Load balanced response: {response.choices[0].message.content[:200]}")
    return response.choices[0].message.content


def call_with_cost_attribution() -> dict:
    """
    Pass user_id and team_id for cost attribution.
    LiteLLM records these in PostgreSQL against the spend entry.
    """
    response = proxy_client.chat.completions.create(
        model="claude-sonnet",
        messages=[{"role": "user", "content": "What is the CAP theorem?"}],
        max_tokens=300,
        extra_body={
            "user_id": "user_8821",
            "team_id": "platform-team",
            "metadata": {
                "feature": "documentation-assistant",
                "session_id": "sess_abc123",
                "environment": "production",
            },
        },
    )
    return {
        "content": response.choices[0].message.content,
        "model": response.model,
        "usage": dict(response.usage),
    }


# ─────────────────────────────────────────────────────────────────────────────
# Option B: Anthropic SDK → LiteLLM Proxy
# Useful when migrating existing code that already uses the Anthropic SDK
# ─────────────────────────────────────────────────────────────────────────────

anthropic_via_proxy = anthropic.Anthropic(
    api_key="sk-litellm-master...",
    base_url="http://localhost:4000",
)


def call_anthropic_sdk_through_proxy() -> str:
    """
    The Anthropic SDK can call through the LiteLLM proxy.
    LiteLLM translates between Anthropic message format and its internal format.
    No changes to Anthropic SDK call syntax required.
    """
    message = anthropic_via_proxy.messages.create(
        model="claude-sonnet",            # proxy model_name, not Anthropic model ID
        max_tokens=256,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": "What is a Merkle tree?"}],
    )
    print(f"Via Anthropic SDK -> LiteLLM proxy:")
    print(f"  {message.content[0].text[:200]}")
    return message.content[0].text


if __name__ == "__main__":
    call_claude_via_proxy()
    call_load_balanced_group()
    result = call_with_cost_attribution()
    print(f"\nCost attribution result: {json.dumps(result, indent=2)[:400]}")
    call_anthropic_sdk_through_proxy()

Routing Strategies Compared

LiteLLM's router supports five traffic distribution strategies. Choose based on your use case.

Strategy	When to use	Trade-off
`simple-shuffle`	All keys are equivalent, just need distribution	No performance awareness; can route to degraded key
`least-busy`	Streaming responses where in-flight count matters	Doesn't account for token consumption variance
`latency-based-routing`	User-facing, latency-sensitive features	Slight overhead to track P95 per key
`cost-based-routing`	Batch jobs where cost minimization is priority	May concentrate traffic; can exhaust one key
`usage-based-routing-v2`	You have strict TPM/RPM provider limits	Requires Redis; most accurate quota-aware routing

For user-facing production deployments, latency-based-routing with Redis is the best default. It naturally avoids overloaded keys, routes around degraded providers, and degrades gracefully under load without additional configuration.

Team and User Budget Enforcement

LiteLLM Proxy supports team-level and user-level spend limits enforced at the database layer. The admin API manages teams, users, and virtual keys.

import httpx
import json

LITELLM_BASE = "http://localhost:4000"
MASTER_KEY = "sk-litellm-master..."
HEADERS = {"Authorization": f"Bearer {MASTER_KEY}", "Content-Type": "application/json"}


def create_team_with_budget() -> str:
    """Create a team entity with a monthly budget limit and rate limits."""
    payload = {
        "team_id": "team-platform",
        "team_alias": "Platform Engineering",
        "max_budget": 500.00,            # USD per billing period
        "budget_duration": "monthly",
        "tpm_limit": 500_000,            # tokens per minute
        "rpm_limit": 1_000,              # requests per minute
        "models": ["claude-sonnet", "claude-haiku", "gpt-4o"],
    }
    response = httpx.post(f"{LITELLM_BASE}/team/new", headers=HEADERS, json=payload)
    data = response.json()
    print(f"Team created: {data.get('team_id')}")
    return data.get("team_id", "")


def create_user_virtual_key(team_id: str) -> str:
    """
    Generate a virtual API key for a user.
    The key is scoped to the team's budget and model allowlist.
    Users call the proxy with this key - not the real provider keys.
    """
    payload = {
        "user_id": "user_8821",
        "team_id": team_id,
        "key_alias": "alice-dev-key",
        "max_budget": 50.00,             # USD per user per month
        "budget_duration": "monthly",
        "models": ["claude-sonnet", "claude-haiku"],   # restrict model access
        "tpm_limit": 100_000,
        "metadata": {"role": "senior-engineer", "department": "platform"},
    }
    response = httpx.post(f"{LITELLM_BASE}/key/generate", headers=HEADERS, json=payload)
    data = response.json()
    virtual_key = data["key"]
    print(f"Virtual key: {virtual_key[:20]}...")
    return virtual_key


def inspect_team_spend(team_id: str) -> dict:
    """Query current spend and remaining budget for a team."""
    response = httpx.get(
        f"{LITELLM_BASE}/team/info",
        headers=HEADERS,
        params={"team_id": team_id},
    )
    info = response.json()
    spend = info.get("spend", 0.0)
    budget = info.get("max_budget", 0.0)
    print(f"Team: {team_id}")
    print(f"  Spend:     ${spend:.4f}")
    print(f"  Budget:    ${budget:.2f}")
    print(f"  Remaining: ${budget - spend:.4f}")
    return info


def list_spend_logs(limit: int = 10) -> list[dict]:
    """Return recent spend log entries across all users and teams."""
    response = httpx.get(
        f"{LITELLM_BASE}/spend/logs",
        headers=HEADERS,
        params={"limit": limit},
    )
    logs = response.json()
    print(f"\nRecent spend log ({len(logs)} entries):")
    for entry in logs[:5]:
        print(
            f"  [{entry.get('call_type', 'chat')}] "
            f"user={entry.get('user_id', 'anon')} | "
            f"model={entry.get('model', '?')} | "
            f"${entry.get('spend', 0):.6f}"
        )
    return logs


def set_per_model_budget(team_id: str) -> None:
    """
    Set different budgets for different models within the same team.
    Useful when expensive models need tighter limits than cheap models.
    """
    response = httpx.post(
        f"{LITELLM_BASE}/team/update",
        headers=HEADERS,
        json={
            "team_id": team_id,
            "model_spend_map": {
                "claude-opus-4-6": 50.00,    # strict limit for expensive model
                "claude-sonnet-4-6": 300.00,
                "claude-haiku-4-5-20251001": 150.00,
            },
        },
    )
    print(f"Per-model budget set: {response.status_code}")


if __name__ == "__main__":
    team_id = create_team_with_budget()
    virtual_key = create_user_virtual_key(team_id)
    inspect_team_spend(team_id)
    list_spend_logs()
    set_per_model_budget(team_id)

Docker Compose: Full Production Stack

Running LiteLLM with Redis (for caching and routing state) and PostgreSQL (for spend tracking) is the recommended production setup.

# docker-compose.yml
version: "3.9"

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      ANTHROPIC_API_KEY_1: ${ANTHROPIC_API_KEY_1}
      ANTHROPIC_API_KEY_2: ${ANTHROPIC_API_KEY_2}
      ANTHROPIC_API_KEY_3: ${ANTHROPIC_API_KEY_3}
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      DATABASE_URL: postgresql://litellm:litellm@postgres:5432/litellm
      REDIS_HOST: redis
      REDIS_PORT: "6379"
      LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY}
    command: --config /app/config.yaml --port 4000 --detailed_debug
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health/readiness"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 15s

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: litellm
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm -d litellm"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  postgres_data:
  redis_data:

# Deploy the full stack
docker compose up -d

# Verify the proxy is healthy
curl http://localhost:4000/health/readiness

# Check all configured models
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer sk-litellm-master..."

# Query spend summary
curl "http://localhost:4000/spend/logs?limit=20" \
  -H "Authorization: Bearer sk-litellm-master..."

# View the admin UI (if enabled)
open http://localhost:4000/ui

Connecting to Langfuse for Full Observability

LiteLLM supports callbacks to observability platforms. Langfuse is the most popular option for AI-specific tracing.

# In litellm_config.yaml
litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

environment_variables:
  LANGFUSE_PUBLIC_KEY: "pk-lf-..."
  LANGFUSE_SECRET_KEY: "sk-lf-..."
  LANGFUSE_HOST: "https://cloud.langfuse.com"   # or your self-hosted URL

With this config active, every LLM call is automatically traced in Langfuse with model name, token counts, cost, latency, prompt, and response. No instrumentation code required in any application service.

Other supported callbacks: helicone, prometheus, datadog, s3, sentry, and custom HTTP webhooks for building your own observability pipeline.

Understanding the Request Lifecycle

Every request passes through: authentication, rate limiting, budget check, cache lookup, routing, provider call, cost recording, and observability callbacks - all in under 30ms of gateway overhead for a cache miss, and under 10ms for a cache hit.

Kubernetes Deployment for Production Scale

For production deployments requiring high availability, run multiple LiteLLM replicas behind a load balancer. Redis and PostgreSQL must be external shared services.

# kubernetes/litellm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
  namespace: ai-platform
spec:
  replicas: 3       # 3 replicas - shared Redis state keeps them consistent
  selector:
    matchLabels:
      app: litellm-proxy
  template:
    metadata:
      labels:
        app: litellm-proxy
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-stable
          ports:
            - containerPort: 4000
          command:
            - litellm
            - --config
            - /app/config.yaml
            - --port
            - "4000"
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-secrets
                  key: anthropic-api-key
            - name: REDIS_HOST
              value: "redis-cluster.ai-platform.svc.cluster.local"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: llm-secrets
                  key: database-url
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health/liveliness
              port: 4000
            initialDelaySeconds: 30
            periodSeconds: 30
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "500m"
      volumes:
        - name: config
          configMap:
            name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm-proxy
  namespace: ai-platform
spec:
  selector:
    app: litellm-proxy
  ports:
    - port: 4000
      targetPort: 4000
  type: ClusterIP

Production Engineering Notes

:::tip Pin the LiteLLM Docker image version LiteLLM releases frequently and occasionally introduces breaking YAML schema changes. Use a pinned tag like ghcr.io/berriai/litellm:main-v1.40.0 rather than :main-latest in production. Track the changelog before upgrading. :::

:::warning Health check the providers, not just the container A LiteLLM container can be running while all configured providers are returning errors. Use /health/readiness which actually tests connectivity to each configured provider, not just container process liveness. Add this endpoint to your load balancer health check. :::

:::danger Always set the master key - never run with an open proxy Without a master key configured, any caller can send requests through your proxy and charge costs to your provider accounts. Bind the proxy to an internal network interface (127.0.0.1:4000 or the cluster-internal IP), not 0.0.0.0, unless you have authentication at the network layer. :::

:::info The proxy adds 10–30ms latency per request The network hop through LiteLLM Proxy adds roughly 10–30ms. For LLM responses that take 500–2000ms, this overhead is 1–6%. For batch processing, it is negligible. For sub-50ms total latency requirements (extremely rare for LLM use cases), use SDK Mode instead. :::

Common Mistakes

Mistake 1: Assigning the same model_name to different models unintentionally

If two entries share the same model_name, LiteLLM treats them as a load-balanced group. This is intentional for the multi-key pattern. But if you accidentally give Claude Sonnet and GPT-4o the same name, you will route production traffic to a random mix of both. Use distinct names unless you explicitly want load balancing across those entries.

Mistake 2: Omitting rpm and tpm in model config

Without rate limit metadata, usage-based-routing-v2 cannot make informed routing decisions - it has no idea how close a key is to its limit. Always configure rpm and tpm in each model entry to match your actual provider quotas. The router uses these numbers to avoid routing to a key that is about to be rate-limited.

Mistake 3: Not passing user_id in requests

LiteLLM tracks spend per user, but only if you pass user_id in extra_body. Without it, all spend is attributed to an anonymous bucket and per-user cost reporting is useless. Make user_id a mandatory field in every team's LLM call wrapper.

Mistake 4: Running multiple replicas without Redis

Without Redis, each proxy replica maintains its own independent in-memory routing state and cache. Load balancing decisions are inconsistent across replicas, and the cache provides no benefit at scale since each replica maintains a separate cache. Redis is required for any multi-replica deployment.

Mistake 5: Not configuring fallbacks before launch

The fallback config is the most important resilience feature, but it is only useful if configured before a provider goes down. Set up fallbacks in the config file before your first production deployment. Test the fallback manually by temporarily removing a provider's API key and verifying traffic routes correctly.

Mistake 6: Ignoring the admin UI for debugging

LiteLLM Proxy ships with a web-based admin UI at /ui. It shows real-time spend by user and team, all configured models, health status of each key, and recent request logs. Most teams never open it. During an incident - when you need to understand which key is being rate-limited, why fallback triggered, or which team is overrunning their budget - the admin UI saves 20 minutes of log trawling.

Interview Q&A

Q: What is LiteLLM and how does it differ from calling provider SDKs directly?

LiteLLM is a Python library and proxy server that normalizes the APIs of 100+ LLM providers into a single OpenAI-compatible interface. When you call providers directly, each requires a different SDK, authentication scheme, and response format - switching providers means changing application code. With LiteLLM, application code calls one consistent API; the provider is a config-level decision. This separation is valuable for multi-provider setups, fallback chains, centralized cost tracking, and organizations where multiple teams consume LLMs through a shared platform endpoint.

Q: What is the difference between LiteLLM SDK mode and proxy mode? When would you choose each?

SDK mode imports LiteLLM as a Python library and runs provider translation in-process. It adds no network overhead and is simpler to set up. Proxy mode runs LiteLLM as a standalone HTTP server; all services send requests over the network. Proxy mode provides: centralized cost tracking across all services, centralized caching (shared across replicas), centralized routing state, and compatibility with non-Python services. Choose SDK mode for a single-language, single-service Python application. Choose proxy mode for any multi-service architecture or when you need a true gateway layer with centralized control.

Q: How does LiteLLM handle provider failures during routing?

LiteLLM's router tracks error rates per model/key combination. When a key exceeds the allowed_fails threshold within a time window, it enters a cooldown period (defined by cooldown_time). During cooldown, no requests are routed to that key - they go to other available keys in the group. If all keys for a model group are in cooldown, LiteLLM uses the fallbacks config to route to a different model. After the cooldown period expires, the key is reintroduced to routing. All failures - including during fallback - are logged and reported to configured observability callbacks.

Q: How would you implement per-team cost budgets with LiteLLM Proxy?

Create teams via the admin API (POST /team/new) with max_budget and budget_duration fields. Issue virtual API keys for each team member (POST /key/generate) associated with the team. Every request using a virtual key is attributed to that user and team in the PostgreSQL spend table. LiteLLM enforces budgets by checking accumulated spend against the limit on each request - when the limit is exceeded, it returns a 429 with a budget-exceeded message. For soft alerting at 80% consumption, poll the /team/info endpoint on a schedule and fire Slack notifications before the hard limit is hit.

Q: A service is getting 429s from Anthropic even though three API keys are configured in LiteLLM. What is the likely cause?

The most likely cause is that all three keys belong to the same Anthropic organization and share a single organization-level rate limit. Adding keys within the same org multiplies the key count but does not multiply the organization's token quota. The 429s come from the org-level limit, not the per-key limit. The fix options are: (1) request a rate limit increase from Anthropic, (2) use keys from different Anthropic accounts if permitted by the terms of service, or (3) add OpenAI or another provider as a fallback and configure LiteLLM to overflow to that provider when Anthropic is rate-limited. Using usage-based-routing-v2 with accurate tpm limits will prevent any single key from being over-subscribed, but it cannot create capacity that doesn't exist at the organization level.

Q: How would you migrate an existing application from direct Anthropic SDK calls to LiteLLM Proxy without breaking anything?

Deploy LiteLLM Proxy alongside the existing application, configured with the same Anthropic key as the application currently uses. Update the application's Anthropic client to point at the proxy: set base_url="http://litellm-proxy:4000" and use a LiteLLM virtual key instead of the Anthropic key directly. The Anthropic SDK is compatible with the LiteLLM proxy endpoint - no call syntax changes required. Run both pathways in parallel for one week, comparing response quality, latency, and error rates. Once validated, remove the direct Anthropic SDK dependency and standardize on the proxy for all LLM traffic. The migration can be done service by service with zero downtime.

Q: What is usage-based-routing-v2 and when should you use it over latency-based-routing?

usage-based-routing-v2 routes requests to the key with the most remaining capacity in its TPM/RPM buckets, tracked in real time using Redis. It reads the provider-configured limits from the model's rpm and tpm fields in the config, and routes to the key furthest from its limits. This is the correct strategy when your primary constraint is staying within provider rate limits - for example, when you have multiple keys at different tier levels and need to respect each key's exact quota. latency-based-routing routes to the lowest P95 latency key without explicit quota tracking. It works well when keys have similar limits and you want to optimize user-perceived response time. Use usage-based-routing-v2 when quota compliance is critical; use latency-based-routing when performance optimization is the primary goal.

LiteLLM Proxy: Complete Production Configuration

The following is a production-ready LiteLLM Proxy configuration file combining all the elements covered in this lesson - multi-key routing, fallbacks, cost tracking, and caching.

# litellm_config.yaml - production configuration
model_list:
  # Primary: Claude Sonnet with 3 keys for throughput
  - model_name: claude-primary
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_1
      rpm: 500
      tpm: 200_000
  - model_name: claude-primary
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_2
      rpm: 500
      tpm: 200_000
  - model_name: claude-primary
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_3
      rpm: 1000
      tpm: 400_000   # Higher-tier key

  # Fallback: OpenAI GPT-4o
  - model_name: openai-fallback
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  # Budget model: Claude Haiku for low-cost tasks
  - model_name: claude-budget
    litellm_params:
      model: anthropic/claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY_1

router_settings:
  routing_strategy: usage-based-routing-v2
  redis_url: os.environ/REDIS_URL
  num_retries: 2
  retry_after: 5
  allowed_fails: 3
  cooldown_time: 60
  fallbacks:
    - {"claude-primary": ["openai-fallback"]}

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]
  set_verbose: false

  # Semantic caching
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: 6379
    similarity_threshold: 0.95
    supported_call_types: ["acompletion", "completion"]

general_settings:
  database_url: os.environ/DATABASE_URL   # PostgreSQL for spend tracking
  store_model_in_db: true
  master_key: os.environ/LITELLM_MASTER_KEY

environment_variables:
  LANGFUSE_PUBLIC_KEY: os.environ/LANGFUSE_PUBLIC_KEY
  LANGFUSE_SECRET_KEY: os.environ/LANGFUSE_SECRET_KEY

Monitoring LiteLLM Proxy in Production

LiteLLM exposes Prometheus metrics at /metrics by default when the Prometheus integration is enabled. Key metrics to scrape and alert on:

Metric	Type	Alert Condition
`litellm_request_total_requests`	Counter	Sudden drop (service stopped receiving traffic)
`litellm_requests_total_failed`	Counter	Error rate above 1% of total requests
`litellm_deployment_success_responses`	Counter	Track per-deployment success rates
`litellm_deployment_failure_responses`	Counter	Any deployment with sustained failures
`litellm_remaining_requests`	Gauge	Below 20% of RPM limit on any key
`litellm_remaining_tokens`	Gauge	Below 20% of TPM limit on any key
`litellm_overhead_latency_ms`	Histogram	P99 above 100ms (gateway is slow)

Configure Prometheus scraping at the /metrics endpoint and set up Grafana dashboards for these metrics from day one. The most operationally useful dashboard shows: requests per second by model, error rate by model, TPM utilization per key (as % of limit), and the P95 latency breakdown (gateway overhead vs LLM response time).

Summary: LiteLLM Proxy in Production

LiteLLM is the most widely adopted self-hosted LLM gateway. Its key properties:

OpenAI-compatible endpoint: any service using the standard OpenAI SDK can point at LiteLLM with a base URL change - no other code changes required
100+ provider support: every major LLM provider, managed by one unified config file
Virtual keys: decouple application keys from provider credentials, enabling zero-downtime rotation and per-team spend caps
Budget enforcement: team and user budgets enforced at request time using PostgreSQL spend tracking
Fallback chains: automatic provider failover with configurable retry logic
Semantic caching: Redis-backed embedding similarity cache, shared across all services
Routing strategies: simple-shuffle, latency-based-routing, usage-based-routing-v2, and least-busy
Admin API: programmatic control over virtual keys, teams, budgets, and routing - enabling automated provisioning workflows

The canonical deployment is Docker Compose (or Kubernetes) with LiteLLM Proxy + Redis + PostgreSQL. This three-component stack handles everything described in this lesson and scales to thousands of requests per minute on modest hardware.

The Dependency Nightmare​

Why This Exists​

Two Modes: SDK vs Proxy​

SDK Mode: Provider Abstraction in Python​

Proxy Mode: The Production Configuration​

Step 1: Write the Config​

Step 2: Run the Proxy​

Step 3: Call the Proxy from Application Code​

Routing Strategies Compared​

Team and User Budget Enforcement​

Docker Compose: Full Production Stack​

Connecting to Langfuse for Full Observability​

Understanding the Request Lifecycle​

Kubernetes Deployment for Production Scale​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

LiteLLM Proxy: Complete Production Configuration​

Monitoring LiteLLM Proxy in Production​

Summary: LiteLLM Proxy in Production​