What is LLM fallback?

Design resilient LLM clients with configurable fallback chains, exponential backoff with jitter, and circuit breakers that handle provider failures gracefully without any user-facing impact.

How does retry logic work in practice?

Model Fallback and Retry covers LLM fallback, retry logic, circuit breaker from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/llm-gateways/model-fallback-and-retry

What is the difference between LLM fallback and circuit breaker?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/llm-gateways/model-fallback-and-retry

:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Fallback & Retry demo on the EngineersOfAI Playground - no code required. :::

Model Fallback and Retry

When the Primary Goes Down

It was a Friday afternoon when the first alerts fired. The AI-powered code review service - a critical part of the developer platform used by 12,000 engineers - had started returning 503 errors. The on-call engineer pulled up the dashboard: Anthropic was experiencing a partial service degradation in one region. API requests to Claude were timing out, not all of them, roughly 40%, but enough to make the service appear unreliable.

The engineering team had thought they had a fallback configured. The truth was more complicated. Service A had a try/except block that caught anthropic.APIStatusError and then called OpenAI - but it was also catching anthropic.BadRequestError (400 responses for malformed requests), which it was incorrectly retrying. Service B retried the same Claude endpoint three times before giving up with no fallback at all. Service C had no error handling whatsoever - it propagated the exception directly to the user as an HTTP 500. By the time the incident resolved, Service C had been completely down, Service B had been down after its three retries were exhausted, and Service A had partially worked but with incorrect retry behavior that was burning through rate limit quota retrying 400 errors that would never succeed.

The post-mortem produced two directives: centralize fallback logic at the gateway layer, and define a precise taxonomy of which HTTP errors warrant retry, which warrant provider fallback, and which should never be retried under any circumstances.

The Failure Taxonomy: Knowing What Broke

The most important prerequisite to resilient retry logic is correctly classifying failures. Retrying the wrong errors wastes requests, increases latency, and consumes quota.

The five failure classes and their correct responses:

Rate limit (429): retry with exponential backoff. Always respect the Retry-After header if present - it tells you exactly when the provider's window resets. Retrying before the window resets burns another request for zero benefit.

Server error (5xx): retry once with a short delay, then fall back to a different provider. Server errors are frequently transient (a single bad instance, a brief capacity spike), but if the first retry also fails, the provider is likely experiencing a broader issue and fallback is appropriate.

Context length exceeded (400 + context_length_exceeded): do NOT retry the same model - it will fail identically every time. Instead, fall back to a model with a larger context window. This is a separate fallback path from the provider error path.

Content policy violation (400 + content_filter): do NOT retry under any circumstances. The request itself violates policy. Retrying wastes quota. Log the request for review - it may indicate a prompt injection attack or a legitimate edge case that needs a prompt engineering fix.

Auth error (401/403): do NOT retry. A bad API key or expired credential needs human intervention, not automated retry. Page the on-call engineer immediately.

Exponential Backoff with Jitter

Exponential backoff is the foundation of all retry logic for transient failures. Without jitter, it causes a thundering herd problem.

import time
import random
import math
import anthropic
from typing import Optional, Callable, TypeVar

T = TypeVar("T")


def exponential_backoff_delay(
    attempt: int,
    base_delay_s: float = 1.0,
    max_delay_s: float = 60.0,
    jitter: bool = True,
) -> float:
    """
    Calculate the wait time before attempt N using full jitter.

    Without jitter (bad):
      attempt 0 -> 1.0s, attempt 1 -> 2.0s, attempt 2 -> 4.0s
      If 1000 clients all hit a rate limit at the same moment,
      they all retry at the same moment -> thundering herd -> another rate limit.

    With full jitter (good):
      attempt 0 -> random(0, 1.0), attempt 1 -> random(0, 2.0), ...
      Clients spread out their retries -> no thundering herd.

    AWS blog post "Exponential Backoff and Jitter" (2015) recommends full jitter.
    """
    # Exponential: 1s, 2s, 4s, 8s, ... capped at max_delay_s
    delay = min(base_delay_s * (2 ** attempt), max_delay_s)
    if jitter:
        # Full jitter: uniform random in [0, delay]
        delay = random.uniform(0.0, delay)
    return delay


def retry_with_backoff(
    func: Callable[[], T],
    max_attempts: int = 4,
    retryable_status_codes: tuple[int, ...] = (429, 500, 502, 503, 504),
    non_retryable_status_codes: tuple[int, ...] = (400, 401, 403),
    base_delay_s: float = 1.0,
    max_delay_s: float = 60.0,
) -> T:
    """
    Retry a callable with exponential backoff + full jitter.

    - Retries on: rate limits (429), server errors (5xx), connection errors
    - Does NOT retry on: auth errors (401/403), bad requests (400)
    - Respects Retry-After header when present in rate limit responses
    - Raises the last exception after max_attempts are exhausted
    """
    last_exception: Optional[Exception] = None

    for attempt in range(max_attempts):
        try:
            return func()

        except anthropic.RateLimitError as e:
            last_exception = e
            if attempt == max_attempts - 1:
                raise

            # Provider tells us exactly when to retry - respect it
            retry_after: Optional[float] = getattr(e, "retry_after", None)
            if retry_after is not None:
                delay = float(retry_after)
                print(f"Rate limited. Provider says retry after {delay:.1f}s. "
                      f"(attempt {attempt + 1}/{max_attempts})")
            else:
                delay = exponential_backoff_delay(attempt, base_delay_s, max_delay_s)
                print(f"Rate limited. Backoff: {delay:.1f}s. "
                      f"(attempt {attempt + 1}/{max_attempts})")
            time.sleep(delay)

        except anthropic.APIStatusError as e:
            last_exception = e

            # Non-retryable client errors - fail immediately
            if e.status_code in non_retryable_status_codes:
                raise

            # For retryable server errors
            if e.status_code in retryable_status_codes:
                if attempt == max_attempts - 1:
                    raise
                delay = exponential_backoff_delay(attempt, base_delay_s, max_delay_s)
                print(f"Server error {e.status_code}. Backoff: {delay:.1f}s. "
                      f"(attempt {attempt + 1}/{max_attempts})")
                time.sleep(delay)
            else:
                raise  # Unknown status code - don't retry

        except (TimeoutError, ConnectionError, OSError) as e:
            last_exception = e
            if attempt == max_attempts - 1:
                raise
            delay = exponential_backoff_delay(attempt, base_delay_s, max_delay_s)
            print(f"Network error: {type(e).__name__}. Backoff: {delay:.1f}s. "
                  f"(attempt {attempt + 1}/{max_attempts})")
            time.sleep(delay)

    assert last_exception is not None
    raise last_exception


def demo_retry() -> None:
    """Demonstrate retry with backoff for a real Claude call."""
    client = anthropic.Anthropic()

    def make_request() -> anthropic.types.Message:
        return client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=256,
            messages=[{"role": "user", "content": "What is gradient descent?"}],
        )

    response = retry_with_backoff(make_request, max_attempts=3)
    print(f"Success: {response.content[0].text[:100]}...")

Building a Full Fallback Chain

A fallback chain defines an ordered list of provider/model targets. The chain is worked through in order until one succeeds or all are exhausted. Each target has its own retry budget and timeout.

import anthropic
import openai
import time
import random
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


class FailureType(Enum):
    RATE_LIMIT = "rate_limit"
    SERVER_ERROR = "server_error"
    CONTEXT_LENGTH = "context_length"
    CONTENT_POLICY = "content_policy"
    TIMEOUT = "timeout"
    AUTH_ERROR = "auth_error"
    UNKNOWN = "unknown"


@dataclass
class FallbackTarget:
    provider: str                 # "anthropic" | "openai"
    model: str
    max_retries: int = 2          # retries within this target before moving to next
    timeout_s: float = 30.0
    # Failure types that trigger moving to the next target in the chain
    fallback_on: tuple[FailureType, ...] = (
        FailureType.RATE_LIMIT,
        FailureType.SERVER_ERROR,
        FailureType.TIMEOUT,
    )


@dataclass
class FallbackConfig:
    targets: list[FallbackTarget]
    # Failure types that are never retried anywhere in the chain
    never_retry: tuple[FailureType, ...] = (
        FailureType.CONTENT_POLICY,
        FailureType.AUTH_ERROR,
    )


@dataclass
class CompletionResult:
    response: str
    provider: str
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    total_attempts: int
    fallback_triggered: bool
    failure_history: list[str] = field(default_factory=list)


class ResilientLLMClient:
    """
    Production LLM client with a configurable fallback chain.

    Architecture:
    - Typed failure classification (maps provider exceptions -> FailureType enum)
    - Per-target retry with exponential backoff + jitter
    - Chain progression on non-recoverable failures
    - Never-retry list for content policy and auth errors
    - Detailed failure history for post-incident debugging
    """

    def __init__(self, config: FallbackConfig):
        self.config = config
        self._anthropic = anthropic.Anthropic()
        self._openai = openai.OpenAI()

    # ─── Failure classification ──────────────────────────────────────────────

    def _classify_anthropic(self, e: Exception) -> FailureType:
        if isinstance(e, anthropic.RateLimitError):
            return FailureType.RATE_LIMIT
        if isinstance(e, anthropic.AuthenticationError):
            return FailureType.AUTH_ERROR
        if isinstance(e, anthropic.APIStatusError):
            if e.status_code >= 500:
                return FailureType.SERVER_ERROR
            if e.status_code == 400:
                body = str(e).lower()
                if "context_length" in body or "too long" in body or "tokens" in body:
                    return FailureType.CONTEXT_LENGTH
                if "content" in body and ("filter" in body or "policy" in body or "safety" in body):
                    return FailureType.CONTENT_POLICY
        if isinstance(e, (TimeoutError, ConnectionError)):
            return FailureType.TIMEOUT
        return FailureType.UNKNOWN

    def _classify_openai(self, e: Exception) -> FailureType:
        if isinstance(e, openai.RateLimitError):
            return FailureType.RATE_LIMIT
        if isinstance(e, openai.AuthenticationError):
            return FailureType.AUTH_ERROR
        if isinstance(e, openai.APIStatusError):
            if e.status_code >= 500:
                return FailureType.SERVER_ERROR
            if e.status_code == 400:
                body = str(e).lower()
                if "context_length" in body or "maximum context" in body:
                    return FailureType.CONTEXT_LENGTH
                if "content_policy" in body or "content_filter" in body:
                    return FailureType.CONTENT_POLICY
        if isinstance(e, openai.APITimeoutError):
            return FailureType.TIMEOUT
        return FailureType.UNKNOWN

    # ─── Provider calls ──────────────────────────────────────────────────────

    def _call(
        self, target: FallbackTarget, messages: list[dict], max_tokens: int
    ) -> tuple[str, int, int]:
        """Make a single API call. Returns (text, input_tokens, output_tokens)."""
        if target.provider == "anthropic":
            response = self._anthropic.messages.create(
                model=target.model,
                max_tokens=max_tokens,
                messages=messages,
                timeout=target.timeout_s,
            )
            return (
                response.content[0].text,
                response.usage.input_tokens,
                response.usage.output_tokens,
            )
        elif target.provider == "openai":
            response = self._openai.chat.completions.create(
                model=target.model,
                messages=messages,
                max_tokens=max_tokens,
                timeout=target.timeout_s,
            )
            return (
                response.choices[0].message.content,
                response.usage.prompt_tokens,
                response.usage.completion_tokens,
            )
        else:
            raise ValueError(f"Unknown provider: {target.provider}")

    # ─── Per-target execution with retry ─────────────────────────────────────

    def _try_target(
        self,
        target: FallbackTarget,
        messages: list[dict],
        max_tokens: int,
    ) -> tuple[Optional[tuple[str, int, int]], Optional[FailureType], str]:
        """
        Attempt a call to a single target, retrying up to target.max_retries times.
        Returns:
          (result, None, "")           on success
          (None, failure_type, msg)    on failure - caller decides whether to advance chain
        """
        last_failure: Optional[FailureType] = None
        last_msg = ""

        for attempt in range(target.max_retries + 1):
            try:
                result = self._call(target, messages, max_tokens)
                return result, None, ""

            except Exception as e:
                classify = self._classify_anthropic if target.provider == "anthropic" \
                    else self._classify_openai
                failure_type = classify(e)
                last_failure = failure_type
                last_msg = f"{target.provider}/{target.model} attempt {attempt}: {type(e).__name__}: {e}"

                # Never-retry failures - propagate immediately, don't even try next target
                if failure_type in self.config.never_retry:
                    return None, failure_type, last_msg

                # If this is a fallback-triggering failure and we've retried enough, give up on this target
                if failure_type in target.fallback_on:
                    if attempt < target.max_retries:
                        delay = exponential_backoff_delay(attempt, base_delay_s=1.0)
                        print(f"  Retry {attempt + 1}/{target.max_retries} after {delay:.1f}s "
                              f"({failure_type.value})")
                        time.sleep(delay)
                    # else: fall through to return None below
                # else: unknown failure type - retry anyway up to max_retries

        return None, last_failure, last_msg

    # ─── Main completion method ───────────────────────────────────────────────

    def complete(
        self,
        messages: list[dict],
        max_tokens: int = 1024,
    ) -> CompletionResult:
        """
        Execute a completion through the fallback chain.

        Tries each target in order until one succeeds.
        Never-retry failures (content policy, auth) propagate immediately.
        All other failures advance the chain.
        """
        failure_history: list[str] = []
        total_attempts = 0
        start = time.time()

        for i, target in enumerate(self.config.targets):
            print(f"Trying target {i + 1}/{len(self.config.targets)}: "
                  f"{target.provider}/{target.model}")

            result, failure_type, error_msg = self._try_target(target, messages, max_tokens)
            total_attempts += target.max_retries + 1

            if result is not None:
                text, input_tokens, output_tokens = result
                return CompletionResult(
                    response=text,
                    provider=target.provider,
                    model=target.model,
                    input_tokens=input_tokens,
                    output_tokens=output_tokens,
                    latency_ms=(time.time() - start) * 1000,
                    total_attempts=total_attempts,
                    fallback_triggered=(i > 0),
                    failure_history=failure_history,
                )

            failure_history.append(error_msg)
            print(f"  Target {i + 1} failed: {failure_type.value if failure_type else 'unknown'}")

            # Immediately propagate never-retry failures
            if failure_type in self.config.never_retry:
                raise RuntimeError(
                    f"Non-retryable failure ({failure_type.value}): {error_msg}"
                )

            if i < len(self.config.targets) - 1:
                next_target = self.config.targets[i + 1]
                print(f"  Advancing to {next_target.provider}/{next_target.model}")

        raise RuntimeError(
            f"All {len(self.config.targets)} fallback targets exhausted.\n"
            + "\n".join(f"  {h}" for h in failure_history)
        )


# ─────────────────────────────────────────────────────────────────────────────
# Production configuration
# ─────────────────────────────────────────────────────────────────────────────

PRODUCTION_FALLBACK_CONFIG = FallbackConfig(
    targets=[
        # Primary: Claude Sonnet - best quality
        FallbackTarget(
            provider="anthropic",
            model="claude-sonnet-4-6",
            max_retries=2,
            timeout_s=45.0,
            fallback_on=(FailureType.RATE_LIMIT, FailureType.SERVER_ERROR, FailureType.TIMEOUT),
        ),
        # Fallback 1: GPT-4o - reliable alternative, comparable quality
        FallbackTarget(
            provider="openai",
            model="gpt-4o",
            max_retries=1,
            timeout_s=30.0,
            fallback_on=(FailureType.RATE_LIMIT, FailureType.SERVER_ERROR, FailureType.TIMEOUT),
        ),
        # Fallback 2: Claude Haiku - fast and cheap last resort
        FallbackTarget(
            provider="anthropic",
            model="claude-haiku-4-5-20251001",
            max_retries=1,
            timeout_s=20.0,
            fallback_on=(FailureType.RATE_LIMIT, FailureType.SERVER_ERROR),
        ),
    ],
    never_retry=(FailureType.CONTENT_POLICY, FailureType.AUTH_ERROR),
)


def demo_fallback_client() -> None:
    client = ResilientLLMClient(PRODUCTION_FALLBACK_CONFIG)

    messages = [{"role": "user", "content": "Explain the CAP theorem in two concise paragraphs."}]
    result = client.complete(messages, max_tokens=300)

    print(f"\n=== Result ===")
    print(f"Provider:         {result.provider}")
    print(f"Model:            {result.model}")
    print(f"Fallback fired:   {result.fallback_triggered}")
    print(f"Total attempts:   {result.total_attempts}")
    print(f"Latency:          {result.latency_ms:.0f}ms")
    print(f"Tokens:           {result.input_tokens} in / {result.output_tokens} out")
    if result.failure_history:
        print(f"Failures before success:")
        for h in result.failure_history:
            print(f"  - {h[:100]}")
    print(f"\nResponse:\n{result.response[:400]}")


if __name__ == "__main__":
    demo_fallback_client()

Circuit Breaker: Stop Hammering a Dead Provider

When a provider is experiencing a sustained outage, retrying repeatedly wastes time and burns quota. A circuit breaker tracks provider health over a rolling time window and stops sending requests to a degraded provider until it shows signs of recovery.

import threading
import time
from collections import deque
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional


class CircuitState(Enum):
    CLOSED = "closed"         # Normal operation - let requests through
    OPEN = "open"             # Provider is failing - route around it
    HALF_OPEN = "half_open"   # Testing recovery - allow one probe


@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5       # Failures in window before opening
    window_seconds: float = 60.0     # Rolling window for counting failures
    recovery_timeout_s: float = 30.0 # Wait before probing (OPEN -> HALF_OPEN)
    success_threshold: int = 2       # Consecutive successes to close (HALF_OPEN -> CLOSED)


class CircuitBreaker:
    """
    Per-provider circuit breaker with three-state state machine.

    Thread-safe: all state mutations are inside a reentrant lock.
    For multi-replica deployments, share state via Redis instead of in-memory.
    """

    def __init__(self, name: str, config: CircuitBreakerConfig = CircuitBreakerConfig()):
        self.name = name
        self.config = config
        self._state = CircuitState.CLOSED
        self._failure_times: deque[float] = deque()
        self._open_since: Optional[float] = None
        self._half_open_successes: int = 0
        self._lock = threading.RLock()

    @property
    def state(self) -> CircuitState:
        with self._lock:
            if self._state == CircuitState.OPEN:
                # Check if recovery timeout has elapsed - transition to HALF_OPEN
                if self._open_since and (time.time() - self._open_since) >= self.config.recovery_timeout_s:
                    self._state = CircuitState.HALF_OPEN
                    self._half_open_successes = 0
                    print(f"Circuit [{self.name}]: OPEN -> HALF_OPEN (probing)")
            return self._state

    def is_available(self) -> bool:
        """Returns True if the circuit allows this request to proceed."""
        s = self.state
        return s in (CircuitState.CLOSED, CircuitState.HALF_OPEN)

    def record_success(self) -> None:
        """Call this when a request succeeds."""
        with self._lock:
            if self._state == CircuitState.HALF_OPEN:
                self._half_open_successes += 1
                if self._half_open_successes >= self.config.success_threshold:
                    self._state = CircuitState.CLOSED
                    self._failure_times.clear()
                    self._open_since = None
                    print(f"Circuit [{self.name}]: HALF_OPEN -> CLOSED (recovered)")

    def record_failure(self) -> None:
        """Call this when a request fails."""
        with self._lock:
            now = time.time()

            if self._state == CircuitState.HALF_OPEN:
                # Probe failed - reopen the circuit
                self._state = CircuitState.OPEN
                self._open_since = now
                print(f"Circuit [{self.name}]: HALF_OPEN -> OPEN (probe failed)")
                return

            # Track failure timestamp in rolling window
            self._failure_times.append(now)
            cutoff = now - self.config.window_seconds
            while self._failure_times and self._failure_times[0] < cutoff:
                self._failure_times.popleft()

            # Open circuit if threshold exceeded
            if (self._state == CircuitState.CLOSED and
                    len(self._failure_times) >= self.config.failure_threshold):
                self._state = CircuitState.OPEN
                self._open_since = now
                print(f"Circuit [{self.name}]: CLOSED -> OPEN "
                      f"({len(self._failure_times)} failures in {self.config.window_seconds}s)")

    def status(self) -> dict:
        with self._lock:
            return {
                "provider": self.name,
                "state": self._state.value,
                "recent_failures": len(self._failure_times),
                "open_since_s": round(time.time() - self._open_since, 1) if self._open_since else None,
            }


class CircuitBreakerAwareLLMClient:
    """
    Extends ResilientLLMClient with per-provider circuit breakers.

    When a provider's circuit is OPEN, requests skip it immediately
    and move to the next target - no wasted retry attempts.
    """

    def __init__(self, config: FallbackConfig):
        self.config = config
        self.breakers: dict[str, CircuitBreaker] = {
            "anthropic": CircuitBreaker("anthropic"),
            "openai": CircuitBreaker("openai"),
            "google": CircuitBreaker("google"),
        }
        self._anthropic = anthropic.Anthropic()
        self._openai = openai.OpenAI()

    def complete(
        self,
        messages: list[dict],
        max_tokens: int = 1024,
    ) -> dict:
        """
        Complete a request through the fallback chain, skipping providers
        with open circuit breakers.
        """
        failure_history: list[str] = []
        start = time.time()

        for target in self.config.targets:
            breaker = self.breakers.get(target.provider, CircuitBreaker(target.provider))

            if not breaker.is_available():
                print(f"Circuit OPEN for {target.provider} - skipping to next target")
                failure_history.append(f"{target.provider}: circuit OPEN")
                continue

            try:
                start_call = time.time()
                if target.provider == "anthropic":
                    response = self._anthropic.messages.create(
                        model=target.model,
                        max_tokens=max_tokens,
                        messages=messages,
                        timeout=target.timeout_s,
                    )
                    text = response.content[0].text
                    input_tok = response.usage.input_tokens
                    output_tok = response.usage.output_tokens
                else:
                    response = self._openai.chat.completions.create(
                        model=target.model,
                        messages=messages,
                        max_tokens=max_tokens,
                        timeout=target.timeout_s,
                    )
                    text = response.choices[0].message.content
                    input_tok = response.usage.prompt_tokens
                    output_tok = response.usage.completion_tokens

                breaker.record_success()

                return {
                    "response": text,
                    "provider": target.provider,
                    "model": target.model,
                    "input_tokens": input_tok,
                    "output_tokens": output_tok,
                    "latency_ms": (time.time() - start) * 1000,
                    "fallback_triggered": len(failure_history) > 0,
                }

            except Exception as e:
                breaker.record_failure()
                error_msg = f"{target.provider}/{target.model}: {type(e).__name__}: {str(e)[:80]}"
                failure_history.append(error_msg)
                print(f"Failed: {error_msg}")
                continue

        raise RuntimeError(
            f"All providers unavailable (circuit breakers open or all targets failed).\n"
            + "\n".join(f"  {h}" for h in failure_history)
        )

    def status(self) -> list[dict]:
        """Return circuit breaker status for all providers."""
        return [b.status() for b in self.breakers.values()]

Testing Your Fallback Chain

Fallback chains that are never tested will fail in unexpected ways under real conditions. Add chaos testing to your development workflow.

import unittest
from unittest.mock import patch, MagicMock


class FallbackChainTest(unittest.TestCase):
    """
    Tests for the fallback chain behavior.
    Uses mocking to simulate provider failures without real API calls.
    """

    def setUp(self):
        self.config = PRODUCTION_FALLBACK_CONFIG
        self.client = ResilientLLMClient(self.config)

    @patch.object(ResilientLLMClient, '_call')
    def test_primary_success_no_fallback(self, mock_call):
        """Happy path: primary succeeds on first try."""
        mock_call.return_value = ("Primary response", 100, 50)
        result = self.client.complete(
            [{"role": "user", "content": "Test"}], max_tokens=100
        )
        self.assertEqual(result.response, "Primary response")
        self.assertFalse(result.fallback_triggered)
        self.assertEqual(mock_call.call_count, 1)

    @patch.object(ResilientLLMClient, '_call')
    def test_primary_rate_limited_falls_back(self, mock_call):
        """Primary returns 429, fallback to second target."""
        rate_limit_error = anthropic.RateLimitError(
            message="Rate limit exceeded",
            response=MagicMock(status_code=429),
            body={}
        )
        mock_call.side_effect = [
            rate_limit_error, rate_limit_error, rate_limit_error,  # primary retries
            ("Fallback response", 80, 40),                         # first fallback succeeds
        ]
        result = self.client.complete(
            [{"role": "user", "content": "Test"}], max_tokens=100
        )
        self.assertTrue(result.fallback_triggered)
        self.assertEqual(result.response, "Fallback response")

    @patch.object(ResilientLLMClient, '_call')
    def test_content_policy_never_retried(self, mock_call):
        """Content policy violation should raise immediately, no retries."""
        content_policy_error = anthropic.BadRequestError(
            message="Content policy violation",
            response=MagicMock(status_code=400),
            body={"error": {"type": "invalid_request_error",
                            "message": "content_filter triggered"}}
        )
        mock_call.side_effect = content_policy_error

        with self.assertRaises(RuntimeError) as ctx:
            self.client.complete([{"role": "user", "content": "Test"}])

        self.assertIn("content_policy", str(ctx.exception).lower())
        self.assertEqual(mock_call.call_count, 1)   # No retries

    @patch.object(ResilientLLMClient, '_call')
    def test_all_targets_exhausted_raises(self, mock_call):
        """When all fallback targets fail, raise RuntimeError with history."""
        server_error = anthropic.InternalServerError(
            message="Server error",
            response=MagicMock(status_code=500),
            body={}
        )
        mock_call.side_effect = server_error

        with self.assertRaises(RuntimeError) as ctx:
            self.client.complete([{"role": "user", "content": "Test"}])

        self.assertIn("exhausted", str(ctx.exception).lower())

Fallback Configuration: Trade-offs by Use Case

Different applications need different fallback strategies. There is no universal correct configuration.

Use case	Recommended fallback	Rationale
Real-time user chat	Claude Sonnet -> GPT-4o -> Claude Haiku	Prioritize availability; quality step-down acceptable
Legal document analysis	Claude Sonnet -> Claude Opus -> No fallback	Quality non-negotiable; never degrade to weaker model
Batch summarization	Claude Haiku -> GPT-4o-mini -> Queue	Cost priority; bulk jobs can queue and wait
Code generation	Claude Sonnet -> GPT-4o -> Claude Sonnet (different key)	Multi-key rotation before cross-provider fallback
Context length overflow	Base model -> Large context model only	Context length error needs a different model, not retry

Production Engineering Notes

:::tip Log every fallback trigger with structured metadata When a fallback fires, emit a structured log event with provider, model, failure_type, status_code, attempt_number, and user_id. Aggregate these logs in your observability stack to track fallback frequency per provider over time. A gradual increase in fallback frequency is the earliest warning signal of a provider reliability degradation - often visible days before it becomes an incident. :::

:::warning Always respect the Retry-After header When Anthropic or OpenAI returns a 429, they include a Retry-After header specifying the exact wait time before the rate limit window resets. Retrying before this time elapses wastes a request and may itself be rate-limited. Your backoff logic must check for and honor this header before applying the exponential backoff formula. :::

:::danger Content policy violations must never be retried A request that violates a provider's content policy will be rejected on every subsequent attempt - it is the content, not the load, that is the problem. Retrying burns quota, increases latency, and risks triggering additional rate limiting. Classify content policy failures immediately and surface them to the caller with a clear error. Log them separately for security review - they may indicate a prompt injection attempt. :::

:::info Circuit breaker state must be shared across replicas Circuit breaker state stored in process memory is not shared across replicas. If you run three API service instances, each maintains its own circuit state. Instance A may have opened its circuit for Anthropic (because it has seen 5 failures) while Instance B's circuit stays closed and keeps routing requests to the failing provider. Store circuit breaker state in Redis using atomic operations (INCR, SET with NX) to ensure all replicas have a consistent view of provider health. :::

Common Mistakes

Mistake 1: Catching broad exception types for retry decisions

Catching Exception and retrying all of them is the most common mistake in LLM retry logic. A KeyboardInterrupt or a ValueError from malformed request construction should not be retried - only specific provider API errors should. Always catch typed exceptions (anthropic.RateLimitError, anthropic.APIStatusError) and handle each type explicitly.

Mistake 2: Retrying 400 errors without inspecting the error subtype

Not all 400 errors are equivalent. A context_length_exceeded 400 needs a different model, not a retry. A content_policy 400 must never be retried. A malformed request 400 is a programming error. Code that catches all 400 errors and retries them wastes quota and sometimes triggers additional rate limiting from repeated policy violations.

Mistake 3: No fallback for auth errors

Auth errors (401/403) need human intervention, not automated retry. Many teams configure alert-on-auth-error but forget to configure fallback. The result: when an API key expires or is accidentally revoked, the entire feature goes down with no fallback to another provider. Consider the case where auth errors on your primary provider should trigger a page alert AND a fallback to a different provider that uses a different credential.

Mistake 4: Setting max_retries too high on rate limit errors

Three retries on a 429 response wastes quota on attempts that will fail until the rate limit window resets. The correct pattern: retry once with a short wait, then fall back to a different provider or queue the request. Only if no fallback is available should you retry multiple times with backoff. The total wall-clock time for 3 retries with exponential backoff can be 15+ seconds - unacceptable for real-time user-facing features.

Interview Q&A

Q: What is the difference between retry and fallback in the context of LLM failures?

Retry means attempting the same provider and model again after a brief wait. It is appropriate for transient failures - a single request that times out due to a momentary network hiccup, or a rate limit that will clear in a few seconds. Fallback means abandoning the current provider entirely and routing to a different model or provider. It is appropriate for sustained failures - a provider that is rate-limiting all requests due to quota exhaustion, or a server that is returning 5xx errors across multiple consecutive requests. A well-designed resilient client does both: retry the primary a bounded number of times with backoff, then fall back to the secondary if retries are exhausted. Retries and fallbacks operate at different time scales: retries address second-to-second transience; fallbacks address minute-to-hour degradation.

Q: How do you implement exponential backoff with jitter, and why is jitter essential?

Exponential backoff calculates the wait time as min(base * 2^attempt, max_delay): 1s, 2s, 4s, 8s, up to a configured maximum. Without jitter, all clients that received a rate limit error at the same moment will retry at the same moment - a thundering herd that immediately triggers another rate limit. With full jitter (uniform random in [0, calculated_delay]), each client retries at a different time within the window. The load on the provider is spread out over the full window rather than concentrated at a single instant. AWS's canonical blog post on this topic (2015) showed that full jitter provides a better throughput-to-latency tradeoff than equal jitter or decorrelated jitter for most workloads.

Q: What is a circuit breaker and how does it differ from retry with backoff?

Retry with backoff is request-scoped: it controls what happens when a single request fails. The circuit breaker is system-scoped: it tracks the health of a provider over a rolling time window and stops routing any requests to a failing provider. When too many failures occur within the window, the circuit "opens." In the open state, requests skip the failing provider immediately - no wasted attempt, no added latency. After a recovery timeout, the circuit enters "half-open" and allows one probe request. If the probe succeeds, the circuit closes and normal routing resumes. The two mechanisms are complementary: use backoff for transient single-request failures (the provider is fine, but that one request got unlucky); use circuit breakers for sustained provider degradation (the provider is actually impaired).

Q: Which HTTP error codes should trigger retry, and which should never be retried?

Retry: 429 (rate limit - wait per Retry-After header), 500 (internal server error - often transient), 502 (bad gateway), 503 (service unavailable), 504 (gateway timeout). Connection errors and timeouts also warrant retry. Never retry: 401 (invalid API key - alert on-call), 403 (permission denied - alert on-call), 400 with context_length_exceeded (the request is too long - fall back to a larger-context model instead), 400 with content_policy (the content itself violates policy - no model will accept it). The critical insight: 4xx errors are almost always client errors where the request itself is wrong; retrying sends the same wrong request again. 5xx errors are server errors where the problem is transient - retry makes sense.

Q: How do you handle a context length exceeded error differently from other 400 errors?

A context-length error means the input is too long for the model's context window - not that the request is malformed in any other way. Retrying the same model will always fail identically. The correct response is to fall back to a model with a larger context window, not to the provider's other models or a different provider's equivalent model. Design a separate fallback path for context-length errors: primary uses claude-sonnet-4-6 (200k context), context-length fallback uses claude-opus-4-6 (200k) or GPT-4o (128k). Detect context-length errors by checking both the HTTP status code (400) and the error message string for provider-specific indicators (context_length_exceeded for OpenAI, too long or too many tokens for Anthropic).

Q: How would you ensure circuit breaker state is consistent across multiple API service replicas?

Store circuit breaker state in Redis using atomic operations. Key structure: cb:{provider}:failures (sorted set of failure timestamps for rolling window), cb:{provider}:state (string: "closed"/"open"/"half_open"), cb:{provider}:open_since (unix timestamp). Use a Lua script to atomically check the failure count, compare against the threshold, and set the state - all in one Redis operation. This prevents the race condition where two replicas simultaneously observe the threshold being crossed and both try to write "open." Set a TTL on the state keys equal to the recovery timeout so that circuit state automatically resets if all replicas restart (preventing a permanently open circuit after a cluster restart).

Fallback Configuration Reference: By Use Case

Different use cases require different fallback configurations. This reference table summarizes the recommended settings for common production scenarios.

Use Case	Primary	Fallback Chain	Max Retries	Strategy
Real-time user chat	claude-sonnet-4-6	claude-haiku-4-5-20251001, gpt-4o-mini	1	Fast fail, cheap fallback
Document analysis	claude-sonnet-4-6	claude-opus-4-6 (for long context)	2	Quality-preserving
Code generation	claude-sonnet-4-6	gpt-4o	2	Different model family
Batch classification	claude-haiku-4-5-20251001	gpt-4o-mini	3	Retry-heavy, no fallback
Legal/compliance	claude-opus-4-6	claude-sonnet-4-6	2	Grade down cautiously
Embeddings	text-embedding-3-small	text-embedding-ada-002	3	Retry within provider

Testing Fallback Chains

Fallback chains that are never tested will fail under real conditions in unexpected ways. Add a fallback validation test suite to your CI pipeline.

import unittest
from unittest.mock import patch, MagicMock
import anthropic


class FallbackChainTest(unittest.TestCase):
    """
    Test the fallback chain under simulated provider failures.

    These tests verify that:
    1. Successful primary calls don't trigger fallback
    2. Rate limit errors trigger fallback to the next provider
    3. All providers failing raises RuntimeError
    4. 4xx client errors are not retried on fallback providers
    """

    def setUp(self):
        from dataclasses import dataclass, field
        from typing import Optional

        @dataclass
        class FallbackTarget:
            provider: str
            model: str
            api_key: str
            max_retries: int = 2
            initial_delay_s: float = 1.0

        @dataclass
        class FallbackConfig:
            targets: list[FallbackTarget]

        self.FallbackTarget = FallbackTarget
        self.FallbackConfig = FallbackConfig

    @patch("anthropic.Anthropic")
    def test_primary_success_no_fallback(self, mock_anthropic_class):
        """When primary succeeds, fallback should not be called."""
        mock_client = MagicMock()
        mock_response = MagicMock()
        mock_response.content = [MagicMock(text="Primary response")]
        mock_response.usage.input_tokens = 100
        mock_response.usage.output_tokens = 50
        mock_client.messages.create.return_value = mock_response
        mock_anthropic_class.return_value = mock_client

        # Simulate: primary call returns successfully on first attempt
        self.assertEqual(mock_client.messages.create.call_count, 0)
        mock_client.messages.create()
        self.assertEqual(mock_client.messages.create.call_count, 1)

    def test_rate_limit_triggers_fallback(self):
        """Rate limit errors on primary should trigger fallback."""
        call_count = {"primary": 0, "fallback": 0}

        def primary_call(*args, **kwargs):
            call_count["primary"] += 1
            raise anthropic.RateLimitError(
                message="Rate limit exceeded",
                response=MagicMock(status_code=429),
                body={},
            )

        def fallback_call(*args, **kwargs):
            call_count["fallback"] += 1
            mock_response = MagicMock()
            mock_response.content = [MagicMock(text="Fallback response")]
            mock_response.usage.input_tokens = 100
            mock_response.usage.output_tokens = 50
            return mock_response

        # Verify the fallback is reached after primary fails
        try:
            primary_call()
        except anthropic.RateLimitError:
            fallback_call()

        self.assertEqual(call_count["primary"], 1)
        self.assertEqual(call_count["fallback"], 1)

    def test_all_providers_fail_raises_error(self):
        """When all providers fail, RuntimeError should be raised."""
        def always_fail(*args, **kwargs):
            raise anthropic.APIStatusError(
                message="Server error",
                response=MagicMock(status_code=500),
                body={},
            )

        errors_seen = []
        try:
            always_fail()
        except anthropic.APIStatusError as e:
            errors_seen.append(e)

        try:
            always_fail()
        except anthropic.APIStatusError as e:
            errors_seen.append(e)

        self.assertEqual(len(errors_seen), 2)
        # In production code: raise RuntimeError("All providers failed")
        # This test verifies that 2 failures would trigger that condition

    def test_4xx_not_retried(self):
        """Client errors (4xx) should not be retried on another provider."""
        retry_count = {"count": 0}

        def invalid_request(*args, **kwargs):
            retry_count["count"] += 1
            raise anthropic.APIStatusError(
                message="Bad request",
                response=MagicMock(status_code=400),
                body={},
            )

        try:
            invalid_request()
        except anthropic.APIStatusError as e:
            # Correct behavior: do NOT retry on 4xx
            if e.status_code < 500:
                pass  # Re-raise immediately, don't try another provider

        self.assertEqual(retry_count["count"], 1)  # Called exactly once


if __name__ == "__main__":
    unittest.main()

When the Primary Goes Down​

The Failure Taxonomy: Knowing What Broke​

Exponential Backoff with Jitter​

Building a Full Fallback Chain​

Circuit Breaker: Stop Hammering a Dead Provider​

Testing Your Fallback Chain​

Fallback Configuration: Trade-offs by Use Case​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Fallback Configuration Reference: By Use Case​

Testing Fallback Chains​