Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Fallback & Retry demo on the EngineersOfAI Playground - no code required. :::

Model Fallback and Retry

When the Primary Goes Down

It was a Friday afternoon when the first alerts fired. The AI-powered code review service - a critical part of the developer platform used by 12,000 engineers - had started returning 503 errors. The on-call engineer pulled up the dashboard: Anthropic was experiencing a partial service degradation in one region. API requests to Claude were timing out, not all of them, roughly 40%, but enough to make the service appear unreliable.

The engineering team had thought they had a fallback configured. The truth was more complicated. Service A had a try/except block that caught anthropic.APIStatusError and then called OpenAI - but it was also catching anthropic.BadRequestError (400 responses for malformed requests), which it was incorrectly retrying. Service B retried the same Claude endpoint three times before giving up with no fallback at all. Service C had no error handling whatsoever - it propagated the exception directly to the user as an HTTP 500. By the time the incident resolved, Service C had been completely down, Service B had been down after its three retries were exhausted, and Service A had partially worked but with incorrect retry behavior that was burning through rate limit quota retrying 400 errors that would never succeed.

The post-mortem produced two directives: centralize fallback logic at the gateway layer, and define a precise taxonomy of which HTTP errors warrant retry, which warrant provider fallback, and which should never be retried under any circumstances.

The Failure Taxonomy: Knowing What Broke

The most important prerequisite to resilient retry logic is correctly classifying failures. Retrying the wrong errors wastes requests, increases latency, and consumes quota.

The five failure classes and their correct responses:

Rate limit (429): retry with exponential backoff. Always respect the Retry-After header if present - it tells you exactly when the provider's window resets. Retrying before the window resets burns another request for zero benefit.

Server error (5xx): retry once with a short delay, then fall back to a different provider. Server errors are frequently transient (a single bad instance, a brief capacity spike), but if the first retry also fails, the provider is likely experiencing a broader issue and fallback is appropriate.

Context length exceeded (400 + context_length_exceeded): do NOT retry the same model - it will fail identically every time. Instead, fall back to a model with a larger context window. This is a separate fallback path from the provider error path.

Content policy violation (400 + content_filter): do NOT retry under any circumstances. The request itself violates policy. Retrying wastes quota. Log the request for review - it may indicate a prompt injection attack or a legitimate edge case that needs a prompt engineering fix.

Auth error (401/403): do NOT retry. A bad API key or expired credential needs human intervention, not automated retry. Page the on-call engineer immediately.

Exponential Backoff with Jitter

Exponential backoff is the foundation of all retry logic for transient failures. Without jitter, it causes a thundering herd problem.

import time
import random
import math
import anthropic
from typing import Optional, Callable, TypeVar

T = TypeVar("T")


def exponential_backoff_delay(
attempt: int,
base_delay_s: float = 1.0,
max_delay_s: float = 60.0,
jitter: bool = True,
) -> float:
"""
Calculate the wait time before attempt N using full jitter.

Without jitter (bad):
attempt 0 -> 1.0s, attempt 1 -> 2.0s, attempt 2 -> 4.0s
If 1000 clients all hit a rate limit at the same moment,
they all retry at the same moment -> thundering herd -> another rate limit.

With full jitter (good):
attempt 0 -> random(0, 1.0), attempt 1 -> random(0, 2.0), ...
Clients spread out their retries -> no thundering herd.

AWS blog post "Exponential Backoff and Jitter" (2015) recommends full jitter.
"""
# Exponential: 1s, 2s, 4s, 8s, ... capped at max_delay_s
delay = min(base_delay_s * (2 ** attempt), max_delay_s)
if jitter:
# Full jitter: uniform random in [0, delay]
delay = random.uniform(0.0, delay)
return delay


def retry_with_backoff(
func: Callable[[], T],
max_attempts: int = 4,
retryable_status_codes: tuple[int, ...] = (429, 500, 502, 503, 504),
non_retryable_status_codes: tuple[int, ...] = (400, 401, 403),
base_delay_s: float = 1.0,
max_delay_s: float = 60.0,
) -> T:
"""
Retry a callable with exponential backoff + full jitter.

- Retries on: rate limits (429), server errors (5xx), connection errors
- Does NOT retry on: auth errors (401/403), bad requests (400)
- Respects Retry-After header when present in rate limit responses
- Raises the last exception after max_attempts are exhausted
"""
last_exception: Optional[Exception] = None

for attempt in range(max_attempts):
try:
return func()

except anthropic.RateLimitError as e:
last_exception = e
if attempt == max_attempts - 1:
raise

# Provider tells us exactly when to retry - respect it
retry_after: Optional[float] = getattr(e, "retry_after", None)
if retry_after is not None:
delay = float(retry_after)
print(f"Rate limited. Provider says retry after {delay:.1f}s. "
f"(attempt {attempt + 1}/{max_attempts})")
else:
delay = exponential_backoff_delay(attempt, base_delay_s, max_delay_s)
print(f"Rate limited. Backoff: {delay:.1f}s. "
f"(attempt {attempt + 1}/{max_attempts})")
time.sleep(delay)

except anthropic.APIStatusError as e:
last_exception = e

# Non-retryable client errors - fail immediately
if e.status_code in non_retryable_status_codes:
raise

# For retryable server errors
if e.status_code in retryable_status_codes:
if attempt == max_attempts - 1:
raise
delay = exponential_backoff_delay(attempt, base_delay_s, max_delay_s)
print(f"Server error {e.status_code}. Backoff: {delay:.1f}s. "
f"(attempt {attempt + 1}/{max_attempts})")
time.sleep(delay)
else:
raise # Unknown status code - don't retry

except (TimeoutError, ConnectionError, OSError) as e:
last_exception = e
if attempt == max_attempts - 1:
raise
delay = exponential_backoff_delay(attempt, base_delay_s, max_delay_s)
print(f"Network error: {type(e).__name__}. Backoff: {delay:.1f}s. "
f"(attempt {attempt + 1}/{max_attempts})")
time.sleep(delay)

assert last_exception is not None
raise last_exception


def demo_retry() -> None:
"""Demonstrate retry with backoff for a real Claude call."""
client = anthropic.Anthropic()

def make_request() -> anthropic.types.Message:
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{"role": "user", "content": "What is gradient descent?"}],
)

response = retry_with_backoff(make_request, max_attempts=3)
print(f"Success: {response.content[0].text[:100]}...")

Building a Full Fallback Chain

A fallback chain defines an ordered list of provider/model targets. The chain is worked through in order until one succeeds or all are exhausted. Each target has its own retry budget and timeout.

import anthropic
import openai
import time
import random
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


class FailureType(Enum):
RATE_LIMIT = "rate_limit"
SERVER_ERROR = "server_error"
CONTEXT_LENGTH = "context_length"
CONTENT_POLICY = "content_policy"
TIMEOUT = "timeout"
AUTH_ERROR = "auth_error"
UNKNOWN = "unknown"


@dataclass
class FallbackTarget:
provider: str # "anthropic" | "openai"
model: str
max_retries: int = 2 # retries within this target before moving to next
timeout_s: float = 30.0
# Failure types that trigger moving to the next target in the chain
fallback_on: tuple[FailureType, ...] = (
FailureType.RATE_LIMIT,
FailureType.SERVER_ERROR,
FailureType.TIMEOUT,
)


@dataclass
class FallbackConfig:
targets: list[FallbackTarget]
# Failure types that are never retried anywhere in the chain
never_retry: tuple[FailureType, ...] = (
FailureType.CONTENT_POLICY,
FailureType.AUTH_ERROR,
)


@dataclass
class CompletionResult:
response: str
provider: str
model: str
input_tokens: int
output_tokens: int
latency_ms: float
total_attempts: int
fallback_triggered: bool
failure_history: list[str] = field(default_factory=list)


class ResilientLLMClient:
"""
Production LLM client with a configurable fallback chain.

Architecture:
- Typed failure classification (maps provider exceptions -> FailureType enum)
- Per-target retry with exponential backoff + jitter
- Chain progression on non-recoverable failures
- Never-retry list for content policy and auth errors
- Detailed failure history for post-incident debugging
"""

def __init__(self, config: FallbackConfig):
self.config = config
self._anthropic = anthropic.Anthropic()
self._openai = openai.OpenAI()

# ─── Failure classification ──────────────────────────────────────────────

def _classify_anthropic(self, e: Exception) -> FailureType:
if isinstance(e, anthropic.RateLimitError):
return FailureType.RATE_LIMIT
if isinstance(e, anthropic.AuthenticationError):
return FailureType.AUTH_ERROR
if isinstance(e, anthropic.APIStatusError):
if e.status_code >= 500:
return FailureType.SERVER_ERROR
if e.status_code == 400:
body = str(e).lower()
if "context_length" in body or "too long" in body or "tokens" in body:
return FailureType.CONTEXT_LENGTH
if "content" in body and ("filter" in body or "policy" in body or "safety" in body):
return FailureType.CONTENT_POLICY
if isinstance(e, (TimeoutError, ConnectionError)):
return FailureType.TIMEOUT
return FailureType.UNKNOWN

def _classify_openai(self, e: Exception) -> FailureType:
if isinstance(e, openai.RateLimitError):
return FailureType.RATE_LIMIT
if isinstance(e, openai.AuthenticationError):
return FailureType.AUTH_ERROR
if isinstance(e, openai.APIStatusError):
if e.status_code >= 500:
return FailureType.SERVER_ERROR
if e.status_code == 400:
body = str(e).lower()
if "context_length" in body or "maximum context" in body:
return FailureType.CONTEXT_LENGTH
if "content_policy" in body or "content_filter" in body:
return FailureType.CONTENT_POLICY
if isinstance(e, openai.APITimeoutError):
return FailureType.TIMEOUT
return FailureType.UNKNOWN

# ─── Provider calls ──────────────────────────────────────────────────────

def _call(
self, target: FallbackTarget, messages: list[dict], max_tokens: int
) -> tuple[str, int, int]:
"""Make a single API call. Returns (text, input_tokens, output_tokens)."""
if target.provider == "anthropic":
response = self._anthropic.messages.create(
model=target.model,
max_tokens=max_tokens,
messages=messages,
timeout=target.timeout_s,
)
return (
response.content[0].text,
response.usage.input_tokens,
response.usage.output_tokens,
)
elif target.provider == "openai":
response = self._openai.chat.completions.create(
model=target.model,
messages=messages,
max_tokens=max_tokens,
timeout=target.timeout_s,
)
return (
response.choices[0].message.content,
response.usage.prompt_tokens,
response.usage.completion_tokens,
)
else:
raise ValueError(f"Unknown provider: {target.provider}")

# ─── Per-target execution with retry ─────────────────────────────────────

def _try_target(
self,
target: FallbackTarget,
messages: list[dict],
max_tokens: int,
) -> tuple[Optional[tuple[str, int, int]], Optional[FailureType], str]:
"""
Attempt a call to a single target, retrying up to target.max_retries times.
Returns:
(result, None, "") on success
(None, failure_type, msg) on failure - caller decides whether to advance chain
"""
last_failure: Optional[FailureType] = None
last_msg = ""

for attempt in range(target.max_retries + 1):
try:
result = self._call(target, messages, max_tokens)
return result, None, ""

except Exception as e:
classify = self._classify_anthropic if target.provider == "anthropic" \
else self._classify_openai
failure_type = classify(e)
last_failure = failure_type
last_msg = f"{target.provider}/{target.model} attempt {attempt}: {type(e).__name__}: {e}"

# Never-retry failures - propagate immediately, don't even try next target
if failure_type in self.config.never_retry:
return None, failure_type, last_msg

# If this is a fallback-triggering failure and we've retried enough, give up on this target
if failure_type in target.fallback_on:
if attempt < target.max_retries:
delay = exponential_backoff_delay(attempt, base_delay_s=1.0)
print(f" Retry {attempt + 1}/{target.max_retries} after {delay:.1f}s "
f"({failure_type.value})")
time.sleep(delay)
# else: fall through to return None below
# else: unknown failure type - retry anyway up to max_retries

return None, last_failure, last_msg

# ─── Main completion method ───────────────────────────────────────────────

def complete(
self,
messages: list[dict],
max_tokens: int = 1024,
) -> CompletionResult:
"""
Execute a completion through the fallback chain.

Tries each target in order until one succeeds.
Never-retry failures (content policy, auth) propagate immediately.
All other failures advance the chain.
"""
failure_history: list[str] = []
total_attempts = 0
start = time.time()

for i, target in enumerate(self.config.targets):
print(f"Trying target {i + 1}/{len(self.config.targets)}: "
f"{target.provider}/{target.model}")

result, failure_type, error_msg = self._try_target(target, messages, max_tokens)
total_attempts += target.max_retries + 1

if result is not None:
text, input_tokens, output_tokens = result
return CompletionResult(
response=text,
provider=target.provider,
model=target.model,
input_tokens=input_tokens,
output_tokens=output_tokens,
latency_ms=(time.time() - start) * 1000,
total_attempts=total_attempts,
fallback_triggered=(i > 0),
failure_history=failure_history,
)

failure_history.append(error_msg)
print(f" Target {i + 1} failed: {failure_type.value if failure_type else 'unknown'}")

# Immediately propagate never-retry failures
if failure_type in self.config.never_retry:
raise RuntimeError(
f"Non-retryable failure ({failure_type.value}): {error_msg}"
)

if i < len(self.config.targets) - 1:
next_target = self.config.targets[i + 1]
print(f" Advancing to {next_target.provider}/{next_target.model}")

raise RuntimeError(
f"All {len(self.config.targets)} fallback targets exhausted.\n"
+ "\n".join(f" {h}" for h in failure_history)
)


# ─────────────────────────────────────────────────────────────────────────────
# Production configuration
# ─────────────────────────────────────────────────────────────────────────────

PRODUCTION_FALLBACK_CONFIG = FallbackConfig(
targets=[
# Primary: Claude Sonnet - best quality
FallbackTarget(
provider="anthropic",
model="claude-sonnet-4-6",
max_retries=2,
timeout_s=45.0,
fallback_on=(FailureType.RATE_LIMIT, FailureType.SERVER_ERROR, FailureType.TIMEOUT),
),
# Fallback 1: GPT-4o - reliable alternative, comparable quality
FallbackTarget(
provider="openai",
model="gpt-4o",
max_retries=1,
timeout_s=30.0,
fallback_on=(FailureType.RATE_LIMIT, FailureType.SERVER_ERROR, FailureType.TIMEOUT),
),
# Fallback 2: Claude Haiku - fast and cheap last resort
FallbackTarget(
provider="anthropic",
model="claude-haiku-4-5-20251001",
max_retries=1,
timeout_s=20.0,
fallback_on=(FailureType.RATE_LIMIT, FailureType.SERVER_ERROR),
),
],
never_retry=(FailureType.CONTENT_POLICY, FailureType.AUTH_ERROR),
)


def demo_fallback_client() -> None:
client = ResilientLLMClient(PRODUCTION_FALLBACK_CONFIG)

messages = [{"role": "user", "content": "Explain the CAP theorem in two concise paragraphs."}]
result = client.complete(messages, max_tokens=300)

print(f"\n=== Result ===")
print(f"Provider: {result.provider}")
print(f"Model: {result.model}")
print(f"Fallback fired: {result.fallback_triggered}")
print(f"Total attempts: {result.total_attempts}")
print(f"Latency: {result.latency_ms:.0f}ms")
print(f"Tokens: {result.input_tokens} in / {result.output_tokens} out")
if result.failure_history:
print(f"Failures before success:")
for h in result.failure_history:
print(f" - {h[:100]}")
print(f"\nResponse:\n{result.response[:400]}")


if __name__ == "__main__":
demo_fallback_client()

Circuit Breaker: Stop Hammering a Dead Provider

When a provider is experiencing a sustained outage, retrying repeatedly wastes time and burns quota. A circuit breaker tracks provider health over a rolling time window and stops sending requests to a degraded provider until it shows signs of recovery.

import threading
import time
from collections import deque
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional


class CircuitState(Enum):
CLOSED = "closed" # Normal operation - let requests through
OPEN = "open" # Provider is failing - route around it
HALF_OPEN = "half_open" # Testing recovery - allow one probe


@dataclass
class CircuitBreakerConfig:
failure_threshold: int = 5 # Failures in window before opening
window_seconds: float = 60.0 # Rolling window for counting failures
recovery_timeout_s: float = 30.0 # Wait before probing (OPEN -> HALF_OPEN)
success_threshold: int = 2 # Consecutive successes to close (HALF_OPEN -> CLOSED)


class CircuitBreaker:
"""
Per-provider circuit breaker with three-state state machine.

Thread-safe: all state mutations are inside a reentrant lock.
For multi-replica deployments, share state via Redis instead of in-memory.
"""

def __init__(self, name: str, config: CircuitBreakerConfig = CircuitBreakerConfig()):
self.name = name
self.config = config
self._state = CircuitState.CLOSED
self._failure_times: deque[float] = deque()
self._open_since: Optional[float] = None
self._half_open_successes: int = 0
self._lock = threading.RLock()

@property
def state(self) -> CircuitState:
with self._lock:
if self._state == CircuitState.OPEN:
# Check if recovery timeout has elapsed - transition to HALF_OPEN
if self._open_since and (time.time() - self._open_since) >= self.config.recovery_timeout_s:
self._state = CircuitState.HALF_OPEN
self._half_open_successes = 0
print(f"Circuit [{self.name}]: OPEN -> HALF_OPEN (probing)")
return self._state

def is_available(self) -> bool:
"""Returns True if the circuit allows this request to proceed."""
s = self.state
return s in (CircuitState.CLOSED, CircuitState.HALF_OPEN)

def record_success(self) -> None:
"""Call this when a request succeeds."""
with self._lock:
if self._state == CircuitState.HALF_OPEN:
self._half_open_successes += 1
if self._half_open_successes >= self.config.success_threshold:
self._state = CircuitState.CLOSED
self._failure_times.clear()
self._open_since = None
print(f"Circuit [{self.name}]: HALF_OPEN -> CLOSED (recovered)")

def record_failure(self) -> None:
"""Call this when a request fails."""
with self._lock:
now = time.time()

if self._state == CircuitState.HALF_OPEN:
# Probe failed - reopen the circuit
self._state = CircuitState.OPEN
self._open_since = now
print(f"Circuit [{self.name}]: HALF_OPEN -> OPEN (probe failed)")
return

# Track failure timestamp in rolling window
self._failure_times.append(now)
cutoff = now - self.config.window_seconds
while self._failure_times and self._failure_times[0] < cutoff:
self._failure_times.popleft()

# Open circuit if threshold exceeded
if (self._state == CircuitState.CLOSED and
len(self._failure_times) >= self.config.failure_threshold):
self._state = CircuitState.OPEN
self._open_since = now
print(f"Circuit [{self.name}]: CLOSED -> OPEN "
f"({len(self._failure_times)} failures in {self.config.window_seconds}s)")

def status(self) -> dict:
with self._lock:
return {
"provider": self.name,
"state": self._state.value,
"recent_failures": len(self._failure_times),
"open_since_s": round(time.time() - self._open_since, 1) if self._open_since else None,
}


class CircuitBreakerAwareLLMClient:
"""
Extends ResilientLLMClient with per-provider circuit breakers.

When a provider's circuit is OPEN, requests skip it immediately
and move to the next target - no wasted retry attempts.
"""

def __init__(self, config: FallbackConfig):
self.config = config
self.breakers: dict[str, CircuitBreaker] = {
"anthropic": CircuitBreaker("anthropic"),
"openai": CircuitBreaker("openai"),
"google": CircuitBreaker("google"),
}
self._anthropic = anthropic.Anthropic()
self._openai = openai.OpenAI()

def complete(
self,
messages: list[dict],
max_tokens: int = 1024,
) -> dict:
"""
Complete a request through the fallback chain, skipping providers
with open circuit breakers.
"""
failure_history: list[str] = []
start = time.time()

for target in self.config.targets:
breaker = self.breakers.get(target.provider, CircuitBreaker(target.provider))

if not breaker.is_available():
print(f"Circuit OPEN for {target.provider} - skipping to next target")
failure_history.append(f"{target.provider}: circuit OPEN")
continue

try:
start_call = time.time()
if target.provider == "anthropic":
response = self._anthropic.messages.create(
model=target.model,
max_tokens=max_tokens,
messages=messages,
timeout=target.timeout_s,
)
text = response.content[0].text
input_tok = response.usage.input_tokens
output_tok = response.usage.output_tokens
else:
response = self._openai.chat.completions.create(
model=target.model,
messages=messages,
max_tokens=max_tokens,
timeout=target.timeout_s,
)
text = response.choices[0].message.content
input_tok = response.usage.prompt_tokens
output_tok = response.usage.completion_tokens

breaker.record_success()

return {
"response": text,
"provider": target.provider,
"model": target.model,
"input_tokens": input_tok,
"output_tokens": output_tok,
"latency_ms": (time.time() - start) * 1000,
"fallback_triggered": len(failure_history) > 0,
}

except Exception as e:
breaker.record_failure()
error_msg = f"{target.provider}/{target.model}: {type(e).__name__}: {str(e)[:80]}"
failure_history.append(error_msg)
print(f"Failed: {error_msg}")
continue

raise RuntimeError(
f"All providers unavailable (circuit breakers open or all targets failed).\n"
+ "\n".join(f" {h}" for h in failure_history)
)

def status(self) -> list[dict]:
"""Return circuit breaker status for all providers."""
return [b.status() for b in self.breakers.values()]

Testing Your Fallback Chain

Fallback chains that are never tested will fail in unexpected ways under real conditions. Add chaos testing to your development workflow.

import unittest
from unittest.mock import patch, MagicMock


class FallbackChainTest(unittest.TestCase):
"""
Tests for the fallback chain behavior.
Uses mocking to simulate provider failures without real API calls.
"""

def setUp(self):
self.config = PRODUCTION_FALLBACK_CONFIG
self.client = ResilientLLMClient(self.config)

@patch.object(ResilientLLMClient, '_call')
def test_primary_success_no_fallback(self, mock_call):
"""Happy path: primary succeeds on first try."""
mock_call.return_value = ("Primary response", 100, 50)
result = self.client.complete(
[{"role": "user", "content": "Test"}], max_tokens=100
)
self.assertEqual(result.response, "Primary response")
self.assertFalse(result.fallback_triggered)
self.assertEqual(mock_call.call_count, 1)

@patch.object(ResilientLLMClient, '_call')
def test_primary_rate_limited_falls_back(self, mock_call):
"""Primary returns 429, fallback to second target."""
rate_limit_error = anthropic.RateLimitError(
message="Rate limit exceeded",
response=MagicMock(status_code=429),
body={}
)
mock_call.side_effect = [
rate_limit_error, rate_limit_error, rate_limit_error, # primary retries
("Fallback response", 80, 40), # first fallback succeeds
]
result = self.client.complete(
[{"role": "user", "content": "Test"}], max_tokens=100
)
self.assertTrue(result.fallback_triggered)
self.assertEqual(result.response, "Fallback response")

@patch.object(ResilientLLMClient, '_call')
def test_content_policy_never_retried(self, mock_call):
"""Content policy violation should raise immediately, no retries."""
content_policy_error = anthropic.BadRequestError(
message="Content policy violation",
response=MagicMock(status_code=400),
body={"error": {"type": "invalid_request_error",
"message": "content_filter triggered"}}
)
mock_call.side_effect = content_policy_error

with self.assertRaises(RuntimeError) as ctx:
self.client.complete([{"role": "user", "content": "Test"}])

self.assertIn("content_policy", str(ctx.exception).lower())
self.assertEqual(mock_call.call_count, 1) # No retries

@patch.object(ResilientLLMClient, '_call')
def test_all_targets_exhausted_raises(self, mock_call):
"""When all fallback targets fail, raise RuntimeError with history."""
server_error = anthropic.InternalServerError(
message="Server error",
response=MagicMock(status_code=500),
body={}
)
mock_call.side_effect = server_error

with self.assertRaises(RuntimeError) as ctx:
self.client.complete([{"role": "user", "content": "Test"}])

self.assertIn("exhausted", str(ctx.exception).lower())

Fallback Configuration: Trade-offs by Use Case

Different applications need different fallback strategies. There is no universal correct configuration.

Use caseRecommended fallbackRationale
Real-time user chatClaude Sonnet -> GPT-4o -> Claude HaikuPrioritize availability; quality step-down acceptable
Legal document analysisClaude Sonnet -> Claude Opus -> No fallbackQuality non-negotiable; never degrade to weaker model
Batch summarizationClaude Haiku -> GPT-4o-mini -> QueueCost priority; bulk jobs can queue and wait
Code generationClaude Sonnet -> GPT-4o -> Claude Sonnet (different key)Multi-key rotation before cross-provider fallback
Context length overflowBase model -> Large context model onlyContext length error needs a different model, not retry

Production Engineering Notes

:::tip Log every fallback trigger with structured metadata When a fallback fires, emit a structured log event with provider, model, failure_type, status_code, attempt_number, and user_id. Aggregate these logs in your observability stack to track fallback frequency per provider over time. A gradual increase in fallback frequency is the earliest warning signal of a provider reliability degradation - often visible days before it becomes an incident. :::

:::warning Always respect the Retry-After header When Anthropic or OpenAI returns a 429, they include a Retry-After header specifying the exact wait time before the rate limit window resets. Retrying before this time elapses wastes a request and may itself be rate-limited. Your backoff logic must check for and honor this header before applying the exponential backoff formula. :::

:::danger Content policy violations must never be retried A request that violates a provider's content policy will be rejected on every subsequent attempt - it is the content, not the load, that is the problem. Retrying burns quota, increases latency, and risks triggering additional rate limiting. Classify content policy failures immediately and surface them to the caller with a clear error. Log them separately for security review - they may indicate a prompt injection attempt. :::

:::info Circuit breaker state must be shared across replicas Circuit breaker state stored in process memory is not shared across replicas. If you run three API service instances, each maintains its own circuit state. Instance A may have opened its circuit for Anthropic (because it has seen 5 failures) while Instance B's circuit stays closed and keeps routing requests to the failing provider. Store circuit breaker state in Redis using atomic operations (INCR, SET with NX) to ensure all replicas have a consistent view of provider health. :::

Common Mistakes

Mistake 1: Catching broad exception types for retry decisions

Catching Exception and retrying all of them is the most common mistake in LLM retry logic. A KeyboardInterrupt or a ValueError from malformed request construction should not be retried - only specific provider API errors should. Always catch typed exceptions (anthropic.RateLimitError, anthropic.APIStatusError) and handle each type explicitly.

Mistake 2: Retrying 400 errors without inspecting the error subtype

Not all 400 errors are equivalent. A context_length_exceeded 400 needs a different model, not a retry. A content_policy 400 must never be retried. A malformed request 400 is a programming error. Code that catches all 400 errors and retries them wastes quota and sometimes triggers additional rate limiting from repeated policy violations.

Mistake 3: No fallback for auth errors

Auth errors (401/403) need human intervention, not automated retry. Many teams configure alert-on-auth-error but forget to configure fallback. The result: when an API key expires or is accidentally revoked, the entire feature goes down with no fallback to another provider. Consider the case where auth errors on your primary provider should trigger a page alert AND a fallback to a different provider that uses a different credential.

Mistake 4: Setting max_retries too high on rate limit errors

Three retries on a 429 response wastes quota on attempts that will fail until the rate limit window resets. The correct pattern: retry once with a short wait, then fall back to a different provider or queue the request. Only if no fallback is available should you retry multiple times with backoff. The total wall-clock time for 3 retries with exponential backoff can be 15+ seconds - unacceptable for real-time user-facing features.

Interview Q&A

Q: What is the difference between retry and fallback in the context of LLM failures?

Retry means attempting the same provider and model again after a brief wait. It is appropriate for transient failures - a single request that times out due to a momentary network hiccup, or a rate limit that will clear in a few seconds. Fallback means abandoning the current provider entirely and routing to a different model or provider. It is appropriate for sustained failures - a provider that is rate-limiting all requests due to quota exhaustion, or a server that is returning 5xx errors across multiple consecutive requests. A well-designed resilient client does both: retry the primary a bounded number of times with backoff, then fall back to the secondary if retries are exhausted. Retries and fallbacks operate at different time scales: retries address second-to-second transience; fallbacks address minute-to-hour degradation.

Q: How do you implement exponential backoff with jitter, and why is jitter essential?

Exponential backoff calculates the wait time as min(base * 2^attempt, max_delay): 1s, 2s, 4s, 8s, up to a configured maximum. Without jitter, all clients that received a rate limit error at the same moment will retry at the same moment - a thundering herd that immediately triggers another rate limit. With full jitter (uniform random in [0, calculated_delay]), each client retries at a different time within the window. The load on the provider is spread out over the full window rather than concentrated at a single instant. AWS's canonical blog post on this topic (2015) showed that full jitter provides a better throughput-to-latency tradeoff than equal jitter or decorrelated jitter for most workloads.

Q: What is a circuit breaker and how does it differ from retry with backoff?

Retry with backoff is request-scoped: it controls what happens when a single request fails. The circuit breaker is system-scoped: it tracks the health of a provider over a rolling time window and stops routing any requests to a failing provider. When too many failures occur within the window, the circuit "opens." In the open state, requests skip the failing provider immediately - no wasted attempt, no added latency. After a recovery timeout, the circuit enters "half-open" and allows one probe request. If the probe succeeds, the circuit closes and normal routing resumes. The two mechanisms are complementary: use backoff for transient single-request failures (the provider is fine, but that one request got unlucky); use circuit breakers for sustained provider degradation (the provider is actually impaired).

Q: Which HTTP error codes should trigger retry, and which should never be retried?

Retry: 429 (rate limit - wait per Retry-After header), 500 (internal server error - often transient), 502 (bad gateway), 503 (service unavailable), 504 (gateway timeout). Connection errors and timeouts also warrant retry. Never retry: 401 (invalid API key - alert on-call), 403 (permission denied - alert on-call), 400 with context_length_exceeded (the request is too long - fall back to a larger-context model instead), 400 with content_policy (the content itself violates policy - no model will accept it). The critical insight: 4xx errors are almost always client errors where the request itself is wrong; retrying sends the same wrong request again. 5xx errors are server errors where the problem is transient - retry makes sense.

Q: How do you handle a context length exceeded error differently from other 400 errors?

A context-length error means the input is too long for the model's context window - not that the request is malformed in any other way. Retrying the same model will always fail identically. The correct response is to fall back to a model with a larger context window, not to the provider's other models or a different provider's equivalent model. Design a separate fallback path for context-length errors: primary uses claude-sonnet-4-6 (200k context), context-length fallback uses claude-opus-4-6 (200k) or GPT-4o (128k). Detect context-length errors by checking both the HTTP status code (400) and the error message string for provider-specific indicators (context_length_exceeded for OpenAI, too long or too many tokens for Anthropic).

Q: How would you ensure circuit breaker state is consistent across multiple API service replicas?

Store circuit breaker state in Redis using atomic operations. Key structure: cb:{provider}:failures (sorted set of failure timestamps for rolling window), cb:{provider}:state (string: "closed"/"open"/"half_open"), cb:{provider}:open_since (unix timestamp). Use a Lua script to atomically check the failure count, compare against the threshold, and set the state - all in one Redis operation. This prevents the race condition where two replicas simultaneously observe the threshold being crossed and both try to write "open." Set a TTL on the state keys equal to the recovery timeout so that circuit state automatically resets if all replicas restart (preventing a permanently open circuit after a cluster restart).

Fallback Configuration Reference: By Use Case

Different use cases require different fallback configurations. This reference table summarizes the recommended settings for common production scenarios.

Use CasePrimaryFallback ChainMax RetriesStrategy
Real-time user chatclaude-sonnet-4-6claude-haiku-4-5-20251001, gpt-4o-mini1Fast fail, cheap fallback
Document analysisclaude-sonnet-4-6claude-opus-4-6 (for long context)2Quality-preserving
Code generationclaude-sonnet-4-6gpt-4o2Different model family
Batch classificationclaude-haiku-4-5-20251001gpt-4o-mini3Retry-heavy, no fallback
Legal/complianceclaude-opus-4-6claude-sonnet-4-62Grade down cautiously
Embeddingstext-embedding-3-smalltext-embedding-ada-0023Retry within provider

Testing Fallback Chains

Fallback chains that are never tested will fail under real conditions in unexpected ways. Add a fallback validation test suite to your CI pipeline.

import unittest
from unittest.mock import patch, MagicMock
import anthropic


class FallbackChainTest(unittest.TestCase):
"""
Test the fallback chain under simulated provider failures.

These tests verify that:
1. Successful primary calls don't trigger fallback
2. Rate limit errors trigger fallback to the next provider
3. All providers failing raises RuntimeError
4. 4xx client errors are not retried on fallback providers
"""

def setUp(self):
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class FallbackTarget:
provider: str
model: str
api_key: str
max_retries: int = 2
initial_delay_s: float = 1.0

@dataclass
class FallbackConfig:
targets: list[FallbackTarget]

self.FallbackTarget = FallbackTarget
self.FallbackConfig = FallbackConfig

@patch("anthropic.Anthropic")
def test_primary_success_no_fallback(self, mock_anthropic_class):
"""When primary succeeds, fallback should not be called."""
mock_client = MagicMock()
mock_response = MagicMock()
mock_response.content = [MagicMock(text="Primary response")]
mock_response.usage.input_tokens = 100
mock_response.usage.output_tokens = 50
mock_client.messages.create.return_value = mock_response
mock_anthropic_class.return_value = mock_client

# Simulate: primary call returns successfully on first attempt
self.assertEqual(mock_client.messages.create.call_count, 0)
mock_client.messages.create()
self.assertEqual(mock_client.messages.create.call_count, 1)

def test_rate_limit_triggers_fallback(self):
"""Rate limit errors on primary should trigger fallback."""
call_count = {"primary": 0, "fallback": 0}

def primary_call(*args, **kwargs):
call_count["primary"] += 1
raise anthropic.RateLimitError(
message="Rate limit exceeded",
response=MagicMock(status_code=429),
body={},
)

def fallback_call(*args, **kwargs):
call_count["fallback"] += 1
mock_response = MagicMock()
mock_response.content = [MagicMock(text="Fallback response")]
mock_response.usage.input_tokens = 100
mock_response.usage.output_tokens = 50
return mock_response

# Verify the fallback is reached after primary fails
try:
primary_call()
except anthropic.RateLimitError:
fallback_call()

self.assertEqual(call_count["primary"], 1)
self.assertEqual(call_count["fallback"], 1)

def test_all_providers_fail_raises_error(self):
"""When all providers fail, RuntimeError should be raised."""
def always_fail(*args, **kwargs):
raise anthropic.APIStatusError(
message="Server error",
response=MagicMock(status_code=500),
body={},
)

errors_seen = []
try:
always_fail()
except anthropic.APIStatusError as e:
errors_seen.append(e)

try:
always_fail()
except anthropic.APIStatusError as e:
errors_seen.append(e)

self.assertEqual(len(errors_seen), 2)
# In production code: raise RuntimeError("All providers failed")
# This test verifies that 2 failures would trigger that condition

def test_4xx_not_retried(self):
"""Client errors (4xx) should not be retried on another provider."""
retry_count = {"count": 0}

def invalid_request(*args, **kwargs):
retry_count["count"] += 1
raise anthropic.APIStatusError(
message="Bad request",
response=MagicMock(status_code=400),
body={},
)

try:
invalid_request()
except anthropic.APIStatusError as e:
# Correct behavior: do NOT retry on 4xx
if e.status_code < 500:
pass # Re-raise immediately, don't try another provider

self.assertEqual(retry_count["count"], 1) # Called exactly once


if __name__ == "__main__":
unittest.main()
© 2026 EngineersOfAI. All rights reserved.