What is rate limiting?

Protect your LLM infrastructure from abuse and cost overruns with token bucket rate limiting and sliding window quotas per user, team, and feature - enforced at the gateway before any tokens are consumed.

How does token bucket work in practice?

Rate Limiting and Quotas covers rate limiting, token bucket, sliding window from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/llm-gateways/rate-limiting-and-quotas

What is the difference between rate limiting and sliding window?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/llm-gateways/rate-limiting-and-quotas

:::tip 🎮 Interactive Playground Visualize this concept: Try the LLM Product Architecture demo on the EngineersOfAI Playground - no code required. :::

Rate Limiting and Quotas

The Batch Job That Ate Everything

It was 2:17 AM on a Wednesday when the first user complaints arrived in the support queue. The AI writing assistant - 60,000 paying customers, $18 per month, the company's flagship product - had stopped generating responses. The support queue went from 0 to 400 tickets in eleven minutes. Engineers woke up to a Slack thread already two hundred messages long.

The investigation took eleven minutes from the time the first engineer joined the incident channel. A post-mortem review of the Anthropic API logs revealed the cause: one enterprise customer had written a script to regenerate product descriptions for their entire catalog - 280,000 items - using the company's writing API. The script was multithreaded and ran without throttling, without exponential backoff, and without client-side rate limiting. It launched at 10 PM. By 2 AM it had issued 280,000 API calls, consuming the company's entire provider token quota for the night in four hours.

Anthropic began throttling every request from the account at 2:04 AM - including legitimate real-time requests from all 60,000 paying customers. The enterprise customer's batch job had effectively DDoS'd their own provider account. Every API call from every paying user was now hitting the account-level rate limit, getting a 429, and failing silently or with a degraded response.

The fix required two hours: manually blocklisting the enterprise customer's API key, waiting for the provider rate limit window to reset (the full window, not just the remaining time), deploying emergency per-customer rate limiting, then lifting the block and monitoring for recurrence. The incident cost four hours of service downtime for 60,000 users. The root cause: there was no per-customer limit, no distinction between real-time and batch traffic, and no mechanism to isolate one customer's runaway job from the rest of the system.

This is the exact problem gateway-layer rate limiting prevents. It is not hypothetical - it is a class of incident that has happened to dozens of companies building on LLM APIs.

Why Rate Limiting at the Gateway Level

Provider-side rate limits are blunt instruments. They apply to your entire account - every API key, every customer, every feature - as a single shared pool. When any component in your system saturates that pool, every other component is throttled simultaneously. There is no per-customer isolation, no per-feature priority, no way to protect real-time traffic from batch traffic. You cannot tell the provider "this customer should get at most 10% of our total quota" - the provider sees your account as a single entity.

Gateway-layer rate limiting is surgical. It enforces per-customer, per-team, and per-feature consumption limits before requests reach the provider. Limits are enforced at request time, not retroactively. And critically, limits are enforced on tokens - not just request count - because a single LLM request can consume anywhere from 50 tokens to 100,000+ tokens. Request count limits alone are dangerously incomplete for LLM workloads.

Two properties make gateway rate limiting different from traditional API rate limiting:

Token-based consumption: limit by token count (the actual cost dimension), not just by request count. A user who sends one 50,000-token document analysis should be treated differently than a user who sends 1,000 short FAQ queries, even if both result in the same request count.
Multi-dimensional isolation: enforce separate limits for users, teams, features, and request types. A batch job quota is separate from a real-time chat quota - they cannot consume each other's allocation. This is the isolation property that the 2:17 AM incident lacked.

Rate Limiting Algorithms Compared

Token Bucket

The bucket holds a fixed number of tokens. Each request consumes tokens equal to the request's cost (variable per request). The bucket refills at a constant rate. If a request arrives when the bucket contains insufficient tokens, it is rejected.

Properties:

Allows short bursts up to bucket capacity - good for bursty-but-bounded workloads
Enforces a long-term average rate equal to the refill rate
Variable consumption per request: a large request drains more tokens than a small one
Ideal for token-based LLM rate limiting where request sizes vary widely

Sliding Window Counter

Maintains a time-ordered list of request timestamps. For each new request, counts how many timestamps fall within the last N seconds. If the count is below the limit, allow and record. At or above the limit, reject.

Properties:

Precise enforcement of "N requests per window" semantics
No burst allowance above the per-window limit
More Redis memory per key (stores timestamps, not just a counter)
Better for request-count limits when request sizes are uniform

For LLM rate limiting: the token bucket is the right choice. LLM requests have highly variable token consumption. The token bucket handles variable consumption naturally - each request consumes as many tokens as it actually uses. The sliding window counts requests uniformly, making it blind to the actual cost variation between a 50-token FAQ query and a 50,000-token document analysis.

Token Bucket Implementation with Atomic Lua Script

The critical correctness requirement: the check-and-consume operation must be atomic. Without atomicity, race conditions allow two concurrent requests to both pass the "enough tokens?" check before either has consumed tokens - both succeed, collectively consuming twice the tokens that should have been allowed.

The standard Redis solution is a Lua script. Redis executes Lua scripts atomically: no other command runs between the script's read and write operations. The entire read-modify-write sequence is one indivisible Redis operation.

import time
import redis
import anthropic
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class RateLimitConfig:
    """Token bucket configuration for one rate limit dimension."""
    bucket_capacity: int        # Maximum tokens in the bucket (burst limit)
    refill_rate: float          # Tokens added per second (sustained rate)
    fixed_consume_per_request: Optional[int] = None  # Override for pre-flight estimates


@dataclass
class RateLimitResult:
    """Result of a rate limit check-and-consume operation."""
    allowed: bool
    tokens_remaining: float
    retry_after_seconds: Optional[float] = None
    triggered_dimension: Optional[str] = None
    triggered_value: Optional[str] = None


class TokenBucketRateLimiter:
    """
    Distributed token bucket rate limiter backed by Redis.

    Uses Lua scripting for atomic check-and-consume - a single Redis
    command executes the entire read-modify-write atomically.
    No Redis transactions (MULTI/EXEC) required.
    Works correctly with hundreds of concurrent requests.

    Key format: rate_limit:{dimension}:{value}
    Key structure (Redis hash): {tokens: float, last_refill: float (unix ms)}
    """

    # Atomic Lua script: read state → refill → check → consume → write → return
    # Returns [allowed (1/0), current_tokens (float), retry_after_ms (int)]
    _LUA_SCRIPT = """
    local key            = KEYS[1]
    local capacity       = tonumber(ARGV[1])
    local refill_rate    = tonumber(ARGV[2])    -- tokens per second
    local consume        = tonumber(ARGV[3])    -- tokens to consume this request
    local now_ms         = tonumber(ARGV[4])    -- current time in milliseconds

    local data = redis.call('HMGET', key, 'tokens', 'last_refill')
    local current_tokens = tonumber(data[1])
    local last_refill    = tonumber(data[2])

    -- Initialize on first use: bucket starts full
    if current_tokens == nil then
        current_tokens = capacity
        last_refill    = now_ms
    end

    -- Refill: add tokens proportional to elapsed time since last refill
    local elapsed_s = math.max(0, (now_ms - last_refill) / 1000.0)
    local added     = elapsed_s * refill_rate
    current_tokens  = math.min(capacity, current_tokens + added)
    last_refill     = now_ms

    -- Atomic check-and-consume
    if current_tokens >= consume then
        current_tokens = current_tokens - consume
        redis.call('HMSET', key, 'tokens', current_tokens, 'last_refill', last_refill)
        redis.call('EXPIRE', key, 3600)
        return {1, current_tokens, 0}
    else
        -- Not enough tokens - calculate wait time
        local deficit         = consume - current_tokens
        local retry_after_ms  = math.ceil((deficit / refill_rate) * 1000)
        -- Save state even on rejection (update last_refill timestamp)
        redis.call('HMSET', key, 'tokens', current_tokens, 'last_refill', last_refill)
        redis.call('EXPIRE', key, 3600)
        return {0, current_tokens, retry_after_ms}
    end
    """

    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self._script = self.redis.register_script(self._LUA_SCRIPT)

        # Default tier configurations - load from external config in production
        self.configs: dict[str, RateLimitConfig] = {
            # User tiers: bucket_capacity is burst allowance, refill_rate is sustained
            "user:free": RateLimitConfig(
                bucket_capacity=30_000,     # 30k token burst
                refill_rate=500,            # 500 tok/sec → 30k TPM sustained
            ),
            "user:pro": RateLimitConfig(
                bucket_capacity=150_000,
                refill_rate=2_500,          # 2500 tok/sec → 150k TPM sustained
            ),
            "user:enterprise": RateLimitConfig(
                bucket_capacity=500_000,
                refill_rate=8_333,          # ~500k TPM sustained
            ),
            # Team defaults
            "team:default": RateLimitConfig(
                bucket_capacity=500_000,
                refill_rate=8_333,
            ),
            # Feature-level limits: different rates for different use cases
            "feature:real-time-chat": RateLimitConfig(
                bucket_capacity=200_000,
                refill_rate=10_000,          # Higher burst for real-time
            ),
            "feature:batch-pipeline": RateLimitConfig(
                bucket_capacity=100_000,
                refill_rate=500,             # Low sustained rate - batch waits
            ),
            "feature:docs-assistant": RateLimitConfig(
                bucket_capacity=100_000,
                refill_rate=2_000,
            ),
        }

    def _get_config(self, dimension: str, value: str) -> RateLimitConfig:
        """
        Get config for a specific (dimension, value) pair.

        Lookup order:
        1. Exact match: "user:user_8821" (per-user override)
        2. Tier prefix match: "user:pro" (when value is "pro:user_8821")
        3. Dimension default: "user:default"
        4. Permissive fallback (prevents breakage from missing config)
        """
        exact = f"{dimension}:{value}"
        if exact in self.configs:
            return self.configs[exact]

        if ":" in value:
            tier = value.split(":")[0]
            tier_key = f"{dimension}:{tier}"
            if tier_key in self.configs:
                return self.configs[tier_key]

        default_key = f"{dimension}:default"
        if default_key in self.configs:
            return self.configs[default_key]

        # Permissive fallback - prevents service outage from missing config
        return RateLimitConfig(bucket_capacity=1_000_000, refill_rate=100_000)

    def check_and_consume(
        self,
        dimension: str,
        value: str,
        tokens_to_consume: int,
    ) -> RateLimitResult:
        """
        Atomically check and consume tokens from a rate limit bucket.
        Uses fixed_consume_per_request if configured (for pre-flight estimates).
        """
        config = self._get_config(dimension, value)
        actual_consume = config.fixed_consume_per_request or max(1, tokens_to_consume)

        key = f"rate_limit:{dimension}:{value}"
        result = self._script(
            keys=[key],
            args=[
                config.bucket_capacity,
                config.refill_rate,
                actual_consume,
                int(time.time() * 1000),    # current time in milliseconds
            ],
        )

        allowed = bool(result[0])
        tokens_remaining = float(result[1])
        retry_after_ms = float(result[2])

        return RateLimitResult(
            allowed=allowed,
            tokens_remaining=tokens_remaining,
            retry_after_seconds=retry_after_ms / 1000 if retry_after_ms > 0 else None,
            triggered_dimension=dimension if not allowed else None,
            triggered_value=value if not allowed else None,
        )

    def check_all(
        self,
        user_id: str,
        user_tier: str,         # "free", "pro", "enterprise"
        team_id: str,
        feature: str,
        estimated_tokens: int = 1_000,
    ) -> RateLimitResult:
        """
        Check all applicable rate limit dimensions in priority order.
        Returns the first dimension that would be exceeded.

        Uses an estimated token count for pre-flight checks (before the LLM responds
        and we know the actual token count). Actual consumption is not re-recorded
        here - this is the pre-flight gate check only.
        """
        checks = [
            ("user", f"{user_tier}:{user_id}", estimated_tokens),
            ("team", team_id, estimated_tokens),
            ("feature", feature, estimated_tokens),
        ]

        for dimension, value, tokens in checks:
            result = self.check_and_consume(dimension, value, tokens)
            if not result.allowed:
                return result

        return RateLimitResult(allowed=True, tokens_remaining=0.0)

    def get_bucket_state(self, dimension: str, value: str) -> dict:
        """Inspect the current state of a bucket (for admin and debug endpoints)."""
        key = f"rate_limit:{dimension}:{value}"
        data = self.redis.hgetall(key)
        config = self._get_config(dimension, value)

        current = float(data.get(b"tokens", config.bucket_capacity))
        return {
            "dimension": dimension,
            "value": value,
            "current_tokens": round(current),
            "capacity": config.bucket_capacity,
            "refill_rate_per_sec": config.refill_rate,
            "fill_pct": round(current / config.bucket_capacity * 100, 1),
            "estimated_tpm": round(config.refill_rate * 60),
            "estimated_burst_s": round(config.bucket_capacity / config.refill_rate, 1),
        }

    def reset_bucket(self, dimension: str, value: str) -> None:
        """
        Reset a bucket to full capacity.
        Use for: billing period resets, manual override by support team.
        NOT for: regular operations - buckets self-restore via refill.
        """
        key = f"rate_limit:{dimension}:{value}"
        config = self._get_config(dimension, value)
        self.redis.hset(key, mapping={
            "tokens": config.bucket_capacity,
            "last_refill": int(time.time() * 1000),
        })
        print(f"[RateLimit] Bucket reset: {dimension}:{value} -> "
              f"{config.bucket_capacity:,} tokens")

    def set_custom_limit(
        self,
        dimension: str,
        value: str,
        bucket_capacity: int,
        refill_rate: float,
    ) -> None:
        """
        Set a custom rate limit for a specific (dimension, value) pair.
        Used for per-user overrides by the support team without a deployment.
        """
        self.configs[f"{dimension}:{value}"] = RateLimitConfig(
            bucket_capacity=bucket_capacity,
            refill_rate=refill_rate,
        )
        print(f"[RateLimit] Custom limit set for {dimension}:{value}: "
              f"capacity={bucket_capacity:,}, "
              f"refill={refill_rate:.0f}/sec (~{refill_rate*60:,.0f} TPM)")

Sliding Window Rate Limiter

For request-count limits (RPM enforcement), the sliding window is the right tool. It uses Redis sorted sets to maintain a precise count of events within the last N seconds.

class SlidingWindowRateLimiter:
    """
    Sliding window counter for request-count (RPM) rate limiting.

    Uses Redis sorted sets: member = unique request ID, score = timestamp.
    Window is maintained by removing members older than window_seconds.

    Provides exact "N requests in the last W seconds" semantics.
    No burst allowance - if you have used your quota, the next request is blocked.
    """

    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        window_seconds: float = 60.0,
        max_requests: int = 100,
    ):
        self.redis = redis.from_url(redis_url)
        self.window_seconds = window_seconds
        self.max_requests = max_requests

    def check_and_record(self, key: str) -> tuple[bool, int, Optional[float]]:
        """
        Check whether a request is within the sliding window limit.
        Returns (allowed, current_count, retry_after_seconds).

        Pipeline execution order:
        1. Remove expired entries (older than window)
        2. Count current entries (before adding new one)
        3. Add the new entry (with current timestamp as score)
        4. Set TTL on the sorted set key
        """
        now = time.time()
        window_start = now - self.window_seconds
        request_id = f"{now:.6f}-{id(self)}"  # Unique member key

        pipe = self.redis.pipeline()
        pipe.zremrangebyscore(key, "-inf", window_start)   # Remove old entries
        pipe.zcard(key)                                     # Count before adding
        pipe.zadd(key, {request_id: now})                  # Add this request
        pipe.expire(key, int(self.window_seconds) + 1)
        results = pipe.execute()

        count_before = results[1]   # Count BEFORE adding this request

        if count_before >= self.max_requests:
            # Over limit - remove the entry we just added (we're rejecting this request)
            self.redis.zrem(key, request_id)

            # Calculate when the oldest in-window entry will expire
            oldest = self.redis.zrange(key, 0, 0, withscores=True)
            if oldest:
                oldest_score = oldest[0][1]
                retry_after = (oldest_score + self.window_seconds) - now
            else:
                retry_after = self.window_seconds

            return False, count_before, max(0.1, retry_after)

        return True, count_before + 1, None

Rate-Limited Anthropic Client

The RateLimitedAnthropicClient integrates the token bucket into the call path. Pre-flight uses an estimated token count (before the API call). The rate limit check and consume happen before any tokens are sent to the provider.

class RateLimitedAnthropicClient:
    """
    Anthropic client with gateway-layer rate limiting.

    Pre-flight: checks token bucket before calling the LLM
    Raises anthropic.RateLimitError with retry information if limits are exceeded
    The error is structurally similar to provider 429s - callers handle both the same way
    """

    def __init__(self, rate_limiter: TokenBucketRateLimiter):
        self.limiter = rate_limiter
        self.client = anthropic.Anthropic()

    def _estimate_tokens(self, messages: list[dict], max_tokens: int) -> int:
        """
        Pre-flight token estimate for rate limit check.
        Characters / 4 is a widely used heuristic (1 token ≈ 4 English characters).
        Add max_tokens for the worst-case output assumption.
        """
        char_count = sum(len(m.get("content", "")) for m in messages)
        estimated_input = max(50, char_count // 4)
        return estimated_input + max_tokens

    def complete(
        self,
        messages: list[dict],
        user_id: str,
        user_tier: str,     # "free", "pro", "enterprise"
        team_id: str,
        feature: str,
        model: str = "claude-sonnet-4-6",
        max_tokens: int = 1024,
        system: Optional[str] = None,
    ) -> dict:
        """
        Complete a request with gateway-layer rate limit enforcement.

        Pre-flight uses an estimated token count (characters / 4 + max_tokens).
        This estimate is intentionally conservative: it consumes capacity from the
        bucket before the call. This prevents races where many requests are allowed
        simultaneously based on the pre-call estimate and then the actual tokens
        exceed the effective limit.
        """
        # Step 1: Pre-flight rate limit check with estimated token count
        estimated = self._estimate_tokens(messages, max_tokens)

        rl_result = self.limiter.check_all(
            user_id=user_id,
            user_tier=user_tier,
            team_id=team_id,
            feature=feature,
            estimated_tokens=estimated,
        )

        if not rl_result.allowed:
            retry_in = rl_result.retry_after_seconds or 60.0
            # Raise the same error type as provider rate limits
            # so callers don't need separate handling for gateway vs provider limits
            raise anthropic.RateLimitError(
                message=(
                    f"Gateway rate limit exceeded for {rl_result.triggered_dimension} "
                    f"'{rl_result.triggered_value}'. "
                    f"Retry after {retry_in:.1f} seconds. "
                    f"Tokens remaining: {rl_result.tokens_remaining:.0f}"
                ),
                response=None,  # type: ignore
                body={},
            )

        # Step 2: Make the LLM call
        import time as time_module
        kwargs: dict = {
            "model": model,
            "max_tokens": max_tokens,
            "messages": messages,
        }
        if system:
            kwargs["system"] = system

        start = time_module.time()
        response = self.client.messages.create(**kwargs)
        latency_ms = (time_module.time() - start) * 1000

        actual_tokens = response.usage.input_tokens + response.usage.output_tokens

        return {
            "response": response.content[0].text,
            "model": response.model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "latency_ms": round(latency_ms, 1),
            "bucket_tokens_remaining": round(rl_result.tokens_remaining),
            "estimated_tokens": estimated,
            "actual_tokens": actual_tokens,
        }


def demo_rate_limiting() -> None:
    """Demonstrate token bucket enforcement with a burst of requests."""
    limiter = TokenBucketRateLimiter()

    # Configure a small bucket for the demo
    limiter.configs["user:demo:demo_user"] = RateLimitConfig(
        bucket_capacity=3_000,
        refill_rate=50,   # Very slow refill - makes the demo observable
    )

    client = RateLimitedAnthropicClient(limiter)
    question = "What is the Pythagorean theorem? Answer in one sentence."

    print("=== Burst test: 6 requests against a 3,000-token bucket ===\n")
    allowed = 0
    blocked = 0

    for i in range(6):
        try:
            result = client.complete(
                messages=[{"role": "user", "content": question}],
                user_id="demo_user",
                user_tier="demo",
                team_id="demo-team",
                feature="docs-assistant",
                model="claude-haiku-4-5-20251001",
                max_tokens=50,
            )
            print(f"Request {i+1}: ALLOWED  | "
                  f"Bucket: {result['bucket_tokens_remaining']:,} tokens remaining | "
                  f"Est: {result['estimated_tokens']} / Actual: {result['actual_tokens']}")
            allowed += 1
        except anthropic.RateLimitError as e:
            print(f"Request {i+1}: BLOCKED  | {str(e)[:90]}")
            blocked += 1

    print(f"\nSummary: {allowed} allowed, {blocked} blocked")
    print("\n=== Bucket State After Burst ===")
    state = limiter.get_bucket_state("user", "demo:demo_user")
    print(f"  Current: {state['current_tokens']:,} / {state['capacity']:,} tokens "
          f"({state['fill_pct']}% full)")
    print(f"  Refill rate: {state['refill_rate_per_sec']} tok/sec "
          f"(~{state['estimated_tpm']:,} TPM)")
    print(f"  Burst window: {state['estimated_burst_s']}s at refill rate")


if __name__ == "__main__":
    demo_rate_limiting()

Graceful Degradation Strategies

When rate limits are hit, there are three responses - each appropriate for different scenarios and feature types.

Hard reject (429): the correct choice for real-time user-facing features. An immediate 429 with a Retry-After header tells the client exactly when to retry. Well-designed clients respect the header and retry automatically after the window. The user waits a few seconds and retries; the gateway has protected the shared quota.

Request queuing: the correct choice for batch pipelines where a short wait is acceptable. Place the request in a Redis-backed priority queue. A background dequeue worker monitors bucket state and dispatches requests when capacity becomes available. The caller either polls for the result or receives a webhook notification.

Model downgrade: route to a cheaper, faster model (e.g., from claude-sonnet-4-6 to claude-haiku-4-5-20251001) when the premium model's quota is exhausted. Users get a response - potentially lower quality - rather than a failure. Requires pre-validation that the fallback model's quality is acceptable for the specific use case.

Per-Feature Priority Queuing

Some architectures require priority rather than strict per-feature limits. Real-time user requests should always be served before batch pipeline requests, even when they consume the same shared provider quota.

import heapq
import threading
import json
import uuid
from dataclasses import dataclass, field


@dataclass(order=True)
class QueuedRequest:
    """A rate-limited request waiting in the priority queue."""
    priority: int          # Lower value = higher priority (0 = real-time, 10 = batch)
    timestamp: float       # For FIFO ordering within the same priority level
    request_id: str = field(compare=False)
    user_id: str = field(compare=False)
    messages: list[dict] = field(compare=False)
    max_tokens: int = field(compare=False)
    result_key: str = field(compare=False)   # Redis key where result is stored when complete


class PriorityQueuedLLMService:
    """
    LLM service with priority-based request queuing.

    Real-time user requests (priority=0) are always dispatched before
    batch pipeline requests (priority=10), even when provider capacity
    is constrained.

    Results are stored in Redis so callers can poll or be notified via
    a Redis keyspace notification.
    """

    def __init__(
        self,
        rate_limiter: TokenBucketRateLimiter,
        redis_url: str = "redis://localhost:6379",
        result_ttl_s: int = 300,   # 5 minutes for result to be claimed
    ):
        self.limiter = rate_limiter
        self.redis = redis.from_url(redis_url)
        self.result_ttl = result_ttl_s
        self._queue: list[QueuedRequest] = []
        self._queue_lock = threading.Lock()
        self._worker = threading.Thread(
            target=self._drain_queue, daemon=True, name="llm-queue-worker"
        )
        self._worker.start()

    def submit_real_time(
        self, user_id: str, messages: list[dict], max_tokens: int
    ) -> str:
        """Submit a real-time user request (priority 0 - served first)."""
        request_id = str(uuid.uuid4())[:8]
        return self._enqueue(request_id, user_id, messages, max_tokens, priority=0)

    def submit_batch(
        self, user_id: str, messages: list[dict], max_tokens: int
    ) -> str:
        """Submit a batch request (priority 10 - served when real-time queue is clear)."""
        request_id = str(uuid.uuid4())[:8]
        return self._enqueue(request_id, user_id, messages, max_tokens, priority=10)

    def _enqueue(
        self,
        request_id: str,
        user_id: str,
        messages: list[dict],
        max_tokens: int,
        priority: int,
    ) -> str:
        """Add a request to the priority queue. Returns the result key for polling."""
        result_key = f"llm_result:{request_id}"
        req = QueuedRequest(
            priority=priority,
            timestamp=time.time(),
            request_id=request_id,
            user_id=user_id,
            messages=messages,
            max_tokens=max_tokens,
            result_key=result_key,
        )
        with self._queue_lock:
            heapq.heappush(self._queue, req)
        return result_key

    def queue_depth(self) -> dict[str, int]:
        """Return the number of queued requests per priority level."""
        with self._queue_lock:
            real_time = sum(1 for r in self._queue if r.priority == 0)
            batch = sum(1 for r in self._queue if r.priority == 10)
        return {"real_time": real_time, "batch": batch, "total": real_time + batch}

    def poll_result(self, result_key: str) -> Optional[dict]:
        """Poll for a completed request's result. Returns None if not ready yet."""
        raw = self.redis.get(result_key)
        if raw:
            return json.loads(raw)
        return None

    def _drain_queue(self) -> None:
        """
        Background worker: drain the priority queue at the allowed provider rate.
        Highest-priority requests (lowest priority value) are served first.
        Checks provider rate limit before each dispatch.
        """
        client = anthropic.Anthropic()

        while True:
            req = None
            with self._queue_lock:
                if self._queue:
                    req = heapq.heappop(self._queue)

            if req is None:
                time.sleep(0.05)   # Empty queue - poll interval
                continue

            queue_wait_ms = round((time.time() - req.timestamp) * 1000)

            try:
                response = client.messages.create(
                    model="claude-haiku-4-5-20251001",
                    max_tokens=req.max_tokens,
                    messages=req.messages,
                )
                result = {
                    "response": response.content[0].text,
                    "model": response.model,
                    "input_tokens": response.usage.input_tokens,
                    "output_tokens": response.usage.output_tokens,
                    "priority": req.priority,
                    "queue_wait_ms": queue_wait_ms,
                    "request_id": req.request_id,
                }
                self.redis.setex(
                    req.result_key,
                    self.result_ttl,
                    json.dumps(result),
                )

            except anthropic.RateLimitError:
                # Provider rate limit hit - re-queue the request and back off
                with self._queue_lock:
                    heapq.heappush(self._queue, req)
                time.sleep(5.0)   # Wait 5 seconds before retrying

            except Exception as e:
                error_result = {
                    "error": str(e),
                    "request_id": req.request_id,
                    "queue_wait_ms": queue_wait_ms,
                }
                self.redis.setex(
                    req.result_key,
                    self.result_ttl,
                    json.dumps(error_result),
                )


def demo_priority_queue() -> None:
    """Demonstrate that real-time requests are served before batch requests."""
    limiter = TokenBucketRateLimiter()
    service = PriorityQueuedLLMService(limiter)

    # Submit a mix of batch and real-time requests
    result_keys = []

    # Submit 3 batch requests first
    for i in range(3):
        key = service.submit_batch(
            user_id="batch-job",
            messages=[{"role": "user", "content": f"Batch task {i}: summarize topic {i}"}],
            max_tokens=50,
        )
        result_keys.append(("batch", key))
        print(f"Submitted batch request {i} → result key: {key}")

    # Then submit 2 real-time requests - these should be served first
    for i in range(2):
        key = service.submit_real_time(
            user_id="user_9911",
            messages=[{"role": "user", "content": "What is 2+2?"}],
            max_tokens=20,
        )
        result_keys.append(("real-time", key))
        print(f"Submitted real-time request {i} → result key: {key}")

    print(f"\nQueue depth: {service.queue_depth()}")

    # Poll for results
    import time as time_module
    time_module.sleep(10)   # Give the worker time to process

    for req_type, key in result_keys:
        result = service.poll_result(key)
        if result:
            print(f"[{req_type}] Completed in {result.get('queue_wait_ms', '?')}ms "
                  f"wait: {result.get('response', result.get('error', '?'))[:50]}")
        else:
            print(f"[{req_type}] Still pending: {key}")

Rate Limiting Strategy Selection

Scenario	Recommended Strategy	Key Config
Real-time user chat	Token bucket, reject on limit	Capacity: 150k, refill: 2500/s
Batch document pipeline	Token bucket, queue on limit	Capacity: 100k, refill: 500/s
Free tier users	Token bucket, small bucket	Capacity: 30k, refill: 500/s
Enterprise users	Token bucket + custom override	Per-user config in limiter
RPM enforcement	Sliding window counter	Window: 60s, max: 500 RPM
Both RPM and TPM	Token bucket (TPM) + sliding window (RPM)	Check both; reject if either fails

Production Engineering Notes

:::tip Rate limit by token count, not just request count, for LLM APIs A single LLM request can consume anywhere from 50 to 100,000+ tokens. A request-count limit of "100 requests per minute" allows one request that consumes your entire daily provider budget. Always rate limit by token count - or combine request count and token count as parallel checks. The token bucket naturally supports variable consumption per request: large requests consume more tokens and drain the bucket faster than small ones. :::

:::warning Always include Retry-After in 429 responses When rejecting a request due to a rate limit, always return the Retry-After header value (in seconds). Without it, well-behaved client code retries immediately, turning one rate limit violation into a flood of rejected requests that further drain capacity. The Lua script calculates the exact wait time as deficit / refill_rate. Return this value in both the Retry-After HTTP header and in the response body for callers that don't inspect headers. :::

:::danger Rate limiters without atomic operations will over-serve under concurrent load If the check ("do we have tokens?") and consume ("deduct the tokens") are separate Redis operations, concurrent requests can both pass the check before either has consumed. This allows burst overages proportional to concurrency level. The Lua script solution executes both as a single atomic Redis command - Redis serializes all Lua script executions, making race conditions impossible. Test your rate limiter under concurrent load (50+ simultaneous requests) before deploying; a race-condition bug will not appear in sequential testing. :::

:::info Test rate limiting under concurrent load before launch A rate limiter that works correctly in sequential testing may have race conditions under concurrent load. Add a load test to your deployment checklist: send 100 concurrent requests in a burst and verify that the number of allowed requests matches your bucket configuration. The number of allowed requests should be deterministic - not stochastic due to race conditions. Expect the exact number of allowed requests to equal floor(bucket_capacity / tokens_per_request). :::

Common Mistakes

Mistake 1: Limiting only by request count, not token count. An enterprise customer discovers they can send very large requests (100k tokens each) without hitting the request-count limit. Five requests per minute is under the RPM limit, but 500,000 tokens per minute is 2.5x the provider limit. Always enforce a TPM limit in addition to or instead of an RPM limit for LLM workloads.

Mistake 2: Not separating real-time and batch traffic quotas. When both real-time and batch traffic share the same token bucket, a heavy batch job can exhaust the shared budget and block real-time requests. Maintain separate buckets (and separate provider key pools if needed) for batch vs real-time traffic. The batch pipeline should have a deliberately lower refill rate to prevent it from crowding out real-time requests.

Mistake 3: Not setting a Retry-After header on 429 responses. Without Retry-After, clients will retry immediately after receiving a 429. If there are 100 concurrent clients all receiving 429s, they all retry simultaneously, creating a thundering herd that produces 100x the 429 rate - a self-reinforcing feedback loop. The Lua script computes the exact wait time; include it in every 429 response.

Mistake 4: Using fixed estimated token consumption for all pre-flight checks. If you use a fixed estimate (e.g., always consume 1,000 tokens from the bucket regardless of request size), large requests will over-consume (a 5,000-token request is only charged 1,000) and the bucket will deplete much faster than expected. Use a real estimate: characters / 4 + max_tokens is accurate enough for pre-flight purposes. The cost of a slightly wrong estimate is minor compared to a wildly wrong one.

Mistake 5: Rate limiting in application code rather than the gateway. If rate limiting is implemented in each application service separately, it is inconsistently applied: some services implement it, some don't; the configuration diverges; an oversight in one service exposes your provider quota. Rate limiting belongs in the gateway layer where it is applied uniformly to every request, regardless of which service made the call.

Interview Q&A

Q: What is the difference between a token bucket and a sliding window rate limiter?

A token bucket holds tokens that refill at a constant rate. Each request consumes some tokens (variable amount per request); when the bucket is empty, the request is rejected. It allows short bursts up to bucket capacity while enforcing a long-term average equal to the refill rate. A sliding window tracks all request timestamps within the last N seconds. Each new request counts prior requests in the window; if the count exceeds the limit, the request is rejected. It enforces precise "N events per window" semantics with no burst allowance. For LLM APIs where request sizes vary dramatically (50 to 100,000+ tokens), the token bucket is more appropriate because you can consume variable amounts per request based on actual token cost.

Q: Why must the check-and-consume operation be atomic in a distributed system?

Without atomicity, a race condition occurs: two concurrent requests can both pass the "enough tokens?" check before either has consumed tokens. Both proceed, collectively consuming twice the allowed amount. This is the classic check-then-act race condition. The standard Redis solution is a Lua script - Redis guarantees that Lua scripts execute atomically; no other command runs between the script's read and write operations. The Lua script reads the current token count, computes the refill from elapsed time, checks whether the request can be served, consumes tokens if allowed, writes back the updated state, and returns the result - all as one indivisible operation. This ensures correctness under any level of concurrency.

Q: How would you implement a rate limiter that enforces both RPM and TPM limits simultaneously?

Use two separate checks in sequence: one token bucket for TPM and one sliding window for RPM. Both checks must pass for the request to proceed. For the RPM check: each request consumes 1 unit from the RPM sliding window (window=60s, max=rpm_limit). For the TPM check: each request consumes its estimated token count from the TPM token bucket (capacity=tpm_limit, refill=tpm_limit/60 per second). If either check fails, reject with 429. This mirrors the behavior of most LLM providers, which enforce RPM and TPM as independent limits. A request can be within RPM but exceed TPM (one massive request), or within TPM but exceed RPM (many tiny requests).

Q: How do you handle a user who legitimately needs higher limits than the defaults?

Implement tiered rate limit configuration keyed by tier: user:free, user:pro, user:enterprise. Extract the user's tier from the authentication token (JWT claim or API key metadata). For enterprise customers who need limits above standard tiers, store per-user overrides in the config dictionary keyed by user:{user_id} - these take precedence over tier defaults in the lookup order. Provide an admin API endpoint (internal, authenticated) that the support team can use to set custom limits without a deployment. The set_custom_limit() method on the limiter handles this. Store custom overrides in the database so they survive service restarts.

Q: A user reports they are being rate-limited but claims to be sending fewer requests than their limit. How do you diagnose?

Query the Redis bucket state for that user's key: inspect current_tokens, capacity, and last_refill. If the bucket is depleted, the issue is real - the user is consuming more tokens than they believe. Common causes: (1) multiple sessions sharing one bucket - each session thinks it is within limit, but all sessions share the same per-user bucket; (2) requests are larger than expected due to conversation history or system prompt overhead that is not visible to the user; (3) the token estimate used for pre-flight checks is conservative (estimated > actual), so actual consumed tokens are higher than what the user sees in responses; (4) a bug in user_id extraction is attributing another user's tokens to them - check the attribution logic for edge cases in authentication.

Q: What does graceful degradation mean for rate limiting in an LLM application?

Rather than returning a hard 429 when a rate limit is hit, graceful degradation provides a degraded-but-functional response. Three implementations: (1) Model downgrade - if the claude-sonnet-4-6 quota is exhausted, route to claude-haiku-4-5-20251001 instead. The response may be less detailed but the feature works. Requires pre-validation that the fallback model's quality is acceptable for the use case. (2) Request queuing - place the request in a Redis-backed priority queue and tell the user "your request is queued, estimated wait 10 seconds." A background worker drains the queue when capacity becomes available. Suitable when a short wait is acceptable. (3) Cached response degradation - if the user is rate-limited and a semantically similar cached response exists (even below the normal similarity threshold), return the cached response rather than failing. This requires coordination between the rate limiter and semantic cache layers.

Q: How would you design rate limiting for a multi-tenant SaaS product where enterprise customers need guaranteed throughput regardless of what other tenants are doing?

Implement capacity isolation by provider tier. Each enterprise customer gets their own dedicated token bucket with a capacity and refill rate calculated from their contracted SLA. Standard customers share a pooled bucket with a per-tenant subcap. Use separate provider API keys for enterprise vs standard tiers where possible: enterprise keys guarantee that enterprise traffic is never throttled by standard customer activity. At the gateway layer: check the enterprise customer's dedicated bucket (fast path - likely not exhausted); check the standard tier pooled bucket with the per-tenant subcap applied. This gives enterprise customers hard isolation guarantees that standard customer activity cannot violate, regardless of what is happening on the shared pool.

Rate Limiting Configuration Reference

A production rate limiter configuration covers four dimensions with different limits per tier. This is a reference YAML structure for externalizing the configuration:

# rate_limits.yaml - loaded at gateway startup
# All values in tokens. refill_rate in tokens/second.

user_tiers:
  free:
    bucket_capacity: 30_000      # 30k token burst allowance
    refill_rate: 500             # 500 tok/s = 30k TPM sustained
    rpm_limit: 20                # 20 requests per minute
  pro:
    bucket_capacity: 150_000
    refill_rate: 2_500           # 150k TPM sustained
    rpm_limit: 100
  enterprise:
    bucket_capacity: 500_000
    refill_rate: 8_333           # ~500k TPM sustained
    rpm_limit: 500

teams:
  default:
    bucket_capacity: 500_000
    refill_rate: 8_333
    rpm_limit: 500
  # Per-team overrides (support team can add these without deployment)
  # team_id_here:
  #   bucket_capacity: 2_000_000
  #   refill_rate: 33_333

features:
  real-time-chat:
    bucket_capacity: 200_000
    refill_rate: 10_000
    rpm_limit: 600
  batch-pipeline:
    bucket_capacity: 100_000
    refill_rate: 500             # Intentionally throttled - batch waits
    rpm_limit: 30
  docs-assistant:
    bucket_capacity: 100_000
    refill_rate: 2_000
    rpm_limit: 120
  code-generation:
    bucket_capacity: 150_000
    refill_rate: 3_000
    rpm_limit: 100

global:
  bucket_capacity: 5_000_000    # Account-level safety net
  refill_rate: 83_333           # ~5M TPM account ceiling

import yaml


def load_rate_limit_config(config_path: str) -> TokenBucketRateLimiter:
    """Load rate limit configuration from YAML and initialize the limiter."""
    with open(config_path) as f:
        config = yaml.safe_load(f)

    limiter = TokenBucketRateLimiter.__new__(TokenBucketRateLimiter)
    limiter.redis = redis.from_url("redis://localhost:6379")
    limiter.configs = {}

    for tier, values in config.get("user_tiers", {}).items():
        key = f"user:{tier}"
        limiter.configs[key] = RateLimitConfig(
            bucket_capacity=values["bucket_capacity"],
            refill_rate=float(values["refill_rate"]),
        )

    for team_id, values in config.get("teams", {}).items():
        key = f"team:{team_id}"
        limiter.configs[key] = RateLimitConfig(
            bucket_capacity=values["bucket_capacity"],
            refill_rate=float(values["refill_rate"]),
        )

    for feature, values in config.get("features", {}).items():
        key = f"feature:{feature}"
        limiter.configs[key] = RateLimitConfig(
            bucket_capacity=values["bucket_capacity"],
            refill_rate=float(values["refill_rate"]),
        )

    limiter._script = limiter.redis.register_script(TokenBucketRateLimiter._LUA_SCRIPT)
    return limiter

Rate Limiting in the Full Gateway Stack

Rate limiting does not operate in isolation - it is one layer in the gateway stack, applied in this order for each incoming request:

The ordering matters: rate limiting runs before budget checks, which run before the semantic cache lookup, which runs before the LLM call. This ensures: (1) abusive or misconfigured callers are stopped at the cheapest layer (no Redis cost event writes, no cache lookups, no LLM tokens consumed); (2) cache hits skip both the LLM call and the cost recording step; (3) cost is only recorded for actual LLM calls, not for blocked or cached requests.

The Batch Job That Ate Everything​

Why Rate Limiting at the Gateway Level​

Rate Limiting Algorithms Compared​

Token Bucket​

Sliding Window Counter​

Token Bucket Implementation with Atomic Lua Script​

Sliding Window Rate Limiter​

Rate-Limited Anthropic Client​

Graceful Degradation Strategies​

Per-Feature Priority Queuing​

Rate Limiting Strategy Selection​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Rate Limiting Configuration Reference​

Rate Limiting in the Full Gateway Stack​