Rate Limiting and Cost Control
The $47,000 Surprise
On a Tuesday morning in Q3, the founder of a YC startup received a Slack message from their CFO: "Did we authorize a $47,000 AWS charge last month?" They had not. The charge came entirely from one feature: a document summarization tool they had shipped quietly to beta users two weeks earlier.
The root cause was brutally simple. They had launched an endpoint that accepted arbitrary documents and summarized them using a 70B model. They had set no limit on document size. A user - almost certainly automated - discovered the endpoint and started uploading 100,000-token documents at 50 requests per minute for 72 hours straight. The LLM processed every single one. There were no rate limits, no token quotas, no budget alerts. The first signal was the AWS bill.
The engineering post-mortem revealed something more disturbing than just the abuse vector. Even without the runaway user, the cost projections for their current usage pattern would have been 15/month for the product. At approximately 4 in inference. No business model survives those unit economics.
This story plays out constantly in the LLM space, across teams of every size. The combination of variable-cost-per-request (driven by token count), the ability to pass arbitrary amounts of text to the API, and the absence of built-in safeguards makes LLM serving economically explosive in a way that traditional API serving is not. A traditional API endpoint has roughly constant cost per request. An LLM endpoint cost varies by 100x depending on input length. That is a fundamentally different economic model that requires fundamentally different controls.
Rate limiting for LLM services has three goals operating simultaneously. The first is abuse prevention - stopping automated clients from burning through compute resources maliciously or accidentally. The second is fairness - ensuring that one heavy user does not degrade service for everyone else, particularly in multi-tenant deployments. The third is unit economics - ensuring that what you charge per request exceeds what you spend to fulfill it, with enough headroom to be a viable business.
This lesson builds a complete rate limiting and cost control stack from the algorithm level (token bucket, sliding window) through the implementation level (Redis-backed FastAPI middleware) to the business intelligence level (cost attribution dashboards, budget alerts, abuse detection). By the end, you will have all the primitives to ship a production-grade LLM API service that cannot be accidentally bankrupted by a single rogue client.
Why This Exists - Fixed Cost vs Variable Cost APIs
Every rate limiting decision in software history was made with fixed-cost requests in mind. An HTTP endpoint has a cost that is roughly constant regardless of what is in the request. The query parameter might be "id=1" or "id=999999" - the database lookup takes the same time either way. This made request-per-second (RPS) rate limiting natural and sufficient.
LLMs destroy this assumption. The cost of serving a request is proportional to:
- Input token count (prefill compute scales quadratically with sequence length in standard attention, though vLLM's PagedAttention mitigates this)
- Output token count (each token requires one forward pass through the full model)
- The model size (a 70B request costs roughly 8x more than an 8B request)
A rate limit of "100 requests per minute" means almost nothing when one request can be 50 tokens and another can be 50,000. The expensive request consumes 1,000x more GPU compute but counts the same in an RPS limiter.
The industry converged on token-based rate limiting - measuring the actual quantity of compute consumed rather than the count of API calls - because it maps directly to cost. OpenAI, Anthropic, Google, and every major LLM provider now primarily rate-limits on tokens per minute (TPM) and tokens per day (TPD) rather than on request counts.
Traditional rate limiting also assumed that the relationship between a user and a session was ephemeral and stateless. LLM serving adds the KV cache as a persistent shared resource. A user sending very long requests is not just consuming their own quota - they are occupying KV cache blocks that could have served multiple other users. This creates a new class of resource pressure that requires a new approach: per-tenant KV cache quotas, not just per-request token limits.
Historical Context - From Leaky Buckets to Token Economics
Rate limiting as a concept dates to the 1980s telephone network, where the "leaky bucket" algorithm was developed by Jonathan Turner and independently by John Nagle to regulate packet flow in network switches. The algorithm was simple: tokens drip into a bucket at a constant rate; each packet consumes a token; when the bucket is empty, packets are dropped or queued.
The token bucket variant (which allows bursting up to bucket capacity rather than enforcing strict constant rate) became the dominant algorithm for API rate limiting in the web era. Twitter published one of the first public API rate limiting systems in 2009, and Redis-backed rate limiters became standard practice by 2012-2013 as the "rate limit per API key in Redis" pattern proliferated.
LLM-specific rate limiting emerged as a distinct discipline around 2021-2022 when OpenAI added per-organization token limits to the GPT-3 API after observing that some users were sending requests with 10,000+ token prompts that consumed disproportionate compute. The innovation was treating tokens as the fundamental unit of rate limiting rather than requests.
Multi-tenant LLM cost attribution - tracking which tenant consumed which GPU resources - is even newer, emerging primarily from 2023-2024 as enterprises started deploying self-hosted LLMs on shared infrastructure. The tooling is still maturing, but the core problem is clear: GPU time is expensive and divisible, and you need to know who consumed what.
Core Concepts
The Token Bucket Algorithm
The token bucket is the standard algorithm for rate limiting because it allows bursting (short bursts above the average rate) while enforcing a long-term average. This matches user behavior better than a strict "N per second" leaky bucket approach.
The algorithm has three parameters:
- capacity () - maximum tokens the bucket can hold (burst limit)
- refill rate () - tokens added per second (sustained rate)
- current tokens () - current bucket contents
When a request arrives:
- Calculate tokens earned since last refill:
- If : allow request, set
- If : reject request (or queue it)
For LLM services, "cost" is the token count of the request (input + estimated output).
Token Bucket State Over Time:
Capacity = 10,000 tokens
Refill rate = 1,000 tokens/minute
t=0: bucket = 10,000 (full)
t=0: request arrives: 3,000 tokens -> allow, bucket = 7,000
t=0: request arrives: 3,000 tokens -> allow, bucket = 4,000
t=0: request arrives: 5,000 tokens -> DENY (only 4,000 available)
t=60: refill -> bucket = 5,000
t=60: request arrives: 5,000 tokens -> allow, bucket = 0
t=120: refill -> bucket = 1,000
Sliding Window Counters
The token bucket is great for burst control, but you also need daily/weekly caps that the token bucket does not enforce naturally (since the bucket refills automatically and never tracks cumulative consumption).
A sliding window counter tracks total consumption over a rolling time window:
Where is the window duration (e.g., 24 hours) and is the token consumption at time .
In practice, you implement this with two Redis keys: a counter for the current window and a counter for the previous window, with a weighted interpolation at window boundaries:
This avoids the thundering herd problem where all per-minute counters reset simultaneously at the start of each minute.
Cost Attribution Model
To know what anything costs, you need to model GPU resource consumption. The fundamental unit of LLM serving cost is the GPU-hour, which you can break down to per-request cost:
GPU time consumed by a request is approximately:
Where:
- - FLOPs per prefill token (roughly )
- - FLOPs per decode token (same order of magnitude as prefill per token but different memory pressure)
- - GPU FLOP/s
For practical cost attribution, you do not need the exact formula. Track input tokens, output tokens, and model name per request. Use known cost-per-token numbers (from your cloud provider or measured GPU-hour cost divided by measured throughput) to compute per-request cost.
Code Examples
Redis Token Bucket Implementation
# rate_limiter.py
import time
import redis
import hashlib
from typing import Optional, Tuple
from dataclasses import dataclass
from enum import Enum
class RateLimitResult(Enum):
ALLOWED = "allowed"
DENIED_RATE = "denied_rate"
DENIED_DAILY = "denied_daily"
DENIED_BUDGET = "denied_budget"
@dataclass
class RateLimitConfig:
# Token bucket parameters
tokens_per_minute: int # Refill rate (sustained)
burst_tokens: int # Bucket capacity (burst allowance)
# Daily hard cap
tokens_per_day: int
# Budget cap in USD (optional)
daily_budget_usd: Optional[float] = None
# Request-level limits
max_tokens_per_request: int = 32_000
requests_per_minute: int = 60
# Default tiers
RATE_LIMIT_TIERS = {
"free": RateLimitConfig(
tokens_per_minute=10_000,
burst_tokens=20_000,
tokens_per_day=100_000,
daily_budget_usd=1.00,
max_tokens_per_request=4_096,
requests_per_minute=10,
),
"pro": RateLimitConfig(
tokens_per_minute=100_000,
burst_tokens=200_000,
tokens_per_day=5_000_000,
daily_budget_usd=50.00,
max_tokens_per_request=32_000,
requests_per_minute=100,
),
"enterprise": RateLimitConfig(
tokens_per_minute=1_000_000,
burst_tokens=2_000_000,
tokens_per_day=100_000_000,
daily_budget_usd=None, # No cap - billed directly
max_tokens_per_request=128_000,
requests_per_minute=1000,
),
}
# Cost per token in USD for cost attribution
# These are example internal cost figures - measure your actual GPU cost
MODEL_COST_PER_TOKEN = {
"meta-llama/Llama-3.1-8B-Instruct": {
"input": 0.000_000_50, # $0.50 per million input tokens
"output": 0.000_001_00, # $1.00 per million output tokens
},
"meta-llama/Llama-3.1-70B-Instruct": {
"input": 0.000_003_00, # $3.00 per million input tokens
"output": 0.000_006_00, # $6.00 per million output tokens
},
}
class TokenBucketRateLimiter:
"""
Redis-backed token bucket rate limiter for LLM API services.
Uses Lua scripts for atomic check-and-update operations.
This is critical - without atomicity, concurrent requests can
bypass the rate limit (race condition).
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self._register_lua_scripts()
def _register_lua_scripts(self):
"""
Lua script runs atomically on the Redis server.
Prevents race conditions in concurrent rate limit checks.
"""
self.check_and_consume = self.redis.register_script("""
local bucket_key = KEYS[1]
local daily_key = KEYS[2]
local budget_key = KEYS[3]
local cost = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3]) -- tokens per second
local daily_limit = tonumber(ARGV[4])
local daily_budget = tonumber(ARGV[5]) -- in microcents (to avoid float)
local request_cost_microcents = tonumber(ARGV[6])
local now = tonumber(ARGV[7])
-- Token bucket logic
local bucket = redis.call('HMGET', bucket_key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Calculate tokens earned since last refill
local elapsed = now - last_refill
local earned = elapsed * refill_rate
tokens = math.min(capacity, tokens + earned)
-- Check if request can be served
if tokens < cost then
return {0, 'rate_limit', math.ceil((cost - tokens) / refill_rate)}
end
-- Check daily token limit
local daily_used = tonumber(redis.call('GET', daily_key)) or 0
if daily_limit > 0 and (daily_used + cost) > daily_limit then
return {0, 'daily_limit', 0}
end
-- Check budget limit
if daily_budget > 0 then
local budget_used = tonumber(redis.call('GET', budget_key)) or 0
if (budget_used + request_cost_microcents) > daily_budget then
return {0, 'budget_limit', 0}
end
end
-- Consume tokens - all checks passed
tokens = tokens - cost
redis.call('HMSET', bucket_key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', bucket_key, 3600)
-- Update daily counters
redis.call('INCRBY', daily_key, cost)
redis.call('EXPIRE', daily_key, 86400)
if daily_budget > 0 then
redis.call('INCRBY', budget_key, request_cost_microcents)
redis.call('EXPIRE', budget_key, 86400)
end
return {1, 'allowed', tokens}
""")
def check_rate_limit(
self,
user_id: str,
tier: str,
input_tokens: int,
max_output_tokens: int,
model: str,
) -> Tuple[RateLimitResult, dict]:
"""
Check if a request is within rate limits and consume quota if so.
We charge for input tokens + estimated output tokens upfront.
This prevents users from bypassing limits by requesting huge outputs.
In practice, refund unused output token quota after generation completes.
"""
config = RATE_LIMIT_TIERS.get(tier, RATE_LIMIT_TIERS["free"])
# Validate per-request token limit
total_estimated = input_tokens + max_output_tokens
if total_estimated > config.max_tokens_per_request:
return RateLimitResult.DENIED_RATE, {
"reason": "Request exceeds per-request token limit",
"limit": config.max_tokens_per_request,
"requested": total_estimated,
}
# Calculate cost in microcents for budget tracking
model_costs = MODEL_COST_PER_TOKEN.get(model, {
"input": 0.000_001,
"output": 0.000_002,
})
request_cost_usd = (
input_tokens * model_costs["input"] +
max_output_tokens * model_costs["output"]
)
request_cost_microcents = int(request_cost_usd * 100 * 1_000_000)
# Redis key namespacing
bucket_key = f"rl:bucket:{user_id}"
daily_key = f"rl:daily:{user_id}:{self._day_key()}"
budget_key = f"rl:budget:{user_id}:{self._day_key()}"
now = time.time()
refill_rate_per_second = config.tokens_per_minute / 60.0
daily_budget_microcents = int(
(config.daily_budget_usd or 0) * 100 * 1_000_000
)
result = self.check_and_consume(
keys=[bucket_key, daily_key, budget_key],
args=[
total_estimated,
config.burst_tokens,
refill_rate_per_second,
config.tokens_per_day,
daily_budget_microcents,
request_cost_microcents,
now,
],
)
allowed, reason, extra = result[0], result[1].decode(), result[2]
if allowed:
return RateLimitResult.ALLOWED, {
"tokens_remaining_in_bucket": extra,
"estimated_cost_usd": request_cost_usd,
}
elif reason == "rate_limit":
retry_after = int(extra)
return RateLimitResult.DENIED_RATE, {
"reason": "Token rate limit exceeded",
"retry_after_seconds": retry_after,
}
elif reason == "daily_limit":
return RateLimitResult.DENIED_DAILY, {
"reason": "Daily token limit exceeded",
"retry_after_seconds": self._seconds_until_midnight(),
}
else:
return RateLimitResult.DENIED_BUDGET, {
"reason": "Daily budget limit exceeded",
"retry_after_seconds": self._seconds_until_midnight(),
}
def refund_unused_tokens(
self,
user_id: str,
estimated_output_tokens: int,
actual_output_tokens: int,
model: str,
):
"""
Refund the difference between estimated and actual output tokens.
Call this after generation completes.
"""
unused = max(0, estimated_output_tokens - actual_output_tokens)
if unused <= 0:
return
model_costs = MODEL_COST_PER_TOKEN.get(model, {"output": 0.000_002})
refund_cost_microcents = int(
unused * model_costs["output"] * 100 * 1_000_000
)
bucket_key = f"rl:bucket:{user_id}"
daily_key = f"rl:daily:{user_id}:{self._day_key()}"
budget_key = f"rl:budget:{user_id}:{self._day_key()}"
# Refund to bucket (up to capacity) and decrement daily counters
pipe = self.redis.pipeline()
# Atomically refund to bucket - simplified (full impl uses Lua)
pipe.hincrby(bucket_key, "tokens", unused)
pipe.decrby(daily_key, unused)
pipe.decrby(budget_key, refund_cost_microcents)
pipe.execute()
def get_usage_stats(self, user_id: str, tier: str) -> dict:
"""Get current rate limit status for a user."""
config = RATE_LIMIT_TIERS.get(tier, RATE_LIMIT_TIERS["free"])
bucket_key = f"rl:bucket:{user_id}"
daily_key = f"rl:daily:{user_id}:{self._day_key()}"
budget_key = f"rl:budget:{user_id}:{self._day_key()}"
pipe = self.redis.pipeline()
pipe.hmget(bucket_key, "tokens", "last_refill")
pipe.get(daily_key)
pipe.get(budget_key)
results = pipe.execute()
bucket_tokens = float(results[0][0] or config.burst_tokens)
daily_used = int(results[1] or 0)
budget_used_microcents = int(results[2] or 0)
return {
"tier": tier,
"bucket_tokens_available": bucket_tokens,
"bucket_capacity": config.burst_tokens,
"tokens_per_minute": config.tokens_per_minute,
"daily_tokens_used": daily_used,
"daily_tokens_limit": config.tokens_per_day,
"daily_budget_used_usd": budget_used_microcents / 100 / 1_000_000,
"daily_budget_limit_usd": config.daily_budget_usd,
}
def _day_key(self) -> str:
from datetime import datetime, timezone
return datetime.now(timezone.utc).strftime("%Y-%m-%d")
def _seconds_until_midnight(self) -> int:
from datetime import datetime, timezone
now = datetime.now(timezone.utc)
midnight = now.replace(hour=0, minute=0, second=0, microsecond=0)
from datetime import timedelta
next_midnight = midnight + timedelta(days=1)
return int((next_midnight - now).total_seconds())
FastAPI Middleware Integration
# middleware.py
import time
import json
from typing import Callable, Optional
from fastapi import FastAPI, Request, Response, HTTPException
from fastapi.responses import StreamingResponse
import redis
from .rate_limiter import TokenBucketRateLimiter, RateLimitResult
app = FastAPI()
# Initialize Redis connection (use Redis Cluster in production)
redis_client = redis.Redis(
host="redis-service",
port=6379,
decode_responses=False,
socket_connect_timeout=1,
socket_timeout=1,
retry_on_timeout=True,
)
rate_limiter = TokenBucketRateLimiter(redis_client)
def get_user_info(request: Request) -> dict:
"""
Extract user identity and tier from the request.
In production, verify the API key and look up the user record.
"""
api_key = request.headers.get("Authorization", "").replace("Bearer ", "")
if not api_key:
raise HTTPException(status_code=401, detail="Missing API key")
# In production: look up from database or cache
# user = db.get_user_by_api_key(api_key)
# For this example, decode tier from key prefix
if api_key.startswith("ent_"):
return {"user_id": api_key[:32], "tier": "enterprise"}
elif api_key.startswith("pro_"):
return {"user_id": api_key[:32], "tier": "pro"}
else:
return {"user_id": api_key[:32], "tier": "free"}
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next: Callable) -> Response:
"""
Rate limiting middleware that runs before every request.
Only applies rate limiting to inference endpoints.
"""
if not request.url.path.startswith("/v1/chat"):
return await call_next(request)
try:
user_info = get_user_info(request)
except HTTPException as e:
return Response(
content=json.dumps({"error": e.detail}),
status_code=e.status_code,
media_type="application/json",
)
# Parse request body to extract token counts
# We need to read the body here, then make it available downstream
body_bytes = await request.body()
try:
body = json.loads(body_bytes)
except json.JSONDecodeError:
return Response(
content=json.dumps({"error": "Invalid JSON body"}),
status_code=400,
media_type="application/json",
)
# Estimate input tokens (use tiktoken for accuracy in production)
messages = body.get("messages", [])
input_text = " ".join(m.get("content", "") for m in messages)
estimated_input_tokens = max(1, len(input_text.split()) * 4 // 3)
max_output_tokens = body.get("max_tokens", 512)
model = body.get("model", "meta-llama/Llama-3.1-8B-Instruct")
# Check rate limit
try:
result, metadata = rate_limiter.check_rate_limit(
user_id=user_info["user_id"],
tier=user_info["tier"],
input_tokens=estimated_input_tokens,
max_output_tokens=max_output_tokens,
model=model,
)
except redis.RedisError:
# Redis unavailable - fail open (allow request) with logging
# You may choose to fail closed depending on your abuse risk
result = RateLimitResult.ALLOWED
metadata = {"redis_error": True}
if result != RateLimitResult.ALLOWED:
headers = {
"X-RateLimit-Limit": str(
RATE_LIMIT_TIERS[user_info["tier"]].tokens_per_minute
),
"Retry-After": str(metadata.get("retry_after_seconds", 60)),
"X-RateLimit-Reason": metadata.get("reason", "rate limited"),
}
return Response(
content=json.dumps({
"error": {
"type": result.value,
"message": metadata.get("reason", "Rate limit exceeded"),
"retry_after": metadata.get("retry_after_seconds", 60),
}
}),
status_code=429,
headers=headers,
media_type="application/json",
)
# Store rate limit context for use in response headers
request.state.user_info = user_info
request.state.rl_metadata = metadata
request.state.estimated_output_tokens = max_output_tokens
request.state.model = model
request.state.body = body_bytes
# Reconstruct request with body (since we consumed it)
async def receive():
return {"type": "http.request", "body": body_bytes, "more_body": False}
request._receive = receive
response = await call_next(request)
# Add rate limit headers to response
usage = rate_limiter.get_usage_stats(
user_info["user_id"], user_info["tier"]
)
response.headers["X-RateLimit-Limit-Tokens"] = str(
usage["tokens_per_minute"]
)
response.headers["X-RateLimit-Remaining-Day"] = str(
max(0, usage["daily_tokens_limit"] - usage["daily_tokens_used"])
)
return response
@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
"""
Proxy endpoint to vLLM with rate limiting applied by middleware.
Handles token refund after generation completes.
"""
import httpx
body = json.loads(await request.body())
user_info = getattr(request.state, "user_info", {})
estimated_output = getattr(request.state, "estimated_output_tokens", 512)
model = body.get("model", "meta-llama/Llama-3.1-8B-Instruct")
async def stream_with_refund():
actual_output_tokens = 0
async with httpx.AsyncClient(timeout=120.0) as client:
async with client.stream(
"POST",
"http://vllm-service:8000/v1/chat/completions",
json=body,
) as resp:
async for line in resp.aiter_lines():
if line.startswith("data: ") and line != "data: [DONE]":
try:
chunk = json.loads(line[6:])
# Count actual output tokens
usage = chunk.get("usage", {})
if usage:
actual_output_tokens = usage.get(
"completion_tokens", 0
)
except json.JSONDecodeError:
pass
yield line + "\n"
# Refund unused output token quota
if user_info and actual_output_tokens > 0:
rate_limiter.refund_unused_tokens(
user_id=user_info.get("user_id", ""),
estimated_output_tokens=estimated_output,
actual_output_tokens=actual_output_tokens,
model=model,
)
return StreamingResponse(stream_with_refund(), media_type="text/event-stream")
Cost Attribution and Dashboard
# cost_tracker.py
"""
Cost attribution system for multi-tenant LLM deployments.
Tracks GPU cost per tenant, per model, per feature.
"""
import time
from datetime import datetime, timezone, timedelta
from typing import Optional
import redis
from prometheus_client import Counter, Histogram
# Prometheus counters for cost attribution
tokens_billed = Counter(
"llm_tokens_billed_total",
"Total billable tokens by tenant and model",
["tenant_id", "model", "token_type"], # token_type: input or output
)
cost_attributed_usd = Counter(
"llm_cost_attributed_usd_total",
"Total attributed cost in USD by tenant and model",
["tenant_id", "model"],
)
# Cost per token in USD (measure against your actual GPU costs)
COST_TABLE = {
"meta-llama/Llama-3.1-8B-Instruct": {"input": 5e-7, "output": 1e-6},
"meta-llama/Llama-3.1-70B-Instruct": {"input": 3e-6, "output": 6e-6},
"mistralai/Mixtral-8x7B-Instruct": {"input": 7e-7, "output": 7e-7},
"default": {"input": 1e-6, "output": 2e-6},
}
class CostAttributionTracker:
"""
Tracks per-request cost attribution for multi-tenant deployments.
Stores data in Redis with daily aggregation for fast dashboard queries.
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def record_request(
self,
tenant_id: str,
model: str,
feature: str,
input_tokens: int,
output_tokens: int,
latency_ms: float,
) -> float:
"""Record a completed request and return its cost in USD."""
costs = COST_TABLE.get(model, COST_TABLE["default"])
request_cost = (
input_tokens * costs["input"] +
output_tokens * costs["output"]
)
day = datetime.now(timezone.utc).strftime("%Y-%m-%d")
# Store in Redis sorted set for time-series queries
pipe = self.redis.pipeline()
# Daily aggregation keys
pipe.hincrbyfloat(
f"cost:daily:{tenant_id}:{day}",
f"{model}:{feature}:input_tokens",
input_tokens,
)
pipe.hincrbyfloat(
f"cost:daily:{tenant_id}:{day}",
f"{model}:{feature}:output_tokens",
output_tokens,
)
pipe.hincrbyfloat(
f"cost:daily:{tenant_id}:{day}",
f"{model}:{feature}:cost_usd",
request_cost,
)
pipe.expire(f"cost:daily:{tenant_id}:{day}", 90 * 86400) # 90 days
# Cross-tenant aggregation
pipe.hincrbyfloat(
f"cost:daily:all:{day}",
f"{tenant_id}:cost_usd",
request_cost,
)
pipe.expire(f"cost:daily:all:{day}", 90 * 86400)
# Budget alert check
pipe.hincrbyfloat(
f"cost:budget:{tenant_id}:{day}",
"total_usd",
request_cost,
)
pipe.expire(f"cost:budget:{tenant_id}:{day}", 86400)
pipe.execute()
# Update Prometheus counters
tokens_billed.labels(
tenant_id=tenant_id, model=model, token_type="input"
).inc(input_tokens)
tokens_billed.labels(
tenant_id=tenant_id, model=model, token_type="output"
).inc(output_tokens)
cost_attributed_usd.labels(
tenant_id=tenant_id, model=model
).inc(request_cost)
return request_cost
def get_tenant_daily_cost(
self, tenant_id: str, days: int = 30
) -> list:
"""Get daily cost breakdown for a tenant over the last N days."""
results = []
today = datetime.now(timezone.utc)
for i in range(days):
day = (today - timedelta(days=i)).strftime("%Y-%m-%d")
data = self.redis.hgetall(f"cost:daily:{tenant_id}:{day}")
if not data:
results.append({"date": day, "total_usd": 0.0})
continue
total_usd = sum(
float(v) for k, v in data.items()
if k.endswith(b":cost_usd") or k.endswith("cost_usd")
)
results.append({
"date": day,
"total_usd": round(total_usd, 6),
"breakdown": {
k.decode() if isinstance(k, bytes) else k: float(v)
for k, v in data.items()
},
})
return results
def check_budget_alert(
self,
tenant_id: str,
daily_budget_usd: float,
alert_threshold: float = 0.80,
) -> Optional[dict]:
"""
Check if a tenant is approaching their daily budget.
Returns alert dict if threshold exceeded, None otherwise.
"""
day = datetime.now(timezone.utc).strftime("%Y-%m-%d")
budget_data = self.redis.hgetall(f"cost:budget:{tenant_id}:{day}")
if not budget_data:
return None
total_key = b"total_usd"
if total_key not in budget_data:
return None
used_usd = float(budget_data[total_key])
usage_ratio = used_usd / daily_budget_usd
if usage_ratio >= alert_threshold:
return {
"tenant_id": tenant_id,
"used_usd": round(used_usd, 4),
"budget_usd": daily_budget_usd,
"usage_pct": round(usage_ratio * 100, 1),
"alert_level": "critical" if usage_ratio >= 0.95 else "warning",
}
return None
def get_top_spenders(self, date: Optional[str] = None) -> list:
"""Get top spending tenants for a given day. Useful for cost dashboards."""
if date is None:
date = datetime.now(timezone.utc).strftime("%Y-%m-%d")
all_costs = self.redis.hgetall(f"cost:daily:all:{date}")
if not all_costs:
return []
spenders = []
for key, value in all_costs.items():
if key.endswith(b":cost_usd") or key.endswith("cost_usd"):
tenant_id = (
key.decode() if isinstance(key, bytes) else key
).replace(":cost_usd", "")
spenders.append({
"tenant_id": tenant_id,
"cost_usd": round(float(value), 4),
})
return sorted(spenders, key=lambda x: x["cost_usd"], reverse=True)[:20]
Abuse Detection
# abuse_detector.py
"""
Detects abuse patterns in LLM API usage:
- Automated scraping (high request rate, low diversity)
- Prompt injection attempts
- Unusual request size patterns
- Systematic probing of model capabilities
"""
import re
import time
import redis
import hashlib
from typing import Optional
from dataclasses import dataclass
@dataclass
class AbuseSignal:
signal_type: str
severity: str # low, medium, high, critical
description: str
action: str # log, throttle, block
class AbuseDetector:
"""
Stateful abuse detection using Redis for pattern tracking.
Designed to run in the request path with sub-millisecond overhead.
"""
# Prompt injection patterns - incomplete but covers common vectors
INJECTION_PATTERNS = [
r"ignore (all )?(previous|prior|above) instructions",
r"disregard (your )?(previous|prior|system) (instructions|prompt)",
r"you are now (a|an|DAN|jailbreak)",
r"pretend you (are|have no|don't have) (guidelines|restrictions|safety)",
r"\\n\\n(human|assistant|system):", # Prompt injection via newlines
r"<\|system\|>", # Token injection attempt
r"\[INST\].*\[/INST\]", # Llama-2 injection
]
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self._compiled_patterns = [
re.compile(p, re.IGNORECASE) for p in self.INJECTION_PATTERNS
]
def check_request(
self,
user_id: str,
prompt_text: str,
input_tokens: int,
) -> list:
"""
Run all abuse checks on a request.
Returns list of AbuseSignal objects (empty = clean).
"""
signals = []
signals.extend(self._check_prompt_injection(prompt_text))
signals.extend(self._check_request_rate_anomaly(user_id))
signals.extend(self._check_token_size_anomaly(user_id, input_tokens))
signals.extend(self._check_prompt_diversity(user_id, prompt_text))
return signals
def _check_prompt_injection(self, text: str) -> list:
for pattern in self._compiled_patterns:
if pattern.search(text):
return [AbuseSignal(
signal_type="prompt_injection",
severity="high",
description=f"Prompt injection pattern detected: {pattern.pattern[:50]}",
action="block",
)]
return []
def _check_request_rate_anomaly(self, user_id: str) -> list:
"""
Detect burst patterns that suggest automated usage.
Uses a 10-second sliding window.
"""
key = f"abuse:rate:{user_id}"
now = int(time.time())
window = 10 # seconds
pipe = self.redis.pipeline()
pipe.zadd(key, {str(now) + "_" + str(id(key)): now})
pipe.zremrangebyscore(key, 0, now - window)
pipe.zcard(key)
pipe.expire(key, 60)
results = pipe.execute()
request_count = results[2]
if request_count > 30: # 30+ requests in 10 seconds
return [AbuseSignal(
signal_type="burst_rate",
severity="high",
description=f"{request_count} requests in {window}s window",
action="throttle",
)]
elif request_count > 15:
return [AbuseSignal(
signal_type="burst_rate",
severity="medium",
description=f"{request_count} requests in {window}s window",
action="log",
)]
return []
def _check_token_size_anomaly(
self, user_id: str, input_tokens: int
) -> list:
"""
Detect unusually large requests compared to user's history.
A user who typically sends 200-token requests suddenly sending
50,000-token requests is a cost abuse signal.
"""
key = f"abuse:token_history:{user_id}"
# Track rolling average of input token counts
pipe = self.redis.pipeline()
pipe.lpush(key, input_tokens)
pipe.ltrim(key, 0, 99) # Keep last 100 requests
pipe.lrange(key, 0, 99)
pipe.expire(key, 7 * 86400)
results = pipe.execute()
history = [int(x) for x in results[2]]
if len(history) < 10:
return [] # Not enough history
avg_tokens = sum(history[1:]) / (len(history) - 1) # Exclude current
if avg_tokens > 0 and input_tokens > avg_tokens * 10:
return [AbuseSignal(
signal_type="token_size_spike",
severity="medium",
description=(
f"Request {input_tokens} tokens is "
f"{input_tokens/avg_tokens:.1f}x user average ({avg_tokens:.0f})"
),
action="log",
)]
return []
def _check_prompt_diversity(
self, user_id: str, prompt_text: str
) -> list:
"""
Detect repetitive/templated requests that suggest systematic scraping.
Uses content fingerprinting to detect near-duplicates.
"""
# Fingerprint first 200 chars (most variation is in the beginning)
fingerprint = hashlib.md5(
prompt_text[:200].encode()
).hexdigest()[:8]
key = f"abuse:fingerprints:{user_id}"
pipe = self.redis.pipeline()
pipe.sadd(key, fingerprint)
pipe.scard(key)
pipe.expire(key, 3600)
results = pipe.execute()
unique_count = results[1]
# Get total request count for comparison
total_key = f"abuse:total_requests:{user_id}"
total = self.redis.incr(total_key)
self.redis.expire(total_key, 3600)
if total > 20 and unique_count / total < 0.1:
# Less than 10% unique prompts out of 20+ requests
return [AbuseSignal(
signal_type="low_prompt_diversity",
severity="medium",
description=(
f"Only {unique_count} unique fingerprints in {total} requests "
f"({100*unique_count/total:.0f}% diversity)"
),
action="throttle",
)]
return []
Architecture Diagrams
Rate Limiting Architecture - Request Flow
Token Bucket State Machine
Multi-Tenant Resource Isolation
Production Engineering Notes
Fail Open vs Fail Closed
When Redis is unavailable, you have a choice: allow all requests (fail open) or reject all requests (fail closed). For most LLM services, fail open is the right default. A Redis outage should not cause user-facing 429 errors - it should cause an alert that your rate limiting infrastructure is down.
The risk of fail open is that a determined abuser who can trigger a Redis outage can bypass your rate limits. In practice, this threat model applies to very few production deployments. For the vast majority of services, the cost of legitimate user disruption from fail-closed outweighs the risk.
If you run a high-value API where abuse during a Redis outage would be catastrophic, implement a local in-memory fallback rate limiter (with conservative limits) that activates when Redis is unreachable.
# Fail-open pattern with error logging
try:
result, metadata = rate_limiter.check_rate_limit(...)
except redis.RedisError as e:
logger.error(f"Redis unavailable for rate limiting: {e}")
metrics.increment("rate_limiter.redis_error")
result = RateLimitResult.ALLOWED # Fail open
metadata = {}
Token Count Estimation
Using word count times a constant factor for token estimation is acceptable for rate limiting but not for billing. For billing, use tiktoken (OpenAI's tokenizer) or the model-native tokenizer. Different models have meaningfully different tokenization (Llama tokenizes differently than GPT-4), and the difference can be 20-30% on some text types.
For rate limiting purposes, slightly overestimating token count (by 10-20%) is acceptable and errs on the safe side - users get slightly less headroom than their stated limit, but you never over-serve. This is the conservative approach. For billing, always use exact token counts from the model's actual usage response.
Redis Key Design
Design your Redis keys for efficient prefix scanning when you need to debug or analyze. Use a consistent hierarchy: {namespace}:{resource_type}:{identifier}:{time_bucket}.
Examples:
rl:bucket:usr_abc123- rate limit bucket for user abc123rl:daily:usr_abc123:2025-01-15- daily countercost:daily:ten_xyz:2025-01-15- daily cost for tenant xyzabuse:rate:usr_abc123- abuse detection window for user
This hierarchy lets you scan all rate limit keys with SCAN 0 MATCH rl:*, scan a specific user with SCAN 0 MATCH *:usr_abc123:*, and expire cleanup is explicit rather than relying on TTL for correctness.
Multi-Region Deployments
If you run vLLM in multiple regions, your rate limiter must be global - a user hitting your US region and EU region simultaneously should not get 2x their token quota. Use Redis Cluster with cross-region replication, or accept slightly stale rate limit state (which results in minor over-serving at region boundaries) in exchange for lower latency.
For most services, per-region rate limits with a 10-20% headroom buffer are acceptable. A user can burst to 110% of their stated limit by spreading requests across regions, but this is unlikely to be economically significant. If it is, use Redis Global (enterprise Redis with synchronous cross-region replication).
Common Mistakes
:::danger Rate Limiting by Requests per Second Only
The most dangerous mistake in LLM API rate limiting is using request count alone. A user limited to 10 RPM who sends 100,000-token documents on every request is consuming 1,000x more GPU resources than a user sending 100-token prompts at the same rate.
This mistake is insidious because it feels like you have a rate limiter. You do - it just does not limit the thing that costs money.
Always implement both RPS limiting (to prevent server overload from many small requests) AND token-per-minute limiting (to control actual cost). For cost control purposes, token-per-minute is the primary constraint; RPS is secondary.
# WRONG - only counts requests
if requests_this_minute > 100:
raise RateLimitError("Too many requests")
# CORRECT - count tokens, which map to actual cost
if tokens_this_minute + estimated_tokens > 100_000:
raise RateLimitError("Token limit exceeded")
:::
:::danger No Upfront Token Reservation
If you only check remaining tokens at request start without reserving them, concurrent requests can simultaneously pass the rate limit check before any of them has consumed tokens. Ten concurrent requests each see "10,000 tokens available, I need 1,000" - all pass - and you serve 10,000 tokens to a user whose limit is 1,000.
The fix is atomic check-and-consume. The Lua script approach in this lesson is correct. The common mistake is doing GET tokens then SET tokens - cost as two separate Redis commands (two round trips, not atomic).
Always use Lua scripts or Redis transactions (MULTI/EXEC) for rate limit state updates. :::
:::warning Not Refunding Unused Output Token Quota
You must charge for output tokens upfront (based on max_tokens) because you do not know actual output length until generation completes. If you charge at request end, a user can send a request, see it is accepted, and keep the connection open indefinitely - effectively holding their quota reservation forever.
But if you never refund unused quota, users who set conservative max_tokens values (e.g., max_tokens=4096 for a response that turns out to be 50 tokens) get penalized. A user requesting 100 completions per day at max_tokens=4096 might actually use only 200 tokens per response on average - their effective daily capacity is 100x lower than their stated limit.
Implement the refund pattern: reserve max_tokens upfront, refund (max_tokens - actual_tokens) after generation completes. This gives users fair access to their full quota based on actual consumption. :::
:::warning Treating All Request Content as Safe for Logging
When you log requests for abuse detection and debugging, you are collecting user content. This creates compliance obligations (GDPR, CCPA, HIPAA depending on your users' data) and security risks (prompt logs are a high-value target for data breaches).
Specifically around abuse detection: do NOT store raw prompt text in Redis. Store fingerprints (first 200 chars hashed) and token counts. Raw content belongs in your logging pipeline with proper access controls and retention policies, not in Redis keys that might be visible to any service with Redis access.
Separate your abuse detection state (Redis, ephemeral, low-sensitivity) from your request audit logs (S3/GCS, encrypted, access-controlled, retention-policy-governed). :::
Interview Q&A
Q1: Why is request-per-second rate limiting insufficient for LLM APIs, and what should you use instead?
Request-per-second limits the count of API calls, which in traditional services is a reasonable proxy for server load. For LLMs, the actual resource consumption - GPU compute time - is proportional to the number of tokens processed, not the number of API calls. A user sending one 100,000-token request consumes the same GPU resources as 100 users each sending 1,000-token requests, but would be counted as 1/100th the load by an RPS limiter.
The correct approach is token-per-minute (TPM) and token-per-day (TPD) limiting, which directly measures the quantity that drives cost. You still want RPS limits as a secondary constraint - to prevent a single user from monopolizing the request queue with thousands of tiny requests - but the primary cost control mechanism must be token-based.
In practice, set three limits per user tier: max tokens per request (prevents a single enormous request from monopolizing resources), tokens per minute (controls sustained throughput), and tokens per day (hard cap on daily spend). The first protects system stability, the second protects per-minute capacity, and the third protects your monthly bill.
Q2: Describe the token bucket algorithm and why it is preferred over a fixed window counter for API rate limiting.
A token bucket maintains a counter representing available tokens, with a constant refill rate (tokens per second) up to a maximum capacity. When a request arrives, it consumes tokens equal to its cost. If sufficient tokens are available, the request is allowed; otherwise it is rejected.
The key advantage over a fixed window counter is burst tolerance. A fixed window counter allows exactly N tokens in each 60-second window, but completely resets at window boundaries - which creates a "thundering herd" problem where users learn to send bursts at the start of each window. More importantly, a user who makes no requests for 5 minutes cannot then burst to catch up - they are still limited to the per-minute window rate.
The token bucket allows a user to accumulate unused capacity (up to the bucket size) and then spend it as a burst. This matches natural usage patterns - a user might send nothing for an hour and then need a quick burst of 10 requests. With a fixed window, only the current window's quota is available. With a token bucket, they have accumulated credit they can use immediately.
For LLM services specifically, the burst tolerance is important because users often have batch processing workflows - they want to process 50 documents quickly, then nothing for an hour. Token buckets accommodate this while still enforcing sustainable long-term rates.
Q3: How do you implement atomic rate limit checks to prevent race conditions under concurrent load?
Without atomicity, concurrent requests can bypass rate limits through a classic check-then-act race condition: multiple requests simultaneously read the current token count (say, 1,000 tokens available), each determines "I need 800 tokens, 1,000 is enough," and all proceed - consuming 3,200 tokens when only 1,000 were available.
The standard solution for Redis is Lua scripts. Redis executes Lua scripts atomically - no other Redis commands run between the first and last line of the script. This makes it impossible for concurrent requests to race. The Lua script reads the bucket state, checks availability, and consumes tokens as a single atomic operation.
The alternative is Redis transactions (MULTI/EXEC with WATCH), but this requires optimistic locking and retries when the watched key changes - more complex to implement correctly and slower under contention. Lua scripts are the recommended approach.
In environments where you cannot use Lua (some Redis-compatible services do not support scripting), you can use Redis's INCR command which is atomic by itself. Set per-user per-minute counters that INCR on each token consumption, with EXPIRY to reset them. This is simpler but supports only simple counting semantics (not the burst-then-refill behavior of a token bucket).
Q4: How would you design a cost attribution system for a multi-tenant LLM deployment where each tenant is billed separately?
The fundamental requirement is tracking, per request: which tenant, which model, how many input tokens, how many output tokens. From these four data points you can compute cost as input_tokens * input_cost_per_token + output_tokens * output_cost_per_token, where the cost per token is derived from your actual GPU infrastructure cost divided by measured token throughput.
For the data infrastructure: a Redis hash per tenant per day works well for real-time dashboards and budget alert checks (fast reads, in-memory). A secondary write to a time-series database (InfluxDB, TimescaleDB, or BigQuery) enables long-term trend analysis and billing exports. Do not try to do both in the hot request path - use an async event queue (Kafka or Redis Streams) to fan out to the slow path.
For multi-model deployments, maintain a cost table mapping model names to per-token costs. Update this table when you change hardware, when model quantization changes, or when you migrate to different instance types. The cost table should be versioned and date-stamped so you can accurately reconstruct historical costs even after changing the table.
For tenant isolation in the vLLM layer, you can use the priority parameter on vLLM requests to implement SLO tiers (enterprise tenants get higher priority in the scheduling queue). True KV cache partitioning per tenant requires modifications to vLLM's caching layer and is generally only worth the complexity for very large deployments.
Q5: What are the most important signals for detecting prompt injection attacks, and how do you avoid over-blocking legitimate requests?
The most reliable signals for prompt injection are pattern-based: phrases like "ignore previous instructions," "you are now [persona]," attempts to inject synthetic system/user/assistant tokens (like \n\nHuman: to manipulate chat format), and XML/HTML tags that look like they are trying to alter the prompt structure.
Pattern matching alone has high false positive rates. A cybersecurity researcher legitimately writing about prompt injection will trigger the same patterns as an attacker. The key to avoiding over-blocking is: use pattern matching as a signal for investigation, not as a hard block by default. Log and alert on detected patterns, and only hard-block on high-confidence, high-severity patterns (like clear attempts to override system prompts with admin personas).
A tiered response is better than binary block/allow: low-confidence patterns get logged; medium-confidence patterns get throttled (applied to a slower, more expensive safety-checking step); high-confidence patterns get blocked with a clear error message explaining why. This lets you tune aggressiveness over time based on your false positive rate in production.
For systems where prompt injection is a critical threat (e.g., LLMs with tool use that can trigger real-world actions), consider running all requests through a separate safety model inference - a small, fast classifier trained specifically on injection patterns. This is more accurate than regex matching and less likely to block legitimate security-related discussion.
Q6: Walk me through how you would design budget alerts that notify a tenant before they hit their monthly limit, not after.
Budget alerts work at two layers. The first layer is per-request budget checks - before serving a request, check if this request would push the tenant over their budget and reject it if so. This prevents overage but provides no advance warning.
The second layer is threshold-based alerts at 50%, 75%, 90%, and 95% of budget. When a tenant crosses each threshold, send a notification (email, webhook, Slack). This gives them time to react before they hit the wall.
The implementation uses Redis to track cumulative daily and monthly spending. On each request completion, increment the tenant's cost accumulator. Separately, run a background job (cron, every 15 minutes) that scans all active tenants, computes their spend vs budget, and fires alerts for tenants who crossed a threshold since the last check.
The background job approach is important because you want idempotent alerting - if the job runs every 15 minutes and the tenant is at 91%, you should fire the 90% alert exactly once, not every 15 minutes until they hit 95%. Use Redis to store "last threshold alerted" per tenant, so you only fire alerts for newly crossed thresholds.
For the notification itself, include the tenant's current spend, budget, projected month-end spend based on current daily rate, and a link to their usage dashboard. "You've used 500 monthly budget. At your current rate ($18/day), you'll hit your limit in 2.8 days" is far more useful than "You've exceeded 90% of budget."
