Skip to main content

Middleware - Wrapping Every Request and Response

Reading time: ~30 minutes | Level: Intermediate → Engineering

Before reading further, consider this puzzle:

app.add_middleware(AuthMiddleware)
app.add_middleware(LoggingMiddleware)
app.add_middleware(TimingMiddleware)

# Request comes in. What order do they execute?
# Which middleware sees the response first?

If you guessed that AuthMiddleware runs first (because it was added first), you are wrong - and in a real application that mistake creates security holes.

Middleware forms a stack, not a queue. The last middleware added wraps everything added before it. The execution order is the reverse of the add_middleware call order:

Request path: TimingMiddleware → LoggingMiddleware → AuthMiddleware → Handler
Response path: AuthMiddleware → LoggingMiddleware → TimingMiddleware → Client

TimingMiddleware is outermost: it starts timing before any other middleware runs, and it stops timing after all other middleware has processed the response. AuthMiddleware is innermost: it runs closest to the handler, right before the handler executes.

This is the onion model: request flows inward through layers, the handler executes at the core, and the response flows outward through the same layers in reverse. Every middleware is both a request preprocessor and a response postprocessor.

What You Will Learn

  • What middleware is and exactly what it wraps
  • The onion model - visualised as a sequence diagram
  • WSGI middleware: the callable pattern
  • ASGI middleware: BaseHTTPMiddleware and pure ASGI
  • Flask before_request/after_request hooks vs middleware
  • Five production-grade middleware implementations
  • The performance tradeoff between BaseHTTPMiddleware and pure ASGI middleware
  • Middleware vs dependency injection - when to use each

Prerequisites

  • Lesson 04 (FastAPI) - middleware basics, Depends(), request lifecycle
  • Lesson 05 (Request-Response Lifecycle) - how middleware fits into the full stack
  • Lesson 01 (HTTP Deep Dive) - CORS preflight, headers, status codes

Part 1 - What Middleware Is

Middleware is code that runs for every request and every response, regardless of which endpoint handles the request. It sits between the ASGI/WSGI server and your route handlers. Think of it as a series of decorators applied to your entire application rather than to individual functions.

Middleware has two distinct execution windows:

  1. Before the handler: Can inspect, modify, or reject the request. Can also short-circuit - returning a response without ever calling the handler (e.g., authentication middleware returning 401 immediately).
  2. After the handler: Can inspect or modify the response. Can add headers, compress the body, log the result, or clean up resources.

The things middleware is good at:

  • Concerns that apply to every request uniformly (logging, timing, CORS)
  • Concerns that must run before routing (authentication token parsing, request ID generation)
  • Concerns that require access to both the raw request and the raw response (timing, logging request + response together)
  • Infrastructure concerns that should be invisible to handler code

The things middleware is not good at: business logic, per-endpoint variation, anything that needs access to the dependency injection graph.

Part 2 - The Onion Model

The key to understanding middleware execution: await call_next(request) is the boundary. Code before call_next runs on the request. Code after call_next runs on the response. The response from call_next is the result of every inner middleware plus the handler running in sequence.

Part 3 - WSGI Middleware

WSGI middleware is simply a callable that wraps another callable. WSGI's interface (environ, start_response) flows through each layer:

# A WSGI application
def simple_app(environ: dict, start_response) -> list[bytes]:
start_response("200 OK", [("Content-Type", "text/plain")])
return [b"Hello World"]

# A WSGI middleware
class TimingMiddleware:
def __init__(self, app):
self.app = app # the wrapped application

def __call__(self, environ: dict, start_response):
import time
start = time.perf_counter()

# Intercept start_response to capture the status code
captured_status = []
def capturing_start_response(status, headers, exc_info=None):
captured_status.append(status)
return start_response(status, headers, exc_info)

result = self.app(environ, capturing_start_response)

duration_ms = (time.perf_counter() - start) * 1000
print(
f"{environ['REQUEST_METHOD']} {environ['PATH_INFO']} "
f"in {duration_ms:.1f}ms"
)
return result

# Compose: wrap the app - innermost first, outermost last
app = simple_app
app = AuthMiddleware(app) # innermost
app = LoggingMiddleware(app) # middle
app = TimingMiddleware(app) # outermost

The WSGI middleware pattern is pure composition: each middleware receives the inner app via __init__ and stores it as self.app. Calling self.app(environ, start_response) passes the request to the next layer.

Flask uses WSGI middleware for things like ProxyFix (which reads X-Forwarded-For headers from reverse proxies):

from flask import Flask
from werkzeug.middleware.proxy_fix import ProxyFix

flask_app = Flask(__name__)
# Wrap the Flask WSGI app with ProxyFix middleware
flask_app.wsgi_app = ProxyFix(flask_app.wsgi_app, x_for=1, x_proto=1)

Part 4 - ASGI Middleware: BaseHTTPMiddleware

Starlette's BaseHTTPMiddleware is the easiest way to write ASGI middleware. It provides a simple dispatch(request, call_next) interface that hides ASGI's raw scope/receive/send machinery:

from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response
import time

class TimingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
start = time.perf_counter()
response = await call_next(request) # inner layers + handler
duration_ms = (time.perf_counter() - start) * 1000
response.headers["X-Duration-Ms"] = f"{duration_ms:.1f}"
return response

Register with FastAPI:

from fastapi import FastAPI

app = FastAPI()
# Remember: last added = outermost
app.add_middleware(TimingMiddleware)

Or use the decorator style (equivalent, adds at the outermost position):

@app.middleware("http")
async def timing_middleware(request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration_ms = (time.perf_counter() - start) * 1000
response.headers["X-Duration-Ms"] = f"{duration_ms:.1f}"
return response
note

BaseHTTPMiddleware reads the entire request body into memory when call_next is called. This means it is not suitable for streaming endpoints (file uploads, large request bodies). For streaming endpoints, write a pure ASGI middleware (Part 5) that operates on send/receive directly without buffering.

Part 5 - Pure ASGI Middleware

Pure ASGI middleware operates on the raw scope/receive/send interface. It is more complex but faster (no request body buffering) and supports streaming:

import time
import uuid

class PureASGITimingMiddleware:
"""
Pure ASGI middleware - no request body buffering.
Suitable for streaming endpoints and high-throughput APIs.
"""
def __init__(self, app):
self.app = app

async def __call__(self, scope, receive, send) -> None:
if scope["type"] != "http":
# Pass non-HTTP connections (WebSocket, lifespan) through unchanged
await self.app(scope, receive, send)
return

start = time.perf_counter()
request_id = str(uuid.uuid4())
status_code = None

# Wrap send to intercept the response start event
async def send_wrapper(message):
nonlocal status_code
if message["type"] == "http.response.start":
status_code = message["status"]
# Add headers without buffering the body
headers = list(message.get("headers", []))
duration_ms = (time.perf_counter() - start) * 1000
headers.append((b"x-duration-ms", f"{duration_ms:.1f}".encode()))
headers.append((b"x-request-id", request_id.encode()))
message = {**message, "headers": headers}
await send(message)

await self.app(scope, receive, send_wrapper)

The difference from BaseHTTPMiddleware: this middleware never touches receive - it does not buffer the request body. The response is intercepted by wrapping send, so it sees each response chunk as it streams out without accumulating it in memory.

Register a pure ASGI middleware:

from fastapi import FastAPI

app = FastAPI()
app.add_middleware(PureASGITimingMiddleware)

Part 6 - Flask Before/After Hooks vs Middleware

Flask provides before_request and after_request hooks as a simpler alternative to WSGI middleware. These are not middleware - they are registered callbacks inside the Flask request context:

from flask import Flask, g, request
import time
import uuid

app = Flask(__name__)

@app.before_request
def start_timer():
# Runs before every request handler
g.start_time = time.perf_counter()
g.request_id = str(uuid.uuid4())

@app.after_request
def log_request(response):
# Runs after every request handler - receives the Response object
duration_ms = (time.perf_counter() - g.start_time) * 1000
app.logger.info(
"method=%s path=%s status=%d duration_ms=%.1f request_id=%s",
request.method, request.path, response.status_code,
duration_ms, g.request_id
)
response.headers["X-Request-ID"] = g.request_id
return response

@app.teardown_request
def close_db_session(exc):
# Runs after after_request, even on exception
# Used for cleanup (close DB connection, release locks)
db = g.pop("db", None)
if db is not None:
db.close()

Flask hooks vs WSGI middleware:

FeatureFlask before/after hooksWSGI middleware
Access to Flask g and requestYesNo (raw environ only)
Can short-circuit before handlerOnly via before_request returning a ResponseYes, any time
Applies to all routesYesYes
Reusable across Flask appsNo (tied to app instance)Yes (composable)
Suitable for auth, logging, timingYesYes

For most Flask middleware needs, before_request/after_request hooks are idiomatic and simpler. Use WSGI middleware (app.wsgi_app = MiddlewareClass(app.wsgi_app)) when you need reusable, app-agnostic middleware or when you need to handle things before Flask's request context is set up.

Part 7 - Five Production Middleware Implementations

7.1 - Request ID Middleware

import uuid
from contextvars import ContextVar
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import Response

# Store in a context var so log formatters can access it automatically
request_id_ctx: ContextVar[str] = ContextVar("request_id", default="")

class RequestIDMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
# Honour upstream request ID (from API gateway or calling service)
request_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())

# Set on request state for handlers and dependencies
request.state.request_id = request_id

# Set in context var for logging (auto-included in all log lines)
token = request_id_ctx.set(request_id)

try:
response = await call_next(request)
finally:
request_id_ctx.reset(token)

# Always return to client - makes debugging possible
response.headers["X-Request-ID"] = request_id
return response

7.2 - Timing and Structured Logging Middleware

import time
import logging
from starlette.middleware.base import BaseHTTPMiddleware

logger = logging.getLogger("access")

class StructuredLoggingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
start = time.perf_counter()

# Read before call_next - some fields may not be readable after
method = request.method
path = request.url.path
client_ip = (
request.headers.get("X-Forwarded-For", "").split(",")[0].strip()
or (request.client.host if request.client else "unknown")
)

response = await call_next(request)

duration_ms = (time.perf_counter() - start) * 1000

logger.info(
"http_request",
extra={
"method": method,
"path": path,
"status_code": response.status_code,
"duration_ms": round(duration_ms, 1),
"client_ip": client_ip,
"request_id": getattr(request.state, "request_id", None),
"user_id": getattr(request.state, "user_id", None),
}
)
return response
tip

Add timing middleware - it is the cheapest observability win available. Every request logs its duration. When P99 latency suddenly spikes from 50 ms to 5,000 ms, timing middleware tells you immediately. Combined with request IDs, you can find the slow endpoint and trace the slow request through all its database and service calls in minutes.

7.3 - CORS Middleware

CORS (Cross-Origin Resource Sharing) is a browser security mechanism. Browsers block cross-origin requests (from app.example.com to api.example.com) unless the server explicitly permits them via response headers.

A preflight request is an OPTIONS request the browser sends automatically before the real request to check if the server allows the cross-origin call. CORS middleware must handle these immediately - the browser will not send the actual request if the preflight fails.

from starlette.middleware.cors import CORSMiddleware

app.add_middleware(
CORSMiddleware,
# Production: list only your actual frontend origins
allow_origins=["https://app.example.com", "https://admin.example.com"],
# Never use allow_origins=["*"] with allow_credentials=True - browsers reject it
allow_credentials=True, # allows cookies and Authorization headers
allow_methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS"],
allow_headers=["Authorization", "Content-Type", "X-Request-ID"],
expose_headers=["X-Request-ID", "X-Duration-Ms"], # headers JS can read
max_age=600, # preflight result cached for 600 seconds (10 minutes)
)

What CORS middleware does per request:

  • Preflight (OPTIONS): Returns 200 with Access-Control-Allow-* headers. The route handler does not run.
  • Simple cross-origin request: Adds Access-Control-Allow-Origin to the response.
  • Same-origin request: Passes through unchanged.
danger

allow_origins=["*"] with allow_credentials=True is rejected by browsers - it is an invalid CORS configuration. If you need to send cookies or Authorization headers cross-origin, you must list the exact allowed origins. Wildcard origin is only appropriate for fully public APIs with no credentials.

7.4 - Rate Limiting Middleware with Redis

import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import JSONResponse
import redis.asyncio as aioredis

redis_client = aioredis.from_url("redis://localhost:6379")

class RateLimitMiddleware(BaseHTTPMiddleware):
"""
Sliding-window rate limiter: N requests per minute per client IP.
Uses Redis atomic INCR + EXPIRE for distributed correctness.
"""
def __init__(self, app, requests_per_minute: int = 60):
super().__init__(app)
self.requests_per_minute = requests_per_minute

def _get_client_key(self, request: Request) -> str:
ip = (
request.headers.get("X-Forwarded-For", "").split(",")[0].strip()
or (request.client.host if request.client else "unknown")
)
# One bucket per minute - key resets every 60 seconds
minute_bucket = int(time.time() // 60)
return f"rate_limit:{ip}:{minute_bucket}"

async def dispatch(self, request: Request, call_next) -> Response:
# Skip rate limiting for health checks
if request.url.path in {"/health", "/ready"}:
return await call_next(request)

key = self._get_client_key(request)

# Atomic increment + set expiry
current = await redis_client.incr(key)
if current == 1:
await redis_client.expire(key, 60)

if current > self.requests_per_minute:
return JSONResponse(
status_code=429,
content={"error": "rate_limit_exceeded", "retry_after_seconds": 60},
headers={"Retry-After": "60"},
)

response = await call_next(request)
response.headers["X-RateLimit-Limit"] = str(self.requests_per_minute)
response.headers["X-RateLimit-Remaining"] = str(
max(0, self.requests_per_minute - current)
)
return response

7.5 - JWT Authentication Middleware

import jwt
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import JSONResponse, Response

SECRET_KEY = "your-secret-key"
ALGORITHM = "HS256"
PUBLIC_PATHS = {"/health", "/ready", "/docs", "/redoc", "/openapi.json"}

class JWTAuthMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
if request.url.path in PUBLIC_PATHS:
return await call_next(request)

auth_header = request.headers.get("Authorization", "")
if not auth_header.startswith("Bearer "):
return JSONResponse(
status_code=401,
content={"error": "missing_token"},
headers={"WWW-Authenticate": "Bearer"},
)

token = auth_header.removeprefix("Bearer ")
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
except jwt.ExpiredSignatureError:
return JSONResponse(status_code=401, content={"error": "token_expired"})
except jwt.InvalidTokenError:
return JSONResponse(status_code=401, content={"error": "invalid_token"})

# Attach decoded claims to request state - available to all handlers
request.state.user_id = payload["sub"]
request.state.user_role = payload.get("role", "user")

return await call_next(request)

Part 8 - Built-in Starlette Middleware

Starlette ships several ready-made middleware classes:

from starlette.middleware.trustedhost import TrustedHostMiddleware
from starlette.middleware.gzip import GZipMiddleware
from starlette.middleware.httpsredirect import HTTPSRedirectMiddleware

# Reject requests with Host headers not in the allowlist
# Prevents HTTP Host header attacks
app.add_middleware(
TrustedHostMiddleware,
allowed_hosts=["api.example.com", "*.example.com"],
)

# Compress responses larger than minimum_size bytes using gzip
# Typical savings: 70-80% size reduction on JSON responses
app.add_middleware(GZipMiddleware, minimum_size=1000)

# Redirect all HTTP requests to HTTPS
# Only add if your app directly receives HTTP (not behind Nginx with SSL termination)
app.add_middleware(HTTPSRedirectMiddleware)

Part 9 - Middleware vs Dependency Injection

This is a genuine architectural decision. The wrong choice makes code harder to test and maintain.

ConcernMiddlewareDepends()Reason
Request ID generationYesNoMust run for every request, before routing
Structured access loggingYesNoNeeds request + response together
Response timingYesNoWraps the full request including dependency resolution
CORSYesNoMust handle OPTIONS preflights before routing
Rate limiting (IP-level)YesNoApplies uniformly, needs raw IP only
JWT token parsing (extract user)EitherPreferredDepends: typed, testable, per-endpoint variation
Database session managementNoYesNeeds cleanup after handler (yield pattern)
Permission checkingNoYesVaries per endpoint, needs type safety
Pagination parametersNoYesPer-endpoint, needs a typed result
Rate limiting (per-user)NoYesNeeds authenticated user ID from Depends

The general rules:

  • If it applies uniformly to every request and does not need business logic context → middleware
  • If it varies per endpoint, needs the dependency injection graph, or produces a typed value the handler uses → Depends()
  • If it needs cleanup (like a database session) → Depends() with yield
# Middleware: parse the raw token, attach string to request.state
# Fast, no database call, applies to every request
class TokenParsingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
token = request.headers.get("Authorization", "").removeprefix("Bearer ")
request.state.raw_token = token
return await call_next(request)

# Dependency: validate the token, load the user from DB, typed return value
# Only runs for endpoints that declare it; fully testable with dependency_overrides
async def get_current_user(
request: Request,
db: AsyncSession = Depends(get_db),
) -> User:
token = request.state.raw_token
if not token:
raise HTTPException(status_code=401, detail="Missing token")
payload = verify_jwt(token)
user = await db.get(User, payload["sub"])
if not user:
raise HTTPException(status_code=401, detail="User not found")
return user
warning

Middleware that raises an exception before calling call_next means the request never reaches the handler. Any resources allocated before the raise must be explicitly cleaned up. Always use try/finally around resource allocation in middleware to guarantee cleanup even when exceptions occur.

Part 10 - Putting It Together: A Production Middleware Stack

from fastapi import FastAPI
from starlette.middleware.cors import CORSMiddleware
from starlette.middleware.gzip import GZipMiddleware
from starlette.middleware.trustedhost import TrustedHostMiddleware

from myapp.middleware import (
RequestIDMiddleware,
StructuredLoggingMiddleware,
JWTAuthMiddleware,
RateLimitMiddleware,
)

app = FastAPI(title="My API", version="1.0.0")

# Order matters - last added = outermost.
# Read add_middleware calls bottom-up to understand actual request flow.

# Innermost: auth - closest to handler
app.add_middleware(JWTAuthMiddleware)

# Rate limiting - after auth so we can rate-limit per user
app.add_middleware(RateLimitMiddleware, requests_per_minute=100)

# Structured logging - wraps auth + handler, logs final status and duration
app.add_middleware(StructuredLoggingMiddleware)

# Request ID - runs early so all inner middleware can log it
app.add_middleware(RequestIDMiddleware)

# GZip - compress responses before they leave the Python process
app.add_middleware(GZipMiddleware, minimum_size=500)

# CORS - handle preflight OPTIONS before any other middleware
app.add_middleware(
CORSMiddleware,
allow_origins=["https://app.example.com"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

# Trusted host - outermost guard, rejects invalid Host headers first
app.add_middleware(
TrustedHostMiddleware,
allowed_hosts=["api.example.com", "localhost"],
)

Effective request execution order (outermost to innermost):

TrustedHostMiddleware
CORSMiddleware
GZipMiddleware
RequestIDMiddleware
StructuredLoggingMiddleware
RateLimitMiddleware
JWTAuthMiddleware
Handler

Graded Practice

Level 1 - Predict the Execution Order

Given this setup:

app.add_middleware(DatabaseMiddleware)
app.add_middleware(CacheMiddleware)
app.add_middleware(AuthMiddleware)
app.add_middleware(RequestIDMiddleware)
  1. List the request execution order (outer to inner).
  2. List the response execution order (inner to outer).
  3. If AuthMiddleware raises a 401 before calling call_next, which middleware still executes on the response path? Which does not?
  4. RequestIDMiddleware sets request.state.request_id. Can DatabaseMiddleware read this value? Can the handler?
Show Answer
  1. Request execution order (outer to inner): RequestIDMiddleware → AuthMiddleware → CacheMiddleware → DatabaseMiddleware → Handler

    Last added (RequestIDMiddleware) is outermost, so it runs first on requests.

  2. Response execution order (inner to outer): Handler → DatabaseMiddleware → CacheMiddleware → AuthMiddleware → RequestIDMiddleware

    Response flows outward through the same layers in reverse - each middleware's code after await call_next(request) runs in this order.

  3. If AuthMiddleware raises 401 before call_next:

    • RequestIDMiddleware does execute on the response path - it called await call_next(request) which returned the 401 response from AuthMiddleware. Its post-call_next code runs normally.
    • CacheMiddleware and DatabaseMiddleware do NOT execute at all - AuthMiddleware never called call_next, so the inner layers were never invoked. This is correct: no database connections are opened for unauthenticated requests.
  4. Yes to both. Middleware runs sequentially in the same process. RequestIDMiddleware (outermost) runs first on the request path and sets request.state.request_id. All inner middleware and the handler receive the same Request object with the same .state. DatabaseMiddleware, CacheMiddleware, and the handler can all read request.state.request_id.

Level 2 - Debug This Middleware

A developer reports that their timing middleware always shows near-zero durations:

class BrokenTimingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
response = await call_next(request)
start = time.perf_counter() # BUG
duration_ms = (time.perf_counter() - start) * 1000
response.headers["X-Duration-Ms"] = f"{duration_ms:.1f}"
return response

And another developer's auth middleware lets requests through even when the token is invalid:

class BrokenAuthMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
token = request.headers.get("Authorization", "")
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
request.state.user = payload
except jwt.InvalidTokenError:
pass # BUG: silently ignores invalid tokens
return await call_next(request)

Identify both bugs and write the fixed versions.

Show Answer

Bug 1: Timer starts after call_next - always measures near-zero duration.

start = time.perf_counter() is called after await call_next(request) has already completed. The timing measures only the nanoseconds between start being set and the second perf_counter() call - effectively zero. All actual handler and middleware execution time occurred before start was set.

Fixed timing middleware:

class TimingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
start = time.perf_counter() # BEFORE call_next
response = await call_next(request) # all inner middleware + handler run here
duration_ms = (time.perf_counter() - start) * 1000
response.headers["X-Duration-Ms"] = f"{duration_ms:.1f}"
return response

Bug 2: pass on jwt.InvalidTokenError - invalid tokens are silently accepted.

When jwt.decode raises InvalidTokenError, the except block executes pass and the middleware proceeds to await call_next(request). The handler receives a request where request.state.user is not set. The handler either crashes with AttributeError or processes the request as if authentication succeeded. This is a critical security vulnerability.

Fixed auth middleware:

class AuthMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next) -> Response:
if request.url.path in PUBLIC_PATHS:
return await call_next(request)

token = request.headers.get("Authorization", "").removeprefix("Bearer ")
if not token:
return JSONResponse(
status_code=401,
content={"error": "missing_token"},
headers={"WWW-Authenticate": "Bearer"},
)
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
request.state.user = payload
except jwt.ExpiredSignatureError:
return JSONResponse(status_code=401, content={"error": "token_expired"})
except jwt.InvalidTokenError:
# NEVER pass silently on a security check - return 401 immediately
return JSONResponse(status_code=401, content={"error": "invalid_token"})

return await call_next(request)

The critical fix: invalid tokens must return an error response immediately, before calling call_next. Silently catching security exceptions and continuing is always a critical bug.

Level 3 - Design Challenge

You are building a SaaS platform. The product team requires:

  1. Every API request must log: timestamp, request ID, tenant ID, user ID, method, path, status, duration, and client IP as structured JSON (searchable in Elasticsearch).
  2. Each tenant has a configurable rate limit stored in the database (not a fixed number in code).
  3. Some endpoints are public (no auth, no rate limit): /health, /v1/auth/login, /v1/auth/register.
  4. Responses over 2 KB must be gzip-compressed.
  5. The system must handle 10,000 requests per second per service instance.

Design the full middleware stack. For each requirement, decide: middleware or Depends()? Why? Identify which requirements cannot be met with BaseHTTPMiddleware. Estimate the per-request overhead.

Show Answer

Decision matrix:

RequirementImplementationReason
Structured access loggingMiddlewareNeeds request + response together; applies uniformly
Request ID generationMiddlewareMust be first; uniform; no business logic
Tenant ID extractionMiddleware (header read) + Depends() (DB validation)Header parse in middleware; DB lookup needs DI graph
User authenticationDepends() primarilyTyped User return; per-endpoint variation; testable
Per-tenant rate limitingDepends()Needs tenant from prior Depends; DB config lookup
GZip compressionGZipMiddleware (built-in pure ASGI)Response transform, uniform, no body buffering

Why per-tenant rate limiting is Depends() not middleware:

The rate limit value is per-tenant and stored in the database. Middleware has no access to the FastAPI dependency injection graph - it cannot call Depends(get_db). A dependency can:

async def per_tenant_rate_limit(
tenant: Tenant = Depends(get_tenant),
redis: Redis = Depends(get_redis),
) -> None:
key = f"rate:{tenant.id}:{int(time.time() // 60)}"
count = await redis.incr(key)
await redis.expire(key, 60)
if count > tenant.requests_per_minute:
raise HTTPException(429, "Rate limit exceeded")

Does requirement 5 (10,000 RPS) conflict with BaseHTTPMiddleware?

Yes, for GZip. BaseHTTPMiddleware buffers the full response body before the next middleware can see it. At 10,000 RPS with 5 KB average response, you buffer 50 MB/s in memory for GZip processing. Use GZipMiddleware directly - it is a pure ASGI middleware that compresses as the response streams out, with no full-body buffering. For logging and request ID middleware, BaseHTTPMiddleware is acceptable since they do not need to touch the response body.

Full middleware stack (add_middleware order - innermost first):

# GZip: pure ASGI, streaming compression
app.add_middleware(GZipMiddleware, minimum_size=2048)

# Trusted host guard
app.add_middleware(TrustedHostMiddleware, allowed_hosts=["*.saas.example.com"])

# CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["https://app.saas.example.com"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

# Structured JSON logging - innermost middleware, sees final status + duration
app.add_middleware(StructuredJSONLoggingMiddleware)

# Request ID - outermost so all inner middleware log it
app.add_middleware(RequestIDMiddleware)

Protected routes - dependencies applied at router level:

router = APIRouter(
dependencies=[
Depends(get_tenant), # parse X-Tenant-ID, validate in DB/cache
Depends(get_current_user), # validate JWT, load user
Depends(per_tenant_rate_limit), # check Redis against tenant.requests_per_minute
]
)

Per-request overhead estimate:

  • RequestIDMiddleware: UUID4 generation approx 0.05 ms
  • StructuredJSONLoggingMiddleware: timer + dict + JSON serialize approx 0.1 ms
  • CORSMiddleware: header check + write approx 0.02 ms
  • GZipMiddleware: 2-5 ms for responses over 2 KB; zero for small JSON responses
  • get_tenant Depends: Redis GET cache hit approx 0.3 ms; DB query on cache miss approx 2-5 ms
  • get_current_user Depends: JWT decode approx 0.1 ms + optional DB approx 0.5 ms
  • per_tenant_rate_limit Depends: Redis INCR approx 0.3 ms

Total overhead with all cache hits: approximately 1.5 ms per request.

To sustain 10,000 RPS: cache tenant records in Redis with a 60-second TTL to eliminate per-request DB queries. Cache JWT validation results keyed on the token signature with a TTL equal to the token's remaining lifetime. At 10,000 RPS, one Uvicorn worker can handle thousands of concurrent requests since all middleware and dependency operations are awaitable - no threads blocked.

Key Takeaways

  • Middleware is a stack, not a queue. The last add_middleware call makes that middleware the outermost layer - it runs first on requests and last on responses. Read add_middleware calls bottom-up to understand actual request execution order.
  • The onion model: request flows inward through middleware layers, the handler executes at the core, the response flows outward through the same layers in reverse. Every middleware wraps everything inside it - both for requests and responses.
  • WSGI middleware is a callable wrapping a callable: class MW: def __init__(self, app): self.app = app; def __call__(self, environ, start_response): .... It is synchronous, composable, and the foundation of all Python web middleware patterns.
  • BaseHTTPMiddleware provides a simple async def dispatch(request, call_next) interface. It buffers the full request body into memory - do not use it for streaming endpoints (file uploads, server-sent events). Use pure ASGI middleware for streaming.
  • Pure ASGI middleware (async def __call__(scope, receive, send)) operates on raw streams. More complex to write but supports streaming, has lower memory overhead, and is preferred for high-throughput APIs.
  • Flask before_request/after_request hooks are middleware-like but Flask-specific. They have access to the Flask request context (g, request) and are simpler for Flask-internal concerns. WSGI middleware (app.wsgi_app = MW(app.wsgi_app)) is used for cross-app, infrastructure-level concerns.
  • Middleware is for uniform, infrastructure-level concerns: request IDs, timing, structured logging, CORS, IP-level rate limiting, trusted host validation, gzip compression. It runs for every request without variation.
  • Depends() is for business-logic-aware, per-endpoint concerns: JWT validation returning a typed User, database sessions needing cleanup via yield, per-endpoint permissions, per-user or per-tenant rate limiting. Dependencies are testable, typed, and composable.
  • Middleware order is a security property. Authentication must execute before any code that trusts request.state.user. Getting middleware order wrong creates silent security holes that unit tests will not catch.
  • Middleware that raises before call_next means the handler never runs and no inner middleware processes the request. Resources allocated before the raise must be cleaned up in finally blocks.
  • Never put business logic in middleware. Business logic in middleware is invisible to handler tests, cannot be varied per endpoint, and creates tight coupling. If you are writing if request.url.path == "/some-endpoint": inside middleware, that logic belongs in a dependency or the handler.
© 2026 EngineersOfAI. All rights reserved.