Skip to main content

Service Mesh and Load Balancing

The Model Update That Broke Production

It was a routine model update. The recommendation system was being upgraded from v12 to v13. The new model showed 3.1% improvement in NDCG@10 on A/B testing. The deployment procedure: push the new container, update the deployment, wait for pods to roll.

Three minutes into the rollout, the on-call engineer's phone lit up. Error rate on the recommendation endpoint was 23%. P99 latency was 4.2 seconds. The previous best P99 was 180ms. Users were getting empty recommendation carousels, which meant no content to click, which meant engagement was collapsing.

The root cause took 40 minutes to find: the v13 model had a memory allocation bug that caused it to crash with an out-of-memory error on 15% of requests. During the rolling deployment, some pods were running v13 and some v12. The v13 pods crashed and Kubernetes marked them unhealthy, but the load balancer was still sending traffic to them for 10-30 seconds after the crashes because health check intervals were set to 30 seconds. Each request hitting a v13 pod hung for 30 seconds before timing out.

The deeper problem: there was no traffic control. The deployment strategy was "replace all old pods with new pods," with no ability to say "send 5% of traffic to v13, watch the error rate, stop if it exceeds 1%." There was no circuit breaker to automatically stop sending requests to crashing pods. There was no way to instantly roll back - rolling back a Kubernetes deployment still required waiting for the rolling restart.

This is exactly the problem service mesh solves. After adopting Istio with proper traffic management, the same team can now deploy a new model version, control what percentage of traffic it receives, set automated circuit breakers that stop routing to unhealthy pods within seconds, and roll back to the previous version by changing a single configuration field - with zero downtime and zero service disruption.

This lesson covers the concepts and code to build that capability in production ML serving infrastructure.


Why This Exists - The Problem Service Mesh Solves

As ML systems grow from one model to many models, and from a single service to a mesh of services, three categories of operational problems emerge:

Observability gap: Service A calls Service B calls Service C. A request fails. Which service failed? Which network hop was slow? Without a uniform observability layer, you are reading logs from three different services with different formats and correlating them manually. This can take hours.

Security complexity: Service-to-service communication within a cluster should be encrypted and authenticated. But implementing mTLS manually in every service in your ML platform - model servers, feature stores, preprocessing pipelines, monitoring services - means writing the same TLS configuration code dozens of times, and rotating certificates requires touching every service.

Traffic management complexity: Implementing circuit breaking, retries, timeouts, and canary deployments in application code creates inconsistency. Each team implements slightly different retry logic, different timeout defaults, different health check behavior. One team's aggressive retry policy sends amplified traffic to an already-overloaded service.

A service mesh solves all three by moving this logic out of application code and into a dedicated infrastructure layer - a set of sidecar proxies that intercept all network traffic and apply consistent policies.


Historical Context - From Hand-Rolled to Platform-Level Networking

In the early days of microservices (2013-2016), teams implemented resilience patterns in application code. Netflix's Hystrix library (2012) was the dominant approach: circuit breakers, fallbacks, and bulkheads implemented in the application itself. Every service that called another service integrated Hystrix. This worked, but it meant networking logic was distributed across hundreds of services in dozens of different languages.

The insight that led to service meshes: if every service in your cluster gets a sidecar proxy that intercepts all traffic, you can implement networking policies uniformly once in the proxy rather than N times in application code. Envoy, developed at Lyft by Matt Klein and open-sourced in 2016, was the first widely-adopted proxy built for this role. It was designed from the ground up to be programmatically configurable - the "data plane" controlled by a separate "control plane."

Istio, a collaboration between Google, IBM, and Lyft, launched in 2017 as a control plane for Envoy. Linkerd (originally Twitter's Finagle, rebuilt in Rust in 2018) offered a lighter-weight alternative. By 2020, service mesh adoption in production was mainstream for large ML platforms.

The specific connection to ML: as organizations moved from monolithic ML training systems to microservices-based ML platforms (feature stores, model registries, serving layers, monitoring systems), the same networking operational problems that drove service mesh adoption in web microservices appeared in ML infrastructure.


Core Concepts

The Sidecar Proxy Pattern

A service mesh injects a proxy container (sidecar) alongside every application container in a pod. The sidecar intercepts all inbound and outbound network traffic using iptables rules injected into the pod's network namespace. The application code never needs to change - from the application's perspective, it is still making normal network calls.

The sidecar handles:

  • mTLS: Encrypts and authenticates all inter-service communication automatically
  • Load balancing: Distributes requests across service replicas
  • Circuit breaking: Stops sending requests to unhealthy backends
  • Retries and timeouts: Configurable per-route
  • Observability: Records metrics, traces, and logs for every request

Istio Architecture - Control Plane and Data Plane

Istio's architecture follows the control plane / data plane separation that Envoy introduced:

  • Data plane: The Envoy sidecars injected into every pod. They handle the actual packet forwarding, load balancing, and policy enforcement. Data plane state is local to each proxy.
  • Control plane (istiod): The central component that holds the desired state of the mesh (routing rules, security policies, service discovery) and pushes it to all data plane proxies via the xDS API.

The xDS (extensible Discovery Service) API is the protocol through which istiod configures Envoy proxies:

  • LDS (Listener Discovery): "Create a listener on port 50051 with these filters"
  • RDS (Route Discovery): "For requests to model-server, use these routing rules"
  • CDS (Cluster Discovery): "model-server has these 8 backend endpoints"
  • EDS (Endpoint Discovery): "The current healthy endpoints for model-server are..."

This push-based configuration means Envoy proxies update their routing configuration within seconds of a change, without any rolling restart.

Linkerd vs Istio

Both solve the same problem but with different philosophies:

DimensionIstioLinkerd
ProxyEnvoy (C++)Linkerd2-proxy (Rust)
ComplexityHigh - many featuresLow - focused on essentials
Resource overhead~50MB RAM per pod~10MB RAM per pod
mTLSYes (mutual TLS)Yes (automatic)
Traffic managementVery flexibleBasic
Learning curveSteepGentle
Best for ML use caseComplex traffic policies, multi-clusterSimple observability, security

For ML platforms that need advanced canary deployments, fine-grained traffic splitting, and complex routing policies (route to GPU-accelerated replicas only for large batches), Istio is the better choice. For teams that primarily want automatic mTLS and observability without the operational complexity, Linkerd is sufficient.

Load Balancing Algorithms

The choice of load balancing algorithm directly affects model serving performance. Different algorithms work best in different ML serving scenarios:

Round Robin: Simple rotation. Request 1 goes to backend 1, request 2 to backend 2, etc. Works well when all backends have similar performance. Problem for ML: if some requests require large batches (slow) and others are single-item (fast), round robin creates hot spots on the backends currently handling slow requests.

Least Connections (Least Requests): Send to the backend with the fewest in-flight requests. Better than round robin for ML serving because it naturally handles variable-duration inference. A backend stuck processing a large batch will not receive new requests until it finishes.

Consistent Hashing: Map request keys deterministically to backends. The same user ID, item ID, or request key always routes to the same backend. Critical for caching: if your model server maintains an in-memory KV-cache of computed embeddings for popular items, consistent hashing ensures requests for the same item hit the same replica (and thus the same cache), rather than spreading cache across all replicas.

Power of Two Choices (P2C): Randomly select 2 backends, send to the less loaded one. Provides near-optimal distribution without the coordination cost of maintaining global state. Used by Envoy by default for its excellent balance of simplicity and load distribution quality.

L4 vs L7 Load Balancing

L4 load balancing (Layer 4, transport layer): Routes TCP connections to backends. The load balancer sees source/destination IP and port, but not the application payload. Efficient but coarse-grained. Connections are routed at establishment time and stay sticky to one backend. For gRPC (which multiplexes many RPCs over one TCP connection), L4 load balancing means all RPCs from one client go to one backend.

L7 load balancing (Layer 7, application layer): Routes individual HTTP/2 requests (or gRPC RPCs) to backends. The load balancer inspects the request headers, method, and body. It can route based on content, route gRPC calls individually (not per connection), and implement application-aware policies like rate limiting per API key.

For ML serving, L7 load balancing is essential:

  • gRPC requires L7 to route individual RPCs, not connections
  • Canary deployments require routing based on request headers or percentages at the request level
  • Per-user rate limiting requires reading user identity from request headers
  • A/B testing requires routing specific users to specific model versions

NGINX Ingress with gRPC support and Envoy both provide L7 load balancing for ML serving.

mTLS for Service-to-Service Security

Mutual TLS (mTLS) means both parties in a connection authenticate each other with certificates. Standard TLS only authenticates the server (the client verifies the server's certificate but the server does not verify the client's).

In ML serving, mTLS between your feature store, model server, and prediction logger ensures:

  1. The feature store only responds to authenticated model servers (not arbitrary pods in the cluster)
  2. The prediction logger only accepts writes from authenticated model servers
  3. All feature values and prediction data are encrypted in transit

Istio implements mTLS automatically. When Istio is installed, it issues a certificate to every pod's sidecar using SPIFFE (Secure Production Identity Framework for Everyone) IDs. All traffic between sidecars is automatically encrypted and authenticated without any application code changes.


Code Examples

Kubernetes Service and Ingress YAML

# model-service.yaml
# Kubernetes Service for ML model serving
# ClusterIP: internal cluster access only
# NodePort: accessible from outside the cluster on each node's IP
# LoadBalancer: provisions cloud load balancer (AWS ALB, GCP LB)
apiVersion: v1
kind: Service
metadata:
name: inference-service
namespace: ml-serving
labels:
app: inference
version: v2
annotations:
# For AWS NLB L4 load balancing
service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
type: ClusterIP # Internal only - Istio Ingress Gateway handles external access
selector:
app: inference
version: v2
ports:
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
- name: http-metrics
port: 8080
targetPort: 8080
protocol: TCP
# Session affinity: same client IP always routes to same backend
# Useful for models that maintain per-session context
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 300 # 5-minute affinity window
---
# Headless service for StatefulSets (e.g., model servers with persistent GPU state)
apiVersion: v1
kind: Service
metadata:
name: inference-headless
namespace: ml-serving
spec:
clusterIP: None # Headless - DNS resolves to individual pod IPs
selector:
app: inference
ports:
- port: 50051
---
# NGINX Ingress for external access with gRPC support
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ml-serving-ingress
namespace: ml-serving
annotations:
kubernetes.io/ingress.class: "nginx"
# Enable gRPC backend protocol
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
# Enable gRPC-specific settings
nginx.ingress.kubernetes.io/grpc-backend: "true"
# TLS termination at ingress
nginx.ingress.kubernetes.io/ssl-redirect: "true"
# Rate limiting: 100 requests per second per client IP
nginx.ingress.kubernetes.io/limit-rps: "100"
# Circuit breaker: 503 if backend unavailable
nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout"
nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "2"
# Timeouts for ML inference (may need longer for large batch)
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
tls:
- hosts:
- api.inference.example.com
secretName: inference-tls-secret
rules:
- host: api.inference.example.com
http:
paths:
- path: /inference.InferenceService
pathType: Prefix
backend:
service:
name: inference-service
port:
number: 50051

Istio VirtualService for Canary Deployment

# Canary deployment for ML model upgrade: v2 (current) -> v3 (new)
# Strategy: start at 5% traffic, monitor, increase incrementally

# DestinationRule: define the subset versions
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: inference-destination-rule
namespace: ml-serving
spec:
host: inference-service
# Use LEAST_CONN for ML serving - handles variable-latency inference
trafficPolicy:
loadBalancer:
simple: LEAST_CONN
connectionPool:
tcp:
maxConnections: 1000
http:
h2UpgradePolicy: UPGRADE # Force HTTP/2 for gRPC
http2MaxRequests: 1000
maxRequestsPerConnection: 0 # Unlimited - connection reuse
# Outlier detection: stop sending to unhealthy pods
# This is the circuit breaker for the service
outlierDetection:
# Eject backend after 5 consecutive 5xx errors in 1 second
consecutive5xxErrors: 5
interval: 1s
# Remove from load balancer pool for 30 seconds
baseEjectionTime: 30s
# Maximum 50% of backends can be ejected simultaneously
maxEjectionPercent: 50
subsets:
- name: v2
labels:
version: v2
- name: v3
labels:
version: v3
---
# VirtualService: traffic splitting between v2 and v3
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: inference-virtual-service
namespace: ml-serving
spec:
hosts:
- inference-service
http:
# Route 1: Internal traffic from A/B testing service gets v3
- match:
- headers:
x-ab-group:
exact: "experiment-v3"
route:
- destination:
host: inference-service
subset: v3
weight: 100

# Route 2: 5% canary traffic to v3
- route:
- destination:
host: inference-service
subset: v2
weight: 95
- destination:
host: inference-service
subset: v3
weight: 5 # Increase to 10, 25, 50, 100 as confidence grows
timeout: 500ms # Hard timeout - fail fast if model is slow
retries:
attempts: 2
perTryTimeout: 250ms
# Only retry on connection failures - NOT on 5xx (model errors are not transient)
retryOn: connect-failure,refused-stream,retriable-4xx

Envoy Circuit Breaker Configuration

# Envoy circuit breaker configuration for ML serving
# Applied as Envoy filter via Istio EnvoyFilter CRD
# Or directly in Envoy static/dynamic config

# Envoy calls the ML model server cluster with circuit breaking
static_resources:
clusters:
- name: model_server
connect_timeout: 0.5s
type: STRICT_DNS
lb_policy: LEAST_REQUEST # Best for variable-latency ML inference

# Circuit breaker settings
circuit_breakers:
thresholds:
# Default priority
- priority: DEFAULT
# Max pending requests before circuit opens
max_pending_requests: 100
# Max concurrent requests to all backends
max_requests: 1000
# Max total connections (for HTTP/2, usually 1 per upstream)
max_connections: 100
# Max retry attempts queued
max_retries: 5
# High priority (for health checks, admin)
- priority: HIGH
max_pending_requests: 10
max_requests: 50

# Outlier detection = circuit breaker per endpoint
outlier_detection:
# Consecutive 5xx errors before ejection
consecutive_5xx: 5
# Consecutive gateway errors before ejection
consecutive_gateway_failure: 5
# Check interval
interval: 1s
# Base ejection duration
base_ejection_time: 30s
# Max % of hosts that can be ejected
max_ejection_percent: 50
# Success rate threshold - eject if below 90% success rate
success_rate_minimum_hosts: 3
success_rate_request_volume: 100
success_rate_stdev_factor: 1900 # 1.9 standard deviations

load_assignment:
cluster_name: model_server
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: inference-service.ml-serving.svc.cluster.local
port_value: 50051

Consistent Hash Load Balancer in Python

"""
Consistent hash load balancer implementation.
Used when you need request affinity - same key always routes to same backend.

ML use cases:
- Model server with in-memory KV-cache: route requests for the same
item/user to the same backend to maximize cache hit rate
- Stateful inference sessions: route session_id to same backend
- Embedding computation: co-locate requests for same entity

Uses virtual nodes (vnodes) for better distribution uniformity.
"""
import hashlib
import bisect
from typing import Optional


class ConsistentHashLoadBalancer:
"""
Consistent hash ring with virtual nodes.

Virtual nodes: each physical backend gets `vnodes` positions on the ring.
More virtual nodes = better distribution uniformity but more memory.
Default 150 vnodes per backend gives good distribution with small cluster sizes.

Ring property: adding/removing a backend only remaps ~1/N of keys,
not all keys. This minimizes cache invalidation when scaling.
"""

def __init__(self, vnodes: int = 150):
self.vnodes = vnodes
self._ring: list[int] = [] # Sorted list of hash positions
self._ring_to_backend: dict[int, str] = {} # hash position -> backend
self._backends: set[str] = set()

def add_backend(self, backend: str) -> None:
"""
Add a backend server to the ring.
Creates `vnodes` virtual positions for this backend.

Call this when a new model server replica comes up.
"""
if backend in self._backends:
return

self._backends.add(backend)

for i in range(self.vnodes):
# Hash backend:vnode_index to get ring position
vnode_key = f"{backend}:vnode:{i}"
position = self._hash(vnode_key)

# Insert in sorted order
bisect.insort(self._ring, position)
self._ring_to_backend[position] = backend

def remove_backend(self, backend: str) -> None:
"""
Remove a backend from the ring.
Called when a model server replica is terminated.
Only ~1/N of keys are remapped to other backends.
"""
if backend not in self._backends:
return

self._backends.discard(backend)

for i in range(self.vnodes):
vnode_key = f"{backend}:vnode:{i}"
position = self._hash(vnode_key)

if position in self._ring_to_backend:
del self._ring_to_backend[position]
# Remove from sorted list
idx = bisect.bisect_left(self._ring, position)
if idx < len(self._ring) and self._ring[idx] == position:
self._ring.pop(idx)

def get_backend(self, key: str) -> Optional[str]:
"""
Get the backend for a given key.
Deterministic: same key always returns same backend (when ring is stable).

key: partition key - user_id, item_id, session_id, etc.
"""
if not self._ring:
return None

position = self._hash(key)

# Find the first ring position >= our key's position
idx = bisect.bisect_left(self._ring, position)

# Wrap around if we've gone past the end of the ring
if idx >= len(self._ring):
idx = 0

ring_position = self._ring[idx]
return self._ring_to_backend[ring_position]

def _hash(self, key: str) -> int:
"""SHA-256 hash mapped to integer ring position."""
return int(hashlib.sha256(key.encode()).hexdigest(), 16)

def get_distribution(self) -> dict[str, float]:
"""
Show what percentage of the hash space each backend owns.
Useful for verifying distribution uniformity.
"""
if not self._ring:
return {}

counts: dict[str, int] = {backend: 0 for backend in self._backends}
for position in self._ring:
backend = self._ring_to_backend[position]
counts[backend] += 1

total = len(self._ring)
return {
backend: count / total * 100
for backend, count in counts.items()
}


class MLServingLoadBalancer:
"""
Load balancer for ML serving with multiple strategies.
Choose based on your serving pattern.
"""

def __init__(self, backends: list[str], strategy: str = "least_requests"):
self.backends = list(backends)
self.strategy = strategy

# State for least-requests tracking
self._in_flight: dict[str, int] = {b: 0 for b in backends}
self._rr_index = 0

# Consistent hash for session affinity
self._ch = ConsistentHashLoadBalancer(vnodes=150)
for backend in backends:
self._ch.add_backend(backend)

# Health state
self._healthy: set[str] = set(backends)

def get_backend(self, key: str = None) -> Optional[str]:
"""
Select a backend using the configured strategy.
key: used for consistent_hash strategy only
"""
healthy = [b for b in self.backends if b in self._healthy]
if not healthy:
return None

if self.strategy == "round_robin":
idx = self._rr_index % len(healthy)
self._rr_index += 1
return healthy[idx]

elif self.strategy == "least_requests":
# Pick the backend with fewest in-flight requests
return min(healthy, key=lambda b: self._in_flight.get(b, 0))

elif self.strategy == "consistent_hash":
if key is None:
raise ValueError("consistent_hash strategy requires a key")
backend = self._ch.get_backend(key)
return backend if backend in self._healthy else self.get_backend(strategy_override="round_robin")

elif self.strategy == "power_of_two":
# Sample 2 random backends, pick less loaded one
import random
if len(healthy) == 1:
return healthy[0]
sample = random.sample(healthy, min(2, len(healthy)))
return min(sample, key=lambda b: self._in_flight.get(b, 0))

return healthy[0]

def request_started(self, backend: str) -> None:
"""Call when starting a request to a backend."""
self._in_flight[backend] = self._in_flight.get(backend, 0) + 1

def request_finished(self, backend: str) -> None:
"""Call when request to backend completes."""
if self._in_flight.get(backend, 0) > 0:
self._in_flight[backend] -= 1

def mark_unhealthy(self, backend: str) -> None:
"""Remove backend from rotation (circuit breaker integration)."""
self._healthy.discard(backend)

def mark_healthy(self, backend: str) -> None:
"""Return backend to rotation."""
self._healthy.add(backend)

Health Check Endpoint for ML Model

"""
Health check endpoint for ML model serving.
Required for Kubernetes liveness/readiness probes and load balancer health checks.

Two separate checks:
- Liveness: is the process alive? (restart if fails)
- Readiness: is the model ready to serve? (remove from LB pool if fails)
"""
import time
import threading
import psutil
import torch
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import logging

logger = logging.getLogger(__name__)


class ModelHealthState:
"""
Shared state for model health tracking.
Thread-safe - read by health check server, written by inference threads.
"""

def __init__(self):
self._lock = threading.Lock()
self._model_loaded = False
self._last_successful_inference = None
self._inference_errors_last_minute = 0
self._error_window_start = time.time()

def model_loaded(self, loaded: bool) -> None:
with self._lock:
self._model_loaded = loaded

def record_inference_success(self) -> None:
with self._lock:
self._last_successful_inference = time.time()

def record_inference_error(self) -> None:
with self._lock:
now = time.time()
# Reset error counter every minute
if now - self._error_window_start > 60:
self._inference_errors_last_minute = 0
self._error_window_start = now
self._inference_errors_last_minute += 1

def is_ready(self) -> tuple[bool, str]:
"""Returns (is_ready, reason)."""
with self._lock:
if not self._model_loaded:
return False, "Model not yet loaded"

# Too many errors in the last minute
if self._inference_errors_last_minute > 100:
return False, f"High error rate: {self._inference_errors_last_minute} errors/min"

# Check GPU memory if available
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
mem_info = torch.cuda.mem_get_info(i)
free_mb = mem_info[0] / (1024 * 1024)
if free_mb < 100: # Less than 100MB free GPU memory
return False, f"GPU {i} nearly out of memory: {free_mb:.0f}MB free"

return True, "OK"

def is_alive(self) -> tuple[bool, str]:
"""Returns (is_alive, reason). Liveness check."""
with self._lock:
# Check system memory
mem = psutil.virtual_memory()
if mem.percent > 95:
return False, f"System memory critical: {mem.percent}% used"

# Check if model loaded (if it was never loaded, restart might help)
# But don't fail liveness until model has had time to load
return True, "OK"


# Global health state - shared between health server and inference code
health_state = ModelHealthState()


class MLHealthCheckHandler(BaseHTTPRequestHandler):
"""
HTTP handler for health check endpoints.
Kubernetes probes HTTP endpoints on the pod's IP.
"""

def do_GET(self):
if self.path == "/healthz" or self.path == "/livez":
# Liveness probe - is the process alive?
alive, reason = health_state.is_alive()
self._respond(alive, reason, endpoint="liveness")

elif self.path == "/readyz":
# Readiness probe - is the model ready to serve?
ready, reason = health_state.is_ready()
self._respond(ready, reason, endpoint="readiness")

elif self.path == "/metrics":
# Prometheus-format metrics
self._respond_metrics()

else:
self.send_response(404)
self.end_headers()

def _respond(self, healthy: bool, reason: str, endpoint: str) -> None:
status_code = 200 if healthy else 503
body = json.dumps({
"status": "healthy" if healthy else "unhealthy",
"reason": reason,
"endpoint": endpoint,
"timestamp": time.time(),
}).encode("utf-8")

self.send_response(status_code)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", len(body))
self.end_headers()
self.wfile.write(body)

def _respond_metrics(self) -> None:
"""Prometheus text format metrics."""
metrics = [
"# HELP ml_model_loaded Whether the model is loaded",
"# TYPE ml_model_loaded gauge",
f"ml_model_loaded {1 if health_state._model_loaded else 0}",
]
body = "\n".join(metrics).encode("utf-8")
self.send_response(200)
self.send_header("Content-Type", "text/plain; version=0.0.4")
self.end_headers()
self.wfile.write(body)

def log_message(self, format, *args):
"""Suppress default access log to avoid noise."""
pass


def start_health_server(port: int = 8080) -> threading.Thread:
"""Start health check HTTP server in background thread."""
server = HTTPServer(("0.0.0.0", port), MLHealthCheckHandler)

thread = threading.Thread(target=server.serve_forever, daemon=True)
thread.start()

logger.info(f"Health check server started on port {port}")
return thread

Rate Limiter for ML APIs

"""
Token bucket rate limiter for ML inference APIs.
Prevents model server overload from misbehaving or bursty clients.

Two levels of limiting:
- Per-client rate limit: prevent one client from consuming all capacity
- Global rate limit: protect the model server from total overload
"""
import time
import threading
from collections import defaultdict


class TokenBucketRateLimiter:
"""
Token bucket algorithm for rate limiting ML inference requests.

Token bucket: bucket holds `capacity` tokens.
Tokens refill at `rate` tokens/second.
Each request consumes 1 token.
If bucket is empty: request is rejected (rate limited).

Why token bucket for ML:
- Allows bursts up to `capacity` (handles request spikes)
- Enforces average rate of `rate` req/s over time
- No global state needed per request - just check and decrement
"""

def __init__(self, rate: float, capacity: float):
"""
rate: tokens per second (sustained request rate)
capacity: bucket size (maximum burst size)
"""
self.rate = rate
self.capacity = capacity
self._lock = threading.Lock()

# State per client
self._buckets: dict[str, dict] = defaultdict(
lambda: {"tokens": capacity, "last_refill": time.monotonic()}
)

def is_allowed(self, client_id: str) -> tuple[bool, float]:
"""
Check if a request from client_id is allowed.

Returns (allowed, retry_after_seconds).
retry_after_seconds: how long until next token is available (if denied).
"""
with self._lock:
now = time.monotonic()
bucket = self._buckets[client_id]

# Refill tokens based on elapsed time
elapsed = now - bucket["last_refill"]
new_tokens = elapsed * self.rate
bucket["tokens"] = min(
self.capacity,
bucket["tokens"] + new_tokens
)
bucket["last_refill"] = now

if bucket["tokens"] >= 1.0:
bucket["tokens"] -= 1.0
return True, 0.0
else:
# How long until 1 token is available
retry_after = (1.0 - bucket["tokens"]) / self.rate
return False, retry_after

def get_status(self, client_id: str) -> dict:
"""Get current token count for a client (for debugging)."""
with self._lock:
bucket = self._buckets.get(client_id)
if not bucket:
return {"tokens": self.capacity, "capacity": self.capacity}
return {
"tokens": bucket["tokens"],
"capacity": self.capacity,
"rate": self.rate,
}


class MLAPIRateLimiter:
"""
Two-tier rate limiter for ML serving:
- Per-client limit: fair usage per API key
- Global limit: protect model server from total overload

Usage in gRPC interceptor:
limiter = MLAPIRateLimiter(
per_client_rate=10, # 10 req/s per client
per_client_burst=50, # Allow bursts of 50
global_rate=1000, # 1000 req/s total
global_burst=2000,
)
"""

def __init__(
self,
per_client_rate: float,
per_client_burst: float,
global_rate: float,
global_burst: float,
):
self.per_client = TokenBucketRateLimiter(per_client_rate, per_client_burst)
self.global_limiter = TokenBucketRateLimiter(global_rate, global_burst)

def check(self, client_id: str) -> tuple[bool, str]:
"""
Returns (allowed, rejection_reason).
Checks global limit first (cheaper), then per-client.
"""
# Check global limit
global_ok, global_retry = self.global_limiter.is_allowed("__global__")
if not global_ok:
return False, f"Global rate limit exceeded. Retry in {global_retry:.2f}s"

# Check per-client limit
client_ok, client_retry = self.per_client.is_allowed(client_id)
if not client_ok:
return False, f"Per-client rate limit exceeded. Retry in {client_retry:.2f}s"

return True, ""

Production Engineering Notes

Canary Deployment Strategy for ML Models

Model upgrades are higher risk than most software updates because:

  1. Model behavior is hard to characterize with unit tests
  2. Edge cases only appear at production traffic scale
  3. Accuracy regressions may not manifest as errors - the model just gives wrong answers

The safe canary strategy:

Day 1 - Deploy v3, start at 1% traffic
Monitor: error rate, P99 latency, business metrics (CTR, conversion)
Abort criteria: error rate >2x baseline or P99 >3x baseline

Day 2 - If Day 1 healthy, increase to 10%
Extended monitoring window: 24 hours
Watch for distribution shift signals

Day 3 - If Day 2 healthy, increase to 50%
Monitor business metrics vs holdout

Day 4 - If Day 3 healthy, 100% traffic to v3
Keep v2 pods running for 1-2 hours for rapid rollback
Then decommission v2

The Istio weight changes are a single YAML patch - no downtime, instant effect:

# Increase canary to 10%
kubectl patch virtualservice inference-virtual-service -n ml-serving \
--type=json \
-p='[{"op": "replace", "path": "/spec/http/1/route/0/weight", "value": 90},
{"op": "replace", "path": "/spec/http/1/route/1/weight", "value": 10}]'

Session Affinity for Stateful ML Inference

Some ML serving scenarios require requests from the same client to always route to the same backend:

  1. Multi-turn dialogue models: The model server maintains conversation history in memory
  2. Streaming inference: The client opened a bidirectional gRPC stream to a specific backend
  3. KV-cache optimization: Routing same prefixes to same GPU maximizes KV-cache reuse

In Kubernetes Services, use sessionAffinity: ClientIP. In Istio, use consistent hash in DestinationRule:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: llm-serving-consistent-hash
spec:
host: llm-service
trafficPolicy:
loadBalancer:
consistentHash:
# Hash on session ID header for conversation affinity
httpHeaderName: "x-session-id"
# Or hash on source IP for simpler affinity
# useSourceIp: true
# Or hash on a specific cookie
# httpCookie:
# name: session-id
# ttl: 3600s

Common Mistakes

:::danger Circuit Breaker Not Configured - Cascading Failure in ML Serving

Without circuit breakers, a slow or failing model server version causes a cascade: requests queue up waiting for responses, thread pools fill, the load balancer keeps sending traffic to the failing backend, latency spikes across the entire service.

The circuit breaker pattern: after N consecutive failures from a backend, stop sending it requests for M seconds. This gives the backend time to recover and prevents request pile-up.

In Istio, configure outlierDetection in your DestinationRule. Without it, a crashing pod during a bad canary deployment will continue receiving traffic until Kubernetes marks it as unhealthy (30+ seconds by default):

outlierDetection:
consecutive5xxErrors: 5 # Eject after 5 consecutive errors
interval: 1s # Check every second
baseEjectionTime: 30s # Keep ejected for 30 seconds
maxEjectionPercent: 50 # Never eject more than 50% of backends

Without this configuration, a 5% canary with a memory bug can cause 23% error rates cluster-wide because failing pods keep receiving traffic. :::

:::danger Using L4 Load Balancer for gRPC Services

Deploying gRPC services behind a standard L4 load balancer (AWS NLB without HTTP/2 support, hardware load balancers in TCP mode) causes all traffic to route to a single backend. HTTP/2 multiplexes many RPCs over one TCP connection. The L4 load balancer sees one TCP connection per client - and routes all of that client's RPCs to one backend.

Symptom: One or two model server replicas are at 100% CPU while others are idle. The cluster appears under-utilized but latency is high. Adding replicas does not improve performance.

Fix: Use an L7 load balancer that understands HTTP/2:

  • NGINX Ingress with backend-protocol: "GRPC" annotation
  • AWS ALB with gRPC support enabled
  • Envoy proxy (in Istio service mesh)
  • Client-side load balancing with gRPC's built-in DNS resolver

The L7 load balancer routes individual RPCs (HTTP/2 frames), not TCP connections, distributing load correctly. :::

:::warning Aggressive Retry Policy Causes Retry Storms

Retries amplify load on already-struggling backends. If a model server is at 90% capacity and starts returning 503 errors, and your clients are configured to retry 3 times on 503, the effective load on the server becomes 90% * 4 = 360% - a guaranteed cascade failure.

Safe retry configuration:

  1. Only retry on connection failures and 503 Retry-After responses, not on all 5xx
  2. Use exponential backoff with jitter between retries
  3. Respect the Retry-After header when backends set it
  4. Count retries against the original request's deadline, not a fresh deadline per retry
retries:
attempts: 2
perTryTimeout: 250ms
retryOn: connect-failure,refused-stream
# NOT: retryOn: 5xx (this retries on all server errors, including overload)

:::

:::warning Health Check Interval Too Long During Canary

The default Kubernetes readiness probe interval is 10 seconds with a failure threshold of 3. This means a crashing pod can continue receiving traffic for up to 30 seconds before Kubernetes marks it as not-ready and removes it from the Service endpoints.

For ML serving with canary deployments, 30 seconds of requests to a crashing pod means 30 seconds of elevated error rates - visible to users.

Configure more aggressive health checks for model servers:

readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 30 # Time for model to load (keep this reasonable)
periodSeconds: 5 # Check every 5 seconds (not 10)
failureThreshold: 2 # Fail after 2 consecutive failures (not 3)
timeoutSeconds: 3 # Fail fast if health check is slow

With these settings, a crashing pod is removed from rotation within 10 seconds (2 failures * 5 second interval) instead of 30. :::


Interview Q&A

Q1: Explain the sidecar proxy pattern and how it enables a service mesh without application code changes.

The sidecar proxy pattern places a proxy container (Envoy in Istio, linkerd2-proxy in Linkerd) alongside every application container in the same pod. The proxy shares the pod's network namespace. When Istio is installed, a webhook intercepts pod creation and injects the sidecar and init container into every pod that has the Istio label.

The init container uses iptables rules to redirect all inbound and outbound TCP traffic through the Envoy sidecar before the packets reach the application container. The application code makes a network call to model-server:50051 - from the application's perspective, it is connecting normally. But the OS redirects the outgoing packet to Envoy's local port. Envoy looks up its routing configuration (pushed by istiod), applies mTLS, circuit breaking, and retry policies, then forwards the request to the destination's Envoy sidecar. The destination sidecar terminates the mTLS, decrypts the payload, and forwards the plaintext to the application container on its local port.

This is transparent to application code. No TLS configuration, no circuit breaker implementation, no retry logic in the service. All of that is managed by the control plane (istiod) and executed by the data plane (Envoy sidecars).

Q2: Compare round-robin, least-connections, and consistent hash load balancing for ML serving scenarios.

Round robin rotates requests through backends sequentially. It works when all requests have similar processing time and all backends are equivalent. For ML serving with homogeneous models and similar-duration inference, it distributes load well. It fails when inference time varies significantly - some backends accumulate slow requests while others are idle.

Least connections (least requests) sends each new request to the backend with the fewest in-flight requests. This adapts to variable-duration inference: a backend processing a large batch naturally stops receiving new requests until it finishes. This is generally the best default for ML serving where batch sizes and thus inference durations vary.

Consistent hash maps request keys deterministically to backends. The same user ID, item ID, or session key always routes to the same backend. This is valuable when backends maintain state - an in-memory embedding cache, a conversation history for a dialogue model, or a KV-cache for LLM prefill. Without consistent hashing, each of 8 replicas would hold 1/8 of the cache content, and a request for a cached item would miss 7/8 of the time. With consistent hashing, cache hit rates approach the single-server hit rate.

The tradeoff: consistent hashing creates load imbalance if some keys are "hot" (e.g., one extremely popular item routing all traffic to one backend). Mitigate with virtual nodes (more even distribution) and a fallback to least-connections for overloaded backends.

Q3: What is mTLS in a service mesh context and why is it important for ML platforms?

Standard TLS (HTTPS) authenticates only the server: the client verifies the server's certificate, but the server accepts connections from any client. Mutual TLS adds client authentication: both parties present certificates, and both verify the other's certificate. The connection proceeds only if both certificates are valid and trusted.

In an ML platform with mTLS: when your fraud detection model server calls your feature store, the feature store verifies the certificate presented by the caller. If the certificate is not issued by the mesh's certificate authority (Istio's istiod), the connection is rejected. A compromised or rogue pod cannot impersonate a model server to read feature data. An attacker who gains network access to the cluster cannot directly query your feature store - they need a valid certificate.

Istio implements mTLS automatically. It issues SPIFFE-compliant X.509 certificates to every pod's sidecar, using the pod's Kubernetes Service Account as the identity. Certificates are rotated automatically (default every 24 hours) without any application involvement. The security administrator simply sets the mesh-wide policy: PeerAuthentication with STRICT mode, and all service-to-service traffic in the mesh is automatically encrypted and mutually authenticated.

Q4: Describe a safe canary deployment strategy for upgrading an ML model in production.

A safe canary deployment for ML models uses progressive traffic shifting with automated abort criteria.

Start by deploying the new model version alongside the existing version without routing any traffic to it (0% canary, 100% current). Run internal smoke tests and verify the new version serves correctly. Then shift a small percentage of production traffic - typically 1-5% - to the new version.

Define explicit abort criteria before starting: if error rate increases more than 2x, if P99 latency increases more than 3x, or if key business metrics (CTR for a recommendation model, conversion rate for a pricing model) degrade by more than 1%, immediately shift traffic back to 0% on the new version.

If the canary passes the 1-5% window for 24 hours with no regressions, increase to 10%, then 25%, then 50%, then 100%, with a 24-hour monitoring window at each stage. Keep the previous version deployed for 1-2 hours after reaching 100% to allow rapid rollback if a problem appears at full scale.

In Istio, traffic shifting is a single YAML change to the VirtualService weights. Rollback is equally fast - one command and traffic is back on the stable version within seconds.

Q5: Explain the difference between L4 and L7 load balancing and why gRPC requires L7.

L4 load balancing operates at the transport layer. The load balancer routes TCP connections based on source/destination IP and port. It does not inspect application payload. When a TCP connection is established, the load balancer picks a backend and all subsequent data on that connection goes to the same backend. This is efficient because the load balancer only needs to make a routing decision once per connection.

L7 load balancing operates at the application layer. The load balancer inspects HTTP headers, request path, and body. It makes a routing decision for each individual request, independent of which TCP connection carries it.

gRPC requires L7 because it uses HTTP/2, which multiplexes many RPC calls over a single TCP connection. An L4 load balancer sees one TCP connection and routes it to one backend. From that point, all hundreds of gRPC calls multiplexed on that connection go to the same backend. Other backends sit idle. L7 load balancing routes each HTTP/2 frame (each individual gRPC call) independently, achieving proper distribution.

For ML serving with Kubernetes: standard Kubernetes Services use L4 kube-proxy rules. For gRPC model servers, you need either an L7 ingress (NGINX with gRPC annotation, AWS ALB with gRPC), client-side load balancing (gRPC channel with DNS resolver returning multiple IPs), or a service mesh (Istio Envoy does L7 per-request routing automatically).

Q6: What is the power-of-two-choices load balancing algorithm and why does it outperform round-robin for ML serving?

Power-of-two-choices (P2C) is a randomized load balancing algorithm. For each incoming request, randomly select two backends from the available pool, then route the request to whichever of the two has fewer in-flight requests.

It outperforms round robin for ML serving because round robin distributes connections evenly but does not account for varying request duration. If backend A is processing a slow batch request while backend B just completed a request, round robin still routes the next two requests to A and B alternately - sending work to an already-busy backend.

P2C handles this without needing global state. You do not need to query all backends to find the globally least-loaded one - you just sample two. Theoretical analysis shows that P2C achieves maximum load within O(log log n) of optimal, dramatically better than round robin's O(n) worst case.

Envoy uses P2C (called LEAST_REQUEST in Envoy configuration) as its default when connection counts are tracked. For ML serving where inference duration varies by 10x or more between small and large requests, P2C reduces tail latency compared to round robin by keeping backends more evenly loaded.

© 2026 EngineersOfAI. All rights reserved.