Service Mesh for ML Serving
Five Models, Three Versions Each, Zero Visibility
The ML serving team at a recommendation platform had accumulated a traffic routing problem so complex that no single person understood the full picture. They had five active models (content ranking, user affinity, session context, diversity re-ranker, freshness booster) each with up to three simultaneously-served versions: the production champion, a staging candidate being shadow-tested, and sometimes a legacy rollback version kept warm for emergency failover.
Traffic routing rules were implemented as environment variables in deployment YAMLs, custom Nginx configs, application-layer feature flags, and one genuinely alarming bash script checked into the infra repo. Changing a routing rule required editing multiple files, redeploying multiple services, and hoping that no one had recently made a conflicting change in a file you did not know existed.
The breaking point was an A/B test on the content ranking model. The test required routing 10% of traffic to model-v47, 85% to model-v45 (production champion), and 5% to model-v43 (legacy variant for a specific user segment). Implementing this without a service mesh took two engineers four days. When A/B testing at this complexity takes four days to set up, you run fewer tests. When you run fewer tests, your models improve more slowly. The business impact was real.
The service mesh migration - Istio - reduced new routing rule implementation to 15 minutes on average. More importantly, it made all routing rules visible, auditable, and testable in a single place.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Docker for ML demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
A service mesh is a dedicated infrastructure layer for service-to-service communication. In Kubernetes, it is typically implemented as a sidecar proxy (Envoy) injected into every pod, with a control plane (Istiod) that distributes routing rules, policies, and telemetry collection configuration to all proxies.
For ML serving specifically, Istio solves problems that application-layer routing cannot:
- Traffic splitting without code changes: Route 5% of requests to model-v47 by weight, independent of any application-layer logic
- Header-based routing: Route requests with
X-User-Segment: premiumto a higher-capacity model variant - Circuit breaking: Automatically stop routing to a model that is returning errors, preventing cascading failures
- Mutual TLS: Encrypted, authenticated communication between services without application code changes
- Telemetry: Automatic per-service, per-route latency histograms and error rates in Prometheus
Istio Architecture for ML
Istio Installation
# Install Istio (istioctl method)
# Download istioctl
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.21.0 sh -
cd istio-1.21.0
export PATH=$PWD/bin:$PATH
# Install Istio with the default profile (suitable for production)
istioctl install --set profile=default -y
# Enable automatic sidecar injection in the ML namespace
kubectl label namespace ml-serving istio-injection=enabled
# Verify installation
kubectl get pods -n istio-system
istioctl verify-install
Core Concepts: VirtualService and DestinationRule
The two primary Istio resources for ML routing:
- DestinationRule: Defines named subsets of a service (by pod labels) and connection pool settings
- VirtualService: Defines routing rules - which subset gets which percentage of traffic
# kubernetes/istio/content-ranking-destination-rule.yaml
# DestinationRule: define which pod versions belong to which subset
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: content-ranking-dr
namespace: ml-serving
spec:
host: content-ranking-svc # Kubernetes service name
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
idleTimeout: 30s
# Circuit breaker settings (applies to all subsets)
outlierDetection:
consecutive5xxErrors: 5 # Eject after 5 consecutive 5xx errors
interval: 30s # Check every 30 seconds
baseEjectionTime: 30s # Eject for at least 30 seconds
maxEjectionPercent: 50 # Never eject more than 50% of instances
subsets:
- name: v45
labels:
model-version: "v45" # Matches pods with this label
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 50
- name: v47
labels:
model-version: "v47"
- name: v43-legacy
labels:
model-version: "v43"
# kubernetes/istio/content-ranking-virtual-service.yaml
# VirtualService: traffic split between model versions
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: content-ranking-vs
namespace: ml-serving
spec:
hosts:
- content-ranking-svc
http:
# Route 1: Premium user segment to v47 (new model)
- match:
- headers:
x-user-segment:
exact: premium
route:
- destination:
host: content-ranking-svc
subset: v47
weight: 100
# Route 2: Default traffic split (canary deployment)
- route:
- destination:
host: content-ranking-svc
subset: v45 # Production champion
weight: 90
- destination:
host: content-ranking-svc
subset: v47 # Canary
weight: 10
# Timeout and retry policies for this route
timeout: 200ms
retries:
attempts: 2
perTryTimeout: 80ms
retryOn: "5xx,gateway-error,connect-failure,retriable-4xx"
Progressive Canary Deployment
Automate the canary rollout process:
# scripts/canary_controller.py
"""
Automated canary deployment controller for ML model versions.
Gradually shifts traffic from champion to challenger, monitoring metrics at each step.
"""
import subprocess
import json
import time
import logging
from dataclasses import dataclass
logger = logging.getLogger(__name__)
@dataclass
class CanaryConfig:
service_name: str
namespace: str
champion_version: str
challenger_version: str
# Traffic progression: list of (challenger_weight, hold_minutes) tuples
stages: list[tuple[int, int]] = None
# Prometheus query to check model quality during canary
quality_metric_query: str = "rate(content_ranking_p50_ms[5m])"
max_latency_regression_pct: float = 10.0 # Max allowed latency increase
class CanaryController:
def __init__(self, config: CanaryConfig):
self.config = config
self.stages = config.stages or [
(5, 10), # 5% for 10 minutes
(10, 15), # 10% for 15 minutes
(25, 20), # 25% for 20 minutes
(50, 30), # 50% for 30 minutes
(100, 0), # 100% - full promotion
]
def run_canary(self) -> bool:
"""
Execute the canary progression.
Returns True if successfully promoted to 100%, False if rolled back.
"""
logger.info(
f"Starting canary: {self.config.champion_version} -> "
f"{self.config.challenger_version}"
)
for challenger_weight, hold_minutes in self.stages:
champion_weight = 100 - challenger_weight
self._update_traffic_split(champion_weight, challenger_weight)
logger.info(
f"Traffic split: champion={champion_weight}%, "
f"challenger={challenger_weight}%"
)
if hold_minutes > 0:
logger.info(f"Holding for {hold_minutes} minutes...")
time.sleep(hold_minutes * 60)
# Check quality metrics
if not self._check_metrics_ok():
logger.error("Metrics degraded during canary - rolling back")
self._rollback()
return False
logger.info(
f"Canary complete: {self.config.challenger_version} now serves 100% of traffic"
)
return True
def _update_traffic_split(self, champion_weight: int, challenger_weight: int) -> None:
"""Update Istio VirtualService with new weights."""
vs_patch = {
"spec": {
"http": [{
"route": [
{
"destination": {
"host": f"{self.config.service_name}-svc",
"subset": self.config.champion_version,
},
"weight": champion_weight,
},
{
"destination": {
"host": f"{self.config.service_name}-svc",
"subset": self.config.challenger_version,
},
"weight": challenger_weight,
},
],
"timeout": "200ms",
}]
}
}
subprocess.run([
"kubectl", "patch", "virtualservice",
f"{self.config.service_name}-vs",
"--namespace", self.config.namespace,
"--type", "merge",
"--patch", json.dumps(vs_patch),
], check=True)
def _check_metrics_ok(self) -> bool:
"""Query Prometheus to check if challenger is performing acceptably."""
# Compare P99 latency between champion and challenger
# (In practice, use the Prometheus API to query per-subset metrics)
champion_p99 = self._query_prometheus(
f'histogram_quantile(0.99, rate(inference_latency_seconds_bucket'
f'{{model_version="{self.config.champion_version}"}}[5m]))'
)
challenger_p99 = self._query_prometheus(
f'histogram_quantile(0.99, rate(inference_latency_seconds_bucket'
f'{{model_version="{self.config.challenger_version}"}}[5m]))'
)
if champion_p99 is None or challenger_p99 is None:
logger.warning("Could not fetch metrics - assuming OK")
return True
regression_pct = (challenger_p99 - champion_p99) / champion_p99 * 100
if regression_pct > self.config.max_latency_regression_pct:
logger.error(
f"Latency regression: challenger P99={challenger_p99:.3f}s vs "
f"champion P99={champion_p99:.3f}s "
f"({regression_pct:.1f}% worse, limit {self.config.max_latency_regression_pct}%)"
)
return False
logger.info(
f"Metrics OK: challenger P99={challenger_p99:.3f}s "
f"vs champion P99={champion_p99:.3f}s ({regression_pct:+.1f}%)"
)
return True
def _rollback(self) -> None:
"""Immediately route all traffic back to champion."""
self._update_traffic_split(champion_weight=100, challenger_weight=0)
logger.info(f"Rolled back to {self.config.champion_version} (100% traffic)")
def _query_prometheus(self, query: str) -> float | None:
"""Query Prometheus instant query endpoint."""
import requests
try:
resp = requests.get(
"http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query",
params={"query": query},
timeout=10,
)
data = resp.json()
results = data.get("data", {}).get("result", [])
if results:
return float(results[0]["value"][1])
except Exception as e:
logger.warning(f"Prometheus query failed: {e}")
return None
A/B Testing Configuration
# A/B test: route different user cohorts to different model versions
# User cohort determined by consistent hash of user_id header
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: content-ranking-ab-test
namespace: ml-serving
spec:
hosts:
- content-ranking-svc
http:
# Control group: user IDs hashing to 0-79 (80%) get v45
- match:
- headers:
x-user-id:
# Regex: match user IDs starting with hex 0-C (roughly 80%)
regex: "^[0-9a-cA-C].*"
route:
- destination:
host: content-ranking-svc
subset: v45
weight: 100
# Treatment group: user IDs hashing to d-f (roughly 20%) get v47
- route:
- destination:
host: content-ranking-svc
subset: v47
weight: 100
# Better A/B routing: use application-layer consistent hashing
# The VirtualService regex approach is fragile - prefer hash-based routing
# where the application or API gateway computes the bucket
def assign_ab_bucket(user_id: str, experiment_id: str, n_buckets: int = 100) -> int:
"""
Consistent hash-based A/B assignment.
Same user_id + experiment_id always produces same bucket.
"""
import hashlib
key = f"{user_id}:{experiment_id}"
digest = hashlib.md5(key.encode()).hexdigest()
return int(digest[:8], 16) % n_buckets
# Then use header-based routing in VirtualService based on
# an x-ab-bucket header set by the API gateway
Circuit Breaker in Action
# DestinationRule with aggressive circuit breaker for ML inference
# If a model variant is returning errors, automatically stop routing to it
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: content-ranking-dr
namespace: ml-serving
spec:
host: content-ranking-svc
trafficPolicy:
outlierDetection:
# Eject host if 5 consecutive 5xx errors in 30s window
consecutive5xxErrors: 5
interval: 30s
# Keep ejected for at least 1 minute before trying again
baseEjectionTime: 60s
# Exponential backoff: each ejection doubles hold time (up to 10 min)
maxEjectionPercent: 100 # Eject all instances if all are unhealthy
loadBalancer:
# Least connection: preferred for ML inference (variable response times)
simple: LEAST_CONN
Envoy Telemetry for ML Inference
Istio's Envoy sidecar automatically generates metrics without any application-side code:
# Prometheus will automatically scrape:
# istio_requests_total{
# destination_service="content-ranking-svc",
# response_code="200",
# source_app="api-gateway"
# }
# istio_request_duration_milliseconds_bucket{...}
# Create Grafana dashboard queries for ML serving:
# P99 latency per model version:
# histogram_quantile(0.99,
# sum(rate(istio_request_duration_milliseconds_bucket{
# destination_service="content-ranking-svc"
# }[5m])) by (le, destination_version)
# )
# Error rate per model version:
# sum(rate(istio_requests_total{
# destination_service="content-ranking-svc",
# response_code=~"5.."
# }[5m])) by (destination_version)
# /
# sum(rate(istio_requests_total{
# destination_service="content-ranking-svc"
# }[5m])) by (destination_version)
Production Notes
Start with traffic mirroring: Before shifting real traffic to a new model version, use Istio's mirroring feature to send a copy of production traffic to the new model without serving the responses. This lets you evaluate the new model on real production traffic with zero risk.
# Mirror 10% of traffic to v47 for evaluation (responses from v45 are served)
- route:
- destination:
host: content-ranking-svc
subset: v45
weight: 100
mirror:
host: content-ranking-svc
subset: v47
mirrorPercentage:
value: 10.0
Timeout tuning: ML inference latency is variable - set timeouts based on P99 latency plus
safety margin, not P50. A 100ms P99 inference model needs at least a 150ms timeout to avoid
excessive timeouts under load. Set timeout at the VirtualService level and ensure retries
use perTryTimeout less than the overall timeout.
:::tip Use Kiali for Service Mesh Visualization Kiali (included with Istio) provides a real-time topology visualization of all services in the mesh, including traffic split percentages, error rates per service, and latency histograms. For ML teams running A/B tests, Kiali makes it immediately visible which model version is receiving which fraction of traffic and how each is performing. :::
:::warning Service Mesh Is Infrastructure, Not ML Tooling A service mesh routes HTTP requests based on headers, weights, and circuit breaker state. It knows nothing about ML concepts: model accuracy, prediction confidence, feature drift. Use the service mesh for infrastructure-level concerns (traffic splitting, retries, mTLS) and ML-specific concerns (quality gates, A/B metric collection, model performance) in your application layer and monitoring stack. :::
:::danger mTLS and Model-to-Model Calls If your serving infrastructure has models calling other models (e.g., a feature model feeds into a ranking model), enabling strict mutual TLS (mTLS) in Istio means all inter-service calls must be authenticated. Ensure all ML services in the mesh have properly configured service accounts and certificates. In PERMISSIVE mode (the default), mTLS is accepted but not required - both plain HTTP and mTLS work. In STRICT mode, all inter-service traffic must use mTLS. :::
Interview Q&A
Q: What is a service mesh and why is it useful for ML serving?
A service mesh is an infrastructure layer that handles service-to-service communication in distributed systems. In Kubernetes, it is implemented as sidecar proxies (Envoy in Istio) injected into every pod, with a control plane distributing configuration. For ML serving, it provides: declarative traffic splitting for canary deployments and A/B tests (without application code changes), automatic circuit breaking (stop routing to failing model versions), mutual TLS (encrypted inter-service communication), and automatic telemetry (latency, error rates per service and route) without modifying ML application code.
Q: How do you implement a canary deployment for an ML model with Istio?
Deploy the new model version as a separate Kubernetes Deployment with a different label (e.g.,
model-version: v47). Define it as a subset in an Istio DestinationRule. Update the VirtualService
to send a small percentage (5-10%) of traffic to the new subset, keeping the majority on the
champion. Monitor quality metrics during the canary period. Gradually increase the canary weight
if metrics are healthy. Roll back immediately by setting canary weight to 0 if metrics degrade.
The traffic shift is a Kubernetes resource update - no pod restarts required.
Q: What is Istio traffic mirroring and when should you use it for ML?
Traffic mirroring (also called shadowing) sends a copy of live production traffic to a new model version, but serves the original version's responses to users. This lets you evaluate the new model against production traffic distribution with zero user impact. Use it before starting a canary: mirror 10-20% of traffic to the new model, collect its predictions and latency metrics, compare to the champion's metrics. If the new model performs well in shadow mode, proceed with canary. If not, discard without any user exposure.
Q: What does Istio's circuit breaker do and why is it important for ML serving?
Istio's circuit breaker (configured via outlierDetection in DestinationRule) monitors each
instance of a service for errors. If an instance returns too many consecutive errors (5xx
responses), Istio ejects it from the load balancing pool for a time period. For ML serving,
this means: if one pod of a model deployment starts returning errors (OOM, deadlock, bad model
file), Istio automatically stops routing to it, protecting the overall service. Without circuit
breaking, a bad pod continues receiving traffic and producing errors until a human intervenes.
Q: How do you do consistent A/B assignment with Istio to ensure the same user always goes to the same model?
Istio header-based routing ensures the same request characteristics go to the same subset.
For consistent user assignment, compute a hash bucket from the user ID in your API gateway or
authentication layer, set an x-ab-bucket header (e.g., 0-99), then use Istio VirtualService
header match rules to route bucket 0-79 to the control group and 80-99 to treatment. The hash
must be consistent (same user always same bucket, independent of time or request count). MD5 or
SHA-based hashing with a stable salt (experiment ID) achieves this. This is more reliable than
Istio's regex-on-user-ID approach.
