What is istio machine learning?

Use Istio service mesh to manage traffic routing across multiple ML model versions - canary deployments, A/B testing, circuit breakers, and telemetry.

How does service mesh ml serving work in practice?

Service Mesh for ML Serving covers istio machine learning, service mesh ml serving, canary deployment ml models from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/containerization/container-registries

What is the difference between istio machine learning and canary deployment ml models?

See the full breakdown at https://engineersofai.com/docs/mlops/containerization/container-registries

Service Mesh for ML Serving

Five Models, Three Versions Each, Zero Visibility

The ML serving team at a recommendation platform had accumulated a traffic routing problem so complex that no single person understood the full picture. They had five active models (content ranking, user affinity, session context, diversity re-ranker, freshness booster) each with up to three simultaneously-served versions: the production champion, a staging candidate being shadow-tested, and sometimes a legacy rollback version kept warm for emergency failover.

Traffic routing rules were implemented as environment variables in deployment YAMLs, custom Nginx configs, application-layer feature flags, and one genuinely alarming bash script checked into the infra repo. Changing a routing rule required editing multiple files, redeploying multiple services, and hoping that no one had recently made a conflicting change in a file you did not know existed.

The breaking point was an A/B test on the content ranking model. The test required routing 10% of traffic to model-v47, 85% to model-v45 (production champion), and 5% to model-v43 (legacy variant for a specific user segment). Implementing this without a service mesh took two engineers four days. When A/B testing at this complexity takes four days to set up, you run fewer tests. When you run fewer tests, your models improve more slowly. The business impact was real.

The service mesh migration - Istio - reduced new routing rule implementation to 15 minutes on average. More importantly, it made all routing rules visible, auditable, and testable in a single place.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Docker for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

A service mesh is a dedicated infrastructure layer for service-to-service communication. In Kubernetes, it is typically implemented as a sidecar proxy (Envoy) injected into every pod, with a control plane (Istiod) that distributes routing rules, policies, and telemetry collection configuration to all proxies.

For ML serving specifically, Istio solves problems that application-layer routing cannot:

Traffic splitting without code changes: Route 5% of requests to model-v47 by weight, independent of any application-layer logic
Header-based routing: Route requests with X-User-Segment: premium to a higher-capacity model variant
Circuit breaking: Automatically stop routing to a model that is returning errors, preventing cascading failures
Mutual TLS: Encrypted, authenticated communication between services without application code changes
Telemetry: Automatic per-service, per-route latency histograms and error rates in Prometheus

Istio Architecture for ML

Istio Installation

# Install Istio (istioctl method)
# Download istioctl
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.21.0 sh -
cd istio-1.21.0
export PATH=$PWD/bin:$PATH

# Install Istio with the default profile (suitable for production)
istioctl install --set profile=default -y

# Enable automatic sidecar injection in the ML namespace
kubectl label namespace ml-serving istio-injection=enabled

# Verify installation
kubectl get pods -n istio-system
istioctl verify-install

Core Concepts: VirtualService and DestinationRule

The two primary Istio resources for ML routing:

DestinationRule: Defines named subsets of a service (by pod labels) and connection pool settings
VirtualService: Defines routing rules - which subset gets which percentage of traffic

# kubernetes/istio/content-ranking-destination-rule.yaml
# DestinationRule: define which pod versions belong to which subset
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: content-ranking-dr
  namespace: ml-serving
spec:
  host: content-ranking-svc    # Kubernetes service name
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        idleTimeout: 30s
    # Circuit breaker settings (applies to all subsets)
    outlierDetection:
      consecutive5xxErrors: 5          # Eject after 5 consecutive 5xx errors
      interval: 30s                    # Check every 30 seconds
      baseEjectionTime: 30s            # Eject for at least 30 seconds
      maxEjectionPercent: 50           # Never eject more than 50% of instances
  subsets:
    - name: v45
      labels:
        model-version: "v45"           # Matches pods with this label
      trafficPolicy:
        connectionPool:
          http:
            http1MaxPendingRequests: 50
    - name: v47
      labels:
        model-version: "v47"
    - name: v43-legacy
      labels:
        model-version: "v43"

# kubernetes/istio/content-ranking-virtual-service.yaml
# VirtualService: traffic split between model versions
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: content-ranking-vs
  namespace: ml-serving
spec:
  hosts:
    - content-ranking-svc
  http:
    # Route 1: Premium user segment to v47 (new model)
    - match:
        - headers:
            x-user-segment:
              exact: premium
      route:
        - destination:
            host: content-ranking-svc
            subset: v47
          weight: 100

    # Route 2: Default traffic split (canary deployment)
    - route:
        - destination:
            host: content-ranking-svc
            subset: v45    # Production champion
          weight: 90
        - destination:
            host: content-ranking-svc
            subset: v47    # Canary
          weight: 10
      # Timeout and retry policies for this route
      timeout: 200ms
      retries:
        attempts: 2
        perTryTimeout: 80ms
        retryOn: "5xx,gateway-error,connect-failure,retriable-4xx"

Progressive Canary Deployment

Automate the canary rollout process:

# scripts/canary_controller.py
"""
Automated canary deployment controller for ML model versions.
Gradually shifts traffic from champion to challenger, monitoring metrics at each step.
"""

import subprocess
import json
import time
import logging
from dataclasses import dataclass

logger = logging.getLogger(__name__)


@dataclass
class CanaryConfig:
    service_name: str
    namespace: str
    champion_version: str
    challenger_version: str
    # Traffic progression: list of (challenger_weight, hold_minutes) tuples
    stages: list[tuple[int, int]] = None
    # Prometheus query to check model quality during canary
    quality_metric_query: str = "rate(content_ranking_p50_ms[5m])"
    max_latency_regression_pct: float = 10.0  # Max allowed latency increase


class CanaryController:
    def __init__(self, config: CanaryConfig):
        self.config = config
        self.stages = config.stages or [
            (5, 10),    # 5% for 10 minutes
            (10, 15),   # 10% for 15 minutes
            (25, 20),   # 25% for 20 minutes
            (50, 30),   # 50% for 30 minutes
            (100, 0),   # 100% - full promotion
        ]

    def run_canary(self) -> bool:
        """
        Execute the canary progression.
        Returns True if successfully promoted to 100%, False if rolled back.
        """
        logger.info(
            f"Starting canary: {self.config.champion_version} -> "
            f"{self.config.challenger_version}"
        )

        for challenger_weight, hold_minutes in self.stages:
            champion_weight = 100 - challenger_weight
            self._update_traffic_split(champion_weight, challenger_weight)
            logger.info(
                f"Traffic split: champion={champion_weight}%, "
                f"challenger={challenger_weight}%"
            )

            if hold_minutes > 0:
                logger.info(f"Holding for {hold_minutes} minutes...")
                time.sleep(hold_minutes * 60)

                # Check quality metrics
                if not self._check_metrics_ok():
                    logger.error("Metrics degraded during canary - rolling back")
                    self._rollback()
                    return False

        logger.info(
            f"Canary complete: {self.config.challenger_version} now serves 100% of traffic"
        )
        return True

    def _update_traffic_split(self, champion_weight: int, challenger_weight: int) -> None:
        """Update Istio VirtualService with new weights."""
        vs_patch = {
            "spec": {
                "http": [{
                    "route": [
                        {
                            "destination": {
                                "host": f"{self.config.service_name}-svc",
                                "subset": self.config.champion_version,
                            },
                            "weight": champion_weight,
                        },
                        {
                            "destination": {
                                "host": f"{self.config.service_name}-svc",
                                "subset": self.config.challenger_version,
                            },
                            "weight": challenger_weight,
                        },
                    ],
                    "timeout": "200ms",
                }]
            }
        }

        subprocess.run([
            "kubectl", "patch", "virtualservice",
            f"{self.config.service_name}-vs",
            "--namespace", self.config.namespace,
            "--type", "merge",
            "--patch", json.dumps(vs_patch),
        ], check=True)

    def _check_metrics_ok(self) -> bool:
        """Query Prometheus to check if challenger is performing acceptably."""
        # Compare P99 latency between champion and challenger
        # (In practice, use the Prometheus API to query per-subset metrics)
        champion_p99 = self._query_prometheus(
            f'histogram_quantile(0.99, rate(inference_latency_seconds_bucket'
            f'{{model_version="{self.config.champion_version}"}}[5m]))'
        )
        challenger_p99 = self._query_prometheus(
            f'histogram_quantile(0.99, rate(inference_latency_seconds_bucket'
            f'{{model_version="{self.config.challenger_version}"}}[5m]))'
        )

        if champion_p99 is None or challenger_p99 is None:
            logger.warning("Could not fetch metrics - assuming OK")
            return True

        regression_pct = (challenger_p99 - champion_p99) / champion_p99 * 100
        if regression_pct > self.config.max_latency_regression_pct:
            logger.error(
                f"Latency regression: challenger P99={challenger_p99:.3f}s vs "
                f"champion P99={champion_p99:.3f}s "
                f"({regression_pct:.1f}% worse, limit {self.config.max_latency_regression_pct}%)"
            )
            return False

        logger.info(
            f"Metrics OK: challenger P99={challenger_p99:.3f}s "
            f"vs champion P99={champion_p99:.3f}s ({regression_pct:+.1f}%)"
        )
        return True

    def _rollback(self) -> None:
        """Immediately route all traffic back to champion."""
        self._update_traffic_split(champion_weight=100, challenger_weight=0)
        logger.info(f"Rolled back to {self.config.champion_version} (100% traffic)")

    def _query_prometheus(self, query: str) -> float | None:
        """Query Prometheus instant query endpoint."""
        import requests
        try:
            resp = requests.get(
                "http://prometheus.monitoring.svc.cluster.local:9090/api/v1/query",
                params={"query": query},
                timeout=10,
            )
            data = resp.json()
            results = data.get("data", {}).get("result", [])
            if results:
                return float(results[0]["value"][1])
        except Exception as e:
            logger.warning(f"Prometheus query failed: {e}")
        return None

A/B Testing Configuration

# A/B test: route different user cohorts to different model versions
# User cohort determined by consistent hash of user_id header
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: content-ranking-ab-test
  namespace: ml-serving
spec:
  hosts:
    - content-ranking-svc
  http:
    # Control group: user IDs hashing to 0-79 (80%) get v45
    - match:
        - headers:
            x-user-id:
              # Regex: match user IDs starting with hex 0-C (roughly 80%)
              regex: "^[0-9a-cA-C].*"
      route:
        - destination:
            host: content-ranking-svc
            subset: v45
          weight: 100

    # Treatment group: user IDs hashing to d-f (roughly 20%) get v47
    - route:
        - destination:
            host: content-ranking-svc
            subset: v47
          weight: 100

# Better A/B routing: use application-layer consistent hashing
# The VirtualService regex approach is fragile - prefer hash-based routing
# where the application or API gateway computes the bucket

def assign_ab_bucket(user_id: str, experiment_id: str, n_buckets: int = 100) -> int:
    """
    Consistent hash-based A/B assignment.
    Same user_id + experiment_id always produces same bucket.
    """
    import hashlib
    key = f"{user_id}:{experiment_id}"
    digest = hashlib.md5(key.encode()).hexdigest()
    return int(digest[:8], 16) % n_buckets

# Then use header-based routing in VirtualService based on
# an x-ab-bucket header set by the API gateway

Circuit Breaker in Action

# DestinationRule with aggressive circuit breaker for ML inference
# If a model variant is returning errors, automatically stop routing to it
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: content-ranking-dr
  namespace: ml-serving
spec:
  host: content-ranking-svc
  trafficPolicy:
    outlierDetection:
      # Eject host if 5 consecutive 5xx errors in 30s window
      consecutive5xxErrors: 5
      interval: 30s
      # Keep ejected for at least 1 minute before trying again
      baseEjectionTime: 60s
      # Exponential backoff: each ejection doubles hold time (up to 10 min)
      maxEjectionPercent: 100  # Eject all instances if all are unhealthy
    loadBalancer:
      # Least connection: preferred for ML inference (variable response times)
      simple: LEAST_CONN

Envoy Telemetry for ML Inference

Istio's Envoy sidecar automatically generates metrics without any application-side code:

# Prometheus will automatically scrape:
# istio_requests_total{
#   destination_service="content-ranking-svc",
#   response_code="200",
#   source_app="api-gateway"
# }
# istio_request_duration_milliseconds_bucket{...}

# Create Grafana dashboard queries for ML serving:
# P99 latency per model version:
# histogram_quantile(0.99,
#   sum(rate(istio_request_duration_milliseconds_bucket{
#     destination_service="content-ranking-svc"
#   }[5m])) by (le, destination_version)
# )

# Error rate per model version:
# sum(rate(istio_requests_total{
#   destination_service="content-ranking-svc",
#   response_code=~"5.."
# }[5m])) by (destination_version)
# /
# sum(rate(istio_requests_total{
#   destination_service="content-ranking-svc"
# }[5m])) by (destination_version)

Production Notes

Start with traffic mirroring: Before shifting real traffic to a new model version, use Istio's mirroring feature to send a copy of production traffic to the new model without serving the responses. This lets you evaluate the new model on real production traffic with zero risk.

# Mirror 10% of traffic to v47 for evaluation (responses from v45 are served)
- route:
    - destination:
        host: content-ranking-svc
        subset: v45
      weight: 100
  mirror:
    host: content-ranking-svc
    subset: v47
  mirrorPercentage:
    value: 10.0

Timeout tuning: ML inference latency is variable - set timeouts based on P99 latency plus safety margin, not P50. A 100ms P99 inference model needs at least a 150ms timeout to avoid excessive timeouts under load. Set timeout at the VirtualService level and ensure retries use perTryTimeout less than the overall timeout.

:::tip Use Kiali for Service Mesh Visualization Kiali (included with Istio) provides a real-time topology visualization of all services in the mesh, including traffic split percentages, error rates per service, and latency histograms. For ML teams running A/B tests, Kiali makes it immediately visible which model version is receiving which fraction of traffic and how each is performing. :::

:::warning Service Mesh Is Infrastructure, Not ML Tooling A service mesh routes HTTP requests based on headers, weights, and circuit breaker state. It knows nothing about ML concepts: model accuracy, prediction confidence, feature drift. Use the service mesh for infrastructure-level concerns (traffic splitting, retries, mTLS) and ML-specific concerns (quality gates, A/B metric collection, model performance) in your application layer and monitoring stack. :::

:::danger mTLS and Model-to-Model Calls If your serving infrastructure has models calling other models (e.g., a feature model feeds into a ranking model), enabling strict mutual TLS (mTLS) in Istio means all inter-service calls must be authenticated. Ensure all ML services in the mesh have properly configured service accounts and certificates. In PERMISSIVE mode (the default), mTLS is accepted but not required - both plain HTTP and mTLS work. In STRICT mode, all inter-service traffic must use mTLS. :::

Interview Q&A

Q: What is a service mesh and why is it useful for ML serving?

A service mesh is an infrastructure layer that handles service-to-service communication in distributed systems. In Kubernetes, it is implemented as sidecar proxies (Envoy in Istio) injected into every pod, with a control plane distributing configuration. For ML serving, it provides: declarative traffic splitting for canary deployments and A/B tests (without application code changes), automatic circuit breaking (stop routing to failing model versions), mutual TLS (encrypted inter-service communication), and automatic telemetry (latency, error rates per service and route) without modifying ML application code.

Q: How do you implement a canary deployment for an ML model with Istio?

Deploy the new model version as a separate Kubernetes Deployment with a different label (e.g., model-version: v47). Define it as a subset in an Istio DestinationRule. Update the VirtualService to send a small percentage (5-10%) of traffic to the new subset, keeping the majority on the champion. Monitor quality metrics during the canary period. Gradually increase the canary weight if metrics are healthy. Roll back immediately by setting canary weight to 0 if metrics degrade. The traffic shift is a Kubernetes resource update - no pod restarts required.

Q: What is Istio traffic mirroring and when should you use it for ML?

Traffic mirroring (also called shadowing) sends a copy of live production traffic to a new model version, but serves the original version's responses to users. This lets you evaluate the new model against production traffic distribution with zero user impact. Use it before starting a canary: mirror 10-20% of traffic to the new model, collect its predictions and latency metrics, compare to the champion's metrics. If the new model performs well in shadow mode, proceed with canary. If not, discard without any user exposure.

Q: What does Istio's circuit breaker do and why is it important for ML serving?

Istio's circuit breaker (configured via outlierDetection in DestinationRule) monitors each instance of a service for errors. If an instance returns too many consecutive errors (5xx responses), Istio ejects it from the load balancing pool for a time period. For ML serving, this means: if one pod of a model deployment starts returning errors (OOM, deadlock, bad model file), Istio automatically stops routing to it, protecting the overall service. Without circuit breaking, a bad pod continues receiving traffic and producing errors until a human intervenes.

Q: How do you do consistent A/B assignment with Istio to ensure the same user always goes to the same model?

Istio header-based routing ensures the same request characteristics go to the same subset. For consistent user assignment, compute a hash bucket from the user ID in your API gateway or authentication layer, set an x-ab-bucket header (e.g., 0-99), then use Istio VirtualService header match rules to route bucket 0-79 to the control group and 80-99 to treatment. The hash must be consistent (same user always same bucket, independent of time or request count). MD5 or SHA-based hashing with a stable salt (experiment ID) achieves this. This is more reliable than Istio's regex-on-user-ID approach.

Five Models, Three Versions Each, Zero Visibility​

Why This Exists​

Istio Architecture for ML​

Istio Installation​

Core Concepts: VirtualService and DestinationRule​

Progressive Canary Deployment​

A/B Testing Configuration​

Circuit Breaker in Action​

Envoy Telemetry for ML Inference​

Production Notes​

Interview Q&A​