What is kubernetes deployment strategy ML?

Master the three core Kubernetes workload primitives for ML engineers - stateless serving with Deployments, traffic routing with Services, and advanced pod patterns for ML.

How does kubernetes service clusterip loadbalancer work in practice?

Pods, Deployments, and Services - Deep Dive covers kubernetes deployment strategy ML, kubernetes service clusterip loadbalancer, pod anti-affinity ML serving from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/kubernetes-for-ml/pods-deployments-services

What is the difference between kubernetes deployment strategy ML and pod anti-affinity ML serving?

See the full breakdown at https://engineersofai.com/docs/mlops/kubernetes-for-ml/pods-deployments-services

Pods, Deployments, and Services - Deep Dive

The Traffic Spike That Revealed the Gap

Your recommendation model has been running in production for six weeks. Then a marketing campaign drops at 9am on a Tuesday and traffic spikes 8x in under two minutes. You watch the dashboard: p99 latency climbs from 80ms to 3.4 seconds, then requests start returning 502 errors. Your serving Deployment has 3 replicas, all running, all healthy according to liveness probes. But the Service is routing traffic to a pod that's mid-update - you pushed a new version six minutes ago and one pod is still loading the new model weights. It's marked ready by mistake because you forgot to update the initialDelaySeconds after upgrading to a larger model version.

This scenario illustrates that understanding Pods, Deployments, and Services at a surface level is not enough. The details matter: how rolling updates interact with readiness probes, how Service endpoint selection works, how pod scheduling decisions cascade during failures, and how to design pod specs that fail safely. This lesson goes deep on all three primitives.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Kubernetes for ML demo on the EngineersOfAI Playground - no code required. :::

The Pod Lifecycle in Detail

Most ML engineers think of a pod as "a container running on a node." That's true, but pods have a structured lifecycle with distinct phases that affect how you monitor and debug them.

Pending: The pod has been accepted by the API server and scheduled to a node, but the container image is still being pulled from the registry. Large ML images (10–30 GB) can stay in Pending for 3–10 minutes on first pull. Subsequent pulls use the cached image and are instant.

Init containers: Special containers that run to completion before the main containers start. Used to download model weights, run database migrations, or wait for dependencies to be ready. If an init container fails, Kubernetes restarts it (following the pod's restart policy) until it succeeds or the pod is marked failed.

Running → Ready: The pod transitions from Running (containers are executing) to Ready (readiness probe is passing) only after your health check passes. Traffic is only routed to Ready pods.

Terminating: When a pod is deleted, Kubernetes sends SIGTERM to all containers and starts a grace period (default: 30 seconds). Your ML server should handle SIGTERM by finishing in-flight requests and then exiting. If the process hasn't exited after the grace period, Kubernetes sends SIGKILL.

import signal
import sys
import asyncio
from fastapi import FastAPI

app = FastAPI()
shutdown_requested = False

def handle_sigterm(signum, frame):
    """Graceful shutdown: stop accepting new requests, finish in-flight ones."""
    global shutdown_requested
    shutdown_requested = True

signal.signal(signal.SIGTERM, handle_sigterm)

@app.middleware("http")
async def reject_during_shutdown(request, call_next):
    if shutdown_requested:
        # Return 503 so load balancer stops routing here
        from fastapi.responses import JSONResponse
        return JSONResponse({"detail": "shutting down"}, status_code=503)
    return await call_next(request)

And set a matching terminationGracePeriodSeconds in the pod spec:

spec:
  terminationGracePeriodSeconds: 60    # give model server 60s to drain requests
  containers:
    - name: model-server
      # ...

Init Containers for ML - Pre-loading Model Weights

One of the most useful patterns for ML serving is using an init container to download model weights from a registry before the main container starts. This allows:

Faster pod startup (model download happens in parallel with image pull of the main container)
Separation of concerns (the model server image doesn't need the AWS CLI or GCS client built in)
Retry logic for flaky downloads without complicating the model server code

spec:
  initContainers:
    - name: download-model
      image: amazon/aws-cli:2.15.0
      command:
        - sh
        - -c
        - |
          echo "Downloading model weights..."
          aws s3 cp s3://ml-artifacts/fraud/v2.1.0/ /model-cache/ --recursive
          echo "Download complete. Verifying checksum..."
          sha256sum -c /model-cache/checksums.txt
          echo "Verification passed."
      env:
        - name: AWS_REGION
          value: "us-east-1"
      envFrom:
        - secretRef:
            name: s3-credentials
      volumeMounts:
        - name: model-cache
          mountPath: /model-cache

  containers:
    - name: model-server
      image: registry.company.com/fraud-server:v2.1.0-slim  # no model weights included
      volumeMounts:
        - name: model-cache
          mountPath: /models                # reads from here, downloaded by init container
      env:
        - name: MODEL_PATH
          value: "/models/fraud/v2.1.0/model.pt"

  volumes:
    - name: model-cache
      emptyDir: {}   # shared between init container and main container

:::tip Caching Model Weights with PersistentVolumeClaims For model weights that rarely change, use a PVC instead of emptyDir for the model cache. The init container downloads the weights on first run and checks a version file. On subsequent pod starts (restarts, rescheduling), it skips the download if the version matches. This reduces pod startup time from 5 minutes (download 4 GB) to 10 seconds (version check only).

# Init container script with caching
if [ -f /model-cache/version.txt ] && [ "$(cat /model-cache/version.txt)" = "v2.1.0" ]; then
  echo "Model cache is current. Skipping download."
else
  aws s3 cp s3://ml-artifacts/fraud/v2.1.0/ /model-cache/ --recursive
  echo "v2.1.0" > /model-cache/version.txt
fi

:::

Sidecar Containers for ML Observability

A sidecar is a second container in the same pod that provides auxiliary functionality - logging, metrics collection, or request tracing - without modifying the main model server image.

spec:
  containers:
    - name: model-server
      image: registry.company.com/fraud-model:v2.1.0
      ports:
        - containerPort: 8080

    - name: log-shipper
      image: fluent/fluent-bit:3.0
      volumeMounts:
        - name: logs
          mountPath: /logs
      env:
        - name: LOKI_URL
          value: "http://loki-svc.monitoring:3100"

    - name: metrics-proxy
      image: prom/pushgateway:v1.7.0
      # Collects metrics from model-server and pushes to Prometheus
      ports:
        - containerPort: 9091

  volumes:
    - name: logs
      emptyDir: {}

Both containers share the pod's network namespace - model-server can reach metrics-proxy via localhost:9091. They also share the pod's lifecycle: if either container exits with a non-zero code (depending on restart policy), the pod may be restarted.

Deployments - Rolling Update Deep Dive

The rolling update is the workhorse of ML deployment. Understanding its exact mechanics prevents outages.

Key parameters controlling this process:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1          # at most 1 extra pod above desired replica count
    maxUnavailable: 0    # never reduce below desired count during update

With maxSurge: 1 and maxUnavailable: 0 on a 3-replica Deployment, the rollout creates 1 new pod (now 4 running), waits for it to pass readiness, then terminates 1 old pod (back to 3), and repeats. This guarantees zero traffic reduction but costs extra capacity (4 pods' worth of resources) during the rollout.

For GPU workloads where the 4th GPU pod cannot be scheduled because the cluster is at capacity, use maxSurge: 0 and maxUnavailable: 1 instead - but accept that serving capacity temporarily drops by one replica during each step.

Watching a Rollout

# Monitor rollout progress
kubectl rollout status deployment/fraud-model -n ml-prod
# Waiting for deployment "fraud-model" rollout to finish: 1 out of 3 new replicas have been updated...
# Waiting for deployment "fraud-model" rollout to finish: 1 old replicas are pending termination...
# deployment "fraud-model" successfully rolled out

# View rollout history
kubectl rollout history deployment/fraud-model -n ml-prod
# REVISION  CHANGE-CAUSE
# 6         model v2.0.0
# 7         model v2.1.0
# 8         model v2.2.0

# Rollback if the new version has issues
kubectl rollout undo deployment/fraud-model -n ml-prod
# deployment.apps/fraud-model rolled back

# Rollback to a specific revision
kubectl rollout undo deployment/fraud-model -n ml-prod --to-revision=7

Services - Endpoint Selection in Detail

The Service endpoint selection mechanism is more dynamic than it first appears. The Endpoints controller watches all pods in the cluster and maintains the Endpoints object for each Service, adding and removing pod IPs as their readiness state changes.

# Watch the Endpoints object - see pods added/removed in real time during rollout
kubectl get endpoints fraud-model-svc -n ml-prod -w
# NAME               ENDPOINTS                                         AGE
# fraud-model-svc    10.0.1.5:8080,10.0.1.6:8080,10.0.1.7:8080       14d
# fraud-model-svc    10.0.1.5:8080,10.0.1.6:8080                      14d  # pod-c removed (terminating)
# fraud-model-svc    10.0.1.5:8080,10.0.1.6:8080,10.0.1.8:8080       14d  # pod-d added (new version)

Service Types for ML

For ML specifically:

Internal ML microservices (feature servers, embedding servers, preprocessing pipelines) → ClusterIP
External-facing inference APIs → LoadBalancer (or ClusterIP behind an Ingress controller)
Prometheus scraping ML metrics → ClusterIP (Prometheus is in-cluster)

Headless Services for Distributed Training

For distributed training where pods need to discover each other's individual IPs (not load-balanced), use a headless Service (clusterIP: None). This returns the individual pod IPs when DNS-queried instead of a single virtual IP:

apiVersion: v1
kind: Service
metadata:
  name: distributed-training-svc
  namespace: team-fraud
spec:
  clusterIP: None                    # headless - no VIP, returns pod IPs
  selector:
    job-name: fraud-distributed-train
  ports:
    - port: 23456
      name: pytorch-dist

PyTorch distributed uses this: MASTER_ADDR=distributed-training-svc resolves to the pod IP of the master process, enabling torchrun rendezvous.

Advanced Scheduling - Affinity and Topology Spread

Pod Anti-Affinity for High Availability

The default scheduler might place all 3 replicas on the same node if it has capacity. One node failure takes down all serving capacity. Pod anti-affinity prevents this:

spec:
  affinity:
    podAntiAffinity:
      # Hard rule: never schedule two replicas on the same node
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values: ["fraud-model"]
          topologyKey: kubernetes.io/hostname   # spread by node

      # Soft rule: prefer spreading across availability zones
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values: ["fraud-model"]
            topologyKey: topology.kubernetes.io/zone

Topology Spread Constraints - More Flexible HA

Topology spread constraints give you more control than pod anti-affinity for distributing pods evenly across failure domains:

spec:
  topologySpreadConstraints:
    - maxSkew: 1                              # max difference between most-loaded and least-loaded zone
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule       # hard: don't schedule if constraint would be violated
      labelSelector:
        matchLabels:
          app: fraud-model
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname    # also spread by node within each zone
      whenUnsatisfiable: ScheduleAnyway      # soft: prefer to spread but don't block scheduling
      labelSelector:
        matchLabels:
          app: fraud-model

With 3 pods across 3 AZs, each AZ gets exactly 1 pod. If one AZ fails, 2 of 3 pods survive.

Multi-Container Patterns for ML

The Feature Enrichment Sidecar

Some teams run a lightweight feature enrichment sidecar alongside the model server. The sidecar maintains a local cache of user features (refreshed from Redis every 30 seconds) and serves them to the model server via a local gRPC interface on localhost. This eliminates one network hop on the hot path.

spec:
  containers:
    - name: model-server
      image: registry.company.com/fraud-model:v2.1.0
      env:
        - name: FEATURE_CACHE_ADDR
          value: "localhost:9999"    # sidecar is on localhost

    - name: feature-cache
      image: registry.company.com/feature-cache-sidecar:v1.3
      env:
        - name: REDIS_URL
          value: "redis://redis-svc:6379"
        - name: REFRESH_INTERVAL_SECONDS
          value: "30"
      resources:
        requests:
          cpu: "0.5"
          memory: "512Mi"
        limits:
          cpu: "1"
          memory: "1Gi"
      ports:
        - containerPort: 9999
          name: grpc

Deployment Pause and Resume for Canary

During a canary release, you might want to pause a rollout after one pod has been updated to observe metrics before continuing:

# Deploy new version
kubectl set image deployment/fraud-model model-server=fraud-model:v2.2.0 -n ml-prod

# Pause after first pod is updated
kubectl rollout pause deployment/fraud-model -n ml-prod

# Observe metrics for 30 minutes...
# Check error rate, latency, and model accuracy

# If metrics look good, resume
kubectl rollout resume deployment/fraud-model -n ml-prod

# If something is wrong, rollback
kubectl rollout undo deployment/fraud-model -n ml-prod

Production Notes

Image pull policy: Set imagePullPolicy: IfNotPresent in production. Always forces a registry check on every pod start, adding latency and creating a dependency on registry availability. Since you should be using pinned image digests, there is no need to always check for updates.

containers:
  - name: model-server
    image: registry.company.com/fraud-model:v2.1.0
    imagePullPolicy: IfNotPresent

Deployment annotations for traceability: Add annotations to Deployments and pod templates to track what triggered the deployment:

metadata:
  annotations:
    kubernetes.io/change-cause: "fraud model v2.1.0 - adds synthetic fraud features"
    gitlab.com/pipeline-id: "pipeline-78901"
    gitlab.com/commit-sha: "a1b2c3d4"

Minimum replica count: For any production ML service, run at least 2 replicas. With 1 replica, a pod restart (node drain, OOMKill, rolling update) causes a complete outage for the duration of model loading. Two replicas provides the minimum redundancy for zero-downtime operations.

Common Mistakes

:::danger Not Configuring terminationGracePeriodSeconds The default grace period is 30 seconds. If your model server takes 25 seconds to finish serving in-flight requests and 30 seconds includes the SIGTERM propagation delay, you might be cutting off requests mid-inference. Set terminationGracePeriodSeconds to at least 2x your maximum expected request duration plus 10 seconds for startup overhead.

Model servers with streaming responses or long batch inference (15–30 seconds) need grace periods of 60–120 seconds. :::

:::warning Using Deployment for Stateful Workloads Deployments are designed for stateless workloads. If your model server maintains in-memory state (user session data, per-user model personalization caches) that cannot be reconstructed from external storage, a Deployment rolling update will lose that state when old pods are terminated. Use StatefulSets for stateful ML services, or better, externalize all state to Redis or a database so the serving layer remains truly stateless. :::

:::warning Hard Pod Anti-Affinity Blocking Scheduling requiredDuringSchedulingIgnoredDuringExecution anti-affinity is a hard constraint. If you have 3 replicas with hard node anti-affinity but only 2 nodes in the cluster, the third pod will remain in Pending forever - it cannot be scheduled. Use preferredDuringSchedulingIgnoredDuringExecution for soft constraints that guide scheduling without blocking it. :::

Interview Q&A

Q1: Walk through exactly what happens, step by step, when you run kubectl apply -f deployment.yaml with a new image tag.

The client sends the manifest to the API server, which validates it and stores the new desired state in etcd. The Deployment controller notices the difference between desired state (new image) and actual state (old image), and begins a rolling update. It creates a new ReplicaSet with the new image spec and scales it up one pod at a time (respecting maxSurge). For each new pod: the scheduler picks a node with sufficient resources and constraints satisfied, the kubelet on that node pulls the image and starts the container, init containers run to completion, the main container starts, and Kubernetes waits for the readiness probe to pass. Only after a new pod is ready does the controller scale down the old ReplicaSet by one pod (respecting maxUnavailable). The old pod receives SIGTERM, handles graceful shutdown within terminationGracePeriodSeconds, and exits. This cycle repeats until all pods run the new image.

Q2: What is a headless Service and when would you use one for ML?

A headless Service sets clusterIP: None, which means it has no virtual IP. DNS queries for the Service name return the individual IP addresses of all matching pods, instead of a single load-balanced VIP. This is used when clients need to connect to specific pod instances rather than any arbitrary replica. In ML, the main use case is distributed training: PyTorch's torchrun needs to resolve the master worker's specific IP for the rendezvous protocol. With a headless Service, DNS returns each pod IP individually, and environment variables like MASTER_ADDR can be set to the Service name, which resolves to the master pod's IP.

Q3: How does Service endpoint selection interact with rolling updates, and what can go wrong?

The Endpoints controller adds a pod IP to a Service's endpoint list only when the pod's readiness probe passes. During a rolling update, new pods start in the Pending/Running state before passing readiness - they do not receive traffic. Old pods remain in the endpoint list until they are terminated (SIGTERM). This ensures traffic continuity. What can go wrong: if terminationGracePeriodSeconds is too short, old pods receive SIGTERM and stop processing in-flight requests before they finish. Kubernetes removes them from Endpoints immediately on termination, but requests in-flight at the proxy layer may still be routed to the terminating pod. The fix: ensure the model server returns 503 from the readiness endpoint on SIGTERM receipt (causing immediate removal from Endpoints) and then drains in-flight requests before exiting.

Q4: Describe the sidecar container pattern for ML and give a concrete example.

A sidecar is a secondary container in the same pod that extends the main container's functionality without modifying its code. Both containers share the pod's network namespace (communicate via localhost) and can share storage volumes. For ML, common sidecars include: (1) a log shipping sidecar that tails model server logs and sends them to a central log aggregation system like Loki, (2) a feature cache sidecar that maintains a local Redis-like cache of precomputed user features refreshed periodically, reducing feature retrieval latency on the hot path, (3) an OpenTelemetry collector sidecar that receives traces from the model server and batches them to Jaeger or Tempo. The sidecar pattern keeps the model server image focused on inference while adding cross-cutting capabilities without code changes.

Q5: What are topology spread constraints and how do they improve on pod anti-affinity for ML serving?

Pod anti-affinity defines a relationship between pods: "don't schedule near pods with these labels." It works well for simple cases (one replica per node) but becomes complex for larger counts. Topology spread constraints define a distribution goal: "spread pods as evenly as possible across zones/nodes, with at most N difference between the most-loaded and least-loaded domain." For a 6-replica ML serving deployment across 3 AZs, pod anti-affinity can only say "no two pods in the same AZ" - which might result in 3/2/1 distribution if one AZ starts full. Topology spread constraints say "keep the skew at most 1" - so the scheduler actively tries for 2/2/2. They also support whenUnsatisfiable: ScheduleAnyway for soft preferences that don't block scheduling when perfect distribution is impossible.

The Traffic Spike That Revealed the Gap​

The Pod Lifecycle in Detail​

Init Containers for ML - Pre-loading Model Weights​

Sidecar Containers for ML Observability​

Deployments - Rolling Update Deep Dive​

Watching a Rollout​

Services - Endpoint Selection in Detail​

Service Types for ML​

Headless Services for Distributed Training​

Advanced Scheduling - Affinity and Topology Spread​

Pod Anti-Affinity for High Availability​

Topology Spread Constraints - More Flexible HA​

Multi-Container Patterns for ML​

The Feature Enrichment Sidecar​

Deployment Pause and Resume for Canary​

Production Notes​

Common Mistakes​

Interview Q&A​