Skip to main content

Kubernetes Fundamentals for ML Engineers

The Day You Can No Longer Avoid K8s

It's your first week at a new ML engineering role. Your manager sends a Slack message: "Hey - your first task is to deploy the new fraud detection model. It's already containerized, just needs to go to prod. Here's the cluster access." She pastes a kubeconfig file and a half-finished YAML manifest.

You stare at the YAML. There are words you recognize - image, env, port - and words that look like they come from a different dimension: livenessProbe, resources.requests.memory, volumeMount, tolerations. You've been training models for three years. You've never needed to know what a DaemonSet is.

The uncomfortable truth about 2026: every major ML platform runs on Kubernetes. Kubeflow, Vertex AI, SageMaker, Azure ML, Seldon, KServe - they are all K8s-native. If you work at a company with more than one ML engineer, someone has almost certainly already Kubernetes-ified your infrastructure. You need to be able to read the manifests, write new ones, and debug when things go wrong. You don't need to be a K8s platform engineer, but you do need to be fluent.

This lesson gives you exactly what you need to get that model deployed, and more importantly, to understand why each piece of the YAML exists. We'll cover the twenty percent of Kubernetes that you'll use ninety percent of the time as an ML engineer: pods, deployments, services, namespaces, resource requests, health probes, ConfigMaps, Secrets, and persistent volumes. That is the minimum viable K8s knowledge for a productive ML engineer.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Kubernetes for ML demo on the EngineersOfAI Playground - no code required. :::

Why Kubernetes Exists

Before Kubernetes, deploying a service to a fleet of servers was manual and fragile. You SSHed into a server, ran Docker commands, and hoped nothing crashed. If the server died, you SSHed into another and repeated the process. Scaling meant adding servers by hand. Multiple teams shared the same machines with no isolation - one team's memory leak could take down everyone else.

Google had been running containers at scale internally since 2003 (using a system called Borg), processing billions of container launches per week. In 2014, they extracted the core ideas from Borg and open-sourced them as Kubernetes. The CNCF (Cloud Native Computing Foundation) took ownership in 2016, and the ecosystem exploded.

The core insight: describe what you want, not what to do. Instead of "SSH to server-42 and run this Docker command," you say "I want 3 replicas of this container with 4 CPUs and 16 GB RAM, always running, with traffic load-balanced between them." Kubernetes figures out which servers have capacity, schedules the containers there, monitors their health, and restarts them if they crash. This desired-state model is the foundation of every Kubernetes concept.

For ML specifically, this matters because:

  • Training jobs need heterogeneous resources (GPUs, high-memory nodes) without manual server selection
  • Model serving needs to recover automatically from crashes without 3am pages
  • Multiple teams need to share expensive GPU hardware without one team starving others
  • Experiments need reproducible environments - Docker containers plus declarative manifests equal identical environments everywhere

Architecture: The Two Sides of a Cluster

The control plane is the brain - it stores desired state (etcd), accepts new state (API server), decides where to run things (scheduler), and continuously reconciles reality toward desired state (controller manager). The worker nodes are the muscle - they actually run your containers.

As an ML engineer, you interact almost exclusively with the API server via kubectl apply. The rest is Kubernetes machinery working on your behalf.

Core Primitive 1: Pod

A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share a network namespace (same IP address, can reach each other via localhost) and optional shared storage. In practice, most ML workloads use one container per pod.

apiVersion: v1
kind: Pod
metadata:
name: fraud-model-v2
namespace: ml-prod
labels:
app: fraud-model
version: "v2.1.0"
team: risk
spec:
containers:
- name: model-server
image: registry.company.com/fraud-model:v2.1.0
ports:
- containerPort: 8080

You almost never create naked Pods in production. Pods are mortal - if the node running the pod dies, the pod is gone and nothing recreates it. That is what Deployments are for.

But understanding Pods is essential because every higher-level abstraction (Deployment, Job, CronJob, PyTorchJob) creates Pods under the hood. When you debug a failed training job, you are reading Pod logs and events.

# Essential pod commands every ML engineer needs
kubectl get pods -n ml-prod # list all pods
kubectl get pods -n ml-prod -l app=fraud-model # filter by label
kubectl describe pod fraud-model-v2 -n ml-prod # full details + events
kubectl logs fraud-model-v2 -n ml-prod # stdout/stderr
kubectl logs fraud-model-v2 -n ml-prod --previous # logs from crashed container
kubectl exec -it fraud-model-v2 -n ml-prod -- bash # shell into pod
kubectl top pod fraud-model-v2 -n ml-prod # live CPU/memory usage
kubectl get events -n ml-prod --sort-by=.metadata.creationTimestamp # cluster events

Core Primitive 2: Deployment

A Deployment declares desired state: "I want N replicas of this Pod, always running, with this exact spec." The Deployment controller continuously reconciles reality toward that state - if a pod crashes, it creates a new one. If you update the image tag, it rolls out the new version without downtime.

apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-model
namespace: ml-prod
spec:
replicas: 3 # always maintain 3 running pods
selector:
matchLabels:
app: fraud-model # manages pods with this label
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # can have 4 pods during update
maxUnavailable: 0 # never drop below 3 serving pods
template:
metadata:
labels:
app: fraud-model
version: "v2.1.0"
spec:
containers:
- name: model-server
image: registry.company.com/fraud-model:v2.1.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"

The rolling update strategy is critical for ML serving. maxUnavailable: 0 ensures you never have fewer serving pods than requested during a deployment - essential when your model takes 30 seconds to load into GPU memory and warm up. If you set maxUnavailable: 1, a 3-replica deployment would drop to 2 pods while the new pod loads, reducing serving capacity by 33%.

# Deployment management commands
kubectl apply -f deployment.yaml # create or update deployment
kubectl rollout status deployment/fraud-model -n ml-prod # watch rollout progress
kubectl rollout history deployment/fraud-model -n ml-prod # view revision history
kubectl rollout undo deployment/fraud-model -n ml-prod # rollback to previous version
kubectl set image deployment/fraud-model model-server=fraud-model:v2.2.0 -n ml-prod

Resource Requests and Limits - The Most Important Config for ML

This is where most ML engineers get tripped up. Kubernetes has two separate resource concepts:

  • Request: the guaranteed minimum. The scheduler uses requests to find a node with sufficient available capacity. The pod is placed only on nodes where available - already_requested >= this_pod_request.
  • Limit: the hard ceiling. If a container tries to use more memory than its limit, it gets OOMKilled (exit code 137). If it exceeds its CPU limit, it gets CPU-throttled.
resources:
requests:
cpu: "2" # 2 CPU cores guaranteed at scheduling time
memory: "8Gi" # 8 GB RAM guaranteed
nvidia.com/gpu: 1 # 1 GPU (requires NVIDIA device plugin - covered in lesson 03)
limits:
cpu: "4" # can burst to 4 cores if node has spare capacity
memory: "16Gi" # hard limit - OOMKilled if exceeded
nvidia.com/gpu: 1 # GPU limits must always equal requests

:::warning GPU Request Rules For GPU resources (nvidia.com/gpu), the limit must always equal the request. Kubernetes cannot partially allocate a GPU or throttle GPU usage the way it throttles CPU. If you request 1 GPU, your container gets exactly 1 GPU for its exclusive use. Set request and limit to the same value or the API server will reject the manifest. :::

Why Requests Matter for Scheduling

Imagine a node with 32 GB RAM. Four pods each request 8 GB. The scheduler considers the node "full" - no new pods requiring any RAM will be scheduled there, even if actual usage is 15 GB total. This is correct and intentional: requests are guarantees. If all four pods' memory usage spikes simultaneously, the node can serve them without running out of memory, because total requests never exceed capacity.

The practical implication: set requests to what your workload actually needs under normal load, not the maximum it could ever use. Set limits to the maximum it should ever use. The gap between request and limit is "burstable" capacity that multiple pods share when they don't all need it simultaneously.

The OOMKill Problem in ML

ML models have spiky memory profiles. A PyTorch model might use 4 GB at idle, 12 GB during a large-batch inference request, and 14 GB during concurrent requests with dynamic padding. If your memory limit is 10 GB, the pod gets OOMKilled mid-inference. The symptom:

kubectl describe pod fraud-model-abc123 -n ml-prod
# Containers:
# model-server:
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Started: Sat, 14 Mar 2026 14:23:11 +0000
# Finished: Sat, 14 Mar 2026 14:31:44 +0000

Prevention:

  1. Profile memory usage at the maximum expected batch size before setting limits
  2. Set limits to 1.5–2x your measured peak usage for a safety margin
  3. Set requests to 60–80% of limits so pods land on nodes with real headroom
  4. For inference servers, set OMP_NUM_THREADS and MKL_NUM_THREADS to control CPU memory allocation

Core Primitive 3: Service

Pods have ephemeral IP addresses - they change every time a pod restarts or is rescheduled. A Service provides a stable DNS name and virtual IP that load-balances traffic across all healthy pods matching a label selector.

apiVersion: v1
kind: Service
metadata:
name: fraud-model-svc
namespace: ml-prod
spec:
selector:
app: fraud-model # routes to any pod with this label
ports:
- port: 80 # service port (what callers use)
targetPort: 8080 # container port (what the app listens on)
name: http
type: ClusterIP # internal only - default for ML microservices

Service types:

  • ClusterIP (default): only reachable inside the cluster. Use for internal ML services - feature stores, embedding servers, preprocessing pipelines.
  • NodePort: opens a static port on every node's external IP. Used for debugging or when you need external access without a cloud load balancer.
  • LoadBalancer: provisions a cloud load balancer (AWS ALB, GCP Load Balancer). Use for externally-exposed ML APIs that external clients call.

DNS resolution inside the cluster: fraud-model-svc.ml-prod.svc.cluster.local always resolves to the Service. Services in the same namespace can use just fraud-model-svc. Cross-namespace: fraud-model-svc.ml-prod.

Namespaces - Team Isolation in Shared Clusters

A namespace is a virtual cluster within a cluster. It provides name isolation (two teams can both have a Deployment named model-server), resource quotas (limit total CPU/GPU/memory per team), RBAC scope (ML engineer Alice can only access team-fraud), and network policy boundaries.

apiVersion: v1
kind: Namespace
metadata:
name: team-fraud
labels:
team: fraud
cost-center: risk-engineering
env: production

In multi-team ML platforms, a common pattern is one namespace per team per environment:

  • team-fraud-dev, team-fraud-staging, team-fraud-prod
  • team-credit-dev, team-credit-staging, team-credit-prod
# Namespace management
kubectl get namespaces
kubectl create namespace team-fraud
kubectl apply -f manifest.yaml -n team-fraud # explicit namespace
kubectl config set-context --current --namespace=team-fraud # set default

ConfigMaps and Secrets

ML models have configuration that varies across environments: model paths, feature server URLs, batch sizes, decision thresholds. Never bake environment-specific values into Docker images - you need the same image to run in dev, staging, and prod with different config.

# ConfigMap: non-sensitive configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: fraud-model-config
namespace: ml-prod
data:
MODEL_PATH: "/models/fraud/v2.1.0/model.pt"
FEATURE_SERVER_URL: "http://feature-server-svc:8081"
BATCH_SIZE: "32"
DECISION_THRESHOLD: "0.75"
LOG_LEVEL: "INFO"
---
# Secret: sensitive credentials (base64 encoded, not encrypted by default)
apiVersion: v1
kind: Secret
metadata:
name: fraud-model-secrets
namespace: ml-prod
type: Opaque
data:
# base64 encode values: echo -n 'mytoken' | base64
MODEL_REGISTRY_TOKEN: "dG9rZW4tdmFsdWU="
REDIS_PASSWORD: "cmVkaXMtcGFzcw=="

Injecting into pods:

spec:
containers:
- name: model-server
image: registry.company.com/fraud-model:v2.1.0
# Inject all keys from ConfigMap and Secret as environment variables
envFrom:
- configMapRef:
name: fraud-model-config
- secretRef:
name: fraud-model-secrets

:::tip ConfigMap as Mounted File For ML models that read a YAML or JSON config file instead of environment variables, mount the ConfigMap as a file:

volumeMounts:
- name: config-volume
mountPath: /app/config/model.yaml
subPath: model.yaml # mount only this key as a file
volumes:
- name: config-volume
configMap:
name: fraud-model-config

:::

:::danger Kubernetes Secrets Are Not Encrypted By Default Kubernetes Secrets are base64-encoded in etcd, not encrypted. Anyone with etcd read access or sufficient RBAC permissions can read secret values. For real production secrets (API keys, database passwords, model registry tokens), use External Secrets Operator with AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault. ESO syncs secrets into K8s at runtime without permanently storing sensitive values in etcd. :::

Liveness and Readiness Probes for ML Services

This is where ML workloads diverge most significantly from traditional web services. An ML model server might take 45 seconds to load a 4 GB model into GPU memory on startup, become unresponsive if the CUDA context gets corrupted, or need a warmup request before handling production traffic efficiently.

Kubernetes provides two health check mechanisms:

  • Readiness probe: Is this pod ready to receive traffic? If it fails, the pod is removed from the Service endpoints (no traffic sent) but not restarted. This is what you want during model loading.
  • Liveness probe: Is this pod alive? If it fails, the pod is restarted. This handles the corrupted-CUDA-context scenario where the process is running but unable to serve requests.
readinessProbe:
httpGet:
path: /health/ready # returns 200 only after model is loaded + warmed up
port: 8080
initialDelaySeconds: 60 # wait 60s before first check (covers model load time)
periodSeconds: 10 # check every 10 seconds after that
failureThreshold: 3 # 3 consecutive failures = not ready
successThreshold: 1 # 1 success = ready to receive traffic
livenessProbe:
httpGet:
path: /health/live # returns 200 as long as process is responsive
port: 8080
initialDelaySeconds: 120 # longer delay - give model full time to start
periodSeconds: 30 # check every 30 seconds
failureThreshold: 3 # 3 failures = restart the pod
timeoutSeconds: 5 # probe must respond within 5 seconds

Implementing the health endpoints in FastAPI:

from fastapi import FastAPI, HTTPException
import torch
import logging

app = FastAPI()
logger = logging.getLogger(__name__)

model = None
model_loaded = False
warmup_complete = False


@app.on_event("startup")
async def startup():
global model, model_loaded, warmup_complete

logger.info("Loading model...")
model = torch.jit.load("/models/fraud/v2.1.0/model.pt")
model.eval()
model_loaded = True
logger.info("Model loaded in GPU memory.")

# Warmup: run one forward pass to prime CUDA kernel caches
device = next(model.parameters()).device
dummy = torch.zeros(1, 512, dtype=torch.long).to(device)
with torch.no_grad():
_ = model(dummy)
warmup_complete = True
logger.info("Warmup complete. Ready to serve.")


@app.get("/health/live")
def liveness():
"""Always return 200 unless the process is truly wedged."""
return {"status": "alive"}


@app.get("/health/ready")
def readiness():
"""Return 200 only when model is loaded AND warmup is done."""
if not model_loaded:
raise HTTPException(status_code=503, detail="Model not loaded")
if not warmup_complete:
raise HTTPException(status_code=503, detail="Warmup not complete")
return {"status": "ready", "model_loaded": True, "warmup_complete": True}

:::danger The initialDelaySeconds Trap Setting initialDelaySeconds too low is the most common ML probe misconfiguration. If your model takes 45 seconds to load but initialDelaySeconds: 30, Kubernetes starts checking readiness before loading is complete. After 3 failures it marks the pod unready - but worse, a too-low liveness initialDelaySeconds causes K8s to restart the pod before it finishes loading. This creates an infinite crash loop where the pod never starts successfully.

Measure actual model load time in your target environment. Set initialDelaySeconds to at least 1.5x that value. When in doubt, be generous - a slow rollout is better than a crash loop. :::

Persistent Volumes - Storing Checkpoints and Artifacts

Regular pod storage (emptyDir) lives and dies with the pod. ML training jobs need to save checkpoints that survive pod preemption. Persistent Volumes (PVs) are cluster-managed storage that outlasts individual pods.

# PersistentVolumeClaim - request durable storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-checkpoints
namespace: team-fraud
spec:
accessModes:
- ReadWriteOnce # one pod at a time, read-write
storageClassName: fast-ssd # cluster's SSD storage class
resources:
requests:
storage: 200Gi # 200 GB for checkpoints

Mount the PVC in a training pod:

spec:
containers:
- name: trainer
image: registry.company.com/fraud-trainer:v3.0
volumeMounts:
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: checkpoints
persistentVolumeClaim:
claimName: training-checkpoints

For distributed training where multiple pods need to read/write the same checkpoint directory simultaneously, use ReadWriteMany with network file storage:

spec:
accessModes:
- ReadWriteMany # multiple pods can mount simultaneously
storageClassName: efs # AWS EFS, GCP Filestore, or Azure Files
resources:
requests:
storage: 500Gi

A Complete Production ML Manifest

Putting everything together:

apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-model
namespace: ml-prod
annotations:
model-version: "v2.1.0"
deployed-by: "gitlab-ci"
pipeline-id: "12345"
spec:
replicas: 3
selector:
matchLabels:
app: fraud-model
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: fraud-model
version: "v2.1.0"
team: risk
spec:
imagePullSecrets:
- name: registry-credentials
containers:
- name: model-server
image: registry.company.com/fraud-model:v2.1.0
ports:
- containerPort: 8080
name: http
envFrom:
- configMapRef:
name: fraud-model-config
- secretRef:
name: fraud-model-secrets
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 3
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-artifact-cache
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["fraud-model"]
topologyKey: kubernetes.io/hostname # spread across nodes
---
apiVersion: v1
kind: Service
metadata:
name: fraud-model-svc
namespace: ml-prod
spec:
selector:
app: fraud-model
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: fraud-model-pdb
namespace: ml-prod
spec:
minAvailable: 2 # always keep at least 2 pods during node drains
selector:
matchLabels:
app: fraud-model

Production Engineering Notes

Never use latest image tags. latest means "whatever was most recently pushed" - it breaks reproducibility and makes rollbacks impossible to reason about. Pin to semantic version tags or, better, immutable content digests:

image: registry.company.com/fraud-model@sha256:a1b2c3d4... # immutable

Use pod anti-affinity to spread replicas. Without it, the scheduler might place all 3 replicas on the same node. If that node fails, you lose all serving capacity simultaneously. The podAntiAffinity spec above with kubernetes.io/hostname ensures each replica runs on a different node.

Add PodDisruptionBudgets. Without a PDB, a cluster administrator draining a node for maintenance can evict all your pods simultaneously. A PDB with minAvailable: 2 means the drain process will wait for a new pod to start before evicting the next one - maintaining minimum serving capacity throughout.

Set resource requests based on measurement, not intuition. Run your model server under realistic load and observe actual CPU and memory usage with kubectl top. Then set requests to p75 usage and limits to p99 usage. Revisit quarterly as traffic patterns change.

Common Mistakes

:::danger Setting Limits Without Profiling Setting memory limit: 4Gi because "that seems like enough" without profiling actual usage is the most common ML K8s mistake. PyTorch models often use 2–3x more memory under load than at idle due to batch processing, intermediate tensor allocation, and CUDA memory fragmentation. Always profile with realistic batch sizes and concurrency before setting limits.

Use kubectl top pod --containers during a load test to observe peak usage. :::

:::warning Missing Readiness Probes on ML Services Without a readiness probe, Kubernetes routes traffic to new pods the instant the container process starts - before the model is loaded into memory. The first requests after every pod restart will hit an unloaded model and return 500 errors. This problem is silent in normal operations but becomes visible during rolling updates, when pods regularly restart. Always implement readiness probes with initialDelaySeconds that accounts for model loading time. :::

:::warning Hardcoding Environment Config in Docker Images Baking FEATURE_SERVER_URL=http://feature-store-prod.company.com into the image means you cannot use the same image in staging. Use ConfigMaps for all environment-specific configuration. The same image tag should run identically in dev, staging, and prod - only the ConfigMap values differ. This is a core tenet of the twelve-factor app methodology. :::

Interview Q&A

Q1: What is the difference between a Kubernetes resource request and a resource limit? How does each affect pod behavior?

A request is a scheduling guarantee - the scheduler only places a pod on a node where the available capacity (total node capacity minus sum of existing requests) is at least as large as the pod's request. The request does not cap usage; a pod can use more than its requested amount if the node has spare capacity. A limit is a hard ceiling enforced at runtime: exceeding the memory limit triggers OOMKill (exit code 137); exceeding the CPU limit causes throttling (the process runs slower, not killed). For ML workloads, set requests based on typical usage to enable efficient scheduling, and limits based on the maximum safe usage to prevent one pod from starving others on the same node.

Q2: Why would an ML serving deployment use separate liveness and readiness probes?

They serve distinct purposes. A readiness probe answers "should traffic be sent to this pod?" - it should fail during model loading (preventing requests from hitting an unloaded model) and pass once the model is loaded and warmed up. A liveness probe answers "should this pod be restarted?" - it should only fail when the process is in an unrecoverable state, such as a deadlocked thread or corrupted CUDA context. Using only a liveness probe with a short initialDelaySeconds would cause Kubernetes to restart the pod repeatedly during the model loading phase, creating a crash loop. Using only a readiness probe would leave zombie pods in a permanently "not ready" state without ever restarting them.

Q3: A production ML serving pod is being OOMKilled intermittently during peak traffic. Walk through your diagnosis.

First, confirm OOMKill with kubectl describe pod <name>: look for Reason: OOMKilled and Exit Code: 137. Run kubectl top pod --containers during a load test to observe memory trend. Check whether the OOMKill happens at startup (model loading spike), at steady state (limit too low for idle usage), or only during bursts (batch size too large). Use Python memory profiling (torch.cuda.memory_stats(), memory_profiler) to find the allocation site. Common culprits: dynamic batch padding allocating large tensors, gradient retention from missing torch.no_grad() in inference, or CUDA memory fragmentation over time. Fix by increasing the limit, reducing batch size, adding explicit gc.collect() and torch.cuda.empty_cache() calls, or implementing response streaming.

Q4: How do Kubernetes Services find the pods they should route traffic to?

Services use label selectors - they continuously watch the API server for pods whose labels match the selector. When a pod becomes ready (its readiness probe passes) and its labels match the Service selector, the endpoints controller adds that pod's IP and port to the Service's Endpoints object. When a pod becomes unready or is deleted, it is removed from Endpoints. kube-proxy on each node watches the Endpoints object and programs iptables or IPVS rules to route traffic from the Service's virtual IP to one of the current endpoint IPs. This is why label consistency matters: a pod without the correct labels will never receive traffic from the Service, even if it's running and healthy.

Q5: What are Kubernetes Secrets and what are their security limitations?

Secrets store sensitive data (passwords, API tokens, certificates) and are accessible to pods via environment variables or volume mounts, separate from non-sensitive ConfigMaps. The key limitation: by default, Secret values are only base64-encoded in etcd, not encrypted. Anyone with etcd read access, certain RBAC permissions (kubectl get secret), or access to a pod that mounts the secret can read the value. Mitigation strategies: enable etcd encryption-at-rest in the cluster configuration, restrict access to Secrets via RBAC (only the service accounts that need a secret should have get permission on it), and use the External Secrets Operator with a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault) so plaintext values never persist in etcd.

Q6: Explain PodDisruptionBudgets and give a concrete example of why an ML team needs one.

A PodDisruptionBudget limits the number of pods of a given selector that can be simultaneously unavailable due to voluntary disruptions - cluster upgrades, node drains, Descheduler evictions. Without a PDB, a cluster admin draining a node for maintenance could evict all replicas simultaneously, causing a complete outage. With minAvailable: 2 on a 3-replica fraud model deployment, the drain process can evict at most 1 pod at a time and must wait for the replacement to become ready before evicting the next. For ML serving, this is especially important because model loading takes 30–60 seconds - if 2 pods are evicted simultaneously, your serving capacity drops to 1 pod for the entire model loading duration. A PDB prevents this by serializing evictions.

© 2026 EngineersOfAI. All rights reserved.