What is kubernetes pods deployments?

The minimum Kubernetes knowledge every ML engineer needs to be productive - pods, deployments, services, resource requests, GPU allocation, probes, and persistent volumes.

How does kubernetes resource requests limits work in practice?

Kubernetes Fundamentals for ML Engineers covers kubernetes pods deployments, kubernetes resource requests limits, GPU resource requests kubernetes from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/kubernetes-for-ml/kubernetes-fundamentals

What is the difference between kubernetes pods deployments and GPU resource requests kubernetes?

See the full breakdown at https://engineersofai.com/docs/mlops/kubernetes-for-ml/kubernetes-fundamentals

Kubernetes Fundamentals for ML Engineers

The Day You Can No Longer Avoid K8s

It's your first week at a new ML engineering role. Your manager sends a Slack message: "Hey - your first task is to deploy the new fraud detection model. It's already containerized, just needs to go to prod. Here's the cluster access." She pastes a kubeconfig file and a half-finished YAML manifest.

You stare at the YAML. There are words you recognize - image, env, port - and words that look like they come from a different dimension: livenessProbe, resources.requests.memory, volumeMount, tolerations. You've been training models for three years. You've never needed to know what a DaemonSet is.

The uncomfortable truth about 2026: every major ML platform runs on Kubernetes. Kubeflow, Vertex AI, SageMaker, Azure ML, Seldon, KServe - they are all K8s-native. If you work at a company with more than one ML engineer, someone has almost certainly already Kubernetes-ified your infrastructure. You need to be able to read the manifests, write new ones, and debug when things go wrong. You don't need to be a K8s platform engineer, but you do need to be fluent.

This lesson gives you exactly what you need to get that model deployed, and more importantly, to understand why each piece of the YAML exists. We'll cover the twenty percent of Kubernetes that you'll use ninety percent of the time as an ML engineer: pods, deployments, services, namespaces, resource requests, health probes, ConfigMaps, Secrets, and persistent volumes. That is the minimum viable K8s knowledge for a productive ML engineer.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Kubernetes for ML demo on the EngineersOfAI Playground - no code required. :::

Why Kubernetes Exists

Before Kubernetes, deploying a service to a fleet of servers was manual and fragile. You SSHed into a server, ran Docker commands, and hoped nothing crashed. If the server died, you SSHed into another and repeated the process. Scaling meant adding servers by hand. Multiple teams shared the same machines with no isolation - one team's memory leak could take down everyone else.

Google had been running containers at scale internally since 2003 (using a system called Borg), processing billions of container launches per week. In 2014, they extracted the core ideas from Borg and open-sourced them as Kubernetes. The CNCF (Cloud Native Computing Foundation) took ownership in 2016, and the ecosystem exploded.

The core insight: describe what you want, not what to do. Instead of "SSH to server-42 and run this Docker command," you say "I want 3 replicas of this container with 4 CPUs and 16 GB RAM, always running, with traffic load-balanced between them." Kubernetes figures out which servers have capacity, schedules the containers there, monitors their health, and restarts them if they crash. This desired-state model is the foundation of every Kubernetes concept.

For ML specifically, this matters because:

Training jobs need heterogeneous resources (GPUs, high-memory nodes) without manual server selection
Model serving needs to recover automatically from crashes without 3am pages
Multiple teams need to share expensive GPU hardware without one team starving others
Experiments need reproducible environments - Docker containers plus declarative manifests equal identical environments everywhere

Architecture: The Two Sides of a Cluster

The control plane is the brain - it stores desired state (etcd), accepts new state (API server), decides where to run things (scheduler), and continuously reconciles reality toward desired state (controller manager). The worker nodes are the muscle - they actually run your containers.

As an ML engineer, you interact almost exclusively with the API server via kubectl apply. The rest is Kubernetes machinery working on your behalf.

Core Primitive 1: Pod

A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share a network namespace (same IP address, can reach each other via localhost) and optional shared storage. In practice, most ML workloads use one container per pod.

apiVersion: v1
kind: Pod
metadata:
  name: fraud-model-v2
  namespace: ml-prod
  labels:
    app: fraud-model
    version: "v2.1.0"
    team: risk
spec:
  containers:
    - name: model-server
      image: registry.company.com/fraud-model:v2.1.0
      ports:
        - containerPort: 8080

You almost never create naked Pods in production. Pods are mortal - if the node running the pod dies, the pod is gone and nothing recreates it. That is what Deployments are for.

But understanding Pods is essential because every higher-level abstraction (Deployment, Job, CronJob, PyTorchJob) creates Pods under the hood. When you debug a failed training job, you are reading Pod logs and events.

# Essential pod commands every ML engineer needs
kubectl get pods -n ml-prod                           # list all pods
kubectl get pods -n ml-prod -l app=fraud-model        # filter by label
kubectl describe pod fraud-model-v2 -n ml-prod        # full details + events
kubectl logs fraud-model-v2 -n ml-prod                # stdout/stderr
kubectl logs fraud-model-v2 -n ml-prod --previous     # logs from crashed container
kubectl exec -it fraud-model-v2 -n ml-prod -- bash    # shell into pod
kubectl top pod fraud-model-v2 -n ml-prod             # live CPU/memory usage
kubectl get events -n ml-prod --sort-by=.metadata.creationTimestamp  # cluster events

Core Primitive 2: Deployment

A Deployment declares desired state: "I want N replicas of this Pod, always running, with this exact spec." The Deployment controller continuously reconciles reality toward that state - if a pod crashes, it creates a new one. If you update the image tag, it rolls out the new version without downtime.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-model
  namespace: ml-prod
spec:
  replicas: 3                           # always maintain 3 running pods
  selector:
    matchLabels:
      app: fraud-model                  # manages pods with this label
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1                       # can have 4 pods during update
      maxUnavailable: 0                 # never drop below 3 serving pods
  template:
    metadata:
      labels:
        app: fraud-model
        version: "v2.1.0"
    spec:
      containers:
        - name: model-server
          image: registry.company.com/fraud-model:v2.1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "16Gi"

The rolling update strategy is critical for ML serving. maxUnavailable: 0 ensures you never have fewer serving pods than requested during a deployment - essential when your model takes 30 seconds to load into GPU memory and warm up. If you set maxUnavailable: 1, a 3-replica deployment would drop to 2 pods while the new pod loads, reducing serving capacity by 33%.

# Deployment management commands
kubectl apply -f deployment.yaml             # create or update deployment
kubectl rollout status deployment/fraud-model -n ml-prod   # watch rollout progress
kubectl rollout history deployment/fraud-model -n ml-prod  # view revision history
kubectl rollout undo deployment/fraud-model -n ml-prod     # rollback to previous version
kubectl set image deployment/fraud-model model-server=fraud-model:v2.2.0 -n ml-prod

Resource Requests and Limits - The Most Important Config for ML

This is where most ML engineers get tripped up. Kubernetes has two separate resource concepts:

Request: the guaranteed minimum. The scheduler uses requests to find a node with sufficient available capacity. The pod is placed only on nodes where available - already_requested >= this_pod_request.
Limit: the hard ceiling. If a container tries to use more memory than its limit, it gets OOMKilled (exit code 137). If it exceeds its CPU limit, it gets CPU-throttled.

resources:
  requests:
    cpu: "2"           # 2 CPU cores guaranteed at scheduling time
    memory: "8Gi"      # 8 GB RAM guaranteed
    nvidia.com/gpu: 1  # 1 GPU (requires NVIDIA device plugin - covered in lesson 03)
  limits:
    cpu: "4"           # can burst to 4 cores if node has spare capacity
    memory: "16Gi"     # hard limit - OOMKilled if exceeded
    nvidia.com/gpu: 1  # GPU limits must always equal requests

:::warning GPU Request Rules For GPU resources (nvidia.com/gpu), the limit must always equal the request. Kubernetes cannot partially allocate a GPU or throttle GPU usage the way it throttles CPU. If you request 1 GPU, your container gets exactly 1 GPU for its exclusive use. Set request and limit to the same value or the API server will reject the manifest. :::

Why Requests Matter for Scheduling

Imagine a node with 32 GB RAM. Four pods each request 8 GB. The scheduler considers the node "full" - no new pods requiring any RAM will be scheduled there, even if actual usage is 15 GB total. This is correct and intentional: requests are guarantees. If all four pods' memory usage spikes simultaneously, the node can serve them without running out of memory, because total requests never exceed capacity.

The practical implication: set requests to what your workload actually needs under normal load, not the maximum it could ever use. Set limits to the maximum it should ever use. The gap between request and limit is "burstable" capacity that multiple pods share when they don't all need it simultaneously.

The OOMKill Problem in ML

ML models have spiky memory profiles. A PyTorch model might use 4 GB at idle, 12 GB during a large-batch inference request, and 14 GB during concurrent requests with dynamic padding. If your memory limit is 10 GB, the pod gets OOMKilled mid-inference. The symptom:

kubectl describe pod fraud-model-abc123 -n ml-prod
# Containers:
#   model-server:
#     Last State: Terminated
#       Reason: OOMKilled
#       Exit Code: 137
#       Started: Sat, 14 Mar 2026 14:23:11 +0000
#       Finished: Sat, 14 Mar 2026 14:31:44 +0000

Prevention:

Profile memory usage at the maximum expected batch size before setting limits
Set limits to 1.5–2x your measured peak usage for a safety margin
Set requests to 60–80% of limits so pods land on nodes with real headroom
For inference servers, set OMP_NUM_THREADS and MKL_NUM_THREADS to control CPU memory allocation

Core Primitive 3: Service

Pods have ephemeral IP addresses - they change every time a pod restarts or is rescheduled. A Service provides a stable DNS name and virtual IP that load-balances traffic across all healthy pods matching a label selector.

apiVersion: v1
kind: Service
metadata:
  name: fraud-model-svc
  namespace: ml-prod
spec:
  selector:
    app: fraud-model           # routes to any pod with this label
  ports:
    - port: 80                 # service port (what callers use)
      targetPort: 8080         # container port (what the app listens on)
      name: http
  type: ClusterIP              # internal only - default for ML microservices

Service types:

ClusterIP (default): only reachable inside the cluster. Use for internal ML services - feature stores, embedding servers, preprocessing pipelines.
NodePort: opens a static port on every node's external IP. Used for debugging or when you need external access without a cloud load balancer.
LoadBalancer: provisions a cloud load balancer (AWS ALB, GCP Load Balancer). Use for externally-exposed ML APIs that external clients call.

DNS resolution inside the cluster: fraud-model-svc.ml-prod.svc.cluster.local always resolves to the Service. Services in the same namespace can use just fraud-model-svc. Cross-namespace: fraud-model-svc.ml-prod.

Namespaces - Team Isolation in Shared Clusters

A namespace is a virtual cluster within a cluster. It provides name isolation (two teams can both have a Deployment named model-server), resource quotas (limit total CPU/GPU/memory per team), RBAC scope (ML engineer Alice can only access team-fraud), and network policy boundaries.

apiVersion: v1
kind: Namespace
metadata:
  name: team-fraud
  labels:
    team: fraud
    cost-center: risk-engineering
    env: production

In multi-team ML platforms, a common pattern is one namespace per team per environment:

team-fraud-dev, team-fraud-staging, team-fraud-prod
team-credit-dev, team-credit-staging, team-credit-prod

# Namespace management
kubectl get namespaces
kubectl create namespace team-fraud
kubectl apply -f manifest.yaml -n team-fraud    # explicit namespace
kubectl config set-context --current --namespace=team-fraud  # set default

ConfigMaps and Secrets

ML models have configuration that varies across environments: model paths, feature server URLs, batch sizes, decision thresholds. Never bake environment-specific values into Docker images - you need the same image to run in dev, staging, and prod with different config.

# ConfigMap: non-sensitive configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: fraud-model-config
  namespace: ml-prod
data:
  MODEL_PATH: "/models/fraud/v2.1.0/model.pt"
  FEATURE_SERVER_URL: "http://feature-server-svc:8081"
  BATCH_SIZE: "32"
  DECISION_THRESHOLD: "0.75"
  LOG_LEVEL: "INFO"
---
# Secret: sensitive credentials (base64 encoded, not encrypted by default)
apiVersion: v1
kind: Secret
metadata:
  name: fraud-model-secrets
  namespace: ml-prod
type: Opaque
data:
  # base64 encode values: echo -n 'mytoken' | base64
  MODEL_REGISTRY_TOKEN: "dG9rZW4tdmFsdWU="
  REDIS_PASSWORD: "cmVkaXMtcGFzcw=="

Injecting into pods:

spec:
  containers:
    - name: model-server
      image: registry.company.com/fraud-model:v2.1.0
      # Inject all keys from ConfigMap and Secret as environment variables
      envFrom:
        - configMapRef:
            name: fraud-model-config
        - secretRef:
            name: fraud-model-secrets

:::tip ConfigMap as Mounted File For ML models that read a YAML or JSON config file instead of environment variables, mount the ConfigMap as a file:

volumeMounts:
  - name: config-volume
    mountPath: /app/config/model.yaml
    subPath: model.yaml        # mount only this key as a file
volumes:
  - name: config-volume
    configMap:
      name: fraud-model-config

:::

:::danger Kubernetes Secrets Are Not Encrypted By Default Kubernetes Secrets are base64-encoded in etcd, not encrypted. Anyone with etcd read access or sufficient RBAC permissions can read secret values. For real production secrets (API keys, database passwords, model registry tokens), use External Secrets Operator with AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault. ESO syncs secrets into K8s at runtime without permanently storing sensitive values in etcd. :::

Liveness and Readiness Probes for ML Services

This is where ML workloads diverge most significantly from traditional web services. An ML model server might take 45 seconds to load a 4 GB model into GPU memory on startup, become unresponsive if the CUDA context gets corrupted, or need a warmup request before handling production traffic efficiently.

Kubernetes provides two health check mechanisms:

Readiness probe: Is this pod ready to receive traffic? If it fails, the pod is removed from the Service endpoints (no traffic sent) but not restarted. This is what you want during model loading.
Liveness probe: Is this pod alive? If it fails, the pod is restarted. This handles the corrupted-CUDA-context scenario where the process is running but unable to serve requests.

readinessProbe:
  httpGet:
    path: /health/ready    # returns 200 only after model is loaded + warmed up
    port: 8080
  initialDelaySeconds: 60  # wait 60s before first check (covers model load time)
  periodSeconds: 10        # check every 10 seconds after that
  failureThreshold: 3      # 3 consecutive failures = not ready
  successThreshold: 1      # 1 success = ready to receive traffic
livenessProbe:
  httpGet:
    path: /health/live     # returns 200 as long as process is responsive
    port: 8080
  initialDelaySeconds: 120 # longer delay - give model full time to start
  periodSeconds: 30        # check every 30 seconds
  failureThreshold: 3      # 3 failures = restart the pod
  timeoutSeconds: 5        # probe must respond within 5 seconds

Implementing the health endpoints in FastAPI:

from fastapi import FastAPI, HTTPException
import torch
import logging

app = FastAPI()
logger = logging.getLogger(__name__)

model = None
model_loaded = False
warmup_complete = False


@app.on_event("startup")
async def startup():
    global model, model_loaded, warmup_complete

    logger.info("Loading model...")
    model = torch.jit.load("/models/fraud/v2.1.0/model.pt")
    model.eval()
    model_loaded = True
    logger.info("Model loaded in GPU memory.")

    # Warmup: run one forward pass to prime CUDA kernel caches
    device = next(model.parameters()).device
    dummy = torch.zeros(1, 512, dtype=torch.long).to(device)
    with torch.no_grad():
        _ = model(dummy)
    warmup_complete = True
    logger.info("Warmup complete. Ready to serve.")


@app.get("/health/live")
def liveness():
    """Always return 200 unless the process is truly wedged."""
    return {"status": "alive"}


@app.get("/health/ready")
def readiness():
    """Return 200 only when model is loaded AND warmup is done."""
    if not model_loaded:
        raise HTTPException(status_code=503, detail="Model not loaded")
    if not warmup_complete:
        raise HTTPException(status_code=503, detail="Warmup not complete")
    return {"status": "ready", "model_loaded": True, "warmup_complete": True}

:::danger The initialDelaySeconds Trap Setting initialDelaySeconds too low is the most common ML probe misconfiguration. If your model takes 45 seconds to load but initialDelaySeconds: 30, Kubernetes starts checking readiness before loading is complete. After 3 failures it marks the pod unready - but worse, a too-low liveness initialDelaySeconds causes K8s to restart the pod before it finishes loading. This creates an infinite crash loop where the pod never starts successfully.

Measure actual model load time in your target environment. Set initialDelaySeconds to at least 1.5x that value. When in doubt, be generous - a slow rollout is better than a crash loop. :::

Persistent Volumes - Storing Checkpoints and Artifacts

Regular pod storage (emptyDir) lives and dies with the pod. ML training jobs need to save checkpoints that survive pod preemption. Persistent Volumes (PVs) are cluster-managed storage that outlasts individual pods.

# PersistentVolumeClaim - request durable storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-checkpoints
  namespace: team-fraud
spec:
  accessModes:
    - ReadWriteOnce          # one pod at a time, read-write
  storageClassName: fast-ssd # cluster's SSD storage class
  resources:
    requests:
      storage: 200Gi         # 200 GB for checkpoints

Mount the PVC in a training pod:

spec:
  containers:
    - name: trainer
      image: registry.company.com/fraud-trainer:v3.0
      volumeMounts:
        - name: checkpoints
          mountPath: /checkpoints
  volumes:
    - name: checkpoints
      persistentVolumeClaim:
        claimName: training-checkpoints

For distributed training where multiple pods need to read/write the same checkpoint directory simultaneously, use ReadWriteMany with network file storage:

spec:
  accessModes:
    - ReadWriteMany           # multiple pods can mount simultaneously
  storageClassName: efs       # AWS EFS, GCP Filestore, or Azure Files
  resources:
    requests:
      storage: 500Gi

A Complete Production ML Manifest

Putting everything together:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-model
  namespace: ml-prod
  annotations:
    model-version: "v2.1.0"
    deployed-by: "gitlab-ci"
    pipeline-id: "12345"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-model
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: fraud-model
        version: "v2.1.0"
        team: risk
    spec:
      imagePullSecrets:
        - name: registry-credentials
      containers:
        - name: model-server
          image: registry.company.com/fraud-model:v2.1.0
          ports:
            - containerPort: 8080
              name: http
          envFrom:
            - configMapRef:
                name: fraud-model-config
            - secretRef:
                name: fraud-model-secrets
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "16Gi"
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 30
            failureThreshold: 3
          volumeMounts:
            - name: model-cache
              mountPath: /models
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-artifact-cache
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: ["fraud-model"]
              topologyKey: kubernetes.io/hostname   # spread across nodes
---
apiVersion: v1
kind: Service
metadata:
  name: fraud-model-svc
  namespace: ml-prod
spec:
  selector:
    app: fraud-model
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: fraud-model-pdb
  namespace: ml-prod
spec:
  minAvailable: 2             # always keep at least 2 pods during node drains
  selector:
    matchLabels:
      app: fraud-model

Production Engineering Notes

Never use latest image tags. latest means "whatever was most recently pushed" - it breaks reproducibility and makes rollbacks impossible to reason about. Pin to semantic version tags or, better, immutable content digests:

image: registry.company.com/fraud-model@sha256:a1b2c3d4...  # immutable

Use pod anti-affinity to spread replicas. Without it, the scheduler might place all 3 replicas on the same node. If that node fails, you lose all serving capacity simultaneously. The podAntiAffinity spec above with kubernetes.io/hostname ensures each replica runs on a different node.

Add PodDisruptionBudgets. Without a PDB, a cluster administrator draining a node for maintenance can evict all your pods simultaneously. A PDB with minAvailable: 2 means the drain process will wait for a new pod to start before evicting the next one - maintaining minimum serving capacity throughout.

Set resource requests based on measurement, not intuition. Run your model server under realistic load and observe actual CPU and memory usage with kubectl top. Then set requests to p75 usage and limits to p99 usage. Revisit quarterly as traffic patterns change.

Common Mistakes

:::danger Setting Limits Without Profiling Setting memory limit: 4Gi because "that seems like enough" without profiling actual usage is the most common ML K8s mistake. PyTorch models often use 2–3x more memory under load than at idle due to batch processing, intermediate tensor allocation, and CUDA memory fragmentation. Always profile with realistic batch sizes and concurrency before setting limits.

Use kubectl top pod --containers during a load test to observe peak usage. :::

:::warning Missing Readiness Probes on ML Services Without a readiness probe, Kubernetes routes traffic to new pods the instant the container process starts - before the model is loaded into memory. The first requests after every pod restart will hit an unloaded model and return 500 errors. This problem is silent in normal operations but becomes visible during rolling updates, when pods regularly restart. Always implement readiness probes with initialDelaySeconds that accounts for model loading time. :::

:::warning Hardcoding Environment Config in Docker Images Baking FEATURE_SERVER_URL=http://feature-store-prod.company.com into the image means you cannot use the same image in staging. Use ConfigMaps for all environment-specific configuration. The same image tag should run identically in dev, staging, and prod - only the ConfigMap values differ. This is a core tenet of the twelve-factor app methodology. :::

Interview Q&A

Q1: What is the difference between a Kubernetes resource request and a resource limit? How does each affect pod behavior?

A request is a scheduling guarantee - the scheduler only places a pod on a node where the available capacity (total node capacity minus sum of existing requests) is at least as large as the pod's request. The request does not cap usage; a pod can use more than its requested amount if the node has spare capacity. A limit is a hard ceiling enforced at runtime: exceeding the memory limit triggers OOMKill (exit code 137); exceeding the CPU limit causes throttling (the process runs slower, not killed). For ML workloads, set requests based on typical usage to enable efficient scheduling, and limits based on the maximum safe usage to prevent one pod from starving others on the same node.

Q2: Why would an ML serving deployment use separate liveness and readiness probes?

They serve distinct purposes. A readiness probe answers "should traffic be sent to this pod?" - it should fail during model loading (preventing requests from hitting an unloaded model) and pass once the model is loaded and warmed up. A liveness probe answers "should this pod be restarted?" - it should only fail when the process is in an unrecoverable state, such as a deadlocked thread or corrupted CUDA context. Using only a liveness probe with a short initialDelaySeconds would cause Kubernetes to restart the pod repeatedly during the model loading phase, creating a crash loop. Using only a readiness probe would leave zombie pods in a permanently "not ready" state without ever restarting them.

Q3: A production ML serving pod is being OOMKilled intermittently during peak traffic. Walk through your diagnosis.

First, confirm OOMKill with kubectl describe pod <name>: look for Reason: OOMKilled and Exit Code: 137. Run kubectl top pod --containers during a load test to observe memory trend. Check whether the OOMKill happens at startup (model loading spike), at steady state (limit too low for idle usage), or only during bursts (batch size too large). Use Python memory profiling (torch.cuda.memory_stats(), memory_profiler) to find the allocation site. Common culprits: dynamic batch padding allocating large tensors, gradient retention from missing torch.no_grad() in inference, or CUDA memory fragmentation over time. Fix by increasing the limit, reducing batch size, adding explicit gc.collect() and torch.cuda.empty_cache() calls, or implementing response streaming.

Q4: How do Kubernetes Services find the pods they should route traffic to?

Services use label selectors - they continuously watch the API server for pods whose labels match the selector. When a pod becomes ready (its readiness probe passes) and its labels match the Service selector, the endpoints controller adds that pod's IP and port to the Service's Endpoints object. When a pod becomes unready or is deleted, it is removed from Endpoints. kube-proxy on each node watches the Endpoints object and programs iptables or IPVS rules to route traffic from the Service's virtual IP to one of the current endpoint IPs. This is why label consistency matters: a pod without the correct labels will never receive traffic from the Service, even if it's running and healthy.

Q5: What are Kubernetes Secrets and what are their security limitations?

Secrets store sensitive data (passwords, API tokens, certificates) and are accessible to pods via environment variables or volume mounts, separate from non-sensitive ConfigMaps. The key limitation: by default, Secret values are only base64-encoded in etcd, not encrypted. Anyone with etcd read access, certain RBAC permissions (kubectl get secret), or access to a pod that mounts the secret can read the value. Mitigation strategies: enable etcd encryption-at-rest in the cluster configuration, restrict access to Secrets via RBAC (only the service accounts that need a secret should have get permission on it), and use the External Secrets Operator with a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault) so plaintext values never persist in etcd.

Q6: Explain PodDisruptionBudgets and give a concrete example of why an ML team needs one.

A PodDisruptionBudget limits the number of pods of a given selector that can be simultaneously unavailable due to voluntary disruptions - cluster upgrades, node drains, Descheduler evictions. Without a PDB, a cluster admin draining a node for maintenance could evict all replicas simultaneously, causing a complete outage. With minAvailable: 2 on a 3-replica fraud model deployment, the drain process can evict at most 1 pod at a time and must wait for the replacement to become ready before evicting the next. For ML serving, this is especially important because model loading takes 30–60 seconds - if 2 pods are evicted simultaneously, your serving capacity drops to 1 pod for the entire model loading duration. A PDB prevents this by serializing evictions.

The Day You Can No Longer Avoid K8s​

Why Kubernetes Exists​

Architecture: The Two Sides of a Cluster​

Core Primitive 1: Pod​

Core Primitive 2: Deployment​

Resource Requests and Limits - The Most Important Config for ML​

Why Requests Matter for Scheduling​

The OOMKill Problem in ML​

Core Primitive 3: Service​

Namespaces - Team Isolation in Shared Clusters​

ConfigMaps and Secrets​

Liveness and Readiness Probes for ML Services​

Persistent Volumes - Storing Checkpoints and Artifacts​

A Complete Production ML Manifest​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​