:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Microservices demo on the EngineersOfAI Playground - no code required. :::
Kubernetes for ML
The VM-to-Kubernetes Migration
The ML team had been running on EC2 VMs for three years. It worked - sort of. Training jobs ran on manually provisioned p3.8xlarge instances. Serving ran on g4dn.xlarge instances, one model per instance. When a model needed more resources, an engineer filed a Jira ticket to provision a new instance, waited 2 business days for approval, and then spent half a day configuring it.
The ML cluster had 47 separate EC2 instances. 12 of them were under 20% GPU utilization. 3 of them had been running for 8 months with no job scheduled - someone's training environment that never got cleaned up. A single model that needed 4 GPUs for a short training run required a dedicated instance that ran 24/7, because the alternative was waiting days for a new one.
The motivation for migrating to Kubernetes was GPU sharing and utilization. The goal: multiple training jobs sharing a GPU cluster, each getting the resources they need when they need them, and releasing them when done. The result: GPU utilization went from 38% average to 71%. Infrastructure cost dropped by 40%. Provisioning time for new training jobs dropped from 2 days to 30 seconds.
The migration took 8 weeks. This lesson documents the key decisions and implementation details.
Why Kubernetes for ML
Kubernetes was designed for microservices - stateless, CPU-bound workloads that scale horizontally. ML workloads are stateful, GPU-bound, and often need large chunks of shared memory. The fit isn't perfect, but Kubernetes offers four properties that make it worth the operational complexity:
1. Resource-based scheduling: Kubernetes schedules pods based on explicit resource requests - nvidia.com/gpu: 2, memory: 64Gi. This enables multiple workloads to share a cluster, with the scheduler ensuring no overallocation.
2. Declarative configuration: Every workload is a YAML manifest, version-controlled and reproducible. Compare this to SSH-ing into a VM and running commands manually.
3. Ecosystem: Kubeflow, Argo Workflows, Seldon, KServe, Ray - the ML tooling ecosystem is now primarily Kubernetes-native.
4. Node pools: Different node types (CPU-only, 1-GPU, 8-GPU, high-memory) can coexist in the same cluster, with workloads scheduled to the appropriate node type automatically.
GPU Scheduling in Kubernetes
NVIDIA Device Plugin
GPUs are not natively visible to Kubernetes - the NVIDIA device plugin bridges the gap. It runs as a DaemonSet on every GPU node and advertises GPUs as schedulable resources:
# Install NVIDIA device plugin (run once per cluster)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
# Only run on nodes with NVIDIA GPUs
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: nvidia-device-plugin-ctr
image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
After the plugin is installed, GPUs appear as nvidia.com/gpu in node capacity:
kubectl describe node gpu-node-1
# Capacity:
# nvidia.com/gpu: 4
# Allocatable:
# nvidia.com/gpu: 4
Node Taints for GPU Isolation
Prevent CPU workloads from accidentally scheduling onto GPU nodes (which are expensive and should be reserved for ML):
# Taint GPU nodes - only ML workloads that tolerate this can schedule here
kubectl taint nodes gpu-node-1 gpu-node-2 \
dedicated=ml-workloads:NoSchedule
# Label GPU nodes by type for targeting
kubectl label nodes gpu-node-1 gpu-node-2 \
accelerator=nvidia-a100 \
gpu-count=4
# Training job that targets A100 nodes
apiVersion: batch/v1
kind: Job
metadata:
name: model-training-run
spec:
template:
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "ml-workloads"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-a100 # Schedule only on A100 nodes
containers:
- name: trainer
image: myregistry/ml-trainer:v1.0
resources:
requests:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "2" # Request 2 GPUs
limits:
memory: "64Gi"
cpu: "16"
nvidia.com/gpu: "2" # Limit must equal request for GPUs
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1" # Automatically set by device plugin
Training Workloads: Jobs and StatefulSets
Simple Training: Kubernetes Job
For single-machine training jobs:
apiVersion: batch/v1
kind: Job
metadata:
name: bert-finetuning-2024-03-15
labels:
team: nlp
experiment: bert-domain-adaptation
spec:
backoffLimit: 2 # retry up to 2 times on failure
completions: 1
template:
metadata:
labels:
job-name: bert-finetuning-2024-03-15
spec:
restartPolicy: OnFailure
tolerations:
- key: "dedicated"
value: "ml-workloads"
effect: "NoSchedule"
containers:
- name: trainer
image: myregistry/bert-trainer:v2.1.0
args:
- "--model-name=bert-base-uncased"
- "--num-epochs=10"
- "--learning-rate=2e-5"
- "--output-dir=/checkpoints/bert-domain-v1"
resources:
requests:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "8"
volumeMounts:
- name: checkpoints
mountPath: /checkpoints
- name: training-data
mountPath: /data
readOnly: true
env:
- name: MLFLOW_TRACKING_URI
valueFrom:
secretKeyRef:
name: mlflow-credentials
key: tracking_uri
volumes:
- name: checkpoints
persistentVolumeClaim:
claimName: ml-checkpoints-pvc
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
Distributed Training: PyTorch Operator
For multi-GPU distributed training, use the PyTorch Training Operator (part of Kubeflow):
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: llm-distributed-training
namespace: ml-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
tolerations:
- key: dedicated
value: ml-workloads
effect: NoSchedule
containers:
- name: pytorch
image: myregistry/llm-trainer:v1.0
args:
- "--nproc_per_node=4"
- "train.py"
- "--model-size=7b"
resources:
limits:
nvidia.com/gpu: "4"
memory: "256Gi"
cpu: "32"
Worker:
replicas: 3 # 3 worker nodes × 4 GPUs = 12 GPUs total
restartPolicy: OnFailure
template:
spec:
tolerations:
- key: dedicated
value: ml-workloads
effect: NoSchedule
containers:
- name: pytorch
image: myregistry/llm-trainer:v1.0
resources:
limits:
nvidia.com/gpu: "4"
memory: "256Gi"
cpu: "32"
Serving Workloads: Deployments and HPA
Model Serving Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: recommendation-model-v3
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: recommendation-model
version: v3
template:
metadata:
labels:
app: recommendation-model
version: v3
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: model-server
image: myregistry/recommendation-server:v3.2.1
ports:
- containerPort: 8080
- containerPort: 9090 # gRPC
resources:
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "8"
env:
- name: MODEL_URI
value: "models:/recommendation-model/Production"
- name: MLFLOW_TRACKING_URI
valueFrom:
secretKeyRef:
name: mlflow-credentials
key: tracking_uri
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # Wait for model to load
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
Horizontal Pod Autoscaler for ML Serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: recommendation-model-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: recommendation-model-v3
minReplicas: 2 # Never scale to zero (cold start too slow)
maxReplicas: 20
metrics:
# Scale on GPU utilization (DCGM metric from NVIDIA GPU Operator)
- type: External
external:
metric:
name: dcgm_fi_dev_gpu_util
selector:
matchLabels:
app: recommendation-model
target:
type: AverageValue
averageValue: "70" # Target 70% GPU utilization
# Also scale on request queue depth
- type: External
external:
metric:
name: inference_request_queue_depth
target:
type: AverageValue
averageValue: "10" # Scale when >10 requests queued per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 30 # Add up to 2 pods every 30 seconds
scaleDown:
stabilizationWindowSeconds: 300 # Cautious scale-down
policies:
- type: Pods
value: 1
periodSeconds: 120
Kubeflow Pipelines vs Argo Workflows
Both orchestrate ML pipelines on Kubernetes. Key differences:
| Feature | Kubeflow Pipelines | Argo Workflows |
|---|---|---|
| ML-specific features | Built-in (metadata tracking, lineage) | General-purpose |
| SDK | Python SDK, UI-based | YAML manifests |
| Complexity | Higher (requires full Kubeflow install) | Lower (Argo only) |
| Ecosystem | Rich (TFX, Katib, PyTorch Operator) | General |
| Best for | End-to-end ML platforms | Simple pipeline orchestration |
# Kubeflow Pipeline definition
from kfp import dsl
from kfp.components import create_component_from_func
@create_component_from_func
def preprocess_data(
input_path: str,
output_path: str,
validation_split: float = 0.1,
) -> None:
"""Preprocessing step - runs in its own container."""
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_parquet(input_path)
train, val = train_test_split(data, test_size=validation_split)
train.to_parquet(f"{output_path}/train.parquet")
val.to_parquet(f"{output_path}/val.parquet")
@create_component_from_func
def train_model(
data_path: str,
model_output_path: str,
learning_rate: float = 2e-5,
num_epochs: int = 10,
) -> str:
"""Training step - runs on GPU."""
# Training code here
return model_output_path
@create_component_from_func
def validate_and_register(
model_path: str,
validation_data_path: str,
min_accuracy: float = 0.80,
) -> bool:
"""Validation gate - reject model if below threshold."""
# Load model and compute accuracy
# Register to MLflow if passes
pass
@dsl.pipeline(
name="recommendation-model-pipeline",
description="End-to-end training pipeline with quality gates",
)
def ml_pipeline(
raw_data_path: str,
model_name: str = "recommendation-model",
):
preprocess_task = preprocess_data(
input_path=raw_data_path,
output_path="/tmp/processed",
)
train_task = train_model(
data_path=preprocess_task.output,
model_output_path="/tmp/models",
).set_gpu_limit(1).set_memory_limit("32G")
validate_task = validate_and_register(
model_path=train_task.output,
validation_data_path=preprocess_task.output,
)
Resource Requests and Limits for ML
Getting resource requests right is critical for efficient cluster utilization:
def estimate_gpu_memory_requirements(
model_params: int,
precision: str = "bfloat16",
batch_size: int = 1,
sequence_length: int = 512,
gradient_checkpointing: bool = False,
) -> dict:
"""
Estimate GPU memory requirements for a transformer model.
Used to set Kubernetes resource requests.
"""
bytes_per_param = {"float32": 4, "bfloat16": 2, "int8": 1, "int4": 0.5}[precision]
# Model weights
model_memory_gb = model_params * bytes_per_param / 1e9
# KV cache for attention (2 × layers × heads × head_dim × seq_len × batch)
# Simplified: ~2 bytes × model_params × batch_size × (seq_len/512)
kv_cache_gb = model_memory_gb * 0.25 * batch_size * (sequence_length / 512)
# Activations (with gradient checkpointing, ~30% of without)
activation_multiplier = 0.3 if gradient_checkpointing else 1.0
activation_gb = model_memory_gb * 0.8 * batch_size * activation_multiplier
total_gb = model_memory_gb + kv_cache_gb + activation_gb
# Add 20% overhead for CUDA context, PyTorch allocator, etc.
required_gpu_memory_gb = total_gb * 1.20
return {
"model_weights_gb": model_memory_gb,
"kv_cache_gb": kv_cache_gb,
"activations_gb": activation_gb,
"total_required_gb": required_gpu_memory_gb,
"recommended_gpu": _select_gpu(required_gpu_memory_gb),
}
def _select_gpu(required_gb: float) -> str:
if required_gb <= 16:
return "nvidia-a10g (24GB)" # g5.xlarge
elif required_gb <= 40:
return "nvidia-a100-40gb"
elif required_gb <= 80:
return "nvidia-a100-80gb"
else:
return f"Multiple GPUs required: {int(required_gb/80) + 1}× A100-80GB"
Production Engineering Notes
Pod Disruption Budgets for Serving
Ensure rolling deployments don't take down too many serving pods simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: recommendation-model-pdb
spec:
minAvailable: 2 # Keep at least 2 pods running during rollouts/node drains
selector:
matchLabels:
app: recommendation-model
Init Containers for Model Warming
Use init containers to pre-download model artifacts before the serving container starts - avoiding slow first-request latency:
initContainers:
- name: model-downloader
image: amazon/aws-cli:latest
command:
- aws
- s3
- cp
- s3://ml-models/recommendation/v3/model.pt
- /model-cache/model.pt
volumeMounts:
- name: model-cache
mountPath: /model-cache
Common Mistakes
:::danger Setting GPU limit different from GPU request
For Kubernetes GPU resources (nvidia.com/gpu), requests and limits must be equal. Unlike CPU and memory, GPUs are not time-shared - a pod either has the GPU or doesn't. Setting a request of 1 GPU and a limit of 2 GPUs is not supported and will cause scheduling confusion.
:::
:::warning Not setting readiness probes with sufficient initialDelaySeconds
Model loading takes 30–90 seconds. If your readiness probe fires before the model is loaded, Kubernetes marks the pod as not-ready, which is correct - but if the delay is too short, the pod gets killed and restarted in a loop before it ever finishes loading. Set initialDelaySeconds to at least the expected model loading time plus 20% buffer.
:::
:::danger Running GPU nodes without node taints
Without taints on GPU nodes, any pod can schedule there - including CPU-only pods that have no GPU requirements. CPU pods on GPU nodes waste expensive GPU instances without using the GPU. Always taint GPU nodes with dedicated=ml-workloads:NoSchedule and require ML workloads to explicitly tolerate it.
:::
Interview Q&A
Q: How does Kubernetes GPU scheduling work?
A: GPUs are exposed to Kubernetes as extended resources via the NVIDIA device plugin, which runs as a DaemonSet on every GPU node and advertises the node's GPU capacity as nvidia.com/gpu. Pods request GPUs like any other resource: nvidia.com/gpu: 2 in resource requests and limits. The scheduler finds a node with 2 available GPUs (not already allocated to other pods), places the pod there, and the device plugin mounts the GPU device files into the container. Important: GPU requests and limits must be equal (GPUs aren't time-shared), and a pod is either granted all its requested GPUs or scheduled on a different node. GPU isolation is enforced via CUDA_VISIBLE_DEVICES environment variable set by the device plugin.
Q: What is the difference between Kubeflow Pipelines and Argo Workflows?
A: Argo Workflows is a general-purpose workflow orchestration engine for Kubernetes - it runs any sequence of containers with dependencies between them. Kubeflow Pipelines is built on top of Argo and adds ML-specific features: a Python SDK for defining pipelines as code, built-in metadata tracking (what data went into each step, what artifacts it produced), integration with MLflow and the Kubeflow model registry, and a UI designed for ML practitioners rather than Kubernetes operators. I'd recommend Argo Workflows when you need simple pipeline orchestration and want minimal operational overhead. I'd recommend Kubeflow Pipelines when you want end-to-end ML platform integration, metadata lineage, and when your team is primarily Python-based (the Python SDK is much more ergonomic than writing YAML DAGs).
Q: How do you right-size GPU resource requests for ML serving?
A: Start by benchmarking: load the model and measure actual GPU memory consumption under expected batch sizes. Use nvidia-smi or torch.cuda.memory_allocated() to measure peak GPU memory. Add 20–25% buffer above peak. For serving (not training), you typically don't need activation memory - just model weights and KV cache. A 7B parameter model in BF16 is 14 GB of weights; with KV cache at typical batch size, plan for 20–24 GB, fitting on an A100-40GB. For requests vs limits: set request to the measured working set memory, limit to the node's full GPU memory (if you want to allow bursting). Always ensure requests == limits for the nvidia.com/gpu count itself - GPU sharing requires explicit configuration (MIG or time-slicing) and isn't available by default.
Q: How would you migrate ML workloads from EC2 instances to Kubernetes?
A: Eight-step migration. First, containerize: wrap each ML workload in a Docker image with all dependencies. Test locally. Second, create Kubernetes manifests: define Job specs for training, Deployment specs for serving - start with equivalent resource requests to the EC2 instances. Third, set up GPU nodes: create a GPU node pool with taints and labels. Install NVIDIA device plugin. Verify GPU visibility. Fourth, migrate serving first: serving workloads are more predictable. Deploy one model, verify predictions match the EC2 version, then cut traffic over. Fifth, migrate training: start with non-critical experiments. Validate that checkpoints are saved correctly to shared storage. Sixth, implement autoscaling: GPU node pool autoscaler, HPA on serving deployments. Seventh, clean up EC2 instances: after 2 weeks of successful K8s operation, terminate the corresponding EC2 instances. Eighth, optimize: measure GPU utilization; use this data to right-size resource requests and identify opportunities for bin-packing.
Q: What is the role of PersistentVolumeClaims in ML workloads on Kubernetes?
A: PVCs provide persistent storage that outlives individual pods - essential for ML because model checkpoints and training data must survive pod restarts and rescheduling. Three main uses: (1) shared training data: a ReadOnlyMany PVC backed by EFS (or similar NFS) allows multiple training pods to read the same dataset simultaneously without data transfer overhead; (2) checkpoint storage: a ReadWriteOnce PVC for a training job to write checkpoints - when the job is interrupted and rescheduled, it can resume from the checkpoint; (3) model artifacts: serving pods need to load model weights at startup, either from a PVC (faster, if on fast storage) or from S3 (slower but simpler). A common pattern: store large datasets on S3, cache to local node storage via an init container before training starts, and write checkpoints to both local PVC (fast) and S3 (durable).
