What is Kubernetes ML?

Use Kubernetes as ML infrastructure - from GPU scheduling and device plugins to Kubeflow Pipelines and autoscaling - migrating ML workloads from VMs to K8s without disruption.

How does GPU scheduling Kubernetes work in practice?

Kubernetes for ML covers Kubernetes ML, GPU scheduling Kubernetes, Kubeflow Pipelines from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/ai-platform-engineering/kubernetes-for-ml

What is the difference between Kubernetes ML and Kubeflow Pipelines?

See the full breakdown at https://engineersofai.com/docs/ai-systems/ai-platform-engineering/kubernetes-for-ml

:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Microservices demo on the EngineersOfAI Playground - no code required. :::

Kubernetes for ML

The VM-to-Kubernetes Migration

The ML team had been running on EC2 VMs for three years. It worked - sort of. Training jobs ran on manually provisioned p3.8xlarge instances. Serving ran on g4dn.xlarge instances, one model per instance. When a model needed more resources, an engineer filed a Jira ticket to provision a new instance, waited 2 business days for approval, and then spent half a day configuring it.

The ML cluster had 47 separate EC2 instances. 12 of them were under 20% GPU utilization. 3 of them had been running for 8 months with no job scheduled - someone's training environment that never got cleaned up. A single model that needed 4 GPUs for a short training run required a dedicated instance that ran 24/7, because the alternative was waiting days for a new one.

The motivation for migrating to Kubernetes was GPU sharing and utilization. The goal: multiple training jobs sharing a GPU cluster, each getting the resources they need when they need them, and releasing them when done. The result: GPU utilization went from 38% average to 71%. Infrastructure cost dropped by 40%. Provisioning time for new training jobs dropped from 2 days to 30 seconds.

The migration took 8 weeks. This lesson documents the key decisions and implementation details.

Why Kubernetes for ML

Kubernetes was designed for microservices - stateless, CPU-bound workloads that scale horizontally. ML workloads are stateful, GPU-bound, and often need large chunks of shared memory. The fit isn't perfect, but Kubernetes offers four properties that make it worth the operational complexity:

1. Resource-based scheduling: Kubernetes schedules pods based on explicit resource requests - nvidia.com/gpu: 2, memory: 64Gi. This enables multiple workloads to share a cluster, with the scheduler ensuring no overallocation.

2. Declarative configuration: Every workload is a YAML manifest, version-controlled and reproducible. Compare this to SSH-ing into a VM and running commands manually.

3. Ecosystem: Kubeflow, Argo Workflows, Seldon, KServe, Ray - the ML tooling ecosystem is now primarily Kubernetes-native.

4. Node pools: Different node types (CPU-only, 1-GPU, 8-GPU, high-memory) can coexist in the same cluster, with workloads scheduled to the appropriate node type automatically.

GPU Scheduling in Kubernetes

NVIDIA Device Plugin

GPUs are not natively visible to Kubernetes - the NVIDIA device plugin bridges the gap. It runs as a DaemonSet on every GPU node and advertises GPUs as schedulable resources:

# Install NVIDIA device plugin (run once per cluster)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      # Only run on nodes with NVIDIA GPUs
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-device-plugin-ctr
          image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

After the plugin is installed, GPUs appear as nvidia.com/gpu in node capacity:

kubectl describe node gpu-node-1
# Capacity:
#   nvidia.com/gpu:  4
# Allocatable:
#   nvidia.com/gpu:  4

Node Taints for GPU Isolation

Prevent CPU workloads from accidentally scheduling onto GPU nodes (which are expensive and should be reserved for ML):

# Taint GPU nodes - only ML workloads that tolerate this can schedule here
kubectl taint nodes gpu-node-1 gpu-node-2 \
  dedicated=ml-workloads:NoSchedule

# Label GPU nodes by type for targeting
kubectl label nodes gpu-node-1 gpu-node-2 \
  accelerator=nvidia-a100 \
  gpu-count=4

# Training job that targets A100 nodes
apiVersion: batch/v1
kind: Job
metadata:
  name: model-training-run
spec:
  template:
    spec:
      tolerations:
        - key: "dedicated"
          operator: "Equal"
          value: "ml-workloads"
          effect: "NoSchedule"
      nodeSelector:
        accelerator: nvidia-a100     # Schedule only on A100 nodes
      containers:
        - name: trainer
          image: myregistry/ml-trainer:v1.0
          resources:
            requests:
              memory: "32Gi"
              cpu: "8"
              nvidia.com/gpu: "2"    # Request 2 GPUs
            limits:
              memory: "64Gi"
              cpu: "16"
              nvidia.com/gpu: "2"    # Limit must equal request for GPUs
          env:
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1"           # Automatically set by device plugin

Training Workloads: Jobs and StatefulSets

Simple Training: Kubernetes Job

For single-machine training jobs:

apiVersion: batch/v1
kind: Job
metadata:
  name: bert-finetuning-2024-03-15
  labels:
    team: nlp
    experiment: bert-domain-adaptation
    owner: [email protected]
spec:
  backoffLimit: 2            # retry up to 2 times on failure
  completions: 1
  template:
    metadata:
      labels:
        job-name: bert-finetuning-2024-03-15
    spec:
      restartPolicy: OnFailure
      tolerations:
        - key: "dedicated"
          value: "ml-workloads"
          effect: "NoSchedule"
      containers:
        - name: trainer
          image: myregistry/bert-trainer:v2.1.0
          args:
            - "--model-name=bert-base-uncased"
            - "--num-epochs=10"
            - "--learning-rate=2e-5"
            - "--output-dir=/checkpoints/bert-domain-v1"
          resources:
            requests:
              nvidia.com/gpu: "1"
              memory: "24Gi"
              cpu: "4"
            limits:
              nvidia.com/gpu: "1"
              memory: "32Gi"
              cpu: "8"
          volumeMounts:
            - name: checkpoints
              mountPath: /checkpoints
            - name: training-data
              mountPath: /data
              readOnly: true
          env:
            - name: MLFLOW_TRACKING_URI
              valueFrom:
                secretKeyRef:
                  name: mlflow-credentials
                  key: tracking_uri
      volumes:
        - name: checkpoints
          persistentVolumeClaim:
            claimName: ml-checkpoints-pvc
        - name: training-data
          persistentVolumeClaim:
            claimName: training-data-pvc

Distributed Training: PyTorch Operator

For multi-GPU distributed training, use the PyTorch Training Operator (part of Kubeflow):

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llm-distributed-training
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          tolerations:
            - key: dedicated
              value: ml-workloads
              effect: NoSchedule
          containers:
            - name: pytorch
              image: myregistry/llm-trainer:v1.0
              args:
                - "--nproc_per_node=4"
                - "train.py"
                - "--model-size=7b"
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  memory: "256Gi"
                  cpu: "32"

    Worker:
      replicas: 3           # 3 worker nodes × 4 GPUs = 12 GPUs total
      restartPolicy: OnFailure
      template:
        spec:
          tolerations:
            - key: dedicated
              value: ml-workloads
              effect: NoSchedule
          containers:
            - name: pytorch
              image: myregistry/llm-trainer:v1.0
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  memory: "256Gi"
                  cpu: "32"

Serving Workloads: Deployments and HPA

Model Serving Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: recommendation-model-v3
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: recommendation-model
      version: v3
  template:
    metadata:
      labels:
        app: recommendation-model
        version: v3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: model-server
          image: myregistry/recommendation-server:v3.2.1
          ports:
            - containerPort: 8080
            - containerPort: 9090   # gRPC
          resources:
            requests:
              nvidia.com/gpu: "1"
              memory: "16Gi"
              cpu: "4"
            limits:
              nvidia.com/gpu: "1"
              memory: "24Gi"
              cpu: "8"
          env:
            - name: MODEL_URI
              value: "models:/recommendation-model/Production"
            - name: MLFLOW_TRACKING_URI
              valueFrom:
                secretKeyRef:
                  name: mlflow-credentials
                  key: tracking_uri
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 60    # Wait for model to load
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 30

Horizontal Pod Autoscaler for ML Serving

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: recommendation-model-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: recommendation-model-v3
  minReplicas: 2             # Never scale to zero (cold start too slow)
  maxReplicas: 20
  metrics:
    # Scale on GPU utilization (DCGM metric from NVIDIA GPU Operator)
    - type: External
      external:
        metric:
          name: dcgm_fi_dev_gpu_util
          selector:
            matchLabels:
              app: recommendation-model
        target:
          type: AverageValue
          averageValue: "70"    # Target 70% GPU utilization

    # Also scale on request queue depth
    - type: External
      external:
        metric:
          name: inference_request_queue_depth
        target:
          type: AverageValue
          averageValue: "10"    # Scale when >10 requests queued per pod

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 30     # Add up to 2 pods every 30 seconds
    scaleDown:
      stabilizationWindowSeconds: 300   # Cautious scale-down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Kubeflow Pipelines vs Argo Workflows

Both orchestrate ML pipelines on Kubernetes. Key differences:

Feature	Kubeflow Pipelines	Argo Workflows
ML-specific features	Built-in (metadata tracking, lineage)	General-purpose
SDK	Python SDK, UI-based	YAML manifests
Complexity	Higher (requires full Kubeflow install)	Lower (Argo only)
Ecosystem	Rich (TFX, Katib, PyTorch Operator)	General
Best for	End-to-end ML platforms	Simple pipeline orchestration

# Kubeflow Pipeline definition
from kfp import dsl
from kfp.components import create_component_from_func

@create_component_from_func
def preprocess_data(
    input_path: str,
    output_path: str,
    validation_split: float = 0.1,
) -> None:
    """Preprocessing step - runs in its own container."""
    import pandas as pd
    from sklearn.model_selection import train_test_split

    data = pd.read_parquet(input_path)
    train, val = train_test_split(data, test_size=validation_split)
    train.to_parquet(f"{output_path}/train.parquet")
    val.to_parquet(f"{output_path}/val.parquet")


@create_component_from_func
def train_model(
    data_path: str,
    model_output_path: str,
    learning_rate: float = 2e-5,
    num_epochs: int = 10,
) -> str:
    """Training step - runs on GPU."""
    # Training code here
    return model_output_path


@create_component_from_func
def validate_and_register(
    model_path: str,
    validation_data_path: str,
    min_accuracy: float = 0.80,
) -> bool:
    """Validation gate - reject model if below threshold."""
    # Load model and compute accuracy
    # Register to MLflow if passes
    pass


@dsl.pipeline(
    name="recommendation-model-pipeline",
    description="End-to-end training pipeline with quality gates",
)
def ml_pipeline(
    raw_data_path: str,
    model_name: str = "recommendation-model",
):
    preprocess_task = preprocess_data(
        input_path=raw_data_path,
        output_path="/tmp/processed",
    )

    train_task = train_model(
        data_path=preprocess_task.output,
        model_output_path="/tmp/models",
    ).set_gpu_limit(1).set_memory_limit("32G")

    validate_task = validate_and_register(
        model_path=train_task.output,
        validation_data_path=preprocess_task.output,
    )

Resource Requests and Limits for ML

Getting resource requests right is critical for efficient cluster utilization:

def estimate_gpu_memory_requirements(
    model_params: int,
    precision: str = "bfloat16",
    batch_size: int = 1,
    sequence_length: int = 512,
    gradient_checkpointing: bool = False,
) -> dict:
    """
    Estimate GPU memory requirements for a transformer model.
    Used to set Kubernetes resource requests.
    """
    bytes_per_param = {"float32": 4, "bfloat16": 2, "int8": 1, "int4": 0.5}[precision]

    # Model weights
    model_memory_gb = model_params * bytes_per_param / 1e9

    # KV cache for attention (2 × layers × heads × head_dim × seq_len × batch)
    # Simplified: ~2 bytes × model_params × batch_size × (seq_len/512)
    kv_cache_gb = model_memory_gb * 0.25 * batch_size * (sequence_length / 512)

    # Activations (with gradient checkpointing, ~30% of without)
    activation_multiplier = 0.3 if gradient_checkpointing else 1.0
    activation_gb = model_memory_gb * 0.8 * batch_size * activation_multiplier

    total_gb = model_memory_gb + kv_cache_gb + activation_gb
    # Add 20% overhead for CUDA context, PyTorch allocator, etc.
    required_gpu_memory_gb = total_gb * 1.20

    return {
        "model_weights_gb": model_memory_gb,
        "kv_cache_gb": kv_cache_gb,
        "activations_gb": activation_gb,
        "total_required_gb": required_gpu_memory_gb,
        "recommended_gpu": _select_gpu(required_gpu_memory_gb),
    }


def _select_gpu(required_gb: float) -> str:
    if required_gb <= 16:
        return "nvidia-a10g (24GB)"  # g5.xlarge
    elif required_gb <= 40:
        return "nvidia-a100-40gb"
    elif required_gb <= 80:
        return "nvidia-a100-80gb"
    else:
        return f"Multiple GPUs required: {int(required_gb/80) + 1}× A100-80GB"

Production Engineering Notes

Pod Disruption Budgets for Serving

Ensure rolling deployments don't take down too many serving pods simultaneously:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: recommendation-model-pdb
spec:
  minAvailable: 2   # Keep at least 2 pods running during rollouts/node drains
  selector:
    matchLabels:
      app: recommendation-model

Init Containers for Model Warming

Use init containers to pre-download model artifacts before the serving container starts - avoiding slow first-request latency:

initContainers:
  - name: model-downloader
    image: amazon/aws-cli:latest
    command:
      - aws
      - s3
      - cp
      - s3://ml-models/recommendation/v3/model.pt
      - /model-cache/model.pt
    volumeMounts:
      - name: model-cache
        mountPath: /model-cache

Common Mistakes

:::danger Setting GPU limit different from GPU request For Kubernetes GPU resources (nvidia.com/gpu), requests and limits must be equal. Unlike CPU and memory, GPUs are not time-shared - a pod either has the GPU or doesn't. Setting a request of 1 GPU and a limit of 2 GPUs is not supported and will cause scheduling confusion. :::

:::warning Not setting readiness probes with sufficient initialDelaySeconds Model loading takes 30–90 seconds. If your readiness probe fires before the model is loaded, Kubernetes marks the pod as not-ready, which is correct - but if the delay is too short, the pod gets killed and restarted in a loop before it ever finishes loading. Set initialDelaySeconds to at least the expected model loading time plus 20% buffer. :::

:::danger Running GPU nodes without node taints Without taints on GPU nodes, any pod can schedule there - including CPU-only pods that have no GPU requirements. CPU pods on GPU nodes waste expensive GPU instances without using the GPU. Always taint GPU nodes with dedicated=ml-workloads:NoSchedule and require ML workloads to explicitly tolerate it. :::

Interview Q&A

Q: How does Kubernetes GPU scheduling work?

A: GPUs are exposed to Kubernetes as extended resources via the NVIDIA device plugin, which runs as a DaemonSet on every GPU node and advertises the node's GPU capacity as nvidia.com/gpu. Pods request GPUs like any other resource: nvidia.com/gpu: 2 in resource requests and limits. The scheduler finds a node with 2 available GPUs (not already allocated to other pods), places the pod there, and the device plugin mounts the GPU device files into the container. Important: GPU requests and limits must be equal (GPUs aren't time-shared), and a pod is either granted all its requested GPUs or scheduled on a different node. GPU isolation is enforced via CUDA_VISIBLE_DEVICES environment variable set by the device plugin.

Q: What is the difference between Kubeflow Pipelines and Argo Workflows?

A: Argo Workflows is a general-purpose workflow orchestration engine for Kubernetes - it runs any sequence of containers with dependencies between them. Kubeflow Pipelines is built on top of Argo and adds ML-specific features: a Python SDK for defining pipelines as code, built-in metadata tracking (what data went into each step, what artifacts it produced), integration with MLflow and the Kubeflow model registry, and a UI designed for ML practitioners rather than Kubernetes operators. I'd recommend Argo Workflows when you need simple pipeline orchestration and want minimal operational overhead. I'd recommend Kubeflow Pipelines when you want end-to-end ML platform integration, metadata lineage, and when your team is primarily Python-based (the Python SDK is much more ergonomic than writing YAML DAGs).

Q: How do you right-size GPU resource requests for ML serving?

A: Start by benchmarking: load the model and measure actual GPU memory consumption under expected batch sizes. Use nvidia-smi or torch.cuda.memory_allocated() to measure peak GPU memory. Add 20–25% buffer above peak. For serving (not training), you typically don't need activation memory - just model weights and KV cache. A 7B parameter model in BF16 is 14 GB of weights; with KV cache at typical batch size, plan for 20–24 GB, fitting on an A100-40GB. For requests vs limits: set request to the measured working set memory, limit to the node's full GPU memory (if you want to allow bursting). Always ensure requests == limits for the nvidia.com/gpu count itself - GPU sharing requires explicit configuration (MIG or time-slicing) and isn't available by default.

Q: How would you migrate ML workloads from EC2 instances to Kubernetes?

A: Eight-step migration. First, containerize: wrap each ML workload in a Docker image with all dependencies. Test locally. Second, create Kubernetes manifests: define Job specs for training, Deployment specs for serving - start with equivalent resource requests to the EC2 instances. Third, set up GPU nodes: create a GPU node pool with taints and labels. Install NVIDIA device plugin. Verify GPU visibility. Fourth, migrate serving first: serving workloads are more predictable. Deploy one model, verify predictions match the EC2 version, then cut traffic over. Fifth, migrate training: start with non-critical experiments. Validate that checkpoints are saved correctly to shared storage. Sixth, implement autoscaling: GPU node pool autoscaler, HPA on serving deployments. Seventh, clean up EC2 instances: after 2 weeks of successful K8s operation, terminate the corresponding EC2 instances. Eighth, optimize: measure GPU utilization; use this data to right-size resource requests and identify opportunities for bin-packing.

Q: What is the role of PersistentVolumeClaims in ML workloads on Kubernetes?

A: PVCs provide persistent storage that outlives individual pods - essential for ML because model checkpoints and training data must survive pod restarts and rescheduling. Three main uses: (1) shared training data: a ReadOnlyMany PVC backed by EFS (or similar NFS) allows multiple training pods to read the same dataset simultaneously without data transfer overhead; (2) checkpoint storage: a ReadWriteOnce PVC for a training job to write checkpoints - when the job is interrupted and rescheduled, it can resume from the checkpoint; (3) model artifacts: serving pods need to load model weights at startup, either from a PVC (faster, if on fast storage) or from S3 (slower but simpler). A common pattern: store large datasets on S3, cache to local node storage via an init container before training starts, and write checkpoints to both local PVC (fast) and S3 (durable).

The VM-to-Kubernetes Migration​

Why Kubernetes for ML​

GPU Scheduling in Kubernetes​

NVIDIA Device Plugin​

Node Taints for GPU Isolation​

Training Workloads: Jobs and StatefulSets​

Simple Training: Kubernetes Job​

Distributed Training: PyTorch Operator​

Serving Workloads: Deployments and HPA​

Model Serving Deployment​

Horizontal Pod Autoscaler for ML Serving​

Kubeflow Pipelines vs Argo Workflows​

Resource Requests and Limits for ML​

Production Engineering Notes​

Pod Disruption Budgets for Serving​

Init Containers for Model Warming​

Common Mistakes​

Interview Q&A​