Skip to main content

Module 8 - Kubernetes for ML

Kubernetes has become the operating system of the cloud-native era. For ML engineers, it is no longer optional - every major ML platform (Kubeflow, Vertex AI, SageMaker, Azure ML) runs on top of it, and engineering teams expect you to be fluent in it. This module takes you from zero to production-ready, covering every aspect of running ML on Kubernetes.

What You Will Learn

Module Lessons

#LessonKey Skills
01K8s Fundamentals for MLPods, Deployments, ConfigMaps, Secrets, PVCs, resource requests, liveness/readiness probes
02Pods, Deployments & ServicesCore workload patterns, service types, namespaces, networking
03GPU SchedulingNVIDIA device plugin, MIG, time-slicing, node affinity, GPU quotas
04Helm for MLChart anatomy, parameterized deployments, lifecycle hooks, umbrella charts
05Training Jobs on K8sJob/CronJob, PyTorchJob, TFJob, fault tolerance, spot node handling
06Autoscaling ML WorkloadsHPA, KEDA, rolling updates, readiness gates for model warmup
07KServe Model ServingOperators, CRDs, KServe, Seldon Core, Argo Workflows

Why Kubernetes for ML?

The ML lifecycle has three distinct computational phases: experimentation (exploratory, bursty, GPU-heavy), training (long-running, fault-sensitive, distributed), and serving (latency-critical, variable load, cost-sensitive). No single compute paradigm handles all three well.

Kubernetes handles all three through a unified API:

  • GPU scheduling - declarative resource requests (nvidia.com/gpu: 1) instead of manual allocation
  • Distributed training - Training Operator's PyTorchJob manages torchrun across pods
  • Elastic serving - KEDA scales deployments on custom metrics including GPU utilization
  • Reproducibility - Docker containers + declarative manifests = identical environments everywhere
  • Cost control - ResourceQuotas per namespace prevent runaway GPU spend

Prerequisites

Before starting this module, you should be comfortable with:

  • Docker: building images, understanding layers, volumes, networking
  • Basic Linux: file permissions, processes, environment variables
  • Python ML: you have trained at least one model with PyTorch or TensorFlow
  • Module 07 (CI/CD for ML) - particularly how models are packaged as container images

Key Mental Models

Kubernetes is a desired-state engine. You declare what you want (3 replicas of my model server, each with 1 GPU and 8 GB RAM), and Kubernetes continuously reconciles reality toward that state. If a pod crashes, Kubernetes recreates it. If a node goes down, pods are rescheduled. This is fundamentally different from imperative "run this command on that server" thinking.

Everything is a resource. Pods, Deployments, Services, ConfigMaps, Jobs, CronJobs - all are Kubernetes resources with a spec and a status. Custom resources (CRDs) extend this model: a PyTorchJob is just a Kubernetes resource that the Training Operator knows how to reconcile.

Namespaces are your blast radius control. A runaway training job in team-a namespace cannot OOM pods in team-b namespace if ResourceQuotas are correctly configured. Namespaces are the primary isolation boundary in shared ML clusters.

© 2026 EngineersOfAI. All rights reserved.