Module 8 - Kubernetes for ML

Kubernetes has become the operating system of the cloud-native era. For ML engineers, it is no longer optional - every major ML platform (Kubeflow, Vertex AI, SageMaker, Azure ML) runs on top of it, and engineering teams expect you to be fluent in it. This module takes you from zero to production-ready, covering every aspect of running ML on Kubernetes.

What You Will Learn

Module Lessons

#	Lesson	Key Skills
01	K8s Fundamentals for ML	Pods, Deployments, ConfigMaps, Secrets, PVCs, resource requests, liveness/readiness probes
02	Pods, Deployments & Services	Core workload patterns, service types, namespaces, networking
03	GPU Scheduling	NVIDIA device plugin, MIG, time-slicing, node affinity, GPU quotas
04	Helm for ML	Chart anatomy, parameterized deployments, lifecycle hooks, umbrella charts
05	Training Jobs on K8s	Job/CronJob, PyTorchJob, TFJob, fault tolerance, spot node handling
06	Autoscaling ML Workloads	HPA, KEDA, rolling updates, readiness gates for model warmup
07	KServe Model Serving	Operators, CRDs, KServe, Seldon Core, Argo Workflows

Why Kubernetes for ML?

The ML lifecycle has three distinct computational phases: experimentation (exploratory, bursty, GPU-heavy), training (long-running, fault-sensitive, distributed), and serving (latency-critical, variable load, cost-sensitive). No single compute paradigm handles all three well.

Kubernetes handles all three through a unified API:

GPU scheduling - declarative resource requests (nvidia.com/gpu: 1) instead of manual allocation
Distributed training - Training Operator's PyTorchJob manages torchrun across pods
Elastic serving - KEDA scales deployments on custom metrics including GPU utilization
Reproducibility - Docker containers + declarative manifests = identical environments everywhere
Cost control - ResourceQuotas per namespace prevent runaway GPU spend

Prerequisites

Before starting this module, you should be comfortable with:

Docker: building images, understanding layers, volumes, networking
Basic Linux: file permissions, processes, environment variables
Python ML: you have trained at least one model with PyTorch or TensorFlow
Module 07 (CI/CD for ML) - particularly how models are packaged as container images

Key Mental Models

Kubernetes is a desired-state engine. You declare what you want (3 replicas of my model server, each with 1 GPU and 8 GB RAM), and Kubernetes continuously reconciles reality toward that state. If a pod crashes, Kubernetes recreates it. If a node goes down, pods are rescheduled. This is fundamentally different from imperative "run this command on that server" thinking.

Everything is a resource. Pods, Deployments, Services, ConfigMaps, Jobs, CronJobs - all are Kubernetes resources with a spec and a status. Custom resources (CRDs) extend this model: a PyTorchJob is just a Kubernetes resource that the Training Operator knows how to reconcile.

Namespaces are your blast radius control. A runaway training job in team-a namespace cannot OOM pods in team-b namespace if ResourceQuotas are correctly configured. Namespaces are the primary isolation boundary in shared ML clusters.

What You Will Learn​

Module Lessons​

Why Kubernetes for ML?​

Prerequisites​

Key Mental Models​

What You Will Learn

Module Lessons

Why Kubernetes for ML?

Prerequisites

Key Mental Models