9 docs tagged with "kubernetes"

Autoscaling ML Workloads

Horizontal Pod Autoscaler, KEDA event-driven autoscaling for GPU metrics, zero-downtime rolling updates with readiness gates, and autoscaling patterns for production ML serving.

GPU Scheduling in Kubernetes

GPU resource management in Kubernetes - NVIDIA device plugin, MIG, time-slicing, node affinity, GPU quotas per namespace, and DCGM monitoring for ML clusters.

Health Checks and Readiness

Liveness vs readiness probes, dependency health checks, health check libraries, SLOs, and building production-grade health endpoints in Python.

Helm for ML Deployments

Helm charts for ML applications - chart anatomy, parameterizing ML deployments, environment values files, lifecycle hooks for model validation, and umbrella charts for multi-component stacks.

KServe and Kubernetes ML Operators

Custom Kubernetes operators for ML workflows - what operators enable, KServe for standardized model serving, Seldon Core, the Kubeflow Training Operator, Argo Workflows, and when to build vs. use existing operators.

Kubernetes Fundamentals for ML Engineers

The minimum Kubernetes knowledge every ML engineer needs to be productive - pods, deployments, services, resource requests, GPU allocation, probes, and persistent volumes.

Module 8 - Kubernetes for ML

A complete guide to running machine learning workloads on Kubernetes, from fundamentals to GPU scheduling, training jobs, model serving, Helm, and multi-tenant clusters.

Pods, Deployments, and Services - Deep Dive

Master the three core Kubernetes workload primitives for ML engineers - stateless serving with Deployments, traffic routing with Services, and advanced pod patterns for ML.

Training Jobs on Kubernetes

Running ML training on Kubernetes - Jobs, CronJobs, PyTorchJob and TFJob with the Training Operator, fault tolerance, checkpoint-based recovery, spot node handling, and distributed training patterns.