GPU Scheduling in Kubernetes
Eight A100s, Thirty Data Scientists, Zero Coordination
Your ML platform team manages a GPU cluster: 8 nodes, each with 1 NVIDIA A100 80GB GPU. Thirty data scientists use it for experiments, fine-tuning runs, and production training jobs. There is no scheduling policy. People SSH in and run jobs directly, or submit Kubernetes Jobs with nvidia.com/gpu: 1 requests. Most mornings, all 8 GPUs are occupied. By afternoon, 4 of them are allocated to pods that haven't done any GPU work in 6 hours - a researcher started a job, went to a meeting, and their experiment has been sitting idle at an interactive Python prompt for three hours, holding an entire A100 hostage.
Your team's complaints are consistent: "I submitted a training job 2 hours ago and it's still Pending." "Why can't I get a GPU? The dashboard says 6 are allocated but only 2 are actually being used." "Team A ran a giant fine-tuning job last night and burned our entire compute budget."
This is the GPU scheduling problem. It is fundamentally a multi-tenant resource management problem with three sub-problems: efficient allocation (how do you match jobs to GPU resources precisely?), fair sharing (how do you prevent any one team or job from monopolizing the cluster?), and observability (how do you see what is actually happening on every GPU?). Kubernetes solves all three, but requires deliberate configuration to do so.
:::tip 🎮 Interactive Playground Visualize this concept: Try the GPU Scheduling & Utilization demo on the EngineersOfAI Playground - no code required. :::
How Kubernetes Learns About GPUs
Kubernetes itself does not know what a GPU is. It knows about CPUs and memory natively. Everything else - GPUs, FPGAs, custom ASICs - is exposed through the device plugin framework. A device plugin is a DaemonSet that runs on every GPU node, discovers the hardware, and advertises it to the kubelet as a custom resource.
The NVIDIA device plugin:
- Runs as a DaemonSet on every node with GPUs (via node selector or toleration)
- Scans for NVIDIA GPUs using the NVIDIA management library
- Registers each GPU as a
nvidia.com/gpuresource with the kubelet - When a pod requesting a GPU is scheduled to the node, the device plugin mounts the appropriate
/dev/nvidiaXdevice files and the NVIDIA driver libraries into the container
Installing the NVIDIA Device Plugin
# Apply the device plugin DaemonSet
# Official: https://github.com/NVIDIA/k8s-device-plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu # allow scheduling on GPU nodes
operator: Exists
effect: NoSchedule
containers:
- name: nvidia-device-plugin-ctr
image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
securityContext:
allowPrivilegeEscalation: false
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
# Verify GPU resources are visible after plugin installation
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu"
# NAME GPU
# gpu-node-01 1
# gpu-node-02 1
# gpu-node-03 1
# gpu-node-04 1
Requesting GPUs in Pod Specs
Once the device plugin is running, requesting a GPU is as simple as adding a resource request:
spec:
containers:
- name: trainer
image: registry.company.com/fraud-trainer:v3.0
resources:
requests:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: 2 # request 2 GPUs
limits:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: 2 # must equal request for GPU
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1" # optional: explicit GPU assignment
Inside the container, standard PyTorch/TensorFlow GPU usage works as expected:
import torch
# Check what the container sees
print(torch.cuda.device_count()) # 2 (matches the GPU request)
print(torch.cuda.get_device_name(0)) # NVIDIA A100 80GB
# Multi-GPU training with DataParallel
model = torch.nn.DataParallel(model)
model.cuda()
Node Taints and Tolerations for GPU Nodes
A common cluster configuration: taint all GPU nodes with nvidia.com/gpu:NoSchedule. This prevents regular CPU-only pods from being scheduled on expensive GPU nodes and wasting them. Only pods with the matching toleration can land on GPU nodes.
# On the GPU node (set by cluster admin or node pool config)
# kubectl taint nodes gpu-node-01 nvidia.com/gpu=:NoSchedule
# In the pod spec - toleration to allow scheduling on GPU nodes
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: trainer
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
The device plugin DaemonSet already includes this toleration (see above), so it can schedule on GPU nodes. Your training pods need to add it explicitly.
Node Affinity for GPU Types
When the cluster has heterogeneous GPUs (A100s for training, V100s for serving, T4s for cost-efficient inference), use node affinity to target specific GPU types:
spec:
affinity:
nodeAffinity:
# Hard requirement: must be scheduled on an A100 node
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- "NVIDIA-A100-SXM4-80GB"
- "NVIDIA-A100-PCIE-80GB"
# Soft preference: prefer nodes in the high-memory node pool
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: node-pool
operator: In
values: ["gpu-a100-highmem"]
GPU nodes are typically labeled by the cluster operator (or by the NVIDIA GPU operator) with labels like:
nvidia.com/gpu.product: exact GPU model stringnvidia.com/gpu.memory: GPU memory in MiBnvidia.com/gpu.count: number of GPUs on the node
# View GPU labels on nodes
kubectl get nodes -l nvidia.com/gpu.product -o custom-columns=\
"NAME:.metadata.name,GPU:.metadata.labels.nvidia\.com/gpu\.product,MEM:.metadata.labels.nvidia\.com/gpu\.memory"
MIG (Multi-Instance GPU) in Kubernetes
NVIDIA A100 and H100 GPUs support Multi-Instance GPU (MIG) partitioning. A single A100 80GB can be sliced into up to 7 independent MIG instances, each with its own memory, compute engines, and fault isolation. This is ideal for serving multiple smaller models on one GPU.
MIG profiles available on A100:
1g.10gb: 1/7 GPU, 10 GB memory (7 per GPU max)2g.20gb: 2/7 GPU, 20 GB memory (3 per GPU max)3g.40gb: 3/7 GPU, 40 GB memory (2 per GPU max)4g.40gb: 4/7 GPU, 40 GB memory (1 per GPU max)7g.80gb: full GPU, 80 GB memory (1 per GPU)
To use MIG in Kubernetes, configure the NVIDIA GPU Operator with a MIG strategy:
# node label to set MIG profile
# kubectl label node gpu-node-01 nvidia.com/mig.config=all-1g.10gb
# Pod requesting a MIG slice
spec:
containers:
- name: serving-pod
resources:
requests:
nvidia.com/mig-1g.10gb: 1
limits:
nvidia.com/mig-1g.10gb: 1
MIG provides hard isolation: two MIG instances on the same physical GPU cannot affect each other's memory, error state, or performance. This is important for multi-tenant production serving.
GPU Time-Slicing - Sharing Without MIG
For GPUs that don't support MIG (V100s, older GPUs) or for experimental workloads where memory isn't the constraint, GPU time-slicing lets multiple pods share a single GPU through time-multiplexing.
# ConfigMap for NVIDIA device plugin to enable time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: kube-system
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # advertise 1 GPU as 4 shareable units
With this config, each GPU node with 1 physical GPU now advertises nvidia.com/gpu: 4 to the scheduler. Four pods can each request 1 GPU and all run on the same physical GPU simultaneously via time-slicing.
:::warning Time-Slicing Has No Memory Isolation
Time-slicing shares the GPU's time but NOT its memory. If you have 4 pods each requesting nvidia.com/gpu: 1 on a time-sliced GPU with 16 GB memory, and each tries to use 6 GB, you will get CUDA out-of-memory errors. There is no per-pod memory limit enforcement. Use MIG for workloads that need memory isolation. Use time-slicing only for development, experimentation, or lightweight inference where you control total memory usage.
:::
GPU Quotas Per Namespace
ResourceQuotas limit the total GPU resources a namespace can consume. This is the mechanism for fair sharing across ML teams:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-fraud-gpu-quota
namespace: team-fraud
spec:
hard:
requests.nvidia.com/gpu: "4" # team-fraud can request at most 4 GPUs total
limits.nvidia.com/gpu: "4" # limits must match requests for GPUs
requests.cpu: "64" # 64 CPU cores total
requests.memory: "256Gi" # 256 GB RAM total
pods: "50" # at most 50 pods in this namespace
With this quota, if team-fraud already has 4 GPUs allocated, any new pod requesting a GPU will be rejected at admission time with a quota exceeded error:
Error from server (Forbidden): pods "new-training-job" is forbidden:
exceeded quota: team-fraud-gpu-quota, requested: requests.nvidia.com/gpu=1,
used: requests.nvidia.com/gpu=4, limited: requests.nvidia.com/gpu=4
LimitRange for Default GPU Requests
Without explicit resource requests, pods land on GPU nodes without actually requesting a GPU (they just use whatever GPU is present via CUDA_VISIBLE_DEVICES=0). A LimitRange sets defaults:
apiVersion: v1
kind: LimitRange
metadata:
name: team-fraud-limits
namespace: team-fraud
spec:
limits:
- type: Container
default:
cpu: "2"
memory: "8Gi"
defaultRequest:
cpu: "1"
memory: "4Gi"
max:
cpu: "32"
memory: "128Gi"
nvidia.com/gpu: "4" # no single container can request more than 4 GPUs
DCGM Exporter - Monitoring GPU Utilization
The DCGM (Data Center GPU Manager) Exporter is the standard Prometheus exporter for NVIDIA GPU metrics. It runs as a DaemonSet on GPU nodes and exposes per-GPU metrics including utilization, memory usage, temperature, power draw, and errors.
# DCGM Exporter DaemonSet (simplified)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
securityContext:
runAsNonRoot: false
runAsUser: 0
ports:
- containerPort: 9400
name: metrics
env:
- name: DCGM_EXPORTER_KUBERNETES
value: "true" # enables pod/container labels on metrics
Key DCGM metrics for ML platform monitoring:
# GPU utilization across all nodes (%)
DCGM_FI_DEV_GPU_UTIL{node="gpu-node-01"}
# GPU memory used vs total (bytes)
DCGM_FI_DEV_FB_USED{exported_namespace="team-fraud"}
DCGM_FI_DEV_FB_FREE
# GPU temperature (Celsius) - alert if > 85°C
DCGM_FI_DEV_GPU_TEMP > 85
# NVLink bandwidth (for multi-GPU training) - bytes/second
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL
# GPU power consumption (watts) - for cost attribution
DCGM_FI_DEV_POWER_USAGE
# ECC errors (double-bit errors indicate hardware failure)
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0
With DCGM_EXPORTER_KUBERNETES=true, metrics include pod-level labels:
exported_namespace: which K8s namespace the pod is inexported_pod: pod nameexported_container: container name
This lets you build per-team GPU utilization dashboards and identify idle GPU holders.
Identifying Idle GPU Holders
The runaway idle job problem from the opening scenario. PromQL query to find namespaces with allocated but idle GPUs:
# GPUs allocated per namespace (from K8s resource requests)
kube_pod_container_resource_requests{resource="nvidia_com_gpu"}
* on(pod, namespace) group_left(phase)
kube_pod_status_phase{phase="Running"}
# GPU utilization per namespace (from DCGM)
avg by (exported_namespace) (DCGM_FI_DEV_GPU_UTIL)
# Idle GPU pods: allocated GPU but < 5% utilization for 30 minutes
(
kube_pod_container_resource_requests{resource="nvidia_com_gpu"} > 0
unless
avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) > 5
)
Set up a Grafana alert on this query and page the namespace owner to release their idle GPU allocation.
Production Notes
GPU Operator vs Manual Device Plugin: NVIDIA GPU Operator is the recommended installation path for clusters that need drivers installed on nodes. It manages the driver installation (as a DaemonSet), device plugin, DCGM exporter, and MIG configuration in a single Helm chart. The manual device plugin install (shown above) assumes NVIDIA drivers are already installed on nodes - appropriate for cloud providers where GPU instances come with drivers pre-installed.
Node pool isolation: In cloud Kubernetes (GKE, EKS, AKS), use separate node pools for different GPU types. A100 nodes in one pool with taints, V100 in another, T4 in a third. This makes capacity management explicit and prevents scheduling surprises.
GPU memory fragmentation: CUDA's memory allocator can fragment GPU memory over time, especially for long-running inference servers. Monitor DCGM_FI_DEV_FB_FREE over time. If free memory decreases steadily without increasing GPU utilization, the server has a memory leak or fragmentation issue. Schedule periodic pod restarts with CronJobs during low-traffic periods.
Common Mistakes
:::danger Requesting nvidia.com/gpu Without Tolerations on Tainted GPU Nodes
If GPU nodes have the nvidia.com/gpu:NoSchedule taint (common in production clusters), pods without the matching toleration will remain in Pending forever. The pod events will show "0 nodes are available: N node(s) had untolerated taint {nvidia.com/gpu: }." Always add the GPU toleration to pods that request GPU resources.
kubectl describe pod stuck-training-job | grep -A5 Events
# Warning FailedScheduling 0/8 nodes available: 8 Insufficient nvidia.com/gpu
# OR
# Warning FailedScheduling 0/8 nodes available: 8 node(s) had untolerated taint
:::
:::warning Not Setting GPU Limits Equal to Requests
For nvidia.com/gpu, the limit must always equal the request. Unlike CPU and memory where limits can exceed requests (for bursting), GPU resources cannot be over-committed. If you set request: 1 but no limit, the API server will set limit: 1 automatically. If you try request: 1, limit: 2, the API server will reject the pod spec with a validation error. Always set them equal.
:::
:::warning Using nvidia.com/gpu: 0 to Run on GPU Nodes Without Using a GPU
Some users set nvidia.com/gpu: 0 thinking it means "no GPU required." This is incorrect - the device plugin does not advertise zero-count resources. Setting nvidia.com/gpu: 0 is the same as not requesting a GPU at all. The pod will be scheduled on any node (including GPU nodes if they have no taint, wasting expensive GPU capacity). If you need to run on a GPU node without using a GPU, use node affinity explicitly.
:::
Interview Q&A
Q1: How does Kubernetes know a node has GPUs, and what happens when you request nvidia.com/gpu: 1?
Kubernetes learns about GPUs through the NVIDIA device plugin framework. The device plugin runs as a DaemonSet on GPU nodes, discovers GPUs using NVML (NVIDIA Management Library), and registers each GPU as a nvidia.com/gpu resource with the local kubelet. The kubelet reports this capacity to the API server, which updates the node's capacity field. When you create a pod with nvidia.com/gpu: 1, the scheduler finds nodes where allocatable.nvidia.com/gpu >= 1, places the pod there, and the device plugin mounts the appropriate /dev/nvidiaX device file and NVIDIA driver libraries into the container. From the container's perspective, it has normal CUDA access. The scheduler tracks GPU allocations so it doesn't double-assign the same GPU to two pods.
Q2: What is MIG and when would you choose it over GPU time-slicing?
MIG (Multi-Instance GPU) is a hardware feature on NVIDIA A100 and H100 GPUs that partitions a single physical GPU into up to 7 fully isolated instances, each with dedicated memory, compute engines, and fault isolation. Time-slicing is a software technique that schedules multiple pods on the same GPU by time-multiplexing compute but sharing memory entirely. Choose MIG when you need strong isolation between tenants (different teams sharing a GPU node in production), when each workload needs a guaranteed amount of GPU memory (MIG slices have fixed, isolated memory), or when you need fault isolation (one tenant's CUDA error won't affect others). Choose time-slicing when the GPUs don't support MIG, when workloads are lightweight and memory-constrained isn't an issue, or for development/experimentation clusters where strict isolation isn't needed.
Q3: A training pod is stuck in Pending with "0/8 nodes available: 8 Insufficient nvidia.com/gpu". How do you diagnose this?
First, check the current GPU allocation: kubectl get pods -A --field-selector=status.phase=Running -o json | jq '[.items[].spec.containers[].resources.requests["nvidia.com/gpu"] // "0"] | map(tonumber) | add' to see total GPU requests. Check if GPU nodes themselves are available: kubectl describe nodes | grep -A3 "Allocated resources". Check if there are taints blocking scheduling: kubectl describe nodes | grep Taints. Check the ResourceQuota for the namespace: kubectl describe resourcequota -n <namespace>. If GPUs are physically available but all pods show Pending, the device plugin may have crashed - check kubectl get pods -n kube-system | grep device-plugin. If some jobs have been running for a long time with low GPU utilization (use DCGM metrics), you may have idle GPU holders - alert them to release resources.
Q4: How do you implement fair GPU sharing across multiple ML teams on a shared cluster?
Layer three mechanisms: (1) Namespace isolation - each team gets a namespace. (2) ResourceQuota per namespace - requests.nvidia.com/gpu: "4" limits each team to 4 GPUs total. This creates equal maximum allocation. (3) Priority classes - define training-batch (low priority, can be preempted), training-interactive (medium), and inference-production (high, never preempted). Production serving pods get priority; training jobs preempt each other but never preempt serving. Add a Kubernetes queuing system (Volcano or Kueue) for the training tier to handle gang scheduling (ensuring distributed training gets all its GPUs simultaneously or none), preemption, and quota-aware scheduling.
Q5: What DCGM metrics would you monitor and what alerts would you set for a GPU ML cluster?
Core alerts: (1) DCGM_FI_DEV_GPU_UTIL < 10 for more than 30 minutes on an allocated GPU - idle GPU holder, page the owning namespace. (2) DCGM_FI_DEV_GPU_TEMP > 85 - GPU thermal throttling risk, alert on-call. (3) DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0 - uncorrectable GPU memory error, drain the node immediately and replace hardware. (4) DCGM_FI_DEV_FB_FREE < 2000 (MB) on serving nodes - GPU memory nearly exhausted, may cause OOM on next inference. (5) NVLink errors on multi-GPU nodes - indicates bandwidth degradation for distributed training. Dashboards: per-team GPU utilization heatmap, GPU memory pressure timeline, training job queue depth vs available GPUs.
