GPU Containers
The GPU That Disappeared in Kubernetes
Rodrigo spent a full day debugging a problem that should not exist: his PyTorch training container worked perfectly on his workstation with GPU, but when the same container ran in the company's Kubernetes cluster, it used only CPU. No error. No warning. Just slow training that should have been fast.
The first sign something was wrong: a training job that took 8 minutes on his workstation was projected to take 6 hours in Kubernetes. He added a check to the training script:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
Output from the Kubernetes pod: CUDA available: False, GPU count: 0. The GPU nodes were
there - kubectl describe node showed nvidia.com/gpu: 4 in the allocatable resources. But
the container was not seeing them.
The root cause was a missing resource request in the Kubernetes pod spec. Without
resources.limits: nvidia.com/gpu: 1, the NVIDIA device plugin does not mount the GPU devices
into the container. PyTorch silently fell back to CPU. The fix was one line in the pod spec:
resources:
limits:
nvidia.com/gpu: 1
But understanding why that line is necessary requires understanding the full GPU container stack - which this lesson covers.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Docker for ML demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Running GPU workloads in containers requires coordination between several layers of the software stack that normally do not interact: the GPU hardware, the NVIDIA kernel driver (on the host), CUDA runtime libraries (potentially in the container), and the containerization layer (Docker or containerd with NVIDIA hooks).
Before the NVIDIA Container Toolkit (released 2017, originally called nvidia-docker), GPU access in containers required running privileged containers - which bypassed all container security boundaries. The toolkit solved this with a cleaner approach: a custom container runtime hook that injects GPU device files and driver libraries into containers at creation time, without requiring root or privileged mode.
The GPU Container Stack
The key insight: the NVIDIA kernel driver is always on the host, never in the container. The
CUDA runtime libraries (libcuda.so, libcudnn.so) can be either in the container (baked into
the base image) or provided by the host via the container toolkit's injection mechanism. The
container toolkit maps the right GPU device files (/dev/nvidia0, /dev/nvidiactl) into the
container's namespace.
CUDA Version Compatibility Matrix
The most common source of GPU container failures is CUDA version mismatch. The rules:
Host NVIDIA Driver → supports CUDA Runtime versions ≤ its max supported version
Container CUDA Runtime → must be ≤ host driver's max supported CUDA version
PyTorch CUDA build → must match container CUDA runtime
Critical rule: CUDA runtime in the container must be compatible with the driver on the host. Newer drivers are backward compatible - a host with driver 535.x can run containers with CUDA 11.8 and CUDA 12.2. But a host with driver 525.x cannot run a container with CUDA 12.2.
# Check host driver version and max supported CUDA
nvidia-smi
# Output includes:
# Driver Version: 535.161.07
# CUDA Version: 12.2 ← This is the MAX CUDA version this driver supports
# Check CUDA version inside container
docker run --gpus all nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 nvcc --version
Installing the NVIDIA Container Toolkit
# Ubuntu 22.04 setup - run on every GPU host machine
# 1. Add NVIDIA container toolkit repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 2. Install toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# 3. Configure Docker to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 4. Verify installation
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Expected output: nvidia-smi showing your GPU(s)
# If this fails, the toolkit is not installed correctly
Running GPU Containers with Docker
# Grant access to all GPUs
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Grant access to specific GPU (by index)
docker run --rm --gpus '"device=0"' my-ml-image:latest python train.py
# Grant access to multiple specific GPUs
docker run --rm --gpus '"device=0,1"' my-ml-image:latest python train.py
# Grant access to 2 GPUs (let Docker choose which ones)
docker run --rm --gpus 2 my-ml-image:latest python train.py
# Old nvidia-docker2 style (still works but deprecated)
docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all my-ml-image:latest
CUDA_VISIBLE_DEVICES: Controlling GPU Allocation in Code
# src/training/gpu_setup.py
"""
GPU device management for ML training.
CUDA_VISIBLE_DEVICES is the primary way to control GPU allocation.
"""
import os
import torch
import logging
logger = logging.getLogger(__name__)
def setup_device(
requested_gpus: int = 1,
prefer_gpu: bool = True,
) -> torch.device:
"""
Set up training device with proper CUDA visibility controls.
Args:
requested_gpus: Number of GPUs to use (0 for CPU-only)
prefer_gpu: Fall back to CPU if GPU not available
Returns:
torch.device configured for training
"""
if requested_gpus == 0 or not prefer_gpu:
return torch.device("cpu")
if not torch.cuda.is_available():
if prefer_gpu:
logger.warning("GPU requested but not available - falling back to CPU")
return torch.device("cpu")
available_gpus = torch.cuda.device_count()
logger.info(f"Available GPUs: {available_gpus}")
if available_gpus == 0:
logger.warning("No CUDA devices visible to PyTorch - check CUDA_VISIBLE_DEVICES")
return torch.device("cpu")
# Log all available GPUs
for i in range(available_gpus):
props = torch.cuda.get_device_properties(i)
logger.info(
f"GPU {i}: {props.name}, "
f"{props.total_memory / 1024**3:.1f}GB VRAM, "
f"CUDA {props.major}.{props.minor}"
)
device = torch.device("cuda:0")
torch.cuda.set_device(0)
logger.info(f"Using device: {torch.cuda.get_device_name(0)}")
return device
def setup_multi_gpu(requested_gpus: int) -> list[int]:
"""
Set up multi-GPU training. Returns list of GPU indices to use.
Sets CUDA_VISIBLE_DEVICES to restrict visibility to requested GPUs.
"""
available = torch.cuda.device_count()
if requested_gpus > available:
logger.warning(
f"Requested {requested_gpus} GPUs but only {available} available. "
f"Using {available}."
)
gpu_ids = list(range(available))
else:
# Use first N GPUs by default
# In practice, use a GPU scheduler or CUDA_VISIBLE_DEVICES env var
# to select non-contiguous GPUs
gpu_ids = list(range(requested_gpus))
# Restrict PyTorch to only see the selected GPUs
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(i) for i in gpu_ids)
logger.info(f"CUDA_VISIBLE_DEVICES set to: {os.environ['CUDA_VISIBLE_DEVICES']}")
return gpu_ids
def assert_gpu_available(min_vram_gb: float = 8.0) -> None:
"""
Assert GPU is available and meets memory requirements.
Call at the start of training to fail fast rather than train on CPU
when GPU was expected.
"""
if not torch.cuda.is_available():
raise RuntimeError(
"GPU not available. Set CUDA_VISIBLE_DEVICES to debug. "
"Check: (1) NVIDIA driver installed, (2) Container has --gpus flag, "
"(3) K8s pod spec has nvidia.com/gpu resource request"
)
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
vram_gb = props.total_memory / 1024**3
if vram_gb < min_vram_gb:
raise RuntimeError(
f"GPU {i} ({props.name}) has {vram_gb:.1f}GB VRAM, "
f"minimum required: {min_vram_gb}GB"
)
logger.info(f"GPU assertion passed: {torch.cuda.device_count()} GPU(s) available")
GPU Containers in Kubernetes
The full Kubernetes GPU setup requires the NVIDIA Device Plugin:
# Install NVIDIA Device Plugin in Kubernetes cluster
# (runs as a DaemonSet on GPU nodes)
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade --install nvdp nvdp/nvidia-device-plugin \
--version 0.15.0 \
--namespace kube-system \
--set failOnInitError=false
# Verify device plugin is running
kubectl get pods -n kube-system -l app=nvidia-device-plugin
# Verify GPU resources are visible on nodes
kubectl describe node <gpu-node-name> | grep nvidia.com/gpu
# Should show: nvidia.com/gpu: 4 (or however many GPUs the node has)
# kubernetes/training-job.yaml - ML training job with GPU resource request
apiVersion: batch/v1
kind: Job
metadata:
name: fraud-model-training
namespace: ml-platform
spec:
backoffLimit: 0
activeDeadlineSeconds: 14400 # 4 hour timeout
template:
spec:
restartPolicy: Never
containers:
- name: training
image: gcr.io/myproject/ml-training:v1.2.3
command: ["python", "-m", "src.training.train"]
args:
- "--config=config/training.yaml"
- "--output-dir=/models/output"
env:
- name: MLFLOW_TRACKING_URI
valueFrom:
secretKeyRef:
name: mlflow-credentials
key: tracking-uri
resources:
requests:
cpu: "4"
memory: "16Gi"
# This is the critical line - without it, no GPU is allocated
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1" # requests must equal limits for GPU
volumeMounts:
- name: model-storage
mountPath: /models
- name: training-data
mountPath: /data
readOnly: true
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: ml-model-store
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
nodeSelector:
# Schedule only on GPU nodes (nodes with this label)
cloud.google.com/gke-accelerator: nvidia-tesla-a100
tolerations:
# GPU nodes typically have a taint to prevent non-GPU pods from scheduling on them
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Testing GPU Availability Inside Containers
# scripts/test_gpu_in_container.py
"""
Diagnostic script - run inside container to verify GPU access.
Usage: python scripts/test_gpu_in_container.py
"""
import sys
import torch
import subprocess
def check_nvidia_smi():
"""Test that nvidia-smi is accessible from inside the container."""
try:
result = subprocess.run(
["nvidia-smi", "--query-gpu=name,memory.total,driver_version",
"--format=csv,noheader"],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0:
print("nvidia-smi output:")
print(result.stdout)
return True
else:
print(f"nvidia-smi failed: {result.stderr}")
return False
except FileNotFoundError:
print("nvidia-smi not found in container - NVIDIA toolkit not injecting binaries")
return False
def check_pytorch_cuda():
"""Test that PyTorch can see and use GPUs."""
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if not torch.cuda.is_available():
print("\nCUDA NOT AVAILABLE. Debug checklist:")
print(" 1. Is the container running with --gpus flag (Docker)?")
print(" 2. Does the K8s pod spec have nvidia.com/gpu resource request?")
print(" 3. Is NVIDIA Container Toolkit installed on the host?")
print(" 4. Does the container's CUDA version match the host driver?")
print(f" Run: nvidia-smi on host to check driver's max CUDA version")
return False
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
print(f" GPU {i}: {props.name}, {props.total_memory / 1024**3:.1f}GB")
# Perform actual tensor operation on GPU to verify execution
print("\nRunning GPU computation test...")
device = torch.device("cuda:0")
x = torch.randn(1000, 1000, device=device)
y = torch.mm(x, x.T)
torch.cuda.synchronize()
print(f"Matrix multiply on GPU succeeded. Result shape: {y.shape}")
return True
if __name__ == "__main__":
print("=" * 50)
print("GPU Container Diagnostic")
print("=" * 50)
smi_ok = check_nvidia_smi()
pytorch_ok = check_pytorch_cuda()
if smi_ok and pytorch_ok:
print("\nAll GPU checks PASSED")
sys.exit(0)
else:
print("\nSome GPU checks FAILED - see above for details")
sys.exit(1)
Production Notes
GPU resource limits: In Kubernetes, GPU resource requests must equal limits (unlike CPU/memory
which can be different). This is a Kubernetes + NVIDIA device plugin constraint. Always set both
requests and limits for nvidia.com/gpu to the same value.
GPU sharing: By default, Kubernetes GPU resources are not divisible - a pod either gets a whole GPU or no GPU. For inference workloads that do not use the full GPU, consider NVIDIA MIG (Multi-Instance GPU) for A100/H100 GPUs, or use a GPU sharing solution like NVIDIA Time Slicing for older GPUs.
CUDA version pinning: Pin CUDA versions explicitly in your Dockerfile base image tag.
Never use nvidia/cuda:latest - CUDA major version upgrades (e.g., 11.x → 12.x) can break
PyTorch compatibility in ways that are hard to debug. Pin to the exact tag:
nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04.
:::tip Always Include a GPU Availability Check at Startup
Add assert_gpu_available() or equivalent at the start of every training script. When a
training job runs on CPU instead of GPU due to a misconfiguration, you want it to fail
immediately with a clear error, not train for hours before someone notices the wall-clock time
is wrong. Fail fast, with a diagnostic message that tells the operator exactly what to check.
:::
:::warning CUDA Version Mismatch is Silent
A CUDA version mismatch between container and host does not always produce an error. Sometimes
PyTorch falls back to CPU silently (if torch.cuda.is_available() returns False), and sometimes
it crashes with an opaque error like CUDA error: unknown error. Always verify GPU access
explicitly with a test script before deploying a new container version to production GPU workloads.
:::
:::danger Privileged Mode for GPU Access
Some older documentation suggests running GPU containers with --privileged. Never do this in
production. Privileged mode gives the container root-level access to the host system - effectively
breaking container isolation. The NVIDIA Container Toolkit provides GPU access without privileged
mode. If you find a guide suggesting privileged mode for GPU access, it is outdated.
:::
Interview Q&A
Q: What is the NVIDIA Container Toolkit and what problem does it solve?
The NVIDIA Container Toolkit (previously called nvidia-docker or nvidia-docker2) enables GPU
access in containers without requiring privileged mode. It installs a custom OCI runtime hook
that intercepts container creation and injects GPU device files (/dev/nvidia0, /dev/nvidiactl)
and driver user-space libraries (libcuda.so, libnvidia-ml.so) into the container's namespace.
The NVIDIA kernel driver remains on the host - the container sees the GPU through these injected
files and libraries.
Q: How do CUDA version compatibility requirements work in GPU containers?
The relationship is: Host NVIDIA Driver → supports maximum CUDA Runtime version. The container's
CUDA runtime version must be less than or equal to the host driver's maximum supported CUDA
version. NVIDIA drivers are backward compatible - a newer driver can run older CUDA runtimes.
For example, driver 535.x supports up to CUDA 12.2, so it can run containers with CUDA 11.8,
12.0, or 12.2. But a host with driver 525.x (max CUDA 12.0) cannot run a container with CUDA 12.2.
Always check nvidia-smi on the host to see the maximum supported CUDA version.
Q: Why does a GPU container work in Docker but fail to use GPU in Kubernetes?
The most common reason: the Kubernetes pod spec is missing the nvidia.com/gpu resource request
and limit. Without this, the NVIDIA device plugin does not allocate GPU devices to the pod, and
/dev/nvidia* files are not mounted in the container. The container then has no GPU access and
PyTorch silently falls back to CPU. Fix: add nvidia.com/gpu: "1" to both resources.requests
and resources.limits in the pod spec. Also check that the NVIDIA device plugin DaemonSet is
running and that the node has the correct tolerations.
Q: What is CUDA_VISIBLE_DEVICES and how do you use it in ML containers?
CUDA_VISIBLE_DEVICES is an environment variable that controls which GPUs are visible to CUDA
applications running in the process. CUDA_VISIBLE_DEVICES=0 shows only GPU 0. CUDA_VISIBLE_DEVICES=0,2
shows GPUs 0 and 2. CUDA_VISIBLE_DEVICES="" shows no GPUs (forces CPU mode). In containers,
the Docker/Kubernetes runtime sets this based on which GPUs are allocated. In multi-GPU training
scripts, you can set it programmatically to restrict each process to its assigned GPU for
data-parallel training.
Q: How do you test that a GPU container is correctly using the GPU before deploying to production?
Include a diagnostic script in your container that: (1) runs nvidia-smi and checks it returns
successfully, (2) checks torch.cuda.is_available() returns True, (3) performs an actual tensor
operation on GPU and verifies it completes without error. Run this script in CI after building the
container and again on the first pod startup in production. Add assertions at the start of training
scripts that fail immediately if GPU is not available when expected, rather than silently running
on CPU.
