Skip to main content

Kubernetes and Auto-Scaling for LLMs

The 3 AM Alert That Rewrites Your Architecture

It is 3:17 AM on a Tuesday. Your on-call phone lights up. The LLM-powered coding assistant your team shipped three months ago - the one that was getting 500 requests per day when you launched - is now getting 50,000. A viral tweet at 11 PM started it. By midnight, your three GPU servers were fully saturated. By 1 AM, the queue depth hit 10,000 pending requests. By 3 AM, your load balancer is returning 503s to every request, your SRE is on a call with your CEO, and you are staring at a Grafana dashboard that looks like a cliff face.

You have three A100 nodes in your cluster. You know there are more available in your cloud provider's GPU pool. But your serving infrastructure - three bare-metal vLLM processes behind an Nginx reverse proxy - has no concept of elastic scaling. It cannot provision new nodes. It cannot distribute load across more replicas. It cannot even tell the difference between "completely saturated" and "responding slowly." The entire scaling surface is a single number in an Nginx upstream block.

You provision two more GPU nodes manually. It takes 23 minutes to get them running - boot time, CUDA driver install, Docker image pull, model weight download. By the time they are serving traffic, half your users have given up and left. Your SLO report for the week will show 4.2 hours of degraded performance. Your CEO asks why you cannot scale automatically "like AWS does for web servers." You have no good answer.

This is the story that drives every engineering team from ad-hoc GPU servers to a proper Kubernetes-based LLM serving platform. The difference between those three bare-metal boxes and a production Kubernetes deployment is not just operational convenience - it is the difference between manually responding to every scaling event at 3 AM and having infrastructure that detects the spike, provisions nodes, schedules pods, and routes traffic, all in the time it takes you to pour a coffee.

This lesson builds that infrastructure. We cover the NVIDIA device plugin that makes Kubernetes GPU-aware, the node pool design that keeps GPU nodes cost-efficient, the HPA and KEDA configurations that trigger scaling on the right signals (not CPU percentage), and the complete Helm chart for a production vLLM deployment. By the end, you will have the Kubernetes manifests and scaling policies to handle the 100x traffic spike - and the 3 AM alert will be a Slack notification that says "autoscaler provisioned 2 new GPU nodes, scaling event complete."


Why This Exists - The Problem with Bare-Metal LLM Serving

Before Kubernetes became the standard for LLM infrastructure, teams ran inference servers the way they ran everything else in 2015: a fixed number of machines, each running a process, traffic distributed by a static load balancer. This worked for stateless web services because web servers are cheap, fast to start, and homogeneous. A web server starts in under a second. You can run 100 of them on a single machine. When traffic drops, you just let them sit idle - the cost is negligible.

GPU inference servers violate every one of those assumptions. A single A100 80GB GPU costs roughly $3/hour on-demand. Cold-starting a vLLM server - loading the CUDA drivers, pulling the Docker image, downloading model weights from object storage, loading weights into GPU memory - takes 8 to 25 minutes depending on model size. You cannot run multiple large model instances on a single GPU without MIG partitioning. And idle GPU nodes cost exactly the same as busy ones.

The result was a painful tradeoff: provision for peak load (expensive and wasteful at baseline) or provision for average load (cheap but catastrophically slow at peak). Teams oscillated between these two failure modes until cloud-native GPU orchestration became mature enough to offer a third option: provision exactly what you need, exactly when you need it, and terminate it when you do not.

Kubernetes addresses this through three interlocking systems. The scheduler places pods on nodes based on resource requirements, including GPU resources exposed by the NVIDIA device plugin. The Horizontal Pod Autoscaler adjusts the number of running pods based on observed metrics. The Cluster Autoscaler (or Karpenter, its more modern replacement) provisions and terminates nodes based on unschedulable pod demand. Together, these three systems implement the elastic scaling loop that bare-metal infrastructure cannot.

The remaining challenge - the one that makes LLM scaling harder than web server scaling - is that LLM request cost is highly variable. A request with a 4,000-token context that generates 2,000 tokens takes roughly 8x longer and uses 8x more GPU memory than a request with 500 tokens generating 250 tokens. Simple CPU/memory-based autoscaling metrics do not capture this. A pod can look "idle" by CPU metrics while its GPU is 100% utilized running a slow, large-context request. This is why LLM autoscaling requires custom metrics - queue depth, GPU utilization, time-to-first-token - and why KEDA (Kubernetes Event-Driven Autoscaling) is often a better fit than standard HPA for LLM workloads.


Historical Context - From GPU Servers to GPU-Native Kubernetes

GPU scheduling in Kubernetes has a surprisingly short history. Before 2017, Kubernetes had no native concept of GPU resources. Teams that wanted to run GPU workloads on Kubernetes used a combination of hostPath volume mounts for CUDA libraries and nodeSelector hacks to target GPU nodes. This was fragile and completely non-portable.

The NVIDIA device plugin, released in late 2017 as part of the Kubernetes device plugin framework (introduced in Kubernetes 1.8), changed this. It gave Kubernetes a standard API for discovering, advertising, and allocating GPU resources to pods. The plugin runs as a DaemonSet on every GPU node, queries the NVML (NVIDIA Management Library) for available GPUs, and registers them with the kubelet as nvidia.com/gpu resources. Pods could now request GPUs with the same resources.requests and resources.limits fields they used for CPU and memory.

The scaling story evolved more slowly. The standard Horizontal Pod Autoscaler, introduced in Kubernetes 1.1, scaled on CPU and memory. For stateless web services, this was sufficient. For GPU inference, it was not - GPU utilization is exposed through NVIDIA's metrics server (DCGM Exporter), not through the standard Kubernetes metrics pipeline. The Custom Metrics API, stabilized in Kubernetes 1.6, allowed HPA to scale on arbitrary metrics from external sources. But wiring up GPU metrics to HPA required deploying DCGM Exporter, Prometheus, the Prometheus Adapter, and configuring the metrics pipeline - a non-trivial amount of work.

KEDA (Kubernetes Event-Driven Autoscaling), first released in 2019 and donated to CNCF in 2020, simplified this significantly. Rather than requiring teams to implement the Custom Metrics API themselves, KEDA provided a library of pre-built scalers for common event sources: SQS queues, Redis lists, Kafka topics, Prometheus queries. For LLM serving, the Prometheus scaler meant teams could write a PromQL query (e.g., avg(vllm_queue_depth)) and have KEDA automatically scale pods when the queue depth crossed a threshold. This is now the dominant pattern for LLM autoscaling in production.

Karpenter, released by AWS in 2021 and open-sourced in 2022, addressed the node provisioning side. The original Kubernetes Cluster Autoscaler scaled node groups but required pre-defined node groups and was slow to react. Karpenter watches for unschedulable pods and provisions exactly the right instance type to satisfy the pending pod's resource requirements - including GPU type, GPU count, and memory - without requiring pre-configured node groups. For LLM workloads where you might want to scale from 0 to 10 A100 nodes in response to demand, Karpenter is substantially faster and more cost-efficient than the original cluster autoscaler.


Core Concepts

The NVIDIA Device Plugin and GPU Scheduling

The NVIDIA device plugin solves a fundamental problem: how does Kubernetes know a node has GPUs, and how does it allocate them to pods without double-booking?

The device plugin runs as a DaemonSet on every node with NVIDIA GPUs. It performs three functions. First, it discovers available GPUs using NVML and registers them with the kubelet as nvidia.com/gpu resources. Second, it watches for pod assignments from the scheduler and allocates specific GPU devices (identified by UUID) to specific pods. Third, it injects the necessary environment variables (NVIDIA_VISIBLE_DEVICES, NVIDIA_DRIVER_CAPABILITIES) into the pod's containers so the CUDA runtime knows which GPU to use.

The GPU allocation model is binary: a GPU is either allocated to a pod or it is not. This means that if you request 1 GPU and your model only uses 40% of the GPU's compute and memory, the remaining 60% is wasted. This is the core inefficiency that MIG (Multi-Instance GPU) partitioning addresses.

# NVIDIA Device Plugin DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins

GPU Node Pools with Taints and Tolerations

GPU nodes are expensive. You do not want non-GPU workloads scheduled onto them, consuming CPU and memory and preventing GPU pods from landing. The standard pattern is to taint GPU nodes with nvidia.com/gpu=present:NoSchedule and require GPU pods to tolerate that taint.

Node pools group nodes by instance type. A well-designed GPU cluster typically has at least two node pools: a GPU node pool for inference servers and a CPU node pool for everything else (API gateways, monitoring, queue processors).

# Karpenter NodePool for GPU inference
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu-inference
spec:
template:
metadata:
labels:
workload-type: gpu-inference
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: gpu-nodes
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- "p4d.24xlarge" # 8x A100 40GB
- "p4de.24xlarge" # 8x A100 80GB
- "p5.48xlarge" # 8x H100 80GB
taints:
- key: nvidia.com/gpu
value: "present"
effect: NoSchedule
limits:
nvidia.com/gpu: 64 # Max 8 nodes of A100s
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s # Terminate empty GPU nodes quickly

Pod Spec for GPU Workloads

A vLLM pod spec must specify GPU resource requests and limits, tolerate the GPU node taint, and optionally use node affinity to prefer specific GPU types.

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: llm-serving
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
workload-type: gpu-inference
containers:
- name: vllm
image: vllm/vllm-openai:v0.4.2
args:
- "--model"
- "meta-llama/Llama-3-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.90"
- "--port"
- "8000"
ports:
- containerPort: 8000
name: http
resources:
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "8"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
failureThreshold: 5
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-weights-pvc

Persistent Volume Claims for Model Weights

Model weights are large. Llama 3 8B is roughly 15GB in FP16. Llama 3 70B is roughly 140GB. Downloading these from HuggingFace Hub on every pod restart is slow (minutes to hours) and expensive (network egress costs). The standard pattern is to pre-download weights to a PersistentVolume backed by shared network storage (EFS on AWS, GCS Filestore on GCP, Azure Files on Azure) and mount it read-only across all inference pods.

# StorageClass for model weights - ReadWriteMany via EFS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: model-weights-efs
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: fs-0a1b2c3d4e5f
directoryPerms: "700"
reclaimPolicy: Retain
volumeBindingMode: Immediate

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-weights-pvc
namespace: llm-serving
spec:
accessModes:
- ReadOnlyMany # Multiple pods can mount read-only simultaneously
storageClassName: model-weights-efs
resources:
requests:
storage: 200Gi

---
# One-time Job to pre-download model weights to the shared volume
apiVersion: batch/v1
kind: Job
metadata:
name: download-model-weights
namespace: llm-serving
spec:
template:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: downloader
image: python:3.11-slim
command:
- python
- -c
- |
from huggingface_hub import snapshot_download
import os
snapshot_download(
repo_id="meta-llama/Llama-3-8B-Instruct",
local_dir="/model-weights/llama-3-8b-instruct",
token=os.environ["HF_TOKEN"]
)
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-weights
mountPath: /model-weights
restartPolicy: OnFailure
volumes:
- name: model-weights
persistentVolumeClaim:
claimName: model-weights-pvc

StatefulSet vs Deployment for Model Servers

For most vLLM deployments, a Deployment is the correct choice. Deployments provide rolling updates, easy scaling, and work well with stateless inference servers. The KV cache in vLLM is in-process memory - it does not persist across pod restarts and does not need stable network identities or persistent storage per-replica.

Use a StatefulSet only if your model server requires stable network identity (for example, because you are implementing prefix-aware routing that sends specific request hashes to specific pod replicas by DNS name) or if you need per-replica persistent storage. These are rare requirements. The overhead of StatefulSets - slower rolling updates, more complex scaling - is not worth it for standard inference workloads.

HPA with Custom Metrics for GPU Utilization

The standard HPA scales on CPU utilization. For LLM inference, GPU utilization is the correct signal. This requires three components: DCGM Exporter (exports GPU metrics to Prometheus), Prometheus (stores the metrics), and the Prometheus Adapter (exposes Prometheus metrics via the Kubernetes Custom Metrics API so HPA can read them).

# HPA scaling on GPU utilization and queue depth via custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: llm-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: dcgm_fi_dev_gpu_util
target:
type: AverageValue
averageValue: "70" # Scale when avg GPU util exceeds 70%
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "5" # Scale when avg queue depth exceeds 5
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2 # Add max 2 pods per scale event
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1 # Remove max 1 pod per scale event
periodSeconds: 120

KEDA for Queue-Based Scaling

KEDA is often a better fit for LLM workloads because it can scale to zero (HPA cannot go below 1 replica) and has native scalers for common queue systems. If you are using an SQS queue or Redis list as your request queue, KEDA can scale inference pods directly from queue depth.

# KEDA ScaledObject for SQS-based LLM request queue
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-keda-scaler
namespace: llm-serving
spec:
scaleTargetRef:
name: vllm-server
pollingInterval: 15
cooldownPeriod: 300 # Wait 5 min after last scale event before scale-to-zero
minReplicaCount: 0 # True scale-to-zero for batch workloads
maxReplicaCount: 8
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/llm-requests
queueLength: "10" # Target 10 messages per pod
awsRegion: us-east-1
scaleOnInFlight: "true"
authenticationRef:
name: keda-aws-credentials

---
# Prometheus-based KEDA scaler - uses vLLM queue depth metric directly
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-prometheus-scaler
namespace: llm-serving
spec:
scaleTargetRef:
name: vllm-server
pollingInterval: 30
cooldownPeriod: 180
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
metricName: vllm_queue_depth
threshold: "8"
query: avg(vllm_num_requests_waiting)

MIG - Multi-Instance GPU Partitioning

Multi-Instance GPU (MIG) is NVIDIA's feature for partitioning a single A100 or H100 into multiple smaller GPU instances. Each MIG instance gets a guaranteed slice of compute, memory bandwidth, and L2 cache - it is hardware isolation, not software time-sharing. One MIG instance cannot observe or affect another's memory or compute state.

MIG profiles on an A100 80GB:

ProfileCompute FractionMemoryInstances per GPU
1g.10gb1/710GB7
2g.20gb2/720GB3
3g.40gb3/740GB2
7g.80gb7/780GB1 (full GPU)

For smaller models (7B parameter models quantized to INT4 that fit in 8-10GB), MIG allows you to run 3-7 independent inference servers on a single A100, dramatically improving cost efficiency.

# MIG configuration via NVIDIA GPU Operator ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7 # 7 small instances per A100 80GB
all-3g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 2 # 2 medium instances per A100 80GB

---
# Pod requesting a specific MIG slice
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-small-model
namespace: llm-serving
spec:
replicas: 3
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.4.2
args:
- "--model"
- "mistralai/Mistral-7B-Instruct-v0.2"
- "--gpu-memory-utilization"
- "0.85"
resources:
requests:
nvidia.com/mig-3g.40gb: "1"
limits:
nvidia.com/mig-3g.40gb: "1"

The math behind MIG cost efficiency is straightforward. An A100 80GB costs roughly 3/hour.WithoutMIG,hostinga7BINT4quantizedmodel(needing 5GBVRAM)onthatGPUwastes75GBofcapacity.With1g.10gbMIGpartitioning,thatsameGPUhosts7independentinferenceprocesses,reducingtheeffectivecostperinferenceserverfrom3/hour. Without MIG, hosting a 7B INT4-quantized model (needing ~5GB VRAM) on that GPU wastes 75GB of capacity. With `1g.10gb` MIG partitioning, that same GPU hosts 7 independent inference processes, reducing the effective cost per inference server from 3/hour to $0.43/hour.


Architecture Diagrams


Complete Helm Chart for vLLM

A Helm chart packages all Kubernetes manifests into a deployable unit with configurable values. Here is the structure for a production-ready chart.

vllm-serving/
Chart.yaml
values.yaml
templates/
deployment.yaml
service.yaml
hpa.yaml
scaledobject.yaml
pvc.yaml
serviceaccount.yaml
configmap.yaml
secret.yaml

values.yaml:

replicaCount: 1

image:
repository: vllm/vllm-openai
tag: "v0.4.2"
pullPolicy: IfNotPresent

model:
name: "meta-llama/Llama-3-8B-Instruct"
tensorParallelSize: 1
maxModelLen: 8192
gpuMemoryUtilization: 0.90

resources:
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "8"

autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 8
targetQueueDepth: 8
scaleDownCooldown: 300

persistence:
enabled: true
existingClaim: "model-weights-pvc"
mountPath: "/root/.cache/huggingface"

service:
type: ClusterIP
port: 80
targetPort: 8000

nodeSelector:
workload-type: gpu-inference

tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule

monitoring:
enabled: true
serviceMonitor: true

huggingfaceToken:
secretName: "hf-token"
secretKey: "token"

templates/deployment.yaml (key sections):

apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "vllm-serving.fullname" . }}
namespace: {{ .Release.Namespace }}
spec:
replicas: {{ .Values.replicaCount }}
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Never reduce below current capacity during rollout
maxSurge: 1 # Add one new pod at a time
template:
spec:
terminationGracePeriodSeconds: 600
tolerations:
{{- toYaml .Values.tolerations | nindent 8 }}
nodeSelector:
{{- toYaml .Values.nodeSelector | nindent 8 }}
containers:
- name: vllm
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
args:
- "--model"
- {{ .Values.model.name | quote }}
- "--tensor-parallel-size"
- {{ .Values.model.tensorParallelSize | quote }}
- "--max-model-len"
- {{ .Values.model.maxModelLen | quote }}
- "--gpu-memory-utilization"
- {{ .Values.model.gpuMemoryUtilization | quote }}
- "--port"
- "8000"
resources:
{{- toYaml .Values.resources | nindent 12 }}
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
failureThreshold: 5
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
failureThreshold: 3
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "sleep 540" # Allow in-flight requests to drain

Deploy with Helm:

# Install the chart
helm install vllm-llama3 ./vllm-serving \
--namespace llm-serving \
--create-namespace \
--set model.name="meta-llama/Llama-3-8B-Instruct" \
--set autoscaling.maxReplicas=4 \
--set persistence.existingClaim=model-weights-pvc

# Upgrade with new image
helm upgrade vllm-llama3 ./vllm-serving \
--namespace llm-serving \
--set image.tag="v0.5.0" \
--reuse-values

# Check rollout status
kubectl rollout status deployment/vllm-llama3 -n llm-serving

# View current HPA state
kubectl get hpa -n llm-serving -w

Production Engineering Notes

Cold Start Latency is Your Biggest Operational Risk

GPU node cold start is 8-25 minutes. Pod cold start (assuming node is already running) is 2-8 minutes for large models (image pull plus model load into GPU VRAM). This means autoscaling cannot respond to a sudden spike in under 10-30 minutes at best. The mitigation strategies are:

1. Maintain minimum replicas. Set minReplicas: 1 or higher. Scale to zero only for truly bursty, non-latency-sensitive batch workloads.

2. Pre-warm nodes. Keep GPU nodes in a "ready but empty" state using a low-resource pause pod. The node is running and incurring cost, but not running inference until needed. This reduces scale-out from 15 minutes (node provision plus pod start) to 3-5 minutes (pod start only).

3. Use predictive scaling. If you have predictable traffic patterns (business hours spike, weekend lull), schedule pre-scaling with a CronJob-based replica adjustment before the spike arrives.

# Pre-warming script - scale up before predicted traffic spikes
# Deploy as a Kubernetes CronJob triggered 15 min before peak hours

from kubernetes import client, config
import datetime

config.load_incluster_config()
apps_v1 = client.AppsV1Api()

def scale_for_peak(target_replicas: int = 4):
apps_v1.patch_namespaced_deployment_scale(
name="vllm-server",
namespace="llm-serving",
body={"spec": {"replicas": target_replicas}}
)
print(f"[{datetime.datetime.now().isoformat()}] Scaled vllm-server to {target_replicas} replicas")

scale_for_peak()
# CronJob for predictive pre-scaling
apiVersion: batch/v1
kind: CronJob
metadata:
name: vllm-prescale-morning
namespace: llm-serving
spec:
schedule: "45 8 * * 1-5" # 8:45 AM weekdays, 15 min before 9 AM peak
jobTemplate:
spec:
template:
spec:
serviceAccountName: vllm-scaler
containers:
- name: prescale
image: python:3.11-slim
command: ["python", "/scripts/prescale.py"]
restartPolicy: OnFailure

DCGM Exporter for GPU Observability

# DCGM Exporter DaemonSet - exports GPU metrics to Prometheus
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9400"
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.7-ubuntu20.04
ports:
- name: metrics
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
resources:
requests:
memory: "128Mi"
cpu: "100m"

Key DCGM metrics to alert on:

MetricMeaningAlert Threshold
DCGM_FI_DEV_GPU_UTILGPU compute utilizationAlert above 90% sustained
DCGM_FI_DEV_FB_USEDVRAM used (bytes)Alert above 95% of capacity
DCGM_FI_DEV_POWER_USAGEGPU power draw (watts)Alert above TDP for thermal risk
DCGM_FI_DEV_GPU_TEMPGPU temperature (Celsius)Alert above 83C for A100
DCGM_FI_DEV_MEM_COPY_UTILMemory bandwidth utilizationAlert above 85%

Resource Quotas and Namespace Isolation

In multi-tenant clusters, use ResourceQuotas to prevent any single team from consuming all GPU resources:

apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-a-llm
spec:
hard:
requests.nvidia.com/gpu: "4"
limits.nvidia.com/gpu: "4"
requests.memory: "128Gi"
persistentvolumeclaims: "10"

Spot/Preemptible GPU Instance Strategy

Spot GPU instances cost 60-90% less than On-Demand but can be interrupted with 2 minutes of notice (AWS) or 30 seconds (GCP). Making Spot viable for LLM serving requires:

  1. AWS Node Termination Handler - watches for Spot interruption notices and cordons/drains the node, allowing in-flight requests to complete within the 2-minute window
  2. Diversified instance types - bid on multiple GPU types (p3.2xlarge, p3.8xlarge, g5.xlarge) to increase Spot availability
  3. Spot for capacity, On-Demand for baseline - Karpenter can prefer Spot but fall back to On-Demand when Spot is unavailable
# Karpenter NodePool with Spot preference
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Prefer spot, fall back to on-demand
- key: node.kubernetes.io/instance-type
operator: In
values:
- "p3.2xlarge" # 1x V100 16GB
- "p3.8xlarge" # 4x V100 16GB
- "g5.xlarge" # 1x A10G 24GB
- "g5.2xlarge" # 1x A10G 24GB, more CPU

Common Mistakes

:::danger Setting CPU/Memory as the HPA Target for GPU Inference The most common mistake when moving LLM inference to Kubernetes is using targetCPUUtilizationPercentage in the HPA. GPU inference barely touches the CPU - the CPU is mostly idle while the GPU runs the forward pass. An HPA watching CPU will see 5-10% utilization even when the GPU is fully saturated and requests are queuing. The cluster will not scale and SLOs will be missed. Always use GPU utilization (via DCGM) or request queue depth as your HPA/KEDA target metric. :::

:::danger Forgetting initialDelaySeconds on Liveness Probes vLLM and other large model servers take 2-10 minutes to load model weights into GPU memory before they can serve requests. If you set initialDelaySeconds: 30 on your liveness probe (the default for many Helm charts), Kubernetes will kill the pod before it has finished loading - and then restart it - creating an infinite restart loop. Set initialDelaySeconds to at least 2x your measured cold start time: 180-240 seconds for 7B models and 300-400 seconds for 70B models. :::

:::warning Not Setting terminationGracePeriodSeconds The default Kubernetes terminationGracePeriodSeconds is 30 seconds. An LLM generation request can take 30-120 seconds to complete. If you deploy a rolling update or scale down without setting a sufficiently long grace period, Kubernetes will SIGKILL your pod mid-generation, cutting off users mid-response. Set terminationGracePeriodSeconds: 600 for safety and implement a preStop hook that sleeps long enough for in-flight requests to complete. :::

:::warning Using Deployment maxUnavailable: 1 During Rolling Updates With maxUnavailable: 1, Kubernetes terminates an old pod before confirming the new pod is ready. For GPU workloads where each pod holds significant capacity and readiness takes 3-5 minutes, this creates an unavailability window. Set maxUnavailable: 0 and maxSurge: 1 to ensure you never go below your current serving capacity during a rollout - at the cost of temporarily needing one extra GPU slot. Always ensure your node pool has spare GPU capacity for this surge pod. :::

:::warning Requesting More GPU Memory than the Model Needs Setting --gpu-memory-utilization 0.95 and requesting a full GPU for a 7B model that only needs 15GB on a 40GB GPU wastes 25GB of VRAM. Profile your model's actual VRAM usage at your target batch size and context length, then set --gpu-memory-utilization to leave 10-15% buffer for KV cache growth at peak load. Use MIG partitioning when hosting multiple small models on a single large GPU - the cost savings can be 3-7x. :::


Interview Q&A

Q: How does the NVIDIA device plugin enable GPU scheduling in Kubernetes, and what prevents two pods from being allocated the same GPU?

The NVIDIA device plugin runs as a DaemonSet on every GPU node and implements the Kubernetes Device Plugin API. It calls ListAndWatch to discover all GPUs via NVML (the NVIDIA Management Library) and registers them with the kubelet as nvidia.com/gpu resources with a count equal to the number of physical GPUs on the node.

When the scheduler assigns a pod to a node, the kubelet calls the device plugin's Allocate function with the number of GPUs requested. The device plugin selects specific GPUs by UUID from its internal pool of unallocated devices, marks them as allocated, and injects the NVIDIA_VISIBLE_DEVICES environment variable into the container. The CUDA runtime then exposes only those specific GPU UUIDs to the process.

Double-booking is prevented by the device plugin's internal allocation state: once a GPU UUID is assigned to a pod, it is removed from the available pool and will not be allocated to another pod until the container exits and the kubelet calls the Release function. The Kubernetes scheduler also enforces resource constraints at the cluster level - it will not schedule a pod on a node if the node's advertised nvidia.com/gpu allocatable count is zero.

Q: What is the difference between HPA and KEDA for LLM autoscaling, and when would you choose each?

HPA (Horizontal Pod Autoscaler) is a Kubernetes-native controller that scales pods based on metrics from the Metrics API or Custom Metrics API. It requires you to implement the metrics pipeline yourself (DCGM Exporter to Prometheus to Prometheus Adapter to Custom Metrics API), and its minimum replica count is 1 - it cannot scale to zero.

KEDA (Kubernetes Event-Driven Autoscaling) wraps HPA but adds a library of pre-built scalers for SQS, Redis, Kafka, Prometheus, and many other event sources. More importantly, KEDA supports minReplicaCount: 0, enabling true scale-to-zero.

For latency-sensitive LLM serving with a fixed minimum replica count, HPA with Prometheus-based GPU utilization metrics works well and has no additional dependencies beyond the metrics pipeline. For batch LLM workloads - nightly report generation, async document processing - where cost optimization matters and latency on the first request after idle is acceptable, KEDA's scale-to-zero saves significant GPU costs. In practice, many teams use KEDA for both because its Prometheus scaler is simpler to configure than the full Prometheus Adapter + Custom Metrics API pipeline.

Q: Explain MIG partitioning. When should you use 1g.10gb profiles vs 3g.40gb profiles on an A100 80GB?

MIG (Multi-Instance GPU) is a hardware feature in NVIDIA A100 and H100 that partitions a single GPU into isolated slices. Each slice receives a fixed fraction of streaming multiprocessors, L2 cache, and HBM memory bandwidth. The isolation is hardware-enforced - processes in different MIG instances cannot access each other's memory or interfere with each other's compute.

Use 1g.10gb (7 instances per A100 80GB) for INT4-quantized 7B models (4-5GB VRAM), embedding models, or small classification models. This profile gives the highest density but the least memory bandwidth per instance - not suitable for models that are memory bandwidth-bound.

Use 3g.40gb (2 instances per A100) for FP16 13B models (~28GB VRAM) or for 7B models running high-throughput workloads where memory bandwidth matters. The half-GPU allocation gives enough bandwidth to sustain reasonable tokens-per-second throughput.

Use the full 7g.80gb (no partitioning) for 70B models or any model where you need maximum throughput on a single GPU, or where tensor-parallel inference requires the full memory fabric of an undivided GPU.

Q: How would you deploy a 70B parameter model on Kubernetes that requires tensor parallelism across 4 GPUs?

A 70B model in FP16 requires approximately 140GB of VRAM. A single A100 80GB is insufficient. You need tensor parallelism (TP) across at least 2 GPUs (2x80GB=160GB with some overhead) or 4 GPUs for comfortable headroom and better throughput.

The pod spec requests nvidia.com/gpu: "4" and the vLLM argument is --tensor-parallel-size 4. Kubernetes allocates all 4 GPUs from a single node automatically - you cannot split a single TP group across multiple nodes without specialized NCCL/RDMA networking.

Node selection becomes critical: you need nodes with at least 4 available GPUs. Use node affinity to require instances with 8 GPUs (p4d.24xlarge, p5.48xlarge) and pod anti-affinity to spread replicas across different nodes so no single node failure eliminates all serving capacity.

The trade-off is that each replica now costs 4 GPUs, making scale-out 4x more expensive. For 70B models, most teams run a fixed number of replicas sized for 95th percentile traffic plus one spare, rather than aggressive autoscaling, because the cost of over-provisioning a small number of large replicas is predictable and manageable.

Q: What metrics should you monitor on a Kubernetes LLM serving deployment, and what alerts would you set?

The metrics stack has three layers: infrastructure (GPU), serving framework (vLLM), and request-level SLOs.

Infrastructure metrics via DCGM Exporter: DCGM_FI_DEV_GPU_UTIL (alert at sustained 90%+), DCGM_FI_DEV_FB_USED (alert at 95% of VRAM), DCGM_FI_DEV_GPU_TEMP (alert at 83C for A100 - thermal throttling begins), DCGM_FI_DEV_POWER_USAGE (alert at TDP for thermal risk).

vLLM serving metrics via its Prometheus endpoint: vllm_num_requests_waiting (queue depth - your primary scaling signal), vllm_num_requests_running (active requests), vllm_gpu_cache_usage_perc (KV cache saturation - alert above 90%), vllm_e2e_request_latency_seconds (end-to-end latency percentiles).

Business SLOs: time-to-first-token (TTFT) p50/p95/p99 - the latency a user perceives before seeing any response; tokens-per-second throughput (your revenue rate); request error rate (5xx responses). Alert when TTFT p95 exceeds your SLO (typically 2-5 seconds for interactive use cases) or when error rate exceeds 1%.

Q: How do you handle Kubernetes rolling updates for LLM pods without dropping in-flight requests?

The key is combining three mechanisms: readiness gates, graceful shutdown, and the right rolling update strategy.

First, set maxUnavailable: 0 and maxSurge: 1 in the deployment rolling update strategy. This ensures Kubernetes starts a new pod before terminating the old one and does not route traffic to the new pod until its readiness probe passes.

Second, set a generous initialDelaySeconds (180-400 seconds depending on model size) on the readiness probe so the new pod is fully loaded before receiving traffic.

Third, configure a preStop hook that sleeps for 540 seconds and set terminationGracePeriodSeconds: 600. When Kubernetes sends SIGTERM to the old pod, the preStop hook runs first and delays the shutdown signal, giving in-flight requests up to 9 minutes to complete before the pod terminates. Pair this with connection draining on your load balancer (remove the pod from the service endpoints immediately on SIGTERM, before the preStop sleep, so no new requests are routed to the terminating pod).

The practical risk with this approach: during a rolling update, you temporarily need N+1 GPU slots. If you are running at maximum node capacity, the surge pod will be pending until Karpenter provisions a new node, which can delay the rollout by 10-15 minutes.

© 2026 EngineersOfAI. All rights reserved.