Skip to main content

Autoscaling ML Workloads

10x Traffic in 90 Secondsโ€‹

Product launches are unpredictable. Your recommendation model serves 500 requests per second on a normal morning. The company sends a mass marketing email to 2 million users at 9:00am. By 9:01:30, traffic has spiked to 5,200 requests per second. Your 3-replica serving Deployment is drowning - p99 latency is at 8 seconds, requests are timing out.

Your on-call engineer checks the HPA: it's set to scale on CPU utilization, target 70%. CPU is at 95%. The HPA has triggered - new pods are being created. But your recommendation model takes 45 seconds to load and warm up. By the time the new pods are ready, 3 minutes have passed. You lost 180 seconds ร— (5,200 - 1,500 serving capacity) = 666,000 requests to timeouts. Revenue impact: estimated $180K.

Post-mortem finding: the scaling policy was too slow, the metric was wrong (CPU utilization doesn't correlate well with ML inference saturation), and the pod warmup time was not accounted for in the scaling math. This lesson fixes all three problems.

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the Autoscaling ML Workloads demo on the EngineersOfAI Playground - no code required. :::

Horizontal Pod Autoscaler (HPA) - Fundamentalsโ€‹

The HPA controller queries metrics every 15 seconds (configurable) and computes the desired replica count as:

desiredReplicas=โŒˆcurrentReplicasร—currentMetricValuedesiredMetricValueโŒ‰\text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil

If your HPA target is CPU utilization 70%, current pods are at 95% CPU, and you have 3 replicas:

desiredReplicas=โŒˆ3ร—9570โŒ‰=โŒˆ4.07โŒ‰=5ย replicas\text{desiredReplicas} = \lceil 3 \times \frac{95}{70} \rceil = \lceil 4.07 \rceil = 5 \text{ replicas}

But then it waits 3 minutes (default --horizontal-pod-autoscaler-cpu-initialization-period) for new pods to stabilize before recalculating. This is why CPU-based HPA is slow to respond to sudden spikes.

Basic HPA Configurationโ€‹

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-model-hpa
namespace: ml-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-model
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # scale up when avg CPU > 60%
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: "12Gi" # scale up when avg memory > 12 GiB
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # scale up immediately (no cooldown)
policies:
- type: Pods
value: 4
periodSeconds: 60 # add at most 4 pods per minute
- type: Percent
value: 100
periodSeconds: 60 # or double replicas, whichever is larger
selectPolicy: Max # use the more aggressive policy
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 minutes before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 60 # remove at most 1 pod per minute

The asymmetric scale-up/scale-down behavior is important for ML:

  • Scale up: aggressive, immediate. Traffic spikes are sudden and you can't afford to wait.
  • Scale down: conservative, slow. Model servers take time to warm up - you don't want to scale down prematurely and then scale up again 2 minutes later, wasting warmup time.

Custom Metrics HPA - Requests Per Secondโ€‹

For ML serving, requests per second (RPS) per replica is a more direct and predictive metric than CPU utilization. Your model server exports RPS as a Prometheus metric; the Prometheus Adapter bridges it to the Kubernetes metrics API for HPA consumption.

# prometheus-adapter ConfigMap - expose custom metrics to HPA
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'rate(http_requests_total{<<.LabelMatchers>>}[2m])'
# HPA using custom RPS metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-model-rps-hpa
namespace: ml-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-model
minReplicas: 3
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second # custom metric from Prometheus Adapter
target:
type: AverageValue
averageValue: "150" # scale when each pod handles > 150 RPS
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30 # can double replicas every 30 seconds

With 150 RPS target per pod and 5,200 total RPS incoming, the HPA will target 35 replicas - scaling from 3 to 35, adding pods as fast as your cluster has capacity.

KEDA - Event-Driven Autoscaling for MLโ€‹

KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes autoscaling to external event sources: Prometheus metrics, Kafka queue depth, Redis list length, AWS SQS queue length, and many more. KEDA can also scale deployments to zero replicas when idle, which HPA cannot do (minimum is 1 replica).

KEDA ScaledObject - Kafka-Driven Batch Inferenceโ€‹

# Install KEDA: helm install keda kedacore/keda -n keda --create-namespace

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: fraud-batch-inference-scaled
namespace: ml-prod
spec:
scaleTargetRef:
name: fraud-batch-inference # Deployment to scale
minReplicaCount: 0 # scale to zero when no messages
maxReplicaCount: 50
cooldownPeriod: 300 # wait 5 min before scaling to zero
pollingInterval: 15 # check Kafka lag every 15 seconds
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-svc:9092
consumerGroup: fraud-inference-consumer
topic: inference-requests
lagThreshold: "100" # scale up when lag per replica > 100
offsetResetPolicy: latest

When the Kafka topic has 0 messages, KEDA scales the Deployment to 0 replicas - you pay nothing for idle GPU capacity. When messages arrive, KEDA scales from 0 to the required replicas within seconds.

KEDA ScaledObject - Prometheus GPU Utilizationโ€‹

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: fraud-model-gpu-scaler
namespace: ml-prod
spec:
scaleTargetRef:
name: fraud-model
minReplicaCount: 3
maxReplicaCount: 40
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-svc.monitoring:9090
metricName: gpu_utilization
query: |
avg(DCGM_FI_DEV_GPU_UTIL{exported_namespace="ml-prod",
exported_pod=~"fraud-model-.*"})
threshold: "70" # scale up when avg GPU util > 70%

Scale to Zero for Batch and Dev Workloadsโ€‹

# Development model - scale to zero when no traffic
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: fraud-model-dev-scaler
namespace: ml-dev
spec:
scaleTargetRef:
name: fraud-model-dev
minReplicaCount: 0 # fully scale to zero
maxReplicaCount: 5
cooldownPeriod: 600 # 10 minutes idle before scale to zero
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-svc.monitoring:9090
metricName: http_requests_total
query: |
sum(rate(http_requests_total{
namespace="ml-dev",
pod=~"fraud-model-dev-.*"
}[5m]))
threshold: "1" # scale up if any traffic

Readiness Gates - Accounting for Model Warmup in Scalingโ€‹

Standard Kubernetes readiness probes work at the container level. For ML services, there's a subtler problem: even after the readiness probe passes, the first N requests might be slow because CUDA kernel caches are cold, JIT compilation hasn't happened, or inference throughput hasn't reached steady state.

Readiness gates are a Pod feature that allows external systems to add their own readiness conditions. This lets your model server report "warmed up" after serving 100 requests (ensuring JIT is complete and CUDA kernels are compiled) before being added to the load balancer.

spec:
readinessGates:
- conditionType: "fraud.model/warmed-up" # custom condition

containers:
- name: model-server
# ...
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 60
periodSeconds: 10

A separate controller (a simple sidecar or init container) watches the pod and updates the condition once warmup is complete:

# warmup_reporter.py - runs as sidecar
import requests
import time
from kubernetes import client, config

config.load_incluster_config()
v1 = client.CoreV1Api()

POD_NAME = os.environ["POD_NAME"]
POD_NAMESPACE = os.environ["POD_NAMESPACE"]

def wait_for_model_ready():
"""Wait for main container's readiness probe to pass."""
while True:
try:
r = requests.get("http://localhost:8080/health/ready", timeout=2)
if r.status_code == 200:
return
except Exception:
pass
time.sleep(5)

def run_warmup_requests(n=100):
"""Send N warmup requests to prime CUDA kernels."""
dummy_payload = {"features": [0.0] * 512, "user_id": "warmup"}
for i in range(n):
requests.post("http://localhost:8080/predict", json=dummy_payload)
print(f"Warmup complete after {n} requests.")

def set_ready_condition():
"""Set the custom readiness gate condition to True."""
body = {"status": {"conditions": [{
"type": "fraud.model/warmed-up",
"status": "True",
"lastTransitionTime": datetime.utcnow().isoformat() + "Z",
"reason": "WarmupComplete",
"message": "Model has completed warmup inference requests"
}]}}
v1.patch_namespaced_pod_status(POD_NAME, POD_NAMESPACE, body)
print("Readiness gate condition set to True.")

wait_for_model_ready()
run_warmup_requests(100)
set_ready_condition()
# sidecar exits - pod is now fully ready including custom gate

Zero-Downtime Rolling Updates with Warmup-Aware Scalingโ€‹

The complete zero-downtime update pattern for ML services with 45-second warmup:

The entire rollout for 3 replicas with 45-second warmup takes approximately:

  • 70 seconds per replica (startup + readiness + warmup gate)
  • Plus 30 seconds termination grace per old pod
  • Total: ~280 seconds (under 5 minutes) for a 3-replica service

Vertical Pod Autoscaler (VPA) - Right-Sizing ML Podsโ€‹

VPA automatically adjusts CPU and memory requests based on observed usage. For ML teams that don't know the right resource requests (new models, new workloads), VPA in recommendation mode provides data without auto-applying changes:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: fraud-model-vpa
namespace: ml-prod
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: fraud-model
updatePolicy:
updateMode: "Off" # Recommendation only - do not auto-resize pods
resourcePolicy:
containerPolicies:
- containerName: model-server
minAllowed:
cpu: "1"
memory: "4Gi"
maxAllowed:
cpu: "16"
memory: "64Gi"
# View VPA recommendations
kubectl get vpa fraud-model-vpa -n ml-prod -o yaml | grep -A20 recommendation
# recommendation:
# containerRecommendations:
# - containerName: model-server
# lowerBound:
# cpu: 1500m
# memory: 6824Mi
# target:
# cpu: 2200m
# memory: 9200Mi # VPA recommends 9.2 GiB, you set 8 GiB - adjust!
# upperBound:
# cpu: 4000m
# memory: 14Gi

:::warning Don't Use VPA and HPA Together on the Same Metric VPA and HPA should not both scale based on CPU/memory - they will conflict. VPA increases resources of existing pods while HPA adds new pods, creating a feedback loop. The safe pattern: use VPA for right-sizing (in recommendation mode, then apply manually), and HPA for scaling. Or use VPA on training Jobs (which don't autoscale) and HPA on serving Deployments. :::

Production Notesโ€‹

Pre-scale before known traffic spikes. If you know a marketing campaign goes live at 9am, pre-scale your serving Deployment at 8:45am manually: kubectl scale deployment/fraud-model --replicas=20. This avoids the HPA lag during the initial spike. After the campaign, let HPA scale back down normally.

Monitor HPA decisions. The HPA status shows its current reasoning:

kubectl describe hpa fraud-model-hpa -n ml-prod
# Current replicas: 3
# Desired replicas: 7
# Current metrics:
# cpu: 87% / 60%
# Conditions:
# ScalingLimited: False
# AbleToScale: True

Set minReplicas >= 2 for production. With minReplicas: 1, a single pod failure causes an outage while the replacement warms up. Keep minimum 2 for production serving.

Common Mistakesโ€‹

:::danger Using CPU Utilization Alone for ML Autoscaling CPU utilization is a poor metric for GPU-accelerated ML serving. The model inference happens on GPU; the CPU is often not the bottleneck. A model can be fully saturated (100% GPU utilization, 8-second latency) while showing only 30% CPU utilization because the CPU threads are mostly waiting for GPU kernels to complete. Use RPS per pod or GPU utilization as the primary scaling metric. CPU can be a secondary metric for pre-processing-heavy models. :::

:::warning Not Accounting for Pod Warmup Time in maxReplicas Math If your cluster has 40 GPU slots and your model server takes 45 seconds to warm up, scaling from 5 to 40 replicas means 35 new pods all loading models simultaneously. If your model artifacts are on a shared NFS PVC, 35 concurrent 4 GB reads will saturate the network. The warmup storm causes pods to take 10 minutes to become ready instead of 45 seconds. Use a scaling rate limit (HPA behavior.scaleUp.policies) to stagger pod creation and avoid storage I/O saturation. :::

Interview Q&Aโ€‹

Q1: How does the Kubernetes HPA compute the desired replica count, and what are the limitations of CPU-based autoscaling for ML serving?

The HPA formula is: desiredReplicas = ceil(currentReplicas ร— currentMetric / targetMetric). It queries the metrics API every 15 seconds and applies the formula. For CPU-based scaling, limitations for ML: (1) CPU doesn't reflect GPU saturation - a model running 100% GPU utilization might show 20% CPU. (2) HPA has a default stabilization window that prevents acting on transient spikes - by the time HPA reacts, the spike is 3 minutes old. (3) New pods need to warmup before they're useful, adding latency to the effective scaling response. Better metrics: RPS per pod (direct measure of serving load), inference queue depth (for batch systems), or GPU utilization (for GPU-bound models).

Q2: What is KEDA and what capabilities does it add beyond the built-in HPA?

KEDA (Kubernetes Event-Driven Autoscaling) extends K8s autoscaling with: (1) Scale-to-zero - KEDA can scale a Deployment to 0 replicas when idle. Standard HPA minimum is 1. (2) External event sources - Kafka consumer lag, Redis list length, SQS queue depth, Prometheus queries, Datadog metrics, and 50+ more scalers out of the box. (3) Faster scaling decisions - KEDA polls external sources directly (every 15 seconds by default) rather than waiting for metrics to flow through the K8s metrics pipeline. For ML specifically: scale batch inference workers based on Kafka topic lag (workers scale to zero when no messages, scale up proportional to lag), or scale serving based on Prometheus GPU utilization queries that standard HPA can't directly consume.

Q3: Explain the pod warmup problem for ML serving autoscaling and two ways to mitigate it.

The problem: when an HPA triggers scale-up, new pods are created but they take 30โ€“90 seconds to load models and warm up CUDA kernels before serving traffic efficiently. During this window, they appear in the Service endpoint list (if readiness probe passes after just model loading, before warmup), receive traffic, and serve it at 3โ€“5x normal latency due to cold kernels. Mitigation 1: readiness gates - extend the pod ready condition with a custom gate that only signals ready after N warmup requests have been processed, ensuring CUDA kernels are hot before any production traffic arrives. Mitigation 2: pre-scaling - for predictable traffic patterns (marketing campaigns, business hour peaks), pre-scale the Deployment 10โ€“15 minutes in advance via CI/CD or a CronJob that runs kubectl scale before the expected spike.

Q4: A recommendation model serving deployment is experiencing "flapping" - constantly scaling up and back down every few minutes. What causes this and how do you fix it?

Flapping (also called thrashing) occurs when the scale-up trigger and the scale-down trigger are too close together, or the stabilization window is too short. For ML serving: (1) if warmup takes 45 seconds and you scale down within 60 seconds of the trigger metric returning to normal, you might scale down pods that haven't finished warming up yet, then scale up again because warmup made the metric spike. Fix: increase scaleDown.stabilizationWindowSeconds to 300โ€“600 seconds for ML services with long warmup times. (2) If the target metric is too close to steady-state values, normal traffic variance triggers constant scaling. Fix: widen the target threshold. (3) If VPA and HPA are both modifying resources, they can create oscillation. Fix: use VPA in recommendation mode only, don't auto-apply.

Q5: How would you design an autoscaling system for a batch inference pipeline that processes 500K records per hour with peak loads of 2M records per hour?

Design: Use KEDA with a Kafka (or SQS) scaler. Producers push inference requests to a Kafka topic. Workers consume from the topic. KEDA scales workers based on consumer group lag with a target of 1,000 messages of lag per worker (tuned to your model's throughput). At steady state (500K records/hour with workers processing ~600K/hour each), you might need 2โ€“3 workers. At peak (2M records/hour), KEDA scales to ~6โ€“8 workers. Set minReplicaCount: 0 and cooldownPeriod: 600 - scale to zero during off-hours, saving GPU cost. For the scale-from-zero latency problem (first messages wait 45 seconds for pods to warm up), maintain a minimum of 1 "warm standby" pod that consumes no messages but is ready to serve, implemented via a separate Deployment with 1 replica that's always running.

ยฉ 2026 EngineersOfAI. All rights reserved.