What is kubernetes HPA ML?

Horizontal Pod Autoscaler, KEDA event-driven autoscaling for GPU metrics, zero-downtime rolling updates with readiness gates, and autoscaling patterns for production ML serving.

How does KEDA autoscaling ML work in practice?

Autoscaling ML Workloads covers kubernetes HPA ML, KEDA autoscaling ML, GPU metric autoscaling kubernetes from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/kubernetes-for-ml/autoscaling-ml-workloads

What is the difference between kubernetes HPA ML and GPU metric autoscaling kubernetes?

See the full breakdown at https://engineersofai.com/docs/mlops/kubernetes-for-ml/autoscaling-ml-workloads

Autoscaling ML Workloads

10x Traffic in 90 Seconds

Product launches are unpredictable. Your recommendation model serves 500 requests per second on a normal morning. The company sends a mass marketing email to 2 million users at 9:00am. By 9:01:30, traffic has spiked to 5,200 requests per second. Your 3-replica serving Deployment is drowning - p99 latency is at 8 seconds, requests are timing out.

Your on-call engineer checks the HPA: it's set to scale on CPU utilization, target 70%. CPU is at 95%. The HPA has triggered - new pods are being created. But your recommendation model takes 45 seconds to load and warm up. By the time the new pods are ready, 3 minutes have passed. You lost 180 seconds × (5,200 - 1,500 serving capacity) = 666,000 requests to timeouts. Revenue impact: estimated $180K.

Post-mortem finding: the scaling policy was too slow, the metric was wrong (CPU utilization doesn't correlate well with ML inference saturation), and the pod warmup time was not accounted for in the scaling math. This lesson fixes all three problems.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Autoscaling ML Workloads demo on the EngineersOfAI Playground - no code required. :::

Horizontal Pod Autoscaler (HPA) - Fundamentals

The HPA controller queries metrics every 15 seconds (configurable) and computes the desired replica count as:

$\text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil$

If your HPA target is CPU utilization 70%, current pods are at 95% CPU, and you have 3 replicas:

$\text{desiredReplicas} = \lceil 3 \times \frac{95}{70} \rceil = \lceil 4.07 \rceil = 5 \text{ replicas}$

But then it waits 3 minutes (default --horizontal-pod-autoscaler-cpu-initialization-period) for new pods to stabilize before recalculating. This is why CPU-based HPA is slow to respond to sudden spikes.

Basic HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-model-hpa
  namespace: ml-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-model
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60    # scale up when avg CPU > 60%
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: "12Gi"       # scale up when avg memory > 12 GiB
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # scale up immediately (no cooldown)
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60          # add at most 4 pods per minute
        - type: Percent
          value: 100
          periodSeconds: 60          # or double replicas, whichever is larger
      selectPolicy: Max              # use the more aggressive policy
    scaleDown:
      stabilizationWindowSeconds: 300  # wait 5 minutes before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60          # remove at most 1 pod per minute

The asymmetric scale-up/scale-down behavior is important for ML:

Scale up: aggressive, immediate. Traffic spikes are sudden and you can't afford to wait.
Scale down: conservative, slow. Model servers take time to warm up - you don't want to scale down prematurely and then scale up again 2 minutes later, wasting warmup time.

Custom Metrics HPA - Requests Per Second

For ML serving, requests per second (RPS) per replica is a more direct and predictive metric than CPU utilization. Your model server exports RPS as a Prometheus metric; the Prometheus Adapter bridges it to the Kubernetes metrics API for HPA consumption.

# prometheus-adapter ConfigMap - expose custom metrics to HPA
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
      - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          matches: "^(.*)_total"
          as: "${1}_per_second"
        metricsQuery: 'rate(http_requests_total{<<.LabelMatchers>>}[2m])'

# HPA using custom RPS metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-model-rps-hpa
  namespace: ml-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-model
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second    # custom metric from Prometheus Adapter
        target:
          type: AverageValue
          averageValue: "150"               # scale when each pod handles > 150 RPS
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30               # can double replicas every 30 seconds

With 150 RPS target per pod and 5,200 total RPS incoming, the HPA will target 35 replicas - scaling from 3 to 35, adding pods as fast as your cluster has capacity.

KEDA - Event-Driven Autoscaling for ML

KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes autoscaling to external event sources: Prometheus metrics, Kafka queue depth, Redis list length, AWS SQS queue length, and many more. KEDA can also scale deployments to zero replicas when idle, which HPA cannot do (minimum is 1 replica).

KEDA ScaledObject - Kafka-Driven Batch Inference

# Install KEDA: helm install keda kedacore/keda -n keda --create-namespace

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fraud-batch-inference-scaled
  namespace: ml-prod
spec:
  scaleTargetRef:
    name: fraud-batch-inference              # Deployment to scale
  minReplicaCount: 0                        # scale to zero when no messages
  maxReplicaCount: 50
  cooldownPeriod: 300                       # wait 5 min before scaling to zero
  pollingInterval: 15                       # check Kafka lag every 15 seconds
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-svc:9092
        consumerGroup: fraud-inference-consumer
        topic: inference-requests
        lagThreshold: "100"                 # scale up when lag per replica > 100
        offsetResetPolicy: latest

When the Kafka topic has 0 messages, KEDA scales the Deployment to 0 replicas - you pay nothing for idle GPU capacity. When messages arrive, KEDA scales from 0 to the required replicas within seconds.

KEDA ScaledObject - Prometheus GPU Utilization

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fraud-model-gpu-scaler
  namespace: ml-prod
spec:
  scaleTargetRef:
    name: fraud-model
  minReplicaCount: 3
  maxReplicaCount: 40
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-svc.monitoring:9090
        metricName: gpu_utilization
        query: |
          avg(DCGM_FI_DEV_GPU_UTIL{exported_namespace="ml-prod",
              exported_pod=~"fraud-model-.*"})
        threshold: "70"                     # scale up when avg GPU util > 70%

Scale to Zero for Batch and Dev Workloads

# Development model - scale to zero when no traffic
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: fraud-model-dev-scaler
  namespace: ml-dev
spec:
  scaleTargetRef:
    name: fraud-model-dev
  minReplicaCount: 0                        # fully scale to zero
  maxReplicaCount: 5
  cooldownPeriod: 600                       # 10 minutes idle before scale to zero
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-svc.monitoring:9090
        metricName: http_requests_total
        query: |
          sum(rate(http_requests_total{
            namespace="ml-dev",
            pod=~"fraud-model-dev-.*"
          }[5m]))
        threshold: "1"                      # scale up if any traffic

Readiness Gates - Accounting for Model Warmup in Scaling

Standard Kubernetes readiness probes work at the container level. For ML services, there's a subtler problem: even after the readiness probe passes, the first N requests might be slow because CUDA kernel caches are cold, JIT compilation hasn't happened, or inference throughput hasn't reached steady state.

Readiness gates are a Pod feature that allows external systems to add their own readiness conditions. This lets your model server report "warmed up" after serving 100 requests (ensuring JIT is complete and CUDA kernels are compiled) before being added to the load balancer.

spec:
  readinessGates:
    - conditionType: "fraud.model/warmed-up"    # custom condition

  containers:
    - name: model-server
      # ...
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 60
        periodSeconds: 10

A separate controller (a simple sidecar or init container) watches the pod and updates the condition once warmup is complete:

# warmup_reporter.py - runs as sidecar
import requests
import time
from kubernetes import client, config

config.load_incluster_config()
v1 = client.CoreV1Api()

POD_NAME = os.environ["POD_NAME"]
POD_NAMESPACE = os.environ["POD_NAMESPACE"]

def wait_for_model_ready():
    """Wait for main container's readiness probe to pass."""
    while True:
        try:
            r = requests.get("http://localhost:8080/health/ready", timeout=2)
            if r.status_code == 200:
                return
        except Exception:
            pass
        time.sleep(5)

def run_warmup_requests(n=100):
    """Send N warmup requests to prime CUDA kernels."""
    dummy_payload = {"features": [0.0] * 512, "user_id": "warmup"}
    for i in range(n):
        requests.post("http://localhost:8080/predict", json=dummy_payload)
    print(f"Warmup complete after {n} requests.")

def set_ready_condition():
    """Set the custom readiness gate condition to True."""
    body = {"status": {"conditions": [{
        "type": "fraud.model/warmed-up",
        "status": "True",
        "lastTransitionTime": datetime.utcnow().isoformat() + "Z",
        "reason": "WarmupComplete",
        "message": "Model has completed warmup inference requests"
    }]}}
    v1.patch_namespaced_pod_status(POD_NAME, POD_NAMESPACE, body)
    print("Readiness gate condition set to True.")

wait_for_model_ready()
run_warmup_requests(100)
set_ready_condition()
# sidecar exits - pod is now fully ready including custom gate

Zero-Downtime Rolling Updates with Warmup-Aware Scaling

The complete zero-downtime update pattern for ML services with 45-second warmup:

The entire rollout for 3 replicas with 45-second warmup takes approximately:

70 seconds per replica (startup + readiness + warmup gate)
Plus 30 seconds termination grace per old pod
Total: ~280 seconds (under 5 minutes) for a 3-replica service

Vertical Pod Autoscaler (VPA) - Right-Sizing ML Pods

VPA automatically adjusts CPU and memory requests based on observed usage. For ML teams that don't know the right resource requests (new models, new workloads), VPA in recommendation mode provides data without auto-applying changes:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: fraud-model-vpa
  namespace: ml-prod
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fraud-model
  updatePolicy:
    updateMode: "Off"           # Recommendation only - do not auto-resize pods
  resourcePolicy:
    containerPolicies:
      - containerName: model-server
        minAllowed:
          cpu: "1"
          memory: "4Gi"
        maxAllowed:
          cpu: "16"
          memory: "64Gi"

# View VPA recommendations
kubectl get vpa fraud-model-vpa -n ml-prod -o yaml | grep -A20 recommendation
# recommendation:
#   containerRecommendations:
#   - containerName: model-server
#     lowerBound:
#       cpu: 1500m
#       memory: 6824Mi
#     target:
#       cpu: 2200m
#       memory: 9200Mi   # VPA recommends 9.2 GiB, you set 8 GiB - adjust!
#     upperBound:
#       cpu: 4000m
#       memory: 14Gi

:::warning Don't Use VPA and HPA Together on the Same Metric VPA and HPA should not both scale based on CPU/memory - they will conflict. VPA increases resources of existing pods while HPA adds new pods, creating a feedback loop. The safe pattern: use VPA for right-sizing (in recommendation mode, then apply manually), and HPA for scaling. Or use VPA on training Jobs (which don't autoscale) and HPA on serving Deployments. :::

Production Notes

Pre-scale before known traffic spikes. If you know a marketing campaign goes live at 9am, pre-scale your serving Deployment at 8:45am manually: kubectl scale deployment/fraud-model --replicas=20. This avoids the HPA lag during the initial spike. After the campaign, let HPA scale back down normally.

Monitor HPA decisions. The HPA status shows its current reasoning:

kubectl describe hpa fraud-model-hpa -n ml-prod
# Current replicas: 3
# Desired replicas: 7
# Current metrics:
#   cpu: 87% / 60%
# Conditions:
#   ScalingLimited: False
#   AbleToScale: True

Set minReplicas >= 2 for production. With minReplicas: 1, a single pod failure causes an outage while the replacement warms up. Keep minimum 2 for production serving.

Common Mistakes

:::danger Using CPU Utilization Alone for ML Autoscaling CPU utilization is a poor metric for GPU-accelerated ML serving. The model inference happens on GPU; the CPU is often not the bottleneck. A model can be fully saturated (100% GPU utilization, 8-second latency) while showing only 30% CPU utilization because the CPU threads are mostly waiting for GPU kernels to complete. Use RPS per pod or GPU utilization as the primary scaling metric. CPU can be a secondary metric for pre-processing-heavy models. :::

:::warning Not Accounting for Pod Warmup Time in maxReplicas Math If your cluster has 40 GPU slots and your model server takes 45 seconds to warm up, scaling from 5 to 40 replicas means 35 new pods all loading models simultaneously. If your model artifacts are on a shared NFS PVC, 35 concurrent 4 GB reads will saturate the network. The warmup storm causes pods to take 10 minutes to become ready instead of 45 seconds. Use a scaling rate limit (HPA behavior.scaleUp.policies) to stagger pod creation and avoid storage I/O saturation. :::

Interview Q&A

Q1: How does the Kubernetes HPA compute the desired replica count, and what are the limitations of CPU-based autoscaling for ML serving?

The HPA formula is: desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric). It queries the metrics API every 15 seconds and applies the formula. For CPU-based scaling, limitations for ML: (1) CPU doesn't reflect GPU saturation - a model running 100% GPU utilization might show 20% CPU. (2) HPA has a default stabilization window that prevents acting on transient spikes - by the time HPA reacts, the spike is 3 minutes old. (3) New pods need to warmup before they're useful, adding latency to the effective scaling response. Better metrics: RPS per pod (direct measure of serving load), inference queue depth (for batch systems), or GPU utilization (for GPU-bound models).

Q2: What is KEDA and what capabilities does it add beyond the built-in HPA?

KEDA (Kubernetes Event-Driven Autoscaling) extends K8s autoscaling with: (1) Scale-to-zero - KEDA can scale a Deployment to 0 replicas when idle. Standard HPA minimum is 1. (2) External event sources - Kafka consumer lag, Redis list length, SQS queue depth, Prometheus queries, Datadog metrics, and 50+ more scalers out of the box. (3) Faster scaling decisions - KEDA polls external sources directly (every 15 seconds by default) rather than waiting for metrics to flow through the K8s metrics pipeline. For ML specifically: scale batch inference workers based on Kafka topic lag (workers scale to zero when no messages, scale up proportional to lag), or scale serving based on Prometheus GPU utilization queries that standard HPA can't directly consume.

Q3: Explain the pod warmup problem for ML serving autoscaling and two ways to mitigate it.

The problem: when an HPA triggers scale-up, new pods are created but they take 30–90 seconds to load models and warm up CUDA kernels before serving traffic efficiently. During this window, they appear in the Service endpoint list (if readiness probe passes after just model loading, before warmup), receive traffic, and serve it at 3–5x normal latency due to cold kernels. Mitigation 1: readiness gates - extend the pod ready condition with a custom gate that only signals ready after N warmup requests have been processed, ensuring CUDA kernels are hot before any production traffic arrives. Mitigation 2: pre-scaling - for predictable traffic patterns (marketing campaigns, business hour peaks), pre-scale the Deployment 10–15 minutes in advance via CI/CD or a CronJob that runs kubectl scale before the expected spike.

Q4: A recommendation model serving deployment is experiencing "flapping" - constantly scaling up and back down every few minutes. What causes this and how do you fix it?

Flapping (also called thrashing) occurs when the scale-up trigger and the scale-down trigger are too close together, or the stabilization window is too short. For ML serving: (1) if warmup takes 45 seconds and you scale down within 60 seconds of the trigger metric returning to normal, you might scale down pods that haven't finished warming up yet, then scale up again because warmup made the metric spike. Fix: increase scaleDown.stabilizationWindowSeconds to 300–600 seconds for ML services with long warmup times. (2) If the target metric is too close to steady-state values, normal traffic variance triggers constant scaling. Fix: widen the target threshold. (3) If VPA and HPA are both modifying resources, they can create oscillation. Fix: use VPA in recommendation mode only, don't auto-apply.

Q5: How would you design an autoscaling system for a batch inference pipeline that processes 500K records per hour with peak loads of 2M records per hour?

Design: Use KEDA with a Kafka (or SQS) scaler. Producers push inference requests to a Kafka topic. Workers consume from the topic. KEDA scales workers based on consumer group lag with a target of 1,000 messages of lag per worker (tuned to your model's throughput). At steady state (500K records/hour with workers processing ~600K/hour each), you might need 2–3 workers. At peak (2M records/hour), KEDA scales to ~6–8 workers. Set minReplicaCount: 0 and cooldownPeriod: 600 - scale to zero during off-hours, saving GPU cost. For the scale-from-zero latency problem (first messages wait 45 seconds for pods to warm up), maintain a minimum of 1 "warm standby" pod that consumes no messages but is ready to serve, implemented via a separate Deployment with 1 replica that's always running.

10x Traffic in 90 Seconds​

Horizontal Pod Autoscaler (HPA) - Fundamentals​

Basic HPA Configuration​

Custom Metrics HPA - Requests Per Second​

KEDA - Event-Driven Autoscaling for ML​

KEDA ScaledObject - Kafka-Driven Batch Inference​

KEDA ScaledObject - Prometheus GPU Utilization​

Scale to Zero for Batch and Dev Workloads​

Readiness Gates - Accounting for Model Warmup in Scaling​

Zero-Downtime Rolling Updates with Warmup-Aware Scaling​

Vertical Pod Autoscaler (VPA) - Right-Sizing ML Pods​

Production Notes​

Common Mistakes​

Interview Q&A​