Skip to main content

KServe and Kubernetes ML Operators

The Platform Team's Dilemma

Your ML platform team has spent three months building a custom Kubernetes controller to manage model serving. It creates Deployments, Services, HPAs, and ConfigMaps from a simple ModelServing custom resource that data scientists submit. It works well for PyTorch and ONNX models. But the NLP team needs GPU autoscaling, the vision team needs Triton Inference Server with dynamic batching, and someone just asked about A/B traffic splitting between model versions.

Six months later, your custom controller is 8,000 lines of Go, has a 2-person maintenance burden, and still doesn't support canary deployments or monitoring integration. Meanwhile, KServe - an open-source project with 200+ contributors and 5,000+ GitHub stars - does all of this and more, battle-tested at production scale by dozens of major companies.

This is the build vs. buy dilemma for ML platform operators. This lesson gives you the framework to make that decision and shows you the production-ready options available today.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Model Serving Architecture demo on the EngineersOfAI Playground - no code required. :::

What Kubernetes Operators Enable

A Kubernetes operator is a software pattern (not a product) that extends Kubernetes with domain-specific knowledge using custom resources (CRDs) and controllers. The operator pattern was introduced by CoreOS in 2016 and formalized in the Kubernetes documentation.

The core idea: instead of managing Kubernetes primitives directly (Deployments, Services, HPAs), you define a higher-level abstraction specific to your domain. For ML:

# Without an operator - data scientists write this:
apiVersion: apps/v1
kind: Deployment
metadata: ...
spec:
replicas: 3
template:
spec:
containers:
- name: model
image: registry/fraud-model:v2.1.0
resources:
limits:
nvidia.com/gpu: 1
# ... 80 more lines of YAML
---
apiVersion: v1
kind: Service
...
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
...
---
# ... 200 total lines
# With a model serving operator - data scientists write this:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-model
namespace: team-fraud
spec:
predictor:
model:
modelFormat:
name: pytorch
storageUri: s3://ml-models/fraud/v2.1.0/
resources:
limits:
nvidia.com/gpu: 1

The operator takes the InferenceService resource, creates all the underlying Kubernetes resources (Deployment, Service, HPA, Istio VirtualService for traffic splitting), monitors them, and reconciles them back to the desired state if anything drifts. The data scientist writes 15 lines instead of 200, and the platform team encodes their best practices into the operator once.

KServe - The Standard for ML Serving on Kubernetes

KServe (formerly KFServing) was donated to the CNCF by Google, IBM, Bloomberg, Seldon, and others in 2022. It is now the recommended model serving solution for Kubeflow and widely adopted as a standalone operator. KServe provides:

  • Standardized model serving: built-in support for scikit-learn, XGBoost, TensorFlow, PyTorch, ONNX, MLflow, Hugging Face
  • Triton Inference Server integration: GPU-accelerated serving with dynamic batching
  • Canary deployments: traffic splitting between model versions
  • Transformer support: pre/post-processing pipelines as separate containers
  • Scale to zero: serverless mode with Knative
  • Explainability: built-in Alibi Explain integration
  • Multi-model serving: host many small models on one GPU

KServe InferenceService - Basic Usage

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-model
namespace: team-fraud
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
spec:
predictor:
minReplicas: 2
maxReplicas: 10
scaleTarget: 50 # scale when >50 RPS per replica
scaleMetric: rps
model:
modelFormat:
name: pytorch
version: "2"
storageUri: "s3://ml-models/fraud/v2.1.0/"
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: 1
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: 1
# Model server parameters
args:
- --model-name=fraud-model
- --model-dir=/mnt/models

KServe automatically:

  • Downloads model artifacts from S3 to a PVC using an init container
  • Starts the appropriate model server (TorchServe for PyTorch)
  • Creates a Service, Ingress, and optionally an Istio VirtualService
  • Configures autoscaling via KEDA or Kubernetes HPA

Canary Deployments with KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-model
namespace: team-fraud
spec:
predictor:
canaryTrafficPercent: 10 # send 10% of traffic to canary
minReplicas: 2
model:
modelFormat:
name: pytorch
storageUri: "s3://ml-models/fraud/v2.0.0/" # current stable

# Canary predictor - new version being validated
# (set via kubectl apply of updated spec)

After validating the canary:

# Promote canary to 100% (zero-downtime traffic shift)
kubectl patch inferenceservice fraud-model -n team-fraud \
--type=merge -p '{"spec":{"predictor":{"canaryTrafficPercent":0}}}'

# Check traffic distribution
kubectl get inferenceservice fraud-model -n team-fraud
# NAME URL READY PREV LATEST PREVROLLEDOUT LATESTROLLEDOUT
# fraud-model http://fraud-model.team-fraud... True 90 10

KServe Transformer - Pre/Post-Processing Pipelines

A Transformer is a separate component that runs before the predictor (for feature preprocessing) and after it (for response postprocessing). This keeps the predictor focused on inference:

spec:
transformer:
containers:
- name: fraud-transformer
image: registry.company.com/fraud-transformer:v1.2
env:
- name: FEATURE_SERVER_URL
value: "http://feature-server-svc:8081"
resources:
requests:
cpu: "1"
memory: "2Gi"
predictor:
model:
modelFormat:
name: pytorch
storageUri: "s3://ml-models/fraud/v2.1.0/"

Request flow: client → transformer (fetch features, preprocess) → predictor (inference) → transformer (postprocess) → client

Multi-Model Serving with KServe and Triton

When serving dozens of small models, a single Triton instance can host all of them:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: triton-multi-model-server
namespace: ml-prod
spec:
predictor:
triton:
storageUri: "s3://ml-models/triton-model-repository/"
runtimeVersion: "23.12-py3"
resources:
limits:
nvidia.com/gpu: 1
# Triton serves all models in the repository simultaneously
# fraud-model, churn-model, ltv-model, spam-classifier, etc.

Seldon Core - Alternative ML Serving Operator

Seldon Core is a production ML serving platform (open-sourced by Seldon, acquired by Red Hat/IBM). Key differentiators from KServe:

  • SeldonDeployment CRD: more flexible component graph (preprocessing → model → postprocessing)
  • Outlier detection: built-in Alibi Detect integration
  • Explain: Anchors, SHAP, Counterfactuals via Alibi Explain
  • Tracing: built-in Jaeger integration
  • Canary + A/B: more sophisticated traffic routing than KServe
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: fraud-model-seldon
namespace: team-fraud
spec:
predictors:
- name: default
graph:
name: fraud-model
type: MODEL
implementation: TRITON_SERVER
modelUri: s3://ml-models/fraud/v2.1.0/
envSecretRefName: s3-credentials
replicas: 3
traffic: 90 # 90% to stable version

- name: canary
graph:
name: fraud-model-canary
type: MODEL
implementation: TRITON_SERVER
modelUri: s3://ml-models/fraud/v2.2.0/
replicas: 1
traffic: 10 # 10% to canary

Argo Workflows - ML Pipeline Orchestration

Argo Workflows is a Kubernetes-native workflow engine. Each step in a workflow runs as a Kubernetes Pod, giving you full access to cluster resources (GPUs, PVCs) with built-in retry, parallelism, and dependency management.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: fraud-model-training-pipeline
namespace: team-fraud
spec:
entrypoint: training-pipeline
volumeClaimTemplates:
- metadata:
name: pipeline-workspace
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 200Gi

templates:
- name: training-pipeline
dag:
tasks:
- name: data-validation
template: run-data-validation
- name: feature-engineering
dependencies: [data-validation]
template: run-feature-engineering
- name: model-training
dependencies: [feature-engineering]
template: run-training
- name: model-evaluation
dependencies: [model-training]
template: run-evaluation
- name: model-registration
dependencies: [model-evaluation]
template: register-model
when: "{{tasks.model-evaluation.outputs.parameters.auc}} > 0.92"

- name: run-training
resource:
action: apply
manifest: |
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: fraud-train-{{workflow.uid}}
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: registry.company.com/fraud-trainer:v3.0
resources:
limits:
nvidia.com/gpu: 4
successCondition: status.conditions.[-1:].type == Succeeded
failureCondition: status.conditions.[-1:].type == Failed

This Argo Workflow: validates data, engineers features, launches a PyTorchJob for training, evaluates the trained model, and only registers the model if AUC exceeds 0.92.

When to Build vs. Use Existing Operators

Use KServe when: you need standardized serving for common frameworks (PyTorch, TensorFlow, scikit-learn), GPU autoscaling, canary deployments, and integration with Kubeflow or MLflow model registries.

Use Seldon Core when: you need complex component graphs (multi-step pipelines), built-in outlier detection and explainability as serving components, or your organization is already invested in the Seldon ecosystem.

Use Argo Workflows when: you need to orchestrate multi-step training pipelines with conditional branching, artifact passing between steps, and full Kubernetes resource access (GPU Jobs, PVCs) in each step.

Build a custom operator when: your domain-specific resource management logic is genuinely novel (not just "model serving but different"), you have the engineering capacity to maintain Go code, and you've validated that extending an existing operator is more complex than building from scratch. This scenario is rare - most teams overestimate how different their needs are.

Production Notes

KServe serverless mode: In serverless mode (with Knative Serving), KServe can scale models to zero replicas when they have no traffic and scale from zero in under 60 seconds. This is ideal for internal models with irregular traffic patterns - you pay only for active inference, not idle GPU capacity.

Model registry integration: KServe integrates with MLflow's model registry. You can specify mlflow:// URIs in the storageUri field, and KServe downloads the model artifact from the registered version. This creates a clean separation: the model registry manages versions, KServe handles serving lifecycle.

Cluster-level vs. namespace-level operator installation: KServe's operator runs cluster-wide. Its CRDs (InferenceService, etc.) are cluster-scoped but resources are namespace-scoped. Install the operator once per cluster, then data scientist teams create InferenceService objects in their namespaces.

Common Mistakes

:::danger Reinventing KServe Without Knowing It Exists The most common mistake on ML platform teams: spending 3–6 months building a custom model serving controller, only to discover KServe already solves the same problems with more features, better testing, and a larger community. Always evaluate KServe and Seldon before building. The operator space is mature - your serving use case is almost certainly covered.

A useful heuristic: if a senior engineer can describe the existing open-source operator's limitations in two sentences, build. If they need a spreadsheet, use the existing operator. :::

:::warning Not Implementing Proper RBAC for Operators KServe's operator needs RBAC permissions to create Deployments, Services, HPAs, ConfigMaps, and Secrets in all namespaces. Granting it cluster-admin is common but overly permissive. Use the least-privilege RBAC rules shipped with the operator's Helm chart. Audit what permissions the operator actually uses with kubectl auth can-i --as=system:serviceaccount:kserve:kserve-controller-manager --list. :::

:::warning Treating Operator CRDs as Permanent APIs CRDs are the API surface of your operator. Changing a CRD field (renaming, removing, type change) is a breaking change for all existing resources. If you build a custom operator, version your CRDs from day one (v1alpha1v1beta1v1) and use conversion webhooks to maintain backward compatibility across versions. KServe follows this pattern - it still supports v1alpha2 resources alongside v1beta1. :::

Interview Q&A

Q1: What is a Kubernetes operator and how does it differ from a regular Deployment?

A Kubernetes operator combines a Custom Resource Definition (CRD) with a controller. The CRD defines a new resource type in the Kubernetes API (like InferenceService or PyTorchJob). The controller is a control loop that watches for instances of that resource and reconciles the cluster state to match the desired state expressed in the resource. A Deployment is a built-in Kubernetes resource with built-in controller logic. An operator is a user-defined extension that follows the same controller-reconciler pattern but with domain-specific logic. The operator knows that an InferenceService with modelFormat: pytorch needs a specific init container for model download, a specific model server image, and specific readiness probe configuration - knowledge that cannot be expressed in generic Kubernetes primitives.

Q2: What are the main capabilities of KServe and what problems does it solve compared to raw Kubernetes?

KServe solves five problems: (1) Model serving boilerplate - instead of writing 200 lines of Deployment/Service/HPA YAML, data scientists write a 15-line InferenceService. (2) Multi-framework support - built-in serving runtimes for PyTorch, TensorFlow, scikit-learn, XGBoost, ONNX, Hugging Face, with the correct server configuration for each. (3) Canary deployments - native traffic splitting between model versions with a canaryTrafficPercent field. (4) Autoscaling - KEDA-based autoscaling on RPS, with scale-to-zero in serverless mode via Knative. (5) Model storage management - automatic download of model artifacts from S3, GCS, or Azure Blob via init containers, with PVC caching.

Q3: Compare KServe and Seldon Core. When would you choose each?

Both are production ML serving operators, but with different strengths. KServe: tighter Kubeflow/MLflow integration, better out-of-the-box serverless (Knative) support, simpler CRD for common cases, stronger GPU autoscaling. Seldon Core: more flexible component graph (each stage in a pipeline is a separate component with independent scaling), tighter Alibi Detect integration for drift/outlier detection, more mature A/B testing support. Choose KServe for teams using Kubeflow or MLflow as the ML platform, for serverless/scale-to-zero serving, and for GPU workloads. Choose Seldon for complex multi-stage inference pipelines where preprocessing, postprocessing, and explanation need to scale independently, or when production outlier detection is a hard requirement.

Q4: When should an ML platform team build a custom Kubernetes operator vs. use KServe?

Build a custom operator when: (1) your serving requirements are genuinely novel and not covered by KServe or Seldon (e.g., you need a custom runtime for a proprietary chip that KServe doesn't support), (2) you have significant engineering capacity (at least 2 engineers dedicated to the operator, Go knowledge required), and (3) you've validated the cost by prototyping against extending an existing operator. The vast majority of teams should use KServe. Common justifications for building that don't actually justify it: "we need custom metrics" (KServe supports custom metrics), "we need canary deployments" (KServe supports this), "we need our own storage format" (KServe has pluggable storage initializers). Before building, spend one week trying to extend KServe via its extensibility points (custom serving runtimes, storage initializers). If you hit genuine blockers, build.

Q5: Describe how Argo Workflows can orchestrate an end-to-end ML training pipeline on Kubernetes.

Argo Workflows runs each pipeline step as a Kubernetes Pod, with full access to cluster resources. A typical ML pipeline: Step 1 (data-validation) runs a Python pod that validates schema, row counts, and distributions of the new training dataset. Step 2 (feature-engineering) runs a Spark-on-K8s job that generates feature vectors and stores them in a PVC. Step 3 (model-training) applies a PyTorchJob manifest and waits for it to reach Succeeded status. Step 4 (model-evaluation) runs a pod that loads the trained model, runs it on a holdout set, and outputs AUC as a workflow artifact parameter. Step 5 (model-registration) only runs if AUC > 0.92 (conditional step using when), registering the model in MLflow. The DAG dependency graph handles parallelism automatically - validation and feature engineering run sequentially, but you could run hyperparameter sweep training jobs in parallel. All intermediate artifacts are passed through PVCs or Argo artifact repositories (S3).

© 2026 EngineersOfAI. All rights reserved.