Skip to main content

GitOps for ML

The Model That Went to Production Without Anyone Noticing

It was a Thursday afternoon when the ML platform team realized that a model with a 3% lower F1 score was serving 100% of production traffic. Nobody had approved it. Nobody had reviewed the performance comparison. The model had been deployed because an engineer with kubectl access had run kubectl set image deployment/fraud-detector model=registry/fraud:v2.1 directly on the cluster - a "quick fix" to a configuration issue - and the team's only deployment mechanism was a Confluence page titled "How to Update a Model (Updated Feb 2023)."

There was no audit trail. No rollback mechanism. No way to reconstruct what had happened. The cluster's desired state was whatever was last applied via kubectl, which might have been three different engineers on three different days. The Git repository, which was supposed to be the source of truth, was weeks out of date.

This is the pre-GitOps state of ML deployment at many organizations. It is not incompetence - it is the natural result of giving engineers direct access to the cluster and no structured alternative. When there is no fast, safe mechanism for deploying models, engineers reach for the fastest unsafe one.

GitOps solves this problem at the architectural level. Under GitOps, Git is the only way to change the system. Every model deployment is a PR. Every rollback is a git revert. Every production incident has a complete audit trail in the commit history. The cluster continuously reconciles itself against the Git state - any manual change is automatically corrected within minutes. The Thursday afternoon incident cannot happen because the tooling makes it impossible.

This lesson covers the GitOps toolchain for ML platforms: Flux CD and ArgoCD for continuous reconciliation, image update automation for automatic model promotion, Argo Rollouts for canary deployments, and Sealed Secrets for credential management. All of it wired together so that a model deployment is a PR merge and a rollback is a git revert.

:::tip 🎮 Interactive Playground Visualize this concept: Try the GitOps for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Traditional CI/CD pushes changes to the cluster: a CI pipeline runs kubectl apply or helm upgrade when code merges. This works but has a critical failure mode: the pipeline is the only thing maintaining the desired state. If someone runs kubectl delete pod or changes a ConfigMap directly, there is no mechanism to detect or correct the drift.

GitOps, formalized by Weaveworks in 2017, inverts this model. Instead of pushing changes in, a cluster-side agent pulls the desired state from Git and continuously reconciles against it. The key properties (from the GitOps Working Group specification):

  • Declarative: the system is described as desired state, not a sequence of commands
  • Versioned and immutable: Git history is the audit trail
  • Pulled automatically: software agents pull and apply changes, not pipelines
  • Continuously reconciled: drift is detected and corrected automatically

For ML specifically, GitOps adds a layer of governance that is difficult to achieve any other way. Every model deployment has a PR, a code review, and a merge commit. The exact model version serving production at any point in time is readable directly from the Git history. Rollbacks take thirty seconds (a git revert). These are not just nice-to-haves - in regulated industries (finance, healthcare), they are often legal requirements.

GitOps Architecture for ML

Flux CD - The Pull-Based Reconciler

Flux is a CNCF graduated project that implements GitOps as a set of Kubernetes controllers. It is composed of several controllers: Source, Kustomize, Helm, Image Automation.

Installing Flux

# Install Flux CLI
brew install fluxcd/tap/flux

# Bootstrap Flux on the cluster - connects to your Git repository
flux bootstrap github \
--owner=myorg \
--repository=ml-platform \
--branch=main \
--path=clusters/prod \
--personal=false \
--token-auth=false # Uses deploy key instead

GitRepository Source - Defining Where Flux Reads From

# clusters/prod/flux-system/ml-platform-source.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: ml-platform
namespace: flux-system
spec:
interval: 1m # How often Flux checks for new commits
ref:
branch: main
url: ssh://[email protected]/myorg/ml-platform.git
secretRef:
name: ml-platform-deploy-key # SSH key with read access
ignore: |
# Do not watch these paths - they change too frequently
/.github/
/docs/
/experiments/

Kustomization - Applying Manifests from Git

# clusters/prod/flux-system/model-deployments.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: model-deployments
namespace: flux-system
spec:
interval: 5m # Re-apply every 5 minutes (catches drift)
retryInterval: 1m # Retry on failure
timeout: 5m
sourceRef:
kind: GitRepository
name: ml-platform
path: ./clusters/prod/model-deployments # Which directory to apply
prune: true # Delete resources removed from Git (critical!)
wait: true # Wait for resources to become ready
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: fraud-detector
namespace: ml-serving
postBuild:
substitute:
CLUSTER_ENV: "prod"
AWS_REGION: "us-east-1"

HelmRelease - Deploying ML Services via Helm

# clusters/prod/model-deployments/mlflow-release.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: mlflow
namespace: flux-system
spec:
interval: 1h
url: https://community-charts.github.io/helm-charts

---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: mlflow
namespace: mlops
spec:
interval: 10m
chart:
spec:
chart: mlflow
version: "0.7.x" # Semver range - auto-updates minor versions
sourceRef:
kind: HelmRepository
name: mlflow
namespace: flux-system
interval: 1h # Check for chart updates hourly
values:
replicaCount: 3
backendStore:
postgres:
enabled: true
host: "${MLFLOW_DB_HOST}"
dbName: mlflow
defaultArtifactRoot: "s3://eai-prod-model-artifacts/mlflow-artifacts"
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/eai-prod-mlflow-role"
valuesFrom:
# Override with environment-specific values stored in a ConfigMap
- kind: ConfigMap
name: mlflow-values-override
valuesKey: values.yaml
upgrade:
remediation:
retries: 3
strategy: rollback # Roll back if upgrade fails
rollback:
cleanupOnFail: true

ArgoCD - Application-Centric GitOps

ArgoCD is an alternative to Flux with a richer UI and more explicit application model. Many teams use ArgoCD for application workloads and Flux for infrastructure.

ArgoCD Application

# argocd/applications/fraud-detector.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: fraud-detector
namespace: argocd
finalizers:
- resources-finalizer.argocd.io # Deletes resources when app is deleted
spec:
project: ml-serving

source:
repoURL: https://github.com/myorg/ml-platform.git
targetRevision: main
path: model-deployments/fraud-detector

destination:
server: https://kubernetes.default.svc
namespace: ml-serving

syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual changes to cluster
allowEmpty: false # Don't accidentally delete everything
syncOptions:
- Validate=true
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true # Prune after new resources are healthy
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m

revisionHistoryLimit: 10 # Keep 10 previous versions for rollback

ApplicationSet - Deploy Across Multiple Clusters

# argocd/applicationsets/model-serving.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: model-serving
namespace: argocd
spec:
generators:
# Generate one Application per model directory
- git:
repoURL: https://github.com/myorg/ml-platform.git
revision: main
directories:
- path: model-deployments/*

template:
metadata:
name: "{{path.basename}}"
namespace: argocd
spec:
project: ml-serving
source:
repoURL: https://github.com/myorg/ml-platform.git
targetRevision: main
path: "{{path}}"
destination:
server: https://kubernetes.default.svc
namespace: ml-serving
syncPolicy:
automated:
prune: true
selfHeal: true

Sync Waves - Ordered Deployment

For ML platforms, some resources must exist before others: the feature store must be ready before the model server starts, the model server must be ready before routing is updated.

# model-deployments/fraud-detector/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
namespace: ml-serving
annotations:
argocd.argoproj.io/sync-wave: "2" # Apply after wave 1 (secrets, configmaps)
spec:
replicas: 4
selector:
matchLabels:
app: fraud-detector
template:
metadata:
labels:
app: fraud-detector
spec:
containers:
- name: model
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector:sha-abc123
ports:
- containerPort: 8080
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
strategy:
canary:
steps:
- setWeight: 10 # Send 10% of traffic to new version
- pause: {duration: 5m} # Wait 5 minutes
- analysis:
templates:
- templateName: fraud-detector-latency
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
canaryService: fraud-detector-canary
stableService: fraud-detector-stable
trafficRouting:
istio:
virtualService:
name: fraud-detector-vs
routes:
- primary

---
# Wave 0: secrets and configmaps must be ready first
apiVersion: v1
kind: ConfigMap
metadata:
name: fraud-detector-config
namespace: ml-serving
annotations:
argocd.argoproj.io/sync-wave: "0"
data:
MODEL_VERSION: "v2.1"
FEATURE_STORE_ENDPOINT: "http://feature-store:8080"
THRESHOLD: "0.85"

Image Update Automation - Automatic Model Promotion

Flux's Image Automation controller watches your container registry for new image tags and automatically opens PRs (or directly commits) to update manifests. This is the link between your training pipeline and your GitOps deployment.

# clusters/prod/flux-system/image-automation.yaml

# Step 1: Tell Flux which registry to watch
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: fraud-detector
namespace: flux-system
spec:
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector
interval: 5m # Scan registry every 5 minutes for new tags
secretRef:
name: ecr-credentials

---
# Step 2: Define which tags to consider (semver filtering)
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: fraud-detector
namespace: flux-system
spec:
imageRepositoryRef:
name: fraud-detector
policy:
semver:
range: ">=1.0.0" # Only promote release tags (not sha-* or dev-*)

---
# Step 3: Tell Flux which file to update when a new image is found
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
name: model-images
namespace: flux-system
spec:
interval: 5m
sourceRef:
kind: GitRepository
name: ml-platform
git:
checkout:
ref:
branch: main
commit:
author:
name: Flux Image Automation
messageTemplate: |
chore: update {{range .Updated.Images}}{{.}}{{end}} to latest
push:
branch: staging # Push to staging branch, not main - requires PR
update:
path: ./clusters/prod/model-deployments
strategy: Setters # Use marker comments in manifests

Marker comments tell Flux exactly which field to update:

# model-deployments/fraud-detector/rollout.yaml (with markers)
spec:
template:
spec:
containers:
- name: model
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector:v1.2.3 # {"$imagepolicy": "flux-system:fraud-detector"}

Model Deployment via GitOps - Full Workflow

GitHub Actions - The CI Half

# .github/workflows/train-and-deploy.yml
name: Train and Deploy Model

on:
push:
paths:
- 'models/fraud-detector/**'
branches:
- main

env:
ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
IMAGE_NAME: fraud-detector
AWS_REGION: us-east-1

jobs:
train-and-evaluate:
runs-on: ubuntu-latest
outputs:
model_version: ${{ steps.version.outputs.version }}
evaluation_passed: ${{ steps.evaluate.outputs.passed }}

steps:
- uses: actions/checkout@v4

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-ml
aws-region: ${{ env.AWS_REGION }}

- name: Run training job
run: |
python models/fraud-detector/train.py \
--experiment-name fraud-detector-${{ github.sha }} \
--output-dir /tmp/model-artifacts

- name: Evaluate model
id: evaluate
run: |
python models/fraud-detector/evaluate.py \
--model-dir /tmp/model-artifacts \
--baseline-version latest \
--min-f1-score 0.92 \
--max-latency-ms 50
echo "passed=true" >> $GITHUB_OUTPUT

- name: Generate version
id: version
run: |
VERSION="v$(date +%Y%m%d)-$(echo ${{ github.sha }} | cut -c1-8)"
echo "version=$VERSION" >> $GITHUB_OUTPUT

build-and-push:
needs: train-and-evaluate
if: needs.train-and-evaluate.outputs.evaluation_passed == 'true'
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.push.outputs.image_tag }}

steps:
- uses: actions/checkout@v4

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-ml
aws-region: ${{ env.AWS_REGION }}

- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2

- name: Build and push serving container
id: push
run: |
IMAGE_TAG="${{ needs.train-and-evaluate.outputs.model_version }}"
FULL_IMAGE="${ECR_REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}"

docker build \
-f models/fraud-detector/Dockerfile.serve \
-t "$FULL_IMAGE" \
--label "git-sha=${{ github.sha }}" \
--label "trained-at=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
.

docker push "$FULL_IMAGE"
echo "image_tag=$IMAGE_TAG" >> $GITHUB_OUTPUT

open-deployment-pr:
needs: [train-and-evaluate, build-and-push]
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Update image tag in manifests
run: |
IMAGE_TAG="${{ needs.build-and-push.outputs.image_tag }}"
MANIFEST="clusters/prod/model-deployments/fraud-detector/rollout.yaml"

# Update the image tag using yq
yq e ".spec.template.spec.containers[0].image = \"${ECR_REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}\"" \
-i "$MANIFEST"

- name: Open PR with model update
uses: peter-evans/create-pull-request@v6
with:
token: ${{ secrets.GITHUB_TOKEN }}
commit-message: "deploy: update fraud-detector to ${{ needs.build-and-push.outputs.image_tag }}"
branch: "deploy/fraud-detector-${{ needs.build-and-push.outputs.image_tag }}"
title: "Deploy fraud-detector ${{ needs.build-and-push.outputs.image_tag }}"
body: |
## Model Deployment

**Model**: `fraud-detector`
**Version**: `${{ needs.build-and-push.outputs.image_tag }}`
**Training commit**: `${{ github.sha }}`

### Evaluation Results
See training run in MLflow: [View Run](https://mlflow.prod.internal/experiments/fraud-detector-${{ github.sha }})

### Deployment Plan
- [ ] 10% canary traffic for 5 minutes
- [ ] Monitor latency and error rate
- [ ] Full rollout on success, automatic rollback on failure

**Review the evaluation metrics before approving.**
labels: |
model-deployment
fraud-detector

Secrets Management in GitOps

Sealed Secrets - Encrypt Secrets to Commit in Git

# Install kubeseal CLI
brew install kubeseal

# Fetch the cluster's public key
kubeseal --fetch-cert \
--controller-namespace=sealed-secrets \
--controller-name=sealed-secrets \
> pub-cert.pem

# Create a regular Kubernetes secret, then seal it
kubectl create secret generic mlflow-db-credentials \
--from-literal=password=supersecret123 \
--from-literal=username=mlflow \
--dry-run=client \
-o yaml | \
kubeseal \
--cert pub-cert.pem \
--format yaml \
> clusters/prod/secrets/mlflow-db-credentials-sealed.yaml

# The sealed secret is safe to commit to Git
# Only the cluster's private key can decrypt it
# clusters/prod/secrets/mlflow-db-credentials-sealed.yaml
# This file is safe to commit - encrypted with cluster public key
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: mlflow-db-credentials
namespace: mlops
spec:
encryptedData:
password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
username: AgAKb8ySjB0kfGPAHDg3xOZIYqvpQQ...
template:
metadata:
name: mlflow-db-credentials
namespace: mlops
type: Opaque

External Secrets Operator - Pull from AWS Secrets Manager

# clusters/prod/secrets/mlflow-external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-manager
namespace: mlops
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: external-secrets-sa
namespace: external-secrets

---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: mlflow-db-credentials
namespace: mlops
spec:
refreshInterval: 1h # Re-sync from Secrets Manager hourly
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore

target:
name: mlflow-db-credentials # Name of the Kubernetes Secret to create
creationPolicy: Owner

data:
- secretKey: password # Key in the Kubernetes Secret
remoteRef:
key: eai-prod/mlflow/db # Path in AWS Secrets Manager
property: password # JSON key within the secret

- secretKey: username
remoteRef:
key: eai-prod/mlflow/db
property: username

Drift Detection and Remediation

ArgoCD detects drift immediately and can be configured to self-heal:

# argocd/applications/fraud-detector.yaml
spec:
syncPolicy:
automated:
selfHeal: true # Automatically revert manual changes
syncOptions:
- RespectIgnoreDifferences=true

# Ignore differences that are OK to have (e.g., autoscaler changes replica count)
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # Ignore autoscaler-managed replica count
- group: argoproj.io
kind: Rollout
jsonPointers:
- /spec/replicas
# Check for drift manually
argocd app diff fraud-detector

# Force resync if something is out of sync
argocd app sync fraud-detector

# Roll back to previous version
argocd app rollback fraud-detector

# View resource history
argocd app history fraud-detector

Argo Rollouts - Canary Analysis

# model-deployments/fraud-detector/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: fraud-detector-latency
namespace: ml-serving
spec:
metrics:
- name: p99-latency
interval: 1m
successCondition: result < 50 # Less than 50ms P99
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="fraud-detector",
version="canary"
}[5m])) by (le)
) * 1000

- name: error-rate
interval: 1m
successCondition: result < 0.01 # Less than 1% error rate
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="fraud-detector",
version="canary",
status=~"5.."
}[5m])) /
sum(rate(http_requests_total{
service="fraud-detector",
version="canary"
}[5m]))

- name: business-metric
interval: 5m
successCondition: result > 0.92 # F1 score above threshold
failureLimit: 1
provider:
web:
url: "http://ml-metrics-api/models/fraud-detector/live-f1"
jsonPath: "{$.f1_score}"

Production Engineering Notes

Repository structure matters: Separate your application code repo from your GitOps configuration repo. App repo: model code, training scripts, Dockerfile. Config repo: Kubernetes manifests, Helm values, Flux/ArgoCD resources. This separation means infrastructure changes can be reviewed independently from code changes, and the config repo becomes the pure source of truth for what runs where.

Namespace per environment: In a single cluster, use namespaces to separate environments (ml-dev, ml-staging, ml-prod). Apply Kubernetes NetworkPolicies to prevent cross-namespace traffic. Use RBAC to ensure engineers can only write to their namespaces in Git - ArgoCD enforces this at the application project level.

Gradual rollout vs instant switch: Never set canary weight from 0 to 100 in one step. For ML models, use 10% → 30% → 70% → 100% with analysis gates at each step. If a metric fails the analysis at 10%, you abort before any real damage - only 10% of users saw the bad model.

Flux vs ArgoCD: Flux is more composable and Kubernetes-native (everything is a CRD). ArgoCD has a better UI and more mature RBAC/multi-tenancy model. Many production teams run both: Flux for infrastructure and add-ons, ArgoCD for application deployments. They coexist fine on the same cluster.

Common Mistakes

:::danger Never Give CI Pipelines Direct kubectl Access to Production If your CI/CD pipeline runs kubectl apply or helm upgrade directly against production, you have bypassed the entire GitOps model. CI should only push to Git. The GitOps agent applies to the cluster. No exceptions. :::

:::danger prune: true Is Dangerous Without Understanding When prune: true is set (necessary for proper GitOps), deleting a file from Git deletes the corresponding Kubernetes resource. A mis-typed git rm, an accidental path change, or a repository reorganization can delete production deployments. Always test prune behavior in staging first, and ensure you have a recent etcd backup. :::

:::warning Sync Waves Create Ordering Complexity Sync waves solve the "resource A must exist before resource B" problem, but they add complexity. If a wave 0 resource fails to become ready, ArgoCD stops and all higher-wave resources never deploy. This means a broken ConfigMap blocks the model deployment. Test your wave ordering carefully and set generous health check timeouts. :::

:::warning Image Update Automation Bypasses PR Review Flux's image update automation can be configured to commit directly to main rather than opening a PR. This is convenient but removes human review from model deployments. For production ML systems, always configure automation to push to a staging branch and require a PR for promotion to main. :::

Interview Q&A

Q: What are the four core properties of GitOps and how do they apply to ML deployments?

The four GitOps properties from the OpenGitOps specification: (1) Declarative - system state described as desired state, not imperative commands. For ML: Kubernetes manifests describing model deployments, not kubectl set image commands. (2) Versioned and immutable - Git history is the audit trail. For ML: every model version deployed to production has a corresponding Git commit and PR. Rollback means git revert. (3) Pulled automatically - a cluster-side agent (Flux/ArgoCD) pulls state from Git, not pipelines pushing in. For ML: this means the cluster continuously converges to the Git state, even if someone makes a manual change. (4) Continuously reconciled - drift is detected and corrected. For ML: if someone manually changes the model serving container tag, ArgoCD reverts it within minutes.

Q: What is the difference between Flux and ArgoCD, and when would you use each?

Both implement GitOps but with different designs. Flux is purely Kubernetes-native - every concept is a CRD, configuration is just Kubernetes manifests, and there is no separate UI server required. It is highly composable and easier to automate with GitOps itself (bootstrapping Flux is a single command that commits manifests to Git). ArgoCD has a richer UI with a graphical application tree, more mature RBAC/multi-tenancy features (projects, RBAC policies, SSO integration), and better support for managing multiple clusters from a single control plane. For ML platform teams, a common pattern is Flux for cluster infrastructure (add-ons, operators, namespaces) and ArgoCD for model serving applications - leveraging each tool's strengths.

Q: How do you handle secrets in a GitOps model without committing plaintext credentials?

Two main approaches. Sealed Secrets: encrypt Kubernetes secrets with the cluster's public key using kubeseal, commit the encrypted SealedSecret manifest to Git. Only the controller on the specific cluster (which holds the private key) can decrypt it. Simple, self-contained, no external dependencies. External Secrets Operator: define ExternalSecret resources in Git that reference secrets stored in AWS Secrets Manager, HashiCorp Vault, or GCP Secret Manager. The operator fetches and syncs secrets at runtime. More operationally complex but gives you centralized secrets management, audit logs in the secrets backend, and secret rotation without Git commits. For ML teams on AWS, External Secrets + AWS Secrets Manager is typically the better choice because it integrates with IAM audit logging and allows secret rotation without a Git change.

Q: Walk me through what happens when a model deployment PR is merged in a GitOps system.

(1) PR merges - Git commit updates the image tag in the Kubernetes manifest. (2) Flux's GitRepository source controller detects the new commit within 1 minute (configurable interval). (3) Flux's Kustomization controller computes the diff between current cluster state and the desired state in Git. (4) The diff is applied - in this case, an Argo Rollout object is updated with the new image tag. (5) Argo Rollouts controller detects the Rollout spec change and begins the canary strategy: it creates a new ReplicaSet with the new image, routes 10% of traffic to it. (6) After 5 minutes, the AnalysisRun checks Prometheus metrics for the canary pods. (7) If latency and error rate are within bounds, traffic weight increases to 50%, then 100%. (8) If an analysis fails, Argo Rollouts automatically aborts and rolls back to the stable version. Total automation - no manual steps after the PR merge.

Q: What is drift in a GitOps context and how do you detect and remediate it?

Drift is when the actual state of the cluster diverges from the desired state declared in Git. Sources: someone runs kubectl edit, a Helm release is upgraded manually, a cluster autoscaler changes replica counts. ArgoCD detects drift during its reconciliation loop (every few minutes by default) by comparing the live cluster state against the rendered manifests from Git. It shows drift as OutOfSync status in the UI. Remediation options: (1) selfHeal: true - ArgoCD automatically re-applies the Git state, reverting manual changes; (2) manual sync - engineer reviews the drift, decides whether to sync Git to cluster or vice versa; (3) ignoreDifferences - for expected drift (autoscaler-managed replica counts), configure ArgoCD to ignore specific fields. For ML models, selfHeal on model serving deployments is essential - you cannot allow someone's kubectl set image to override the reviewed, tested version in Git.

Quick Reference - GitOps CLI Commands

# Flux - force reconciliation immediately (don't wait for interval)
flux reconcile source git ml-platform
flux reconcile kustomization model-deployments

# Check reconciliation status
flux get kustomizations
flux get helmreleases -A

# Suspend reconciliation (e.g., for maintenance window)
flux suspend kustomization model-deployments
flux resume kustomization model-deployments

# ArgoCD - sync an app immediately
argocd app sync fraud-detector

# Check app health and sync status
argocd app list
argocd app get fraud-detector

# Roll back to a previous version
argocd app rollback fraud-detector 3 # rollback to history item 3

# Pause and resume automated sync (e.g., during incident)
argocd app set fraud-detector --sync-policy none
argocd app set fraud-detector --sync-policy automated

# Image update automation - check what images Flux is watching
flux get image repository -A
flux get image policy -A

# Sealed secrets - re-seal after key rotation
kubeseal --fetch-cert --controller-namespace sealed-secrets > new-cert.pem
kubeseal --cert new-cert.pem -o yaml < original-secret.yaml > new-sealed.yaml
© 2026 EngineersOfAI. All rights reserved.