Apply GitOps principles to ML infrastructure - Flux CD, ArgoCD, image update automation, secrets management, and PR-gated model deployments with Argo Rollouts.

How does Flux CD work in practice?

GitOps for ML covers GitOps ML, Flux CD, ArgoCD from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/infrastructure-as-code/gitops-for-ml

What is the difference between GitOps ML and ArgoCD?

See the full breakdown at https://engineersofai.com/docs/mlops/infrastructure-as-code/gitops-for-ml

GitOps for ML

The Model That Went to Production Without Anyone Noticing

It was a Thursday afternoon when the ML platform team realized that a model with a 3% lower F1 score was serving 100% of production traffic. Nobody had approved it. Nobody had reviewed the performance comparison. The model had been deployed because an engineer with kubectl access had run kubectl set image deployment/fraud-detector model=registry/fraud:v2.1 directly on the cluster - a "quick fix" to a configuration issue - and the team's only deployment mechanism was a Confluence page titled "How to Update a Model (Updated Feb 2023)."

There was no audit trail. No rollback mechanism. No way to reconstruct what had happened. The cluster's desired state was whatever was last applied via kubectl, which might have been three different engineers on three different days. The Git repository, which was supposed to be the source of truth, was weeks out of date.

This is the pre-GitOps state of ML deployment at many organizations. It is not incompetence - it is the natural result of giving engineers direct access to the cluster and no structured alternative. When there is no fast, safe mechanism for deploying models, engineers reach for the fastest unsafe one.

GitOps solves this problem at the architectural level. Under GitOps, Git is the only way to change the system. Every model deployment is a PR. Every rollback is a git revert. Every production incident has a complete audit trail in the commit history. The cluster continuously reconciles itself against the Git state - any manual change is automatically corrected within minutes. The Thursday afternoon incident cannot happen because the tooling makes it impossible.

This lesson covers the GitOps toolchain for ML platforms: Flux CD and ArgoCD for continuous reconciliation, image update automation for automatic model promotion, Argo Rollouts for canary deployments, and Sealed Secrets for credential management. All of it wired together so that a model deployment is a PR merge and a rollback is a git revert.

:::tip 🎮 Interactive Playground Visualize this concept: Try the GitOps for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Traditional CI/CD pushes changes to the cluster: a CI pipeline runs kubectl apply or helm upgrade when code merges. This works but has a critical failure mode: the pipeline is the only thing maintaining the desired state. If someone runs kubectl delete pod or changes a ConfigMap directly, there is no mechanism to detect or correct the drift.

GitOps, formalized by Weaveworks in 2017, inverts this model. Instead of pushing changes in, a cluster-side agent pulls the desired state from Git and continuously reconciles against it. The key properties (from the GitOps Working Group specification):

Declarative: the system is described as desired state, not a sequence of commands
Versioned and immutable: Git history is the audit trail
Pulled automatically: software agents pull and apply changes, not pipelines
Continuously reconciled: drift is detected and corrected automatically

For ML specifically, GitOps adds a layer of governance that is difficult to achieve any other way. Every model deployment has a PR, a code review, and a merge commit. The exact model version serving production at any point in time is readable directly from the Git history. Rollbacks take thirty seconds (a git revert). These are not just nice-to-haves - in regulated industries (finance, healthcare), they are often legal requirements.

GitOps Architecture for ML

Flux CD - The Pull-Based Reconciler

Flux is a CNCF graduated project that implements GitOps as a set of Kubernetes controllers. It is composed of several controllers: Source, Kustomize, Helm, Image Automation.

Installing Flux

# Install Flux CLI
brew install fluxcd/tap/flux

# Bootstrap Flux on the cluster - connects to your Git repository
flux bootstrap github \
  --owner=myorg \
  --repository=ml-platform \
  --branch=main \
  --path=clusters/prod \
  --personal=false \
  --token-auth=false  # Uses deploy key instead

GitRepository Source - Defining Where Flux Reads From

# clusters/prod/flux-system/ml-platform-source.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: ml-platform
  namespace: flux-system
spec:
  interval: 1m        # How often Flux checks for new commits
  ref:
    branch: main
  url: ssh://[email protected]/myorg/ml-platform.git
  secretRef:
    name: ml-platform-deploy-key  # SSH key with read access
  ignore: |
    # Do not watch these paths - they change too frequently
    /.github/
    /docs/
    /experiments/

Kustomization - Applying Manifests from Git

# clusters/prod/flux-system/model-deployments.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: model-deployments
  namespace: flux-system
spec:
  interval: 5m          # Re-apply every 5 minutes (catches drift)
  retryInterval: 1m     # Retry on failure
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: ml-platform
  path: ./clusters/prod/model-deployments    # Which directory to apply
  prune: true            # Delete resources removed from Git (critical!)
  wait: true             # Wait for resources to become ready
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: fraud-detector
      namespace: ml-serving
  postBuild:
    substitute:
      CLUSTER_ENV: "prod"
      AWS_REGION: "us-east-1"

HelmRelease - Deploying ML Services via Helm

# clusters/prod/model-deployments/mlflow-release.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: mlflow
  namespace: flux-system
spec:
  interval: 1h
  url: https://community-charts.github.io/helm-charts

---
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: mlflow
  namespace: mlops
spec:
  interval: 10m
  chart:
    spec:
      chart: mlflow
      version: "0.7.x"   # Semver range - auto-updates minor versions
      sourceRef:
        kind: HelmRepository
        name: mlflow
        namespace: flux-system
      interval: 1h        # Check for chart updates hourly
  values:
    replicaCount: 3
    backendStore:
      postgres:
        enabled: true
        host: "${MLFLOW_DB_HOST}"
        dbName: mlflow
    defaultArtifactRoot: "s3://eai-prod-model-artifacts/mlflow-artifacts"
    serviceAccount:
      annotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/eai-prod-mlflow-role"
  valuesFrom:
    # Override with environment-specific values stored in a ConfigMap
    - kind: ConfigMap
      name: mlflow-values-override
      valuesKey: values.yaml
  upgrade:
    remediation:
      retries: 3
      strategy: rollback   # Roll back if upgrade fails
  rollback:
    cleanupOnFail: true

ArgoCD - Application-Centric GitOps

ArgoCD is an alternative to Flux with a richer UI and more explicit application model. Many teams use ArgoCD for application workloads and Flux for infrastructure.

ArgoCD Application

# argocd/applications/fraud-detector.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: fraud-detector
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.io  # Deletes resources when app is deleted
spec:
  project: ml-serving

  source:
    repoURL: https://github.com/myorg/ml-platform.git
    targetRevision: main
    path: model-deployments/fraud-detector

  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving

  syncPolicy:
    automated:
      prune: true         # Delete resources removed from Git
      selfHeal: true      # Revert manual changes to cluster
      allowEmpty: false   # Don't accidentally delete everything
    syncOptions:
      - Validate=true
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true    # Prune after new resources are healthy
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

  revisionHistoryLimit: 10  # Keep 10 previous versions for rollback

ApplicationSet - Deploy Across Multiple Clusters

# argocd/applicationsets/model-serving.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: model-serving
  namespace: argocd
spec:
  generators:
    # Generate one Application per model directory
    - git:
        repoURL: https://github.com/myorg/ml-platform.git
        revision: main
        directories:
          - path: model-deployments/*

  template:
    metadata:
      name: "{{path.basename}}"
      namespace: argocd
    spec:
      project: ml-serving
      source:
        repoURL: https://github.com/myorg/ml-platform.git
        targetRevision: main
        path: "{{path}}"
      destination:
        server: https://kubernetes.default.svc
        namespace: ml-serving
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Sync Waves - Ordered Deployment

For ML platforms, some resources must exist before others: the feature store must be ready before the model server starts, the model server must be ready before routing is updated.

# model-deployments/fraud-detector/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: fraud-detector
  namespace: ml-serving
  annotations:
    argocd.argoproj.io/sync-wave: "2"   # Apply after wave 1 (secrets, configmaps)
spec:
  replicas: 4
  selector:
    matchLabels:
      app: fraud-detector
  template:
    metadata:
      labels:
        app: fraud-detector
    spec:
      containers:
        - name: model
          image: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector:sha-abc123
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
            limits:
              cpu: "4"
              memory: "8Gi"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
  strategy:
    canary:
      steps:
        - setWeight: 10     # Send 10% of traffic to new version
        - pause: {duration: 5m}  # Wait 5 minutes
        - analysis:
            templates:
              - templateName: fraud-detector-latency
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100
      canaryService: fraud-detector-canary
      stableService: fraud-detector-stable
      trafficRouting:
        istio:
          virtualService:
            name: fraud-detector-vs
            routes:
              - primary

---
# Wave 0: secrets and configmaps must be ready first
apiVersion: v1
kind: ConfigMap
metadata:
  name: fraud-detector-config
  namespace: ml-serving
  annotations:
    argocd.argoproj.io/sync-wave: "0"
data:
  MODEL_VERSION: "v2.1"
  FEATURE_STORE_ENDPOINT: "http://feature-store:8080"
  THRESHOLD: "0.85"

Image Update Automation - Automatic Model Promotion

Flux's Image Automation controller watches your container registry for new image tags and automatically opens PRs (or directly commits) to update manifests. This is the link between your training pipeline and your GitOps deployment.

# clusters/prod/flux-system/image-automation.yaml

# Step 1: Tell Flux which registry to watch
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: fraud-detector
  namespace: flux-system
spec:
  image: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector
  interval: 5m     # Scan registry every 5 minutes for new tags
  secretRef:
    name: ecr-credentials

---
# Step 2: Define which tags to consider (semver filtering)
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: fraud-detector
  namespace: flux-system
spec:
  imageRepositoryRef:
    name: fraud-detector
  policy:
    semver:
      range: ">=1.0.0"   # Only promote release tags (not sha-* or dev-*)

---
# Step 3: Tell Flux which file to update when a new image is found
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: model-images
  namespace: flux-system
spec:
  interval: 5m
  sourceRef:
    kind: GitRepository
    name: ml-platform
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        email: [email protected]
        name: Flux Image Automation
      messageTemplate: |
        chore: update {{range .Updated.Images}}{{.}}{{end}} to latest
    push:
      branch: staging  # Push to staging branch, not main - requires PR
  update:
    path: ./clusters/prod/model-deployments
    strategy: Setters    # Use marker comments in manifests

Marker comments tell Flux exactly which field to update:

# model-deployments/fraud-detector/rollout.yaml (with markers)
spec:
  template:
    spec:
      containers:
        - name: model
          image: 123456789.dkr.ecr.us-east-1.amazonaws.com/fraud-detector:v1.2.3  # {"$imagepolicy": "flux-system:fraud-detector"}

Model Deployment via GitOps - Full Workflow

GitHub Actions - The CI Half

# .github/workflows/train-and-deploy.yml
name: Train and Deploy Model

on:
  push:
    paths:
      - 'models/fraud-detector/**'
    branches:
      - main

env:
  ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
  IMAGE_NAME: fraud-detector
  AWS_REGION: us-east-1

jobs:
  train-and-evaluate:
    runs-on: ubuntu-latest
    outputs:
      model_version: ${{ steps.version.outputs.version }}
      evaluation_passed: ${{ steps.evaluate.outputs.passed }}

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-ml
          aws-region: ${{ env.AWS_REGION }}

      - name: Run training job
        run: |
          python models/fraud-detector/train.py \
            --experiment-name fraud-detector-${{ github.sha }} \
            --output-dir /tmp/model-artifacts

      - name: Evaluate model
        id: evaluate
        run: |
          python models/fraud-detector/evaluate.py \
            --model-dir /tmp/model-artifacts \
            --baseline-version latest \
            --min-f1-score 0.92 \
            --max-latency-ms 50
          echo "passed=true" >> $GITHUB_OUTPUT

      - name: Generate version
        id: version
        run: |
          VERSION="v$(date +%Y%m%d)-$(echo ${{ github.sha }} | cut -c1-8)"
          echo "version=$VERSION" >> $GITHUB_OUTPUT

  build-and-push:
    needs: train-and-evaluate
    if: needs.train-and-evaluate.outputs.evaluation_passed == 'true'
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.push.outputs.image_tag }}

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions-ml
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push serving container
        id: push
        run: |
          IMAGE_TAG="${{ needs.train-and-evaluate.outputs.model_version }}"
          FULL_IMAGE="${ECR_REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}"

          docker build \
            -f models/fraud-detector/Dockerfile.serve \
            -t "$FULL_IMAGE" \
            --label "git-sha=${{ github.sha }}" \
            --label "trained-at=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
            .

          docker push "$FULL_IMAGE"
          echo "image_tag=$IMAGE_TAG" >> $GITHUB_OUTPUT

  open-deployment-pr:
    needs: [train-and-evaluate, build-and-push]
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Update image tag in manifests
        run: |
          IMAGE_TAG="${{ needs.build-and-push.outputs.image_tag }}"
          MANIFEST="clusters/prod/model-deployments/fraud-detector/rollout.yaml"

          # Update the image tag using yq
          yq e ".spec.template.spec.containers[0].image = \"${ECR_REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}\"" \
            -i "$MANIFEST"

      - name: Open PR with model update
        uses: peter-evans/create-pull-request@v6
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          commit-message: "deploy: update fraud-detector to ${{ needs.build-and-push.outputs.image_tag }}"
          branch: "deploy/fraud-detector-${{ needs.build-and-push.outputs.image_tag }}"
          title: "Deploy fraud-detector ${{ needs.build-and-push.outputs.image_tag }}"
          body: |
            ## Model Deployment

            **Model**: `fraud-detector`
            **Version**: `${{ needs.build-and-push.outputs.image_tag }}`
            **Training commit**: `${{ github.sha }}`

            ### Evaluation Results
            See training run in MLflow: [View Run](https://mlflow.prod.internal/experiments/fraud-detector-${{ github.sha }})

            ### Deployment Plan
            - [ ] 10% canary traffic for 5 minutes
            - [ ] Monitor latency and error rate
            - [ ] Full rollout on success, automatic rollback on failure

            **Review the evaluation metrics before approving.**
          labels: |
            model-deployment
            fraud-detector

Secrets Management in GitOps

Sealed Secrets - Encrypt Secrets to Commit in Git

# Install kubeseal CLI
brew install kubeseal

# Fetch the cluster's public key
kubeseal --fetch-cert \
  --controller-namespace=sealed-secrets \
  --controller-name=sealed-secrets \
  > pub-cert.pem

# Create a regular Kubernetes secret, then seal it
kubectl create secret generic mlflow-db-credentials \
  --from-literal=password=supersecret123 \
  --from-literal=username=mlflow \
  --dry-run=client \
  -o yaml | \
  kubeseal \
    --cert pub-cert.pem \
    --format yaml \
  > clusters/prod/secrets/mlflow-db-credentials-sealed.yaml

# The sealed secret is safe to commit to Git
# Only the cluster's private key can decrypt it

# clusters/prod/secrets/mlflow-db-credentials-sealed.yaml
# This file is safe to commit - encrypted with cluster public key
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: mlflow-db-credentials
  namespace: mlops
spec:
  encryptedData:
    password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
    username: AgAKb8ySjB0kfGPAHDg3xOZIYqvpQQ...
  template:
    metadata:
      name: mlflow-db-credentials
      namespace: mlops
    type: Opaque

External Secrets Operator - Pull from AWS Secrets Manager

# clusters/prod/secrets/mlflow-external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: mlops
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: mlflow-db-credentials
  namespace: mlops
spec:
  refreshInterval: 1h    # Re-sync from Secrets Manager hourly
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore

  target:
    name: mlflow-db-credentials    # Name of the Kubernetes Secret to create
    creationPolicy: Owner

  data:
    - secretKey: password          # Key in the Kubernetes Secret
      remoteRef:
        key: eai-prod/mlflow/db    # Path in AWS Secrets Manager
        property: password         # JSON key within the secret

    - secretKey: username
      remoteRef:
        key: eai-prod/mlflow/db
        property: username

Drift Detection and Remediation

ArgoCD detects drift immediately and can be configured to self-heal:

# argocd/applications/fraud-detector.yaml
spec:
  syncPolicy:
    automated:
      selfHeal: true      # Automatically revert manual changes
    syncOptions:
      - RespectIgnoreDifferences=true

  # Ignore differences that are OK to have (e.g., autoscaler changes replica count)
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas     # Ignore autoscaler-managed replica count
    - group: argoproj.io
      kind: Rollout
      jsonPointers:
        - /spec/replicas

# Check for drift manually
argocd app diff fraud-detector

# Force resync if something is out of sync
argocd app sync fraud-detector

# Roll back to previous version
argocd app rollback fraud-detector

# View resource history
argocd app history fraud-detector

Argo Rollouts - Canary Analysis

# model-deployments/fraud-detector/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: fraud-detector-latency
  namespace: ml-serving
spec:
  metrics:
    - name: p99-latency
      interval: 1m
      successCondition: result < 50  # Less than 50ms P99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="fraud-detector",
                version="canary"
              }[5m])) by (le)
            ) * 1000

    - name: error-rate
      interval: 1m
      successCondition: result < 0.01   # Less than 1% error rate
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="fraud-detector",
              version="canary",
              status=~"5.."
            }[5m])) /
            sum(rate(http_requests_total{
              service="fraud-detector",
              version="canary"
            }[5m]))

    - name: business-metric
      interval: 5m
      successCondition: result > 0.92   # F1 score above threshold
      failureLimit: 1
      provider:
        web:
          url: "http://ml-metrics-api/models/fraud-detector/live-f1"
          jsonPath: "{$.f1_score}"

Production Engineering Notes

Repository structure matters: Separate your application code repo from your GitOps configuration repo. App repo: model code, training scripts, Dockerfile. Config repo: Kubernetes manifests, Helm values, Flux/ArgoCD resources. This separation means infrastructure changes can be reviewed independently from code changes, and the config repo becomes the pure source of truth for what runs where.

Namespace per environment: In a single cluster, use namespaces to separate environments (ml-dev, ml-staging, ml-prod). Apply Kubernetes NetworkPolicies to prevent cross-namespace traffic. Use RBAC to ensure engineers can only write to their namespaces in Git - ArgoCD enforces this at the application project level.

Gradual rollout vs instant switch: Never set canary weight from 0 to 100 in one step. For ML models, use 10% → 30% → 70% → 100% with analysis gates at each step. If a metric fails the analysis at 10%, you abort before any real damage - only 10% of users saw the bad model.

Flux vs ArgoCD: Flux is more composable and Kubernetes-native (everything is a CRD). ArgoCD has a better UI and more mature RBAC/multi-tenancy model. Many production teams run both: Flux for infrastructure and add-ons, ArgoCD for application deployments. They coexist fine on the same cluster.

Common Mistakes

:::danger Never Give CI Pipelines Direct kubectl Access to Production If your CI/CD pipeline runs kubectl apply or helm upgrade directly against production, you have bypassed the entire GitOps model. CI should only push to Git. The GitOps agent applies to the cluster. No exceptions. :::

:::danger prune: true Is Dangerous Without Understanding When prune: true is set (necessary for proper GitOps), deleting a file from Git deletes the corresponding Kubernetes resource. A mis-typed git rm, an accidental path change, or a repository reorganization can delete production deployments. Always test prune behavior in staging first, and ensure you have a recent etcd backup. :::

:::warning Sync Waves Create Ordering Complexity Sync waves solve the "resource A must exist before resource B" problem, but they add complexity. If a wave 0 resource fails to become ready, ArgoCD stops and all higher-wave resources never deploy. This means a broken ConfigMap blocks the model deployment. Test your wave ordering carefully and set generous health check timeouts. :::

:::warning Image Update Automation Bypasses PR Review Flux's image update automation can be configured to commit directly to main rather than opening a PR. This is convenient but removes human review from model deployments. For production ML systems, always configure automation to push to a staging branch and require a PR for promotion to main. :::

Interview Q&A

Q: What are the four core properties of GitOps and how do they apply to ML deployments?

The four GitOps properties from the OpenGitOps specification: (1) Declarative - system state described as desired state, not imperative commands. For ML: Kubernetes manifests describing model deployments, not kubectl set image commands. (2) Versioned and immutable - Git history is the audit trail. For ML: every model version deployed to production has a corresponding Git commit and PR. Rollback means git revert. (3) Pulled automatically - a cluster-side agent (Flux/ArgoCD) pulls state from Git, not pipelines pushing in. For ML: this means the cluster continuously converges to the Git state, even if someone makes a manual change. (4) Continuously reconciled - drift is detected and corrected. For ML: if someone manually changes the model serving container tag, ArgoCD reverts it within minutes.

Q: What is the difference between Flux and ArgoCD, and when would you use each?

Both implement GitOps but with different designs. Flux is purely Kubernetes-native - every concept is a CRD, configuration is just Kubernetes manifests, and there is no separate UI server required. It is highly composable and easier to automate with GitOps itself (bootstrapping Flux is a single command that commits manifests to Git). ArgoCD has a richer UI with a graphical application tree, more mature RBAC/multi-tenancy features (projects, RBAC policies, SSO integration), and better support for managing multiple clusters from a single control plane. For ML platform teams, a common pattern is Flux for cluster infrastructure (add-ons, operators, namespaces) and ArgoCD for model serving applications - leveraging each tool's strengths.

Q: How do you handle secrets in a GitOps model without committing plaintext credentials?

Two main approaches. Sealed Secrets: encrypt Kubernetes secrets with the cluster's public key using kubeseal, commit the encrypted SealedSecret manifest to Git. Only the controller on the specific cluster (which holds the private key) can decrypt it. Simple, self-contained, no external dependencies. External Secrets Operator: define ExternalSecret resources in Git that reference secrets stored in AWS Secrets Manager, HashiCorp Vault, or GCP Secret Manager. The operator fetches and syncs secrets at runtime. More operationally complex but gives you centralized secrets management, audit logs in the secrets backend, and secret rotation without Git commits. For ML teams on AWS, External Secrets + AWS Secrets Manager is typically the better choice because it integrates with IAM audit logging and allows secret rotation without a Git change.

Q: Walk me through what happens when a model deployment PR is merged in a GitOps system.

(1) PR merges - Git commit updates the image tag in the Kubernetes manifest. (2) Flux's GitRepository source controller detects the new commit within 1 minute (configurable interval). (3) Flux's Kustomization controller computes the diff between current cluster state and the desired state in Git. (4) The diff is applied - in this case, an Argo Rollout object is updated with the new image tag. (5) Argo Rollouts controller detects the Rollout spec change and begins the canary strategy: it creates a new ReplicaSet with the new image, routes 10% of traffic to it. (6) After 5 minutes, the AnalysisRun checks Prometheus metrics for the canary pods. (7) If latency and error rate are within bounds, traffic weight increases to 50%, then 100%. (8) If an analysis fails, Argo Rollouts automatically aborts and rolls back to the stable version. Total automation - no manual steps after the PR merge.

Q: What is drift in a GitOps context and how do you detect and remediate it?

Drift is when the actual state of the cluster diverges from the desired state declared in Git. Sources: someone runs kubectl edit, a Helm release is upgraded manually, a cluster autoscaler changes replica counts. ArgoCD detects drift during its reconciliation loop (every few minutes by default) by comparing the live cluster state against the rendered manifests from Git. It shows drift as OutOfSync status in the UI. Remediation options: (1) selfHeal: true - ArgoCD automatically re-applies the Git state, reverting manual changes; (2) manual sync - engineer reviews the drift, decides whether to sync Git to cluster or vice versa; (3) ignoreDifferences - for expected drift (autoscaler-managed replica counts), configure ArgoCD to ignore specific fields. For ML models, selfHeal on model serving deployments is essential - you cannot allow someone's kubectl set image to override the reviewed, tested version in Git.

Quick Reference - GitOps CLI Commands

# Flux - force reconciliation immediately (don't wait for interval)
flux reconcile source git ml-platform
flux reconcile kustomization model-deployments

# Check reconciliation status
flux get kustomizations
flux get helmreleases -A

# Suspend reconciliation (e.g., for maintenance window)
flux suspend kustomization model-deployments
flux resume kustomization model-deployments

# ArgoCD - sync an app immediately
argocd app sync fraud-detector

# Check app health and sync status
argocd app list
argocd app get fraud-detector

# Roll back to a previous version
argocd app rollback fraud-detector 3   # rollback to history item 3

# Pause and resume automated sync (e.g., during incident)
argocd app set fraud-detector --sync-policy none
argocd app set fraud-detector --sync-policy automated

# Image update automation - check what images Flux is watching
flux get image repository -A
flux get image policy -A

# Sealed secrets - re-seal after key rotation
kubeseal --fetch-cert --controller-namespace sealed-secrets > new-cert.pem
kubeseal --cert new-cert.pem -o yaml < original-secret.yaml > new-sealed.yaml

The Model That Went to Production Without Anyone Noticing​

Why This Exists​

GitOps Architecture for ML​

Flux CD - The Pull-Based Reconciler​

Installing Flux​

GitRepository Source - Defining Where Flux Reads From​

Kustomization - Applying Manifests from Git​

HelmRelease - Deploying ML Services via Helm​

ArgoCD - Application-Centric GitOps​

ArgoCD Application​

ApplicationSet - Deploy Across Multiple Clusters​

Sync Waves - Ordered Deployment​

Image Update Automation - Automatic Model Promotion​

Model Deployment via GitOps - Full Workflow​

GitHub Actions - The CI Half​

Secrets Management in GitOps​

Sealed Secrets - Encrypt Secrets to Commit in Git​

External Secrets Operator - Pull from AWS Secrets Manager​

Drift Detection and Remediation​

Argo Rollouts - Canary Analysis​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Quick Reference - GitOps CLI Commands​