Skip to main content

Container Registry and CI

The Security Incident That Changed Everything

The security team at a healthcare ML startup sent a Slack message at 10 AM on a Tuesday that stopped all engineering work: "We have a critical CVE in production. Container image ml-inference:prod-2024-01-15 has OpenSSL CVE-2024-0727, CVSS 7.5. This is a production health data system. We need this patched within 4 hours per our compliance requirements."

The incident investigation revealed a process problem. The team had a container registry (AWS ECR), a CI pipeline that built images, and a deployment process. What they did not have: any security scanning in the pipeline. Images were built, pushed, and deployed without ever being checked for CVEs. The vulnerable OpenSSL version had been in the python:3.11-slim base image since before the team started using it - 11 weeks earlier. No one had noticed because no one was looking.

The 4-hour patch timeline was aggressive. The team spent 45 minutes finding and running Trivy for the first time, 30 minutes updating the base image tag, and 2 hours navigating the manual deployment process. They made it with 45 minutes to spare. The next day, they automated the entire security scanning workflow so this could never happen again without immediate detection.

This lesson builds the container registry workflow they should have had from the start.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Docker for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Container registries are the artifact stores of the container world - analogous to PyPI for Python packages or Maven Central for Java. They store image layers (content-addressed blobs) and metadata (manifests that describe which layers make up each image). The registry serves images to Docker daemons, Kubernetes nodes, and CI runners.

For ML systems, container registries have additional requirements beyond standard software: ML images change on two independent cycles (code changes and model updates), images can be very large (gigabytes), multi-architecture support is necessary for teams with Apple Silicon development machines and x86-64 production, and security scanning must be automated because ML practitioners often do not think about CVE remediation.

Registry Choices for ML Teams

ECR for AWS shops: Use ECR if you deploy on EKS or ECS. No image pull costs within the same region. IAM roles handle authentication - no credentials to manage. ECR basic scanning uses Clair; enhanced scanning uses Snyk. ECR Lifecycle Policies automate image cleanup.

GHCR for GitHub-native teams: GitHub Container Registry is the path of least resistance for teams whose code is on GitHub. The GITHUB_TOKEN in GitHub Actions has automatic read/write access to GHCR - no additional credentials configuration needed. Free for public repositories.

Image Tagging Strategy for ML

ML images require a more sophisticated tagging strategy than typical software images because they have two independent version axes: code version and model version.

# Tagging patterns for ML images

# Pattern 1: Git SHA (always unique, immutable)
# Use for: pinning exact versions in deployments, debugging
docker tag ml-inference:latest ml-inference:git-abc1234

# Pattern 2: Semantic version (for explicit release tracking)
docker tag ml-inference:latest ml-inference:v2.3.1

# Pattern 3: Model version + code version (ML-specific)
# Encodes both what code runs AND which model it serves
docker tag ml-inference:latest ml-inference:model-v47-code-abc1234

# Pattern 4: Environment tags (mutable - updated on promotion)
docker tag ml-inference:latest ml-inference:staging
docker tag ml-inference:latest ml-inference:production

# Pattern 5: Date-based (for scheduled builds, easy human reading)
docker tag ml-inference:latest ml-inference:2024-03-15

# Best practice: use git SHA as the immutable canonical tag
# Use environment tags as mutable pointers to the current version in each env
# scripts/image_tagger.py - generate consistent image tags in CI
import subprocess
import os
from datetime import date


def get_image_tags(
registry: str,
repository: str,
model_version: str = None,
) -> list[str]:
"""
Generate the set of tags to apply to an image build.
Returns list of full image references.
"""
git_sha = subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"]
).decode().strip()

branch = os.environ.get("GITHUB_REF_NAME", "unknown").replace("/", "-")
today = date.today().strftime("%Y%m%d")

base = f"{registry}/{repository}"
tags = [
f"{base}:git-{git_sha}", # Immutable - always added
f"{base}:{today}-{git_sha}", # Date + sha for human readability
]

if model_version:
tags.append(f"{base}:model-v{model_version}-code-{git_sha}")

if branch == "main":
tags.append(f"{base}:latest")

return tags

The Complete CI/CD Pipeline for ML Container Images

# .github/workflows/container-ci.yml
name: ML Container CI

on:
push:
branches: [main, develop]
paths:
- 'src/**'
- 'Dockerfile*'
- 'requirements*.txt'
pull_request:
branches: [main]

env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}/ml-inference

jobs:
# ─────────────────────────────────────────────────────────────
# Build multi-architecture image
# ─────────────────────────────────────────────────────────────
build-and-scan:
name: Build, Scan, and Push
runs-on: ubuntu-latest
permissions:
contents: read
packages: write # Required for GHCR push
security-events: write # Required for SARIF upload

outputs:
image-digest: ${{ steps.build.outputs.digest }}
image-tags: ${{ steps.meta.outputs.tags }}

steps:
- name: Checkout
uses: actions/checkout@v4

# Set up Docker Buildx for multi-architecture builds
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
platforms: linux/amd64,linux/arm64

# Log in to GHCR using GITHUB_TOKEN (no additional secrets needed)
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

# Generate image tags and labels
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=git-,format=short
type=raw,value=latest,enable={{is_default_branch}}
type=ref,event=branch
type=semver,pattern={{version}}

# Build and push (multi-platform)
- name: Build and push
id: build
uses: docker/build-push-action@v5
with:
context: .
file: Dockerfile.inference
platforms: linux/amd64,linux/arm64
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache
cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache,mode=max
# Pass build args for image metadata
build-args: |
BUILD_DATE=${{ github.event.head_commit.timestamp }}
GIT_SHA=${{ github.sha }}
VERSION=${{ steps.meta.outputs.version }}

# ─────────────────────────────────────────────────────────
# Security scanning with Trivy
# ─────────────────────────────────────────────────────────
- name: Run Trivy vulnerability scan
id: trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
format: sarif
output: trivy-results.sarif
severity: CRITICAL,HIGH
ignore-unfixed: true # Ignore CVEs with no fix available
exit-code: 0 # Don't fail here - we upload results and decide below

# Upload Trivy results to GitHub Security tab
- name: Upload Trivy scan results to GitHub Security
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: trivy-results.sarif

# Fail if CRITICAL vulnerabilities with fixes are found
- name: Fail on critical CVEs
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
format: table
severity: CRITICAL
ignore-unfixed: true
exit-code: 1 # Fail CI on CRITICAL

# Generate SBOM for compliance
- name: Generate SBOM
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
format: spdx-json
output: sbom.spdx.json

- name: Upload SBOM
uses: actions/upload-artifact@v4
with:
name: sbom-${{ github.sha }}
path: sbom.spdx.json
retention-days: 90 # Keep SBOM for compliance audit window

# ─────────────────────────────────────────────────────────────
# Image promotion across environments
# ─────────────────────────────────────────────────────────────
promote-to-staging:
name: Promote to Staging
needs: [build-and-scan]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: staging

steps:
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

# Promotion by retagging - no rebuild
- name: Tag image as staging
run: |
docker buildx imagetools create \
--tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:staging \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}

- name: Deploy to staging Kubernetes
run: |
# Update Kubernetes deployment image reference
kubectl set image deployment/ml-inference \
ml-inference=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }} \
--namespace=ml-staging
kubectl rollout status deployment/ml-inference --namespace=ml-staging --timeout=5m

- name: Run smoke tests against staging
run: |
pip install httpx pytest
pytest tests/smoke/ --base-url=${{ vars.STAGING_URL }} --timeout=30

promote-to-production:
name: Promote to Production
needs: [promote-to-staging]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production # Requires manual approval

steps:
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Tag image as production
run: |
docker buildx imagetools create \
--tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:production \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}

- name: Deploy to production Kubernetes
run: |
kubectl set image deployment/ml-inference \
ml-inference=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }} \
--namespace=ml-production
kubectl rollout status deployment/ml-inference --namespace=ml-production --timeout=10m

- name: Notify team
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-H 'Content-type: application/json' \
-d "{\"text\": \"ml-inference image promoted to production: git-${{ github.sha }}\"}"

ECR-Specific Configuration

# scripts/ecr_helpers.py - ECR authentication and lifecycle management
import boto3
import base64
import subprocess
import json


def authenticate_ecr(region: str = "us-east-1") -> str:
"""Authenticate Docker to ECR. Returns ECR registry URL."""
ecr = boto3.client("ecr", region_name=region)

# Get ECR authorization token
token = ecr.get_authorization_token()
auth_data = token["authorizationData"][0]

# Decode credentials
credentials = base64.b64decode(auth_data["authorizationToken"]).decode()
username, password = credentials.split(":", 1)
registry = auth_data["proxyEndpoint"]

# Authenticate Docker
subprocess.run([
"docker", "login",
"--username", username,
"--password-stdin",
registry,
], input=password.encode(), check=True)

return registry


def create_ecr_repository(
repo_name: str,
region: str = "us-east-1",
enable_scan_on_push: bool = True,
lifecycle_days: int = 30,
) -> str:
"""Create ECR repository with scanning and lifecycle policy."""
ecr = boto3.client("ecr", region_name=region)

try:
response = ecr.create_repository(
repositoryName=repo_name,
imageScanningConfiguration={"scanOnPush": enable_scan_on_push},
imageTagMutability="MUTABLE", # Allow retagging for promotion
)
repo_uri = response["repository"]["repositoryUri"]
except ecr.exceptions.RepositoryAlreadyExistsException:
response = ecr.describe_repositories(repositoryNames=[repo_name])
repo_uri = response["repositories"][0]["repositoryUri"]

# Set lifecycle policy: keep last 10 untagged images, delete older than N days
lifecycle_policy = {
"rules": [
{
"rulePriority": 1,
"description": "Remove untagged images after 1 day",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 1,
},
"action": {"type": "expire"},
},
{
"rulePriority": 2,
"description": f"Keep only last 10 images older than {lifecycle_days} days",
"selection": {
"tagStatus": "any",
"countType": "imageCountMoreThan",
"countNumber": 10,
},
"action": {"type": "expire"},
},
]
}

ecr.put_lifecycle_policy(
repositoryName=repo_name,
lifecyclePolicyText=json.dumps(lifecycle_policy),
)

return repo_uri

Production Notes

Image immutability: Never overwrite an immutable tag (SHA-based). Only mutable tags (staging, production, latest) should be updated. This ensures you can always trace exactly what is running in production by looking at the SHA tag.

Registry caching in CI: Use --cache-from and --cache-to with type=registry in docker buildx build to store build cache in the registry itself. This works across CI runners (unlike local BuildKit cache, which is runner-specific). Significantly speeds up CI builds when most layers have not changed.

Cleanup: Registries accumulate images quickly, especially with high commit velocity. Set up lifecycle policies (ECR) or retention policies (GHCR: 90 days for untagged). Never delete tagged images that are currently deployed - automate image cleanup to only remove untagged or images tagged only with temporary/branch tags.

:::tip Sign Images for Production For regulated industries or security-conscious environments, sign container images with cosign (Sigstore). Signing proves that an image was built by your CI pipeline and has not been tampered with. Kubernetes admission controllers can verify signatures before allowing images to run.

# Sign an image after pushing
cosign sign --key cosign.key ghcr.io/myorg/ml-inference:git-abc1234

# Verify a signature
cosign verify --key cosign.pub ghcr.io/myorg/ml-inference:git-abc1234

:::

:::warning Pull Rate Limits on Docker Hub Docker Hub imposes pull rate limits: 100 pulls per 6 hours for unauthenticated requests, 200 for authenticated free accounts, unlimited for paid. In CI/CD pipelines that run frequently, you will hit these limits for base images pulled from Docker Hub. Mirror the base images you use to your private registry (ECR, GHCR) to avoid rate limiting. :::

:::danger Never Push Credentials to a Registry Container images are often pulled by many systems - CI runners, production nodes, developer machines. A credential accidentally included in an image layer (in ENV, in a file copied via COPY, or in a RUN command output) is exposed to everyone who can pull the image. Use .dockerignore to exclude credential files. Audit image layers with docker history or dive before pushing to any registry. :::

Interview Q&A

Q: What is image promotion in CI/CD and how does it work for ML containers?

Image promotion is the practice of moving a single, immutable image artifact through environments (dev → staging → production) rather than rebuilding for each environment. In container terms: build once with a SHA-based tag, run tests in staging, if tests pass retag the image with the environment name (e.g., production) rather than rebuilding. This ensures that exactly the same bytes that passed in staging are what runs in production. Promotion is a retag operation, not a rebuild.

Q: What is Trivy and how do you integrate it into a CI/CD pipeline?

Trivy (by Aqua Security) is an open-source vulnerability scanner for container images, file systems, and code repositories. In CI/CD, integrate it after the image build step: run trivy image <image-ref> with --exit-code 1 --severity CRITICAL,HIGH to fail the pipeline if critical vulnerabilities are found. Upload the SARIF-format results to GitHub Security tab for visibility. Run Trivy with --ignore-unfixed to avoid failing on CVEs that have no fix available yet. Set up periodic rescans of production images (not just on build) to catch new CVEs in already-deployed images.

Q: What is a multi-architecture Docker image and when do you need one for ML?

A multi-architecture image (multi-arch or multi-platform) bundles images for multiple CPU architectures (typically linux/amd64 and linux/arm64) into a single image reference. When you pull the image, Docker automatically selects the correct architecture for the local machine. For ML teams, this is needed when data scientists develop on Apple Silicon Macs (arm64) and deploy to x86-64 cloud instances (amd64). Without multi-arch builds, images built on an M1 Mac run in Rosetta emulation on x86-64, which is slower and potentially non-deterministic. Build with docker buildx build --platform linux/amd64,linux/arm64.

Q: What is an SBOM and why do ML teams need to generate one?

An SBOM (Software Bill of Materials) is a structured list of all components in a software artifact

  • for a container image, this includes OS packages, Python packages, and their versions. ML teams need SBOMs for: (1) Compliance with regulations that require component inventories (healthcare HIPAA, finance), (2) Rapid CVE response - when a new vulnerability is announced, an SBOM lets you immediately determine which images are affected without scanning them all, (3) License compliance - confirming you are not accidentally shipping GPL-licensed code in a proprietary product. Generate with trivy image --format spdx-json.

Q: How do you manage image cleanup in a container registry to control storage costs?

Set up lifecycle policies: ECR supports native lifecycle policies via JSON rules (e.g., delete untagged images older than 1 day, keep only 10 images per tag prefix). GHCR: use the API or retention policies. General principles: always clean up untagged images promptly (they accumulate from intermediate build layers), clean up branch-specific tags when the branch is merged, never delete tagged images that are currently deployed (track which images are deployed in your deployment system before running cleanup), and keep at minimum the last N production-promoted images for rollback.

© 2026 EngineersOfAI. All rights reserved.