Container Registry and CI
The Security Incident That Changed Everything
The security team at a healthcare ML startup sent a Slack message at 10 AM on a Tuesday that
stopped all engineering work: "We have a critical CVE in production. Container image
ml-inference:prod-2024-01-15 has OpenSSL CVE-2024-0727, CVSS 7.5. This is a production
health data system. We need this patched within 4 hours per our compliance requirements."
The incident investigation revealed a process problem. The team had a container registry (AWS ECR),
a CI pipeline that built images, and a deployment process. What they did not have: any security
scanning in the pipeline. Images were built, pushed, and deployed without ever being checked for
CVEs. The vulnerable OpenSSL version had been in the python:3.11-slim base image since before
the team started using it - 11 weeks earlier. No one had noticed because no one was looking.
The 4-hour patch timeline was aggressive. The team spent 45 minutes finding and running Trivy for the first time, 30 minutes updating the base image tag, and 2 hours navigating the manual deployment process. They made it with 45 minutes to spare. The next day, they automated the entire security scanning workflow so this could never happen again without immediate detection.
This lesson builds the container registry workflow they should have had from the start.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Docker for ML demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Container registries are the artifact stores of the container world - analogous to PyPI for Python packages or Maven Central for Java. They store image layers (content-addressed blobs) and metadata (manifests that describe which layers make up each image). The registry serves images to Docker daemons, Kubernetes nodes, and CI runners.
For ML systems, container registries have additional requirements beyond standard software: ML images change on two independent cycles (code changes and model updates), images can be very large (gigabytes), multi-architecture support is necessary for teams with Apple Silicon development machines and x86-64 production, and security scanning must be automated because ML practitioners often do not think about CVE remediation.
Registry Choices for ML Teams
ECR for AWS shops: Use ECR if you deploy on EKS or ECS. No image pull costs within the same region. IAM roles handle authentication - no credentials to manage. ECR basic scanning uses Clair; enhanced scanning uses Snyk. ECR Lifecycle Policies automate image cleanup.
GHCR for GitHub-native teams: GitHub Container Registry is the path of least resistance for
teams whose code is on GitHub. The GITHUB_TOKEN in GitHub Actions has automatic read/write
access to GHCR - no additional credentials configuration needed. Free for public repositories.
Image Tagging Strategy for ML
ML images require a more sophisticated tagging strategy than typical software images because they have two independent version axes: code version and model version.
# Tagging patterns for ML images
# Pattern 1: Git SHA (always unique, immutable)
# Use for: pinning exact versions in deployments, debugging
docker tag ml-inference:latest ml-inference:git-abc1234
# Pattern 2: Semantic version (for explicit release tracking)
docker tag ml-inference:latest ml-inference:v2.3.1
# Pattern 3: Model version + code version (ML-specific)
# Encodes both what code runs AND which model it serves
docker tag ml-inference:latest ml-inference:model-v47-code-abc1234
# Pattern 4: Environment tags (mutable - updated on promotion)
docker tag ml-inference:latest ml-inference:staging
docker tag ml-inference:latest ml-inference:production
# Pattern 5: Date-based (for scheduled builds, easy human reading)
docker tag ml-inference:latest ml-inference:2024-03-15
# Best practice: use git SHA as the immutable canonical tag
# Use environment tags as mutable pointers to the current version in each env
# scripts/image_tagger.py - generate consistent image tags in CI
import subprocess
import os
from datetime import date
def get_image_tags(
registry: str,
repository: str,
model_version: str = None,
) -> list[str]:
"""
Generate the set of tags to apply to an image build.
Returns list of full image references.
"""
git_sha = subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"]
).decode().strip()
branch = os.environ.get("GITHUB_REF_NAME", "unknown").replace("/", "-")
today = date.today().strftime("%Y%m%d")
base = f"{registry}/{repository}"
tags = [
f"{base}:git-{git_sha}", # Immutable - always added
f"{base}:{today}-{git_sha}", # Date + sha for human readability
]
if model_version:
tags.append(f"{base}:model-v{model_version}-code-{git_sha}")
if branch == "main":
tags.append(f"{base}:latest")
return tags
The Complete CI/CD Pipeline for ML Container Images
# .github/workflows/container-ci.yml
name: ML Container CI
on:
push:
branches: [main, develop]
paths:
- 'src/**'
- 'Dockerfile*'
- 'requirements*.txt'
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}/ml-inference
jobs:
# ─────────────────────────────────────────────────────────────
# Build multi-architecture image
# ─────────────────────────────────────────────────────────────
build-and-scan:
name: Build, Scan, and Push
runs-on: ubuntu-latest
permissions:
contents: read
packages: write # Required for GHCR push
security-events: write # Required for SARIF upload
outputs:
image-digest: ${{ steps.build.outputs.digest }}
image-tags: ${{ steps.meta.outputs.tags }}
steps:
- name: Checkout
uses: actions/checkout@v4
# Set up Docker Buildx for multi-architecture builds
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
platforms: linux/amd64,linux/arm64
# Log in to GHCR using GITHUB_TOKEN (no additional secrets needed)
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
# Generate image tags and labels
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=git-,format=short
type=raw,value=latest,enable={{is_default_branch}}
type=ref,event=branch
type=semver,pattern={{version}}
# Build and push (multi-platform)
- name: Build and push
id: build
uses: docker/build-push-action@v5
with:
context: .
file: Dockerfile.inference
platforms: linux/amd64,linux/arm64
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache
cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache,mode=max
# Pass build args for image metadata
build-args: |
BUILD_DATE=${{ github.event.head_commit.timestamp }}
GIT_SHA=${{ github.sha }}
VERSION=${{ steps.meta.outputs.version }}
# ─────────────────────────────────────────────────────────
# Security scanning with Trivy
# ─────────────────────────────────────────────────────────
- name: Run Trivy vulnerability scan
id: trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
format: sarif
output: trivy-results.sarif
severity: CRITICAL,HIGH
ignore-unfixed: true # Ignore CVEs with no fix available
exit-code: 0 # Don't fail here - we upload results and decide below
# Upload Trivy results to GitHub Security tab
- name: Upload Trivy scan results to GitHub Security
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: trivy-results.sarif
# Fail if CRITICAL vulnerabilities with fixes are found
- name: Fail on critical CVEs
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
format: table
severity: CRITICAL
ignore-unfixed: true
exit-code: 1 # Fail CI on CRITICAL
# Generate SBOM for compliance
- name: Generate SBOM
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
format: spdx-json
output: sbom.spdx.json
- name: Upload SBOM
uses: actions/upload-artifact@v4
with:
name: sbom-${{ github.sha }}
path: sbom.spdx.json
retention-days: 90 # Keep SBOM for compliance audit window
# ─────────────────────────────────────────────────────────────
# Image promotion across environments
# ─────────────────────────────────────────────────────────────
promote-to-staging:
name: Promote to Staging
needs: [build-and-scan]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: staging
steps:
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
# Promotion by retagging - no rebuild
- name: Tag image as staging
run: |
docker buildx imagetools create \
--tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:staging \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
- name: Deploy to staging Kubernetes
run: |
# Update Kubernetes deployment image reference
kubectl set image deployment/ml-inference \
ml-inference=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }} \
--namespace=ml-staging
kubectl rollout status deployment/ml-inference --namespace=ml-staging --timeout=5m
- name: Run smoke tests against staging
run: |
pip install httpx pytest
pytest tests/smoke/ --base-url=${{ vars.STAGING_URL }} --timeout=30
promote-to-production:
name: Promote to Production
needs: [promote-to-staging]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production # Requires manual approval
steps:
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Tag image as production
run: |
docker buildx imagetools create \
--tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:production \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }}
- name: Deploy to production Kubernetes
run: |
kubectl set image deployment/ml-inference \
ml-inference=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:git-${{ github.sha }} \
--namespace=ml-production
kubectl rollout status deployment/ml-inference --namespace=ml-production --timeout=10m
- name: Notify team
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-H 'Content-type: application/json' \
-d "{\"text\": \"ml-inference image promoted to production: git-${{ github.sha }}\"}"
ECR-Specific Configuration
# scripts/ecr_helpers.py - ECR authentication and lifecycle management
import boto3
import base64
import subprocess
import json
def authenticate_ecr(region: str = "us-east-1") -> str:
"""Authenticate Docker to ECR. Returns ECR registry URL."""
ecr = boto3.client("ecr", region_name=region)
# Get ECR authorization token
token = ecr.get_authorization_token()
auth_data = token["authorizationData"][0]
# Decode credentials
credentials = base64.b64decode(auth_data["authorizationToken"]).decode()
username, password = credentials.split(":", 1)
registry = auth_data["proxyEndpoint"]
# Authenticate Docker
subprocess.run([
"docker", "login",
"--username", username,
"--password-stdin",
registry,
], input=password.encode(), check=True)
return registry
def create_ecr_repository(
repo_name: str,
region: str = "us-east-1",
enable_scan_on_push: bool = True,
lifecycle_days: int = 30,
) -> str:
"""Create ECR repository with scanning and lifecycle policy."""
ecr = boto3.client("ecr", region_name=region)
try:
response = ecr.create_repository(
repositoryName=repo_name,
imageScanningConfiguration={"scanOnPush": enable_scan_on_push},
imageTagMutability="MUTABLE", # Allow retagging for promotion
)
repo_uri = response["repository"]["repositoryUri"]
except ecr.exceptions.RepositoryAlreadyExistsException:
response = ecr.describe_repositories(repositoryNames=[repo_name])
repo_uri = response["repositories"][0]["repositoryUri"]
# Set lifecycle policy: keep last 10 untagged images, delete older than N days
lifecycle_policy = {
"rules": [
{
"rulePriority": 1,
"description": "Remove untagged images after 1 day",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 1,
},
"action": {"type": "expire"},
},
{
"rulePriority": 2,
"description": f"Keep only last 10 images older than {lifecycle_days} days",
"selection": {
"tagStatus": "any",
"countType": "imageCountMoreThan",
"countNumber": 10,
},
"action": {"type": "expire"},
},
]
}
ecr.put_lifecycle_policy(
repositoryName=repo_name,
lifecyclePolicyText=json.dumps(lifecycle_policy),
)
return repo_uri
Production Notes
Image immutability: Never overwrite an immutable tag (SHA-based). Only mutable tags
(staging, production, latest) should be updated. This ensures you can always trace
exactly what is running in production by looking at the SHA tag.
Registry caching in CI: Use --cache-from and --cache-to with type=registry in
docker buildx build to store build cache in the registry itself. This works across CI runners
(unlike local BuildKit cache, which is runner-specific). Significantly speeds up CI builds when
most layers have not changed.
Cleanup: Registries accumulate images quickly, especially with high commit velocity. Set up lifecycle policies (ECR) or retention policies (GHCR: 90 days for untagged). Never delete tagged images that are currently deployed - automate image cleanup to only remove untagged or images tagged only with temporary/branch tags.
:::tip Sign Images for Production
For regulated industries or security-conscious environments, sign container images with
cosign (Sigstore). Signing proves that an image was built by your CI pipeline and has not
been tampered with. Kubernetes admission controllers can verify signatures before allowing
images to run.
# Sign an image after pushing
cosign sign --key cosign.key ghcr.io/myorg/ml-inference:git-abc1234
# Verify a signature
cosign verify --key cosign.pub ghcr.io/myorg/ml-inference:git-abc1234
:::
:::warning Pull Rate Limits on Docker Hub Docker Hub imposes pull rate limits: 100 pulls per 6 hours for unauthenticated requests, 200 for authenticated free accounts, unlimited for paid. In CI/CD pipelines that run frequently, you will hit these limits for base images pulled from Docker Hub. Mirror the base images you use to your private registry (ECR, GHCR) to avoid rate limiting. :::
:::danger Never Push Credentials to a Registry
Container images are often pulled by many systems - CI runners, production nodes, developer
machines. A credential accidentally included in an image layer (in ENV, in a file copied via
COPY, or in a RUN command output) is exposed to everyone who can pull the image. Use .dockerignore
to exclude credential files. Audit image layers with docker history or dive before pushing
to any registry.
:::
Interview Q&A
Q: What is image promotion in CI/CD and how does it work for ML containers?
Image promotion is the practice of moving a single, immutable image artifact through environments
(dev → staging → production) rather than rebuilding for each environment. In container terms: build
once with a SHA-based tag, run tests in staging, if tests pass retag the image with the environment
name (e.g., production) rather than rebuilding. This ensures that exactly the same bytes that
passed in staging are what runs in production. Promotion is a retag operation, not a rebuild.
Q: What is Trivy and how do you integrate it into a CI/CD pipeline?
Trivy (by Aqua Security) is an open-source vulnerability scanner for container images, file
systems, and code repositories. In CI/CD, integrate it after the image build step: run
trivy image <image-ref> with --exit-code 1 --severity CRITICAL,HIGH to fail the pipeline
if critical vulnerabilities are found. Upload the SARIF-format results to GitHub Security tab
for visibility. Run Trivy with --ignore-unfixed to avoid failing on CVEs that have no fix
available yet. Set up periodic rescans of production images (not just on build) to catch new
CVEs in already-deployed images.
Q: What is a multi-architecture Docker image and when do you need one for ML?
A multi-architecture image (multi-arch or multi-platform) bundles images for multiple CPU
architectures (typically linux/amd64 and linux/arm64) into a single image reference. When
you pull the image, Docker automatically selects the correct architecture for the local machine.
For ML teams, this is needed when data scientists develop on Apple Silicon Macs (arm64) and
deploy to x86-64 cloud instances (amd64). Without multi-arch builds, images built on an M1 Mac
run in Rosetta emulation on x86-64, which is slower and potentially non-deterministic. Build
with docker buildx build --platform linux/amd64,linux/arm64.
Q: What is an SBOM and why do ML teams need to generate one?
An SBOM (Software Bill of Materials) is a structured list of all components in a software artifact
- for a container image, this includes OS packages, Python packages, and their versions. ML teams
need SBOMs for: (1) Compliance with regulations that require component inventories (healthcare HIPAA,
finance), (2) Rapid CVE response - when a new vulnerability is announced, an SBOM lets you
immediately determine which images are affected without scanning them all, (3) License compliance -
confirming you are not accidentally shipping GPL-licensed code in a proprietary product. Generate
with
trivy image --format spdx-json.
Q: How do you manage image cleanup in a container registry to control storage costs?
Set up lifecycle policies: ECR supports native lifecycle policies via JSON rules (e.g., delete untagged images older than 1 day, keep only 10 images per tag prefix). GHCR: use the API or retention policies. General principles: always clean up untagged images promptly (they accumulate from intermediate build layers), clean up branch-specific tags when the branch is merged, never delete tagged images that are currently deployed (track which images are deployed in your deployment system before running cleanup), and keep at minimum the last N production-promoted images for rollback.
