Skip to main content

Optimizing ML Docker Images

The 8GB Image That Broke Production

The platform team at a computer vision startup had built a real-time object detection service. It worked. The model ran. And then Kubernetes started failing to schedule pods fast enough. The operations engineer pulled up the pod startup logs and found the culprit: Pulling image... 8.4GB. On their cloud instances, with a 1 Gbps network interface, pulling an 8.4GB image took 11-13 minutes. Cold start time (from pod creation to first successful health check): 14 minutes.

The downstream effect was disastrous. Auto-scaling could not respond to traffic spikes. By the time new pods were healthy, the spike had passed and P99 latency had already blown past the SLA. The team had alerts for CPU and memory but not for image pull time - it was invisible until it became a production issue.

The root cause of the 8.4GB image was accumulated technical debt in the Dockerfile. It had started as a quick solution: FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 (the big CUDA developer image), then pip install torch torchvision opencv-python-headless pandas scikit-learn albumentations.... No one had ever questioned the base image. No one had done a multi-stage build. No one had removed the build tools after compilation. The image grew organically over 18 months.

Tanya took three days to rebuild it. Final result: 1.18GB. Cold start: 87 seconds. A 7-fold reduction in image size, 9-fold reduction in cold start time. This lesson documents exactly what she did.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Docker for ML demo on the EngineersOfAI Playground - no code required. :::

Why Image Size Matters for ML

ML images tend toward large sizes for several structural reasons:

  • CUDA base images: nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 is ~4.5GB before adding any Python packages
  • PyTorch: torch==2.2.0 is ~800MB installed (including CUDA libraries bundled in the wheel)
  • OpenCV: Full opencv-python pulls in Qt and X11 libraries not needed in headless servers
  • Build tools: gcc, g++, development headers installed for compiling packages and never removed

Image size affects:

  1. Cold start time: Image pull + layer extraction before container starts
  2. Registry storage costs: Every version stored at 8GB vs 1GB
  3. CI/CD pipeline time: Pushing 8GB images takes minutes; pulling in CI takes minutes
  4. Attack surface: More packages = more potential CVEs

The Optimization Toolkit

Technique 1: Multi-Stage Builds

Multi-stage builds are the single most impactful optimization. The idea: use a large "builder" image for compilation and package installation, then copy only the necessary artifacts into a small "runtime" image.

# ─────────────────────────────────────────────────────────────────
# BEFORE: Naive single-stage image - all build tools stay in image
# ─────────────────────────────────────────────────────────────────
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
# devel image includes: nvcc compiler, development headers, build tools
# This alone is ~4.5GB

RUN apt-get update && apt-get install -y \
python3.11 python3-pip build-essential libopencv-dev

RUN pip install torch==2.2.0 torchvision opencv-python numpy scikit-learn

COPY src/ /app/src/
COPY models/ /app/models/

CMD ["python3", "/app/src/serve.py"]

# Result: ~8.4GB
# ─────────────────────────────────────────────────────────────────
# AFTER: Multi-stage build - build artifacts copied, build tools discarded
# ─────────────────────────────────────────────────────────────────

# Stage 1: Builder - install everything, including build tools
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3-pip python3.11-venv build-essential \
&& rm -rf /var/lib/apt/lists/*

# Create virtual environment for clean copy
RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python packages inside venv
# torch: use the CUDA-specific index URL for smaller wheels
RUN pip install --no-cache-dir \
torch==2.2.0+cu121 \
--index-url https://download.pytorch.org/whl/cu121

RUN pip install --no-cache-dir \
torchvision==0.17.0+cu121 \
--index-url https://download.pytorch.org/whl/cu121

RUN pip install --no-cache-dir \
opencv-python-headless==4.9.0.80 \
numpy==1.26.4 \
scikit-learn==1.4.2


# ─────────────────────────────────────────────────────────────────
# Stage 2: Runtime - only what's needed to run the service
# Use the RUNTIME image (not devel) - no nvcc compiler included
# ─────────────────────────────────────────────────────────────────
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 AS runtime

# Install Python runtime (not build tools - no build-essential)
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 \
libglib2.0-0 \
libgl1-mesa-glx \
&& rm -rf /var/lib/apt/lists/*

# Copy only the virtual environment from builder stage
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

WORKDIR /app

# Copy application code and model
COPY src/ ./src/
COPY models/detector_v4.pt ./models/

# Security: non-root user
RUN useradd -m -r mluser && chown -R mluser:mluser /app
USER mluser

EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --start-period=90s \
CMD python3 -c "import requests; requests.get('http://localhost:8080/health').raise_for_status()"

CMD ["python3", "-m", "src.serve"]

# Result: ~1.18GB (7x reduction)

Technique 2: Right-Sizing the Base Image

# For inference (no training, no GPU):
# Start here - 130MB before your packages
FROM python:3.11-slim

# For inference WITH GPU:
# runtime only - no nvcc, no development headers (~2.5GB vs ~4.5GB for devel)
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04

# For training (needs nvcc to compile custom CUDA ops):
# devel only for training images, never for inference
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04

# For inference with minimal security surface (Python-only services):
# distroless - no shell, no package manager, just Python runtime
# See Technique 4 below
FROM gcr.io/distroless/python3-debian12

Technique 3: BuildKit Cache Mounts

BuildKit (enabled by default in Docker >= 23) supports --mount=type=cache which caches pip's package download cache between builds. This speeds up builds without affecting image size (the cache is on the build host, not in the image):

# syntax=docker/dockerfile:1.6
FROM python:3.11-slim AS base

# Enable BuildKit cache mount for pip
# The cache persists between docker build runs on the same host
# This turns a 3-minute pip install into a 20-second cache restore
RUN --mount=type=cache,target=/root/.cache/pip \
pip install torch==2.2.0 torchvision scikit-learn

# For apt packages:
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && apt-get install -y --no-install-recommends \
libgomp1 libglib2.0-0
# Ensure BuildKit is enabled (default in Docker >= 23)
export DOCKER_BUILDKIT=1

# Or use docker buildx (always uses BuildKit)
docker buildx build -t my-ml-image:latest .

Technique 4: Distroless Images

Distroless images (maintained by Google) contain only the application runtime - no shell, no package manager, no Linux utilities. They are significantly smaller and have a much smaller attack surface than Debian-based images.

# Multi-stage: build in Python slim, copy to distroless
FROM python:3.11-slim AS builder

WORKDIR /app
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./src/


# Distroless runtime - no shell, no apt, no curl
# This means: no exec into container, no debugging tools
# Trade-off: security vs debuggability
FROM gcr.io/distroless/python3-debian12 AS runtime

COPY --from=builder /opt/venv /opt/venv
COPY --from=builder /app/src /app/src

ENV PATH="/opt/venv/bin:$PATH"
ENV PYTHONPATH="/app"

WORKDIR /app

# No USER instruction needed - distroless already runs as non-root (uid 65532)

EXPOSE 8080

# In distroless, CMD must use exec form (no shell available to interpret string form)
CMD ["/opt/venv/bin/python", "-m", "src.serve"]

:::note Distroless Trade-offs Distroless images cannot be exec'd into for debugging (docker exec -it fails - no shell). For production inference services where all debugging goes through logs and metrics, this is acceptable. For training images or any service that might need interactive debugging, stick with Debian slim. :::

Technique 5: Package Hygiene

Small but meaningful size savings from careful package management:

# Bad: installs recommended packages and leaves apt lists
RUN apt-get install -y curl

# Good: no-install-recommends + clean up apt lists in same RUN (same layer)
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*

# Critical: --no-install-recommends + cleanup MUST be in the same RUN command
# If cleanup is in a separate RUN, the lists are deleted in a new layer
# but the data still exists in the previous layer - no size savings
# Bad: unnecessary packages installed as build dependencies
RUN pip install torch[vision] # Pulls in extra optional dependencies

# Good: install only what you need, headless versions where available
RUN pip install \
torch==2.2.0 \
torchvision==0.17.0 \
opencv-python-headless==4.9.0.80 # Headless: no Qt/X11 GUI dependencies

Image Scanning with Trivy

After building an optimized image, scan it for known CVEs:

# Install Trivy
brew install trivy # macOS
# or: curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh

# Scan image
trivy image my-ml-image:latest

# Output example:
# my-ml-image:latest (ubuntu 22.04)
# CRITICAL: 2
# libssl3 (CVE-2024-0727) → fix available in 3.0.13-0ubuntu0.22.04.1
# python3.11 (CVE-2023-6597) → fix available in 3.11.8-1~22.04
# HIGH: 5
# ...

# Fail CI if CRITICAL or HIGH vulnerabilities found
trivy image --exit-code 1 --severity CRITICAL,HIGH my-ml-image:latest

# Generate SBOM (Software Bill of Materials) for compliance
trivy image --format spdx-json --output sbom.json my-ml-image:latest
# .github/workflows/security-scan.yml
- name: Build image
run: docker build -t my-ml-image:${{ github.sha }} .

- name: Scan with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: my-ml-image:${{ github.sha }}
format: table
exit-code: 1
ignore-unfixed: true
severity: CRITICAL,HIGH

Registry-Side Caching

Use registry caching to speed up CI/CD builds:

# syntax=docker/dockerfile:1.6
FROM python:3.11-slim
# Build with registry cache (GitHub Container Registry example)
docker buildx build \
--cache-from type=registry,ref=ghcr.io/myorg/ml-service:cache \
--cache-to type=registry,ref=ghcr.io/myorg/ml-service:cache,mode=max \
--platform linux/amd64 \
-t ghcr.io/myorg/ml-service:latest \
--push \
.

# This stores build cache in the registry itself
# Next build: cache layers are pulled from registry before building
# Effective even across different CI runners (unlike local BuildKit cache)

Measuring Results

# Check final image size
docker images my-ml-image --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"

# Inspect layers and their sizes
docker history my-ml-image:latest

# Use dive for detailed layer analysis
# brew install dive (macOS)
dive my-ml-image:latest

# Measure cold start time (image pull + container start)
time docker run --rm my-ml-image:latest python -c "import torch; print(torch.__version__)"

Results Comparison

Optimization AppliedImage SizeBuild Time (no cache)Cold Start
Naive single-stage8.4 GB8 min14 min
+ Slim base5.1 GB6 min9 min
+ Multi-stage2.3 GB7 min4 min
+ headless packages1.8 GB6 min3.5 min
+ Registry caching1.8 GB45 sec (cached)3.5 min
+ BuildKit cache mounts1.8 GB20 sec (cached)3.5 min
Final result1.18 GB20 sec87 sec

Production Notes

Pin base image digests in production: Use the image digest (SHA256 hash) rather than the tag in production Dockerfiles. Tags are mutable and can change without warning when security patches are applied. Digest pinning: FROM python:3.11-slim@sha256:abc123.... Update deliberately when you want to take base image updates.

Separate training and inference images: Training images can be large (include CUDA devel tools, Jupyter, debugging libraries). Inference images must be small and fast. Never ship a training image to production inference. Maintain two Dockerfiles: Dockerfile.training and Dockerfile.inference.

:::tip Benchmark Before Optimizing Before investing time in image optimization, measure what actually matters for your system: is it cold start time, build time, or registry cost? If your pods never cold start (they run 24/7), image size matters less. If you have bursty traffic that requires autoscaling from zero, cold start time is critical. Optimize for your actual bottleneck. :::

:::warning Multi-Stage Build and Layer Cache In multi-stage builds, Docker caches each stage independently. If the builder stage cache is invalidated (e.g., requirements.txt changed), the builder runs from scratch. But the runtime stage cache may still be valid if only the application code changed. Use --target runtime to build only the runtime stage in CI for faster builds when you know only application code changed. :::

:::danger Testing Distroless Images Distroless images have no shell, so you cannot docker exec -it container bash for debugging. Before adopting distroless for a service, set up comprehensive logging and metrics - your only debugging tools are container logs and distributed tracing. For teams not yet set up for that level of observability, Debian slim with a non-root user is a better choice. :::

Interview Q&A

Q: What is a multi-stage Docker build and why is it important for ML images?

A multi-stage build uses multiple FROM instructions in a single Dockerfile. Each FROM starts a new stage. You can copy artifacts from earlier stages into later stages with COPY --from=<stage>. For ML, the pattern is: use a large "devel" image (with CUDA compiler, build tools, full Python) to install dependencies and compile any extensions, then copy only the installed packages and application code into a small "runtime" image (without build tools). The final image contains only what's needed to run, not what was needed to build. This is the single most impactful image size optimization - commonly reducing images from 8GB to under 2GB.

Q: How do BuildKit cache mounts work and how do they differ from layer caching?

Layer caching caches the output of a Dockerfile instruction (the filesystem state after the instruction runs). If the instruction changes, the cache is invalidated. BuildKit cache mounts (--mount=type=cache) provide a persistent cache directory that survives between builds but is NOT part of the image. For pip installs, this means pip's package download cache persists between builds - the second time you install a package, it comes from the local cache instead of the internet. This speeds up builds without adding to image size. Cache mounts and layer caching are complementary, not alternatives.

Q: How do you choose between runtime and devel CUDA base images?

The devel CUDA image includes the CUDA compiler (nvcc) and development headers needed to compile custom CUDA extensions. The runtime image only includes the libraries needed to execute already-compiled CUDA code. For inference services (which run pre-compiled PyTorch operations), the runtime image is correct - it is 1-2GB smaller than devel. For training images that compile custom CUDA kernels (unusual outside of research), devel is required. Most ML teams need devel only for training images, never for inference images.

Q: What does Trivy scan for and when should you run it?

Trivy scans Docker images for CVEs (Common Vulnerabilities and Exposures) in OS packages (via the OS package database) and Python packages (via PyPI advisory database). It reports vulnerabilities by severity: CRITICAL, HIGH, MEDIUM, LOW. Run it as part of your CI/CD pipeline after every image build, before pushing to the registry. Configure it to fail the pipeline on CRITICAL or HIGH vulnerabilities. Trivy can also generate an SBOM (Software Bill of Materials) for compliance with regulations that require a full inventory of components in production software.

Q: How do you reduce the size of PyTorch-based ML images?

Several steps: (1) Install the CUDA-specific PyTorch wheel (+cu121 suffix from download.pytorch.org/whl/cu121) rather than the full wheel that bundles multiple CUDA versions. (2) Use torchvision only if you need it - omit for non-vision models. (3) If using TorchScript or ONNX for inference, install torch without torchvision and torchaudio. (4) Use --no-cache-dir flag with pip to avoid storing wheel files in the image. (5) Use multi-stage builds to separate the large torch install into the builder stage. Combined, these can reduce a PyTorch inference image from 4GB+ to under 2GB.

© 2026 EngineersOfAI. All rights reserved.