Module 06: Containerization
Why Containers Changed ML Engineering
"Works on my machine" is the original sin of software development. In ML, the problem is worse: it is not just the code that needs to match, it is the CUDA version, cuDNN version, Python version, library versions, system libraries, and sometimes even hardware architecture. A data scientist on a MacBook M2, a training cluster running Ubuntu 22.04 with CUDA 12.1, and a production inference server running RHEL 8 with CUDA 11.8 are three completely different environments. Without containers, shipping a model from development to production is an exercise in environment archaeology.
This module covers containers from first principles for ML engineers: why they matter, how to write efficient Dockerfiles, how to build GPU-enabled containers, how to manage images in CI/CD, and how to build a complete local ML development environment with Docker Compose that onboards a new team member in 10 minutes instead of 3 days.
Module Map
Learning Objectives
By the end of this module you will be able to:
- Write efficient ML Dockerfiles with proper layer ordering and caching strategy
- Reduce ML Docker image sizes from gigabytes to hundreds of megabytes using multi-stage builds
- Build and run GPU-enabled containers with proper NVIDIA runtime configuration
- Set up a container registry workflow with security scanning and environment promotion
- Configure Istio service mesh traffic splitting for safe ML model canary deployments
- Build a complete local ML development environment with Docker Compose
Prerequisites
- Basic Linux command line
- Python package management (pip, conda)
- Module 05 (CI/CD for ML) recommended - containers are used in those pipelines
Lessons
| # | Lesson | Core Problem Solved |
|---|---|---|
| 01 | Docker for ML | "Works on my machine" ML debugging |
| 02 | Optimizing ML Docker Images | 8GB image causing 12-minute cold starts |
| 03 | GPU Containers | GPU container ignores GPU in production |
| 04 | Container Registry and CI | Security incident from unscanned image |
| 05 | Service Mesh for ML | 5 models × 3 versions routing chaos |
| 06 | Docker Compose for ML Dev | 3-day onboarding reduced to 10 minutes |
Estimated Time
6–8 hours total.
