Skip to main content

Docker Compose for ML Development

Three Days to Set Up a Local Environment

A new ML engineer named Priya joined a 12-person ML platform team on a Monday. The team had been using the same setup for 18 months - everyone's laptop environment had evolved organically, shaped by individual debugging sessions, quick package installs, and the occasional "let me just change this config file." There was a README.md with setup instructions that no one had updated in seven months.

Priya spent three days on setup. Day one: install Conda, create the environment, hit a conflict between PyTorch and the team's custom library, spend 4 hours resolving it. Day two: install MLflow tracking server, discover it requires a specific SQLite version not present on her machine, install that, then discover the team actually uses PostgreSQL as the backend in production so some features work differently locally. Day three: install the local feature store (Redis-backed), configure it to talk to the training service, discover a hardcoded hostname in the training code that assumed a specific local network setup.

By Thursday she was writing code. But she had wasted three days - and more importantly, she had three days of accumulated environmental differences from her teammates that would generate subtle bugs for weeks.

The team lead, after Priya's experience, allocated one sprint to build a Docker Compose-based local environment. The next engineer who joined - six weeks later - was running the full stack with docker compose up in 12 minutes. This lesson documents what they built.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Docker for ML demo on the EngineersOfAI Playground - no code required. :::

Why This Exists

Docker Compose was created by Orchard (later acquired by Docker Inc.) and released in 2013 as "Fig," renamed to Docker Compose in 2014. It provides a YAML format for defining multi-container applications and a CLI for managing their lifecycle.

For ML teams, Docker Compose solves a specific problem: ML development requires multiple services running simultaneously - a feature store, a tracking server (MLflow), sometimes a local model registry, a monitoring stack, and possibly a local message broker for streaming features. Without Compose, every engineer manages these services manually, leading to "works on my machine" problems at the service-dependency level, not just the package-version level.

The key insight: treating your local ML development environment as infrastructure-as-code, just like your production environment. The docker-compose.yml file is the single source of truth for "what services does this ML project require to run locally."

The Complete ML Development Stack

The Complete docker-compose.yml

# docker-compose.yml - complete ML development environment
# Usage:
# Core only: docker compose up
# With training: docker compose --profile training up
# With serving: docker compose --profile serving up
# Full stack: docker compose --profile training --profile serving --profile monitoring up

name: ml-platform

services:
# ─────────────────────────────────────────────────────────────
# CORE: Feature Store (Redis)
# ─────────────────────────────────────────────────────────────
feature-store:
image: redis:7.2-alpine
container_name: ml-feature-store
ports:
- "6379:6379"
command: redis-server --save 60 1 --loglevel warning --requirepass "${REDIS_PASSWORD:-devpassword}"
volumes:
- redis-data:/data
healthcheck:
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD:-devpassword}", "ping"]
interval: 5s
timeout: 3s
retries: 5
start_period: 10s
restart: unless-stopped
networks:
- ml-net

# ─────────────────────────────────────────────────────────────
# CORE: PostgreSQL (MLflow backend)
# ─────────────────────────────────────────────────────────────
postgres:
image: postgres:16-alpine
container_name: ml-postgres
environment:
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-mlflowdev}
POSTGRES_DB: mlflow
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U mlflow"]
interval: 5s
timeout: 3s
retries: 10
start_period: 20s
restart: unless-stopped
networks:
- ml-net

# ─────────────────────────────────────────────────────────────
# CORE: MLflow Tracking Server
# ─────────────────────────────────────────────────────────────
mlflow:
image: ghcr.io/mlflow/mlflow:v2.11.0
container_name: ml-mlflow
depends_on:
postgres:
condition: service_healthy
ports:
- "5000:5000"
environment:
MLFLOW_BACKEND_STORE_URI: postgresql://mlflow:${POSTGRES_PASSWORD:-mlflowdev}@postgres:5432/mlflow
MLFLOW_DEFAULT_ARTIFACT_ROOT: /mlartifacts
MLFLOW_SERVE_ARTIFACTS: "true"
command: >
mlflow server
--host 0.0.0.0
--port 5000
--backend-store-uri postgresql://mlflow:${POSTGRES_PASSWORD:-mlflowdev}@postgres:5432/mlflow
--default-artifact-root /mlartifacts
--serve-artifacts
volumes:
- mlflow-artifacts:/mlartifacts
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
restart: unless-stopped
networks:
- ml-net

# ─────────────────────────────────────────────────────────────
# TRAINING: Training service (only with --profile training)
# ─────────────────────────────────────────────────────────────
training:
profiles: [training]
build:
context: .
dockerfile: Dockerfile.training
args:
- PYTHON_VERSION=3.11
container_name: ml-training
depends_on:
feature-store:
condition: service_healthy
mlflow:
condition: service_healthy
environment:
# MLflow connection
MLFLOW_TRACKING_URI: http://mlflow:5000
MLFLOW_EXPERIMENT_NAME: local-dev
# Feature store connection
REDIS_HOST: feature-store
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD:-devpassword}
# Data paths (mounted volumes)
TRAINING_DATA_PATH: /data/training/train.parquet
EVAL_DATA_PATH: /data/eval/eval_set_v3.parquet
# Model output
MODEL_OUTPUT_DIR: /models/output
# Development settings
LOG_LEVEL: DEBUG
PYTHONPATH: /app
volumes:
# Live code reload: src changes are immediately reflected without rebuild
- ./src:/app/src
- ./config:/app/config:ro
# Data and model volumes (local paths for development)
- ${LOCAL_DATA_PATH:-./data}:/data:ro
- model-store:/models
# GPU access (only works if GPU is available)
# Comment out if running on CPU-only machine
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
networks:
- ml-net
stdin_open: true # Keep stdin open for interactive debugging
tty: true # Allocate pseudo-TTY

# ─────────────────────────────────────────────────────────────
# SERVING: Inference server (only with --profile serving)
# ─────────────────────────────────────────────────────────────
inference-server:
profiles: [serving]
build:
context: .
dockerfile: Dockerfile.inference
container_name: ml-inference
depends_on:
feature-store:
condition: service_healthy
ports:
- "8080:8080"
environment:
REDIS_HOST: feature-store
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD:-devpassword}
MODEL_PATH: /models/current/model.joblib
LOG_LEVEL: INFO
PYTHONPATH: /app
volumes:
- ./src:/app/src # Live code reload
- model-store:/models:ro # Read-only model access
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
networks:
- ml-net

# ─────────────────────────────────────────────────────────────
# MONITORING: Prometheus (only with --profile monitoring)
# ─────────────────────────────────────────────────────────────
prometheus:
profiles: [monitoring]
image: prom/prometheus:v2.51.0
container_name: ml-prometheus
ports:
- "9090:9090"
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.enable-lifecycle # Allows hot-reload via curl -X POST /-/reload
networks:
- ml-net

# ─────────────────────────────────────────────────────────────
# MONITORING: Grafana (only with --profile monitoring)
# ─────────────────────────────────────────────────────────────
grafana:
profiles: [monitoring]
image: grafana/grafana:10.2.0
container_name: ml-grafana
depends_on:
- prometheus
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana-data:/var/lib/grafana
- ./config/grafana/dashboards:/var/lib/grafana/dashboards:ro
- ./config/grafana/provisioning:/etc/grafana/provisioning:ro
networks:
- ml-net

# ─────────────────────────────────────────────────────────────
# Named volumes for data persistence
# ─────────────────────────────────────────────────────────────
volumes:
redis-data:
postgres-data:
mlflow-artifacts:
model-store:
prometheus-data:
grafana-data:

# ─────────────────────────────────────────────────────────────
# Shared network
# ─────────────────────────────────────────────────────────────
networks:
ml-net:
driver: bridge

Environment Variable Management

Use a .env file for local configuration, .env.example as the committed template:

# .env.example - commit this to the repo
# Copy to .env and customize locally (never commit .env)

# Database passwords
POSTGRES_PASSWORD=mlflowdev
REDIS_PASSWORD=devpassword

# Grafana
GRAFANA_PASSWORD=admin

# Local data path (absolute path to your data directory)
LOCAL_DATA_PATH=./data

# MLflow experiment name (optional override)
MLFLOW_EXPERIMENT_NAME=local-dev

# GPU settings (uncomment if using GPU)
# NVIDIA_VISIBLE_DEVICES=0
# .gitignore - ensure .env is never committed
.env
.env.local
.env.*.local

GPU Access in Docker Compose

# For GPU access in the training service
# Requires: NVIDIA Container Toolkit installed on host

training:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1 # "all" or specific number
capabilities: [gpu] # Required: enables NVIDIA runtime

# Or with specific GPU selection:
training:
environment:
- CUDA_VISIBLE_DEVICES=0 # Use only GPU 0
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"] # Specific GPU by index
capabilities: [gpu]
# Verify GPU is accessible in the training container
docker compose --profile training run --rm training python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
if torch.cuda.is_available():
print(f'GPU: {torch.cuda.get_device_name(0)}')
"

Common Development Workflows

# ─────────────────────────────────────────────────────────────
# Daily workflow commands
# ─────────────────────────────────────────────────────────────

# Start core services (MLflow, Redis, Postgres)
docker compose up -d

# Check service health
docker compose ps

# Follow logs from a specific service
docker compose logs -f mlflow

# Run a one-off training job
docker compose --profile training run --rm training \
python -m src.training.train --config config/training.yaml

# Start interactive Python session in training container
# (with all services available, src/ mounted)
docker compose --profile training run --rm -it training python

# Run tests inside the container (same environment as training)
docker compose --profile training run --rm training \
pytest tests/ -v --tb=short

# Start inference server for local testing
docker compose --profile serving up inference-server

# Curl the local inference server
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"transaction_amount": 150.0, "merchant_category": "food"}'

# Rebuild a specific service after Dockerfile change
docker compose build training

# Reset everything (removes volumes - data is lost)
docker compose down --volumes

# Update base images to latest
docker compose pull

The Prometheus Configuration

# config/prometheus.yml - scrape ML services
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
# Scrape inference server metrics
- job_name: ml-inference
static_configs:
- targets: ["inference-server:8080"]
metrics_path: /metrics
scrape_interval: 5s

# Scrape training job metrics (when running)
- job_name: ml-training
static_configs:
- targets: ["training:9091"] # Prometheus client exposes on 9091
scrape_interval: 30s

# MLflow metrics
- job_name: mlflow
static_configs:
- targets: ["mlflow:5000"]
metrics_path: /metrics

Health Checks and Service Dependencies

The depends_on with condition: service_healthy is critical for ML stacks where services must start in order:

# This ensures training does NOT start until MLflow is ready
# Without this: training script fails with "connection refused" on MLflow
training:
depends_on:
feature-store:
condition: service_healthy # Redis health check must pass
mlflow:
condition: service_healthy # MLflow health check must pass
postgres:
condition: service_healthy # Postgres must be ready (MLflow depends on it)
# src/utils/wait_for_services.py
# Alternative: implement wait-for logic in application startup
import time
import redis
import requests
import logging

logger = logging.getLogger(__name__)


def wait_for_redis(host: str, port: int, password: str, timeout: int = 60) -> None:
"""Wait for Redis to be available. Raises TimeoutError if not ready in time."""
deadline = time.time() + timeout
while time.time() < deadline:
try:
r = redis.Redis(host=host, port=port, password=password)
r.ping()
logger.info(f"Redis at {host}:{port} is ready")
return
except Exception:
time.sleep(2)
raise TimeoutError(f"Redis at {host}:{port} not ready after {timeout}s")


def wait_for_mlflow(tracking_uri: str, timeout: int = 120) -> None:
"""Wait for MLflow server to be available."""
health_url = f"{tracking_uri}/health"
deadline = time.time() + timeout
while time.time() < deadline:
try:
resp = requests.get(health_url, timeout=5)
if resp.status_code == 200:
logger.info(f"MLflow at {tracking_uri} is ready")
return
except requests.exceptions.ConnectionError:
pass
time.sleep(3)
raise TimeoutError(f"MLflow at {tracking_uri} not ready after {timeout}s")

Onboarding Script

#!/bin/bash
# scripts/dev-setup.sh - run this once after cloning the repo
# Sets up the complete local ML development environment

set -e

echo "=== ML Platform Development Setup ==="

# 1. Check prerequisites
echo "Checking prerequisites..."
command -v docker >/dev/null 2>&1 || { echo "Error: Docker not installed"; exit 1; }
command -v docker-compose >/dev/null 2>&1 || docker compose version >/dev/null 2>&1 || {
echo "Error: Docker Compose not installed"
exit 1
}

# 2. Create .env from template if it doesn't exist
if [ ! -f .env ]; then
cp .env.example .env
echo "Created .env from .env.example - review and customize if needed"
fi

# 3. Create local data directory structure
mkdir -p data/training data/eval models/output

# 4. Build custom images
echo "Building Docker images..."
docker compose build

# 5. Start core services
echo "Starting core services (MLflow, Redis, PostgreSQL)..."
docker compose up -d

# 6. Wait for services to be healthy
echo "Waiting for services to be healthy..."
timeout 120 bash -c 'until docker compose ps | grep -q "healthy"; do sleep 2; done'

# 7. Verify setup
echo ""
echo "=== Setup Complete ==="
echo ""
echo "Services available:"
echo " MLflow UI: http://localhost:5000"
echo " Redis: localhost:6379"
echo " PostgreSQL: localhost:5432"
echo ""
echo "Quick commands:"
echo " Train a model: docker compose --profile training run --rm training python -m src.training.train"
echo " Start inference: docker compose --profile serving up"
echo " View logs: docker compose logs -f <service>"
echo " Stop all: docker compose down"
echo ""
echo "Total setup time from this script: under 15 minutes on first run (image download)"
echo "Subsequent runs: under 30 seconds"

Production Notes

Volume mounts for live code reload: The ./src:/app/src volume mount in the training service means Python files edited locally are immediately reflected inside the container - no rebuild needed. This is the key to a productive development workflow. Make sure the PYTHONPATH in the container is set correctly so that from src.features.engineering import ... works with the mounted path.

Named volumes vs bind mounts: Use named volumes for data that should persist (Postgres data, MLflow artifacts, trained models). Use bind mounts (local path → container path) for code directories where you want live reload. Never use bind mounts for large data files (slow, especially on macOS due to filesystem virtualization) - copy data into volumes or access via S3.

macOS performance: Docker on macOS runs Linux inside a VM, and bind mounts (your code directory → container) are significantly slower than on Linux due to the VM/FUSE overhead. For code that is frequently reloaded, this is acceptable. For data access, mount data into named volumes (copy once) rather than bind-mounting large data directories.

:::tip Use docker compose watch for Hot Reload Docker Compose v2.22+ introduced docker compose watch, which automatically syncs file changes into running containers without volume mounts:

training:
develop:
watch:
- action: sync
path: ./src
target: /app/src
- action: rebuild
path: requirements.txt # Rebuild image when deps change

This is more efficient than bind mounts on macOS and avoids permission issues. :::

:::warning .dockerignore Is Critical for Compose Builds When Docker Compose builds images, it uses the current directory as the build context. Without a proper .dockerignore, it will send your entire repo - including large data files, git history, and virtual environments - to the Docker daemon on every build. This can add minutes to each build. Always have a .dockerignore that excludes data/, models/, .git/, and venv/. :::

:::danger Hardcoded Hostnames in ML Code The most common source of "works in Docker Compose but not in production" bugs is hardcoded hostnames. If your code has redis.Redis(host="localhost"), it works when Redis runs on the same machine but fails when Redis is in a Docker network. Use environment variables for all service hostnames: REDIS_HOST=feature-store in Compose, REDIS_HOST=redis.ml-platform.svc.cluster.local in Kubernetes. Then in code: os.environ.get("REDIS_HOST", "localhost"). :::

Interview Q&A

Q: Why use Docker Compose for local ML development instead of just installing everything locally?

Local installation accumulates version drift between team members. One engineer installs MLflow 2.8, another has 2.11 - different behavior for the same code. With Docker Compose, the docker-compose.yml file specifies exact service versions and is committed to the repo. Every developer gets the same Redis version, the same MLflow version, the same PostgreSQL version, with the same configuration. Setup reduces from "follow the README and debug for 3 days" to docker compose up. When a dependency version needs updating, one commit to docker-compose.yml updates everyone's environment on next docker compose pull.

Q: How do Docker Compose profiles work and how would you use them for an ML stack?

Docker Compose profiles let you define groups of services that start together. Services without a profiles: field always start. Services with profiles only start when that profile is explicitly requested. For ML, use profiles to separate: core services (always on: MLflow, Redis, Postgres), training (only when actively training: the training container), serving (only for inference testing: the inference server), and monitoring (only for observability work: Prometheus, Grafana). Start with docker compose --profile training up for training sessions, docker compose up for just the infrastructure during code writing.

Q: How do you give a Docker Compose service access to a GPU?

In Compose, use the deploy.resources.reservations.devices block with driver: nvidia and capabilities: [gpu]. This is equivalent to docker run --gpus all. The NVIDIA Container Toolkit must be installed on the host. For specific GPU selection, use device_ids: ["0"] to target a specific GPU by index, or set CUDA_VISIBLE_DEVICES as an environment variable. Note: the deploy.resources key was historically only valid for Swarm mode, but Docker Compose now supports GPU device reservations for regular up commands.

Q: How do you handle service startup ordering in Docker Compose for an ML stack?

Use depends_on with condition: service_healthy to enforce startup order. This requires each service to have a healthcheck defined. The pattern for an ML stack: Postgres starts first (no dependencies), MLflow waits for Postgres healthy, training waits for both MLflow and Redis healthy. Without condition: service_healthy, depends_on only ensures the container is started - not that the service inside it is ready, leading to race conditions where training tries to connect to MLflow before its server is accepting connections.

Q: What is the difference between a named volume and a bind mount in Docker Compose and when do you use each?

A bind mount maps a host filesystem path to a container path. Changes on either side are immediately visible on the other - critical for live code reload (./src:/app/src). A named volume is Docker-managed storage, typically faster than bind mounts on macOS (no filesystem virtualization overhead), and persists between container restarts independently of host paths. Use bind mounts for code directories (live reload), use named volumes for databases, ML artifacts, and any data that should persist but does not need to be edited directly from the host.

© 2026 EngineersOfAI. All rights reserved.