Skip to main content

Docker and Containerized Local Inference

The 3 AM Production Incident

It is 3 AM. A client's on-premise AI assistant has been down for six hours. The engineer on call is staring at a wall of Python dependency errors. The model worked perfectly on the developer's MacBook. It worked in the staging environment. It failed the moment it hit the client's air-gapped Ubuntu 22.04 server with a different CUDA version, a different Python minor version, and a systemd service that someone had configured six months ago and nobody remembers how.

The root cause takes two more hours to find. The developer had installed bitsandbytes from source on their machine against CUDA 11.8. The production server is running CUDA 12.2. The library silently compiled with the wrong backend. The 70B model loads, passes its health check, and then hangs on the first real inference request. By the time the postmortem is written, the team has lost a full business day and the client has lost faith.

This story is not hypothetical. It plays out in some variation across every AI team that skips containerization. LLM inference has a deeper dependency chain than almost any other software workload: Python version, PyTorch version, CUDA toolkit version, cuDNN version, driver version, and the model weights themselves must all align. Miss any one link and you get silent failures, degraded throughput, or hard crashes at the worst possible moment.

Docker does not make this problem disappear. But it compresses the surface of failure from a dozen moving parts to exactly one: the container image. If the image runs on your laptop, it runs on the server. If it ran last Tuesday, it will run next Tuesday. That guarantee is worth more in production AI than almost any other engineering investment you can make.

This lesson is the practical manual for building that guarantee. You will learn how GPU passthrough actually works at the kernel level, how to build minimal and reproducible inference containers, how to manage the awkward problem of 50 GB model weight files inside a containerized world, and how to wire together a full local AI stack using Docker Compose. By the end, you will have Dockerfiles and compose files you can drop directly into a production repo.

Why This Exists - The Dependency Hell That Broke Everything

Before containerization became standard practice for AI workloads, teams managed inference environments through a combination of virtual environments, conda, and hope. The problem was that virtual environments only isolate Python packages. They do nothing for system libraries, CUDA toolkit versions, or native compiled extensions.

The early LLM serving tools from 2022 and 2023 - llama.cpp, text-generation-inference, and the first versions of vLLM - each had strong opinions about their system dependencies. llama.cpp needs specific compiler flags for CUDA or Metal. vLLM requires CUDA 12.x and will not function on CUDA 11.x. Text Generation Inference from HuggingFace pins specific versions of flash-attention that must be compiled against a matching torch version.

Teams discovered this the hard way: a pip install vllm that worked in one environment would silently install the CPU-only build in another because the CUDA detection logic read the wrong environment variable. The failure mode was insidious - the server started, the model loaded, and everything looked fine until a user noticed responses were taking 45 seconds instead of 2 seconds.

Docker solves this by making the entire runtime environment - including the CUDA toolkit, system libraries, Python, and all packages - part of the artifact you ship. The image is the environment. You build it once, you test it once, and you deploy that exact tested artifact everywhere. NVIDIA extended this guarantee to the GPU itself through the Container Toolkit, which handles the boundary between the container's CUDA libraries and the host's GPU driver. You no longer need to match CUDA versions between the container and the host - only the driver version matters at the host level.

Historical Context - From VMs to Containers to GPU Containers

Docker was released by Solomon Hykes at dotCloud in March 2013. The insight was not new - Linux had cgroups (2008) and namespaces (2002) for a decade - but Docker made them usable by combining them into a developer-friendly workflow: build an image, push it to a registry, pull and run it anywhere.

GPU support came much later. The first attempt was through device mapping: passing /dev/nvidia0 into the container as a device file. This worked but required the container to have exactly the same NVIDIA driver libraries as the host, which defeated half the purpose of containerization.

NVIDIA released the NVIDIA Docker runtime (nvidia-docker) in 2016, and then the Container Toolkit (nvidia-container-toolkit) in 2019. The breakthrough was the "driver injection" model: at container startup, the toolkit injects the host's driver libraries into the container through a mount. The container ships its own CUDA toolkit (the version it was compiled against), but at runtime it talks to the host's driver. This means you can run a CUDA 12.2 container on a host with an NVIDIA driver from 2023 as long as the driver supports CUDA 12.2 (driver version 525.85 or higher). The CUDA forward-compatibility guarantee handles the rest.

The practical consequence for LLM inference appeared in 2023 when vLLM, TGI, and Ollama all started shipping official Docker images. Teams could for the first time pull a single command and get a working, GPU-accelerated inference server in under five minutes, regardless of the host OS's Python or CUDA state.

Core Concepts

How GPU Passthrough Actually Works

When you run a container with --gpus all, the following happens:

  1. The Docker daemon calls the NVIDIA Container Runtime, a shim that wraps the standard OCI runtime (runc).
  2. The shim reads the container's environment and the GPU device requirements.
  3. Before the container process starts, the shim mounts the host's NVIDIA driver libraries (from /usr/lib/x86_64-linux-gnu/ or similar) into the container at a well-known path.
  4. The shim exposes the GPU device files (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm) to the container's device cgroup.
  5. The container process starts. When it calls CUDA APIs, those calls go through the container's CUDA toolkit headers but ultimately land on the host's driver via the injected libraries.

The mathematical relationship that matters here is the CUDA compatibility constraint:

driver_versionminimum_driver(cuda_version_in_container)\text{driver\_version} \geq \text{minimum\_driver}(\text{cuda\_version\_in\_container})

For CUDA 12.2, the minimum driver version is 525.85. For CUDA 12.4, it is 550.54. You can always run a container built against an older CUDA version on a newer driver, but not the reverse.

CUDA Version Compatibility Matrix

Container CUDAMinimum DriverTypical Card
11.8520.61RTX 3090, A100 (older setups)
12.0525.60RTX 4090, A6000
12.1530.30RTX 4090, H100
12.2535.86RTX 4090, H100
12.4550.54RTX 4090, H100, L40S
12.6560.28RTX 5090, H200

The practical recommendation for new inference containers in 2025 is CUDA 12.4 - it is stable, supported by all current PyTorch and vLLM releases, and the driver requirement (550.x) is met by any card purchased in the last two years.

Base Image Selection

NVIDIA publishes official CUDA base images on Docker Hub under nvidia/cuda. There are three flavors:

  • base - just the CUDA runtime libraries. Smallest. Use this if you are only running pre-compiled binaries.
  • runtime - adds cuDNN and development headers. Use this for most inference workloads.
  • devel - adds the full compiler toolchain. Use this only if you need to compile custom CUDA kernels inside the image.

For vLLM and most Python inference servers, runtime is correct. A typical base image tag looks like:

nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04

The Ubuntu 22.04 base is the current standard. Ubuntu 20.04 is reaching end of life in 2025 and should be avoided for new images.

Layer Caching and Model Weights

The largest practical challenge in containerizing LLM inference is model weights. A 7B model in float16 is 14 GB. A 70B model in Q4 quantization is roughly 40 GB. Baking these into the image makes the image immense - Docker Hub's free tier has a 10 GB layer limit, and even private registries become expensive to store and slow to pull.

The standard pattern separates weights from the image:

image = code + runtime environment (2-8 GB)
volume = model weights (14-140 GB, mounted at container start)

Docker volumes or bind mounts hold the weights. The container image itself stays lean and pushable. This also means you can update the inference server code (a new vLLM version, a bug fix) without re-downloading the weights, and you can share a single weight directory across multiple containers.

The tradeoff: the first time a container starts with a given volume, it must wait for weights to be available - either downloaded from HuggingFace or copied from a network share. For production air-gapped deployments, you pre-stage the weights volume separately.

Code Examples

Installing NVIDIA Container Toolkit

# On Ubuntu 22.04 / Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify - should print your GPU info
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Running Ollama in Docker

Ollama publishes an official image that handles model downloading, quantization selection, and serving in a single container.

# CPU only
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama_models:/root/.ollama \
ollama/ollama

# With NVIDIA GPU passthrough
docker run -d \
--name ollama \
--gpus all \
-p 11434:11434 \
-v ollama_models:/root/.ollama \
ollama/ollama

# Pull a model into the running container
docker exec ollama ollama pull llama3.2:3b
docker exec ollama ollama pull qwen2.5-coder:7b

# Run inference via the API
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2:3b", "prompt": "Explain backpropagation in one paragraph.", "stream": false}'

Production-Grade vLLM Dockerfile

This Dockerfile builds a vLLM server with a specific model, using multi-stage builds to keep the final image clean and using BuildKit cache mounts to speed up repeated builds.

# syntax=docker/dockerfile:1.4
# Stage 1: build dependencies
FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 AS builder

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y \
python3.11 \
python3.11-dev \
python3-pip \
git \
curl \
&& rm -rf /var/lib/apt/lists/*

# Pin pip and install build tools
RUN python3.11 -m pip install --upgrade pip setuptools wheel

# Install vLLM - use BuildKit cache to avoid re-downloading on every build
RUN --mount=type=cache,target=/root/.cache/pip \
pip install vllm==0.6.3 \
huggingface_hub \
accelerate \
sentencepiece \
protobuf

# Stage 2: final runtime image
FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 AS runtime

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV HF_HOME=/models
ENV TRANSFORMERS_CACHE=/models

RUN apt-get update && apt-get install -y \
python3.11 \
python3-pip \
curl \
&& rm -rf /var/lib/apt/lists/*

# Copy installed packages from builder stage
COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11
COPY --from=builder /usr/local/bin /usr/local/bin

# Create non-root user for security
RUN useradd -m -u 1000 inference
USER inference

WORKDIR /app

# Healthcheck - verify the server is accepting connections
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1

# Entrypoint - MODEL_NAME is passed at runtime via environment variable
# TENSOR_PARALLEL_SIZE controls multi-GPU sharding
ENTRYPOINT ["python3.11", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/current", \
"--host", "0.0.0.0", \
"--port", "8000", \
"--max-model-len", "4096", \
"--gpu-memory-utilization", "0.90"]

Build and run:

# Build the image
DOCKER_BUILDKIT=1 docker build -t myorg/vllm-server:0.6.3 .

# Run with the model directory mounted as a volume
docker run -d \
--name vllm-server \
--gpus all \
-p 8000:8000 \
-v /data/models/llama-3.1-8b:/models/current:ro \
-e CUDA_VISIBLE_DEVICES=0 \
myorg/vllm-server:0.6.3

# Test the OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/current",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'

Pre-baking Model Weights Into an Image

For air-gapped deployments where pulling from HuggingFace at runtime is not possible, you can bake the weights directly into the image. This creates a large image (40-150 GB) but makes deployment fully self-contained.

FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV HF_HOME=/models

RUN apt-get update && apt-get install -y python3.11 python3-pip \
&& rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/pip \
pip install vllm==0.6.3 huggingface_hub

# Download model weights at build time
# Requires HF_TOKEN build arg for gated models
ARG HF_TOKEN=""
ARG MODEL_ID="meta-llama/Llama-3.1-8B-Instruct"

RUN --mount=type=cache,target=/root/.cache/huggingface \
python3.11 -c "
from huggingface_hub import snapshot_download
import os
snapshot_download(
repo_id='${MODEL_ID}',
local_dir='/models/current',
token='${HF_TOKEN}' if '${HF_TOKEN}' else None,
ignore_patterns=['*.gguf', 'original/*']
)
"

HEALTHCHECK --interval=30s --timeout=10s --start-period=180s --retries=5 \
CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python3.11", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/current", "--host", "0.0.0.0", "--port", "8000"]

Build with the HuggingFace token as a build argument. Use --secret in BuildKit to avoid baking the token into image layers:

DOCKER_BUILDKIT=1 docker build \
--secret id=hf_token,src=~/.hf_token \
--build-arg MODEL_ID="meta-llama/Llama-3.1-8B-Instruct" \
-t myorg/llama-3.1-8b-airgapped:latest .

Docker Compose for a Full Local RAG Stack

This compose file assembles a complete local retrieval-augmented generation stack: an Ollama model server, an Open-WebUI frontend, and a Qdrant vector database.

# docker-compose.yml
version: "3.9"

services:
# ---- Model server (Ollama) ----
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_models:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- OLLAMA_NUM_PARALLEL=2
- OLLAMA_MAX_LOADED_MODELS=2
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 5
start_period: 60s

# ---- Web UI ----
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=change-this-in-production
- ENABLE_RAG_WEB_SEARCH=true
- RAG_EMBEDDING_ENGINE=ollama
- RAG_EMBEDDING_MODEL=nomic-embed-text
depends_on:
ollama:
condition: service_healthy

# ---- Vector database ----
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
restart: unless-stopped
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_storage:/qdrant/storage
environment:
- QDRANT__SERVICE__GRPC_PORT=6334

# ---- Model initialization (runs once and exits) ----
model-init:
image: curlimages/curl:latest
container_name: model-init
restart: "no"
depends_on:
ollama:
condition: service_healthy
command: >
sh -c "
curl -s -X POST http://ollama:11434/api/pull -d '{\"name\": \"llama3.2:3b\"}' &&
curl -s -X POST http://ollama:11434/api/pull -d '{\"name\": \"nomic-embed-text\"}'
"

volumes:
ollama_models:
driver: local
open_webui_data:
driver: local
qdrant_storage:
driver: local

Start the full stack:

# Start everything in detached mode
docker compose up -d

# Watch the model downloads
docker compose logs -f model-init

# Check all services are healthy
docker compose ps

# Open the UI
open http://localhost:3000

Mounting the HuggingFace Cache as a Volume

If you already have models downloaded locally via huggingface-cli download or the transformers library, you can mount that cache directory directly into containers to avoid re-downloading.

# The default HuggingFace cache location on Linux/Mac
HF_CACHE="${HOME}/.cache/huggingface/hub"

# Mount it read-only into vLLM
docker run -d \
--name vllm \
--gpus all \
-p 8000:8000 \
-v "${HF_CACHE}:/root/.cache/huggingface/hub:ro" \
-e HF_HOME=/root/.cache/huggingface \
-e MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0

# Or with docker compose, add to volumes section:
# volumes:
# - ~/.cache/huggingface/hub:/root/.cache/huggingface/hub:ro

Architecture Diagrams

Container Runtime Architecture

Local RAG Stack Architecture

Multi-Stage Build Flow

Production Engineering Notes

Image Tagging Strategy

Never use latest for inference servers in production. Tag images with the exact version of the inference framework plus a build date or git hash:

myorg/vllm-server:0.6.3-20250415
myorg/ollama-custom:0.5.4-llama3.1

This makes rollbacks unambiguous. When a model update causes a regression, you can identify exactly which image to roll back to.

Layer Ordering for Cache Efficiency

Docker builds images layer by layer. Layers are cached and reused if the instructions before them have not changed. For inference containers, order your Dockerfile instructions from least to most frequently changed:

  1. Base image (FROM)
  2. System package installation (apt-get)
  3. Python package installation (pip install)
  4. Application code (COPY)
  5. Default command (CMD)

This means a code change in step 4 does not invalidate the pip install cache in step 3, which typically takes 5-10 minutes for large ML packages.

GPU Memory Reservation

vLLM pre-allocates a fraction of GPU memory at startup via --gpu-memory-utilization. The default is 0.90 (90%). In a multi-container environment where multiple inference servers share a GPU (not recommended for production but common in dev), you must set this lower or use CUDA_VISIBLE_DEVICES to assign specific GPUs to specific containers:

# Container 1 gets GPU 0
docker run -e CUDA_VISIBLE_DEVICES=0 --gpus all myorg/vllm-server ...

# Container 2 gets GPU 1
docker run -e CUDA_VISIBLE_DEVICES=1 --gpus all myorg/vllm-server ...

Volume Backup and Migration

Model weight volumes are large but static - they rarely change. Use Docker volume backup patterns for disaster recovery:

# Backup an Ollama models volume to a tar file
docker run --rm \
-v ollama_models:/source:ro \
-v $(pwd):/backup \
ubuntu \
tar czf /backup/ollama_models_backup.tar.gz -C /source .

# Restore
docker run --rm \
-v ollama_models:/target \
-v $(pwd):/backup \
ubuntu \
tar xzf /backup/ollama_models_backup.tar.gz -C /target

Resource Limits

Always set memory limits on CPU-only inference containers to prevent OOM kills from taking down other services:

# In docker-compose.yml
services:
cpu-inference:
image: myorg/llama-cpu:latest
deploy:
resources:
limits:
memory: 32G
cpus: "16"
reservations:
memory: 16G

GPU memory limits are managed by the CUDA runtime, not Docker. Set --gpu-memory-utilization in vLLM or OLLAMA_MAX_VRAM in Ollama instead.

Logging and Observability

Inference servers generate high-volume logs during busy periods. Use Docker's log drivers to avoid filling disk:

services:
vllm:
image: myorg/vllm-server:latest
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"

For production, route logs to a centralized system (Loki, CloudWatch, Elasticsearch) using the loki or awslogs log drivers.

Common Mistakes

:::danger Forgetting to Restart Docker After Installing NVIDIA Container Toolkit Installing nvidia-container-toolkit does not automatically apply the configuration. You must run sudo nvidia-ctk runtime configure --runtime=docker followed by sudo systemctl restart docker. Skipping the restart is the single most common reason docker run --gpus all fails with "unknown flag: --gpus" or "could not select device driver" even after a correct toolkit installation. :::

:::danger Baking HuggingFace Tokens Into Image Layers Using ARG HF_TOKEN and then running huggingface-cli download --token $HF_TOKEN in a RUN instruction bakes the token into the image layer history. Anyone who can pull the image can extract the token with docker history --no-trunc. Always use --secret with BuildKit: RUN --mount=type=secret,id=hf_token .... Leaked HuggingFace tokens can be used to access gated models or, if the account has write access, to push malicious model files. :::

:::warning Mismatching CUDA Versions Between Container and Driver Running a container built against CUDA 12.4 on a host with driver version 470.x (which supports only CUDA 11.4) will fail at runtime, not at container start. The container will start, the model will begin loading, and then PyTorch or CUDA will throw a driver version mismatch error. Always check the host driver version with nvidia-smi before deploying. The required minimum driver version for each CUDA version is documented at docs.nvidia.com/cuda/cuda-toolkit-release-notes. :::

:::warning Large Images in CI/CD Pipelines If you are building inference container images in CI and pushing to a registry, be aware that a vLLM image with baked-in weights can be 50-150 GB. Standard CI runners (GitHub Actions, GitLab CI) have limited disk space (14-100 GB) and will fail silently or with opaque errors if you exceed it. Either use external volumes for weights (the recommended approach) or provision dedicated CI runners with sufficient local storage for image builds. :::

:::warning Running as Root Inside Containers The default user in most NVIDIA CUDA base images is root. For production inference servers, create a non-root user in your Dockerfile. This limits the blast radius of a container escape or a prompt injection attack that gains code execution. The useradd -m -u 1000 inference pattern shown in the Dockerfile above is the minimum required. Also mount model weight volumes as read-only (:ro) to prevent any process inside the container from modifying the weights. :::

Interview Q&A

Q: Explain what happens when you run docker run --gpus all on a host with NVIDIA Container Toolkit installed. Walk through the process from the CLI command to CUDA kernel execution.

A: The Docker CLI parses --gpus all and passes a device request to the Docker daemon. The daemon invokes the NVIDIA Container Runtime, a shim that wraps runc (the standard OCI runtime). Before the container process starts, the shim queries the host for available GPU devices and uses libnvidia-container to bind-mount the host's NVIDIA driver libraries into the container's filesystem namespace at a path like /run/nvidia-container-toolkit/driver-mounts. It also exposes the GPU device files (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm) in the container's device cgroup. When the container process starts and calls CUDA API functions, those calls go through the CUDA toolkit headers that were compiled into the container image, but the actual driver calls route through the injected host driver libraries and ultimately reach the physical GPU over PCIe. The CUDA forward-compatibility layer allows the container's CUDA toolkit version to be higher than the minimum version supported by the host driver, as long as the driver version meets the minimum for the container's CUDA version.


Q: What is the difference between baking model weights into a Docker image versus using external volumes, and when would you choose each approach?

A: Baking weights into the image creates a fully self-contained artifact - you can deploy it to any Docker host without network access or pre-staged storage. This is valuable for air-gapped environments, kiosk deployments, or when you need exact reproducibility including the model version. The downsides are that the image becomes 14-150 GB depending on model size, making it expensive to store in registries, slow to pull, and impractical to update frequently. External volumes separate the model weights from the runtime environment. The image stays small (2-8 GB), you can update the inference server code without re-downloading weights, and you can share a single weight directory across multiple containers. The tradeoff is that deployment requires the weights to be pre-staged on the target host. For most production use cases, external volumes are the correct choice. Baked-in weights are the right choice when: the deployment target is air-gapped, the model version must be pinned exactly with the code version, or you are distributing a self-contained appliance.


Q: You have a vLLM container that starts successfully but produces inference responses much slower than expected. GPU utilization is at 100% during inference. What are the likely causes and how would you diagnose them?

A: 100% GPU utilization with slow throughput points to a few specific issues. First, check if the model is actually running on the GPU: docker exec vllm nvidia-smi should show the vllm process consuming VRAM. If VRAM usage is near zero but the GPU is busy, the container may be running CUDA operations in CPU-emulation mode, which happens when the CUDA version in the container is incompatible with the host driver. Second, check --max-model-len: if vLLM is configured with a very large context window (e.g., 32768) but the model's actual sequences are short, vLLM pre-allocates KV cache for the maximum length, leaving less VRAM for parallel requests. Third, check --tensor-parallel-size: if you have multiple GPUs but are only using one, you may be VRAM-constrained and swapping. Fourth, look at request batching - if you are sending one request at a time, vLLM cannot batch them. Use continuous batching by sending concurrent requests. Finally, verify that PCIe bandwidth is not the bottleneck: on multi-GPU systems without NVLink, model-parallel communication over PCIe can become a bottleneck for large models.


Q: Describe a Docker Compose setup for a local RAG system. What services do you need, how do they communicate, and what volumes does each service require?

A: A minimal local RAG stack needs three services. The model server handles both text generation and embedding generation. Ollama is practical here because it serves both capabilities through a single API. It needs a volume for downloaded model files (typically mapped to /root/.ollama). The vector database stores document embeddings and handles similarity search. Qdrant is the standard choice for local deployment - it is a single binary, has a clean HTTP and gRPC API, and supports all the metadata filtering that RAG pipelines need. It needs a volume for its storage directory. The application layer can be Open-WebUI for a no-code interface or a custom Python service for programmatic RAG. It communicates with the model server via HTTP and with the vector database via gRPC or HTTP. All three services should be on the same Docker network (Compose creates this automatically) and communicate by service name rather than localhost. Health checks on the model server are critical because it takes 30-120 seconds to load models at startup, and the application layer should not start sending requests until the model server is ready. The depends_on with condition: service_healthy pattern in Compose handles this.


Q: What security considerations apply specifically to containerized LLM inference servers that do not apply to typical web services?

A: Several LLM-specific risks compound the usual container security concerns. Prompt injection attacks that gain code execution are more feasible with LLMs than with typical web services because the model processes untrusted input in a highly capable reasoning system. Running the inference server as a non-root user and mounting model weights as read-only limits the damage from any such exploit. Model extraction attacks: if the inference server API is accessible on the network without authentication, an attacker can query it systematically to extract the model's capabilities or attempt to recover training data. Inference APIs should always require authentication tokens, even on local networks. VRAM side-channel attacks: a multi-tenant GPU inference server may leak information about one user's request through timing patterns observable by another user sharing the same GPU. This is relevant when running containerized inference for multiple untrusted users on shared hardware. HuggingFace token exposure: as discussed, tokens must never be baked into image layers. Additionally, if the container needs to authenticate to HuggingFace at runtime, pass the token as an environment variable from a Docker secret rather than hardcoding it in the compose file. Finally, model weight integrity: for production deployments, verify model checksums after download and before serving. A supply-chain attack that replaces a popular model with a backdoored version is a real threat given the size of the HuggingFace ecosystem.


Q: How do you handle CUDA version upgrades in a containerized inference deployment? What is the upgrade path and what can go wrong?

A: A CUDA version upgrade in a containerized environment involves two independent upgrades that must be coordinated: the host driver upgrade and the container image rebuild. The host driver can be upgraded independently of the containers - a newer driver is backward compatible with containers built against older CUDA versions. So upgrading from driver 525.x to 550.x on the host allows you to run containers built against CUDA 12.0 through 12.4. The container image rebuild happens separately: you update the base image tag in your Dockerfile from cuda:12.2.x to cuda:12.4.x, rebuild, and redeploy. What can go wrong: if you rebuild the container against CUDA 12.4 and forget to upgrade the host driver, the container will fail at CUDA initialization (requires driver 550.54, but host has 525.x). The failure message from PyTorch is usually something like "CUDA driver version is insufficient for CUDA runtime version." A second risk: some compiled extensions (flash-attention, bitsandbytes) are built against a specific CUDA version and must be recompiled when the container's CUDA version changes. The safe upgrade sequence is: (1) upgrade host driver, (2) verify existing containers still work, (3) rebuild container images with the new CUDA base, (4) test on a staging host, (5) deploy to production. Never do host driver upgrades and container image upgrades simultaneously in production.


Advanced Patterns

Sidecar Containers for Model Management

In production setups with Docker Compose, a sidecar pattern keeps the inference server focused on serving while a separate container handles model lifecycle operations: downloading updates, validating checksums, and pre-warming the cache.

services:
# Main inference server - never restarts for model ops
vllm:
image: myorg/vllm-server:0.6.3
volumes:
- model_store:/models:ro # read-only - only model-manager writes here
depends_on:
model-manager:
condition: service_healthy

# Sidecar - manages model downloads and validation
model-manager:
image: python:3.11-slim
volumes:
- model_store:/models # read-write
- ./scripts:/scripts:ro
environment:
- HF_TOKEN_FILE=/run/secrets/hf_token
- MODEL_ID=meta-llama/Llama-3.1-8B-Instruct
- TARGET_DIR=/models/current
command: python /scripts/ensure_model.py
secrets:
- hf_token
healthcheck:
test: ["CMD", "test", "-f", "/models/current/config.json"]
interval: 10s
timeout: 5s
retries: 30
start_period: 300s

secrets:
hf_token:
file: ~/.hf_token

volumes:
model_store:

The ensure_model.py sidecar script:

#!/usr/bin/env python3
"""
Model management sidecar. Downloads and validates a HuggingFace model.
Exits 0 when model is ready. Used as a healthcheck-compatible init container.
"""
import os
import sys
import hashlib
import json
from pathlib import Path
from huggingface_hub import snapshot_download, hf_hub_download

MODEL_ID = os.environ["MODEL_ID"]
TARGET_DIR = Path(os.environ["TARGET_DIR"])
TOKEN_FILE = os.environ.get("HF_TOKEN_FILE")

def load_token() -> str | None:
if TOKEN_FILE and Path(TOKEN_FILE).exists():
return Path(TOKEN_FILE).read_text().strip()
return os.environ.get("HF_TOKEN")

def model_is_ready(target: Path) -> bool:
"""Basic validation: config.json and at least one weight shard exist."""
if not (target / "config.json").exists():
return False
weight_files = list(target.glob("*.safetensors")) + list(target.glob("*.bin"))
return len(weight_files) > 0

def main():
token = load_token()
TARGET_DIR.mkdir(parents=True, exist_ok=True)

if model_is_ready(TARGET_DIR):
print(f"Model already present at {TARGET_DIR}", flush=True)
sys.exit(0)

print(f"Downloading {MODEL_ID} to {TARGET_DIR}...", flush=True)
snapshot_download(
repo_id=MODEL_ID,
local_dir=str(TARGET_DIR),
token=token,
ignore_patterns=["*.gguf", "original/*", "*.pt"],
)
print("Download complete.", flush=True)

if __name__ == "__main__":
main()

Health Check Endpoints

vLLM exposes a /health endpoint that returns 200 when the model is loaded and ready to serve. Use this in production orchestration rather than just checking if the process is alive:

# Check if vLLM is ready (not just running)
until curl -s -f http://localhost:8000/health > /dev/null; do
echo "Waiting for vLLM to be ready..."
sleep 5
done
echo "vLLM is ready"

# Get model info
curl http://localhost:8000/v1/models | python3 -m json.tool

# Check GPU memory usage inside the container
docker exec vllm python3 -c "
import torch
for i in range(torch.cuda.device_count()):
total = torch.cuda.get_device_properties(i).total_memory / 1e9
allocated = torch.cuda.memory_allocated(i) / 1e9
print(f'GPU {i}: {allocated:.1f}/{total:.1f} GB used')
"

Cleaning Up Unused Images and Volumes

LLM Docker images and model weight volumes accumulate quickly. A 70B model baked into an image is 40+ GB. Docker's standard prune commands are essential for hygiene:

# Remove all stopped containers
docker container prune -f

# Remove dangling images (untagged intermediary layers)
docker image prune -f

# Remove all images not referenced by a running container
# WARNING: this removes ALL unused images including large inference images
docker image prune -a -f

# Remove unused volumes (does NOT remove named volumes by default)
docker volume prune -f

# Nuclear option: remove everything not currently running
docker system prune -a --volumes -f

# Check disk usage before and after
docker system df

For model weight volumes that you want to keep across container rebuilds, use named volumes (as shown in the compose files above) rather than anonymous volumes. Named volumes survive docker volume prune unless you add --filter flags.

Testing the Full Stack End to End

A quick integration test script that verifies every service in the local RAG compose stack is functioning:

#!/usr/bin/env python3
"""
Integration test for local AI stack.
Tests: Ollama API, embedding generation, Qdrant connectivity, end-to-end RAG query.
"""
import requests
import json
import sys

OLLAMA_URL = "http://localhost:11434"
QDRANT_URL = "http://localhost:6333"
WEBUI_URL = "http://localhost:3000"

def test_ollama():
r = requests.get(f"{OLLAMA_URL}/api/tags", timeout=10)
assert r.status_code == 200, f"Ollama not responding: {r.status_code}"
models = [m["name"] for m in r.json().get("models", [])]
print(f"Ollama OK - loaded models: {models}")
return models

def test_embedding(model="nomic-embed-text"):
r = requests.post(
f"{OLLAMA_URL}/api/embed",
json={"model": model, "input": "test sentence"},
timeout=30,
)
assert r.status_code == 200, f"Embedding failed: {r.status_code}"
embedding = r.json()["embeddings"][0]
print(f"Embedding OK - dim={len(embedding)}")
return embedding

def test_qdrant():
r = requests.get(f"{QDRANT_URL}/collections", timeout=10)
assert r.status_code == 200, f"Qdrant not responding: {r.status_code}"
collections = [c["name"] for c in r.json()["result"]["collections"]]
print(f"Qdrant OK - collections: {collections}")

def test_generation(model="llama3.2:3b"):
r = requests.post(
f"{OLLAMA_URL}/api/generate",
json={"model": model, "prompt": "Say 'OK' and nothing else.", "stream": False},
timeout=60,
)
assert r.status_code == 200, f"Generation failed: {r.status_code}"
response = r.json()["response"]
print(f"Generation OK - response: {response[:50]}")

if __name__ == "__main__":
errors = []
for test_fn in [test_ollama, test_embedding, test_qdrant, test_generation]:
try:
test_fn()
except Exception as e:
errors.append(f"{test_fn.__name__}: {e}")
print(f"FAILED {test_fn.__name__}: {e}")

if errors:
print(f"\n{len(errors)} test(s) failed")
sys.exit(1)
else:
print("\nAll tests passed - stack is healthy")
sys.exit(0)
© 2026 EngineersOfAI. All rights reserved.