Containers and Namespaces
Reading time: ~35 min · Interview relevance: High · Target roles: MLOps Engineer, Platform Engineer, AI Infrastructure
It is 2:00 AM on a Tuesday. Your model serving cluster hosts fifteen different teams, each running inference containers for their respective models. Team A runs a BERT-based text classifier. Team B runs a ResNet-50 image encoder. Team C is doing something exotic with a 7B-parameter quantized LLaMA model. All of these processes live on the same physical machines. The kernels they call into are identical. The hardware they touch is shared.
Then Team C's inference process starts consuming 47 GB of RAM because someone forgot to set a memory limit on a request that had a 200,000-token context window. Within thirty seconds the kernel's OOM killer starts shooting processes. It picks Team A's serving process because it has the largest resident set. Team A's SLA breach pages six engineers at 2:07 AM. An incident begins. By the time the postmortem is written, the root cause is clear: Team C's container had no memory limit set.
The fix was not architectural. It was operational. Team C's container should have had a cgroup memory limit set to 16 GB. Had that limit been in place, Team C's process would have been killed in isolation - only Team C's service would have experienced the error. Team A would have slept through the night. The incident would have been a five-minute ticket, not a forty-minute postmortem.
Containers are often presented as a deployment packaging story. "Ship your code with its dependencies." That is true but incomplete. The deeper story is about isolation guarantees. A container is a set of Linux kernel primitives - namespaces, cgroups, and a union filesystem - composed to create a process that believes it is alone on a machine, cannot accidentally harm its neighbors, and cannot consume more resources than it is allocated. Understanding these primitives is what separates engineers who run containers from engineers who design the platforms that run containers reliably at scale.
For ML workloads specifically the stakes are higher. Training jobs can consume entire machines for days. Inference services must coexist on dense multi-tenant clusters. GPU resources are scarce and expensive. Data pipelines churn through terabytes. When any of these workloads misbehave the blast radius must be contained - literally. The kernel mechanisms described in this lesson are what make that containment possible.
Why This Exists
Before Linux namespaces, process isolation required either virtual machines (heavyweight, slow to start, poor density) or a single shared OS with no isolation at all - everything could see and kill everything else. The chroot syscall from 1979 gave processes a fake filesystem root but provided no network, PID, or resource isolation. It was a hack, not a solution.
The problem namespace isolation solves: multiple workloads need to run on the same kernel without being able to observe each other's processes, network state, filesystem contents, or hostname. They need to think they are alone while actually sharing kernel resources. cgroups solve the complementary problem: even if processes are isolated from each other's view, they still share the same CPU time, RAM, and I/O bandwidth. Without limits, one workload starves the others.
The combination - namespaces for isolation of view, cgroups for isolation of resource consumption, and a union filesystem for isolation of storage - is what Docker packaged into a developer-friendly tool in 2013. But Docker itself is just a user-friendly interface over these kernel primitives that existed years before Docker was written.
Historical Context
Linux namespaces were first introduced gradually over a decade. Mount namespaces arrived in 2002. UTS, IPC, and network namespaces came in 2006. PID namespaces in 2008. User namespaces - the last and most powerful type, enabling unprivileged container creation - became stable in kernel 3.8 in 2013. The namespace concept itself dates to Plan 9 from Bell Labs in the 1980s, where the idea that every process should have its own view of the filesystem was fundamental to the OS design.
Control groups (cgroups) were contributed by Google engineers Paul Menage and Rohit Seth in 2006 and merged into Linux 2.6.24 in 2008. They were born from internal Google needs - their production clusters ran thousands of processes per machine and needed fine-grained resource accounting and limits. cgroups v2 (the unified hierarchy) was merged in Linux 4.5 (2016) and became the default in most distributions around 2019-2020.
Docker launched in March 2013 and became the first tool to make these primitives accessible to application developers without requiring kernel expertise. But Docker itself uses libcontainer (now runc), which directly calls the kernel namespace and cgroup APIs. The Open Container Initiative (OCI) standardized the container runtime specification in 2015, producing runc as the reference implementation and containerd as the higher-level daemon that manages the container lifecycle. Kubernetes adopted containerd as its default runtime in 2020 after deprecating Docker's own shim.
Core Concepts
Linux Namespaces - The Isolation Layer
A namespace wraps a global kernel resource so that processes within the namespace see their own isolated copy. There are seven namespace types:
| Namespace | Clone Flag | Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs - container's init is PID 1 |
| NET | CLONE_NEWNET | Network interfaces, routes, iptables rules |
| MNT | CLONE_NEWNS | Mount points and filesystem tree |
| UTS | CLONE_NEWUTS | Hostname and NIS domain name |
| IPC | CLONE_NEWIPC | SysV IPC and POSIX message queues |
| USER | CLONE_NEWUSER | User and group IDs (enables rootless) |
| CGROUP | CLONE_NEWCGROUP | cgroup root directory view |
The key insight: namespaces are not separate OS instances. They are separate views into the same kernel. The PID namespace makes a process think it is PID 1, but the kernel still assigns it a real PID in the root namespace. A process in a net namespace sees only the virtual network interfaces assigned to it, but all packets still flow through the same physical NIC and kernel network stack.
PID namespace in depth: The container's init process (PID 1 from its perspective) is assigned PID 1 inside the namespace. From the host, it has a completely different PID. If PID 1 inside the container exits, the kernel sends SIGKILL to all processes in that namespace - this is the container's death signal. This is why your container entrypoint must handle signals correctly; it is literally PID 1 in the container namespace and must propagate signals to child processes.
Network namespace in depth: Each container gets its own loopback interface, its own routing table, and its own set of virtual ethernet (veth) interfaces. Docker creates veth pairs: one end goes into the container's net namespace, the other end connects to a bridge (docker0) in the host namespace. Traffic between the container and the outside world is routed through this bridge with NAT. This is why ifconfig inside a container shows different interfaces than the host.
Mount namespace in depth: The container's filesystem tree is independent from the host. The container can mount, unmount, and pivot_root without affecting the host. This is what makes overlayfs work correctly - the container's filesystem mount is entirely contained within its mount namespace.
User namespace in depth: Unprivileged user namespaces allow a non-root user to create a container where they appear as root (UID 0) inside the namespace but are an unprivileged user outside. The kernel maintains a UID mapping: inside the container UID 0 maps to outside UID 1000. This is the foundation of rootless containers. An important limitation: the mapped UID can only access files that the real UID can access on the host.
Inspecting Namespaces in Python
import os
import subprocess
import ctypes
def get_process_namespaces(pid: int) -> dict[str, str]:
"""Read namespace identifiers for a given PID.
Each namespace appears as a symlink in /proc/<pid>/ns/.
The inode number uniquely identifies the namespace instance.
Two processes with the same inode for a namespace type
share that namespace - they have a shared view of that resource.
"""
ns_dir = f"/proc/{pid}/ns"
namespaces = {}
try:
for ns_name in os.listdir(ns_dir):
ns_path = os.path.join(ns_dir, ns_name)
# readlink returns something like "net:[4026531992]"
link_target = os.readlink(ns_path)
ns_type, ns_inode = link_target.split(":[")
namespaces[ns_name] = ns_inode.rstrip("]")
except PermissionError:
print(f"No permission to read namespaces of PID {pid}")
return namespaces
def compare_namespaces(pid1: int, pid2: int) -> None:
"""Compare namespace membership of two processes.
Same inode = same namespace = shared resource view.
Different inode = isolated view of that resource.
"""
ns1 = get_process_namespaces(pid1)
ns2 = get_process_namespaces(pid2)
print(f"{'Namespace':<12} {'PID ' + str(pid1):<22} {'PID ' + str(pid2):<22} {'Shared?'}")
print("-" * 70)
for ns in sorted(set(ns1) | set(ns2)):
inode1 = ns1.get(ns, "N/A")
inode2 = ns2.get(ns, "N/A")
shared = "YES (same namespace)" if inode1 == inode2 else "NO (isolated)"
print(f"{ns:<12} {inode1:<22} {inode2:<22} {shared}")
def enter_namespace(pid: int, ns_type: str) -> None:
"""Enter the namespace of another process using setns().
This is what 'docker exec' does internally - it enters the
existing namespaces of the container's init process before
execing the new command.
Requires CAP_SYS_ADMIN or user namespace support.
"""
LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
ns_path = f"/proc/{pid}/ns/{ns_type}"
fd = os.open(ns_path, os.O_RDONLY)
try:
ret = LIBC.setns(fd, 0)
if ret != 0:
errno = ctypes.get_errno()
raise OSError(errno, os.strerror(errno))
print(f"Entered {ns_type} namespace of PID {pid}")
finally:
os.close(fd)
def list_container_processes(container_id: str) -> list[dict]:
"""Use docker inspect to find container PID, then list all
processes that share the same PID namespace.
This shows the host-visible PIDs of all container processes,
which is useful for debugging and tracing.
"""
result = subprocess.run(
["docker", "inspect", "--format", "{{.State.Pid}}", container_id],
capture_output=True, text=True
)
container_pid = int(result.stdout.strip())
container_ns_inode = get_process_namespaces(container_pid).get("pid")
container_procs = []
for proc_dir in os.scandir("/proc"):
if not proc_dir.name.isdigit():
continue
try:
proc_pid = int(proc_dir.name)
proc_ns_inode = get_process_namespaces(proc_pid).get("pid")
if proc_ns_inode == container_ns_inode:
cmdline_path = f"/proc/{proc_pid}/cmdline"
with open(cmdline_path, "r") as f:
cmdline = f.read().replace("\x00", " ").strip()
container_procs.append({"host_pid": proc_pid, "cmd": cmdline})
except (PermissionError, FileNotFoundError, ValueError):
continue
return container_procs
# Example usage
print("=== Current process namespaces ===")
my_namespaces = get_process_namespaces(os.getpid())
for ns, inode in sorted(my_namespaces.items()):
print(f" {ns}: {inode}")
cgroups v1 vs v2 - The Resource Accounting Layer
Control groups impose resource limits and provide accounting for groups of processes. Every process in Linux belongs to exactly one cgroup in each hierarchy.
cgroups v1 problems: Each resource controller (memory, cpu, blkio, etc.) had its own independent hierarchy. A process could be in /sys/fs/cgroup/memory/team-a/job-1 for memory but in /sys/fs/cgroup/cpu/team-a/job-1 for CPU. These hierarchies were independent and inconsistent, making it hard to reason about the total resource allocation of a "container." Thread-level granularity was broken in subtle ways. The memory "soft limit" didn't work well in practice.
cgroups v2 uses a single unified hierarchy. All controllers attach to the same tree. This makes accounting consistent and enables the "no internal processes" rule - a cgroup that has child cgroups cannot itself have processes, preventing split-brain accounting.
The memory controller in v2 has evolved significantly:
memory.max- hard limit; OOM kill fires if exceededmemory.high- soft limit; process is throttled (reclaim triggered) but not killedmemory.swap.max- swap usage limit; set to 0 to disable swap for latency-sensitive jobsmemory.current- current usage in bytesmemory.stat- detailed breakdown (anon, file, shmem, kernel, etc.)
For ML workloads, memory.high is a critical safety valve. Set it to 85-90% of memory.max. When a training job starts leaking memory it gets throttled before the OOM killer fires. This gives your monitoring system time to detect the issue and alert before data loss occurs.
import os
import pathlib
CGROUP_V2_ROOT = pathlib.Path("/sys/fs/cgroup")
def is_cgroup_v2() -> bool:
"""Check whether this system uses cgroups v2 (unified hierarchy)."""
with open("/proc/filesystems") as f:
return "cgroup2" in f.read()
def get_current_cgroup() -> str:
"""Read the cgroup path of the current process.
cgroup v2 entry is prefixed with '0::' and contains the path
relative to the cgroup v2 root at /sys/fs/cgroup.
"""
with open("/proc/self/cgroup") as f:
for line in f:
if line.startswith("0::"): # v2: single line "0::/<path>"
return line.strip().split("::", 1)[1]
return "unknown"
def read_memory_stats(cgroup_path: str) -> dict:
"""Read memory statistics from a cgroup v2 path.
Useful for monitoring containers and training jobs from outside.
"""
base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")
stats = {}
try:
current = (base / "memory.current").read_text().strip()
stats["current_bytes"] = int(current)
stats["current_mb"] = round(int(current) / (1024 ** 2), 1)
max_val = (base / "memory.max").read_text().strip()
stats["max_bytes"] = int(max_val) if max_val != "max" else -1
high_val = (base / "memory.high").read_text().strip()
stats["high_bytes"] = int(high_val) if high_val != "max" else -1
# memory.stat has detailed breakdown
stat_text = (base / "memory.stat").read_text()
key_stats = ["anon", "file", "shmem", "kernel", "pgfault", "pgmajfault"]
for line in stat_text.splitlines():
parts = line.split()
if len(parts) == 2 and parts[0] in key_stats:
stats[f"stat_{parts[0]}"] = int(parts[1])
except FileNotFoundError as e:
stats["error"] = str(e)
return stats
def set_memory_limits(
cgroup_path: str,
max_mb: int,
high_mb: int | None = None,
swap_mb: int = 0,
) -> None:
"""Set memory limits on a cgroup. Requires write access (root or delegation).
Best practice for ML training jobs:
- Set high to 85-90% of max (throttle buffer before OOM kill)
- Set swap to 0 (swap causes latency spikes that corrupt timing)
"""
base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")
max_bytes = max_mb * 1024 * 1024
(base / "memory.max").write_text(str(max_bytes))
if high_mb is None:
high_mb = int(max_mb * 0.87)
high_bytes = high_mb * 1024 * 1024
(base / "memory.high").write_text(str(high_bytes))
# Disable swap for ML workloads - swap causes stalls that look like hanging
if swap_mb == 0:
(base / "memory.swap.max").write_text("0")
else:
(base / "memory.swap.max").write_text(str(swap_mb * 1024 * 1024))
print(f"Memory limits set on {cgroup_path}:")
print(f" memory.max = {max_mb} MB")
print(f" memory.high = {high_mb} MB (throttle threshold)")
print(f" memory.swap = {swap_mb} MB")
def set_cpu_quota(cgroup_path: str, cpu_cores: float) -> None:
"""Set CPU quota. cpu_cores=4.0 means 4 full CPU cores.
cpu.max format: "<quota_us> <period_us>"
Default period is 100ms (100000 us). Setting quota to N*period
gives N virtual CPUs worth of time.
"""
period_us = 100_000
quota_us = int(period_us * cpu_cores)
base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")
(base / "cpu.max").write_text(f"{quota_us} {period_us}")
print(f"CPU quota: {cpu_cores} cores ({quota_us}us / {period_us}us period)")
def set_io_weight(cgroup_path: str, weight: int = 100) -> None:
"""Set I/O weight for a cgroup (1-10000, default 100).
Higher weight = more I/O bandwidth during contention.
For training data loading, set to 200-500 relative to serving.
"""
base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")
(base / "io.weight").write_text(f"default {weight}")
print(f"I/O weight set to {weight} on {cgroup_path}")
# Example: configure a training job cgroup
# set_memory_limits("/ml-jobs/team-a/train-bert", max_mb=32768)
# set_cpu_quota("/ml-jobs/team-a/train-bert", cpu_cores=16.0)
# set_io_weight("/ml-jobs/team-a/train-bert", weight=300)
Overlay Filesystems - The Union Layer
An overlay filesystem (overlayfs) composes multiple directory trees into a single unified view. It has three logical components:
- Lower layer (read-only): The container image layers. These are immutable and shared between all containers running the same image. A 10 GB PyTorch image shared by 50 training containers consumes 10 GB of storage, not 500 GB.
- Upper layer (read-write): The container's writable layer. All writes go here. This is discarded when the container is removed unless committed to a new image.
- Work directory: Required by the kernel's overlayfs implementation for atomic rename operations. Usually at the same level as the upper directory.
When a container reads a file: overlayfs checks the upper layer first. If not found, it checks lower layers top-to-bottom. The lower layers are the stacked Docker image layers (each RUN, COPY, ADD in a Dockerfile creates a layer).
When a container writes a file that only exists in a lower layer: overlayfs performs a "copy-up" operation. It copies the entire file from the lower layer to the upper layer, then modifies the copy. The lower layer is never modified. This is why modifying a large file inside a container's root filesystem is expensive: the entire original file must be copied up before the first byte can be changed.
Practical implications for ML:
- Never write large checkpoints or model outputs to paths inside the container's root filesystem. Use mounted volumes. The upper layer is stored in
/var/lib/docker/overlay2/and has no special performance characteristics - it is just a directory on the host filesystem. - When building images, order Dockerfile layers from least-to-most-frequently-changed. Base OS and Python version change rarely; your training script changes daily. This maximizes Docker's layer cache reuse.
- Overlayfs has a maximum nesting depth. Very deep image layer stacks (50+ layers) can hit this. Use
docker squashor multi-stage builds to flatten layer count for production images.
Container Runtime - runc and containerd
The container runtime stack has two levels that solve different problems:
runc (low-level runtime): Implements the OCI runtime specification. Given an OCI bundle (a directory with a config.json describing the container configuration and a rootfs/ directory for the filesystem), runc calls the appropriate Linux syscalls: clone() with namespace flags, unshare() for additional namespaces, pivot_root() to make the overlayfs mount the container's root, and writes to cgroup files to set resource limits. runc is a one-shot process - it sets up the container, execs the container's init process, and exits. The container runs independently.
containerd (high-level runtime): Manages the full container lifecycle - image pulling and unpacking, container creation and deletion, execution, networking, and snapshotting. containerd calls runc via a shim process (containerd-shim). The shim stays alive for the container's entire lifetime: it holds the container's stdio, reports exit codes to containerd, and ensures the container can outlive a containerd restart. Docker and Kubernetes both use containerd; they are clients of its gRPC API.
GPU Containers - nvidia-container-toolkit
NVIDIA GPUs require kernel drivers and userspace libraries that must be accessible inside containers. The naive approach - install CUDA inside every container image - creates huge images and tightly couples the image to a specific driver version. The nvidia-container-toolkit solves this with a cleaner separation:
How it works: The toolkit installs a custom OCI hook that runs before runc execs the container's init process. This hook inspects the container's requested GPU devices, then injects:
- Device bindings for
/dev/nvidia0,/dev/nvidiactl,/dev/nvidia-uvm - Mount bindings for host driver libraries (libcuda.so, libcublas.so, etc.) into the container's filesystem at
/usr/local/cuda/lib64or equivalent - Environment variables describing available GPUs
The container image only needs CUDA headers and stub libraries. The actual NVIDIA driver (with its specific version) is mounted in at runtime from the host. Upgrading the host driver upgrades the driver seen by all containers without rebuilding a single image.
# Correct production approach: CUDA base image (headers + stubs, no driver)
FROM nvcr.io/nvidia/pytorch:24.01-py3
# The image provides CUDA headers and PyTorch with CUDA support.
# The actual libcuda.so driver is NOT in this image - it is mounted
# from the host by nvidia-container-toolkit at runtime.
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY train.py .
# Handle signals correctly as PID 1 in the container namespace
# Use exec form (JSON array) so the process is PID 1 directly,
# not wrapped in a shell that ignores signals.
ENTRYPOINT ["python", "train.py"]
import subprocess
import json
def run_gpu_container(
image: str,
script: str,
gpu_indices: list[int],
memory_gb: int = 16,
cpu_cores: int = 8,
) -> subprocess.Popen:
"""Launch a GPU training container with proper resource limits.
Key flags:
--gpus: requests specific GPUs via nvidia-container-toolkit
--memory: cgroup v2 memory.max (CPU RAM, not GPU VRAM)
--cpus: cgroup v2 cpu.max
--shm-size: /dev/shm size for DataLoader shared memory workers
--ipc=host: share host IPC namespace (needed for some NCCL configs)
"""
gpu_spec = ",".join(str(i) for i in gpu_indices)
cmd = [
"docker", "run", "--rm",
f"--gpus=device={gpu_spec}",
f"--memory={memory_gb}g",
f"--memory-swap={memory_gb}g", # disable swap
f"--cpus={cpu_cores}",
"--shm-size=16g", # /dev/shm for DataLoader workers
"--ulimit", "memlock=-1", # required for GPU pinned memory
"--ulimit", "stack=67108864",
"-v", "/fast-storage/data:/data:ro",
"-v", "/fast-storage/checkpoints:/checkpoints",
image,
"python", script,
]
print("Launching container:")
print(" " + " ".join(cmd))
return subprocess.Popen(cmd)
def check_gpu_visibility_in_container(container_id: str) -> None:
"""Verify that a running container can see its assigned GPUs."""
result = subprocess.run(
["docker", "exec", container_id, "nvidia-smi", "--query-gpu=index,name,memory.total",
"--format=csv,noheader"],
capture_output=True, text=True
)
if result.returncode == 0:
print("GPUs visible to container:")
for line in result.stdout.strip().splitlines():
print(f" {line}")
else:
print(f"nvidia-smi failed: {result.stderr}")
seccomp and Linux Capabilities - The Security Layer
Linux capabilities break root's omnipotence into discrete privileges. Instead of a process being "root" (all-powerful) or "not root" (powerless), capabilities allow a process to hold only the specific privileges it needs. There are approximately 40 capabilities in modern Linux kernels.
For an ML inference container, the capabilities it needs are essentially none. It reads model files, runs matrix multiplications, writes results to a network socket. It does not need:
CAP_SYS_MODULE- load kernel modulesCAP_SYS_TIME- change system timeCAP_NET_RAW- raw network socket accessCAP_SYS_ADMIN- broad system administration operations
The principle of least privilege applied to containers: drop all capabilities by default, add back only what is specifically required.
seccomp (Secure Computing Mode) provides syscall filtering. A BPF program runs before every syscall and can allow, deny with ERRNO, or SIGKILL the process. Docker's default seccomp profile blocks approximately 44 syscalls that containers almost never need. For hardened ML inference containers, write a custom allowlist profile:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
"syscalls": [
{
"comment": "Syscalls needed for Python ML inference workload",
"names": [
"read", "write", "open", "openat", "close", "stat", "fstat",
"lstat", "poll", "lseek", "mmap", "mprotect", "munmap",
"brk", "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
"ioctl", "pread64", "pwrite64", "readv", "writev", "access",
"pipe", "select", "sched_yield", "mremap", "msync", "madvise",
"shmget", "shmat", "shmctl", "dup", "dup2", "nanosleep",
"getpid", "socket", "connect", "accept", "sendto", "recvfrom",
"sendmsg", "recvmsg", "shutdown", "bind", "listen",
"getsockname", "getpeername", "socketpair",
"setsockopt", "getsockopt", "clone", "fork", "vfork",
"execve", "exit", "wait4", "kill", "uname", "fcntl",
"flock", "fsync", "fdatasync", "truncate", "ftruncate",
"getdents", "getdents64", "getcwd", "chdir", "rename",
"mkdir", "rmdir", "unlink", "symlink", "readlink",
"chmod", "fchmod", "gettimeofday", "getrlimit", "getrusage",
"sysinfo", "getuid", "getgid", "getppid",
"futex", "sched_getaffinity", "epoll_create", "epoll_create1",
"epoll_ctl", "epoll_wait", "epoll_pwait",
"set_tid_address", "clock_gettime", "clock_nanosleep",
"exit_group", "tgkill", "openat2", "statx",
"io_uring_setup", "io_uring_enter", "io_uring_register",
"getrandom", "memfd_create", "copy_file_range",
"prlimit64", "sendfile"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
def run_hardened_inference_container(
image: str,
model_path: str,
seccomp_profile_path: str,
) -> subprocess.CompletedProcess:
"""Launch a hardened inference container with minimal permissions.
Security posture:
- All Linux capabilities dropped
- Custom syscall allowlist via seccomp
- Read-only root filesystem (no writes to image layers)
- No privilege escalation via setuid binaries
- Memory limit prevents OOM spreading to neighbors
- PID limit prevents fork bombs
"""
cmd = [
"docker", "run", "--rm",
"--cap-drop=ALL",
f"--security-opt=seccomp={seccomp_profile_path}",
"--read-only",
"--security-opt=no-new-privileges",
"--memory=8g",
"--memory-swap=8g",
"--cpus=4",
"--pids-limit=100",
# Writable tmpfs for temp files (since root fs is read-only)
"--tmpfs=/tmp:size=512m,noexec",
"-v", f"{model_path}:/model:ro",
"-v", "/fast-storage/inference-output:/output",
image,
"python", "serve.py", "--model", "/model"
]
return subprocess.run(cmd, capture_output=True, text=True)
Rootless Containers
Traditional Docker requires a root-privileged daemon (dockerd). Any container escape gives an attacker host root. Rootless containers (Podman, rootless Docker) use user namespaces to allow non-root users to create and run containers.
The mechanism: a user namespace maps the container's UID 0 (root) to the user's real UID (e.g., 1000) on the host. Inside the container, the process sees itself as root and can bind to port 80, install packages, and write to root-owned paths. From the host's perspective, it is running as UID 1000 with no elevated privileges. An escape from the container gives the attacker only UID 1000 access.
The UID mapping is stored in /proc/<pid>/uid_map and /proc/<pid>/gid_map and enforced by the kernel for every filesystem access check.
Limitation for ML GPU workloads: NVIDIA GPU device nodes (/dev/nvidia*) require elevated permissions to access. The nvidia-container-toolkit has improving but not fully production-ready rootless support depending on driver version. For GPU training and serving, most production deployments still use a privileged daemon (with carefully managed capabilities) rather than fully rootless containers.
BuildKit and Multi-Stage Builds for ML
ML images are large. A naive Dockerfile installs everything in one layer and produces a 15-20 GB image. Multi-stage builds separate the build environment (with compilers, headers, build tools) from the runtime image (which only needs what runs at inference time).
# syntax=docker/dockerfile:1.7
# Stage 1: Build stage - has compilers and build tools (large, discarded)
FROM python:3.11-slim AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ cmake libffi-dev libssl-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY requirements.txt .
# Install to /install prefix so we can copy selectively
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2: Distroless runtime image (minimal attack surface, small size)
# gcr.io/distroless/python3 contains ONLY the Python interpreter.
# No bash. No shell. No package manager. No curl or wget.
# An attacker who exploits your inference service cannot pivot further.
FROM gcr.io/distroless/python3-debian12
# Copy only the installed packages from the builder stage
COPY --from=builder /install /usr/local
WORKDIR /app
# Model artifacts should NOT be copied here - mount them via volume.
# Only copy the inference server code.
COPY --chown=nonroot:nonroot model_server.py .
COPY --chown=nonroot:nonroot tokenizer_config.json .
# Run as non-root even within distroless
USER nonroot
# Must use exec form (JSON array) with distroless - no shell available
ENTRYPOINT ["python3", "model_server.py"]
Why distroless for ML inference specifically: Production inference containers never need bash, curl, pip, or any shell. A distroless image cuts attack surface drastically. If your XSS/SSRF-to-RCE exploit lands inside a distroless container, you cannot run wget to pull a second-stage payload. You cannot install tools. The lateral movement path is dead. The image is also 3-5x smaller than a full Python slim image, which means faster container startup and less storage pressure.
Kubernetes Pod Isolation Model
A Kubernetes pod is a group of containers that share a network namespace and optionally a PID namespace. They run on the same node and communicate via localhost. Each pod gets its own unique IP address from the pod CIDR. Containers within a pod share that IP.
cgroup hierarchy in Kubernetes: kubepods -> (Burstable|Guaranteed|BestEffort) -> pod-<uid> -> container-<name>. Resource QoS class determines placement in the hierarchy and scheduling behavior during pressure.
Three QoS classes:
- Guaranteed: requests == limits for all resources. Highest protection from OOM kill. Use for latency-sensitive inference servers.
- Burstable: requests < limits. Can use more resources if available. Use for training jobs that can tolerate some variability.
- BestEffort: no requests or limits specified. Killed first during memory pressure. Never use for ML workloads in production.
# Kubernetes pod spec - properly configured for GPU training
apiVersion: v1
kind: Pod
metadata:
name: bert-finetuning-job
namespace: ml-team-a
labels:
job-type: training
model: bert-large
spec:
# Restart policy for training - only restart on failure, not after completion
restartPolicy: OnFailure
# Terminate gracefully - give the job time to save checkpoint
terminationGracePeriodSeconds: 300
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
command: ["python", "finetune_bert.py"]
args: ["--checkpoint-dir", "/checkpoints", "--data-dir", "/data"]
resources:
requests:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "2"
limits:
# Set limits == requests for Guaranteed QoS class
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "2"
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_DEBUG
value: "WARN"
- name: OMP_NUM_THREADS
value: "4"
volumeMounts:
- name: training-data
mountPath: /data
readOnly: true
- name: checkpoints
mountPath: /checkpoints
- name: dshm
mountPath: /dev/shm # Override default 64MB shm for DataLoader workers
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false # false because PyTorch writes temp files
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop: ["ALL"]
volumes:
- name: training-data
persistentVolumeClaim:
claimName: bert-training-data-pvc
- name: checkpoints
persistentVolumeClaim:
claimName: bert-checkpoints-pvc
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "16Gi" # Override default 64MB /dev/shm
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
accelerator: nvidia-a100-80gb
import pathlib
import re
def get_container_resource_limits() -> dict:
"""Read this container's own resource limits from its cgroup.
ML workloads should call this at startup to configure themselves
appropriately. PyTorch's memory allocator can be tuned based on
the known container memory limit.
"""
cgroup_path = ""
with open("/proc/self/cgroup") as f:
for line in f:
if line.startswith("0::"): # cgroup v2
cgroup_path = line.strip().split("::", 1)[1]
break
base = pathlib.Path("/sys/fs/cgroup") / cgroup_path.lstrip("/")
limits = {"cgroup_path": cgroup_path}
try:
mem_max = (base / "memory.max").read_text().strip()
if mem_max == "max":
limits["memory_max_bytes"] = None
limits["memory_max_gb"] = None
else:
limits["memory_max_bytes"] = int(mem_max)
limits["memory_max_gb"] = round(int(mem_max) / (1024 ** 3), 1)
cpu_max = (base / "cpu.max").read_text().strip()
if cpu_max == "max":
limits["cpu_cores"] = None
else:
quota, period = cpu_max.split()
limits["cpu_cores"] = round(int(quota) / int(period), 2)
# Read memory.stat for current usage breakdown
stat_text = (base / "memory.stat").read_text()
for line in stat_text.splitlines():
parts = line.split()
if len(parts) == 2 and parts[0] in ("anon", "file", "shmem"):
limits[f"current_{parts[0]}_bytes"] = int(parts[1])
except FileNotFoundError:
limits["error"] = "cgroup files not found (may be cgroup v1 or unconfined)"
return limits
def configure_pytorch_from_cgroup_limits() -> None:
"""Configure PyTorch memory allocator based on container cgroup limits.
If we know the container has 32 GB, we can tell PyTorch to not
fragment memory above 80% of that, preventing OOM kills near the limit.
"""
import os
limits = get_container_resource_limits()
if limits.get("memory_max_bytes"):
max_gb = limits["memory_max_gb"]
print(f"Container memory limit: {max_gb} GB")
# Reserve 20% headroom above PyTorch allocator max
allocator_max_mb = int(limits["memory_max_bytes"] * 0.75 / (1024 * 1024))
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = (
f"max_split_size_mb:{allocator_max_mb},"
"garbage_collection_threshold:0.8,"
"expandable_segments:True"
)
print(f"Set PYTORCH_CUDA_ALLOC_CONF max_split_size_mb={allocator_max_mb}")
if limits.get("cpu_cores"):
cpu_cores = int(limits["cpu_cores"])
# DataLoader workers should not exceed available CPU cores
os.environ.setdefault("NUM_WORKERS", str(max(1, cpu_cores - 2)))
print(f"Container CPU limit: {limits['cpu_cores']} cores")
print(f"Recommended DataLoader workers: {os.environ['NUM_WORKERS']}")
Architecture Overview
Production Engineering Notes
Image layer optimization for ML: Keep your base image (CUDA + Python + system libs) as a separately-tagged image that rarely changes. Your model code and requirements.txt should be in the top layers. A build that only changes train.py should take seconds, not minutes. The Docker layer cache is your build speed multiplier.
Memory limit math for GPU workloads: GPU memory is managed by the NVIDIA driver, not by Linux cgroups. memory.max=32G limits CPU RAM only. PyTorch uses CPU RAM for gradient buffers, optimizer states (Adam's moment estimates are 2x the model size in float32), DataLoader worker shared memory (/dev/shm), and CPU-pinned memory for DMA transfers. If the cgroup limit is too low, pin_memory() calls fail, DataLoader workers get OOM-killed, and optimizer state offloading fails. A rough formula: memory.max >= (model_params * bytes_per_param * 6) + shm_size + 4GB_OS_overhead.
Container startup latency for ML inference autoscaling: Cold start latency is your autoscaling Achilles heel. A 7B model that takes 60 seconds to load from a network-mounted PVC adds 60 seconds to every scale-out event. Mitigations: (1) use local NVMe storage (PVC with storageClassName: local-nvme) so the model is already on the node, (2) use model server frameworks like Triton or TorchServe that keep models warm in a ready pool, (3) pre-pull container images on all GPU nodes using a DaemonSet.
The copy-up trap: If a training job writes checkpoints to any path inside the container's root filesystem, overlayfs performs copy-up operations for every new file created. For a 10 GB checkpoint, this means 10 GB of I/O through the overlay machinery before the first byte is written. Always bind-mount a dedicated volume for checkpoint output.
Namespace leak detection: If containers crash without proper cleanup, network namespace files can be held open by orphaned processes, preventing kernel garbage collection. The symptom is accumulating veth devices on the host. Monitor with ip link | grep veth | wc -l and alert if this count grows unbounded between container lifecycles.
Common Mistakes
:::danger Setting memory.max Without memory.high
Setting only memory.max means the OOM killer fires with no warning the instant the process exceeds the limit. For ML workloads that have variable memory usage, set memory.high to 85-90% of memory.max. When the workload approaches the limit, Linux starts reclaiming memory and throttling allocations. Your monitoring system gets time to detect and alert. The training job slows down instead of dying. The OOM kill becomes a last resort, not the first response.
:::
:::danger Running ML Containers as Root with --privileged
docker run --privileged disables all namespace and cgroup isolation. The container has full access to the host's devices, network, and filesystem. It can mount host paths, load kernel modules, and modify iptables. This is appropriate only for container runtime development and testing, never for production ML workloads. If you think you need --privileged for GPU access, you are wrong - the nvidia-container-toolkit handles this correctly without it. Audit all containers in your cluster for the Privileged: true flag.
:::
:::warning Storing Model Weights in Container Image Layers
Never COPY large model weight files (anything over 500 MB) into a Docker image. A 7B parameter model at bfloat16 is ~14 GB. Every docker pull downloads 14 GB. Every pushed version stores a new 14 GB layer. Layer storage in your registry balloons. Instead: store model weights on object storage (S3, GCS, Azure Blob) or a shared PVC, download at container startup, or mount via a volume. The container image should be small (1-4 GB) and fast to pull.
:::
:::warning Ignoring cgroup v1 vs v2 Differences
If your infrastructure mixes kernel versions (some nodes 4.x, others 5.x+), some hosts use cgroup v1, others v2. The paths and file names differ: v1 is /sys/fs/cgroup/memory/<path>/memory.limit_in_bytes, v2 is /sys/fs/cgroup/<path>/memory.max. Python code that reads cgroup stats must detect which version is active. Check /proc/filesystems for cgroup2 or examine whether /sys/fs/cgroup/cgroup.controllers exists. Failing to detect this leads to incorrect memory monitoring that shows "no limit" even when limits are set.
:::
Interview Questions
Q1: What is the difference between a container and a virtual machine at the kernel level?
A VM runs a complete guest kernel on top of a hypervisor. The hypervisor virtualizes the hardware (CPU rings, memory management, device I/O). The guest kernel manages its own memory, scheduler, and device drivers in isolation. An exploit in the guest kernel does not directly reach the host kernel. A container shares the host kernel. There is no guest kernel to boot. Isolation is achieved through Linux namespaces (separate views of kernel resources) and cgroups (resource limits). Containers start in milliseconds because there is no kernel to boot, no firmware to initialize, no hardware enumeration. The density is 10-100x higher than VMs. The isolation is weaker: a kernel exploit (e.g., a namespace escape CVE) affects all containers simultaneously. For ML workloads, containers are the right choice for performance and density; VMs are sometimes layered underneath for stronger multi-tenant boundaries (VM per team, containers within each VM).
Q2: Walk me through what happens when you run docker run --gpus 1 pytorch/pytorch train.py.
- Docker CLI parses the command and sends a gRPC request to dockerd.
- dockerd checks if
pytorch/pytorchis in the local image store; if not, pulls it layer by layer from Docker Hub, extracting each layer to/var/lib/docker/overlay2/. - dockerd calls containerd to create the container, providing the image manifest and runtime config (including
--gpus 1). - containerd creates an overlayfs mount: lower layers from the image, upper layer is a new empty writable directory.
- containerd spawns a containerd-shim process that will persist for the container's lifetime.
- The shim calls runc with the OCI bundle (config.json + rootfs path).
- Because
--gpus 1was specified, the nvidia-container-runtime OCI hook fires before runc execs the init process. The hook discovers available GPUs, selects one, and injects device bindings (/dev/nvidia0, etc.) and driver library mounts into the container config. - runc calls
clone()withCLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPCto create the container namespaces. - runc writes cgroup resource limits to
/sys/fs/cgroup/files. - runc calls
pivot_root()to make the overlayfs mount the container's/. - runc execs
python train.pyas PID 1 of the new PID namespace.
Q3: How do cgroups v2 memory limits interact with PyTorch's memory management?
cgroups v2 memory.max limits CPU RAM (host memory managed by the Linux kernel). PyTorch's CUDA allocator manages GPU VRAM independently via the CUDA runtime - this is completely invisible to cgroups. However, PyTorch uses significant CPU RAM for: Adam optimizer states (2 float32 tensors per parameter = 8 bytes per parameter), DataLoader worker processes (each worker forks and copies the dataset object into its address space), CPU-pinned memory buffers for async DMA transfers (pin_memory=True), and gradient accumulation buffers. For a 1B parameter model with Adam in float32, optimizer states alone consume 8 GB of CPU RAM. Set the cgroup limit to at least: (model_params * 8 bytes) + (num_workers * dataset_size_in_ram) + 4 GB OS overhead. Set memory.high to 85% of that value to get throttling before OOM.
Q4: What is a rootless container, how does it work, and when can you not use it for ML?
Rootless containers use user namespaces to map container UID 0 to the user's real UID on the host. Inside the container the process sees itself as root and can perform root-privileged operations within the container's namespace. Outside, the process has only the user's privileges. An attacker who escapes the container gets the user's access, not root access. The mapping is stored in /proc/<pid>/uid_map. Limitations for ML: (1) NVIDIA GPU device nodes require specific permissions that rootless containers cannot obtain without additional host-level configuration. The toolkit has improving rootless support but it varies by driver version and distribution. (2) Network plugins may require root for veth pair creation. (3) cgroup v1 delegation requires root. For CPU-only ML workloads (preprocessing, feature engineering, serving small models on CPU), rootless containers are fully suitable. For GPU training and serving at scale, most production deployments still use privileged daemons with careful capability management rather than fully rootless containers.
Q5: Your cluster has 40 A100 GPUs across 10 nodes. How do you prevent one team from monopolizing GPU resources?
Three-layer enforcement: (1) Kubernetes namespace ResourceQuota: each team gets a dedicated namespace with ResourceQuota capping nvidia.com/gpu to their allocation. Kubernetes enforces this at scheduling time - the pod is rejected if the quota would be exceeded. (2) LimitRange: define default GPU requests and limits so pods without explicit specs get a reasonable default allocation rather than 0 (which would allow the pod to land on a GPU node without actually claiming any GPUs). (3) Priority classes: create PriorityClass objects with different preemption policies. Interactive inference services (high priority, no preemption) evict training jobs (low priority, preemptible) when GPU nodes are needed for serving. Additionally, NVIDIA MIG (Multi-Instance GPU) on A100s allows partitioning a single physical GPU into up to 7 isolated GPU instances, each with dedicated memory bandwidth and SM counts. This allows multi-tenant sharing of individual GPUs for smaller inference workloads without interference.
Q6: Why is the overlayfs copy-up mechanism a performance trap for ML checkpoint writes?
Overlayfs copy-up triggers when a container modifies a file that exists only in the lower (read-only image) layers. For a new file written to the container's root filesystem, there is no existing lower-layer version, so no copy-up occurs on first write. However, if a training script overwrites a file that came from the image (e.g., a config file), overlayfs must copy the entire original file from the lower layer to the upper layer before applying the modification. For checkpoint writes specifically, the issue is not copy-up but upper-layer I/O performance. All writes to the container's root filesystem go through the overlayfs upper layer, which is stored in /var/lib/docker/overlay2/. This adds filesystem metadata overhead, is subject to the host's filesystem performance, and may compete with other containers sharing the same host disk. In contrast, a mounted volume (-v /fast-nvme/checkpoints:/checkpoints) writes directly to the host filesystem at the device level, bypassing overlayfs entirely. The measured throughput difference is typically 2-5x for sequential checkpoint writes on NVMe storage.
