Containers and Namespaces

Reading time: ~35 min · Interview relevance: High · Target roles: MLOps Engineer, Platform Engineer, AI Infrastructure

It is 2:00 AM on a Tuesday. Your model serving cluster hosts fifteen different teams, each running inference containers for their respective models. Team A runs a BERT-based text classifier. Team B runs a ResNet-50 image encoder. Team C is doing something exotic with a 7B-parameter quantized LLaMA model. All of these processes live on the same physical machines. The kernels they call into are identical. The hardware they touch is shared.

Then Team C's inference process starts consuming 47 GB of RAM because someone forgot to set a memory limit on a request that had a 200,000-token context window. Within thirty seconds the kernel's OOM killer starts shooting processes. It picks Team A's serving process because it has the largest resident set. Team A's SLA breach pages six engineers at 2:07 AM. An incident begins. By the time the postmortem is written, the root cause is clear: Team C's container had no memory limit set.

The fix was not architectural. It was operational. Team C's container should have had a cgroup memory limit set to 16 GB. Had that limit been in place, Team C's process would have been killed in isolation - only Team C's service would have experienced the error. Team A would have slept through the night. The incident would have been a five-minute ticket, not a forty-minute postmortem.

Containers are often presented as a deployment packaging story. "Ship your code with its dependencies." That is true but incomplete. The deeper story is about isolation guarantees. A container is a set of Linux kernel primitives - namespaces, cgroups, and a union filesystem - composed to create a process that believes it is alone on a machine, cannot accidentally harm its neighbors, and cannot consume more resources than it is allocated. Understanding these primitives is what separates engineers who run containers from engineers who design the platforms that run containers reliably at scale.

For ML workloads specifically the stakes are higher. Training jobs can consume entire machines for days. Inference services must coexist on dense multi-tenant clusters. GPU resources are scarce and expensive. Data pipelines churn through terabytes. When any of these workloads misbehave the blast radius must be contained - literally. The kernel mechanisms described in this lesson are what make that containment possible.

Why This Exists

Before Linux namespaces, process isolation required either virtual machines (heavyweight, slow to start, poor density) or a single shared OS with no isolation at all - everything could see and kill everything else. The chroot syscall from 1979 gave processes a fake filesystem root but provided no network, PID, or resource isolation. It was a hack, not a solution.

The problem namespace isolation solves: multiple workloads need to run on the same kernel without being able to observe each other's processes, network state, filesystem contents, or hostname. They need to think they are alone while actually sharing kernel resources. cgroups solve the complementary problem: even if processes are isolated from each other's view, they still share the same CPU time, RAM, and I/O bandwidth. Without limits, one workload starves the others.

The combination - namespaces for isolation of view, cgroups for isolation of resource consumption, and a union filesystem for isolation of storage - is what Docker packaged into a developer-friendly tool in 2013. But Docker itself is just a user-friendly interface over these kernel primitives that existed years before Docker was written.

Historical Context

Linux namespaces were first introduced gradually over a decade. Mount namespaces arrived in 2002. UTS, IPC, and network namespaces came in 2006. PID namespaces in 2008. User namespaces - the last and most powerful type, enabling unprivileged container creation - became stable in kernel 3.8 in 2013. The namespace concept itself dates to Plan 9 from Bell Labs in the 1980s, where the idea that every process should have its own view of the filesystem was fundamental to the OS design.

Control groups (cgroups) were contributed by Google engineers Paul Menage and Rohit Seth in 2006 and merged into Linux 2.6.24 in 2008. They were born from internal Google needs - their production clusters ran thousands of processes per machine and needed fine-grained resource accounting and limits. cgroups v2 (the unified hierarchy) was merged in Linux 4.5 (2016) and became the default in most distributions around 2019-2020.

Docker launched in March 2013 and became the first tool to make these primitives accessible to application developers without requiring kernel expertise. But Docker itself uses libcontainer (now runc), which directly calls the kernel namespace and cgroup APIs. The Open Container Initiative (OCI) standardized the container runtime specification in 2015, producing runc as the reference implementation and containerd as the higher-level daemon that manages the container lifecycle. Kubernetes adopted containerd as its default runtime in 2020 after deprecating Docker's own shim.

Core Concepts

Linux Namespaces - The Isolation Layer

A namespace wraps a global kernel resource so that processes within the namespace see their own isolated copy. There are seven namespace types:

Namespace	Clone Flag	Isolates
PID	CLONE_NEWPID	Process IDs - container's init is PID 1
NET	CLONE_NEWNET	Network interfaces, routes, iptables rules
MNT	CLONE_NEWNS	Mount points and filesystem tree
UTS	CLONE_NEWUTS	Hostname and NIS domain name
IPC	CLONE_NEWIPC	SysV IPC and POSIX message queues
USER	CLONE_NEWUSER	User and group IDs (enables rootless)
CGROUP	CLONE_NEWCGROUP	cgroup root directory view

The key insight: namespaces are not separate OS instances. They are separate views into the same kernel. The PID namespace makes a process think it is PID 1, but the kernel still assigns it a real PID in the root namespace. A process in a net namespace sees only the virtual network interfaces assigned to it, but all packets still flow through the same physical NIC and kernel network stack.

PID namespace in depth: The container's init process (PID 1 from its perspective) is assigned PID 1 inside the namespace. From the host, it has a completely different PID. If PID 1 inside the container exits, the kernel sends SIGKILL to all processes in that namespace - this is the container's death signal. This is why your container entrypoint must handle signals correctly; it is literally PID 1 in the container namespace and must propagate signals to child processes.

Network namespace in depth: Each container gets its own loopback interface, its own routing table, and its own set of virtual ethernet (veth) interfaces. Docker creates veth pairs: one end goes into the container's net namespace, the other end connects to a bridge (docker0) in the host namespace. Traffic between the container and the outside world is routed through this bridge with NAT. This is why ifconfig inside a container shows different interfaces than the host.

Mount namespace in depth: The container's filesystem tree is independent from the host. The container can mount, unmount, and pivot_root without affecting the host. This is what makes overlayfs work correctly - the container's filesystem mount is entirely contained within its mount namespace.

User namespace in depth: Unprivileged user namespaces allow a non-root user to create a container where they appear as root (UID 0) inside the namespace but are an unprivileged user outside. The kernel maintains a UID mapping: inside the container UID 0 maps to outside UID 1000. This is the foundation of rootless containers. An important limitation: the mapped UID can only access files that the real UID can access on the host.

Inspecting Namespaces in Python

import os
import subprocess
import ctypes

def get_process_namespaces(pid: int) -> dict[str, str]:
    """Read namespace identifiers for a given PID.

    Each namespace appears as a symlink in /proc/<pid>/ns/.
    The inode number uniquely identifies the namespace instance.
    Two processes with the same inode for a namespace type
    share that namespace - they have a shared view of that resource.
    """
    ns_dir = f"/proc/{pid}/ns"
    namespaces = {}
    try:
        for ns_name in os.listdir(ns_dir):
            ns_path = os.path.join(ns_dir, ns_name)
            # readlink returns something like "net:[4026531992]"
            link_target = os.readlink(ns_path)
            ns_type, ns_inode = link_target.split(":[")
            namespaces[ns_name] = ns_inode.rstrip("]")
    except PermissionError:
        print(f"No permission to read namespaces of PID {pid}")
    return namespaces


def compare_namespaces(pid1: int, pid2: int) -> None:
    """Compare namespace membership of two processes.

    Same inode = same namespace = shared resource view.
    Different inode = isolated view of that resource.
    """
    ns1 = get_process_namespaces(pid1)
    ns2 = get_process_namespaces(pid2)

    print(f"{'Namespace':<12} {'PID ' + str(pid1):<22} {'PID ' + str(pid2):<22} {'Shared?'}")
    print("-" * 70)
    for ns in sorted(set(ns1) | set(ns2)):
        inode1 = ns1.get(ns, "N/A")
        inode2 = ns2.get(ns, "N/A")
        shared = "YES (same namespace)" if inode1 == inode2 else "NO (isolated)"
        print(f"{ns:<12} {inode1:<22} {inode2:<22} {shared}")


def enter_namespace(pid: int, ns_type: str) -> None:
    """Enter the namespace of another process using setns().

    This is what 'docker exec' does internally - it enters the
    existing namespaces of the container's init process before
    execing the new command.
    Requires CAP_SYS_ADMIN or user namespace support.
    """
    LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
    ns_path = f"/proc/{pid}/ns/{ns_type}"
    fd = os.open(ns_path, os.O_RDONLY)
    try:
        ret = LIBC.setns(fd, 0)
        if ret != 0:
            errno = ctypes.get_errno()
            raise OSError(errno, os.strerror(errno))
        print(f"Entered {ns_type} namespace of PID {pid}")
    finally:
        os.close(fd)


def list_container_processes(container_id: str) -> list[dict]:
    """Use docker inspect to find container PID, then list all
    processes that share the same PID namespace.

    This shows the host-visible PIDs of all container processes,
    which is useful for debugging and tracing.
    """
    result = subprocess.run(
        ["docker", "inspect", "--format", "{{.State.Pid}}", container_id],
        capture_output=True, text=True
    )
    container_pid = int(result.stdout.strip())
    container_ns_inode = get_process_namespaces(container_pid).get("pid")

    container_procs = []
    for proc_dir in os.scandir("/proc"):
        if not proc_dir.name.isdigit():
            continue
        try:
            proc_pid = int(proc_dir.name)
            proc_ns_inode = get_process_namespaces(proc_pid).get("pid")
            if proc_ns_inode == container_ns_inode:
                cmdline_path = f"/proc/{proc_pid}/cmdline"
                with open(cmdline_path, "r") as f:
                    cmdline = f.read().replace("\x00", " ").strip()
                container_procs.append({"host_pid": proc_pid, "cmd": cmdline})
        except (PermissionError, FileNotFoundError, ValueError):
            continue
    return container_procs


# Example usage
print("=== Current process namespaces ===")
my_namespaces = get_process_namespaces(os.getpid())
for ns, inode in sorted(my_namespaces.items()):
    print(f"  {ns}: {inode}")

cgroups v1 vs v2 - The Resource Accounting Layer

Control groups impose resource limits and provide accounting for groups of processes. Every process in Linux belongs to exactly one cgroup in each hierarchy.

cgroups v1 problems: Each resource controller (memory, cpu, blkio, etc.) had its own independent hierarchy. A process could be in /sys/fs/cgroup/memory/team-a/job-1 for memory but in /sys/fs/cgroup/cpu/team-a/job-1 for CPU. These hierarchies were independent and inconsistent, making it hard to reason about the total resource allocation of a "container." Thread-level granularity was broken in subtle ways. The memory "soft limit" didn't work well in practice.

cgroups v2 uses a single unified hierarchy. All controllers attach to the same tree. This makes accounting consistent and enables the "no internal processes" rule - a cgroup that has child cgroups cannot itself have processes, preventing split-brain accounting.

The memory controller in v2 has evolved significantly:

memory.max - hard limit; OOM kill fires if exceeded
memory.high - soft limit; process is throttled (reclaim triggered) but not killed
memory.swap.max - swap usage limit; set to 0 to disable swap for latency-sensitive jobs
memory.current - current usage in bytes
memory.stat - detailed breakdown (anon, file, shmem, kernel, etc.)

For ML workloads, memory.high is a critical safety valve. Set it to 85-90% of memory.max. When a training job starts leaking memory it gets throttled before the OOM killer fires. This gives your monitoring system time to detect the issue and alert before data loss occurs.

import os
import pathlib

CGROUP_V2_ROOT = pathlib.Path("/sys/fs/cgroup")


def is_cgroup_v2() -> bool:
    """Check whether this system uses cgroups v2 (unified hierarchy)."""
    with open("/proc/filesystems") as f:
        return "cgroup2" in f.read()


def get_current_cgroup() -> str:
    """Read the cgroup path of the current process.

    cgroup v2 entry is prefixed with '0::' and contains the path
    relative to the cgroup v2 root at /sys/fs/cgroup.
    """
    with open("/proc/self/cgroup") as f:
        for line in f:
            if line.startswith("0::"):  # v2: single line "0::/<path>"
                return line.strip().split("::", 1)[1]
    return "unknown"


def read_memory_stats(cgroup_path: str) -> dict:
    """Read memory statistics from a cgroup v2 path.

    Useful for monitoring containers and training jobs from outside.
    """
    base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")
    stats = {}

    try:
        current = (base / "memory.current").read_text().strip()
        stats["current_bytes"] = int(current)
        stats["current_mb"] = round(int(current) / (1024 ** 2), 1)

        max_val = (base / "memory.max").read_text().strip()
        stats["max_bytes"] = int(max_val) if max_val != "max" else -1

        high_val = (base / "memory.high").read_text().strip()
        stats["high_bytes"] = int(high_val) if high_val != "max" else -1

        # memory.stat has detailed breakdown
        stat_text = (base / "memory.stat").read_text()
        key_stats = ["anon", "file", "shmem", "kernel", "pgfault", "pgmajfault"]
        for line in stat_text.splitlines():
            parts = line.split()
            if len(parts) == 2 and parts[0] in key_stats:
                stats[f"stat_{parts[0]}"] = int(parts[1])
    except FileNotFoundError as e:
        stats["error"] = str(e)

    return stats


def set_memory_limits(
    cgroup_path: str,
    max_mb: int,
    high_mb: int | None = None,
    swap_mb: int = 0,
) -> None:
    """Set memory limits on a cgroup. Requires write access (root or delegation).

    Best practice for ML training jobs:
    - Set high to 85-90% of max (throttle buffer before OOM kill)
    - Set swap to 0 (swap causes latency spikes that corrupt timing)
    """
    base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")

    max_bytes = max_mb * 1024 * 1024
    (base / "memory.max").write_text(str(max_bytes))

    if high_mb is None:
        high_mb = int(max_mb * 0.87)
    high_bytes = high_mb * 1024 * 1024
    (base / "memory.high").write_text(str(high_bytes))

    # Disable swap for ML workloads - swap causes stalls that look like hanging
    if swap_mb == 0:
        (base / "memory.swap.max").write_text("0")
    else:
        (base / "memory.swap.max").write_text(str(swap_mb * 1024 * 1024))

    print(f"Memory limits set on {cgroup_path}:")
    print(f"  memory.max  = {max_mb} MB")
    print(f"  memory.high = {high_mb} MB  (throttle threshold)")
    print(f"  memory.swap = {swap_mb} MB")


def set_cpu_quota(cgroup_path: str, cpu_cores: float) -> None:
    """Set CPU quota. cpu_cores=4.0 means 4 full CPU cores.

    cpu.max format: "<quota_us> <period_us>"
    Default period is 100ms (100000 us). Setting quota to N*period
    gives N virtual CPUs worth of time.
    """
    period_us = 100_000
    quota_us = int(period_us * cpu_cores)
    base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")
    (base / "cpu.max").write_text(f"{quota_us} {period_us}")
    print(f"CPU quota: {cpu_cores} cores ({quota_us}us / {period_us}us period)")


def set_io_weight(cgroup_path: str, weight: int = 100) -> None:
    """Set I/O weight for a cgroup (1-10000, default 100).

    Higher weight = more I/O bandwidth during contention.
    For training data loading, set to 200-500 relative to serving.
    """
    base = CGROUP_V2_ROOT / cgroup_path.lstrip("/")
    (base / "io.weight").write_text(f"default {weight}")
    print(f"I/O weight set to {weight} on {cgroup_path}")


# Example: configure a training job cgroup
# set_memory_limits("/ml-jobs/team-a/train-bert", max_mb=32768)
# set_cpu_quota("/ml-jobs/team-a/train-bert", cpu_cores=16.0)
# set_io_weight("/ml-jobs/team-a/train-bert", weight=300)

Overlay Filesystems - The Union Layer

An overlay filesystem (overlayfs) composes multiple directory trees into a single unified view. It has three logical components:

Lower layer (read-only): The container image layers. These are immutable and shared between all containers running the same image. A 10 GB PyTorch image shared by 50 training containers consumes 10 GB of storage, not 500 GB.
Upper layer (read-write): The container's writable layer. All writes go here. This is discarded when the container is removed unless committed to a new image.
Work directory: Required by the kernel's overlayfs implementation for atomic rename operations. Usually at the same level as the upper directory.

When a container reads a file: overlayfs checks the upper layer first. If not found, it checks lower layers top-to-bottom. The lower layers are the stacked Docker image layers (each RUN, COPY, ADD in a Dockerfile creates a layer).

When a container writes a file that only exists in a lower layer: overlayfs performs a "copy-up" operation. It copies the entire file from the lower layer to the upper layer, then modifies the copy. The lower layer is never modified. This is why modifying a large file inside a container's root filesystem is expensive: the entire original file must be copied up before the first byte can be changed.

Practical implications for ML:

Never write large checkpoints or model outputs to paths inside the container's root filesystem. Use mounted volumes. The upper layer is stored in /var/lib/docker/overlay2/ and has no special performance characteristics - it is just a directory on the host filesystem.
When building images, order Dockerfile layers from least-to-most-frequently-changed. Base OS and Python version change rarely; your training script changes daily. This maximizes Docker's layer cache reuse.
Overlayfs has a maximum nesting depth. Very deep image layer stacks (50+ layers) can hit this. Use docker squash or multi-stage builds to flatten layer count for production images.

Container Runtime - runc and containerd

The container runtime stack has two levels that solve different problems:

runc (low-level runtime): Implements the OCI runtime specification. Given an OCI bundle (a directory with a config.json describing the container configuration and a rootfs/ directory for the filesystem), runc calls the appropriate Linux syscalls: clone() with namespace flags, unshare() for additional namespaces, pivot_root() to make the overlayfs mount the container's root, and writes to cgroup files to set resource limits. runc is a one-shot process - it sets up the container, execs the container's init process, and exits. The container runs independently.

containerd (high-level runtime): Manages the full container lifecycle - image pulling and unpacking, container creation and deletion, execution, networking, and snapshotting. containerd calls runc via a shim process (containerd-shim). The shim stays alive for the container's entire lifetime: it holds the container's stdio, reports exit codes to containerd, and ensures the container can outlive a containerd restart. Docker and Kubernetes both use containerd; they are clients of its gRPC API.

GPU Containers - nvidia-container-toolkit

NVIDIA GPUs require kernel drivers and userspace libraries that must be accessible inside containers. The naive approach - install CUDA inside every container image - creates huge images and tightly couples the image to a specific driver version. The nvidia-container-toolkit solves this with a cleaner separation:

How it works: The toolkit installs a custom OCI hook that runs before runc execs the container's init process. This hook inspects the container's requested GPU devices, then injects:

Device bindings for /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm
Mount bindings for host driver libraries (libcuda.so, libcublas.so, etc.) into the container's filesystem at /usr/local/cuda/lib64 or equivalent
Environment variables describing available GPUs

The container image only needs CUDA headers and stub libraries. The actual NVIDIA driver (with its specific version) is mounted in at runtime from the host. Upgrading the host driver upgrades the driver seen by all containers without rebuilding a single image.

# Correct production approach: CUDA base image (headers + stubs, no driver)
FROM nvcr.io/nvidia/pytorch:24.01-py3

# The image provides CUDA headers and PyTorch with CUDA support.
# The actual libcuda.so driver is NOT in this image - it is mounted
# from the host by nvidia-container-toolkit at runtime.

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY train.py .

# Handle signals correctly as PID 1 in the container namespace
# Use exec form (JSON array) so the process is PID 1 directly,
# not wrapped in a shell that ignores signals.
ENTRYPOINT ["python", "train.py"]

import subprocess
import json


def run_gpu_container(
    image: str,
    script: str,
    gpu_indices: list[int],
    memory_gb: int = 16,
    cpu_cores: int = 8,
) -> subprocess.Popen:
    """Launch a GPU training container with proper resource limits.

    Key flags:
    --gpus: requests specific GPUs via nvidia-container-toolkit
    --memory: cgroup v2 memory.max (CPU RAM, not GPU VRAM)
    --cpus: cgroup v2 cpu.max
    --shm-size: /dev/shm size for DataLoader shared memory workers
    --ipc=host: share host IPC namespace (needed for some NCCL configs)
    """
    gpu_spec = ",".join(str(i) for i in gpu_indices)

    cmd = [
        "docker", "run", "--rm",
        f"--gpus=device={gpu_spec}",
        f"--memory={memory_gb}g",
        f"--memory-swap={memory_gb}g",  # disable swap
        f"--cpus={cpu_cores}",
        "--shm-size=16g",               # /dev/shm for DataLoader workers
        "--ulimit", "memlock=-1",       # required for GPU pinned memory
        "--ulimit", "stack=67108864",
        "-v", "/fast-storage/data:/data:ro",
        "-v", "/fast-storage/checkpoints:/checkpoints",
        image,
        "python", script,
    ]
    print("Launching container:")
    print("  " + " ".join(cmd))
    return subprocess.Popen(cmd)


def check_gpu_visibility_in_container(container_id: str) -> None:
    """Verify that a running container can see its assigned GPUs."""
    result = subprocess.run(
        ["docker", "exec", container_id, "nvidia-smi", "--query-gpu=index,name,memory.total",
         "--format=csv,noheader"],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        print("GPUs visible to container:")
        for line in result.stdout.strip().splitlines():
            print(f"  {line}")
    else:
        print(f"nvidia-smi failed: {result.stderr}")

seccomp and Linux Capabilities - The Security Layer

Linux capabilities break root's omnipotence into discrete privileges. Instead of a process being "root" (all-powerful) or "not root" (powerless), capabilities allow a process to hold only the specific privileges it needs. There are approximately 40 capabilities in modern Linux kernels.

For an ML inference container, the capabilities it needs are essentially none. It reads model files, runs matrix multiplications, writes results to a network socket. It does not need:

CAP_SYS_MODULE - load kernel modules
CAP_SYS_TIME - change system time
CAP_NET_RAW - raw network socket access
CAP_SYS_ADMIN - broad system administration operations

The principle of least privilege applied to containers: drop all capabilities by default, add back only what is specifically required.

seccomp (Secure Computing Mode) provides syscall filtering. A BPF program runs before every syscall and can allow, deny with ERRNO, or SIGKILL the process. Docker's default seccomp profile blocks approximately 44 syscalls that containers almost never need. For hardened ML inference containers, write a custom allowlist profile:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "comment": "Syscalls needed for Python ML inference workload",
      "names": [
        "read", "write", "open", "openat", "close", "stat", "fstat",
        "lstat", "poll", "lseek", "mmap", "mprotect", "munmap",
        "brk", "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
        "ioctl", "pread64", "pwrite64", "readv", "writev", "access",
        "pipe", "select", "sched_yield", "mremap", "msync", "madvise",
        "shmget", "shmat", "shmctl", "dup", "dup2", "nanosleep",
        "getpid", "socket", "connect", "accept", "sendto", "recvfrom",
        "sendmsg", "recvmsg", "shutdown", "bind", "listen",
        "getsockname", "getpeername", "socketpair",
        "setsockopt", "getsockopt", "clone", "fork", "vfork",
        "execve", "exit", "wait4", "kill", "uname", "fcntl",
        "flock", "fsync", "fdatasync", "truncate", "ftruncate",
        "getdents", "getdents64", "getcwd", "chdir", "rename",
        "mkdir", "rmdir", "unlink", "symlink", "readlink",
        "chmod", "fchmod", "gettimeofday", "getrlimit", "getrusage",
        "sysinfo", "getuid", "getgid", "getppid",
        "futex", "sched_getaffinity", "epoll_create", "epoll_create1",
        "epoll_ctl", "epoll_wait", "epoll_pwait",
        "set_tid_address", "clock_gettime", "clock_nanosleep",
        "exit_group", "tgkill", "openat2", "statx",
        "io_uring_setup", "io_uring_enter", "io_uring_register",
        "getrandom", "memfd_create", "copy_file_range",
        "prlimit64", "sendfile"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

def run_hardened_inference_container(
    image: str,
    model_path: str,
    seccomp_profile_path: str,
) -> subprocess.CompletedProcess:
    """Launch a hardened inference container with minimal permissions.

    Security posture:
    - All Linux capabilities dropped
    - Custom syscall allowlist via seccomp
    - Read-only root filesystem (no writes to image layers)
    - No privilege escalation via setuid binaries
    - Memory limit prevents OOM spreading to neighbors
    - PID limit prevents fork bombs
    """
    cmd = [
        "docker", "run", "--rm",
        "--cap-drop=ALL",
        f"--security-opt=seccomp={seccomp_profile_path}",
        "--read-only",
        "--security-opt=no-new-privileges",
        "--memory=8g",
        "--memory-swap=8g",
        "--cpus=4",
        "--pids-limit=100",
        # Writable tmpfs for temp files (since root fs is read-only)
        "--tmpfs=/tmp:size=512m,noexec",
        "-v", f"{model_path}:/model:ro",
        "-v", "/fast-storage/inference-output:/output",
        image,
        "python", "serve.py", "--model", "/model"
    ]
    return subprocess.run(cmd, capture_output=True, text=True)

Rootless Containers

Traditional Docker requires a root-privileged daemon (dockerd). Any container escape gives an attacker host root. Rootless containers (Podman, rootless Docker) use user namespaces to allow non-root users to create and run containers.

The mechanism: a user namespace maps the container's UID 0 (root) to the user's real UID (e.g., 1000) on the host. Inside the container, the process sees itself as root and can bind to port 80, install packages, and write to root-owned paths. From the host's perspective, it is running as UID 1000 with no elevated privileges. An escape from the container gives the attacker only UID 1000 access.

The UID mapping is stored in /proc/<pid>/uid_map and /proc/<pid>/gid_map and enforced by the kernel for every filesystem access check.

Limitation for ML GPU workloads: NVIDIA GPU device nodes (/dev/nvidia*) require elevated permissions to access. The nvidia-container-toolkit has improving but not fully production-ready rootless support depending on driver version. For GPU training and serving, most production deployments still use a privileged daemon (with carefully managed capabilities) rather than fully rootless containers.

BuildKit and Multi-Stage Builds for ML

ML images are large. A naive Dockerfile installs everything in one layer and produces a 15-20 GB image. Multi-stage builds separate the build environment (with compilers, headers, build tools) from the runtime image (which only needs what runs at inference time).

# syntax=docker/dockerfile:1.7

# Stage 1: Build stage - has compilers and build tools (large, discarded)
FROM python:3.11-slim AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ cmake libffi-dev libssl-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /build
COPY requirements.txt .

# Install to /install prefix so we can copy selectively
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Distroless runtime image (minimal attack surface, small size)
# gcr.io/distroless/python3 contains ONLY the Python interpreter.
# No bash. No shell. No package manager. No curl or wget.
# An attacker who exploits your inference service cannot pivot further.
FROM gcr.io/distroless/python3-debian12

# Copy only the installed packages from the builder stage
COPY --from=builder /install /usr/local

WORKDIR /app

# Model artifacts should NOT be copied here - mount them via volume.
# Only copy the inference server code.
COPY --chown=nonroot:nonroot model_server.py .
COPY --chown=nonroot:nonroot tokenizer_config.json .

# Run as non-root even within distroless
USER nonroot

# Must use exec form (JSON array) with distroless - no shell available
ENTRYPOINT ["python3", "model_server.py"]

Why distroless for ML inference specifically: Production inference containers never need bash, curl, pip, or any shell. A distroless image cuts attack surface drastically. If your XSS/SSRF-to-RCE exploit lands inside a distroless container, you cannot run wget to pull a second-stage payload. You cannot install tools. The lateral movement path is dead. The image is also 3-5x smaller than a full Python slim image, which means faster container startup and less storage pressure.

Kubernetes Pod Isolation Model

A Kubernetes pod is a group of containers that share a network namespace and optionally a PID namespace. They run on the same node and communicate via localhost. Each pod gets its own unique IP address from the pod CIDR. Containers within a pod share that IP.

cgroup hierarchy in Kubernetes: kubepods -> (Burstable|Guaranteed|BestEffort) -> pod-<uid> -> container-<name>. Resource QoS class determines placement in the hierarchy and scheduling behavior during pressure.

Three QoS classes:

Guaranteed: requests == limits for all resources. Highest protection from OOM kill. Use for latency-sensitive inference servers.
Burstable: requests < limits. Can use more resources if available. Use for training jobs that can tolerate some variability.
BestEffort: no requests or limits specified. Killed first during memory pressure. Never use for ML workloads in production.

# Kubernetes pod spec - properly configured for GPU training
apiVersion: v1
kind: Pod
metadata:
  name: bert-finetuning-job
  namespace: ml-team-a
  labels:
    job-type: training
    model: bert-large
spec:
  # Restart policy for training - only restart on failure, not after completion
  restartPolicy: OnFailure

  # Terminate gracefully - give the job time to save checkpoint
  terminationGracePeriodSeconds: 300

  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    command: ["python", "finetune_bert.py"]
    args: ["--checkpoint-dir", "/checkpoints", "--data-dir", "/data"]

    resources:
      requests:
        memory: "32Gi"
        cpu: "8"
        nvidia.com/gpu: "2"
      limits:
        # Set limits == requests for Guaranteed QoS class
        memory: "32Gi"
        cpu: "8"
        nvidia.com/gpu: "2"

    env:
    - name: NCCL_SOCKET_IFNAME
      value: "eth0"
    - name: NCCL_DEBUG
      value: "WARN"
    - name: OMP_NUM_THREADS
      value: "4"

    volumeMounts:
    - name: training-data
      mountPath: /data
      readOnly: true
    - name: checkpoints
      mountPath: /checkpoints
    - name: dshm
      mountPath: /dev/shm  # Override default 64MB shm for DataLoader workers

    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: false  # false because PyTorch writes temp files
      runAsNonRoot: true
      runAsUser: 1000
      capabilities:
        drop: ["ALL"]

  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: bert-training-data-pvc
  - name: checkpoints
    persistentVolumeClaim:
      claimName: bert-checkpoints-pvc
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: "16Gi"  # Override default 64MB /dev/shm

  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

  nodeSelector:
    accelerator: nvidia-a100-80gb

import pathlib
import re


def get_container_resource_limits() -> dict:
    """Read this container's own resource limits from its cgroup.

    ML workloads should call this at startup to configure themselves
    appropriately. PyTorch's memory allocator can be tuned based on
    the known container memory limit.
    """
    cgroup_path = ""
    with open("/proc/self/cgroup") as f:
        for line in f:
            if line.startswith("0::"):  # cgroup v2
                cgroup_path = line.strip().split("::", 1)[1]
                break

    base = pathlib.Path("/sys/fs/cgroup") / cgroup_path.lstrip("/")
    limits = {"cgroup_path": cgroup_path}

    try:
        mem_max = (base / "memory.max").read_text().strip()
        if mem_max == "max":
            limits["memory_max_bytes"] = None
            limits["memory_max_gb"] = None
        else:
            limits["memory_max_bytes"] = int(mem_max)
            limits["memory_max_gb"] = round(int(mem_max) / (1024 ** 3), 1)

        cpu_max = (base / "cpu.max").read_text().strip()
        if cpu_max == "max":
            limits["cpu_cores"] = None
        else:
            quota, period = cpu_max.split()
            limits["cpu_cores"] = round(int(quota) / int(period), 2)

        # Read memory.stat for current usage breakdown
        stat_text = (base / "memory.stat").read_text()
        for line in stat_text.splitlines():
            parts = line.split()
            if len(parts) == 2 and parts[0] in ("anon", "file", "shmem"):
                limits[f"current_{parts[0]}_bytes"] = int(parts[1])

    except FileNotFoundError:
        limits["error"] = "cgroup files not found (may be cgroup v1 or unconfined)"

    return limits


def configure_pytorch_from_cgroup_limits() -> None:
    """Configure PyTorch memory allocator based on container cgroup limits.

    If we know the container has 32 GB, we can tell PyTorch to not
    fragment memory above 80% of that, preventing OOM kills near the limit.
    """
    import os
    limits = get_container_resource_limits()

    if limits.get("memory_max_bytes"):
        max_gb = limits["memory_max_gb"]
        print(f"Container memory limit: {max_gb} GB")

        # Reserve 20% headroom above PyTorch allocator max
        allocator_max_mb = int(limits["memory_max_bytes"] * 0.75 / (1024 * 1024))
        os.environ["PYTORCH_CUDA_ALLOC_CONF"] = (
            f"max_split_size_mb:{allocator_max_mb},"
            "garbage_collection_threshold:0.8,"
            "expandable_segments:True"
        )
        print(f"Set PYTORCH_CUDA_ALLOC_CONF max_split_size_mb={allocator_max_mb}")

    if limits.get("cpu_cores"):
        cpu_cores = int(limits["cpu_cores"])
        # DataLoader workers should not exceed available CPU cores
        os.environ.setdefault("NUM_WORKERS", str(max(1, cpu_cores - 2)))
        print(f"Container CPU limit: {limits['cpu_cores']} cores")
        print(f"Recommended DataLoader workers: {os.environ['NUM_WORKERS']}")

Architecture Overview

Production Engineering Notes

Image layer optimization for ML: Keep your base image (CUDA + Python + system libs) as a separately-tagged image that rarely changes. Your model code and requirements.txt should be in the top layers. A build that only changes train.py should take seconds, not minutes. The Docker layer cache is your build speed multiplier.

Memory limit math for GPU workloads: GPU memory is managed by the NVIDIA driver, not by Linux cgroups. memory.max=32G limits CPU RAM only. PyTorch uses CPU RAM for gradient buffers, optimizer states (Adam's moment estimates are 2x the model size in float32), DataLoader worker shared memory (/dev/shm), and CPU-pinned memory for DMA transfers. If the cgroup limit is too low, pin_memory() calls fail, DataLoader workers get OOM-killed, and optimizer state offloading fails. A rough formula: memory.max >= (model_params * bytes_per_param * 6) + shm_size + 4GB_OS_overhead.

Container startup latency for ML inference autoscaling: Cold start latency is your autoscaling Achilles heel. A 7B model that takes 60 seconds to load from a network-mounted PVC adds 60 seconds to every scale-out event. Mitigations: (1) use local NVMe storage (PVC with storageClassName: local-nvme) so the model is already on the node, (2) use model server frameworks like Triton or TorchServe that keep models warm in a ready pool, (3) pre-pull container images on all GPU nodes using a DaemonSet.

The copy-up trap: If a training job writes checkpoints to any path inside the container's root filesystem, overlayfs performs copy-up operations for every new file created. For a 10 GB checkpoint, this means 10 GB of I/O through the overlay machinery before the first byte is written. Always bind-mount a dedicated volume for checkpoint output.

Namespace leak detection: If containers crash without proper cleanup, network namespace files can be held open by orphaned processes, preventing kernel garbage collection. The symptom is accumulating veth devices on the host. Monitor with ip link | grep veth | wc -l and alert if this count grows unbounded between container lifecycles.

Common Mistakes

:::danger Setting memory.max Without memory.high Setting only memory.max means the OOM killer fires with no warning the instant the process exceeds the limit. For ML workloads that have variable memory usage, set memory.high to 85-90% of memory.max. When the workload approaches the limit, Linux starts reclaiming memory and throttling allocations. Your monitoring system gets time to detect and alert. The training job slows down instead of dying. The OOM kill becomes a last resort, not the first response. :::

:::danger Running ML Containers as Root with --privileged docker run --privileged disables all namespace and cgroup isolation. The container has full access to the host's devices, network, and filesystem. It can mount host paths, load kernel modules, and modify iptables. This is appropriate only for container runtime development and testing, never for production ML workloads. If you think you need --privileged for GPU access, you are wrong - the nvidia-container-toolkit handles this correctly without it. Audit all containers in your cluster for the Privileged: true flag. :::

:::warning Storing Model Weights in Container Image Layers Never COPY large model weight files (anything over 500 MB) into a Docker image. A 7B parameter model at bfloat16 is ~14 GB. Every docker pull downloads 14 GB. Every pushed version stores a new 14 GB layer. Layer storage in your registry balloons. Instead: store model weights on object storage (S3, GCS, Azure Blob) or a shared PVC, download at container startup, or mount via a volume. The container image should be small (1-4 GB) and fast to pull. :::

:::warning Ignoring cgroup v1 vs v2 Differences If your infrastructure mixes kernel versions (some nodes 4.x, others 5.x+), some hosts use cgroup v1, others v2. The paths and file names differ: v1 is /sys/fs/cgroup/memory/<path>/memory.limit_in_bytes, v2 is /sys/fs/cgroup/<path>/memory.max. Python code that reads cgroup stats must detect which version is active. Check /proc/filesystems for cgroup2 or examine whether /sys/fs/cgroup/cgroup.controllers exists. Failing to detect this leads to incorrect memory monitoring that shows "no limit" even when limits are set. :::

Interview Questions

Q1: What is the difference between a container and a virtual machine at the kernel level?

A VM runs a complete guest kernel on top of a hypervisor. The hypervisor virtualizes the hardware (CPU rings, memory management, device I/O). The guest kernel manages its own memory, scheduler, and device drivers in isolation. An exploit in the guest kernel does not directly reach the host kernel. A container shares the host kernel. There is no guest kernel to boot. Isolation is achieved through Linux namespaces (separate views of kernel resources) and cgroups (resource limits). Containers start in milliseconds because there is no kernel to boot, no firmware to initialize, no hardware enumeration. The density is 10-100x higher than VMs. The isolation is weaker: a kernel exploit (e.g., a namespace escape CVE) affects all containers simultaneously. For ML workloads, containers are the right choice for performance and density; VMs are sometimes layered underneath for stronger multi-tenant boundaries (VM per team, containers within each VM).

Q2: Walk me through what happens when you run docker run --gpus 1 pytorch/pytorch train.py.

Docker CLI parses the command and sends a gRPC request to dockerd.
dockerd checks if pytorch/pytorch is in the local image store; if not, pulls it layer by layer from Docker Hub, extracting each layer to /var/lib/docker/overlay2/.
dockerd calls containerd to create the container, providing the image manifest and runtime config (including --gpus 1).
containerd creates an overlayfs mount: lower layers from the image, upper layer is a new empty writable directory.
containerd spawns a containerd-shim process that will persist for the container's lifetime.
The shim calls runc with the OCI bundle (config.json + rootfs path).
Because --gpus 1 was specified, the nvidia-container-runtime OCI hook fires before runc execs the init process. The hook discovers available GPUs, selects one, and injects device bindings (/dev/nvidia0, etc.) and driver library mounts into the container config.
runc calls clone() with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC to create the container namespaces.
runc writes cgroup resource limits to /sys/fs/cgroup/ files.
runc calls pivot_root() to make the overlayfs mount the container's /.
runc execs python train.py as PID 1 of the new PID namespace.

Q3: How do cgroups v2 memory limits interact with PyTorch's memory management?

cgroups v2 memory.max limits CPU RAM (host memory managed by the Linux kernel). PyTorch's CUDA allocator manages GPU VRAM independently via the CUDA runtime - this is completely invisible to cgroups. However, PyTorch uses significant CPU RAM for: Adam optimizer states (2 float32 tensors per parameter = 8 bytes per parameter), DataLoader worker processes (each worker forks and copies the dataset object into its address space), CPU-pinned memory buffers for async DMA transfers (pin_memory=True), and gradient accumulation buffers. For a 1B parameter model with Adam in float32, optimizer states alone consume 8 GB of CPU RAM. Set the cgroup limit to at least: (model_params * 8 bytes) + (num_workers * dataset_size_in_ram) + 4 GB OS overhead. Set memory.high to 85% of that value to get throttling before OOM.

Q4: What is a rootless container, how does it work, and when can you not use it for ML?

Rootless containers use user namespaces to map container UID 0 to the user's real UID on the host. Inside the container the process sees itself as root and can perform root-privileged operations within the container's namespace. Outside, the process has only the user's privileges. An attacker who escapes the container gets the user's access, not root access. The mapping is stored in /proc/<pid>/uid_map. Limitations for ML: (1) NVIDIA GPU device nodes require specific permissions that rootless containers cannot obtain without additional host-level configuration. The toolkit has improving rootless support but it varies by driver version and distribution. (2) Network plugins may require root for veth pair creation. (3) cgroup v1 delegation requires root. For CPU-only ML workloads (preprocessing, feature engineering, serving small models on CPU), rootless containers are fully suitable. For GPU training and serving at scale, most production deployments still use privileged daemons with careful capability management rather than fully rootless containers.

Q5: Your cluster has 40 A100 GPUs across 10 nodes. How do you prevent one team from monopolizing GPU resources?

Three-layer enforcement: (1) Kubernetes namespace ResourceQuota: each team gets a dedicated namespace with ResourceQuota capping nvidia.com/gpu to their allocation. Kubernetes enforces this at scheduling time - the pod is rejected if the quota would be exceeded. (2) LimitRange: define default GPU requests and limits so pods without explicit specs get a reasonable default allocation rather than 0 (which would allow the pod to land on a GPU node without actually claiming any GPUs). (3) Priority classes: create PriorityClass objects with different preemption policies. Interactive inference services (high priority, no preemption) evict training jobs (low priority, preemptible) when GPU nodes are needed for serving. Additionally, NVIDIA MIG (Multi-Instance GPU) on A100s allows partitioning a single physical GPU into up to 7 isolated GPU instances, each with dedicated memory bandwidth and SM counts. This allows multi-tenant sharing of individual GPUs for smaller inference workloads without interference.

Q6: Why is the overlayfs copy-up mechanism a performance trap for ML checkpoint writes?

Overlayfs copy-up triggers when a container modifies a file that exists only in the lower (read-only image) layers. For a new file written to the container's root filesystem, there is no existing lower-layer version, so no copy-up occurs on first write. However, if a training script overwrites a file that came from the image (e.g., a config file), overlayfs must copy the entire original file from the lower layer to the upper layer before applying the modification. For checkpoint writes specifically, the issue is not copy-up but upper-layer I/O performance. All writes to the container's root filesystem go through the overlayfs upper layer, which is stored in /var/lib/docker/overlay2/. This adds filesystem metadata overhead, is subject to the host's filesystem performance, and may compete with other containers sharing the same host disk. In contrast, a mounted volume (-v /fast-nvme/checkpoints:/checkpoints) writes directly to the host filesystem at the device level, bypassing overlayfs entirely. The measured throughput difference is typically 2-5x for sequential checkpoint writes on NVMe storage.

Why This Exists​

Historical Context​

Core Concepts​

Linux Namespaces - The Isolation Layer​

Inspecting Namespaces in Python​

cgroups v1 vs v2 - The Resource Accounting Layer​

Overlay Filesystems - The Union Layer​

Container Runtime - runc and containerd​

GPU Containers - nvidia-container-toolkit​

seccomp and Linux Capabilities - The Security Layer​

Rootless Containers​

BuildKit and Multi-Stage Builds for ML​

Kubernetes Pod Isolation Model​

Architecture Overview​

Production Engineering Notes​

Common Mistakes​

Interview Questions​