What is multi-tenant ml platform?

Learn how to design ML platforms that safely serve multiple teams from shared GPU infrastructure - covering Kubernetes isolation, fair scheduling, data isolation, cost attribution, and quota management.

How does kubernetes ml isolation work in practice?

Multi-Tenant ML Platforms covers multi-tenant ml platform, kubernetes ml isolation, gpu scheduling fair from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/ml-architecture-patterns/multi-tenant-ml-platforms

What is the difference between multi-tenant ml platform and gpu scheduling fair?

See the full breakdown at https://engineersofai.com/docs/ai-systems/ml-architecture-patterns/multi-tenant-ml-platforms

:::tip 🎮 Interactive Playground Visualize this concept: Try the Multi-Tenant ML Platform demo on the EngineersOfAI Playground - no code required. :::

Multi-Tenant ML Platforms

Your company has 15 ML teams. Collectively they have requisitioned and received a GPU cluster: 80 A100 nodes, 8 GPUs each, 640 GPUs total. A reasonable platform decision - shared infrastructure is cheaper than 15 separate clusters, and most teams do not need all their GPUs at peak utilization simultaneously.

Three weeks after the cluster goes live, the performance team's fine-tuning job occupies 200 GPUs and is scheduled to run for 36 hours. The fraud team has a production retraining job due to start in 6 hours - it needs 32 GPUs. The performance team's job is holding all available GPUs. The fraud team cannot run their job. The fraud model goes stale during a period of high transaction volume. The fraud team is furious. The platform team gets paged.

This is the multi-tenancy problem: how do you share expensive GPU infrastructure fairly across teams that have conflicting needs, without letting any team's workload starve others, while maintaining strong enough isolation that one team's bugs cannot affect another team's data or results?

The answer involves Kubernetes resource management primitives - namespaces, resource quotas, priority classes, fair scheduling - combined with network policies for data isolation and a cost attribution system that gives each team visibility into their consumption. Databricks built an architecture that serves thousands of tenants on shared infrastructure. The principles they discovered are applicable at any scale.

Why Multi-Tenancy: The Economics

Running dedicated GPU infrastructure for each team sounds simple: no sharing, no conflicts. But it is economically wasteful in ways that compound at scale.

Utilization problem: GPU utilization at most ML teams averages 30-40% across a day. Training jobs run intensively for hours, then stop while engineers analyze results. A dedicated cluster for a team of 5 sits idle 60-70% of the time. At $30,000/month for a 10-node A100 cluster, that is$ 18,000-21,000/month in idle capacity.

Cold-start problem: a team that needs to run a one-off large training job (new architecture exploration, full dataset retraining) would need to wait for procurement of additional dedicated infrastructure - weeks. On a shared cluster, they can borrow unused capacity immediately.

Amortized management cost: operating a GPU cluster requires monitoring, patching, CUDA driver management, network configuration, and on-call rotation. With dedicated clusters per team, each team bears the full operational cost. With a shared platform, a dedicated platform team operates one cluster for everyone.

The trade-off: multi-tenancy introduces complexity - scheduling fairness, isolation, cost attribution - that dedicated infrastructure avoids. The economics favor multi-tenancy for any organization with more than 3-4 ML teams.

The Multi-Tenancy Architecture

Resource Isolation: Kubernetes Namespaces

The foundational isolation primitive in Kubernetes is the namespace. Each tenant (team, project, or customer) gets their own namespace. Namespaces provide:

Name scoping: resources (Pods, Services, ConfigMaps) in one namespace do not conflict with same-named resources in another
RBAC scope: permissions defined at namespace level - team members get admin rights to their namespace, read-only to others
ResourceQuota scope: limits applied per namespace, not per cluster
NetworkPolicy scope: network isolation rules applied per namespace

# namespace-setup.yaml - template applied for each new tenant
apiVersion: v1
kind: Namespace
metadata:
  name: fraud-team
  labels:
    team: fraud
    cost-center: "cc-1042"
    tier: "production"
---
# ResourceQuota: hard limits per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: fraud-team-quota
  namespace: fraud-team
spec:
  hard:
    # Compute limits
    requests.cpu: "200"         # total CPU requests across all pods
    limits.cpu: "400"
    requests.memory: 800Gi
    limits.memory: 1600Gi
    requests.nvidia.com/gpu: "32"  # max 32 GPUs at any time
    limits.nvidia.com/gpu: "32"

    # Object limits (prevent runaway resource creation)
    count/pods: "200"
    count/services: "50"
    count/persistentvolumeclaims: "100"
    count/jobs.batch: "500"

    # Storage limits
    requests.storage: "100Ti"
---
# LimitRange: default and max per-pod limits
apiVersion: v1
kind: LimitRange
metadata:
  name: fraud-team-limits
  namespace: fraud-team
spec:
  limits:
  - type: Container
    default:
      cpu: "2"
      memory: "8Gi"
    defaultRequest:
      cpu: "500m"
      memory: "2Gi"
    max:
      cpu: "96"       # max 96 vCPUs per container
      memory: "768Gi" # max 768 GB per container
      nvidia.com/gpu: "8"  # max 8 GPUs per container (one node)

Fair Scheduling: No Team Monopolizes the Cluster

A quota of 32 GPUs means a team cannot exceed 32 GPUs simultaneously. But what about the sequence: if one team submits 200 jobs simultaneously, they could hold the job queue and starve other teams even within their quota? Fair scheduling addresses this.

Priority Classes

# Priority classes for different workload types
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-production-critical
value: 1000
globalDefault: false
description: "Production model retraining - preempts experimental jobs"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-normal
value: 500
globalDefault: true
description: "Regular experimentation"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-best-effort
value: 100
description: "Long-running jobs that can be preempted"
preemptionPolicy: Never  # this class cannot preempt others

Kueue: Kubernetes-Native Fair Scheduling for ML

Kueue (Kubernetes Queue) is a Kubernetes-native job queuing system that implements fair scheduling across teams. Each team gets a ClusterQueue with a guaranteed share and a borrowing limit from unused capacity.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cluster-total
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "gpu-a100"
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 640   # total GPUs in cluster
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: fraud-team-queue
  namespace: fraud-team
spec:
  clusterQueue: cluster-total
---
# Cohort for borrowing: teams can borrow from the cohort's unused quota
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: fraud-team-cq
spec:
  cohort: ml-platform
  namespaceSelector:
    matchLabels:
      team: fraud
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu"]
    flavors:
    - name: "gpu-a100"
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 64    # fraud team's guaranteed share: 64 GPUs
        borrowingLimit: 128 # can borrow up to 128 more if cluster has capacity
        lendingLimit: 32    # will lend back up to 32 GPUs when others need them

Custom Fair Scheduler Logic

For platforms with more complex fairness requirements (accounting for historical usage, priority weights per team):

from dataclasses import dataclass
from typing import Optional
import heapq
import time


@dataclass
class TrainingJob:
    job_id: str
    team_id: str
    gpu_count: int
    priority: int  # higher = more urgent
    submitted_at: float

    def __lt__(self, other):
        # Higher priority first; tie-break by submission time (FIFO)
        if self.priority != other.priority:
            return self.priority > other.priority
        return self.submitted_at < other.submitted_at


class FairShareScheduler:
    """
    Fair share scheduler for GPU training jobs.
    Each team gets a guaranteed share; excess is shared fairly.
    Uses Dominant Resource Fairness (DRF) for allocation.
    """

    def __init__(self, total_gpus: int, team_shares: dict):
        """
        total_gpus: total GPUs in cluster
        team_shares: {"fraud": 0.1, "recommendation": 0.3, ...}
                     shares should sum to <= 1.0
        """
        self.total_gpus = total_gpus
        self.team_shares = team_shares
        self.guaranteed_gpus = {
            team: int(share * total_gpus)
            for team, share in team_shares.items()
        }
        self.current_usage: dict = {team: 0 for team in team_shares}
        self.job_queue: list = []  # min-heap

    def submit_job(self, job: TrainingJob) -> None:
        """Add a job to the scheduling queue."""
        heapq.heappush(self.job_queue, job)
        print(f"[Scheduler] Queued job {job.job_id} for {job.team_id}")

    def get_available_gpus(self) -> int:
        """Total currently available GPUs."""
        used = sum(self.current_usage.values())
        return self.total_gpus - used

    def get_team_fair_share(self, team_id: str) -> float:
        """
        A team's current fair share of idle capacity.
        Teams with less usage relative to their entitlement get priority.
        """
        guaranteed = self.guaranteed_gpus.get(team_id, 0)
        current = self.current_usage.get(team_id, 0)
        # Surplus ratio: how much below their guaranteed quota are they?
        return max(0, guaranteed - current)

    def schedule_next(self) -> Optional[TrainingJob]:
        """
        Select the next job to run using fair-share logic.
        Prefers teams that are below their guaranteed allocation.
        """
        if not self.job_queue or self.get_available_gpus() == 0:
            return None

        # Find the job from the team most below its fair share
        best_job = None
        best_deficit = -1

        temp_queue = []
        while self.job_queue:
            job = heapq.heappop(self.job_queue)
            deficit = self.get_team_fair_share(job.team_id)
            available = self.get_available_gpus()

            if job.gpu_count <= available and deficit >= best_deficit:
                if best_job:
                    heapq.heappush(temp_queue, best_job)
                best_job = job
                best_deficit = deficit
            else:
                heapq.heappush(temp_queue, job)

        for job in temp_queue:
            heapq.heappush(self.job_queue, job)

        if best_job:
            self.current_usage[best_job.team_id] = (
                self.current_usage.get(best_job.team_id, 0) + best_job.gpu_count
            )
            print(
                f"[Scheduler] Scheduling {best_job.job_id} ({best_job.gpu_count} GPUs) "
                f"for {best_job.team_id}. Usage: {self.current_usage}"
            )

        return best_job

    def complete_job(self, job: TrainingJob) -> None:
        """Release GPUs when a job completes."""
        self.current_usage[job.team_id] = max(
            0, self.current_usage.get(job.team_id, 0) - job.gpu_count
        )
        print(
            f"[Scheduler] Completed {job.job_id}. "
            f"Released {job.gpu_count} GPUs for {job.team_id}."
        )

Data Isolation: Preventing Cross-Tenant Leakage

In an ML platform, data isolation is not just a Kubernetes concern - it spans storage, secrets, and network access.

Kubernetes Network Policies

# Block all cross-namespace pod-to-pod communication by default
# Each namespace gets this policy applied at creation time

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-cross-namespace
  namespace: fraud-team
spec:
  podSelector: {}  # applies to all pods in namespace
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: fraud-team  # only allow ingress from same namespace
    - namespaceSelector:
        matchLabels:
          name: platform    # allow ingress from platform namespace (monitoring, etc.)
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: fraud-team
    - namespaceSelector:
        matchLabels:
          name: platform
  # Allow DNS resolution
  - ports:
    - protocol: UDP
      port: 53
  # Allow egress to S3 (via VPC endpoint)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/8  # VPC CIDR - S3 VPC endpoint

S3 Bucket Isolation with IRSA

Each team gets an IAM Role bound to their Kubernetes Service Account (IRSA - IAM Roles for Service Accounts). The role grants access only to their S3 prefix.

import boto3
import os


class TenantAwareS3Client:
    """
    S3 client that enforces tenant-scoped access.
    Uses IRSA for credential injection - no hardcoded secrets.
    """

    def __init__(self, team_id: str, bucket: str = "ml-platform-data"):
        self.team_id = team_id
        self.bucket = bucket
        self.prefix = f"teams/{team_id}/"

        # boto3 automatically uses IRSA credentials when running in Kubernetes
        # with the correct service account annotation
        self.client = boto3.client("s3")

    def _scoped_key(self, key: str) -> str:
        """Ensure all S3 keys are within the team's prefix."""
        if not key.startswith(self.prefix):
            return f"{self.prefix}{key}"
        return key

    def upload_dataset(self, local_path: str, dataset_name: str) -> str:
        """Upload a dataset to the team's S3 prefix."""
        s3_key = self._scoped_key(f"datasets/{dataset_name}")
        self.client.upload_file(local_path, self.bucket, s3_key)
        return f"s3://{self.bucket}/{s3_key}"

    def list_datasets(self) -> list:
        """List all datasets for this team."""
        response = self.client.list_objects_v2(
            Bucket=self.bucket,
            Prefix=f"{self.prefix}datasets/",
        )
        return [
            obj["Key"].replace(self.prefix, "")
            for obj in response.get("Contents", [])
        ]

    def upload_model(self, local_path: str, model_name: str, version: str) -> str:
        """Upload a model artifact to the team's S3 prefix."""
        s3_key = self._scoped_key(f"models/{model_name}/{version}/model.pkl")
        self.client.upload_file(local_path, self.bucket, s3_key)
        return f"s3://{self.bucket}/{s3_key}"

Cost Attribution: Who Consumed What

Shared infrastructure without cost attribution leads to the tragedy of the commons - teams have no incentive to optimize their GPU usage because they do not pay for it. Cost attribution gives each team visibility into their consumption and creates the right incentives.

from prometheus_client import Counter, Gauge
from datetime import datetime, timezone
import json


# Prometheus metrics for cost attribution
GPU_SECONDS_USED = Counter(
    "ml_platform_gpu_seconds_total",
    "Total GPU-seconds consumed per team",
    labelnames=["team_id", "job_type", "gpu_type"],
)

CURRENT_GPU_USAGE = Gauge(
    "ml_platform_current_gpus",
    "Currently allocated GPUs per team",
    labelnames=["team_id"],
)


class CostAttributionTracker:
    """
    Track and report GPU consumption per team for chargeback.
    Integrates with Kubernetes pod metrics and Prometheus.
    """

    # A100 GPU pricing (approximate AWS p4d.24xlarge spot equivalent)
    GPU_COST_PER_HOUR: dict = {
        "a100": 2.50,   # $/GPU/hour
        "v100": 1.80,
        "t4": 0.60,
    }

    def __init__(self, db_client):
        self.db = db_client

    def record_job_start(
        self,
        job_id: str,
        team_id: str,
        gpu_count: int,
        gpu_type: str = "a100",
        job_type: str = "training",
    ) -> None:
        """Record when a job starts consuming GPUs."""
        record = {
            "job_id": job_id,
            "team_id": team_id,
            "gpu_count": gpu_count,
            "gpu_type": gpu_type,
            "job_type": job_type,
            "started_at": datetime.now(timezone.utc).isoformat(),
            "estimated_cost_per_hour": (
                gpu_count * self.GPU_COST_PER_HOUR.get(gpu_type, 1.0)
            ),
        }
        self.db.insert("job_cost_records", record)
        CURRENT_GPU_USAGE.labels(team_id=team_id).inc(gpu_count)

    def record_job_end(
        self,
        job_id: str,
        team_id: str,
        gpu_count: int,
        gpu_type: str = "a100",
    ) -> dict:
        """Record when a job completes and compute final cost."""
        record = self.db.get("job_cost_records", job_id)
        if not record:
            return {}

        started = datetime.fromisoformat(record["started_at"])
        ended = datetime.now(timezone.utc)
        duration_hours = (ended - started).total_seconds() / 3600

        cost_usd = (
            gpu_count
            * duration_hours
            * self.GPU_COST_PER_HOUR.get(gpu_type, 1.0)
        )

        # Update Prometheus counters
        GPU_SECONDS_USED.labels(
            team_id=team_id,
            job_type=record["job_type"],
            gpu_type=gpu_type,
        ).inc(gpu_count * duration_hours * 3600)

        CURRENT_GPU_USAGE.labels(team_id=team_id).dec(gpu_count)

        result = {
            "job_id": job_id,
            "team_id": team_id,
            "duration_hours": round(duration_hours, 3),
            "gpu_count": gpu_count,
            "cost_usd": round(cost_usd, 4),
        }
        self.db.update("job_cost_records", job_id, result)
        return result

    def get_team_monthly_spend(self, team_id: str, month: str) -> dict:
        """
        Compute total spend for a team in a given month.
        month format: "2024-03"
        """
        records = self.db.query(
            "SELECT SUM(cost_usd), SUM(gpu_count * duration_hours) "
            "FROM job_cost_records "
            "WHERE team_id = ? AND started_at LIKE ?",
            [team_id, f"{month}%"],
        )
        return {
            "team_id": team_id,
            "month": month,
            "total_cost_usd": records[0][0] or 0.0,
            "total_gpu_hours": records[0][1] or 0.0,
        }

    def generate_chargeback_report(self, month: str) -> list:
        """Generate per-team chargeback report for finance."""
        teams = self.db.query(
            "SELECT DISTINCT team_id FROM job_cost_records"
        )
        return [
            self.get_team_monthly_spend(row[0], month)
            for row in teams
        ]

Tenant Onboarding Automation

Manual tenant onboarding is error-prone and slow. Automate it with a Terraform module or Kubernetes operator.

# tenant_onboarding.py
from kubernetes import client, config
import subprocess
import json


class TenantOnboarder:
    """
    Automates new team onboarding to the ML platform.
    Creates namespace, RBAC, resource quota, network policies,
    and S3 bucket prefix.
    """

    def __init__(self, namespace_prefix: str = ""):
        config.load_kube_config()
        self.core_v1 = client.CoreV1Api()
        self.rbac_v1 = client.RbacAuthorizationV1Api()
        self.networking_v1 = client.NetworkingV1Api()

    def onboard_team(
        self,
        team_id: str,
        cost_center: str,
        gpu_quota: int,
        cpu_quota: int,
        memory_gb_quota: int,
        admin_users: list,
    ) -> dict:
        """
        Create all Kubernetes resources for a new team.
        Returns a summary of created resources.
        """
        namespace = f"ml-{team_id}"
        created = []

        # 1. Create namespace
        self._create_namespace(namespace, team_id, cost_center)
        created.append(f"Namespace/{namespace}")

        # 2. Apply resource quota
        self._apply_resource_quota(
            namespace, gpu_quota, cpu_quota, memory_gb_quota
        )
        created.append(f"ResourceQuota/{namespace}-quota")

        # 3. Apply network policy (default-deny cross-namespace)
        self._apply_network_policy(namespace)
        created.append(f"NetworkPolicy/default-deny-cross-namespace")

        # 4. Create RBAC - admin role for team members
        self._create_team_rbac(namespace, team_id, admin_users)
        created.append(f"RoleBinding/{namespace}-admins")

        # 5. Create service account with IRSA annotation
        self._create_service_account(namespace, team_id)
        created.append(f"ServiceAccount/{namespace}-sa")

        print(f"[Onboarding] Team {team_id} onboarded. Created: {created}")
        return {
            "team_id": team_id,
            "namespace": namespace,
            "gpu_quota": gpu_quota,
            "created_resources": created,
        }

    def _create_namespace(
        self, namespace: str, team_id: str, cost_center: str
    ) -> None:
        ns = client.V1Namespace(
            metadata=client.V1ObjectMeta(
                name=namespace,
                labels={
                    "team": team_id,
                    "cost-center": cost_center,
                    "managed-by": "ml-platform",
                },
            )
        )
        self.core_v1.create_namespace(body=ns)

    def _apply_resource_quota(
        self,
        namespace: str,
        gpu_quota: int,
        cpu_quota: int,
        memory_gb_quota: int,
    ) -> None:
        quota = client.V1ResourceQuota(
            metadata=client.V1ObjectMeta(
                name=f"{namespace}-quota",
                namespace=namespace,
            ),
            spec=client.V1ResourceQuotaSpec(
                hard={
                    "requests.nvidia.com/gpu": str(gpu_quota),
                    "limits.nvidia.com/gpu": str(gpu_quota),
                    "requests.cpu": str(cpu_quota),
                    "limits.cpu": str(cpu_quota * 2),
                    "requests.memory": f"{memory_gb_quota}Gi",
                    "limits.memory": f"{memory_gb_quota * 2}Gi",
                    "count/pods": "500",
                    "count/jobs.batch": "2000",
                }
            ),
        )
        self.core_v1.create_namespaced_resource_quota(
            namespace=namespace, body=quota
        )

    def _create_team_rbac(
        self, namespace: str, team_id: str, admin_users: list
    ) -> None:
        # Bind the edit ClusterRole to team admins within their namespace
        binding = client.V1RoleBinding(
            metadata=client.V1ObjectMeta(
                name=f"{team_id}-admins",
                namespace=namespace,
            ),
            role_ref=client.V1RoleRef(
                api_group="rbac.authorization.k8s.io",
                kind="ClusterRole",
                name="edit",  # built-in: can create/update/delete most resources
            ),
            subjects=[
                client.V1Subject(
                    kind="User",
                    name=user,
                    api_group="rbac.authorization.k8s.io",
                )
                for user in admin_users
            ],
        )
        self.rbac_v1.create_namespaced_role_binding(
            namespace=namespace, body=binding
        )

    def _create_service_account(self, namespace: str, team_id: str) -> None:
        sa = client.V1ServiceAccount(
            metadata=client.V1ObjectMeta(
                name=f"{team_id}-sa",
                namespace=namespace,
                annotations={
                    # IRSA: bind to IAM role that scopes S3 access to team prefix
                    "eks.amazonaws.com/role-arn": (
                        f"arn:aws:iam::123456789:role/ml-platform-{team_id}"
                    )
                },
            )
        )
        self.core_v1.create_namespaced_service_account(
            namespace=namespace, body=sa
        )

    def _apply_network_policy(self, namespace: str) -> None:
        # Default deny all cross-namespace traffic - applied via kubectl
        subprocess.run(
            [
                "kubectl", "apply",
                "-n", namespace,
                "-f", "platform-templates/default-deny-network-policy.yaml",
            ],
            check=True,
        )

Security: Pod Security Standards

Kubernetes Pod Security Standards (PSS) replace the deprecated PodSecurityPolicies. Apply at namespace level.

apiVersion: v1
kind: Namespace
metadata:
  name: ml-fraud-team
  labels:
    # Enforce the "restricted" pod security standard
    # Blocks: privileged containers, host network, host PID, dangerous capabilities
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Training containers in a multi-tenant platform should never run as root, never use host network mode, and never mount the host filesystem.

How Databricks Serves Thousands of Tenants

Databricks operates a multi-tenant managed Spark and ML platform serving tens of thousands of enterprise customers on shared infrastructure (AWS, Azure, GCP). Their isolation architecture:

Cluster isolation: each customer's Databricks cluster runs in the customer's own cloud account (Classic plan) or in Databricks-managed infrastructure with strict network isolation (Serverless). Serverless uses Kubernetes with network policies, RBAC, and separate IAM roles per workspace.

Data isolation: the Unity Catalog governance layer enforces data access controls at the catalog, schema, and table level. Even on shared compute, a workspace cannot access data it does not own. Fine-grained access control is enforced at the query layer - a SQL query in workspace A cannot read tables owned by workspace B.

Compute isolation: the Serverless architecture allocates pre-warmed microVMs per SQL warehouse. Each query runs in an isolated microVM with no shared memory with other tenants. The isolation is at the VM level, stronger than container-level isolation.

Cost isolation: every resource usage event (compute seconds, data scanned, storage read) is tagged with workspace ID and attributed to the customer. The billing system aggregates these events into monthly invoices.

:::danger Noisy Neighbor Problem

The noisy neighbor problem: a team with a GPU-intensive workload degrades performance for all other teams on the same physical node. Kubernetes ResourceQuota prevents quota overuse but does not prevent a legitimate job from consuming all available CPU, memory bandwidth, or network I/O on a node, affecting co-located pods from other teams.

Solution: use node taints and team-specific node pools. High-priority workloads get dedicated node pools; experimental workloads share a common pool. Use PodAntiAffinity to spread pods from the same team across different nodes, reducing the blast radius of any single misbehaving pod. For the most sensitive production workloads (fraud detection, real-time pricing), use dedicated nodes with taints and tolerations - no sharing. :::

:::warning Resource Quota Does Not Cover All Resource Types

Kubernetes ResourceQuota covers CPU, memory, GPU, and object counts. It does not cover: (1) network bandwidth - a team's pod can saturate the node's network interface; (2) disk IOPS - a pod can saturate the node's NVMe drives with random I/O; (3) custom hardware accelerators that are not registered as Kubernetes resources. Monitor these at the node level with Prometheus node exporter and alert when node-level resource saturation is detected, even if no single pod has exceeded its quota. :::

Interview Q&A

Q1: How does Kubernetes ResourceQuota enable multi-tenancy for ML platforms?

ResourceQuota applies hard limits on resource consumption within a Kubernetes namespace. For ML multi-tenancy, each team gets a namespace with a ResourceQuota that caps their GPU allocation, CPU, memory, and object counts. When a team tries to create a pod that would exceed their quota, the API server rejects the request with a quota exceeded error. The team's training job fails cleanly rather than taking GPUs from other teams.

ResourceQuota alone is not sufficient - it prevents over-allocation but does not ensure fair distribution of available capacity. For fairness, you need a fair scheduling layer (Kueue or Volcano) that queues jobs across teams and dispatches them in proportion to each team's guaranteed share when the cluster has available capacity.

Q2: How do you prevent data leakage between tenants on a shared ML platform?

Data isolation requires controls at multiple layers. At the network layer, Kubernetes NetworkPolicy blocks all cross-namespace pod-to-pod communication by default, with explicit allow rules only for platform services (monitoring, registry). At the storage layer, each team gets a separate S3 prefix or bucket, with access controlled by IAM roles bound to Kubernetes Service Accounts (IRSA). Team pods automatically receive credentials scoped to their prefix only - they cannot list or read from other teams' prefixes. At the secret layer, each team's API keys and model access tokens are stored in Kubernetes Secrets scoped to their namespace, not readable from other namespaces. At the application layer, the feature store enforces tenant-scoped access - a team can only read/write feature groups they own.

Q3: How do you handle the noisy neighbor problem on a shared GPU cluster?

The noisy neighbor problem is when one tenant's workload degrades performance for others through contention over non-quota resources: memory bandwidth, PCIe bandwidth, NVMe IOPS, or network. Pure Kubernetes quota does not prevent this.

Solutions in order of escalating isolation: (1) PodAntiAffinity rules to spread tenants across different nodes, limiting the blast radius; (2) Node pools per priority tier - production jobs get dedicated node pools with team taints, experimental jobs share a common pool; (3) Physical isolation for the most sensitive workloads - dedicated bare-metal GPU nodes with no sharing. Monitor node-level resource saturation (not just quota utilization) with Prometheus node exporter. Alert on node-level NVMe utilization above 80% and network bandwidth above 70% - these thresholds indicate noisy neighbor risk before it impacts other tenants.

Q4: How does cost attribution work in a multi-tenant ML platform and why does it matter?

Cost attribution records the GPU-hours, CPU-hours, and storage consumed by each team and associates that consumption with a dollar amount. This matters for three reasons: (1) Internal chargeback - platform costs are allocated to the teams that generated them, creating incentive to optimize; (2) Budget enforcement - teams know how much they are spending and can make cost/benefit decisions about model complexity and training frequency; (3) Platform funding - the platform team recovers infrastructure costs from consuming teams, making the platform financially sustainable.

Implementation: instrument every job start/end event with team ID, GPU type, GPU count, and duration. Store this in a time-series database. Join with GPU pricing (on-demand or spot) to compute dollar amounts. Export weekly and monthly reports per team. Integrate with the company's internal billing system for actual chargeback. The hardest part is attributing shared costs (control plane, monitoring, storage overhead) fairly across teams - typically allocated proportionally to compute consumption.

Q5: How does Databricks achieve multi-tenancy isolation in their Serverless SQL product?

Databricks Serverless SQL uses a microVM-per-query architecture. Each SQL query executes inside an isolated microVM (Firecracker, the same technology used by AWS Lambda). MicroVMs provide VM-level isolation with near-container startup times (50-150ms). No shared memory, no shared CPU cache, no shared network namespace between queries from different customers.

The compute pool (warmed microVMs) is shared, but each running query instance is isolated. At the network layer, each microVM gets a virtual network interface with no routing to other microVMs. At the data layer, Unity Catalog enforces fine-grained access control at the query planning stage - before any data is read, the query planner validates that the requesting workspace has permission to access each referenced table. This happens even on shared compute because the access control logic runs inside the microVM, not as a gate before the compute.

Summary

Multi-tenant ML platforms share expensive GPU infrastructure across teams while maintaining isolation, fairness, and accountability. Kubernetes namespaces provide the isolation unit - separate RBAC, network policies, resource quotas, and storage access per tenant. Fair scheduling (Kueue, Volcano) ensures no team monopolizes the cluster, with guaranteed shares and borrowable capacity from the cohort. Data isolation spans network policies (blocking cross-namespace communication), IAM roles (scoped S3 access per team), and application-layer enforcement (feature store access controls). Cost attribution tracks GPU-hours per team and converts them to dollars, creating the right economic incentives. Tenant onboarding is fully automated - a single script creates namespace, quota, RBAC, network policy, and service account. The Databricks model shows that even at thousands-of-tenants scale, these primitives combine into a system that is both economically efficient and strongly isolated.

The Infrastructure Sharing Problem​

Why Multi-Tenancy: The Economics​

The Multi-Tenancy Architecture​

Resource Isolation: Kubernetes Namespaces​

Fair Scheduling: No Team Monopolizes the Cluster​

Priority Classes​

Kueue: Kubernetes-Native Fair Scheduling for ML​

Custom Fair Scheduler Logic​

Data Isolation: Preventing Cross-Tenant Leakage​

Kubernetes Network Policies​

S3 Bucket Isolation with IRSA​

Cost Attribution: Who Consumed What​

Tenant Onboarding Automation​

Security: Pod Security Standards​

How Databricks Serves Thousands of Tenants​

Interview Q&A​

Summary​

The Infrastructure Sharing Problem