What is multi-model serving?

How to serve hundreds of models efficiently - model multiplexing, ensembles in production, A/B testing infrastructure, shadow mode, canary deployments, and multi-tenant GPU resource isolation.

How does model multiplexing work in practice?

Multi-Model Serving covers multi-model serving, model multiplexing, A/B testing ML models from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/model-serving/multi-model-serving

What is the difference between multi-model serving and A/B testing ML models?

See the full breakdown at https://engineersofai.com/docs/ai-systems/model-serving/multi-model-serving

:::tip 🎮 Interactive Playground Visualize this concept: Try the Sync vs Async Inference demo on the EngineersOfAI Playground - no code required. :::

Multi-Model Serving

Fifty Fine-Tuned Models, One Team, One Budget

The platform team has fifty customer accounts. Each customer has a fine-tuned version of the base recommendation model, trained on their proprietary data. Each fine-tuned model is 850MB. Fifty models: 42.5GB of weights. The team has eight A100 GPUs. An A100 has 80GB of VRAM.

The naive approach: one GPU per customer. Problem: eight GPUs, fifty customers. The math does not work.

The next idea: load all models on the CPUs and move to GPU on demand. Problem: a PCIe transfer of 850MB takes roughly 1.4 seconds. Add model load time and you are at 3-5 seconds before serving the first prediction. Unacceptable.

The solution the team implements: model multiplexing with LRU eviction. The A100s collectively hold 12-15 models in VRAM at any time (the most recently requested ones). Models are organized by customer usage frequency: the top 12 customers (accounting for 80% of traffic) have their models pinned; the remaining 38 share the remaining GPU slots with LRU eviction. When a less-frequent customer's request arrives and their model is not loaded, it is fetched from the host DRAM cache (not disk) in ~400ms. Customer experience: fast for frequent users, acceptable for occasional ones.

This is multi-model serving. Not 50 services with 50 teams. One infrastructure, one team, fifty models, all with adequate resource isolation, metrics, and the ability to A/B test any of them independently.

Why This Exists - The Economics of Model Proliferation

Early ML deployments had one or two models. One recommendation model, one spam filter. One GPU cluster per model. Life was simple.

Modern ML products have model proliferation at every level:

Fine-tuned variants: the same base model fine-tuned per customer, per language, per domain
A/B test variants: new model version vs old, for gradual rollout
Ensemble components: 3-5 models whose predictions are averaged for a final output
Pipeline stages: embedding model → retrieval model → reranking model → generation model
Feature-specific models: different models for different input types (text, image, tabular)

The infrastructure question is not "how do I serve one model" but "how do I serve a hundred models with the same reliability, monitoring, and update lifecycle, using hardware efficiently."

Historical Context

Single-model serving was the norm until roughly 2018. As companies built larger ML teams and more diverse product features, model counts grew. Google's internal TensorFlow Serving (open-sourced in 2016) was one of the first systems designed for multi-model serving: it could load and serve multiple models from the same server process, with separate endpoints per model.

NVIDIA's Triton Inference Server (2018) went further: model repositories on shared storage, dynamic model loading and unloading, concurrent model execution on the same GPU, and built-in A/B routing. Triton established the model repository abstraction that most production serving systems now use.

The serving-at-scale challenge intensified with LLMs. A single 7B parameter model occupies 14GB of VRAM. Running 50 fine-tuned variants would require 700GB - ten A100s. S-LoRA (Sheng et al., 2023) showed that LoRA adapters for fine-tuning share the same base model weights; you can serve 50 variants by loading the base model once (14GB) and swapping only the small adapters (10-100MB each) on demand. This brought multi-model LLM serving into economic viability.

Model Multiplexing Architecture

Model multiplexing serves multiple models from shared serving infrastructure, routing requests to the appropriate model based on the request metadata.

# model_multiplexer.py - GPU model pool with LRU eviction
import threading
import time
from collections import OrderedDict
from typing import Dict, Optional, Callable
import torch

class ModelPool:
    """
    Manages a bounded pool of models in GPU memory with LRU eviction.
    Thread-safe for concurrent serving.
    """

    def __init__(
        self,
        max_gpu_models: int,
        load_fn: Callable[[str], torch.nn.Module],
        device: str = "cuda:0"
    ):
        self.max_gpu_models = max_gpu_models
        self.load_fn = load_fn  # function: model_id -> loaded model
        self.device = device

        # LRU cache: OrderedDict preserves insertion order
        self.gpu_models: OrderedDict[str, torch.nn.Module] = OrderedDict()
        self.pinned_models: set = set()  # never evicted
        self.load_times: Dict[str, float] = {}

        self.lock = threading.Lock()
        self.loading_locks: Dict[str, threading.Event] = {}

    def pin(self, model_id: str):
        """Mark a model as pinned - never evicted regardless of LRU."""
        with self.lock:
            self.pinned_models.add(model_id)
            # Load it now if not loaded
            if model_id not in self.gpu_models:
                self._load_model_locked(model_id)

    def get_model(self, model_id: str) -> torch.nn.Module:
        """
        Get a model, loading it if necessary (with LRU eviction).
        Thread-safe: concurrent requests for the same model wait for
        the single loading event.
        """
        # Fast path: model already in GPU memory
        with self.lock:
            if model_id in self.gpu_models:
                # Move to end (most recently used)
                self.gpu_models.move_to_end(model_id)
                return self.gpu_models[model_id]

            # Check if another thread is already loading this model
            if model_id in self.loading_locks:
                event = self.loading_locks[model_id]
            else:
                event = threading.Event()
                self.loading_locks[model_id] = event
                # This thread will do the loading
                event = None  # signal to load below

        if event is None:
            # This thread loads the model
            try:
                start = time.perf_counter()
                model = self.load_fn(model_id)
                model.to(self.device)
                model.eval()
                load_time_ms = (time.perf_counter() - start) * 1000

                with self.lock:
                    self._evict_if_needed()
                    self.gpu_models[model_id] = model
                    self.load_times[model_id] = load_time_ms
                    if model_id in self.loading_locks:
                        self.loading_locks[model_id].set()
                        del self.loading_locks[model_id]

                return model
            except Exception as e:
                with self.lock:
                    if model_id in self.loading_locks:
                        self.loading_locks[model_id].set()
                        del self.loading_locks[model_id]
                raise
        else:
            # Wait for the loading thread
            event.wait(timeout=30.0)  # 30s timeout
            with self.lock:
                if model_id in self.gpu_models:
                    return self.gpu_models[model_id]
                raise RuntimeError(f"Model {model_id} failed to load")

    def _evict_if_needed(self):
        """Evict LRU model if pool is full. Called with lock held."""
        if len(self.gpu_models) < self.max_gpu_models:
            return

        # Find oldest non-pinned model
        for model_id in list(self.gpu_models.keys()):
            if model_id not in self.pinned_models:
                model = self.gpu_models.pop(model_id)
                # Free GPU memory
                del model
                torch.cuda.empty_cache()
                print(f"Evicted {model_id} from GPU pool")
                return

        raise RuntimeError(
            "GPU pool full and all models are pinned - cannot evict"
        )

    def _load_model_locked(self, model_id: str):
        """Load model, assuming lock is held."""
        model = self.load_fn(model_id)
        model.to(self.device)
        model.eval()
        self._evict_if_needed()
        self.gpu_models[model_id] = model
        self.gpu_models.move_to_end(model_id)

    @property
    def stats(self) -> dict:
        with self.lock:
            return {
                "loaded_models": list(self.gpu_models.keys()),
                "pinned_models": list(self.pinned_models),
                "pool_utilization": len(self.gpu_models) / self.max_gpu_models,
                "avg_load_time_ms": (
                    sum(self.load_times.values()) / len(self.load_times)
                    if self.load_times else 0
                )
            }

A/B Testing Infrastructure

A/B testing for ML models requires traffic splitting, consistent user assignment, and metric collection per variant.

# ab_testing.py - model A/B testing router
import hashlib
import random
import time
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum

class TrafficSplit(Enum):
    CONTROL = "control"
    TREATMENT = "treatment"

@dataclass
class ExperimentConfig:
    experiment_id: str
    control_model: str      # model_id for control (e.g., "model_v1.2")
    treatment_model: str    # model_id for treatment (e.g., "model_v1.3")
    treatment_percentage: float = 0.10  # 10% in treatment
    sticky: bool = True     # same user always gets same variant
    enabled: bool = True
    metrics: List[str] = field(default_factory=lambda: ["latency", "error_rate"])

class ABTestRouter:
    """
    Routes requests to A or B model based on experiment config.
    Uses deterministic hashing for consistent user assignment.
    """

    def __init__(self, model_pool: ModelPool):
        self.model_pool = model_pool
        self.experiments: Dict[str, ExperimentConfig] = {}
        self.metrics: Dict[str, Dict] = {}

    def register_experiment(self, config: ExperimentConfig):
        self.experiments[config.experiment_id] = config
        self.metrics[config.experiment_id] = {
            "control": {"requests": 0, "errors": 0, "total_latency_ms": 0},
            "treatment": {"requests": 0, "errors": 0, "total_latency_ms": 0},
        }
        print(f"Experiment '{config.experiment_id}' registered: "
              f"{config.control_model} vs {config.treatment_model} "
              f"({config.treatment_percentage*100:.0f}% treatment)")

    def _assign_variant(
        self, experiment: ExperimentConfig, user_id: str
    ) -> TrafficSplit:
        """
        Deterministically assign user to control or treatment.
        Same user_id always gets the same variant for sticky experiments.
        """
        if not experiment.sticky:
            # Random assignment - not consistent per user
            return (TrafficSplit.TREATMENT
                    if random.random() < experiment.treatment_percentage
                    else TrafficSplit.CONTROL)

        # Hash-based assignment - consistent per user per experiment
        hash_input = f"{experiment.experiment_id}:{user_id}"
        hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
        bucket = (hash_value % 10000) / 10000.0  # [0, 1)

        return (TrafficSplit.TREATMENT
                if bucket < experiment.treatment_percentage
                else TrafficSplit.CONTROL)

    async def predict(
        self,
        experiment_id: str,
        user_id: str,
        input_data,
        predict_fn: callable
    ) -> dict:
        """Route request to appropriate model variant and collect metrics."""
        if experiment_id not in self.experiments:
            raise ValueError(f"Unknown experiment: {experiment_id}")

        experiment = self.experiments[experiment_id]
        if not experiment.enabled:
            # Experiment disabled - use control
            variant = TrafficSplit.CONTROL
        else:
            variant = self._assign_variant(experiment, user_id)

        model_id = (experiment.treatment_model
                    if variant == TrafficSplit.TREATMENT
                    else experiment.control_model)

        model = self.model_pool.get_model(model_id)
        metrics = self.metrics[experiment_id][variant.value]

        start = time.perf_counter()
        error = None
        try:
            result = await predict_fn(model, input_data)
        except Exception as e:
            error = e
            metrics["errors"] += 1
        finally:
            latency_ms = (time.perf_counter() - start) * 1000
            metrics["requests"] += 1
            metrics["total_latency_ms"] += latency_ms

        if error:
            raise error

        return {
            **result,
            "_experiment": experiment_id,
            "_variant": variant.value,
            "_model": model_id,
        }

    def get_experiment_stats(self, experiment_id: str) -> dict:
        """Return per-variant metrics for experiment analysis."""
        metrics = self.metrics.get(experiment_id, {})
        stats = {}
        for variant, m in metrics.items():
            n = m["requests"]
            stats[variant] = {
                "requests": n,
                "error_rate": m["errors"] / n if n > 0 else 0,
                "avg_latency_ms": m["total_latency_ms"] / n if n > 0 else 0,
            }
        return stats

Shadow Mode Testing

Shadow mode runs a new model in parallel with the production model without returning the new model's results to users. This lets you measure real-traffic behavior before any user-visible rollout.

# shadow_testing.py - shadow mode for ML model validation
import asyncio
import time
from typing import Any

class ShadowTester:
    """
    Runs shadow model in parallel. Production results returned to users.
    Shadow results logged for offline analysis.
    """

    def __init__(
        self,
        production_model,
        shadow_model,
        shadow_ratio: float = 1.0,  # fraction of requests to shadow (1.0 = all)
        log_writer = None
    ):
        self.production = production_model
        self.shadow = shadow_model
        self.shadow_ratio = shadow_ratio
        self.log_writer = log_writer
        self._shadow_queue = asyncio.Queue(maxsize=1000)

    async def predict(self, input_data: Any, request_id: str) -> dict:
        """
        Run production inference synchronously.
        Shadow inference runs asynchronously in background.
        """
        import random

        # Always run production model
        prod_start = time.perf_counter()
        prod_result = await self._run_inference(self.production, input_data)
        prod_latency_ms = (time.perf_counter() - prod_start) * 1000

        # Shadow: fire and forget (non-blocking)
        if random.random() < self.shadow_ratio:
            asyncio.create_task(
                self._shadow_inference(input_data, request_id, prod_result, prod_latency_ms)
            )

        return prod_result

    async def _shadow_inference(
        self,
        input_data: Any,
        request_id: str,
        prod_result: dict,
        prod_latency_ms: float
    ):
        """Run shadow inference and log comparison. Does not affect user response."""
        try:
            shadow_start = time.perf_counter()
            shadow_result = await self._run_inference(self.shadow, input_data)
            shadow_latency_ms = (time.perf_counter() - shadow_start) * 1000

            # Compare outputs
            comparison = {
                "request_id": request_id,
                "timestamp": time.time(),
                "production": {
                    "result": prod_result,
                    "latency_ms": prod_latency_ms,
                },
                "shadow": {
                    "result": shadow_result,
                    "latency_ms": shadow_latency_ms,
                },
                "diverged": prod_result != shadow_result,
                "latency_delta_ms": shadow_latency_ms - prod_latency_ms,
            }

            if self.log_writer:
                await self.log_writer.write(comparison)

        except Exception as e:
            # Shadow failures never propagate to users
            print(f"Shadow inference failed for {request_id}: {e}")

    async def _run_inference(self, model, input_data: Any) -> dict:
        """Run inference in thread pool (non-blocking)."""
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(None, model.predict, input_data)

Canary Deployments for ML Models

Canary deployment gradually shifts traffic from the old model to the new model, monitoring error rates and business metrics at each step.

# canary_deployment.py - gradual traffic migration with automatic rollback
import asyncio
import time
from typing import Dict

class CanaryController:
    """
    Controls canary deployment of a new model version.
    Automatically rolls back if error rate or latency degrades.
    """

    def __init__(
        self,
        current_model_id: str,
        canary_model_id: str,
        router: ABTestRouter,
        rollback_error_rate_threshold: float = 0.05,  # 5% errors → rollback
        rollback_latency_p99_ms: float = 500.0,       # >500ms p99 → rollback
    ):
        self.current = current_model_id
        self.canary = canary_model_id
        self.router = router
        self.error_threshold = rollback_error_rate_threshold
        self.latency_threshold = rollback_latency_p99_ms

        # Register experiment with 0% canary initially
        self.experiment_id = f"canary_{canary_model_id}"
        self.router.register_experiment(ExperimentConfig(
            experiment_id=self.experiment_id,
            control_model=current_model_id,
            treatment_model=canary_model_id,
            treatment_percentage=0.0,  # start at 0%
            enabled=True
        ))

    async def run_canary_rollout(
        self,
        stages: list = [0.01, 0.05, 0.10, 0.25, 0.50, 1.0],
        stage_duration_minutes: float = 30.0,
        check_interval_seconds: float = 60.0
    ):
        """
        Run gradual canary rollout through traffic percentage stages.
        Monitors health at each stage and rolls back if thresholds exceeded.
        """
        for target_pct in stages:
            print(f"\nCanary stage: {target_pct*100:.0f}% → {self.canary}")
            await self._set_canary_percentage(target_pct)

            stage_end = time.time() + stage_duration_minutes * 60
            while time.time() < stage_end:
                await asyncio.sleep(check_interval_seconds)

                stats = self.router.get_experiment_stats(self.experiment_id)
                canary_stats = stats.get("treatment", {})

                error_rate = canary_stats.get("error_rate", 0)
                avg_latency = canary_stats.get("avg_latency_ms", 0)

                print(f"  Canary: errors={error_rate:.3f}, "
                      f"latency={avg_latency:.1f}ms "
                      f"({canary_stats.get('requests', 0)} requests)")

                if error_rate > self.error_threshold:
                    print(f"  ROLLBACK: error rate {error_rate:.3f} > {self.error_threshold}")
                    await self._rollback()
                    return

                if avg_latency > self.latency_threshold:
                    print(f"  ROLLBACK: latency {avg_latency:.1f}ms > {self.latency_threshold}ms")
                    await self._rollback()
                    return

            print(f"  Stage {target_pct*100:.0f}% healthy - advancing")

        print(f"\nCanary complete - {self.canary} is now serving 100% of traffic")
        await self._complete_rollout()

    async def _set_canary_percentage(self, pct: float):
        exp = self.router.experiments[self.experiment_id]
        exp.treatment_percentage = pct

    async def _rollback(self):
        print(f"Rolling back to {self.current}")
        await self._set_canary_percentage(0.0)
        exp = self.router.experiments[self.experiment_id]
        exp.enabled = False

    async def _complete_rollout(self):
        """Update production model to the canary version."""
        exp = self.router.experiments[self.experiment_id]
        exp.control_model = self.canary
        await self._set_canary_percentage(0.0)
        exp.enabled = False
        print(f"Production model updated to {self.canary}")

LoRA-Based Multi-Model Serving for LLMs

LoRA (Low-Rank Adaptation) fine-tunes only a small set of adapter weights added to the frozen base model. Multiple customers can have separate LoRA adapters that share the base model weights - enabling multi-tenant LLM serving without loading separate full models per customer.

# lora_serving.py - serving multiple LoRA adapters on shared base model
# Using the peft library + vLLM's LoRA support
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from typing import Dict

class LoRAModelServer:
    """
    Serves multiple LoRA adapters on a shared base model.
    Base model loaded once; adapters swapped per-request.
    """

    def __init__(self, base_model_id: str, device: str = "cuda"):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id)

        print(f"Loading base model: {base_model_id}")
        self.base_model = AutoModelForCausalLM.from_pretrained(
            base_model_id,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.base_model.eval()

        # Cache loaded adapters (smaller than full models - can cache more)
        self.loaded_adapters: Dict[str, PeftModel] = {}
        self.active_adapter: str = None

    def load_adapter(self, adapter_id: str, adapter_path: str):
        """Load a LoRA adapter and cache it."""
        if adapter_id in self.loaded_adapters:
            return  # already loaded

        print(f"Loading adapter: {adapter_id}")
        # LoRA adapter weights are typically 10-100MB vs 14GB for base model
        adapter_model = PeftModel.from_pretrained(
            self.base_model,
            adapter_path,
            adapter_name=adapter_id,
        )
        self.loaded_adapters[adapter_id] = adapter_model

    def predict(self, adapter_id: str, prompt: str, max_new_tokens: int = 200) -> str:
        """Run inference with the specified LoRA adapter."""
        if adapter_id not in self.loaded_adapters:
            raise ValueError(f"Adapter {adapter_id} not loaded")

        model = self.loaded_adapters[adapter_id]

        # Set active adapter
        if hasattr(model, 'set_adapter'):
            model.set_adapter(adapter_id)

        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                pad_token_id=self.tokenizer.eos_token_id
            )

        return self.tokenizer.decode(
            output[0][inputs['input_ids'].shape[1]:],
            skip_special_tokens=True
        )

# Key economics:
# Base model (LLaMA-7B): 14GB VRAM (loaded once)
# LoRA adapter per customer: ~50MB
# 50 customers: 14GB + 50 * 0.05GB = 16.5GB total
# vs. 50 * 14GB = 700GB without LoRA sharing

Resource Isolation in Multi-Tenant GPU Serving

Multiple models on shared GPUs require resource isolation to prevent one model's traffic spike from starving others.

NVIDIA MIG (Multi-Instance GPU) on A100 divides the GPU into isolated partitions with dedicated memory and compute, giving hard resource guarantees between tenants.

For softer isolation (without MIG), use rate limiting at the request router level:

# rate_limiter.py - per-model QPS limiting for resource isolation
import asyncio
import time
from collections import defaultdict
from typing import Dict

class TokenBucketRateLimiter:
    """
    Per-model token bucket rate limiter.
    Prevents one model from consuming all GPU capacity.
    """

    def __init__(self, model_qps_limits: Dict[str, float]):
        self.limits = model_qps_limits  # model_id -> max QPS
        self.tokens: Dict[str, float] = {m: limit for m, limit in model_qps_limits.items()}
        self.last_refill: Dict[str, float] = defaultdict(time.monotonic)
        self.lock = asyncio.Lock()

    async def acquire(self, model_id: str, timeout_ms: float = 100) -> bool:
        """
        Try to acquire a token for model_id.
        Returns True if request can proceed, False if rate-limited.
        """
        limit = self.limits.get(model_id, 10)  # default 10 QPS for unknown models
        deadline = time.monotonic() + timeout_ms / 1000

        async with self.lock:
            # Refill tokens based on time elapsed
            now = time.monotonic()
            elapsed = now - self.last_refill[model_id]
            self.tokens[model_id] = min(
                limit,
                self.tokens.get(model_id, 0) + elapsed * limit
            )
            self.last_refill[model_id] = now

            if self.tokens[model_id] >= 1:
                self.tokens[model_id] -= 1
                return True
            else:
                return False  # Rate limited - caller should return 429

Production Engineering Notes

Model Versioning and Registry

Use a model registry (MLflow, Weights & Biases, SageMaker Model Registry) as the authoritative source of model versions. The serving infrastructure should pull model metadata from the registry, not from configuration files that diverge across environments.

Key metadata to track per model version: training data fingerprint, validation metrics, resource requirements (GPU VRAM, CPU), serving SLA, rollout stage (shadow → canary → production).

Health Checks Per Model

When serving multiple models, health checks must be per-model, not per-server. A server can be healthy (process running, port open) while one of its loaded models is corrupted or producing nonsense outputs. Run warm-up inference on each model at load time and expose per-model health at /health/{model_id}.

Common Mistakes

:::danger GPU Memory Fragmentation with Model Cycling Loading and unloading models from GPU memory repeatedly causes memory fragmentation. torch.cuda.empty_cache() returns memory to the CUDA allocator pool but does not defragment it. After 50+ load/evict cycles, you may see "CUDA out of memory" even though nvidia-smi shows available memory. Mitigation: pre-allocate fixed-size memory buffers and load model weights into those buffers; use a memory allocator designed for frequent allocation (NVIDIA's cnmem or RAPIDS rmm). In production, plan your model slot sizes to match model weight sizes and avoid very frequent cycling. :::

:::danger A/B Testing Without Consistent User Assignment Running A/B tests without sticky user assignment causes users to experience different model versions on different requests - creating inconsistent experiences and polluting your metrics. "The new model increased click rate by 15%" is meaningless if the same user saw both models randomly. Use hash-based assignment that maps user_id consistently to control or treatment for the duration of the experiment. :::

:::warning Shadow Mode Consuming Production Resources Shadow mode runs the shadow model on every request (or a fraction). This doubles GPU resource usage. If the production model is already at 70% GPU utilization, shadow mode at 100% ratio will push you to 140% - causing OOM errors or severe latency degradation. Either run shadow mode at a reduced ratio (10-20%), dedicate separate GPU capacity to shadow inference, or use asynchronous shadow processing with a queue that drops requests when the queue is full. :::

:::warning Not Measuring Business Metrics, Only Technical Metrics A new model can have lower latency and lower error rate but still degrade business metrics (click-through rate, conversion, user satisfaction). A/B tests must measure the metrics that matter to the product. Pure technical dashboards (latency, throughput, error rate) tell you the model is serving correctly; business dashboards tell you if it is serving the right results. Always define business metric success criteria before starting an A/B test. :::

Interview Q&A

Q1: You have 200 fine-tuned models to serve. GPU VRAM holds 15 at a time. How do you design the serving infrastructure?

A: The key insight is that not all models are equally active. Analyze traffic to find the distribution: typically, the top 10-20% of models handle 80%+ of requests (Pareto distribution). Design a tiered system: (1) Hot pool - the top 15 most-active models are pinned in GPU VRAM permanently, never evicted. (2) Warm pool - the next 50 models are loaded in host CPU DRAM. GPU transfer time is ~400ms for a 1GB model, acceptable for occasional requests. (3) Cold pool - remaining 135 models live in object storage (S3). First request triggers load from S3 (3-5 seconds), warmed into host DRAM, then GPU VRAM if traffic warrants. Use LRU eviction within each tier. If any model consistently moves from cold to warm, promote it proactively. Monitor per-model p99 latency to detect when a model needs promotion. For LLMs where models share a base architecture, use LoRA adapters - the base model is pinned (14GB) and adapters are loaded on demand (~50MB each, negligible overhead).

Q2: What is the difference between canary deployment and shadow mode for model rollouts?

A: Shadow mode and canary are complementary, not alternatives. Shadow mode runs the new model on real traffic in parallel with production, but returns only production results to users. It lets you measure the new model's accuracy, latency, and resource usage on real traffic with zero user risk. Use shadow mode first to validate basic correctness and performance. Canary deployment shifts a small percentage of real traffic to the new model - users in the canary group actually receive the new model's results. Canary measures real user impact: click rate, engagement, satisfaction. Use canary after shadow mode validates technical correctness. The typical order is: offline eval → shadow mode (0% user exposure) → canary 1% → canary 10% → canary 50% → 100% rollout. Automatic rollback triggers at each stage if error rate or latency thresholds are exceeded.

Q3: How do LoRA adapters enable efficient multi-tenant LLM serving?

A: LoRA (Low-Rank Adaptation) adds small trainable matrices to frozen base model layers. The adapter for a 7B model is typically 50-100MB vs 14GB for the full model. In multi-tenant serving, the base model weights are loaded once into GPU VRAM and shared across all customers. Each customer's adapter (their fine-tuning delta) is loaded separately and merged with the base weights at inference time via a fast matrix addition. S-LoRA (2023) showed you can serve 2,000+ LoRA adapters on a few GPUs by storing adapters in CPU memory and swapping them to GPU per-request. The GPU only needs the base model + the currently active adapter. Memory: 14GB base + 0.1GB per active adapter vs 14GB per customer without LoRA. This is the key economic enabler for per-customer fine-tuned models at scale.

Q4: How do you implement consistent user assignment for A/B testing in a distributed serving system?

A: Use deterministic hashing. Compute hash(experiment_id + user_id) % 10000 to get a bucket in [0, 10000). If the bucket is less than treatment_percentage * 10000, the user is in treatment; otherwise control. MD5 or MurmurHash gives good distribution. This is stateless - any server can compute the same assignment for the same user without coordination. The experiment_id is included in the hash to ensure different experiments do not systematically overlap (a user in the treatment for experiment A is not systematically more likely to be in treatment for experiment B). No database lookup required; no session state. For production-scale systems (billions of users), this also means zero infrastructure for assignment - just a hash function call per request.

Q5: What resource isolation mechanisms exist for multi-tenant GPU serving?

A: Three mechanisms at different granularities. Hardware isolation: NVIDIA MIG (Multi-Instance GPU) on A100/H100 partitions the GPU into isolated instances with dedicated VRAM, compute engines, and cache. Tenants cannot exceed their partition's resources regardless of load. Best for strong SLA guarantees but requires static capacity allocation. Software isolation: CUDA streams and priority queues allow scheduling priority between tenants but do not prevent VRAM sharing. A high-priority tenant's kernels are scheduled first but still competes for VRAM with low-priority tenants. Request-level isolation: rate limiting at the serving layer (token bucket, sliding window) caps per-tenant QPS, preventing one tenant from saturating the GPU. Combined with priority queues, high-SLA tenants get low latency and low-SLA tenants are rate-limited under load. In practice, most deployments use software isolation + request-level rate limiting, with MIG reserved for strict contractual SLA requirements.

Fifty Fine-Tuned Models, One Team, One Budget​

Why This Exists - The Economics of Model Proliferation​

Historical Context​

Model Multiplexing Architecture​

A/B Testing Infrastructure​

Shadow Mode Testing​

Canary Deployments for ML Models​

LoRA-Based Multi-Model Serving for LLMs​

Resource Isolation in Multi-Tenant GPU Serving​

Production Engineering Notes​

Model Versioning and Registry​

Health Checks Per Model​

Common Mistakes​

Interview Q&A​