:::tip 🎮 Interactive Playground Visualize this concept: Try the Sync vs Async Inference demo on the EngineersOfAI Playground - no code required. :::
Multi-Model Serving
Fifty Fine-Tuned Models, One Team, One Budget
The platform team has fifty customer accounts. Each customer has a fine-tuned version of the base recommendation model, trained on their proprietary data. Each fine-tuned model is 850MB. Fifty models: 42.5GB of weights. The team has eight A100 GPUs. An A100 has 80GB of VRAM.
The naive approach: one GPU per customer. Problem: eight GPUs, fifty customers. The math does not work.
The next idea: load all models on the CPUs and move to GPU on demand. Problem: a PCIe transfer of 850MB takes roughly 1.4 seconds. Add model load time and you are at 3-5 seconds before serving the first prediction. Unacceptable.
The solution the team implements: model multiplexing with LRU eviction. The A100s collectively hold 12-15 models in VRAM at any time (the most recently requested ones). Models are organized by customer usage frequency: the top 12 customers (accounting for 80% of traffic) have their models pinned; the remaining 38 share the remaining GPU slots with LRU eviction. When a less-frequent customer's request arrives and their model is not loaded, it is fetched from the host DRAM cache (not disk) in ~400ms. Customer experience: fast for frequent users, acceptable for occasional ones.
This is multi-model serving. Not 50 services with 50 teams. One infrastructure, one team, fifty models, all with adequate resource isolation, metrics, and the ability to A/B test any of them independently.
Why This Exists - The Economics of Model Proliferation
Early ML deployments had one or two models. One recommendation model, one spam filter. One GPU cluster per model. Life was simple.
Modern ML products have model proliferation at every level:
- Fine-tuned variants: the same base model fine-tuned per customer, per language, per domain
- A/B test variants: new model version vs old, for gradual rollout
- Ensemble components: 3-5 models whose predictions are averaged for a final output
- Pipeline stages: embedding model → retrieval model → reranking model → generation model
- Feature-specific models: different models for different input types (text, image, tabular)
The infrastructure question is not "how do I serve one model" but "how do I serve a hundred models with the same reliability, monitoring, and update lifecycle, using hardware efficiently."
Historical Context
Single-model serving was the norm until roughly 2018. As companies built larger ML teams and more diverse product features, model counts grew. Google's internal TensorFlow Serving (open-sourced in 2016) was one of the first systems designed for multi-model serving: it could load and serve multiple models from the same server process, with separate endpoints per model.
NVIDIA's Triton Inference Server (2018) went further: model repositories on shared storage, dynamic model loading and unloading, concurrent model execution on the same GPU, and built-in A/B routing. Triton established the model repository abstraction that most production serving systems now use.
The serving-at-scale challenge intensified with LLMs. A single 7B parameter model occupies 14GB of VRAM. Running 50 fine-tuned variants would require 700GB - ten A100s. S-LoRA (Sheng et al., 2023) showed that LoRA adapters for fine-tuning share the same base model weights; you can serve 50 variants by loading the base model once (14GB) and swapping only the small adapters (10-100MB each) on demand. This brought multi-model LLM serving into economic viability.
Model Multiplexing Architecture
Model multiplexing serves multiple models from shared serving infrastructure, routing requests to the appropriate model based on the request metadata.
# model_multiplexer.py - GPU model pool with LRU eviction
import threading
import time
from collections import OrderedDict
from typing import Dict, Optional, Callable
import torch
class ModelPool:
"""
Manages a bounded pool of models in GPU memory with LRU eviction.
Thread-safe for concurrent serving.
"""
def __init__(
self,
max_gpu_models: int,
load_fn: Callable[[str], torch.nn.Module],
device: str = "cuda:0"
):
self.max_gpu_models = max_gpu_models
self.load_fn = load_fn # function: model_id -> loaded model
self.device = device
# LRU cache: OrderedDict preserves insertion order
self.gpu_models: OrderedDict[str, torch.nn.Module] = OrderedDict()
self.pinned_models: set = set() # never evicted
self.load_times: Dict[str, float] = {}
self.lock = threading.Lock()
self.loading_locks: Dict[str, threading.Event] = {}
def pin(self, model_id: str):
"""Mark a model as pinned - never evicted regardless of LRU."""
with self.lock:
self.pinned_models.add(model_id)
# Load it now if not loaded
if model_id not in self.gpu_models:
self._load_model_locked(model_id)
def get_model(self, model_id: str) -> torch.nn.Module:
"""
Get a model, loading it if necessary (with LRU eviction).
Thread-safe: concurrent requests for the same model wait for
the single loading event.
"""
# Fast path: model already in GPU memory
with self.lock:
if model_id in self.gpu_models:
# Move to end (most recently used)
self.gpu_models.move_to_end(model_id)
return self.gpu_models[model_id]
# Check if another thread is already loading this model
if model_id in self.loading_locks:
event = self.loading_locks[model_id]
else:
event = threading.Event()
self.loading_locks[model_id] = event
# This thread will do the loading
event = None # signal to load below
if event is None:
# This thread loads the model
try:
start = time.perf_counter()
model = self.load_fn(model_id)
model.to(self.device)
model.eval()
load_time_ms = (time.perf_counter() - start) * 1000
with self.lock:
self._evict_if_needed()
self.gpu_models[model_id] = model
self.load_times[model_id] = load_time_ms
if model_id in self.loading_locks:
self.loading_locks[model_id].set()
del self.loading_locks[model_id]
return model
except Exception as e:
with self.lock:
if model_id in self.loading_locks:
self.loading_locks[model_id].set()
del self.loading_locks[model_id]
raise
else:
# Wait for the loading thread
event.wait(timeout=30.0) # 30s timeout
with self.lock:
if model_id in self.gpu_models:
return self.gpu_models[model_id]
raise RuntimeError(f"Model {model_id} failed to load")
def _evict_if_needed(self):
"""Evict LRU model if pool is full. Called with lock held."""
if len(self.gpu_models) < self.max_gpu_models:
return
# Find oldest non-pinned model
for model_id in list(self.gpu_models.keys()):
if model_id not in self.pinned_models:
model = self.gpu_models.pop(model_id)
# Free GPU memory
del model
torch.cuda.empty_cache()
print(f"Evicted {model_id} from GPU pool")
return
raise RuntimeError(
"GPU pool full and all models are pinned - cannot evict"
)
def _load_model_locked(self, model_id: str):
"""Load model, assuming lock is held."""
model = self.load_fn(model_id)
model.to(self.device)
model.eval()
self._evict_if_needed()
self.gpu_models[model_id] = model
self.gpu_models.move_to_end(model_id)
@property
def stats(self) -> dict:
with self.lock:
return {
"loaded_models": list(self.gpu_models.keys()),
"pinned_models": list(self.pinned_models),
"pool_utilization": len(self.gpu_models) / self.max_gpu_models,
"avg_load_time_ms": (
sum(self.load_times.values()) / len(self.load_times)
if self.load_times else 0
)
}
A/B Testing Infrastructure
A/B testing for ML models requires traffic splitting, consistent user assignment, and metric collection per variant.
# ab_testing.py - model A/B testing router
import hashlib
import random
import time
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from enum import Enum
class TrafficSplit(Enum):
CONTROL = "control"
TREATMENT = "treatment"
@dataclass
class ExperimentConfig:
experiment_id: str
control_model: str # model_id for control (e.g., "model_v1.2")
treatment_model: str # model_id for treatment (e.g., "model_v1.3")
treatment_percentage: float = 0.10 # 10% in treatment
sticky: bool = True # same user always gets same variant
enabled: bool = True
metrics: List[str] = field(default_factory=lambda: ["latency", "error_rate"])
class ABTestRouter:
"""
Routes requests to A or B model based on experiment config.
Uses deterministic hashing for consistent user assignment.
"""
def __init__(self, model_pool: ModelPool):
self.model_pool = model_pool
self.experiments: Dict[str, ExperimentConfig] = {}
self.metrics: Dict[str, Dict] = {}
def register_experiment(self, config: ExperimentConfig):
self.experiments[config.experiment_id] = config
self.metrics[config.experiment_id] = {
"control": {"requests": 0, "errors": 0, "total_latency_ms": 0},
"treatment": {"requests": 0, "errors": 0, "total_latency_ms": 0},
}
print(f"Experiment '{config.experiment_id}' registered: "
f"{config.control_model} vs {config.treatment_model} "
f"({config.treatment_percentage*100:.0f}% treatment)")
def _assign_variant(
self, experiment: ExperimentConfig, user_id: str
) -> TrafficSplit:
"""
Deterministically assign user to control or treatment.
Same user_id always gets the same variant for sticky experiments.
"""
if not experiment.sticky:
# Random assignment - not consistent per user
return (TrafficSplit.TREATMENT
if random.random() < experiment.treatment_percentage
else TrafficSplit.CONTROL)
# Hash-based assignment - consistent per user per experiment
hash_input = f"{experiment.experiment_id}:{user_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = (hash_value % 10000) / 10000.0 # [0, 1)
return (TrafficSplit.TREATMENT
if bucket < experiment.treatment_percentage
else TrafficSplit.CONTROL)
async def predict(
self,
experiment_id: str,
user_id: str,
input_data,
predict_fn: callable
) -> dict:
"""Route request to appropriate model variant and collect metrics."""
if experiment_id not in self.experiments:
raise ValueError(f"Unknown experiment: {experiment_id}")
experiment = self.experiments[experiment_id]
if not experiment.enabled:
# Experiment disabled - use control
variant = TrafficSplit.CONTROL
else:
variant = self._assign_variant(experiment, user_id)
model_id = (experiment.treatment_model
if variant == TrafficSplit.TREATMENT
else experiment.control_model)
model = self.model_pool.get_model(model_id)
metrics = self.metrics[experiment_id][variant.value]
start = time.perf_counter()
error = None
try:
result = await predict_fn(model, input_data)
except Exception as e:
error = e
metrics["errors"] += 1
finally:
latency_ms = (time.perf_counter() - start) * 1000
metrics["requests"] += 1
metrics["total_latency_ms"] += latency_ms
if error:
raise error
return {
**result,
"_experiment": experiment_id,
"_variant": variant.value,
"_model": model_id,
}
def get_experiment_stats(self, experiment_id: str) -> dict:
"""Return per-variant metrics for experiment analysis."""
metrics = self.metrics.get(experiment_id, {})
stats = {}
for variant, m in metrics.items():
n = m["requests"]
stats[variant] = {
"requests": n,
"error_rate": m["errors"] / n if n > 0 else 0,
"avg_latency_ms": m["total_latency_ms"] / n if n > 0 else 0,
}
return stats
Shadow Mode Testing
Shadow mode runs a new model in parallel with the production model without returning the new model's results to users. This lets you measure real-traffic behavior before any user-visible rollout.
# shadow_testing.py - shadow mode for ML model validation
import asyncio
import time
from typing import Any
class ShadowTester:
"""
Runs shadow model in parallel. Production results returned to users.
Shadow results logged for offline analysis.
"""
def __init__(
self,
production_model,
shadow_model,
shadow_ratio: float = 1.0, # fraction of requests to shadow (1.0 = all)
log_writer = None
):
self.production = production_model
self.shadow = shadow_model
self.shadow_ratio = shadow_ratio
self.log_writer = log_writer
self._shadow_queue = asyncio.Queue(maxsize=1000)
async def predict(self, input_data: Any, request_id: str) -> dict:
"""
Run production inference synchronously.
Shadow inference runs asynchronously in background.
"""
import random
# Always run production model
prod_start = time.perf_counter()
prod_result = await self._run_inference(self.production, input_data)
prod_latency_ms = (time.perf_counter() - prod_start) * 1000
# Shadow: fire and forget (non-blocking)
if random.random() < self.shadow_ratio:
asyncio.create_task(
self._shadow_inference(input_data, request_id, prod_result, prod_latency_ms)
)
return prod_result
async def _shadow_inference(
self,
input_data: Any,
request_id: str,
prod_result: dict,
prod_latency_ms: float
):
"""Run shadow inference and log comparison. Does not affect user response."""
try:
shadow_start = time.perf_counter()
shadow_result = await self._run_inference(self.shadow, input_data)
shadow_latency_ms = (time.perf_counter() - shadow_start) * 1000
# Compare outputs
comparison = {
"request_id": request_id,
"timestamp": time.time(),
"production": {
"result": prod_result,
"latency_ms": prod_latency_ms,
},
"shadow": {
"result": shadow_result,
"latency_ms": shadow_latency_ms,
},
"diverged": prod_result != shadow_result,
"latency_delta_ms": shadow_latency_ms - prod_latency_ms,
}
if self.log_writer:
await self.log_writer.write(comparison)
except Exception as e:
# Shadow failures never propagate to users
print(f"Shadow inference failed for {request_id}: {e}")
async def _run_inference(self, model, input_data: Any) -> dict:
"""Run inference in thread pool (non-blocking)."""
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, model.predict, input_data)
Canary Deployments for ML Models
Canary deployment gradually shifts traffic from the old model to the new model, monitoring error rates and business metrics at each step.
# canary_deployment.py - gradual traffic migration with automatic rollback
import asyncio
import time
from typing import Dict
class CanaryController:
"""
Controls canary deployment of a new model version.
Automatically rolls back if error rate or latency degrades.
"""
def __init__(
self,
current_model_id: str,
canary_model_id: str,
router: ABTestRouter,
rollback_error_rate_threshold: float = 0.05, # 5% errors → rollback
rollback_latency_p99_ms: float = 500.0, # >500ms p99 → rollback
):
self.current = current_model_id
self.canary = canary_model_id
self.router = router
self.error_threshold = rollback_error_rate_threshold
self.latency_threshold = rollback_latency_p99_ms
# Register experiment with 0% canary initially
self.experiment_id = f"canary_{canary_model_id}"
self.router.register_experiment(ExperimentConfig(
experiment_id=self.experiment_id,
control_model=current_model_id,
treatment_model=canary_model_id,
treatment_percentage=0.0, # start at 0%
enabled=True
))
async def run_canary_rollout(
self,
stages: list = [0.01, 0.05, 0.10, 0.25, 0.50, 1.0],
stage_duration_minutes: float = 30.0,
check_interval_seconds: float = 60.0
):
"""
Run gradual canary rollout through traffic percentage stages.
Monitors health at each stage and rolls back if thresholds exceeded.
"""
for target_pct in stages:
print(f"\nCanary stage: {target_pct*100:.0f}% → {self.canary}")
await self._set_canary_percentage(target_pct)
stage_end = time.time() + stage_duration_minutes * 60
while time.time() < stage_end:
await asyncio.sleep(check_interval_seconds)
stats = self.router.get_experiment_stats(self.experiment_id)
canary_stats = stats.get("treatment", {})
error_rate = canary_stats.get("error_rate", 0)
avg_latency = canary_stats.get("avg_latency_ms", 0)
print(f" Canary: errors={error_rate:.3f}, "
f"latency={avg_latency:.1f}ms "
f"({canary_stats.get('requests', 0)} requests)")
if error_rate > self.error_threshold:
print(f" ROLLBACK: error rate {error_rate:.3f} > {self.error_threshold}")
await self._rollback()
return
if avg_latency > self.latency_threshold:
print(f" ROLLBACK: latency {avg_latency:.1f}ms > {self.latency_threshold}ms")
await self._rollback()
return
print(f" Stage {target_pct*100:.0f}% healthy - advancing")
print(f"\nCanary complete - {self.canary} is now serving 100% of traffic")
await self._complete_rollout()
async def _set_canary_percentage(self, pct: float):
exp = self.router.experiments[self.experiment_id]
exp.treatment_percentage = pct
async def _rollback(self):
print(f"Rolling back to {self.current}")
await self._set_canary_percentage(0.0)
exp = self.router.experiments[self.experiment_id]
exp.enabled = False
async def _complete_rollout(self):
"""Update production model to the canary version."""
exp = self.router.experiments[self.experiment_id]
exp.control_model = self.canary
await self._set_canary_percentage(0.0)
exp.enabled = False
print(f"Production model updated to {self.canary}")
LoRA-Based Multi-Model Serving for LLMs
LoRA (Low-Rank Adaptation) fine-tunes only a small set of adapter weights added to the frozen base model. Multiple customers can have separate LoRA adapters that share the base model weights - enabling multi-tenant LLM serving without loading separate full models per customer.
# lora_serving.py - serving multiple LoRA adapters on shared base model
# Using the peft library + vLLM's LoRA support
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from typing import Dict
class LoRAModelServer:
"""
Serves multiple LoRA adapters on a shared base model.
Base model loaded once; adapters swapped per-request.
"""
def __init__(self, base_model_id: str, device: str = "cuda"):
self.device = device
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id)
print(f"Loading base model: {base_model_id}")
self.base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
self.base_model.eval()
# Cache loaded adapters (smaller than full models - can cache more)
self.loaded_adapters: Dict[str, PeftModel] = {}
self.active_adapter: str = None
def load_adapter(self, adapter_id: str, adapter_path: str):
"""Load a LoRA adapter and cache it."""
if adapter_id in self.loaded_adapters:
return # already loaded
print(f"Loading adapter: {adapter_id}")
# LoRA adapter weights are typically 10-100MB vs 14GB for base model
adapter_model = PeftModel.from_pretrained(
self.base_model,
adapter_path,
adapter_name=adapter_id,
)
self.loaded_adapters[adapter_id] = adapter_model
def predict(self, adapter_id: str, prompt: str, max_new_tokens: int = 200) -> str:
"""Run inference with the specified LoRA adapter."""
if adapter_id not in self.loaded_adapters:
raise ValueError(f"Adapter {adapter_id} not loaded")
model = self.loaded_adapters[adapter_id]
# Set active adapter
if hasattr(model, 'set_adapter'):
model.set_adapter(adapter_id)
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
pad_token_id=self.tokenizer.eos_token_id
)
return self.tokenizer.decode(
output[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
# Key economics:
# Base model (LLaMA-7B): 14GB VRAM (loaded once)
# LoRA adapter per customer: ~50MB
# 50 customers: 14GB + 50 * 0.05GB = 16.5GB total
# vs. 50 * 14GB = 700GB without LoRA sharing
Resource Isolation in Multi-Tenant GPU Serving
Multiple models on shared GPUs require resource isolation to prevent one model's traffic spike from starving others.
NVIDIA MIG (Multi-Instance GPU) on A100 divides the GPU into isolated partitions with dedicated memory and compute, giving hard resource guarantees between tenants.
For softer isolation (without MIG), use rate limiting at the request router level:
# rate_limiter.py - per-model QPS limiting for resource isolation
import asyncio
import time
from collections import defaultdict
from typing import Dict
class TokenBucketRateLimiter:
"""
Per-model token bucket rate limiter.
Prevents one model from consuming all GPU capacity.
"""
def __init__(self, model_qps_limits: Dict[str, float]):
self.limits = model_qps_limits # model_id -> max QPS
self.tokens: Dict[str, float] = {m: limit for m, limit in model_qps_limits.items()}
self.last_refill: Dict[str, float] = defaultdict(time.monotonic)
self.lock = asyncio.Lock()
async def acquire(self, model_id: str, timeout_ms: float = 100) -> bool:
"""
Try to acquire a token for model_id.
Returns True if request can proceed, False if rate-limited.
"""
limit = self.limits.get(model_id, 10) # default 10 QPS for unknown models
deadline = time.monotonic() + timeout_ms / 1000
async with self.lock:
# Refill tokens based on time elapsed
now = time.monotonic()
elapsed = now - self.last_refill[model_id]
self.tokens[model_id] = min(
limit,
self.tokens.get(model_id, 0) + elapsed * limit
)
self.last_refill[model_id] = now
if self.tokens[model_id] >= 1:
self.tokens[model_id] -= 1
return True
else:
return False # Rate limited - caller should return 429
Production Engineering Notes
Model Versioning and Registry
Use a model registry (MLflow, Weights & Biases, SageMaker Model Registry) as the authoritative source of model versions. The serving infrastructure should pull model metadata from the registry, not from configuration files that diverge across environments.
Key metadata to track per model version: training data fingerprint, validation metrics, resource requirements (GPU VRAM, CPU), serving SLA, rollout stage (shadow → canary → production).
Health Checks Per Model
When serving multiple models, health checks must be per-model, not per-server. A server can be healthy (process running, port open) while one of its loaded models is corrupted or producing nonsense outputs. Run warm-up inference on each model at load time and expose per-model health at /health/{model_id}.
Common Mistakes
:::danger GPU Memory Fragmentation with Model Cycling
Loading and unloading models from GPU memory repeatedly causes memory fragmentation. torch.cuda.empty_cache() returns memory to the CUDA allocator pool but does not defragment it. After 50+ load/evict cycles, you may see "CUDA out of memory" even though nvidia-smi shows available memory. Mitigation: pre-allocate fixed-size memory buffers and load model weights into those buffers; use a memory allocator designed for frequent allocation (NVIDIA's cnmem or RAPIDS rmm). In production, plan your model slot sizes to match model weight sizes and avoid very frequent cycling.
:::
:::danger A/B Testing Without Consistent User Assignment Running A/B tests without sticky user assignment causes users to experience different model versions on different requests - creating inconsistent experiences and polluting your metrics. "The new model increased click rate by 15%" is meaningless if the same user saw both models randomly. Use hash-based assignment that maps user_id consistently to control or treatment for the duration of the experiment. :::
:::warning Shadow Mode Consuming Production Resources Shadow mode runs the shadow model on every request (or a fraction). This doubles GPU resource usage. If the production model is already at 70% GPU utilization, shadow mode at 100% ratio will push you to 140% - causing OOM errors or severe latency degradation. Either run shadow mode at a reduced ratio (10-20%), dedicate separate GPU capacity to shadow inference, or use asynchronous shadow processing with a queue that drops requests when the queue is full. :::
:::warning Not Measuring Business Metrics, Only Technical Metrics A new model can have lower latency and lower error rate but still degrade business metrics (click-through rate, conversion, user satisfaction). A/B tests must measure the metrics that matter to the product. Pure technical dashboards (latency, throughput, error rate) tell you the model is serving correctly; business dashboards tell you if it is serving the right results. Always define business metric success criteria before starting an A/B test. :::
Interview Q&A
Q1: You have 200 fine-tuned models to serve. GPU VRAM holds 15 at a time. How do you design the serving infrastructure?
A: The key insight is that not all models are equally active. Analyze traffic to find the distribution: typically, the top 10-20% of models handle 80%+ of requests (Pareto distribution). Design a tiered system: (1) Hot pool - the top 15 most-active models are pinned in GPU VRAM permanently, never evicted. (2) Warm pool - the next 50 models are loaded in host CPU DRAM. GPU transfer time is ~400ms for a 1GB model, acceptable for occasional requests. (3) Cold pool - remaining 135 models live in object storage (S3). First request triggers load from S3 (3-5 seconds), warmed into host DRAM, then GPU VRAM if traffic warrants. Use LRU eviction within each tier. If any model consistently moves from cold to warm, promote it proactively. Monitor per-model p99 latency to detect when a model needs promotion. For LLMs where models share a base architecture, use LoRA adapters - the base model is pinned (14GB) and adapters are loaded on demand (~50MB each, negligible overhead).
Q2: What is the difference between canary deployment and shadow mode for model rollouts?
A: Shadow mode and canary are complementary, not alternatives. Shadow mode runs the new model on real traffic in parallel with production, but returns only production results to users. It lets you measure the new model's accuracy, latency, and resource usage on real traffic with zero user risk. Use shadow mode first to validate basic correctness and performance. Canary deployment shifts a small percentage of real traffic to the new model - users in the canary group actually receive the new model's results. Canary measures real user impact: click rate, engagement, satisfaction. Use canary after shadow mode validates technical correctness. The typical order is: offline eval → shadow mode (0% user exposure) → canary 1% → canary 10% → canary 50% → 100% rollout. Automatic rollback triggers at each stage if error rate or latency thresholds are exceeded.
Q3: How do LoRA adapters enable efficient multi-tenant LLM serving?
A: LoRA (Low-Rank Adaptation) adds small trainable matrices to frozen base model layers. The adapter for a 7B model is typically 50-100MB vs 14GB for the full model. In multi-tenant serving, the base model weights are loaded once into GPU VRAM and shared across all customers. Each customer's adapter (their fine-tuning delta) is loaded separately and merged with the base weights at inference time via a fast matrix addition. S-LoRA (2023) showed you can serve 2,000+ LoRA adapters on a few GPUs by storing adapters in CPU memory and swapping them to GPU per-request. The GPU only needs the base model + the currently active adapter. Memory: 14GB base + 0.1GB per active adapter vs 14GB per customer without LoRA. This is the key economic enabler for per-customer fine-tuned models at scale.
Q4: How do you implement consistent user assignment for A/B testing in a distributed serving system?
A: Use deterministic hashing. Compute hash(experiment_id + user_id) % 10000 to get a bucket in [0, 10000). If the bucket is less than treatment_percentage * 10000, the user is in treatment; otherwise control. MD5 or MurmurHash gives good distribution. This is stateless - any server can compute the same assignment for the same user without coordination. The experiment_id is included in the hash to ensure different experiments do not systematically overlap (a user in the treatment for experiment A is not systematically more likely to be in treatment for experiment B). No database lookup required; no session state. For production-scale systems (billions of users), this also means zero infrastructure for assignment - just a hash function call per request.
Q5: What resource isolation mechanisms exist for multi-tenant GPU serving?
A: Three mechanisms at different granularities. Hardware isolation: NVIDIA MIG (Multi-Instance GPU) on A100/H100 partitions the GPU into isolated instances with dedicated VRAM, compute engines, and cache. Tenants cannot exceed their partition's resources regardless of load. Best for strong SLA guarantees but requires static capacity allocation. Software isolation: CUDA streams and priority queues allow scheduling priority between tenants but do not prevent VRAM sharing. A high-priority tenant's kernels are scheduled first but still competes for VRAM with low-priority tenants. Request-level isolation: rate limiting at the serving layer (token bucket, sliding window) caps per-tenant QPS, preventing one tenant from saturating the GPU. Combined with priority queues, high-SLA tenants get low latency and low-SLA tenants are rate-limited under load. In practice, most deployments use software isolation + request-level rate limiting, with MIG reserved for strict contractual SLA requirements.
