Multi-Model Serving Architecture
The Economics of Serving Twelve Models
The machine learning platform team at a mid-sized SaaS company has a problem. Over the past 18 months, their product team shipped features that require AI: a coding assistant using a code-specialized model, a customer support chatbot fine-tuned on their product docs, a document summarization pipeline using a long-context model, a semantic search system requiring an embedding model, a content moderation classifier, and a real-time intent classifier for routing user messages. Plus three internal tools: a Slack bot, a code review assistant, and a SQL generation helper.
Nine distinct models. Each running on dedicated GPU instances. Each with its own deployment pipeline. Each billing at full utilization rates whether or not it is actively serving requests at 2 AM.
The cloud bill hits $47,000 per month. The infrastructure engineer opens a capacity utilization dashboard and discovers that the average GPU utilization across all nine deployments is 11%. The coding assistant sees heavy traffic from 9 AM to 7 PM but is nearly idle overnight. The document summarization pipeline gets bursts a few times per day. The content moderation classifier runs at high volume all the time but requires very little GPU memory. They are paying for 9 x full GPU instances and using the equivalent of about 1.
This is the default state for most organizations that have been running AI features for more than a year: a sprawl of single-model deployments, each provisioned for peak load, sitting mostly idle. The fix is not complex, but it requires thinking about model serving as infrastructure rather than a collection of one-off deployments. You need a multi-model serving layer: a shared infrastructure platform that hosts multiple models, routes requests to the right model, manages GPU memory efficiently, and scales each model independently based on actual demand.
Done right, this same team can serve all nine models on three well-utilized GPU instances, reduce their cloud bill to $12,000 per month, and add new model deployments in minutes rather than days. The engineering investment is one or two sprints of platform work. The payoff is immediate and compounds as the number of models grows.
This lesson covers how to design and build that platform.
Why This Exists - The Single-Model Deployment Trap
The single-model deployment pattern emerges naturally. The first model a team ships gets its own instance. That works fine. The second model gets its own instance. Also fine. By the fifth model, the pattern is established and nobody questions it. By the tenth model, you have a GPU cluster that costs more to run than the engineering team and is 89% idle.
The fundamental problem is that single-model deployments are provisioned for peak load, not average load. An LLM serving 100 requests per minute at peak needs N GPU replicas. But at 3 AM it might serve 2 requests per minute. If you keep N replicas running to handle peak, you are wasting (N-1) replicas overnight. If you scale down to 1 replica, cold start latency becomes a problem when traffic ramps up in the morning.
The second problem is that different models have very different resource profiles. A 7B parameter chat model needs 14GB of GPU memory. A 1.5B embedding model needs 3GB. A BERT-based classifier needs 1.5GB. Running each on a dedicated A100 (40GB) is like driving one passenger per car on a highway: legal, but wasteful.
The third problem is operational overhead. Nine separate model deployments means nine sets of deployment configs, nine monitoring dashboards, nine on-call runbooks. The cognitive load scales linearly with the number of models.
Multi-model serving addresses all three problems. A shared serving layer bins-packs multiple models onto shared GPU memory, dynamically loads and unloads models based on demand, routes requests through a unified API, and scales the entire platform rather than individual models. Operational overhead is concentrated in one place rather than scattered across nine separate stacks.
The trade-off is architectural complexity upfront in exchange for dramatically lower operational complexity and cost at scale.
Historical Context - From Dedicated Servers to Shared Inference
The earliest production ML serving systems (circa 2015-2017) were direct ports of software service thinking: one model per server, one service per deployment. TensorFlow Serving (released by Google in 2016) was among the first attempts to generalize this, providing a server that could load multiple model versions and expose them through a unified gRPC interface. But it was still fundamentally single-model-type: all models had to use the same serving framework and format.
The multi-model serving problem became acute around 2021-2022, when transformer models arrived in production at scale. Unlike traditional ML models (gradient boosted trees, CNNs), LLMs were large enough to dominate GPU memory - a 7B model alone could fill an A100. Serving multiple LLMs on shared hardware required new approaches to memory management.
NVIDIA's Multi-Instance GPU (MIG) feature, introduced with the A100 GPU in 2020, provided hardware-level isolation: a single physical GPU could be partitioned into up to 7 isolated GPU instances, each with their own memory, cache, and compute. For the first time, you could run multiple models on one GPU with true resource isolation. This was a significant "aha moment" for teams running many small-to-medium models.
On the software side, the development of LiteLLM (2023) was a turning point for multi-model routing. LiteLLM provided a unified OpenAI-compatible API that could proxy to any backend - local vLLM, Hugging Face TGI, OpenAI, Anthropic, or any other provider. For the first time, application code could be written against a single interface and the routing layer could transparently direct requests to different model backends based on cost, latency, or task type.
By 2024, frameworks like Ray Serve had added native multi-model support with independent autoscaling per model, and specialized multi-model servers like Triton Inference Server (NVIDIA) had matured to the point where a single Triton instance could serve dozens of models simultaneously with GPU memory pooling.
Core Concepts
The Multi-Model Taxonomy
Before building a multi-model architecture, you need to classify your models along two dimensions: request pattern and resource profile.
Request pattern:
- High-volume, low-latency: classifiers, embedding models, intent detection. Thousands of requests per minute, each completing in under 100ms.
- Medium-volume, medium-latency: chat models, completion models. Hundreds of requests per minute, each taking 500ms to 5 seconds.
- Low-volume, high-compute: long-context summarization, code generation, batch inference. Tens of requests per minute, each taking 10-60 seconds.
Resource profile:
- Memory-dominant: large LLMs (7B-70B). Their weights alone consume most of the GPU memory.
- Compute-dominant: embedding models, small classifiers. Low memory footprint but need GPU throughput.
- Burst-heavy: batch inference jobs. Quiet for long periods, then need full GPU for minutes at a time.
These classifications determine your bin-packing strategy. Memory-dominant models constrain how many can co-exist on a single GPU. Compute-dominant models can often share GPU time via time-slicing with minimal performance impact. Burst-heavy models should use autoscaling that can scale to zero and spin up on demand.
GPU Sharing Strategies
There are three distinct strategies for sharing GPU resources across multiple models, each with different isolation guarantees and performance characteristics.
MIG (Multi-Instance GPU): Hardware-level partitioning. An A100 40GB can be split into up to 7 instances (e.g., 7x 5GB instances, or mixed configurations like 1x 20GB + 2x 10GB). Each MIG instance has isolated memory, cache, and compute engines - models running on different MIG instances cannot interfere with each other. MIG is ideal for strict isolation requirements (multi-tenant environments, models with different SLAs). The downside is inflexibility: MIG partitioning is configured at boot time and changing it requires stopping all workloads on the GPU.
MPS (Multi-Process Service): CUDA-level time-sharing. Multiple CUDA processes share the physical GPU, multiplexing access to compute. Models can run concurrently if memory fits. MPS offers better GPU utilization than dedicated partitions but no memory isolation - a memory error in one model can affect others. Good for internal workloads where models are trusted.
Time-slicing: OS-level context switching between GPU processes. The simplest approach - CUDA schedules work from multiple processes by time-slicing. Lowest overhead, lowest isolation. Fine for small models and batch workloads that are not latency-sensitive.
For production LLM serving, MIG is typically preferred for large models requiring isolation. Time-slicing works well for embedding models and classifiers that have low memory footprint and can tolerate slightly variable latency.
Model Cascade Architecture
A model cascade routes requests through a series of increasingly capable (and expensive) models, short-circuiting to a cheaper model whenever it can handle the request with sufficient quality. The intuition is that most requests are simple, and a small model can handle them faster and cheaper than a large model. Only the genuinely hard requests escalate.
For example, a tiered architecture might look like:
- A 1.5B parameter model handles simple FAQ responses (sub-100ms, cheap)
- A 7B model handles medium-complexity requests (200-500ms, moderate cost)
- A 70B model handles complex reasoning, multi-step tasks (1-5s, expensive)
The routing logic is a confidence estimator: the small model generates a response and an associated confidence score. If confidence is above a threshold, return the response. If not, escalate to the next tier.
The mathematics of cascade cost reduction are compelling. If 70% of requests can be handled by the 1.5B model, 25% by the 7B, and only 5% reach the 70B, the average per-request compute cost is:
Where is roughly c_{7B}0.01/request, and is $0.10/request:
Versus routing all requests to the 70B model at $0.10/request. The cascade reduces cost by more than 90%. The catch is that the cascade requires a reliable confidence estimator, which is itself a non-trivial engineering problem.
Dynamic Model Loading and Offloading
In a multi-model serving system, not all models need to be loaded into GPU memory simultaneously. Models that have been idle for a period can be offloaded to CPU RAM or NVMe SSD, freeing GPU memory for actively serving models. When a request arrives for an offloaded model, it is loaded back to GPU before serving.
The latency of loading a model from storage back to GPU is the key operational parameter:
- GPU to CPU RAM (PCIe bandwidth ~32 GB/s): A 14GB model loads in ~0.44 seconds
- CPU RAM to GPU (same ~32 GB/s): Same 0.44 seconds
- NVMe SSD to CPU to GPU (~7 GB/s NVMe + ~32 GB/s PCIe): A 14GB model in ~2 seconds
- Cold load from S3/GCS (~1 Gbps): A 14GB model in ~112 seconds - unacceptable for interactive serving
The practical conclusion: models must always reside at minimum in CPU RAM for acceptable cold-start performance. NVMe is borderline acceptable for batch workloads. Cloud storage is only acceptable for very infrequent cold starts on batch pipelines.
A model loading strategy for a 40GB A100 serving 6 models:
- Always hot in GPU: top 2-3 models by request volume (fit within 40GB)
- Warm in CPU RAM: next 2-3 models (swap time ~0.5s, acceptable for medium-frequency requests)
- Cold on NVMe: rarely-used models (swap time ~2s, acceptable for batch)
LiteLLM as a Unified Proxy Layer
LiteLLM provides an OpenAI-compatible API that routes requests to any backend: local vLLM/TGI instances, Anthropic, OpenAI, Cohere, Bedrock, and more. For multi-model architectures, it serves as the unified routing and abstraction layer.
The key capabilities:
- Unified endpoint: all models are accessed via
/v1/chat/completionsregardless of backend - Model aliasing: you define "model names" that map to specific backends and versions
- Load balancing: round-robin or weighted routing across multiple instances of the same model
- Fallback chains: if the primary backend is unavailable, automatically fall back to a secondary
- Cost tracking: per-request token costs tracked across all backends
- Rate limiting: per-model and per-user rate limits enforced at the proxy layer
For a team running 9 models, LiteLLM means application code calls one endpoint with a model name, and the platform team owns all routing logic in one config file. Adding a new model means updating the config, not changing application code.
Architecture Diagrams
Code Examples
LiteLLM Proxy Configuration
# litellm_config.yaml
model_list:
# Coding assistant - routes to local Codestral via vLLM
- model_name: coding-assistant
litellm_params:
model: openai/codestral-22b
api_base: http://vllm-codestral:8000/v1
api_key: "none"
max_tokens: 4096
# Chat model - load balanced across two vLLM instances
- model_name: chat
litellm_params:
model: openai/llama-3-8b-instruct
api_base: http://vllm-chat-0:8000/v1
api_key: "none"
- model_name: chat
litellm_params:
model: openai/llama-3-8b-instruct
api_base: http://vllm-chat-1:8000/v1
api_key: "none"
# Embeddings - routes to TGI instance
- model_name: text-embedding
litellm_params:
model: openai/bge-m3
api_base: http://tgi-embed:8080/v1
api_key: "none"
# Fallback to OpenAI if all local instances are down
- model_name: chat-fallback
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
router_settings:
routing_strategy: least-busy
num_retries: 2
retry_after: 5
fallbacks:
- {"chat": ["chat-fallback"]}
litellm_settings:
# Track spend per model
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
# Rate limits per model
rpm_limit: 1000 # global requests per minute
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
# Run LiteLLM proxy
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e LITELLM_MASTER_KEY=$LITELLM_MASTER_KEY \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml \
--port 4000 \
--detailed_debug
Model Router - Request Classification and Routing
from enum import Enum
from dataclasses import dataclass
from typing import Optional
import re
class ModelTask(Enum):
CODING = "coding"
CHAT = "chat"
EMBED = "embed"
SUMMARIZE = "summarize"
CLASSIFY = "classify"
@dataclass
class RoutingDecision:
task: ModelTask
model_name: str
priority: str # "high", "medium", "low"
estimated_tokens: int
reason: str
# Map from task to model configuration
MODEL_ROUTING_TABLE = {
ModelTask.CODING: {
"model": "coding-assistant",
"max_tokens": 4096,
"timeout": 60,
},
ModelTask.CHAT: {
"model": "chat",
"max_tokens": 2048,
"timeout": 30,
},
ModelTask.EMBED: {
"model": "text-embedding",
"max_tokens": 512,
"timeout": 5,
},
ModelTask.SUMMARIZE: {
"model": "summarizer",
"max_tokens": 1024,
"timeout": 45,
},
ModelTask.CLASSIFY: {
"model": "classifier",
"max_tokens": 64,
"timeout": 2,
},
}
class ModelRouter:
"""
Routes incoming requests to the appropriate model based on
request type, content analysis, and explicit model hints.
"""
# Code indicators - words and patterns that suggest coding tasks
CODE_PATTERNS = [
r"\b(def |class |import |from |return |async |await )\b",
r"```(python|javascript|typescript|sql|bash|java|go|rust)",
r"\b(function|variable|bug|error|syntax|compile|debug)\b",
r"\b(write me|implement|refactor|fix this code)\b",
]
EMBED_INDICATORS = ["embed", "embedding", "vector", "similarity", "semantic search"]
SUMMARIZE_INDICATORS = ["summarize", "summary", "tldr", "key points", "brief"]
CLASSIFY_INDICATORS = ["classify", "category", "label", "intent", "sentiment"]
def route(self, messages: list[dict], explicit_model: Optional[str] = None) -> RoutingDecision:
"""
Determine the best model for this request.
explicit_model overrides task detection if provided.
"""
if explicit_model and explicit_model in [m.value for m in ModelTask]:
task = ModelTask(explicit_model)
else:
task = self._classify_task(messages)
config = MODEL_ROUTING_TABLE[task]
estimated_tokens = self._estimate_tokens(messages)
priority = self._determine_priority(task, estimated_tokens)
return RoutingDecision(
task=task,
model_name=config["model"],
priority=priority,
estimated_tokens=estimated_tokens,
reason=f"Task detected: {task.value}",
)
def _classify_task(self, messages: list[dict]) -> ModelTask:
"""Lightweight task classifier using pattern matching."""
text = " ".join(
m.get("content", "") for m in messages
if isinstance(m.get("content"), str)
).lower()
# Check for embedding request (usually comes with specific field)
if any(ind in text for ind in self.EMBED_INDICATORS):
return ModelTask.EMBED
# Check for classification request
if any(ind in text for ind in self.CLASSIFY_INDICATORS):
return ModelTask.CLASSIFY
# Check for summarization request
if any(ind in text for ind in self.SUMMARIZE_INDICATORS):
return ModelTask.SUMMARIZE
# Check for code patterns
for pattern in self.CODE_PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
return ModelTask.CODING
# Default to chat
return ModelTask.CHAT
def _estimate_tokens(self, messages: list[dict]) -> int:
"""Rough token estimate: ~4 chars per token."""
total_chars = sum(
len(m.get("content", "")) for m in messages
if isinstance(m.get("content"), str)
)
return total_chars // 4
def _determine_priority(self, task: ModelTask, estimated_tokens: int) -> str:
"""Priority affects queue position in high-load scenarios."""
if task in (ModelTask.CLASSIFY, ModelTask.EMBED):
return "high" # Fast, cheap, often blocking user actions
if estimated_tokens > 8000:
return "low" # Long context = batch territory
return "medium"
# Integration with LiteLLM client
from openai import OpenAI
litellm_client = OpenAI(
api_key="your-litellm-master-key",
base_url="http://localhost:4000/v1",
)
router = ModelRouter()
def serve_request(messages: list[dict], user_id: str) -> str:
decision = router.route(messages)
response = litellm_client.chat.completions.create(
model=decision.model_name,
messages=messages,
max_tokens=MODEL_ROUTING_TABLE[decision.task]["max_tokens"],
timeout=MODEL_ROUTING_TABLE[decision.task]["timeout"],
extra_headers={
"x-user-id": user_id,
"x-task-type": decision.task.value,
"x-priority": decision.priority,
}
)
return response.choices[0].message.content
Dynamic Model Loading Manager
import asyncio
import time
from collections import OrderedDict
from dataclasses import dataclass, field
from typing import Optional
import logging
logger = logging.getLogger(__name__)
@dataclass
class ModelState:
name: str
gpu_memory_gb: float
location: str # "gpu", "cpu_ram", "nvme", "cold"
last_used: float = field(default_factory=time.time)
load_count: int = 0
avg_load_time_s: float = 0.0
class ModelMemoryManager:
"""
Manages model placement across GPU / CPU RAM / NVMe.
Uses LRU eviction policy for GPU memory management.
"""
def __init__(self, gpu_capacity_gb: float = 40.0, cpu_ram_capacity_gb: float = 128.0):
self.gpu_capacity_gb = gpu_capacity_gb
self.cpu_ram_capacity_gb = cpu_ram_capacity_gb
# Ordered dicts maintain LRU ordering
self.gpu_models: OrderedDict[str, ModelState] = OrderedDict()
self.cpu_models: OrderedDict[str, ModelState] = OrderedDict()
self.gpu_used_gb: float = 0.0
self.cpu_used_gb: float = 0.0
def gpu_free_gb(self) -> float:
return self.gpu_capacity_gb - self.gpu_used_gb
def cpu_free_gb(self) -> float:
return self.cpu_ram_capacity_gb - self.cpu_used_gb
async def ensure_model_on_gpu(self, model_name: str) -> float:
"""
Ensures model is loaded to GPU. Returns load latency in seconds.
Evicts LRU models if necessary.
"""
# Already on GPU
if model_name in self.gpu_models:
self.gpu_models.move_to_end(model_name) # Mark as recently used
self.gpu_models[model_name].last_used = time.time()
return 0.0
model = self._get_model_state(model_name)
start = time.time()
# Evict from GPU if necessary
while self.gpu_free_gb() < model.gpu_memory_gb:
evicted = self._evict_from_gpu()
if evicted is None:
raise RuntimeError(f"Cannot free enough GPU memory for {model_name}")
# Load model based on current location
if model.location == "cpu_ram":
await self._load_cpu_to_gpu(model)
elif model.location == "nvme":
await self._load_nvme_to_gpu(model)
else:
await self._load_cold_to_gpu(model)
elapsed = time.time() - start
model.avg_load_time_s = (
(model.avg_load_time_s * model.load_count + elapsed)
/ (model.load_count + 1)
)
model.load_count += 1
logger.info(f"Loaded {model_name} to GPU in {elapsed:.2f}s from {model.location}")
return elapsed
def _evict_from_gpu(self) -> Optional[str]:
"""Evict the least recently used model from GPU to CPU RAM."""
if not self.gpu_models:
return None
# LRU: first item in OrderedDict is least recently used
evict_name, evict_state = next(iter(self.gpu_models.items()))
logger.info(f"Evicting {evict_name} from GPU (last used {time.time() - evict_state.last_used:.0f}s ago)")
if self.cpu_free_gb() >= evict_state.gpu_memory_gb:
evict_state.location = "cpu_ram"
self.cpu_models[evict_name] = evict_state
self.cpu_used_gb += evict_state.gpu_memory_gb
else:
evict_state.location = "nvme"
logger.warning(f"CPU RAM full, evicting {evict_name} to NVMe")
del self.gpu_models[evict_name]
self.gpu_used_gb -= evict_state.gpu_memory_gb
return evict_name
async def _load_cpu_to_gpu(self, model: ModelState) -> None:
"""Simulate CPU RAM to GPU transfer (~32 GB/s PCIe bandwidth)."""
transfer_time = model.gpu_memory_gb / 32.0
await asyncio.sleep(transfer_time) # Simulated; real impl calls vLLM API
model.location = "gpu"
self.gpu_models[model.name] = model
self.gpu_used_gb += model.gpu_memory_gb
del self.cpu_models[model.name]
self.cpu_used_gb -= model.gpu_memory_gb
async def _load_nvme_to_gpu(self, model: ModelState) -> None:
"""NVMe to CPU to GPU - slower path."""
nvme_time = model.gpu_memory_gb / 7.0 # NVMe ~7 GB/s
pcie_time = model.gpu_memory_gb / 32.0 # PCIe ~32 GB/s
await asyncio.sleep(nvme_time + pcie_time)
model.location = "gpu"
self.gpu_models[model.name] = model
self.gpu_used_gb += model.gpu_memory_gb
async def _load_cold_to_gpu(self, model: ModelState) -> None:
"""Cold load from object storage - only for batch, never interactive."""
logger.warning(f"Cold loading {model.name} from storage - this will take minutes")
# Real implementation: dvc pull or s3 download
await asyncio.sleep(5) # Placeholder
model.location = "gpu"
self.gpu_models[model.name] = model
self.gpu_used_gb += model.gpu_memory_gb
def _get_model_state(self, model_name: str) -> ModelState:
"""Look up model state across all tiers."""
if model_name in self.gpu_models:
return self.gpu_models[model_name]
if model_name in self.cpu_models:
return self.cpu_models[model_name]
# Model registry lookup would go here
raise ValueError(f"Unknown model: {model_name}")
Ray Serve Multi-Model Deployment
import ray
from ray import serve
from typing import Optional
import httpx
ray.init()
serve.start()
@serve.deployment(
ray_actor_options={"num_gpus": 1},
autoscaling_config={
"min_replicas": 1,
"max_replicas": 4,
"target_num_ongoing_requests_per_replica": 10,
},
max_ongoing_requests=20,
)
class ChatModelDeployment:
def __init__(self):
# vLLM server is co-located on the same GPU in the Ray actor
import subprocess
self.process = subprocess.Popen([
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "meta-llama/Llama-3-8B-Instruct",
"--port", "8001",
"--max-model-len", "4096",
])
self.client = httpx.Client(base_url="http://localhost:8001")
async def __call__(self, request):
body = await request.json()
resp = self.client.post("/v1/chat/completions", json=body)
return resp.json()
@serve.deployment(
ray_actor_options={"num_gpus": 0.25}, # Fractional GPU for small model
autoscaling_config={
"min_replicas": 2,
"max_replicas": 20,
"target_num_ongoing_requests_per_replica": 50,
},
)
class EmbeddingDeployment:
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer("BAAI/bge-m3", device="cuda")
async def __call__(self, request):
body = await request.json()
texts = body.get("input", [])
if isinstance(texts, str):
texts = [texts]
embeddings = self.model.encode(texts, normalize_embeddings=True)
return {
"data": [
{"embedding": emb.tolist(), "index": i}
for i, emb in enumerate(embeddings)
],
"model": "bge-m3",
}
# Bind and deploy both
chat_app = ChatModelDeployment.bind()
embed_app = EmbeddingDeployment.bind()
serve.run(chat_app, route_prefix="/chat")
serve.run(embed_app, route_prefix="/embeddings")
Per-Model Autoscaling with Kubernetes HPA
# hpa-chat-model.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-chat-model
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-chat
minReplicas: 1
maxReplicas: 8
metrics:
- type: External
external:
metric:
name: pending_requests_count
selector:
matchLabels:
model: "llama-3-8b-chat"
target:
type: AverageValue
averageValue: "10" # Scale up if >10 pending requests per replica
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 120
---
# KEDA ScaledObject for scale-to-zero (coding model - used only during business hours)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: coding-model-scaler
spec:
scaleTargetRef:
name: vllm-coding
minReplicaCount: 0 # Scale to zero when no traffic
maxReplicaCount: 4
cooldownPeriod: 300 # 5 min cooldown before scale-down
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_num_requests_waiting
threshold: "1"
query: |
sum(vllm_num_requests_waiting{model="codestral-22b"})
Cost Optimization - Bin-Packing Models on GPUs
from dataclasses import dataclass
from typing import list
import math
@dataclass
class ModelRequirement:
name: str
gpu_memory_gb: float
min_replicas: int
peak_requests_per_sec: float
tokens_per_request: int
@dataclass
class GPUNode:
gpu_type: str
memory_gb: float
cost_per_hour: float
throughput_tokens_per_sec: float
def bin_pack_models(
models: list[ModelRequirement],
gpu: GPUNode,
overhead_fraction: float = 0.15, # Reserve 15% GPU memory for overhead
) -> dict:
"""
Greedy bin-packing: assign models to GPU instances to minimize cost.
Returns number of GPU instances needed and assignment.
"""
usable_memory = gpu.memory_gb * (1 - overhead_fraction)
# Sort models by memory requirement (largest first - First Fit Decreasing)
sorted_models = sorted(models, key=lambda m: m.gpu_memory_gb, reverse=True)
bins = [] # Each bin = one GPU instance
for model in sorted_models:
# Find the first bin where this model fits
placed = False
for gpu_bin in bins:
used = sum(m.gpu_memory_gb for m in gpu_bin)
if used + model.gpu_memory_gb <= usable_memory:
gpu_bin.append(model)
placed = True
break
if not placed:
bins.append([model]) # Open a new GPU instance
total_gpus = len(bins)
total_cost_per_hour = total_gpus * gpu.cost_per_hour
print(f"\nBin-packing result: {total_gpus} x {gpu.gpu_type}")
print(f"Estimated cost: ${total_cost_per_hour:.2f}/hour (${total_cost_per_hour * 24 * 30:.0f}/month)")
for i, gpu_bin in enumerate(bins):
used = sum(m.gpu_memory_gb for m in gpu_bin)
util = used / gpu.memory_gb * 100
print(f" GPU {i}: {', '.join(m.name for m in gpu_bin)} - {used:.1f}GB / {gpu.memory_gb}GB ({util:.0f}% utilized)")
return {"num_gpus": total_gpus, "assignment": bins, "monthly_cost": total_cost_per_hour * 24 * 30}
# Example: 9 models, what is the minimum GPU count?
models = [
ModelRequirement("codestral-22b", gpu_memory_gb=44, min_replicas=1, peak_requests_per_sec=2, tokens_per_request=512),
ModelRequirement("llama-3-8b-chat", gpu_memory_gb=16, min_replicas=2, peak_requests_per_sec=20, tokens_per_request=256),
ModelRequirement("llama-3-8b-support", gpu_memory_gb=16, min_replicas=1, peak_requests_per_sec=10, tokens_per_request=512),
ModelRequirement("mistral-7b-summarizer", gpu_memory_gb=14, min_replicas=1, peak_requests_per_sec=3, tokens_per_request=1024),
ModelRequirement("bge-m3-embed", gpu_memory_gb=3, min_replicas=1, peak_requests_per_sec=100, tokens_per_request=128),
ModelRequirement("bert-classifier", gpu_memory_gb=1.5, min_replicas=1, peak_requests_per_sec=200, tokens_per_request=64),
]
a100_40gb = GPUNode(
gpu_type="A100-40GB",
memory_gb=40.0,
cost_per_hour=3.20,
throughput_tokens_per_sec=15000,
)
result = bin_pack_models(models, a100_40gb)
Production Engineering Notes
Model warm-up is not optional. When a vLLM or TGI instance starts, the first request is slow due to CUDA kernel compilation (CUDA graphs), KV cache initialization, and model weight loading into L2 cache. For a 7B model, the first request can take 10-30x longer than subsequent requests. Always implement warm-up: send a few synthetic requests immediately after a new instance comes online before routing real traffic to it. In Kubernetes, use a readiness probe that checks not just that the server is up but that it has processed a test request.
Co-locating an embedding model with an LLM is the highest-value optimization most teams miss. Almost every LLM application also has a vector search component with an embedding model. The embedding model (BGE-M3, E5-large) typically requires 3-6GB of GPU memory. If you are running a 7B LLM on a 40GB A100, you have 20-26GB free - more than enough for the embedding model. Running both on the same GPU means zero network latency for embedding lookups. Most teams run them on separate instances, paying for network egress and extra GPU hours unnecessarily.
Time-of-day autoscaling beats reactive autoscaling for predictable traffic patterns. If your coding assistant has zero traffic from midnight to 6 AM, do not wait for HPA to scale it down reactively. Use KEDA with a cron trigger to pre-scale to zero at 11 PM and pre-scale back up at 7 AM. Reactive autoscaling has a lag (typically 3-5 minutes) during which you pay for idle GPUs. Scheduled scaling eliminates this for predictable patterns.
Per-model GPU metrics are not exposed by default - you must configure them. vLLM exposes Prometheus metrics at /metrics but they are per-instance, not per-model. In a multi-model setup where multiple models share a proxy, you need the proxy layer (LiteLLM) to emit per-model cost and request metrics. Configure LiteLLM's success_callback to emit to your metrics backend (Prometheus, Datadog) with a model label so you can see per-model utilization, latency, and cost on separate dashboards.
Model cascade confidence estimation is harder than it looks. The cascade architecture is compelling in theory but requires a reliable confidence estimator. For chat models, there is no built-in confidence score - you have to infer it from log-probability of the response tokens, a separate classifier trained on your task, or a lightweight LLM-as-judge call. Log-probability is the most practical: vLLM exposes logprobs in its API response. Low mean log-probability on the response tokens correlates with model uncertainty and is a reasonable escalation signal.
Shared GPU time-slicing degrades tail latency. When two models share a GPU via CUDA MPS or time-slicing, P99 latency increases because one model can preempt the other's CUDA kernel. For latency-sensitive models (intent classifiers, real-time chat), MIG isolation is worth the additional memory overhead. Use time-slicing only for batch workloads or models where P99 latency is not customer-facing.
Keep model artifacts in a local registry, not just cloud storage. Every serving node should pull model weights from a local registry (Harbor, a local S3-compatible store like MinIO) rather than Hugging Face Hub or AWS S3 directly. This eliminates cold-start dependency on external services, reduces egress costs, and gives you control over which model versions are available. A common pattern: CI pipeline pushes fine-tuned weights to your internal registry, serving nodes pull only from there.
Common Mistakes
:::danger Running Each Model on a Dedicated Instance Without Utilization Analysis
This is the most expensive mistake in multi-model architectures. If you have 9 models running on 9 A100 instances at 207,360/year. If average utilization is 11%, you are wasting ~$184,000/year. Before provisioning any dedicated instance, measure actual utilization for at least 2 weeks. Then use bin-packing to determine the minimum number of shared instances that can handle your peak load. In most cases the answer is 2-4x fewer instances than dedicated single-model deployments.
:::
:::danger Sending Cold Traffic to a Just-Loaded Model
A model that was just loaded from CPU RAM or NVMe into GPU needs several "warm-up" requests before it runs at full speed. CUDA graphs need to be compiled, KV cache needs initialization, and weight data needs to be in L2 cache. If your model loading system swaps a model in and immediately routes real user requests to it, the first few requests will have abnormally high latency (10-30x normal). This creates a bad user experience and incorrect latency metrics. Always send 3-5 synthetic warm-up requests to a newly loaded model and only mark it ready after they complete successfully.
:::
:::warning Building a Model Router Without a Fallback Strategy
Your model router classifies requests and sends them to specific models. What happens when the target model is unavailable - either crashed, overloaded, or being updated? Without a fallback strategy, requests simply fail. A robust router has: (1) a fallback model for every model in the routing table (e.g., if the coding assistant is down, fall back to the general chat model), (2) circuit breaker logic that stops routing to a failing model before the queue fills up, (3) request queuing with a TTL so requests wait briefly for a model to become available rather than failing immediately. LiteLLM's fallbacks configuration handles case 1 automatically.
:::
:::warning Ignoring Memory Fragmentation in Long-Running Multi-Model Processes
GPU memory fragmentation is a real problem in long-running serving processes that load and unload models. When you load a 7B model, use it, offload it to CPU, load a 13B model, use it, then try to reload the 7B model, the GPU allocator may fail to allocate a contiguous 14GB block even if 14GB is technically free, because the memory is fragmented. PyTorch's CUDA allocator has a caching behavior that makes this worse over time. The mitigation: periodically restart model server processes (during low-traffic windows) to reset allocator state, use CUDA's torch.cuda.empty_cache() after model unloads, and monitor GPU memory fragmentation metrics (nvidia-smi reserved vs used vs free).
:::
:::warning Not Accounting for KV Cache Memory in Multi-Model Estimates
When calculating whether multiple models fit on a GPU, teams typically only count model weight memory. But the KV cache for an LLM under load can consume 30-50% of the model's weight memory. A 7B model's weights are ~14GB, but at full utilization with a 4096-token context window and 32 concurrent requests, the KV cache adds another 6-8GB. If you bin-pack models assuming only weight memory, you will run out of GPU memory under production load. Always model KV cache memory separately and include it in your bin-packing calculations.
:::
Interview Q&A
Q: What is the difference between MIG, MPS, and time-slicing for sharing a GPU across multiple models, and when would you use each?
A: All three allow multiple processes to share a physical GPU, but with different isolation and performance characteristics. MIG (Multi-Instance GPU, available on A100 and H100) is hardware partitioning: the physical GPU is divided into isolated instances each with their own memory, L2 cache, and compute engines. Models on different MIG instances cannot interfere with each other. MIG is ideal for multi-tenant environments or when models have strict latency SLAs and cannot tolerate interference. The downside is that MIG configuration is set at the OS level and requires stopping all workloads to change. MPS (Multi-Process Service) is a CUDA-level feature that allows multiple CUDA processes to share compute, multiplexing their CUDA kernels. MPS gives better GPU utilization than dedicated partitions but has no memory isolation - a CUDA error in one process can affect others. Time-slicing is pure OS-level context switching: the GPU scheduler interleaves CUDA kernels from multiple processes. It has the lowest overhead but the highest latency variability because any process can preempt any other. In practice: use MIG for latency-critical production LLM serving, MPS for internal tools and batch workloads, and time-slicing for small embedding models and classifiers where latency variance is acceptable.
Q: How does a model cascade work, and what are the failure modes of a naive confidence estimator?
A: A model cascade routes requests through increasingly powerful (and expensive) models, short-circuiting to the cheapest model that can handle the request with sufficient quality. The appeal is cost reduction: if 70% of requests are simple enough for a 1.5B model, you avoid running them through a 70B model at 50x the cost. The cascade requires a confidence estimator that tells you whether the current model's response is good enough to return. A naive confidence estimator based on response log-probability has several failure modes. First, models are often overconfident on topics outside their training distribution - they produce high-confidence, fluent, wrong answers. The log-probability of the tokens may be high even when the answer is factually incorrect. Second, models are calibration-inconsistent across task types - a code model may produce very high-confidence responses for Python code but also high-confidence responses for a medical question it should not answer. Third, the confidence threshold needs to be tuned per task type, which is a calibration problem that requires labeled data. More robust approaches: use a separate lightweight quality classifier trained on (query, response, quality_label) triples from your own data, or use downstream task completion as the escalation signal (if the user immediately follows up with a clarification, the previous response was likely inadequate and should have escalated).
Q: Describe how you would design the model routing layer for a platform serving 10 different models, including coding, chat, embeddings, classification, and summarization tasks.
A: The routing layer needs four components. First, a task classifier that maps incoming requests to model types. This should be fast (under 5ms) and stateless. The simplest implementation is a rule-based classifier using keyword patterns and request metadata. A more robust approach trains a small BERT-like classifier on labeled examples from your own request logs, which handles edge cases better than rules. Second, a routing table that maps task types to model backends, including fallback configurations and per-model timeout/token limits. This should be config-driven so you can update routing without code changes. Third, a proxy layer (LiteLLM is the standard choice) that provides a unified OpenAI-compatible API and handles load balancing, fallbacks, and retry logic. Fourth, per-model observability: cost, latency, and quality metrics emitted with a model label so you can identify which models are expensive, slow, or degrading. The routing decision should also consider current backend health - if the coding model's queue depth is above threshold, fall back to the general chat model rather than queueing indefinitely. Session context matters too: in a multi-turn conversation, the same model should serve all turns rather than potentially routing turn 1 to the chat model and turn 2 to the coding model mid-session.
Q: How do you handle cold-start latency for scale-to-zero model deployments in a production environment?
A: Scale-to-zero is appealing because you pay nothing when a model is not in use. The problem is that the first request after a period of inactivity can wait 30-120 seconds for the model to load, which is unacceptable for interactive users. Several strategies mitigate this. Predictive pre-warming: analyze your traffic patterns and pre-warm models before expected demand. If your coding assistant consistently receives its first request at 8:45 AM on weekdays, schedule a warm-up job at 8:30 AM. This works well for predictable diurnal patterns but fails for sudden spikes. Tiered response: when a cold-start is unavoidable, return a "processing" response immediately and deliver the actual response asynchronously, rather than making the user wait for the cold start. This requires client-side support for async responses but is the best user experience for long-running models. Keep-warm at low cost: instead of true scale-to-zero, scale to a minimal fractional allocation that keeps the model weights in CPU RAM (not GPU). When a request arrives, load to GPU in ~0.5s rather than 60+ seconds from cold. This costs a small amount for CPU RAM reservation but eliminates the worst-case cold-start penalty. KEDA cron scaling: use KEDA's cron scaler to guarantee at least one warm replica during known high-traffic windows, and scale to zero only during confirmed low-traffic periods. The combination of predictive warm-up and CPU-warm standby handles most production cold-start requirements.
Q: Walk through how you would instrument per-model cost tracking in a multi-model serving environment.
A: Per-model cost visibility is essential for making bin-packing and routing decisions, but it requires intentional instrumentation at every layer. At the serving layer: vLLM and TGI expose prompt_tokens and completion_tokens in every response. Log these alongside model_name, user_id, and request_id in your access log. At the proxy layer: LiteLLM has a built-in cost tracking database that records per-request cost based on per-token pricing you configure for each model. Its /spend/logs endpoint gives you cost breakdowns by model, user, and time period. At the infrastructure layer: use GPU metrics (nvidia-smi, DCGM) to measure actual GPU utilization per model, and combine with your cloud instance cost to compute cost-per-token at the infrastructure level. Tag cloud billing dimensions (AWS cost allocation tags, GCP labels) with model names so your cloud cost dashboard shows GPU spend per model. The most actionable metric is cost-per-successful-response per model type: if your 70B model costs 0.01 per response and resolves intent 80% of the time, you can make a data-driven decision about where to set the cascade escalation threshold. A common mistake is tracking only raw token cost without accounting for retry cost (failed requests that retry) and cascade escalation cost (requests that hit multiple models before returning).
Q: How do you ensure consistent model behavior across multiple replicas in a multi-model serving architecture?
A: This is a subtle but important problem. Multiple replicas of the same model should behave identically for the same input (assuming temperature=0 or deterministic sampling), but in practice there are several sources of divergence. Quantization inconsistency: if different replicas are running different quantization levels (one INT8, one FP16) due to a partial rollout or accident, outputs will differ even for the same input and temperature. Always version and tag the exact model artifact including quantization format, and verify replicas are running identical artifacts at startup. CUDA non-determinism: even with the same weights, CUDA operations on different GPU hardware or driver versions can produce slightly different floating-point results due to non-deterministic reduction operations. For most applications this is acceptable, but for financial or legal applications requiring auditability it matters. Mitigate by pinning CUDA driver versions and using torch.use_deterministic_algorithms(True) with the performance cost accepted. Session affinity: for multi-turn conversations, route all turns of a session to the same replica. This is not for determinism per se, but to avoid subtle behavioral differences between replicas affecting conversation coherence - even if replicas are technically identical, small floating-point differences in prior context can compound turn-over-turn. Config drift: replicas started at different times may have different configuration (different context window limits, different sampling parameters in their startup config). Enforce configuration through environment variables set at deployment time, not runtime config files, so all replicas in a deployment have identical settings.
Summary
Multi-model serving architecture is the difference between a collection of one-off AI features and a coherent AI platform. The organizations that get this right - shared GPU infrastructure, model routing, dynamic loading, per-model autoscaling - see 3-5x cost reduction compared to single-model dedicated deployments, with better operational visibility and faster time-to-deploy for new models.
The core engineering investments are modest: a routing layer (LiteLLM plus a task classifier), a model memory manager (LRU eviction from GPU to CPU to NVMe), per-model observability (cost and latency metrics with model labels), and autoscaling configured per model based on actual traffic patterns. A small platform team can build this in two sprints. The payoff - lower cloud bills, simpler operations, faster feature shipping - compounds with every model you add.
The teams that do not invest in this infrastructure pay for it every month on their cloud bill.
