Skip to main content

vLLM and Inference Servers

The Production Scenario

You have trained a fine-tuned version of LLaMA-3 8B for customer support automation. The model is great - it handles 85% of tier-1 support tickets without human intervention. Now you need to deploy it. The product manager asks: "How many users can we handle simultaneously?" You realize you have no good answer. You ran some ad-hoc tests with the HuggingFace generate() call and got about 40 tokens per second. That is fast enough for a single user, but what about 100 simultaneous users? 1,000?

You try the naive approach first: wrap the HuggingFace pipeline in a FastAPI server, serialize requests behind a lock, process one at a time. Your benchmark shows 40 tokens/second for user 1, and user 100 waits 25 minutes for their first token. This obviously does not scale.

You try batching: collect requests into a batch of 8, process together. Now user 1 gets their first token in 0.5 seconds instead of 40ms - you introduced latency to gain throughput. But you are getting 280 tokens/second across 8 concurrent requests. Still, your p99 latency is enormous because of static batching (short requests wait for long ones), and your KV cache management is fragile - you reserved fixed maximum-length buffers for every slot, wasting 60% of GPU memory.

Then you switch to vLLM. Same model, same GPU. Throughput jumps from 280 to 1,800 tokens/second. P99 latency drops by 8×. You can handle 200 concurrent users with acceptable latency SLAs. You did not change your model or your hardware - you changed your inference server.

This lesson explains why that happened and how the major inference servers work.


Why This Exists: The Inference Serving Problem

Building a production-grade LLM serving system is significantly more complex than running model.generate(). The core challenges are:

Throughput: A single user generates tokens at 40–100 tokens/second on a modern GPU. But the GPU's maximum throughput is 3,000+ tokens/second. That gap is the batching opportunity - you can serve 30–80 users simultaneously with properly implemented batching.

Latency: Users expect interactive responses. Time-to-first-token (TTFT) should be under 2 seconds. Time-between-tokens (TBT) should be under 50ms for smooth streaming. High throughput and low latency are in tension - a production system must balance both.

Memory efficiency: KV cache is the memory bottleneck. A naive system pre-allocates maximum-length KV buffers for each sequence - most of this allocation is wasted because sequences complete far before their maximum length. With 100 concurrent requests and 4,096 token max length, the waste can consume 70%+ of GPU memory.

API compatibility: Clients expect the OpenAI API format - /v1/chat/completions, /v1/completions, SSE streaming. Building this correctly with error handling, timeouts, authentication, and proper HTTP semantics is non-trivial.

Multi-GPU coordination: As covered in the previous lesson, large models require tensor or pipeline parallelism. The serving system must coordinate across GPUs transparently.

Inference servers solve all of these problems so you can focus on model selection and prompt engineering rather than GPU memory management.


Historical Context

Before dedicated LLM inference servers, practitioners used general-purpose model serving systems designed for traditional deep learning (ONNX Runtime, TensorFlow Serving, Triton Inference Server). These assumed fixed input/output shapes and batch sizes - poorly suited for autoregressive LLM generation.

FasterTransformer (NVIDIA, 2019) was the first specialized framework for transformer inference, providing hand-optimized CUDA kernels for attention and MLP layers. It significantly improved throughput over PyTorch baselines but still used static batching.

Orca (Yu et al., 2022) introduced continuous batching (covered in the previous lesson). This paper demonstrated 36.9× throughput improvement over FasterTransformer for variable-length workloads.

vLLM (Kwon et al., 2023) combined continuous batching with PagedAttention and released a production-ready open-source serving system. It demonstrated 2–24× higher throughput than HuggingFace Transformers and became the de facto standard for LLM serving within months of release.

TGI (HuggingFace, 2023) provided a competing production system with tighter HuggingFace model hub integration and Flash Attention 2.

TensorRT-LLM (NVIDIA, 2023) focused on maximum throughput on NVIDIA hardware through CUDA kernel fusion and quantization.

Ollama and llama.cpp served the developer and on-premise market with easy setup for local inference on consumer hardware.


vLLM: PagedAttention + Continuous Batching

vLLM is the most widely deployed open-source inference server as of 2025. Its two core innovations are PagedAttention (memory efficiency) and continuous batching (compute efficiency).

PagedAttention Recap

(Covered in depth in Lesson 02 - KV Cache. Brief summary here.)

Traditional KV cache allocates a contiguous memory block for each sequence's maximum possible length. This wastes memory when sequences complete early (internal fragmentation) and prevents sharing identical prefix KV states across requests (no reuse).

PagedAttention stores KV cache in fixed-size pages (blocks), similar to virtual memory in operating systems. Each page holds 16 tokens of KV state. A block table maps logical positions to physical pages. Pages are allocated on demand, freed immediately on sequence completion, and can be shared across requests with identical prefixes (prompt caching).

This allows vLLM to achieve near 100% KV cache memory utilization versus 30–60% in naive implementations.

vLLM Architecture

vLLM Key Features

FeatureDetail
Throughput vs HuggingFace2–24× higher depending on load
Tensor parallelismBuilt-in, tensor_parallel_size=N
QuantizationAWQ, GPTQ, INT8, FP8
Speculative decodingBuilt-in, configurable draft model
Prompt cachingAutomatic prefix caching
Chunked prefillenable_chunked_prefill=True
Model supportLLaMA, Mistral, Gemma, Qwen, Phi, and 50+ others
APIOpenAI-compatible /v1/completions and /v1/chat/completions

HuggingFace TGI (Text Generation Inference)

TGI is HuggingFace's production inference server. Written in Rust (HTTP layer) + Python (model logic), it provides tighter integration with the HuggingFace model hub and slightly better out-of-box experience for HuggingFace models.

Key features:

  • Flash Attention 2 integration for faster attention computation
  • Tensor parallelism across multiple GPUs
  • Token streaming via SSE
  • Quantization: GPTQ, AWQ, BitsAndBytes
  • Continuous batching (iteration-level scheduling)
  • Watermarking support for generated text
  • Native Safetensors format support

When to prefer TGI over vLLM:

  • Heavy HuggingFace ecosystem integration (Inference Endpoints, Inference API)
  • Need watermarking
  • Models only available in HuggingFace format and not yet supported by vLLM

TGI benchmarks vs vLLM: Performance is similar for most workloads. vLLM's PagedAttention implementation tends to win on memory utilization; TGI's Rust HTTP layer handles extreme connection counts more efficiently.


TensorRT-LLM: Maximum NVIDIA Throughput

TensorRT-LLM (NVIDIA) is the highest-throughput option for NVIDIA hardware. It works by:

  1. Compiling the model: TensorRT compiles the LLM into an optimized CUDA engine for your specific GPU, dtype, and max batch/sequence size. This is a one-time step that takes 10–30 minutes.

  2. Kernel fusion: TensorRT fuses adjacent operations (attention + residual add + layernorm) into single CUDA kernels, eliminating intermediate memory reads and writes.

  3. INT8/FP8 quantization: Native support for hardware-accelerated INT8 and FP8 on H100/A100, achieving 2–4× further speedup over FP16.

  4. In-flight batching: TensorRT-LLM's term for continuous batching with its own scheduling implementation.

Trade-offs:

  • Highest throughput on NVIDIA hardware (20–40% better than vLLM for some workloads)
  • Requires compilation step - not dynamic, cannot swap models without recompiling
  • Harder to debug and customize
  • Best supported on H100; some features require specific GPU generations

Use TensorRT-LLM when: You have a fixed model, NVIDIA hardware, and throughput is the primary optimization target. Production-scale deployments where compute cost is significant.


Ollama: Developer-Friendly Local Inference

Ollama is designed for developers who want to run LLMs locally with minimal setup. It wraps llama.cpp in a user-friendly interface with a simple CLI and OpenAI-compatible API.

# Install and run LLaMA-3 8B locally
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain attention mechanisms"

# Or use the OpenAI-compatible API
ollama serve # Starts API server on localhost:11434
# Use with OpenAI SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "What is backpropagation?"}],
)
print(response.choices[0].message.content)

Key features:

  • Model management (pull, list, remove)
  • GGUF format (quantized models from TheBloke, bartowski, etc.)
  • Apple Silicon MPS acceleration
  • CPU fallback when GPU memory is insufficient
  • Simple Modelfile for custom system prompts and parameters

Use Ollama when: Local development, prototyping, air-gapped deployments, or low-volume internal tools where cloud inference is not an option.


llama.cpp: The Foundation Layer

llama.cpp is a pure C++ inference engine for quantized LLMs. It is the underlying engine for Ollama and many other tools. Key characteristics:

  • GGUF format: Quantized model format (Q4_K_M, Q5_K_M, Q8_0, etc.) that packs model weights efficiently
  • CPU inference: Runs on any CPU with decent performance - 10–30 tokens/second on a modern laptop with Q4 quantization
  • Apple Silicon MPS: Full Metal acceleration, competitive with NVIDIA for smaller models
  • CUDA/ROCm: GPU acceleration on NVIDIA and AMD
  • Minimal dependencies: Single C++ file (or small library), no Python required

GGUF quantization formats:

FormatSize reductionQuality lossUse case
Q8_050% vs FP16MinimalGPU with memory constraints
Q5_K_M69% vs FP16SmallGood balance
Q4_K_M75% vs FP16ModerateStandard choice
Q3_K_M81% vs FP16NoticeableCPU-only, space constrained
Q2_K87% vs FP16SignificantExtreme compression, not recommended

Comparison: All Five Servers

vLLMTGITensorRT-LLMOllamallama.cpp
Primary useProduction servingProduction servingNVIDIA productionLocal devLocal/embedded
ThroughputExcellentExcellentBest on NVIDIAGoodModerate
Ease of setupGoodGoodDifficultExcellentModerate
Model supportVery broadBroadNVIDIA models onlyGGUF modelsGGUF models
APIOpenAI compatOpenAI compatCustom + OpenAIOpenAI compatBasic REST
Tensor parallelYesYesYesNoNo
QuantizationAWQ, GPTQ, INT8, FP8GPTQ, AWQ, BitsAndBytesINT8, FP8, INT4GGUF (Q2–Q8)GGUF (Q2–Q8)
Continuous batchingYes (PagedAttention)YesYes (in-flight)NoNo
StreamingSSESSESSESSESSE
CPU supportNoNoNoYes (slow)Yes
Apple SiliconNoNoNoYes (MPS)Yes (MPS)
LicenseApache 2.0Apache 2.0Apache 2.0MITMIT

Code: Deploy LLaMA-3 8B with vLLM

# Install vLLM
pip install vllm

# Start the server (single GPU)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key my-secret-key \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90

# With 4 GPUs (tensor parallel)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--dtype float16 \
--port 8000
# Benchmark with concurrent requests using OpenAI SDK
import asyncio
import time
from openai import AsyncOpenAI
from statistics import mean, quantiles
from typing import List


async def send_request(client: AsyncOpenAI, prompt: str) -> dict:
"""Send a single chat completion request and measure latency."""
start = time.perf_counter()
first_token_time = None
tokens_generated = 0

stream = await client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
stream=True,
)

async for chunk in stream:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.perf_counter()
tokens_generated += 1

total_time = time.perf_counter() - start
ttft = first_token_time - start if first_token_time else total_time

return {
"total_time_s": total_time,
"ttft_s": ttft,
"tokens_generated": tokens_generated,
"tps": tokens_generated / total_time if total_time > 0 else 0,
}


async def benchmark_concurrent(
n_concurrent: int,
n_requests: int = 100,
server_url: str = "http://localhost:8000",
) -> dict:
"""
Send n_requests at n_concurrent concurrency.
Returns throughput and latency percentiles.
"""
client = AsyncOpenAI(base_url=f"{server_url}/v1", api_key="my-secret-key")

prompts = [
"Explain gradient descent in one paragraph.",
"Write a Python function to compute fibonacci numbers.",
"What is the capital of France and what is it known for?",
"Describe the transformer architecture briefly.",
"What causes the seasons on Earth?",
]

# Create all tasks
tasks = [
send_request(client, prompts[i % len(prompts)])
for i in range(n_requests)
]

# Run with controlled concurrency using semaphore
semaphore = asyncio.Semaphore(n_concurrent)
results = []

async def bounded_request(task_coro):
async with semaphore:
return await task_coro

start = time.perf_counter()
results = await asyncio.gather(*[bounded_request(t) for t in tasks])
wall_time = time.perf_counter() - start

ttfts = [r["ttft_s"] for r in results]
total_times = [r["total_time_s"] for r in results]
total_tokens = sum(r["tokens_generated"] for r in results)

qs_ttft = quantiles(ttfts, n=100)
qs_total = quantiles(total_times, n=100)

return {
"n_concurrent": n_concurrent,
"n_requests": n_requests,
"wall_time_s": round(wall_time, 2),
"throughput_tps": round(total_tokens / wall_time, 1),
"throughput_rps": round(n_requests / wall_time, 2),
"ttft_p50_ms": round(qs_ttft[49] * 1000, 1),
"ttft_p99_ms": round(qs_ttft[98] * 1000, 1),
"total_p50_s": round(qs_total[49], 2),
"total_p99_s": round(qs_total[98], 2),
}


async def main():
print("vLLM LLaMA-3 8B Benchmark")
print(f"{'Concurrency':>12} {'Throughput TPS':>16} {'TTFT p50 ms':>12} {'TTFT p99 ms':>12}")
print("-" * 60)

for n_concurrent in [1, 4, 8, 16, 32]:
result = await benchmark_concurrent(n_concurrent, n_requests=50)
print(
f" {n_concurrent:>10} "
f" {result['throughput_tps']:>14.0f} "
f" {result['ttft_p50_ms']:>10.0f} "
f" {result['ttft_p99_ms']:>10.0f}"
)


asyncio.run(main())

Code: OpenAI-Compatible Client

All five inference servers (when configured) accept the OpenAI API format. This means you can switch between servers by changing the base_url:

from openai import OpenAI
import httpx

# --- vLLM ---
vllm_client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-key",
)

# --- TGI ---
tgi_client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-key",
)

# --- Ollama ---
ollama_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)

# --- OpenAI (production) ---
openai_client = OpenAI(api_key="sk-...")


def chat(client: OpenAI, model: str, prompt: str) -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
temperature=0.7,
)
return response.choices[0].message.content


# Streaming response
def stream_chat(client: OpenAI, model: str, prompt: str) -> None:
print(f"\nStreaming response from {model}:\n")
with client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
stream=True,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()


# Batch completions (non-chat)
def batch_completions(
client: OpenAI,
model: str,
prompts: list[str],
) -> list[str]:
responses = []
for prompt in prompts:
response = client.completions.create(
model=model,
prompt=prompt,
max_tokens=100,
temperature=0.0,
)
responses.append(response.choices[0].text)
return responses

Production: Autoscaling, Health Checks, and Load Balancing

Health Check Endpoint

vLLM and TGI expose health check endpoints for load balancers:

import httpx
import asyncio


async def check_server_health(server_url: str) -> dict:
"""Check if an inference server is ready to accept requests."""
async with httpx.AsyncClient(timeout=5.0) as client:
try:
# vLLM health endpoint
response = await client.get(f"{server_url}/health")
if response.status_code == 200:
return {"status": "healthy", "server": server_url}

# Check available models
models_response = await client.get(f"{server_url}/v1/models")
models = models_response.json()["data"]
return {
"status": "healthy",
"server": server_url,
"models": [m["id"] for m in models],
}
except (httpx.ConnectError, httpx.TimeoutException) as e:
return {"status": "unhealthy", "server": server_url, "error": str(e)}


async def check_all_servers(servers: list[str]) -> dict:
results = await asyncio.gather(*[check_server_health(s) for s in servers])
return {r["server"]: r for r in results}

Rolling Deployment

# Deploy new model version with zero downtime
# 1. Start new vLLM server on different port
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct-v2 \
--port 8001 &

# 2. Wait for health check to pass
until curl -f http://localhost:8001/health; do sleep 5; done

# 3. Update nginx upstream to include new server
# nginx.conf:
# upstream llm_backend {
# server localhost:8000 weight=50; # old
# server localhost:8001 weight=50; # new
# }
nginx -s reload

# 4. Drain and shutdown old server
# Wait for in-flight requests to complete, then:
kill -SIGTERM <old_vllm_pid>

# 5. Update nginx to 100% new server
# upstream llm_backend {
# server localhost:8001;
# }
nginx -s reload

nginx Load Balancing

# nginx.conf for load balancing across multiple vLLM instances
upstream llm_backend {
# Least connections - important for LLM inference where request duration varies
least_conn;

server llm-server-1:8000 max_fails=3 fail_timeout=30s;
server llm-server-2:8000 max_fails=3 fail_timeout=30s;
server llm-server-3:8000 max_fails=3 fail_timeout=30s;

keepalive 64;
}

server {
listen 80;

location /v1/ {
proxy_pass http://llm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";

# Critical for streaming: disable proxy buffering
proxy_buffering off;
proxy_cache off;

# Long timeouts for long-running generation
proxy_read_timeout 300s;
proxy_send_timeout 300s;

# Pass through headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}

# Health check endpoint for ALB/NLB
location /health {
proxy_pass http://llm_backend/health;
access_log off;
}
}

Docker Compose for Local Development

# docker-compose.yml
version: "3.8"

services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- huggingface-cache:/root/.cache/huggingface
command: >
--model meta-llama/Meta-Llama-3-8B-Instruct
--dtype auto
--max-model-len 8192
--gpu-memory-utilization 0.85
--port 8000
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s # Model loading takes time

volumes:
huggingface-cache:

:::danger Do Not Use HuggingFace generate() in Production

The HuggingFace pipeline() and model.generate() APIs are designed for research and experimentation. They do not implement continuous batching, PagedAttention, or efficient KV cache management. Wrapping them in a FastAPI server produces a system that:

  • Serves exactly one request at a time (unless you implement batching yourself)
  • Wastes 40–70% of GPU memory on KV cache fragmentation
  • Has no backpressure mechanism - will OOM under load
  • Cannot stream tokens per-request while processing a batch

For any workload with more than one user, use vLLM, TGI, or TensorRT-LLM. :::

:::warning Disable Proxy Buffering for Streaming Responses

If your nginx or load balancer buffers responses before sending to clients, streaming will break. The client will wait until the entire response is generated before receiving anything - negating the streaming benefit.

Required nginx configuration:

proxy_buffering off;
proxy_cache off;

Required FastAPI/uvicorn: use StreamingResponse with media_type="text/event-stream". vLLM and TGI handle this correctly - ensure any reverse proxy in front does not re-buffer. :::

:::warning Check GPU Memory Before Starting the Server

vLLM loads the full model into GPU memory at startup, then allocates the remaining GPU memory for KV cache (controlled by --gpu-memory-utilization). If other processes hold GPU memory (other models, monitoring, etc.), vLLM may OOM at startup or reduce KV cache budget unexpectedly.

# Check current GPU memory usage before starting vLLM
nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# If running multiple services, set CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server ...

:::


Interview Questions

Q1: Why does vLLM achieve 2–24× higher throughput than HuggingFace's generate() for multi-user workloads?

Three primary reasons:

  1. Continuous batching: HuggingFace generate() processes one request at a time (or a fixed static batch). vLLM uses iteration-level scheduling - it replaces completed sequences immediately at every decode step rather than waiting for the longest sequence in a batch. This eliminates GPU idle time caused by static batching's "slowest request holds everyone up" problem.

  2. PagedAttention: HuggingFace pre-allocates maximum-length KV cache buffers for each sequence (wastes 40–70% of GPU memory on padding). vLLM's PagedAttention allocates KV cache in 16-token pages on demand, achieving near 100% memory utilization. More efficient memory use means more concurrent sequences fit in GPU memory.

  3. Custom CUDA kernels: vLLM's PagedAttention attention kernel is hand-optimized for the paged memory layout, whereas HuggingFace uses FlashAttention on contiguous memory. The custom kernel achieves better memory bandwidth utilization for the typical LLM decode pattern.

Q2: How does vLLM's OpenAI-compatible API help with inference server portability?

The OpenAI API format - /v1/chat/completions, /v1/completions, /v1/models with standard JSON schemas - has become the de facto standard for LLM APIs. vLLM, TGI, Ollama, TensorRT-LLM, and even commercial providers (Anthropic, Mistral, Groq) support it.

Portability implications:

  • Client code using the OpenAI Python SDK requires only a base_url change to switch between providers
  • Load testing, monitoring, and observability tools built for OpenAI's API work without modification
  • Gradual migration: run vLLM locally in development, switch base_url for staging/production
  • A/B testing between inference backends requires only load balancer configuration, not code changes

The primary limitation: extensions (like vLLM's --lora-modules or TGI's watermarking) are not standardized and require provider-specific configuration.

Q3: When would you choose TensorRT-LLM over vLLM, and what are the trade-offs?

Choose TensorRT-LLM when:

  • You have a fixed, unchanging model in production
  • You are running on NVIDIA H100/A100/L40S (TensorRT is NVIDIA-specific)
  • Throughput is the primary optimization metric and you can invest in operational complexity
  • You need FP8 quantization for maximum performance on H100

TensorRT-LLM trade-offs:

  • Pro: 20–40% higher throughput over vLLM for some workloads; best hardware utilization via kernel fusion; native FP8 support
  • Con: Requires model compilation (10–30 minutes per model × GPU × dtype × max-length configuration); any model change requires recompilation; harder to debug; less community support than vLLM; not portable to non-NVIDIA hardware

vLLM is the right default. Switch to TensorRT-LLM when vLLM's throughput ceiling is your production bottleneck and the operational overhead of pre-compilation is acceptable.

Q4: A user reports that streaming responses from your vLLM server have a long delay before tokens start appearing, followed by all tokens appearing at once. What is wrong?

This is almost certainly a reverse proxy buffering issue. When proxy_buffering is enabled in nginx (or similar), the proxy accumulates the entire SSE stream before forwarding it to the client. The client experiences what looks like a blocked response: nothing for 5–30 seconds, then the complete output appears instantly.

Diagnosis:

# Test directly (bypassing proxy)
curl http://vllm-server:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"...", "messages":[{"role":"user","content":"hello"}], "stream":true}'

If streaming works when hitting vLLM directly but not through the proxy, it is definitely proxy buffering.

Fix in nginx:

proxy_buffering off;
proxy_cache off;
proxy_set_header X-Accel-Buffering no;

Also check: application-level buffering (uvicorn worker settings), CDN caching, or any middleware that buffers responses.

Q5: How would you design a multi-region LLM serving setup for a global application with 50,000 daily active users?

Architecture:

  1. Regional clusters: Deploy vLLM clusters in 3–4 regions (US-East, EU-West, Asia-Pacific). Route users to the nearest region via DNS geolocation or Anycast. This reduces TTFT by cutting network round-trip time.

  2. Autoscaling: Use Kubernetes HPA (Horizontal Pod Autoscaler) with GPU utilization and request queue depth as scaling metrics. Scale up when queue depth exceeds 20 requests or GPU utilization exceeds 75%.

  3. Model consistency: All regions run identical model checkpoints from a central artifact store (S3/GCS). Model updates are rolled out region by region to avoid global outages.

  4. Load balancing within cluster: nginx or envoy with least_conn algorithm. Use DRAINING state for graceful rolling deployments - stop accepting new requests, let in-flight requests complete, then terminate.

  5. Observability: Centralized metrics (Prometheus + Grafana) tracking TTFT, TBT, throughput, queue depth, GPU utilization, and error rates per region. Alert on p99 TTFT > 5s or queue depth > 50 requests.

  6. Cost optimization: Use spot/preemptible instances for batch processing queues; reserve on-demand instances for interactive serving. Autoscale to zero during off-peak hours in each region (user activity is time-zone correlated).

Q6: What is the significance of the OpenAI-compatible API format, and what does it enable for the broader LLM ecosystem?

The OpenAI API has become the HTTP standard for language model access - analogous to how SQL became the standard for relational database access. Any system that speaks this format participates in a large ecosystem of compatible tools.

What it enables:

  • Framework compatibility: LangChain, LlamaIndex, Semantic Kernel, and similar frameworks all support the OpenAI format - they work with any compatible server without modification
  • Observability tools: LangFuse, Helicone, Arize, and LLM monitoring platforms can proxy and log any OpenAI-format API
  • Provider portability: Switch from OpenAI GPT-4 to local vLLM or Mistral API by changing one URL and API key
  • Cost control: Route requests between providers based on cost/capability without changing application code
  • Testing: Use local Ollama for development tests, switch to production vLLM for integration tests, same code throughout

The limitation: the format covers the basics (chat completions, text completions, embeddings) but not advanced features like tool use schemas, vision inputs, or function calling are fully standardized. Each provider extends the format in different ways for these capabilities.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Inference Batching & Throughput demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.