vLLM Architecture and Deployment
The Production Traffic Spike That Changed Everything
It is 2:47 AM on a Tuesday. Your company just launched an AI-powered feature to 500,000 users, and the Slack alerts are firing. Response latency has climbed from 340ms to 14 seconds. GPU memory is at 94%. Requests are queuing. The on-call engineer is watching the dashboard, helpless, because the serving infrastructure was not built for this.
The model itself is fine. LLaMA 3 70B is sitting on two A100 80GB GPUs, doing its job correctly. The problem is everything around it. The naive serving approach - one request at a time, waiting for the full sequence to complete before accepting the next - means those GPUs are idle most of the time between forward passes. The KV cache, which stores the attention keys and values for every token in every sequence, is being allocated in large contiguous blocks. When a 2,048-token request finishes early, that memory sits wasted. When a new 4,096-token request comes in, there is no contiguous block large enough, so it waits.
This is not a hypothetical. It is the state of LLM serving before 2023. Every team that tried to productionize an open-source language model hit the same wall: the model was capable, the hardware was available, but the serving stack was a bottleneck. Teams wrote custom batching logic, hand-tuned memory pools, and still got throughput numbers that were embarrassing compared to the theoretical capacity of the hardware.
Then a team at UC Berkeley published a paper. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica released vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention in June 2023. The core insight was deceptively simple: stop treating the KV cache like a contiguous array. Treat it like virtual memory in an operating system.
That one insight unlocked 24x higher throughput on the same hardware. And it is now the production standard for serving open-source language models at scale.
This lesson will teach you exactly how vLLM works, why each design decision matters, and how to deploy it correctly for production workloads. By the end, you will understand PagedAttention, continuous batching, tensor parallelism, and how to tune a vLLM deployment to hit latency SLAs while maximizing throughput.
Why This Exists - The Problem Before vLLM
Static KV Cache Allocation Was Destroying GPU Utilization
Before vLLM, the standard approach to LLM serving was to allocate a KV cache proportional to the maximum sequence length at model load time. If your model supported 4,096 tokens, you pre-allocated memory for 4,096 key-value pairs per attention head per layer.
This created three severe problems.
Problem 1: Memory fragmentation. Each request got a contiguous block of pre-allocated memory. If a request finished at token 312 instead of 4,096, you wasted of the reserved memory for that slot until it was cleared. With 16 concurrent request slots, you could be using only 8% of your KV cache memory effectively.
Problem 2: Head-of-line blocking. Because memory was pre-allocated in fixed slots, you had a hard limit on concurrent requests equal to the number of pre-allocated slots. When all slots were full, new requests queued even if the GPU was 60% idle. The math simply did not work out.
Problem 3: No batching across sequences of different lengths. Naive batching required padding all sequences in a batch to the same length, which meant paying the compute cost of attending to padding tokens. For a batch with one 50-token sequence and one 3,900-token sequence, you were paying 78x more than necessary.
The collective result: GPU utilization on a well-provisioned LLM serving cluster in 2022 was often 20-35%. The hardware was expensive. The utilization was terrible. The latency was still high because you could not pack enough requests together to amortize the fixed costs.
Why Previous Solutions Failed
Teams tried several workarounds, all of which had fundamental limits.
Dynamic batching (collect requests for 50ms, batch them together) helped with throughput but hurt latency. Setting the collection window too long made responses feel sluggish. Setting it too short captured too few requests to matter.
Request-level memory pooling reduced fragmentation somewhat but still required contiguous allocation per request. The fragmentation problem moved to a different level but did not go away.
Sequence packing (bin-packing variable-length sequences into fixed-size batches) helped utilization but required complex scheduling logic and still wasted memory on sequences that ended early.
None of these addressed the root cause: the KV cache was being managed like a static array when it should have been managed like a dynamic memory allocator.
Historical Context - PagedAttention and the Berkeley Paper
The Operating Systems Insight
Ion Stoica, one of the coauthors on the vLLM paper, is the same researcher who built Apache Spark and Apache Mesos. His background is distributed systems and operating systems. When his group at Berkeley started looking at LLM serving, they recognized a familiar pattern.
The KV cache problem looked exactly like the virtual memory problem that operating systems researchers solved in the 1960s. Physical RAM is limited and fragmented. Programs want to believe they have a large, contiguous address space. The OS uses page tables to map virtual addresses to physical pages, which can be scattered anywhere in RAM. Programs see a clean abstraction. The OS manages the messy reality underneath.
The "aha moment" in the vLLM paper was applying this exact abstraction to attention. Instead of allocating a contiguous block of KV cache memory per sequence, use a paging system. Each sequence has a logical block table that maps logical KV blocks to physical KV blocks. Physical blocks can be anywhere in GPU memory. The attention computation needs to be modified to work with non-contiguous blocks, but that is a tractable engineering problem.
The paper was published in June 2023 and presented at SOSP 2023 (the top systems conference). Within three months, it had become the default serving solution for organizations running open-source LLMs at any serious scale. The throughput improvements were not marginal improvements - they were order-of-magnitude improvements on real production workloads.
Continuous Batching - The Second Insight
PagedAttention solved memory fragmentation. But it enabled a second crucial innovation: continuous batching (also called iteration-level scheduling).
In naive batching, you take a batch of requests, run them all until every sequence in the batch is finished, then take the next batch. The problem is that sequences finish at different times. A 50-token response and a 2,000-token response in the same batch means you are waiting for the 2,000-token response while the GPU slots that finished at token 50 sit idle.
Continuous batching fixes this by operating at the iteration level rather than the batch level. After every forward pass (every token generated), the scheduler checks: which sequences just finished? Remove them from the batch. Are there new requests waiting? Add them to the batch. The GPU is always running full batches. Sequences flow in and out continuously.
This was not a new idea. Orca (Yu et al. 2022) had proposed iteration-level scheduling before vLLM. But PagedAttention made it practical, because continuous batching requires variable-sized KV cache allocations. Without PagedAttention, you could not efficiently add and remove sequences mid-batch without memory fragmentation.
Core Concepts - How vLLM Works
PagedAttention - Memory Management for Attention
The standard attention mechanism computes:
During autoregressive generation, for each new token, you need the keys and values for all previous tokens in the sequence. Storing all those and tensors is the KV cache. For a 70B parameter model with 80 attention heads, 128 head dimension, and 80 layers, each token in the KV cache occupies:
(The factor of 2 is for K and V. The final 2 is for float16.)
For a sequence of 4,096 tokens, the KV cache for that single sequence is 13.4 GB. This is why memory management matters.
PagedAttention partitions the KV cache into fixed-size blocks (default 16 tokens per block). Each sequence is assigned a sequence of logical blocks. The physical location of those blocks in GPU memory is managed by a block table - exactly like a page table in virtual memory.
When a new token is generated, it goes into the current block. When a block fills up, a new physical block is allocated from the free pool. When a sequence finishes, all its physical blocks are returned to the free pool immediately. No fragmentation. No wasted memory.
Logical KV Cache (sequence perspective):
[Block 0: tokens 0-15][Block 1: tokens 16-31][Block 2: tokens 32-47]...
Physical KV Cache (GPU memory):
Physical slot 47 -> Block 0 of seq A
Physical slot 12 -> Block 1 of seq A
Physical slot 83 -> Block 0 of seq B
Physical slot 3 -> Block 1 of seq B
Physical slot 19 -> Block 2 of seq A
...
The attention computation must be modified to work with this non-contiguous layout. vLLM uses custom CUDA kernels that take the block tables as input and gather the correct KV blocks during attention computation. The mathematical result is identical to standard attention - it is purely an implementation difference in how memory is accessed.
Copy-on-Write for Parallel Sampling
One elegant property of PagedAttention is how it handles parallel sampling (generating multiple responses to the same prompt). In beam search or temperature sampling with multiple outputs, many sequences share a common prefix - the prompt itself.
With PagedAttention, the KV cache blocks for the shared prefix are marked as shared and reference-counted. All output sequences point to the same physical blocks for the prompt. When a sequence diverges (generates a token that differs from another), only then are the diverging blocks copied. This is copy-on-write, identical to how Linux handles process forking.
The memory savings for parallel sampling can be enormous. If you are generating 10 candidate responses to a 1,000-token prompt, instead of 10 copies of the 1,000-token KV cache, you have 1 shared copy and 10 small deltas.
Continuous Batching - The Scheduler
vLLM's scheduler runs at every decoding step. Its job is to decide which sequences to include in the next forward pass. The scheduling algorithm:
- Check the waiting queue for new prefill requests (first-pass processing of a new prompt)
- Check running sequences for completions - sequences that generated an EOS token or hit max length
- Evict sequences from GPU if memory pressure is high (optionally swapping to CPU or recomputing)
- Compose the next batch: a mix of new prefills and ongoing decodes
The key insight is that prefill (processing the prompt) and decode (generating tokens one at a time) have very different compute characteristics. Prefill is compute-bound (processing many tokens in parallel). Decode is memory-bandwidth-bound (one token at a time, loading all model weights for each step).
vLLM handles this by chunked prefill - breaking large prefills into smaller chunks that can be interleaved with decode steps. This prevents a long prompt from blocking all decode requests while it is being processed.
Tensor Parallelism - Distributing Across GPUs
When a model is too large for a single GPU (LLaMA 3 70B at float16 requires ~140GB, which exceeds a single A100 80GB), vLLM uses tensor parallelism to split the model across multiple GPUs.
In tensor parallelism, each weight matrix is split along one dimension across GPUs. For an attention layer with 80 heads and 2 GPUs, each GPU holds 40 heads. For a feed-forward layer with hidden dim 8,192 and intermediate dim 28,672, each GPU holds 14,336 rows of the weight matrix.
During the forward pass, each GPU processes its partition independently, then GPUs synchronize using all-reduce operations across NVLink (for intra-node) or InfiniBand (for inter-node). The communication overhead is manageable when GPUs are connected via NVLink, which provides 600 GB/s bidirectional bandwidth (vs ~25 GB/s for PCIe 4.0).
For a 70B model with batch size 32 and hidden dim 8,192, the activation tensors are small compared to the bandwidth, so tensor parallelism across 2-4 GPUs incurs only 5-15% overhead.
Architecture Deep Dive - vLLM Components
AsyncLLMEngine
The AsyncLLMEngine is vLLM's core request handler. It exposes an async Python API that accepts PromptInputs (text or token IDs), SamplingParams (temperature, top_p, max_tokens, etc.), and a request ID. Internally it manages an event loop that continuously runs the scheduler and feeds batches to the model workers.
Key properties:
- Non-blocking: requests are queued and results streamed via async generators
- Request cancellation: clients can cancel in-flight requests, freeing KV cache immediately
- Metrics collection: tracks queue length, batch sizes, token throughput per step
KV Cache Manager
The KVCacheManager maintains the block allocator. It tracks:
- Free physical blocks (the pool available for new allocations)
- Per-sequence block tables (the mapping from logical to physical)
- Reference counts for shared blocks (prefix caching, parallel sampling)
- GPU memory watermarks (when to start evicting sequences)
Block eviction policy: when memory pressure hits the configured threshold, sequences are preempted and their blocks freed. Preempted sequences can be either recomputed (prompt is re-processed on the next opportunity) or swapped to CPU memory. Swap is slower but avoids losing progress on long generations.
Speculative Decoding Integration
vLLM supports speculative decoding natively. A small draft model generates tokens speculatively, then the large target model verifies them in a single forward pass. If all tokens are accepted, you get tokens at the cost of roughly one step. If some are rejected, you fall back gracefully.
The speedup factor depends on the acceptance rate :
For code generation tasks with high acceptance rates (70-85%), speculative decoding with drafts can deliver 2-3x throughput improvement.
Deploying vLLM - Step by Step
Installation
# Install vLLM with CUDA 12.1 support
pip install vllm
# Or install from source for latest features
git clone https://github.com/vllm-project/vllm
cd vllm
pip install -e .
# Verify GPU setup
python -c "import torch; print(torch.cuda.get_device_name(0))"
python -c "import vllm; print(vllm.__version__)"
Starting the OpenAI-Compatible Server
vLLM ships with a built-in server that is drop-in compatible with the OpenAI API. This means any code that calls openai.chat.completions.create() can point at your vLLM server with no changes.
# Basic deployment - single GPU
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 8000
# LLaMA 3 70B on 2x A100 80GB
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 256
# With quantization (AWQ) - fits 70B on single A100
python -m vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--quantization awq \
--port 8000 \
--gpu-memory-utilization 0.90
Docker Deployment
For production, always containerize. vLLM publishes official Docker images.
# Dockerfile
FROM vllm/vllm-openai:latest
# Set model cache directory
ENV HF_HOME=/model-cache
ENV TRANSFORMERS_CACHE=/model-cache
# Pre-bake model weights into the image (optional, avoids runtime download)
# RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('meta-llama/Llama-3-8B-Instruct')"
EXPOSE 8000
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "meta-llama/Llama-3-8B-Instruct", \
"--host", "0.0.0.0", \
"--port", "8000"]
# docker-compose.yml
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0,1
- HF_TOKEN=${HF_TOKEN}
ports:
- "8000:8000"
volumes:
- /mnt/model-cache:/root/.cache/huggingface
command: >
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-3-70B-Instruct
--tensor-parallel-size 2
--gpu-memory-utilization 0.90
--max-model-len 8192
--max-num-seqs 256
--host 0.0.0.0
--port 8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
Client Usage - Python
from openai import OpenAI
# Point at your vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM doesn't require auth by default
)
# Chat completion - identical to OpenAI SDK usage
response = client.chat.completions.create(
model="meta-llama/Llama-3-70B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformer attention in 3 sentences."},
],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="meta-llama/Llama-3-70B-Instruct",
messages=[{"role": "user", "content": "Write a function to merge two sorted arrays."}],
max_tokens=1024,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Direct Python API (for embedding in applications)
from vllm import LLM, SamplingParams
# Initialize - downloads model if not cached
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=1,
gpu_memory_utilization=0.90,
max_model_len=4096,
)
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=512,
stop=["</s>", "<|eot_id|>"],
)
prompts = [
"What is the capital of France?",
"Explain gradient descent in one paragraph.",
"Write a Python function to check if a number is prime.",
]
# Batch inference - processes all prompts together
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt!r}")
print(f"Generated: {output.outputs[0].text!r}")
print("---")
Configuration Reference - Critical Parameters
Memory Parameters
--gpu-memory-utilization (default: 0.90)
The fraction of total GPU memory vLLM reserves for the KV cache, after loading model weights. At 0.90 on an 80GB A100 with a 140GB model loaded (split across 2 GPUs, so 70GB per GPU), 90% of the remaining 10GB goes to KV cache = 9GB. This translates to roughly 2,750 KV blocks of 16 tokens each = capacity for ~44,000 concurrent tokens.
Setting this too high causes CUDA OOM errors when the OS or other processes need GPU memory. Setting it too low reduces the number of concurrent sequences you can serve.
--max-model-len (default: model maximum)
The maximum total sequence length (prompt + completion). Reducing this is one of the most effective ways to increase concurrency. If your use case never needs sequences longer than 4,096 tokens, set this to 4,096 even if the model supports 128K. The KV cache blocks allocated per sequence scale linearly with this value.
--block-size (default: 16)
Number of tokens per KV cache block. Larger blocks reduce block table overhead but increase internal fragmentation. The default of 16 is well-calibrated for most workloads. Change only if you have measured a specific bottleneck.
Performance Parameters
--max-num-seqs (default: 256)
Maximum number of sequences in the running batch at any time. Higher values increase throughput but also increase memory pressure and per-step latency. For latency-sensitive deployments serving interactive users, keep this lower (64-128). For batch processing jobs where throughput matters more than latency, go higher (256-512).
--max-num-batched-tokens (default: max_model_len)
Maximum total tokens across all sequences in a single forward pass. This is the primary lever for controlling memory-bandwidth vs compute tradeoff. Higher values mean larger batches, higher GPU utilization, and higher throughput - at the cost of higher per-request latency.
--enable-chunked-prefill
When enabled, long prefill requests are broken into chunks and interleaved with decode steps. This prevents a single long-context request from blocking all ongoing decode requests. Essential for deployments that mix long-context and short-context requests.
Quantization Parameters
--quantization awq
AWQ (Activation-aware Weight Quantization) quantizes weights to INT4 with minimal accuracy loss. Reduces model memory by ~4x. A quantized LLaMA 3 70B fits on a single A100 80GB. Requires the model to be pre-quantized and available in AWQ format on HuggingFace.
--kv-cache-dtype fp8_e5m2
Quantizes the KV cache itself to FP8. Reduces KV cache memory by 2x compared to FP16, allowing more concurrent sequences. Available on H100 GPUs which have native FP8 support.
Performance Tuning - Throughput vs Latency
The core tension in LLM serving is throughput vs latency. They move in opposite directions.
Maximizing throughput: large batches, high --max-num-seqs, high --max-num-batched-tokens. Every GPU cycle is doing useful work. Individual requests wait longer because they are queued behind others.
Minimizing latency: small batches, dedicated capacity per request, high --max-num-seqs relative to actual concurrent users. Individual requests see fast responses but GPU utilization may be low.
# Load testing with locust
# locustfile.py
from locust import HttpUser, task, between
import json
import random
PROMPTS = [
"Explain the difference between supervised and unsupervised learning.",
"Write a Python function to compute Fibonacci numbers iteratively.",
"What are the main components of a transformer architecture?",
"Summarize the key innovations in GPT-3 compared to GPT-2.",
]
class VLLMUser(HttpUser):
wait_time = between(0.1, 1.0)
@task
def generate(self):
payload = {
"model": "meta-llama/Llama-3-8B-Instruct",
"messages": [
{"role": "user", "content": random.choice(PROMPTS)}
],
"max_tokens": 256,
"temperature": 0.7,
}
with self.client.post(
"/v1/chat/completions",
json=payload,
catch_response=True,
) as response:
if response.status_code == 200:
data = response.json()
# Check time to first token via custom timing
response.success()
else:
response.failure(f"HTTP {response.status_code}")
# Run load test - 50 concurrent users, ramp up over 30 seconds
locust -f locustfile.py \
--host http://localhost:8000 \
--users 50 \
--spawn-rate 5 \
--run-time 120s \
--headless
Benchmark with vLLM's Built-in Benchmarks
# Download benchmark datasets
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Run throughput benchmark
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3-8B-Instruct \
--dataset ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000 \
--output-len 256
# Run latency benchmark (single-request latency, no concurrent load)
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3-8B-Instruct \
--input-len 512 \
--output-len 256 \
--num-iters 100
Monitoring with Prometheus
vLLM exposes Prometheus metrics at /metrics when started with --enable-prometheus.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--enable-prometheus \
--prometheus-port 9090
Key metrics to track:
# prometheus_scrape_config.py - vLLM key metrics
CRITICAL_METRICS = {
# Throughput
"vllm:num_requests_running": "Active requests in the batch",
"vllm:num_requests_waiting": "Requests queued (high = overloaded)",
"vllm:gpu_cache_usage_perc": "KV cache utilization (>95% = OOM risk)",
# Latency
"vllm:time_to_first_token_seconds": "TTFT histogram",
"vllm:time_per_output_token_seconds": "Inter-token latency",
"vllm:e2e_request_latency_seconds": "End-to-end latency",
# Throughput
"vllm:request_success_total": "Completed requests",
"vllm:prompt_tokens_total": "Total prompt tokens processed",
"vllm:generation_tokens_total": "Total tokens generated",
}
# prometheus.yml - scrape config
scrape_configs:
- job_name: 'vllm'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9090']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'vllm:.*'
action: keep
Grafana Dashboard Panels
For each deployment, track these four panels:
- Requests in flight -
vllm:num_requests_runningandvllm:num_requests_waiting. The ratio tells you if you are overloaded. - KV cache utilization -
vllm:gpu_cache_usage_perc. Alert at 90%. At 100%, requests start getting preempted and recomputed. - Time to first token (p50/p95/p99) - The latency users feel while waiting for the response to start appearing.
- Tokens per second -
rate(vllm:generation_tokens_total[1m]). Your throughput gauge.
Production Deployment for LLaMA 3 70B on 2x A100 80GB
Hardware Setup
Two A100 80GB GPUs connected via NVLink. Each GPU has 80GB memory. Model is LLaMA 3 70B at float16 = 140GB total. Split evenly = 70GB per GPU. Remaining 10GB per GPU goes to KV cache (at 0.90 utilization = 9GB per GPU = 18GB total KV cache).
At 16 tokens per block and 3.28MB per token for the KV cache (calculated earlier, scaled for 70B with 8 KV heads per group), the 18GB total KV cache can hold approximately 400-600 concurrent sequences at an average length of 2,048 tokens.
# Production launch script
#!/bin/bash
set -e
MODEL="meta-llama/Llama-3-70B-Instruct"
TENSOR_PARALLEL=2
GPU_UTIL=0.90
MAX_LEN=8192
MAX_SEQS=256
PORT=8000
echo "Starting vLLM server for ${MODEL}..."
echo "Tensor parallel: ${TENSOR_PARALLEL}"
echo "GPU memory utilization: ${GPU_UTIL}"
python -m vllm.entrypoints.openai.api_server \
--model "${MODEL}" \
--tensor-parallel-size "${TENSOR_PARALLEL}" \
--gpu-memory-utilization "${GPU_UTIL}" \
--max-model-len "${MAX_LEN}" \
--max-num-seqs "${MAX_SEQS}" \
--host 0.0.0.0 \
--port "${PORT}" \
--enable-prometheus \
--prometheus-port 9090 \
--enable-chunked-prefill \
--served-model-name "llama-3-70b" \
--chat-template /app/llama3-chat-template.jinja 2>&1 | tee /var/log/vllm.log
Chat Template for LLaMA 3
LLaMA 3 uses a specific chat format with special tokens. You must configure the correct template or responses will be malformed.
{# llama3-chat-template.jinja #}
{% set loop_messages = messages %}
{% for message in loop_messages %}
{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' %}
{% if loop.first and messages[0]['role'] != 'system' %}
{% set content = bos_token + content %}
{% endif %}
{{ content }}
{% endfor %}
{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
Speculative Decoding with vLLM
# Enable speculative decoding with a draft model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--speculative-model meta-llama/Llama-3-8B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 2 \
--speculative-draft-tensor-parallel-size 1
The draft model (8B) runs on one GPU, generating 5 candidate tokens. The target model (70B) verifies them in one forward pass. For a task with 75% acceptance rate:
Speculative decoding helps most for:
- Code generation (high acceptance rate, repetitive patterns)
- Structured outputs (JSON, templates)
- Retrieval-augmented generation (model often copies from context)
It helps least for:
- Creative writing (high diversity, low acceptance)
- Instruction following on diverse tasks
- Short responses (overhead of draft model not amortized)
Production Engineering Notes
Graceful Startup and Health Checks
vLLM takes 2-5 minutes to load a 70B model. Your load balancer must not route traffic until the model is warm. Always implement a proper readiness probe.
# health_check.py
import httpx
import time
import sys
def wait_for_ready(url: str, timeout: int = 300) -> bool:
start = time.time()
while time.time() - start < timeout:
try:
response = httpx.get(f"{url}/health", timeout=5)
if response.status_code == 200:
return True
except (httpx.ConnectError, httpx.TimeoutException):
pass
time.sleep(5)
return False
if __name__ == "__main__":
if not wait_for_ready("http://localhost:8000"):
print("vLLM failed to become ready within 300 seconds")
sys.exit(1)
print("vLLM is ready")
Memory Pressure and Preemption Handling
When the KV cache is exhausted, vLLM preempts sequences. In the default configuration, preempted sequences are recomputed from scratch when capacity frees up. This is invisible to the client but doubles the compute cost for those requests.
Configure alerts when vllm:gpu_cache_usage_perc exceeds 85%. At that level, start scaling horizontally before hitting preemption. Preemption is not a failure state, but chronic preemption indicates chronic underprovisioning.
Handling Model Updates Without Downtime
# Blue-green deployment pattern with nginx
# Run two vLLM instances on different ports
# Start new version on port 8001
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \ # updated model
--port 8001 \
--tensor-parallel-size 2
# Wait for health check
python health_check.py --url http://localhost:8001
# Shift nginx upstream from 8000 to 8001
nginx -s reload
# Drain old instance (wait for in-flight requests to complete)
sleep 60
kill $(cat /var/run/vllm-old.pid)
Multi-Node Tensor Parallelism (Ray)
For models that do not fit on a single node, vLLM integrates with Ray for multi-node tensor parallelism:
# On head node
ray start --head --port=6379
# On worker nodes
ray start --address=<head-node-ip>:6379
# Launch vLLM with Ray
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray
Common Mistakes
:::danger CUDA OOM from setting gpu-memory-utilization too high
Setting --gpu-memory-utilization 0.99 is tempting to maximize KV cache. Do not do it. The GPU needs headroom for PyTorch memory management, CUDA streams, and temporary computation buffers. At 0.99, you will hit sporadic CUDA OOM errors under peak load that are hard to reproduce and hard to diagnose. The safe maximum is 0.92-0.95. Test under load before setting above 0.90.
:::
:::danger Forgetting to set --max-model-len and running out of memory
Without --max-model-len, vLLM allocates KV cache for the model's full context window. LLaMA 3 supports 128K tokens. Allocating KV cache for 128K tokens per sequence, even at float16, would consume most of your GPU memory before you run any requests. Always explicitly set --max-model-len to the longest sequence your workload actually needs.
:::
:::warning Tensor parallel size must match GPU count exactly
If you set --tensor-parallel-size 2 but only have 1 GPU visible (due to CUDA_VISIBLE_DEVICES), vLLM will crash at startup. Always verify nvidia-smi shows the expected GPUs and that CUDA_VISIBLE_DEVICES is set correctly before launching.
:::
:::warning Not using the correct chat template
LLaMA 3, Mistral, and other models each have their own chat formatting conventions. If you use the wrong template (or no template), the model will see malformed input and produce low-quality outputs. Always verify the chat template matches the model's training format. The vLLM repo includes templates for common models in examples/.
:::
:::warning Running benchmarks on a cold model The first few requests after model load are slower due to CUDA graph compilation and memory initialization. Always warm up the model before benchmarking: send 10-20 requests first, then measure. Cold-start latency numbers are not representative of steady-state performance. :::
Interview Q&A
Q1: What is PagedAttention and why does it improve throughput?
PagedAttention reimplements the KV cache using virtual memory principles. Instead of allocating a contiguous block of GPU memory for each sequence (which wastes memory when sequences complete early and creates fragmentation), PagedAttention uses fixed-size blocks (typically 16 tokens) and a block table that maps a sequence's logical block positions to physical block positions in GPU memory. Physical blocks can be scattered across GPU memory just like OS pages are scattered in RAM. When a sequence finishes, its physical blocks are immediately returned to the free pool. This eliminates internal fragmentation (wasted memory within an allocated block) and external fragmentation (inability to allocate due to scattered free memory). The result is that vLLM can run 3-24x more concurrent sequences on the same hardware compared to naive approaches, which directly multiplies throughput.
Q2: How does continuous batching differ from naive batching, and why does it matter?
Naive batching takes a group of requests, processes them all together until every sequence is finished, then takes the next group. The problem is that sequences have different lengths - a 50-token response and a 2,000-token response in the same batch means waiting for the long response while GPU slots for finished sequences sit idle.
Continuous batching operates at the iteration level. After every single forward pass (every token generated), the scheduler removes completed sequences and adds new waiting requests. The batch is always full. The GPU is always doing useful work. For a typical production workload where response lengths vary significantly, continuous batching delivers 5-10x higher throughput than naive batching on the same hardware.
Q3: How do you tune vLLM for a latency-sensitive customer-facing application vs a batch processing pipeline?
For latency-sensitive interactive use (target p95 TTFT under 500ms):
- Keep
--max-num-seqslower (64-128) to reduce queuing time - Enable streaming so users see output before completion
- Use a faster/smaller model if possible
- Consider speculative decoding for specific task types
- Deploy more instances for horizontal scaling rather than cramming more concurrency into one
For batch processing (maximize token throughput, latency flexible):
- Maximize
--max-num-seqs(256-512) - Maximize
--max-num-batched-tokens - Use quantization to fit more sequences in memory
- Enable chunked prefill
- Use async clients that submit all jobs upfront and collect results
Q4: LLaMA 3 70B at float16 requires 140GB. You have two A100 80GB GPUs (160GB combined). Will it fit and how?
Yes, with tensor parallelism. The model weights are split across both GPUs: each GPU holds half the weight matrices. During the forward pass, the computation is split accordingly - each GPU processes its half and the results are combined with all-reduce operations (sum across GPUs) at specific points in the computation graph (after attention output projection and after FFN output projection).
The memory allocation per GPU is roughly: 70GB for model weights (half of 140GB) + 9GB for KV cache (90% of remaining 10GB) = 79GB, which fits within 80GB. In practice, you also need memory for activations and CUDA runtime overhead, so setting --gpu-memory-utilization 0.90 correctly reserves exactly the right amount.
The latency cost of tensor parallelism is the all-reduce communication overhead. On A100 GPUs connected via NVLink (600 GB/s bidirectional), this is typically 5-15% overhead compared to a theoretical single-GPU deployment.
Q5: Your vLLM deployment is experiencing high num_requests_waiting during peak hours. What are your options?
Diagnose first. High waiting queue means either the model is too slow to generate (compute-bound) or the KV cache is exhausted (memory-bound). Check vllm:gpu_cache_usage_perc - if it is near 100%, the bottleneck is memory. If it is low (below 80%), the bottleneck is compute.
For memory-bound (high KV cache usage):
- Reduce
--max-model-lenif requests are shorter than the limit - Apply quantization (AWQ/GPTQ) to free GPU memory for more KV cache
- Add more GPU instances and load balance across them
- Use KV cache quantization (
--kv-cache-dtype fp8_e5m2) on H100s
For compute-bound (low KV cache usage but high latency):
- Apply speculative decoding if the task is amenable
- Use a smaller/faster model if quality requirements allow
- Add more GPU instances
- Enable tensor parallelism if not already using it
For either bottleneck, the correct long-term solution is horizontal scaling: multiple vLLM instances behind a load balancer. Lesson 03 in this module covers Kubernetes autoscaling for exactly this scenario.
Q6: What is speculative decoding and what workloads benefit most from it?
Speculative decoding uses a small draft model to speculatively generate candidate tokens, then the target model verifies all in a single forward pass. If the target model accepts all (its probability distribution agrees with the draft's choices), you get tokens at roughly the cost of one step. If some are rejected, you fall back to the target model's output for the first rejected position.
The expected speedup is where is the acceptance rate. For and : speedup = .
High-benefit workloads: code completion (the model often continues predictably), document summarization (the model tends to copy from context), structured data extraction (low-entropy outputs like JSON keys). Low-benefit workloads: open-ended creative tasks (high entropy, low acceptance rate), diverse instruction following (the draft model diverges often).
Summary
vLLM is the production standard for serving open-source LLMs because it solved the two fundamental problems that made earlier approaches unscalable: memory fragmentation via PagedAttention, and GPU idling via continuous batching. The combination delivers 3-24x higher throughput on identical hardware.
For deployment: start with the OpenAI-compatible server, use --tensor-parallel-size for multi-GPU, set --gpu-memory-utilization 0.90 and always explicitly set --max-model-len. Monitor num_requests_waiting and gpu_cache_usage_perc as your primary health signals.
The next lesson covers TGI and alternative serving frameworks - when HuggingFace's TGI is a better choice than vLLM, and how tools like LiteLLM let you build a unified serving layer across multiple backends.
