Skip to main content

Monitoring LLM Services

The 3 AM Alert

It is 3:17 AM. Your on-call phone lights up. The alert says: "p99 latency exceeded SLA threshold - 45s." You stare at it. The SLA threshold is 10 seconds. You open your dashboard and see... a single graph: CPU utilization at 22%. Everything looks fine. Nothing looks fine. The alert keeps firing.

This is the nightmare that plays out at every team that deploys LLMs without purpose-built observability. The problem is not that something went wrong - something always goes wrong. The problem is that you built your monitoring stack for traditional services, and LLMs are not traditional services. They have completely different failure modes. A traditional API either responds or it times out. An LLM can start streaming tokens immediately (making your load balancer think everything is fine) and then stall for 40 seconds mid-stream, invisible to every HTTP-level health check you have.

On that particular night, the root cause turned out to be KV cache exhaustion. The vLLM instance had hit 98% KV cache utilization. New requests were being accepted, streaming was starting, but the engine was swapping cached attention states to CPU memory (a process called "block swapping") which introduced catastrophic latency in the middle of generation. The fix took 4 minutes. The investigation took 2 hours - entirely because the right metric (vllm:gpu_cache_usage_perc) was not being collected.

That incident changed how the team thought about observability. Traditional metrics - CPU, memory, request rate, HTTP status codes - are necessary but nowhere near sufficient for LLM services. You need a completely different mental model. You need to understand the internal state of the inference engine: how full is the KV cache, how long are sequences waiting in the queue, what is the token generation rate right now, and is the model producing outputs that are consistent with what you expect. This lesson builds that complete observability stack from scratch.

We will cover infrastructure metrics that map to GPU-level resource consumption, LLM-specific latency decomposition (TTFT vs ITL vs E2E), the full set of vLLM Prometheus metrics and what each one tells you, quality monitoring for detecting output drift, distributed tracing with OpenTelemetry, alerting rules that actually fire at the right time, and a complete Grafana dashboard setup you can deploy today.


Why This Exists - The Gap Between APM and LLM Reality

Application Performance Monitoring (APM) tools like Datadog, New Relic, and Prometheus were built around a simple model: a request comes in, work happens, a response goes out. You measure latency (time from request to response), throughput (requests per second), and error rate (HTTP 4xx/5xx). This works beautifully for REST APIs, databases, and microservices.

LLMs break every assumption in this model.

The streaming problem. When a user sends a prompt to an LLM, the response is not a single atomic event. The model generates one token at a time, streaming them back as they are produced. HTTP 200 is returned the moment the first token arrives - which might be 500ms. But the full response might take 30 more seconds. From a traditional APM perspective, the request succeeded instantly. From a user perspective, they are staring at a spinner for 30 seconds after the first word appears.

The batching problem. LLMs process requests in batches for GPU efficiency. A request that arrives when the batch is full waits in a queue. Queue depth is a critical metric that standard APM tools never thought to expose. A traditional web server queue backs up when CPU is saturated. An LLM queue backs up when KV cache is full - a concept that has no analog in traditional computing.

The quality problem. A traditional API either returns the correct data or throws an error. You can measure correctness. An LLM always returns something - but what it returns might be wrong, hallucinated, off-topic, or subtly degraded compared to what the model returned last week. Traditional error rate monitoring tells you nothing about this. You need quality metrics.

The economics problem. Every token costs compute. A user who sends a 10,000-token prompt costs 10x more to serve than a user sending a 1,000-token prompt, but both look identical to HTTP-level monitoring. Without token-level visibility, you cannot attribute costs, detect abuse, or optimize serving economics.

The solution is a monitoring stack built specifically for the LLM inference lifecycle - one that understands tokens, sequences, KV cache, batching, and streaming as first-class concepts.


Historical Context - From Model Cards to Runtime Observability

When the first LLM APIs went public - GPT-3 in 2020 via OpenAI's beta - there was essentially no production observability thinking at all. You called an API, you got text back. The internal state of the model was completely opaque. Even the concept of "latency decomposition" for LLMs did not exist publicly.

The first serious thinking about LLM-specific metrics emerged from the vLLM team at UC Berkeley in 2023. When Woosuk Kwon, Zhuohan Li, and the PagedAttention paper team open-sourced vLLM, they also introduced a Prometheus metrics endpoint that tracked queue depth, cache utilization, and per-request token statistics. This was the first time the infrastructure community had a concrete vocabulary for LLM observability.

The terms TTFT (Time to First Token) and TPOT (Time Per Output Token) - also called ITL, Inter-Token Latency - became standard around mid-2023 when Anyscale, Replicate, and other serving platforms started publishing benchmarks using this decomposition. The insight was simple but important: users experience latency in two phases. The first token arriving tells the user "something is happening." Subsequent tokens create the streaming experience. These two phases have completely different acceptable thresholds and different root causes when they degrade.

OpenTelemetry's LLM Observability working group formally started in late 2023, producing the first draft semantic conventions for LLM spans in early 2024. This gave the ecosystem a standardized way to instrument LLM calls across the entire stack - from the application layer calling the API down to the inference engine executing the model.

By 2024-2025, monitoring LLM services had become a distinct engineering discipline, with dedicated tooling from Langfuse, Helicone, Arize Phoenix, and others specifically targeting the quality and behavioral monitoring layer that infrastructure tools could not address.


Core Concepts

Latency Decomposition - The Three Numbers That Matter

Understanding LLM latency requires decomposing the total request time into distinct phases. Each phase has different causes and different solutions when it degrades.

E2E Latency=Queue Wait+Prefill Time+Decode Time\text{E2E Latency} = \text{Queue Wait} + \text{Prefill Time} + \text{Decode Time}

Where:

  • Queue Wait - time the request spends waiting for the engine to accept it
  • Prefill Time - time to process the input prompt and generate the first token (this IS the TTFT when queue wait is excluded)
  • Decode Time - time to generate all remaining tokens

The TTFT as seen by the client is:

TTFTclient=Queue Wait+Prefill Time\text{TTFT}_\text{client} = \text{Queue Wait} + \text{Prefill Time}

The Inter-Token Latency (ITL) measures the generation speed during decode:

ITL=Decode TimeOutput Tokens1\text{ITL} = \frac{\text{Decode Time}}{\text{Output Tokens} - 1}

A healthy LLM service has:

  • TTFT under 500ms for interactive use cases
  • ITL under 50ms per token (which translates to approximately 20 tokens/second, fast enough to appear instantaneous to a human reader)

When TTFT is high but ITL is normal, the bottleneck is prefill - usually caused by very long input prompts or KV cache pressure. When ITL is high but TTFT is normal, the bottleneck is decode - usually caused by GPU memory bandwidth saturation or CPU swapping of KV blocks.

Request Timeline:

|-- Queue Wait --|-- Prefill --|-- Decode (token by token) --|
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Request | T1 T2 T3 T4 T5 T6 T7 T8
arrives TTFT^

TTFT = time to first token T1
ITL = average gap between Ti and Ti+1
E2E = total from request arrival to last token

KV Cache - The Resource That Defines Your Capacity

The KV cache stores the key-value attention matrices for every token in every active sequence. It is the primary GPU memory consumer in an LLM serving deployment, and its utilization level directly controls your system's ability to accept new requests.

When KV cache reaches capacity, vLLM has three options:

  1. Wait - hold new requests in the queue until cache space frees up
  2. Preempt - evict a running sequence's cache blocks to disk, finish it later (adds latency to that request)
  3. Reject - return an error to the caller

None of these are good outcomes for users. The right strategy is to monitor cache utilization proactively and scale out before it becomes a problem.

KV cache size for a given model is approximately:

KV Cache Size=2×nlayers×nheads×dhead×max_seq_len×bytes_per_element\text{KV Cache Size} = 2 \times n_\text{layers} \times n_\text{heads} \times d_\text{head} \times \text{max\_seq\_len} \times \text{bytes\_per\_element}

For Llama 3 8B with 32 layers, 8 KV heads, 128 head dimension, FP16:

KV per token=2×32×8×128×2=131,072 bytes128 KB per token\text{KV per token} = 2 \times 32 \times 8 \times 128 \times 2 = 131,072 \text{ bytes} \approx 128 \text{ KB per token}

With 16GB of GPU memory dedicated to KV cache, you can hold approximately 16×109131072122,000\frac{16 \times 10^9}{131072} \approx 122,000 tokens across all active sequences simultaneously.

Throughput Metrics - Tokens Per Second vs Requests Per Second

For LLM services, the most meaningful throughput metric is not requests per second but tokens per second - both input (prefill) and output (decode) measured separately.

A single request generating 2,000 output tokens consumes as much GPU compute as 10 requests generating 200 tokens each, but looks like 1/10th the throughput if you measure RPS. Token-level throughput gives you a GPU-normalized view of capacity consumption.

Throughput Metrics Priority:

1. Output tokens/sec (decode throughput) <- primary GPU compute signal
2. Input tokens/sec (prefill throughput) <- secondary, usually higher
3. Requests/sec <- useful for SLA accounting
4. Concurrent sequences <- direct KV cache pressure signal

vLLM Prometheus Metrics - The Complete Reference

vLLM exposes a /metrics endpoint that Prometheus can scrape. Here are the critical metrics and what they tell you.

Infrastructure Metrics

MetricTypeWhat It Measures
vllm:num_requests_runningGaugeActive sequences currently being decoded
vllm:num_requests_waitingGaugeRequests queued, waiting for KV cache space
vllm:num_requests_swappedGaugeSequences whose KV cache was swapped to CPU
vllm:gpu_cache_usage_percGaugeKV cache utilization (0.0 to 1.0)
vllm:cpu_cache_usage_percGaugeCPU KV cache utilization (non-zero means swapping)

Latency Metrics

MetricTypeWhat It Measures
vllm:e2e_request_latency_secondsHistogramFull end-to-end request latency
vllm:request_prompt_tokensHistogramInput token count per request
vllm:request_generation_tokensHistogramOutput token count per request
vllm:time_to_first_token_secondsHistogramTTFT distribution
vllm:time_per_output_token_secondsHistogramITL distribution

Throughput Metrics

MetricTypeWhat It Measures
vllm:prompt_tokens_totalCounterCumulative input tokens processed
vllm:generation_tokens_totalCounterCumulative output tokens generated
vllm:request_success_totalCounterSuccessfully completed requests

Code Examples

Setting Up vLLM with Prometheus Metrics

# start_vllm_server.py
import subprocess
import sys

def start_vllm_with_metrics(
model: str = "meta-llama/Llama-3.1-8B-Instruct",
port: int = 8000,
metrics_port: int = 8001,
gpu_memory_utilization: float = 0.90,
max_model_len: int = 8192,
tensor_parallel_size: int = 1,
):
"""
Start vLLM server with full Prometheus metrics enabled.
Metrics are served on the same port at /metrics endpoint.
"""
cmd = [
sys.executable, "-m", "vllm.entrypoints.openai.api_server",
"--model", model,
"--port", str(port),
"--gpu-memory-utilization", str(gpu_memory_utilization),
"--max-model-len", str(max_model_len),
"--tensor-parallel-size", str(tensor_parallel_size),
# Enable metrics
"--enable-metrics",
"--disable-log-requests", # Use structured metrics instead of log spam
# Enable prefix caching for better cache hit rates
"--enable-prefix-caching",
# Chunked prefill improves TTFT under load
"--enable-chunked-prefill",
"--max-num-batched-tokens", "8192",
]

print(f"Starting vLLM server: {' '.join(cmd)}")
subprocess.run(cmd)


if __name__ == "__main__":
start_vllm_with_metrics()
# Or start directly from CLI
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-metrics \
--enable-prefix-caching \
--enable-chunked-prefill

# Verify metrics are exposed
curl http://localhost:8000/metrics | grep vllm | head -20

Prometheus Configuration

# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "vllm_alerts.yml"

alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]

scrape_configs:
- job_name: "vllm"
static_configs:
- targets: ["vllm-service:8000"]
metrics_path: "/metrics"
scrape_interval: 10s # More frequent for LLM services
scrape_timeout: 5s

- job_name: "dcgm_exporter"
# NVIDIA DCGM Exporter for GPU hardware metrics
static_configs:
- targets: ["dcgm-exporter:9400"]
scrape_interval: 15s

- job_name: "node_exporter"
static_configs:
- targets: ["node-exporter:9100"]
scrape_interval: 30s

Alert Rules for LLM Services

# vllm_alerts.yml
groups:
- name: vllm_critical
interval: 30s
rules:

# KV cache near exhaustion - most critical alert
- alert: VLLMKVCacheExhaustion
expr: vllm:gpu_cache_usage_perc > 0.90
for: 2m
labels:
severity: critical
team: ml-platform
annotations:
summary: "vLLM KV cache above 90% on {{ $labels.instance }}"
description: >
KV cache utilization is {{ $value | humanizePercentage }}.
New requests will be queued or preempted.
Action: scale out replicas or reduce max_model_len.

# Requests stuck in queue
- alert: VLLMRequestQueueBackup
expr: vllm:num_requests_waiting > 20
for: 1m
labels:
severity: warning
team: ml-platform
annotations:
summary: "{{ $value }} requests queued in vLLM on {{ $labels.instance }}"
description: >
Request queue depth is {{ $value }}.
This indicates KV cache pressure or insufficient capacity.

# CPU swapping active - severe performance degradation
- alert: VLLMCPUSwappingActive
expr: vllm:num_requests_swapped > 0
for: 0m # Alert immediately
labels:
severity: critical
team: ml-platform
annotations:
summary: "vLLM is swapping KV cache to CPU on {{ $labels.instance }}"
description: >
{{ $value }} sequences are swapped to CPU memory.
This causes severe latency degradation (often 5-10x slower).
Immediate action required: scale out or reduce load.

# E2E latency SLA breach
- alert: VLLMLatencySLABreach
expr: >
histogram_quantile(0.95,
rate(vllm:e2e_request_latency_seconds_bucket[5m])
) > 30
for: 3m
labels:
severity: critical
team: ml-platform
annotations:
summary: "vLLM p95 latency {{ $value | humanizeDuration }} exceeds 30s SLA"

# TTFT degradation
- alert: VLLMTTFTDegradation
expr: >
histogram_quantile(0.95,
rate(vllm:time_to_first_token_seconds_bucket[5m])
) > 5
for: 2m
labels:
severity: warning
team: ml-platform
annotations:
summary: "vLLM p95 TTFT {{ $value | humanizeDuration }} exceeds 5s"
description: >
High TTFT usually indicates: long input prompts causing slow prefill,
KV cache pressure causing queue waits, or insufficient GPU memory bandwidth.

# GPU OOM risk
- alert: GPUMemoryHighUtilization
expr: >
DCGM_FI_DEV_FB_USED /
(DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
for: 2m
labels:
severity: critical
team: ml-platform
annotations:
summary: "GPU memory {{ $value | humanizePercentage }} on {{ $labels.instance }}"

# Low throughput anomaly detection
- alert: VLLMThroughputDrop
expr: >
rate(vllm:generation_tokens_total[5m]) <
0.5 * rate(vllm:generation_tokens_total[30m] offset 1h)
for: 5m
labels:
severity: warning
team: ml-platform
annotations:
summary: "vLLM token generation rate dropped 50% compared to 1h ago"

Python Metrics Client - Custom Instrumentation

# llm_metrics.py
import time
import asyncio
from typing import Optional, AsyncGenerator
from prometheus_client import (
Counter, Histogram, Gauge, Summary,
CollectorRegistry, start_http_server
)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.semconv.ai import SpanAttributes
import httpx
import json


# Custom metrics registry for application-level LLM metrics
registry = CollectorRegistry()

# Request metrics
llm_requests_total = Counter(
"llm_requests_total",
"Total LLM API requests",
["model", "status", "user_tier"],
registry=registry,
)

llm_ttft_seconds = Histogram(
"llm_ttft_seconds",
"Time to first token in seconds",
["model"],
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
registry=registry,
)

llm_itl_seconds = Histogram(
"llm_itl_seconds",
"Inter-token latency in seconds",
["model"],
buckets=[0.01, 0.02, 0.05, 0.1, 0.25, 0.5, 1.0],
registry=registry,
)

llm_e2e_latency_seconds = Histogram(
"llm_e2e_latency_seconds",
"End-to-end request latency",
["model", "user_tier"],
buckets=[1, 5, 10, 30, 60, 120, 300],
registry=registry,
)

llm_input_tokens = Histogram(
"llm_input_tokens",
"Input token count per request",
["model"],
buckets=[64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384],
registry=registry,
)

llm_output_tokens = Histogram(
"llm_output_tokens",
"Output token count per request",
["model"],
buckets=[32, 64, 128, 256, 512, 1024, 2048, 4096],
registry=registry,
)

llm_output_length_ratio = Histogram(
"llm_output_length_ratio",
"Ratio of output tokens to max_tokens (detect truncation)",
["model"],
buckets=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 1.0],
registry=registry,
)

# Quality signals
llm_refusal_rate = Counter(
"llm_refusals_total",
"Requests that received a refusal response",
["model", "reason"],
registry=registry,
)

llm_empty_response_total = Counter(
"llm_empty_responses_total",
"Requests that received empty or very short responses",
["model"],
registry=registry,
)


class InstrumentedLLMClient:
"""
LLM client with full observability instrumentation.
Wraps OpenAI-compatible API with Prometheus metrics and OTEL tracing.
"""

def __init__(
self,
base_url: str = "http://localhost:8000",
model: str = "meta-llama/Llama-3.1-8B-Instruct",
otlp_endpoint: Optional[str] = None,
):
self.base_url = base_url
self.model = model
self.client = httpx.AsyncClient(timeout=120.0)

# Set up OpenTelemetry tracing
provider = TracerProvider()
if otlp_endpoint:
exporter = OTLPSpanExporter(endpoint=otlp_endpoint)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
self.tracer = trace.get_tracer(__name__)

async def chat_completion_stream(
self,
messages: list,
max_tokens: int = 512,
temperature: float = 0.7,
user_tier: str = "default",
user_id: Optional[str] = None,
) -> AsyncGenerator[str, None]:
"""
Streaming chat completion with full metrics instrumentation.
Yields token strings as they arrive.
"""
request_start = time.perf_counter()
first_token_time: Optional[float] = None
last_token_time: Optional[float] = None
token_count = 0
full_response = ""
status = "success"

# Count input tokens (approximate - use tiktoken for accuracy)
input_text = " ".join(m.get("content", "") for m in messages)
approx_input_tokens = len(input_text.split()) * 1.3 # rough estimate

with self.tracer.start_as_current_span("llm.chat_completion") as span:
span.set_attribute("llm.model", self.model)
span.set_attribute("llm.max_tokens", max_tokens)
span.set_attribute("llm.temperature", temperature)
span.set_attribute("llm.user_tier", user_tier)
if user_id:
span.set_attribute("llm.user_id", user_id)

try:
payload = {
"model": self.model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature,
"stream": True,
"stream_options": {"include_usage": True},
}

async with self.client.stream(
"POST",
f"{self.base_url}/v1/chat/completions",
json=payload,
) as response:
response.raise_for_status()

async for line in response.aiter_lines():
if not line.startswith("data: "):
continue
data = line[6:]
if data == "[DONE]":
break

chunk = json.loads(data)
now = time.perf_counter()

# Check for usage data in final chunk
if "usage" in chunk and chunk["usage"]:
usage = chunk["usage"]
llm_input_tokens.labels(model=self.model).observe(
usage.get("prompt_tokens", 0)
)
actual_output = usage.get("completion_tokens", 0)
llm_output_tokens.labels(model=self.model).observe(
actual_output
)
# Truncation detection
if max_tokens > 0:
ratio = actual_output / max_tokens
llm_output_length_ratio.labels(
model=self.model
).observe(ratio)

choices = chunk.get("choices", [])
if not choices:
continue

delta = choices[0].get("delta", {})
content = delta.get("content", "")

if content:
# First token received
if first_token_time is None:
first_token_time = now
ttft = first_token_time - request_start
llm_ttft_seconds.labels(model=self.model).observe(ttft)
span.set_attribute("llm.ttft_seconds", ttft)

# Inter-token latency
elif last_token_time is not None:
itl = now - last_token_time
llm_itl_seconds.labels(model=self.model).observe(itl)

last_token_time = now
token_count += 1
full_response += content
yield content

except httpx.HTTPStatusError as e:
status = f"http_{e.response.status_code}"
span.record_exception(e)
raise
except Exception as e:
status = "error"
span.record_exception(e)
raise
finally:
e2e_latency = time.perf_counter() - request_start
llm_e2e_latency_seconds.labels(
model=self.model, user_tier=user_tier
).observe(e2e_latency)
llm_requests_total.labels(
model=self.model, status=status, user_tier=user_tier
).inc()

span.set_attribute("llm.e2e_latency_seconds", e2e_latency)
span.set_attribute("llm.output_tokens", token_count)

# Quality signal: detect refusals
refusal_phrases = [
"i cannot", "i can't", "i'm unable to",
"i am unable to", "i won't", "i will not",
"as an ai", "as a language model",
]
response_lower = full_response.lower()
for phrase in refusal_phrases:
if phrase in response_lower:
llm_refusal_rate.labels(
model=self.model, reason="policy"
).inc()
break

# Detect empty/very short responses
if len(full_response.strip()) < 10:
llm_empty_response_total.labels(model=self.model).inc()

Structured Logging for LLM Services

# llm_logging.py
import json
import logging
import time
from typing import Optional
from datetime import datetime, timezone


class LLMStructuredLogger:
"""
Structured JSON logger for LLM request/response pairs.
Designed for ingestion into ELK Stack, Loki, or CloudWatch Logs.

IMPORTANT: Never log full prompts/responses in production without
PII scrubbing. This example includes a simple scrubber hook.
"""

def __init__(
self,
service_name: str = "llm-api",
log_prompts: bool = False, # Default OFF for privacy
log_responses: bool = False, # Default OFF for cost/privacy
pii_scrubber=None,
):
self.service_name = service_name
self.log_prompts = log_prompts
self.log_responses = log_responses
self.pii_scrubber = pii_scrubber

# Set up structured logger
self.logger = logging.getLogger(service_name)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(message)s"))
self.logger.addHandler(handler)
self.logger.setLevel(logging.INFO)

def log_request_start(
self,
request_id: str,
model: str,
user_id: Optional[str],
messages: list,
max_tokens: int,
temperature: float,
) -> dict:
"""Log when a request begins. Returns context dict for correlation."""
context = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"event": "llm_request_start",
"service": self.service_name,
"request_id": request_id,
"model": model,
"user_id": user_id or "anonymous",
"max_tokens": max_tokens,
"temperature": temperature,
"message_count": len(messages),
}

if self.log_prompts:
prompt_text = " ".join(m.get("content", "") for m in messages)
if self.pii_scrubber:
prompt_text = self.pii_scrubber(prompt_text)
context["prompt_preview"] = prompt_text[:200]

self.logger.info(json.dumps(context))
return context

def log_request_complete(
self,
context: dict,
ttft: float,
e2e_latency: float,
input_tokens: int,
output_tokens: int,
finish_reason: str,
response_text: str = "",
):
"""Log when a request completes with full metrics."""
log_entry = {
**context,
"timestamp": datetime.now(timezone.utc).isoformat(),
"event": "llm_request_complete",
"ttft_ms": round(ttft * 1000, 1),
"e2e_latency_ms": round(e2e_latency * 1000, 1),
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": input_tokens + output_tokens,
"finish_reason": finish_reason,
"tokens_per_second": round(
output_tokens / e2e_latency if e2e_latency > 0 else 0, 1
),
}

if self.log_responses and response_text:
if self.pii_scrubber:
response_text = self.pii_scrubber(response_text)
log_entry["response_preview"] = response_text[:200]

self.logger.info(json.dumps(log_entry))

def log_request_error(
self,
context: dict,
error_type: str,
error_message: str,
latency_at_failure: float,
):
"""Log failed requests with error details."""
log_entry = {
**context,
"timestamp": datetime.now(timezone.utc).isoformat(),
"event": "llm_request_error",
"error_type": error_type,
"error_message": error_message[:500],
"latency_at_failure_ms": round(latency_at_failure * 1000, 1),
}
self.logger.error(json.dumps(log_entry))

Grafana Dashboard as Code

# generate_grafana_dashboard.py
"""
Generate a Grafana dashboard JSON for vLLM monitoring.
Import this into Grafana via the UI or API.
"""
import json

DASHBOARD = {
"title": "vLLM Production Dashboard",
"uid": "vllm-prod-v1",
"refresh": "30s",
"time": {"from": "now-1h", "to": "now"},
"panels": [
# Row 1: Request health
{
"id": 1, "type": "stat", "gridPos": {"x": 0, "y": 0, "w": 4, "h": 4},
"title": "Requests Running",
"targets": [{"expr": "vllm:num_requests_running", "legendFormat": "running"}],
},
{
"id": 2, "type": "stat", "gridPos": {"x": 4, "y": 0, "w": 4, "h": 4},
"title": "Requests Waiting",
"targets": [{"expr": "vllm:num_requests_waiting", "legendFormat": "waiting"}],
"options": {"thresholds": {"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 5},
{"color": "red", "value": 20},
]}},
},
{
"id": 3, "type": "stat", "gridPos": {"x": 8, "y": 0, "w": 4, "h": 4},
"title": "KV Cache Usage",
"targets": [{"expr": "vllm:gpu_cache_usage_perc * 100", "legendFormat": "%"}],
"options": {"unit": "percent", "thresholds": {"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90},
]}},
},
# Row 2: Latency
{
"id": 4, "type": "timeseries", "gridPos": {"x": 0, "y": 4, "w": 12, "h": 8},
"title": "TTFT Percentiles",
"targets": [
{
"expr": 'histogram_quantile(0.50, rate(vllm:time_to_first_token_seconds_bucket[5m]))',
"legendFormat": "p50",
},
{
"expr": 'histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m]))',
"legendFormat": "p95",
},
{
"expr": 'histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))',
"legendFormat": "p99",
},
],
},
# Row 3: Throughput
{
"id": 5, "type": "timeseries", "gridPos": {"x": 0, "y": 12, "w": 12, "h": 8},
"title": "Token Throughput (tokens/sec)",
"targets": [
{
"expr": "rate(vllm:generation_tokens_total[1m])",
"legendFormat": "output tokens/sec",
},
{
"expr": "rate(vllm:prompt_tokens_total[1m])",
"legendFormat": "input tokens/sec",
},
],
},
],
}

if __name__ == "__main__":
with open("vllm_dashboard.json", "w") as f:
json.dump({"dashboard": DASHBOARD}, f, indent=2)
print("Dashboard JSON written to vllm_dashboard.json")
print("Import into Grafana: Dashboards -> Import -> Upload JSON file")

Docker Compose - Full Monitoring Stack

# docker-compose.monitoring.yml
version: "3.8"

services:
vllm:
image: vllm/vllm-openai:latest
command:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--port"
- "8000"
- "--gpu-memory-utilization"
- "0.90"
- "--enable-metrics"
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}

prometheus:
image: prom/prometheus:v2.48.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./vllm_alerts.yml:/etc/prometheus/vllm_alerts.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"

grafana:
image: grafana/grafana:10.2.0
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_AUTH_ANONYMOUS_ENABLED=true
depends_on:
- prometheus

alertmanager:
image: prom/alertmanager:v0.26.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"

# NVIDIA DCGM Exporter for GPU hardware metrics
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu20.04
environment:
- DCGM_EXPORTER_LISTEN=:9400
- DCGM_EXPORTER_KUBERNETES=false
ports:
- "9400:9400"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
cap_add:
- SYS_ADMIN

# OpenTelemetry Collector
otel-collector:
image: otel/opentelemetry-collector-contrib:0.89.0
volumes:
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
command: ["--config=/etc/otel-collector-config.yml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8889:8889" # Prometheus metrics

jaeger:
image: jaegertracing/all-in-one:1.52
ports:
- "16686:16686" # Jaeger UI
- "14268:14268" # HTTP collector

volumes:
prometheus_data:
grafana_data:

Architecture Diagrams

LLM Observability Stack Architecture

LLM Request Lifecycle - What Gets Measured When

GPU Memory Layout and KV Cache Monitoring


Production Engineering Notes

Scrape Interval Tuning

The default Prometheus scrape interval of 60 seconds is too slow for LLM services. A KV cache can go from 80% to 100% and trigger OOM errors in under 30 seconds under load. Set scrape interval to 10-15 seconds for LLM metrics specifically, while keeping slower intervals for less volatile metrics like node stats.

# In prometheus.yml - different intervals per job
scrape_configs:
- job_name: "vllm"
scrape_interval: 10s # Fast - LLM state changes quickly
- job_name: "node_exporter"
scrape_interval: 60s # Slow - node metrics are stable

Histogram Bucket Selection

vLLM's default histogram buckets may not match your latency SLAs. For interactive applications with a 5-second TTFT SLA, you want high-resolution buckets in the 0-5s range. The default buckets often have gaps right where your SLA threshold sits, making percentile estimation inaccurate.

# Custom histogram buckets aligned with your SLAs
TTFT_BUCKETS = [
0.05, 0.1, 0.2, 0.3, 0.5, 0.75,
1.0, 1.5, 2.0, 3.0, 5.0, # SLA boundary at 5s
10.0, 20.0, 30.0, 60.0,
]

Cardinality Management

Adding user_id as a label on every metric is tempting but catastrophic. With 100,000 users, every metric becomes 100,000 time series. Prometheus will OOM. Use bucketed user tiers (free, pro, enterprise) or session-type labels instead. Reserve per-user granularity for your structured logs and traces, not your metrics.

Multi-Instance Aggregation

When running multiple vLLM replicas behind a load balancer, remember that vLLM metrics are per-instance. Use Prometheus sum() and avg() to aggregate across replicas. The instance label lets you identify which replica is degraded.

# Total tokens per second across all replicas
sum(rate(vllm:generation_tokens_total[1m]))

# KV cache usage per replica - identify the hottest instance
vllm:gpu_cache_usage_perc by (instance)

# P95 latency - aggregate across all instances
histogram_quantile(0.95, sum(rate(vllm:e2e_request_latency_seconds_bucket[5m])) by (le))

Quality Monitoring at Scale

Infrastructure metrics tell you that your service is running. They do not tell you if it is running well. Build a separate quality monitoring pipeline that:

  1. Samples 1-5% of requests for human review
  2. Runs automated evaluations using a smaller judge model (e.g., evaluate output coherence with a GPT-4o-mini call)
  3. Tracks output length distribution over time - sudden changes indicate model behavior shifts
  4. Monitors refusal rate trends - a spike often indicates a broken system prompt or an adversarial prompt campaign

Common Mistakes

:::danger Missing KV Cache Alert - The Most Dangerous Gap

The single most common monitoring gap is not alerting on vllm:gpu_cache_usage_perc. Teams configure CPU and memory alerts (because that is what their generic Kubernetes monitoring template includes) and never add the KV-cache-specific alert.

When KV cache hits 100%, vLLM starts preempting sequences or rejecting requests. This is completely invisible to CPU and GPU utilization metrics - the GPU might be at 40% compute utilization while being completely starved of KV cache space. You will see rising latency and request failures with no obvious cause.

Always configure this alert before going to production:

- alert: VLLMKVCacheExhaustion
expr: vllm:gpu_cache_usage_perc > 0.85
for: 1m
labels:
severity: critical

:::

:::danger Measuring HTTP Latency Instead of E2E Latency

Many teams plug their LLM service into existing APM tools that measure HTTP request latency at the load balancer. For streaming responses, the HTTP request returns 200 immediately and keeps the connection open. The load balancer records the time to the first byte as the "request latency" - often 200-500ms - while users are actually waiting 30+ seconds for the full response.

This creates a completely false picture of service health. Your dashboards show 400ms p99 latency while users are filing support tickets about 45-second waits.

You must instrument at the application layer where you can measure TTFT and E2E latency properly, or use vLLM's own Prometheus histogram which measures from request receipt to last token sent. :::

:::warning Logging Prompts Without PII Scrubbing

It is tempting to log full prompts and responses for debugging purposes. In a production service that handles user data, this is a significant privacy and compliance risk. Users routinely include PII (names, emails, addresses, SSNs) in their prompts without realizing the implications.

Never enable full prompt/response logging without:

  1. A PII scrubbing pipeline (presidio, AWS Comprehend, or similar)
  2. Clear data retention policies (max 30 days, encrypted at rest)
  3. Legal review of your terms of service
  4. User consent mechanisms

Default your logging configuration to log_prompts: false and require explicit opt-in for debug logging in production. :::

:::warning Using the Same Alert Thresholds as Your Web Services

A p99 latency of 5 seconds is catastrophic for a web API. For an LLM generating 2,000 tokens, 5 seconds might be completely acceptable. Set thresholds that reflect the nature of LLM workloads, not reflexes from your microservices monitoring culture.

Typical production LLM thresholds:

  • TTFT p95: alert at 5s (not 500ms)
  • E2E p95: alert at 60s for long-form generation
  • KV cache: alert at 85% (not 70%)
  • Queue depth: alert at 20 requests (not 5)

Calibrate based on your actual workload characteristics during load testing. :::


Interview Q&A

Q1: What is TTFT and why is it more important than E2E latency for interactive LLM applications?

TTFT (Time to First Token) is the elapsed time from when a request arrives at the inference server to when the first output token begins streaming back to the client. It is more important than E2E latency for interactive applications because of the psychology of perceived responsiveness.

When a user sends a message, the most anxiety-inducing period is complete silence - no feedback at all. Once the first token appears, the user has confirmation that something is happening and will tolerate a much longer wait. This is why a chat interface that starts streaming within 300ms and takes 15 seconds total feels faster than one that takes 2 seconds to start streaming and finishes in 8 seconds.

From a systems perspective, TTFT is dominated by two factors: queue wait time (how long the request waited for KV cache space) and prefill time (how long it takes the GPU to process the input tokens and generate the first output token). Prefill is compute-bound and scales with input length. Queue wait is resource-bound and scales with system load. High TTFT usually means either very long input prompts or a system running near KV cache saturation.

Q2: Walk me through how you would debug a sudden 10x increase in p95 latency on a vLLM deployment.

Start with the latency decomposition. Check vllm:time_to_first_token_seconds and vllm:time_per_output_token_seconds separately. If TTFT spiked but ITL is normal, the bottleneck is in prefill or queuing. Check vllm:num_requests_waiting - if it is elevated, KV cache exhaustion is the likely cause. Check vllm:gpu_cache_usage_perc to confirm.

If ITL spiked but TTFT is normal, the bottleneck is in decode. This could indicate GPU memory bandwidth saturation, CPU swapping of KV blocks (vllm:num_requests_swapped and vllm:cpu_cache_usage_perc), or a sudden increase in average output length (vllm:request_generation_tokens).

Check the input token distribution - a sudden increase in average prompt length (perhaps a user discovered they can send very long inputs) will consume KV cache faster and increase prefill time. Look at vllm:request_prompt_tokens histogram to detect this pattern.

Finally, check GPU hardware metrics via DCGM. GPU memory errors, thermal throttling (GPU frequency drops to maintain thermal limits), or NVLink errors in multi-GPU setups can all cause sudden latency spikes that have nothing to do with the serving software.

Q3: How would you monitor for quality degradation in an LLM service, not just infrastructure health?

Quality monitoring operates at a different layer than infrastructure monitoring. Infrastructure metrics tell you the service is running; quality metrics tell you if it is producing useful outputs.

The most practical approaches are: (1) Output length monitoring - track the distribution of output token counts over time. A sudden shift (much shorter or longer outputs) often indicates a broken system prompt, model configuration change, or a prompt injection attack. (2) Refusal rate tracking - count responses containing phrases like "I cannot" or "I am unable to." A spike often indicates an adversarial prompt campaign or an overly restrictive system prompt. (3) Truncation detection - measure the ratio of actual output tokens to max_tokens. High truncation rate means users are consistently hitting the output limit, suggesting they need longer responses.

For higher-confidence quality signals, implement async evaluation: sample 1-5% of request/response pairs, send them to a judge model (a smaller, cheaper LLM) with a scoring prompt, and track the judge scores over time as a proxy for output quality. This is imperfect but detects major quality regressions.

Q4: What are the key differences between monitoring a traditional ML model serving endpoint and a generative LLM endpoint?

Traditional ML serving (classification, regression, embedding) involves fixed-size outputs, synchronous responses, and deterministic compute costs. You measure latency (one number), throughput (requests/sec), and accuracy drift (requires label collection). The monitoring vocabulary is well-established.

Generative LLM serving introduces: variable-length outputs (a request for a haiku and a request for a research paper both hit the same endpoint), streaming responses (latency is not one number but a series: TTFT, ITL, E2E), KV cache as a stateful shared resource across concurrent requests (no analog in traditional serving), and quality that is subjective and hard to measure automatically.

The key additions for LLM monitoring are: TTFT and ITL histograms (replacing single-latency measurement), KV cache utilization and queue depth (the two most critical capacity signals), token throughput (output tokens/sec, not requests/sec), and quality proxies like output length distribution and refusal rate.

Q5: How would you set up distributed tracing for an LLM application with multiple services - for example, a RAG pipeline with retrieval, reranking, and generation stages?

Use OpenTelemetry with a parent trace that spans the entire user request, and child spans for each stage of the pipeline. The parent span starts when the user request arrives at your API gateway and ends when the streaming response completes. Each stage - vector retrieval, document reranking, prompt construction, LLM inference - becomes a child span.

The critical span attributes for LLM spans follow the OpenTelemetry semantic conventions for GenAI: gen_ai.system (the model provider), gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reason.

The tracing data lets you answer questions that metrics cannot: "For this specific slow request, was the bottleneck in retrieval (slow vector DB query), reranking (CPU-bound cross-encoder), or generation (LLM latency)?" You sample traces at 5-10% for normal requests, but use tail-based sampling to capture 100% of slow requests (above p95 threshold) and 100% of error requests. This gives you full visibility into failure modes without the cost of storing every trace.

Q6: Explain how to configure Prometheus alerting so that a KV cache alarm fires quickly but does not generate alert fatigue.

The key is matching the for duration to the volatility of the signal. KV cache utilization can spike and recover quickly under bursty load. An alert with for: 0m (fires immediately) will generate false positives every time a traffic burst temporarily saturates the cache. An alert with for: 10m will miss real incidents.

Use a two-tier approach: a warning at 80% with for: 5m (gives the team time to prepare, not urgent), and a critical alert at 90% with for: 2m (definitely a problem that needs action). The 2-minute window filters out brief spikes while still catching sustained saturation before it impacts users.

Route the warning to a Slack channel (low urgency), and the critical alert to PagerDuty (on-call wake-up). This way the on-call engineer is only paged for genuinely urgent situations while the team still has visibility into developing problems during business hours.

Also configure alert inhibition: if the critical KV cache alert is firing, suppress the "request queue backup" alert - the queue backup is a consequence of the cache saturation, not an independent problem. Deduplication prevents the on-call engineer from receiving five simultaneous alerts for one root cause.

© 2026 EngineersOfAI. All rights reserved.