Skip to main content

Observability and Logging

Reading time: ~45 min · Interview relevance: Very High · Target roles: ML Engineer, MLOps Engineer, Platform Engineer

When the Model Stops Working and Nobody Knows Why

A fraud detection model at a payments company has been in production for eight months. One Tuesday morning, the on-call engineer gets a page: false positive rate is spiking, legitimate transactions are being declined. The engineer opens the dashboard. The model is serving predictions. The API is healthy. Error rate is zero. Latency looks normal.

But something is wrong. The model is making bad predictions. Why?

Without proper observability, this investigation takes days. With it, it takes 20 minutes. In this case, the engineer sees a metric called feature_store_cache_hit_rate that dropped from 98% to 12% an hour before the spike in false positives. A cache configuration change had caused the model to receive stale feature values - 6-hour-old transaction history instead of real-time. The model was not broken. The data it was receiving was broken.

This scenario illustrates why ML systems need observability that goes beyond what traditional software needs. You need to monitor not just "is the service up" but "is the model receiving the right inputs," "is the model's output distribution consistent with training," "which feature values are contributing most to predictions," and "how does latency break down across preprocessing, inference, and postprocessing."

This lesson covers the three pillars of observability (logs, metrics, traces), the tools used at scale (Prometheus, OpenTelemetry, Grafana, Loki), and the ML-specific concerns that most observability guides ignore: input distribution monitoring, output drift detection, per-feature latency attribution, and GPU utilization tracking in serving.

Why This Exists - The Debugging Problem at Scale

In 2003, George Candea and Armando Fox at Stanford published "Crash-Only Software," arguing that systems should be designed to be observable from the outside - you should never need to inspect internal state to understand behavior. This insight became foundational for distributed systems engineering.

The problem is that at scale, you cannot attach a debugger to production. A model serving 100,000 requests per second has no room for print() statements or step-through debugging. You need to observe the system's behavior through signals it emits: logs it writes, metrics it exposes, and traces it generates.

Traditional application monitoring focused on a simple question: "Is the service up?" ML systems need to answer a harder question: "Is the service correct?" A model serving predictions with 5ms latency and zero errors could still be completely wrong - serving stale features, using a model with a silently corrupted weight file, or making predictions on a data distribution that has shifted dramatically from training.

The field of ML observability (sometimes called "ML monitoring" or "model monitoring") has converged on three additional signals beyond the standard three pillars: input drift (are incoming features distributed like training data?), output drift (is the prediction distribution changing?), and data quality (are features arriving with correct types, ranges, and missing value rates?).

Historical Context - From printf to OpenTelemetry

Logging is the oldest observability primitive. Early Unix programs wrote messages to stdout or to /var/log/* files. The syslog standard (RFC 3164, 1984) established severity levels and centralized log collection.

Structured logging - writing logs as machine-parseable key-value pairs rather than human-readable strings - emerged in the mid-2000s as log volumes made grep-based analysis impractical. The Splunk platform (founded 2003) built a business on structured log analysis.

Prometheus was developed at SoundCloud in 2012 (open-sourced 2015) as a replacement for Graphite and StatsD. Its key innovations were the pull model (Prometheus scrapes metrics from endpoints rather than applications pushing to a central server) and the data model (time series identified by metric name plus label key-value pairs).

OpenTelemetry (2019) merged the OpenCensus and OpenTracing standards into a single vendor-neutral API for collecting logs, metrics, and traces. It has become the dominant standard for new observability instrumentation.

The ELK stack (Elasticsearch, Logstash, Kibana) dominated centralized log management for most of the 2010s. Grafana Loki (2018) offered a lighter alternative: instead of indexing log content (expensive), Loki only indexes log labels, storing the log content compressed. For ML systems that generate gigabytes of training logs per day, Loki's cost profile is dramatically better.

Core Concepts

The Three Pillars of Observability

Logs are discrete events with context. Each log entry captures something that happened at a specific point in time. Logs are high-cardinality (each request can log a unique request ID) and unsampled (you want every error log). The cost is volume.

Metrics are aggregated measurements over time. A counter tracks "total requests handled"; a histogram tracks "distribution of request latencies." Metrics are low-cardinality by design - you cannot have a metric per user ID - but they are cheap and persistent.

Traces capture the causal chain of events across service boundaries. When a single ML inference request involves a feature store lookup, a model prediction, and a post-processing step, a trace shows the timing of each step and how they relate. Traces are typically sampled (you do not trace every request) because they are expensive.

Structured Logging with structlog

Plain text logs (print("Starting epoch 3")) are useless at scale. You cannot query them efficiently, and they lose context across distributed systems. Structured logs emit JSON events where every piece of information is a named field.

# logging_setup.py
# Configure structlog for production ML services

import logging
import sys
import structlog
from typing import Any


def configure_structlog(
log_level: str = "INFO",
json_output: bool = True,
service_name: str = "ml-inference",
version: str = "unknown",
) -> None:
"""
Configure structlog for production use.
JSON output for log aggregation systems (Loki, Elasticsearch).
Console output (colored, readable) for development.
"""
shared_processors: list[Any] = [
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level,
structlog.stdlib.add_logger_name,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
# Add service-level context to every log event
structlog.processors.CallsiteParameterAdder(
[structlog.processors.CallsiteParameter.FILENAME,
structlog.processors.CallsiteParameter.LINENO]
),
]

if json_output:
# Production: emit JSON for log aggregation
processors = shared_processors + [
structlog.processors.dict_tracebacks,
structlog.processors.JSONRenderer(),
]
else:
# Development: colored console output
processors = shared_processors + [
structlog.dev.ConsoleRenderer(colors=True),
]

structlog.configure(
processors=processors,
wrapper_class=structlog.make_filtering_bound_logger(
getattr(logging, log_level.upper())
),
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(sys.stdout),
cache_logger_on_first_use=True,
)

# Also configure stdlib logging to go through structlog
logging.basicConfig(
format="%(message)s",
stream=sys.stdout,
level=getattr(logging, log_level.upper()),
)


# Usage: in your application entry point
configure_structlog(
log_level="INFO",
json_output=True,
service_name="recommendation-model",
version="2.1.0",
)

logger = structlog.get_logger()

# ---------------------------------------------------------------
# Structured logging in an ML inference handler
# ---------------------------------------------------------------

class InferenceHandler:
def __init__(self, model, feature_store):
self.model = model
self.feature_store = feature_store
self.log = structlog.get_logger().bind(
component="inference_handler",
model_version=model.version,
)

def predict(self, request_id: str, user_id: str, item_ids: list[str]):
# Bind request-level context to all logs within this request
log = self.log.bind(
request_id=request_id,
user_id=user_id,
num_items=len(item_ids),
)

log.info("inference_request_received")

import time
t0 = time.perf_counter()

# Feature retrieval
try:
features = self.feature_store.get(user_id, item_ids)
feature_latency_ms = (time.perf_counter() - t0) * 1000
log.info("features_retrieved",
latency_ms=round(feature_latency_ms, 2),
cache_hit=features.from_cache,
num_features=len(features))
except Exception as exc:
log.error("feature_retrieval_failed",
error=str(exc),
error_type=type(exc).__name__)
raise

# Inference
t1 = time.perf_counter()
try:
predictions = self.model(features)
inference_latency_ms = (time.perf_counter() - t1) * 1000

log.info("inference_complete",
inference_latency_ms=round(inference_latency_ms, 2),
total_latency_ms=round((time.perf_counter() - t0) * 1000, 2),
top_prediction=predictions[0].item_id,
top_score=round(predictions[0].score, 4))
except Exception as exc:
log.error("inference_failed",
error=str(exc),
error_type=type(exc).__name__)
raise

return predictions

A single inference_complete log event from this handler looks like:

{
"timestamp": "2026-04-22T14:23:01.234567Z",
"level": "info",
"event": "inference_complete",
"component": "inference_handler",
"model_version": "2.1.0",
"request_id": "req-abc123",
"user_id": "u-456",
"num_items": 20,
"inference_latency_ms": 4.7,
"total_latency_ms": 12.3,
"top_prediction": "item-789",
"top_score": 0.9342,
"filename": "inference.py",
"lineno": 67
}

Every field is queryable in Loki or Elasticsearch. You can instantly answer: "Show me all requests where total_latency_ms > 100" or "Show me all requests where cache_hit = false in the last hour."

Correlation IDs for Distributed Tracing

In a microservices architecture, a single user request might touch 5 services. Without correlation IDs, you cannot link the logs from each service back to the same user request.

# middleware/correlation_id.py
import uuid
import structlog
from contextvars import ContextVar
from fastapi import Request, Response
from starlette.middleware.base import BaseHTTPMiddleware

# Thread-local (actually context-var-local) storage for correlation ID
CORRELATION_ID_CTX: ContextVar[str] = ContextVar("correlation_id", default="")

HEADER_NAME = "X-Correlation-ID"


class CorrelationIDMiddleware(BaseHTTPMiddleware):
"""
FastAPI middleware that:
1. Reads X-Correlation-ID header from incoming requests
(so callers can pass their own ID for end-to-end tracing)
2. Generates a new UUID if not present
3. Binds it to the structlog context so every log in this request
automatically includes correlation_id
4. Returns it in the response header
"""

async def dispatch(self, request: Request, call_next) -> Response:
correlation_id = request.headers.get(HEADER_NAME) or str(uuid.uuid4())

# Store in context var (asyncio-safe, no thread-safety issues)
token = CORRELATION_ID_CTX.set(correlation_id)

# Bind to structlog context for this request
structlog.contextvars.bind_contextvars(
correlation_id=correlation_id,
http_method=request.method,
http_path=request.url.path,
)

try:
response = await call_next(request)
response.headers[HEADER_NAME] = correlation_id
return response
finally:
# Clean up context vars after request
structlog.contextvars.clear_contextvars()
CORRELATION_ID_CTX.reset(token)

OpenTelemetry Tracing for ML Inference

OpenTelemetry provides a unified API for distributed tracing. Spans represent units of work; they can be nested and span service boundaries.

# tracing.py
# OpenTelemetry setup for an ML inference service

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor


def configure_tracing(
service_name: str,
service_version: str,
otlp_endpoint: str = "http://jaeger.monitoring.svc:4317",
) -> trace.Tracer:
"""Configure OpenTelemetry tracing with OTLP exporter (Jaeger)."""

resource = Resource.create({
"service.name": service_name,
"service.version": service_version,
})

provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument FastAPI and outbound HTTP calls
FastAPIInstrumentor.instrument()
HTTPXClientInstrumentor().instrument()

return trace.get_tracer(service_name)


# Initialize in your application
tracer = configure_tracing(
service_name="recommendation-model",
service_version="2.1.0",
)


# ---------------------------------------------------------------
# Using spans in ML inference
# ---------------------------------------------------------------

class TracedInferenceHandler:
def __init__(self, model, feature_store):
self.model = model
self.feature_store = feature_store
self.tracer = tracer

async def predict(self, request_id: str, user_id: str, item_ids: list[str]):
# Root span for the entire inference request
with self.tracer.start_as_current_span("ml.inference") as root_span:
root_span.set_attributes({
"ml.request_id": request_id,
"ml.user_id": user_id,
"ml.num_items": len(item_ids),
"ml.model_name": self.model.name,
"ml.model_version": self.model.version,
})

# Child span: feature retrieval
with self.tracer.start_as_current_span("feature_store.get") as feat_span:
features = await self.feature_store.get(user_id, item_ids)
feat_span.set_attributes({
"feature_store.cache_hit": features.from_cache,
"feature_store.num_features": len(features),
})

# Child span: model inference
with self.tracer.start_as_current_span("model.forward") as model_span:
predictions = await self.model.async_predict(features)
model_span.set_attributes({
"model.batch_size": len(item_ids),
"model.top_score": float(predictions[0].score),
})

# Child span: postprocessing (filtering, ranking)
with self.tracer.start_as_current_span("postprocess") as post_span:
results = self._postprocess(predictions)
post_span.set_attributes({
"postprocess.num_results": len(results),
})

return results

The resulting trace in Jaeger shows a waterfall diagram: root span ml.inference containing three child spans (feature_store.get, model.forward, postprocess) with precise timing for each. If feature_store.get takes 80ms out of a 100ms total request, you immediately know where the bottleneck is.

Prometheus Metrics for ML Model Servers

Prometheus scrapes an HTTP endpoint (usually /metrics) and stores time series. Here are the metrics every ML inference service should expose:

# metrics.py
# Prometheus metrics for an ML inference service

from prometheus_client import (
Counter, Histogram, Gauge, Summary,
start_http_server, REGISTRY
)
import time


# ---------------------------------------------------------------
# Request metrics
# ---------------------------------------------------------------

REQUEST_COUNTER = Counter(
"ml_inference_requests_total",
"Total number of inference requests",
["model_name", "model_version", "status"], # labels
)

REQUEST_LATENCY = Histogram(
"ml_inference_duration_seconds",
"Inference request duration in seconds",
["model_name", "model_version"],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05,
0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)

# Sub-operation latencies
FEATURE_LATENCY = Histogram(
"ml_feature_retrieval_duration_seconds",
"Feature store retrieval duration",
["model_name", "cache_hit"],
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.5],
)

PREPROCESS_LATENCY = Histogram(
"ml_preprocessing_duration_seconds",
"Input preprocessing duration",
["model_name"],
)

MODEL_FORWARD_LATENCY = Histogram(
"ml_model_forward_duration_seconds",
"Model forward pass duration",
["model_name", "model_version", "batch_size"],
buckets=[0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1],
)

# ---------------------------------------------------------------
# Batch and throughput metrics
# ---------------------------------------------------------------

BATCH_SIZE_HISTOGRAM = Histogram(
"ml_inference_batch_size",
"Distribution of inference batch sizes",
["model_name"],
buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256],
)

THROUGHPUT_COUNTER = Counter(
"ml_inference_samples_total",
"Total samples processed",
["model_name", "model_version"],
)

# ---------------------------------------------------------------
# Model health and drift metrics
# ---------------------------------------------------------------

PREDICTION_SCORE_HISTOGRAM = Histogram(
"ml_prediction_score",
"Distribution of model prediction scores (for drift detection)",
["model_name", "model_version"],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99],
)

# Gauge: current value (not cumulative)
ACTIVE_REQUESTS = Gauge(
"ml_inference_active_requests",
"Number of inference requests currently in flight",
["model_name"],
)

MODEL_INFO = Gauge(
"ml_model_info",
"Model metadata (always 1, labels carry the metadata)",
["model_name", "model_version", "git_commit"],
)

# ---------------------------------------------------------------
# GPU metrics (if serving on GPU)
# ---------------------------------------------------------------

GPU_UTILIZATION = Gauge(
"gpu_utilization_percent",
"GPU compute utilization percentage",
["gpu_index", "gpu_name"],
)

GPU_MEMORY_USED_BYTES = Gauge(
"gpu_memory_used_bytes",
"GPU memory currently in use",
["gpu_index"],
)

GPU_MEMORY_TOTAL_BYTES = Gauge(
"gpu_memory_total_bytes",
"Total GPU memory",
["gpu_index"],
)

GPU_TEMPERATURE_CELSIUS = Gauge(
"gpu_temperature_celsius",
"GPU temperature",
["gpu_index"],
)


# ---------------------------------------------------------------
# Instrumented inference handler
# ---------------------------------------------------------------

class InstrumentedInferenceHandler:
def __init__(self, model, feature_store):
self.model = model
self.feature_store = feature_store

# Register model info metric (set once at startup)
MODEL_INFO.labels(
model_name=model.name,
model_version=model.version,
git_commit=model.git_commit,
).set(1)

def predict(self, request_id: str, user_id: str, item_ids: list[str]):
model_labels = {
"model_name": self.model.name,
"model_version": self.model.version,
}

ACTIVE_REQUESTS.labels(**model_labels).inc()

try:
with REQUEST_LATENCY.labels(**model_labels).time():
BATCH_SIZE_HISTOGRAM.labels(
model_name=self.model.name
).observe(len(item_ids))

# Feature retrieval
t0 = time.perf_counter()
features = self.feature_store.get(user_id, item_ids)
FEATURE_LATENCY.labels(
model_name=self.model.name,
cache_hit=str(features.from_cache),
).observe(time.perf_counter() - t0)

# Model forward pass
t1 = time.perf_counter()
predictions = self.model(features)
MODEL_FORWARD_LATENCY.labels(
model_name=self.model.name,
model_version=self.model.version,
batch_size=str(len(item_ids)),
).observe(time.perf_counter() - t1)

# Record prediction score distribution (for drift detection)
for pred in predictions:
PREDICTION_SCORE_HISTOGRAM.labels(**model_labels).observe(
pred.score
)

THROUGHPUT_COUNTER.labels(**model_labels).inc(len(item_ids))
REQUEST_COUNTER.labels(**model_labels, status="success").inc()

return predictions

except Exception as exc:
REQUEST_COUNTER.labels(**model_labels, status="error").inc()
raise
finally:
ACTIVE_REQUESTS.labels(**model_labels).dec()


# ---------------------------------------------------------------
# GPU metrics collector (run as a background thread)
# ---------------------------------------------------------------

def collect_gpu_metrics() -> None:
"""
Query nvidia-smi and update GPU Prometheus gauges.
Run this in a background thread every 15 seconds.
"""
try:
import subprocess
output = subprocess.check_output(
["nvidia-smi",
"--query-gpu=index,name,utilization.gpu,"
"memory.used,memory.total,temperature.gpu",
"--format=csv,noheader,nounits"],
text=True,
)
for line in output.strip().splitlines():
parts = [p.strip() for p in line.split(",")]
if len(parts) < 6:
continue
idx, name, util, mem_used, mem_total, temp = parts

GPU_UTILIZATION.labels(
gpu_index=idx, gpu_name=name
).set(float(util))
GPU_MEMORY_USED_BYTES.labels(gpu_index=idx).set(
float(mem_used) * 1024 * 1024
)
GPU_MEMORY_TOTAL_BYTES.labels(gpu_index=idx).set(
float(mem_total) * 1024 * 1024
)
GPU_TEMPERATURE_CELSIUS.labels(gpu_index=idx).set(float(temp))
except Exception:
pass # do not crash the server if nvidia-smi fails


import threading

def start_gpu_metrics_collector(interval_seconds: int = 15) -> None:
"""Start background thread to collect GPU metrics."""
def _loop():
while True:
collect_gpu_metrics()
time.sleep(interval_seconds)

t = threading.Thread(target=_loop, daemon=True)
t.start()

Alerting Rules with Alertmanager

Prometheus alert rules define conditions that trigger alerts. Here are production-ready ML alerting rules:

# alerting/ml_alerts.yaml
groups:
- name: ml_inference_alerts
interval: 30s
rules:

# High error rate on inference endpoint
- alert: MLInferenceHighErrorRate
expr: |
(
rate(ml_inference_requests_total{status="error"}[5m])
/
rate(ml_inference_requests_total[5m])
) > 0.01
for: 2m
labels:
severity: critical
team: ml-platform
annotations:
summary: "High ML inference error rate: {{ $labels.model_name }}"
description: >
Error rate is {{ $value | humanizePercentage }} for
model {{ $labels.model_name }} version {{ $labels.model_version }}.
This exceeds the 1% SLO threshold.

# P99 latency SLO violation
- alert: MLInferenceHighLatency
expr: |
histogram_quantile(
0.99,
rate(ml_inference_duration_seconds_bucket[5m])
) > 0.1
for: 5m
labels:
severity: warning
team: ml-platform
annotations:
summary: "ML inference p99 latency above 100ms: {{ $labels.model_name }}"
description: >
p99 latency is {{ $value | humanizeDuration }} for
model {{ $labels.model_name }}.

# No traffic (model may be down)
- alert: MLInferenceNoTraffic
expr: |
rate(ml_inference_requests_total[10m]) == 0
for: 10m
labels:
severity: warning
team: ml-platform
annotations:
summary: "No inference traffic for model {{ $labels.model_name }}"
description: "Zero requests in the last 10 minutes. Service may be down."

# Feature store cache miss rate spike
- alert: MLFeatureCacheMissRateHigh
expr: |
(
rate(ml_feature_retrieval_duration_seconds_count{cache_hit="false"}[5m])
/
rate(ml_feature_retrieval_duration_seconds_count[5m])
) > 0.15
for: 5m
labels:
severity: warning
team: ml-platform
annotations:
summary: "Feature store cache miss rate high: {{ $labels.model_name }}"
description: >
Cache miss rate is {{ $value | humanizePercentage }}.
This may cause degraded model quality if stale features are served.

# Prediction score distribution drift
# Alert if the median prediction score shifts by more than 10%
- alert: MLPredictionScoreDrift
expr: |
abs(
histogram_quantile(0.5, rate(ml_prediction_score_bucket[1h]))
- histogram_quantile(0.5, rate(ml_prediction_score_bucket[24h] offset 1h))
) > 0.1
for: 15m
labels:
severity: warning
team: ml-platform
annotations:
summary: "Prediction score drift detected: {{ $labels.model_name }}"
description: >
Median prediction score has shifted by {{ $value | humanizePercentage }}
compared to the 24h baseline. This may indicate input distribution shift.

# GPU memory near capacity
- alert: GPUMemoryHigh
expr: |
(gpu_memory_used_bytes / gpu_memory_total_bytes) > 0.90
for: 5m
labels:
severity: warning
team: ml-platform
annotations:
summary: "GPU memory above 90% on {{ $labels.gpu_index }}"
description: >
GPU {{ $labels.gpu_index }} ({{ $labels.gpu_name }}) memory is at
{{ $value | humanizePercentage }}.

# GPU utilization low during serving hours
- alert: GPUUtilizationLow
expr: |
gpu_utilization_percent < 30
for: 10m
labels:
severity: info
team: ml-platform
annotations:
summary: "GPU underutilized: {{ $labels.gpu_index }}"
description: >
GPU {{ $labels.gpu_index }} utilization is {{ $value }}%.
Consider reducing instance count or increasing batch size.

Loki Log Aggregation

Loki aggregates logs from all your ML services into a central store, queryable with LogQL. The key design decision in Loki is that it only indexes labels (like Prometheus), not log content. This makes it dramatically cheaper than Elasticsearch for high-volume ML logs.

# loki/promtail-config.yaml
# Promtail ships logs from Kubernetes pods to Loki

server:
http_listen_port: 9080

clients:
- url: http://loki.monitoring.svc:3100/loki/api/v1/push

scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod

pipeline_stages:
# Parse JSON logs (structlog output)
- json:
expressions:
level: level
event: event
model_name: model_name
model_version: model_version
request_id: request_id
latency_ms: total_latency_ms

# Add parsed fields as Loki labels (these become queryable)
- labels:
level:
model_name:
model_version:

# Drop DEBUG logs in production to reduce volume
- drop:
expression: '"level":"debug"'

relabel_configs:
# Use Kubernetes namespace and pod name as labels
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app

Sample LogQL queries for ML debugging:

# All error logs for the recommendation model in the last hour
{app="recommendation-model", level="error"} | json

# Inference requests where total_latency_ms > 100
{app="recommendation-model"} | json
| event = "inference_complete"
| total_latency_ms > 100

# Count of inference requests per minute, grouped by status
sum by (status) (
rate({app="recommendation-model"} | json
| event = "inference_complete" [1m])
)

# Find all requests with a specific correlation ID (end-to-end trace)
{namespace="ml-serving"} | json | correlation_id = "req-abc123"

ML-Specific Metrics Dashboard (Grafana)

A Grafana dashboard for an ML inference service should have four sections:

  1. Service health: request rate, error rate, p50/p95/p99 latency
  2. Model performance: prediction score distribution, top prediction rate
  3. Infrastructure: GPU utilization, GPU memory, CPU, network
  4. Data quality: feature cache hit rate, feature null rate, input feature distribution
{
"title": "ML Inference Dashboard",
"uid": "ml-inference",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(ml_inference_requests_total[1m])) by (model_name)",
"legendFormat": "{{ model_name }}"
}]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(ml_inference_requests_total{status='error'}[5m])) by (model_name) / sum(rate(ml_inference_requests_total[5m])) by (model_name)",
"legendFormat": "{{ model_name }} error rate"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 0.005, "color": "yellow"},
{"value": 0.01, "color": "red"}
]
}
}
}
},
{
"title": "Inference Latency (p50 / p95 / p99)",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(ml_inference_duration_seconds_bucket[5m])) * 1000",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(ml_inference_duration_seconds_bucket[5m])) * 1000",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(ml_inference_duration_seconds_bucket[5m])) * 1000",
"legendFormat": "p99"
}
],
"fieldConfig": {"defaults": {"unit": "ms"}}
},
{
"title": "GPU Utilization",
"type": "timeseries",
"targets": [{
"expr": "gpu_utilization_percent",
"legendFormat": "GPU {{ gpu_index }}"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100
}
}
},
{
"title": "Feature Cache Hit Rate",
"type": "stat",
"targets": [{
"expr": "rate(ml_feature_retrieval_duration_seconds_count{cache_hit='true'}[5m]) / rate(ml_feature_retrieval_duration_seconds_count[5m])",
"legendFormat": "Cache Hit Rate"
}],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 0.8, "color": "yellow"},
{"value": 0.95, "color": "green"}
]
}
}
}
}
]
}

Observability Architecture

Production Engineering Notes

Log sampling for high-throughput services: At 100,000 requests per second, emitting a structured log event for every request generates ~50 GB of logs per hour. Use log sampling: log 100% of errors, 10% of warnings, 1% of successful requests. structlog supports this with a custom processor that inspects log level and drops based on a configurable sample rate.

Histogram bucket selection: Prometheus histograms are pre-bucketed. If your inference latency is typically 5-50ms but you use the default buckets (0.1s, 0.5s, 1s...), your percentile calculations will be inaccurate. Always set buckets appropriate for your workload. A good rule: cover the full expected range with roughly equal logarithmic spacing.

Cardinality explosion: Every unique combination of label values creates a new time series in Prometheus. A label like user_id or request_id would create millions of time series and crash Prometheus. Labels must have bounded cardinality - model name, model version, GPU index, status code are good labels. User IDs, item IDs, request IDs are not.

OpenTelemetry sampling: Tracing every request at 100K RPS is impractical. Use head-based sampling (decide at the root span whether to trace this request) with a rate like 1%. Use tail-based sampling (decide after the fact, keep all traces that had errors or were slow) for catching long tail issues.

eBPF for zero-instrumentation profiling: Tools like Parca and Pyroscope use eBPF to continuously profile production processes without modifying application code. They sample call stacks across all processes and aggregate them into flamegraphs. For ML serving, this can reveal unexpected hotspots in Python interpreter overhead, memory allocation patterns, or kernel-level bottlenecks invisible to application-level metrics.

:::danger Dangerous Patterns to Avoid Do not log sensitive data. Model inputs often contain PII - user IDs, demographic features, location data. Logging these creates compliance risk (GDPR, CCPA). Log identifiers, not values. Instead of log.info("features", age=user.age, location=user.location), log log.info("features_retrieved", feature_names=list(features.keys())).

Do not use unbounded label cardinality in Prometheus. Adding a label like user_id to a counter will crash your Prometheus instance. Each unique label value combination creates a separate time series. At 1 million users, that is 1 million time series for a single metric. Rule of thumb: if you cannot enumerate all possible values in advance, it should not be a label. :::

:::warning Common Pitfalls Alert fatigue: If you define too many alerts with tight thresholds, on-call engineers start ignoring alerts because they fire constantly. Every alert should be actionable. "GPU memory above 70%" is not actionable. "GPU memory above 95%" and "inference p99 latency above 200ms for 10 minutes" are actionable.

Missing the "why" in metrics: High p99 latency is a symptom, not a root cause. Structure your metrics to enable root cause analysis: separate latency metrics for feature retrieval, preprocessing, model forward pass, and postprocessing. When p99 is high, you can immediately see which phase is slow.

Structured log context leaking between requests: If you use structlog.contextvars.bind_contextvars() in async code, ensure you call clear_contextvars() at the end of each request. Otherwise, context from request A leaks into request B's logs, making correlation IDs useless. :::

Interview Questions and Answers

Q1: What are the three pillars of observability and how do they differ?

A: Logs, metrics, and traces. Logs are discrete events - structured records of what happened at a specific point in time, useful for detailed debugging and understanding exact sequences of events. They are high-cardinality (each log can have unique fields) and should capture enough context to reconstruct what happened. Metrics are aggregated measurements over time - counters, gauges, and histograms that summarize system behavior. They are low-cardinality by design (bounded label sets) and persistent (Prometheus retains metrics for weeks or months). They are best for dashboards, SLO tracking, and alerting. Traces capture causal relationships across service boundaries - a trace shows that request X triggered feature store call Y which took 80ms and model forward pass Z which took 15ms. Traces are typically sampled and used for latency attribution and debugging distributed systems. The key insight is that no single pillar is sufficient: metrics tell you something is wrong, logs tell you what happened, and traces tell you where in the call chain it happened.

Q2: You are building a Prometheus metric for ML inference latency. What histogram bucket boundaries would you use and why?

A: It depends on the expected latency range. For a typical online inference service targeting p99 under 100ms, I would use: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0] (in seconds). The buckets should be denser in the typical operating range (1-50ms) to give accurate percentile calculations, and sparser at the extremes. Using log-scale spacing ensures roughly equal accuracy across orders of magnitude. Critically, the largest bucket must be larger than any value you expect to see - if a request takes 10 seconds and your largest bucket is 5 seconds, it falls into the infinity bucket and percentile calculations above p95 or so become inaccurate. For a batch inference service where latency is in seconds, you would use completely different buckets like [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0].

Q3: Explain how correlation IDs enable debugging in a microservices ML system.

A: A correlation ID is a unique identifier attached to a user request at the entry point (usually the API gateway) and propagated through every downstream service call via HTTP headers (X-Correlation-ID). Each service logs the correlation ID with every log event for that request. When something goes wrong - a user reports a bad recommendation, or you see a latency spike - you can take the correlation ID from the user's request or from an error alert and search for it across all your logs (in Loki, Elasticsearch, or any centralized logging system). You immediately see the complete sequence of events: the API gateway received the request at 14:23:01.200, the feature store was queried at 14:23:01.210 with cache miss, the model inference ran at 14:23:01.310 (100ms later because the cache miss caused a slow database query), and the response was returned at 14:23:01.325. Without correlation IDs, you would need to manually correlate logs across 5 services using timestamps and user IDs, which is error-prone and slow.

Q4: What is cardinality in the context of Prometheus, and why is it dangerous?

A: Cardinality is the number of unique time series Prometheus stores. Each unique combination of metric name and label values creates a separate time series. Prometheus stores all time series in memory, so cardinality directly impacts memory usage. High cardinality occurs when labels have many possible values - the classic example is using request_id, user_id, or url_path (with query parameters) as labels. A system with 1 million users and a user_id label on a single counter metric would create 1 million time series. Prometheus stores each time series as a separate chunk in memory, and at 1 million+ series, a single instance can run out of memory and crash. The fix is to never use high-cardinality values as labels. If you need per-user analytics, use logs (which can have arbitrary fields) rather than metrics. For URLs, strip query parameters and aggregate path templates (/users/{id} instead of /users/123).

Q5: How would you detect that a production ML model is receiving input data that has drifted from its training distribution?

A: Several approaches, from simple to sophisticated. The simplest is monitoring the output distribution: track the histogram of prediction scores with PREDICTION_SCORE_HISTOGRAM. If the trained model produces scores peaking around 0.8-0.9 on typical inputs, but production scores start peaking at 0.5, something has changed. For input monitoring, log summary statistics of key features (mean, p10, p90, null rate) and track them as Prometheus gauges. Alert if the mean of a feature drifts by more than two standard deviations from the training baseline. More sophisticated approaches use statistical tests: the Population Stability Index (PSI) or Kolmogorov-Smirnov test to compare the distribution of a feature between a reference window (training data or last week's production) and the current window. Tools like Evidently or WhyLogs implement these tests and expose them as metrics. The key insight is that you need both input and output monitoring - input drift tells you the model is operating outside its training distribution, output drift may indicate the model is still working correctly on the new distribution.

Q6: What is OpenTelemetry and why was it created?

A: OpenTelemetry is a vendor-neutral open standard and SDK for collecting telemetry data (traces, metrics, and logs) from applications. It was created in 2019 by merging two competing standards: OpenCensus (from Google) and OpenTracing (CNCF). Before OpenTelemetry, every observability backend (Datadog, New Relic, Jaeger, Zipkin) had its own SDK. If you instrumented your application with the Jaeger SDK, switching to Datadog required replacing all your instrumentation code. OpenTelemetry separates the instrumentation API from the exporter. You instrument your code once against the OTel API, then configure exporters to send data to any backend (Jaeger, Tempo, Datadog, Honeycomb) without changing application code. For ML systems, this is important because the observability backend often changes as teams grow and requirements evolve.

Q7: Explain the difference between logs and metrics for an ML serving system. When should you use each?

A: Use metrics when you need aggregation and alerting. Metrics answer questions like: "What is our p99 inference latency over the last hour?" or "How many requests per second are we serving?" They are cheap to query because Prometheus pre-aggregates them. Use logs when you need to debug a specific request or understand what happened in detail. Logs answer questions like: "What features did this specific request receive?" or "Why did this specific user get prediction X instead of Y?" Logs retain full event detail but are expensive to query at scale. The rule of thumb: if you need to alert on it, make it a metric. If you need to debug an individual request, log it. If you need both (e.g., inference latency: alert when p99 is high AND debug which requests were slow), use both: a Prometheus histogram for the aggregate and a structured log with total_latency_ms for per-request details. OpenTelemetry Logs is attempting to unify these, but in practice Prometheus metrics and structured logs remain separate tools optimized for their respective use cases.

© 2026 EngineersOfAI. All rights reserved.