Skip to main content

Prometheus and Grafana for ML

Building the ML Observability Stack From Scratchโ€‹

You've joined an ML team that has a model in production but no meaningful monitoring. The data scientist who deployed it checks the server's CPU dashboard occasionally. The model serves 800K predictions per day. Nobody knows if it's performing well. Nobody gets paged when it slows down. The business reviews model accuracy quarterly.

Your task: build the observability stack. Not the infrastructure observability (that exists) - the ML-specific observability. Custom prediction latency histograms broken down by inference stage. Feature freshness gauges. Prediction score distribution monitoring. Drift detection alerts. Grafana dashboards that tell you in 10 seconds whether the model is healthy.

This lesson walks through the complete stack: Prometheus for metrics collection and storage, custom metrics instrumented in your model server, PromQL for querying, and Grafana for visualization.

:::tip ๐ŸŽฎ Interactive Playground Visualize this concept: Try the Infrastructure Monitoring demo on the EngineersOfAI Playground - no code required. :::

Prometheus Architectureโ€‹

Prometheus follows a pull-based metrics collection model. Instead of your application pushing metrics to a central server, Prometheus scrapes metrics from HTTP endpoints exposed by your application.

Installing Prometheus with the kube-prometheus-stackโ€‹

# Install using Helm (includes Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=your-secure-password \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

The Four Prometheus Metric Typesโ€‹

Prometheus provides four metric types. Choosing the right type matters for correctness and query efficiency:

Counter: monotonically increasing value. Never decreases. Resets to 0 on pod restart. Use for: request counts, error counts, bytes processed, predictions made.

from prometheus_client import Counter

REQUEST_COUNT = Counter(
"model_requests_total",
"Total prediction requests",
["model_name", "status"]
)

# Usage
REQUEST_COUNT.labels(model_name="fraud-model", status="success").inc()

Gauge: can go up or down. Use for: current value of something. Memory usage, active connections, current model version.

from prometheus_client import Gauge

FEATURE_FRESHNESS = Gauge(
"feature_freshness_seconds",
"Seconds since last feature update",
["feature_group"]
)

GPU_MEMORY_FREE_RATIO = Gauge(
"model_gpu_memory_free_ratio",
"Fraction of GPU memory free",
["model_name"]
)

# Usage
FEATURE_FRESHNESS.labels(feature_group="user_txn").set(3600) # 1 hour old

Histogram: samples observations and counts them in configurable buckets. Use for: latency, response sizes, batch sizes. Enables percentile computation.

from prometheus_client import Histogram

REQUEST_LATENCY = Histogram(
"model_request_latency_seconds",
"Prediction request latency",
["model_name", "stage"],
# Bucket boundaries tuned for ML serving latency profile:
buckets=[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

PREDICTION_SCORE = Histogram(
"model_prediction_score",
"Distribution of prediction scores (0-1)",
["model_name"],
# Buckets for score distribution monitoring
buckets=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

# Usage with context manager
with REQUEST_LATENCY.labels(model_name="fraud-model", stage="feature_fetch").time():
features = await fetch_features(user_id)

Summary: similar to Histogram but computes percentiles client-side. Avoid for ML - Histograms are more flexible for server-side percentile computation and aggregation across replicas.

The Complete ML Model Server Metrics Setupโ€‹

# metrics.py - centralized metrics for the fraud model server
from prometheus_client import Counter, Histogram, Gauge, Info
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from fastapi import FastAPI, Response
import time

app = FastAPI()

# ---- Counter metrics ----
REQUESTS_TOTAL = Counter(
"fraud_model_requests_total",
"Total prediction requests",
["status", "decision"] # status: success/error/timeout, decision: approved/declined/error
)

FEATURE_CACHE = Counter(
"fraud_model_feature_cache_ops_total",
"Feature cache operations",
["operation"] # operation: hit/miss
)

# ---- Gauge metrics ----
ACTIVE_REQUESTS = Gauge(
"fraud_model_active_requests",
"Number of requests currently being processed"
)

MODEL_SCORE_PSI = Gauge(
"fraud_model_score_psi",
"Population Stability Index of prediction scores vs reference"
)

MODEL_APPROVAL_RATE = Gauge(
"fraud_model_approval_rate_1h",
"Fraction of requests resulting in approval over the last hour"
)

FEATURE_AGE_SECONDS = Gauge(
"fraud_model_feature_age_seconds",
"Age of the oldest feature used in predictions",
["feature_group"]
)

# ---- Histogram metrics ----
REQUEST_DURATION = Histogram(
"fraud_model_request_duration_seconds",
"End-to-end request duration",
buckets=[0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.5, 1.0, 2.0]
)

STAGE_DURATION = Histogram(
"fraud_model_stage_duration_seconds",
"Duration of each processing stage",
["stage"], # stage: feature_fetch / inference / postprocess
buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)

PREDICTION_SCORE_HIST = Histogram(
"fraud_model_prediction_score",
"Distribution of fraud probability scores",
buckets=[0.0, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0]
)

BATCH_SIZE_HIST = Histogram(
"fraud_model_batch_size",
"Number of samples in each inference batch",
buckets=[1, 2, 4, 8, 16, 32, 64, 128]
)

# ---- Info metric ----
MODEL_INFO = Info(
"fraud_model",
"Information about the loaded model"
)
MODEL_INFO.info({
"version": "v2.1.0",
"framework": "pytorch",
"training_date": "2026-01-15",
"architecture": "transformer-fraud-v3"
})


@app.get("/metrics")
def metrics():
return Response(
generate_latest(),
media_type=CONTENT_TYPE_LATEST
)


@app.post("/predict")
async def predict(request: PredictionRequest):
ACTIVE_REQUESTS.inc()
start = time.perf_counter()

try:
# Stage 1: feature fetch
t1 = time.perf_counter()
features, cache_hit = await fetch_features(request.user_id)
STAGE_DURATION.labels("feature_fetch").observe(time.perf_counter() - t1)
FEATURE_CACHE.labels("hit" if cache_hit else "miss").inc()

# Stage 2: inference
t2 = time.perf_counter()
score = await run_inference(features)
STAGE_DURATION.labels("inference").observe(time.perf_counter() - t2)
PREDICTION_SCORE_HIST.observe(score)

# Stage 3: postprocess
t3 = time.perf_counter()
decision = "approved" if score < 0.5 else "declined"
STAGE_DURATION.labels("postprocess").observe(time.perf_counter() - t3)

# Record overall metrics
REQUEST_DURATION.observe(time.perf_counter() - start)
REQUESTS_TOTAL.labels(status="success", decision=decision).inc()

return {"score": score, "decision": decision}

except Exception as e:
REQUESTS_TOTAL.labels(status="error", decision="error").inc()
raise
finally:
ACTIVE_REQUESTS.dec()

PromQL for ML Monitoringโ€‹

PromQL (Prometheus Query Language) is the query language for Prometheus. Essential PromQL patterns for ML:

Request Rateโ€‹

# Current request rate (requests per second, 5-minute window)
rate(fraud_model_requests_total[5m])

# Request rate by status (visualize success vs error)
sum by (status) (rate(fraud_model_requests_total[5m]))

# Error rate percentage
rate(fraud_model_requests_total{status="error"}[5m]) /
rate(fraud_model_requests_total[5m]) * 100

Latency Percentilesโ€‹

# p50 (median) request latency
histogram_quantile(0.50,
sum by (le) (rate(fraud_model_request_duration_seconds_bucket[5m]))
)

# p95 request latency
histogram_quantile(0.95,
sum by (le) (rate(fraud_model_request_duration_seconds_bucket[5m]))
)

# p99 request latency (the SLO metric)
histogram_quantile(0.99,
sum by (le) (rate(fraud_model_request_duration_seconds_bucket[5m]))
)

# Latency breakdown by stage - identify the bottleneck
histogram_quantile(0.95,
sum by (le, stage) (rate(fraud_model_stage_duration_seconds_bucket[5m]))
)

Prediction Score Distribution Monitoringโ€‹

# Current approval rate (fraction of scores below 0.5 threshold)
sum(rate(fraud_model_prediction_score_bucket{le="0.5"}[1h])) /
sum(rate(fraud_model_prediction_score_count[1h]))

# Compare current to 7-day-ago (drift detection)
(
sum(rate(fraud_model_prediction_score_bucket{le="0.5"}[1h]))
/
sum(rate(fraud_model_prediction_score_count[1h]))
) -
(
sum(rate(fraud_model_prediction_score_bucket{le="0.5"}[1h] offset 7d))
/
sum(rate(fraud_model_prediction_score_count[1h] offset 7d))
)

Feature Cache Hit Rateโ€‹

# Feature cache hit rate (target: > 90%)
rate(fraud_model_feature_cache_ops_total{operation="hit"}[5m]) /
(
rate(fraud_model_feature_cache_ops_total{operation="hit"}[5m]) +
rate(fraud_model_feature_cache_ops_total{operation="miss"}[5m])
)

GPU Utilization for ML Servingโ€‹

# Average GPU utilization across all model server pods
avg(DCGM_FI_DEV_GPU_UTIL{exported_namespace="ml-prod", exported_pod=~"fraud-model-.*"})

# GPU memory pressure per pod
(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100

Grafana Dashboard Design for MLโ€‹

A well-designed ML Grafana dashboard has a specific layout that tells the story at a glance:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Row 1: KEY HEALTH INDICATORS (stat panels, color-coded) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Req Rate โ”‚ โ”‚ Error Rate โ”‚ โ”‚ p99 Latencyโ”‚ โ”‚ Score PSI โ”‚ โ”‚
โ”‚ โ”‚ 1,247 RPS โ”‚ โ”‚ 0.02% โœ“ โ”‚ โ”‚ 87ms โœ“ โ”‚ โ”‚ 0.08 โœ“ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ Row 2: SERVING PERFORMANCE (time series) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Request Latency by Stage โ”‚ โ”‚ Request Rate and Error Rate โ”‚ โ”‚
โ”‚ โ”‚ [graph: feature/infer/post] โ”‚ โ”‚ [graph: rps + error %] โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ Row 3: MODEL BEHAVIOR (time series) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Prediction Score Distribution โ”‚ โ”‚ Approval Rate vs 7-day Avg โ”‚ โ”‚
โ”‚ โ”‚ [heatmap: score buckets] โ”‚ โ”‚ [graph: current + baseline] โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ Row 4: INFRASTRUCTURE โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ GPU Utilization per Pod โ”‚ โ”‚ CPU + Memory per Pod โ”‚ โ”‚
โ”‚ โ”‚ [graph: DCGM_GPU_UTIL] โ”‚ โ”‚ [graph: resource usage] โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Grafana Dashboard as Code (JSON Model)โ€‹

{
"title": "Fraud Model - Production Monitoring",
"uid": "fraud-model-prod",
"tags": ["ml", "fraud", "production"],
"time": {"from": "now-6h", "to": "now"},
"refresh": "30s",
"panels": [
{
"type": "stat",
"title": "Request Rate (RPS)",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 2000},
{"color": "red", "value": 5000}
]
}
}
},
"targets": [{
"expr": "sum(rate(fraud_model_requests_total[5m]))",
"legendFormat": "RPS"
}]
},
{
"type": "timeseries",
"title": "Request Latency by Stage (p95)",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (le, stage) (rate(fraud_model_stage_duration_seconds_bucket[5m])))",
"legendFormat": "{{stage}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {"lineWidth": 2}
}
}
}
]
}

Store dashboards as JSON in your Git repository and provision them via Grafana's ConfigMap provisioning:

# grafana-dashboards-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-ml-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1" # Grafana sidecar auto-imports dashboards with this label
data:
fraud-model-dashboard.json: |
{ ... your dashboard JSON ... }

Thanos - Long-Term Metric Storageโ€‹

Prometheus defaults to 15-day local storage. For ML monitoring, you need longer retention:

  • Drift analysis requires comparing to data from months ago
  • Seasonal pattern detection needs at least 1 year of history
  • Regulatory requirements may mandate 1โ€“3 year retention

Thanos extends Prometheus with object storage (S3/GCS) for unlimited retention:

# Thanos sidecar - runs alongside Prometheus, ships blocks to S3
containers:
- name: thanos-sidecar
image: quay.io/thanos/thanos:v0.34.0
args:
- sidecar
- --tsdb.path=/prometheus # same volume as Prometheus
- --objstore.config-file=/etc/thanos/objstore.yaml
- --prometheus.url=http://localhost:9090
volumeMounts:
- name: prometheus-data
mountPath: /prometheus
- name: thanos-config
mountPath: /etc/thanos
---
# objstore.yaml
type: S3
config:
bucket: company-prometheus-long-term
region: us-east-1
endpoint: s3.amazonaws.com

With Thanos, query 90-day drift trends:

# Compare this week's approval rate to the same week last year
avg_over_time(fraud_model_approval_rate_1h[7d]) -
avg_over_time(fraud_model_approval_rate_1h[7d] offset 364d)

Production Notesโ€‹

Expose /metrics on a dedicated port. In Kubernetes, configure a separate port for Prometheus scraping so it doesn't conflict with your API traffic:

# In the pod spec
ports:
- containerPort: 8080
name: http # API traffic
- containerPort: 9090
name: metrics # Prometheus scraping only

Use ServiceMonitor for automatic scrape config. With the Prometheus Operator, ServiceMonitor resources auto-configure scraping without modifying prometheus.yaml:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fraud-model-metrics
namespace: ml-prod
spec:
selector:
matchLabels:
app: fraud-model
endpoints:
- port: metrics
interval: 15s
path: /metrics

Common Mistakesโ€‹

:::danger Using Summary Instead of Histogram Summaries compute percentiles client-side and cannot be aggregated across replicas. If your model server has 3 replicas, summary_quantile{quantile="0.99"} from each replica cannot be combined to give the fleet-wide p99. Histograms aggregate correctly: sum all three replicas' bucket counters, then apply histogram_quantile(). Always use Histograms for latency in Kubernetes where you have multiple pod replicas. :::

:::warning Hardcoding Prometheus Scrape Configs Adding model server scrape targets directly to prometheus.yaml means every new model server requires a manual config update and Prometheus reload. Use ServiceMonitors (Prometheus Operator) or PodMonitors to dynamically discover scraping targets based on labels. New pods are scraped automatically without any configuration change. :::

:::danger Not Setting Appropriate Histogram Buckets Default Prometheus Histogram buckets ([.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]) are designed for web service latency. For an ML model that consistently responds in 40โ€“100ms, you want finer-grained buckets in that range: [0.01, 0.02, 0.03, 0.04, 0.05, 0.07, 0.1, 0.15, 0.2, 0.3, 0.5, 1.0]. Poorly placed buckets give imprecise percentile estimates because histogram_quantile() uses linear interpolation within buckets. Profile your model's latency distribution before choosing bucket boundaries. :::

Interview Q&Aโ€‹

Q1: Explain the difference between Counter, Gauge, and Histogram in Prometheus. Give an ML-specific example of each.

Counter: monotonically increasing number that never decreases (resets to 0 on restart). ML example: model_predictions_total - total number of predictions served. Use rate() to compute predictions per second. Gauge: any value that can increase or decrease. ML example: feature_freshness_seconds - current age of the most recent feature update. Set directly to the current age. Histogram: counts observations in predefined buckets and exposes bucket counts, sum, and total count. ML example: model_request_latency_seconds with buckets at 10ms, 25ms, 50ms, 100ms, 250ms. Enables computing any percentile server-side with histogram_quantile(0.99, ...).

Q2: How does histogram_quantile() work in PromQL and why must you use Histograms (not Summaries) for multi-replica ML services?

histogram_quantile(p, histogram_metric) estimates the p-th quantile by finding the bucket where cumulative count exceeds p ร— total_count and using linear interpolation within that bucket. It works on aggregated bucket data. The key: Histogram bucket counters from multiple replicas can be summed (sum by (le) (rate(...))), then passed to histogram_quantile to get the fleet-wide percentile. Summary quantiles are computed per-replica and cannot be mathematically combined. avg(summary_quantile{quantile="0.99"}) does not give the fleet p99 - it gives the average of three independent p99s, which underestimates the true p99 when replicas have different load.

Q3: How would you instrument a model serving endpoint to expose latency broken down by processing stage?

Define a Histogram metric with a stage label: model_stage_duration_seconds{stage=["feature_fetch", "inference", "postprocess"]}. In the serving code, time each stage separately using the histogram's .time() context manager. This creates three separate time series under the same metric, all queryable with histogram_quantile. In Grafana, set legendFormat: "{{stage}}" to show separate lines per stage on the same graph. During an incident, you can immediately see which stage is slow: if feature_fetch p95 is 200ms and inference p95 is 40ms, the feature store is the bottleneck, not the model.

Q4: What is Thanos and when would an ML team need it?

Thanos extends Prometheus with object storage (S3, GCS, Azure Blob) for unlimited long-term metric retention. The Prometheus operator pattern (Thanos sidecar) runs alongside Prometheus, continuously ships 2-hour blocks to object storage, and provides a query layer that transparently reads from both local Prometheus storage (for recent data) and object storage (for historical data). ML teams need Thanos when: (1) drift analysis requires comparing to data more than 15 days old (Prometheus default retention), (2) seasonal pattern detection needs 1+ year of metric history, (3) regulatory compliance requires multi-year metric retention for auditable ML behavior, (4) multi-cluster deployments need a unified query layer across several Prometheus instances (Thanos Querier federates them).

Q5: Design a Prometheus metrics schema for an ML model serving endpoint that enables: request rate monitoring, latency SLO tracking, model behavior monitoring, and feature quality monitoring.

Metrics schema: (1) model_requests_total (Counter, labels: model_name, status, decision) - for request rate and error rate. (2) model_request_duration_seconds (Histogram, labels: model_name, stage, buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]) - for latency SLO tracking with histogram_quantile(0.99, ...). (3) model_prediction_score (Histogram, labels: model_name, buckets: [0.0, 0.1, ..., 1.0]) - for score distribution monitoring and approval rate tracking. (4) feature_freshness_seconds (Gauge, labels: feature_group) - for feature staleness alerts. (5) feature_null_rate (Gauge, labels: feature_name) - for data quality monitoring. (6) model_score_psi (Gauge, labels: model_name) - computed hourly by a separate job, represents the PSI between reference and current score distribution.

ยฉ 2026 EngineersOfAI. All rights reserved.