What is ML infrastructure monitoring?

Monitoring the infrastructure layer of ML systems - CPU, GPU, memory, latency, the four monitoring layers, custom ML metrics with Prometheus, and building the observability foundation for model quality monitoring.

How does GPU monitoring DCGM Prometheus work in practice?

Infrastructure Monitoring for ML Systems covers ML infrastructure monitoring, GPU monitoring DCGM Prometheus, ML latency SLOs from first principles with code examples. Free lesson at https://engineersofai.com/docs/mlops/monitoring-and-observability/infrastructure-monitoring

What is the difference between ML infrastructure monitoring and ML latency SLOs?

See the full breakdown at https://engineersofai.com/docs/mlops/monitoring-and-observability/infrastructure-monitoring

Infrastructure Monitoring for ML Systems

The 4am Page That Took Three Hours to Diagnose

At 4:07am, PagerDuty wakes the on-call ML engineer. Alert: "fraud_model_p99_latency > 500ms." She opens the Grafana dashboard. CPU utilization: 45% - fine. Memory: 62% - fine. Error rate: 0.02% - fine. GPU utilization: 89% - suspicious. She checks the GPU memory dashboard. All full. But the feature server CPU is at 97%.

Three hours of log-spelunking later, she finds the cause: the feature store's Redis connection pool exhausted at 3:51am. The feature server started falling back to database lookups, which are 40x slower than Redis hits. Inference latency spiked because the inference step was waiting for slow features, not because the model itself was slow.

The monitoring system detected the symptom (high latency) but could not explain the cause because it had no instrumentation on the feature retrieval step, the Redis connection pool health, or the cascade between them. Proper infrastructure monitoring would have shown the Redis pool exhaustion at 3:51am with a clear causal chain: Redis pool exhausted → feature retrieval degrades → inference latency spikes.

This lesson shows you how to build infrastructure monitoring that tells you not just what broke, but why.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure Monitoring demo on the EngineersOfAI Playground - no code required. :::

The Four Monitoring Layers for ML

Most ML teams over-instrument Layer 1 (easy, standard tooling) and under-instrument Layers 3 and 4 (harder, ML-specific). But silent failures hide in Layers 3 and 4. Build all four.

Layer 1: Infrastructure Metrics

Standard infrastructure metrics using Prometheus + Kubernetes metrics-server + DCGM Exporter:

CPU and Memory

# kube-state-metrics + node-exporter provides these automatically
# Key PromQL queries for ML workloads:

# CPU utilization per pod (%)
rate(container_cpu_usage_seconds_total{
  namespace="ml-prod",
  pod=~"fraud-model-.*"
}[5m]) * 100

# Memory usage vs limit (%)
container_memory_usage_bytes{namespace="ml-prod", pod=~"fraud-model-.*"} /
container_spec_memory_limit_bytes{namespace="ml-prod", pod=~"fraud-model-.*"} * 100

# OOMKill events (these should be zero)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled",namespace="ml-prod"} == 1

# Pod restart count (high restarts = instability)
kube_pod_container_status_restarts_total{namespace="ml-prod", pod=~"fraud-model-.*"}

GPU Metrics (DCGM Exporter)

# GPU utilization - target: 70-90% for serving (too low = waste, too high = latency spike risk)
DCGM_FI_DEV_GPU_UTIL{exported_namespace="ml-prod", exported_pod=~"fraud-model-.*"}

# GPU memory used (bytes)
DCGM_FI_DEV_FB_USED{exported_namespace="ml-prod"}

# GPU memory free - alert if less than 2 GiB free
DCGM_FI_DEV_FB_FREE{exported_namespace="ml-prod"} < 2000   # MiB

# GPU temperature - throttling starts at 85°C on most NVIDIA GPUs
DCGM_FI_DEV_GPU_TEMP > 80

# ECC errors - double-bit errors indicate hardware failure
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0

Disk and Network

# Disk usage on nodes with model artifact storage
node_filesystem_avail_bytes{mountpoint="/models"} / node_filesystem_size_bytes * 100 < 20
# Alert: less than 20% disk free on model storage

# Network bandwidth - relevant for feature store calls and model downloads
rate(container_network_receive_bytes_total{pod=~"fraud-model-.*"}[5m])
rate(container_network_transmit_bytes_total{pod=~"fraud-model-.*"}[5m])

Layer 2: Application Metrics

Layer 2 is where you instrument your model serving code with custom Prometheus metrics:

from prometheus_client import Counter, Histogram, Gauge, Summary
from prometheus_client import start_http_server
import time
import functools

# --- Define metrics ---

# Request counter (by status and model version)
REQUEST_COUNT = Counter(
    "model_requests_total",
    "Total number of prediction requests",
    ["model_name", "model_version", "status"]   # labels
)

# Request latency histogram (full end-to-end)
REQUEST_LATENCY = Histogram(
    "model_request_latency_seconds",
    "Prediction request latency in seconds",
    ["model_name", "stage"],    # stage: feature_fetch, inference, postprocess
    buckets=[0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

# Model prediction score distribution
PREDICTION_SCORE = Histogram(
    "model_prediction_score",
    "Distribution of model prediction scores",
    ["model_name"],
    buckets=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

# Currently loaded model version (gauge for tracking)
MODEL_VERSION_INFO = Gauge(
    "model_version_info",
    "Information about the currently loaded model version",
    ["model_name", "version", "framework"]
)

# Feature store cache hit rate
FEATURE_CACHE_HITS = Counter("feature_cache_hits_total", "Feature cache hits")
FEATURE_CACHE_MISSES = Counter("feature_cache_misses_total", "Feature cache misses")

# Batch size distribution (for batch inference endpoints)
BATCH_SIZE = Histogram(
    "model_batch_size",
    "Size of batches processed",
    buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256]
)


# --- Instrument the model server ---

class InstrumentedModelServer:
    def __init__(self, model, model_name: str, model_version: str):
        self.model = model
        self.model_name = model_name
        self.model_version = model_version

        # Record model version info
        MODEL_VERSION_INFO.labels(
            model_name=model_name,
            version=model_version,
            framework="pytorch"
        ).set(1)

    def predict(self, request_id: str, raw_features: dict) -> dict:
        start_time = time.perf_counter()

        try:
            # Stage 1: Feature retrieval
            with REQUEST_LATENCY.labels(self.model_name, "feature_fetch").time():
                features = self._fetch_features(raw_features)

            # Stage 2: Inference
            with REQUEST_LATENCY.labels(self.model_name, "inference").time():
                score = self._run_inference(features)

            # Stage 3: Postprocessing
            with REQUEST_LATENCY.labels(self.model_name, "postprocess").time():
                result = self._postprocess(score)

            # Record metrics
            REQUEST_COUNT.labels(self.model_name, self.model_version, "success").inc()
            PREDICTION_SCORE.labels(self.model_name).observe(score)

            total_latency = time.perf_counter() - start_time
            REQUEST_LATENCY.labels(self.model_name, "total").observe(total_latency)

            return result

        except Exception as e:
            REQUEST_COUNT.labels(self.model_name, self.model_version, "error").inc()
            raise

    def _fetch_features(self, raw_features: dict) -> dict:
        try:
            features = self.feature_store.get(raw_features["user_id"])
            FEATURE_CACHE_HITS.inc()
            return features
        except CacheMiss:
            FEATURE_CACHE_MISSES.inc()
            return self.feature_store.get_from_db(raw_features["user_id"])

Latency SLOs and Error Budget

# PromQL to track SLO compliance
# SLO: 99% of requests complete within 100ms

# Good requests (under 100ms)
good_requests = """
sum(rate(model_request_latency_seconds_bucket{
    model_name="fraud-model",
    stage="total",
    le="0.1"
}[7d]))
"""

# Total requests
total_requests = """
sum(rate(model_request_latency_seconds_count{
    model_name="fraud-model",
    stage="total"
}[7d]))
"""

# SLO compliance rate = good / total (target: >= 0.99)
# Error budget burn rate = rate at which you're consuming the 1% error budget

Layer 3: ML Pipeline Metrics

The ML pipeline layer is the most neglected and most important for detecting silent failures.

Feature Freshness

Features that go stale (not updated for hours or days) silently degrade predictions. Monitor the age of each feature category:

from prometheus_client import Gauge
from datetime import datetime

FEATURE_FRESHNESS = Gauge(
    "feature_freshness_seconds",
    "Age of the most recent feature update in seconds",
    ["feature_group"]
)

def update_feature_freshness():
    """Called periodically to update feature age metrics."""
    feature_groups = {
        "user_transaction_features": get_last_update_time("user_txn_features"),
        "merchant_risk_features": get_last_update_time("merchant_risk_features"),
        "device_fingerprint_features": get_last_update_time("device_features"),
    }

    now = datetime.utcnow()
    for group, last_update in feature_groups.items():
        age_seconds = (now - last_update).total_seconds()
        FEATURE_FRESHNESS.labels(feature_group=group).set(age_seconds)

# Alert: feature age > 4 hours
# FIRING: feature_freshness_seconds{feature_group="user_transaction_features"} > 14400

Data Quality Metrics

DATA_QUALITY = Gauge(
    "feature_null_rate",
    "Fraction of null values for a feature in recent predictions",
    ["feature_name"]
)

FEATURE_VALUE_COUNT = Counter(
    "feature_value_total",
    "Count of feature values by range bucket",
    ["feature_name", "bucket"]
)

def monitor_request_data_quality(features: dict, model_name: str):
    """Called per-request to track data quality metrics."""
    for feature_name, value in features.items():
        if value is None or (isinstance(value, float) and np.isnan(value)):
            DATA_QUALITY.labels(feature_name=feature_name).inc()   # count nulls

Prediction Volume and Pipeline Throughput

PREDICTION_VOLUME = Counter(
    "model_predictions_total",
    "Total predictions made, by model and decision",
    ["model_name", "decision"]
)

PIPELINE_LAG = Gauge(
    "ml_pipeline_lag_seconds",
    "Lag between event time and prediction time",
    ["pipeline_name"]
)

Layer 4: Model Quality Metrics

Connecting infrastructure to model quality:

# These come from the delayed evaluation pipeline (Lesson 02)
MODEL_AUC = Gauge(
    "model_auc_roc",
    "Model AUC-ROC computed on labeled data",
    ["model_name", "model_version", "evaluation_window"]
)

MODEL_PSI = Gauge(
    "model_score_psi",
    "Population Stability Index of prediction score distribution",
    ["model_name"]
)

DRIFT_DETECTED = Gauge(
    "feature_drift_detected",
    "1 if drift detected for this feature, 0 otherwise",
    ["feature_name", "test"]
)

Building the Unified Dashboard

A well-designed ML observability dashboard has four panels arranged top-to-bottom, mirroring the monitoring layer hierarchy:

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 4: MODEL QUALITY                                         │
│  AUC: 0.91 (7d)  |  Score PSI: 0.08  |  Approval Rate: 68.2%  │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 3: ML PIPELINE                                           │
│  Feature Freshness: txn=2h ago, risk=45m ago, device=12m ago   │
│  Null Rate: all < 1%  |  Prediction Volume: 2.3K/min           │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 2: APPLICATION                                           │
│  p50: 23ms | p95: 67ms | p99: 89ms | Error Rate: 0.01%        │
│  Feature Fetch: 18ms | Inference: 44ms | Post: 4ms             │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 1: INFRASTRUCTURE                                        │
│  CPU: 45% | Memory: 62% | GPU Util: 87% | GPU Mem: 78%        │
│  Pod Restarts: 0 | OOMKills: 0 | Node Health: 3/3 OK          │
└─────────────────────────────────────────────────────────────────┘

This layout lets on-call engineers immediately see the full picture: model quality at top (business impact), infrastructure health at bottom (implementation reality). During an incident, you can trace from symptom to root cause top-down or bottom-up.

Production Notes

Cardinality management in Prometheus: Be careful with high-cardinality label dimensions. Labeling by request_id or user_id creates millions of metric series and OOMs Prometheus. Only use labels with bounded, known value sets: model_name, model_version, status, decision, stage. Never use request-level IDs as labels.

Sampling for high-throughput services: At 10,000 RPS, logging detailed metrics for every request creates storage and processing overhead. Sample at 1–10% for latency histograms, but always maintain full counts (Counters are cheap, Histograms are expensive).

Metric retention: Prometheus defaults to 15-day retention. For ML monitoring, you often need 90-day or 1-year retention to detect seasonal patterns. Use Thanos or Cortex for long-term metric storage.

Common Mistakes

:::danger Only Monitoring One Layer Teams that monitor only infrastructure (Layer 1) miss data quality and model quality problems. Teams that monitor only model accuracy miss infrastructure problems that are causing latency SLO breaches. All four layers are necessary. The root cause of a model accuracy problem often starts in Layer 1 (a node failure degraded the feature store) or Layer 3 (a feature pipeline started producing stale values). :::

:::warning High-Cardinality Labels Crashing Prometheus Adding a user_id or request_id label to any metric will create a unique time series for every user or request. At 1M daily users, that's 1M time series for a single metric - Prometheus will OOM within hours. Always review new metric label dimensions before deploying. Use cardinality analysis tools (prometheus-cardinality-exporter) to catch this before it hits production. :::

:::warning Not Instrumenting Stage-Level Latency A metric showing "inference request takes 200ms" is not actionable. Is it the feature retrieval? The model forward pass? The response serialization? Instrument each stage separately so you can pinpoint the bottleneck immediately during an incident. The 4am story at the top of this lesson: stage-level latency would have shown "feature_fetch: 190ms, inference: 10ms" and pointed immediately to the Redis issue. :::

Interview Q&A

Q1: Describe the four monitoring layers for ML systems and give an example of what each layer catches.

Layer 1 (infrastructure): CPU, memory, GPU utilization, pod restarts, OOMKills. Catches hardware failures, resource exhaustion, and platform-level issues. Example: GPU temperature spiking to 90°C, triggering thermal throttling that increases inference latency. Layer 2 (application): request latency, error rate, throughput, stage-level timing. Catches software bugs, overload, and serving infrastructure issues. Example: feature retrieval latency spiking from 20ms to 200ms because Redis connection pool exhausted. Layer 3 (ML pipeline): feature freshness, data quality, null rates, prediction volume. Catches data pipeline failures and preprocessing issues. Example: a feature pipeline silently starting to produce all-zero vectors for a key feature because an upstream schema changed. Layer 4 (model quality): AUC, F1, data drift metrics, score distribution PSI. Catches model degradation from distribution shift, concept drift, and staleness. Example: the recommendation model's predicted CTR over-estimates actual CTR by 40% because the user behavior distribution shifted after a UI redesign.

Q2: What is a latency SLO for an ML serving endpoint and how do you implement error budget tracking?

An SLO (Service Level Objective) for latency defines the fraction of requests that must complete within a threshold. Example: "99% of fraud model predictions must complete in under 100ms." Implementation: record per-request latency in a Prometheus Histogram. Use the le="0.1" bucket (requests under 100ms) divided by total requests to compute the SLO compliance rate. The error budget is 1% (the 1 in 99%) - you can afford to violate the SLO for 1% of requests over a 30-day window. Error budget burn rate alerts fire when you're consuming the budget faster than sustainable: if you burn 5% of the monthly budget in one day, you'll exhaust it in 6 days and are likely experiencing an active incident.

Q3: Why is feature freshness monitoring important and how do you implement it?

Feature freshness measures how recently each feature was computed. Stale features cause silent degradation: if "user's transaction history from the last 7 days" hasn't been updated for 48 hours, the model is effectively making predictions on 9-day-old data - which may reflect a completely different behavioral state for the user. Implementation: maintain a metadata table in your feature store that records the last update timestamp for each feature group. A monitoring job runs every 5 minutes, reads these timestamps, computes the age (now - last_update), and pushes them to Prometheus as a Gauge metric. Alert when any critical feature group is older than its SLA (e.g., transaction features SLA: updated every hour). This catches upstream data pipeline failures before they degrade model quality.

Q4: How do you avoid Prometheus cardinality explosion in ML monitoring?

Cardinality explosion occurs when a label dimension has an unbounded number of values. For ML monitoring, the dangerous dimensions are: user_id (millions of values), request_id (billions of values), and free-text error messages. Prevention: only use labels with bounded, known value sets - model_name (< 20), model_version (< 100), status (success/error/timeout), decision (approved/declined), feature_name (< 200). For any dimension you're unsure about, estimate its cardinality: (number of time series) × (bytes per series) × (retention days) gives the storage requirement. A metric with 1M cardinality uses ~1 GB of memory in Prometheus. For high-cardinality observability (per-request tracing), use distributed tracing (Jaeger/Tempo) instead of Prometheus metrics.

Q5: An ML model's p99 latency is 350ms against a 100ms SLO. Walk through your diagnosis using the four monitoring layers.

Start at Layer 2 (application): look at stage-level latency. If feature_fetch is 300ms, inference is 40ms, postprocess is 10ms - the bottleneck is feature retrieval, not the model. Move to Layer 3 (ML pipeline): check feature store metrics. Redis cache hit rate: 23% (normally 95%) - Redis is not being used. Check Layer 1 (infrastructure): Redis pod status. kubectl get pods -n ml-prod | grep redis shows redis-0 CrashLoopBackOff. Redis crashed 4 hours ago; the feature server is falling back to database queries (300ms vs 5ms Redis reads). Fix: restart Redis, investigate the crash (check logs for OOMKill or config error). Verify that the feature server has proper circuit-breaker logic to detect Redis unavailability and either fail fast or serve cached values. Post-fix: p99 drops back to 90ms within 2 minutes of Redis recovering.

The 4am Page That Took Three Hours to Diagnose​

The Four Monitoring Layers for ML​

Layer 1: Infrastructure Metrics​

CPU and Memory​

GPU Metrics (DCGM Exporter)​

Disk and Network​

Layer 2: Application Metrics​

Latency SLOs and Error Budget​

Layer 3: ML Pipeline Metrics​

Feature Freshness​

Data Quality Metrics​

Prediction Volume and Pipeline Throughput​

Layer 4: Model Quality Metrics​

Building the Unified Dashboard​

Production Notes​

Common Mistakes​

Interview Q&A​