Skip to main content

Module 9 - Monitoring and Observability

Traditional software monitoring asks: "Is the service running? Is it fast? Are there errors?" Machine learning monitoring asks all of that - plus ten harder questions: "Is the model still making good decisions? Have the input distributions shifted? Is the population the model was trained on the same as the population it's serving today? Are the business outcomes we care about still improving?"

The gap between these two monitoring paradigms is where most ML production failures hide.

The Silent Failure Problem

A model can be completely broken from a business perspective while every infrastructure metric looks green. CPU utilization: normal. Latency: 45ms p99. Error rate: 0.01%. Requests per second: stable. And yet the model has gradually shifted from "rank products by predicted user preference" to "rank products by popularity" over three months of concept drift - because the feature pipeline broke silently and started serving zero-values for several user behavior features.

No infrastructure alert fires. No error rate spikes. The model returns valid predictions with the right schema. Only a quarterly A/B test reveals the degradation - and by then, 90 days of suboptimal recommendations have cost the business an estimated $2.4M in lost revenue.

This module teaches you to build monitoring systems that catch problems like this before they affect users.

What You Will Learn

Module Lessons

#LessonKey Skills
01Data Drift DetectionKS test, PSI, chi-squared, Wasserstein distance, MMD, EvidentlyAI
02Model Performance DegradationGround truth delay, proxy metrics, shadow evaluation, cohort monitoring
03Infrastructure MonitoringFour monitoring layers, Prometheus, GPU metrics, latency SLOs
04Alerting StrategiesML alert taxonomy, PagerDuty routing, on-call runbooks, post-mortems
05Prometheus & Grafana for MLCustom metrics, PromQL for ML, Grafana dashboard design
06Logging for ML SystemsStructured prediction logging, audit logs, log aggregation
07Explainability in ProductionSHAP serving, Anchors, explanation as a service, debugging with explanations

The Four Monitoring Layers

ML observability has a natural hierarchy. Lower layers are necessary but not sufficient. Upper layers catch the problems that lower layers miss:

Layer 4: Model Quality ← Business impact, accuracy, fairness
↑ catches business problems
Layer 3: ML Pipeline ← Feature freshness, data quality, drift
↑ catches data problems
Layer 2: Application ← Latency, error rates, throughput
↑ catches software problems
Layer 1: Infrastructure ← CPU, GPU, memory, network, disk
↑ catches hardware/platform problems

You need all four layers. Most teams over-invest in Layer 1 (infrastructure) and under-invest in Layers 3 and 4 (ML-specific). The silent failure problem lives in Layers 3 and 4.

Key Mental Models

Monitoring is sampling. You cannot monitor every prediction. For a model serving 10M predictions per day, you sample a statistically significant subset for drift analysis, log a portion for quality auditing, and compute aggregate metrics hourly. Design for sampling from the start.

Ground truth is delayed. For most ML models, you don't know if a prediction was correct for days, weeks, or months. A fraud model knows if a transaction was truly fraudulent only after chargebacks are processed (30–60 days). This forces you to use proxy metrics for near-real-time monitoring.

Drift is not always bad. Input distribution shift is a warning signal, not a confirmed problem. Correlate drift alerts with business metric changes before triggering an automatic retraining. Seasonal drift (winter vs. summer features) is expected and not harmful if the model was trained to handle it.

© 2026 EngineersOfAI. All rights reserved.