Module 9 - Monitoring and Observability
Traditional software monitoring asks: "Is the service running? Is it fast? Are there errors?" Machine learning monitoring asks all of that - plus ten harder questions: "Is the model still making good decisions? Have the input distributions shifted? Is the population the model was trained on the same as the population it's serving today? Are the business outcomes we care about still improving?"
The gap between these two monitoring paradigms is where most ML production failures hide.
The Silent Failure Problem
A model can be completely broken from a business perspective while every infrastructure metric looks green. CPU utilization: normal. Latency: 45ms p99. Error rate: 0.01%. Requests per second: stable. And yet the model has gradually shifted from "rank products by predicted user preference" to "rank products by popularity" over three months of concept drift - because the feature pipeline broke silently and started serving zero-values for several user behavior features.
No infrastructure alert fires. No error rate spikes. The model returns valid predictions with the right schema. Only a quarterly A/B test reveals the degradation - and by then, 90 days of suboptimal recommendations have cost the business an estimated $2.4M in lost revenue.
This module teaches you to build monitoring systems that catch problems like this before they affect users.
What You Will Learn
Module Lessons
| # | Lesson | Key Skills |
|---|---|---|
| 01 | Data Drift Detection | KS test, PSI, chi-squared, Wasserstein distance, MMD, EvidentlyAI |
| 02 | Model Performance Degradation | Ground truth delay, proxy metrics, shadow evaluation, cohort monitoring |
| 03 | Infrastructure Monitoring | Four monitoring layers, Prometheus, GPU metrics, latency SLOs |
| 04 | Alerting Strategies | ML alert taxonomy, PagerDuty routing, on-call runbooks, post-mortems |
| 05 | Prometheus & Grafana for ML | Custom metrics, PromQL for ML, Grafana dashboard design |
| 06 | Logging for ML Systems | Structured prediction logging, audit logs, log aggregation |
| 07 | Explainability in Production | SHAP serving, Anchors, explanation as a service, debugging with explanations |
The Four Monitoring Layers
ML observability has a natural hierarchy. Lower layers are necessary but not sufficient. Upper layers catch the problems that lower layers miss:
Layer 4: Model Quality ← Business impact, accuracy, fairness
↑ catches business problems
Layer 3: ML Pipeline ← Feature freshness, data quality, drift
↑ catches data problems
Layer 2: Application ← Latency, error rates, throughput
↑ catches software problems
Layer 1: Infrastructure ← CPU, GPU, memory, network, disk
↑ catches hardware/platform problems
You need all four layers. Most teams over-invest in Layer 1 (infrastructure) and under-invest in Layers 3 and 4 (ML-specific). The silent failure problem lives in Layers 3 and 4.
Key Mental Models
Monitoring is sampling. You cannot monitor every prediction. For a model serving 10M predictions per day, you sample a statistically significant subset for drift analysis, log a portion for quality auditing, and compute aggregate metrics hourly. Design for sampling from the start.
Ground truth is delayed. For most ML models, you don't know if a prediction was correct for days, weeks, or months. A fraud model knows if a transaction was truly fraudulent only after chargebacks are processed (30–60 days). This forces you to use proxy metrics for near-real-time monitoring.
Drift is not always bad. Input distribution shift is a warning signal, not a confirmed problem. Correlate drift alerts with business metric changes before triggering an automatic retraining. Seasonal drift (winter vs. summer features) is expected and not harmful if the model was trained to handle it.
