01Module 9 - Monitoring and ObservabilityComplete ML monitoring and observability - data drift detection, model performance monitoring, Prometheus/Grafana for ML, distributed tracing, alerting, and production monitoring tools like EvidentlyAI and NannyML.02Data Drift DetectionDetecting when input data distributions change in production - KS test, PSI, chi-squared, Wasserstein distance, MMD, univariate vs. multivariate drift, reference window selection, and EvidentlyAI.03Model Performance MonitoringMonitoring model quality in production - the ground truth delay problem, proxy metrics, shadow evaluation, cohort-based monitoring, SLOs for model quality, and detecting degradation before it hurts the business.04Infrastructure Monitoring for ML SystemsMonitoring the infrastructure layer of ML systems - CPU, GPU, memory, latency, the four monitoring layers, custom ML metrics with Prometheus, and building the observability foundation for model quality monitoring.05Alerting and Incident Response for MLML-specific alerting design, alert taxonomy, routing with PagerDuty and OpsGenie, on-call runbook design for ML models, post-mortem templates, and reducing MTTD and MTTR for ML incidents.06Prometheus and Grafana for MLBuilding production ML observability infrastructure - Prometheus architecture, custom ML metrics, PromQL for ML, Grafana dashboard design for model serving, and scaling with Thanos for long-term storage.07Logging for ML SystemsStructured logging for ML systems - prediction logging for delayed evaluation, structured JSON logs, audit logs for regulated models, log aggregation with Loki and Elasticsearch, and tracing individual prediction failures.08Explainability in ProductionServing model explanations alongside predictions - SHAP for production, Anchors for rule-based explanations, explanation as a service, debugging production failures with explanations, and regulatory compliance.