Alerting and Incident Response for ML
ML-specific alerting design, alert taxonomy, routing with PagerDuty and OpsGenie, on-call runbook design for ML models, post-mortem templates, and reducing MTTD and MTTR for ML incidents.
ML-specific alerting design, alert taxonomy, routing with PagerDuty and OpsGenie, on-call runbook design for ML models, post-mortem templates, and reducing MTTD and MTTR for ML incidents.
Detecting when input data distributions change in production - KS test, PSI, chi-squared, Wasserstein distance, MMD, univariate vs. multivariate drift, reference window selection, and EvidentlyAI.
Serving model explanations alongside predictions - SHAP for production, Anchors for rule-based explanations, explanation as a service, debugging production failures with explanations, and regulatory compliance.
Monitoring the infrastructure layer of ML systems - CPU, GPU, memory, latency, the four monitoring layers, custom ML metrics with Prometheus, and building the observability foundation for model quality monitoring.
Structured logging for ML systems - prediction logging for delayed evaluation, structured JSON logs, audit logs for regulated models, log aggregation with Loki and Elasticsearch, and tracing individual prediction failures.
Monitoring model quality in production - the ground truth delay problem, proxy metrics, shadow evaluation, cohort-based monitoring, SLOs for model quality, and detecting degradation before it hurts the business.
Complete ML monitoring and observability - data drift detection, model performance monitoring, Prometheus/Grafana for ML, distributed tracing, alerting, and production monitoring tools like EvidentlyAI and NannyML.
Building production ML observability infrastructure - Prometheus architecture, custom ML metrics, PromQL for ML, Grafana dashboard design for model serving, and scaling with Thanos for long-term storage.