8 docs tagged with "monitoring"

Alerting and Incident Response for ML

ML-specific alerting design, alert taxonomy, routing with PagerDuty and OpsGenie, on-call runbook design for ML models, post-mortem templates, and reducing MTTD and MTTR for ML incidents.

Data Drift Detection

Detecting when input data distributions change in production - KS test, PSI, chi-squared, Wasserstein distance, MMD, univariate vs. multivariate drift, reference window selection, and EvidentlyAI.

Serving model explanations alongside predictions - SHAP for production, Anchors for rule-based explanations, explanation as a service, debugging production failures with explanations, and regulatory compliance.

Infrastructure Monitoring for ML Systems

Monitoring the infrastructure layer of ML systems - CPU, GPU, memory, latency, the four monitoring layers, custom ML metrics with Prometheus, and building the observability foundation for model quality monitoring.

Logging for ML Systems

Structured logging for ML systems - prediction logging for delayed evaluation, structured JSON logs, audit logs for regulated models, log aggregation with Loki and Elasticsearch, and tracing individual prediction failures.

Model Performance Monitoring

Monitoring model quality in production - the ground truth delay problem, proxy metrics, shadow evaluation, cohort-based monitoring, SLOs for model quality, and detecting degradation before it hurts the business.

Module 9 - Monitoring and Observability

Complete ML monitoring and observability - data drift detection, model performance monitoring, Prometheus/Grafana for ML, distributed tracing, alerting, and production monitoring tools like EvidentlyAI and NannyML.

Prometheus and Grafana for ML

Building production ML observability infrastructure - Prometheus architecture, custom ML metrics, PromQL for ML, Grafana dashboard design for model serving, and scaling with Thanos for long-term storage.