Skip to main content

Alerting and Incident Response for ML

The 3am Page Without a Runbook

3:17am. PagerDuty fires: "fraud_model_approval_rate_delta_7day > 0.08." Your ML engineer on-call opens her laptop, still half-asleep. The alert says approval rate has drifted 8 percentage points over the past week. Is this bad? Is the model failing catastrophically, or is this a normal Tuesday-to-Tuesday business variation? She opens Grafana. She opens the model logs. She opens Slack history looking for recent deployments. Two hours later, she's determined the cause: a marketing team changed a customer segmentation that feeds one of the model's features, causing the feature distribution to shift. The model is technically fine - it's doing exactly what it was trained to do on these features. But the feature inputs have changed.

Ninety minutes of incident diagnosis could have been three minutes with a proper runbook. "Alert: approval_rate_delta > 0.08 → Check [these 5 things] in [this order] → Expected causes: [list] → Resolution steps: [list]." That's what this lesson builds.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Infrastructure Monitoring demo on the EngineersOfAI Playground - no code required. :::

What Makes an Alert Good or Bad

Not all alerts are created equal. The goal of an alerting system is to notify the right person at the right time with enough context to act immediately. Alerts fail when they are:

  • Not actionable: "GPU utilization is 84%" - what do I do with that?
  • Not accurate: false positive rate > 20% trains engineers to ignore alerts
  • Not timely: alerting 2 hours after the incident started means 2 hours of impact
  • Not contextualized: "AUC dropped" without knowing the model, version, cohort, or likely cause

A good alert answers four questions immediately:

  1. What is broken? (fraud-model approval_rate, not "a model metric")
  2. How bad is it? (8 percentage points above baseline, affecting ~2,400 users/day)
  3. When did it start? (Tuesday 14:22 UTC, 5 days ago)
  4. What to check first? (link to runbook step 1)

ML Alert Taxonomy

ML systems need a richer alert taxonomy than standard software services:

Alert Severity Tiers

# Severity definitions for ML systems

CRITICAL:
description: "Business impact is occurring or imminent. Page immediately."
response_time: "15 minutes"
examples:
- "fraud_model p99 latency > 500ms for 5+ minutes (SLO breach)"
- "fraud_model error rate > 5%"
- "GPU OOMKill on serving node"
- "feature_freshness > 6 hours for transaction features"
- "chargeback_proxy_rate > 1.5% (model quality)"

WARNING:
description: "Degraded but functional. Investigate within 2 business hours."
response_time: "2 hours (business hours)"
examples:
- "fraud_model p99 latency > 200ms"
- "score_psi > 0.20 (drift detected)"
- "feature null_rate > 5%"
- "approval_rate_delta > 0.05 from 7-day baseline"

INFO:
description: "Informational. Review daily. No immediate action required."
response_time: "Next business day"
examples:
- "model retrain pipeline completed"
- "score_psi > 0.10 (mild drift)"
- "weekly AUC report"
- "feature distribution report available"

Prometheus Alert Rules for ML

# prometheus/alerts/ml-model-alerts.yaml

groups:
- name: fraud-model-serving
interval: 60s # evaluate every 60 seconds
rules:

# Layer 2: Application SLO breach
- alert: FraudModelHighLatency
expr: |
histogram_quantile(0.99,
rate(model_request_latency_seconds_bucket{
model_name="fraud-model", stage="total"
}[5m])
) > 0.5
for: 2m # must be true for 2 minutes to avoid flapping
labels:
severity: critical
team: ml-platform
model: fraud-model
annotations:
summary: "Fraud model p99 latency exceeds 500ms SLO"
description: >
Fraud model p99 latency is {{ $value | humanizeDuration }} (SLO: 500ms).
Started at {{ $labels.start_time }}.
Runbook: https://wiki.company.com/runbooks/fraud-model-latency
dashboard: https://grafana.company.com/d/fraud-model?from=now-30m

# Layer 2: Error rate
- alert: FraudModelHighErrorRate
expr: |
rate(model_requests_total{
model_name="fraud-model", status="error"
}[5m]) /
rate(model_requests_total{
model_name="fraud-model"
}[5m]) > 0.05
for: 1m
labels:
severity: critical
team: ml-platform
annotations:
summary: "Fraud model error rate exceeds 5%"
description: "Current error rate: {{ $value | humanizePercentage }}. Check pod logs."
runbook: "https://wiki.company.com/runbooks/fraud-model-errors"

# Layer 3: Feature freshness
- alert: FeatureStaleness
expr: feature_freshness_seconds{feature_group="user_transaction_features"} > 14400 # 4 hours
for: 5m
labels:
severity: critical
team: data-engineering
annotations:
summary: "Transaction features are stale (>4 hours)"
description: >
Feature group {{ $labels.feature_group }} last updated
{{ $value | humanizeDuration }} ago.
This will cause model degradation. Check the feature pipeline.
runbook: "https://wiki.company.com/runbooks/feature-staleness"

# Layer 4: Model quality (proxy metric)
- alert: ApprovalRateDrift
expr: |
abs(
avg_over_time(model_approval_rate[1d]) -
avg_over_time(model_approval_rate[7d] offset 1d)
) > 0.06
for: 30m
labels:
severity: warning
team: ml-team
annotations:
summary: "Fraud model approval rate drifted >6pp from 7-day baseline"
description: >
Current approval rate: {{ $value | humanizePercentage }}.
This may indicate distribution shift or model behavior change.
Check data drift report and recent deployments.
runbook: "https://wiki.company.com/runbooks/approval-rate-drift"

# Infrastructure: GPU
- alert: GPUOOMRisk
expr: DCGM_FI_DEV_FB_FREE{exported_namespace="ml-prod"} < 2000 # MiB
for: 5m
labels:
severity: warning
team: ml-platform
annotations:
summary: "GPU memory nearly exhausted on serving node"
description: >
GPU on {{ $labels.exported_pod }} has {{ $value }}MiB free.
Risk of OOM during inference. Consider scaling up or reducing batch size.

On-Call Runbooks for ML Incidents

A runbook is a step-by-step diagnostic procedure for a specific alert. Good runbooks cut MTTD (Mean Time to Detect root cause) from hours to minutes. Bad runbooks are generic ("check the logs") and provide no value.

Runbook Template: FraudModelHighLatency

# Runbook: FraudModelHighLatency

**Alert**: fraud_model p99 latency > 500ms for 2+ minutes
**Severity**: CRITICAL
**Owner**: ML Platform team
**Last updated**: 2026-03-14

## Context
The fraud model serves 1,200 RPS. p99 > 500ms indicates that 1% of users
(~12/second) are experiencing unacceptable delays. The payment flow has a 1-second
timeout - latency > 1s causes transaction failures.

## Step 1: Check Application-Level Breakdown (2 min)
Open: https://grafana.company.com/d/fraud-model-latency

Look at stage-level latency panel:
- feature_fetch > 200ms → go to Step 2
- inference > 200ms → go to Step 3
- postprocess > 100ms → go to Step 4
- all stages normal → go to Step 5 (load balancer issue)

## Step 2: Feature Store Issues
```bash
# Check feature store pod health
kubectl get pods -n ml-prod | grep feature-server

# Check Redis pod health
kubectl get pods -n ml-prod | grep redis

# Check Redis connection pool metrics
# Grafana panel: "Redis Connection Pool" in the ML Infrastructure dashboard

# Check feature store logs for timeout errors
kubectl logs -n ml-prod -l app=feature-server --since=30m | grep -i "timeout\|error\|redis"

Common causes:

  • Redis pod crashed (CrashLoopBackOff → restart it)
  • Redis memory exhausted (increase Redis memory limit or flush stale keys)
  • Feature store pod OOMKilled (check kubectl describe)

Resolution:

# Restart Redis if crashed
kubectl rollout restart deployment/redis -n ml-prod

# Monitor: latency should recover within 2 minutes of Redis restarting

Step 3: Inference Server Issues

# Check GPU utilization
kubectl exec -n ml-prod <model-pod> -- nvidia-smi

# Check if GPU is memory-bound
# If DCGM_FI_DEV_FB_FREE < 2000 MiB → reduce batch_size in ConfigMap

# Check for OOMKill
kubectl describe pod -n ml-prod -l app=fraud-model | grep OOMKilled

Common causes:

  • GPU memory full (too many concurrent requests, batch size too large)
  • Inference is normally 40ms - if it's 200ms, GPU is throttling

Resolution:

  • If GPU memory full: kubectl set env deployment/fraud-model BATCH_SIZE=16
  • If CPU throttling: check if request.cpu is being throttled

Step 4: Postprocessing Timeout

Usually indicates upstream API (business rules service) is slow. Check: kubectl logs -l app=fraud-model | grep "rules-engine"

Step 5: Escalation

If all stages look normal but latency is high:

  • Check ingress/load balancer: may be a network issue
  • Check for network policy changes (kubectl get networkpolicies)
  • Page: @ml-platform-lead with "Unknown source of latency, stages all nominal"

Post-Incident

After resolution, file a post-mortem within 48 hours: https://wiki.company.com/postmortems/new Tag with: fraud-model, latency, <root-cause-tag>


## PagerDuty Routing for ML Teams

ML incidents involve multiple teams: ML engineers (model quality issues), data engineers (feature pipelines), platform engineers (infrastructure), and business stakeholders (when business metrics are affected). Route alerts to the right team immediately.

```yaml
# PagerDuty escalation policy (conceptual)

ml-model-serving:
schedule: fraud-model-oncall-rotation
escalation:
- level: 1
targets: [fraud-model-oncall-engineer]
notify_after: 5m
- level: 2
targets: [ml-platform-lead, ml-team-slack-channel]
notify_after: 15m
- level: 3
targets: [ml-engineering-manager]
notify_after: 30m

ml-feature-pipeline:
schedule: data-engineering-oncall
escalation:
- level: 1
targets: [data-engineering-oncall]
notify_after: 5m
- level: 2
targets: [data-platform-lead]
notify_after: 15m

ml-model-quality:
schedule: ml-team-oncall
# Quality alerts during business hours only (09:00-18:00)
# After-hours: notify via Slack only, page if not acknowledged in 2 hours
escalation:
- level: 1
targets: [ml-team-slack-channel]
notify_after: 0m
- level: 2 # only if not acknowledged
targets: [ml-team-oncall]
notify_after: 120m

MTTD and MTTR for ML Systems

  • MTTD (Mean Time to Detect): how quickly you detect that something is wrong. Covered by monitoring and alerting configuration.
  • MTTR (Mean Time to Resolve): how quickly you fix it once detected. Covered by runbooks and incident process.

For ML systems, these metrics split into two categories:

Infrastructure MTTD/MTTR (should match standard SRE targets):

  • MTTD: < 5 minutes (infrastructure monitoring is mature)
  • MTTR: < 30 minutes (runbooks are well-established)

ML Quality MTTD/MTTR (typically much worse, needs improvement):

  • MTTD without proxy metrics: days to months (until ground truth arrives)
  • MTTD with proxy metrics: hours to days (PSI alerts, approval rate drift)
  • MTTD with CBPE: hours (estimated AUC monitoring)
  • MTTR: depends on root cause - data pipeline fix (hours), retraining (days), fundamental architecture change (weeks)
# Track MTTD and MTTR in your incident log
import pandas as pd

incidents_df = pd.read_csv("incidents.csv") # from your incident management tool

# MTTD: time from incident_start to first_alert_time
incidents_df["mttd_minutes"] = (
incidents_df["first_alert_time"] - incidents_df["incident_start_time"]
).dt.total_seconds() / 60

# MTTR: time from first_alert_time to resolution_time
incidents_df["mttr_minutes"] = (
incidents_df["resolution_time"] - incidents_df["first_alert_time"]
).dt.total_seconds() / 60

print("MTTD by incident type:")
print(incidents_df.groupby("incident_type")["mttd_minutes"].describe())

print("\nMTTR by incident type:")
print(incidents_df.groupby("incident_type")["mttr_minutes"].describe())

Post-Mortem Template for ML Incidents

# Post-Mortem: [incident title]
**Date**: 2026-03-14
**Severity**: P1 / P2 / P3
**Duration**: Xs (detected at Y, resolved at Z)
**Models affected**: fraud-model v2.1.0
**Business impact**: [quantify: N predictions affected, $X revenue at risk, etc.]
**Author**: [on-call engineer]
**Status**: Draft / Under review / Final

## Timeline
- **14:22 UTC**: Feature pipeline for user_transaction_features starts producing null values (root cause)
- **14:47 UTC**: Score PSI alert fires (25 min after root cause)
- **15:02 UTC**: On-call engineer acknowledges and begins investigation (15 min MTTD)
- **15:23 UTC**: Root cause identified: upstream data source schema change
- **15:41 UTC**: Feature pipeline redeployed with schema fix
- **15:48 UTC**: Score PSI returns to baseline. Incident resolved.
- **Total duration**: 86 minutes. MTTD: 25 minutes. MTTR: 61 minutes.

## Root Cause
The upstream user transaction database team deployed a schema change on 2026-03-14 at 14:18 UTC that renamed the `transaction_amount` column to `txn_amount_usd`. The feature pipeline's SQL query referenced the old column name and started returning NULL for all values. The model received all-null transaction features and produced a shifted score distribution.

## Detection
Alert FraudModelScorePSI fired 25 minutes after the root cause. Detection lag: the feature staleness alert (would have fired sooner) was not configured for this feature group. See action items.

## What Went Well
- Once PSI alert fired, runbook led to root cause in 21 minutes
- Feature pipeline had clear error logs pointing to the schema mismatch
- Redeployment was fast (automated CI/CD pipeline, < 5 minutes)

## What Went Poorly
- No schema validation on the feature pipeline inputs - the null values were not caught at the source
- Feature freshness alert was not configured for this feature group
- The upstream team did not notify the ML team of the schema change
- Post-incident analysis (was any model output affected?) took 3 additional hours because prediction logs did not include feature values

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add schema validation to feature pipeline (fail fast, not silently) | Data Eng | 2026-03-21 |
| Add feature freshness alert for all feature groups | ML Platform | 2026-03-18 |
| Require ML team notification for upstream schema changes (add to data contracts) | Data Governance | 2026-03-28 |
| Log feature values alongside prediction scores | ML Platform | 2026-03-25 |

## Metrics
- **MTTD**: 25 minutes (target: < 10 minutes for schema failures)
- **MTTR**: 61 minutes (target: < 30 minutes)
- **Predictions affected**: ~35,000 (25 minutes at 1,400/min)
- **Business impact**: estimated $47K in suboptimal fraud decisions (calculated post-hoc)

Production Notes

Alert fatigue is the enemy: if your on-call engineer sees 200 alerts in a week, they will start ignoring them. Aim for < 10 actionable alerts per week per engineer. Ruthlessly suppress false positives - a 20% false positive rate destroys on-call effectiveness.

Every alert needs a runbook before it goes live: never deploy an alert without a corresponding runbook. If you can't write the runbook (because you don't know what to do when it fires), the alert shouldn't exist yet.

Test your alerts: use amtool alert add (Alertmanager) or Prometheus test tooling to fire test alerts. Verify routing, notification channels, and runbook links all work before an actual incident.

Common Mistakes

:::danger Alerting on Everything The default Prometheus + kube-state-metrics setup generates hundreds of potential alert rules. Enabling all of them creates alert storms that train engineers to ignore alerts. Be selective: only alert on things that require immediate human action. "CPU utilization > 70%" is not actionable if your model server is designed to run at 70% CPU. Remove or silence alerts that consistently fire without requiring action. :::

:::warning Alerts Without Context An alert message that says "model_approval_rate_delta > 0.06" is meaningless without context. The message should include: current value, baseline value, time since deviation started, whether there was a recent deployment, and a direct link to the relevant Grafana panel. Engineers who wake up at 3am need everything in the notification - they should not have to open four tabs to understand the situation. :::

:::danger Incident Post-Mortems That Assign Blame Post-mortems that identify a single person as "the cause" of an incident create a culture where engineers hide problems instead of reporting them. Blameless post-mortems focus on systems, processes, and lack of safeguards - not individual human error. The question is never "who made the mistake?" but "what system failure allowed this mistake to have this impact?" The schema change example: the engineer who renamed the column wasn't wrong to rename it - the system lacked a change management process that would have notified affected consumers. :::

Interview Q&A

Q1: What makes an alert good? What are the most common alerting anti-patterns in ML systems?

A good alert is: actionable (tells the engineer what to do, not just what happened), accurate (low false positive rate - < 5% false positives), timely (detects problems within the impact window), and contextualized (includes current value, baseline, time since onset, and runbook link). Common ML anti-patterns: (1) alerting on infrastructure metrics that don't correlate with model quality (CPU utilization by itself means nothing for ML quality). (2) Single-threshold alerts that fire constantly due to expected variation - use for: 5m minimum duration to avoid flapping. (3) Missing alerts on ML-specific signals (feature staleness, score PSI) while over-alerting on generic metrics. (4) Alert messages with no context - "model_metric > threshold" with no current value, no baseline, no runbook.

Q2: How would you design on-call rotation and alert routing for a team with ML engineers, data engineers, and infrastructure engineers?

Route alerts to the team closest to the root cause, not to a single "ML on-call" that handles everything. Infrastructure alerts (GPU OOMKill, pod CrashLoopBackOff) → infrastructure/platform team on-call. Feature pipeline alerts (freshness, null rates, schema validation failures) → data engineering on-call. Model quality alerts (score PSI, approval rate drift, AUC degradation) → ML team on-call. All P1/P2 alerts → also notify team leads via Slack. Escalation policy: page on-call for CRITICAL; Slack for WARNING (page only if unacknowledged after 2 hours). Use separate schedules for each team to avoid waking up the wrong expert at 3am.

Q3: Describe MTTD and MTTR for ML systems and explain why they differ from traditional software systems.

MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve) measure incident detection and resolution speed. For traditional software: MTTD is typically minutes (error rates spike visibly). For ML: MTTD splits into two categories. Infrastructure failures (pod crash, latency spike) have MTTD of minutes - same as traditional software. But model quality degradation (accuracy drop from drift, silent data pipeline failure) has MTTD of days to months without specialized monitoring, because the model continues to return valid-looking predictions with no error spikes. MTTR also differs: a software bug is fixed with a code change and rollback (hours). An ML quality incident may require data collection and retraining (days) or even fundamental feature engineering redesign (weeks). The key to improving ML MTTD: proxy metrics and CBPE-based performance estimation that detect quality degradation in hours instead of waiting for delayed ground truth.

Q4: What should a post-mortem for an ML incident include that's different from a traditional software incident?

ML-specific post-mortem elements: (1) Model impact quantification - how many predictions were affected, and what was the estimated business impact (suboptimal decisions made, revenue at risk). Traditional software post-mortems focus on availability; ML incidents may have "all systems up" while producing wrong outputs. (2) Ground truth gap analysis - was degradation detectable with available proxy metrics, or did it require waiting for ground truth? If the latter, what monitoring would have detected it sooner? (3) Data lineage section - trace the root cause through the data pipeline: upstream source → feature pipeline → feature store → model input → model output. (4) Model version tracking - which model version was affected, was it recently updated, and does rollback to the previous version resolve the issue? (5) Retraining analysis - does the incident suggest the model needs retraining (distribution shift) or is it a data pipeline fix?

Q5: How do you reduce alert fatigue in ML monitoring without missing real incidents?

Five strategies: (1) Minimum duration filters - for: 5m prevents alerts that fire for 30 seconds of transient noise. (2) Composite alerts - instead of alerting on score PSI alone, alert when PSI > 0.25 AND approval_rate_delta > 0.05. The compound condition filters benign shifts that affect only one metric. (3) Suppress dependent alerts - if a feature pipeline alert is firing, automatically suppress model quality alerts that are likely caused by it (using Alertmanager's inhibit_rules). (4) Regular alert review - monthly review of alert fire history: any alert with > 30% false positive rate gets tuned or removed. (5) Better thresholds - instead of fixed thresholds (PSI > 0.25), use dynamic thresholds based on historical variance: alert when value exceeds the 99th percentile of the last 30 days for that metric. This adapts to seasonal patterns automatically.

© 2026 EngineersOfAI. All rights reserved.