Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Quality Checks demo on the EngineersOfAI Playground - no code required. :::

Data Incident Management

Six Hours That Should Have Been Forty-Five Minutes

The detection came at 9:12 AM. Not from an automated alert - from a Slack message. A product manager had noticed that the weekly retention report showed 0% day-7 retention for new users. That could not be right.

The data engineering team started investigating at 9:52 AM, 40 minutes after the detection. First question: which table? The retention report read from a dbt model, which read from three other models, which read from five source tables. Nobody on the call had run this specific pipeline recently. 25 minutes later: they had identified the source table. events.user_sessions.

Second question: what changed? Row count looked normal. Schema looked normal. They started reading dbt model SQL looking for a bug. An hour passed. Then someone noticed: the session_start_ts column had switched from UTC to local time in the most recent load. UTC-5. Day-7 retention appeared as 0% because the join condition on session timestamps was now off by 5 hours, pushing sessions out of their retention windows.

The fix: rerun the last 14 days of data through the pipeline with the correct timezone normalization. 45 minutes. Then validate the output. 30 minutes. Then notify all stakeholders. 80 minutes of communication.

Total time: 6 hours. Time of the actual fix: 45 minutes. The other 5 hours and 15 minutes were spent on detection delay, triage, communication, and finding the right people to escalate to.

With a data incident management process - automated detection, pre-built runbooks, clear severity definitions, and communication templates - the same incident resolves in 45 minutes. Same fix, same root cause, same pipeline. The difference is process.

Data Incidents vs. Software Incidents

Software incident management is well-understood. Systems like PagerDuty, Incident.io, and Opsgenie have standardized the process. But data incidents are fundamentally different from software incidents in ways that matter for how you manage them.

Software incidents: binary state (up/down), visible to users immediately (errors, timeouts), concentrated blast radius (the failing service), fast detection (error rate alerts fire in seconds).

Data incidents: continuous state (partially wrong, wrong for some users, wrong in some columns), often invisible to users (data looks present and queryable, just wrong), diffuse blast radius (a broken source table can affect dozens of downstream consumers), slow detection (often discovered by a human noticing something looks off, not by an automated alert).

The key implication: data incidents require a different detection strategy (proactive monitoring vs. reactive error rates), a different triage process (understand the data flow vs. check service health), and different communication (notify analysts and BI teams, not just engineering).

The other major difference: data incidents have a recovery dimension that software incidents often do not. When a service crashes and comes back up, the state is usually correct. When a data pipeline runs with wrong data, you often need to reprocess historical data - running the entire pipeline again for the affected time window. This can take hours for large datasets.

Incident Severity Classification

Severity classification is the first decision in every incident. It determines the response time expectation, who gets paged, and what escalation path to follow.

SeverityDefinitionResponse SLAWho Gets Paged
P0ML model producing wrong outputs at scale, or financial reports wrong in production15 minutesOn-call data engineer + team lead
P1Model-impacting, limited scope; or a production dashboard wrong for executives1 hourOn-call data engineer
P2Analyst dashboard incorrect, no ML or financial impact4 hoursSlack alert to team channel
P3Data quality issue detected, no known consumer impact yetNext business dayJira ticket created

The critical P0 scenario for data engineering is a broken ML feature table. If ml.user_features is wrong, every model that reads from it is making wrong predictions. If those models control pricing, content ranking, fraud detection, or recommendations at scale, a P0 is fully justified. The blast radius of a broken feature table is often larger than any single service outage.

# Severity classifier - determines severity from check result

from enum import Enum
from typing import Dict

class Severity(str, Enum):
P0 = "P0"
P1 = "P1"
P2 = "P2"
P3 = "P3"

# Configure which tables are critical and their downstream impact
TABLE_CRITICALITY = {
"ml.user_features": {"tier": "ml_critical", "downstream": ["churn_model", "ltv_model", "rec_engine"]},
"mart.revenue_metrics": {"tier": "financial_critical", "downstream": ["executive_dashboard", "board_report"]},
"mart.user_purchase_hist":{"tier": "product_critical", "downstream": ["retention_report", "growth_dashboard"]},
"staging.events": {"tier": "foundational", "downstream": ["every mart model"]},
}

def classify_severity(result: Dict) -> Severity:
table = result.get("table", "")
pillar = result.get("pillar", "")
passed = result.get("passed")

if passed:
return None # Not an incident

criticality = TABLE_CRITICALITY.get(table, {}).get("tier", "low")
downstream = TABLE_CRITICALITY.get(table, {}).get("downstream", [])

# P0: critical table + ML or financial impact
if criticality in ("ml_critical", "financial_critical"):
if pillar in ("freshness", "volume", "distribution"):
return Severity.P0

# P1: critical table + schema change (may not have propagated yet)
if criticality in ("ml_critical", "financial_critical") and pillar == "schema":
return Severity.P1

# P1: foundational table affecting many downstream consumers
if criticality == "foundational" and len(downstream) >= 5:
return Severity.P1

# P2: product-critical table
if criticality == "product_critical":
return Severity.P2

# P3: everything else
return Severity.P3

The Five Phases of a Data Incident

A well-managed data incident moves through five phases: Detection, Triage, Mitigation, Root Cause Analysis, and Post-Mortem. Each phase has a time target and a clear deliverable.

Phase 1: Detection

The goal of detection is to minimize the time between when a problem starts and when the team knows about it. There are two detection modes:

Proactive detection: automated observability checks fire before any human notices. This is the target state. A freshness alert fires 2 hours after a pipeline stops updating - before any analyst has had a chance to open a dashboard.

Reactive detection: a stakeholder reports something looks wrong. This is the current state at most organizations. The problem with reactive detection is the detection gap: the time between when the problem started and when the stakeholder noticed it. For a dashboard that is only looked at once a week, the detection gap can be days.

Closing the detection gap is the single most important improvement you can make to incident response time. Every P0 that is detected by a stakeholder complaint rather than an automated alert is an argument for investing more in observability coverage.

Phase 2: Triage

Triage is the 15-minute sprint to understand scope before committing to a mitigation approach. The goal: identify which table is the root source, what pillar has failed, and which downstream consumers are affected.

The triage playbook - 10 queries to run in the first 5 minutes:

-- TRIAGE PLAYBOOK: run these queries in sequence
-- Table: ${AFFECTED_TABLE}
-- Timestamp: ${INCIDENT_START_TIME}

-- 1. When was the table last updated?
SELECT MAX(created_at) AS last_updated,
NOW() - MAX(created_at) AS lag
FROM ${AFFECTED_TABLE};

-- 2. How many rows does the table have vs. yesterday?
SELECT
DATE(created_at) AS dt,
COUNT(*) AS row_count
FROM ${AFFECTED_TABLE}
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY dt DESC;

-- 3. Are there null rate spikes in key columns?
SELECT
COUNT(*) AS total,
SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS user_id_null_rate,
SUM(CASE WHEN created_at IS NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS ts_null_rate
FROM ${AFFECTED_TABLE}
WHERE DATE(created_at) = CURRENT_DATE - 1;

-- 4. What does the current schema look like?
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = '${SCHEMA}'
AND table_name = '${TABLE_NAME}'
ORDER BY ordinal_position;

-- 5. Have there been any recent DDL changes?
-- (Snowflake / BigQuery query history)
SELECT query_text, user_name, start_time
FROM information_schema.query_history
WHERE query_type IN ('ALTER', 'CREATE', 'DROP', 'TRUNCATE')
AND query_text ILIKE '%${TABLE_NAME}%'
AND start_time >= NOW() - INTERVAL '24 hours'
ORDER BY start_time DESC;

-- 6. When did the last pipeline run complete?
-- (Airflow metadata DB)
SELECT dag_id, run_id, state, execution_date, end_date
FROM airflow.dag_run
WHERE dag_id ILIKE '%${TABLE_NAME}%'
ORDER BY execution_date DESC
LIMIT 5;

-- 7. Are upstream tables also affected?
SELECT 'staging.events' AS upstream_table, MAX(created_at), COUNT(*) FROM staging.events
UNION ALL
SELECT 'staging.users', MAX(created_at), COUNT(*) FROM staging.users;

-- 8. Is the value distribution different from yesterday?
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ${KEY_METRIC_COL}) AS median_today,
AVG(${KEY_METRIC_COL}) AS mean_today
FROM ${AFFECTED_TABLE}
WHERE DATE(created_at) = CURRENT_DATE - 1;

-- 9. Are there duplicate primary keys?
SELECT ${PK_COL}, COUNT(*) AS freq
FROM ${AFFECTED_TABLE}
WHERE DATE(created_at) = CURRENT_DATE - 1
GROUP BY ${PK_COL}
HAVING COUNT(*) > 1
LIMIT 10;

-- 10. What do the last 5 rows look like?
SELECT *
FROM ${AFFECTED_TABLE}
ORDER BY created_at DESC
LIMIT 5;

Phase 3: Mitigation

Mitigation stops the damage. It is not the same as root cause analysis - mitigation is about restoring correct data as quickly as possible, even if the root cause is not yet fully understood. Common mitigation options:

Rerun the pipeline: the most common mitigation. Identify the affected time window, fix or revert the change that caused the issue, and rerun all pipeline tasks for that window. Works when the source data is correct and the problem is in the transformation layer.

Rollback to snapshot: if your data lake uses Delta Lake or Apache Iceberg, you can time-travel to a previous version of the table. This is the fastest mitigation when the problem is in the data itself rather than the transformation.

# Delta Lake time travel rollback
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

def rollback_delta_table(
spark: SparkSession,
table_path: str,
rollback_to_version: int = None,
rollback_to_timestamp: str = None,
) -> None:
"""
Rollback a Delta Lake table to a previous version.
Use this when a bad write has corrupted current data.
"""
delta_table = DeltaTable.forPath(spark, table_path)

if rollback_to_version is not None:
# Read the table at the target version
df_old = spark.read.format("delta") \
.option("versionAsOf", rollback_to_version) \
.load(table_path)
elif rollback_to_timestamp is not None:
# Read the table at the target timestamp
df_old = spark.read.format("delta") \
.option("timestampAsOf", rollback_to_timestamp) \
.load(table_path)
else:
raise ValueError("Must specify either rollback_to_version or rollback_to_timestamp")

# Overwrite the current table with the old version
df_old.write.format("delta").mode("overwrite") \
.option("overwriteSchema", "false") \
.save(table_path)

# Restore lineage record: emit OpenLineage event noting the rollback
print(f"Rolled back {table_path} to version {rollback_to_version or rollback_to_timestamp}")

# Iceberg time travel (via Spark)
def rollback_iceberg_table(
spark: SparkSession,
table_name: str,
snapshot_id: int,
) -> None:
spark.sql(f"""
CALL spark_catalog.system.rollback_to_snapshot(
table => '{table_name}',
snapshot_id => {snapshot_id}
)
""")

Serve stale data with a staleness indicator: when a rerun will take hours, notify downstream consumers that the data is stale and provide the timestamp of the last known good data. This allows BI and ML consumers to make an informed decision about whether to wait or proceed with a staleness caveat.

Disable the affected model temporarily: if an ML model is producing wrong outputs, disable it and fall back to a default or heuristic. Wrong outputs at scale are worse than no outputs.

Phase 4: Root Cause Analysis

After mitigation, before the post-mortem, do the 5-whys analysis. This is a structured technique for finding the root cause rather than stopping at the proximate cause.

A worked example, tracing the wrong day-7 retention report from the opening scenario:

Why #QuestionAnswer
1Why was day-7 retention showing 0%?The join between session events and the retention window was producing no matches
2Why was the join producing no matches?The session_start_ts column in events.user_sessions was in EST, but the retention window was computed in UTC
3Why was session_start_ts in EST instead of UTC?The most recent data load from the session tracking system used the local timestamp instead of UTC
4Why did the load use local timestamp?The session tracking team deployed a new version of the event schema that dropped the utc_offset field; the ingestion script fell back to local time
5Why did this change go undetected?There was no automated check on the timezone offset of session_start_ts - the observability system checked row count and null rates but not the value distribution of timestamp fields

The 5th why reveals the root cause: a monitoring gap. The action item from this analysis: add a distribution check on session_start_ts that verifies values are in UTC (hours between 0-23 when compared to a known UTC reference, or checking that the UTC offset of timestamp values is consistently 0).

def five_whys_template() -> dict:
"""
Structured template for data incident 5-whys analysis.
Fill this in during or after the incident.
"""
return {
"incident_id": "INC-2024-0312-001",
"incident_summary": "Day-7 retention report showing 0% for new users",
"detection_method": "stakeholder_complaint",
"detection_at": "2024-03-12T09:12:00Z",
"mitigation_at": "2024-03-12T14:30:00Z",
"total_downtime_hours": 5.3,
"affected_consumers": ["retention_report", "growth_dashboard", "weekly_exec_email"],
"five_whys": [
{
"why": 1,
"question": "Why was day-7 retention showing 0%?",
"answer": "Join between session events and retention window produced no matches",
},
{
"why": 2,
"question": "Why was the join producing no matches?",
"answer": "session_start_ts column was in EST, retention window computed in UTC - 5-hour offset caused events to fall outside windows",
},
{
"why": 3,
"question": "Why was session_start_ts in EST?",
"answer": "Most recent data load from session tracking system used local timestamp instead of UTC",
},
{
"why": 4,
"question": "Why did the load use local timestamp?",
"answer": "Session tracking team deployed new event schema that dropped utc_offset field; ingestion script fell back to local time",
},
{
"why": 5,
"question": "Why did this change go undetected?",
"answer": "No automated check on timezone distribution of session_start_ts - monitoring covered row count and null rates but not timestamp value distribution",
},
],
"root_cause": "Monitoring gap: no distribution check on timestamp timezone consistency",
"contributing_factors": [
"Source schema change not communicated to data engineering team",
"No schema change notification from session tracking system to downstream consumers",
],
"action_items": [
{
"id": "ACTION-001",
"description": "Add timezone distribution check to session_start_ts: verify all values are UTC-normalized",
"owner": "data-platform-team",
"due_date": "2024-03-19",
"prevents": "Same incident recurrence",
},
{
"id": "ACTION-002",
"description": "Establish schema change notification process between session tracking team and data engineering",
"owner": "engineering-leads",
"due_date": "2024-03-26",
"prevents": "Silent source schema changes going undetected",
},
{
"id": "ACTION-003",
"description": "Configure automated Slack alert when any timestamp column distribution shifts by more than 1 standard deviation",
"owner": "data-platform-team",
"due_date": "2024-03-19",
"prevents": "Distribution shifts going undetected",
},
],
}

Phase 5: Post-Mortem

The post-mortem document is the output of the incident. It serves two purposes: it creates an institutional memory of what happened and how it was resolved, and it is the forcing function for turning incidents into monitoring improvements.

A good post-mortem has:

  1. Timeline: exact timestamps from first detection through all-clear, with who was on the call at each step
  2. Impact statement: which downstream consumers were affected, for how long, and what the business consequence was (wrong decisions made, analyst time wasted, executive report incorrect)
  3. Root cause: the full 5-whys chain, not just the proximate cause
  4. What went well: the parts of the response that worked (e.g., "the Airflow circuit breaker prevented the bad data from reaching the mart layer")
  5. What went poorly: the parts that need improvement (e.g., "triage took 2 hours because nobody had a lineage map of the affected table")
  6. Action items: specific, assigned, dated - not "improve monitoring" but "add distribution check on session_start_ts by March 19, owned by the data platform team"

The Prevention Loop

The most important part of incident management is what happens after the post-mortem. Every P0 and P1 incident should result in at least one new monitoring check that would have detected the incident 24 hours earlier.

This is the prevention loop: incident → root cause → monitoring gap identified → new check implemented → incident becomes detectable before it reaches stakeholders next time.

# After every incident, add a check to the monitoring config
# that would have caught it earlier.

# The incident: timezone distribution shift in session_start_ts
# New check: verify that timestamp values are UTC-normalized

def check_timestamp_utc_consistency(
conn,
table: str,
timestamp_col: str,
expected_utc_offset_hours: int = 0,
) -> Dict:
"""
Verify that a timestamp column is consistently UTC-normalized.
Added after the day-7 retention incident (INC-2024-0312-001).
"""
with conn.cursor() as cur:
cur.execute(f"""
SELECT
-- Compute the modal hour of day for timestamps
-- UTC timestamps should be uniformly distributed across hours
-- EST timestamps will cluster in business hours (13:00-21:00 UTC)
EXTRACT(HOUR FROM {timestamp_col}) AS hour_of_day,
COUNT(*) AS event_count
FROM {table}
WHERE DATE({timestamp_col}) = CURRENT_DATE - 1
AND {timestamp_col} IS NOT NULL
GROUP BY EXTRACT(HOUR FROM {timestamp_col})
ORDER BY event_count DESC
LIMIT 1
""")
modal_hour = cur.fetchone()

if not modal_hour:
return {"table": table, "passed": None, "reason": "No data"}

peak_hour = modal_hour[0]

# UTC business data typically peaks 13:00-18:00 UTC
# EST data peaks 18:00-23:00 UTC (business hours EST = 13:00-18:00 EST + 5h)
# If peak is consistently in 18:00-23:00 UTC, data may be EST not UTC

est_range = range(18, 24)
suspicious_timezone = peak_hour in est_range

return {
"table": table,
"pillar": "distribution",
"check_name": "timestamp_utc_consistency",
"passed": not suspicious_timezone,
"severity": "critical" if suspicious_timezone else "info",
"metrics": {"peak_hour_utc": peak_hour},
"reason": f"Peak timestamp hour {peak_hour}:00 UTC suggests non-UTC timezone" if suspicious_timezone else None,
}

Communication Templates

During a P0/P1 incident, communication takes as long as mitigation. Having templates ready means communication is never the bottleneck.

Initial incident notification (send within 5 minutes of detection):

DATA INCIDENT - P{SEVERITY} - {TIMESTAMP}

Affected: {TABLE(S)} - {PILLAR AFFECTED}
Impact: {DOWNSTREAM CONSUMERS AFFECTED}
Status: INVESTIGATING
ETA for update: 30 minutes

Incident commander: {NAME}
Investigation channel: #incident-{DATE}

Status update (send every 30 minutes for P0, every hour for P1):

DATA INCIDENT UPDATE - {TIMESTAMP}

Status: {INVESTIGATING | MITIGATION IN PROGRESS | RESOLVED}
Latest finding: {WHAT WE KNOW}
Current action: {WHAT WE ARE DOING RIGHT NOW}
ETA for resolution: {TIME ESTIMATE}

Data affected from: {START_TIME} to {CURRENT_TIME OR END_TIME}

All-clear notification (send when data is correct and verified):

DATA INCIDENT RESOLVED - {TIMESTAMP}

Issue: {ONE-SENTENCE SUMMARY}
Root cause: {ONE-SENTENCE ROOT CAUSE}
Duration: {HOURS AFFECTED}

Data has been validated and is correct as of {VERIFICATION_TIMESTAMP}.
All downstream consumers ({LIST}) should be operational.

Post-mortem will be published by: {DATE}

:::danger Never fix data silently When you rerun a pipeline to fix bad data, always send a notification to affected stakeholders. The worst outcome is that a stakeholder makes a business decision based on the wrong data, and the team has already fixed it and moved on without telling anyone. The decision has already been made on wrong data. Communicate every fix, even when the fix is fast and confident. :::

:::warning The detection gap is the most important metric to track Track the time between when a data quality issue started and when the team was notified. This is your "detection gap." The industry target for P0 issues is under 30 minutes. Most teams, before implementing systematic observability, have detection gaps of 2–8 hours. Track this metric monthly. It is the clearest indicator of observability program maturity and drives investment decisions better than any other single metric. :::

Interview Q&A

Q: How do you classify the severity of a data quality incident?

A: I use a P0–P3 classification based on two dimensions: the downstream impact (what consumers are affected) and the nature of the failure (silent corruption vs. missing data vs. schema change). P0 is when an ML model is producing wrong outputs at scale or financial reports are incorrect in production - both have immediate business consequences. P1 is when a model-impacting table is broken with limited scope, or executive dashboards are wrong. P2 covers analyst dashboards and operational reports with no ML or financial impact. P3 is a quality issue detected before it reaches any consumer. The classification drives the response SLA: P0 requires acknowledgment within 15 minutes, P1 within 1 hour, P2 within 4 hours. For data incidents specifically, I also consider whether the issue is silent (data looks present but is wrong) vs. obvious (table is empty or pipeline has failed) - silent issues are always higher severity because they have been affecting consumers longer before detection.

Q: Walk me through how you would triage a data quality incident in the first 10 minutes.

A: The first 10 minutes are about understanding scope, not fixing anything. Step one: identify the reported symptom and the consuming asset (which dashboard, which model). Step two: trace one hop upstream to the source table. Run SELECT MAX(created_at), COUNT(*) to check freshness and volume immediately. Step three: check the schema - run SELECT column_name, data_type FROM information_schema.columns and compare to what you expect. Step four: check upstream table health using the same freshness and volume queries. Step five: look at query history for recent DDL operations on the affected table. With those five checks, you will have identified whether the problem is in the transformation layer (upstream tables are healthy, downstream is wrong), in ingestion (upstream tables have the issue), or in schema (a column changed type or was removed). The entire triage takes 10–15 minutes and tells you where to focus the fix.

Q: What is the 5-whys technique and how do you apply it to data incidents?

A: The 5-whys is a structured root cause analysis technique where you ask "why?" repeatedly, each time targeting the answer to the previous question, until you reach a root cause that can be acted upon. For data incidents, the technique is powerful because the proximate cause (wrong value in a column) is almost never the actionable root cause. The actionable root cause is usually either a monitoring gap (we did not have a check that would have caught this), a process gap (no schema change notification between source and destination teams), or a code defect (a transformation has a bug). In the session timezone example, stopping at the first why - "the join produced no matches" - leads you to fix the join. Stopping at the fifth why - "we had no distribution check on timestamp values" - leads you to add a check that prevents recurrence. The post-mortem action items should always target the root cause, not the proximate cause.

Q: What are the mitigation options when a data pipeline has produced incorrect data, and how do you choose between them?

A: The mitigation options are, in order of speed: (1) serve stale data with a staleness indicator - fastest, but requires downstream consumers to handle staleness gracefully; (2) rollback to a previous snapshot using Delta Lake or Iceberg time travel - fast if the table supports time travel and the correct data exists in a previous version; (3) rerun the affected pipeline for the affected time window - requires fixing the root cause first, takes minutes to hours depending on data volume; (4) disable the affected consumer - appropriate when a model is producing dangerously wrong outputs and rerun will take too long. I choose based on the severity and the time available: for a P0 affecting an ML model, I disable the model first (to stop wrong outputs immediately), then rerun the pipeline (to restore correct operation), then communicate the timeline to affected teams. For a P2 affecting an analyst dashboard, a rerun after the fix is usually sufficient.

Q: How do you prevent the same data incident from recurring?

A: The prevention loop is the most important part of the incident process, and the part most teams skip. After every P0 or P1, I require one action item specifically: "add a monitoring check that would have detected this incident 24 hours before the stakeholder noticed it." If the incident was caused by a timezone distribution shift, add a distribution check on timestamp values. If it was caused by a schema change, add schema snapshot monitoring to the affected table. If it was caused by an upstream table going stale, add freshness monitoring to that table. Over time, this compounds: each incident makes the monitoring system more comprehensive. After 20 incidents, your monitoring covers 20 failure modes it did not cover before. This is how you build a monitoring system that covers unknown unknowns - not by anticipating them, but by systematically converting every past unknown into a known with a check.

Q: How do you structure data incident communication to avoid confusion and information overload?

A: I use three communication artifacts with clear purposes. First: the initial incident notification, sent within 5 minutes of detection, stating affected systems, severity, current status, and who is investigating. This goes to a broad audience. Second: status updates every 30 minutes for P0, every hour for P1, with specific progress ("we have identified the root cause, fix is in progress") rather than generic updates ("we are still investigating"). These go to the incident channel only - anyone who wants updates subscribes to the channel. Third: the all-clear notification, with a one-sentence root cause summary, the data validation timestamp (when data was confirmed correct), and a pointer to the post-mortem. The key discipline is to over-communicate during the incident and use templates so communication never becomes a bottleneck. The worst pattern is a team that goes silent during investigation - stakeholders fill the silence with speculation, escalations, and questions that distract the investigation team.

Q: How do you measure the effectiveness of your data incident management process over time?

A: I track four metrics monthly. Detection gap: the time between when the issue started (often traceable from the query logs or pipeline logs) and when the team was notified. The target is under 30 minutes for P0 and P1. Mean time to resolve (MTTR): from detection to all-clear. This measures how efficiently triage and mitigation work. False positive rate: the percentage of automated alerts that did not correspond to real data quality issues. Tracks alert fatigue risk. Prevention success rate: for each P0 and P1 incident, did the team implement the action items from the post-mortem? If the prevention loop closes correctly, you should see the same incident type rarely repeat. These four metrics together tell you whether your observability program is improving: detection gap decreasing means better monitoring coverage, MTTR decreasing means better runbooks and tooling, false positive rate stable means thresholds are well-calibrated, and prevention success rate above 80% means the organization is actually learning from incidents.

© 2026 EngineersOfAI. All rights reserved.