:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Quality Checks demo on the EngineersOfAI Playground - no code required. :::
Five Pillars of Data Observability for ML Systems
The Investigation That Should Never Have Happened
It started with a Slack message at 2:47 PM: "Hey, can someone check the Q3 revenue numbers? They look really high."
The data team began their investigation at 3:00 PM. By 4:00 PM they had ruled out a dashboard bug and a calculation error. By 5:00 PM they had identified that the source table, events.purchase_events, had changed. By 5:30 PM they understood what had changed: the table had gained two new columns from a recently integrated partner system, row counts had jumped 40% because that same partner system had started backfilling historical events, and the created_at timestamp field had silently shifted from UTC to the partner system's local timezone - Eastern Standard Time, UTC-5.
The root cause of a 3-hour investigation was three simultaneous changes to a single source table: a schema change, a volume spike, and a distribution shift. All three were detectable in under 5 minutes with automated observability checks. Instead, they were caught by a confused product manager who noticed the numbers looked wrong.
This is what data downtime looks like in practice. It is not a crashed server or a 500 error. It is data that is technically present and queryable but wrong - wrong schema, wrong volume, wrong values. Software reliability engineering has decades of tooling to detect and alert on service failures. Data reliability engineering has been, until recently, improvised on top of SQL queries, manual checks, and human intuition.
The five pillars of data observability are the systematic answer to this improvisation. They define exactly what properties of data need to be monitored, how to measure each property, and how to set alert thresholds that catch real problems without generating so much noise that engineers start ignoring alerts.
Why This Exists
The first generation of data quality tooling focused on correctness: does this value match a rule? Is this column non-null? Does this value fall in a valid range? These checks, implemented in tools like Great Expectations, answered the question "is this specific thing wrong?" They required a human to anticipate every possible failure mode and write a check for it.
The problem with rule-based checks is that they only catch known unknowns. You can write a check for the failure mode you have already seen. You cannot write a check for the failure mode you have never imagined. The timezone shift in the opening example would not have been caught by any rule-based check unless someone had previously anticipated exactly that failure pattern and written a check for distribution drift on timestamp fields.
Barr Moses, CEO of Monte Carlo, coined the term "data observability" in 2020 and articulated the five-pillar framework as an answer to this limitation. The insight was borrowed directly from software observability: instead of asking "is this specific thing wrong?", ask "does the overall behavior of this data match what I would expect?" This shifts from rule-based to anomaly-based detection, and from known unknowns to unknown unknowns.
The five pillars give you a complete picture of data health. If any one pillar changes unexpectedly, something in your data pipeline has changed. That change may or may not be intentional, but you need to know about it. Monitoring all five pillars means you have a systematically complete view of data health, not a partial view shaped by whatever failure modes happened to have occurred to your team in the past.
Historical Context
Data quality as a discipline predates the five-pillar framework. The ISO 8000 standard for data quality dates to 2008. Master Data Management platforms from IBM and Informatica were selling data governance solutions in the 2000s. The difference between that era's tools and modern data observability is the shift from batch validation (run a quality check once per day, report results) to continuous monitoring (always-on anomaly detection that alerts within minutes of a problem).
The modern data stack enabled this shift. When Snowflake and BigQuery made it cheap to run arbitrary SQL against petabytes of data in seconds, it became practical to run hundreds of monitoring queries every hour. The rise of dbt made it possible to instrument quality checks directly in the transformation layer. The rise of Apache Airflow made it possible to block downstream pipelines when quality checks failed.
Monte Carlo's 2020 Series A and the subsequent coining of "data observability" created a category. Bigeye, Anomalo, Soda, and Datafold all followed with variations on the same core insight. The five-pillar framework became the industry-standard way to describe what data observability measures. It is now referenced in engineering blog posts from Airbnb, Lyft, Netflix, and virtually every data-mature company.
The Five Pillars
The five pillars are best understood not as independent metrics, but as five different lenses through which you can observe the same data asset. A single problem in a pipeline can manifest in multiple pillars simultaneously - as in the opening example, where one change showed up in volume, schema, and distribution all at once.
Pillar 1: Freshness
Freshness answers the question: "Is this data current?" For any table with a time-based update pattern, freshness is the elapsed time since the last row was written. A table that should update every hour but has not updated in 6 hours has a freshness violation.
Freshness violations are the most common cause of ML model degradation in production. A feature pipeline that stops updating means your model is making predictions on stale data. The model continues to run, continues to serve predictions, and continues to appear healthy on model monitoring dashboards - but it is predicting based on data from 6, 12, 24 hours ago.
The formal freshness metric is data lag: . You define a freshness SLA per table: for example, "this table must be updated within 2 hours of the hour boundary." If , trigger an alert.
import psycopg2
from datetime import datetime, timedelta
import pytz
def check_freshness(
conn,
table: str,
timestamp_col: str,
sla_hours: float,
) -> dict:
"""
Check if a table meets its freshness SLA.
Returns a dict with lag, SLA, and whether it passed.
"""
query = f"""
SELECT MAX({timestamp_col}) AS last_updated
FROM {table}
"""
with conn.cursor() as cur:
cur.execute(query)
last_updated = cur.fetchone()[0]
if last_updated is None:
return {
"table": table,
"last_updated": None,
"lag_hours": None,
"sla_hours": sla_hours,
"passed": False,
"reason": "Table is empty - no rows found",
}
# Normalize to UTC
if last_updated.tzinfo is None:
last_updated = last_updated.replace(tzinfo=pytz.UTC)
now = datetime.now(tz=pytz.UTC)
lag = now - last_updated
lag_hours = lag.total_seconds() / 3600
return {
"table": table,
"last_updated": last_updated.isoformat(),
"lag_hours": round(lag_hours, 2),
"sla_hours": sla_hours,
"passed": lag_hours <= sla_hours,
"reason": None if lag_hours <= sla_hours else f"Lag {lag_hours:.1f}h exceeds SLA {sla_hours}h",
}
Pillar 2: Volume
Volume answers the question: "Does this table have the expected number of rows?" Volume monitoring catches a wide class of pipeline failures: truncated loads, duplicate inserts, missed partitions, and filter bugs that silently exclude large amounts of data.
The key insight is that volume does not have a fixed expected value - it has an expected range that evolves over time. A growing e-commerce platform should see more orders every month. Volume monitoring needs to be adaptive, not static.
The standard approach is to compare today's row count to a rolling baseline. A common threshold is ±20% of the 7-day rolling average. This is simple, interpretable, and catches most volume anomalies without generating excessive false positives.
For partitioned tables, check volume per partition rather than total volume. A single missing partition is not visible in the total row count if you have 100 other partitions - but it will show up as a zero-row partition in per-partition monitoring.
from scipy import stats
import numpy as np
def check_volume(
conn,
table: str,
date_col: str,
lookback_days: int = 14,
alert_threshold_pct: float = 0.25,
) -> dict:
"""
Check daily row count against rolling baseline.
Uses z-score to detect unusual volume.
"""
query = f"""
SELECT
DATE({date_col}) AS dt,
COUNT(*) AS row_count
FROM {table}
WHERE {date_col} >= CURRENT_DATE - INTERVAL '{lookback_days} days'
GROUP BY DATE({date_col})
ORDER BY dt
"""
with conn.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
if len(rows) < 2:
return {"table": table, "passed": None, "reason": "Insufficient history"}
dates = [r[0] for r in rows]
counts = [r[1] for r in rows]
today_count = counts[-1]
baseline_counts = counts[:-1]
baseline_mean = np.mean(baseline_counts)
baseline_std = np.std(baseline_counts)
if baseline_mean == 0:
return {"table": table, "passed": False, "reason": "Baseline mean is zero"}
# Deviation from mean as a percentage
pct_deviation = abs(today_count - baseline_mean) / baseline_mean
# Z-score for statistical significance
z_score = (today_count - baseline_mean) / baseline_std if baseline_std > 0 else 0
passed = pct_deviation <= alert_threshold_pct
return {
"table": table,
"today_count": today_count,
"baseline_mean": round(baseline_mean, 1),
"baseline_std": round(baseline_std, 1),
"pct_deviation": round(pct_deviation * 100, 1),
"z_score": round(z_score, 2),
"passed": passed,
"reason": None if passed else f"Volume {pct_deviation*100:.1f}% from baseline (z={z_score:.2f})",
}
Pillar 3: Schema
Schema monitoring answers the question: "Have the columns, types, or nullable constraints of this table changed?" Schema changes are among the most disruptive events in a data pipeline because downstream tables, models, and dashboards depend on columns existing with specific names and types.
Schema monitoring works by capturing a snapshot of the schema at each check interval and comparing it to the previous snapshot. Changes are classified by severity:
- Additive (safe): a new nullable column was added. Downstream consumers that do not reference this column are unaffected.
- Breaking: an existing column was renamed or removed. Every downstream consumer that references this column will break.
- Dangerous: a column's type changed. Downstream consumers may silently produce wrong results - for example, an INTEGER column changing to FLOAT will cause sum aggregations to produce different results.
import json
from dataclasses import dataclass, field
from typing import List
@dataclass
class ColumnDef:
name: str
dtype: str
nullable: bool
@dataclass
class SchemaChange:
change_type: str # added, removed, type_changed, nullable_changed
column: str
old_value: str = None
new_value: str = None
severity: str = "info" # info, warning, critical
def get_schema(conn, table: str) -> List[ColumnDef]:
"""Fetch current schema from information_schema."""
schema, tbl = table.split(".")
query = """
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = %s AND table_name = %s
ORDER BY ordinal_position
"""
with conn.cursor() as cur:
cur.execute(query, (schema, tbl))
return [
ColumnDef(name=r[0], dtype=r[1], nullable=(r[2] == "YES"))
for r in cur.fetchall()
]
def diff_schemas(
old: List[ColumnDef],
new: List[ColumnDef],
) -> List[SchemaChange]:
"""Diff two schema snapshots, classify each change by severity."""
old_map = {c.name: c for c in old}
new_map = {c.name: c for c in new}
changes = []
# Removed columns - always critical (breaks downstream)
for name in old_map:
if name not in new_map:
changes.append(SchemaChange(
change_type="removed",
column=name,
old_value=old_map[name].dtype,
severity="critical",
))
# Added columns - info (additive, safe)
for name in new_map:
if name not in old_map:
changes.append(SchemaChange(
change_type="added",
column=name,
new_value=new_map[name].dtype,
severity="info",
))
# Type changes - critical (silent corruption risk)
for name in old_map:
if name in new_map:
if old_map[name].dtype != new_map[name].dtype:
changes.append(SchemaChange(
change_type="type_changed",
column=name,
old_value=old_map[name].dtype,
new_value=new_map[name].dtype,
severity="critical",
))
elif old_map[name].nullable != new_map[name].nullable:
changes.append(SchemaChange(
change_type="nullable_changed",
column=name,
old_value=str(old_map[name].nullable),
new_value=str(new_map[name].nullable),
severity="warning",
))
return changes
Pillar 4: Distribution
Distribution monitoring answers the question: "Do the values in this column still look the way they should?" This is the most statistically sophisticated pillar and the most important for ML systems. A model trained on a feature distribution will degrade when that distribution shifts - even if the data is fresh, at the right volume, with the correct schema.
Distribution monitoring covers three sub-checks:
Null rate monitoring: track the percentage of null values per column. A column that was 0.1% null becoming 15% null signals a broken upstream join or a source system issue.
Value distribution monitoring: track histogram bins, percentiles, and summary statistics. The 95th percentile of order_value_usd shifting from 2,000 signals either a data quality issue or a significant business change.
Categorical distribution monitoring: for low-cardinality columns, track the frequency of each value. A column that should contain {US, UK, EU} with roughly equal distribution showing 95% US signals something changed in the upstream system or ingestion.
from scipy import stats
import numpy as np
from typing import Dict, Any
def check_distribution(
conn,
table: str,
column: str,
lookback_partition: str = "CURRENT_DATE - 1",
reference_partition: str = "CURRENT_DATE - 8",
) -> Dict[str, Any]:
"""
Compare column distribution between two time windows.
Uses KS test for continuous and chi-square for categorical.
"""
query = f"""
WITH today AS (
SELECT {column}::float AS val
FROM {table}
WHERE DATE(created_at) = {lookback_partition}
AND {column} IS NOT NULL
),
reference AS (
SELECT {column}::float AS val
FROM {table}
WHERE DATE(created_at) = {reference_partition}
AND {column} IS NOT NULL
)
SELECT 'today' AS period, val FROM today
UNION ALL
SELECT 'reference' AS period, val FROM reference
"""
with conn.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
today_vals = [r[1] for r in rows if r[0] == "today"]
ref_vals = [r[1] for r in rows if r[0] == "reference"]
if not today_vals or not ref_vals:
return {"column": column, "passed": None, "reason": "Insufficient data"}
# Kolmogorov-Smirnov test: are these two distributions the same?
ks_stat, p_value = stats.ks_2samp(today_vals, ref_vals)
# Null rate comparison
null_rate_query = f"""
SELECT
SUM(CASE WHEN {column} IS NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS null_rate
FROM {table}
WHERE DATE(created_at) = {{partition}}
"""
passed = p_value > 0.05 # Fail if distributions are significantly different
return {
"column": column,
"today_count": len(today_vals),
"today_mean": round(np.mean(today_vals), 4),
"today_p95": round(np.percentile(today_vals, 95), 4),
"ref_mean": round(np.mean(ref_vals), 4),
"ref_p95": round(np.percentile(ref_vals, 95), 4),
"ks_statistic": round(ks_stat, 4),
"p_value": round(p_value, 4),
"passed": passed,
"reason": None if passed else f"KS test p={p_value:.4f} - distribution shifted significantly",
}
Pillar 5: Lineage
Lineage answers the question: "If this data changes, what breaks?" It is both a debugging tool (trace a wrong value backward to its source) and a risk management tool (trace a change forward to understand its blast radius before deploying it).
Lineage has two levels. Table-level lineage tracks which tables feed which tables: raw.events → staging.events → mart.user_features. Column-level lineage tracks which specific source columns produce which specific output columns: raw.events.created_at → (UTC conversion in staging.events) → mart.user_features.last_event_ts.
Column-level lineage is dramatically more useful but dramatically harder to compute. It requires parsing SQL transformations to understand which source columns are referenced in which output column expressions. The sqlglot library is the current best-in-class tool for this.
We cover lineage in full depth in Lesson 02. For this lesson, understand that lineage is the connective tissue that turns the other four pillars from isolated alerts into an actionable understanding of your data estate. When a volume alert fires on raw.events, lineage tells you immediately: "this table feeds staging.events, mart.user_features, and model.churn_risk - all three downstream consumers may be affected."
The Full Five-Pillar Monitor
Here is a complete Python class that implements checks for all five pillars and stores results in a metrics table for trending:
import psycopg2
import json
import logging
from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional
import numpy as np
from scipy import stats
import pytz
logger = logging.getLogger(__name__)
class DataObservabilityMonitor:
"""
Implements all five pillars of data observability for a PostgreSQL/Redshift warehouse.
Stores check results in a metrics table for trending and dashboarding.
"""
METRICS_TABLE = "data_observability.check_results"
def __init__(self, conn_string: str):
self.conn_string = conn_string
def _get_conn(self):
return psycopg2.connect(self.conn_string)
def _store_result(self, conn, result: Dict[str, Any]) -> None:
"""Persist a check result to the metrics store."""
query = f"""
INSERT INTO {self.METRICS_TABLE}
(checked_at, table_name, pillar, passed, metric_value, threshold, notes)
VALUES
(%s, %s, %s, %s, %s, %s, %s)
"""
with conn.cursor() as cur:
cur.execute(query, (
datetime.now(tz=pytz.UTC),
result.get("table"),
result.get("pillar"),
result.get("passed"),
json.dumps(result.get("metrics", {})),
json.dumps(result.get("thresholds", {})),
result.get("reason"),
))
conn.commit()
def check_freshness(
self,
table: str,
timestamp_col: str,
sla_hours: float,
) -> Dict[str, Any]:
with self._get_conn() as conn:
query = f"SELECT MAX({timestamp_col}) FROM {table}"
with conn.cursor() as cur:
cur.execute(query)
last_updated = cur.fetchone()[0]
if last_updated is None:
result = {
"pillar": "freshness", "table": table,
"passed": False, "reason": "Table is empty",
}
else:
if last_updated.tzinfo is None:
last_updated = last_updated.replace(tzinfo=pytz.UTC)
lag_hours = (datetime.now(tz=pytz.UTC) - last_updated).total_seconds() / 3600
passed = lag_hours <= sla_hours
result = {
"pillar": "freshness", "table": table,
"passed": passed,
"metrics": {"lag_hours": round(lag_hours, 2), "last_updated": last_updated.isoformat()},
"thresholds": {"sla_hours": sla_hours},
"reason": None if passed else f"Lag {lag_hours:.1f}h exceeds SLA {sla_hours}h",
}
self._store_result(conn, result)
return result
def check_volume(
self,
table: str,
date_col: str,
alert_threshold_pct: float = 0.25,
lookback_days: int = 14,
) -> Dict[str, Any]:
with self._get_conn() as conn:
query = f"""
SELECT DATE({date_col}), COUNT(*)
FROM {table}
WHERE {date_col} >= CURRENT_DATE - INTERVAL '{lookback_days} days'
GROUP BY DATE({date_col})
ORDER BY 1
"""
with conn.cursor() as cur:
cur.execute(query)
rows = cur.fetchall()
if len(rows) < 3:
return {"pillar": "volume", "table": table, "passed": None, "reason": "Insufficient history"}
counts = [r[1] for r in rows]
today_count = counts[-1]
baseline = counts[:-1]
mean = np.mean(baseline)
pct_dev = abs(today_count - mean) / mean if mean > 0 else 0
passed = pct_dev <= alert_threshold_pct
result = {
"pillar": "volume", "table": table, "passed": passed,
"metrics": {
"today_count": today_count,
"baseline_mean": round(mean, 1),
"pct_deviation": round(pct_dev * 100, 1),
},
"thresholds": {"alert_threshold_pct": alert_threshold_pct * 100},
"reason": None if passed else f"Volume {pct_dev*100:.1f}% from baseline",
}
self._store_result(conn, result)
return result
def check_schema(
self,
table: str,
previous_schema_json: Optional[str] = None,
) -> Dict[str, Any]:
"""
Check for schema changes. If previous_schema_json is None,
this is the first run - store schema and return passed=True.
"""
schema_parts = table.split(".")
schema_name, table_name = schema_parts[0], schema_parts[1]
with self._get_conn() as conn:
query = """
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = %s AND table_name = %s
ORDER BY ordinal_position
"""
with conn.cursor() as cur:
cur.execute(query, (schema_name, table_name))
current_schema = {
r[0]: {"dtype": r[1], "nullable": r[2]}
for r in cur.fetchall()
}
if previous_schema_json is None:
return {
"pillar": "schema", "table": table,
"passed": True,
"metrics": {"current_schema": current_schema},
"reason": "Baseline schema captured",
}
previous_schema = json.loads(previous_schema_json)
changes = []
for col in previous_schema:
if col not in current_schema:
changes.append({"type": "removed", "column": col, "severity": "critical"})
for col in current_schema:
if col not in previous_schema:
changes.append({"type": "added", "column": col, "severity": "info"})
elif current_schema[col]["dtype"] != previous_schema[col]["dtype"]:
changes.append({
"type": "type_changed", "column": col, "severity": "critical",
"old": previous_schema[col]["dtype"],
"new": current_schema[col]["dtype"],
})
critical_changes = [c for c in changes if c["severity"] == "critical"]
passed = len(critical_changes) == 0
result = {
"pillar": "schema", "table": table, "passed": passed,
"metrics": {"changes": changes, "change_count": len(changes)},
"reason": None if passed else f"{len(critical_changes)} critical schema changes detected",
}
return result
def run_all_checks(self, check_configs: List[Dict]) -> List[Dict]:
"""
Run all configured checks and return results.
check_configs example:
[
{"pillar": "freshness", "table": "events.purchases",
"timestamp_col": "created_at", "sla_hours": 2},
{"pillar": "volume", "table": "events.purchases",
"date_col": "created_at", "alert_threshold_pct": 0.20},
]
"""
results = []
for config in check_configs:
pillar = config["pillar"]
try:
if pillar == "freshness":
result = self.check_freshness(
config["table"], config["timestamp_col"], config["sla_hours"]
)
elif pillar == "volume":
result = self.check_volume(
config["table"], config["date_col"],
config.get("alert_threshold_pct", 0.25),
)
elif pillar == "schema":
result = self.check_schema(
config["table"], config.get("previous_schema_json")
)
else:
result = {"pillar": pillar, "passed": None, "reason": "Unknown pillar"}
results.append(result)
status = "PASS" if result.get("passed") else "FAIL"
logger.info(f"[{status}] {pillar.upper()} - {config['table']}")
except Exception as e:
logger.error(f"Check failed for {pillar}/{config['table']}: {e}")
results.append({
"pillar": pillar, "table": config.get("table"),
"passed": False, "reason": f"Check error: {str(e)}",
})
return results
Setting Effective Alert Thresholds
One of the hardest problems in data observability is not implementing the checks - it is calibrating the thresholds. An alert threshold that is too tight generates so many false positives that engineers stop paying attention. A threshold that is too loose misses real problems.
The standard starting points are:
- Freshness: set SLA based on pipeline schedule plus a 2x buffer. If a pipeline runs every hour, set freshness SLA to 2 hours.
- Volume: start with ±25% of rolling 7-day average. Tighten to ±15% once you have enough baseline data to understand seasonality.
- Distribution: use a KS test p-value threshold of 0.05 for the initial alert, but require 3 consecutive failures before paging (reduces false positives from random daily variation).
These are starting points, not fixed rules. Every table has different volatility characteristics. High-volume transactional tables need tighter thresholds. Low-volume reference tables are more sensitive to percentage changes but less likely to have daily variation.
Why All Five Pillars Are Required
A common mistake is to instrument only one or two pillars and declare victory. In practice, each pillar catches failure modes the others miss.
| Failure mode | Freshness | Volume | Schema | Distribution | Lineage |
|---|---|---|---|---|---|
| Pipeline didn't run | YES | YES | No | No | context |
| New data source backfill | No | YES | Maybe | YES | context |
| Column removed | No | No | YES | Yes (nulls) | context |
| Timezone shift | No | No | No | YES | context |
| Upstream join broken | No | YES | No | YES (nulls) | context |
| Silent filter bug | No | YES | No | Yes | context |
The timezone shift in our opening example showed up only in distribution. The missing partition scenario shows up in volume but not freshness (the table is still being updated - just missing a day's worth of data). Without all five pillars, you will always have blind spots.
:::danger Schema monitoring alone is not enough Many teams start data observability by monitoring schema changes. This is necessary but deeply insufficient. Schema monitoring detects structural changes to a table's definition. It does not detect data that is stale, data at the wrong volume, or data whose values have shifted. A table can have a perfectly stable schema and still be completely wrong. :::
:::warning The null rate trap Null rate increases are often the first signal of a distribution problem - and they are often ignored because "nulls are normal." Train yourself to treat unexpected null rate increases as P1 alerts. A column that was 0.5% null jumping to 30% null means a join broke, a source system stopped populating a field, or a transformation has a bug. This kind of signal should never be silently absorbed. :::
Production Engineering Notes
In production, each check should be a separate Airflow task that runs at the end of each pipeline DAG. This allows you to block downstream tasks on quality failures - if the volume check fails, do not proceed to the feature engineering DAG that reads from this table.
Store check results in a dedicated data_observability.check_results table. This enables you to trend checks over time, build Grafana dashboards, and retrospectively investigate when a quality issue started. Storing only the current state (pass/fail) is not enough - you need the full time series.
For large data estates, the monitoring queries themselves can be expensive. Mitigate this by:
- Running checks on partition columns with
WHEREfilters rather than full table scans - Using approximate aggregation functions (
APPROX_COUNT_DISTINCT,APPROX_PERCENTILE) for distribution checks - Scheduling intensive distribution checks less frequently (daily) than freshness checks (hourly)
Interview Q&A
Q: What are the five pillars of data observability, and why is each one necessary?
A: The five pillars are freshness (is data current?), volume (is row count as expected?), schema (have table structure changes occurred?), distribution (do values still look normal?), and lineage (what is the data's origin and what does it feed?). Each pillar catches a distinct class of failure. Freshness catches pipeline failures and ingestion lag. Volume catches backfills, truncations, and filter bugs. Schema catches structural changes that break downstream consumers. Distribution catches value-level changes including feature drift for ML models - the most insidious failure because it is silent. Lineage contextualizes alerts by showing impact scope. You need all five because each class of failure is invisible to the others.
Q: How do you decide on alert thresholds for volume monitoring?
A: I use rolling window baselines rather than static thresholds. I compute the mean and standard deviation of daily row counts over the past 14 days and alert when today's count deviates by more than 2-3 standard deviations or more than 20-25% from the rolling mean. For tables with strong weekly seasonality, I compare to the same day of the week in prior weeks rather than a simple rolling mean. Initial thresholds are starting points that I refine over the first few weeks based on false positive rates - if a check fires more than twice a week on a healthy table, the threshold needs to be relaxed.
Q: A data scientist reports that a feature in the training dataset looks different from the same feature in production. Walk me through how you would diagnose this.
A: This is a training-serving skew problem. I would start with the distribution pillar: run a KS test comparing the feature distribution in the training dataset sample against recent production data. If the distributions differ significantly, I move to lineage: trace the feature back through the transformation pipeline to find where the divergence occurs. Common culprits are: different SQL filters applied at training vs. serving time, a time-based column computed differently in batch vs. stream processing, or the training data being captured at a different point in the pipeline than the serving data. I would also check the schema pillar - did the source column change type between when training data was extracted and now?
Q: What is the difference between data quality checks and data observability?
A: Data quality checks answer the question "is this specific thing wrong?" - they are rule-based assertions that you write in advance, like "revenue must be positive" or "user_id must be non-null." Data observability answers the question "does the overall behavior of this data match what I would expect?" - it uses historical baselines and anomaly detection to catch unknown unknowns. Quality checks catch the failure modes you anticipated. Observability catches the ones you did not. In production, you need both: quality checks for known business rules, observability for unexpected change detection.
Q: How would you instrument data observability in an Airflow-orchestrated pipeline?
A: I add observability checks as explicit Airflow tasks at two points in the DAG. At ingestion completion, I run freshness and volume checks on the raw table. If either fails, I set the task state to FAILED and halt the DAG - there is no point running transformations on stale or incomplete data. At the end of each transformation layer, I run schema and distribution checks on the output table. I use Airflow's sensor pattern for upstream dependency checks and store all check results in a PostgreSQL metrics table with timestamps, so I can trend health over time. Alerts route to Slack for warnings and PagerDuty for critical failures.
Q: What is data downtime, and how is it different from service downtime?
A: Service downtime is when a system is unavailable - requests fail, the product is inaccessible. Data downtime is when data is technically available and queryable but wrong - stale, incomplete, incorrectly transformed, or otherwise unfit for the purpose it is being used for. Data downtime is more dangerous than service downtime in many ways because it is invisible. A service outage produces errors that users can see. Data downtime produces wrong numbers that users act on - wrong ML predictions, wrong dashboards, wrong business decisions. The goal of data observability is to minimize data downtime by reducing the time to detection, triage, and resolution of data quality incidents.
