:::tip ๐ฎ Interactive Playground Visualize this concept: Try the Stream Pipeline Viz demo on the EngineersOfAI Playground - no code required. :::
The Data Engineering Landscape for AI Teams
The Pipeline That Broke the Launchโ
It's 9:47 PM on a Thursday when Priya's phone buzzes. The message is from the product manager: "Recommendation quality dropped 18% over the last 6 hours. Did something change on the model side?" Priya is a senior ML engineer at a company that does food delivery at the scale of hundreds of millions of orders per year. The recommendation model she owns ranks restaurants for each user. An 18% quality drop means real revenue impact - the kind that gets noticed in a Monday morning exec review.
Priya pulls up the monitoring dashboard. Model inference latency: normal. Feature retrieval: normal. The model itself hasn't been retrained in three days - that's expected. She starts digging into the feature pipeline. The "user_7d_order_history" feature, which captures what categories of food a user ordered in the last seven days, is supposed to be refreshed every four hours. The last successful run timestamp reads: 31 hours ago.
The feature pipeline failed silently. No alert fired because the alerting rule checked for job failure, not for feature staleness. The model has been serving 31-hour-old user preference data to every user. A user who ordered Thai food for the last three days is being shown Italian restaurant recommendations because the pipeline never picked up those signals. The fix takes eight minutes - rerun the pipeline, wait for feature refresh. The diagnosis took two hours.
What Priya learned that night is the lesson every ML engineer eventually learns the hard way: the model is rarely the problem. The data infrastructure surrounding the model is where production systems fail. The feature pipeline, the ingestion job, the transformation layer, the schema contract between the data team and the ML team - these are the components that determine whether a model performs in production. You can have the most sophisticated transformer architecture in the world and it will underperform a simpler model if the features feeding it are stale, corrupt, or incomplete.
This is what data engineering for AI teams is about. Not just moving data from point A to point B, but designing systems that guarantee the right data arrives at the right place at the right time with the right quality, every time, at scale. The rest of this lesson maps out the entire landscape - the roles, the tools, the architecture, and the failure modes you need to understand before going deeper.
Why This Existsโ
The Problem Before Modern Data Engineeringโ
Before 2010, "data engineering" as a named discipline barely existed. Companies had data warehouses - typically Oracle or Teradata - maintained by a small team of database administrators. ETL was done with commercial tools like Informatica or SSIS. The process was slow, brittle, and expensive. A new data source meant a multi-week project to write SSIS packages, configure staging tables, and hand-test the transformation logic.
When ML teams started asking for training data, the answer from the DBA team was often: "Put in a ticket, we'll get to it in the next sprint." Training data requests meant weeks of lead time. When a data scientist wanted to iterate on features, they were blocked. When a model went to production and needed a new feature, the timeline was measured in months.
The explosion of data volume in the early 2010s - driven by mobile apps, clickstream data, and event-driven architectures - made the old approach completely unviable. Informatica couldn't process terabytes of events per day. Teradata couldn't handle the cardinality of user event logs from a company like Facebook or Airbnb. The data warehouse couldn't keep up.
The Insight That Changed Everythingโ
The insight came from inside the companies dealing with this scale: treat data infrastructure like software infrastructure. Apply software engineering principles - version control, testing, CI/CD, modular design, observability - to data systems. This was the birth of the "data engineer" role.
At Airbnb, engineers built Airflow (open-sourced in 2015) to manage complex pipeline dependencies with code, not GUI drag-and-drop tools. At LinkedIn, engineers built Kafka (2011) to handle real-time event streams at a scale no traditional message broker could match. At Netflix, engineers built Metacat and various data catalog tools to manage schema evolution across thousands of datasets. These weren't academic innovations - they were solutions to immediate production problems.
The modern data stack emerged from this period: cloud data warehouses (Redshift, BigQuery, Snowflake) that could scale storage and compute independently, transformation tools (dbt) that brought software engineering discipline to SQL, orchestrators (Airflow, Dagster, Prefect) that managed pipeline dependencies as code, and eventually feature stores (Feast, Tecton, Hopsworks) that solved the specific problem of serving ML features reliably at low latency.
What Data Engineers Actually Doโ
There is a persistent misconception in ML teams that data engineers "just move data." The actual scope is far wider and the decisions are consequential.
Schema design and data modeling - Deciding how data is stored, indexed, and partitioned affects every downstream consumer. A table partitioned by user_id is fast for user-level queries but slow for time-range queries. The schema design decision made when a table is created is expensive to change later.
Ingestion architecture - How does raw data get from source systems (application databases, clickstream events, third-party APIs, IoT sensors) into the data warehouse? Change data capture (CDC) from Postgres? Fivetran connectors? Custom Kafka consumers? Each has different latency, cost, and reliability trade-offs.
Transformation pipelines - Raw data is rarely usable directly. It contains duplicates, nulls in unexpected places, schema inconsistencies, and business logic that must be applied uniformly. Data engineers write and maintain the transformation code (typically SQL in dbt, or Python in Spark) that converts raw data into clean, reliable datasets.
Data quality monitoring - Pipelines break. Upstream systems change schemas without warning. Data volumes spike or drop. Data engineers build monitoring systems that catch these anomalies before they reach ML models or production dashboards.
Feature pipeline ownership - In ML-heavy organizations, data engineers often co-own or fully own the pipelines that compute features for ML models. This includes backfilling historical features, maintaining point-in-time correctness, and ensuring feature freshness SLAs are met.
Platform and tooling - Evaluating, selecting, and maintaining the tools in the data stack. This is increasingly significant as the number of options explodes.
:::note What DE is NOT Data engineers are not responsible for model architecture, training loops, or hyperparameter tuning. The boundary is typically: DE is responsible for the data that arrives at the training script's input. ML engineers are responsible for what happens inside the training script and beyond. In practice, this boundary is fuzzy and collaboration is constant. :::
The Data Platform Maturity Modelโ
Organizations evolve through predictable stages as their data platforms grow. Knowing which stage you are at clarifies what problems to focus on.
Stage 1: Notebooks and manual exports. Data scientists query production databases directly. Training data is generated by running a SQL query and downloading the result. There is no pipeline, no orchestration, no quality monitoring. This is fine at two data scientists and 10 GB of data. It breaks when the team grows to five and data volume hits 100 GB.
Stage 2: Scheduled scripts. Python scripts run on cron. Data is copied from source to S3 or a warehouse on a schedule. Failures go unnoticed. There are no tests. Schema drift breaks pipelines silently. This stage is where most early-stage startups live. The transition out of this stage is usually triggered by a data quality incident that affects a business decision.
Stage 3: Managed pipelines with orchestration. Airflow or Dagster manages pipeline dependencies. Failures alert on-call. dbt tests catch quality issues. The bronze/silver/gold pattern is in place. Most data is reliable. The remaining problems are: lack of feature store (training-serving skew exists), no self-serve training data (ML teams file DE tickets for each new dataset), and no lineage system (impact of changes is unclear).
Stage 4: ML-ready data platform. A feature store provides self-serve training data with point-in-time correctness. Online feature serving is unified with the offline store. Data contracts are formalized. Lineage is tracked. New features can be added by ML engineers using the feature store SDK without DE involvement for routine requests. Backfill runs on dedicated infrastructure. This is where Uber, DoorDash, and Airbnb operate.
Most companies reading this are at Stage 2 or early Stage 3. The goal of this module is to give you a complete map of Stage 4 so you can make architectural decisions today that move you toward it rather than away from it.
Historical Contextโ
The data engineer title became common around 2012โ2014, largely driven by companies operating at internet scale. Before that, the closest role was "ETL developer" or "data warehouse engineer."
2006: Hadoop releases, making distributed batch processing accessible. MapReduce becomes the dominant paradigm for large-scale data transformation.
2010: Hive open-sourced by Facebook. SQL-on-Hadoop becomes possible. Data analysts can now query petabyte-scale data without writing MapReduce jobs.
2011: Apache Kafka created at LinkedIn by Jay Kreps, Neha Narkhede, and Jun Rao to handle activity stream data. The publish-subscribe log model proves transformative.
2013: Amazon Redshift scales to widespread adoption. The first major cloud-native columnar warehouse makes it clear that on-premise data warehouses are going to be replaced.
2014: Apache Spark releases 1.0. In-memory distributed processing replaces MapReduce for interactive and iterative workloads. ML on Spark (MLlib) makes large-scale ML training feasible.
2015: Airflow open-sourced by Airbnb. Pipeline orchestration as code becomes the industry standard.
2016: dbt (data build tool) starts at RJMetrics, later spun out. SQL transformations get version control, testing, and documentation. The "analytics engineer" role emerges.
2019โ2021: The feature store category emerges. Uber builds Michelangelo's feature store internally. Feast (Gojek), Tecton, Hopsworks, and Databricks Feature Store follow. The ML-specific data infrastructure layer becomes a recognized product category.
2022โpresent: The "data lakehouse" pattern (Databricks, Apache Iceberg, Delta Lake) merges the flexibility of data lakes with the reliability of warehouses. Real-time feature computation becomes standard. Apache Iceberg's table format - open, vendor-neutral, supporting ACID transactions and time-travel queries on object storage - begins displacing proprietary warehouse storage formats for organizations that want to avoid vendor lock-in while retaining warehouse-grade reliability guarantees.
How Data Flows in an AI Organizationโ
Understanding the full data flow from raw event to model prediction is essential context for every decision a data engineer makes. The flow has six stages.
Stage 1: Data Sourcesโ
Data originates from application databases (OLTP systems), event streams (user clickstream, IoT sensors, app logs), external APIs (weather, financial data, social media), and files (partner data dumps, legacy system exports). Each source has different characteristics: relational databases have strong schema consistency; event streams have high volume and variable schema; external APIs have rate limits and unreliable uptime.
Stage 2: Ingestionโ
Raw data is extracted from sources and loaded into the data platform. Two patterns dominate:
Batch ingestion - a scheduled job (hourly, daily) reads from the source and loads a batch of new records. Simple, reliable, adds latency. Tools: Fivetran, Airbyte, custom scripts.
Streaming ingestion - events are published to a message broker (Kafka, Kinesis, Pub/Sub) as they occur, and consumers write them to storage in near-real-time. Complex, low-latency. Tools: Kafka Connect, custom Flink/Spark Streaming jobs.
Stage 3: Raw Storage (Bronze Layer)โ
Raw data lands in its native format with no transformation - just ingestion. This is the bronze layer of the medallion architecture. It serves as an audit log and recovery point. If a transformation bug is found, bronze data is always reprocessable. Bronze data is never cleaned; it is never queried directly by analysts or models.
Stage 4: Transformation (Silver and Gold Layers)โ
Silver tables are cleaned, deduplicated, standardized bronze data. Nulls are handled. Types are cast. Timestamps are normalized to UTC. Business keys are validated. Silver tables have a schema contract - downstream consumers can depend on them.
Gold tables are business-level aggregates and ML features computed from silver tables. User-level features (7-day order count, lifetime value, preferred cuisine), item-level features (average rating, order volume, price tier), and context features (time of day, day of week, session length) are all computed at the gold layer.
This three-layer pattern is the medallion architecture, popularized by Databricks. It provides clear data lineage, blast radius containment (a bug in silver doesn't touch bronze), and a clear escalation path for debugging.
Stage 5: Feature Storeโ
For ML specifically, gold-layer features are published to a feature store. The feature store serves two functions:
Offline store - historical feature values joined to training labels, ensuring point-in-time correctness. When you train a model on data from six months ago, you must use only features that existed six months ago. The offline store makes this possible.
Online store - low-latency key-value store (Redis, DynamoDB, Cassandra) serving the same features at inference time. A recommendation model needs to retrieve user features in under 10 milliseconds. A database query over historical data cannot do this. The online store pre-computes and caches the features.
The feature store solves a specific problem: training-serving skew. Without it, the features computed for training (from a batch pipeline) and the features computed at serving time (from a real-time pipeline) are often subtly different, leading to models that perform well in offline evaluation and poorly in production.
Stage 6: Model Training and Servingโ
The ML layer consumes features from the feature store. Training jobs pull from the offline store via point-in-time joins. Serving infrastructure retrieves features from the online store at request time, combines them with request-level features (what the user just clicked, the current time), runs the model, and returns a prediction.
The Modern Data Stackโ
Each layer of the data platform has a category of tools with different trade-offs. Here is what experienced engineers actually use and why.
Ingestion: Fivetran and Airbyteโ
Fivetran is a managed, SaaS connector service. You point it at a source (Salesforce, Stripe, Postgres, Google Analytics) and it handles extraction, schema inference, incremental loading, and schema change detection automatically. It is expensive per connector but requires almost zero maintenance. Used by companies that want to move fast and pay for reliability.
Airbyte is open-source with a managed cloud offering. More connectors, more flexibility, more operational overhead. Used by companies that want to avoid vendor lock-in or have custom source systems that Fivetran doesn't support.
For custom high-volume ingestion (event streams, CDC from proprietary databases), teams write custom connectors using Kafka Connect or write Python/Spark jobs.
Transformation: dbtโ
dbt (data build tool) transforms data inside the warehouse using SQL. Each transformation is a .sql file. dbt manages the dependency graph between transformations, runs them in the correct order, generates documentation, and runs data quality tests (row count assertions, uniqueness checks, referential integrity).
dbt brought software engineering discipline to SQL. Before dbt, transformation logic lived in stored procedures, custom Python scripts, or undocumented SSIS packages. With dbt, it lives in version-controlled .sql files with tests and lineage graphs.
dbt runs on top of the warehouse: Snowflake, BigQuery, Redshift, Databricks, or DuckDB locally. It does not move data - it transforms data that is already in the warehouse.
Warehouse: Snowflake, BigQuery, Redshiftโ
All three are columnar, cloud-native, MPP (massively parallel processing) warehouses. They separate compute from storage, meaning you can scale query capacity independently of storage capacity.
Snowflake - separates storage and compute most cleanly. Virtual warehouses (compute clusters) can be spun up and down per workload. Popular for SQL-heavy organizations. Auto-suspend reduces cost significantly.
BigQuery - serverless; no cluster management. Pricing is per query (bytes scanned). Extremely cost-effective for infrequent large queries, expensive for frequent small queries. Native integration with the Google Cloud ecosystem.
Redshift - AWS-native. Tighter integration with S3, Glue, Sagemaker. Requires more operational management than BigQuery but more flexible for custom configurations.
For large-scale ML training (terabyte-plus), teams often bypass the warehouse and read directly from cloud object storage (S3, GCS) using columnar file formats (Parquet, ORC) via Spark or PyArrow.
Orchestration: Airflow and Dagsterโ
Airflow - the incumbent. DAGs defined in Python. A massive ecosystem of operators. Mature, battle-tested, widely understood. Operationally heavy (you need to manage the scheduler, workers, metadata DB). Complex DAG dependencies can be hard to debug.
Dagster - the modern challenger. Treats data assets (not tasks) as the primary abstraction. Each pipeline step declares what data it produces and what data it depends on. Excellent observability, software-defined assets, and testing story. Gaining adoption rapidly in data teams that are tired of Airflow's operational overhead.
Prefect - focuses on developer experience. Easier local development. Hybrid execution model (cloud orchestration, local/cloud compute). Smaller community than Airflow.
Feature Store: Feast and Tectonโ
Feast - open-source. Supports offline (Parquet, BigQuery, Redshift) and online (Redis, DynamoDB, Firestore) stores. You define features in Python, register them, and Feast handles the materialization from offline to online. Requires significant configuration and operational ownership.
Tecton - commercial, fully managed. Feature pipeline definition, backfill, monitoring, and serving in one platform. High cost but much lower operational overhead than Feast. Used at DoorDash, Hypercell, and others at significant scale.
Batch vs Streaming vs Micro-batchโ
The choice of execution model is one of the most consequential architectural decisions in data engineering. It affects latency, cost, complexity, and failure modes.
Batch Processingโ
Data is processed in discrete chunks on a schedule. A nightly job runs at midnight and processes all events from the previous day. Simple, highly reliable, straightforward to debug. The downside is latency: data is always at least as stale as the batch interval.
When to use batch: training data for models that are retrained infrequently (daily, weekly). Historical analytics. Large-scale transformations where freshness isn't critical.
Batch processing cost model: pay for compute during the job window. Idle between runs. Cost is predictable and often low.
Streaming Processingโ
Events are processed as they arrive. A Kafka consumer reads each user click event as it happens, updates feature aggregates in Redis, and those aggregates are immediately available for inference. Latency measured in seconds or milliseconds.
When to use streaming: features for real-time models (fraud detection, content recommendation, dynamic pricing). Monitoring pipelines. Event-driven architectures where downstream systems must react immediately.
Streaming cost model: always-on consumers, continuous compute. More expensive than batch for the same data volume, but enables use cases batch cannot serve.
Micro-batchโ
A middle ground: small batches processed on a tight schedule (every 1โ5 minutes). Simpler than true streaming (no watermarking complexity), lower latency than daily batch. Spark Structured Streaming operates in micro-batch mode by default.
When to use micro-batch: features that need to be fresher than daily but don't require true millisecond latency. A user's "last 30 minutes of activity" feature for a recommendation model.
The tradeoff is operational cost: micro-batch adds complexity without the full capabilities of true streaming (like event-time windows). Many teams find themselves eventually migrating micro-batch to true streaming as latency requirements tighten.
Choosing the Right Execution Model: A Frameworkโ
The right question is not "batch or streaming?" but "what is the acceptable staleness for this feature, and what is the cost of reducing that staleness?"
| Freshness requirement | Typical solution | Example |
|---|---|---|
| Daily or slower | Batch (Spark, dbt) | Weekly churn model features |
| Every 1โ4 hours | Scheduled micro-batch | Recommendation model user profile |
| Every 1โ15 minutes | Micro-batch or Spark Structured Streaming | Dynamic pricing signal |
| Under 60 seconds | True streaming (Flink, Kafka Streams) | Fraud detection features |
| Under 1 second | Stream processing + low-latency store (Redis) | Real-time session features |
The cost curve is nonlinear. Moving from daily to hourly batch costs marginally more compute (24x more runs, each smaller). Moving from hourly to sub-minute streaming requires a fundamentally different technology stack (Kafka, Flink, stateful stream processing), a different operational model (always-on consumers, checkpoint management, state backends), and engineers with streaming expertise. The jump from batch to true streaming is not linear in cost or complexity.
:::tip The most common production pattern Most companies at mid-scale (think: Series B to large Series D) use a hybrid: batch pipelines for the majority of features (daily or hourly), with one or two streaming pipelines for the small set of features with genuinely tight latency requirements (fraud signals, real-time session state). The streaming minority is the most expensive to build and operate. Quantify the business value before committing to streaming for every feature. :::
The ML Data Lifecycle: Bronze, Silver, Goldโ
The medallion architecture gives every dataset a clear tier with defined quality guarantees.
Bronze (raw) - exact copy of source data, no modifications. Schema matches the source system. Every ingested record is here. If a source system has a bug that produces malformed records, bronze contains those malformed records. Nothing is dropped. This is the recovery point for the entire platform.
Silver (cleaned) - transformations applied: deduplication, null handling, type casting, timestamp normalization, business key validation, PII masking. Silver tables have a published schema. Downstream consumers write against silver schemas. Breaking schema changes require a versioning strategy.
Gold (aggregated/feature-ready) - business-level aggregates and ML features. A user_features gold table might have columns: user_id, 7d_order_count, 30d_total_spend, preferred_cuisine_category, last_active_timestamp. Gold tables are typically smaller than silver (aggregation reduces rows) but are the most expensive to compute (complex joins, window functions).
The key insight of the medallion architecture is blast radius containment. A bug in a silver transformation can corrupt silver and gold tables, but bronze is intact - you can reprocess silver from bronze when the bug is fixed. A bug in a bronze ingestion job means you need to re-ingest from source, which is more expensive.
Data Contracts and SLAsโ
Data contracts are the formal agreements between the team producing a dataset and the teams consuming it. Without them, ML pipelines break silently when data engineers change upstream tables.
A complete data contract specifies:
- Schema - column names, types, nullability. Adding a column is non-breaking; renaming or removing a column is breaking.
- Freshness SLA - "this table will be updated by 6 AM UTC daily." If it isn't, consuming pipelines know to alert.
- Volume range - "this table typically has 1โ3 million rows per day." A day with 50,000 rows or 50 million rows is anomalous and should trigger a data quality alert.
- Value distributions - "the
event_typecolumn contains only these valid values." Unexpected values indicate an upstream schema change. - Ownership - who to page when the contract is violated.
:::tip Why contracts matter for ML ML training pipelines are especially sensitive to silent data quality issues. A model trained on six months of data where one month silently contains a schema bug will have degraded performance that is hard to diagnose. With a data contract, the anomaly is caught at ingestion and the training pipeline fails loudly rather than silently producing a worse model. :::
Implementing Contracts with dbt Testsโ
dbt provides built-in tests that encode contract requirements as code:
-- models/silver/users.yml
version: 2
models:
- name: silver_users
description: Cleaned user records from OLTP source
columns:
- name: user_id
description: Primary key
tests:
- unique
- not_null
- name: email
tests:
- not_null
- name: created_at
tests:
- not_null
tests:
- dbt_utils.recency:
datepart: hour
field: created_at
interval: 2
These tests run after each transformation. If uniqueness is violated, or the table hasn't been updated in 2 hours, the dbt run fails and an alert fires.
Code Example: A Raw-to-Silver Pipeline in Pythonโ
This pipeline reads raw user events from a Parquet file (simulating bronze layer data), applies silver-layer transformations, and writes the result to a new Parquet file. This is the simplest form of a medallion pipeline.
"""
raw_to_silver.py
Simulates a bronze -> silver transformation pipeline.
Reads raw user events (bronze), applies cleaning transforms,
writes cleaned data (silver).
"""
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
from datetime import datetime, timezone
from pathlib import Path
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s - %(message)s"
)
logger = logging.getLogger(__name__)
# ------------------------------------------------------------------
# Schema definitions
# ------------------------------------------------------------------
BRONZE_SCHEMA = pa.schema([
pa.field("event_id", pa.string()),
pa.field("user_id", pa.string()),
pa.field("event_type", pa.string()),
pa.field("timestamp_raw", pa.string()), # string from source
pa.field("amount_cents", pa.int64()),
pa.field("currency", pa.string()),
pa.field("metadata", pa.string()), # raw JSON blob
])
SILVER_SCHEMA = pa.schema([
pa.field("event_id", pa.string()),
pa.field("user_id", pa.string()),
pa.field("event_type", pa.string()),
pa.field("event_timestamp", pa.timestamp("us", tz="UTC")), # normalized
pa.field("amount_usd", pa.float64()), # normalized to USD
pa.field("is_valid", pa.bool_()),
])
VALID_EVENT_TYPES = {"order_placed", "order_cancelled", "payment_completed"}
USD_FX = {"USD": 1.0, "EUR": 1.09, "GBP": 1.27, "CAD": 0.74}
# ------------------------------------------------------------------
# Transformation functions
# ------------------------------------------------------------------
def parse_timestamp(ts_raw: str) -> datetime | None:
"""
Parse a timestamp string from bronze. Bronze may contain multiple
formats from different upstream services. Return None for unparseable
values - they will be marked invalid.
"""
formats = [
"%Y-%m-%dT%H:%M:%S.%fZ",
"%Y-%m-%dT%H:%M:%SZ",
"%Y-%m-%d %H:%M:%S",
]
for fmt in formats:
try:
dt = datetime.strptime(ts_raw, fmt)
return dt.replace(tzinfo=timezone.utc)
except ValueError:
continue
return None
def normalize_amount(amount_cents: int | None, currency: str | None) -> float | None:
"""
Convert amount_cents in any supported currency to USD float.
Returns None for unsupported currencies or null inputs.
"""
if amount_cents is None or currency is None:
return None
currency = currency.upper().strip()
fx_rate = USD_FX.get(currency)
if fx_rate is None:
return None
return (amount_cents / 100.0) * fx_rate
def validate_event(row: dict) -> bool:
"""
An event is valid if:
- event_id is non-null
- user_id is non-null
- event_type is a known type
- timestamp parsed successfully
- amount is non-negative
"""
if not row.get("event_id"):
return False
if not row.get("user_id"):
return False
if row.get("event_type") not in VALID_EVENT_TYPES:
return False
if row.get("event_timestamp") is None:
return False
if row.get("amount_usd") is not None and row["amount_usd"] < 0:
return False
return True
def transform_bronze_to_silver(
bronze_table: pa.Table,
) -> tuple[pa.Table, dict]:
"""
Apply silver-layer transformations to a bronze table.
Returns (silver_table, quality_metrics).
"""
rows = bronze_table.to_pylist()
silver_rows = []
metrics = {
"input_count": len(rows),
"output_count": 0,
"invalid_timestamp": 0,
"invalid_event_type": 0,
"invalid_currency": 0,
"null_user_id": 0,
}
for row in rows:
# Parse timestamp
ts = parse_timestamp(row.get("timestamp_raw", ""))
if ts is None:
metrics["invalid_timestamp"] += 1
# Normalize amount
amount_usd = normalize_amount(row.get("amount_cents"), row.get("currency"))
if amount_usd is None and row.get("currency") not in USD_FX:
metrics["invalid_currency"] += 1
# Track null user IDs
if not row.get("user_id"):
metrics["null_user_id"] += 1
# Track invalid event types
if row.get("event_type") not in VALID_EVENT_TYPES:
metrics["invalid_event_type"] += 1
silver_row = {
"event_id": row.get("event_id"),
"user_id": row.get("user_id"),
"event_type": row.get("event_type"),
"event_timestamp": ts,
"amount_usd": amount_usd,
"is_valid": None, # filled below
}
silver_row["is_valid"] = validate_event(silver_row)
silver_rows.append(silver_row)
# Build PyArrow table from dicts
silver_table = pa.Table.from_pylist(silver_rows, schema=SILVER_SCHEMA)
metrics["output_count"] = len(silver_rows)
metrics["valid_count"] = sum(1 for r in silver_rows if r["is_valid"])
metrics["invalid_count"] = metrics["output_count"] - metrics["valid_count"]
return silver_table, metrics
# ------------------------------------------------------------------
# Pipeline entry point
# ------------------------------------------------------------------
def run_pipeline(bronze_path: Path, silver_path: Path) -> None:
logger.info(f"Starting bronze->silver transform: {bronze_path}")
# Read bronze
bronze_table = pq.read_table(bronze_path, schema=BRONZE_SCHEMA)
logger.info(f"Read {len(bronze_table)} rows from bronze layer")
# Transform
silver_table, metrics = transform_bronze_to_silver(bronze_table)
# Log quality metrics
logger.info(f"Quality metrics: {metrics}")
invalid_pct = metrics["invalid_count"] / max(metrics["output_count"], 1) * 100
if invalid_pct > 5.0:
logger.warning(
f"High invalid rate: {invalid_pct:.1f}% of records failed validation. "
f"Check upstream schema for changes."
)
# Write silver - partitioned by date for efficient downstream queries
silver_path.mkdir(parents=True, exist_ok=True)
run_date = datetime.now(timezone.utc).strftime("%Y-%m-%d")
output_file = silver_path / f"events_date={run_date}.parquet"
pq.write_table(silver_table, output_file, compression="snappy")
logger.info(f"Wrote {len(silver_table)} rows to {output_file}")
if __name__ == "__main__":
run_pipeline(
bronze_path=Path("data/bronze/events/"),
silver_path=Path("data/silver/events/"),
)
This pipeline demonstrates several important patterns: explicit schema definitions at each layer, field-level validation with quality metric tracking, alerting when the invalid rate exceeds a threshold, and output partitioned by date for efficient downstream queries.
When DE Becomes the ML Bottleneckโ
In early-stage ML organizations, the bottleneck is usually modeling: not enough data science expertise, too few labeled examples, poorly specified problems. As the organization matures and modeling capability grows, the bottleneck shifts. Data engineering becomes the constraint.
The signs that DE is the bottleneck for ML:
Feature request backlog. The ML team has a list of features they want to add to models. Each feature requires a new pipeline. The DE team's sprint capacity can't keep up. Models sit unchanged for months, not because the modeling work is blocked, but because the features aren't available.
Long training data SLAs. An ML team wants to run experiments: try different feature combinations, different time windows, different label definitions. Each experiment requires a new training dataset. If generating a training dataset takes days (because it requires a DE ticket, a pipeline build, a backfill run, and a QA review), experiment velocity is throttled. Teams that run 2โ3 experiments per week cannot iterate as fast as teams that run 20โ30.
Training-serving feature mismatch. When DE owns the batch feature pipeline and ML owns the serving-time feature computation, the two diverge over time. Each team makes pragmatic decisions without coordinating. Six months later, the serving features are computed differently from the training features and nobody is sure when the divergence started.
Pipeline SLA violations blocking retraining. The model retraining job is scheduled for Sunday at 2 AM. It depends on the weekly feature refresh completing by 1 AM. The feature refresh runs long (3.5 hours instead of 1.5 hours due to a data volume spike) and isn't ready when the training job starts. The training job fails. The model runs another week on stale weights. Nobody notices until a product manager asks why recommendation quality has been flat.
The DEโML Collaboration Model That Worksโ
The most effective organizations structure DEโML collaboration around three principles:
Shared feature ownership with clear interfaces. Features have a declared owner (either DE or ML) and a published interface: schema, freshness SLA, and data contract. ML teams can request new features through a lightweight specification process. DE teams build and own the pipeline. Both teams sign off on the contract. Neither team changes the contract unilaterally.
Self-serve training data with guardrails. ML teams should be able to generate their own training datasets from the feature store without filing DE tickets. This requires the feature store to provide a self-serve interface (Feast's get_historical_features, Tecton's dataset API) with guardrails: point-in-time correctness is enforced automatically, data access controls prevent teams from accessing unauthorized data, and quotas prevent runaway backfill jobs from consuming the entire cluster.
Shared observability. Feature freshness, data quality metrics, and pipeline SLAs should be visible to both DE and ML teams in a shared dashboard. When a feature's freshness degrades, both teams see it simultaneously. The ML team doesn't have to file a bug report to notify DE - they see it the same moment DE does, and they understand the downstream impact (model performance degradation) that motivates urgency.
:::note The analytics engineer role Some organizations solve the DEโML interface problem by creating an "analytics engineer" or "ML data engineer" role that sits between pure data engineering and ML. This person writes dbt models for gold-layer features, owns the feature store population pipelines, and speaks fluently to both the DE team (on pipeline architecture) and the ML team (on feature semantics and training data assembly). The role is increasingly common in organizations with more than 10 ML practitioners. :::
How Uber's Data Platform Evolved to Serve ML at Scaleโ
Uber's data platform evolution is one of the most documented and instructive examples in the industry. It went through four distinct phases, each driven by a specific scaling pain point.
Phase 1 (2014โ2015): MySQL and manual exports. Early Uber used MySQL for everything. Data scientists pulled data directly from production MySQL replicas. This worked for a few analysts. It did not work when the team grew and SQL queries started impacting production database performance.
Phase 2 (2015โ2016): Hadoop + Hive. Uber built a Hadoop cluster and began streaming application events to HDFS via Kafka. Data scientists queried via Hive. This solved the production database impact problem but introduced new ones: Hive queries were slow (minutes to hours), schema management was manual and error-prone, and pipeline dependencies were managed by cron jobs that silently failed.
Phase 3 (2016โ2018): Michelangelo. Uber built Michelangelo, their internal ML platform. The core insight was that ML teams were spending 80% of their time on data engineering work - feature computation, training data assembly, serving infrastructure - and only 20% on modeling. Michelangelo standardized the feature pipeline: a unified interface for defining features, computing them on historical data, and serving them at low latency. It included a feature store, a training pipeline, and a serving layer. Feature reuse across teams became possible: a feature computed by the fraud team could be used by the pricing team.
Phase 4 (2019โpresent): Unified lakehouse. Uber migrated from HDFS to a lakehouse architecture (Apache Hudi on S3), enabling ACID transactions on large datasets, incremental processing (process only changed records, not full datasets), and a unified storage layer for batch and streaming. Hudi's "upsert" capability - update or insert a record based on a primary key - was critical for handling late-arriving events and corrections without full table rewrites.
The lesson from Uber's evolution: every generation of data infrastructure solves the current bottleneck and creates the next one. The right answer is not a single architecture but a commitment to evolving the platform as requirements change. The teams that do this well treat their data platform as a product with internal customers (ML engineers, analysts, product managers), maintain a roadmap, and allocate headcount to platform work rather than treating it as overhead. The teams that do it poorly rebuild from scratch every two years when the accumulated technical debt becomes unmanageable - each rebuild expensive, disruptive, and only slightly better than what it replaced.
One more pattern Uber's evolution illustrates: the feature store is a forcing function for organizational alignment. Before Michelangelo, different ML teams at Uber computed the same features independently, with different logic, different freshness, and different quality guarantees. After Michelangelo, feature definitions became canonical. A feature computed by the fraud team was available to the pricing team with no recomputation cost. This reuse is the primary economic argument for the feature store investment - the amortized cost of a feature drops with each additional team that uses it.
dbt in Practice: Transformation as Codeโ
To understand why dbt changed data engineering, look at what a transformation pipeline looks like with and without it.
Without dbt (2015 era): A Python script runs a sequence of SQL statements, writes results to staging tables, then swaps them into production. The script lives on one engineer's laptop. It runs via cron. There are no tests. When it fails, it fails silently. When an analyst asks "how is this metric computed?", the answer is "look at the Python script if you can find it."
With dbt (2024): Transformation logic lives in .sql files in a git repository. dbt manages the dependency graph - if silver_orders depends on bronze_events, dbt runs them in the correct order. Tests are co-located with the transformation: unique, not_null, accepted_values, and custom SQL tests run after every transformation. The documentation is auto-generated. Every column has a description. Every dependency is visible in a lineage graph.
Here is a realistic dbt model for computing user-level features (gold layer) from cleaned events (silver layer):
-- models/gold/user_features.sql
-- Gold-layer user feature table for the recommendation model.
-- Depends on: silver_orders, silver_users
-- Refreshed: every 4 hours (see dbt job configuration)
{{
config(
materialized='incremental',
unique_key='user_id',
on_schema_change='fail'
)
}}
WITH base_orders AS (
SELECT
user_id,
order_id,
order_date,
total_amount_usd,
cuisine_category,
is_cancelled,
created_at
FROM {{ ref('silver_orders') }}
WHERE is_valid = true
{% if is_incremental() %}
-- On incremental runs, only process new/updated orders
-- that fall within the feature lookback window
AND created_at >= (
SELECT DATEADD(day, -7, MAX(feature_computed_at))
FROM {{ this }}
)
{% endif %}
),
order_stats AS (
SELECT
user_id,
COUNT(*) AS total_order_count,
COUNT(*) FILTER (WHERE NOT is_cancelled) AS completed_order_count,
COUNT(*) FILTER (WHERE is_cancelled) AS cancelled_order_count,
-- 7-day rolling counts
COUNT(*) FILTER (
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
AND NOT is_cancelled
) AS orders_7d,
-- 30-day rolling counts
COUNT(*) FILTER (
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
AND NOT is_cancelled
) AS orders_30d,
-- Spend metrics
SUM(total_amount_usd) FILTER (NOT is_cancelled) AS lifetime_spend_usd,
AVG(total_amount_usd) FILTER (NOT is_cancelled) AS avg_order_value_usd,
MAX(total_amount_usd) AS max_order_value_usd,
-- Recency
MAX(order_date) FILTER (NOT is_cancelled) AS last_order_date,
DATEDIFF(day, MAX(order_date), CURRENT_DATE) AS days_since_last_order,
-- Cuisine preferences (most common in last 90 days)
MODE() WITHIN GROUP (ORDER BY cuisine_category)
FILTER (
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
AND NOT is_cancelled
) AS preferred_cuisine_90d
FROM base_orders
GROUP BY user_id
)
SELECT
os.user_id,
os.total_order_count,
os.completed_order_count,
os.cancelled_order_count,
ROUND(
os.cancelled_order_count::float / NULLIF(os.total_order_count, 0),
4
) AS cancellation_rate,
os.orders_7d,
os.orders_30d,
os.lifetime_spend_usd,
os.avg_order_value_usd,
os.max_order_value_usd,
os.last_order_date,
os.days_since_last_order,
os.preferred_cuisine_90d,
-- Derived flags for model bucketing
CASE
WHEN os.lifetime_spend_usd >= 500 THEN 'high_value'
WHEN os.lifetime_spend_usd >= 100 THEN 'mid_value'
ELSE 'low_value'
END AS user_value_tier,
CASE
WHEN os.days_since_last_order <= 7 THEN 'active'
WHEN os.days_since_last_order <= 30 THEN 'lapsing'
ELSE 'churned'
END AS engagement_status,
CURRENT_TIMESTAMP AS feature_computed_at
FROM order_stats os
The {{ config(materialized='incremental') }} block tells dbt to only process new records on each run rather than recomputing the entire table. The {% if is_incremental() %} block adds a filter that limits the source data to the relevant lookback window on incremental runs. On the first run (full refresh), the filter is omitted and the entire history is processed.
This is the dbt pattern used in production at scale: full refresh run weekly (or when logic changes), incremental runs every few hours to update the feature table efficiently.
# models/gold/user_features.yml
# Schema tests co-located with the model definition.
version: 2
models:
- name: user_features
description: >
Gold-layer user feature table. Refreshed every 4 hours.
Consumed by the recommendation model (offline store) and
the feature materialization job (online store โ Redis).
columns:
- name: user_id
description: Unique user identifier (FK to silver_users)
tests:
- unique
- not_null
- name: orders_7d
description: Completed orders in the last 7 days
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: user_value_tier
tests:
- accepted_values:
values: ['high_value', 'mid_value', 'low_value']
- name: engagement_status
tests:
- accepted_values:
values: ['active', 'lapsing', 'churned']
- name: feature_computed_at
tests:
- not_null
- dbt_utils.recency:
datepart: hour
field: feature_computed_at
interval: 6 # Alert if table is more than 6 hours stale
When dbt test runs after each pipeline execution, every assertion above is validated. A uniqueness violation on user_id means duplicate users were introduced - a data bug that would silently corrupt model training is caught before the feature table is used.
YouTube Resourcesโ
| Title | Channel | Why Watch |
|---|---|---|
| The Modern Data Stack Explained | Seattle Data Guy | Clear walkthrough of how Fivetran, dbt, Snowflake, and Airflow connect |
| Michelangelo: Uber's ML Platform | InfoQ | Direct walkthrough of how Uber built a feature store and training pipeline |
| Data Engineering Overview 2024 | Andreas Kretz | DE role landscape, what changed, what skills matter |
| dbt Fundamentals Course | dbt Labs | How dbt transforms raw data into clean models with testing |
| Feature Stores Explained | Chip Huyen | Why feature stores exist, what they solve, when you need one |
Production Engineering Notesโ
Schema drift is the most common silent failure mode. An upstream service deploys a new version and renames a field - user_id becomes userId, or amount changes from integer cents to float dollars. Your ingestion pipeline either silently ignores the field (null propagation) or fails. Neither is acceptable without monitoring. Implement schema validation at ingestion: compare the incoming schema against the expected schema and alert immediately on mismatch.
Data volume monitoring catches more bugs than schema monitoring. Schema changes are dramatic and obvious when caught. Volume anomalies are subtle but often more damaging. A pipeline that usually writes 2 million rows per day writing 200,000 rows indicates something is wrong upstream - a missing event source, a filter bug, or a silent ingestion failure. Volume monitoring with a threshold alert (e.g., "alert if today's row count is less than 50% of the 7-day average") catches a class of bugs that schema validation entirely misses.
Columnar file format selection matters for training throughput. Training jobs that read feature data from S3 should always read Parquet, not CSV or JSON. Parquet is columnar: reading 10 columns from a 200-column dataset scans only 5% of the data. CSV is row-oriented: reading 10 columns requires scanning 100% of the data. At training scale (terabytes of feature data), this difference determines whether a training data load takes 4 minutes or 80 minutes. Use snappy compression for Parquet: it is fast to decompress (important for training loop throughput) while achieving 50โ70% compression ratios.
Partition strategy must be designed with read patterns in mind. Data partitioned by date=YYYY-MM-DD is efficient for time-range queries. Data partitioned by user_id_bucket=0001 is efficient for user-level joins. Training jobs that filter by date and then join on user_id need a partition scheme that supports both efficiently - often implemented as date=YYYY-MM-DD/user_id_bucket=NNNN with the outer partition being the primary filter. Choosing the wrong partition scheme is expensive to fix: it requires a full table rewrite.
The catalog is as important as the data. A data catalog (DataHub, Amundsen, Apache Atlas) records what datasets exist, what they contain, who owns them, what their lineage is, and what depends on them. Without a catalog, engineers spend significant time asking "does a dataset for X already exist?" and "what would break if I changed this table?" The catalog pays for itself in the first incident where you need to trace a data quality issue through three levels of pipeline dependencies to find the root cause.
Partition pruning is the difference between a 30-second query and a 30-minute query. If a Parquet dataset is partitioned by date=YYYY-MM-DD, a query filtering WHERE date = '2024-01-15' reads only one partition directory, not the entire dataset. This is partition pruning. If your gold tables are not partitioned by the columns that ML training jobs filter on (typically date or entity ID), training data extraction will be slow and expensive.
The fan-out problem. A widely-used gold table (say, a user_features table used by 40 different ML models) becomes a single point of failure. When it breaks - even briefly - 40 downstream pipelines break simultaneously. Solutions: separate compute from serving (write to a feature store, not directly to a table), implement circuit breakers on consuming pipelines, and track which jobs depend on which tables in a data catalog.
Change data capture (CDC) is the most underused ingestion pattern. Most teams sync data from OLTP databases by running a nightly SELECT * FROM orders WHERE updated_at > last_run_time. This misses hard deletes (rows deleted from the source are never deleted from the warehouse), is unreliable for rows without an updated_at column, and adds load to the production database on every run. CDC tools like Debezium read the database's replication log (Postgres WAL, MySQL binlog) and stream every insert, update, and delete as an event to Kafka - zero impact on the production database, sub-second latency, and complete fidelity including hard deletes.
Backfill planning is as important as forward pipeline design. When a bug in a transformation is found after three months of production data, the fix requires re-running three months of history. How long does that take? How much does it cost? Does it interfere with the live pipeline? The answer to these questions should be known before the bug occurs. Backfill capacity should be designed into the platform, not improvised after the fact. A good benchmark: your platform should be able to reprocess 30 days of history for any gold-layer feature within 4 hours using dedicated backfill infrastructure. If that benchmark cannot be met, the platform's operational resilience is insufficient for production ML.
Test data quality at every layer transition, not just at the end. A common mistake is to run data quality checks only on the final gold-layer output. By that point, a quality issue that originated in bronze has propagated through silver and corrupted multiple gold tables. Running dbt test after every layer transition - bronze ingestion, silver transformation, gold aggregation - catches issues at their origin layer, reduces the blast radius, and makes root cause analysis dramatically easier. The cost is marginal (dbt tests on a modern warehouse are fast). The benefit in reduced incident investigation time is substantial.
Common Mistakesโ
:::danger Allowing ML engineers direct write access to the feature store Granting ML engineers permissions to manually write feature values into the online store for debugging or experimentation. This bypasses the pipeline, breaks auditability, and creates values in the online store that do not correspond to any point in the offline store's history. The next model training run, which reads from the offline store, will use different values than the ones the model was tested against in the online store. The correct pattern: all writes to the feature store go through the pipeline. Debugging is done in a separate staging environment. :::
:::danger Training on silver instead of gold Passing uncleaned silver data directly to model training. Silver data has been deduplicated and type-cast, but it has not been aggregated or joined into the feature representation the model expects. Models trained on wrong feature representations produce incorrect outputs that look plausible until you look closely. Always use the gold/feature-layer data for training, not intermediate cleaning layers. :::
:::danger Using future data in training features (target leakage) When building a training dataset for a model, joining features that were not available at the time of the label event. Example: predicting whether a user will churn, using the user's total lifetime spend - which includes spend that happened after the churn label was assigned. This inflates offline metrics and produces models that fail in production. Point-in-time correct joins in the feature store prevent this. :::
:::warning Skipping data contracts because "the team is small" A data contract feels like overhead when you are two engineers. When you are ten engineers, a broken schema causes a production outage. When you are fifty engineers, it causes data corruption that isn't discovered for weeks. Data contracts scale with teams - implement them early, even informally. :::
:::warning Using the production OLTP database as a data source for training Querying an application Postgres database directly from a training job adds load to the production database that can cause latency spikes or outages in the application. Always read from a replica, or better, from the data warehouse where the data has been ingested and processed. Never read from production OLTP for batch analytics or training. :::
Interview Q&Aโ
Q1: Walk me through how you'd design a data pipeline to serve a recommendation model at DoorDash scale.
A well-structured answer covers all six stages of the data flow. Start with sources: OLTP (order history, restaurant catalog) and event streams (user clicks, impressions). Ingestion: use CDC (change data capture) via Debezium for OLTP changes, Kafka for events. Bronze: write raw events and CDC records to S3 Parquet, partitioned by date. Silver: dbt transformations to clean, deduplicate, and standardize - run every 30 minutes to keep data fresh. Gold/feature layer: compute user features (order history, cuisine preferences, price sensitivity), restaurant features (avg rating, delivery time, current capacity), and context features (time of day, weather). Materialize to a feature store with both offline (S3 Parquet, for training) and online (Redis, for serving) tiers. Training jobs pull from the offline store using point-in-time joins. The serving layer retrieves online features at inference time with p99 latency under 10ms.
The interviewer is listening for: awareness of point-in-time correctness, offline vs online store distinction, freshness SLA design, and the connection between pipeline design and model quality.
Q2: What is training-serving skew and how do you prevent it?
Training-serving skew is when the features used to train a model are computed differently than the features used at serving time, causing a mismatch between the distribution the model was trained on and the distribution it sees in production. Common causes: the training pipeline uses a batch join that aggregates slightly differently than the real-time serving pipeline; timezone handling is inconsistent; null values are filled with different defaults; a feature was added to training after the serving pipeline was already deployed.
Prevention: use a single feature definition that runs on both offline and online pipelines. In practice, this means defining features in the feature store using a declarative SDK (like Feast's FeatureView) and letting the store handle both historical retrieval and real-time serving. Regularly run "consistency checks" that compare offline and online feature values for the same entity at the same timestamp - significant divergence indicates skew.
Q3: What is the medallion architecture and why does it matter for ML?
The medallion architecture organizes data into three layers: bronze (raw), silver (cleaned), and gold (aggregated). Bronze is immutable and represents the source of truth for recovery. Silver has schema contracts and quality guarantees. Gold is optimized for analytics and ML consumption. For ML specifically, the medallion architecture matters because it: (1) provides a clear reprocessing path when bugs are found - reprocess from bronze, not from source; (2) separates concerns - data cleaning bugs are fixed in silver transformations without touching bronze; (3) enables data quality testing at each layer boundary; (4) makes lineage clear - you can trace a gold-layer feature back through silver to the exact bronze record.
Q4: What is a feature store and when do you need one?
A feature store is a centralized system for storing, managing, and serving ML features. It has two components: an offline store (historical feature values, typically Parquet on S3 or a warehouse) for training data assembly with point-in-time correctness, and an online store (Redis, DynamoDB) for low-latency feature retrieval at serving time. You need a feature store when: (1) multiple models use the same features - without a store, each team recomputes the same features independently, wasting compute and creating consistency issues; (2) point-in-time correctness is required - feature stores handle the time-travel join automatically; (3) online serving latency requirements are tight - sub-10ms feature retrieval requires a cache, not a database query. You probably don't need a feature store for a single model in a small organization; a well-organized set of Parquet files and a simple Redis cache may be sufficient.
Q5: How do data contracts prevent ML pipeline failures?
Data contracts are formal agreements between data producers and consumers specifying schema, freshness, volume ranges, and value distributions. Without contracts, upstream schema changes break ML pipelines silently: a renamed field becomes null in the feature table, the feature silently has zero variance, and the model's predictions degrade without any alert firing. With contracts, the anomaly is caught at ingestion: the schema validator detects the renamed field, the ingestion job fails loudly, and the alert fires immediately - before corrupted data propagates to the feature store and training jobs. For ML teams specifically, contracts also protect training data quality: a contract violation that would corrupt three months of historical features is caught in real time rather than discovered during the next model training run.
Q6: How do you handle the DE-to-ML team handoff for feature development?
The handoff should be contractual, not ad hoc. Define a process: (1) ML team specifies the feature in a feature specification document - what business concept it captures, the expected computation, the freshness requirement; (2) DE team implements the pipeline, writes tests, and publishes the feature to the feature store offline store; (3) Backfill: run the computation over historical data so the feature is available for training; (4) Online materialization: schedule the feature computation to run continuously and materialize to the online store; (5) Contract published: DE publishes an SLA for freshness and data quality. This process prevents the common failure where an ML team discovers, after months of development, that the feature they need is computationally infeasible at the freshness they assumed.
Q7: How would you measure and improve data pipeline observability?
Data pipeline observability means knowing at any moment: is every pipeline running on schedule, is the data quality acceptable, and if something is wrong, what is it and why? Most teams start with logging and alerting on job success/failure. This is necessary but insufficient - a pipeline can succeed (complete without error) and produce wrong data.
Comprehensive observability has three layers:
Operational metrics - pipeline-level health. Metrics to track: job runtime (alert on significant regression from baseline), row counts per partition (volume anomaly detection), p95/p99 latency for feature materialization to the online store, and backfill queue depth.
Data quality metrics - row-level health. For each critical table: null rate per column (alert if null rate increases from 0.1% to 5%), uniqueness violation count, referential integrity failure rate, distribution shift (track statistical properties of numeric columns - mean, stddev, percentiles - and alert on significant changes). Great Expectations and dbt tests are common tools for this layer.
Lineage and impact analysis - what depends on what. When a bronze table has a data quality issue, how many silver tables, gold tables, and ML models are affected? A data catalog with lineage (DataHub, Marquez) answers this immediately. Without it, the impact assessment is a manual process of reading pipeline code and asking engineers what they depend on. The difference between a 15-minute impact assessment and a 4-hour one, when there is a production data incident at 2 AM, is the presence or absence of lineage tracking.
Practical implementation starting point: instrument every pipeline with three metrics - row count in, row count out, and wall-clock runtime. Emit these to your monitoring system (Datadog, Prometheus, CloudWatch). Set volume-based alerts. This alone catches the majority of production pipeline failures that slip past job-level success monitoring.
