What is data systems for ML?

The complete ML data stack - from raw storage through feature engineering to model training and serving, including data lakes, warehouses, lakehouses, and temporal joins.

How does data lake machine learning work in practice?

Data Systems for ML - The Foundation Layer covers data systems for ML, data lake machine learning, data warehouse ML from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/systems-foundations/data-systems-for-ml

What is the difference between data systems for ML and data warehouse ML?

See the full breakdown at https://engineersofai.com/docs/ai-systems/systems-foundations/data-systems-for-ml

:::tip 🎮 Interactive Playground Visualize this concept: Try the Data Lakehouse Architecture demo on the EngineersOfAI Playground - no code required. :::

Data Systems for ML - The Foundation Layer

The model is the crown jewel. The data pipeline is the mine. No mine, no jewel - and most ML teams spend 80% of their time in the mine.

The Production Moment

The retrospective memo lands in your inbox at 4 PM on a Friday. It is three pages long. The ML team at the ride-sharing company had spent six months building a demand forecasting model. The model architecture was sophisticated - a temporal fusion transformer with geospatial embeddings. The offline metrics were excellent: MAPE of 8.2% on holdout data, significantly better than the 12.1% baseline.

Production performance: MAPE of 19.7%.

The post-mortem identified three data pipeline problems:

The training data was built by joining ride events to weather data on "city." But the weather API returned data at 15-minute intervals by weather station, and the join logic used the city-level average - which smoothed out the hyperlocal weather patterns that actually drove demand.
The validation set was built from the same data pipeline as the training set, using a random 80/20 split. The correct split for time-series data is temporal - train on months 1-10, validate on months 11-12. The random split leaked future patterns into training.
The feature for "average surge pricing last 24 hours" was computed at training time using current data. At serving time, it was computed using events from the past 24 hours. Because surge pricing varies dramatically by hour of day, these two computations produced systematically different values.

Three data problems. None of them were model problems. The model was fine. The data system was broken.

This lesson is about building the data system right the first time.

The ML Data Stack

Every ML system, regardless of scale, has the same logical layers. The specific technologies differ by team size and scale, but the layers are universal.

Each layer has distinct access patterns, consistency requirements, and cost profiles. Understanding them separately is essential before designing their integration.

OLTP vs OLAP vs ML Workloads

The three major database workload types have fundamentally different requirements, and ML adds a third that is distinct from both:

OLTP (Online Transaction Processing): Low latency reads and writes for individual records. Think: user authentication, order placement, payment processing. Requires strong consistency, row-level access, high concurrency. PostgreSQL, MySQL.

OLAP (Online Analytical Processing): High-throughput scans over large datasets, aggregations, and complex queries. Think: "What was our revenue by region in Q3?" Requires column-oriented storage (to scan only the columns you need), batch operations, eventual consistency acceptable. BigQuery, Snowflake, Redshift.

ML workloads: Somewhere between both, and neither. Training requires scanning petabytes of data with complex feature engineering (OLAP-like), but also needs point-in-time correct feature retrieval (a temporal operation no OLAP system was designed for). Serving requires millisecond-latency feature reads (OLTP-like) at high throughput.

Dimension	OLTP	OLAP	ML Training	ML Serving
Access pattern	Row reads/writes	Column scans	Column scans + temporal joins	Row reads
Latency requirement	<10ms	Minutes/hours	Hours (batch)	<10ms
Data volume	GB	TB-PB	TB-PB	GB (cached)
Typical tech	PostgreSQL, MySQL	BigQuery, Snowflake	Spark, Dask	Redis, Cassandra
Consistency	Strong	Eventual OK	Temporal correctness critical	Eventual OK

The key insight: ML systems need both OLAP-scale processing (for training) and OLTP-scale serving (for inference). This is why ML infrastructure is architecturally complex - it spans the full spectrum.

Data Lakes: The Foundation

A data lake is an object storage system (S3, GCS, Azure Data Lake Storage) that stores data in its raw format, without enforcement of schema or structure at write time. Data is organized in a directory hierarchy, typically by source system, date, and entity type.

The data lake is where everything lands first. Raw event logs, database snapshots (via CDC), API responses, model predictions, user feedback - all stored as files (Parquet, Avro, JSON, CSV).

Why data lakes for ML?

Schema flexibility: Different data sources have different schemas. A data lake accepts everything without transformation. You figure out the schema later, at read time ("schema on read").
Historical retention: ML training benefits from years of historical data. Data lakes store data cheaply ($0.023/GB/month on S3) for as long as you want.
Reprocessing capability: If you discover a bug in your feature engineering pipeline, you need to reprocess historical data. With a data lake, the raw data is always available. With a database, it might have been overwritten.
Decoupled storage and compute: You pay for storage separately from compute. A data lake with 100 TB of data costs ~$2,300/month in storage, regardless of whether you process it. You only pay for Spark compute when you run a job.

# Organizing data in a data lake - follow the Hive partitioning convention
# This allows query engines (Spark, Athena, BigQuery) to skip irrelevant partitions

# Directory structure:
# s3://company-data-lake/
#   events/
#     year=2024/
#       month=11/
#         day=15/
#           hour=14/
#             part-00000.parquet   # ~128 MB per file
#             part-00001.parquet
#   user_profiles/
#     snapshot_date=2024-11-15/
#       part-00000.parquet
#   model_predictions/
#     model_version=v3.2/
#       year=2024/month=11/day=15/
#         part-00000.parquet

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("FeatureEngineering") \
    .config("spark.sql.sources.partitionOverwriteMode", "dynamic") \
    .getOrCreate()

# Writing with partition columns - Spark handles the directory structure
events_df.write \
    .mode("overwrite") \
    .partitionBy("year", "month", "day", "hour") \
    .parquet("s3://company-data-lake/events/")

# Reading with partition pruning - only reads relevant partitions
# Spark generates the S3 LIST only for the specified date range
yesterday_events = spark.read \
    .parquet("s3://company-data-lake/events/") \
    .filter("year = 2024 AND month = 11 AND day = 14")

# Without partitioning: Spark would scan ALL data (100TB)
# With partitioning: Spark scans only one day (275 GB)

Data Warehouses: Structured Analytics

A data warehouse is a storage system designed for fast analytical queries on structured data. Unlike data lakes (raw, unstructured, cheap), data warehouses maintain schemas, enforce data quality, and optimize storage for query performance.

Modern cloud data warehouses (BigQuery, Snowflake, Redshift) use columnar storage: data for each column is stored together, enabling extremely fast scans of individual columns without reading the entire row. For analytical queries like "compute the 30-day average transaction amount per user," this can be 100× faster than row-oriented storage.

When to use data warehouses for ML:

Feature engineering with complex SQL: If your features require multi-table joins, window functions, and aggregations, SQL on a data warehouse is often simpler than equivalent Spark code.
Serving pre-aggregated features: For features that can be computed in batch (daily statistics, historical averages), store them in the warehouse and export to the feature store on schedule.
Data exploration: Analysts and ML engineers exploring data interactively need a query engine with low setup overhead. BigQuery's serverless model - query TB of data in seconds, pay per byte scanned - is ideal.

-- Feature engineering in BigQuery: computing 30-day user transaction features
-- This is the kind of query that would be run daily to generate training features

CREATE OR REPLACE TABLE `project.features.user_transaction_features_v1` AS
WITH
-- Step 1: Compute rolling aggregations
user_stats AS (
  SELECT
    user_id,
    DATE(event_timestamp) AS feature_date,

    -- 30-day features
    COUNT(*) OVER (
      PARTITION BY user_id
      ORDER BY DATE(event_timestamp)
      RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW
    ) AS tx_count_30d,

    AVG(amount) OVER (
      PARTITION BY user_id
      ORDER BY DATE(event_timestamp)
      RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW
    ) AS tx_avg_amount_30d,

    -- 7-day features
    COUNT(*) OVER (
      PARTITION BY user_id
      ORDER BY DATE(event_timestamp)
      RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW
    ) AS tx_count_7d,

    -- Day-of-week pattern (0=Monday, 6=Sunday)
    EXTRACT(DAYOFWEEK FROM event_timestamp) AS dow,

    -- First/last transaction (account age feature)
    MIN(event_timestamp) OVER (PARTITION BY user_id) AS first_tx_timestamp,
    MAX(event_timestamp) OVER (PARTITION BY user_id) AS last_tx_timestamp

  FROM `project.raw.transactions`
  WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 60 DAY)
)

SELECT DISTINCT user_id, feature_date, tx_count_30d, tx_avg_amount_30d,
       tx_count_7d, dow,
       DATE_DIFF(feature_date, DATE(first_tx_timestamp), DAY) AS account_age_days,
       DATE_DIFF(feature_date, DATE(last_tx_timestamp), DAY) AS days_since_last_tx
FROM user_stats;

-- Export to GCS for downstream use
EXPORT DATA OPTIONS(
  uri='gs://company-data-lake/features/user_tx_features/snapshot_date=*/part-*.parquet',
  format='PARQUET',
  overwrite=true
) AS SELECT * FROM `project.features.user_transaction_features_v1`;

The Lakehouse Pattern

The lakehouse (Delta Lake, Apache Iceberg, Apache Hudi) is a hybrid architecture that adds data warehouse capabilities to a data lake:

ACID transactions: Atomic writes to S3/GCS, preventing partial writes and enabling consistent reads
Schema enforcement: Reject writes that don't match the defined schema
Time travel: Query the data as it existed at any past timestamp
DML operations: UPDATE, DELETE, MERGE (upsert) on immutable object storage

# Delta Lake: ACID transactions on your data lake
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Write with ACID guarantees
user_features_df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("s3://company-data-lake/delta/user_features/")

# UPSERT: update existing users, insert new ones
delta_table = DeltaTable.forPath(spark, "s3://company-data-lake/delta/user_features/")

delta_table.alias("target").merge(
    new_features_df.alias("updates"),
    "target.user_id = updates.user_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# TIME TRAVEL: query data as it existed 7 days ago
# Essential for debugging: what features did the model see last Tuesday?
features_7_days_ago = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2024-11-08") \
    .load("s3://company-data-lake/delta/user_features/")

# Also works with version numbers
features_at_version_42 = spark.read \
    .format("delta") \
    .option("versionAsOf", 42) \
    .load("s3://company-data-lake/delta/user_features/")

Time-Series Data for ML: The Point-in-Time Problem

This is the most critical concept in ML data engineering, and the most commonly mishandled.

When building training data for an ML model, you need to recreate what the world looked like at the moment each label was generated. If you use data that was available after the label was generated, you introduce temporal leakage - the model learns from information it won't have at serving time, producing optimistic offline metrics that don't hold in production.

The formal problem: You have a table of events with timestamps. Each event has a label (e.g., "did this transaction turn out to be fraudulent?"). You want to attach features to each event. The features must be computed using only data that was available when the event occurred.

from pyspark.sql import SparkSession, functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("PointInTimeJoin").getOrCreate()

# Simulated data: transaction events (with labels) and user feature snapshots
transactions = spark.createDataFrame([
    ("tx001", "user_A", "2024-11-01 10:00:00", 1),   # fraudulent transaction
    ("tx002", "user_A", "2024-11-05 14:00:00", 0),   # legitimate transaction
    ("tx003", "user_B", "2024-11-03 09:00:00", 0),
], ["tx_id", "user_id", "event_time", "label"])

# Feature snapshots: computed daily, stored with the snapshot timestamp
feature_snapshots = spark.createDataFrame([
    ("user_A", "2024-10-31 00:00:00", 5, 250.0),    # 5 tx last 30 days, $250 avg
    ("user_A", "2024-11-01 00:00:00", 6, 300.0),    # snapshot after a fraud tx
    ("user_A", "2024-11-04 00:00:00", 7, 285.0),
    ("user_B", "2024-10-31 00:00:00", 2, 120.0),
    ("user_B", "2024-11-02 00:00:00", 3, 130.0),
], ["user_id", "snapshot_time", "tx_count_30d", "avg_amount_30d"])

# Cast timestamps
transactions = transactions.withColumn("event_time", F.to_timestamp("event_time"))
feature_snapshots = feature_snapshots.withColumn(
    "snapshot_time", F.to_timestamp("snapshot_time")
)

# WRONG approach: simple join (ignores time, uses latest features for all events)
wrong_result = transactions.join(feature_snapshots, on="user_id", how="left")
# This attaches features from ALL snapshots, including future ones

# CORRECT: Point-in-time join
# For each transaction, find the most recent feature snapshot BEFORE the event

# Step 1: Cross join to get all (transaction, snapshot) combinations
joined = transactions.alias("tx").join(
    feature_snapshots.alias("fs"),
    on="user_id",
    how="left"
)

# Step 2: Filter to only snapshots that were available at event time
joined_filtered = joined.filter(
    F.col("fs.snapshot_time") <= F.col("tx.event_time")
)

# Step 3: Keep only the MOST RECENT snapshot before each event
window = Window.partitionBy("tx_id").orderBy(F.desc("snapshot_time"))
point_in_time_correct = (
    joined_filtered
    .withColumn("rank", F.row_number().over(window))
    .filter(F.col("rank") == 1)
    .drop("rank", "snapshot_time")
)

point_in_time_correct.show()
# tx001 | user_A | label=1 | tx_count_30d=5 | avg_amount_30d=250  (uses 10/31 snapshot)
# tx002 | user_A | label=0 | tx_count_30d=7 | avg_amount_30d=285  (uses 11/04 snapshot)
# tx003 | user_B | label=0 | tx_count_30d=2 | avg_amount_30d=120  (uses 10/31 snapshot)

The point-in-time join ensures that transaction tx001 (fraudulent, on Nov 1) uses features computed as of Oct 31 - the data the model would have had at serving time on Nov 1. If we had used the Nov 1 snapshot (which included the fraud transaction itself), we'd be leaking label information into the features.

Data Lineage: Knowing Where Your Features Came From

When a model's performance degrades in production, one of the first questions is: "which features changed, and why?" Without data lineage, answering this requires hours of archaeology through pipeline code. With data lineage, you can trace any feature value back through every transformation to the original source event.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Any

@dataclass
class LineageNode:
    """Represents a dataset or feature in the data lineage graph."""
    name: str
    version: str
    created_at: datetime
    schema: dict[str, str]
    row_count: int
    source_nodes: list[str] = field(default_factory=list)
    transformation_code: str = ""  # git commit hash of the code that created this
    storage_path: str = ""

class DataLineageTracker:
    """
    Track the lineage of ML datasets and features.
    Integrates with Apache Atlas, OpenLineage, or Marquez for production use.
    """
    def __init__(self):
        self.nodes: dict[str, LineageNode] = {}

    def register_source(self, name: str, schema: dict, path: str) -> LineageNode:
        """Register a source dataset (raw data, no parents)."""
        node = LineageNode(
            name=name,
            version=self._generate_version(),
            created_at=datetime.now(),
            schema=schema,
            row_count=0,
            storage_path=path
        )
        self.nodes[name] = node
        return node

    def register_transformation(
        self,
        output_name: str,
        input_names: list[str],
        transformation_code_hash: str,
        output_schema: dict,
        output_row_count: int,
        output_path: str
    ) -> LineageNode:
        """Register a derived dataset and its provenance."""
        # Validate all inputs exist
        for input_name in input_names:
            if input_name not in self.nodes:
                raise ValueError(f"Input dataset not registered: {input_name}")

        node = LineageNode(
            name=output_name,
            version=self._generate_version(),
            created_at=datetime.now(),
            schema=output_schema,
            row_count=output_row_count,
            source_nodes=input_names,
            transformation_code=transformation_code_hash,
            storage_path=output_path
        )
        self.nodes[output_name] = node
        return node

    def trace_lineage(self, dataset_name: str) -> list[str]:
        """Return all ancestor datasets in topological order."""
        visited = []
        self._dfs(dataset_name, visited, set())
        return visited

    def _dfs(self, name: str, visited: list, seen: set):
        if name in seen:
            return
        seen.add(name)
        node = self.nodes.get(name)
        if node:
            for parent in node.source_nodes:
                self._dfs(parent, visited, seen)
            visited.append(name)

    def _generate_version(self) -> str:
        import uuid
        return str(uuid.uuid4())[:8]

# Usage
tracker = DataLineageTracker()

# Register raw sources
tracker.register_source(
    "raw_transactions",
    schema={"user_id": "string", "amount": "float", "timestamp": "timestamp"},
    path="s3://lake/raw/transactions/"
)
tracker.register_source(
    "raw_user_profiles",
    schema={"user_id": "string", "account_age_days": "int"},
    path="s3://lake/raw/user_profiles/"
)

# Register a derived feature dataset
tracker.register_transformation(
    output_name="user_tx_features_v1",
    input_names=["raw_transactions", "raw_user_profiles"],
    transformation_code_hash="a3f7c891",  # git commit hash
    output_schema={
        "user_id": "string",
        "tx_count_30d": "int",
        "avg_amount_30d": "float",
        "account_age_days": "int",
        "feature_date": "date"
    },
    output_row_count=50_000_000,
    output_path="s3://lake/features/user_tx_v1/"
)

# Trace lineage: what went into the training data?
lineage = tracker.trace_lineage("user_tx_features_v1")
print(f"Lineage: {' -> '.join(lineage)}")
# Lineage: raw_user_profiles -> raw_transactions -> user_tx_features_v1

Production lineage systems (Apache Atlas, OpenLineage / Marquez, DataHub, Amundsen) provide automatic lineage capture by instrumenting Spark, Airflow, and dbt, without requiring manual registration calls.

Complete Architecture: End-to-End Data Flow

Putting it all together: a production ML data architecture for a recommendation system.

The key flows:

Training path: RAW → CLEAN → SPARK → OFFLINE → TRAIN → REGISTRY. Runs daily (or on trigger). Produces versioned training datasets and model artifacts.
Serving path: ONLINE → SERVE. Runs on every user request. Must be <10ms. ONLINE store is populated by FLINK (real-time) and SPARK (batch).
No direct database queries at serving time: The ONLINE store is a pre-computed cache of features. Never query your production database in the serving path.

Common Mistakes

:::danger Querying the Production Database in the Serving Path One of the most common and costly mistakes: fetching features by querying your production PostgreSQL or MySQL database in the hot serving path. Under load, this increases database query rate, degrades the primary database's OLTP performance, and adds unpredictable latency to model serving. Always pre-compute and cache features in a dedicated feature store (Redis, Cassandra). The production database is for your application, not for ML feature serving. :::

:::danger Random Train/Test Split for Time-Series Data Using a random 80/20 split for time-series ML data introduces temporal leakage: training examples from March end up adjacent to test examples from January. The model sees future patterns in training. Always split temporally - train on early time periods, validate on later time periods. The gap between training and validation should match the gap between your training cutoff and production deployment. :::

:::warning Using Processing Time Instead of Event Time Features based on "the last N minutes" must be computed using event time (when the event actually occurred), not processing time (when the event arrived in the pipeline). Under queue backlog, messages can be delayed significantly. A velocity feature computed with processing time during a 30-minute backlog is wildly incorrect. Always use event timestamps, and design your stream processing to handle out-of-order events with watermarking. :::

:::warning No Data Lineage for Model Features Without data lineage, when a model degrades in production, you cannot trace the cause back to a data pipeline change. Adding lineage after the fact requires rewriting pipelines. Instrument lineage from day one using OpenLineage-compatible tools - it adds minimal overhead and is invaluable during incidents. :::

Interview Q&A

Q1: What is the difference between a data lake and a data warehouse, and when do you use each for ML?

A data lake stores data in its raw format (files on object storage like S3) without enforcing schema at write time. It accepts any data, is cheap ($0.023/GB/month), and is ideal for raw event logs, unstructured data, and historical retention. The downside: no query optimization, no ACID guarantees, no schema enforcement.

A data warehouse (BigQuery, Snowflake) enforces schema, stores data in columnar format for fast analytical queries, and supports SQL. It's ideal for structured analytical queries - "compute the average transaction amount per user per month" - and for pre-aggregated feature engineering. The downside: more expensive, requires schema definition upfront, doesn't store unstructured data.

For ML: use the data lake for raw data storage (everything lands here first), and the warehouse for feature engineering via SQL (window functions, complex joins). The lakehouse pattern (Delta Lake, Iceberg) adds ACID transactions and time travel to the data lake, capturing the benefits of both.

Q2: What is point-in-time correct feature retrieval and why does it matter?

When building training data, each training example has a label (what happened) and features (what was known when it happened). Point-in-time correct retrieval means: for each training example, use feature values that were available at the moment the label was generated, not at the time training was run.

Without it, you get temporal leakage: features computed today contain information about events that hadn't happened when the labels were generated. For example, computing "the user's average transaction amount last 30 days" at training time (using current data) will include transactions that occurred after the fraud label event. The model learns to "predict" fraud using information that exists only in hindsight.

The practical implementation: store feature values with timestamps (as daily or hourly snapshots), and use an ASOF JOIN (or equivalent temporal join) to attach, for each training event, the most recent feature snapshot that was available before the event timestamp.

Q3: What is data lineage and why is it important for production ML systems?

Data lineage is the recorded history of where each piece of data came from, what transformations were applied to it, and what it was used to produce. For ML, it means: knowing which raw data sources went into each feature, which features went into each training dataset, which dataset was used to train which model version.

It matters in production for three reasons: First, debugging - when a model's performance degrades, you can trace the regression to a specific data pipeline change. Second, reproducibility - you can recreate any historical training dataset exactly, which is necessary for regulatory compliance and model audits. Third, impact analysis - when a data source schema changes, lineage tells you which models will be affected.

Without lineage, debugging a production ML incident becomes archaeology: reading through pipeline code to figure out what might have changed. With lineage (Apache Atlas, OpenLineage/Marquez, DataHub), it is a graph traversal.

Q4: Explain the medallion architecture (Bronze, Silver, Gold) for ML data pipelines.

Medallion architecture organizes data in three layers of increasing quality and structure:

Bronze (Raw): Raw data exactly as received from source systems. Immutable. No transformation, no filtering. If a source sends malformed data, it lands in Bronze in its original form. Purpose: never lose the original data; enable reprocessing from scratch.

Silver (Cleaned): Data that has been validated, deduplicated, and type-cast. Schema enforced. Null handling applied. Business rules for validity (e.g., "amount must be positive") checked. Malformed records quarantined. Purpose: consistent, reliable data that all downstream consumers can trust.

Gold (Features): Feature-level datasets ready for ML training and serving. Window aggregations computed. Point-in-time joins applied. Serving schema aligned with the feature store schema. Purpose: features that can go directly into model training without additional transformation.

The three-layer approach is powerful because it separates concerns: data quality (Bronze to Silver) from feature logic (Silver to Gold). When a feature engineering bug is discovered, you reprocess from Silver, not from Bronze. When a schema change happens upstream, Silver handles the validation, protecting Gold from malformed data.

Q5: How would you design a data pipeline that ensures the features used in model training are identical to the features computed at serving time?

This is the training-serving consistency problem, and it requires several coordinated elements:

First, shared feature computation code: the same Python/PySpark functions that compute features for training should be importable by the serving code. Never implement the same logic twice in different languages or frameworks.

Second, versioned feature schemas: the feature schema (names, types, transformations) should be explicitly versioned. A model is trained with schema v2.3. When deployed, the serving infrastructure fetches schema v2.3 features. If the schema changes, it's a new version, and old models continue serving with the old schema until retrained.

Third, serialize the fitted preprocessor: if your features include scaling, normalization, or one-hot encoding, serialize the fitted transformer (using sklearn's pickle or a custom artifact) and deploy it alongside the model. The same fitted scaler that was applied to training data is applied to serving data.

Fourth, integration tests: for every new model deployment, run a validation suite that (a) takes a sample of raw inputs, (b) runs them through both the training feature pipeline and the serving feature pipeline, and (c) asserts the outputs are identical within floating-point tolerance. This catches regressions before they reach production.

Summary

The ML data stack is the foundation that everything else rests on. Elegant model architectures cannot compensate for training data that leaks future information, features that differ between training and serving, or pipelines with no lineage tracking.

The key principles: land everything in the data lake first; use a data warehouse for SQL-based feature engineering; add Delta Lake or Iceberg for ACID guarantees and time travel; implement point-in-time correct joins religiously; track lineage from day one; and always use event time, not processing time, for time-based features.

The ML engineer who masters data systems is worth ten engineers who master model architectures. Models improve incrementally with hyperparameter tuning. Data systems have step-function improvements when designed correctly from the start.

:::tip Key Takeaway The most common source of production ML failures is not model architecture - it is data pipeline design. Invest disproportionately in getting the data layer right: point-in-time correct joins, shared preprocessing code, data quality validation, and data lineage. These are the foundations that make every model that runs on top of them more reliable, reproducible, and debuggable. :::

The Production Moment​

The ML Data Stack​

OLTP vs OLAP vs ML Workloads​

Data Lakes: The Foundation​

Data Warehouses: Structured Analytics​

The Lakehouse Pattern​

Time-Series Data for ML: The Point-in-Time Problem​

Data Lineage: Knowing Where Your Features Came From​

Complete Architecture: End-to-End Data Flow​

Common Mistakes​

Interview Q&A​

Summary​