What is data modelling for machine learning?

How to design data models for machine learning - point-in-time correctness, entity-centric tables, SCD Type 2, label leakage prevention, and the training-serving skew problem.

How does point-in-time join work in practice?

Data Modelling for ML covers data modelling for machine learning, point-in-time join, label leakage from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/foundations/data-modelling-for-ml

What is the difference between data modelling for machine learning and label leakage?

See the full breakdown at https://engineersofai.com/docs/data-engineering/foundations/data-modelling-for-ml

:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Store Architecture demo on the EngineersOfAI Playground - no code required. :::

Data Modelling for ML

The Night the Fraud Model Lied to Us

It was 11:47 PM on a Tuesday when the Slack message came in. The fraud model - three weeks in production - had just triggered a wave of false declines across Stripe's highest-value merchants. Cards were getting rejected at restaurants, hotels, and airlines. The on-call team was scrambling.

The model had backtested at 94.2% AUC. In production it was performing at 71%. Something was deeply wrong.

The senior data engineer, Maya, pulled up the training data generation script. Within ten minutes she found it. The training pipeline was joining user account features - account age, total spend, verified status - to transaction events using the transaction's created_at timestamp. But the account features table only stored the current state of the account. When a transaction happened three months ago, the pipeline was attaching the user's account metrics as they stood today, not as they were at the time of the transaction.

The model had learned to predict fraud using information that didn't exist at the time of prediction. An account that had been flagged as suspicious two months after the transaction - information that was reflected in today's account table - was leaking forward into the training set. The model had learned a pattern that was causally impossible in production.

This is label leakage. And it is the single most common and most expensive mistake in ML data modelling. Unlike software bugs, it doesn't crash anything. It silently inflates your offline metrics while guaranteeing production failure. The model looks great in backtesting precisely because it cheated - and it only reveals itself when you deploy to a world where you can't cheat anymore.

The fix required a complete rewrite of the training data generation logic using point-in-time joins: for every training example, every feature had to be computed using only data that existed at or before the event timestamp. This is not a nice-to-have. It is the foundational invariant of any ML data model worth building.

Why This Exists

Before feature stores and ML-specific data modelling practices, data scientists built training datasets the way analysts built reports: grab the relevant tables, join them together, filter to the date range, and train. This approach works fine for analytics because analysts ask backward-looking questions. A report about Q3 revenue doesn't care about what the data looked like in real time - it just needs the final numbers.

ML is different in a way that matters enormously. A machine learning model learns causal patterns from historical data and then applies those patterns to future, unseen data. For this to work, the historical training data must perfectly simulate the information environment the model will face in production. Every feature at training time must represent what would have been knowable at the moment of prediction.

Traditional OLAP data models don't preserve this property. Star schemas optimized for reporting use slowly changing dimensions that overwrite history or track current state. Wide analytical tables denormalize for query convenience, not for temporal correctness. None of these were designed with point-in-time correctness in mind.

The other problem is feature reuse. Data science teams at scale build hundreds or thousands of features. Without discipline, every team recomputes the same features differently - slightly different time windows, different null handling, different aggregation logic. When the same feature is computed differently in training and serving, you get training-serving skew: the model sees one distribution of data during training and a different distribution in production. The model degrades silently.

ML data modelling is the discipline of designing data structures and pipelines that solve both of these problems: point-in-time correctness and consistent feature computation across the full ML lifecycle.

Historical Context

The academic foundations come from decision support systems in the 1990s. Ralph Kimball's dimensional modelling methodology (first published in The Data Warehouse Toolkit, 1996) gave us star schemas, slowly changing dimensions, and the discipline of separating facts from dimensions. These ideas were designed for reporting at scale.

The shift toward ML-specific data modelling emerged more slowly, driven by practitioners hitting Kimball's limits. Uber's Michelangelo platform (2017) was one of the first public acknowledgements that ML needs its own data layer. Their engineering blog post described the training-serving skew problem explicitly and introduced the concept of an online/offline feature store. Around the same time, Airbnb's Zipline project (2018) introduced declarative feature pipelines with built-in point-in-time join semantics.

The key insight of these systems was that features need versioning through time, not just versioning by deployment. A feature computed over a rolling 30-day window has a different value on every day of history. Storing and retrieving those historical values correctly - at arbitrary past timestamps - requires a different data model than anything in traditional data warehousing.

Feast (open-sourced by Gojek in 2019, now maintained by the Linux Foundation) brought these ideas to the broader open-source community. Tecton (2020) commercialized them. By 2022, every major cloud provider had some form of managed feature store, all built on the same core insight: ML needs time-aware data models.

Core Concepts

ML Needs Different Data Models Than OLAP

OLAP data models optimize for aggregation queries across large time ranges. A star schema is designed to answer "what were total sales by region in Q3?" efficiently. For this, you want normalized dimensions, compressed fact tables, and indexes that support GROUP BY + aggregate patterns.

ML training data has almost the opposite requirements. You want:

Point-in-time correctness - every row in the training set must represent the exact information state at a specific timestamp
Wide, denormalized tables - most ML frameworks (scikit-learn, XGBoost, PyTorch) expect a 2D matrix where each row is a complete training example with all features pre-joined
Consistent feature semantics - features computed the same way offline (training) and online (serving)
Historical feature values - not just current state, but the value of every feature at every past timestamp where a training label exists

A reporting data model optimized for a business dashboard will fail all four of these requirements. This is why ML teams end up building separate feature pipelines that duplicate much of the warehouse logic - not because they want to, but because the warehouse model was never designed for these access patterns.

Point-in-Time Correctness: The Most Important Concept

The core mathematical invariant of any ML data model is:

$\text{feature}(t) = f\left(\text{data available at } t - \epsilon\right)$

For any feature used to train a model on an event at time $t$ , that feature must be computed using only data that was available at time $t$ (minus an infinitesimal delay). No data created after time $t$ can appear in the feature.

Violating this invariant is called label leakage - the model learns from information that leaked from the future into the training set. Label leakage is insidious because it improves offline metrics while guaranteeing production degradation. The bigger the leakage, the better your backtests look, and the worse your production model performs.

There are three common forms of label leakage:

Target leakage: Features that are computed after the label is determined. Classic example: using days_until_churn as a feature for a churn model. The feature is derived from the label itself.

Temporal leakage: Features joined without time constraints. Using today's account balance to predict a transaction that happened six months ago. The account balance reflects all activity since then, including the target event.

Aggregation leakage: Computing statistics (mean, stddev) on the entire dataset before splitting into train/test. The test set statistics leak into the training features.

The formal definition of a valid training dataset is:

$\mathcal{D}_{\text{train}} = \left\{ \left( \mathbf{x}_{t_i}^{\text{PIT}},\ y_{t_i} \right) \mid i = 1,\ldots,N \right\}$

Where $\mathbf{x}_{t_i}^{\text{PIT}}$ is the point-in-time feature vector at time $t_i$ - every feature computed using only data with timestamp $\leq t_i$ .

Point-in-Time Joins

The implementation mechanism for point-in-time correctness is the AS-OF join (also called a point-in-time join or temporal join). The idea: for each event in your label table, join to the most recent feature record that existed at or before the event timestamp.

In SQL (standard ANSI SQL:2011 and supported in some dialects):

-- Standard AS OF JOIN syntax (supported in some databases)
SELECT
    t.transaction_id,
    t.event_time,
    t.is_fraud,
    u.account_age_days,
    u.total_spend_90d,
    u.verified_status
FROM transactions t
ASOF JOIN user_features u
    ON t.user_id = u.user_id
    AND t.event_time >= u.feature_computed_at

In practice, most SQL databases don't support ASOF JOIN natively. BigQuery, Snowflake, and DuckDB have workarounds. The most portable approach uses a window function or a correlated subquery.

DuckDB (which added ASOF JOIN in v0.8.0) is now the most practical option for local development:

-- DuckDB ASOF JOIN - one of the few databases with native support
SELECT
    t.transaction_id,
    t.event_time,
    t.is_fraud,
    u.account_age_days,
    u.total_spend_90d
FROM transactions t
ASOF JOIN user_features_history u
    ON t.user_id = u.user_id
    AND t.event_time >= u.valid_at
ORDER BY t.event_time

Slowly Changing Dimensions: Which Type Matters for ML

Ralph Kimball defined several types of slowly changing dimensions (SCDs) for handling attributes that change over time. For ML, two types matter:

SCD Type 1 - Overwrite: When an attribute changes, overwrite the current value. No history is preserved. This is appropriate for attributes where historical values are genuinely irrelevant (e.g., fixing a typo in a user's name). For ML, using SCD Type 1 on features that change meaningfully will cause label leakage. Don't use Type 1 for any feature that's predictive.

SCD Type 2 - Add Row: When an attribute changes, add a new row with valid_from and valid_to timestamps. The previous row's valid_to is set to the change time. This preserves full history and enables point-in-time joins. This is the correct approach for any feature used in ML training.

SCD Type 3 - Add Column: Tracks only the previous value alongside the current value (two columns: current_value and previous_value). Too limited for ML - you can only look back one step.

For ML data modelling, the rule is simple: any feature that changes over time must be stored as SCD Type 2 to enable point-in-time correct training data generation.

Entity-Centric Data Modelling

Traditional data modelling organizes around business events: orders, payments, page views, clicks. Entity-centric modelling organizes around the entities that participate in those events: users, items, contexts.

The shift matters for ML because models predict properties of entities, not of events in isolation. A fraud model predicts whether a user is committing fraud, not whether a transaction is fraudulent in isolation. The transaction is just the observation point. All the predictive signal is in the entity history.

Entity-centric modelling means building one wide feature table per entity type, keyed on (entity_id, timestamp), containing all predictive features computed up to that timestamp. These are sometimes called entity snapshots or feature snapshots.

For a recommendation system, the core entities are:

User: demographics, behavioral history, preferences, session context
Item: content features, popularity metrics, recency
Context: time of day, device, location, recent session

The training example for a recommendation is not just "user X clicked item Y." It's "user X, with these features as of timestamp T, clicked item Y in context Z." The entity-centric model makes this natural to construct.

Training-Serving Skew

Training-serving skew is the condition where the distribution of features at training time differs from the distribution at serving time. It is one of the most common causes of silent model degradation in production.

The causes are usually:

Different code paths: Python pandas code computes a feature in training; a different Java service computes the same feature at serving time. Subtle differences in null handling, rounding, or edge cases create distributional differences.
Different time windows: Training uses a 30-day rolling window; serving uses a 7-day window because the 30-day query was too slow.
Data freshness differences: Training data is recomputed monthly from the full warehouse; serving features are computed from a streaming pipeline with a 5-minute lag. These are different signals.
Schema drift: A dimension table added a new status code after the training data was generated. The serving pipeline handles the new code; the model has never seen it.

The solution is a feature store: a system that ensures features are computed identically in both the offline (training) and online (serving) paths. The offline path reads from historical snapshots; the online path reads from a low-latency store (Redis, DynamoDB). Both paths use the same feature definitions.

Even without a full feature store, you can reduce training-serving skew by:

Logging serving-time features alongside predictions, then using those logged features as your training data (the "log-and-train" pattern)
Writing feature logic once in a language-agnostic specification (SQL, Feast feature views) and generating both offline and online implementations from it

Star Schema vs. Wide Tables for ML

The star schema (fact table + dimension tables, joined at query time) is excellent for reporting. For ML, it has two problems: join overhead at training time and risk of temporal leakage when dimensions use SCD Type 1.

Wide, denormalized tables are generally preferred for ML because:

Training pipelines can read a single table rather than executing multi-table joins at scale
The feature computation logic (time windows, aggregations) is materialized, not recomputed at every training run
Point-in-time correctness can be enforced once, at materialization time

The pattern is: use a star schema in your data warehouse for reporting, then run a separate feature materialization pipeline that produces wide entity-snapshot tables for ML. These are not competing approaches - they serve different consumers.

Normalization vs. Denormalization for ML

Database normalization (1NF through 5NF) reduces data redundancy and ensures update consistency. For transactional systems, this is essential. For ML training data, it is actively harmful.

ML training requires random access to millions or billions of training examples. When each example requires 5-10 joins to reconstruct, the I/O cost becomes prohibitive at scale. A model training on 100 million examples with 10 table joins will spend more time on joins than on actual gradient computation.

The ML-optimal table design is:

One wide row per training example
All features pre-joined and pre-computed
Stored in a columnar format (Parquet) for efficient batch reads
Partitioned by time for efficient temporal filtering

The trade-off is storage space and refresh latency - wide denormalized tables take more space and require scheduled recomputation. But for ML training at scale, this trade-off is almost always correct.

Schema Evolution for ML Pipelines

Pipelines break when schemas change unexpectedly. Adding a column is usually safe. Removing or renaming a column breaks every downstream consumer. Changing a column's type or semantics is the most dangerous - it breaks consumers silently.

Best practices for schema evolution in ML pipelines:

Additive changes only: New columns can be added freely. Never remove or rename columns in a live pipeline - deprecate them first, migrate consumers, then remove.

Null-safe defaults: When a new feature column is added, existing historical rows get null. ML training pipelines must handle nulls gracefully (imputation or exclusion).

Column versioning: For features that change semantics, add a new versioned column (spend_90d_v2) rather than modifying the existing one. Mark the old column as deprecated.

Schema registries: For event-driven pipelines (Kafka, Kinesis), use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) with compatibility checks. Enforce backward compatibility at publish time.

Code Examples

Point-in-Time Join with `pd.merge_asof`

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# ── Simulate a label table ─────────────────────────────────────────────────
# In practice: transactions, orders, user events, etc.
np.random.seed(42)

n_transactions = 1000
transaction_times = pd.date_range("2024-01-01", "2024-06-30", periods=n_transactions)

labels = pd.DataFrame({
    "transaction_id": range(n_transactions),
    "user_id": np.random.randint(1, 100, size=n_transactions),
    "event_time": transaction_times,
    "is_fraud": np.random.binomial(1, 0.03, size=n_transactions),
    "amount": np.random.lognormal(4, 1, size=n_transactions).round(2),
})
labels = labels.sort_values("event_time").reset_index(drop=True)

print(f"Labels shape: {labels.shape}")
print(labels.head(3).to_string())

# ── Simulate a user features table (SCD Type 2 style) ─────────────────────
# Each row is a snapshot of user features at a point in time
user_feature_records = []
for user_id in range(1, 100):
    # Each user gets several feature snapshots over the period
    n_snapshots = np.random.randint(3, 15)
    snapshot_times = sorted(
        pd.date_range("2023-10-01", "2024-06-29", periods=n_snapshots).tolist()
    )
    for snap_time in snapshot_times:
        user_feature_records.append({
            "user_id": user_id,
            "feature_computed_at": snap_time,
            "account_age_days": (snap_time - pd.Timestamp("2022-01-01")).days,
            "total_spend_90d": np.random.lognormal(6, 1.5),
            "n_transactions_30d": np.random.randint(0, 50),
            "verified": np.random.choice([True, False], p=[0.7, 0.3]),
            "risk_score": np.random.beta(2, 8),  # Most users low risk
        })

user_features = pd.DataFrame(user_feature_records)
user_features = user_features.sort_values(
    ["user_id", "feature_computed_at"]
).reset_index(drop=True)

print(f"\nUser features shape: {user_features.shape}")
print(user_features.head(3).to_string())

# ── Point-in-time join using pd.merge_asof ────────────────────────────────
# merge_asof finds, for each row in `left`, the most recent row in `right`
# where right_key <= left_key. This is the "AS OF" join.
#
# CRITICAL: Both DataFrames must be sorted by the join key before calling.

training_data = pd.merge_asof(
    left=labels.sort_values("event_time"),
    right=user_features.sort_values("feature_computed_at"),
    left_on="event_time",          # The timestamp in the label table
    right_on="feature_computed_at",  # The timestamp in the feature table
    by="user_id",                  # Join key - match on user_id first
    direction="backward",          # Use the most recent feature BEFORE event_time
    # direction="forward" would cause leakage!
)

print(f"\nTraining data shape: {training_data.shape}")
print(f"Rows with null features (no prior snapshot): {training_data['risk_score'].isna().sum()}")
print(training_data[["transaction_id", "user_id", "event_time",
                       "feature_computed_at", "risk_score", "is_fraud"]].head(5).to_string())

# ── Verify point-in-time correctness ─────────────────────────────────────
# The feature_computed_at must always be <= event_time
leakage_check = training_data.dropna(subset=["feature_computed_at"])
violations = leakage_check[
    leakage_check["feature_computed_at"] > leakage_check["event_time"]
]
assert len(violations) == 0, f"Label leakage detected: {len(violations)} violations"
print(f"\nPoint-in-time check passed: 0 violations")

# ── Handle rows with no prior feature snapshot ────────────────────────────
# Drop or impute - never fill with future values
training_data_clean = training_data.dropna(subset=["risk_score"])
print(f"Training data after dropping no-history rows: {training_data_clean.shape}")

SCD Type 2 Implementation

import pandas as pd
from datetime import datetime

# ── SCD Type 2: Track historical values with valid_from / valid_to ─────────
# This pattern lets you answer "what was user X's status on date D?"

def apply_scd2_update(
    scd_table: pd.DataFrame,
    updates: pd.DataFrame,
    entity_key: str,
    effective_date: datetime,
) -> pd.DataFrame:
    """
    Apply SCD Type 2 updates to a snapshot table.

    - Closes out the current row for any entity that is being updated
      by setting valid_to = effective_date
    - Appends new rows with valid_from = effective_date, valid_to = None

    Parameters
    ----------
    scd_table   : Current SCD Type 2 table (with valid_from, valid_to columns)
    updates     : Rows representing new attribute values
    entity_key  : Column name of the entity identifier (e.g., 'user_id')
    effective_date : The timestamp at which these changes become effective
    """
    SENTINEL_FUTURE = pd.Timestamp("9999-12-31")

    # Find current-active rows for entities being updated
    # Current = valid_to is null or equals the sentinel future date
    updating_entities = updates[entity_key].unique()

    # Close out active rows for updated entities
    is_active = scd_table["valid_to"].isna() | (scd_table["valid_to"] == SENTINEL_FUTURE)
    is_affected = scd_table[entity_key].isin(updating_entities)

    scd_table = scd_table.copy()
    scd_table.loc[is_active & is_affected, "valid_to"] = effective_date

    # Append new rows
    new_rows = updates.copy()
    new_rows["valid_from"] = effective_date
    new_rows["valid_to"] = SENTINEL_FUTURE  # "currently active"

    return pd.concat([scd_table, new_rows], ignore_index=True)


def lookup_at_time(
    scd_table: pd.DataFrame,
    entity_key: str,
    entity_id,
    as_of: datetime,
) -> pd.Series:
    """Retrieve the attribute values for an entity at a specific point in time."""
    mask = (
        (scd_table[entity_key] == entity_id) &
        (scd_table["valid_from"] <= as_of) &
        ((scd_table["valid_to"] > as_of) | scd_table["valid_to"].isna())
    )
    result = scd_table[mask]
    if result.empty:
        raise ValueError(f"No record found for {entity_key}={entity_id} at {as_of}")
    return result.iloc[0]


# ── Example: User account status changes over time ─────────────────────────
SENTINEL = pd.Timestamp("9999-12-31")

# Initial state (Jan 1)
users_scd2 = pd.DataFrame([
    {"user_id": 1, "status": "free",    "spend_tier": "low",    "valid_from": pd.Timestamp("2024-01-01"), "valid_to": SENTINEL},
    {"user_id": 2, "status": "paid",    "spend_tier": "medium", "valid_from": pd.Timestamp("2024-01-01"), "valid_to": SENTINEL},
    {"user_id": 3, "status": "free",    "spend_tier": "low",    "valid_from": pd.Timestamp("2024-01-01"), "valid_to": SENTINEL},
])

# User 1 upgrades to paid on Feb 15
upgrade_event = pd.DataFrame([
    {"user_id": 1, "status": "paid", "spend_tier": "medium"}
])
users_scd2 = apply_scd2_update(
    users_scd2, upgrade_event, "user_id", pd.Timestamp("2024-02-15")
)

# User 1 gets suspended on April 3
suspension_event = pd.DataFrame([
    {"user_id": 1, "status": "suspended", "spend_tier": "medium"}
])
users_scd2 = apply_scd2_update(
    users_scd2, suspension_event, "user_id", pd.Timestamp("2024-04-03")
)

print("SCD Type 2 table for user_id=1:")
print(users_scd2[users_scd2["user_id"] == 1].to_string(index=False))
print()

# Point-in-time lookups
jan_record  = lookup_at_time(users_scd2, "user_id", 1, pd.Timestamp("2024-01-20"))
mar_record  = lookup_at_time(users_scd2, "user_id", 1, pd.Timestamp("2024-03-01"))
may_record  = lookup_at_time(users_scd2, "user_id", 1, pd.Timestamp("2024-05-01"))

print(f"User 1 status on Jan 20:  {jan_record['status']}")   # free
print(f"User 1 status on Mar 01:  {mar_record['status']}")   # paid
print(f"User 1 status on May 01:  {may_record['status']}")   # suspended

Detecting Label Leakage with Timestamp Checks

import pandas as pd
import numpy as np

def detect_temporal_leakage(
    training_df: pd.DataFrame,
    event_timestamp_col: str,
    feature_timestamp_cols: list[str],
    tolerance_seconds: int = 0,
) -> dict:
    """
    Scan a training DataFrame for temporal leakage.

    For each feature timestamp column, check whether any feature was
    computed AFTER the event timestamp (indicating future data leaked in).

    Parameters
    ----------
    training_df            : The assembled training DataFrame
    event_timestamp_col    : Column name of the label/event timestamp
    feature_timestamp_cols : List of columns recording when each feature was computed
    tolerance_seconds      : Allow features computed within N seconds after event
                             (for near-real-time systems with small lag)

    Returns
    -------
    dict with per-column violation counts and example rows
    """
    tolerance = pd.Timedelta(seconds=tolerance_seconds)
    results = {}

    event_ts = pd.to_datetime(training_df[event_timestamp_col])

    for feat_ts_col in feature_timestamp_cols:
        feat_ts = pd.to_datetime(training_df[feat_ts_col])

        # Leakage: feature timestamp > event timestamp + tolerance
        violation_mask = feat_ts > (event_ts + tolerance)
        n_violations = violation_mask.sum()
        n_valid = (~feat_ts.isna()).sum()

        results[feat_ts_col] = {
            "n_violations": int(n_violations),
            "n_valid_rows": int(n_valid),
            "leakage_rate": float(n_violations / n_valid) if n_valid > 0 else 0.0,
            "max_leakage_seconds": float(
                (feat_ts[violation_mask] - event_ts[violation_mask]).dt.total_seconds().max()
            ) if n_violations > 0 else 0.0,
            "example_violations": training_df[violation_mask].head(3).to_dict("records"),
        }

    total_violations = sum(v["n_violations"] for v in results.values())

    return {
        "total_violations": total_violations,
        "is_clean": total_violations == 0,
        "per_column": results,
    }


# ── Demonstration ─────────────────────────────────────────────────────────
np.random.seed(0)
N = 500

event_times = pd.date_range("2024-01-01", periods=N, freq="1h")

# Correctly constructed feature (always before event)
clean_feature_times = event_times - pd.to_timedelta(
    np.random.randint(1, 48, N), unit="h"
)

# Accidentally leaked feature (sometimes after event - classic mistake)
# Simulates a feature joined without time constraint from a "current state" table
leaked_feature_times = event_times + pd.to_timedelta(
    np.random.choice([-12, -6, 0, 6, 12, 24], size=N), unit="h"
)

df = pd.DataFrame({
    "event_time":              event_times,
    "label":                   np.random.binomial(1, 0.05, N),
    "clean_feature_computed_at":  clean_feature_times,
    "leaked_feature_computed_at": leaked_feature_times,
    "clean_feature_value":    np.random.randn(N),
    "leaked_feature_value":   np.random.randn(N),
})

report = detect_temporal_leakage(
    training_df=df,
    event_timestamp_col="event_time",
    feature_timestamp_cols=["clean_feature_computed_at", "leaked_feature_computed_at"],
)

print(f"Total violations: {report['total_violations']}")
print(f"Dataset is clean: {report['is_clean']}\n")

for col, stats in report["per_column"].items():
    print(f"Column: {col}")
    print(f"  Violations:    {stats['n_violations']} / {stats['n_valid_rows']}")
    print(f"  Leakage rate:  {stats['leakage_rate']:.1%}")
    print(f"  Max leakage:   {stats['max_leakage_seconds']:.0f} seconds\n")

Architecture Diagram

YouTube Resources

Title	Channel	Why Watch
Feature Stores for ML - Feast Deep Dive	Tecton	End-to-end walkthrough of point-in-time joins in a production feature store
Slowly Changing Dimensions (SCD) Explained	Kahan Data Solutions	Clear visual walkthrough of SCD Types 1, 2, 3 with dimensional modelling context
How Airbnb Builds ML Features at Scale	InfoQ	Zipline architecture - the real-world system that pioneered entity-centric feature pipelines
Data Leakage in Machine Learning	StatQuest with Josh Starmer	Intuitive explanation of why leakage inflates offline metrics
Kimball Dimensional Modelling for Modern Data Teams	dbt Labs	Star schema foundations adapted to the modern data stack

Production Engineering Notes

The Backfill Problem

When you add a new feature to a feature store, you need to backfill its historical values for all past training examples. This is expensive and operationally complex. At Airbnb scale, backfilling a new feature over two years of transaction history requires:

Replaying all raw events through the feature computation logic
Ensuring the computation logic handles schema changes over the backfill period correctly
Storing potentially billions of feature snapshot records
Re-running all model training jobs that use this feature

This is why feature design discipline matters upfront. Adding a feature is cheap; backfilling it is expensive. Plan your feature schema before you need it in production.

Late-Arriving Data

In streaming systems, events don't always arrive in order. A transaction that occurred at 14:00:00 might arrive at the processing system at 14:05:37. If your feature computation assumes strict time ordering, late-arriving events will cause incorrect point-in-time feature values.

The standard mitigation is a watermark: a moving threshold that defines "events before this time are considered complete." Features are only computed after the watermark has passed the relevant time window. This introduces latency (you wait for late events) in exchange for correctness. The right watermark depends on your data source's typical latency characteristics.

Feature Drift and Data Quality

A model that was correct at training time can degrade as the underlying data distributions shift. Common causes:

A data pipeline change that alters how a feature is computed
A product change that affects the behavioral signal a feature captures
Seasonal or secular trends that shift feature distributions over time

Production ML systems must monitor feature distributions continuously, not just model performance. By the time you see model degradation, the underlying feature drift has usually been present for days or weeks. Tools like Great Expectations, Monte Carlo, and Evidently AI are designed for this.

Point-in-Time Join Scale

pd.merge_asof works for development and small datasets (up to ~10M rows). At production scale:

Spark: Use a broadcast join approach - broadcast the feature table and apply window functions. For very large feature tables, consider Spark's AS OF JOIN equivalent via temporal filtering.
BigQuery/Snowflake: Use a QUALIFY ROW_NUMBER() OVER (PARTITION BY entity_id ORDER BY feat_ts DESC) = 1 WHERE feat_ts <= event_ts pattern.
DuckDB: Native ASOF JOIN - excellent for intermediate scale (100M-1B rows) on a single large machine.
Feast/Tecton: Managed point-in-time join computation - recommended for teams that need this at scale repeatedly.

Common Mistakes

:::danger Label Leakage from SCD Type 1 Dimensions

Using a dimension table that overwrites current values (SCD Type 1) for any attribute that changes over time will cause label leakage. The most common version: a users table stores current account status. Training data joins to this table without time constraints. Users whose accounts were flagged or upgraded after the training event get their current (future) status attached to past events.

Fix: Convert predictive dimension attributes to SCD Type 2. If you can't change the source, materialize daily snapshots and use point-in-time joins against those snapshots. :::

:::danger Using direction="nearest" in pd.merge_asof

pd.merge_asof with direction="nearest" will match the closest feature record - which might be after the event timestamp. This silently introduces leakage without any error or warning.

Always use direction="backward" for point-in-time joins. The feature record timestamp must be less than or equal to the event timestamp.

# WRONG - can use future features
pd.merge_asof(labels, features, left_on="event_time", right_on="feat_time",
              direction="nearest")  # DO NOT USE

# CORRECT - only uses features available at or before event_time
pd.merge_asof(labels, features, left_on="event_time", right_on="feat_time",
              direction="backward")  # ALWAYS USE THIS

:::

:::warning Computing Aggregates Before Train/Test Split

Computing global statistics (mean, stddev, quantiles) on the full dataset before splitting introduces statistical leakage. The test set's distribution leaks into the training features.

Always fit preprocessing transforms (scalers, imputers, encoders) on the training set only, then apply them to the test set.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# WRONG - fits on full dataset
scaler = StandardScaler().fit(X)
X_train_scaled = scaler.transform(X_train)

# CORRECT - fits only on training data
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

:::

:::warning Ignoring Training-Serving Skew

If you compute features differently in training (pandas, Spark) and serving (Java, Go, SQL), you will have skew. The model will perform significantly worse in production than in backtesting, and the gap will appear random and hard to diagnose.

Log serving-time features alongside predictions. Periodically compare the distribution of logged serving features against training features. Any distributional difference is a skew signal. :::

Interview Q&A

Q1: What is label leakage and how do you prevent it?

Label leakage is when information about the target variable (or information that was only available after the target event) is used as a feature during training. The model learns this illegitimate signal and achieves inflated offline metrics. When deployed to production - where only past data is available - it fails because the leaked signal doesn't exist.

Prevention requires:

Point-in-time joins: for every training example, every feature must be computed using only data with a timestamp less than or equal to the event timestamp
SCD Type 2 for changing dimensions: never join to a "current state" table for features that change over time
Timestamp auditing: maintain a feature_computed_at column for every feature and verify it's always <= event_time in the training set
Train/test splitting before feature computation: compute aggregate features only on training fold data

The most common real-world form is joining entity features from a "current state" table without time constraints - a user's current risk score attached to transactions from 6 months ago.

Q2: How would you prevent label leakage when building training data for a fraud model?

A fraud model predicts at transaction time whether the transaction is fraudulent. Every feature must represent what was knowable at transaction time.

Concrete steps:

Identify all feature tables: account features, device features, merchant features, historical transaction aggregates
Version all feature tables as SCD Type 2: any table that can change over time gets valid_from / valid_to columns
Implement point-in-time joins: for each transaction in the label set, join to the feature table row where valid_from <= transaction_time < valid_to
Audit temporal consistency: after joining, verify that max(feature_computed_at) <= min(transaction_time) for each feature group
Be especially careful with derived features: things like "was the user's email verified" need to be from the SCD2 table, not from the current users table
Watch out for future aggregates: if you compute "total spend this month" as a feature, make sure it's the cumulative spend up to the transaction moment, not the end-of-month total

The most subtle form: the fraud label itself is often applied retroactively (a chargeback comes in 30 days after the transaction). If your feature pipeline runs after the label is applied, any features computed from the same user's data during that 30-day window will have seen the fraud signal.

Q3: What is SCD Type 2 and why is it important for ML?

SCD Type 2 (Slowly Changing Dimension, Type 2) is a data modelling pattern where every change to a dimension attribute creates a new row rather than overwriting the existing one. Each row has valid_from and valid_to timestamps defining when that attribute value was active.

For ML, it's essential because:

Models learn from historical data and must therefore have access to historical feature values
Without SCD Type 2, a join on a dimension table gives you the current attribute value, not the historical value - this is label leakage
SCD Type 2 enables true point-in-time joins: "what was user X's status on March 15th?"

Example: a user upgrades from "free" to "paid" on March 1st. Without SCD Type 2, all transactions - including January transactions - look like they came from a "paid" user. The model learns that paid users are low fraud risk, but this signal was leaked: the user wasn't paid in January.

With SCD Type 2, the January transactions correctly join to the "free" row (valid_from=2024-01-01, valid_to=2024-03-01), preserving temporal correctness.

Q4: What is training-serving skew and how do you detect it?

Training-serving skew is a mismatch between the feature distributions the model sees during training and the distributions it sees in production serving. The model was optimized for a slightly different input space than it operates in, causing production performance to be worse than expected.

Detection:

Feature logging: log every feature vector at inference time alongside the prediction
Distribution comparison: compare feature means, percentiles, and null rates between training data and logged serving data using statistical tests (KS test, PSI)
Population Stability Index (PSI): PSI > 0.25 signals significant distribution shift and warrants investigation

$\text{PSI} = \sum_{i=1}^{N} \left( A_i - E_i \right) \cdot \ln\frac{A_i}{E_i}$

Where $A_i$ is the actual (serving) proportion and $E_i$ is the expected (training) proportion in bucket $i$ .

Common causes: different code paths for training vs. serving, data freshness differences, schema changes after model training.

Prevention: use a feature store that enforces a single feature computation definition for both offline and online paths. The "log-and-train" pattern - training on logged serving features - eliminates most skew by definition.

Q5: When would you use a star schema vs. a wide denormalized table for ML?

Star schema is appropriate when:

Features are shared across many different ML use cases and you want a single source of truth
Feature reuse and join flexibility matter more than training throughput
The team is small and training jobs run infrequently (daily or weekly)
Storage is constrained (denormalized tables are larger)

Wide denormalized tables are appropriate when:

Training jobs run frequently (multiple times per day) and join latency matters
The training dataset is large (hundreds of millions of rows) and joins add significant compute cost
The feature set for a specific model is stable and well-defined
You're using a columnar format (Parquet) and need to maximize read throughput

In practice, the standard pattern is: maintain a star schema in the warehouse for flexibility and reporting, then run a scheduled feature materialization pipeline that produces pre-joined wide tables in Parquet for ML training. The star schema is the source of truth; the wide tables are optimized derivatives.

Q6: How do you handle schema evolution in a feature store without breaking models?

Schema evolution in feature stores is managed through three principles:

Additive-only changes: New features (columns) can always be added. Existing models ignore columns they don't know about. New models can optionally use the new features.
Deprecation over deletion: When a feature is being removed, mark it as deprecated and set a sunset date. Migrate all consuming models before deletion. Never delete a feature from a live pipeline.
Versioned feature views: When a feature's semantics change significantly (new computation logic, different time window), create a new version (user_spend_30d_v2) rather than modifying the existing one. The old and new versions coexist until all models have migrated.

For event schemas in streaming pipelines, enforce backward compatibility at the schema registry level. A new schema version must be able to deserialize all previous messages. Forward compatibility (old consumers can read new messages) is desirable but sometimes requires accepting unknown fields gracefully.

:::tip Feature Store Decision Framework Use a dedicated feature store (Feast, Tecton, Vertex AI Feature Store) when:

You have more than 3-4 ML models in production
Multiple teams are computing the same or similar features independently
You need sub-50ms feature retrieval at inference time
Training-serving skew has caused production incidents

For teams with fewer models, a disciplined combination of SCD Type 2 tables + point-in-time joins implemented in dbt or Spark is often sufficient and has much lower operational overhead. :::

Q7: How do you validate that a training dataset is free of temporal leakage before training?

Validation should happen at three layers:

Layer 1 - Schema checks (fast, run every time):

Verify every feature column has an associated _computed_at timestamp column
Assert that max(feature_computed_at) <= min(event_time) per entity group
Flag any feature with a null _computed_at - untracked compute time is untrusted time

Layer 2 - Statistical checks (run on samples):

Compute the correlation between each feature and the label. Features with suspiciously high correlations (AUC > 0.95 individually) are candidates for leakage investigation.
Split the dataset into past and future halves by time. A leakage-free model trained on the past half should perform similarly when applied to the future half. A large gap in performance (past AUC 0.94, future AUC 0.71) indicates leakage.
Plot feature value distributions conditioned on the label. A leaked feature will show clear label separation in a way that seems too convenient.

Layer 3 - Causal audit (run when a new feature is added):

For each new feature, trace its data lineage back to the source table
Identify the earliest timestamp at which the feature value could be known
Verify that this timestamp is earlier than the event timestamp for all training examples
Special scrutiny on: any feature derived from an event-level aggregate, any feature that joins to a table without a time constraint, any feature that references the target entity's future activity

The automated leakage detection code shown earlier in this lesson implements Layer 1. Layer 2 and Layer 3 require human judgment informed by domain knowledge of what should be causally knowable at prediction time.

Production Engineering Notes

The Entity History Explosion Problem

Entity-centric data models store a snapshot of entity features at every point in time. For a high-activity user who generates 1,000 transactions per day, storing one feature snapshot per transaction means 365,000 rows per year per user. At 1 million active users, that's 365 billion rows per year - before any actual feature values.

In practice, you don't need a snapshot at every transaction. Features computed over fixed time windows (30-day spend, 7-day login count) only change meaningfully when new data arrives that falls within the window. Daily snapshots are sufficient for most features; hourly snapshots are only warranted for features with very high volatility.

The right granularity:

Daily snapshots: recommended for most slowly-changing entity features (account status, cumulative spend, verification status)
Event-triggered snapshots: recompute and store a snapshot when a significant event occurs (large transaction, status change, verification)
Rolling aggregations: pre-compute time windows (7d, 30d, 90d) and store as columns in the snapshot, rather than reconstructing them from raw events at training time

Monitoring Point-in-Time Correctness in Production

Point-in-time joins are correct when the feature pipeline is implemented correctly. They break silently in these scenarios:

A new feature is added without SCD Type 2: developer adds a new column to the users table using a regular UPDATE (overwrite). The column immediately becomes a source of leakage for all future training data generation.
A pipeline is re-run on historical data: a backfill job reprocesses past events using current logic, which may incorporate data that wasn't available at the original event time.
Time zones are inconsistent: event timestamps stored in UTC, feature computation timestamps stored in local time. A one-to-eight-hour offset can cause small amounts of leakage that are hard to detect statistically.
Late-arriving labels: fraud labels often arrive weeks after the transaction (via chargebacks). If the label assignment time is used as the event timestamp instead of the transaction time, all features between the transaction and the chargeback are leaked.

Mitigation: instrument your point-in-time join pipeline to emit a temporal_gap_seconds metric - the time between each feature's computed_at and the event timestamp. Alert when this metric drops below zero (leakage) or above a threshold (staleness).

dbt Integration for Feature Pipelines

dbt (data build tool) is increasingly used to implement feature computation logic in SQL, which runs in the data warehouse. The advantages:

SQL-native: feature logic is readable, version-controlled, and testable
Incremental models: dbt can compute features only for new events, not the full history
Column lineage: dbt's metadata graph tracks which source tables feed which feature columns
Tests: dbt's built-in test framework can enforce timestamp ordering (not_null, accepted_values, custom temporal checks)

A typical dbt feature pipeline structure:

models/
  staging/          # Raw source cleaning, type casting
  intermediate/     # SCD Type 2 snapshots (dbt snapshots feature)
  features/         # Feature computation SQL models
    user_features.sql          # Daily user snapshot
    item_features.sql          # Item-level features
  training_sets/    # Point-in-time join SQL models
    fraud_training_set.sql     # Final joined training dataset

The dbt snapshot command is designed exactly for SCD Type 2 - it tracks changes to a source table and maintains a dbt_valid_from / dbt_valid_to history automatically.

Key Takeaways

Point-in-time correctness is non-negotiable: every training feature must use only data available at or before the event timestamp. Violations inflate offline metrics and guarantee production failure.
SCD Type 2 is the correct dimension storage strategy for ML: any feature that changes over time must be stored with valid_from / valid_to columns. SCD Type 1 (overwrite) causes label leakage.
pd.merge_asof with direction="backward" is the pandas implementation of a point-in-time join. Always verify feature_computed_at <= event_time after joining.
Entity-centric data models organize tables around entities (user, item, context) rather than events, making it natural to attach historical feature values to each training example.
Training-serving skew arises from different code paths, different time windows, or data freshness differences between training and serving. Use feature logging and PSI monitoring to detect it.
Wide denormalized tables are preferred for ML training over star schemas because they avoid join overhead at training time and can enforce point-in-time correctness at materialization time.
Schema evolution should follow the additive-only principle: add new columns freely, deprecate before removing, version semantically changed features rather than modifying existing ones.
dbt snapshots implement SCD Type 2 automatically - dbt_valid_from and dbt_valid_to are maintained by the dbt snapshot command, making temporal correctness easier to enforce in SQL pipelines.

:::tip Feature Store Decision Framework Use a dedicated feature store (Feast, Tecton, Vertex AI Feature Store) when:

You have more than 3-4 ML models in production
Multiple teams are computing the same or similar features independently
You need sub-50ms feature retrieval at inference time
Training-serving skew has caused production incidents

For teams with fewer models, a disciplined combination of SCD Type 2 tables + point-in-time joins implemented in dbt or Spark is often sufficient and has much lower operational overhead. :::

The Night the Fraud Model Lied to Us​

Why This Exists​

Historical Context​

Core Concepts​

ML Needs Different Data Models Than OLAP​

Point-in-Time Correctness: The Most Important Concept​

Point-in-Time Joins​

Slowly Changing Dimensions: Which Type Matters for ML​

Entity-Centric Data Modelling​

Training-Serving Skew​

Star Schema vs. Wide Tables for ML​

Normalization vs. Denormalization for ML​

Schema Evolution for ML Pipelines​

Code Examples​

Point-in-Time Join with pd.merge_asof​

SCD Type 2 Implementation​

Detecting Label Leakage with Timestamp Checks​

Architecture Diagram​

YouTube Resources​

Production Engineering Notes​

The Backfill Problem​

Late-Arriving Data​

Feature Drift and Data Quality​

Point-in-Time Join Scale​

Common Mistakes​

Interview Q&A​

Q1: What is label leakage and how do you prevent it?​

Q2: How would you prevent label leakage when building training data for a fraud model?​

Q3: What is SCD Type 2 and why is it important for ML?​

Q4: What is training-serving skew and how do you detect it?​

Q5: When would you use a star schema vs. a wide denormalized table for ML?​

Q6: How do you handle schema evolution in a feature store without breaking models?​

Q7: How do you validate that a training dataset is free of temporal leakage before training?​

Production Engineering Notes​

The Entity History Explosion Problem​

Monitoring Point-in-Time Correctness in Production​

dbt Integration for Feature Pipelines​

Key Takeaways​

The Night the Fraud Model Lied to Us

Why This Exists

Historical Context

Core Concepts

ML Needs Different Data Models Than OLAP

Point-in-Time Correctness: The Most Important Concept

Point-in-Time Joins

Slowly Changing Dimensions: Which Type Matters for ML

Entity-Centric Data Modelling

Training-Serving Skew

Star Schema vs. Wide Tables for ML

Normalization vs. Denormalization for ML

Schema Evolution for ML Pipelines

Code Examples

Point-in-Time Join with `pd.merge_asof`

SCD Type 2 Implementation

Detecting Label Leakage with Timestamp Checks

Architecture Diagram

YouTube Resources

Production Engineering Notes

The Backfill Problem

Late-Arriving Data

Feature Drift and Data Quality

Point-in-Time Join Scale

Common Mistakes

Interview Q&A

Q1: What is label leakage and how do you prevent it?

Q2: How would you prevent label leakage when building training data for a fraud model?

Q3: What is SCD Type 2 and why is it important for ML?

Q4: What is training-serving skew and how do you detect it?

Q5: When would you use a star schema vs. a wide denormalized table for ML?

Q6: How do you handle schema evolution in a feature store without breaking models?

Q7: How do you validate that a training dataset is free of temporal leakage before training?

Production Engineering Notes

The Entity History Explosion Problem

Monitoring Point-in-Time Correctness in Production

dbt Integration for Feature Pipelines

Key Takeaways