How does feature store work in practice?

Point-in-Time Correctness covers point-in-time correctness, feature store, data leakage from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/feature-stores/point-in-time-correctness

What is the difference between point-in-time correctness and data leakage?

See the full breakdown at https://engineersofai.com/docs/data-engineering/feature-stores/point-in-time-correctness

:::tip 🎮 Interactive Playground Visualize this concept: Try the Point-in-Time Join demo on the EngineersOfAI Playground - no code required. :::

Point-in-Time Correctness

Q: What is point-in-time correctness?

Time-travel queries, point-in-time joins, and preventing data leakage.

The Day the Model Lied to the Business

The churn prediction model had been a quiet success story for eight months. AUC of 0.91 on the held-out validation set, precision hovering around 0.84 at the operating threshold, lift of 3.2× over baseline. The model interventions - targeted retention emails and discount offers - had supposedly reduced monthly churn by a measurable amount. Product managers referenced it in quarterly reviews. The data science team had built something they were proud of.

Then a new engineer joined the team and ran a routine audit. She pulled the training dataset construction code - a 200-line Python script that had been written in a weekend sprint 14 months ago. The script loaded the label table (users flagged as churned, with their churn date), then called merge(features_df, on='user_id') to attach feature values. No timestamp condition. No as_of logic. Just a straight join on user ID, attaching whatever the most recent feature record was.

For a user who had churned in January, the model had trained on that user's March feature values - values that were calculated after the user had already left. Features like days_since_last_login (0, they were gone), support_tickets_last_30d (0, they weren't submitting tickets anymore), and session_count_last_14d (0). The model had learned to predict churn from the absence of activity that only existed because the user had already churned. It was not predicting churn. It was describing it.

When the model was deployed, it received feature values for users who were still active - users who had normal session counts, normal login frequencies, normal support interactions. The distribution looked nothing like the corrupted training data. AUC in production was 0.74, a gap of 17 points from the offline evaluation. The business had spent months making retention decisions based on a model that had essentially memorized a consequence of churn rather than a predictor of it.

This is temporal leakage. It is the most common and most damaging failure mode in machine learning systems, and it is entirely invisible to standard validation procedures unless you know to look for it. The fix requires understanding a fundamental concept: point-in-time correctness - the discipline of ensuring that for every training example, the features used were actually available at the moment the label was generated.

Why This Exists: The Time Travel Problem

Machine learning models are time machines in one very specific sense: they are trained on the past, then asked to make decisions about the present. This temporal structure creates a strict constraint that most data infrastructure ignores by default.

When you join two tables in SQL - a labels table and a features table - the database has no concept of time. It matches rows based on keys, and it will happily give you the most recent feature value regardless of when the label event occurred. The database is not doing anything wrong. SQL joins are designed for relational queries, not temporal queries. The problem is that we are using a relational tool to solve a temporal problem.

The core insight is simple: a model that will be used to make decisions at time T must be trained exclusively on information that was available strictly before time T. Any information generated after T - or any aggregation that includes data from after T - is contamination. The model will learn patterns that do not exist in the decision-making context it will be deployed into.

Feature stores exist, in part, specifically to solve this problem. Point-in-time correctness is not an optimization or a best practice - it is a correctness requirement. Training on temporally contaminated data produces a model that is incorrect by construction, and standard metrics will not tell you this unless you carefully design your evaluation to avoid the same contamination.

Historical Context: How the Industry Learned This Lesson

The terminology "point-in-time correct" emerged prominently around 2018–2020 as feature stores became a recognized pattern. Uber's Michelangelo team, which published details of their internal feature platform around 2017, identified temporal correctness as one of the hardest problems in training data generation. Feast's early documentation from 2020 listed ASOF joins as a core primitive.

Before feature stores, the industry learned about temporal leakage the hard way - through deployed models that performed well offline and poorly in production, and through post-mortems that traced the performance gap back to training data construction. Kaggle competitions saw the same pattern: solutions that placed highly on the private leaderboard often used features that leaked label information, but this only became clear when the solution was examined carefully.

The Quant finance community had understood this problem decades earlier. Backtesting stock strategies requires careful discipline about "look-ahead bias" - the equivalent of temporal leakage. A trading strategy that uses earnings data from a quarter to make trades that would have been executed before that quarter's earnings were announced is not a real strategy; it is a simulation using information that did not exist. Financial engineers built tools to handle this in the 1990s. The ML community had to rediscover the same lesson.

What Point-in-Time Correctness Means

For every training example $(x_i, y_i)$ , where $y_i$ is a label generated at time $T_i$ , every feature $x_i^{(j)}$ must be computed exclusively from data that was available at time $T_i - \epsilon$ - strictly before the label event occurred.

Formally, for a feature $f$ that is a function of a data window, and a label at time $T$ :

$x_i^{(j)} = f(\text{data}_{t < T_i})$

Any feature that depends on $\text{data}_{t \geq T_i}$ introduces temporal leakage. The severity of the leakage depends on:

How far into the future the feature looks - using tomorrow's data for a label from today is severe leakage; using data from one hour after the label event in an hourly model is mild but still problematic
How correlated the leaked data is with the label - leakage involving data that is a direct consequence of the label event (like post-churn behavior) destroys the model entirely
What fraction of training examples are affected - if only 5% of examples are contaminated, the model degrades slightly; if 100% are contaminated, the model may have near-zero real-world predictive power

The practical definition: for each (entity, event_time) pair in your training label set, retrieve the feature value that was most recently computed before event_time. This is the point-in-time correct value.

Why Naive Joins Always Leak

The Most Common Pattern

import pandas as pd

# Labels: one row per churn event
labels_df = pd.DataFrame({
    'user_id': [101, 102, 103, 104],
    'label_date': pd.to_datetime(['2024-01-15', '2024-02-10', '2024-03-05', '2024-01-28']),
    'churned': [1, 1, 0, 0]
})

# Features: latest computed values (updated daily)
features_df = pd.DataFrame({
    'user_id': [101, 102, 103, 104],
    'last_updated': pd.to_datetime(['2024-03-20', '2024-03-20', '2024-03-20', '2024-03-20']),
    'session_count_30d': [2, 3, 45, 38],
    'days_since_login': [42, 38, 1, 2]
})

# This join is WRONG - attaches March features to January labels
training_df = labels_df.merge(features_df, on='user_id')
print(training_df)

For user 101 who churned in January, this join attaches features computed in March - after they had already left. Their session_count_30d is 2 (they barely logged in after churning), and days_since_login is 42. These values are consequences of the churn event, not predictors of it.

The Date Column Trap

A common attempt at fixing this adds a date condition:

# This looks better but is STILL WRONG
training_df = labels_df.merge(
    features_df,
    left_on=['user_id', 'label_date'],
    right_on=['user_id', 'last_updated']
)

This only returns rows where the feature was computed on exactly the same day as the label event. It misses most rows (features are not usually computed on exactly the same day as a label event) and still does not solve the temporal boundary problem for windowed features.

The Window Overlap Problem

Suppose you have a feature purchase_count_30d - the number of purchases in the last 30 days, computed at time $T$ . The label is "user made a high-value purchase on date $T$ ." The 30-day window for the feature at time $T$ includes the day $T$ itself. If the label event is a purchase, it is included in the feature computation. The feature is partially defined by the label.

This is circular leakage - the label event contaminates the feature. Even if the feature was genuinely computed at time $T$ , the window overlaps with the label event.

$\text{purchase\_count\_30d}(T) = \sum_{t=T-30}^{T} \text{purchases}(t)$

The fix is to use a window that excludes the label timestamp:

$\text{purchase\_count\_30d}(T) = \sum_{t=T-31}^{T-1} \text{purchases}(t)$

In practice: use a one-day lag when your feature aggregation window endpoint is the same as your label date.

The Point-in-Time Join Algorithm

The correct algorithm is called an ASOF join - "as of this point in time, what was the most recent feature value?"

The logic for each training example $(user\_id_i, event\_time_i)$ :

Filter the feature history table to rows where feature_timestamp < event_time_i
Within those rows, filter to entity_id = user_id_i
Take the row with the maximum feature_timestamp
That row's feature values are the point-in-time correct values for this training example

SQL Implementation: The LATERAL Join Pattern

The standard SQL implementation uses a LATERAL join (also called a correlated subquery). This is supported in BigQuery, Snowflake, PostgreSQL, and DuckDB.

-- Point-in-time correct training dataset generation
-- For each label event, find the most recent feature snapshot
-- computed BEFORE the label event occurred

SELECT
    l.user_id,
    l.label_date,
    l.churned,
    f.purchase_count_30d,
    f.session_count_14d,
    f.days_since_last_login,
    f.feature_computed_at  -- keep this for auditing
FROM labels l
LEFT JOIN LATERAL (
    SELECT
        purchase_count_30d,
        session_count_14d,
        days_since_last_login,
        feature_computed_at
    FROM user_features uf
    WHERE uf.user_id = l.user_id
      AND uf.feature_computed_at < l.label_date  -- strict less-than
    ORDER BY uf.feature_computed_at DESC
    LIMIT 1
) f ON true;

note

The LEFT JOIN LATERAL ensures that label rows without any prior feature snapshot are still included in the result - they will have NULL feature values. This is intentional. A missing feature value is better than a leaked feature value. You can impute or filter NULLs downstream, but you cannot undo contamination.

DuckDB Variant (ASOF JOIN syntax)

DuckDB 0.8+ has native ASOF join syntax, which is more readable and more performant than LATERAL:

-- DuckDB native ASOF join
SELECT
    l.user_id,
    l.label_date,
    l.churned,
    f.purchase_count_30d,
    f.session_count_14d,
    f.days_since_last_login
FROM labels l
ASOF JOIN user_features f
    ON l.user_id = f.user_id
    AND l.label_date > f.feature_computed_at;
-- DuckDB automatically picks the latest f row where feature_computed_at < label_date

Snowflake Variant

-- Snowflake: same LATERAL pattern, slight syntax difference
SELECT
    l.user_id,
    l.label_date,
    l.churned,
    f.purchase_count_30d
FROM labels l,
LATERAL (
    SELECT purchase_count_30d
    FROM user_features uf
    WHERE uf.user_id = l.user_id
      AND uf.feature_computed_at < l.label_date
    ORDER BY uf.feature_computed_at DESC
    LIMIT 1
) f;

Python Implementation: `PointInTimeJoin`

For medium-scale workloads where SQL is not convenient, here is a production-ready Python implementation using pandas merge_asof:

import pandas as pd
import numpy as np
from typing import List, Optional
from dataclasses import dataclass


@dataclass
class FeatureTable:
    """Represents a feature history table with temporal metadata."""
    df: pd.DataFrame
    entity_col: str
    timestamp_col: str
    feature_cols: List[str]


class PointInTimeJoin:
    """
    Performs point-in-time correct joins between a label set and
    one or more feature history tables.

    For each (entity, event_time) in the label set, retrieves
    the most recent feature value where feature_timestamp < event_time.
    """

    def __init__(self, label_df: pd.DataFrame,
                 entity_col: str = 'entity_id',
                 event_time_col: str = 'event_timestamp'):
        self.label_df = label_df.copy()
        self.entity_col = entity_col
        self.event_time_col = event_time_col

        # Ensure timestamps are datetime
        self.label_df[event_time_col] = pd.to_datetime(
            self.label_df[event_time_col], utc=True
        )

    def join(self, feature_table: FeatureTable,
             tolerance: Optional[pd.Timedelta] = None) -> pd.DataFrame:
        """
        Join a feature table to the label set using ASOF semantics.

        Args:
            feature_table: FeatureTable with feature history
            tolerance: If set, features older than event_time - tolerance
                      are treated as missing (NaN). Useful when a feature
                      has a known staleness limit.

        Returns:
            Label DataFrame with feature columns added.
        """
        feat_df = feature_table.df.copy()
        feat_df[feature_table.timestamp_col] = pd.to_datetime(
            feat_df[feature_table.timestamp_col], utc=True
        )

        # Sort both dataframes by timestamp (required by merge_asof)
        labels_sorted = self.label_df.sort_values(self.event_time_col)
        features_sorted = feat_df.sort_values(feature_table.timestamp_col)

        result_frames = []

        # Process each entity separately to avoid cross-entity matches
        all_entities = labels_sorted[self.entity_col].unique()

        for entity in all_entities:
            label_subset = labels_sorted[
                labels_sorted[self.entity_col] == entity
            ]
            feature_subset = features_sorted[
                features_sorted[feature_table.entity_col] == entity
            ]

            if feature_subset.empty:
                # No features available - attach NaN columns
                for col in feature_table.feature_cols:
                    label_subset = label_subset.copy()
                    label_subset[col] = np.nan
                result_frames.append(label_subset)
                continue

            # merge_asof: for each label row, find the most recent
            # feature row where feature_timestamp <= label_timestamp
            # We use 'backward' direction (find most recent prior value)
            merged = pd.merge_asof(
                label_subset,
                feature_subset[[feature_table.entity_col,
                                 feature_table.timestamp_col] +
                                feature_table.feature_cols],
                left_on=self.event_time_col,
                right_on=feature_table.timestamp_col,
                left_by=self.entity_col,
                right_by=feature_table.entity_col,
                direction='backward',  # most recent feature before event
                tolerance=tolerance,
                suffixes=('', '_feat')
            )

            result_frames.append(merged)

        return pd.concat(result_frames, ignore_index=True)

    def join_multiple(self, feature_tables: List[FeatureTable]) -> pd.DataFrame:
        """Join multiple feature tables sequentially."""
        result = self.label_df.copy()
        joiner = PointInTimeJoin(result, self.entity_col, self.event_time_col)

        for ft in feature_tables:
            result = joiner.join(ft)
            joiner = PointInTimeJoin(result, self.entity_col, self.event_time_col)

        return result


# Usage example
if __name__ == "__main__":
    # Label set: churn events with timestamps
    labels = pd.DataFrame({
        'user_id': [101, 102, 103, 101],
        'event_timestamp': [
            '2024-01-15 00:00:00+00:00',
            '2024-02-10 00:00:00+00:00',
            '2024-03-05 00:00:00+00:00',
            '2024-03-20 00:00:00+00:00',
        ],
        'churned': [1, 1, 0, 0]
    })

    # Feature history: daily snapshots
    feature_history = pd.DataFrame({
        'user_id': [101, 101, 101, 102, 102, 103, 103],
        'feature_timestamp': [
            '2024-01-10 00:00:00+00:00',
            '2024-02-01 00:00:00+00:00',
            '2024-03-15 00:00:00+00:00',
            '2024-01-20 00:00:00+00:00',
            '2024-02-08 00:00:00+00:00',
            '2024-02-28 00:00:00+00:00',
            '2024-03-04 00:00:00+00:00',
        ],
        'session_count_30d': [12, 8, 3, 22, 18, 40, 42],
        'days_since_login': [2, 5, 30, 1, 3, 1, 1]
    })

    ft = FeatureTable(
        df=feature_history,
        entity_col='user_id',
        timestamp_col='feature_timestamp',
        feature_cols=['session_count_30d', 'days_since_login']
    )

    joiner = PointInTimeJoin(labels, entity_col='user_id',
                              event_time_col='event_timestamp')
    result = joiner.join(ft)

    print(result[['user_id', 'event_timestamp', 'churned',
                   'session_count_30d', 'days_since_login']])

    # For user 101 with label on 2024-01-15:
    # Should get features from 2024-01-10 (the snapshot BEFORE the event)
    # NOT the 2024-03-15 snapshot (which is after)

tip

The merge_asof function uses a sorted merge under the hood, which is $O(n \log n)$ rather than $O(n^2)$ . For datasets with millions of label rows and feature snapshots, this is fast enough for offline training data generation. For extremely large datasets, push the computation to BigQuery or Spark - the LATERAL join pattern scales to petabytes.

Feast: `get_historical_features()`

Feast, the open-source feature store, implements point-in-time correctness as a first-class primitive through its get_historical_features() API. Understanding how it works under the hood clarifies why the API looks the way it does.

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path=".")

# entity_df is the key concept: it must contain BOTH the entity ID
# AND an event_timestamp column. Without the timestamp, Feast cannot
# perform the point-in-time join.
entity_df = pd.DataFrame({
    "user_id": [101, 102, 103, 104],
    "event_timestamp": [
        "2024-01-15",
        "2024-02-10",
        "2024-03-05",
        "2024-01-28"
    ],
    "churned": [1, 1, 0, 0]  # labels travel with the entity_df
})

# Feast performs the point-in-time join internally
# For each row in entity_df, it finds the most recent feature values
# where feature_timestamp < event_timestamp
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_activity_features:session_count_30d",
        "user_activity_features:days_since_last_login",
        "user_payment_features:purchase_count_30d",
        "user_payment_features:lifetime_value"
    ]
).to_df()

print(training_df.head())

What Feast Does Internally

Feast's historical backend (BigQuery, Snowflake, DuckDB, or Spark depending on your registry) translates the get_historical_features() call into a series of ASOF joins. For each feature view, it generates SQL similar to the LATERAL join pattern above.

When you use the BigQuery backend, Feast materializes intermediate results to temporary tables, then performs the ASOF join in BigQuery. For a training dataset with 10 million labels and 5 feature views, this produces 5 BigQuery jobs in parallel, then joins the results together - an architecture that scales to hundreds of millions of examples.

The `event_timestamp` Column is Mandatory

A common mistake with Feast is calling get_historical_features() with an entity_df that lacks an event_timestamp column. Feast will raise a ValueError immediately. This is intentional. The API refuses to generate training data without temporal context because doing so would produce temporally leaked data by default.

# This raises ValueError: entity_df must have 'event_timestamp' column
bad_entity_df = pd.DataFrame({
    "user_id": [101, 102, 103]
    # no event_timestamp - Feast refuses this
})

# Feast enforces correctness at the API level
store.get_historical_features(
    entity_df=bad_entity_df,  # ValueError raised here
    features=[...]
)

The Feature Window Overlap Problem

Even with a correct ASOF join, you can still have leakage if the feature computation window includes the label event itself.

Case Study: Purchase Prediction

Suppose you are predicting whether a user will make a purchase in the next 7 days. Your label is 1 if they made a purchase in the 7-day window starting at $T$ , 0 otherwise. One of your features is purchase_count_30d - the count of purchases in the 30 days ending at $T$ .

If the user made a purchase on day $T+3$ (within the label window), and your feature window ending at $T$ is defined as "the 30 days before and including $T$ ," then:

The label captures the purchase at $T+3$ : label = 1
The feature purchase_count_30d computed at $T$ does not capture it - correct so far

But now consider: what if your feature is computed with a window of "30 days before and including the day of computation," and you compute it at $T$ including data from day $T$ itself. If day $T$ had other activity that correlates with the purchase decision (e.g., the user viewed 5 product pages on day $T$ ), that activity is now in your feature. This is not strictly leakage if the activity at $T$ is genuinely prior to the purchase decision - but the window boundary is subtle.

The safest convention: use a one-day lag for all window features. A feature labeled as "computed for date $T$ " uses a window ending at $T-1$ , not $T$ .

def compute_purchase_count(df: pd.DataFrame,
                            as_of_date: pd.Timestamp,
                            lookback_days: int = 30) -> int:
    """
    Compute purchase count for lookback window ending BEFORE as_of_date.

    Window: [as_of_date - lookback_days, as_of_date - 1 day]
    This ensures the feature does not include any event on as_of_date itself.
    """
    window_end = as_of_date - pd.Timedelta(days=1)  # explicit lag
    window_start = as_of_date - pd.Timedelta(days=lookback_days)

    mask = (
        (df['purchase_date'] >= window_start) &
        (df['purchase_date'] <= window_end)
    )
    return mask.sum()

Time Zone Correctness

Time zones are a silent source of temporal leakage that rarely appears in post-mortems because it is subtle.

Suppose you have a feature pipeline that runs at midnight Pacific Time (PST, UTC-8). It computes "today's" aggregations for all users. When you store this feature, the timestamp is:

Local (PST): 2024-01-15 00:00:00 PST - "January 15th features"
UTC: 2024-01-15 08:00:00 UTC - stored as January 15th, 8 AM UTC

Now you have a label event that occurred at 2024-01-15 04:00:00 UTC (4 AM UTC = 8 PM PST on January 14th). In PST, this is still January 14th. But the feature was computed starting at 8 AM UTC and includes data through midnight PST, which means:

PST perspective: the feature covers "January 14th" events - the label is also on January 14th, potential overlap
UTC perspective: the feature timestamp (08:00 UTC) is after the label event (04:00 UTC) - the feature was computed after the label event in UTC time

The rule: store all timestamps in UTC, always. Compute all window boundaries in UTC. Convert to local time only for display.

from datetime import timezone

def normalize_to_utc(ts) -> pd.Timestamp:
    """Ensure a timestamp is timezone-aware UTC."""
    ts = pd.Timestamp(ts)
    if ts.tzinfo is None:
        # Assume UTC if no timezone info - log a warning in production
        import warnings
        warnings.warn(
            f"Received naive timestamp {ts} - assuming UTC. "
            "Ensure all pipeline timestamps are explicitly UTC-aware.",
            UserWarning
        )
        return ts.tz_localize('UTC')
    return ts.tz_convert('UTC')

warning

When working with international user bases, be especially careful with "daily" features. A feature computed "for Monday" in New York is a different time range than "for Monday" in Tokyo. If your label events are timestamped in UTC and your features are computed in local time, the mismatch can create leakage for users in certain time zones but not others - producing training data that has inconsistent leakage, which is harder to detect than uniform leakage.

Feature Lag: Intentionally Delayed Features

Some features are not available immediately after the event that generates them. Credit card chargebacks, for example, take 45–90 days to be confirmed and reported. If you use chargeback_rate_90d as a feature for a fraud model, and you train on recent data, many of the "non-fraud" labels will not yet have their chargebacks confirmed - so they appear clean, even though they are fraudulent.

This is label latency bias, a variant of temporal leakage. The fix is to build an explicit lag into the feature definition:

@dataclass
class FeatureDefinition:
    """Feature definition with explicit lag and window configuration."""
    name: str
    source_table: str
    aggregation: str  # "sum", "count", "avg", etc.
    window_days: int
    lag_days: int = 0  # Minimum lag before this feature is considered valid

    def get_window_bounds(self, as_of: pd.Timestamp):
        """
        Compute the [start, end] window for this feature as of a given time.

        With a 45-day lag and 90-day window:
        - Window ends at: as_of - 45 days
        - Window starts at: as_of - 45 days - 90 days = as_of - 135 days
        """
        window_end = as_of - pd.Timedelta(days=self.lag_days)
        window_start = window_end - pd.Timedelta(days=self.window_days)
        return window_start, window_end


# Chargeback rate with 45-day confirmation lag
chargeback_feature = FeatureDefinition(
    name="chargeback_rate_90d",
    source_table="chargebacks",
    aggregation="rate",
    window_days=90,
    lag_days=45  # only count chargebacks confirmed at least 45 days ago
)

# At training time for a label event on 2024-03-01:
# Window end: 2024-03-01 - 45 days = 2024-01-15
# Window start: 2024-01-15 - 90 days = 2023-10-17
# Only chargebacks confirmed before 2024-01-15 are included
start, end = chargeback_feature.get_window_bounds(pd.Timestamp('2024-03-01'))
print(f"Feature window: {start.date()} to {end.date()}")
# Feature window: 2023-10-17 to 2024-01-15

Detecting Leakage After the Fact

If you suspect a model may have been trained on leaked data, there are several diagnostic approaches.

The Temporal Train-Test Gap Test

import numpy as np
from sklearn.metrics import roc_auc_score


def temporal_leakage_test(model, df: pd.DataFrame,
                           timestamp_col: str,
                           label_col: str,
                           feature_cols: list) -> dict:
    """
    Temporal leakage diagnostic:
    Train on Q1, validate on Q2, test on Q3.

    If validation AUC >> test AUC, suspect leakage in training data.
    If validation AUC ≈ test AUC, training data is likely clean.
    """
    df = df.sort_values(timestamp_col)
    n = len(df)

    q1_end = n // 3
    q2_end = 2 * n // 3

    train = df.iloc[:q1_end]
    val = df.iloc[q1_end:q2_end]
    test = df.iloc[q2_end:]

    model.fit(train[feature_cols], train[label_col])

    val_auc = roc_auc_score(val[label_col],
                             model.predict_proba(val[feature_cols])[:, 1])
    test_auc = roc_auc_score(test[label_col],
                              model.predict_proba(test[feature_cols])[:, 1])

    gap = val_auc - test_auc

    return {
        'val_auc': val_auc,
        'test_auc': test_auc,
        'gap': gap,
        'leakage_suspected': gap > 0.05,
        'interpretation': (
            "LEAKAGE SUSPECTED: validation performance significantly exceeds "
            "test performance. Training data may be temporally contaminated."
            if gap > 0.05 else
            "No strong evidence of leakage. Val/test performance is consistent."
        )
    }

Feature Correlation with Time of Event

A feature that is temporally leaked will often be highly correlated with features that describe the post-label world:

def check_feature_temporal_correlation(
        training_df: pd.DataFrame,
        feature_col: str,
        label_col: str,
        label_timestamp_col: str,
        feature_timestamp_col: str) -> dict:
    """
    Check if a feature was computed after the label event.
    Returns the fraction of rows where feature_timestamp > label_timestamp.
    """
    leaked_rows = (
        training_df[feature_timestamp_col] > training_df[label_timestamp_col]
    ).sum()

    total_rows = len(training_df)
    leak_rate = leaked_rows / total_rows

    # Also check correlation between feature and label
    # (leaked features often have suspiciously high correlation)
    correlation = training_df[feature_col].corr(
        training_df[label_col].astype(float)
    )

    return {
        'feature': feature_col,
        'leak_rate': leak_rate,
        'leaked_rows': leaked_rows,
        'label_correlation': correlation,
        'verdict': 'LEAKED' if leak_rate > 0.0 else 'CLEAN'
    }

Visual: Naive Join vs. Point-in-Time Join

The naive join reaches forward in time, attaching a March snapshot to a January label. The point-in-time join correctly identifies that only the January 10th snapshot was available before the churn event.

Production Engineering Notes

Storing Feature History

Point-in-time correctness requires that you store feature history, not just the latest values. This is a storage design decision:

Storage Pattern	Correct for PIT?	Storage Cost	Notes
Latest-value table (upsert)	No	Low	Cannot look back in time
Append-only log with timestamps	Yes	High	Enables any PIT query
Daily snapshots	Partial	Medium	PIT-correct only to day granularity
Event sourcing	Yes	Very high	Full auditability

For most feature stores, daily or hourly snapshots are the practical choice. The snapshot frequency determines the precision of your PIT joins - if you need hourly accuracy, you need hourly snapshots.

Backfill Contamination

When you backfill historical features (computing past feature values for time periods before the feature pipeline existed), you must ensure the backfill does not use information that was not available at that historical time.

Example of bad backfill:

# WRONG: using today's model scores to backfill historical features
def backfill_user_risk_score(historical_date: pd.Timestamp) -> pd.DataFrame:
    # This model was trained on data including the historical period!
    # Using it to generate "historical" features creates circular leakage
    scores = current_risk_model.predict(users_as_of(historical_date))
    return scores  # these are not what risk scores would have been historically

danger

Backfill Contamination is one of the most insidious forms of temporal leakage because it looks like legitimate historical feature computation. When you backfill feature values using a model or algorithm that was trained on data from the period you are backfilling, every example in that period is contaminated. The backfill should only use algorithms and data sources that would have been available at the historical time.

Common Mistakes

danger

The "Close Enough" Approximation

"The label event happened at 2:47 PM and the feature was computed at end-of-day on the same date - that's only a few hours difference, it's fine."

It is not fine if:

The label event influenced the end-of-day computation (e.g., the label is a transaction, and end-of-day counts include transactions)
Your model operates at sub-day granularity in production (features will be from the previous day, but you trained on same-day features)

The gap between "close enough" thinking and rigorous point-in-time correctness is exactly where production performance gaps come from. Use strict < comparisons, never <=. When in doubt, add a day of lag.

danger

Silently Missing Rows After PIT Join

After a point-in-time join, some label rows will have NULL features because no feature snapshot existed before the label event (e.g., a new user with no prior feature history). A common mistake is to silently drop these rows, treating them as "bad data."

These rows are not bad data - they represent genuinely cold-start cases. If you drop them from training, your model will never see cold-start users during training, and it will perform poorly on them in production (where cold-start users exist). Handle NULLs explicitly: impute with a cold-start strategy, train a separate cold-start model, or use a default feature vector that represents "no prior history."

warning

Using <= Instead of < in the Time Boundary

The time boundary in ASOF joins should be strict: feature_timestamp < label_timestamp. Using <= (less-than-or-equal) allows features that were computed at the exact moment of the label event. For batch pipelines that run at midnight, this means features computed on "January 15th" (midnight UTC) can be joined to labels that also occurred at midnight UTC - including data from the same second as the label event. Use strict less-than.

warning

Different Feature Views with Different Snapshot Frequencies

If you join multiple feature views, each with a different snapshot frequency (one updated hourly, one updated daily, one updated weekly), the PIT join will retrieve the correct as-of snapshot for each - but the snapshots may have different "ages" relative to the label event. A weekly feature last updated 6 days before the label event may be significantly stale. Track the feature_computed_at timestamp for each feature view in your training data and monitor the distribution of lags.

Interview Questions and Answers

Q1: What is point-in-time correctness and why does it matter?

Point-in-time correctness is the property that for each training example with a label generated at time T, every feature value was computed from data that was strictly available before T. It matters because machine learning models learn statistical associations. If training features include data from after the label event, the model learns associations that cannot be reproduced at prediction time - where only past data is available. The result is a model that performs well on contaminated offline evaluations but poorly in production, often with no obvious reason why. The performance gap is a symptom of having trained on a fundamentally different distribution than the deployment distribution.

Q2: How would you explain temporal leakage to a non-technical stakeholder?

Imagine you're training a model to predict which students will fail an exam, so you can intervene before the exam happens. If the person building the model accidentally trains it by looking at each student's behavior after the exam - their study hours recorded after they already knew they failed, their tutor sessions booked after the exam - the model learns patterns that only appear after failure has already occurred. When you deploy this model to predict who will fail the next exam (before it happens), it performs terribly, because none of the "warning signs" it learned exist yet. That is temporal leakage: you trained the model on information from the future.

Q3: How would you detect label leakage after the fact, on a model that is already deployed?

Several methods: First, a temporal train-test gap test - retrain the model on Q1 data only, evaluate on Q2 (validation) and Q3 (test) separately. If validation AUC is much higher than test AUC, leakage in Q1 training data is likely. Second, audit the training data construction code for any join without a temporal condition. Third, for each feature, check whether feature_timestamp > label_timestamp for any rows in the training set - this directly measures the leak rate. Fourth, check feature-label correlations: a feature that is unexpectedly highly correlated with the label (e.g., Pearson r > 0.8) is suspicious and warrants investigation. Fifth, examine feature importance - if a feature related to post-event behavior ranks highest, that is a red flag.

Q4: How does Feast implement point-in-time correctness, and what does it require from the caller?

Feast's get_historical_features() method requires the entity_df parameter to contain both entity ID columns and an event_timestamp column. This is mandatory - Feast raises a ValueError if event_timestamp is missing, preventing accidental non-temporal queries. Internally, Feast translates the request into ASOF joins against the configured offline store (BigQuery, Snowflake, Redshift, or DuckDB). For each feature view, it generates SQL that finds the most recent feature row with feature_timestamp < event_timestamp for each entity. When using the BigQuery backend, Feast materializes intermediate results to temporary tables and performs the joins in BigQuery's compute environment, enabling training dataset generation at scale.

Q5: What is the feature window overlap problem and how do you fix it?

The feature window overlap problem occurs when the aggregation window for a feature includes the time of the label event itself. For example, if the label is "user made a purchase today" and the feature is "purchase count in the last 30 days," the feature may include today's purchase (the label event) in its count. This creates circular leakage: the label directly affects the feature value. The fix is to use a one-day lag on the feature window - compute purchase_count_30d over the 30 days ending at yesterday, not today. More generally: the window endpoint for any feature used to predict label at time T should be T minus a configurable lag offset, where the lag is at least 1 unit (day, hour, etc.) depending on the time granularity of your pipeline.

Q6: How do you handle features that have intentional delays (like chargebacks)?

Some features have inherent latency - chargebacks take 45 days to confirm, returns take 30 days, customer satisfaction surveys take 7 days. For these features, you must build the lag into the feature definition. The feature chargeback_rate_90d for training example at time T should be computed over the window ending at T minus 45 days, not T. This ensures that both at training time and at prediction time, the feature uses the same definition: only chargebacks that have had 45 days to be confirmed. If you train with chargeback_rate_90d ending at T (including recent unconfirmed chargebacks) but deploy with the same feature that only has confirmed chargebacks (because recent ones are not yet confirmed), you create a training-serving skew that degrades model performance on recent transactions.

Q7: Why does the ASOF join use strict less-than (<) rather than less-than-or-equal (<=)?

Using strict less-than ensures that features computed at exactly the same timestamp as the label event are excluded. This is important for two reasons. First, if features are computed as part of an event processing pipeline and the label event is also part of that pipeline, the same batch run might compute both the label and the feature - using <= would allow a feature that may have incorporated the label event in its computation to be used for that label. Second, at exactly T, the label event has just occurred but the "future" has not yet started - whether to include T depends on whether the feature computation at T could have observed the label event. The safest default is strict exclusion (<). You can relax to <= only when you can verify that features timestamped at T were computed before the label event at T.

The Day the Model Lied to the Business​

Why This Exists: The Time Travel Problem​

Historical Context: How the Industry Learned This Lesson​

What Point-in-Time Correctness Means​

Why Naive Joins Always Leak​

The Most Common Pattern​

The Date Column Trap​

The Window Overlap Problem​

The Point-in-Time Join Algorithm​

SQL Implementation: The LATERAL Join Pattern​

DuckDB Variant (ASOF JOIN syntax)​

Snowflake Variant​

Python Implementation: PointInTimeJoin​

Feast: get_historical_features()​

What Feast Does Internally​

The event_timestamp Column is Mandatory​

The Feature Window Overlap Problem​

Case Study: Purchase Prediction​

Time Zone Correctness​

Feature Lag: Intentionally Delayed Features​

Detecting Leakage After the Fact​

The Temporal Train-Test Gap Test​

Feature Correlation with Time of Event​

Visual: Naive Join vs. Point-in-Time Join​

Production Engineering Notes​

Storing Feature History​

Backfill Contamination​

Common Mistakes​

Interview Questions and Answers​

The Day the Model Lied to the Business

Why This Exists: The Time Travel Problem

Historical Context: How the Industry Learned This Lesson

What Point-in-Time Correctness Means

Why Naive Joins Always Leak

The Most Common Pattern

The Date Column Trap

The Window Overlap Problem

The Point-in-Time Join Algorithm

SQL Implementation: The LATERAL Join Pattern

DuckDB Variant (ASOF JOIN syntax)

Snowflake Variant

Python Implementation: `PointInTimeJoin`

Feast: `get_historical_features()`

What Feast Does Internally

The `event_timestamp` Column is Mandatory

The Feature Window Overlap Problem

Case Study: Purchase Prediction

Time Zone Correctness

Feature Lag: Intentionally Delayed Features

Detecting Leakage After the Fact

The Temporal Train-Test Gap Test

Feature Correlation with Time of Event

Visual: Naive Join vs. Point-in-Time Join

Production Engineering Notes

Storing Feature History

Backfill Contamination

Common Mistakes

Interview Questions and Answers