Skip to main content

Feature Engineering at Scale - The 80% of ML Work That Determines 80% of Results

Reading time: ~40 minutes | Level: ML System Design | Role: MLE, AI Engineer, MLOps


The Real Interview Moment

It is 2017, and the Uber Michelangelo team is staring at a production incident dashboard. The ETA model - the one displayed on hundreds of millions of rides every day - has started drifting. Not by a lot. Maybe 15% higher mean absolute error than last week. No code change was deployed. No data schema changed. The model weights are identical. Yet something is measurably, undeniably wrong.

After two days of debugging, the root cause surfaces: the feature "average trip speed in a 1 km radius, last 30 minutes" is computed differently in two places. In the training pipeline, a Spark job reads historical GPS logs and computes a weighted average over all trip segments falling within the geohash. In the serving pipeline, a Redis lookup reads a precomputed value that was aggregated by a separate microservice using a slightly different definition of "trip segment." When the microservice was updated three weeks ago, nobody updated the Spark job.

The disagreement is less than 1% on average. But it is systematic. In dense urban areas with complex traffic patterns - exactly the cases where ETA prediction matters most - the divergence spikes to 8%. The model was trained on one distribution and served on a subtly different one. The engineers had assumed feature computation was "done." It was not.

This incident, and dozens like it across the industry, led Michelangelo to formalize what we now call the feature store: a centralized system where features are defined once, computed once, and served identically to both training and inference. It is not glamorous engineering. There are no breakthrough algorithms involved. But as a senior Uber engineer later wrote in their blog post, "fixing training-serving skew gave us more model improvement than any architecture change we made that year."

Training-serving skew became the #1 ML bug category at Google, Uber, Netflix, and every major tech company that has published ML postmortems. Understanding it - how it happens, how to detect it, and how to prevent it - is the difference between an ML engineer who ships reliable systems and one who ships demo-quality models that degrade in production. This lesson is about building features that work, at scale, in the real world.


Feature Taxonomy and Extraction

Features are the vocabulary your model uses to understand the world. Choosing and computing that vocabulary well is genuinely difficult engineering. Every feature type requires a different treatment, and getting these treatments wrong is one of the most common sources of silent model degradation.

Numerical Features

Raw numerical features rarely enter models unchanged. The transformations you apply depend on the feature distribution and the model architecture.

Z-score normalization centers and scales: z=xμσz = \frac{x - \mu}{\sigma}. This is appropriate when the feature is approximately Gaussian and you are using gradient-based models (neural networks, SVMs). It does not handle skewed distributions well - one outlier can dominate μ\mu and σ\sigma.

Log transformation compresses heavy-tailed distributions: x=log(1+x)x' = \log(1 + x). Use for income, prices, counts - anything that spans several orders of magnitude. The +1+1 handles zeros. For signed values, use sign(x)log(1+x)\text{sign}(x) \cdot \log(1 + |x|).

Quantile transformation maps any distribution to a uniform or Gaussian: rank each value, then map ranks to quantiles. This is robust to outliers and works for any distribution shape. sklearn's QuantileTransformer implements this. The downside: it destroys exact numerical relationships, so interpretation becomes harder.

Binning converts continuous to categorical: divide the range into kk buckets. Useful when the relationship to the target is step-wise (e.g., age groups, income brackets). Two approaches: equal-width bins (simple, sensitive to outliers) or equal-frequency bins (robust, requires knowing the distribution). Binning introduces discretization error but makes the model robust to extreme values.

Clipping removes extreme outliers before normalization: x=clip(x,p1,p99)x' = \text{clip}(x, p_1, p_{99}) where p1p_1 and p99p_{99} are the 1st and 99th percentiles. Always clip before z-score normalization to prevent outliers from dominating μ\mu and σ\sigma.

Categorical Features

The right encoding depends on cardinality (number of unique values) and whether the categories have an inherent order.

One-hot encoding works for low-cardinality categories (under ~50 unique values). Each category becomes a binary column. Pros: interpretable, no ordinal assumption. Cons: sparse, does not scale to high cardinality, cannot handle new categories at serving time without special handling (set unknown to all-zeros or a dedicated "unknown" category).

Ordinal encoding assigns integers 0, 1, 2, … when there is a natural order (education level, shirt size). Never use ordinal encoding for unordered categories - it imposes a false distance metric on the feature space. XGBoost sees "0 is close to 1 and far from 5" - which is meaningless for unordered categories like city names.

Target encoding (also called mean encoding) replaces each category with the mean of the target variable for that category. With Laplace smoothing to handle rare categories:

μ^c=i:xi=cyi+λμglobalnc+λ\hat{\mu}_c = \frac{\sum_{i: x_i = c} y_i + \lambda \cdot \mu_{\text{global}}}{n_c + \lambda}

The λ\lambda term blends the category mean toward the global mean - the smaller ncn_c (fewer observations for that category), the more the estimate shrinks toward the global mean. Target encoding is powerful for high-cardinality categorical features. However, it introduces target leakage if you compute it on the full training set and then train on the same set. Always compute target encodings in a cross-validation loop.

Embedding lookup is the correct approach for very high cardinality categories (user IDs, product IDs, ZIP codes). Each category is mapped to a learned dense vector. These embeddings can be trained end-to-end with the model (works for neural networks), pre-trained from a separate task (user embeddings from collaborative filtering, item embeddings from a two-tower model), or fine-tuned from pre-trained sentence transformers (for text categories like product descriptions).

Hashing trick maps categories to a fixed-size integer space without storing a vocabulary: h(x)=hash(x)modKh(x) = \text{hash}(x) \mod K. Collisions are possible but acceptable when KK is large. This handles new categories at serving time naturally and requires no vocabulary storage. Used extensively in production recommendation systems.

Temporal Features

Time is one of the richest feature sources in production ML and one of the most commonly mishandled.

Cyclical encoding is essential for periodic time features. Simply encoding hour-of-day as an integer (0–23) creates a discontinuity: hour 23 and hour 0 are adjacent in time but at opposite ends of the integer range. Cyclical encoding preserves periodicity using sine and cosine pairs:

hour_sin=sin ⁣(2πtT),hour_cos=cos ⁣(2πtT)\text{hour\_sin} = \sin\!\left(\frac{2\pi \cdot t}{T}\right), \quad \text{hour\_cos} = \cos\!\left(\frac{2\pi \cdot t}{T}\right)

where TT is the period (24 for hours, 7 for days-of-week, 12 for months). The pair (sin,cos)(\text{sin}, \text{cos}) encodes position on a unit circle - hour 23 and hour 0 are now neighbors in this 2D space, as they should be.

Days since event captures recency: how many days since this user last purchased, since this item was last viewed, since this account was created. Combine with exponential decay for recency-weighted features:

w(Δt)=eλΔtw(\Delta t) = e^{-\lambda \Delta t}

where Δt\Delta t is the time elapsed and λ\lambda controls the decay rate (λ=0.01\lambda = 0.01 means roughly half-weight at 70 days). This is particularly effective for user activity and engagement features.

Rolling statistics capture temporal context: rolling mean, rolling standard deviation, rolling maximum over windows of 1 hour, 24 hours, 7 days, 30 days. The key implementation challenge: these must be computed point-in-time correctly. When training on historical data, the rolling window at time tt must only use data strictly from before time tt.

Cross Features

Feature crossing captures interactions that individual features cannot express alone. If "user from California" and "query contains tacos" each have weak signal individually, their cross - "California user querying tacos" - might be a strong signal for a local restaurant search model.

Direct product: for two categorical features x1{1,...,N1}x_1 \in \{1, ..., N_1\} and x2{1,...,N2}x_2 \in \{1, ..., N_2\}, the crossed feature takes N1×N2N_1 \times N_2 values. This is only feasible for low-cardinality features (otherwise the feature space explodes).

Hash feature crossing handles high-cardinality crosses by hashing the joint value into a fixed-size space:

ϕhash(x1,x2)=hash(x1n+x2)modK\phi_{\text{hash}}(x_1, x_2) = \text{hash}(x_1 \cdot n + x_2) \mod K

where nn is a large prime and KK is the output dimension. Collisions occur but are controlled by choosing KK large enough relative to the number of unique combinations. This is how Google's Wide and Deep model implements feature crosses in the "wide" component.

Learned feature interactions: instead of manually crossing features, models like DeepFM and DCN (Deep and Cross Network) learn pairwise and higher-order feature interactions automatically through their architecture. This is the modern alternative to manual feature crossing for neural ranking models.

Embedding Features

When features come from a pre-trained representation system:

  • User embeddings: from a collaborative filtering model or two-tower retrieval model. Represent user preference as a dense vector learned from historical interactions.
  • Item embeddings: from the same two-tower model or from a content-based encoder (image features, text features).
  • Text embeddings: sentence transformer applied to text fields (product description, user review, search query). Fine-tune on domain-specific data for best results.
  • Graph embeddings: Node2Vec or GraphSAGE applied to social or transaction graphs, capturing structural proximity.

Embedding features require special handling in pipelines: they are dense vectors (typically 32–512 dimensions), expensive to compute (cache aggressively in a key-value store), and need dimension-appropriate normalization before combining with other features.


Training-Serving Skew - The #1 ML Production Bug

Training-serving skew is the systematic difference between the feature values seen during model training and the feature values computed at serving time for the same input. Formally:

Δskew=E ⁣[ϕtrain(x)ϕserve(x)]\Delta_{\text{skew}} = \mathbb{E}\!\left[|\phi_{\text{train}}(x) - \phi_{\text{serve}}(x)|\right]

Even small values of Δskew\Delta_{\text{skew}} cause silent model degradation - the model works fine in offline evaluation (where training features are used consistently) but underperforms in production. The danger is that it is silent: no errors are thrown, no metrics alert, and the model just quietly performs worse than it should.

Root Causes

Different code paths: the most common cause. Training uses a Python or Spark implementation; serving uses a Java or C++ implementation. They are supposed to compute the same thing but diverge on edge cases - null handling, rounding behavior, timezone interpretation, empty array handling, string normalization differences.

Different data sources: training reads from a data warehouse with fully consistent historical data; serving reads from a different operational database that may have slightly different values, different refresh frequencies, or different schema versions due to recent migrations.

Time-window bugs: a feature defined as "purchases in the last 30 days" is computed during training by looking back 30 days from the label timestamp. During serving, if the reference point is not handled carefully (e.g., it uses wall clock time rather than request timestamp), the computed window is different.

Post-event features: using features that were only available after the event you are trying to predict. Example: predicting whether a transaction is fraudulent using the feature "account flagged for fraud." This flag is set after fraud is detected, not before. At training time you see the label and the post-event feature together. At serving time the feature has not yet been set. This is also called target leakage or data leakage.

Missing value handling differences: training fills nulls with the training set mean or median; serving fills nulls with a hardcoded constant. After the training set distribution shifts, the hardcoded constant becomes a poor approximation.

Preprocessing order differences: normalization applied before or after log transformation, for example, or clipping applied at different percentile cutoffs.

Detection

Log features at serving time: the most reliable detection method. Log the exact feature vector used for each prediction. Periodically sample these logs and compare the feature distributions to the training set distributions using statistical tests.

The Population Stability Index (PSI) is the industry standard metric:

PSI=b=1B(actualbexpectedb)ln ⁣actualbexpectedb\text{PSI} = \sum_{b=1}^{B} \left(\text{actual}_b - \text{expected}_b\right) \ln\!\frac{\text{actual}_b}{\text{expected}_b}

where actual and expected are the frequencies in each bin for serving and training distributions respectively. PSI below 0.1 indicates no significant change. PSI between 0.1 and 0.25 suggests moderate shift worth investigating. PSI above 0.25 indicates major distribution shift - retrain or investigate the pipeline.

Shadow mode testing: run a new feature computation in shadow mode alongside the existing production path, compare outputs on the same live requests, and alert on divergence before going live. This is the correct way to validate a feature pipeline change.

Unit tests with golden examples: maintain a set of (input, expected_feature_vector) pairs tested in both training and serving code. If either implementation changes in a way that breaks the golden examples, tests fail before reaching production.

Prevention

The most reliable prevention is one code path, one data source.

  • Define feature computation logic once (ideally in Python, which can be deployed both offline and online)
  • Deploy the same logic to both offline batch jobs and online serving
  • Read from the same underlying data store for both (the feature store)

This is exactly what feature stores enable. When you cannot use the same code (e.g., serving requires C++ for latency requirements), maintain bidirectional tests: compute the feature in both implementations on the same test inputs and assert equality within a defined tolerance.


Feature Stores

A feature store is a centralized system for managing, computing, and serving features. It eliminates training-serving skew by being the single source of truth for all feature values - both for model training and for real-time inference.

Architecture: Online and Offline Stores

Offline store holds the complete history of feature values with timestamps. When training a model, you query the offline store for historical feature snapshots. The key requirement: these snapshots must be point-in-time correct.

Online store holds the latest feature value for each entity (user, item, driver, product, etc.). At serving time, you provide an entity key (e.g., user_id=12345) and retrieve the feature vector in sub-10ms. Redis is the most common backend for the online store. The online store is updated continuously by the streaming pipeline, and periodically backfilled from the offline store.

Point-in-Time Correct Joins

This is the most critical concept in feature store design. When training on historical data, you must join features as of the exact moment the label was generated - not as of the time of training, not as of current query time.

The problem: suppose you have a loan approval model. The label is "did this loan default?" The label is generated months after the loan was issued. If you join the applicant's "current credit score" (as of training time, months or years later) rather than their credit score at the time of the loan application, you are using information that was not available at decision time. Your model will learn spurious correlations and fail in production.

The solution: for each (entity, label_timestamp) pair in your training set, look up the feature value that was current as of label_timestamp - specifically, the latest feature value with a timestamp less than or equal to label_timestamp. In SQL:

SELECT
labels.user_id,
labels.label_timestamp,
labels.target,
f.credit_score,
f.debt_to_income_ratio
FROM labels
LEFT JOIN features f
ON labels.user_id = f.user_id
AND f.feature_timestamp = (
SELECT MAX(feature_timestamp)
FROM features
WHERE user_id = labels.user_id
AND feature_timestamp <= labels.label_timestamp
)

This is called an as-of join or point-in-time join. Feature store frameworks like Feast and Tecton implement this efficiently using sorted merge algorithms over time-partitioned Parquet files - much faster than the correlated subquery above.

Feast Example

Feast is the most widely used open-source feature store. Here is the core setup for a user churn prediction system:

from feast import FeatureStore, FeatureView, Entity, Field
from feast.types import Float32, Int64
from feast import FileSource
from datetime import timedelta

# ─────────────────────────────────────────────────────────────────────────────
# Step 1: Define the entity - the "key" for feature lookup
# ─────────────────────────────────────────────────────────────────────────────
user = Entity(
name="user_id",
description="User entity for feature joins",
)

# ─────────────────────────────────────────────────────────────────────────────
# Step 2: Define the data source (offline: Parquet files or BigQuery)
# ─────────────────────────────────────────────────────────────────────────────
user_stats_source = FileSource(
path="data/user_stats.parquet",
timestamp_field="event_timestamp", # column containing feature computation time
)

# ─────────────────────────────────────────────────────────────────────────────
# Step 3: Define the feature view - a group of related features from one source
# ─────────────────────────────────────────────────────────────────────────────
user_stats_view = FeatureView(
name="user_stats",
entities=[user],
ttl=timedelta(days=30), # max age for online store features before staleness
schema=[
Field(name="purchase_count_7d", dtype=Int64),
Field(name="avg_order_value", dtype=Float32),
Field(name="days_since_last_purchase", dtype=Int64),
Field(name="session_count_30d", dtype=Int64),
Field(name="support_ticket_count", dtype=Int64),
],
source=user_stats_source,
)

# Initialize the feature store (reads from feature_store.yaml config)
store = FeatureStore(repo_path=".")

Retrieving historical features for training (point-in-time correct automatically):

import pandas as pd
from datetime import datetime

# Your training labels with timestamps - the "spine" of the training set
# Each row: which entity, at what point in time, with what label
entity_df = pd.DataFrame({
"user_id": [1001, 1002, 1003, 1004, 1005],
"event_timestamp": [
datetime(2024, 1, 15),
datetime(2024, 1, 20),
datetime(2024, 2, 1),
datetime(2024, 2, 10),
datetime(2024, 3, 5),
],
"label": [1, 0, 1, 0, 1], # churned in next 30 days?
})

# Feast performs the point-in-time join automatically
# For each (user_id, event_timestamp) row, it looks up the latest feature
# value with feature_timestamp <= event_timestamp
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"user_stats:purchase_count_7d",
"user_stats:avg_order_value",
"user_stats:days_since_last_purchase",
"user_stats:session_count_30d",
"user_stats:support_ticket_count",
],
).to_df()

print(training_df.head())
# user_id | event_timestamp | label | purchase_count_7d | avg_order_value | ...
# Features are from the correct point in time for each row - guaranteed no leakage

Online serving (real-time lookup, sub-10ms):

# At serving time: entity key lookup in Redis
# The exact same feature definitions are used - no skew possible
feature_vector = store.get_online_features(
features=[
"user_stats:purchase_count_7d",
"user_stats:avg_order_value",
"user_stats:days_since_last_purchase",
"user_stats:session_count_30d",
"user_stats:support_ticket_count",
],
entity_rows=[{"user_id": 1001}],
).to_dict()

# Same Feast registry → same feature definitions → no skew
prediction = model.predict([[
feature_vector["purchase_count_7d"][0],
feature_vector["avg_order_value"][0],
feature_vector["days_since_last_purchase"][0],
feature_vector["session_count_30d"][0],
feature_vector["support_ticket_count"][0],
]])
print(f"Churn probability: {prediction[0]:.3f}")

The key insight: both training and serving go through the same Feast registry and the same feature view definitions. It is architecturally impossible for them to compute features differently, because they are not computing features at all - they are both reading from the same store, which was populated by a single pipeline.


Feature Selection at Scale

With hundreds of candidate features, you need principled methods to select the most informative subset. Fewer features means faster training, faster inference, lower storage costs, and often better generalization through reduced overfitting.

Correlation Filtering

Remove redundant features that carry near-duplicate information. Compute the Pearson correlation matrix (for numerical features) or Spearman rank correlation (for ordinal or non-linear relationships). If rij>0.9|r_{ij}| > 0.9 for two features, drop one of them - preferably the one with lower individual correlation to the target.

This is a filter method: it does not consider the model, only pairwise feature relationships. It is fast and scales to thousands of features since it requires only O(p2)O(p^2) correlation computations.

Mutual Information

Mutual information measures the nonlinear dependency between a feature XX and the target YY:

I(X;Y)=H(Y)H(YX)=x,yp(x,y)logp(x,y)p(x)p(y)I(X;Y) = H(Y) - H(Y|X) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

I(X;Y)=0I(X;Y) = 0 means XX and YY are statistically independent (the feature is useless). I(X;Y)=H(Y)I(X;Y) = H(Y) means XX perfectly predicts YY. Unlike correlation, mutual information captures non-linear dependencies and works naturally for categorical variables using the discrete version.

At scale: computing exact mutual information requires estimating the joint distribution. For large datasets, estimate it on a sample (1M rows is typically sufficient) using sklearn's mutual_info_classif or mutual_info_regression. This is fast enough for hundreds of candidate features.

LASSO Regularization

LASSO (L1 regularization) adds a sparsity penalty to the loss:

LLASSO=Lbase+αj=1pwj\mathcal{L}_{\text{LASSO}} = \mathcal{L}_{\text{base}} + \alpha \sum_{j=1}^{p} |w_j|

The L1 penalty drives many coefficients exactly to zero, performing automatic feature selection. Features with zero coefficients after fitting are not useful for the model. The regularization strength α\alpha controls sparsity - larger α\alpha means more features zeroed out.

LASSO is a wrapper method - it considers features in the context of the model, capturing interactions and redundancies that filter methods miss. Use LassoCV to select the optimal α\alpha via cross-validation automatically.

SHAP-Based Feature Importance

After training any model (including non-linear gradient boosted trees or neural networks), use SHAP values to measure feature importance. The SHAP value for feature jj and instance ii is:

ϕj=SF{j}S!(FS1)!F![f(S{j})f(S)]\phi_j = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!(|F|-|S|-1)!}{|F|!}\left[f(S \cup \{j\}) - f(S)\right]

This computes the average marginal contribution of feature jj across all possible subsets of other features. SHAP values satisfy efficiency, symmetry, linearity, and null player - the unique axioms that define a fair attribution. They capture interaction effects and are consistent across model types.

At scale: TreeSHAP for gradient boosted models runs in polynomial time O(TLD2)O(TLD^2). For neural networks, use KernelSHAP on a representative sample (10K–100K rows).

Selection Strategy by Dataset Size

Dataset SizeRecommended ApproachRationale
Less than 100K rowsMutual information + LASSOFull dataset fits in memory; wrapper method feasible
100K–10M rowsCorrelation filter then MI on sample then SHAP on trained modelSample MI for speed; use model output for final selection
More than 10M rowsCorrelation filter then approximate MI on 1M sample then gradient boosted importanceApproximate methods sufficient at scale

Code: Complete Feature Engineering Pipeline

Cyclical Encoding for Temporal Features

import numpy as np
import pandas as pd

def cyclical_encode(df: pd.DataFrame, col: str, period: int) -> pd.DataFrame:
"""
Encode a periodic column with sin/cos to preserve cyclical distance.

The key insight: integer encoding hour-of-day creates a false discontinuity
where hour 23 and hour 0 are 23 units apart - but they are adjacent in time.
Cyclical encoding maps them to neighboring points on a unit circle.

Args:
df: DataFrame containing the column
col: Column name (e.g., 'hour', 'day_of_week', 'month')
period: Period of the cycle (24 for hours, 7 for days, 12 for months)

Returns:
DataFrame with two new columns: {col}_sin and {col}_cos
"""
df = df.copy()
df[f"{col}_sin"] = np.sin(2 * np.pi * df[col] / period)
df[f"{col}_cos"] = np.cos(2 * np.pi * df[col] / period)
return df


# Example: encode all temporal features from a timestamp column
df = pd.DataFrame({
"timestamp": pd.date_range("2024-01-01", periods=48, freq="h"),
"sales": np.random.lognormal(4, 1, 48),
})
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["month"] = df["timestamp"].dt.month

df = cyclical_encode(df, "hour", 24)
df = cyclical_encode(df, "day_of_week", 7)
df = cyclical_encode(df, "month", 12)

# Verify: hour 23 and hour 0 should be close in (sin, cos) space
h0 = df.loc[df.hour == 0, ["hour_sin", "hour_cos"]].values[0]
h23 = df.loc[df.hour == 23, ["hour_sin", "hour_cos"]].values[0]
dist_cyclical = np.sqrt(((h0 - h23) ** 2).sum())
dist_naive = abs(0 - 23)

print(f"Cyclical distance hour 0 to hour 23: {dist_cyclical:.4f}") # ~0.26
print(f"Naive integer distance hour 0 to hour 23: {dist_naive}") # 23

# The cyclical distance is small - correctly captures proximity
# The naive distance is huge - incorrectly treats them as far apart

Target Encoding with Cross-Validation (No Leakage)

from sklearn.model_selection import KFold
import numpy as np
import pandas as pd

def target_encode_cv(
X_train: pd.DataFrame,
y_train: pd.Series,
col: str,
n_splits: int = 5,
smoothing: float = 10.0,
) -> pd.Series:
"""
Target encoding computed in cross-validation folds to prevent target leakage.

Each row in the training set is encoded using the mean target of all OTHER
rows in different folds - it never sees its own target value during encoding.

Without this: the encoding for a category c is computed from all rows with
category c, including the row being encoded. The model then sees a feature
that is perfectly correlated with the target for that row. Severe overfitting.

Args:
X_train: Training features DataFrame
y_train: Training labels Series
col: Column to target-encode
n_splits: Number of CV folds (5 is standard)
smoothing: Laplace smoothing (higher = more smoothing for rare categories)

Returns:
Series of target-encoded values for X_train (same index as X_train)
"""
global_mean = y_train.mean()
encoded = pd.Series(index=X_train.index, dtype=float)

kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

for fold_train_idx, fold_val_idx in kf.split(X_train):
# Compute encoding statistics on the fold's training portion
fold_X = X_train.iloc[fold_train_idx]
fold_y = y_train.iloc[fold_train_idx]

# Compute per-category statistics from the fold training set
stats = fold_y.groupby(fold_X[col]).agg(["sum", "count"])
stats.columns = ["sum", "count"]

# Smoothed mean: blend category mean toward global mean
# Categories with few observations get strongly regularized toward global_mean
# Categories with many observations trust their own mean
stats["smoothed_mean"] = (
(stats["sum"] + smoothing * global_mean) /
(stats["count"] + smoothing)
)

# Encode the validation portion using statistics from training portion
val_categories = X_train.iloc[fold_val_idx][col]
encoded.iloc[fold_val_idx] = val_categories.map(
stats["smoothed_mean"]
).fillna(global_mean)

return encoded


def build_target_encoding_map(
X: pd.DataFrame,
y: pd.Series,
col: str,
smoothing: float = 10.0,
) -> tuple[pd.Series, float]:
"""
Build a target encoding map from the full training set, for applying to test/serving data.

Returns:
(encoding_map, global_mean) - use encoding_map.map() on new data
"""
global_mean = y.mean()
stats = y.groupby(X[col]).agg(["sum", "count"])
stats.columns = ["sum", "count"]
stats["smoothed_mean"] = (
(stats["sum"] + smoothing * global_mean) /
(stats["count"] + smoothing)
)
return stats["smoothed_mean"], global_mean


# Usage: target-encode 'city' for a churn prediction model
np.random.seed(42)
n = 10000
X = pd.DataFrame({
"city": np.random.choice(
["NYC", "LA", "Chicago", "Houston", "Phoenix", "RareTown1", "RareTown2"], n,
p=[0.25, 0.20, 0.15, 0.15, 0.15, 0.05, 0.05]
),
"age": np.random.randint(18, 65, n),
})
y = pd.Series(np.random.binomial(1, 0.25, n))

# For training: CV-based encoding (no leakage)
X["city_encoded"] = target_encode_cv(X, y, col="city", smoothing=10.0)

# Build the final encoding map (fitted on full training set) for test/serving
encoding_map, global_mean = build_target_encoding_map(X, y, col="city", smoothing=10.0)

# For test/serving data:
X_test = pd.DataFrame({"city": ["NYC", "LA", "UnseenCity", "RareTown1"]})
X_test["city_encoded"] = X_test["city"].map(encoding_map).fillna(global_mean)
print(X_test)

Hash Feature Crossing

import hashlib
import numpy as np
from scipy.sparse import csr_matrix

def hash_feature_cross(
x1: np.ndarray,
x2: np.ndarray,
output_dim: int = 10000,
) -> np.ndarray:
"""
Hash-based feature crossing for high-cardinality categorical pairs.

Instead of creating N1 * N2 one-hot dimensions (which explodes combinatorially),
we hash the joint value into a fixed-size space of output_dim buckets.
Collisions are controlled by choosing output_dim >> expected unique combinations.

Args:
x1: First categorical feature array (string or int)
x2: Second categorical feature array (string or int)
output_dim: Hash space size - controls collision rate

Returns:
Integer array of indices in [0, output_dim)
"""
crossed = np.array([
int(hashlib.md5(f"{a}_{b}".encode()).hexdigest(), 16) % output_dim
for a, b in zip(x1, x2)
])
return crossed


def hash_cross_sparse(
x1: np.ndarray,
x2: np.ndarray,
output_dim: int = 1000,
) -> csr_matrix:
"""
Hash cross as sparse binary matrix - suitable for linear models.
Each row has exactly one 1, at the position of the crossed hash value.
"""
indices = hash_feature_cross(x1, x2, output_dim)
n = len(indices)
data = np.ones(n)
row_ind = np.arange(n)
return csr_matrix((data, (row_ind, indices)), shape=(n, output_dim))


# Example: cross user_country × product_category
# Wide & Deep style: wide component uses hash crosses for memorization
countries = np.array(["US", "UK", "DE", "US", "FR", "UK", "US", "DE"])
categories = np.array(["electronics", "clothing", "electronics", "books",
"clothing", "books", "electronics", "clothing"])

crossed = hash_feature_cross(countries, categories, output_dim=500)
print("Crossed feature indices:", crossed)

# Verify determinism: same inputs always give same output
assert hash_feature_cross(
np.array(["US"]), np.array(["electronics"]), output_dim=500
)[0] == hash_feature_cross(
np.array(["US"]), np.array(["electronics"]), output_dim=500
)[0], "Hash crossing must be deterministic"

# Collision analysis: for M unique combinations and K buckets,
# expected collision rate ≈ 1 - e^{-M^2 / (2K)}
M = len(set(zip(countries, categories))) # unique pairs
K = 500
collision_rate = 1 - np.exp(-M**2 / (2 * K))
print(f"Unique combinations: {M}, bucket size: {K}")
print(f"Expected collision rate: {collision_rate:.1%}") # should be low

Production sklearn Pipeline with ColumnTransformer

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
StandardScaler,
OneHotEncoder,
FunctionTransformer,
QuantileTransformer,
)
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
import joblib

# ─────────────────────────────────────────────────────────────────────────────
# Feature group definitions
# ─────────────────────────────────────────────────────────────────────────────
NUMERICAL_STANDARD = ["age", "account_age_days", "session_duration_seconds"]
NUMERICAL_LOG = ["purchase_amount", "page_views_30d"] # heavy-tailed: log-transform
NUMERICAL_QUANTILE = ["income_estimate"] # unknown shape: quantile
CATEGORICAL_LOW_CARD = ["device_type", "country", "subscription_tier"]

# ─────────────────────────────────────────────────────────────────────────────
# Individual transformers for each numerical type
# ─────────────────────────────────────────────────────────────────────────────
numerical_standard_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])

log_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("log", FunctionTransformer(np.log1p, validate=True)),
("scaler", StandardScaler()),
])

quantile_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("quantile", QuantileTransformer(
output_distribution="normal",
n_quantiles=1000,
random_state=42,
)),
])

# ─────────────────────────────────────────────────────────────────────────────
# Categorical transformer
# ─────────────────────────────────────────────────────────────────────────────
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="constant", fill_value="__missing__")),
("onehot", OneHotEncoder(
handle_unknown="ignore", # ignore unseen categories at serving time (all zeros)
sparse_output=True, # memory efficient for high-cardinality
drop=None, # keep all categories for interpretability
)),
])

# ─────────────────────────────────────────────────────────────────────────────
# Assemble ColumnTransformer - routes each column to the right transformer
# ─────────────────────────────────────────────────────────────────────────────
preprocessor = ColumnTransformer(
transformers=[
("num_standard", numerical_standard_transformer, NUMERICAL_STANDARD),
("num_log", log_transformer, NUMERICAL_LOG),
("num_quantile", quantile_transformer, NUMERICAL_QUANTILE),
("cat", categorical_transformer, CATEGORICAL_LOW_CARD),
],
remainder="drop", # columns not listed above are dropped (not silently passed through)
n_jobs=-1, # parallel transformer fitting
)

# ─────────────────────────────────────────────────────────────────────────────
# Full pipeline: preprocessing + model as single serializable artifact
# This is the key to training-serving consistency:
# - One artifact contains both fitted preprocessors and the model
# - Deploying this artifact guarantees the same transformations at serving time
# ─────────────────────────────────────────────────────────────────────────────
full_pipeline = Pipeline([
("preprocessor", preprocessor),
("model", GradientBoostingClassifier(
n_estimators=300,
max_depth=5,
learning_rate=0.05,
subsample=0.8,
random_state=42,
)),
])

# ─────────────────────────────────────────────────────────────────────────────
# Generate synthetic training data
# ─────────────────────────────────────────────────────────────────────────────
np.random.seed(42)
n = 5000
X_train = pd.DataFrame({
"age": np.random.randint(18, 75, n),
"account_age_days": np.random.randint(0, 3000, n),
"session_duration_seconds": np.random.exponential(300, n),
"purchase_amount": np.random.lognormal(4, 1, n),
"page_views_30d": np.random.poisson(20, n),
"income_estimate": np.random.lognormal(10, 0.5, n),
"device_type": np.random.choice(["mobile", "desktop", "tablet"], n),
"country": np.random.choice(["US", "UK", "DE", "FR", "CA"], n),
"subscription_tier": np.random.choice(["free", "basic", "premium"], n),
})
y_train = np.random.binomial(1, 0.25, n)

# ─────────────────────────────────────────────────────────────────────────────
# Train: fit_transform on training data only
# The scaler/encoder statistics are computed from training set only
# (never from test or serving data - that would be leakage)
# ─────────────────────────────────────────────────────────────────────────────
full_pipeline.fit(X_train, y_train)

# ─────────────────────────────────────────────────────────────────────────────
# Serving: the pipeline applies the same transformations using training statistics
# No refitting. No recomputing means/stds. Guaranteed consistency.
# ─────────────────────────────────────────────────────────────────────────────
X_serve = pd.DataFrame({
"age": [34],
"account_age_days": [450],
"session_duration_seconds": [280.5],
"purchase_amount": [125.99],
"page_views_30d": [18],
"income_estimate": [75000.0],
"device_type": ["mobile"],
"country": ["US"],
"subscription_tier": ["basic"],
})

prediction = full_pipeline.predict_proba(X_serve)[:, 1]
print(f"Churn probability: {prediction[0]:.3f}")

# ─────────────────────────────────────────────────────────────────────────────
# Persist: single artifact for deployment
# Load this artifact in your serving container - identical behavior guaranteed
# ─────────────────────────────────────────────────────────────────────────────
joblib.dump(full_pipeline, "churn_model_pipeline.pkl")
loaded = joblib.load("churn_model_pipeline.pkl")
assert abs(loaded.predict_proba(X_serve)[0, 1] - prediction[0]) < 1e-9

Feature Selection Code

import numpy as np
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler


def correlation_filter(X: pd.DataFrame, threshold: float = 0.9) -> list[str]:
"""
Remove highly correlated features (keep one of each correlated pair).
Filters redundant features before more expensive methods.
"""
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
to_keep = [col for col in X.columns if col not in to_drop]
print(f"Correlation filter: dropped {len(to_drop)} features, kept {len(to_keep)}")
return to_keep


def select_features_mutual_info(
X: pd.DataFrame,
y: pd.Series,
top_k: int = 50,
sample_size: int = 100_000,
) -> list[str]:
"""
Select top-k features by mutual information with the target.
Captures nonlinear dependencies that correlation misses.
Samples the dataset for scalability on large datasets.
"""
if len(X) > sample_size:
idx = np.random.choice(len(X), sample_size, replace=False)
X_sample, y_sample = X.iloc[idx], y.iloc[idx]
else:
X_sample, y_sample = X, y

X_filled = X_sample.fillna(X_sample.median(numeric_only=True))
mi_scores = mutual_info_classif(X_filled, y_sample, random_state=42)
mi_series = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

print("Top 10 features by mutual information:")
print(mi_series.head(10))
return mi_series.head(top_k).index.tolist()


def select_features_lasso(X: pd.DataFrame, y: pd.Series) -> list[str]:
"""
Select features with non-zero LASSO coefficients.
LassoCV finds the optimal regularization strength via cross-validation.
Considers feature interactions within the model - wrapper method.
"""
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.fillna(X.median(numeric_only=True)))

lasso = LassoCV(cv=5, random_state=42, max_iter=10000, n_jobs=-1)
lasso.fit(X_scaled, y)

selected = [col for col, coef in zip(X.columns, lasso.coef_) if abs(coef) > 1e-6]
print(f"LASSO: selected {len(selected)}/{len(X.columns)} features, alpha={lasso.alpha_:.4f}")
return selected

Common Mistakes

:::danger Training-Serving Skew from Different Code Paths Computing features differently in training (Python/Spark) and serving (Java/C++) is the #1 ML production bug. The disagreement can be sub-1% on average but systematic in the exact cases that matter most. The fix: use a feature store as the single source of truth, or maintain bidirectional tests with golden examples that run against both implementations every time either changes. :::

:::danger Post-Event Features (Future Leakage) Using features that are only available after the event you are predicting. The model learns to use information from the future, achieves unrealistically high offline metrics, and then catastrophically fails in production where those features are not yet set. Audit every feature for temporal validity: "Was this value available at the exact moment the prediction would have been made?" :::

:::danger One-Hot Encoding High-Cardinality Columns One-hot encoding ZIP codes (30,000 unique values), user IDs (millions), or product SKUs produces impossibly sparse, high-dimensional feature matrices. Memory explodes. Training slows. Generalization degrades. Use embedding lookup (for neural networks) or hashing trick (for any model). Rule of thumb: cardinality above 50 - do not use one-hot. :::

:::warning Not Fitting Preprocessing on Training Data Only Calling scaler.fit(X_all) on the full dataset before splitting leaks test set distribution statistics into training (the mean and variance used for scaling encode information about the test set). Always split first, then fit preprocessors on training data only, then apply (transform) to both train and test. sklearn Pipelines with fit on training data and predict on test data enforce this correctly by design. :::

:::warning Target Encoding Without Cross-Validation Computing target encoding on the full training set and then training on the same set is a classic leakage bug. The encoded value for each row directly incorporates that row's target value (especially for rare categories). The model sees a near-perfect predictor and wildly overfits. Always compute target encoding in cross-validation folds so each row's encoding uses only out-of-fold data. :::

:::tip Log Raw Features at Serving Time Always log the exact feature vector (before model inference) for each prediction. This enables post-hoc debugging of serving issues, distribution drift detection via PSI, offline replication of any production prediction for debugging, and eventual training data collection from real-world distributions. Without feature logs, training-serving skew is nearly impossible to diagnose. :::


YouTube Resources

VideoCreatorWhy Watch
Feature Engineering for MLAndrej KarpathyFeature engineering principles and intuition
Feature Stores ExplainedTectonWhat feature stores solve and when you need one
Uber MichelangeloUber EngineeringProduction ML platform origin story
Feast Feature Store TutorialFeastPoint-in-time joins in practice

Interview Questions and Answers

Q1: What is training-serving skew and what are its root causes?

Training-serving skew is a systematic difference between the feature values computed during model training and the feature values computed at serving time for the same underlying input: Δskew=E[ϕtrain(x)ϕserve(x)]\Delta_{\text{skew}} = \mathbb{E}[|\phi_{\text{train}}(x) - \phi_{\text{serve}}(x)|].

Root causes: (1) Different code paths - training in Python/Spark, serving in Java/Go, with subtle implementation differences in null handling, rounding, window definitions, or string normalization. (2) Different data sources - training reads from a warehouse with different refresh schedules or schema versions than the serving database. (3) Time-window bugs - rolling aggregations computed relative to the wrong reference timestamp. (4) Post-event features - features that encode information from after the label timestamp. (5) Missing value handling - training imputes with training-set statistics but serving uses hardcoded values that drift over time.

Detection: log features at serving time, compare distributions using PSI or KS test. Prevention: feature store with a single definition and data source for both training and serving, plus unit tests with golden examples.


Q2: Why is a point-in-time join necessary? What goes wrong without it?

Without a point-in-time join, you join features to training labels using their current values at training time rather than their values at the time of the historical event. This causes temporal leakage: training features contain information from the future that would not have been available at prediction time.

Concrete example: predicting loan default. If you join "current debt-to-income ratio" (as of training time, potentially years after the loan was issued) to loan default labels, you are using the applicant's financial state after the default - not before it. Default often changes financial behavior substantially. Your model learns that "high current debt load predicts past default," which is a causal loop that cannot be exploited at prediction time.

The point-in-time join retrieves, for each (entity, label_timestamp) pair, the latest feature value with feature_timestamp ≤ label_timestamp. This ensures training features represent the information state at decision time, which matches what is available at serving time.


Q3: When should you use embeddings instead of one-hot encoding for categorical features?

Use embeddings when: (1) Cardinality is high (more than ~50 unique values), making one-hot encoding impractical due to sparsity and memory cost. (2) The model is a neural network, which can learn embeddings end-to-end as differentiable parameters. (3) Semantic similarity matters - embeddings can learn that "SUV" and "truck" are similar vehicle categories without explicit engineering. (4) Serving latency matters - a 64-dim embedding lookup is far faster than operating on a 50,000-dim one-hot vector.

Use one-hot when: (1) Cardinality is low (under 50) and you need interpretability. (2) The model is a linear model or tree (which cannot natively use embeddings). (3) You want a simple, debuggable baseline.

For tree-based models with high cardinality (XGBoost, LightGBM), use target encoding or ordinal encoding instead - trees cannot use embeddings natively, and one-hot encoding is wasteful for high-cardinality features.


Q4: Design a feature store for a ride-sharing ETA prediction system.

Entities: trip (trip_id), driver (driver_id), geohash (location grid cell).

Feature views:

  • driver_stats: acceptance rate (7 day), average speed (1 hour, 6 hour), active trip count - updated every 5 minutes via streaming.
  • geohash_traffic: average speed in 1km radius (5 min, 30 min, 1 hour), demand density - updated every 30 seconds via streaming.
  • trip_context: pickup/dropoff geohash, distance estimate - computed at request time, not stored.

Online store: Redis cluster, keyed by (entity_type, entity_id). TTL 30 minutes for geohash features, 1 hour for driver stats. Sub-10ms SLA enforced at the 99th percentile.

Offline store: Parquet on S3, partitioned by date. Retained for 2 years for retraining.

Streaming pipeline: GPS events and trip events flow through Kafka, consumed by a Flink job that computes tumbling window aggregations (5-min, 30-min). Outputs written to Redis online store.

Batch pipeline: daily Spark job computes 7-day driver statistics → offline store → Redis backfill for the less time-sensitive features.

Point-in-time join for training: when constructing training datasets from historical trips, join geohash_traffic and driver_stats features as of the trip request timestamp - not the trip completion timestamp, which would be future leakage.


Q5: Explain hash feature crossing. What are its trade-offs versus direct product crossing?

Hash feature crossing maps the joint value of two categorical features into a fixed-size integer space:

ϕhash(x1,x2)=hash(x1n+x2)modK\phi_{\text{hash}}(x_1, x_2) = \text{hash}(x_1 \cdot n + x_2) \mod K

The result is an integer index in [0,K)[0, K) - serves as an embedding table index (neural networks) or a one-hot position (linear models).

Advantages over direct product: scales to high-cardinality features without storing a vocabulary or creating N1×N2N_1 \times N_2 columns; handles unseen combinations at serving time naturally (they just hash to some bucket); memory usage is fixed at KK regardless of true cardinality.

Trade-offs: (1) Collisions - two different (x1,x2)(x_1, x_2) pairs may hash to the same bucket. Controlled by choosing KK large (by the birthday paradox, for MM unique combinations and KK buckets, collision rate is approximately 1eM2/(2K)1 - e^{-M^2/(2K)}). For M=10,000M = 10,000 unique combinations, setting K=1,000,000K = 1,000,000 gives roughly 5% collision rate. (2) Interpretability - you cannot easily tell which (country, category) pair corresponds to which bucket. (3) Manual selection - you must decide which feature pairs to cross. Models like DCN (Deep and Cross Network) learn this automatically.


Summary

Feature engineering is the most labor-intensive and most impactful part of the ML workflow. The key principles for production-quality feature pipelines:

  1. Use appropriate transforms for each feature type: z-score for Gaussian numericals, log for heavy-tailed distributions, cyclical encoding for temporal periodicity, embeddings for high-cardinality categoricals.

  2. Prevent training-serving skew: define features once, in one place (a feature store), computed from one data source. Test with golden examples across all implementations. Log serving features.

  3. Point-in-time joins are non-negotiable: when building training datasets from historical data, features must reflect the information state at label time - not training time, not current time.

  4. Select features with principled methods: correlation filter for redundancy removal, mutual information for nonlinear relevance, LASSO for sparsity-inducing wrapper selection, SHAP for post-hoc importance attribution.

  5. Use sklearn Pipelines end-to-end: a Pipeline containing preprocessing and model, serialized as a single artifact, is the most reliable way to ensure consistency between training and serving.

The 80/20 of ML is real: 80% of model improvement comes from feature engineering, not model architecture. Get your features right, and the models will follow.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Feature Engineering Transformations demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.