Feature Engineering - The Highest-Leverage Skill in ML

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, Data Scientist, AI Eng, MLOps

The Real Interview Moment

You're in a machine learning system design round. The interviewer asks: "Walk me through your feature engineering process for a problem you've worked on." This question sounds soft, but it's one of the most revealing in the entire interview loop.

The weak candidate lists a few transforms: "I one-hot encoded the categorical variables and normalized the numerical ones." The strong candidate tells a story: "We were building a churn prediction model. The raw data had 50 columns, but most predictive power came from features we engineered - the trend in a user's activity over the last 30 days (not just the count), the ratio of support tickets to purchases, and a time-since-last-login feature that captured recency. We tried 200+ features, used mutual information to filter to the top 40, then ran recursive feature elimination with a gradient boosted model to get down to 25. The engineered features improved AUC from 0.72 to 0.89."

That's the difference. Feature engineering is where domain knowledge meets data science, and it remains the single highest-leverage activity in applied ML - even in the age of deep learning. For tabular data (which is most production ML), feature engineering matters more than model selection.

What You Will Master

After reading this page, you will be able to:

Apply numerical transformations: log, Box-Cox, power transforms, binning, standardization, and normalization
Encode categorical variables using one-hot, target, frequency, ordinal, and learned embeddings
Extract features from text (TF-IDF, n-grams), time series (lag features, rolling statistics), and dates
Create interaction features and polynomial features with appropriate complexity control
Apply feature selection methods: filter (mutual information, chi-squared), wrapper (RFE), and embedded (L1, tree importance)
Design a feature engineering pipeline for production, including feature stores
Answer the "tell me about your feature engineering process" interview question with a compelling narrative
Avoid common pitfalls: data leakage, target leakage, and feature-target correlation traps
Reason about when feature engineering matters most vs. when deep learning replaces it

Self-Assessment: Where Are You Now?

Skill Area	1 (Never done it)	3 (Used in projects)	5 (Production expert)	Your Rating
Numerical transforms	Don't know what StandardScaler does	Used log and standardization	Choose transforms based on distribution analysis	___
Categorical encoding	Only know one-hot	Used 2-3 encoding methods	Know when target encoding beats embeddings	___
Text features	Never extracted text features	Used TF-IDF	Designed text feature pipelines with n-grams + embeddings	___
Time features	Never engineered time features	Used day-of-week, month	Created lag features, rolling stats, trend indicators	___
Feature selection	Never done it	Used correlation filtering	Applied filter + wrapper + embedded methods systematically	___
Feature stores	Never heard of one	Know what they are	Used Feast/Tecton/built custom feature store	___
Data leakage	Not sure what it is	Can identify obvious cases	Can detect subtle temporal and target leakage	___

Score interpretation:

7-14: Start here. Feature engineering is the most practical skill in ML.
15-25: Good foundation. Focus on advanced encoding, selection methods, and the leakage section.
26-35: You're ready for senior-level questions. Drill the practice problems and feature store design.

Part 1 - Numerical Feature Transforms

Why Transform Numerical Features?

Raw numerical features often have properties that hurt model performance: skewed distributions, different scales, outliers, and non-linear relationships with the target. Transforms address these issues.

Numerical Transform Selection

Standardization (Z-Score Normalization)

z = (x - mean) / std

Centers data at 0 with unit variance. Essential for algorithms that assume features are on the same scale: linear regression, logistic regression, SVMs, k-NN, PCA, and neural networks.

Not needed for: Tree-based models (decision trees, random forests, XGBoost) - they split on individual features and are invariant to monotonic transforms.

Log Transform

x' = log(x + 1)   # log1p to handle zeros

Compresses the right tail of skewed distributions. Extremely common for:

Revenue, prices, salaries (power-law distributions)
Word counts, page views, click counts
Any feature spanning multiple orders of magnitude

60-Second Answer

"I apply log transforms to right-skewed features like revenue or click counts. The intuition is that the difference between $100 and$ 200 is more meaningful than the difference between $10,000 and$ 10,100. Log transform captures this - it converts multiplicative relationships to additive ones, which linear models handle naturally. I use log1p (log(x+1)) to handle zeros. For features with negative values, I use the Yeo-Johnson transform, which generalizes Box-Cox to handle negative inputs."

Box-Cox and Yeo-Johnson Transforms

Box-Cox finds the optimal power transform to make data more Gaussian:

x' = (x^lambda - 1) / lambda    (lambda != 0)
x' = log(x)                      (lambda = 0)

The parameter lambda is estimated from the data. Box-Cox requires positive values.

Yeo-Johnson extends Box-Cox to handle zero and negative values. Preferred in practice because it doesn't require positive inputs.

Binning (Discretization)

Convert continuous features into categorical bins:

Equal-width binning: Divide range into k equal intervals. Simple but sensitive to outliers.
Equal-frequency (quantile) binning: Each bin contains roughly the same number of samples. More robust.
Custom bins: Based on domain knowledge (e.g., age groups: 0-17, 18-24, 25-34, ...).

When binning helps:

Captures non-linear relationships in linear models (age < 18 might have a qualitatively different effect than age = 25 vs 30)
Reduces the impact of outliers
Can improve interpretability

When binning hurts:

Loses information within each bin
Creates arbitrary boundaries
Tree-based models already learn optimal splits - binning is redundant

Common Trap

Never fit your scaler or binner on the test set. Always fit on training data and transform both train and test. This is one of the most common data leakage mistakes in practice. In sklearn, always use fit_transform on the training set and transform on the test set - never fit_transform on the full dataset before splitting.

Handling Missing Values

Strategy	When to Use	Implementation
Mean/Median imputation	MCAR (missing completely at random)	Simple, fast; can distort distribution
Mode imputation	Categorical features	Preserves data type
Indicator variable	Missingness itself is informative	Add `feature_is_missing` boolean column
Model-based (KNN, iterative)	MAR (missing at random)	More accurate but slower, risk of leakage
Leave as NaN	Tree-based models	XGBoost/LightGBM handle NaN natively

Interviewer's Perspective

I always ask candidates how they handle missing values. The worst answer is "I dropped all rows with missing values." This throws away data and introduces selection bias. The best answer depends on the mechanism: "I first analyze why values are missing - is missingness random, or correlated with the target? If missingness itself is predictive (e.g., users who don't fill in income are more likely to churn), I add a missing indicator feature. For the imputed value, I use median for skewed numericals and mode for categoricals, always fitting on training data only."

Part 2 - Categorical Feature Encoding

Encoding Methods Overview

Categorical Encoding Selection

One-Hot Encoding

Create a binary column for each category value:

City	city_NYC	city_LA	city_CHI
NYC	1	0	0
LA	0	1	0
CHI	0	0	1

Pros: No ordinal assumption, works with all model types Cons: Explodes dimensionality for high-cardinality features (1M users = 1M columns)

Drop one category? For linear models, drop one column to avoid multicollinearity (the "dummy variable trap"). For tree-based models, it doesn't matter.

Label (Ordinal) Encoding

Map each category to an integer: NYC=0, LA=1, CHI=2.

Appropriate when: The variable has a natural order (education: high school < bachelor's < master's < PhD). Dangerous when: Applied to non-ordinal categories - it implies NYC < LA < CHI, which is meaningless. Exception: Tree-based models (XGBoost, LightGBM) handle label-encoded categoricals well because they only learn splits, not magnitudes.

Target Encoding (Mean Encoding)

Replace each category with the mean of the target variable for that category:

city_encoded = E[y | city]

Example: If the click-through rate for NYC users is 0.15, all NYC entries get the value 0.15.

The leakage problem: Using the target to compute the encoding creates data leakage. The category's encoded value contains information about the target, which inflates training metrics.

Solutions:

Leave-one-out: For each row, compute the mean excluding that row
K-fold target encoding: Compute encodings using only out-of-fold data (like cross-validation)
Smoothing: Blend category mean with global mean, weighted by category frequency:

encoded = (n * category_mean + m * global_mean) / (n + m)

where n is category count and m is the smoothing parameter.

Instant Rejection

If you mention target encoding without immediately addressing the leakage risk, it signals you've used it without understanding it. Always say: "Target encoding requires careful handling to avoid leakage - I use k-fold encoding where each fold's encoding is computed from the other folds' target values, never from the same data being encoded."

Frequency Encoding

Replace each category with its frequency (count or proportion) in the training data:

city_encoded = count(city) / total_count

Pros: No leakage risk, simple, captures frequency patterns Cons: Two categories with the same frequency get the same encoding (information loss) When it works well: When frequency is genuinely predictive (e.g., popular products are more likely to be purchased)

Learned Embeddings

For high-cardinality categoricals (user IDs, product IDs, zip codes), learn a dense vector representation within a neural network:

embedding = nn.Embedding(num_categories, embedding_dim)
# e.g., 1M products -> 64-dimensional vectors

Pros: Captures complex relationships, compact representation, can be pre-trained Cons: Requires a neural network, needs enough data per category to learn meaningful embeddings Production use: Recommendation systems universally use learned embeddings for users and items

Feature Hashing (Hash Encoding)

Hash each category to a fixed-size integer space:

encoded = hash(category) % num_buckets

Pros: Fixed dimensionality (choose num_buckets), handles unseen categories at inference, no need to store vocabulary Cons: Hash collisions (different categories map to same bucket), not interpretable When to use: Very high cardinality + limited memory, or when you need to handle unseen categories

Encoding Comparison Table

Method	Cardinality	Leakage Risk	Works With	Key Advantage
One-hot	Low (< 20)	None	All models	Simple, no assumptions
Label/Ordinal	Any	None	Trees, ordinal features	Compact, fast
Target encoding	Medium-High	High (must mitigate)	All models	Most predictive for tabular
Frequency	Medium-High	None	All models	Simple, no leakage
Embeddings	Very high	None	Neural networks	Learns complex relationships
Hashing	Very high	None	All models	Fixed dimension, handles unseen

Part 3 - Text, Time, and Interaction Features

Text Features

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF(t, d) = TF(t, d) * IDF(t)
TF(t, d) = count(t in d) / |d|
IDF(t) = log(N / DF(t))

High TF-IDF means the term is frequent in this document but rare across all documents - it's distinctive.

Text feature pipeline for tabular ML:

Basic: TF-IDF on unigrams + bigrams, keep top 500-5000 features
Intermediate: Add character n-grams (captures typos, word fragments)
Advanced: Pre-trained sentence embeddings (Sentence-BERT) as features - often the best approach now

Other text features:

Text length: Number of characters, words, sentences
Special character counts: Exclamation marks, question marks, URLs, mentions
Readability scores: Flesch-Kincaid, Coleman-Liau
Sentiment scores: From a pre-trained sentiment model
Named entity counts: Number of people, organizations, locations mentioned

60-Second Answer

"For text features in tabular ML, I typically start with TF-IDF on unigrams and bigrams - it's simple, fast, and surprisingly effective. I'd also add basic text statistics: length, word count, and any domain-specific pattern counts. For modern approaches, I'd compute sentence embeddings using a pre-trained model like Sentence-BERT and use those as dense features - this captures semantic meaning that TF-IDF misses. The choice between TF-IDF and embeddings depends on the dataset size and whether domain-specific vocabulary matters more than general semantics."

Time Features

Time is one of the richest sources of engineered features. Categories:

Calendar features:

Day of week, hour of day, month, quarter, is_weekend, is_holiday
These capture cyclical patterns (sales spike on weekends, usage drops at night)

Cyclical encoding (important!): Hour 23 and hour 0 are 1 hour apart, but if encoded as integers, they appear 23 apart. Fix with sin/cos encoding:

hour_sin = sin(2 * pi * hour / 24)
hour_cos = cos(2 * pi * hour / 24)

Lag features:

value_t-1, value_t-7, value_t-30 - past values of the target or key metrics
Critical for time series forecasting

Rolling statistics:

Rolling mean, median, std, min, max over windows (7-day, 30-day, 90-day)
Captures trends and volatility

Time-since features:

Days since last purchase, hours since last login, time since account creation
Captures recency, a powerful predictor for engagement and churn

Trend features:

Slope of a value over the last N periods
Difference between recent average and longer-term average
Captures acceleration/deceleration in user behavior

Common Trap

Time features are the most common source of data leakage in ML. If you compute a rolling 7-day average that includes future data points (relative to the prediction time), you've leaked the future. Always ensure that lag and rolling features only use data available before the prediction timestamp. In an interview, explicitly state: "All time features are computed using only data available at prediction time."

Interaction Features

Create new features from combinations of existing ones:

Multiplicative interactions:

feature_new = feature_A * feature_B

Example: price_per_sqft = price / square_feet

Polynomial features:

[x1, x2] -> [x1, x2, x1^2, x1*x2, x2^2]

Ratio features:

support_ticket_ratio = support_tickets / total_orders
engagement_rate = clicks / impressions

When interactions help:

Linear models cannot learn interactions automatically - you must provide them
Even tree-based models benefit from well-chosen ratio features (the tree would need multiple splits to approximate a ratio)

When to be careful:

Polynomial features explode combinatorially - use domain knowledge to select meaningful pairs
Ratio features can create division-by-zero or infinity - always add a small constant or handle edge cases

Part 4 - Feature Selection

Why Feature Selection Matters

More features is not always better:

Curse of dimensionality: More features require exponentially more data to avoid overfitting
Noise features: Irrelevant features add noise that can hurt model performance
Computational cost: More features = slower training and inference
Interpretability: Fewer features make the model easier to understand and debug

The Three Approaches

Feature Selection Methods

Filter Methods (Fast, Model-Independent)

Evaluate each feature independently, without training a model.

Method	For	Formula/Approach	Pros	Cons
Correlation	Numerical features, regression	Pearson/Spearman correlation with target	Fast, intuitive	Only captures linear/monotonic relationships
Mutual Information	Any feature type	MI(X, Y) - measures any statistical dependency	Captures non-linear relationships	Computationally expensive for continuous features
Chi-squared	Categorical features, classification	Chi-squared test of independence	Fast, well-understood	Only for categorical features with categorical target
ANOVA F-test	Numerical features, classification	F-statistic between groups	Fast	Assumes normality and equal variance
Variance threshold	Any numerical	Remove features with near-zero variance	Very fast, removes constants	Doesn't consider relationship with target

Wrapper Methods (Accurate, Expensive)

Use a model's performance to evaluate feature subsets.

Recursive Feature Elimination (RFE):

Train model on all features
Remove the least important feature (by coefficient or importance)
Retrain and repeat until desired number of features reached
Optionally cross-validate at each step (RFECV)

Forward Selection: Start with zero features. Add the feature that improves performance the most. Repeat.

Backward Elimination: Start with all features. Remove the feature whose removal hurts performance the least. Repeat.

Cost: O(n_features * model_training_time) - expensive for large feature sets.

Embedded Methods (Best of Both Worlds)

Feature selection happens during model training.

L1 Regularization (Lasso): L1 penalty drives some coefficients exactly to zero, performing automatic feature selection. Features with zero coefficients are eliminated.

Tree-based importance:

Split importance: How often (and how much) each feature reduces impurity across all trees
Permutation importance: Shuffle each feature and measure performance drop. More reliable but slower.

Interviewer's Perspective

When a candidate says "I used feature importance from XGBoost," I follow up with: "Which type of importance - gain, cover, or frequency? And do you know the problems with the default (gain-based) importance?" The issue is that gain-based importance is biased toward high-cardinality features and can be misleading. Permutation importance is more reliable. Knowing this distinction signals real experience with feature selection.

Feature Selection Pipeline (Practical)

Remove constant/near-constant features (variance threshold)
Remove highly correlated features (keep one from each correlated pair, |r| > 0.95)
Filter by mutual information (remove features with MI < threshold)
Embedded selection with L1 or tree importance (top-k features)
Validate with cross-validation - check that removing features doesn't hurt performance

Company Variation

Google: Uses feature analysis tools (TFX/TFDV) to detect anomalies and compute statistics before feature selection
Meta: Heavy use of feature importance from gradient boosted models for ranking features
Netflix: Feature stores with hundreds of pre-computed features; selection is about choosing the right subset for each model
Startups: Often skip formal selection - use domain knowledge to choose 10-20 features and iterate

Part 5 - Feature Stores and Production Feature Engineering

What Is a Feature Store?

A feature store is a centralized system for computing, storing, and serving ML features. It solves three critical problems:

Train-serve skew: Features computed differently in training (batch Python) vs. serving (real-time API). A feature store ensures identical computation.
Feature reuse: Multiple models use the same features (user click count, item popularity). Without a feature store, each team re-implements them.
Point-in-time correctness: For training, you need the feature values as they existed at prediction time, not today's values. A feature store handles time-travel queries.

Feature Store Consistency

Feature Store Components

Component	Purpose	Example
Feature definitions	Code that computes features	"user_7d_click_count = count clicks in last 7 days"
Offline store	Historical feature values for training	Data warehouse (BigQuery, Snowflake)
Online store	Low-latency feature serving for inference	Redis, DynamoDB
Feature registry	Catalog of all features with metadata	Documentation, lineage, ownership
Materialization	Process that computes and stores features	Batch (Spark) + streaming (Flink/Kafka)

Popular Feature Stores

Feature Store	Type	Best For
Feast	Open-source	Startup/mid-size, GCP/AWS
Tecton	Managed SaaS	Enterprise, real-time features
Hopsworks	Open-source + managed	Python-first teams
Databricks Feature Store	Part of Databricks	Teams already on Databricks
Amazon SageMaker Feature Store	AWS managed	AWS-native teams
Vertex AI Feature Store	GCP managed	GCP-native teams

The Feature Engineering Production Pipeline

60-Second Answer

"In production, feature engineering isn't a one-time notebook exercise - it's an ongoing pipeline. I think of it in three layers: (1) Batch features computed daily or hourly from data warehouses - things like user lifetime value, 30-day rolling averages, aggregated statistics. (2) Near-real-time features computed from streaming data - recent click counts, session-level behavior, trending signals. (3) Real-time features computed at request time - user's current location, time of day, device type. These all flow through a feature store that ensures consistency between training and serving. The most important thing is avoiding train-serve skew - if a feature is computed differently at training time vs. serving time, your model's performance in production will differ from offline evaluation."

Part 6 - Data Leakage: The Silent Killer

What Is Data Leakage?

Data leakage occurs when information from outside the training dataset leaks into the model during training, giving artificially high performance that doesn't generalize.

Types of Leakage

1. Target Leakage A feature contains information that is only available because of the target:

Using "treatment_outcome" to predict "should_treat" - the outcome is known only after treatment
Using "default_flag" to predict "credit_risk" - default is the definition of risk
Using "cancellation_reason" to predict "will_churn" - reason only exists after churn

2. Temporal Leakage Using future information to predict the past:

Computing a rolling average that includes future data points
Feature engineering on the full dataset before train/test split (target encoding, imputation statistics)
Not respecting temporal ordering in train/validation split

3. Train-Test Contamination

Fitting preprocessing (scaler, imputer, encoder) on the full dataset before splitting
Duplicate records appearing in both train and test (especially after data augmentation)
Information leaking through group membership (same patient in train and test with different visits)

Instant Rejection

If you describe a feature engineering workflow where you apply TF-IDF, target encoding, or any form of imputation to the entire dataset before splitting into train/test, you've committed data leakage. In an interview, this is an immediate red flag. Always say: "I split first, then fit all transformations on the training set only, and apply the fitted transformers to the test set."

How to Detect Leakage

Suspiciously high performance: If your model achieves 99% accuracy on a problem where 90% is state-of-the-art, suspect leakage
Feature importance analysis: If one feature dominates all others by a huge margin, investigate it
Temporal validation: Does performance drop significantly when you use a proper time-based split vs. random split?
Remove and retest: Remove the top feature and retrain - if performance barely changes, the feature may be leaking

Practice Problems

Problem 1: The Feature Engineering Narrative

"Tell me about a time you engineered features for a machine learning project. Walk me through your process from raw data to final feature set."

Hint 1 - Direction

Structure your answer as: (1) Problem context, (2) Raw data description, (3) Feature engineering decisions with reasoning, (4) Feature selection, (5) Impact on model performance.

Hint 2 - Insight

The interviewer wants to hear domain knowledge, not just a list of transforms. Explain why you chose each feature transformation. For example: "I log-transformed revenue because it follows a power-law distribution and the model needs to differentiate between $100 and$ 200 as much as between $10K and$ 10.1K."

Hint 3 - Full Solution + Rubric

Example strong answer:

"I built a churn prediction model for a SaaS product. The raw data had user demographics, subscription info, and activity logs.

Step 1: Understanding the data. I started with EDA - plotted distributions, checked missing values, and looked at the target rate (8% churn, moderately imbalanced).

Step 2: Temporal features from activity logs. The most impactful features came from user behavior over time. I created:

Activity trend: slope of daily active minutes over the last 30 days (capturing decline)
Engagement ratio: ratio of last 7 days activity to last 30 days activity (capturing recent change)
Days since last login (recency)
Session count change: this week vs. average of last 4 weeks

I was careful to only use data available before the churn date - no temporal leakage.

Step 3: Categorical encoding. Plan type (3 values) was one-hot encoded. Industry (200 values) used target encoding with 5-fold cross-validation to prevent leakage.

Step 4: Feature selection. Started with 80 features. Used mutual information to filter to 40, then permutation importance with XGBoost to select the final 25. Cross-validated to ensure no performance loss.

Impact: AUC improved from 0.72 (raw features) to 0.89 (engineered). The activity trend feature alone was worth 7 AUC points."

Scoring Rubric:

Strong Hire: Tells a coherent story with specific numbers, explains why for each decision, addresses leakage, discusses feature selection with validation, quantifies impact
Lean Hire: Describes reasonable features but lacks the "why" reasoning, or doesn't discuss leakage prevention
No Hire: Lists transforms ("I used one-hot encoding and StandardScaler") without context, reasoning, or impact

Problem 2: High-Cardinality Categorical

You have a user_id column with 10 million unique values. You're building a click prediction model. How do you encode this feature?

Hint 1 - Direction

Think about why you'd want to encode user_id at all (it captures user preferences). Then think about which encoding methods can handle 10M unique values without creating a 10M-dimensional feature vector.

Hint 2 - Insight

One-hot encoding is impossible (10M columns). Target encoding risks leakage. The best approaches are learned embeddings (if using a neural network) or aggregation-based features (compute statistics per user and use those instead of the raw ID). Consider whether you have enough data per user.

Hint 3 - Full Solution + Rubric

Approach depends on the model architecture:

Option A: Learned Embeddings (Neural Network)

Create an embedding layer: nn.Embedding(10M, 64) - each user gets a 64-dimensional vector
The embeddings are learned during training
Handles cold-start by having a default embedding for unseen users
This is the standard approach in recommendation systems (Meta, Google, Netflix)

Option B: Aggregated User Features (Tabular Models)

Instead of encoding user_id directly, compute features about each user:
- Historical click-through rate (with smoothing)
- Total impressions, total clicks
- Days since account creation
- Average session length
- Category preferences (click distribution across categories)
This captures the information in user_id without the dimensionality problem

Option C: Feature Hashing

Hash user_id to a fixed-size space (e.g., 1000 buckets)
Loses individual user identity due to collisions
Useful as a baseline or when computational resources are limited

What NOT to do:

One-hot encoding (10M columns = impossible)
Label encoding without a tree model (implies ordinal relationship between users)
Target encoding without extreme care (10M categories = severe leakage risk)

Scoring Rubric:

Strong Hire: Discusses multiple approaches, recommends embeddings for neural nets or aggregated features for tabular models, mentions cold-start handling, discusses data sparsity concerns
Lean Hire: Mentions embeddings or hashing, but doesn't discuss trade-offs or cold-start
No Hire: Suggests one-hot encoding or doesn't recognize why user_id encoding is challenging

Problem 3: Data Leakage Detection

Your fraud detection model achieves 99.9% AUC on the test set. In production, it performs at 0.65 AUC. What went wrong?

Hint 1 - Direction

A massive gap between offline and online performance almost always indicates data leakage or a distribution shift between training data and production data. Think about what could be different.

Hint 2 - Insight

Common causes for this pattern: (1) A feature that's only available after fraud is confirmed (target leakage), (2) random train/test split on temporal data instead of time-based split, (3) duplicate transactions in train and test, (4) a feature that's computed differently in batch (training) vs. real-time (serving).

Hint 3 - Full Solution + Rubric

Investigation checklist (in priority order):

Target leakage: Is any feature computed using information only available after the fraud label is assigned?
- "is_disputed" - only exists because the user reported fraud
- "chargeback_amount" - directly derived from the fraud outcome
- Fix: Remove any feature that wouldn't be available at prediction time
Temporal leakage: Was the train/test split random instead of time-based?
- Random split means future transactions are in training, past in testing
- The model memorizes patterns from the future
- Fix: Time-based split (train on months 1-6, test on months 7-8)
Duplicate/near-duplicate leakage: Are the same transactions (or very similar ones) in both train and test?
- A transaction and its retry/reversal might both appear
- Fix: Deduplicate by transaction_id, or split by user (no user in both sets)
Train-serve skew: Are features computed differently at training time vs. serving time?
- Training: batch SQL computes "user_30d_transaction_count" including all 30 days
- Serving: real-time system only has access to the last 7 days of cached data
- Fix: Use a feature store that ensures consistency
Distribution shift: Did the fraud pattern change between training period and deployment?
- New fraud tactics that didn't exist in training data
- Fix: Regular retraining, monitor feature distributions

Scoring Rubric:

Strong Hire: Systematically investigates all leakage types, checks feature availability at prediction time, mentions train-serve skew as a production-specific issue, proposes a feature store as a fix
Lean Hire: Identifies target or temporal leakage but misses train-serve skew
No Hire: Says "the model overfit" without investigating the specific mechanism

Problem 4: Feature Engineering for Time Series

You're building a demand forecasting model for a retail chain. You have daily sales data for 500 stores over 3 years. What features would you engineer?

Hint 1 - Direction

Think about what drives retail demand: time patterns (day of week, season, holidays), trends (is this store's sales growing or declining?), external factors (weather, promotions), and store-specific characteristics.

Hint 2 - Insight

The key challenge in time series feature engineering is avoiding temporal leakage while capturing enough temporal context. Lag features, rolling statistics, and trend indicators are essential, but they must only use past data.

Hint 3 - Full Solution + Rubric

Feature categories:

1. Calendar features:

Day of week (cyclical encoded), month, quarter, year
is_weekend, is_holiday, is_school_break
Days until/since nearest holiday (captures pre/post holiday effects)
Pay period indicators (1st and 15th of month for paycheck effects)

2. Lag features:

sales_1d_ago, sales_7d_ago, sales_14d_ago, sales_28d_ago, sales_365d_ago
Same-day-last-week, same-day-last-year (captures weekly and yearly seasonality)

3. Rolling statistics (windows: 7, 14, 28, 90 days):

Rolling mean, median, std, min, max
Rolling quantiles (25th, 75th) - captures distribution changes
Coefficient of variation (std/mean) - captures volatility

4. Trend features:

Slope of sales over last 7/30/90 days
Ratio: last 7-day average / last 30-day average (acceleration/deceleration)
Year-over-year growth rate

5. Store-level features:

Store size, location type (urban/suburban/rural)
Historical average sales (store baseline)
Store ranking within region

6. External features:

Weather (temperature, precipitation - affects foot traffic)
Promotions/discounts (binary or amount)
Competitor activity (if available)
Local events (concerts, sports games near the store)

7. Product-level aggregations:

Sales by category, department
New product launch indicators
Stock-out indicators (if available)

Critical: No temporal leakage! All lag and rolling features must use data strictly before the prediction date. Validate by checking that your time-based validation performance matches production performance.

Scoring Rubric:

Strong Hire: Comprehensive feature list across multiple categories, explicitly addresses temporal leakage, includes cyclical encoding, discusses store-level vs. global features, mentions validation strategy
Lean Hire: Good lag and calendar features but missing trend features, external signals, or leakage discussion
No Hire: Only mentions basic features (day of week, month) without lag, rolling, or trend features

Interview Cheat Sheet

Topic	Key Fact	When to Mention
Log transform	Compresses right tail; use log1p for zeros; makes power-law distributions more Gaussian	Skewed numerical features
StandardScaler	z = (x-mean)/std; needed for linear models, SVMs, NNs; NOT needed for trees	Feature preprocessing
One-hot encoding	Binary columns per category; drop one for linear models; only for low cardinality	Categorical encoding
Target encoding	Mean of target per category; must use k-fold to prevent leakage	High-cardinality categoricals
Embeddings	Learned dense vectors; standard for user/item IDs in rec systems	Very high cardinality
TF-IDF	Term frequency * inverse document frequency; captures distinctive words	Text features
Lag features	Past values of target/features; critical for time series; MUST avoid temporal leakage	Time series
Rolling stats	Mean/std/min/max over windows; captures trends and volatility	Time series
Mutual information	Captures any statistical dependency; better than correlation for non-linear	Feature selection
Permutation importance	Shuffle feature, measure performance drop; more reliable than gain importance	Feature selection
Feature stores	Centralized feature computation + serving; prevents train-serve skew	Production ML
Data leakage	Information from future/target leaks into features; causes offline/online gap	Always mention
Cyclical encoding	sin/cos encoding for periodic features (hour, day of week)	Time features

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

List 4 ways to encode a categorical variable with 50 unique values
Explain why you must fit StandardScaler on training data only
Define data leakage in one sentence
Name the three types of feature selection methods

Day 3 - Active Recall

Without notes: When would you use target encoding vs. one-hot encoding vs. embeddings?
Explain the leakage risk of target encoding and how to mitigate it
List 5 features you'd engineer from a timestamp column
What's the difference between gain-based and permutation-based feature importance?

Day 7 - Application

Design a feature engineering pipeline for a churn prediction model. Include numerical transforms, categorical encoding, time features, and feature selection.
Explain what a feature store is and why it matters to a junior data scientist
A model achieves 0.99 AUC offline but 0.70 online. Investigate systematically.

Day 14 - Synthesis

Compare feature engineering approaches for: (a) tabular classification, (b) time series forecasting, (c) recommendation system, (d) NLP classification
Design a feature store architecture for a company with 10 models sharing features
"Tell me about your feature engineering process" - deliver a 3-minute answer

Day 21 - Interview Simulation

You're given a dataset with 500 features. Walk through your feature selection workflow.
The PM asks why the model performs differently in production. Diagnose the train-serve skew.
Design features for a ride-hailing demand prediction system (include spatial, temporal, and contextual features).

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Numerical Feature Transforms​

Why Transform Numerical Features?​

Standardization (Z-Score Normalization)​

Log Transform​

Box-Cox and Yeo-Johnson Transforms​

Binning (Discretization)​

Handling Missing Values​

Part 2 - Categorical Feature Encoding​

Encoding Methods Overview​

One-Hot Encoding​

Label (Ordinal) Encoding​

Target Encoding (Mean Encoding)​

Frequency Encoding​

Learned Embeddings​

Feature Hashing (Hash Encoding)​

Encoding Comparison Table​

Part 3 - Text, Time, and Interaction Features​

Text Features​

Time Features​

Interaction Features​

Part 4 - Feature Selection​

Why Feature Selection Matters​

The Three Approaches​

Filter Methods (Fast, Model-Independent)​

Wrapper Methods (Accurate, Expensive)​

Embedded Methods (Best of Both Worlds)​

Feature Selection Pipeline (Practical)​

Part 5 - Feature Stores and Production Feature Engineering​

What Is a Feature Store?​

Feature Store Components​

Popular Feature Stores​

The Feature Engineering Production Pipeline​

Part 6 - Data Leakage: The Silent Killer​

What Is Data Leakage?​

Types of Leakage​

How to Detect Leakage​

Practice Problems​

Problem 1: The Feature Engineering Narrative​

Problem 2: High-Cardinality Categorical​

Problem 3: Data Leakage Detection​

Problem 4: Feature Engineering for Time Series​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Immediate Recall​

Day 3 - Active Recall​

Day 7 - Application​

Day 14 - Synthesis​

Day 21 - Interview Simulation​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Numerical Feature Transforms

Why Transform Numerical Features?

Standardization (Z-Score Normalization)

Log Transform

Box-Cox and Yeo-Johnson Transforms

Binning (Discretization)

Handling Missing Values

Part 2 - Categorical Feature Encoding

Encoding Methods Overview

One-Hot Encoding

Label (Ordinal) Encoding

Target Encoding (Mean Encoding)

Frequency Encoding

Learned Embeddings

Feature Hashing (Hash Encoding)

Encoding Comparison Table

Part 3 - Text, Time, and Interaction Features

Text Features

Time Features

Interaction Features

Part 4 - Feature Selection

Why Feature Selection Matters

The Three Approaches

Filter Methods (Fast, Model-Independent)

Wrapper Methods (Accurate, Expensive)

Embedded Methods (Best of Both Worlds)

Feature Selection Pipeline (Practical)

Part 5 - Feature Stores and Production Feature Engineering

What Is a Feature Store?

Feature Store Components

Popular Feature Stores

The Feature Engineering Production Pipeline

Part 6 - Data Leakage: The Silent Killer

What Is Data Leakage?

Types of Leakage

How to Detect Leakage

Practice Problems

Problem 1: The Feature Engineering Narrative

Problem 2: High-Cardinality Categorical

Problem 3: Data Leakage Detection

Problem 4: Feature Engineering for Time Series

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

Day 3 - Active Recall

Day 7 - Application

Day 14 - Synthesis

Day 21 - Interview Simulation