Feature Engineering - The Highest-Leverage Skill in ML
Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, Data Scientist, AI Eng, MLOps
The Real Interview Moment
You're in a machine learning system design round. The interviewer asks: "Walk me through your feature engineering process for a problem you've worked on." This question sounds soft, but it's one of the most revealing in the entire interview loop.
The weak candidate lists a few transforms: "I one-hot encoded the categorical variables and normalized the numerical ones." The strong candidate tells a story: "We were building a churn prediction model. The raw data had 50 columns, but most predictive power came from features we engineered - the trend in a user's activity over the last 30 days (not just the count), the ratio of support tickets to purchases, and a time-since-last-login feature that captured recency. We tried 200+ features, used mutual information to filter to the top 40, then ran recursive feature elimination with a gradient boosted model to get down to 25. The engineered features improved AUC from 0.72 to 0.89."
That's the difference. Feature engineering is where domain knowledge meets data science, and it remains the single highest-leverage activity in applied ML - even in the age of deep learning. For tabular data (which is most production ML), feature engineering matters more than model selection.
What You Will Master
After reading this page, you will be able to:
- Apply numerical transformations: log, Box-Cox, power transforms, binning, standardization, and normalization
- Encode categorical variables using one-hot, target, frequency, ordinal, and learned embeddings
- Extract features from text (TF-IDF, n-grams), time series (lag features, rolling statistics), and dates
- Create interaction features and polynomial features with appropriate complexity control
- Apply feature selection methods: filter (mutual information, chi-squared), wrapper (RFE), and embedded (L1, tree importance)
- Design a feature engineering pipeline for production, including feature stores
- Answer the "tell me about your feature engineering process" interview question with a compelling narrative
- Avoid common pitfalls: data leakage, target leakage, and feature-target correlation traps
- Reason about when feature engineering matters most vs. when deep learning replaces it
Self-Assessment: Where Are You Now?
| Skill Area | 1 (Never done it) | 3 (Used in projects) | 5 (Production expert) | Your Rating |
|---|---|---|---|---|
| Numerical transforms | Don't know what StandardScaler does | Used log and standardization | Choose transforms based on distribution analysis | ___ |
| Categorical encoding | Only know one-hot | Used 2-3 encoding methods | Know when target encoding beats embeddings | ___ |
| Text features | Never extracted text features | Used TF-IDF | Designed text feature pipelines with n-grams + embeddings | ___ |
| Time features | Never engineered time features | Used day-of-week, month | Created lag features, rolling stats, trend indicators | ___ |
| Feature selection | Never done it | Used correlation filtering | Applied filter + wrapper + embedded methods systematically | ___ |
| Feature stores | Never heard of one | Know what they are | Used Feast/Tecton/built custom feature store | ___ |
| Data leakage | Not sure what it is | Can identify obvious cases | Can detect subtle temporal and target leakage | ___ |
Score interpretation:
- 7-14: Start here. Feature engineering is the most practical skill in ML.
- 15-25: Good foundation. Focus on advanced encoding, selection methods, and the leakage section.
- 26-35: You're ready for senior-level questions. Drill the practice problems and feature store design.
Part 1 - Numerical Feature Transforms
Why Transform Numerical Features?
Raw numerical features often have properties that hurt model performance: skewed distributions, different scales, outliers, and non-linear relationships with the target. Transforms address these issues.
Standardization (Z-Score Normalization)
z = (x - mean) / std
Centers data at 0 with unit variance. Essential for algorithms that assume features are on the same scale: linear regression, logistic regression, SVMs, k-NN, PCA, and neural networks.
Not needed for: Tree-based models (decision trees, random forests, XGBoost) - they split on individual features and are invariant to monotonic transforms.
Log Transform
x' = log(x + 1) # log1p to handle zeros
Compresses the right tail of skewed distributions. Extremely common for:
- Revenue, prices, salaries (power-law distributions)
- Word counts, page views, click counts
- Any feature spanning multiple orders of magnitude
"I apply log transforms to right-skewed features like revenue or click counts. The intuition is that the difference between 200 is more meaningful than the difference between 10,100. Log transform captures this - it converts multiplicative relationships to additive ones, which linear models handle naturally. I use log1p (log(x+1)) to handle zeros. For features with negative values, I use the Yeo-Johnson transform, which generalizes Box-Cox to handle negative inputs."
Box-Cox and Yeo-Johnson Transforms
Box-Cox finds the optimal power transform to make data more Gaussian:
x' = (x^lambda - 1) / lambda (lambda != 0)
x' = log(x) (lambda = 0)
The parameter lambda is estimated from the data. Box-Cox requires positive values.
Yeo-Johnson extends Box-Cox to handle zero and negative values. Preferred in practice because it doesn't require positive inputs.
Binning (Discretization)
Convert continuous features into categorical bins:
- Equal-width binning: Divide range into k equal intervals. Simple but sensitive to outliers.
- Equal-frequency (quantile) binning: Each bin contains roughly the same number of samples. More robust.
- Custom bins: Based on domain knowledge (e.g., age groups: 0-17, 18-24, 25-34, ...).
When binning helps:
- Captures non-linear relationships in linear models (age < 18 might have a qualitatively different effect than age = 25 vs 30)
- Reduces the impact of outliers
- Can improve interpretability
When binning hurts:
- Loses information within each bin
- Creates arbitrary boundaries
- Tree-based models already learn optimal splits - binning is redundant
Never fit your scaler or binner on the test set. Always fit on training data and transform both train and test. This is one of the most common data leakage mistakes in practice. In sklearn, always use fit_transform on the training set and transform on the test set - never fit_transform on the full dataset before splitting.
Handling Missing Values
| Strategy | When to Use | Implementation |
|---|---|---|
| Mean/Median imputation | MCAR (missing completely at random) | Simple, fast; can distort distribution |
| Mode imputation | Categorical features | Preserves data type |
| Indicator variable | Missingness itself is informative | Add feature_is_missing boolean column |
| Model-based (KNN, iterative) | MAR (missing at random) | More accurate but slower, risk of leakage |
| Leave as NaN | Tree-based models | XGBoost/LightGBM handle NaN natively |
I always ask candidates how they handle missing values. The worst answer is "I dropped all rows with missing values." This throws away data and introduces selection bias. The best answer depends on the mechanism: "I first analyze why values are missing - is missingness random, or correlated with the target? If missingness itself is predictive (e.g., users who don't fill in income are more likely to churn), I add a missing indicator feature. For the imputed value, I use median for skewed numericals and mode for categoricals, always fitting on training data only."
Part 2 - Categorical Feature Encoding
Encoding Methods Overview
One-Hot Encoding
Create a binary column for each category value:
| City | city_NYC | city_LA | city_CHI |
|---|---|---|---|
| NYC | 1 | 0 | 0 |
| LA | 0 | 1 | 0 |
| CHI | 0 | 0 | 1 |
Pros: No ordinal assumption, works with all model types Cons: Explodes dimensionality for high-cardinality features (1M users = 1M columns)
Drop one category? For linear models, drop one column to avoid multicollinearity (the "dummy variable trap"). For tree-based models, it doesn't matter.
Label (Ordinal) Encoding
Map each category to an integer: NYC=0, LA=1, CHI=2.
Appropriate when: The variable has a natural order (education: high school < bachelor's < master's < PhD). Dangerous when: Applied to non-ordinal categories - it implies NYC < LA < CHI, which is meaningless. Exception: Tree-based models (XGBoost, LightGBM) handle label-encoded categoricals well because they only learn splits, not magnitudes.
Target Encoding (Mean Encoding)
Replace each category with the mean of the target variable for that category:
city_encoded = E[y | city]
Example: If the click-through rate for NYC users is 0.15, all NYC entries get the value 0.15.
The leakage problem: Using the target to compute the encoding creates data leakage. The category's encoded value contains information about the target, which inflates training metrics.
Solutions:
- Leave-one-out: For each row, compute the mean excluding that row
- K-fold target encoding: Compute encodings using only out-of-fold data (like cross-validation)
- Smoothing: Blend category mean with global mean, weighted by category frequency:
encoded = (n * category_mean + m * global_mean) / (n + m)
where n is category count and m is the smoothing parameter.
If you mention target encoding without immediately addressing the leakage risk, it signals you've used it without understanding it. Always say: "Target encoding requires careful handling to avoid leakage - I use k-fold encoding where each fold's encoding is computed from the other folds' target values, never from the same data being encoded."
Frequency Encoding
Replace each category with its frequency (count or proportion) in the training data:
city_encoded = count(city) / total_count
Pros: No leakage risk, simple, captures frequency patterns Cons: Two categories with the same frequency get the same encoding (information loss) When it works well: When frequency is genuinely predictive (e.g., popular products are more likely to be purchased)
Learned Embeddings
For high-cardinality categoricals (user IDs, product IDs, zip codes), learn a dense vector representation within a neural network:
embedding = nn.Embedding(num_categories, embedding_dim)
# e.g., 1M products -> 64-dimensional vectors
Pros: Captures complex relationships, compact representation, can be pre-trained Cons: Requires a neural network, needs enough data per category to learn meaningful embeddings Production use: Recommendation systems universally use learned embeddings for users and items
Feature Hashing (Hash Encoding)
Hash each category to a fixed-size integer space:
encoded = hash(category) % num_buckets
Pros: Fixed dimensionality (choose num_buckets), handles unseen categories at inference, no need to store vocabulary Cons: Hash collisions (different categories map to same bucket), not interpretable When to use: Very high cardinality + limited memory, or when you need to handle unseen categories
Encoding Comparison Table
| Method | Cardinality | Leakage Risk | Works With | Key Advantage |
|---|---|---|---|---|
| One-hot | Low (< 20) | None | All models | Simple, no assumptions |
| Label/Ordinal | Any | None | Trees, ordinal features | Compact, fast |
| Target encoding | Medium-High | High (must mitigate) | All models | Most predictive for tabular |
| Frequency | Medium-High | None | All models | Simple, no leakage |
| Embeddings | Very high | None | Neural networks | Learns complex relationships |
| Hashing | Very high | None | All models | Fixed dimension, handles unseen |
Part 3 - Text, Time, and Interaction Features
Text Features
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF(t, d) = TF(t, d) * IDF(t)
TF(t, d) = count(t in d) / |d|
IDF(t) = log(N / DF(t))
High TF-IDF means the term is frequent in this document but rare across all documents - it's distinctive.
Text feature pipeline for tabular ML:
- Basic: TF-IDF on unigrams + bigrams, keep top 500-5000 features
- Intermediate: Add character n-grams (captures typos, word fragments)
- Advanced: Pre-trained sentence embeddings (Sentence-BERT) as features - often the best approach now
Other text features:
- Text length: Number of characters, words, sentences
- Special character counts: Exclamation marks, question marks, URLs, mentions
- Readability scores: Flesch-Kincaid, Coleman-Liau
- Sentiment scores: From a pre-trained sentiment model
- Named entity counts: Number of people, organizations, locations mentioned
"For text features in tabular ML, I typically start with TF-IDF on unigrams and bigrams - it's simple, fast, and surprisingly effective. I'd also add basic text statistics: length, word count, and any domain-specific pattern counts. For modern approaches, I'd compute sentence embeddings using a pre-trained model like Sentence-BERT and use those as dense features - this captures semantic meaning that TF-IDF misses. The choice between TF-IDF and embeddings depends on the dataset size and whether domain-specific vocabulary matters more than general semantics."
Time Features
Time is one of the richest sources of engineered features. Categories:
Calendar features:
- Day of week, hour of day, month, quarter, is_weekend, is_holiday
- These capture cyclical patterns (sales spike on weekends, usage drops at night)
Cyclical encoding (important!): Hour 23 and hour 0 are 1 hour apart, but if encoded as integers, they appear 23 apart. Fix with sin/cos encoding:
hour_sin = sin(2 * pi * hour / 24)
hour_cos = cos(2 * pi * hour / 24)
Lag features:
value_t-1,value_t-7,value_t-30- past values of the target or key metrics- Critical for time series forecasting
Rolling statistics:
- Rolling mean, median, std, min, max over windows (7-day, 30-day, 90-day)
- Captures trends and volatility
Time-since features:
- Days since last purchase, hours since last login, time since account creation
- Captures recency, a powerful predictor for engagement and churn
Trend features:
- Slope of a value over the last N periods
- Difference between recent average and longer-term average
- Captures acceleration/deceleration in user behavior
Time features are the most common source of data leakage in ML. If you compute a rolling 7-day average that includes future data points (relative to the prediction time), you've leaked the future. Always ensure that lag and rolling features only use data available before the prediction timestamp. In an interview, explicitly state: "All time features are computed using only data available at prediction time."
Interaction Features
Create new features from combinations of existing ones:
Multiplicative interactions:
feature_new = feature_A * feature_B
Example: price_per_sqft = price / square_feet
Polynomial features:
[x1, x2] -> [x1, x2, x1^2, x1*x2, x2^2]
Ratio features:
support_ticket_ratio = support_tickets / total_orders
engagement_rate = clicks / impressions
When interactions help:
- Linear models cannot learn interactions automatically - you must provide them
- Even tree-based models benefit from well-chosen ratio features (the tree would need multiple splits to approximate a ratio)
When to be careful:
- Polynomial features explode combinatorially - use domain knowledge to select meaningful pairs
- Ratio features can create division-by-zero or infinity - always add a small constant or handle edge cases
Part 4 - Feature Selection
Why Feature Selection Matters
More features is not always better:
- Curse of dimensionality: More features require exponentially more data to avoid overfitting
- Noise features: Irrelevant features add noise that can hurt model performance
- Computational cost: More features = slower training and inference
- Interpretability: Fewer features make the model easier to understand and debug
The Three Approaches
Filter Methods (Fast, Model-Independent)
Evaluate each feature independently, without training a model.
| Method | For | Formula/Approach | Pros | Cons |
|---|---|---|---|---|
| Correlation | Numerical features, regression | Pearson/Spearman correlation with target | Fast, intuitive | Only captures linear/monotonic relationships |
| Mutual Information | Any feature type | MI(X, Y) - measures any statistical dependency | Captures non-linear relationships | Computationally expensive for continuous features |
| Chi-squared | Categorical features, classification | Chi-squared test of independence | Fast, well-understood | Only for categorical features with categorical target |
| ANOVA F-test | Numerical features, classification | F-statistic between groups | Fast | Assumes normality and equal variance |
| Variance threshold | Any numerical | Remove features with near-zero variance | Very fast, removes constants | Doesn't consider relationship with target |
Wrapper Methods (Accurate, Expensive)
Use a model's performance to evaluate feature subsets.
Recursive Feature Elimination (RFE):
- Train model on all features
- Remove the least important feature (by coefficient or importance)
- Retrain and repeat until desired number of features reached
- Optionally cross-validate at each step (RFECV)
Forward Selection: Start with zero features. Add the feature that improves performance the most. Repeat.
Backward Elimination: Start with all features. Remove the feature whose removal hurts performance the least. Repeat.
Cost: O(n_features * model_training_time) - expensive for large feature sets.
Embedded Methods (Best of Both Worlds)
Feature selection happens during model training.
L1 Regularization (Lasso): L1 penalty drives some coefficients exactly to zero, performing automatic feature selection. Features with zero coefficients are eliminated.
Tree-based importance:
- Split importance: How often (and how much) each feature reduces impurity across all trees
- Permutation importance: Shuffle each feature and measure performance drop. More reliable but slower.
When a candidate says "I used feature importance from XGBoost," I follow up with: "Which type of importance - gain, cover, or frequency? And do you know the problems with the default (gain-based) importance?" The issue is that gain-based importance is biased toward high-cardinality features and can be misleading. Permutation importance is more reliable. Knowing this distinction signals real experience with feature selection.
Feature Selection Pipeline (Practical)
- Remove constant/near-constant features (variance threshold)
- Remove highly correlated features (keep one from each correlated pair, |r| > 0.95)
- Filter by mutual information (remove features with MI < threshold)
- Embedded selection with L1 or tree importance (top-k features)
- Validate with cross-validation - check that removing features doesn't hurt performance
- Google: Uses feature analysis tools (TFX/TFDV) to detect anomalies and compute statistics before feature selection
- Meta: Heavy use of feature importance from gradient boosted models for ranking features
- Netflix: Feature stores with hundreds of pre-computed features; selection is about choosing the right subset for each model
- Startups: Often skip formal selection - use domain knowledge to choose 10-20 features and iterate
Part 5 - Feature Stores and Production Feature Engineering
What Is a Feature Store?
A feature store is a centralized system for computing, storing, and serving ML features. It solves three critical problems:
- Train-serve skew: Features computed differently in training (batch Python) vs. serving (real-time API). A feature store ensures identical computation.
- Feature reuse: Multiple models use the same features (user click count, item popularity). Without a feature store, each team re-implements them.
- Point-in-time correctness: For training, you need the feature values as they existed at prediction time, not today's values. A feature store handles time-travel queries.
Feature Store Components
| Component | Purpose | Example |
|---|---|---|
| Feature definitions | Code that computes features | "user_7d_click_count = count clicks in last 7 days" |
| Offline store | Historical feature values for training | Data warehouse (BigQuery, Snowflake) |
| Online store | Low-latency feature serving for inference | Redis, DynamoDB |
| Feature registry | Catalog of all features with metadata | Documentation, lineage, ownership |
| Materialization | Process that computes and stores features | Batch (Spark) + streaming (Flink/Kafka) |
Popular Feature Stores
| Feature Store | Type | Best For |
|---|---|---|
| Feast | Open-source | Startup/mid-size, GCP/AWS |
| Tecton | Managed SaaS | Enterprise, real-time features |
| Hopsworks | Open-source + managed | Python-first teams |
| Databricks Feature Store | Part of Databricks | Teams already on Databricks |
| Amazon SageMaker Feature Store | AWS managed | AWS-native teams |
| Vertex AI Feature Store | GCP managed | GCP-native teams |
The Feature Engineering Production Pipeline
"In production, feature engineering isn't a one-time notebook exercise - it's an ongoing pipeline. I think of it in three layers: (1) Batch features computed daily or hourly from data warehouses - things like user lifetime value, 30-day rolling averages, aggregated statistics. (2) Near-real-time features computed from streaming data - recent click counts, session-level behavior, trending signals. (3) Real-time features computed at request time - user's current location, time of day, device type. These all flow through a feature store that ensures consistency between training and serving. The most important thing is avoiding train-serve skew - if a feature is computed differently at training time vs. serving time, your model's performance in production will differ from offline evaluation."
Part 6 - Data Leakage: The Silent Killer
What Is Data Leakage?
Data leakage occurs when information from outside the training dataset leaks into the model during training, giving artificially high performance that doesn't generalize.
Types of Leakage
1. Target Leakage A feature contains information that is only available because of the target:
- Using "treatment_outcome" to predict "should_treat" - the outcome is known only after treatment
- Using "default_flag" to predict "credit_risk" - default is the definition of risk
- Using "cancellation_reason" to predict "will_churn" - reason only exists after churn
2. Temporal Leakage Using future information to predict the past:
- Computing a rolling average that includes future data points
- Feature engineering on the full dataset before train/test split (target encoding, imputation statistics)
- Not respecting temporal ordering in train/validation split
3. Train-Test Contamination
- Fitting preprocessing (scaler, imputer, encoder) on the full dataset before splitting
- Duplicate records appearing in both train and test (especially after data augmentation)
- Information leaking through group membership (same patient in train and test with different visits)
If you describe a feature engineering workflow where you apply TF-IDF, target encoding, or any form of imputation to the entire dataset before splitting into train/test, you've committed data leakage. In an interview, this is an immediate red flag. Always say: "I split first, then fit all transformations on the training set only, and apply the fitted transformers to the test set."
How to Detect Leakage
- Suspiciously high performance: If your model achieves 99% accuracy on a problem where 90% is state-of-the-art, suspect leakage
- Feature importance analysis: If one feature dominates all others by a huge margin, investigate it
- Temporal validation: Does performance drop significantly when you use a proper time-based split vs. random split?
- Remove and retest: Remove the top feature and retrain - if performance barely changes, the feature may be leaking
Practice Problems
Problem 1: The Feature Engineering Narrative
"Tell me about a time you engineered features for a machine learning project. Walk me through your process from raw data to final feature set."
Hint 1 - Direction
Structure your answer as: (1) Problem context, (2) Raw data description, (3) Feature engineering decisions with reasoning, (4) Feature selection, (5) Impact on model performance.
Hint 2 - Insight
The interviewer wants to hear domain knowledge, not just a list of transforms. Explain why you chose each feature transformation. For example: "I log-transformed revenue because it follows a power-law distribution and the model needs to differentiate between 200 as much as between 10.1K."
Hint 3 - Full Solution + Rubric
Example strong answer:
"I built a churn prediction model for a SaaS product. The raw data had user demographics, subscription info, and activity logs.
Step 1: Understanding the data. I started with EDA - plotted distributions, checked missing values, and looked at the target rate (8% churn, moderately imbalanced).
Step 2: Temporal features from activity logs. The most impactful features came from user behavior over time. I created:
- Activity trend: slope of daily active minutes over the last 30 days (capturing decline)
- Engagement ratio: ratio of last 7 days activity to last 30 days activity (capturing recent change)
- Days since last login (recency)
- Session count change: this week vs. average of last 4 weeks
I was careful to only use data available before the churn date - no temporal leakage.
Step 3: Categorical encoding. Plan type (3 values) was one-hot encoded. Industry (200 values) used target encoding with 5-fold cross-validation to prevent leakage.
Step 4: Feature selection. Started with 80 features. Used mutual information to filter to 40, then permutation importance with XGBoost to select the final 25. Cross-validated to ensure no performance loss.
Impact: AUC improved from 0.72 (raw features) to 0.89 (engineered). The activity trend feature alone was worth 7 AUC points."
Scoring Rubric:
- Strong Hire: Tells a coherent story with specific numbers, explains why for each decision, addresses leakage, discusses feature selection with validation, quantifies impact
- Lean Hire: Describes reasonable features but lacks the "why" reasoning, or doesn't discuss leakage prevention
- No Hire: Lists transforms ("I used one-hot encoding and StandardScaler") without context, reasoning, or impact
Problem 2: High-Cardinality Categorical
You have a user_id column with 10 million unique values. You're building a click prediction model. How do you encode this feature?
Hint 1 - Direction
Think about why you'd want to encode user_id at all (it captures user preferences). Then think about which encoding methods can handle 10M unique values without creating a 10M-dimensional feature vector.
Hint 2 - Insight
One-hot encoding is impossible (10M columns). Target encoding risks leakage. The best approaches are learned embeddings (if using a neural network) or aggregation-based features (compute statistics per user and use those instead of the raw ID). Consider whether you have enough data per user.
Hint 3 - Full Solution + Rubric
Approach depends on the model architecture:
Option A: Learned Embeddings (Neural Network)
- Create an embedding layer:
nn.Embedding(10M, 64)- each user gets a 64-dimensional vector - The embeddings are learned during training
- Handles cold-start by having a default embedding for unseen users
- This is the standard approach in recommendation systems (Meta, Google, Netflix)
Option B: Aggregated User Features (Tabular Models)
- Instead of encoding user_id directly, compute features about each user:
- Historical click-through rate (with smoothing)
- Total impressions, total clicks
- Days since account creation
- Average session length
- Category preferences (click distribution across categories)
- This captures the information in user_id without the dimensionality problem
Option C: Feature Hashing
- Hash user_id to a fixed-size space (e.g., 1000 buckets)
- Loses individual user identity due to collisions
- Useful as a baseline or when computational resources are limited
What NOT to do:
- One-hot encoding (10M columns = impossible)
- Label encoding without a tree model (implies ordinal relationship between users)
- Target encoding without extreme care (10M categories = severe leakage risk)
Scoring Rubric:
- Strong Hire: Discusses multiple approaches, recommends embeddings for neural nets or aggregated features for tabular models, mentions cold-start handling, discusses data sparsity concerns
- Lean Hire: Mentions embeddings or hashing, but doesn't discuss trade-offs or cold-start
- No Hire: Suggests one-hot encoding or doesn't recognize why user_id encoding is challenging
Problem 3: Data Leakage Detection
Your fraud detection model achieves 99.9% AUC on the test set. In production, it performs at 0.65 AUC. What went wrong?
Hint 1 - Direction
A massive gap between offline and online performance almost always indicates data leakage or a distribution shift between training data and production data. Think about what could be different.
Hint 2 - Insight
Common causes for this pattern: (1) A feature that's only available after fraud is confirmed (target leakage), (2) random train/test split on temporal data instead of time-based split, (3) duplicate transactions in train and test, (4) a feature that's computed differently in batch (training) vs. real-time (serving).
Hint 3 - Full Solution + Rubric
Investigation checklist (in priority order):
-
Target leakage: Is any feature computed using information only available after the fraud label is assigned?
- "is_disputed" - only exists because the user reported fraud
- "chargeback_amount" - directly derived from the fraud outcome
- Fix: Remove any feature that wouldn't be available at prediction time
-
Temporal leakage: Was the train/test split random instead of time-based?
- Random split means future transactions are in training, past in testing
- The model memorizes patterns from the future
- Fix: Time-based split (train on months 1-6, test on months 7-8)
-
Duplicate/near-duplicate leakage: Are the same transactions (or very similar ones) in both train and test?
- A transaction and its retry/reversal might both appear
- Fix: Deduplicate by transaction_id, or split by user (no user in both sets)
-
Train-serve skew: Are features computed differently at training time vs. serving time?
- Training: batch SQL computes "user_30d_transaction_count" including all 30 days
- Serving: real-time system only has access to the last 7 days of cached data
- Fix: Use a feature store that ensures consistency
-
Distribution shift: Did the fraud pattern change between training period and deployment?
- New fraud tactics that didn't exist in training data
- Fix: Regular retraining, monitor feature distributions
Scoring Rubric:
- Strong Hire: Systematically investigates all leakage types, checks feature availability at prediction time, mentions train-serve skew as a production-specific issue, proposes a feature store as a fix
- Lean Hire: Identifies target or temporal leakage but misses train-serve skew
- No Hire: Says "the model overfit" without investigating the specific mechanism
Problem 4: Feature Engineering for Time Series
You're building a demand forecasting model for a retail chain. You have daily sales data for 500 stores over 3 years. What features would you engineer?
Hint 1 - Direction
Think about what drives retail demand: time patterns (day of week, season, holidays), trends (is this store's sales growing or declining?), external factors (weather, promotions), and store-specific characteristics.
Hint 2 - Insight
The key challenge in time series feature engineering is avoiding temporal leakage while capturing enough temporal context. Lag features, rolling statistics, and trend indicators are essential, but they must only use past data.
Hint 3 - Full Solution + Rubric
Feature categories:
1. Calendar features:
- Day of week (cyclical encoded), month, quarter, year
- is_weekend, is_holiday, is_school_break
- Days until/since nearest holiday (captures pre/post holiday effects)
- Pay period indicators (1st and 15th of month for paycheck effects)
2. Lag features:
- sales_1d_ago, sales_7d_ago, sales_14d_ago, sales_28d_ago, sales_365d_ago
- Same-day-last-week, same-day-last-year (captures weekly and yearly seasonality)
3. Rolling statistics (windows: 7, 14, 28, 90 days):
- Rolling mean, median, std, min, max
- Rolling quantiles (25th, 75th) - captures distribution changes
- Coefficient of variation (std/mean) - captures volatility
4. Trend features:
- Slope of sales over last 7/30/90 days
- Ratio: last 7-day average / last 30-day average (acceleration/deceleration)
- Year-over-year growth rate
5. Store-level features:
- Store size, location type (urban/suburban/rural)
- Historical average sales (store baseline)
- Store ranking within region
6. External features:
- Weather (temperature, precipitation - affects foot traffic)
- Promotions/discounts (binary or amount)
- Competitor activity (if available)
- Local events (concerts, sports games near the store)
7. Product-level aggregations:
- Sales by category, department
- New product launch indicators
- Stock-out indicators (if available)
Critical: No temporal leakage! All lag and rolling features must use data strictly before the prediction date. Validate by checking that your time-based validation performance matches production performance.
Scoring Rubric:
- Strong Hire: Comprehensive feature list across multiple categories, explicitly addresses temporal leakage, includes cyclical encoding, discusses store-level vs. global features, mentions validation strategy
- Lean Hire: Good lag and calendar features but missing trend features, external signals, or leakage discussion
- No Hire: Only mentions basic features (day of week, month) without lag, rolling, or trend features
Interview Cheat Sheet
| Topic | Key Fact | When to Mention |
|---|---|---|
| Log transform | Compresses right tail; use log1p for zeros; makes power-law distributions more Gaussian | Skewed numerical features |
| StandardScaler | z = (x-mean)/std; needed for linear models, SVMs, NNs; NOT needed for trees | Feature preprocessing |
| One-hot encoding | Binary columns per category; drop one for linear models; only for low cardinality | Categorical encoding |
| Target encoding | Mean of target per category; must use k-fold to prevent leakage | High-cardinality categoricals |
| Embeddings | Learned dense vectors; standard for user/item IDs in rec systems | Very high cardinality |
| TF-IDF | Term frequency * inverse document frequency; captures distinctive words | Text features |
| Lag features | Past values of target/features; critical for time series; MUST avoid temporal leakage | Time series |
| Rolling stats | Mean/std/min/max over windows; captures trends and volatility | Time series |
| Mutual information | Captures any statistical dependency; better than correlation for non-linear | Feature selection |
| Permutation importance | Shuffle feature, measure performance drop; more reliable than gain importance | Feature selection |
| Feature stores | Centralized feature computation + serving; prevents train-serve skew | Production ML |
| Data leakage | Information from future/target leaks into features; causes offline/online gap | Always mention |
| Cyclical encoding | sin/cos encoding for periodic features (hour, day of week) | Time features |
Spaced Repetition Checkpoints
Day 0 - Immediate Recall
- List 4 ways to encode a categorical variable with 50 unique values
- Explain why you must fit StandardScaler on training data only
- Define data leakage in one sentence
- Name the three types of feature selection methods
Day 3 - Active Recall
- Without notes: When would you use target encoding vs. one-hot encoding vs. embeddings?
- Explain the leakage risk of target encoding and how to mitigate it
- List 5 features you'd engineer from a timestamp column
- What's the difference between gain-based and permutation-based feature importance?
Day 7 - Application
- Design a feature engineering pipeline for a churn prediction model. Include numerical transforms, categorical encoding, time features, and feature selection.
- Explain what a feature store is and why it matters to a junior data scientist
- A model achieves 0.99 AUC offline but 0.70 online. Investigate systematically.
Day 14 - Synthesis
- Compare feature engineering approaches for: (a) tabular classification, (b) time series forecasting, (c) recommendation system, (d) NLP classification
- Design a feature store architecture for a company with 10 models sharing features
- "Tell me about your feature engineering process" - deliver a 3-minute answer
Day 21 - Interview Simulation
- You're given a dataset with 500 features. Walk through your feature selection workflow.
- The PM asks why the model performs differently in production. Diagnose the train-serve skew.
- Design features for a ride-hailing demand prediction system (include spatial, temporal, and contextual features).
