Skip to main content

Feature Engineering - The Highest-Leverage Skill in ML

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, Data Scientist, AI Eng, MLOps

The Real Interview Moment

You're in a machine learning system design round. The interviewer asks: "Walk me through your feature engineering process for a problem you've worked on." This question sounds soft, but it's one of the most revealing in the entire interview loop.

The weak candidate lists a few transforms: "I one-hot encoded the categorical variables and normalized the numerical ones." The strong candidate tells a story: "We were building a churn prediction model. The raw data had 50 columns, but most predictive power came from features we engineered - the trend in a user's activity over the last 30 days (not just the count), the ratio of support tickets to purchases, and a time-since-last-login feature that captured recency. We tried 200+ features, used mutual information to filter to the top 40, then ran recursive feature elimination with a gradient boosted model to get down to 25. The engineered features improved AUC from 0.72 to 0.89."

That's the difference. Feature engineering is where domain knowledge meets data science, and it remains the single highest-leverage activity in applied ML - even in the age of deep learning. For tabular data (which is most production ML), feature engineering matters more than model selection.

What You Will Master

After reading this page, you will be able to:

  • Apply numerical transformations: log, Box-Cox, power transforms, binning, standardization, and normalization
  • Encode categorical variables using one-hot, target, frequency, ordinal, and learned embeddings
  • Extract features from text (TF-IDF, n-grams), time series (lag features, rolling statistics), and dates
  • Create interaction features and polynomial features with appropriate complexity control
  • Apply feature selection methods: filter (mutual information, chi-squared), wrapper (RFE), and embedded (L1, tree importance)
  • Design a feature engineering pipeline for production, including feature stores
  • Answer the "tell me about your feature engineering process" interview question with a compelling narrative
  • Avoid common pitfalls: data leakage, target leakage, and feature-target correlation traps
  • Reason about when feature engineering matters most vs. when deep learning replaces it

Self-Assessment: Where Are You Now?

Skill Area1 (Never done it)3 (Used in projects)5 (Production expert)Your Rating
Numerical transformsDon't know what StandardScaler doesUsed log and standardizationChoose transforms based on distribution analysis___
Categorical encodingOnly know one-hotUsed 2-3 encoding methodsKnow when target encoding beats embeddings___
Text featuresNever extracted text featuresUsed TF-IDFDesigned text feature pipelines with n-grams + embeddings___
Time featuresNever engineered time featuresUsed day-of-week, monthCreated lag features, rolling stats, trend indicators___
Feature selectionNever done itUsed correlation filteringApplied filter + wrapper + embedded methods systematically___
Feature storesNever heard of oneKnow what they areUsed Feast/Tecton/built custom feature store___
Data leakageNot sure what it isCan identify obvious casesCan detect subtle temporal and target leakage___

Score interpretation:

  • 7-14: Start here. Feature engineering is the most practical skill in ML.
  • 15-25: Good foundation. Focus on advanced encoding, selection methods, and the leakage section.
  • 26-35: You're ready for senior-level questions. Drill the practice problems and feature store design.

Part 1 - Numerical Feature Transforms

Why Transform Numerical Features?

Raw numerical features often have properties that hurt model performance: skewed distributions, different scales, outliers, and non-linear relationships with the target. Transforms address these issues.

Numerical Transform Selection

Standardization (Z-Score Normalization)

z = (x - mean) / std

Centers data at 0 with unit variance. Essential for algorithms that assume features are on the same scale: linear regression, logistic regression, SVMs, k-NN, PCA, and neural networks.

Not needed for: Tree-based models (decision trees, random forests, XGBoost) - they split on individual features and are invariant to monotonic transforms.

Log Transform

x' = log(x + 1) # log1p to handle zeros

Compresses the right tail of skewed distributions. Extremely common for:

  • Revenue, prices, salaries (power-law distributions)
  • Word counts, page views, click counts
  • Any feature spanning multiple orders of magnitude
60-Second Answer

"I apply log transforms to right-skewed features like revenue or click counts. The intuition is that the difference between 100and100 and 200 is more meaningful than the difference between 10,000and10,000 and 10,100. Log transform captures this - it converts multiplicative relationships to additive ones, which linear models handle naturally. I use log1p (log(x+1)) to handle zeros. For features with negative values, I use the Yeo-Johnson transform, which generalizes Box-Cox to handle negative inputs."

Box-Cox and Yeo-Johnson Transforms

Box-Cox finds the optimal power transform to make data more Gaussian:

x' = (x^lambda - 1) / lambda (lambda != 0)
x' = log(x) (lambda = 0)

The parameter lambda is estimated from the data. Box-Cox requires positive values.

Yeo-Johnson extends Box-Cox to handle zero and negative values. Preferred in practice because it doesn't require positive inputs.

Binning (Discretization)

Convert continuous features into categorical bins:

  • Equal-width binning: Divide range into k equal intervals. Simple but sensitive to outliers.
  • Equal-frequency (quantile) binning: Each bin contains roughly the same number of samples. More robust.
  • Custom bins: Based on domain knowledge (e.g., age groups: 0-17, 18-24, 25-34, ...).

When binning helps:

  • Captures non-linear relationships in linear models (age < 18 might have a qualitatively different effect than age = 25 vs 30)
  • Reduces the impact of outliers
  • Can improve interpretability

When binning hurts:

  • Loses information within each bin
  • Creates arbitrary boundaries
  • Tree-based models already learn optimal splits - binning is redundant
Common Trap

Never fit your scaler or binner on the test set. Always fit on training data and transform both train and test. This is one of the most common data leakage mistakes in practice. In sklearn, always use fit_transform on the training set and transform on the test set - never fit_transform on the full dataset before splitting.

Handling Missing Values

StrategyWhen to UseImplementation
Mean/Median imputationMCAR (missing completely at random)Simple, fast; can distort distribution
Mode imputationCategorical featuresPreserves data type
Indicator variableMissingness itself is informativeAdd feature_is_missing boolean column
Model-based (KNN, iterative)MAR (missing at random)More accurate but slower, risk of leakage
Leave as NaNTree-based modelsXGBoost/LightGBM handle NaN natively
Interviewer's Perspective

I always ask candidates how they handle missing values. The worst answer is "I dropped all rows with missing values." This throws away data and introduces selection bias. The best answer depends on the mechanism: "I first analyze why values are missing - is missingness random, or correlated with the target? If missingness itself is predictive (e.g., users who don't fill in income are more likely to churn), I add a missing indicator feature. For the imputed value, I use median for skewed numericals and mode for categoricals, always fitting on training data only."

Part 2 - Categorical Feature Encoding

Encoding Methods Overview

Categorical Encoding Selection

One-Hot Encoding

Create a binary column for each category value:

Citycity_NYCcity_LAcity_CHI
NYC100
LA010
CHI001

Pros: No ordinal assumption, works with all model types Cons: Explodes dimensionality for high-cardinality features (1M users = 1M columns)

Drop one category? For linear models, drop one column to avoid multicollinearity (the "dummy variable trap"). For tree-based models, it doesn't matter.

Label (Ordinal) Encoding

Map each category to an integer: NYC=0, LA=1, CHI=2.

Appropriate when: The variable has a natural order (education: high school < bachelor's < master's < PhD). Dangerous when: Applied to non-ordinal categories - it implies NYC < LA < CHI, which is meaningless. Exception: Tree-based models (XGBoost, LightGBM) handle label-encoded categoricals well because they only learn splits, not magnitudes.

Target Encoding (Mean Encoding)

Replace each category with the mean of the target variable for that category:

city_encoded = E[y | city]

Example: If the click-through rate for NYC users is 0.15, all NYC entries get the value 0.15.

The leakage problem: Using the target to compute the encoding creates data leakage. The category's encoded value contains information about the target, which inflates training metrics.

Solutions:

  1. Leave-one-out: For each row, compute the mean excluding that row
  2. K-fold target encoding: Compute encodings using only out-of-fold data (like cross-validation)
  3. Smoothing: Blend category mean with global mean, weighted by category frequency:
encoded = (n * category_mean + m * global_mean) / (n + m)

where n is category count and m is the smoothing parameter.

Instant Rejection

If you mention target encoding without immediately addressing the leakage risk, it signals you've used it without understanding it. Always say: "Target encoding requires careful handling to avoid leakage - I use k-fold encoding where each fold's encoding is computed from the other folds' target values, never from the same data being encoded."

Frequency Encoding

Replace each category with its frequency (count or proportion) in the training data:

city_encoded = count(city) / total_count

Pros: No leakage risk, simple, captures frequency patterns Cons: Two categories with the same frequency get the same encoding (information loss) When it works well: When frequency is genuinely predictive (e.g., popular products are more likely to be purchased)

Learned Embeddings

For high-cardinality categoricals (user IDs, product IDs, zip codes), learn a dense vector representation within a neural network:

embedding = nn.Embedding(num_categories, embedding_dim)
# e.g., 1M products -> 64-dimensional vectors

Pros: Captures complex relationships, compact representation, can be pre-trained Cons: Requires a neural network, needs enough data per category to learn meaningful embeddings Production use: Recommendation systems universally use learned embeddings for users and items

Feature Hashing (Hash Encoding)

Hash each category to a fixed-size integer space:

encoded = hash(category) % num_buckets

Pros: Fixed dimensionality (choose num_buckets), handles unseen categories at inference, no need to store vocabulary Cons: Hash collisions (different categories map to same bucket), not interpretable When to use: Very high cardinality + limited memory, or when you need to handle unseen categories

Encoding Comparison Table

MethodCardinalityLeakage RiskWorks WithKey Advantage
One-hotLow (< 20)NoneAll modelsSimple, no assumptions
Label/OrdinalAnyNoneTrees, ordinal featuresCompact, fast
Target encodingMedium-HighHigh (must mitigate)All modelsMost predictive for tabular
FrequencyMedium-HighNoneAll modelsSimple, no leakage
EmbeddingsVery highNoneNeural networksLearns complex relationships
HashingVery highNoneAll modelsFixed dimension, handles unseen

Part 3 - Text, Time, and Interaction Features

Text Features

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF(t, d) = TF(t, d) * IDF(t)
TF(t, d) = count(t in d) / |d|
IDF(t) = log(N / DF(t))

High TF-IDF means the term is frequent in this document but rare across all documents - it's distinctive.

Text feature pipeline for tabular ML:

  1. Basic: TF-IDF on unigrams + bigrams, keep top 500-5000 features
  2. Intermediate: Add character n-grams (captures typos, word fragments)
  3. Advanced: Pre-trained sentence embeddings (Sentence-BERT) as features - often the best approach now

Other text features:

  • Text length: Number of characters, words, sentences
  • Special character counts: Exclamation marks, question marks, URLs, mentions
  • Readability scores: Flesch-Kincaid, Coleman-Liau
  • Sentiment scores: From a pre-trained sentiment model
  • Named entity counts: Number of people, organizations, locations mentioned
60-Second Answer

"For text features in tabular ML, I typically start with TF-IDF on unigrams and bigrams - it's simple, fast, and surprisingly effective. I'd also add basic text statistics: length, word count, and any domain-specific pattern counts. For modern approaches, I'd compute sentence embeddings using a pre-trained model like Sentence-BERT and use those as dense features - this captures semantic meaning that TF-IDF misses. The choice between TF-IDF and embeddings depends on the dataset size and whether domain-specific vocabulary matters more than general semantics."

Time Features

Time is one of the richest sources of engineered features. Categories:

Calendar features:

  • Day of week, hour of day, month, quarter, is_weekend, is_holiday
  • These capture cyclical patterns (sales spike on weekends, usage drops at night)

Cyclical encoding (important!): Hour 23 and hour 0 are 1 hour apart, but if encoded as integers, they appear 23 apart. Fix with sin/cos encoding:

hour_sin = sin(2 * pi * hour / 24)
hour_cos = cos(2 * pi * hour / 24)

Lag features:

  • value_t-1, value_t-7, value_t-30 - past values of the target or key metrics
  • Critical for time series forecasting

Rolling statistics:

  • Rolling mean, median, std, min, max over windows (7-day, 30-day, 90-day)
  • Captures trends and volatility

Time-since features:

  • Days since last purchase, hours since last login, time since account creation
  • Captures recency, a powerful predictor for engagement and churn

Trend features:

  • Slope of a value over the last N periods
  • Difference between recent average and longer-term average
  • Captures acceleration/deceleration in user behavior
Common Trap

Time features are the most common source of data leakage in ML. If you compute a rolling 7-day average that includes future data points (relative to the prediction time), you've leaked the future. Always ensure that lag and rolling features only use data available before the prediction timestamp. In an interview, explicitly state: "All time features are computed using only data available at prediction time."

Interaction Features

Create new features from combinations of existing ones:

Multiplicative interactions:

feature_new = feature_A * feature_B

Example: price_per_sqft = price / square_feet

Polynomial features:

[x1, x2] -> [x1, x2, x1^2, x1*x2, x2^2]

Ratio features:

support_ticket_ratio = support_tickets / total_orders
engagement_rate = clicks / impressions

When interactions help:

  • Linear models cannot learn interactions automatically - you must provide them
  • Even tree-based models benefit from well-chosen ratio features (the tree would need multiple splits to approximate a ratio)

When to be careful:

  • Polynomial features explode combinatorially - use domain knowledge to select meaningful pairs
  • Ratio features can create division-by-zero or infinity - always add a small constant or handle edge cases

Part 4 - Feature Selection

Why Feature Selection Matters

More features is not always better:

  • Curse of dimensionality: More features require exponentially more data to avoid overfitting
  • Noise features: Irrelevant features add noise that can hurt model performance
  • Computational cost: More features = slower training and inference
  • Interpretability: Fewer features make the model easier to understand and debug

The Three Approaches

Feature Selection Methods

Filter Methods (Fast, Model-Independent)

Evaluate each feature independently, without training a model.

MethodForFormula/ApproachProsCons
CorrelationNumerical features, regressionPearson/Spearman correlation with targetFast, intuitiveOnly captures linear/monotonic relationships
Mutual InformationAny feature typeMI(X, Y) - measures any statistical dependencyCaptures non-linear relationshipsComputationally expensive for continuous features
Chi-squaredCategorical features, classificationChi-squared test of independenceFast, well-understoodOnly for categorical features with categorical target
ANOVA F-testNumerical features, classificationF-statistic between groupsFastAssumes normality and equal variance
Variance thresholdAny numericalRemove features with near-zero varianceVery fast, removes constantsDoesn't consider relationship with target

Wrapper Methods (Accurate, Expensive)

Use a model's performance to evaluate feature subsets.

Recursive Feature Elimination (RFE):

  1. Train model on all features
  2. Remove the least important feature (by coefficient or importance)
  3. Retrain and repeat until desired number of features reached
  4. Optionally cross-validate at each step (RFECV)

Forward Selection: Start with zero features. Add the feature that improves performance the most. Repeat.

Backward Elimination: Start with all features. Remove the feature whose removal hurts performance the least. Repeat.

Cost: O(n_features * model_training_time) - expensive for large feature sets.

Embedded Methods (Best of Both Worlds)

Feature selection happens during model training.

L1 Regularization (Lasso): L1 penalty drives some coefficients exactly to zero, performing automatic feature selection. Features with zero coefficients are eliminated.

Tree-based importance:

  • Split importance: How often (and how much) each feature reduces impurity across all trees
  • Permutation importance: Shuffle each feature and measure performance drop. More reliable but slower.
Interviewer's Perspective

When a candidate says "I used feature importance from XGBoost," I follow up with: "Which type of importance - gain, cover, or frequency? And do you know the problems with the default (gain-based) importance?" The issue is that gain-based importance is biased toward high-cardinality features and can be misleading. Permutation importance is more reliable. Knowing this distinction signals real experience with feature selection.

Feature Selection Pipeline (Practical)

  1. Remove constant/near-constant features (variance threshold)
  2. Remove highly correlated features (keep one from each correlated pair, |r| > 0.95)
  3. Filter by mutual information (remove features with MI < threshold)
  4. Embedded selection with L1 or tree importance (top-k features)
  5. Validate with cross-validation - check that removing features doesn't hurt performance
Company Variation
  • Google: Uses feature analysis tools (TFX/TFDV) to detect anomalies and compute statistics before feature selection
  • Meta: Heavy use of feature importance from gradient boosted models for ranking features
  • Netflix: Feature stores with hundreds of pre-computed features; selection is about choosing the right subset for each model
  • Startups: Often skip formal selection - use domain knowledge to choose 10-20 features and iterate

Part 5 - Feature Stores and Production Feature Engineering

What Is a Feature Store?

A feature store is a centralized system for computing, storing, and serving ML features. It solves three critical problems:

  1. Train-serve skew: Features computed differently in training (batch Python) vs. serving (real-time API). A feature store ensures identical computation.
  2. Feature reuse: Multiple models use the same features (user click count, item popularity). Without a feature store, each team re-implements them.
  3. Point-in-time correctness: For training, you need the feature values as they existed at prediction time, not today's values. A feature store handles time-travel queries.

Feature Store Consistency

Feature Store Components

ComponentPurposeExample
Feature definitionsCode that computes features"user_7d_click_count = count clicks in last 7 days"
Offline storeHistorical feature values for trainingData warehouse (BigQuery, Snowflake)
Online storeLow-latency feature serving for inferenceRedis, DynamoDB
Feature registryCatalog of all features with metadataDocumentation, lineage, ownership
MaterializationProcess that computes and stores featuresBatch (Spark) + streaming (Flink/Kafka)
Feature StoreTypeBest For
FeastOpen-sourceStartup/mid-size, GCP/AWS
TectonManaged SaaSEnterprise, real-time features
HopsworksOpen-source + managedPython-first teams
Databricks Feature StorePart of DatabricksTeams already on Databricks
Amazon SageMaker Feature StoreAWS managedAWS-native teams
Vertex AI Feature StoreGCP managedGCP-native teams

The Feature Engineering Production Pipeline

60-Second Answer

"In production, feature engineering isn't a one-time notebook exercise - it's an ongoing pipeline. I think of it in three layers: (1) Batch features computed daily or hourly from data warehouses - things like user lifetime value, 30-day rolling averages, aggregated statistics. (2) Near-real-time features computed from streaming data - recent click counts, session-level behavior, trending signals. (3) Real-time features computed at request time - user's current location, time of day, device type. These all flow through a feature store that ensures consistency between training and serving. The most important thing is avoiding train-serve skew - if a feature is computed differently at training time vs. serving time, your model's performance in production will differ from offline evaluation."

Part 6 - Data Leakage: The Silent Killer

What Is Data Leakage?

Data leakage occurs when information from outside the training dataset leaks into the model during training, giving artificially high performance that doesn't generalize.

Types of Leakage

1. Target Leakage A feature contains information that is only available because of the target:

  • Using "treatment_outcome" to predict "should_treat" - the outcome is known only after treatment
  • Using "default_flag" to predict "credit_risk" - default is the definition of risk
  • Using "cancellation_reason" to predict "will_churn" - reason only exists after churn

2. Temporal Leakage Using future information to predict the past:

  • Computing a rolling average that includes future data points
  • Feature engineering on the full dataset before train/test split (target encoding, imputation statistics)
  • Not respecting temporal ordering in train/validation split

3. Train-Test Contamination

  • Fitting preprocessing (scaler, imputer, encoder) on the full dataset before splitting
  • Duplicate records appearing in both train and test (especially after data augmentation)
  • Information leaking through group membership (same patient in train and test with different visits)
Instant Rejection

If you describe a feature engineering workflow where you apply TF-IDF, target encoding, or any form of imputation to the entire dataset before splitting into train/test, you've committed data leakage. In an interview, this is an immediate red flag. Always say: "I split first, then fit all transformations on the training set only, and apply the fitted transformers to the test set."

How to Detect Leakage

  1. Suspiciously high performance: If your model achieves 99% accuracy on a problem where 90% is state-of-the-art, suspect leakage
  2. Feature importance analysis: If one feature dominates all others by a huge margin, investigate it
  3. Temporal validation: Does performance drop significantly when you use a proper time-based split vs. random split?
  4. Remove and retest: Remove the top feature and retrain - if performance barely changes, the feature may be leaking

Practice Problems

Problem 1: The Feature Engineering Narrative

"Tell me about a time you engineered features for a machine learning project. Walk me through your process from raw data to final feature set."

Hint 1 - Direction

Structure your answer as: (1) Problem context, (2) Raw data description, (3) Feature engineering decisions with reasoning, (4) Feature selection, (5) Impact on model performance.

Hint 2 - Insight

The interviewer wants to hear domain knowledge, not just a list of transforms. Explain why you chose each feature transformation. For example: "I log-transformed revenue because it follows a power-law distribution and the model needs to differentiate between 100and100 and 200 as much as between 10Kand10K and 10.1K."

Hint 3 - Full Solution + Rubric

Example strong answer:

"I built a churn prediction model for a SaaS product. The raw data had user demographics, subscription info, and activity logs.

Step 1: Understanding the data. I started with EDA - plotted distributions, checked missing values, and looked at the target rate (8% churn, moderately imbalanced).

Step 2: Temporal features from activity logs. The most impactful features came from user behavior over time. I created:

  • Activity trend: slope of daily active minutes over the last 30 days (capturing decline)
  • Engagement ratio: ratio of last 7 days activity to last 30 days activity (capturing recent change)
  • Days since last login (recency)
  • Session count change: this week vs. average of last 4 weeks

I was careful to only use data available before the churn date - no temporal leakage.

Step 3: Categorical encoding. Plan type (3 values) was one-hot encoded. Industry (200 values) used target encoding with 5-fold cross-validation to prevent leakage.

Step 4: Feature selection. Started with 80 features. Used mutual information to filter to 40, then permutation importance with XGBoost to select the final 25. Cross-validated to ensure no performance loss.

Impact: AUC improved from 0.72 (raw features) to 0.89 (engineered). The activity trend feature alone was worth 7 AUC points."

Scoring Rubric:

  • Strong Hire: Tells a coherent story with specific numbers, explains why for each decision, addresses leakage, discusses feature selection with validation, quantifies impact
  • Lean Hire: Describes reasonable features but lacks the "why" reasoning, or doesn't discuss leakage prevention
  • No Hire: Lists transforms ("I used one-hot encoding and StandardScaler") without context, reasoning, or impact

Problem 2: High-Cardinality Categorical

You have a user_id column with 10 million unique values. You're building a click prediction model. How do you encode this feature?

Hint 1 - Direction

Think about why you'd want to encode user_id at all (it captures user preferences). Then think about which encoding methods can handle 10M unique values without creating a 10M-dimensional feature vector.

Hint 2 - Insight

One-hot encoding is impossible (10M columns). Target encoding risks leakage. The best approaches are learned embeddings (if using a neural network) or aggregation-based features (compute statistics per user and use those instead of the raw ID). Consider whether you have enough data per user.

Hint 3 - Full Solution + Rubric

Approach depends on the model architecture:

Option A: Learned Embeddings (Neural Network)

  • Create an embedding layer: nn.Embedding(10M, 64) - each user gets a 64-dimensional vector
  • The embeddings are learned during training
  • Handles cold-start by having a default embedding for unseen users
  • This is the standard approach in recommendation systems (Meta, Google, Netflix)

Option B: Aggregated User Features (Tabular Models)

  • Instead of encoding user_id directly, compute features about each user:
    • Historical click-through rate (with smoothing)
    • Total impressions, total clicks
    • Days since account creation
    • Average session length
    • Category preferences (click distribution across categories)
  • This captures the information in user_id without the dimensionality problem

Option C: Feature Hashing

  • Hash user_id to a fixed-size space (e.g., 1000 buckets)
  • Loses individual user identity due to collisions
  • Useful as a baseline or when computational resources are limited

What NOT to do:

  • One-hot encoding (10M columns = impossible)
  • Label encoding without a tree model (implies ordinal relationship between users)
  • Target encoding without extreme care (10M categories = severe leakage risk)

Scoring Rubric:

  • Strong Hire: Discusses multiple approaches, recommends embeddings for neural nets or aggregated features for tabular models, mentions cold-start handling, discusses data sparsity concerns
  • Lean Hire: Mentions embeddings or hashing, but doesn't discuss trade-offs or cold-start
  • No Hire: Suggests one-hot encoding or doesn't recognize why user_id encoding is challenging

Problem 3: Data Leakage Detection

Your fraud detection model achieves 99.9% AUC on the test set. In production, it performs at 0.65 AUC. What went wrong?

Hint 1 - Direction

A massive gap between offline and online performance almost always indicates data leakage or a distribution shift between training data and production data. Think about what could be different.

Hint 2 - Insight

Common causes for this pattern: (1) A feature that's only available after fraud is confirmed (target leakage), (2) random train/test split on temporal data instead of time-based split, (3) duplicate transactions in train and test, (4) a feature that's computed differently in batch (training) vs. real-time (serving).

Hint 3 - Full Solution + Rubric

Investigation checklist (in priority order):

  1. Target leakage: Is any feature computed using information only available after the fraud label is assigned?

    • "is_disputed" - only exists because the user reported fraud
    • "chargeback_amount" - directly derived from the fraud outcome
    • Fix: Remove any feature that wouldn't be available at prediction time
  2. Temporal leakage: Was the train/test split random instead of time-based?

    • Random split means future transactions are in training, past in testing
    • The model memorizes patterns from the future
    • Fix: Time-based split (train on months 1-6, test on months 7-8)
  3. Duplicate/near-duplicate leakage: Are the same transactions (or very similar ones) in both train and test?

    • A transaction and its retry/reversal might both appear
    • Fix: Deduplicate by transaction_id, or split by user (no user in both sets)
  4. Train-serve skew: Are features computed differently at training time vs. serving time?

    • Training: batch SQL computes "user_30d_transaction_count" including all 30 days
    • Serving: real-time system only has access to the last 7 days of cached data
    • Fix: Use a feature store that ensures consistency
  5. Distribution shift: Did the fraud pattern change between training period and deployment?

    • New fraud tactics that didn't exist in training data
    • Fix: Regular retraining, monitor feature distributions

Scoring Rubric:

  • Strong Hire: Systematically investigates all leakage types, checks feature availability at prediction time, mentions train-serve skew as a production-specific issue, proposes a feature store as a fix
  • Lean Hire: Identifies target or temporal leakage but misses train-serve skew
  • No Hire: Says "the model overfit" without investigating the specific mechanism

Problem 4: Feature Engineering for Time Series

You're building a demand forecasting model for a retail chain. You have daily sales data for 500 stores over 3 years. What features would you engineer?

Hint 1 - Direction

Think about what drives retail demand: time patterns (day of week, season, holidays), trends (is this store's sales growing or declining?), external factors (weather, promotions), and store-specific characteristics.

Hint 2 - Insight

The key challenge in time series feature engineering is avoiding temporal leakage while capturing enough temporal context. Lag features, rolling statistics, and trend indicators are essential, but they must only use past data.

Hint 3 - Full Solution + Rubric

Feature categories:

1. Calendar features:

  • Day of week (cyclical encoded), month, quarter, year
  • is_weekend, is_holiday, is_school_break
  • Days until/since nearest holiday (captures pre/post holiday effects)
  • Pay period indicators (1st and 15th of month for paycheck effects)

2. Lag features:

  • sales_1d_ago, sales_7d_ago, sales_14d_ago, sales_28d_ago, sales_365d_ago
  • Same-day-last-week, same-day-last-year (captures weekly and yearly seasonality)

3. Rolling statistics (windows: 7, 14, 28, 90 days):

  • Rolling mean, median, std, min, max
  • Rolling quantiles (25th, 75th) - captures distribution changes
  • Coefficient of variation (std/mean) - captures volatility

4. Trend features:

  • Slope of sales over last 7/30/90 days
  • Ratio: last 7-day average / last 30-day average (acceleration/deceleration)
  • Year-over-year growth rate

5. Store-level features:

  • Store size, location type (urban/suburban/rural)
  • Historical average sales (store baseline)
  • Store ranking within region

6. External features:

  • Weather (temperature, precipitation - affects foot traffic)
  • Promotions/discounts (binary or amount)
  • Competitor activity (if available)
  • Local events (concerts, sports games near the store)

7. Product-level aggregations:

  • Sales by category, department
  • New product launch indicators
  • Stock-out indicators (if available)

Critical: No temporal leakage! All lag and rolling features must use data strictly before the prediction date. Validate by checking that your time-based validation performance matches production performance.

Scoring Rubric:

  • Strong Hire: Comprehensive feature list across multiple categories, explicitly addresses temporal leakage, includes cyclical encoding, discusses store-level vs. global features, mentions validation strategy
  • Lean Hire: Good lag and calendar features but missing trend features, external signals, or leakage discussion
  • No Hire: Only mentions basic features (day of week, month) without lag, rolling, or trend features

Interview Cheat Sheet

TopicKey FactWhen to Mention
Log transformCompresses right tail; use log1p for zeros; makes power-law distributions more GaussianSkewed numerical features
StandardScalerz = (x-mean)/std; needed for linear models, SVMs, NNs; NOT needed for treesFeature preprocessing
One-hot encodingBinary columns per category; drop one for linear models; only for low cardinalityCategorical encoding
Target encodingMean of target per category; must use k-fold to prevent leakageHigh-cardinality categoricals
EmbeddingsLearned dense vectors; standard for user/item IDs in rec systemsVery high cardinality
TF-IDFTerm frequency * inverse document frequency; captures distinctive wordsText features
Lag featuresPast values of target/features; critical for time series; MUST avoid temporal leakageTime series
Rolling statsMean/std/min/max over windows; captures trends and volatilityTime series
Mutual informationCaptures any statistical dependency; better than correlation for non-linearFeature selection
Permutation importanceShuffle feature, measure performance drop; more reliable than gain importanceFeature selection
Feature storesCentralized feature computation + serving; prevents train-serve skewProduction ML
Data leakageInformation from future/target leaks into features; causes offline/online gapAlways mention
Cyclical encodingsin/cos encoding for periodic features (hour, day of week)Time features

Spaced Repetition Checkpoints

Day 0 - Immediate Recall

  • List 4 ways to encode a categorical variable with 50 unique values
  • Explain why you must fit StandardScaler on training data only
  • Define data leakage in one sentence
  • Name the three types of feature selection methods

Day 3 - Active Recall

  • Without notes: When would you use target encoding vs. one-hot encoding vs. embeddings?
  • Explain the leakage risk of target encoding and how to mitigate it
  • List 5 features you'd engineer from a timestamp column
  • What's the difference between gain-based and permutation-based feature importance?

Day 7 - Application

  • Design a feature engineering pipeline for a churn prediction model. Include numerical transforms, categorical encoding, time features, and feature selection.
  • Explain what a feature store is and why it matters to a junior data scientist
  • A model achieves 0.99 AUC offline but 0.70 online. Investigate systematically.

Day 14 - Synthesis

  • Compare feature engineering approaches for: (a) tabular classification, (b) time series forecasting, (c) recommendation system, (d) NLP classification
  • Design a feature store architecture for a company with 10 models sharing features
  • "Tell me about your feature engineering process" - deliver a 3-minute answer

Day 21 - Interview Simulation

  • You're given a dataset with 500 features. Walk through your feature selection workflow.
  • The PM asks why the model performs differently in production. Diagnose the train-serve skew.
  • Design features for a ride-hailing demand prediction system (include spatial, temporal, and contextual features).
© 2026 EngineersOfAI. All rights reserved.