Feature Selection and Importance
500 Features, 50 That Matter
The model had been accumulating features for two years. Every quarter, a few new signals were added - a social media activity score here, a device type breakdown there, some web analytics aggregates. Nobody ever removed a feature. Why would they? Adding features could only help, the intuition went.
By the time the senior ML engineer inherited the project, there were 512 features in the training pipeline. The model itself was a gradient boosting model with 1,200 estimators, trained on a dataset of 8 million records. Training took 4 hours. The feature pipeline, reading and computing 512 features for 8 million records, took 6 hours before training even started.
The model had an AUC of 0.87. The business was satisfied. But the 10-hour end-to-end training cycle meant that model updates took over a day to deploy. Hyperparameter experiments were nearly impossible - each trial took 4 hours. When a regulatory requirement emerged that demanded model interpretability, the team couldn't clearly explain what the 512 features represented or which ones actually drove decisions.
The engineer ran a systematic feature selection analysis. The result: 47 features produced a model with AUC 0.868 - within 0.002 of the full model. The training pipeline took 45 minutes. The feature pipeline took 30 minutes. The total cycle dropped from 10 hours to 75 minutes. The model became explainable.
Feature selection is not just about performance. It is about maintainability, interpretability, compute cost, and inference latency. This lesson covers the methods that make it systematic.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Selection Methods demo on the EngineersOfAI Playground - no code required. :::
Why This Exists: The Curse of Dimensionality and the Noise Problem
Adding features to a machine learning model does not monotonically improve performance. Beyond a certain point, additional features introduce noise that the model must learn to ignore - which requires more data, more training time, and produces a more overfit model.
Several concrete failure modes emerge from excessive features:
Noise dimensions: Features that contain no predictive signal act as noise. The model wastes capacity learning that these features don't matter, which can hurt performance on small datasets and certainly increases training time.
Correlated features: Highly correlated features contain redundant information. Including both doesn't improve the model but increases the feature space, complicates interpretation, and (for linear models) creates multicollinearity that destabilizes coefficient estimates.
Leaking features: A feature that inadvertently encodes the target variable or is computed using information from after the label event appears highly predictive in training but is not available at serving time. More features means more opportunities for leakage.
Compute and latency costs: Every feature must be computed, stored, transmitted, and processed. Each additional feature in the serving path adds latency. Feature selection is a cost optimization.
Feature selection is the principled process of choosing a minimal subset of features that preserves (or improves) model performance while reducing all the above costs.
Historical Context
Feature selection has been an active area of machine learning research since the 1990s. John and Kohavi (1994) formalized the distinction between filter, wrapper, and embedded methods that remains the standard taxonomy today.
The LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani (1996), showed that regularization could perform feature selection automatically by shrinking some coefficients exactly to zero. This was a significant advance: feature selection became a natural byproduct of model training rather than a separate preprocessing step.
Permutation importance, introduced by Breiman (2001) alongside random forests, provided a model-agnostic measure of feature importance that was more reliable than impurity-based measures.
SHAP (SHapley Additive exPlanations), introduced by Lundberg and Lee (2017), unified several importance measures under a game-theoretic framework and introduced fast tree-specific computation (TreeSHAP). SHAP values are now the dominant tool for both feature importance and model explanation in tree-based model production systems.
Core Concepts
The Taxonomy: Filter, Wrapper, Embedded
Filter Methods
Filter methods score features independently of any model, using statistical relationships between features and the target. They are fast and scalable but cannot detect interaction effects.
Correlation (Pearson/Spearman): Measures linear (Pearson) or monotonic (Spearman) relationship between a numerical feature and a numerical target. Features with low absolute correlation with the target are poor linear predictors.
Mutual Information: Measures the reduction in uncertainty about the target given the feature. Captures non-linear relationships that correlation misses. Available via sklearn.feature_selection.mutual_info_classif and mutual_info_regression.
Variance Threshold: Removes features with near-zero variance. A feature that is constant or nearly constant has no predictive power and should be removed.
import pandas as pd
import numpy as np
from sklearn.feature_selection import (
mutual_info_classif, mutual_info_regression,
VarianceThreshold, SelectKBest, chi2
)
from scipy import stats
from typing import List, Tuple
def filter_features_by_mutual_information(
X: pd.DataFrame,
y: pd.Series,
task: str = "classification", # or "regression"
top_k: int = 100,
min_mi_score: float = 0.01
) -> Tuple[List[str], pd.DataFrame]:
"""
Score features by mutual information with the target.
Returns top_k features with MI score above min_mi_score.
"""
if task == "classification":
mi_scores = mutual_info_classif(X, y, random_state=42)
else:
mi_scores = mutual_info_regression(X, y, random_state=42)
mi_df = pd.DataFrame({
"feature": X.columns,
"mi_score": mi_scores
}).sort_values("mi_score", ascending=False)
selected = mi_df[mi_df["mi_score"] >= min_mi_score].head(top_k)
return selected["feature"].tolist(), mi_df
def remove_low_variance_features(
X: pd.DataFrame,
variance_threshold: float = 0.01
) -> Tuple[pd.DataFrame, List[str]]:
"""Remove features with variance below threshold (after standard scaling)."""
selector = VarianceThreshold(threshold=variance_threshold)
selector.fit(X)
selected_mask = selector.get_support()
removed = [col for col, keep in zip(X.columns, selected_mask) if not keep]
return X.loc[:, selected_mask], removed
def remove_correlated_features(
X: pd.DataFrame,
correlation_threshold: float = 0.95
) -> Tuple[pd.DataFrame, List[str]]:
"""
Remove features that are highly correlated with another feature.
When two features are correlated above threshold, keep the one
with higher correlation to the target.
"""
corr_matrix = X.corr().abs()
# Upper triangle of correlation matrix
upper = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# Features to remove: any column with correlation > threshold to another column
to_remove = [
col for col in upper.columns
if any(upper[col] > correlation_threshold)
]
return X.drop(columns=to_remove), to_remove
Wrapper Methods: Recursive Feature Elimination
RFE fits a model, ranks features by importance, removes the least important feature, and repeats. It is slow (requires n_features model fits) but finds a feature subset that is specifically good for the chosen model.
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold
def recursive_feature_elimination(
X: pd.DataFrame,
y: pd.Series,
n_features_to_select: int = 50,
cv_folds: int = 3 # use RFECV for automatic selection
) -> Tuple[List[str], object]:
"""
Use RFECV to find the optimal number of features automatically.
More expensive than RFE but doesn't require specifying n_features.
"""
estimator = GradientBoostingClassifier(
n_estimators=100, # lighter model for selection
max_depth=3,
random_state=42
)
# RFECV: automatically finds optimal n_features via cross-validation
rfecv = RFECV(
estimator=estimator,
step=10, # remove 10 features at each step for speed
cv=StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42),
scoring="roc_auc",
min_features_to_select=10,
n_jobs=-1
)
rfecv.fit(X, y)
selected_features = X.columns[rfecv.support_].tolist()
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"CV AUC at optimal: {rfecv.cv_results_['mean_test_score'][rfecv.n_features_ - 1]:.4f}")
return selected_features, rfecv
Embedded Methods: LASSO and Tree Importance
LASSO regularization applies an L1 penalty to model coefficients, which has the effect of shrinking some coefficients exactly to zero. This provides feature selection as a byproduct of model fitting - only features that provide sufficient predictive value to overcome the penalty remain in the model.
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
def lasso_feature_selection(
X: pd.DataFrame,
y: pd.Series,
cv: int = 5
) -> Tuple[List[str], LassoCV]:
"""
Use LASSO with cross-validated alpha selection.
Returns features with non-zero coefficients.
"""
# LASSO requires scaled features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lasso = LassoCV(cv=cv, random_state=42, max_iter=5000, n_jobs=-1)
lasso.fit(X_scaled, y)
selected_mask = lasso.coef_ != 0
selected_features = X.columns[selected_mask].tolist()
print(f"LASSO selected {len(selected_features)} features (alpha={lasso.alpha_:.4f})")
return selected_features, lasso
SHAP-Based Feature Selection
SHAP values provide the most reliable feature importance measure for tree-based models. Unlike Gini impurity-based importance (which is biased toward high-cardinality features) and permutation importance (which is slow for many features), TreeSHAP provides exact, unbiased, per-sample attribution.
import shap
import lightgbm as lgb
import matplotlib.pyplot as plt
def shap_feature_selection(
model,
X_train: pd.DataFrame,
X_val: pd.DataFrame,
importance_threshold_pct: float = 1.0 # keep features above 1% of total importance
) -> Tuple[List[str], pd.DataFrame]:
"""
Select features based on their mean absolute SHAP value.
More reliable than Gini importance for tree-based models.
"""
# Compute SHAP values (TreeSHAP - fast for tree models)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val)
# For binary classification, shap_values may be a list [neg_class, pos_class]
if isinstance(shap_values, list):
shap_values = shap_values[1] # use positive class
# Mean absolute SHAP value per feature
mean_abs_shap = np.abs(shap_values).mean(axis=0)
importance_df = pd.DataFrame({
"feature": X_val.columns,
"mean_abs_shap": mean_abs_shap
}).sort_values("mean_abs_shap", ascending=False)
# Normalize to percentage of total importance
importance_df["importance_pct"] = (
importance_df["mean_abs_shap"] /
importance_df["mean_abs_shap"].sum() * 100
)
# Select features above threshold
selected = importance_df[
importance_df["importance_pct"] >= importance_threshold_pct
]["feature"].tolist()
print(f"\nTop 20 features by SHAP importance:")
print(importance_df.head(20).to_string(index=False))
print(f"\nSelected {len(selected)} features above {importance_threshold_pct}% importance threshold")
return selected, importance_df
def shap_feature_selection_with_interaction_analysis(
model,
X_sample: pd.DataFrame, # sample - SHAP interactions are O(n^2 features)
n_samples: int = 1000
) -> pd.DataFrame:
"""
Compute SHAP interaction values to find feature pairs with strong interactions.
Use to identify which features are worth crossing.
"""
X_small = X_sample.sample(min(n_samples, len(X_sample)), random_state=42)
explainer = shap.TreeExplainer(model)
shap_interaction_values = explainer.shap_interaction_values(X_small)
if isinstance(shap_interaction_values, list):
shap_interaction_values = shap_interaction_values[1]
# Sum of off-diagonal interaction effects per feature pair
interaction_strength = np.abs(shap_interaction_values).sum(axis=0)
np.fill_diagonal(interaction_strength, 0) # remove self-interactions
interaction_df = pd.DataFrame(
interaction_strength,
index=X_small.columns,
columns=X_small.columns
)
return interaction_df
Permutation Importance
Permutation importance measures how much a model's performance decreases when a feature's values are randomly shuffled (breaking its relationship with the target). A feature that matters causes a large decrease when shuffled; an irrelevant feature causes no change.
from sklearn.inspection import permutation_importance
def compute_permutation_importance(
model,
X_val: pd.DataFrame,
y_val: pd.Series,
scoring: str = "roc_auc",
n_repeats: int = 10,
n_jobs: int = -1
) -> pd.DataFrame:
"""
Compute permutation importance on held-out validation data.
More reliable than training-set importance measures.
"""
result = permutation_importance(
model, X_val, y_val,
n_repeats=n_repeats,
random_state=42,
scoring=scoring,
n_jobs=n_jobs
)
importance_df = pd.DataFrame({
"feature": X_val.columns,
"importance_mean": result.importances_mean,
"importance_std": result.importances_std,
}).sort_values("importance_mean", ascending=False)
# Features with negative importance are noise
# (shuffling them slightly improves performance due to regularization effect)
noise_features = importance_df[
importance_df["importance_mean"] < 0
]["feature"].tolist()
print(f"Features with negative permutation importance (likely noise): {len(noise_features)}")
return importance_df
Leakage Detection
Leakage-infected features appear disproportionately important. A systematic leakage check looks for features that are suspicious in their importance relative to their conceptual predictive value.
def leakage_detection_audit(
importance_df: pd.DataFrame,
feature_metadata: dict, # {feature_name: {compute_time, domain_relevance}}
target_correlation_df: pd.DataFrame # correlations of each feature with target
) -> pd.DataFrame:
"""
Flag features that are suspiciously important or correlated with target.
High correlation + questionable compute timing = potential leakage.
"""
leakage_flags = []
for _, row in importance_df.iterrows():
feature = row["feature"]
importance = row["importance_mean"]
# Get target correlation
target_corr = target_correlation_df.get(feature, 0.0)
# Get metadata
meta = feature_metadata.get(feature, {})
compute_time = meta.get("compute_time", "unknown") # "before_label" or "after_label"
# Flag if: high importance AND high target correlation AND compute time is suspicious
is_suspicious = (
abs(target_corr) > 0.7 and
importance > importance_df["importance_mean"].quantile(0.9) and
compute_time in ("after_label", "unknown")
)
leakage_flags.append({
"feature": feature,
"importance_mean": importance,
"target_correlation": target_corr,
"compute_time": compute_time,
"leakage_suspected": is_suspicious,
})
return pd.DataFrame(leakage_flags).sort_values("leakage_suspected", ascending=False)
Collinearity Analysis: VIF
For linear models, highly correlated features cause multicollinearity - the model's coefficient estimates become unstable, and standard errors inflate. Variance Inflation Factor (VIF) quantifies how much a feature's variance is inflated due to correlation with other features.
where is the coefficient of determination from regressing feature on all other features. VIF greater than 5 indicates problematic collinearity; greater than 10 is severe.
from statsmodels.stats.outliers_influence import variance_inflation_factor
def compute_vif(X: pd.DataFrame) -> pd.DataFrame:
"""
Compute Variance Inflation Factor for each feature.
High VIF indicates collinearity - feature provides redundant information.
"""
vif_data = pd.DataFrame({
"feature": X.columns,
"VIF": [
variance_inflation_factor(X.values, i)
for i in range(X.shape[1])
]
}).sort_values("VIF", ascending=False)
return vif_data
def remove_high_vif_features(
X: pd.DataFrame,
vif_threshold: float = 10.0,
y: pd.Series = None # if provided, keep the feature more correlated with target
) -> Tuple[pd.DataFrame, List[str]]:
"""
Iteratively remove features with high VIF until all are below threshold.
"""
removed = []
X_current = X.copy()
while True:
vif_df = compute_vif(X_current)
max_vif = vif_df["VIF"].max()
if max_vif <= vif_threshold:
break
# Remove the feature with highest VIF
to_remove = vif_df.iloc[0]["feature"]
X_current = X_current.drop(columns=[to_remove])
removed.append(to_remove)
print(f"Removed '{to_remove}' (VIF={max_vif:.1f})")
return X_current, removed
Selection Stability
A robust feature selection process produces similar results when run on different random subsets of the training data. Unstable selection - where different subsets produce radically different feature sets - indicates that many features have similar importance and the selection boundary is noisy.
def measure_selection_stability(
X: pd.DataFrame,
y: pd.Series,
selection_fn, # callable that returns list of selected features
n_bootstrap: int = 20,
sample_fraction: float = 0.8
) -> pd.DataFrame:
"""
Measure how consistently each feature is selected across bootstrap samples.
A stability score of 1.0 means selected in every bootstrap sample.
"""
selection_counts = {col: 0 for col in X.columns}
for i in range(n_bootstrap):
# Bootstrap sample
idx = np.random.choice(len(X), int(len(X) * sample_fraction), replace=True)
X_sample = X.iloc[idx]
y_sample = y.iloc[idx]
selected = selection_fn(X_sample, y_sample)
for feature in selected:
selection_counts[feature] += 1
stability_df = pd.DataFrame({
"feature": list(selection_counts.keys()),
"selection_frequency": [v / n_bootstrap for v in selection_counts.values()]
}).sort_values("selection_frequency", ascending=False)
return stability_df
Production Engineering Notes
Selection cadence: Feature selection should be re-run when significant amounts of new training data are available, when model performance degrades, or when new features are added to the pipeline. It is not a one-time exercise.
Selection in CI: Add a feature importance check to your CI pipeline. If any feature in the approved selection has near-zero importance in the current training run (below a threshold), flag it for review - it may have become stale or the upstream data source may have changed.
Don't select features on the test set: Feature selection that uses test set information is a form of data leakage. Run selection on training (or training + validation) data only. Report performance on a held-out test set.
Common Mistakes
:::danger Selecting features using the full dataset If you use test set data to compute mutual information scores, SHAP values, or correlations to drive feature selection, you have leaked test set information into your feature set. Always perform feature selection on training data only. :::
:::danger Dropping features with zero linear correlation to target Linear correlation only measures linear relationships. A feature can have zero Pearson correlation with the target but high mutual information - if the relationship is non-linear (quadratic, sinusoidal). Always supplement correlation analysis with mutual information for non-linear screening. :::
:::warning Using training-set Gini importance for tree models
Gini impurity-based feature importance (sklearn's feature_importances_) is computed on the training set and is biased toward high-cardinality features. It will overstate the importance of features with many unique values (continuous features, high-cardinality categoricals) relative to binary features. Use permutation importance on the validation set or SHAP values for more reliable estimates.
:::
:::tip Re-evaluate importance after feature engineering Feature importance measured on raw features will differ from importance measured after feature engineering (log transforms, interactions, target encoding). Run your final importance analysis on the engineered feature set that will actually be used in the model. :::
Interview Q&A
Q: What is the difference between filter, wrapper, and embedded feature selection methods?
A: Filter methods score features independently of any model - using statistics like mutual information, correlation, or variance. They're fast and scalable but can't detect interaction effects. Wrapper methods use a model as an oracle: they iteratively add or remove features and measure the model's performance at each step. They find feature sets specifically good for the chosen model but are expensive - RFE with 500 features requires hundreds of model fits. Embedded methods perform feature selection as part of model training: LASSO shrinks some coefficients to zero, and tree models compute importance during fitting. Embedded methods balance accuracy and cost. In practice, a pipeline that uses filter methods for initial screening (500 → 150 features), then embedded methods for refinement (150 → 50), is common.
Q: Why is SHAP-based feature importance more reliable than Gini importance for tree models?
A: Gini importance is computed during training by summing the decrease in impurity across all splits where a feature is used. It is biased toward high-cardinality features (continuous features and high-cardinality categoricals) because those features offer more split points, so the model tends to use them more often even if their contribution per split is small. It is also computed on training data, so it reflects what the model learned, not what generalizes. SHAP values (specifically TreeSHAP) compute each feature's marginal contribution to each individual prediction, averaged over all possible orderings of features. They are exact, unbiased, and can be computed on held-out data. The result is a feature importance measure that is game-theoretically fair and reflects out-of-sample importance.
Q: How do you detect feature leakage during feature selection?
A: Leakage shows as features with suspiciously high importance or correlation with the target. Specifically: a feature that appears in the top 5% of importance but has no clear conceptual reason to be predictive is suspicious. A feature with correlation to the target greater than 0.9 is almost certainly leaking. To investigate: check when the feature is computed relative to the label timestamp - if it uses data from after the label event, it's leaking. Remove the suspected feature and check if model performance drops dramatically - if it does, the feature was probably carrying real information (possibly leaked). If performance is unaffected, the feature was either noise or another feature captured the same information.
Q: How many features should you include in a production model?
A: There is no universal answer, but some heuristics. For tree-based models: performance typically peaks somewhere between 20 and 200 features, depending on training set size. The rule of thumb is that you need at least 10–100 training examples per feature to avoid overfitting. For neural networks: more features can be handled because the network learns its own feature weighting through attention or embedding layers. For linear models: regularization (LASSO/Ridge) keeps this manageable, but collinearity is a concern. From an operational standpoint: minimize features to what is needed for the target performance level. Every additional feature is a dependency that can fail, drift, or become unavailable. Prefer 50 robust features over 500 fragile ones.
Q: What is VIF and when do you use it?
A: Variance Inflation Factor (VIF) measures how much the variance of a feature's regression coefficient is inflated due to collinearity with other features. For feature , VIF is where is the R-squared from regressing feature on all other features. A VIF of 1 means no collinearity; 5–10 means moderate concern; above 10 means severe. Use VIF primarily for linear models and logistic regression, where collinearity destabilizes coefficient estimates and standard errors. For tree-based models, collinearity is less problematic because trees split on individual features at each node - two correlated features will just share splitting duties. VIF analysis is also useful for model interpretability: if two features are highly collinear, including both makes it harder to interpret their individual coefficients.
