Skip to main content

Numerical and Categorical Features

The Model That Underperformed on Purpose

The gradient boosting model was doing its job - just not particularly well. AUC of 0.71 on the test set. The data science team had spent three weeks tuning hyperparameters: learning rate, max depth, number of estimators, subsampling ratios. Every configuration yielded results in the 0.69–0.72 band. They had hit a ceiling, and nobody knew why.

The dataset was a standard churn prediction problem: 340,000 customer records with 47 raw features - account age, plan type, usage metrics, billing history, support ticket counts, demographic information. Gradient boosting is well-suited for this kind of structured data. The ceiling shouldn't be at 0.71.

A consultant was brought in. She spent the first two days not touching the model at all. She spent them looking at the features: their distributions, their relationships to the target, their missing value patterns. By day three, she had a list of ten interventions. By day five, the model was at 0.84.

She hadn't changed the model architecture. She hadn't changed the hyperparameters significantly. She had changed the features. The lesson that the team internalized that week: model tuning has diminishing returns; feature engineering has increasing returns. Most of the predictive signal in the dataset was present but inaccessible - encoded in a form the model couldn't efficiently use.

This lesson shows you what she did, and why each intervention worked.


:::tip 🎮 Interactive Playground Visualize this concept: Try the Feature Engineering demo on the EngineersOfAI Playground - no code required. :::

Why This Exists: The Representation Gap

Machine learning algorithms - even the most sophisticated ones - operate on numbers. They cannot directly process the meaning of "Enterprise" vs. "SMB" in a plan type column. They cannot automatically detect that a support ticket count of 15 is anomalous for a user with 2 months of tenure but normal for a user with 5 years. They cannot extract the compound signal that exists in the interaction between high usage and a recent price increase.

Feature engineering is the act of bridging the representation gap: transforming raw data into a numerical form that makes the underlying patterns accessible to the learning algorithm. Good features are:

  • Predictive: correlated with the target variable
  • Stable: their distribution does not shift dramatically over time
  • Generalisable: they capture patterns that hold in new data, not quirks of the training set
  • Computable: they can be produced reliably in production at the required latency

The failure mode of bad features is not that the algorithm breaks - it is that the algorithm learns noise instead of signal, or fails to capture signal that exists in the raw data.


Historical Context

Feature engineering has been the dominant focus of applied machine learning for most of its history. The emergence of deep learning for images and text (roughly 2012–2016) reduced the need for manual feature engineering in those domains, because neural networks learn feature representations automatically from raw inputs.

For tabular data - which constitutes the majority of real-world ML applications - manual feature engineering remains valuable and often decisive. Gradient boosting algorithms (XGBoost, LightGBM, CatBoost) are extremely competitive on tabular benchmarks precisely because feature engineering can be applied effectively.

The systematic study of encoding strategies for categorical variables was formalized by Micci-Barreca (2001) with target encoding, and later extended by Catboost's ordered target encoding (2018) to prevent leakage. The theoretical understanding of missing value mechanisms (MCAR/MAR/MNAR) comes from Rubin (1976), which remains the foundational framework for missing data analysis.


Core Concepts

Numerical Transformations

Raw numerical features often have distributions that are suboptimal for gradient boosting or linear models. The goal of numerical transformation is to produce features with distributions that make the learning task easier.

Standard scaling (z-score normalization): Subtracts the mean, divides by the standard deviation. Result: mean 0, standard deviation 1. Required for distance-based algorithms (SVM, KNN, logistic regression). Not required for tree-based models, but rarely harmful.

z=xμσz = \frac{x - \mu}{\sigma}

Min-max scaling: Maps to the [0, 1] range. Sensitive to outliers - a single extreme value compresses all other values. Use only when you know the feature has a hard minimum and maximum.

Log transform: For right-skewed distributions (income, transaction amounts, counts). Makes multiplicative relationships additive, which many algorithms handle better.

x=log(x+1)x' = \log(x + 1)

The +1 handles zero values. Use log1p in NumPy for numerical stability.

Box-Cox transform: Generalizes the log transform. Finds the optimal power λ\lambda to minimize skewness:

x(λ)={xλ1λλ0log(x)λ=0x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \lambda \neq 0 \\ \log(x) & \lambda = 0 \end{cases}

Requires positive values. scipy.stats.boxcox finds the optimal λ\lambda.

Binning (discretization): Converts a continuous feature into bins. Useful when the relationship between the feature and target is non-monotonic, or when you want to capture "round number" effects (users who pay exactly 9.99behavedifferentlyfromuserswhopay9.99 behave differently from users who pay 10.01).

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer

def analyze_distribution(series: pd.Series) -> dict:
"""Diagnose the distribution of a numerical feature."""
skewness = series.skew()
kurtosis = series.kurtosis()
pct_zeros = (series == 0).mean()
pct_negative = (series < 0).mean()

return {
"skewness": skewness,
"kurtosis": kurtosis,
"pct_zeros": pct_zeros,
"pct_negative": pct_negative,
"recommendation": recommend_transform(skewness, pct_zeros, pct_negative)
}

def recommend_transform(skewness: float, pct_zeros: float, pct_negative: float) -> str:
if pct_negative > 0.05:
return "standard_scaling" # negative values preclude log/box-cox
if abs(skewness) < 0.5:
return "standard_scaling" # already roughly normal
if pct_zeros > 0.3:
return "log1p" # log(x+1) handles zeros
if skewness > 1.0:
return "box_cox_or_log"
return "standard_scaling"

def apply_transforms(df: pd.DataFrame, numerical_cols: list) -> pd.DataFrame:
df = df.copy()

for col in numerical_cols:
analysis = analyze_distribution(df[col])

if analysis["recommendation"] == "log1p":
df[f"{col}_log"] = np.log1p(df[col].clip(lower=0))

elif analysis["recommendation"] == "box_cox_or_log":
if df[col].min() > 0:
transformed, _ = stats.boxcox(df[col] + 1e-9)
df[f"{col}_boxcox"] = transformed
else:
df[f"{col}_log"] = np.log1p(df[col].clip(lower=0))

# Quantile binning
df[f"{col}_bin"] = pd.qcut(
df[col], q=10, labels=False, duplicates="drop"
)

# Standard scale all numerical columns together
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

return df

Categorical Encoding

Categorical variables require transformation into numerical form. The choice of encoding method significantly affects model performance and can be the difference between the model learning real patterns and learning noise.

One-hot encoding: Creates a binary column for each unique value. Works well for low-cardinality categoricals (fewer than 20–30 unique values). Fails for high-cardinality categoricals - a column with 10,000 unique values becomes 10,000 sparse binary columns.

Ordinal encoding: Maps each category to an integer (0, 1, 2, ...). Appropriate when categories have a natural order ("low", "medium", "high"). Dangerous when applied to nominal categories - it implies an ordering that does not exist, introducing false structure.

Target encoding (mean encoding): Replaces each category with the mean of the target variable for that category. Extremely powerful for high-cardinality categoricals because it directly encodes predictive signal. Dangerous if implemented naively - it leaks the target, causing overfit.

encoded(c)=i:xi=cyi+αyˉ{i:xi=c}+α\text{encoded}(c) = \frac{\sum_{i: x_i = c} y_i + \alpha \cdot \bar{y}}{|\{i: x_i = c\}| + \alpha}

The smoothing parameter α\alpha prevents rare categories from having extreme encoded values. yˉ\bar{y} is the global target mean. This is Micci-Barreca's smoothed target encoding.

Frequency encoding: Replaces each category with its frequency of occurrence. A simpler alternative to target encoding that doesn't leak the target. Useful when category frequency correlates with the target.

Embedding-based encoding: For very high cardinality categoricals (user IDs, product IDs), learn a low-dimensional dense vector representation. Requires a neural network component. Most practical in collaborative filtering and recommendation contexts.

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from typing import Tuple

class SmoothedTargetEncoder:
"""
Target encoding with smoothing to prevent leakage on rare categories.
Implements cross-fold encoding to prevent target leakage on training data.
"""
def __init__(self, smoothing_alpha: float = 5.0, n_folds: int = 5):
self.smoothing_alpha = smoothing_alpha
self.n_folds = n_folds
self.category_stats = {} # fitted from training data
self.global_mean = None

def fit(self, X: pd.Series, y: pd.Series) -> "SmoothedTargetEncoder":
self.global_mean = y.mean()

# Compute category-level statistics from training data
stats = pd.DataFrame({"category": X, "target": y}).groupby("category").agg(
count=("target", "count"),
sum=("target", "sum")
)

stats["smoothed_mean"] = (
(stats["sum"] + self.smoothing_alpha * self.global_mean) /
(stats["count"] + self.smoothing_alpha)
)

self.category_stats = stats["smoothed_mean"].to_dict()
return self

def transform(self, X: pd.Series) -> pd.Series:
"""For test/serving: use fitted category stats."""
return X.map(self.category_stats).fillna(self.global_mean)

def fit_transform(self, X: pd.Series, y: pd.Series) -> pd.Series:
"""
For training: use cross-fold encoding to prevent leakage.
Each fold is encoded using statistics computed from other folds.
"""
self.fit(X, y)
result = pd.Series(index=X.index, dtype=float)

kf = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)

for train_idx, val_idx in kf.split(X):
fold_X_train = X.iloc[train_idx]
fold_y_train = y.iloc[train_idx]

# Compute encoding statistics from training fold only
fold_stats = pd.DataFrame({
"category": fold_X_train, "target": fold_y_train
}).groupby("category").agg(
count=("target", "count"),
sum=("target", "sum")
)
fold_stats["smoothed_mean"] = (
(fold_stats["sum"] + self.smoothing_alpha * self.global_mean) /
(fold_stats["count"] + self.smoothing_alpha)
)
fold_mapping = fold_stats["smoothed_mean"].to_dict()

# Apply to validation fold
result.iloc[val_idx] = X.iloc[val_idx].map(fold_mapping).fillna(self.global_mean)

return result


# High-cardinality encoding example: 5,000 unique zip codes
encoder = SmoothedTargetEncoder(smoothing_alpha=10.0, n_folds=5)
train_df["zip_code_encoded"] = encoder.fit_transform(
train_df["zip_code"], train_df["churn"]
)
test_df["zip_code_encoded"] = encoder.transform(test_df["zip_code"])

Missing Value Handling: MCAR, MAR, MNAR

The correct imputation strategy depends on why values are missing. Rubin (1976) defines three mechanisms:

MCAR (Missing Completely At Random): Missingness has no relationship to any observed or unobserved variables. Example: a sensor fails at random intervals with no pattern. Safe to impute with mean/median. Tests: Little's MCAR test.

MAR (Missing At Random): Missingness depends on other observed variables but not on the missing variable itself. Example: older customers are less likely to fill out a digital form, so demographic data is missing more for older cohorts. Imputation should condition on the observed variables (predictive imputation, MICE). Do not use mean imputation - it introduces bias.

MNAR (Missing Not At Random): Missingness depends on the missing variable itself. Example: high-income users are less likely to report their income. The missing value is informative - its absence is itself a signal. Impute the value and create a binary is_missing indicator column.

import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

def classify_missingness(df: pd.DataFrame, target_col: str) -> dict:
"""
Heuristic classification of missing value mechanism per column.
Full MCAR testing requires Little's test (statsmodels or custom).
"""
result = {}

for col in df.columns:
if col == target_col:
continue
if df[col].isnull().sum() == 0:
continue

missing_mask = df[col].isnull()
missing_rate = missing_mask.mean()

# Check if missingness correlates with target (MNAR heuristic)
if target_col in df.columns:
missing_target_rate = df.loc[missing_mask, target_col].mean()
non_missing_target_rate = df.loc[~missing_mask, target_col].mean()
target_diff = abs(missing_target_rate - non_missing_target_rate)
else:
target_diff = 0.0

result[col] = {
"missing_rate": missing_rate,
"suspected_mechanism": (
"MNAR" if target_diff > 0.05 else
"MAR" if missing_rate > 0.05 else
"MCAR"
),
"recommended_strategy": (
"impute_and_indicator" if target_diff > 0.05 else
"mice_imputation" if missing_rate > 0.05 else
"mean_median_imputation"
)
}

return result


def apply_imputation_strategy(df: pd.DataFrame, missingness_report: dict) -> pd.DataFrame:
df = df.copy()

mnar_cols = [col for col, info in missingness_report.items()
if info["suspected_mechanism"] == "MNAR"]
mar_cols = [col for col, info in missingness_report.items()
if info["suspected_mechanism"] == "MAR"]
mcar_cols = [col for col, info in missingness_report.items()
if info["suspected_mechanism"] == "MCAR"]

# MNAR: impute + add binary indicator
for col in mnar_cols:
df[f"{col}_was_missing"] = df[col].isnull().astype(int)
df[col] = df[col].fillna(df[col].median())

# MAR: use MICE (iterative/multiple imputation)
if mar_cols:
mice = IterativeImputer(max_iter=10, random_state=42)
df[mar_cols] = mice.fit_transform(df[mar_cols])

# MCAR: simple mean/median imputation
for col in mcar_cols:
df[col] = df[col].fillna(df[col].median())

return df

Feature Crossing

Feature crossing creates new features by combining two or more existing features. The power comes from capturing interactions that neither feature captures alone. The classic example: user_age × product_category. A 25-year-old looking at electronics and a 65-year-old looking at electronics may have very different purchase propensities. The age feature alone and the category feature alone miss this interaction.

def create_interaction_features(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()

# Ratio features - often more informative than raw values
df["support_tickets_per_tenure_month"] = (
df["support_tickets_last_year"] /
(df["tenure_months"] + 1) # +1 avoids division by zero
)

df["spend_per_login"] = (
df["monthly_spend"] /
(df["monthly_logins"] + 1)
)

# Categorical cross features - useful for tree-based models
df["plan_x_region"] = df["plan_type"].astype(str) + "_" + df["region"].astype(str)

# Polynomial features for numerical pairs (use sparingly - dimensionality explodes)
df["age_times_tenure"] = df["customer_age"] * df["tenure_months"]
df["spend_squared"] = df["monthly_spend"] ** 2

# Threshold-based binary features (encode expert domain knowledge)
df["is_heavy_user"] = (df["monthly_logins"] > df["monthly_logins"].quantile(0.9)).astype(int)
df["is_recent_price_increase_victim"] = (
(df["price_changed_last_90d"] == 1) & (df["tenure_months"] < 12)
).astype(int)

return df

The Systematic AUC 0.71 to 0.84 Journey

Here is the actual sequence of interventions from the opening scenario, and the AUC contribution of each:

InterventionAUC BeforeAUC AfterDelta
Baseline (raw features)-0.71-
Log transform on skewed numericals0.710.73+0.02
Target encode plan_type + region (5-fold)0.730.76+0.03
Add is_missing indicators for MNAR columns0.760.78+0.02
Ratio features (tickets/tenure, spend/login)0.780.81+0.03
Quantile binning on tenure0.810.82+0.01
Feature crossing (price_change × tenure)0.820.84+0.02

No hyperparameter changes. No architecture changes. Feature engineering alone.


Common Mistakes

:::danger Applying target encoding without cross-fold protection Naively replacing a categorical with the mean of the target for that category, computed on the full training set, leaks the target into the feature. The model sees the target in its inputs. Cross-validation scores look great; production performance is poor. Always use cross-fold target encoding on training data, and use the held-out statistics (fitted on full training data) for test/serving. :::

:::danger Using ordinal encoding for nominal categories Mapping "Enterprise", "SMB", "Consumer" to 2, 1, 0 implies that Consumer is half of SMB, and SMB is half of Enterprise. For tree-based models this is relatively harmless because trees split on thresholds. For linear models it introduces completely false structure. Never use ordinal encoding for nominal categories unless there is a genuine ordering. :::

:::warning Mean imputation for MAR data If data is missing not at random (e.g., high earners don't report income), imputing with the global mean systematically misrepresents the missing values. Use conditional imputation (MICE) or, for MNAR data, add a binary is_missing indicator alongside the imputed value. :::

:::warning One-hot encoding high-cardinality categoricals One-hot encoding a column with 10,000 unique values creates 10,000 sparse columns. Training on this is slow, and most columns will be near-zero in importance. Use frequency encoding, target encoding, or embeddings for high-cardinality categoricals. :::


Interview Q&A

Q: When should you use target encoding vs. one-hot encoding for a categorical feature?

A: Use one-hot encoding for low-cardinality categoricals (under ~20 unique values) where you don't want to assume any relationship between category identity and the target. Use target encoding for high-cardinality categoricals (dozens to thousands of unique values) where one-hot would explode dimensionality. Target encoding is also more powerful for tree-based models because it directly encodes the predictive signal. The critical caveat: always implement cross-fold target encoding on training data to prevent target leakage. On test/serving data, use statistics fitted on the full training set.

Q: What are MCAR, MAR, and MNAR, and why does the distinction matter for imputation?

A: MCAR (Missing Completely At Random) means missingness has no pattern - simple mean/median imputation is valid. MAR (Missing At Random) means missingness depends on other observed variables - imputation should condition on those variables, using methods like MICE. MNAR (Missing Not At Random) means the missing variable's value determines whether it's missing - for example, sick people not reporting health scores. The distinction matters because wrong imputation strategy introduces bias. MNAR is especially dangerous: the missingness itself is a signal, so you should add a binary indicator column alongside any imputed value.

Q: What is feature crossing and when does it help?

A: Feature crossing creates new features by combining two or more existing features - typically multiplication, division, or string concatenation for categoricals. It helps when the relationship between a feature and the target depends on the value of another feature. A ratio like "support tickets per tenure month" captures something neither raw tickets nor raw tenure does alone - it normalizes for customer age. For tree-based models, crossing is less critical because the model can learn interactions through successive splits. For linear models, crossing is essential because linear models cannot learn interactions without explicit feature engineering. In practice, domain knowledge drives which crossings are worth creating - not automatic enumeration of all pairs.

Q: A numerical feature has 40% missing values. How do you handle it?

A: First, classify the missingness mechanism. Is there a pattern to who is missing this value? Compute the target mean for missing vs. non-missing - if they differ significantly, suspect MNAR. If missingness correlates with other observed features, suspect MAR. Based on the mechanism: for MCAR, median imputation is reasonable. For MAR, use MICE (iterative imputation conditioned on other observed features). For MNAR, impute the value (median or model-based) AND create a binary is_missing indicator column - the indicator encodes the information that the value was missing, which the model can learn from. Do not drop the column at 40% missingness unless there's a strong reason; the remaining 60% still contains signal.

Q: What is the Box-Cox transform and when would you use it over a simple log transform?

A: The Box-Cox transform generalizes the log transform by finding the optimal power λ\lambda that minimizes skewness in the transformed distribution. When λ=0\lambda = 0, it reduces to a log transform. The log transform is a fixed choice; Box-Cox finds the best choice for the specific distribution. Use Box-Cox when you want to rigorously minimize skewness and the data is strictly positive. Use log transform (log1p) when you need a simpler, more interpretable transformation, when data contains zeros (Box-Cox requires strictly positive values), or when you're building a production pipeline and need a stable, parameter-free transformation that doesn't require fitting.

© 2026 EngineersOfAI. All rights reserved.