Data Representation and Feature Spaces

Reading time: ~24 minutes | Level: ML Foundations | Role: MLE, ML Engineer, Data Scientist, Research Engineer

A healthcare startup built a readmission prediction model that achieved 0.76 AUC in the lab. When they deployed it, performance dropped to 0.61 AUC - a catastrophic 15-point gap.

The root cause had nothing to do with the model. It was the feature days_since_last_visit. In training, this feature was computed from a clean EHR database with complete history. In production, the feature server had a 30-day rolling window - patients with no visits in the last 30 days got NaN, which was imputed with 0. In training, patients with no prior visits got a large number reflecting their full history (or a mean imputation from a different distribution). The same feature meant different things at training time and serving time.

This is a representation problem. Before any model architecture decision, before any hyperparameter tuning, the fundamental engineering question is: how do you transform raw data into a vector that the model can learn from - consistently, in both training and production?

What You Will Learn

What a feature space is geometrically and why it matters
How to represent tabular, text, image, time-series, and graph data as vectors
The curse of dimensionality - intuition and mathematical grounding
Feature engineering vs. representation learning
Normalization, standardization, and encoding strategies
Python: building production-quality feature vectors with sklearn

Part 1 - What is a Feature Space?

Every ML model operates on a feature vector: a finite-dimensional real-valued vector $x \in \mathbb{R}^d$ where $d$ is the number of features.

The feature space is the set of all possible feature vectors - $\mathbb{R}^d$ . Every training example is a point in this space. Every prediction is asking "given this point in feature space, what is the output?"

Feature space for a 2D problem (d=2):

feature_2
    │     ●  ●           ● = class A
  3 ┤  ●     ●   ○       ○ = class B
    │   ●  ○        ○
  2 ┤      ○  ●  ○
    │  ○      ○   ●
  1 ┤    ○  ○   ●
    │
  0 ┼──────┬──────┬────
    0      1      2      feature_1

The model learns a decision boundary in this space.

In high dimensions (d = 512, d = 10,000), this space is not visualizable, but the geometry still governs how models learn.

Why the representation matters

The same physical measurement can be represented many ways, and the representation determines what the model can learn:

import numpy as np

# Example: encoding "day of week" for a demand forecasting model

# Bad representation: integer encoding
# Monday=0, Tuesday=1, ..., Sunday=6
# This tells the model that Sunday (6) is "larger than" Monday (0)
# and that the distance Monday→Tuesday equals Tuesday→Wednesday
# Neither is meaningful for demand
day_integer = np.array([0, 1, 2, 3, 4, 5, 6])

# Better: one-hot encoding
# Each day is its own binary feature - no ordinal relationship implied
day_onehot = np.eye(7)  # 7x7 identity matrix
# Monday = [1,0,0,0,0,0,0], Sunday = [0,0,0,0,0,0,1]

# Best for capturing periodicity: cyclical encoding
# Projects day onto a circle: the distance Mon→Sun equals Mon→Tue
day_sin = np.sin(2 * np.pi * day_integer / 7)
day_cos = np.cos(2 * np.pi * day_integer / 7)
# Now Sunday and Monday are geometrically close (they wrap around)

print("Integer encoding: Sunday - Monday =", 6 - 0)          # 6 (meaningless)
print(f"Cyclical: Monday = ({day_sin[0]:.3f}, {day_cos[0]:.3f})")
print(f"Cyclical: Sunday = ({day_sin[6]:.3f}, {day_cos[6]:.3f})")
# Monday and Sunday are close in cyclical encoding

Part 2 - Data Types and Their Representations

Tabular Data

Tabular data (the most common type in production ML) contains:

Numerical continuous: age, income, temperature → often normalize
Numerical ordinal: rating (1–5), education level → may encode as ordinal or ordinal + powers
Categorical nominal: country, product category → one-hot or embedding
Binary: is_active, has_churned → 0/1
Datetime: timestamps → extract cyclical features, time deltas

import pandas as pd
import numpy as np
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, OrdinalEncoder,
    OneHotEncoder, TargetEncoder
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Sample customer data
data = pd.DataFrame({
    'age': [23, 45, 31, 67, 28],
    'income': [35000, 120000, 55000, 85000, 42000],
    'country': ['US', 'UK', 'DE', 'US', 'FR'],
    'education': ['high_school', 'graduate', 'college', 'graduate', 'college'],
    'is_premium': [0, 1, 0, 1, 0],
    'signup_month': [1, 6, 3, 11, 8],
})

# Education has natural ordering
education_order = ['high_school', 'college', 'graduate', 'phd']

# Build a ColumnTransformer: different encoding for each feature type
preprocessor = ColumnTransformer(
    transformers=[
        # Numerical continuous: z-score standardization
        ('num', StandardScaler(), ['age', 'income']),
        # Nominal categorical: one-hot (drop first to avoid multicollinearity)
        ('cat_nom', OneHotEncoder(drop='first', sparse_output=False), ['country']),
        # Ordinal categorical: map to integers that preserve order
        ('cat_ord', OrdinalEncoder(
            categories=[education_order]
        ), ['education']),
        # Binary: pass through unchanged
        ('binary', 'passthrough', ['is_premium']),
    ]
)

# Cyclical encoding for month (add manually after transformer)
data['signup_month_sin'] = np.sin(2 * np.pi * data['signup_month'] / 12)
data['signup_month_cos'] = np.cos(2 * np.pi * data['signup_month'] / 12)

X = data.drop('signup_month', axis=1)
X_transformed = preprocessor.fit_transform(X.drop(['signup_month_sin', 'signup_month_cos'], axis=1))
print(f"Original features: {data.shape[1]}")
print(f"Transformed feature vector dimension: {X_transformed.shape[1] + 2}")

Text Data

Text must be converted to vectors. The representation depends on whether you need bag-of-words features or semantic embeddings:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np

corpus = [
    "machine learning models require careful evaluation",
    "deep learning models are a subset of machine learning",
    "careful evaluation prevents poor model deployment",
    "model deployment requires monitoring and evaluation",
]

# Bag of Words: count occurrence of each word in vocabulary
bow = CountVectorizer(max_features=20)
X_bow = bow.fit_transform(corpus)
print(f"BoW matrix shape: {X_bow.shape}")   # (4, 20)
print(f"Vocabulary: {list(bow.vocabulary_.keys())[:10]}")

# TF-IDF: downweight words that appear in many documents
# TF-IDF(t, d) = TF(t, d) * log(N / df(t))
# Words like "the" appear everywhere → low IDF → low weight
# Domain-specific words appear rarely → high IDF → high weight
tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
X_tfidf = tfidf.fit_transform(corpus)
print(f"TF-IDF matrix shape: {X_tfidf.shape}")   # (4, 100)

# For semantic search and modern NLP: use transformer embeddings
# (conceptual - requires transformers library)
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer('all-MiniLM-L6-v2')
# embeddings = model.encode(corpus)  # shape: (4, 384)
# Each sentence → 384-dim dense vector encoding semantic meaning

Image Data

Images are represented as pixel tensors. The representation choices:

import numpy as np

# Raw pixel representation
# A 224x224 RGB image = 224 * 224 * 3 = 150,528 features
# This is too high-dimensional for most traditional ML algorithms
image_flat = np.random.randint(0, 256, (224, 224, 3)).flatten()
print(f"Raw pixel features: {len(image_flat)}")  # 150,528

# CNNs process images in their natural spatial structure
# Input: (batch, channels, height, width) = (N, 3, 224, 224) in PyTorch
import torch
batch_images = torch.randn(4, 3, 224, 224)  # 4 images, RGB, 224x224

# After a ResNet-50 backbone: each image → 2048-dim feature vector
# These learned features are far more useful than raw pixels for most tasks
# import torchvision.models as models
# resnet = models.resnet50(pretrained=True)
# features = resnet(batch_images)  # shape: (4, 1000) for classification

# Normalization: ImageNet mean/std for pretrained models
imagenet_mean = np.array([0.485, 0.456, 0.406])
imagenet_std = np.array([0.229, 0.224, 0.225])
# Always normalize when using pretrained models
# normalized = (image / 255.0 - mean) / std

Time-Series Data

Time-series requires careful feature extraction that respects temporal ordering:

import numpy as np
import pandas as pd

def extract_ts_features(series: np.ndarray, window_sizes: list = [7, 30, 90]) -> dict:
    """
    Extract statistical features from a time series window.
    Used for tabular ML on time-series data (before sequence models).
    """
    features = {}

    # Statistical moments
    features['mean'] = np.mean(series)
    features['std'] = np.std(series)
    features['min'] = np.min(series)
    features['max'] = np.max(series)
    features['skewness'] = float(pd.Series(series).skew())

    # Trend features
    features['trend_slope'] = np.polyfit(np.arange(len(series)), series, 1)[0]
    features['last_value'] = series[-1]
    features['change_from_start'] = series[-1] - series[0]

    # Rolling window statistics
    for w in window_sizes:
        if len(series) >= w:
            window = series[-w:]
            features[f'mean_{w}d'] = np.mean(window)
            features[f'std_{w}d'] = np.std(window)
            features[f'max_{w}d'] = np.max(window)

    # Lag features: value at specific past time steps
    for lag in [1, 7, 14, 30]:
        if len(series) > lag:
            features[f'lag_{lag}'] = series[-lag - 1]

    return features

# Example: daily sales time series
daily_sales = np.random.exponential(100, size=365) + np.sin(np.arange(365) / 365 * 2 * np.pi) * 20
features = extract_ts_features(daily_sales)
print(f"Extracted {len(features)} features from time series")
print(f"Trend slope: {features['trend_slope']:.4f}")

Part 3 - The Curse of Dimensionality

As the number of features $d$ increases, the geometry of the feature space changes in ways that break many ML algorithms' assumptions.

Mathematical intuition

Nearest neighbor distance concentration: In $\mathbb{R}^d$ , the ratio of the distance to the nearest neighbor vs. the distance to the farthest neighbor approaches 1 as $d \to \infty$ .

$\lim_{d \to \infty} \frac{d_{max} - d_{min}}{d_{min}} = 0$

This means all points look equally far away - nearest neighbor becomes meaningless.

Volume concentration near the surface: The fraction of volume of a unit $d$ -sphere that lies within a distance $\epsilon$ of the surface is $1 - (1-\epsilon)^d \to 1$ as $d$ grows. Essentially all the volume is in a thin shell.

Exponential data sparsity: To maintain the same sample density as dimensions increase, data requirements grow exponentially. If you need 10 samples per unit interval in 1D, you need $10^d$ samples in $d$ dimensions.

import numpy as np

# Visualize curse of dimensionality: distance concentration
np.random.seed(42)
n_samples = 1000

print("Curse of dimensionality - distance concentration")
print(f"{'Dims':>6} {'min_dist':>10} {'max_dist':>10} {'ratio':>10} {'relative_std':>14}")
print("-" * 55)

for d in [2, 5, 10, 50, 100, 500, 1000]:
    # Sample n_samples points in d-dimensional unit hypercube
    points = np.random.uniform(0, 1, (n_samples, d))
    # Compute distances from first point to all others
    distances = np.linalg.norm(points[1:] - points[0], axis=1)
    dmin = distances.min()
    dmax = distances.max()
    ratio = (dmax - dmin) / (dmin + 1e-10)
    rel_std = distances.std() / distances.mean()

    print(f"{d:>6} {dmin:>10.4f} {dmax:>10.4f} {ratio:>10.4f} {rel_std:>14.6f}")

# Output shows ratio → 0 and relative_std → 0 as d grows
# At d=1000, all points appear approximately the same distance away

Dims    min_dist   max_dist      ratio   relative_std
-------------------------------------------------------
   0.0198     1.3241    65.8989       0.445123
   0.2547     2.1068     7.2738       0.193264
   0.5711     2.8947     4.0700       0.112034
   2.5124     5.2943     1.1067       0.039258
   3.8901     6.5123     0.6740       0.025201
   9.7234    12.3401     0.2692       0.009847
  13.9482    16.2834     0.1675       0.006543

Practical consequences

KNN degrades in high dimensions: When all points appear equidistant, "nearest neighbor" is essentially random. Use KNN only in low-dimensional spaces ( $d < 20$ ) or after dimensionality reduction.

More features is not always better: Adding irrelevant features to a high-dimensional problem can hurt performance by increasing sparsity and noise. Feature selection and regularization are necessary.

Linear models degrade gracefully: Regularized linear models (Ridge, Lasso) are more robust to high dimensions than distance-based methods, because they learn a global linear separator rather than relying on local geometry.

Deep learning with high dimensions: CNNs and transformers are designed to handle high-dimensional inputs (images, text) by learning structured representations that exploit spatial/sequential relationships. They do not directly operate on raw high-dimensional vectors as generic feature vectors.

:::danger High-dimensional KNN in production Using KNN with raw features in more than ~20 dimensions produces unreliable nearest-neighbor rankings. If you must use KNN in high dimensions, apply PCA or an embedding model first to reduce to a lower-dimensional representation where distances are meaningful. :::

Part 4 - Feature Engineering vs. Representation Learning

There are two philosophies for creating useful feature representations:

Feature engineering: domain experts manually design features that capture relevant structure. Requires deep domain knowledge. Produces interpretable features. Cannot adapt automatically to new patterns.

Representation learning: the model learns to extract useful features from raw data. Requires more data. Produces opaque features. Can discover unexpected structure.

                Feature Engineering
Raw Data → [Domain Expert Design] → Feature Vector → Model

                Representation Learning
Raw Data → [End-to-End Model] → Implicit Features → Output

In practice, both approaches are used:

Tabular data: Feature engineering often dominates (time deltas, ratios, cross-products)
Images, text, audio: Representation learning with pretrained models dominates
Hybrid: Engineered features + learned embeddings concatenated into a single vector

import numpy as np
import pandas as pd

# Hybrid approach: engineered features + learned embeddings
# Common in e-commerce, fraud detection, search

def build_hybrid_features(
    user_id: int,
    item_id: int,
    context: dict,
    user_embedding: np.ndarray,    # learned from interaction history
    item_embedding: np.ndarray,    # learned from item content
) -> np.ndarray:
    """
    Combine engineered features with learned embeddings.
    This hybrid approach often outperforms either alone.
    """
    # Engineered features
    engineered = np.array([
        context['hour_of_day'] / 24.0,             # normalized hour
        float(context['is_weekend']),               # binary
        np.log1p(context['user_session_length']),  # log-transformed count
        context['days_since_last_visit'] / 365.0,  # normalized time delta
    ])

    # Interaction feature: user-item cosine similarity
    cos_sim = np.dot(user_embedding, item_embedding) / (
        np.linalg.norm(user_embedding) * np.linalg.norm(item_embedding) + 1e-10
    )

    # Concatenate everything into one feature vector
    feature_vector = np.concatenate([
        engineered,                           # [4,]
        user_embedding,                       # [64,]
        item_embedding,                       # [64,]
        [cos_sim],                            # [1,]
    ])

    return feature_vector  # [133,]

# Each component contributes different signal:
# Engineered features: explicit domain knowledge (time of day matters for CTR)
# User embedding: learned user preferences from behavior history
# Item embedding: learned item semantics from content/interactions
# Cosine sim: explicit user-item affinity

Part 5 - Normalization and Standardization

Scaling features is not cosmetic. For many algorithms, unscaled features directly degrade performance.

Why scaling matters

Gradient-based optimizers: If feature 1 ranges over [0, 100,000] and feature 2 ranges over [0, 1], the gradient of the loss with respect to feature 1's weight is 100,000x larger in magnitude. Gradient descent will take oversized steps in the feature 1 direction and undersized steps in the feature 2 direction, slowing convergence dramatically.

Distance-based algorithms (KNN, K-Means, SVM with RBF kernel): Distance is dominated by features with large scales. If income (0–200,000) is in the same feature vector as age (0–100), income drives the distance metric and age is effectively ignored.

Tree-based models: Decision trees and gradient boosted trees split on thresholds - they are scale-invariant. You do not need to scale for Random Forests or XGBoost.

Standardization vs. Normalization

Standardization (Z-score scaling): Maps features to mean 0, standard deviation 1.

$z = \frac{x - \mu}{\sigma}$

Use when: the feature distribution is approximately Gaussian, or for algorithms that assume Gaussian features (logistic regression, linear SVM, PCA).

Min-Max Normalization: Maps features to [0, 1] range.

$z = \frac{x - x_{min}}{x_{max} - x_{min}}$

Use when: the feature has a bounded range and the distribution is roughly uniform, or for neural networks (bounded activation functions benefit from bounded inputs).

Robust Scaling: Uses median and IQR instead of mean and std.

$z = \frac{x - \text{median}}{IQR}$

Use when: the feature has outliers. StandardScaler is sensitive to outliers; RobustScaler is not.

import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Example: income data with outliers
np.random.seed(42)
income = np.concatenate([
    np.random.normal(60000, 20000, 990),  # typical incomes
    np.array([500000, 800000, 1200000, 2000000, 10000000] * 2)  # outliers
])
income = income.reshape(-1, 1)

# StandardScaler: outliers skew the mean and std
ss = StandardScaler()
income_standard = ss.fit_transform(income)
print(f"StandardScaler: mean={income_standard.mean():.3f}, std={income_standard.std():.3f}")
print(f"  Max (outlier): {income_standard.max():.1f} std deviations above mean")

# RobustScaler: uses median and IQR, unaffected by outliers
rs = RobustScaler()
income_robust = rs.fit_transform(income)
print(f"RobustScaler:   median={np.median(income_robust):.3f}")
print(f"  Max (outlier): {income_robust.max():.1f} IQRs above median")

# For neural networks: log-transform skewed features first, then StandardScaler
income_log = np.log1p(income)
income_log_standard = StandardScaler().fit_transform(income_log)
print(f"Log + Standard: max={income_log_standard.max():.1f}")
# Much more reasonable scale - outliers are less extreme after log transform

Encoding categorical variables

Encoding	When to use	Cardinality	Notes
One-hot	Low cardinality (<20), linear models	Low	Creates sparse vectors
Ordinal	Natural ordering exists (size: S/M/L/XL)	Any	Imposes ordering - use with care
Target encoding	High cardinality, tree models	High	Risk of leakage - apply within CV folds
Embedding	Very high cardinality (user ID, product ID)	Very high	Learned, dense, captures relationships
Hashing	Extreme cardinality, memory-constrained	Extreme	Collisions, but works at scale

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import KFold

# Target encoding with cross-validation (prevents leakage)
def target_encode_cv(X_train: np.ndarray, y_train: np.ndarray,
                     X_test: np.ndarray, n_splits: int = 5) -> tuple:
    """
    Target encoding with K-fold to prevent leakage.

    Naive target encoding (mean(y) per category) uses the target
    to create features, which leaks if done naively on the whole train set.
    Solution: use out-of-fold means.
    """
    categories = np.unique(X_train)
    global_mean = y_train.mean()

    # Encode training set using out-of-fold means
    X_train_encoded = np.zeros(len(X_train))
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for fold_train_idx, fold_val_idx in kf.split(X_train):
        fold_means = {}
        for cat in categories:
            cat_mask = X_train[fold_train_idx] == cat
            if cat_mask.sum() > 0:
                fold_means[cat] = y_train[fold_train_idx][cat_mask].mean()
            else:
                fold_means[cat] = global_mean
        # Apply fold means to validation fold
        for i in fold_val_idx:
            X_train_encoded[i] = fold_means.get(X_train[i], global_mean)

    # Encode test set using full training means
    full_means = {}
    for cat in categories:
        cat_mask = X_train == cat
        full_means[cat] = y_train[cat_mask].mean() if cat_mask.sum() > 0 else global_mean

    X_test_encoded = np.array([full_means.get(x, global_mean) for x in X_test])

    return X_train_encoded, X_test_encoded

Part 6 - Building a Production Feature Pipeline with sklearn

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Simulated loan application data
np.random.seed(42)
n = 5000

df = pd.DataFrame({
    'age': np.random.randint(18, 80, n).astype(float),
    'annual_income': np.random.lognormal(11, 0.7, n),
    'credit_score': np.random.randint(300, 850, n).astype(float),
    'employment_length': np.random.randint(0, 30, n).astype(float),
    'loan_amount': np.random.lognormal(9, 0.8, n),
    'home_ownership': np.random.choice(['RENT', 'OWN', 'MORTGAGE'], n),
    'purpose': np.random.choice(['debt_consolidation', 'home_improvement',
                                  'car', 'medical', 'vacation'], n),
    'n_open_accounts': np.random.randint(0, 30, n).astype(float),
})

# Introduce missing values (realistic)
missing_idx = np.random.choice(n, size=int(0.05 * n), replace=False)
df.loc[missing_idx, 'employment_length'] = np.nan
df.loc[missing_idx[:100], 'credit_score'] = np.nan

# Generate synthetic target: higher risk for low income, high loan amount
risk_score = (df['loan_amount'] / (df['annual_income'] + 1) +
              (850 - df['credit_score'].fillna(600)) / 850)
y = (risk_score > risk_score.quantile(0.8)).astype(int)

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, stratify=y)

# Define feature groups
numerical_features = ['age', 'annual_income', 'credit_score',
                       'employment_length', 'loan_amount', 'n_open_accounts']
categorical_features = ['home_ownership', 'purpose']

# Numerical pipeline: impute missing → robust scale (for outliers in income)
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # median robust to outliers
    ('scaler', RobustScaler()),
])

# Categorical pipeline: impute missing → one-hot encode
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')),
])

# Combine into ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features),
])

# Full pipeline: preprocessing → model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=4, random_state=42)),
])

# Train
full_pipeline.fit(X_train, y_train)

# Evaluate
y_prob = full_pipeline.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
print(f"Loan default prediction AUC: {auc:.4f}")

# The pipeline serializes as one object - training and serving are consistent
import joblib
# joblib.dump(full_pipeline, 'loan_model_v1.pkl')
# loaded = joblib.load('loan_model_v1.pkl')
# loaded.predict_proba(new_data)  # same preprocessing guaranteed

:::tip Pipeline = consistency guarantee Using sklearn's Pipeline ensures that the same preprocessing transformations (imputation, scaling, encoding) that were learned on training data are applied at serving time. This is the primary tool for preventing train-serve skew in Python ML systems. The fit() call memorizes the statistics (mean, std, categories); the transform() call applies them. :::

Part 7 - Feature Selection: Less Can Be More

Adding more features increases dimensionality. Feature selection reduces it by keeping only the most informative ones.

from sklearn.feature_selection import (
    SelectKBest, f_classif, mutual_info_classif,
    RFE
)
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
import numpy as np

# Method 1: Univariate statistical tests
# Select features with highest mutual information with target
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=2000, n_features=50, n_informative=10, random_state=42)

selector = SelectKBest(mutual_info_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_indices = selector.get_support(indices=True)
print(f"Selected feature indices: {selected_indices}")

# Method 2: Permutation importance (model-agnostic, post-training)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
importance = result.importances_mean
top_features = np.argsort(importance)[-10:]
print(f"Top 10 features by permutation importance: {top_features}")

# Method 3: L1 (Lasso) regularization for implicit feature selection
# Lasso drives irrelevant feature weights to exactly 0
from sklearn.linear_model import LogisticRegressionCV
lasso_model = LogisticRegressionCV(
    Cs=10, cv=5, penalty='l1', solver='saga', max_iter=5000
)
lasso_model.fit(X_train, y_train)
n_zero = np.sum(np.abs(lasso_model.coef_[0]) < 1e-6)
print(f"Lasso zeroed out {n_zero} of {X.shape[1]} features")

Interview Questions

Q1: What is the curse of dimensionality and what are its practical consequences for ML engineers?

The curse of dimensionality refers to several phenomena that make ML harder as the number of features $d$ increases:

1. Distance concentration: As $d$ grows, the ratio of the maximum to minimum distance between any pair of points converges to 1. This means all points appear approximately equidistant. Nearest-neighbor search becomes unreliable because there is no meaningful "closest" point.

2. Exponential sample complexity: To maintain a fixed sample density, you need exponentially more data as dimensions increase. If 100 samples provide good coverage in 1D, you need $100^d$ samples in $d$ dimensions (roughly).

3. Volume concentration near surface: Almost all the volume of a high-dimensional sphere lies in a thin shell near the surface. Random samples cluster near the boundary, leaving the interior empty.

Practical consequences:

KNN degrades: Avoid KNN with $d > 20$ . Apply PCA or embedding first.
Feature selection matters: Irrelevant features add noise dimensions that degrade distance-based algorithms. Select or regularize aggressively.
Sparsity: A dataset with $n = 10,000$ points in $d = 500$ dimensions is extremely sparse - the data manifold occupies a tiny fraction of the feature space.
Linear models are more robust: Regularized linear models do not rely on local distance geometry and degrade more gracefully with dimensionality.
Representation learning helps: CNNs and transformers learn compact, low-dimensional representations from raw high-dimensional data (images, text) that respect the actual data manifold.

Q2: What is the difference between StandardScaler and RobustScaler? When would you use each?

StandardScaler: Subtracts the mean and divides by the standard deviation: $z = (x - \mu) / \sigma$ . Both $\mu$ and $\sigma$ are computed from the training set.

RobustScaler: Subtracts the median and divides by the IQR (interquartile range, Q3–Q1): $z = (x - \text{median}) / IQR$ .

Key difference: StandardScaler is sensitive to outliers. If your data has outliers, they inflate $\sigma$ , making the scaled values for typical points very small (all clustered near 0). Outliers themselves get very large values.

RobustScaler is resistant to outliers because median and IQR are quantile-based statistics - they are not affected by extreme values.

Use StandardScaler when: The feature is roughly Gaussian with few outliers; for PCA (which is defined in terms of variance/covariance); for algorithms that assume Gaussian features (linear models, linear SVM).

Use RobustScaler when: The feature has known outliers (income, transaction amounts, web traffic); for any dataset from the real world where extreme values are expected.

Neither is needed for: Tree-based models (Random Forest, XGBoost, LightGBM) - they split on thresholds, which are scale-invariant.

Q3: What is target encoding, why does it cause leakage, and how do you prevent it?

Target encoding replaces a categorical variable with the mean of the target variable for that category. For example, if "city = London" appears in 500 training examples and 30% of those are fraud, you replace "London" with 0.30.

Why it causes leakage: If you compute the mean on the full training set and then train the model on the same training set, the model is implicitly told the answer for each training example. The target-encoded feature for a training example is partially computed from that very example's label. This inflates training performance and gives overly optimistic offline evaluation.

How to prevent it: Use out-of-fold (OOF) target encoding within cross-validation:

Split training data into K folds
For each fold, compute category means from the K-1 other folds
Apply those means to encode the held-out fold
For the test set: use means computed from the full training set

This ensures that the target-encoded value for any training example is never computed using that example's own label.

Sklearn's TargetEncoder (added in 1.3) handles this automatically with a cv parameter.

Q4: When should you use feature engineering vs. representation learning? How do you decide?

The choice depends on data modality, data volume, and interpretability requirements:

Use feature engineering when:

Tabular data: No state-of-the-art representation learning for structured tabular data has consistently outperformed GBT + good feature engineering. Tabular data has heterogeneous types (categorical, numerical, datetime) that benefit from domain knowledge.
Limited data: Representation learning requires large datasets to learn good representations. With thousands of examples, pretrained representations + engineered features often win.
Interpretability required: Engineered features have explicit meaning. A learned embedding dimension does not.
Domain knowledge is available: If you know that loan_amount / income is the credit utilization ratio and highly predictive, encode it explicitly rather than hoping the model discovers it.

Use representation learning when:

Image, text, audio: These modalities have dominant pretrained models (ResNet, BERT, Whisper) that produce far better features than any manual engineering.
Large scale: With millions of examples, neural end-to-end learning can discover features you would not think to engineer.
Complex interactions: When the relevant patterns are high-order interactions across many features, learned representations can capture them.
Transfer learning available: If a pretrained model exists for your domain, fine-tuning its representations is usually faster and better than engineering features from scratch.

Hybrid approach (often best): Concatenate engineered features (explicit domain knowledge) with learned embeddings (implicit interaction features). This is standard in industrial recommendation systems - user/item embeddings (learned) + contextual features (engineered) → combined feature vector → output model.

Q5: What is train-serve skew in feature engineering, and how does a sklearn Pipeline prevent it?

Train-serve skew in feature engineering occurs when the feature transformations applied during model training differ from those applied during model serving. Examples:

StandardScaler is fit on the training set (mean μ, std σ). At serving time, a different mean and std are used (perhaps recomputed from recent production data, or using a different code path).
Missing values are imputed with the training set median during training, but with a constant value (e.g., 0) during serving.
Categorical encoding maps "RENT" → [1,0,0] during training but to a different encoding during serving due to a code change.

Consequences: The model receives different-distribution inputs than it was trained on, causing performance degradation. There is no error - just silently wrong predictions.

How sklearn Pipeline prevents it:

Single object encapsulates everything: Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]) stores the scaler (with its fitted mean and std) and the model together. Serializing the pipeline with joblib.dump() saves both.
fit_transform() vs. transform(): Calling pipeline.fit(X_train, y_train) fits all transformers on X_train. Calling pipeline.predict(X_test) applies the same fitted transformers. The statistics are fixed to training data statistics.
Reproducibility: The same pipeline object is used in the notebook, the unit tests, and the production serving code. There is no "offline feature pipeline" vs. "online feature server" - there is one pipeline, serialized once.

The limitation: sklearn pipelines work for Python serving. If you serve in Java, Go, or C++, you must ensure that the feature transformations are faithfully reimplemented and tested for parity.

Key Takeaways

Feature space is the geometric space where ML operates - every training example is a point, and the model finds patterns in this space
Different data modalities require different representations: tabular (normalization + encoding), text (TF-IDF or embeddings), image (CNN features), time-series (window statistics + lags)
The curse of dimensionality makes distances less meaningful and data sparser as $d$ increases - avoid high-dimensional feature spaces for distance-based algorithms
StandardScaler assumes Gaussian distribution; RobustScaler is robust to outliers; neither is needed for tree-based models
Target encoding requires out-of-fold computation to prevent leakage from the target into training features
sklearn's Pipeline is the primary tool for preventing train-serve skew - it ensures training-time preprocessing transformations are identically applied at serving time

Next: Lesson 05 - The Bias-Variance Tradeoff →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Embedding Space Explorer demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - What is a Feature Space?​

Why the representation matters​

Part 2 - Data Types and Their Representations​

Tabular Data​

Text Data​

Image Data​

Time-Series Data​

Part 3 - The Curse of Dimensionality​

Mathematical intuition​

Practical consequences​

Part 4 - Feature Engineering vs. Representation Learning​

Part 5 - Normalization and Standardization​

Why scaling matters​

Standardization vs. Normalization​

Encoding categorical variables​

Part 6 - Building a Production Feature Pipeline with sklearn​

Part 7 - Feature Selection: Less Can Be More​

Interview Questions​

Key Takeaways​