What is pandas for machine learning?

Pandas for machine learning engineers - DataFrame operations, missing data, groupby feature aggregation, time series, memory optimization, and building leakage-free feature matrices.

How does pandas feature engineering work in practice?

Pandas for ML covers pandas for machine learning, pandas feature engineering, pandas groupby aggregation from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/ml-with-python/pandas-for-ml

What is the difference between pandas for machine learning and pandas groupby aggregation?

See the full breakdown at https://engineersofai.com/docs/ml/ml-with-python/pandas-for-ml

Pandas for ML

It is your third week on Airbnb's pricing team. You have just inherited the ETL pipeline that prepares features for the nightly price recommendation model. The previous engineer left a sticky note on the codebase: "runs 3 hours, do not touch." You open the notebook and within ten minutes you understand why. Every major computation is done with .apply() - row-by-row Python function calls on DataFrames with eight million rows. There is a median computation using apply(lambda x: x.median()), a string parsing step using apply(lambda row: parse_amenities(row['amenities']), axis=1), and - most egregiously - a 30-day rolling average implemented as a nested Python loop inside an apply.

You spend a day rewriting. The median computation becomes .median(). The amenity parsing becomes .str.split().explode() and a vectorized .isin(). The rolling average becomes .rolling(30).mean(). You run the pipeline. It finishes in five minutes and thirty seconds. The three-hour runtime was not a data size problem. It was a Pandas usage problem - the difference between 200x slower and optimal is just knowing which operations are vectorized and which are Python loops in disguise.

That story plays out on every data team that has a large enough dataset. .apply() is the most common performance trap in Pandas because it looks like a vectorized operation but is not. It calls your Python function once per row or once per element, with all the interpreter overhead that implies. For one million rows, an apply() call might take 10 seconds where a vectorized .str or arithmetic operation takes 50 milliseconds.

But performance is only one dimension. The deeper lesson from the Airbnb pipeline was correctness. Two of the features were computed on the full dataset before the train/test split - the rolling average and a groupby aggregation that used future data to characterize past listings. The model was learning from information it would never have at inference time. Fixing the leakage dropped the model's validation RMSE by 12% and, more importantly, closed the gap between offline validation and online A/B test performance that had puzzled the team for months.

This lesson teaches Pandas the way ML engineers need to know it: not just the API, but the reasoning behind each choice - when a merge introduces leakage, how memory layout affects pipeline feasibility, what the difference between transform and apply actually is, and how to reason about correctness in time-series feature pipelines.

DataFrames and Series: The Core Model

import pandas as pd
import numpy as np

# Create DataFrame from dict
df = pd.DataFrame({
    'user_id': [1001, 1002, 1003, 1004],
    'age': [25, np.nan, 34, 28],
    'spend_30d': [120.5, 0.0, 450.2, 88.3],
    'plan': ['free', 'paid', 'paid', 'free'],
    'churn': [0, 0, 1, 0],
})

# Inspect immediately - always do this after loading
print(df.dtypes)
print(df.shape)           # (4, 5)
print(df.describe())      # summary stats for numeric columns
print(df.info())          # dtypes + null counts + memory usage

# Selecting columns
ages = df['age']             # Series
subset = df[['age', 'plan']] # DataFrame

# Conditional selection
high_spend = df[df['spend_30d'] > 100]
paid_users = df.query("plan == 'paid' and churn == 0")

# .loc (label-based) vs .iloc (position-based)
row0 = df.loc[0, 'age']          # value at row-label 0, column 'age'
row0_idx = df.iloc[0, 1]         # value at row 0, column 1 (position)

Essential Operations Every ML Engineer Needs

apply and map

# apply: row-wise or column-wise function - use sparingly, it is slow
df['log_spend'] = df['spend_30d'].apply(np.log1p)

# Better: use the vectorized version directly
df['log_spend'] = np.log1p(df['spend_30d'])   # ~50x faster for large DataFrames

# map: element-wise transformation with a dict or function
plan_map = {'free': 0, 'paid': 1}
df['plan_encoded'] = df['plan'].map(plan_map)

# str accessor: vectorized string operations (not apply!)
df['plan_upper'] = df['plan'].str.upper()
df['plan_starts_f'] = df['plan'].str.startswith('f')

Renaming, dropping, reindexing

df = df.rename(columns={'spend_30d': 'spend'})
df = df.drop(columns=['user_id'])     # drop column
df = df.reset_index(drop=True)        # reset row index after filtering

Handling Missing Data Rigorously

Missing data in ML is not just a Pandas concern - it is a data integrity concern with three distinct generating mechanisms, each requiring a different strategy.

The three types of missingness

MCAR (Missing Completely At Random): The probability of a value being missing is independent of the value itself and all other variables. Example: a sensor randomly drops readings due to hardware noise. Safe to drop or impute with simple statistics.

MAR (Missing At Random): The probability of missing depends on other observed variables but not on the missing value itself. Example: age is more likely to be missing for younger users (who skip optional fields), but the age values themselves are not the cause. Can be imputed using other columns.

MNAR (Missing Not At Random): The probability of missing depends on the value that is missing. Example: high-income users are more likely to skip the income field. Simple imputation will introduce systematic bias - requires careful modeling.

# Detect missing values
print(df.isnull().sum())             # count per column
print(df.isnull().mean())            # fraction missing per column

# Detect patterns in missingness - is missingness correlated across columns?
miss_df = df.isnull().astype(int)
miss_corr = miss_df.corr()
# High correlation in missingness → likely MAR or MNAR, not MCAR

# Visualize missingness pattern
for col in df.columns:
    missing_pct = df[col].isnull().mean() * 100
    if missing_pct > 0:
        print(f"{col}: {missing_pct:.1f}% missing")

Imputation strategies

# Forward fill and backward fill (time series only)
ts = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=5, freq='D'),
    'value': [1.0, np.nan, np.nan, 4.0, 5.0],
})
ts['value_ffill'] = ts['value'].ffill()    # 1, 1, 1, 4, 5
ts['value_bfill'] = ts['value'].bfill()    # 1, 4, 4, 4, 5

# Interpolation for smoothly changing signals
ts['value_interp'] = ts['value'].interpolate(method='linear')
# 1, 2, 3, 4, 5 - fills linearly between known points

# KNN imputation (via sklearn) - uses neighbor information
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(df[['age', 'spend_30d']].values)

The leakage trap with fillna

# WRONG: fill with global mean computed on all data
df['age'] = df['age'].fillna(df['age'].mean())

# CORRECT: compute fill value on training set only
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Fit on train
age_mean_train = train_df['age'].mean()

# Apply to both
train_df = train_df.copy()
test_df  = test_df.copy()
train_df['age'] = train_df['age'].fillna(age_mean_train)
test_df['age']  = test_df['age'].fillna(age_mean_train)  # use TRAIN mean

:::warning fillna leakage Filling with the global mean before splitting leaks information from the test set into training. Always compute imputation statistics on the training set and apply them to both splits. This is exactly what sklearn's SimpleImputer inside a Pipeline does correctly. :::

:::tip Sentinel values for tree-based models For gradient boosted trees (XGBoost, LightGBM), leaving NaN as-is is often better than imputing. These models have native NaN handling and can learn different split directions for missing vs. present values - preserving the "missingness" as a signal. Only impute when the model requires it (e.g., linear models, SVMs, neural networks). :::

Memory Optimization: 4x Reduction with Dtype Engineering

A 10-million-row DataFrame with naive dtypes can easily consume 8GB of RAM. Memory optimization is not premature optimization in ML - it is often the difference between a pipeline that runs and one that crashes. It also directly affects speed: smaller arrays fit in CPU cache, leading to faster operations.

import pandas as pd
import numpy as np

df = pd.read_csv('large_dataset.csv')
baseline_mem = df.memory_usage(deep=True).sum() / 1e6
print(f"Baseline memory: {baseline_mem:.1f} MB")

# Step 1: Downcast integer columns
def downcast_numeric(df):
    for col in df.select_dtypes(include=['int64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='integer')
    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = pd.to_numeric(df[col], downcast='float')
    return df

df = downcast_numeric(df)
after_downcast = df.memory_usage(deep=True).sum() / 1e6
print(f"After downcast: {after_downcast:.1f} MB "
      f"({(1 - after_downcast/baseline_mem)*100:.0f}% reduction)")

# Step 2: Convert low-cardinality strings to category
for col in df.select_dtypes(include='object').columns:
    n_unique = df[col].nunique()
    n_total  = len(df)
    if n_unique / n_total < 0.05:   # less than 5% unique values → good candidate
        df[col] = df[col].astype('category')

after_category = df.memory_usage(deep=True).sum() / 1e6
print(f"After category: {after_category:.1f} MB "
      f"({(1 - after_category/baseline_mem)*100:.0f}% reduction)")

Specifying dtypes at read time

# Best practice: specify dtypes when reading - avoids the object inference step
dtype_map = {
    'user_id':    np.int32,
    'age':        np.float32,
    'country':    'category',
    'plan':       'category',
    'amount':     np.float32,
    'is_fraud':   np.int8,
}

df = pd.read_csv('data.csv', dtype=dtype_map)
# Memory is allocated correctly from the start - no post-hoc conversion needed

pd.Categorical for low-cardinality columns

# Categorical dtype: stores integer codes + a lookup table
# For a column with 50 unique values and 10M rows:
# object dtype:   10M × ~50 bytes per Python string ≈ 500 MB
# category dtype: 10M × 1 byte (int8 codes) + 50 × 50 bytes lookup ≈ 10 MB

countries = pd.Categorical(['US', 'UK', 'DE', 'US', 'FR', 'US'])
print(countries.codes)       # [0, 3, 1, 0, 2, 0] - integer codes
print(countries.categories)  # ['DE', 'FR', 'UK', 'US']

# GroupBy on categorical is also faster - operates on integer codes
df['country'] = df['country'].astype('category')
country_spend = df.groupby('country')['amount'].mean()

Typical memory savings reference

Dtype change	Memory reduction	Risk
`int64` → `int32`	50%	Overflow if values exceed 2.1B
`int64` → `int16`	75%	Overflow if values exceed 32,767
`int64` → `int8`	87.5%	Only for boolean-like 0/1 data
`float64` → `float32`	50%	Precision loss at 7th decimal place
`object` (50 unique) → `category`	90%+	Slight overhead for new category assignment

Chunked reading for very large files

# Process 10GB CSV without loading it all into memory
chunk_size = 100_000
results = []

for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
    # Compute per-chunk aggregations
    chunk_agg = chunk.groupby('user_id')['amount'].sum()
    results.append(chunk_agg)

# Combine all chunk results
final = pd.concat(results).groupby(level=0).sum()

Temporal Data: DatetimeIndex, Resample, Rolling, and Lag Features

Time series data appears in virtually every real-world ML problem - clickstreams, transaction histories, sensor readings, pricing data. Pandas has the richest time series API of any data library.

# Parse and set datetime index
df = pd.DataFrame({
    'timestamp': ['2024-01-01 09:00', '2024-01-01 10:30', '2024-01-02 08:15'],
    'value': [10.5, 12.3, 9.8],
})
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.set_index('timestamp')

# Resample to daily mean
daily = df.resample('D').mean()

# Rolling window features
df['value_7d_mean'] = df['value'].rolling('7D').mean()
df['value_7d_std']  = df['value'].rolling('7D').std()
df['value_30d_max'] = df['value'].rolling('30D').max()

# Expanding window: uses all data from start up to current row
df['cumulative_mean'] = df['value'].expanding().mean()

# Lag features (shift-based)
df['value_lag_1'] = df['value'].shift(1)    # previous period
df['value_lag_7'] = df['value'].shift(7)    # one week ago
df['value_pct_change'] = df['value'].pct_change(periods=1)   # % change

# Time-based feature extraction
df['hour']        = df.index.hour
df['day_of_week'] = df.index.dayofweek    # 0=Monday, 6=Sunday
df['is_weekend']  = df.index.dayofweek >= 5
df['month']       = df.index.month
df['quarter']     = df.index.quarter
df['is_month_end'] = df.index.is_month_end

Timezone handling

# Localize naive timestamps to UTC
df.index = df.index.tz_localize('UTC')

# Convert to local timezone
df_local = df.copy()
df_local.index = df_local.index.tz_convert('America/New_York')

# Extract hour in local time (important for user behavior models)
df_local['local_hour'] = df_local.index.hour
# A transaction at 2am UTC is 9pm EST - very different user behavior pattern

Preventing time series leakage with shift()

# The critical rule: lag features must use shift(1) to avoid look-ahead bias
txn = pd.DataFrame({
    'user_id':   [1, 1, 1, 2, 2],
    'ts':        pd.to_datetime(['2024-01-01', '2024-01-05', '2024-01-10',
                                 '2024-01-02', '2024-01-08']),
    'amount':    [50, 120, 80, 200, 150],
})
txn = txn.sort_values(['user_id', 'ts'])

# WRONG: cumulative sum includes current transaction
txn['cum_spend_LEAKY'] = txn.groupby('user_id')['amount'].cumsum()

# CORRECT: shift(1) excludes the current row
txn['cum_spend_safe'] = txn.groupby('user_id')['amount'].transform(
    lambda x: x.shift(1).expanding().sum()
)
# Row 0 for each user: NaN (no prior data)
# Row 1: sum of prior transactions only

:::danger Time series leakage via cumsum A cumulative sum that includes the current row is one of the most common forms of feature leakage in time-series ML. At inference time, you will not have the current transaction's amount - you are predicting about it. Always shift by 1 period before computing running statistics. :::

GroupBy Internals and Performance: Split-Apply-Combine

GroupBy is the highest-leverage Pandas operation for feature engineering. Understanding when to use agg, transform, and apply - and their performance implications - is essential.

The split-apply-combine model

Pandas GroupBy works in three phases:

Split: partition the DataFrame into groups based on a key
Apply: run a function on each group
Combine: concatenate the results back into a DataFrame or Series

import pandas as pd
import numpy as np

transactions = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 3],
    'amount':  [50, 120, 80, 200, 150, 30],
    'is_fraud': [0, 0, 1, 0, 0, 0],
    'category': ['food', 'electronics', 'food', 'electronics', 'travel', 'food'],
})

# agg: reduce each group to a scalar per column
# Returns one row per group - use for building feature tables
user_stats = transactions.groupby('user_id')['amount'].agg([
    'mean', 'sum', 'max', 'count', 'std'
]).reset_index()
user_stats.columns = ['user_id', 'amt_mean', 'amt_sum', 'amt_max', 'txn_count', 'amt_std']

# Multiple column aggregation with named output
user_profile = transactions.groupby('user_id').agg(
    amt_mean   = ('amount', 'mean'),
    amt_sum    = ('amount', 'sum'),
    fraud_rate = ('is_fraud', 'mean'),
    txn_count  = ('amount', 'count'),
).reset_index()

# Multi-key groupby
category_user = transactions.groupby(['user_id', 'category']).agg(
    amt_sum = ('amount', 'sum'),
    txn_n   = ('amount', 'count'),
).reset_index()

transform vs apply - the critical distinction

# transform: returns a Series with the SAME index as the original DataFrame
# Output shape = input shape → perfect for adding group features as new columns

# Add the user's mean spend as a feature on every row
transactions['user_avg_spend'] = transactions.groupby('user_id')['amount'].transform('mean')
# Every row for user 1 gets 83.33 (mean of 50, 120, 80)

# apply: calls a Python function on each group DataFrame
# More flexible but much slower - avoids unless transform/agg cannot do the job
transactions['user_rank'] = transactions.groupby('user_id')['amount'].apply(
    lambda x: x.rank(pct=True)   # rank within user's transactions
).reset_index(level=0, drop=True)

# Custom aggregation with apply (only when needed)
def gini_coefficient(x):
    """Custom aggregation not available in standard agg."""
    x = np.sort(x.values)
    n = len(x)
    cumsum = np.cumsum(x)
    return (2 * cumsum.sum()) / (n * x.sum()) - (n + 1) / n

gini_per_user = transactions.groupby('user_id')['amount'].apply(gini_coefficient)

Method	Returns	Speed	Use when
`agg`	One row per group	Fast (C)	Summarizing groups into a feature table
`transform`	Same shape as input	Fast (C)	Adding group stats as columns to original df
`apply`	Flexible	Slow (Python loop)	Custom logic unavailable in agg/transform

GroupBy with multiple keys

# Groupby on two columns - creates a MultiIndex
cat_stats = transactions.groupby(['user_id', 'category']).agg(
    spend = ('amount', 'sum')
).reset_index()

# Pivot to wide format: one column per category
cat_wide = cat_stats.pivot_table(
    index='user_id',
    columns='category',
    values='spend',
    fill_value=0,
    aggfunc='sum'
)
cat_wide.columns = [f'spend_{c}' for c in cat_wide.columns]
cat_wide = cat_wide.reset_index()

# Join back to main table
enriched = transactions.merge(cat_wide, on='user_id', how='left')

Merge features back correctly

# Left join - preserves all rows in transactions
enriched = transactions.merge(user_stats, on='user_id', how='left')

# But watch out - if computed on all data, this leaks future transactions!
# For time-series: compute aggregations on past data only (see shift pattern above)

Merging Strategies: Join Types, merge_asof, and Many-to-Many

Standard join types and their ML implications

users = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'age': [25, 30, 35, 28],
})
purchases = pd.DataFrame({
    'user_id': [1, 1, 3, 5],
    'purchase_amount': [100, 200, 50, 80],
})

# Inner join - only users with purchases (can bias toward active users)
inner = users.merge(purchases, on='user_id', how='inner')
# Rows: 3 (users 1, 1, 3 - user 5 dropped since not in users table)

# Left join - all users, NaN for those with no purchases
left = users.merge(purchases, on='user_id', how='left')
# Rows: 5 (user 4 gets NaN for purchase_amount)

# Indicator column for debugging merges
left_debug = users.merge(purchases, on='user_id', how='left', indicator=True)
print(left_debug['_merge'].value_counts())
# left_only: users with no purchases (user_id 2, 4)
# both: users with purchases (user_id 1, 3)

# Anti-join pattern: users with NO purchases (negative class)
no_purchase_mask = ~users['user_id'].isin(purchases['user_id'])
no_purchase_users = users[no_purchase_mask]

:::note Join type and label bias For fraud detection, using an inner join keeps only users who had at least one event - this can bias your training set. A left join on the user table preserves all users, with NaN for missing features. The right join type depends on your label definition. :::

merge_asof - time-series joins without future leakage

merge_asof performs a merge on the nearest key rather than exact match - essential for joining event tables with slowly-changing dimension tables without introducing future leakage.

# Stock prices (sampled every minute)
prices = pd.DataFrame({
    'time': pd.to_datetime(['09:30', '09:31', '09:32', '09:33'], format='%H:%M'),
    'price': [100.0, 100.5, 99.8, 101.2],
})

# Trades (happen at arbitrary times)
trades = pd.DataFrame({
    'time': pd.to_datetime(['09:30:30', '09:31:45', '09:32:15'], format='%H:%M:%S'),
    'volume': [500, 1200, 300],
})

# Join each trade to the most recent price BEFORE it
# merge_asof looks backward - no future leakage
result = pd.merge_asof(
    trades.sort_values('time'),
    prices.sort_values('time'),
    on='time',
    direction='backward',   # take the most recent price before the trade
)
# Each trade row now has the most recently available price - safe for ML features

# tolerance parameter: only match if within 5 minutes
result_tol = pd.merge_asof(
    trades.sort_values('time'),
    prices.sort_values('time'),
    on='time',
    direction='backward',
    tolerance=pd.Timedelta('5min'),
)

Handling many-to-many merges

# Many-to-many merges produce a cartesian product - usually a bug
# If user_id 1 appears 3 times in left and 2 times in right:
# inner join produces 3 × 2 = 6 rows for that user

# Detect before merging
left_dups  = transactions.duplicated(subset=['user_id']).any()
right_dups = user_stats.duplicated(subset=['user_id']).any()
print(f"Left has user_id duplicates: {left_dups}")    # True (many transactions per user)
print(f"Right has user_id duplicates: {right_dups}")  # False (one row per user)

# Safe pattern: aggregate to one row per key before merging
user_features = transactions.groupby('user_id').agg(
    txn_count = ('amount', 'count'),
    total_spend = ('amount', 'sum'),
).reset_index()
# Now right side has no duplicates - safe to join

enriched = transactions.merge(user_features, on='user_id', how='left')

Data Pipeline Architecture - Mermaid Diagram

GroupBy for Feature Aggregation (Full Pattern)

import pandas as pd
import numpy as np

transactions = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 3],
    'amount':  [50, 120, 80, 200, 150, 30],
    'is_fraud': [0, 0, 1, 0, 0, 0],
    'category': ['food', 'electronics', 'food', 'electronics', 'travel', 'food'],
})

# Full multi-column aggregation with named outputs
user_profile = transactions.groupby('user_id').agg(
    amt_mean   = ('amount',   'mean'),
    amt_sum    = ('amount',   'sum'),
    amt_max    = ('amount',   'max'),
    amt_std    = ('amount',   'std'),
    txn_count  = ('amount',   'count'),
    fraud_sum  = ('is_fraud', 'sum'),
    fraud_rate = ('is_fraud', 'mean'),
).reset_index()

print(user_profile.head())

Point-in-time safe aggregation (preventing leakage in time series)

# Transactions with timestamps
txn = pd.DataFrame({
    'user_id':   [1, 1, 1, 2, 2],
    'ts':        pd.to_datetime(['2024-01-01', '2024-01-05', '2024-01-10',
                                 '2024-01-02', '2024-01-08']),
    'amount':    [50, 120, 80, 200, 150],
})
txn = txn.sort_values(['user_id', 'ts'])

# Safe: each row only uses data from BEFORE that transaction
txn['cumulative_spend'] = txn.groupby('user_id')['amount'].transform(
    lambda x: x.shift(1).expanding().sum()  # shift(1): exclude current row
)

Pandas vs Polars: When to Switch

Polars is a modern DataFrame library written in Rust with a lazy evaluation engine. For large-scale ML pipelines, knowing when to switch is a practical skill.

Speed comparison (approximate, 10M rows)

Operation	Pandas	Polars	Speedup
GroupBy aggregation	~8s	~0.4s	20x
String operations	~12s	~1s	12x
CSV read	~6s	~1s	6x
Filter + select	~0.5s	~0.1s	5x
Join (large tables)	~15s	~1s	15x

# Polars syntax (comparable to Pandas but with lazy evaluation)
import polars as pl

# Lazy computation - builds a query plan, executes only on .collect()
result = (
    pl.scan_csv('large_data.csv')        # lazy read
    .filter(pl.col('amount') > 100)      # lazy filter
    .groupby('user_id')
    .agg([
        pl.col('amount').mean().alias('avg_amount'),
        pl.col('amount').sum().alias('total_amount'),
        pl.col('is_fraud').mean().alias('fraud_rate'),
    ])
    .collect()                            # execute everything at once
)

# Convert Polars DataFrame to Pandas for sklearn compatibility
result_pd = result.to_pandas()

Decision framework: Pandas vs Polars

Dataset size < 1M rows      → Pandas (ergonomics win, speed irrelevant)
Dataset size 1M–100M rows   → either; Polars wins on speed
Dataset size > 100M rows    → Polars or Spark (Pandas may OOM)
Team knows Pandas well       → keep Pandas unless bottlenecked
Pipeline already in PySpark  → keep PySpark
Need complex string ops      → Polars str API is faster
Need .apply() flexibility    → Pandas (Polars has less flexibility)

:::tip Polars migration strategy You do not need to rewrite everything. Profile your pipeline first with %timeit on each major step. The bottlenecks are usually 2–3 operations (a large groupby, a join, or a string column transformation). Rewrite only those steps in Polars, convert to Pandas with .to_pandas() for the sklearn-compatible final step. :::

Data Validation: Pandera for Schema Validation

Before data enters a model, validate its schema. Silent schema violations - a column that should be non-negative has negative values, a required column is null, a string column has unexpected categories - are a leading cause of model degradation in production.

import pandera as pa
from pandera import Column, DataFrameSchema, Check
import pandas as pd

# Define the expected schema for your feature matrix
feature_schema = DataFrameSchema({
    'user_id': Column(int, nullable=False, unique=False),
    'age': Column(float, checks=[
        Check.greater_than_or_equal_to(0),
        Check.less_than_or_equal_to(120),
    ], nullable=True),
    'spend_30d': Column(float, checks=[
        Check.greater_than_or_equal_to(0),
    ], nullable=False),
    'plan': Column(str, checks=[
        Check.isin(['free', 'paid', 'enterprise']),
    ], nullable=False),
    'churn': Column(int, checks=[
        Check.isin([0, 1]),
    ], nullable=False),
})

# Validate - raises SchemaError with informative message on failure
try:
    validated_df = feature_schema.validate(df)
    print("Schema validation passed")
except pa.errors.SchemaError as e:
    print(f"Schema validation failed: {e}")

# In production: validate both training data AND inference data
# Add validation at pipeline ingestion, before feature engineering, and
# before model input - catch errors as early as possible

:::note Data validation in production Pandera's .validate() can be added as a decorator on feature engineering functions. When an unexpected value or dtype arrives from upstream (changed ETL, new data source, schema migration), the validation failure surfaces immediately instead of silently corrupting model predictions hours or days later. :::

From Pandas to ML: Building Feature Matrices

The final step before model training is converting your Pandas DataFrame into NumPy arrays (or direct sklearn inputs). Here is a complete, correct pipeline:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# --- Load and inspect ---
df = pd.read_csv('customer_data.csv')
print(df.dtypes)
print(df.isnull().sum())

# --- Feature engineering (all on df, before split) ---
df['log_spend'] = np.log1p(df['spend_30d'])
df['plan_encoded'] = df['plan'].map({'free': 0, 'paid': 1}).fillna(0)

# Group features - computed on full df (will be refactored to avoid leakage in prod)
user_spend = df.groupby('user_id')['spend_30d'].mean().rename('user_avg_spend')
df = df.merge(user_spend, on='user_id', how='left')

# --- Define features and label ---
FEATURE_COLS = ['age', 'log_spend', 'plan_encoded', 'user_avg_spend']
LABEL_COL = 'churn'

df = df.dropna(subset=FEATURE_COLS + [LABEL_COL])

X = df[FEATURE_COLS].values.astype(np.float32)   # NumPy array
y = df[LABEL_COL].values.astype(np.int64)

# --- Split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Positive rate: train={y_train.mean():.3f}, test={y_test.mean():.3f}")

YouTube Resources

Video	Channel	Description
Pandas Tutorial for Beginners	Corey Schafer	Complete pandas walkthrough - DataFrames, merging, groupby
Pandas Performance Tips	Python Programmer	Vectorization and speed - apply vs vectorized operations
Time Series with Pandas	SentDex	DateTime features for ML - rolling, resampling, lag features
Polars vs Pandas	Rob Mulla	When to switch from pandas - benchmarks and migration guide
Pandas GroupBy Explained	Keith Galli	Deep dive into split-apply-combine and transform vs apply

Interview Q&A

Q1: What is the difference between transform and apply in Pandas groupby, and when should you use each?

transform calls a function on each group and returns a result with the same index as the original DataFrame - it "broadcasts" the group result back to every row. This is the right tool when you want to add a group-level statistic as a new column: df.groupby('user_id')['amount'].transform('mean') adds each user's mean spend to every row. apply is more flexible - it receives the full group DataFrame and can return any shape, including a scalar, a Series, or a modified DataFrame. The cost is performance: apply is a Python loop over groups, while transform with named aggregation functions (like 'mean', 'sum') uses C implementations. The rule: use agg for summarizing groups into a feature table, transform for adding group features to the original DataFrame, and apply only when the computation cannot be expressed with agg or transform.

Q2: What are the most important memory optimization techniques for large Pandas DataFrames?

Three techniques cover 90% of cases. First, downcast integer columns - int64 to int32 saves 50%, int16 saves 75%. Check that values fit the smaller range before downcasting. Second, downcast float columns - float64 to float32 saves 50%, and float32 is what most ML training uses anyway, so there is no precision cost for the ML use case. Third, convert low-cardinality string columns to category - instead of storing each string as a Python object (50+ bytes), Pandas stores integer codes (1 byte) plus a small lookup table, often 90%+ memory reduction for columns like country, plan type, or status code. Apply these at read time with dtype= in pd.read_csv for best results - avoids the intermediate object allocation.

Q3: How do you handle time series leakage when building features in Pandas?

Time series leakage occurs when a feature computed at time $t$ uses information from after $t$ . The two common sources: (1) cumulative statistics that include the current row - fix with shift(1) before expanding() or cumsum(); (2) group aggregations computed on all data before splitting - fix by computing aggregations only on the training set and applying them to test. For joining slowly-changing tables (prices, exchange rates), use pd.merge_asof with direction='backward' to ensure each row only gets the most recent value as of its timestamp. The underlying principle: at inference time, you will only have data up to the prediction moment. Every feature must be computable from that constraint.

Q4: How does Pandas groupby perform internally, and why is apply slow?

Pandas GroupBy internally builds a hash table from the groupby keys and partitions row indices into groups. For agg with standard functions like 'mean' or 'sum', Pandas dispatches to Cython-optimized C functions that operate directly on the underlying NumPy arrays - this is fast. For apply, Pandas iterates over the groups in Python, calls your Python function with a DataFrame slice for each group, and concatenates the results. For 10,000 groups with a Python function, this means 10,000 Python function calls, 10,000 DataFrame slice operations, and 10,000 result concatenations - a Python-speed operation. Profiling rule: if groupby().apply() is your bottleneck, check whether agg, transform, or a pivot operation can replace it.

Q5: How do you detect and handle multicollinearity before model training?

# Detect: correlation matrix on numeric features
corr = df[FEATURE_COLS].corr().abs()

# Find pairs with correlation above threshold
high_corr_pairs = []
for i in range(len(corr.columns)):
    for j in range(i + 1, len(corr.columns)):
        if corr.iloc[i, j] > 0.90:
            high_corr_pairs.append((corr.columns[i], corr.columns[j], corr.iloc[i, j]))

for feat_a, feat_b, corr_val in high_corr_pairs:
    print(f"{feat_a} - {feat_b}: {corr_val:.3f}")

# Handle options:
# 1. Drop one of the correlated features (domain knowledge guides which)
# 2. PCA to create orthogonal features
# 3. L2 regularization (Ridge) - handles multicollinearity in linear models
# 4. Tree-based models - largely unaffected by multicollinearity

Multicollinearity is most damaging for linear models - it inflates coefficient variance and makes interpretation unreliable. For gradient boosting or random forests, it causes redundant feature importance split credit but rarely hurts prediction quality.

Q6: When would you choose Polars over Pandas, and how do you make that decision?

The decision is driven by three factors: data size, operation type, and team familiarity. Polars is typically 5–20x faster than Pandas on large DataFrames because it uses a query optimizer, parallel execution across CPU cores, and Apache Arrow's columnar memory format. It is the clear choice when: (1) your DataFrame exceeds 5–10 million rows and groupby or join operations are bottlenecks; (2) you are reading large Parquet or CSV files repeatedly and scan time matters; (3) you need lazy evaluation to avoid loading full datasets into memory. Stick with Pandas when: the dataset fits comfortably in memory under 1M rows, you need .apply() with complex Python logic that Polars does not support, or your team is deeply familiar with Pandas and the performance difference does not matter in practice. A practical hybrid: use Polars for heavy ETL and feature aggregation, convert to Pandas with .to_pandas() for the final sklearn-compatible steps.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Tensor Viewer demo on the EngineersOfAI Playground - no code required.

:::

DataFrames and Series: The Core Model​

Essential Operations Every ML Engineer Needs​

apply and map​

Renaming, dropping, reindexing​

Handling Missing Data Rigorously​

The three types of missingness​

Imputation strategies​

The leakage trap with fillna​

Memory Optimization: 4x Reduction with Dtype Engineering​

Specifying dtypes at read time​

pd.Categorical for low-cardinality columns​

Typical memory savings reference​

Chunked reading for very large files​

Temporal Data: DatetimeIndex, Resample, Rolling, and Lag Features​

Timezone handling​

Preventing time series leakage with shift()​

GroupBy Internals and Performance: Split-Apply-Combine​

The split-apply-combine model​

transform vs apply - the critical distinction​

GroupBy with multiple keys​

Merge features back correctly​

Merging Strategies: Join Types, merge_asof, and Many-to-Many​

Standard join types and their ML implications​

merge_asof - time-series joins without future leakage​

Handling many-to-many merges​

Data Pipeline Architecture - Mermaid Diagram​

GroupBy for Feature Aggregation (Full Pattern)​

Point-in-time safe aggregation (preventing leakage in time series)​

Pandas vs Polars: When to Switch​

Speed comparison (approximate, 10M rows)​

Decision framework: Pandas vs Polars​

Data Validation: Pandera for Schema Validation​

From Pandas to ML: Building Feature Matrices​

YouTube Resources​

Interview Q&A​