Pandas for Interviews - Data Wrangling Mastery for ML Roles

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Scientist, MLE, AI Eng, Analytics Engineer

The Real Interview Moment

You are forty-five minutes into an Airbnb Data Science on-site. The interviewer opens a shared notebook with a DataFrame of 2 million booking records and says: "Compute the 7-day rolling average of bookings per city, then flag any city-day pair where the rolling average dropped more than 30% compared to the same day last year. You have 20 minutes."

You stare at the DataFrame. You know you need groupby, a rolling window, a date offset for year-over-year comparison, and a boolean mask for the flag. But the clock is ticking. Do you start with a for-loop iterating over cities? Do you merge the DataFrame with itself on a shifted date? Or do you reach for groupby().rolling() with a DateOffset, chain it with a transform, and solve it in five lines?

This is what Pandas interviews test: not whether you can Google the API, but whether you can compose operations fluently under pressure. Candidates who build DataFrames row by row or write nested for-loops over groups get a "no hire." Candidates who chain groupby, merge, transform, and window functions into clean pipelines get a "strong hire." This page builds that fluency.

What You Will Master

Create and inspect DataFrames with dtypes, shape, describe, info, and value_counts
Select and filter with loc, iloc, boolean indexing, and query
Group and aggregate with groupby, agg, transform, and apply
Merge and join DataFrames with merge, join, concat - and know when each is appropriate
Reshape with pivot_table, melt, stack, and unstack
Window functions with rolling, expanding, shift, and rank
Handle missing data with fillna, dropna, interpolate, and imputation strategies
Optimize performance with vectorized operations, categoricals, and avoiding common antipatterns
Engineer features for ML models using Pandas operations

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Do	4 - Can Optimize	5 - Can Teach	Your Score
Select/filter with loc, iloc, boolean						___
GroupBy with agg and multiple functions						___
Merge/join (inner, left, outer)						___
Pivot tables and reshaping						___
Rolling/expanding window functions						___
Handle missing data strategically						___
Avoid loops in favor of vectorized ops						___
Feature engineering for ML						___

Target: All 4s and 5s before your interview.

Part 1 - DataFrame Fundamentals

Creation and Inspection

import pandas as pd
import numpy as np

# From dictionary
df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 28, 42],
    'city': ['NYC', 'SF', 'NYC', 'LA', 'SF'],
    'signup_date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-01-10',
                                    '2024-03-05', '2024-02-28']),
    'revenue': [100.0, 250.0, None, 175.0, 300.0]
})

# Essential inspection
df.shape          # (5, 5) - rows, columns
df.dtypes         # dtype per column
df.info()         # Non-null counts, memory usage
df.describe()     # Summary statistics for numeric columns
df['city'].value_counts()  # Frequency counts
df.isnull().sum() # Missing values per column

60-Second Answer

"Pandas DataFrames are column-oriented data structures built on NumPy arrays. Each column is a contiguous array with a single dtype, which is why column-wise operations are fast and row-wise iteration is slow. I always start with df.info() and df.describe() to understand dtypes, missing values, and distributions. For ML feature engineering, I use vectorized operations, groupby transforms, and merge-based feature joins - never iterrows()."

Selection and Filtering

# Column selection
df['age']                          # Single column (Series)
df[['age', 'city']]                # Multiple columns (DataFrame)

# Row selection by label
df.loc[0]                          # Row with index label 0
df.loc[0:2, 'age':'city']         # Slice by labels (inclusive)

# Row selection by position
df.iloc[0]                         # First row
df.iloc[0:2, 1:3]                 # First 2 rows, columns 1-2

# Boolean filtering
df[df['age'] > 30]                 # Filter rows
df[(df['age'] > 25) & (df['city'] == 'NYC')]  # Multiple conditions
df.query('age > 25 and city == "NYC"')         # Query syntax (cleaner)

# isin for multiple values
df[df['city'].isin(['NYC', 'SF'])]

# String operations
df[df['city'].str.startswith('N')]

Common Trap

Do NOT use df[df['age'] > 30 and df['city'] == 'NYC'] - Python's and operator does not work element-wise on Series. You must use & (bitwise AND) and wrap each condition in parentheses. This is one of the most common Pandas bugs and interviewers will test it. Similarly, use | for OR and ~ for NOT.

Part 2 - GroupBy: The Most Important Operation

Basic GroupBy

# Sample data: e-commerce orders
orders = pd.DataFrame({
    'order_id': range(1, 11),
    'user_id': [1, 2, 1, 3, 2, 1, 3, 2, 1, 3],
    'product_category': ['electronics', 'books', 'clothing', 'electronics',
                         'electronics', 'books', 'clothing', 'books',
                         'electronics', 'books'],
    'amount': [120, 15, 45, 200, 80, 25, 60, 18, 150, 30],
    'date': pd.to_datetime(['2024-01-05', '2024-01-06', '2024-01-07',
                            '2024-01-08', '2024-01-10', '2024-01-12',
                            '2024-01-15', '2024-01-18', '2024-01-20',
                            '2024-01-22'])
})

# Single aggregation
orders.groupby('user_id')['amount'].sum()

# Multiple aggregations
orders.groupby('user_id')['amount'].agg(['sum', 'mean', 'count', 'max'])

# Named aggregations (clean output)
orders.groupby('user_id').agg(
    total_spend=('amount', 'sum'),
    avg_order=('amount', 'mean'),
    order_count=('amount', 'count'),
    first_order=('date', 'min'),
    last_order=('date', 'max')
).reset_index()

# Multiple group columns
orders.groupby(['user_id', 'product_category'])['amount'].sum().reset_index()

Transform vs Agg

Pandas GroupBy - Three Methods

# transform: add a column with the group mean (for normalization)
orders['user_avg'] = orders.groupby('user_id')['amount'].transform('mean')

# Use transform for group-level features without merge
orders['pct_of_user_total'] = (
    orders['amount'] / orders.groupby('user_id')['amount'].transform('sum')
)

# Use transform for group-level standardization
orders['z_score_within_user'] = orders.groupby('user_id')['amount'].transform(
    lambda x: (x - x.mean()) / x.std()
)

Instant Rejection

Never write for user_id in df['user_id'].unique(): subset = df[df['user_id'] == user_id]; ... - this is an O(n * k) anti-pattern where k is the number of groups. Use groupby, which is O(n). Interviewers interpret manual iteration over groups as a fundamental misunderstanding of Pandas. The only exception is when your per-group operation genuinely cannot be expressed as agg, transform, or apply.

Part 3 - Merge and Join

Merge Types

# Sample data
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'plan': ['free', 'paid', 'paid', 'free']
})

purchases = pd.DataFrame({
    'purchase_id': [101, 102, 103, 104, 105],
    'user_id': [1, 2, 2, 5, 1],
    'amount': [50, 100, 75, 200, 30]
})

# Inner join: only matching rows from both
inner = pd.merge(users, purchases, on='user_id', how='inner')
# Users 1, 2 only (3, 4 have no purchases; 5 is not in users)

# Left join: all rows from left, matching from right
left = pd.merge(users, purchases, on='user_id', how='left')
# All 4 users; Charlie and Diana have NaN for purchase columns

# Right join: all rows from right, matching from left
right = pd.merge(users, purchases, on='user_id', how='right')
# All 5 purchases; user_id 5 has NaN for user columns

# Outer join: all rows from both
outer = pd.merge(users, purchases, on='user_id', how='outer')

# Merge on different column names
pd.merge(users, purchases, left_on='user_id', right_on='user_id')

# Merge on multiple keys
pd.merge(df1, df2, on=['user_id', 'date'])

Merge Comparison Table

Join Type	Keeps From Left	Keeps From Right	Use Case
inner	Matching only	Matching only	Standard feature join when all keys exist
left	All	Matching only	Enrich base table without losing rows
right	Matching only	All	Enrich lookup table
outer	All	All	Union - find unmatched on both sides
cross	All (repeated)	All (repeated)	Cartesian product for pairwise features

Company Variation

Meta and Airbnb heavily test merge operations because their data is spread across many tables (users, events, sessions, bookings). They expect you to know when to use left vs inner join and how to handle duplicates after merging. Google tends to test merge concepts in SQL rather than Pandas. Startups and ML-focused roles test merge in the context of feature engineering: "Join user features with transaction features for your model."

Handling Merge Pitfalls

# Pitfall 1: Unintended row multiplication
# If keys are not unique on both sides, merge creates a cross product
A = pd.DataFrame({'key': [1, 1], 'val_a': ['a1', 'a2']})
B = pd.DataFrame({'key': [1, 1], 'val_b': ['b1', 'b2']})
result = pd.merge(A, B, on='key')  # 4 rows (2 x 2)!

# Prevention: validate the merge
pd.merge(A, B, on='key', validate='one_to_one')  # Raises error

# Pitfall 2: Overlapping column names
pd.merge(df1, df2, on='user_id', suffixes=('_left', '_right'))

# Pitfall 3: Missing key values after merge
# Always check shape after merge
print(f"Before: {len(users)}, After: {len(left)}")

Part 4 - Pivot Tables and Reshaping

# Sample data
sales = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-01']*3 + ['2024-01-02']*3),
    'city': ['NYC', 'SF', 'LA'] * 2,
    'product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'revenue': [100, 200, 150, 300, 120, 250]
})

# Pivot table: rows = date, columns = city, values = revenue
pivot = sales.pivot_table(
    values='revenue',
    index='date',
    columns='city',
    aggfunc='sum',
    fill_value=0
)

# Multiple aggregations
pivot_multi = sales.pivot_table(
    values='revenue',
    index='date',
    columns='city',
    aggfunc=['sum', 'mean', 'count']
)

# Melt: wide -> long (inverse of pivot)
wide_df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'score_2023': [80, 90, 75],
    'score_2024': [85, 88, 82]
})
long_df = wide_df.melt(
    id_vars='user_id',
    value_vars=['score_2023', 'score_2024'],
    var_name='year',
    value_name='score'
)

# Stack/Unstack: move between row and column indices
stacked = pivot.stack()      # Columns -> row multi-index
unstacked = stacked.unstack() # Row multi-index -> columns

Part 5 - Window Functions

Rolling Windows

# Time series data
ts = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'value': np.random.randn(30).cumsum() + 100
})
ts = ts.set_index('date')

# Rolling mean (7-day moving average)
ts['ma_7'] = ts['value'].rolling(window=7).mean()

# Rolling with minimum periods (handles start of series)
ts['ma_7_min3'] = ts['value'].rolling(window=7, min_periods=3).mean()

# Rolling standard deviation (volatility)
ts['std_7'] = ts['value'].rolling(window=7).std()

# Expanding window (cumulative)
ts['cumulative_mean'] = ts['value'].expanding().mean()
ts['cumulative_max'] = ts['value'].expanding().max()

# Shift (lag features - critical for ML)
ts['lag_1'] = ts['value'].shift(1)    # Yesterday's value
ts['lag_7'] = ts['value'].shift(7)    # Last week's value
ts['diff_1'] = ts['value'].diff(1)    # Day-over-day change
ts['pct_change'] = ts['value'].pct_change(1)  # Percentage change

# Rolling per group
multi_ts = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10, freq='D').tolist() * 3,
    'city': ['NYC']*10 + ['SF']*10 + ['LA']*10,
    'sales': np.random.randint(50, 200, 30)
})
multi_ts['rolling_avg'] = (
    multi_ts.groupby('city')['sales']
    .transform(lambda x: x.rolling(3, min_periods=1).mean())
)

Rank Functions

df = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 2],
    'order_date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-02-01',
                                   '2024-01-05', '2024-01-20', '2024-02-10']),
    'amount': [100, 200, 150, 300, 50, 250]
})

# Rank within group
df['order_rank'] = df.groupby('user_id')['order_date'].rank(method='dense')

# Get first and last per group
df['is_first_order'] = df.groupby('user_id')['order_date'].rank() == 1
df['is_last_order'] = df.groupby('user_id')['order_date'].rank(ascending=False) == 1

# Cumulative count per group
df['order_number'] = df.groupby('user_id').cumcount() + 1

Part 6 - Handling Missing Data

Detection and Strategies

df = pd.DataFrame({
    'age': [25, None, 35, 28, None],
    'income': [50000, 75000, None, 60000, 80000],
    'city': ['NYC', 'SF', None, 'NYC', 'LA'],
    'score': [0.8, 0.6, 0.9, None, 0.7]
})

# Detection
df.isnull().sum()                  # Count per column
df.isnull().sum() / len(df)        # Percentage per column
df.isnull().any(axis=1).sum()      # Rows with any missing

# Strategy 1: Drop rows/columns
df.dropna()                        # Drop rows with any NaN
df.dropna(subset=['age'])          # Drop only if 'age' is NaN
df.dropna(thresh=3)                # Keep rows with at least 3 non-null values

# Strategy 2: Fill with constant
df['city'].fillna('Unknown')

# Strategy 3: Fill with statistics
df['age'].fillna(df['age'].median())
df['income'].fillna(df['income'].mean())

# Strategy 4: Group-aware fill (better for ML)
df['income'] = df.groupby('city')['income'].transform(
    lambda x: x.fillna(x.median())
)

# Strategy 5: Forward/backward fill (for time series)
df['score'].fillna(method='ffill')   # Forward fill
df['score'].fillna(method='bfill')   # Backward fill

# Strategy 6: Interpolation (for time series)
df['score'].interpolate(method='linear')

# Create missing indicator features (useful for ML)
df['age_missing'] = df['age'].isnull().astype(int)

Missing Data Decision Framework

Missing Data Strategy

Common Trap

Never impute missing values using statistics computed on the entire dataset including the test set. This is data leakage. In interview settings, compute fill values on the training set only, then apply them to both train and test. Say this explicitly - interviewers listen for it. "I would compute the median on the training fold only and use that value to fill both train and test."

Part 7 - Performance Optimization

Common Antipatterns and Fixes

# ANTIPATTERN 1: iterrows() - O(n) with high constant factor
# Slow: ~1 sec for 100K rows
for idx, row in df.iterrows():
    df.loc[idx, 'new_col'] = row['a'] + row['b']
# Fast: ~1 ms for 100K rows
df['new_col'] = df['a'] + df['b']

# ANTIPATTERN 2: apply() with a simple function
# Slow
df['upper_city'] = df['city'].apply(lambda x: x.upper() if pd.notna(x) else x)
# Fast
df['upper_city'] = df['city'].str.upper()

# ANTIPATTERN 3: Growing a DataFrame row by row
# Slow: O(n^2) due to repeated copies
result = pd.DataFrame()
for i in range(10000):
    result = pd.concat([result, pd.DataFrame([{'a': i}])])
# Fast: build list first, then create DataFrame
rows = [{'a': i} for i in range(10000)]
result = pd.DataFrame(rows)

# ANTIPATTERN 4: Not using categoricals for low-cardinality strings
# Before: each string is a full Python object
df['city'].dtype  # object - each value is a separate string
# After: integers with a lookup table
df['city'] = df['city'].astype('category')
# Memory: often 10-50x reduction

Performance Comparison Table

Operation	Slow Way	Fast Way	Speedup
Element-wise math	`iterrows()` + arithmetic	Vectorized: `df['a'] + df['b']`	100-1000x
Conditional column	`apply(lambda)`	`np.where(cond, val1, val2)`	10-100x
String operations	`apply(str.method)`	`df['col'].str.method()`	5-50x
Group statistics	Manual loop over groups	`groupby().transform()`	10-100x
Build DataFrame	Append row by row	Build list, then DataFrame	100x
Low-cardinality strings	object dtype	category dtype	10-50x memory
Multiple aggregations	Multiple groupby calls	Single `agg({'col': [fns]})`	2-5x

Part 8 - Feature Engineering for ML

Time-Based Features

df = pd.DataFrame({
    'user_id': [1]*5 + [2]*5,
    'event_time': pd.to_datetime([
        '2024-01-01 08:00', '2024-01-01 09:30', '2024-01-02 14:00',
        '2024-01-05 16:00', '2024-01-10 10:00',
        '2024-01-02 07:00', '2024-01-03 11:00', '2024-01-03 15:00',
        '2024-01-07 09:00', '2024-01-15 12:00'
    ]),
    'event_type': ['view', 'click', 'purchase', 'view', 'view',
                   'click', 'view', 'purchase', 'view', 'click']
})

# Date/time component features
df['hour'] = df['event_time'].dt.hour
df['day_of_week'] = df['event_time'].dt.dayofweek   # 0=Mon, 6=Sun
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['event_time'].dt.month

# Time since previous event (per user)
df = df.sort_values(['user_id', 'event_time'])
df['time_since_prev'] = (
    df.groupby('user_id')['event_time']
    .diff()
    .dt.total_seconds() / 3600  # Convert to hours
)

# Days since first event (user tenure)
df['days_since_first'] = (
    df['event_time'] -
    df.groupby('user_id')['event_time'].transform('min')
).dt.days

Aggregation-Based Features

# User-level aggregation features for a ML model
user_features = df.groupby('user_id').agg(
    total_events=('event_type', 'count'),
    unique_event_types=('event_type', 'nunique'),
    first_event=('event_time', 'min'),
    last_event=('event_time', 'max'),
    avg_time_between_events=('time_since_prev', 'mean'),
).reset_index()

# Count per event type (one-hot-like aggregation)
event_counts = (
    df.groupby(['user_id', 'event_type'])
    .size()
    .unstack(fill_value=0)
    .add_prefix('count_')
    .reset_index()
)
user_features = pd.merge(user_features, event_counts, on='user_id')

# Recency features
reference_date = df['event_time'].max()
user_features['days_since_last_event'] = (
    (reference_date - user_features['last_event']).dt.days
)

# Activity span
user_features['active_days'] = (
    (user_features['last_event'] - user_features['first_event']).dt.days
)

Target Encoding (with Regularization)

def target_encode(train_df, test_df, col, target, alpha=10):
    """
    Target encoding with smoothing to prevent overfitting.

    Args:
        train_df: training DataFrame
        test_df: test DataFrame
        col: column to encode
        target: target column name
        alpha: smoothing parameter (higher = more regularization)
    Returns:
        train_encoded, test_encoded: Series with encoded values
    """
    global_mean = train_df[target].mean()

    # Group statistics on training data only
    group_stats = train_df.groupby(col)[target].agg(['mean', 'count'])

    # Smoothed encoding: weighted average of group mean and global mean
    # More samples -> trust group mean more; few samples -> fall back to global mean
    smoothed = (
        (group_stats['count'] * group_stats['mean'] + alpha * global_mean) /
        (group_stats['count'] + alpha)
    )

    # Map to both train and test
    train_encoded = train_df[col].map(smoothed).fillna(global_mean)
    test_encoded = test_df[col].map(smoothed).fillna(global_mean)

    return train_encoded, test_encoded

Practice Problems

Problem 1: Cohort Retention Analysis

Given a DataFrame of user events with columns user_id, event_date, compute the monthly retention rate for each signup cohort.

Definition: A user is "retained" in month N if they have at least one event in the Nth month after their first event.

Hint 1 - Direction

First, determine each user's cohort (month of first event). Then, for each event, compute the months elapsed since the user's first event. Finally, compute the retention rate for each cohort-month pair.

Hint 2 - Key Operations

Use groupby().transform('min') to get each user's first event date. Use dt.to_period('M') for monthly grouping. Pivot to get a cohort x month retention matrix.

Hint 3 - Full Solution

def cohort_retention(events_df):
    """
    Compute monthly cohort retention rates.

    Args:
        events_df: DataFrame with 'user_id' and 'event_date' columns
    Returns:
        DataFrame with cohort months as rows, months-since-signup as columns,
        values are retention rates (0 to 1)
    """
    df = events_df.copy()

    # Step 1: Determine each user's cohort (month of first event)
    df['first_event'] = df.groupby('user_id')['event_date'].transform('min')
    df['cohort'] = df['first_event'].dt.to_period('M')

    # Step 2: Compute months elapsed since first event
    df['event_month'] = df['event_date'].dt.to_period('M')
    df['months_since_signup'] = (
        (df['event_month'] - df['cohort']).apply(lambda x: x.n)
    )

    # Step 3: Count unique users per cohort-month pair
    cohort_data = (
        df.groupby(['cohort', 'months_since_signup'])['user_id']
        .nunique()
        .reset_index()
        .rename(columns={'user_id': 'active_users'})
    )

    # Step 4: Get cohort sizes (month 0 count)
    cohort_sizes = (
        cohort_data[cohort_data['months_since_signup'] == 0]
        [['cohort', 'active_users']]
        .rename(columns={'active_users': 'cohort_size'})
    )

    # Step 5: Compute retention rate
    cohort_data = pd.merge(cohort_data, cohort_sizes, on='cohort')
    cohort_data['retention_rate'] = (
        cohort_data['active_users'] / cohort_data['cohort_size']
    )

    # Step 6: Pivot to matrix form
    retention_matrix = cohort_data.pivot_table(
        values='retention_rate',
        index='cohort',
        columns='months_since_signup',
        fill_value=0
    )

    return retention_matrix

Scoring:

Strong Hire: Clean pipeline with no loops. Handles edge cases (users with only one event). Explains that retention is always 100% at month 0.
Lean Hire: Correct logic but uses apply/iterrows in places where vectorized operations would work.
No Hire: Cannot construct the cohort assignment without loops.

Problem 2: Feature Engineering Pipeline

Given a transactions DataFrame, create a feature matrix for a churn prediction model.

Columns: user_id, transaction_date, amount, category, is_refund

Create at least 10 meaningful features per user.

Hint 1 - Direction

Think about recency, frequency, monetary value (RFM), behavioral patterns (category diversity, refund rate), and temporal trends (recent vs historical behavior).

Hint 2 - Feature Categories

Categories: (1) basic aggregations (count, sum, mean), (2) recency (days since last), (3) frequency patterns (events per week), (4) behavioral (category entropy, refund rate), (5) trends (recent-30d vs all-time).

Hint 3 - Full Solution

def build_churn_features(transactions, reference_date=None):
    """
    Build user-level features for churn prediction.

    Args:
        transactions: DataFrame with user_id, transaction_date, amount,
                      category, is_refund
        reference_date: date for computing recency (default: max date)
    Returns:
        DataFrame with one row per user and 15+ features
    """
    df = transactions.copy()
    df['transaction_date'] = pd.to_datetime(df['transaction_date'])

    if reference_date is None:
        reference_date = df['transaction_date'].max()

    # ---- Basic aggregation features ----
    features = df.groupby('user_id').agg(
        total_transactions=('transaction_date', 'count'),
        total_spend=('amount', 'sum'),
        avg_transaction=('amount', 'mean'),
        max_transaction=('amount', 'max'),
        std_transaction=('amount', 'std'),
    ).reset_index()

    # ---- Recency features ----
    recency = df.groupby('user_id')['transaction_date'].agg(
        first_transaction='min',
        last_transaction='max'
    ).reset_index()
    recency['days_since_last'] = (
        (reference_date - recency['last_transaction']).dt.days
    )
    recency['tenure_days'] = (
        (recency['last_transaction'] - recency['first_transaction']).dt.days
    )
    features = features.merge(
        recency[['user_id', 'days_since_last', 'tenure_days']], on='user_id'
    )

    # ---- Frequency features ----
    features['transactions_per_day'] = (
        features['total_transactions'] /
        features['tenure_days'].clip(lower=1)
    )

    # ---- Behavioral features ----
    # Category diversity
    cat_diversity = df.groupby('user_id')['category'].nunique().reset_index()
    cat_diversity.columns = ['user_id', 'unique_categories']
    features = features.merge(cat_diversity, on='user_id')

    # Refund rate
    refund_stats = df.groupby('user_id')['is_refund'].mean().reset_index()
    refund_stats.columns = ['user_id', 'refund_rate']
    features = features.merge(refund_stats, on='user_id')

    # ---- Trend features (recent vs all-time) ----
    cutoff_30d = reference_date - pd.Timedelta(days=30)
    recent = df[df['transaction_date'] >= cutoff_30d]
    recent_stats = recent.groupby('user_id').agg(
        recent_30d_count=('transaction_date', 'count'),
        recent_30d_spend=('amount', 'sum')
    ).reset_index()
    features = features.merge(recent_stats, on='user_id', how='left')
    features['recent_30d_count'] = features['recent_30d_count'].fillna(0)
    features['recent_30d_spend'] = features['recent_30d_spend'].fillna(0)

    # Ratio of recent to total activity
    features['recent_activity_ratio'] = (
        features['recent_30d_count'] /
        features['total_transactions'].clip(lower=1)
    )

    # ---- Fill remaining NaNs ----
    features['std_transaction'] = features['std_transaction'].fillna(0)

    return features

Scoring:

Strong Hire: 10+ features covering RFM, behavioral, and trend categories. No data leakage (uses reference_date). Clean merge-based pipeline.
Lean Hire: 5-8 reasonable features but misses trend features or has potential leakage.
No Hire: Fewer than 5 features or uses loops to compute them.

Problem 3: Funnel Analysis

Given a DataFrame of user events with user_id, event_name, timestamp, compute conversion rates through a defined funnel: page_view -> add_to_cart -> checkout -> purchase.

Hint 1 - Direction

For each user, check if they completed each stage (in order). A user who added to cart but never viewed a page should not count - events must happen in the correct sequence.

Hint 2 - Approach

Pivot the data to get the first timestamp of each event type per user. Then check if each stage's timestamp is after the previous stage's timestamp.

Hint 3 - Full Solution

def funnel_analysis(events_df, funnel_stages):
    """
    Compute funnel conversion rates.

    Args:
        events_df: DataFrame with user_id, event_name, timestamp
        funnel_stages: list of event names in order, e.g.
                       ['page_view', 'add_to_cart', 'checkout', 'purchase']
    Returns:
        DataFrame with stage, users_at_stage, conversion_rate, drop_off_rate
    """
    df = events_df.copy()

    # Step 1: Get first occurrence of each event type per user
    first_events = (
        df[df['event_name'].isin(funnel_stages)]
        .sort_values('timestamp')
        .groupby(['user_id', 'event_name'])['timestamp']
        .first()
        .unstack(fill_value=pd.NaT)
    )

    # Step 2: Ensure events happen in order
    # A user completes stage N if:
    # (a) they performed stage N, AND
    # (b) they completed stage N-1, AND
    # (c) stage N happened after stage N-1
    completed = pd.DataFrame(index=first_events.index)
    completed[funnel_stages[0]] = first_events[funnel_stages[0]].notna()

    for i in range(1, len(funnel_stages)):
        prev_stage = funnel_stages[i - 1]
        curr_stage = funnel_stages[i]

        has_event = first_events[curr_stage].notna()
        completed_prev = completed[prev_stage]

        if curr_stage in first_events.columns:
            in_order = (
                first_events[curr_stage] >= first_events[prev_stage]
            ).fillna(False)
        else:
            in_order = False

        completed[curr_stage] = has_event & completed_prev & in_order

    # Step 3: Compute funnel metrics
    total_users = len(first_events)
    results = []
    for stage in funnel_stages:
        users_at_stage = completed[stage].sum()
        conversion_rate = users_at_stage / total_users
        results.append({
            'stage': stage,
            'users_at_stage': int(users_at_stage),
            'conversion_from_top': f"{conversion_rate:.1%}"
        })

    results_df = pd.DataFrame(results)

    # Add stage-to-stage conversion
    results_df['stage_conversion'] = (
        results_df['users_at_stage'] /
        results_df['users_at_stage'].shift(1)
    ).fillna(1.0).apply(lambda x: f"{x:.1%}")

    return results_df

Scoring:

Strong Hire: Handles event ordering correctly (events must happen in sequence). Computes both overall and stage-to-stage conversion. No loops over users.
Lean Hire: Correct logic but does not enforce event ordering.
No Hire: Uses loops over users to check funnel completion.

Problem 4: Time Series Anomaly Detection

Given daily metrics per city, flag days where the metric deviates more than 2 standard deviations from the 30-day rolling mean for that city.

Hint 1 - Direction

Group by city, compute rolling mean and std, then flag where the absolute deviation exceeds 2 * rolling_std.

Hint 2 - Edge cases

Handle the start of each city's time series where the rolling window is not full. Use min_periods parameter.

Hint 3 - Full Solution

def detect_anomalies(df, metric_col, group_col='city',
                     window=30, n_std=2, min_periods=7):
    """
    Detect anomalies using rolling z-score per group.

    Args:
        df: DataFrame with date index, group column, and metric column
        metric_col: name of the metric column
        group_col: column to group by
        window: rolling window size
        n_std: number of standard deviations for threshold
        min_periods: minimum observations for rolling stats
    Returns:
        DataFrame with added columns: rolling_mean, rolling_std,
        z_score, is_anomaly
    """
    result = df.copy().sort_values([group_col, 'date'])

    # Compute rolling statistics per group
    grouped = result.groupby(group_col)[metric_col]

    result['rolling_mean'] = grouped.transform(
        lambda x: x.rolling(window, min_periods=min_periods).mean()
    )
    result['rolling_std'] = grouped.transform(
        lambda x: x.rolling(window, min_periods=min_periods).std()
    )

    # Compute z-score
    result['z_score'] = (
        (result[metric_col] - result['rolling_mean']) /
        result['rolling_std'].clip(lower=1e-8)  # Avoid division by zero
    )

    # Flag anomalies
    result['is_anomaly'] = result['z_score'].abs() > n_std

    # Classify direction
    result['anomaly_type'] = np.where(
        ~result['is_anomaly'], 'normal',
        np.where(result['z_score'] > 0, 'spike', 'drop')
    )

    return result

Scoring:

Strong Hire: Handles per-group rolling, min_periods, division by zero, and anomaly direction. Discusses limitations (non-stationarity, seasonality).
Lean Hire: Correct basic approach but misses edge cases.
No Hire: Computes rolling stats without grouping or uses global statistics.

Problem 5: Efficient Large-Scale Merge

You have a 10M-row events table and a 1M-row users table. The merge is slow. Optimize it.

Hint 1 - Direction

Check dtypes (string keys are slow), consider sorting before merge, and reduce columns before merging.

Hint 2 - Techniques

(1) Convert string IDs to categoricals or integers. (2) Select only needed columns before merge. (3) Use merge instead of join when possible. (4) Consider chunked processing.

Hint 3 - Full Solution

def optimized_merge(events_df, users_df, key='user_id', needed_user_cols=None):
    """
    Optimized large-scale merge.

    Args:
        events_df: large DataFrame (10M+ rows)
        users_df: lookup DataFrame (1M+ rows)
        key: merge key
        needed_user_cols: list of columns needed from users_df
    Returns:
        Merged DataFrame
    """
    # Optimization 1: Select only needed columns
    if needed_user_cols is not None:
        users_subset = users_df[[key] + needed_user_cols].copy()
    else:
        users_subset = users_df.copy()

    # Optimization 2: Ensure key dtype is efficient
    # Convert string IDs to categorical
    if events_df[key].dtype == 'object':
        # Create shared categories
        all_ids = pd.CategoricalDtype(
            categories=pd.unique(
                pd.concat([events_df[key], users_subset[key]])
            )
        )
        events_df[key] = events_df[key].astype(all_ids)
        users_subset[key] = users_subset[key].astype(all_ids)

    # Optimization 3: Drop duplicates in lookup table
    users_subset = users_subset.drop_duplicates(subset=[key])

    # Optimization 4: Use merge with validate
    result = pd.merge(
        events_df,
        users_subset,
        on=key,
        how='left',
        validate='many_to_one'  # Catch unexpected duplicates
    )

    return result

# Additional tip: if memory is a concern, process in chunks
def chunked_merge(events_df, users_df, key, chunk_size=1_000_000):
    """Process merge in chunks to limit peak memory."""
    chunks = []
    for start in range(0, len(events_df), chunk_size):
        chunk = events_df.iloc[start:start + chunk_size]
        merged_chunk = pd.merge(chunk, users_df, on=key, how='left')
        chunks.append(merged_chunk)
    return pd.concat(chunks, ignore_index=True)

Scoring:

Strong Hire: Addresses dtype optimization, column selection, deduplication, and memory concerns. Mentions alternatives (Polars, DuckDB) for truly large datasets.
Lean Hire: Identifies one or two optimizations but not the full picture.
No Hire: Cannot identify why the merge is slow or only suggests "get more RAM."

Interview Cheat Sheet

Concept	Key Pattern	One-Liner	Red Flag
Selection	`df.loc[rows, cols]` for labels, `df.iloc` for positions	loc = labels, iloc = integers	Confusing loc and iloc
Boolean filter	`df[(cond1) & (cond2)]`	Use & not `and`, parentheses required	Using Python `and` with Series
GroupBy agg	`df.groupby(col).agg(name=(col, func))`	One row per group	Using loops over unique values
Transform	`groupby().transform()`	Same shape as input, broadcasts group stat	Not knowing transform exists
Merge	`pd.merge(left, right, on=key, how='left')`	SQL JOIN equivalent	Not specifying how (default inner)
Pivot	`df.pivot_table(values, index, columns, aggfunc)`	Long -> wide with aggregation	Confusing pivot and pivot_table
Rolling	`df.rolling(window).mean()`	Moving average with min_periods	Forgetting min_periods
Missing data	Fill train stats only, add indicator	Never leak test info	Computing fill values on full dataset
Performance	Vectorize, categoricals, select cols first	No iterrows, no row-by-row append	Building DataFrames row by row
Feature eng	RFM + behavioral + trend features	Group-aggregate-merge pipeline	Computing features with for-loops

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read this entire page and run all code examples
Write a groupby-agg pipeline from memory
Explain the difference between agg, transform, and apply
Complete the self-assessment

Day 3 - First Recall

Implement cohort retention analysis without looking at the solution
Write 5 different merge examples (inner, left, outer, multi-key, with validation)
Create 10 features from a transactions DataFrame from memory

Day 7 - Connections

Solve the funnel analysis problem under timed conditions (15 min)
Implement the anomaly detection problem without hints
Explain 5 Pandas performance antipatterns and their fixes

Day 14 - Application

Given a new dataset, build a complete feature engineering pipeline in 20 minutes
Optimize a slow Pandas pipeline (identify bottlenecks, fix them)
Explain the difference between Pandas and SQL for data manipulation

Day 21 - Mock Interview

Have someone give you a novel dataset and ask you to compute metrics
Solve all 5 practice problems in sequence under timed conditions (60 min)
Explain your feature engineering approach as if presenting to an interviewer

Key Takeaways

GroupBy is the backbone of Pandas interviews. Know agg() for summarization, transform() for broadcasting group statistics back to every row, and apply() for everything else. If you reach for a for-loop over groups, you have already lost.
Merges are where data quality issues hide. Always specify how=, check the output shape, and use validate= to catch unexpected duplicates. In interviews, explicitly state your join type and why.
Feature engineering is what separates DS/MLE candidates. Anyone can compute a mean. The strong candidate computes recency, frequency, monetary value, behavioral diversity, temporal trends, and missing indicators - all without loops, all from training data only.
Performance matters at scale. Knowing that iterrows() is 1000x slower than vectorized operations, that categoricals save 10x memory, and that you should select columns before merging are signals that you can handle production data.

Next Steps

Continue to SQL for ML Roles to master the query language that every data role requires - with ML-specific patterns for feature extraction, A/B test analysis, and metric computation.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - DataFrame Fundamentals​

Creation and Inspection​

Selection and Filtering​

Part 2 - GroupBy: The Most Important Operation​

Basic GroupBy​

Transform vs Agg​

Part 3 - Merge and Join​

Merge Types​

Merge Comparison Table​

Handling Merge Pitfalls​

Part 4 - Pivot Tables and Reshaping​

Part 5 - Window Functions​

Rolling Windows​

Rank Functions​

Part 6 - Handling Missing Data​

Detection and Strategies​

Missing Data Decision Framework​

Part 7 - Performance Optimization​

Common Antipatterns and Fixes​

Performance Comparison Table​

Part 8 - Feature Engineering for ML​

Time-Based Features​

Aggregation-Based Features​

Target Encoding (with Regularization)​

Practice Problems​

Problem 1: Cohort Retention Analysis​

Problem 2: Feature Engineering Pipeline​

Problem 3: Funnel Analysis​

Problem 4: Time Series Anomaly Detection​

Problem 5: Efficient Large-Scale Merge​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - DataFrame Fundamentals

Creation and Inspection

Selection and Filtering

Part 2 - GroupBy: The Most Important Operation

Basic GroupBy

Transform vs Agg

Part 3 - Merge and Join

Merge Types

Merge Comparison Table

Handling Merge Pitfalls

Part 4 - Pivot Tables and Reshaping

Part 5 - Window Functions

Rolling Windows

Rank Functions

Part 6 - Handling Missing Data

Detection and Strategies

Missing Data Decision Framework

Part 7 - Performance Optimization

Common Antipatterns and Fixes

Performance Comparison Table

Part 8 - Feature Engineering for ML

Time-Based Features

Aggregation-Based Features

Target Encoding (with Regularization)

Practice Problems

Problem 1: Cohort Retention Analysis

Problem 2: Feature Engineering Pipeline

Problem 3: Funnel Analysis

Problem 4: Time Series Anomaly Detection

Problem 5: Efficient Large-Scale Merge

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways

Next Steps