Skip to main content

Pandas for Interviews - Data Wrangling Mastery for ML Roles

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Scientist, MLE, AI Eng, Analytics Engineer

The Real Interview Moment

You are forty-five minutes into an Airbnb Data Science on-site. The interviewer opens a shared notebook with a DataFrame of 2 million booking records and says: "Compute the 7-day rolling average of bookings per city, then flag any city-day pair where the rolling average dropped more than 30% compared to the same day last year. You have 20 minutes."

You stare at the DataFrame. You know you need groupby, a rolling window, a date offset for year-over-year comparison, and a boolean mask for the flag. But the clock is ticking. Do you start with a for-loop iterating over cities? Do you merge the DataFrame with itself on a shifted date? Or do you reach for groupby().rolling() with a DateOffset, chain it with a transform, and solve it in five lines?

This is what Pandas interviews test: not whether you can Google the API, but whether you can compose operations fluently under pressure. Candidates who build DataFrames row by row or write nested for-loops over groups get a "no hire." Candidates who chain groupby, merge, transform, and window functions into clean pipelines get a "strong hire." This page builds that fluency.

What You Will Master

  • Create and inspect DataFrames with dtypes, shape, describe, info, and value_counts
  • Select and filter with loc, iloc, boolean indexing, and query
  • Group and aggregate with groupby, agg, transform, and apply
  • Merge and join DataFrames with merge, join, concat - and know when each is appropriate
  • Reshape with pivot_table, melt, stack, and unstack
  • Window functions with rolling, expanding, shift, and rank
  • Handle missing data with fillna, dropna, interpolate, and imputation strategies
  • Optimize performance with vectorized operations, categoricals, and avoiding common antipatterns
  • Engineer features for ML models using Pandas operations

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Do4 - Can Optimize5 - Can TeachYour Score
Select/filter with loc, iloc, boolean___
GroupBy with agg and multiple functions___
Merge/join (inner, left, outer)___
Pivot tables and reshaping___
Rolling/expanding window functions___
Handle missing data strategically___
Avoid loops in favor of vectorized ops___
Feature engineering for ML___

Target: All 4s and 5s before your interview.

Part 1 - DataFrame Fundamentals

Creation and Inspection

import pandas as pd
import numpy as np

# From dictionary
df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'age': [25, 30, 35, 28, 42],
'city': ['NYC', 'SF', 'NYC', 'LA', 'SF'],
'signup_date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-01-10',
'2024-03-05', '2024-02-28']),
'revenue': [100.0, 250.0, None, 175.0, 300.0]
})

# Essential inspection
df.shape # (5, 5) - rows, columns
df.dtypes # dtype per column
df.info() # Non-null counts, memory usage
df.describe() # Summary statistics for numeric columns
df['city'].value_counts() # Frequency counts
df.isnull().sum() # Missing values per column
60-Second Answer

"Pandas DataFrames are column-oriented data structures built on NumPy arrays. Each column is a contiguous array with a single dtype, which is why column-wise operations are fast and row-wise iteration is slow. I always start with df.info() and df.describe() to understand dtypes, missing values, and distributions. For ML feature engineering, I use vectorized operations, groupby transforms, and merge-based feature joins - never iterrows()."

Selection and Filtering

# Column selection
df['age'] # Single column (Series)
df[['age', 'city']] # Multiple columns (DataFrame)

# Row selection by label
df.loc[0] # Row with index label 0
df.loc[0:2, 'age':'city'] # Slice by labels (inclusive)

# Row selection by position
df.iloc[0] # First row
df.iloc[0:2, 1:3] # First 2 rows, columns 1-2

# Boolean filtering
df[df['age'] > 30] # Filter rows
df[(df['age'] > 25) & (df['city'] == 'NYC')] # Multiple conditions
df.query('age > 25 and city == "NYC"') # Query syntax (cleaner)

# isin for multiple values
df[df['city'].isin(['NYC', 'SF'])]

# String operations
df[df['city'].str.startswith('N')]
Common Trap

Do NOT use df[df['age'] > 30 and df['city'] == 'NYC'] - Python's and operator does not work element-wise on Series. You must use & (bitwise AND) and wrap each condition in parentheses. This is one of the most common Pandas bugs and interviewers will test it. Similarly, use | for OR and ~ for NOT.

Part 2 - GroupBy: The Most Important Operation

Basic GroupBy

# Sample data: e-commerce orders
orders = pd.DataFrame({
'order_id': range(1, 11),
'user_id': [1, 2, 1, 3, 2, 1, 3, 2, 1, 3],
'product_category': ['electronics', 'books', 'clothing', 'electronics',
'electronics', 'books', 'clothing', 'books',
'electronics', 'books'],
'amount': [120, 15, 45, 200, 80, 25, 60, 18, 150, 30],
'date': pd.to_datetime(['2024-01-05', '2024-01-06', '2024-01-07',
'2024-01-08', '2024-01-10', '2024-01-12',
'2024-01-15', '2024-01-18', '2024-01-20',
'2024-01-22'])
})

# Single aggregation
orders.groupby('user_id')['amount'].sum()

# Multiple aggregations
orders.groupby('user_id')['amount'].agg(['sum', 'mean', 'count', 'max'])

# Named aggregations (clean output)
orders.groupby('user_id').agg(
total_spend=('amount', 'sum'),
avg_order=('amount', 'mean'),
order_count=('amount', 'count'),
first_order=('date', 'min'),
last_order=('date', 'max')
).reset_index()

# Multiple group columns
orders.groupby(['user_id', 'product_category'])['amount'].sum().reset_index()

Transform vs Agg

Pandas GroupBy - Three Methods

# transform: add a column with the group mean (for normalization)
orders['user_avg'] = orders.groupby('user_id')['amount'].transform('mean')

# Use transform for group-level features without merge
orders['pct_of_user_total'] = (
orders['amount'] / orders.groupby('user_id')['amount'].transform('sum')
)

# Use transform for group-level standardization
orders['z_score_within_user'] = orders.groupby('user_id')['amount'].transform(
lambda x: (x - x.mean()) / x.std()
)
Instant Rejection

Never write for user_id in df['user_id'].unique(): subset = df[df['user_id'] == user_id]; ... - this is an O(n * k) anti-pattern where k is the number of groups. Use groupby, which is O(n). Interviewers interpret manual iteration over groups as a fundamental misunderstanding of Pandas. The only exception is when your per-group operation genuinely cannot be expressed as agg, transform, or apply.

Part 3 - Merge and Join

Merge Types

# Sample data
users = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'plan': ['free', 'paid', 'paid', 'free']
})

purchases = pd.DataFrame({
'purchase_id': [101, 102, 103, 104, 105],
'user_id': [1, 2, 2, 5, 1],
'amount': [50, 100, 75, 200, 30]
})

# Inner join: only matching rows from both
inner = pd.merge(users, purchases, on='user_id', how='inner')
# Users 1, 2 only (3, 4 have no purchases; 5 is not in users)

# Left join: all rows from left, matching from right
left = pd.merge(users, purchases, on='user_id', how='left')
# All 4 users; Charlie and Diana have NaN for purchase columns

# Right join: all rows from right, matching from left
right = pd.merge(users, purchases, on='user_id', how='right')
# All 5 purchases; user_id 5 has NaN for user columns

# Outer join: all rows from both
outer = pd.merge(users, purchases, on='user_id', how='outer')

# Merge on different column names
pd.merge(users, purchases, left_on='user_id', right_on='user_id')

# Merge on multiple keys
pd.merge(df1, df2, on=['user_id', 'date'])

Merge Comparison Table

Join TypeKeeps From LeftKeeps From RightUse Case
innerMatching onlyMatching onlyStandard feature join when all keys exist
leftAllMatching onlyEnrich base table without losing rows
rightMatching onlyAllEnrich lookup table
outerAllAllUnion - find unmatched on both sides
crossAll (repeated)All (repeated)Cartesian product for pairwise features
Company Variation

Meta and Airbnb heavily test merge operations because their data is spread across many tables (users, events, sessions, bookings). They expect you to know when to use left vs inner join and how to handle duplicates after merging. Google tends to test merge concepts in SQL rather than Pandas. Startups and ML-focused roles test merge in the context of feature engineering: "Join user features with transaction features for your model."

Handling Merge Pitfalls

# Pitfall 1: Unintended row multiplication
# If keys are not unique on both sides, merge creates a cross product
A = pd.DataFrame({'key': [1, 1], 'val_a': ['a1', 'a2']})
B = pd.DataFrame({'key': [1, 1], 'val_b': ['b1', 'b2']})
result = pd.merge(A, B, on='key') # 4 rows (2 x 2)!

# Prevention: validate the merge
pd.merge(A, B, on='key', validate='one_to_one') # Raises error

# Pitfall 2: Overlapping column names
pd.merge(df1, df2, on='user_id', suffixes=('_left', '_right'))

# Pitfall 3: Missing key values after merge
# Always check shape after merge
print(f"Before: {len(users)}, After: {len(left)}")

Part 4 - Pivot Tables and Reshaping

# Sample data
sales = pd.DataFrame({
'date': pd.to_datetime(['2024-01-01']*3 + ['2024-01-02']*3),
'city': ['NYC', 'SF', 'LA'] * 2,
'product': ['A', 'B', 'A', 'B', 'A', 'B'],
'revenue': [100, 200, 150, 300, 120, 250]
})

# Pivot table: rows = date, columns = city, values = revenue
pivot = sales.pivot_table(
values='revenue',
index='date',
columns='city',
aggfunc='sum',
fill_value=0
)

# Multiple aggregations
pivot_multi = sales.pivot_table(
values='revenue',
index='date',
columns='city',
aggfunc=['sum', 'mean', 'count']
)

# Melt: wide -> long (inverse of pivot)
wide_df = pd.DataFrame({
'user_id': [1, 2, 3],
'score_2023': [80, 90, 75],
'score_2024': [85, 88, 82]
})
long_df = wide_df.melt(
id_vars='user_id',
value_vars=['score_2023', 'score_2024'],
var_name='year',
value_name='score'
)

# Stack/Unstack: move between row and column indices
stacked = pivot.stack() # Columns -> row multi-index
unstacked = stacked.unstack() # Row multi-index -> columns

Part 5 - Window Functions

Rolling Windows

# Time series data
ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=30, freq='D'),
'value': np.random.randn(30).cumsum() + 100
})
ts = ts.set_index('date')

# Rolling mean (7-day moving average)
ts['ma_7'] = ts['value'].rolling(window=7).mean()

# Rolling with minimum periods (handles start of series)
ts['ma_7_min3'] = ts['value'].rolling(window=7, min_periods=3).mean()

# Rolling standard deviation (volatility)
ts['std_7'] = ts['value'].rolling(window=7).std()

# Expanding window (cumulative)
ts['cumulative_mean'] = ts['value'].expanding().mean()
ts['cumulative_max'] = ts['value'].expanding().max()

# Shift (lag features - critical for ML)
ts['lag_1'] = ts['value'].shift(1) # Yesterday's value
ts['lag_7'] = ts['value'].shift(7) # Last week's value
ts['diff_1'] = ts['value'].diff(1) # Day-over-day change
ts['pct_change'] = ts['value'].pct_change(1) # Percentage change

# Rolling per group
multi_ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10, freq='D').tolist() * 3,
'city': ['NYC']*10 + ['SF']*10 + ['LA']*10,
'sales': np.random.randint(50, 200, 30)
})
multi_ts['rolling_avg'] = (
multi_ts.groupby('city')['sales']
.transform(lambda x: x.rolling(3, min_periods=1).mean())
)

Rank Functions

df = pd.DataFrame({
'user_id': [1, 1, 1, 2, 2, 2],
'order_date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-02-01',
'2024-01-05', '2024-01-20', '2024-02-10']),
'amount': [100, 200, 150, 300, 50, 250]
})

# Rank within group
df['order_rank'] = df.groupby('user_id')['order_date'].rank(method='dense')

# Get first and last per group
df['is_first_order'] = df.groupby('user_id')['order_date'].rank() == 1
df['is_last_order'] = df.groupby('user_id')['order_date'].rank(ascending=False) == 1

# Cumulative count per group
df['order_number'] = df.groupby('user_id').cumcount() + 1

Part 6 - Handling Missing Data

Detection and Strategies

df = pd.DataFrame({
'age': [25, None, 35, 28, None],
'income': [50000, 75000, None, 60000, 80000],
'city': ['NYC', 'SF', None, 'NYC', 'LA'],
'score': [0.8, 0.6, 0.9, None, 0.7]
})

# Detection
df.isnull().sum() # Count per column
df.isnull().sum() / len(df) # Percentage per column
df.isnull().any(axis=1).sum() # Rows with any missing

# Strategy 1: Drop rows/columns
df.dropna() # Drop rows with any NaN
df.dropna(subset=['age']) # Drop only if 'age' is NaN
df.dropna(thresh=3) # Keep rows with at least 3 non-null values

# Strategy 2: Fill with constant
df['city'].fillna('Unknown')

# Strategy 3: Fill with statistics
df['age'].fillna(df['age'].median())
df['income'].fillna(df['income'].mean())

# Strategy 4: Group-aware fill (better for ML)
df['income'] = df.groupby('city')['income'].transform(
lambda x: x.fillna(x.median())
)

# Strategy 5: Forward/backward fill (for time series)
df['score'].fillna(method='ffill') # Forward fill
df['score'].fillna(method='bfill') # Backward fill

# Strategy 6: Interpolation (for time series)
df['score'].interpolate(method='linear')

# Create missing indicator features (useful for ML)
df['age_missing'] = df['age'].isnull().astype(int)

Missing Data Decision Framework

Missing Data Strategy

Common Trap

Never impute missing values using statistics computed on the entire dataset including the test set. This is data leakage. In interview settings, compute fill values on the training set only, then apply them to both train and test. Say this explicitly - interviewers listen for it. "I would compute the median on the training fold only and use that value to fill both train and test."

Part 7 - Performance Optimization

Common Antipatterns and Fixes

# ANTIPATTERN 1: iterrows() - O(n) with high constant factor
# Slow: ~1 sec for 100K rows
for idx, row in df.iterrows():
df.loc[idx, 'new_col'] = row['a'] + row['b']
# Fast: ~1 ms for 100K rows
df['new_col'] = df['a'] + df['b']

# ANTIPATTERN 2: apply() with a simple function
# Slow
df['upper_city'] = df['city'].apply(lambda x: x.upper() if pd.notna(x) else x)
# Fast
df['upper_city'] = df['city'].str.upper()

# ANTIPATTERN 3: Growing a DataFrame row by row
# Slow: O(n^2) due to repeated copies
result = pd.DataFrame()
for i in range(10000):
result = pd.concat([result, pd.DataFrame([{'a': i}])])
# Fast: build list first, then create DataFrame
rows = [{'a': i} for i in range(10000)]
result = pd.DataFrame(rows)

# ANTIPATTERN 4: Not using categoricals for low-cardinality strings
# Before: each string is a full Python object
df['city'].dtype # object - each value is a separate string
# After: integers with a lookup table
df['city'] = df['city'].astype('category')
# Memory: often 10-50x reduction

Performance Comparison Table

OperationSlow WayFast WaySpeedup
Element-wise mathiterrows() + arithmeticVectorized: df['a'] + df['b']100-1000x
Conditional columnapply(lambda)np.where(cond, val1, val2)10-100x
String operationsapply(str.method)df['col'].str.method()5-50x
Group statisticsManual loop over groupsgroupby().transform()10-100x
Build DataFrameAppend row by rowBuild list, then DataFrame100x
Low-cardinality stringsobject dtypecategory dtype10-50x memory
Multiple aggregationsMultiple groupby callsSingle agg({'col': [fns]})2-5x

Part 8 - Feature Engineering for ML

Time-Based Features

df = pd.DataFrame({
'user_id': [1]*5 + [2]*5,
'event_time': pd.to_datetime([
'2024-01-01 08:00', '2024-01-01 09:30', '2024-01-02 14:00',
'2024-01-05 16:00', '2024-01-10 10:00',
'2024-01-02 07:00', '2024-01-03 11:00', '2024-01-03 15:00',
'2024-01-07 09:00', '2024-01-15 12:00'
]),
'event_type': ['view', 'click', 'purchase', 'view', 'view',
'click', 'view', 'purchase', 'view', 'click']
})

# Date/time component features
df['hour'] = df['event_time'].dt.hour
df['day_of_week'] = df['event_time'].dt.dayofweek # 0=Mon, 6=Sun
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['event_time'].dt.month

# Time since previous event (per user)
df = df.sort_values(['user_id', 'event_time'])
df['time_since_prev'] = (
df.groupby('user_id')['event_time']
.diff()
.dt.total_seconds() / 3600 # Convert to hours
)

# Days since first event (user tenure)
df['days_since_first'] = (
df['event_time'] -
df.groupby('user_id')['event_time'].transform('min')
).dt.days

Aggregation-Based Features

# User-level aggregation features for a ML model
user_features = df.groupby('user_id').agg(
total_events=('event_type', 'count'),
unique_event_types=('event_type', 'nunique'),
first_event=('event_time', 'min'),
last_event=('event_time', 'max'),
avg_time_between_events=('time_since_prev', 'mean'),
).reset_index()

# Count per event type (one-hot-like aggregation)
event_counts = (
df.groupby(['user_id', 'event_type'])
.size()
.unstack(fill_value=0)
.add_prefix('count_')
.reset_index()
)
user_features = pd.merge(user_features, event_counts, on='user_id')

# Recency features
reference_date = df['event_time'].max()
user_features['days_since_last_event'] = (
(reference_date - user_features['last_event']).dt.days
)

# Activity span
user_features['active_days'] = (
(user_features['last_event'] - user_features['first_event']).dt.days
)

Target Encoding (with Regularization)

def target_encode(train_df, test_df, col, target, alpha=10):
"""
Target encoding with smoothing to prevent overfitting.

Args:
train_df: training DataFrame
test_df: test DataFrame
col: column to encode
target: target column name
alpha: smoothing parameter (higher = more regularization)
Returns:
train_encoded, test_encoded: Series with encoded values
"""
global_mean = train_df[target].mean()

# Group statistics on training data only
group_stats = train_df.groupby(col)[target].agg(['mean', 'count'])

# Smoothed encoding: weighted average of group mean and global mean
# More samples -> trust group mean more; few samples -> fall back to global mean
smoothed = (
(group_stats['count'] * group_stats['mean'] + alpha * global_mean) /
(group_stats['count'] + alpha)
)

# Map to both train and test
train_encoded = train_df[col].map(smoothed).fillna(global_mean)
test_encoded = test_df[col].map(smoothed).fillna(global_mean)

return train_encoded, test_encoded

Practice Problems

Problem 1: Cohort Retention Analysis

Given a DataFrame of user events with columns user_id, event_date, compute the monthly retention rate for each signup cohort.

Definition: A user is "retained" in month N if they have at least one event in the Nth month after their first event.

Hint 1 - Direction

First, determine each user's cohort (month of first event). Then, for each event, compute the months elapsed since the user's first event. Finally, compute the retention rate for each cohort-month pair.

Hint 2 - Key Operations

Use groupby().transform('min') to get each user's first event date. Use dt.to_period('M') for monthly grouping. Pivot to get a cohort x month retention matrix.

Hint 3 - Full Solution
def cohort_retention(events_df):
"""
Compute monthly cohort retention rates.

Args:
events_df: DataFrame with 'user_id' and 'event_date' columns
Returns:
DataFrame with cohort months as rows, months-since-signup as columns,
values are retention rates (0 to 1)
"""
df = events_df.copy()

# Step 1: Determine each user's cohort (month of first event)
df['first_event'] = df.groupby('user_id')['event_date'].transform('min')
df['cohort'] = df['first_event'].dt.to_period('M')

# Step 2: Compute months elapsed since first event
df['event_month'] = df['event_date'].dt.to_period('M')
df['months_since_signup'] = (
(df['event_month'] - df['cohort']).apply(lambda x: x.n)
)

# Step 3: Count unique users per cohort-month pair
cohort_data = (
df.groupby(['cohort', 'months_since_signup'])['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id': 'active_users'})
)

# Step 4: Get cohort sizes (month 0 count)
cohort_sizes = (
cohort_data[cohort_data['months_since_signup'] == 0]
[['cohort', 'active_users']]
.rename(columns={'active_users': 'cohort_size'})
)

# Step 5: Compute retention rate
cohort_data = pd.merge(cohort_data, cohort_sizes, on='cohort')
cohort_data['retention_rate'] = (
cohort_data['active_users'] / cohort_data['cohort_size']
)

# Step 6: Pivot to matrix form
retention_matrix = cohort_data.pivot_table(
values='retention_rate',
index='cohort',
columns='months_since_signup',
fill_value=0
)

return retention_matrix

Scoring:

  • Strong Hire: Clean pipeline with no loops. Handles edge cases (users with only one event). Explains that retention is always 100% at month 0.
  • Lean Hire: Correct logic but uses apply/iterrows in places where vectorized operations would work.
  • No Hire: Cannot construct the cohort assignment without loops.

Problem 2: Feature Engineering Pipeline

Given a transactions DataFrame, create a feature matrix for a churn prediction model.

Columns: user_id, transaction_date, amount, category, is_refund

Create at least 10 meaningful features per user.

Hint 1 - Direction

Think about recency, frequency, monetary value (RFM), behavioral patterns (category diversity, refund rate), and temporal trends (recent vs historical behavior).

Hint 2 - Feature Categories

Categories: (1) basic aggregations (count, sum, mean), (2) recency (days since last), (3) frequency patterns (events per week), (4) behavioral (category entropy, refund rate), (5) trends (recent-30d vs all-time).

Hint 3 - Full Solution
def build_churn_features(transactions, reference_date=None):
"""
Build user-level features for churn prediction.

Args:
transactions: DataFrame with user_id, transaction_date, amount,
category, is_refund
reference_date: date for computing recency (default: max date)
Returns:
DataFrame with one row per user and 15+ features
"""
df = transactions.copy()
df['transaction_date'] = pd.to_datetime(df['transaction_date'])

if reference_date is None:
reference_date = df['transaction_date'].max()

# ---- Basic aggregation features ----
features = df.groupby('user_id').agg(
total_transactions=('transaction_date', 'count'),
total_spend=('amount', 'sum'),
avg_transaction=('amount', 'mean'),
max_transaction=('amount', 'max'),
std_transaction=('amount', 'std'),
).reset_index()

# ---- Recency features ----
recency = df.groupby('user_id')['transaction_date'].agg(
first_transaction='min',
last_transaction='max'
).reset_index()
recency['days_since_last'] = (
(reference_date - recency['last_transaction']).dt.days
)
recency['tenure_days'] = (
(recency['last_transaction'] - recency['first_transaction']).dt.days
)
features = features.merge(
recency[['user_id', 'days_since_last', 'tenure_days']], on='user_id'
)

# ---- Frequency features ----
features['transactions_per_day'] = (
features['total_transactions'] /
features['tenure_days'].clip(lower=1)
)

# ---- Behavioral features ----
# Category diversity
cat_diversity = df.groupby('user_id')['category'].nunique().reset_index()
cat_diversity.columns = ['user_id', 'unique_categories']
features = features.merge(cat_diversity, on='user_id')

# Refund rate
refund_stats = df.groupby('user_id')['is_refund'].mean().reset_index()
refund_stats.columns = ['user_id', 'refund_rate']
features = features.merge(refund_stats, on='user_id')

# ---- Trend features (recent vs all-time) ----
cutoff_30d = reference_date - pd.Timedelta(days=30)
recent = df[df['transaction_date'] >= cutoff_30d]
recent_stats = recent.groupby('user_id').agg(
recent_30d_count=('transaction_date', 'count'),
recent_30d_spend=('amount', 'sum')
).reset_index()
features = features.merge(recent_stats, on='user_id', how='left')
features['recent_30d_count'] = features['recent_30d_count'].fillna(0)
features['recent_30d_spend'] = features['recent_30d_spend'].fillna(0)

# Ratio of recent to total activity
features['recent_activity_ratio'] = (
features['recent_30d_count'] /
features['total_transactions'].clip(lower=1)
)

# ---- Fill remaining NaNs ----
features['std_transaction'] = features['std_transaction'].fillna(0)

return features

Scoring:

  • Strong Hire: 10+ features covering RFM, behavioral, and trend categories. No data leakage (uses reference_date). Clean merge-based pipeline.
  • Lean Hire: 5-8 reasonable features but misses trend features or has potential leakage.
  • No Hire: Fewer than 5 features or uses loops to compute them.

Problem 3: Funnel Analysis

Given a DataFrame of user events with user_id, event_name, timestamp, compute conversion rates through a defined funnel: page_view -> add_to_cart -> checkout -> purchase.

Hint 1 - Direction

For each user, check if they completed each stage (in order). A user who added to cart but never viewed a page should not count - events must happen in the correct sequence.

Hint 2 - Approach

Pivot the data to get the first timestamp of each event type per user. Then check if each stage's timestamp is after the previous stage's timestamp.

Hint 3 - Full Solution
def funnel_analysis(events_df, funnel_stages):
"""
Compute funnel conversion rates.

Args:
events_df: DataFrame with user_id, event_name, timestamp
funnel_stages: list of event names in order, e.g.
['page_view', 'add_to_cart', 'checkout', 'purchase']
Returns:
DataFrame with stage, users_at_stage, conversion_rate, drop_off_rate
"""
df = events_df.copy()

# Step 1: Get first occurrence of each event type per user
first_events = (
df[df['event_name'].isin(funnel_stages)]
.sort_values('timestamp')
.groupby(['user_id', 'event_name'])['timestamp']
.first()
.unstack(fill_value=pd.NaT)
)

# Step 2: Ensure events happen in order
# A user completes stage N if:
# (a) they performed stage N, AND
# (b) they completed stage N-1, AND
# (c) stage N happened after stage N-1
completed = pd.DataFrame(index=first_events.index)
completed[funnel_stages[0]] = first_events[funnel_stages[0]].notna()

for i in range(1, len(funnel_stages)):
prev_stage = funnel_stages[i - 1]
curr_stage = funnel_stages[i]

has_event = first_events[curr_stage].notna()
completed_prev = completed[prev_stage]

if curr_stage in first_events.columns:
in_order = (
first_events[curr_stage] >= first_events[prev_stage]
).fillna(False)
else:
in_order = False

completed[curr_stage] = has_event & completed_prev & in_order

# Step 3: Compute funnel metrics
total_users = len(first_events)
results = []
for stage in funnel_stages:
users_at_stage = completed[stage].sum()
conversion_rate = users_at_stage / total_users
results.append({
'stage': stage,
'users_at_stage': int(users_at_stage),
'conversion_from_top': f"{conversion_rate:.1%}"
})

results_df = pd.DataFrame(results)

# Add stage-to-stage conversion
results_df['stage_conversion'] = (
results_df['users_at_stage'] /
results_df['users_at_stage'].shift(1)
).fillna(1.0).apply(lambda x: f"{x:.1%}")

return results_df

Scoring:

  • Strong Hire: Handles event ordering correctly (events must happen in sequence). Computes both overall and stage-to-stage conversion. No loops over users.
  • Lean Hire: Correct logic but does not enforce event ordering.
  • No Hire: Uses loops over users to check funnel completion.

Problem 4: Time Series Anomaly Detection

Given daily metrics per city, flag days where the metric deviates more than 2 standard deviations from the 30-day rolling mean for that city.

Hint 1 - Direction

Group by city, compute rolling mean and std, then flag where the absolute deviation exceeds 2 * rolling_std.

Hint 2 - Edge cases

Handle the start of each city's time series where the rolling window is not full. Use min_periods parameter.

Hint 3 - Full Solution
def detect_anomalies(df, metric_col, group_col='city',
window=30, n_std=2, min_periods=7):
"""
Detect anomalies using rolling z-score per group.

Args:
df: DataFrame with date index, group column, and metric column
metric_col: name of the metric column
group_col: column to group by
window: rolling window size
n_std: number of standard deviations for threshold
min_periods: minimum observations for rolling stats
Returns:
DataFrame with added columns: rolling_mean, rolling_std,
z_score, is_anomaly
"""
result = df.copy().sort_values([group_col, 'date'])

# Compute rolling statistics per group
grouped = result.groupby(group_col)[metric_col]

result['rolling_mean'] = grouped.transform(
lambda x: x.rolling(window, min_periods=min_periods).mean()
)
result['rolling_std'] = grouped.transform(
lambda x: x.rolling(window, min_periods=min_periods).std()
)

# Compute z-score
result['z_score'] = (
(result[metric_col] - result['rolling_mean']) /
result['rolling_std'].clip(lower=1e-8) # Avoid division by zero
)

# Flag anomalies
result['is_anomaly'] = result['z_score'].abs() > n_std

# Classify direction
result['anomaly_type'] = np.where(
~result['is_anomaly'], 'normal',
np.where(result['z_score'] > 0, 'spike', 'drop')
)

return result

Scoring:

  • Strong Hire: Handles per-group rolling, min_periods, division by zero, and anomaly direction. Discusses limitations (non-stationarity, seasonality).
  • Lean Hire: Correct basic approach but misses edge cases.
  • No Hire: Computes rolling stats without grouping or uses global statistics.

Problem 5: Efficient Large-Scale Merge

You have a 10M-row events table and a 1M-row users table. The merge is slow. Optimize it.

Hint 1 - Direction

Check dtypes (string keys are slow), consider sorting before merge, and reduce columns before merging.

Hint 2 - Techniques

(1) Convert string IDs to categoricals or integers. (2) Select only needed columns before merge. (3) Use merge instead of join when possible. (4) Consider chunked processing.

Hint 3 - Full Solution
def optimized_merge(events_df, users_df, key='user_id', needed_user_cols=None):
"""
Optimized large-scale merge.

Args:
events_df: large DataFrame (10M+ rows)
users_df: lookup DataFrame (1M+ rows)
key: merge key
needed_user_cols: list of columns needed from users_df
Returns:
Merged DataFrame
"""
# Optimization 1: Select only needed columns
if needed_user_cols is not None:
users_subset = users_df[[key] + needed_user_cols].copy()
else:
users_subset = users_df.copy()

# Optimization 2: Ensure key dtype is efficient
# Convert string IDs to categorical
if events_df[key].dtype == 'object':
# Create shared categories
all_ids = pd.CategoricalDtype(
categories=pd.unique(
pd.concat([events_df[key], users_subset[key]])
)
)
events_df[key] = events_df[key].astype(all_ids)
users_subset[key] = users_subset[key].astype(all_ids)

# Optimization 3: Drop duplicates in lookup table
users_subset = users_subset.drop_duplicates(subset=[key])

# Optimization 4: Use merge with validate
result = pd.merge(
events_df,
users_subset,
on=key,
how='left',
validate='many_to_one' # Catch unexpected duplicates
)

return result

# Additional tip: if memory is a concern, process in chunks
def chunked_merge(events_df, users_df, key, chunk_size=1_000_000):
"""Process merge in chunks to limit peak memory."""
chunks = []
for start in range(0, len(events_df), chunk_size):
chunk = events_df.iloc[start:start + chunk_size]
merged_chunk = pd.merge(chunk, users_df, on=key, how='left')
chunks.append(merged_chunk)
return pd.concat(chunks, ignore_index=True)

Scoring:

  • Strong Hire: Addresses dtype optimization, column selection, deduplication, and memory concerns. Mentions alternatives (Polars, DuckDB) for truly large datasets.
  • Lean Hire: Identifies one or two optimizations but not the full picture.
  • No Hire: Cannot identify why the merge is slow or only suggests "get more RAM."

Interview Cheat Sheet

ConceptKey PatternOne-LinerRed Flag
Selectiondf.loc[rows, cols] for labels, df.iloc for positionsloc = labels, iloc = integersConfusing loc and iloc
Boolean filterdf[(cond1) & (cond2)]Use & not and, parentheses requiredUsing Python and with Series
GroupBy aggdf.groupby(col).agg(name=(col, func))One row per groupUsing loops over unique values
Transformgroupby().transform()Same shape as input, broadcasts group statNot knowing transform exists
Mergepd.merge(left, right, on=key, how='left')SQL JOIN equivalentNot specifying how (default inner)
Pivotdf.pivot_table(values, index, columns, aggfunc)Long -> wide with aggregationConfusing pivot and pivot_table
Rollingdf.rolling(window).mean()Moving average with min_periodsForgetting min_periods
Missing dataFill train stats only, add indicatorNever leak test infoComputing fill values on full dataset
PerformanceVectorize, categoricals, select cols firstNo iterrows, no row-by-row appendBuilding DataFrames row by row
Feature engRFM + behavioral + trend featuresGroup-aggregate-merge pipelineComputing features with for-loops

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page and run all code examples
  • Write a groupby-agg pipeline from memory
  • Explain the difference between agg, transform, and apply
  • Complete the self-assessment

Day 3 - First Recall

  • Implement cohort retention analysis without looking at the solution
  • Write 5 different merge examples (inner, left, outer, multi-key, with validation)
  • Create 10 features from a transactions DataFrame from memory

Day 7 - Connections

  • Solve the funnel analysis problem under timed conditions (15 min)
  • Implement the anomaly detection problem without hints
  • Explain 5 Pandas performance antipatterns and their fixes

Day 14 - Application

  • Given a new dataset, build a complete feature engineering pipeline in 20 minutes
  • Optimize a slow Pandas pipeline (identify bottlenecks, fix them)
  • Explain the difference between Pandas and SQL for data manipulation

Day 21 - Mock Interview

  • Have someone give you a novel dataset and ask you to compute metrics
  • Solve all 5 practice problems in sequence under timed conditions (60 min)
  • Explain your feature engineering approach as if presenting to an interviewer

Key Takeaways

  1. GroupBy is the backbone of Pandas interviews. Know agg() for summarization, transform() for broadcasting group statistics back to every row, and apply() for everything else. If you reach for a for-loop over groups, you have already lost.

  2. Merges are where data quality issues hide. Always specify how=, check the output shape, and use validate= to catch unexpected duplicates. In interviews, explicitly state your join type and why.

  3. Feature engineering is what separates DS/MLE candidates. Anyone can compute a mean. The strong candidate computes recency, frequency, monetary value, behavioral diversity, temporal trends, and missing indicators - all without loops, all from training data only.

  4. Performance matters at scale. Knowing that iterrows() is 1000x slower than vectorized operations, that categoricals save 10x memory, and that you should select columns before merging are signals that you can handle production data.

Next Steps

Continue to SQL for ML Roles to master the query language that every data role requires - with ML-specific patterns for feature extraction, A/B test analysis, and metric computation.

© 2026 EngineersOfAI. All rights reserved.