Pandas for Interviews - Data Wrangling Mastery for ML Roles
Reading time: ~45 min | Interview relevance: Critical | Roles: Data Scientist, MLE, AI Eng, Analytics Engineer
The Real Interview Moment
You are forty-five minutes into an Airbnb Data Science on-site. The interviewer opens a shared notebook with a DataFrame of 2 million booking records and says: "Compute the 7-day rolling average of bookings per city, then flag any city-day pair where the rolling average dropped more than 30% compared to the same day last year. You have 20 minutes."
You stare at the DataFrame. You know you need groupby, a rolling window, a date offset for year-over-year comparison, and a boolean mask for the flag. But the clock is ticking. Do you start with a for-loop iterating over cities? Do you merge the DataFrame with itself on a shifted date? Or do you reach for groupby().rolling() with a DateOffset, chain it with a transform, and solve it in five lines?
This is what Pandas interviews test: not whether you can Google the API, but whether you can compose operations fluently under pressure. Candidates who build DataFrames row by row or write nested for-loops over groups get a "no hire." Candidates who chain groupby, merge, transform, and window functions into clean pipelines get a "strong hire." This page builds that fluency.
What You Will Master
- Create and inspect DataFrames with dtypes, shape, describe, info, and value_counts
- Select and filter with loc, iloc, boolean indexing, and query
- Group and aggregate with groupby, agg, transform, and apply
- Merge and join DataFrames with merge, join, concat - and know when each is appropriate
- Reshape with pivot_table, melt, stack, and unstack
- Window functions with rolling, expanding, shift, and rank
- Handle missing data with fillna, dropna, interpolate, and imputation strategies
- Optimize performance with vectorized operations, categoricals, and avoiding common antipatterns
- Engineer features for ML models using Pandas operations
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Do | 4 - Can Optimize | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Select/filter with loc, iloc, boolean | ___ | |||||
| GroupBy with agg and multiple functions | ___ | |||||
| Merge/join (inner, left, outer) | ___ | |||||
| Pivot tables and reshaping | ___ | |||||
| Rolling/expanding window functions | ___ | |||||
| Handle missing data strategically | ___ | |||||
| Avoid loops in favor of vectorized ops | ___ | |||||
| Feature engineering for ML | ___ |
Target: All 4s and 5s before your interview.
Part 1 - DataFrame Fundamentals
Creation and Inspection
import pandas as pd
import numpy as np
# From dictionary
df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'age': [25, 30, 35, 28, 42],
'city': ['NYC', 'SF', 'NYC', 'LA', 'SF'],
'signup_date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-01-10',
'2024-03-05', '2024-02-28']),
'revenue': [100.0, 250.0, None, 175.0, 300.0]
})
# Essential inspection
df.shape # (5, 5) - rows, columns
df.dtypes # dtype per column
df.info() # Non-null counts, memory usage
df.describe() # Summary statistics for numeric columns
df['city'].value_counts() # Frequency counts
df.isnull().sum() # Missing values per column
"Pandas DataFrames are column-oriented data structures built on NumPy arrays. Each column is a contiguous array with a single dtype, which is why column-wise operations are fast and row-wise iteration is slow. I always start with df.info() and df.describe() to understand dtypes, missing values, and distributions. For ML feature engineering, I use vectorized operations, groupby transforms, and merge-based feature joins - never iterrows()."
Selection and Filtering
# Column selection
df['age'] # Single column (Series)
df[['age', 'city']] # Multiple columns (DataFrame)
# Row selection by label
df.loc[0] # Row with index label 0
df.loc[0:2, 'age':'city'] # Slice by labels (inclusive)
# Row selection by position
df.iloc[0] # First row
df.iloc[0:2, 1:3] # First 2 rows, columns 1-2
# Boolean filtering
df[df['age'] > 30] # Filter rows
df[(df['age'] > 25) & (df['city'] == 'NYC')] # Multiple conditions
df.query('age > 25 and city == "NYC"') # Query syntax (cleaner)
# isin for multiple values
df[df['city'].isin(['NYC', 'SF'])]
# String operations
df[df['city'].str.startswith('N')]
Do NOT use df[df['age'] > 30 and df['city'] == 'NYC'] - Python's and operator does not work element-wise on Series. You must use & (bitwise AND) and wrap each condition in parentheses. This is one of the most common Pandas bugs and interviewers will test it. Similarly, use | for OR and ~ for NOT.
Part 2 - GroupBy: The Most Important Operation
Basic GroupBy
# Sample data: e-commerce orders
orders = pd.DataFrame({
'order_id': range(1, 11),
'user_id': [1, 2, 1, 3, 2, 1, 3, 2, 1, 3],
'product_category': ['electronics', 'books', 'clothing', 'electronics',
'electronics', 'books', 'clothing', 'books',
'electronics', 'books'],
'amount': [120, 15, 45, 200, 80, 25, 60, 18, 150, 30],
'date': pd.to_datetime(['2024-01-05', '2024-01-06', '2024-01-07',
'2024-01-08', '2024-01-10', '2024-01-12',
'2024-01-15', '2024-01-18', '2024-01-20',
'2024-01-22'])
})
# Single aggregation
orders.groupby('user_id')['amount'].sum()
# Multiple aggregations
orders.groupby('user_id')['amount'].agg(['sum', 'mean', 'count', 'max'])
# Named aggregations (clean output)
orders.groupby('user_id').agg(
total_spend=('amount', 'sum'),
avg_order=('amount', 'mean'),
order_count=('amount', 'count'),
first_order=('date', 'min'),
last_order=('date', 'max')
).reset_index()
# Multiple group columns
orders.groupby(['user_id', 'product_category'])['amount'].sum().reset_index()
Transform vs Agg
# transform: add a column with the group mean (for normalization)
orders['user_avg'] = orders.groupby('user_id')['amount'].transform('mean')
# Use transform for group-level features without merge
orders['pct_of_user_total'] = (
orders['amount'] / orders.groupby('user_id')['amount'].transform('sum')
)
# Use transform for group-level standardization
orders['z_score_within_user'] = orders.groupby('user_id')['amount'].transform(
lambda x: (x - x.mean()) / x.std()
)
Never write for user_id in df['user_id'].unique(): subset = df[df['user_id'] == user_id]; ... - this is an O(n * k) anti-pattern where k is the number of groups. Use groupby, which is O(n). Interviewers interpret manual iteration over groups as a fundamental misunderstanding of Pandas. The only exception is when your per-group operation genuinely cannot be expressed as agg, transform, or apply.
Part 3 - Merge and Join
Merge Types
# Sample data
users = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'plan': ['free', 'paid', 'paid', 'free']
})
purchases = pd.DataFrame({
'purchase_id': [101, 102, 103, 104, 105],
'user_id': [1, 2, 2, 5, 1],
'amount': [50, 100, 75, 200, 30]
})
# Inner join: only matching rows from both
inner = pd.merge(users, purchases, on='user_id', how='inner')
# Users 1, 2 only (3, 4 have no purchases; 5 is not in users)
# Left join: all rows from left, matching from right
left = pd.merge(users, purchases, on='user_id', how='left')
# All 4 users; Charlie and Diana have NaN for purchase columns
# Right join: all rows from right, matching from left
right = pd.merge(users, purchases, on='user_id', how='right')
# All 5 purchases; user_id 5 has NaN for user columns
# Outer join: all rows from both
outer = pd.merge(users, purchases, on='user_id', how='outer')
# Merge on different column names
pd.merge(users, purchases, left_on='user_id', right_on='user_id')
# Merge on multiple keys
pd.merge(df1, df2, on=['user_id', 'date'])
Merge Comparison Table
| Join Type | Keeps From Left | Keeps From Right | Use Case |
|---|---|---|---|
| inner | Matching only | Matching only | Standard feature join when all keys exist |
| left | All | Matching only | Enrich base table without losing rows |
| right | Matching only | All | Enrich lookup table |
| outer | All | All | Union - find unmatched on both sides |
| cross | All (repeated) | All (repeated) | Cartesian product for pairwise features |
Meta and Airbnb heavily test merge operations because their data is spread across many tables (users, events, sessions, bookings). They expect you to know when to use left vs inner join and how to handle duplicates after merging. Google tends to test merge concepts in SQL rather than Pandas. Startups and ML-focused roles test merge in the context of feature engineering: "Join user features with transaction features for your model."
Handling Merge Pitfalls
# Pitfall 1: Unintended row multiplication
# If keys are not unique on both sides, merge creates a cross product
A = pd.DataFrame({'key': [1, 1], 'val_a': ['a1', 'a2']})
B = pd.DataFrame({'key': [1, 1], 'val_b': ['b1', 'b2']})
result = pd.merge(A, B, on='key') # 4 rows (2 x 2)!
# Prevention: validate the merge
pd.merge(A, B, on='key', validate='one_to_one') # Raises error
# Pitfall 2: Overlapping column names
pd.merge(df1, df2, on='user_id', suffixes=('_left', '_right'))
# Pitfall 3: Missing key values after merge
# Always check shape after merge
print(f"Before: {len(users)}, After: {len(left)}")
Part 4 - Pivot Tables and Reshaping
# Sample data
sales = pd.DataFrame({
'date': pd.to_datetime(['2024-01-01']*3 + ['2024-01-02']*3),
'city': ['NYC', 'SF', 'LA'] * 2,
'product': ['A', 'B', 'A', 'B', 'A', 'B'],
'revenue': [100, 200, 150, 300, 120, 250]
})
# Pivot table: rows = date, columns = city, values = revenue
pivot = sales.pivot_table(
values='revenue',
index='date',
columns='city',
aggfunc='sum',
fill_value=0
)
# Multiple aggregations
pivot_multi = sales.pivot_table(
values='revenue',
index='date',
columns='city',
aggfunc=['sum', 'mean', 'count']
)
# Melt: wide -> long (inverse of pivot)
wide_df = pd.DataFrame({
'user_id': [1, 2, 3],
'score_2023': [80, 90, 75],
'score_2024': [85, 88, 82]
})
long_df = wide_df.melt(
id_vars='user_id',
value_vars=['score_2023', 'score_2024'],
var_name='year',
value_name='score'
)
# Stack/Unstack: move between row and column indices
stacked = pivot.stack() # Columns -> row multi-index
unstacked = stacked.unstack() # Row multi-index -> columns
Part 5 - Window Functions
Rolling Windows
# Time series data
ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=30, freq='D'),
'value': np.random.randn(30).cumsum() + 100
})
ts = ts.set_index('date')
# Rolling mean (7-day moving average)
ts['ma_7'] = ts['value'].rolling(window=7).mean()
# Rolling with minimum periods (handles start of series)
ts['ma_7_min3'] = ts['value'].rolling(window=7, min_periods=3).mean()
# Rolling standard deviation (volatility)
ts['std_7'] = ts['value'].rolling(window=7).std()
# Expanding window (cumulative)
ts['cumulative_mean'] = ts['value'].expanding().mean()
ts['cumulative_max'] = ts['value'].expanding().max()
# Shift (lag features - critical for ML)
ts['lag_1'] = ts['value'].shift(1) # Yesterday's value
ts['lag_7'] = ts['value'].shift(7) # Last week's value
ts['diff_1'] = ts['value'].diff(1) # Day-over-day change
ts['pct_change'] = ts['value'].pct_change(1) # Percentage change
# Rolling per group
multi_ts = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10, freq='D').tolist() * 3,
'city': ['NYC']*10 + ['SF']*10 + ['LA']*10,
'sales': np.random.randint(50, 200, 30)
})
multi_ts['rolling_avg'] = (
multi_ts.groupby('city')['sales']
.transform(lambda x: x.rolling(3, min_periods=1).mean())
)
Rank Functions
df = pd.DataFrame({
'user_id': [1, 1, 1, 2, 2, 2],
'order_date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-02-01',
'2024-01-05', '2024-01-20', '2024-02-10']),
'amount': [100, 200, 150, 300, 50, 250]
})
# Rank within group
df['order_rank'] = df.groupby('user_id')['order_date'].rank(method='dense')
# Get first and last per group
df['is_first_order'] = df.groupby('user_id')['order_date'].rank() == 1
df['is_last_order'] = df.groupby('user_id')['order_date'].rank(ascending=False) == 1
# Cumulative count per group
df['order_number'] = df.groupby('user_id').cumcount() + 1
Part 6 - Handling Missing Data
Detection and Strategies
df = pd.DataFrame({
'age': [25, None, 35, 28, None],
'income': [50000, 75000, None, 60000, 80000],
'city': ['NYC', 'SF', None, 'NYC', 'LA'],
'score': [0.8, 0.6, 0.9, None, 0.7]
})
# Detection
df.isnull().sum() # Count per column
df.isnull().sum() / len(df) # Percentage per column
df.isnull().any(axis=1).sum() # Rows with any missing
# Strategy 1: Drop rows/columns
df.dropna() # Drop rows with any NaN
df.dropna(subset=['age']) # Drop only if 'age' is NaN
df.dropna(thresh=3) # Keep rows with at least 3 non-null values
# Strategy 2: Fill with constant
df['city'].fillna('Unknown')
# Strategy 3: Fill with statistics
df['age'].fillna(df['age'].median())
df['income'].fillna(df['income'].mean())
# Strategy 4: Group-aware fill (better for ML)
df['income'] = df.groupby('city')['income'].transform(
lambda x: x.fillna(x.median())
)
# Strategy 5: Forward/backward fill (for time series)
df['score'].fillna(method='ffill') # Forward fill
df['score'].fillna(method='bfill') # Backward fill
# Strategy 6: Interpolation (for time series)
df['score'].interpolate(method='linear')
# Create missing indicator features (useful for ML)
df['age_missing'] = df['age'].isnull().astype(int)
Missing Data Decision Framework
Never impute missing values using statistics computed on the entire dataset including the test set. This is data leakage. In interview settings, compute fill values on the training set only, then apply them to both train and test. Say this explicitly - interviewers listen for it. "I would compute the median on the training fold only and use that value to fill both train and test."
Part 7 - Performance Optimization
Common Antipatterns and Fixes
# ANTIPATTERN 1: iterrows() - O(n) with high constant factor
# Slow: ~1 sec for 100K rows
for idx, row in df.iterrows():
df.loc[idx, 'new_col'] = row['a'] + row['b']
# Fast: ~1 ms for 100K rows
df['new_col'] = df['a'] + df['b']
# ANTIPATTERN 2: apply() with a simple function
# Slow
df['upper_city'] = df['city'].apply(lambda x: x.upper() if pd.notna(x) else x)
# Fast
df['upper_city'] = df['city'].str.upper()
# ANTIPATTERN 3: Growing a DataFrame row by row
# Slow: O(n^2) due to repeated copies
result = pd.DataFrame()
for i in range(10000):
result = pd.concat([result, pd.DataFrame([{'a': i}])])
# Fast: build list first, then create DataFrame
rows = [{'a': i} for i in range(10000)]
result = pd.DataFrame(rows)
# ANTIPATTERN 4: Not using categoricals for low-cardinality strings
# Before: each string is a full Python object
df['city'].dtype # object - each value is a separate string
# After: integers with a lookup table
df['city'] = df['city'].astype('category')
# Memory: often 10-50x reduction
Performance Comparison Table
| Operation | Slow Way | Fast Way | Speedup |
|---|---|---|---|
| Element-wise math | iterrows() + arithmetic | Vectorized: df['a'] + df['b'] | 100-1000x |
| Conditional column | apply(lambda) | np.where(cond, val1, val2) | 10-100x |
| String operations | apply(str.method) | df['col'].str.method() | 5-50x |
| Group statistics | Manual loop over groups | groupby().transform() | 10-100x |
| Build DataFrame | Append row by row | Build list, then DataFrame | 100x |
| Low-cardinality strings | object dtype | category dtype | 10-50x memory |
| Multiple aggregations | Multiple groupby calls | Single agg({'col': [fns]}) | 2-5x |
Part 8 - Feature Engineering for ML
Time-Based Features
df = pd.DataFrame({
'user_id': [1]*5 + [2]*5,
'event_time': pd.to_datetime([
'2024-01-01 08:00', '2024-01-01 09:30', '2024-01-02 14:00',
'2024-01-05 16:00', '2024-01-10 10:00',
'2024-01-02 07:00', '2024-01-03 11:00', '2024-01-03 15:00',
'2024-01-07 09:00', '2024-01-15 12:00'
]),
'event_type': ['view', 'click', 'purchase', 'view', 'view',
'click', 'view', 'purchase', 'view', 'click']
})
# Date/time component features
df['hour'] = df['event_time'].dt.hour
df['day_of_week'] = df['event_time'].dt.dayofweek # 0=Mon, 6=Sun
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['event_time'].dt.month
# Time since previous event (per user)
df = df.sort_values(['user_id', 'event_time'])
df['time_since_prev'] = (
df.groupby('user_id')['event_time']
.diff()
.dt.total_seconds() / 3600 # Convert to hours
)
# Days since first event (user tenure)
df['days_since_first'] = (
df['event_time'] -
df.groupby('user_id')['event_time'].transform('min')
).dt.days
Aggregation-Based Features
# User-level aggregation features for a ML model
user_features = df.groupby('user_id').agg(
total_events=('event_type', 'count'),
unique_event_types=('event_type', 'nunique'),
first_event=('event_time', 'min'),
last_event=('event_time', 'max'),
avg_time_between_events=('time_since_prev', 'mean'),
).reset_index()
# Count per event type (one-hot-like aggregation)
event_counts = (
df.groupby(['user_id', 'event_type'])
.size()
.unstack(fill_value=0)
.add_prefix('count_')
.reset_index()
)
user_features = pd.merge(user_features, event_counts, on='user_id')
# Recency features
reference_date = df['event_time'].max()
user_features['days_since_last_event'] = (
(reference_date - user_features['last_event']).dt.days
)
# Activity span
user_features['active_days'] = (
(user_features['last_event'] - user_features['first_event']).dt.days
)
Target Encoding (with Regularization)
def target_encode(train_df, test_df, col, target, alpha=10):
"""
Target encoding with smoothing to prevent overfitting.
Args:
train_df: training DataFrame
test_df: test DataFrame
col: column to encode
target: target column name
alpha: smoothing parameter (higher = more regularization)
Returns:
train_encoded, test_encoded: Series with encoded values
"""
global_mean = train_df[target].mean()
# Group statistics on training data only
group_stats = train_df.groupby(col)[target].agg(['mean', 'count'])
# Smoothed encoding: weighted average of group mean and global mean
# More samples -> trust group mean more; few samples -> fall back to global mean
smoothed = (
(group_stats['count'] * group_stats['mean'] + alpha * global_mean) /
(group_stats['count'] + alpha)
)
# Map to both train and test
train_encoded = train_df[col].map(smoothed).fillna(global_mean)
test_encoded = test_df[col].map(smoothed).fillna(global_mean)
return train_encoded, test_encoded
Practice Problems
Problem 1: Cohort Retention Analysis
Given a DataFrame of user events with columns user_id, event_date, compute the monthly retention rate for each signup cohort.
Definition: A user is "retained" in month N if they have at least one event in the Nth month after their first event.
Hint 1 - Direction
First, determine each user's cohort (month of first event). Then, for each event, compute the months elapsed since the user's first event. Finally, compute the retention rate for each cohort-month pair.
Hint 2 - Key Operations
Use groupby().transform('min') to get each user's first event date. Use dt.to_period('M') for monthly grouping. Pivot to get a cohort x month retention matrix.
Hint 3 - Full Solution
def cohort_retention(events_df):
"""
Compute monthly cohort retention rates.
Args:
events_df: DataFrame with 'user_id' and 'event_date' columns
Returns:
DataFrame with cohort months as rows, months-since-signup as columns,
values are retention rates (0 to 1)
"""
df = events_df.copy()
# Step 1: Determine each user's cohort (month of first event)
df['first_event'] = df.groupby('user_id')['event_date'].transform('min')
df['cohort'] = df['first_event'].dt.to_period('M')
# Step 2: Compute months elapsed since first event
df['event_month'] = df['event_date'].dt.to_period('M')
df['months_since_signup'] = (
(df['event_month'] - df['cohort']).apply(lambda x: x.n)
)
# Step 3: Count unique users per cohort-month pair
cohort_data = (
df.groupby(['cohort', 'months_since_signup'])['user_id']
.nunique()
.reset_index()
.rename(columns={'user_id': 'active_users'})
)
# Step 4: Get cohort sizes (month 0 count)
cohort_sizes = (
cohort_data[cohort_data['months_since_signup'] == 0]
[['cohort', 'active_users']]
.rename(columns={'active_users': 'cohort_size'})
)
# Step 5: Compute retention rate
cohort_data = pd.merge(cohort_data, cohort_sizes, on='cohort')
cohort_data['retention_rate'] = (
cohort_data['active_users'] / cohort_data['cohort_size']
)
# Step 6: Pivot to matrix form
retention_matrix = cohort_data.pivot_table(
values='retention_rate',
index='cohort',
columns='months_since_signup',
fill_value=0
)
return retention_matrix
Scoring:
- Strong Hire: Clean pipeline with no loops. Handles edge cases (users with only one event). Explains that retention is always 100% at month 0.
- Lean Hire: Correct logic but uses apply/iterrows in places where vectorized operations would work.
- No Hire: Cannot construct the cohort assignment without loops.
Problem 2: Feature Engineering Pipeline
Given a transactions DataFrame, create a feature matrix for a churn prediction model.
Columns: user_id, transaction_date, amount, category, is_refund
Create at least 10 meaningful features per user.
Hint 1 - Direction
Think about recency, frequency, monetary value (RFM), behavioral patterns (category diversity, refund rate), and temporal trends (recent vs historical behavior).
Hint 2 - Feature Categories
Categories: (1) basic aggregations (count, sum, mean), (2) recency (days since last), (3) frequency patterns (events per week), (4) behavioral (category entropy, refund rate), (5) trends (recent-30d vs all-time).
Hint 3 - Full Solution
def build_churn_features(transactions, reference_date=None):
"""
Build user-level features for churn prediction.
Args:
transactions: DataFrame with user_id, transaction_date, amount,
category, is_refund
reference_date: date for computing recency (default: max date)
Returns:
DataFrame with one row per user and 15+ features
"""
df = transactions.copy()
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
if reference_date is None:
reference_date = df['transaction_date'].max()
# ---- Basic aggregation features ----
features = df.groupby('user_id').agg(
total_transactions=('transaction_date', 'count'),
total_spend=('amount', 'sum'),
avg_transaction=('amount', 'mean'),
max_transaction=('amount', 'max'),
std_transaction=('amount', 'std'),
).reset_index()
# ---- Recency features ----
recency = df.groupby('user_id')['transaction_date'].agg(
first_transaction='min',
last_transaction='max'
).reset_index()
recency['days_since_last'] = (
(reference_date - recency['last_transaction']).dt.days
)
recency['tenure_days'] = (
(recency['last_transaction'] - recency['first_transaction']).dt.days
)
features = features.merge(
recency[['user_id', 'days_since_last', 'tenure_days']], on='user_id'
)
# ---- Frequency features ----
features['transactions_per_day'] = (
features['total_transactions'] /
features['tenure_days'].clip(lower=1)
)
# ---- Behavioral features ----
# Category diversity
cat_diversity = df.groupby('user_id')['category'].nunique().reset_index()
cat_diversity.columns = ['user_id', 'unique_categories']
features = features.merge(cat_diversity, on='user_id')
# Refund rate
refund_stats = df.groupby('user_id')['is_refund'].mean().reset_index()
refund_stats.columns = ['user_id', 'refund_rate']
features = features.merge(refund_stats, on='user_id')
# ---- Trend features (recent vs all-time) ----
cutoff_30d = reference_date - pd.Timedelta(days=30)
recent = df[df['transaction_date'] >= cutoff_30d]
recent_stats = recent.groupby('user_id').agg(
recent_30d_count=('transaction_date', 'count'),
recent_30d_spend=('amount', 'sum')
).reset_index()
features = features.merge(recent_stats, on='user_id', how='left')
features['recent_30d_count'] = features['recent_30d_count'].fillna(0)
features['recent_30d_spend'] = features['recent_30d_spend'].fillna(0)
# Ratio of recent to total activity
features['recent_activity_ratio'] = (
features['recent_30d_count'] /
features['total_transactions'].clip(lower=1)
)
# ---- Fill remaining NaNs ----
features['std_transaction'] = features['std_transaction'].fillna(0)
return features
Scoring:
- Strong Hire: 10+ features covering RFM, behavioral, and trend categories. No data leakage (uses reference_date). Clean merge-based pipeline.
- Lean Hire: 5-8 reasonable features but misses trend features or has potential leakage.
- No Hire: Fewer than 5 features or uses loops to compute them.
Problem 3: Funnel Analysis
Given a DataFrame of user events with user_id, event_name, timestamp, compute conversion rates through a defined funnel: page_view -> add_to_cart -> checkout -> purchase.
Hint 1 - Direction
For each user, check if they completed each stage (in order). A user who added to cart but never viewed a page should not count - events must happen in the correct sequence.
Hint 2 - Approach
Pivot the data to get the first timestamp of each event type per user. Then check if each stage's timestamp is after the previous stage's timestamp.
Hint 3 - Full Solution
def funnel_analysis(events_df, funnel_stages):
"""
Compute funnel conversion rates.
Args:
events_df: DataFrame with user_id, event_name, timestamp
funnel_stages: list of event names in order, e.g.
['page_view', 'add_to_cart', 'checkout', 'purchase']
Returns:
DataFrame with stage, users_at_stage, conversion_rate, drop_off_rate
"""
df = events_df.copy()
# Step 1: Get first occurrence of each event type per user
first_events = (
df[df['event_name'].isin(funnel_stages)]
.sort_values('timestamp')
.groupby(['user_id', 'event_name'])['timestamp']
.first()
.unstack(fill_value=pd.NaT)
)
# Step 2: Ensure events happen in order
# A user completes stage N if:
# (a) they performed stage N, AND
# (b) they completed stage N-1, AND
# (c) stage N happened after stage N-1
completed = pd.DataFrame(index=first_events.index)
completed[funnel_stages[0]] = first_events[funnel_stages[0]].notna()
for i in range(1, len(funnel_stages)):
prev_stage = funnel_stages[i - 1]
curr_stage = funnel_stages[i]
has_event = first_events[curr_stage].notna()
completed_prev = completed[prev_stage]
if curr_stage in first_events.columns:
in_order = (
first_events[curr_stage] >= first_events[prev_stage]
).fillna(False)
else:
in_order = False
completed[curr_stage] = has_event & completed_prev & in_order
# Step 3: Compute funnel metrics
total_users = len(first_events)
results = []
for stage in funnel_stages:
users_at_stage = completed[stage].sum()
conversion_rate = users_at_stage / total_users
results.append({
'stage': stage,
'users_at_stage': int(users_at_stage),
'conversion_from_top': f"{conversion_rate:.1%}"
})
results_df = pd.DataFrame(results)
# Add stage-to-stage conversion
results_df['stage_conversion'] = (
results_df['users_at_stage'] /
results_df['users_at_stage'].shift(1)
).fillna(1.0).apply(lambda x: f"{x:.1%}")
return results_df
Scoring:
- Strong Hire: Handles event ordering correctly (events must happen in sequence). Computes both overall and stage-to-stage conversion. No loops over users.
- Lean Hire: Correct logic but does not enforce event ordering.
- No Hire: Uses loops over users to check funnel completion.
Problem 4: Time Series Anomaly Detection
Given daily metrics per city, flag days where the metric deviates more than 2 standard deviations from the 30-day rolling mean for that city.
Hint 1 - Direction
Group by city, compute rolling mean and std, then flag where the absolute deviation exceeds 2 * rolling_std.
Hint 2 - Edge cases
Handle the start of each city's time series where the rolling window is not full. Use min_periods parameter.
Hint 3 - Full Solution
def detect_anomalies(df, metric_col, group_col='city',
window=30, n_std=2, min_periods=7):
"""
Detect anomalies using rolling z-score per group.
Args:
df: DataFrame with date index, group column, and metric column
metric_col: name of the metric column
group_col: column to group by
window: rolling window size
n_std: number of standard deviations for threshold
min_periods: minimum observations for rolling stats
Returns:
DataFrame with added columns: rolling_mean, rolling_std,
z_score, is_anomaly
"""
result = df.copy().sort_values([group_col, 'date'])
# Compute rolling statistics per group
grouped = result.groupby(group_col)[metric_col]
result['rolling_mean'] = grouped.transform(
lambda x: x.rolling(window, min_periods=min_periods).mean()
)
result['rolling_std'] = grouped.transform(
lambda x: x.rolling(window, min_periods=min_periods).std()
)
# Compute z-score
result['z_score'] = (
(result[metric_col] - result['rolling_mean']) /
result['rolling_std'].clip(lower=1e-8) # Avoid division by zero
)
# Flag anomalies
result['is_anomaly'] = result['z_score'].abs() > n_std
# Classify direction
result['anomaly_type'] = np.where(
~result['is_anomaly'], 'normal',
np.where(result['z_score'] > 0, 'spike', 'drop')
)
return result
Scoring:
- Strong Hire: Handles per-group rolling, min_periods, division by zero, and anomaly direction. Discusses limitations (non-stationarity, seasonality).
- Lean Hire: Correct basic approach but misses edge cases.
- No Hire: Computes rolling stats without grouping or uses global statistics.
Problem 5: Efficient Large-Scale Merge
You have a 10M-row events table and a 1M-row users table. The merge is slow. Optimize it.
Hint 1 - Direction
Check dtypes (string keys are slow), consider sorting before merge, and reduce columns before merging.
Hint 2 - Techniques
(1) Convert string IDs to categoricals or integers. (2) Select only needed columns before merge. (3) Use merge instead of join when possible. (4) Consider chunked processing.
Hint 3 - Full Solution
def optimized_merge(events_df, users_df, key='user_id', needed_user_cols=None):
"""
Optimized large-scale merge.
Args:
events_df: large DataFrame (10M+ rows)
users_df: lookup DataFrame (1M+ rows)
key: merge key
needed_user_cols: list of columns needed from users_df
Returns:
Merged DataFrame
"""
# Optimization 1: Select only needed columns
if needed_user_cols is not None:
users_subset = users_df[[key] + needed_user_cols].copy()
else:
users_subset = users_df.copy()
# Optimization 2: Ensure key dtype is efficient
# Convert string IDs to categorical
if events_df[key].dtype == 'object':
# Create shared categories
all_ids = pd.CategoricalDtype(
categories=pd.unique(
pd.concat([events_df[key], users_subset[key]])
)
)
events_df[key] = events_df[key].astype(all_ids)
users_subset[key] = users_subset[key].astype(all_ids)
# Optimization 3: Drop duplicates in lookup table
users_subset = users_subset.drop_duplicates(subset=[key])
# Optimization 4: Use merge with validate
result = pd.merge(
events_df,
users_subset,
on=key,
how='left',
validate='many_to_one' # Catch unexpected duplicates
)
return result
# Additional tip: if memory is a concern, process in chunks
def chunked_merge(events_df, users_df, key, chunk_size=1_000_000):
"""Process merge in chunks to limit peak memory."""
chunks = []
for start in range(0, len(events_df), chunk_size):
chunk = events_df.iloc[start:start + chunk_size]
merged_chunk = pd.merge(chunk, users_df, on=key, how='left')
chunks.append(merged_chunk)
return pd.concat(chunks, ignore_index=True)
Scoring:
- Strong Hire: Addresses dtype optimization, column selection, deduplication, and memory concerns. Mentions alternatives (Polars, DuckDB) for truly large datasets.
- Lean Hire: Identifies one or two optimizations but not the full picture.
- No Hire: Cannot identify why the merge is slow or only suggests "get more RAM."
Interview Cheat Sheet
| Concept | Key Pattern | One-Liner | Red Flag |
|---|---|---|---|
| Selection | df.loc[rows, cols] for labels, df.iloc for positions | loc = labels, iloc = integers | Confusing loc and iloc |
| Boolean filter | df[(cond1) & (cond2)] | Use & not and, parentheses required | Using Python and with Series |
| GroupBy agg | df.groupby(col).agg(name=(col, func)) | One row per group | Using loops over unique values |
| Transform | groupby().transform() | Same shape as input, broadcasts group stat | Not knowing transform exists |
| Merge | pd.merge(left, right, on=key, how='left') | SQL JOIN equivalent | Not specifying how (default inner) |
| Pivot | df.pivot_table(values, index, columns, aggfunc) | Long -> wide with aggregation | Confusing pivot and pivot_table |
| Rolling | df.rolling(window).mean() | Moving average with min_periods | Forgetting min_periods |
| Missing data | Fill train stats only, add indicator | Never leak test info | Computing fill values on full dataset |
| Performance | Vectorize, categoricals, select cols first | No iterrows, no row-by-row append | Building DataFrames row by row |
| Feature eng | RFM + behavioral + trend features | Group-aggregate-merge pipeline | Computing features with for-loops |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Read this entire page and run all code examples
- Write a groupby-agg pipeline from memory
- Explain the difference between agg, transform, and apply
- Complete the self-assessment
Day 3 - First Recall
- Implement cohort retention analysis without looking at the solution
- Write 5 different merge examples (inner, left, outer, multi-key, with validation)
- Create 10 features from a transactions DataFrame from memory
Day 7 - Connections
- Solve the funnel analysis problem under timed conditions (15 min)
- Implement the anomaly detection problem without hints
- Explain 5 Pandas performance antipatterns and their fixes
Day 14 - Application
- Given a new dataset, build a complete feature engineering pipeline in 20 minutes
- Optimize a slow Pandas pipeline (identify bottlenecks, fix them)
- Explain the difference between Pandas and SQL for data manipulation
Day 21 - Mock Interview
- Have someone give you a novel dataset and ask you to compute metrics
- Solve all 5 practice problems in sequence under timed conditions (60 min)
- Explain your feature engineering approach as if presenting to an interviewer
Key Takeaways
-
GroupBy is the backbone of Pandas interviews. Know
agg()for summarization,transform()for broadcasting group statistics back to every row, andapply()for everything else. If you reach for a for-loop over groups, you have already lost. -
Merges are where data quality issues hide. Always specify
how=, check the output shape, and usevalidate=to catch unexpected duplicates. In interviews, explicitly state your join type and why. -
Feature engineering is what separates DS/MLE candidates. Anyone can compute a mean. The strong candidate computes recency, frequency, monetary value, behavioral diversity, temporal trends, and missing indicators - all without loops, all from training data only.
-
Performance matters at scale. Knowing that
iterrows()is 1000x slower than vectorized operations, that categoricals save 10x memory, and that you should select columns before merging are signals that you can handle production data.
Next Steps
Continue to SQL for ML Roles to master the query language that every data role requires - with ML-specific patterns for feature extraction, A/B test analysis, and metric computation.
