SQL for ML Roles - The Query Language Every AI Engineer Needs
Reading time: ~45 min | Interview relevance: Critical | Roles: Data Scientist, MLE, AI Eng, Analytics Engineer, Applied Scientist
The Real Interview Moment
You are in a Netflix Data Science on-site. The interviewer pulls up a SQL editor and says: "We have a table of streaming events - user_id, title_id, start_time, duration_seconds, device_type. Write a query to find users whose 28-day rolling engagement dropped by more than 40% compared to their historical average, broken down by device. You have 25 minutes."
You know this requires window functions, date arithmetic, conditional aggregation, and a CTE or subquery structure. But the clock is running. Do you start with a monolithic query and get lost in nested subqueries? Or do you build it layer by layer - first the daily aggregation, then the rolling window, then the historical baseline, then the comparison - each in a clean CTE that you can debug independently?
This is what SQL interviews for ML roles test. They are not about SELECT-FROM-WHERE basics. They are about whether you can decompose a complex analytical question into composable query layers, use window functions fluently, and translate a business metric definition into precise SQL. Every DS, MLE, and analytics role at a major tech company will test this.
Candidates who say "I would do this in Pandas" get a polite rejection - the data lives in a warehouse and your Pandas cannot handle 500M rows. Candidates who build clean, layered CTEs with correct window functions get a "strong hire."
What You Will Master
- Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM/AVG OVER, NTILE
- CTEs: Common Table Expressions for readable, composable queries
- Self-joins: Comparing rows within the same table
- Aggregation: GROUP BY, HAVING, conditional aggregation with CASE WHEN
- Date functions: DATEDIFF, DATE_TRUNC, INTERVAL, date arithmetic
- Query optimization: indexes, EXPLAIN plans, partition pruning
- ML-specific patterns: feature extraction, A/B test analysis, cohort analysis, funnel analysis, metric computation
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Write | 4 - Can Optimize | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Window functions (RANK, LAG, running sum) | ___ | |||||
| CTEs for multi-step queries | ___ | |||||
| Self-joins | ___ | |||||
| Conditional aggregation (CASE WHEN) | ___ | |||||
| Date arithmetic and DATE_TRUNC | ___ | |||||
| A/B test metric queries | ___ | |||||
| Cohort retention queries | ___ | |||||
| Query optimization (indexes, EXPLAIN) | ___ |
Target: All 4s and 5s before your interview.
Part 1 - Window Functions: The Core Interview Topic
What Window Functions Do
Essential Window Functions
-- Sample table: daily_metrics
-- | user_id | metric_date | page_views | purchases |
-- |---------|-------------|------------|-----------|
-- ROW_NUMBER: unique sequential number per partition
SELECT user_id, metric_date, page_views,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY metric_date) AS day_number
FROM daily_metrics;
-- RANK vs DENSE_RANK vs ROW_NUMBER
-- Values: [100, 100, 90, 80]
-- ROW_NUMBER: 1, 2, 3, 4 (no ties)
-- RANK: 1, 1, 3, 4 (ties skip)
-- DENSE_RANK: 1, 1, 2, 3 (ties don't skip)
-- LAG / LEAD: access previous/next row
SELECT user_id, metric_date, page_views,
LAG(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS prev_day_views,
LEAD(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS next_day_views,
page_views - LAG(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS day_over_day_change
FROM daily_metrics;
-- Running sum / running average
SELECT user_id, metric_date, purchases,
SUM(purchases) OVER (
PARTITION BY user_id
ORDER BY metric_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_purchases,
AVG(page_views) OVER (
PARTITION BY user_id
ORDER BY metric_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day_avg
FROM daily_metrics;
-- NTILE: divide into equal-sized buckets
SELECT user_id, total_spend,
NTILE(10) OVER (ORDER BY total_spend) AS spend_decile
FROM user_summary;
"Window functions compute a value for each row based on a set of related rows, without collapsing the result. They use OVER() to define the window - PARTITION BY groups the rows (like GROUP BY but without reducing), ORDER BY sorts within each partition, and the frame clause (ROWS BETWEEN) defines which rows to include in the computation. The most common ones in ML contexts are LAG for time-series features, ROW_NUMBER for deduplication, and running SUM/AVG for cumulative metrics."
Window Frame Specifications
| Frame Clause | Meaning | Use Case |
|---|---|---|
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | All rows from start to current | Running/cumulative sum |
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW | Current + 6 previous rows | 7-day rolling average |
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING | Previous, current, next | Smoothing |
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING | Entire partition | Partition-level stats |
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW | Date-based window | Calendar-aware rolling |
ROWS and RANGE are different. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW counts exactly 7 rows regardless of gaps. RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW includes all rows within a 7-day calendar window, which may be more or fewer than 7 rows if there are gaps or duplicates. Interviewers test this distinction. For time-series with potential missing dates, RANGE is usually correct.
Part 2 - CTEs: Building Queries in Layers
Why CTEs Matter
-- Bad: deeply nested subquery (hard to read, debug, modify)
SELECT * FROM (
SELECT * FROM (
SELECT user_id, SUM(amount) as total
FROM orders
GROUP BY user_id
) a
WHERE total > 100
) b
JOIN users u ON b.user_id = u.user_id;
-- Good: layered CTEs (each step is clear and testable)
WITH user_totals AS (
SELECT user_id,
SUM(amount) AS total_spend,
COUNT(*) AS order_count
FROM orders
GROUP BY user_id
),
high_value_users AS (
SELECT *
FROM user_totals
WHERE total_spend > 100
)
SELECT u.name, u.signup_date, h.total_spend, h.order_count
FROM high_value_users h
JOIN users u ON h.user_id = u.user_id
ORDER BY h.total_spend DESC;
Never write a query as a single monolithic block with 5 levels of nested subqueries. Interviewers evaluate your SQL code the same way they evaluate your Python code - readability and maintainability matter. CTEs show you can decompose a problem. They also make it easy for you to debug during the interview: you can run each CTE independently to verify intermediate results.
Part 3 - ML-Specific SQL Patterns
Pattern 1: Feature Extraction for ML Models
-- Build a feature table for a churn prediction model
-- One row per user with aggregated features
WITH user_activity AS (
SELECT
user_id,
COUNT(*) AS total_events,
COUNT(DISTINCT DATE_TRUNC('day', event_time)) AS active_days,
MIN(event_time) AS first_event,
MAX(event_time) AS last_event,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count,
SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS view_count,
COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN product_id END) AS unique_products_purchased
FROM events
WHERE event_time >= '2024-01-01'
GROUP BY user_id
),
recency AS (
SELECT
user_id,
EXTRACT(DAY FROM CURRENT_TIMESTAMP - last_event) AS days_since_last_activity,
EXTRACT(DAY FROM last_event - first_event) AS tenure_days
FROM user_activity
),
recent_vs_historical AS (
SELECT
user_id,
-- Last 30 days
SUM(CASE WHEN event_time >= CURRENT_DATE - INTERVAL '30' DAY
THEN 1 ELSE 0 END) AS events_last_30d,
-- 31-60 days ago
SUM(CASE WHEN event_time >= CURRENT_DATE - INTERVAL '60' DAY
AND event_time < CURRENT_DATE - INTERVAL '30' DAY
THEN 1 ELSE 0 END) AS events_prev_30d
FROM events
GROUP BY user_id
)
SELECT
ua.user_id,
ua.total_events,
ua.active_days,
ua.purchase_count,
ua.view_count,
ua.unique_products_purchased,
r.days_since_last_activity,
r.tenure_days,
-- Derived features
COALESCE(ua.purchase_count * 1.0 / NULLIF(ua.view_count, 0), 0) AS conversion_rate,
COALESCE(ua.total_events * 1.0 / NULLIF(r.tenure_days, 0), 0) AS events_per_day,
rh.events_last_30d,
rh.events_prev_30d,
CASE WHEN rh.events_prev_30d > 0
THEN (rh.events_last_30d - rh.events_prev_30d) * 1.0 / rh.events_prev_30d
ELSE 0
END AS activity_trend
FROM user_activity ua
JOIN recency r ON ua.user_id = r.user_id
LEFT JOIN recent_vs_historical rh ON ua.user_id = rh.user_id;
Pattern 2: A/B Test Analysis
-- Compute the key metrics for an A/B test
-- Test: does a new recommendation algorithm increase purchases?
WITH experiment_users AS (
-- Get users and their experiment assignments
SELECT user_id, variant -- 'control' or 'treatment'
FROM experiment_assignments
WHERE experiment_id = 'rec_algo_v2'
AND assignment_date BETWEEN '2024-06-01' AND '2024-06-30'
),
user_metrics AS (
-- Compute per-user metrics during the experiment period
SELECT
eu.user_id,
eu.variant,
COALESCE(COUNT(DISTINCT o.order_id), 0) AS num_purchases,
COALESCE(SUM(o.amount), 0) AS total_revenue,
COALESCE(COUNT(DISTINCT DATE_TRUNC('day', o.order_date)), 0) AS active_days
FROM experiment_users eu
LEFT JOIN orders o
ON eu.user_id = o.user_id
AND o.order_date BETWEEN '2024-06-01' AND '2024-06-30'
GROUP BY eu.user_id, eu.variant
)
SELECT
variant,
COUNT(*) AS num_users,
-- Purchase rate
AVG(CASE WHEN num_purchases > 0 THEN 1.0 ELSE 0.0 END) AS purchase_rate,
-- Average purchases per user
AVG(num_purchases) AS avg_purchases,
-- Average revenue per user (ARPU)
AVG(total_revenue) AS arpu,
-- Revenue per purchasing user (ARPPU)
AVG(CASE WHEN num_purchases > 0 THEN total_revenue END) AS arppu,
-- Standard deviation (for significance testing)
STDDEV(total_revenue) AS revenue_stddev,
-- Percentiles
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_revenue) AS median_revenue,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_revenue) AS p95_revenue
FROM user_metrics
GROUP BY variant
ORDER BY variant;
Netflix, Airbnb, and Uber test A/B analysis SQL heavily because experimentation is core to their product development. They want you to compute not just averages but also variance (for significance), percentiles (for understanding distribution), and conditional metrics (ARPPU vs ARPU). Google and Meta often test this in a statistics interview rather than a SQL interview. Startups rarely test A/B SQL.
Pattern 3: Cohort Retention Analysis
-- Monthly cohort retention: what % of users who signed up in month M
-- are still active N months later?
WITH user_cohorts AS (
-- Assign each user to their signup cohort
SELECT
user_id,
DATE_TRUNC('month', signup_date) AS cohort_month
FROM users
),
monthly_activity AS (
-- Get each user's active months
SELECT DISTINCT
user_id,
DATE_TRUNC('month', event_date) AS activity_month
FROM events
),
cohort_activity AS (
-- Join cohorts with activity and compute months since signup
SELECT
uc.cohort_month,
ma.activity_month,
-- Months between cohort month and activity month
EXTRACT(YEAR FROM AGE(ma.activity_month, uc.cohort_month)) * 12 +
EXTRACT(MONTH FROM AGE(ma.activity_month, uc.cohort_month)) AS months_since_signup,
COUNT(DISTINCT uc.user_id) AS active_users
FROM user_cohorts uc
JOIN monthly_activity ma ON uc.user_id = ma.user_id
WHERE ma.activity_month >= uc.cohort_month -- Only activity after signup
GROUP BY uc.cohort_month, ma.activity_month
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM user_cohorts
GROUP BY cohort_month
)
SELECT
ca.cohort_month,
ca.months_since_signup,
ca.active_users,
cs.cohort_size,
ROUND(ca.active_users * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
WHERE ca.months_since_signup <= 12 -- First year
ORDER BY ca.cohort_month, ca.months_since_signup;
Pattern 4: Funnel Analysis
-- Conversion funnel: page_view -> add_to_cart -> checkout -> purchase
-- Track how many users complete each step within a single session
WITH funnel_events AS (
SELECT
session_id,
user_id,
event_name,
event_time,
-- Assign numeric order to funnel stages
CASE event_name
WHEN 'page_view' THEN 1
WHEN 'add_to_cart' THEN 2
WHEN 'checkout' THEN 3
WHEN 'purchase' THEN 4
END AS stage_order
FROM events
WHERE event_name IN ('page_view', 'add_to_cart', 'checkout', 'purchase')
),
session_max_stage AS (
-- For each session, find the deepest funnel stage reached
SELECT
session_id,
user_id,
MAX(stage_order) AS max_stage
FROM funnel_events
GROUP BY session_id, user_id
)
SELECT
'page_view' AS stage,
COUNT(DISTINCT CASE WHEN max_stage >= 1 THEN session_id END) AS sessions,
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 1 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1) AS pct_of_total
FROM session_max_stage
UNION ALL
SELECT
'add_to_cart',
COUNT(DISTINCT CASE WHEN max_stage >= 2 THEN session_id END),
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 2 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1)
FROM session_max_stage
UNION ALL
SELECT
'checkout',
COUNT(DISTINCT CASE WHEN max_stage >= 3 THEN session_id END),
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 3 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1)
FROM session_max_stage
UNION ALL
SELECT
'purchase',
COUNT(DISTINCT CASE WHEN max_stage >= 4 THEN session_id END),
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 4 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1)
FROM session_max_stage;
Pattern 5: Rolling Metric Computation
-- Compute 7-day and 28-day rolling active users (DAU, WAU, MAU)
WITH daily_active AS (
SELECT
DATE_TRUNC('day', event_time) AS metric_date,
COUNT(DISTINCT user_id) AS dau
FROM events
GROUP BY DATE_TRUNC('day', event_time)
),
rolling_metrics AS (
SELECT
metric_date,
dau,
-- 7-day rolling unique users (WAU approximation)
SUM(dau) OVER (
ORDER BY metric_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS wau_sum, -- Note: this overcounts; see below
-- 28-day rolling average DAU
AVG(dau) OVER (
ORDER BY metric_date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS avg_dau_28d
FROM daily_active
)
SELECT * FROM rolling_metrics
ORDER BY metric_date;
-- NOTE: For true WAU (unique users in 7 days), you need:
-- This cannot be done with simple rolling windows because uniqueness
-- requires access to raw user-level data within the window.
-- True WAU query:
SELECT
d.metric_date,
COUNT(DISTINCT e.user_id) AS true_wau
FROM (
SELECT DISTINCT DATE_TRUNC('day', event_time) AS metric_date
FROM events
) d
JOIN events e
ON DATE_TRUNC('day', e.event_time)
BETWEEN d.metric_date - INTERVAL '6' DAY AND d.metric_date
GROUP BY d.metric_date;
Part 4 - Self-Joins and Advanced Patterns
Self-Join: Year-over-Year Comparison
-- Compare each month's revenue to the same month last year
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
curr.month,
curr.revenue AS current_revenue,
prev.revenue AS previous_year_revenue,
ROUND((curr.revenue - prev.revenue) * 100.0 / NULLIF(prev.revenue, 0), 1) AS yoy_growth_pct
FROM monthly_revenue curr
LEFT JOIN monthly_revenue prev
ON curr.month = prev.month + INTERVAL '1 year';
Self-Join: Finding Consecutive Events
-- Find users who made purchases on 3+ consecutive days
WITH purchase_days AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('day', order_date) AS purchase_date
FROM orders
),
with_prev AS (
SELECT
user_id,
purchase_date,
LAG(purchase_date, 1) OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_date,
LAG(purchase_date, 2) OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_prev_date
FROM purchase_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date - prev_date = INTERVAL '1' DAY
AND prev_date - prev_prev_date = INTERVAL '1' DAY;
Deduplication with ROW_NUMBER
-- Keep only the most recent record per user (common data cleaning task)
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY updated_at DESC
) AS rn
FROM user_profiles
)
SELECT *
FROM ranked
WHERE rn = 1;
Part 5 - Query Optimization
Key Optimization Techniques
Optimization Rules
| Rule | Bad | Good | Why |
|---|---|---|---|
| Filter early | JOIN then WHERE | WHERE then JOIN (or in CTE) | Reduces data before expensive operations |
| Use appropriate joins | Cross join then filter | Inner/left join with ON clause | Avoids Cartesian product |
| Avoid SELECT * | SELECT * FROM big_table | SELECT col1, col2 FROM big_table | Reduces I/O and memory |
| Use EXISTS vs IN | WHERE id IN (SELECT id FROM ...) | WHERE EXISTS (SELECT 1 FROM ... WHERE ...) | EXISTS short-circuits |
| Avoid functions on indexed columns | WHERE YEAR(date_col) = 2024 | WHERE date_col >= '2024-01-01' AND date_col < '2025-01-01' | Allows index usage |
| Use approximate functions | COUNT(DISTINCT user_id) | APPROX_COUNT_DISTINCT(user_id) | 100x faster for large tables |
| Partition pruning | Scan all partitions | Filter on partition key | Only reads relevant partitions |
When I ask about query optimization, I want to hear three things: (1) "I would start with EXPLAIN ANALYZE to understand the query plan." (2) "I would check for full table scans and add indexes on WHERE and JOIN columns." (3) "I would push filters as early as possible to reduce the data volume before joins and aggregations." Candidates who jump to "add more machines" without analyzing the query plan reveal they lack hands-on database experience.
Practice Problems
Problem 1: User Engagement Segments
Given a table events(user_id, event_type, event_date), classify users into engagement segments based on their activity in the last 90 days:
- Power users: 50+ events
- Active users: 10-49 events
- Casual users: 1-9 events
- Churned: 0 events (had activity before but not in last 90 days)
Hint 1 - Direction
You need to consider ALL users, including those with zero recent events. Use a LEFT JOIN from all known users to their recent activity.
Hint 2 - Structure
CTE 1: Get all users who have ever been active. CTE 2: Count events in the last 90 days. Main query: CASE WHEN to classify.
Hint 3 - Full Solution
WITH all_users AS (
-- All users who have ever been active
SELECT DISTINCT user_id
FROM events
),
recent_activity AS (
-- Event count in last 90 days
SELECT
user_id,
COUNT(*) AS event_count
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '90' DAY
GROUP BY user_id
)
SELECT
au.user_id,
COALESCE(ra.event_count, 0) AS recent_events,
CASE
WHEN COALESCE(ra.event_count, 0) >= 50 THEN 'power_user'
WHEN COALESCE(ra.event_count, 0) >= 10 THEN 'active_user'
WHEN COALESCE(ra.event_count, 0) >= 1 THEN 'casual_user'
ELSE 'churned'
END AS segment
FROM all_users au
LEFT JOIN recent_activity ra ON au.user_id = ra.user_id;
-- Summary statistics
-- Wrap the above in another CTE and do:
-- SELECT segment, COUNT(*) as user_count,
-- ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) AS pct
-- FROM user_segments
-- GROUP BY segment;
Scoring:
- Strong Hire: Uses LEFT JOIN to include churned users. COALESCE for null handling. Can extend with additional dimensions (segment by device, by cohort).
- Lean Hire: Correct segments but misses churned users (uses INNER JOIN).
- No Hire: Cannot structure the query or uses non-standard syntax.
Problem 2: Session Computation
Events have no session_id. Define a session as a sequence of events by the same user where no two consecutive events are more than 30 minutes apart. Compute session_id, session duration, and events per session.
Hint 1 - Direction
Use LAG to compute time gap between consecutive events per user. A gap > 30 minutes starts a new session. Use cumulative SUM of new-session flags to assign session IDs.
Hint 2 - The cumsum trick
Create a is_new_session flag (1 when gap > 30 min or first event). Then SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) gives the session number.
Hint 3 - Full Solution
WITH ordered_events AS (
SELECT
user_id,
event_time,
event_type,
LAG(event_time) OVER (
PARTITION BY user_id ORDER BY event_time
) AS prev_event_time
FROM events
),
session_flags AS (
SELECT
*,
CASE
WHEN prev_event_time IS NULL THEN 1 -- First event
WHEN EXTRACT(EPOCH FROM (event_time - prev_event_time)) > 1800 THEN 1 -- 30 min gap
ELSE 0
END AS is_new_session
FROM ordered_events
),
sessionized AS (
SELECT
*,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_id
FROM session_flags
)
SELECT
user_id,
session_id,
COUNT(*) AS events_in_session,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) / 60.0 AS duration_minutes
FROM sessionized
GROUP BY user_id, session_id;
Scoring:
- Strong Hire: Clean LAG + cumsum pattern. Correct gap threshold handling. Can discuss why 30 minutes is a common choice and how to tune it.
- Lean Hire: Correct approach but syntax errors in the window function or time arithmetic.
- No Hire: Cannot conceptualize the sessionization without loops.
Problem 3: Top-K per Group
Find the top 3 products by revenue in each category, including ties.
Hint 1 - Direction
Use DENSE_RANK (not ROW_NUMBER) to handle ties correctly.
Hint 2 - Structure
CTE: Compute per-product revenue. Window: DENSE_RANK partitioned by category, ordered by revenue DESC. Filter: WHERE rank <= 3.
Hint 3 - Full Solution
WITH product_revenue AS (
SELECT
category,
product_id,
product_name,
SUM(revenue) AS total_revenue
FROM sales
GROUP BY category, product_id, product_name
),
ranked AS (
SELECT
*,
DENSE_RANK() OVER (
PARTITION BY category
ORDER BY total_revenue DESC
) AS revenue_rank
FROM product_revenue
)
SELECT category, product_id, product_name, total_revenue, revenue_rank
FROM ranked
WHERE revenue_rank <= 3
ORDER BY category, revenue_rank;
Key distinction:
ROW_NUMBER(): If two products tie, one gets rank 2 and the other rank 3. You may miss a tied product.RANK(): Both get rank 2, next gets rank 4. Top-3 filter may return fewer than 3 if there is a tie at rank 3.DENSE_RANK(): Both get rank 2, next gets rank 3. Top-3 filter includes all products ranked 1, 2, or 3.
Scoring:
- Strong Hire: Uses DENSE_RANK and explains why. Discusses the difference between the three ranking functions.
- Lean Hire: Uses ROW_NUMBER (correct but does not handle ties).
- No Hire: Cannot write a ranking window function.
Problem 4: A/B Test with Pre-Experiment Validation
Write a query that validates an A/B test for sample ratio mismatch (SRM) and computes the treatment effect with confidence intervals.
Hint 1 - Direction
First check that control and treatment groups are balanced (similar sizes and similar pre-experiment metrics). Then compute the metric difference and standard error.
Hint 2 - Statistics in SQL
For a difference in means: SE = sqrt(var1/n1 + var2/n2). A 95% CI is (diff - 1.96SE, diff + 1.96SE). If the CI excludes 0, the result is significant.
Hint 3 - Full Solution
WITH experiment_data AS (
SELECT
ea.user_id,
ea.variant,
COALESCE(SUM(o.amount), 0) AS total_spend,
COUNT(DISTINCT o.order_id) AS num_orders
FROM experiment_assignments ea
LEFT JOIN orders o
ON ea.user_id = o.user_id
AND o.order_date BETWEEN ea.assignment_date
AND ea.assignment_date + INTERVAL '14' DAY
WHERE ea.experiment_id = 'checkout_redesign'
GROUP BY ea.user_id, ea.variant
),
group_stats AS (
SELECT
variant,
COUNT(*) AS n,
AVG(total_spend) AS mean_spend,
VARIANCE(total_spend) AS var_spend,
AVG(num_orders * 1.0) AS mean_orders
FROM experiment_data
GROUP BY variant
),
srm_check AS (
-- Sample Ratio Mismatch: are groups balanced?
SELECT
MAX(CASE WHEN variant = 'control' THEN n END) AS n_control,
MAX(CASE WHEN variant = 'treatment' THEN n END) AS n_treatment,
MAX(CASE WHEN variant = 'control' THEN n END) * 1.0 /
NULLIF(MAX(CASE WHEN variant = 'treatment' THEN n END), 0) AS ratio
FROM group_stats
),
effect_size AS (
SELECT
-- Revenue per user
MAX(CASE WHEN variant = 'treatment' THEN mean_spend END) -
MAX(CASE WHEN variant = 'control' THEN mean_spend END) AS spend_diff,
-- Standard error of the difference
SQRT(
MAX(CASE WHEN variant = 'control' THEN var_spend / n END) +
MAX(CASE WHEN variant = 'treatment' THEN var_spend / n END)
) AS se_diff
FROM group_stats
)
SELECT
sc.n_control,
sc.n_treatment,
sc.ratio AS sample_ratio,
CASE WHEN ABS(sc.ratio - 1.0) > 0.05 THEN 'WARNING: SRM detected' ELSE 'OK' END AS srm_status,
ROUND(es.spend_diff, 2) AS revenue_lift,
ROUND(es.spend_diff - 1.96 * es.se_diff, 2) AS ci_lower,
ROUND(es.spend_diff + 1.96 * es.se_diff, 2) AS ci_upper,
CASE
WHEN es.spend_diff - 1.96 * es.se_diff > 0 THEN 'Significant positive'
WHEN es.spend_diff + 1.96 * es.se_diff < 0 THEN 'Significant negative'
ELSE 'Not significant'
END AS significance
FROM srm_check sc, effect_size es;
Scoring:
- Strong Hire: Includes SRM check, correct CI computation, LEFT JOIN for users with zero purchases. Discusses assumptions (normality via CLT for large n).
- Lean Hire: Computes averages but no confidence interval or significance check.
- No Hire: Cannot structure the A/B test query.
Problem 5: Customer Lifetime Value (LTV) Prediction Features
Build a feature table for predicting 12-month LTV for users who signed up in the last 3 months, using only data from their first 30 days.
Hint 1 - Direction
This is a data leakage problem disguised as a feature engineering problem. You must restrict features to the first 30 days only.
Hint 2 - Key features
First-30-day features: spend, frequency, recency within window, category diversity, refund rate, device type, time-of-day patterns.
Hint 3 - Full Solution
WITH new_users AS (
SELECT user_id, signup_date
FROM users
WHERE signup_date >= CURRENT_DATE - INTERVAL '90' DAY
),
first_30d_orders AS (
-- Only orders within first 30 days of signup (no leakage)
SELECT
nu.user_id,
nu.signup_date,
o.order_id,
o.order_date,
o.amount,
o.category,
o.is_refund
FROM new_users nu
JOIN orders o ON nu.user_id = o.user_id
WHERE o.order_date BETWEEN nu.signup_date
AND nu.signup_date + INTERVAL '30' DAY
),
features AS (
SELECT
user_id,
signup_date,
-- Monetary
COALESCE(SUM(amount), 0) AS first_30d_revenue,
COALESCE(AVG(amount), 0) AS avg_order_value,
COALESCE(MAX(amount), 0) AS max_order_value,
-- Frequency
COUNT(DISTINCT order_id) AS first_30d_orders,
COUNT(DISTINCT DATE_TRUNC('day', order_date)) AS active_days,
-- Recency within window
EXTRACT(DAY FROM
MAX(order_date) - signup_date
) AS days_to_last_order,
EXTRACT(DAY FROM
MIN(order_date) - signup_date
) AS days_to_first_order,
-- Behavioral
COUNT(DISTINCT category) AS unique_categories,
SUM(CASE WHEN is_refund THEN 1 ELSE 0 END) * 1.0 /
NULLIF(COUNT(*), 0) AS refund_rate,
-- Temporal patterns
COUNT(DISTINCT EXTRACT(DOW FROM order_date)) AS unique_dow,
SUM(CASE WHEN EXTRACT(DOW FROM order_date) IN (0, 6) THEN 1 ELSE 0 END) * 1.0 /
NULLIF(COUNT(*), 0) AS weekend_order_ratio
FROM first_30d_orders
GROUP BY user_id, signup_date
),
ltv_labels AS (
-- 12-month LTV as the target variable
SELECT
nu.user_id,
COALESCE(SUM(o.amount), 0) AS ltv_12m
FROM new_users nu
LEFT JOIN orders o ON nu.user_id = o.user_id
AND o.order_date BETWEEN nu.signup_date
AND nu.signup_date + INTERVAL '365' DAY
GROUP BY nu.user_id
)
SELECT
f.*,
l.ltv_12m
FROM features f
JOIN ltv_labels l ON f.user_id = l.user_id;
Scoring:
- Strong Hire: Correctly restricts features to first 30 days. Explains why this prevents data leakage. Mentions that this approach has a survivorship bias issue (only new users who made at least one purchase in 30 days appear in features - should LEFT JOIN from new_users). Discusses feature importance for LTV.
- Lean Hire: Computes reasonable features but does not carefully restrict the time window.
- No Hire: Uses all-time data for features (data leakage).
Interview Cheat Sheet
| Concept | Key Pattern | One-Liner | Red Flag |
|---|---|---|---|
| Window function | FUNC() OVER (PARTITION BY ... ORDER BY ...) | Compute per-row stats without reducing rows | Confusing window functions with GROUP BY |
| ROW_NUMBER vs RANK | ROW_NUMBER has no ties; RANK/DENSE_RANK do | Choose based on tie handling needs | Using ROW_NUMBER when ties matter |
| LAG / LEAD | LAG(col, n) OVER (PARTITION BY ... ORDER BY ...) | Access previous/next row value | Not specifying PARTITION BY (global lag) |
| Running sum | SUM() OVER (ORDER BY ... ROWS UNBOUNDED PRECEDING) | Cumulative aggregation | Forgetting the frame clause |
| CTE | WITH cte AS (SELECT ...) SELECT ... FROM cte | Readable, composable query layers | 5+ levels of nested subqueries |
| Self-join | Join table to itself with offset | YoY comparison, consecutive events | Not recognizing when self-join is needed |
| A/B test | Control vs treatment with LEFT JOIN | Include users with zero events | INNER JOIN drops zero-conversion users |
| Cohort analysis | DATE_TRUNC + months_since_signup | Track retention by signup cohort | Not using DISTINCT for user counts |
| Sessionization | LAG + cumsum of gap flags | Group events into sessions | Cannot implement without procedural code |
| Optimization | Filter early, index JOIN/WHERE columns | Start with EXPLAIN ANALYZE | Applying function to indexed column |
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Read this entire page and trace through each query mentally
- Write a window function query using ROW_NUMBER, LAG, and running SUM
- Build a cohort retention query from scratch
- Complete the self-assessment
Day 3 - First Recall
- Write the sessionization query from memory (LAG + cumsum pattern)
- Build an A/B test analysis query with SRM check and confidence intervals
- Explain ROWS vs RANGE frame clauses with examples
Day 7 - Connections
- Solve the user engagement segments problem without hints (15 min)
- Write a funnel analysis query for a 5-step funnel
- Explain 5 query optimization techniques
Day 14 - Application
- Build a complete LTV feature extraction query under timed conditions (25 min)
- Write a query to detect anomalies in daily metrics (rolling z-score)
- Optimize a slow query given an EXPLAIN plan
Day 21 - Mock Interview
- Have someone give you a table schema and business question - write the query in 20 minutes
- Solve all practice problems in sequence (90 min total)
- Explain your query design decisions as if presenting to an interviewer
Key Takeaways
-
Window functions are the most tested SQL concept in ML interviews. LAG for time-series features, ROW_NUMBER for deduplication, running SUM for cumulative metrics, and RANK for top-k queries cover 80% of interview questions.
-
CTEs are non-negotiable. Every complex query should be built as a sequence of CTEs, each solving one sub-problem. This is not just about readability - it is about debuggability during a timed interview.
-
ML-specific SQL is about preventing data leakage. Feature extraction queries must respect temporal boundaries (only use data available at prediction time). A/B test queries must use LEFT JOINs to include users with zero conversions. LTV queries must restrict features to the observation window.
-
Optimization questions test systems thinking. Knowing that you should filter before joining, avoid functions on indexed columns, and use EXPLAIN ANALYZE demonstrates that you understand what happens under the hood - critical for ML roles that process large datasets.
Next Steps
Continue to Python ML Coding to implement ML algorithms from scratch - the definitive test of whether you understand how models actually work.
