Skip to main content

SQL for ML Roles - The Query Language Every AI Engineer Needs

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Scientist, MLE, AI Eng, Analytics Engineer, Applied Scientist

The Real Interview Moment

You are in a Netflix Data Science on-site. The interviewer pulls up a SQL editor and says: "We have a table of streaming events - user_id, title_id, start_time, duration_seconds, device_type. Write a query to find users whose 28-day rolling engagement dropped by more than 40% compared to their historical average, broken down by device. You have 25 minutes."

You know this requires window functions, date arithmetic, conditional aggregation, and a CTE or subquery structure. But the clock is running. Do you start with a monolithic query and get lost in nested subqueries? Or do you build it layer by layer - first the daily aggregation, then the rolling window, then the historical baseline, then the comparison - each in a clean CTE that you can debug independently?

This is what SQL interviews for ML roles test. They are not about SELECT-FROM-WHERE basics. They are about whether you can decompose a complex analytical question into composable query layers, use window functions fluently, and translate a business metric definition into precise SQL. Every DS, MLE, and analytics role at a major tech company will test this.

Candidates who say "I would do this in Pandas" get a polite rejection - the data lives in a warehouse and your Pandas cannot handle 500M rows. Candidates who build clean, layered CTEs with correct window functions get a "strong hire."

What You Will Master

  • Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM/AVG OVER, NTILE
  • CTEs: Common Table Expressions for readable, composable queries
  • Self-joins: Comparing rows within the same table
  • Aggregation: GROUP BY, HAVING, conditional aggregation with CASE WHEN
  • Date functions: DATEDIFF, DATE_TRUNC, INTERVAL, date arithmetic
  • Query optimization: indexes, EXPLAIN plans, partition pruning
  • ML-specific patterns: feature extraction, A/B test analysis, cohort analysis, funnel analysis, metric computation

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Write4 - Can Optimize5 - Can TeachYour Score
Window functions (RANK, LAG, running sum)___
CTEs for multi-step queries___
Self-joins___
Conditional aggregation (CASE WHEN)___
Date arithmetic and DATE_TRUNC___
A/B test metric queries___
Cohort retention queries___
Query optimization (indexes, EXPLAIN)___

Target: All 4s and 5s before your interview.

Part 1 - Window Functions: The Core Interview Topic

What Window Functions Do

GROUP BY vs Window Functions

Essential Window Functions

-- Sample table: daily_metrics
-- | user_id | metric_date | page_views | purchases |
-- |---------|-------------|------------|-----------|

-- ROW_NUMBER: unique sequential number per partition
SELECT user_id, metric_date, page_views,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY metric_date) AS day_number
FROM daily_metrics;

-- RANK vs DENSE_RANK vs ROW_NUMBER
-- Values: [100, 100, 90, 80]
-- ROW_NUMBER: 1, 2, 3, 4 (no ties)
-- RANK: 1, 1, 3, 4 (ties skip)
-- DENSE_RANK: 1, 1, 2, 3 (ties don't skip)

-- LAG / LEAD: access previous/next row
SELECT user_id, metric_date, page_views,
LAG(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS prev_day_views,
LEAD(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS next_day_views,
page_views - LAG(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS day_over_day_change
FROM daily_metrics;

-- Running sum / running average
SELECT user_id, metric_date, purchases,
SUM(purchases) OVER (
PARTITION BY user_id
ORDER BY metric_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_purchases,
AVG(page_views) OVER (
PARTITION BY user_id
ORDER BY metric_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day_avg
FROM daily_metrics;

-- NTILE: divide into equal-sized buckets
SELECT user_id, total_spend,
NTILE(10) OVER (ORDER BY total_spend) AS spend_decile
FROM user_summary;
60-Second Answer

"Window functions compute a value for each row based on a set of related rows, without collapsing the result. They use OVER() to define the window - PARTITION BY groups the rows (like GROUP BY but without reducing), ORDER BY sorts within each partition, and the frame clause (ROWS BETWEEN) defines which rows to include in the computation. The most common ones in ML contexts are LAG for time-series features, ROW_NUMBER for deduplication, and running SUM/AVG for cumulative metrics."

Window Frame Specifications

Frame ClauseMeaningUse Case
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWAll rows from start to currentRunning/cumulative sum
ROWS BETWEEN 6 PRECEDING AND CURRENT ROWCurrent + 6 previous rows7-day rolling average
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWINGPrevious, current, nextSmoothing
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGEntire partitionPartition-level stats
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROWDate-based windowCalendar-aware rolling
Common Trap

ROWS and RANGE are different. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW counts exactly 7 rows regardless of gaps. RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW includes all rows within a 7-day calendar window, which may be more or fewer than 7 rows if there are gaps or duplicates. Interviewers test this distinction. For time-series with potential missing dates, RANGE is usually correct.

Part 2 - CTEs: Building Queries in Layers

Why CTEs Matter

-- Bad: deeply nested subquery (hard to read, debug, modify)
SELECT * FROM (
SELECT * FROM (
SELECT user_id, SUM(amount) as total
FROM orders
GROUP BY user_id
) a
WHERE total > 100
) b
JOIN users u ON b.user_id = u.user_id;

-- Good: layered CTEs (each step is clear and testable)
WITH user_totals AS (
SELECT user_id,
SUM(amount) AS total_spend,
COUNT(*) AS order_count
FROM orders
GROUP BY user_id
),
high_value_users AS (
SELECT *
FROM user_totals
WHERE total_spend > 100
)
SELECT u.name, u.signup_date, h.total_spend, h.order_count
FROM high_value_users h
JOIN users u ON h.user_id = u.user_id
ORDER BY h.total_spend DESC;
Instant Rejection

Never write a query as a single monolithic block with 5 levels of nested subqueries. Interviewers evaluate your SQL code the same way they evaluate your Python code - readability and maintainability matter. CTEs show you can decompose a problem. They also make it easy for you to debug during the interview: you can run each CTE independently to verify intermediate results.

Part 3 - ML-Specific SQL Patterns

Pattern 1: Feature Extraction for ML Models

-- Build a feature table for a churn prediction model
-- One row per user with aggregated features

WITH user_activity AS (
SELECT
user_id,
COUNT(*) AS total_events,
COUNT(DISTINCT DATE_TRUNC('day', event_time)) AS active_days,
MIN(event_time) AS first_event,
MAX(event_time) AS last_event,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count,
SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS view_count,
COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN product_id END) AS unique_products_purchased
FROM events
WHERE event_time >= '2024-01-01'
GROUP BY user_id
),
recency AS (
SELECT
user_id,
EXTRACT(DAY FROM CURRENT_TIMESTAMP - last_event) AS days_since_last_activity,
EXTRACT(DAY FROM last_event - first_event) AS tenure_days
FROM user_activity
),
recent_vs_historical AS (
SELECT
user_id,
-- Last 30 days
SUM(CASE WHEN event_time >= CURRENT_DATE - INTERVAL '30' DAY
THEN 1 ELSE 0 END) AS events_last_30d,
-- 31-60 days ago
SUM(CASE WHEN event_time >= CURRENT_DATE - INTERVAL '60' DAY
AND event_time < CURRENT_DATE - INTERVAL '30' DAY
THEN 1 ELSE 0 END) AS events_prev_30d
FROM events
GROUP BY user_id
)
SELECT
ua.user_id,
ua.total_events,
ua.active_days,
ua.purchase_count,
ua.view_count,
ua.unique_products_purchased,
r.days_since_last_activity,
r.tenure_days,
-- Derived features
COALESCE(ua.purchase_count * 1.0 / NULLIF(ua.view_count, 0), 0) AS conversion_rate,
COALESCE(ua.total_events * 1.0 / NULLIF(r.tenure_days, 0), 0) AS events_per_day,
rh.events_last_30d,
rh.events_prev_30d,
CASE WHEN rh.events_prev_30d > 0
THEN (rh.events_last_30d - rh.events_prev_30d) * 1.0 / rh.events_prev_30d
ELSE 0
END AS activity_trend
FROM user_activity ua
JOIN recency r ON ua.user_id = r.user_id
LEFT JOIN recent_vs_historical rh ON ua.user_id = rh.user_id;

Pattern 2: A/B Test Analysis

-- Compute the key metrics for an A/B test
-- Test: does a new recommendation algorithm increase purchases?

WITH experiment_users AS (
-- Get users and their experiment assignments
SELECT user_id, variant -- 'control' or 'treatment'
FROM experiment_assignments
WHERE experiment_id = 'rec_algo_v2'
AND assignment_date BETWEEN '2024-06-01' AND '2024-06-30'
),
user_metrics AS (
-- Compute per-user metrics during the experiment period
SELECT
eu.user_id,
eu.variant,
COALESCE(COUNT(DISTINCT o.order_id), 0) AS num_purchases,
COALESCE(SUM(o.amount), 0) AS total_revenue,
COALESCE(COUNT(DISTINCT DATE_TRUNC('day', o.order_date)), 0) AS active_days
FROM experiment_users eu
LEFT JOIN orders o
ON eu.user_id = o.user_id
AND o.order_date BETWEEN '2024-06-01' AND '2024-06-30'
GROUP BY eu.user_id, eu.variant
)
SELECT
variant,
COUNT(*) AS num_users,
-- Purchase rate
AVG(CASE WHEN num_purchases > 0 THEN 1.0 ELSE 0.0 END) AS purchase_rate,
-- Average purchases per user
AVG(num_purchases) AS avg_purchases,
-- Average revenue per user (ARPU)
AVG(total_revenue) AS arpu,
-- Revenue per purchasing user (ARPPU)
AVG(CASE WHEN num_purchases > 0 THEN total_revenue END) AS arppu,
-- Standard deviation (for significance testing)
STDDEV(total_revenue) AS revenue_stddev,
-- Percentiles
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_revenue) AS median_revenue,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_revenue) AS p95_revenue
FROM user_metrics
GROUP BY variant
ORDER BY variant;
Company Variation

Netflix, Airbnb, and Uber test A/B analysis SQL heavily because experimentation is core to their product development. They want you to compute not just averages but also variance (for significance), percentiles (for understanding distribution), and conditional metrics (ARPPU vs ARPU). Google and Meta often test this in a statistics interview rather than a SQL interview. Startups rarely test A/B SQL.

Pattern 3: Cohort Retention Analysis

-- Monthly cohort retention: what % of users who signed up in month M
-- are still active N months later?

WITH user_cohorts AS (
-- Assign each user to their signup cohort
SELECT
user_id,
DATE_TRUNC('month', signup_date) AS cohort_month
FROM users
),
monthly_activity AS (
-- Get each user's active months
SELECT DISTINCT
user_id,
DATE_TRUNC('month', event_date) AS activity_month
FROM events
),
cohort_activity AS (
-- Join cohorts with activity and compute months since signup
SELECT
uc.cohort_month,
ma.activity_month,
-- Months between cohort month and activity month
EXTRACT(YEAR FROM AGE(ma.activity_month, uc.cohort_month)) * 12 +
EXTRACT(MONTH FROM AGE(ma.activity_month, uc.cohort_month)) AS months_since_signup,
COUNT(DISTINCT uc.user_id) AS active_users
FROM user_cohorts uc
JOIN monthly_activity ma ON uc.user_id = ma.user_id
WHERE ma.activity_month >= uc.cohort_month -- Only activity after signup
GROUP BY uc.cohort_month, ma.activity_month
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM user_cohorts
GROUP BY cohort_month
)
SELECT
ca.cohort_month,
ca.months_since_signup,
ca.active_users,
cs.cohort_size,
ROUND(ca.active_users * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
WHERE ca.months_since_signup <= 12 -- First year
ORDER BY ca.cohort_month, ca.months_since_signup;

Pattern 4: Funnel Analysis

-- Conversion funnel: page_view -> add_to_cart -> checkout -> purchase
-- Track how many users complete each step within a single session

WITH funnel_events AS (
SELECT
session_id,
user_id,
event_name,
event_time,
-- Assign numeric order to funnel stages
CASE event_name
WHEN 'page_view' THEN 1
WHEN 'add_to_cart' THEN 2
WHEN 'checkout' THEN 3
WHEN 'purchase' THEN 4
END AS stage_order
FROM events
WHERE event_name IN ('page_view', 'add_to_cart', 'checkout', 'purchase')
),
session_max_stage AS (
-- For each session, find the deepest funnel stage reached
SELECT
session_id,
user_id,
MAX(stage_order) AS max_stage
FROM funnel_events
GROUP BY session_id, user_id
)
SELECT
'page_view' AS stage,
COUNT(DISTINCT CASE WHEN max_stage >= 1 THEN session_id END) AS sessions,
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 1 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1) AS pct_of_total

FROM session_max_stage

UNION ALL

SELECT
'add_to_cart',
COUNT(DISTINCT CASE WHEN max_stage >= 2 THEN session_id END),
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 2 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1)
FROM session_max_stage

UNION ALL

SELECT
'checkout',
COUNT(DISTINCT CASE WHEN max_stage >= 3 THEN session_id END),
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 3 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1)
FROM session_max_stage

UNION ALL

SELECT
'purchase',
COUNT(DISTINCT CASE WHEN max_stage >= 4 THEN session_id END),
ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 4 THEN session_id END) * 100.0 /
COUNT(DISTINCT session_id), 1)
FROM session_max_stage;

Pattern 5: Rolling Metric Computation

-- Compute 7-day and 28-day rolling active users (DAU, WAU, MAU)

WITH daily_active AS (
SELECT
DATE_TRUNC('day', event_time) AS metric_date,
COUNT(DISTINCT user_id) AS dau
FROM events
GROUP BY DATE_TRUNC('day', event_time)
),
rolling_metrics AS (
SELECT
metric_date,
dau,
-- 7-day rolling unique users (WAU approximation)
SUM(dau) OVER (
ORDER BY metric_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS wau_sum, -- Note: this overcounts; see below
-- 28-day rolling average DAU
AVG(dau) OVER (
ORDER BY metric_date
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS avg_dau_28d
FROM daily_active
)
SELECT * FROM rolling_metrics
ORDER BY metric_date;

-- NOTE: For true WAU (unique users in 7 days), you need:
-- This cannot be done with simple rolling windows because uniqueness
-- requires access to raw user-level data within the window.

-- True WAU query:
SELECT
d.metric_date,
COUNT(DISTINCT e.user_id) AS true_wau
FROM (
SELECT DISTINCT DATE_TRUNC('day', event_time) AS metric_date
FROM events
) d
JOIN events e
ON DATE_TRUNC('day', e.event_time)
BETWEEN d.metric_date - INTERVAL '6' DAY AND d.metric_date
GROUP BY d.metric_date;

Part 4 - Self-Joins and Advanced Patterns

Self-Join: Year-over-Year Comparison

-- Compare each month's revenue to the same month last year

WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
curr.month,
curr.revenue AS current_revenue,
prev.revenue AS previous_year_revenue,
ROUND((curr.revenue - prev.revenue) * 100.0 / NULLIF(prev.revenue, 0), 1) AS yoy_growth_pct
FROM monthly_revenue curr
LEFT JOIN monthly_revenue prev
ON curr.month = prev.month + INTERVAL '1 year';

Self-Join: Finding Consecutive Events

-- Find users who made purchases on 3+ consecutive days

WITH purchase_days AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('day', order_date) AS purchase_date
FROM orders
),
with_prev AS (
SELECT
user_id,
purchase_date,
LAG(purchase_date, 1) OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_date,
LAG(purchase_date, 2) OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_prev_date
FROM purchase_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date - prev_date = INTERVAL '1' DAY
AND prev_date - prev_prev_date = INTERVAL '1' DAY;

Deduplication with ROW_NUMBER

-- Keep only the most recent record per user (common data cleaning task)

WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY updated_at DESC
) AS rn
FROM user_profiles
)
SELECT *
FROM ranked
WHERE rn = 1;

Part 5 - Query Optimization

Key Optimization Techniques

SQL Query Optimization Flow

Optimization Rules

RuleBadGoodWhy
Filter earlyJOIN then WHEREWHERE then JOIN (or in CTE)Reduces data before expensive operations
Use appropriate joinsCross join then filterInner/left join with ON clauseAvoids Cartesian product
Avoid SELECT *SELECT * FROM big_tableSELECT col1, col2 FROM big_tableReduces I/O and memory
Use EXISTS vs INWHERE id IN (SELECT id FROM ...)WHERE EXISTS (SELECT 1 FROM ... WHERE ...)EXISTS short-circuits
Avoid functions on indexed columnsWHERE YEAR(date_col) = 2024WHERE date_col >= '2024-01-01' AND date_col < '2025-01-01'Allows index usage
Use approximate functionsCOUNT(DISTINCT user_id)APPROX_COUNT_DISTINCT(user_id)100x faster for large tables
Partition pruningScan all partitionsFilter on partition keyOnly reads relevant partitions
Interviewer's Perspective

When I ask about query optimization, I want to hear three things: (1) "I would start with EXPLAIN ANALYZE to understand the query plan." (2) "I would check for full table scans and add indexes on WHERE and JOIN columns." (3) "I would push filters as early as possible to reduce the data volume before joins and aggregations." Candidates who jump to "add more machines" without analyzing the query plan reveal they lack hands-on database experience.

Practice Problems

Problem 1: User Engagement Segments

Given a table events(user_id, event_type, event_date), classify users into engagement segments based on their activity in the last 90 days:

  • Power users: 50+ events
  • Active users: 10-49 events
  • Casual users: 1-9 events
  • Churned: 0 events (had activity before but not in last 90 days)
Hint 1 - Direction

You need to consider ALL users, including those with zero recent events. Use a LEFT JOIN from all known users to their recent activity.

Hint 2 - Structure

CTE 1: Get all users who have ever been active. CTE 2: Count events in the last 90 days. Main query: CASE WHEN to classify.

Hint 3 - Full Solution
WITH all_users AS (
-- All users who have ever been active
SELECT DISTINCT user_id
FROM events
),
recent_activity AS (
-- Event count in last 90 days
SELECT
user_id,
COUNT(*) AS event_count
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '90' DAY
GROUP BY user_id
)
SELECT
au.user_id,
COALESCE(ra.event_count, 0) AS recent_events,
CASE
WHEN COALESCE(ra.event_count, 0) >= 50 THEN 'power_user'
WHEN COALESCE(ra.event_count, 0) >= 10 THEN 'active_user'
WHEN COALESCE(ra.event_count, 0) >= 1 THEN 'casual_user'
ELSE 'churned'
END AS segment
FROM all_users au
LEFT JOIN recent_activity ra ON au.user_id = ra.user_id;

-- Summary statistics
-- Wrap the above in another CTE and do:
-- SELECT segment, COUNT(*) as user_count,
-- ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) AS pct
-- FROM user_segments
-- GROUP BY segment;

Scoring:

  • Strong Hire: Uses LEFT JOIN to include churned users. COALESCE for null handling. Can extend with additional dimensions (segment by device, by cohort).
  • Lean Hire: Correct segments but misses churned users (uses INNER JOIN).
  • No Hire: Cannot structure the query or uses non-standard syntax.

Problem 2: Session Computation

Events have no session_id. Define a session as a sequence of events by the same user where no two consecutive events are more than 30 minutes apart. Compute session_id, session duration, and events per session.

Hint 1 - Direction

Use LAG to compute time gap between consecutive events per user. A gap > 30 minutes starts a new session. Use cumulative SUM of new-session flags to assign session IDs.

Hint 2 - The cumsum trick

Create a is_new_session flag (1 when gap > 30 min or first event). Then SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) gives the session number.

Hint 3 - Full Solution
WITH ordered_events AS (
SELECT
user_id,
event_time,
event_type,
LAG(event_time) OVER (
PARTITION BY user_id ORDER BY event_time
) AS prev_event_time
FROM events
),
session_flags AS (
SELECT
*,
CASE
WHEN prev_event_time IS NULL THEN 1 -- First event
WHEN EXTRACT(EPOCH FROM (event_time - prev_event_time)) > 1800 THEN 1 -- 30 min gap
ELSE 0
END AS is_new_session
FROM ordered_events
),
sessionized AS (
SELECT
*,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_id
FROM session_flags
)
SELECT
user_id,
session_id,
COUNT(*) AS events_in_session,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) / 60.0 AS duration_minutes
FROM sessionized
GROUP BY user_id, session_id;

Scoring:

  • Strong Hire: Clean LAG + cumsum pattern. Correct gap threshold handling. Can discuss why 30 minutes is a common choice and how to tune it.
  • Lean Hire: Correct approach but syntax errors in the window function or time arithmetic.
  • No Hire: Cannot conceptualize the sessionization without loops.

Problem 3: Top-K per Group

Find the top 3 products by revenue in each category, including ties.

Hint 1 - Direction

Use DENSE_RANK (not ROW_NUMBER) to handle ties correctly.

Hint 2 - Structure

CTE: Compute per-product revenue. Window: DENSE_RANK partitioned by category, ordered by revenue DESC. Filter: WHERE rank <= 3.

Hint 3 - Full Solution
WITH product_revenue AS (
SELECT
category,
product_id,
product_name,
SUM(revenue) AS total_revenue
FROM sales
GROUP BY category, product_id, product_name
),
ranked AS (
SELECT
*,
DENSE_RANK() OVER (
PARTITION BY category
ORDER BY total_revenue DESC
) AS revenue_rank
FROM product_revenue
)
SELECT category, product_id, product_name, total_revenue, revenue_rank
FROM ranked
WHERE revenue_rank <= 3
ORDER BY category, revenue_rank;

Key distinction:

  • ROW_NUMBER(): If two products tie, one gets rank 2 and the other rank 3. You may miss a tied product.
  • RANK(): Both get rank 2, next gets rank 4. Top-3 filter may return fewer than 3 if there is a tie at rank 3.
  • DENSE_RANK(): Both get rank 2, next gets rank 3. Top-3 filter includes all products ranked 1, 2, or 3.

Scoring:

  • Strong Hire: Uses DENSE_RANK and explains why. Discusses the difference between the three ranking functions.
  • Lean Hire: Uses ROW_NUMBER (correct but does not handle ties).
  • No Hire: Cannot write a ranking window function.

Problem 4: A/B Test with Pre-Experiment Validation

Write a query that validates an A/B test for sample ratio mismatch (SRM) and computes the treatment effect with confidence intervals.

Hint 1 - Direction

First check that control and treatment groups are balanced (similar sizes and similar pre-experiment metrics). Then compute the metric difference and standard error.

Hint 2 - Statistics in SQL

For a difference in means: SE = sqrt(var1/n1 + var2/n2). A 95% CI is (diff - 1.96SE, diff + 1.96SE). If the CI excludes 0, the result is significant.

Hint 3 - Full Solution
WITH experiment_data AS (
SELECT
ea.user_id,
ea.variant,
COALESCE(SUM(o.amount), 0) AS total_spend,
COUNT(DISTINCT o.order_id) AS num_orders
FROM experiment_assignments ea
LEFT JOIN orders o
ON ea.user_id = o.user_id
AND o.order_date BETWEEN ea.assignment_date
AND ea.assignment_date + INTERVAL '14' DAY
WHERE ea.experiment_id = 'checkout_redesign'
GROUP BY ea.user_id, ea.variant
),
group_stats AS (
SELECT
variant,
COUNT(*) AS n,
AVG(total_spend) AS mean_spend,
VARIANCE(total_spend) AS var_spend,
AVG(num_orders * 1.0) AS mean_orders
FROM experiment_data
GROUP BY variant
),
srm_check AS (
-- Sample Ratio Mismatch: are groups balanced?
SELECT
MAX(CASE WHEN variant = 'control' THEN n END) AS n_control,
MAX(CASE WHEN variant = 'treatment' THEN n END) AS n_treatment,
MAX(CASE WHEN variant = 'control' THEN n END) * 1.0 /
NULLIF(MAX(CASE WHEN variant = 'treatment' THEN n END), 0) AS ratio
FROM group_stats
),
effect_size AS (
SELECT
-- Revenue per user
MAX(CASE WHEN variant = 'treatment' THEN mean_spend END) -
MAX(CASE WHEN variant = 'control' THEN mean_spend END) AS spend_diff,
-- Standard error of the difference
SQRT(
MAX(CASE WHEN variant = 'control' THEN var_spend / n END) +
MAX(CASE WHEN variant = 'treatment' THEN var_spend / n END)
) AS se_diff
FROM group_stats
)
SELECT
sc.n_control,
sc.n_treatment,
sc.ratio AS sample_ratio,
CASE WHEN ABS(sc.ratio - 1.0) > 0.05 THEN 'WARNING: SRM detected' ELSE 'OK' END AS srm_status,
ROUND(es.spend_diff, 2) AS revenue_lift,
ROUND(es.spend_diff - 1.96 * es.se_diff, 2) AS ci_lower,
ROUND(es.spend_diff + 1.96 * es.se_diff, 2) AS ci_upper,
CASE
WHEN es.spend_diff - 1.96 * es.se_diff > 0 THEN 'Significant positive'
WHEN es.spend_diff + 1.96 * es.se_diff < 0 THEN 'Significant negative'
ELSE 'Not significant'
END AS significance
FROM srm_check sc, effect_size es;

Scoring:

  • Strong Hire: Includes SRM check, correct CI computation, LEFT JOIN for users with zero purchases. Discusses assumptions (normality via CLT for large n).
  • Lean Hire: Computes averages but no confidence interval or significance check.
  • No Hire: Cannot structure the A/B test query.

Problem 5: Customer Lifetime Value (LTV) Prediction Features

Build a feature table for predicting 12-month LTV for users who signed up in the last 3 months, using only data from their first 30 days.

Hint 1 - Direction

This is a data leakage problem disguised as a feature engineering problem. You must restrict features to the first 30 days only.

Hint 2 - Key features

First-30-day features: spend, frequency, recency within window, category diversity, refund rate, device type, time-of-day patterns.

Hint 3 - Full Solution
WITH new_users AS (
SELECT user_id, signup_date
FROM users
WHERE signup_date >= CURRENT_DATE - INTERVAL '90' DAY
),
first_30d_orders AS (
-- Only orders within first 30 days of signup (no leakage)
SELECT
nu.user_id,
nu.signup_date,
o.order_id,
o.order_date,
o.amount,
o.category,
o.is_refund
FROM new_users nu
JOIN orders o ON nu.user_id = o.user_id
WHERE o.order_date BETWEEN nu.signup_date
AND nu.signup_date + INTERVAL '30' DAY
),
features AS (
SELECT
user_id,
signup_date,
-- Monetary
COALESCE(SUM(amount), 0) AS first_30d_revenue,
COALESCE(AVG(amount), 0) AS avg_order_value,
COALESCE(MAX(amount), 0) AS max_order_value,
-- Frequency
COUNT(DISTINCT order_id) AS first_30d_orders,
COUNT(DISTINCT DATE_TRUNC('day', order_date)) AS active_days,
-- Recency within window
EXTRACT(DAY FROM
MAX(order_date) - signup_date
) AS days_to_last_order,
EXTRACT(DAY FROM
MIN(order_date) - signup_date
) AS days_to_first_order,
-- Behavioral
COUNT(DISTINCT category) AS unique_categories,
SUM(CASE WHEN is_refund THEN 1 ELSE 0 END) * 1.0 /
NULLIF(COUNT(*), 0) AS refund_rate,
-- Temporal patterns
COUNT(DISTINCT EXTRACT(DOW FROM order_date)) AS unique_dow,
SUM(CASE WHEN EXTRACT(DOW FROM order_date) IN (0, 6) THEN 1 ELSE 0 END) * 1.0 /
NULLIF(COUNT(*), 0) AS weekend_order_ratio
FROM first_30d_orders
GROUP BY user_id, signup_date
),
ltv_labels AS (
-- 12-month LTV as the target variable
SELECT
nu.user_id,
COALESCE(SUM(o.amount), 0) AS ltv_12m
FROM new_users nu
LEFT JOIN orders o ON nu.user_id = o.user_id
AND o.order_date BETWEEN nu.signup_date
AND nu.signup_date + INTERVAL '365' DAY
GROUP BY nu.user_id
)
SELECT
f.*,
l.ltv_12m
FROM features f
JOIN ltv_labels l ON f.user_id = l.user_id;

Scoring:

  • Strong Hire: Correctly restricts features to first 30 days. Explains why this prevents data leakage. Mentions that this approach has a survivorship bias issue (only new users who made at least one purchase in 30 days appear in features - should LEFT JOIN from new_users). Discusses feature importance for LTV.
  • Lean Hire: Computes reasonable features but does not carefully restrict the time window.
  • No Hire: Uses all-time data for features (data leakage).

Interview Cheat Sheet

ConceptKey PatternOne-LinerRed Flag
Window functionFUNC() OVER (PARTITION BY ... ORDER BY ...)Compute per-row stats without reducing rowsConfusing window functions with GROUP BY
ROW_NUMBER vs RANKROW_NUMBER has no ties; RANK/DENSE_RANK doChoose based on tie handling needsUsing ROW_NUMBER when ties matter
LAG / LEADLAG(col, n) OVER (PARTITION BY ... ORDER BY ...)Access previous/next row valueNot specifying PARTITION BY (global lag)
Running sumSUM() OVER (ORDER BY ... ROWS UNBOUNDED PRECEDING)Cumulative aggregationForgetting the frame clause
CTEWITH cte AS (SELECT ...) SELECT ... FROM cteReadable, composable query layers5+ levels of nested subqueries
Self-joinJoin table to itself with offsetYoY comparison, consecutive eventsNot recognizing when self-join is needed
A/B testControl vs treatment with LEFT JOINInclude users with zero eventsINNER JOIN drops zero-conversion users
Cohort analysisDATE_TRUNC + months_since_signupTrack retention by signup cohortNot using DISTINCT for user counts
SessionizationLAG + cumsum of gap flagsGroup events into sessionsCannot implement without procedural code
OptimizationFilter early, index JOIN/WHERE columnsStart with EXPLAIN ANALYZEApplying function to indexed column

Spaced Repetition Checkpoints

Day 0 - Initial Learning

  • Read this entire page and trace through each query mentally
  • Write a window function query using ROW_NUMBER, LAG, and running SUM
  • Build a cohort retention query from scratch
  • Complete the self-assessment

Day 3 - First Recall

  • Write the sessionization query from memory (LAG + cumsum pattern)
  • Build an A/B test analysis query with SRM check and confidence intervals
  • Explain ROWS vs RANGE frame clauses with examples

Day 7 - Connections

  • Solve the user engagement segments problem without hints (15 min)
  • Write a funnel analysis query for a 5-step funnel
  • Explain 5 query optimization techniques

Day 14 - Application

  • Build a complete LTV feature extraction query under timed conditions (25 min)
  • Write a query to detect anomalies in daily metrics (rolling z-score)
  • Optimize a slow query given an EXPLAIN plan

Day 21 - Mock Interview

  • Have someone give you a table schema and business question - write the query in 20 minutes
  • Solve all practice problems in sequence (90 min total)
  • Explain your query design decisions as if presenting to an interviewer

Key Takeaways

  1. Window functions are the most tested SQL concept in ML interviews. LAG for time-series features, ROW_NUMBER for deduplication, running SUM for cumulative metrics, and RANK for top-k queries cover 80% of interview questions.

  2. CTEs are non-negotiable. Every complex query should be built as a sequence of CTEs, each solving one sub-problem. This is not just about readability - it is about debuggability during a timed interview.

  3. ML-specific SQL is about preventing data leakage. Feature extraction queries must respect temporal boundaries (only use data available at prediction time). A/B test queries must use LEFT JOINs to include users with zero conversions. LTV queries must restrict features to the observation window.

  4. Optimization questions test systems thinking. Knowing that you should filter before joining, avoid functions on indexed columns, and use EXPLAIN ANALYZE demonstrates that you understand what happens under the hood - critical for ML roles that process large datasets.

Next Steps

Continue to Python ML Coding to implement ML algorithms from scratch - the definitive test of whether you understand how models actually work.

© 2026 EngineersOfAI. All rights reserved.