SQL for ML Roles - The Query Language Every AI Engineer Needs

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Scientist, MLE, AI Eng, Analytics Engineer, Applied Scientist

The Real Interview Moment

You are in a Netflix Data Science on-site. The interviewer pulls up a SQL editor and says: "We have a table of streaming events - user_id, title_id, start_time, duration_seconds, device_type. Write a query to find users whose 28-day rolling engagement dropped by more than 40% compared to their historical average, broken down by device. You have 25 minutes."

You know this requires window functions, date arithmetic, conditional aggregation, and a CTE or subquery structure. But the clock is running. Do you start with a monolithic query and get lost in nested subqueries? Or do you build it layer by layer - first the daily aggregation, then the rolling window, then the historical baseline, then the comparison - each in a clean CTE that you can debug independently?

This is what SQL interviews for ML roles test. They are not about SELECT-FROM-WHERE basics. They are about whether you can decompose a complex analytical question into composable query layers, use window functions fluently, and translate a business metric definition into precise SQL. Every DS, MLE, and analytics role at a major tech company will test this.

Candidates who say "I would do this in Pandas" get a polite rejection - the data lives in a warehouse and your Pandas cannot handle 500M rows. Candidates who build clean, layered CTEs with correct window functions get a "strong hire."

What You Will Master

Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM/AVG OVER, NTILE
CTEs: Common Table Expressions for readable, composable queries
Self-joins: Comparing rows within the same table
Aggregation: GROUP BY, HAVING, conditional aggregation with CASE WHEN
Date functions: DATEDIFF, DATE_TRUNC, INTERVAL, date arithmetic
Query optimization: indexes, EXPLAIN plans, partition pruning
ML-specific patterns: feature extraction, A/B test analysis, cohort analysis, funnel analysis, metric computation

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Write	4 - Can Optimize	5 - Can Teach	Your Score
Window functions (RANK, LAG, running sum)						___
CTEs for multi-step queries						___
Self-joins						___
Conditional aggregation (CASE WHEN)						___
Date arithmetic and DATE_TRUNC						___
A/B test metric queries						___
Cohort retention queries						___
Query optimization (indexes, EXPLAIN)						___

Target: All 4s and 5s before your interview.

Part 1 - Window Functions: The Core Interview Topic

What Window Functions Do

GROUP BY vs Window Functions

Essential Window Functions

-- Sample table: daily_metrics
-- | user_id | metric_date | page_views | purchases |
-- |---------|-------------|------------|-----------|

-- ROW_NUMBER: unique sequential number per partition
SELECT user_id, metric_date, page_views,
       ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY metric_date) AS day_number
FROM daily_metrics;

-- RANK vs DENSE_RANK vs ROW_NUMBER
-- Values: [100, 100, 90, 80]
-- ROW_NUMBER:  1, 2, 3, 4  (no ties)
-- RANK:        1, 1, 3, 4  (ties skip)
-- DENSE_RANK:  1, 1, 2, 3  (ties don't skip)

-- LAG / LEAD: access previous/next row
SELECT user_id, metric_date, page_views,
       LAG(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS prev_day_views,
       LEAD(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS next_day_views,
       page_views - LAG(page_views, 1) OVER (PARTITION BY user_id ORDER BY metric_date) AS day_over_day_change
FROM daily_metrics;

-- Running sum / running average
SELECT user_id, metric_date, purchases,
       SUM(purchases) OVER (
           PARTITION BY user_id
           ORDER BY metric_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ) AS cumulative_purchases,
       AVG(page_views) OVER (
           PARTITION BY user_id
           ORDER BY metric_date
           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS rolling_7day_avg
FROM daily_metrics;

-- NTILE: divide into equal-sized buckets
SELECT user_id, total_spend,
       NTILE(10) OVER (ORDER BY total_spend) AS spend_decile
FROM user_summary;

60-Second Answer

"Window functions compute a value for each row based on a set of related rows, without collapsing the result. They use OVER() to define the window - PARTITION BY groups the rows (like GROUP BY but without reducing), ORDER BY sorts within each partition, and the frame clause (ROWS BETWEEN) defines which rows to include in the computation. The most common ones in ML contexts are LAG for time-series features, ROW_NUMBER for deduplication, and running SUM/AVG for cumulative metrics."

Window Frame Specifications

Frame Clause	Meaning	Use Case
`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`	All rows from start to current	Running/cumulative sum
`ROWS BETWEEN 6 PRECEDING AND CURRENT ROW`	Current + 6 previous rows	7-day rolling average
`ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING`	Previous, current, next	Smoothing
`ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`	Entire partition	Partition-level stats
`RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW`	Date-based window	Calendar-aware rolling

Common Trap

ROWS and RANGE are different. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW counts exactly 7 rows regardless of gaps. RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW includes all rows within a 7-day calendar window, which may be more or fewer than 7 rows if there are gaps or duplicates. Interviewers test this distinction. For time-series with potential missing dates, RANGE is usually correct.

Part 2 - CTEs: Building Queries in Layers

Why CTEs Matter

-- Bad: deeply nested subquery (hard to read, debug, modify)
SELECT * FROM (
    SELECT * FROM (
        SELECT user_id, SUM(amount) as total
        FROM orders
        GROUP BY user_id
    ) a
    WHERE total > 100
) b
JOIN users u ON b.user_id = u.user_id;

-- Good: layered CTEs (each step is clear and testable)
WITH user_totals AS (
    SELECT user_id,
           SUM(amount) AS total_spend,
           COUNT(*) AS order_count
    FROM orders
    GROUP BY user_id
),
high_value_users AS (
    SELECT *
    FROM user_totals
    WHERE total_spend > 100
)
SELECT u.name, u.signup_date, h.total_spend, h.order_count
FROM high_value_users h
JOIN users u ON h.user_id = u.user_id
ORDER BY h.total_spend DESC;

Instant Rejection

Never write a query as a single monolithic block with 5 levels of nested subqueries. Interviewers evaluate your SQL code the same way they evaluate your Python code - readability and maintainability matter. CTEs show you can decompose a problem. They also make it easy for you to debug during the interview: you can run each CTE independently to verify intermediate results.

Part 3 - ML-Specific SQL Patterns

Pattern 1: Feature Extraction for ML Models

-- Build a feature table for a churn prediction model
-- One row per user with aggregated features

WITH user_activity AS (
    SELECT
        user_id,
        COUNT(*) AS total_events,
        COUNT(DISTINCT DATE_TRUNC('day', event_time)) AS active_days,
        MIN(event_time) AS first_event,
        MAX(event_time) AS last_event,
        SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count,
        SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS view_count,
        COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN product_id END) AS unique_products_purchased
    FROM events
    WHERE event_time >= '2024-01-01'
    GROUP BY user_id
),
recency AS (
    SELECT
        user_id,
        EXTRACT(DAY FROM CURRENT_TIMESTAMP - last_event) AS days_since_last_activity,
        EXTRACT(DAY FROM last_event - first_event) AS tenure_days
    FROM user_activity
),
recent_vs_historical AS (
    SELECT
        user_id,
        -- Last 30 days
        SUM(CASE WHEN event_time >= CURRENT_DATE - INTERVAL '30' DAY
                 THEN 1 ELSE 0 END) AS events_last_30d,
        -- 31-60 days ago
        SUM(CASE WHEN event_time >= CURRENT_DATE - INTERVAL '60' DAY
                  AND event_time < CURRENT_DATE - INTERVAL '30' DAY
                 THEN 1 ELSE 0 END) AS events_prev_30d
    FROM events
    GROUP BY user_id
)
SELECT
    ua.user_id,
    ua.total_events,
    ua.active_days,
    ua.purchase_count,
    ua.view_count,
    ua.unique_products_purchased,
    r.days_since_last_activity,
    r.tenure_days,
    -- Derived features
    COALESCE(ua.purchase_count * 1.0 / NULLIF(ua.view_count, 0), 0) AS conversion_rate,
    COALESCE(ua.total_events * 1.0 / NULLIF(r.tenure_days, 0), 0) AS events_per_day,
    rh.events_last_30d,
    rh.events_prev_30d,
    CASE WHEN rh.events_prev_30d > 0
         THEN (rh.events_last_30d - rh.events_prev_30d) * 1.0 / rh.events_prev_30d
         ELSE 0
    END AS activity_trend
FROM user_activity ua
JOIN recency r ON ua.user_id = r.user_id
LEFT JOIN recent_vs_historical rh ON ua.user_id = rh.user_id;

Pattern 2: A/B Test Analysis

-- Compute the key metrics for an A/B test
-- Test: does a new recommendation algorithm increase purchases?

WITH experiment_users AS (
    -- Get users and their experiment assignments
    SELECT user_id, variant  -- 'control' or 'treatment'
    FROM experiment_assignments
    WHERE experiment_id = 'rec_algo_v2'
      AND assignment_date BETWEEN '2024-06-01' AND '2024-06-30'
),
user_metrics AS (
    -- Compute per-user metrics during the experiment period
    SELECT
        eu.user_id,
        eu.variant,
        COALESCE(COUNT(DISTINCT o.order_id), 0) AS num_purchases,
        COALESCE(SUM(o.amount), 0) AS total_revenue,
        COALESCE(COUNT(DISTINCT DATE_TRUNC('day', o.order_date)), 0) AS active_days
    FROM experiment_users eu
    LEFT JOIN orders o
        ON eu.user_id = o.user_id
        AND o.order_date BETWEEN '2024-06-01' AND '2024-06-30'
    GROUP BY eu.user_id, eu.variant
)
SELECT
    variant,
    COUNT(*) AS num_users,
    -- Purchase rate
    AVG(CASE WHEN num_purchases > 0 THEN 1.0 ELSE 0.0 END) AS purchase_rate,
    -- Average purchases per user
    AVG(num_purchases) AS avg_purchases,
    -- Average revenue per user (ARPU)
    AVG(total_revenue) AS arpu,
    -- Revenue per purchasing user (ARPPU)
    AVG(CASE WHEN num_purchases > 0 THEN total_revenue END) AS arppu,
    -- Standard deviation (for significance testing)
    STDDEV(total_revenue) AS revenue_stddev,
    -- Percentiles
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_revenue) AS median_revenue,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_revenue) AS p95_revenue
FROM user_metrics
GROUP BY variant
ORDER BY variant;

Company Variation

Netflix, Airbnb, and Uber test A/B analysis SQL heavily because experimentation is core to their product development. They want you to compute not just averages but also variance (for significance), percentiles (for understanding distribution), and conditional metrics (ARPPU vs ARPU). Google and Meta often test this in a statistics interview rather than a SQL interview. Startups rarely test A/B SQL.

Pattern 3: Cohort Retention Analysis

-- Monthly cohort retention: what % of users who signed up in month M
-- are still active N months later?

WITH user_cohorts AS (
    -- Assign each user to their signup cohort
    SELECT
        user_id,
        DATE_TRUNC('month', signup_date) AS cohort_month
    FROM users
),
monthly_activity AS (
    -- Get each user's active months
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('month', event_date) AS activity_month
    FROM events
),
cohort_activity AS (
    -- Join cohorts with activity and compute months since signup
    SELECT
        uc.cohort_month,
        ma.activity_month,
        -- Months between cohort month and activity month
        EXTRACT(YEAR FROM AGE(ma.activity_month, uc.cohort_month)) * 12 +
        EXTRACT(MONTH FROM AGE(ma.activity_month, uc.cohort_month)) AS months_since_signup,
        COUNT(DISTINCT uc.user_id) AS active_users
    FROM user_cohorts uc
    JOIN monthly_activity ma ON uc.user_id = ma.user_id
    WHERE ma.activity_month >= uc.cohort_month  -- Only activity after signup
    GROUP BY uc.cohort_month, ma.activity_month
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
    FROM user_cohorts
    GROUP BY cohort_month
)
SELECT
    ca.cohort_month,
    ca.months_since_signup,
    ca.active_users,
    cs.cohort_size,
    ROUND(ca.active_users * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM cohort_activity ca
JOIN cohort_sizes cs ON ca.cohort_month = cs.cohort_month
WHERE ca.months_since_signup <= 12  -- First year
ORDER BY ca.cohort_month, ca.months_since_signup;

Pattern 4: Funnel Analysis

-- Conversion funnel: page_view -> add_to_cart -> checkout -> purchase
-- Track how many users complete each step within a single session

WITH funnel_events AS (
    SELECT
        session_id,
        user_id,
        event_name,
        event_time,
        -- Assign numeric order to funnel stages
        CASE event_name
            WHEN 'page_view' THEN 1
            WHEN 'add_to_cart' THEN 2
            WHEN 'checkout' THEN 3
            WHEN 'purchase' THEN 4
        END AS stage_order
    FROM events
    WHERE event_name IN ('page_view', 'add_to_cart', 'checkout', 'purchase')
),
session_max_stage AS (
    -- For each session, find the deepest funnel stage reached
    SELECT
        session_id,
        user_id,
        MAX(stage_order) AS max_stage
    FROM funnel_events
    GROUP BY session_id, user_id
)
SELECT
    'page_view' AS stage,
    COUNT(DISTINCT CASE WHEN max_stage >= 1 THEN session_id END) AS sessions,
    ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 1 THEN session_id END) * 100.0 /
          COUNT(DISTINCT session_id), 1) AS pct_of_total

FROM session_max_stage

UNION ALL

SELECT
    'add_to_cart',
    COUNT(DISTINCT CASE WHEN max_stage >= 2 THEN session_id END),
    ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 2 THEN session_id END) * 100.0 /
          COUNT(DISTINCT session_id), 1)
FROM session_max_stage

UNION ALL

SELECT
    'checkout',
    COUNT(DISTINCT CASE WHEN max_stage >= 3 THEN session_id END),
    ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 3 THEN session_id END) * 100.0 /
          COUNT(DISTINCT session_id), 1)
FROM session_max_stage

UNION ALL

SELECT
    'purchase',
    COUNT(DISTINCT CASE WHEN max_stage >= 4 THEN session_id END),
    ROUND(COUNT(DISTINCT CASE WHEN max_stage >= 4 THEN session_id END) * 100.0 /
          COUNT(DISTINCT session_id), 1)
FROM session_max_stage;

Pattern 5: Rolling Metric Computation

-- Compute 7-day and 28-day rolling active users (DAU, WAU, MAU)

WITH daily_active AS (
    SELECT
        DATE_TRUNC('day', event_time) AS metric_date,
        COUNT(DISTINCT user_id) AS dau
    FROM events
    GROUP BY DATE_TRUNC('day', event_time)
),
rolling_metrics AS (
    SELECT
        metric_date,
        dau,
        -- 7-day rolling unique users (WAU approximation)
        SUM(dau) OVER (
            ORDER BY metric_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS wau_sum,  -- Note: this overcounts; see below
        -- 28-day rolling average DAU
        AVG(dau) OVER (
            ORDER BY metric_date
            ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
        ) AS avg_dau_28d
    FROM daily_active
)
SELECT * FROM rolling_metrics
ORDER BY metric_date;

-- NOTE: For true WAU (unique users in 7 days), you need:
-- This cannot be done with simple rolling windows because uniqueness
-- requires access to raw user-level data within the window.

-- True WAU query:
SELECT
    d.metric_date,
    COUNT(DISTINCT e.user_id) AS true_wau
FROM (
    SELECT DISTINCT DATE_TRUNC('day', event_time) AS metric_date
    FROM events
) d
JOIN events e
    ON DATE_TRUNC('day', e.event_time)
       BETWEEN d.metric_date - INTERVAL '6' DAY AND d.metric_date
GROUP BY d.metric_date;

Part 4 - Self-Joins and Advanced Patterns

Self-Join: Year-over-Year Comparison

-- Compare each month's revenue to the same month last year

WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
    curr.month,
    curr.revenue AS current_revenue,
    prev.revenue AS previous_year_revenue,
    ROUND((curr.revenue - prev.revenue) * 100.0 / NULLIF(prev.revenue, 0), 1) AS yoy_growth_pct
FROM monthly_revenue curr
LEFT JOIN monthly_revenue prev
    ON curr.month = prev.month + INTERVAL '1 year';

Self-Join: Finding Consecutive Events

-- Find users who made purchases on 3+ consecutive days

WITH purchase_days AS (
    SELECT DISTINCT
        user_id,
        DATE_TRUNC('day', order_date) AS purchase_date
    FROM orders
),
with_prev AS (
    SELECT
        user_id,
        purchase_date,
        LAG(purchase_date, 1) OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_date,
        LAG(purchase_date, 2) OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_prev_date
    FROM purchase_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date - prev_date = INTERVAL '1' DAY
  AND prev_date - prev_prev_date = INTERVAL '1' DAY;

Deduplication with ROW_NUMBER

-- Keep only the most recent record per user (common data cleaning task)

WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY user_id
               ORDER BY updated_at DESC
           ) AS rn
    FROM user_profiles
)
SELECT *
FROM ranked
WHERE rn = 1;

Part 5 - Query Optimization

Key Optimization Techniques

SQL Query Optimization Flow

Optimization Rules

Rule	Bad	Good	Why
Filter early	JOIN then WHERE	WHERE then JOIN (or in CTE)	Reduces data before expensive operations
Use appropriate joins	Cross join then filter	Inner/left join with ON clause	Avoids Cartesian product
Avoid SELECT *	`SELECT * FROM big_table`	`SELECT col1, col2 FROM big_table`	Reduces I/O and memory
Use EXISTS vs IN	`WHERE id IN (SELECT id FROM ...)`	`WHERE EXISTS (SELECT 1 FROM ... WHERE ...)`	EXISTS short-circuits
Avoid functions on indexed columns	`WHERE YEAR(date_col) = 2024`	`WHERE date_col >= '2024-01-01' AND date_col < '2025-01-01'`	Allows index usage
Use approximate functions	`COUNT(DISTINCT user_id)`	`APPROX_COUNT_DISTINCT(user_id)`	100x faster for large tables
Partition pruning	Scan all partitions	Filter on partition key	Only reads relevant partitions

Interviewer's Perspective

When I ask about query optimization, I want to hear three things: (1) "I would start with EXPLAIN ANALYZE to understand the query plan." (2) "I would check for full table scans and add indexes on WHERE and JOIN columns." (3) "I would push filters as early as possible to reduce the data volume before joins and aggregations." Candidates who jump to "add more machines" without analyzing the query plan reveal they lack hands-on database experience.

Practice Problems

Problem 1: User Engagement Segments

Given a table events(user_id, event_type, event_date), classify users into engagement segments based on their activity in the last 90 days:

Power users: 50+ events
Active users: 10-49 events
Casual users: 1-9 events
Churned: 0 events (had activity before but not in last 90 days)

Hint 1 - Direction

You need to consider ALL users, including those with zero recent events. Use a LEFT JOIN from all known users to their recent activity.

Hint 2 - Structure

CTE 1: Get all users who have ever been active. CTE 2: Count events in the last 90 days. Main query: CASE WHEN to classify.

Hint 3 - Full Solution

WITH all_users AS (
    -- All users who have ever been active
    SELECT DISTINCT user_id
    FROM events
),
recent_activity AS (
    -- Event count in last 90 days
    SELECT
        user_id,
        COUNT(*) AS event_count
    FROM events
    WHERE event_date >= CURRENT_DATE - INTERVAL '90' DAY
    GROUP BY user_id
)
SELECT
    au.user_id,
    COALESCE(ra.event_count, 0) AS recent_events,
    CASE
        WHEN COALESCE(ra.event_count, 0) >= 50 THEN 'power_user'
        WHEN COALESCE(ra.event_count, 0) >= 10 THEN 'active_user'
        WHEN COALESCE(ra.event_count, 0) >= 1 THEN 'casual_user'
        ELSE 'churned'
    END AS segment
FROM all_users au
LEFT JOIN recent_activity ra ON au.user_id = ra.user_id;

-- Summary statistics
-- Wrap the above in another CTE and do:
-- SELECT segment, COUNT(*) as user_count,
--        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) AS pct
-- FROM user_segments
-- GROUP BY segment;

Scoring:

Strong Hire: Uses LEFT JOIN to include churned users. COALESCE for null handling. Can extend with additional dimensions (segment by device, by cohort).
Lean Hire: Correct segments but misses churned users (uses INNER JOIN).
No Hire: Cannot structure the query or uses non-standard syntax.

Problem 2: Session Computation

Events have no session_id. Define a session as a sequence of events by the same user where no two consecutive events are more than 30 minutes apart. Compute session_id, session duration, and events per session.

Hint 1 - Direction

Use LAG to compute time gap between consecutive events per user. A gap > 30 minutes starts a new session. Use cumulative SUM of new-session flags to assign session IDs.

Hint 2 - The cumsum trick

Create a is_new_session flag (1 when gap > 30 min or first event). Then SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) gives the session number.

Hint 3 - Full Solution

WITH ordered_events AS (
    SELECT
        user_id,
        event_time,
        event_type,
        LAG(event_time) OVER (
            PARTITION BY user_id ORDER BY event_time
        ) AS prev_event_time
    FROM events
),
session_flags AS (
    SELECT
        *,
        CASE
            WHEN prev_event_time IS NULL THEN 1  -- First event
            WHEN EXTRACT(EPOCH FROM (event_time - prev_event_time)) > 1800 THEN 1  -- 30 min gap
            ELSE 0
        END AS is_new_session
    FROM ordered_events
),
sessionized AS (
    SELECT
        *,
        SUM(is_new_session) OVER (
            PARTITION BY user_id
            ORDER BY event_time
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS session_id
    FROM session_flags
)
SELECT
    user_id,
    session_id,
    COUNT(*) AS events_in_session,
    MIN(event_time) AS session_start,
    MAX(event_time) AS session_end,
    EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) / 60.0 AS duration_minutes
FROM sessionized
GROUP BY user_id, session_id;

Scoring:

Strong Hire: Clean LAG + cumsum pattern. Correct gap threshold handling. Can discuss why 30 minutes is a common choice and how to tune it.
Lean Hire: Correct approach but syntax errors in the window function or time arithmetic.
No Hire: Cannot conceptualize the sessionization without loops.

Problem 3: Top-K per Group

Find the top 3 products by revenue in each category, including ties.

Hint 1 - Direction

Use DENSE_RANK (not ROW_NUMBER) to handle ties correctly.

Hint 2 - Structure

CTE: Compute per-product revenue. Window: DENSE_RANK partitioned by category, ordered by revenue DESC. Filter: WHERE rank <= 3.

Hint 3 - Full Solution

WITH product_revenue AS (
    SELECT
        category,
        product_id,
        product_name,
        SUM(revenue) AS total_revenue
    FROM sales
    GROUP BY category, product_id, product_name
),
ranked AS (
    SELECT
        *,
        DENSE_RANK() OVER (
            PARTITION BY category
            ORDER BY total_revenue DESC
        ) AS revenue_rank
    FROM product_revenue
)
SELECT category, product_id, product_name, total_revenue, revenue_rank
FROM ranked
WHERE revenue_rank <= 3
ORDER BY category, revenue_rank;

Key distinction:

ROW_NUMBER(): If two products tie, one gets rank 2 and the other rank 3. You may miss a tied product.
RANK(): Both get rank 2, next gets rank 4. Top-3 filter may return fewer than 3 if there is a tie at rank 3.
DENSE_RANK(): Both get rank 2, next gets rank 3. Top-3 filter includes all products ranked 1, 2, or 3.

Scoring:

Strong Hire: Uses DENSE_RANK and explains why. Discusses the difference between the three ranking functions.
Lean Hire: Uses ROW_NUMBER (correct but does not handle ties).
No Hire: Cannot write a ranking window function.

Problem 4: A/B Test with Pre-Experiment Validation

Write a query that validates an A/B test for sample ratio mismatch (SRM) and computes the treatment effect with confidence intervals.

Hint 1 - Direction

First check that control and treatment groups are balanced (similar sizes and similar pre-experiment metrics). Then compute the metric difference and standard error.

Hint 2 - Statistics in SQL

For a difference in means: SE = sqrt(var1/n1 + var2/n2). A 95% CI is (diff - 1.96SE, diff + 1.96SE). If the CI excludes 0, the result is significant.

Hint 3 - Full Solution

WITH experiment_data AS (
    SELECT
        ea.user_id,
        ea.variant,
        COALESCE(SUM(o.amount), 0) AS total_spend,
        COUNT(DISTINCT o.order_id) AS num_orders
    FROM experiment_assignments ea
    LEFT JOIN orders o
        ON ea.user_id = o.user_id
        AND o.order_date BETWEEN ea.assignment_date
            AND ea.assignment_date + INTERVAL '14' DAY
    WHERE ea.experiment_id = 'checkout_redesign'
    GROUP BY ea.user_id, ea.variant
),
group_stats AS (
    SELECT
        variant,
        COUNT(*) AS n,
        AVG(total_spend) AS mean_spend,
        VARIANCE(total_spend) AS var_spend,
        AVG(num_orders * 1.0) AS mean_orders
    FROM experiment_data
    GROUP BY variant
),
srm_check AS (
    -- Sample Ratio Mismatch: are groups balanced?
    SELECT
        MAX(CASE WHEN variant = 'control' THEN n END) AS n_control,
        MAX(CASE WHEN variant = 'treatment' THEN n END) AS n_treatment,
        MAX(CASE WHEN variant = 'control' THEN n END) * 1.0 /
            NULLIF(MAX(CASE WHEN variant = 'treatment' THEN n END), 0) AS ratio
    FROM group_stats
),
effect_size AS (
    SELECT
        -- Revenue per user
        MAX(CASE WHEN variant = 'treatment' THEN mean_spend END) -
            MAX(CASE WHEN variant = 'control' THEN mean_spend END) AS spend_diff,
        -- Standard error of the difference
        SQRT(
            MAX(CASE WHEN variant = 'control' THEN var_spend / n END) +
            MAX(CASE WHEN variant = 'treatment' THEN var_spend / n END)
        ) AS se_diff
    FROM group_stats
)
SELECT
    sc.n_control,
    sc.n_treatment,
    sc.ratio AS sample_ratio,
    CASE WHEN ABS(sc.ratio - 1.0) > 0.05 THEN 'WARNING: SRM detected' ELSE 'OK' END AS srm_status,
    ROUND(es.spend_diff, 2) AS revenue_lift,
    ROUND(es.spend_diff - 1.96 * es.se_diff, 2) AS ci_lower,
    ROUND(es.spend_diff + 1.96 * es.se_diff, 2) AS ci_upper,
    CASE
        WHEN es.spend_diff - 1.96 * es.se_diff > 0 THEN 'Significant positive'
        WHEN es.spend_diff + 1.96 * es.se_diff < 0 THEN 'Significant negative'
        ELSE 'Not significant'
    END AS significance
FROM srm_check sc, effect_size es;

Scoring:

Strong Hire: Includes SRM check, correct CI computation, LEFT JOIN for users with zero purchases. Discusses assumptions (normality via CLT for large n).
Lean Hire: Computes averages but no confidence interval or significance check.
No Hire: Cannot structure the A/B test query.

Problem 5: Customer Lifetime Value (LTV) Prediction Features

Build a feature table for predicting 12-month LTV for users who signed up in the last 3 months, using only data from their first 30 days.

Hint 1 - Direction

This is a data leakage problem disguised as a feature engineering problem. You must restrict features to the first 30 days only.

Hint 2 - Key features

First-30-day features: spend, frequency, recency within window, category diversity, refund rate, device type, time-of-day patterns.

Hint 3 - Full Solution

WITH new_users AS (
    SELECT user_id, signup_date
    FROM users
    WHERE signup_date >= CURRENT_DATE - INTERVAL '90' DAY
),
first_30d_orders AS (
    -- Only orders within first 30 days of signup (no leakage)
    SELECT
        nu.user_id,
        nu.signup_date,
        o.order_id,
        o.order_date,
        o.amount,
        o.category,
        o.is_refund
    FROM new_users nu
    JOIN orders o ON nu.user_id = o.user_id
    WHERE o.order_date BETWEEN nu.signup_date
          AND nu.signup_date + INTERVAL '30' DAY
),
features AS (
    SELECT
        user_id,
        signup_date,
        -- Monetary
        COALESCE(SUM(amount), 0) AS first_30d_revenue,
        COALESCE(AVG(amount), 0) AS avg_order_value,
        COALESCE(MAX(amount), 0) AS max_order_value,
        -- Frequency
        COUNT(DISTINCT order_id) AS first_30d_orders,
        COUNT(DISTINCT DATE_TRUNC('day', order_date)) AS active_days,
        -- Recency within window
        EXTRACT(DAY FROM
            MAX(order_date) - signup_date
        ) AS days_to_last_order,
        EXTRACT(DAY FROM
            MIN(order_date) - signup_date
        ) AS days_to_first_order,
        -- Behavioral
        COUNT(DISTINCT category) AS unique_categories,
        SUM(CASE WHEN is_refund THEN 1 ELSE 0 END) * 1.0 /
            NULLIF(COUNT(*), 0) AS refund_rate,
        -- Temporal patterns
        COUNT(DISTINCT EXTRACT(DOW FROM order_date)) AS unique_dow,
        SUM(CASE WHEN EXTRACT(DOW FROM order_date) IN (0, 6) THEN 1 ELSE 0 END) * 1.0 /
            NULLIF(COUNT(*), 0) AS weekend_order_ratio
    FROM first_30d_orders
    GROUP BY user_id, signup_date
),
ltv_labels AS (
    -- 12-month LTV as the target variable
    SELECT
        nu.user_id,
        COALESCE(SUM(o.amount), 0) AS ltv_12m
    FROM new_users nu
    LEFT JOIN orders o ON nu.user_id = o.user_id
        AND o.order_date BETWEEN nu.signup_date
            AND nu.signup_date + INTERVAL '365' DAY
    GROUP BY nu.user_id
)
SELECT
    f.*,
    l.ltv_12m
FROM features f
JOIN ltv_labels l ON f.user_id = l.user_id;

Scoring:

Strong Hire: Correctly restricts features to first 30 days. Explains why this prevents data leakage. Mentions that this approach has a survivorship bias issue (only new users who made at least one purchase in 30 days appear in features - should LEFT JOIN from new_users). Discusses feature importance for LTV.
Lean Hire: Computes reasonable features but does not carefully restrict the time window.
No Hire: Uses all-time data for features (data leakage).

Interview Cheat Sheet

Concept	Key Pattern	One-Liner	Red Flag
Window function	`FUNC() OVER (PARTITION BY ... ORDER BY ...)`	Compute per-row stats without reducing rows	Confusing window functions with GROUP BY
ROW_NUMBER vs RANK	ROW_NUMBER has no ties; RANK/DENSE_RANK do	Choose based on tie handling needs	Using ROW_NUMBER when ties matter
LAG / LEAD	`LAG(col, n) OVER (PARTITION BY ... ORDER BY ...)`	Access previous/next row value	Not specifying PARTITION BY (global lag)
Running sum	`SUM() OVER (ORDER BY ... ROWS UNBOUNDED PRECEDING)`	Cumulative aggregation	Forgetting the frame clause
CTE	`WITH cte AS (SELECT ...) SELECT ... FROM cte`	Readable, composable query layers	5+ levels of nested subqueries
Self-join	Join table to itself with offset	YoY comparison, consecutive events	Not recognizing when self-join is needed
A/B test	Control vs treatment with LEFT JOIN	Include users with zero events	INNER JOIN drops zero-conversion users
Cohort analysis	DATE_TRUNC + months_since_signup	Track retention by signup cohort	Not using DISTINCT for user counts
Sessionization	LAG + cumsum of gap flags	Group events into sessions	Cannot implement without procedural code
Optimization	Filter early, index JOIN/WHERE columns	Start with EXPLAIN ANALYZE	Applying function to indexed column

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Read this entire page and trace through each query mentally
Write a window function query using ROW_NUMBER, LAG, and running SUM
Build a cohort retention query from scratch
Complete the self-assessment

Day 3 - First Recall

Write the sessionization query from memory (LAG + cumsum pattern)
Build an A/B test analysis query with SRM check and confidence intervals
Explain ROWS vs RANGE frame clauses with examples

Day 7 - Connections

Solve the user engagement segments problem without hints (15 min)
Write a funnel analysis query for a 5-step funnel
Explain 5 query optimization techniques

Day 14 - Application

Build a complete LTV feature extraction query under timed conditions (25 min)
Write a query to detect anomalies in daily metrics (rolling z-score)
Optimize a slow query given an EXPLAIN plan

Day 21 - Mock Interview

Have someone give you a table schema and business question - write the query in 20 minutes
Solve all practice problems in sequence (90 min total)
Explain your query design decisions as if presenting to an interviewer

Key Takeaways

Window functions are the most tested SQL concept in ML interviews. LAG for time-series features, ROW_NUMBER for deduplication, running SUM for cumulative metrics, and RANK for top-k queries cover 80% of interview questions.
CTEs are non-negotiable. Every complex query should be built as a sequence of CTEs, each solving one sub-problem. This is not just about readability - it is about debuggability during a timed interview.
ML-specific SQL is about preventing data leakage. Feature extraction queries must respect temporal boundaries (only use data available at prediction time). A/B test queries must use LEFT JOINs to include users with zero conversions. LTV queries must restrict features to the observation window.
Optimization questions test systems thinking. Knowing that you should filter before joining, avoid functions on indexed columns, and use EXPLAIN ANALYZE demonstrates that you understand what happens under the hood - critical for ML roles that process large datasets.

Next Steps

Continue to Python ML Coding to implement ML algorithms from scratch - the definitive test of whether you understand how models actually work.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Window Functions: The Core Interview Topic​

What Window Functions Do​

Essential Window Functions​

Window Frame Specifications​

Part 2 - CTEs: Building Queries in Layers​

Why CTEs Matter​

Part 3 - ML-Specific SQL Patterns​

Pattern 1: Feature Extraction for ML Models​

Pattern 2: A/B Test Analysis​

Pattern 3: Cohort Retention Analysis​

Pattern 4: Funnel Analysis​

Pattern 5: Rolling Metric Computation​

Part 4 - Self-Joins and Advanced Patterns​

Self-Join: Year-over-Year Comparison​

Self-Join: Finding Consecutive Events​

Deduplication with ROW_NUMBER​

Part 5 - Query Optimization​

Key Optimization Techniques​

Optimization Rules​

Practice Problems​

Problem 1: User Engagement Segments​

Problem 2: Session Computation​

Problem 3: Top-K per Group​

Problem 4: A/B Test with Pre-Experiment Validation​

Problem 5: Customer Lifetime Value (LTV) Prediction Features​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - First Recall​

Day 7 - Connections​

Day 14 - Application​

Day 21 - Mock Interview​

Key Takeaways​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Window Functions: The Core Interview Topic

What Window Functions Do

Essential Window Functions

Window Frame Specifications

Part 2 - CTEs: Building Queries in Layers

Why CTEs Matter

Part 3 - ML-Specific SQL Patterns

Pattern 1: Feature Extraction for ML Models

Pattern 2: A/B Test Analysis

Pattern 3: Cohort Retention Analysis

Pattern 4: Funnel Analysis

Pattern 5: Rolling Metric Computation

Part 4 - Self-Joins and Advanced Patterns

Self-Join: Year-over-Year Comparison

Self-Join: Finding Consecutive Events

Deduplication with ROW_NUMBER

Part 5 - Query Optimization

Key Optimization Techniques

Optimization Rules

Practice Problems

Problem 1: User Engagement Segments

Problem 2: Session Computation

Problem 3: Top-K per Group

Problem 4: A/B Test with Pre-Experiment Validation

Problem 5: Customer Lifetime Value (LTV) Prediction Features

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - First Recall

Day 7 - Connections

Day 14 - Application

Day 21 - Mock Interview

Key Takeaways

Next Steps