Data Scientist Problem List
Reading time: ~40 min | Interview relevance: Critical | Roles: Data Scientist, Applied Scientist, Analytics Data Scientist, Product Data Scientist
You are in a Data Scientist interview at a major tech company. The interviewer slides a laptop across the table and says: "We ran an A/B test for a new feature. The p-value is 0.03, but the product manager says the effect is too small to ship. Walk me through how you would analyze this situation." If your heart just skipped a beat, this list is for you.
Data Scientist interviews are uniquely challenging because they blend statistics, coding, machine learning, and business reasoning in ways that no other role does. This list of 45 problems covers all four pillars: Statistics & Experimentation, SQL & Data Manipulation, ML Modeling, and Business Case Studies.
Data Scientist Interview Structure
| Round | Duration | What They Test | Weight |
|---|---|---|---|
| Statistics & Probability | 45-60 min | Hypothesis testing, distributions, experimental design | 25-30% |
| SQL & Coding | 45-60 min | SQL queries, pandas, data manipulation | 20-25% |
| ML / Modeling | 45-60 min | Model building, feature engineering, evaluation | 20-25% |
| Business Case / Product Sense | 45-60 min | Metric definition, problem framing, communication | 15-20% |
| Behavioral | 30-45 min | Influence without authority, stakeholder management | 10% |
:::tip The Data Scientist Secret The best Data Scientists are not the best coders or the best statisticians. They are the best at translating vague business problems into precise analytical questions. Practice the translation, not just the execution. :::
Section 1: Statistics & Experimentation (15 Problems)
Statistics is the backbone of data science. These problems test your ability to reason about uncertainty, design experiments, and interpret results.
Hypothesis Testing & Inference
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 1 | Design an A/B Test for a New Checkout Flow | Medium | 25 min | Sample size, power analysis, MDE | The single most important DS skill | FAANG, All |
| 2 | A/B Test Shows p=0.04 but CI Includes Practically Zero Effect. Ship or Not? | Hard | 30 min | Statistical vs. practical significance | Distinguishes senior from junior DS | Meta, Google, Airbnb |
| 3 | Multiple Comparison Problem: Testing 10 Variants Simultaneously | Medium | 20 min | Bonferroni, FDR, family-wise error rate | Real experiments test many variants | Meta, Netflix, Microsoft |
| 4 | Design a Switchback Experiment for a Marketplace | Hard | 30 min | Network effects, interference, switchback design | Standard A/B testing fails with network effects | Uber, Lyft, DoorDash, Airbnb |
| 5 | Analyze an A/B Test with Novelty Effect | Medium | 25 min | Time-varying treatment effect, holdout analysis | New features often show inflated initial effects | Meta, Google, Netflix |
Probability & Distributions
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 6 | Estimate the Probability of Server Failure Given Alert Data | Medium | 20 min | Bayes' theorem, conditional probability | Bayesian reasoning is fundamental | Google, Amazon, Meta |
| 7 | Model the Number of Customer Arrivals per Hour | Easy | 15 min | Poisson distribution, rate estimation | Distribution selection for count data | All |
| 8 | Calculate Confidence Intervals for a Proportion | Easy | 15 min | Normal approximation, Wilson interval | Basic inference that many candidates get wrong | All |
| 9 | Compare Two Methods Using Bootstrap Confidence Intervals | Medium | 25 min | Bootstrap resampling, percentile method | Non-parametric inference for complex metrics | Meta, Google, Airbnb |
| 10 | Design a Sequential Testing Procedure (Peeking Problem) | Hard | 30 min | Sequential analysis, alpha spending | Real experiments are monitored continuously | Netflix, Meta, Uber |
Advanced Experimentation
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 11 | Estimate Treatment Effect When Random Assignment Is Not Possible | Hard | 30 min | Causal inference, propensity score matching, DiD | Observational data is more common than experiments | Meta, Google, Uber |
| 12 | Design Guardrail Metrics for an A/B Test | Medium | 20 min | Guardrails, pre-specified boundaries, SRM check | Protect against shipping harmful changes | FAANG, All |
| 13 | Variance Reduction Techniques for Faster Experiments | Hard | 30 min | CUPED, stratification, pre-experiment covariates | Faster experiments = faster iteration | Meta, Netflix, Microsoft |
| 14 | Long-Term Impact Estimation When Only Short-Term Data Exists | Hard | 30 min | Surrogate metrics, long-term holdout | Most business outcomes are long-term | Google, Meta, Netflix |
| 15 | Simpson's Paradox in Experiment Analysis | Medium | 20 min | Confounding, segmented analysis | Aggregate results can mislead | All |
:::warning Statistics Red Flags These mistakes immediately concern interviewers:
- Confusing p-value with probability of hypothesis being true
- Not checking assumptions (normality, independence) before applying tests
- Using t-test when ratio metric requires delta method or bootstrap
- Not considering multiple comparisons in multi-variant tests
- Ignoring practical significance and focusing only on statistical significance :::
Section 2: SQL & Data Manipulation (12 Problems)
Data Scientists live in SQL and pandas. These problems test fluency in data extraction and transformation.
SQL Problems
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 16 | Calculate Daily Active Users (DAU), Weekly Active Users (WAU), and Stickiness | Medium | 20 min | Date functions, COUNT DISTINCT, ratio computation | Core product metric computation | Meta, Google, Snap |
| 17 | Find Power Users (Users in Top 10% of Activity) | Medium | 20 min | Window functions, NTILE/PERCENT_RANK | User segmentation drives product decisions | Meta, Uber, Airbnb |
| 18 | Compute Funnel Conversion Rates with Drop-Off Analysis | Medium | 25 min | Sequential event joins, conditional aggregation | Funnel analysis is a DS bread-and-butter task | All |
| 19 | Detect Churned Users Who Reactivated | Hard | 25 min | Self-join with temporal logic, gap detection | Win-back analysis is high business value | Spotify, Netflix, Uber |
| 20 | Build a Cohort Retention Table | Hard | 30 min | Cohort join, pivot logic, date arithmetic | The canonical product analytics query | Meta, Airbnb, Spotify |
| 21 | Find the First Touchpoint That Led to Conversion | Medium | 20 min | Attribution logic, FIRST_VALUE window function | Marketing attribution analysis | Google, Meta, Airbnb |
Pandas Problems
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 22 | Clean and Merge Multiple Messy CSV Files | Easy | 20 min | Data cleaning, merge, type coercion | Real data is always messy | All |
| 23 | Compute Rolling Engagement Metrics with User Segmentation | Medium | 25 min | GroupBy, rolling window, multi-level aggregation | Time-series feature engineering | Meta, Uber, Airbnb |
| 24 | Build a Feature Matrix from Event-Level Data | Medium | 25 min | Pivot, aggregation, sparse features | Feature engineering from raw logs | Big Tech, Startups |
| 25 | Detect and Handle Outliers in Metric Data | Medium | 20 min | IQR, z-score, Winsorization | Outliers distort metrics and model performance | All |
| 26 | Perform Time-Series Decomposition of Revenue Data | Medium | 25 min | Trend, seasonality, residual decomposition | Understanding revenue patterns | All |
| 27 | Create Automated Summary Statistics Report | Easy | 15 min | Descriptive stats, distribution visualization | First step of any analysis | All |
Section 3: ML Modeling (10 Problems)
Data Scientist ML questions focus more on practical modeling decisions than algorithm implementation.
Model Building & Selection
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 28 | Build a Churn Prediction Model and Explain Feature Importance | Medium | 35 min | Classification, SHAP/feature importance, business action | Connects modeling to business impact | All |
| 29 | Predict Customer Lifetime Value (LTV) | Hard | 35 min | Regression, censored data, cohort-based estimation | LTV drives acquisition and retention strategy | Meta, Airbnb, Netflix, Uber |
| 30 | Build a Propensity Model for Targeted Marketing | Medium | 30 min | Binary classification, calibration, uplift modeling | Marketing optimization requires calibrated models | All |
| 31 | Forecast Daily Revenue for Next 90 Days | Medium | 30 min | Time-series forecasting, seasonality, uncertainty quantification | Revenue forecasting is the highest-visibility DS task | All |
| 32 | Detect Fraudulent Transactions in Highly Imbalanced Data | Hard | 35 min | Extreme imbalance, precision-recall tradeoff, cost-sensitive learning | Fraud detection is a classic DS problem | Stripe, PayPal, Amazon |
| 33 | Build a Recommendation System for a Content Platform | Medium | 30 min | Collaborative filtering, content-based, hybrid | Recommendations drive engagement at every platform | Netflix, Spotify, Meta |
Model Evaluation & Interpretation
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 34 | Your Model Has 95% Accuracy but Stakeholders Don't Trust It. Diagnose. | Medium | 25 min | Class imbalance, confusion matrix deep dive, calibration | Accuracy alone is misleading | All |
| 35 | Compare Two Models: One Has Better AUC, The Other Better Precision@K | Medium | 25 min | Metric selection, business context, threshold optimization | Different metrics tell different stories | All |
| 36 | Explain a Black-Box Model to a Non-Technical Stakeholder | Medium | 20 min | SHAP, partial dependence, plain language explanation | Communication is a core DS skill | All |
| 37 | Detect Data Drift in a Production Model | Medium | 25 min | PSI, KS test, feature distribution monitoring | Models degrade over time in production | FAANG, Big Tech |
Section 4: Business Case Studies (8 Problems)
Business case studies test your ability to frame problems, define metrics, and connect analysis to decisions.
| # | Problem | Difficulty | Time | Key Concept | Why It Matters | Company Tags |
|---|---|---|---|---|---|---|
| 38 | A Key Metric Dropped 10% Overnight. Walk Through Your Investigation. | Medium | 25 min | Root cause analysis, segmentation, data quality checks | The most common DS on-call scenario | FAANG, All |
| 39 | Define the Success Metrics for a New Social Feature | Medium | 20 min | Metric hierarchy (north star, primary, guardrail) | Product sense is critical for product DS | Meta, Google, Snap |
| 40 | Should We Launch This Feature Based on Inconclusive A/B Test Results? | Hard | 30 min | Decision under uncertainty, business judgment, cost of wrong decision | Textbook doesn't cover inconclusive results | Meta, Google, Netflix |
| 41 | Design a Data Strategy for a New Market Entry | Medium | 25 min | Data collection, baseline establishment, success criteria | Data strategy drives business strategy | Uber, Airbnb, DoorDash |
| 42 | Evaluate Whether a Pricing Change Increased Revenue | Hard | 30 min | Price elasticity, causal inference, confounders | Pricing analysis requires causal thinking | Uber, Airbnb, Amazon |
| 43 | Prioritize Three Potential ML Projects Given Resource Constraints | Medium | 20 min | Impact estimation, feasibility assessment, ROI framework | Resource allocation is a key DS leadership skill | All |
| 44 | A Model Performs Well Offline but Poorly Online. Diagnose. | Hard | 30 min | Train-serve skew, data leakage, feedback loops | The classic production ML problem | FAANG, Big Tech |
| 45 | Design a Metric for Measuring Marketplace Health | Medium | 25 min | Two-sided metrics, supply-demand balance, leading indicators | Marketplace metrics are inherently complex | Uber, Airbnb, DoorDash |
:::tip Business Case Framework For any business case, follow this structure:
- Clarify the problem and business context
- Define success metrics (primary + guardrails)
- Hypothesize root causes or expected outcomes
- Analyze using data (describe the analysis you would do)
- Recommend a course of action with confidence level
- Acknowledge risks and next steps :::
4-Week Data Scientist Study Plan
| Week | Focus | Problems | Daily Load |
|---|---|---|---|
| Week 1 | Statistics & Experimentation | #1-15 | 2-3 problems/day |
| Week 2 | SQL & Data Manipulation | #16-27 | 2 problems/day |
| Week 3 | ML Modeling | #28-37 | 1-2 problems/day (deeper) |
| Week 4 | Business Cases + Review | #38-45 + review | 1 case/day + review |
Week 1: Statistics Deep Dive
Day 1: #1, #2 (A/B testing fundamentals)
Day 2: #3, #4 (multiple comparisons, switchback)
Day 3: #5, #6 (novelty effect, Bayes)
Day 4: #7, #8 (distributions, confidence intervals)
Day 5: #9, #10 (bootstrap, sequential testing)
Day 6: #11, #12 (causal inference, guardrails)
Day 7: #13, #14, #15 (variance reduction, long-term impact, Simpson's paradox)
Week 2: SQL & Pandas Sprint
Day 1: #16, #17 (DAU/WAU, power users)
Day 2: #18, #19 (funnels, churn detection)
Day 3: #20, #21 (retention, attribution)
Day 4: #22, #23 (data cleaning, rolling metrics)
Day 5: #24, #25 (feature matrix, outliers)
Day 6: #26, #27 (time-series decomposition, summary stats)
Day 7: Review all SQL problems without reference
Key Statistical Formulas to Know
Sample Size Calculation
n = (Z_alpha/2 + Z_beta)^2 * (2 * sigma^2) / delta^2
Where:
- Z_alpha/2 = 1.96 for 95% confidence
- Z_beta = 0.84 for 80% power
- sigma^2 = variance of the metric
- delta = minimum detectable effect (MDE)
Common Statistical Tests Cheat Sheet
| Scenario | Test | Assumptions |
|---|---|---|
| Compare two means (large n) | Z-test | Normal approximation |
| Compare two means (small n) | t-test | Normality, equal variance |
| Compare two proportions | Chi-squared / Z-test for proportions | Large n for normal approx |
| Compare means of 3+ groups | ANOVA | Normality, equal variance |
| Non-normal distributions | Mann-Whitney U | Independent samples |
| Paired measurements | Paired t-test | Normal differences |
| Ratio metrics | Delta method or bootstrap | Depends on method |
Variance Reduction with CUPED
Y_adjusted = Y - theta * X
Where:
- Y = metric during experiment
- X = same metric pre-experiment (covariate)
- theta = Cov(Y, X) / Var(X)
Variance reduction: 1 - Corr(Y, X)^2
Problem Deep Dives
Problem 2: Statistical vs. Practical Significance
Scenario: An A/B test shows p=0.04 (significant at alpha=0.05). The 95% CI for the effect on revenue per user is [0.15]. The product change requires 2 engineers for 3 months.
Analysis Framework:
- The test is statistically significant, but the lower bound of the CI ($0.002/user) is tiny
- Calculate total expected impact: 0.002 * DAU * 365 = annual minimum impact
- Compare against engineering cost (2 engineers * 3 months * salary)
- Consider opportunity cost: what else could those engineers build?
- Decision: If minimum impact < cost, don't ship despite significance
Key Insight: Statistical significance means the effect is real (non-zero). It does not mean the effect is large enough to matter.
Problem 11: Causal Inference Without Random Assignment
Scenario: You want to measure the impact of a new onboarding flow, but it was rolled out to all new users in one region. You cannot run a randomized experiment.
Approaches:
- Difference-in-Differences (DiD): Compare treated region before/after vs. control region before/after. Requires parallel trends assumption.
- Propensity Score Matching: Match treated users to similar untreated users on observables. Requires no unmeasured confounders.
- Synthetic Control: Create a weighted combination of control regions that matches the treated region pre-intervention.
- Regression Discontinuity: If there is a sharp cutoff (e.g., date of rollout), compare users just before/after the cutoff.
When to use each:
| Method | Best When | Key Assumption |
|---|---|---|
| DiD | Regional or temporal rollout | Parallel trends |
| PSM | Individual-level treatment variation | No unmeasured confounders |
| Synthetic Control | Few treated units (regions, countries) | Pre-treatment fit |
| RDD | Sharp cutoff exists | Continuity around cutoff |
DS-Specific Patterns to Master
| Pattern | Where It Appears | Problems |
|---|---|---|
| A/B test design and analysis | Nearly every DS interview | #1-5, #10, #12, #13, #40 |
| Metric definition | Product DS roles | #39, #45 |
| Causal inference | Quasi-experiments | #4, #11, #14, #42 |
| Cohort analysis | Retention, LTV | #20, #29 |
| Funnel analysis | Product optimization | #18, #38 |
| Time-series reasoning | Forecasting, monitoring | #26, #31, #38 |
| SQL window functions | Every DS SQL round | #16, #17, #19, #20, #21 |
| Model interpretability | Stakeholder communication | #34, #36 |
Difficulty Distribution
| Difficulty | Problems | Count |
|---|---|---|
| Easy | #7, #8, #22, #27 | 4 |
| Medium | #1, #3, #5, #6, #9, #12, #15, #16, #17, #18, #21, #23, #24, #25, #26, #28, #30, #31, #33, #34, #35, #36, #37, #38, #39, #41, #43, #45 | 28 |
| Hard | #2, #4, #10, #11, #13, #14, #19, #20, #29, #32, #40, #42, #44 | 13 |
Next Steps
After completing the Data Scientist problem list:
- Easy Tier if you need more practice on fundamentals
- Meta-Style Problems since Meta heavily hires product Data Scientists
- Google-Style Problems for research-oriented DS roles
- Section 15: Role-Specific Prep for the full Data Scientist preparation path
