Data Scientist: 6-Week Prep Path
Reading time: ~40 min | Interview relevance: Critical | Roles: Data Scientist, Applied Scientist, Product Data Scientist, Analytics Data Scientist
The Real Interview Moment
The interviewer slides a laptop across the table. On the screen is a dataset from an e-commerce company. "We launched a new checkout flow two weeks ago. The product team says it increased revenue by 12%. The engineering team says it increased page load time by 300ms. The CEO wants to know if we should keep it. You have 45 minutes. Go."
This is the Data Scientist interview distilled to its essence. It is not about building the most sophisticated model or writing the most elegant code. It is about translating messy business questions into rigorous statistical analyses, and then communicating findings in a way that drives decisions.
The Data Scientist interview is unique because it tests something that no other AI/ML role tests as heavily: your ability to think critically about data and communicate insights to non-technical stakeholders. You need statistics, SQL, ML, and business sense -- all working together.
This 6-week plan will prepare you for every dimension.
Role Overview
What Data Scientists Do
Data Scientists extract insights from data to drive business decisions. They:
- Design and analyze A/B tests and experiments
- Build predictive models for business outcomes
- Create dashboards and reports for stakeholders
- Define and monitor key business metrics
- Perform deep-dive analyses on product and user behavior
- Collaborate with product, engineering, and leadership teams
Interview Format (Typical)
| Round | Duration | Focus |
|---|---|---|
| Phone Screen | 45-60 min | SQL + probability/statistics basics |
| SQL Round | 45-60 min | Complex queries, window functions, optimization |
| Statistics / Probability | 45-60 min | Hypothesis testing, A/B testing, distributions |
| Case Study / Product Sense | 45-60 min | Business metric design, product analysis |
| ML / Modeling | 45-60 min | Feature engineering, model selection, evaluation |
| Behavioral / Presentation | 45-60 min | Communication, stakeholder management |
Focus Area Allocation
Breakdown by Skill
Statistics and Probability (25% -- ~35 hours total)
- Probability: Bayes theorem, conditional probability, common distributions
- Hypothesis testing: t-tests, chi-squared, ANOVA, multiple testing correction
- A/B testing: power analysis, sample size calculation, sequential testing
- Bayesian thinking: priors, posteriors, Bayesian A/B testing
- Causal inference: difference-in-differences, instrumental variables, propensity scores
SQL and Data Manipulation (25% -- ~35 hours total)
- Complex joins, subqueries, CTEs
- Window functions: ROW_NUMBER, RANK, LAG, LEAD, running aggregates
- Query optimization: execution plans, indexing strategies
- pandas: groupby, merge, pivot, time series operations
ML Fundamentals (20% -- ~28 hours total)
- Supervised learning: regression, classification, ensemble methods
- Feature engineering: encoding, scaling, feature selection
- Model evaluation: metrics, cross-validation, overfitting diagnosis
- Practical ML: when to use what model and why
Business Case Studies (15% -- ~22 hours total)
- Metric design: defining success metrics for a product
- Root cause analysis: diagnosing metric changes
- Product sense: understanding user behavior and business logic
- Communication: presenting findings to non-technical audiences
Behavioral (15% -- ~22 hours total)
- Stakeholder management stories
- Project impact quantification
- Handling ambiguity and conflicting priorities
- Communication of technical concepts to non-technical people
6-Week Schedule Overview
Week 1: Foundations -- Statistics and SQL
Goal: Refresh statistical foundations and build SQL fluency.
Daily time: 3 hours (weekdays), 5 hours (weekends)
Monday -- Probability Fundamentals
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 easy/medium SQL problems (HackerRank or LeetCode) |
| Lunch (20 min) | Read | ML Fundamentals probability section |
| Evening (90 min) | Study | Probability rules, conditional probability, Bayes theorem, independence, common distributions (normal, binomial, Poisson, exponential) |
| Night (15 min) | Review | Solve 3 probability brain teasers |
Probability problems to practice:
- Given a fair coin, what is the expected number of flips to get two heads in a row?
- A diagnostic test has 95% sensitivity and 99% specificity. If 1% of the population has the disease, what is P(disease | positive test)?
- You roll two dice. What is P(sum = 7)?
:::tip Bayes Theorem is Your Best Friend Data Scientist interviews love Bayes theorem problems. Memorize the formula and practice until it is second nature:
More importantly, develop the intuition: start with the base rate (prior), update with evidence (likelihood), and normalize. :::
Tuesday -- Statistical Distributions and Estimation
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | Central Limit Theorem and its applications |
| Evening (90 min) | Study | CLT, confidence intervals, maximum likelihood estimation, method of moments |
| Night (15 min) | Review | Calculate a 95% confidence interval by hand |
Wednesday -- Hypothesis Testing
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems (JOINs, GROUP BY) |
| Lunch (20 min) | Read | Type I and Type II errors |
| Evening (90 min) | Study | Null and alternative hypotheses, p-values, significance level, power, t-tests (one-sample, two-sample, paired) |
| Night (15 min) | Review | Work through a complete hypothesis test example |
:::warning Understand p-values Correctly A p-value is NOT the probability that the null hypothesis is true. It is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true. This is a very common interview question and many candidates get it wrong. :::
Thursday -- Advanced Hypothesis Testing
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems (subqueries) |
| Lunch (20 min) | Read | Chi-squared tests and ANOVA |
| Evening (90 min) | Study | Chi-squared test, ANOVA, non-parametric tests (Mann-Whitney, Wilcoxon), multiple testing correction (Bonferroni, Benjamini-Hochberg) |
| Night (15 min) | Review | Decision tree for choosing the right statistical test |
Friday -- SQL: JOINs and Aggregations
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 3 medium SQL problems |
| Lunch (20 min) | Read | Coding Interviews SQL section |
| Evening (90 min) | Study | INNER, LEFT, RIGHT, FULL OUTER, CROSS JOINs. GROUP BY, HAVING, DISTINCT, CASE WHEN |
| Night (15 min) | Review | Write a query to find the top 3 customers by revenue per month |
Saturday -- SQL: Window Functions
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Study | Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, AVG OVER, NTILE |
| Afternoon (2 hrs) | Practice | Solve 8 window function problems |
| Evening (1 hr) | Review | Write queries for running totals, moving averages, and year-over-year comparisons |
:::tip Window Functions Are the SQL Interview Differentiator Basic SQL (JOINs, GROUP BY) is table stakes. Window functions separate strong candidates from average ones. Practice these patterns:
- Running totals:
SUM(revenue) OVER (ORDER BY date) - Month-over-month growth:
LAG(metric, 1) OVER (ORDER BY month) - Ranking within groups:
ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) - Moving averages:
AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW):::
Sunday -- Week 1 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Review | Redo 5 hardest SQL problems from the week |
| Afternoon (2 hrs) | Study | Review all statistical concepts; create a cheat sheet |
| Evening (1 hr) | Plan | Update resume per Resume and Portfolio |
:::note Week 1 Milestone Checkpoint
- Solve Bayes theorem problems in under 3 minutes
- Choose the correct hypothesis test for a given scenario
- Write SQL with window functions confidently
- Explain CLT, confidence intervals, and p-values accurately
- Calculate sample statistics and construct confidence intervals by hand
- Write a query with 3+ JOINs and window functions :::
Week 2: Foundations -- A/B Testing and pandas
Goal: Master A/B testing methodology and data manipulation with pandas and SQL.
Daily time: 3 hours (weekdays), 5 hours (weekends)
Monday -- A/B Testing Fundamentals
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | A/B testing at tech companies |
| Evening (90 min) | Study | A/B test design: control/treatment, randomization, sample size calculation, power analysis, guardrail metrics |
| Night (15 min) | Review | Calculate required sample size for an A/B test with specific parameters |
Tuesday -- A/B Testing: Advanced Topics
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium/hard SQL problems |
| Lunch (20 min) | Read | Common A/B testing pitfalls |
| Evening (90 min) | Study | Network effects, novelty/primacy effects, multiple testing, peeking problem, sequential testing, Bayesian A/B testing |
| Night (15 min) | Review | List 5 reasons an A/B test result might be invalid |
:::danger A/B Testing Pitfalls That Fail Candidates
- Peeking: Checking results before reaching required sample size inflates false positive rate
- Simpson's Paradox: Overall results can contradict segment-level results
- Survivorship bias: Only analyzing users who completed the flow
- Interference: Users in treatment affecting control users (network effects)
- Not accounting for multiple comparisons: Testing 20 metrics means ~1 false positive at alpha=0.05 :::
Wednesday -- A/B Testing Case Studies
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | Real A/B test case studies from tech companies |
| Evening (90 min) | Practice | Solve 3 A/B testing case studies: design the test, choose metrics, analyze results, make a recommendation |
| Night (15 min) | Review | Practice explaining your reasoning aloud |
Thursday -- pandas Fundamentals
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | pandas vs SQL comparison |
| Evening (90 min) | Study | pandas: DataFrames, Series, indexing, filtering, groupby, merge, concat, pivot_table |
| Night (15 min) | Practice | Replicate a SQL query in pandas |
Friday -- pandas Advanced and EDA
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium/hard SQL problems |
| Lunch (20 min) | Read | EDA best practices |
| Evening (90 min) | Study | pandas: apply, map, time series resampling, string methods. Matplotlib/seaborn for quick visualizations |
| Night (15 min) | Practice | Perform EDA on a sample dataset (distributions, correlations, missing values) |
Saturday -- End-to-End Analysis Practice
| Time | Activity | Details |
|---|---|---|
| Morning (2.5 hrs) | Practice | Given a dataset, perform complete analysis: EDA, hypothesis formulation, statistical testing, visualization, conclusion |
| Afternoon (1.5 hrs) | Study | Common data patterns: seasonality, trends, cohort effects, funnel analysis |
| Evening (1 hr) | Review | Practice presenting your analysis in 10 minutes |
Sunday -- Week 2 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Review | Redo A/B testing problems; practice calculations |
| Afternoon (2 hrs) | SQL | Solve 5 hard SQL problems |
| Evening (1 hr) | Mock | First practice: explain an analysis to a non-technical partner |
:::note Week 2 Milestone Checkpoint
- Design a complete A/B test with guardrail metrics and sample size calculation
- Identify 5+ A/B testing pitfalls with mitigation strategies
- Manipulate data fluently in both SQL and pandas
- Perform end-to-end exploratory data analysis
- Present statistical findings clearly to a non-technical audience
- Solve hard SQL problems involving self-joins, CTEs, and window functions :::
Week 3: Core Skills -- ML Fundamentals and Feature Engineering
Goal: Master practical ML for data science: model selection, feature engineering, and evaluation.
Daily time: 3.5 hours (weekdays), 5 hours (weekends)
Monday -- Supervised Learning: Regression
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium/hard SQL problems |
| Lunch (20 min) | Read | ML Fundamentals regression section |
| Evening (120 min) | Study | Linear regression, polynomial regression, regularization (Ridge, Lasso, Elastic Net), assumptions, diagnostics |
| Night (15 min) | Review | List the assumptions of linear regression and how to check them |
Tuesday -- Supervised Learning: Classification
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | Classification metrics comparison |
| Evening (120 min) | Study | Logistic regression, decision trees, random forests, gradient boosting, SVM. When to use what |
| Night (15 min) | Review | Create a model selection decision tree |
Wednesday -- Model Evaluation Deep Dive
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | Precision-recall trade-offs |
| Evening (120 min) | Study | Accuracy, precision, recall, F1, AUC-ROC, AUC-PR, log loss, calibration, cross-validation strategies |
| Night (15 min) | Review | When is accuracy a misleading metric? (imbalanced classes) |
:::tip The Metric Selection Question Data Scientist interviews love asking: "Which metric would you use and why?" The answer is never "accuracy." Consider:
- Precision over recall: When false positives are costly (spam filter for important emails)
- Recall over precision: When false negatives are costly (disease screening)
- AUC-ROC: When you need a threshold-independent measure
- AUC-PR: When you have heavily imbalanced data
- Business metric: When you can directly tie model performance to dollars :::
Thursday -- Feature Engineering
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | Feature engineering best practices |
| Evening (120 min) | Study | Encoding categorical variables, handling missing data, feature scaling, feature selection (filter, wrapper, embedded methods), feature importance |
| Night (15 min) | Practice | Given a raw dataset description, list 10 features you would engineer |
Friday -- Practical ML: End-to-End Pipeline
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | scikit-learn pipeline patterns |
| Evening (120 min) | Practice | Build an end-to-end ML pipeline: data cleaning, feature engineering, model training, hyperparameter tuning, evaluation |
| Night (15 min) | Review | Identify potential data leakage in your pipeline |
Saturday -- Unsupervised Learning and Dimensionality Reduction
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Study | K-means, hierarchical clustering, DBSCAN, PCA, t-SNE, UMAP |
| Afternoon (2 hrs) | Practice | Apply clustering to a customer segmentation problem |
| Evening (1 hr) | Mock | First ML mock: given a problem, propose a modeling approach (30 min) |
Sunday -- Week 3 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Review | Revisit all ML concepts; create a one-page cheat sheet |
| Afternoon (2 hrs) | Practice | Solve 5 "what model would you use?" scenario questions |
| Evening (1 hr) | Behavioral | Draft 3 STAR stories about data science projects |
:::note Week 3 Milestone Checkpoint
- Select the right model for a given problem with justification
- Explain regularization (L1, L2) and when to use each
- Evaluate models using appropriate metrics for imbalanced data
- Engineer features from raw data descriptions
- Build an end-to-end ML pipeline without data leakage
- Apply and interpret clustering results :::
Week 4: Core Skills -- Business Cases and Product Sense
Goal: Master business case study interviews and product metric design.
Daily time: 3.5 hours (weekdays), 5 hours (weekends)
Monday -- Metric Design Frameworks
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium/hard SQL problems |
| Lunch (20 min) | Read | Product metric frameworks (HEART, AARRR, North Star) |
| Evening (120 min) | Study | How to define success metrics, counter-metrics, guardrail metrics. HEART framework (Happiness, Engagement, Adoption, Retention, Task success) |
| Night (15 min) | Practice | Define metrics for 3 different products |
Tuesday -- Root Cause Analysis
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium/hard SQL problems |
| Lunch (20 min) | Read | Root cause analysis frameworks |
| Evening (120 min) | Practice | Solve 3 root cause analysis scenarios: "Daily active users dropped 10% week-over-week. What happened?" |
| Night (15 min) | Review | Develop a systematic debugging checklist for metric drops |
Wednesday -- Case Study Practice: E-Commerce
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | E-commerce metrics primer |
| Evening (120 min) | Practice | Case study: "Design the metrics for a new marketplace feature. How would you measure success? What experiment would you run?" |
| Night (15 min) | Review | Practice presenting your case study answer in 15 minutes |
Thursday -- Case Study Practice: Social Media
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | Social media engagement metrics |
| Evening (120 min) | Practice | Case study: "Instagram engagement is down among users aged 18-24. Diagnose the problem and propose solutions." |
| Night (15 min) | Review | Identify the data you would need to support your analysis |
Friday -- Causal Inference
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 medium SQL problems |
| Lunch (20 min) | Read | Causal inference overview |
| Evening (120 min) | Study | Observational studies vs experiments, confounding variables, difference-in-differences, propensity score matching, instrumental variables, regression discontinuity |
| Night (15 min) | Review | Explain when you cannot run an A/B test and what alternatives exist |
:::warning Not Everything Can Be A/B Tested Interviewers will test whether you know when A/B testing is inappropriate and what to do instead:
- Ethical constraints: Cannot randomly deny a safety feature
- Network effects: Users influence each other
- Long-term effects: Cannot wait months for results
- Rare events: Not enough samples for statistical power
- No randomization possible: Historical policy changes
Alternatives: difference-in-differences, propensity score matching, instrumental variables, regression discontinuity, interrupted time series. :::
Saturday -- Business Presentation Practice
| Time | Activity | Details |
|---|---|---|
| Morning (2.5 hrs) | Practice | Complete case study: analyze a dataset, formulate insights, create a 3-slide summary, present findings |
| Afternoon (1.5 hrs) | Study | Time series basics: trends, seasonality, decomposition, forecasting |
| Evening (1 hr) | Mock | Case study mock: root cause analysis scenario (30 min) |
Sunday -- Week 4 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Review | Redo all case studies from the week |
| Afternoon (2 hrs) | Practice | Rapid-fire metric design: define metrics for 10 products in 30 minutes |
| Evening (1 hr) | Behavioral | Add 2 STAR stories about business impact and stakeholder communication |
:::note Week 4 Milestone Checkpoint
- Define success metrics for any product using the HEART framework
- Diagnose a metric change using a systematic root cause analysis approach
- Complete a business case study in 30-45 minutes
- Explain causal inference methods and when to use each
- Present data findings clearly in under 10 minutes
- Know when A/B testing is inappropriate and propose alternatives :::
Week 5: Polish -- Advanced Topics and Mock Interviews
Goal: Cover advanced DS topics, practice take-homes, and intensify mocks.
Daily time: 3.5 hours (weekdays), 5 hours (weekends)
Monday -- Advanced ML: Time Series and NLP
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 hard SQL problems |
| Lunch (20 min) | Read | Deep Learning overview (skim) |
| Evening (120 min) | Study | Time series: ARIMA, Prophet, feature engineering for time series. Basic NLP: TF-IDF, embeddings, sentiment analysis |
| Night (15 min) | Review | When to use time series models vs ML models for forecasting |
Tuesday -- System Design for Data Science
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 hard SQL problems |
| Lunch (20 min) | Read | ML System Design overview |
| Evening (120 min) | Study | Designing analytics pipelines, dashboard architecture, real-time metrics, experimentation platforms |
| Night (15 min) | Review | Design an experimentation platform at a high level |
Wednesday -- Take-Home Project Practice
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 1 hard SQL problem |
| Lunch (20 min) | Read | Take-Home Projects |
| Evening (120 min) | Project | Complete a mock take-home: analyze a dataset, build a model, write up findings |
| Night (15 min) | Review | Self-critique: Is your analysis rigorous? Are your conclusions justified? |
Thursday -- Company Research
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL practice | 2 company-specific SQL problems |
| Lunch (20 min) | Read | Company Guides |
| Evening (120 min) | Research | Target company data science blog posts, products, metrics, culture |
| Night (15 min) | Notes | Prepare company-specific talking points |
Friday -- Mock Interview Day
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | Warm-up | 1 easy SQL problem |
| Afternoon (3 hrs) | Mocks | SQL mock (45 min) + statistics/probability mock (45 min) + case study mock (45 min) |
| Evening (30 min) | Debrief | Catalog weaknesses for Week 6 focus |
Saturday -- Weakness Remediation
| Time | Activity | Details |
|---|---|---|
| Morning (2.5 hrs) | Study | Deep dive into weakest area from mocks |
| Afternoon (1.5 hrs) | Practice | 5 targeted practice problems |
| Evening (1 hr) | Behavioral | Practice all STAR stories aloud |
Sunday -- Week 5 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Review | Create comprehensive cheat sheets: statistics, SQL patterns, metric frameworks |
| Afternoon (2 hrs) | Practice | 5 rapid-fire case studies (10 minutes each) |
| Evening (1 hr) | Plan | Finalize Week 6 based on remaining gaps |
:::note Week 5 Milestone Checkpoint
- Handle time series and basic NLP problems
- Complete a take-home analysis project in under 4 hours
- Pass SQL, statistics, and case study mocks with 7/10+ scores
- Know target company's products, metrics, and data science culture
- Have 6+ polished STAR stories ready
- Handle rapid-fire case studies with structured frameworks :::
Week 6: Final Week -- Simulations, Behavioral, and Confidence
Goal: Final mock interviews, behavioral polish, and mental preparation.
Daily time: 2.5 hours (weekdays), 4 hours (weekends)
Monday -- Light Review
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | SQL | 2 medium problems for flow |
| Lunch (20 min) | Read | Negotiation and Offers |
| Evening (60 min) | Review | Skim all cheat sheets |
| Night (15 min) | Rest | Light reading |
Tuesday -- Full Loop Simulation
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | Warm-up | 1 easy problem |
| Afternoon (3 hrs) | Mock | Full simulation: SQL + stats + case study + behavioral |
| Evening (30 min) | Debrief | Final notes |
Wednesday -- Targeted Review
| Time | Activity | Details |
|---|---|---|
| Morning (45 min) | Study | Weakest area from mock |
| Evening (90 min) | Practice | 3 targeted problems |
Thursday -- Behavioral Final Prep
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Practice | All STAR stories aloud, timed |
| Lunch (20 min) | Read | Behavioral final tips |
| Evening (90 min) | Mock | Final behavioral mock |
| Night (15 min) | Prep | Questions to ask interviewers |
Friday -- Rest
| Time | Activity | Details |
|---|---|---|
| Morning (30 min) | Logistics | Confirm schedule, test setup |
| Rest of day | Relax | Recharge |
Weekend -- Light and Rest
Light review Saturday. Full rest Sunday.
:::note Week 6 Final Assessment
- Can solve complex SQL problems in under 20 minutes
- Can design and analyze an A/B test from scratch
- Can diagnose a metric change systematically
- Can build and evaluate an ML model for a business problem
- Can present findings clearly to non-technical stakeholders
- Can answer probability/statistics questions confidently
- Have prepared questions showing genuine curiosity about the company :::
SQL Problem Categories to Master
Must-Solve Problem Types
| Category | Example | Difficulty |
|---|---|---|
| Funnel analysis | Calculate conversion rates across steps | Medium |
| Retention cohorts | Monthly retention by signup cohort | Hard |
| Running totals | Cumulative revenue by category | Medium |
| Year-over-year | Compare metrics across time periods | Medium |
| Sessionization | Group user events into sessions | Hard |
| Self-joins | Find users who did A then B within 7 days | Hard |
| Ranking within groups | Top N items per category | Medium |
| Gap analysis | Find periods with no activity | Hard |
| Moving averages | 7-day rolling average of daily metrics | Medium |
| Percentiles | Median and P95 response times | Hard |
Sample SQL Problems
Problem 1: Retention Analysis
Given a user_activity table with user_id, activity_date, and signup_date, calculate the Day-1, Day-7, and Day-30 retention rates by signup month.
Problem 2: Funnel Conversion
Given tables page_views, add_to_cart, and purchases, calculate the conversion rate at each funnel step by device type, for the last 30 days.
Problem 3: Revenue Growth
Given an orders table, calculate the month-over-month revenue growth rate, and flag months where growth exceeded 20%.
Statistics Quick Reference
Formulas You Must Know
| Concept | Formula | When to Use |
|---|---|---|
| Sample mean | Always | |
| Standard error | Confidence intervals, hypothesis tests | |
| Confidence interval | Estimating population parameters | |
| Z-test statistic | Large sample hypothesis testing | |
| T-test statistic | Small sample mean comparison | |
| Sample size (A/B) | Planning A/B tests | |
| Bayes theorem | Conditional probability problems |
Distribution Quick Reference
| Distribution | Use Case | Key Parameter |
|---|---|---|
| Normal | Continuous data, CLT | mean, std dev |
| Binomial | Count of successes in n trials | n, p |
| Poisson | Count of events in a time period | lambda |
| Exponential | Time between events | lambda |
| Bernoulli | Single yes/no trial | p |
| Uniform | Equal probability outcomes | a, b |
| Geometric | Trials until first success | p |
Case Study Framework
Use this framework for every case study question:
- Clarify the question -- What are we trying to answer? Who is the stakeholder?
- Define metrics -- What does success look like? What are guardrail metrics?
- Formulate hypotheses -- What might explain the observed behavior?
- Design the analysis -- What data do you need? What methods will you use?
- Analyze and conclude -- Present findings with confidence levels
- Recommend action -- What should the business do? What are the risks?
Essential Resources
Handbook Chapters to Prioritize
| Priority | Chapter | When to Study |
|---|---|---|
| Critical | ML Fundamentals | Weeks 2-4 |
| Critical | Coding Interviews (SQL focus) | Weeks 1-5 |
| High | Behavioral | Weeks 5-6 |
| High | ML System Design | Week 5 |
| Medium | Deep Learning | Week 5 (skim) |
| Medium | Company Guides | Week 5 |
| Medium | Take-Home Projects | Week 5 |
| Low | Negotiation | Week 6 |
Books
- "Practical Statistics for Data Scientists" by Peter Bruce and Andrew Bruce
- "Trustworthy Online Controlled Experiments" by Kohavi, Tang, and Xu
- "Naked Statistics" by Charles Wheelan (for intuition building)
Practice Platforms
- StrataScratch -- SQL and data science interview questions from real companies
- Mode Analytics -- SQL practice with real datasets
- DataLemur -- SQL interview questions by difficulty
- Kaggle -- Datasets for analysis practice
Next Steps
You now have a complete 6-week roadmap for Data Scientist interview preparation. If this path does not match your target role, consider:
- MLE Prep Path -- If your role requires more model building and engineering
- AI Engineer Prep Path -- If your role focuses on LLM applications
- Data Engineer Prep Path -- If your role emphasizes data infrastructure over analysis
The best data scientists are not just technically strong -- they are storytellers who translate data into decisions. Start practicing both skills today.
