Skip to main content

Data Engineer: 6-Week Prep Path

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Engineer, Analytics Engineer, Data Platform Engineer, Data Infrastructure Engineer

The Real Interview Moment

It is 2:15 PM on a Tuesday. You are in the third round of your on-site at a major tech company. The interviewer opens a shared document and says: "We have a platform that processes 500 million events per day from mobile apps. Events arrive via Kafka, need to be enriched with user profile data, deduplicated, and stored for both real-time dashboards and historical analysis. Design the data pipeline."

You have sixty minutes. She wants to see the end-to-end architecture: ingestion, processing, storage, serving, monitoring, and how you handle late-arriving data, schema evolution, and failures. She will probe your understanding of exactly-once semantics, partitioning strategies, backfill procedures, and cost trade-offs.

This is not about memorizing tools. Any candidate can name Spark, Kafka, and Airflow. What separates the strong candidates is understanding why you would choose one approach over another, how components interact under failure, and what breaks at scale.

The data engineer interview is deceptively broad. It spans SQL (often at a level that surprises candidates), data modeling, distributed systems, pipeline design, coding, and system design. But the scope is manageable in six weeks because the concepts connect logically: if you understand how data flows through systems, the rest follows.

This 6-week plan is compressed but complete. It assumes you have some familiarity with databases and programming. If you are starting from zero, add two weeks of basics at the front.

Role Overview

What Data Engineers Do

Data Engineers build and maintain the infrastructure that enables organizations to collect, store, transform, and serve data. They:

  • Design and build data pipelines (batch and streaming) that process terabytes to petabytes daily
  • Model data warehouses and data lakes for analytical and operational use cases
  • Write complex SQL for transformations, quality checks, and analysis
  • Build and maintain orchestration workflows (Airflow, Dagster, Prefect)
  • Optimize query performance and storage costs
  • Ensure data quality, reliability, and freshness through monitoring and testing
  • Collaborate with data scientists, analysts, and ML engineers on data needs
  • Manage schema evolution, data governance, and access control

How It Differs from Other Data Roles

DimensionData EngineerData ScientistAnalytics EngineerMLE
Primary focusData infrastructureStatistical analysisBusiness data modelsML models in production
Key outputPipelines, warehousesInsights, modelsdbt models, dashboardsDeployed ML systems
SQL depthExpertProficientExpertModerate
Systems knowledgeDeepMinimalModerateModerate
Coding emphasisProduction-gradeExploratorySQL-focusedML-focused
Scale concernData volume and velocityModel accuracyQuery performanceModel latency

Interview Format (Typical)

RoundDurationFocus
Phone Screen45-60 minSQL + coding basics
SQL Round45-60 minComplex queries, window functions, optimization
Coding Round45-60 minPython/data processing (not heavy DSA)
Data Modeling45-60 minDimensional modeling, schema design, normalization
System Design60 minEnd-to-end pipeline design
Behavioral45 minCollaboration, incident response, prioritization

:::tip SQL Is the Make-or-Break Round Unlike ML or software engineering interviews where coding is the main filter, data engineering interviews use SQL as the primary technical screen. Many otherwise strong candidates are eliminated because they cannot write complex window functions, CTEs, or self-joins fluently under pressure. Do not underestimate the SQL round. :::

Focus Area Allocation

Data Engineer Interview Prep Time Allocation - SQL Mastery 25%, System Design 25%, Data Modeling 15%, Coding 15%, Distributed Systems 10%, Behavioral 10%

Breakdown by Skill

SQL Mastery (25% -- ~45 hours total)

  • Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, running aggregates
  • CTEs and recursive queries
  • Self-joins and complex join patterns
  • Query optimization: EXPLAIN plans, indexing, partitioning, materialized views
  • Advanced: pivoting, unpivoting, sessionization, gap-and-island problems

System Design (25% -- ~45 hours total)

  • Batch pipeline architecture: ingestion, transformation, serving
  • Streaming pipeline design: Kafka, Flink, exactly-once semantics
  • Data lake and lakehouse architecture: Iceberg, Delta Lake, Hudi
  • Orchestration: DAG design, dependency management, backfill strategies
  • Monitoring: data quality, freshness, volume, schema drift

Data Modeling (15% -- ~27 hours total)

  • Dimensional modeling: star schema, snowflake schema, slowly changing dimensions
  • Normalization: 1NF through 3NF, BCNF, denormalization trade-offs
  • Data vault modeling
  • Schema design for specific use cases (events, time series, hierarchical)

Coding (15% -- ~27 hours total)

  • Python: data processing with pandas, file I/O, API interaction
  • Data structures: hash maps, heaps, queues (not heavy graph/DP)
  • Testing: unit tests for pipelines, data validation
  • Scripting: ETL scripts, data quality checks

Distributed Systems (10% -- ~18 hours total)

  • CAP theorem and consistency models
  • Partitioning and sharding strategies
  • Replication and fault tolerance
  • MapReduce paradigm and Spark internals

Behavioral (10% -- ~18 hours total)

  • Incident response stories (data pipeline failures)
  • Cross-team collaboration
  • Prioritization and technical debt management
  • Data quality and governance experiences

6-Week Schedule Overview

Data Engineer 6-Week Prep Plan - gantt-style schedule: SQL and Data Modeling weeks 1–2, Pipeline Design and Distributed Systems weeks 3–4, System Design and Mock Interviews weeks 5–6

Week 1: Foundations -- SQL Mastery Begins

Goal: Rebuild SQL fluency from basics to advanced window functions. Start data modeling fundamentals.

Daily time: 3.5 hours (weekdays), 5 hours (weekends)

Monday -- SQL Review: Joins, Aggregations, Subqueries

TimeActivityDetails
Morning (60 min)SQL practice5 easy/medium SQL problems on LeetCode or HackerRank (INNER JOIN, LEFT JOIN, GROUP BY, HAVING)
Lunch (20 min)ReadCoding Interviews overview
Evening (90 min)StudyJoin types (INNER, LEFT, RIGHT, FULL, CROSS, SELF), aggregation functions, subqueries vs CTEs, NULL handling
Night (15 min)ReviewWrite a cheat sheet of join behavior with NULL values

Practice problems:

  • Find employees earning more than their manager (self-join)
  • Department with highest average salary (GROUP BY + subquery)
  • Customers who never ordered (LEFT JOIN + NULL check)
  • Second highest salary in each department (subquery approach)

Tuesday -- SQL: Window Functions I

TimeActivityDetails
Morning (60 min)SQL practice5 medium SQL problems (window functions)
Lunch (20 min)ReadWindow function syntax and frame specification
Evening (90 min)StudyROW_NUMBER, RANK, DENSE_RANK, NTILE, PARTITION BY, ORDER BY, frame clauses (ROWS BETWEEN, RANGE BETWEEN)
Night (15 min)ReviewWrite 3 queries using different ranking functions on the same data; explain when each is appropriate

Practice problems:

  • Rank employees by salary within each department
  • Find top 3 earners per department
  • Calculate running total of sales by month
  • Identify duplicate rows using ROW_NUMBER

:::tip Window Functions Are Interview Gold Over 80% of data engineering SQL rounds include at least one window function problem. Master these patterns cold: ranking within groups, running aggregates, comparing current row to previous/next row (LAG/LEAD), and percentile calculations. If you can write window functions fluently, you are already ahead of most candidates. :::

Wednesday -- SQL: Window Functions II and CTEs

TimeActivityDetails
Morning (60 min)SQL practice5 medium/hard SQL problems
Lunch (20 min)ReadCTE best practices and recursive CTEs
Evening (90 min)StudyLAG, LEAD, FIRST_VALUE, LAST_VALUE, running averages, cumulative sums, CTEs for readability, recursive CTEs for hierarchical data
Night (15 min)PracticeWrite a recursive CTE to traverse an org chart

Practice problems:

  • Calculate month-over-month growth rate (LAG)
  • Find consecutive login days per user (gap-and-island)
  • Compute 7-day moving average of daily revenue
  • Traverse an employee hierarchy to find all reports (recursive CTE)

Thursday -- Data Modeling: Normalization

TimeActivityDetails
Morning (60 min)SQL practice5 medium SQL problems
Lunch (20 min)ReadDatabase normalization forms overview
Evening (90 min)Study1NF, 2NF, 3NF, BCNF: definitions, examples, identifying violations. Functional dependencies. When to denormalize and why
Night (15 min)PracticeGiven a flat table, normalize it to 3NF step by step

:::warning Normalization Is Not Just Theory Interviewers often give you a denormalized table and ask you to normalize it, or give you a normalized schema and ask when you would denormalize. You need to be able to do both directions fluently, explaining the trade-offs at each step: data integrity vs query performance, storage efficiency vs join complexity, write performance vs read performance. :::

Friday -- Data Modeling: Dimensional Modeling

TimeActivityDetails
Morning (60 min)SQL practice5 medium SQL problems
Lunch (20 min)ReadKimball dimensional modeling introduction
Evening (90 min)StudyStar schema, snowflake schema, fact tables (transaction, periodic, accumulating), dimension tables, surrogate keys, degenerate dimensions
Night (15 min)DesignDesign a star schema for an e-commerce order system

Saturday -- SQL Deep Practice + Slowly Changing Dimensions

TimeActivityDetails
Morning (2 hrs)SQL practice8 medium/hard SQL problems covering all concepts from the week
Afternoon (2 hrs)StudySlowly Changing Dimensions (SCD Types 1, 2, 3, 4, 6). Implementation patterns for each type. When to use which
Evening (1 hr)DesignDesign a data warehouse schema for a ride-sharing company: trips fact table, driver dimension (SCD Type 2), location dimension

Sunday -- Week 1 Review

TimeActivityDetails
Morning (2 hrs)SQL reviewRe-solve hardest SQL problems from the week; time yourself
Afternoon (2 hrs)Data modeling reviewPractice: given a business scenario, design both an OLTP schema (normalized) and OLAP schema (dimensional)
Evening (1 hr)PlanReview Week 2 plan; identify SQL patterns that need more practice

:::note Week 1 Milestone Checkpoint By the end of Week 1, you should be able to:

  • Write window functions (ROW_NUMBER, RANK, LAG, LEAD) without reference
  • Use CTEs and recursive CTEs for complex query decomposition
  • Normalize a table to 3NF and explain each step
  • Design a star schema for a given business scenario
  • Explain SCD Types 1, 2, and 3 with implementation examples
  • Solve medium SQL problems in under 10 minutes :::

Week 2: Foundations -- Advanced SQL and Pipeline Basics

Goal: Master advanced SQL patterns. Begin understanding data pipeline architecture.

Daily time: 3.5 hours (weekdays), 5 hours (weekends)

Monday -- SQL: Sessionization and Gap-and-Island

TimeActivityDetails
Morning (60 min)SQL practice5 medium/hard SQL problems
Lunch (20 min)ReadSessionization techniques in SQL
Evening (90 min)StudySessionization (grouping events into sessions using time gaps), gap-and-island problems (finding consecutive sequences), LEAD/LAG for boundary detection
Night (15 min)PracticeWrite a sessionization query with a 30-minute timeout

Practice problems:

  • Group user clickstream into sessions (30-min gap)
  • Find consecutive days a stock price increased
  • Identify gaps in a sequence of IDs
  • Calculate session duration and pages per session

Tuesday -- SQL: Query Optimization

TimeActivityDetails
Morning (60 min)SQL practice5 medium/hard SQL problems
Lunch (20 min)ReadHow to read an EXPLAIN plan
Evening (90 min)StudyEXPLAIN/EXPLAIN ANALYZE, index types (B-tree, hash, GIN, GiST), index selection, partition pruning, query rewriting for performance, common anti-patterns (SELECT *, implicit type casting, functions on indexed columns)
Night (15 min)PracticeTake a slow query and optimize it using EXPLAIN analysis

:::danger SQL Optimization Questions Are Tricky Interviewers love asking: "This query is slow. How would you fix it?" The trap is jumping to "add an index." A strong answer follows a systematic process: (1) Read the EXPLAIN plan, (2) Identify the bottleneck (full table scan, nested loop join, sort spill to disk), (3) Consider multiple solutions (indexing, query rewrite, materialized view, partitioning), (4) Discuss trade-offs of each approach. :::

Wednesday -- SQL: Advanced Patterns and Anti-Patterns

TimeActivityDetails
Morning (60 min)SQL practice5 hard SQL problems
Lunch (20 min)ReadSQL anti-patterns reference
Evening (90 min)StudyPivot/unpivot, GROUPING SETS / CUBE / ROLLUP, correlated subqueries, EXISTS vs IN performance, approximate distinct counts (HyperLogLog), handling large IN lists
Night (15 min)PracticeWrite a single query using GROUPING SETS to produce a multi-level summary

Thursday -- Introduction to Data Pipelines

TimeActivityDetails
Morning (60 min)Coding practice2 Python problems (file I/O, data manipulation)
Lunch (20 min)ReadETL vs ELT comparison
Evening (90 min)StudyPipeline architectures: ETL (extract-transform-load), ELT (extract-load-transform), batch vs streaming, orchestration concepts (DAGs, dependencies, triggers)
Night (15 min)DrawSketch a basic ETL pipeline with failure handling

Friday -- Orchestration: Airflow Fundamentals

TimeActivityDetails
Morning (60 min)Coding practice2 Python problems (APIs, error handling)
Lunch (20 min)ReadApache Airflow architecture overview
Evening (90 min)StudyAirflow concepts: DAGs, operators (PythonOperator, BashOperator, SQL operators), sensors, XComs, task dependencies, scheduling, backfilling, idempotency
Night (15 min)DesignDesign an Airflow DAG for a daily reporting pipeline

Saturday -- Python for Data Engineering + Pipeline Practice

TimeActivityDetails
Morning (2 hrs)CodingWrite a data pipeline in Python: read from CSV, validate schema, transform (clean, deduplicate, enrich), write to output. Include proper error handling and logging
Afternoon (2 hrs)StudyData formats: Parquet vs CSV vs Avro vs ORC. Compression: gzip, snappy, zstd. Columnar vs row-based storage trade-offs
Evening (1 hr)SQL reviewSolve 5 hard SQL problems

Sunday -- Week 2 Review

TimeActivityDetails
Morning (2 hrs)SQL reviewTime yourself on 10 SQL problems (mix of medium and hard). Target: medium in 8 min, hard in 15 min
Afternoon (2 hrs)Pipeline reviewDesign a pipeline for a specific scenario: daily user activity aggregation from raw event logs
Evening (1 hr)PlanList all distributed systems topics needed for Week 3

:::note Week 2 Milestone Checkpoint

  • Write sessionization queries without reference
  • Read and interpret EXPLAIN plans to identify bottlenecks
  • Optimize a slow query using at least 3 different techniques
  • Explain ETL vs ELT with trade-offs for different scenarios
  • Design a basic Airflow DAG with proper error handling and idempotency
  • Solve hard SQL problems in under 15 minutes
  • Choose between Parquet, Avro, and CSV for different use cases :::

Week 3: Core Skills -- Distributed Systems and Streaming

Goal: Understand the distributed systems concepts that underpin modern data engineering. Learn streaming fundamentals.

Daily time: 4 hours (weekdays), 6 hours (weekends)

Monday -- Distributed Systems Fundamentals

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems
Lunch (20 min)ReadCAP theorem explained
Evening (120 min)StudyCAP theorem, consistency models (strong, eventual, causal), partitioning strategies (hash, range, round-robin), replication (leader-follower, multi-leader, leaderless)
Night (15 min)ReviewFor each data store you know (Postgres, Redis, Cassandra, DynamoDB), identify where it sits on the CAP triangle

Tuesday -- Distributed Storage: HDFS, S3, and Data Lakes

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems
Lunch (20 min)ReadData lake vs data warehouse vs data lakehouse
Evening (120 min)StudyHDFS architecture, S3 consistency model, data lake organization (bronze/silver/gold layers), table formats: Apache Iceberg, Delta Lake, Hudi. Partitioning, compaction, time travel
Night (15 min)CompareCreate a comparison table: Iceberg vs Delta Lake vs Hudi

Wednesday -- Apache Spark Internals

TimeActivityDetails
Morning (60 min)Coding practice2 Python problems (data processing)
Lunch (20 min)ReadSpark architecture overview
Evening (120 min)StudySpark architecture: driver, executors, DAG scheduler, stages, tasks. RDDs vs DataFrames vs Datasets. Shuffles, partitions, broadcast joins, skew handling, caching strategies
Night (15 min)DrawDiagram the execution of a Spark job with a join and aggregation, showing shuffle boundaries

:::tip Understand Spark Conceptually, Not Just Syntactically Interviewers rarely ask you to write Spark code (though some do). More commonly, they ask: "Your Spark job is running for 6 hours. How do you debug and optimize it?" You need to understand: shuffle operations, data skew, partition count, broadcast join thresholds, spill to disk, and Spark UI interpretation. Conceptual understanding beats syntax memorization. :::

Thursday -- Apache Kafka and Event Streaming

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems
Lunch (20 min)ReadKafka architecture introduction
Evening (120 min)StudyKafka architecture: brokers, topics, partitions, consumer groups, offsets. Delivery guarantees: at-most-once, at-least-once, exactly-once semantics. Schema registry, compacted topics, retention policies
Night (15 min)DesignDesign a Kafka topic strategy for an event-driven e-commerce platform

Friday -- Stream Processing: Concepts and Patterns

TimeActivityDetails
Morning (60 min)Coding practice2 Python problems (concurrency, generators)
Lunch (20 min)ReadFlink vs Spark Streaming vs Kafka Streams comparison
Evening (120 min)StudyStream processing concepts: event time vs processing time, watermarks, windowing (tumbling, sliding, session), late data handling, state management, checkpointing
Night (15 min)PracticeDesign a streaming pipeline that computes real-time click-through rates with 5-minute tumbling windows

Saturday -- Data Quality and Testing

TimeActivityDetails
Morning (2.5 hrs)StudyData quality dimensions: accuracy, completeness, consistency, timeliness, uniqueness. Data contracts. Testing frameworks: Great Expectations, dbt tests, custom validation. Schema evolution strategies
Afternoon (2 hrs)ImplementationWrite a data quality check suite for a pipeline: null checks, range checks, uniqueness checks, referential integrity, volume anomaly detection
Evening (1.5 hrs)SQL practiceSolve 5 hard SQL problems

:::warning Data Quality Questions Are Increasing Modern data engineering interviews increasingly include data quality and observability questions. "How do you ensure your pipeline produces correct data?" is not optional. Be ready to discuss: automated testing, data contracts, monitoring, alerting, and root cause analysis when data quality issues arise. Companies like Monte Carlo, Soda, and Elementary have made data observability mainstream. :::

Sunday -- Week 3 Review

TimeActivityDetails
Morning (2 hrs)ReviewRe-draw system architectures from memory: Spark, Kafka, data lake
Afternoon (2.5 hrs)PracticeDesign a streaming pipeline: real-time fraud detection with Kafka and Flink. Include failure handling, late data, and monitoring
Evening (1.5 hrs)Behavioral prepDraft 3 STAR stories about pipeline incidents, data quality issues, and cross-team collaboration

:::note Week 3 Milestone Checkpoint

  • Explain CAP theorem and its implications for data system design
  • Describe Spark's execution model (DAG, stages, tasks, shuffles)
  • Explain Kafka's architecture and delivery guarantees
  • Design a streaming pipeline with proper windowing and late data handling
  • Articulate data lake layering (bronze/silver/gold) and its purpose
  • Write comprehensive data quality checks for a pipeline :::

Week 4: Core Skills -- Pipeline System Design and Coding

Goal: Practice end-to-end pipeline system design. Build coding fluency for data engineering interviews.

Daily time: 4 hours (weekdays), 6 hours (weekends)

Monday -- System Design Framework for Data Pipelines

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems
Lunch (20 min)ReadML System Design overview (adapt framework for data)
Evening (120 min)StudyData pipeline design framework: (1) Requirements and constraints, (2) Data sources and ingestion, (3) Storage layer, (4) Processing/transformation, (5) Serving layer, (6) Monitoring and operations
Night (15 min)ReviewWrite the framework from memory

:::tip The Data Pipeline Design Framework Use this structure for every pipeline design question:

  1. Clarify requirements -- What are we building? What is the data volume, velocity, and variety? What are the SLAs (freshness, latency, correctness)?
  2. Data sources and ingestion -- Where does data come from? CDC, API, event stream, file upload? How do we handle schema differences?
  3. Storage layer -- Where do we land raw data? What format? How do we partition? Data lake, warehouse, or lakehouse?
  4. Processing -- Batch or streaming? What transformations? How do we handle failures and retries? Idempotency?
  5. Serving layer -- How do consumers access the data? OLAP queries, API, export, real-time dashboard?
  6. Monitoring and operations -- Data quality checks, freshness monitoring, volume alerts, schema drift detection, runbook for incidents. :::

Tuesday -- Design: Event Analytics Pipeline

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems
Lunch (20 min)ReadEvent-driven architecture patterns
Evening (120 min)PracticeDesign an event analytics pipeline for a mobile app: 100M events/day, need real-time dashboards (sub-minute) and historical analysis (years of data). Walk through all 6 framework steps
Night (15 min)ReviewIdentify single points of failure in your design

Wednesday -- Design: Data Warehouse Migration

TimeActivityDetails
Morning (60 min)Coding2 Python data processing problems
Lunch (20 min)ReadData warehouse migration case studies
Evening (120 min)PracticeDesign a migration plan: legacy on-prem Oracle data warehouse to cloud (Snowflake/BigQuery/Redshift). Address: parallel running, validation, rollback, data reconciliation, zero-downtime cutover
Night (15 min)ReviewList the top 5 risks and mitigation strategies

Thursday -- Coding: Data Processing in Python

TimeActivityDetails
Morning (60 min)Coding2 LeetCode medium (hash maps, arrays)
Lunch (20 min)ReadPython performance tips for data processing
Evening (120 min)ImplementationImplement common data engineering tasks in Python: (1) Parse and validate JSON events, (2) Deduplicate records using hash-based approach, (3) Implement a simple schema validator, (4) Write a rate-limited API data fetcher
Night (15 min)ReviewAdd error handling and unit tests to each implementation

Friday -- Coding: Data Structures for Data Engineering

TimeActivityDetails
Morning (60 min)Coding2 LeetCode medium (queues, heaps)
Lunch (20 min)ReadData structures used in data systems (bloom filters, LSM trees, B-trees)
Evening (120 min)Study and implementBloom filters (implement from scratch), consistent hashing (implement), LRU cache (implement). Understand how each is used in data systems
Night (15 min)ReviewExplain when you would use each data structure in a real pipeline

Saturday -- Full System Design Practice

TimeActivityDetails
Morning (2.5 hrs)MockTimed system design: Design a data pipeline for a ride-sharing company (trips, pricing, driver matching analytics). 60 min design + 30 min self-review
Afternoon (2 hrs)StudyChange Data Capture (CDC): Debezium, log-based CDC, trigger-based CDC, query-based CDC. When to use each approach
Evening (1.5 hrs)SQL practiceSolve 5 hard SQL problems (timed)

Sunday -- Week 4 Review

TimeActivityDetails
Morning (2 hrs)ReviewRe-do all system designs from the week. Focus on areas where you lacked depth
Afternoon (2.5 hrs)Coding reviewRefactor Python implementations. Add comprehensive tests. Clean up code quality
Evening (1.5 hrs)BehavioralPrepare 3 more STAR stories. Practice all 6 stories aloud

:::note Week 4 Milestone Checkpoint

  • Walk through the data pipeline design framework from memory
  • Design an event analytics pipeline end-to-end with monitoring
  • Implement bloom filter and consistent hashing from scratch
  • Write Python data processing code with proper error handling and testing
  • Explain CDC approaches and their trade-offs
  • Design a migration strategy with risk mitigation :::

Week 5: Polish -- Advanced Topics and Mock Interviews

Goal: Cover remaining advanced topics. Begin intensive mock interview practice.

Daily time: 4 hours (weekdays), 6 hours (weekends)

Monday -- Advanced Data Modeling: Data Vault and Activity Schema

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems (company-tagged if possible)
Lunch (20 min)ReadData vault 2.0 overview
Evening (120 min)StudyData vault modeling (hubs, links, satellites), activity schema, one big table (OBT) pattern, wide vs tall table trade-offs. When each modeling approach is appropriate
Night (15 min)CompareCreate a decision matrix: when to use star schema vs data vault vs OBT

Tuesday -- dbt and Modern Data Stack

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems
Lunch (20 min)Readdbt (data build tool) introduction
Evening (120 min)StudyModern data stack: dbt models (staging, intermediate, marts), testing in dbt, incremental models, snapshots, documentation. ELT patterns with Fivetran/Airbyte + warehouse + dbt
Night (15 min)DesignDesign a dbt project structure for an e-commerce company

:::note dbt Knowledge Is Increasingly Expected Even if the job description does not mention dbt, understanding the ELT paradigm it represents is important. Many interviewers will ask about transformation-in-warehouse approaches, incremental processing, and testing data transformations. If you know dbt concepts, you can speak to these patterns confidently. :::

Wednesday -- Cost Optimization and Performance Tuning

TimeActivityDetails
Morning (60 min)SQL practice3 hard SQL problems
Lunch (20 min)ReadCloud data warehouse cost models
Evening (120 min)StudyCost optimization: warehouse sizing, clustering/sort keys, materialized views, partitioning strategies, query patterns that waste money. Performance tuning: Spark job optimization, query profiling, caching strategies
Night (15 min)PracticeGiven a scenario with a $50K/month Snowflake bill, identify 5 ways to reduce costs

Thursday -- Mock: SQL Round

TimeActivityDetails
Morning (60 min)Warm-up2 medium SQL problems
Afternoon (120 min)MockFull SQL round simulation: 4 problems in 50 minutes. Problems covering: window functions, sessionization, complex joins, optimization. Then 70 min to review and re-solve any you missed
Evening (60 min)ReviewIdentify SQL patterns that slowed you down; create targeted practice list
Night (15 min)PlanSelect 5 problems targeting your weakest SQL patterns

Friday -- Mock: System Design Round

TimeActivityDetails
Morning (60 min)Coding2 LeetCode medium (maintain fluency)
Lunch (20 min)ReviewSkim pipeline design framework notes
Evening (120 min)MockTimed system design (60 min): Design a real-time recommendation data pipeline. Self-review (60 min): grade yourself on each framework component
Night (15 min)DebriefList areas where your design was weakest

Saturday -- Company-Specific Research and Full Mock

TimeActivityDetails
Morning (2.5 hrs)ResearchDeep dive into target company: their data stack (check engineering blog), scale (data volume), challenges, recent data engineering job postings
Afternoon (2 hrs)MockFull mock interview: SQL (45 min) + System Design (45 min). Self-timed, strict
Evening (1.5 hrs)BehavioralPractice all STAR stories. Prepare company-specific answers for: "Why this company?" and "Tell me about a time you dealt with data quality issues"

Sunday -- Week 5 Review

TimeActivityDetails
Morning (2 hrs)SQL marathonSolve 10 hard SQL problems in 90 minutes (timed)
Afternoon (2.5 hrs)System designRedo your weakest system design from the week with improvements
Evening (1.5 hrs)PreparationCreate one-page cheat sheets: SQL patterns, system design framework, data modeling patterns, distributed systems concepts

:::note Week 5 Milestone Checkpoint

  • Explain data vault modeling and when to use it
  • Describe the modern data stack and dbt's role in it
  • Identify 5+ cost optimization strategies for cloud data warehouses
  • Complete a SQL round simulation in under 50 minutes
  • Complete a system design round with all 6 framework components
  • Have 6+ polished STAR stories ready
  • Know your target company's data engineering stack and challenges :::

Week 6: Final Week -- Simulation and Confidence

Goal: Final mock interviews, confidence building, and logistics. No new material.

Daily time: 3 hours (weekdays), 4 hours (weekends)

Monday -- Light Practice and Review

TimeActivityDetails
Morning (60 min)SQL practice5 medium SQL problems (patterns you know well, for speed)
Lunch (20 min)ReadNegotiation and Offers overview
Evening (60 min)ReviewSkim all cheat sheets and notes
Night (15 min)RelaxMental break

:::warning Do Not Cram in the Final Week Week 6 is about reinforcing and sharpening. If you encounter a concept you have not studied, make a note but do not go deep. Focus on being fluent in what you already know. Cramming new material in the final days increases anxiety and reduces performance. :::

Tuesday -- Full Mock Interview Day

TimeActivityDetails
Morning (60 min)Warm-up2 easy SQL problems for confidence
Afternoon (150 min)MockFull loop simulation: SQL (45 min) + Coding (45 min) + System Design (60 min). Strict time limits. No looking at notes
Evening (30 min)DebriefNote confidence level for each round. Celebrate progress

Wednesday -- Targeted Review

TimeActivityDetails
Morning (60 min)SQL practice3 medium/hard problems targeting your weakest pattern
Lunch (20 min)ReadReview your system design cheat sheet
Evening (90 min)ReviewWalk through 2 system designs slowly and thoroughly. Practice explaining your decisions out loud
Night (15 min)Mental rehearsalVisualize a successful interview day

Thursday -- Behavioral Final Polish

TimeActivityDetails
Morning (60 min)PracticeTell all STAR stories aloud. Time each one (aim for 3-4 minutes)
Lunch (20 min)ReadTarget company values and culture
Evening (90 min)MockFinal behavioral mock + practice "tell me about yourself" (aim for 2 minutes, covering: background, data engineering experience, why this company, what excites you)
Night (15 min)PrepareWrite 8 thoughtful questions to ask interviewers

Questions to prepare answers for:

  • Walk me through a data pipeline you built end-to-end
  • Tell me about a time your pipeline failed in production. What happened?
  • How do you ensure data quality across pipelines?
  • Describe a disagreement with a stakeholder about data modeling
  • What is the most complex SQL query you have written?
  • How do you prioritize between building new pipelines and maintaining existing ones?

Friday -- Rest and Logistics

TimeActivityDetails
Morning (30 min)LogisticsConfirm interview schedule, test video/audio, prepare workspace
AfternoonRestExercise, relaxation, enjoyable activities
Evening (30 min)Light reviewSkim cheat sheets one final time

Saturday -- Optional Light Practice

TimeActivityDetails
Morning (2 hrs)Light practice5 medium SQL problems for flow and confidence
Afternoon (1.5 hrs)ReviewWalk through one system design slowly
Evening (30 min)Mental prepConfidence visualization; remind yourself of the work you have put in

Sunday -- Complete Rest

Take the day completely off. Get good sleep. You are ready.

:::note Week 6 Final Assessment

  • Can solve hard SQL problems in under 15 minutes with clean, readable queries
  • Can design a data pipeline end-to-end in 60 minutes using the full framework
  • Can explain distributed systems concepts (CAP, partitioning, replication) clearly
  • Can discuss data modeling trade-offs (star vs snowflake vs data vault) with examples
  • Can deliver all STAR stories naturally in 3-4 minutes each
  • Feel confident (not perfect, but confident) about each interview round
  • Have prepared thoughtful questions for each interviewer :::

Problem Sets by Week

Cumulative Practice Counts

WeekSQL ProblemsCoding ProblemsSystem DesignsData ModelsSTAR Stories
1330030
2284100
3164103
4146306
5222316
6130106
Total12616946
CategoryCountDifficulty Mix
Window Functions (ranking, running aggs)255 medium, 20 hard
Joins (self, outer, cross, complex multi-table)205 easy, 10 medium, 5 hard
Aggregation and Grouping155 easy, 8 medium, 2 hard
CTEs and Recursive Queries155 medium, 10 hard
Sessionization and Gap-and-Island122 medium, 10 hard
Subqueries and EXISTS124 medium, 8 hard
Query Optimization105 medium, 5 hard
Pivoting, GROUPING SETS, Advanced103 medium, 7 hard
Date/Time Manipulation73 medium, 4 hard

SQL Practice Platforms

PlatformStrengthDifficulty Range
LeetCode (Database section)Standard interview problemsEasy to Hard
HackerRank (SQL domain)Good progressive difficultyEasy to Hard
StrataScratchReal company interview questionsMedium to Hard
DataLemurData-specific SQL challengesEasy to Hard
SQLZooInteractive tutorials for basicsEasy to Medium
Mode Analytics SQL TutorialReal-world analytical queriesMedium

System Design Problems to Practice

Ordered by Difficulty

ProblemKey Concepts TestedDifficulty
Daily reporting pipeline from raw logsETL basics, scheduling, idempotencyStarter
User activity data warehouseDimensional modeling, SCD, partitioningStarter
Real-time click analytics dashboardStreaming, Kafka, windowing, serving layerIntermediate
Event-driven e-commerce pipelineCDC, event sourcing, exactly-once processingIntermediate
Multi-source customer 360 pipelineData integration, deduplication, identity resolutionIntermediate
Data warehouse migration (on-prem to cloud)Migration strategy, validation, zero downtimeAdvanced
Real-time ML feature pipelineFeature store, streaming features, servingAdvanced
Global data platform (multi-region)Replication, consistency, latency, complianceAdvanced
Ad event pipeline at scale (10B events/day)High throughput, cost optimization, backpressureAdvanced

Mock Interview Schedule

Data Engineer Mock Interview Cadence - timeline of SQL mocks, coding mocks, system design mocks, behavioral mocks, and full loop simulations across the 6-week prep plan

Finding Mock Interview Partners

  • DataTalks.Club -- Free community with data engineering focus
  • Data Engineering Slack/Discord -- Active communities for peer practice
  • interviewing.io -- Anonymous technical interviews
  • Pramp -- Free peer mock interviews
  • Friends in data roles -- Ask data engineers to run you through their favorite interview questions

Common Mistakes to Avoid

:::danger Top 7 Data Engineer Interview Mistakes

  1. Writing correct but unreadable SQL. Interviewers read your SQL. A 200-line query with no CTEs, no comments, and cryptic aliases is a red flag. Structure your queries with CTEs, use meaningful aliases, and format consistently. Readability is part of the evaluation.

  2. Not handling edge cases in SQL. NULLs, empty tables, ties in ranking, duplicate records, and division by zero. Many candidates write queries that work on the happy path but break on edge cases. Always ask: "What happens if this column is NULL?"

  3. Naming tools without understanding them. "I would use Kafka for streaming" is not enough. Why Kafka over Kinesis or Pulsar? How many partitions? What is the retention? How do you handle exactly-once? What happens when a consumer falls behind? Depth matters more than breadth.

  4. Ignoring monitoring and failure handling in system design. A pipeline design that only covers the happy path is incomplete. Every system design answer must address: "What happens when this fails?" "How do you detect data quality issues?" "How do you backfill?"

  5. Underestimating the data modeling round. Many candidates with strong coding skills stumble on data modeling because they treat it as "just drawing tables." You need to understand normalization theory, dimensional modeling philosophy, and the trade-offs behind every design decision.

  6. Over-engineering system designs. You do not need Kafka, Spark, Flink, Airflow, dbt, Iceberg, AND a feature store for a pipeline processing 10,000 records per day. Right-sizing your solution to the requirements demonstrates engineering maturity. Start simple, then explain what you would add as scale increases.

  7. Not preparing behavioral stories about incidents. Data pipelines break. Data quality degrades. Stakeholders get incorrect data. Interviewers want to hear how you responded to real incidents: how you detected the issue, communicated with stakeholders, fixed it, and prevented recurrence. Have 2-3 strong incident stories ready. :::

Essential Resources

Handbook Chapters to Prioritize

PriorityChapterWhen to Study
CriticalCoding InterviewsWeeks 1-4
CriticalML System Design (adapt for data)Weeks 3-5
HighBehavioralWeeks 5-6
MediumCompany GuidesWeek 5
MediumML Fundamentals (skim for context)Week 4
LowNegotiationWeek 6

External Resources

Books:

  • "Designing Data-Intensive Applications" by Martin Kleppmann (the data engineering bible)
  • "The Data Warehouse Toolkit" by Ralph Kimball (dimensional modeling reference)
  • "Fundamentals of Data Engineering" by Joe Reis and Matt Housley
  • "Streaming Systems" by Akidau, Chernyak, and Lax

Courses:

  • DataTalks.Club Data Engineering Zoomcamp (free, comprehensive)
  • Stanford CS245 Principles of Data-Intensive Systems
  • CMU 15-445 Database Systems (query optimization, storage)

Practice Platforms:

  • LeetCode Database section (SQL practice)
  • StrataScratch (real interview SQL questions)
  • DataLemur (SQL and data science problems)
  • HackerRank SQL domain

Blogs and References:

  • Target company engineering blogs (always read before interviews)
  • "The Modern Data Stack" landscape (understand the ecosystem)
  • Apache project documentation (Kafka, Spark, Flink, Airflow, Iceberg)
  • dbt documentation and best practices

Next Steps

You now have a complete 6-week roadmap for Data Engineer interview preparation. The timeline is compressed because data engineering interviews, while broad, are more predictable than research or ML engineering interviews. The key to success is SQL fluency and system design depth.

If this path does not quite fit your target role, consider:

  • MLE Prep Path -- If your target role involves building and deploying ML models, not just data infrastructure
  • MLOps Prep Path -- If your target role emphasizes ML infrastructure, model serving, and CI/CD for ML
  • Data Scientist Prep Path -- If your target role emphasizes statistical analysis and experimentation over infrastructure

Start today. Open a SQL editor. Write a window function. Design a pipeline on paper. The 6-week clock starts now.

© 2026 EngineersOfAI. All rights reserved.