Data Engineer: 6-Week Prep Path

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Engineer, Analytics Engineer, Data Platform Engineer, Data Infrastructure Engineer

The Real Interview Moment

It is 2:15 PM on a Tuesday. You are in the third round of your on-site at a major tech company. The interviewer opens a shared document and says: "We have a platform that processes 500 million events per day from mobile apps. Events arrive via Kafka, need to be enriched with user profile data, deduplicated, and stored for both real-time dashboards and historical analysis. Design the data pipeline."

You have sixty minutes. She wants to see the end-to-end architecture: ingestion, processing, storage, serving, monitoring, and how you handle late-arriving data, schema evolution, and failures. She will probe your understanding of exactly-once semantics, partitioning strategies, backfill procedures, and cost trade-offs.

This is not about memorizing tools. Any candidate can name Spark, Kafka, and Airflow. What separates the strong candidates is understanding why you would choose one approach over another, how components interact under failure, and what breaks at scale.

The data engineer interview is deceptively broad. It spans SQL (often at a level that surprises candidates), data modeling, distributed systems, pipeline design, coding, and system design. But the scope is manageable in six weeks because the concepts connect logically: if you understand how data flows through systems, the rest follows.

This 6-week plan is compressed but complete. It assumes you have some familiarity with databases and programming. If you are starting from zero, add two weeks of basics at the front.

Role Overview

What Data Engineers Do

Data Engineers build and maintain the infrastructure that enables organizations to collect, store, transform, and serve data. They:

Design and build data pipelines (batch and streaming) that process terabytes to petabytes daily
Model data warehouses and data lakes for analytical and operational use cases
Write complex SQL for transformations, quality checks, and analysis
Build and maintain orchestration workflows (Airflow, Dagster, Prefect)
Optimize query performance and storage costs
Ensure data quality, reliability, and freshness through monitoring and testing
Collaborate with data scientists, analysts, and ML engineers on data needs
Manage schema evolution, data governance, and access control

How It Differs from Other Data Roles

Dimension	Data Engineer	Data Scientist	Analytics Engineer	MLE
Primary focus	Data infrastructure	Statistical analysis	Business data models	ML models in production
Key output	Pipelines, warehouses	Insights, models	dbt models, dashboards	Deployed ML systems
SQL depth	Expert	Proficient	Expert	Moderate
Systems knowledge	Deep	Minimal	Moderate	Moderate
Coding emphasis	Production-grade	Exploratory	SQL-focused	ML-focused
Scale concern	Data volume and velocity	Model accuracy	Query performance	Model latency

Interview Format (Typical)

Round	Duration	Focus
Phone Screen	45-60 min	SQL + coding basics
SQL Round	45-60 min	Complex queries, window functions, optimization
Coding Round	45-60 min	Python/data processing (not heavy DSA)
Data Modeling	45-60 min	Dimensional modeling, schema design, normalization
System Design	60 min	End-to-end pipeline design
Behavioral	45 min	Collaboration, incident response, prioritization

:::tip SQL Is the Make-or-Break Round Unlike ML or software engineering interviews where coding is the main filter, data engineering interviews use SQL as the primary technical screen. Many otherwise strong candidates are eliminated because they cannot write complex window functions, CTEs, or self-joins fluently under pressure. Do not underestimate the SQL round. :::

Focus Area Allocation

Data Engineer Interview Prep Time Allocation - SQL Mastery 25%, System Design 25%, Data Modeling 15%, Coding 15%, Distributed Systems 10%, Behavioral 10%

Breakdown by Skill

SQL Mastery (25% -- ~45 hours total)

Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, running aggregates
CTEs and recursive queries
Self-joins and complex join patterns
Query optimization: EXPLAIN plans, indexing, partitioning, materialized views
Advanced: pivoting, unpivoting, sessionization, gap-and-island problems

System Design (25% -- ~45 hours total)

Batch pipeline architecture: ingestion, transformation, serving
Streaming pipeline design: Kafka, Flink, exactly-once semantics
Data lake and lakehouse architecture: Iceberg, Delta Lake, Hudi
Orchestration: DAG design, dependency management, backfill strategies
Monitoring: data quality, freshness, volume, schema drift

Data Modeling (15% -- ~27 hours total)

Dimensional modeling: star schema, snowflake schema, slowly changing dimensions
Normalization: 1NF through 3NF, BCNF, denormalization trade-offs
Data vault modeling
Schema design for specific use cases (events, time series, hierarchical)

Coding (15% -- ~27 hours total)

Python: data processing with pandas, file I/O, API interaction
Data structures: hash maps, heaps, queues (not heavy graph/DP)
Testing: unit tests for pipelines, data validation
Scripting: ETL scripts, data quality checks

Distributed Systems (10% -- ~18 hours total)

CAP theorem and consistency models
Partitioning and sharding strategies
Replication and fault tolerance
MapReduce paradigm and Spark internals

Behavioral (10% -- ~18 hours total)

Incident response stories (data pipeline failures)
Cross-team collaboration
Prioritization and technical debt management
Data quality and governance experiences

6-Week Schedule Overview

Data Engineer 6-Week Prep Plan - gantt-style schedule: SQL and Data Modeling weeks 1–2, Pipeline Design and Distributed Systems weeks 3–4, System Design and Mock Interviews weeks 5–6

Week 1: Foundations -- SQL Mastery Begins

Goal: Rebuild SQL fluency from basics to advanced window functions. Start data modeling fundamentals.

Daily time: 3.5 hours (weekdays), 5 hours (weekends)

Monday -- SQL Review: Joins, Aggregations, Subqueries

Time	Activity	Details
Morning (60 min)	SQL practice	5 easy/medium SQL problems on LeetCode or HackerRank (INNER JOIN, LEFT JOIN, GROUP BY, HAVING)
Lunch (20 min)	Read	Coding Interviews overview
Evening (90 min)	Study	Join types (INNER, LEFT, RIGHT, FULL, CROSS, SELF), aggregation functions, subqueries vs CTEs, NULL handling
Night (15 min)	Review	Write a cheat sheet of join behavior with NULL values

Practice problems:

Find employees earning more than their manager (self-join)
Department with highest average salary (GROUP BY + subquery)
Customers who never ordered (LEFT JOIN + NULL check)
Second highest salary in each department (subquery approach)

Tuesday -- SQL: Window Functions I

Time	Activity	Details
Morning (60 min)	SQL practice	5 medium SQL problems (window functions)
Lunch (20 min)	Read	Window function syntax and frame specification
Evening (90 min)	Study	ROW_NUMBER, RANK, DENSE_RANK, NTILE, PARTITION BY, ORDER BY, frame clauses (ROWS BETWEEN, RANGE BETWEEN)
Night (15 min)	Review	Write 3 queries using different ranking functions on the same data; explain when each is appropriate

Practice problems:

Rank employees by salary within each department
Find top 3 earners per department
Calculate running total of sales by month
Identify duplicate rows using ROW_NUMBER

:::tip Window Functions Are Interview Gold Over 80% of data engineering SQL rounds include at least one window function problem. Master these patterns cold: ranking within groups, running aggregates, comparing current row to previous/next row (LAG/LEAD), and percentile calculations. If you can write window functions fluently, you are already ahead of most candidates. :::

Wednesday -- SQL: Window Functions II and CTEs

Time	Activity	Details
Morning (60 min)	SQL practice	5 medium/hard SQL problems
Lunch (20 min)	Read	CTE best practices and recursive CTEs
Evening (90 min)	Study	LAG, LEAD, FIRST_VALUE, LAST_VALUE, running averages, cumulative sums, CTEs for readability, recursive CTEs for hierarchical data
Night (15 min)	Practice	Write a recursive CTE to traverse an org chart

Practice problems:

Calculate month-over-month growth rate (LAG)
Find consecutive login days per user (gap-and-island)
Compute 7-day moving average of daily revenue
Traverse an employee hierarchy to find all reports (recursive CTE)

Thursday -- Data Modeling: Normalization

Time	Activity	Details
Morning (60 min)	SQL practice	5 medium SQL problems
Lunch (20 min)	Read	Database normalization forms overview
Evening (90 min)	Study	1NF, 2NF, 3NF, BCNF: definitions, examples, identifying violations. Functional dependencies. When to denormalize and why
Night (15 min)	Practice	Given a flat table, normalize it to 3NF step by step

:::warning Normalization Is Not Just Theory Interviewers often give you a denormalized table and ask you to normalize it, or give you a normalized schema and ask when you would denormalize. You need to be able to do both directions fluently, explaining the trade-offs at each step: data integrity vs query performance, storage efficiency vs join complexity, write performance vs read performance. :::

Friday -- Data Modeling: Dimensional Modeling

Time	Activity	Details
Morning (60 min)	SQL practice	5 medium SQL problems
Lunch (20 min)	Read	Kimball dimensional modeling introduction
Evening (90 min)	Study	Star schema, snowflake schema, fact tables (transaction, periodic, accumulating), dimension tables, surrogate keys, degenerate dimensions
Night (15 min)	Design	Design a star schema for an e-commerce order system

Saturday -- SQL Deep Practice + Slowly Changing Dimensions

Time	Activity	Details
Morning (2 hrs)	SQL practice	8 medium/hard SQL problems covering all concepts from the week
Afternoon (2 hrs)	Study	Slowly Changing Dimensions (SCD Types 1, 2, 3, 4, 6). Implementation patterns for each type. When to use which
Evening (1 hr)	Design	Design a data warehouse schema for a ride-sharing company: trips fact table, driver dimension (SCD Type 2), location dimension

Sunday -- Week 1 Review

Time	Activity	Details
Morning (2 hrs)	SQL review	Re-solve hardest SQL problems from the week; time yourself
Afternoon (2 hrs)	Data modeling review	Practice: given a business scenario, design both an OLTP schema (normalized) and OLAP schema (dimensional)
Evening (1 hr)	Plan	Review Week 2 plan; identify SQL patterns that need more practice

:::note Week 1 Milestone Checkpoint By the end of Week 1, you should be able to:

Write window functions (ROW_NUMBER, RANK, LAG, LEAD) without reference
Use CTEs and recursive CTEs for complex query decomposition
Normalize a table to 3NF and explain each step
Design a star schema for a given business scenario
Explain SCD Types 1, 2, and 3 with implementation examples
Solve medium SQL problems in under 10 minutes :::

Week 2: Foundations -- Advanced SQL and Pipeline Basics

Goal: Master advanced SQL patterns. Begin understanding data pipeline architecture.

Daily time: 3.5 hours (weekdays), 5 hours (weekends)

Monday -- SQL: Sessionization and Gap-and-Island

Time	Activity	Details
Morning (60 min)	SQL practice	5 medium/hard SQL problems
Lunch (20 min)	Read	Sessionization techniques in SQL
Evening (90 min)	Study	Sessionization (grouping events into sessions using time gaps), gap-and-island problems (finding consecutive sequences), LEAD/LAG for boundary detection
Night (15 min)	Practice	Write a sessionization query with a 30-minute timeout

Practice problems:

Group user clickstream into sessions (30-min gap)
Find consecutive days a stock price increased
Identify gaps in a sequence of IDs
Calculate session duration and pages per session

Tuesday -- SQL: Query Optimization

Time	Activity	Details
Morning (60 min)	SQL practice	5 medium/hard SQL problems
Lunch (20 min)	Read	How to read an EXPLAIN plan
Evening (90 min)	Study	EXPLAIN/EXPLAIN ANALYZE, index types (B-tree, hash, GIN, GiST), index selection, partition pruning, query rewriting for performance, common anti-patterns (SELECT *, implicit type casting, functions on indexed columns)
Night (15 min)	Practice	Take a slow query and optimize it using EXPLAIN analysis

:::danger SQL Optimization Questions Are Tricky Interviewers love asking: "This query is slow. How would you fix it?" The trap is jumping to "add an index." A strong answer follows a systematic process: (1) Read the EXPLAIN plan, (2) Identify the bottleneck (full table scan, nested loop join, sort spill to disk), (3) Consider multiple solutions (indexing, query rewrite, materialized view, partitioning), (4) Discuss trade-offs of each approach. :::

Wednesday -- SQL: Advanced Patterns and Anti-Patterns

Time	Activity	Details
Morning (60 min)	SQL practice	5 hard SQL problems
Lunch (20 min)	Read	SQL anti-patterns reference
Evening (90 min)	Study	Pivot/unpivot, GROUPING SETS / CUBE / ROLLUP, correlated subqueries, EXISTS vs IN performance, approximate distinct counts (HyperLogLog), handling large IN lists
Night (15 min)	Practice	Write a single query using GROUPING SETS to produce a multi-level summary

Thursday -- Introduction to Data Pipelines

Time	Activity	Details
Morning (60 min)	Coding practice	2 Python problems (file I/O, data manipulation)
Lunch (20 min)	Read	ETL vs ELT comparison
Evening (90 min)	Study	Pipeline architectures: ETL (extract-transform-load), ELT (extract-load-transform), batch vs streaming, orchestration concepts (DAGs, dependencies, triggers)
Night (15 min)	Draw	Sketch a basic ETL pipeline with failure handling

Friday -- Orchestration: Airflow Fundamentals

Time	Activity	Details
Morning (60 min)	Coding practice	2 Python problems (APIs, error handling)
Lunch (20 min)	Read	Apache Airflow architecture overview
Evening (90 min)	Study	Airflow concepts: DAGs, operators (PythonOperator, BashOperator, SQL operators), sensors, XComs, task dependencies, scheduling, backfilling, idempotency
Night (15 min)	Design	Design an Airflow DAG for a daily reporting pipeline

Saturday -- Python for Data Engineering + Pipeline Practice

Time	Activity	Details
Morning (2 hrs)	Coding	Write a data pipeline in Python: read from CSV, validate schema, transform (clean, deduplicate, enrich), write to output. Include proper error handling and logging
Afternoon (2 hrs)	Study	Data formats: Parquet vs CSV vs Avro vs ORC. Compression: gzip, snappy, zstd. Columnar vs row-based storage trade-offs
Evening (1 hr)	SQL review	Solve 5 hard SQL problems

Sunday -- Week 2 Review

Time	Activity	Details
Morning (2 hrs)	SQL review	Time yourself on 10 SQL problems (mix of medium and hard). Target: medium in 8 min, hard in 15 min
Afternoon (2 hrs)	Pipeline review	Design a pipeline for a specific scenario: daily user activity aggregation from raw event logs
Evening (1 hr)	Plan	List all distributed systems topics needed for Week 3

:::note Week 2 Milestone Checkpoint

Write sessionization queries without reference
Read and interpret EXPLAIN plans to identify bottlenecks
Optimize a slow query using at least 3 different techniques
Explain ETL vs ELT with trade-offs for different scenarios
Design a basic Airflow DAG with proper error handling and idempotency
Solve hard SQL problems in under 15 minutes
Choose between Parquet, Avro, and CSV for different use cases :::

Week 3: Core Skills -- Distributed Systems and Streaming

Goal: Understand the distributed systems concepts that underpin modern data engineering. Learn streaming fundamentals.

Daily time: 4 hours (weekdays), 6 hours (weekends)

Monday -- Distributed Systems Fundamentals

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems
Lunch (20 min)	Read	CAP theorem explained
Evening (120 min)	Study	CAP theorem, consistency models (strong, eventual, causal), partitioning strategies (hash, range, round-robin), replication (leader-follower, multi-leader, leaderless)
Night (15 min)	Review	For each data store you know (Postgres, Redis, Cassandra, DynamoDB), identify where it sits on the CAP triangle

Tuesday -- Distributed Storage: HDFS, S3, and Data Lakes

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems
Lunch (20 min)	Read	Data lake vs data warehouse vs data lakehouse
Evening (120 min)	Study	HDFS architecture, S3 consistency model, data lake organization (bronze/silver/gold layers), table formats: Apache Iceberg, Delta Lake, Hudi. Partitioning, compaction, time travel
Night (15 min)	Compare	Create a comparison table: Iceberg vs Delta Lake vs Hudi

Wednesday -- Apache Spark Internals

Time	Activity	Details
Morning (60 min)	Coding practice	2 Python problems (data processing)
Lunch (20 min)	Read	Spark architecture overview
Evening (120 min)	Study	Spark architecture: driver, executors, DAG scheduler, stages, tasks. RDDs vs DataFrames vs Datasets. Shuffles, partitions, broadcast joins, skew handling, caching strategies
Night (15 min)	Draw	Diagram the execution of a Spark job with a join and aggregation, showing shuffle boundaries

:::tip Understand Spark Conceptually, Not Just Syntactically Interviewers rarely ask you to write Spark code (though some do). More commonly, they ask: "Your Spark job is running for 6 hours. How do you debug and optimize it?" You need to understand: shuffle operations, data skew, partition count, broadcast join thresholds, spill to disk, and Spark UI interpretation. Conceptual understanding beats syntax memorization. :::

Thursday -- Apache Kafka and Event Streaming

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems
Lunch (20 min)	Read	Kafka architecture introduction
Evening (120 min)	Study	Kafka architecture: brokers, topics, partitions, consumer groups, offsets. Delivery guarantees: at-most-once, at-least-once, exactly-once semantics. Schema registry, compacted topics, retention policies
Night (15 min)	Design	Design a Kafka topic strategy for an event-driven e-commerce platform

Friday -- Stream Processing: Concepts and Patterns

Time	Activity	Details
Morning (60 min)	Coding practice	2 Python problems (concurrency, generators)
Lunch (20 min)	Read	Flink vs Spark Streaming vs Kafka Streams comparison
Evening (120 min)	Study	Stream processing concepts: event time vs processing time, watermarks, windowing (tumbling, sliding, session), late data handling, state management, checkpointing
Night (15 min)	Practice	Design a streaming pipeline that computes real-time click-through rates with 5-minute tumbling windows

Saturday -- Data Quality and Testing

Time	Activity	Details
Morning (2.5 hrs)	Study	Data quality dimensions: accuracy, completeness, consistency, timeliness, uniqueness. Data contracts. Testing frameworks: Great Expectations, dbt tests, custom validation. Schema evolution strategies
Afternoon (2 hrs)	Implementation	Write a data quality check suite for a pipeline: null checks, range checks, uniqueness checks, referential integrity, volume anomaly detection
Evening (1.5 hrs)	SQL practice	Solve 5 hard SQL problems

:::warning Data Quality Questions Are Increasing Modern data engineering interviews increasingly include data quality and observability questions. "How do you ensure your pipeline produces correct data?" is not optional. Be ready to discuss: automated testing, data contracts, monitoring, alerting, and root cause analysis when data quality issues arise. Companies like Monte Carlo, Soda, and Elementary have made data observability mainstream. :::

Sunday -- Week 3 Review

Time	Activity	Details
Morning (2 hrs)	Review	Re-draw system architectures from memory: Spark, Kafka, data lake
Afternoon (2.5 hrs)	Practice	Design a streaming pipeline: real-time fraud detection with Kafka and Flink. Include failure handling, late data, and monitoring
Evening (1.5 hrs)	Behavioral prep	Draft 3 STAR stories about pipeline incidents, data quality issues, and cross-team collaboration

:::note Week 3 Milestone Checkpoint

Explain CAP theorem and its implications for data system design
Describe Spark's execution model (DAG, stages, tasks, shuffles)
Explain Kafka's architecture and delivery guarantees
Design a streaming pipeline with proper windowing and late data handling
Articulate data lake layering (bronze/silver/gold) and its purpose
Write comprehensive data quality checks for a pipeline :::

Week 4: Core Skills -- Pipeline System Design and Coding

Goal: Practice end-to-end pipeline system design. Build coding fluency for data engineering interviews.

Daily time: 4 hours (weekdays), 6 hours (weekends)

Monday -- System Design Framework for Data Pipelines

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems
Lunch (20 min)	Read	ML System Design overview (adapt framework for data)
Evening (120 min)	Study	Data pipeline design framework: (1) Requirements and constraints, (2) Data sources and ingestion, (3) Storage layer, (4) Processing/transformation, (5) Serving layer, (6) Monitoring and operations
Night (15 min)	Review	Write the framework from memory

:::tip The Data Pipeline Design Framework Use this structure for every pipeline design question:

Clarify requirements -- What are we building? What is the data volume, velocity, and variety? What are the SLAs (freshness, latency, correctness)?
Data sources and ingestion -- Where does data come from? CDC, API, event stream, file upload? How do we handle schema differences?
Storage layer -- Where do we land raw data? What format? How do we partition? Data lake, warehouse, or lakehouse?
Processing -- Batch or streaming? What transformations? How do we handle failures and retries? Idempotency?
Serving layer -- How do consumers access the data? OLAP queries, API, export, real-time dashboard?
Monitoring and operations -- Data quality checks, freshness monitoring, volume alerts, schema drift detection, runbook for incidents. :::

Tuesday -- Design: Event Analytics Pipeline

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems
Lunch (20 min)	Read	Event-driven architecture patterns
Evening (120 min)	Practice	Design an event analytics pipeline for a mobile app: 100M events/day, need real-time dashboards (sub-minute) and historical analysis (years of data). Walk through all 6 framework steps
Night (15 min)	Review	Identify single points of failure in your design

Wednesday -- Design: Data Warehouse Migration

Time	Activity	Details
Morning (60 min)	Coding	2 Python data processing problems
Lunch (20 min)	Read	Data warehouse migration case studies
Evening (120 min)	Practice	Design a migration plan: legacy on-prem Oracle data warehouse to cloud (Snowflake/BigQuery/Redshift). Address: parallel running, validation, rollback, data reconciliation, zero-downtime cutover
Night (15 min)	Review	List the top 5 risks and mitigation strategies

Thursday -- Coding: Data Processing in Python

Time	Activity	Details
Morning (60 min)	Coding	2 LeetCode medium (hash maps, arrays)
Lunch (20 min)	Read	Python performance tips for data processing
Evening (120 min)	Implementation	Implement common data engineering tasks in Python: (1) Parse and validate JSON events, (2) Deduplicate records using hash-based approach, (3) Implement a simple schema validator, (4) Write a rate-limited API data fetcher
Night (15 min)	Review	Add error handling and unit tests to each implementation

Friday -- Coding: Data Structures for Data Engineering

Time	Activity	Details
Morning (60 min)	Coding	2 LeetCode medium (queues, heaps)
Lunch (20 min)	Read	Data structures used in data systems (bloom filters, LSM trees, B-trees)
Evening (120 min)	Study and implement	Bloom filters (implement from scratch), consistent hashing (implement), LRU cache (implement). Understand how each is used in data systems
Night (15 min)	Review	Explain when you would use each data structure in a real pipeline

Saturday -- Full System Design Practice

Time	Activity	Details
Morning (2.5 hrs)	Mock	Timed system design: Design a data pipeline for a ride-sharing company (trips, pricing, driver matching analytics). 60 min design + 30 min self-review
Afternoon (2 hrs)	Study	Change Data Capture (CDC): Debezium, log-based CDC, trigger-based CDC, query-based CDC. When to use each approach
Evening (1.5 hrs)	SQL practice	Solve 5 hard SQL problems (timed)

Sunday -- Week 4 Review

Time	Activity	Details
Morning (2 hrs)	Review	Re-do all system designs from the week. Focus on areas where you lacked depth
Afternoon (2.5 hrs)	Coding review	Refactor Python implementations. Add comprehensive tests. Clean up code quality
Evening (1.5 hrs)	Behavioral	Prepare 3 more STAR stories. Practice all 6 stories aloud

:::note Week 4 Milestone Checkpoint

Walk through the data pipeline design framework from memory
Design an event analytics pipeline end-to-end with monitoring
Implement bloom filter and consistent hashing from scratch
Write Python data processing code with proper error handling and testing
Explain CDC approaches and their trade-offs
Design a migration strategy with risk mitigation :::

Week 5: Polish -- Advanced Topics and Mock Interviews

Goal: Cover remaining advanced topics. Begin intensive mock interview practice.

Daily time: 4 hours (weekdays), 6 hours (weekends)

Monday -- Advanced Data Modeling: Data Vault and Activity Schema

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems (company-tagged if possible)
Lunch (20 min)	Read	Data vault 2.0 overview
Evening (120 min)	Study	Data vault modeling (hubs, links, satellites), activity schema, one big table (OBT) pattern, wide vs tall table trade-offs. When each modeling approach is appropriate
Night (15 min)	Compare	Create a decision matrix: when to use star schema vs data vault vs OBT

Tuesday -- dbt and Modern Data Stack

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems
Lunch (20 min)	Read	dbt (data build tool) introduction
Evening (120 min)	Study	Modern data stack: dbt models (staging, intermediate, marts), testing in dbt, incremental models, snapshots, documentation. ELT patterns with Fivetran/Airbyte + warehouse + dbt
Night (15 min)	Design	Design a dbt project structure for an e-commerce company

:::note dbt Knowledge Is Increasingly Expected Even if the job description does not mention dbt, understanding the ELT paradigm it represents is important. Many interviewers will ask about transformation-in-warehouse approaches, incremental processing, and testing data transformations. If you know dbt concepts, you can speak to these patterns confidently. :::

Wednesday -- Cost Optimization and Performance Tuning

Time	Activity	Details
Morning (60 min)	SQL practice	3 hard SQL problems
Lunch (20 min)	Read	Cloud data warehouse cost models
Evening (120 min)	Study	Cost optimization: warehouse sizing, clustering/sort keys, materialized views, partitioning strategies, query patterns that waste money. Performance tuning: Spark job optimization, query profiling, caching strategies
Night (15 min)	Practice	Given a scenario with a $50K/month Snowflake bill, identify 5 ways to reduce costs

Thursday -- Mock: SQL Round

Time	Activity	Details
Morning (60 min)	Warm-up	2 medium SQL problems
Afternoon (120 min)	Mock	Full SQL round simulation: 4 problems in 50 minutes. Problems covering: window functions, sessionization, complex joins, optimization. Then 70 min to review and re-solve any you missed
Evening (60 min)	Review	Identify SQL patterns that slowed you down; create targeted practice list
Night (15 min)	Plan	Select 5 problems targeting your weakest SQL patterns

Friday -- Mock: System Design Round

Time	Activity	Details
Morning (60 min)	Coding	2 LeetCode medium (maintain fluency)
Lunch (20 min)	Review	Skim pipeline design framework notes
Evening (120 min)	Mock	Timed system design (60 min): Design a real-time recommendation data pipeline. Self-review (60 min): grade yourself on each framework component
Night (15 min)	Debrief	List areas where your design was weakest

Saturday -- Company-Specific Research and Full Mock

Time	Activity	Details
Morning (2.5 hrs)	Research	Deep dive into target company: their data stack (check engineering blog), scale (data volume), challenges, recent data engineering job postings
Afternoon (2 hrs)	Mock	Full mock interview: SQL (45 min) + System Design (45 min). Self-timed, strict
Evening (1.5 hrs)	Behavioral	Practice all STAR stories. Prepare company-specific answers for: "Why this company?" and "Tell me about a time you dealt with data quality issues"

Sunday -- Week 5 Review

Time	Activity	Details
Morning (2 hrs)	SQL marathon	Solve 10 hard SQL problems in 90 minutes (timed)
Afternoon (2.5 hrs)	System design	Redo your weakest system design from the week with improvements
Evening (1.5 hrs)	Preparation	Create one-page cheat sheets: SQL patterns, system design framework, data modeling patterns, distributed systems concepts

:::note Week 5 Milestone Checkpoint

Explain data vault modeling and when to use it
Describe the modern data stack and dbt's role in it
Identify 5+ cost optimization strategies for cloud data warehouses
Complete a SQL round simulation in under 50 minutes
Complete a system design round with all 6 framework components
Have 6+ polished STAR stories ready
Know your target company's data engineering stack and challenges :::

Week 6: Final Week -- Simulation and Confidence

Goal: Final mock interviews, confidence building, and logistics. No new material.

Daily time: 3 hours (weekdays), 4 hours (weekends)

Monday -- Light Practice and Review

Time	Activity	Details
Morning (60 min)	SQL practice	5 medium SQL problems (patterns you know well, for speed)
Lunch (20 min)	Read	Negotiation and Offers overview
Evening (60 min)	Review	Skim all cheat sheets and notes
Night (15 min)	Relax	Mental break

:::warning Do Not Cram in the Final Week Week 6 is about reinforcing and sharpening. If you encounter a concept you have not studied, make a note but do not go deep. Focus on being fluent in what you already know. Cramming new material in the final days increases anxiety and reduces performance. :::

Tuesday -- Full Mock Interview Day

Time	Activity	Details
Morning (60 min)	Warm-up	2 easy SQL problems for confidence
Afternoon (150 min)	Mock	Full loop simulation: SQL (45 min) + Coding (45 min) + System Design (60 min). Strict time limits. No looking at notes
Evening (30 min)	Debrief	Note confidence level for each round. Celebrate progress

Wednesday -- Targeted Review

Time	Activity	Details
Morning (60 min)	SQL practice	3 medium/hard problems targeting your weakest pattern
Lunch (20 min)	Read	Review your system design cheat sheet
Evening (90 min)	Review	Walk through 2 system designs slowly and thoroughly. Practice explaining your decisions out loud
Night (15 min)	Mental rehearsal	Visualize a successful interview day

Thursday -- Behavioral Final Polish

Time	Activity	Details
Morning (60 min)	Practice	Tell all STAR stories aloud. Time each one (aim for 3-4 minutes)
Lunch (20 min)	Read	Target company values and culture
Evening (90 min)	Mock	Final behavioral mock + practice "tell me about yourself" (aim for 2 minutes, covering: background, data engineering experience, why this company, what excites you)
Night (15 min)	Prepare	Write 8 thoughtful questions to ask interviewers

Questions to prepare answers for:

Walk me through a data pipeline you built end-to-end
Tell me about a time your pipeline failed in production. What happened?
How do you ensure data quality across pipelines?
Describe a disagreement with a stakeholder about data modeling
What is the most complex SQL query you have written?
How do you prioritize between building new pipelines and maintaining existing ones?

Friday -- Rest and Logistics

Time	Activity	Details
Morning (30 min)	Logistics	Confirm interview schedule, test video/audio, prepare workspace
Afternoon	Rest	Exercise, relaxation, enjoyable activities
Evening (30 min)	Light review	Skim cheat sheets one final time

Saturday -- Optional Light Practice

Time	Activity	Details
Morning (2 hrs)	Light practice	5 medium SQL problems for flow and confidence
Afternoon (1.5 hrs)	Review	Walk through one system design slowly
Evening (30 min)	Mental prep	Confidence visualization; remind yourself of the work you have put in

Sunday -- Complete Rest

Take the day completely off. Get good sleep. You are ready.

:::note Week 6 Final Assessment

Can solve hard SQL problems in under 15 minutes with clean, readable queries
Can design a data pipeline end-to-end in 60 minutes using the full framework
Can explain distributed systems concepts (CAP, partitioning, replication) clearly
Can discuss data modeling trade-offs (star vs snowflake vs data vault) with examples
Can deliver all STAR stories naturally in 3-4 minutes each
Feel confident (not perfect, but confident) about each interview round
Have prepared thoughtful questions for each interviewer :::

Problem Sets by Week

Cumulative Practice Counts

Week	SQL Problems	Coding Problems	System Designs	Data Models	STAR Stories
1	33	0	0	3	0
2	28	4	1	0	0
3	16	4	1	0	3
4	14	6	3	0	6
5	22	2	3	1	6
6	13	0	1	0	6
Total	126	16	9	4	6

Category	Count	Difficulty Mix
Window Functions (ranking, running aggs)	25	5 medium, 20 hard
Joins (self, outer, cross, complex multi-table)	20	5 easy, 10 medium, 5 hard
Aggregation and Grouping	15	5 easy, 8 medium, 2 hard
CTEs and Recursive Queries	15	5 medium, 10 hard
Sessionization and Gap-and-Island	12	2 medium, 10 hard
Subqueries and EXISTS	12	4 medium, 8 hard
Query Optimization	10	5 medium, 5 hard
Pivoting, GROUPING SETS, Advanced	10	3 medium, 7 hard
Date/Time Manipulation	7	3 medium, 4 hard

SQL Practice Platforms

Platform	Strength	Difficulty Range
LeetCode (Database section)	Standard interview problems	Easy to Hard
HackerRank (SQL domain)	Good progressive difficulty	Easy to Hard
StrataScratch	Real company interview questions	Medium to Hard
DataLemur	Data-specific SQL challenges	Easy to Hard
SQLZoo	Interactive tutorials for basics	Easy to Medium
Mode Analytics SQL Tutorial	Real-world analytical queries	Medium

System Design Problems to Practice

Ordered by Difficulty

Problem	Key Concepts Tested	Difficulty
Daily reporting pipeline from raw logs	ETL basics, scheduling, idempotency	Starter
User activity data warehouse	Dimensional modeling, SCD, partitioning	Starter
Real-time click analytics dashboard	Streaming, Kafka, windowing, serving layer	Intermediate
Event-driven e-commerce pipeline	CDC, event sourcing, exactly-once processing	Intermediate
Multi-source customer 360 pipeline	Data integration, deduplication, identity resolution	Intermediate
Data warehouse migration (on-prem to cloud)	Migration strategy, validation, zero downtime	Advanced
Real-time ML feature pipeline	Feature store, streaming features, serving	Advanced
Global data platform (multi-region)	Replication, consistency, latency, compliance	Advanced
Ad event pipeline at scale (10B events/day)	High throughput, cost optimization, backpressure	Advanced

Mock Interview Schedule

Data Engineer Mock Interview Cadence - timeline of SQL mocks, coding mocks, system design mocks, behavioral mocks, and full loop simulations across the 6-week prep plan

Finding Mock Interview Partners

DataTalks.Club -- Free community with data engineering focus
Data Engineering Slack/Discord -- Active communities for peer practice
interviewing.io -- Anonymous technical interviews
Pramp -- Free peer mock interviews
Friends in data roles -- Ask data engineers to run you through their favorite interview questions

Common Mistakes to Avoid

:::danger Top 7 Data Engineer Interview Mistakes

Writing correct but unreadable SQL. Interviewers read your SQL. A 200-line query with no CTEs, no comments, and cryptic aliases is a red flag. Structure your queries with CTEs, use meaningful aliases, and format consistently. Readability is part of the evaluation.
Not handling edge cases in SQL. NULLs, empty tables, ties in ranking, duplicate records, and division by zero. Many candidates write queries that work on the happy path but break on edge cases. Always ask: "What happens if this column is NULL?"
Naming tools without understanding them. "I would use Kafka for streaming" is not enough. Why Kafka over Kinesis or Pulsar? How many partitions? What is the retention? How do you handle exactly-once? What happens when a consumer falls behind? Depth matters more than breadth.
Ignoring monitoring and failure handling in system design. A pipeline design that only covers the happy path is incomplete. Every system design answer must address: "What happens when this fails?" "How do you detect data quality issues?" "How do you backfill?"
Underestimating the data modeling round. Many candidates with strong coding skills stumble on data modeling because they treat it as "just drawing tables." You need to understand normalization theory, dimensional modeling philosophy, and the trade-offs behind every design decision.
Over-engineering system designs. You do not need Kafka, Spark, Flink, Airflow, dbt, Iceberg, AND a feature store for a pipeline processing 10,000 records per day. Right-sizing your solution to the requirements demonstrates engineering maturity. Start simple, then explain what you would add as scale increases.
Not preparing behavioral stories about incidents. Data pipelines break. Data quality degrades. Stakeholders get incorrect data. Interviewers want to hear how you responded to real incidents: how you detected the issue, communicated with stakeholders, fixed it, and prevented recurrence. Have 2-3 strong incident stories ready. :::

Essential Resources

Handbook Chapters to Prioritize

Priority	Chapter	When to Study
Critical	Coding Interviews	Weeks 1-4
Critical	ML System Design (adapt for data)	Weeks 3-5
High	Behavioral	Weeks 5-6
Medium	Company Guides	Week 5
Medium	ML Fundamentals (skim for context)	Week 4
Low	Negotiation	Week 6

External Resources

Books:

"Designing Data-Intensive Applications" by Martin Kleppmann (the data engineering bible)
"The Data Warehouse Toolkit" by Ralph Kimball (dimensional modeling reference)
"Fundamentals of Data Engineering" by Joe Reis and Matt Housley
"Streaming Systems" by Akidau, Chernyak, and Lax

Courses:

DataTalks.Club Data Engineering Zoomcamp (free, comprehensive)
Stanford CS245 Principles of Data-Intensive Systems
CMU 15-445 Database Systems (query optimization, storage)

Practice Platforms:

LeetCode Database section (SQL practice)
StrataScratch (real interview SQL questions)
DataLemur (SQL and data science problems)
HackerRank SQL domain

Blogs and References:

Target company engineering blogs (always read before interviews)
"The Modern Data Stack" landscape (understand the ecosystem)
Apache project documentation (Kafka, Spark, Flink, Airflow, Iceberg)
dbt documentation and best practices

Next Steps

You now have a complete 6-week roadmap for Data Engineer interview preparation. The timeline is compressed because data engineering interviews, while broad, are more predictable than research or ML engineering interviews. The key to success is SQL fluency and system design depth.

If this path does not quite fit your target role, consider:

MLE Prep Path -- If your target role involves building and deploying ML models, not just data infrastructure
MLOps Prep Path -- If your target role emphasizes ML infrastructure, model serving, and CI/CD for ML
Data Scientist Prep Path -- If your target role emphasizes statistical analysis and experimentation over infrastructure

Start today. Open a SQL editor. Write a window function. Design a pipeline on paper. The 6-week clock starts now.

The Real Interview Moment​

Role Overview​

What Data Engineers Do​

How It Differs from Other Data Roles​

Interview Format (Typical)​

Focus Area Allocation​

Breakdown by Skill​

6-Week Schedule Overview​

Week 1: Foundations -- SQL Mastery Begins​

Monday -- SQL Review: Joins, Aggregations, Subqueries​

Tuesday -- SQL: Window Functions I​

Wednesday -- SQL: Window Functions II and CTEs​

Thursday -- Data Modeling: Normalization​

Friday -- Data Modeling: Dimensional Modeling​

Saturday -- SQL Deep Practice + Slowly Changing Dimensions​

Sunday -- Week 1 Review​

Week 2: Foundations -- Advanced SQL and Pipeline Basics​

Monday -- SQL: Sessionization and Gap-and-Island​

Tuesday -- SQL: Query Optimization​

Wednesday -- SQL: Advanced Patterns and Anti-Patterns​

Thursday -- Introduction to Data Pipelines​

Friday -- Orchestration: Airflow Fundamentals​

Saturday -- Python for Data Engineering + Pipeline Practice​

Sunday -- Week 2 Review​

Week 3: Core Skills -- Distributed Systems and Streaming​

Monday -- Distributed Systems Fundamentals​

Tuesday -- Distributed Storage: HDFS, S3, and Data Lakes​

Wednesday -- Apache Spark Internals​

Thursday -- Apache Kafka and Event Streaming​

Friday -- Stream Processing: Concepts and Patterns​

Saturday -- Data Quality and Testing​

Sunday -- Week 3 Review​

Week 4: Core Skills -- Pipeline System Design and Coding​

Monday -- System Design Framework for Data Pipelines​

Tuesday -- Design: Event Analytics Pipeline​

Wednesday -- Design: Data Warehouse Migration​

Thursday -- Coding: Data Processing in Python​

Friday -- Coding: Data Structures for Data Engineering​

Saturday -- Full System Design Practice​

Sunday -- Week 4 Review​

Week 5: Polish -- Advanced Topics and Mock Interviews​

Monday -- Advanced Data Modeling: Data Vault and Activity Schema​

Tuesday -- dbt and Modern Data Stack​

Wednesday -- Cost Optimization and Performance Tuning​

Thursday -- Mock: SQL Round​

Friday -- Mock: System Design Round​

Saturday -- Company-Specific Research and Full Mock​

Sunday -- Week 5 Review​

Week 6: Final Week -- Simulation and Confidence​

Monday -- Light Practice and Review​

Tuesday -- Full Mock Interview Day​

Wednesday -- Targeted Review​

Thursday -- Behavioral Final Polish​

Friday -- Rest and Logistics​

Saturday -- Optional Light Practice​

Sunday -- Complete Rest​

Problem Sets by Week​

Cumulative Practice Counts​

Recommended SQL Problem Categories​

SQL Practice Platforms​

System Design Problems to Practice​

Ordered by Difficulty​

Mock Interview Schedule​

Finding Mock Interview Partners​

Common Mistakes to Avoid​

Essential Resources​

Handbook Chapters to Prioritize​

External Resources​

Next Steps​

The Real Interview Moment

Role Overview

What Data Engineers Do

How It Differs from Other Data Roles

Interview Format (Typical)

Focus Area Allocation

Breakdown by Skill

6-Week Schedule Overview

Week 1: Foundations -- SQL Mastery Begins

Monday -- SQL Review: Joins, Aggregations, Subqueries

Tuesday -- SQL: Window Functions I

Wednesday -- SQL: Window Functions II and CTEs

Thursday -- Data Modeling: Normalization

Friday -- Data Modeling: Dimensional Modeling

Saturday -- SQL Deep Practice + Slowly Changing Dimensions

Sunday -- Week 1 Review

Week 2: Foundations -- Advanced SQL and Pipeline Basics

Monday -- SQL: Sessionization and Gap-and-Island

Tuesday -- SQL: Query Optimization

Wednesday -- SQL: Advanced Patterns and Anti-Patterns

Thursday -- Introduction to Data Pipelines

Friday -- Orchestration: Airflow Fundamentals

Saturday -- Python for Data Engineering + Pipeline Practice

Sunday -- Week 2 Review

Week 3: Core Skills -- Distributed Systems and Streaming

Monday -- Distributed Systems Fundamentals

Tuesday -- Distributed Storage: HDFS, S3, and Data Lakes

Wednesday -- Apache Spark Internals

Thursday -- Apache Kafka and Event Streaming

Friday -- Stream Processing: Concepts and Patterns

Saturday -- Data Quality and Testing

Sunday -- Week 3 Review

Week 4: Core Skills -- Pipeline System Design and Coding

Monday -- System Design Framework for Data Pipelines

Tuesday -- Design: Event Analytics Pipeline

Wednesday -- Design: Data Warehouse Migration

Thursday -- Coding: Data Processing in Python

Friday -- Coding: Data Structures for Data Engineering

Saturday -- Full System Design Practice

Sunday -- Week 4 Review

Week 5: Polish -- Advanced Topics and Mock Interviews

Monday -- Advanced Data Modeling: Data Vault and Activity Schema

Tuesday -- dbt and Modern Data Stack

Wednesday -- Cost Optimization and Performance Tuning

Thursday -- Mock: SQL Round

Friday -- Mock: System Design Round

Saturday -- Company-Specific Research and Full Mock

Sunday -- Week 5 Review

Week 6: Final Week -- Simulation and Confidence

Monday -- Light Practice and Review

Tuesday -- Full Mock Interview Day

Wednesday -- Targeted Review

Thursday -- Behavioral Final Polish

Friday -- Rest and Logistics

Saturday -- Optional Light Practice

Sunday -- Complete Rest

Problem Sets by Week

Cumulative Practice Counts

Recommended SQL Problem Categories

SQL Practice Platforms

System Design Problems to Practice

Ordered by Difficulty

Mock Interview Schedule

Finding Mock Interview Partners

Common Mistakes to Avoid

Essential Resources

Handbook Chapters to Prioritize

External Resources

Next Steps