Data Engineer: 6-Week Prep Path
Reading time: ~45 min | Interview relevance: Critical | Roles: Data Engineer, Analytics Engineer, Data Platform Engineer, Data Infrastructure Engineer
The Real Interview Moment
It is 2:15 PM on a Tuesday. You are in the third round of your on-site at a major tech company. The interviewer opens a shared document and says: "We have a platform that processes 500 million events per day from mobile apps. Events arrive via Kafka, need to be enriched with user profile data, deduplicated, and stored for both real-time dashboards and historical analysis. Design the data pipeline."
You have sixty minutes. She wants to see the end-to-end architecture: ingestion, processing, storage, serving, monitoring, and how you handle late-arriving data, schema evolution, and failures. She will probe your understanding of exactly-once semantics, partitioning strategies, backfill procedures, and cost trade-offs.
This is not about memorizing tools. Any candidate can name Spark, Kafka, and Airflow. What separates the strong candidates is understanding why you would choose one approach over another, how components interact under failure, and what breaks at scale.
The data engineer interview is deceptively broad. It spans SQL (often at a level that surprises candidates), data modeling, distributed systems, pipeline design, coding, and system design. But the scope is manageable in six weeks because the concepts connect logically: if you understand how data flows through systems, the rest follows.
This 6-week plan is compressed but complete. It assumes you have some familiarity with databases and programming. If you are starting from zero, add two weeks of basics at the front.
Role Overview
What Data Engineers Do
Data Engineers build and maintain the infrastructure that enables organizations to collect, store, transform, and serve data. They:
- Design and build data pipelines (batch and streaming) that process terabytes to petabytes daily
- Model data warehouses and data lakes for analytical and operational use cases
- Write complex SQL for transformations, quality checks, and analysis
- Build and maintain orchestration workflows (Airflow, Dagster, Prefect)
- Optimize query performance and storage costs
- Ensure data quality, reliability, and freshness through monitoring and testing
- Collaborate with data scientists, analysts, and ML engineers on data needs
- Manage schema evolution, data governance, and access control
How It Differs from Other Data Roles
| Dimension | Data Engineer | Data Scientist | Analytics Engineer | MLE |
|---|---|---|---|---|
| Primary focus | Data infrastructure | Statistical analysis | Business data models | ML models in production |
| Key output | Pipelines, warehouses | Insights, models | dbt models, dashboards | Deployed ML systems |
| SQL depth | Expert | Proficient | Expert | Moderate |
| Systems knowledge | Deep | Minimal | Moderate | Moderate |
| Coding emphasis | Production-grade | Exploratory | SQL-focused | ML-focused |
| Scale concern | Data volume and velocity | Model accuracy | Query performance | Model latency |
Interview Format (Typical)
| Round | Duration | Focus |
|---|---|---|
| Phone Screen | 45-60 min | SQL + coding basics |
| SQL Round | 45-60 min | Complex queries, window functions, optimization |
| Coding Round | 45-60 min | Python/data processing (not heavy DSA) |
| Data Modeling | 45-60 min | Dimensional modeling, schema design, normalization |
| System Design | 60 min | End-to-end pipeline design |
| Behavioral | 45 min | Collaboration, incident response, prioritization |
:::tip SQL Is the Make-or-Break Round Unlike ML or software engineering interviews where coding is the main filter, data engineering interviews use SQL as the primary technical screen. Many otherwise strong candidates are eliminated because they cannot write complex window functions, CTEs, or self-joins fluently under pressure. Do not underestimate the SQL round. :::
Focus Area Allocation
Breakdown by Skill
SQL Mastery (25% -- ~45 hours total)
- Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, running aggregates
- CTEs and recursive queries
- Self-joins and complex join patterns
- Query optimization: EXPLAIN plans, indexing, partitioning, materialized views
- Advanced: pivoting, unpivoting, sessionization, gap-and-island problems
System Design (25% -- ~45 hours total)
- Batch pipeline architecture: ingestion, transformation, serving
- Streaming pipeline design: Kafka, Flink, exactly-once semantics
- Data lake and lakehouse architecture: Iceberg, Delta Lake, Hudi
- Orchestration: DAG design, dependency management, backfill strategies
- Monitoring: data quality, freshness, volume, schema drift
Data Modeling (15% -- ~27 hours total)
- Dimensional modeling: star schema, snowflake schema, slowly changing dimensions
- Normalization: 1NF through 3NF, BCNF, denormalization trade-offs
- Data vault modeling
- Schema design for specific use cases (events, time series, hierarchical)
Coding (15% -- ~27 hours total)
- Python: data processing with pandas, file I/O, API interaction
- Data structures: hash maps, heaps, queues (not heavy graph/DP)
- Testing: unit tests for pipelines, data validation
- Scripting: ETL scripts, data quality checks
Distributed Systems (10% -- ~18 hours total)
- CAP theorem and consistency models
- Partitioning and sharding strategies
- Replication and fault tolerance
- MapReduce paradigm and Spark internals
Behavioral (10% -- ~18 hours total)
- Incident response stories (data pipeline failures)
- Cross-team collaboration
- Prioritization and technical debt management
- Data quality and governance experiences
6-Week Schedule Overview
Week 1: Foundations -- SQL Mastery Begins
Goal: Rebuild SQL fluency from basics to advanced window functions. Start data modeling fundamentals.
Daily time: 3.5 hours (weekdays), 5 hours (weekends)
Monday -- SQL Review: Joins, Aggregations, Subqueries
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 easy/medium SQL problems on LeetCode or HackerRank (INNER JOIN, LEFT JOIN, GROUP BY, HAVING) |
| Lunch (20 min) | Read | Coding Interviews overview |
| Evening (90 min) | Study | Join types (INNER, LEFT, RIGHT, FULL, CROSS, SELF), aggregation functions, subqueries vs CTEs, NULL handling |
| Night (15 min) | Review | Write a cheat sheet of join behavior with NULL values |
Practice problems:
- Find employees earning more than their manager (self-join)
- Department with highest average salary (GROUP BY + subquery)
- Customers who never ordered (LEFT JOIN + NULL check)
- Second highest salary in each department (subquery approach)
Tuesday -- SQL: Window Functions I
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 medium SQL problems (window functions) |
| Lunch (20 min) | Read | Window function syntax and frame specification |
| Evening (90 min) | Study | ROW_NUMBER, RANK, DENSE_RANK, NTILE, PARTITION BY, ORDER BY, frame clauses (ROWS BETWEEN, RANGE BETWEEN) |
| Night (15 min) | Review | Write 3 queries using different ranking functions on the same data; explain when each is appropriate |
Practice problems:
- Rank employees by salary within each department
- Find top 3 earners per department
- Calculate running total of sales by month
- Identify duplicate rows using ROW_NUMBER
:::tip Window Functions Are Interview Gold Over 80% of data engineering SQL rounds include at least one window function problem. Master these patterns cold: ranking within groups, running aggregates, comparing current row to previous/next row (LAG/LEAD), and percentile calculations. If you can write window functions fluently, you are already ahead of most candidates. :::
Wednesday -- SQL: Window Functions II and CTEs
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 medium/hard SQL problems |
| Lunch (20 min) | Read | CTE best practices and recursive CTEs |
| Evening (90 min) | Study | LAG, LEAD, FIRST_VALUE, LAST_VALUE, running averages, cumulative sums, CTEs for readability, recursive CTEs for hierarchical data |
| Night (15 min) | Practice | Write a recursive CTE to traverse an org chart |
Practice problems:
- Calculate month-over-month growth rate (LAG)
- Find consecutive login days per user (gap-and-island)
- Compute 7-day moving average of daily revenue
- Traverse an employee hierarchy to find all reports (recursive CTE)
Thursday -- Data Modeling: Normalization
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 medium SQL problems |
| Lunch (20 min) | Read | Database normalization forms overview |
| Evening (90 min) | Study | 1NF, 2NF, 3NF, BCNF: definitions, examples, identifying violations. Functional dependencies. When to denormalize and why |
| Night (15 min) | Practice | Given a flat table, normalize it to 3NF step by step |
:::warning Normalization Is Not Just Theory Interviewers often give you a denormalized table and ask you to normalize it, or give you a normalized schema and ask when you would denormalize. You need to be able to do both directions fluently, explaining the trade-offs at each step: data integrity vs query performance, storage efficiency vs join complexity, write performance vs read performance. :::
Friday -- Data Modeling: Dimensional Modeling
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 medium SQL problems |
| Lunch (20 min) | Read | Kimball dimensional modeling introduction |
| Evening (90 min) | Study | Star schema, snowflake schema, fact tables (transaction, periodic, accumulating), dimension tables, surrogate keys, degenerate dimensions |
| Night (15 min) | Design | Design a star schema for an e-commerce order system |
Saturday -- SQL Deep Practice + Slowly Changing Dimensions
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | SQL practice | 8 medium/hard SQL problems covering all concepts from the week |
| Afternoon (2 hrs) | Study | Slowly Changing Dimensions (SCD Types 1, 2, 3, 4, 6). Implementation patterns for each type. When to use which |
| Evening (1 hr) | Design | Design a data warehouse schema for a ride-sharing company: trips fact table, driver dimension (SCD Type 2), location dimension |
Sunday -- Week 1 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | SQL review | Re-solve hardest SQL problems from the week; time yourself |
| Afternoon (2 hrs) | Data modeling review | Practice: given a business scenario, design both an OLTP schema (normalized) and OLAP schema (dimensional) |
| Evening (1 hr) | Plan | Review Week 2 plan; identify SQL patterns that need more practice |
:::note Week 1 Milestone Checkpoint By the end of Week 1, you should be able to:
- Write window functions (ROW_NUMBER, RANK, LAG, LEAD) without reference
- Use CTEs and recursive CTEs for complex query decomposition
- Normalize a table to 3NF and explain each step
- Design a star schema for a given business scenario
- Explain SCD Types 1, 2, and 3 with implementation examples
- Solve medium SQL problems in under 10 minutes :::
Week 2: Foundations -- Advanced SQL and Pipeline Basics
Goal: Master advanced SQL patterns. Begin understanding data pipeline architecture.
Daily time: 3.5 hours (weekdays), 5 hours (weekends)
Monday -- SQL: Sessionization and Gap-and-Island
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 medium/hard SQL problems |
| Lunch (20 min) | Read | Sessionization techniques in SQL |
| Evening (90 min) | Study | Sessionization (grouping events into sessions using time gaps), gap-and-island problems (finding consecutive sequences), LEAD/LAG for boundary detection |
| Night (15 min) | Practice | Write a sessionization query with a 30-minute timeout |
Practice problems:
- Group user clickstream into sessions (30-min gap)
- Find consecutive days a stock price increased
- Identify gaps in a sequence of IDs
- Calculate session duration and pages per session
Tuesday -- SQL: Query Optimization
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 medium/hard SQL problems |
| Lunch (20 min) | Read | How to read an EXPLAIN plan |
| Evening (90 min) | Study | EXPLAIN/EXPLAIN ANALYZE, index types (B-tree, hash, GIN, GiST), index selection, partition pruning, query rewriting for performance, common anti-patterns (SELECT *, implicit type casting, functions on indexed columns) |
| Night (15 min) | Practice | Take a slow query and optimize it using EXPLAIN analysis |
:::danger SQL Optimization Questions Are Tricky Interviewers love asking: "This query is slow. How would you fix it?" The trap is jumping to "add an index." A strong answer follows a systematic process: (1) Read the EXPLAIN plan, (2) Identify the bottleneck (full table scan, nested loop join, sort spill to disk), (3) Consider multiple solutions (indexing, query rewrite, materialized view, partitioning), (4) Discuss trade-offs of each approach. :::
Wednesday -- SQL: Advanced Patterns and Anti-Patterns
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 hard SQL problems |
| Lunch (20 min) | Read | SQL anti-patterns reference |
| Evening (90 min) | Study | Pivot/unpivot, GROUPING SETS / CUBE / ROLLUP, correlated subqueries, EXISTS vs IN performance, approximate distinct counts (HyperLogLog), handling large IN lists |
| Night (15 min) | Practice | Write a single query using GROUPING SETS to produce a multi-level summary |
Thursday -- Introduction to Data Pipelines
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding practice | 2 Python problems (file I/O, data manipulation) |
| Lunch (20 min) | Read | ETL vs ELT comparison |
| Evening (90 min) | Study | Pipeline architectures: ETL (extract-transform-load), ELT (extract-load-transform), batch vs streaming, orchestration concepts (DAGs, dependencies, triggers) |
| Night (15 min) | Draw | Sketch a basic ETL pipeline with failure handling |
Friday -- Orchestration: Airflow Fundamentals
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding practice | 2 Python problems (APIs, error handling) |
| Lunch (20 min) | Read | Apache Airflow architecture overview |
| Evening (90 min) | Study | Airflow concepts: DAGs, operators (PythonOperator, BashOperator, SQL operators), sensors, XComs, task dependencies, scheduling, backfilling, idempotency |
| Night (15 min) | Design | Design an Airflow DAG for a daily reporting pipeline |
Saturday -- Python for Data Engineering + Pipeline Practice
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Coding | Write a data pipeline in Python: read from CSV, validate schema, transform (clean, deduplicate, enrich), write to output. Include proper error handling and logging |
| Afternoon (2 hrs) | Study | Data formats: Parquet vs CSV vs Avro vs ORC. Compression: gzip, snappy, zstd. Columnar vs row-based storage trade-offs |
| Evening (1 hr) | SQL review | Solve 5 hard SQL problems |
Sunday -- Week 2 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | SQL review | Time yourself on 10 SQL problems (mix of medium and hard). Target: medium in 8 min, hard in 15 min |
| Afternoon (2 hrs) | Pipeline review | Design a pipeline for a specific scenario: daily user activity aggregation from raw event logs |
| Evening (1 hr) | Plan | List all distributed systems topics needed for Week 3 |
:::note Week 2 Milestone Checkpoint
- Write sessionization queries without reference
- Read and interpret EXPLAIN plans to identify bottlenecks
- Optimize a slow query using at least 3 different techniques
- Explain ETL vs ELT with trade-offs for different scenarios
- Design a basic Airflow DAG with proper error handling and idempotency
- Solve hard SQL problems in under 15 minutes
- Choose between Parquet, Avro, and CSV for different use cases :::
Week 3: Core Skills -- Distributed Systems and Streaming
Goal: Understand the distributed systems concepts that underpin modern data engineering. Learn streaming fundamentals.
Daily time: 4 hours (weekdays), 6 hours (weekends)
Monday -- Distributed Systems Fundamentals
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems |
| Lunch (20 min) | Read | CAP theorem explained |
| Evening (120 min) | Study | CAP theorem, consistency models (strong, eventual, causal), partitioning strategies (hash, range, round-robin), replication (leader-follower, multi-leader, leaderless) |
| Night (15 min) | Review | For each data store you know (Postgres, Redis, Cassandra, DynamoDB), identify where it sits on the CAP triangle |
Tuesday -- Distributed Storage: HDFS, S3, and Data Lakes
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems |
| Lunch (20 min) | Read | Data lake vs data warehouse vs data lakehouse |
| Evening (120 min) | Study | HDFS architecture, S3 consistency model, data lake organization (bronze/silver/gold layers), table formats: Apache Iceberg, Delta Lake, Hudi. Partitioning, compaction, time travel |
| Night (15 min) | Compare | Create a comparison table: Iceberg vs Delta Lake vs Hudi |
Wednesday -- Apache Spark Internals
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding practice | 2 Python problems (data processing) |
| Lunch (20 min) | Read | Spark architecture overview |
| Evening (120 min) | Study | Spark architecture: driver, executors, DAG scheduler, stages, tasks. RDDs vs DataFrames vs Datasets. Shuffles, partitions, broadcast joins, skew handling, caching strategies |
| Night (15 min) | Draw | Diagram the execution of a Spark job with a join and aggregation, showing shuffle boundaries |
:::tip Understand Spark Conceptually, Not Just Syntactically Interviewers rarely ask you to write Spark code (though some do). More commonly, they ask: "Your Spark job is running for 6 hours. How do you debug and optimize it?" You need to understand: shuffle operations, data skew, partition count, broadcast join thresholds, spill to disk, and Spark UI interpretation. Conceptual understanding beats syntax memorization. :::
Thursday -- Apache Kafka and Event Streaming
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems |
| Lunch (20 min) | Read | Kafka architecture introduction |
| Evening (120 min) | Study | Kafka architecture: brokers, topics, partitions, consumer groups, offsets. Delivery guarantees: at-most-once, at-least-once, exactly-once semantics. Schema registry, compacted topics, retention policies |
| Night (15 min) | Design | Design a Kafka topic strategy for an event-driven e-commerce platform |
Friday -- Stream Processing: Concepts and Patterns
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding practice | 2 Python problems (concurrency, generators) |
| Lunch (20 min) | Read | Flink vs Spark Streaming vs Kafka Streams comparison |
| Evening (120 min) | Study | Stream processing concepts: event time vs processing time, watermarks, windowing (tumbling, sliding, session), late data handling, state management, checkpointing |
| Night (15 min) | Practice | Design a streaming pipeline that computes real-time click-through rates with 5-minute tumbling windows |
Saturday -- Data Quality and Testing
| Time | Activity | Details |
|---|---|---|
| Morning (2.5 hrs) | Study | Data quality dimensions: accuracy, completeness, consistency, timeliness, uniqueness. Data contracts. Testing frameworks: Great Expectations, dbt tests, custom validation. Schema evolution strategies |
| Afternoon (2 hrs) | Implementation | Write a data quality check suite for a pipeline: null checks, range checks, uniqueness checks, referential integrity, volume anomaly detection |
| Evening (1.5 hrs) | SQL practice | Solve 5 hard SQL problems |
:::warning Data Quality Questions Are Increasing Modern data engineering interviews increasingly include data quality and observability questions. "How do you ensure your pipeline produces correct data?" is not optional. Be ready to discuss: automated testing, data contracts, monitoring, alerting, and root cause analysis when data quality issues arise. Companies like Monte Carlo, Soda, and Elementary have made data observability mainstream. :::
Sunday -- Week 3 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Review | Re-draw system architectures from memory: Spark, Kafka, data lake |
| Afternoon (2.5 hrs) | Practice | Design a streaming pipeline: real-time fraud detection with Kafka and Flink. Include failure handling, late data, and monitoring |
| Evening (1.5 hrs) | Behavioral prep | Draft 3 STAR stories about pipeline incidents, data quality issues, and cross-team collaboration |
:::note Week 3 Milestone Checkpoint
- Explain CAP theorem and its implications for data system design
- Describe Spark's execution model (DAG, stages, tasks, shuffles)
- Explain Kafka's architecture and delivery guarantees
- Design a streaming pipeline with proper windowing and late data handling
- Articulate data lake layering (bronze/silver/gold) and its purpose
- Write comprehensive data quality checks for a pipeline :::
Week 4: Core Skills -- Pipeline System Design and Coding
Goal: Practice end-to-end pipeline system design. Build coding fluency for data engineering interviews.
Daily time: 4 hours (weekdays), 6 hours (weekends)
Monday -- System Design Framework for Data Pipelines
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems |
| Lunch (20 min) | Read | ML System Design overview (adapt framework for data) |
| Evening (120 min) | Study | Data pipeline design framework: (1) Requirements and constraints, (2) Data sources and ingestion, (3) Storage layer, (4) Processing/transformation, (5) Serving layer, (6) Monitoring and operations |
| Night (15 min) | Review | Write the framework from memory |
:::tip The Data Pipeline Design Framework Use this structure for every pipeline design question:
- Clarify requirements -- What are we building? What is the data volume, velocity, and variety? What are the SLAs (freshness, latency, correctness)?
- Data sources and ingestion -- Where does data come from? CDC, API, event stream, file upload? How do we handle schema differences?
- Storage layer -- Where do we land raw data? What format? How do we partition? Data lake, warehouse, or lakehouse?
- Processing -- Batch or streaming? What transformations? How do we handle failures and retries? Idempotency?
- Serving layer -- How do consumers access the data? OLAP queries, API, export, real-time dashboard?
- Monitoring and operations -- Data quality checks, freshness monitoring, volume alerts, schema drift detection, runbook for incidents. :::
Tuesday -- Design: Event Analytics Pipeline
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems |
| Lunch (20 min) | Read | Event-driven architecture patterns |
| Evening (120 min) | Practice | Design an event analytics pipeline for a mobile app: 100M events/day, need real-time dashboards (sub-minute) and historical analysis (years of data). Walk through all 6 framework steps |
| Night (15 min) | Review | Identify single points of failure in your design |
Wednesday -- Design: Data Warehouse Migration
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding | 2 Python data processing problems |
| Lunch (20 min) | Read | Data warehouse migration case studies |
| Evening (120 min) | Practice | Design a migration plan: legacy on-prem Oracle data warehouse to cloud (Snowflake/BigQuery/Redshift). Address: parallel running, validation, rollback, data reconciliation, zero-downtime cutover |
| Night (15 min) | Review | List the top 5 risks and mitigation strategies |
Thursday -- Coding: Data Processing in Python
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding | 2 LeetCode medium (hash maps, arrays) |
| Lunch (20 min) | Read | Python performance tips for data processing |
| Evening (120 min) | Implementation | Implement common data engineering tasks in Python: (1) Parse and validate JSON events, (2) Deduplicate records using hash-based approach, (3) Implement a simple schema validator, (4) Write a rate-limited API data fetcher |
| Night (15 min) | Review | Add error handling and unit tests to each implementation |
Friday -- Coding: Data Structures for Data Engineering
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding | 2 LeetCode medium (queues, heaps) |
| Lunch (20 min) | Read | Data structures used in data systems (bloom filters, LSM trees, B-trees) |
| Evening (120 min) | Study and implement | Bloom filters (implement from scratch), consistent hashing (implement), LRU cache (implement). Understand how each is used in data systems |
| Night (15 min) | Review | Explain when you would use each data structure in a real pipeline |
Saturday -- Full System Design Practice
| Time | Activity | Details |
|---|---|---|
| Morning (2.5 hrs) | Mock | Timed system design: Design a data pipeline for a ride-sharing company (trips, pricing, driver matching analytics). 60 min design + 30 min self-review |
| Afternoon (2 hrs) | Study | Change Data Capture (CDC): Debezium, log-based CDC, trigger-based CDC, query-based CDC. When to use each approach |
| Evening (1.5 hrs) | SQL practice | Solve 5 hard SQL problems (timed) |
Sunday -- Week 4 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Review | Re-do all system designs from the week. Focus on areas where you lacked depth |
| Afternoon (2.5 hrs) | Coding review | Refactor Python implementations. Add comprehensive tests. Clean up code quality |
| Evening (1.5 hrs) | Behavioral | Prepare 3 more STAR stories. Practice all 6 stories aloud |
:::note Week 4 Milestone Checkpoint
- Walk through the data pipeline design framework from memory
- Design an event analytics pipeline end-to-end with monitoring
- Implement bloom filter and consistent hashing from scratch
- Write Python data processing code with proper error handling and testing
- Explain CDC approaches and their trade-offs
- Design a migration strategy with risk mitigation :::
Week 5: Polish -- Advanced Topics and Mock Interviews
Goal: Cover remaining advanced topics. Begin intensive mock interview practice.
Daily time: 4 hours (weekdays), 6 hours (weekends)
Monday -- Advanced Data Modeling: Data Vault and Activity Schema
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems (company-tagged if possible) |
| Lunch (20 min) | Read | Data vault 2.0 overview |
| Evening (120 min) | Study | Data vault modeling (hubs, links, satellites), activity schema, one big table (OBT) pattern, wide vs tall table trade-offs. When each modeling approach is appropriate |
| Night (15 min) | Compare | Create a decision matrix: when to use star schema vs data vault vs OBT |
Tuesday -- dbt and Modern Data Stack
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems |
| Lunch (20 min) | Read | dbt (data build tool) introduction |
| Evening (120 min) | Study | Modern data stack: dbt models (staging, intermediate, marts), testing in dbt, incremental models, snapshots, documentation. ELT patterns with Fivetran/Airbyte + warehouse + dbt |
| Night (15 min) | Design | Design a dbt project structure for an e-commerce company |
:::note dbt Knowledge Is Increasingly Expected Even if the job description does not mention dbt, understanding the ELT paradigm it represents is important. Many interviewers will ask about transformation-in-warehouse approaches, incremental processing, and testing data transformations. If you know dbt concepts, you can speak to these patterns confidently. :::
Wednesday -- Cost Optimization and Performance Tuning
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 hard SQL problems |
| Lunch (20 min) | Read | Cloud data warehouse cost models |
| Evening (120 min) | Study | Cost optimization: warehouse sizing, clustering/sort keys, materialized views, partitioning strategies, query patterns that waste money. Performance tuning: Spark job optimization, query profiling, caching strategies |
| Night (15 min) | Practice | Given a scenario with a $50K/month Snowflake bill, identify 5 ways to reduce costs |
Thursday -- Mock: SQL Round
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Warm-up | 2 medium SQL problems |
| Afternoon (120 min) | Mock | Full SQL round simulation: 4 problems in 50 minutes. Problems covering: window functions, sessionization, complex joins, optimization. Then 70 min to review and re-solve any you missed |
| Evening (60 min) | Review | Identify SQL patterns that slowed you down; create targeted practice list |
| Night (15 min) | Plan | Select 5 problems targeting your weakest SQL patterns |
Friday -- Mock: System Design Round
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Coding | 2 LeetCode medium (maintain fluency) |
| Lunch (20 min) | Review | Skim pipeline design framework notes |
| Evening (120 min) | Mock | Timed system design (60 min): Design a real-time recommendation data pipeline. Self-review (60 min): grade yourself on each framework component |
| Night (15 min) | Debrief | List areas where your design was weakest |
Saturday -- Company-Specific Research and Full Mock
| Time | Activity | Details |
|---|---|---|
| Morning (2.5 hrs) | Research | Deep dive into target company: their data stack (check engineering blog), scale (data volume), challenges, recent data engineering job postings |
| Afternoon (2 hrs) | Mock | Full mock interview: SQL (45 min) + System Design (45 min). Self-timed, strict |
| Evening (1.5 hrs) | Behavioral | Practice all STAR stories. Prepare company-specific answers for: "Why this company?" and "Tell me about a time you dealt with data quality issues" |
Sunday -- Week 5 Review
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | SQL marathon | Solve 10 hard SQL problems in 90 minutes (timed) |
| Afternoon (2.5 hrs) | System design | Redo your weakest system design from the week with improvements |
| Evening (1.5 hrs) | Preparation | Create one-page cheat sheets: SQL patterns, system design framework, data modeling patterns, distributed systems concepts |
:::note Week 5 Milestone Checkpoint
- Explain data vault modeling and when to use it
- Describe the modern data stack and dbt's role in it
- Identify 5+ cost optimization strategies for cloud data warehouses
- Complete a SQL round simulation in under 50 minutes
- Complete a system design round with all 6 framework components
- Have 6+ polished STAR stories ready
- Know your target company's data engineering stack and challenges :::
Week 6: Final Week -- Simulation and Confidence
Goal: Final mock interviews, confidence building, and logistics. No new material.
Daily time: 3 hours (weekdays), 4 hours (weekends)
Monday -- Light Practice and Review
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 5 medium SQL problems (patterns you know well, for speed) |
| Lunch (20 min) | Read | Negotiation and Offers overview |
| Evening (60 min) | Review | Skim all cheat sheets and notes |
| Night (15 min) | Relax | Mental break |
:::warning Do Not Cram in the Final Week Week 6 is about reinforcing and sharpening. If you encounter a concept you have not studied, make a note but do not go deep. Focus on being fluent in what you already know. Cramming new material in the final days increases anxiety and reduces performance. :::
Tuesday -- Full Mock Interview Day
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Warm-up | 2 easy SQL problems for confidence |
| Afternoon (150 min) | Mock | Full loop simulation: SQL (45 min) + Coding (45 min) + System Design (60 min). Strict time limits. No looking at notes |
| Evening (30 min) | Debrief | Note confidence level for each round. Celebrate progress |
Wednesday -- Targeted Review
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | SQL practice | 3 medium/hard problems targeting your weakest pattern |
| Lunch (20 min) | Read | Review your system design cheat sheet |
| Evening (90 min) | Review | Walk through 2 system designs slowly and thoroughly. Practice explaining your decisions out loud |
| Night (15 min) | Mental rehearsal | Visualize a successful interview day |
Thursday -- Behavioral Final Polish
| Time | Activity | Details |
|---|---|---|
| Morning (60 min) | Practice | Tell all STAR stories aloud. Time each one (aim for 3-4 minutes) |
| Lunch (20 min) | Read | Target company values and culture |
| Evening (90 min) | Mock | Final behavioral mock + practice "tell me about yourself" (aim for 2 minutes, covering: background, data engineering experience, why this company, what excites you) |
| Night (15 min) | Prepare | Write 8 thoughtful questions to ask interviewers |
Questions to prepare answers for:
- Walk me through a data pipeline you built end-to-end
- Tell me about a time your pipeline failed in production. What happened?
- How do you ensure data quality across pipelines?
- Describe a disagreement with a stakeholder about data modeling
- What is the most complex SQL query you have written?
- How do you prioritize between building new pipelines and maintaining existing ones?
Friday -- Rest and Logistics
| Time | Activity | Details |
|---|---|---|
| Morning (30 min) | Logistics | Confirm interview schedule, test video/audio, prepare workspace |
| Afternoon | Rest | Exercise, relaxation, enjoyable activities |
| Evening (30 min) | Light review | Skim cheat sheets one final time |
Saturday -- Optional Light Practice
| Time | Activity | Details |
|---|---|---|
| Morning (2 hrs) | Light practice | 5 medium SQL problems for flow and confidence |
| Afternoon (1.5 hrs) | Review | Walk through one system design slowly |
| Evening (30 min) | Mental prep | Confidence visualization; remind yourself of the work you have put in |
Sunday -- Complete Rest
Take the day completely off. Get good sleep. You are ready.
:::note Week 6 Final Assessment
- Can solve hard SQL problems in under 15 minutes with clean, readable queries
- Can design a data pipeline end-to-end in 60 minutes using the full framework
- Can explain distributed systems concepts (CAP, partitioning, replication) clearly
- Can discuss data modeling trade-offs (star vs snowflake vs data vault) with examples
- Can deliver all STAR stories naturally in 3-4 minutes each
- Feel confident (not perfect, but confident) about each interview round
- Have prepared thoughtful questions for each interviewer :::
Problem Sets by Week
Cumulative Practice Counts
| Week | SQL Problems | Coding Problems | System Designs | Data Models | STAR Stories |
|---|---|---|---|---|---|
| 1 | 33 | 0 | 0 | 3 | 0 |
| 2 | 28 | 4 | 1 | 0 | 0 |
| 3 | 16 | 4 | 1 | 0 | 3 |
| 4 | 14 | 6 | 3 | 0 | 6 |
| 5 | 22 | 2 | 3 | 1 | 6 |
| 6 | 13 | 0 | 1 | 0 | 6 |
| Total | 126 | 16 | 9 | 4 | 6 |
Recommended SQL Problem Categories
| Category | Count | Difficulty Mix |
|---|---|---|
| Window Functions (ranking, running aggs) | 25 | 5 medium, 20 hard |
| Joins (self, outer, cross, complex multi-table) | 20 | 5 easy, 10 medium, 5 hard |
| Aggregation and Grouping | 15 | 5 easy, 8 medium, 2 hard |
| CTEs and Recursive Queries | 15 | 5 medium, 10 hard |
| Sessionization and Gap-and-Island | 12 | 2 medium, 10 hard |
| Subqueries and EXISTS | 12 | 4 medium, 8 hard |
| Query Optimization | 10 | 5 medium, 5 hard |
| Pivoting, GROUPING SETS, Advanced | 10 | 3 medium, 7 hard |
| Date/Time Manipulation | 7 | 3 medium, 4 hard |
SQL Practice Platforms
| Platform | Strength | Difficulty Range |
|---|---|---|
| LeetCode (Database section) | Standard interview problems | Easy to Hard |
| HackerRank (SQL domain) | Good progressive difficulty | Easy to Hard |
| StrataScratch | Real company interview questions | Medium to Hard |
| DataLemur | Data-specific SQL challenges | Easy to Hard |
| SQLZoo | Interactive tutorials for basics | Easy to Medium |
| Mode Analytics SQL Tutorial | Real-world analytical queries | Medium |
System Design Problems to Practice
Ordered by Difficulty
| Problem | Key Concepts Tested | Difficulty |
|---|---|---|
| Daily reporting pipeline from raw logs | ETL basics, scheduling, idempotency | Starter |
| User activity data warehouse | Dimensional modeling, SCD, partitioning | Starter |
| Real-time click analytics dashboard | Streaming, Kafka, windowing, serving layer | Intermediate |
| Event-driven e-commerce pipeline | CDC, event sourcing, exactly-once processing | Intermediate |
| Multi-source customer 360 pipeline | Data integration, deduplication, identity resolution | Intermediate |
| Data warehouse migration (on-prem to cloud) | Migration strategy, validation, zero downtime | Advanced |
| Real-time ML feature pipeline | Feature store, streaming features, serving | Advanced |
| Global data platform (multi-region) | Replication, consistency, latency, compliance | Advanced |
| Ad event pipeline at scale (10B events/day) | High throughput, cost optimization, backpressure | Advanced |
Mock Interview Schedule
Finding Mock Interview Partners
- DataTalks.Club -- Free community with data engineering focus
- Data Engineering Slack/Discord -- Active communities for peer practice
- interviewing.io -- Anonymous technical interviews
- Pramp -- Free peer mock interviews
- Friends in data roles -- Ask data engineers to run you through their favorite interview questions
Common Mistakes to Avoid
:::danger Top 7 Data Engineer Interview Mistakes
-
Writing correct but unreadable SQL. Interviewers read your SQL. A 200-line query with no CTEs, no comments, and cryptic aliases is a red flag. Structure your queries with CTEs, use meaningful aliases, and format consistently. Readability is part of the evaluation.
-
Not handling edge cases in SQL. NULLs, empty tables, ties in ranking, duplicate records, and division by zero. Many candidates write queries that work on the happy path but break on edge cases. Always ask: "What happens if this column is NULL?"
-
Naming tools without understanding them. "I would use Kafka for streaming" is not enough. Why Kafka over Kinesis or Pulsar? How many partitions? What is the retention? How do you handle exactly-once? What happens when a consumer falls behind? Depth matters more than breadth.
-
Ignoring monitoring and failure handling in system design. A pipeline design that only covers the happy path is incomplete. Every system design answer must address: "What happens when this fails?" "How do you detect data quality issues?" "How do you backfill?"
-
Underestimating the data modeling round. Many candidates with strong coding skills stumble on data modeling because they treat it as "just drawing tables." You need to understand normalization theory, dimensional modeling philosophy, and the trade-offs behind every design decision.
-
Over-engineering system designs. You do not need Kafka, Spark, Flink, Airflow, dbt, Iceberg, AND a feature store for a pipeline processing 10,000 records per day. Right-sizing your solution to the requirements demonstrates engineering maturity. Start simple, then explain what you would add as scale increases.
-
Not preparing behavioral stories about incidents. Data pipelines break. Data quality degrades. Stakeholders get incorrect data. Interviewers want to hear how you responded to real incidents: how you detected the issue, communicated with stakeholders, fixed it, and prevented recurrence. Have 2-3 strong incident stories ready. :::
Essential Resources
Handbook Chapters to Prioritize
| Priority | Chapter | When to Study |
|---|---|---|
| Critical | Coding Interviews | Weeks 1-4 |
| Critical | ML System Design (adapt for data) | Weeks 3-5 |
| High | Behavioral | Weeks 5-6 |
| Medium | Company Guides | Week 5 |
| Medium | ML Fundamentals (skim for context) | Week 4 |
| Low | Negotiation | Week 6 |
External Resources
Books:
- "Designing Data-Intensive Applications" by Martin Kleppmann (the data engineering bible)
- "The Data Warehouse Toolkit" by Ralph Kimball (dimensional modeling reference)
- "Fundamentals of Data Engineering" by Joe Reis and Matt Housley
- "Streaming Systems" by Akidau, Chernyak, and Lax
Courses:
- DataTalks.Club Data Engineering Zoomcamp (free, comprehensive)
- Stanford CS245 Principles of Data-Intensive Systems
- CMU 15-445 Database Systems (query optimization, storage)
Practice Platforms:
- LeetCode Database section (SQL practice)
- StrataScratch (real interview SQL questions)
- DataLemur (SQL and data science problems)
- HackerRank SQL domain
Blogs and References:
- Target company engineering blogs (always read before interviews)
- "The Modern Data Stack" landscape (understand the ecosystem)
- Apache project documentation (Kafka, Spark, Flink, Airflow, Iceberg)
- dbt documentation and best practices
Next Steps
You now have a complete 6-week roadmap for Data Engineer interview preparation. The timeline is compressed because data engineering interviews, while broad, are more predictable than research or ML engineering interviews. The key to success is SQL fluency and system design depth.
If this path does not quite fit your target role, consider:
- MLE Prep Path -- If your target role involves building and deploying ML models, not just data infrastructure
- MLOps Prep Path -- If your target role emphasizes ML infrastructure, model serving, and CI/CD for ML
- Data Scientist Prep Path -- If your target role emphasizes statistical analysis and experimentation over infrastructure
Start today. Open a SQL editor. Write a window function. Design a pipeline on paper. The 6-week clock starts now.
