Data Engineer Problem List

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Engineer, Analytics Engineer, Data Platform Engineer, Streaming Engineer

A company has 2 billion events per day flowing into their data lake. Dashboards are slow, pipelines are failing at 3 AM, and the data science team complains that feature tables are stale. Someone needs to fix this. That someone is a Data Engineer.

Data Engineering interviews test a unique combination of skills: SQL mastery, pipeline architecture, data modeling, distributed systems knowledge, and streaming fluency. This list of 45 problems covers every category you will encounter, organized by interview round and calibrated to real questions reported at major tech companies.

Data Engineer Interview Structure

Data Engineering interviews vary more than MLE or SWE interviews, but most top companies follow a pattern:

Round	Duration	What They Test	Weight
SQL & Data Manipulation	45-60 min	Complex queries, window functions, optimization	25-30%
Coding (Python/Scala)	45-60 min	Data processing logic, ETL implementation	20-25%
System Design	45-60 min	Pipeline architecture, data modeling, scale	25-30%
Domain & Behavioral	30-45 min	Data quality, debugging, collaboration	15-20%

:::tip The Data Engineer Differentiator Data Engineers are judged by reliability, not cleverness. An interviewer would rather see a boring, correct, fault-tolerant pipeline than an elegant but fragile one. Always talk about error handling, idempotency, and monitoring. :::

Round 1: SQL & Query Optimization (15 Problems)

SQL is the lingua franca of data engineering. You will face at least one round of pure SQL, often with follow-up questions about query plans and optimization.

Core SQL Problems

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
1	Find the Nth Highest Salary per Department	Medium	15 min	Window functions (DENSE_RANK)	Ranking within groups is fundamental to reporting and analytics	Google, Amazon, Meta
2	Compute Running Totals with Resets	Medium	20 min	Window functions, conditional aggregation	Revenue tracking, session analysis, cumulative metrics	Uber, Airbnb, Stripe
3	Identify Gaps in Sequential Data	Medium	15 min	Self-join, LAG/LEAD	Missing date detection, pipeline monitoring, SLA tracking	FAANG, All
4	Sessionize Clickstream Data	Hard	25 min	Window functions, conditional logic, time gaps	User behavior analysis, session-based analytics	Google, Meta, Airbnb
5	Pivot Rows to Columns Dynamically	Medium	20 min	CASE WHEN, conditional aggregation	Reporting transformations, dashboard data preparation	Amazon, Microsoft, Databricks

Advanced SQL

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
6	Recursive CTE for Hierarchical Data	Hard	25 min	Recursive queries, tree traversal	Org charts, category trees, bill of materials	Google, Amazon, Snowflake
7	Optimize a Slow Query with Proper Indexing Strategy	Hard	30 min	EXPLAIN plan, index selection, query rewriting	Performance tuning is daily DE work	FAANG, All
8	Deduplicate Records with Fuzzy Matching	Medium	20 min	Similarity functions, group-by strategies	Data quality, entity resolution	Airbnb, Stripe, Databricks
9	Compute Retention Cohorts	Medium	20 min	Self-join, date arithmetic, conditional aggregation	Product analytics, user lifecycle analysis	Meta, Spotify, Pinterest
10	Implement Slowly Changing Dimension Type 2	Hard	25 min	Temporal joins, effective date ranges	Dimensional modeling, historical tracking	Amazon, Snowflake, dbt Labs

Query Optimization & Internals

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
11	Explain and Fix a Cartesian Join	Easy	10 min	Join strategies, query plan reading	Debugging production queries that blow up	All
12	Partition Pruning and Predicate Pushdown	Medium	20 min	Query optimization, storage layout	Reducing scan cost on petabyte-scale tables	Databricks, Snowflake, Google
13	Compare Hash Join vs. Sort-Merge Join	Medium	15 min	Join algorithms, memory vs. disk	Understanding why some queries are slow	FAANG, Big Tech
14	Optimize a Query Hitting a 100TB Table	Hard	25 min	Partitioning, clustering, materialized views	Real-world optimization at scale	Google, Amazon, Snowflake
15	Write an Idempotent MERGE (Upsert) Statement	Medium	15 min	MERGE/INSERT ON CONFLICT, idempotency	Pipeline reruns must not duplicate data	All

:::warning SQL Round Red Flags

Using correlated subqueries where window functions are appropriate
Not considering NULL handling in joins and aggregations
Writing queries that scan entire tables when partition pruning is possible
Cannot explain the difference between RANK, DENSE_RANK, and ROW_NUMBER
Ignoring data skew in GROUP BY operations :::

Round 2: Data Pipeline & ETL Coding (12 Problems)

These problems test your ability to write data processing code in Python (or Spark/Scala). Interviewers want to see clean, testable, fault-tolerant pipeline logic.

ETL Implementation

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
16	Implement a Schema Validator for Incoming JSON Data	Medium	20 min	Schema validation, error handling	Data quality at ingestion is the first line of defense	Airbnb, Stripe, All
17	Build an Incremental Data Loader with Watermarks	Medium	25 min	Change data capture, watermark tracking	Incremental loads reduce cost and latency	FAANG, Databricks
18	Implement a Dead Letter Queue Handler	Medium	20 min	Error routing, retry logic	Malformed records must not crash the pipeline	Uber, Stripe, Amazon
19	Build a Backfill Orchestrator	Hard	30 min	Date range partitioning, idempotent execution	Historical reprocessing is a weekly task	FAANG, Big Tech
20	Implement a Data Reconciliation Check	Medium	20 min	Count matching, checksum validation	Verifying data completeness after migration or transfer	All

Data Processing Logic

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
21	Flatten Deeply Nested JSON to Tabular Format	Medium	20 min	Recursive parsing, schema inference	Semi-structured data is everywhere (APIs, logs, events)	Airbnb, Snowflake, Databricks
22	Implement a Custom Partitioner for Skewed Data	Hard	25 min	Data skew mitigation, salting	Skewed keys cause OOM and stragglers in Spark	Google, Meta, Databricks
23	Build a File Format Converter (CSV to Parquet with Schema)	Easy	15 min	Columnar storage, type inference	File format choice impacts query performance by 10-100x	All
24	Implement Log Parsing with Regex and Error Handling	Easy	15 min	Regex, defensive parsing	Log processing is the most common first DE task	All
25	Build a Data Deduplication Pipeline	Medium	20 min	Hashing, window-based dedup	Exact and near-duplicate detection at scale	Meta, Airbnb, Uber

Streaming & Real-Time

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
26	Implement a Tumbling Window Aggregation	Medium	20 min	Time-based windowing, watermarks	Streaming aggregation is core to real-time analytics	Uber, LinkedIn, Confluent
27	Handle Late-Arriving Events in a Stream	Hard	25 min	Watermarks, allowed lateness, side outputs	Late data is the norm, not the exception	Google, Uber, Confluent

:::tip Coding Round Expectations Data Engineering coding rounds are NOT LeetCode rounds. Interviewers expect:

Clean separation of concerns (extract, transform, load as separate functions)
Proper error handling and logging
Idempotent operations (safe to re-run)
Configuration-driven design (no hardcoded paths or schemas) :::

Round 3: System Design (13 Problems)

Data Engineering system design focuses on data platforms, pipeline architecture, and storage systems. Unlike MLE system design, the emphasis is on data flow, not model architecture.

Data Pipeline Architecture

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
28	Design a Real-Time Event Processing Pipeline	Hard	45 min	Kafka, stream processing, exactly-once semantics	Real-time data is table stakes at scale	Uber, LinkedIn, Google
29	Design a Data Lake Architecture	Hard	45 min	Storage layers (bronze/silver/gold), metadata management	The foundational data platform decision	FAANG, Databricks, Snowflake
30	Design a Change Data Capture Pipeline	Medium	35 min	CDC, log-based replication, schema evolution	Syncing operational DBs to analytics is the #1 DE use case	Uber, Airbnb, Stripe
31	Design a Data Quality Monitoring System	Medium	35 min	Statistical checks, anomaly detection, alerting	Bad data costs millions; monitoring prevents it	FAANG, All
32	Design a Batch ETL Platform for 10TB/Day	Medium	35 min	Orchestration, partitioning, retry, monitoring	The bread-and-butter DE system design problem	Amazon, Google, Big Tech

Data Modeling & Storage

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
33	Design a Star Schema for an E-Commerce Data Warehouse	Medium	30 min	Fact and dimension tables, grain, conformed dimensions	Dimensional modeling is foundational to analytics engineering	Amazon, Walmart, All
34	Design a Metrics Layer	Hard	40 min	Metric definitions, consistency, self-serve analytics	Ensuring every team computes "revenue" the same way	Airbnb, Uber, Transform
35	Design Schema Evolution Strategy for a Data Lake	Medium	30 min	Schema registry, backward/forward compatibility	Schema changes must not break downstream consumers	Confluent, Databricks, FAANG
36	Design a Multi-Tenant Data Platform	Hard	45 min	Isolation, access control, resource quotas	Serving multiple teams from one platform	FAANG, Snowflake, Databricks

Distributed Systems for Data

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
37	Design a Distributed Task Scheduler (Like Airflow)	Hard	45 min	DAG execution, dependency management, fault tolerance	Understanding orchestration internals	Airbnb, Google, Astronomer
38	Design a Distributed File Storage System	Hard	45 min	Replication, partitioning, consistency	Understanding HDFS/S3 internals at a design level	Google, Amazon, Meta
39	Explain and Design Around the CAP Theorem	Medium	25 min	Consistency vs. availability tradeoffs	Every storage decision involves CAP tradeoffs	FAANG, All
40	Design a Data Catalog and Discovery System	Medium	35 min	Metadata management, lineage, search	Data discovery is critical as data assets grow	Lyft, Airbnb, LinkedIn

:::note System Design Evaluation Criteria for DEs Interviewers evaluate DE system designs on:

Reliability: What happens when things fail? Retries, idempotency, dead letter queues
Scalability: How does it handle 10x data growth?
Data Quality: Where are the validation checkpoints?
Operability: How do you monitor, debug, and backfill?
Cost: Storage format choices, compute optimization, data lifecycle :::

Round 4: Domain Knowledge & Debugging (5 Problems)

These are discussion-based problems that test real-world experience.

#	Problem	Difficulty	Time	Key Concept	Why DEs Need It	Company Tags
41	Debug a Pipeline That Produces Duplicate Records	Medium	20 min	Idempotency, exactly-once processing	The most common production DE bug	All
42	Explain Data Lineage and Its Importance	Easy	15 min	Metadata, impact analysis, regulatory compliance	GDPR, SOX, and debugging all require lineage	FAANG, Finance
43	Compare Batch vs. Stream Processing Tradeoffs	Easy	15 min	Latency, complexity, cost, correctness	Choosing the right processing paradigm	All
44	Design a Data Governance Strategy	Medium	25 min	PII handling, access controls, retention policies	Regulatory compliance is non-negotiable	FAANG, Finance, Healthcare
45	Troubleshoot a Spark Job That Is Running Slowly	Medium	20 min	Shuffle optimization, partition sizing, skew detection	Spark tuning is daily work for many DEs	Databricks, FAANG, Big Tech

4-Week Data Engineer Study Plan

Week	Focus	Problems	Daily Load
Week 1	SQL mastery	#1-15 (SQL round)	2-3 problems/day
Week 2	Pipeline coding	#16-27 (ETL + streaming)	2 problems/day + review SQL
Week 3	System design	#28-40 (Architecture + modeling)	1-2 designs/day
Week 4	Integration + domain	#41-45 + mocks	1 problem + 1 mock/day

Week 1: SQL Deep Dive

Day 1: #1, #2 (window functions: DENSE_RANK, running totals)
Day 2: #3, #4 (gaps detection, sessionization)
Day 3: #5, #6 (pivot, recursive CTE)
Day 4: #7 (query optimization - spend extra time on EXPLAIN)
Day 5: #8, #9 (fuzzy dedup, retention cohorts)
Day 6: #10, #11 (SCD Type 2, Cartesian join debug)
Day 7: #12, #13, #14, #15 (optimization topics + MERGE)

Week 2: Pipeline Coding

Day 1: #16, #17 (schema validation, incremental loads)
Day 2: #18, #19 (dead letter queue, backfill orchestrator)
Day 3: #20, #21 (reconciliation, nested JSON flattening)
Day 4: #22, #23 (custom partitioner, file format conversion)
Day 5: #24, #25 (log parsing, deduplication pipeline)
Day 6: #26, #27 (tumbling windows, late events)
Day 7: Review weak problems from weeks 1-2

Week 3: System Design Sprint

Day 1: #28 (real-time event processing - the big one)
Day 2: #29 (data lake architecture)
Day 3: #30, #31 (CDC pipeline, data quality monitoring)
Day 4: #32, #33 (batch ETL platform, star schema)
Day 5: #34, #35 (metrics layer, schema evolution)
Day 6: #36, #37 (multi-tenant platform, distributed scheduler)
Day 7: #38, #39, #40 (distributed storage, CAP, data catalog)

Week 4: Polish and Mock

Day 1: #41-45 (domain knowledge and debugging)
Day 2-3: Re-solve all Yellow/Red problems
Day 4-5: Full mock interviews (1 SQL + 1 coding + 1 design)
Day 6-7: Final review of weak areas + behavioral prep

Problem Deep Dives

Problem 4: Sessionize Clickstream Data

Why this problem matters: Sessionization is the canonical DE SQL problem. It combines window functions, conditional logic, and time-based reasoning. Every analytics-heavy company asks a variant of this.

Problem Statement: Given a table of user events with user_id, event_timestamp, and page_url, assign a session_id to each event. A new session starts when more than 30 minutes have elapsed since the previous event for the same user.

Solution Approach:

WITH time_gaps AS (
  SELECT
    user_id,
    event_timestamp,
    page_url,
    LAG(event_timestamp) OVER (
      PARTITION BY user_id ORDER BY event_timestamp
    ) AS prev_event_time
  FROM clickstream
),
session_flags AS (
  SELECT
    *,
    CASE
      WHEN prev_event_time IS NULL THEN 1
      WHEN EXTRACT(EPOCH FROM event_timestamp - prev_event_time) > 1800 THEN 1
      ELSE 0
    END AS new_session_flag
  FROM time_gaps
),
session_ids AS (
  SELECT
    *,
    SUM(new_session_flag) OVER (
      PARTITION BY user_id ORDER BY event_timestamp
    ) AS session_id
  FROM session_flags
)
SELECT
  user_id,
  event_timestamp,
  page_url,
  user_id || '-' || session_id AS session_id
FROM session_ids;

Key Points Interviewers Check:

Correct use of LAG with PARTITION BY
Proper handling of the first event (no previous event)
Using SUM of flags as a session counter (the cumulative sum trick)
Ability to modify the session timeout parameter

Problem 29: Design a Data Lake Architecture

Why this problem matters: This is the most strategic DE system design problem. Your answer reveals whether you think about data platforms architecturally or just as a collection of scripts.

Architecture (Medallion / Multi-Hop):

Raw Sources -> Bronze -> Silver -> Gold -> Consumers

1. Bronze Layer (Raw)
   - Exact copy of source data, append-only
   - Schema: source schema preserved, metadata columns added
   - Format: Parquet/Delta with ingestion timestamp
   - Retention: 90 days to 2 years

2. Silver Layer (Cleaned)
   - Deduplication, type casting, null handling
   - Standardized schemas, conformed dimensions
   - Incremental processing with watermarks
   - Format: Delta/Iceberg with partitioning

3. Gold Layer (Business)
   - Aggregated, business-ready datasets
   - Star schemas, metrics tables, feature tables
   - SLAs on freshness and quality
   - Format: Optimized for query patterns

4. Metadata & Governance
   - Schema registry for all layers
   - Data catalog with lineage tracking
   - Access control per layer/dataset
   - Quality checks at each transition

Key Design Decisions to Discuss:

Table format: Delta Lake vs. Iceberg vs. Hudi
Storage: S3/GCS/ADLS object storage
Compute: Spark vs. serverless (Athena, BigQuery)
Orchestration: Airflow, Dagster, Prefect
Catalog: Unity Catalog, AWS Glue, Hive Metastore

Data Engineer Patterns to Master

Pattern	Where It Appears	Problems
Window functions	SQL analysis, sessionization, dedup	#1, #2, #3, #4, #9
Idempotent processing	Pipeline design, MERGE, backfill	#15, #17, #19, #41
Schema evolution	Data lake, CDC, streaming	#16, #30, #35
Partitioning strategies	Query optimization, storage layout	#12, #14, #22, #32
Watermark-based processing	Incremental loads, streaming	#17, #26, #27
Data quality checks	Validation, monitoring, reconciliation	#20, #31, #44
Medallion architecture	Data lake, pipeline layering	#29, #32
Exactly-once semantics	Streaming, dedup, idempotency	#27, #28, #41

SQL vs. Python: When to Use Which

A common interview discussion topic for Data Engineers:

Scenario	Preferred Tool	Reasoning
Ad-hoc analysis	SQL	Faster iteration, optimizer handles it
Aggregation on structured data	SQL	Declarative, easy to audit
Complex business logic	Python/Spark	Procedural logic, testing, modularity
Schema validation	Python	Programmatic checks, richer error handling
Streaming	Python/Scala + framework	Kafka Streams, Flink, Spark Structured Streaming
UDF-heavy transformations	Python + SQL	Best of both worlds
Data quality checks	SQL + Python	SQL for checks, Python for orchestration

Technology Stack Expectations by Company Tier

Tier	SQL	Processing	Orchestration	Storage	Streaming
FAANG	Advanced (Presto/Spark SQL)	Spark, internal tools	Airflow, internal	S3/GCS, internal	Kafka, Flink
Unicorn	Advanced (Snowflake/BigQuery)	Spark, dbt	Airflow, Dagster	S3/GCS, Delta/Iceberg	Kafka, Kinesis
Growth Stage	Intermediate+	dbt, Spark, Python	Airflow, Prefect	S3, Snowflake	Kafka, Pub/Sub
Startup	Intermediate	Python, dbt	Cron, basic Airflow	Postgres, BigQuery	Simple queues

:::danger Common Data Engineer Interview Mistakes

Overcomplicating SQL - Use window functions instead of nested subqueries
Ignoring data quality - Every pipeline answer should mention validation
No error handling - Pipelines fail; your design must handle it
Forgetting idempotency - Pipelines get re-run; they must produce the same result
Not discussing monitoring - A pipeline without observability is a ticking time bomb
Treating batch and streaming as equivalent - They have fundamentally different guarantees :::

Difficulty Distribution

Difficulty	Problems	Count
Easy	#11, #23, #24, #42, #43	5
Medium	#1, #2, #3, #5, #8, #9, #12, #13, #15, #16, #17, #18, #20, #21, #25, #26, #30, #31, #32, #33, #35, #39, #40, #41, #44, #45	26
Hard	#4, #6, #7, #10, #14, #19, #22, #27, #28, #29, #34, #36, #37, #38	14

Problem Deep Dives (Continued)

Problem 7: Optimize a Slow Query with Proper Indexing Strategy

Why this problem matters: Query optimization is the single most practical skill a Data Engineer uses on a daily basis. Interviewers want to see that you can read an EXPLAIN plan, identify bottlenecks, and apply the right fix without over-indexing.

Problem Setup:

-- This query runs on a 500M row events table and takes 45 minutes
SELECT
    user_id,
    DATE(event_time) AS event_date,
    COUNT(*) AS event_count,
    COUNT(DISTINCT session_id) AS session_count
FROM events
WHERE event_time >= '2025-01-01'
  AND event_time < '2025-04-01'
  AND country = 'US'
  AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY user_id, DATE(event_time)
ORDER BY event_count DESC
LIMIT 100;

EXPLAIN Plan Analysis (What to Look For):

Seq Scan on events (cost=0.00..125000000.00)
  Filter: (event_time >= ... AND country = ...)
  Rows Removed by Filter: 450000000
Sort (cost=...)
  Sort Key: event_count DESC
  Sort Method: external merge (Disk)

Red Flags in This Plan:

Sequential scan on a 500M row table (no index being used)
450M rows removed by filter (90% of data scanned and discarded)
Disk-based sort (insufficient work_mem)

Optimization Steps:

Step	Action	Impact
1	Add partition by `event_time` (monthly)	Prune 9 of 12 month partitions; scan 125M instead of 500M
2	Add composite index `(country, event_type, event_time)`	Further reduce scanned rows within partition
3	Pre-aggregate daily summaries in a materialized view	Reduce 125M rows to ~2M daily aggregates
4	Increase `work_mem` for this query	In-memory sort instead of disk sort

Key Points Interviewers Check:

Can you read an EXPLAIN plan and identify the bottleneck?
Do you know the difference between Seq Scan, Index Scan, Index Only Scan, and Bitmap Scan?
Can you reason about index selectivity (high cardinality = good index candidate)?
Do you understand the cost of maintaining indexes on write-heavy tables?
Can you explain when a materialized view is better than an index?

Problem 10: Implement Slowly Changing Dimension Type 2

Why this problem matters: SCD Type 2 is the foundation of historical tracking in data warehouses. Every analytics team needs to know what a customer's status was at the time of a transaction, not what it is today.

Problem Statement: Given a customers source table that gets updated in place, maintain a dim_customers table that tracks all historical states with effective date ranges.

Solution:

-- Step 1: Close existing records that have changed
UPDATE dim_customers d
SET
    effective_end_date = CURRENT_DATE - INTERVAL '1 day',
    is_current = FALSE
FROM staging_customers s
WHERE d.customer_id = s.customer_id
  AND d.is_current = TRUE
  AND (
    d.name != s.name
    OR d.email != s.email
    OR d.plan_tier != s.plan_tier
    OR d.address != s.address
  );

-- Step 2: Insert new versions for changed records
INSERT INTO dim_customers (
    customer_id, name, email, plan_tier, address,
    effective_start_date, effective_end_date, is_current
)
SELECT
    s.customer_id, s.name, s.email, s.plan_tier, s.address,
    CURRENT_DATE,
    '9999-12-31'::DATE,
    TRUE
FROM staging_customers s
JOIN dim_customers d
  ON s.customer_id = d.customer_id
WHERE d.effective_end_date = CURRENT_DATE - INTERVAL '1 day'
  AND d.is_current = FALSE;

-- Step 3: Insert genuinely new customers
INSERT INTO dim_customers (
    customer_id, name, email, plan_tier, address,
    effective_start_date, effective_end_date, is_current
)
SELECT
    s.customer_id, s.name, s.email, s.plan_tier, s.address,
    CURRENT_DATE,
    '9999-12-31'::DATE,
    TRUE
FROM staging_customers s
WHERE NOT EXISTS (
    SELECT 1 FROM dim_customers d
    WHERE d.customer_id = s.customer_id
);

Key Design Decisions:

effective_end_date = '9999-12-31' for current records (avoids NULL comparisons)
is_current boolean flag for fast lookup of current state
Hash-based change detection (hash all columns, compare hash) is faster for wide tables
Surrogate keys (auto-increment) vs. natural keys for the dimension

Interview Follow-Ups:

How do you handle SCD Type 2 with Spark (hint: merge with whenMatchedUpdateIf condition)?
What about Type 1 (overwrite) vs. Type 3 (previous + current columns)?
How does SCD Type 2 interact with fact table foreign keys?

Problem 22: Implement a Custom Partitioner for Skewed Data

Why this problem matters: Data skew is the #1 cause of slow Spark jobs. A single partition with 100x more data than others causes the entire job to wait. This problem tests whether you have real production experience with distributed data processing.

Problem Statement: You have a Spark job that joins two tables on merchant_id. One merchant accounts for 40% of all transactions. The join stage takes 4 hours because one executor processes 40% of the data while others sit idle.

Solution: Salting Approach

from pyspark.sql import functions as F
import random

# Configuration
SALT_BUCKETS = 10
skewed_key = "merchant_12345"  # The problematic merchant

# Step 1: Salt the large (left) table
transactions_salted = transactions.withColumn(
    "salt",
    F.when(
        F.col("merchant_id") == skewed_key,
        F.floor(F.rand() * SALT_BUCKETS).cast("int")
    ).otherwise(F.lit(0))
).withColumn(
    "salted_key",
    F.concat(F.col("merchant_id"), F.lit("_"), F.col("salt"))
)

# Step 2: Replicate the small (right) table for skewed keys
from functools import reduce
from pyspark.sql import DataFrame

def replicate_for_skewed(df, skewed_key, salt_buckets):
    skewed = df.filter(F.col("merchant_id") == skewed_key)
    non_skewed = df.filter(F.col("merchant_id") != skewed_key)

    replicated = reduce(
        DataFrame.unionAll,
        [
            skewed.withColumn("salt", F.lit(i))
            for i in range(salt_buckets)
        ]
    )
    non_skewed_with_salt = non_skewed.withColumn("salt", F.lit(0))

    return replicated.unionAll(non_skewed_with_salt).withColumn(
        "salted_key",
        F.concat(F.col("merchant_id"), F.lit("_"), F.col("salt"))
    )

merchants_replicated = replicate_for_skewed(merchants, skewed_key, SALT_BUCKETS)

# Step 3: Join on salted key
result = transactions_salted.join(
    merchants_replicated,
    on="salted_key",
    how="inner"
).drop("salt", "salted_key")

Why This Works:

The skewed merchant's transactions are distributed across 10 partitions (salt 0-9)
The merchant's reference data is replicated 10 times (one copy per salt bucket)
Each executor now processes ~4% of transactions instead of 40%
Non-skewed merchants are unaffected (salt = 0 for both sides)

Alternative Approaches:

Broadcast join: If the right table is small (<100MB), broadcast it to all executors
Adaptive Query Execution (AQE): Spark 3.0+ can detect and handle skew automatically
Separate processing: Process the skewed key separately and union the results
Pre-aggregation: If possible, aggregate the skewed key before joining

Problem 37: Design a Distributed Task Scheduler (Like Airflow)

Why this problem matters: Understanding how orchestrators work internally makes you better at using them and debugging them. This is a Staff-level DE system design question.

Architecture:

Web UI -> API Server -> Scheduler -> Executor Pool -> Workers

Components:
1. DAG Parser
   - Read DAG definitions (Python files or YAML)
   - Build dependency graph
   - Validate: no cycles, all operators exist, correct parameters

2. Scheduler (the brain)
   - Event loop: check for ready tasks every N seconds
   - Task state machine: queued -> scheduled -> running -> success/failed
   - Dependency resolution: task is ready when all upstream tasks succeeded
   - Concurrency control: max active tasks per DAG, per pool, globally

3. Executor
   - LocalExecutor: runs tasks as subprocesses (dev/small scale)
   - CeleryExecutor: distributes tasks to Celery workers (production)
   - KubernetesExecutor: spins up K8s pods per task (elastic scaling)

4. Metadata Database
   - Task instance state (PostgreSQL)
   - DAG run history
   - XCom (cross-communication between tasks)
   - Variable and connection storage

5. Worker
   - Pulls task from queue
   - Executes the operator (BashOperator, PythonOperator, etc.)
   - Reports status back to metadata DB
   - Handles heartbeat (so scheduler knows worker is alive)

Key Design Decisions:
- Pull vs. push model for task distribution
- Exactly-once execution guarantee (database lock on task instance)
- Retry logic (exponential backoff, max retries)
- Dead letter handling for permanently failed tasks
- Scheduler HA (active-passive or active-active with database locking)

Scaling Challenges:

Scale	Challenge	Solution
100 DAGs	Single scheduler is fine	No changes needed
1,000 DAGs	Scheduler parsing becomes slow	DAG serialization, pre-parse to DB
10,000 DAGs	Single scheduler bottleneck	Multiple schedulers with database locking
100,000 tasks/day	Executor throughput	KubernetesExecutor with auto-scaling
1M tasks/day	Metadata DB load	Database partitioning, read replicas, archive old runs

Common SQL Window Function Patterns

A reference table for the most commonly tested window function patterns:

Pattern	SQL	Use Case
Ranking	`RANK() OVER (PARTITION BY dept ORDER BY salary DESC)`	Top N per group
Dense ranking	`DENSE_RANK() OVER (...)`	Top N with ties
Row number	`ROW_NUMBER() OVER (...)`	Deduplication (keep first/last)
Running total	`SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)`	Cumulative metrics
Moving average	`AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`	Smoothed metrics
Previous value	`LAG(value, 1) OVER (PARTITION BY user ORDER BY date)`	Change detection
Next value	`LEAD(value, 1) OVER (PARTITION BY user ORDER BY date)`	Forward-looking analysis
First/last value	`FIRST_VALUE(value) OVER (PARTITION BY user ORDER BY date)`	Session start/end
Percent rank	`PERCENT_RANK() OVER (ORDER BY score)`	Percentile computation
Ntile	`NTILE(4) OVER (ORDER BY score)`	Bucketing into quartiles

Data Engineering Anti-Patterns

Patterns that seem reasonable but cause problems at scale:

Anti-Pattern	Why It Seems Reasonable	Why It Fails	Better Approach
SELECT *	Convenient for ad-hoc queries	Wastes I/O on wide tables; breaks when schema changes	Explicitly list needed columns
DELETE + INSERT instead of MERGE	Simpler to implement	Not atomic; partial failures leave inconsistent state	Use MERGE/UPSERT for idempotency
Storing JSON strings in SQL columns	Flexible schema	Cannot index, filter, or join on nested fields efficiently	Use structured columns or a document store
Cron-only scheduling	No external dependencies	No dependency management, no retry, no monitoring	Use Airflow/Dagster/Prefect
Processing all data every run	Simpler logic	Costs grow linearly with data; wastes compute on old data	Incremental processing with watermarks
Single large table	Fewer joins	Slow queries, expensive scans, schema changes affect everything	Normalize into fact + dimension tables
No data contracts	Move fast, schema evolves	Downstream breakage when upstream changes schema	Schema registry + backward compatibility checks
Manual backfills	"We'll just re-run it"	Error-prone, inconsistent, takes days	Automated backfill orchestrator with validation

Data Quality Framework

A structured approach to data quality that interviewers want to hear:

Dimension	What It Measures	Example Check	Where to Check
Completeness	Are all expected records present?	Row count matches source; no missing dates	After ingestion
Accuracy	Are values correct?	Checksums match; spot-check against source	After transformation
Consistency	Do related values agree?	Order total = sum of line items; FK relationships valid	After join/aggregation
Freshness	Is data up-to-date?	Latest event timestamp within SLA	Continuous monitoring
Uniqueness	No duplicates?	Primary key is unique; no duplicate events	After deduplication step
Validity	Values within expected range?	Prices > 0; dates not in the future; enums are valid	At ingestion and after transform

Implementation with Great Expectations or dbt tests:

# dbt schema.yml example
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
      - name: order_date
        tests:
          - not_null
          - dbt_utils.accepted_range:
              max_value: "current_date"
    tests:
      - dbt_utils.recency:
          datepart: day
          field: order_date
          interval: 1

Recommended Resources

Resource	Type	Best For
"Designing Data-Intensive Applications" (Kleppmann)	Book	System design fundamentals
"The Data Warehouse Toolkit" (Kimball)	Book	Dimensional modeling
DataLemur SQL problems	Practice	SQL interview prep
Spark: The Definitive Guide	Book	Spark internals
Confluent Kafka documentation	Docs	Streaming concepts
dbt documentation	Docs	Analytics engineering patterns
"Fundamentals of Data Engineering" (Reis & Housley)	Book	Modern DE practices
"Streaming Systems" (Akidau, Chernyak, Lax)	Book	Streaming theory and practice

Next Steps

After completing the Data Engineer problem list, consider:

Core 50 to fill gaps in general DSA and ML fundamentals
Hard Tier if targeting Staff+ DE roles or FAANG
Google-Style Problems for Google-specific preparation (heavy on distributed systems)
Startup-Style Problems if interviewing at data-intensive startups

Data Engineer Interview Structure​

Round 1: SQL & Query Optimization (15 Problems)​

Core SQL Problems​

Advanced SQL​

Query Optimization & Internals​

Round 2: Data Pipeline & ETL Coding (12 Problems)​

ETL Implementation​

Data Processing Logic​

Streaming & Real-Time​

Round 3: System Design (13 Problems)​

Data Pipeline Architecture​

Data Modeling & Storage​

Distributed Systems for Data​

Round 4: Domain Knowledge & Debugging (5 Problems)​

4-Week Data Engineer Study Plan​

Week 1: SQL Deep Dive​

Week 2: Pipeline Coding​

Week 3: System Design Sprint​

Week 4: Polish and Mock​

Problem Deep Dives​

Problem 4: Sessionize Clickstream Data​

Problem 29: Design a Data Lake Architecture​

Data Engineer Patterns to Master​

SQL vs. Python: When to Use Which​

Technology Stack Expectations by Company Tier​

Difficulty Distribution​

Problem Deep Dives (Continued)​

Problem 7: Optimize a Slow Query with Proper Indexing Strategy​

Problem 10: Implement Slowly Changing Dimension Type 2​

Problem 22: Implement a Custom Partitioner for Skewed Data​

Problem 37: Design a Distributed Task Scheduler (Like Airflow)​

Common SQL Window Function Patterns​

Data Engineering Anti-Patterns​

Data Quality Framework​

Recommended Resources​

Next Steps​