Skip to main content

Data Engineer Problem List

Reading time: ~45 min | Interview relevance: Critical | Roles: Data Engineer, Analytics Engineer, Data Platform Engineer, Streaming Engineer

A company has 2 billion events per day flowing into their data lake. Dashboards are slow, pipelines are failing at 3 AM, and the data science team complains that feature tables are stale. Someone needs to fix this. That someone is a Data Engineer.

Data Engineering interviews test a unique combination of skills: SQL mastery, pipeline architecture, data modeling, distributed systems knowledge, and streaming fluency. This list of 45 problems covers every category you will encounter, organized by interview round and calibrated to real questions reported at major tech companies.

Data Engineer Interview Structure

Data Engineering interviews vary more than MLE or SWE interviews, but most top companies follow a pattern:

RoundDurationWhat They TestWeight
SQL & Data Manipulation45-60 minComplex queries, window functions, optimization25-30%
Coding (Python/Scala)45-60 minData processing logic, ETL implementation20-25%
System Design45-60 minPipeline architecture, data modeling, scale25-30%
Domain & Behavioral30-45 minData quality, debugging, collaboration15-20%

:::tip The Data Engineer Differentiator Data Engineers are judged by reliability, not cleverness. An interviewer would rather see a boring, correct, fault-tolerant pipeline than an elegant but fragile one. Always talk about error handling, idempotency, and monitoring. :::

Round 1: SQL & Query Optimization (15 Problems)

SQL is the lingua franca of data engineering. You will face at least one round of pure SQL, often with follow-up questions about query plans and optimization.

Core SQL Problems

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
1Find the Nth Highest Salary per DepartmentMedium15 minWindow functions (DENSE_RANK)Ranking within groups is fundamental to reporting and analyticsGoogle, Amazon, Meta
2Compute Running Totals with ResetsMedium20 minWindow functions, conditional aggregationRevenue tracking, session analysis, cumulative metricsUber, Airbnb, Stripe
3Identify Gaps in Sequential DataMedium15 minSelf-join, LAG/LEADMissing date detection, pipeline monitoring, SLA trackingFAANG, All
4Sessionize Clickstream DataHard25 minWindow functions, conditional logic, time gapsUser behavior analysis, session-based analyticsGoogle, Meta, Airbnb
5Pivot Rows to Columns DynamicallyMedium20 minCASE WHEN, conditional aggregationReporting transformations, dashboard data preparationAmazon, Microsoft, Databricks

Advanced SQL

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
6Recursive CTE for Hierarchical DataHard25 minRecursive queries, tree traversalOrg charts, category trees, bill of materialsGoogle, Amazon, Snowflake
7Optimize a Slow Query with Proper Indexing StrategyHard30 minEXPLAIN plan, index selection, query rewritingPerformance tuning is daily DE workFAANG, All
8Deduplicate Records with Fuzzy MatchingMedium20 minSimilarity functions, group-by strategiesData quality, entity resolutionAirbnb, Stripe, Databricks
9Compute Retention CohortsMedium20 minSelf-join, date arithmetic, conditional aggregationProduct analytics, user lifecycle analysisMeta, Spotify, Pinterest
10Implement Slowly Changing Dimension Type 2Hard25 minTemporal joins, effective date rangesDimensional modeling, historical trackingAmazon, Snowflake, dbt Labs

Query Optimization & Internals

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
11Explain and Fix a Cartesian JoinEasy10 minJoin strategies, query plan readingDebugging production queries that blow upAll
12Partition Pruning and Predicate PushdownMedium20 minQuery optimization, storage layoutReducing scan cost on petabyte-scale tablesDatabricks, Snowflake, Google
13Compare Hash Join vs. Sort-Merge JoinMedium15 minJoin algorithms, memory vs. diskUnderstanding why some queries are slowFAANG, Big Tech
14Optimize a Query Hitting a 100TB TableHard25 minPartitioning, clustering, materialized viewsReal-world optimization at scaleGoogle, Amazon, Snowflake
15Write an Idempotent MERGE (Upsert) StatementMedium15 minMERGE/INSERT ON CONFLICT, idempotencyPipeline reruns must not duplicate dataAll

:::warning SQL Round Red Flags

  • Using correlated subqueries where window functions are appropriate
  • Not considering NULL handling in joins and aggregations
  • Writing queries that scan entire tables when partition pruning is possible
  • Cannot explain the difference between RANK, DENSE_RANK, and ROW_NUMBER
  • Ignoring data skew in GROUP BY operations :::

Round 2: Data Pipeline & ETL Coding (12 Problems)

These problems test your ability to write data processing code in Python (or Spark/Scala). Interviewers want to see clean, testable, fault-tolerant pipeline logic.

ETL Implementation

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
16Implement a Schema Validator for Incoming JSON DataMedium20 minSchema validation, error handlingData quality at ingestion is the first line of defenseAirbnb, Stripe, All
17Build an Incremental Data Loader with WatermarksMedium25 minChange data capture, watermark trackingIncremental loads reduce cost and latencyFAANG, Databricks
18Implement a Dead Letter Queue HandlerMedium20 minError routing, retry logicMalformed records must not crash the pipelineUber, Stripe, Amazon
19Build a Backfill OrchestratorHard30 minDate range partitioning, idempotent executionHistorical reprocessing is a weekly taskFAANG, Big Tech
20Implement a Data Reconciliation CheckMedium20 minCount matching, checksum validationVerifying data completeness after migration or transferAll

Data Processing Logic

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
21Flatten Deeply Nested JSON to Tabular FormatMedium20 minRecursive parsing, schema inferenceSemi-structured data is everywhere (APIs, logs, events)Airbnb, Snowflake, Databricks
22Implement a Custom Partitioner for Skewed DataHard25 minData skew mitigation, saltingSkewed keys cause OOM and stragglers in SparkGoogle, Meta, Databricks
23Build a File Format Converter (CSV to Parquet with Schema)Easy15 minColumnar storage, type inferenceFile format choice impacts query performance by 10-100xAll
24Implement Log Parsing with Regex and Error HandlingEasy15 minRegex, defensive parsingLog processing is the most common first DE taskAll
25Build a Data Deduplication PipelineMedium20 minHashing, window-based dedupExact and near-duplicate detection at scaleMeta, Airbnb, Uber

Streaming & Real-Time

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
26Implement a Tumbling Window AggregationMedium20 minTime-based windowing, watermarksStreaming aggregation is core to real-time analyticsUber, LinkedIn, Confluent
27Handle Late-Arriving Events in a StreamHard25 minWatermarks, allowed lateness, side outputsLate data is the norm, not the exceptionGoogle, Uber, Confluent

:::tip Coding Round Expectations Data Engineering coding rounds are NOT LeetCode rounds. Interviewers expect:

  • Clean separation of concerns (extract, transform, load as separate functions)
  • Proper error handling and logging
  • Idempotent operations (safe to re-run)
  • Configuration-driven design (no hardcoded paths or schemas) :::

Round 3: System Design (13 Problems)

Data Engineering system design focuses on data platforms, pipeline architecture, and storage systems. Unlike MLE system design, the emphasis is on data flow, not model architecture.

Data Pipeline Architecture

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
28Design a Real-Time Event Processing PipelineHard45 minKafka, stream processing, exactly-once semanticsReal-time data is table stakes at scaleUber, LinkedIn, Google
29Design a Data Lake ArchitectureHard45 minStorage layers (bronze/silver/gold), metadata managementThe foundational data platform decisionFAANG, Databricks, Snowflake
30Design a Change Data Capture PipelineMedium35 minCDC, log-based replication, schema evolutionSyncing operational DBs to analytics is the #1 DE use caseUber, Airbnb, Stripe
31Design a Data Quality Monitoring SystemMedium35 minStatistical checks, anomaly detection, alertingBad data costs millions; monitoring prevents itFAANG, All
32Design a Batch ETL Platform for 10TB/DayMedium35 minOrchestration, partitioning, retry, monitoringThe bread-and-butter DE system design problemAmazon, Google, Big Tech

Data Modeling & Storage

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
33Design a Star Schema for an E-Commerce Data WarehouseMedium30 minFact and dimension tables, grain, conformed dimensionsDimensional modeling is foundational to analytics engineeringAmazon, Walmart, All
34Design a Metrics LayerHard40 minMetric definitions, consistency, self-serve analyticsEnsuring every team computes "revenue" the same wayAirbnb, Uber, Transform
35Design Schema Evolution Strategy for a Data LakeMedium30 minSchema registry, backward/forward compatibilitySchema changes must not break downstream consumersConfluent, Databricks, FAANG
36Design a Multi-Tenant Data PlatformHard45 minIsolation, access control, resource quotasServing multiple teams from one platformFAANG, Snowflake, Databricks

Distributed Systems for Data

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
37Design a Distributed Task Scheduler (Like Airflow)Hard45 minDAG execution, dependency management, fault toleranceUnderstanding orchestration internalsAirbnb, Google, Astronomer
38Design a Distributed File Storage SystemHard45 minReplication, partitioning, consistencyUnderstanding HDFS/S3 internals at a design levelGoogle, Amazon, Meta
39Explain and Design Around the CAP TheoremMedium25 minConsistency vs. availability tradeoffsEvery storage decision involves CAP tradeoffsFAANG, All
40Design a Data Catalog and Discovery SystemMedium35 minMetadata management, lineage, searchData discovery is critical as data assets growLyft, Airbnb, LinkedIn

:::note System Design Evaluation Criteria for DEs Interviewers evaluate DE system designs on:

  • Reliability: What happens when things fail? Retries, idempotency, dead letter queues
  • Scalability: How does it handle 10x data growth?
  • Data Quality: Where are the validation checkpoints?
  • Operability: How do you monitor, debug, and backfill?
  • Cost: Storage format choices, compute optimization, data lifecycle :::

Round 4: Domain Knowledge & Debugging (5 Problems)

These are discussion-based problems that test real-world experience.

#ProblemDifficultyTimeKey ConceptWhy DEs Need ItCompany Tags
41Debug a Pipeline That Produces Duplicate RecordsMedium20 minIdempotency, exactly-once processingThe most common production DE bugAll
42Explain Data Lineage and Its ImportanceEasy15 minMetadata, impact analysis, regulatory complianceGDPR, SOX, and debugging all require lineageFAANG, Finance
43Compare Batch vs. Stream Processing TradeoffsEasy15 minLatency, complexity, cost, correctnessChoosing the right processing paradigmAll
44Design a Data Governance StrategyMedium25 minPII handling, access controls, retention policiesRegulatory compliance is non-negotiableFAANG, Finance, Healthcare
45Troubleshoot a Spark Job That Is Running SlowlyMedium20 minShuffle optimization, partition sizing, skew detectionSpark tuning is daily work for many DEsDatabricks, FAANG, Big Tech

4-Week Data Engineer Study Plan

WeekFocusProblemsDaily Load
Week 1SQL mastery#1-15 (SQL round)2-3 problems/day
Week 2Pipeline coding#16-27 (ETL + streaming)2 problems/day + review SQL
Week 3System design#28-40 (Architecture + modeling)1-2 designs/day
Week 4Integration + domain#41-45 + mocks1 problem + 1 mock/day

Week 1: SQL Deep Dive

Day 1: #1, #2 (window functions: DENSE_RANK, running totals)
Day 2: #3, #4 (gaps detection, sessionization)
Day 3: #5, #6 (pivot, recursive CTE)
Day 4: #7 (query optimization - spend extra time on EXPLAIN)
Day 5: #8, #9 (fuzzy dedup, retention cohorts)
Day 6: #10, #11 (SCD Type 2, Cartesian join debug)
Day 7: #12, #13, #14, #15 (optimization topics + MERGE)

Week 2: Pipeline Coding

Day 1: #16, #17 (schema validation, incremental loads)
Day 2: #18, #19 (dead letter queue, backfill orchestrator)
Day 3: #20, #21 (reconciliation, nested JSON flattening)
Day 4: #22, #23 (custom partitioner, file format conversion)
Day 5: #24, #25 (log parsing, deduplication pipeline)
Day 6: #26, #27 (tumbling windows, late events)
Day 7: Review weak problems from weeks 1-2

Week 3: System Design Sprint

Day 1: #28 (real-time event processing - the big one)
Day 2: #29 (data lake architecture)
Day 3: #30, #31 (CDC pipeline, data quality monitoring)
Day 4: #32, #33 (batch ETL platform, star schema)
Day 5: #34, #35 (metrics layer, schema evolution)
Day 6: #36, #37 (multi-tenant platform, distributed scheduler)
Day 7: #38, #39, #40 (distributed storage, CAP, data catalog)

Week 4: Polish and Mock

Day 1: #41-45 (domain knowledge and debugging)
Day 2-3: Re-solve all Yellow/Red problems
Day 4-5: Full mock interviews (1 SQL + 1 coding + 1 design)
Day 6-7: Final review of weak areas + behavioral prep

Problem Deep Dives

Problem 4: Sessionize Clickstream Data

Why this problem matters: Sessionization is the canonical DE SQL problem. It combines window functions, conditional logic, and time-based reasoning. Every analytics-heavy company asks a variant of this.

Problem Statement: Given a table of user events with user_id, event_timestamp, and page_url, assign a session_id to each event. A new session starts when more than 30 minutes have elapsed since the previous event for the same user.

Solution Approach:

WITH time_gaps AS (
SELECT
user_id,
event_timestamp,
page_url,
LAG(event_timestamp) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) AS prev_event_time
FROM clickstream
),
session_flags AS (
SELECT
*,
CASE
WHEN prev_event_time IS NULL THEN 1
WHEN EXTRACT(EPOCH FROM event_timestamp - prev_event_time) > 1800 THEN 1
ELSE 0
END AS new_session_flag
FROM time_gaps
),
session_ids AS (
SELECT
*,
SUM(new_session_flag) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) AS session_id
FROM session_flags
)
SELECT
user_id,
event_timestamp,
page_url,
user_id || '-' || session_id AS session_id
FROM session_ids;

Key Points Interviewers Check:

  • Correct use of LAG with PARTITION BY
  • Proper handling of the first event (no previous event)
  • Using SUM of flags as a session counter (the cumulative sum trick)
  • Ability to modify the session timeout parameter

Problem 29: Design a Data Lake Architecture

Why this problem matters: This is the most strategic DE system design problem. Your answer reveals whether you think about data platforms architecturally or just as a collection of scripts.

Architecture (Medallion / Multi-Hop):

Raw Sources -> Bronze -> Silver -> Gold -> Consumers

1. Bronze Layer (Raw)
- Exact copy of source data, append-only
- Schema: source schema preserved, metadata columns added
- Format: Parquet/Delta with ingestion timestamp
- Retention: 90 days to 2 years

2. Silver Layer (Cleaned)
- Deduplication, type casting, null handling
- Standardized schemas, conformed dimensions
- Incremental processing with watermarks
- Format: Delta/Iceberg with partitioning

3. Gold Layer (Business)
- Aggregated, business-ready datasets
- Star schemas, metrics tables, feature tables
- SLAs on freshness and quality
- Format: Optimized for query patterns

4. Metadata & Governance
- Schema registry for all layers
- Data catalog with lineage tracking
- Access control per layer/dataset
- Quality checks at each transition

Key Design Decisions to Discuss:

  • Table format: Delta Lake vs. Iceberg vs. Hudi
  • Storage: S3/GCS/ADLS object storage
  • Compute: Spark vs. serverless (Athena, BigQuery)
  • Orchestration: Airflow, Dagster, Prefect
  • Catalog: Unity Catalog, AWS Glue, Hive Metastore

Data Engineer Patterns to Master

PatternWhere It AppearsProblems
Window functionsSQL analysis, sessionization, dedup#1, #2, #3, #4, #9
Idempotent processingPipeline design, MERGE, backfill#15, #17, #19, #41
Schema evolutionData lake, CDC, streaming#16, #30, #35
Partitioning strategiesQuery optimization, storage layout#12, #14, #22, #32
Watermark-based processingIncremental loads, streaming#17, #26, #27
Data quality checksValidation, monitoring, reconciliation#20, #31, #44
Medallion architectureData lake, pipeline layering#29, #32
Exactly-once semanticsStreaming, dedup, idempotency#27, #28, #41

SQL vs. Python: When to Use Which

A common interview discussion topic for Data Engineers:

ScenarioPreferred ToolReasoning
Ad-hoc analysisSQLFaster iteration, optimizer handles it
Aggregation on structured dataSQLDeclarative, easy to audit
Complex business logicPython/SparkProcedural logic, testing, modularity
Schema validationPythonProgrammatic checks, richer error handling
StreamingPython/Scala + frameworkKafka Streams, Flink, Spark Structured Streaming
UDF-heavy transformationsPython + SQLBest of both worlds
Data quality checksSQL + PythonSQL for checks, Python for orchestration

Technology Stack Expectations by Company Tier

TierSQLProcessingOrchestrationStorageStreaming
FAANGAdvanced (Presto/Spark SQL)Spark, internal toolsAirflow, internalS3/GCS, internalKafka, Flink
UnicornAdvanced (Snowflake/BigQuery)Spark, dbtAirflow, DagsterS3/GCS, Delta/IcebergKafka, Kinesis
Growth StageIntermediate+dbt, Spark, PythonAirflow, PrefectS3, SnowflakeKafka, Pub/Sub
StartupIntermediatePython, dbtCron, basic AirflowPostgres, BigQuerySimple queues

:::danger Common Data Engineer Interview Mistakes

  1. Overcomplicating SQL - Use window functions instead of nested subqueries
  2. Ignoring data quality - Every pipeline answer should mention validation
  3. No error handling - Pipelines fail; your design must handle it
  4. Forgetting idempotency - Pipelines get re-run; they must produce the same result
  5. Not discussing monitoring - A pipeline without observability is a ticking time bomb
  6. Treating batch and streaming as equivalent - They have fundamentally different guarantees :::

Difficulty Distribution

DifficultyProblemsCount
Easy#11, #23, #24, #42, #435
Medium#1, #2, #3, #5, #8, #9, #12, #13, #15, #16, #17, #18, #20, #21, #25, #26, #30, #31, #32, #33, #35, #39, #40, #41, #44, #4526
Hard#4, #6, #7, #10, #14, #19, #22, #27, #28, #29, #34, #36, #37, #3814

Problem Deep Dives (Continued)

Problem 7: Optimize a Slow Query with Proper Indexing Strategy

Why this problem matters: Query optimization is the single most practical skill a Data Engineer uses on a daily basis. Interviewers want to see that you can read an EXPLAIN plan, identify bottlenecks, and apply the right fix without over-indexing.

Problem Setup:

-- This query runs on a 500M row events table and takes 45 minutes
SELECT
user_id,
DATE(event_time) AS event_date,
COUNT(*) AS event_count,
COUNT(DISTINCT session_id) AS session_count
FROM events
WHERE event_time >= '2025-01-01'
AND event_time < '2025-04-01'
AND country = 'US'
AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY user_id, DATE(event_time)
ORDER BY event_count DESC
LIMIT 100;

EXPLAIN Plan Analysis (What to Look For):

Seq Scan on events (cost=0.00..125000000.00)
Filter: (event_time >= ... AND country = ...)
Rows Removed by Filter: 450000000
Sort (cost=...)
Sort Key: event_count DESC
Sort Method: external merge (Disk)

Red Flags in This Plan:

  1. Sequential scan on a 500M row table (no index being used)
  2. 450M rows removed by filter (90% of data scanned and discarded)
  3. Disk-based sort (insufficient work_mem)

Optimization Steps:

StepActionImpact
1Add partition by event_time (monthly)Prune 9 of 12 month partitions; scan 125M instead of 500M
2Add composite index (country, event_type, event_time)Further reduce scanned rows within partition
3Pre-aggregate daily summaries in a materialized viewReduce 125M rows to ~2M daily aggregates
4Increase work_mem for this queryIn-memory sort instead of disk sort

Key Points Interviewers Check:

  • Can you read an EXPLAIN plan and identify the bottleneck?
  • Do you know the difference between Seq Scan, Index Scan, Index Only Scan, and Bitmap Scan?
  • Can you reason about index selectivity (high cardinality = good index candidate)?
  • Do you understand the cost of maintaining indexes on write-heavy tables?
  • Can you explain when a materialized view is better than an index?

Problem 10: Implement Slowly Changing Dimension Type 2

Why this problem matters: SCD Type 2 is the foundation of historical tracking in data warehouses. Every analytics team needs to know what a customer's status was at the time of a transaction, not what it is today.

Problem Statement: Given a customers source table that gets updated in place, maintain a dim_customers table that tracks all historical states with effective date ranges.

Solution:

-- Step 1: Close existing records that have changed
UPDATE dim_customers d
SET
effective_end_date = CURRENT_DATE - INTERVAL '1 day',
is_current = FALSE
FROM staging_customers s
WHERE d.customer_id = s.customer_id
AND d.is_current = TRUE
AND (
d.name != s.name
OR d.email != s.email
OR d.plan_tier != s.plan_tier
OR d.address != s.address
);

-- Step 2: Insert new versions for changed records
INSERT INTO dim_customers (
customer_id, name, email, plan_tier, address,
effective_start_date, effective_end_date, is_current
)
SELECT
s.customer_id, s.name, s.email, s.plan_tier, s.address,
CURRENT_DATE,
'9999-12-31'::DATE,
TRUE
FROM staging_customers s
JOIN dim_customers d
ON s.customer_id = d.customer_id
WHERE d.effective_end_date = CURRENT_DATE - INTERVAL '1 day'
AND d.is_current = FALSE;

-- Step 3: Insert genuinely new customers
INSERT INTO dim_customers (
customer_id, name, email, plan_tier, address,
effective_start_date, effective_end_date, is_current
)
SELECT
s.customer_id, s.name, s.email, s.plan_tier, s.address,
CURRENT_DATE,
'9999-12-31'::DATE,
TRUE
FROM staging_customers s
WHERE NOT EXISTS (
SELECT 1 FROM dim_customers d
WHERE d.customer_id = s.customer_id
);

Key Design Decisions:

  • effective_end_date = '9999-12-31' for current records (avoids NULL comparisons)
  • is_current boolean flag for fast lookup of current state
  • Hash-based change detection (hash all columns, compare hash) is faster for wide tables
  • Surrogate keys (auto-increment) vs. natural keys for the dimension

Interview Follow-Ups:

  • How do you handle SCD Type 2 with Spark (hint: merge with whenMatchedUpdateIf condition)?
  • What about Type 1 (overwrite) vs. Type 3 (previous + current columns)?
  • How does SCD Type 2 interact with fact table foreign keys?

Problem 22: Implement a Custom Partitioner for Skewed Data

Why this problem matters: Data skew is the #1 cause of slow Spark jobs. A single partition with 100x more data than others causes the entire job to wait. This problem tests whether you have real production experience with distributed data processing.

Problem Statement: You have a Spark job that joins two tables on merchant_id. One merchant accounts for 40% of all transactions. The join stage takes 4 hours because one executor processes 40% of the data while others sit idle.

Solution: Salting Approach

from pyspark.sql import functions as F
import random

# Configuration
SALT_BUCKETS = 10
skewed_key = "merchant_12345" # The problematic merchant

# Step 1: Salt the large (left) table
transactions_salted = transactions.withColumn(
"salt",
F.when(
F.col("merchant_id") == skewed_key,
F.floor(F.rand() * SALT_BUCKETS).cast("int")
).otherwise(F.lit(0))
).withColumn(
"salted_key",
F.concat(F.col("merchant_id"), F.lit("_"), F.col("salt"))
)

# Step 2: Replicate the small (right) table for skewed keys
from functools import reduce
from pyspark.sql import DataFrame

def replicate_for_skewed(df, skewed_key, salt_buckets):
skewed = df.filter(F.col("merchant_id") == skewed_key)
non_skewed = df.filter(F.col("merchant_id") != skewed_key)

replicated = reduce(
DataFrame.unionAll,
[
skewed.withColumn("salt", F.lit(i))
for i in range(salt_buckets)
]
)
non_skewed_with_salt = non_skewed.withColumn("salt", F.lit(0))

return replicated.unionAll(non_skewed_with_salt).withColumn(
"salted_key",
F.concat(F.col("merchant_id"), F.lit("_"), F.col("salt"))
)

merchants_replicated = replicate_for_skewed(merchants, skewed_key, SALT_BUCKETS)

# Step 3: Join on salted key
result = transactions_salted.join(
merchants_replicated,
on="salted_key",
how="inner"
).drop("salt", "salted_key")

Why This Works:

  • The skewed merchant's transactions are distributed across 10 partitions (salt 0-9)
  • The merchant's reference data is replicated 10 times (one copy per salt bucket)
  • Each executor now processes ~4% of transactions instead of 40%
  • Non-skewed merchants are unaffected (salt = 0 for both sides)

Alternative Approaches:

  1. Broadcast join: If the right table is small (<100MB), broadcast it to all executors
  2. Adaptive Query Execution (AQE): Spark 3.0+ can detect and handle skew automatically
  3. Separate processing: Process the skewed key separately and union the results
  4. Pre-aggregation: If possible, aggregate the skewed key before joining

Problem 37: Design a Distributed Task Scheduler (Like Airflow)

Why this problem matters: Understanding how orchestrators work internally makes you better at using them and debugging them. This is a Staff-level DE system design question.

Architecture:

Web UI -> API Server -> Scheduler -> Executor Pool -> Workers

Components:
1. DAG Parser
- Read DAG definitions (Python files or YAML)
- Build dependency graph
- Validate: no cycles, all operators exist, correct parameters

2. Scheduler (the brain)
- Event loop: check for ready tasks every N seconds
- Task state machine: queued -> scheduled -> running -> success/failed
- Dependency resolution: task is ready when all upstream tasks succeeded
- Concurrency control: max active tasks per DAG, per pool, globally

3. Executor
- LocalExecutor: runs tasks as subprocesses (dev/small scale)
- CeleryExecutor: distributes tasks to Celery workers (production)
- KubernetesExecutor: spins up K8s pods per task (elastic scaling)

4. Metadata Database
- Task instance state (PostgreSQL)
- DAG run history
- XCom (cross-communication between tasks)
- Variable and connection storage

5. Worker
- Pulls task from queue
- Executes the operator (BashOperator, PythonOperator, etc.)
- Reports status back to metadata DB
- Handles heartbeat (so scheduler knows worker is alive)

Key Design Decisions:
- Pull vs. push model for task distribution
- Exactly-once execution guarantee (database lock on task instance)
- Retry logic (exponential backoff, max retries)
- Dead letter handling for permanently failed tasks
- Scheduler HA (active-passive or active-active with database locking)

Scaling Challenges:

ScaleChallengeSolution
100 DAGsSingle scheduler is fineNo changes needed
1,000 DAGsScheduler parsing becomes slowDAG serialization, pre-parse to DB
10,000 DAGsSingle scheduler bottleneckMultiple schedulers with database locking
100,000 tasks/dayExecutor throughputKubernetesExecutor with auto-scaling
1M tasks/dayMetadata DB loadDatabase partitioning, read replicas, archive old runs

Common SQL Window Function Patterns

A reference table for the most commonly tested window function patterns:

PatternSQLUse Case
RankingRANK() OVER (PARTITION BY dept ORDER BY salary DESC)Top N per group
Dense rankingDENSE_RANK() OVER (...)Top N with ties
Row numberROW_NUMBER() OVER (...)Deduplication (keep first/last)
Running totalSUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)Cumulative metrics
Moving averageAVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)Smoothed metrics
Previous valueLAG(value, 1) OVER (PARTITION BY user ORDER BY date)Change detection
Next valueLEAD(value, 1) OVER (PARTITION BY user ORDER BY date)Forward-looking analysis
First/last valueFIRST_VALUE(value) OVER (PARTITION BY user ORDER BY date)Session start/end
Percent rankPERCENT_RANK() OVER (ORDER BY score)Percentile computation
NtileNTILE(4) OVER (ORDER BY score)Bucketing into quartiles

Data Engineering Anti-Patterns

Patterns that seem reasonable but cause problems at scale:

Anti-PatternWhy It Seems ReasonableWhy It FailsBetter Approach
**SELECT ***Convenient for ad-hoc queriesWastes I/O on wide tables; breaks when schema changesExplicitly list needed columns
DELETE + INSERT instead of MERGESimpler to implementNot atomic; partial failures leave inconsistent stateUse MERGE/UPSERT for idempotency
Storing JSON strings in SQL columnsFlexible schemaCannot index, filter, or join on nested fields efficientlyUse structured columns or a document store
Cron-only schedulingNo external dependenciesNo dependency management, no retry, no monitoringUse Airflow/Dagster/Prefect
Processing all data every runSimpler logicCosts grow linearly with data; wastes compute on old dataIncremental processing with watermarks
Single large tableFewer joinsSlow queries, expensive scans, schema changes affect everythingNormalize into fact + dimension tables
No data contractsMove fast, schema evolvesDownstream breakage when upstream changes schemaSchema registry + backward compatibility checks
Manual backfills"We'll just re-run it"Error-prone, inconsistent, takes daysAutomated backfill orchestrator with validation

Data Quality Framework

A structured approach to data quality that interviewers want to hear:

DimensionWhat It MeasuresExample CheckWhere to Check
CompletenessAre all expected records present?Row count matches source; no missing datesAfter ingestion
AccuracyAre values correct?Checksums match; spot-check against sourceAfter transformation
ConsistencyDo related values agree?Order total = sum of line items; FK relationships validAfter join/aggregation
FreshnessIs data up-to-date?Latest event timestamp within SLAContinuous monitoring
UniquenessNo duplicates?Primary key is unique; no duplicate eventsAfter deduplication step
ValidityValues within expected range?Prices > 0; dates not in the future; enums are validAt ingestion and after transform

Implementation with Great Expectations or dbt tests:

# dbt schema.yml example
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: total_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: order_date
tests:
- not_null
- dbt_utils.accepted_range:
max_value: "current_date"
tests:
- dbt_utils.recency:
datepart: day
field: order_date
interval: 1
ResourceTypeBest For
"Designing Data-Intensive Applications" (Kleppmann)BookSystem design fundamentals
"The Data Warehouse Toolkit" (Kimball)BookDimensional modeling
DataLemur SQL problemsPracticeSQL interview prep
Spark: The Definitive GuideBookSpark internals
Confluent Kafka documentationDocsStreaming concepts
dbt documentationDocsAnalytics engineering patterns
"Fundamentals of Data Engineering" (Reis & Housley)BookModern DE practices
"Streaming Systems" (Akidau, Chernyak, Lax)BookStreaming theory and practice

Next Steps

After completing the Data Engineer problem list, consider:

© 2026 EngineersOfAI. All rights reserved.