Data Engineer Problem List
Reading time: ~45 min | Interview relevance: Critical | Roles: Data Engineer, Analytics Engineer, Data Platform Engineer, Streaming Engineer
A company has 2 billion events per day flowing into their data lake. Dashboards are slow, pipelines are failing at 3 AM, and the data science team complains that feature tables are stale. Someone needs to fix this. That someone is a Data Engineer.
Data Engineering interviews test a unique combination of skills: SQL mastery, pipeline architecture, data modeling, distributed systems knowledge, and streaming fluency. This list of 45 problems covers every category you will encounter, organized by interview round and calibrated to real questions reported at major tech companies.
Data Engineer Interview Structure
Data Engineering interviews vary more than MLE or SWE interviews, but most top companies follow a pattern:
| Round | Duration | What They Test | Weight |
|---|---|---|---|
| SQL & Data Manipulation | 45-60 min | Complex queries, window functions, optimization | 25-30% |
| Coding (Python/Scala) | 45-60 min | Data processing logic, ETL implementation | 20-25% |
| System Design | 45-60 min | Pipeline architecture, data modeling, scale | 25-30% |
| Domain & Behavioral | 30-45 min | Data quality, debugging, collaboration | 15-20% |
:::tip The Data Engineer Differentiator Data Engineers are judged by reliability, not cleverness. An interviewer would rather see a boring, correct, fault-tolerant pipeline than an elegant but fragile one. Always talk about error handling, idempotency, and monitoring. :::
Round 1: SQL & Query Optimization (15 Problems)
SQL is the lingua franca of data engineering. You will face at least one round of pure SQL, often with follow-up questions about query plans and optimization.
Core SQL Problems
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 1 | Find the Nth Highest Salary per Department | Medium | 15 min | Window functions (DENSE_RANK) | Ranking within groups is fundamental to reporting and analytics | Google, Amazon, Meta |
| 2 | Compute Running Totals with Resets | Medium | 20 min | Window functions, conditional aggregation | Revenue tracking, session analysis, cumulative metrics | Uber, Airbnb, Stripe |
| 3 | Identify Gaps in Sequential Data | Medium | 15 min | Self-join, LAG/LEAD | Missing date detection, pipeline monitoring, SLA tracking | FAANG, All |
| 4 | Sessionize Clickstream Data | Hard | 25 min | Window functions, conditional logic, time gaps | User behavior analysis, session-based analytics | Google, Meta, Airbnb |
| 5 | Pivot Rows to Columns Dynamically | Medium | 20 min | CASE WHEN, conditional aggregation | Reporting transformations, dashboard data preparation | Amazon, Microsoft, Databricks |
Advanced SQL
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 6 | Recursive CTE for Hierarchical Data | Hard | 25 min | Recursive queries, tree traversal | Org charts, category trees, bill of materials | Google, Amazon, Snowflake |
| 7 | Optimize a Slow Query with Proper Indexing Strategy | Hard | 30 min | EXPLAIN plan, index selection, query rewriting | Performance tuning is daily DE work | FAANG, All |
| 8 | Deduplicate Records with Fuzzy Matching | Medium | 20 min | Similarity functions, group-by strategies | Data quality, entity resolution | Airbnb, Stripe, Databricks |
| 9 | Compute Retention Cohorts | Medium | 20 min | Self-join, date arithmetic, conditional aggregation | Product analytics, user lifecycle analysis | Meta, Spotify, Pinterest |
| 10 | Implement Slowly Changing Dimension Type 2 | Hard | 25 min | Temporal joins, effective date ranges | Dimensional modeling, historical tracking | Amazon, Snowflake, dbt Labs |
Query Optimization & Internals
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 11 | Explain and Fix a Cartesian Join | Easy | 10 min | Join strategies, query plan reading | Debugging production queries that blow up | All |
| 12 | Partition Pruning and Predicate Pushdown | Medium | 20 min | Query optimization, storage layout | Reducing scan cost on petabyte-scale tables | Databricks, Snowflake, Google |
| 13 | Compare Hash Join vs. Sort-Merge Join | Medium | 15 min | Join algorithms, memory vs. disk | Understanding why some queries are slow | FAANG, Big Tech |
| 14 | Optimize a Query Hitting a 100TB Table | Hard | 25 min | Partitioning, clustering, materialized views | Real-world optimization at scale | Google, Amazon, Snowflake |
| 15 | Write an Idempotent MERGE (Upsert) Statement | Medium | 15 min | MERGE/INSERT ON CONFLICT, idempotency | Pipeline reruns must not duplicate data | All |
:::warning SQL Round Red Flags
- Using correlated subqueries where window functions are appropriate
- Not considering NULL handling in joins and aggregations
- Writing queries that scan entire tables when partition pruning is possible
- Cannot explain the difference between RANK, DENSE_RANK, and ROW_NUMBER
- Ignoring data skew in GROUP BY operations :::
Round 2: Data Pipeline & ETL Coding (12 Problems)
These problems test your ability to write data processing code in Python (or Spark/Scala). Interviewers want to see clean, testable, fault-tolerant pipeline logic.
ETL Implementation
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 16 | Implement a Schema Validator for Incoming JSON Data | Medium | 20 min | Schema validation, error handling | Data quality at ingestion is the first line of defense | Airbnb, Stripe, All |
| 17 | Build an Incremental Data Loader with Watermarks | Medium | 25 min | Change data capture, watermark tracking | Incremental loads reduce cost and latency | FAANG, Databricks |
| 18 | Implement a Dead Letter Queue Handler | Medium | 20 min | Error routing, retry logic | Malformed records must not crash the pipeline | Uber, Stripe, Amazon |
| 19 | Build a Backfill Orchestrator | Hard | 30 min | Date range partitioning, idempotent execution | Historical reprocessing is a weekly task | FAANG, Big Tech |
| 20 | Implement a Data Reconciliation Check | Medium | 20 min | Count matching, checksum validation | Verifying data completeness after migration or transfer | All |
Data Processing Logic
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 21 | Flatten Deeply Nested JSON to Tabular Format | Medium | 20 min | Recursive parsing, schema inference | Semi-structured data is everywhere (APIs, logs, events) | Airbnb, Snowflake, Databricks |
| 22 | Implement a Custom Partitioner for Skewed Data | Hard | 25 min | Data skew mitigation, salting | Skewed keys cause OOM and stragglers in Spark | Google, Meta, Databricks |
| 23 | Build a File Format Converter (CSV to Parquet with Schema) | Easy | 15 min | Columnar storage, type inference | File format choice impacts query performance by 10-100x | All |
| 24 | Implement Log Parsing with Regex and Error Handling | Easy | 15 min | Regex, defensive parsing | Log processing is the most common first DE task | All |
| 25 | Build a Data Deduplication Pipeline | Medium | 20 min | Hashing, window-based dedup | Exact and near-duplicate detection at scale | Meta, Airbnb, Uber |
Streaming & Real-Time
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 26 | Implement a Tumbling Window Aggregation | Medium | 20 min | Time-based windowing, watermarks | Streaming aggregation is core to real-time analytics | Uber, LinkedIn, Confluent |
| 27 | Handle Late-Arriving Events in a Stream | Hard | 25 min | Watermarks, allowed lateness, side outputs | Late data is the norm, not the exception | Google, Uber, Confluent |
:::tip Coding Round Expectations Data Engineering coding rounds are NOT LeetCode rounds. Interviewers expect:
- Clean separation of concerns (extract, transform, load as separate functions)
- Proper error handling and logging
- Idempotent operations (safe to re-run)
- Configuration-driven design (no hardcoded paths or schemas) :::
Round 3: System Design (13 Problems)
Data Engineering system design focuses on data platforms, pipeline architecture, and storage systems. Unlike MLE system design, the emphasis is on data flow, not model architecture.
Data Pipeline Architecture
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 28 | Design a Real-Time Event Processing Pipeline | Hard | 45 min | Kafka, stream processing, exactly-once semantics | Real-time data is table stakes at scale | Uber, LinkedIn, Google |
| 29 | Design a Data Lake Architecture | Hard | 45 min | Storage layers (bronze/silver/gold), metadata management | The foundational data platform decision | FAANG, Databricks, Snowflake |
| 30 | Design a Change Data Capture Pipeline | Medium | 35 min | CDC, log-based replication, schema evolution | Syncing operational DBs to analytics is the #1 DE use case | Uber, Airbnb, Stripe |
| 31 | Design a Data Quality Monitoring System | Medium | 35 min | Statistical checks, anomaly detection, alerting | Bad data costs millions; monitoring prevents it | FAANG, All |
| 32 | Design a Batch ETL Platform for 10TB/Day | Medium | 35 min | Orchestration, partitioning, retry, monitoring | The bread-and-butter DE system design problem | Amazon, Google, Big Tech |
Data Modeling & Storage
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 33 | Design a Star Schema for an E-Commerce Data Warehouse | Medium | 30 min | Fact and dimension tables, grain, conformed dimensions | Dimensional modeling is foundational to analytics engineering | Amazon, Walmart, All |
| 34 | Design a Metrics Layer | Hard | 40 min | Metric definitions, consistency, self-serve analytics | Ensuring every team computes "revenue" the same way | Airbnb, Uber, Transform |
| 35 | Design Schema Evolution Strategy for a Data Lake | Medium | 30 min | Schema registry, backward/forward compatibility | Schema changes must not break downstream consumers | Confluent, Databricks, FAANG |
| 36 | Design a Multi-Tenant Data Platform | Hard | 45 min | Isolation, access control, resource quotas | Serving multiple teams from one platform | FAANG, Snowflake, Databricks |
Distributed Systems for Data
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 37 | Design a Distributed Task Scheduler (Like Airflow) | Hard | 45 min | DAG execution, dependency management, fault tolerance | Understanding orchestration internals | Airbnb, Google, Astronomer |
| 38 | Design a Distributed File Storage System | Hard | 45 min | Replication, partitioning, consistency | Understanding HDFS/S3 internals at a design level | Google, Amazon, Meta |
| 39 | Explain and Design Around the CAP Theorem | Medium | 25 min | Consistency vs. availability tradeoffs | Every storage decision involves CAP tradeoffs | FAANG, All |
| 40 | Design a Data Catalog and Discovery System | Medium | 35 min | Metadata management, lineage, search | Data discovery is critical as data assets grow | Lyft, Airbnb, LinkedIn |
:::note System Design Evaluation Criteria for DEs Interviewers evaluate DE system designs on:
- Reliability: What happens when things fail? Retries, idempotency, dead letter queues
- Scalability: How does it handle 10x data growth?
- Data Quality: Where are the validation checkpoints?
- Operability: How do you monitor, debug, and backfill?
- Cost: Storage format choices, compute optimization, data lifecycle :::
Round 4: Domain Knowledge & Debugging (5 Problems)
These are discussion-based problems that test real-world experience.
| # | Problem | Difficulty | Time | Key Concept | Why DEs Need It | Company Tags |
|---|---|---|---|---|---|---|
| 41 | Debug a Pipeline That Produces Duplicate Records | Medium | 20 min | Idempotency, exactly-once processing | The most common production DE bug | All |
| 42 | Explain Data Lineage and Its Importance | Easy | 15 min | Metadata, impact analysis, regulatory compliance | GDPR, SOX, and debugging all require lineage | FAANG, Finance |
| 43 | Compare Batch vs. Stream Processing Tradeoffs | Easy | 15 min | Latency, complexity, cost, correctness | Choosing the right processing paradigm | All |
| 44 | Design a Data Governance Strategy | Medium | 25 min | PII handling, access controls, retention policies | Regulatory compliance is non-negotiable | FAANG, Finance, Healthcare |
| 45 | Troubleshoot a Spark Job That Is Running Slowly | Medium | 20 min | Shuffle optimization, partition sizing, skew detection | Spark tuning is daily work for many DEs | Databricks, FAANG, Big Tech |
4-Week Data Engineer Study Plan
| Week | Focus | Problems | Daily Load |
|---|---|---|---|
| Week 1 | SQL mastery | #1-15 (SQL round) | 2-3 problems/day |
| Week 2 | Pipeline coding | #16-27 (ETL + streaming) | 2 problems/day + review SQL |
| Week 3 | System design | #28-40 (Architecture + modeling) | 1-2 designs/day |
| Week 4 | Integration + domain | #41-45 + mocks | 1 problem + 1 mock/day |
Week 1: SQL Deep Dive
Day 1: #1, #2 (window functions: DENSE_RANK, running totals)
Day 2: #3, #4 (gaps detection, sessionization)
Day 3: #5, #6 (pivot, recursive CTE)
Day 4: #7 (query optimization - spend extra time on EXPLAIN)
Day 5: #8, #9 (fuzzy dedup, retention cohorts)
Day 6: #10, #11 (SCD Type 2, Cartesian join debug)
Day 7: #12, #13, #14, #15 (optimization topics + MERGE)
Week 2: Pipeline Coding
Day 1: #16, #17 (schema validation, incremental loads)
Day 2: #18, #19 (dead letter queue, backfill orchestrator)
Day 3: #20, #21 (reconciliation, nested JSON flattening)
Day 4: #22, #23 (custom partitioner, file format conversion)
Day 5: #24, #25 (log parsing, deduplication pipeline)
Day 6: #26, #27 (tumbling windows, late events)
Day 7: Review weak problems from weeks 1-2
Week 3: System Design Sprint
Day 1: #28 (real-time event processing - the big one)
Day 2: #29 (data lake architecture)
Day 3: #30, #31 (CDC pipeline, data quality monitoring)
Day 4: #32, #33 (batch ETL platform, star schema)
Day 5: #34, #35 (metrics layer, schema evolution)
Day 6: #36, #37 (multi-tenant platform, distributed scheduler)
Day 7: #38, #39, #40 (distributed storage, CAP, data catalog)
Week 4: Polish and Mock
Day 1: #41-45 (domain knowledge and debugging)
Day 2-3: Re-solve all Yellow/Red problems
Day 4-5: Full mock interviews (1 SQL + 1 coding + 1 design)
Day 6-7: Final review of weak areas + behavioral prep
Problem Deep Dives
Problem 4: Sessionize Clickstream Data
Why this problem matters: Sessionization is the canonical DE SQL problem. It combines window functions, conditional logic, and time-based reasoning. Every analytics-heavy company asks a variant of this.
Problem Statement:
Given a table of user events with user_id, event_timestamp, and page_url, assign a session_id to each event. A new session starts when more than 30 minutes have elapsed since the previous event for the same user.
Solution Approach:
WITH time_gaps AS (
SELECT
user_id,
event_timestamp,
page_url,
LAG(event_timestamp) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) AS prev_event_time
FROM clickstream
),
session_flags AS (
SELECT
*,
CASE
WHEN prev_event_time IS NULL THEN 1
WHEN EXTRACT(EPOCH FROM event_timestamp - prev_event_time) > 1800 THEN 1
ELSE 0
END AS new_session_flag
FROM time_gaps
),
session_ids AS (
SELECT
*,
SUM(new_session_flag) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) AS session_id
FROM session_flags
)
SELECT
user_id,
event_timestamp,
page_url,
user_id || '-' || session_id AS session_id
FROM session_ids;
Key Points Interviewers Check:
- Correct use of LAG with PARTITION BY
- Proper handling of the first event (no previous event)
- Using SUM of flags as a session counter (the cumulative sum trick)
- Ability to modify the session timeout parameter
Problem 29: Design a Data Lake Architecture
Why this problem matters: This is the most strategic DE system design problem. Your answer reveals whether you think about data platforms architecturally or just as a collection of scripts.
Architecture (Medallion / Multi-Hop):
Raw Sources -> Bronze -> Silver -> Gold -> Consumers
1. Bronze Layer (Raw)
- Exact copy of source data, append-only
- Schema: source schema preserved, metadata columns added
- Format: Parquet/Delta with ingestion timestamp
- Retention: 90 days to 2 years
2. Silver Layer (Cleaned)
- Deduplication, type casting, null handling
- Standardized schemas, conformed dimensions
- Incremental processing with watermarks
- Format: Delta/Iceberg with partitioning
3. Gold Layer (Business)
- Aggregated, business-ready datasets
- Star schemas, metrics tables, feature tables
- SLAs on freshness and quality
- Format: Optimized for query patterns
4. Metadata & Governance
- Schema registry for all layers
- Data catalog with lineage tracking
- Access control per layer/dataset
- Quality checks at each transition
Key Design Decisions to Discuss:
- Table format: Delta Lake vs. Iceberg vs. Hudi
- Storage: S3/GCS/ADLS object storage
- Compute: Spark vs. serverless (Athena, BigQuery)
- Orchestration: Airflow, Dagster, Prefect
- Catalog: Unity Catalog, AWS Glue, Hive Metastore
Data Engineer Patterns to Master
| Pattern | Where It Appears | Problems |
|---|---|---|
| Window functions | SQL analysis, sessionization, dedup | #1, #2, #3, #4, #9 |
| Idempotent processing | Pipeline design, MERGE, backfill | #15, #17, #19, #41 |
| Schema evolution | Data lake, CDC, streaming | #16, #30, #35 |
| Partitioning strategies | Query optimization, storage layout | #12, #14, #22, #32 |
| Watermark-based processing | Incremental loads, streaming | #17, #26, #27 |
| Data quality checks | Validation, monitoring, reconciliation | #20, #31, #44 |
| Medallion architecture | Data lake, pipeline layering | #29, #32 |
| Exactly-once semantics | Streaming, dedup, idempotency | #27, #28, #41 |
SQL vs. Python: When to Use Which
A common interview discussion topic for Data Engineers:
| Scenario | Preferred Tool | Reasoning |
|---|---|---|
| Ad-hoc analysis | SQL | Faster iteration, optimizer handles it |
| Aggregation on structured data | SQL | Declarative, easy to audit |
| Complex business logic | Python/Spark | Procedural logic, testing, modularity |
| Schema validation | Python | Programmatic checks, richer error handling |
| Streaming | Python/Scala + framework | Kafka Streams, Flink, Spark Structured Streaming |
| UDF-heavy transformations | Python + SQL | Best of both worlds |
| Data quality checks | SQL + Python | SQL for checks, Python for orchestration |
Technology Stack Expectations by Company Tier
| Tier | SQL | Processing | Orchestration | Storage | Streaming |
|---|---|---|---|---|---|
| FAANG | Advanced (Presto/Spark SQL) | Spark, internal tools | Airflow, internal | S3/GCS, internal | Kafka, Flink |
| Unicorn | Advanced (Snowflake/BigQuery) | Spark, dbt | Airflow, Dagster | S3/GCS, Delta/Iceberg | Kafka, Kinesis |
| Growth Stage | Intermediate+ | dbt, Spark, Python | Airflow, Prefect | S3, Snowflake | Kafka, Pub/Sub |
| Startup | Intermediate | Python, dbt | Cron, basic Airflow | Postgres, BigQuery | Simple queues |
:::danger Common Data Engineer Interview Mistakes
- Overcomplicating SQL - Use window functions instead of nested subqueries
- Ignoring data quality - Every pipeline answer should mention validation
- No error handling - Pipelines fail; your design must handle it
- Forgetting idempotency - Pipelines get re-run; they must produce the same result
- Not discussing monitoring - A pipeline without observability is a ticking time bomb
- Treating batch and streaming as equivalent - They have fundamentally different guarantees :::
Difficulty Distribution
| Difficulty | Problems | Count |
|---|---|---|
| Easy | #11, #23, #24, #42, #43 | 5 |
| Medium | #1, #2, #3, #5, #8, #9, #12, #13, #15, #16, #17, #18, #20, #21, #25, #26, #30, #31, #32, #33, #35, #39, #40, #41, #44, #45 | 26 |
| Hard | #4, #6, #7, #10, #14, #19, #22, #27, #28, #29, #34, #36, #37, #38 | 14 |
Problem Deep Dives (Continued)
Problem 7: Optimize a Slow Query with Proper Indexing Strategy
Why this problem matters: Query optimization is the single most practical skill a Data Engineer uses on a daily basis. Interviewers want to see that you can read an EXPLAIN plan, identify bottlenecks, and apply the right fix without over-indexing.
Problem Setup:
-- This query runs on a 500M row events table and takes 45 minutes
SELECT
user_id,
DATE(event_time) AS event_date,
COUNT(*) AS event_count,
COUNT(DISTINCT session_id) AS session_count
FROM events
WHERE event_time >= '2025-01-01'
AND event_time < '2025-04-01'
AND country = 'US'
AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY user_id, DATE(event_time)
ORDER BY event_count DESC
LIMIT 100;
EXPLAIN Plan Analysis (What to Look For):
Seq Scan on events (cost=0.00..125000000.00)
Filter: (event_time >= ... AND country = ...)
Rows Removed by Filter: 450000000
Sort (cost=...)
Sort Key: event_count DESC
Sort Method: external merge (Disk)
Red Flags in This Plan:
- Sequential scan on a 500M row table (no index being used)
- 450M rows removed by filter (90% of data scanned and discarded)
- Disk-based sort (insufficient work_mem)
Optimization Steps:
| Step | Action | Impact |
|---|---|---|
| 1 | Add partition by event_time (monthly) | Prune 9 of 12 month partitions; scan 125M instead of 500M |
| 2 | Add composite index (country, event_type, event_time) | Further reduce scanned rows within partition |
| 3 | Pre-aggregate daily summaries in a materialized view | Reduce 125M rows to ~2M daily aggregates |
| 4 | Increase work_mem for this query | In-memory sort instead of disk sort |
Key Points Interviewers Check:
- Can you read an EXPLAIN plan and identify the bottleneck?
- Do you know the difference between Seq Scan, Index Scan, Index Only Scan, and Bitmap Scan?
- Can you reason about index selectivity (high cardinality = good index candidate)?
- Do you understand the cost of maintaining indexes on write-heavy tables?
- Can you explain when a materialized view is better than an index?
Problem 10: Implement Slowly Changing Dimension Type 2
Why this problem matters: SCD Type 2 is the foundation of historical tracking in data warehouses. Every analytics team needs to know what a customer's status was at the time of a transaction, not what it is today.
Problem Statement:
Given a customers source table that gets updated in place, maintain a dim_customers table that tracks all historical states with effective date ranges.
Solution:
-- Step 1: Close existing records that have changed
UPDATE dim_customers d
SET
effective_end_date = CURRENT_DATE - INTERVAL '1 day',
is_current = FALSE
FROM staging_customers s
WHERE d.customer_id = s.customer_id
AND d.is_current = TRUE
AND (
d.name != s.name
OR d.email != s.email
OR d.plan_tier != s.plan_tier
OR d.address != s.address
);
-- Step 2: Insert new versions for changed records
INSERT INTO dim_customers (
customer_id, name, email, plan_tier, address,
effective_start_date, effective_end_date, is_current
)
SELECT
s.customer_id, s.name, s.email, s.plan_tier, s.address,
CURRENT_DATE,
'9999-12-31'::DATE,
TRUE
FROM staging_customers s
JOIN dim_customers d
ON s.customer_id = d.customer_id
WHERE d.effective_end_date = CURRENT_DATE - INTERVAL '1 day'
AND d.is_current = FALSE;
-- Step 3: Insert genuinely new customers
INSERT INTO dim_customers (
customer_id, name, email, plan_tier, address,
effective_start_date, effective_end_date, is_current
)
SELECT
s.customer_id, s.name, s.email, s.plan_tier, s.address,
CURRENT_DATE,
'9999-12-31'::DATE,
TRUE
FROM staging_customers s
WHERE NOT EXISTS (
SELECT 1 FROM dim_customers d
WHERE d.customer_id = s.customer_id
);
Key Design Decisions:
effective_end_date = '9999-12-31'for current records (avoids NULL comparisons)is_currentboolean flag for fast lookup of current state- Hash-based change detection (hash all columns, compare hash) is faster for wide tables
- Surrogate keys (auto-increment) vs. natural keys for the dimension
Interview Follow-Ups:
- How do you handle SCD Type 2 with Spark (hint: merge with
whenMatchedUpdateIfcondition)? - What about Type 1 (overwrite) vs. Type 3 (previous + current columns)?
- How does SCD Type 2 interact with fact table foreign keys?
Problem 22: Implement a Custom Partitioner for Skewed Data
Why this problem matters: Data skew is the #1 cause of slow Spark jobs. A single partition with 100x more data than others causes the entire job to wait. This problem tests whether you have real production experience with distributed data processing.
Problem Statement:
You have a Spark job that joins two tables on merchant_id. One merchant accounts for 40% of all transactions. The join stage takes 4 hours because one executor processes 40% of the data while others sit idle.
Solution: Salting Approach
from pyspark.sql import functions as F
import random
# Configuration
SALT_BUCKETS = 10
skewed_key = "merchant_12345" # The problematic merchant
# Step 1: Salt the large (left) table
transactions_salted = transactions.withColumn(
"salt",
F.when(
F.col("merchant_id") == skewed_key,
F.floor(F.rand() * SALT_BUCKETS).cast("int")
).otherwise(F.lit(0))
).withColumn(
"salted_key",
F.concat(F.col("merchant_id"), F.lit("_"), F.col("salt"))
)
# Step 2: Replicate the small (right) table for skewed keys
from functools import reduce
from pyspark.sql import DataFrame
def replicate_for_skewed(df, skewed_key, salt_buckets):
skewed = df.filter(F.col("merchant_id") == skewed_key)
non_skewed = df.filter(F.col("merchant_id") != skewed_key)
replicated = reduce(
DataFrame.unionAll,
[
skewed.withColumn("salt", F.lit(i))
for i in range(salt_buckets)
]
)
non_skewed_with_salt = non_skewed.withColumn("salt", F.lit(0))
return replicated.unionAll(non_skewed_with_salt).withColumn(
"salted_key",
F.concat(F.col("merchant_id"), F.lit("_"), F.col("salt"))
)
merchants_replicated = replicate_for_skewed(merchants, skewed_key, SALT_BUCKETS)
# Step 3: Join on salted key
result = transactions_salted.join(
merchants_replicated,
on="salted_key",
how="inner"
).drop("salt", "salted_key")
Why This Works:
- The skewed merchant's transactions are distributed across 10 partitions (salt 0-9)
- The merchant's reference data is replicated 10 times (one copy per salt bucket)
- Each executor now processes ~4% of transactions instead of 40%
- Non-skewed merchants are unaffected (salt = 0 for both sides)
Alternative Approaches:
- Broadcast join: If the right table is small (<100MB), broadcast it to all executors
- Adaptive Query Execution (AQE): Spark 3.0+ can detect and handle skew automatically
- Separate processing: Process the skewed key separately and union the results
- Pre-aggregation: If possible, aggregate the skewed key before joining
Problem 37: Design a Distributed Task Scheduler (Like Airflow)
Why this problem matters: Understanding how orchestrators work internally makes you better at using them and debugging them. This is a Staff-level DE system design question.
Architecture:
Web UI -> API Server -> Scheduler -> Executor Pool -> Workers
Components:
1. DAG Parser
- Read DAG definitions (Python files or YAML)
- Build dependency graph
- Validate: no cycles, all operators exist, correct parameters
2. Scheduler (the brain)
- Event loop: check for ready tasks every N seconds
- Task state machine: queued -> scheduled -> running -> success/failed
- Dependency resolution: task is ready when all upstream tasks succeeded
- Concurrency control: max active tasks per DAG, per pool, globally
3. Executor
- LocalExecutor: runs tasks as subprocesses (dev/small scale)
- CeleryExecutor: distributes tasks to Celery workers (production)
- KubernetesExecutor: spins up K8s pods per task (elastic scaling)
4. Metadata Database
- Task instance state (PostgreSQL)
- DAG run history
- XCom (cross-communication between tasks)
- Variable and connection storage
5. Worker
- Pulls task from queue
- Executes the operator (BashOperator, PythonOperator, etc.)
- Reports status back to metadata DB
- Handles heartbeat (so scheduler knows worker is alive)
Key Design Decisions:
- Pull vs. push model for task distribution
- Exactly-once execution guarantee (database lock on task instance)
- Retry logic (exponential backoff, max retries)
- Dead letter handling for permanently failed tasks
- Scheduler HA (active-passive or active-active with database locking)
Scaling Challenges:
| Scale | Challenge | Solution |
|---|---|---|
| 100 DAGs | Single scheduler is fine | No changes needed |
| 1,000 DAGs | Scheduler parsing becomes slow | DAG serialization, pre-parse to DB |
| 10,000 DAGs | Single scheduler bottleneck | Multiple schedulers with database locking |
| 100,000 tasks/day | Executor throughput | KubernetesExecutor with auto-scaling |
| 1M tasks/day | Metadata DB load | Database partitioning, read replicas, archive old runs |
Common SQL Window Function Patterns
A reference table for the most commonly tested window function patterns:
| Pattern | SQL | Use Case |
|---|---|---|
| Ranking | RANK() OVER (PARTITION BY dept ORDER BY salary DESC) | Top N per group |
| Dense ranking | DENSE_RANK() OVER (...) | Top N with ties |
| Row number | ROW_NUMBER() OVER (...) | Deduplication (keep first/last) |
| Running total | SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) | Cumulative metrics |
| Moving average | AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) | Smoothed metrics |
| Previous value | LAG(value, 1) OVER (PARTITION BY user ORDER BY date) | Change detection |
| Next value | LEAD(value, 1) OVER (PARTITION BY user ORDER BY date) | Forward-looking analysis |
| First/last value | FIRST_VALUE(value) OVER (PARTITION BY user ORDER BY date) | Session start/end |
| Percent rank | PERCENT_RANK() OVER (ORDER BY score) | Percentile computation |
| Ntile | NTILE(4) OVER (ORDER BY score) | Bucketing into quartiles |
Data Engineering Anti-Patterns
Patterns that seem reasonable but cause problems at scale:
| Anti-Pattern | Why It Seems Reasonable | Why It Fails | Better Approach |
|---|---|---|---|
| **SELECT *** | Convenient for ad-hoc queries | Wastes I/O on wide tables; breaks when schema changes | Explicitly list needed columns |
| DELETE + INSERT instead of MERGE | Simpler to implement | Not atomic; partial failures leave inconsistent state | Use MERGE/UPSERT for idempotency |
| Storing JSON strings in SQL columns | Flexible schema | Cannot index, filter, or join on nested fields efficiently | Use structured columns or a document store |
| Cron-only scheduling | No external dependencies | No dependency management, no retry, no monitoring | Use Airflow/Dagster/Prefect |
| Processing all data every run | Simpler logic | Costs grow linearly with data; wastes compute on old data | Incremental processing with watermarks |
| Single large table | Fewer joins | Slow queries, expensive scans, schema changes affect everything | Normalize into fact + dimension tables |
| No data contracts | Move fast, schema evolves | Downstream breakage when upstream changes schema | Schema registry + backward compatibility checks |
| Manual backfills | "We'll just re-run it" | Error-prone, inconsistent, takes days | Automated backfill orchestrator with validation |
Data Quality Framework
A structured approach to data quality that interviewers want to hear:
| Dimension | What It Measures | Example Check | Where to Check |
|---|---|---|---|
| Completeness | Are all expected records present? | Row count matches source; no missing dates | After ingestion |
| Accuracy | Are values correct? | Checksums match; spot-check against source | After transformation |
| Consistency | Do related values agree? | Order total = sum of line items; FK relationships valid | After join/aggregation |
| Freshness | Is data up-to-date? | Latest event timestamp within SLA | Continuous monitoring |
| Uniqueness | No duplicates? | Primary key is unique; no duplicate events | After deduplication step |
| Validity | Values within expected range? | Prices > 0; dates not in the future; enums are valid | At ingestion and after transform |
Implementation with Great Expectations or dbt tests:
# dbt schema.yml example
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: total_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: order_date
tests:
- not_null
- dbt_utils.accepted_range:
max_value: "current_date"
tests:
- dbt_utils.recency:
datepart: day
field: order_date
interval: 1
Recommended Resources
| Resource | Type | Best For |
|---|---|---|
| "Designing Data-Intensive Applications" (Kleppmann) | Book | System design fundamentals |
| "The Data Warehouse Toolkit" (Kimball) | Book | Dimensional modeling |
| DataLemur SQL problems | Practice | SQL interview prep |
| Spark: The Definitive Guide | Book | Spark internals |
| Confluent Kafka documentation | Docs | Streaming concepts |
| dbt documentation | Docs | Analytics engineering patterns |
| "Fundamentals of Data Engineering" (Reis & Housley) | Book | Modern DE practices |
| "Streaming Systems" (Akidau, Chernyak, Lax) | Book | Streaming theory and practice |
Next Steps
After completing the Data Engineer problem list, consider:
- Core 50 to fill gaps in general DSA and ML fundamentals
- Hard Tier if targeting Staff+ DE roles or FAANG
- Google-Style Problems for Google-specific preparation (heavy on distributed systems)
- Startup-Style Problems if interviewing at data-intensive startups
