Data Engineer - The Pipeline Builder

Reading time: ~20 min | Interview relevance: Critical | Roles: DE

The Real Interview Moment

You're in a system design round and the interviewer says: "Our ML team complains that they spend 60% of their time cleaning data instead of building models. Design a data platform that gives them clean, fresh, well-documented features they can use for both training and real-time serving."

This isn't an ML question. It's about building the data infrastructure that makes ML possible. Without clean data pipelines, feature stores, and quality monitoring, every ML model is built on a shaky foundation. Data Engineers are the unsung heroes of AI - you don't build the models, but nothing works without what you build.

What You Will Master

After reading this page, you will be able to:

Define the Data Engineer role and distinguish it from MLE, MLOps, and Analytics Engineer
Understand the DE interview loop and what each round evaluates
Identify core skills: SQL, distributed systems, pipeline orchestration, data modeling
Navigate career trajectories from junior DE to Staff/Principal
Evaluate whether DE is the right entry point into AI for you

Self-Assessment: Where Are You Now?

Skill Area	1 (Weak)	3 (Moderate)	5 (Strong)	Your Rating
SQL	Basic SELECT	Window functions, CTEs	Query optimization, execution plans	___
Python	Basic scripting	Data processing, APIs	Production pipelines, async	___
Distributed systems (Spark, Kafka)	Never used	Basic Spark/Kafka	Optimize Spark jobs, design event systems	___
Data modeling	No experience	Basic schemas	Dimensional modeling, SCDs	___
Pipeline orchestration (Airflow, Dagster)	Never used	Built basic DAGs	Complex pipelines with retries, monitoring	___
Cloud data services	No experience	Basic usage	Optimize storage, partitioning, cost	___
Coding (DSA)	Can't solve Easy	Solve Medium in 30 min	Solve Medium-Hard consistently	___
Data quality	No experience	Basic assertions	Data contracts, quality monitoring, lineage	___

Part 1 - What a Data Engineer Actually Does

60-Second Answer

"A Data Engineer builds the data foundation that everything else runs on. I design and maintain pipelines that ingest data from dozens of sources, transform it into usable formats, and serve it to ML engineers for training, analysts for dashboards, and products for real-time features. My core concerns are reliability (data arrives on time), quality (data is accurate and complete), and scale (handling terabytes to petabytes efficiently). Without solid data engineering, ML models train on stale or incorrect data, analysts make decisions on wrong numbers, and products serve broken experiences."

How DE Fits in the AI Ecosystem

DE Ecosystem Fit

DE vs. Adjacent Roles

Dimension	Data Engineer	MLOps Engineer	Analytics Engineer
Builds	Data pipelines, warehouses	Model pipelines, serving infra	Transformed datasets, dashboards
Key tool	Spark, Airflow, Kafka	Kubeflow, MLflow, K8s	dbt, SQL, Looker
Cares about	Data freshness, quality, scale	Model freshness, serving latency	Analyst usability, metric definitions
ML knowledge	Light (understands data needs)	Moderate (understands pipeline)	Light (understands metrics)

Interviewer's Perspective

In DE interviews, I look for someone who thinks about data holistically: not just "can you write a pipeline" but "can you design a data platform that's reliable, scalable, well-documented, and easy for downstream consumers to use?" The best candidates talk about data contracts, lineage, freshness SLAs, and quality monitoring without being prompted.

Part 2 - The DE Interview Loop

Round	Duration	Focus
SQL Deep Dive	45-60 min	Complex queries, optimization, window functions
Coding	45-60 min	Python/DSA, data processing problems
System Design	45-60 min	Data pipeline architecture, warehouse design
Data Modeling	45-60 min	Schema design, dimensional modeling, trade-offs
Behavioral	45-60 min	Data quality incidents, cross-team collaboration

Key Differences from Other Role Interviews

Aspect	DE	MLE	MLOps
SQL depth	2 rounds, very deep	0-1 rounds, basic	0-1 rounds, basic
System design focus	Data pipeline + warehouse	ML system (rec, fraud)	ML platform (registry, serving)
Unique round	Data modeling	Paper discussion	Incident response

Company Variation

Google: Strong coding bar, distributed systems (MapReduce concepts). System design for data is key.
Meta: SQL-heavy (expect 2 SQL rounds). Data warehouse design is critical.
Netflix: "Analytics Engineer" title. dbt, SQL, data modeling.
Startups: "Build our data stack from scratch." Breadth over depth.
Finance: Data governance, compliance, lineage tracking. Regulatory requirements drive design.

Part 3 - Career Trajectory

DE Career Ladder

Transition Paths

From	To DE	Difficulty	Advantages
Backend SWE	🟢 Easy	Coding, systems, databases	Distributed data systems, data modeling
Analyst / BI	🟢 Easy	SQL, business context	Programming, distributed systems
DBA	🟢 Easy	SQL, database internals	Python, cloud, pipeline orchestration
MLOps	🟢 Easy	Pipeline thinking, infra skills	Data modeling, warehouse design

Instant Rejection

Never say: "I want to be a Data Engineer because I'm not good enough at math for ML." This frames DE as a consolation prize. Instead: "I'm drawn to Data Engineering because I love building reliable, scalable systems that other teams depend on. The ML team can only build great models if I give them great data - and solving that infrastructure challenge at scale is what excites me."

Practice Problems

Problem 1: Pipeline Design

Design a data pipeline that ingests clickstream events from a mobile app (10M events/day), enriches them with user profile data, computes daily aggregations, and loads them into a warehouse.

Hint 1 - Direction

Think about ingestion method (streaming vs. batch), storage format (Parquet), partitioning strategy (by date), and late-arriving events.

Full Answer + Rubric

Ingestion: Mobile app → Kafka (clickstream-raw). 10M/day ≈ 115 events/sec - use 4-8 Kafka partitions for parallelism.

Processing (two paths):

Real-time enrichment (Flink): Join with user profiles from Redis. Write enriched events to S3 (Parquet, date-partitioned).
Daily batch aggregation (Spark on Airflow): Process yesterday's data at 2 AM UTC. Compute DAU, session metrics, top pages by segment. Write to warehouse.

Late arrivals: 48-hour reprocessing window. Rerun aggregation for previous 2 days nightly.

Data quality: Great Expectations checks at each stage - row counts, nulls, schema validation.

Scoring:

Strong Hire: Both real-time and batch paths, handles late arrivals, includes quality checks
Lean Hire: Reasonable architecture but misses late arrivals or quality
No Hire: Just "put events in S3"

Problem 2: Data Quality Investigation

5% of user records have NULL country, but the mobile app always collects location. What's happening?

Hint 1 - Direction

Investigate each pipeline stage: app-side (permissions), ingestion (parsing), transformation (join failures), loading (schema mismatch).

Full Answer + Rubric

Systematic investigation:

App-side: Users deny location permissions? Check if 5% matches denial rate. Use IP geolocation as fallback.
Ingestion: Check raw Kafka events. If country present in raw but NULL in warehouse → transformation issue.
Transformation: Check enrichment join. LEFT JOIN on user_id with missing profiles → NULLs.
Schema evolution: App schema changed? Old: country, New: user_country. Schema mismatch.

Prevention: Data contracts between app team and data team. Automated check: expect_column_values_to_not_be_null(country, mostly=0.97).

Scoring:

Strong Hire: Systematic diagnosis across stages, proposes data contracts
Lean Hire: Identifies one cause but not systematic
No Hire: "Fill NULLs with a default value"

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design a data pipeline"	Ingest → Store → Transform → Serve → Monitor	"I'd start by understanding the SLA: how fresh and what failure rate?"
"Ensure data quality"	Contracts → Validation → Monitoring → Alerting → Remediation	"Data quality is a first-class concern with automated checks at every stage"
"Design a data model"	Entities → Relationships → Grain → Dimensions → Facts	"The key decision is the grain - what does one row represent?"
"Handle schema evolution"	Registry → Compatibility → Migration	"I enforce backward-compatible changes with a schema registry"

Spaced Repetition Checkpoints

Day 0: Read this page. Assess your SQL and pipeline skills.
Day 3: Solve 3 advanced SQL problems (window functions, self-joins, recursive CTEs).
Day 7: Design a data pipeline on a whiteboard: ingestion, transformation, storage, quality.
Day 14: Design a star schema for an e-commerce company.
Day 21: Mock system design: "Design the data platform for a ride-sharing company."

What's Next

If DE is your target → The Interview Process
Compare → MLOps or MLE
Coding prep → Coding Interviews
Salary context → Salary Bands

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - What a Data Engineer Actually Does​

How DE Fits in the AI Ecosystem​

DE vs. Adjacent Roles​

Part 2 - The DE Interview Loop​

Key Differences from Other Role Interviews​

Part 3 - Career Trajectory​

Transition Paths​

Practice Problems​

Problem 1: Pipeline Design​

Problem 2: Data Quality Investigation​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​