Skip to main content

Data Engineer - The Pipeline Builder

Reading time: ~20 min | Interview relevance: Critical | Roles: DE

The Real Interview Moment

You're in a system design round and the interviewer says: "Our ML team complains that they spend 60% of their time cleaning data instead of building models. Design a data platform that gives them clean, fresh, well-documented features they can use for both training and real-time serving."

This isn't an ML question. It's about building the data infrastructure that makes ML possible. Without clean data pipelines, feature stores, and quality monitoring, every ML model is built on a shaky foundation. Data Engineers are the unsung heroes of AI - you don't build the models, but nothing works without what you build.

What You Will Master

After reading this page, you will be able to:

  • Define the Data Engineer role and distinguish it from MLE, MLOps, and Analytics Engineer
  • Understand the DE interview loop and what each round evaluates
  • Identify core skills: SQL, distributed systems, pipeline orchestration, data modeling
  • Navigate career trajectories from junior DE to Staff/Principal
  • Evaluate whether DE is the right entry point into AI for you

Self-Assessment: Where Are You Now?

Skill Area1 (Weak)3 (Moderate)5 (Strong)Your Rating
SQLBasic SELECTWindow functions, CTEsQuery optimization, execution plans___
PythonBasic scriptingData processing, APIsProduction pipelines, async___
Distributed systems (Spark, Kafka)Never usedBasic Spark/KafkaOptimize Spark jobs, design event systems___
Data modelingNo experienceBasic schemasDimensional modeling, SCDs___
Pipeline orchestration (Airflow, Dagster)Never usedBuilt basic DAGsComplex pipelines with retries, monitoring___
Cloud data servicesNo experienceBasic usageOptimize storage, partitioning, cost___
Coding (DSA)Can't solve EasySolve Medium in 30 minSolve Medium-Hard consistently___
Data qualityNo experienceBasic assertionsData contracts, quality monitoring, lineage___

Part 1 - What a Data Engineer Actually Does

60-Second Answer

"A Data Engineer builds the data foundation that everything else runs on. I design and maintain pipelines that ingest data from dozens of sources, transform it into usable formats, and serve it to ML engineers for training, analysts for dashboards, and products for real-time features. My core concerns are reliability (data arrives on time), quality (data is accurate and complete), and scale (handling terabytes to petabytes efficiently). Without solid data engineering, ML models train on stale or incorrect data, analysts make decisions on wrong numbers, and products serve broken experiences."

How DE Fits in the AI Ecosystem

DE Ecosystem Fit

DE vs. Adjacent Roles

DimensionData EngineerMLOps EngineerAnalytics Engineer
BuildsData pipelines, warehousesModel pipelines, serving infraTransformed datasets, dashboards
Key toolSpark, Airflow, KafkaKubeflow, MLflow, K8sdbt, SQL, Looker
Cares aboutData freshness, quality, scaleModel freshness, serving latencyAnalyst usability, metric definitions
ML knowledgeLight (understands data needs)Moderate (understands pipeline)Light (understands metrics)
Interviewer's Perspective

In DE interviews, I look for someone who thinks about data holistically: not just "can you write a pipeline" but "can you design a data platform that's reliable, scalable, well-documented, and easy for downstream consumers to use?" The best candidates talk about data contracts, lineage, freshness SLAs, and quality monitoring without being prompted.

Part 2 - The DE Interview Loop

RoundDurationFocus
SQL Deep Dive45-60 minComplex queries, optimization, window functions
Coding45-60 minPython/DSA, data processing problems
System Design45-60 minData pipeline architecture, warehouse design
Data Modeling45-60 minSchema design, dimensional modeling, trade-offs
Behavioral45-60 minData quality incidents, cross-team collaboration

Key Differences from Other Role Interviews

AspectDEMLEMLOps
SQL depth2 rounds, very deep0-1 rounds, basic0-1 rounds, basic
System design focusData pipeline + warehouseML system (rec, fraud)ML platform (registry, serving)
Unique roundData modelingPaper discussionIncident response
Company Variation
  • Google: Strong coding bar, distributed systems (MapReduce concepts). System design for data is key.
  • Meta: SQL-heavy (expect 2 SQL rounds). Data warehouse design is critical.
  • Netflix: "Analytics Engineer" title. dbt, SQL, data modeling.
  • Startups: "Build our data stack from scratch." Breadth over depth.
  • Finance: Data governance, compliance, lineage tracking. Regulatory requirements drive design.

Part 3 - Career Trajectory

DE Career Ladder

Transition Paths

FromTo DEDifficultyAdvantagesGaps
Backend SWE🟢 EasyCoding, systems, databasesDistributed data systems, data modeling
Analyst / BI🟢 EasySQL, business contextProgramming, distributed systems
DBA🟢 EasySQL, database internalsPython, cloud, pipeline orchestration
MLOps🟢 EasyPipeline thinking, infra skillsData modeling, warehouse design
Instant Rejection

Never say: "I want to be a Data Engineer because I'm not good enough at math for ML." This frames DE as a consolation prize. Instead: "I'm drawn to Data Engineering because I love building reliable, scalable systems that other teams depend on. The ML team can only build great models if I give them great data - and solving that infrastructure challenge at scale is what excites me."

Practice Problems

Problem 1: Pipeline Design

Design a data pipeline that ingests clickstream events from a mobile app (10M events/day), enriches them with user profile data, computes daily aggregations, and loads them into a warehouse.

Hint 1 - Direction

Think about ingestion method (streaming vs. batch), storage format (Parquet), partitioning strategy (by date), and late-arriving events.

Full Answer + Rubric

Ingestion: Mobile app → Kafka (clickstream-raw). 10M/day ≈ 115 events/sec - use 4-8 Kafka partitions for parallelism.

Processing (two paths):

  1. Real-time enrichment (Flink): Join with user profiles from Redis. Write enriched events to S3 (Parquet, date-partitioned).
  2. Daily batch aggregation (Spark on Airflow): Process yesterday's data at 2 AM UTC. Compute DAU, session metrics, top pages by segment. Write to warehouse.

Late arrivals: 48-hour reprocessing window. Rerun aggregation for previous 2 days nightly.

Data quality: Great Expectations checks at each stage - row counts, nulls, schema validation.

Scoring:

  • Strong Hire: Both real-time and batch paths, handles late arrivals, includes quality checks
  • Lean Hire: Reasonable architecture but misses late arrivals or quality
  • No Hire: Just "put events in S3"

Problem 2: Data Quality Investigation

5% of user records have NULL country, but the mobile app always collects location. What's happening?

Hint 1 - Direction

Investigate each pipeline stage: app-side (permissions), ingestion (parsing), transformation (join failures), loading (schema mismatch).

Full Answer + Rubric

Systematic investigation:

  1. App-side: Users deny location permissions? Check if 5% matches denial rate. Use IP geolocation as fallback.
  2. Ingestion: Check raw Kafka events. If country present in raw but NULL in warehouse → transformation issue.
  3. Transformation: Check enrichment join. LEFT JOIN on user_id with missing profiles → NULLs.
  4. Schema evolution: App schema changed? Old: country, New: user_country. Schema mismatch.

Prevention: Data contracts between app team and data team. Automated check: expect_column_values_to_not_be_null(country, mostly=0.97).

Scoring:

  • Strong Hire: Systematic diagnosis across stages, proposes data contracts
  • Lean Hire: Identifies one cause but not systematic
  • No Hire: "Fill NULLs with a default value"

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design a data pipeline"Ingest → Store → Transform → Serve → Monitor"I'd start by understanding the SLA: how fresh and what failure rate?"
"Ensure data quality"Contracts → Validation → Monitoring → Alerting → Remediation"Data quality is a first-class concern with automated checks at every stage"
"Design a data model"Entities → Relationships → Grain → Dimensions → Facts"The key decision is the grain - what does one row represent?"
"Handle schema evolution"Registry → Compatibility → Migration"I enforce backward-compatible changes with a schema registry"

Spaced Repetition Checkpoints

  • Day 0: Read this page. Assess your SQL and pipeline skills.
  • Day 3: Solve 3 advanced SQL problems (window functions, self-joins, recursive CTEs).
  • Day 7: Design a data pipeline on a whiteboard: ingestion, transformation, storage, quality.
  • Day 14: Design a star schema for an e-commerce company.
  • Day 21: Mock system design: "Design the data platform for a ride-sharing company."

What's Next

© 2026 EngineersOfAI. All rights reserved.