Data Engineer - The Pipeline Builder
Reading time: ~20 min | Interview relevance: Critical | Roles: DE
The Real Interview Moment
You're in a system design round and the interviewer says: "Our ML team complains that they spend 60% of their time cleaning data instead of building models. Design a data platform that gives them clean, fresh, well-documented features they can use for both training and real-time serving."
This isn't an ML question. It's about building the data infrastructure that makes ML possible. Without clean data pipelines, feature stores, and quality monitoring, every ML model is built on a shaky foundation. Data Engineers are the unsung heroes of AI - you don't build the models, but nothing works without what you build.
What You Will Master
After reading this page, you will be able to:
- Define the Data Engineer role and distinguish it from MLE, MLOps, and Analytics Engineer
- Understand the DE interview loop and what each round evaluates
- Identify core skills: SQL, distributed systems, pipeline orchestration, data modeling
- Navigate career trajectories from junior DE to Staff/Principal
- Evaluate whether DE is the right entry point into AI for you
Self-Assessment: Where Are You Now?
| Skill Area | 1 (Weak) | 3 (Moderate) | 5 (Strong) | Your Rating |
|---|---|---|---|---|
| SQL | Basic SELECT | Window functions, CTEs | Query optimization, execution plans | ___ |
| Python | Basic scripting | Data processing, APIs | Production pipelines, async | ___ |
| Distributed systems (Spark, Kafka) | Never used | Basic Spark/Kafka | Optimize Spark jobs, design event systems | ___ |
| Data modeling | No experience | Basic schemas | Dimensional modeling, SCDs | ___ |
| Pipeline orchestration (Airflow, Dagster) | Never used | Built basic DAGs | Complex pipelines with retries, monitoring | ___ |
| Cloud data services | No experience | Basic usage | Optimize storage, partitioning, cost | ___ |
| Coding (DSA) | Can't solve Easy | Solve Medium in 30 min | Solve Medium-Hard consistently | ___ |
| Data quality | No experience | Basic assertions | Data contracts, quality monitoring, lineage | ___ |
Part 1 - What a Data Engineer Actually Does
"A Data Engineer builds the data foundation that everything else runs on. I design and maintain pipelines that ingest data from dozens of sources, transform it into usable formats, and serve it to ML engineers for training, analysts for dashboards, and products for real-time features. My core concerns are reliability (data arrives on time), quality (data is accurate and complete), and scale (handling terabytes to petabytes efficiently). Without solid data engineering, ML models train on stale or incorrect data, analysts make decisions on wrong numbers, and products serve broken experiences."
How DE Fits in the AI Ecosystem
DE vs. Adjacent Roles
| Dimension | Data Engineer | MLOps Engineer | Analytics Engineer |
|---|---|---|---|
| Builds | Data pipelines, warehouses | Model pipelines, serving infra | Transformed datasets, dashboards |
| Key tool | Spark, Airflow, Kafka | Kubeflow, MLflow, K8s | dbt, SQL, Looker |
| Cares about | Data freshness, quality, scale | Model freshness, serving latency | Analyst usability, metric definitions |
| ML knowledge | Light (understands data needs) | Moderate (understands pipeline) | Light (understands metrics) |
In DE interviews, I look for someone who thinks about data holistically: not just "can you write a pipeline" but "can you design a data platform that's reliable, scalable, well-documented, and easy for downstream consumers to use?" The best candidates talk about data contracts, lineage, freshness SLAs, and quality monitoring without being prompted.
Part 2 - The DE Interview Loop
| Round | Duration | Focus |
|---|---|---|
| SQL Deep Dive | 45-60 min | Complex queries, optimization, window functions |
| Coding | 45-60 min | Python/DSA, data processing problems |
| System Design | 45-60 min | Data pipeline architecture, warehouse design |
| Data Modeling | 45-60 min | Schema design, dimensional modeling, trade-offs |
| Behavioral | 45-60 min | Data quality incidents, cross-team collaboration |
Key Differences from Other Role Interviews
| Aspect | DE | MLE | MLOps |
|---|---|---|---|
| SQL depth | 2 rounds, very deep | 0-1 rounds, basic | 0-1 rounds, basic |
| System design focus | Data pipeline + warehouse | ML system (rec, fraud) | ML platform (registry, serving) |
| Unique round | Data modeling | Paper discussion | Incident response |
- Google: Strong coding bar, distributed systems (MapReduce concepts). System design for data is key.
- Meta: SQL-heavy (expect 2 SQL rounds). Data warehouse design is critical.
- Netflix: "Analytics Engineer" title. dbt, SQL, data modeling.
- Startups: "Build our data stack from scratch." Breadth over depth.
- Finance: Data governance, compliance, lineage tracking. Regulatory requirements drive design.
Part 3 - Career Trajectory
Transition Paths
| From | To DE | Difficulty | Advantages | Gaps |
|---|---|---|---|---|
| Backend SWE | 🟢 Easy | Coding, systems, databases | Distributed data systems, data modeling | |
| Analyst / BI | 🟢 Easy | SQL, business context | Programming, distributed systems | |
| DBA | 🟢 Easy | SQL, database internals | Python, cloud, pipeline orchestration | |
| MLOps | 🟢 Easy | Pipeline thinking, infra skills | Data modeling, warehouse design |
Never say: "I want to be a Data Engineer because I'm not good enough at math for ML." This frames DE as a consolation prize. Instead: "I'm drawn to Data Engineering because I love building reliable, scalable systems that other teams depend on. The ML team can only build great models if I give them great data - and solving that infrastructure challenge at scale is what excites me."
Practice Problems
Problem 1: Pipeline Design
Design a data pipeline that ingests clickstream events from a mobile app (10M events/day), enriches them with user profile data, computes daily aggregations, and loads them into a warehouse.
Hint 1 - Direction
Think about ingestion method (streaming vs. batch), storage format (Parquet), partitioning strategy (by date), and late-arriving events.
Full Answer + Rubric
Ingestion: Mobile app → Kafka (clickstream-raw). 10M/day ≈ 115 events/sec - use 4-8 Kafka partitions for parallelism.
Processing (two paths):
- Real-time enrichment (Flink): Join with user profiles from Redis. Write enriched events to S3 (Parquet, date-partitioned).
- Daily batch aggregation (Spark on Airflow): Process yesterday's data at 2 AM UTC. Compute DAU, session metrics, top pages by segment. Write to warehouse.
Late arrivals: 48-hour reprocessing window. Rerun aggregation for previous 2 days nightly.
Data quality: Great Expectations checks at each stage - row counts, nulls, schema validation.
Scoring:
- Strong Hire: Both real-time and batch paths, handles late arrivals, includes quality checks
- Lean Hire: Reasonable architecture but misses late arrivals or quality
- No Hire: Just "put events in S3"
Problem 2: Data Quality Investigation
5% of user records have NULL country, but the mobile app always collects location. What's happening?
Hint 1 - Direction
Investigate each pipeline stage: app-side (permissions), ingestion (parsing), transformation (join failures), loading (schema mismatch).
Full Answer + Rubric
Systematic investigation:
- App-side: Users deny location permissions? Check if 5% matches denial rate. Use IP geolocation as fallback.
- Ingestion: Check raw Kafka events. If country present in raw but NULL in warehouse → transformation issue.
- Transformation: Check enrichment join. LEFT JOIN on user_id with missing profiles → NULLs.
- Schema evolution: App schema changed? Old:
country, New:user_country. Schema mismatch.
Prevention: Data contracts between app team and data team. Automated check: expect_column_values_to_not_be_null(country, mostly=0.97).
Scoring:
- Strong Hire: Systematic diagnosis across stages, proposes data contracts
- Lean Hire: Identifies one cause but not systematic
- No Hire: "Fill NULLs with a default value"
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design a data pipeline" | Ingest → Store → Transform → Serve → Monitor | "I'd start by understanding the SLA: how fresh and what failure rate?" |
| "Ensure data quality" | Contracts → Validation → Monitoring → Alerting → Remediation | "Data quality is a first-class concern with automated checks at every stage" |
| "Design a data model" | Entities → Relationships → Grain → Dimensions → Facts | "The key decision is the grain - what does one row represent?" |
| "Handle schema evolution" | Registry → Compatibility → Migration | "I enforce backward-compatible changes with a schema registry" |
Spaced Repetition Checkpoints
- Day 0: Read this page. Assess your SQL and pipeline skills.
- Day 3: Solve 3 advanced SQL problems (window functions, self-joins, recursive CTEs).
- Day 7: Design a data pipeline on a whiteboard: ingestion, transformation, storage, quality.
- Day 14: Design a star schema for an e-commerce company.
- Day 21: Mock system design: "Design the data platform for a ride-sharing company."
What's Next
- If DE is your target → The Interview Process
- Compare → MLOps or MLE
- Coding prep → Coding Interviews
- Salary context → Salary Bands
