Research Engineer - The Paper Implementer

Reading time: ~22 min | Interview relevance: Critical | Roles: RE

The Real Interview Moment

You're in the on-site at a frontier AI lab. The interviewer says: "Here's a paper published last week on a new attention mechanism. You have 45 minutes. Read the key sections, identify the core algorithmic contribution, and implement it in PyTorch. Then tell me what experiment you'd run first to validate it."

You've never seen this paper before. Dense math - einops notation, custom normalization schemes, a novel position encoding. The interviewer isn't testing whether you've memorized this paper - they're testing whether you can go from mathematical notation to working code under time pressure, and whether you have the research taste to know what experiment matters.

This is what Research Engineers do every day: read papers, implement ideas, and design experiments to validate them.

What You Will Master

After reading this page, you will be able to:

Define the Research Engineer role and distinguish it from Research Scientist, MLE, and PhD researcher
Understand the RE interview loop at frontier labs (OpenAI, Anthropic, DeepMind, Meta FAIR)
Identify the mathematical and coding skills tested in RE interviews
Navigate the career trajectory from junior RE to Senior/Staff
Evaluate whether RE is the right fit for your background

Self-Assessment: Where Are You Now?

Skill Area	1 (Weak)	3 (Moderate)	5 (Strong)	Your Rating
Math (linear algebra, probability, optimization)	Can't do matrix multiplication	Comfortable with undergrad material	Derive proofs, read math-heavy papers	___
PyTorch (implementing from scratch)	Used high-level APIs only	Implement standard architectures	Implement papers from scratch, custom autograd	___
Paper reading	Never read an ML paper	Read 1-2 per month	Read 5+ per week, can critique methods	___
Experiment design	No experience	Basic training loops	Design ablations, hyperparameter sweeps	___
Distributed training	Never used multi-GPU	Basic DataParallel	FSDP, DeepSpeed, custom parallelism	___
Coding (DSA)	Can't solve Easy	Solve Medium in 30 min	Solve Hard consistently	___
Research taste	Don't know what's important	Follow trends	Can identify impactful research directions	___

Part 1 - What a Research Engineer Actually Does

RE vs. Research Scientist vs. MLE

RE vs Adjacent Roles

Dimension	Research Scientist	Research Engineer	ML Engineer
Primary output	Papers, new ideas	Working implementations, experiment results	Production models, systems
Publishes papers	Yes (first author)	Sometimes (co-author)	Rarely
PhD required	Usually	Sometimes	No
Math depth	Very deep	Deep	Moderate
Engineering depth	Moderate	Very deep	Deep
Reads papers	Daily (writes them)	Daily (implements them)	Weekly-monthly
Typical employer	Frontier labs, universities	Frontier labs, research teams	Any tech company

60-Second Answer

"A Research Engineer turns research ideas into reality. Research Scientists propose new methods, but someone needs to implement those ideas in code, run experiments at scale, and make the training infrastructure work. That's me. I need the math to read papers and the engineering to make them run. I implement papers from scratch in PyTorch, optimize training across hundreds of GPUs, and design experiments to validate whether a new idea actually improves over baselines."

Where REs Work

Employer	Focus	Interview Style
DeepMind	Fundamental research, reasoning, safety	Academic-style, paper discussion heavy
OpenAI	Pre-training, alignment, product research	Coding-heavy, "implement this paper"
Anthropic	Safety, interpretability, alignment	Strong coding + research taste
Meta FAIR	Open research, vision, NLP	Academic + engineering hybrid
Google DeepMind	Foundation models, optimization	Strong coding bar, research depth
AI startups	Applied research for product	Full-stack: research to production

Common Trap

Research Engineer roles are rare - maybe 1,000–2,000 open RE positions globally at any time, compared to 50,000+ MLE openings. The bar is extremely high, and you'll compete against PhD candidates. If you don't have a strong math background and paper implementation experience, consider MLE first and transition later.

Part 2 - The RE Interview Loop

Typical Loop at Frontier Labs

RE Interview Loop

Round-by-Round Breakdown

Paper Implementation Round (Signature Round)

You're given a paper (or section) and asked to implement the core algorithm in PyTorch. 45-60 minutes.

What they're testing: Can you read math and translate it to code? Can you identify the key contribution vs. boilerplate? Can you debug when output doesn't match?

BAD approach: Start coding immediately without understanding the math.

GOOD approach: Spend 10 minutes reading and annotating. Identify the core equation. Write down tensor shapes on paper. Implement step by step, verifying shapes. Write a simple test case.

Research Discussion Round

Present a paper you've read deeply and discuss it critically.

What they're testing: Research taste. Can you explain why a paper matters? Can you identify limitations? Can you propose follow-up experiments?

Strong structure:

What problem does this paper solve, and why does it matter?
What's the core technical contribution?
What are the key experiments, and are they convincing?
What are the limitations?
What would you do next if continuing this work?

Interviewer's Perspective

In the research discussion, I'm looking for taste. The strongest signal is when a candidate identifies a paper's limitations without being prompted and proposes concrete follow-up experiments. Anyone can summarize. The RE candidate should critique and extend.

Company Variations

Company Variation

OpenAI: Heavily coding-focused. May ask you to implement a full training loop from scratch. Speed matters.
Anthropic: Research taste is paramount. "What would you work on and why?" is a critical question.
DeepMind: Most academic. May include a whiteboard math round alongside coding.
Meta FAIR: Open research culture. Strong coding + paper discussion. Publication record helps but isn't required.

Part 3 - Career Trajectory

RE Career Ladder

Transition Paths

From	To RE	Difficulty	Key Advantage
PhD student	🟢 Natural	Deep math, paper experience	Production engineering, coding speed
MLE	🟡 Medium	Engineering skills, PyTorch	Paper reading, math depth, research taste
SWE	🟠 Hard	Strong coding	Math, ML fundamentals, research context
New Grad (CS)	🟡 Medium	Fresh knowledge	Research experience - need strong projects

Instant Rejection

Never say: "I want to be a Research Engineer because I want to work on cool AI stuff." Instead: "I want to be a Research Engineer because I love the cycle of reading a paper, understanding the math, implementing it, and seeing whether the improvements hold. I've implemented 15 papers this year on [specific area], and I want to do it full-time at a lab where the research matters."

Practice Problems

Problem 1: Paper Reading

Given this abstract: "We propose FlashAttention-3, which reduces the memory complexity of attention from O(N²) to O(N) by tiling the computation and using online softmax. We achieve 2.5x speedup over FlashAttention-2 on H100 GPUs."

What is the core technical challenge? What experiment would you run first to validate the claims?

Hint 1 - Direction

The core challenge is that standard attention materializes a full N×N matrix in GPU HBM, which is memory-prohibitive for long sequences and bandwidth-bound.

Full Answer + Rubric

Strong answer: "Standard attention requires an N×N attention matrix in HBM, which is (1) memory-prohibitive for long sequences and (2) bandwidth-bound. FlashAttention addresses this by tiling so Q, K, V blocks fit in fast SRAM, computing block-by-block using online softmax.

First experiment: reproduce the headline speedup. Benchmark FA-3 vs FA-2 on H100 at sequence lengths [1K, 4K, 16K, 64K] with d=128. Verify numerically identical outputs (within FP tolerance). Then profile: is the speedup from better tiling, memory patterns, or hardware-specific optimization?"

Scoring:

Strong Hire: Explains the memory/compute trade-off, proposes concrete reproducibility experiment with specific parameters
Lean Hire: Understands high-level idea but vague experiment proposal
No Hire: Can't explain why attention is memory-intensive

Problem 2: Implementation

Implement scaled dot-product attention in PyTorch from scratch. Input: Q, K, V tensors of shape (batch, heads, seq_len, d_k). Include mask support.

Hint 1 - Direction

The formula is: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k) + mask) × V

Full Answer + Rubric

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q, K, V: (batch, heads, seq_len, d_k)
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attn_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

Scoring:

Strong Hire: Correct implementation, explains scaling factor (prevents softmax saturation), handles mask with -inf, returns attention weights
Lean Hire: Correct but can't explain why we scale by sqrt(d_k)
No Hire: Gets matrix multiplication order wrong or doesn't understand masking

Problem 3: Experiment Design

You've implemented a new positional encoding scheme. How do you evaluate whether it's better than RoPE (Rotary Position Embeddings)?

Hint 1 - Direction

Think about controlled experiments: same model, same data, same hyperparameters, only change the positional encoding. What metrics matter? What sequence lengths should you test at?

Full Answer + Rubric

Strong answer:

Controlled setup: Same transformer architecture, same training data, same hyperparameters. Only change: positional encoding (RoPE vs. yours).
Training: Train both for the same number of tokens. Monitor training loss curves - do they converge to the same loss? Faster? Lower?
Evaluation: Perplexity on a held-out test set at the trained context length. Then - critically - evaluate at longer context lengths than trained on (extrapolation). This is where positional encodings usually differ.
Downstream tasks: Perplexity isn't everything. Test on tasks that specifically require position awareness: copying, retrieval from long context, code completion with long files.
Ablations: What happens with different model sizes? Different sequence lengths during training? Is the improvement consistent or specific to one configuration?
Compute cost: Is the new encoding more expensive to compute? Flash attention compatible? What's the wall-clock time impact?

Scoring:

Strong Hire: Controlled experiment, tests extrapolation (key for positional encodings), considers downstream tasks and compute cost
Lean Hire: Good experiment design but misses extrapolation testing
No Hire: Only checks training loss without evaluation on held-out data

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Implement this from the paper"	Read math → Identify shapes → Code step-by-step → Test	"Let me trace the tensor shapes first"
"Discuss a paper"	Problem → Contribution → Experiments → Limitations → Extensions	"The paper's strongest result is X, but I think the evaluation is limited because Y"
"What research excites you?"	Current state → Open problem → Your approach → Expected impact	"The gap I see is X, and I think combining Y and Z could address it"
"Scale this experiment"	Single GPU → Multi-GPU → Parallelism strategy → Bottleneck analysis	"The first bottleneck at scale will be X"

Spaced Repetition Checkpoints

Day 0: Read this page. Assess your math and implementation skills honestly.
Day 3: Pick a recent paper. Read it, identify the core contribution, implement it.
Day 7: Practice research discussion: present a paper to a friend in 10 minutes, then defend it.
Day 14: Implement multi-head attention from scratch. Then a full transformer encoder block.
Day 21: Time yourself: given a new paper section, implement the key algorithm in 45 minutes.

What's Next

If RE is your target → Paper Discussion - your most important prep section
Compare → MLE is the more accessible alternative
Coding prep → Coding Interviews
Deep learning depth → Deep Learning

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - What a Research Engineer Actually Does​

RE vs. Research Scientist vs. MLE​

Where REs Work​

Part 2 - The RE Interview Loop​

Typical Loop at Frontier Labs​

Round-by-Round Breakdown​

Paper Implementation Round (Signature Round)​

Research Discussion Round​

Company Variations​

Part 3 - Career Trajectory​

Transition Paths​

Practice Problems​

Problem 1: Paper Reading​

Problem 2: Implementation​

Problem 3: Experiment Design​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - What a Research Engineer Actually Does

RE vs. Research Scientist vs. MLE

Where REs Work

Part 2 - The RE Interview Loop

Typical Loop at Frontier Labs

Round-by-Round Breakdown

Paper Implementation Round (Signature Round)

Research Discussion Round

Company Variations

Part 3 - Career Trajectory

Transition Paths

Practice Problems

Problem 1: Paper Reading

Problem 2: Implementation

Problem 3: Experiment Design

Interview Cheat Sheet

Spaced Repetition Checkpoints

What's Next