Research Engineer - The Paper Implementer
Reading time: ~22 min | Interview relevance: Critical | Roles: RE
The Real Interview Moment
You're in the on-site at a frontier AI lab. The interviewer says: "Here's a paper published last week on a new attention mechanism. You have 45 minutes. Read the key sections, identify the core algorithmic contribution, and implement it in PyTorch. Then tell me what experiment you'd run first to validate it."
You've never seen this paper before. Dense math - einops notation, custom normalization schemes, a novel position encoding. The interviewer isn't testing whether you've memorized this paper - they're testing whether you can go from mathematical notation to working code under time pressure, and whether you have the research taste to know what experiment matters.
This is what Research Engineers do every day: read papers, implement ideas, and design experiments to validate them.
What You Will Master
After reading this page, you will be able to:
- Define the Research Engineer role and distinguish it from Research Scientist, MLE, and PhD researcher
- Understand the RE interview loop at frontier labs (OpenAI, Anthropic, DeepMind, Meta FAIR)
- Identify the mathematical and coding skills tested in RE interviews
- Navigate the career trajectory from junior RE to Senior/Staff
- Evaluate whether RE is the right fit for your background
Self-Assessment: Where Are You Now?
| Skill Area | 1 (Weak) | 3 (Moderate) | 5 (Strong) | Your Rating |
|---|---|---|---|---|
| Math (linear algebra, probability, optimization) | Can't do matrix multiplication | Comfortable with undergrad material | Derive proofs, read math-heavy papers | ___ |
| PyTorch (implementing from scratch) | Used high-level APIs only | Implement standard architectures | Implement papers from scratch, custom autograd | ___ |
| Paper reading | Never read an ML paper | Read 1-2 per month | Read 5+ per week, can critique methods | ___ |
| Experiment design | No experience | Basic training loops | Design ablations, hyperparameter sweeps | ___ |
| Distributed training | Never used multi-GPU | Basic DataParallel | FSDP, DeepSpeed, custom parallelism | ___ |
| Coding (DSA) | Can't solve Easy | Solve Medium in 30 min | Solve Hard consistently | ___ |
| Research taste | Don't know what's important | Follow trends | Can identify impactful research directions | ___ |
Part 1 - What a Research Engineer Actually Does
RE vs. Research Scientist vs. MLE
| Dimension | Research Scientist | Research Engineer | ML Engineer |
|---|---|---|---|
| Primary output | Papers, new ideas | Working implementations, experiment results | Production models, systems |
| Publishes papers | Yes (first author) | Sometimes (co-author) | Rarely |
| PhD required | Usually | Sometimes | No |
| Math depth | Very deep | Deep | Moderate |
| Engineering depth | Moderate | Very deep | Deep |
| Reads papers | Daily (writes them) | Daily (implements them) | Weekly-monthly |
| Typical employer | Frontier labs, universities | Frontier labs, research teams | Any tech company |
"A Research Engineer turns research ideas into reality. Research Scientists propose new methods, but someone needs to implement those ideas in code, run experiments at scale, and make the training infrastructure work. That's me. I need the math to read papers and the engineering to make them run. I implement papers from scratch in PyTorch, optimize training across hundreds of GPUs, and design experiments to validate whether a new idea actually improves over baselines."
Where REs Work
| Employer | Focus | Interview Style |
|---|---|---|
| DeepMind | Fundamental research, reasoning, safety | Academic-style, paper discussion heavy |
| OpenAI | Pre-training, alignment, product research | Coding-heavy, "implement this paper" |
| Anthropic | Safety, interpretability, alignment | Strong coding + research taste |
| Meta FAIR | Open research, vision, NLP | Academic + engineering hybrid |
| Google DeepMind | Foundation models, optimization | Strong coding bar, research depth |
| AI startups | Applied research for product | Full-stack: research to production |
Research Engineer roles are rare - maybe 1,000–2,000 open RE positions globally at any time, compared to 50,000+ MLE openings. The bar is extremely high, and you'll compete against PhD candidates. If you don't have a strong math background and paper implementation experience, consider MLE first and transition later.
Part 2 - The RE Interview Loop
Typical Loop at Frontier Labs
Round-by-Round Breakdown
Paper Implementation Round (Signature Round)
You're given a paper (or section) and asked to implement the core algorithm in PyTorch. 45-60 minutes.
What they're testing: Can you read math and translate it to code? Can you identify the key contribution vs. boilerplate? Can you debug when output doesn't match?
BAD approach: Start coding immediately without understanding the math.
GOOD approach: Spend 10 minutes reading and annotating. Identify the core equation. Write down tensor shapes on paper. Implement step by step, verifying shapes. Write a simple test case.
Research Discussion Round
Present a paper you've read deeply and discuss it critically.
What they're testing: Research taste. Can you explain why a paper matters? Can you identify limitations? Can you propose follow-up experiments?
Strong structure:
- What problem does this paper solve, and why does it matter?
- What's the core technical contribution?
- What are the key experiments, and are they convincing?
- What are the limitations?
- What would you do next if continuing this work?
In the research discussion, I'm looking for taste. The strongest signal is when a candidate identifies a paper's limitations without being prompted and proposes concrete follow-up experiments. Anyone can summarize. The RE candidate should critique and extend.
Company Variations
- OpenAI: Heavily coding-focused. May ask you to implement a full training loop from scratch. Speed matters.
- Anthropic: Research taste is paramount. "What would you work on and why?" is a critical question.
- DeepMind: Most academic. May include a whiteboard math round alongside coding.
- Meta FAIR: Open research culture. Strong coding + paper discussion. Publication record helps but isn't required.
Part 3 - Career Trajectory
Transition Paths
| From | To RE | Difficulty | Key Advantage | Key Gap |
|---|---|---|---|---|
| PhD student | 🟢 Natural | Deep math, paper experience | Production engineering, coding speed | |
| MLE | 🟡 Medium | Engineering skills, PyTorch | Paper reading, math depth, research taste | |
| SWE | 🟠 Hard | Strong coding | Math, ML fundamentals, research context | |
| New Grad (CS) | 🟡 Medium | Fresh knowledge | Research experience - need strong projects |
Never say: "I want to be a Research Engineer because I want to work on cool AI stuff." Instead: "I want to be a Research Engineer because I love the cycle of reading a paper, understanding the math, implementing it, and seeing whether the improvements hold. I've implemented 15 papers this year on [specific area], and I want to do it full-time at a lab where the research matters."
Practice Problems
Problem 1: Paper Reading
Given this abstract: "We propose FlashAttention-3, which reduces the memory complexity of attention from O(N²) to O(N) by tiling the computation and using online softmax. We achieve 2.5x speedup over FlashAttention-2 on H100 GPUs."
What is the core technical challenge? What experiment would you run first to validate the claims?
Hint 1 - Direction
The core challenge is that standard attention materializes a full N×N matrix in GPU HBM, which is memory-prohibitive for long sequences and bandwidth-bound.
Full Answer + Rubric
Strong answer: "Standard attention requires an N×N attention matrix in HBM, which is (1) memory-prohibitive for long sequences and (2) bandwidth-bound. FlashAttention addresses this by tiling so Q, K, V blocks fit in fast SRAM, computing block-by-block using online softmax.
First experiment: reproduce the headline speedup. Benchmark FA-3 vs FA-2 on H100 at sequence lengths [1K, 4K, 16K, 64K] with d=128. Verify numerically identical outputs (within FP tolerance). Then profile: is the speedup from better tiling, memory patterns, or hardware-specific optimization?"
Scoring:
- Strong Hire: Explains the memory/compute trade-off, proposes concrete reproducibility experiment with specific parameters
- Lean Hire: Understands high-level idea but vague experiment proposal
- No Hire: Can't explain why attention is memory-intensive
Problem 2: Implementation
Implement scaled dot-product attention in PyTorch from scratch. Input: Q, K, V tensors of shape (batch, heads, seq_len, d_k). Include mask support.
Hint 1 - Direction
The formula is: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k) + mask) × V
Full Answer + Rubric
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
# Q, K, V: (batch, heads, seq_len, d_k)
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attn_weights, V)
return output, attn_weights
Scoring:
- Strong Hire: Correct implementation, explains scaling factor (prevents softmax saturation), handles mask with -inf, returns attention weights
- Lean Hire: Correct but can't explain why we scale by sqrt(d_k)
- No Hire: Gets matrix multiplication order wrong or doesn't understand masking
Problem 3: Experiment Design
You've implemented a new positional encoding scheme. How do you evaluate whether it's better than RoPE (Rotary Position Embeddings)?
Hint 1 - Direction
Think about controlled experiments: same model, same data, same hyperparameters, only change the positional encoding. What metrics matter? What sequence lengths should you test at?
Full Answer + Rubric
Strong answer:
- Controlled setup: Same transformer architecture, same training data, same hyperparameters. Only change: positional encoding (RoPE vs. yours).
- Training: Train both for the same number of tokens. Monitor training loss curves - do they converge to the same loss? Faster? Lower?
- Evaluation: Perplexity on a held-out test set at the trained context length. Then - critically - evaluate at longer context lengths than trained on (extrapolation). This is where positional encodings usually differ.
- Downstream tasks: Perplexity isn't everything. Test on tasks that specifically require position awareness: copying, retrieval from long context, code completion with long files.
- Ablations: What happens with different model sizes? Different sequence lengths during training? Is the improvement consistent or specific to one configuration?
- Compute cost: Is the new encoding more expensive to compute? Flash attention compatible? What's the wall-clock time impact?
Scoring:
- Strong Hire: Controlled experiment, tests extrapolation (key for positional encodings), considers downstream tasks and compute cost
- Lean Hire: Good experiment design but misses extrapolation testing
- No Hire: Only checks training loss without evaluation on held-out data
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Implement this from the paper" | Read math → Identify shapes → Code step-by-step → Test | "Let me trace the tensor shapes first" |
| "Discuss a paper" | Problem → Contribution → Experiments → Limitations → Extensions | "The paper's strongest result is X, but I think the evaluation is limited because Y" |
| "What research excites you?" | Current state → Open problem → Your approach → Expected impact | "The gap I see is X, and I think combining Y and Z could address it" |
| "Scale this experiment" | Single GPU → Multi-GPU → Parallelism strategy → Bottleneck analysis | "The first bottleneck at scale will be X" |
Spaced Repetition Checkpoints
- Day 0: Read this page. Assess your math and implementation skills honestly.
- Day 3: Pick a recent paper. Read it, identify the core contribution, implement it.
- Day 7: Practice research discussion: present a paper to a friend in 10 minutes, then defend it.
- Day 14: Implement multi-head attention from scratch. Then a full transformer encoder block.
- Day 21: Time yourself: given a new paper section, implement the key algorithm in 45 minutes.
What's Next
- If RE is your target → Paper Discussion - your most important prep section
- Compare → MLE is the more accessible alternative
- Coding prep → Coding Interviews
- Deep learning depth → Deep Learning
