Skip to main content

Research Engineer - The Paper Implementer

Reading time: ~22 min | Interview relevance: Critical | Roles: RE

The Real Interview Moment

You're in the on-site at a frontier AI lab. The interviewer says: "Here's a paper published last week on a new attention mechanism. You have 45 minutes. Read the key sections, identify the core algorithmic contribution, and implement it in PyTorch. Then tell me what experiment you'd run first to validate it."

You've never seen this paper before. Dense math - einops notation, custom normalization schemes, a novel position encoding. The interviewer isn't testing whether you've memorized this paper - they're testing whether you can go from mathematical notation to working code under time pressure, and whether you have the research taste to know what experiment matters.

This is what Research Engineers do every day: read papers, implement ideas, and design experiments to validate them.

What You Will Master

After reading this page, you will be able to:

  • Define the Research Engineer role and distinguish it from Research Scientist, MLE, and PhD researcher
  • Understand the RE interview loop at frontier labs (OpenAI, Anthropic, DeepMind, Meta FAIR)
  • Identify the mathematical and coding skills tested in RE interviews
  • Navigate the career trajectory from junior RE to Senior/Staff
  • Evaluate whether RE is the right fit for your background

Self-Assessment: Where Are You Now?

Skill Area1 (Weak)3 (Moderate)5 (Strong)Your Rating
Math (linear algebra, probability, optimization)Can't do matrix multiplicationComfortable with undergrad materialDerive proofs, read math-heavy papers___
PyTorch (implementing from scratch)Used high-level APIs onlyImplement standard architecturesImplement papers from scratch, custom autograd___
Paper readingNever read an ML paperRead 1-2 per monthRead 5+ per week, can critique methods___
Experiment designNo experienceBasic training loopsDesign ablations, hyperparameter sweeps___
Distributed trainingNever used multi-GPUBasic DataParallelFSDP, DeepSpeed, custom parallelism___
Coding (DSA)Can't solve EasySolve Medium in 30 minSolve Hard consistently___
Research tasteDon't know what's importantFollow trendsCan identify impactful research directions___

Part 1 - What a Research Engineer Actually Does

RE vs. Research Scientist vs. MLE

RE vs Adjacent Roles

DimensionResearch ScientistResearch EngineerML Engineer
Primary outputPapers, new ideasWorking implementations, experiment resultsProduction models, systems
Publishes papersYes (first author)Sometimes (co-author)Rarely
PhD requiredUsuallySometimesNo
Math depthVery deepDeepModerate
Engineering depthModerateVery deepDeep
Reads papersDaily (writes them)Daily (implements them)Weekly-monthly
Typical employerFrontier labs, universitiesFrontier labs, research teamsAny tech company
60-Second Answer

"A Research Engineer turns research ideas into reality. Research Scientists propose new methods, but someone needs to implement those ideas in code, run experiments at scale, and make the training infrastructure work. That's me. I need the math to read papers and the engineering to make them run. I implement papers from scratch in PyTorch, optimize training across hundreds of GPUs, and design experiments to validate whether a new idea actually improves over baselines."

Where REs Work

EmployerFocusInterview Style
DeepMindFundamental research, reasoning, safetyAcademic-style, paper discussion heavy
OpenAIPre-training, alignment, product researchCoding-heavy, "implement this paper"
AnthropicSafety, interpretability, alignmentStrong coding + research taste
Meta FAIROpen research, vision, NLPAcademic + engineering hybrid
Google DeepMindFoundation models, optimizationStrong coding bar, research depth
AI startupsApplied research for productFull-stack: research to production
Common Trap

Research Engineer roles are rare - maybe 1,000–2,000 open RE positions globally at any time, compared to 50,000+ MLE openings. The bar is extremely high, and you'll compete against PhD candidates. If you don't have a strong math background and paper implementation experience, consider MLE first and transition later.

Part 2 - The RE Interview Loop

Typical Loop at Frontier Labs

RE Interview Loop

Round-by-Round Breakdown

Paper Implementation Round (Signature Round)

You're given a paper (or section) and asked to implement the core algorithm in PyTorch. 45-60 minutes.

What they're testing: Can you read math and translate it to code? Can you identify the key contribution vs. boilerplate? Can you debug when output doesn't match?

BAD approach: Start coding immediately without understanding the math.

GOOD approach: Spend 10 minutes reading and annotating. Identify the core equation. Write down tensor shapes on paper. Implement step by step, verifying shapes. Write a simple test case.

Research Discussion Round

Present a paper you've read deeply and discuss it critically.

What they're testing: Research taste. Can you explain why a paper matters? Can you identify limitations? Can you propose follow-up experiments?

Strong structure:

  1. What problem does this paper solve, and why does it matter?
  2. What's the core technical contribution?
  3. What are the key experiments, and are they convincing?
  4. What are the limitations?
  5. What would you do next if continuing this work?
Interviewer's Perspective

In the research discussion, I'm looking for taste. The strongest signal is when a candidate identifies a paper's limitations without being prompted and proposes concrete follow-up experiments. Anyone can summarize. The RE candidate should critique and extend.

Company Variations

Company Variation
  • OpenAI: Heavily coding-focused. May ask you to implement a full training loop from scratch. Speed matters.
  • Anthropic: Research taste is paramount. "What would you work on and why?" is a critical question.
  • DeepMind: Most academic. May include a whiteboard math round alongside coding.
  • Meta FAIR: Open research culture. Strong coding + paper discussion. Publication record helps but isn't required.

Part 3 - Career Trajectory

RE Career Ladder

Transition Paths

FromTo REDifficultyKey AdvantageKey Gap
PhD student🟢 NaturalDeep math, paper experienceProduction engineering, coding speed
MLE🟡 MediumEngineering skills, PyTorchPaper reading, math depth, research taste
SWE🟠 HardStrong codingMath, ML fundamentals, research context
New Grad (CS)🟡 MediumFresh knowledgeResearch experience - need strong projects
Instant Rejection

Never say: "I want to be a Research Engineer because I want to work on cool AI stuff." Instead: "I want to be a Research Engineer because I love the cycle of reading a paper, understanding the math, implementing it, and seeing whether the improvements hold. I've implemented 15 papers this year on [specific area], and I want to do it full-time at a lab where the research matters."

Practice Problems

Problem 1: Paper Reading

Given this abstract: "We propose FlashAttention-3, which reduces the memory complexity of attention from O(N²) to O(N) by tiling the computation and using online softmax. We achieve 2.5x speedup over FlashAttention-2 on H100 GPUs."

What is the core technical challenge? What experiment would you run first to validate the claims?

Hint 1 - Direction

The core challenge is that standard attention materializes a full N×N matrix in GPU HBM, which is memory-prohibitive for long sequences and bandwidth-bound.

Full Answer + Rubric

Strong answer: "Standard attention requires an N×N attention matrix in HBM, which is (1) memory-prohibitive for long sequences and (2) bandwidth-bound. FlashAttention addresses this by tiling so Q, K, V blocks fit in fast SRAM, computing block-by-block using online softmax.

First experiment: reproduce the headline speedup. Benchmark FA-3 vs FA-2 on H100 at sequence lengths [1K, 4K, 16K, 64K] with d=128. Verify numerically identical outputs (within FP tolerance). Then profile: is the speedup from better tiling, memory patterns, or hardware-specific optimization?"

Scoring:

  • Strong Hire: Explains the memory/compute trade-off, proposes concrete reproducibility experiment with specific parameters
  • Lean Hire: Understands high-level idea but vague experiment proposal
  • No Hire: Can't explain why attention is memory-intensive

Problem 2: Implementation

Implement scaled dot-product attention in PyTorch from scratch. Input: Q, K, V tensors of shape (batch, heads, seq_len, d_k). Include mask support.

Hint 1 - Direction

The formula is: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k) + mask) × V

Full Answer + Rubric
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
# Q, K, V: (batch, heads, seq_len, d_k)
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attn_weights, V)
return output, attn_weights

Scoring:

  • Strong Hire: Correct implementation, explains scaling factor (prevents softmax saturation), handles mask with -inf, returns attention weights
  • Lean Hire: Correct but can't explain why we scale by sqrt(d_k)
  • No Hire: Gets matrix multiplication order wrong or doesn't understand masking

Problem 3: Experiment Design

You've implemented a new positional encoding scheme. How do you evaluate whether it's better than RoPE (Rotary Position Embeddings)?

Hint 1 - Direction

Think about controlled experiments: same model, same data, same hyperparameters, only change the positional encoding. What metrics matter? What sequence lengths should you test at?

Full Answer + Rubric

Strong answer:

  1. Controlled setup: Same transformer architecture, same training data, same hyperparameters. Only change: positional encoding (RoPE vs. yours).
  2. Training: Train both for the same number of tokens. Monitor training loss curves - do they converge to the same loss? Faster? Lower?
  3. Evaluation: Perplexity on a held-out test set at the trained context length. Then - critically - evaluate at longer context lengths than trained on (extrapolation). This is where positional encodings usually differ.
  4. Downstream tasks: Perplexity isn't everything. Test on tasks that specifically require position awareness: copying, retrieval from long context, code completion with long files.
  5. Ablations: What happens with different model sizes? Different sequence lengths during training? Is the improvement consistent or specific to one configuration?
  6. Compute cost: Is the new encoding more expensive to compute? Flash attention compatible? What's the wall-clock time impact?

Scoring:

  • Strong Hire: Controlled experiment, tests extrapolation (key for positional encodings), considers downstream tasks and compute cost
  • Lean Hire: Good experiment design but misses extrapolation testing
  • No Hire: Only checks training loss without evaluation on held-out data

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Implement this from the paper"Read math → Identify shapes → Code step-by-step → Test"Let me trace the tensor shapes first"
"Discuss a paper"Problem → Contribution → Experiments → Limitations → Extensions"The paper's strongest result is X, but I think the evaluation is limited because Y"
"What research excites you?"Current state → Open problem → Your approach → Expected impact"The gap I see is X, and I think combining Y and Z could address it"
"Scale this experiment"Single GPU → Multi-GPU → Parallelism strategy → Bottleneck analysis"The first bottleneck at scale will be X"

Spaced Repetition Checkpoints

  • Day 0: Read this page. Assess your math and implementation skills honestly.
  • Day 3: Pick a recent paper. Read it, identify the core contribution, implement it.
  • Day 7: Practice research discussion: present a paper to a friend in 10 minutes, then defend it.
  • Day 14: Implement multi-head attention from scratch. Then a full transformer encoder block.
  • Day 21: Time yourself: given a new paper section, implement the key algorithm in 45 minutes.

What's Next

© 2026 EngineersOfAI. All rights reserved.