How to Read ML Papers - The 3-Pass Method for Interview Mastery

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Data Scientist

The Real Interview Moment

You are in a DeepMind interview. The hiring manager slides a printed paper across the table - one you have never seen before. "Take 20 minutes to read this, then walk me through it." You stare at 12 pages of dense mathematics, unfamiliar notation, and eight figures. Your heart rate spikes. Where do you even start?

Twenty minutes later, you deliver a clear, structured summary: the problem the authors were solving, why existing approaches fell short, the key insight, the core method (with the main equation explained), the headline result, and two limitations you identified. The interviewer smiles - this is exactly what she was looking for.

The difference between panic and clarity is not intelligence. It is method. Experienced researchers have a systematic approach to reading papers that extracts maximum understanding in minimum time. This chapter teaches you that method and shows you how to adapt it specifically for interview preparation.

What You Will Master

Apply the 3-pass reading method to any ML paper
Extract interview-relevant information in 30-60 minutes
Build a note-taking system optimized for recall under pressure
Prioritize your reading list based on role and target company
Handle the "read a paper on the spot" interview format
Retain paper knowledge using spaced repetition

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Do	4 - Efficient	5 - Expert	Your Score
Read a paper abstract and identify the key contribution						___
Skim a paper in 5-10 minutes for high-level understanding						___
Read a paper in 30-60 minutes for interview-depth understanding						___
Understand mathematical notation in ML papers						___
Identify limitations not mentioned by the authors						___
Connect a paper to prior and subsequent work						___
Summarize a paper in 60 seconds						___
Retain paper knowledge for weeks or months						___

Target: All 4s and 5s before your interview.

Part 1 - Why Most People Read Papers Wrong

The Linear Reading Trap

Most people read papers like novels: start at page 1, read every word sequentially, get bogged down in the related work section, struggle through the math, and give up somewhere in the experiments. An hour later, they can barely remember what the paper was about.

This is the worst possible approach. ML papers are not designed to be read linearly. They are structured documents with predictable sections, and different sections serve different purposes at different stages of understanding.

Three-Pass Reading Method

The Keshav Method

The 3-pass approach was formalized by Srinivasan Keshav in "How to Read a Paper" (2007) and has been adapted by ML researchers worldwide. The core insight is that each pass has a specific goal, and you should decide after each pass whether the next pass is worth your time.

60-Second Answer

"I use a three-pass method for reading papers. The first pass takes five minutes - I read the title, abstract, introduction first and last paragraphs, and scan the figures. This tells me what the paper does and whether it is worth a deeper read. The second pass takes thirty minutes - I read the full introduction, methods, and experiments while skipping proofs. This gives me interview-level understanding. The third pass takes two hours - I mentally reproduce the paper, verify the math, and identify gaps. I only do pass three for papers central to my work."

Part 2 - The First Pass (5 Minutes)

Goal: Decide Whether to Keep Reading

The first pass answers five questions:

What problem does this paper solve?
What is the claimed contribution?
Is the paper relevant to me?
Is it from a credible source?
Is it worth a second pass?

What to Read

Read these elements in this order:

Element	Time	What You Extract
Title	10 sec	Topic and claimed contribution
Abstract	1 min	Problem, method, key result
Introduction (first 2 paragraphs)	1 min	Problem context, motivation
Introduction (last paragraph)	30 sec	Contribution summary, paper outline
Section headings	30 sec	Paper structure, method name
All figures and captions	1 min	Architecture diagrams, result plots
Conclusion (first paragraph)	30 sec	Summary and main takeaway

What to Skip

In the first pass, explicitly skip:

Related work section (it is written for reviewers, not for you)
Mathematical derivations
Experimental details
Appendices

First Pass Example: "Attention Is All You Need"

Here is what a first pass of the Transformer paper would yield:

Title: "Attention Is All You Need"
→ Bold claim. Suggests attention alone is sufficient (no RNNs, no CNNs).

Abstract:
→ New architecture called "Transformer" based solely on attention mechanisms
→ Achieves 28.4 BLEU on English-to-German translation (new SOTA)
→ Trains in 3.5 days on 8 GPUs (much faster than existing models)

Introduction (first paragraphs):
→ Recurrent models are the dominant approach for sequence modeling
→ Sequential nature prevents parallelization within training examples
→ Attention has been used with RNNs but always alongside recurrence

Introduction (last paragraph):
→ Transformer relies entirely on attention to draw global dependencies
→ Achieves new SOTA on translation with significantly less training time

Figures:
→ Figure 1: Architecture diagram - encoder-decoder with multi-head attention
→ Figure 2: Scaled dot-product attention mechanism

Conclusion:
→ First sequence transduction model based entirely on attention
→ Trains significantly faster than recurrent architectures

DECISION: Definitely worth a second pass. This is a foundational paper.

Common Trap

Do not skip the figures. In ML papers, figures often contain more information than the text. An architecture diagram can give you 80% understanding of the method in 30 seconds. Many interviewers will ask you to draw the architecture - knowing the figure is essential.

Part 3 - The Second Pass (30 Minutes)

Goal: Interview-Level Understanding

The second pass is where you build the understanding needed for most interview situations. After this pass, you should be able to:

Explain the paper to a colleague in 5 minutes
Answer "what" and "why" questions about the method
Describe the main results with approximate numbers
State 2-3 limitations

What to Read

Section	Time	Reading Strategy
Full Introduction	5 min	Understand the problem deeply. Note what specific prior work limitations the paper addresses.
Method Section	12 min	Read carefully. Understand the architecture or algorithm at a conceptual level. Note key equations but do not derive them yet. Focus on "why this design choice?"
Experiments: Setup	3 min	Note the datasets, baselines, and metrics. This tells you how the claims are evaluated.
Experiments: Main Results	5 min	Focus on the main comparison tables. How much better than baselines? On which metrics?
Experiments: Ablations	3 min	These are gold for interviews. Ablations show which components matter and why.
Conclusion + Limitations	2 min	Note what the authors themselves identify as future work.

The "Why" Notebook

During the second pass, maintain a running list of "why" questions. For every design choice, ask yourself: "Why did the authors do it this way?" If you can answer from the paper, great. If you cannot, that is a question to investigate.

# Example "why" notebook for the Transformer paper:

why_questions = {
    "Why scaled dot-product instead of additive attention?":
        "Dot-product is faster (optimized matrix multiply). "
        "Scaling by sqrt(d_k) prevents softmax saturation at large dimensions.",

    "Why multi-head instead of single large attention?":
        "Multiple heads let the model attend to information from different "
        "representation subspaces. Like having multiple 'perspectives'.",

    "Why sinusoidal positional encoding?":
        "Deterministic (no learned parameters), can extrapolate to longer "
        "sequences, and PE(pos+k) can be represented as a linear function of PE(pos).",

    "Why 6 layers in both encoder and decoder?":
        "Ablation in Table 3 shows diminishing returns beyond 6. "
        "This was likely a sweet spot for the WMT translation task.",

    "Why label smoothing of 0.1?":
        "Hurts perplexity but improves BLEU. Prevents overconfident predictions. "
        "The model learns a softer distribution over the vocabulary.",

    "Why warmup + inverse sqrt learning rate schedule?":
        "Warmup prevents divergence in early training when parameters are random. "
        "Decay prevents oscillation later when the model is near convergence.",
}

Second Pass Note Card

After your second pass, fill in the interview note card:

Paper: Attention Is All You Need
Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
Year: 2017
Venue: NeurIPS 2017

ONE-SENTENCE SUMMARY:
The Transformer replaces recurrence entirely with self-attention,
achieving SOTA translation quality with massively parallel training.

PROBLEM:
RNN-based sequence models are inherently sequential, preventing
parallelization and limiting practical training on long sequences.

KEY INSIGHT:
Self-attention can capture all pairwise dependencies in a sequence
simultaneously, trading sequential depth for parallel breadth.

METHOD:
- Encoder-decoder architecture with 6 layers each
- Multi-head self-attention with scaled dot-product
- Position-wise feed-forward networks
- Sinusoidal positional encoding
- Residual connections + layer normalization

RESULTS:
- 28.4 BLEU on EN-DE (SOTA by 2+ BLEU)
- 41.8 BLEU on EN-FR (SOTA)
- Trains in 3.5 days on 8 P100 GPUs (fraction of prior cost)

LIMITATIONS:
- O(n^2) memory in sequence length
- Positional encoding does not generalize well to longer sequences
- Fixed context window

FOLLOW-UP:
- BERT, GPT, T5, and essentially all modern LLMs
- Efficient attention variants (Linformer, Performer, Flash Attention)
- Better positional encodings (RoPE, ALiBi)

MY OPINION:
The paper's biggest insight is not attention itself (which existed)
but the courage to remove recurrence entirely. The ablation study
is excellent - Table 3 is one of the best ablation tables in ML.

Instant Rejection

If you claim to have read a paper but cannot name the main baseline it was compared against, or cannot recall whether the improvement was 2% or 20%, the interviewer will conclude you are bluffing. Always note specific numbers from the results section.

Part 4 - The Third Pass (1-2 Hours)

Goal: Deep Mastery (Research Roles Only)

The third pass is needed only for papers that are central to your work or for research engineer interviews where you will be grilled on every detail. In this pass, you mentally reproduce the paper.

The Reproduction Mindset

Ask yourself: "If I had to re-create this paper from scratch - with the same problem statement but no knowledge of their solution - could I arrive at a similar approach?"

This is the deepest form of understanding. It forces you to:

Verify every assumption. Why is this loss function appropriate? What happens if you change it?
Check the math. Derive the key equations yourself. Do the dimensions work out?
Question the experiments. Are the baselines fair? Are the datasets representative? Are the improvements statistically significant?
Identify gaps. What experiments are missing? What assumptions might not hold?

Third Pass Checklist

Example: Third-Pass Analysis of Self-Attention

During a third pass of the Transformer paper, you would derive the attention computation:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Dimensional analysis:

$Q \in \mathbb{R}^{n \times d_k}$ (n tokens, each projected to dimension $d_k$ )
$K \in \mathbb{R}^{n \times d_k}$
$V \in \mathbb{R}^{n \times d_v}$
$QK^T \in \mathbb{R}^{n \times n}$ (pairwise similarity matrix)
Softmax normalizes each row to sum to 1
Output $\in \mathbb{R}^{n \times d_v}$ (weighted combination of values)

Why the scaling factor $\sqrt{d_k}$ ?

Without scaling, the dot products grow with $d_k$ . If $q$ and $k$ have independent components with mean 0 and variance 1, then $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ has variance $d_k$ . For $d_k = 64$ , this means dot products are on the order of $\pm 8$ , which pushes the softmax into saturation regions where gradients vanish.

import numpy as np

# Demonstration: why scaling matters
d_k = 64
q = np.random.randn(d_k)
k = np.random.randn(d_k)

# Without scaling
raw_dot = q @ k
print(f"Raw dot product: {raw_dot:.2f}")  # Typically |value| > 5

# With scaling
scaled_dot = (q @ k) / np.sqrt(d_k)
print(f"Scaled dot product: {scaled_dot:.2f}")  # Typically |value| < 2

# Effect on softmax
scores_unscaled = np.random.randn(10) * np.sqrt(d_k)  # Simulating unscaled
scores_scaled = scores_unscaled / np.sqrt(d_k)

softmax_unscaled = np.exp(scores_unscaled) / np.exp(scores_unscaled).sum()
softmax_scaled = np.exp(scores_scaled) / np.exp(scores_scaled).sum()

print(f"Max attention weight (unscaled): {softmax_unscaled.max():.4f}")  # Near 1.0
print(f"Max attention weight (scaled): {softmax_scaled.max():.4f}")  # More uniform
# Unscaled attention is nearly one-hot → vanishing gradients

Critical question an interviewer might ask: "Could you use a different scaling factor?"

Yes. The $\sqrt{d_k}$ is a variance-normalizing choice assuming Gaussian inputs. If your inputs have different variance (e.g., after certain activations), the optimal scaling changes. In practice, some architectures learn the temperature parameter instead of fixing it.

Part 5 - Reading Strategies for Different Situations

Situation 1: Candidate-Chosen Paper (Prepare in Advance)

You choose the paper and have days or weeks to prepare.

Strategy: Do all 3 passes. Build a complete note card. Practice presenting it out loud at least 3 times. Anticipate follow-up questions. Read 2-3 papers that cite it.

Time investment: 4-5 hours total over several sessions.

Situation 2: Assigned Paper (24-48 Hours Notice)

The company gives you a specific paper to present.

Strategy: Do passes 1 and 2 immediately. Do a targeted pass 3 focusing on the method and key equations. Write a presentation outline. Practice once.

Time investment: 2-3 hours.

Situation 3: On-the-Spot Paper (20 Minutes)

The interviewer hands you a paper during the interview.

Strategy: Do pass 1 in 3 minutes. Spend 12 minutes on a targeted pass 2 (introduction + method + one key result table). Spend 5 minutes organizing your thoughts. Present in the structure: problem, insight, method, result, limitation.

Time investment: 20 minutes.

# Time allocation for on-the-spot reading (20 minutes total)
time_allocation = {
    "Pass 1: Title, abstract, figures, conclusion": 3,
    "Introduction (full)": 3,
    "Method (key idea + main equation)": 6,
    "Results (main table only)": 3,
    "Organize thoughts + prepare 2 limitations": 5,
}

total = sum(time_allocation.values())
print(f"Total: {total} minutes")

for task, minutes in time_allocation.items():
    pct = (minutes / total) * 100
    print(f"  {task}: {minutes} min ({pct:.0f}%)")

Company Variation

Google and Meta typically give you 20-30 minutes to read an assigned paper during the interview. Anthropic and OpenAI are more likely to ask you to present a paper you have chosen. Hedge funds (Two Sigma, Citadel) sometimes send the paper 24 hours in advance with specific questions to prepare. Always ask your recruiter about the exact format.

Part 6 - Building Your Reading List

Priority Framework

Not all papers are equally important for interviews. Use this framework to prioritize:

Paper Priority Framework

The Canon by Role

All Roles (Must-Know):

#	Paper	Year	Why It Matters
1	Attention Is All You Need	2017	Foundation of modern NLP/LLMs
2	BERT	2018	Pre-training + fine-tuning paradigm
3	GPT-3	2020	In-context learning, scaling
4	ResNet	2015	Skip connections, deep networks
5	Batch Normalization	2015	Training stability
6	Adam	2014	Standard optimizer
7	Dropout	2014	Regularization in neural nets

MLE Additional Reading:

#	Paper	Year	Why It Matters
8	Word2Vec	2013	Embeddings, representation learning
9	GloVe	2014	Matrix factorization view of embeddings
10	Sequence to Sequence with Attention	2015	Attention mechanism origin
11	XGBoost	2016	Gradient boosting at scale
12	Neural Architecture Search	2017	AutoML foundations

AI Engineer Additional Reading:

#	Paper	Year	Why It Matters
8	InstructGPT / RLHF	2022	Alignment, instruction following
9	LoRA	2021	Efficient fine-tuning
10	RAG	2020	Retrieval-augmented generation
11	Chain-of-Thought Prompting	2022	Reasoning in LLMs
12	Constitutional AI	2022	AI safety, RLAIF

Research Engineer Additional Reading:

#	Paper	Year	Why It Matters
8	Scaling Laws (Kaplan)	2020	Understanding scale
9	Chinchilla	2022	Compute-optimal training
10	Denoising Diffusion (DDPM)	2020	Generative models
11	Vision Transformer (ViT)	2020	Transformers beyond NLP
12	Mixture of Experts	2017/2022	Sparse computation, scaling

Part 7 - Note-Taking Systems for Retention

The Spaced Repetition Approach

Reading a paper once is not enough. You will forget 80% within a week without review. Here is a system that ensures retention:

After Pass 2 - Create your note card (the template from the Overview chapter).

Day 1 after reading: Review the note card. Can you explain the paper in 60 seconds without looking? If not, re-read the sections you have forgotten.

Day 3: Review again. Practice explaining the paper out loud to an imaginary interviewer.

Day 7: Review once more. By now, the core should be solid. Note any remaining weak spots.

Day 14: Final review. At this point, the paper should be firmly in long-term memory.

The Connection Map

For maximum interview impact, connect each paper to others you have read:

# Build a mental graph of paper connections
paper_connections = {
    "Transformer": {
        "builds_on": ["Bahdanau Attention", "Seq2Seq", "Layer Normalization"],
        "led_to": ["BERT", "GPT", "T5", "ViT"],
        "shares_ideas_with": ["Self-attention in images (Non-local Neural Networks)"],
        "key_technique_used_by": ["Every modern LLM"],
    },
    "BERT": {
        "builds_on": ["Transformer (encoder only)", "ELMo", "Semi-supervised learning"],
        "led_to": ["RoBERTa", "ALBERT", "DeBERTa", "SpanBERT"],
        "contrasts_with": ["GPT (autoregressive vs masked)"],
        "key_technique_used_by": ["Search engines, classification systems"],
    },
    "ResNet": {
        "builds_on": ["VGG (depth matters)", "Highway Networks (gating)"],
        "led_to": ["DenseNet", "ResNeXt", "EfficientNet"],
        "shares_ideas_with": ["LSTM gates (gradient flow)", "Transformer residual connections"],
        "key_insight": "Making identity mapping easy enables very deep networks",
    },
}

# In an interview, these connections show breadth
# "ResNet's skip connections are conceptually similar to LSTM gates -
#  both solve the vanishing gradient problem by providing shortcut paths
#  for gradient flow. This same idea appears in the Transformer as
#  residual connections around each attention and FFN sublayer."

The One-Page Summary

For each paper, create a one-page summary that you can review in 2 minutes. This should include:

The Hook (1 sentence): What makes this paper important
The Problem (2 sentences): What was broken before
The Insight (1 sentence): The core innovation
The Method (3-5 bullets): How it works
The Key Equation: The one equation you must know
The Main Result (1 sentence with numbers): How much better
The Limitations (2-3 bullets): What does not work
The Legacy (1 sentence): What it led to

Part 8 - Common Mathematical Notation in ML Papers

Many candidates struggle with papers because the notation is unfamiliar. Here is a reference:

Notation	Meaning	Example
$\mathbb{R}^{n \times d}$	Real-valued matrix, $n$ rows, $d$ columns	Weight matrix
$\\|\\|x\\|\\|_2$	L2 norm (Euclidean length)	Regularization
$\nabla_\theta \mathcal{L}$	Gradient of loss $\mathcal{L}$ with respect to parameters $\theta$	Backpropagation
$\mathbb{E}_{x \sim p}[f(x)]$	Expected value of $f(x)$ when $x$ is drawn from distribution $p$	Loss functions
$\text{KL}(p \\|\\| q)$	KL divergence between distributions $p$ and $q$	VAEs, RLHF
$\sigma(\cdot)$	Sigmoid function	Gating mechanisms
$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$	Softmax function	Attention, classification
$\mathcal{O}(\cdot)$	Big-O notation (computational complexity)	Efficiency analysis
$\odot$	Element-wise (Hadamard) product	Gating in LSTMs
$\otimes$	Outer product or Kronecker product	Attention patterns

Reading Equations: A Step-by-Step Approach

When you encounter an equation in a paper:

Identify the output. What is on the left side of the equals sign?
Identify the inputs. What variables appear on the right side?
Check dimensions. What shape is each tensor?
Understand each operation. What does each function or operator do?
Build intuition. What is this equation computing, in plain English?

Example: The attention equation:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Output: A matrix of attention-weighted representations
Inputs: Query matrix $Q$ , Key matrix $K$ , Value matrix $V$
Dimensions: $Q$ is $n \times d_k$ , $K$ is $n \times d_k$ , $V$ is $n \times d_v$ , output is $n \times d_v$
Operations: Matrix multiply ( $QK^T$ ), scale ( $\div \sqrt{d_k}$ ), normalize (softmax), weight ( $\times V$ )
Intuition: For each position, compute how much to attend to every other position (via $QK^T$ ), normalize these attention weights, then take a weighted sum of the value vectors

Part 9 - Handling the On-the-Spot Paper Read

The 20-Minute Protocol

When handed a paper you have never seen:

Minutes 0-3: First Pass

Read title and abstract
Scan all figures (this alone can give you 50% of the method)
Read first and last paragraphs of introduction
Read first paragraph of conclusion

Minutes 3-15: Targeted Second Pass

Read the full introduction (understand the problem deeply)
Read the method section, focusing on the main algorithm or architecture
Identify the ONE key equation or diagram
Read the main results table (just the headline numbers)

Minutes 15-20: Organize

Write a one-sentence summary
List the key contribution (1 bullet)
List the method (3 bullets)
List the main result (1 bullet with a number)
Identify 2 limitations or questions

What Interviewers Are Looking For

In on-the-spot paper reads, interviewers evaluate:

Skill	How They Test It
Efficient reading	Can you extract the key ideas in 20 minutes?
Structured thinking	Do you present findings in a logical order?
Technical intuition	Can you understand the method even without following every detail?
Critical thinking	Can you identify at least one limitation or questionable assumption?
Intellectual honesty	Do you clearly state what you understood vs. what you did not?

60-Second Answer

When presenting an on-the-spot paper, always start with: "I had 20 minutes, so let me share what I was able to extract. The paper addresses [problem]. The key insight is [insight]. The method works by [2-3 sentences]. The main result is [number]. Two things I would want to dig into further are [limitation 1] and [limitation 2]."

This framing manages expectations while demonstrating competence.

Practice Problems

Problem 1: First-Pass Exercise

Choose any paper from arXiv (cs.LG) published this week. Set a 5-minute timer. After the timer, write down: (1) what problem it solves, (2) what the claimed contribution is, (3) whether you would do a second pass.

Hint

Focus on the abstract and figures. Do not get pulled into the text. If the abstract mentions a specific metric improvement, note the number. If there is an architecture diagram, spend 60 seconds studying it.

Problem 2: On-the-Spot Simulation

Have a friend select a paper from the NeurIPS 2024 proceedings. Give yourself 20 minutes to read it, then present it in 5 minutes. Record yourself and review.

Hint

The most common mistake is spending too long on the introduction and not reaching the method. Force yourself to start reading the method section by minute 5, no matter what.

Problem 3: Connection Building

Take three papers you have already read (e.g., Transformer, BERT, GPT-3). For each pair, identify: (1) what they share, (2) how they differ, (3) how one builds on the other.

Hint

The Transformer provides the architecture. BERT takes the encoder and trains it bidirectionally with MLM. GPT takes the decoder and trains it autoregressively. Both BERT and GPT build on the Transformer but make opposite architectural choices (bidirectional vs. unidirectional), which determines their downstream use cases (understanding vs. generation).

Problem 4: Limitation Identification

Read the abstract of a paper and, before reading the rest, write down 3 potential limitations based only on what you know about the problem domain. Then read the paper and see if the authors addressed any of your concerns.

Hint

Common limitation categories: scalability (does it work at larger scales?), generalization (does it work on other datasets/domains?), computational cost (how expensive is it?), assumptions (what might not hold in practice?), and evaluation (are the benchmarks representative?).

Problem 5: Note Card from Memory

Read a paper today using the 3-pass method. Tomorrow, without re-reading, write a complete note card from memory. Compare it to your original notes. Where are the gaps?

Hint

Most people forget the specific numbers (BLEU score, accuracy improvement) and the ablation results first. These are the details that interviewers use to test whether you actually read the paper vs. read a blog post about it.

Interview Cheat Sheet

Situation	Strategy	Time
"Tell me about a paper"	Present your best-prepared paper using the 7-step skeleton	5-10 min
"Here, read this paper"	Use the 20-minute protocol: first pass (3 min), targeted second pass (12 min), organize (5 min)	20 min
"Have you read paper X?" (and you have)	Start with the one-sentence summary, then go to the problem and key insight	2-5 min
"Have you read paper X?" (and you have not)	Be honest. Say you have not, but connect to what you do know	1 min
"What papers have you read recently?"	Have 3 papers ready: 1 classic, 1 relevant to the role, 1 recent	2 min per paper
"What is the key equation?"	Write it, explain each term, explain why each design choice was made	3-5 min
"What are the limitations?"	State 2-3, propose how you might address each	2-3 min
"How would you extend this work?"	Give 1-2 concrete, technically grounded ideas	2-3 min

Spaced Repetition Checkpoints

Day 0 (Today)

Understand the 3-pass method thoroughly
Choose your first 3 papers to read from the canon
Set up your note card system (digital or physical)

Day 3

Complete a full 3-pass read of your first paper
Write a complete note card
Practice a 60-second summary out loud

Day 7

Complete your second paper
Review your first paper's note card (can you still explain it?)
Practice an on-the-spot read with a random arXiv paper

Day 14

Complete your third paper
Review all three note cards
Build a connection map between the three papers
Do a mock 5-minute presentation of each

Day 21

Review all note cards
Do a full mock paper discussion interview (45 minutes)
Identify gaps in your knowledge and plan additional reading

Next Steps

Now that you have a systematic reading method, move to Chapter 2: Presenting Papers in Interviews to learn how to structure your knowledge into compelling, interview-winning presentations. Then begin the deep dives with Chapter 3: Attention Is All You Need.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - Why Most People Read Papers Wrong​

The Linear Reading Trap​

The Keshav Method​

Part 2 - The First Pass (5 Minutes)​

Goal: Decide Whether to Keep Reading​

What to Read​

What to Skip​

First Pass Example: "Attention Is All You Need"​

Part 3 - The Second Pass (30 Minutes)​

Goal: Interview-Level Understanding​

What to Read​

The "Why" Notebook​

Second Pass Note Card​

Part 4 - The Third Pass (1-2 Hours)​

Goal: Deep Mastery (Research Roles Only)​

The Reproduction Mindset​

Third Pass Checklist​

Example: Third-Pass Analysis of Self-Attention​

Part 5 - Reading Strategies for Different Situations​

Situation 1: Candidate-Chosen Paper (Prepare in Advance)​

Situation 2: Assigned Paper (24-48 Hours Notice)​

Situation 3: On-the-Spot Paper (20 Minutes)​

Part 6 - Building Your Reading List​

Priority Framework​

The Canon by Role​

Part 7 - Note-Taking Systems for Retention​

The Spaced Repetition Approach​

The Connection Map​

The One-Page Summary​

Part 8 - Common Mathematical Notation in ML Papers​

Reading Equations: A Step-by-Step Approach​

Part 9 - Handling the On-the-Spot Paper Read​

The 20-Minute Protocol​

What Interviewers Are Looking For​

Practice Problems​

Problem 1: First-Pass Exercise​

Problem 2: On-the-Spot Simulation​

Problem 3: Connection Building​

Problem 4: Limitation Identification​

Problem 5: Note Card from Memory​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - Why Most People Read Papers Wrong

The Linear Reading Trap

The Keshav Method

Part 2 - The First Pass (5 Minutes)

Goal: Decide Whether to Keep Reading

What to Read

What to Skip

First Pass Example: "Attention Is All You Need"

Part 3 - The Second Pass (30 Minutes)

Goal: Interview-Level Understanding

What to Read

The "Why" Notebook

Second Pass Note Card

Part 4 - The Third Pass (1-2 Hours)

Goal: Deep Mastery (Research Roles Only)

The Reproduction Mindset

Third Pass Checklist

Example: Third-Pass Analysis of Self-Attention

Part 5 - Reading Strategies for Different Situations

Situation 1: Candidate-Chosen Paper (Prepare in Advance)

Situation 2: Assigned Paper (24-48 Hours Notice)

Situation 3: On-the-Spot Paper (20 Minutes)

Part 6 - Building Your Reading List

Priority Framework

The Canon by Role

Part 7 - Note-Taking Systems for Retention

The Spaced Repetition Approach

The Connection Map

The One-Page Summary

Part 8 - Common Mathematical Notation in ML Papers

Reading Equations: A Step-by-Step Approach

Part 9 - Handling the On-the-Spot Paper Read

The 20-Minute Protocol

What Interviewers Are Looking For

Practice Problems

Problem 1: First-Pass Exercise

Problem 2: On-the-Spot Simulation

Problem 3: Connection Building

Problem 4: Limitation Identification

Problem 5: Note Card from Memory

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21

Next Steps