Deep Learning for Interviews - Your Complete Roadmap

Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Computer Vision Eng, NLP Eng

The Real Interview Moment

You are forty-five minutes into a Meta MLE on-site loop. The interviewer, a staff research engineer who has published at NeurIPS, slides a whiteboard marker to you and says: "Design me a neural network for this task. Walk me through every decision - architecture, activations, normalization, training strategy - and justify each choice." You know what a ResNet is. You know what batch normalization does. But when the interviewer asks "Why did you choose ReLU over GELU here?" or "What happens to the gradients in layer 47 during the backward pass?" - you realize that surface-level knowledge is not enough.

Deep learning interviews at top companies are not trivia contests. They test whether you have a unified mental model of how neural networks work - from the mathematics of backpropagation through the engineering of distributed training. The interviewer wants to see you reason through tradeoffs: Why does ResNet work better than VGG at 152 layers? Why do Transformers use layer norm instead of batch norm? Why does GELU outperform ReLU in language models? Each question probes a different node in the same interconnected graph of deep learning knowledge.

This section gives you that graph. Eleven topics, organized in dependency order, with the exact depth and connections you need to navigate any deep learning interview question with confidence and precision.

What You Will Master

After completing this section, you will be able to:

Derive backpropagation from first principles and trace gradients through any computational graph
Choose activation functions with mathematical justification for why GELU dominates Transformers and ReLU dominates CNNs
Design convolutional architectures and explain the evolution from LeNet to ConvNeXt
Explain recurrent architectures and why LSTMs solve the vanishing gradient problem that vanilla RNNs cannot
Derive the attention mechanism from scratch and explain multi-head attention, cross-attention, and masked attention
Build a Transformer from components - embedding, positional encoding, multi-head attention, FFN, layer norm
Compare normalization techniques (batch, layer, group, RMS) and justify when to use each
Apply advanced training techniques - learning rate schedules, gradient clipping, mixed precision, label smoothing
Design distributed training strategies - data parallelism, model parallelism, pipeline parallelism, ZeRO
Explain generative models - VAEs, GANs, diffusion models, and flow-based models
Answer rapid-fire deep learning questions with structured, interview-ready responses
Connect all topics into a unified framework for architectural decision-making

Self-Assessment: Where Are You Now?

Before diving in, honestly rate yourself on each topic. This will help you prioritize your study time.

Topic	1 -- Cannot Explain	2 -- Vaguely Recall	3 -- Can Explain	4 -- Can Derive/Implement	5 -- Can Teach	Your Score
Backpropagation & chain rule						___
Activation functions & tradeoffs						___
CNN architectures & design						___
RNNs, LSTMs, GRUs						___
Attention mechanisms						___
Transformer architecture						___
Normalization techniques						___
Training techniques						___
Distributed training						___
Generative models						___
Interview-style rapid fire						___

Scoring guide:

Total 11-22: You need the full section. Start from the beginning and work sequentially.
Total 23-33: You have foundations but gaps. Use the dependency map below to identify and fill weak spots.
Total 34-44: You are close to interview-ready. Focus on practice problems and the rapid-fire section.
Total 45-55: You are ready. Do one final pass on the cheat sheets and spaced repetition checkpoints.

Topic Dependency Map

The eleven topics in this section are not independent. Understanding Transformers requires attention, which requires knowing why RNNs struggle, which requires understanding vanishing gradients from backpropagation. The diagram below shows the dependency structure.

Topic Dependency Map - 11 Deep Learning Topics

Legend:

Green (Topics 01-02): Foundational - must be solid before anything else
Yellow (Topics 03-05): Core architectures - the building blocks
Blue (Topics 06-08): Advanced architecture and training - where interviews go deep
Red (Topics 09-11): Specialized - tested at senior levels and specific roles

Study Paths by Role and Timeline

Not every role tests every topic at the same depth. Use the paths below to prioritize.

Path 1: MLE Generalist (Google, Meta, Amazon) - 3 weeks

Week	Topics	Focus
Week 1	Backprop, Activations, CNNs	Derive backprop by hand. Know activation tradeoffs cold. Explain ResNet skip connections.
Week 2	RNNs, Attention, Transformers	LSTM gating math. Derive scaled dot-product attention. Build a Transformer on the whiteboard.
Week 3	Normalization, Training, Distributed, Generative	BatchNorm vs LayerNorm. LR schedules. Data parallelism vs model parallelism. VAE vs GAN.

Path 2: NLP/LLM Engineer (OpenAI, Anthropic, Cohere) - 2 weeks

Week	Topics	Focus
Week 1	Backprop, Activations, Attention, Transformers	GELU specifically. Multi-head attention derivation. Positional encoding. Full Transformer build.
Week 2	Normalization, Training, Distributed, Generative	Pre-norm vs post-norm. Mixed precision. ZeRO stages. Autoregressive vs masked LM.

Path 3: Computer Vision Engineer (Tesla, Apple, NVIDIA) - 2 weeks

Week	Topics	Focus
Week 1	Backprop, Activations, CNNs, Normalization	Conv math (stride, padding, output size). Architecture evolution. GroupNorm for small batches.
Week 2	Training, Distributed, Generative, Transformers	Transfer learning. Multi-GPU training. Diffusion models. Vision Transformers.

Path 4: Research Engineer (DeepMind, FAIR, Brain) - 4 weeks

Week	Topics	Focus
Week 1	Backprop, Activations	Full derivations. Automatic differentiation (forward vs reverse mode). Activation landscapes.
Week 2	CNNs, RNNs, Attention	Every architecture variant. Receptive field math. Attention alternatives (linear, sparse).
Week 3	Transformers, Normalization, Training	Architecture ablations. RMSNorm. Curriculum learning. Hyperparameter sensitivity.
Week 4	Distributed, Generative, Interview Questions	Pipeline parallelism. Diffusion math. Timed practice on all topics.

Path 5: Startup ML Engineer - 1 week crash course

Day	Topics	Focus
Day 1-2	Backprop, Activations, CNNs	Enough to reason about architectures and debug training issues.
Day 3-4	Attention, Transformers, Training	Practical Transformer usage. Fine-tuning strategies. LR schedules.
Day 5	Interview Questions + Practice	Rapid-fire answers. System design integration.

Topic-to-Company Mapping

Different companies emphasize different deep learning topics. This table shows what to expect.

Topic	Google	Meta	Amazon	Apple	OpenAI/Anthropic	NVIDIA	Startups
Backpropagation	Derive on whiteboard	Derive on whiteboard	Conceptual	Conceptual	Derive + implement	Derive + CUDA	Conceptual
Activation Functions	Tradeoff analysis	Tradeoff analysis	Basic knowledge	Basic knowledge	GELU deep dive	Kernel optimization	Basic knowledge
CNNs	Architecture design	Architecture design	Transfer learning	On-device CNNs	Vision backbone	CUDA kernels	Transfer learning
RNNs & LSTMs	Full derivation	Full derivation	Sequence modeling	On-device RNNs	Historical context	Rarely tested	Rarely tested
Attention	Derive from scratch	Derive from scratch	Conceptual	Conceptual	Derive + variants	Kernel optimization	Conceptual
Transformers	Full architecture	Full architecture	Usage-level	Usage-level	Deep architectural	Optimization	Usage-level
Normalization	BN vs LN tradeoffs	BN vs LN tradeoffs	Basic knowledge	GroupNorm focus	RMSNorm deep dive	Fused kernels	Basic knowledge
Training Techniques	LR schedules, regularization	LR schedules, augmentation	Practical tuning	Quantization focus	Large-scale training	Mixed precision	Practical tuning
Distributed Training	Tested at senior+	Tested at senior+	Tested at senior+	Rarely tested	Core competency	Core competency	Rarely tested
Generative Models	GAN/Diffusion theory	GAN/Diffusion theory	Rarely tested	On-device generation	Core competency	Inference optimization	Application-level

Interviewer's Perspective

"When I interview for MLE roles, I use deep learning questions as a calibration tool. A candidate who can only recite definitions gets a lean no-hire. A candidate who can derive backprop, explain why ResNet skip connections help gradient flow, and reason about when to use LayerNorm vs BatchNorm - that candidate demonstrates the kind of first-principles thinking we need. I am not looking for memorization. I am looking for understanding."

Staff MLE, Google Brain

What Makes Deep Learning Interviews Different

Deep learning interviews differ from classical ML interviews in three important ways:

1. Derivation Depth

Classical ML interviews might ask you to explain regularization conceptually. Deep learning interviews ask you to derive the gradient of the loss with respect to weights in layer 3 of a network, trace it through batch normalization, and explain what happens numerically when the network has 100 layers.

2. Architecture Reasoning

You will not just be asked "what is a CNN?" You will be asked "why did the authors of ResNet add skip connections, and can you prove mathematically that they help gradient flow?" or "why does the Transformer use multi-head attention instead of a single large attention matrix?"

3. Scale Awareness

Modern deep learning operates at scales that fundamentally change the game. You need to know what breaks when you go from 1 GPU to 1,000 GPUs, why batch normalization fails with small batch sizes, and why mixed-precision training matters for billion-parameter models.

Classical ML vs Deep Learning Interviews

Section Overview: All 11 Topics

Topic 01 - Backpropagation

The mathematical backbone of all neural network training. You will derive the chain rule on computational graphs, trace gradients through a 2-layer network by hand, understand vanishing and exploding gradients, and learn the difference between forward-mode and reverse-mode automatic differentiation. This is tested in some form at every company.

Key interview questions: "Derive the gradient of cross-entropy loss with respect to the weights in the first layer." "Why does reverse-mode AD dominate deep learning?"

Topic 02 - Activation Functions

The nonlinear functions that give neural networks their power. You will learn every major activation function (sigmoid through Mish), understand why ReLU revolutionized training, diagnose the dying ReLU problem, and explain why modern Transformers use GELU. Includes a decision flowchart for choosing activations.

Key interview questions: "Why not just use sigmoid everywhere?" "What is the dying ReLU problem and how do you fix it?" "Why does GPT use GELU?"

Topic 03 - Convolutional Neural Networks

The architecture that dominated computer vision for a decade. You will master convolution math, trace the evolution from LeNet to ConvNeXt, derive output dimensions, understand receptive fields, explain skip connections mathematically, and design transfer learning strategies.

Key interview questions: "Calculate the output size of this conv layer." "Why do ResNets work at 152 layers when VGG fails at 19?" "What are depthwise separable convolutions?"

Topic 04 - RNNs and LSTMs

Sequential architectures and why they struggle with long-range dependencies. You will derive the RNN gradient and show where it vanishes, explain LSTM gating mechanisms mathematically, compare GRU simplifications, and understand why attention replaced recurrence.

Key interview questions: "Derive why vanilla RNNs have vanishing gradients." "Walk me through the LSTM cell - what does each gate do?" "Why did the field move away from RNNs?"

Topic 05 - Attention Mechanism

The mechanism that replaced recurrence and enabled modern AI. You will derive scaled dot-product attention from first principles, explain the scaling factor, understand multi-head attention, compare self-attention and cross-attention, and analyze computational complexity.

Key interview questions: "Derive attention from scratch on the whiteboard." "Why divide by the square root of $d_k$ ?" "What is the computational complexity of self-attention?"

Topic 06 - Transformer Architecture

The architecture behind GPT, BERT, and virtually all modern AI. You will build a complete Transformer from components, understand positional encoding, compare encoder-only vs decoder-only vs encoder-decoder variants, and explain pre-norm vs post-norm.

Key interview questions: "Draw the full Transformer architecture and explain every component." "Why does GPT use decoder-only?" "How does positional encoding work?"

Topic 07 - Normalization Techniques

Techniques that stabilize and accelerate training. You will compare BatchNorm, LayerNorm, GroupNorm, InstanceNorm, and RMSNorm, understand which axes each normalizes over, and explain why Transformers use LayerNorm instead of BatchNorm.

Key interview questions: "Why does BatchNorm fail with small batch sizes?" "Why do Transformers use LayerNorm?" "What is RMSNorm and why is it used in LLaMA?"

Topic 08 - Training Techniques

Advanced strategies for training deep networks effectively. You will master learning rate schedules (warmup, cosine annealing), gradient clipping, mixed precision training, data augmentation, label smoothing, and curriculum learning.

Key interview questions: "Design a training recipe for a 1B parameter model." "Why use warmup?" "Explain mixed precision training and where it can go wrong."

Topic 09 - Distributed Training

Scaling training across multiple GPUs and nodes. You will understand data parallelism, model parallelism, pipeline parallelism, tensor parallelism, and ZeRO optimization stages. Tested primarily at senior levels and at companies training large models.

Key interview questions: "How does data parallelism work with gradient synchronization?" "What are the three ZeRO stages?" "When do you use model parallelism vs data parallelism?"

Topic 10 - Generative Models

Models that generate new data. You will understand VAEs (ELBO derivation), GANs (minimax game, mode collapse), diffusion models (forward/reverse process), and flow-based models. Increasingly tested as generative AI becomes central to the industry.

Key interview questions: "Derive the ELBO for VAEs." "Why do GANs suffer from mode collapse?" "Explain the diffusion process mathematically."

Topic 11 - DL Interview Questions

Rapid-fire practice across all topics. Fifty questions with structured answers, scoring rubrics, and time targets. Use this for final preparation and timed practice sessions.

Common DL Interview Patterns

Across hundreds of DL interviews at top companies, certain patterns emerge repeatedly. Knowing these patterns helps you anticipate follow-up questions.

Pattern 1: "Derive It, Then Scale It"

The interviewer starts with a derivation question, then asks what happens at scale.

Example flow:

"Derive backpropagation for a 2-layer network" (Backprop)
"What happens to gradients in a 100-layer network?" (Vanishing/Exploding gradients)
"How does ResNet solve this?" (Skip connections)
"How do you train a 100-layer ResNet on 8 GPUs?" (Distributed training)

Preparation: For every concept, know both the small-scale math AND the large-scale engineering.

Pattern 2: "Why Not X?"

The interviewer proposes a suboptimal approach and asks you to argue against it.

Example flow:

"Let's use sigmoid activations" -- Why not? (Vanishing gradients)
"Let's use BatchNorm in our Transformer" -- Why not? (LayerNorm is better for variable-length sequences)
"Let's train the entire ImageNet model from scratch with 1000 labeled images" -- Why not? (Transfer learning)
"Let's use a single giant attention head" -- Why not? (Multi-head captures diverse relationships)

Preparation: For every design choice, know both what TO use and what NOT to use, with precise reasons.

Pattern 3: "Connect the Dots"

The interviewer asks you to link seemingly separate concepts.

Example flow:

"How does the choice of activation function affect what initialization you should use?" (Activation + Init)
"Why did the move from CNNs to Transformers also change us from BatchNorm to LayerNorm?" (Architecture + Normalization)
"How does the attention mechanism relate to the vanishing gradient problem in RNNs?" (Attention + Gradient flow)

Preparation: Study the connections map above. Every edge in that graph is a potential interview question.

Pattern 4: "Debug This Training Run"

The interviewer describes symptoms and asks you to diagnose.

Symptom	Likely Cause	From Topic
Loss is NaN after 100 steps	Exploding gradients, no gradient clipping	Backprop, Training
Loss plateaus at high value	Vanishing gradients, bad initialization, LR too low	Backprop, Activations
Training loss is 0, test loss is high	Overfitting - no regularization, too much capacity	Training techniques
Training loss oscillates wildly	LR too high, no warmup	Training techniques
40% of neurons output zero always	Dying ReLU, LR too high or bad init	Activations
Model works on English, fails on German	Tokenizer issue, not an architecture issue	LLM-specific
Distributed training is 4x slower on 8 GPUs than 4 GPUs	Communication bottleneck, batch size issues	Distributed training

Frequently Tested Cross-Topic Connections

These are the connections between topics that interviewers love to probe. If you can fluently explain these links, you demonstrate systems-level understanding.

Backpropagation + Activations

The gradient properties of activation functions directly determine whether backpropagation succeeds in deep networks. Sigmoid's maximum derivative of 0.25 causes exponential gradient decay. ReLU's gradient of 1 (for positive inputs) solves this but creates dead neurons. GELU's smooth gradient near zero provides stability for Transformer attention layers. The initialization scheme (Xavier vs He) must match the activation to maintain gradient magnitude across layers.

CNNs + Skip Connections + Normalization

ResNet's skip connections provide gradient highways ( $\frac{\partial}{\partial x}(F(x) + x) = F'(x) + I$ ), but they are not sufficient alone. Batch normalization in each residual block keeps activations in a healthy range, preventing both the $F(x)$ branch from dominating (which would defeat the skip connection) and the activations from drifting. ConvNeXt showed that switching from BatchNorm to LayerNorm (borrowed from Transformers) further improves performance.

RNNs + Attention + Transformers

This is a historical evolution driven by gradient flow. Vanilla RNNs suffer from vanishing gradients over long sequences (the product of $N$ Jacobians shrinks). LSTMs partially solve this with gated memory cells (additive gradient path). Attention fully solves it by creating a direct connection between any two positions ( $O(1)$ gradient path length). Transformers remove recurrence entirely, using self-attention for all context - this enables parallelization and scales to billions of parameters.

Normalization + Training Techniques + Distributed Training

BatchNorm's behavior changes with batch size: effective with large batches ( $\geq 32$ ), noisy with small batches. In distributed training, the effective batch size per GPU decreases when you split data across GPUs. This is why: (1) large-scale training often uses SyncBatchNorm (synchronize statistics across GPUs), (2) Transformers use LayerNorm (batch-size independent), and (3) LLaMA uses RMSNorm (simpler than LayerNorm, works at massive scale).

Generative Models + Training Techniques + Distributed Training

Training generative models at scale requires the full toolkit: mixed precision training (FP16/BF16 for memory), gradient checkpointing (recompute activations to fit larger models), ZeRO optimization (shard optimizer states across GPUs), and careful learning rate scheduling (warmup to avoid early instability, cosine decay for convergence). A question like "describe how you would train a diffusion model from scratch" tests all three topics simultaneously.

Frequently Tested Cross-Topic Connections

Self-Assessment Practice: Quick Diagnostic Questions

Before you begin studying, try answering these 10 questions. If you can answer 7+ confidently, you may be able to skip some foundational topics. If fewer than 4, start from the beginning.

#	Question	Topic Tested
1	Derive $\frac{\partial L}{\partial W_1}$ for a 2-layer network with ReLU.	Backprop
2	Why does GELU outperform ReLU in Transformers? Give 2 reasons.	Activations
3	Calculate the output size of a 3x3 conv with stride 2, padding 1 on a 56x56 input.	CNNs
4	Why can't vanilla RNNs model long-range dependencies? Be specific.	RNNs
5	Derive scaled dot-product attention. Why divide by $\sqrt{d_k}$ ?	Attention
6	What are the 3 components of a Transformer encoder block?	Transformers
7	Why do Transformers use LayerNorm instead of BatchNorm?	Normalization
8	What is learning rate warmup and why is it necessary?	Training
9	Explain ZeRO Stage 2 in one sentence.	Distributed
10	What is the ELBO in VAEs and why do we maximize it?	Generative

Scoring:

0-3 correct: Start from Topic 01 and work sequentially
4-6 correct: You have foundations - focus on weak topics and connections
7-9 correct: You are nearly interview-ready - focus on practice problems and speed
10 correct: Move directly to the rapid-fire section and mock interviews

Interview Cheat Sheet - Section-Level Reference

Concept	One-Sentence Summary	When It Is Asked
Backpropagation	Chain rule on computational graphs - reverse-mode AD computes all gradients in one backward pass	Phone screen, on-site
Activation functions	Nonlinear functions enabling universal approximation - ReLU for CNNs, GELU for Transformers	Phone screen
CNNs	Local connectivity + weight sharing for spatial data - ResNet skip connections enable deep training	On-site, system design
RNNs/LSTMs	Sequential processing with hidden state - LSTM gates solve vanishing gradients	Phone screen (less common now)
Attention	Weighted sum based on query-key similarity - $O(n^2 d)$ complexity	On-site, core for LLM roles
Transformers	Self-attention + FFN in encoder/decoder blocks - parallelizable, scalable	On-site, core for all roles
Normalization	Stabilize activations across layers - LayerNorm for Transformers, BatchNorm for CNNs	On-site
Training techniques	LR schedules, gradient clipping, mixed precision - practical engineering for convergence	On-site, system design
Distributed training	Split data/model across GPUs - AllReduce for data parallel, ZeRO for memory efficiency	Senior on-site
Generative models	Learn data distribution to sample new examples - VAE, GAN, Diffusion, Flow	Specialized roles

Spaced Repetition Checkpoints

Use this schedule to reinforce your learning. Each checkpoint lists what you should be able to do from memory.

Day 0 - After First Read

Draw the topic dependency diagram from memory
List all 11 topics and their one-sentence summaries
Identify your 3 weakest topics from the self-assessment
State which topics your target company emphasizes

Day 3 - First Review

For each topic, state the most common interview question
Explain the difference between topics that are often confused (BatchNorm vs LayerNorm, RNN vs LSTM, attention vs self-attention)
Recite your study path and current progress

Day 7 - Connections Review

Explain how backpropagation connects to vanishing gradients, which connects to activation choice, which connects to skip connections
Trace how a single training step works end-to-end: forward pass, loss computation, backward pass, gradient update
Explain why Transformers use: GELU (not ReLU), LayerNorm (not BatchNorm), multi-head attention (not single-head)

Day 14 - Interview Simulation

Complete 5 practice problems from different topics under time pressure (10 minutes each)
Give a 60-second answer for each of the 11 topics
Design a complete architecture for a given task, justifying every component choice

Day 21 - Final Calibration

Do a full mock interview covering 4-5 deep learning topics in 45 minutes
Identify any remaining weak spots and do targeted review
Practice transitioning between topics smoothly (e.g., "the vanishing gradient problem in RNNs motivated the attention mechanism, which...")

Prerequisites from ML Fundamentals

This section builds directly on concepts from ML Fundamentals. Make sure you are comfortable with:

Prerequisite	From Topic	Why You Need It
Gradient descent and optimization	Optimization	Backpropagation computes gradients that optimizers consume
Loss functions (cross-entropy, MSE)	Loss Functions	Every derivation starts from the loss
Regularization (L1, L2, dropout)	Regularization	Deep networks need regularization - dropout, weight decay, data augmentation
Bias-variance tradeoff	Bias-Variance	Understanding overfitting in deep networks requires this framework
Evaluation metrics	Evaluation Metrics	You cannot train without knowing what you are optimizing

What Comes Next

Once you have completed the Deep Learning section, you will be ready for:

LLM Interviews - Builds directly on Transformers, attention, training techniques, and distributed training
ML System Design - Applies architectural decisions and training strategies to real-world systems
Paper Discussion - Many discussed papers are deep learning architecture papers (ResNet, Transformer, BERT, GPT)

Start with Backpropagation - the mathematical foundation that everything else builds on.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Topic Dependency Map​

Study Paths by Role and Timeline​

Path 1: MLE Generalist (Google, Meta, Amazon) - 3 weeks​

Path 2: NLP/LLM Engineer (OpenAI, Anthropic, Cohere) - 2 weeks​

Path 3: Computer Vision Engineer (Tesla, Apple, NVIDIA) - 2 weeks​

Path 4: Research Engineer (DeepMind, FAIR, Brain) - 4 weeks​

Path 5: Startup ML Engineer - 1 week crash course​

Topic-to-Company Mapping​

What Makes Deep Learning Interviews Different​

1. Derivation Depth​

2. Architecture Reasoning​

3. Scale Awareness​

Section Overview: All 11 Topics​

Topic 01 - Backpropagation​

Topic 02 - Activation Functions​

Topic 03 - Convolutional Neural Networks​

Topic 04 - RNNs and LSTMs​

Topic 05 - Attention Mechanism​

Topic 06 - Transformer Architecture​

Topic 07 - Normalization Techniques​

Topic 08 - Training Techniques​

Topic 09 - Distributed Training​

Topic 10 - Generative Models​

Topic 11 - DL Interview Questions​

Common DL Interview Patterns​

Pattern 1: "Derive It, Then Scale It"​

Pattern 2: "Why Not X?"​

Pattern 3: "Connect the Dots"​

Pattern 4: "Debug This Training Run"​

Frequently Tested Cross-Topic Connections​

Backpropagation + Activations​

CNNs + Skip Connections + Normalization​

RNNs + Attention + Transformers​

Normalization + Training Techniques + Distributed Training​

Generative Models + Training Techniques + Distributed Training​

Self-Assessment Practice: Quick Diagnostic Questions​

Interview Cheat Sheet - Section-Level Reference​

Spaced Repetition Checkpoints​

Day 0 - After First Read​

Day 3 - First Review​

Day 7 - Connections Review​

Day 14 - Interview Simulation​

Day 21 - Final Calibration​

Prerequisites from ML Fundamentals​

What Comes Next​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Topic Dependency Map

Study Paths by Role and Timeline

Path 1: MLE Generalist (Google, Meta, Amazon) - 3 weeks

Path 2: NLP/LLM Engineer (OpenAI, Anthropic, Cohere) - 2 weeks

Path 3: Computer Vision Engineer (Tesla, Apple, NVIDIA) - 2 weeks

Path 4: Research Engineer (DeepMind, FAIR, Brain) - 4 weeks

Path 5: Startup ML Engineer - 1 week crash course

Topic-to-Company Mapping

What Makes Deep Learning Interviews Different

1. Derivation Depth

2. Architecture Reasoning

3. Scale Awareness

Section Overview: All 11 Topics

Topic 01 - Backpropagation

Topic 02 - Activation Functions

Topic 03 - Convolutional Neural Networks

Topic 04 - RNNs and LSTMs

Topic 05 - Attention Mechanism

Topic 06 - Transformer Architecture

Topic 07 - Normalization Techniques

Topic 08 - Training Techniques

Topic 09 - Distributed Training

Topic 10 - Generative Models

Topic 11 - DL Interview Questions

Common DL Interview Patterns

Pattern 1: "Derive It, Then Scale It"

Pattern 2: "Why Not X?"

Pattern 3: "Connect the Dots"

Pattern 4: "Debug This Training Run"

Frequently Tested Cross-Topic Connections

Backpropagation + Activations

CNNs + Skip Connections + Normalization

RNNs + Attention + Transformers

Normalization + Training Techniques + Distributed Training

Generative Models + Training Techniques + Distributed Training

Self-Assessment Practice: Quick Diagnostic Questions

Interview Cheat Sheet - Section-Level Reference

Spaced Repetition Checkpoints

Day 0 - After First Read

Day 3 - First Review

Day 7 - Connections Review

Day 14 - Interview Simulation

Day 21 - Final Calibration

Prerequisites from ML Fundamentals

What Comes Next