Deep Learning for Interviews - Your Complete Roadmap
Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Computer Vision Eng, NLP Eng
The Real Interview Moment
You are forty-five minutes into a Meta MLE on-site loop. The interviewer, a staff research engineer who has published at NeurIPS, slides a whiteboard marker to you and says: "Design me a neural network for this task. Walk me through every decision - architecture, activations, normalization, training strategy - and justify each choice." You know what a ResNet is. You know what batch normalization does. But when the interviewer asks "Why did you choose ReLU over GELU here?" or "What happens to the gradients in layer 47 during the backward pass?" - you realize that surface-level knowledge is not enough.
Deep learning interviews at top companies are not trivia contests. They test whether you have a unified mental model of how neural networks work - from the mathematics of backpropagation through the engineering of distributed training. The interviewer wants to see you reason through tradeoffs: Why does ResNet work better than VGG at 152 layers? Why do Transformers use layer norm instead of batch norm? Why does GELU outperform ReLU in language models? Each question probes a different node in the same interconnected graph of deep learning knowledge.
This section gives you that graph. Eleven topics, organized in dependency order, with the exact depth and connections you need to navigate any deep learning interview question with confidence and precision.
What You Will Master
After completing this section, you will be able to:
- Derive backpropagation from first principles and trace gradients through any computational graph
- Choose activation functions with mathematical justification for why GELU dominates Transformers and ReLU dominates CNNs
- Design convolutional architectures and explain the evolution from LeNet to ConvNeXt
- Explain recurrent architectures and why LSTMs solve the vanishing gradient problem that vanilla RNNs cannot
- Derive the attention mechanism from scratch and explain multi-head attention, cross-attention, and masked attention
- Build a Transformer from components - embedding, positional encoding, multi-head attention, FFN, layer norm
- Compare normalization techniques (batch, layer, group, RMS) and justify when to use each
- Apply advanced training techniques - learning rate schedules, gradient clipping, mixed precision, label smoothing
- Design distributed training strategies - data parallelism, model parallelism, pipeline parallelism, ZeRO
- Explain generative models - VAEs, GANs, diffusion models, and flow-based models
- Answer rapid-fire deep learning questions with structured, interview-ready responses
- Connect all topics into a unified framework for architectural decision-making
Self-Assessment: Where Are You Now?
Before diving in, honestly rate yourself on each topic. This will help you prioritize your study time.
| Topic | 1 -- Cannot Explain | 2 -- Vaguely Recall | 3 -- Can Explain | 4 -- Can Derive/Implement | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Backpropagation & chain rule | ___ | |||||
| Activation functions & tradeoffs | ___ | |||||
| CNN architectures & design | ___ | |||||
| RNNs, LSTMs, GRUs | ___ | |||||
| Attention mechanisms | ___ | |||||
| Transformer architecture | ___ | |||||
| Normalization techniques | ___ | |||||
| Training techniques | ___ | |||||
| Distributed training | ___ | |||||
| Generative models | ___ | |||||
| Interview-style rapid fire | ___ |
Scoring guide:
- Total 11-22: You need the full section. Start from the beginning and work sequentially.
- Total 23-33: You have foundations but gaps. Use the dependency map below to identify and fill weak spots.
- Total 34-44: You are close to interview-ready. Focus on practice problems and the rapid-fire section.
- Total 45-55: You are ready. Do one final pass on the cheat sheets and spaced repetition checkpoints.
Topic Dependency Map
The eleven topics in this section are not independent. Understanding Transformers requires attention, which requires knowing why RNNs struggle, which requires understanding vanishing gradients from backpropagation. The diagram below shows the dependency structure.
Legend:
- Green (Topics 01-02): Foundational - must be solid before anything else
- Yellow (Topics 03-05): Core architectures - the building blocks
- Blue (Topics 06-08): Advanced architecture and training - where interviews go deep
- Red (Topics 09-11): Specialized - tested at senior levels and specific roles
Study Paths by Role and Timeline
Not every role tests every topic at the same depth. Use the paths below to prioritize.
Path 1: MLE Generalist (Google, Meta, Amazon) - 3 weeks
| Week | Topics | Focus |
|---|---|---|
| Week 1 | Backprop, Activations, CNNs | Derive backprop by hand. Know activation tradeoffs cold. Explain ResNet skip connections. |
| Week 2 | RNNs, Attention, Transformers | LSTM gating math. Derive scaled dot-product attention. Build a Transformer on the whiteboard. |
| Week 3 | Normalization, Training, Distributed, Generative | BatchNorm vs LayerNorm. LR schedules. Data parallelism vs model parallelism. VAE vs GAN. |
Path 2: NLP/LLM Engineer (OpenAI, Anthropic, Cohere) - 2 weeks
| Week | Topics | Focus |
|---|---|---|
| Week 1 | Backprop, Activations, Attention, Transformers | GELU specifically. Multi-head attention derivation. Positional encoding. Full Transformer build. |
| Week 2 | Normalization, Training, Distributed, Generative | Pre-norm vs post-norm. Mixed precision. ZeRO stages. Autoregressive vs masked LM. |
Path 3: Computer Vision Engineer (Tesla, Apple, NVIDIA) - 2 weeks
| Week | Topics | Focus |
|---|---|---|
| Week 1 | Backprop, Activations, CNNs, Normalization | Conv math (stride, padding, output size). Architecture evolution. GroupNorm for small batches. |
| Week 2 | Training, Distributed, Generative, Transformers | Transfer learning. Multi-GPU training. Diffusion models. Vision Transformers. |
Path 4: Research Engineer (DeepMind, FAIR, Brain) - 4 weeks
| Week | Topics | Focus |
|---|---|---|
| Week 1 | Backprop, Activations | Full derivations. Automatic differentiation (forward vs reverse mode). Activation landscapes. |
| Week 2 | CNNs, RNNs, Attention | Every architecture variant. Receptive field math. Attention alternatives (linear, sparse). |
| Week 3 | Transformers, Normalization, Training | Architecture ablations. RMSNorm. Curriculum learning. Hyperparameter sensitivity. |
| Week 4 | Distributed, Generative, Interview Questions | Pipeline parallelism. Diffusion math. Timed practice on all topics. |
Path 5: Startup ML Engineer - 1 week crash course
| Day | Topics | Focus |
|---|---|---|
| Day 1-2 | Backprop, Activations, CNNs | Enough to reason about architectures and debug training issues. |
| Day 3-4 | Attention, Transformers, Training | Practical Transformer usage. Fine-tuning strategies. LR schedules. |
| Day 5 | Interview Questions + Practice | Rapid-fire answers. System design integration. |
Topic-to-Company Mapping
Different companies emphasize different deep learning topics. This table shows what to expect.
| Topic | Meta | Amazon | Apple | OpenAI/Anthropic | NVIDIA | Startups | |
|---|---|---|---|---|---|---|---|
| Backpropagation | Derive on whiteboard | Derive on whiteboard | Conceptual | Conceptual | Derive + implement | Derive + CUDA | Conceptual |
| Activation Functions | Tradeoff analysis | Tradeoff analysis | Basic knowledge | Basic knowledge | GELU deep dive | Kernel optimization | Basic knowledge |
| CNNs | Architecture design | Architecture design | Transfer learning | On-device CNNs | Vision backbone | CUDA kernels | Transfer learning |
| RNNs & LSTMs | Full derivation | Full derivation | Sequence modeling | On-device RNNs | Historical context | Rarely tested | Rarely tested |
| Attention | Derive from scratch | Derive from scratch | Conceptual | Conceptual | Derive + variants | Kernel optimization | Conceptual |
| Transformers | Full architecture | Full architecture | Usage-level | Usage-level | Deep architectural | Optimization | Usage-level |
| Normalization | BN vs LN tradeoffs | BN vs LN tradeoffs | Basic knowledge | GroupNorm focus | RMSNorm deep dive | Fused kernels | Basic knowledge |
| Training Techniques | LR schedules, regularization | LR schedules, augmentation | Practical tuning | Quantization focus | Large-scale training | Mixed precision | Practical tuning |
| Distributed Training | Tested at senior+ | Tested at senior+ | Tested at senior+ | Rarely tested | Core competency | Core competency | Rarely tested |
| Generative Models | GAN/Diffusion theory | GAN/Diffusion theory | Rarely tested | On-device generation | Core competency | Inference optimization | Application-level |
"When I interview for MLE roles, I use deep learning questions as a calibration tool. A candidate who can only recite definitions gets a lean no-hire. A candidate who can derive backprop, explain why ResNet skip connections help gradient flow, and reason about when to use LayerNorm vs BatchNorm - that candidate demonstrates the kind of first-principles thinking we need. I am not looking for memorization. I am looking for understanding."
- Staff MLE, Google Brain
What Makes Deep Learning Interviews Different
Deep learning interviews differ from classical ML interviews in three important ways:
1. Derivation Depth
Classical ML interviews might ask you to explain regularization conceptually. Deep learning interviews ask you to derive the gradient of the loss with respect to weights in layer 3 of a network, trace it through batch normalization, and explain what happens numerically when the network has 100 layers.
2. Architecture Reasoning
You will not just be asked "what is a CNN?" You will be asked "why did the authors of ResNet add skip connections, and can you prove mathematically that they help gradient flow?" or "why does the Transformer use multi-head attention instead of a single large attention matrix?"
3. Scale Awareness
Modern deep learning operates at scales that fundamentally change the game. You need to know what breaks when you go from 1 GPU to 1,000 GPUs, why batch normalization fails with small batch sizes, and why mixed-precision training matters for billion-parameter models.
Section Overview: All 11 Topics
Topic 01 - Backpropagation
The mathematical backbone of all neural network training. You will derive the chain rule on computational graphs, trace gradients through a 2-layer network by hand, understand vanishing and exploding gradients, and learn the difference between forward-mode and reverse-mode automatic differentiation. This is tested in some form at every company.
Key interview questions: "Derive the gradient of cross-entropy loss with respect to the weights in the first layer." "Why does reverse-mode AD dominate deep learning?"
Topic 02 - Activation Functions
The nonlinear functions that give neural networks their power. You will learn every major activation function (sigmoid through Mish), understand why ReLU revolutionized training, diagnose the dying ReLU problem, and explain why modern Transformers use GELU. Includes a decision flowchart for choosing activations.
Key interview questions: "Why not just use sigmoid everywhere?" "What is the dying ReLU problem and how do you fix it?" "Why does GPT use GELU?"
Topic 03 - Convolutional Neural Networks
The architecture that dominated computer vision for a decade. You will master convolution math, trace the evolution from LeNet to ConvNeXt, derive output dimensions, understand receptive fields, explain skip connections mathematically, and design transfer learning strategies.
Key interview questions: "Calculate the output size of this conv layer." "Why do ResNets work at 152 layers when VGG fails at 19?" "What are depthwise separable convolutions?"
Topic 04 - RNNs and LSTMs
Sequential architectures and why they struggle with long-range dependencies. You will derive the RNN gradient and show where it vanishes, explain LSTM gating mechanisms mathematically, compare GRU simplifications, and understand why attention replaced recurrence.
Key interview questions: "Derive why vanilla RNNs have vanishing gradients." "Walk me through the LSTM cell - what does each gate do?" "Why did the field move away from RNNs?"
Topic 05 - Attention Mechanism
The mechanism that replaced recurrence and enabled modern AI. You will derive scaled dot-product attention from first principles, explain the scaling factor, understand multi-head attention, compare self-attention and cross-attention, and analyze computational complexity.
Key interview questions: "Derive attention from scratch on the whiteboard." "Why divide by the square root of ?" "What is the computational complexity of self-attention?"
Topic 06 - Transformer Architecture
The architecture behind GPT, BERT, and virtually all modern AI. You will build a complete Transformer from components, understand positional encoding, compare encoder-only vs decoder-only vs encoder-decoder variants, and explain pre-norm vs post-norm.
Key interview questions: "Draw the full Transformer architecture and explain every component." "Why does GPT use decoder-only?" "How does positional encoding work?"
Topic 07 - Normalization Techniques
Techniques that stabilize and accelerate training. You will compare BatchNorm, LayerNorm, GroupNorm, InstanceNorm, and RMSNorm, understand which axes each normalizes over, and explain why Transformers use LayerNorm instead of BatchNorm.
Key interview questions: "Why does BatchNorm fail with small batch sizes?" "Why do Transformers use LayerNorm?" "What is RMSNorm and why is it used in LLaMA?"
Topic 08 - Training Techniques
Advanced strategies for training deep networks effectively. You will master learning rate schedules (warmup, cosine annealing), gradient clipping, mixed precision training, data augmentation, label smoothing, and curriculum learning.
Key interview questions: "Design a training recipe for a 1B parameter model." "Why use warmup?" "Explain mixed precision training and where it can go wrong."
Topic 09 - Distributed Training
Scaling training across multiple GPUs and nodes. You will understand data parallelism, model parallelism, pipeline parallelism, tensor parallelism, and ZeRO optimization stages. Tested primarily at senior levels and at companies training large models.
Key interview questions: "How does data parallelism work with gradient synchronization?" "What are the three ZeRO stages?" "When do you use model parallelism vs data parallelism?"
Topic 10 - Generative Models
Models that generate new data. You will understand VAEs (ELBO derivation), GANs (minimax game, mode collapse), diffusion models (forward/reverse process), and flow-based models. Increasingly tested as generative AI becomes central to the industry.
Key interview questions: "Derive the ELBO for VAEs." "Why do GANs suffer from mode collapse?" "Explain the diffusion process mathematically."
Topic 11 - DL Interview Questions
Rapid-fire practice across all topics. Fifty questions with structured answers, scoring rubrics, and time targets. Use this for final preparation and timed practice sessions.
Common DL Interview Patterns
Across hundreds of DL interviews at top companies, certain patterns emerge repeatedly. Knowing these patterns helps you anticipate follow-up questions.
Pattern 1: "Derive It, Then Scale It"
The interviewer starts with a derivation question, then asks what happens at scale.
Example flow:
- "Derive backpropagation for a 2-layer network" (Backprop)
- "What happens to gradients in a 100-layer network?" (Vanishing/Exploding gradients)
- "How does ResNet solve this?" (Skip connections)
- "How do you train a 100-layer ResNet on 8 GPUs?" (Distributed training)
Preparation: For every concept, know both the small-scale math AND the large-scale engineering.
Pattern 2: "Why Not X?"
The interviewer proposes a suboptimal approach and asks you to argue against it.
Example flow:
- "Let's use sigmoid activations" -- Why not? (Vanishing gradients)
- "Let's use BatchNorm in our Transformer" -- Why not? (LayerNorm is better for variable-length sequences)
- "Let's train the entire ImageNet model from scratch with 1000 labeled images" -- Why not? (Transfer learning)
- "Let's use a single giant attention head" -- Why not? (Multi-head captures diverse relationships)
Preparation: For every design choice, know both what TO use and what NOT to use, with precise reasons.
Pattern 3: "Connect the Dots"
The interviewer asks you to link seemingly separate concepts.
Example flow:
- "How does the choice of activation function affect what initialization you should use?" (Activation + Init)
- "Why did the move from CNNs to Transformers also change us from BatchNorm to LayerNorm?" (Architecture + Normalization)
- "How does the attention mechanism relate to the vanishing gradient problem in RNNs?" (Attention + Gradient flow)
Preparation: Study the connections map above. Every edge in that graph is a potential interview question.
Pattern 4: "Debug This Training Run"
The interviewer describes symptoms and asks you to diagnose.
| Symptom | Likely Cause | From Topic |
|---|---|---|
| Loss is NaN after 100 steps | Exploding gradients, no gradient clipping | Backprop, Training |
| Loss plateaus at high value | Vanishing gradients, bad initialization, LR too low | Backprop, Activations |
| Training loss is 0, test loss is high | Overfitting - no regularization, too much capacity | Training techniques |
| Training loss oscillates wildly | LR too high, no warmup | Training techniques |
| 40% of neurons output zero always | Dying ReLU, LR too high or bad init | Activations |
| Model works on English, fails on German | Tokenizer issue, not an architecture issue | LLM-specific |
| Distributed training is 4x slower on 8 GPUs than 4 GPUs | Communication bottleneck, batch size issues | Distributed training |
Frequently Tested Cross-Topic Connections
These are the connections between topics that interviewers love to probe. If you can fluently explain these links, you demonstrate systems-level understanding.
Backpropagation + Activations
The gradient properties of activation functions directly determine whether backpropagation succeeds in deep networks. Sigmoid's maximum derivative of 0.25 causes exponential gradient decay. ReLU's gradient of 1 (for positive inputs) solves this but creates dead neurons. GELU's smooth gradient near zero provides stability for Transformer attention layers. The initialization scheme (Xavier vs He) must match the activation to maintain gradient magnitude across layers.
CNNs + Skip Connections + Normalization
ResNet's skip connections provide gradient highways (), but they are not sufficient alone. Batch normalization in each residual block keeps activations in a healthy range, preventing both the branch from dominating (which would defeat the skip connection) and the activations from drifting. ConvNeXt showed that switching from BatchNorm to LayerNorm (borrowed from Transformers) further improves performance.
RNNs + Attention + Transformers
This is a historical evolution driven by gradient flow. Vanilla RNNs suffer from vanishing gradients over long sequences (the product of Jacobians shrinks). LSTMs partially solve this with gated memory cells (additive gradient path). Attention fully solves it by creating a direct connection between any two positions ( gradient path length). Transformers remove recurrence entirely, using self-attention for all context - this enables parallelization and scales to billions of parameters.
Normalization + Training Techniques + Distributed Training
BatchNorm's behavior changes with batch size: effective with large batches (), noisy with small batches. In distributed training, the effective batch size per GPU decreases when you split data across GPUs. This is why: (1) large-scale training often uses SyncBatchNorm (synchronize statistics across GPUs), (2) Transformers use LayerNorm (batch-size independent), and (3) LLaMA uses RMSNorm (simpler than LayerNorm, works at massive scale).
Generative Models + Training Techniques + Distributed Training
Training generative models at scale requires the full toolkit: mixed precision training (FP16/BF16 for memory), gradient checkpointing (recompute activations to fit larger models), ZeRO optimization (shard optimizer states across GPUs), and careful learning rate scheduling (warmup to avoid early instability, cosine decay for convergence). A question like "describe how you would train a diffusion model from scratch" tests all three topics simultaneously.
Self-Assessment Practice: Quick Diagnostic Questions
Before you begin studying, try answering these 10 questions. If you can answer 7+ confidently, you may be able to skip some foundational topics. If fewer than 4, start from the beginning.
| # | Question | Topic Tested |
|---|---|---|
| 1 | Derive for a 2-layer network with ReLU. | Backprop |
| 2 | Why does GELU outperform ReLU in Transformers? Give 2 reasons. | Activations |
| 3 | Calculate the output size of a 3x3 conv with stride 2, padding 1 on a 56x56 input. | CNNs |
| 4 | Why can't vanilla RNNs model long-range dependencies? Be specific. | RNNs |
| 5 | Derive scaled dot-product attention. Why divide by ? | Attention |
| 6 | What are the 3 components of a Transformer encoder block? | Transformers |
| 7 | Why do Transformers use LayerNorm instead of BatchNorm? | Normalization |
| 8 | What is learning rate warmup and why is it necessary? | Training |
| 9 | Explain ZeRO Stage 2 in one sentence. | Distributed |
| 10 | What is the ELBO in VAEs and why do we maximize it? | Generative |
Scoring:
- 0-3 correct: Start from Topic 01 and work sequentially
- 4-6 correct: You have foundations - focus on weak topics and connections
- 7-9 correct: You are nearly interview-ready - focus on practice problems and speed
- 10 correct: Move directly to the rapid-fire section and mock interviews
Interview Cheat Sheet - Section-Level Reference
| Concept | One-Sentence Summary | When It Is Asked |
|---|---|---|
| Backpropagation | Chain rule on computational graphs - reverse-mode AD computes all gradients in one backward pass | Phone screen, on-site |
| Activation functions | Nonlinear functions enabling universal approximation - ReLU for CNNs, GELU for Transformers | Phone screen |
| CNNs | Local connectivity + weight sharing for spatial data - ResNet skip connections enable deep training | On-site, system design |
| RNNs/LSTMs | Sequential processing with hidden state - LSTM gates solve vanishing gradients | Phone screen (less common now) |
| Attention | Weighted sum based on query-key similarity - complexity | On-site, core for LLM roles |
| Transformers | Self-attention + FFN in encoder/decoder blocks - parallelizable, scalable | On-site, core for all roles |
| Normalization | Stabilize activations across layers - LayerNorm for Transformers, BatchNorm for CNNs | On-site |
| Training techniques | LR schedules, gradient clipping, mixed precision - practical engineering for convergence | On-site, system design |
| Distributed training | Split data/model across GPUs - AllReduce for data parallel, ZeRO for memory efficiency | Senior on-site |
| Generative models | Learn data distribution to sample new examples - VAE, GAN, Diffusion, Flow | Specialized roles |
Spaced Repetition Checkpoints
Use this schedule to reinforce your learning. Each checkpoint lists what you should be able to do from memory.
Day 0 - After First Read
- Draw the topic dependency diagram from memory
- List all 11 topics and their one-sentence summaries
- Identify your 3 weakest topics from the self-assessment
- State which topics your target company emphasizes
Day 3 - First Review
- For each topic, state the most common interview question
- Explain the difference between topics that are often confused (BatchNorm vs LayerNorm, RNN vs LSTM, attention vs self-attention)
- Recite your study path and current progress
Day 7 - Connections Review
- Explain how backpropagation connects to vanishing gradients, which connects to activation choice, which connects to skip connections
- Trace how a single training step works end-to-end: forward pass, loss computation, backward pass, gradient update
- Explain why Transformers use: GELU (not ReLU), LayerNorm (not BatchNorm), multi-head attention (not single-head)
Day 14 - Interview Simulation
- Complete 5 practice problems from different topics under time pressure (10 minutes each)
- Give a 60-second answer for each of the 11 topics
- Design a complete architecture for a given task, justifying every component choice
Day 21 - Final Calibration
- Do a full mock interview covering 4-5 deep learning topics in 45 minutes
- Identify any remaining weak spots and do targeted review
- Practice transitioning between topics smoothly (e.g., "the vanishing gradient problem in RNNs motivated the attention mechanism, which...")
Prerequisites from ML Fundamentals
This section builds directly on concepts from ML Fundamentals. Make sure you are comfortable with:
| Prerequisite | From Topic | Why You Need It |
|---|---|---|
| Gradient descent and optimization | Optimization | Backpropagation computes gradients that optimizers consume |
| Loss functions (cross-entropy, MSE) | Loss Functions | Every derivation starts from the loss |
| Regularization (L1, L2, dropout) | Regularization | Deep networks need regularization - dropout, weight decay, data augmentation |
| Bias-variance tradeoff | Bias-Variance | Understanding overfitting in deep networks requires this framework |
| Evaluation metrics | Evaluation Metrics | You cannot train without knowing what you are optimizing |
What Comes Next
Once you have completed the Deep Learning section, you will be ready for:
- LLM Interviews - Builds directly on Transformers, attention, training techniques, and distributed training
- ML System Design - Applies architectural decisions and training strategies to real-world systems
- Paper Discussion - Many discussed papers are deep learning architecture papers (ResNet, Transformer, BERT, GPT)
Start with Backpropagation - the mathematical foundation that everything else builds on.
