Skip to main content

Deep Learning for Interviews - Your Complete Roadmap

Reading time: ~25 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Computer Vision Eng, NLP Eng

The Real Interview Moment

You are forty-five minutes into a Meta MLE on-site loop. The interviewer, a staff research engineer who has published at NeurIPS, slides a whiteboard marker to you and says: "Design me a neural network for this task. Walk me through every decision - architecture, activations, normalization, training strategy - and justify each choice." You know what a ResNet is. You know what batch normalization does. But when the interviewer asks "Why did you choose ReLU over GELU here?" or "What happens to the gradients in layer 47 during the backward pass?" - you realize that surface-level knowledge is not enough.

Deep learning interviews at top companies are not trivia contests. They test whether you have a unified mental model of how neural networks work - from the mathematics of backpropagation through the engineering of distributed training. The interviewer wants to see you reason through tradeoffs: Why does ResNet work better than VGG at 152 layers? Why do Transformers use layer norm instead of batch norm? Why does GELU outperform ReLU in language models? Each question probes a different node in the same interconnected graph of deep learning knowledge.

This section gives you that graph. Eleven topics, organized in dependency order, with the exact depth and connections you need to navigate any deep learning interview question with confidence and precision.

What You Will Master

After completing this section, you will be able to:

  • Derive backpropagation from first principles and trace gradients through any computational graph
  • Choose activation functions with mathematical justification for why GELU dominates Transformers and ReLU dominates CNNs
  • Design convolutional architectures and explain the evolution from LeNet to ConvNeXt
  • Explain recurrent architectures and why LSTMs solve the vanishing gradient problem that vanilla RNNs cannot
  • Derive the attention mechanism from scratch and explain multi-head attention, cross-attention, and masked attention
  • Build a Transformer from components - embedding, positional encoding, multi-head attention, FFN, layer norm
  • Compare normalization techniques (batch, layer, group, RMS) and justify when to use each
  • Apply advanced training techniques - learning rate schedules, gradient clipping, mixed precision, label smoothing
  • Design distributed training strategies - data parallelism, model parallelism, pipeline parallelism, ZeRO
  • Explain generative models - VAEs, GANs, diffusion models, and flow-based models
  • Answer rapid-fire deep learning questions with structured, interview-ready responses
  • Connect all topics into a unified framework for architectural decision-making

Self-Assessment: Where Are You Now?

Before diving in, honestly rate yourself on each topic. This will help you prioritize your study time.

Topic1 -- Cannot Explain2 -- Vaguely Recall3 -- Can Explain4 -- Can Derive/Implement5 -- Can TeachYour Score
Backpropagation & chain rule___
Activation functions & tradeoffs___
CNN architectures & design___
RNNs, LSTMs, GRUs___
Attention mechanisms___
Transformer architecture___
Normalization techniques___
Training techniques___
Distributed training___
Generative models___
Interview-style rapid fire___

Scoring guide:

  • Total 11-22: You need the full section. Start from the beginning and work sequentially.
  • Total 23-33: You have foundations but gaps. Use the dependency map below to identify and fill weak spots.
  • Total 34-44: You are close to interview-ready. Focus on practice problems and the rapid-fire section.
  • Total 45-55: You are ready. Do one final pass on the cheat sheets and spaced repetition checkpoints.

Topic Dependency Map

The eleven topics in this section are not independent. Understanding Transformers requires attention, which requires knowing why RNNs struggle, which requires understanding vanishing gradients from backpropagation. The diagram below shows the dependency structure.

Topic Dependency Map - 11 Deep Learning Topics

Legend:

  • Green (Topics 01-02): Foundational - must be solid before anything else
  • Yellow (Topics 03-05): Core architectures - the building blocks
  • Blue (Topics 06-08): Advanced architecture and training - where interviews go deep
  • Red (Topics 09-11): Specialized - tested at senior levels and specific roles

Study Paths by Role and Timeline

Not every role tests every topic at the same depth. Use the paths below to prioritize.

Path 1: MLE Generalist (Google, Meta, Amazon) - 3 weeks

WeekTopicsFocus
Week 1Backprop, Activations, CNNsDerive backprop by hand. Know activation tradeoffs cold. Explain ResNet skip connections.
Week 2RNNs, Attention, TransformersLSTM gating math. Derive scaled dot-product attention. Build a Transformer on the whiteboard.
Week 3Normalization, Training, Distributed, GenerativeBatchNorm vs LayerNorm. LR schedules. Data parallelism vs model parallelism. VAE vs GAN.

Path 2: NLP/LLM Engineer (OpenAI, Anthropic, Cohere) - 2 weeks

WeekTopicsFocus
Week 1Backprop, Activations, Attention, TransformersGELU specifically. Multi-head attention derivation. Positional encoding. Full Transformer build.
Week 2Normalization, Training, Distributed, GenerativePre-norm vs post-norm. Mixed precision. ZeRO stages. Autoregressive vs masked LM.

Path 3: Computer Vision Engineer (Tesla, Apple, NVIDIA) - 2 weeks

WeekTopicsFocus
Week 1Backprop, Activations, CNNs, NormalizationConv math (stride, padding, output size). Architecture evolution. GroupNorm for small batches.
Week 2Training, Distributed, Generative, TransformersTransfer learning. Multi-GPU training. Diffusion models. Vision Transformers.

Path 4: Research Engineer (DeepMind, FAIR, Brain) - 4 weeks

WeekTopicsFocus
Week 1Backprop, ActivationsFull derivations. Automatic differentiation (forward vs reverse mode). Activation landscapes.
Week 2CNNs, RNNs, AttentionEvery architecture variant. Receptive field math. Attention alternatives (linear, sparse).
Week 3Transformers, Normalization, TrainingArchitecture ablations. RMSNorm. Curriculum learning. Hyperparameter sensitivity.
Week 4Distributed, Generative, Interview QuestionsPipeline parallelism. Diffusion math. Timed practice on all topics.

Path 5: Startup ML Engineer - 1 week crash course

DayTopicsFocus
Day 1-2Backprop, Activations, CNNsEnough to reason about architectures and debug training issues.
Day 3-4Attention, Transformers, TrainingPractical Transformer usage. Fine-tuning strategies. LR schedules.
Day 5Interview Questions + PracticeRapid-fire answers. System design integration.

Topic-to-Company Mapping

Different companies emphasize different deep learning topics. This table shows what to expect.

TopicGoogleMetaAmazonAppleOpenAI/AnthropicNVIDIAStartups
BackpropagationDerive on whiteboardDerive on whiteboardConceptualConceptualDerive + implementDerive + CUDAConceptual
Activation FunctionsTradeoff analysisTradeoff analysisBasic knowledgeBasic knowledgeGELU deep diveKernel optimizationBasic knowledge
CNNsArchitecture designArchitecture designTransfer learningOn-device CNNsVision backboneCUDA kernelsTransfer learning
RNNs & LSTMsFull derivationFull derivationSequence modelingOn-device RNNsHistorical contextRarely testedRarely tested
AttentionDerive from scratchDerive from scratchConceptualConceptualDerive + variantsKernel optimizationConceptual
TransformersFull architectureFull architectureUsage-levelUsage-levelDeep architecturalOptimizationUsage-level
NormalizationBN vs LN tradeoffsBN vs LN tradeoffsBasic knowledgeGroupNorm focusRMSNorm deep diveFused kernelsBasic knowledge
Training TechniquesLR schedules, regularizationLR schedules, augmentationPractical tuningQuantization focusLarge-scale trainingMixed precisionPractical tuning
Distributed TrainingTested at senior+Tested at senior+Tested at senior+Rarely testedCore competencyCore competencyRarely tested
Generative ModelsGAN/Diffusion theoryGAN/Diffusion theoryRarely testedOn-device generationCore competencyInference optimizationApplication-level
Interviewer's Perspective

"When I interview for MLE roles, I use deep learning questions as a calibration tool. A candidate who can only recite definitions gets a lean no-hire. A candidate who can derive backprop, explain why ResNet skip connections help gradient flow, and reason about when to use LayerNorm vs BatchNorm - that candidate demonstrates the kind of first-principles thinking we need. I am not looking for memorization. I am looking for understanding."

  • Staff MLE, Google Brain

What Makes Deep Learning Interviews Different

Deep learning interviews differ from classical ML interviews in three important ways:

1. Derivation Depth

Classical ML interviews might ask you to explain regularization conceptually. Deep learning interviews ask you to derive the gradient of the loss with respect to weights in layer 3 of a network, trace it through batch normalization, and explain what happens numerically when the network has 100 layers.

2. Architecture Reasoning

You will not just be asked "what is a CNN?" You will be asked "why did the authors of ResNet add skip connections, and can you prove mathematically that they help gradient flow?" or "why does the Transformer use multi-head attention instead of a single large attention matrix?"

3. Scale Awareness

Modern deep learning operates at scales that fundamentally change the game. You need to know what breaks when you go from 1 GPU to 1,000 GPUs, why batch normalization fails with small batch sizes, and why mixed-precision training matters for billion-parameter models.

Classical ML vs Deep Learning Interviews

Section Overview: All 11 Topics

Topic 01 - Backpropagation

The mathematical backbone of all neural network training. You will derive the chain rule on computational graphs, trace gradients through a 2-layer network by hand, understand vanishing and exploding gradients, and learn the difference between forward-mode and reverse-mode automatic differentiation. This is tested in some form at every company.

Key interview questions: "Derive the gradient of cross-entropy loss with respect to the weights in the first layer." "Why does reverse-mode AD dominate deep learning?"

Topic 02 - Activation Functions

The nonlinear functions that give neural networks their power. You will learn every major activation function (sigmoid through Mish), understand why ReLU revolutionized training, diagnose the dying ReLU problem, and explain why modern Transformers use GELU. Includes a decision flowchart for choosing activations.

Key interview questions: "Why not just use sigmoid everywhere?" "What is the dying ReLU problem and how do you fix it?" "Why does GPT use GELU?"

Topic 03 - Convolutional Neural Networks

The architecture that dominated computer vision for a decade. You will master convolution math, trace the evolution from LeNet to ConvNeXt, derive output dimensions, understand receptive fields, explain skip connections mathematically, and design transfer learning strategies.

Key interview questions: "Calculate the output size of this conv layer." "Why do ResNets work at 152 layers when VGG fails at 19?" "What are depthwise separable convolutions?"

Topic 04 - RNNs and LSTMs

Sequential architectures and why they struggle with long-range dependencies. You will derive the RNN gradient and show where it vanishes, explain LSTM gating mechanisms mathematically, compare GRU simplifications, and understand why attention replaced recurrence.

Key interview questions: "Derive why vanilla RNNs have vanishing gradients." "Walk me through the LSTM cell - what does each gate do?" "Why did the field move away from RNNs?"

Topic 05 - Attention Mechanism

The mechanism that replaced recurrence and enabled modern AI. You will derive scaled dot-product attention from first principles, explain the scaling factor, understand multi-head attention, compare self-attention and cross-attention, and analyze computational complexity.

Key interview questions: "Derive attention from scratch on the whiteboard." "Why divide by the square root of dkd_k?" "What is the computational complexity of self-attention?"

Topic 06 - Transformer Architecture

The architecture behind GPT, BERT, and virtually all modern AI. You will build a complete Transformer from components, understand positional encoding, compare encoder-only vs decoder-only vs encoder-decoder variants, and explain pre-norm vs post-norm.

Key interview questions: "Draw the full Transformer architecture and explain every component." "Why does GPT use decoder-only?" "How does positional encoding work?"

Topic 07 - Normalization Techniques

Techniques that stabilize and accelerate training. You will compare BatchNorm, LayerNorm, GroupNorm, InstanceNorm, and RMSNorm, understand which axes each normalizes over, and explain why Transformers use LayerNorm instead of BatchNorm.

Key interview questions: "Why does BatchNorm fail with small batch sizes?" "Why do Transformers use LayerNorm?" "What is RMSNorm and why is it used in LLaMA?"

Topic 08 - Training Techniques

Advanced strategies for training deep networks effectively. You will master learning rate schedules (warmup, cosine annealing), gradient clipping, mixed precision training, data augmentation, label smoothing, and curriculum learning.

Key interview questions: "Design a training recipe for a 1B parameter model." "Why use warmup?" "Explain mixed precision training and where it can go wrong."

Topic 09 - Distributed Training

Scaling training across multiple GPUs and nodes. You will understand data parallelism, model parallelism, pipeline parallelism, tensor parallelism, and ZeRO optimization stages. Tested primarily at senior levels and at companies training large models.

Key interview questions: "How does data parallelism work with gradient synchronization?" "What are the three ZeRO stages?" "When do you use model parallelism vs data parallelism?"

Topic 10 - Generative Models

Models that generate new data. You will understand VAEs (ELBO derivation), GANs (minimax game, mode collapse), diffusion models (forward/reverse process), and flow-based models. Increasingly tested as generative AI becomes central to the industry.

Key interview questions: "Derive the ELBO for VAEs." "Why do GANs suffer from mode collapse?" "Explain the diffusion process mathematically."

Topic 11 - DL Interview Questions

Rapid-fire practice across all topics. Fifty questions with structured answers, scoring rubrics, and time targets. Use this for final preparation and timed practice sessions.

Common DL Interview Patterns

Across hundreds of DL interviews at top companies, certain patterns emerge repeatedly. Knowing these patterns helps you anticipate follow-up questions.

Pattern 1: "Derive It, Then Scale It"

The interviewer starts with a derivation question, then asks what happens at scale.

Example flow:

  1. "Derive backpropagation for a 2-layer network" (Backprop)
  2. "What happens to gradients in a 100-layer network?" (Vanishing/Exploding gradients)
  3. "How does ResNet solve this?" (Skip connections)
  4. "How do you train a 100-layer ResNet on 8 GPUs?" (Distributed training)

Preparation: For every concept, know both the small-scale math AND the large-scale engineering.

Pattern 2: "Why Not X?"

The interviewer proposes a suboptimal approach and asks you to argue against it.

Example flow:

  1. "Let's use sigmoid activations" -- Why not? (Vanishing gradients)
  2. "Let's use BatchNorm in our Transformer" -- Why not? (LayerNorm is better for variable-length sequences)
  3. "Let's train the entire ImageNet model from scratch with 1000 labeled images" -- Why not? (Transfer learning)
  4. "Let's use a single giant attention head" -- Why not? (Multi-head captures diverse relationships)

Preparation: For every design choice, know both what TO use and what NOT to use, with precise reasons.

Pattern 3: "Connect the Dots"

The interviewer asks you to link seemingly separate concepts.

Example flow:

  1. "How does the choice of activation function affect what initialization you should use?" (Activation + Init)
  2. "Why did the move from CNNs to Transformers also change us from BatchNorm to LayerNorm?" (Architecture + Normalization)
  3. "How does the attention mechanism relate to the vanishing gradient problem in RNNs?" (Attention + Gradient flow)

Preparation: Study the connections map above. Every edge in that graph is a potential interview question.

Pattern 4: "Debug This Training Run"

The interviewer describes symptoms and asks you to diagnose.

SymptomLikely CauseFrom Topic
Loss is NaN after 100 stepsExploding gradients, no gradient clippingBackprop, Training
Loss plateaus at high valueVanishing gradients, bad initialization, LR too lowBackprop, Activations
Training loss is 0, test loss is highOverfitting - no regularization, too much capacityTraining techniques
Training loss oscillates wildlyLR too high, no warmupTraining techniques
40% of neurons output zero alwaysDying ReLU, LR too high or bad initActivations
Model works on English, fails on GermanTokenizer issue, not an architecture issueLLM-specific
Distributed training is 4x slower on 8 GPUs than 4 GPUsCommunication bottleneck, batch size issuesDistributed training

Frequently Tested Cross-Topic Connections

These are the connections between topics that interviewers love to probe. If you can fluently explain these links, you demonstrate systems-level understanding.

Backpropagation + Activations

The gradient properties of activation functions directly determine whether backpropagation succeeds in deep networks. Sigmoid's maximum derivative of 0.25 causes exponential gradient decay. ReLU's gradient of 1 (for positive inputs) solves this but creates dead neurons. GELU's smooth gradient near zero provides stability for Transformer attention layers. The initialization scheme (Xavier vs He) must match the activation to maintain gradient magnitude across layers.

CNNs + Skip Connections + Normalization

ResNet's skip connections provide gradient highways (x(F(x)+x)=F(x)+I\frac{\partial}{\partial x}(F(x) + x) = F'(x) + I), but they are not sufficient alone. Batch normalization in each residual block keeps activations in a healthy range, preventing both the F(x)F(x) branch from dominating (which would defeat the skip connection) and the activations from drifting. ConvNeXt showed that switching from BatchNorm to LayerNorm (borrowed from Transformers) further improves performance.

RNNs + Attention + Transformers

This is a historical evolution driven by gradient flow. Vanilla RNNs suffer from vanishing gradients over long sequences (the product of NN Jacobians shrinks). LSTMs partially solve this with gated memory cells (additive gradient path). Attention fully solves it by creating a direct connection between any two positions (O(1)O(1) gradient path length). Transformers remove recurrence entirely, using self-attention for all context - this enables parallelization and scales to billions of parameters.

Normalization + Training Techniques + Distributed Training

BatchNorm's behavior changes with batch size: effective with large batches (32\geq 32), noisy with small batches. In distributed training, the effective batch size per GPU decreases when you split data across GPUs. This is why: (1) large-scale training often uses SyncBatchNorm (synchronize statistics across GPUs), (2) Transformers use LayerNorm (batch-size independent), and (3) LLaMA uses RMSNorm (simpler than LayerNorm, works at massive scale).

Generative Models + Training Techniques + Distributed Training

Training generative models at scale requires the full toolkit: mixed precision training (FP16/BF16 for memory), gradient checkpointing (recompute activations to fit larger models), ZeRO optimization (shard optimizer states across GPUs), and careful learning rate scheduling (warmup to avoid early instability, cosine decay for convergence). A question like "describe how you would train a diffusion model from scratch" tests all three topics simultaneously.

Frequently Tested Cross-Topic Connections

Self-Assessment Practice: Quick Diagnostic Questions

Before you begin studying, try answering these 10 questions. If you can answer 7+ confidently, you may be able to skip some foundational topics. If fewer than 4, start from the beginning.

#QuestionTopic Tested
1Derive LW1\frac{\partial L}{\partial W_1} for a 2-layer network with ReLU.Backprop
2Why does GELU outperform ReLU in Transformers? Give 2 reasons.Activations
3Calculate the output size of a 3x3 conv with stride 2, padding 1 on a 56x56 input.CNNs
4Why can't vanilla RNNs model long-range dependencies? Be specific.RNNs
5Derive scaled dot-product attention. Why divide by dk\sqrt{d_k}?Attention
6What are the 3 components of a Transformer encoder block?Transformers
7Why do Transformers use LayerNorm instead of BatchNorm?Normalization
8What is learning rate warmup and why is it necessary?Training
9Explain ZeRO Stage 2 in one sentence.Distributed
10What is the ELBO in VAEs and why do we maximize it?Generative

Scoring:

  • 0-3 correct: Start from Topic 01 and work sequentially
  • 4-6 correct: You have foundations - focus on weak topics and connections
  • 7-9 correct: You are nearly interview-ready - focus on practice problems and speed
  • 10 correct: Move directly to the rapid-fire section and mock interviews

Interview Cheat Sheet - Section-Level Reference

ConceptOne-Sentence SummaryWhen It Is Asked
BackpropagationChain rule on computational graphs - reverse-mode AD computes all gradients in one backward passPhone screen, on-site
Activation functionsNonlinear functions enabling universal approximation - ReLU for CNNs, GELU for TransformersPhone screen
CNNsLocal connectivity + weight sharing for spatial data - ResNet skip connections enable deep trainingOn-site, system design
RNNs/LSTMsSequential processing with hidden state - LSTM gates solve vanishing gradientsPhone screen (less common now)
AttentionWeighted sum based on query-key similarity - O(n2d)O(n^2 d) complexityOn-site, core for LLM roles
TransformersSelf-attention + FFN in encoder/decoder blocks - parallelizable, scalableOn-site, core for all roles
NormalizationStabilize activations across layers - LayerNorm for Transformers, BatchNorm for CNNsOn-site
Training techniquesLR schedules, gradient clipping, mixed precision - practical engineering for convergenceOn-site, system design
Distributed trainingSplit data/model across GPUs - AllReduce for data parallel, ZeRO for memory efficiencySenior on-site
Generative modelsLearn data distribution to sample new examples - VAE, GAN, Diffusion, FlowSpecialized roles

Spaced Repetition Checkpoints

Use this schedule to reinforce your learning. Each checkpoint lists what you should be able to do from memory.

Day 0 - After First Read

  • Draw the topic dependency diagram from memory
  • List all 11 topics and their one-sentence summaries
  • Identify your 3 weakest topics from the self-assessment
  • State which topics your target company emphasizes

Day 3 - First Review

  • For each topic, state the most common interview question
  • Explain the difference between topics that are often confused (BatchNorm vs LayerNorm, RNN vs LSTM, attention vs self-attention)
  • Recite your study path and current progress

Day 7 - Connections Review

  • Explain how backpropagation connects to vanishing gradients, which connects to activation choice, which connects to skip connections
  • Trace how a single training step works end-to-end: forward pass, loss computation, backward pass, gradient update
  • Explain why Transformers use: GELU (not ReLU), LayerNorm (not BatchNorm), multi-head attention (not single-head)

Day 14 - Interview Simulation

  • Complete 5 practice problems from different topics under time pressure (10 minutes each)
  • Give a 60-second answer for each of the 11 topics
  • Design a complete architecture for a given task, justifying every component choice

Day 21 - Final Calibration

  • Do a full mock interview covering 4-5 deep learning topics in 45 minutes
  • Identify any remaining weak spots and do targeted review
  • Practice transitioning between topics smoothly (e.g., "the vanishing gradient problem in RNNs motivated the attention mechanism, which...")

Prerequisites from ML Fundamentals

This section builds directly on concepts from ML Fundamentals. Make sure you are comfortable with:

PrerequisiteFrom TopicWhy You Need It
Gradient descent and optimizationOptimizationBackpropagation computes gradients that optimizers consume
Loss functions (cross-entropy, MSE)Loss FunctionsEvery derivation starts from the loss
Regularization (L1, L2, dropout)RegularizationDeep networks need regularization - dropout, weight decay, data augmentation
Bias-variance tradeoffBias-VarianceUnderstanding overfitting in deep networks requires this framework
Evaluation metricsEvaluation MetricsYou cannot train without knowing what you are optimizing

What Comes Next

Once you have completed the Deep Learning section, you will be ready for:

  • LLM Interviews - Builds directly on Transformers, attention, training techniques, and distributed training
  • ML System Design - Applies architectural decisions and training strategies to real-world systems
  • Paper Discussion - Many discussed papers are deep learning architecture papers (ResNet, Transformer, BERT, GPT)

Start with Backpropagation - the mathematical foundation that everything else builds on.

© 2026 EngineersOfAI. All rights reserved.