Deep Learning Interview Questions Bank - Your Complete Preparation Guide
Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Engineer, Research Engineer, Applied Scientist, Computer Vision Engineer, NLP Engineer
The Real Interview Moment
You are 20 minutes into a Meta MLE on-site. The interviewer has asked about backpropagation, batch normalization, and attention mechanisms - you handled them well. Then she pulls up a whiteboard and says: "You are training a 7B parameter language model. After 10K steps, the loss suddenly spikes to NaN and never recovers. Walk me through your debugging process, step by step."
This question synthesizes knowledge across initialization, gradient management, mixed precision, normalization, and distributed training. The interviewer is testing whether you can integrate deep learning fundamentals into a coherent debugging workflow. Candidates who list random ideas ("try a lower learning rate?") get a "no hire." Candidates who present a systematic investigation - check gradient norms, check for FP16 overflow, inspect activation statistics by layer, examine the data batch that caused the spike - get a "strong hire."
The questions in this bank build toward exactly this level of integration.
What You Will Master
- 65 questions spanning all deep learning topics in this section
- Structured model answers with the depth expected at each level
- Company-specific question patterns (Google, Meta, Amazon, Apple, OpenAI, Anthropic)
- Quick-fire rapid response format for screening rounds
- Cross-references to detailed explanations in other pages
How to Use This Question Bank
Practice method:
- Set a timer for 2 minutes per question (screening) or 5 minutes (deep dive)
- Answer aloud as if speaking to an interviewer
- Check your answer against the model answer
- Grade yourself: Strong Hire / Lean Hire / No Hire
- For any "No Hire," study the linked page before retrying
Section 1 - Screening Questions (Phone Screen Level)
These questions are asked in 30-minute phone screens. Expect 5-8 questions. Each answer should take 1-2 minutes. The interviewer wants crisp, accurate definitions with intuition.
Q1: Explain backpropagation in plain terms.
Asked at virtually every company. Google expects the chain rule derivation. Amazon is satisfied with intuition. OpenAI and Anthropic expect you to connect it to computational graphs.
Model Answer: "Backpropagation is an algorithm for computing gradients of the loss with respect to every parameter in a neural network. It works by applying the chain rule recursively. During the forward pass, we compute and cache intermediate activations. During the backward pass, we start from the loss and propagate gradients backward through the network - each layer computes its local gradient and multiplies it by the incoming gradient from the layer above. This gives us for every weight . Computationally, it is just chain rule applied to a computational graph - each node computes its local Jacobian, and we multiply along paths. The total compute for the backward pass is approximately 2x the forward pass."
Scoring: Strong Hire = chain rule + computational graph + forward/backward distinction + compute cost. Lean Hire = correct intuition but cannot explain the chain rule. No Hire = confuses backprop with gradient descent.
Deep dive: Backpropagation
Q2: Why do we use non-linear activation functions?
Model Answer: "Without non-linear activations, a neural network of any depth is equivalent to a single linear transformation - stacking linear layers gives . Non-linearities break this collapse and give the network the ability to approximate any continuous function (universal approximation theorem). ReLU is the default choice: - it is simple, fast, and avoids the vanishing gradient problem that plagued sigmoid/tanh. Its downside is 'dying ReLU' where neurons stuck at zero never recover, which Leaky ReLU and GELU address."
Scoring: Strong Hire = linear collapse argument + universal approximation + specific activations with tradeoffs. Lean Hire = knows non-linearity is needed but cannot explain the linear collapse. No Hire = cannot explain why linearity is insufficient.
Deep dive: Activation Functions
Q3: What is the vanishing gradient problem and how do you solve it?
Model Answer: "In deep networks, gradients are multiplied through layers during backpropagation. If the Jacobian of each layer has spectral norm less than 1, gradients shrink exponentially - after 50 layers they are effectively zero. Early layers stop learning. This is caused by saturating activations (sigmoid/tanh) and poor initialization. Solutions: (1) ReLU activations - gradient is 1 for positive inputs, (2) residual connections - provide gradient shortcuts that bypass layers, (3) proper initialization - He for ReLU, Xavier for tanh, (4) normalization layers - BatchNorm, LayerNorm keep activations in a good range, (5) gradient clipping for the exploding gradient variant."
Scoring: Strong Hire = cause + 4+ solutions with why each works. Lean Hire = knows the problem but only mentions 1-2 solutions. No Hire = confuses vanishing gradients with underfitting.
Deep dive: Backpropagation
Q4: Explain the difference between BatchNorm and LayerNorm. When do you use each?
Model Answer: "Both normalize activations but along different dimensions. BatchNorm normalizes across the batch dimension - for each feature, it computes mean and variance across all samples in the mini-batch. LayerNorm normalizes across the feature dimension - for each sample, it computes mean and variance across all features. BatchNorm depends on batch statistics, which makes it problematic for small batches, variable-length sequences, and inference (requires running statistics). LayerNorm is independent of batch size. Rule of thumb: BatchNorm for CNNs (spatial statistics are meaningful), LayerNorm for transformers (sequence length varies, batch independence is important). RMSNorm is a simpler variant of LayerNorm that skips the mean centering - used in Llama and other modern LLMs."
Scoring: Strong Hire = normalization dimension difference + when each fails + RMSNorm. Lean Hire = knows the difference but not the practical implications. No Hire = confuses the normalization dimensions.
Deep dive: Normalization
Q5: What is the attention mechanism? Why is it important?
Model Answer: "Attention computes a weighted sum of value vectors, where weights are determined by the similarity between a query and key vectors: . The scaling prevents softmax saturation with large dimensions. Attention is important because it allows every position in a sequence to directly attend to every other position - solving the long-range dependency problem that plagued RNNs. Multi-head attention runs multiple attention operations in parallel with different learned projections, capturing different types of relationships. Self-attention (Q, K, V all come from the same sequence) is the building block of transformers."
Scoring: Strong Hire = full formula + scaling explanation + multi-head + self-attention. Lean Hire = knows the concept but cannot write the formula. No Hire = cannot explain attention beyond "it helps the model focus."
Deep dive: Attention Mechanism
Q6: What is a transformer? Walk me through its architecture.
Model Answer: "A transformer is a sequence-to-sequence architecture built entirely on attention - no recurrence or convolution. The encoder has N identical blocks, each with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization around each. The decoder adds cross-attention to attend to the encoder's output and uses masked self-attention to prevent attending to future tokens. Modern LLMs (GPT, Llama) use decoder-only transformers - just the decoder stack with causal masking. Encoder-only (BERT) is used for understanding tasks. The key innovations: positional encodings (sinusoidal or learned) since attention is permutation-invariant, and the ability to process all positions in parallel (unlike RNNs)."
Scoring: Strong Hire = encoder/decoder structure + attention types + residual/norm + modern variants (decoder-only vs encoder-only). Lean Hire = knows the components but cannot explain the information flow. No Hire = cannot describe the architecture beyond "it uses attention."
Deep dive: Transformer Architecture
Q7: What is the difference between a CNN and an RNN? When do you use each?
Model Answer: "CNNs process data with spatial structure (images) using local receptive fields (convolution kernels) that share weights across positions. They capture translation invariance and build hierarchical features from local to global. RNNs process sequential data (text, time series) by maintaining a hidden state that is updated at each time step, capturing temporal dependencies. In practice, transformers have largely replaced RNNs for sequence tasks because RNNs have the vanishing gradient problem for long sequences and cannot be parallelized during training. CNNs remain dominant for low-level vision tasks but Vision Transformers (ViTs) are competitive for image classification when you have enough data."
Scoring: Strong Hire = key architectural differences + inductive biases + modern landscape (transformers replacing both). Lean Hire = knows the difference but does not discuss when transformers replaced them. No Hire = cannot articulate the structural difference.
Deep dive: CNNs, RNNs & LSTMs
Q8: What is dropout and why does it work?
Model Answer: "Dropout randomly sets a fraction of activations to zero during training (typically 10-50%). At inference, all neurons are active but outputs are scaled by the keep probability. It works as regularization by preventing co-adaptation - neurons cannot rely on specific other neurons being present, forcing each to learn independently useful features. There is a Bayesian interpretation: dropout training approximately performs variational inference over the weights, and averaging over dropout masks at inference approximates a Bayesian model average. In practice, dropout is used in fully connected layers but rarely in convolutional layers (use spatial dropout instead) or modern transformers (which use other regularization)."
Scoring: Strong Hire = mechanism + co-adaptation + Bayesian interpretation + where it is not used. Lean Hire = knows the mechanism but not the why. No Hire = confuses dropout with pruning.
Deep dive: Activation Functions
Q9: He initialization vs Xavier initialization - when do you use each?
Model Answer: "Both control the variance of weight initialization to prevent exploding/vanishing activations. Xavier sets variance to , assuming linear activations - it preserves variance in both forward and backward passes. He sets variance to , accounting for ReLU zeroing out half the activations. Use Xavier for tanh/sigmoid, He for ReLU/LeakyReLU. Using Xavier with ReLU causes activations to shrink by half per layer - a 50-layer network's activations would be times the input. For transformers with LayerNorm, the choice matters less because normalization stabilizes variance regardless."
Scoring: Strong Hire = both formulas + why they differ (ReLU halving) + when normalization makes the choice less critical. Lean Hire = knows which to use when but not the derivation. No Hire = cannot distinguish them.
Deep dive: Training Techniques
Q10: What is mixed precision training?
Model Answer: "Mixed precision uses both FP16 (or BF16) and FP32 during training. Forward and backward passes use FP16/BF16 for speed (2-8x on Tensor Cores) and memory savings (half per parameter). Master weights and optimizer states stay in FP32 because weight updates are tiny and underflow in FP16. With FP16, you also need loss scaling - multiply the loss by a large constant to shift gradients into FP16's representable range, then unscale after backprop. BF16 has the same range as FP32 (8-bit exponent) so it does not need loss scaling, making it the preferred choice on A100+ GPUs. Certain operations - softmax, LayerNorm, loss computation - must stay in FP32 for numerical stability."
Scoring: Strong Hire = why mixed (not just FP16) + master weights in FP32 + loss scaling + BF16 advantage + which ops need FP32. Lean Hire = knows it uses lower precision for speed but misses the details. No Hire = says "just use FP16 everywhere."
Deep dive: Training Techniques
Q11: What is knowledge distillation?
Model Answer: "Knowledge distillation trains a small student model to mimic a large teacher model. Instead of hard labels, the student learns from the teacher's soft probability distribution, which contains rich inter-class structure ('dark knowledge'). The temperature parameter T softens the distribution - higher T reveals more about class similarities. The loss combines KL divergence between temperature-scaled teacher and student distributions (multiplied by ) with standard cross-entropy on hard labels. Typical results: 3-10x model compression with 1-3% accuracy loss. Modern example: DistilBERT is a distilled version of BERT - 60% smaller, 60% faster, retaining 97% of performance."
Scoring: Strong Hire = soft labels + temperature + factor + practical results. Lean Hire = knows the concept but cannot explain temperature or the loss. No Hire = confuses with model pruning or quantization.
Deep dive: Training Techniques
Q12: Explain data parallelism vs model parallelism.
Model Answer: "Data parallelism replicates the entire model on each GPU and splits the data. Each GPU processes a different mini-batch, then gradients are synchronized via AllReduce. It scales throughput but does not reduce per-GPU model memory. Model parallelism splits the model across GPUs: tensor parallelism splits individual layers (requires high-bandwidth NVLink, used within a node), pipeline parallelism assigns layer groups to different GPUs (tolerates lower bandwidth, used across nodes). For a 70B model that does not fit on one GPU, you need model parallelism. For a 1B model on 8 GPUs, data parallelism is sufficient. Large-scale training combines all three - tensor within node, pipeline across nodes, data across replica groups."
Scoring: Strong Hire = all three types + when each is needed + NVLink/bandwidth considerations. Lean Hire = knows data vs model parallelism but not tensor vs pipeline. No Hire = thinks data parallelism reduces model memory.
Deep dive: Distributed Training
Q13: What is the ELBO in VAEs?
Model Answer: "The ELBO (Evidence Lower Bound) is the VAE training objective - a lower bound on the log-likelihood . It decomposes into: reconstruction loss () which encourages the decoder to reconstruct the input, and a KL divergence () which regularizes the encoder's posterior to be close to the prior (standard Gaussian). The gap between the ELBO and the true log-likelihood is exactly - how well the encoder approximates the true posterior. Maximizing the ELBO simultaneously improves reconstruction and tightens this gap."
Scoring: Strong Hire = both terms + gap interpretation + direction of KL. Lean Hire = knows the two terms but not the gap. No Hire = cannot state the ELBO.
Deep dive: Generative Models
Q14: Why do GANs suffer from mode collapse?
Model Answer: "Mode collapse happens when the generator produces only a few outputs that fool the discriminator, ignoring the diversity of the real distribution. It occurs because the minimax objective does not explicitly encourage diversity - the generator can minimize its loss by finding one mode that maximally confuses the discriminator. The JSD-based objective makes this worse: when real and generated distributions do not overlap (common in high dimensions), JSD is constant and provides zero gradient, so the generator has no signal to explore new modes. Solutions include WGAN (Wasserstein distance gives meaningful gradients everywhere), spectral normalization (constrains discriminator strength), and minibatch discrimination (discriminator penalizes low diversity). Diffusion models avoid this entirely because their objective is per-sample regression, not an adversarial game."
Scoring: Strong Hire = cause (adversarial + JSD gradient issue) + 3+ solutions + why diffusion avoids it. Lean Hire = knows mode collapse exists but only 1 solution. No Hire = cannot explain mode collapse.
Deep dive: Generative Models
Q15: What is gradient clipping and when do you need it?
Model Answer: "Gradient clipping caps gradient magnitudes to prevent exploding gradients. The standard approach is clip-by-global-norm: compute the L2 norm of all gradients concatenated, and if it exceeds a threshold (typically 1.0), scale all gradients by threshold/norm. This preserves gradient direction while capping magnitude. It is essential for transformer training (always used), critical for RNN/LSTM training (exploding gradients through time), and helpful for any very deep network. It does NOT help with vanishing gradients - only exploding gradients. Clip-by-value (capping each element) is rarely used because it changes the gradient direction."
Scoring: Strong Hire = clip-by-norm formula + direction preservation + when needed + does not help vanishing. Lean Hire = knows what it does but not the implementation. No Hire = confuses with gradient scaling or normalization.
Deep dive: Training Techniques
Section 2 - Technical Deep Dive (On-Site Level)
These questions are asked in 45-60 minute technical interviews. Expect 3-5 questions with deep follow-ups. The interviewer wants mathematical depth, derivations, and practical implications.
Q16: Derive the backpropagation equations for a two-layer neural network.
Google and DeepMind expect full derivation on the whiteboard. Meta wants the derivation plus practical implications. OpenAI and Anthropic may ask you to extend it to attention or custom layers.
Model Answer:
Network: , , , , .
Output layer: (softmax + cross-entropy simplification).
Hidden layer weights: . .
Propagate to hidden layer: . .
Input layer weights: . .
The pattern: at each layer, gradient = upstream gradient times local Jacobian. Weight gradients are outer product of upstream gradient and layer input.
Scoring: Strong Hire = complete derivation with correct dimensions, softmax-CE simplification, general pattern articulated. Lean Hire = mostly correct but makes a dimensional error. No Hire = cannot set up the chain rule.
Deep dive: Backpropagation
Q17: Explain the self-attention computation step by step, including the complexity and how to reduce it.
Model Answer:
Given input , compute , , where .
Attention scores: - this is matrix. Apply softmax row-wise: . Output: - weighted combination of values.
Complexity: for the computation. Memory: for the attention matrix. For tokens, this is 10 billion entries - impossible.
Reduction methods: (1) FlashAttention - exact attention but tiled computation, O(n) memory via recomputation, no approximation. (2) Sparse attention - attend to local windows + global tokens. (3) Linear attention - kernel approximation to avoid materializing matrix. (4) Grouped-query attention (GQA) - share K,V heads across multiple Q heads, reducing KV cache size.
Scoring: Strong Hire = full computation + complexity + FlashAttention + GQA. Lean Hire = correct computation but cannot discuss efficiency. No Hire = cannot compute the attention output.
Deep dive: Attention Mechanism, Transformer Architecture
Q18: You are training a model and observe the loss curve plateau after a few epochs. Walk me through your debugging process.
Model Answer:
"I would investigate systematically:
Step 1: Check if it is a learning rate issue. Is the plateau at a high or low loss? High loss plateau = underfitting, model capacity may be insufficient, or learning rate too low to escape initial basin. Low loss plateau = may be near convergence, or stuck in a local minimum.
Step 2: Examine gradient statistics. Are gradients near zero? (vanishing gradients - check initialization, activation functions). Are gradients oscillating? (learning rate too high). Are specific layers not learning? (check per-layer gradient norms).
Step 3: Data check. Is the data pipeline feeding correctly? (log a batch, verify labels match inputs). Is there label noise causing an irreducible loss floor?
Step 4: Architecture check. Is the model expressive enough? (try a larger model). Are residual connections and normalization in place?
Step 5: Optimizer and schedule. Try a different optimizer (switch Adam to SGD with momentum or vice versa). Try a learning rate warmup. Try cosine annealing to periodically increase LR and escape plateaus.
Step 6: Regularization. Is dropout or weight decay too strong? (temporarily remove and check if training loss improves).
I would also check the validation loss alongside training loss to distinguish underfitting (both high) from a data quality issue (training loss at theoretical minimum)."
Scoring: Strong Hire = systematic approach covering LR, gradients, data, architecture, optimizer + distinguishes underfitting from convergence. Lean Hire = suggests 2-3 fixes but not systematically. No Hire = only suggests "lower the learning rate."
Q19: Explain how LSTM solves the vanishing gradient problem. Derive the gradient flow through the cell state.
Model Answer:
"LSTM introduces a cell state with additive updates - unlike RNN's multiplicative hidden state updates.
The cell state update:
Where is the forget gate (sigmoid output in [0,1]) and is the input gate.
Gradient flow through cell state:
Over time steps:
If (forget gate nearly open), gradients flow unattenuated - the product is close to 1 regardless of . This is the key insight: the forget gate can learn to preserve gradients over long sequences.
In contrast, vanilla RNN: , giving . If the spectral norm of is less than 1, this product vanishes exponentially.
The LSTM's additive cell state update is analogous to residual connections in deep networks - both provide a gradient highway that bypasses nonlinearities."
Scoring: Strong Hire = full gradient derivation, forget gate role, comparison to vanilla RNN, connection to residual networks. Lean Hire = knows LSTM helps but cannot derive gradient flow. No Hire = cannot explain the LSTM gates.
Deep dive: RNNs & LSTMs
Q20: What are ZeRO stages 1, 2, and 3? Calculate memory savings for a 13B model on 8 GPUs.
Model Answer:
"ZeRO eliminates memory redundancy in data parallelism. In standard DDP, every GPU stores: weights (2P bytes in FP16), gradients (2P), and optimizer states (12P for Adam with FP32 master weights) - 16P total.
ZeRO Stage 1: Partition optimizer states. Each GPU stores 1/N of optimizer states. Per GPU: .
ZeRO Stage 2: Also partition gradients. Per GPU: .
ZeRO Stage 3: Also partition weights. Per GPU: .
For 13B model, 8 GPUs:
| Stage | Per-GPU Memory | Total |
|---|---|---|
| DDP | 16 x 13 = 208 GB | 208 GB per GPU |
| Stage 1 | 26 + 26 + 19.5 = 71.5 GB | Fits A100-80GB (tight) |
| Stage 2 | 26 + 3.25 + 19.5 = 48.75 GB | Fits comfortably |
| Stage 3 | 208/8 = 26 GB | Very comfortable |
Stage 3 adds 1.5x communication overhead (extra all-gather during forward/backward). FSDP is PyTorch's implementation of Stage 3."
Scoring: Strong Hire = all three stages with formulas + correct calculation + communication tradeoff + FSDP connection. Lean Hire = knows the stages but cannot calculate memory. No Hire = cannot explain what ZeRO does.
Deep dive: Distributed Training
Q21: Compare the training objectives of VAEs, GANs, and diffusion models.
Model Answer:
"VAE: Maximize the ELBO = reconstruction term - KL term. . Lower bound on log-likelihood. Training is stable but outputs can be blurry because the reconstruction loss (MSE) averages over modes.
GAN: Minimax game: . Minimizes Jensen-Shannon divergence between real and generated distributions (at the optimal discriminator). Training is unstable - generator and discriminator must be carefully balanced. Mode collapse is common.
Diffusion: Denoising score matching: . Simple MSE loss predicting added noise. Training is very stable (no adversarial dynamics) and covers all modes (every noise level forces full distribution learning). Slow sampling (hundreds of steps) is the main downside.
Key mathematical distinction: VAE optimizes a bound on likelihood (approximate). GAN optimizes a divergence via an adversary (implicit). Diffusion optimizes a score function via denoising (explicit, per-sample)."
Scoring: Strong Hire = all three objectives with formulas + what each minimizes + stability/quality tradeoffs + mathematical insight about bound vs divergence vs score. Lean Hire = knows all three but cannot state the precise objectives. No Hire = confuses the objectives.
Deep dive: Generative Models
Q22: Explain the Chinchilla scaling laws and how they changed LLM training.
Model Answer:
"Kaplan et al. (2020) found loss scales as power laws with parameters, data, and compute. They concluded: for a fixed compute budget, make the model as large as possible. This led to GPT-3 (175B params, 300B tokens).
Hoffmann et al. (2022, Chinchilla) challenged this. They showed the optimal allocation scales parameters and data equally with compute: , . Rule of thumb: 20 tokens per parameter. By this standard, GPT-3 was severely undertrained - 175B params should have seen ~3.5T tokens, not 300B.
They proved this by training Chinchilla (70B params, 1.4T tokens) which outperformed Gopher (280B params, 300B tokens) with 4x fewer parameters.
Modern nuance: Chinchilla optimizes for training compute. But inference cost matters too - a smaller model trained on more data is cheaper to deploy. Llama 3 (70B, 15T tokens) is deliberately 'overtrained' relative to Chinchilla because Meta optimizes for total cost including inference. The compute formula is ."
Scoring: Strong Hire = Kaplan vs Chinchilla distinction + 20 tokens rule + Chinchilla vs Gopher example + inference-aware deviation + 6ND formula. Lean Hire = knows the scaling law exists but cannot state the specific relationship or numbers. No Hire = has not heard of Chinchilla or scaling laws.
Deep dive: Distributed Training
Q23: What is the reparameterization trick and why does it matter?
Model Answer:
"In VAEs, we need to backpropagate through sampling: . Sampling is stochastic and non-differentiable - you cannot compute .
The reparameterization trick rewrites: , where .
Now is a deterministic, differentiable function of and (which depend on ) plus external noise that does not depend on . Gradients flow through and to via standard backpropagation.
This generalizes beyond VAEs - any time you need to differentiate through a sample from a parameterized distribution (policy gradient in RL, stochastic computation graphs), reparameterization provides lower-variance gradient estimates than the REINFORCE alternative."
Scoring: Strong Hire = why sampling is non-differentiable + the trick formula + gradient flow + connection to RL/general stochastic computation. Lean Hire = knows the formula but cannot explain why it is needed. No Hire = cannot explain the trick.
Deep dive: Generative Models
Q24: Explain how FlashAttention works and why it is important.
Model Answer:
"Standard attention materializes the attention matrix, requiring memory. For a 100K context, that is 40GB just for the attention scores - impossible.
FlashAttention (Dao et al., 2022) computes exact attention without materializing the full matrix. It tiles the Q, K, V matrices into blocks that fit in GPU SRAM (fast, small memory), computes attention one tile at a time, and uses the online softmax trick to accumulate results across tiles without storing the full matrix.
Key insight: GPUs have a memory hierarchy - HBM (slow, large) and SRAM (fast, tiny). Standard attention is memory-bandwidth bound - it reads/writes the huge attention matrix from HBM repeatedly. FlashAttention is compute-bound - it does more FLOPs (recomputation) but avoids HBM reads/writes, which is faster overall.
Results: 2-4x wall-clock speedup, O(n) memory instead of , enables much longer context lengths. FlashAttention-2 further optimizes work partitioning. FlashAttention-3 targets H100 with asynchronous computation. This is now the default attention implementation in all major frameworks."
Scoring: Strong Hire = tiling + online softmax + SRAM vs HBM + IO-aware analysis + memory reduction. Lean Hire = knows it is faster and uses less memory but not how. No Hire = has not heard of FlashAttention.
Deep dive: Transformer Architecture
Q25: How does a diffusion model generate images? Walk through both training and inference.
Model Answer:
"Training:
- Sample a clean image from the dataset
- Sample a random timestep
- Sample noise
- Create noisy image:
- Train the network to predict : minimize
The network (typically a U-Net) takes the noisy image and timestep as input and outputs the predicted noise.
Inference:
- Start with pure noise
- For :
- Predict the noise:
- Compute the denoised estimate and take a reverse diffusion step
- Output
For text-to-image, the network is conditioned on text embeddings via cross-attention, and classifier-free guidance amplifies the conditional signal.
The training is simple and stable (just MSE), but inference requires T (typically 20-1000) network evaluations, making it slow."
Scoring: Strong Hire = both training and inference algorithms with formulas + conditioning + guidance + speed tradeoff. Lean Hire = correct high-level process but missing formulas. No Hire = cannot describe the training procedure.
Deep dive: Generative Models
Section 3 - Senior/Staff Level Questions
These questions test deep understanding, system-level thinking, and the ability to make architectural decisions. Expect 15-20 minute discussions per question.
Q26: Design the training infrastructure for a 100B parameter language model from scratch.
Model Answer:
"Compute budget: Chinchilla-optimal training: . Compute: FLOPs.
Hardware: 1024 H100 GPUs (128 nodes x 8 GPUs). At 40% MFU: effective 400 PFLOPS. Time: seconds = 35 days.
Parallelism: TP=8 within node (NVLink). PP=4 across node groups (InfiniBand). DP=1024/(8x4)=32. ZeRO Stage 1 within DP group.
Memory per GPU: Weights (TP=8): 100B x 2 / 8 = 25GB. Optimizer (ZeRO-1, DP=32): ~4.7GB. Gradients: ~25GB. Total static: ~55GB. Activations with checkpointing: ~20GB. Total: ~75GB on 80GB H100.
Training configuration: Batch size 2048 sequences x 4096 tokens. LR 3e-4 with cosine decay. Warmup 2000 steps. BF16 mixed precision. Gradient clipping 1.0.
Data pipeline: Tokenized data stored in memory-mapped files. WebDataset format for efficient streaming. Data mixing: web text (60%), books (15%), code (15%), academic (10%).
Failure recovery: Checkpoint every 500 steps to object storage. Elastic training handles up to 5% node failures. Automatic restart with learning rate rewind. Loss spike detection: if loss exceeds 2x running average, revert to last checkpoint.
Monitoring: Track per-layer gradient norms, activation magnitudes, learning rate, loss, MFU, GPU utilization, inter-node bandwidth. Alert on anomalies."
Scoring: Strong Hire = complete compute budget + parallelism config + memory calculation + data pipeline + failure recovery + monitoring. Lean Hire = reasonable parallelism but missing compute budget or failure recovery. No Hire = cannot design a multi-dimensional parallelism config.
Q27: When would you choose a diffusion model over an autoregressive model for generation, and vice versa?
Model Answer:
"Diffusion for: (1) Continuous data - images, audio, video, 3D - where the output space is naturally continuous and high-dimensional. Diffusion's denoising objective is a natural fit. (2) When you need diversity - diffusion covers all modes by design. (3) When you want fine-grained control - classifier-free guidance, inpainting, style transfer are natural with diffusion.
Autoregressive for: (1) Discrete sequential data - text, code, music tokens - where the output has a natural left-to-right order. (2) When you need exact likelihood - autoregressive models provide tractable log-likelihood. (3) When you need reasoning - chain-of-thought requires sequential token generation where each token conditions on all previous tokens. (4) When you want a unified model - LLMs can handle many tasks with prompting.
Hybrid approaches are emerging: (1) Autoregressive models generating image tokens (DALL-E 1, Parti, Chameleon). (2) Diffusion models with discrete denoising for text (MDLM, SEDD). (3) Diffusion for planning/draft + autoregressive for refinement.
Key tradeoff: Autoregressive is sequential but exact. Diffusion is parallel per step but iterative across steps. For text, the sequential nature matches human language. For images, the parallel nature matches spatial structure."
Scoring: Strong Hire = clear criteria for each + examples + hybrid approaches + key tradeoff articulated. Lean Hire = reasonable comparison but misses one paradigm's strengths. No Hire = no clear framework for choosing.
Q28: Explain why pre-training with self-supervised objectives works so well. What does the model actually learn?
Model Answer:
"Self-supervised pre-training works because predicting the next token (or masked tokens, or denoised images) forces the model to learn a compressed representation of the training distribution - capturing syntax, semantics, world knowledge, and reasoning patterns.
What the model learns at different scales:
- Small models (~100M): Syntax, grammar, common collocations, basic factual associations.
- Medium models (~1-10B): Semantic understanding, analogy reasoning, multi-step factual chains, basic code generation.
- Large models (~100B+): Complex reasoning, chain-of-thought, few-shot learning, cross-domain transfer, theory of mind (emergent).
Why next-token prediction is so powerful: The loss function requires modeling ALL aspects of text to minimize - factual knowledge, logical reasoning, stylistic patterns, conversational structure. The model must build internal representations of all these phenomena to achieve low loss.
The scaling hypothesis: As models and data scale, the internal representations become richer and more general. At some point, the representations become useful for tasks the model was never explicitly trained on (emergent capabilities).
Limitations: Pre-training optimizes for distribution matching, not for truthfulness, helpfulness, or safety. This is why alignment (RLHF, constitutional AI) is needed as a second phase."
Scoring: Strong Hire = explains compression/representation argument + scale-dependent capabilities + why the objective is powerful + limitations requiring alignment. Lean Hire = knows pre-training works but cannot articulate why deeply. No Hire = "it just learns from lots of data."
Q29: You trained a model and it performs well on your benchmark but poorly in production. Debug this systematically.
Model Answer:
"This is a distribution shift problem. Systematic investigation:
1. Data distribution mismatch: Compare the production data distribution to the training/benchmark distribution. Are there new categories, languages, edge cases, or adversarial inputs? Collect production samples and measure feature drift using statistical tests (KS test, MMD, population stability index).
2. Evaluation metric mismatch: Does the benchmark metric correlate with the production success metric? A model optimized for accuracy may fail on latency, calibration, fairness, or user satisfaction. Identify the true production metric and evaluate against it.
3. Preprocessing discrepancies: Is the production data pipeline identical to the training pipeline? Different tokenization, normalization, image resizing, or feature extraction can cause silent failures. Run the same input through both pipelines and compare.
4. Temporal drift: Is the production data from a different time period? Language models trained on 2023 data may fail on 2024 events. Recommendation models trained in summer may fail in winter.
5. Adversarial/edge cases: Production users behave differently from benchmark creators. They find edge cases, provide unusual inputs, and attempt to break the model. Run adversarial evaluations.
6. Infrastructure issues: Quantization errors (if model was quantized for serving), batching effects, memory limits causing silent truncation, different library versions.
Mitigation: (1) Continuously monitor production metrics, (2) Maintain a golden evaluation set that mirrors production distribution, (3) A/B test before full deployment, (4) Set up automated retraining on production data."
Scoring: Strong Hire = 5+ systematic categories + specific tools/tests + mitigation plan. Lean Hire = identifies 2-3 causes but not systematically. No Hire = "train on more data."
Q30: Explain the loss NaN debugging process for a large language model training run.
Model Answer:
"Loss NaN during LLM training is usually caused by numerical instability. Systematic debugging:
Immediate triage:
- When did it happen? Step number, learning rate at that step, data batch ID. Early (first 100 steps) vs late (after 10K steps) suggests different causes.
- Gradient norm history: Was there a spike before the NaN? Gradients growing over 10-100 steps then exploding suggests accumulating instability. A sudden spike suggests a bad data batch.
Common causes by timing:
Early training NaN:
- Bad initialization (weights too large/small)
- Learning rate too high (especially without warmup)
- Missing gradient clipping
- FP16 overflow (activations exceed 65504)
Late training NaN:
- A particularly bad data batch (corrupted text, extremely long sequences)
- Loss scaling failure (dynamic loss scale drops too low)
- Accumulating numerical error in attention softmax
- LayerNorm encountering zero variance
Debugging steps:
- Add gradient norm logging per layer - identify which layer diverges first
- Add activation magnitude logging - check for pre-softmax values exceeding FP16 range
- Check the specific data batch - is there a 100K token sequence or corrupted data?
- Check loss scaling - is the dynamic scale at its minimum?
- Try BF16 instead of FP16 (eliminates overflow issues)
- Try reducing learning rate by 2x
- Try increasing gradient clipping (from 1.0 to 0.5)
- If specific layer diverges: check its initialization, normalization, and residual connection
Prevention: Use BF16, gradient clipping at 1.0, learning rate warmup (1000-2000 steps), data quality filtering (remove extremely long or corrupted samples), and checkpoint frequently enough to resume without losing much work."
Scoring: Strong Hire = timing-based diagnosis + 5+ specific causes + systematic debugging steps + prevention strategy. Lean Hire = identifies some causes but not systematically. No Hire = "restart training with a lower learning rate."
Q31: How would you implement efficient fine-tuning for a 70B model with limited compute?
Model Answer:
"Parameter-efficient fine-tuning (PEFT) methods avoid updating all 70B parameters:
LoRA (Low-Rank Adaptation): Add trainable low-rank matrices to attention weights: where , , (typically 8-64). Only train A and B (~0.1% of parameters). Memory: only need optimizer states for LoRA params. Can serve multiple tasks by swapping LoRA adapters.
QLoRA: Quantize the base model to 4-bit (NF4 quantization), then add LoRA adapters in BF16. Reduces memory from 140GB to ~35GB for a 70B model. Can fine-tune 70B on a single 48GB GPU.
Why these work: Pre-trained weights capture general knowledge. Fine-tuning only needs to adapt the model to a new task or style, which requires a low-rank update - the 'task-specific' information lives in a low-dimensional subspace.
Comparison:
| Method | Trainable Params | Memory (70B) | Quality |
|---|---|---|---|
| Full fine-tuning | 70B (100%) | ~1120 GB | Best |
| LoRA r=16 | ~170M (0.24%) | ~160 GB | 95-99% of full |
| QLoRA r=16 | ~170M (0.24%) | ~35 GB | 93-97% of full |
| Adapter layers | ~500M (0.7%) | ~180 GB | 94-98% of full |
| Prompt tuning | ~100K (0.0001%) | ~140 GB | 80-90% of full |
I would choose QLoRA for its memory efficiency. For production, merge the LoRA weights back into the base model (zero inference overhead)."
Scoring: Strong Hire = LoRA mechanism + QLoRA + memory calculations + comparison table + merge for serving + explanation of why low-rank works. Lean Hire = knows LoRA exists but cannot explain the mechanism. No Hire = "just fine-tune the last layer."
Q32: Explain the difference between pre-LayerNorm and post-LayerNorm transformers and why it matters.
Model Answer:
"Post-LayerNorm (original Transformer): . LayerNorm is applied after the residual connection. The residual stream accumulates unnormalized values, making training unstable for deep models - gradients can explode because the residual path has no normalization.
Pre-LayerNorm (GPT-2, most modern LLMs): . LayerNorm is applied before the attention/FFN. The residual stream receives bounded inputs (normalized by LN), making gradients more stable. This allows training very deep transformers (100+ layers) without learning rate warmup.
Why it matters:
- Pre-LN is much more stable - almost always preferred for training large models
- Post-LN can achieve slightly better final performance with careful tuning (because the final layer output is normalized)
- Pre-LN requires an extra LayerNorm after the final layer (the residual stream is not normalized)
- Many recent models use RMSNorm (simplified LayerNorm without mean centering) in the pre-LN position
The choice between pre and post norm is one of the most practical architecture decisions with real impact on training stability. Getting this wrong can waste weeks of GPU time on unstable training runs."
Scoring: Strong Hire = both formulas + stability analysis + gradient flow explanation + practical implications + RMSNorm mention. Lean Hire = knows there is a difference but cannot explain the stability implications. No Hire = does not know pre vs post LayerNorm is a design choice.
Section 4 - Company-Tagged Questions
These questions are frequently asked at specific companies. Study the ones for your target companies.
Q33: [Google] Explain the T5 architecture and how it differs from GPT and BERT.
Model Answer: "T5 (Text-to-Text Transfer Transformer) is an encoder-decoder model that frames ALL NLP tasks as text-to-text: input text maps to output text. Translation: 'translate English to French: cat' -> 'chat'. Summarization: 'summarize: [article]' -> '[summary]'. Classification: 'classify: [text]' -> 'positive'.
BERT is encoder-only (bidirectional attention, MLM pre-training, requires task-specific heads). GPT is decoder-only (causal attention, autoregressive pre-training, no encoder). T5 has both encoder (bidirectional) and decoder (causal with cross-attention).
T5 uses relative position biases instead of absolute position embeddings, which generalize better to unseen sequence lengths. It uses a span corruption pre-training objective (mask random spans, predict them) rather than BERT's random token masking or GPT's next-token prediction."
Scoring: Strong Hire = text-to-text framing + architecture comparison + position encoding difference + span corruption. Lean Hire = knows the architecture but not the design philosophy. No Hire = confuses T5 with BERT or GPT.
Q34: [Meta] How does Llama 2/3 differ from the original GPT architecture?
Model Answer: "Key modifications from GPT to Llama: (1) Pre-LayerNorm with RMSNorm instead of post-LayerNorm with full LayerNorm - simpler and more stable. (2) SwiGLU activation in FFN instead of GELU - . Larger hidden dimension to compensate. (3) Rotary Position Embeddings (RoPE) instead of absolute positional embeddings - encodes relative position through rotation matrices applied to Q and K, enabling length generalization. (4) Grouped-Query Attention (GQA) in Llama 2 70B and Llama 3 - shares KV heads across multiple query heads, reducing KV cache memory. (5) No bias terms in linear layers - simplifies implementation, negligible performance impact. (6) Llama 3 training: 15T tokens (far beyond Chinchilla optimal) for inference efficiency."
Scoring: Strong Hire = RMSNorm + SwiGLU + RoPE + GQA + training scale explanation. Lean Hire = knows 2-3 differences. No Hire = "Llama is basically GPT with more data."
Q35: [OpenAI/Anthropic] What is RLHF and why is it needed?
Model Answer: "Reinforcement Learning from Human Feedback aligns language models with human preferences. Three stages: (1) Supervised Fine-Tuning (SFT) - fine-tune on high-quality instruction-following data. (2) Reward Model Training - collect human comparisons (which of two outputs is better), train a model to predict human preference scores. (3) RL Optimization - use PPO (Proximal Policy Optimization) to fine-tune the LLM to maximize the reward model's score, with a KL penalty to prevent diverging too far from the SFT model.
Why it is needed: pre-training optimizes for next-token prediction, which produces models that are knowledgeable but not helpful, honest, or harmless. A model trained on internet text will happily produce toxic content, hallucinate facts, or refuse to answer simple questions. RLHF teaches the model to be helpful (follow instructions), honest (express uncertainty), and harmless (refuse dangerous requests).
Alternatives: DPO (Direct Preference Optimization) skips the reward model and directly optimizes the policy from preference data - simpler, no reward model training, but potentially less expressive. Constitutional AI (Anthropic) uses the model itself to generate critiques and revisions, reducing reliance on human labels."
Scoring: Strong Hire = three stages + why needed + KL penalty + DPO/CAI alternatives. Lean Hire = knows the concept but not the three-stage process. No Hire = cannot explain RLHF.
Q36: [Amazon] How would you serve a 70B model at low latency for a customer-facing product?
Model Answer: "Quantization: Apply GPTQ or AWQ 4-bit quantization - reduces memory from 140GB to ~35GB, fits on a single A100 or 2 A10Gs. Quality loss: 1-3% on benchmarks. 8-bit is safer if quality is critical.
KV Cache Optimization: Use GQA architecture (Llama 2/3), PagedAttention (vLLM) for efficient cache management, or multi-query attention. KV cache is often the memory bottleneck for long contexts.
Batching: Continuous batching (not static batching) - as requests finish, immediately add new ones. vLLM or TensorRT-LLM handle this automatically.
Speculative Decoding: Use a small draft model (7B) to generate candidate tokens, verify with the 70B model in parallel. Can achieve 2-3x speedup without quality loss.
Infrastructure: A100/H100 GPUs with tensor parallelism for single-request latency reduction. Or multiple smaller GPUs (L4, A10G) with model sharding for cost efficiency.
Target latency: First token in under 500ms, subsequent tokens at 30+ tokens/second. For 200 tokens: under 7 seconds total.
Cost optimization: Spot instances for non-critical traffic, request batching for throughput, caching for repeated queries, and a smaller model (7B-13B) for simple queries with routing."
Scoring: Strong Hire = quantization + KV cache + batching + speculative decoding + specific latency targets + cost optimization. Lean Hire = mentions quantization and batching but no specifics. No Hire = "use a GPU."
Q37: [Apple] How do you train a model for on-device deployment with strict memory constraints?
Model Answer: "On-device means: typically 2-6GB memory budget, CPU or Apple Neural Engine (ANE), no internet required, under 100ms latency.
Architecture: Start with a small model (1-3B parameters). Use efficient attention (GQA or multi-query). Use SwiGLU activation. Minimize embedding table size.
Training approach: Knowledge distillation from a large teacher (70B) to a small student (3B). This preserves much of the quality in a deployable size. Train the student on task-specific data for the target use case.
Compression pipeline:
- Structured pruning (remove entire attention heads or FFN neurons that contribute least)
- Quantization-aware training (QAT) - simulate 4-bit or 8-bit during training so the model learns to be robust to quantization
- Post-training quantization to INT4/INT8 using GPTQ or similar
- Weight clustering for further compression
Optimization for Apple Silicon:
- CoreML or MLX for inference
- ANE-friendly operations (avoid dynamic shapes, use static graphs)
- Metal Performance Shaders for GPU fallback
Final model: ~1.5GB for a 3B INT4 model, runs at 20-30 tokens/second on iPhone 15 Pro."
Scoring: Strong Hire = distillation + pruning + QAT + platform-specific optimization (CoreML/ANE) + specific size/speed targets. Lean Hire = mentions quantization but not the full pipeline. No Hire = "use a smaller model."
Q38: [Google] Explain Mixture of Experts (MoE) and its training challenges.
Model Answer: "MoE replaces the dense FFN in a transformer with multiple 'expert' FFN networks and a gating mechanism that routes each token to the top-K experts (typically K=2). The total parameter count is large (expert count x expert size) but only K experts are active per token, so compute cost is much lower than a dense model of the same total size.
Gating: A learned linear layer maps the token embedding to a distribution over experts. Top-K experts are selected. Their outputs are weighted by the gate values and summed.
Training challenges:
- Load balancing: Without intervention, the gate learns to route most tokens to a few experts, leaving others unused. Fix: auxiliary loss that penalizes uneven routing ().
- Communication cost: In distributed training, tokens must be sent to the GPU holding their assigned expert (all-to-all communication). This can be a bottleneck.
- Instability: Expert routing can oscillate. Fixes: expert capacity factor (cap tokens per expert), noise in gating (explore different experts).
- Fine-tuning difficulty: Expert specialization during pre-training may not transfer well to fine-tuning tasks.
Examples: Switch Transformer (Google), Mixtral 8x7B (Mistral), GShard, ST-MoE. Mixtral has 8 experts with 2 active, giving ~12B active parameters from ~47B total."
Scoring: Strong Hire = routing mechanism + load balancing + communication cost + 3+ challenges + real examples with numbers. Lean Hire = knows the concept but not the challenges. No Hire = cannot explain MoE routing.
Q39: [Anthropic] Explain constitutional AI and how it differs from RLHF.
Model Answer: "Constitutional AI (CAI) is Anthropic's approach to aligning language models using a set of principles ('constitution') rather than large amounts of human feedback.
Process:
- Start with an RLHF-trained model
- Generate harmful outputs by red-teaming
- Ask the model itself to critique its output based on the constitution ('identify ways this response could be harmful')
- Ask the model to revise its output based on the critique
- Train a reward model on (original, revision) pairs
- Fine-tune with RL using this reward model (RLAIF - RL from AI Feedback)
How it differs from RLHF:
- RLHF requires extensive human labeling of preferences. CAI uses the model's own judgment guided by principles.
- RLHF captures implicit human values. CAI makes values explicit through the constitution.
- CAI is more scalable (less human labor) and more transparent (values are written down).
- CAI can be iterated - refine the constitution based on observed failures.
Limitations: The model must already be capable enough to critique itself (bootstrapping problem). The constitution may not cover all edge cases. Human values are hard to fully codify in rules."
Scoring: Strong Hire = full CAI process + RLAIF + explicit differences from RLHF + limitations. Lean Hire = knows CAI exists but not the process. No Hire = confuses CAI with RLHF.
Q40: [Meta] How does text-to-image generation work in Stable Diffusion?
Model Answer: "Stable Diffusion is a latent diffusion model with three main components:
-
VAE (Autoencoder): Compresses 512x512x3 images to 64x64x4 latent representations (48x compression). Trained separately with reconstruction + perceptual + adversarial losses.
-
U-Net Denoiser: Operates in latent space. Architecture: downsampling blocks, middle block, upsampling blocks, all with ResNet blocks + self-attention + cross-attention. Cross-attention receives text embeddings from CLIP. Conditioned on timestep via sinusoidal embeddings + Adaptive Group Norm.
-
CLIP Text Encoder: Converts text prompts to embedding sequences. These embeddings are injected into the U-Net via cross-attention at multiple resolution levels.
Generation: Start with random noise in latent space. Iteratively denoise with the U-Net (conditioned on text). Apply classifier-free guidance: where is the guidance scale (7.5-12). Decode the final latent to an image with the VAE decoder.
SDXL improvements: Larger U-Net, dual text encoders (CLIP-G + OpenCLIP), two-stage refinement, better training on higher resolution."
Scoring: Strong Hire = all three components + cross-attention conditioning + CFG formula + latent space advantage + SDXL improvements. Lean Hire = knows it is a diffusion model but not the architecture details. No Hire = cannot explain the pipeline.
Q41: [DeepMind] Explain the scaling laws for neural language models.
Model Answer: "Neural scaling laws describe power-law relationships between model performance and resources:
where (Kaplan) where where
Kaplan (2020): For fixed compute, scale model size preferentially. Led to large but undertrained models.
Chinchilla (2022): Scale N and D equally. Optimal: ~20 tokens per parameter. Chinchilla (70B, 1.4T) beat Gopher (280B, 300B).
Compute formula: FLOPs. This enables compute budgeting before training.
Beyond language: Similar scaling laws hold for vision (ViT), multimodal (CLIP), and code models, though the exponents differ.
Open questions: (1) Do scaling laws predict emergent capabilities? (debated). (2) When do scaling laws break down? (data quality bottleneck, architectural limitations). (3) How do they change with different training objectives (RLHF, instruction tuning)?"
Scoring: Strong Hire = all three power laws + Kaplan vs Chinchilla + 6ND formula + open questions. Lean Hire = knows the general trend but not specific formulas. No Hire = has not heard of scaling laws.
Q42: [NVIDIA] How does tensor parallelism work in Megatron-LM?
Model Answer: "Megatron-LM splits individual transformer layers across GPUs using column and row parallelism.
MLP block: . Split column-wise across GPUs - each computes independently (GeLU is element-wise, no communication). Split row-wise - each computes a partial sum . One AllReduce to sum: .
Attention block: Split Q, K, V projections column-wise across GPUs. Each GPU computes attention for its heads independently. Split output projection row-wise. One AllReduce.
Result: 2 AllReduces per transformer layer (one for attention, one for MLP). Each AllReduce transfers bytes.
Requirements: NVLink (900 GB/s on H100) is essential because communication happens every layer. This is why TP is limited to within a single node. TP degree is typically 2, 4, or 8 matching the node's GPU count.
Megatron also supports pipeline parallelism and sequence parallelism (splitting LayerNorm and dropout across the TP group to save activation memory)."
Scoring: Strong Hire = column/row split details + AllReduce count + NVLink requirement + sequence parallelism. Lean Hire = knows TP splits layers but not the details. No Hire = cannot explain how a layer is split.
Q43: [Startup] You have a GPU budget of $10K/month. What is the largest model you can fine-tune and serve?
Model Answer: "Budget allocation: 5K serving (adjustable).
Training ($5K/month):
- A100 80GB spot instances: ~$1.50/hour = ~3,300 GPU-hours/month
- With QLoRA, a 70B model needs ~1 A100 for fine-tuning
- Fine-tuning 70B on 10K examples: ~4-8 hours = $6-12
- Can fine-tune 70B hundreds of times or run extensive hyperparameter searches
- Alternatively: 2-4 A100s for training a smaller model (7-13B) from scratch on domain data
Serving ($5K/month):
- A10G instances: ~$0.75/hour = ~6,600 GPU-hours/month
- 70B INT4: needs 2x A10G (24GB each) = $1.50/hour for one replica = 3,300 hours
- Throughput per replica: ~30 tokens/second at batch=1
- For ~1M tokens/day output: one replica is sufficient ($1,100/month)
- Budget allows 3-4 replicas for redundancy/throughput
Recommendation: Fine-tune Llama 3 70B with QLoRA for quality-critical tasks, or fine-tune 7-13B for latency-sensitive applications. Serve with vLLM on A10G instances with INT4 quantization. Use a 7B model as a router or for simple queries to reduce 70B usage."
Scoring: Strong Hire = specific cost calculations + QLoRA for training + quantization for serving + budget allocation strategy + practical deployment plan. Lean Hire = reasonable plan but no cost calculations. No Hire = "use an API."
Q44: [Google] How does the Mixture-of-Depths approach work?
Model Answer: "Mixture of Depths (MoD) applies the idea from MoE to transformer layers themselves - not all tokens need processing by every layer. A learned router decides which tokens skip a layer entirely (passing through only the residual connection) and which tokens get full computation.
Mechanism: At each layer, the router scores each token. Only the top-K tokens (based on a capacity ratio, e.g., 50%) get processed by the attention + FFN block. The remaining tokens pass through unchanged via the residual connection.
Benefits: Reduces compute by 30-50% with minimal quality loss. Particularly effective because many tokens (stop words, repeated context) do not need deep processing - the model learns to allocate compute where it is needed.
Comparison to early exit: Early exit (tokens stop processing partway through the network) is harder to batch efficiently. MoD maintains the full depth for all tokens but selectively applies computation, preserving batch structure."
Scoring: Strong Hire = routing mechanism + capacity ratio + comparison to early exit + compute savings. Lean Hire = knows the concept but not the details. No Hire = confuses with MoE.
Q45: [Anthropic/OpenAI] Explain the difference between DPO and PPO for LLM alignment.
Model Answer: "Both align LLMs with human preferences, but differ in approach:
PPO (Proximal Policy Optimization): Three-stage process - SFT, train reward model on preference data, then RL to maximize reward while staying close to SFT policy (KL penalty). Requires training and maintaining a separate reward model. More flexible - can optimize arbitrary reward functions.
DPO (Direct Preference Optimization): Shows that the optimal RL solution has a closed-form expression relating the policy to the reward. This means you can directly optimize the policy from preference data without training a reward model. The loss: where is the preferred response and is the dispreferred response.
Tradeoffs:
- DPO: simpler (no reward model, no RL), more stable training, but limited to pairwise preferences, potentially less flexible
- PPO: more complex but can optimize for multi-dimensional rewards, can use iterative online data collection, handles reward hacking better with the explicit reward model
- In practice, DPO is used more often due to simplicity, but top labs still use PPO or variants for frontier models"
Scoring: Strong Hire = DPO derivation insight + loss formula + tradeoffs + when each is preferred. Lean Hire = knows both exist but cannot compare. No Hire = cannot explain either.
Section 5 - Quick-Fire Questions (30 Seconds Each)
These test rapid recall. Give a 1-2 sentence answer for each.
QF1: What is the universal approximation theorem? A neural network with one hidden layer and non-linear activation can approximate any continuous function on a compact domain to arbitrary precision, given enough neurons. It guarantees existence, not efficiency - you may need impractically many neurons.
QF2: What is a residual connection? - adding the input directly to the output. Enables gradient flow in deep networks and allows layers to learn residual functions rather than full mappings.
QF3: What is positional encoding in transformers? Since attention is permutation-invariant, positional encodings inject position information. Sinusoidal (original), learned (GPT-2), or rotary (RoPE, Llama) encodings. RoPE encodes relative position through rotation of Q and K vectors.
QF4: What is the softmax temperature? Dividing logits by before softmax: . flattens the distribution (more random), sharpens it (more deterministic).
QF5: What is beam search? A decoding strategy that maintains the top-B candidates at each generation step. Broader than greedy (B=1) but not as diverse as sampling. Commonly used for translation but not for open-ended generation.
QF6: What is gradient accumulation? Summing gradients over multiple mini-batches before updating weights. Simulates a larger batch size when GPU memory is limited. Update every K steps = effective batch size K times larger.
QF7: What is a skip connection vs a residual connection? Same concept in most contexts. In ResNets, residual connection means . In U-Nets, skip connections concatenate features from encoder to decoder. The term depends on the architecture.
QF8: What is the KV cache in transformer inference? During autoregressive generation, key and value matrices from previous tokens are cached so they do not need to be recomputed. Memory: bytes. It is often the memory bottleneck.
QF9: What is model quantization? Reducing weight precision from FP32/FP16 to INT8/INT4. Reduces model size by 2-8x and speeds up inference. Post-training quantization (GPTQ, AWQ) or quantization-aware training (QAT) for better quality.
QF10: What is the difference between encoder-only, decoder-only, and encoder-decoder models? Encoder-only (BERT): bidirectional, for classification/embedding. Decoder-only (GPT, Llama): causal, for generation. Encoder-decoder (T5, BART): bidirectional encoder + causal decoder, for seq-to-seq tasks.
QF11: What is tokenization in NLP? Converting text to integer sequences. BPE (Byte-Pair Encoding) starts with characters and iteratively merges frequent pairs. SentencePiece is a common implementation. Typical vocabulary: 32K-128K tokens. Subword tokenization handles unknown words.
QF12: What is the difference between pre-training and fine-tuning? Pre-training: learn general representations on large unlabeled data (next token, MLM). Fine-tuning: adapt to a specific task on smaller labeled data. Pre-training is expensive (weeks on hundreds of GPUs), fine-tuning is cheap (hours on 1-8 GPUs).
QF13: What is causal masking? In decoder transformers, a triangular mask prevents tokens from attending to future positions. Ensures autoregressive generation - each token only depends on previous tokens. Applied as to future positions before softmax.
QF14: What is perplexity? - the exponential of the average negative log-likelihood. Lower is better. PPL of 10 means the model is "as confused as if it were choosing between 10 equally likely options."
QF15: What is the curse of dimensionality? In high-dimensional spaces, data points become equidistant, volume concentrates near the surface of hyperspheres, and the amount of data needed to cover the space grows exponentially. Makes distance-based methods unreliable.
QF16: What is an embedding layer? A lookup table that maps discrete tokens to continuous vectors. Equivalent to a linear layer with one-hot input. Learned during training to capture semantic relationships (similar tokens have similar embeddings).
QF17: What is attention masking? Selectively preventing attention between certain positions. Causal mask (autoregressive), padding mask (ignore padding tokens), or custom masks (prefix LM, sliding window). Applied by adding to the attention logits before softmax.
QF18: What is weight decay? Adding to the loss (L2 regularization) or equivalently multiplying weights by each step. In AdamW, weight decay is decoupled from the adaptive learning rate, which is mathematically different from L2 regularization with Adam.
QF19: What is the lottery ticket hypothesis? Dense neural networks contain sparse subnetworks (winning tickets) that, when trained in isolation from the original initialization, achieve comparable performance. Implies structured pruning at initialization may be possible.
QF20: What is contrastive learning? Learning representations by pulling positive pairs (augmented views of same data) together and pushing negative pairs apart in embedding space. SimCLR, MoCo, and CLIP are key examples. CLIP aligns images and text in a shared space.
Cross-Reference Index
| Topic | Detailed Page |
|---|---|
| Backpropagation and gradient flow | 01 - Backpropagation |
| Activation functions (ReLU, GELU, etc.) | 02 - Activation Functions |
| Convolutional Neural Networks | 03 - CNNs |
| RNNs, LSTMs, GRUs | 04 - RNNs & LSTMs |
| Attention mechanism | 05 - Attention Mechanism |
| Transformer architecture | 06 - Transformer Architecture |
| Normalization (BatchNorm, LayerNorm, RMSNorm) | 07 - Normalization |
| Training techniques (init, clipping, mixed precision, distillation) | 08 - Training Techniques |
| Distributed training (parallelism, ZeRO, scaling laws) | 09 - Distributed Training |
| Generative models (VAE, GAN, diffusion) | 10 - Generative Models |
Interview Cheat Sheet
| Round Type | Questions | Time Per Q | Depth Expected |
|---|---|---|---|
| Phone Screen | 5-8 | 2-3 min | Definition + intuition + one example |
| Technical Deep Dive | 3-5 | 8-12 min | Full explanation + derivation + tradeoffs + follow-ups |
| Senior/Staff | 2-3 | 15-20 min | System design + mathematical depth + production considerations |
| Quick-Fire | 15-20 | 15-30 sec | Crisp 1-2 sentence answer |
Answer Framework for Every Question
- WHAT: Define the concept (1-2 sentences)
- WHY: Explain the intuition - why does this work or matter? (2-3 sentences)
- HOW: Mathematical formulation or algorithm (show equations if relevant)
- WHEN: When to use it and when NOT to (specific scenarios)
- TRADE-OFFS: Limitations, alternatives, what you would consider in production
Spaced Repetition Checkpoints
Day 0 - Initial Learning
- Answer all 15 screening questions aloud (time yourself: 2 min each)
- Grade yourself on each - identify any "No Hire" areas
- Answer all 20 quick-fire questions in under 5 minutes total
- Read the detailed pages for any topics where you scored "No Hire"
Day 3 - Recall
- Without looking, answer Q1-Q10 aloud again
- Attempt two technical deep dive questions (Q16-Q25)
- Review your weakest 3 topics from Day 0
- Practice the quick-fire round again - target under 30 seconds each
Day 7 - Application
- Answer all screening questions without preparation
- Attempt Q26 (training infrastructure design) with full system thinking
- Practice answering with follow-up questions (have a friend probe deeper)
- Score yourself against the rubrics
Day 14 - Integration
- Do a mock interview: 5 random questions from any section, 45 minutes total
- Practice company-specific questions for your target companies (Q33-Q45)
- Attempt staff-level questions Q26-Q32
- Identify gaps in your knowledge and fill them with the detailed pages
Day 21 - Mastery
- Full mock interview with all difficulty levels mixed
- Can you answer any question in this bank confidently?
- Can you handle 2-3 levels of follow-up on each answer?
- Practice the NaN debugging question (Q30) and infrastructure design (Q26) end-to-end in under 15 minutes each
