Deep Learning Interview Questions Bank - Your Complete Preparation Guide

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Engineer, Research Engineer, Applied Scientist, Computer Vision Engineer, NLP Engineer

The Real Interview Moment

You are 20 minutes into a Meta MLE on-site. The interviewer has asked about backpropagation, batch normalization, and attention mechanisms - you handled them well. Then she pulls up a whiteboard and says: "You are training a 7B parameter language model. After 10K steps, the loss suddenly spikes to NaN and never recovers. Walk me through your debugging process, step by step."

This question synthesizes knowledge across initialization, gradient management, mixed precision, normalization, and distributed training. The interviewer is testing whether you can integrate deep learning fundamentals into a coherent debugging workflow. Candidates who list random ideas ("try a lower learning rate?") get a "no hire." Candidates who present a systematic investigation - check gradient norms, check for FP16 overflow, inspect activation statistics by layer, examine the data batch that caused the spike - get a "strong hire."

The questions in this bank build toward exactly this level of integration.

What You Will Master

65 questions spanning all deep learning topics in this section
Structured model answers with the depth expected at each level
Company-specific question patterns (Google, Meta, Amazon, Apple, OpenAI, Anthropic)
Quick-fire rapid response format for screening rounds
Cross-references to detailed explanations in other pages

How to Use This Question Bank

How to Use This Question Bank - Practice Workflow

Practice method:

Set a timer for 2 minutes per question (screening) or 5 minutes (deep dive)
Answer aloud as if speaking to an interviewer
Check your answer against the model answer
Grade yourself: Strong Hire / Lean Hire / No Hire
For any "No Hire," study the linked page before retrying

Section 1 - Screening Questions (Phone Screen Level)

These questions are asked in 30-minute phone screens. Expect 5-8 questions. Each answer should take 1-2 minutes. The interviewer wants crisp, accurate definitions with intuition.

Q1: Explain backpropagation in plain terms.

Company Variation

Asked at virtually every company. Google expects the chain rule derivation. Amazon is satisfied with intuition. OpenAI and Anthropic expect you to connect it to computational graphs.

Model Answer: "Backpropagation is an algorithm for computing gradients of the loss with respect to every parameter in a neural network. It works by applying the chain rule recursively. During the forward pass, we compute and cache intermediate activations. During the backward pass, we start from the loss and propagate gradients backward through the network - each layer computes its local gradient and multiplies it by the incoming gradient from the layer above. This gives us $\partial L / \partial w$ for every weight $w$ . Computationally, it is just chain rule applied to a computational graph - each node computes its local Jacobian, and we multiply along paths. The total compute for the backward pass is approximately 2x the forward pass."

Scoring: Strong Hire = chain rule + computational graph + forward/backward distinction + compute cost. Lean Hire = correct intuition but cannot explain the chain rule. No Hire = confuses backprop with gradient descent.

Deep dive: Backpropagation

Q2: Why do we use non-linear activation functions?

Model Answer: "Without non-linear activations, a neural network of any depth is equivalent to a single linear transformation - stacking linear layers gives $W_n \cdot W_{n-1} \cdots W_1 \cdot x = W' \cdot x$ . Non-linearities break this collapse and give the network the ability to approximate any continuous function (universal approximation theorem). ReLU is the default choice: $\max(0, x)$ - it is simple, fast, and avoids the vanishing gradient problem that plagued sigmoid/tanh. Its downside is 'dying ReLU' where neurons stuck at zero never recover, which Leaky ReLU and GELU address."

Scoring: Strong Hire = linear collapse argument + universal approximation + specific activations with tradeoffs. Lean Hire = knows non-linearity is needed but cannot explain the linear collapse. No Hire = cannot explain why linearity is insufficient.

Deep dive: Activation Functions

Q3: What is the vanishing gradient problem and how do you solve it?

Model Answer: "In deep networks, gradients are multiplied through layers during backpropagation. If the Jacobian of each layer has spectral norm less than 1, gradients shrink exponentially - after 50 layers they are effectively zero. Early layers stop learning. This is caused by saturating activations (sigmoid/tanh) and poor initialization. Solutions: (1) ReLU activations - gradient is 1 for positive inputs, (2) residual connections - provide gradient shortcuts that bypass layers, (3) proper initialization - He for ReLU, Xavier for tanh, (4) normalization layers - BatchNorm, LayerNorm keep activations in a good range, (5) gradient clipping for the exploding gradient variant."

Scoring: Strong Hire = cause + 4+ solutions with why each works. Lean Hire = knows the problem but only mentions 1-2 solutions. No Hire = confuses vanishing gradients with underfitting.

Deep dive: Backpropagation

Q4: Explain the difference between BatchNorm and LayerNorm. When do you use each?

Model Answer: "Both normalize activations but along different dimensions. BatchNorm normalizes across the batch dimension - for each feature, it computes mean and variance across all samples in the mini-batch. LayerNorm normalizes across the feature dimension - for each sample, it computes mean and variance across all features. BatchNorm depends on batch statistics, which makes it problematic for small batches, variable-length sequences, and inference (requires running statistics). LayerNorm is independent of batch size. Rule of thumb: BatchNorm for CNNs (spatial statistics are meaningful), LayerNorm for transformers (sequence length varies, batch independence is important). RMSNorm is a simpler variant of LayerNorm that skips the mean centering - used in Llama and other modern LLMs."

Scoring: Strong Hire = normalization dimension difference + when each fails + RMSNorm. Lean Hire = knows the difference but not the practical implications. No Hire = confuses the normalization dimensions.

Deep dive: Normalization

Q5: What is the attention mechanism? Why is it important?

Model Answer: "Attention computes a weighted sum of value vectors, where weights are determined by the similarity between a query and key vectors: $\text{Attention}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d_k}) V$ . The $\sqrt{d_k}$ scaling prevents softmax saturation with large dimensions. Attention is important because it allows every position in a sequence to directly attend to every other position - solving the long-range dependency problem that plagued RNNs. Multi-head attention runs multiple attention operations in parallel with different learned projections, capturing different types of relationships. Self-attention (Q, K, V all come from the same sequence) is the building block of transformers."

Scoring: Strong Hire = full formula + scaling explanation + multi-head + self-attention. Lean Hire = knows the concept but cannot write the formula. No Hire = cannot explain attention beyond "it helps the model focus."

Deep dive: Attention Mechanism

Q6: What is a transformer? Walk me through its architecture.

Model Answer: "A transformer is a sequence-to-sequence architecture built entirely on attention - no recurrence or convolution. The encoder has N identical blocks, each with multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization around each. The decoder adds cross-attention to attend to the encoder's output and uses masked self-attention to prevent attending to future tokens. Modern LLMs (GPT, Llama) use decoder-only transformers - just the decoder stack with causal masking. Encoder-only (BERT) is used for understanding tasks. The key innovations: positional encodings (sinusoidal or learned) since attention is permutation-invariant, and the ability to process all positions in parallel (unlike RNNs)."

Scoring: Strong Hire = encoder/decoder structure + attention types + residual/norm + modern variants (decoder-only vs encoder-only). Lean Hire = knows the components but cannot explain the information flow. No Hire = cannot describe the architecture beyond "it uses attention."

Deep dive: Transformer Architecture

Q7: What is the difference between a CNN and an RNN? When do you use each?

Model Answer: "CNNs process data with spatial structure (images) using local receptive fields (convolution kernels) that share weights across positions. They capture translation invariance and build hierarchical features from local to global. RNNs process sequential data (text, time series) by maintaining a hidden state that is updated at each time step, capturing temporal dependencies. In practice, transformers have largely replaced RNNs for sequence tasks because RNNs have the vanishing gradient problem for long sequences and cannot be parallelized during training. CNNs remain dominant for low-level vision tasks but Vision Transformers (ViTs) are competitive for image classification when you have enough data."

Scoring: Strong Hire = key architectural differences + inductive biases + modern landscape (transformers replacing both). Lean Hire = knows the difference but does not discuss when transformers replaced them. No Hire = cannot articulate the structural difference.

Deep dive: CNNs, RNNs & LSTMs

Q8: What is dropout and why does it work?

Model Answer: "Dropout randomly sets a fraction of activations to zero during training (typically 10-50%). At inference, all neurons are active but outputs are scaled by the keep probability. It works as regularization by preventing co-adaptation - neurons cannot rely on specific other neurons being present, forcing each to learn independently useful features. There is a Bayesian interpretation: dropout training approximately performs variational inference over the weights, and averaging over dropout masks at inference approximates a Bayesian model average. In practice, dropout is used in fully connected layers but rarely in convolutional layers (use spatial dropout instead) or modern transformers (which use other regularization)."

Scoring: Strong Hire = mechanism + co-adaptation + Bayesian interpretation + where it is not used. Lean Hire = knows the mechanism but not the why. No Hire = confuses dropout with pruning.

Deep dive: Activation Functions

Q9: He initialization vs Xavier initialization - when do you use each?

Model Answer: "Both control the variance of weight initialization to prevent exploding/vanishing activations. Xavier sets variance to $2/(n_{in} + n_{out})$ , assuming linear activations - it preserves variance in both forward and backward passes. He sets variance to $2/n_{in}$ , accounting for ReLU zeroing out half the activations. Use Xavier for tanh/sigmoid, He for ReLU/LeakyReLU. Using Xavier with ReLU causes activations to shrink by half per layer - a 50-layer network's activations would be $2^{-50}$ times the input. For transformers with LayerNorm, the choice matters less because normalization stabilizes variance regardless."

Scoring: Strong Hire = both formulas + why they differ (ReLU halving) + when normalization makes the choice less critical. Lean Hire = knows which to use when but not the derivation. No Hire = cannot distinguish them.

Deep dive: Training Techniques

Q10: What is mixed precision training?

Model Answer: "Mixed precision uses both FP16 (or BF16) and FP32 during training. Forward and backward passes use FP16/BF16 for speed (2-8x on Tensor Cores) and memory savings (half per parameter). Master weights and optimizer states stay in FP32 because weight updates are tiny and underflow in FP16. With FP16, you also need loss scaling - multiply the loss by a large constant to shift gradients into FP16's representable range, then unscale after backprop. BF16 has the same range as FP32 (8-bit exponent) so it does not need loss scaling, making it the preferred choice on A100+ GPUs. Certain operations - softmax, LayerNorm, loss computation - must stay in FP32 for numerical stability."

Scoring: Strong Hire = why mixed (not just FP16) + master weights in FP32 + loss scaling + BF16 advantage + which ops need FP32. Lean Hire = knows it uses lower precision for speed but misses the details. No Hire = says "just use FP16 everywhere."

Deep dive: Training Techniques

Q11: What is knowledge distillation?

Model Answer: "Knowledge distillation trains a small student model to mimic a large teacher model. Instead of hard labels, the student learns from the teacher's soft probability distribution, which contains rich inter-class structure ('dark knowledge'). The temperature parameter T softens the distribution - higher T reveals more about class similarities. The loss combines KL divergence between temperature-scaled teacher and student distributions (multiplied by $T^2$ ) with standard cross-entropy on hard labels. Typical results: 3-10x model compression with 1-3% accuracy loss. Modern example: DistilBERT is a distilled version of BERT - 60% smaller, 60% faster, retaining 97% of performance."

Scoring: Strong Hire = soft labels + temperature + $T^2$ factor + practical results. Lean Hire = knows the concept but cannot explain temperature or the loss. No Hire = confuses with model pruning or quantization.

Deep dive: Training Techniques

Q12: Explain data parallelism vs model parallelism.

Model Answer: "Data parallelism replicates the entire model on each GPU and splits the data. Each GPU processes a different mini-batch, then gradients are synchronized via AllReduce. It scales throughput but does not reduce per-GPU model memory. Model parallelism splits the model across GPUs: tensor parallelism splits individual layers (requires high-bandwidth NVLink, used within a node), pipeline parallelism assigns layer groups to different GPUs (tolerates lower bandwidth, used across nodes). For a 70B model that does not fit on one GPU, you need model parallelism. For a 1B model on 8 GPUs, data parallelism is sufficient. Large-scale training combines all three - tensor within node, pipeline across nodes, data across replica groups."

Scoring: Strong Hire = all three types + when each is needed + NVLink/bandwidth considerations. Lean Hire = knows data vs model parallelism but not tensor vs pipeline. No Hire = thinks data parallelism reduces model memory.

Deep dive: Distributed Training

Q13: What is the ELBO in VAEs?

Model Answer: "The ELBO (Evidence Lower Bound) is the VAE training objective - a lower bound on the log-likelihood $\log p(x)$ . It decomposes into: reconstruction loss ( $\mathbb{E}_{q(z|x)}[\log p(x|z)]$ ) which encourages the decoder to reconstruct the input, and a KL divergence ( $\text{KL}(q(z|x) \| p(z))$ ) which regularizes the encoder's posterior to be close to the prior (standard Gaussian). The gap between the ELBO and the true log-likelihood is exactly $\text{KL}(q(z|x) \| p(z|x))$ - how well the encoder approximates the true posterior. Maximizing the ELBO simultaneously improves reconstruction and tightens this gap."

Scoring: Strong Hire = both terms + gap interpretation + direction of KL. Lean Hire = knows the two terms but not the gap. No Hire = cannot state the ELBO.

Deep dive: Generative Models

Q14: Why do GANs suffer from mode collapse?

Model Answer: "Mode collapse happens when the generator produces only a few outputs that fool the discriminator, ignoring the diversity of the real distribution. It occurs because the minimax objective does not explicitly encourage diversity - the generator can minimize its loss by finding one mode that maximally confuses the discriminator. The JSD-based objective makes this worse: when real and generated distributions do not overlap (common in high dimensions), JSD is constant and provides zero gradient, so the generator has no signal to explore new modes. Solutions include WGAN (Wasserstein distance gives meaningful gradients everywhere), spectral normalization (constrains discriminator strength), and minibatch discrimination (discriminator penalizes low diversity). Diffusion models avoid this entirely because their objective is per-sample regression, not an adversarial game."

Scoring: Strong Hire = cause (adversarial + JSD gradient issue) + 3+ solutions + why diffusion avoids it. Lean Hire = knows mode collapse exists but only 1 solution. No Hire = cannot explain mode collapse.

Deep dive: Generative Models

Q15: What is gradient clipping and when do you need it?

Model Answer: "Gradient clipping caps gradient magnitudes to prevent exploding gradients. The standard approach is clip-by-global-norm: compute the L2 norm of all gradients concatenated, and if it exceeds a threshold (typically 1.0), scale all gradients by threshold/norm. This preserves gradient direction while capping magnitude. It is essential for transformer training (always used), critical for RNN/LSTM training (exploding gradients through time), and helpful for any very deep network. It does NOT help with vanishing gradients - only exploding gradients. Clip-by-value (capping each element) is rarely used because it changes the gradient direction."

Scoring: Strong Hire = clip-by-norm formula + direction preservation + when needed + does not help vanishing. Lean Hire = knows what it does but not the implementation. No Hire = confuses with gradient scaling or normalization.

Deep dive: Training Techniques

Section 2 - Technical Deep Dive (On-Site Level)

These questions are asked in 45-60 minute technical interviews. Expect 3-5 questions with deep follow-ups. The interviewer wants mathematical depth, derivations, and practical implications.

Q16: Derive the backpropagation equations for a two-layer neural network.

Company Variation

Google and DeepMind expect full derivation on the whiteboard. Meta wants the derivation plus practical implications. OpenAI and Anthropic may ask you to extend it to attention or custom layers.

Model Answer:

Network: $z_1 = W_1 x + b_1$ , $a_1 = \sigma(z_1)$ , $z_2 = W_2 a_1 + b_2$ , $\hat{y} = \text{softmax}(z_2)$ , $L = -\sum y_i \log \hat{y}_i$ .

Output layer: $\partial L / \partial z_2 = \hat{y} - y$ (softmax + cross-entropy simplification).

Hidden layer weights: $\partial L / \partial W_2 = (\hat{y} - y) a_1^T$ . $\partial L / \partial b_2 = \hat{y} - y$ .

Propagate to hidden layer: $\partial L / \partial a_1 = W_2^T (\hat{y} - y)$ . $\partial L / \partial z_1 = \partial L / \partial a_1 \odot \sigma'(z_1)$ .

Input layer weights: $\partial L / \partial W_1 = (\partial L / \partial z_1) x^T$ . $\partial L / \partial b_1 = \partial L / \partial z_1$ .

The pattern: at each layer, gradient = upstream gradient times local Jacobian. Weight gradients are outer product of upstream gradient and layer input.

Scoring: Strong Hire = complete derivation with correct dimensions, softmax-CE simplification, general pattern articulated. Lean Hire = mostly correct but makes a dimensional error. No Hire = cannot set up the chain rule.

Deep dive: Backpropagation

Q17: Explain the self-attention computation step by step, including the complexity and how to reduce it.

Model Answer:

Given input $X \in \mathbb{R}^{n \times d}$ , compute $Q = XW_Q$ , $K = XW_K$ , $V = XW_V$ where $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ .

Attention scores: $A = QK^T / \sqrt{d_k}$ - this is $n \times n$ matrix. Apply softmax row-wise: $\hat{A} = \text{softmax}(A)$ . Output: $O = \hat{A} V$ - weighted combination of values.

Complexity: $O(n^2 d)$ for the $QK^T$ computation. Memory: $O(n^2)$ for the attention matrix. For $n = 100\text{K}$ tokens, this is 10 billion entries - impossible.

Reduction methods: (1) FlashAttention - exact attention but tiled computation, O(n) memory via recomputation, no approximation. (2) Sparse attention - attend to local windows + global tokens. (3) Linear attention - kernel approximation to avoid materializing $n \times n$ matrix. (4) Grouped-query attention (GQA) - share K,V heads across multiple Q heads, reducing KV cache size.

Scoring: Strong Hire = full computation + complexity + FlashAttention + GQA. Lean Hire = correct computation but cannot discuss efficiency. No Hire = cannot compute the attention output.

Deep dive: Attention Mechanism, Transformer Architecture

Q18: You are training a model and observe the loss curve plateau after a few epochs. Walk me through your debugging process.

Model Answer:

"I would investigate systematically:

Step 1: Check if it is a learning rate issue. Is the plateau at a high or low loss? High loss plateau = underfitting, model capacity may be insufficient, or learning rate too low to escape initial basin. Low loss plateau = may be near convergence, or stuck in a local minimum.

Step 2: Examine gradient statistics. Are gradients near zero? (vanishing gradients - check initialization, activation functions). Are gradients oscillating? (learning rate too high). Are specific layers not learning? (check per-layer gradient norms).

Step 3: Data check. Is the data pipeline feeding correctly? (log a batch, verify labels match inputs). Is there label noise causing an irreducible loss floor?

Step 4: Architecture check. Is the model expressive enough? (try a larger model). Are residual connections and normalization in place?

Step 5: Optimizer and schedule. Try a different optimizer (switch Adam to SGD with momentum or vice versa). Try a learning rate warmup. Try cosine annealing to periodically increase LR and escape plateaus.

Step 6: Regularization. Is dropout or weight decay too strong? (temporarily remove and check if training loss improves).

I would also check the validation loss alongside training loss to distinguish underfitting (both high) from a data quality issue (training loss at theoretical minimum)."

Scoring: Strong Hire = systematic approach covering LR, gradients, data, architecture, optimizer + distinguishes underfitting from convergence. Lean Hire = suggests 2-3 fixes but not systematically. No Hire = only suggests "lower the learning rate."

Q19: Explain how LSTM solves the vanishing gradient problem. Derive the gradient flow through the cell state.

Model Answer:

"LSTM introduces a cell state $c_t$ with additive updates - unlike RNN's multiplicative hidden state updates.

The cell state update: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$

Where $f_t$ is the forget gate (sigmoid output in [0,1]) and $i_t$ is the input gate.

Gradient flow through cell state:

$\frac{\partial c_t}{\partial c_{t-1}} = f_t$

Over $T$ time steps:

$\frac{\partial c_T}{\partial c_0} = \prod_{t=1}^{T} f_t$

If $f_t \approx 1$ (forget gate nearly open), gradients flow unattenuated - the product is close to 1 regardless of $T$ . This is the key insight: the forget gate can learn to preserve gradients over long sequences.

In contrast, vanilla RNN: $h_t = \tanh(W_h h_{t-1} + W_x x_t)$ , giving $\partial h_t / \partial h_{t-1} = \text{diag}(\tanh'(\cdot)) \cdot W_h$ . If the spectral norm of $W_h$ is less than 1, this product vanishes exponentially.

The LSTM's additive cell state update is analogous to residual connections in deep networks - both provide a gradient highway that bypasses nonlinearities."

Scoring: Strong Hire = full gradient derivation, forget gate role, comparison to vanilla RNN, connection to residual networks. Lean Hire = knows LSTM helps but cannot derive gradient flow. No Hire = cannot explain the LSTM gates.

Deep dive: RNNs & LSTMs

Q20: What are ZeRO stages 1, 2, and 3? Calculate memory savings for a 13B model on 8 GPUs.

Model Answer:

"ZeRO eliminates memory redundancy in data parallelism. In standard DDP, every GPU stores: weights (2P bytes in FP16), gradients (2P), and optimizer states (12P for Adam with FP32 master weights) - 16P total.

ZeRO Stage 1: Partition optimizer states. Each GPU stores 1/N of optimizer states. Per GPU: $2P + 2P + 12P/N$ .

ZeRO Stage 2: Also partition gradients. Per GPU: $2P + 2P/N + 12P/N = 2P + 14P/N$ .

ZeRO Stage 3: Also partition weights. Per GPU: $(2P + 2P + 12P)/N = 16P/N$ .

For 13B model, 8 GPUs:

Stage	Per-GPU Memory	Total
DDP	16 x 13 = 208 GB	208 GB per GPU
Stage 1	26 + 26 + 19.5 = 71.5 GB	Fits A100-80GB (tight)
Stage 2	26 + 3.25 + 19.5 = 48.75 GB	Fits comfortably
Stage 3	208/8 = 26 GB	Very comfortable

Stage 3 adds 1.5x communication overhead (extra all-gather during forward/backward). FSDP is PyTorch's implementation of Stage 3."

Scoring: Strong Hire = all three stages with formulas + correct calculation + communication tradeoff + FSDP connection. Lean Hire = knows the stages but cannot calculate memory. No Hire = cannot explain what ZeRO does.

Deep dive: Distributed Training

Q21: Compare the training objectives of VAEs, GANs, and diffusion models.

Model Answer:

"VAE: Maximize the ELBO = reconstruction term - KL term. $\mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{KL}(q(z|x) \| p(z))$ . Lower bound on log-likelihood. Training is stable but outputs can be blurry because the reconstruction loss (MSE) averages over modes.

GAN: Minimax game: $\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1-D(G(z)))]$ . Minimizes Jensen-Shannon divergence between real and generated distributions (at the optimal discriminator). Training is unstable - generator and discriminator must be carefully balanced. Mode collapse is common.

Diffusion: Denoising score matching: $\mathbb{E}_{t,x_0,\epsilon}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$ . Simple MSE loss predicting added noise. Training is very stable (no adversarial dynamics) and covers all modes (every noise level forces full distribution learning). Slow sampling (hundreds of steps) is the main downside.

Key mathematical distinction: VAE optimizes a bound on likelihood (approximate). GAN optimizes a divergence via an adversary (implicit). Diffusion optimizes a score function via denoising (explicit, per-sample)."

Scoring: Strong Hire = all three objectives with formulas + what each minimizes + stability/quality tradeoffs + mathematical insight about bound vs divergence vs score. Lean Hire = knows all three but cannot state the precise objectives. No Hire = confuses the objectives.

Deep dive: Generative Models

Q22: Explain the Chinchilla scaling laws and how they changed LLM training.

Model Answer:

"Kaplan et al. (2020) found loss scales as power laws with parameters, data, and compute. They concluded: for a fixed compute budget, make the model as large as possible. This led to GPT-3 (175B params, 300B tokens).

Hoffmann et al. (2022, Chinchilla) challenged this. They showed the optimal allocation scales parameters and data equally with compute: $N \propto C^{0.5}$ , $D \propto C^{0.5}$ . Rule of thumb: 20 tokens per parameter. By this standard, GPT-3 was severely undertrained - 175B params should have seen ~3.5T tokens, not 300B.

They proved this by training Chinchilla (70B params, 1.4T tokens) which outperformed Gopher (280B params, 300B tokens) with 4x fewer parameters.

Modern nuance: Chinchilla optimizes for training compute. But inference cost matters too - a smaller model trained on more data is cheaper to deploy. Llama 3 (70B, 15T tokens) is deliberately 'overtrained' relative to Chinchilla because Meta optimizes for total cost including inference. The compute formula is $C \approx 6ND$ ."

Scoring: Strong Hire = Kaplan vs Chinchilla distinction + 20 tokens rule + Chinchilla vs Gopher example + inference-aware deviation + 6ND formula. Lean Hire = knows the scaling law exists but cannot state the specific relationship or numbers. No Hire = has not heard of Chinchilla or scaling laws.

Deep dive: Distributed Training

Q23: What is the reparameterization trick and why does it matter?

Model Answer:

"In VAEs, we need to backpropagate through sampling: $z \sim q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$ . Sampling is stochastic and non-differentiable - you cannot compute $\partial z / \partial \phi$ .

The reparameterization trick rewrites: $z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$ , where $\epsilon \sim \mathcal{N}(0, I)$ .

Now $z$ is a deterministic, differentiable function of $\mu$ and $\sigma$ (which depend on $\phi$ ) plus external noise $\epsilon$ that does not depend on $\phi$ . Gradients flow through $\mu$ and $\sigma$ to $\phi$ via standard backpropagation.

This generalizes beyond VAEs - any time you need to differentiate through a sample from a parameterized distribution (policy gradient in RL, stochastic computation graphs), reparameterization provides lower-variance gradient estimates than the REINFORCE alternative."

Scoring: Strong Hire = why sampling is non-differentiable + the trick formula + gradient flow + connection to RL/general stochastic computation. Lean Hire = knows the formula but cannot explain why it is needed. No Hire = cannot explain the trick.

Deep dive: Generative Models

Q24: Explain how FlashAttention works and why it is important.

Model Answer:

"Standard attention materializes the $n \times n$ attention matrix, requiring $O(n^2)$ memory. For a 100K context, that is 40GB just for the attention scores - impossible.

FlashAttention (Dao et al., 2022) computes exact attention without materializing the full matrix. It tiles the Q, K, V matrices into blocks that fit in GPU SRAM (fast, small memory), computes attention one tile at a time, and uses the online softmax trick to accumulate results across tiles without storing the full matrix.

Key insight: GPUs have a memory hierarchy - HBM (slow, large) and SRAM (fast, tiny). Standard attention is memory-bandwidth bound - it reads/writes the huge attention matrix from HBM repeatedly. FlashAttention is compute-bound - it does more FLOPs (recomputation) but avoids HBM reads/writes, which is faster overall.

Results: 2-4x wall-clock speedup, O(n) memory instead of $O(n^2)$ , enables much longer context lengths. FlashAttention-2 further optimizes work partitioning. FlashAttention-3 targets H100 with asynchronous computation. This is now the default attention implementation in all major frameworks."

Scoring: Strong Hire = tiling + online softmax + SRAM vs HBM + IO-aware analysis + memory reduction. Lean Hire = knows it is faster and uses less memory but not how. No Hire = has not heard of FlashAttention.

Deep dive: Transformer Architecture

Q25: How does a diffusion model generate images? Walk through both training and inference.

Model Answer:

"Training:

Sample a clean image $x_0$ from the dataset
Sample a random timestep $t \sim \text{Uniform}(1, T)$
Sample noise $\epsilon \sim \mathcal{N}(0, I)$
Create noisy image: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$
Train the network to predict $\epsilon$ : minimize $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$

The network (typically a U-Net) takes the noisy image and timestep as input and outputs the predicted noise.

Inference:

Start with pure noise $x_T \sim \mathcal{N}(0, I)$
For $t = T, T-1, \ldots, 1$ $t = T, T - 1, \dots, 1$ :
- Predict the noise: $\hat{\epsilon} = \epsilon_\theta(x_t, t)$
- Compute the denoised estimate and take a reverse diffusion step
- $x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \hat{\epsilon}\right) + \sigma_t z$
Output $x_0$

For text-to-image, the network is conditioned on text embeddings via cross-attention, and classifier-free guidance amplifies the conditional signal.

The training is simple and stable (just MSE), but inference requires T (typically 20-1000) network evaluations, making it slow."

Scoring: Strong Hire = both training and inference algorithms with formulas + conditioning + guidance + speed tradeoff. Lean Hire = correct high-level process but missing formulas. No Hire = cannot describe the training procedure.

Deep dive: Generative Models

Section 3 - Senior/Staff Level Questions

These questions test deep understanding, system-level thinking, and the ability to make architectural decisions. Expect 15-20 minute discussions per question.

Q26: Design the training infrastructure for a 100B parameter language model from scratch.

Model Answer:

"Compute budget: Chinchilla-optimal training: $D = 20 \times 100\text{B} = 2\text{T tokens}$ . Compute: $C = 6 \times 100\text{B} \times 2\text{T} = 1.2 \times 10^{24}$ FLOPs.

Hardware: 1024 H100 GPUs (128 nodes x 8 GPUs). At 40% MFU: effective 400 PFLOPS. Time: $1.2 \times 10^{24} / (400 \times 10^{15}) = 3 \times 10^{6}$ seconds = 35 days.

Parallelism: TP=8 within node (NVLink). PP=4 across node groups (InfiniBand). DP=1024/(8x4)=32. ZeRO Stage 1 within DP group.

Memory per GPU: Weights (TP=8): 100B x 2 / 8 = 25GB. Optimizer (ZeRO-1, DP=32): ~4.7GB. Gradients: ~25GB. Total static: ~55GB. Activations with checkpointing: ~20GB. Total: ~75GB on 80GB H100.

Training configuration: Batch size 2048 sequences x 4096 tokens. LR 3e-4 with cosine decay. Warmup 2000 steps. BF16 mixed precision. Gradient clipping 1.0.

Data pipeline: Tokenized data stored in memory-mapped files. WebDataset format for efficient streaming. Data mixing: web text (60%), books (15%), code (15%), academic (10%).

Failure recovery: Checkpoint every 500 steps to object storage. Elastic training handles up to 5% node failures. Automatic restart with learning rate rewind. Loss spike detection: if loss exceeds 2x running average, revert to last checkpoint.

Monitoring: Track per-layer gradient norms, activation magnitudes, learning rate, loss, MFU, GPU utilization, inter-node bandwidth. Alert on anomalies."

Scoring: Strong Hire = complete compute budget + parallelism config + memory calculation + data pipeline + failure recovery + monitoring. Lean Hire = reasonable parallelism but missing compute budget or failure recovery. No Hire = cannot design a multi-dimensional parallelism config.

Q27: When would you choose a diffusion model over an autoregressive model for generation, and vice versa?

Model Answer:

"Diffusion for: (1) Continuous data - images, audio, video, 3D - where the output space is naturally continuous and high-dimensional. Diffusion's denoising objective is a natural fit. (2) When you need diversity - diffusion covers all modes by design. (3) When you want fine-grained control - classifier-free guidance, inpainting, style transfer are natural with diffusion.

Autoregressive for: (1) Discrete sequential data - text, code, music tokens - where the output has a natural left-to-right order. (2) When you need exact likelihood - autoregressive models provide tractable log-likelihood. (3) When you need reasoning - chain-of-thought requires sequential token generation where each token conditions on all previous tokens. (4) When you want a unified model - LLMs can handle many tasks with prompting.

Hybrid approaches are emerging: (1) Autoregressive models generating image tokens (DALL-E 1, Parti, Chameleon). (2) Diffusion models with discrete denoising for text (MDLM, SEDD). (3) Diffusion for planning/draft + autoregressive for refinement.

Key tradeoff: Autoregressive is sequential but exact. Diffusion is parallel per step but iterative across steps. For text, the sequential nature matches human language. For images, the parallel nature matches spatial structure."

Scoring: Strong Hire = clear criteria for each + examples + hybrid approaches + key tradeoff articulated. Lean Hire = reasonable comparison but misses one paradigm's strengths. No Hire = no clear framework for choosing.

Q28: Explain why pre-training with self-supervised objectives works so well. What does the model actually learn?

Model Answer:

"Self-supervised pre-training works because predicting the next token (or masked tokens, or denoised images) forces the model to learn a compressed representation of the training distribution - capturing syntax, semantics, world knowledge, and reasoning patterns.

What the model learns at different scales:

Small models (~100M): Syntax, grammar, common collocations, basic factual associations.
Medium models (~1-10B): Semantic understanding, analogy reasoning, multi-step factual chains, basic code generation.
Large models (~100B+): Complex reasoning, chain-of-thought, few-shot learning, cross-domain transfer, theory of mind (emergent).

Why next-token prediction is so powerful: The loss function $-\log p(x_t | x_{<t})$ requires modeling ALL aspects of text to minimize - factual knowledge, logical reasoning, stylistic patterns, conversational structure. The model must build internal representations of all these phenomena to achieve low loss.

The scaling hypothesis: As models and data scale, the internal representations become richer and more general. At some point, the representations become useful for tasks the model was never explicitly trained on (emergent capabilities).

Limitations: Pre-training optimizes for distribution matching, not for truthfulness, helpfulness, or safety. This is why alignment (RLHF, constitutional AI) is needed as a second phase."

Scoring: Strong Hire = explains compression/representation argument + scale-dependent capabilities + why the objective is powerful + limitations requiring alignment. Lean Hire = knows pre-training works but cannot articulate why deeply. No Hire = "it just learns from lots of data."

Q29: You trained a model and it performs well on your benchmark but poorly in production. Debug this systematically.

Model Answer:

"This is a distribution shift problem. Systematic investigation:

1. Data distribution mismatch: Compare the production data distribution to the training/benchmark distribution. Are there new categories, languages, edge cases, or adversarial inputs? Collect production samples and measure feature drift using statistical tests (KS test, MMD, population stability index).

2. Evaluation metric mismatch: Does the benchmark metric correlate with the production success metric? A model optimized for accuracy may fail on latency, calibration, fairness, or user satisfaction. Identify the true production metric and evaluate against it.

3. Preprocessing discrepancies: Is the production data pipeline identical to the training pipeline? Different tokenization, normalization, image resizing, or feature extraction can cause silent failures. Run the same input through both pipelines and compare.

4. Temporal drift: Is the production data from a different time period? Language models trained on 2023 data may fail on 2024 events. Recommendation models trained in summer may fail in winter.

5. Adversarial/edge cases: Production users behave differently from benchmark creators. They find edge cases, provide unusual inputs, and attempt to break the model. Run adversarial evaluations.

6. Infrastructure issues: Quantization errors (if model was quantized for serving), batching effects, memory limits causing silent truncation, different library versions.

Mitigation: (1) Continuously monitor production metrics, (2) Maintain a golden evaluation set that mirrors production distribution, (3) A/B test before full deployment, (4) Set up automated retraining on production data."

Scoring: Strong Hire = 5+ systematic categories + specific tools/tests + mitigation plan. Lean Hire = identifies 2-3 causes but not systematically. No Hire = "train on more data."

Q30: Explain the loss NaN debugging process for a large language model training run.

Model Answer:

"Loss NaN during LLM training is usually caused by numerical instability. Systematic debugging:

Immediate triage:

When did it happen? Step number, learning rate at that step, data batch ID. Early (first 100 steps) vs late (after 10K steps) suggests different causes.
Gradient norm history: Was there a spike before the NaN? Gradients growing over 10-100 steps then exploding suggests accumulating instability. A sudden spike suggests a bad data batch.

Common causes by timing:

Early training NaN:

Bad initialization (weights too large/small)
Learning rate too high (especially without warmup)
Missing gradient clipping
FP16 overflow (activations exceed 65504)

Late training NaN:

A particularly bad data batch (corrupted text, extremely long sequences)
Loss scaling failure (dynamic loss scale drops too low)
Accumulating numerical error in attention softmax
LayerNorm encountering zero variance

Debugging steps:

Add gradient norm logging per layer - identify which layer diverges first
Add activation magnitude logging - check for pre-softmax values exceeding FP16 range
Check the specific data batch - is there a 100K token sequence or corrupted data?
Check loss scaling - is the dynamic scale at its minimum?
Try BF16 instead of FP16 (eliminates overflow issues)
Try reducing learning rate by 2x
Try increasing gradient clipping (from 1.0 to 0.5)
If specific layer diverges: check its initialization, normalization, and residual connection

Prevention: Use BF16, gradient clipping at 1.0, learning rate warmup (1000-2000 steps), data quality filtering (remove extremely long or corrupted samples), and checkpoint frequently enough to resume without losing much work."

Scoring: Strong Hire = timing-based diagnosis + 5+ specific causes + systematic debugging steps + prevention strategy. Lean Hire = identifies some causes but not systematically. No Hire = "restart training with a lower learning rate."

Q31: How would you implement efficient fine-tuning for a 70B model with limited compute?

Model Answer:

"Parameter-efficient fine-tuning (PEFT) methods avoid updating all 70B parameters:

LoRA (Low-Rank Adaptation): Add trainable low-rank matrices to attention weights: $W' = W + BA$ where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times d}$ , $r \ll d$ (typically 8-64). Only train A and B (~0.1% of parameters). Memory: only need optimizer states for LoRA params. Can serve multiple tasks by swapping LoRA adapters.

QLoRA: Quantize the base model to 4-bit (NF4 quantization), then add LoRA adapters in BF16. Reduces memory from 140GB to ~35GB for a 70B model. Can fine-tune 70B on a single 48GB GPU.

Why these work: Pre-trained weights capture general knowledge. Fine-tuning only needs to adapt the model to a new task or style, which requires a low-rank update - the 'task-specific' information lives in a low-dimensional subspace.

Comparison:

Method	Trainable Params	Memory (70B)	Quality
Full fine-tuning	70B (100%)	~1120 GB	Best
LoRA r=16	~170M (0.24%)	~160 GB	95-99% of full
QLoRA r=16	~170M (0.24%)	~35 GB	93-97% of full
Adapter layers	~500M (0.7%)	~180 GB	94-98% of full
Prompt tuning	~100K (0.0001%)	~140 GB	80-90% of full

I would choose QLoRA for its memory efficiency. For production, merge the LoRA weights back into the base model (zero inference overhead)."

Scoring: Strong Hire = LoRA mechanism + QLoRA + memory calculations + comparison table + merge for serving + explanation of why low-rank works. Lean Hire = knows LoRA exists but cannot explain the mechanism. No Hire = "just fine-tune the last layer."

Q32: Explain the difference between pre-LayerNorm and post-LayerNorm transformers and why it matters.

Model Answer:

"Post-LayerNorm (original Transformer): $x + \text{LN}(\text{Attention}(x))$ . LayerNorm is applied after the residual connection. The residual stream accumulates unnormalized values, making training unstable for deep models - gradients can explode because the residual path has no normalization.

Pre-LayerNorm (GPT-2, most modern LLMs): $x + \text{Attention}(\text{LN}(x))$ . LayerNorm is applied before the attention/FFN. The residual stream receives bounded inputs (normalized by LN), making gradients more stable. This allows training very deep transformers (100+ layers) without learning rate warmup.

Why it matters:

Pre-LN is much more stable - almost always preferred for training large models
Post-LN can achieve slightly better final performance with careful tuning (because the final layer output is normalized)
Pre-LN requires an extra LayerNorm after the final layer (the residual stream is not normalized)
Many recent models use RMSNorm (simplified LayerNorm without mean centering) in the pre-LN position

The choice between pre and post norm is one of the most practical architecture decisions with real impact on training stability. Getting this wrong can waste weeks of GPU time on unstable training runs."

Scoring: Strong Hire = both formulas + stability analysis + gradient flow explanation + practical implications + RMSNorm mention. Lean Hire = knows there is a difference but cannot explain the stability implications. No Hire = does not know pre vs post LayerNorm is a design choice.

Section 4 - Company-Tagged Questions

These questions are frequently asked at specific companies. Study the ones for your target companies.

Q33: [Google] Explain the T5 architecture and how it differs from GPT and BERT.

Model Answer: "T5 (Text-to-Text Transfer Transformer) is an encoder-decoder model that frames ALL NLP tasks as text-to-text: input text maps to output text. Translation: 'translate English to French: cat' -> 'chat'. Summarization: 'summarize: [article]' -> '[summary]'. Classification: 'classify: [text]' -> 'positive'.

BERT is encoder-only (bidirectional attention, MLM pre-training, requires task-specific heads). GPT is decoder-only (causal attention, autoregressive pre-training, no encoder). T5 has both encoder (bidirectional) and decoder (causal with cross-attention).

T5 uses relative position biases instead of absolute position embeddings, which generalize better to unseen sequence lengths. It uses a span corruption pre-training objective (mask random spans, predict them) rather than BERT's random token masking or GPT's next-token prediction."

Scoring: Strong Hire = text-to-text framing + architecture comparison + position encoding difference + span corruption. Lean Hire = knows the architecture but not the design philosophy. No Hire = confuses T5 with BERT or GPT.

Q34: [Meta] How does Llama 2/3 differ from the original GPT architecture?

Model Answer: "Key modifications from GPT to Llama: (1) Pre-LayerNorm with RMSNorm instead of post-LayerNorm with full LayerNorm - simpler and more stable. (2) SwiGLU activation in FFN instead of GELU - $\text{SwiGLU}(x) = (\text{Swish}(xW_1)) \odot (xW_3) \cdot W_2$ . Larger hidden dimension to compensate. (3) Rotary Position Embeddings (RoPE) instead of absolute positional embeddings - encodes relative position through rotation matrices applied to Q and K, enabling length generalization. (4) Grouped-Query Attention (GQA) in Llama 2 70B and Llama 3 - shares KV heads across multiple query heads, reducing KV cache memory. (5) No bias terms in linear layers - simplifies implementation, negligible performance impact. (6) Llama 3 training: 15T tokens (far beyond Chinchilla optimal) for inference efficiency."

Scoring: Strong Hire = RMSNorm + SwiGLU + RoPE + GQA + training scale explanation. Lean Hire = knows 2-3 differences. No Hire = "Llama is basically GPT with more data."

Q35: [OpenAI/Anthropic] What is RLHF and why is it needed?

Model Answer: "Reinforcement Learning from Human Feedback aligns language models with human preferences. Three stages: (1) Supervised Fine-Tuning (SFT) - fine-tune on high-quality instruction-following data. (2) Reward Model Training - collect human comparisons (which of two outputs is better), train a model to predict human preference scores. (3) RL Optimization - use PPO (Proximal Policy Optimization) to fine-tune the LLM to maximize the reward model's score, with a KL penalty to prevent diverging too far from the SFT model.

Why it is needed: pre-training optimizes for next-token prediction, which produces models that are knowledgeable but not helpful, honest, or harmless. A model trained on internet text will happily produce toxic content, hallucinate facts, or refuse to answer simple questions. RLHF teaches the model to be helpful (follow instructions), honest (express uncertainty), and harmless (refuse dangerous requests).

Alternatives: DPO (Direct Preference Optimization) skips the reward model and directly optimizes the policy from preference data - simpler, no reward model training, but potentially less expressive. Constitutional AI (Anthropic) uses the model itself to generate critiques and revisions, reducing reliance on human labels."

Scoring: Strong Hire = three stages + why needed + KL penalty + DPO/CAI alternatives. Lean Hire = knows the concept but not the three-stage process. No Hire = cannot explain RLHF.

Q36: [Amazon] How would you serve a 70B model at low latency for a customer-facing product?

Model Answer: "Quantization: Apply GPTQ or AWQ 4-bit quantization - reduces memory from 140GB to ~35GB, fits on a single A100 or 2 A10Gs. Quality loss: 1-3% on benchmarks. 8-bit is safer if quality is critical.

KV Cache Optimization: Use GQA architecture (Llama 2/3), PagedAttention (vLLM) for efficient cache management, or multi-query attention. KV cache is often the memory bottleneck for long contexts.

Batching: Continuous batching (not static batching) - as requests finish, immediately add new ones. vLLM or TensorRT-LLM handle this automatically.

Speculative Decoding: Use a small draft model (7B) to generate candidate tokens, verify with the 70B model in parallel. Can achieve 2-3x speedup without quality loss.

Infrastructure: A100/H100 GPUs with tensor parallelism for single-request latency reduction. Or multiple smaller GPUs (L4, A10G) with model sharding for cost efficiency.

Target latency: First token in under 500ms, subsequent tokens at 30+ tokens/second. For 200 tokens: under 7 seconds total.

Cost optimization: Spot instances for non-critical traffic, request batching for throughput, caching for repeated queries, and a smaller model (7B-13B) for simple queries with routing."

Scoring: Strong Hire = quantization + KV cache + batching + speculative decoding + specific latency targets + cost optimization. Lean Hire = mentions quantization and batching but no specifics. No Hire = "use a GPU."

Q37: [Apple] How do you train a model for on-device deployment with strict memory constraints?

Model Answer: "On-device means: typically 2-6GB memory budget, CPU or Apple Neural Engine (ANE), no internet required, under 100ms latency.

Architecture: Start with a small model (1-3B parameters). Use efficient attention (GQA or multi-query). Use SwiGLU activation. Minimize embedding table size.

Training approach: Knowledge distillation from a large teacher (70B) to a small student (3B). This preserves much of the quality in a deployable size. Train the student on task-specific data for the target use case.

Compression pipeline:

Structured pruning (remove entire attention heads or FFN neurons that contribute least)
Quantization-aware training (QAT) - simulate 4-bit or 8-bit during training so the model learns to be robust to quantization
Post-training quantization to INT4/INT8 using GPTQ or similar
Weight clustering for further compression

Optimization for Apple Silicon:

CoreML or MLX for inference
ANE-friendly operations (avoid dynamic shapes, use static graphs)
Metal Performance Shaders for GPU fallback

Final model: ~1.5GB for a 3B INT4 model, runs at 20-30 tokens/second on iPhone 15 Pro."

Scoring: Strong Hire = distillation + pruning + QAT + platform-specific optimization (CoreML/ANE) + specific size/speed targets. Lean Hire = mentions quantization but not the full pipeline. No Hire = "use a smaller model."

Q38: [Google] Explain Mixture of Experts (MoE) and its training challenges.

Model Answer: "MoE replaces the dense FFN in a transformer with multiple 'expert' FFN networks and a gating mechanism that routes each token to the top-K experts (typically K=2). The total parameter count is large (expert count x expert size) but only K experts are active per token, so compute cost is much lower than a dense model of the same total size.

Gating: A learned linear layer maps the token embedding to a distribution over experts. Top-K experts are selected. Their outputs are weighted by the gate values and summed.

Training challenges:

Load balancing: Without intervention, the gate learns to route most tokens to a few experts, leaving others unused. Fix: auxiliary loss that penalizes uneven routing ( $\alpha \cdot \text{CV}(\text{expert loads})^2$ ).
Communication cost: In distributed training, tokens must be sent to the GPU holding their assigned expert (all-to-all communication). This can be a bottleneck.
Instability: Expert routing can oscillate. Fixes: expert capacity factor (cap tokens per expert), noise in gating (explore different experts).
Fine-tuning difficulty: Expert specialization during pre-training may not transfer well to fine-tuning tasks.

Examples: Switch Transformer (Google), Mixtral 8x7B (Mistral), GShard, ST-MoE. Mixtral has 8 experts with 2 active, giving ~12B active parameters from ~47B total."

Scoring: Strong Hire = routing mechanism + load balancing + communication cost + 3+ challenges + real examples with numbers. Lean Hire = knows the concept but not the challenges. No Hire = cannot explain MoE routing.

Q39: [Anthropic] Explain constitutional AI and how it differs from RLHF.

Model Answer: "Constitutional AI (CAI) is Anthropic's approach to aligning language models using a set of principles ('constitution') rather than large amounts of human feedback.

Process:

Start with an RLHF-trained model
Generate harmful outputs by red-teaming
Ask the model itself to critique its output based on the constitution ('identify ways this response could be harmful')
Ask the model to revise its output based on the critique
Train a reward model on (original, revision) pairs
Fine-tune with RL using this reward model (RLAIF - RL from AI Feedback)

How it differs from RLHF:

RLHF requires extensive human labeling of preferences. CAI uses the model's own judgment guided by principles.
RLHF captures implicit human values. CAI makes values explicit through the constitution.
CAI is more scalable (less human labor) and more transparent (values are written down).
CAI can be iterated - refine the constitution based on observed failures.

Limitations: The model must already be capable enough to critique itself (bootstrapping problem). The constitution may not cover all edge cases. Human values are hard to fully codify in rules."

Scoring: Strong Hire = full CAI process + RLAIF + explicit differences from RLHF + limitations. Lean Hire = knows CAI exists but not the process. No Hire = confuses CAI with RLHF.

Q40: [Meta] How does text-to-image generation work in Stable Diffusion?

Model Answer: "Stable Diffusion is a latent diffusion model with three main components:

VAE (Autoencoder): Compresses 512x512x3 images to 64x64x4 latent representations (48x compression). Trained separately with reconstruction + perceptual + adversarial losses.
U-Net Denoiser: Operates in latent space. Architecture: downsampling blocks, middle block, upsampling blocks, all with ResNet blocks + self-attention + cross-attention. Cross-attention receives text embeddings from CLIP. Conditioned on timestep via sinusoidal embeddings + Adaptive Group Norm.
CLIP Text Encoder: Converts text prompts to embedding sequences. These embeddings are injected into the U-Net via cross-attention at multiple resolution levels.

Generation: Start with random noise in latent space. Iteratively denoise with the U-Net (conditioned on text). Apply classifier-free guidance: $\epsilon = \epsilon_u + s \cdot (\epsilon_c - \epsilon_u)$ where $s$ is the guidance scale (7.5-12). Decode the final latent to an image with the VAE decoder.

SDXL improvements: Larger U-Net, dual text encoders (CLIP-G + OpenCLIP), two-stage refinement, better training on higher resolution."

Scoring: Strong Hire = all three components + cross-attention conditioning + CFG formula + latent space advantage + SDXL improvements. Lean Hire = knows it is a diffusion model but not the architecture details. No Hire = cannot explain the pipeline.

Q41: [DeepMind] Explain the scaling laws for neural language models.

Model Answer: "Neural scaling laws describe power-law relationships between model performance and resources:

$L(N) = (N_c/N)^{\alpha_N}$ where $\alpha_N \approx 0.076$ (Kaplan) $L(D) = (D_c/D)^{\alpha_D}$ where $\alpha_D \approx 0.095$ $L(C) = (C_c/C)^{\alpha_C}$ where $\alpha_C \approx 0.050$

Kaplan (2020): For fixed compute, scale model size preferentially. Led to large but undertrained models.

Chinchilla (2022): Scale N and D equally. Optimal: ~20 tokens per parameter. Chinchilla (70B, 1.4T) beat Gopher (280B, 300B).

Compute formula: $C \approx 6ND$ FLOPs. This enables compute budgeting before training.

Beyond language: Similar scaling laws hold for vision (ViT), multimodal (CLIP), and code models, though the exponents differ.

Open questions: (1) Do scaling laws predict emergent capabilities? (debated). (2) When do scaling laws break down? (data quality bottleneck, architectural limitations). (3) How do they change with different training objectives (RLHF, instruction tuning)?"

Scoring: Strong Hire = all three power laws + Kaplan vs Chinchilla + 6ND formula + open questions. Lean Hire = knows the general trend but not specific formulas. No Hire = has not heard of scaling laws.

Q42: [NVIDIA] How does tensor parallelism work in Megatron-LM?

Model Answer: "Megatron-LM splits individual transformer layers across GPUs using column and row parallelism.

MLP block: $Y = \text{GeLU}(XA)B$ . Split $A$ column-wise across $N$ GPUs - each computes $\text{GeLU}(XA_i)$ independently (GeLU is element-wise, no communication). Split $B$ row-wise - each computes a partial sum $\text{GeLU}(XA_i)B_i$ . One AllReduce to sum: $Y = \sum_i \text{GeLU}(XA_i)B_i$ .

Attention block: Split Q, K, V projections column-wise across GPUs. Each GPU computes attention for its heads independently. Split output projection row-wise. One AllReduce.

Result: 2 AllReduces per transformer layer (one for attention, one for MLP). Each AllReduce transfers $O(\text{batch} \times \text{seq} \times \text{hidden})$ bytes.

Requirements: NVLink (900 GB/s on H100) is essential because communication happens every layer. This is why TP is limited to within a single node. TP degree is typically 2, 4, or 8 matching the node's GPU count.

Megatron also supports pipeline parallelism and sequence parallelism (splitting LayerNorm and dropout across the TP group to save activation memory)."

Scoring: Strong Hire = column/row split details + AllReduce count + NVLink requirement + sequence parallelism. Lean Hire = knows TP splits layers but not the details. No Hire = cannot explain how a layer is split.

Q43: [Startup] You have a GPU budget of $10K/month. What is the largest model you can fine-tune and serve?

Model Answer: "Budget allocation: $5K training,$ 5K serving (adjustable).

Training ($5K/month):

A100 80GB spot instances: ~$1.50/hour = ~3,300 GPU-hours/month
With QLoRA, a 70B model needs ~1 A100 for fine-tuning
Fine-tuning 70B on 10K examples: ~4-8 hours = $6-12
Can fine-tune 70B hundreds of times or run extensive hyperparameter searches
Alternatively: 2-4 A100s for training a smaller model (7-13B) from scratch on domain data

Serving ($5K/month):

A10G instances: ~$0.75/hour = ~6,600 GPU-hours/month
70B INT4: needs 2x A10G (24GB each) = $1.50/hour for one replica = 3,300 hours
Throughput per replica: ~30 tokens/second at batch=1
For ~1M tokens/day output: one replica is sufficient ($1,100/month)
Budget allows 3-4 replicas for redundancy/throughput

Recommendation: Fine-tune Llama 3 70B with QLoRA for quality-critical tasks, or fine-tune 7-13B for latency-sensitive applications. Serve with vLLM on A10G instances with INT4 quantization. Use a 7B model as a router or for simple queries to reduce 70B usage."

Scoring: Strong Hire = specific cost calculations + QLoRA for training + quantization for serving + budget allocation strategy + practical deployment plan. Lean Hire = reasonable plan but no cost calculations. No Hire = "use an API."

Q44: [Google] How does the Mixture-of-Depths approach work?

Model Answer: "Mixture of Depths (MoD) applies the idea from MoE to transformer layers themselves - not all tokens need processing by every layer. A learned router decides which tokens skip a layer entirely (passing through only the residual connection) and which tokens get full computation.

Mechanism: At each layer, the router scores each token. Only the top-K tokens (based on a capacity ratio, e.g., 50%) get processed by the attention + FFN block. The remaining tokens pass through unchanged via the residual connection.

Benefits: Reduces compute by 30-50% with minimal quality loss. Particularly effective because many tokens (stop words, repeated context) do not need deep processing - the model learns to allocate compute where it is needed.

Comparison to early exit: Early exit (tokens stop processing partway through the network) is harder to batch efficiently. MoD maintains the full depth for all tokens but selectively applies computation, preserving batch structure."

Scoring: Strong Hire = routing mechanism + capacity ratio + comparison to early exit + compute savings. Lean Hire = knows the concept but not the details. No Hire = confuses with MoE.

Q45: [Anthropic/OpenAI] Explain the difference between DPO and PPO for LLM alignment.

Model Answer: "Both align LLMs with human preferences, but differ in approach:

PPO (Proximal Policy Optimization): Three-stage process - SFT, train reward model on preference data, then RL to maximize reward while staying close to SFT policy (KL penalty). Requires training and maintaining a separate reward model. More flexible - can optimize arbitrary reward functions.

DPO (Direct Preference Optimization): Shows that the optimal RL solution has a closed-form expression relating the policy to the reward. This means you can directly optimize the policy from preference data without training a reward model. The loss: $L = -\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)})$ where $y_w$ is the preferred response and $y_l$ is the dispreferred response.

Tradeoffs:

DPO: simpler (no reward model, no RL), more stable training, but limited to pairwise preferences, potentially less flexible
PPO: more complex but can optimize for multi-dimensional rewards, can use iterative online data collection, handles reward hacking better with the explicit reward model
In practice, DPO is used more often due to simplicity, but top labs still use PPO or variants for frontier models"

Scoring: Strong Hire = DPO derivation insight + loss formula + tradeoffs + when each is preferred. Lean Hire = knows both exist but cannot compare. No Hire = cannot explain either.

Section 5 - Quick-Fire Questions (30 Seconds Each)

These test rapid recall. Give a 1-2 sentence answer for each.

QF1: What is the universal approximation theorem? A neural network with one hidden layer and non-linear activation can approximate any continuous function on a compact domain to arbitrary precision, given enough neurons. It guarantees existence, not efficiency - you may need impractically many neurons.

QF2: What is a residual connection? $y = f(x) + x$ - adding the input directly to the output. Enables gradient flow in deep networks and allows layers to learn residual functions rather than full mappings.

QF3: What is positional encoding in transformers? Since attention is permutation-invariant, positional encodings inject position information. Sinusoidal (original), learned (GPT-2), or rotary (RoPE, Llama) encodings. RoPE encodes relative position through rotation of Q and K vectors.

QF4: What is the softmax temperature? Dividing logits by $T$ before softmax: $p_i = \exp(z_i/T) / \sum \exp(z_j/T)$ . $T > 1$ flattens the distribution (more random), $T < 1$ sharpens it (more deterministic).

QF5: What is beam search? A decoding strategy that maintains the top-B candidates at each generation step. Broader than greedy (B=1) but not as diverse as sampling. Commonly used for translation but not for open-ended generation.

QF6: What is gradient accumulation? Summing gradients over multiple mini-batches before updating weights. Simulates a larger batch size when GPU memory is limited. Update every K steps = effective batch size K times larger.

QF7: What is a skip connection vs a residual connection? Same concept in most contexts. In ResNets, residual connection means $f(x) + x$ . In U-Nets, skip connections concatenate features from encoder to decoder. The term depends on the architecture.

QF8: What is the KV cache in transformer inference? During autoregressive generation, key and value matrices from previous tokens are cached so they do not need to be recomputed. Memory: $2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{batch}$ bytes. It is often the memory bottleneck.

QF9: What is model quantization? Reducing weight precision from FP32/FP16 to INT8/INT4. Reduces model size by 2-8x and speeds up inference. Post-training quantization (GPTQ, AWQ) or quantization-aware training (QAT) for better quality.

QF10: What is the difference between encoder-only, decoder-only, and encoder-decoder models? Encoder-only (BERT): bidirectional, for classification/embedding. Decoder-only (GPT, Llama): causal, for generation. Encoder-decoder (T5, BART): bidirectional encoder + causal decoder, for seq-to-seq tasks.

QF11: What is tokenization in NLP? Converting text to integer sequences. BPE (Byte-Pair Encoding) starts with characters and iteratively merges frequent pairs. SentencePiece is a common implementation. Typical vocabulary: 32K-128K tokens. Subword tokenization handles unknown words.

QF12: What is the difference between pre-training and fine-tuning? Pre-training: learn general representations on large unlabeled data (next token, MLM). Fine-tuning: adapt to a specific task on smaller labeled data. Pre-training is expensive (weeks on hundreds of GPUs), fine-tuning is cheap (hours on 1-8 GPUs).

QF13: What is causal masking? In decoder transformers, a triangular mask prevents tokens from attending to future positions. Ensures autoregressive generation - each token only depends on previous tokens. Applied as $-\infty$ to future positions before softmax.

QF14: What is perplexity? $\text{PPL} = \exp(-\frac{1}{N}\sum \log p(x_i))$ - the exponential of the average negative log-likelihood. Lower is better. PPL of 10 means the model is "as confused as if it were choosing between 10 equally likely options."

QF15: What is the curse of dimensionality? In high-dimensional spaces, data points become equidistant, volume concentrates near the surface of hyperspheres, and the amount of data needed to cover the space grows exponentially. Makes distance-based methods unreliable.

QF16: What is an embedding layer? A lookup table that maps discrete tokens to continuous vectors. Equivalent to a linear layer with one-hot input. Learned during training to capture semantic relationships (similar tokens have similar embeddings).

QF17: What is attention masking? Selectively preventing attention between certain positions. Causal mask (autoregressive), padding mask (ignore padding tokens), or custom masks (prefix LM, sliding window). Applied by adding $-\infty$ to the attention logits before softmax.

QF18: What is weight decay? Adding $\lambda \|w\|^2$ to the loss (L2 regularization) or equivalently multiplying weights by $(1 - \lambda \cdot \text{lr})$ each step. In AdamW, weight decay is decoupled from the adaptive learning rate, which is mathematically different from L2 regularization with Adam.

QF19: What is the lottery ticket hypothesis? Dense neural networks contain sparse subnetworks (winning tickets) that, when trained in isolation from the original initialization, achieve comparable performance. Implies structured pruning at initialization may be possible.

QF20: What is contrastive learning? Learning representations by pulling positive pairs (augmented views of same data) together and pushing negative pairs apart in embedding space. SimCLR, MoCo, and CLIP are key examples. CLIP aligns images and text in a shared space.

Cross-Reference Index

Topic	Detailed Page
Backpropagation and gradient flow	01 - Backpropagation
Activation functions (ReLU, GELU, etc.)	02 - Activation Functions
Convolutional Neural Networks	03 - CNNs
RNNs, LSTMs, GRUs	04 - RNNs & LSTMs
Attention mechanism	05 - Attention Mechanism
Transformer architecture	06 - Transformer Architecture
Normalization (BatchNorm, LayerNorm, RMSNorm)	07 - Normalization
Training techniques (init, clipping, mixed precision, distillation)	08 - Training Techniques
Distributed training (parallelism, ZeRO, scaling laws)	09 - Distributed Training
Generative models (VAE, GAN, diffusion)	10 - Generative Models

Interview Cheat Sheet

Round Type	Questions	Time Per Q	Depth Expected
Phone Screen	5-8	2-3 min	Definition + intuition + one example
Technical Deep Dive	3-5	8-12 min	Full explanation + derivation + tradeoffs + follow-ups
Senior/Staff	2-3	15-20 min	System design + mathematical depth + production considerations
Quick-Fire	15-20	15-30 sec	Crisp 1-2 sentence answer

Answer Framework for Every Question

WHAT: Define the concept (1-2 sentences)
WHY: Explain the intuition - why does this work or matter? (2-3 sentences)
HOW: Mathematical formulation or algorithm (show equations if relevant)
WHEN: When to use it and when NOT to (specific scenarios)
TRADE-OFFS: Limitations, alternatives, what you would consider in production

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Answer all 15 screening questions aloud (time yourself: 2 min each)
Grade yourself on each - identify any "No Hire" areas
Answer all 20 quick-fire questions in under 5 minutes total
Read the detailed pages for any topics where you scored "No Hire"

Day 3 - Recall

Without looking, answer Q1-Q10 aloud again
Attempt two technical deep dive questions (Q16-Q25)
Review your weakest 3 topics from Day 0
Practice the quick-fire round again - target under 30 seconds each

Day 7 - Application

Answer all screening questions without preparation
Attempt Q26 (training infrastructure design) with full system thinking
Practice answering with follow-up questions (have a friend probe deeper)
Score yourself against the rubrics

Day 14 - Integration

Do a mock interview: 5 random questions from any section, 45 minutes total
Practice company-specific questions for your target companies (Q33-Q45)
Attempt staff-level questions Q26-Q32
Identify gaps in your knowledge and fill them with the detailed pages

Day 21 - Mastery

Full mock interview with all difficulty levels mixed
Can you answer any question in this bank confidently?
Can you handle 2-3 levels of follow-up on each answer?
Practice the NaN debugging question (Q30) and infrastructure design (Q26) end-to-end in under 15 minutes each

The Real Interview Moment​

What You Will Master​

How to Use This Question Bank​

Section 1 - Screening Questions (Phone Screen Level)​

Q1: Explain backpropagation in plain terms.​

Q2: Why do we use non-linear activation functions?​

Q3: What is the vanishing gradient problem and how do you solve it?​

Q4: Explain the difference between BatchNorm and LayerNorm. When do you use each?​

Q5: What is the attention mechanism? Why is it important?​

Q6: What is a transformer? Walk me through its architecture.​

Q7: What is the difference between a CNN and an RNN? When do you use each?​

Q8: What is dropout and why does it work?​

Q9: He initialization vs Xavier initialization - when do you use each?​

Q10: What is mixed precision training?​

Q11: What is knowledge distillation?​

Q12: Explain data parallelism vs model parallelism.​

Q13: What is the ELBO in VAEs?​

Q14: Why do GANs suffer from mode collapse?​

Q15: What is gradient clipping and when do you need it?​

Section 2 - Technical Deep Dive (On-Site Level)​

Q16: Derive the backpropagation equations for a two-layer neural network.​

Q17: Explain the self-attention computation step by step, including the complexity and how to reduce it.​

Q18: You are training a model and observe the loss curve plateau after a few epochs. Walk me through your debugging process.​

Q19: Explain how LSTM solves the vanishing gradient problem. Derive the gradient flow through the cell state.​

Q20: What are ZeRO stages 1, 2, and 3? Calculate memory savings for a 13B model on 8 GPUs.​

Q21: Compare the training objectives of VAEs, GANs, and diffusion models.​

Q22: Explain the Chinchilla scaling laws and how they changed LLM training.​

Q23: What is the reparameterization trick and why does it matter?​

Q24: Explain how FlashAttention works and why it is important.​

Q25: How does a diffusion model generate images? Walk through both training and inference.​

Section 3 - Senior/Staff Level Questions​

Q26: Design the training infrastructure for a 100B parameter language model from scratch.​

Q27: When would you choose a diffusion model over an autoregressive model for generation, and vice versa?​

Q28: Explain why pre-training with self-supervised objectives works so well. What does the model actually learn?​

Q29: You trained a model and it performs well on your benchmark but poorly in production. Debug this systematically.​

Q30: Explain the loss NaN debugging process for a large language model training run.​

Q31: How would you implement efficient fine-tuning for a 70B model with limited compute?​

Q32: Explain the difference between pre-LayerNorm and post-LayerNorm transformers and why it matters.​

Section 4 - Company-Tagged Questions​

Q33: [Google] Explain the T5 architecture and how it differs from GPT and BERT.​

Q34: [Meta] How does Llama 2/3 differ from the original GPT architecture?​

Q35: [OpenAI/Anthropic] What is RLHF and why is it needed?​

Q36: [Amazon] How would you serve a 70B model at low latency for a customer-facing product?​

Q37: [Apple] How do you train a model for on-device deployment with strict memory constraints?​

Q38: [Google] Explain Mixture of Experts (MoE) and its training challenges.​

Q39: [Anthropic] Explain constitutional AI and how it differs from RLHF.​

Q40: [Meta] How does text-to-image generation work in Stable Diffusion?​

Q41: [DeepMind] Explain the scaling laws for neural language models.​

Q42: [NVIDIA] How does tensor parallelism work in Megatron-LM?​

Q43: [Startup] You have a GPU budget of $10K/month. What is the largest model you can fine-tune and serve?​

Q44: [Google] How does the Mixture-of-Depths approach work?​

Q45: [Anthropic/OpenAI] Explain the difference between DPO and PPO for LLM alignment.​

Section 5 - Quick-Fire Questions (30 Seconds Each)​

Cross-Reference Index​

Interview Cheat Sheet​

Answer Framework for Every Question​

Spaced Repetition Checkpoints​

Day 0 - Initial Learning​

Day 3 - Recall​

Day 7 - Application​

Day 14 - Integration​

Day 21 - Mastery​

The Real Interview Moment

What You Will Master

How to Use This Question Bank

Section 1 - Screening Questions (Phone Screen Level)

Q1: Explain backpropagation in plain terms.

Q2: Why do we use non-linear activation functions?

Q3: What is the vanishing gradient problem and how do you solve it?

Q4: Explain the difference between BatchNorm and LayerNorm. When do you use each?

Q5: What is the attention mechanism? Why is it important?

Q6: What is a transformer? Walk me through its architecture.

Q7: What is the difference between a CNN and an RNN? When do you use each?

Q8: What is dropout and why does it work?

Q9: He initialization vs Xavier initialization - when do you use each?

Q10: What is mixed precision training?

Q11: What is knowledge distillation?

Q12: Explain data parallelism vs model parallelism.

Q13: What is the ELBO in VAEs?

Q14: Why do GANs suffer from mode collapse?

Q15: What is gradient clipping and when do you need it?

Section 2 - Technical Deep Dive (On-Site Level)

Q16: Derive the backpropagation equations for a two-layer neural network.

Q17: Explain the self-attention computation step by step, including the complexity and how to reduce it.

Q18: You are training a model and observe the loss curve plateau after a few epochs. Walk me through your debugging process.

Q19: Explain how LSTM solves the vanishing gradient problem. Derive the gradient flow through the cell state.

Q20: What are ZeRO stages 1, 2, and 3? Calculate memory savings for a 13B model on 8 GPUs.

Q21: Compare the training objectives of VAEs, GANs, and diffusion models.

Q22: Explain the Chinchilla scaling laws and how they changed LLM training.

Q23: What is the reparameterization trick and why does it matter?

Q24: Explain how FlashAttention works and why it is important.

Q25: How does a diffusion model generate images? Walk through both training and inference.

Section 3 - Senior/Staff Level Questions

Q26: Design the training infrastructure for a 100B parameter language model from scratch.

Q27: When would you choose a diffusion model over an autoregressive model for generation, and vice versa?

Q28: Explain why pre-training with self-supervised objectives works so well. What does the model actually learn?

Q29: You trained a model and it performs well on your benchmark but poorly in production. Debug this systematically.

Q30: Explain the loss NaN debugging process for a large language model training run.

Q31: How would you implement efficient fine-tuning for a 70B model with limited compute?

Q32: Explain the difference between pre-LayerNorm and post-LayerNorm transformers and why it matters.

Section 4 - Company-Tagged Questions

Q33: [Google] Explain the T5 architecture and how it differs from GPT and BERT.

Q34: [Meta] How does Llama 2/3 differ from the original GPT architecture?

Q35: [OpenAI/Anthropic] What is RLHF and why is it needed?

Q36: [Amazon] How would you serve a 70B model at low latency for a customer-facing product?

Q37: [Apple] How do you train a model for on-device deployment with strict memory constraints?

Q38: [Google] Explain Mixture of Experts (MoE) and its training challenges.

Q39: [Anthropic] Explain constitutional AI and how it differs from RLHF.

Q40: [Meta] How does text-to-image generation work in Stable Diffusion?

Q41: [DeepMind] Explain the scaling laws for neural language models.

Q42: [NVIDIA] How does tensor parallelism work in Megatron-LM?

Q43: [Startup] You have a GPU budget of $10K/month. What is the largest model you can fine-tune and serve?

Q44: [Google] How does the Mixture-of-Depths approach work?

Q45: [Anthropic/OpenAI] Explain the difference between DPO and PPO for LLM alignment.

Section 5 - Quick-Fire Questions (30 Seconds Each)

Cross-Reference Index

Interview Cheat Sheet

Answer Framework for Every Question

Spaced Repetition Checkpoints

Day 0 - Initial Learning

Day 3 - Recall

Day 7 - Application

Day 14 - Integration

Day 21 - Mastery