Activation Functions - The Nonlinearity That Makes Deep Learning Work
Reading time: ~35 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, NLP Eng, CV Eng
The Real Interview Moment
You are in a Meta MLE phone screen. The interviewer asks what seems like a simple question: "Why do we need activation functions?" You answer "to add nonlinearity" - correct, but generic. She follows up: "Okay, so why not just use sigmoid? It is nonlinear." You mention vanishing gradients. She pushes further: "So we switched to ReLU. But GPT and BERT use GELU, not ReLU. Why? What is different about Transformers that makes GELU a better choice?"
Now you are in deep water. You know GELU is "smoother" but you cannot articulate why that matters for Transformers specifically, or why the smoothness at zero is important for gradient flow in attention layers. The interview shifts from "this candidate knows the basics" to "this candidate memorized names without understanding the design reasoning."
Activation function questions seem easy on the surface but quickly reveal the depth of your understanding. This page gives you the mathematical properties, the gradient analysis, the historical context, and - most importantly - the reasoning framework for choosing the right activation for any architecture.
What You Will Master
- Explain why neural networks need nonlinear activation functions (universal approximation theorem)
- State the formula, derivative, and range of every major activation function
- Analyze the gradient properties of each activation and connect them to vanishing/exploding gradient problems
- Explain the dying ReLU problem with a concrete numerical example
- Compare ReLU variants (Leaky ReLU, PReLU, ELU) and when each is appropriate
- Derive why GELU is preferred in Transformers (smooth approximation to a stochastic regularizer)
- Distinguish GELU vs SiLU/Swish vs Mish and state which modern architectures use which
- Apply the softmax function correctly and explain its temperature parameter
- Navigate a decision flowchart for choosing activation functions in any architecture
- Answer activation function interview questions with mathematical precision
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Explain | 4 -- Can Derive | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Explain why nonlinearity is needed | ___ | |||||
| Write sigmoid/tanh formulas and derivatives | ___ | |||||
| Explain vanishing gradients with sigmoid | ___ | |||||
| Explain ReLU advantages and dying ReLU | ___ | |||||
| Compare Leaky ReLU, PReLU, ELU | ___ | |||||
| Explain GELU formula and why Transformers use it | ___ | |||||
| Distinguish GELU vs SiLU vs Mish | ___ | |||||
| Apply softmax with temperature scaling | ___ |
Target: All 4s and 5s before your interview.
Part 1 - Why Nonlinearity Matters
The Linear Collapse Problem
Without activation functions, a neural network is just a sequence of linear transformations:
Composing these:
No matter how many layers you stack, the result is equivalent to a single linear transformation . The network cannot learn any function more complex than linear regression. All the depth is wasted.
"Without activation functions, a deep network collapses to a single linear transformation - stacking linear layers gives you another linear layer. Activation functions introduce nonlinearity, which is what allows neural networks to approximate any continuous function (the universal approximation theorem). The choice of activation affects training dynamics: sigmoid causes vanishing gradients because its derivative is at most 0.25, ReLU solves this with a constant gradient of 1 for positive inputs but creates dead neurons, and GELU provides a smooth approximation that works best in Transformers because its gentle curve near zero allows more nuanced gradient flow through attention layers."
Universal Approximation Theorem
With at least one hidden layer and a nonlinear activation function, a neural network can approximate any continuous function on a compact domain to arbitrary precision (given enough neurons). The activation function is what unlocks this power.
However, universality says nothing about learnability. A network might theoretically be able to represent a function but fail to learn it due to vanishing gradients, dead neurons, or poor optimization landscape. This is why the choice of activation function matters enormously in practice.
Part 2 - The Classic Activations: Sigmoid and Tanh
Sigmoid
Properties:
- Range: - outputs interpretable as probabilities
- Derivative:
- Maximum derivative: at
- Saturates for large : as , as
- Output is always positive (not zero-centered)
Why sigmoid fell out of favor:
-
Vanishing gradients: . Through layers, the gradient shrinks by at least . A 10-layer network attenuates gradients by .
-
Not zero-centered: Sigmoid outputs are always positive. This means gradients on weights all have the same sign (all positive or all negative), causing zig-zag updates during optimization. This slows convergence.
-
Expensive: Requires computing , which is slower than a simple comparison.
Where sigmoid is still used:
- Binary classification output (single probability)
- Gate functions in LSTMs and GRUs (values must be in )
- Attention gates in some architectures
Tanh
Properties:
- Range: - zero-centered (unlike sigmoid)
- Derivative:
- Maximum derivative: at (4x better than sigmoid)
- Still saturates for large
Tanh vs sigmoid:
- Tanh is always preferred over sigmoid in hidden layers because it is zero-centered
- Maximum derivative of 1.0 instead of 0.25 means slower gradient decay
- Still suffers from vanishing gradients, just less severely than sigmoid
- Both are largely replaced by ReLU and its variants in modern networks
Do NOT say "tanh does not have vanishing gradients." It does - for large . The improvement over sigmoid is the maximum derivative of 1.0 (vs 0.25) and zero-centered output, but the saturation problem remains. Only ReLU-family activations truly solve the vanishing gradient issue for positive inputs.
Part 3 - ReLU: The Revolution
Rectified Linear Unit
Properties:
- Range:
- Derivative:
- In practice, the derivative at is set to 0 (or sometimes 1)
- No saturation for positive inputs
- Computationally trivial: just a comparison and a mask
Why ReLU Revolutionized Deep Learning
ReLU solved three problems simultaneously:
1. No vanishing gradients (for positive inputs): The derivative is exactly 1 for all positive inputs, regardless of magnitude. The gradient passes through unchanged. A 100-layer network has gradient factor through the ReLU gates (for active neurons).
2. Sparsity: For typical inputs, roughly 50% of neurons output zero. This creates a sparse representation, which has two benefits:
- Computational efficiency (multiplying by zero is free)
- Better representations (sparse codes are more linearly separable)
3. Computational efficiency: is a single comparison - orders of magnitude faster than computing for sigmoid or tanh. This matters at scale.
The Dying ReLU Problem
When a neuron's pre-activation is negative for all training examples, the neuron outputs zero for every input and receives zero gradient. It can never recover - it is "dead."
How neurons die:
- A large negative bias or unfortunate weight initialization pushes the pre-activation negative
- A large learning rate causes a weight update that pushes the pre-activation far negative
- Once for all inputs in the training set, for all inputs
- With zero gradient, the weights never update, and the neuron stays dead forever
Concrete example: Suppose a neuron has weights and bias . For any non-negative input , the pre-activation . The neuron is permanently dead.
How common is it? In poorly initialized or aggressively trained networks, 10-40% of neurons can die. With proper initialization (He init) and moderate learning rates, it is usually manageable (less than 5%).
"When a candidate mentions the dying ReLU problem, I immediately follow up with: 'How would you detect dead neurons in a trained network, and what would you do about it?' A strong candidate says: 'Monitor the fraction of zero activations per layer during training. If it exceeds 20-30%, reduce the learning rate, check initialization, or switch to Leaky ReLU. You can also use a small L2 penalty to prevent weights from growing too large in magnitude.' A weak candidate just says 'use Leaky ReLU' without diagnosing the root cause."
Part 4 - ReLU Variants: Fixing the Dying Neuron Problem
Leaky ReLU
where is a small constant (typically 0.01).
Properties:
- Derivative for : (not zero - neurons cannot die)
- The small negative slope allows gradient flow even for negative inputs
- Introduces one hyperparameter
PReLU (Parametric ReLU)
Same formula as Leaky ReLU, but is a learnable parameter (one per channel or one per neuron). The network learns the optimal negative slope.
Key result: He et al. (2015) showed PReLU improved ImageNet classification accuracy over ReLU by 1.1% - a significant margin at the time.
ELU (Exponential Linear Unit)
Properties:
- Smooth at (unlike ReLU)
- Saturates to for large negative inputs (adds noise robustness)
- Zero-centered mean activations (unlike ReLU which is always positive)
- More computationally expensive than ReLU (requires )
SELU (Scaled ELU)
with specific constants and .
Key property: SELU is self-normalizing - activations automatically converge to zero mean and unit variance during training, without needing batch normalization. This works only with specific conditions (fully connected networks, proper initialization, no skip connections).
Comparison Table
| Activation | Formula ( region) | Gradient () | Dead Neurons? | Zero-Centered? | Compute Cost |
|---|---|---|---|---|---|
| ReLU | Yes | No | Lowest | ||
| Leaky ReLU | No | Approximately | Low | ||
| PReLU | (learned) | (learned) | No | Approximately | Low |
| ELU | No | Yes | Medium | ||
| SELU | No | Yes (self-normalizing) | Medium |
Part 5 - Modern Activations: GELU, SiLU/Swish, and Mish
These are the activations that dominate modern architectures (2018-present). Understanding them is critical for LLM and Transformer interview questions.
GELU (Gaussian Error Linear Unit)
where is the CDF of the standard normal distribution: .
Practical approximation:
The intuition behind GELU:
GELU can be understood as a smooth, probabilistic version of ReLU. For a given input :
- If is large positive: , so (passed through, like ReLU)
- If is large negative: , so (blocked, like ReLU)
- If is near zero: the transition is smooth and probabilistic, unlike ReLU's hard kink
The key innovation is that GELU acts as a stochastic regularizer. It is equivalent to multiplying the input by a Bernoulli random variable whose probability depends on the input magnitude. Larger inputs are more likely to be passed through, smaller inputs are more likely to be dropped - similar to dropout but input-dependent.
Derivative:
where is the standard normal PDF. The derivative is smooth everywhere - no discontinuity at .
SiLU / Swish
SiLU (Sigmoid Linear Unit) and Swish are the same function. Swish was discovered by Google Brain through automated search (Ramachandran et al., 2017). SiLU was independently proposed earlier (Elfwing et al., 2018).
Properties:
- Smooth, non-monotonic (has a small dip below zero near )
- Bounded below: minimum value at
- Unbounded above: as
- Self-gating: the function gates itself using
Derivative:
Mish
Properties:
- Smooth, non-monotonic (similar to SiLU)
- Bounded below: minimum at
- Unbounded above
- Slightly better than SiLU on some vision benchmarks but with higher compute cost
GELU vs SiLU vs Mish: The Modern Comparison
| Property | GELU | SiLU/Swish | Mish |
|---|---|---|---|
| Formula | |||
| Smooth at 0 | Yes | Yes | Yes |
| Non-monotonic | Very slightly | Yes (dip at ) | Yes (dip at ) |
| Compute cost | Medium (erf) | Low (sigmoid) | High (softplus + tanh) |
| Used in | BERT, GPT, ViT, LLaMA | EfficientNet, YOLOv5 | YOLOv4, some vision models |
| Dominant domain | NLP / Transformers | Vision / Efficient architectures | Vision |
Why GELU Dominates Transformers
Three reasons GELU is the default activation in Transformer-based models:
1. Smooth gradient near zero. Transformers rely heavily on attention scores near zero. A hard kink at zero (like ReLU) creates discontinuous gradients that can destabilize training. GELU's smooth transition provides more stable gradient flow through the attention mechanism.
2. Probabilistic interpretation. GELU's connection to the standard normal CDF means it naturally performs a form of stochastic regularization. In Transformers, which are prone to overfitting on smaller datasets, this implicit regularization is beneficial.
3. Empirical performance. The original BERT paper (Devlin et al., 2018) used GELU, and subsequent work confirmed its superiority over ReLU in the Transformer architecture. GPT-2, GPT-3, ViT, and many other landmark models followed suit.
At OpenAI/Anthropic, you will be asked specifically about GELU and why it is used in Transformers. At companies like NVIDIA, the question shifts to "how do you implement GELU efficiently in CUDA?" (answer: use the tanh approximation, which avoids the expensive erf function). At startups, knowing that "Transformers use GELU, CNNs use ReLU" is usually sufficient.
Part 6 - Softmax: The Output Activation
Definition
Softmax converts a vector of arbitrary real numbers (logits) into a probability distribution: all outputs are in and sum to 1.
Temperature Scaling
| Temperature | Effect | Use Case |
|---|---|---|
| Approaches argmax (one-hot) | Hard predictions, greedy decoding | |
| Standard softmax | Training, standard inference | |
| Flatter distribution | Knowledge distillation, creative sampling | |
| Uniform distribution | Maximum entropy |
Numerical Stability
Direct computation of overflows for large (e.g., ). The standard trick:
Subtracting does not change the result (it cancels in the fraction) but ensures the largest exponent is , preventing overflow.
Softmax Jacobian
The Jacobian of softmax is:
In matrix form: where .
Never apply softmax to hidden layers. Softmax is an output activation for classification. Its normalizing property (outputs sum to 1) creates strong coupling between all neurons in the layer, which destroys the independent feature learning that hidden layers need. If an interviewer hears you suggest softmax in a hidden layer, it signals a fundamental misunderstanding.
Part 7 - Decision Framework: Choosing the Right Activation
Quick Reference: Activation by Architecture
| Architecture | Hidden Activation | Output Activation | Rationale |
|---|---|---|---|
| CNN (ResNet, VGG) | ReLU | Softmax (classification) | Speed, sparsity, proven track record |
| CNN (EfficientNet) | SiLU/Swish | Softmax | Smooth gradients, NAS-discovered |
| CNN (ConvNeXt) | GELU | Softmax | Modernized CNN borrowing from Transformers |
| Transformer (BERT, GPT) | GELU | Softmax / Linear | Smooth near zero, implicit regularization |
| LLaMA / Mistral | SiLU (in gated FFN) | Linear (LM head) | Gated FFN: |
| LSTM / GRU | Tanh (state), Sigmoid (gates) | Task-dependent | Historical, gates need range |
| Diffusion model U-Net | SiLU/Swish | Linear | Smooth gradients for denoising |
| Generative (decoder) | Varies | Linear (logits) | Raw scores, softmax applied in loss |
The Gated FFN Pattern in Modern LLMs
LLaMA and Mistral use a gated feed-forward network (SwiGLU variant):
This uses SiLU as a gating mechanism: one linear projection creates gate values (passed through SiLU), another creates the values to be gated, and they are multiplied element-wise. This has become the standard FFN design in modern LLMs, outperforming the original Transformer's .
"I ask about activation functions not to hear a list of names, but to understand how a candidate thinks about architectural decisions. When they say 'I would use GELU,' I want to hear why - what property of GELU matters for this specific architecture? When they say 'ReLU for CNNs,' I want to hear about sparsity and computational efficiency. The best candidates connect the mathematical properties (smoothness, gradient magnitude, zero-centered outputs) to the architectural requirements (attention stability, training speed, representation quality)."
Part 8 - Comprehensive Mathematical Reference
All Activation Functions at a Glance
| Activation | Formula | Derivative | Range | Year |
|---|---|---|---|---|
| Sigmoid | 1943 | |||
| Tanh | 1986 | |||
| ReLU | 2010 | |||
| Leaky ReLU | if , else | 2013 | ||
| PReLU | , learned | if , else | 2015 | |
| ELU | if , if | if , if | 2015 | |
| GELU | 2016 | |||
| SiLU/Swish | 2017 | |||
| Mish | Complex expression | 2019 | ||
| Softmax | , sums to 1 | 1959 |
Gradient Properties Summary
| Activation | Max gradient | Min gradient () | Vanishing? | Dead neurons? | Zero-centered? |
|---|---|---|---|---|---|
| Sigmoid | 0.25 | Yes, severe | No | No | |
| Tanh | 1.0 | Yes, moderate | No | Yes | |
| ReLU | 1.0 | 0 (exact) | No (positive) | Yes | No |
| Leaky ReLU | 1.0 | (e.g., 0.01) | No | No | Approximately |
| GELU | No | No (soft) | Approximately | ||
| SiLU | No | No (soft) | Approximately |
Part 9 - Historical Timeline and Key Papers
Understanding the historical progression helps you answer "why did the field move from X to Y?" questions.
| Year | Activation | Paper / Context | Key Impact |
|---|---|---|---|
| 1943 | Sigmoid | McCulloch & Pitts - mathematical neuron model | First activation function, modeled biological neurons |
| 1986 | Tanh | Rumelhart et al. - backpropagation paper | Zero-centered improvement over sigmoid |
| 2010 | ReLU | Nair & Hinton - RBMs with ReLU | Solved vanishing gradients, enabled deep training |
| 2012 | ReLU | Krizhevsky et al. - AlexNet | ReLU + GPU training won ImageNet, launched modern DL |
| 2013 | Leaky ReLU | Maas et al. - rectifier nonlinearities | Fixed dying ReLU with small negative slope |
| 2015 | PReLU | He et al. - delving deep into rectifiers | Learned negative slope, improved ImageNet by 1.1% |
| 2015 | ELU | Clevert et al. - fast and accurate DL | Smooth, zero-centered alternative to ReLU |
| 2016 | GELU | Hendrycks & Gimpel | Probabilistic activation, later adopted by BERT/GPT |
| 2017 | Swish/SiLU | Ramachandran et al. - searching for activations | NAS-discovered, outperformed ReLU on many benchmarks |
| 2017 | SELU | Klambauer et al. - self-normalizing networks | Self-normalizing property without BatchNorm |
| 2019 | Mish | Misra - Mish: a self-regularized activation | Slight improvement over Swish in some vision tasks |
| 2020 | SwiGLU | Shazeer - GLU variants improve Transformer | Gated FFN with SiLU, adopted by LLaMA and PaLM |
The general trend is clear: the field moved from biologically inspired (sigmoid) to optimization-friendly (ReLU) to smooth and self-regularizing (GELU/SiLU) to gated combinations (SwiGLU). Each transition was driven by specific training failures at scale that the previous activation could not handle.
Part 10 - Activation Functions and Initialization: The Critical Pairing
The choice of activation function directly determines the correct weight initialization scheme. Using the wrong pairing is a common source of training failures.
The Core Principle
We want the variance of activations to remain stable across layers. If activations grow layer by layer, they explode. If they shrink, gradients vanish. Initialization must account for the activation function's effect on variance.
Xavier/Glorot Initialization (for Sigmoid and Tanh)
For a layer where has variance and the activation is symmetric around zero (like tanh):
This keeps both forward activations and backward gradients at stable variance. Named after Xavier Glorot, who derived it in 2010.
Why Xavier fails with ReLU: ReLU zeroes out the negative half of the distribution, cutting the variance roughly in half. Xavier initialization does not account for this, causing activations to shrink with depth.
He/Kaiming Initialization (for ReLU)
The factor of 2 compensates for ReLU zeroing out negative values. Named after Kaiming He, who derived it in 2015.
Why He fails with sigmoid: The variance is too large for sigmoid, causing pre-activations to land in the saturated flat regions where . Training freezes immediately.
Initialization Cheat Sheet
| Activation | Initialization | Variance Formula | PyTorch Function |
|---|---|---|---|
| Sigmoid | Xavier/Glorot | nn.init.xavier_uniform_ | |
| Tanh | Xavier/Glorot | nn.init.xavier_uniform_ | |
| ReLU | He/Kaiming | nn.init.kaiming_normal_ | |
| Leaky ReLU () | Modified He | nn.init.kaiming_normal_(a=alpha) | |
| GELU | He (approximately) | nn.init.kaiming_normal_ | |
| SELU | LeCun | nn.init.normal_(std=1/sqrt(n_in)) |
What Happens with Wrong Pairings
| Wrong Pairing | Symptom | Why |
|---|---|---|
| Xavier + ReLU | Activations decay to zero in deep networks | ReLU halves variance, Xavier does not compensate |
| He + Sigmoid | Activations saturate, gradients vanish immediately | Pre-activations too large, sigmoid flat regions |
| Zero init + any | All neurons learn the same thing (symmetry problem) | Identical neurons get identical gradients |
| Too large init + ReLU | Exploding activations, NaN loss | Variance compounds multiplicatively across layers |
"One of my favorite diagnostic questions is: 'Your network's loss is not decreasing from the first epoch. What do you check?' The top answer is 'check the initialization.' If you are using ReLU with Xavier init, or sigmoid with He init, training can fail from epoch zero. This is the kind of practical debugging knowledge that separates engineers from theorists."
Practice Problems
Problem 1: The "Why Not Sigmoid?" Question
An interviewer asks: "Your network has 20 hidden layers. You propose using sigmoid activations. Convince me this is a bad idea with a mathematical argument."
Hint 1 - Direction
Quantify the gradient attenuation through 20 layers. Use the maximum sigmoid derivative.
Hint 2 - Insight
. Through 20 layers (19 sigmoid operations), the gradient is attenuated by at most . Compute this number.
Hint 3 - Full Solution + Rubric
The gradient from layer 20 to layer 1 passes through 19 sigmoid derivatives. Even in the best case (all pre-activations at zero):
The gradient reaching layer 1 is approximately one trillionth of the gradient at layer 20. For practical purposes, the first 10+ layers receive zero useful gradient signal - they do not learn.
Additionally, sigmoid outputs are not zero-centered (), causing all gradients to have the same sign, which leads to zig-zag optimization paths.
Better alternatives: ReLU (gradient = 1 for positive), with He initialization and skip connections for a 20-layer network.
Scoring Rubric:
- Strong Hire: Computes , gives the numerical value, mentions non-zero-centered outputs as a second problem, proposes ReLU + skip connections
- Lean Hire: Correctly states the vanishing gradient argument but does not compute the specific attenuation
- No Hire: Cannot quantify the problem or says "sigmoid is fine for any depth"
Problem 2: Dying ReLU Diagnosis
You are training a 10-layer CNN with ReLU activations. After 1000 epochs, you notice that 35% of neurons in layer 5 have zero activation for every input in the validation set. Diagnose the problem and propose 3 solutions.
Hint 1 - Direction
35% dead neurons is abnormally high. Think about what could cause so many neurons to have permanently negative pre-activations.
Hint 2 - Insight
Common causes: learning rate too high (large weight updates overshoot), poor initialization (biases initialized too negative), or input data not normalized (pushing activations negative). Each suggests a different fix.
Hint 3 - Full Solution + Rubric
Diagnosis: 35% dead neurons suggests aggressive weight updates pushed pre-activations permanently negative. This is the dying ReLU problem - once for all inputs, the neuron receives zero gradient and cannot recover.
Root cause analysis:
- Learning rate may be too high, causing weights to overshoot during updates
- Initialization may have negative biases or large negative weights
- Batch normalization may be absent, allowing pre-activations to drift negative
Three solutions:
-
Switch to Leaky ReLU (): Dead neurons now have gradient instead of , allowing recovery. Minimal computational overhead.
-
Reduce learning rate + use warmup: Start with a small learning rate (1/10th of current) for the first 5 epochs, then ramp up. This prevents early large updates that kill neurons.
-
Add batch normalization before ReLU: BN re-centers pre-activations around zero at each layer, ensuring roughly 50% of pre-activations are positive (keeping neurons alive).
Bonus solution: Re-initialize dead neurons during training (some frameworks support this). Monitor per-layer dead neuron percentage as a training diagnostic.
Scoring Rubric:
- Strong Hire: Diagnoses root cause (high LR or bad init), provides 3 distinct solutions with rationale, mentions monitoring dead neuron percentage as a training metric
- Lean Hire: Correctly identifies dying ReLU, proposes 2 solutions
- No Hire: Cannot explain why neurons die or only says "use a different activation"
Problem 3: GELU vs ReLU in Transformers
An interviewer asks: "We are building a Transformer for text classification. You propose using GELU instead of ReLU. I push back - ReLU is simpler and faster. Defend your choice."
Hint 1 - Direction
Think about what properties of GELU specifically help Transformers that ReLU does not provide. Consider the gradient behavior near zero and the attention mechanism.
Hint 2 - Insight
Three arguments: (1) smooth gradient near zero helps attention score gradients, (2) GELU acts as implicit regularization (probabilistic gating), (3) empirical results from BERT, GPT consistently show GELU outperforms ReLU in Transformers.
Hint 3 - Full Solution + Rubric
Argument 1 - Gradient smoothness at zero: ReLU has a discontinuous gradient at : the derivative jumps from 0 to 1. In Transformers, attention scores and FFN activations frequently pass through values near zero. The gradient discontinuity creates instability during backpropagation through these near-zero regions. GELU has a smooth, continuous gradient everywhere: , which transitions gradually. This produces more stable training dynamics.
Argument 2 - Implicit regularization: GELU is equivalent to multiplying the input by a Bernoulli mask whose probability depends on the input magnitude: where . This provides input-dependent dropout-like regularization. Transformers are parameter-heavy and prone to overfitting - the implicit regularization from GELU helps.
Argument 3 - Empirical evidence: Every major Transformer model (BERT, GPT-2, GPT-3, ViT, T5) uses GELU. Ablation studies consistently show 0.5-1.5% improvement over ReLU on language modeling perplexity. The computational overhead is modest (approximately 10-15% more time in the FFN layer, which is not the bottleneck - attention is).
Concession: If inference latency is absolutely critical (edge deployment, real-time systems), ReLU is a reasonable choice. But for training quality, GELU is strictly better for Transformers.
Scoring Rubric:
- Strong Hire: Gives all three arguments with mathematical backing, acknowledges the latency tradeoff, mentions specific models that use GELU
- Lean Hire: Correctly states GELU is smoother and cites empirical evidence but cannot explain the probabilistic interpretation
- No Hire: Can only say "GELU is better" without explaining why the smoothness matters for Transformers specifically
Problem 4: Softmax Temperature
You have a trained classifier that outputs logits . Compute the softmax probabilities at temperatures , , and . Explain a practical use for each temperature.
Hint 1 - Direction
Divide logits by before applying softmax. Lower temperature sharpens the distribution, higher temperature flattens it.
Hint 2 - Insight
At : standard softmax. At : divide logits by 0.5 (double them), making differences larger. At : halve the logits, reducing differences.
Hint 3 - Full Solution + Rubric
At :
- , , , sum = 11.212
- Probabilities:
At :
- , , , sum = 63.21
- Probabilities: - much sharper
At :
- , , , sum = 5.418
- Probabilities: - much flatter
Practical uses:
- : Standard classification inference
- (e.g., 0.5): Greedy/confident decoding in LLMs, model calibration
- (e.g., 2): Knowledge distillation (Hinton et al., 2015) - softened probabilities from the teacher network reveal more information about class similarities ("dark knowledge")
Scoring Rubric:
- Strong Hire: Computes all three correctly, explains knowledge distillation, mentions temperature in LLM sampling (top-k + temperature)
- Lean Hire: Correct computations and basic understanding of sharpening/flattening
- No Hire: Cannot compute softmax with temperature or does not know the direction of the effect
Interview Cheat Sheet
| Concept | Key Fact | Common Mistakes |
|---|---|---|
| Why nonlinearity | Without it, deep network = single linear layer | Saying "to make it more complex" without explaining the linear collapse |
| Sigmoid | , not zero-centered | Saying sigmoid "always causes vanishing gradients" (it does not if network is shallow) |
| Tanh | Max derivative 1.0, zero-centered, still saturates | Saying tanh solves vanishing gradients completely |
| ReLU | Gradient = 1 for , creates sparsity | Forgetting the dying ReLU problem |
| Dying ReLU | Neurons with for all inputs get zero gradient forever | Saying "just use Leaky ReLU" without diagnosing root cause |
| Leaky ReLU | for , prevents dead neurons | Confusing with PReLU (learned ) |
| GELU | , smooth, used in Transformers | Not knowing the tanh approximation formula |
| SiLU/Swish | , used in LLaMA's gated FFN | Confusing SiLU with GELU |
| Softmax | Converts logits to probabilities, use numerically stable version | Applying softmax to hidden layers |
| Temperature | sharpens, flattens | Getting the direction wrong |
Spaced Repetition Checkpoints
Day 0 - After First Read
- Write the formula and derivative for: sigmoid, tanh, ReLU, GELU, SiLU
- Explain the dying ReLU problem in 3 sentences
- State which activation is used in: BERT, GPT, ResNet, LSTM gates, LLaMA FFN
Day 3 - First Review
- Explain why sigmoid causes vanishing gradients with a specific numerical example
- Compare GELU and ReLU: give 3 differences that matter for Transformers
- Compute softmax with temperature for a 3-element vector without looking at notes
Day 7 - Connections Review
- Explain the connection between activation functions and gradient flow through a deep network
- Trace the decision flowchart: given an architecture, choose the activation and justify it
- Explain the SwiGLU FFN pattern used in LLaMA
Day 14 - Interview Simulation
- Give a 60-second answer on "why do we need activation functions?"
- Defend GELU over ReLU for a Transformer with 3 specific technical arguments
- Diagnose a dying ReLU scenario and propose 3 solutions with rationale
Day 21 - Final Calibration
- Complete all 4 practice problems under time pressure (8 minutes each)
- For any activation function named by the interviewer, immediately state: formula, derivative, range, where it is used, key advantage, key limitation
- Connect activation functions to: backpropagation (gradient properties), initialization (He vs Xavier), normalization (interaction effects)
