Skip to main content

Activation Functions - The Nonlinearity That Makes Deep Learning Work

Reading time: ~35 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, NLP Eng, CV Eng

The Real Interview Moment

You are in a Meta MLE phone screen. The interviewer asks what seems like a simple question: "Why do we need activation functions?" You answer "to add nonlinearity" - correct, but generic. She follows up: "Okay, so why not just use sigmoid? It is nonlinear." You mention vanishing gradients. She pushes further: "So we switched to ReLU. But GPT and BERT use GELU, not ReLU. Why? What is different about Transformers that makes GELU a better choice?"

Now you are in deep water. You know GELU is "smoother" but you cannot articulate why that matters for Transformers specifically, or why the smoothness at zero is important for gradient flow in attention layers. The interview shifts from "this candidate knows the basics" to "this candidate memorized names without understanding the design reasoning."

Activation function questions seem easy on the surface but quickly reveal the depth of your understanding. This page gives you the mathematical properties, the gradient analysis, the historical context, and - most importantly - the reasoning framework for choosing the right activation for any architecture.

What You Will Master

  • Explain why neural networks need nonlinear activation functions (universal approximation theorem)
  • State the formula, derivative, and range of every major activation function
  • Analyze the gradient properties of each activation and connect them to vanishing/exploding gradient problems
  • Explain the dying ReLU problem with a concrete numerical example
  • Compare ReLU variants (Leaky ReLU, PReLU, ELU) and when each is appropriate
  • Derive why GELU is preferred in Transformers (smooth approximation to a stochastic regularizer)
  • Distinguish GELU vs SiLU/Swish vs Mish and state which modern architectures use which
  • Apply the softmax function correctly and explain its temperature parameter
  • Navigate a decision flowchart for choosing activation functions in any architecture
  • Answer activation function interview questions with mathematical precision

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Explain4 -- Can Derive5 -- Can TeachYour Score
Explain why nonlinearity is needed___
Write sigmoid/tanh formulas and derivatives___
Explain vanishing gradients with sigmoid___
Explain ReLU advantages and dying ReLU___
Compare Leaky ReLU, PReLU, ELU___
Explain GELU formula and why Transformers use it___
Distinguish GELU vs SiLU vs Mish___
Apply softmax with temperature scaling___

Target: All 4s and 5s before your interview.

Part 1 - Why Nonlinearity Matters

The Linear Collapse Problem

Without activation functions, a neural network is just a sequence of linear transformations:

h1=W1x,h2=W2h1,y=W3h2\mathbf{h}_1 = W_1 \mathbf{x}, \quad \mathbf{h}_2 = W_2 \mathbf{h}_1, \quad \mathbf{y} = W_3 \mathbf{h}_2

Composing these: y=W3W2W1x=Weffx\mathbf{y} = W_3 W_2 W_1 \mathbf{x} = W_{\text{eff}} \mathbf{x}

No matter how many layers you stack, the result is equivalent to a single linear transformation WeffW_{\text{eff}}. The network cannot learn any function more complex than linear regression. All the depth is wasted.

60-Second Answer

"Without activation functions, a deep network collapses to a single linear transformation - stacking linear layers gives you another linear layer. Activation functions introduce nonlinearity, which is what allows neural networks to approximate any continuous function (the universal approximation theorem). The choice of activation affects training dynamics: sigmoid causes vanishing gradients because its derivative is at most 0.25, ReLU solves this with a constant gradient of 1 for positive inputs but creates dead neurons, and GELU provides a smooth approximation that works best in Transformers because its gentle curve near zero allows more nuanced gradient flow through attention layers."

Universal Approximation Theorem

With at least one hidden layer and a nonlinear activation function, a neural network can approximate any continuous function on a compact domain to arbitrary precision (given enough neurons). The activation function is what unlocks this power.

However, universality says nothing about learnability. A network might theoretically be able to represent a function but fail to learn it due to vanishing gradients, dead neurons, or poor optimization landscape. This is why the choice of activation function matters enormously in practice.

Linear vs Nonlinear Networks: Why Activations Are Required

Part 2 - The Classic Activations: Sigmoid and Tanh

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Properties:

  • Range: (0,1)(0, 1) - outputs interpretable as probabilities
  • Derivative: σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))
  • Maximum derivative: 0.250.25 at z=0z = 0
  • Saturates for large z|z|: σ(z)0\sigma(z) \to 0 as zz \to -\infty, σ(z)1\sigma(z) \to 1 as z+z \to +\infty
  • Output is always positive (not zero-centered)

Why sigmoid fell out of favor:

  1. Vanishing gradients: σ(z)0.25\sigma'(z) \leq 0.25. Through NN layers, the gradient shrinks by at least (0.25)N(0.25)^N. A 10-layer network attenuates gradients by 106\sim 10^6.

  2. Not zero-centered: Sigmoid outputs are always positive. This means gradients on weights all have the same sign (all positive or all negative), causing zig-zag updates during optimization. This slows convergence.

  3. Expensive: Requires computing eze^{-z}, which is slower than a simple comparison.

Where sigmoid is still used:

  • Binary classification output (single probability)
  • Gate functions in LSTMs and GRUs (values must be in [0,1][0,1])
  • Attention gates in some architectures

Tanh

tanh(z)=ezezez+ez=2σ(2z)1\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1

Properties:

  • Range: (1,1)(-1, 1) - zero-centered (unlike sigmoid)
  • Derivative: tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z)
  • Maximum derivative: 1.01.0 at z=0z = 0 (4x better than sigmoid)
  • Still saturates for large z|z|

Tanh vs sigmoid:

  • Tanh is always preferred over sigmoid in hidden layers because it is zero-centered
  • Maximum derivative of 1.0 instead of 0.25 means slower gradient decay
  • Still suffers from vanishing gradients, just less severely than sigmoid
  • Both are largely replaced by ReLU and its variants in modern networks
Common Trap

Do NOT say "tanh does not have vanishing gradients." It does - tanh(z)0\tanh'(z) \to 0 for large z|z|. The improvement over sigmoid is the maximum derivative of 1.0 (vs 0.25) and zero-centered output, but the saturation problem remains. Only ReLU-family activations truly solve the vanishing gradient issue for positive inputs.

Part 3 - ReLU: The Revolution

Rectified Linear Unit

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Properties:

  • Range: [0,+)[0, +\infty)
  • Derivative: ReLU(z)={1z>00z<0undefinedz=0\text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z < 0 \\ \text{undefined} & z = 0 \end{cases}
  • In practice, the derivative at z=0z = 0 is set to 0 (or sometimes 1)
  • No saturation for positive inputs
  • Computationally trivial: just a comparison and a mask

Why ReLU Revolutionized Deep Learning

ReLU solved three problems simultaneously:

1. No vanishing gradients (for positive inputs): The derivative is exactly 1 for all positive inputs, regardless of magnitude. The gradient passes through unchanged. A 100-layer network has gradient factor (1)100=1(1)^{100} = 1 through the ReLU gates (for active neurons).

2. Sparsity: For typical inputs, roughly 50% of neurons output zero. This creates a sparse representation, which has two benefits:

  • Computational efficiency (multiplying by zero is free)
  • Better representations (sparse codes are more linearly separable)

3. Computational efficiency: max(0,z)\max(0, z) is a single comparison - orders of magnitude faster than computing eze^{-z} for sigmoid or tanh. This matters at scale.

ReLU: Problems Solved and Problem Created (Dying ReLU)

The Dying ReLU Problem

When a neuron's pre-activation is negative for all training examples, the neuron outputs zero for every input and receives zero gradient. It can never recover - it is "dead."

How neurons die:

  1. A large negative bias or unfortunate weight initialization pushes the pre-activation negative
  2. A large learning rate causes a weight update that pushes the pre-activation far negative
  3. Once z<0z < 0 for all inputs in the training set, ReLU(z)=0\text{ReLU}'(z) = 0 for all inputs
  4. With zero gradient, the weights never update, and the neuron stays dead forever

Concrete example: Suppose a neuron has weights w=[1,1]\mathbf{w} = [-1, -1] and bias b=5b = -5. For any non-negative input x\mathbf{x}, the pre-activation wTx+b5<0\mathbf{w}^T \mathbf{x} + b \leq -5 < 0. The neuron is permanently dead.

How common is it? In poorly initialized or aggressively trained networks, 10-40% of neurons can die. With proper initialization (He init) and moderate learning rates, it is usually manageable (less than 5%).

Interviewer's Perspective

"When a candidate mentions the dying ReLU problem, I immediately follow up with: 'How would you detect dead neurons in a trained network, and what would you do about it?' A strong candidate says: 'Monitor the fraction of zero activations per layer during training. If it exceeds 20-30%, reduce the learning rate, check initialization, or switch to Leaky ReLU. You can also use a small L2 penalty to prevent weights from growing too large in magnitude.' A weak candidate just says 'use Leaky ReLU' without diagnosing the root cause."

Part 4 - ReLU Variants: Fixing the Dying Neuron Problem

Leaky ReLU

LeakyReLU(z)={zz>0αzz0\text{LeakyReLU}(z) = \begin{cases} z & z > 0 \\ \alpha z & z \leq 0 \end{cases}

where α\alpha is a small constant (typically 0.01).

Properties:

  • Derivative for z<0z < 0: α\alpha (not zero - neurons cannot die)
  • The small negative slope allows gradient flow even for negative inputs
  • Introduces one hyperparameter α\alpha

PReLU (Parametric ReLU)

PReLU(z)={zz>0αzz0\text{PReLU}(z) = \begin{cases} z & z > 0 \\ \alpha z & z \leq 0 \end{cases}

Same formula as Leaky ReLU, but α\alpha is a learnable parameter (one per channel or one per neuron). The network learns the optimal negative slope.

Key result: He et al. (2015) showed PReLU improved ImageNet classification accuracy over ReLU by 1.1% - a significant margin at the time.

ELU (Exponential Linear Unit)

ELU(z)={zz>0α(ez1)z0\text{ELU}(z) = \begin{cases} z & z > 0 \\ \alpha(e^z - 1) & z \leq 0 \end{cases}

Properties:

  • Smooth at z=0z = 0 (unlike ReLU)
  • Saturates to α-\alpha for large negative inputs (adds noise robustness)
  • Zero-centered mean activations (unlike ReLU which is always positive)
  • More computationally expensive than ReLU (requires eze^z)

SELU (Scaled ELU)

SELU(z)=λ{zz>0α(ez1)z0\text{SELU}(z) = \lambda \begin{cases} z & z > 0 \\ \alpha(e^z - 1) & z \leq 0 \end{cases}

with specific constants λ1.0507\lambda \approx 1.0507 and α1.6733\alpha \approx 1.6733.

Key property: SELU is self-normalizing - activations automatically converge to zero mean and unit variance during training, without needing batch normalization. This works only with specific conditions (fully connected networks, proper initialization, no skip connections).

Comparison Table

ActivationFormula (z<0z < 0 region)Gradient (z<0z < 0)Dead Neurons?Zero-Centered?Compute Cost
ReLU0000YesNoLowest
Leaky ReLU0.01z0.01z0.010.01NoApproximatelyLow
PReLUαz\alpha z (learned)α\alpha (learned)NoApproximatelyLow
ELUα(ez1)\alpha(e^z - 1)αez\alpha e^zNoYesMedium
SELUλα(ez1)\lambda\alpha(e^z - 1)λαez\lambda\alpha e^zNoYes (self-normalizing)Medium

Part 5 - Modern Activations: GELU, SiLU/Swish, and Mish

These are the activations that dominate modern architectures (2018-present). Understanding them is critical for LLM and Transformer interview questions.

GELU (Gaussian Error Linear Unit)

GELU(z)=zΦ(z)\text{GELU}(z) = z \cdot \Phi(z)

where Φ(z)\Phi(z) is the CDF of the standard normal distribution: Φ(z)=12[1+erf(z2)]\Phi(z) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right].

Practical approximation:

GELU(z)0.5z(1+tanh[2π(z+0.044715z3)])\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(z + 0.044715z^3)\right]\right)

The intuition behind GELU:

GELU can be understood as a smooth, probabilistic version of ReLU. For a given input zz:

  • If zz is large positive: Φ(z)1\Phi(z) \approx 1, so GELU(z)z\text{GELU}(z) \approx z (passed through, like ReLU)
  • If zz is large negative: Φ(z)0\Phi(z) \approx 0, so GELU(z)0\text{GELU}(z) \approx 0 (blocked, like ReLU)
  • If zz is near zero: the transition is smooth and probabilistic, unlike ReLU's hard kink

The key innovation is that GELU acts as a stochastic regularizer. It is equivalent to multiplying the input by a Bernoulli random variable whose probability depends on the input magnitude. Larger inputs are more likely to be passed through, smaller inputs are more likely to be dropped - similar to dropout but input-dependent.

Derivative:

GELU(z)=Φ(z)+zϕ(z)\text{GELU}'(z) = \Phi(z) + z \cdot \phi(z)

where ϕ(z)=12πez2/2\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-z^2/2} is the standard normal PDF. The derivative is smooth everywhere - no discontinuity at z=0z = 0.

SiLU / Swish

SiLU(z)=zσ(z)=z1+ez\text{SiLU}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}

SiLU (Sigmoid Linear Unit) and Swish are the same function. Swish was discovered by Google Brain through automated search (Ramachandran et al., 2017). SiLU was independently proposed earlier (Elfwing et al., 2018).

Properties:

  • Smooth, non-monotonic (has a small dip below zero near z1.28z \approx -1.28)
  • Bounded below: minimum value 0.278\approx -0.278 at z1.28z \approx -1.28
  • Unbounded above: SiLU(z)z\text{SiLU}(z) \to z as z+z \to +\infty
  • Self-gating: the function gates itself using σ(z)\sigma(z)

Derivative:

SiLU(z)=σ(z)+zσ(z)(1σ(z))=σ(z)(1+z(1σ(z)))\text{SiLU}'(z) = \sigma(z) + z \cdot \sigma(z)(1 - \sigma(z)) = \sigma(z)(1 + z(1 - \sigma(z)))

Mish

Mish(z)=ztanh(softplus(z))=ztanh(ln(1+ez))\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z)) = z \cdot \tanh(\ln(1 + e^z))

Properties:

  • Smooth, non-monotonic (similar to SiLU)
  • Bounded below: minimum 0.31\approx -0.31 at z1.19z \approx -1.19
  • Unbounded above
  • Slightly better than SiLU on some vision benchmarks but with higher compute cost

GELU vs SiLU vs Mish: The Modern Comparison

PropertyGELUSiLU/SwishMish
FormulazΦ(z)z \cdot \Phi(z)zσ(z)z \cdot \sigma(z)ztanh(softplus(z))z \cdot \tanh(\text{softplus}(z))
Smooth at 0YesYesYes
Non-monotonicVery slightlyYes (dip at z1.28z \approx -1.28)Yes (dip at z1.19z \approx -1.19)
Compute costMedium (erf)Low (sigmoid)High (softplus + tanh)
Used inBERT, GPT, ViT, LLaMAEfficientNet, YOLOv5YOLOv4, some vision models
Dominant domainNLP / TransformersVision / Efficient architecturesVision

Why GELU Dominates Transformers

Three reasons GELU is the default activation in Transformer-based models:

1. Smooth gradient near zero. Transformers rely heavily on attention scores near zero. A hard kink at zero (like ReLU) creates discontinuous gradients that can destabilize training. GELU's smooth transition provides more stable gradient flow through the attention mechanism.

2. Probabilistic interpretation. GELU's connection to the standard normal CDF means it naturally performs a form of stochastic regularization. In Transformers, which are prone to overfitting on smaller datasets, this implicit regularization is beneficial.

3. Empirical performance. The original BERT paper (Devlin et al., 2018) used GELU, and subsequent work confirmed its superiority over ReLU in the Transformer architecture. GPT-2, GPT-3, ViT, and many other landmark models followed suit.

Activation Functions by Architecture: CNNs, Transformers, RNNs

Company Variation

At OpenAI/Anthropic, you will be asked specifically about GELU and why it is used in Transformers. At companies like NVIDIA, the question shifts to "how do you implement GELU efficiently in CUDA?" (answer: use the tanh approximation, which avoids the expensive erf function). At startups, knowing that "Transformers use GELU, CNNs use ReLU" is usually sufficient.

Part 6 - Softmax: The Output Activation

Definition

softmax(z)i=ezij=1Kezj\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Softmax converts a vector of arbitrary real numbers (logits) into a probability distribution: all outputs are in (0,1)(0, 1) and sum to 1.

Temperature Scaling

softmax(z/T)i=ezi/Tj=1Kezj/T\text{softmax}(\mathbf{z}/T)_i = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}

TemperatureEffectUse Case
T0T \to 0Approaches argmax (one-hot)Hard predictions, greedy decoding
T=1T = 1Standard softmaxTraining, standard inference
T>1T > 1Flatter distributionKnowledge distillation, creative sampling
TT \to \inftyUniform distributionMaximum entropy

Numerical Stability

Direct computation of ezie^{z_i} overflows for large ziz_i (e.g., e1000=e^{1000} = \infty). The standard trick:

softmax(z)i=ezimax(z)jezjmax(z)\text{softmax}(\mathbf{z})_i = \frac{e^{z_i - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}

Subtracting max(z)\max(\mathbf{z}) does not change the result (it cancels in the fraction) but ensures the largest exponent is e0=1e^0 = 1, preventing overflow.

Softmax Jacobian

The Jacobian of softmax is:

softmax(z)izj={softmax(z)i(1softmax(z)i)i=jsoftmax(z)isoftmax(z)jij\frac{\partial \text{softmax}(\mathbf{z})_i}{\partial z_j} = \begin{cases} \text{softmax}(\mathbf{z})_i (1 - \text{softmax}(\mathbf{z})_i) & i = j \\ -\text{softmax}(\mathbf{z})_i \cdot \text{softmax}(\mathbf{z})_j & i \neq j \end{cases}

In matrix form: J=diag(p)ppTJ = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T where p=softmax(z)\mathbf{p} = \text{softmax}(\mathbf{z}).

Instant Rejection

Never apply softmax to hidden layers. Softmax is an output activation for classification. Its normalizing property (outputs sum to 1) creates strong coupling between all neurons in the layer, which destroys the independent feature learning that hidden layers need. If an interviewer hears you suggest softmax in a hidden layer, it signals a fundamental misunderstanding.

Part 7 - Decision Framework: Choosing the Right Activation

Activation Function Decision Flowchart by Layer Type

Quick Reference: Activation by Architecture

ArchitectureHidden ActivationOutput ActivationRationale
CNN (ResNet, VGG)ReLUSoftmax (classification)Speed, sparsity, proven track record
CNN (EfficientNet)SiLU/SwishSoftmaxSmooth gradients, NAS-discovered
CNN (ConvNeXt)GELUSoftmaxModernized CNN borrowing from Transformers
Transformer (BERT, GPT)GELUSoftmax / LinearSmooth near zero, implicit regularization
LLaMA / MistralSiLU (in gated FFN)Linear (LM head)Gated FFN: SiLU(W1x)W3x\text{SiLU}(W_1 x) \odot W_3 x
LSTM / GRUTanh (state), Sigmoid (gates)Task-dependentHistorical, gates need [0,1][0,1] range
Diffusion model U-NetSiLU/SwishLinearSmooth gradients for denoising
Generative (decoder)VariesLinear (logits)Raw scores, softmax applied in loss

The Gated FFN Pattern in Modern LLMs

LLaMA and Mistral use a gated feed-forward network (SwiGLU variant):

FFN(x)=(SiLU(W1x)W3x)W2\text{FFN}(\mathbf{x}) = (\text{SiLU}(W_1 \mathbf{x}) \odot W_3 \mathbf{x}) \cdot W_2

This uses SiLU as a gating mechanism: one linear projection creates gate values (passed through SiLU), another creates the values to be gated, and they are multiplied element-wise. This has become the standard FFN design in modern LLMs, outperforming the original Transformer's ReLU(W1x)W2\text{ReLU}(W_1 \mathbf{x}) \cdot W_2.

Interviewer's Perspective

"I ask about activation functions not to hear a list of names, but to understand how a candidate thinks about architectural decisions. When they say 'I would use GELU,' I want to hear why - what property of GELU matters for this specific architecture? When they say 'ReLU for CNNs,' I want to hear about sparsity and computational efficiency. The best candidates connect the mathematical properties (smoothness, gradient magnitude, zero-centered outputs) to the architectural requirements (attention stability, training speed, representation quality)."

Part 8 - Comprehensive Mathematical Reference

All Activation Functions at a Glance

ActivationFormulaDerivativeRangeYear
Sigmoid11+ez\frac{1}{1+e^{-z}}σ(z)(1σ(z))\sigma(z)(1-\sigma(z))(0,1)(0, 1)1943
Tanhezezez+ez\frac{e^z - e^{-z}}{e^z + e^{-z}}1tanh2(z)1 - \tanh^2(z)(1,1)(-1, 1)1986
ReLUmax(0,z)\max(0, z)1[z>0]\mathbb{1}[z > 0][0,+)[0, +\infty)2010
Leaky ReLUmax(αz,z)\max(\alpha z, z)α\alpha if z<0z < 0, else 11(,+)(-\infty, +\infty)2013
PReLUmax(αz,z)\max(\alpha z, z), α\alpha learnedα\alpha if z<0z < 0, else 11(,+)(-\infty, +\infty)2015
ELUzz if z>0z > 0, α(ez1)\alpha(e^z-1) if z0z \leq 011 if z>0z > 0, αez\alpha e^z if z0z \leq 0[α,+)[-\alpha, +\infty)2015
GELUzΦ(z)z \cdot \Phi(z)Φ(z)+zϕ(z)\Phi(z) + z\phi(z)[0.17,+)\approx [-0.17, +\infty)2016
SiLU/Swishzσ(z)z \cdot \sigma(z)σ(z)(1+z(1σ(z)))\sigma(z)(1 + z(1-\sigma(z)))[0.28,+)\approx [-0.28, +\infty)2017
Mishztanh(softplus(z))z \cdot \tanh(\text{softplus}(z))Complex expression[0.31,+)\approx [-0.31, +\infty)2019
Softmaxezijezj\frac{e^{z_i}}{\sum_j e^{z_j}}pi(δijpj)p_i(\delta_{ij} - p_j)(0,1)(0, 1), sums to 11959

Gradient Properties Summary

ActivationMax gradientMin gradient (z0z \neq 0)Vanishing?Dead neurons?Zero-centered?
Sigmoid0.250\to 0Yes, severeNoNo
Tanh1.00\to 0Yes, moderateNoYes
ReLU1.00 (exact)No (positive)YesNo
Leaky ReLU1.0α\alpha (e.g., 0.01)NoNoApproximately
GELU1.08\approx 1.080.17\approx -0.17NoNo (soft)Approximately
SiLU1.10\approx 1.100.10\approx -0.10NoNo (soft)Approximately

Part 9 - Historical Timeline and Key Papers

Understanding the historical progression helps you answer "why did the field move from X to Y?" questions.

YearActivationPaper / ContextKey Impact
1943SigmoidMcCulloch & Pitts - mathematical neuron modelFirst activation function, modeled biological neurons
1986TanhRumelhart et al. - backpropagation paperZero-centered improvement over sigmoid
2010ReLUNair & Hinton - RBMs with ReLUSolved vanishing gradients, enabled deep training
2012ReLUKrizhevsky et al. - AlexNetReLU + GPU training won ImageNet, launched modern DL
2013Leaky ReLUMaas et al. - rectifier nonlinearitiesFixed dying ReLU with small negative slope
2015PReLUHe et al. - delving deep into rectifiersLearned negative slope, improved ImageNet by 1.1%
2015ELUClevert et al. - fast and accurate DLSmooth, zero-centered alternative to ReLU
2016GELUHendrycks & GimpelProbabilistic activation, later adopted by BERT/GPT
2017Swish/SiLURamachandran et al. - searching for activationsNAS-discovered, outperformed ReLU on many benchmarks
2017SELUKlambauer et al. - self-normalizing networksSelf-normalizing property without BatchNorm
2019MishMisra - Mish: a self-regularized activationSlight improvement over Swish in some vision tasks
2020SwiGLUShazeer - GLU variants improve TransformerGated FFN with SiLU, adopted by LLaMA and PaLM

The general trend is clear: the field moved from biologically inspired (sigmoid) to optimization-friendly (ReLU) to smooth and self-regularizing (GELU/SiLU) to gated combinations (SwiGLU). Each transition was driven by specific training failures at scale that the previous activation could not handle.

Part 10 - Activation Functions and Initialization: The Critical Pairing

The choice of activation function directly determines the correct weight initialization scheme. Using the wrong pairing is a common source of training failures.

The Core Principle

We want the variance of activations to remain stable across layers. If activations grow layer by layer, they explode. If they shrink, gradients vanish. Initialization must account for the activation function's effect on variance.

Xavier/Glorot Initialization (for Sigmoid and Tanh)

For a layer z=Wxz = Wx where xx has variance vv and the activation is symmetric around zero (like tanh):

Var(Wij)=2nin+nout\text{Var}(W_{ij}) = \frac{2}{n_{\text{in}} + n_{\text{out}}}

This keeps both forward activations and backward gradients at stable variance. Named after Xavier Glorot, who derived it in 2010.

Why Xavier fails with ReLU: ReLU zeroes out the negative half of the distribution, cutting the variance roughly in half. Xavier initialization does not account for this, causing activations to shrink with depth.

He/Kaiming Initialization (for ReLU)

Var(Wij)=2nin\text{Var}(W_{ij}) = \frac{2}{n_{\text{in}}}

The factor of 2 compensates for ReLU zeroing out negative values. Named after Kaiming He, who derived it in 2015.

Why He fails with sigmoid: The variance is too large for sigmoid, causing pre-activations to land in the saturated flat regions where σ(z)0\sigma'(z) \approx 0. Training freezes immediately.

Initialization Cheat Sheet

ActivationInitializationVariance FormulaPyTorch Function
SigmoidXavier/Glorot2nin+nout\frac{2}{n_\text{in} + n_\text{out}}nn.init.xavier_uniform_
TanhXavier/Glorot2nin+nout\frac{2}{n_\text{in} + n_\text{out}}nn.init.xavier_uniform_
ReLUHe/Kaiming2nin\frac{2}{n_\text{in}}nn.init.kaiming_normal_
Leaky ReLU (α\alpha)Modified He2(1+α2)nin\frac{2}{(1+\alpha^2) \cdot n_\text{in}}nn.init.kaiming_normal_(a=alpha)
GELUHe (approximately)2nin\frac{2}{n_\text{in}}nn.init.kaiming_normal_
SELULeCun1nin\frac{1}{n_\text{in}}nn.init.normal_(std=1/sqrt(n_in))

What Happens with Wrong Pairings

Wrong PairingSymptomWhy
Xavier + ReLUActivations decay to zero in deep networksReLU halves variance, Xavier does not compensate
He + SigmoidActivations saturate, gradients vanish immediatelyPre-activations too large, sigmoid flat regions
Zero init + anyAll neurons learn the same thing (symmetry problem)Identical neurons get identical gradients
Too large init + ReLUExploding activations, NaN lossVariance compounds multiplicatively across layers
Interviewer's Perspective

"One of my favorite diagnostic questions is: 'Your network's loss is not decreasing from the first epoch. What do you check?' The top answer is 'check the initialization.' If you are using ReLU with Xavier init, or sigmoid with He init, training can fail from epoch zero. This is the kind of practical debugging knowledge that separates engineers from theorists."

Practice Problems

Problem 1: The "Why Not Sigmoid?" Question

An interviewer asks: "Your network has 20 hidden layers. You propose using sigmoid activations. Convince me this is a bad idea with a mathematical argument."

Hint 1 - Direction

Quantify the gradient attenuation through 20 layers. Use the maximum sigmoid derivative.

Hint 2 - Insight

σ(z)0.25\sigma'(z) \leq 0.25. Through 20 layers (19 sigmoid operations), the gradient is attenuated by at most (0.25)19(0.25)^{19}. Compute this number.

Hint 3 - Full Solution + Rubric

The gradient from layer 20 to layer 1 passes through 19 sigmoid derivatives. Even in the best case (all pre-activations at zero):

LW1(0.25)19Lz203.6×1012Lz20\left|\frac{\partial L}{\partial W_1}\right| \leq (0.25)^{19} \cdot \left|\frac{\partial L}{\partial z_{20}}\right| \approx 3.6 \times 10^{-12} \cdot \left|\frac{\partial L}{\partial z_{20}}\right|

The gradient reaching layer 1 is approximately one trillionth of the gradient at layer 20. For practical purposes, the first 10+ layers receive zero useful gradient signal - they do not learn.

Additionally, sigmoid outputs are not zero-centered (σ(z)(0,1)\sigma(z) \in (0,1)), causing all gradients to have the same sign, which leads to zig-zag optimization paths.

Better alternatives: ReLU (gradient = 1 for positive), with He initialization and skip connections for a 20-layer network.

Scoring Rubric:

  • Strong Hire: Computes (0.25)19(0.25)^{19}, gives the numerical value, mentions non-zero-centered outputs as a second problem, proposes ReLU + skip connections
  • Lean Hire: Correctly states the vanishing gradient argument but does not compute the specific attenuation
  • No Hire: Cannot quantify the problem or says "sigmoid is fine for any depth"

Problem 2: Dying ReLU Diagnosis

You are training a 10-layer CNN with ReLU activations. After 1000 epochs, you notice that 35% of neurons in layer 5 have zero activation for every input in the validation set. Diagnose the problem and propose 3 solutions.

Hint 1 - Direction

35% dead neurons is abnormally high. Think about what could cause so many neurons to have permanently negative pre-activations.

Hint 2 - Insight

Common causes: learning rate too high (large weight updates overshoot), poor initialization (biases initialized too negative), or input data not normalized (pushing activations negative). Each suggests a different fix.

Hint 3 - Full Solution + Rubric

Diagnosis: 35% dead neurons suggests aggressive weight updates pushed pre-activations permanently negative. This is the dying ReLU problem - once wTx+b<0\mathbf{w}^T\mathbf{x} + b < 0 for all inputs, the neuron receives zero gradient and cannot recover.

Root cause analysis:

  1. Learning rate may be too high, causing weights to overshoot during updates
  2. Initialization may have negative biases or large negative weights
  3. Batch normalization may be absent, allowing pre-activations to drift negative

Three solutions:

  1. Switch to Leaky ReLU (α=0.01\alpha = 0.01): Dead neurons now have gradient 0.010.01 instead of 00, allowing recovery. Minimal computational overhead.

  2. Reduce learning rate + use warmup: Start with a small learning rate (1/10th of current) for the first 5 epochs, then ramp up. This prevents early large updates that kill neurons.

  3. Add batch normalization before ReLU: BN re-centers pre-activations around zero at each layer, ensuring roughly 50% of pre-activations are positive (keeping neurons alive).

Bonus solution: Re-initialize dead neurons during training (some frameworks support this). Monitor per-layer dead neuron percentage as a training diagnostic.

Scoring Rubric:

  • Strong Hire: Diagnoses root cause (high LR or bad init), provides 3 distinct solutions with rationale, mentions monitoring dead neuron percentage as a training metric
  • Lean Hire: Correctly identifies dying ReLU, proposes 2 solutions
  • No Hire: Cannot explain why neurons die or only says "use a different activation"

Problem 3: GELU vs ReLU in Transformers

An interviewer asks: "We are building a Transformer for text classification. You propose using GELU instead of ReLU. I push back - ReLU is simpler and faster. Defend your choice."

Hint 1 - Direction

Think about what properties of GELU specifically help Transformers that ReLU does not provide. Consider the gradient behavior near zero and the attention mechanism.

Hint 2 - Insight

Three arguments: (1) smooth gradient near zero helps attention score gradients, (2) GELU acts as implicit regularization (probabilistic gating), (3) empirical results from BERT, GPT consistently show GELU outperforms ReLU in Transformers.

Hint 3 - Full Solution + Rubric

Argument 1 - Gradient smoothness at zero: ReLU has a discontinuous gradient at z=0z = 0: the derivative jumps from 0 to 1. In Transformers, attention scores and FFN activations frequently pass through values near zero. The gradient discontinuity creates instability during backpropagation through these near-zero regions. GELU has a smooth, continuous gradient everywhere: GELU(z)=Φ(z)+zϕ(z)\text{GELU}'(z) = \Phi(z) + z\phi(z), which transitions gradually. This produces more stable training dynamics.

Argument 2 - Implicit regularization: GELU is equivalent to multiplying the input by a Bernoulli mask whose probability depends on the input magnitude: GELU(z)=zP(Zz)\text{GELU}(z) = z \cdot P(Z \leq z) where ZN(0,1)Z \sim \mathcal{N}(0,1). This provides input-dependent dropout-like regularization. Transformers are parameter-heavy and prone to overfitting - the implicit regularization from GELU helps.

Argument 3 - Empirical evidence: Every major Transformer model (BERT, GPT-2, GPT-3, ViT, T5) uses GELU. Ablation studies consistently show 0.5-1.5% improvement over ReLU on language modeling perplexity. The computational overhead is modest (approximately 10-15% more time in the FFN layer, which is not the bottleneck - attention is).

Concession: If inference latency is absolutely critical (edge deployment, real-time systems), ReLU is a reasonable choice. But for training quality, GELU is strictly better for Transformers.

Scoring Rubric:

  • Strong Hire: Gives all three arguments with mathematical backing, acknowledges the latency tradeoff, mentions specific models that use GELU
  • Lean Hire: Correctly states GELU is smoother and cites empirical evidence but cannot explain the probabilistic interpretation
  • No Hire: Can only say "GELU is better" without explaining why the smoothness matters for Transformers specifically

Problem 4: Softmax Temperature

You have a trained classifier that outputs logits z=[2.0,1.0,0.1]\mathbf{z} = [2.0, 1.0, 0.1]. Compute the softmax probabilities at temperatures T=1T = 1, T=0.5T = 0.5, and T=2T = 2. Explain a practical use for each temperature.

Hint 1 - Direction

Divide logits by TT before applying softmax. Lower temperature sharpens the distribution, higher temperature flattens it.

Hint 2 - Insight

At T=1T = 1: standard softmax. At T=0.5T = 0.5: divide logits by 0.5 (double them), making differences larger. At T=2T = 2: halve the logits, reducing differences.

Hint 3 - Full Solution + Rubric

At T=1T = 1: z/T=[2.0,1.0,0.1]\mathbf{z}/T = [2.0, 1.0, 0.1]

  • e2.0=7.389e^{2.0} = 7.389, e1.0=2.718e^{1.0} = 2.718, e0.1=1.105e^{0.1} = 1.105, sum = 11.212
  • Probabilities: [0.659,0.242,0.099][0.659, 0.242, 0.099]

At T=0.5T = 0.5: z/T=[4.0,2.0,0.2]\mathbf{z}/T = [4.0, 2.0, 0.2]

  • e4.0=54.60e^{4.0} = 54.60, e2.0=7.389e^{2.0} = 7.389, e0.2=1.221e^{0.2} = 1.221, sum = 63.21
  • Probabilities: [0.864,0.117,0.019][0.864, 0.117, 0.019] - much sharper

At T=2T = 2: z/T=[1.0,0.5,0.05]\mathbf{z}/T = [1.0, 0.5, 0.05]

  • e1.0=2.718e^{1.0} = 2.718, e0.5=1.649e^{0.5} = 1.649, e0.05=1.051e^{0.05} = 1.051, sum = 5.418
  • Probabilities: [0.502,0.304,0.194][0.502, 0.304, 0.194] - much flatter

Practical uses:

  • T=1T = 1: Standard classification inference
  • T<1T < 1 (e.g., 0.5): Greedy/confident decoding in LLMs, model calibration
  • T>1T > 1 (e.g., 2): Knowledge distillation (Hinton et al., 2015) - softened probabilities from the teacher network reveal more information about class similarities ("dark knowledge")

Scoring Rubric:

  • Strong Hire: Computes all three correctly, explains knowledge distillation, mentions temperature in LLM sampling (top-k + temperature)
  • Lean Hire: Correct computations and basic understanding of sharpening/flattening
  • No Hire: Cannot compute softmax with temperature or does not know the direction of the effect

Interview Cheat Sheet

ConceptKey FactCommon Mistakes
Why nonlinearityWithout it, deep network = single linear layerSaying "to make it more complex" without explaining the linear collapse
Sigmoidσ(z)0.25\sigma'(z) \leq 0.25, not zero-centeredSaying sigmoid "always causes vanishing gradients" (it does not if network is shallow)
TanhMax derivative 1.0, zero-centered, still saturatesSaying tanh solves vanishing gradients completely
ReLUGradient = 1 for z>0z > 0, creates sparsityForgetting the dying ReLU problem
Dying ReLUNeurons with z<0z < 0 for all inputs get zero gradient foreverSaying "just use Leaky ReLU" without diagnosing root cause
Leaky ReLUα=0.01\alpha = 0.01 for z<0z < 0, prevents dead neuronsConfusing with PReLU (learned α\alpha)
GELUzΦ(z)z \cdot \Phi(z), smooth, used in TransformersNot knowing the tanh approximation formula
SiLU/Swishzσ(z)z \cdot \sigma(z), used in LLaMA's gated FFNConfusing SiLU with GELU
SoftmaxConverts logits to probabilities, use numerically stable versionApplying softmax to hidden layers
TemperatureT<1T < 1 sharpens, T>1T > 1 flattensGetting the direction wrong

Spaced Repetition Checkpoints

Day 0 - After First Read

  • Write the formula and derivative for: sigmoid, tanh, ReLU, GELU, SiLU
  • Explain the dying ReLU problem in 3 sentences
  • State which activation is used in: BERT, GPT, ResNet, LSTM gates, LLaMA FFN

Day 3 - First Review

  • Explain why sigmoid causes vanishing gradients with a specific numerical example
  • Compare GELU and ReLU: give 3 differences that matter for Transformers
  • Compute softmax with temperature for a 3-element vector without looking at notes

Day 7 - Connections Review

  • Explain the connection between activation functions and gradient flow through a deep network
  • Trace the decision flowchart: given an architecture, choose the activation and justify it
  • Explain the SwiGLU FFN pattern used in LLaMA

Day 14 - Interview Simulation

  • Give a 60-second answer on "why do we need activation functions?"
  • Defend GELU over ReLU for a Transformer with 3 specific technical arguments
  • Diagnose a dying ReLU scenario and propose 3 solutions with rationale

Day 21 - Final Calibration

  • Complete all 4 practice problems under time pressure (8 minutes each)
  • For any activation function named by the interviewer, immediately state: formula, derivative, range, where it is used, key advantage, key limitation
  • Connect activation functions to: backpropagation (gradient properties), initialization (He vs Xavier), normalization (interaction effects)
© 2026 EngineersOfAI. All rights reserved.