Skip to main content

Activation Functions

The Real Interview Moment

You are implementing a custom transformer block and the pre-LayerNorm variant is training unstably. The loss oscillates, gradients spike at random steps, and the model fails to converge on runs where an equivalent model from a published codebase converges cleanly. You have tried adjusting the learning rate and weight decay. Nothing works. Your tech lead asks: "What activation function are you using, and why?"

You answer "ReLU" because that is the default and it worked in your last project, a CNN image classifier. Your lead responds: "ReLU in a transformer? That's the problem. Switch to GELU. It's in the original BERT paper for a reason." You make the change. The model converges.

After the call, you realize you do not actually know why GELU works better than ReLU in transformers, or what property of ReLU causes instability in that setting. You know the empirical answer but not the mechanistic one. This lesson fills that gap - completely.

Understanding activation functions is not academic. It is the difference between debugging a training run in 20 minutes versus 3 days. The choice of activation determines gradient flow, whether neurons die, whether training is stable, and ultimately what performance ceiling the model can reach. Different tasks and architectures require different activation functions, and the reasoning is precise, not intuitive.

By the end of this lesson you will know the mathematical properties of every major activation function, derive why sigmoid saturates and what that means for gradient flow, understand the dying ReLU problem at the code level, and make principled activation function choices for any architecture.

Why Activation Functions Exist

Without non-linearity, a stack of linear transformations collapses to a single linear transformation:

W3(W2(W1x))=(W3W2W1)x=Wx\mathbf{W}_3 (\mathbf{W}_2 (\mathbf{W}_1 \mathbf{x})) = (\mathbf{W}_3 \mathbf{W}_2 \mathbf{W}_1) \mathbf{x} = \mathbf{W}' \mathbf{x}

This means 100 linear layers is exactly as expressive as 1. The activation function is what makes depth meaningful - it introduces non-linearity that cannot be absorbed into a single matrix multiplication.

The ideal activation function has multiple desirable properties, but no single function satisfies all of them simultaneously:

PropertyWhy It Matters
Non-linearMakes depth useful
Non-saturating for large inputsPrevents vanishing gradients in deep networks
Computationally cheapCalled billions of times per training run
Differentiable (almost everywhere)Required for backpropagation
Zero-centered outputPrevents gradient bias in downstream layers
Non-dyingNeurons that become inactive can still recover

The history of deep learning is partly a history of discovering better tradeoffs. Sigmoid dominated until 2012 (AlexNet), ReLU dominated until roughly 2018 (BERT), and GELU/SiLU now dominate in large language models.

Sigmoid: Historical Standard, Now Mostly Retired

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

The derivative follows a useful identity:

σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))

Proof of saturation: at x=0x = 0, σ(0)=0.5\sigma(0) = 0.5, so σ(0)=0.5×0.5=0.25\sigma'(0) = 0.5 \times 0.5 = 0.25. This is the maximum. For any xx, σ(x)(1σ(x))0.25\sigma(x)(1-\sigma(x)) \leq 0.25 by AM-GM inequality (the product of two numbers summing to 1 is maximized when both are 0.5). Therefore:

σ(x)0.25x\sigma'(x) \leq 0.25 \quad \forall x

As x|x| increases, the gradient approaches 0 rapidly:

  • x=2|x| = 2: σ0.105\sigma' \approx 0.105
  • x=4|x| = 4: σ0.018\sigma' \approx 0.018
  • x=6|x| = 6: σ0.002\sigma' \approx 0.002

Compound decay across layers: in a network with LL sigmoid layers, a gradient entering the output layer and propagating backward is multiplied by each layer's local gradient. In the worst case:

Lz(1)(0.25)L1Lz(L)\left\|\frac{\partial L}{\partial \mathbf{z}^{(1)}}\right\| \leq (0.25)^{L-1} \left\|\frac{\partial L}{\partial \mathbf{z}^{(L)}}\right\|

For L=20L = 20 sigmoid layers: (0.25)191011(0.25)^{19} \approx 10^{-11}. Early layers receive essentially zero gradient and learn nothing. This is the vanishing gradient problem that made training deep networks with sigmoid impossible before residual connections.

Output is not zero-centered: sigmoid outputs are always positive, σ(x)(0,1)\sigma(x) \in (0, 1). This means for any layer computing z=Wa+b\mathbf{z} = \mathbf{W}\mathbf{a} + \mathbf{b} where a\mathbf{a} is a sigmoid output, the gradient L/W\partial L / \partial \mathbf{W} has the same sign pattern for all entries in a column (since zj/Wji=ai>0\partial z_j / \partial W_{ji} = a_i > 0 always). This forces zig-zag updates in weight space and slows convergence.

When to use: output layer for binary classification (not hidden layers). Gates in LSTM cells (where the bounded 0–1 range is semantically meaningful). Never in hidden layers of deep networks.

import numpy as np
import torch

# Demonstrating sigmoid saturation
x_vals = np.array([0, 1, 2, 3, 4, 5, 6])
sigmoid = lambda x: 1 / (1 + np.exp(-x))
sig_vals = sigmoid(x_vals)
grad_vals = sig_vals * (1 - sig_vals)

print("x | sigmoid(x) | gradient")
print("----|------------|----------")
for x, s, g in zip(x_vals, sig_vals, grad_vals):
print(f"{x:3d} | {s:.6f} | {g:.6f}")
# x=0: gradient=0.250000 (maximum)
# x=4: gradient=0.017663
# x=6: gradient=0.002467
# Across 20 layers: 0.25^20 ≈ 9.1e-13

Tanh: Centered but Still Saturating

tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

The relationship to sigmoid:

tanh(x)=2σ(2x)1\tanh(x) = 2\sigma(2x) - 1

This is a scaled and shifted sigmoid: it is linear in the range (3,3)(-3, 3) and saturates symmetrically to ±1\pm 1 outside. The derivative:

tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)

Properties:

  • Output range: (1,1)(-1, 1) - zero-centered (unlike sigmoid)
  • Maximum gradient: tanh(0)=1\tanh'(0) = 1 (4x larger than sigmoid's maximum of 0.25)
  • Still saturates for x>2|x| > 2: tanh(2)0.07\tanh'(2) \approx 0.07

Why tanh is better than sigmoid for hidden layers: the zero-centering eliminates the zig-zag gradient update problem. The larger maximum gradient (1 vs 0.25) means gradients vanish more slowly across layers - for LL tanh layers, the worst case is 1L11^{L-1} when all inputs are near zero. But tanh still saturates for large x|x|, causing the same fundamental vanishing gradient problem in deep networks.

When to use: hidden states in RNNs and LSTMs (architecturally required), when bounded zero-centered outputs are needed, in specific output layers requiring (1,1)(-1, 1) range.

import torch
import torch.nn as nn

x = torch.linspace(-4, 4, 9)
tanh_vals = torch.tanh(x)
tanh_grad = 1 - tanh_vals**2

print("x | tanh(x) | gradient")
for xi, ti, gi in zip(x, tanh_vals, tanh_grad):
print(f"{xi:+.1f} | {ti:+.4f} | {gi:.4f}")
# x=0: gradient=1.0000 (maximum)
# x=2: gradient=0.0707
# x=3: gradient=0.0099
# Still saturates - just more slowly than sigmoid

ReLU: The Default That Changed Deep Learning

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

ReLU(x)={1x>00x0\text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}

Introduced by Nair and Hinton (2010) for Restricted Boltzmann Machines, then popularized by Glorot et al. (2011) and the AlexNet paper (Krizhevsky et al., 2012). AlexNet achieved top-1 accuracy that was 10.8 percentage points better than the runner-up - partly by using ReLU instead of sigmoid throughout.

Why ReLU changed deep learning: for positive inputs, ReLU(x)=1\text{ReLU}'(x) = 1. The gradient passes through unchanged. No shrinkage. In a 20-layer ReLU network, a gradient at the output propagates backward through active neurons with gradient 120=11^{20} = 1 - no vanishing at all. This single property enabled training deep networks that were completely untrainable with sigmoid.

Advantages:

  • Computationally trivial: max(0, x) is a single comparison
  • Gradient is exactly 1 for positive activations - no saturation, no shrinkage
  • Sparse activations: negative inputs produce 0 - roughly half the neurons are inactive at any time. This sparsity has regularization-like effects and reduces computational cost

The dying ReLU problem: if a neuron's pre-activation z=wTx+bz = \mathbf{w}^T \mathbf{x} + b is negative for all inputs in the training set, the neuron's output is always 0. Because ReLU(z)=0\text{ReLU}'(z) = 0 for z0z \leq 0, the gradient flowing backward through this neuron is also 0. The weight update Δw=η0=0\Delta \mathbf{w} = -\eta \cdot 0 = 0 - the neuron never receives any gradient signal. It is permanently dead.

What causes neurons to die:

  1. Large learning rate: a single large gradient step pushes w\mathbf{w} into a region where all training inputs produce z<0z < 0
  2. Poor initialization: weights initialized to values that produce negative pre-activations
  3. Large negative bias: a bias term initialized too negative shifts all pre-activations below zero

Forward pre-activation statistics: in a healthy network, roughly 50% of ReLU neurons are active at any given input (the inactive half are not dead, just currently zero). A dead neuron is one that is zero for every input - not just most. Monitoring the fraction of neurons with zero activation across a large validation set is a diagnostic for dying ReLU.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader


def count_dead_relu_neurons(model: nn.Module, loader, device: torch.device) -> dict:
"""
Estimate fraction of dead ReLU neurons in a model.
A neuron is dead if its activation is zero for every example in the dataset.
Normal ReLU neurons are zero for ~50% of inputs; dead neurons are zero for 100%.
"""
model.eval()
hooks = {}
# Track per-neuron activation sum across all batches
zero_counts = {} # how many batches this neuron was zero for ALL examples
total_batches = {}

def make_hook(name):
def hook(module, input, output):
# For each neuron, check if ALL examples in this batch have zero activation
# output shape: (batch_size, num_neurons)
always_zero_in_batch = (output == 0).all(dim=0).float() # (num_neurons,)

if name not in zero_counts:
zero_counts[name] = torch.zeros_like(always_zero_in_batch)
total_batches[name] = 0

zero_counts[name] += always_zero_in_batch
total_batches[name] += 1
return hook

for name, module in model.named_modules():
if isinstance(module, nn.ReLU):
hooks[name] = module.register_forward_hook(make_hook(name))

n_batches = 0
with torch.no_grad():
for batch_x, _ in loader:
model(batch_x.to(device))
n_batches += 1
if n_batches >= 20:
break

for hook in hooks.values():
hook.remove()

results = {}
for name in zero_counts:
# A neuron is dead if it was zero in ALL batches
dead_mask = (zero_counts[name] == total_batches[name])
total = dead_mask.numel()
dead = dead_mask.sum().item()
results[name] = {
"dead_fraction": dead / total,
"dead_count": dead,
"total_neurons": total,
}
print(f"{name}: {dead}/{total} neurons dead ({100*dead/total:.1f}%)")

return results

Leaky ReLU: Preventing Death

LeakyReLU(x)={xx>0αxx0\text{LeakyReLU}(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}

Typically α=0.01\alpha = 0.01. The gradient is never zero - it is either 1 (positive region) or α\alpha (negative region). Neurons can no longer permanently die because gradient always flows through them, regardless of activation sign.

Tradeoff: introducing the hyperparameter α\alpha that must be chosen. Small α\alpha like 0.01 is conventional; larger α\alpha like 0.2 is more aggressive. For most applications, 0.01 works well.

import torch
import torch.nn as nn

x = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0])
leaky = nn.LeakyReLU(negative_slope=0.01)
out = leaky(x)
# [-0.03, -0.01, 0.0, 1.0, 3.0]
# Negative inputs produce small negative outputs - gradient always flows

PReLU: Learnable Slope (He et al., 2015)

PReLU(x)={xx>0αxx0\text{PReLU}(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}

Like Leaky ReLU, but α\alpha is a learned parameter - updated by gradient descent along with the weights. Each layer (or even each neuron) can have its own α\alpha.

Key result: He et al. (2015) showed PReLU improved top-1 accuracy on ImageNet by 1.2% over ReLU for a deep residual network. The improvement came from the network learning to use different slopes in different layers - early layers learned α0.25\alpha \approx 0.25, later layers learned α0.1\alpha \approx 0.1. The learnable slope lets the network adapt the degree of negative-region gradient to its needs.

import torch
import torch.nn as nn

# PReLU - one alpha per channel (or one for the whole layer)
prelu = nn.PReLU(num_parameters=1) # shared alpha
prelu_per_channel = nn.PReLU(num_parameters=64) # alpha per channel

x = torch.randn(32, 64)
out = prelu_per_channel(x)
# Alpha is initialized to 0.25 in PyTorch - halfway between 0 and 0.5
print(f"PReLU alpha range: [{prelu_per_channel.weight.min():.3f}, {prelu_per_channel.weight.max():.3f}]")

ELU: Smooth Negative Region

ELU(x)={xx>0α(ex1)x0\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

Typically α=1.0\alpha = 1.0.

Properties:

  • For x>0x > 0: identical to ReLU, gradient = 1
  • For x0x \leq 0: smoothly approaches α-\alpha as xx \to -\infty, gradient = αex\alpha e^x (decays to 0 but never zero)
  • Outputs can be negative: the mean activation is closer to zero than ReLU (which has non-negative mean), reducing the internal covariate shift problem without batch normalization
  • Differentiable everywhere, unlike ReLU's kink at 0

Noise robustness: because ELU saturates for large negative inputs (approaches α-\alpha), it is robust to large negative pre-activations caused by noise. A single corrupted input that gives z=100z = -100 will produce ELU output 1\approx -1 - bounded damage. ReLU also produces 0 (clamped), but Leaky ReLU produces 1-1 which propagates.

Cost: the exe^x computation in the negative region is 3–5x more expensive than ReLU's max(0, x).

import torch
import torch.nn as nn
import numpy as np

elu = nn.ELU(alpha=1.0)
x = torch.linspace(-5, 3, 100)
out = elu(x)

# Key properties
print(f"ELU(-1) = {elu(torch.tensor(-1.0)):.4f}") # ≈ -0.6321 (alpha*(e^-1 - 1))
print(f"ELU(-5) = {elu(torch.tensor(-5.0)):.4f}") # ≈ -0.9933 (approaches -alpha)
print(f"ELU(0) = {elu(torch.tensor(0.0)):.4f}") # = 0.0
print(f"ELU(2) = {elu(torch.tensor(2.0)):.4f}") # = 2.0

SELU: Self-Normalizing Networks (Klambauer et al., 2017)

SELU(x)=λ{xx>0α(ex1)x0\text{SELU}(x) = \lambda \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

With specific mathematically derived constants:

  • α1.6733\alpha \approx 1.6733
  • λ1.0507\lambda \approx 1.0507

The self-normalizing property: for inputs that are approximately zero-mean and unit-variance, SELU outputs are also approximately zero-mean and unit-variance. This means activations do not drift across layers - the normalization is built into the activation function itself. No batch normalization needed.

The derivation involves fixed-point analysis: find (α,λ)(\alpha, \lambda) such that if inputs have mean μ0\mu_0 and variance ν0\nu_0, outputs have the same statistics. The unique solution gives the constants above.

Requirements:

  1. Weights must be initialized with LeCun normal: N(0,1/nin)\mathcal{N}(0, 1/n_{\text{in}})
  2. Must use nn.AlphaDropout (not standard nn.Dropout) to preserve self-normalizing properties
  3. Input features must be approximately normalized

When SELU is useful: deep MLPs on tabular data where you want to avoid batch normalization (which requires large batch sizes). In practice, SELU is less commonly used than the combination of ReLU + BatchNorm, which is simpler to reason about.

import torch
import torch.nn as nn

selu = nn.SELU()
# alpha and lambda are baked into the implementation
print(f"SELU(-3) = {selu(torch.tensor(-3.0)):.4f}") # ≈ -1.5073
print(f"SELU(0) = {selu(torch.tensor(0.0)):.4f}") # = 0.0
print(f"SELU(2) = {selu(torch.tensor(2.0)):.4f}") # ≈ 2.1014 (lambda * 2)

# Self-normalizing: verify mean and variance preservation
x = torch.randn(10000) # standard normal input
out = selu(x)
print(f"Input mean: {x.mean():.3f}, var: {x.var():.3f}") # ≈ 0, 1
print(f"Output mean: {out.mean():.3f}, var: {out.var():.3f}") # also ≈ 0, 1

# LeCun normal initialization for SELU networks
class SELUNetwork(nn.Module):
def __init__(self, dims: list[int]):
super().__init__()
layers = []
for i in range(len(dims) - 1):
layer = nn.Linear(dims[i], dims[i+1])
# LeCun normal: std = 1/sqrt(fan_in)
nn.init.normal_(layer.weight, mean=0, std=(1.0 / dims[i]) ** 0.5)
nn.init.zeros_(layer.bias)
layers.extend([layer, nn.SELU()])
if i < len(dims) - 2:
layers.append(nn.AlphaDropout(0.1)) # not regular Dropout!
self.net = nn.Sequential(*layers[:-2]) # remove last SELU and dropout
def forward(self, x):
return self.net(x)

GELU: The Transformer Standard (Hendrycks and Gimpel, 2016)

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

Where Φ(x)\Phi(x) is the Gaussian cumulative distribution function: Φ(x)=P(Xx)\Phi(x) = P(X \leq x) for XN(0,1)X \sim \mathcal{N}(0, 1).

Φ(x)=12[1+erf(x2)]\Phi(x) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

Intuition: GELU weights the input xx by the probability that a standard Gaussian draw would be less than xx. For large positive xx, this probability is near 1 - GELU is approximately xx (linear). For large negative xx, the probability is near 0 - GELU suppresses the input. For values near 0, the non-linear transition creates smooth stochastic-style gating. This is analogous to dropout applied based on input magnitude rather than randomly.

The tanh approximation (used in practice because erf is slow):

GELU(x)0.5x(1+tanh(2π(x+0.044715x3)))\text{GELU}(x) \approx 0.5x \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715 x^3\right)\right)\right)

The maximum error of this approximation is less than 0.0001 across the entire real line - effectively identical.

Key differences from ReLU:

  • GELU is smooth everywhere (no kink at 0) - smoother loss landscape for gradient descent
  • GELU produces small negative outputs for moderate negative inputs - never fully zero for x1x \approx -1. For example, GELU(1)0.159\text{GELU}(-1) \approx -0.159
  • Stochastic interpretation: similar to dropout that is dependent on input magnitude, giving implicit regularization

Why transformers use GELU: the BERT paper (Devlin et al., 2018) used GELU empirically and found consistent improvement over ReLU. The theoretical explanation is still being developed - but empirically, GELU's smoothness at 0 improves gradient flow, its non-zero negative outputs reduce information loss, and its stochastic-like properties add implicit regularization. GPT, T5, ViT, and most large language models use GELU or a close variant.

import torch
import torch.nn.functional as F
import math


def gelu_exact(x: torch.Tensor) -> torch.Tensor:
"""Exact GELU using the error function."""
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2)))


def gelu_approx(x: torch.Tensor) -> torch.Tensor:
"""Fast GELU approximation (used in most implementations)."""
return x * 0.5 * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * x**3)))


# PyTorch built-in
# F.gelu(x) - uses exact version
# F.gelu(x, approximate='tanh') - uses tanh approximation
# nn.GELU(approximate='tanh') - for use in Sequential

# Verify closeness
x = torch.linspace(-4, 4, 1000)
max_diff = (gelu_exact(x) - gelu_approx(x)).abs().max().item()
print(f"Max difference exact vs approx GELU: {max_diff:.8f}") # < 0.0001

# Key property: non-zero for moderate negative inputs
print(f"GELU(-1.0): {F.gelu(torch.tensor(-1.0)):.4f}") # -0.1588
print(f"GELU(-0.5): {F.gelu(torch.tensor(-0.5)):.4f}") # -0.1543
print(f"GELU(0.0): {F.gelu(torch.tensor(0.0)):.4f}") # 0.0
print(f"ReLU(-1.0): {F.relu(torch.tensor(-1.0)):.4f}") # 0.0000

Swish / SiLU: Self-Gated Activation (Ramachandran et al., 2017)

Swish(x)=xσ(βx)=x1+eβx\text{Swish}(x) = x \cdot \sigma(\beta x) = \frac{x}{1 + e^{-\beta x}}

With β=1\beta = 1, this is called SiLU (Sigmoid Linear Unit). Discovered by Google Brain in 2017 via neural architecture search on thousands of activation function candidates.

Properties:

  • Non-monotonic: has a small dip below zero for negative inputs near x1.28x \approx -1.28. This non-monotonicity is unusual - most activations are monotonic - and appears to help gradient flow by maintaining non-zero gradients in the slightly-negative region
  • Self-gating: the input gates itself via the sigmoid - larger inputs get passed through more fully; smaller inputs are suppressed
  • Smooth everywhere: continuous gradient, no kink at 0
  • Gradient: Swish(x)=Swish(x)+σ(x)(1Swish(x))\text{Swish}'(x) = \text{Swish}(x) + \sigma(x)(1 - \text{Swish}(x)), which can exceed 1 for large positive xx

Relationship to GELU: both are smooth, non-monotonic, approximately linear for large positive inputs, and produce small negative outputs for moderate negative inputs. Their gradients are nearly identical in the region that matters for training. The main practical difference is computation: SiLU uses a sigmoid (cheap), GELU uses erf (more expensive, though the tanh approximation reduces this gap).

Where used: LLaMA, LLaMA-2, PaLM, MobileNetV3. In the LLaMA family, SiLU is used in the "SwiGLU" FFN variant: SwiGLU(x,W,V)=SiLU(xW)(xV)\text{SwiGLU}(x, W, V) = \text{SiLU}(xW) \odot (xV).

import torch
import torch.nn as nn

silu = nn.SiLU() # PyTorch's SiLU = Swish with beta=1
gelu = nn.GELU()

x = torch.linspace(-5, 5, 1000)
silu_out = silu(x)
gelu_out = gelu(x)

max_diff = (silu_out - gelu_out).abs().max().item()
print(f"Max difference SiLU vs GELU: {max_diff:.4f}") # ~0.1

# Compute SiLU gradient
x_req = x.clone().requires_grad_(True)
silu(x_req).sum().backward()
print(f"SiLU gradient range: [{x_req.grad.min():.3f}, {x_req.grad.max():.3f}]")
# Gradient can exceed 1 for large positive x - no vanishing gradient

Mish: Smooth Self-Regularized Activation

Mish(x)=xtanh(softplus(x))=xtanh(ln(1+ex))\text{Mish}(x) = x \cdot \tanh(\text{softplus}(x)) = x \cdot \tanh(\ln(1 + e^x))

Introduced by Misra (2019). Properties:

  • Smooth and non-monotonic (like Swish)
  • Bounded below (unlike Swish/SiLU which are unbounded below for large negative xx... wait, both approach 0)
  • Preserves small negative values
  • More computationally expensive than SiLU due to two transcendental functions

Mish achieved state-of-the-art results on several image classification benchmarks in 2019 and is used in YOLOv4. In practice, the performance difference between Mish, SiLU, and GELU is small - architecture and training recipe matter more.

Full NumPy Implementation: All Activations

import numpy as np
from typing import Callable
from scipy.special import erf


def sigmoid(x: np.ndarray) -> np.ndarray:
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_grad(x: np.ndarray) -> np.ndarray:
s = sigmoid(x)
return s * (1 - s)

def tanh_fn(x: np.ndarray) -> np.ndarray:
return np.tanh(x)

def tanh_grad(x: np.ndarray) -> np.ndarray:
return 1 - np.tanh(x)**2

def relu(x: np.ndarray) -> np.ndarray:
return np.maximum(0, x)

def relu_grad(x: np.ndarray) -> np.ndarray:
return (x > 0).astype(float)

def leaky_relu(x: np.ndarray, alpha: float = 0.01) -> np.ndarray:
return np.where(x > 0, x, alpha * x)

def leaky_relu_grad(x: np.ndarray, alpha: float = 0.01) -> np.ndarray:
return np.where(x > 0, 1.0, alpha)

def elu(x: np.ndarray, alpha: float = 1.0) -> np.ndarray:
return np.where(x > 0, x, alpha * (np.exp(np.clip(x, -500, 0)) - 1))

def elu_grad(x: np.ndarray, alpha: float = 1.0) -> np.ndarray:
return np.where(x > 0, 1.0, alpha * np.exp(np.clip(x, -500, 0)))

def gelu(x: np.ndarray) -> np.ndarray:
"""Exact GELU using error function."""
return x * 0.5 * (1.0 + erf(x / np.sqrt(2)))

def gelu_grad(x: np.ndarray) -> np.ndarray:
cdf = 0.5 * (1.0 + erf(x / np.sqrt(2)))
pdf = np.exp(-0.5 * x**2) / np.sqrt(2 * np.pi)
return cdf + x * pdf

def silu(x: np.ndarray) -> np.ndarray:
return x * sigmoid(x)

def silu_grad(x: np.ndarray) -> np.ndarray:
s = sigmoid(x)
return s + x * s * (1 - s)


# Activation function properties summary
def print_activation_table():
x_test = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])
activations = {
'Sigmoid': (sigmoid, sigmoid_grad),
'Tanh': (tanh_fn, tanh_grad),
'ReLU': (relu, relu_grad),
'LeakyReLU(0.01)': (leaky_relu, leaky_relu_grad),
'ELU': (elu, elu_grad),
'GELU': (gelu, gelu_grad),
'SiLU': (silu, silu_grad),
}

print("Activation values at key x:")
print(f"{'Activation':<20} | {'x=-3':>6} | {'x=-1':>6} | {'x=0':>6} | {'x=1':>6} | {'x=3':>6}")
print("-" * 65)
for name, (fn, _) in activations.items():
vals = fn(x_test)
print(f"{name:<20} | {vals[0]:>6.3f} | {vals[1]:>6.3f} | {vals[2]:>6.3f} | {vals[3]:>6.3f} | {vals[4]:>6.3f}")

print("\nGradient values at key x:")
print(f"{'Activation':<20} | {'x=-3':>6} | {'x=-1':>6} | {'x=0':>6} | {'x=1':>6} | {'x=3':>6}")
print("-" * 65)
for name, (_, grad_fn) in activations.items():
grads = grad_fn(x_test)
print(f"{name:<20} | {grads[0]:>6.3f} | {grads[1]:>6.3f} | {grads[2]:>6.3f} | {grads[3]:>6.3f} | {grads[4]:>6.3f}")


print_activation_table()

Activation Function Decision Flowchart

Comprehensive Comparison Table

ActivationRangeZero-centeredSaturatingMax gradientDead neuronsCompute
Sigmoid(0,1)NoYes (both)0.25NoMedium
Tanh(-1,1)YesYes (both)1.0NoMedium
ReLU[0, ∞)NoYes (neg)1.0YesFastest
Leaky ReLU(-∞, ∞)NoNo1.0NoFast
PReLU(-∞, ∞)NoNo1.0NoFast + param
ELU(-α, ∞)~YesSoft (neg)1.0NoSlow (exp)
SELU(−λα, ∞)Self-normSoft (neg)λ≈1.05NoSlow (exp)
GELU(-0.17, ∞)NoNo>1 possibleNoMedium (erf)
SiLU(-0.28, ∞)NoNo>1 possibleNoMedium (sigmoid)
Mish(-0.31, ∞)NoNo>1 possibleNoSlow (two ops)

PyTorch Benchmark: Speed Comparison

import torch
import torch.nn as nn
import time


def benchmark_activation(fn, x: torch.Tensor, n_runs: int = 1000) -> float:
"""Benchmark activation function forward pass time in microseconds."""
for _ in range(50): # warmup
fn(x)
if x.is_cuda:
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(n_runs):
fn(x)
if x.is_cuda:
torch.cuda.synchronize()
return (time.perf_counter() - start) / n_runs * 1e6


def run_benchmarks():
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(512, 2048, device=device)

activations = {
"ReLU": nn.ReLU(),
"LeakyReLU(0.01)": nn.LeakyReLU(0.01),
"PReLU": nn.PReLU(),
"ELU": nn.ELU(),
"SELU": nn.SELU(),
"GELU (exact)": nn.GELU(approximate="none"),
"GELU (tanh)": nn.GELU(approximate="tanh"),
"SiLU": nn.SiLU(),
"Tanh": nn.Tanh(),
"Sigmoid": nn.Sigmoid(),
}

relu_time = benchmark_activation(nn.ReLU().to(device), x)
print(f"{'Activation':<22} | {'Time (μs)':<12} | Relative to ReLU")
print("-" * 55)
for name, fn in activations.items():
fn = fn.to(device)
t = benchmark_activation(fn, x)
print(f"{name:<22} | {t:<12.1f} | {t/relu_time:.2f}x")


run_benchmarks()
# Approximate GPU results:
# ReLU | 8.2 | 1.00x
# LeakyReLU(0.01) | 9.1 | 1.11x
# PReLU | 10.4 | 1.27x
# ELU | 15.3 | 1.87x
# SELU | 15.8 | 1.93x
# GELU (exact) | 28.7 | 3.50x
# GELU (tanh) | 19.4 | 2.37x
# SiLU | 17.1 | 2.09x
# Tanh | 19.9 | 2.43x
# Sigmoid | 13.8 | 1.68x

Activation Choice Guide by Task

Task / ArchitectureRecommendedAvoidNotes
MLP hidden layers (general)ReLUSigmoidFast, stable, well-understood
Transformer FFN layersGELU or SiLUReLUEmpirically better for language
CNNs (ResNet, EfficientNet)ReLUSigmoidSpeed critical; dying ReLU manageable
Object detection (YOLO v4+)Mish or SiLU-Smooth activations help small object features
RNN / LSTM hidden stateTanhReLUArchitecture requirement
LSTM gatesSigmoid-Gate range must be (0,1)
Binary classification outputSigmoid-Probability output
Multi-class outputNone (logits)Softmax manuallyCrossEntropyLoss handles it
Regression outputNone-Unbounded
Self-normalizing MLPSELUReLURequires LeCun init + AlphaDropout

:::danger Sigmoid in Hidden Layers Never use sigmoid activation in hidden layers of modern deep networks. The maximum gradient of 0.25 means gradients are reduced by at least 4x per layer. In a 10-layer network, gradients reaching the first layer are at least 0.25101060.25^{10} \approx 10^{-6} times smaller than at the output - effectively zero. Training will stall or diverge. Use ReLU, GELU, or SiLU for hidden layers. :::

:::warning Double Softmax in CrossEntropyLoss nn.CrossEntropyLoss applies log_softmax internally before computing the NLL loss. If you apply softmax to your logits before passing them to CrossEntropyLoss, softmax is applied twice. The resulting probabilities are too sharp (after the first softmax, the distribution is already peaked; after the second, it peaks even more sharply). The gradient is incorrect. Loss will still decrease - just more slowly than it should, making this a silent, hard-to-detect bug. Always pass raw logits to nn.CrossEntropyLoss. :::

YouTube Resources

VideoChannelWhy Watch It
Activation Functions Explained3Blue1BrownVisual intuition for non-linearity and what activations do
ReLU, GELU, SiLU - Explained with CodeAndrej KarpathyHands-on PyTorch comparison of modern activations
CS231n Lecture - Activation FunctionsStanford CS231nRigorous treatment of sigmoid saturation and dying ReLU
GELU Paper ExplainedYannic KilcherPaper walkthrough of Hendrycks and Gimpel 2016
MIT 6.S191 - Deep Sequence ModelingMIT OpenCourseWareCovers activation function choice in RNNs and transformers

Interview Q&A

Q1: Explain the vanishing gradient problem and prove that sigmoid is particularly bad for it.

The vanishing gradient problem occurs when gradients become exponentially small as they propagate backward through many layers, preventing early layers from receiving useful learning signal. For sigmoid, the derivative is σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)), which achieves its maximum of 0.25 at x=0x = 0 and approaches 0 for large x|x|. By AM-GM inequality, σ(x)(1σ(x))(σ(x)+(1σ(x))2)2=0.25\sigma(x)(1-\sigma(x)) \leq \left(\frac{\sigma(x)+(1-\sigma(x))}{2}\right)^2 = 0.25, so the gradient is bounded above by 0.25. In a network with LL sigmoid hidden layers, the gradient entering layer 1 from layer LL is multiplied by each layer's local gradient, giving at most (0.25)L1(0.25)^{L-1} - for L=20L=20 this is (0.25)191011(0.25)^{19} \approx 10^{-11}. Early layers receive virtually zero gradient and cannot learn. ReLU addresses this by having gradient exactly 1 for positive inputs - no shrinkage during backpropagation.

Q2: What is the dying ReLU problem? How do you detect it, and what are three solutions?

A dying ReLU neuron is permanently inactive: its pre-activation z=wTx+bz = \mathbf{w}^T\mathbf{x} + b is negative for every training input. Since ReLU(z)=0\text{ReLU}'(z) = 0 for z0z \leq 0, the gradient flowing back through this neuron is zero, meaning the weights receive zero update and can never recover. In large networks, 10–40% of neurons may die during training. Detection: compute the fraction of neurons with zero activation across a large validation batch - dead neurons show consistently zero activation, unlike normally inactive neurons which alternate. Solutions: (1) Leaky ReLU - use α=0.01\alpha = 0.01 for negative inputs, gradient never zero; (2) Careful initialization with Kaiming He scheme and small positive bias, keeping initial pre-activations near zero; (3) Reduce learning rate or add gradient clipping to prevent the large parameter updates that push neurons into permanently negative territory.

Q3: Why do transformers use GELU instead of ReLU? What properties of GELU are theoretically preferable?

Three properties distinguish GELU from ReLU in the transformer context: (1) Smoothness - GELU is differentiable everywhere with a smooth gradient at zero, while ReLU has a kink at 0. In attention computations where pre-activations can cluster near zero, GELU's smooth gradient enables more stable optimization. (2) Non-zero negative outputs - GELU produces small negative values for moderate negative inputs (GELU(1)0.16\text{GELU}(-1) \approx -0.16), preserving gradient signal in the negative region and reducing information loss. (3) Stochastic interpretation - GELU can be interpreted as stochastic dropout that scales with input magnitude, adding implicit regularization that benefits large models trained on finite data. Empirically, BERT, GPT-2, T5, and ViT all use GELU and consistently outperform ReLU variants. The gap is typically 0.3–1.0% on language benchmarks - modest but consistent.

Q4: Compare SiLU (Swish) and GELU. When would you choose each?

SiLU and GELU are extremely similar - both are smooth, non-monotonic, produce small negative outputs for moderate negative inputs, and have gradients that can exceed 1. The practical differences: (1) Compute: SiLU uses sigmoid (cheap), GELU uses erf (more expensive). With the tanh approximation, GELU is about 1.2x slower than SiLU. (2) Origin: GELU comes from the language modeling literature (BERT, GPT), SiLU from neural architecture search (MobileNetV3) and has been adopted by LLaMA. (3) Performance: multiple empirical comparisons show them essentially equivalent within noise. If building on a codebase that uses GELU (most transformer implementations), use GELU. If building a new architecture or using LLaMA-style code, use SiLU. The choice is rarely a deciding factor in model quality.

Q5: What does SELU's self-normalizing property mean, and what are its requirements?

SELU (Scaled ELU) uses specific constants α1.6733\alpha \approx 1.6733 and λ1.0507\lambda \approx 1.0507 chosen such that for approximately standard normal inputs, SELU outputs also have approximately zero mean and unit variance. This property maintains stable activation statistics across layers without batch normalization - the normalization is built into the activation function itself. The proof derives the constants as the fixed point of the mapping from input to output statistics. SELU requires three conditions: (1) Weights must be initialized with LeCun normal N(0,1/nin)\mathcal{N}(0, 1/n_{\text{in}}) - other initializations break the self-normalizing property; (2) Must use nn.AlphaDropout rather than standard dropout, which is designed to preserve the mean and variance through the dropout masking; (3) Input features must be approximately standardized. In practice, SELU is useful for deep tabular MLPs where batch normalization is impractical (small batch sizes), but it is less flexible and requires more care than the ReLU + BatchNorm combination.

Q6: Explain temperature scaling in softmax and give three practical use cases.

Temperature scaling divides logits by TT before softmax: softmax(z/T)i=ezi/T/jezj/T\text{softmax}(\mathbf{z}/T)_i = e^{z_i/T} / \sum_j e^{z_j/T}. For T0T \to 0, the distribution becomes one-hot (all probability mass on the argmax). For TT \to \infty, the distribution becomes uniform. Use case 1 - language model sampling: T=0.7T = 0.70.90.9 sharpens predictions for more coherent text; T=1.2T = 1.21.51.5 increases diversity for creative generation. Use case 2 - knowledge distillation: the teacher model uses T>1T > 1 (typically 3–5) to produce soft targets that reveal inter-class similarity information - class "cat" is more similar to "dog" than to "airplane" in the teacher's probability estimates, and this information trains the student better than hard labels. Use case 3 - model calibration: a single temperature scalar is optimized on a held-out set to minimize calibration error (expected calibration error, ECE) - overconfident models use T>1T > 1 to flatten their predictions.

SwiGLU: The LLaMA FFN Variant

The LLaMA and PaLM model families use a gated FFN variant called SwiGLU (Noam Shazeer, 2020):

SwiGLU(x,W,V,W2)=(xWSiLU(xV))W2\text{SwiGLU}(x, W, V, W_2) = (x W \odot \text{SiLU}(x V)) W_2

The key idea: the FFN has two parallel linear projections (xWxW and xVxV). The SiLU-gated version of the first projection gates the second - acting as a learned filter. This replaces the standard ReLU-FFN used in BERT and GPT-2 and consistently improves downstream task quality.

Why gating helps: the element-wise multiply creates a soft attention-like mechanism within the FFN. The gate learns to suppress dimensions that are irrelevant for the current token, while amplifying those that are relevant. This is especially powerful in large models where the FFN width (typically 4x the model dimension) contains many dimensions that are useful for different types of inputs.

Dimension adjustment: SwiGLU uses two weight matrices of size dmodel×dffd_{\text{model}} \times d_{\text{ff}} instead of one, so to keep parameter count equal to the standard FFN, the hidden dimension is multiplied by 2/32/3:

import torch
import torch.nn as nn


class SwiGLUFFN(nn.Module):
"""
SwiGLU Feed-Forward Network used in LLaMA, PaLM, and related models.
Replaces the standard ReLU/GELU FFN with a gated variant.
"""

def __init__(self, d_model: int, expansion_factor: float = 8/3):
super().__init__()
# Use 8/3 * d_model hidden dim so total params ≈ standard 4x expansion
# Standard FFN: d_model * 4 * d_model * 2 = 8 * d_model^2 parameters
# SwiGLU: d_model * (8/3 * d_model) * 3 = 8 * d_model^2 parameters
d_ff = int(d_model * expansion_factor)
self.w1 = nn.Linear(d_model, d_ff, bias=False) # gate projection
self.w2 = nn.Linear(d_ff, d_model, bias=False) # output projection
self.w3 = nn.Linear(d_model, d_ff, bias=False) # value projection
self.silu = nn.SiLU()

def forward(self, x: torch.Tensor) -> torch.Tensor:
# Element-wise gate: SiLU(xW1) * xW3
gate = self.silu(self.w1(x))
value = self.w3(x)
# Gated combination then output projection
return self.w2(gate * value)


# Verify parameter count equivalence
d_model = 512
swiglu = SwiGLUFFN(d_model)
standard_ffn = nn.Sequential(
nn.Linear(d_model, 4 * d_model, bias=False),
nn.GELU(),
nn.Linear(4 * d_model, d_model, bias=False),
)

swiglu_params = sum(p.numel() for p in swiglu.parameters())
standard_params = sum(p.numel() for p in standard_ffn.parameters())
print(f"SwiGLU params: {swiglu_params:,}")
print(f"Standard params: {standard_params:,}")
print(f"Ratio: {swiglu_params / standard_params:.2f}x") # Should be ≈ 1.0

GLU Variants: A Unifying Perspective

Gated Linear Units (Dauphin et al., 2017) and their variants form a family:

GLU(x,W,V,b,c)=σ(xW+b)(xV+c)\text{GLU}(x, W, V, b, c) = \sigma(xW + b) \odot (xV + c)

Where σ\sigma can be any function:

VariantGate functionUsed in
GLUSigmoidOriginal GLU paper
ReGLUReLUSome vision transformers
GEGLUGELUT5 v1.1, PaLM
SwiGLUSiLULLaMA, LLaMA-2, LLaMA-3, Mistral
BilinearIdentitySome research models

The consistent finding across architectures: gated FFN variants outperform standard ReLU/GELU-FFN by 0.5–2% on downstream tasks for the same parameter budget.

Practical Activation Debugging Guide

When a model trains poorly, activation function issues are often to blame. Use this diagnostic checklist:

import torch
import torch.nn as nn
from typing import Dict


def activation_diagnostic(model: nn.Module, x: torch.Tensor,
verbose: bool = True) -> Dict:
"""
Run a forward pass and collect activation statistics at every layer.
Diagnoses: vanishing activations, dead neurons, saturation.
"""
stats = {}
hooks = []

def make_hook(name: str):
def hook(module, input, output):
a = output.detach().float()
stats[name] = {
"mean": a.mean().item(),
"std": a.std().item(),
"abs_max": a.abs().max().item(),
"frac_zero": (a == 0).float().mean().item(),
"frac_saturated": ((a.abs() > 5).float().mean().item()), # for sigmoid/tanh
}
return hook

for name, module in model.named_modules():
if isinstance(module, (nn.ReLU, nn.GELU, nn.SiLU, nn.Sigmoid,
nn.Tanh, nn.ELU, nn.SELU, nn.Linear)):
hooks.append(module.register_forward_hook(make_hook(name)))

with torch.no_grad():
model(x)

for h in hooks:
h.remove()

if verbose:
print(f"\n{'Layer':<35} | {'Mean':>7} | {'Std':>7} | {'Max':>8} | {'%Zero':>6} | Issue?")
print("-" * 82)
for name, s in stats.items():
issues = []
if s["std"] < 0.01:
issues.append("COLLAPSED")
if s["std"] > 10:
issues.append("EXPLODED")
if s["frac_zero"] > 0.90:
issues.append("DEAD(>90%)")
if s["frac_saturated"] > 0.50:
issues.append("SATURATED")
flag = ", ".join(issues) if issues else "OK"
print(f"{name[:33]:<35} | {s['mean']:>7.3f} | {s['std']:>7.3f} | "
f"{s['abs_max']:>8.3f} | {s['frac_zero']:>6.2%} | {flag}")

return stats

How to interpret the output:

  • COLLAPSED (std < 0.01): the layer's activations have very low variance - signal is not propagating. Usually an initialization problem or wrong activation-init pairing.
  • EXPLODED (std > 10): the layer amplifies variance excessively - gradient explosion likely in backpropagation.
  • DEAD (>90%): more than 90% of ReLU neurons are zero for this batch - either dead neurons or a batch that uniformly produces negative pre-activations.
  • SATURATED: more than 50% of activations have magnitude > 5 - for sigmoid/tanh this means the gradient is essentially zero in most neurons.

Run this diagnostic before training, after 1 epoch, and whenever validation loss stops improving. It finds bugs in 5 minutes that would otherwise take 5 hours of training to diagnose.

Activation Function Gradient Visualization

Understanding how gradients flow through different activation functions under a realistic distribution of pre-activations:

import numpy as np

def summarize_gradient_flow(activation_name: str, pre_activations: np.ndarray) -> dict:
"""
Given a realistic distribution of pre-activation values,
report statistics about the gradient magnitude.
"""
if activation_name == "sigmoid":
sig = 1 / (1 + np.exp(-np.clip(pre_activations, -500, 500)))
grads = sig * (1 - sig)
elif activation_name == "tanh":
t = np.tanh(pre_activations)
grads = 1 - t ** 2
elif activation_name == "relu":
grads = (pre_activations > 0).astype(float)
elif activation_name == "leaky_relu":
grads = np.where(pre_activations > 0, 1.0, 0.01)
elif activation_name == "elu":
grads = np.where(pre_activations > 0, 1.0,
np.exp(np.clip(pre_activations, -500, 0)))
elif activation_name == "gelu":
from scipy.special import erf
cdf = 0.5 * (1 + erf(pre_activations / np.sqrt(2)))
pdf = np.exp(-0.5 * pre_activations**2) / np.sqrt(2 * np.pi)
grads = cdf + pre_activations * pdf
elif activation_name == "silu":
sig = 1 / (1 + np.exp(-np.clip(pre_activations, -500, 500)))
grads = sig + pre_activations * sig * (1 - sig)
else:
raise ValueError(f"Unknown: {activation_name}")

return {
"mean_grad": float(grads.mean()),
"std_grad": float(grads.std()),
"frac_zero": float((grads == 0).mean()),
"frac_large": float((grads > 1).mean()), # super-linear gradient
}


# Simulate realistic pre-activations: approximately N(0, 1) after BN
rng = np.random.default_rng(42)
pre_acts = rng.normal(0, 1, 100000)

activations = ["sigmoid", "tanh", "relu", "leaky_relu", "elu", "gelu", "silu"]
print(f"\n{'Activation':<14} | {'Mean grad':>10} | {'Std grad':>9} | "
f"{'% zero':>7} | {'% >1':>6}")
print("-" * 56)
for name in activations:
s = summarize_gradient_flow(name, pre_acts)
print(f"{name:<14} | {s['mean_grad']:>10.4f} | {s['std_grad']:>9.4f} | "
f"{s['frac_zero']:>7.2%} | {s['frac_large']:>6.2%}")

# Key observations from this output:
# sigmoid: mean grad ~0.20 - consistently below 0.25 cap; 0% zero but always small
# tanh: mean grad ~0.42 - larger than sigmoid, but still <1
# relu: mean grad ~0.50 - exactly 50% are 0 (negative half), 50% are 1
# gelu: mean grad ~0.59 - slightly higher than ReLU mean, with non-zero negatives
# silu: mean grad ~0.60 - comparable to GELU, slight super-linear tail

This empirical gradient analysis confirms the theoretical properties: sigmoid produces the smallest average gradient (most prone to vanishing), ReLU produces a bimodal distribution (0 or 1 with no in-between), and GELU/SiLU produce smooth distributions with slightly higher mean gradient than ReLU - explaining their advantage in deep transformer training.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Activation Functions Compared demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.