Activation Functions

Activation functions are used to introduce non-linearities in models. Without activation functions, any multi-layer network mathematically collapses into a single linear transformation, making depth meaningless.

Without activation functions, consider what happens when you stack linear layers:

Layer 1: y₁ = W₁x + b₁
Layer 2: y₂ = W₂y₁ + b₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂)
Layer 3: y₃ = W₃y₂ + b₃ = W₃((W₂W₁)x + (W₂b₁ + b₂)) + b₃ = (W₃W₂W₁)x + ...

No matter how many layers you add, you can always multiply out all the weight matrices into one equivalent matrix W_final = W₃W₂W₁... and combine all the biases into one equivalent bias b_final.

The entire deep network becomes mathematically identical to: y = W_final × x + b_final

So a 100-layer network without activation functions is literally just a single linear transformation in disguise! The depth becomes completely meaningless. That's why activation functions are absolutely essential - they break this mathematical collapse and allow each layer to contribute genuine complexity.

Sigmoid

$$ h(x)=\sigma(x)=\frac{1}{1+e^{-x}} $$
$$ \frac{d}{dx} \sigma(x) = \sigma(x)(1 - \sigma(x)) $$
Sigmoid Gradient

Derivative of Sigmoid

Hyperbolic Tan

$$ h(x)=\tanh (x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} $$
$$ \frac{d}{dx} \text{tanh}(x) = 1 - \text{tanh}^2(x) $$
tanh derivative

Derivative of Tanh

ReLU and variants

Softmax