Activation Functions
Activation functions are used to introduce non-linearities in models. Without activation functions, any multi-layer network mathematically collapses into a single linear transformation, making depth meaningless.
Without activation functions, consider what happens when you stack linear layers:
Layer 1: y₁ = W₁x + b₁
Layer 2: y₂ = W₂y₁ + b₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂)
Layer 3: y₃ = W₃y₂ + b₃ = W₃((W₂W₁)x + (W₂b₁ + b₂)) + b₃ = (W₃W₂W₁)x + ...
No matter how many layers you add, you can always multiply out all the weight matrices into one equivalent matrix W_final = W₃W₂W₁... and combine all the biases into one equivalent bias b_final.
The entire deep network becomes mathematically identical to: y = W_final × x + b_final
So a 100-layer network without activation functions is literally just a single linear transformation in disguise! The depth becomes completely meaningless. That's why activation functions are absolutely essential - they break this mathematical collapse and allow each layer to contribute genuine complexity.
Sigmoid
Derivative of Sigmoid
Hyperbolic Tan
Derivative of Tanh