Depth and Trainability
Trainability depends on model design choices:
- Neural network architecture
- Adaptive Learning Rate Optimizers
- Weight Initialization in Deep Neural Networks
- Hyperparams
Smoothening the loss surface
Based on paper: Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018
Why do residual connections make neural networks more trainable?
- Adding skip connections makes the loss surface less rough
- Gradients more representative of the direction to good local minima
Note: Use visualizations with a grain of salt: dramatic dimensionality reduction!
The effect of depth
- Deeper architectures have more uneven, chaotic surfaces and many minima
- Removing skip connections fragments and elongates the loss surface
- Fragmentation requires good initialization
- Flatter minima accompanied by lower test errors
The effect of depth in wider architectures
- Similar conclusions when increasing width
- Width makes the loss surface even smoother and flatter!
The effect of weight decay on optimizer trajectory
- Weight decay encourages optimization trajectory perpendicular to isocurves
- Turning off weight decay, the optimizer often goes in parallel with isocurves
Why skip connections make loss surfaces smoother?
The gradient with skip connection becomes
$$
\frac{\partial \mathcal{L}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot\left(\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{x}}\right)=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}}
$$
Which means, the previous layer gradient is carried to the next module untouched. The problem of Vanishing and Exploding Gradients becomes less problematic. Seen otherwise, the loss surface corresponds to stronger gradients, i.e., smoother.
References
- Lecture 5.5, UvA DL course 2020
- Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018