Depth and Trainability

Created December 16, 2020 · Updated March 4, 2026

Trainability depends on model design choices:

Neural network architecture
Adaptive Learning Rate Optimizers
Weight Initialization in Deep Neural Networks
Hyperparams

Smoothening the loss surface

Based on paper: Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018

Why do residual connections make neural networks more trainable?

Adding skip connections makes the loss surface less rough
Gradients more representative of the direction to good local minima

Note: Use visualizations with a grain of salt: dramatic dimensionality reduction!

The effect of depth

Deeper architectures have more uneven, chaotic surfaces and many minima
Removing skip connections fragments and elongates the loss surface
Fragmentation requires good initialization
Flatter minima accompanied by lower test errors

The effect of depth in wider architectures

Similar conclusions when increasing width
Width makes the loss surface even smoother and flatter!

The effect of weight decay on optimizer trajectory

Weight decay encourages optimization trajectory perpendicular to isocurves
Turning off weight decay, the optimizer often goes in parallel with isocurves

Why skip connections make loss surfaces smoother?

The gradient with skip connection becomes

\frac{\partial \mathcal{L}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{h}}{\partial \boldsymbol{x}}=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot\left(\frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{x}}\right)=\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}} \cdot \frac{\partial \boldsymbol{F}}{\partial \boldsymbol{x}}+\frac{\partial \mathcal{L}}{\partial \boldsymbol{h}}

Which means, the previous layer gradient is carried to the next module untouched. The problem of Vanishing and Exploding Gradients becomes less problematic. Seen otherwise, the loss surface corresponds to stronger gradients, i.e., smoother.

References

Lecture 5.5, UvA DL course 2020
Li, Xu, Taylor, Studer, Goldstein, Visualizing the Loss Landscape of Neural Nets, NeurlPS, 2018