Backpropagation Through Time (BPTT)

How are RNNs and MLPs different?

  • Main difference: Layer-shared parameters vs Layer-specific parameters. Just mentally have to switch from 'time steps' to 'layers'.
  • Question: how to train with shared parameters? Backpropagation... through time.
unfolding-rnns

BPPT is basically, chain rule again for $\frac{d \mathcal{L}}{d V}, \frac{d \mathcal{L}}{d U}, \frac{d \mathcal{L}}{d W}$ which is the same algorithm as normal Backpropagation. But there is one caveat: shared computations complicate the chain rule:

$$ \begin{array}{l} y_{t}=\operatorname{softmax}\left(V \cdot \boldsymbol{s}_{t}\right) \\ \boldsymbol{s}_{t}=\tanh \left(U \cdot \boldsymbol{x}_{t}+W \cdot \boldsymbol{s}_{t-1}\right) \end{array} $$

$V, U, W$ are same for $t, t+1, \ldots$

  • Gradients flow not from a 'single path' of previous layer like in MLP
  • The recurrence in the chain rule 'hides' multiple dependencies

Chain rule for dS/dW and dS/dV in unfolded graph

Dependencies from $s_{t}$ in separate covariate variables

$$ s=h\left(s_{t}, s_{t-1}, \ldots, s_{0}\right) $$

All gradient path flows from $s$ to $w$ Via $s_{t}$ Via $s_{t-1}$
$$ \frac{d s}{d w}=\sum_{i=0}^{t} \frac{d s}{d s_{i}} \cdot \frac{d s_{i}}{d w} $$

Chain rule by change of variable differentiation

  • Same result as above but more involved
  • One must keep in mind that the nonlinearity $h$ and the derivative acts within 'one layer' No recursion for $\mathrm{h},$ only via $s_{i}$

BPPT for memory and input parameters dL/dW, dL/dU

Putting everything together,

$$ \frac{\partial \mathcal{L}_{t}}{\partial W}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial s_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\prod_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}} $$

If you have a new loss per time step, sum over time steps
$$ \frac{\partial \mathcal{L}}{\partial \boldsymbol{W}}=\sum_{t} \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}} $$

Similar for input parameters $\boldsymbol{U}$
$$ \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{U}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial \boldsymbol{s}_{t}}{\partial \boldsymbol{s}_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{U}} $$

Truncating BPTT

  • Unrolling forever is not practical or even feasible
  • Truncate to $t_{\text {trunc}}$ is a usual strategy. Then, replace all $t$ in the equations before with $t_{\text {trunc}}$
  • More focus on short-term terms
    • Not undesirable, as long-term terms may be irrelevant anyway
    • 'Biases' towards simpler models with shorter-term dependencies

Note: Good slides for recursive gradient computation Backpropagation Through Time (BPTT)

Challenges

Vanishing and Exploding Gradients

  • It is a big problem in RNNs are MLPs with infinitely deep layers with shared parameters, which is the main problem

    $$ \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\boldsymbol{\prod}_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}} $$
$$ \begin{array}{l} \circ \text { If } \frac{\partial s_{j}}{\partial s_{j-1}}<1 \Rightarrow \frac{\partial \mathcal{L}_{t}}{\partial W} \ll 1 \rightarrow \text { Vanishing gradient } \\ \text { o If } \frac{\partial s_{j}}{\partial s_{j-1}}>1 \Rightarrow \frac{\partial \ell_{t}}{\partial W} \gg 1 \rightarrow \text { Explodinggradient } \end{array} $$
  • Exponentially similar contribution of longer-term terms, so model emphasizes on shorter-term terms as they have larger gradients.
  • Possible solution is to clipping the exploding gradients as they are an optimization issue, but we cannot rescale vanishing gradients because it is a gradient accuracy issue. In any case rescaling only focuses on short-term.

Misalignment between gradient and weights

  • For every step we use 'different versions over time' of various variables
  • The new gradients are only an estimate, so it gets worse with more backpropagation
  • Doing fewer updates helps but might slow down training

Bias due to truncation

  • Instead of computing the real gradient for all time steps $0,1,2, \ldots, t$, we compute a gradient approximation up to $t_{\text {trunc}}$
  • In practice and for many applications, not it doesn't matter very much.
  • However, it would still be nice to have the model choose what to ignore.