Backpropagation Through Time (BPTT)

Created December 15, 2020 · Updated January 19, 2026

How are RNNs and MLPs different?

Main difference: Layer-shared parameters vs Layer-specific parameters. Just mentally have to switch from 'time steps' to 'layers'.
Question: how to train with shared parameters? Backpropagation... through time.

BPPT is basically, chain rule again for $\frac{d \mathcal{L}}{d V}, \frac{d \mathcal{L}}{d U}, \frac{d \mathcal{L}}{d W}$ which is the same algorithm as normal Backpropagation. But there is one caveat: shared computations complicate the chain rule:

\begin{array}{l} y_{t}=\operatorname{softmax}\left(V \cdot \boldsymbol{s}_{t}\right) \\ \boldsymbol{s}_{t}=\tanh \left(U \cdot \boldsymbol{x}_{t}+W \cdot \boldsymbol{s}_{t-1}\right) \end{array}

$$V, U, W$$ are same for $t, t+1, \ldots$

Gradients flow not from a 'single path' of previous layer like in MLP
The recurrence in the chain rule 'hides' multiple dependencies

Chain rule for dS/dW and dS/dV in unfolded graph

Dependencies from $s_{t}$ in separate covariate variables

s=h\left(s_{t}, s_{t-1}, \ldots, s_{0}\right)

All gradient path flows from

$s$

$w$

Via

s_{t}

Via

s_{t-1}

\frac{d s}{d w}=\sum_{i=0}^{t} \frac{d s}{d s_{i}} \cdot \frac{d s_{i}}{d w}

Chain rule by change of variable differentiation

Same result as above but more involved
One must keep in mind that the nonlinearity $$h$$ and the derivative acts within 'one layer' No recursion for $\mathrm{h},$ only via $s_{i}$

BPPT for memory and input parameters dL/dW, dL/dU

Putting everything together,

\frac{\partial \mathcal{L}_{t}}{\partial W}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial s_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\prod_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}

If you have a new loss per time step, sum over time steps

\frac{\partial \mathcal{L}}{\partial \boldsymbol{W}}=\sum_{t} \frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}}

Similar for input parameters

\boldsymbol{U}

\frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{U}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial \boldsymbol{s}_{t}}{\partial \boldsymbol{s}_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{U}}

Truncating BPTT

Unrolling forever is not practical or even feasible
Truncate to $t_{\text {trunc}}$ is a usual strategy. Then, replace all $$t$$ in the equations before with $t_{\text {trunc}}$
More focus on short-term terms
- Not undesirable, as long-term terms may be irrelevant anyway
- 'Biases' towards simpler models with shorter-term dependencies

Note: Good slides for recursive gradient computation Backpropagation Through Time (BPTT)

Challenges

Vanishing and Exploding Gradients

It is a big problem in RNNs are MLPs with infinitely deep layers with shared parameters, which is the main problem
$\frac{\partial \mathcal{L}_{t}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}} \frac{\partial s_{t}}{\partial s_{i}} \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}=\sum_{i=0}^{t} \frac{\partial \mathcal{L}_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial \boldsymbol{s}_{t}}\left(\boldsymbol{\prod}_{j=i+1}^{t} \frac{\partial s_{j}}{\partial s_{j-1}}\right) \frac{\partial \boldsymbol{s}_{i}}{\partial \boldsymbol{W}}$

\begin{array}{l} \circ \text { If } \frac{\partial s_{j}}{\partial s_{j-1}}<1 \Rightarrow \frac{\partial \mathcal{L}_{t}}{\partial W} \ll 1 \rightarrow \text { Vanishing gradient } \\ \text { o If } \frac{\partial s_{j}}{\partial s_{j-1}}>1 \Rightarrow \frac{\partial \ell_{t}}{\partial W} \gg 1 \rightarrow \text { Explodinggradient } \end{array}

Exponentially similar contribution of longer-term terms, so model emphasizes on shorter-term terms as they have larger gradients.
Possible solution is to clipping the exploding gradients as they are an optimization issue, but we cannot rescale vanishing gradients because it is a gradient accuracy issue. In any case rescaling only focuses on short-term.

Misalignment between gradient and weights

For every step we use 'different versions over time' of various variables
The new gradients are only an estimate, so it gets worse with more backpropagation
Doing fewer updates helps but might slow down training

Bias due to truncation

Instead of computing the real gradient for all time steps $0,1,2, \ldots, t$ , we compute a gradient approximation up to $t_{\text {trunc}}$
In practice and for many applications, not it doesn't matter very much.
However, it would still be nice to have the model choose what to ignore.