LSTM

Created December 15, 2020 · Updated March 4, 2026

Vanilla Recurrent Neural Networks (RNN) are defined as

\begin{array}{l} y_{t}=\operatorname{softmax}\left(V \cdot s_{t}\right) \\ s_{t}=\tanh \left(U \cdot x_{t}+W \cdot s_{t-1}\right) \end{array}

The key ideas behind LSTMs are:

Setting $\left(\frac{\partial s_{j}}{\partial s_{j-1}}=1\right)$ to avoid Vanishing and Exploding Gradients

Remove immediate nonlinear relation between $s_{t}$ and $s_{t-1}$ as nonlinearities return gradient smaller than 1.

Replace tanh between $s_{t}$ and $s_{t-1}$ with identity
Add non-linearlity on the memory variable instead of state variable

Also, avoid continuous overwriting of state

Modulate the importance of new input by a gate
Modulate the importance of new output by a gate
Modulate the importance of past memories by a gate

By putting all these thing together, we allow LSTM at each time step to modify:

Input gate - Determine how important in the input and select the information to add to the new cell state from the input.

i_t = \sigma \left( W^{(i)}x_t + U^{(i)}h_{t-1} \right)

Forget gate - Determine how important is the past state and delete information from the cell state that is no longer needed.

f_t = \sigma \left( W^{(f)}x_t + U^{(f)}h_{t-1} \right)

New memory cell -What could be relevant for new memory? Extract information from the previous hidden cell and input and create candidate memory.

\hat{c}_t = tanh\left( W^{(c)}x_t + U^{(c)}h_{t-1} \right)

Final memory cell - Compute the new cell state.

c_t = \sigma(f_t \odot c_{t-1} + i_t\odot \hat{c_t})

Output gate - How imporatnt is the new state useful for output.

o_t = \sigma \left( W^{(o)}x_t + U^{(o)}h_{t-1} \right)

Final hidden state - Update the hidden state.

h_t = o_t \odot tanh(c_t)

LSTM insights

Comparing the state equations between RNN and LSTM
RNN: $s_{t}=\tanh \left(U \cdot x_{t}+W \cdot s_{t-1}\right)$
LSTM: $s_{t} =s_{t-1} \odot f_{t}+g_{t} \odot i_{t}, m_{t}=\tanh \left(s_{t}\right) \odot o_{t}$

The LSTM also has indirect nonlinear relation between $s_{t}$ and $s_{t-1}$ via $\boldsymbol{m}_{t}$ . There is also direct linear relation -> Strong gradients encouraged
Use sigmoids for gating/squashing $\rightarrow(0,1)$ values
Use tanh as module's recurrence nonlinearity, instead.

References

Understanding LSTM Networks by Chris Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Lecture 6.3, UvA Deep Learning course 2020