Recurrent Neural Networks (RNN)

Created October 27, 2021 · Updated March 4, 2026

Characteristics of sequential data

Sequences are everywhere - videos, stock ecvhange, climate measurements, market analysis, speech, music, user behaviour, computer games etc.

Sequence -> Data are no more i.i.d.

Each data point on the previous ones and their distribution changes over time
How to model temporal dependencies, especially in high-dim complex data?

Sequences might be of arbitrary length

How to model dependencies at different temporal scales?
How to make a model that works for 10 sec and 10 min sequences?

Cycles in our computational graph

How to learn with cycles?

Unclear how to assign credit to past time steps

Which of the infinite past time steps is responsible for this observed present?

Chaotic behavior

How do we account for all possible effects of small changes in past sequence?

Temporal redundancy

In temporally dense series there is lots of redundancy. Curse or blessing?

Inductive bias for sequences

\begin{array}{ll} \hline \text { Challenges } & \text { Inductive bias } \\ \hline \text { Non iid data } & \text { State models, chain rule of probabilities } \\ \hline \text { Arbitrary lengths } & \text { Sharing weights } \\ \hline \text { Credit assignment problem } & \text { Backpropagation through time } \\ \hline \text { Chaotic behavior } & \text { LSTM, GRU } \\ \hline \text { Temporal redundancy } & \text { Spiking neural nets, slow feature learning } \\ \hline \end{array}

A sequence of probabilities

Sequences are by definition non iid. Then how do we compute the likelihood? By decomposing with Probability Theory > Chain rule:

p(x)=\prod_{i} p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)

And then model each term $p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)$ separately.

Memory

To have the past influence present we must keep a summary of the past: A state, a memory, a representation of the past.

Memory can take all forms of shapes as long as it encodes the past Otherwise it is hard to keep track of distributional shifts.

At timestep $$t$$ project all previous information $1, \ldots, t$ onto a state space $s_{t}$ . Memory controlled by a neural network $$h$$ with shared parameters $\theta$ . Then, at timestep $$t+1$$ re-use the parameters $\theta_{t}$ and the previous $s_{t}$ such that

s_{t+1}=h_{t+1}\left(x_{t+1}, s_{t}\right)

Then apply recursive operation, often with Markov Chain assumption that any new state depends on the previous time step only

\begin{array}{l} s_{t}=h_{t}\left(x_{t}, s_{t-1}\right) \\ s_{t-1}=h_{t-1}\left(x_{t-1}, s_{t-2}\right) \end{array}

In the simplest case with fully connected layers: $\theta=\{U, V, W\}$

The matrix $$U$$ transforms a vector from the input space to the state space
The matrix $$V$$ transforms a vector from the the state space to the output space
The matrix $$W$$ transforms the past state and input to the new state

One of the problem is cycles in the computation graph. How do we handle it?
Write down each time step explicitly -> no cycles anymore

In theory, a problem with infinite time steps back
Can we do backpropagation with infinite time steps?

In practice, we cut the sequence short after a while. After all, does the 'very distant' past have influence to the present?

Sequences of arbitrary lengths

We cannot, and ideally should not, have a constraint over sequence length.
The model should have no problem with

varying
unknown
or even infinite
sequence lengths

One logical solution: sharing parameters through time $\theta_{t}=\theta_{t-1}=\cdots=\theta_{0}$

Formulation of Recurrent Neural Networks

Putting things together, we have a simple RNN as follows:

\begin{array}{l} s_{t}=\tanh \left(U_{t} x_{t}+W \cdot s_{t-1}\right) \\ y_{t}=\operatorname{softmax}\left(V \cdot s_{t}\right) \\ \mathcal{L}=\sum \mathcal{L}_{t}\left(y_{t}, l_{t}\right) \end{array}

Architectures

Computational Graph

Same weight is shared in all time steps. In many to many architecture, Loss is accumulated from output at each time step.
rnn_weight

Sequence Classification with RNNs

How should we connect the RNN sentence representation to a classification layer?
The sentence is represented by a sequence of hidden states $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle$
There is not a single hidden state representing the entire sequence

Approach 1: Use the last hidden state of the unrolled RNN

Connect $\mathbf{h}_{T}$ with classification layer: $\mathbf{o}=W_{o} \mathbf{h}_{T}+\mathbf{b}_{o}$
How well does $\mathbf{h}_{T}$ capture information coming from input $x_{t}$ if $t \ll T$ ?
- in particular, if the correct classification depends on the actual input word $x_{t}(t \ll T)$
- how can we increase the potential impact of $x_{t}(t \ll T) ?$

Approach 2: Average all hidden states of the unrolled RNN

Average all $\mathbf{h}_{t}$ in $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle: \mathbf{c}=\sum_{t=1}^{1} \frac{1}{T} \mathbf{h}_{t} \quad \mathbf{o}=W_{o} \mathbf{c}+\mathbf{b}_{o}$
All $\mathbf{h}_{t}$ are scaled by the same value $\frac{1}{T}$
- but represent input prefixes of different lengths: $X_{\leq t}$
- maybe scale by $\alpha_{t}$ , but how to define $\alpha_{t}$ ?
- what's the added value of averaging $\mathbf{h}_{t}$ instead of embeddings? (answer: contextual disambiguation of ambiguous input words)
Can one fixed-sized vector really represent the full meaning of an entire sentence?
- No.
- ... but there are methods that allow for more fine-grained, flexible, on-demand meaning representations of entire sequences
Do we really need to capture the full meaning for sequence classification?
- No. ... well, at least not for most cases
- but in some NLP applications this is needed: machine translation, summarization, ...

Backpropagation

RNNs are trained with Backpropagation Through Time (BPTT).

Variants

LSTM
GRU