Recurrent Neural Networks (RNN)
Characteristics of sequential data
Sequences are everywhere - videos, stock ecvhange, climate measurements, market analysis, speech, music, user behaviour, computer games etc.
Sequence -> Data are no more i.i.d.
- Each data point on the previous ones and their distribution changes over time
- How to model temporal dependencies, especially in high-dim complex data?
Sequences might be of arbitrary length
- How to model dependencies at different temporal scales?
- How to make a model that works for 10 sec and 10 min sequences?
Cycles in our computational graph
- How to learn with cycles?
Unclear how to assign credit to past time steps
- Which of the infinite past time steps is responsible for this observed present?
Chaotic behavior
- How do we account for all possible effects of small changes in past sequence?
Temporal redundancy
- In temporally dense series there is lots of redundancy. Curse or blessing?
Inductive bias for sequences
A sequence of probabilities
Sequences are by definition non iid. Then how do we compute the likelihood? By decomposing with Probability Theory > Chain rule:
And then model each term $p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)$ separately.
Memory
To have the past influence present we must keep a summary of the past: A state, a memory, a representation of the past.
Memory can take all forms of shapes as long as it encodes the past Otherwise it is hard to keep track of distributional shifts.
At timestep $t$ project all previous information $1, \ldots, t$ onto a state space $s_{t}$. Memory controlled by a neural network $h$ with shared parameters $\theta$. Then, at timestep $t+1$ re-use the parameters $\theta_{t}$ and the previous $s_{t}$ such that
Then apply recursive operation, often with Markov Chain assumption that any new state depends on the previous time step only
In the simplest case with fully connected layers: $\theta=\{U, V, W\}$
- The matrix $U$ transforms a vector from the input space to the state space
- The matrix $V$ transforms a vector from the the state space to the output space
- The matrix $W$ transforms the past state and input to the new state
One of the problem is cycles in the computation graph. How do we handle it?
Write down each time step explicitly -> no cycles anymore
- In theory, a problem with infinite time steps back
- Can we do backpropagation with infinite time steps?
In practice, we cut the sequence short after a while. After all, does the 'very distant' past have influence to the present?
Sequences of arbitrary lengths
We cannot, and ideally should not, have a constraint over sequence length.
The model should have no problem with
- varying
- unknown
- or even infinite
sequence lengths
One logical solution: sharing parameters through time $\theta_{t}=\theta_{t-1}=\cdots=\theta_{0}$
Formulation of Recurrent Neural Networks
Putting things together, we have a simple RNN as follows:
Architectures
Computational Graph
Same weight is shared in all time steps. In many to many architecture, Loss is accumulated from output at each time step.

Sequence Classification with RNNs
- How should we connect the RNN sentence representation to a classification layer?
- The sentence is represented by a sequence of hidden states $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle$
- There is not a single hidden state representing the entire sequence
Approach 1: Use the last hidden state of the unrolled RNN
- Connect $\mathbf{h}_{T}$ with classification layer: $\mathbf{o}=W_{o} \mathbf{h}_{T}+\mathbf{b}_{o}$
- How well does $\mathbf{h}_{T}$ capture information coming from input $x_{t}$ if $t \ll T$ ?
- in particular, if the correct classification depends on the actual input word $x_{t}(t \ll T)$
- how can we increase the potential impact of $x_{t}(t \ll T) ?$
Approach 2: Average all hidden states of the unrolled RNN
-
Average all $\mathbf{h}_{t}$ in $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle: \mathbf{c}=\sum_{t=1}^{1} \frac{1}{T} \mathbf{h}_{t} \quad \mathbf{o}=W_{o} \mathbf{c}+\mathbf{b}_{o}$
-
All $\mathbf{h}_{t}$ are scaled by the same value $\frac{1}{T}$
- but represent input prefixes of different lengths: $X_{\leq t}$
- maybe scale by $\alpha_{t}$, but how to define $\alpha_{t}$ ?
- what's the added value of averaging $\mathbf{h}_{t}$ instead of embeddings? (answer: contextual disambiguation of ambiguous input words)
-
Can one fixed-sized vector really represent the full meaning of an entire sentence?
- No.
- ... but there are methods that allow for more fine-grained, flexible, on-demand meaning representations of entire sequences
-
Do we really need to capture the full meaning for sequence classification?
- No. ... well, at least not for most cases
- but in some NLP applications this is needed: machine translation, summarization, ...
Backpropagation
RNNs are trained with Backpropagation Through Time (BPTT).