Recurrent Neural Networks (RNN)

Characteristics of sequential data

Sequences are everywhere - videos, stock ecvhange, climate measurements, market analysis, speech, music, user behaviour, computer games etc.

Sequence -> Data are no more i.i.d.

  • Each data point on the previous ones and their distribution changes over time
  • How to model temporal dependencies, especially in high-dim complex data?

Sequences might be of arbitrary length

  • How to model dependencies at different temporal scales?
  • How to make a model that works for 10 sec and 10 min sequences?

Cycles in our computational graph

  • How to learn with cycles?

Unclear how to assign credit to past time steps

  • Which of the infinite past time steps is responsible for this observed present?

Chaotic behavior

  • How do we account for all possible effects of small changes in past sequence?

Temporal redundancy

  • In temporally dense series there is lots of redundancy. Curse or blessing?

Inductive bias for sequences

$$ \begin{array}{ll} \hline \text { Challenges } & \text { Inductive bias } \\ \hline \text { Non iid data } & \text { State models, chain rule of probabilities } \\ \hline \text { Arbitrary lengths } & \text { Sharing weights } \\ \hline \text { Credit assignment problem } & \text { Backpropagation through time } \\ \hline \text { Chaotic behavior } & \text { LSTM, GRU } \\ \hline \text { Temporal redundancy } & \text { Spiking neural nets, slow feature learning } \\ \hline \end{array} $$

A sequence of probabilities

Sequences are by definition non iid. Then how do we compute the likelihood? By decomposing with Probability Theory > Chain rule:

$$ p(x)=\prod_{i} p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right) $$

And then model each term $p\left(x_{i} \mid x_{1}, \ldots, x_{i-1}\right)$ separately.

Memory

To have the past influence present we must keep a summary of the past: A state, a memory, a representation of the past.

Memory can take all forms of shapes as long as it encodes the past Otherwise it is hard to keep track of distributional shifts.

At timestep $t$ project all previous information $1, \ldots, t$ onto a state space $s_{t}$. Memory controlled by a neural network $h$ with shared parameters $\theta$. Then, at timestep $t+1$ re-use the parameters $\theta_{t}$ and the previous $s_{t}$ such that

$$ s_{t+1}=h_{t+1}\left(x_{t+1}, s_{t}\right) $$

Then apply recursive operation, often with Markov Chain assumption that any new state depends on the previous time step only

$$ \begin{array}{l} s_{t}=h_{t}\left(x_{t}, s_{t-1}\right) \\ s_{t-1}=h_{t-1}\left(x_{t-1}, s_{t-2}\right) \end{array} $$

In the simplest case with fully connected layers: $\theta=\{U, V, W\}$

  • The matrix $U$ transforms a vector from the input space to the state space
  • The matrix $V$ transforms a vector from the the state space to the output space
  • The matrix $W$ transforms the past state and input to the new state
memory-cycles

One of the problem is cycles in the computation graph. How do we handle it?
Write down each time step explicitly -> no cycles anymore

  • In theory, a problem with infinite time steps back
  • Can we do backpropagation with infinite time steps?

In practice, we cut the sequence short after a while. After all, does the 'very distant' past have influence to the present?

Sequences of arbitrary lengths

We cannot, and ideally should not, have a constraint over sequence length.
The model should have no problem with

  • varying
  • unknown
  • or even infinite
    sequence lengths

One logical solution: sharing parameters through time $\theta_{t}=\theta_{t-1}=\cdots=\theta_{0}$

Formulation of Recurrent Neural Networks

Putting things together, we have a simple RNN as follows:

$$ \begin{array}{l} s_{t}=\tanh \left(U_{t} x_{t}+W \cdot s_{t-1}\right) \\ y_{t}=\operatorname{softmax}\left(V \cdot s_{t}\right) \\ \mathcal{L}=\sum \mathcal{L}_{t}\left(y_{t}, l_{t}\right) \end{array} $$

Architectures

rnn_arch

Computational Graph

Same weight is shared in all time steps. In many to many architecture, Loss is accumulated from output at each time step.
rnn_weight

Sequence Classification with RNNs

  • How should we connect the RNN sentence representation to a classification layer?
  • The sentence is represented by a sequence of hidden states $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle$
  • There is not a single hidden state representing the entire sequence

Approach 1: Use the last hidden state of the unrolled RNN

  • Connect $\mathbf{h}_{T}$ with classification layer: $\mathbf{o}=W_{o} \mathbf{h}_{T}+\mathbf{b}_{o}$
  • How well does $\mathbf{h}_{T}$ capture information coming from input $x_{t}$ if $t \ll T$ ?
    • in particular, if the correct classification depends on the actual input word $x_{t}(t \ll T)$
    • how can we increase the potential impact of $x_{t}(t \ll T) ?$

Approach 2: Average all hidden states of the unrolled RNN

  • Average all $\mathbf{h}_{t}$ in $\left\langle\mathbf{h}_{1}, \ldots, \mathbf{h}_{T}\right\rangle: \mathbf{c}=\sum_{t=1}^{1} \frac{1}{T} \mathbf{h}_{t} \quad \mathbf{o}=W_{o} \mathbf{c}+\mathbf{b}_{o}$

  • All $\mathbf{h}_{t}$ are scaled by the same value $\frac{1}{T}$

    • but represent input prefixes of different lengths: $X_{\leq t}$
    • maybe scale by $\alpha_{t}$, but how to define $\alpha_{t}$ ?
    • what's the added value of averaging $\mathbf{h}_{t}$ instead of embeddings? (answer: contextual disambiguation of ambiguous input words)
  • Can one fixed-sized vector really represent the full meaning of an entire sentence?

    • No.
    • ... but there are methods that allow for more fine-grained, flexible, on-demand meaning representations of entire sequences
  • Do we really need to capture the full meaning for sequence classification?

    • No. ... well, at least not for most cases
    • but in some NLP applications this is needed: machine translation, summarization, ...

Backpropagation

RNNs are trained with Backpropagation Through Time (BPTT).

Variants

LSTM
GRU