PixelRNN

Decompose the data likelihood of an $n \times n$ image $p(x)=\prod_{i=1}^{n^{2}} p\left(x_{i} \mid x_{<i}\right)$

Each pixel conditional corresponds to a triplet of colors-> Further decompose per color (same as above)

$$ p\left(x_{i} \mid x_{<i}\right)=p\left(x_{i, R} \mid x_{<i}\right) \cdot p\left(x_{i, G} \mid x_{<i}, x_{i, R}\right) \cdot p\left(x_{i, B} \mid x_{<i}, x_{i, R}, x_{i, G}\right) $$

Model the conditionals $p\left(x_{i, R} \mid x_{<i}\right), \ldots$ with 12 -layer convolutional RNN. The MLP from NADE cannot easily scale and statistics are not shared.

Model the output as a categorical distribution 256-way softmax.

PixelRNN uses two variant of LSTM. Why not a regular LSTM? It requires sequential, pixel-wise computations which leads to less parallelization and slower training.

With Row LSTM and Diagonal BiLSTM, we process one row at a time, so parallelization is possible.

pixelrnn-lstms 1

Row LSTM

Row LSTM with 'causal' triangular receptive field.

  • Per new pixel (row $i$ ) use 1 -d conv (size 3 ) to aggregate pixels above $(i-1)$
  • The effective receptive field spans a triangle
  • Convolution only on 'past' pixels $(i-1),$ not 'future pixels' -> causal
  • Loses some context because of the triangular nature of receptive field

Diagonal BiLSTM

Proposed to address lost context. Key idea: Have two LSTMs moving on oppose diagonals
First diagonal: the convolution past is $(i-1, j),(i, j-1)$

Combine the two LSTMs, recursively the entirety of past context is captured.

Architecture

  • Use 12 layers of LSTMs
  • Add residual connections to speed up learning
  • Although good modelling of $p(x)$, it has nice image generation, but slow training and generation because of LSTMs.
pixelrnn-layers

Generations

No collapse to single mode, lots of variation for same occluded images.

pixelrnn-generation

PixelCNN

Replace LSTMs with fully convolutional 15 layers , no pooling layers to preserve spatial resolution

Use masks to mask out future pixels in convolutions, otherwise 'access to future' means no 'autoregressiveness'.

Faster training as no recurrent steps required -> Better parallelization. But, pixel generation still sequential and thus slow

Advantages and disadvantages

  • Faster training
  • Performance is worse than PixelRNN as context is discarded
  • The cascaded convolutions create a 'blind spot', use Gated PixelCNN to fix
  • No latest space