PixelRNN
Decompose the data likelihood of an $n \times n$ image $p(x)=\prod_{i=1}^{n^{2}} p\left(x_{i} \mid x_{<i}\right)$
Each pixel conditional corresponds to a triplet of colors-> Further decompose per color (same as above)
Model the conditionals $p\left(x_{i, R} \mid x_{<i}\right), \ldots$ with 12 -layer convolutional RNN. The MLP from NADE cannot easily scale and statistics are not shared.
Model the output as a categorical distribution 256-way softmax.
PixelRNN uses two variant of LSTM. Why not a regular LSTM? It requires sequential, pixel-wise computations which leads to less parallelization and slower training.
With Row LSTM and Diagonal BiLSTM, we process one row at a time, so parallelization is possible.
Row LSTM
Row LSTM with 'causal' triangular receptive field.
- Per new pixel (row $i$ ) use 1 -d conv (size 3 ) to aggregate pixels above $(i-1)$
- The effective receptive field spans a triangle
- Convolution only on 'past' pixels $(i-1),$ not 'future pixels' -> causal
- Loses some context because of the triangular nature of receptive field
Diagonal BiLSTM
Proposed to address lost context. Key idea: Have two LSTMs moving on oppose diagonals
First diagonal: the convolution past is $(i-1, j),(i, j-1)$
Combine the two LSTMs, recursively the entirety of past context is captured.
Architecture
- Use 12 layers of LSTMs
- Add residual connections to speed up learning
- Although good modelling of $p(x)$, it has nice image generation, but slow training and generation because of LSTMs.
Generations
No collapse to single mode, lots of variation for same occluded images.
PixelCNN
Replace LSTMs with fully convolutional 15 layers , no pooling layers to preserve spatial resolution
Use masks to mask out future pixels in convolutions, otherwise 'access to future' means no 'autoregressiveness'.
Faster training as no recurrent steps required -> Better parallelization. But, pixel generation still sequential and thus slow
Advantages and disadvantages
- Faster training
- Performance is worse than PixelRNN as context is discarded
- The cascaded convolutions create a 'blind spot', use Gated PixelCNN to fix
- No latest space