PixelRNN

Created December 15, 2020 · Updated March 4, 2026

Decompose the data likelihood of an $n \times n$ image $p(x)=\prod_{i=1}^{n^{2}} p\left(x_{i} \mid x_{<i}\right)$

Each pixel conditional corresponds to a triplet of colors-> Further decompose per color (same as above)

p\left(x_{i} \mid x_{<i}\right)=p\left(x_{i, R} \mid x_{<i}\right) \cdot p\left(x_{i, G} \mid x_{<i}, x_{i, R}\right) \cdot p\left(x_{i, B} \mid x_{<i}, x_{i, R}, x_{i, G}\right)

Model the conditionals $p\left(x_{i, R} \mid x_{<i}\right), \ldots$ with 12 -layer convolutional RNN. The MLP from NADE cannot easily scale and statistics are not shared.

Model the output as a categorical distribution 256-way softmax.

PixelRNN uses two variant of LSTM. Why not a regular LSTM? It requires sequential, pixel-wise computations which leads to less parallelization and slower training.

With Row LSTM and Diagonal BiLSTM, we process one row at a time, so parallelization is possible.

Row LSTM

Row LSTM with 'causal' triangular receptive field.

Per new pixel (row $$i$$ ) use 1 -d conv (size 3 ) to aggregate pixels above $$(i-1)$$
The effective receptive field spans a triangle
Convolution only on 'past' pixels $$(i-1),$$ not 'future pixels' -> causal
Loses some context because of the triangular nature of receptive field

Diagonal BiLSTM

Proposed to address lost context. Key idea: Have two LSTMs moving on oppose diagonals
First diagonal: the convolution past is $$(i-1, j),(i, j-1)$$

Combine the two LSTMs, recursively the entirety of past context is captured.

Architecture

Use 12 layers of LSTMs
Add residual connections to speed up learning
Although good modelling of $$p(x)$$ , it has nice image generation, but slow training and generation because of LSTMs.

Generations

No collapse to single mode, lots of variation for same occluded images.

PixelCNN

Replace LSTMs with fully convolutional 15 layers , no pooling layers to preserve spatial resolution

Use masks to mask out future pixels in convolutions, otherwise 'access to future' means no 'autoregressiveness'.

Faster training as no recurrent steps required -> Better parallelization. But, pixel generation still sequential and thus slow

Advantages and disadvantages

Faster training
Performance is worse than PixelRNN as context is discarded
The cascaded convolutions create a 'blind spot', use Gated PixelCNN to fix
No latest space