Contrastive Divergence

To motivate contrastive divergence, we revisit Maximum Likelihood Estimation (Note: KL Divergence > Relationship with MLE and Cross Entropy)

$$ \mathrm{KL}\left(p_{0} \| p_{\infty}\right)=\int p_{0} \log p_{0}-\int p_{0} \log p_{\infty} \propto-\int p_{0} \log \mathrm{p}_{\infty} $$

Contrastive divergence minimizes

$$ \mathrm{CD}_{n}=\mathrm{KL}\left(p_{0} \| p_{\infty}\right)-\mathrm{KL}\left(p_{n} \| p_{\infty}\right) $$

Updates weights using CD $_{n}$ gradients instead of ML gradients

$$ \frac{d}{\partial \boldsymbol{\theta}} \mathrm{CD}_{n}=-\mathbb{E}_{0}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})\right]+\mathbb{E}_{n}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]+\frac{d}{\partial \boldsymbol{\theta}}[\ldots] $$

where $\mathbb{E}_{n}$ is computed by sampling after $n$ steps in the Markov Chain. The last term is small and can be ignored.

Intuition

Make sure after $n$ sampling step not far from data distribution

  • Usually, one step only $(n=1)$ is enough
  • Something similar to 'minimizing reconstruction error'

Because of conditional independence of $x \mid v$ and $v \mid x$ -> parallel computations

  • Sample a data point $x$
  • Compute the posterior $\boldsymbol{p}(\boldsymbol{v} \mid \boldsymbol{x})$
  • Take sample of latents $\boldsymbol{v} \sim \boldsymbol{p}(\boldsymbol{v} \mid \boldsymbol{x})$
  • Compute the conditional $p(x \mid v)$
  • Sample from $x^{\prime} \sim p(x \mid v)$
  • Minimize difference using $x, x^{\prime}$