Boltzmann Machines
Hopfield Networks minimize the quadratic energy function
Boltzmann machines are stochastic Hopfield networks. In Boltzmann machines the neuron response on activation $a_{i}$ is
Restricted Boltzmann Machines
We can interpret Boltzmann netowrks implement gibbs sampling for pdf $p(x)=\frac{1}{z} \exp \left(\frac{1}{2} x^{T} W x\right)$
Boltzmann machines are too parameter heavy For $x$ with $256 \times 256=65536$ the $W$ has 4.2 billion parameters.
Boltzmann machines learn no features. Instead, it add bottleneck latents $v$:
$x_{i}$ and $v_{j}$ are still binary variables in the original model.
- The quadratic term captures correlations
- The unary terms capture priors: how likely is a (latent) pixel to be +1 or -1
Energy function: $E(x)=-x^{T} W v-b^{T} x-c^{T} v$
Not in the form $\propto \exp (\mathrm{x}) / \mathrm{Z}$ because of the $\Sigma$
Rewriting in form of Free energy function: $F(x)=-\boldsymbol{b}^{T} \boldsymbol{x}-\sum_{i} \log \sum_{v_{i}} \exp \left(v_{i}\left(c_{i}+\boldsymbol{W}_{i} \boldsymbol{x}\right)\right)$
The $F(x)$ defines a bipartite graph with undirected connections. Information flows forward and backward.

The hidden variables $v_{j}$ are independent conditioned on the visible variables
The visible variables $x_{i}$ are independent conditioned on the hidden variables
Training RBM conditional probabilities
The conditional probabilities are defined as sigmoids
Since RBMs are bidirectional -> "Loop" between visible and latent, going back and forth and again
Training any energy model
Maximizing log-likelihood
The expectation w.r.t. a pdf is equivalent to
- sampling from the pdf and
- then taking the average
$$ \mathbb{E}_{x \sim p_{0}}[\log p(x \mid \theta)]=\mathbb{E}_{x \sim p_{0}}\left[-E_{\theta}(x)\right]-\log Z(\theta) $$ - where $\log Z(\boldsymbol{\theta})=\log \sum_{x^{\prime}} \exp \left(-E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$
- and $p_{0}(x)$ is the data distribution
Taking gradients of any energy model
$\frac{d}{d \boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\boldsymbol{x})=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\frac{d}{d \boldsymbol{\theta}} \log Z(\boldsymbol{\theta})$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\theta}(\boldsymbol{x})-\frac{1}{Z(\boldsymbol{\theta})} \frac{d}{d \boldsymbol{\theta}} Z(\boldsymbol{\theta})$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\sum_{x^{\prime}} \frac{1}{Z(\boldsymbol{\theta})} \exp \left(-E_{\boldsymbol{\theta}}\left[\boldsymbol{x}^{\prime}\right]\right)\left(-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\sum_{\boldsymbol{x}^{\prime}} p_{\theta}\left(\boldsymbol{x}^{\prime}\right) \frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\mathbb{E}_{\boldsymbol{x}^{\prime} \sim p_{\theta}}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]$
Taking gradients in an RBM
For an RBM we must integrate out the latent variables
And since for $\operatorname{RBM} E_{\theta}(x, v)=-v^{T} W x-b^{T} x-c^{T} v$
Easy: substitute $x_{n}$ and sum over $v$
Hard (normalization): sum over all $2^{m+d}$ combinations of images & latents
- Intractable due to exponential complexity w.r.t. $m+d$
- Evaluating and optimizing $p_{\theta}(x, v)$ takes a long time
- If we had only the unnormalized part we would have no problem
Tackling intractability by sampling
$\mathbb{E}_{x^{\prime}, v \sim p_{\theta}(x, v)}\left[\frac{d}{d \theta} E_{\theta}\left(x^{\prime}, v\right)\right]$ stands for an expectation
- One can sample very many $x^{\prime}, v$ from $p_{\theta}(x, v)$
- Take average instead of computing analytically (Monte Carlo sampling)
Question: how to even sample from a hard pdf?
- Markov Chain Monte Carlo with Gibbs sampling
- Convergence after many rounds
Sampling the normalizaiton constant
We can rewrite the gradient as
where $\mathbb{E}_{0} \equiv E_{x \sim p_{0}}$ means sampling from training data and average gradients
and $\mathbb{E}_{\infty} \equiv E_{x \sim p_{\theta}(x)}$ means sampling from the model and average gradients
Unfortunately, MCMC can be very slow $\rightarrow 2^{\text {nd }}$ source of intractability.
Constrastive divergence for RBMs
Contrastive Divergence approximates gradient by k-steps Gibbs sampler
Pushing the nominator up while pushing the denominator down.
Sampling with Markov Chain Monte Carlo
We want to sample an $x$ from a pdf $p_{\theta}(x)$ with MCMC with Gibbs sampler:
Step $1 .$ Initialize $x^{0}$ randomly
Step 2 . Let $\hat{x}=x^{t}+$ noise
- If negative energy $f_{\theta}(\hat{x})>f_{\theta}\left(x^{\mathrm{t}}\right),$ set $x^{t+1}=\hat{x}$
- Otherwise, we still keep it as $x^{t+1}=x^{t}$ with probability $\frac{p(\hat{x})}{p\left(x^{t}\right)}=\exp \left(f_{\theta}(\hat{x})-f_{\theta}\left(x^{t}\right)\right)$
Go to step 2
Note: Because of the ratio of likelihoods $\rightarrow$ no $Z(\boldsymbol{\theta})$
Using RBMs
- Some of the first models to show nice generations of images
- Use RBMs to pretrain networks for classification afterward
