Boltzmann Machines

Created December 15, 2020 · Updated March 4, 2026

Hopfield Networks minimize the quadratic energy function

E=-f_{\theta}(x)=-\left(\sum_{i, j} w_{i j} x_{i} x_{j}+\sum_{i} b_{i} x_{i}\right)

Boltzmann machines are stochastic Hopfield networks. In Boltzmann machines the neuron response on activation $a_{i}$ is

x_{i}=\left\{\begin{array}{l}+1 \text { with probability } 1 / 1+\exp \left(-2 a_{i}\right) \\ -1 \quad \text { otherwise }\end{array}\right.

Restricted Boltzmann Machines

We can interpret Boltzmann netowrks implement gibbs sampling for pdf $p(x)=\frac{1}{z} \exp \left(\frac{1}{2} x^{T} W x\right)$

Boltzmann machines are too parameter heavy For $$x$$ with $256 \times 256=65536$ the $$W$$ has 4.2 billion parameters.

Boltzmann machines learn no features. Instead, it add bottleneck latents $$v$$ :

E=-f_{\theta}(x)=-\left(\sum_{i, j} w_{i j} x_{i} v_{j}+\sum_{i} b_{i} x_{i}+\sum_{j} c_{j} v_{j}\right)

x_{i}

and

v_{j}

are still binary variables in the original model.

The quadratic term captures correlations
The unary terms capture priors: how likely is a (latent) pixel to be +1 or -1

Energy function: $E(x)=-x^{T} W v-b^{T} x-c^{T} v$

p(x)=\frac{1}{Z} \sum_{v} \exp (-E(x, v))

Not in the form

\propto \exp (\mathrm{x}) / \mathrm{Z}

because of the

\Sigma

Rewriting in form of Free energy function: $F(x)=-\boldsymbol{b}^{T} \boldsymbol{x}-\sum_{i} \log \sum_{v_{i}} \exp \left(v_{i}\left(c_{i}+\boldsymbol{W}_{i} \boldsymbol{x}\right)\right)$

\begin{array}{l} p(x)=\frac{1}{Z} \exp (-F(x)) \\ Z=\sum_{x} \exp (-F(x)) \end{array}

The $$F(x)$$ defines a bipartite graph with undirected connections. Information flows forward and backward.
rbms

The hidden variables $v_{j}$ are independent conditioned on the visible variables

p(v \mid x)=\prod_{j} p\left(v_{j} \mid x, \boldsymbol{\theta}\right)

The visible variables

x_{i}

are independent conditioned on the hidden variables

p(\boldsymbol{x} \mid \boldsymbol{v})=\prod_{i} p\left(x_{i} \mid \boldsymbol{v}, \boldsymbol{\theta}\right)

Training RBM conditional probabilities

The conditional probabilities are defined as sigmoids

\begin{array}{l} p\left(v_{j} \mid x, \boldsymbol{\theta}\right)=\sigma\left(\boldsymbol{W}_{\cdot j} \boldsymbol{x}+b_{j}\right) \\ p\left(x_{i} \mid \boldsymbol{v}, \boldsymbol{\theta}\right)=\sigma\left(\boldsymbol{v}^{T} \boldsymbol{W}_{i}+c_{i}\right) \end{array}

Since RBMs are bidirectional -> "Loop" between visible and latent, going back and forth and again

\begin{array}{l} v^{(1)} \sim \sigma\left(W_{\cdot j} x^{(0)}+b_{j}\right) \Rightarrow \\ x^{(1)} \sim \sigma\left(W_{\cdot j} v^{(2)}+b_{j}\right) \Rightarrow \\ v^{(2)} \sim \sigma\left(W_{\cdot j} x^{(1)}+b_{j}\right) \Rightarrow \ldots \end{array}

Training any energy model

Maximizing log-likelihood

\mathcal{L}(\boldsymbol{\theta})=\frac{1}{\mathrm{~N}} \sum_{n} \log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}\right)=\mathbb{E}_{p_{0}}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right]

The expectation w.r.t. a pdf is equivalent to

sampling from the pdf and
then taking the average
$\mathbb{E}_{x \sim p_{0}}[\log p(x \mid \theta)]=\mathbb{E}_{x \sim p_{0}}\left[-E_{\theta}(x)\right]-\log Z(\theta)$
where $\log Z(\boldsymbol{\theta})=\log \sum_{x^{\prime}} \exp \left(-E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$
and $p_{0}(x)$ is the data distribution

Taking gradients of any energy model

$\frac{d}{d \boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\boldsymbol{x})=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\frac{d}{d \boldsymbol{\theta}} \log Z(\boldsymbol{\theta})$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\theta}(\boldsymbol{x})-\frac{1}{Z(\boldsymbol{\theta})} \frac{d}{d \boldsymbol{\theta}} Z(\boldsymbol{\theta})$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})-\sum_{x^{\prime}} \frac{1}{Z(\boldsymbol{\theta})} \exp \left(-E_{\boldsymbol{\theta}}\left[\boldsymbol{x}^{\prime}\right]\right)\left(-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right)$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\sum_{\boldsymbol{x}^{\prime}} p_{\theta}\left(\boldsymbol{x}^{\prime}\right) \frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)$
$=-\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})+\mathbb{E}_{\boldsymbol{x}^{\prime} \sim p_{\theta}}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]$

Taking gradients in an RBM

For an RBM we must integrate out the latent variables

\begin{aligned} \mathcal{L}(\boldsymbol{\theta}) &=\frac{1}{\mathrm{~N}} \sum_{n} \log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{\boldsymbol{n}}\right)=\frac{1}{\mathrm{~N}} \sum_{n} \log \sum_{\boldsymbol{y}} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right) \\ \frac{\partial}{\partial \boldsymbol{\theta}} \log \sum_{\boldsymbol{v}} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right) &=-\mathbb{E}_{v \sim p_{\theta}\left(v \mid x_{n}\right)}\left[\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}\right)\right]+\mathbb{E}_{\boldsymbol{x}^{\prime}, \boldsymbol{v} \sim p_{\theta}(x, v)}\left[\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}, \boldsymbol{v}\right)\right] \end{aligned}

And since for $\operatorname{RBM} E_{\theta}(x, v)=-v^{T} W x-b^{T} x-c^{T} v$

\begin{array}{c} \frac{d}{d W_{i j}} E_{\theta}\left(x_{i}, v_{j}\right)=-x_{i} v_{j} \Rightarrow \\ \frac{d \mathcal{L}}{d W_{i j}}=\mathbb{E}_{v \sim p_{\theta}}\left(v \mid x_{n}\right)\left[x_{i} v_{j}\right]-\mathbb{E}_{x^{\prime}, v \sim p_{\theta}(x, v)}\left[x_{i} v_{j}\right] \end{array}

Easy: substitute $x_{n}$ and sum over $$v$$
Hard (normalization): sum over all $2^{m+d}$ combinations of images & latents

Intractable due to exponential complexity w.r.t. $$m+d$$
Evaluating and optimizing $p_{\theta}(x, v)$ takes a long time
If we had only the unnormalized part we would have no problem

Tackling intractability by sampling

$\mathbb{E}_{x^{\prime}, v \sim p_{\theta}(x, v)}\left[\frac{d}{d \theta} E_{\theta}\left(x^{\prime}, v\right)\right]$ stands for an expectation

One can sample very many $x^{\prime}, v$ from $p_{\theta}(x, v)$
Take average instead of computing analytically (Monte Carlo sampling)

Question: how to even sample from a hard pdf?

Markov Chain Monte Carlo with Gibbs sampling
Convergence after many rounds

Sampling the normalizaiton constant

We can rewrite the gradient as

\frac{d}{\partial \boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})=-\mathbb{E}_{0}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}(\boldsymbol{x})\right]+\mathbb{E}_{\infty}\left[\frac{d}{\partial \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}^{\prime}\right)\right]

where

\mathbb{E}_{0} \equiv E_{x \sim p_{0}}

means sampling from training data and average gradients
and

\mathbb{E}_{\infty} \equiv E_{x \sim p_{\theta}(x)}

means sampling from the model and average gradients

Unfortunately, MCMC can be very slow $\rightarrow 2^{\text {nd }}$ source of intractability.

Constrastive divergence for RBMs

Contrastive Divergence approximates gradient by k-steps Gibbs sampler

\frac{d}{d \boldsymbol{\theta}} \log p\left(\boldsymbol{x}_{n} \mid \boldsymbol{\theta}\right)=-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{n}, \boldsymbol{v}_{0}\right)-\frac{d}{d \boldsymbol{\theta}} E_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{k}^{\prime}, \boldsymbol{v}_{k}\right)

Pushing the nominator up while pushing the denominator down.
contrastive-RBMs

Sampling with Markov Chain Monte Carlo

We want to sample an $$x$$ from a pdf $p_{\theta}(x)$ with MCMC with Gibbs sampler:

Step $$1 .$$ Initialize $x^{0}$ randomly
Step 2 . Let $\hat{x}=x^{t}+$ noise

If negative energy $f_{\theta}(\hat{x})>f_{\theta}\left(x^{\mathrm{t}}\right),$ set $x^{t+1}=\hat{x}$
Otherwise, we still keep it as $x^{t+1}=x^{t}$ with probability $\frac{p(\hat{x})}{p\left(x^{t}\right)}=\exp \left(f_{\theta}(\hat{x})-f_{\theta}\left(x^{t}\right)\right)$

Go to step 2

Note: Because of the ratio of likelihoods $\rightarrow$ no $Z(\boldsymbol{\theta})$

Using RBMs

Some of the first models to show nice generations of images
Use RBMs to pretrain networks for classification afterward