Gaussian Distribution

Created December 14, 2020 · Updated July 14, 2025

In univariate case, Gaussian distribution is given by

p\left(x ; \mu, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right)

The coefficient in front, $\frac{1}{\sqrt{2 \pi} \sigma},$ is a constant that does not depend on $$x$$ hence, we can think of it as simply a "normalization factor" used to ensure that

\frac{1}{\sqrt{2 \pi} \sigma} \int_{-\infty}^{\infty} \exp \left(-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right)=1

Expectation

\mathbb{E}[x]=\int_{-\infty}^{\infty} \mathcal{N}\left(x \mid \mu, \sigma^{2}\right) x \mathrm{d} x=\mu

\mathbb{E}\left[x^{2}\right]=\int_{-\infty}^{\infty} \mathcal{N}\left(x \mid \mu, \sigma^{2}\right) x^{2} \mathrm{d} x=\mu^{2}+\sigma^{2}

\operatorname{var}[x]=\mathbb{E}\left[x^{2}\right]-\mathbb{E}[x]^{2}=\sigma^{2}

Multivariate Gaussian Distribution

A vector-valued random variable $X=\left[X_{1} \cdots X_{n}\right]^{T}$ is said to have a multivariate normal (or Gaussian) distribution with mean $\mu \in \mathbf{R}^{n}$ and covariance matrix $\Sigma \in \mathbf{S}_{++}^{n}$ " if its probability density function $^{2}$ is given by

p(x ; \mu, \Sigma)=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right)

Note that $\mathbf{S}_{++}^{n}$ is the space of symmetric positive definite $n \times n$ matrices, defined as $\mathbf{S}_{++}^{n}=\left\{A \in \mathbf{R}^{n \times n}: A=A^{T}\right.$ and $x^{T} A x>0$ for all $x \in \mathbf{R}^{n}$ such that $\left.x \neq 0\right\}$

Properties of Gaussian Distributions

Approximating other distributions

Gaussian distributions are to probability what sine waves are to signal processing - fundamental components from which any complex structure can be built. Two facets:

Single Gaussian + Nonlinear Transformation

Any absolutely continuous distribution in n dimensions can be approximated arbitrarily well by transforming a normally distributed n-dimensional vector through a sufficiently complex neural network.

This principle underlies many modern generative models like Normalizing Flows, Variational Autoencoders, and some types of Generative Adversarial Networks. It's a powerful theoretical foundation, but the practical implementation requires careful attention to the mathematical constraints.

Multiple Gaussians + Linear Combinations

Any continuous density can be approximated to arbitrary accuracy by a linear combination of a sufficient number of gaussians, and by adjusting their means and covariances as well as the coefficients.

This is the universal approximation property of Gaussian Mixture Model. This follows from the fact that Gaussians are smooth, can be positioned anywhere (via means), scaled and oriented (via covariances), and combined in any proportion (via mixture weights).

Reparameterization Trick

If we sample a vector $\mathbf{x}$ from a Gaussian $\mathbf{x} \sim \mathscr{N}(\mathbf{0}, \mathbf{I})$ , and if $\mathbf{y}=\boldsymbol{\mu}+\mathbf{A x}$ , then we have

\mathbf{y} \sim \mathscr{N}\left(\boldsymbol{\mu}, \mathbf{A} \mathbf{A}^{T}\right)

where $\mathbf{\Sigma}=\mathbf{A} \mathbf{A}^{T}$

This means if you have access to a sampler for uncorrelated Gaussian variables, you can create correlated samples for a given mean $\boldsymbol{\mu}$ and covariance $\boldsymbol{\Sigma}$

y \sim N(\mu, \Sigma)

For a given $\boldsymbol{\Sigma}$ you can compute $\Sigma=\mathbf{A A}^{T}$ with Cholesky decomposition such that $\mathbf{A}$ is lower triangular. Or you can compute the eigendecomposition $\mathbf{\Sigma}=\mathbf{U} \mathbf{\Lambda} \mathbf{U}^{T}$ and take $\mathbf{A}=\mathbf{U} \mathbf{\Lambda}^{1 / 2}$ .

This principle is a key idea in Stochastic Gradients and is referred to as Pathwise Gradient Estimator.

Marginalization property

If two sets of variables are jointly Gaussian, then the conditional Gaussian distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distributions of the either set is also Gaussian.
Consider a distribution:

p\left(x_{1}, x_{2}\right)=\mathcal{N}\left(\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right] \mid\left[\begin{array}{l} \mu_{1} \\ \mu_{2} \end{array}\right],\left[\begin{array}{ll} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array}\right]\right)

Then the marginals are given by

p\left(x_{1}\right)=\mathcal{N}\left(x_{1} \mid \mu_{1}, \Sigma_{11}\right)

p\left(x_{2}\right)=N\left(x_{2} \mid \mu_{2}, \Sigma_{22}\right)

Conditioning property

Similarly, the condtional for the above distribution is given as another gaussian

p\left(x_{1} \mid x_{2}\right)=\mathcal{N}\left(\mu_{1 \mid 2}, \Sigma_{1 \mid 2}\right)

with

\begin{array}{l} \mu_{1 \mid 2}=\mu_{1}+\Sigma_{12} \Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right) \\ \Sigma_{112}=\Sigma_{11}-\Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \end{array}

Summation property

The sum of two independant Gaussian random variables is also a Gaussian random variable
If

\begin{array}{l} x \sim \mathcal{N}(\mu, \Sigma) \\ y \sim \mathscr{N}\left(\mu^{\prime}, \Sigma^{\prime}\right) \end{array}

Then $z=x+y \quad \rightarrow \quad z \sim \mathcal{N}\left(\mu+\mu^{\prime}, \Sigma+\Sigma^{\prime}\right)$

Some useful results

Density derivative with respect to $\mathbf{\mu}_k$ (when the covariance matrix is positive definite) is given by

\frac{\partial}{\partial \mu_{k}} \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)={\mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \mathbf{\Sigma}_{k}\right)}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}^{-1}

Limitations of Gaussian Distribution

Very sensitive to outliers as means are affected by them.
Unimodal distribution (but can be handled with latent variables giving rise to Gaussian Mixture Model)

References

http://cs229.stanford.edu/section/gaussians.pdf
2.3 Bishop 2006