Gaussian Distribution
In univariate case, Gaussian distribution is given by
The coefficient in front, $\frac{1}{\sqrt{2 \pi} \sigma},$ is a constant that does not depend on $x$ hence, we can think of it as simply a "normalization factor" used to ensure that
Expectation
Multivariate Gaussian Distribution
A vector-valued random variable $X=\left[X_{1} \cdots X_{n}\right]^{T}$ is said to have a multivariate normal (or Gaussian) distribution with mean $\mu \in \mathbf{R}^{n}$ and covariance matrix $\Sigma \in \mathbf{S}_{++}^{n}$ " if its probability density function $^{2}$ is given by
Note that $\mathbf{S}_{++}^{n}$ is the space of symmetric positive definite $n \times n$ matrices, defined as $\mathbf{S}_{++}^{n}=\left\{A \in \mathbf{R}^{n \times n}: A=A^{T}\right.$ and $x^{T} A x>0$ for all $x \in \mathbf{R}^{n}$ such that $\left.x \neq 0\right\}$
Properties of Gaussian Distributions
Approximating other distributions
Gaussian distributions are to probability what sine waves are to signal processing - fundamental components from which any complex structure can be built. Two facets:
Single Gaussian + Nonlinear Transformation
Any absolutely continuous distribution in n dimensions can be approximated arbitrarily well by transforming a normally distributed n-dimensional vector through a sufficiently complex neural network.
This principle underlies many modern generative models like Normalizing Flows, Variational Autoencoders, and some types of Generative Adversarial Networks. It's a powerful theoretical foundation, but the practical implementation requires careful attention to the mathematical constraints.
Multiple Gaussians + Linear Combinations
Any continuous density can be approximated to arbitrary accuracy by a linear combination of a sufficient number of gaussians, and by adjusting their means and covariances as well as the coefficients.
This is the universal approximation property of Gaussian Mixture Model. This follows from the fact that Gaussians are smooth, can be positioned anywhere (via means), scaled and oriented (via covariances), and combined in any proportion (via mixture weights).
Reparameterization Trick
If we sample a vector $\mathbf{x}$ from a Gaussian $\mathbf{x} \sim \mathscr{N}(\mathbf{0}, \mathbf{I})$, and if $\mathbf{y}=\boldsymbol{\mu}+\mathbf{A x}$, then we have
where $\mathbf{\Sigma}=\mathbf{A} \mathbf{A}^{T}$
This means if you have access to a sampler for uncorrelated Gaussian variables, you can create correlated samples for a given mean $\boldsymbol{\mu}$ and covariance $\boldsymbol{\Sigma}$
For a given $\boldsymbol{\Sigma}$ you can compute $\Sigma=\mathbf{A A}^{T}$ with Cholesky decomposition such that $\mathbf{A}$ is lower triangular. Or you can compute the eigendecomposition $\mathbf{\Sigma}=\mathbf{U} \mathbf{\Lambda} \mathbf{U}^{T}$ and take $\mathbf{A}=\mathbf{U} \mathbf{\Lambda}^{1 / 2}$.
This principle is a key idea in Stochastic Gradients and is referred to as Pathwise Gradient Estimator.
Marginalization property
If two sets of variables are jointly Gaussian, then the conditional Gaussian distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distributions of the either set is also Gaussian.
Consider a distribution:
Then the marginals are given by
Conditioning property
Similarly, the condtional for the above distribution is given as another gaussian
with
Summation property
The sum of two independant Gaussian random variables is also a Gaussian random variable
If
Then $z=x+y \quad \rightarrow \quad z \sim \mathcal{N}\left(\mu+\mu^{\prime}, \Sigma+\Sigma^{\prime}\right)$
Some useful results
Density derivative with respect to $\mathbf{\mu}_k$ (when the covariance matrix is positive definite) is given by
Limitations of Gaussian Distribution
- Very sensitive to outliers as means are affected by them.
- Unimodal distribution (but can be handled with latent variables giving rise to Gaussian Mixture Model)
References
- http://cs229.stanford.edu/section/gaussians.pdf
- 2.3 Bishop 2006