Probabilistic Generative Models

A model is generative if it places a joint distribution over all observed dimensions of the data. In generative models, we model the class-conditional densities $p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)$, as well as the class priors $p\left(\mathcal{C}_{k}\right)$, and then use these to compute posterior probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ through Bayes Theorem.

Let's consider fist binary classes i.e. K=2

Class conditional density: $p\left(x \mid C_{k}\right)$
Prior class probabilities: $p\left(C_{k}\right)$
Joint distribution: $p\left(x, C_{k}\right)=p\left(x \mid C_{k}\right) p\left(C_{k}\right)$

Posterior distribution is given as

$$ \begin{align} p\left(C_{1} \mid \mathbf{x}\right) &=\frac{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)}{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)+p\left(\mathbf{x} \mid C_{2}\right) p\left(C_{2}\right)} \\ &= \frac{1}{1+\frac{p\left(x \mid C_{2}\right) p\left(C_{2}\right)}{p(x| C_1) p\left(C_{1}\right)}} \\ &= \frac{1}{1+e^{-a}} \\ &= \sigma(a) \\ \end{align} $$

where, $\sigma(a)$ is logistic sigmoid gunction and $a=\ln \frac{\sigma}{1-\sigma}=\ln \frac{p\left(x \mid C_{1}\right) p(C_1)}{p\left(x \mid C_{2}\right) p\left(C_{2}\right)}$ where the log of the ratio of the probabilities is known as log odds.

For multiple classes i.e. general K,

$$ p\left(C_{k} \mid \mathbf{x}\right)=\frac{p\left(\mathbf{x} \mid C_{k}\right) p\left(C_{k}\right)}{\sum_{j=1}^{K} p\left(\mathbf{x} \mid C_{j}\right) p\left(C_{j}\right)} = \frac{\exp(a_k)}{\sum_{k}^{j=1}\exp(a_j)} $$

known as the softmax function and $a_{k}=\ln \left(p\left(\mathbf{x} \mid C_{k}\right) p\left(C_{k}\right)\right)$

Continuous inputs: Linear Discriminant Analysis

Let's assume the the inputs are continuous and use Gaussian Distribution to model the class condtional densities:

$$ p\left(\mathbf{x} \mid C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{\left|\mathbf{\Sigma}_{k}\right|^{1 / 2}} \exp \left\{\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}_{k}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\} $$

Assuming the classes share the covaraince matrix $\mathbf{\Sigma}_k = \Sigma$, we call this probabilistic generative modelling as Linear Discriminant Analysis (LDA).

For the case of K=2, the class posterior is given as

$$ p\left(C_{1} \mid \mathbf{x}\right)=\frac{1}{1+\exp (-a)}=\sigma(a) $$

where

$$ \begin{align} a&=\ln \frac{p\left(\mathbf{x} \mid C_{1}\right) p\left(C_{1}\right)}{p\left(\mathbf{x} \mid C_{2}\right) p\left(C_{2}\right)} \\ &= \ln \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)-\ln \mathcal{N}\left(\mathbf{x} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)} \\ &= -\frac{1}{2} \ln |\boldsymbol{\Sigma}|-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{1}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{1}\right)+\frac{1}{2} \ln \left(\mid\mathbf{\Sigma} \mid+\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{2}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{2}\right)+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)}\right. \\ &= (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^T\Sigma^{-1}\mathbf{X} - \frac{1}{2}\boldsymbol{\mu}_1^T\Sigma^{-1}\boldsymbol{\mu}_1 + \frac{1}{2}\boldsymbol{\mu}_2^T\Sigma^{-1}\boldsymbol{\mu}_2 + \ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)} \end{align} $$

Writing in terms of generalized linear model, we have

$$ p\left(C_{1} \mid \mathbf{x}\right)=\sigma\left(\mathbf{w}^{T} \mathbf{x}+w_{0}\right) $$

where

$$ \begin{array}{l} \mathbf{w}=\mathbf{\Sigma}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right) \\ w_{0}=-\frac{1}{2} \boldsymbol{\mu}_{1}^{T} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{1}+\frac{1}{2} \boldsymbol{\mu}_{2}^{T} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{2}+\ln \frac{p\left(C_{1}\right)}{p\left(C_{2}\right)} \end{array} $$

We see that the quadratic terms in $\mathbf{x}$ from the exponents of the Gaussian densities have cancelled due to the assumption of common covariances matrices. This leads to a linear function of $\mathbf{x}$ in the argument of the logistic sigmoid. The resulting decision boundaries correspond to surfaces along which the posterior probabilities $p\left(\mathcal{C}_{k} \mid \mathbf{x}\right)$ are constant and so will be given by linear functions of $\mathbf{x}$ and therefore the decision boundaries are linear in input space. The prior probabilities enter only through the bias parameter $w_0$ so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability.

Probabilistic generative models

For the general case of $K$ classes we have

$$ p\left(\mathbf{x} \mid C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\mathbf{\Sigma}|^{1 / 2}} \exp \left\{\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{T} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\} $$
$$ p\left(C_{k} \mid \mathbf{x}\right)=\frac{\exp \left(a_{k}(\mathbf{x})\right)}{\sum_{j=1}^{K} \exp \left(a_{j}(\mathbf{x})\right)} $$
$$ a_{k}(\mathbf{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0} $$

where we have

$$ \begin{aligned} \mathbf{w}_{k} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k} \\ w_{k 0} &=-\frac{1}{2} \boldsymbol{\mu}_{k}^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{k}+\ln p\left(\mathcal{C}_{k}\right) \end{aligned} $$

The decision boundary is $p\left(C_{k} \mid \mathbf{x}\right)=p\left(C_{j} \mid \mathbf{x}\right)$ i.e. $a_{k}(x)=a_{j}(x)$

If we relax the assumption of a shared covariance matrix and allow each class conditional density $p(\mathbf{x}|\mathcal{C}_k)$ to have it's own class covariance matrix $\boldsymbol{\Sigma}_{k}$, then the cancellations do not occur and we obtain quadratic terms in $\mathbf{x}$, giving rise to a quadratic discriminant.

LDA and QDA

Maximum Likelihood Solution

Let's use maximum likelihood to estimate $\mathbf{\mu}_k$, $\Sigma$ and priors $p(C_k)$.
Denote $p\left(C_{1}\right)=\pi$ and $p\left(C_{2}\right)=1-\pi$
For this binary classification, use $C_1$= 1, $C_2$=0. So we set $t_n$ to select for each classes. Then we have for $C_1$

$$ p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n} \mid \mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right) $$

And for $C_2$

$$ p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} \mid \mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) $$

Thus the likelihood is given by

$$ \begin{align} p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) &= \prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)\right]^{1-t_{n}} \end{align} $$

Taking the log of the likelihood,

$$ \begin{align} \ln p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) &= \sum_{n=1}^{N} t_{n} \ln \pi+t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)+ \left(1-t_{n}\right) \ln (1-\pi)+\left(1-t_{n}\right) \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) \end{align} $$

Taking only the terms that depend on $\pi$, we get

$$ \sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\} $$

Taking the derrivative and rearranging,

$$ \pi_{ML}=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}} $$

where $N_1$ is the total number of $C_1$ points and similar for $N_2$.

Now to estimate $\mu_1$, let's picl out the terms dependant on $\mu_1$ form the log likelihood function and taking derrivative

$$ \begin{align} \frac{\partial}{\partial \boldsymbol{\mu}_{1}} \sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} \mid \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right) &= -\frac{1}{2} \frac{\partial}{\partial \boldsymbol{\mu}_{1}} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right) \end{align} $$

Setting to zero, we have

$$ \mu_{1}=\frac{1}{N_{1}} \sum_{n=1}^{N} t_{n} \mathbf{x}_{n} $$

Similarly for $\mu_2$ is given by

$$ \boldsymbol{\mu}_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N}\left(1-t_{n}\right) \mathbf{x}_{n} $$

For the likelihood solution of the shared covariance matrix $\Sigma$, we have,

$$ \begin{align} \frac{\partial}{\partial \boldsymbol{\Sigma}} \ln p\left(\mathbf{t}, \mathbf{X} \mid \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)&=\frac{\partial}{\partial \boldsymbol{\Sigma}}\left[-\frac{N}{2} \ln |\boldsymbol{\Sigma}|-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\right.\\ & -\frac{1}{2} \sum_{n=1}^{N}\left(1-t_{n}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right) = 0 \end{align} $$

rearraning, we have

$$ \begin{aligned} \mathbf{S} &=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2} \\ \mathbf{S}_{1} &=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \\ \mathbf{S}_{2} &=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}} \end{aligned} $$

Prediction

For new datapoint $\mathbf{x'}$

$$ \begin{align} p\left(C_{1} \mid \mathbf{x}^{\prime}\right)=\sigma\left(\mathbf{w}_{\mathrm{ML}}^{T} \mathbf{x}^{\prime}+w_{0, \mathrm{ML}}\right) \end{align} $$

where,

$$ \begin{aligned} \mathbf{w}_{\mathrm{ML}} &=\mathbf{\Sigma}_{\mathrm{ML}}^{-1}\left(\boldsymbol{\mu}_{1, \mathrm{ML}}-\boldsymbol{\mu}_{2, \mathrm{ML}}\right) \\ w_{0, \mathrm{ML}} &=-\frac{1}{2} \boldsymbol{\mu}_{1, \mathrm{ML}}^{T} \mathbf{\Sigma}_{\mathrm{ML}}^{-1} \boldsymbol{\mu}_{1, \mathrm{ML}}+\frac{1}{2} \boldsymbol{\mu}_{2, \mathrm{ML}}^{T} \mathbf{\Sigma}_{\mathrm{ML}}^{-1} \boldsymbol{\mu}_{2, \mathrm{ML}}+\ln \frac{\pi_{\mathrm{ML}}}{1-\pi_{\mathrm{ML}}} \end{aligned} $$

and assing $\mathbf{x'}$ to $C_1$ if $p\left(C_{1} \mid \mathbf{x}^{\prime}\right) \geq \frac{1}{2}$

Disadvantages of LDA

  1. Gaussian distribution is sensitive to outliers
  2. Lineariry/handcrafted features restrict application
  3. Maximum likelihood is prone to overfitting

Discrete inputs: Naive Bayes

Let's consider discrete feature vectors $\mathbf{x}_{n}=\left(x_{1}, \ldots, x_{D}\right)^{T}$ with $x_{i} \in\{0,1\}$. Then for D-dimensional input, the number of parameters become $2^D-1$ as we need a parameter to model relationship between each pair of features for each class.

The likelihood can be expressed as

$$ \begin{align} p(\mathbf{T}, \mathbf{X} \mid \lambda)&=\prod_{n=1}^{N} p\left(t_{n}, x_{n} \mid \lambda\right) \\ &=\prod_{n=1}^{N} p\left(t_{n} \mid \lambda\right) \cdot p\left(x_{n} \mid t_{n}, \lambda\right) \\ &=\prod_{n=1}^{N} \prod_{k=1}^{K}\left[p\left(\mathcal{C}_{k} \mid \lambda\right) \cdot p\left(x_{n} \mid \mathcal{C}_{k}, \lambda\right)\right]^{t_{n k}} \end{align} $$

To seek a more restricted representation, we make the naive Bayes assumption that the feature values are treated as independant when conditioned on the class $C_k$.

$$ \begin{align} &=\prod_{n=1}^{N} \prod_{k=1}^{K}\left[p\left(\mathcal{C}_{k}\right) \cdot \prod_{d=1}^{D} p\left(x_{n d} \mid \mathcal{C}_{k}, \lambda_{d k}\right)\right]^{t_{n k}} \\ &= \prod_{n=1}^{N} \prod_{k=1}^{K}\left[\pi_{k} \cdot \prod_{d=1}^{D} p\left(x_{n d} \mid \mathcal{C}_{k}, \lambda_{d k}\right)\right]^{t_{n k}} \end{align} $$

In this case, the number of paramters per class is $D$, so total number of parameters is $K.D$. Note again that the features are not marginally independant, only when conditioned on the classes.

For binary cases, we can model with bernoulli distribution:

$$ p\left(\mathbf{x} \mid \mathcal{C}_{k}\right)=\prod_{i=1}^{D} \mu_{k i}^{x_{i}}\left(1-\mu_{k i}\right)^{1-x_{i}} $$

Posterior class probability of the form

$$ p\left(C_{k} \mid \mathbf{x}\right)=\frac{\exp \left(a_{k}(\mathbf{x})\right)}{\sum_{j=1}^{K} \exp \left(a_{j}(\mathbf{x})\right)} $$

where,

$$ \begin{align} a_{k}(\mathbf{x})&=\ln p\left(x \mid C_{k}\right) p\left(C_{k}\right)\\ &= \sum_{i=1}^{D}\left\{x_{i} \ln \mu_{k i}+\left(1-x_{i}\right) \ln \left(1-\mu_{k i}\right)\right\}+\ln p\left(\mathcal{C}_{k}\right) \\ \end{align} $$

References

  1. Bishop 4.2
  2. http://mlss2018.net.ar/slides/Adams-1.pdf