Maximum A Posteriori

With MAP, we consider to optimize parametric distribution via the principles of maximizing posterior probability for modal weights given observed data samples. This is formalized as:

$$ W_{MAP} = \underset{\mathbf{w}}{\arg \max }\ p(w|D) $$

where $p(w|D)$ is the posterior distribution.

In Maximum Likelihood Estimation, we chose W such that the data likelihood $p(D|w)$ is maximized. However in MAP, we choose the most probable W given the data i.e the posterior.

Dataset: $D = \{x,t\}$

Model: $p(t \mid x, \mathbf{w}, \beta)=\mathcal{N}\left(t \mid y(x, \mathbf{w}), \beta^{-1}\right)=\sqrt{\frac{\beta}{2 \pi}} \exp \left[-\frac{\beta}{2}(t-y(x, \mathbf{w}))^{2}\right]$

Given a prior $p(w|\alpha)$, the posterior distribution is given by Bayes theorem as:

$$ \begin{align} p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \beta, \alpha)= \frac{p(t|x,\mathbf{w}, \beta).p(w|\alpha)}{p(t|x,\beta,\alpha)} \end{align} $$

Note that the denominator does not depend on $\mathbf{w}$.

Now MAP is formulated as:

$$ \begin{align} \mathbf{w}_{M A P}=\underset{\mathbf{w}}{\operatorname{argmax}}\ p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \beta, \alpha)&=\underset{\mathbf{w}}{\operatorname{argmax} }\ \log p\left(\mathbf{w} \mid \mathbf{X}, t_{i} \beta, \alpha\right)\\ &= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha) - \log [(t|x,\beta,\alpha)] \end{align} $$

The third term does not contribute to the solution. Thus,

$$ \mathbf{w}_{M A P}= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha) $$

Maximum A Posteriori Estimation for Gaussian Distributions

Lets model the prior distribution as a Gaussian:

$$ \begin{align} p(\mathbf{w} \mid \alpha)&=\prod_{i=1}^{M} \mathcal{N}\left(w_{i} \mid 0, \alpha^{-1}\right)\\ &=\left(\frac{\alpha}{2 \pi}\right)^{\frac{\mu}{2}} \prod_{i=1}^{\mu} e^{-\frac{a}{2} w_{i} w_{i}}\\ &=\left(\frac{\alpha}{2 \pi}\right)^{\frac{\mu}{2}} e^{-\frac{\alpha}{2} w^{\top} w} \end{align} $$

We know from MAP,

$$ \begin{align} \mathbf{w}_{M A P}&= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha)\\ &= \underset{\mathbf{w}}{\operatorname{argmin} } -\log p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)-\log p(\mathbf{w} \mid \alpha)\\ &= \underset{\mathbf{w}}{\operatorname{argmin} } -\log p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta) + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w} \end{align} $$

Modeling the data distribution as Gaussian,

$$ p(t \mid x, \mathbf{w}, \beta)=\sqrt{\frac{\beta}{2 \pi}} \exp \left[-\frac{\beta}{2}(t-y(x, \mathbf{w}))^{2}\right] $$

Thus,

$$ \begin{align} \mathbf{w}_{M A P}&=\underset{\mathbf{w}}{\operatorname{argmin} } \frac{\beta}{2}\sum_{i=1}^N (t - y(x,w))^2 + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w} \end{align} $$

Therefore, MAP reduces to the minimization of quadratic loss and quadratic penalty of weights.

The predictive distribution is given by,

$$ p\left(t^{\prime} \mid x^{\prime}, \mathbf{w}_{\mathrm{MAP}}, \beta\right)=\mathcal{N}\left(t^{\prime} \mid y\left(x, w_{m_{\mathrm{AP}}}\right), \beta^{-1}\right) $$

To get point estimate from the predictive distribution using expected value of the distribution:

$$ \mathbb{E}\left[t^{\prime} \mid x^{\prime}, \mathbf{w}_{\mathrm{MAP}}, \beta\right]=y\left(x, w_{M A P}\right) $$