Maximum A Posteriori
Created November 11, 2020 ยท Updated November 24, 2025
With MAP, we consider to optimize parametric distribution via the principles of maximizing posterior probability for modal weights given observed data samples. This is formalized as:
$$
W_{MAP} = \underset{\mathbf{w}}{\arg \max }\ p(w|D)
$$
where $p(w|D)$ is the posterior distribution.
In Maximum Likelihood Estimation, we chose W such that the data likelihood $p(D|w)$ is maximized. However in MAP, we choose the most probable W given the data i.e the posterior.
Dataset: $D = \{x,t\}$
Model: $p(t \mid x, \mathbf{w}, \beta)=\mathcal{N}\left(t \mid y(x, \mathbf{w}), \beta^{-1}\right)=\sqrt{\frac{\beta}{2 \pi}} \exp \left[-\frac{\beta}{2}(t-y(x, \mathbf{w}))^{2}\right]$
Given a prior $p(w|\alpha)$, the posterior distribution is given by Bayes theorem as:
$$
\begin{align}
p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \beta, \alpha)= \frac{p(t|x,\mathbf{w}, \beta).p(w|\alpha)}{p(t|x,\beta,\alpha)}
\end{align}
$$
Note that the denominator does not depend on $\mathbf{w}$.
Now MAP is formulated as:
$$
\begin{align}
\mathbf{w}_{M A P}=\underset{\mathbf{w}}{\operatorname{argmax}}\ p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}, \beta, \alpha)&=\underset{\mathbf{w}}{\operatorname{argmax} }\ \log p\left(\mathbf{w} \mid \mathbf{X}, t_{i} \beta, \alpha\right)\\
&= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha) - \log [(t|x,\beta,\alpha)]
\end{align}
$$
The third term does not contribute to the solution. Thus,
$$
\mathbf{w}_{M A P}= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha)
$$
Maximum A Posteriori Estimation for Gaussian Distributions
Lets model the prior distribution as a Gaussian:
$$
\begin{align}
p(\mathbf{w} \mid \alpha)&=\prod_{i=1}^{M} \mathcal{N}\left(w_{i} \mid 0, \alpha^{-1}\right)\\
&=\left(\frac{\alpha}{2 \pi}\right)^{\frac{\mu}{2}} \prod_{i=1}^{\mu} e^{-\frac{a}{2} w_{i} w_{i}}\\
&=\left(\frac{\alpha}{2 \pi}\right)^{\frac{\mu}{2}} e^{-\frac{\alpha}{2} w^{\top} w}
\end{align}
$$
We know from MAP,
$$
\begin{align}
\mathbf{w}_{M A P}&= \underset{\mathbf{w}}{\operatorname{argmax} } \log p(t|x,\mathbf{w},\beta) + \log p(\mathbf{w}|\alpha)\\
&= \underset{\mathbf{w}}{\operatorname{argmin} } -\log p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta)-\log p(\mathbf{w} \mid \alpha)\\
&= \underset{\mathbf{w}}{\operatorname{argmin} } -\log p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}, \beta) + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w}
\end{align}
$$
Modeling the data distribution as Gaussian,
$$
p(t \mid x, \mathbf{w}, \beta)=\sqrt{\frac{\beta}{2 \pi}} \exp \left[-\frac{\beta}{2}(t-y(x, \mathbf{w}))^{2}\right]
$$
Thus,
$$
\begin{align}
\mathbf{w}_{M A P}&=\underset{\mathbf{w}}{\operatorname{argmin} } \frac{\beta}{2}\sum_{i=1}^N (t - y(x,w))^2 + \frac{\alpha}{2}\mathbf{w}^T\mathbf{w}
\end{align}
$$
Therefore, MAP reduces to the minimization of quadratic loss and quadratic penalty of weights.
The predictive distribution is given by,
$$
p\left(t^{\prime} \mid x^{\prime}, \mathbf{w}_{\mathrm{MAP}}, \beta\right)=\mathcal{N}\left(t^{\prime} \mid y\left(x, w_{m_{\mathrm{AP}}}\right), \beta^{-1}\right)
$$
To get point estimate from the predictive distribution using expected value of the distribution:
$$
\mathbb{E}\left[t^{\prime} \mid x^{\prime}, \mathbf{w}_{\mathrm{MAP}}, \beta\right]=y\left(x, w_{M A P}\right)
$$