Regularized Least Squares

When the values of learned parameters are very large, they tend to overfit to the solution. To prevent this, we can introduce a heuristically driven term in error function whose goal is to penalize these large values in the weight vectors.

$$ \tilde{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left\{t_{i}-y\left(\mathbf{x}_{i}, \mathbf{w}\right)\right\}^{2}+\frac{1}{2} \lambda \mathbf{w}^{T} \mathbf{w} $$

Note tha this is equivalent to Maximum A Posteriori (MAP) for estimating $\mathbf{w}$ with a gaussian prior. We can also observe,

$$ \lambda = \frac{\alpha}{\beta} $$

This can be interpreted as $\alpha$ representing how much confidence we have in our prior, and $\beta$ representing how much confidence we have in our model.

In a more general terms, this regularization part can be written as:

$$ \hat{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left(t_{i}-\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)\right)^{2}+\frac{\lambda}{2} \sum_{i=1}^{M}\left|\mathbf{w}\right|^q $$

When $q=1$, the regularization term is known as lasso. It encourages sparsity in $\mathbf{w}$.

When $q=2$, the term is called ridge in statistics literature or weight decay in machine learning. It penalizes large weights to have smaller values.