Regularized Least Squares

Created November 11, 2020 · Updated March 4, 2026

When the values of learned parameters are very large, they tend to overfit to the solution. To prevent this, we can introduce a heuristically driven term in error function whose goal is to penalize these large values in the weight vectors.

\tilde{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left\{t_{i}-y\left(\mathbf{x}_{i}, \mathbf{w}\right)\right\}^{2}+\frac{1}{2} \lambda \mathbf{w}^{T} \mathbf{w}

Note tha this is equivalent to Maximum A Posteriori (MAP) for estimating $\mathbf{w}$ with a gaussian prior. We can also observe,

\lambda = \frac{\alpha}{\beta}

This can be interpreted as $\alpha$ representing how much confidence we have in our prior, and $\beta$ representing how much confidence we have in our model.

In a more general terms, this regularization part can be written as:

\hat{E}(\mathbf{w})=\frac{1}{2} \sum_{i=1}^{N}\left(t_{i}-\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)\right)^{2}+\frac{\lambda}{2} \sum_{i=1}^{M}\left|\mathbf{w}\right|^q

When $$q=1$$ , the regularization term is known as lasso. It encourages sparsity in $\mathbf{w}$ .

When $$q=2$$ , the term is called ridge in statistics literature or weight decay in machine learning. It penalizes large weights to have smaller values.