Regularized Least Squares
When the values of learned parameters are very large, they tend to overfit to the solution. To prevent this, we can introduce a heuristically driven term in error function whose goal is to penalize these large values in the weight vectors.
Note tha this is equivalent to Maximum A Posteriori (MAP) for estimating $\mathbf{w}$ with a gaussian prior. We can also observe,
This can be interpreted as $\alpha$ representing how much confidence we have in our prior, and $\beta$ representing how much confidence we have in our model.
In a more general terms, this regularization part can be written as:
When $q=1$, the regularization term is known as lasso. It encourages sparsity in $\mathbf{w}$.
When $q=2$, the term is called ridge in statistics literature or weight decay in machine learning. It penalizes large weights to have smaller values.