Least squares for classification

Created September 26, 2020 · Updated March 4, 2026

With Linear Regression via Maximum Likelihood, we see that minimization of square-sum-of-errors led to a simple closed-form solution to it's parameters. It is tempting to apply similar treatment to Discriminant Functions for classification problems.

Consider a classification problem with $$K$$ classes, with a 1-of-K binary coding scheme for the target vector t. Each class $$C_k$$ has its own linear model.

Each class $$C_k$$ is described by its own linear model so that

y_{k}(\mathbf{x})=\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}+w_{k 0}

Conviniently using vector notation,

\mathbf{y}(\mathbf{x})=\widetilde{\mathbf{W}}^{\mathrm{T}} \widetilde{\mathbf{x}}

where $\widehat{\mathbf{W}}$ is a matrix whose $k^{\mathrm{th}}$ column comprises the $$D+1$$ -dimensional vector $\widetilde{\mathbf{w}}_{k}=\left(w_{k 0}, \mathbf{w}_{k}^{\mathrm{T}}\right)^{\mathrm{T}}$ and $\widetilde{\mathrm{x}}$ is the corresponding augmented input vector $\left(1, \mathbf{x}^{\mathrm{T}}\right)^{\mathrm{T}}$ with a dummy input $x_{0}=1$ . Assign new input $\mathbf{x}$ to the class $$C_k$$ if $k=\text{argmax} \ y_{j}(\mathbf{x})$ .

Consider a training data set $\left\{\mathbf{x}_{n}, \mathbf{t}_{n}\right\}$ where $n=1, \ldots, N,$ and define a matrix $\mathbf{T}$ whose $n^{\text {th }}$ row is the vector $\mathbf{t}_{n}^{\mathrm{T}}$ .

The sum-of-squares error function can then be written as

E_{D}(\widetilde{\mathbf{W}})=\frac{1}{2} \operatorname{Tr}\left\{(\widetilde{\mathbf{X}} \widetilde{\mathbf{W}}-\mathbf{T})^{\mathrm{T}}(\widetilde{\mathbf{X}} \widetilde{\mathbf{W}}-\mathbf{T})\right\}

Setting the derrivative wrt $\widetilde{\mathbf{W}}$ to zero, and rearranging, we obtain the solution

\widetilde{\mathbf{W}}=\left(\widetilde{\mathbf{X}}^{\mathrm{T}} \widetilde{\mathbf{X}}\right)^{-1} \widetilde{\mathbf{X}}^{\mathrm{T}} \mathbf{T}=\widetilde{\mathbf{X}}^{\dagger} \mathbf{T}

We then obtain the discriminant functions as

\mathbf{y}_{LS}(\mathbf{x})=\widetilde{\mathbf{W}}^{\mathrm{T}} \widetilde{\mathbf{x}}

The least squares approach gives an exact closed form solution for discriminant function parameters. However, it has several problems:

It lacks robustness to outliers. Additional data points away from the decision boundary produce significant change in the location of decision boundary.
For K>2, some decision regions can become very small or are simply ignored. This is called masking, and it happens a lot when we use regression for multi class classification.
The components of $y_{LS}(\mathbf{x})$ are not true probabilities even if they sum to zero as they can have negative vales.

These failure cases arise from the fact that least squares corresponds to maximum likelihood under the assumption of Gaussian conditional distribution whereas binary target vectors have distribution clearly very farm from Gaussians.