Support Vector Machines
Learning algorithms based on Kernel Methods have a significant limitation such that for all the input point pairs, kernel function $k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)$ must be evaluated, which is computationally infeasible for both training and inference. SVMs are kernel based method that have sparse solutions, i.e. their predictions for new inputs depend only on the kernel function evaluated at a subset of training data points.
Another important property of SVM is that the determination of model parameters is a convex optimization problem, so any local solution is a global optimum.
Most ML problems (neural networks, logistic regression with complex features) are non-convex - they have multiple local minima that might not be globally optimal. With SVMs, any solution you find is THE best solution possible. No getting stuck in bad local minima!
This is why SVMs were so revolutionary in the 90s/2000s - they offered strong theoretical guarantees that were rare in machine learning at the time. However, SVMs are not probabilistic models and provide no uncertainty estimates. (See Gaussian Processes for that).
Maximum Margin Classifier
Generally, we have seen that there are multiple solutions all of which classify the training data set exactly. In such case, we should try to find the one that give the smallest generalization error.
Support vector machines approach this problem with the concept of margin, which is the smallest distance between the decision boundary and any of the samples. The goal of the algorithm then is to maximize the margin.
Assume a point x and it's hyperplane defined as y(x)=0. Then the perpendicular distance of a point x from the hyperplane is given as $\frac{|y(\mathbf{x})|}{\|\mathbf{w}\|}$. Thus the distance of a point $\mathbf{x}_n$ to the decision surface is given by
And thus, maximum margin solution is given by solving
Direct solution of this optimization problem is very complex, so we can convert it to an equivalent problem by setting $t_{n}\left(\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)+b\right)=1$ for the nearest point. Then the size of the margin is given by $\frac{1}{\|\mathbf{w}\|}$. And for all data points we will have $t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right) \geq 1$.
Then the optimization simply boils down to
This is an example of quadratic programming problem in which we are trying to minimize a quadratic function to a set of linear inequality constraint.
We can solve this constrained optimization problem using lagrange multipliers, giving the Lagrangian function
where $f(\mathbf{w}) = \frac{1}{2}\|\mathbf{w}\|^{2}$, $g(\mathbf{w},\mathbf{x}, b) = \left(t_{n}\left(\mathbf{w}^{T} \mathbf{x}_{n}+b\right)-1\right)$ and $a_n$ for $n=1, \ldots, N$. Note the minum sign in front of Lagrange multiplier is because we are minimizing with respect to $\mathbf{w}$ and b, and maximizing with respect to $\mathbf{a}$.
Recall the KKT conditions:
Obtaining the stationary conditions, we get,
Now using these new constraints and KKT constraints, we can get the dual representation of Langrangian by eliminating $\mathbf{w}$ and $b$:
with respect to $\mathbf{a}$ subject to constraints $a_{n} \geq 0 \text { for } n=1, \ldots, N$ and $\sum_{n=1}^{N} a_{n} t_{n}=0$.
Applying the kernel trick by replacing $\mathbf{x}_{n}^{T} \mathbf{x}_{m}$ with $k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)$ we get,
The prediction of class for datapoint $\mathbf{x}_n$ is then given as:
From KKT conditions above, we know that for every data point, either $a_{n}=0$ or $t_{n} y\left(\mathbf{x}_{n}\right)=1$. Data points for which $a_{n}=0$ will not be used in the prediction. The remaining data points lie on the maximum margin hyperplanes and are called support vectors.
After solving for value of $\mathbf{a}$, we can determine the bias term by noting that for any support vector $\mathbf{x}_n$ i.e. $t_{n} y_{n}(\mathbf{x})=1$ gives us,
For numerical stability, we can average over all the support vectors $\mathcal{S}$ to give
Soft margin classifiers
In maximum margin classifier setup above, we assumed that the training data points are linearly separable with either linear decision boundary, or with nonlinear decision boundary using a nonlinear kernel. But in real world, class condtional distributions usually overlap.
Maximum margin classifier would extemely overfit the decision hyperplane in such situation, leading to drastic reduction in generalizability. Therefore, we need to modify it to allow for some training points to be misclassified.
This can be done by allowing the points to be on the "wrong" side of the margin boundary, but have to pay a penalty proportional to the distance to the margin boundary.
Soft margin classifiers introduce slack variables $\xi_{n} \geq 0$ for $n=1, \ldots, N$ to allow datapoints to be misclassified but incur a penalty. If a datapoint n is on the correct side of the margin, $\xi_{n}=0$ and if on the wrong side of the margin $\xi_{n}=\left|t_{n}-y\left(\mathbf{x}_{n}\right)\right|$. This means points for which $0<\xi_{n} \leqslant 1$ lie inside the margin but on the correct side, and for points with $\xi_{n}>1$ lie on the wrong side of the decision boundary and are misclassified.
We now modify the previous hard constraint $t_{n} y\left(\mathbf{x}_{n}\right) \geq 1$ with soft constraint $t_{n} y\left(\mathbf{x}_{n}\right) \geq 1-\xi_{n}$, so the objective becomes
subject to constraints
The parameter $C > 0$ controls the trade-off between the slack variable penalty and the margin, and is therefore analogous to regularization.
Now Lagrangian becomes,
where Lagrange multipliers are $a_{n} \geq 0, \quad \mu_{n} \geq 0$.
Minimize $L$ w.r.t. primal variables $\mathbf{w}, b, \xi_{n}$:
The above primal langrangian has the following KKT conditions:
Note: The stationary conditions obtained above by optimizing Langrangian with primal variables are also considered to be the part of KKT conditions.
Now we eliminate $\mathbf{w}, b, \xi_{n}$ from the above primal Lagrangian using the above KKT conditions to obtain the dual formulation:
which is identical to the hard margin case, but the constraints are instead subject to the $0 \leq a_{n} \leq C, \quad \sum_{n=1}^{N} a_{n} t_{n}=0$ and are called *box constraints. The dual formation allows the use of the kernel trick.
We solve for the dual problem using a numerical optimizer for the values of $\mathbf{a}_n$. Then we can classify new points using
We are also interested in which points finally give the sparse solution i.e. wich points are the support vectors. The support vectors are the points for which $a_{n}>0$. In that case $t_{n}0 y\left(\mathbf{x}_{n}\right)=1-\xi_{n}$. Then we have two scenarios:
- If $a_{n}<C$ then $\mu_{n}>0$ so $\xi_{n}=0$, the points lie on the margin.
- If $a_{n}=C$ then $\mu_{n}=0$ so $\xi_{n} \geq 0$
- If $\xi_{n} \leq 1$, points are within the margin and classified correctly
- If $\xi_{n}>1$, points are misclassified
When $C \rightarrow \infty$, we get hard margin classifier as above as the penalty goes to infinity for misclassifications, we get atleast 2 support vectors. When $C \rightarrow 0$, we get infinite margin and every point becomes a support vector.
For large $C$ we expect a more complex decision boundary than for small $C$.
The classifier is more sensitive to outliers for large $C$ than for small $C$.
References
- 7.0, 7.1 Bishop 2006
- Quadratic Programming with Python and CVXOPT https://courses.csail.mit.edu/6.867/wiki/images/a/a7/Qp-cvxopt.pdf