Kernel Methods

Created October 11, 2020 · Updated March 4, 2026

A category of pattern matching algorithms such as nearest neighbors are examples of memory-based methods that involve the storing the entire training set and formulating a similarity metric in order to make predictions for future data points.

Many linear parametric models can be re-cast into an equivalent 'dual representation' in which the predictions are also based on linear combinations of a kernel function evaluated at the training data points.

k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)

The power of kernels come from their ability to be substituted in any models involving the inner product of the feature space. These kernels have no explicitly defined parameters, but implicitly still work with (finite) or infinite number of parameters, making them very powerful.

An important contribution to arise from the kernel viewpoint has been the extensions to inputs that are symbolic, rather than simply vectors of real numbers. Kernel functions can be defined over objects as diverse as graphs, sets, strings and text documents.

Kernelized Ridge Regression

The goal is to minimize the sum of squared errors with quadratic weight penalty.

J(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{\mathbf{w}^{T} \phi\left(\mathbf{x}_{n}\right)-t_{n}\right\}^{2}+\frac{\lambda}{2} \mathbf{w}^{T} \mathbf{w}

To obtain the solution, we take the derivative of the error function wrt to the parameters and set it to 0.

\begin{align} \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}}=\sum^{N}\left\{\mathbf{w}^{T} \phi\left(\mathbf{x}_{n}\right)-{t_{n}}\right\} \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{w}^{\mathcal{T}} \mathbf{I}&=\mathbf{0} \\ \mathbf{w}^{T}\left(\sum_{n=1}^{N} \phi\left(\mathbf{x}_{n}\right) \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{I}\right)&=\sum_{n=1}^{N} t_{n} \phi\left(\mathbf{x}_{n}\right)^{T} \\ \left(\sum_{n=1}^{N} \phi\left(\mathbf{x}_{n}\right) \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{I}\right) \mathbf{w}&=\sum_{n=1}^{N} t_{n} \phi\left(\mathbf{x}_{n}\right) \end{align}

\begin{align} \mathbf{w}&=\left(\sum_{n=1}^{N} \phi\left(\mathbf{x}_{n}\right) \phi\left(\mathbf{x}_{n}\right)^{T}+\lambda \mathbf{I}\right)^{-1} \sum_{n=1}^{N} t_{n} \phi\left(\mathbf{x}_{n}\right) \\ &=\left(\mathbf{\Phi}^{T} \mathbf{\Phi}+\lambda \mathbf{I}\right)^{-1} \mathbf{\Phi}^{T} t \end{align}

We now define a Gram matrix $\mathbf{K}=\mathbf{\Phi} \Phi^{\mathrm{T}}$ , which is a symmetric NxN matrix with elements

K_{n m}=\phi\left(\mathbf{x}_{n}\right)^{\mathrm{T}} \boldsymbol{\phi}\left(\mathbf{x}_{m}\right)=k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)

So,

\mathbf{w}=\mathbf{\Phi}^{T}\left(\Phi \Phi^{T}+\lambda \mathbf{I}_{N}\right)^{-1} t=\mathbf{\Phi}^{T}\left(\boldsymbol{K} \lambda \mathbf{I}_{N}\right)^{-1} \boldsymbol{t}

Thus this gives rise to a dual representation in terms of $\mathbf{w}=\mathbf{\Phi}^{T} \mathbf{a}$ where dual variable $\mathbf{a}=\left(\boldsymbol{K}+\lambda \mathbf{I}_{N}\right)^{-1} \boldsymbol{t}$ , where our solution is based entirely in terms of kernel function.

In this dual formulation, for closed form solutions, inverting of $N\times N$ matrix is required which is of cubic complexity $$O(N^3)$$ , whereas primal form we require inversion of $M\times M$ matrix with complexity $$O(M^3)$$ . So then why is dual representation desired?

Allows us to implicitly use feature spaces of high, even infinite, dimensionality.
Does not rely on explicit Basis Functions but on similarity kernel functions.

The Kernel Trick

Formulate your optimization problem in such a way that the input vectors $\mathbf{x}_{n}$ enter only in the form of scalar products:

\mathbf{x}_{n}^{T} \mathbf{x}_{n} \quad \text { (or when using basis functions } \left.\boldsymbol{\phi}\left(\mathbf{x}_{n}\right)^{T} \boldsymbol{\phi}\left(\mathbf{x}_{n}\right)\right)

Replace all instances of $\mathbf{x}_{n}^{T} \mathbf{x}_{m}$ with a kernel function $k\left(\mathbf{x}_{n}, \mathbf{x}_{m}\right)$ .

In order to exploit this kernel substitution, we need the kernel functions to be valid kernel function, which means the Gram Matrix $\mathbf{K}$ must be symmetric positive semi definite for all possible choices of $\left\{\mathbf{x}_{n}\right\}_{n}^{N}$ .

One approach to constructing valid kernel functions is to choose explicit set of Basis Functions, example:

k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{T} \phi\left(\mathbf{x}^{\prime}\right)=\sum_{i=1}^{M} \phi_{i}(\mathbf{x}) \phi_{i}\left(\mathbf{x}^{\prime}\right)

Or similarly using Equivalent Kernel with covariance matrix eigen decomposition :

k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{T} \mathbf{\Sigma} \phi\left(\mathbf{x}^{\prime}\right)=\psi(\mathbf{x})^{T} \psi\left(\mathbf{x}^{\prime}\right)=\sum_{i=1}^{M} \psi_{i}(\mathbf{x}) \psi_{i}\left(\mathbf{x}^{\prime}\right)

For every positive definite kernel there exists $\phi: \mathbb{R}^{d} \rightarrow \mathbb{R}^{M}$ such that

k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\phi(\mathbf{x})^{T} \phi(\mathbf{x})

For example, in case of two dimensional input space $\mathbf{x}=\left(x_{1}, x_{2}\right)$ , we can identify the nonlinear feature mappings as

\begin{aligned} k(\mathbf{x}, \mathbf{z}) &=\left(\mathbf{x}^{\mathrm{T}} \mathbf{z}\right)^{2}=\left(x_{1} z_{1}+x_{2} z_{2}\right)^{2} \\ &=x_{1}^{2} z_{1}^{2}+2 x_{1} z_{1} x_{2} z_{2}+x_{2}^{2} z_{2}^{2} \\ &=\left(x_{1}^{2}, \sqrt{2} x_{1} x_{2}, x_{2}^{2}\right)\left(z_{1}^{2}, \sqrt{2} z_{1} z_{2}, z_{2}^{2}\right)^{\mathrm{T}} \\ &=\phi(\mathbf{x})^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{z}) \end{aligned}

Therefore, we always know that there is a nonlinear mapping function in a valid kernel, but we don't need it know what it is necessarily. In general it is difficult to retrieve the corresponding $\phi(\mathbf{x})$ for a given kernel.

Constructing Kernels from other kernels

\begin{aligned} k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=c k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=f(\mathbf{x}) k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) f\left(\mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=q\left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\exp \left(k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)+k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) k_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{3}\left(\boldsymbol{\phi}(\mathbf{x}), \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right)\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=\mathbf{x}^{\mathrm{T}} \mathbf{A} \mathbf{x}^{\prime} \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right)+k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \\ k\left(\mathbf{x}, \mathbf{x}^{\prime}\right) &=k_{a}\left(\mathbf{x}_{a}, \mathbf{x}_{a}^{\prime}\right) k_{b}\left(\mathbf{x}_{b}, \mathbf{x}_{b}^{\prime}\right) \end{aligned}

Examples

Generalized polynomial kernel

k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\left(c+\mathbf{x}^{T} \mathbf{x}^{\prime}\right)^{M}

Gaussian kernel

Also known as squared exponential kernel: infinite dimensional space!

k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\frac{1}{2 \sigma^{2}}|| \mathbf{x}-\mathbf{x}^{\prime}||^{2}\right)

Radial basis functions

RBFs are a general class of homogenous kernels that depend only on the magnitude of the distance (often used synonymously with Gaussian Kernel)

k\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=k\left(\| \mathbf{x}-\mathbf{x}^{\prime}||^{2}\right)

Pros and Cons

Pros:

We have no explicit parameters/features anymore, only implicit by the kernel function $k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)$
No need to handpick locations of basis functions

Cons:

The computational cost to retrieve $\alpha$ is $\mathcal{O}\left(N^3\right)$ as $\boldsymbol{K} \in \mathbb{R}^{N \times N}$ compared to $\mathcal{O}\left(M^3\right)$ for calculating $$M$$ on the standard way (the cost comes from the inverse)
During prediction, we need $\mathcal{O}(N \cdot M)$ to compute the output for a new point, but would only need $\mathcal{O}(N)$ with the primal parameters $\boldsymbol{w} \Rightarrow$ slow prediction for large datasets

References

6.0-6.2, Bishop 2006