Principle Component Analysis (PCA)

Principle component analysis, or PCA, is a dimensionality reduction algorithm.

There are two deterministic definitions/derivations of PCA:

Maximal variance formulation

Given the data $\mathbf{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right\}, \mathbf{x}_{n} \in \mathbb{R}^{D}$, the goal of PCA of is to project data into a $M<D$ dimensional space while maximizing the variance of the projected data.

Let's consider the projection onto a one-dimensional space(M=1). We can define the direction of this space using a D-dimensional vector $\mathbf{u}_1$, which for convenience we shall choose to be unit vector so that $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{u}_{1}=1$ (we are only interested in direction, not magnitude). Each datapoint $\mathbf{x}_n$ is projected onto a scalar value $\mathbf{u}_1^T\mathbf{x}_n$. The mean of the projected data is $\mathbf{u}_{1}^{\mathrm{T}} \overline{\mathrm{x}}$ where $\overline{\mathrm{x}}$ is the sample set mean given by,

$$ \overline{\mathrm{x}}=\frac{1}{N} \sum_{n=1}^{N} \mathrm{x}_{n} $$

and the variance of the projected data is given by,

$$ \begin{aligned} \frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T} \mathbf{x}_{n}-\mathbf{u}_{1}^{T} \overline{\mathbf{x}}\right)^{2} &=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\right)^{2} \\ &=\frac{1}{N} \sum_{n=1}^{N} \mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{1} \\ &=\mathbf{u}_{1}^{T}\left(\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T}\right) \mathbf{u}_{1}\\ &={\mathbf{u}_{1}^{T} \mathbf{S u}_{1}} \end{aligned} $$

where $\mathbf{S}$ is the data covariance matrix defined by

$$ \mathbf{S}=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{\mathrm{T}} $$

We now maximize the projected variance with $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{S} \mathbf{u}_{1}$ with respect to $\mathbf{u}_1$. This has to be a constrained maximization to prevent $\left\|\mathbf{u}_{1}\right\| \rightarrow \infty$. The constraint we take comes from the normalization condition $\mathbf{u}_{1}^{\Gamma} \mathbf{u}_{1}=1$. Then from the method of Lagrange Multipliers,

$$ L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{u}_{1}^{T} \mathbf{S} \mathbf{u}_{1}+\lambda_{1}\left(\mathbf{u}_{1}^{T} \mathbf{u}_{1}-1\right) $$

Then,

$$ \frac{\partial}{\partial \mathbf{u}_{1}} L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{S} \mathbf{u}_{1}-\lambda_{1} \mathbf{u}_{1}=0 $$

By setting the derivative with respect to $\mathbf{u}_{1}$ equal to zero, we see that this quantity will have a stationary point when

$$ \mathbf{S u}_{1}=\lambda_{1} \mathbf{u}_{1} $$

which means that $\mathbf{u}_{1}$ must be an eigen vector of $\mathbf{S}$, and the variance will be maximum when the eigenvector has the largest eigenvalue $\lambda_1$. This eigenvector is known as the first principle component.

The (total) variance of the projected data is $\operatorname{Tr}[\operatorname{Cov}[\mathbf{z}]]=\sum_{j=1}^{M} \lambda_{j}$.

Thus principle component analysis involves evaluating the mean and the covariance matrix of the data set and then finding the M eigenvectors of $\mathbf{S}$ with the M largest eigenvalues.

Minimum error formulation

PCA can be obtained by minimizing the reconstruction error. To do this, we represent the points in a different orthonormal basis $\left\{\mathbf{u}_{i}: \mathbf{u}_{i}^{T} \mathbf{u}_{i}=1\right\}_{i=1}^{D}$ where,

$$ \mathbf{u}_{i}^{T} \mathbf{u}_{j}=\left\{\begin{array}{ll} 1 & \text { if } i=j \\ 0 & \text { otherwise } \end{array}\right. $$

Because this basis is complete, each data point can be represented exactly by a linear combination of the basis vectors.

$$ \mathbf{x}_{n}=\sum_{i=1}^{D} \alpha_{n i} \mathbf{u}_{i} $$

This means the original data points $\mathbf{X} = \left\{x_{n 1}, \ldots, x_{n D}\right\}$ are replaced by $\left\{\alpha_{n 1}, \ldots, \alpha_{n D}\right\}$. Taking the property of orthonormal basis, we have,$\alpha_{n i}=\mathbf{x}_{n}^{T} \mathbf{u}_{i}$, and so,

$$ \mathbf{x}_{n}=\sum_{i=1}^{D}\left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right) \mathbf{u}_{i} $$

Our goal here is to approximate data point using a representation that is restricted with M < D. So,

$$ \tilde{\mathbf{x}}_{n}=\sum_{i=1}^{M}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}+\sum_{i=M+1}^{D} b_{i} \mathbf{u}_{i} $$

We are free to chose $\mathbf{u}_i$ and $z_{n i} = \left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right)$ and $b_i$ so as to minimize the reconstruction error

$$ \frac{1}{N} \sum_{n=1}^{N}\left\|\mathbf{x}_{n}-\tilde{\mathbf{x}}_{n}\right\|^{2}=\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}-b_{i} \mathbf{u}_{i}\right\|^{2} $$

Minimizing for both $\mathbf{u}_i$ and $b_i$, we get, $b_{i}=\bar{x}^{T} \mathbf{u}_{i}$.

For $\mathbf{u}_i$,

$$ \begin{align} &\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right\|^{2}\\ &= \frac{1}{N} \sum_{n=1}^{N}\left(\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right)^{T}\left(\sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \mathbf{u}_{j}\right) \\ &= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=M+1}^{D} \sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}^{T} \mathbf{u}_{j}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \\ &= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=1 / 4+1}^{D} \mathbf{u}_{i}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i} \\ &= \sum_{i=M+1}^{D} \mathbf{u}_{i}^{T} \mathbf{S} \mathbf{u}_{i} \end{align} $$

Minimizing $\mathbf{u}_{i}$ under constraint $\mathbf{u}_{i}^{T} \mathbf{u}_{i}=1$ (else we get $\mathbf{u}_i=0$) with the method of Lagrange Multipliers, we get a general solution as

$$ \mathbf{S} \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i} $$

Thus the solution is obtained by finding the largest M eigenvectors of the covariance matrix $\mathbf{S}$. such that the remaining D-M are the smallest.

Linear Projection for dimensionality reduction

The final k-dimensional linear projection $y_n$ is obtained as

$$ y_n = \boldsymbol{L}x_n $$

were $\boldsymbol{L}=\boldsymbol{\Lambda}^{\frac{1}{2}}_k \boldsymbol{U}_k^T$


References

  1. 12.1.1, Bishop 2006, Lecture 10 - ML1
  2. http://www.cs.cmu.edu/~tom/10601_fall2012/slides/pca.pdf
  3. Tutorial on PCA, university of Otago http://www.cs.otago.ac.nz/research/publications/OUCS-2002-12.pdf