Principle Component Analysis (PCA)
Principle component analysis, or PCA, is a dimensionality reduction algorithm.
There are two deterministic definitions/derivations of PCA:
Maximal variance formulation
Given the data $\mathbf{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right\}, \mathbf{x}_{n} \in \mathbb{R}^{D}$, the goal of PCA of is to project data into a $M<D$ dimensional space while maximizing the variance of the projected data.
Let's consider the projection onto a one-dimensional space(M=1). We can define the direction of this space using a D-dimensional vector $\mathbf{u}_1$, which for convenience we shall choose to be unit vector so that $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{u}_{1}=1$ (we are only interested in direction, not magnitude). Each datapoint $\mathbf{x}_n$ is projected onto a scalar value $\mathbf{u}_1^T\mathbf{x}_n$. The mean of the projected data is $\mathbf{u}_{1}^{\mathrm{T}} \overline{\mathrm{x}}$ where $\overline{\mathrm{x}}$ is the sample set mean given by,
and the variance of the projected data is given by,
where $\mathbf{S}$ is the data covariance matrix defined by
We now maximize the projected variance with $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{S} \mathbf{u}_{1}$ with respect to $\mathbf{u}_1$. This has to be a constrained maximization to prevent $\left\|\mathbf{u}_{1}\right\| \rightarrow \infty$. The constraint we take comes from the normalization condition $\mathbf{u}_{1}^{\Gamma} \mathbf{u}_{1}=1$. Then from the method of Lagrange Multipliers,
Then,
By setting the derivative with respect to $\mathbf{u}_{1}$ equal to zero, we see that this quantity will have a stationary point when
which means that $\mathbf{u}_{1}$ must be an eigen vector of $\mathbf{S}$, and the variance will be maximum when the eigenvector has the largest eigenvalue $\lambda_1$. This eigenvector is known as the first principle component.
The (total) variance of the projected data is $\operatorname{Tr}[\operatorname{Cov}[\mathbf{z}]]=\sum_{j=1}^{M} \lambda_{j}$.
Thus principle component analysis involves evaluating the mean and the covariance matrix of the data set and then finding the M eigenvectors of $\mathbf{S}$ with the M largest eigenvalues.
Minimum error formulation
PCA can be obtained by minimizing the reconstruction error. To do this, we represent the points in a different orthonormal basis $\left\{\mathbf{u}_{i}: \mathbf{u}_{i}^{T} \mathbf{u}_{i}=1\right\}_{i=1}^{D}$ where,
Because this basis is complete, each data point can be represented exactly by a linear combination of the basis vectors.
This means the original data points $\mathbf{X} = \left\{x_{n 1}, \ldots, x_{n D}\right\}$ are replaced by $\left\{\alpha_{n 1}, \ldots, \alpha_{n D}\right\}$. Taking the property of orthonormal basis, we have,$\alpha_{n i}=\mathbf{x}_{n}^{T} \mathbf{u}_{i}$, and so,
Our goal here is to approximate data point using a representation that is restricted with M < D. So,
We are free to chose $\mathbf{u}_i$ and $z_{n i} = \left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right)$ and $b_i$ so as to minimize the reconstruction error
Minimizing for both $\mathbf{u}_i$ and $b_i$, we get, $b_{i}=\bar{x}^{T} \mathbf{u}_{i}$.
For $\mathbf{u}_i$,
Minimizing $\mathbf{u}_{i}$ under constraint $\mathbf{u}_{i}^{T} \mathbf{u}_{i}=1$ (else we get $\mathbf{u}_i=0$) with the method of Lagrange Multipliers, we get a general solution as
Thus the solution is obtained by finding the largest M eigenvectors of the covariance matrix $\mathbf{S}$. such that the remaining D-M are the smallest.
Linear Projection for dimensionality reduction
The final k-dimensional linear projection $y_n$ is obtained as
were $\boldsymbol{L}=\boldsymbol{\Lambda}^{\frac{1}{2}}_k \boldsymbol{U}_k^T$
References
- 12.1.1, Bishop 2006, Lecture 10 - ML1
- http://www.cs.cmu.edu/~tom/10601_fall2012/slides/pca.pdf
- Tutorial on PCA, university of Otago http://www.cs.otago.ac.nz/research/publications/OUCS-2002-12.pdf