Principle Component Analysis (PCA)

Created October 21, 2020 · Updated March 4, 2026

Principle component analysis, or PCA, is a dimensionality reduction algorithm.

There are two deterministic definitions/derivations of PCA:

Maximal variance formulation

Given the data $\mathbf{X}=\left\{\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right\}, \mathbf{x}_{n} \in \mathbb{R}^{D}$ , the goal of PCA of is to project data into a $$M<D$$ dimensional space while maximizing the variance of the projected data.

Let's consider the projection onto a one-dimensional space(M=1). We can define the direction of this space using a D-dimensional vector $\mathbf{u}_1$ , which for convenience we shall choose to be unit vector so that $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{u}_{1}=1$ (we are only interested in direction, not magnitude). Each datapoint $\mathbf{x}_n$ is projected onto a scalar value $\mathbf{u}_1^T\mathbf{x}_n$ . The mean of the projected data is $\mathbf{u}_{1}^{\mathrm{T}} \overline{\mathrm{x}}$ where $\overline{\mathrm{x}}$ is the sample set mean given by,

\overline{\mathrm{x}}=\frac{1}{N} \sum_{n=1}^{N} \mathrm{x}_{n}

and the variance of the projected data is given by,

\begin{aligned} \frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T} \mathbf{x}_{n}-\mathbf{u}_{1}^{T} \overline{\mathbf{x}}\right)^{2} &=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\right)^{2} \\ &=\frac{1}{N} \sum_{n=1}^{N} \mathbf{u}_{1}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{1} \\ &=\mathbf{u}_{1}^{T}\left(\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T}\right) \mathbf{u}_{1}\\ &={\mathbf{u}_{1}^{T} \mathbf{S u}_{1}} \end{aligned}

where $\mathbf{S}$ is the data covariance matrix defined by

\mathbf{S}=\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{\mathrm{T}}

We now maximize the projected variance with $\mathbf{u}_{1}^{\mathrm{T}} \mathbf{S} \mathbf{u}_{1}$ with respect to $\mathbf{u}_1$ . This has to be a constrained maximization to prevent $\left\|\mathbf{u}_{1}\right\| \rightarrow \infty$ . The constraint we take comes from the normalization condition $\mathbf{u}_{1}^{\Gamma} \mathbf{u}_{1}=1$ . Then from the method of Lagrange Multipliers,

L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{u}_{1}^{T} \mathbf{S} \mathbf{u}_{1}+\lambda_{1}\left(\mathbf{u}_{1}^{T} \mathbf{u}_{1}-1\right)

Then,

\frac{\partial}{\partial \mathbf{u}_{1}} L\left(\mathbf{u}_{1}, \lambda_{1}\right)=\mathbf{S} \mathbf{u}_{1}-\lambda_{1} \mathbf{u}_{1}=0

By setting the derivative with respect to $\mathbf{u}_{1}$ equal to zero, we see that this quantity will have a stationary point when

\mathbf{S u}_{1}=\lambda_{1} \mathbf{u}_{1}

which means that $\mathbf{u}_{1}$ must be an eigen vector of $\mathbf{S}$ , and the variance will be maximum when the eigenvector has the largest eigenvalue $\lambda_1$ . This eigenvector is known as the first principle component.

The (total) variance of the projected data is $\operatorname{Tr}[\operatorname{Cov}[\mathbf{z}]]=\sum_{j=1}^{M} \lambda_{j}$ .

Thus principle component analysis involves evaluating the mean and the covariance matrix of the data set and then finding the M eigenvectors of $\mathbf{S}$ with the M largest eigenvalues.

Minimum error formulation

PCA can be obtained by minimizing the reconstruction error. To do this, we represent the points in a different orthonormal basis $\left\{\mathbf{u}_{i}: \mathbf{u}_{i}^{T} \mathbf{u}_{i}=1\right\}_{i=1}^{D}$ where,

\mathbf{u}_{i}^{T} \mathbf{u}_{j}=\left\{\begin{array}{ll} 1 & \text { if } i=j \\ 0 & \text { otherwise } \end{array}\right.

Because this basis is complete, each data point can be represented exactly by a linear combination of the basis vectors.

\mathbf{x}_{n}=\sum_{i=1}^{D} \alpha_{n i} \mathbf{u}_{i}

This means the original data points $\mathbf{X} = \left\{x_{n 1}, \ldots, x_{n D}\right\}$ are replaced by $\left\{\alpha_{n 1}, \ldots, \alpha_{n D}\right\}$ . Taking the property of orthonormal basis, we have, $\alpha_{n i}=\mathbf{x}_{n}^{T} \mathbf{u}_{i}$ , and so,

\mathbf{x}_{n}=\sum_{i=1}^{D}\left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right) \mathbf{u}_{i}

Our goal here is to approximate data point using a representation that is restricted with M < D. So,

\tilde{\mathbf{x}}_{n}=\sum_{i=1}^{M}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}+\sum_{i=M+1}^{D} b_{i} \mathbf{u}_{i}

We are free to chose $\mathbf{u}_i$ and $z_{n i} = \left(\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{i}\right)$ and $$b_i$$ so as to minimize the reconstruction error

\frac{1}{N} \sum_{n=1}^{N}\left\|\mathbf{x}_{n}-\tilde{\mathbf{x}}_{n}\right\|^{2}=\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\mathbf{x}_{n}^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}-b_{i} \mathbf{u}_{i}\right\|^{2}

Minimizing for both $\mathbf{u}_i$ and $$b_i$$ , we get, $b_{i}=\bar{x}^{T} \mathbf{u}_{i}$ .

For $\mathbf{u}_i$ ,

\begin{align} &\frac{1}{N} \sum_{n=1}^{N}\left\|\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right\|^{2}\\ &= \frac{1}{N} \sum_{n=1}^{N}\left(\sum_{i=M+1}^{D}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}\right)^{T}\left(\sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \mathbf{u}_{j}\right) \\ &= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=M+1}^{D} \sum_{j=M+1}^{D}\left(\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i}\right) \mathbf{u}_{i}^{T} \mathbf{u}_{j}\left(\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)^{T} \mathbf{u}_{j}\right) \\ &= \frac{1}{N} \sum_{n=1}^{N} \sum_{i=1 / 4+1}^{D} \mathbf{u}_{i}^{T}\left(\mathbf{x}_{n}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{n}-\tilde{\mathbf{x}}\right)^{T} \mathbf{u}_{i} \\ &= \sum_{i=M+1}^{D} \mathbf{u}_{i}^{T} \mathbf{S} \mathbf{u}_{i} \end{align}

Minimizing $\mathbf{u}_{i}$ under constraint $\mathbf{u}_{i}^{T} \mathbf{u}_{i}=1$ (else we get $\mathbf{u}_i=0$ ) with the method of Lagrange Multipliers, we get a general solution as

\mathbf{S} \mathbf{u}_{i}=\lambda_{i} \mathbf{u}_{i}

Thus the solution is obtained by finding the largest M eigenvectors of the covariance matrix $\mathbf{S}$ . such that the remaining D-M are the smallest.

Linear Projection for dimensionality reduction

The final k-dimensional linear projection $$y_n$$ is obtained as

y_n = \boldsymbol{L}x_n

were $\boldsymbol{L}=\boldsymbol{\Lambda}^{\frac{1}{2}}_k \boldsymbol{U}_k^T$

References

12.1.1, Bishop 2006, Lecture 10 - ML1
http://www.cs.cmu.edu/~tom/10601_fall2012/slides/pca.pdf
Tutorial on PCA, university of Otago http://www.cs.otago.ac.nz/research/publications/OUCS-2002-12.pdf