**Elan Ding**

*Last Modified: June 18, 2018*

In this post, we will discuss PCA in mathematical perspective. If you are not familiar with linear algebra, I recommend reading the *Appendix* section first.
$\newcommand{\bs}{\boldsymbol}$
$\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$
$\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$
$\newcommand{\tr}{^{\top}}$
$\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$

Suppose we have $p$ features and $n$ data points, $\{\bs{x}^{(1)}, \bs{x}^{(2)},...,\bs{x}^{(n)}\}$, such that each $\bs{x}^{(i)}$ is a vector in $\mathbb{R}^p$.

The main idea is that for each point $\bs{x}^{(i)} \in \mathbb{R}^p$, we will *encode* it into a new vector $\bs{c}^{(i)} \in \mathbb{R}^l, l<n$, so that when we *decode* it back to a vector in $\mathbb{R}^p$, we will lose as little information as possible. (Recall that in Post 1, we chose $n=200$, $p=3$, and $l=2$.)

Let $f: \mathbb{R}^p \to \mathbb{R}^l$ be an *encoding function* (not necessarily linear) such that $f(\bs{x})= \bs{c}$. We also define the *decoding function* as $g:\mathbb{R}^l\to \mathbb{R}^p$. For PCA, the underlying assumption is that the *decoding function* is linear and the columns of the *decoding matrix* are orthonormal. Hence, we have that

$$ g(\bs{c}) = \bs{D}\bs{c}, \quad \text{ where } \bs{D} \in \mathbb{R}^{p\times l},\,\, \bs{D}\tr \bs{D} = I $$

PCA poses a strong assumption on the decoding function. Now suppose that the matrix $\bs{D}$ is already given, how can we choose the optimum encoding function $f$? In other words, given a data point $\bs{x} \in \mathbb{R}^p$, we need to generate the optimal encoded point $\bs{c}^* = f(\bs{x}) \in \mathbb{R}^l$, so that

$$ \bs{c}^* = \argmin{\bs{c}}\norm{\bs{x}-g(\bs{c})}_2 $$

Since minimizing the $L^2$ norm is the same as minimizing the squared $L^2$ norm, the function we are minimizing can be written as

$$ \begin{aligned} (\bs{x}-g(\bs{c}))\tr(\bs{x}-g(\bs{c})) &= \bs{x}\tr\bs{x}-\bs{x}\tr g(\bs{c}) - g(\bs{c})\tr \bs{x} + g(\bs{c})\tr g(\bs{c}) \\ & = \bs{x}\tr \bs{x} - 2\bs{x}\tr g(\bs{c}) + g(\bs{c})\tr g(\bs{c}) \end{aligned} $$

Discarding the terms that do not depend on $\bs{c}$ and substituting $\bs{D}\bs{c}$ for $g(\bs{c})$, we have

$$ \begin{aligned} \bs{c}^* &= \argmin{\bs{c}} \left[-2\bs{x}\tr\bs{D}\bs{c} + \bs{c}\tr\bs{D}\tr\bs{D}\bs{c}\right]\\ &=\argmin{\bs{c}} \left[-2\bs{x}\tr \bs{D}\bs{c} + \bs{c}\tr\bs{c}\right] \end{aligned} $$

From vector calculus, the interior minimum occurs only when the gradient vector is $\bs{0}$. That is,

$$ \nabla_{\bs{c}} (-2\bs{x}\tr \bs{D}\bs{c}+\bs{c}\tr\bs{c})=\bs{0} $$ $$ -2\bs{D}\tr\bs{x}+2\bs{c}=\bs{0} $$ $$ \bs{c}=\bs{D}\tr\bs{x} $$

This is great news because we did not know anything about our *encoding function* $f$ before hand. Once we restrict our *decoding function* to be $g(\bs{c}) = \bs{D}\bs{c}$, we can immediately conclude that $f(\bs{x})=\bs{D}\tr\bs{x}$ is the optimum choice in terms of minimizing recovering error for each data point $\bs{x}$.

Now we have a clear idea of the procedure. First starting with a data point $\bs{x}$, we apply the encoding function $f(\bs{x})=\bs{D}\tr\bs{x}$. Then we reconstruct our original vector back by applying the *reconstruction function*:

$$ r(\bs{x}) = g(f(\bs{x})) = \bs{D}\bs{D}\tr \bs{x}. $$

Note that $\bs{D}\bs{D}\tr$ is the projection matrix that projects $\bs{x}$ onto the column space of $\bs{D}$. Amazing! We can already sense the great geometric significance here.

Next, we need to choose the encoding matrix $\bs{D}$. This time, since $\bs{D}$ is used for all points, instead of minimizing the $L^2$ distance between $\bs{x}$ and $r(\bs{x})$ for a single $\bs{x}$, we minimize it over *all* points $\{\bs{x}^{(1)}, \bs{x}^{(2)},...,\bs{x}^{(n)}\}$. That is we need to solve

$$ \bs{D}^* = \argmin{\bs{D}} \sum_i \norm{\bs{x}^{(i)} - \bs{D}\bs{D}\tr \bs{x}^{(i)}}_2^2 $$

To find $\bs{D}^* \in \mathbb{R}^{p\times l}$ we start by considering the case where $l=1$. In this case, $\bs{D} \in \mathbb{R}^{p\times 1}$, a $p$-vector, is renamed as $\bs{d}$. By orthogonality assumption, $\norm{\bs{d}}_2 =1$. Under this constraint, the function we are trying to mimimize becomes

$$ \begin{aligned} \sum_i\norm{ \bs{x}^{(i)}-\bs{d}\bs{d}\tr\bs{x}^{(i)} }_2^2 &= \sum_i\norm{ \bs{x}^{(i)}-\bs{d}\tr\bs{x}^{(i)}\bs{d} }_2^2 \\ &=\sum_i \norm{ \bs{x}^{(i)} - {\bs{x}^{(i)}}\tr \bs{d}\bs{d} }_2^2 \\ &=\norm{ \bs{X} - \bs{X}\bs{d}\bs{d}\tr }_F^2 \end{aligned} $$

where $\bs{X} \in \mathbb{R}^{n\times p}$ is a matrix formed by putting ${\bs{x}^{(i)}}\tr$ as its rows, and $\norm{\cdots}_F$ is the **Frobenius norm** defined in *Definition 1 (Appendix)*.

Now by *Theorem 2 (Appendix)*, we can use a very powerful technique by rewriting the Frobenius norm in terms of trace:

$$ \begin{aligned} \norm{\bs{X}-\bs{X}\bs{d}\bs{d}\tr}_F^2 &= \text{Tr}\left[\left(\bs{X}-\bs{X}\bs{d}\bs{d}\tr\right)\tr \left(\bs{X}-\bs{X}\bs{d}\bs{d}\tr\right) \right] \\ &= \text{Tr}\left(\bs{X}\tr\bs{X} - \bs{X}\tr \bs{X}\bs{d}\bs{d}\tr-\bs{d}\bs{d}\tr\bs{X}\tr\bs{X}+\bs{d}\bs{d}\tr\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr \right) \\ &= \text{Tr}(\bs{X}\tr\bs{X}) - \text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr)-\text{Tr}(\bs{d}\bs{d}\tr\bs{X}\tr\bs{X}) + \text{Tr}(\bs{d}\bs{d}\tr\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr) \end{aligned} $$

Discarding terms that do not depend on $\bs{d}$ and applying *Theorem 2 (Appendix)*, we have that

$$ \begin{aligned} \bs{d}^* &= \argmin{\bs{d}}\left[ -2\text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr)+\text{Tr}(\bs{d}\bs{d}\tr\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr) \right]\\ &= \argmin{\bs{d}}\left[-2\text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr)+\text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr\bs{d}\bs{d}\tr) \right] \\ &= \argmin{\bs{d}}\left[-2\text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr) + \text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr) \right] \\ &= \argmin{\bs{d}}\left[-\text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr) \right] \\ &= \argmax{\bs{d}}\left[\text{Tr}(\bs{X}\tr\bs{X}\bs{d}\bs{d}\tr) \right] \\ &= \argmax{\bs{d}}\left[\text{Tr}(\bs{d}\tr\bs{X}\tr\bs{X}\bs{d})\right] \\ &= \argmax{\bs{d}}\left[\bs{d}\tr\bs{X}\tr\bs{X}\bs{d} \right] \end{aligned} $$

Since the correlation matrix $\bs{X}\tr \bs{X}$ is symmetric, by *Theorem 1 (Appendix)*, the optimal $\bs{d}$ is given by the unit eigenvector of $\bs{X}\tr\bs{X}$ corresponding to the largest eigenvalue!

This solves the encoding problem for $l=1$. For arbitrary $l$, we state the following theorem.

Theorem. Let $\bs{X} \in \mathbb{R}^{n\times p}$ be the the matrix obtained by stacking vectors $\{\bs{x}^{(1)},...,\bs{x}^{(n)}\}$ as rows. For $l \geq 1$, the encoding matrix $\bs{D} \in \mathbb{R}^{p\times l}$ is given by the $l$ eigenvectors of $\bs{X}\tr\bs{X}$ corresponding to the largest eigenvalues.

*Proof*. We prove this by induction. Suppose that for $l=k$ the statement holds. That is, let $\bs{D}^* = [\bs{d}_1, \bs{d}_2,...,\bs{d}_k]$ where the $\bs{d}_i$'s are the unit eigenvectors of $\bs{X}\tr\bs{X}$ corresponding to the first $k$ largest eigenvalues, such that

$$ \bs{D}^* = \argmin{\bs{D}} \sum_i \norm{\bs{x}^{(i)} - \bs{D}\bs{D}\tr \bs{x}^{(i)}}_2^2, \quad \bs{D} \in \mathbb{R}^{p\times k}. $$

Now suppose that we add onto it one more unit eigenvector $\bs{d}_{k+1}$ corresponding to the largest of the remaining unselected eigenvalues. So now we get a new matrix $\bs{E}^* = [\bs{D}^*, \bs{d}_{k+1}]$. We want to show that

$$ \bs{E}^* = \argmin{\bs{E}} \sum_i \norm{\bs{x}^{(i)} - \bs{E}\bs{E}\tr \bs{x}^{(i)}}_2^2, \quad \bs{E} \in \mathbb{R}^{p\times (k+1)}. $$

Let $\bs{E} = [\bs{D}, \bs{d}]$ where $\bs{D} \in \mathbb{R}^{p\times k}$ and $\bs{d} \in \mathbb{R}^p$. We can rewrite the matrix product in the above as

$$ \bs{E}\bs{E}\tr = \begin{bmatrix} \bs{D} & \bs{d} \end{bmatrix}\begin{bmatrix} \bs{D}\tr \\ \bs{d}\tr \end{bmatrix} = \bs{D}\bs{D}\tr + \bs{d}\bs{d}\tr $$

By the induction hypothesis, the problem becomes clear. It is left to the reader to verify that the problem reduces to the 1 dimensional case. This proves the theorem.

So we started with a collection of $p$ random vectors in $\mathbb{R}^n$ (as columns of $\bs{X}$), and we reduced them by a linear map to a collection of $l$ vectors. Recall that the *reconstruction* function is given by

$$ r(\bs{x}) = \bs{D}\bs{D}\tr \bs{x} $$

where the columns of $\bs{D}$ consist of the $l$ eigenvectors of the *correlation matrix* $\bs{X}\tr \bs{X}$ of the data points corresponding to the largest eigenvalues. Hence, we are essentially making a projection from a $p$ dimensional vector space down to a $l$ dimensional vector space spanned by the orthonormal eigenvectors corresponding to the largest eigenvalues of the data's *correlation* matrix.

Question. Why is it called thecorrelation matrix?

Suppose $p=2$, and $n=1$, so that $\bs{X} = [x_1, x_2]$ where both $x_1$ and $x_2$ are a single realization of the random vector $\bs{X} = [X_1, X_2]$. Then,

$$ \bs{X}\tr \bs{X} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\begin{bmatrix} x_1 & x_2 \\ \end{bmatrix} = \begin{bmatrix} x_1^2 & x_1 x_2 \\ x_1 x_2 & x_2^2 \end{bmatrix} $$

Hence it is an *estimator* of how much the random variable $X_1$ and $X_2$ are correlated.

Fun Fact. The eigenvector of the correlation matrix with the largest eigenvalue is the direction along which the data set has the maximum variance.

*Sketch of Proof*. Suppose $p=1$. That is, we want to reduce our data points of $p$ dimensions down to only a single dimension. This means that we want to pick a unit vector $u$, and replace each data point $x_i$ with the its *projection* onto this unit vector, $u\tr x_i$. Naturally, we want to pick the vector $u$ such that we retain as much variance of the data as possible. (Otherwise, suppose all the data lies on a line, and we happend to pick $u$ orthogonal to that line. Then all the data points will be projected to 0. Not very interesting!)

Suppose that the correlation matrix of the original data set is $R$. That is, suppose $\bs{X} = [x_1, x_2,...,x_n]\tr$, such that $\bs{X}\tr \bs{X} = R$. After projecting onto $u$, the matrix $\bs{X}$ is changed to $\bs{Y}$, where

$$ \bs{Y} = [u\tr x_1, u\tr x_2,..., u\tr x_n]\tr \\ $$

Since $\bs{Y}$ is only a vector, correlation matrix of $\bs{Y}$ is simply a scaler

$$ \bs{Y}\tr \bs{Y} = \sum_{i=1}^n u\tr x_i {x_i}\tr u = u\tr \bs{X}\tr \bs{X} u = u\tr R u $$

We want to maximize this quantity since this gives a measure of the variance of the projected data. Using *Theorem 1* (Appendix) again, we see that the maximum occurs at the unit eigenvector corresponding to the maximum eigenvalue. This is the first principal component. Next, we "flatten" our dataset by projecting along the subspace orthogonal to $u_1$ so that it has no variance along the direction of $u_1$ and repeat the process. We will get our second principal component, and so on.

Lets generate some data points from a bivariate Gaussian distribution centered at the origin, and the correlation matrix is

$$ \begin{bmatrix} 1 & 0.5 \\ 0.5 & 1 \end{bmatrix} $$

The two eigenvectors are

$$ \begin{aligned} v_1 &= (1, 1) \\ v_2 &= (-1, 1) \end{aligned} $$

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

In [2]:

```
mean = [0, 0]
cov = [[1, 0.5], [0.5, 1]]
```

In [3]:

```
x, y = np.random.multivariate_normal(mean, cov, 400).T
plt.axis('equal')
plt.scatter(x, y)
plt.plot([0,2], [0,2], 'ro-')
plt.plot([0, -1 ], [0, 1], 'ro-')
```

Out[3]:

The eigenvectors of the correlation matrix happens to be pointing in diagonal directions. Indeed these are the two principal components.

One last fun observation. What if our correlation matrix is the identity matrix? Clearly, ALL vectors on the plane are eigenvectors of the correlation matrix. The distribution on the plane forms a circle, so the first principal component can be arbitrarily chosen.

We state and prove several theorems that we used earlier.

Theorem 1. Let $\bs{A} \in \mathbb{R}^{n\times n}$ be a real valued symmetric matrix, and $\bs{x}$ be a real vector in $\mathbb{R}^n$. Then the quadratic expression $f(\bs{x}) = \bs{x}\tr \bs{A} \bs{x}$ subject to $||\bs{x}||_2=1$ takes on maximum (minimum) values at its unit eigenvector $\bs{x}$ corresponding to its maximum (minimum) eigenvalue.

Below is an visualization of the constraint optimization. Basically, we are optimizing a quadratic function over a sphere.

*Proof*. Since $\bs{A}$ is symmetric, then by the `spectral theorem`

, it is orthogonally diagonalizable. This means that there exists orthogonal matrix $\bs{Q}$ and diagonal matrix $\bs{\Lambda}$ such that

$$\bs{A} = \bs{Q} \bs{\Lambda} \bs{Q}\tr.$$

Let $\bs{\Lambda} = \text{diag}(\bs{\lambda})$, where $\lambda_i \in \bs{\lambda}$ is the eigenvalue corresponding to the $i$th column of the matrix $\bs{Q}$.

Let $\bs{x}_i$ be an unit eigenvector of $\bs{A}$ corresponding to the eigenvector $\lambda_i$. Then

$$f(\bs{x}_i) = \bs{x}_i\tr\bs{A}\bs{x}_i = \bs{x}_i\tr\lambda_i\bs{x}_i = \lambda_i\bs{x}_i\tr\bs{x}_i = \lambda_i.$$

Let $\bs{Q} = [\bs{q}_1,...,\bs{q}_n]$ where $\bs{q}_i \in \mathbb{R}^n$ and the set of $\bs{q}_i$'s are orthogonal unit vectors; hence form an orthonormal basis of $\mathbb{R}^n$. This means that for all unit vector $\bs{x} \in \mathbb{R}^n$, $\bs{x}$ can be expressed as a linear combination of the orthonormal set $\{\bs{q}_1,...,\bs{q}_n\}$,

$$ \bs{x} = c_1 \bs{q}_1 + c_2\bs{q}_2 + \cdots + c_n\bs{q}_n. $$

Furthermore, since $||x||_2=1$, we have that $c_1^1+x_2^2+\cdots + c_n^2=1$.

Let $\lambda_{\text{max}}$ and $\lambda_{\text{min}}$ be the maximum and minimum of all eigenvalues $\{\lambda_1,...,\lambda_n\}$. Then we have

$$ \begin{aligned} f(\bs{x}) & = (c_1\bs{q}_1 + \cdots + c_n\bs{q}_n)\tr \bs{A}(c_1\bs{q}_1 + \cdots + c_n\bs{q}_n) \\ & = c_1^2 \bs{q}_1\tr\bs{A}\bs{q}_1 + c_1 c_2 \bs{q}_1\tr \bs{A}\bs{q}_2 + \cdots c_n^2 \bs{q}_n\tr \bs{A}\bs{q}_n \end{aligned} $$

Since $\bs{q}_i\tr \bs{q}_j = 0$ whenever $i\neq j$, the above expression simplifies to

$$ \begin{aligned} f(\bs{x}) &= c_1^2 \bs{q}_1\tr \bs{A}\bs{q}_1 + c_2^2 \bs{q}_2 \bs{A}\bs{q}_2 + \cdots c_n^2 \bs{q}_n\tr \bs{A}\bs{q}_n \\ &= c_1^2 \lambda_1 + c_2^2 \lambda_2 + \cdots + c_n^2 \lambda_n \\ &\leq \lambda_{\text{max}}(c_1^2+c_2^2 + \cdots + c_n^2) \\ & = \lambda_{\text{max}} \end{aligned} $$

Likewise, we can show that $f(\bs{x}) \geq \lambda_{\text{min}}$. This proves the theorem. $\Box$

Next we introduce the concepts of `Frobenius norm`

and `trace`

.

Definition 1. TheFrobenius normof a matrix $A$ is defined to be $$ ||A||_F = \sqrt{\sum_{i,j}A_{i,j}^2}. $$

Definition 2. Thetrace operatorof a matrix $A$ is defined to be $$ \text{Tr}(\bs{A}) = \sum_i \bs{A}_{i,i}. $$

This leads to the following theorem.

Theorem 2. The following properties hold for a matrix $A$:

- $\displaystyle ||A||_F = \sqrt{\text{Tr}(\bs{A}\bs{A}\tr)}$
- $\text{Tr}(\bs{A}) = \text{Tr}(\bs{A}\tr)$
- If $\bs{A}\in \mathbb{R}^{m\times n}$ and $\bs{B}\in \mathbb{R}^{n\times m}$, then $$ \text{Tr}(\bs{A}\bs{B}) = \text{Tr}(\bs{B}\bs{A}) $$

*Proof of 1*. This follows easiliy from the definition of matrix multiplication,

$$ \begin{aligned} \text{Tr}(\bs{A}\bs{A}\tr) & = \sum_i \bs{A}_{1,i}^2 + \sum_i \bs{A}_{2,i}^2 + \cdots + \sum_i \bs{A}_{n,i}^2 \\ &= \sum_{i,j} \bs{A}_{i,j}^2 \end{aligned} $$

*Proof of 2*. Obviously the diagonal elements are invariant to transpose.

*Proof of 3*. Again we use the definition of matrix multiplication:

$$ \begin{aligned} \text{Tr}(\bs{A}\bs{B}) &= \sum_i \bs{A}_{1,i}\bs{B}_{i,1} + \sum_i \bs{A}_{2,i}\bs{B}_{i,2} + \cdots + \sum_i \bs{A}_{m,i}\bs{B}_{i,m} \\ &= \sum_{i,j} \bs{A}_{j,i} \bs{B}_{i,j} \\ &= \sum_{i,j} \bs{B}_{j,i}\bs{A}_{i,j} \\ &= \text{Tr}(\bs{B}\bs{A}) \end{aligned} $$

**Remark**. Theorem 2 (3) can be generalized. If the shape of the corresponding matrices allow multiplication, then trace is invariant to cyclic permutation of the order:
$$
\text{Tr}\left(\prod_{i=1}^n \bs{A}^{(i)}\right) = \text{Tr}\left(\bs{A}^{(n)}\prod_{i=1}^{n-1} \bs{A}^{(i)}\right).
$$