**Elan Ding**

*Modified*: June 8, 2018

In this post, we will discuss PCA in a practical perspectives.

PCA is a technique used to reduce the number of features, making it easier to visualize a high-dimensional dataset. By using the first two principle components, we can visualize a high-dimensional dataset in a new two-dimensional space. In exchange for the visulization, we lose some interpretability.

First let's import some libraries. The `mplot3d`

is used for creating 3d plots.

In [1]:

```
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
```

Now we are ready to generate a dataset using `sklearn`

's `make_blobs`

. Here we are generaing a dataset with three features and binary labels. We can think of each data point as a patient who is either positive of negative of a given disease.

In [2]:

```
from sklearn.datasets import make_blobs
data = make_blobs(n_samples=200, n_features=3,
centers=2, cluster_std=5,random_state=101)
```

In [3]:

```
fig = plt.figure(figsize=(8,6))
ax = plt.axes(projection='3d')
ax.scatter(data[0][:,0], data[0][:,1], data[0][:,2], c=data[1], cmap='rainbow')
```

Out[3]:

As you can see there are two clusters of data. The overlapping is due to standard deviation of the clusters since I set `cluster_std = 5`

in my code above.

In [4]:

```
df = pd.DataFrame(data[0], columns=['x', 'y', 'z'])
df.head()
```

Out[4]:

In order for PCA to perform well, we need every feature to have an unit variance.

In [5]:

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaled_data = scaler.transform(df)
```

Next, we fit `scaled_data`

into `sklearn`

's PCA class.

In [6]:

```
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
```

After using the `transform`

method, we have reduced the number of features to 2.

In [7]:

```
print(scaled_data.shape)
print(x_pca.shape)
```

We plot the first and second principal components colored by the label.

In [8]:

```
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=data[1],cmap='rainbow')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')
```

Out[8]:

The principle components are simply linear combinations of the original feature vectors. The coefficients of the linear combination can be found by the following.

In [9]:

```
pca.components_
```

Out[9]:

Let $x,y,z$ be the original features. The array above means that approximately,

$$ \begin{aligned} \text{First Principal Component} &= -0.68x + 0.26y + 0.68z \\ \text{Second Principal Component} &= -0.18x - 0.97y + 0.19z \end{aligned} $$

Using a machine learning technique called `linear discriminant analysis`

, we can find the so called `Bayes' decision boundary`

, which minimizes the total error in our classification. This can be found using `sklearn`

easily.

In [10]:

```
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf2 = LinearDiscriminantAnalysis()
clf2.fit(data[0], data[1])
print(clf2.coef_)
print(clf2.intercept_)
```

In [11]:

```
fig = plt.figure(figsize=(8,6))
ax = plt.axes(projection='3d')
ax.scatter(data[0][:,0], data[0][:,1], data[0][:,2], c=data[1], cmap='rainbow')
xx, yy = np.meshgrid(range(-20,15,4), range(-10,20,4))
z = (0.20763429/0.52744419)*xx - (0.06824359/0.52744419)*yy + 0.14878286
ax.plot_wireframe(xx, yy, z, alpha=0.4)
```

Out[11]:

The surface above is the `Bayes' decision boundary`

. It is a linear function. Given that the two classes have the same covariance matrix, it is the best we can do in terms of minimizing total classification error.

Here comes the advantage of PCA. What if our original data has more than 3 dimensions? It is impossible to directly visualize something hither than dimension 3, so by reducing the dimension down to 2 using PCA, we get a nice visualization of the classification problem.

In [12]:

```
clf = LinearDiscriminantAnalysis()
clf.fit(x_pca, data[1])
print(clf.coef_)
print(clf.intercept_)
```

In [13]:

```
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=data[1],cmap='rainbow')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')
x = np.linspace(-.5, .5, 100)
y = -(3.4014186/0.6482249)*x
plt.plot(x,y, c='y')
```

Out[13]:

The yellow line is the `Bayes' decision boundary`

in terms of the principal components.

That was fun, right? In the next post, we will discuss the theoretical background to better understand why PCA works.