Principial Component Analysis#

Principial Component Analysis (PCA) is a statistical method to convert a set of observations of possibly correlated features into a set of values of linearly uncorrelated features called principal components. The number of principal components is less than or equal to the number of original features. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.

  1. PCA identifies orthogonal axes (principal components) in the original feature space.

  2. These axes are ordered by the amount of variance they explain.

  3. By projecting the data onto these axes, we reduce the dimensionality while preserving the most important information.

To understand what we are doing and why we are doing it, we need a little excursion in the field of statistics

import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Expected Value, Variance and Covariance#

Features are the columns in our table, while data points are the rows

L=Sepal Length (cm)

W=Sepal Width (cm)

S=Petal Length (cm)

T=Petal Width (cm)

Species

5.1

3.5

1.4

0.2

Setosa

4.9

3.0

1.4

0.2

Setosa

4.7

3.2

1.3

0.2

Setosa

4.6

3.1

1.5

0.2

Setosa

5.0

3.6

1.4

0.2

Setosa

Expected value#

\[ E[X] = \bar x = \frac{\sum_i^n x_i}{n} \]

Let us calculate the expected values of all features

\[ E[L] = \bar l = \frac{5.1 + 4.9 + 4.7 + 4.6 + 5.0}{5} = 4.86 \]
\[ E[W] = \bar w = \frac{3.5 + 3 + 3.2 + 3.1 + 3.6}{5} = 3.28 \]
\[ E[S] = \bar s = \frac{1.4 + 1.4 + 1.3 + 1.5 + 1.4}{5} = 1.4 \]
\[ E[T] = \bar t = \frac{5 \times 0.2}{5} = 0.2 \]

Variance#

\[ V[X] = E[(X - \bar x)^2] = E[(X - \bar x)(X - \bar x)] = \frac{\sum_i^n (x_i - \bar x)^2}{n-1}\]

It measures how far each number in the set is from the mean (average) and thus from every other number in the set. A high variance indicates that the feature is very spread out while a low variance indicates the opposite. We divide by \(n-1\) to not underestimate the variance.

\[ V[L] = \frac{(5.1-4.86)^2 + (4.9-4.86)^2 + (4.7-4.86)^2 + (4.6-4.86)^2 + (5-4.86)^2}{4} = 0.043 \]
\[ V[W] = \frac{(3.5-3.28)^2 + (3-3.28)^2 + (3.2-3.28)^2 + (3.1-3.28)^2 + (3.6-3.28)^2}{4} = 0.067 \]
\[ V[S] = \frac{(1.4-1.4)^2 + (1.4-1.4)^2 + (1.3-1.4)^2 + (1.5-1.4)^2 + (1.4-1.4)^2}{4} = 0.005 \]
\[ V[T] = \frac{5 (0.2-0.2)^2}{4} = 0 \]

The sepal width is more spread out than the sepal length and looking at the numbers, this becomes obvious.

Covariance#

\[ Cov[X, Y] = E[(X - \bar x)(Y - \bar y)] \]

The extension of the variance is the covariance. It is used to measure the linear relationship between two variables (data points). If the covariance is positive, it indicates that the two variables tend to increase or decrease together — that is, they’re directly proportional. If one variable tends to increase when the other decreases, the covariance is negative, indicating an inverse relationship. A covariance of zero suggests that the variables are independent of each other.

\[ Cov[L, W] = \frac{(5.1-4.86)(3.5-3.28) + (4.9-4.86)(3-3.28) + (4.7-4.86)(3.2-3.28) + (4.6-4.86)(3.1-3.28) + (5-4.86)(3.6-3.28)}{4} = 0.0365 \]
\[ Cov[L, S] = \frac{(5.1-4.86)(1.4-1.4) + (4.9-4.86)(1.4-1.4) + (4.7-4.86)(1.3-1.4) + (4.6-4.86)(1.5-1.4) + (5-4.86)(1.4-1.4)}{4} = -0.0025 \]
\[ Cov[W, S] = \frac{(3.5-3.28)(1.4-1.4) + (3-3.28)(1.4-1.4) + (3.2-3.28)(1.3-1.4) + (3.1-3.28)(1.5-1.4) + (3.6-3.28)(1.4-1.4)}{4} = -0.0025 \]
\[ Cov[L, T] = \frac{(5.1-4.86)(0.2-0.2) + ...}{4} = 0 = Cov[S, T] = Cov[W, T] \]

We have only looked at a subset of the iris flower dataset. So the values here must not necessarily represent the real dataset. But for this subset we can say the following:

  • The petal width (T) is independent of the other features.

  • The sepal length (L) and sepal width (W) are proportional to one other.

  • The other tend to behave inverse to one another.

# Lets validate our results
x = np.array([[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [
             4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2]])
pprint(x)
print('mean:', np.mean(x, axis=0))
print('variance:', np.var(x, ddof=1, axis=0))
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])
mean: [4.86 3.28 1.4  0.2 ]
variance: [0.043 0.067 0.005 0.   ]

Covariance Matrix#

pprint(np.cov(x, rowvar=False))
array([[ 0.043 ,  0.0365, -0.0025,  0.    ],
       [ 0.0365,  0.067 , -0.0025,  0.    ],
       [-0.0025, -0.0025,  0.005 ,  0.    ],
       [ 0.    ,  0.    ,  0.    ,  0.    ]])

The covariance matrix is a square (\( n\times n\)) matrix that provides a measure of the covariance between each pair of features in a dataset. In this matrix, each element \(C_{ij}\) represents the covariance between the feature \(x_i\) and \(x_j\). The diagonal elements of the covariance matrix (where i = j) are the variances of each feature.

On the diagonal, larger values in the covariance matrix indicate more variance.
On the non-diagonal, larger absolute values indicate a stronger linear relationship between two features.

def noise(x):
    return x + np.random.normal(0, 0.5, 100)


def plot(ax, y):
    ax.plot(x, y, 'o')
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 25)


x = np.linspace(0, 10, 100)
ys = [noise(2*x), noise(-2*x+20), noise(10)]

fig, ax = plt.subplots(1, len(ys), figsize=(15, 5))
for i, y in enumerate(ys):
    plot(ax[i], y)
    pprint(np.cov(x, y))
    print('\n')
array([[ 8.58755909, 17.18595444],
       [17.18595444, 34.63822168]])


array([[  8.58755909, -17.09688854],
       [-17.09688854,  34.26791914]])


array([[8.58755909e+00, 8.36984957e-03],
       [8.36984957e-03, 2.92629696e-01]])
_images/9068ece0aa0d2d777e53636b492b463d5f9899263844727f76abed558971eb6c.png

For simplicity, I omitted the decimal places.

\[\begin{split} \begin{bmatrix} 8 & 17 \\ 17 & 34 \end{bmatrix} \begin{bmatrix} 8 & -17 \\ -17 & 34 \end{bmatrix} \begin{bmatrix} 8 & -0.02 \\ -0.02 & 0.2 \end{bmatrix} \end{split}\]
  • We have variances along the x-axis (first element in all diagonals)

  • In the first two cases, we have variances along the y-axis (second element of the diagonal).

  • In the first example, x and y have a linear relation: If x increases, so does y.

  • In the second example, the relationship is inverted (negative values)

On the non-diagonal, larger absolute values indicate a stronger linear relationship between two features.
Geometrically, this means the non-diagonal values tell the “direction” of the data.

Eigenvalues and eigenvectors#

Let \(A\) be a \(n\times n\) matrix, then we call the scalar \(\lambda\) the eigenvalue and the non-zero vector \(v \in R^n\) the eigenvector for which \( Av = \lambda v \).

\[\begin{split} \begin{bmatrix} -6 & 3\\ 4 & 5 \end{bmatrix} \begin{bmatrix} 1 \\ 4 \end{bmatrix} = 6 \begin{bmatrix} 1 \\ 4 \end{bmatrix} \end{split}\]

Multiplying the Matrix \(A\) with the eigenvector just scales the vector! The eigenvalue is just the scaling factor.

For the covariance matrix this means:

The eigenvector with the largest eigenvalue is the direction along which the data set has the maximum variance

Algorithm#

Finally, we can piece everything together:

  1. PCA identifies orthogonal axes (principal components) in the original feature space.

  2. These axes are ordered by the amount of variance they explain.

  3. By projecting the data onto these axes, we reduce the dimensionality while preserving the most important information.

def pca(X):
    # Normalize the data
    X = (X - np.mean(X, axis=0))
    # Compute the covariance matrix
    cov = np.dot(X.T, X) / X.shape[0]
    # Perform singular value decomposition
    # U is the matrix of eigenvectors
    # S is the diagonal matrix of eigenvalues
    # V is the transpose of U
    U, S, V = np.linalg.svd(cov)
    return U, S, V


def project_data(X, U, k):
    # k is the number of dimensions to reduce to
    # U_reduced is the matrix of the first k eigenvectors
    U_reduced = U[:, :k]
    # Z is the projection of X onto the reduced dimension
    Z = np.dot(X, U_reduced)
    return Z
iris = load_iris()
print(iris.data.shape)

fig, axs = plt.subplots(4, 4, figsize=(20, 20))
for i in range(4):
    for j in range(4):
        if i != j:
            axs[i, j].scatter(iris.data[:, i], iris.data[:, j], c=iris.target)
            axs[i, j].set_xlabel(iris.feature_names[i])
            axs[i, j].set_ylabel(iris.feature_names[j])
(150, 4)
_images/2cffa5f3bc77ea94adb29620096c6d1bde87b12d7d1a19451a4a852f1f150413.png
U, S, V = pca(iris.data)
# Project the four-dimensional data to two dimensions
Z = project_data(iris.data, U, 2)
print(Z.shape)
# We don't need to plot a grid, since we only have two dimensions left
plt.scatter(Z[:, 0], Z[:, 1], c=iris.target)
(150, 2)
<matplotlib.collections.PathCollection at 0x7f2097fbb490>
_images/caded5c5af1acec0af3a52523c9d587c5433ba054da4c73e8589837ea52e2dc0.png
# First, we are going to see how a logistic regression model performs on the data without dimensionality reduction
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.5)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
0.96
# Next, we are going to see the performance with dimensionality reduction
X_train, X_test, y_train, y_test = train_test_split(
    Z, iris.target, test_size=0.5)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
0.96
# Let us go to the extreme and project the data to one dimension
Z = project_data(iris.data, U, 1)
X_train, X_test, y_train, y_test = train_test_split(
    Z, iris.target, test_size=0.5)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
plt.plot(Z)
0.9333333333333333
[<matplotlib.lines.Line2D at 0x7f2097fa3c50>]
_images/8962632a176ce7af462ed9acd0f944534c74ac0ec6809be9ac32ef68504e6e9a.png

We’ve explored how PCA can effectively reduce the dimensionality of the iris flower dataset, preserving its fundamental characteristics. Although the iris flower dataset is a simple example, it demonstrates the significant role PCA can play in tasks like feature extraction or visualization. As with most tasks, it’s not necessary to build your own implementation from scratch. There are excellent libraries available that can perform PCA for you. However, understanding the underlying mathematics can provide valuable intuition about the process.

Sources#