Skip to main content
This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas. The text is released under the CC-BY-NC-ND license, and code is released under the MIT license.

Run in Google Colab

Open this tutorial in Google Colab to execute the code interactively.
Up until now, we have been looking in depth at supervised learning estimators: those estimators that predict labels based on labeled training data. Here we begin looking at several unsupervised estimators, which can highlight interesting aspects of the data without reference to any known labels. In this section, we explore what is perhaps one of the most broadly used of unsupervised algorithms, principal component analysis (PCA). PCA is fundamentally a dimensionality reduction algorithm, but it can also be useful as a tool for visualization, for noise filtering, for feature extraction and engineering, and much more. After a brief conceptual discussion of the PCA algorithm, we will see a couple examples of these further applications. We begin with the standard imports:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

Introducing Principal Component Analysis

Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data. Its behavior is easiest to visualize by looking at a two-dimensional dataset. Consider the following 200 points:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
By eye, it is clear that there is a nearly linear relationship between the x and y variables. This is reminiscent of the regression problem but the problem setting here is different: rather than attempting to predict the y values from the x values, the unsupervised learning problem attempts to learn about the relationship between the x and y values. In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset. Using Scikit-Learn’s PCA estimator, we can compute this as follows:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
The fit learns some quantities from the data, most importantly the “components” and “explained variance”:
print(pca.components_)
# [[-0.94446029 -0.32862557]
#  [-0.32862557  0.94446029]]

print(pca.explained_variance_)
# [0.7625315 0.0184779]
To see what these numbers mean, let’s visualize them as vectors over the input data, using the “components” to define the direction of the vector, and the “explained variance” to define the squared-length of the vector:
def draw_vector(v0, v1, ax=None, color='black'):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0, color=color)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
These vectors represent the principal axes of the data, and the length of the vector is an indication of how “important” that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the “principal components” of the data.

PCA as Dimensionality Reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance. Here is an example of using PCA as a dimensionality reduction transform:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)
# original shape:    (200, 2)
# transformed shape: (200, 1)
The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance.

PCA for Visualization: Hand-written Digits

The usefulness of the dimensionality reduction may not be entirely apparent in only two dimensions, but becomes much more clear when looking at high-dimensional data. Let’s take a quick look at the application of PCA to the digits dataset.
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape  # (1797, 64)
The data consists of 8×8 pixel images, meaning that they are 64-dimensional. To gain some intuition into the relationships between these points, we can use PCA to project them to a more manageable number of dimensions:
pca = PCA(2)  # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)   # (1797, 64)
print(projected.shape)      # (1797, 2)
We can now plot the first two principal components of each point to learn about the data:
plt.scatter(projected[:, 0], projected[:, 1],
            c=digits.target, edgecolor='none',
            alpha=0.5, cmap=plt.cm.Spectral)
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

Choosing the Number of Components

A vital part of using PCA in practice is the ability to estimate how many components are needed to describe the data. This can be determined by looking at the cumulative explained variance ratio as a function of the number of components:
pca = PCA().fit(digits.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
This curve quantifies how much of the total, 64-dimensional variance is contained within the first NN components. For example, we see that with the digits the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.

PCA as Noise Filtering

PCA can also be used as a filtering approach for noisy data. The idea is this: any components with variance much larger than the effect of the noise should be relatively unaffected by the noise. So if you reconstruct the data using just the largest subset of principal components, you should be preferentially keeping the signal and throwing out the noise.
def plot_digits(data):
    fig, axes = plt.subplots(4, 10, figsize=(10, 4),
                             subplot_kw={'xticks':[], 'yticks':[]},
                             gridspec_kw=dict(hspace=0.1, wspace=0.1))
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].reshape(8, 8),
                  cmap='binary', interpolation='nearest',
                  clim=(0, 16))

# Original digits
plot_digits(digits.data)

# Add noise
np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)

# Filter with PCA (preserving 50% variance)
pca = PCA(0.50).fit(noisy)
print(pca.n_components_)  # 12 components

components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)

Example: Eigenfaces

Let’s explore using PCA projection as a feature selector for facial recognition. We use the Labeled Faces in the Wild dataset:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)  # (1348, 62, 47)
Let’s take a look at the first 150 principal components:
from sklearn.decomposition import PCA as RandomizedPCA

pca = RandomizedPCA(150)
pca.fit(faces.data)
We can visualize the images associated with the first several principal components (these components are technically known as “eigenvectors,” so these types of images are often called “eigenfaces”):
fig, axes = plt.subplots(3, 8, figsize=(9, 4),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
    ax.imshow(pca.components_[i].reshape(62, 47), cmap='bone')
These 150 components account for just over 90% of the variance, which means using these 150 components, we would recover most of the essential characteristics of the data.

Summary

In this tutorial we have discussed the use of principal component analysis for:
  • Dimensionality reduction - Project high-dimensional data to lower dimensions
  • Visualization - View high-dimensional data in 2D or 3D
  • Noise filtering - Reconstruct data using only the largest principal components
  • Feature selection - Extract the most important directions of variance
PCA’s main weakness is that it tends to be highly affected by outliers in the data. Scikit-Learn contains several variants on PCA, including RandomizedPCA and SparsePCA. Key references: (Anselmi et al., 2013; Bronstein et al., 2016; Sun et al., 2016)

References

  • Anselmi, F., Leibo, J., Rosasco, L., Mutch, J., Tacchetti, A., et al. (2013). Unsupervised Learning of Invariant Representations in Hierarchical Architectures.
  • Bronstein, M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P. (2016). Geometric deep learning: going beyond Euclidean data.
  • Sun, B., Feng, J., Saenko, K. (2016). Correlation Alignment for Unsupervised Domain Adaptation.