Skip to main content
This notebook contains an excerpt from the Python Data Science Handbook by Jake VanderPlas with additional 3D visualization examples. The text is released under the CC-BY-NC-ND license, and code is released under the MIT license.

Run in Google Colab

Open this tutorial in Google Colab to execute the code interactively.
This notebook demonstrates PCA using 3D data, showing how principal component analysis finds the directions of maximum variance and projects data onto lower-dimensional subspaces.

Setup

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns; sns.set()

Generating 3D Gaussian Data

First, let’s create a 3D Gaussian distribution with a specific covariance structure:
# Define the covariance matrix
covariance_matrix = np.array([
    [4, 2, 1],
    [2, 3, 1.5],
    [1, 1.5, 2]
])

# Mean vector (assuming a mean of 0 for simplicity)
mean = np.zeros(3)

# Generate 1000 samples
samples = np.random.multivariate_normal(mean, covariance_matrix, 1000)

Visualizing 3D Data

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot for samples
ax.scatter(samples[:, 0], samples[:, 1], samples[:, 2])

ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
ax.set_title('3D plot of Gaussian Samples')
plt.show()

Computing Principal Components via SVD

We can find the principal components by performing Singular Value Decomposition on the covariance matrix:
# Compute the covariance matrix of the samples
sample_covariance_matrix = np.cov(samples, rowvar=False)

# Perform SVD to find the principal components
U, S, Vt = np.linalg.svd(sample_covariance_matrix)

# Project the samples onto the first two principal components
projected_samples = np.dot(samples, U[:, :2])

2D Projection

plt.figure(figsize=(8, 6))
plt.scatter(projected_samples[:, 0], projected_samples[:, 1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Projection onto the First Two Principal Components')
plt.grid(True)
plt.show()

Verifying Decorrelation

A key property of PCA is that the projected data has uncorrelated components:
# Calculate the correlation matrix of the projected samples
correlation_matrix = np.corrcoef(projected_samples.T)
print("Correlation matrix:\n", correlation_matrix)

# Extract the correlation between the first two components
correlation_between_components = correlation_matrix[0, 1]
print("Correlation between components:", correlation_between_components)
# Should be essentially zero (e.g., 2.97e-17)

Visualizing the Principal Plane in 3D

We can visualize how the 2D projection relates to the original 3D space:
# Mean of the original samples
mean_sample = np.mean(samples, axis=0)

# Extend the projected samples back to 3D space
extended_samples = np.dot(projected_samples, U[:, :2].T) + mean_sample

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot for the extended samples in the principal component plane
ax.scatter(extended_samples[:, 0], extended_samples[:, 1],
           extended_samples[:, 2], alpha=0.6)

# Plot the principal axes
for i in range(2):
    axis = U[:, i] * max(S)
    ax.plot([mean_sample[0], mean_sample[0] + axis[0]],
            [mean_sample[1], mean_sample[1] + axis[1]],
            [mean_sample[2], mean_sample[2] + axis[2]],
            linewidth=2, color='r')

ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
ax.set_title('2D Projections in Original 3D Space')
plt.show()

Interactive 3D Visualization with Plotly

For interactive exploration, we can use Plotly:
import plotly.graph_objects as go

# Create scatter trace for the extended samples
scatter = go.Scatter3d(
    x=extended_samples[:, 0],
    y=extended_samples[:, 1],
    z=extended_samples[:, 2],
    mode='markers',
    marker=dict(size=2)
)

# Create traces for the principal axes
axes = []
for i in range(2):
    axis = U[:, i] * max(S)
    axes.append(go.Scatter3d(
        x=[mean_sample[0], mean_sample[0] + axis[0]],
        y=[mean_sample[1], mean_sample[1] + axis[1]],
        z=[mean_sample[2], mean_sample[2] + axis[2]],
        mode='lines',
        line=dict(color='red', width=4)
    ))

# Create a mesh for the projected plane
corner_points = np.array([
    mean_sample + U[:, 0] * max(S) + U[:, 1] * max(S),
    mean_sample - U[:, 0] * max(S) + U[:, 1] * max(S),
    mean_sample - U[:, 0] * max(S) - U[:, 1] * max(S),
    mean_sample + U[:, 0] * max(S) - U[:, 1] * max(S)
])

plane = go.Mesh3d(
    x=corner_points[:, 0],
    y=corner_points[:, 1],
    z=corner_points[:, 2],
    opacity=0.5,
    color='yellow'
)

# Layout
layout = go.Layout(
    title='2D Projections in Original 3D Space',
    scene=dict(
        xaxis_title='X axis',
        yaxis_title='Y axis',
        zaxis_title='Z axis'
    )
)

fig = go.Figure(data=[scatter, plane] + axes, layout=layout)
fig.show()

Using Scikit-Learn PCA

We can also use scikit-learn’s PCA implementation:
from sklearn.decomposition import PCA

# Create 2D data with correlation
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T

# Fit PCA
pca = PCA(n_components=2)
pca.fit(X)

print("Components:\n", pca.components_)
print("Explained variance:", pca.explained_variance_)

Visualizing Principal Axes

def draw_vector(v0, v1, ax=None, color='black'):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0, color=color)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# Plot data with principal axes
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

Input vs Principal Components Comparison

rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
pca = PCA(n_components=2, whiten=True)
pca.fit(X)

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

# Plot original data
ax[0].scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v, ax=ax[0])
ax[0].axis('equal')
ax[0].set(xlabel='x_1', ylabel='x_2', title='input')

# Plot principal components
X_pca = pca.transform(X)
ax[1].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.2)
draw_vector([0, 0], [0, 3], ax=ax[1])
draw_vector([0, 0], [3, 0], ax=ax[1])
ax[1].axis('equal')
ax[1].set(xlabel='component 1', ylabel='component 2',
          title='principal components',
          xlim=(-5, 5), ylim=(-3, 3.1))

Key Insights

  1. Decorrelation: PCA transforms correlated variables into uncorrelated principal components
  2. Variance Maximization: The first principal component captures the direction of maximum variance
  3. Orthogonality: Principal components are orthogonal to each other
  4. Dimensionality Reduction: We can project high-dimensional data onto a lower-dimensional subspace while preserving maximum variance
Key references: (Zeng et al., 2016; McInnes et al., 2018)

References

  • McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
  • Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., et al. (2016). 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions.