Chapter 21
35 min read
Section 132 of 175

Principal Component Analysis

Multivariate Statistical Methods

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

  • • Explain what PCA does and why it's useful
  • • Understand the connection between PCA and eigendecomposition
  • • Derive how eigenvalues relate to explained variance
  • • Interpret principal components geometrically and statistically

🔧 Practical Skills

  • • Implement PCA from scratch using NumPy
  • • Choose the optimal number of components using scree plots
  • • Apply PCA for dimensionality reduction in real datasets
  • • Understand reconstruction error and information loss

🧠 Deep Learning Connections

  • Autoencoders — PCA is a linear autoencoder; neural autoencoders learn nonlinear versions
  • Embedding layers — Word2Vec and similar embeddings are learned low-dimensional representations
  • Feature preprocessing — PCA whitening normalizes features before deep learning
  • Latent spaces — VAEs and GANs learn latent representations conceptually similar to PCA
Where You'll Apply This: Data visualization, noise reduction, feature extraction, image compression, face recognition, recommendation systems, and as preprocessing for machine learning models.

The Big Picture

Principal Component Analysis (PCA) is one of the most fundamental techniques in statistics and machine learning. It answers a deceptively simple question: What are the most important directions of variation in my data?

The Core Insight

High-dimensional data often lies near a lower-dimensional subspace. PCA finds this subspace by identifying the directions along which the data varies the most. These directions—the principal components—form a new coordinate system that's optimal for representing your data compactly.

📊

Dimensionality Reduction: 1000 features → 50 components

🔍

Noise Removal: Low-variance directions often contain noise

👁️

Visualization: See 100D data in 2D or 3D

Why We Need PCA

Real-world data presents several challenges that PCA directly addresses:

  1. The Curse of Dimensionality: As dimensions increase, data becomes sparse and distances become meaningless. A dataset with 1000 features but only 100 samples is essentially empty space.
  2. Correlated Features: Many features measure similar things. In image data, adjacent pixels are highly correlated. PCA finds independent directions.
  3. Computational Cost: Training ML models on 10,000 features is expensive. Reducing to 100 components speeds up training dramatically.
  4. Visualization: Humans can only see 2D or 3D. PCA projects high-dimensional data into visualizable space while preserving as much structure as possible.

Historical Context

📜
Karl Pearson (1901)

Pearson introduced PCA as a method for fitting lines and planes to systems of points in space. He framed it as finding the "line of closest fit" to a set of points—minimizing the perpendicular distances from points to the line.

🔬
Harold Hotelling (1933)

Hotelling independently developed PCA and connected it to the eigendecomposition of the covariance matrix. He showed that principal components are the eigenvectors, and eigenvalues represent variance—the formulation we use today.


Geometric Intuition

Before diving into the mathematics, let's build intuition. Imagine you have a cloud of points in 2D space. These points form an elongated ellipse—they vary more along one direction than another.

The Ellipse Analogy

A 2D Gaussian distribution forms an ellipse. The major axis of this ellipse points in the direction of maximum variance—this is PC1. The minor axis, perpendicular to the major axis, points in the direction of minimum remaining variance—this is PC2. In higher dimensions, we have a hyper-ellipsoid with axes pointing in each principal component direction.

Key geometric insights:

  • PC1 is the direction of maximum variance: If you could only keep one axis, this is the one that preserves the most information about how your data is spread out.
  • PC2 is orthogonal to PC1 and has the next most variance: After subtracting out the variance along PC1, PC2 captures what's left.
  • Principal components are orthogonal: They form a new coordinate system where the axes are uncorrelated.
  • Projection is a rotation: PCA rotates your data so the new axes align with the directions of maximum spread.

Interactive: PCA Explorer

Explore how PCA finds the principal components. Adjust the data parameters and observe how PC1 always points in the direction of maximum variance. Toggle "Projections" to see how each point maps onto the first principal component.

📊Interactive PCA Explorer
Loading visualization...

Mathematical Foundation

Now let's formalize these intuitions. PCA can be derived from two equivalent perspectives:

  1. Maximum Variance: Find the direction that maximizes variance of projected data.
  2. Minimum Reconstruction Error: Find the low-dimensional representation that minimizes squared reconstruction error.

Both lead to the same solution: eigendecomposition of the covariance matrix.

The Covariance Matrix

Given a dataset XX with nn samples and dd features, the covariance matrix Σ\Sigma is:

Σ=1n1i=1n(xixˉ)(xixˉ)T=1n1XcTXc\Sigma = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(x_i - \bar{x})^T = \frac{1}{n-1} X_c^T X_c

where XcX_c is the centered data (mean subtracted) and xˉ\bar{x} is the sample mean

The covariance matrix Σ\Sigma is d×dd \times d and symmetric. Its entries tell us:

  • Diagonal entries Σii=Var(Xi)\Sigma_{ii} = \text{Var}(X_i): The variance of each feature
  • Off-diagonal entries Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j): How features covary

Eigendecomposition

The key insight: eigenvectors of the covariance matrix are the principal components, and eigenvalues are the variances along each direction.

The Eigenvalue Equation

Σv=λv\Sigma v = \lambda v

vv is an eigenvector (principal component direction)
λ\lambda is the eigenvalue (variance along that direction)

Why does this work? Consider projecting data onto a unit vector vv. The variance of the projected data is:

Var(Xcv)=vTΣv\text{Var}(X_c v) = v^T \Sigma v

To maximize this variance subject to v=1\|v\| = 1, we use Lagrange multipliers:

L=vTΣvλ(vTv1)\mathcal{L} = v^T \Sigma v - \lambda(v^T v - 1)

Taking the derivative and setting to zero gives Σv=λv\Sigma v = \lambda v—the eigenvalue equation! The maximum variance is achieved when vv is the eigenvector corresponding to the largest eigenvalue.

📊Covariance Matrix Eigendecomposition

Data Generation

2.00
0.50
0.70
200

Sample Covariance Matrix Σ

[
1.934
0.743
0.743
0.560
]
Note: Covariance matrices are always symmetric and positive semi-definite

Eigendecomposition (PCA)

PC1 (First Principal Component)
λ&sub1; = 2.258
[0.916, 0.401]
90.6% variance
PC2 (Second Principal Component)
λ&sub2; = 0.235
[-0.401, 0.916]
9.4% variance
Data points
PC1
PC2

The PCA Connection

This is PCA in action! The eigenvectors of the covariance matrix are the principal components, and the eigenvalues tell us how much variance each component explains. PC1 (the eigenvector with the largest eigenvalue) points in the direction of maximum variance in the data. Adjusting correlation rotates the ellipse, while adjusting variances changes its shape.


The PCA Algorithm

Now that we understand the theory, here's the complete algorithm:

StepOperationResult
1. CenterSubtract mean from each featureZero-mean data X_c
2. CovarianceCompute Σ = X_c^T X_c / (n-1)d×d covariance matrix
3. EigendecompositionSolve Σv = λv for all (λ, v) pairsd eigenvalues, d eigenvectors
4. SortOrder by eigenvalue descendingλ₁ ≥ λ₂ ≥ ... ≥ λ_d
5. SelectKeep top k eigenvectorsV_k: d×k matrix
6. ProjectZ = X_c · V_kn×k transformed data

Interactive: Step-by-Step

Walk through each step of the PCA algorithm with a concrete example. See how the data transforms at each stage.

🔢The PCA Algorithm: Step by Step
Loading visualization...

Choosing the Number of Components

A critical decision in PCA is how many components to keep. More components means more information preserved, but less dimensionality reduction. Several methods help guide this choice:

1. Elbow Method

Plot eigenvalues in order. Look for the "elbow" where they drop sharply. Components after the elbow add little value.

2. Variance Threshold

Keep enough components to explain a target percentage (often 95%) of total variance.

3. Kaiser Criterion

Keep components with eigenvalue > 1 (when using correlation matrix). Each such component explains more variance than a single original feature.

4. Cross-Validation

For ML tasks, choose k that minimizes cross-validation error on the downstream task.

Interactive: Scree Plot Explorer

Experiment with different eigenvalue patterns and selection criteria. Notice how the "elbow" becomes clearer with certain patterns, while others make it harder to choose.

📉Scree Plot Explorer: Choosing the Right Number of Components
Loading visualization...

Reconstruction and Information Loss

When we project to fewer dimensions, we lose information. Understanding this loss is crucial:

Reconstruction Formula

X^=ZVkT+xˉ\hat{X} = Z V_k^T + \bar{x}

where ZZ is the projected data and VkV_k contains the top k eigenvectors

The reconstruction error for a single point is:

x^x2=j=k+1d(xTvj)2=j=k+1dzj2\|\hat{x} - x\|^2 = \sum_{j=k+1}^{d} (x^T v_j)^2 = \sum_{j=k+1}^{d} z_j^2

The total reconstruction error equals the sum of discarded eigenvalues: j=k+1dλj\sum_{j=k+1}^{d} \lambda_j. This is why keeping components with larger eigenvalues minimizes error.

Interactive: Reconstruction Demo

See how reconstruction quality changes with the number of components. Compare original and reconstructed values dimension by dimension.

🔄PCA Reconstruction: What Gets Lost?
Loading visualization...

Applications in Machine Learning

PCA is ubiquitous in machine learning. Here are the most important applications:

🖼️ Image Compression (Eigenfaces)

A 100×100 grayscale image has 10,000 pixels. PCA can represent faces with ~100 components (<1% of original dimensions) with recognizable quality. The principal components are called "eigenfaces"—they look like ghostly face templates.

🔇 Noise Reduction

Noise typically appears in low-variance directions. By reconstructing from only the top components, we project out noise while keeping signal. This is used in signal processing, denoising autoencoders, and data cleaning.

⚡ Feature Preprocessing

PCA Whitening: Transform data so each component has unit variance and components are uncorrelated. This accelerates neural network training by making the loss surface more spherical. Used in image preprocessing before CNNs.

📊 Exploratory Data Analysis

Project high-dimensional data to 2D or 3D for visualization. Look for clusters, outliers, and patterns. Common first step in analyzing genomics, customer data, or any high-dimensional dataset.

🧠 Connection to Autoencoders

PCA is mathematically equivalent to a linear autoencoder! With linear activations and MSE loss, the optimal encoder/decoder matrices are the principal components. Neural autoencoders extend this to learn nonlinear representations.


Python Implementation

Let's implement PCA from scratch to solidify understanding. We'll build a class that follows the scikit-learn API.

PCA Implementation from Scratch
🐍pca_from_scratch.py
1Import NumPy

NumPy provides the linear algebra operations we need for PCA, including eigendecomposition and matrix multiplication.

4Class Definition

We implement PCA as a class to store the fitted parameters (mean, components, eigenvalues) and reuse them for transforming new data.

5Constructor

n_components specifies how many principal components to keep. If None, we keep all components.

11Fit Method

The fit method learns the principal components from training data X. This is where eigendecomposition happens.

13Compute Mean

Calculate the mean of each feature (column). axis=0 means we average along rows to get one mean per column.

16Center the Data

Critical step! Subtract the mean from each data point so the data is centered at the origin. This ensures we find directions of variance, not directions from the origin.

19Compute Covariance Matrix

The covariance matrix captures how each feature varies with every other feature. Shape is (n_features, n_features). We divide by (n-1) for unbiased estimation.

EXAMPLE
For 10 features, cov_matrix is 10×10
22Eigendecomposition

np.linalg.eigh handles symmetric matrices (which covariance matrices always are). Returns eigenvalues and eigenvectors sorted in ascending order.

25Sort by Eigenvalue

We want largest eigenvalues first (they explain most variance), so we reverse the order. eigenvalues[::-1] reverses the array.

28Select Top Components

Keep only the top n_components eigenvectors. These become our principal component directions. Each column of components_ is a principal component.

33Transform Method

Project new data onto the principal components. First center using the learned mean, then project by matrix multiplication.

36Center New Data

Use the same mean computed during fit() to center new data. This ensures consistent transformation.

39Project onto Components

Matrix multiplication: (n_samples, n_features) @ (n_features, n_components) = (n_samples, n_components). Each row becomes a k-dimensional representation.

EXAMPLE
100 samples with 10 features → 100 samples with 2 components
42Inverse Transform

Reconstruct the original data from the low-dimensional representation. Useful for understanding what information was lost.

44Reconstruction Formula

Multiply the projected data by the transpose of components to go back to original feature space, then add the mean back.

48Explained Variance Ratio

Returns the proportion of total variance explained by each component. Sums to 1 if all components kept. Essential for choosing how many components to keep.

37 lines without explanation
1import numpy as np
2
3# PCA Implementation from Scratch
4class PCA:
5    def __init__(self, n_components=None):
6        self.n_components = n_components
7        self.components_ = None
8        self.mean_ = None
9        self.explained_variance_ = None
10
11    def fit(self, X):
12        # Step 1: Compute the mean of each feature
13        self.mean_ = np.mean(X, axis=0)
14
15        # Step 2: Center the data
16        X_centered = X - self.mean_
17
18        # Step 3: Compute covariance matrix
19        cov_matrix = np.cov(X_centered.T)
20
21        # Step 4: Eigendecomposition
22        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
23
24        # Step 5: Sort eigenvectors by eigenvalue (descending)
25        idx = np.argsort(eigenvalues)[::-1]
26        eigenvalues = eigenvalues[idx]
27        eigenvectors = eigenvectors[:, idx]
28
29        # Step 6: Select top n_components
30        if self.n_components is not None:
31            self.components_ = eigenvectors[:, :self.n_components]
32            self.explained_variance_ = eigenvalues[:self.n_components]
33        else:
34            self.components_ = eigenvectors
35            self.explained_variance_ = eigenvalues
36
37        return self
38
39    def transform(self, X):
40        # Center and project onto principal components
41        X_centered = X - self.mean_
42
43        # Project: X_centered @ components
44        return X_centered @ self.components_
45
46    def inverse_transform(self, X_transformed):
47        # Reconstruct from the projection
48        return X_transformed @ self.components_.T + self.mean_
49
50    @property
51    def explained_variance_ratio_(self):
52        total_var = np.sum(self.explained_variance_)
53        return self.explained_variance_ / total_var
Using scikit-learn: In practice, use from sklearn.decomposition import PCA. It handles numerical stability, missing data, and provides additional features like incremental PCA for large datasets.

The Eigenface Connection

One of the most famous applications of PCA is face recognition using eigenfaces. This technique, developed in 1991, illustrates the power of PCA for both compression and pattern recognition.

How Eigenfaces Work

  1. Flatten each face image into a vector. A 100×100 image becomes a 10,000-dimensional vector.
  2. Compute PCA on the dataset of face vectors. The principal components are called "eigenfaces."
  3. Each eigenface looks like a ghostly face—it represents a mode of variation in the face space.
  4. Any face can be reconstructed as a weighted sum of eigenfaces: face=xˉ+iwieigenfacei\text{face} = \bar{x} + \sum_i w_i \cdot \text{eigenface}_i
  5. For recognition, compare the weights ww—similar faces have similar weight vectors.
Key Insight: The first few eigenfaces capture lighting, orientation, and major facial features. Later eigenfaces capture finer details. With just 50-100 eigenfaces (from 10,000+ dimensions), we can represent and recognize faces remarkably well.

Knowledge Check

Test your understanding of Principal Component Analysis with these questions:

📝PCA Knowledge CheckQuestion 1 of 8

What does the first principal component (PC1) represent?

Current Score: 0/0

Summary

Key Takeaways

PCA finds orthogonal directions of maximum variance in data—the principal components.
Mathematically, PCA is eigendecomposition of the covariance matrix.
Eigenvalues represent variance explained; eigenvectors are the new coordinate axes.
The scree plot and cumulative variance help choose how many components to keep.
Reconstruction error equals the sum of discarded eigenvalues.
PCA is used for dimensionality reduction, noise removal, and visualization.
PCA is equivalent to a linear autoencoder with MSE loss.
Eigenfaces demonstrate PCA's power for both compression and pattern recognition.

Looking Ahead

In the next section, we'll explore Factor Analysis—a related technique that models latent factors generating the data. While PCA finds principal components that maximize variance, Factor Analysis assumes observed variables are linear combinations of hidden factors plus noise—a more probabilistic perspective with connections to modern generative models.

Loading comments...