Chapter 21
30 min read
Section 134 of 175

Linear Discriminant Analysis

Multivariate Statistical Methods

Learning Objectives

By the end of this section, you will master Linear Discriminant Analysis (LDA) - the supervised counterpart to PCA that finds optimal projections for classification. You will:

  1. Understand the fundamental difference between supervised (LDA) and unsupervised (PCA) dimensionality reduction - when and why class labels matter.
  2. Master the mathematical formulation: within-class scatter, between-class scatter, Fisher's criterion, and the generalized eigenvalue problem.
  3. Visualize and build intuition for why maximizing the ratio of between-class to within-class variance leads to optimal class separation.
  4. Implement LDA from scratch in Python and understand each computational step from class statistics to projection.
  5. Use LDA as a classifier and understand its connection to Gaussian discriminant analysis and linear decision boundaries.
  6. Connect LDA to modern deep learning: embedding spaces, center loss, contrastive learning, and metric learning.
  7. Apply LDA to real-world problems: face recognition (Fisherfaces), bioinformatics, and feature extraction for classification.
Why This Matters: LDA is one of the most important techniques for supervised dimensionality reduction and classification. It introduces the crucial insight that the best representation depends on your task - for classification, you should maximize class separability, not just total variance. This principle underlies modern deep learning approaches like contrastive learning, center loss, and learned embeddings.

The Big Picture

Historical Context

Linear Discriminant Analysis was developed by Sir Ronald A. Fisher in 1936, published in his landmark paper "The Use of Multiple Measurements in Taxonomic Problems." Fisher was working on a practical problem: given measurements of iris flowers (sepal length, sepal width, petal length, petal width), how can we best distinguish between species?

Fisher's key insight was brilliant in its simplicity: the best way to separate classes is to find a direction that maximizes the distance between class means while minimizing the spread within each class. This ratio - between-class variance divided by within-class variance - became known as Fisher's criterion or the Fisher ratio.

What makes Fisher's work remarkable is that he solved this problem decades before computers could efficiently perform eigendecomposition. His mathematical formulation remains the standard today, used in everything from face recognition to gene expression analysis.

Supervised vs Unsupervised Dimensionality Reduction

To understand LDA's value, we must first appreciate the fundamental difference between supervised and unsupervised dimensionality reduction:

AspectPCA (Unsupervised)LDA (Supervised)
InputData matrix X onlyData matrix X + labels y
ObjectiveMaximize total varianceMaximize class separability
What it findsDirections of maximum spreadDirections that separate classes
Max componentsmin(n_samples, n_features)min(n_features, n_classes - 1)
Best forCompression, visualization, noise reductionClassification preprocessing, feature extraction
IgnoresClass structure entirelyWithin-class variance structure
The Core Difference: PCA asks "Where does the data vary most?" LDA asks "Where do the classes differ most?" These are fundamentally different questions with different optimal answers.

This distinction is crucial in machine learning. If your goal is classification, you should use the labels you have! PCA might find directions where the data varies a lot, but that variation might be within classes (noise) rather than between classes (signal).


The Core Intuition

Geometric View

Imagine you have two groups of people - basketball players and jockeys - and you want to classify new individuals. You measure many features: height, weight, arm span, shoe size, etc.

PCA's approach: Find the direction where people vary most. This might be dominated by weight variation, since weight varies more than height in absolute terms. But weight alone might not separate the groups well - there are heavy basketball players and light jockeys, and vice versa.

LDA's approach: Find the direction where basketball players and jockeys differ most, while being most similar within each group. This might be a combination of height and weight that perfectly separates the two groups.

The key geometric insight is that LDA finds a projection where:

  1. Class means are as far apart as possible
  2. Points within each class are as close together as possible

This creates the ideal situation for classification: well-separated, compact clusters.

PCA vs LDA: When Direction Matters

Consider the classic example where PCA and LDA give dramatically different results:

Imagine two elongated elliptical clusters (like cigars) that are separated along their short axis but aligned along their long axis. Think of two cigars lying side by side.

  • PCA finds the direction of maximum variance - the long axis of the combined data cloud. Projecting onto this direction, the two classes completely overlap!
  • LDA finds the direction of maximum separation - perpendicular to PCA, along the short axis that separates the cigars. Projecting here, classes are perfectly separated.

This is not a contrived example - it happens frequently in real data where within-class variance dominates between-class variance.


Mathematical Formulation

Now let's build LDA rigorously from first principles.

Class Statistics

Given data matrix X with n samples, p features, and K classes, we first compute class-specific statistics:

Class mean for class k:

μk=1nki:yi=kxi\boldsymbol{\mu}_k = \frac{1}{n_k} \sum_{i: y_i = k} \mathbf{x}_i

Overall mean:

μ=1ni=1nxi=k=1Knknμk\boldsymbol{\mu} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i = \sum_{k=1}^{K} \frac{n_k}{n} \boldsymbol{\mu}_k

Class prior probabilities:

πk=nkn\pi_k = \frac{n_k}{n}

Where nkn_k is the number of samples in class k.

Scatter Matrices

LDA's optimization requires two key matrices:

Within-class scatter matrix (SW) - measures the spread of points around their respective class means:

SW=k=1Ki:yi=k(xiμk)(xiμk)T\mathbf{S}_W = \sum_{k=1}^{K} \sum_{i: y_i = k} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T

This is the sum of class covariance matrices (scaled by class sizes). We want to minimize this - compact classes are easier to separate.

Between-class scatter matrix (SB) - measures the spread of class means around the overall mean:

SB=k=1Knk(μkμ)(μkμ)T\mathbf{S}_B = \sum_{k=1}^{K} n_k (\boldsymbol{\mu}_k - \boldsymbol{\mu})(\boldsymbol{\mu}_k - \boldsymbol{\mu})^T

This measures how far apart the class centroids are. We want to maximize this - well-separated class means make classification easy.

Understanding Scatter Matrices

Within-Class Scatter (SW)

Measures spread of points around their class mean. Minimize this to make classes compact.

Between-Class Scatter (SB)

Measures distance between class means. Maximize this to push classes apart.

Total scatter satisfies:

ST=SW+SB\mathbf{S}_T = \mathbf{S}_W + \mathbf{S}_B

Where ST\mathbf{S}_T is the total scatter (covariance) matrix. This decomposition is fundamental - total variance equals within-class plus between-class variance.

Fisher's Criterion

Fisher's brilliant insight was to optimize the ratio of between-class to within-class variance. For a projection direction w, the projected data has:

  • Between-class variance: wTSBw\mathbf{w}^T \mathbf{S}_B \mathbf{w}
  • Within-class variance: wTSWw\mathbf{w}^T \mathbf{S}_W \mathbf{w}

Fisher's criterion maximizes the ratio:

J(w)=wTSBwwTSWwJ(\mathbf{w}) = \frac{\mathbf{w}^T \mathbf{S}_B \mathbf{w}}{\mathbf{w}^T \mathbf{S}_W \mathbf{w}}

This is the Rayleigh quotient of the matrix pencil (SB,SW)(\mathbf{S}_B, \mathbf{S}_W).

Why a Ratio? We can't just maximize between-class variance (we'd get arbitrarily large values by scaling w). We can't just minimize within-class variance (we'd project to zero). The ratio is scale-invariant and balances both objectives.

1D Projection: Class Separability

LDA finds the projection direction that maximizes the Fisher criterion: maximize between-class variance while minimizing within-class variance.

The Generalized Eigenvalue Problem

Taking the derivative of J(w)J(\mathbf{w}) and setting it to zero leads to the generalized eigenvalue problem:

SBw=λSWw\mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w}

If SW\mathbf{S}_W is invertible, this becomes a standard eigenvalue problem:

SW1SBw=λw\mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w}

The discriminant directions are the eigenvectors of SW1SB\mathbf{S}_W^{-1} \mathbf{S}_B, sorted by eigenvalue magnitude. Larger eigenvalues correspond to more discriminative directions.

The optimal direction (Fisher's Linear Discriminant for two classes) has a simple closed form:

w=SW1(μ1μ2)\mathbf{w}^* = \mathbf{S}_W^{-1} (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)

This is the direction from one class mean to the other, transformed by the inverse of the within-class scatter. If classes have spherical covariance, this points directly between the means. If classes are elongated, it adjusts to account for within-class correlations.

Multi-Class LDA

For K > 2 classes, we solve the same generalized eigenvalue problem but extract multiple discriminant directions. A critical constraint emerges:

Maximum Components: LDA can produce at most min(p,K1)\min(p, K-1) meaningful components, where p is the number of features and K is the number of classes. This is because SB\mathbf{S}_B has rank at most K1K-1.

For binary classification (K=2), there is only one LDA component! This is a fundamental limitation compared to PCA.

Number of ClassesMax LDA ComponentsRank of S_B
2 (binary)11
322
1099
1009999

Interactive LDA Explorer

Use this interactive visualization to explore the difference between LDA and PCA. Adjust the class configuration to see when they agree and when they dramatically differ.

Interactive LDA vs PCA Explorer

Class 1
Class 2
LDA Direction
PCA Direction

Distance between class centers

Within-class variance

Rotate the class configuration

What to Observe:

  • LDA (orange) maximizes class separation
  • PCA (green) maximizes total variance
  • • Rotate to 90° to see them differ dramatically
  • • Increase spread to see LDA's advantage
Key Experiment: Set the rotation angle to make the classes align diagonally, then increase class spread. Watch how PCA finds the diagonal (maximum variance) while LDA finds the perpendicular direction (maximum separation). This is the fundamental insight of supervised dimensionality reduction.

Decision Boundaries

LDA as a Classifier

LDA is not just a dimensionality reduction technique - it's also a powerful classifier. When used for classification, LDA assumes:

  1. Each class follows a multivariate Gaussian distribution
  2. All classes have the same covariance matrix (homoscedasticity)
  3. Observations are independent

Under these assumptions, the optimal decision rule (minimizing classification error) is to assign x to the class with highest discriminant function:

δk(x)=xTΣ1μk12μkTΣ1μk+logπk\delta_k(\mathbf{x}) = \mathbf{x}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log \pi_k

Where Σ\boldsymbol{\Sigma} is the common covariance matrix (estimated by SW/(nK)\mathbf{S}_W / (n - K)).

The decision boundary between classes j and k is where δj(x)=δk(x)\delta_j(\mathbf{x}) = \delta_k(\mathbf{x}). Because the quadratic terms cancel (same covariance), this is a linear function of x - hence "Linear" Discriminant Analysis.

Quadratic Discriminant Analysis

If the equal covariance assumption doesn't hold, we can use Quadratic Discriminant Analysis (QDA), which allows each class to have its own covariance:

δk(x)=12logΣk12(xμk)TΣk1(xμk)+logπk\delta_k(\mathbf{x}) = -\frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) + \log \pi_k

QDA produces quadratic (curved) decision boundaries. It's more flexible but requires more data to estimate class-specific covariances reliably.

MethodCovariance AssumptionDecision BoundaryParameters to Estimate
LDAShared across classesLinear (hyperplane)O(p²) for covariance
QDADifferent per classQuadratic (curved)O(K × p²) for covariances

Assumptions and Violations

LDA makes several assumptions. Understanding them helps you know when LDA will work well and when it might fail:

  1. Gaussian class distributions: Classes should be approximately normally distributed. Violations can lead to suboptimal boundaries.
  2. Equal covariance matrices: All classes share the same covariance structure. If this fails badly, consider QDA.
  3. Linear separability: Classes should be separable by a linear boundary. For complex boundaries, use kernelized or nonlinear methods.
  4. No multicollinearity: SW\mathbf{S}_W should be invertible. With highly correlated features or p > n, regularization is needed.
High-Dimensional Data: When the number of features exceeds the number of samples (p > n), SW\mathbf{S}_W is singular. Solutions include: (1) regularization, (2) PCA preprocessing to reduce dimensions first, or (3) pseudoinverse.

In practice, LDA is remarkably robust to moderate violations of these assumptions. It often works well even when data is not perfectly Gaussian, as long as classes are reasonably well-separated.


Real-World Applications

Fisherfaces for Face Recognition

One of LDA's most famous applications is Fisherfaces - a method for face recognition developed by Belhumeur, Hespanha, and Kriegman in 1997.

The problem: Face images are high-dimensional (e.g., 100×100 = 10,000 pixels), but the number of training images per person is small (maybe 5-10). Direct LDA would fail because SW\mathbf{S}_W is singular.

The Fisherfaces solution:

  1. First, apply PCA to reduce dimensionality to (n - K) dimensions where SW\mathbf{S}_W is guaranteed invertible
  2. Then, apply LDA in the reduced PCA space to find the most discriminative directions
  3. For recognition, project new faces and find the nearest class centroid

This PCA+LDA pipeline is remarkably effective. Fisherfaces are robust to lighting variations that confuse eigenfaces (PCA alone), because LDA focuses on identity-related variation rather than lighting-related variation.

Bioinformatics and Gene Expression

In genomics, researchers often measure expression of 20,000+ genes but have only 100-200 patient samples. Tasks include:

  • Cancer subtyping: Classify tumors into molecular subtypes based on gene expression
  • Treatment response prediction: Predict which patients will respond to therapy
  • Biomarker discovery: Identify genes that best distinguish disease states

LDA (with regularization or PCA preprocessing) excels here because:

  • It uses class labels to find disease-relevant gene combinations
  • The discriminant loadings identify which genes contribute most to classification
  • Unlike pure prediction models (SVM, neural nets), LDA provides interpretable directions

Deep Learning Connections

LDA's principles - maximize between-class variance, minimize within-class variance - directly inspire modern deep learning techniques.

Embedding Spaces and Metric Learning

Deep learning models often learn embedding spaces where similar items are close and dissimilar items are far. This is exactly LDA's goal, but learned through neural networks rather than linear algebra.

Key examples:

  • Face embeddings: FaceNet, ArcFace learn 128-512 dimensional representations where same-person faces cluster together
  • Sentence embeddings: BERT, Sentence-BERT produce representations where semantically similar sentences are close
  • Image embeddings: CLIP learns joint image-text embeddings with high between-concept separability

These learned embeddings solve the same problem as LDA - just with nonlinear transformations and much more data.

Center Loss and Intra-Class Compactness

Center loss (introduced by Wen et al., 2016) explicitly encourages LDA-like properties in neural network embeddings:

Lcenter=12i=1nficyi2\mathcal{L}_{center} = \frac{1}{2} \sum_{i=1}^{n} \| \mathbf{f}_i - \mathbf{c}_{y_i} \|^2

Where fi\mathbf{f}_i is the embedding of sample i and cyi\mathbf{c}_{y_i} is the learned center for class yiy_i.

This directly minimizes within-class scatter - exactly like the SW\mathbf{S}_W term in LDA! Combined with softmax cross-entropy (which increases between-class separation), this creates LDA-like embeddings through gradient descent.

Contrastive and Triplet Loss

Contrastive loss and triplet loss operationalize Fisher's criterion in a sampling framework:

Triplet loss:

Ltriplet=max(0,fafp2fafn2+α)\mathcal{L}_{triplet} = \max(0, \| \mathbf{f}_a - \mathbf{f}_p \|^2 - \| \mathbf{f}_a - \mathbf{f}_n \|^2 + \alpha)

Where (anchor, positive, negative) are a triplet with anchor and positive from the same class, negative from a different class. This directly implements:

  • Minimize distance to same-class samples (positive) → reduce within-class scatter
  • Maximize distance to different-class samples (negative) → increase between-class scatter
The Deep Learning Perspective: LDA is the optimal linear solution to the embedding problem. Modern contrastive learning finds nonlinear solutions using the same objective. If your data is linearly separable, LDA gives the answer instantly. If not, neural networks can learn curved embeddings that achieve even better separation.
MethodEraApproachTransformation
Fisher's LDA1936Eigenvalue problemLinear
Center Loss2016Gradient descent + centersDeep neural network
Triplet Loss2015Sampling-based optimizationDeep neural network
Contrastive Learning2020+Self-supervised + InfoNCEDeep neural network

Python Implementation

LDA from Scratch

Let's implement LDA from first principles to understand every computational step.

Complete LDA Implementation from Scratch
🐍lda_from_scratch.py
1Import NumPy

NumPy provides efficient array operations for computing scatter matrices and solving eigenvalue problems.

3SciPy for Eigendecomposition

scipy.linalg.eigh() solves the generalized eigenvalue problem Sb*w = λ*Sw*w more stably than numpy.

5LDA Class Definition

LDA is a supervised dimensionality reduction technique. Unlike PCA, it uses class labels to find optimal projections.

14Component Limit

CRITICAL: LDA can produce at most (n_classes - 1) meaningful components. For binary classification, only 1 component! This is a fundamental limitation.

EXAMPLE
3 classes → max 2 LDA components; 10 classes → max 9 components
17Scalings (Discriminants)

scalings_ stores the discriminant directions - the projection vectors that maximize class separability.

18Class Means

means_ stores the centroid of each class. These are crucial for computing between-class scatter.

19Class Priors

priors_ stores P(class) - the prior probability of each class. Used for classification decisions.

36Compute Class Statistics

First, we compute the mean and count for each class. This is the foundation for both scatter matrices.

49Class Priors from Data

Prior probabilities are estimated from training data frequencies. Can be overridden if you have domain knowledge.

52Within-Class Scatter Matrix

Sw measures the spread of points WITHIN each class. Formula: Sw = Σ_c (X_c - μ_c)^T(X_c - μ_c). We want to MINIMIZE this.

EXAMPLE
Low Sw means compact clusters around each class mean
60Between-Class Scatter Matrix

Sb measures the spread BETWEEN class means. Formula: Sb = Σ_c n_c(μ_c - μ)(μ_c - μ)^T. We want to MAXIMIZE this.

EXAMPLE
High Sb means class means are far apart from overall mean
68Generalized Eigenvalue Problem

LDA solves: Sb*w = λ*Sw*w. This finds directions w that maximize J(w) = w^T*Sb*w / w^T*Sw*w (Fisher criterion).

72Regularization

Add small value to Sw diagonal for numerical stability. Without this, Sw might be singular (non-invertible) with small datasets.

75Solve with scipy.linalg.eigh

eigh(Sb, Sw) solves the generalized eigenvalue problem. Returns eigenvalues (separability measures) and eigenvectors (directions).

78Sort by Eigenvalue

Larger eigenvalues correspond to more discriminative directions. First component gives best class separation.

83Maximum Components Limit

IMPORTANT: Sb has rank at most (n_classes - 1), so only that many non-zero eigenvalues exist. This is LDA's fundamental constraint.

101Transform Method

Projects data onto discriminant components. Centers using TRAINING mean (never test mean) to avoid data leakage.

115LDA as Classifier

LDA can classify directly using Gaussian class-conditional densities with shared covariance. This gives linear decision boundaries.

121Discriminant Score

For each class, compute: score = -distance_to_class_mean + log(prior). Higher score means more likely to belong to that class.

136Iris Dataset Example

Classic dataset with 3 classes (iris species), 4 features (petal/sepal dimensions). Perfect for demonstrating LDA.

145n_components=2

With 3 classes, we can have at most 2 discriminant components. This allows 2D visualization of the separated classes.

157Explained Variance Ratio

Shows how much discriminative information each component captures. First component usually captures most of it.

170 lines without explanation
1import numpy as np
2from typing import Optional, Tuple
3from scipy import linalg
4
5class LinearDiscriminantAnalysis:
6    """
7    Linear Discriminant Analysis from scratch.
8
9    LDA finds directions that maximize the ratio of between-class
10    variance to within-class variance, optimizing for class separability
11    rather than total variance (like PCA).
12    """
13
14    def __init__(self, n_components: Optional[int] = None):
15        """
16        Initialize LDA.
17
18        Args:
19            n_components: Number of discriminant components to keep.
20                         Maximum is min(n_features, n_classes - 1).
21        """
22        self.n_components = n_components
23        self.scalings_ = None      # Projection vectors (discriminants)
24        self.means_ = None         # Class means
25        self.priors_ = None        # Class prior probabilities
26        self.classes_ = None       # Unique class labels
27        self.xbar_ = None          # Overall mean
28
29    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearDiscriminantAnalysis':
30        """
31        Fit LDA on training data with class labels.
32
33        Args:
34            X: Data matrix of shape (n_samples, n_features)
35            y: Class labels of shape (n_samples,)
36
37        Returns:
38            self: Fitted LDA instance
39        """
40        n_samples, n_features = X.shape
41        self.classes_ = np.unique(y)
42        n_classes = len(self.classes_)
43
44        # Step 1: Compute class statistics
45        self.means_ = {}
46        class_counts = {}
47
48        for c in self.classes_:
49            mask = (y == c)
50            self.means_[c] = np.mean(X[mask], axis=0)
51            class_counts[c] = np.sum(mask)
52
53        # Compute overall mean
54        self.xbar_ = np.mean(X, axis=0)
55
56        # Compute class priors
57        self.priors_ = {c: count / n_samples
58                        for c, count in class_counts.items()}
59
60        # Step 2: Compute within-class scatter matrix Sw
61        # Sw = sum over classes of (X_c - mean_c)^T @ (X_c - mean_c)
62        Sw = np.zeros((n_features, n_features))
63
64        for c in self.classes_:
65            mask = (y == c)
66            X_c = X[mask] - self.means_[c]
67            Sw += X_c.T @ X_c
68
69        # Step 3: Compute between-class scatter matrix Sb
70        # Sb = sum over classes of n_c * (mean_c - xbar) @ (mean_c - xbar)^T
71        Sb = np.zeros((n_features, n_features))
72
73        for c in self.classes_:
74            n_c = class_counts[c]
75            mean_diff = (self.means_[c] - self.xbar_).reshape(-1, 1)
76            Sb += n_c * (mean_diff @ mean_diff.T)
77
78        # Step 4: Solve generalized eigenvalue problem
79        # Sb @ w = lambda * Sw @ w
80        # Equivalent to: Sw^(-1) @ Sb @ w = lambda * w
81
82        # Add regularization for numerical stability
83        Sw += np.eye(n_features) * 1e-6
84
85        # Solve: Sw^(-1) @ Sb @ w = lambda * w
86        # Using scipy for better numerical stability
87        eigenvalues, eigenvectors = linalg.eigh(Sb, Sw)
88
89        # Step 5: Sort by eigenvalue (descending)
90        idx = np.argsort(eigenvalues)[::-1]
91        eigenvalues = eigenvalues[idx]
92        eigenvectors = eigenvectors[:, idx]
93
94        # Step 6: Select top k components
95        # Maximum meaningful components = n_classes - 1
96        max_components = min(n_features, n_classes - 1)
97
98        if self.n_components is None:
99            self.n_components = max_components
100        else:
101            self.n_components = min(self.n_components, max_components)
102
103        self.scalings_ = eigenvectors[:, :self.n_components]
104        self.explained_variance_ratio_ = (
105            eigenvalues[:self.n_components] /
106            np.sum(eigenvalues[:max_components])
107        )
108
109        return self
110
111    def transform(self, X: np.ndarray) -> np.ndarray:
112        """
113        Project data onto discriminant components.
114
115        Args:
116            X: Data to transform of shape (n_samples, n_features)
117
118        Returns:
119            X_transformed: Projected data of shape (n_samples, n_components)
120        """
121        # Center using training mean
122        X_centered = X - self.xbar_
123
124        # Project onto discriminant components
125        return X_centered @ self.scalings_
126
127    def fit_transform(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
128        """Fit LDA and transform data in one step."""
129        return self.fit(X, y).transform(X)
130
131    def predict(self, X: np.ndarray) -> np.ndarray:
132        """
133        Predict class labels using LDA as a classifier.
134
135        Uses Gaussian class-conditional densities with
136        shared covariance (linear decision boundaries).
137
138        Args:
139            X: Data to classify of shape (n_samples, n_features)
140
141        Returns:
142            y_pred: Predicted class labels
143        """
144        # Compute discriminant scores for each class
145        # score_c(x) = x @ Sw^(-1) @ mean_c - 0.5 * mean_c @ Sw^(-1) @ mean_c + log(prior_c)
146
147        scores = np.zeros((X.shape[0], len(self.classes_)))
148
149        for i, c in enumerate(self.classes_):
150            # For simplified version, use Euclidean distance to class mean
151            # in the transformed space
152            X_proj = self.transform(X)
153            mean_proj = self.transform(self.means_[c].reshape(1, -1))
154
155            # Negative squared distance + log prior
156            dist_sq = np.sum((X_proj - mean_proj) ** 2, axis=1)
157            scores[:, i] = -dist_sq + np.log(self.priors_[c])
158
159        # Predict class with highest score
160        return self.classes_[np.argmax(scores, axis=1)]
161
162
163# Example Usage
164if __name__ == "__main__":
165    from sklearn.datasets import load_iris
166    from sklearn.model_selection import train_test_split
167    from sklearn.metrics import accuracy_score
168
169    # Load iris dataset (3 classes, 4 features)
170    iris = load_iris()
171    X, y = iris.data, iris.target
172
173    # Split data
174    X_train, X_test, y_train, y_test = train_test_split(
175        X, y, test_size=0.3, random_state=42
176    )
177
178    # Fit LDA
179    lda = LinearDiscriminantAnalysis(n_components=2)
180    X_train_lda = lda.fit_transform(X_train, y_train)
181    X_test_lda = lda.transform(X_test)
182
183    print("Original data shape:", X_train.shape)
184    print("Transformed data shape:", X_train_lda.shape)
185    print("\nExplained variance ratio:", lda.explained_variance_ratio_)
186    print("\nDiscriminant directions (first 2 features):")
187    print(lda.scalings_[:2, :])
188
189    # Classification accuracy
190    y_pred = lda.predict(X_test)
191    acc = accuracy_score(y_test, y_pred)
192    print(f"\nClassification accuracy: {acc:.2%}")

Now let's see LDA vs PCA in action on data designed to highlight their differences:

LDA vs PCA: Practical Comparison
🐍lda_vs_pca_comparison.py
10Synthetic Data with Structure

We create data where classes are separated along one diagonal but elongated along another. This highlights the PCA vs LDA difference.

17Elongated Covariance

The covariance matrix creates elliptical clusters. PCA will find the long axis of the ellipse, but LDA finds the direction separating classes.

EXAMPLE
[[3, 2.5], [2.5, 3]] creates 45-degree elongated ellipses
25Label Array

Class labels are essential for LDA. PCA ignores them entirely - that's the fundamental difference.

31LDA Requires Labels

Notice: LDA.fit_transform(X, y) needs both data AND labels. PCA.fit_transform(X) only needs data. LDA is supervised!

40Visualize Original Data

Plotting both classes shows their overlap and orientation. The ellipses are tilted - this is where PCA gets 'distracted'.

43PCA Direction (Green)

PCA finds the direction of maximum total variance. This often aligns with the elongation of the data cloud.

48LDA Direction (Orange)

LDA finds the direction of maximum class separability. This connects the class means, roughly perpendicular to PCA here.

60PCA Projection Histogram

When projected onto PC1, classes overlap significantly because PCA found variance, not separation.

67LDA Projection Histogram

When projected onto LD1, classes separate cleanly because LDA optimized for this exact goal.

77Angle Between Directions

Computing the angle shows how different the methods are. In our example, they can be nearly perpendicular!

82Fisher Criterion Comparison

Quantitative proof: Fisher criterion (between/within variance ratio) is ALWAYS higher for LDA than PCA projection.

84 lines without explanation
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
4from sklearn.decomposition import PCA
5from sklearn.datasets import make_classification
6from sklearn.preprocessing import StandardScaler
7
8# Generate synthetic data with clear class structure
9# Classes are elongated in one direction (to show PCA vs LDA difference)
10np.random.seed(42)
11
12# Create two classes with different means but same covariance
13n_samples = 200
14mean1 = np.array([-2, 2])
15mean2 = np.array([2, -2])
16
17# Covariance: elongated along the 45-degree diagonal
18cov = np.array([[3, 2.5],
19                [2.5, 3]])
20
21class1 = np.random.multivariate_normal(mean1, cov, n_samples // 2)
22class2 = np.random.multivariate_normal(mean2, cov, n_samples // 2)
23
24X = np.vstack([class1, class2])
25y = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))
26
27# Fit PCA and LDA
28pca = PCA(n_components=1)
29lda = LinearDiscriminantAnalysis(n_components=1)
30
31X_pca = pca.fit_transform(X)
32X_lda = lda.fit_transform(X, y)  # Note: LDA needs labels!
33
34# Visualize the difference
35fig, axes = plt.subplots(1, 3, figsize=(15, 5))
36
37# Original data
38ax = axes[0]
39ax.scatter(X[y == 0, 0], X[y == 0, 1], c='blue', alpha=0.6, label='Class 0')
40ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', alpha=0.6, label='Class 1')
41
42# Draw PCA direction
43pca_dir = pca.components_[0]
44ax.arrow(0, 0, pca_dir[0] * 3, pca_dir[1] * 3,
45         head_width=0.2, color='green', linewidth=2, label='PCA')
46
47# Draw LDA direction
48lda_dir = lda.scalings_[:, 0]
49lda_dir = lda_dir / np.linalg.norm(lda_dir) * 3
50ax.arrow(0, 0, lda_dir[0], lda_dir[1],
51         head_width=0.2, color='orange', linewidth=2, label='LDA')
52
53ax.set_xlabel('Feature 1')
54ax.set_ylabel('Feature 2')
55ax.set_title('Original 2D Data with Projection Directions')
56ax.legend()
57ax.set_aspect('equal')
58ax.grid(True, alpha=0.3)
59
60# PCA projection
61ax = axes[1]
62ax.hist(X_pca[y == 0], bins=20, alpha=0.6, color='blue', label='Class 0')
63ax.hist(X_pca[y == 1], bins=20, alpha=0.6, color='red', label='Class 1')
64ax.set_xlabel('PCA Component 1')
65ax.set_title('PCA Projection (Unsupervised)')
66ax.legend()
67
68# LDA projection
69ax = axes[2]
70ax.hist(X_lda[y == 0], bins=20, alpha=0.6, color='blue', label='Class 0')
71ax.hist(X_lda[y == 1], bins=20, alpha=0.6, color='red', label='Class 1')
72ax.set_xlabel('LDA Component 1')
73ax.set_title('LDA Projection (Supervised)')
74ax.legend()
75
76plt.tight_layout()
77plt.show()
78
79# Print analysis
80print("PCA direction:", pca.components_[0])
81print("LDA direction:", lda.scalings_[:, 0] / np.linalg.norm(lda.scalings_[:, 0]))
82print("\nAngle between PCA and LDA:",
83      np.degrees(np.arccos(np.abs(np.dot(pca.components_[0],
84                                          lda.scalings_[:, 0] / np.linalg.norm(lda.scalings_[:, 0]))))))
85
86# Calculate class separability after projection
87def fisher_criterion(X_proj, y):
88    class0 = X_proj[y == 0]
89    class1 = X_proj[y == 1]
90    between = (np.mean(class0) - np.mean(class1)) ** 2
91    within = np.var(class0) + np.var(class1)
92    return between / within
93
94print(f"\nFisher criterion after PCA: {fisher_criterion(X_pca.ravel(), y):.3f}")
95print(f"Fisher criterion after LDA: {fisher_criterion(X_lda.ravel(), y):.3f}")

Practical Tips and Gotchas

  1. Always check the maximum components limit. With K classes, you get at most K-1 discriminant directions. Don't expect more than 1 component for binary classification!
  2. Handle high-dimensional data carefully. If p > n, use regularized LDA or apply PCA first to reduce to n-K dimensions.
  3. Consider class imbalance. LDA weights classes by their size when computing scatter matrices. Highly imbalanced data can bias results toward the majority class.
  4. Standardize features first. Like PCA, LDA is not scale-invariant. Features with larger variance will dominate unless you standardize.
  5. Check the Gaussian assumption. LDA works best with roughly Gaussian classes. For heavy-tailed or multimodal distributions, consider other methods.
  6. Use LDA for interpretability. Unlike black-box classifiers, LDA discriminants are linear combinations of features - you can interpret which features contribute most.
  7. Compare with logistic regression. For pure classification, logistic regression makes fewer assumptions and often performs similarly. LDA shines when you need the low-dimensional representation.
  8. Combine with PCA strategically. PCA+LDA (Fisherfaces approach) is powerful: PCA handles singularity and removes noise, LDA maximizes class separation in the cleaned space.
When to Choose LDA over PCA: If you have class labels and your goal is classification or class-based visualization, LDA is almost always better than PCA. The exception is when classes are defined arbitrarily and don't reflect true data structure.

Practice Problems

  1. Conceptual: Explain why LDA can produce at most K-1 components for K classes. Hint: What is the rank of the between-class scatter matrix?
  2. Mathematical: Derive the Fisher Linear Discriminant for binary classification. Show that the optimal direction is w=SW1(μ1μ2)\mathbf{w}^* = \mathbf{S}_W^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2).
  3. Comparison: Create a 2D dataset where PCA and LDA find similar directions, and another where they are nearly perpendicular. What determines when they agree?
  4. Implementation: Implement LDA from scratch and verify your results match sklearn.discriminant_analysis.LinearDiscriminantAnalysis on the Iris dataset.
  5. Classification: Compare LDA, logistic regression, and QDA on a dataset with unequal class covariances. Which method handles this best?
  6. High-Dimensional: Apply LDA to a text classification problem with TF-IDF features. How do you handle the fact that p > n?
  7. Deep Learning Connection: Implement center loss and show that minimizing it reduces within-class scatter in the embedding space.
  8. Fisherfaces: Implement the PCA+LDA pipeline for face recognition. Compare with PCA alone (eigenfaces) on a face dataset.

Key Insights

  • LDA is supervised dimensionality reduction. Unlike PCA, it uses class labels to find directions that maximize class separability, not just total variance.
  • Fisher's criterion is the key: Maximize between-class variance / within-class variance. This ratio is scale-invariant and balances separation with compactness.
  • LDA solves a generalized eigenvalue problem: SBw=λSWw\mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w}. Eigenvectors are discriminant directions, eigenvalues measure discriminative power.
  • Maximum K-1 components for K classes. This fundamental limit comes from the rank of SB\mathbf{S}_B. Binary classification yields only 1 LDA component.
  • LDA is also a classifier. Under Gaussian assumptions with shared covariance, LDA produces optimal linear decision boundaries.
  • LDA principles underlie modern deep learning. Center loss, triplet loss, and contrastive learning all implement Fisher's idea: compact classes, separated means.
  • PCA + LDA is a powerful combination. Use PCA to handle singularity and noise, then LDA for class separation (the Fisherfaces approach).
  • LDA assumes Gaussian, equal-covariance classes. When these assumptions fail badly, consider QDA, logistic regression, or nonlinear methods.
The Essence of LDA: If you know what you're trying to distinguish (class labels), use that information! PCA ignores labels and finds general patterns. LDA focuses on the patterns that matter for your classification task. This is the fundamental insight of supervised learning: task-specific representations outperform generic ones.
Loading comments...