Learning Objectives
By the end of this section, you will master Linear Discriminant Analysis (LDA) - the supervised counterpart to PCA that finds optimal projections for classification. You will:
- Understand the fundamental difference between supervised (LDA) and unsupervised (PCA) dimensionality reduction - when and why class labels matter.
- Master the mathematical formulation: within-class scatter, between-class scatter, Fisher's criterion, and the generalized eigenvalue problem.
- Visualize and build intuition for why maximizing the ratio of between-class to within-class variance leads to optimal class separation.
- Implement LDA from scratch in Python and understand each computational step from class statistics to projection.
- Use LDA as a classifier and understand its connection to Gaussian discriminant analysis and linear decision boundaries.
- Connect LDA to modern deep learning: embedding spaces, center loss, contrastive learning, and metric learning.
- Apply LDA to real-world problems: face recognition (Fisherfaces), bioinformatics, and feature extraction for classification.
Why This Matters: LDA is one of the most important techniques for supervised dimensionality reduction and classification. It introduces the crucial insight that the best representation depends on your task - for classification, you should maximize class separability, not just total variance. This principle underlies modern deep learning approaches like contrastive learning, center loss, and learned embeddings.
The Big Picture
Historical Context
Linear Discriminant Analysis was developed by Sir Ronald A. Fisher in 1936, published in his landmark paper "The Use of Multiple Measurements in Taxonomic Problems." Fisher was working on a practical problem: given measurements of iris flowers (sepal length, sepal width, petal length, petal width), how can we best distinguish between species?
Fisher's key insight was brilliant in its simplicity: the best way to separate classes is to find a direction that maximizes the distance between class means while minimizing the spread within each class. This ratio - between-class variance divided by within-class variance - became known as Fisher's criterion or the Fisher ratio.
What makes Fisher's work remarkable is that he solved this problem decades before computers could efficiently perform eigendecomposition. His mathematical formulation remains the standard today, used in everything from face recognition to gene expression analysis.
Supervised vs Unsupervised Dimensionality Reduction
To understand LDA's value, we must first appreciate the fundamental difference between supervised and unsupervised dimensionality reduction:
| Aspect | PCA (Unsupervised) | LDA (Supervised) |
|---|---|---|
| Input | Data matrix X only | Data matrix X + labels y |
| Objective | Maximize total variance | Maximize class separability |
| What it finds | Directions of maximum spread | Directions that separate classes |
| Max components | min(n_samples, n_features) | min(n_features, n_classes - 1) |
| Best for | Compression, visualization, noise reduction | Classification preprocessing, feature extraction |
| Ignores | Class structure entirely | Within-class variance structure |
The Core Difference: PCA asks "Where does the data vary most?" LDA asks "Where do the classes differ most?" These are fundamentally different questions with different optimal answers.
This distinction is crucial in machine learning. If your goal is classification, you should use the labels you have! PCA might find directions where the data varies a lot, but that variation might be within classes (noise) rather than between classes (signal).
The Core Intuition
Geometric View
Imagine you have two groups of people - basketball players and jockeys - and you want to classify new individuals. You measure many features: height, weight, arm span, shoe size, etc.
PCA's approach: Find the direction where people vary most. This might be dominated by weight variation, since weight varies more than height in absolute terms. But weight alone might not separate the groups well - there are heavy basketball players and light jockeys, and vice versa.
LDA's approach: Find the direction where basketball players and jockeys differ most, while being most similar within each group. This might be a combination of height and weight that perfectly separates the two groups.
The key geometric insight is that LDA finds a projection where:
- Class means are as far apart as possible
- Points within each class are as close together as possible
This creates the ideal situation for classification: well-separated, compact clusters.
PCA vs LDA: When Direction Matters
Consider the classic example where PCA and LDA give dramatically different results:
Imagine two elongated elliptical clusters (like cigars) that are separated along their short axis but aligned along their long axis. Think of two cigars lying side by side.
- PCA finds the direction of maximum variance - the long axis of the combined data cloud. Projecting onto this direction, the two classes completely overlap!
- LDA finds the direction of maximum separation - perpendicular to PCA, along the short axis that separates the cigars. Projecting here, classes are perfectly separated.
This is not a contrived example - it happens frequently in real data where within-class variance dominates between-class variance.
Mathematical Formulation
Now let's build LDA rigorously from first principles.
Class Statistics
Given data matrix X with n samples, p features, and K classes, we first compute class-specific statistics:
Class mean for class k:
Overall mean:
Class prior probabilities:
Where is the number of samples in class k.
Scatter Matrices
LDA's optimization requires two key matrices:
Within-class scatter matrix (SW) - measures the spread of points around their respective class means:
This is the sum of class covariance matrices (scaled by class sizes). We want to minimize this - compact classes are easier to separate.
Between-class scatter matrix (SB) - measures the spread of class means around the overall mean:
This measures how far apart the class centroids are. We want to maximize this - well-separated class means make classification easy.
Understanding Scatter Matrices
Within-Class Scatter (SW)
Measures spread of points around their class mean. Minimize this to make classes compact.
Between-Class Scatter (SB)
Measures distance between class means. Maximize this to push classes apart.
Total scatter satisfies:
Where is the total scatter (covariance) matrix. This decomposition is fundamental - total variance equals within-class plus between-class variance.
Fisher's Criterion
Fisher's brilliant insight was to optimize the ratio of between-class to within-class variance. For a projection direction w, the projected data has:
- Between-class variance:
- Within-class variance:
Fisher's criterion maximizes the ratio:
This is the Rayleigh quotient of the matrix pencil .
Why a Ratio? We can't just maximize between-class variance (we'd get arbitrarily large values by scaling w). We can't just minimize within-class variance (we'd project to zero). The ratio is scale-invariant and balances both objectives.
1D Projection: Class Separability
LDA finds the projection direction that maximizes the Fisher criterion: maximize between-class variance while minimizing within-class variance.
The Generalized Eigenvalue Problem
Taking the derivative of and setting it to zero leads to the generalized eigenvalue problem:
If is invertible, this becomes a standard eigenvalue problem:
The discriminant directions are the eigenvectors of , sorted by eigenvalue magnitude. Larger eigenvalues correspond to more discriminative directions.
The optimal direction (Fisher's Linear Discriminant for two classes) has a simple closed form:
This is the direction from one class mean to the other, transformed by the inverse of the within-class scatter. If classes have spherical covariance, this points directly between the means. If classes are elongated, it adjusts to account for within-class correlations.
Multi-Class LDA
For K > 2 classes, we solve the same generalized eigenvalue problem but extract multiple discriminant directions. A critical constraint emerges:
For binary classification (K=2), there is only one LDA component! This is a fundamental limitation compared to PCA.
| Number of Classes | Max LDA Components | Rank of S_B |
|---|---|---|
| 2 (binary) | 1 | 1 |
| 3 | 2 | 2 |
| 10 | 9 | 9 |
| 100 | 99 | 99 |
Interactive LDA Explorer
Use this interactive visualization to explore the difference between LDA and PCA. Adjust the class configuration to see when they agree and when they dramatically differ.
Interactive LDA vs PCA Explorer
Distance between class centers
Within-class variance
Rotate the class configuration
What to Observe:
- • LDA (orange) maximizes class separation
- • PCA (green) maximizes total variance
- • Rotate to 90° to see them differ dramatically
- • Increase spread to see LDA's advantage
Key Experiment: Set the rotation angle to make the classes align diagonally, then increase class spread. Watch how PCA finds the diagonal (maximum variance) while LDA finds the perpendicular direction (maximum separation). This is the fundamental insight of supervised dimensionality reduction.
Decision Boundaries
LDA as a Classifier
LDA is not just a dimensionality reduction technique - it's also a powerful classifier. When used for classification, LDA assumes:
- Each class follows a multivariate Gaussian distribution
- All classes have the same covariance matrix (homoscedasticity)
- Observations are independent
Under these assumptions, the optimal decision rule (minimizing classification error) is to assign x to the class with highest discriminant function:
Where is the common covariance matrix (estimated by ).
The decision boundary between classes j and k is where . Because the quadratic terms cancel (same covariance), this is a linear function of x - hence "Linear" Discriminant Analysis.
Quadratic Discriminant Analysis
If the equal covariance assumption doesn't hold, we can use Quadratic Discriminant Analysis (QDA), which allows each class to have its own covariance:
QDA produces quadratic (curved) decision boundaries. It's more flexible but requires more data to estimate class-specific covariances reliably.
| Method | Covariance Assumption | Decision Boundary | Parameters to Estimate |
|---|---|---|---|
| LDA | Shared across classes | Linear (hyperplane) | O(p²) for covariance |
| QDA | Different per class | Quadratic (curved) | O(K × p²) for covariances |
Assumptions and Violations
LDA makes several assumptions. Understanding them helps you know when LDA will work well and when it might fail:
- Gaussian class distributions: Classes should be approximately normally distributed. Violations can lead to suboptimal boundaries.
- Equal covariance matrices: All classes share the same covariance structure. If this fails badly, consider QDA.
- Linear separability: Classes should be separable by a linear boundary. For complex boundaries, use kernelized or nonlinear methods.
- No multicollinearity: should be invertible. With highly correlated features or p > n, regularization is needed.
In practice, LDA is remarkably robust to moderate violations of these assumptions. It often works well even when data is not perfectly Gaussian, as long as classes are reasonably well-separated.
Real-World Applications
Fisherfaces for Face Recognition
One of LDA's most famous applications is Fisherfaces - a method for face recognition developed by Belhumeur, Hespanha, and Kriegman in 1997.
The problem: Face images are high-dimensional (e.g., 100×100 = 10,000 pixels), but the number of training images per person is small (maybe 5-10). Direct LDA would fail because is singular.
The Fisherfaces solution:
- First, apply PCA to reduce dimensionality to (n - K) dimensions where is guaranteed invertible
- Then, apply LDA in the reduced PCA space to find the most discriminative directions
- For recognition, project new faces and find the nearest class centroid
This PCA+LDA pipeline is remarkably effective. Fisherfaces are robust to lighting variations that confuse eigenfaces (PCA alone), because LDA focuses on identity-related variation rather than lighting-related variation.
Bioinformatics and Gene Expression
In genomics, researchers often measure expression of 20,000+ genes but have only 100-200 patient samples. Tasks include:
- Cancer subtyping: Classify tumors into molecular subtypes based on gene expression
- Treatment response prediction: Predict which patients will respond to therapy
- Biomarker discovery: Identify genes that best distinguish disease states
LDA (with regularization or PCA preprocessing) excels here because:
- It uses class labels to find disease-relevant gene combinations
- The discriminant loadings identify which genes contribute most to classification
- Unlike pure prediction models (SVM, neural nets), LDA provides interpretable directions
Deep Learning Connections
LDA's principles - maximize between-class variance, minimize within-class variance - directly inspire modern deep learning techniques.
Embedding Spaces and Metric Learning
Deep learning models often learn embedding spaces where similar items are close and dissimilar items are far. This is exactly LDA's goal, but learned through neural networks rather than linear algebra.
Key examples:
- Face embeddings: FaceNet, ArcFace learn 128-512 dimensional representations where same-person faces cluster together
- Sentence embeddings: BERT, Sentence-BERT produce representations where semantically similar sentences are close
- Image embeddings: CLIP learns joint image-text embeddings with high between-concept separability
These learned embeddings solve the same problem as LDA - just with nonlinear transformations and much more data.
Center Loss and Intra-Class Compactness
Center loss (introduced by Wen et al., 2016) explicitly encourages LDA-like properties in neural network embeddings:
Where is the embedding of sample i and is the learned center for class .
This directly minimizes within-class scatter - exactly like the term in LDA! Combined with softmax cross-entropy (which increases between-class separation), this creates LDA-like embeddings through gradient descent.
Contrastive and Triplet Loss
Contrastive loss and triplet loss operationalize Fisher's criterion in a sampling framework:
Triplet loss:
Where (anchor, positive, negative) are a triplet with anchor and positive from the same class, negative from a different class. This directly implements:
- Minimize distance to same-class samples (positive) → reduce within-class scatter
- Maximize distance to different-class samples (negative) → increase between-class scatter
The Deep Learning Perspective: LDA is the optimal linear solution to the embedding problem. Modern contrastive learning finds nonlinear solutions using the same objective. If your data is linearly separable, LDA gives the answer instantly. If not, neural networks can learn curved embeddings that achieve even better separation.
| Method | Era | Approach | Transformation |
|---|---|---|---|
| Fisher's LDA | 1936 | Eigenvalue problem | Linear |
| Center Loss | 2016 | Gradient descent + centers | Deep neural network |
| Triplet Loss | 2015 | Sampling-based optimization | Deep neural network |
| Contrastive Learning | 2020+ | Self-supervised + InfoNCE | Deep neural network |
Python Implementation
LDA from Scratch
Let's implement LDA from first principles to understand every computational step.
Now let's see LDA vs PCA in action on data designed to highlight their differences:
Practical Tips and Gotchas
- Always check the maximum components limit. With K classes, you get at most K-1 discriminant directions. Don't expect more than 1 component for binary classification!
- Handle high-dimensional data carefully. If p > n, use regularized LDA or apply PCA first to reduce to n-K dimensions.
- Consider class imbalance. LDA weights classes by their size when computing scatter matrices. Highly imbalanced data can bias results toward the majority class.
- Standardize features first. Like PCA, LDA is not scale-invariant. Features with larger variance will dominate unless you standardize.
- Check the Gaussian assumption. LDA works best with roughly Gaussian classes. For heavy-tailed or multimodal distributions, consider other methods.
- Use LDA for interpretability. Unlike black-box classifiers, LDA discriminants are linear combinations of features - you can interpret which features contribute most.
- Compare with logistic regression. For pure classification, logistic regression makes fewer assumptions and often performs similarly. LDA shines when you need the low-dimensional representation.
- Combine with PCA strategically. PCA+LDA (Fisherfaces approach) is powerful: PCA handles singularity and removes noise, LDA maximizes class separation in the cleaned space.
Practice Problems
- Conceptual: Explain why LDA can produce at most K-1 components for K classes. Hint: What is the rank of the between-class scatter matrix?
- Mathematical: Derive the Fisher Linear Discriminant for binary classification. Show that the optimal direction is .
- Comparison: Create a 2D dataset where PCA and LDA find similar directions, and another where they are nearly perpendicular. What determines when they agree?
- Implementation: Implement LDA from scratch and verify your results match sklearn.discriminant_analysis.LinearDiscriminantAnalysis on the Iris dataset.
- Classification: Compare LDA, logistic regression, and QDA on a dataset with unequal class covariances. Which method handles this best?
- High-Dimensional: Apply LDA to a text classification problem with TF-IDF features. How do you handle the fact that p > n?
- Deep Learning Connection: Implement center loss and show that minimizing it reduces within-class scatter in the embedding space.
- Fisherfaces: Implement the PCA+LDA pipeline for face recognition. Compare with PCA alone (eigenfaces) on a face dataset.
Key Insights
- LDA is supervised dimensionality reduction. Unlike PCA, it uses class labels to find directions that maximize class separability, not just total variance.
- Fisher's criterion is the key: Maximize between-class variance / within-class variance. This ratio is scale-invariant and balances separation with compactness.
- LDA solves a generalized eigenvalue problem: . Eigenvectors are discriminant directions, eigenvalues measure discriminative power.
- Maximum K-1 components for K classes. This fundamental limit comes from the rank of . Binary classification yields only 1 LDA component.
- LDA is also a classifier. Under Gaussian assumptions with shared covariance, LDA produces optimal linear decision boundaries.
- LDA principles underlie modern deep learning. Center loss, triplet loss, and contrastive learning all implement Fisher's idea: compact classes, separated means.
- PCA + LDA is a powerful combination. Use PCA to handle singularity and noise, then LDA for class separation (the Fisherfaces approach).
- LDA assumes Gaussian, equal-covariance classes. When these assumptions fail badly, consider QDA, logistic regression, or nonlinear methods.
The Essence of LDA: If you know what you're trying to distinguish (class labels), use that information! PCA ignores labels and finds general patterns. LDA focuses on the patterns that matter for your classification task. This is the fundamental insight of supervised learning: task-specific representations outperform generic ones.