Learning Objectives
By the end of this section, you will master Factor Analysis - a powerful technique for discovering latent structures that explain correlations among observed variables. You will:
- Understand the conceptual foundation of Factor Analysis: what latent variables are, why we care about them, and how they differ fundamentally from principal components.
- Master the mathematical formulation of the factor model, including factor loadings, communalities, uniqueness, and the covariance structure it implies.
- Apply estimation methods including the Principal Factor method and Maximum Likelihood estimation to extract factors from data.
- Perform and interpret factor rotation using Varimax (orthogonal) and Oblimin (oblique) methods to achieve simple structure for interpretable solutions.
- Connect Factor Analysis to modern deep learning: understand its relationship to VAEs, topic models, embeddings, and other latent variable models.
- Implement Factor Analysis from scratch and apply it to real-world problems in psychology, finance, and marketing.
Why This Matters: Factor Analysis addresses a fundamental question in science and machine learning: what are the underlying causes that explain the correlations we observe? Unlike PCA (which summarizes data), Factor Analysis models data generation - it hypothesizes that hidden factors create the patterns we see. This causal perspective is essential for psychological measurement, economics, biology, and now powers modern latent variable models in deep learning including VAEs, topic models, and word embeddings.
The Big Picture
Historical Context
Factor Analysis was born from a fundamental question in psychology: What is intelligence? In 1904, Charles Spearman noticed something remarkable - students who did well on one cognitive test tended to do well on others, even when the tests seemed very different (vocabulary, mathematics, spatial reasoning).
Spearman hypothesized that a single underlying factor - which he called "g" (general intelligence) - caused performance on all tests. The correlations between tests existed because they all measured this common factor, plus some test-specific abilities. This was the birth of Factor Analysis.
Later, Louis Thurstone extended this work, proposing that intelligence wasn't a single factor but multiple "primary mental abilities." His work on factor rotation (1930s-1940s) established the mathematical framework we still use today. Factor Analysis became the dominant method in psychology for developing personality tests, intelligence measures, and attitude scales.
The Key Insight: Correlations between observed variables might exist because those variables share common causes. Factor Analysis formalizes this intuition mathematically: latent factors "generate" observed correlations.
Why Factor Analysis?
Consider this scenario: You administer a 50-question personality survey to 1,000 people. The questions cover various topics - social preferences, work habits, emotional responses. When you compute correlations between questions, you find patterns:
- Questions about enjoying parties correlate with questions about meeting new people
- Questions about organization correlate with questions about punctuality
- Questions about anxiety correlate with questions about worry
Why do these correlations exist? Factor Analysis proposes an answer: there are latent traits (extroversion, conscientiousness, neuroticism) that influence how people answer multiple questions. The correlations are symptoms of these underlying causes.
This perspective differs fundamentally from PCA:
- PCA asks: "How can I summarize these 50 questions into fewer numbers?"
- Factor Analysis asks: "What underlying traits explain why these questions correlate?"
The Fundamental Question
Factor Analysis addresses a causal question: What latent structure generated the data we observe?
The model assumes:
- There exist latent (hidden) factors that we cannot directly observe
- Each observed variable is a linear combination of these factors plus unique variance
- The correlations among observed variables arise because they share common factor influences
This generative perspective is powerful because it:
- Explains why correlations exist (shared factors)
- Distinguishes common variance from unique variance
- Provides a model of the data-generating process
- Enables theory testing about latent structures
PCA vs Factor Analysis
PCA and Factor Analysis are often confused because they both reduce dimensionality. But they answer fundamentally different questions and make different assumptions about the data.
Key Differences
| Aspect | PCA | Factor Analysis |
|---|---|---|
| Goal | Maximize variance explained | Model latent structure generating correlations |
| Model | No model (data transformation) | X = Λf + ε (generative model) |
| Components/Factors | Observed (exact linear combinations) | Latent (unobserved, estimated) |
| Unique Variance | Not modeled (all variance is explained) | Explicitly modeled (ε term) |
| Covariance Structure | Σ = VΛV^T (full decomposition) | Σ = ΛΛ^T + Ψ (common + unique) |
| Rotation | Typically none (variance-ordered) | Essential for interpretation |
| Number of Components | Can use all (ordered by variance) | Must specify number of factors |
| Interpretability | Components are data summaries | Factors are hypothetical causes |
The mathematical difference is crucial. In PCA:
The covariance matrix is fully decomposed - all variance is accounted for by the components. In Factor Analysis:
Where is a diagonal matrix of unique variances. Not all variance is common - some is specific to each variable.
When to Use Which
| Use PCA When... | Use Factor Analysis When... |
|---|---|
| You want to reduce dimensionality for modeling | You want to understand latent structure |
| Components don't need interpretation | Factors must be interpretable constructs |
| You're preprocessing for ML algorithms | You're developing psychological scales |
| You want data compression | You want to explain correlations |
| Variables have measurement error mixed in | You want to separate common from unique variance |
| You need exact reconstruction | You accept that factors are estimates |
Rule of Thumb: Use PCA for dimensionality reduction and preprocessing. Use Factor Analysis when you believe there are underlying constructs causing correlations, and you want to identify and interpret those constructs.
The Factor Model
Mathematical Formulation
The factor model expresses each observed variable as a linear combination of latent factors:
In matrix notation for all p variables:
Where:
- = vector of p observed variables (standardized)
- = p × k matrix of factor loadings
- = vector of k latent common factors (k < p)
- = vector of p unique factors (specific + error)
The model makes these key assumptions:
- Factors are uncorrelated (for orthogonal solutions):
- Unique factors are uncorrelated with each other: (diagonal)
- Factors and unique factors are uncorrelated:
Given these assumptions, the implied covariance structure is:
This decomposes total variance into:
- = common variance (explained by factors)
- = unique variance (specific + error)
Geometric Interpretation
Geometrically, each factor defines a direction in variable space. The loading is the correlation between variable i and factor j (for standardized variables with orthogonal factors).
Think of it this way: if you could directly measure the hidden factors, the loading would tell you how strongly each observed variable correlates with each factor. Variables that load highly on the same factor are correlated because they share that common cause.
Model Assumptions
Factor Analysis makes strong assumptions that should be checked:
- Linearity: Observed variables are linear combinations of factors. Nonlinear relationships aren't captured.
- No factor-unique correlation: Common factors don't correlate with unique factors. Violated if measurement error depends on the construct.
- Uncorrelated unique factors: Unique variances are independent. Violated if two variables share variance not captured by the factors.
- Multivariate normality: For MLE estimation and certain tests. Violations affect standard errors and fit indices.
Communality and Uniqueness
Two critical concepts in Factor Analysis:
Communality () is the proportion of variance in variable i explained by the common factors:
Uniqueness () is the remaining variance not explained by common factors:
For standardized variables (variance = 1), the total variance equals communality plus uniqueness:
Interpretation: High communality means the variable fits well into the factor structure - most of its variance is explained by common factors. Low communality suggests the variable has substantial unique variance not shared with other variables - consider removing it from the analysis.
Interactive Factor Model Demo
Explore the factor model structure interactively. Observe how latent factors influence observed variables through factor loadings, and how communality and uniqueness partition variance.
Interactive Factor Model Visualization
Hidden constructs that explain correlations
Measured variables in your dataset
Portion of variance explained by factors
Key Concepts:
- Factors (blue circles) are latent variables - unobserved causes
- Loadings (purple arrows) show how strongly each factor influences each variable
- Communality (green) = variance explained by common factors
- Unique variance (orange) = specific + error variance
Try This: Set 2 factors and 6 variables to see a clean "simple structure" where each variable loads primarily on one factor. Then increase the common variance slider to see how communality increases (less orange in each variable).
Estimation Methods
Given observed data, how do we estimate the factor loadings and unique variances ? There are several approaches.
Principal Factor Method
The Principal Factor (PF) method is an iterative approach that extends PCA to account for unique variance:
- Estimate initial communalities: Use Squared Multiple Correlations (SMC) - the R² of each variable predicted from all others. This gives the portion of variance shared with other variables.
- Construct reduced correlation matrix: Replace the 1s on the diagonal of R with communality estimates. This removes unique variance before factoring.
- Extract factors: Perform eigendecomposition of the reduced matrix. Keep the top k eigenvectors.
- Update communalities: Compute new communalities as sum of squared loadings for each variable.
- Iterate: Repeat steps 2-4 until communalities converge.
The key insight: by using communalities instead of 1s on the diagonal, we factor only the common variance, not the unique variance. This distinguishes PF from PCA.
Maximum Likelihood Estimation
MLE finds loadings and unique variances that maximize the probability of observing the data, assuming multivariate normality:
Where the log-likelihood (ignoring constants) is:
With .
MLE has advantages over PF:
- Provides standard errors for loadings
- Allows statistical tests (chi-square test of model fit)
- Scale-invariant (same results for correlation or covariance)
- Generally more efficient estimates
However, MLE requires larger samples (n > 100-200) and assumes normality.
How Many Factors to Extract?
Choosing the number of factors is crucial and often subjective. Common approaches:
- Kaiser Criterion: Keep factors with eigenvalues > 1 (from the correlation matrix). Rationale: each factor should explain more variance than a single variable.
- Scree Test: Plot eigenvalues and look for an "elbow." Factors before the elbow are retained; those after explain little additional variance.
- Parallel Analysis: Compare eigenvalues to those from random data with the same dimensions. Keep factors with eigenvalues exceeding the random baseline.
- Chi-Square Test (MLE only): Test whether adding another factor significantly improves fit.
- Interpretability: Can you meaningfully interpret the factors? If a factor doesn't make sense, you may have extracted too many.
- Theory: Prior knowledge about the number of constructs (e.g., Big Five personality theory predicts 5 factors).
Factor Rotation
The Rotation Problem
Factor solutions are not unique. Any orthogonal transformation of the factors yields an equally valid solution with the same communalities and reproduced correlations.
Mathematically, if is a solution, then so is for any orthogonal matrix T:
This is called rotational indeterminacy. It's both a curse (no unique answer) and a blessing (we can choose the most interpretable orientation).
Simple Structure is the guiding principle for rotation. Thurstone defined simple structure as:
- Each variable loads highly on one factor
- Each variable has near-zero loadings on other factors
- Each factor has some variables with high loadings and many with near-zero
Simple structure makes factors interpretable: each factor represents a distinct construct defined by specific variables.
Orthogonal Rotation (Varimax)
Varimax is the most popular orthogonal rotation. It maximizes the variance of squared loadings within each factor:
Where is the loading normalized by communality.
Effect of Varimax: Pushes loadings toward 0 or ±1, making each variable load clearly on one factor. Factors remain uncorrelated (90° apart).
Other orthogonal rotations:
- Quartimax: Maximizes variance within variables (each variable loads on fewer factors)
- Equamax: Balances Varimax and Quartimax criteria
Oblique Rotation (Oblimin)
Oblique rotations allow factors to correlate. This often achieves better simple structure because factors in reality may be related.
Oblimin (with parameter γ=0, called "Quartimin") minimizes:
This penalizes variables with high loadings on multiple factors.
Trade-offs of oblique rotation:
- Pro: Often achieves cleaner simple structure
- Pro: More realistic if constructs are truly correlated
- Con: Must interpret factor correlations in addition to loadings
- Con: Need two matrices: pattern matrix (loadings) and structure matrix (correlations)
Interactive Rotation Demo
Explore how rotation transforms the loading space. Watch how variables cluster onto distinct factors as you rotate toward simple structure.
Interactive Factor Rotation
Rotate axes to achieve simple structure
Goal of Rotation:
Achieve simple structure where each variable loads highly on one factor and near zero on others.
Try rotating ~45° to see variables cluster onto distinct factors!
Orthogonal vs Oblique:
- Orthogonal: Factors remain uncorrelated (90° apart)
- Oblique: Factors can be correlated (more flexible but complex)
- Oblique often gives better simple structure but harder interpretation
Try This: Rotate to approximately 45° to see how variables separate into two clusters. Notice how simple structure emerges when each variable loads highly on one axis and near-zero on the other.
Factor Scores
Once we have loadings, we often want to compute factor scores - estimated values of the latent factors for each observation. However, since factors are unobserved, scores must be estimated rather than computed exactly.
The regression method (Thomson's method) estimates factor scores as:
Or equivalently using the correlation matrix:
This gives the best linear prediction of factor scores given the observed data.
Properties of factor scores:
- Factor scores are estimates with uncertainty (not exact like PC scores)
- Even for orthogonal factors, estimated scores may be slightly correlated
- Scores depend on the estimation method (regression, Bartlett's, etc.)
- Adding/removing variables changes scores (they're not invariant)
Real-World Applications
Psychology: Intelligence and Personality
Factor Analysis revolutionized psychology:
- Intelligence Testing: Spearman's g-factor explains correlations among cognitive tests. Modern IQ tests are built on factor-analytic foundations, distinguishing general intelligence from specific abilities.
- Big Five Personality: The dominant personality model emerged from factor analysis of thousands of personality descriptors. Five factors (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) explain most variance in personality.
- Scale Development: Questionnaire items are factor analyzed to ensure they measure intended constructs. Items loading poorly are revised or removed.
Finance: Factor Models
Factor models are fundamental in quantitative finance:
- Fama-French Factors: Stock returns are explained by market, size, and value factors. Factor analysis identifies these latent risk exposures.
- Risk Decomposition: Portfolio risk is decomposed into factor exposures (market risk, sector risk, style risk) and idiosyncratic risk.
- Statistical Arbitrage: Factor models identify mispricings when returns deviate from factor-predicted values.
Marketing: Customer Segmentation
Factor Analysis helps understand customer behavior:
- Brand Perception: Survey items about brands are factor analyzed to identify underlying perception dimensions (quality, value, prestige, etc.).
- Purchase Behavior: Product purchase patterns reveal latent shopping styles or need states.
- Customer Segmentation: Factor scores define segments based on underlying motivations, not just demographic characteristics.
Deep Learning Connections
Factor Analysis is the classical ancestor of modern latent variable models in deep learning. The conceptual framework - latent factors generating observed data - underlies many contemporary methods.
Latent Variable Models
The factor model is a linear latent variable model. Modern deep learning extends this in two ways:
- Nonlinear decoders: Replace with a neural network:
- Nonlinear encoders: Map observations to latent distributions rather than point estimates
VAEs and Factor Analysis
Variational Autoencoders (VAEs) are the deep learning generalization of probabilistic Factor Analysis:
| Aspect | Factor Analysis | VAE |
|---|---|---|
| Decoder | Linear: Λf | Neural network: g_θ(z) |
| Encoder | Regression: Λ^T R^(-1) x | Neural network: q_φ(z|x) |
| Prior | Standard normal | Standard normal (typically) |
| Inference | Closed-form (linear) | Amortized variational |
| Training | MLE or PF | ELBO maximization |
| Latent space | Continuous, uncorrelated | Continuous, regularized |
When the VAE decoder is linear and the noise is Gaussian, the VAE reduces to probabilistic PCA (closely related to Factor Analysis).
Topic Models (LDA)
Latent Dirichlet Allocation (LDA) is "Factor Analysis for text":
- Documents (observations) are mixtures of topics (factors)
- Each topic is a distribution over words (like a loading pattern)
- Observed word counts are generated from the topic mixture
- Topics are latent - inferred from word co-occurrences
Just as Factor Analysis discovers personality factors from correlated questionnaire items, LDA discovers topics from words that co-occur in documents.
Word and Sentence Embeddings
Word embeddings (Word2Vec, GloVe) and sentence embeddings (BERT) can be viewed through the lens of latent factor models:
- Words are represented as vectors in a latent space
- Dimensions capture semantic factors (though not interpretable like FA factors)
- Co-occurrence patterns are explained by inner products of embeddings (like correlations explained by factor loadings)
The key difference: deep learning learns these latent spaces with massive scale and nonlinearity, but the conceptual framework echoes Factor Analysis.
The Bigger Picture: Factor Analysis established that observed correlations can arise from shared latent causes. This insight powers modern representation learning, from VAEs to transformers. Understanding Factor Analysis provides the conceptual foundation for interpreting what neural networks might be learning in their hidden layers.
Python Implementation
Factor Analysis from Scratch
Let's implement Factor Analysis from first principles to understand every step of the algorithm.
Now let's see Factor Analysis applied to a personality survey example:
Practical Tips and Gotchas
- Sample size matters: Minimum 100-200 observations. Ideally, 5-10 observations per variable. For stable solutions, larger is always better.
- Check communalities: Variables with communality < 0.3-0.4 don't fit the factor structure well. Consider removing them.
- Look for Heywood cases: Communalities > 1 indicate model problems (overfitting, wrong number of factors, or problematic variables).
- Use multiple extraction methods: Compare Principal Factor and MLE results. Consistent solutions across methods are more trustworthy.
- Try multiple rotations: Varimax is a starting point, but Oblimin may give cleaner structure. Compare solutions.
- Interpret factors substantively: A factor is only meaningful if you can name it based on the high-loading variables. If it doesn't make sense, reconsider the number of factors.
- Report loadings completely: Don't hide low loadings - they matter for understanding simple structure.
- Consider theoretical constraints: If theory suggests correlated factors, use oblique rotation. If uncorrelated factors make sense, use orthogonal.
Limitations of Factor Analysis
- Rotational Indeterminacy: No unique solution exists. Different rotations give different (but mathematically equivalent) answers. This is both a feature and a limitation.
- Factor Score Indeterminacy: Factor scores are estimates with uncertainty. Different methods give different scores, and scores aren't invariant under variable additions.
- Linearity Assumption: Factor Analysis assumes linear relationships. Nonlinear latent structures require extensions (ICA, nonlinear FA, deep latent models).
- Number of Factors: Must be specified in advance. Different numbers can dramatically change interpretation.
- Sample Size Requirements: Requires substantial data for stable solutions. Small samples yield unstable, non-replicable factors.
- Assumes Common Factor Model: If the true structure isn't well-approximated by common factors, results may be misleading.
- Subjectivity in Interpretation: Naming factors requires judgment. Two researchers might interpret the same loadings differently.
- Not a Causal Discovery Method: Factor Analysis finds statistical patterns, not causal relationships. Factors are hypothetical constructs, not proven causes.
| Limitation | Consequence | Mitigation |
|---|---|---|
| Rotational indeterminacy | No unique solution | Use simple structure criteria, report interpretable solution |
| Factor score indeterminacy | Scores are estimates | Use scores cautiously, consider multiple methods |
| Linearity | Misses nonlinear relationships | Try ICA, kernel methods, or deep latent models |
| Factor number choice | Results change with k | Use multiple criteria, validate with theory |
| Sample size needs | Unstable solutions | Collect more data, use Bayesian FA for small n |
Practice Problems
- Conceptual: Explain the difference between communality and uniqueness. Why does Factor Analysis distinguish them while PCA does not?
- Mathematical: Show that the covariance structure follows from the factor model assumptions.
- Computation: Implement the Principal Factor method from scratch. Compare your results with sklearn's FactorAnalysis on the Iris dataset.
- Rotation: Implement Varimax rotation. Start with an unrotated solution and verify that your rotation achieves simpler structure (higher variance of squared loadings).
- Comparison: Run both PCA and Factor Analysis on the same dataset. Compare the loadings/components. When do they differ most?
- Application: Find a personality survey dataset (e.g., from OpenPsychometrics) and perform Factor Analysis. Can you recover the Big Five?
- Factor Scores: Compute factor scores using both the regression method and Bartlett's method. How different are they? When does it matter?
- Deep Learning: Train a linear VAE on MNIST. Compare the learned decoder weights with Factor Analysis loadings. How similar are the latent structures?
Key Insights
- Factor Analysis models causation: Unlike PCA (which summarizes), FA hypothesizes that latent factors cause observed correlations. This causal perspective is powerful but requires theoretical justification.
- Communality distinguishes shared from unique: Only the common variance is factored; unique variance (specific + error) is separated. This is the key difference from PCA.
- Rotation is essential: Initial solutions are arbitrary. Rotation toward simple structure makes factors interpretable and meaningful.
- Orthogonal vs Oblique is a choice: Orthogonal (Varimax) keeps factors uncorrelated but may not achieve the cleanest structure. Oblique allows correlation but adds complexity.
- Factor scores are estimates: Unlike PCA scores which are exact, factor scores have uncertainty. They're indeterminate and method-dependent.
- FA connects to modern deep learning: VAEs, topic models, and embeddings are all descendants of the latent factor perspective. Understanding FA provides intuition for these methods.
- Interpretation requires judgment: Factor Analysis is not purely mechanical. Naming factors, choosing rotations, and deciding on the number of factors all require substantive knowledge.
- Limitations are real: Indeterminacy, linearity assumptions, and sample size requirements constrain what FA can discover. Know when to use alternatives.
The Essence of Factor Analysis: Correlations between observed variables are symptoms; latent factors are causes. FA asks: "What hidden constructs explain why these variables move together?" It's a fundamentally causal question, even if the answer is statistical. This perspective - that simple underlying causes generate complex observed patterns - is the conceptual foundation of modern representation learning. From Spearman's g-factor to VAE latent spaces, the insight remains: find the hidden structure that explains what we see.