Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will master Factor Analysis - a powerful technique for discovering latent structures that explain correlations among observed variables. You will:

Understand the conceptual foundation of Factor Analysis: what latent variables are, why we care about them, and how they differ fundamentally from principal components.
Master the mathematical formulation of the factor model, including factor loadings, communalities, uniqueness, and the covariance structure it implies.
Apply estimation methods including the Principal Factor method and Maximum Likelihood estimation to extract factors from data.
Perform and interpret factor rotation using Varimax (orthogonal) and Oblimin (oblique) methods to achieve simple structure for interpretable solutions.
Connect Factor Analysis to modern deep learning: understand its relationship to VAEs, topic models, embeddings, and other latent variable models.
Implement Factor Analysis from scratch and apply it to real-world problems in psychology, finance, and marketing.

Why This Matters: Factor Analysis addresses a fundamental question in science and machine learning: what are the underlying causes that explain the correlations we observe? Unlike PCA (which summarizes data), Factor Analysis models data generation - it hypothesizes that hidden factors create the patterns we see. This causal perspective is essential for psychological measurement, economics, biology, and now powers modern latent variable models in deep learning including VAEs, topic models, and word embeddings.

The Big Picture

Historical Context

Factor Analysis was born from a fundamental question in psychology: What is intelligence? In 1904, Charles Spearman noticed something remarkable - students who did well on one cognitive test tended to do well on others, even when the tests seemed very different (vocabulary, mathematics, spatial reasoning).

Spearman hypothesized that a single underlying factor - which he called "g" (general intelligence) - caused performance on all tests. The correlations between tests existed because they all measured this common factor, plus some test-specific abilities. This was the birth of Factor Analysis.

Later, Louis Thurstone extended this work, proposing that intelligence wasn't a single factor but multiple "primary mental abilities." His work on factor rotation (1930s-1940s) established the mathematical framework we still use today. Factor Analysis became the dominant method in psychology for developing personality tests, intelligence measures, and attitude scales.

The Key Insight: Correlations between observed variables might exist because those variables share common causes. Factor Analysis formalizes this intuition mathematically: latent factors "generate" observed correlations.

Why Factor Analysis?

Consider this scenario: You administer a 50-question personality survey to 1,000 people. The questions cover various topics - social preferences, work habits, emotional responses. When you compute correlations between questions, you find patterns:

Questions about enjoying parties correlate with questions about meeting new people
Questions about organization correlate with questions about punctuality
Questions about anxiety correlate with questions about worry

Why do these correlations exist? Factor Analysis proposes an answer: there are latent traits (extroversion, conscientiousness, neuroticism) that influence how people answer multiple questions. The correlations are symptoms of these underlying causes.

This perspective differs fundamentally from PCA:

PCA asks: "How can I summarize these 50 questions into fewer numbers?"
Factor Analysis asks: "What underlying traits explain why these questions correlate?"

The Fundamental Question

Factor Analysis addresses a causal question: What latent structure generated the data we observe?

The model assumes:

There exist latent (hidden) factors that we cannot directly observe
Each observed variable is a linear combination of these factors plus unique variance
The correlations among observed variables arise because they share common factor influences

This generative perspective is powerful because it:

Explains why correlations exist (shared factors)
Distinguishes common variance from unique variance
Provides a model of the data-generating process
Enables theory testing about latent structures

PCA vs Factor Analysis

PCA and Factor Analysis are often confused because they both reduce dimensionality. But they answer fundamentally different questions and make different assumptions about the data.

Key Differences

Aspect	PCA	Factor Analysis
Goal	Maximize variance explained	Model latent structure generating correlations
Model	No model (data transformation)	X = Λf + ε (generative model)
Components/Factors	Observed (exact linear combinations)	Latent (unobserved, estimated)
Unique Variance	Not modeled (all variance is explained)	Explicitly modeled (ε term)
Covariance Structure	Σ = VΛV^T (full decomposition)	Σ = ΛΛ^T + Ψ (common + unique)
Rotation	Typically none (variance-ordered)	Essential for interpretation
Number of Components	Can use all (ordered by variance)	Must specify number of factors
Interpretability	Components are data summaries	Factors are hypothetical causes

The mathematical difference is crucial. In PCA:

\boldsymbol{\Sigma} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^T

The covariance matrix is fully decomposed - all variance is accounted for by the components. In Factor Analysis:

\boldsymbol{\Sigma} = \boldsymbol{\Lambda} \boldsymbol{\Lambda}^T + \boldsymbol{\Psi}

Where $\boldsymbol{\Psi}$ is a diagonal matrix of unique variances. Not all variance is common - some is specific to each variable.

When to Use Which

Use PCA When...	Use Factor Analysis When...
You want to reduce dimensionality for modeling	You want to understand latent structure
Components don't need interpretation	Factors must be interpretable constructs
You're preprocessing for ML algorithms	You're developing psychological scales
You want data compression	You want to explain correlations
Variables have measurement error mixed in	You want to separate common from unique variance
You need exact reconstruction	You accept that factors are estimates

Rule of Thumb: Use PCA for dimensionality reduction and preprocessing. Use Factor Analysis when you believe there are underlying constructs causing correlations, and you want to identify and interpret those constructs.

The Factor Model

Mathematical Formulation

The factor model expresses each observed variable as a linear combination of latent factors:

X_i = \lambda_{i1} f_1 + \lambda_{i2} f_2 + \cdots + \lambda_{ik} f_k + \varepsilon_i

In matrix notation for all p variables:

\mathbf{X} = \boldsymbol{\Lambda} \mathbf{f} + \boldsymbol{\varepsilon}

Where:

$\mathbf{X}$ = vector of p observed variables (standardized)
$\boldsymbol{\Lambda}$ = p × k matrix of factor loadings
$\mathbf{f}$ = vector of k latent common factors (k < p)
$\boldsymbol{\varepsilon}$ = vector of p unique factors (specific + error)

The model makes these key assumptions:

Factors are uncorrelated (for orthogonal solutions): $\text{Cov}(\mathbf{f}) = \mathbf{I}$
Unique factors are uncorrelated with each other: $\text{Cov}(\boldsymbol{\varepsilon}) = \boldsymbol{\Psi}$ (diagonal)
Factors and unique factors are uncorrelated: $\text{Cov}(\mathbf{f}, \boldsymbol{\varepsilon}) = \mathbf{0}$

Given these assumptions, the implied covariance structure is:

\boldsymbol{\Sigma} = \boldsymbol{\Lambda} \boldsymbol{\Lambda}^T + \boldsymbol{\Psi}

This decomposes total variance into:

$\boldsymbol{\Lambda} \boldsymbol{\Lambda}^T$ = common variance (explained by factors)
$\boldsymbol{\Psi}$ = unique variance (specific + error)

Geometric Interpretation

Geometrically, each factor defines a direction in variable space. The loading $\lambda_{ij}$ is the correlation between variable i and factor j (for standardized variables with orthogonal factors).

Think of it this way: if you could directly measure the hidden factors, the loading would tell you how strongly each observed variable correlates with each factor. Variables that load highly on the same factor are correlated because they share that common cause.

Model Assumptions

Factor Analysis makes strong assumptions that should be checked:

Linearity: Observed variables are linear combinations of factors. Nonlinear relationships aren't captured.
No factor-unique correlation: Common factors don't correlate with unique factors. Violated if measurement error depends on the construct.
Uncorrelated unique factors: Unique variances are independent. Violated if two variables share variance not captured by the factors.
Multivariate normality: For MLE estimation and certain tests. Violations affect standard errors and fit indices.

Communality and Uniqueness

Two critical concepts in Factor Analysis:

Communality ( $h_i^2$ ) is the proportion of variance in variable i explained by the common factors:

h_i^2 = \sum_{j=1}^{k} \lambda_{ij}^2

Uniqueness ( $\psi_i$ ) is the remaining variance not explained by common factors:

\psi_i = 1 - h_i^2

For standardized variables (variance = 1), the total variance equals communality plus uniqueness:

1 = h_i^2 + \psi_i

Interpretation: High communality means the variable fits well into the factor structure - most of its variance is explained by common factors. Low communality suggests the variable has substantial unique variance not shared with other variables - consider removing it from the analysis.

Interactive Factor Model Demo

Explore the factor model structure interactively. Observe how latent factors influence observed variables through factor loadings, and how communality and uniqueness partition variance.

Interactive Factor Model Visualization

Number of Factors: 2

Hidden constructs that explain correlations

Number of Observed Variables: 6

Measured variables in your dataset

Common Variance (Communality): 70%

Portion of variance explained by factors

Show Factor LoadingsShow Unique Variance

Key Concepts:

Factors (blue circles) are latent variables - unobserved causes
Loadings (purple arrows) show how strongly each factor influences each variable
Communality (green) = variance explained by common factors
Unique variance (orange) = specific + error variance

Try This: Set 2 factors and 6 variables to see a clean "simple structure" where each variable loads primarily on one factor. Then increase the common variance slider to see how communality increases (less orange in each variable).

Estimation Methods

Given observed data, how do we estimate the factor loadings $\boldsymbol{\Lambda}$ and unique variances $\boldsymbol{\Psi}$ ? There are several approaches.

Principal Factor Method

The Principal Factor (PF) method is an iterative approach that extends PCA to account for unique variance:

Estimate initial communalities: Use Squared Multiple Correlations (SMC) - the R² of each variable predicted from all others. This gives the portion of variance shared with other variables.
Construct reduced correlation matrix: Replace the 1s on the diagonal of R with communality estimates. This removes unique variance before factoring.
Extract factors: Perform eigendecomposition of the reduced matrix. Keep the top k eigenvectors.
Update communalities: Compute new communalities as sum of squared loadings for each variable.
Iterate: Repeat steps 2-4 until communalities converge.

The key insight: by using communalities instead of 1s on the diagonal, we factor only the common variance, not the unique variance. This distinguishes PF from PCA.

Maximum Likelihood Estimation

MLE finds loadings and unique variances that maximize the probability of observing the data, assuming multivariate normality:

\hat{\boldsymbol{\Lambda}}, \hat{\boldsymbol{\Psi}} = \arg\max_{\boldsymbol{\Lambda}, \boldsymbol{\Psi}} \mathcal{L}(\boldsymbol{\Lambda}, \boldsymbol{\Psi} | \mathbf{X})

Where the log-likelihood (ignoring constants) is:

\log \mathcal{L} = -\frac{n}{2} \left[ \log|\boldsymbol{\Sigma}| + \text{tr}(\mathbf{S} \boldsymbol{\Sigma}^{-1}) \right]

With $\boldsymbol{\Sigma} = \boldsymbol{\Lambda} \boldsymbol{\Lambda}^T + \boldsymbol{\Psi}$ .

MLE has advantages over PF:

Provides standard errors for loadings
Allows statistical tests (chi-square test of model fit)
Scale-invariant (same results for correlation or covariance)
Generally more efficient estimates

However, MLE requires larger samples (n > 100-200) and assumes normality.

How Many Factors to Extract?

Choosing the number of factors is crucial and often subjective. Common approaches:

Kaiser Criterion: Keep factors with eigenvalues > 1 (from the correlation matrix). Rationale: each factor should explain more variance than a single variable.
Scree Test: Plot eigenvalues and look for an "elbow." Factors before the elbow are retained; those after explain little additional variance.
Parallel Analysis: Compare eigenvalues to those from random data with the same dimensions. Keep factors with eigenvalues exceeding the random baseline.
Chi-Square Test (MLE only): Test whether adding another factor significantly improves fit.
Interpretability: Can you meaningfully interpret the factors? If a factor doesn't make sense, you may have extracted too many.
Theory: Prior knowledge about the number of constructs (e.g., Big Five personality theory predicts 5 factors).

Practical Advice: Use multiple criteria together. If the scree test, parallel analysis, and Kaiser criterion all suggest 4 factors, and those 4 factors are interpretable, you have convergent evidence. If methods disagree, try different numbers and see which solution is most interpretable and theoretically sensible.

Factor Rotation

The Rotation Problem

Factor solutions are not unique. Any orthogonal transformation of the factors yields an equally valid solution with the same communalities and reproduced correlations.

Mathematically, if $\boldsymbol{\Lambda}$ is a solution, then so is $\boldsymbol{\Lambda} \mathbf{T}$ for any orthogonal matrix T:

\boldsymbol{\Sigma} = \boldsymbol{\Lambda} \boldsymbol{\Lambda}^T = (\boldsymbol{\Lambda} \mathbf{T})(\boldsymbol{\Lambda} \mathbf{T})^T

This is called rotational indeterminacy. It's both a curse (no unique answer) and a blessing (we can choose the most interpretable orientation).

Simple Structure is the guiding principle for rotation. Thurstone defined simple structure as:

Each variable loads highly on one factor
Each variable has near-zero loadings on other factors
Each factor has some variables with high loadings and many with near-zero

Simple structure makes factors interpretable: each factor represents a distinct construct defined by specific variables.

Orthogonal Rotation (Varimax)

Varimax is the most popular orthogonal rotation. It maximizes the variance of squared loadings within each factor:

V = \frac{1}{p} \sum_{j=1}^{k} \left[ \sum_{i=1}^{p} \tilde{\lambda}_{ij}^4 - \frac{1}{p}\left(\sum_{i=1}^{p} \tilde{\lambda}_{ij}^2\right)^2 \right]

Where $\tilde{\lambda}_{ij} = \lambda_{ij} / h_i$ is the loading normalized by communality.

Effect of Varimax: Pushes loadings toward 0 or ±1, making each variable load clearly on one factor. Factors remain uncorrelated (90° apart).

Other orthogonal rotations:

Quartimax: Maximizes variance within variables (each variable loads on fewer factors)
Equamax: Balances Varimax and Quartimax criteria

Oblique Rotation (Oblimin)

Oblique rotations allow factors to correlate. This often achieves better simple structure because factors in reality may be related.

Oblimin (with parameter γ=0, called "Quartimin") minimizes:

Q = \sum_{j < k} \sum_i \tilde{\lambda}_{ij}^2 \tilde{\lambda}_{ik}^2

This penalizes variables with high loadings on multiple factors.

Trade-offs of oblique rotation:

Pro: Often achieves cleaner simple structure
Pro: More realistic if constructs are truly correlated
Con: Must interpret factor correlations in addition to loadings
Con: Need two matrices: pattern matrix (loadings) and structure matrix (correlations)

Interactive Rotation Demo

Explore how rotation transforms the loading space. Watch how variables cluster onto distinct factors as you rotate toward simple structure.

Interactive Factor Rotation

Variables

F1' axis

F2' axis

Rotation Type

Orthogonal (Varimax)Oblique (Oblimin)

Rotation Angle: 0°

Rotate axes to achieve simple structure

Goal of Rotation:

Achieve simple structure where each variable loads highly on one factor and near zero on others.

Try rotating ~45° to see variables cluster onto distinct factors!

Orthogonal vs Oblique:

Orthogonal: Factors remain uncorrelated (90° apart)
Oblique: Factors can be correlated (more flexible but complex)
Oblique often gives better simple structure but harder interpretation

Try This: Rotate to approximately 45° to see how variables separate into two clusters. Notice how simple structure emerges when each variable loads highly on one axis and near-zero on the other.

Factor Scores

Once we have loadings, we often want to compute factor scores - estimated values of the latent factors for each observation. However, since factors are unobserved, scores must be estimated rather than computed exactly.

The regression method (Thomson's method) estimates factor scores as:

\hat{\mathbf{f}} = \boldsymbol{\Lambda}^T (\boldsymbol{\Lambda} \boldsymbol{\Lambda}^T + \boldsymbol{\Psi})^{-1} \mathbf{x}

Or equivalently using the correlation matrix:

\hat{\mathbf{f}} = \boldsymbol{\Lambda}^T \mathbf{R}^{-1} \mathbf{x}

This gives the best linear prediction of factor scores given the observed data.

Properties of factor scores:

Factor scores are estimates with uncertainty (not exact like PC scores)
Even for orthogonal factors, estimated scores may be slightly correlated
Scores depend on the estimation method (regression, Bartlett's, etc.)
Adding/removing variables changes scores (they're not invariant)

Indeterminacy Warning: Factor scores are indeterminate - there is no unique "true" score. Different estimation methods give different scores, and even the same method on slightly different data gives different scores. This is a fundamental limitation of Factor Analysis versus PCA.

Real-World Applications

Psychology: Intelligence and Personality

Factor Analysis revolutionized psychology:

Intelligence Testing: Spearman's g-factor explains correlations among cognitive tests. Modern IQ tests are built on factor-analytic foundations, distinguishing general intelligence from specific abilities.
Big Five Personality: The dominant personality model emerged from factor analysis of thousands of personality descriptors. Five factors (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) explain most variance in personality.
Scale Development: Questionnaire items are factor analyzed to ensure they measure intended constructs. Items loading poorly are revised or removed.

Finance: Factor Models

Factor models are fundamental in quantitative finance:

Fama-French Factors: Stock returns are explained by market, size, and value factors. Factor analysis identifies these latent risk exposures.
Risk Decomposition: Portfolio risk is decomposed into factor exposures (market risk, sector risk, style risk) and idiosyncratic risk.
Statistical Arbitrage: Factor models identify mispricings when returns deviate from factor-predicted values.

Marketing: Customer Segmentation

Factor Analysis helps understand customer behavior:

Brand Perception: Survey items about brands are factor analyzed to identify underlying perception dimensions (quality, value, prestige, etc.).
Purchase Behavior: Product purchase patterns reveal latent shopping styles or need states.
Customer Segmentation: Factor scores define segments based on underlying motivations, not just demographic characteristics.

Deep Learning Connections

Factor Analysis is the classical ancestor of modern latent variable models in deep learning. The conceptual framework - latent factors generating observed data - underlies many contemporary methods.

Latent Variable Models

The factor model $\mathbf{X} = \boldsymbol{\Lambda} \mathbf{f} + \boldsymbol{\varepsilon}$ is a linear latent variable model. Modern deep learning extends this in two ways:

Nonlinear decoders: Replace $\boldsymbol{\Lambda} \mathbf{f}$ with a neural network: $\mathbf{X} = g_\theta(\mathbf{f}) + \boldsymbol{\varepsilon}$
Nonlinear encoders: Map observations to latent distributions rather than point estimates

VAEs and Factor Analysis

Variational Autoencoders (VAEs) are the deep learning generalization of probabilistic Factor Analysis:

Aspect	Factor Analysis	VAE
Decoder	Linear: Λf	Neural network: g_θ(z)
Encoder	Regression: Λ^T R^(-1) x	Neural network: q_φ(z\|x)
Prior	Standard normal	Standard normal (typically)
Inference	Closed-form (linear)	Amortized variational
Training	MLE or PF	ELBO maximization
Latent space	Continuous, uncorrelated	Continuous, regularized

When the VAE decoder is linear and the noise is Gaussian, the VAE reduces to probabilistic PCA (closely related to Factor Analysis).

Topic Models (LDA)

Latent Dirichlet Allocation (LDA) is "Factor Analysis for text":

Documents (observations) are mixtures of topics (factors)
Each topic is a distribution over words (like a loading pattern)
Observed word counts are generated from the topic mixture
Topics are latent - inferred from word co-occurrences

Just as Factor Analysis discovers personality factors from correlated questionnaire items, LDA discovers topics from words that co-occur in documents.

Word and Sentence Embeddings

Word embeddings (Word2Vec, GloVe) and sentence embeddings (BERT) can be viewed through the lens of latent factor models:

Words are represented as vectors in a latent space
Dimensions capture semantic factors (though not interpretable like FA factors)
Co-occurrence patterns are explained by inner products of embeddings (like correlations explained by factor loadings)

The key difference: deep learning learns these latent spaces with massive scale and nonlinearity, but the conceptual framework echoes Factor Analysis.

The Bigger Picture: Factor Analysis established that observed correlations can arise from shared latent causes. This insight powers modern representation learning, from VAEs to transformers. Understanding Factor Analysis provides the conceptual foundation for interpreting what neural networks might be learning in their hidden layers.

Python Implementation

Factor Analysis from Scratch

Let's implement Factor Analysis from first principles to understand every step of the algorithm.

Complete Factor Analysis Implementation

🐍factor_analysis_from_scratch.py

Explanation(21)

Code(292)

1Import NumPy

NumPy provides the matrix operations essential for Factor Analysis: eigendecomposition, matrix multiplication, and linear algebra.

6Factor Analysis Class

Encapsulates the Factor Analysis algorithm following scikit-learn API patterns. Factor Analysis is fundamentally different from PCA - it explicitly models latent factors.

14The Factor Model Equation

X = Λf + ε is the core equation. Observed variables X are linear combinations of latent factors f with loadings Λ, plus unique errors ε.

EXAMPLE

If X1 = 0.8*F1 + 0.2*F2 + ε₁, then loadings are [0.8, 0.2]

21Constructor Parameters

n_factors specifies how many latent dimensions to extract. rotation determines how to orient factors for interpretability. Unlike PCA, factor number must be chosen upfront.

35Key Fitted Attributes

loadings_ stores the Λ matrix showing variable-factor relationships. communalities_ shows variance explained by factors. uniquenesses_ captures specific variance not explained by common factors.

50Correlation Matrix

Factor Analysis uses correlation (not covariance) matrix. This standardizes all variables to unit variance, making loadings comparable across variables with different scales.

54Initial Communality Estimates

Squared Multiple Correlations (SMC) provide good initial estimates. SMC = R² when predicting each variable from all others. This starts the iterative algorithm.

EXAMPLE

If variable X1 has SMC=0.7, then 70% of its variance is shared with other variables

61Reduced Correlation Matrix

Key difference from PCA: diagonal contains communalities, not 1s. This removes unique variance before extracting factors - we only factor the common variance.

64Eigendecomposition

Like PCA, we use eigendecomposition. But on the REDUCED matrix. Only common variance is decomposed, not total variance.

74Factor Loadings Formula

Λ = V * √Λ_eigenvalues. Each column is a factor, rows are variables. Loading Λ_ij shows how strongly variable i relates to factor j.

77Update Communalities

Communality h² = sum of squared loadings for each variable. This is the proportion of variance explained by all factors combined.

EXAMPLE

If loadings are [0.8, 0.3], communality = 0.64 + 0.09 = 0.73

90Factor Rotation

After extraction, factors are rotated for interpretability. Rotation changes loadings but not communalities - it redistributes variance among factors.

100Varimax Rotation

Most popular orthogonal rotation. Maximizes variance of squared loadings WITHIN each factor, pushing variables to load highly on one factor only (simple structure).

107Normalize by Communality

Kaiser normalization: divide loadings by √(communality). This gives equal weight to all variables during rotation, regardless of communality differences.

115Pairwise Rotation

Varimax rotates factors pairwise. For each pair (i,j), find the angle φ that maximizes the criterion. This is analytically solvable.

125Optimal Rotation Angle

The angle φ is computed from the quartic criterion. The formula involves sums of squared and cross-products of loadings.

152Oblimin Rotation

Oblique rotation allows correlated factors. Often achieves better simple structure but harder to interpret. Common in psychology where constructs may correlate.

165Factor Scores

Regression method: F = X @ R⁻¹ @ Λ. Unlike PCA, factor scores are ESTIMATED (not exact) because factors are unobserved latent variables.

182Variance Explained

Each factor explains variance = sum of its squared loadings. Unlike PCA, factors don't naturally order by variance - rotation changes this.

195Synthetic Data Generation

We create data with known factor structure to verify our implementation. True loadings show 3 groups of variables, each loading on one factor.

219Data Generation Formula

X = F @ Λ^T + ε generates observations. Factors F are latent, loadings Λ define structure, ε adds unique/error variance.

271 lines without explanation

1import numpy as np
2from scipy import stats
3from typing import Optional, Tuple
4
5class FactorAnalysis:
6    """
7    Factor Analysis implementation from scratch.
8
9    Factor Analysis models observed variables as linear combinations
10    of latent factors plus unique variance components.
11
12    Model: X = Λf + ε
13    Where:
14      - X: observed variables (p-dimensional)
15      - Λ: factor loading matrix (p × k)
16      - f: latent factors (k-dimensional, k < p)
17      - ε: unique factors/errors (p-dimensional, uncorrelated)
18    """
19
20    def __init__(
21        self,
22        n_factors: int,
23        rotation: str = "varimax",
24        max_iter: int = 1000,
25        tol: float = 1e-6
26    ):
27        """
28        Initialize Factor Analysis.
29
30        Args:
31            n_factors: Number of latent factors to extract
32            rotation: Rotation method ('none', 'varimax', 'oblimin')
33            max_iter: Maximum iterations for iterative estimation
34            tol: Convergence tolerance
35        """
36        self.n_factors = n_factors
37        self.rotation = rotation
38        self.max_iter = max_iter
39        self.tol = tol
40
41        # Fitted attributes
42        self.loadings_ = None        # Factor loading matrix Λ
43        self.communalities_ = None   # h² for each variable
44        self.uniquenesses_ = None    # ψ² = 1 - h² (unique variance)
45        self.rotation_matrix_ = None # Rotation matrix T
46        self.mean_ = None            # Variable means
47
48    def fit(self, X: np.ndarray) -> 'FactorAnalysis':
49        """
50        Fit Factor Analysis model using Principal Factor method.
51
52        Args:
53            X: Data matrix of shape (n_samples, n_features)
54
55        Returns:
56            self: Fitted FactorAnalysis instance
57        """
58        n_samples, n_features = X.shape
59        k = self.n_factors
60
61        # Step 1: Center the data
62        self.mean_ = np.mean(X, axis=0)
63        X_centered = X - self.mean_
64
65        # Step 2: Compute correlation matrix
66        # (Use correlation for standardized analysis)
67        R = np.corrcoef(X_centered, rowvar=False)
68
69        # Step 3: Initial communality estimates
70        # Use squared multiple correlations (SMC)
71        R_inv = np.linalg.inv(R)
72        communalities = 1 - 1 / np.diag(R_inv)
73
74        # Step 4: Iterative Principal Factor extraction
75        for iteration in range(self.max_iter):
76            old_communalities = communalities.copy()
77
78            # Replace diagonal of R with communalities
79            R_reduced = R.copy()
80            np.fill_diagonal(R_reduced, communalities)
81
82            # Eigendecomposition of reduced correlation matrix
83            eigenvalues, eigenvectors = np.linalg.eigh(R_reduced)
84
85            # Sort descending
86            idx = np.argsort(eigenvalues)[::-1]
87            eigenvalues = eigenvalues[idx]
88            eigenvectors = eigenvectors[:, idx]
89
90            # Keep only positive eigenvalues and top k factors
91            positive_mask = eigenvalues > 0
92            eigenvalues = eigenvalues[positive_mask][:k]
93            eigenvectors = eigenvectors[:, positive_mask][:, :k]
94
95            # Compute factor loadings: Λ = V * sqrt(Λ_eigenvalues)
96            loadings = eigenvectors * np.sqrt(eigenvalues)
97
98            # Update communalities: h² = sum of squared loadings
99            communalities = np.sum(loadings ** 2, axis=1)
100            communalities = np.clip(communalities, 0, 0.999)
101
102            # Check convergence
103            if np.max(np.abs(communalities - old_communalities)) < self.tol:
104                break
105
106        self.loadings_ = loadings
107        self.communalities_ = communalities
108        self.uniquenesses_ = 1 - communalities
109
110        # Step 5: Apply rotation if specified
111        if self.rotation != "none" and k > 1:
112            self._rotate_factors()
113
114        return self
115
116    def _rotate_factors(self):
117        """Apply factor rotation to achieve simple structure."""
118        if self.rotation == "varimax":
119            self._varimax_rotation()
120        elif self.rotation == "oblimin":
121            self._oblimin_rotation()
122
123    def _varimax_rotation(self, gamma: float = 1.0):
124        """
125        Orthogonal Varimax rotation.
126
127        Maximizes variance of squared loadings within factors,
128        pushing each variable to load highly on one factor.
129        """
130        loadings = self.loadings_.copy()
131        n_vars, n_factors = loadings.shape
132
133        # Normalize loadings by communality
134        h = np.sqrt(self.communalities_)
135        normalized = loadings / h[:, np.newaxis]
136
137        rotation_matrix = np.eye(n_factors)
138
139        for _ in range(self.max_iter):
140            old_loadings = normalized.copy()
141
142            for i in range(n_factors - 1):
143                for j in range(i + 1, n_factors):
144                    # Extract columns i and j
145                    x = normalized[:, i]
146                    y = normalized[:, j]
147
148                    # Compute rotation angle
149                    u = x ** 2 - y ** 2
150                    v = 2 * x * y
151                    A = np.sum(u)
152                    B = np.sum(v)
153                    C = np.sum(u ** 2 - v ** 2)
154                    D = 2 * np.sum(u * v)
155
156                    # Optimal rotation angle
157                    num = D - 2 * A * B / n_vars
158                    denom = C - (A ** 2 - B ** 2) / n_vars
159                    phi = 0.25 * np.arctan2(num, denom)
160
161                    # Apply rotation
162                    cos_phi = np.cos(phi)
163                    sin_phi = np.sin(phi)
164
165                    normalized[:, i] = x * cos_phi + y * sin_phi
166                    normalized[:, j] = -x * sin_phi + y * cos_phi
167
168                    # Update rotation matrix
169                    rot = np.eye(n_factors)
170                    rot[i, i] = cos_phi
171                    rot[j, j] = cos_phi
172                    rot[i, j] = sin_phi
173                    rot[j, i] = -sin_phi
174                    rotation_matrix = rotation_matrix @ rot
175
176            # Check convergence
177            if np.max(np.abs(normalized - old_loadings)) < self.tol:
178                break
179
180        # Denormalize and store results
181        self.loadings_ = normalized * h[:, np.newaxis]
182        self.rotation_matrix_ = rotation_matrix
183
184    def _oblimin_rotation(self, gamma: float = 0.0):
185        """
186        Oblique Oblimin rotation (simplified version).
187
188        Allows factors to be correlated for better simple structure.
189        gamma=0 gives quartimin rotation.
190        """
191        # For simplicity, use varimax as starting point
192        # and then allow small correlations
193        self._varimax_rotation()
194
195        # Note: Full oblimin implementation would require
196        # gradient-based optimization with factor correlations.
197        # This is a placeholder for the full algorithm.
198
199    def transform(self, X: np.ndarray) -> np.ndarray:
200        """
201        Compute factor scores using regression method.
202
203        Args:
204            X: Data to transform of shape (n_samples, n_features)
205
206        Returns:
207            Factor scores of shape (n_samples, n_factors)
208        """
209        X_centered = X - self.mean_
210
211        # Standardize
212        std = np.std(X_centered, axis=0)
213        X_standardized = X_centered / std
214
215        # Compute correlation matrix
216        R = np.corrcoef(X_centered, rowvar=False)
217
218        # Factor scores: F = X @ R^(-1) @ Λ
219        R_inv = np.linalg.inv(R)
220        factor_scores = X_standardized @ R_inv @ self.loadings_
221
222        return factor_scores
223
224    def fit_transform(self, X: np.ndarray) -> np.ndarray:
225        """Fit and transform in one step."""
226        return self.fit(X).transform(X)
227
228    def get_factor_variance(self) -> Tuple[np.ndarray, np.ndarray]:
229        """
230        Compute variance explained by each factor.
231
232        Returns:
233            Tuple of (variance per factor, cumulative variance proportion)
234        """
235        # Variance = sum of squared loadings for each factor
236        variance = np.sum(self.loadings_ ** 2, axis=0)
237        total_variance = self.loadings_.shape[0]  # Total = number of variables
238        proportion = variance / total_variance
239        cumulative = np.cumsum(proportion)
240        return proportion, cumulative
241
242
243# Example usage
244if __name__ == "__main__":
245    np.random.seed(42)
246
247    # Generate data with known factor structure
248    n_samples = 500
249    n_variables = 9
250    n_true_factors = 3
251
252    # True factor loadings (3 groups of variables)
253    true_loadings = np.array([
254        [0.9, 0.1, 0.1],  # Variables 1-3 load on Factor 1
255        [0.8, 0.2, 0.1],
256        [0.85, 0.15, 0.1],
257        [0.1, 0.9, 0.1],  # Variables 4-6 load on Factor 2
258        [0.2, 0.8, 0.15],
259        [0.1, 0.85, 0.1],
260        [0.1, 0.1, 0.9],  # Variables 7-9 load on Factor 3
261        [0.15, 0.2, 0.8],
262        [0.1, 0.15, 0.85],
263    ])
264
265    # Generate latent factors
266    factors = np.random.randn(n_samples, n_true_factors)
267
268    # Generate unique variances
269    unique_var = 1 - np.sum(true_loadings ** 2, axis=1)
270    unique_noise = np.random.randn(n_samples, n_variables) * np.sqrt(unique_var)
271
272    # Generate observed data: X = F @ Λ^T + ε
273    X = factors @ true_loadings.T + unique_noise
274
275    # Fit Factor Analysis
276    fa = FactorAnalysis(n_factors=3, rotation="varimax")
277    factor_scores = fa.fit_transform(X)
278
279    # Results
280    print("Factor Analysis Results")
281    print("=" * 50)
282    print(f"\nExtracted {fa.n_factors} factors")
283    print(f"\nFactor Loadings (after Varimax rotation):")
284    print(np.round(fa.loadings_, 3))
285    print(f"\nCommunalities (variance explained per variable):")
286    print(np.round(fa.communalities_, 3))
287    print(f"\nUniqueness (specific + error variance):")
288    print(np.round(fa.uniquenesses_, 3))
289
290    var_prop, var_cum = fa.get_factor_variance()
291    print(f"\nVariance explained per factor: {np.round(var_prop * 100, 1)}%")
292    print(f"Cumulative variance: {np.round(var_cum * 100, 1)}%")

Now let's see Factor Analysis applied to a personality survey example:

Practical Example: Big Five Personality

🐍factor_analysis_personality.py

Explanation(13)

Code(159)

5Big Five Personality Model

Classic FA application: personality questionnaires measure 5 latent traits (OCEAN). Each trait manifests through multiple correlated survey items.

15True Loading Structure

We simulate the ground truth: 3 questions per factor with strong loadings (~0.7-0.85). This represents a clean factor structure.

21Cross-Loadings

Real data has cross-loadings: questions partially relate to multiple factors. We add small (0.15) cross-loadings for realism.

28Unique Variance

Uniqueness = 1 - communality = variance specific to each question plus measurement error. This is what distinguishes FA from PCA.

37Standardization

Factor Analysis requires standardized data (mean=0, std=1). This ensures loadings are correlation-based and comparable.

41Sklearn Factor Analysis

sklearn provides FactorAnalysis with varimax rotation. n_components is the number of factors to extract.

49Loading Matrix

loadings = fa.components_.T gives the (questions × factors) matrix. Each column is a factor, rows are variables.

63Interpreting Loadings

Loadings > 0.5 (marked with *) indicate strong variable-factor relationships. This is the pattern we interpret.

75Variance Per Factor

Sum of squared loadings for each factor. Unlike PCA, factors aren't ordered by variance after rotation.

87Communalities

h² > 0.5 is considered good. Low communality means the variable doesn't fit the factor model well - consider removing it.

95Loading Heatmap

Visualization shows simple structure: each question loads highly (red/blue) on one factor and near zero (white) on others.

115Factor Variance Plot

Bar chart shows variance per factor; line shows cumulative. Unlike PCA, factors may have similar variance after rotation.

124Factor Score Scatter

Each respondent has a score on each factor. Plotting shows individual differences on latent dimensions.

146 lines without explanation

1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.decomposition import FactorAnalysis
4from sklearn.preprocessing import StandardScaler
5
6# Example: Factor Analysis on Personality Data
7# Big Five personality traits manifest through specific questions
8
9# Simulate survey data (500 respondents, 15 questions)
10# True structure: 5 factors (Openness, Conscientiousness, Extraversion,
11#                          Agreeableness, Neuroticism - "OCEAN")
12np.random.seed(42)
13n_respondents = 500
14n_questions = 15
15n_true_factors = 5
16
17# Each factor influences 3 questions
18true_loadings = np.zeros((n_questions, n_true_factors))
19for i in range(n_true_factors):
20    for j in range(3):
21        question_idx = i * 3 + j
22        true_loadings[question_idx, i] = 0.7 + 0.15 * np.random.rand()
23
24# Add small cross-loadings
25true_loadings += 0.15 * np.random.rand(n_questions, n_true_factors)
26
27# Generate latent factor scores for each person
28factor_scores_true = np.random.randn(n_respondents, n_true_factors)
29
30# Generate unique/error variance
31uniquenesses = 1 - np.sum(true_loadings ** 2, axis=1)
32uniquenesses = np.clip(uniquenesses, 0.1, 0.9)
33errors = np.random.randn(n_respondents, n_questions) * np.sqrt(uniquenesses)
34
35# Generate survey responses: X = F @ Λ' + ε
36X = factor_scores_true @ true_loadings.T + errors
37
38# Standardize data (important for Factor Analysis)
39scaler = StandardScaler()
40X_scaled = scaler.fit_transform(X)
41
42# Fit Factor Analysis with sklearn
43fa = FactorAnalysis(n_components=5, rotation='varimax', random_state=42)
44factor_scores = fa.fit_transform(X_scaled)
45
46print("=" * 60)
47print("FACTOR ANALYSIS: Personality Survey Example")
48print("=" * 60)
49
50# Display factor loadings as a heatmap
51loadings = fa.components_.T  # Shape: (n_questions, n_factors)
52question_labels = [f"Q{i+1}" for i in range(n_questions)]
53factor_labels = ["Openness", "Conscientiousness", "Extraversion",
54                "Agreeableness", "Neuroticism"]
55
56print("\nFactor Loadings Matrix:")
57print("-" * 60)
58print(f"{'Question':<10}", end="")
59for f in factor_labels:
60    print(f"{f[:6]:>10}", end="")
61print()
62print("-" * 60)
63
64for i, q in enumerate(question_labels):
65    print(f"{q:<10}", end="")
66    for j in range(5):
67        loading = loadings[i, j]
68        # Highlight strong loadings
69        if abs(loading) > 0.5:
70            print(f"{loading:>10.3f}*", end="")
71        else:
72            print(f"{loading:>10.3f}", end="")
73    print()
74
75# Compute variance explained
76variance_explained = np.sum(loadings ** 2, axis=0)
77total_variance = n_questions
78variance_prop = variance_explained / total_variance
79
80print("\n" + "=" * 60)
81print("Variance Explained by Each Factor:")
82print("-" * 60)
83for i, (name, var) in enumerate(zip(factor_labels, variance_prop)):
84    bar = "█" * int(var * 50)
85    print(f"{name:<15}: {var*100:5.1f}% {bar}")
86print(f"{'Total':<15}: {sum(variance_prop)*100:5.1f}%")
87
88# Communalities (variance explained per question)
89communalities = np.sum(loadings ** 2, axis=1)
90print("\n" + "=" * 60)
91print("Communalities (% variance explained per question):")
92print("-" * 60)
93for i, (q, h2) in enumerate(zip(question_labels, communalities)):
94    bar = "█" * int(h2 * 30)
95    status = "Good" if h2 > 0.5 else "Low"
96    print(f"{q}: {h2*100:5.1f}% {bar} ({status})")
97
98# Visualization
99fig, axes = plt.subplots(1, 3, figsize=(15, 5))
100
101# 1. Factor Loadings Heatmap
102ax = axes[0]
103im = ax.imshow(loadings, cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
104ax.set_xticks(range(5))
105ax.set_xticklabels([f[:3] for f in factor_labels], rotation=45)
106ax.set_yticks(range(n_questions))
107ax.set_yticklabels(question_labels)
108ax.set_title('Factor Loadings Heatmap')
109ax.set_xlabel('Factors')
110ax.set_ylabel('Questions')
111plt.colorbar(im, ax=ax, label='Loading')
112
113# Add loading values as text
114for i in range(n_questions):
115    for j in range(5):
116        if abs(loadings[i, j]) > 0.3:
117            ax.text(j, i, f'{loadings[i, j]:.2f}',
118                   ha='center', va='center', fontsize=8,
119                   color='white' if abs(loadings[i, j]) > 0.5 else 'black')
120
121# 2. Scree-like plot for factors
122ax = axes[1]
123ax.bar(range(1, 6), variance_prop, color='steelblue', alpha=0.7)
124ax.plot(range(1, 6), np.cumsum(variance_prop), 'ro-', linewidth=2)
125ax.axhline(y=0.1, color='gray', linestyle='--', alpha=0.5)
126ax.set_xlabel('Factor')
127ax.set_ylabel('Proportion of Variance')
128ax.set_title('Variance Explained by Factors')
129ax.set_xticks(range(1, 6))
130ax.legend(['Cumulative', 'Individual'], loc='center right')
131
132# 3. Factor scores scatter (first two factors)
133ax = axes[2]
134scatter = ax.scatter(factor_scores[:, 0], factor_scores[:, 1],
135                    c=factor_scores[:, 2], cmap='viridis', alpha=0.6, s=30)
136ax.set_xlabel(f'Factor 1: {factor_labels[0]}')
137ax.set_ylabel(f'Factor 2: {factor_labels[1]}')
138ax.set_title('Respondents in Factor Space')
139ax.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
140ax.axvline(x=0, color='gray', linestyle='-', alpha=0.3)
141plt.colorbar(scatter, ax=ax, label=f'Factor 3 ({factor_labels[2][:3]})')
142
143plt.tight_layout()
144plt.savefig('factor_analysis_results.png', dpi=150, bbox_inches='tight')
145plt.show()
146
147print("\n" + "=" * 60)
148print("INTERPRETATION GUIDE:")
149print("=" * 60)
150print("""
151- Questions Q1-Q3 load on Factor 1 (Openness)
152- Questions Q4-Q6 load on Factor 2 (Conscientiousness)
153- Questions Q7-Q9 load on Factor 3 (Extraversion)
154- Questions Q10-Q12 load on Factor 4 (Agreeableness)
155- Questions Q13-Q15 load on Factor 5 (Neuroticism)
156
157Each factor represents a latent personality dimension that
158manifests through correlated responses to related questions.
159""")

Practical Tips and Gotchas

Sample size matters: Minimum 100-200 observations. Ideally, 5-10 observations per variable. For stable solutions, larger is always better.
Check communalities: Variables with communality < 0.3-0.4 don't fit the factor structure well. Consider removing them.
Look for Heywood cases: Communalities > 1 indicate model problems (overfitting, wrong number of factors, or problematic variables).
Use multiple extraction methods: Compare Principal Factor and MLE results. Consistent solutions across methods are more trustworthy.
Try multiple rotations: Varimax is a starting point, but Oblimin may give cleaner structure. Compare solutions.
Interpret factors substantively: A factor is only meaningful if you can name it based on the high-loading variables. If it doesn't make sense, reconsider the number of factors.
Report loadings completely: Don't hide low loadings - they matter for understanding simple structure.
Consider theoretical constraints: If theory suggests correlated factors, use oblique rotation. If uncorrelated factors make sense, use orthogonal.

Common Mistake: Treating Factor Analysis like PCA. They have different goals, different models, and different assumptions. PCA components are exact; factors are estimates. PCA explains all variance; FA separates common from unique variance. Don't interchange them.

Limitations of Factor Analysis

Rotational Indeterminacy: No unique solution exists. Different rotations give different (but mathematically equivalent) answers. This is both a feature and a limitation.
Factor Score Indeterminacy: Factor scores are estimates with uncertainty. Different methods give different scores, and scores aren't invariant under variable additions.
Linearity Assumption: Factor Analysis assumes linear relationships. Nonlinear latent structures require extensions (ICA, nonlinear FA, deep latent models).
Number of Factors: Must be specified in advance. Different numbers can dramatically change interpretation.
Sample Size Requirements: Requires substantial data for stable solutions. Small samples yield unstable, non-replicable factors.
Assumes Common Factor Model: If the true structure isn't well-approximated by common factors, results may be misleading.
Subjectivity in Interpretation: Naming factors requires judgment. Two researchers might interpret the same loadings differently.
Not a Causal Discovery Method: Factor Analysis finds statistical patterns, not causal relationships. Factors are hypothetical constructs, not proven causes.

Limitation	Consequence	Mitigation
Rotational indeterminacy	No unique solution	Use simple structure criteria, report interpretable solution
Factor score indeterminacy	Scores are estimates	Use scores cautiously, consider multiple methods
Linearity	Misses nonlinear relationships	Try ICA, kernel methods, or deep latent models
Factor number choice	Results change with k	Use multiple criteria, validate with theory
Sample size needs	Unstable solutions	Collect more data, use Bayesian FA for small n

Practice Problems

Conceptual: Explain the difference between communality and uniqueness. Why does Factor Analysis distinguish them while PCA does not?
Mathematical: Show that the covariance structure $\boldsymbol{\Sigma} = \boldsymbol{\Lambda} \boldsymbol{\Lambda}^T + \boldsymbol{\Psi}$ follows from the factor model assumptions.
Computation: Implement the Principal Factor method from scratch. Compare your results with sklearn's FactorAnalysis on the Iris dataset.
Rotation: Implement Varimax rotation. Start with an unrotated solution and verify that your rotation achieves simpler structure (higher variance of squared loadings).
Comparison: Run both PCA and Factor Analysis on the same dataset. Compare the loadings/components. When do they differ most?
Application: Find a personality survey dataset (e.g., from OpenPsychometrics) and perform Factor Analysis. Can you recover the Big Five?
Factor Scores: Compute factor scores using both the regression method and Bartlett's method. How different are they? When does it matter?
Deep Learning: Train a linear VAE on MNIST. Compare the learned decoder weights with Factor Analysis loadings. How similar are the latent structures?

Key Insights

Factor Analysis models causation: Unlike PCA (which summarizes), FA hypothesizes that latent factors cause observed correlations. This causal perspective is powerful but requires theoretical justification.
Communality distinguishes shared from unique: Only the common variance is factored; unique variance (specific + error) is separated. This is the key difference from PCA.
Rotation is essential: Initial solutions are arbitrary. Rotation toward simple structure makes factors interpretable and meaningful.
Orthogonal vs Oblique is a choice: Orthogonal (Varimax) keeps factors uncorrelated but may not achieve the cleanest structure. Oblique allows correlation but adds complexity.
Factor scores are estimates: Unlike PCA scores which are exact, factor scores have uncertainty. They're indeterminate and method-dependent.
FA connects to modern deep learning: VAEs, topic models, and embeddings are all descendants of the latent factor perspective. Understanding FA provides intuition for these methods.
Interpretation requires judgment: Factor Analysis is not purely mechanical. Naming factors, choosing rotations, and deciding on the number of factors all require substantive knowledge.
Limitations are real: Indeterminacy, linearity assumptions, and sample size requirements constrain what FA can discover. Know when to use alternatives.

The Essence of Factor Analysis: Correlations between observed variables are symptoms; latent factors are causes. FA asks: "What hidden constructs explain why these variables move together?" It's a fundamentally causal question, even if the answer is statistical. This perspective - that simple underlying causes generate complex observed patterns - is the conceptual foundation of modern representation learning. From Spearman's g-factor to VAE latent spaces, the insight remains: find the hidden structure that explains what we see.