Chapter 22
30 min read
Section 138 of 175

SVD and Statistical Applications

Dimensionality Reduction Deep Dive

Learning Objectives

By the end of this section, you will master Singular Value Decomposition (SVD) - the most fundamental matrix decomposition in applied mathematics and machine learning. You will:

  1. Understand the geometric meaning of SVD - how any linear transformation can be decomposed into rotation, scaling, and another rotation.
  2. Master the mathematical formulation from first principles: the relationship to eigendecomposition, the singular values and vectors, and why the decomposition always exists.
  3. Apply the Eckart-Young theorem to compute optimal low-rank approximations for compression, denoising, and dimensionality reduction.
  4. Connect SVD to PCA and understand exactly when they are equivalent and when they differ.
  5. Implement statistical applications: Latent Semantic Analysis for text, collaborative filtering for recommendations, image compression, and pseudoinverse for least squares.
  6. Apply SVD to modern deep learning: LoRA for efficient fine-tuning, weight matrix analysis, and neural network compression.
Why This Matters: SVD is arguably the most important matrix decomposition in all of applied mathematics. It appears in machine learning (PCA, LSA, recommender systems), numerical computing (solving linear systems, pseudoinverse), signal processing (noise reduction), data compression (images, video), and modern AI (LoRA, model compression). Understanding SVD deeply gives you a unified framework for thinking about matrices, transformations, and data structure.

The Big Picture

Historical Context

The Singular Value Decomposition has a rich history spanning over a century. The theoretical foundations were laid in the late 19th century by Eugenio Beltrami (1873) and Camille Jordan (1874), who independently discovered it for square matrices. The extension to rectangular matrices came from mathematicians including Eckart and Young, whose 1936 theorem on optimal low-rank approximation remains central to modern applications.

However, SVD remained primarily a theoretical tool until the 1960s when Gene Golub and William Kahan developed stable numerical algorithms for computing it. This breakthrough transformed SVD from a mathematical curiosity into a practical workhorse. Today, SVD is computed billions of times daily across the world's computing infrastructure.

The applications explosion came in the 1990s and 2000s. Latent Semantic Analysis (1990) revolutionized information retrieval. The Netflix Prize (2006-2009) showcased SVD-based recommender systems. And in the 2020s, LoRA (Low-Rank Adaptation) has made efficient fine-tuning of large language models practical.

Why SVD Matters

SVD answers a fundamental question: What is the essential structure of a matrix? Every matrix, regardless of shape or properties, can be understood as three simple operations: a rotation, a scaling along coordinate axes, and another rotation.

  • Universal applicability: Unlike eigendecomposition (which requires square matrices), SVD works on any matrix - rectangular, sparse, rank-deficient.
  • Optimal approximation: The truncated SVD gives the best possible low-rank approximation in both Frobenius and spectral norms. No other rank-k matrix is closer.
  • Numerical stability: SVD is the foundation for solving ill-conditioned linear systems, computing pseudoinverses, and determining matrix rank.
  • Interpretability: Singular values measure the "importance" of each dimension. Singular vectors reveal the fundamental directions in the data.
Core Insight: SVD reveals that any linear transformation, no matter how complex, is just rotation + axis-aligned scaling + rotation. This decomposition exposes the true geometry hiding inside any matrix.

What Is SVD?

Intuitive Understanding

Imagine you have a rubber sheet with a circle drawn on it. You can stretch it, rotate it, and transform it in any way that keeps straight lines straight (a linear transformation). What happens to the circle? It becomes an ellipse!

SVD tells us exactly how this transformation works:

  1. First, rotate the sheet to align with the transformation's "natural axes" (the V matrix)
  2. Then, stretch along each axis by different amounts - these stretching factors are the singular values (\u03c3)
  3. Finally, rotate again to the output orientation (the U matrix)

The singular values tell us how much the transformation stretches in each direction. The largest singular value is the maximum stretching factor, the smallest is the minimum. If any singular value is zero, the transformation "collapses" that dimension entirely.

The Geometric View

Consider a matrix A that transforms vectors. What does it do to the unit circle? SVD tells us:

  • The unit circle (all vectors of length 1) gets mapped to an ellipse
  • The principal axes of this ellipse are the left singular vectors (u)
  • The lengths of the principal axes are the singular values (\u03c3)
  • The directions in the input that map to these axes are the right singular vectors (v)

This geometric picture is powerful: A transforms the unit sphere into an ellipsoid. SVD tells us the shape, orientation, and alignment of that ellipsoid.


Mathematical Formulation

The Decomposition Structure

For any m \u00d7 n matrix A (real or complex), there exists a factorization:

A=UΣVT\mathbf{A} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T

Where:

  • U\mathbf{U} is an m \u00d7 m orthogonal matrix (U\u1d40U = I) - the left singular vectors
  • Σ\boldsymbol{\Sigma} is an m \u00d7 n diagonal matrix with non-negative entries - the singular values
  • V\mathbf{V} is an n \u00d7 n orthogonal matrix (V\u1d40V = I) - the right singular vectors

The diagonal entries of \u03a3 are called singular values:

σ1σ2σr>0=σr+1=\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 = \sigma_{r+1} = \cdots

Where r = rank(A) is the number of non-zero singular values. The singular values are conventionally ordered from largest to smallest.

In terms of column vectors:

A=i=1rσiuiviT\mathbf{A} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^T

This sum-of-outer-products form shows that A is a weighted sum of rank-1 matrices. Each term σiuiviT\sigma_i \mathbf{u}_i \mathbf{v}_i^T is a rank-1 matrix, and the weights are the singular values.

Computing the SVD

The SVD is computed by relating it to eigendecomposition:

  1. Right singular vectors V: The columns of V are eigenvectors of ATA\mathbf{A}^T\mathbf{A} (an n \u00d7 n symmetric matrix)
  2. Singular values \u03c3: The singular values are the square roots of eigenvalues of ATA\mathbf{A}^T\mathbf{A}: σi=λi\sigma_i = \sqrt{\lambda_i}
  3. Left singular vectors U: Computed as ui=Avi/σi\mathbf{u}_i = \mathbf{A}\mathbf{v}_i / \sigma_i for non-zero singular values

Alternatively, U can be computed from the eigendecomposition of AAT\mathbf{A}\mathbf{A}^T (an m \u00d7 m matrix).

Computational Complexity: Computing the full SVD of an m \u00d7 n matrix takes O(min(m\u00b2n, mn\u00b2)) time. For large matrices, randomized algorithms and truncated SVD are preferred.

Properties of SVD

PropertyFormulaInterpretation
ExistenceAlways existsWorks for ANY matrix (rectangular, singular, complex)
UniquenessSingular values uniqueVectors unique up to sign (if σ distinct)
Rankr = #{i : σᵢ > 0}Number of non-zero singular values
Frobenius norm‖A‖ᶠ = √(Σσᵢ²)Square root of sum of squared singular values
Spectral norm‖A‖₂ = σ₁Largest singular value
Condition numberκ = σ₁/σᵣRatio of largest to smallest (numerical stability)
PseudoinverseA⁺ = VΣ⁺UᵀInvert non-zero singular values

Interactive SVD Explorer

Explore how SVD decomposes a 2\u00d72 matrix into its fundamental components. Adjust the matrix entries and watch how the transformation changes.

Interactive SVD Transformation Visualizer

Unit Circle (Input)
Transformed Ellipse
First Singular Vector
Second Singular Vector

Matrix A:

SVD Decomposition:

Singular Values:\u03c3\u2081 = 2.558, \u03c3\u2082 = 0.977
Condition Number:\u03ba = 2.618
Rank:2

What You're Seeing:

  • \u2022 Any matrix A transforms a circle into an ellipse
  • \u2022 \u03c3\u2081 and \u03c3\u2082 are the semi-axes of the ellipse
  • \u2022 U vectors point along the ellipse axes
  • \u2022 V vectors are the pre-images of U under A
  • \u2022 If \u03c3\u2082 \u2248 0, the matrix is nearly singular (rank 1)
Try This: Set the matrix to [[1, 0], [0, 1]] (identity) - the ellipse stays a circle. Then try [[2, 0], [0, 0.5]] - you get an ellipse with axes along coordinates. Now try [[1, 1], [0, 1]] (shear) - the axes rotate!

Low-Rank Approximation

The Eckart-Young Theorem

One of the most important results about SVD is the Eckart-Young theorem (1936), which states that the best rank-k approximation to a matrix is obtained by truncating the SVD:

Ak=i=1kσiuiviT=UkΣkVkT\mathbf{A}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T = \mathbf{U}_k \boldsymbol{\Sigma}_k \mathbf{V}_k^T

This approximation is optimal in both the Frobenius norm and the spectral (operator) norm:

minrank(B)=kABF=AAkF=i=k+1rσi2\min_{\text{rank}(B) = k} \|\mathbf{A} - \mathbf{B}\|_F = \|\mathbf{A} - \mathbf{A}_k\|_F = \sqrt{\sum_{i=k+1}^{r} \sigma_i^2}
minrank(B)=kAB2=AAk2=σk+1\min_{\text{rank}(B) = k} \|\mathbf{A} - \mathbf{B}\|_2 = \|\mathbf{A} - \mathbf{A}_k\|_2 = \sigma_{k+1}

In plain English: No other rank-k matrix is closer to A than the truncated SVD. The error equals the sum of discarded singular values squared (Frobenius) or just the next singular value (spectral).

Truncated SVD

Truncated SVD keeps only the top k components, dramatically reducing storage and computation:

  • Full matrix: m \u00d7 n = mn values
  • Truncated SVD: Uk (m \u00d7 k) + Sk (k) + Vk (n \u00d7 k) = k(m + n + 1) values
  • Compression ratio: mn / k(m + n + 1) \u2248 mn / k(m+n) for large m, n

For a 1000 \u00d7 1000 matrix with rank-50 approximation: 1,000,000 \u2192 100,050 values (10x compression).

Low-Rank Image Approximation

Key insight: Smooth images (gradient) need fewer ranks than complex patterns (checkerboard). SVD finds the most efficient representation for each.

SVD vs PCA: The Deep Connection

PCA and SVD are intimately related, but they are not the same thing. Understanding the precise relationship is crucial.

Key relationship: PCA is SVD applied to the centered data matrix.

Let X be the centered data matrix (mean subtracted from each column). Then:

X=UΣVT\mathbf{X} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T
  • Principal components = Right singular vectors V
  • Principal component scores = XV = U\u03a3
  • Variance explained = \u03c3\u00b2/(n-1) = eigenvalues of covariance matrix
  • Covariance matrix = (1/(n-1)) X\u1d40X = V (\u03a3\u00b2/(n-1)) V\u1d40

SVD vs PCA: The Critical Difference

SVD on X

Computes orthogonal basis that minimizes reconstruction error from origin.

X = U \u03a3 V\u1d40
\u2713 With centered data, SVD = PCA

PCA on X

First centers data, then computes SVD on X - \u03bc.

X\u0305 = X - \u03bc
X\u0305 = U \u03a3 V\u1d40
\u2713 Always finds directions of maximum variance

Key Takeaway

PCA = SVD of centered data. When the data is already centered (mean = 0), SVD and PCA give identical results. The right singular vectors V are the principal components, and the singular values are the square roots of eigenvalues of X\u1d40X.

AspectSVDPCA
InputAny matrix ACentered data matrix X - μ
CenteringNot requiredRequired (fundamental)
OutputU, Σ, VᵀPrincipal components (V), scores (XV)
InterpretationMatrix factorizationVariance maximization
Sparse dataWorks directlyCentering destroys sparsity
New dataProject via VᵀCenter with training mean, then project
Common Mistake: Applying SVD directly to uncentered data and calling it PCA. This is wrong! Without centering, the first singular vector points toward the data mean, not the direction of maximum variance.

Statistical Applications

Latent Semantic Analysis

Latent Semantic Analysis (LSA) applies SVD to the term-document matrix to discover hidden semantic structure in text. It was one of the first successful applications of dimensionality reduction to natural language.

The problem: Simple keyword matching fails to capture meaning. "Car" and "automobile" are different words but same concept. "Bank" can mean a financial institution or a river bank.

LSA solution: Represent documents in a lower-dimensional "semantic" space where synonyms are close and polysemy is resolved by context.

Latent Semantic Analysis Implementation
🐍latent_semantic_analysis.py
4LSA Class

Latent Semantic Analysis uses SVD to discover hidden semantic structure in text. It was a precursor to modern embeddings like Word2Vec.

12Why LSA Works

Words that appear in similar contexts have similar meanings. SVD finds these latent relationships by decomposing co-occurrence patterns.

28Term-Document Matrix

A matrix where rows are words, columns are documents, and entries are word frequencies (or TF-IDF). This is the input to SVD.

EXAMPLE
If 'machine' appears 3 times in doc 2, entry [machine, doc2] = 3
46TF-IDF Weighting

TF-IDF downweights common words ('the', 'a') and upweights distinctive words. TF = term frequency, IDF = inverse document frequency.

67SVD on TF-IDF

SVD decomposes into U (term-topic), Σ (topic importance), Vᵀ (topic-document). Low-rank approximation captures semantic structure.

87Transform to Semantic Space

Projects new documents onto the learned topic space. Similar documents will have similar projections regardless of exact word matches.

100Semantic Similarity

Cosine similarity in latent space captures meaning similarity. 'Car' and 'automobile' will be similar even without exact word match.

113Topic Extraction

Each column of U represents a 'topic' - words with high weights define that topic. This is interpretable dimensionality reduction.

173 lines without explanation
1import numpy as np
2from collections import Counter
3from typing import List, Tuple
4
5class LatentSemanticAnalysis:
6    """
7    Latent Semantic Analysis (LSA) using SVD.
8
9    LSA discovers hidden semantic structure in text by:
10    1. Building a term-document matrix
11    2. Applying SVD to find latent topics
12    3. Representing documents in reduced semantic space
13
14    This enables:
15    - Semantic similarity (beyond keyword matching)
16    - Document classification
17    - Information retrieval
18    - Topic modeling
19    """
20
21    def __init__(self, n_components: int = 100):
22        """
23        Args:
24            n_components: Number of latent dimensions (topics)
25        """
26        self.n_components = n_components
27        self.vocabulary_ = None
28        self.U_ = None
29        self.S_ = None
30        self.Vt_ = None
31
32    def _build_term_document_matrix(
33        self,
34        documents: List[str]
35    ) -> Tuple[np.ndarray, dict]:
36        """
37        Build term-document matrix with TF-IDF weighting.
38
39        Rows = terms (vocabulary)
40        Columns = documents
41        Entry (i,j) = TF-IDF weight of term i in document j
42        """
43        # Simple tokenization (in practice, use better preprocessing)
44        doc_tokens = [doc.lower().split() for doc in documents]
45
46        # Build vocabulary
47        all_words = set()
48        for tokens in doc_tokens:
49            all_words.update(tokens)
50        vocabulary = {word: idx for idx, word in enumerate(sorted(all_words))}
51
52        n_terms = len(vocabulary)
53        n_docs = len(documents)
54
55        # Compute term frequencies
56        tf_matrix = np.zeros((n_terms, n_docs))
57        for j, tokens in enumerate(doc_tokens):
58            word_counts = Counter(tokens)
59            for word, count in word_counts.items():
60                if word in vocabulary:
61                    tf_matrix[vocabulary[word], j] = count
62
63        # Normalize TF (prevent bias toward long documents)
64        doc_lengths = tf_matrix.sum(axis=0) + 1e-10
65        tf_matrix = tf_matrix / doc_lengths
66
67        # Compute IDF (inverse document frequency)
68        doc_freq = np.sum(tf_matrix > 0, axis=1)
69        idf = np.log(n_docs / (doc_freq + 1)) + 1
70
71        # TF-IDF matrix
72        tfidf_matrix = tf_matrix * idf[:, np.newaxis]
73
74        return tfidf_matrix, vocabulary
75
76    def fit(self, documents: List[str]) -> 'LatentSemanticAnalysis':
77        """
78        Fit LSA model to documents.
79
80        Applies SVD to the term-document matrix:
81            TF-IDF = U @ Σ @ V^T
82
83        Where:
84        - U: term-topic matrix (which terms belong to which topics)
85        - Σ: topic strengths (importance of each topic)
86        - V^T: topic-document matrix (which documents express which topics)
87        """
88        # Build term-document matrix
89        tfidf_matrix, vocabulary = self._build_term_document_matrix(documents)
90        self.vocabulary_ = vocabulary
91
92        # Compute truncated SVD
93        U, S, Vt = np.linalg.svd(tfidf_matrix, full_matrices=False)
94
95        # Keep top k components
96        k = min(self.n_components, len(S))
97        self.U_ = U[:, :k]
98        self.S_ = S[:k]
99        self.Vt_ = Vt[:k, :]
100
101        return self
102
103    def transform(self, documents: List[str]) -> np.ndarray:
104        """
105        Transform documents to semantic space.
106
107        Projects documents onto latent topics.
108        """
109        tfidf_matrix, _ = self._build_term_document_matrix(documents)
110
111        # Project: doc_vector = Σ^(-1) @ U^T @ tfidf_vector
112        # This gives the representation in topic space
113        return np.diag(1/self.S_) @ self.U_.T @ tfidf_matrix
114
115    def document_similarity(
116        self,
117        doc1_idx: int,
118        doc2_idx: int
119    ) -> float:
120        """
121        Compute semantic similarity between two documents.
122
123        Uses cosine similarity in the latent space.
124        """
125        vec1 = self.Vt_[:, doc1_idx]
126        vec2 = self.Vt_[:, doc2_idx]
127
128        # Cosine similarity
129        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
130
131    def get_topics(self, n_words: int = 10) -> List[List[str]]:
132        """
133        Get top words for each topic.
134
135        Returns list of topics, each containing top words.
136        """
137        idx_to_word = {idx: word for word, idx in self.vocabulary_.items()}
138        topics = []
139
140        for k in range(len(self.S_)):
141            # Words with highest weight in this topic
142            top_indices = np.argsort(np.abs(self.U_[:, k]))[::-1][:n_words]
143            top_words = [idx_to_word[i] for i in top_indices]
144            topics.append(top_words)
145
146        return topics
147
148
149# Example usage
150if __name__ == "__main__":
151    # Sample documents
152    documents = [
153        "machine learning algorithms data science",
154        "deep learning neural networks artificial intelligence",
155        "natural language processing text analysis",
156        "computer vision image recognition deep learning",
157        "statistics probability bayesian inference",
158        "regression classification supervised learning",
159        "clustering dimensionality reduction unsupervised",
160        "neural networks backpropagation gradient descent",
161    ]
162
163    # Fit LSA
164    lsa = LatentSemanticAnalysis(n_components=3)
165    lsa.fit(documents)
166
167    print("Top words per topic:")
168    topics = lsa.get_topics(n_words=5)
169    for i, topic in enumerate(topics):
170        print(f"  Topic {i+1}: {', '.join(topic)}")
171
172    # Document similarities
173    print("\nDocument similarities (semantic space):")
174    for i in range(len(documents)):
175        for j in range(i+1, len(documents)):
176            sim = lsa.document_similarity(i, j)
177            if sim > 0.3:  # Show only similar pairs
178                print(f"  Doc {i+1} vs Doc {j+1}: {sim:.3f}")
179
180    print("\nSingular values (topic importance):")
181    print(f"  {lsa.S_.round(3)}")

Recommender Systems

SVD-based collaborative filtering powers recommendation engines at Netflix, Amazon, Spotify, and countless other platforms. The key insight: users and items can be represented as vectors in a shared latent space.

The model: Rating = global bias + user bias + item bias + user vector \u00b7 item vector

SVD-Based Recommender System
🐍svd_recommender.py
4SVD Recommender

Matrix factorization discovers latent factors that explain user preferences. The Netflix Prize was won using SVD-based approaches.

16Latent Factors

n_factors is like the embedding dimension. Factors might represent genres (action, romance), or abstract preferences learned from data.

32Funk SVD

Unlike vanilla SVD, Funk SVD only trains on observed ratings. This handles the sparse matrix problem (most ratings are missing).

42Global Mean

Center ratings to handle different rating scales. Some users rate harshly (all 2-3), others generously (all 4-5).

48Factor Initialization

Small random initialization prevents symmetry breaking issues. Too large causes divergence, too small causes slow learning.

64Prediction Model

Rating = global_mean + user_bias + item_bias + user_factors · item_factors. This decomposes ratings into biases and interactions.

EXAMPLE
User prefers action (+bias), movie is popular (+bias), factors capture nuanced match
72Gradient Descent Updates

Update factors to reduce prediction error. Regularization prevents overfitting by penalizing large factor values.

102Batch Predictions

Vectorized prediction for all items at once. The dot product user_factors @ item_factors.T gives all predicted ratings.

119Similar Items

Items with similar latent representations are similar. Cosine similarity in factor space finds related items.

177 lines without explanation
1import numpy as np
2from typing import Tuple, List
3
4class SVDRecommender:
5    """
6    Collaborative Filtering Recommender using SVD.
7
8    Used in:
9    - Netflix Prize (winning solution used SVD variants)
10    - Amazon product recommendations
11    - Spotify music recommendations
12
13    The idea: Users and items exist in a shared latent space.
14    Similar users like similar items.
15    """
16
17    def __init__(self, n_factors: int = 50, regularization: float = 0.1):
18        """
19        Args:
20            n_factors: Number of latent factors (like hidden dimensions)
21            regularization: L2 regularization strength
22        """
23        self.n_factors = n_factors
24        self.reg = regularization
25        self.user_factors_ = None  # U matrix (users × factors)
26        self.item_factors_ = None  # V matrix (items × factors)
27        self.user_bias_ = None
28        self.item_bias_ = None
29        self.global_mean_ = None
30
31    def fit(self, ratings: np.ndarray, n_epochs: int = 20) -> 'SVDRecommender':
32        """
33        Fit recommender to rating matrix.
34
35        Uses Funk SVD (gradient descent on observed ratings only).
36        Unlike vanilla SVD which requires imputing missing values.
37
38        Args:
39            ratings: Matrix of shape (n_users, n_items) with NaN for missing
40            n_epochs: Training iterations
41        """
42        n_users, n_items = ratings.shape
43
44        # Global mean (for bias calculation)
45        self.global_mean_ = np.nanmean(ratings)
46
47        # Initialize biases
48        self.user_bias_ = np.zeros(n_users)
49        self.item_bias_ = np.zeros(n_items)
50
51        # Initialize factors randomly (small values)
52        self.user_factors_ = np.random.normal(0, 0.1, (n_users, self.n_factors))
53        self.item_factors_ = np.random.normal(0, 0.1, (n_items, self.n_factors))
54
55        # Find observed ratings (non-NaN entries)
56        observed = ~np.isnan(ratings)
57
58        # Learning rate schedule
59        lr = 0.01
60
61        for epoch in range(n_epochs):
62            total_error = 0
63            count = 0
64
65            # Iterate over observed ratings
66            for u in range(n_users):
67                for i in range(n_items):
68                    if observed[u, i]:
69                        # Predicted rating
70                        pred = (self.global_mean_ +
71                               self.user_bias_[u] +
72                               self.item_bias_[i] +
73                               self.user_factors_[u] @ self.item_factors_[i])
74
75                        # Error
76                        error = ratings[u, i] - pred
77                        total_error += error ** 2
78                        count += 1
79
80                        # Update biases
81                        self.user_bias_[u] += lr * (error - self.reg * self.user_bias_[u])
82                        self.item_bias_[i] += lr * (error - self.reg * self.item_bias_[i])
83
84                        # Update factors (gradient descent)
85                        user_update = error * self.item_factors_[i] - self.reg * self.user_factors_[u]
86                        item_update = error * self.user_factors_[u] - self.reg * self.item_factors_[i]
87
88                        self.user_factors_[u] += lr * user_update
89                        self.item_factors_[i] += lr * item_update
90
91            rmse = np.sqrt(total_error / count)
92            if epoch % 5 == 0:
93                print(f"Epoch {epoch}: RMSE = {rmse:.4f}")
94
95        return self
96
97    def predict(self, user_id: int, item_id: int) -> float:
98        """Predict rating for user-item pair."""
99        return (self.global_mean_ +
100                self.user_bias_[user_id] +
101                self.item_bias_[item_id] +
102                self.user_factors_[user_id] @ self.item_factors_[item_id])
103
104    def recommend(self, user_id: int, n_items: int = 10,
105                  exclude_rated: bool = True,
106                  ratings: np.ndarray = None) -> List[Tuple[int, float]]:
107        """
108        Get top-N recommendations for a user.
109
110        Args:
111            user_id: User to get recommendations for
112            n_items: Number of items to recommend
113            exclude_rated: Whether to exclude already-rated items
114            ratings: Original rating matrix (for exclusion)
115
116        Returns:
117            List of (item_id, predicted_rating) tuples
118        """
119        # Predict ratings for all items
120        predictions = (self.global_mean_ +
121                      self.user_bias_[user_id] +
122                      self.item_bias_ +
123                      self.user_factors_[user_id] @ self.item_factors_.T)
124
125        # Exclude already-rated items
126        if exclude_rated and ratings is not None:
127            rated_mask = ~np.isnan(ratings[user_id])
128            predictions[rated_mask] = -np.inf
129
130        # Get top-N
131        top_indices = np.argsort(predictions)[::-1][:n_items]
132
133        return [(idx, predictions[idx]) for idx in top_indices]
134
135    def similar_items(self, item_id: int, n: int = 5) -> List[Tuple[int, float]]:
136        """Find similar items using cosine similarity in latent space."""
137        item_vec = self.item_factors_[item_id]
138
139        # Cosine similarity with all items
140        norms = np.linalg.norm(self.item_factors_, axis=1)
141        similarities = (self.item_factors_ @ item_vec) / (norms * np.linalg.norm(item_vec) + 1e-10)
142
143        # Exclude self
144        similarities[item_id] = -np.inf
145
146        top_indices = np.argsort(similarities)[::-1][:n]
147        return [(idx, similarities[idx]) for idx in top_indices]
148
149
150# Example
151if __name__ == "__main__":
152    np.random.seed(42)
153
154    # Create synthetic rating matrix (100 users, 50 items)
155    n_users, n_items = 100, 50
156    true_factors = 5
157
158    # Generate from latent factor model
159    U_true = np.random.randn(n_users, true_factors)
160    V_true = np.random.randn(n_items, true_factors)
161    ratings_full = 3 + U_true @ V_true.T + np.random.randn(n_users, n_items) * 0.5
162    ratings_full = np.clip(ratings_full, 1, 5)
163
164    # Make sparse (80% missing)
165    mask = np.random.rand(n_users, n_items) < 0.2
166    ratings = np.where(mask, ratings_full, np.nan)
167
168    print(f"Rating matrix: {n_users} users × {n_items} items")
169    print(f"Observed ratings: {np.sum(mask)} / {n_users * n_items} ({100*np.mean(mask):.1f}%)")
170
171    # Fit recommender
172    recommender = SVDRecommender(n_factors=10)
173    recommender.fit(ratings, n_epochs=30)
174
175    # Get recommendations for user 0
176    print("\nTop recommendations for User 0:")
177    recs = recommender.recommend(0, n_items=5, ratings=ratings)
178    for item_id, pred_rating in recs:
179        true_rating = ratings_full[0, item_id]
180        print(f"  Item {item_id}: predicted={pred_rating:.2f}, true={true_rating:.2f}")
181
182    # Find similar items to item 0
183    print("\nItems similar to Item 0:")
184    similar = recommender.similar_items(0, n=3)
185    for item_id, similarity in similar:
186        print(f"  Item {item_id}: similarity={similarity:.3f}")

Image Compression

Images can be compressed by computing the SVD of the pixel matrix and keeping only the top-k singular values/vectors. This exploits the fact that natural images have low effective rank - neighboring pixels are highly correlated.

For a grayscale image of size m \u00d7 n:

  • Original: mn values
  • Rank-k SVD: k(m + n + 1) values
  • Compression: If k = 50 and m = n = 1000, we get 10x compression

Pseudoinverse and Least Squares

SVD provides the numerically stable way to solve least squares problems and compute the Moore-Penrose pseudoinverse.

For the system Ax = b:

x=A+b=VΣ+UTb\mathbf{x} = \mathbf{A}^+ \mathbf{b} = \mathbf{V} \boldsymbol{\Sigma}^+ \mathbf{U}^T \mathbf{b}

Where \u03a3\u207a inverts non-zero singular values and sets zeros to zero. This is the minimum-norm least-squares solution.


Deep Learning Connections

SVD has become increasingly important in modern deep learning, from analyzing trained networks to efficient fine-tuning techniques.

Analyzing Weight Matrices

SVD reveals the effective rank and structure of neural network weight matrices:

  • Rank analysis: Many weight matrices have surprisingly low effective rank - most singular values are near zero
  • Compression: Low-rank approximation can reduce model size 10-100x with minimal accuracy loss
  • Regularization: Nuclear norm regularization (sum of singular values) encourages low-rank solutions
  • Initialization: Orthogonal initialization (from SVD) improves gradient flow

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) is a revolutionary technique for efficient fine-tuning of large language models. Instead of updating the full weight matrix W, LoRA learns a low-rank update:

W=W+ΔW=W+BA\mathbf{W}' = \mathbf{W} + \Delta\mathbf{W} = \mathbf{W} + \mathbf{B}\mathbf{A}

Where B is m \u00d7 r and A is r \u00d7 n, with r \u226a min(m, n).

  • Parameters: Instead of mn parameters, only (m + n) \u00d7 r
  • Example: For a 4096 \u00d7 4096 weight (16M params), rank-8 LoRA uses only 65K params (250x reduction)
  • Why it works: Fine-tuning typically requires low-rank updates - the essential changes live in a small subspace
Connection to SVD: LoRA learns the same structure as a rank-r SVD update. The matrices B and A are like U\u03a3 and V\u1d40 respectively, but learned end-to-end rather than computed.

Neural Network Compression

SVD enables dramatic compression of neural networks:

  1. Weight matrix SVD: Replace W (m \u00d7 n) with U (m \u00d7 k) and V\u1d40 (k \u00d7 n)
  2. Implementation: Split one linear layer into two smaller layers
  3. Trade-off: Slight accuracy loss for significant speedup and memory reduction
  4. Iterative refinement: Fine-tune after compression to recover accuracy

This technique is used to deploy large models on mobile devices and edge hardware where memory and computation are constrained.


Python Implementation

SVD from Scratch

Let's implement SVD from first principles to understand every computational step. We compute it via eigendecomposition of A\u1d40A.

Complete SVD Implementation from Scratch
🐍svd_from_scratch.py
1Import NumPy

NumPy provides efficient numerical computing including linear algebra operations essential for SVD.

4SVD Class Definition

Encapsulates SVD functionality. Unlike PCA, SVD works on ANY matrix (not just square or symmetric), making it more general.

10The SVD Decomposition

A = UΣVᵀ where U contains left singular vectors (column space), Σ contains singular values, and V contains right singular vectors (row space).

18Full vs Economy SVD

Full SVD returns complete U (m×m) and V (n×n). Economy SVD returns only the non-trivial parts - usually what you want.

EXAMPLE
For 1000×10 matrix: full U is 1000×1000, economy U is 1000×10
41Form AᵀA

AᵀA is always symmetric positive semi-definite. Its eigenvalues are σ² (squared singular values) and eigenvectors are V (right singular vectors).

44Eigendecomposition

np.linalg.eigh() efficiently computes eigendecomposition of symmetric matrices. This gives us V and σ².

48Sort Descending

Eigenvalues from eigh are ascending, but we want descending order (largest singular values first for importance ranking).

53Singular Values

σᵢ = √λᵢ where λᵢ are eigenvalues of AᵀA. We use max(0, λ) to handle tiny negative values from numerical errors.

56Matrix Rank

The rank equals the number of non-zero singular values. This tells us the true dimensionality of the column/row space.

61Left Singular Vectors

Uᵢ = A*Vᵢ/σᵢ. Each left singular vector is the normalized result of applying A to the corresponding right singular vector.

90Transform Method

Projects data onto top-k right singular vectors. This is equivalent to dimensionality reduction, finding the best k-dimensional representation.

105Inverse Transform

Reconstructs the original space from the reduced representation. Information lost depends on how many components were dropped.

116Low-Rank Approximation

A_k = U_kΣ_kV_kᵀ gives the BEST rank-k approximation in Frobenius norm (Eckart-Young theorem). No other rank-k matrix is closer to A.

134Explained Variance

Variance is proportional to σ². If first singular value captures 90% of total σ², that component explains 90% of the data's structure.

143Condition Number

κ = σ_max/σ_min measures numerical stability. Large κ means small input changes cause large output changes (ill-conditioned).

EXAMPLE
κ = 10 is well-conditioned, κ = 10^10 is numerically unstable
153Pseudoinverse

A⁺ = VΣ⁺Uᵀ where Σ⁺ inverts non-zero singular values. Solves least-squares Ax=b even when A is non-square or singular.

206 lines without explanation
1import numpy as np
2from typing import Tuple, Optional
3
4class SVD:
5    """
6    Singular Value Decomposition from scratch.
7
8    Decomposes any m×n matrix A into:
9        A = U @ Σ @ V^T
10
11    Where:
12        - U is m×m orthogonal (left singular vectors)
13        - Σ is m×n diagonal (singular values)
14        - V is n×n orthogonal (right singular vectors)
15    """
16
17    def __init__(self, full_matrices: bool = False):
18        """
19        Initialize SVD.
20
21        Args:
22            full_matrices: If True, return full U and V.
23                          If False, return economy (thin) SVD.
24        """
25        self.full_matrices = full_matrices
26        self.U_ = None
27        self.S_ = None  # Singular values (1D array)
28        self.Vt_ = None  # V^T (transposed right singular vectors)
29
30    def fit(self, A: np.ndarray) -> 'SVD':
31        """
32        Compute SVD of matrix A.
33
34        The algorithm:
35        1. Compute A^T @ A (symmetric positive semi-definite)
36        2. Find eigenvalues λ and eigenvectors V of A^T @ A
37        3. Singular values σ = sqrt(λ)
38        4. Right singular vectors = V
39        5. Left singular vectors U = A @ V @ Σ^(-1)
40
41        Args:
42            A: Input matrix of shape (m, n)
43
44        Returns:
45            self: Fitted SVD instance
46        """
47        m, n = A.shape
48
49        # Step 1: Form A^T @ A (n×n matrix)
50        AtA = A.T @ A
51
52        # Step 2: Eigendecomposition of A^T @ A
53        # Eigenvalues are σ^2, eigenvectors are V
54        eigenvalues, V = np.linalg.eigh(AtA)
55
56        # Step 3: Sort in descending order
57        idx = np.argsort(eigenvalues)[::-1]
58        eigenvalues = eigenvalues[idx]
59        V = V[:, idx]
60
61        # Step 4: Singular values (avoid negative due to numerical errors)
62        singular_values = np.sqrt(np.maximum(eigenvalues, 0))
63
64        # Step 5: Determine rank (non-zero singular values)
65        rank = np.sum(singular_values > 1e-10)
66
67        # Step 6: Compute left singular vectors
68        # U = A @ V @ Σ^(-1) for non-zero singular values
69        if self.full_matrices:
70            U = np.zeros((m, m))
71            # Fill in columns corresponding to non-zero singular values
72            for i in range(rank):
73                U[:, i] = (A @ V[:, i]) / singular_values[i]
74            # Complete orthonormal basis for null space
75            # (For full implementation, use QR decomposition on remaining)
76        else:
77            # Economy SVD: only compute min(m,n) vectors
78            k = min(m, n)
79            U = np.zeros((m, k))
80            for i in range(min(k, rank)):
81                U[:, i] = (A @ V[:, i]) / singular_values[i]
82
83        # Store results
84        self.U_ = U
85        self.S_ = singular_values[:min(m, n)] if not self.full_matrices else singular_values
86        self.Vt_ = V.T
87
88        return self
89
90    def transform(self, A: np.ndarray, k: Optional[int] = None) -> np.ndarray:
91        """
92        Project data onto top-k singular vectors.
93
94        This gives the optimal rank-k approximation in the
95        Frobenius norm sense (Eckart-Young theorem).
96
97        Args:
98            A: Matrix to transform
99            k: Number of components to keep (default: all)
100
101        Returns:
102            Transformed data: A @ V_k (projections onto right singular vectors)
103        """
104        if k is None:
105            k = len(self.S_)
106        return A @ self.Vt_[:k].T
107
108    def inverse_transform(self, Z: np.ndarray, k: Optional[int] = None) -> np.ndarray:
109        """
110        Reconstruct from reduced representation.
111
112        Args:
113            Z: Reduced data of shape (m, k)
114            k: Number of components used
115
116        Returns:
117            Reconstructed matrix
118        """
119        if k is None:
120            k = Z.shape[1]
121        return Z @ self.Vt_[:k]
122
123    def low_rank_approximation(self, A: np.ndarray, k: int) -> np.ndarray:
124        """
125        Compute optimal rank-k approximation of A.
126
127        A_k = U_k @ Σ_k @ V_k^T
128
129        This minimizes ||A - A_k||_F (Frobenius norm).
130
131        Args:
132            A: Original matrix
133            k: Desired rank
134
135        Returns:
136            Rank-k approximation of A
137        """
138        # Recompute SVD to ensure correct dimensions
139        U, S, Vt = self.U_, self.S_, self.Vt_
140
141        # Truncate to rank k
142        U_k = U[:, :k]
143        S_k = S[:k]
144        Vt_k = Vt[:k, :]
145
146        # Reconstruct: A_k = U_k @ diag(S_k) @ V_k^T
147        return U_k @ np.diag(S_k) @ Vt_k
148
149    def explained_variance_ratio(self) -> np.ndarray:
150        """
151        Proportion of variance explained by each component.
152
153        Since singular values are sqrt of eigenvalues of A^T @ A,
154        the variance explained is proportional to σ_i^2.
155        """
156        total = np.sum(self.S_ ** 2)
157        return (self.S_ ** 2) / total
158
159    def condition_number(self) -> float:
160        """
161        Compute condition number κ = σ_max / σ_min.
162
163        Measures numerical sensitivity of the matrix.
164        Large κ indicates an ill-conditioned matrix.
165        """
166        nonzero = self.S_[self.S_ > 1e-10]
167        if len(nonzero) == 0:
168            return np.inf
169        return nonzero[0] / nonzero[-1]
170
171    def pseudoinverse(self, A: np.ndarray) -> np.ndarray:
172        """
173        Compute Moore-Penrose pseudoinverse using SVD.
174
175        A^+ = V @ Σ^+ @ U^T
176
177        Where Σ^+ inverts non-zero singular values.
178        """
179        # Invert non-zero singular values
180        S_inv = np.zeros_like(self.S_)
181        mask = self.S_ > 1e-10
182        S_inv[mask] = 1.0 / self.S_[mask]
183
184        # A^+ = V @ Σ^+ @ U^T
185        return self.Vt_.T @ np.diag(S_inv) @ self.U_.T
186
187
188# Practical example: Image compression
189if __name__ == "__main__":
190    # Create a simple test image (gradient pattern)
191    np.random.seed(42)
192
193    # Generate a 100x100 image with structure
194    n = 100
195    x = np.linspace(0, 4*np.pi, n)
196    y = np.linspace(0, 4*np.pi, n)
197    X, Y = np.meshgrid(x, y)
198    image = np.sin(X) + np.cos(Y) + 0.1 * np.random.randn(n, n)
199
200    print(f"Original image shape: {image.shape}")
201    print(f"Original storage: {image.size} values")
202
203    # Compute SVD
204    svd = SVD(full_matrices=False)
205    svd.fit(image)
206
207    print(f"\nSingular values (first 10): {svd.S_[:10].round(2)}")
208    print(f"Variance explained (first 10): {(svd.explained_variance_ratio()[:10]*100).round(1)}%")
209    print(f"Condition number: {svd.condition_number():.2f}")
210
211    # Low-rank approximations
212    for k in [5, 10, 20, 50]:
213        approx = svd.low_rank_approximation(image, k)
214        error = np.linalg.norm(image - approx, 'fro')
215        variance_kept = np.sum(svd.S_[:k]**2) / np.sum(svd.S_**2)
216        storage = k * (n + n + 1)  # U_k, V_k, S_k
217        compression = 100 * (1 - storage / image.size)
218
219        print(f"\nRank-{k} approximation:")
220        print(f"  Frobenius error: {error:.4f}")
221        print(f"  Variance retained: {variance_kept*100:.1f}%")
222        print(f"  Storage: {storage} values ({compression:.1f}% compression)")

Practical Tips and Gotchas

  1. Use truncated SVD for large matrices. Full SVD of an m \u00d7 n matrix costs O(min(m\u00b2n, mn\u00b2)). Truncated SVD finding k components costs O(mnk).
  2. Randomized SVD is your friend. For very large matrices, randomized algorithms (like sklearn's TruncatedSVD) are 10-100x faster with negligible error.
  3. Don't invert small singular values. When computing pseudoinverse, only invert singular values above a threshold. Small values amplify noise.
  4. Center for PCA, don't for SVD on sparse data. Centering destroys sparsity. For sparse matrices like text, use SVD directly (TruncatedSVD).
  5. Check condition number. Large \u03ba = \u03c3\u2081/\u03c3\u1d63 indicates numerical instability. Consider regularization or low-rank approximation.
  6. Singular vectors are not unique. If singular values are repeated, the corresponding singular vectors are only unique up to rotation within that subspace.
  7. Watch memory for full SVD. Full U and V can be huge. Use full_matrices=False in NumPy to get economy SVD.
  8. For recommenders, handle bias separately. Funk SVD learns biases explicitly rather than relying on SVD to capture them.
NumPy vs SciPy: Use scipy.linalg.svd for general SVD and scipy.sparse.linalg.svds for sparse matrices. NumPy's np.linalg.svd is fine for dense matrices but doesn't support sparse input.

Limitations of SVD

  1. Linear only: SVD captures linear relationships. For nonlinear structure, consider kernel methods or neural network autoencoders.
  2. Memory intensive: Full SVD requires storing U, \u03a3, V. For very large matrices, only iterative/randomized methods are practical.
  3. Batch computation: Standard SVD requires the full matrix in memory. Incremental/online updates are possible but complex.
  4. Interpretation challenges: Singular vectors are linear combinations of all features. They may not be interpretable.
  5. No probability model: SVD is a deterministic factorization. For uncertainty quantification, use probabilistic PCA or Bayesian methods.
  6. Sensitivity to outliers: Large values in the matrix disproportionately affect singular values. Consider robust alternatives for noisy data.
LimitationWhen It MattersAlternative
LinearityData lies on curved manifoldKernel PCA, t-SNE, UMAP, autoencoders
MemoryMatrix too large for memoryRandomized SVD, iterative methods
Streaming dataData arrives continuouslyIncremental SVD, online PCA
InterpretabilityNeed to explain factorsNMF (Non-negative Matrix Factorization)
OutliersNoisy or corrupted dataRobust PCA, median-based methods

Practice Problems

  1. Conceptual: Explain geometrically what happens when a 3\u00d72 matrix has rank 1. What do the singular vectors tell us?
  2. Mathematical: Prove that the singular values of A are the square roots of eigenvalues of A\u1d40A. Show that U, V are orthogonal.
  3. Computational: Implement SVD from scratch. Verify your results match np.linalg.svd. Test on random matrices of various shapes.
  4. Image compression: Load an image, compute its SVD, and visualize reconstructions at different ranks. Plot PSNR vs rank.
  5. Text analysis: Implement LSA on a corpus of your choice. Find semantically similar documents using cosine similarity in the reduced space.
  6. Recommender: Train an SVD recommender on MovieLens data. Compare with simple baselines (user mean, item mean).
  7. Pseudoinverse: Solve an overdetermined system Ax = b using SVD pseudoinverse. Compare with np.linalg.lstsq.
  8. Deep learning: Analyze the singular value distribution of weight matrices in a pre-trained CNN. What does the spectrum tell you?
  9. LoRA simulation: Take a weight matrix W, compute its SVD, and simulate LoRA by adding low-rank perturbations. How does rank affect approximation quality?
  10. Condition number: Create matrices with varying condition numbers. Observe how solving Ax = b becomes unstable as \u03ba increases.

Key Insights

  • SVD = rotation + scaling + rotation. Any matrix transformation can be understood as this sequence of simple operations.
  • Singular values measure importance. Larger singular values correspond to more important dimensions. The spectrum reveals the effective rank.
  • Truncated SVD is optimal. The Eckart-Young theorem guarantees no other rank-k matrix is closer to the original.
  • PCA is SVD of centered data. This connection unifies the two most important matrix decompositions in statistics.
  • Applications are everywhere: Text (LSA), recommendations (collaborative filtering), images (compression), linear systems (pseudoinverse), deep learning (LoRA).
  • Condition number matters. Large \u03ba = \u03c3\u2081/\u03c3\u1d63 indicates numerical instability and potential issues.
  • Low-rank structure is common. Real-world data often has much lower effective rank than its nominal dimension.
  • SVD is fundamental to modern AI. From LoRA fine-tuning to model compression, SVD concepts permeate contemporary machine learning.
The Essence of SVD: Every matrix tells a story - SVD reads that story. It reveals the essential structure hiding inside any matrix: how many dimensions truly matter (rank), how important each is (singular values), and what directions define them (singular vectors). This universal decomposition is the Swiss Army knife of numerical linear algebra, equally at home in theoretical proofs and billion-scale production systems. Master SVD, and you've mastered the language in which matrices speak.
Loading comments...