By the end of this section, you will master Singular Value Decomposition (SVD) - the most fundamental matrix decomposition in applied mathematics and machine learning. You will:
Understand the geometric meaning of SVD - how any linear transformation can be decomposed into rotation, scaling, and another rotation.
Master the mathematical formulation from first principles: the relationship to eigendecomposition, the singular values and vectors, and why the decomposition always exists.
Apply the Eckart-Young theorem to compute optimal low-rank approximations for compression, denoising, and dimensionality reduction.
Connect SVD to PCA and understand exactly when they are equivalent and when they differ.
Implement statistical applications: Latent Semantic Analysis for text, collaborative filtering for recommendations, image compression, and pseudoinverse for least squares.
Apply SVD to modern deep learning: LoRA for efficient fine-tuning, weight matrix analysis, and neural network compression.
Why This Matters: SVD is arguably the most important matrix decomposition in all of applied mathematics. It appears in machine learning (PCA, LSA, recommender systems), numerical computing (solving linear systems, pseudoinverse), signal processing (noise reduction), data compression (images, video), and modern AI (LoRA, model compression). Understanding SVD deeply gives you a unified framework for thinking about matrices, transformations, and data structure.
The Big Picture
Historical Context
The Singular Value Decomposition has a rich history spanning over a century. The theoretical foundations were laid in the late 19th century by Eugenio Beltrami (1873) and Camille Jordan (1874), who independently discovered it for square matrices. The extension to rectangular matrices came from mathematicians including Eckart and Young, whose 1936 theorem on optimal low-rank approximation remains central to modern applications.
However, SVD remained primarily a theoretical tool until the 1960s when Gene Golub and William Kahan developed stable numerical algorithms for computing it. This breakthrough transformed SVD from a mathematical curiosity into a practical workhorse. Today, SVD is computed billions of times daily across the world's computing infrastructure.
The applications explosion came in the 1990s and 2000s. Latent Semantic Analysis (1990) revolutionized information retrieval. The Netflix Prize (2006-2009) showcased SVD-based recommender systems. And in the 2020s, LoRA (Low-Rank Adaptation) has made efficient fine-tuning of large language models practical.
Why SVD Matters
SVD answers a fundamental question: What is the essential structure of a matrix? Every matrix, regardless of shape or properties, can be understood as three simple operations: a rotation, a scaling along coordinate axes, and another rotation.
Universal applicability: Unlike eigendecomposition (which requires square matrices), SVD works on any matrix - rectangular, sparse, rank-deficient.
Optimal approximation: The truncated SVD gives the best possible low-rank approximation in both Frobenius and spectral norms. No other rank-k matrix is closer.
Numerical stability: SVD is the foundation for solving ill-conditioned linear systems, computing pseudoinverses, and determining matrix rank.
Interpretability: Singular values measure the "importance" of each dimension. Singular vectors reveal the fundamental directions in the data.
Core Insight: SVD reveals that any linear transformation, no matter how complex, is just rotation + axis-aligned scaling + rotation. This decomposition exposes the true geometry hiding inside any matrix.
What Is SVD?
Intuitive Understanding
Imagine you have a rubber sheet with a circle drawn on it. You can stretch it, rotate it, and transform it in any way that keeps straight lines straight (a linear transformation). What happens to the circle? It becomes an ellipse!
SVD tells us exactly how this transformation works:
First, rotate the sheet to align with the transformation's "natural axes" (the V matrix)
Then, stretch along each axis by different amounts - these stretching factors are the singular values (\u03c3)
Finally, rotate again to the output orientation (the U matrix)
The singular values tell us how much the transformation stretches in each direction. The largest singular value is the maximum stretching factor, the smallest is the minimum. If any singular value is zero, the transformation "collapses" that dimension entirely.
The Geometric View
Consider a matrix A that transforms vectors. What does it do to the unit circle? SVD tells us:
The unit circle (all vectors of length 1) gets mapped to an ellipse
The principal axes of this ellipse are the left singular vectors (u)
The lengths of the principal axes are the singular values (\u03c3)
The directions in the input that map to these axes are the right singular vectors (v)
This geometric picture is powerful: A transforms the unit sphere into an ellipsoid. SVD tells us the shape, orientation, and alignment of that ellipsoid.
Mathematical Formulation
The Decomposition Structure
For any m \u00d7 n matrix A (real or complex), there exists a factorization:
A=UΣVT
Where:
U is an m \u00d7 morthogonal matrix (U\u1d40U = I) - the left singular vectors
Σ is an m \u00d7 ndiagonal matrix with non-negative entries - the singular values
V is an n \u00d7 northogonal matrix (V\u1d40V = I) - the right singular vectors
The diagonal entries of \u03a3 are called singular values:
σ1≥σ2≥⋯≥σr>0=σr+1=⋯
Where r = rank(A) is the number of non-zero singular values. The singular values are conventionally ordered from largest to smallest.
In terms of column vectors:
A=i=1∑rσiuiviT
This sum-of-outer-products form shows that A is a weighted sum of rank-1 matrices. Each term σiuiviT is a rank-1 matrix, and the weights are the singular values.
Computing the SVD
The SVD is computed by relating it to eigendecomposition:
Right singular vectors V: The columns of V are eigenvectors of ATA (an n \u00d7 n symmetric matrix)
Singular values \u03c3: The singular values are the square roots of eigenvalues of ATA: σi=λi
Left singular vectors U: Computed as ui=Avi/σi for non-zero singular values
Alternatively, U can be computed from the eigendecomposition of AAT (an m \u00d7 m matrix).
Computational Complexity: Computing the full SVD of an m \u00d7 n matrix takes O(min(m\u00b2n, mn\u00b2)) time. For large matrices, randomized algorithms and truncated SVD are preferred.
Properties of SVD
Property
Formula
Interpretation
Existence
Always exists
Works for ANY matrix (rectangular, singular, complex)
Uniqueness
Singular values unique
Vectors unique up to sign (if σ distinct)
Rank
r = #{i : σᵢ > 0}
Number of non-zero singular values
Frobenius norm
‖A‖ᶠ = √(Σσᵢ²)
Square root of sum of squared singular values
Spectral norm
‖A‖₂ = σ₁
Largest singular value
Condition number
κ = σ₁/σᵣ
Ratio of largest to smallest (numerical stability)
Pseudoinverse
A⁺ = VΣ⁺Uᵀ
Invert non-zero singular values
Interactive SVD Explorer
Explore how SVD decomposes a 2\u00d72 matrix into its fundamental components. Adjust the matrix entries and watch how the transformation changes.
\u2022 Any matrix A transforms a circle into an ellipse
\u2022 \u03c3\u2081 and \u03c3\u2082 are the semi-axes of the ellipse
\u2022 U vectors point along the ellipse axes
\u2022 V vectors are the pre-images of U under A
\u2022 If \u03c3\u2082 \u2248 0, the matrix is nearly singular (rank 1)
Try This: Set the matrix to [[1, 0], [0, 1]] (identity) - the ellipse stays a circle. Then try [[2, 0], [0, 0.5]] - you get an ellipse with axes along coordinates. Now try [[1, 1], [0, 1]] (shear) - the axes rotate!
Low-Rank Approximation
The Eckart-Young Theorem
One of the most important results about SVD is the Eckart-Young theorem (1936), which states that the best rank-k approximation to a matrix is obtained by truncating the SVD:
Ak=i=1∑kσiuiviT=UkΣkVkT
This approximation is optimal in both the Frobenius norm and the spectral (operator) norm:
rank(B)=kmin∥A−B∥F=∥A−Ak∥F=i=k+1∑rσi2
rank(B)=kmin∥A−B∥2=∥A−Ak∥2=σk+1
In plain English: No other rank-k matrix is closer to A than the truncated SVD. The error equals the sum of discarded singular values squared (Frobenius) or just the next singular value (spectral).
Truncated SVD
Truncated SVD keeps only the top k components, dramatically reducing storage and computation:
Full matrix: m \u00d7 n = mn values
Truncated SVD: Uk (m \u00d7 k) + Sk (k) + Vk (n \u00d7 k) = k(m + n + 1) values
Compression ratio: mn / k(m + n + 1) \u2248 mn / k(m+n) for large m, n
For a 1000 \u00d7 1000 matrix with rank-50 approximation: 1,000,000 \u2192 100,050 values (10x compression).
Low-Rank Image Approximation
Key insight: Smooth images (gradient) need fewer ranks than complex patterns (checkerboard). SVD finds the most efficient representation for each.
SVD vs PCA: The Deep Connection
PCA and SVD are intimately related, but they are not the same thing. Understanding the precise relationship is crucial.
Key relationship: PCA is SVD applied to the centered data matrix.
Let X be the centered data matrix (mean subtracted from each column). Then:
X=UΣVT
Principal components = Right singular vectors V
Principal component scores = XV = U\u03a3
Variance explained = \u03c3\u00b2/(n-1) = eigenvalues of covariance matrix
Covariance matrix = (1/(n-1)) X\u1d40X = V (\u03a3\u00b2/(n-1)) V\u1d40
SVD vs PCA: The Critical Difference
SVD on X
Computes orthogonal basis that minimizes reconstruction error from origin.
X = U \u03a3 V\u1d40
\u2713 With centered data, SVD = PCA
PCA on X
First centers data, then computes SVD on X - \u03bc.
X\u0305 = X - \u03bc
X\u0305 = U \u03a3 V\u1d40
\u2713 Always finds directions of maximum variance
Key Takeaway
PCA = SVD of centered data. When the data is already centered (mean = 0), SVD and PCA give identical results. The right singular vectors V are the principal components, and the singular values are the square roots of eigenvalues of X\u1d40X.
Aspect
SVD
PCA
Input
Any matrix A
Centered data matrix X - μ
Centering
Not required
Required (fundamental)
Output
U, Σ, Vᵀ
Principal components (V), scores (XV)
Interpretation
Matrix factorization
Variance maximization
Sparse data
Works directly
Centering destroys sparsity
New data
Project via Vᵀ
Center with training mean, then project
Common Mistake: Applying SVD directly to uncentered data and calling it PCA. This is wrong! Without centering, the first singular vector points toward the data mean, not the direction of maximum variance.
Statistical Applications
Latent Semantic Analysis
Latent Semantic Analysis (LSA) applies SVD to the term-document matrix to discover hidden semantic structure in text. It was one of the first successful applications of dimensionality reduction to natural language.
The problem: Simple keyword matching fails to capture meaning. "Car" and "automobile" are different words but same concept. "Bank" can mean a financial institution or a river bank.
LSA solution: Represent documents in a lower-dimensional "semantic" space where synonyms are close and polysemy is resolved by context.
Latent Semantic Analysis Implementation
🐍latent_semantic_analysis.py
Explanation(8)
Code(181)
4LSA Class
Latent Semantic Analysis uses SVD to discover hidden semantic structure in text. It was a precursor to modern embeddings like Word2Vec.
12Why LSA Works
Words that appear in similar contexts have similar meanings. SVD finds these latent relationships by decomposing co-occurrence patterns.
28Term-Document Matrix
A matrix where rows are words, columns are documents, and entries are word frequencies (or TF-IDF). This is the input to SVD.
EXAMPLE
If 'machine' appears 3 times in doc 2, entry [machine, doc2] = 3
46TF-IDF Weighting
TF-IDF downweights common words ('the', 'a') and upweights distinctive words. TF = term frequency, IDF = inverse document frequency.
67SVD on TF-IDF
SVD decomposes into U (term-topic), Σ (topic importance), Vᵀ (topic-document). Low-rank approximation captures semantic structure.
87Transform to Semantic Space
Projects new documents onto the learned topic space. Similar documents will have similar projections regardless of exact word matches.
100Semantic Similarity
Cosine similarity in latent space captures meaning similarity. 'Car' and 'automobile' will be similar even without exact word match.
113Topic Extraction
Each column of U represents a 'topic' - words with high weights define that topic. This is interpretable dimensionality reduction.
173 lines without explanation
1import numpy as np
2from collections import Counter
3from typing import List, Tuple
45classLatentSemanticAnalysis:6"""
7 Latent Semantic Analysis (LSA) using SVD.
89 LSA discovers hidden semantic structure in text by:
10 1. Building a term-document matrix
11 2. Applying SVD to find latent topics
12 3. Representing documents in reduced semantic space
1314 This enables:
15 - Semantic similarity (beyond keyword matching)
16 - Document classification
17 - Information retrieval
18 - Topic modeling
19 """2021def__init__(self, n_components:int=100):22"""
23 Args:
24 n_components: Number of latent dimensions (topics)
25 """26 self.n_components = n_components
27 self.vocabulary_ =None28 self.U_ =None29 self.S_ =None30 self.Vt_ =None3132def_build_term_document_matrix(33 self,34 documents: List[str]35)-> Tuple[np.ndarray,dict]:36"""
37 Build term-document matrix with TF-IDF weighting.
3839 Rows = terms (vocabulary)
40 Columns = documents
41 Entry (i,j) = TF-IDF weight of term i in document j
42 """43# Simple tokenization (in practice, use better preprocessing)44 doc_tokens =[doc.lower().split()for doc in documents]4546# Build vocabulary47 all_words =set()48for tokens in doc_tokens:49 all_words.update(tokens)50 vocabulary ={word: idx for idx, word inenumerate(sorted(all_words))}5152 n_terms =len(vocabulary)53 n_docs =len(documents)5455# Compute term frequencies56 tf_matrix = np.zeros((n_terms, n_docs))57for j, tokens inenumerate(doc_tokens):58 word_counts = Counter(tokens)59for word, count in word_counts.items():60if word in vocabulary:61 tf_matrix[vocabulary[word], j]= count
6263# Normalize TF (prevent bias toward long documents)64 doc_lengths = tf_matrix.sum(axis=0)+1e-1065 tf_matrix = tf_matrix / doc_lengths
6667# Compute IDF (inverse document frequency)68 doc_freq = np.sum(tf_matrix >0, axis=1)69 idf = np.log(n_docs /(doc_freq +1))+17071# TF-IDF matrix72 tfidf_matrix = tf_matrix * idf[:, np.newaxis]7374return tfidf_matrix, vocabulary
7576deffit(self, documents: List[str])->'LatentSemanticAnalysis':77"""
78 Fit LSA model to documents.
7980 Applies SVD to the term-document matrix:
81 TF-IDF = U @ Σ @ V^T
8283 Where:
84 - U: term-topic matrix (which terms belong to which topics)
85 - Σ: topic strengths (importance of each topic)
86 - V^T: topic-document matrix (which documents express which topics)
87 """88# Build term-document matrix89 tfidf_matrix, vocabulary = self._build_term_document_matrix(documents)90 self.vocabulary_ = vocabulary
9192# Compute truncated SVD93 U, S, Vt = np.linalg.svd(tfidf_matrix, full_matrices=False)9495# Keep top k components96 k =min(self.n_components,len(S))97 self.U_ = U[:,:k]98 self.S_ = S[:k]99 self.Vt_ = Vt[:k,:]100101return self
102103deftransform(self, documents: List[str])-> np.ndarray:104"""
105 Transform documents to semantic space.
106107 Projects documents onto latent topics.
108 """109 tfidf_matrix, _ = self._build_term_document_matrix(documents)110111# Project: doc_vector = Σ^(-1) @ U^T @ tfidf_vector112# This gives the representation in topic space113return np.diag(1/self.S_) @ self.U_.T @ tfidf_matrix
114115defdocument_similarity(116 self,117 doc1_idx:int,118 doc2_idx:int119)->float:120"""
121 Compute semantic similarity between two documents.
122123 Uses cosine similarity in the latent space.
124 """125 vec1 = self.Vt_[:, doc1_idx]126 vec2 = self.Vt_[:, doc2_idx]127128# Cosine similarity129return np.dot(vec1, vec2)/(np.linalg.norm(vec1)* np.linalg.norm(vec2))130131defget_topics(self, n_words:int=10)-> List[List[str]]:132"""
133 Get top words for each topic.
134135 Returns list of topics, each containing top words.
136 """137 idx_to_word ={idx: word for word, idx in self.vocabulary_.items()}138 topics =[]139140for k inrange(len(self.S_)):141# Words with highest weight in this topic142 top_indices = np.argsort(np.abs(self.U_[:, k]))[::-1][:n_words]143 top_words =[idx_to_word[i]for i in top_indices]144 topics.append(top_words)145146return topics
147148149# Example usage150if __name__ =="__main__":151# Sample documents152 documents =[153"machine learning algorithms data science",154"deep learning neural networks artificial intelligence",155"natural language processing text analysis",156"computer vision image recognition deep learning",157"statistics probability bayesian inference",158"regression classification supervised learning",159"clustering dimensionality reduction unsupervised",160"neural networks backpropagation gradient descent",161]162163# Fit LSA164 lsa = LatentSemanticAnalysis(n_components=3)165 lsa.fit(documents)166167print("Top words per topic:")168 topics = lsa.get_topics(n_words=5)169for i, topic inenumerate(topics):170print(f" Topic {i+1}: {', '.join(topic)}")171172# Document similarities173print("\nDocument similarities (semantic space):")174for i inrange(len(documents)):175for j inrange(i+1,len(documents)):176 sim = lsa.document_similarity(i, j)177if sim >0.3:# Show only similar pairs178print(f" Doc {i+1} vs Doc {j+1}: {sim:.3f}")179180print("\nSingular values (topic importance):")181print(f" {lsa.S_.round(3)}")
Recommender Systems
SVD-based collaborative filtering powers recommendation engines at Netflix, Amazon, Spotify, and countless other platforms. The key insight: users and items can be represented as vectors in a shared latent space.
The model: Rating = global bias + user bias + item bias + user vector \u00b7 item vector
SVD-Based Recommender System
🐍svd_recommender.py
Explanation(9)
Code(186)
4SVD Recommender
Matrix factorization discovers latent factors that explain user preferences. The Netflix Prize was won using SVD-based approaches.
16Latent Factors
n_factors is like the embedding dimension. Factors might represent genres (action, romance), or abstract preferences learned from data.
32Funk SVD
Unlike vanilla SVD, Funk SVD only trains on observed ratings. This handles the sparse matrix problem (most ratings are missing).
42Global Mean
Center ratings to handle different rating scales. Some users rate harshly (all 2-3), others generously (all 4-5).
48Factor Initialization
Small random initialization prevents symmetry breaking issues. Too large causes divergence, too small causes slow learning.
64Prediction Model
Rating = global_mean + user_bias + item_bias + user_factors · item_factors. This decomposes ratings into biases and interactions.
EXAMPLE
User prefers action (+bias), movie is popular (+bias), factors capture nuanced match
72Gradient Descent Updates
Update factors to reduce prediction error. Regularization prevents overfitting by penalizing large factor values.
102Batch Predictions
Vectorized prediction for all items at once. The dot product user_factors @ item_factors.T gives all predicted ratings.
119Similar Items
Items with similar latent representations are similar. Cosine similarity in factor space finds related items.
177 lines without explanation
1import numpy as np
2from typing import Tuple, List
34classSVDRecommender:5"""
6 Collaborative Filtering Recommender using SVD.
78 Used in:
9 - Netflix Prize (winning solution used SVD variants)
10 - Amazon product recommendations
11 - Spotify music recommendations
1213 The idea: Users and items exist in a shared latent space.
14 Similar users like similar items.
15 """1617def__init__(self, n_factors:int=50, regularization:float=0.1):18"""
19 Args:
20 n_factors: Number of latent factors (like hidden dimensions)
21 regularization: L2 regularization strength
22 """23 self.n_factors = n_factors
24 self.reg = regularization
25 self.user_factors_ =None# U matrix (users × factors)26 self.item_factors_ =None# V matrix (items × factors)27 self.user_bias_ =None28 self.item_bias_ =None29 self.global_mean_ =None3031deffit(self, ratings: np.ndarray, n_epochs:int=20)->'SVDRecommender':32"""
33 Fit recommender to rating matrix.
3435 Uses Funk SVD (gradient descent on observed ratings only).
36 Unlike vanilla SVD which requires imputing missing values.
3738 Args:
39 ratings: Matrix of shape (n_users, n_items) with NaN for missing
40 n_epochs: Training iterations
41 """42 n_users, n_items = ratings.shape
4344# Global mean (for bias calculation)45 self.global_mean_ = np.nanmean(ratings)4647# Initialize biases48 self.user_bias_ = np.zeros(n_users)49 self.item_bias_ = np.zeros(n_items)5051# Initialize factors randomly (small values)52 self.user_factors_ = np.random.normal(0,0.1,(n_users, self.n_factors))53 self.item_factors_ = np.random.normal(0,0.1,(n_items, self.n_factors))5455# Find observed ratings (non-NaN entries)56 observed =~np.isnan(ratings)5758# Learning rate schedule59 lr =0.016061for epoch inrange(n_epochs):62 total_error =063 count =06465# Iterate over observed ratings66for u inrange(n_users):67for i inrange(n_items):68if observed[u, i]:69# Predicted rating70 pred =(self.global_mean_ +71 self.user_bias_[u]+72 self.item_bias_[i]+73 self.user_factors_[u] @ self.item_factors_[i])7475# Error76 error = ratings[u, i]- pred
77 total_error += error **278 count +=17980# Update biases81 self.user_bias_[u]+= lr *(error - self.reg * self.user_bias_[u])82 self.item_bias_[i]+= lr *(error - self.reg * self.item_bias_[i])8384# Update factors (gradient descent)85 user_update = error * self.item_factors_[i]- self.reg * self.user_factors_[u]86 item_update = error * self.user_factors_[u]- self.reg * self.item_factors_[i]8788 self.user_factors_[u]+= lr * user_update
89 self.item_factors_[i]+= lr * item_update
9091 rmse = np.sqrt(total_error / count)92if epoch %5==0:93print(f"Epoch {epoch}: RMSE = {rmse:.4f}")9495return self
9697defpredict(self, user_id:int, item_id:int)->float:98"""Predict rating for user-item pair."""99return(self.global_mean_ +100 self.user_bias_[user_id]+101 self.item_bias_[item_id]+102 self.user_factors_[user_id] @ self.item_factors_[item_id])103104defrecommend(self, user_id:int, n_items:int=10,105 exclude_rated:bool=True,106 ratings: np.ndarray =None)-> List[Tuple[int,float]]:107"""
108 Get top-N recommendations for a user.
109110 Args:
111 user_id: User to get recommendations for
112 n_items: Number of items to recommend
113 exclude_rated: Whether to exclude already-rated items
114 ratings: Original rating matrix (for exclusion)
115116 Returns:
117 List of (item_id, predicted_rating) tuples
118 """119# Predict ratings for all items120 predictions =(self.global_mean_ +121 self.user_bias_[user_id]+122 self.item_bias_ +123 self.user_factors_[user_id] @ self.item_factors_.T)124125# Exclude already-rated items126if exclude_rated and ratings isnotNone:127 rated_mask =~np.isnan(ratings[user_id])128 predictions[rated_mask]=-np.inf
129130# Get top-N131 top_indices = np.argsort(predictions)[::-1][:n_items]132133return[(idx, predictions[idx])for idx in top_indices]134135defsimilar_items(self, item_id:int, n:int=5)-> List[Tuple[int,float]]:136"""Find similar items using cosine similarity in latent space."""137 item_vec = self.item_factors_[item_id]138139# Cosine similarity with all items140 norms = np.linalg.norm(self.item_factors_, axis=1)141 similarities =(self.item_factors_ @ item_vec)/(norms * np.linalg.norm(item_vec)+1e-10)142143# Exclude self144 similarities[item_id]=-np.inf
145146 top_indices = np.argsort(similarities)[::-1][:n]147return[(idx, similarities[idx])for idx in top_indices]148149150# Example151if __name__ =="__main__":152 np.random.seed(42)153154# Create synthetic rating matrix (100 users, 50 items)155 n_users, n_items =100,50156 true_factors =5157158# Generate from latent factor model159 U_true = np.random.randn(n_users, true_factors)160 V_true = np.random.randn(n_items, true_factors)161 ratings_full =3+ U_true @ V_true.T + np.random.randn(n_users, n_items)*0.5162 ratings_full = np.clip(ratings_full,1,5)163164# Make sparse (80% missing)165 mask = np.random.rand(n_users, n_items)<0.2166 ratings = np.where(mask, ratings_full, np.nan)167168print(f"Rating matrix: {n_users} users × {n_items} items")169print(f"Observed ratings: {np.sum(mask)} / {n_users * n_items} ({100*np.mean(mask):.1f}%)")170171# Fit recommender172 recommender = SVDRecommender(n_factors=10)173 recommender.fit(ratings, n_epochs=30)174175# Get recommendations for user 0176print("\nTop recommendations for User 0:")177 recs = recommender.recommend(0, n_items=5, ratings=ratings)178for item_id, pred_rating in recs:179 true_rating = ratings_full[0, item_id]180print(f" Item {item_id}: predicted={pred_rating:.2f}, true={true_rating:.2f}")181182# Find similar items to item 0183print("\nItems similar to Item 0:")184 similar = recommender.similar_items(0, n=3)185for item_id, similarity in similar:186print(f" Item {item_id}: similarity={similarity:.3f}")
Image Compression
Images can be compressed by computing the SVD of the pixel matrix and keeping only the top-k singular values/vectors. This exploits the fact that natural images have low effective rank - neighboring pixels are highly correlated.
For a grayscale image of size m \u00d7 n:
Original: mn values
Rank-k SVD: k(m + n + 1) values
Compression: If k = 50 and m = n = 1000, we get 10x compression
Pseudoinverse and Least Squares
SVD provides the numerically stable way to solve least squares problems and compute the Moore-Penrose pseudoinverse.
For the system Ax = b:
x=A+b=VΣ+UTb
Where \u03a3\u207a inverts non-zero singular values and sets zeros to zero. This is the minimum-norm least-squares solution.
Deep Learning Connections
SVD has become increasingly important in modern deep learning, from analyzing trained networks to efficient fine-tuning techniques.
Analyzing Weight Matrices
SVD reveals the effective rank and structure of neural network weight matrices:
Rank analysis: Many weight matrices have surprisingly low effective rank - most singular values are near zero
Compression: Low-rank approximation can reduce model size 10-100x with minimal accuracy loss
LoRA (Low-Rank Adaptation) is a revolutionary technique for efficient fine-tuning of large language models. Instead of updating the full weight matrix W, LoRA learns a low-rank update:
W′=W+ΔW=W+BA
Where B is m \u00d7 r and A is r \u00d7 n, with r \u226a min(m, n).
Parameters: Instead of mn parameters, only (m + n) \u00d7 r
Example: For a 4096 \u00d7 4096 weight (16M params), rank-8 LoRA uses only 65K params (250x reduction)
Why it works: Fine-tuning typically requires low-rank updates - the essential changes live in a small subspace
Connection to SVD: LoRA learns the same structure as a rank-r SVD update. The matrices B and A are like U\u03a3 and V\u1d40 respectively, but learned end-to-end rather than computed.
Neural Network Compression
SVD enables dramatic compression of neural networks:
Weight matrix SVD: Replace W (m \u00d7 n) with U (m \u00d7 k) and V\u1d40 (k \u00d7 n)
Implementation: Split one linear layer into two smaller layers
Trade-off: Slight accuracy loss for significant speedup and memory reduction
Iterative refinement: Fine-tune after compression to recover accuracy
This technique is used to deploy large models on mobile devices and edge hardware where memory and computation are constrained.
Python Implementation
SVD from Scratch
Let's implement SVD from first principles to understand every computational step. We compute it via eigendecomposition of A\u1d40A.
Complete SVD Implementation from Scratch
🐍svd_from_scratch.py
Explanation(16)
Code(222)
1Import NumPy
NumPy provides efficient numerical computing including linear algebra operations essential for SVD.
4SVD Class Definition
Encapsulates SVD functionality. Unlike PCA, SVD works on ANY matrix (not just square or symmetric), making it more general.
10The SVD Decomposition
A = UΣVᵀ where U contains left singular vectors (column space), Σ contains singular values, and V contains right singular vectors (row space).
18Full vs Economy SVD
Full SVD returns complete U (m×m) and V (n×n). Economy SVD returns only the non-trivial parts - usually what you want.
EXAMPLE
For 1000×10 matrix: full U is 1000×1000, economy U is 1000×10
41Form AᵀA
AᵀA is always symmetric positive semi-definite. Its eigenvalues are σ² (squared singular values) and eigenvectors are V (right singular vectors).
44Eigendecomposition
np.linalg.eigh() efficiently computes eigendecomposition of symmetric matrices. This gives us V and σ².
48Sort Descending
Eigenvalues from eigh are ascending, but we want descending order (largest singular values first for importance ranking).
53Singular Values
σᵢ = √λᵢ where λᵢ are eigenvalues of AᵀA. We use max(0, λ) to handle tiny negative values from numerical errors.
56Matrix Rank
The rank equals the number of non-zero singular values. This tells us the true dimensionality of the column/row space.
61Left Singular Vectors
Uᵢ = A*Vᵢ/σᵢ. Each left singular vector is the normalized result of applying A to the corresponding right singular vector.
90Transform Method
Projects data onto top-k right singular vectors. This is equivalent to dimensionality reduction, finding the best k-dimensional representation.
105Inverse Transform
Reconstructs the original space from the reduced representation. Information lost depends on how many components were dropped.
116Low-Rank Approximation
A_k = U_kΣ_kV_kᵀ gives the BEST rank-k approximation in Frobenius norm (Eckart-Young theorem). No other rank-k matrix is closer to A.
134Explained Variance
Variance is proportional to σ². If first singular value captures 90% of total σ², that component explains 90% of the data's structure.
143Condition Number
κ = σ_max/σ_min measures numerical stability. Large κ means small input changes cause large output changes (ill-conditioned).
EXAMPLE
κ = 10 is well-conditioned, κ = 10^10 is numerically unstable
153Pseudoinverse
A⁺ = VΣ⁺Uᵀ where Σ⁺ inverts non-zero singular values. Solves least-squares Ax=b even when A is non-square or singular.
206 lines without explanation
1import numpy as np
2from typing import Tuple, Optional
34classSVD:5"""
6 Singular Value Decomposition from scratch.
78 Decomposes any m×n matrix A into:
9 A = U @ Σ @ V^T
1011 Where:
12 - U is m×m orthogonal (left singular vectors)
13 - Σ is m×n diagonal (singular values)
14 - V is n×n orthogonal (right singular vectors)
15 """1617def__init__(self, full_matrices:bool=False):18"""
19 Initialize SVD.
2021 Args:
22 full_matrices: If True, return full U and V.
23 If False, return economy (thin) SVD.
24 """25 self.full_matrices = full_matrices
26 self.U_ =None27 self.S_ =None# Singular values (1D array)28 self.Vt_ =None# V^T (transposed right singular vectors)2930deffit(self, A: np.ndarray)->'SVD':31"""
32 Compute SVD of matrix A.
3334 The algorithm:
35 1. Compute A^T @ A (symmetric positive semi-definite)
36 2. Find eigenvalues λ and eigenvectors V of A^T @ A
37 3. Singular values σ = sqrt(λ)
38 4. Right singular vectors = V
39 5. Left singular vectors U = A @ V @ Σ^(-1)
4041 Args:
42 A: Input matrix of shape (m, n)
4344 Returns:
45 self: Fitted SVD instance
46 """47 m, n = A.shape
4849# Step 1: Form A^T @ A (n×n matrix)50 AtA = A.T @ A
5152# Step 2: Eigendecomposition of A^T @ A53# Eigenvalues are σ^2, eigenvectors are V54 eigenvalues, V = np.linalg.eigh(AtA)5556# Step 3: Sort in descending order57 idx = np.argsort(eigenvalues)[::-1]58 eigenvalues = eigenvalues[idx]59 V = V[:, idx]6061# Step 4: Singular values (avoid negative due to numerical errors)62 singular_values = np.sqrt(np.maximum(eigenvalues,0))6364# Step 5: Determine rank (non-zero singular values)65 rank = np.sum(singular_values >1e-10)6667# Step 6: Compute left singular vectors68# U = A @ V @ Σ^(-1) for non-zero singular values69if self.full_matrices:70 U = np.zeros((m, m))71# Fill in columns corresponding to non-zero singular values72for i inrange(rank):73 U[:, i]=(A @ V[:, i])/ singular_values[i]74# Complete orthonormal basis for null space75# (For full implementation, use QR decomposition on remaining)76else:77# Economy SVD: only compute min(m,n) vectors78 k =min(m, n)79 U = np.zeros((m, k))80for i inrange(min(k, rank)):81 U[:, i]=(A @ V[:, i])/ singular_values[i]8283# Store results84 self.U_ = U
85 self.S_ = singular_values[:min(m, n)]ifnot self.full_matrices else singular_values
86 self.Vt_ = V.T
8788return self
8990deftransform(self, A: np.ndarray, k: Optional[int]=None)-> np.ndarray:91"""
92 Project data onto top-k singular vectors.
9394 This gives the optimal rank-k approximation in the
95 Frobenius norm sense (Eckart-Young theorem).
9697 Args:
98 A: Matrix to transform
99 k: Number of components to keep (default: all)
100101 Returns:
102 Transformed data: A @ V_k (projections onto right singular vectors)
103 """104if k isNone:105 k =len(self.S_)106return A @ self.Vt_[:k].T
107108definverse_transform(self, Z: np.ndarray, k: Optional[int]=None)-> np.ndarray:109"""
110 Reconstruct from reduced representation.
111112 Args:
113 Z: Reduced data of shape (m, k)
114 k: Number of components used
115116 Returns:
117 Reconstructed matrix
118 """119if k isNone:120 k = Z.shape[1]121return Z @ self.Vt_[:k]122123deflow_rank_approximation(self, A: np.ndarray, k:int)-> np.ndarray:124"""
125 Compute optimal rank-k approximation of A.
126127 A_k = U_k @ Σ_k @ V_k^T
128129 This minimizes ||A - A_k||_F (Frobenius norm).
130131 Args:
132 A: Original matrix
133 k: Desired rank
134135 Returns:
136 Rank-k approximation of A
137 """138# Recompute SVD to ensure correct dimensions139 U, S, Vt = self.U_, self.S_, self.Vt_
140141# Truncate to rank k142 U_k = U[:,:k]143 S_k = S[:k]144 Vt_k = Vt[:k,:]145146# Reconstruct: A_k = U_k @ diag(S_k) @ V_k^T147return U_k @ np.diag(S_k) @ Vt_k
148149defexplained_variance_ratio(self)-> np.ndarray:150"""
151 Proportion of variance explained by each component.
152153 Since singular values are sqrt of eigenvalues of A^T @ A,
154 the variance explained is proportional to σ_i^2.
155 """156 total = np.sum(self.S_ **2)157return(self.S_ **2)/ total
158159defcondition_number(self)->float:160"""
161 Compute condition number κ = σ_max / σ_min.
162163 Measures numerical sensitivity of the matrix.
164 Large κ indicates an ill-conditioned matrix.
165 """166 nonzero = self.S_[self.S_ >1e-10]167iflen(nonzero)==0:168return np.inf
169return nonzero[0]/ nonzero[-1]170171defpseudoinverse(self, A: np.ndarray)-> np.ndarray:172"""
173 Compute Moore-Penrose pseudoinverse using SVD.
174175 A^+ = V @ Σ^+ @ U^T
176177 Where Σ^+ inverts non-zero singular values.
178 """179# Invert non-zero singular values180 S_inv = np.zeros_like(self.S_)181 mask = self.S_ >1e-10182 S_inv[mask]=1.0/ self.S_[mask]183184# A^+ = V @ Σ^+ @ U^T185return self.Vt_.T @ np.diag(S_inv) @ self.U_.T
186187188# Practical example: Image compression189if __name__ =="__main__":190# Create a simple test image (gradient pattern)191 np.random.seed(42)192193# Generate a 100x100 image with structure194 n =100195 x = np.linspace(0,4*np.pi, n)196 y = np.linspace(0,4*np.pi, n)197 X, Y = np.meshgrid(x, y)198 image = np.sin(X)+ np.cos(Y)+0.1* np.random.randn(n, n)199200print(f"Original image shape: {image.shape}")201print(f"Original storage: {image.size} values")202203# Compute SVD204 svd = SVD(full_matrices=False)205 svd.fit(image)206207print(f"\nSingular values (first 10): {svd.S_[:10].round(2)}")208print(f"Variance explained (first 10): {(svd.explained_variance_ratio()[:10]*100).round(1)}%")209print(f"Condition number: {svd.condition_number():.2f}")210211# Low-rank approximations212for k in[5,10,20,50]:213 approx = svd.low_rank_approximation(image, k)214 error = np.linalg.norm(image - approx,'fro')215 variance_kept = np.sum(svd.S_[:k]**2)/ np.sum(svd.S_**2)216 storage = k *(n + n +1)# U_k, V_k, S_k217 compression =100*(1- storage / image.size)218219print(f"\nRank-{k} approximation:")220print(f" Frobenius error: {error:.4f}")221print(f" Variance retained: {variance_kept*100:.1f}%")222print(f" Storage: {storage} values ({compression:.1f}% compression)")
Practical Tips and Gotchas
Use truncated SVD for large matrices. Full SVD of an m \u00d7 n matrix costs O(min(m\u00b2n, mn\u00b2)). Truncated SVD finding k components costs O(mnk).
Randomized SVD is your friend. For very large matrices, randomized algorithms (like sklearn's TruncatedSVD) are 10-100x faster with negligible error.
Don't invert small singular values. When computing pseudoinverse, only invert singular values above a threshold. Small values amplify noise.
Center for PCA, don't for SVD on sparse data. Centering destroys sparsity. For sparse matrices like text, use SVD directly (TruncatedSVD).
Check condition number. Large \u03ba = \u03c3\u2081/\u03c3\u1d63 indicates numerical instability. Consider regularization or low-rank approximation.
Singular vectors are not unique. If singular values are repeated, the corresponding singular vectors are only unique up to rotation within that subspace.
Watch memory for full SVD. Full U and V can be huge. Use full_matrices=False in NumPy to get economy SVD.
For recommenders, handle bias separately. Funk SVD learns biases explicitly rather than relying on SVD to capture them.
NumPy vs SciPy: Use scipy.linalg.svd for general SVD and scipy.sparse.linalg.svds for sparse matrices. NumPy's np.linalg.svd is fine for dense matrices but doesn't support sparse input.
Limitations of SVD
Linear only: SVD captures linear relationships. For nonlinear structure, consider kernel methods or neural network autoencoders.
Memory intensive: Full SVD requires storing U, \u03a3, V. For very large matrices, only iterative/randomized methods are practical.
Batch computation: Standard SVD requires the full matrix in memory. Incremental/online updates are possible but complex.
Interpretation challenges: Singular vectors are linear combinations of all features. They may not be interpretable.
No probability model: SVD is a deterministic factorization. For uncertainty quantification, use probabilistic PCA or Bayesian methods.
Sensitivity to outliers: Large values in the matrix disproportionately affect singular values. Consider robust alternatives for noisy data.
Limitation
When It Matters
Alternative
Linearity
Data lies on curved manifold
Kernel PCA, t-SNE, UMAP, autoencoders
Memory
Matrix too large for memory
Randomized SVD, iterative methods
Streaming data
Data arrives continuously
Incremental SVD, online PCA
Interpretability
Need to explain factors
NMF (Non-negative Matrix Factorization)
Outliers
Noisy or corrupted data
Robust PCA, median-based methods
Practice Problems
Conceptual: Explain geometrically what happens when a 3\u00d72 matrix has rank 1. What do the singular vectors tell us?
Mathematical: Prove that the singular values of A are the square roots of eigenvalues of A\u1d40A. Show that U, V are orthogonal.
Computational: Implement SVD from scratch. Verify your results match np.linalg.svd. Test on random matrices of various shapes.
Image compression: Load an image, compute its SVD, and visualize reconstructions at different ranks. Plot PSNR vs rank.
Text analysis: Implement LSA on a corpus of your choice. Find semantically similar documents using cosine similarity in the reduced space.
Recommender: Train an SVD recommender on MovieLens data. Compare with simple baselines (user mean, item mean).
Pseudoinverse: Solve an overdetermined system Ax = b using SVD pseudoinverse. Compare with np.linalg.lstsq.
Deep learning: Analyze the singular value distribution of weight matrices in a pre-trained CNN. What does the spectrum tell you?
LoRA simulation: Take a weight matrix W, compute its SVD, and simulate LoRA by adding low-rank perturbations. How does rank affect approximation quality?
Condition number: Create matrices with varying condition numbers. Observe how solving Ax = b becomes unstable as \u03ba increases.
Key Insights
SVD = rotation + scaling + rotation. Any matrix transformation can be understood as this sequence of simple operations.
Singular values measure importance. Larger singular values correspond to more important dimensions. The spectrum reveals the effective rank.
Truncated SVD is optimal. The Eckart-Young theorem guarantees no other rank-k matrix is closer to the original.
PCA is SVD of centered data. This connection unifies the two most important matrix decompositions in statistics.
Applications are everywhere: Text (LSA), recommendations (collaborative filtering), images (compression), linear systems (pseudoinverse), deep learning (LoRA).
Condition number matters. Large \u03ba = \u03c3\u2081/\u03c3\u1d63 indicates numerical instability and potential issues.
Low-rank structure is common. Real-world data often has much lower effective rank than its nominal dimension.
SVD is fundamental to modern AI. From LoRA fine-tuning to model compression, SVD concepts permeate contemporary machine learning.
The Essence of SVD: Every matrix tells a story - SVD reads that story. It reveals the essential structure hiding inside any matrix: how many dimensions truly matter (rank), how important each is (singular values), and what directions define them (singular vectors). This universal decomposition is the Swiss Army knife of numerical linear algebra, equally at home in theoretical proofs and billion-scale production systems. Master SVD, and you've mastered the language in which matrices speak.