Chapter 0
20 min read
Section 5 of 175

Linear Algebra Essentials

Prerequisites & Mathematical Foundations

Introduction

Linear algebra is the mathematics of vectors and matrices. It provides the language for describing multivariate data, covariance structures, and transformations. Nearly every ML algorithm—from linear regression to deep learning—relies heavily on linear algebra.

Why This Matters for ML: Datasets are matrices, model parameters are vectors, and operations like PCA, SVD, and neural network computations are all linear algebra. Understanding these concepts is essential for implementing and optimizing ML algorithms.

Vectors

A vector is an ordered list of numbers. In probability and statistics, vectors represent data points, parameters, or random variable realizations.

x=(x1x2xn)Rn\mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} \in \mathbb{R}^n

Vector Operations

OperationDefinitionResult
Additionx + y(x₁+y₁, x₂+y₂, ..., xₙ+yₙ)
Scalar Multiplicationcx(cx₁, cx₂, ..., cxₙ)
Dot Productx · yx₁y₁ + x₂y₂ + ... + xₙyₙ
Norm (length)||x||√(x₁² + x₂² + ... + xₙ²)

Dot Product Properties

The dot product (inner product) is fundamental:

xy=xTy=i=1nxiyi\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^{n} x_i y_i
  • Geometric interpretation: xy=xycosθ\mathbf{x} \cdot \mathbf{y} = ||\mathbf{x}|| \cdot ||\mathbf{y}|| \cos\theta
  • Orthogonal vectors: x · y = 0 when x ⊥ y
  • Norm from dot product: x=xx||\mathbf{x}|| = \sqrt{\mathbf{x} \cdot \mathbf{x}}

Matrices

A matrix is a rectangular array of numbers. An m × n matrix has m rows and n columns:

A=(a11a12a1na21a22a2nam1am2amn)\mathbf{A} = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}

Matrix Terminology

TermDefinitionExample
Squarem = n3×3 matrix
Diagonalaᵢⱼ = 0 for i ≠ jOnly main diagonal non-zero
Identity (I)δᵢⱼ (1 on diagonal)AI = IA = A
Zero Matrix (0)All elements are 0A + 0 = A
SymmetricA = AᵀCovariance matrices

Matrix Operations

Matrix Multiplication

For matrices A (m × n) and B (n × p), the product C = AB is (m × p):

cij=k=1naikbkjc_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}

Dimension Rule

A(m×n) × B(n×p) = C(m×p). The inner dimensions must match! Matrix multiplication is NOT commutative: AB ≠ BA in general.

Transpose

The transpose switches rows and columns:

(AT)ij=aji(\mathbf{A}^T)_{ij} = a_{ji}

Properties:

  • (AT)T=A(\mathbf{A}^T)^T = \mathbf{A}
  • (AB)T=BTAT(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T
  • (A+B)T=AT+BT(\mathbf{A} + \mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T

Matrix Inverse

For a square matrix A, the inverse A⁻¹ satisfies:

AA1=A1A=I\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}

Not all matrices are invertible. A matrix is singular(non-invertible) if its determinant is zero.

Determinant

The determinant is a scalar value that characterizes the matrix. For a 2×2 matrix:

det(abcd)=adbc\det \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc
  • det(A) = 0 ⟺ A is singular (not invertible)
  • det(AB) = det(A) · det(B)
  • det(A⁻¹) = 1/det(A)
  • det(AT)=det(A)\det(\mathbf{A}^T) = \det(\mathbf{A})

Probability Application

The determinant appears in the multivariate normal distribution:
f(x)=1(2π)n/2boldsymbolSigma1/2exp(12(xboldsymbolμ)TboldsymbolSigma1(xboldsymbolμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\\boldsymbol{\\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\\boldsymbol{\mu})^T\\boldsymbol{\\Sigma}^{-1}(\mathbf{x}-\\boldsymbol{\mu})\right)

Special Matrices

Covariance Matrix

For a random vector X, the covariance matrix is:

boldsymbolSigma=E[(Xboldsymbolμ)(Xboldsymbolμ)T]\\boldsymbol{\\Sigma} = E[(\mathbf{X} - \\boldsymbol{\mu})(\mathbf{X} - \\boldsymbol{\mu})^T]

Properties:

  • Symmetric: Σ = Σᵀ
  • Positive semi-definite: xᵀΣx ≥ 0 for all x
  • Diagonal elements: Variances of individual variables
  • Off-diagonal elements: Covariances between variables

Positive Definite Matrices

A matrix A is positive definite if:

xTAx>0for all x0\mathbf{x}^T\mathbf{A}\mathbf{x} > 0 \quad \text{for all } \mathbf{x} \neq \mathbf{0}
  • All eigenvalues are positive
  • Always invertible
  • Covariance matrices are positive semi-definite

Orthogonal Matrices

A matrix Q is orthogonal if:

QTQ=QQT=I\mathbf{Q}^T\mathbf{Q} = \mathbf{Q}\mathbf{Q}^T = \mathbf{I}

Orthogonal matrices preserve lengths and angles. Their columns (and rows) are orthonormal vectors.


Eigenvalues and Eigenvectors

For a square matrix A, if there exists a scalar λ and non-zero vector v such that:

Av=λv\mathbf{A}\mathbf{v} = \lambda\mathbf{v}

Then λ is an eigenvalue and v is the corresponding eigenvector.

Computing Eigenvalues

Eigenvalues are found by solving the characteristic equation:

det(AλI)=0\det(\mathbf{A} - \lambda\mathbf{I}) = 0

Properties

  • Sum of eigenvalues = trace(A) = Σᵢ aᵢᵢ
  • Product of eigenvalues = det(A)
  • Symmetric matrices have real eigenvalues and orthogonal eigenvectors

PCA Connection

Principal Component Analysis (PCA) finds the eigenvectors of the covariance matrix. The eigenvalues represent the variance explained by each principal component.


Matrix Decompositions

Eigendecomposition

A symmetric matrix can be decomposed as:

A=QboldsymbolLambdaQT\mathbf{A} = \mathbf{Q}\\boldsymbol{\\Lambda}\mathbf{Q}^T

Where Q contains eigenvectors and Λ is diagonal with eigenvalues.

Singular Value Decomposition (SVD)

Any m × n matrix can be decomposed as:

A=UboldsymbolSigmaVT\mathbf{A} = \mathbf{U}\\boldsymbol{\\Sigma}\mathbf{V}^T
  • U: m × m orthogonal (left singular vectors)
  • Σ: m × n diagonal (singular values)
  • V: n × n orthogonal (right singular vectors)

Cholesky Decomposition

For a positive definite matrix A:

A=LLT\mathbf{A} = \mathbf{L}\mathbf{L}^T

Where L is lower triangular. This is used for sampling from multivariate normal distributions.


Python Implementation

🐍linear_algebra_basics.py
1import numpy as np
2
3# Vectors
4x = np.array([1, 2, 3])
5y = np.array([4, 5, 6])
6
7# Vector operations
8print(f"x + y = {x + y}")           # [5, 7, 9]
9print(f"2 * x = {2 * x}")           # [2, 4, 6]
10print(f"x · y = {np.dot(x, y)}")    # 32
11print(f"||x|| = {np.linalg.norm(x)}")  # 3.74...
12
13# Matrix creation
14A = np.array([[1, 2], [3, 4]])
15B = np.array([[5, 6], [7, 8]])
16
17# Matrix operations
18print(f"A + B = \n{A + B}")
19print(f"A @ B = \n{A @ B}")        # Matrix multiplication
20print(f"A.T = \n{A.T}")            # Transpose
21
22# Inverse and determinant
23print(f"det(A) = {np.linalg.det(A)}")
24print(f"A^(-1) = \n{np.linalg.inv(A)}")

Eigenvalues and Decompositions

🐍decompositions.py
1import numpy as np
2
3# Symmetric positive definite matrix
4A = np.array([[4, 2], [2, 3]])
5
6# Eigendecomposition
7eigenvalues, eigenvectors = np.linalg.eig(A)
8print(f"Eigenvalues: {eigenvalues}")
9print(f"Eigenvectors:\n{eigenvectors}")
10
11# Verify: A @ v = lambda @ v
12for i in range(len(eigenvalues)):
13    v = eigenvectors[:, i]
14    lam = eigenvalues[i]
15    print(f"A @ v{i} = {A @ v}")
16    print(f"λ{i} * v{i} = {lam * v}")
17
18# SVD
19U, S, Vt = np.linalg.svd(A)
20print(f"\nSVD:")
21print(f"U = \n{U}")
22print(f"S = {S}")
23print(f"V^T = \n{Vt}")
24
25# Cholesky decomposition
26L = np.linalg.cholesky(A)
27print(f"\nCholesky L:\n{L}")
28print(f"L @ L.T = \n{L @ L.T}")  # Should equal A

Covariance Matrix Example

🐍covariance.py
1import numpy as np
2
3# Generate correlated data
4np.random.seed(42)
5n_samples = 1000
6
7# True covariance structure
8true_cov = np.array([[1.0, 0.8],
9                     [0.8, 1.0]])
10
11# Generate samples using Cholesky
12L = np.linalg.cholesky(true_cov)
13z = np.random.randn(n_samples, 2)
14X = z @ L.T  # Transform standard normal to desired covariance
15
16# Estimate covariance from data
17estimated_cov = np.cov(X.T)
18print(f"True covariance:\n{true_cov}")
19print(f"\nEstimated covariance:\n{estimated_cov}")
20
21# Eigendecomposition of covariance
22eigenvalues, eigenvectors = np.linalg.eig(estimated_cov)
23print(f"\nVariances (eigenvalues): {eigenvalues}")
24print(f"Principal directions:\n{eigenvectors}")

Summary

This section covered the linear algebra essentials for statistics:

  • Vectors represent data points and enable geometric interpretations
  • Matrices store data and linear transformations
  • Matrix operations (transpose, inverse, determinant) are fundamental building blocks
  • Covariance matrices capture relationships between variables
  • Eigendecomposition reveals principal directions and is central to PCA
  • SVD and Cholesky decompositions have important statistical applications

In the next section, we'll set up our Python environment with NumPy, SciPy, and other tools we'll use throughout this book.