Chapter 1
15 min read
Section 3 of 134

The Power of Linearity

The Geometric Universe

Learning Objectives

By the end of this section, you will:

  • Understand the precise mathematical definition of linearity: what the two properties (additivity and homogeneity) mean, and why they matter
  • Be able to test whether a given function is linear by checking both properties, and recognize common nonlinear traps (like translations)
  • Grasp the superposition principle and see why it makes complex problems tractable: you only need to understand the building blocks to understand everything
  • See the deep connection between linearity and matrices: how the linearity constraint forces every linear function to be representable as a matrix
  • Understand linearization as the bridge between nonlinear reality and linear tools, and why this makes linear algebra the universal first-order approximation toolkit
  • Recognize how linearity appears in circuits, springs, signals, neural networks, attention mechanisms, and gradient descent

Why This Matters

Linearity is not just a mathematical property. It is the single idea that makes vast swaths of science, engineering, and computing tractable. When a system is linear, we can decompose complex inputs into simple pieces, process each piece independently, and reassemble the results. When a system is not linear, we can still use linear algebra by approximating locally. Understanding linearity is understanding why linear algebra appears everywhere.

The Big Picture: Why Linearity Changes Everything

Imagine you are an engineer designing a bridge. The bridge must withstand dozens of different forces simultaneously: the weight of the deck, the tension in the cables, the push of wind, the rumble of traffic, the thermal expansion from the sun. How can you possibly analyze all of these forces acting together?

Here is the remarkable answer: if the bridge's response to force is linear, then you can analyze each force separately and add the results. You compute the deflection from gravity alone, the deflection from wind alone, the deflection from traffic alone, and simply add them up. The total deflection equals the sum of the individual deflections. You have decomposed an impossibly complex problem into manageable pieces.

This is not a trick specific to bridges. It is the superposition principle, and it works whenever the system you are studying is linear. It works for electrical circuits (each voltage source can be analyzed independently), for sound waves (each instrument in an orchestra adds to the total waveform), for light (the colors in a prism add to form white), and for neural networks (each input feature contributes independently through the weight matrix).

The concept of linearity was not invented in a vacuum. Joseph Fourier showed in 1822 that any periodic signal can be decomposed into a sum of simple sine waves. This only works because the wave equation is linear. Oliver Heaviside used superposition to analyze telegraph circuits in the 1880s. Today, every signal processor, every equalizer, every noise cancellation system relies on the same principle. The mathematics of linearity is the mathematics of breaking complex things into simple parts.

The Core Insight: Linearity means "the whole equals the sum of its parts." If a function is linear, then you can understand it completely by understanding what it does to simple building blocks. This one property is what separates tractable problems from intractable ones across all of science and engineering.

What Linearity Really Means

In Section 1, we met the "golden rule of linearity" as a single formula. Now let us examine it carefully, symbol by symbol, and understand exactly what it requires.

The Two Properties of Linearity

A function ff is linear if and only if it satisfies two conditions for all vectors u\mathbf{u} and v\mathbf{v} and all scalars cc:

Property 1 — Additivity (preserves addition):

f(u+v)=f(u)+f(v)\displaystyle f(\mathbf{u} + \mathbf{v}) = f(\mathbf{u}) + f(\mathbf{v})

In words: applying ff to the sum of two vectors gives the same result as applying ff to each vector separately and then adding the results. The function does not "care" whether you add before or after applying it.

Property 2 — Homogeneity (preserves scaling):

f(cu)=cf(u)\displaystyle f(c\,\mathbf{u}) = c\, f(\mathbf{u})

In words: scaling a vector before applying ff is the same as applying ff first and scaling afterward. If you double the input, the output doubles. If you halve the input, the output halves.

These two properties can be combined into a single elegant statement. For all vectors u,v\mathbf{u}, \mathbf{v} and all scalars α,β\alpha, \beta:

f(αu+βv)=αf(u)+βf(v)\displaystyle f(\alpha\,\mathbf{u} + \beta\,\mathbf{v}) = \alpha\, f(\mathbf{u}) + \beta\, f(\mathbf{v})

This combined form is called the superposition property. It says that ff preserves linear combinations. Whatever linear combination of inputs you feed in, you get the same linear combination of outputs.

Dimensions and Types

A linear function ff maps vectors from one space to another: f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m. The input is a vector with nn components, and the output is a vector with mm components. The dimensions nn and mm can be different. For example, a projection from 3D to 2D has n=3,m=2n = 3, m = 2.

What Linearity Does NOT Mean

There is a crucial distinction that trips up almost every beginner. In everyday language and in high school algebra, a "linear function" means a function whose graph is a straight line: f(x)=mx+bf(x) = mx + b. But in linear algebra, this function is not linear (unless b=0b = 0). It is called affine.

Why? Because f(0)=m0+b=b0f(0) = m \cdot 0 + b = b \neq 0 when b0b \neq 0. A truly linear function must satisfy f(0)=0f(\mathbf{0}) = \mathbf{0}. This follows immediately from homogeneity: set c=0c = 0, and you get f(0u)=0f(u)=0f(0 \cdot \mathbf{u}) = 0 \cdot f(\mathbf{u}) = \mathbf{0}.

Linear vs. Affine

  • Linear: f(x)=3xf(x) = 3x — passes through the origin, preserves addition and scaling
  • Affine (NOT linear): f(x)=3x+2f(x) = 3x + 2 — shifts the origin, violates f(0)=0f(0) = 0
  • Rule of thumb: A linear function maps the zero vector to the zero vector. If f(0)0f(\mathbf{0}) \neq \mathbf{0}, the function cannot be linear.

The Linearity Test: Who Passes?

Let us apply the definition to several functions and see which ones are linear. The interactive tool below lets you test functions yourself by choosing specific input vectors and checking whether additivity and homogeneity hold.

Examples that PASS:

  • Doubling: f(x,y)=(2x,2y)f(x, y) = (2x, 2y). Check: f(u+v)=2(u+v)=2u+2v=f(u)+f(v)f(\mathbf{u} + \mathbf{v}) = 2(\mathbf{u} + \mathbf{v}) = 2\mathbf{u} + 2\mathbf{v} = f(\mathbf{u}) + f(\mathbf{v}). Passes both tests.
  • Rotation by 90°: f(x,y)=(y,x)f(x, y) = (-y, x). Rotating the sum of two vectors is the same as rotating each and adding. Rotating a scaled vector is the same as scaling the rotated vector.
  • Projection onto the x-axis: f(x,y)=(x,0)f(x, y) = (x, 0). Dropping the yy-component preserves addition and scaling.

Examples that FAIL:

  • Squaring: f(x,y)=(x2,y2)f(x, y) = (x^2, y^2). Consider u=(2,0)\mathbf{u} = (2, 0) and v=(3,0)\mathbf{v} = (3, 0). Then f(u+v)=f(5,0)=(25,0)f(\mathbf{u} + \mathbf{v}) = f(5, 0) = (25, 0), but f(u)+f(v)=(4,0)+(9,0)=(13,0)f(\mathbf{u}) + f(\mathbf{v}) = (4, 0) + (9, 0) = (13, 0). Since 251325 \neq 13, the function fails additivity.
  • Translation: f(x,y)=(x+1,y+1)f(x, y) = (x + 1, y + 1). We have f(0,0)=(1,1)(0,0)f(0, 0) = (1, 1) \neq (0, 0). Instant disqualification: the zero vector does not map to zero.
  • Absolute value: f(x,y)=(x,y)f(x, y) = (|x|, |y|). Consider u=(1,0)\mathbf{u} = (1, 0) and v=(1,0)\mathbf{v} = (-1, 0). Then f(u+v)=f(0,0)=(0,0)f(\mathbf{u} + \mathbf{v}) = f(0, 0) = (0, 0), but f(u)+f(v)=(1,0)+(1,0)=(2,0)f(\mathbf{u}) + f(\mathbf{v}) = (1, 0) + (1, 0) = (2, 0). Fails.

Try it yourself: select each function below, adjust the vectors, and watch whether the orange arrow (direct computation) matches the purple dashed arrow (component-wise computation). When they diverge, the function is not linear.

The Linearity Tester

Select a function and test whether it satisfies the linearity conditions

Scales every vector by 2 — a pure stretch
Input Space
uvu+v
Output Space: f(x, y) = (2x, 2y)
f(u)f(v)f(u)+f(v)f(u+v)
Additivity: f(u + v) = f(u) + f(v)
f(u+v) = (2.00, 5.00)f(u)+f(v) = (2.00, 5.00)

Both paths give the same result. This function preserves addition!

Verdict for f(x, y) = (2x, 2y): LINEAR — passes both tests for all inputs

Quick Linearity Checklist

Before doing a full test, check two quick necessary conditions: (1) Does f(0)=0f(\mathbf{0}) = \mathbf{0}? If not, it is not linear. (2) Does f(2u)=2f(u)f(2\mathbf{u}) = 2\,f(\mathbf{u}) for some test vector? If not, it is not linear. These are quick sanity checks before attempting a proof.

The Superposition Principle

The two properties of linearity combine into a single, extraordinarily powerful idea: the superposition principle. Let us state it precisely and then see why it changes everything.

If ff is linear, and a vector v\mathbf{v} can be written as a linear combination of other vectors:

v=c1v1+c2v2++cnvn\displaystyle \mathbf{v} = c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_n \mathbf{v}_n

then:

f(v)=c1f(v1)+c2f(v2)++cnf(vn)\displaystyle f(\mathbf{v}) = c_1 \, f(\mathbf{v}_1) + c_2 \, f(\mathbf{v}_2) + \cdots + c_n \, f(\mathbf{v}_n)

Read that carefully. It says: if you know what ff does to each building block v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_n, then you automatically know what ff does to any linear combination of those building blocks. You do not need to compute ff from scratch for every possible input. You just need a small number of "test cases," and superposition gives you the rest for free.

The Basis Determines Everything

Here is where superposition becomes truly magical. Recall from Section 2 that every vector in R2\mathbb{R}^2 can be written as a linear combination of the two standard basis vectors e1=(1,0)\mathbf{e}_1 = (1, 0) and e2=(0,1)\mathbf{e}_2 = (0, 1):

v=v1e1+v2e2\displaystyle \mathbf{v} = v_1 \, \mathbf{e}_1 + v_2 \, \mathbf{e}_2

By superposition, a linear function ff applied to v\mathbf{v} gives:

f(v)=v1f(e1)+v2f(e2)\displaystyle f(\mathbf{v}) = v_1 \, f(\mathbf{e}_1) + v_2 \, f(\mathbf{e}_2)

This is a stunning result: if you know where the two basis vectors land, you know where every vector in the entire plane lands. The function ff is completely determined by just two pieces of information: f(e1)f(\mathbf{e}_1) and f(e2)f(\mathbf{e}_2).

The interactive visualization below demonstrates this concretely. You choose a vector v\mathbf{v} by setting its coefficients aa and bb, and a transformation TT. The visualization shows three steps:

  1. Decompose: Write v=ae1+be2\mathbf{v} = a\,\mathbf{e}_1 + b\,\mathbf{e}_2
  2. Transform the basis: Compute T(e1)T(\mathbf{e}_1) and T(e2)T(\mathbf{e}_2)
  3. Recombine with the same coefficients: aT(e1)+bT(e2)=T(v)a\,T(\mathbf{e}_1) + b\,T(\mathbf{e}_2) = T(\mathbf{v})

Both paths always give the same result. This is the superposition principle at work.

The Superposition Principle in Action

Decompose a vector into basis components, transform each piece, then recombine. Compare with transforming the original directly.

STEP 1Decompose
v
v = 2.0·e₁ + 1.5·e₂
STEP 2Transform basis
T(e₁)T(e₂)
T(e₁) = (2.00, 0.00), T(e₂) = (0.00, 2.00)
STEP 3Recombine
a·T(e₁)+b·T(e₂)T(v)
a·T(e₁)+b·T(e₂) = (4.00, 3.00)
T(v) = (4.00, 3.00)
Both paths give the same result!

T(v) = (4.00, 3.00) = 2.0·T(e₁) + 1.5·T(e₂) = (4.00, 3.00). You only needed to know where the two basis vectors land to determine where any vector goes. This is the power of linearity.

The Deep Takeaway: Superposition is what makes linear algebra tractable. Instead of understanding a function on infinitely many inputs, you only need to understand it on a finite basis. This is the fundamental reason linear algebra is computationally powerful: finite data (the matrix columns) encodes infinite behavior (the entire transformation).

From Linearity to Matrices

We have just seen that a linear function from R2\mathbb{R}^2 to R2\mathbb{R}^2 is completely determined by where it sends e1\mathbf{e}_1 and e2\mathbf{e}_2. Suppose:

f(e1)=(ac),f(e2)=(bd)\displaystyle f(\mathbf{e}_1) = \begin{pmatrix} a \\ c \end{pmatrix}, \quad f(\mathbf{e}_2) = \begin{pmatrix} b \\ d \end{pmatrix}

Then for any vector v=(v1,v2)\mathbf{v} = (v_1, v_2):

f(v)=v1(ac)+v2(bd)=(av1+bv2cv1+dv2)\displaystyle f(\mathbf{v}) = v_1 \begin{pmatrix} a \\ c \end{pmatrix} + v_2 \begin{pmatrix} b \\ d \end{pmatrix} = \begin{pmatrix} av_1 + bv_2 \\ cv_1 + dv_2 \end{pmatrix}

We can write this compactly as a matrix-vector product:

f(v)=(abcd)(v1v2)=Av\displaystyle f(\mathbf{v}) = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} v_1 \\ v_2 \end{pmatrix} = A\mathbf{v}

This is a profound connection: every linear function can be represented as a matrix, and every matrix represents a linear function. The columns of the matrix are exactly the images of the basis vectors under the transformation. This is not a coincidence or a convention. It is a mathematical necessity forced by the linearity property.

This explains why matrix multiplication is defined the way it is. When you multiply a matrix by a vector, you are computing v1v_1 times the first column plus v2v_2 times the second column. You are using the input's coordinates as weights to combine the transformed basis vectors. The definition of matrix multiplication is not arbitrary. It is the unique way to encode a linear transformation.

The same logic extends to higher dimensions. A linear function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is represented by an m×nm \times n matrix whose nn columns are the images of the nn standard basis vectors, each of which lives in Rm\mathbb{R}^m.

ConceptMeaningMatrix Representation
Linear functionPreserves addition and scalingEncoded as a matrix A
Columns of AWhere basis vectors landA = [f(e₁) | f(e₂) | … | f(eₙ)]
Matrix-vector product AvApply transformation to vv₁·col₁ + v₂·col₂ + …
Matrix multiplication ABCompose transformations: first B, then AApply B then A to every basis vector

Explore this connection interactively. In the visualization below, edit the matrix entries and watch how the two basis vectors (red and blue arrows) move. The entire grid deformation is determined by those two arrows.

Interactive: 2D Linear Transformation

Ae₁Ae₂
Matrix A
1.00
0.00
0.00
1.00
det(A) = 1.000
Presets

The red arrow shows where e₁ = (1, 0) lands, and the blue arrow shows where e₂ = (0, 1) lands. Together, they completely determine the transformation. The grid shows how the entire plane deforms.


Linearity in Real-World Systems

Linearity is not just a mathematical abstraction. It is a property of many physical systems, and recognizing it is the key to analyzing them.

Electrical Circuits: Ohm's Law

Ohm's law states that the voltage VV across a resistor is proportional to the current II flowing through it: V=IRV = IR. This is a linear relationship. If you double the current, the voltage doubles. If you have two current sources feeding the same circuit, the total voltage is the sum of the voltages each source would produce alone. This is why circuit engineers can use superposition to analyze complex circuits with multiple sources: turn off all sources except one, compute the result, repeat for each source, and add up the answers.

Mechanical Systems: Hooke's Law

A spring obeys Hooke's law: F=kxF = -kx, where FF is the restoring force, kk is the spring constant, and xx is the displacement. This is linear in xx: doubling the displacement doubles the force. Two separate displacements add up to produce the sum of their individual forces. This is why structural engineers can use linear algebra to analyze buildings and bridges under multiple loads.

Signal Processing: Superposition of Waves

Sound is the sum of pressure waves. When two speakers play different notes, the resulting waveform is the sum of the individual waveforms. This works because the wave equation is linear. The entire field of Fourier analysis rests on decomposing complex signals into sums of simple sine waves, processing each frequency independently (filtering, compression, equalization), and recombining. Your phone's noise cancellation, your music streaming service's compression algorithm, and every digital audio effect are applications of linearity.

Computer Graphics: Composing Transformations

Every frame of a 3D video game applies dozens of transformations to millions of vertices: rotation, scaling, projection, camera movement. Because each transformation is linear (representable as a matrix), they can be composed by matrix multiplication. Instead of applying 20 separate transformations to each vertex, the game engine multiplies the 20 matrices into a single matrix and applies it once. The linearity of each transformation guarantees that the composition is also linear.

SystemLinear LawSuperposition Application
CircuitsV = IR (Ohm’s law)Analyze each source independently, add results
SpringsF = −kx (Hooke’s law)Sum forces from multiple loads
WavesWave equationFourier: decompose signal into frequencies
OpticsMaxwell’s equations (linear)Interference, diffraction patterns
EconomicsInput–output models (Leontief)Industry interdependencies as matrix equations
Structural Eng.Linear elasticityFinite element method — solve huge sparse linear systems

The Limits of Linearity: When the World Curves

If linearity is so powerful, why doesn't everything just work with linear algebra? The answer is simple: most real-world systems are nonlinear.

  • Gravity follows an inverse-square law: F=Gm1m2r2F = G\frac{m_1 m_2}{r^2}. Double the distance, and the force drops by a factor of 4, not 2.
  • Fluid dynamics is governed by the Navier-Stokes equations, which are nonlinear. That is why weather prediction is so difficult.
  • Neural network activation functions like ReLU (max(0,x)\max(0, x)) and sigmoid are intentionally nonlinear, because purely linear networks can only compute linear functions.
  • Population growth, chemical reactions, and economic markets all exhibit nonlinear behavior.

So if the world is mostly nonlinear, why study linearity? Because of one of the most powerful ideas in all of mathematics: linearization.

Linearization: The Universal Bridge

Every smooth function, no matter how complex, looks linear if you zoom in close enough. This is the geometric meaning of the derivative.

For a function of one variable, the tangent line at a point x0x_0 gives the best linear approximation:

f(x)f(x0)+f(x0)(xx0)\displaystyle f(x) \approx f(x_0) + f'(x_0)(x - x_0)

For a vector-valued function f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, the derivative is the Jacobian matrix JJ, and the linearization becomes:

f(x)f(x0)+J(x0)(xx0)\displaystyle \mathbf{f}(\mathbf{x}) \approx \mathbf{f}(\mathbf{x}_0) + J(\mathbf{x}_0)(\mathbf{x} - \mathbf{x}_0)

This is a linear approximation, and it can be studied with all the tools of linear algebra. The Jacobian JJ is a matrix whose entries are partial derivatives. It tells you how the function behaves locally: which directions it stretches, which it compresses, and which it leaves unchanged.

Explore this idea in the interactive visualization below. Select a nonlinear curve, move the tangent point, and increase the zoom. At high zoom, the curve and its tangent line become indistinguishable. This is linearization in action.

Linear Approximation: Every Curve Is Locally a Line

Zoom into any smooth curve and it becomes indistinguishable from its tangent line. This is why linearization works.

-2.04.0error: 2.2500f(x) = x²Tangent line (linear approx.)
At x₀ = 1.00:f(x₀) = 1.0000
Slope:f’(x₀) = 2.0000
Approximation:f(x) ≈ 1.00 + 2.00·(x − 1.00)

🔍 Try increasing the zoom to see the curve become indistinguishable from its tangent line.

This is why linear algebra appears in every branch of science and engineering: even when the underlying system is nonlinear, we can always approximate it locally with a linear model. The Jacobian matrix captures the local behavior, eigenvalues of the Jacobian determine stability, and linear algebra provides the toolkit for analyzing all of it.

Newton's Method: Linearization in Action

Newton's method for finding roots of equations works by repeatedly linearizing the function at the current guess and solving the linear approximation. Each step replaces a hard nonlinear problem with an easy linear one. The method converges rapidly because the linear approximation becomes more accurate as you get closer to the root. This is linearization as a computational algorithm.

Linearity in Modern AI and Computing

Modern AI systems are built on a delicate interplay between linear and nonlinear operations. Understanding this interplay requires understanding linearity.

Neural Networks: Linear Layers + Nonlinear Activations

Each layer of a neural network performs a linear transformation followed by a nonlinear activation:

h=σ(Wx+b)\displaystyle \mathbf{h} = \sigma(W\mathbf{x} + \mathbf{b})

The matrix WW is a linear transformation. It projects the input x\mathbf{x} into a new representation space. The activation function σ\sigma (ReLU, sigmoid, tanh) adds the nonlinearity needed to learn complex patterns.

Why do we need both? A composition of linear functions is still linear: A2(A1x)=(A2A1)xA_2(A_1 \mathbf{x}) = (A_2 A_1)\mathbf{x}. Stacking 100 linear layers is equivalent to a single linear layer. The nonlinear activation between layers is what gives deep networks their expressive power. But the linear layers are where most of the computation happens, and understanding their geometry (the weight matrices, their rank, their eigenvalues) is essential for understanding what the network has learned.

Attention Is Linear Algebra

The attention mechanism in transformers (GPT, Claude, BERT) is built entirely from linear operations:

  1. Each token is projected into query, key, and value vectors using weight matrices: Q=XWQQ = XW_Q, K=XWKK = XW_K, V=XWVV = XW_V. These are linear transformations.
  2. Attention scores are computed as dot products: QKTQK^T. The dot product is a bilinear operation.
  3. The output is a weighted sum of value vectors. A weighted sum is a linear combination.

The only nonlinear step is the softmax that normalizes the attention scores. Everything else is matrix multiplication, i.e., linear algebra.

Gradient Descent Is Linearization

Training a neural network means minimizing a loss function L(θ)L(\boldsymbol{\theta}) with respect to the parameters θ\boldsymbol{\theta}. Gradient descent works by linearizing the loss function at the current parameter values:

L(θ+Δθ)L(θ)+LΔθ\displaystyle L(\boldsymbol{\theta} + \Delta\boldsymbol{\theta}) \approx L(\boldsymbol{\theta}) + \nabla L \cdot \Delta\boldsymbol{\theta}

The gradient L\nabla L is a vector of partial derivatives. It tells you the direction of steepest increase. The update step Δθ=ηL\Delta\boldsymbol{\theta} = -\eta \nabla L moves the parameters in the direction of steepest decrease. This is linearization: you approximate the nonlinear loss surface with a linear model (a tangent hyperplane), take a step, and repeat.

Backpropagation Is the Chain Rule for Matrices

Computing the gradient of a deep network uses the chain rule. Each layer's contribution to the gradient involves the Jacobian matrix of that layer. For a linear layer h=Wx\mathbf{h} = W\mathbf{x}, the Jacobian is simply WW itself. The gradient flows backward through the network as a sequence of matrix multiplications. Backpropagation is linear algebra applied to the chain rule.

AI ConceptLinear Algebra OperationWhy Linearity Matters
Neural network layerMatrix-vector multiply WxProjects input to new representation
Attention mechanismQKᵀ and weighted sum of VComputes relevance scores via dot products
Word embeddingsVectors in RⁿSemantic similarity = cosine of angle
Gradient descentLinearize loss, step in −∇LLinear approximation of loss surface
BackpropagationChain of Jacobian matricesGradient flows via matrix multiplication
PCA / Dimensionality reductionEigenvalue decompositionFind directions of maximum variance

Why GPUs Are Matrix Machines

The reason modern AI runs on GPUs (Graphics Processing Units) is precisely because GPUs are optimized for the most common operation in linear algebra: matrix multiplication. A single GPU can perform trillions of multiply-add operations per second. The entire AI hardware ecosystem, from NVIDIA's tensor cores to Google's TPUs, exists because linearity makes the core computations massively parallelizable. Each row of the output matrix can be computed independently, which is exactly the kind of parallelism that GPUs exploit.

The Computational View

Let us bring all these ideas together in code. The following Python program systematically tests functions for linearity and demonstrates the superposition principle:

Testing Linearity and the Superposition Principle
🐍linearity_test.py
4A Linear Function

Doubling is linear because 2(u+v) = 2u + 2v (additivity) and 2(cu) = c(2u) (homogeneity). Both properties hold algebraically for any input.

8Rotation Is Linear

A 90° rotation is encoded as a matrix. Matrix multiplication is always linear, so rotation is linear. The key insight: any function that can be written as Av for some fixed matrix A is automatically linear.

13Squaring Is NOT Linear

Squaring violates additivity: (u+v)² ≠ u² + v² in general (the cross term 2uv is missing). This is the classic example of a nonlinear operation.

18Translation Is NOT Linear

Adding a constant vector violates f(0) = 0. The function f(v) = v + (1,1) is affine, not linear. This is the most common trap: straight-line graph ≠ linear in the mathematical sense.

22Randomized Linearity Test

We test both properties on 1000 random vectors. If ANY pair fails, the function is not linear. If all pass, it is very likely linear (though a proof requires checking all possible inputs, not just random samples).

44Superposition in Action

We decompose v = 3e₁ + (-2)e₂, transform each basis vector with A, then recombine with the same coefficients. The result matches the direct computation A@v. This is the superposition principle: knowing A@e₁ and A@e₂ is enough to compute A@v for ANY v.

54Columns Are Transformed Basis Vectors

The columns of A are exactly A@e₁ and A@e₂. This is not a coincidence — it is the fundamental relationship between linear functions and matrices. The matrix IS the function, compactly recorded as column vectors.

63 lines without explanation
1import numpy as np
2
3# Define some functions to test
4def doubling(v):
5    """A linear function: scales by 2"""
6    return 2 * v
7
8def rotation_90(v):
9    """A linear function: 90-degree rotation"""
10    R = np.array([[0, -1], [1, 0]], dtype=float)
11    return R @ v
12
13def squaring(v):
14    """A nonlinear function: squares each component"""
15    return v ** 2
16
17def translation(v):
18    """An affine function (NOT linear): shifts by (1, 1)"""
19    return v + np.array([1.0, 1.0])
20
21# Test linearity: check both properties
22def test_linearity(f, name, num_trials=1000):
23    passes = True
24    for _ in range(num_trials):
25        u = np.random.randn(2)
26        v = np.random.randn(2)
27        c = np.random.randn()
28
29        # Test additivity: f(u + v) == f(u) + f(v)
30        if not np.allclose(f(u + v), f(u) + f(v)):
31            passes = False
32            break
33
34        # Test homogeneity: f(c*u) == c*f(u)
35        if not np.allclose(f(c * u), c * f(u)):
36            passes = False
37            break
38
39    status = "LINEAR" if passes else "NOT LINEAR"
40    print(f"{name}: {status}")
41
42# Run the tests
43test_linearity(doubling, "Doubling f(v) = 2v")
44test_linearity(rotation_90, "Rotation 90°")
45test_linearity(squaring, "Squaring f(v) = v²")
46test_linearity(translation, "Translation f(v) = v + (1,1)")
47
48# Demonstrate superposition with a matrix
49print("\n--- Superposition Principle ---")
50A = np.array([[2, -1], [1, 3]], dtype=float)
51e1 = np.array([1.0, 0.0])
52e2 = np.array([0.0, 1.0])
53
54# An arbitrary vector
55v = np.array([3.0, -2.0])
56
57# Path 1: Apply A directly to v
58direct = A @ v
59print(f"Direct:  A @ v = {direct}")
60
61# Path 2: Decompose, transform basis, recombine
62Ae1 = A @ e1  # Where e1 lands
63Ae2 = A @ e2  # Where e2 lands
64recombined = v[0] * Ae1 + v[1] * Ae2
65print(f"Recombi: {v[0]}*A@e1 + {v[1]}*A@e2 = {recombined}")
66print(f"Match? {np.allclose(direct, recombined)}")
67
68# The columns of A ARE the transformed basis vectors
69print(f"\nColumn 1 of A: {A[:, 0]} = A @ e1: {Ae1}")
70print(f"Column 2 of A: {A[:, 1]} = A @ e2: {Ae2}")

Summary

This section has explored the single most important property in all of linear algebra: linearity. Let us summarize the key ideas:

  1. Linearity has two properties: additivity (f(u+v)=f(u)+f(v)f(\mathbf{u}+\mathbf{v}) = f(\mathbf{u})+f(\mathbf{v})) and homogeneity (f(cu)=cf(u)f(c\mathbf{u}) = cf(\mathbf{u})). These combine into the superposition property: f(αu+βv)=αf(u)+βf(v)f(\alpha\mathbf{u}+\beta\mathbf{v}) = \alpha f(\mathbf{u})+\beta f(\mathbf{v}).
  2. Linear straight line. The function f(x)=mx+bf(x) = mx + b is affine, not linear (when b0b \neq 0). Linear functions must map zero to zero.
  3. The superposition principle says that you can decompose any input into simple building blocks, process each one, and reassemble the results. This is the fundamental reason linear problems are tractable.
  4. A linear function is fully determined by its basis images. In Rn\mathbb{R}^n, knowing where nn basis vectors land tells you where every vector lands. These images become the columns of the matrix.
  5. Every linear function is a matrix, and every matrix is a linear function. This equivalence is the central bridge between abstract functions and concrete computation.
  6. Real-world linear systems (circuits, springs, waves, optics) can be analyzed by superposition, breaking complex problems into manageable pieces.
  7. Linearization bridges the nonlinear world with linear tools. The Jacobian matrix captures local behavior, enabling gradient descent, Newton's method, and stability analysis.
  8. Modern AI is built on linearity: weight matrices are linear transformations, attention is computed with dot products and linear combinations, and gradient descent is repeated linearization of the loss surface.
The road ahead: We have now seen that every linear function can be encoded as a matrix. In the next section, we will survey the vast landscape of applications where these ideas appear in practice, from solving systems of equations to compressing images to training the neural networks behind modern AI. The tools of linear algebra are not just elegant mathematics. They are the operational machinery of the modern world.