Chapter 3
15 min read
Section 12 of 134

What is a Linear Transformation?

Linear Transformations - The Big Idea

By the end of this section you will see a linear transformation as a geometric reshaping of all of space that keeps the origin fixed, keeps grid lines straight and evenly spaced, and is fully determined by what it does to two basis vectors. No matrices required.

A Little Idea That Ate Geometry

Through the first half of the nineteenth century, people who worked with equations kept bumping into the same kind of object. A change of coordinates in astronomy would look like x=ax+byx' = ax + by, y=cx+dyy' = cx + dy. A rotation in mechanics would look like that too. A projection onto a plane in surveying would look like that too. Different motivations, same four-number object. Nobody had a name for it yet.

In 1858, Arthur Cayley published a short paper called A Memoir on the Theory of Matrices and did something quietly radical. He stopped thinking of x=ax+by,  y=cx+dyx' = ax + by,\; y' = cx + dy as a pair of equations to be solved, and started thinking of it as a single object that takes a point in the plane and returns a new point. Two such objects can be composed to get a third. Some of them have inverses. Some collapse the plane onto a line. They have an algebra of their own.

That shift — from “a system of equations to solve” to “a transformation of space” — is the entire reason linear algebra exists as a field. The matrices came later, as bookkeeping. The object came first. That object is what this section is about, and we are going to meet it without writing down a single matrix.


The Mental Picture First

Before we define anything, play with the widget below. It applies various functions to the plane and shows what happens to grid lines.

The Linearity Tester

Select a function and test whether it satisfies the linearity conditions

Scales every vector by 2 — a pure stretch
Input Space
uvu+v
Output Space: f(x, y) = (2x, 2y)
f(u)f(v)f(u)+f(v)f(u+v)
Additivity: f(u + v) = f(u) + f(v)
f(u+v) = (2.00, 5.00)f(u)+f(v) = (2.00, 5.00)

Both paths give the same result. This function preserves addition!

Verdict for f(x, y) = (2x, 2y): LINEAR — passes both tests for all inputs

Something striking is going on. The functions marked linear all do the same three things, no matter how different they look numerically.

One. The origin stays put. Whatever else the transformation does, the point (0,0)(0, 0) lands on (0,0)(0, 0).

Two. Grid lines stay straight. A straight line in the input plane maps to a straight line in the output plane. No curves, no kinks, no bending.

Three. Grid lines that were parallel and evenly spaced stay parallel and evenly spaced. The grid might be stretched, rotated, sheared, or even flattened onto a line — but it is still a regular grid of parallelograms.

The functions marked not linear break at least one of those rules. Translation slides the origin somewhere else. Squaring bends the grid into curves. Absolute value folds the plane along an axis. Whatever informal notion of “linear” the reader arrived with, those three visual invariants are the one that pays off.

The visual definition. A transformation of the plane is linear when it keeps the origin fixed, keeps lines straight, and keeps parallel lines parallel and evenly spaced. Everything else in this chapter is extracting that picture into symbols.

The Formal Definition, Earned

Here is how a mathematician writes down the pictures you just played with. Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 be a function from the plane to itself. We call TT a linear transformation when, for every pair of vectors u,vR2\mathbf{u}, \mathbf{v} \in \mathbb{R}^2 and every scalar cRc \in \mathbb{R}, both of the following hold.

Additivity. T(u+v)=T(u)+T(v)T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}). If you add two vectors first and then apply TT, you get the same answer as applying TT to each one separately and then adding.

Homogeneity. T(cu)=cT(u)T(c\,\mathbf{u}) = c\,T(\mathbf{u}). If you scale a vector first and then apply TT, you get the same answer as applying TT first and then scaling.

You can fold both axioms into one statement: T(au+bv)=aT(u)+bT(v)T(a\mathbf{u} + b\mathbf{v}) = a\,T(\mathbf{u}) + b\,T(\mathbf{v}) for all scalars a,ba, b and all vectors u,v\mathbf{u}, \mathbf{v}. The slogan: T commutes with addition and scalar multiplication. Those two operations are the entire structure of a vector space, so another way to say it is that TT preserves the structure.

Consequences you get for free

The origin is fixed. Plug c=0c = 0 into homogeneity: T(0u)=0T(u)T(0 \cdot \mathbf{u}) = 0 \cdot T(\mathbf{u}), which says T(0)=0T(\mathbf{0}) = \mathbf{0}. There is no choice about this. Any transformation that moves the origin is not linear.

Two vectors tell you everything. Every vector in the plane can be written as v=xe1+ye2\mathbf{v} = x\,\mathbf{e}_1 + y\,\mathbf{e}_2 where e1=(1,0)\mathbf{e}_1 = (1, 0) and e2=(0,1)\mathbf{e}_2 = (0, 1) are the standard basis vectors. Apply linearity:

T(v)=T(xe1+ye2)=xT(e1)+yT(e2).T(\mathbf{v}) = T(x\,\mathbf{e}_1 + y\,\mathbf{e}_2) = x\,T(\mathbf{e}_1) + y\,T(\mathbf{e}_2).

Read that slowly. On the left, we do not know what TT does to an arbitrary v\mathbf{v}. On the right, we only need two things: T(e1)T(\mathbf{e}_1) and T(e2)T(\mathbf{e}_2). The whole transformation on an infinite plane is compressed into two output vectors.

The two-vector theorem. A linear transformation T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 is completely determined by the pair (T(e1),T(e2))(T(\mathbf{e}_1), T(\mathbf{e}_2)). In Chapter 4 we will give that pair a name and learn to compute with it efficiently. For now, the pair itself is TT.

Grid lines stay straight because the image of the line a+td\mathbf{a} + t\,\mathbf{d} is T(a)+tT(d)T(\mathbf{a}) + t\,T(\mathbf{d}) — a straight line through T(a)T(\mathbf{a}) in the direction T(d)T(\mathbf{d}). Parallel lines stay parallel because two lines in direction d\mathbf{d} both map to lines in direction T(d)T(\mathbf{d}). Equal spacing survives because homogeneity preserves the scalar tt. Every picture is now a theorem.


Worked Example — By Hand

Let us fix a specific TT and compute with it the way a human would on paper. Choose T(e1)=(2,1)T(\mathbf{e}_1) = (2, 1) and T(e2)=(1,1)T(\mathbf{e}_2) = (-1, 1).

Those two outputs are all we have committed to. No formula, no matrix. Yet we can now compute T(v)T(\mathbf{v}) for any v\mathbf{v}, just by using linearity.

Take v=(3,2)\mathbf{v} = (3, 2). Decompose it in terms of the basis: v=3e1+2e2\mathbf{v} = 3\,\mathbf{e}_1 + 2\,\mathbf{e}_2. Apply linearity:

T(v)=3T(e1)+2T(e2)=3(2,1)+2(1,1).T(\mathbf{v}) = 3\,T(\mathbf{e}_1) + 2\,T(\mathbf{e}_2) = 3(2, 1) + 2(-1, 1).

Distribute the scalars: 3(2,1)=(6,3)3(2, 1) = (6, 3) and 2(1,1)=(2,2)2(-1, 1) = (-2, 2). Add the two: (6,3)+(2,2)=(4,5)(6, 3) + (-2, 2) = (4, 5). So T(3,2)=(4,5)T(3, 2) = (4, 5). No matrix multiplication, no determinants. Just linearity, twice.

Drive the widget below to the state T(e1)=(2,1)T(\mathbf{e}_1) = (2, 1) (set a=2,c=1a = 2, c = 1) and T(e2)=(1,1)T(\mathbf{e}_2) = (-1, 1) (set b=1,d=1b = -1, d = 1). The blue arrow lands where our by-hand calculation says it should, and the grid shows the full reshape.

Interactive: 2D Linear Transformation

Ae₁Ae₂
Matrix A
1.00
0.00
0.00
1.00
det(A) = 1.000
Presets

The red arrow shows where e₁ = (1, 0) lands, and the blue arrow shows where e₂ = (0, 1) lands. Together, they completely determine the transformation. The grid shows how the entire plane deforms.

Notice: the origin stays at the origin, the light-colored grid lines have all become straight lines in the darker image grid, and every cell of the image grid is the same parallelogram. That is the visual fingerprint of linearity.


Python From Scratch

We now write TT as a Python function that takes a vector and returns its image. No NumPy, no matrices — just the two output vectors T(e1)T(\mathbf{e}_1) and T(e2)T(\mathbf{e}_2) and the linearity rule, exactly as you computed by hand.

A linear transformation, implemented from its basis images
🐍linear_transformation.py
1# T is defined by where it sends the two basis vectors.

This single comment is the whole thesis of the section. A linear transformation T on the plane is completely determined once you say where T sends e1 = (1, 0) and where it sends e2 = (0, 1). Every other vector is a combination of those two, and linearity forces T to respect that combination.

2Te1 = (2.0, 1.0)

Stores the image of the first basis vector e1 = (1, 0). T sends (1, 0) to (2, 1). The horizontal unit arrow gets reshaped into the arrow from origin to (2, 1).

EXECUTION STATE
Te1 = (2.0, 1.0)
→ meaning = T(e1) = T(1, 0) = (2, 1). The blue basis arrow lands at (2, 1) after the transformation.
3Te2 = (-1.0, 1.0)

Stores the image of the second basis vector e2 = (0, 1). T sends (0, 1) to (-1, 1). Together with Te1 this fixes T everywhere on the plane.

EXECUTION STATE
Te2 = (-1.0, 1.0)
5def T(v):

A Python function that implements the linear transformation. It takes one 2D vector and returns its image under T. We are NOT using a matrix — just the two output vectors Te1 and Te2 plus the linearity rule.

EXECUTION STATE
⬇ input: v = A tuple (x, y) representing a 2D vector.
→ example value on line 12 = v = (3.0, 2.0)
⬆ returns = A tuple (out_x, out_y) — the image T(v) in the output plane.
6x, y = v

Unpack the tuple into its two coordinates. For v = (3.0, 2.0), this assigns x = 3.0 and y = 2.0. These are the coefficients in the combination v = x·e1 + y·e2.

EXECUTION STATE
x = 3.0 (the coefficient of e1 in v)
y = 2.0 (the coefficient of e2 in v)
→ key identity = v = x·e1 + y·e2 = 3.0·(1,0) + 2.0·(0,1)
7out_x = x * Te1[0] + y * Te2[0]

Linearity in action. The first output coordinate is x times the first coordinate of T(e1), plus y times the first coordinate of T(e2). This is T(x·e1 + y·e2) = x·T(e1) + y·T(e2) projected onto the first output axis.

EXECUTION STATE
x * Te1[0] = 3.0 * 2.0 = 6.0
y * Te2[0] = 2.0 * (-1.0) = -2.0
out_x = 6.0 + (-2.0) = 4.0
8out_y = x * Te1[1] + y * Te2[1]

Same idea for the second output coordinate. x times Te1's y-component, plus y times Te2's y-component.

EXECUTION STATE
x * Te1[1] = 3.0 * 1.0 = 3.0
y * Te2[1] = 2.0 * 1.0 = 2.0
out_y = 3.0 + 2.0 = 5.0
9return (out_x, out_y)

Return the image of v as a tuple. For v = (3.0, 2.0), we return (4.0, 5.0). The arrow from the origin to (3, 2) in the input plane corresponds to an arrow from the origin to (4, 5) in the output plane — exactly what the interactive grid above shows when you set T(e1)=(2,1), T(e2)=(-1,1).

EXECUTION STATE
⬆ return = (4.0, 5.0)
11v = (3.0, 2.0)

The test vector from our by-hand worked example. In terms of the basis: v = 3·e1 + 2·e2.

EXECUTION STATE
v = (3.0, 2.0)
12Tv = T(v)

Call T on v. Internally this runs lines 6–9. The result should match the number we computed by hand: (4, 5).

EXECUTION STATE
Tv = (4.0, 5.0)
→ matches hand calculation = 3·(2,1) + 2·(-1,1) = (6,3) + (-2,2) = (4,5) ✓
13print("T(v) =", Tv)

Prints: T(v) = (4.0, 5.0)

15u = (1.0, 1.0)

A test vector for the additivity axiom. Picked so the numbers are tractable by eye.

EXECUTION STATE
u = (1.0, 1.0)
16w = (2.0, -1.0)

A second test vector. u and w are deliberately not parallel so the check is non-trivial.

EXECUTION STATE
w = (2.0, -1.0)
17c = 3.0

A scalar for the homogeneity axiom T(c·u) = c·T(u).

EXECUTION STATE
c = 3.0
19u_plus_w = (u[0] + w[0], u[1] + w[1])

Compute u + w coordinate-wise. This is the vector we feed into T on the next line.

EXECUTION STATE
u[0] + w[0] = 1.0 + 2.0 = 3.0
u[1] + w[1] = 1.0 + (-1.0) = 0.0
u_plus_w = (3.0, 0.0)
20lhs_add = T(u_plus_w)

Left-hand side of the additivity axiom: T applied AFTER adding. We compute T(3, 0).

EXECUTION STATE
inside T: x, y = x = 3.0, y = 0.0
out_x = 3.0*2.0 + 0.0*(-1.0) = 6.0
out_y = 3.0*1.0 + 0.0*1.0 = 3.0
lhs_add = (6.0, 3.0)
21rhs_add = (T(u)[0] + T(w)[0], T(u)[1] + T(w)[1])

Right-hand side: T applied to u and w separately, then summed. If T is linear, this must match lhs_add.

EXECUTION STATE
T(u) = x,y=1,1 → out_x = 1*2 + 1*(-1) = 1, out_y = 1*1 + 1*1 = 2 → (1.0, 2.0)
T(w) = x,y=2,-1 → out_x = 2*2 + (-1)*(-1) = 5, out_y = 2*1 + (-1)*1 = 1 → (5.0, 1.0)
rhs_add = (1.0 + 5.0, 2.0 + 1.0) = (6.0, 3.0)
→ check = lhs_add == rhs_add ✓
22print("Additivity: ...")

Prints: Additivity: T(u+w) = (6.0, 3.0) | T(u)+T(w) = (6.0, 3.0)

24cu = (c * u[0], c * u[1])

Scale u by c. For c = 3 and u = (1, 1), this gives (3, 3).

EXECUTION STATE
cu = (3.0, 3.0)
25lhs_hom = T(cu)

T applied to the scaled vector.

EXECUTION STATE
inside T = x,y = 3,3 → out_x = 3*2 + 3*(-1) = 3, out_y = 3*1 + 3*1 = 6
lhs_hom = (3.0, 6.0)
26rhs_hom = (c * T(u)[0], c * T(u)[1])

Scale T(u) by c. If T is homogeneous, this equals lhs_hom.

EXECUTION STATE
T(u) = (1.0, 2.0) (computed above)
rhs_hom = (3.0 * 1.0, 3.0 * 2.0) = (3.0, 6.0)
→ check = lhs_hom == rhs_hom ✓
27print("Homogeneity: ...")

Prints: Homogeneity: T(c*u) = (3.0, 6.0) | c*T(u) = (3.0, 6.0). Both axioms verified numerically.

5 lines without explanation
1# T is defined by where it sends the two basis vectors.
2Te1 = (2.0, 1.0)
3Te2 = (-1.0, 1.0)
4
5def T(v):
6    x, y = v
7    out_x = x * Te1[0] + y * Te2[0]
8    out_y = x * Te1[1] + y * Te2[1]
9    return (out_x, out_y)
10
11v = (3.0, 2.0)
12Tv = T(v)
13print("T(v) =", Tv)
14
15u = (1.0, 1.0)
16w = (2.0, -1.0)
17c = 3.0
18
19u_plus_w = (u[0] + w[0], u[1] + w[1])
20lhs_add = T(u_plus_w)
21rhs_add = (T(u)[0] + T(w)[0], T(u)[1] + T(w)[1])
22print("Additivity:  T(u+w) =", lhs_add, " |  T(u)+T(w) =", rhs_add)
23
24cu = (c * u[0], c * u[1])
25lhs_hom = T(cu)
26rhs_hom = (c * T(u)[0], c * T(u)[1])
27print("Homogeneity: T(c*u) =", lhs_hom, " |  c*T(u) =", rhs_hom)

Click any line on the right to see the values flowing through it. Lines 7–8 are the linearity rule in pure arithmetic. The additivity and homogeneity checks at the bottom are numerical sanity — if either printed pair had mismatched, the function would not be linear.


The PyTorch Way

In production, almost nobody writes the above. Hand-written loops produce the right answer, but once you have a thousand vectors to transform, or a million, Python loops are the wrong tool. PyTorch lets us bundle T(e1)T(\mathbf{e}_1) and T(e2)T(\mathbf{e}_2) into a single 2×22 \times 2 tensor whose columns are those images, and then apply TT to any vector with one multiplication.

That 2×22 \times 2 tensor is what Chapter 4 will formally name the matrix of T. We are not there yet, so we treat the tensor purely as bookkeeping that packs the two output vectors into one object. The @@ operator does the same two arithmetic lines the previous code did, but as one fused operation.

The same transformation, packaged as a tensor
🐍linear_transformation_torch.py
1import torch

PyTorch is a tensor library with GPU acceleration and automatic differentiation. Here we only need two things: the tensor type and the matrix-multiply operator @. Everything runs as fused kernels on CPU or GPU — not Python loops.

EXECUTION STATE
torch = PyTorch library. Provides torch.tensor, torch.stack, the @ operator, and much more.
3# Where the two basis vectors get sent.

Same idea as the pure-Python version. T is still defined by T(e1) and T(e2).

4Te1 = torch.tensor([2.0, 1.0])

Create a 1-D tensor (a vector) holding the image of e1. Identical values to the Python version.

EXECUTION STATE
📚 torch.tensor() = PyTorch factory function: turns a Python list into a tensor — a multi-dimensional array that supports GPU computation and autograd.
⬇ arg: [2.0, 1.0] = The list of numbers. PyTorch infers dtype=float32 from the floats.
⬆ result: Te1 = tensor([2., 1.]) shape (2,)
5Te2 = torch.tensor([-1.0, 1.0])

Same as line 4, but for the image of e2.

EXECUTION STATE
Te2 = tensor([-1., 1.]) shape (2,)
7# Stack the images as COLUMNS. In Ch. 4 we will call this the matrix of T.

We are about to bundle Te1 and Te2 into a single 2×2 tensor whose columns are the images of the basis vectors. That tensor is, literally, what Chapter 4 will name the matrix of T. For now we use it only as compact bookkeeping.

8A = torch.stack([Te1, Te2], dim=1)

Combine the two image vectors into a single 2×2 tensor whose columns are Te1 and Te2.

EXECUTION STATE
📚 torch.stack() = Concatenates tensors along a NEW axis. Given N tensors of shape S, stack returns a tensor of shape S with N inserted at position dim.
⬇ arg 1: [Te1, Te2] = A Python list of two 1-D tensors, each of shape (2,).
⬇ arg 2: dim=1 = Insert the new axis at position 1 (columns). dim=0 would stack them as rows. We want them as COLUMNS.
→ dim=0 vs dim=1 = dim=0: tensor([[ 2., 1.], [-1., 1.]]) — Te1 is row 0 dim=1: tensor([[ 2., -1.], [ 1., 1.]]) — Te1 is column 0
⬆ result: A =
        col0  col1
row0:  [ 2.,  -1.]
row1:  [ 1.,   1.]
shape (2, 2)
9print("A =")

Label line for the print below.

10print(A)

Prints: tensor([[ 2., -1.], [ 1., 1.]])

12# Apply T to a vector with one multiplication: A @ v

The payoff. Instead of two separate arithmetic lines for out_x and out_y, PyTorch's @ operator does both at once.

13v = torch.tensor([3.0, 2.0])

Same test vector as the hand-worked example.

EXECUTION STATE
v = tensor([3., 2.]) shape (2,)
14Tv = A @ v

The matrix-vector multiply. Under the hood, PyTorch computes the same two dot products our Python version did, but as one fused BLAS call.

EXECUTION STATE
@ operator = Python's matrix-multiply operator (PEP 465), implemented by torch.matmul. For (2,2) @ (2,) it produces a (2,) result.
→ row 0 of A · v = [2, -1] · [3, 2] = 2*3 + (-1)*2 = 4.0
→ row 1 of A · v = [1, 1] · [3, 2] = 1*3 + 1*2 = 5.0
⬆ result: Tv = tensor([4., 5.])
→ matches the pure-Python version = Both paths produce (4.0, 5.0). Same math, different bookkeeping.
15print("T(v) =", Tv)

Prints: T(v) = tensor([4., 5.])

17# Batched: send ten vectors at once

The real reason engineers use the matrix bookkeeping: it parallelizes. Thousands of vectors, one kernel launch.

18V = torch.randn(10, 2)

Ten random 2-D vectors stacked as rows of a (10, 2) tensor.

EXECUTION STATE
📚 torch.randn() = Samples from the standard normal distribution. torch.randn(10, 2) returns a (10, 2) tensor of independent N(0,1) samples.
⬇ args: (10, 2) = The shape. 10 rows = 10 vectors, 2 columns = 2 components each.
⬆ result: V =
A (10, 2) tensor. Values differ per run because randn is random, but the shape is always (10, 2).
19TV = V @ A.T

Apply T to every row of V. We use A.T because V has vectors as ROWS, and we want each output row to be T(input row).

EXECUTION STATE
A.T =
Transpose of A — swaps rows and columns. A was [[2,-1],[1,1]], so A.T is [[2,1],[-1,1]].
→ why transpose? = When vectors are rows: T(row) = row @ A.T puts the two image-coordinates into the two output columns.
⬆ result: TV =
A (10, 2) tensor. Row i of TV is T applied to row i of V. Ten applications of T in one instruction.
20print("TV shape:", TV.shape)

Prints: TV shape: torch.Size([10, 2]). Ten inputs, ten outputs — computed in parallel by PyTorch's BLAS backend.

4 lines without explanation
1import torch
2
3# Where the two basis vectors get sent.
4Te1 = torch.tensor([2.0, 1.0])
5Te2 = torch.tensor([-1.0, 1.0])
6
7# Stack the images as COLUMNS. In Ch. 4 we will call this the matrix of T.
8A = torch.stack([Te1, Te2], dim=1)
9print("A =")
10print(A)
11
12# Apply T to a vector with one multiplication: A @ v
13v = torch.tensor([3.0, 2.0])
14Tv = A @ v
15print("T(v) =", Tv)
16
17# Batched: send ten vectors at once
18V = torch.randn(10, 2)
19TV = V @ A.T
20print("TV shape:", TV.shape)

What PyTorch hid, and why that hiding matters. The two dot products are still happening on line 14 — PyTorch just computes them inside a C++ kernel, which on a GPU runs thousands of these side by side. When we scale up to layers of a neural network with thousands of neurons, the difference between a Python loop and a fused kernel is roughly four orders of magnitude. That is why the matrix bookkeeping is not a notational convenience. It is the thing that makes linear algebra computationally real.


Where Engineers Actually Use This: Italic Fonts

One thing your eyes are doing right now, whether you notice it or not, is reading the output of a linear transformation. Every vector font — TrueType, OpenType, PostScript Type 1, the fonts Apple and Microsoft ship with their operating systems — stores each letter as a set of outline points. When the rendering engine needs to draw that letter on your screen, it applies a 2×22 \times 2 transformation to every one of those outline points before rasterizing. The PostScript language specification calls the four numbers the font matrix. They are exactly the two images T(e1)T(\mathbf{e}_1) and T(e2)T(\mathbf{e}_2) of our two-vector theorem.

The most recognizable use of the font matrix is synthetic italic. When a typeface does not ship with its own hand-drawn italic, graphics libraries synthesize one by shearing the upright glyphs: leave the horizontal direction alone, and tilt the vertical direction to the right by a fixed angle. The italic transformation sends e1\mathbf{e}_1 to (1,0)(1, 0) — horizontal stays horizontal — and sends e2\mathbf{e}_2 to (tanθ,1)(\tan\theta, 1) where θ\theta is the italic angle. For θ=12\theta = 12^\circ this gives T(e2)(0.213,1)T(\mathbf{e}_2) \approx (0.213, 1). That one pair of output vectors defines the italic transformation for the entire alphabet.

Shear angle12°

0° is upright. Typical italic is 10–15°.

Text

Any word. This is linear transformation applied to glyph outlines.

Where this linear map sends the basis:
T(e₁) = (1, 0) — horizontal stays horizontal
T(e₂) = (0.213, 1) — vertical tilts right by tan(12°)
That pair of images is the whole transformation.

The demo is the same math as the worked example, with different numbers. Pull up the shear angle, and because TT is linear, every glyph outline point transforms consistently: the overall letter leans, but it does not bend, because straight lines stay straight. If the transformation were not linear — say, if the engine used T(x,y)=(x+y2,y)T(x, y) = (x + y^2, y) — the letter would warp instead of tilt, and typography would be ruined. Linear is a surprisingly strong word.

The same idea, expanded from 2×22 \times 2 to 4×44 \times 4 and promoted to three dimensions, is how every polygon in every 3D game gets positioned on your screen, how every photo gets scaled and rotated in an image editor, and how every robot arm figures out where its gripper is. Synthetic italic is the simplest honest example of a linear transformation doing real work, but it is not a toy — it ships on every font rendering stack in the world.


Pitfalls & What to Watch For

“Linear” does not mean y=mx+by = mx + b

This is the single most common point of confusion coming from high-school algebra. The function f(x)=mx+bf(x) = mx + b, whose graph is a straight line, is not a linear transformation unless b=0b = 0. Plug x=0x = 0 into it: f(0)=bf(0) = b, which is the origin only when b=0b = 0. A genuine linear transformation must fix the origin. Functions that “look linear on a graph” but shift the origin are called affine. Neural network layers are affine (they add a bias), not linear, which matters the moment you start proving things about them.

Translation is not linear

Moving every point by a fixed vector t\mathbf{t} — the transformation T(v)=v+tT(\mathbf{v}) = \mathbf{v} + \mathbf{t} — is the most useful non-linear operation in graphics. It feels linear. It is not. It moves the origin to t\mathbf{t}. In Chapter 6 we will see how computer graphics gets around this with homogeneous coordinates, but for now the rule is hard: a shift is never linear.

The determinant can be zero

A linear transformation is free to collapse the plane onto a line, or even to a single point. The projection T(x,y)=(x,0)T(x, y) = (x, 0) that flattens everything onto the x-axis is perfectly linear — it satisfies both axioms — but it is not invertible, because once you have flattened the plane you cannot recover the lost information. We will call such transformations singular, and the number that detects them is the determinant. Not every linear transformation can be undone.

Why the image of (1,1)(1,1) is not a third independent fact

Students sometimes ask: if I tell you T(e1)T(\mathbf{e}_1), T(e2)T(\mathbf{e}_2), and T(1,1)T(1, 1), is that more information than giving the first two? It is not. By linearity T(1,1)=T(e1)+T(e2)T(1, 1) = T(\mathbf{e}_1) + T(\mathbf{e}_2), so the third value is forced. Give any basis-pair's images and the rest of TT is pinned down.


What This Unlocks Next

You now own the object. In the next section, we catalogue the 2D linear transformations that matter in practice — rotations, scalings, shears, reflections, projections — and learn to recognize each one from the pair (T(e1),T(e2))(T(\mathbf{e}_1), T(\mathbf{e}_2)) alone. Then Chapter 4 finally gives that pair the name it has been earning: the matrix of the transformation.

  • It preserves vector addition: T(u+v)=T(u)+T(v)T(\vec{u} + \vec{v}) = T(\vec{u}) + T(\vec{v}).
  • It preserves scalar multiplication: T(cv)=cT(v)T(c\,\vec{v}) = c\,T(\vec{v}).

Geometrically, this means three things: the origin never moves, every straight line stays a straight line, and the spacing between parallel lines stays uniform. Under a linear transformation, the whole space gets stretched, squashed, rotated, sheared, or flipped — but never curved.

A Matrix IS a Transformation

Every linear transformation of Rn\mathbb{R}^n can be written as a matrix. The key idea is deceptively simple:

The columns of a matrix are the images of the standard basis vectors.

In 2D, the standard basis is ı^=(1,0)\hat{\imath} = (1, 0) and ȷ^=(0,1)\hat{\jmath} = (0, 1). If a matrix

A=[abcd]A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}

represents a linear transformation, then its first column (a,c)(a, c) is where ı^\hat{\imath} lands, and its second column (b,d)(b, d) is where ȷ^\hat{\jmath} lands. That's the entire picture — once you know what happens to the two basis vectors, the rest of space follows because every vector is a linear combination of ı^\hat{\imath} and ȷ^\hat{\jmath}.

2D: Play with a 2×2 Matrix

Type any 2×22 \times 2 matrix below. The grid you see is the image of the integer lattice under that matrix — the original square grid is not drawn, so you're seeing pure output. The red arrow is where ı^\hat{\imath} lands (column 1); the green arrow is where ȷ^\hat{\jmath} lands (column 2). Try each preset in turn, then mix them.

Loading 2D transform…
Try the “Collapse” preset. It uses a matrix whose columns are parallel (one is a scalar multiple of the other), so the whole 2D plane gets squashed onto a single line. This is what it looks like when the determinant is zero: the transformation destroys a dimension.

3D: Play with a 3×3 Matrix

In three dimensions the story is identical — just with one more basis vector. A 3×33 \times 3 matrix has three columns, one for each transformed basis vector:

A=[ı^ȷ^k^]A = \begin{bmatrix} \vert & \vert & \vert \\ \hat{\imath}' & \hat{\jmath}' & \hat{k}' \\ \vert & \vert & \vert \end{bmatrix}

Below you can set all nine entries. The purple wireframe is the image of the unit cube [0,1]3[0,1]^3 under the matrix — its volume equals detA|\det A|, and when the cube flattens to a plane or a line, you're watching detA=0\det A = 0 in action.

Loading 3D transform…

Ten Numbers That Tell the Story

A matrix is a machine that bends, stretches, rotates, squashes, or collapses space. People extract special numbers from that machine to understand what it does. No single number tells the full story — each one answers a different question. For each of the ten most-used invariants below you get: why we compute it, how it is computed with a concrete numerical walk-through, and an interactive viewer pre-loaded with a matrix that shows the behaviour at a glance.

Our running example for the non-degenerate cases will be A=[3102]A = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix}. It shears, it stretches, and it keeps its two dimensions, so it is a good workhorse. For degenerate cases we'll switch to B=[1224]B = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}, whose second column is two times its first — it crushes the plane onto a single line.

Rank & Nullity — Dimensions Surviving vs. Lost

Why we compute it. Rank tells you how many independent output directions the matrix can still reach, and nullity tells you how many input directions it throws away. If rank is full, the matrix is invertible and nothing is lost. If rank drops, some entire subspace of inputs collapses to a single point — and that subspace is the null space. Rank is the first number you should ever look at: it is the sentence “does this transformation lose information?” in numeric form.

How it's computed. Reduce the matrix to row-echelon form and count the non-zero rows. For B=[1224]B = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} subtract 2×row 1 from row 2:

[1224][1200]\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \longrightarrow \begin{bmatrix} 1 & 2 \\ 0 & 0 \end{bmatrix}

One non-zero row ⇒ rank(B)=1\text{rank}(B) = 1, so by the rank-nullity theorem nullity(B)=21=1\text{nullity}(B) = 2 - 1 = 1.

See it in action. The red dashed line in the viewer below is the null space — the entire line y=x/2y = -x/2 of inputs getting squashed to the origin. Every amber grid line has collapsed onto a single line instead of covering the plane.

Loading 2D transform…

Determinant — Volume Scale Factor

Why we compute it. The determinant is the signed factor by which the transformation scales nn-dimensional volume. Its sign tells you whether the space got flipped (a left hand becomes a right hand). Its magnitude tells you whether the transformation is expanding, compressing, or (at zero) collapsing. Along with rank, this is the single most useful scalar summary of a matrix.

How it's computed. For a 2×22 \times 2 matrix,

det[abcd]=adbc.\det \begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc.

Plug in A=[3102]A = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix}:

detA=(3)(2)(0)(1)=6.\det A = (3)(2) - (0)(1) = 6.

So every area gets multiplied by 6. For the singular case B=[1224]B = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}: detB=(1)(4)(2)(2)=0\det B = (1)(4) - (2)(2) = 0, so area becomes zero — exactly the collapse we saw under rank.

See it in action. The purple parallelogram is the image of the unit square. With A its area is exactly 6; with Flip X (next viewer) its area is 1 but orientation is reversed (you can see the parallelogram “vertices” go counterclockwise instead of clockwise).

Loading 2D transform…
Sign matters. Try the Flip X preset ([1001]\begin{bmatrix} -1 & 0 \\ 0 & 1 \end{bmatrix}). Det is 1-1: area scale factor 1 (unchanged), but the amber ı^\hat\imath' arrow now points left, flipping orientation.

Trace — Sum of Eigenvalues

Why we compute it. Trace is a cheap algebraic summary that turns out to equal the sum of eigenvalues, counted with multiplicity. It shows up everywhere: in probability as the expected value of a quadratic form, in mechanics as a conserved moment, in ML as the sum of diagonal variances. If you only have a second to look at a matrix, trace + determinant already narrows down a 2×2 matrix's behaviour to a small number of cases.

How it's computed. tr(A)=a11+a22++ann\text{tr}(A) = a_{11} + a_{22} + \cdots + a_{nn}. For our AA,

tr(A)=3+2=5.\text{tr}(A) = 3 + 2 = 5.

We'll see in a moment that A's eigenvalues are 3 and 2, so the sum of eigenvalues = trace identity is instantly verified: 3+2=5=tr(A)3 + 2 = 5 = \text{tr}(A). This is a cheap “sanity check” that your eigenvalue computation didn't go sideways.

Eigenvalues & Eigenvectors — Invariant Directions

Definition. Given a square matrix AA, a non-zero vector v\vec{v} is called an eigenvector if applying AA keeps it on its own line — i.e. the output is a scalar multiple of v\vec{v} itself. The scalar is the eigenvalue λ\lambda. The defining equation is deceptively short:

Av=λvwith v0.A\,\vec{v} = \lambda\,\vec{v} \qquad \text{with } \vec{v} \neq \vec{0}.

Three things make this the most important pair of quantities derived from a matrix. First, an eigenvector is a direction the transformation does not bend — it only stretches or flips it. Second, any matrix that has enough independent eigenvectors can be rewritten in the eigenbasis, where it acts as a pure diagonal scaling — the cleanest possible description of a linear map. Third, when you apply AA repeatedly — to model a dynamical system, a Markov chain, or a deep network — the eigenvalues directly control growth, decay, oscillation, and stability.

Intuition — four ways to see it.

  • Invariant lines. Most input directions get rotated by the matrix. Eigenvectors span the handful of lines that don't rotate — they only scale. A real n×nn \times n matrix can have up to nn of them (though sometimes fewer, or complex ones).
  • Natural modes. Think of AA acting on a vibrating drum or a coupled spring system. The eigenvectors are the pure modes of vibration; the eigenvalues are their frequencies (or growth rates). Any motion decomposes into a sum of these independent modes.
  • Diagonalisation. If AA has nn linearly independent eigenvectors, then A=PDP1A = P D P^{-1}, where PP has the eigenvectors as columns and DD is diagonal with the eigenvalues on it. That's the algebraic punchline: every diagonalisable matrix is secretly a diagonal matrix in disguise — and the eigenvectors are the rotation you apply to uncover it.
  • Long-term behaviour. Applying AA repeatedly gives AkvA^k \vec{v}. In the eigenbasis this becomes DkD^k, which is trivial: λik\lambda_i^k on the diagonal. So whichever eigenvalue has the largest absolute value dominates; the trajectory pulls onto the corresponding eigenvector. (This is the iteration animation further down.)

How eigenvalues are computed. The equation Av=λvA\vec{v} = \lambda\vec{v} rearranges to (AλI)v=0(A - \lambda I)\vec{v} = \vec{0}. A non-zero v\vec{v} can satisfy this only when the matrix AλIA - \lambda I is singular — i.e. its determinant is zero. So:

det(AλI)=0(the characteristic polynomial).\det(A - \lambda I) = 0 \quad \text{(the characteristic polynomial)}.

For our running matrix A=[3102]A = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix},

AλI=[3λ102λ],det(AλI)=(3λ)(2λ)=0.A - \lambda I = \begin{bmatrix} 3-\lambda & 1 \\ 0 & 2-\lambda \end{bmatrix}, \qquad \det(A - \lambda I) = (3-\lambda)(2-\lambda) = 0.

Two roots:

λ1=3,λ2=2.\lambda_1 = 3, \qquad \lambda_2 = 2.

How eigenvectors are computed. For each eigenvalue, solve (AλI)v=0(A - \lambda I)\,\vec{v} = 0. For λ1=3\lambda_1 = 3:

A3I=[0101]    y=0    v1=[10].A - 3I = \begin{bmatrix} 0 & 1 \\ 0 & -1 \end{bmatrix} \implies y = 0 \implies \vec{v}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}.

For λ2=2\lambda_2 = 2:

A2I=[1100]    x+y=0    v2=[11].A - 2I = \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix} \implies x + y = 0 \implies \vec{v}_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}.

Sanity check: Av1=(3,0)=3v1A\vec{v}_1 = (3, 0) = 3\vec{v}_1 ✓ and Av2=(2,2)=2v2A\vec{v}_2 = (2, -2) = 2\vec{v}_2 ✓. A second cheap cross-check: λ1+λ2=5=tr(A)\lambda_1 + \lambda_2 = 5 = \text{tr}(A) and λ1λ2=6=det(A)\lambda_1\,\lambda_2 = 6 = \det(A). Those two identities always hold for a 2×22 \times 2 matrix.

See it in action. The amber dashed lines in the viewer below are exactly those two eigenvector directions. The red ı^\hat\imath' arrow (image of ı^\hat\imath) sits on the first amber line and is three times as long — that's λ1=3\lambda_1 = 3. The direction v2=(1,1)\vec{v}_2 = (1, -1) sits on the second amber line and gets scaled by 2. The whole transformation decomposes into “stretch by 3 along the first amber line and by 2 along the second” — diagonal in the eigenbasis.

Loading 2D transform…
Complex eigenvalues mean rotation. The 90° rotation matrix R=[0110]R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} has no real eigenvectors. Its characteristic polynomial is λ2+1=0\lambda^2 + 1 = 0, giving λ=±i\lambda = \pm i. The viewer below shows those complex eigenvalues in the Matrix-invariants panel, and no amber eigenvector lines appear — because every real direction gets rotated somewhere else. The magnitude λ=1|\lambda| = 1 tells you this is a pure rotation (no stretching); the argument argλ=90°\arg\lambda = 90° tells you the rotation angle.
Loading 2D transform…

Watch eigenvectors emerge from iteration. The deepest reason eigenvectors matter is this: pick any starting vector and apply AA to it repeatedly. The result AkvA^k \vec{v} rotates toward the eigenvector whose eigenvalue has the largest absolute value, while its length grows like λ1k|\lambda_1|^k. This is called power iteration, and it's the engine behind PageRank (find the dominant eigenvector of the web's link matrix), PCA (find the dominant eigenvector of the data's covariance), and countless numerical algorithms.

The viewer below turns on iteration mode: click step → to apply AA once, and watch the pink trajectory vAvA2v\vec{v} \to A\vec{v} \to A^2\vec{v} \to \ldots line up with the amber eigenvector line whose eigenvalue is biggest. The live ratio Ak+1v/Akv\|A^{k+1}\vec{v}\| / \|A^k\vec{v}\| converges to λ1|\lambda_1|.

Loading matrix-vector action…

Algebraic vs. geometric multiplicity. A matrix can have a repeated eigenvalue but still only one eigenvector direction — the classic example is the shear [2102]\begin{bmatrix} 2 & 1 \\ 0 & 2 \end{bmatrix}, whose only eigenvalue is λ=2\lambda = 2 (algebraic multiplicity 2) but whose eigenspace is the single line along ı^\hat\imath (geometric multiplicity 1). Such a matrix is called defective: you cannot diagonalise it, and it instead has a Jordan block structure. The viewer correctly shows only one amber line in that case — a useful signal that the transformation is hiding coupling beyond its eigenvalues.

When you'll reach for eigenvalues and eigenvectors.

AreaWhat eigenvectors / eigenvalues give you
Principal Component AnalysisThe leading eigenvectors of the covariance matrix are the directions of maximal variance in the data.
PageRankThe dominant eigenvector of the web's stochastic link matrix is the page-ranking vector.
Spectral clusteringEigenvectors of a graph's Laplacian group nodes into clusters in low dimensions.
Dynamical systems / stabilityLinearise at a fixed point; eigenvalues of the Jacobian tell you if it's stable (|λ| < 1), unstable, or oscillatory.
Markov chainsThe eigenvector for λ = 1 is the stationary distribution; |λ| < 1 eigenvectors decay away at rate λᵏ.
Differential equationsSolving ẋ = Ax: the solutions are sums of eᵗλᵢ v_i — eigenvalues give the growth/decay modes.
Quantum mechanicsObservables are matrices; eigenvalues are the possible measured values; eigenvectors are the measurement states.
Vibrations / bucklingEigenvalues of stiffness matrices are (squares of) natural frequencies; eigenvectors are mode shapes.
Deep learningEigenvalues of weight matrices and Jacobians control gradient vanishing/explosion and learning dynamics.

What eigenvalues tell you about the transformation.

  • λ>1|\lambda| > 1 on some eigenvector — that direction is amplified under repeated application.
  • λ<1|\lambda| < 1 on all eigenvectors — the matrix is a contraction; iterations converge to the origin.
  • λ<0\lambda < 0 — the direction is flipped at each application.
  • Complex λ=reiθ\lambda = r e^{i\theta} — that pair of directions is rotated by angle θ\theta and scaled by rr. r=1r = 1 is a pure rotation; r>1r > 1 is a spiral out; r<1r < 1 is a spiral in.
  • λ=0\lambda = 0 — that eigenvector is the null direction: the matrix kills it. If present, the matrix is singular.

Together, the full spectrum of eigenvalues answers the deepest question you can ask about a linear map: over time, where does this transformation send things?

Singular Values — True Stretching Strengths

Why we compute them. Eigenvalues work best for well-behaved square matrices. Singular values work for every matrix — square, rectangular, defective — and they are always real and non-negative. They tell you the actual stretching strengths along orthogonal principal directions, which is what you need for PCA, for SVD-based compression, for solving least squares, and for defining matrix norms that behave properly.

How they're computed. Singular values are the square roots of the eigenvalues of AAA^\top A. For our A,

AA=[3012][3102]=[9335].A^\top A = \begin{bmatrix} 3 & 0 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 9 & 3 \\ 3 & 5 \end{bmatrix}.

Solve det(AAμI)=0\det(A^\top A - \mu I) = 0: μ214μ+36=0    μ=7±13\mu^2 - 14\mu + 36 = 0 \implies \mu = 7 \pm \sqrt{13}. So the singular values are

σ1=7+133.26,σ2=7131.84.\sigma_1 = \sqrt{7 + \sqrt{13}} \approx 3.26, \qquad \sigma_2 = \sqrt{7 - \sqrt{13}} \approx 1.84.

Sanity checks: (i) the product σ1σ2=36=6=detA\sigma_1 \sigma_2 = \sqrt{36} = 6 = |\det A| ✓; (ii) the sum of squares σ12+σ22=14=AF2\sigma_1^2 + \sigma_2^2 = 14 = \|A\|_F^2 ✓. These identities link SVD back to the determinant and Frobenius norm.

See it in action. The viewer below turns on the singular ellipse: the dashed white unit circle is the set of all input unit vectors, and the solid blue ellipse is its image under A. The ellipse's long semi-axis has length σ1\sigma_1, the short one σ2\sigma_2 — every unit input produces an output whose length lies between those two numbers.

Loading 2D transform…

Spectral Norm — Maximum Amplification

Definition. The spectral norm (also called the operator 2-norm) of a matrix AA is the largest factor by which it can stretch the length of any vector:

A2=maxx0Axx=maxx=1Ax.\|A\|_2 = \max_{\vec{x} \neq \vec{0}} \frac{\|A\vec{x}\|}{\|\vec{x}\|} = \max_{\|\vec{x}\| = 1} \|A\vec{x}\|.

Equivalently, it is the largest singular value of AA, or the square root of the largest eigenvalue of AAA^\top A:

A2=σmax(A)=λmax(AA).\|A\|_2 = \sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^\top A)}.

All three definitions give exactly the same number. The first is the honest meaning (“how much can this matrix amplify?”). The second and third are how you compute it from SVD or from eigenvalues of AAA^\top A.

Three ways to see it. 1. Worst-case amplifier. Feed every unit vector intoAA and measure the output's length. The biggest number you find is A2\|A\|_2. Every output length lies between σmin\sigma_{\min} and σmax\sigma_{\max}, so the spectral norm caps the amplification.

2. Geometric — the ellipse's longest reach. The unit sphere maps to an ellipsoid whose semi-axes are the singular values. The spectral norm is literally the length of the longest semi-axis. In the viewer below, it is the distance from the origin to the blue arrowtip.

3. Lipschitz constant. For any two inputs,

AxAyA2xy.\|A\vec{x} - A\vec{y}\| \leq \|A\|_2 \cdot \|\vec{x} - \vec{y}\|.

The matrix cannot separate two points by more than A2\|A\|_2 times their original distance. If A2<1\|A\|_2 < 1 the matrix is a contraction (everything gets closer). If A2=1\|A\|_2 = 1 it is an isometry or non-expansive map (rotations and reflections live here). If A2>1\|A\|_2 > 1 some direction gets amplified.

How it's computed. Take the largest singular value we already computed for A=[3102]A = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix}:

A2=σ1=7+133.26.\|A\|_2 = \sigma_1 = \sqrt{7 + \sqrt{13}} \approx 3.26.

Concrete sanity check: feed in ı^=(1,0)\hat\imath = (1, 0), get (3,0)(3, 0) — length 3, below the 3.26 bound. Try ȷ^=(0,1)\hat\jmath = (0, 1), get (1,2)(1, 2) — length 52.24\sqrt{5} \approx 2.24, also below. The only unit input that reaches the bound is the first right singular vector of A, aligned with the ellipse's long axis.

When you'll reach for it.

  • Dynamical systems and stability. If a system evolves via xt+1=Axt\vec{x}_{t+1} = A\vec{x}_t, then xtA2tx0\|\vec{x}_t\| \leq \|A\|_2^t \cdot \|\vec{x}_0\|. So A2<1\|A\|_2 < 1 guarantees the trajectory decays to zero (stable); A2>1\|A\|_2 > 1 can blow up. This is the linear version of Lyapunov stability.
  • Gradient flow / exploding gradients in deep learning. Each layer of a neural network multiplies the gradient by some WiW_i^\top. If iWi21\prod_i \|W_i\|_2 \gg 1, gradients explode; if 1\ll 1, they vanish. Spectral normalization in GANs (Miyato et al., 2018) divides each weight matrix by its spectral norm to pin it to 1, ensuring 1-Lipschitz layers.
  • Numerical error bounds. Solving Ax=bA\vec{x} = \vec{b} with a perturbed b+Δb\vec{b} + \Delta\vec{b} gives a relative output error bounded by κ(A)Δb/b\kappa(A) \cdot \|\Delta\vec{b}\|/\|\vec{b}\| — and the condition number κ(A)=σmax/σmin\kappa(A) = \sigma_{\max}/\sigma_{\min} puts the spectral norm on top of the ratio.
  • Robust control and HH^\infty design. In control theory the worst-case input-to-output amplification of a linear transfer function is its spectral norm; designing a controller that minimises G\|G\|_\infty directly bounds disturbance amplification.
  • Approximation and compression. The best rank-k approximation of AA (the Eckart-Young theorem) has error exactly σk+1\sigma_{k+1} in spectral norm — the (k+1)-th singular value. This is why SVD- based low-rank compression uses the spectral norm as its natural error measure.

What it tells you about the transformation. The spectral norm isolates one specific question: what is the worst thing this matrix does? If it is < 1, the matrix is shrinking space everywhere (and repeated application will drive everything to the origin). If it equals 1, the matrix at best preserves lengths — rotations, reflections, and orthogonal projections all have spectral norm exactly 1. If it is > 1, at least one direction is being actively amplified, and that amplification compounds under iteration. Knowing just this one number lets you reason about stability, convergence of iterative solvers, robustness of ML models, and the propagation of measurement error — without ever computing the full transformation.

See it in action. The viewer below reproduces the singular ellipse for A=[3102]A = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix}. The solid blue arrow points from the origin to the farthest point on the ellipse — its length is the spectral norm ≈ 3.26. Every unit input vector lands somewhere on that blue ellipse, so no output ever exceeds that length.

Loading 2D transform…

Feel the stretch on a single vector. The viewer above shows the whole ellipse of possible outputs. The one below isolates one vector at a time: drag anywhere on the canvas (or type coordinates) to set an input v, and watch the amber arrow Av land wherever the matrix carries it. The panel reports v\|v\|, Av\|Av\|, and the stretch ratio Av/v\|Av\|/\|v\| — try spinning vv around the origin and notice the ratio bobbing between σmin\sigma_{\min} and σmax\sigma_{\max}, never exceeding the spectral norm.

Loading matrix-vector action…

Frobenius Norm — Total Energy

Why we compute it. It is the simplest possible matrix norm: just the Euclidean length of the matrix treated as a vector of its entries. It's what you get when you compute an “L2 loss” over a matrix in ML, and it bounds the spectral norm from above, so if AF\|A\|_F is small, the matrix can't do anything too dramatic in any direction.

How it's computed. AF=i,jaij2\|A\|_F = \sqrt{\sum_{i,j} a_{ij}^2}. For A,

AF=32+12+02+22=143.742.\|A\|_F = \sqrt{3^2 + 1^2 + 0^2 + 2^2} = \sqrt{14} \approx 3.742.

It equals σ12+σ22\sqrt{\sigma_1^2 + \sigma_2^2}, so it pools the two stretching strengths. For our A, 3.262+1.8423.74\sqrt{3.26^2 + 1.84^2} \approx 3.74 ✓.

Condition Number — Numerical Fragility

Why we compute it. When you solve Ax=bA\vec{x} = \vec{b}, a small change in b\vec{b} can produce a change in x\vec{x} up to κ(A)\kappa(A) times larger. A condition number close to 1 means inputs and outputs change in lockstep. A huge condition number means small measurement errors can ruin the answer. In numerical linear algebra, this is the single most important diagnostic for “can I trust this computation?”

How it's computed. Just a ratio of singular values:

κ(A)=σmaxσmin.\kappa(A) = \frac{\sigma_{\max}}{\sigma_{\min}}.

For our A:

κ(A)=3.261.841.77.\kappa(A) = \frac{3.26}{1.84} \approx 1.77.

That's well-conditioned — inputs and outputs stay roughly proportional. The viewer below shows an ill-conditioned matrix instead: one direction stretches enormously while the other barely moves, so σ1/σ2\sigma_1 / \sigma_2 is huge. Watch the amber grid: it gets badly flattened — that squeezing is the geometric face of “ill-conditioned.”

Loading 2D transform…

Summary Table — Which Number Answers Which Question

QuantityFormulaGeometric MeaningDegenerate When
Rank# of independent columnsDimensions that surviveRank < n
Nullityn − rankDirections crushed to originNullity > 0
det Asigned n-volume scaleVolume multiplier (sign = orientation)det = 0
tr(A)sum of diagonals = Σ λᵢSum of natural scaling modes
λᵢ (eigenvalues)roots of det(A−λI) = 0Stretch factors along invariant directionscomplex ⇒ no real invariant
vᵢ (eigenvectors)(A−λᵢ I) v = 0Directions that don't rotate
σᵢ (singular values)√ eigenvalues of AᵀAPrincipal stretching strengthsσ = 0 ⇒ dim lost
‖A‖₂σ_maxMaximum amplification
‖A‖_F√Σaᵢⱼ²Total entry energy
κ(A)σ_max / σ_minNumerical fragilityκ = ∞ ⇒ singular
Why so many numbers? Because each one answers a different question, and the matrix's behaviour is rich enough that no single summary captures it all. Rank tells you what's lost; the determinant tells you how volume scales; eigenvalues tell you the natural modes; singular values tell you real stretching strength; the condition number tells you whether you can trust the computation. Together they form a toolbox — and once you've played with enough matrices in the viewers above, you start to feel which one the situation calls for.

Loading comments...