1.12 — Singular Value Decomposition (SVD)

Date: 2026-03-02 | Block: 1 — Linear Algebra

The idea in plain English

Eigendecomposition is powerful but limited — it only works for square matrices. SVD is the generalisation that works for any matrix, any shape, any rank. It decomposes any matrix into three simple pieces: a rotation, a scaling, and another rotation. It is arguably the single most useful matrix factorisation in all of data science.

The intuition

Remember from Topic 1.3: any matrix transformation can be understood geometrically. SVD makes that geometrically precise. It says: any linear transformation is secretly just three simple operations chained together.

Take the unit circle (all vectors of length 1). Apply any matrix A to it. You always get an ellipse. The SVD tells you: - How much the ellipse is stretched in each direction (the singular values σ₁, σ₂, ...) - Which directions in the output space correspond to those stretches (the left singular vectors — columns of U) - Which directions in the input space get mapped to those stretches (the right singular vectors — columns of V)

Unit circle  →  [Vᵀ]  →  [Σ]  →  [U]  →  Ellipse

Step 1 (Vᵀ): rotate input to align natural directions with axes
Step 2 (Σ):  scale each axis independently (σ₁ stretches most, σ₂ next, ...)
Step 3 (U):  rotate output to the right orientation

Any transformation = rotate → scale → rotate. Always.

The math

The decomposition:

A = U · Σ · Vᵀ

U (m×m): orthogonal — left singular vectors (directions in output space)
Σ (m×n): diagonal — singular values σ₁ ≥ σ₂ ≥ ... ≥ 0 (the stretch factors)
Vᵀ (n×n): orthogonal — right singular vectors (directions in input space)

Works for any matrix (rectangular, rank-deficient, anything).

Connection to eigendecomposition:

AᵀA = VΣ²Vᵀ   →   V = eigenvectors of AᵀA,   σᵢ = √(eigenvalue of AᵀA)
AAᵀ = UΣ²Uᵀ   →   U = eigenvectors of AAᵀ

AᵀA is always symmetric and non-negative → eigenvalues always real and ≥ 0 → singular values always real and ≥ 0. This is why SVD always works.

Rank-1 expansion (the most useful form):

A = σ₁·u₁v₁ᵀ + σ₂·u₂v₂ᵀ + ... + σᵣ·uᵣvᵣᵀ

Each term is a rank-1 matrix (an outer product). The terms are sorted from most important (σ₁ is largest) to least important (σᵣ is smallest).

Truncated SVD — the best low-rank approximation: Drop all but the top k terms:

Aₖ = σ₁·u₁v₁ᵀ + ... + σₖ·uₖvₖᵀ

The Eckart-Young theorem guarantees this is the best possible rank-k approximation of A — no other rank-k matrix is closer.

Key facts:

rank(A) = number of non-zero singular values
condition number = σ_max / σ_min   (large = near-singular = numerically dangerous)

A worked example — image compression

An image is a matrix of pixel values, say 512×512. SVD gives 512 singular values. Plot them — the first few are enormous (they capture coarse structure: shapes, edges, gradients) and they decay rapidly (fine texture, noise are in the small ones).

Keep only the top 50:

Original: 512×512 = 262,144 numbers
Compressed: 50×(512+1+512) = 51,250 numbers  →  80% smaller
Visual quality: often nearly indistinguishable

Why this matters for ML

Recommendation systems (Netflix Prize): you have a huge user×movie rating matrix — mostly empty (most users haven't rated most movies). SVD fills in the gaps by finding the low-rank structure: a few "taste dimensions" that explain most viewing patterns. The 2009 Netflix Prize was won with matrix factorisation — essentially truncated SVD.

Numerical stability: when you invert a matrix to solve linear regression, small singular values cause instability (dividing by near-zero). The condition number σ_max/σ_min tells you how bad the problem is. High condition number → need regularisation.

Pseudoinverse via SVD: A⁺ = VΣ⁺Uᵀ where Σ⁺ inverts the non-zero singular values and keeps zeros as zero. This is how numpy computes lstsq — it never explicitly inverts anything, it uses SVD.

PCA (Topic 1.13): SVD of the centred data matrix directly gives you the principal components. No need to compute the covariance matrix — just SVD of X̃.

The one thing to remember

Any matrix = rotate → scale → rotate. The singular values are the scale factors. Keep only the top k singular values and you have the best rank-k approximation of the matrix — this is truncated SVD, the foundation of PCA and recommendation systems.