1.12 — Singular Value Decomposition (SVD)
Date: 2026-03-02 | Block: 1 — Linear Algebra
The idea in plain English
Eigendecomposition is powerful but limited — it only works for square matrices. SVD is the generalisation that works for any matrix, any shape, any rank. It decomposes any matrix into three simple pieces: a rotation, a scaling, and another rotation. It is arguably the single most useful matrix factorisation in all of data science.
The intuition
Remember from Topic 1.3: any matrix transformation can be understood geometrically. SVD makes that geometrically precise. It says: any linear transformation is secretly just three simple operations chained together.
Take the unit circle (all vectors of length 1). Apply any matrix A to it. You always get an ellipse. The SVD tells you: - How much the ellipse is stretched in each direction (the singular values σ₁, σ₂, ...) - Which directions in the output space correspond to those stretches (the left singular vectors — columns of U) - Which directions in the input space get mapped to those stretches (the right singular vectors — columns of V)
Unit circle → [Vᵀ] → [Σ] → [U] → Ellipse
Step 1 (Vᵀ): rotate input to align natural directions with axes
Step 2 (Σ): scale each axis independently (σ₁ stretches most, σ₂ next, ...)
Step 3 (U): rotate output to the right orientation
Any transformation = rotate → scale → rotate. Always.
The math
The decomposition:
A = U · Σ · Vᵀ
- U (m×m): orthogonal — left singular vectors (directions in output space)
- Σ (m×n): diagonal — singular values σ₁ ≥ σ₂ ≥ ... ≥ 0 (the stretch factors)
- Vᵀ (n×n): orthogonal — right singular vectors (directions in input space)
Works for any matrix (rectangular, rank-deficient, anything).
Connection to eigendecomposition:
AᵀA = VΣ²Vᵀ → V = eigenvectors of AᵀA, σᵢ = √(eigenvalue of AᵀA)
AAᵀ = UΣ²Uᵀ → U = eigenvectors of AAᵀ
AᵀA is always symmetric and non-negative → eigenvalues always real and ≥ 0 → singular values always real and ≥ 0. This is why SVD always works.
Rank-1 expansion (the most useful form):
A = σ₁·u₁v₁ᵀ + σ₂·u₂v₂ᵀ + ... + σᵣ·uᵣvᵣᵀ
Each term is a rank-1 matrix (an outer product). The terms are sorted from most important (σ₁ is largest) to least important (σᵣ is smallest).
Truncated SVD — the best low-rank approximation: Drop all but the top k terms:
Aₖ = σ₁·u₁v₁ᵀ + ... + σₖ·uₖvₖᵀ
The Eckart-Young theorem guarantees this is the best possible rank-k approximation of A — no other rank-k matrix is closer.
Key facts:
rank(A) = number of non-zero singular values
condition number = σ_max / σ_min (large = near-singular = numerically dangerous)
A worked example — image compression
An image is a matrix of pixel values, say 512×512. SVD gives 512 singular values. Plot them — the first few are enormous (they capture coarse structure: shapes, edges, gradients) and they decay rapidly (fine texture, noise are in the small ones).
Keep only the top 50:
Original: 512×512 = 262,144 numbers
Compressed: 50×(512+1+512) = 51,250 numbers → 80% smaller
Visual quality: often nearly indistinguishable
Why this matters for ML
Recommendation systems (Netflix Prize): you have a huge user×movie rating matrix — mostly empty (most users haven't rated most movies). SVD fills in the gaps by finding the low-rank structure: a few "taste dimensions" that explain most viewing patterns. The 2009 Netflix Prize was won with matrix factorisation — essentially truncated SVD.
Numerical stability: when you invert a matrix to solve linear regression, small singular values cause instability (dividing by near-zero). The condition number σ_max/σ_min tells you how bad the problem is. High condition number → need regularisation.
Pseudoinverse via SVD: A⁺ = VΣ⁺Uᵀ where Σ⁺ inverts the non-zero singular values and keeps zeros as zero. This is how numpy computes lstsq — it never explicitly inverts anything, it uses SVD.
PCA (Topic 1.13): SVD of the centred data matrix directly gives you the principal components. No need to compute the covariance matrix — just SVD of X̃.
The one thing to remember
Any matrix = rotate → scale → rotate. The singular values are the scale factors. Keep only the top k singular values and you have the best rank-k approximation of the matrix — this is truncated SVD, the foundation of PCA and recommendation systems.