1.13 — PCA Derived from SVD

Date: 2026-03-02 | Block: 1 — Linear Algebra

The idea in plain English

PCA (Principal Component Analysis) finds the most important directions of variation in your data, and lets you project the data onto those directions to reduce its dimensionality. It answers: "If I can only keep k directions, which k directions preserve the most information?" It is derived from the geometry of projections and the algebra of SVD.

The intuition

Imagine a cloud of data points in 2D — they form an elongated ellipse:

        *  *
      *  *  *
    *  *  *  *      ← data cloud (elongated)
      *  *  *
        *  *

The cloud is longer in one direction than another. PCA finds that long axis (the direction of most spread/variance) and calls it the first principal component. The perpendicular direction is the second component.

If you project all points onto just the first component, you've gone from 2D to 1D — losing minimal information, because the dropped direction (perpendicular) has little variance. You've compressed the data.

In high dimensions (hundreds of features), the same idea applies: find the few directions that capture most of the variation, project onto them, ignore the rest.

The math — derived from first principles

Setup: data matrix X is (n×p) — n samples, p features.

Step 1 — centre the data. Subtract the mean of each column:

X̃ = X − mean(X, axis=0)

This makes every feature have mean 0. PCA finds directions of spread, not of location, so centering is essential.

Step 2 — formulate the problem. Find a unit vector w that maximises the variance of the projected data:

Var(X̃w) = (1/n)‖X̃w‖² = wᵀCw     where C = (1/n)X̃ᵀX̃  (covariance matrix)

Maximise wᵀCw  subject to  ‖w‖² = 1

Step 3 — solve with Lagrange multipliers (constrained optimisation — fully covered in Block 2):

∂/∂w [wᵀCw − λ(wᵀw − 1)] = 0
→  Cw = λw

The optimal direction w must be an eigenvector of C. The variance achieved equals the eigenvalue λ. To maximise variance → take the eigenvector with the largest eigenvalue.

Step 4 — get all components at once via SVD. Instead of computing C and its eigendecomposition, take SVD of X̃ directly:

X̃ = U · Σ · Vᵀ

→  C = (1/n)X̃ᵀX̃ = V·(Σ²/n)·Vᵀ

Reading off: the columns of V are the principal components. The variance along component i is σᵢ²/n.

Step 5 — project to k dimensions. Take the first k columns of V:

Vₖ = [v₁ | v₂ | ... | vₖ]     (p×k)

Z = X̃ · Vₖ                     (n×k) ← your compressed data

To approximately reconstruct: X̃_approx = Z · Vₖᵀ — a projection (Topic 1.9) onto the principal subspace.

Choosing k — how many components?

Variance explained by component i = σᵢ² / (σ₁² + σ₂² + ... + σₚ²)

Plot these in order (scree plot). Choose k at the "elbow," or where cumulative variance hits 95%.

Why this matters for ML

Dimensionality reduction: 1000 features → 20 principal components capturing 95% of variance. This speeds up any downstream model, reduces overfitting, and removes noise.

Visualisation: project your high-dimensional data to 2 or 3 PCA dimensions and you can see it. Clusters, outliers, and structure become visible. Before training any model on a new dataset, PCA visualisation should be your first step.

Noise removal: the small singular values capture noise (random fluctuations with no pattern). Discarding them and reconstructing from the top components gives a denoised version of the data.

Connection to everything in Block 1: every topic contributed to this derivation — vectors are data points, span and independence explain what PCA is doing, matrices transform the data, rank is the intrinsic dimension, projections are the core operation, eigenvalues measure variance, eigendecomposition gives all components, and SVD provides the clean practical algorithm.

The complete algorithm in 4 steps

1. X̃ = X − mean(X, axis=0)       ← centre
2. X̃ = U·Σ·Vᵀ                   ← SVD
3. Vₖ = first k columns of V      ← pick components
4. Z = X̃·Vₖ                      ← project (compress)

The one thing to remember

PCA finds the directions of maximum variance. Via SVD, these are the right singular vectors of the centred data matrix. Project onto the top k of them → best possible k-dimensional compression of your data.