1.4 — Matrix Multiplication: Composition of Transformations

Date: 2026-02-27 | Block: 1 — Linear Algebra

The idea in plain English

Multiplying two matrices produces a new matrix that represents applying both transformations one after the other. It's not element-wise multiplication — it's function composition. A·B means "first do B, then do A."

The intuition

If matrix B rotates space 90°, and matrix A scales everything by 2, then A·B is the single transformation that does both: rotate first, then scale. The result is a brand new matrix that "remembers" both operations baked in.

Think of it like combining two instructions into one. Instead of saying "first turn left, then go forward," you combine them into "take this diagonal path."

Why does order matter? Because "rotate then scale" is geometrically different from "scale then rotate" — just like "put on shoes then socks" differs from "put on socks then shoes."

The math

How to compute A·B: Each entry Cᵢⱼ of the result = dot product of row i of A with column j of B.

A = [ a  b ]    B = [ e  f ]
    [ c  d ]        [ g  h ]

A·B = [ a·e + b·g    a·f + b·h ]
      [ c·e + d·g    c·f + d·h ]

Dimension rule: (m×n) · (n×p) = (m×p) — the inner dimensions must match.

Key properties: - NOT commutative: AB ≠ BA in general (order matters!) - Associative: (AB)C = A(BC) (grouping doesn't matter) - Identity: A·I = I·A = A (identity matrix does nothing)

A worked example

A = [ 1  2 ]    B = [ 5  6 ]
    [ 3  4 ]        [ 7  8 ]

Top-left: row 1 of A · col 1 of B = [1,2]·[5,7] = 1·5 + 2·7 = 19 Top-right: row 1 of A · col 2 of B = [1,2]·[6,8] = 1·6 + 2·8 = 22 Bottom-left: row 2 of A · col 1 of B = [3,4]·[5,7] = 3·5 + 4·7 = 43 Bottom-right: row 2 of A · col 2 of B = [3,4]·[6,8] = 3·6 + 4·8 = 50

A·B = [ 19  22 ]
      [ 43  50 ]

Proving AB ≠ BA:

Rotation A = [[0,-1],[1,0]], Scale B = [[2,0],[0,1]]
A·B = [[0,-1],[2,0]]   (scale first, then rotate)
B·A = [[0,-2],[1,0]]   (rotate first, then scale)
Different! ✗

Why this matters for ML

Neural networks are chains of matrix multiplications. The full forward pass is W₃·(W₂·(W₁·x)). By associativity this equals (W₃·W₂·W₁)·x — without nonlinearities, a 10-layer network collapses into a single matrix multiplication. This is exactly why activation functions (ReLU, etc.) are essential between layers — they break the chain.

GPUs exist for matrix multiplication. Modern deep learning hardware is optimised for one thing: doing billions of matrix multiplications quickly in parallel. The entire deep learning revolution is partly a story about fast hardware for this one operation.

Attention in Transformers (Q·Kᵀ) is a matrix multiplication measuring how much each query and key agree.

The one thing to remember

A·B = apply B first, then A. Order matters — this is not like multiplying numbers.