Readings / Linear Algebra / 1.8 — Dot Product, Norms & Distance Metrics

1.8 — Dot Product, Norms & Distance Metrics

1.8 — Dot Product, Norms & Distance Metrics

Date: 2026-03-01 | Block: 1 — Linear Algebra

The idea in plain English

The dot product measures how much two vectors point in the same direction — it's a measure of alignment. Norms measure the size of a vector. Distance metrics measure how far apart two vectors are. Together, these tools let us measure relationships between vectors, which is the foundation of similarity, loss functions, and clustering.

The intuition

Dot product: imagine two people pulling a box. If they pull in exactly the same direction, their efforts fully add up (large dot product). If they pull at right angles, neither helps the other (dot product = 0). If they pull in opposite directions, they cancel out (negative dot product).

Norms: different norms measure "size" in different ways. The L2 norm (Euclidean) is the straight-line distance — like a bird flying. The L1 norm is the block-walking distance — like a taxi in a city grid.

The shapes of unit balls (all points with norm = 1) show the difference intuitively:

L1 (diamond ◇):   L2 (circle ○):   L∞ (square □):
      *                  *             *────*
     /|\                / \            |    |
    / | \              |   |           |    |
   *──+──*              \ /            *────*
    \ | /                *
     \|/
      *

The L1 diamond has sharp corners on the axes — this is why L1 regularization produces sparse solutions (the optimum lands on a corner, zeroing out weights).

The math

Dot product (two equivalent forms):

a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ        ← algebraic (sum of products)
a · b = ‖a‖ · ‖b‖ · cos(θ)               ← geometric (angle between them)

These always give the same number. θ is the angle between the vectors.

What the sign tells you: - a·b > 0 → pointing broadly in the same direction (θ < 90°) - a·b = 0 → perpendicular / orthogonal (θ = 90°) — share no direction at all - a·b < 0 → pointing broadly opposite (θ > 90°)

Norms:

L1: ‖v‖₁ = |v₁| + |v₂| + ... + |vₙ|       (sum of absolute values)
L2: ‖v‖₂ = √(v₁² + v₂² + ... + vₙ²)       (Euclidean length)
L∞: ‖v‖∞ = max(|v₁|, |v₂|, ..., |vₙ|)    (largest absolute value)

Cosine similarity (direction only, ignores magnitude):

cos_sim(a, b) = (a · b) / (‖a‖ · ‖b‖)   ∈ [−1, 1]

A worked example

a = [1, 2, 3],  b = [4, 0, 1]
a · b = 1·4 + 2·0 + 3·1 = 7   (positive → broadly same direction)

v = [3, -4]
‖v‖₁ = 3 + 4 = 7
‖v‖₂ = √(9+16) = 5
‖v‖∞ = max(3,4) = 4

Why this matters for ML

Attention mechanism in Transformers: Attention(Q,K,V) = softmax(Q·Kᵀ/√d)·V. The Q·Kᵀ computes dot products between every query and every key. A high dot product = high attention = "these tokens are related." The entire mechanism of attention is the dot product measuring directional alignment.

Word embeddings and cosine similarity: "King" and "Queen" have a high cosine similarity in embedding space — they point in similar directions even though their exact vectors differ. Cosine similarity is used because we care about the direction of meaning, not the magnitude.

L1 vs L2 regularization: L1 penalises ‖w‖₁ and produces sparse weights (many exactly zero, because the diamond corners land on axes). L2 penalises ‖w‖₂² and smoothly shrinks all weights but rarely reaches zero. The geometry of the unit ball explains the behavior.

The one thing to remember

The dot product measures alignment. Zero dot product = completely perpendicular = no shared direction. This is the foundation of attention, similarity search, and regularization.

← Previous 1.7 — Inverse Matrices and When They Exist Next → 1.9 — Projections and Orthogonality