Vectors and Vector Spaces - The Language of Embeddings

Reading time: ~22 minutes | Level: Mathematical Foundations → ML Engineering

A 512-dimensional vector represents a sentence. A 1,536-dimensional vector represents a document. The distance between two such vectors determines whether a RAG chatbot retrieves the right context or returns garbage.

Every semantic search engine, every recommendation system, every embedding-based retrieval pipeline lives or dies by vector arithmetic. If you do not know what a vector space is - if you treat vectors as "just arrays" - you cannot reason about why cosine similarity works but Euclidean distance sometimes fails, or why embeddings for "king" minus "man" plus "woman" approximately equals the embedding for "queen."

This lesson builds the foundation. Not abstract mathematics - the mathematics that runs inside every production ML system you will build.

What You Will Learn

What a vector really is: algebraic definition vs. geometric intuition
What a vector space is and why the 8 axioms matter in ML
L1, L2, and L∞ norms: geometric meaning and ML use
Inner products: the algebraic heart of attention and cosine similarity
High-dimensional geometry surprises: why 3D intuition betrays you at 512 dimensions
NumPy: vector creation, norms, dot products, broadcasting
How embeddings, cosine similarity, and KNN connect to vector space theory

Prerequisites

Python and NumPy arrays (you can write np.array([1, 2, 3]))
Basic algebra (variables, functions)
No prior linear algebra required

Part 1 - What a Vector Really Is

Most engineers learn vectors as "arrays of numbers." This is true but incomplete. A vector has two interpretations that both matter:

The algebraic view

A vector is an ordered list of numbers called components or coordinates:

v = [v₁, v₂, ..., vₙ]

In Python and NumPy, this is exactly np.array([v1, v2, ..., vn]). The number of components is the dimension n.

The geometric view

A vector is a direction and magnitude in n-dimensional space. The vector [3, 4] points 3 units right and 4 units up. It has a magnitude (length) of 5.

     │
   4 ┤    ●  ← tip of vector [3, 4]
     │   /
   3 ┤  /
     │ /
   2 ┤/  ← magnitude = √(3² + 4²) = 5
     │
   1 ┤
     │
   0 ┼───┬───┬───
     0   1   2   3

The geometric view is critical for ML: embeddings are not just arrays, they are points in a high-dimensional space where proximity means semantic similarity.

Why both views matter in ML

The algebraic view lets you write code. The geometric view lets you reason:

Why is the embedding for "cat" close to the embedding for "kitten"? Because they are nearby points in vector space - their components are numerically similar.
Why does word2vec arithmetic work? Because vector subtraction and addition correspond to geometric operations (translation) in the embedding space.
Why does cosine similarity measure "semantic similarity"? Because it measures the angle between two vectors - vectors pointing in similar directions represent similar concepts, regardless of their magnitude.

Part 2 - Vector Spaces: The 8 Axioms (and Why They Matter)

A vector space (over the real numbers) is a set V of objects (vectors) with two operations:

Vector addition: v + w
Scalar multiplication: αv (where α is a real number)

...satisfying 8 axioms.

The 8 axioms

Let u, v, w ∈ V and α, β ∈ ℝ:

#	Axiom	What it means
1	v + w = w + v	Addition is commutative
2	(u + v) + w = u + (v + w)	Addition is associative
3	∃ 0 such that v + 0 = v	Zero vector exists
4	∃ -v such that v + (-v) = 0	Additive inverse exists
5	1·v = v	Scalar identity
6	α(βv) = (αβ)v	Scalar multiplication associativity
7	α(v + w) = αv + αw	Distributivity over vector addition
8	(α + β)v = αv + βv	Distributivity over scalar addition

Why do these axioms matter for ML?

These axioms are what allow us to do algebra with embeddings.

When word2vec gives you king - man + woman ≈ queen, this works because:

king, man, woman, queen are all vectors in the same vector space ℝ³⁰⁰
Vector subtraction (king - man) is defined by axiom 4 (additive inverse)
Vector addition (+ woman) is defined by axioms 1-3
The result lands in the same space (axiom 7) where queen lives

If embeddings did not live in a proper vector space, this arithmetic would not have geometric meaning.

:::tip Why ℝⁿ always satisfies all 8 axioms The standard ML embedding space ℝⁿ (n-dimensional real-number vectors) satisfies all 8 axioms automatically. NumPy implements these axioms for you - + is addition, * is scalar multiplication. The reason we study axioms is to recognize other vector spaces (function spaces, polynomial spaces) and to understand which operations are valid. :::

Vector subspaces

A subspace of V is a subset W ⊆ V that is itself a vector space. The key test: W is a subspace if and only if:

0 ∈ W (contains zero vector)
If v, w ∈ W, then v + w ∈ W (closed under addition)
If v ∈ W and α ∈ ℝ, then αv ∈ W (closed under scalar multiplication)

ML relevance: The span of the weight columns in a neural network layer forms a subspace. Information that does not lie in this subspace is destroyed by the layer. This is the geometric meaning of rank (covered in Lesson 02) and kernel (Lesson 05).

Part 3 - Norms: Measuring Vector Size

A norm assigns a non-negative real number to every vector, measuring its "size" or "length." Different norms create different geometries, and different geometries produce different ML behaviors.

The three most important norms

L1 norm (Manhattan distance):

‖v‖₁ = |v₁| + |v₂| + ... + |vₙ|

Interpretation: sum of absolute values of all components. If you think of each component as blocks in a city grid, this is the taxi distance.

L2 norm (Euclidean distance):

‖v‖₂ = √(v₁² + v₂² + ... + vₙ²)

Interpretation: the straight-line distance from the origin to the tip of the vector. This is what len() means geometrically.

L∞ norm (Max norm, Chebyshev):

‖v‖∞ = max(|v₁|, |v₂|, ..., |vₙ|)

Interpretation: the largest absolute component. Used in minimax problems and certain robotics/control applications.

Geometric visualization of unit balls

The "unit ball" is the set of all vectors with norm ≤ 1. The shape of this ball reveals the geometry of the norm:

L1 unit ball (diamond):        L2 unit ball (circle):
        (0,1)                         (0,1)
          │                          /     \
    (-1,0)┼──(1,0)             (-1,0)       (1,0)
          │                          \     /
        (0,-1)                        (0,-1)

L∞ unit ball (square):
    (-1,1)───(1,1)
       │         │
    (-1,-1)──(1,-1)

Why this matters enormously for ML: The shape of the norm ball determines the shape of the regularization constraint region. L1 regularization (Lasso) has corners at the axes - when the loss function's level sets touch the constraint, they most often touch at a corner, which corresponds to a sparse solution (many weights = 0). L2 regularization (Ridge) has a smooth sphere - solutions are pushed toward zero but rarely exactly zero. (Full treatment in Lesson 08.)

Norms in NumPy

import numpy as np

v = np.array([3.0, -4.0, 0.0, 1.0])

# L1 norm
l1 = np.linalg.norm(v, ord=1)      # 3 + 4 + 0 + 1 = 8.0
l1_manual = np.sum(np.abs(v))       # same result

# L2 norm (default)
l2 = np.linalg.norm(v)             # √(9 + 16 + 0 + 1) = √26 ≈ 5.099
l2_manual = np.sqrt(np.sum(v**2))  # same result

# L∞ norm
linf = np.linalg.norm(v, ord=np.inf)  # max(3, 4, 0, 1) = 4.0
linf_manual = np.max(np.abs(v))        # same result

print(f"L1: {l1:.3f}")   # 8.000
print(f"L2: {l2:.3f}")   # 5.099
print(f"L∞: {linf:.3f}") # 4.000

Normalizing vectors

A unit vector has L2 norm = 1. Normalizing divides by the L2 norm:

# Normalize a vector to unit length
v = np.array([3.0, 4.0])
v_normalized = v / np.linalg.norm(v)  # [0.6, 0.8]

# Verify: norm of normalized vector = 1
print(np.linalg.norm(v_normalized))  # 1.0

# For a batch of embedding vectors (rows are embeddings)
embeddings = np.random.randn(1000, 512)  # 1000 embeddings, 512-dim
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized_embeddings = embeddings / norms  # each row has L2 norm = 1

Part 4 - Inner Products and the Angle Between Vectors

The inner product (or dot product) of two vectors u and v in ℝⁿ is:

u · v = u₁v₁ + u₂v₂ + ... + uₙvₙ = Σᵢ uᵢvᵢ

It also equals:

u · v = ‖u‖₂ · ‖v‖₂ · cos(θ)

where θ is the angle between the vectors.

This is the most important formula in this lesson. It connects algebraic computation (sum of products) to geometry (angle between directions).

What the dot product encodes

Value of u·v	Geometric meaning	Example
u·v > 0	Angle θ < 90°, vectors point in similar direction	Similar embeddings
u·v = 0	Angle θ = 90°, vectors are perpendicular (orthogonal)	Unrelated concepts
u·v < 0	Angle θ > 90°, vectors point in opposite directions	Antonyms
u·v = ‖u‖·‖v‖	θ = 0°, vectors are parallel (same direction)	Identical meaning

Cosine similarity

Cosine similarity normalizes the dot product by the magnitudes:

cos_sim(u, v) = (u · v) / (‖u‖₂ · ‖v‖₂)

This gives a value in [-1, 1] that measures directional alignment independent of vector length.

Why use cosine similarity for embeddings?

In NLP, a document repeated three times should have the same meaning as the original document - only its vector might be three times as long if you count word occurrences. Cosine similarity is invariant to this scaling because it divides by the magnitudes. L2 distance is not scale-invariant and would incorrectly consider the repeated document as different from the original.

import numpy as np

def cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    dot_product = np.dot(u, v)
    magnitude_u = np.linalg.norm(u)
    magnitude_v = np.linalg.norm(v)

    if magnitude_u == 0 or magnitude_v == 0:
        return 0.0  # Handle zero vectors

    return dot_product / (magnitude_u * magnitude_v)

# Simulate word embeddings (in practice, these come from a model)
king = np.array([0.5, 0.3, 0.8, -0.2, 0.6])
queen = np.array([0.4, 0.4, 0.7, -0.1, 0.7])  # similar direction
random_vec = np.array([0.1, -0.9, 0.0, 0.5, -0.3])  # different direction

print(f"cos_sim(king, queen) = {cosine_similarity(king, queen):.4f}")  # high
print(f"cos_sim(king, random) = {cosine_similarity(king, random_vec):.4f}")  # low

# For batch operations: compute all pairwise similarities efficiently
def cosine_similarity_matrix(embeddings: np.ndarray) -> np.ndarray:
    """
    Compute all pairwise cosine similarities for a batch of embeddings.

    Args:
        embeddings: (n, d) array of n embeddings of dimension d
    Returns:
        (n, n) similarity matrix
    """
    # Normalize each embedding to unit length
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / norms
    # Dot product of unit vectors = cosine similarity
    return normalized @ normalized.T  # (n, n) matrix

embeddings = np.random.randn(100, 512)
sim_matrix = cosine_similarity_matrix(embeddings)
print(f"Similarity matrix shape: {sim_matrix.shape}")  # (100, 100)
print(f"Diagonal (self-similarity): {sim_matrix[0, 0]:.6f}")  # ≈ 1.0

The Cauchy-Schwarz inequality

The inner product satisfies: |u · v| ≤ ‖u‖₂ · ‖v‖₂

This is why cosine similarity is bounded in [-1, 1] - we are dividing by the maximum possible value of |u·v|.

Part 5 - High-Dimensional Geometry Surprises

Human intuition is built for 2D and 3D. ML embeddings live in 512D, 768D, 1536D. The geometry at these scales is radically different from what your intuition expects.

Surprise 1: All points are far apart

In 3 dimensions, if you place 1000 random points in the unit cube, many pairs will be close together. In 512 dimensions, random points are almost all approximately the same distance from each other.

The expected L2 distance between two random points in ℝⁿ (each sampled from N(0,1)) grows as √(2n):

import numpy as np

for n_dims in [2, 10, 100, 512, 1536]:
    # Sample two random points
    n_trials = 10000
    distances = []
    for _ in range(n_trials):
        u = np.random.randn(n_dims)
        v = np.random.randn(n_dims)
        distances.append(np.linalg.norm(u - v))

    expected = np.sqrt(2 * n_dims)
    actual_mean = np.mean(distances)
    actual_std = np.std(distances)

    print(f"d={n_dims:5d}: expected≈{expected:.1f}, "
          f"actual={actual_mean:.1f}±{actual_std:.2f}, "
          f"relative_std={actual_std/actual_mean:.4f}")

d=    2: expected≈2.0, actual=1.8±0.95, relative_std=0.5364
d=   10: expected≈4.5, actual=4.4±0.71, relative_std=0.1601
d=  100: expected≈14.1, actual=14.1±0.71, relative_std=0.0503
d=  512: expected≈32.0, actual=32.0±0.32, relative_std=0.0099
d= 1536: expected≈55.4, actual=55.4±0.19, relative_std=0.0034

The relative standard deviation shrinks as dimensions increase. At 512 dimensions, all pairwise distances are nearly equal. This is why exact nearest neighbor search becomes difficult in high dimensions - there is little distance contrast.

Surprise 2: Volume concentrates at the surface

Most of the volume of a high-dimensional sphere is in a thin shell near the surface:

Fraction of volume within ε of surface: 1 - (1-ε/r)ⁿ

For n=512, ε/r=0.01 (1% shell): 1 - 0.99^512 ≈ 0.9999... ≈ 1

This means: if you sample random points from a high-dimensional ball, almost all of them are near the surface. The "interior" is essentially empty.

ML implication: Random initialization of neural network weights places them on approximately a sphere in high-dimensional weight space. Gradient descent moves them through a landscape that looks very different from our 3D intuition.

Surprise 3: Random vectors are nearly orthogonal

In ℝ², two random vectors have expected angle ≈ 90° (by symmetry). In ℝ⁵¹², two random vectors are also approximately orthogonal - but now there are exponentially many nearly-orthogonal directions available.

import numpy as np

def average_angle(n_dims: int, n_samples: int = 1000) -> float:
    """Compute average angle between pairs of random unit vectors."""
    u = np.random.randn(n_samples, n_dims)
    v = np.random.randn(n_samples, n_dims)
    u /= np.linalg.norm(u, axis=1, keepdims=True)
    v /= np.linalg.norm(v, axis=1, keepdims=True)
    cos_angles = np.sum(u * v, axis=1)
    angles_deg = np.degrees(np.arccos(np.clip(cos_angles, -1, 1)))
    return float(np.mean(angles_deg))

for d in [2, 10, 100, 512]:
    angle = average_angle(d)
    print(f"d={d:4d}: average angle = {angle:.2f}°")  # All near 90°

ML implication: High-dimensional embeddings have more capacity than you might expect - you can pack many nearly-orthogonal concepts into the same space. This is why 512D embeddings can encode semantic distinctions between millions of concepts.

The curse of dimensionality

The "curse of dimensionality" refers to how many phenomena break in high dimensions:

Nearest neighbor distances become meaningless (low contrast)
Data becomes sparse: exponentially more data needed to fill the space
Euclidean distance becomes less informative than angular distance

:::danger KNN in high dimensions K-Nearest Neighbors (KNN) becomes unreliable in high dimensions because all points are approximately equidistant. This is why modern vector databases (Pinecone, Weaviate, Faiss) use approximate nearest neighbor algorithms and often work with cosine similarity (angular distance) rather than Euclidean distance - angular distance is more robust to the high-dimensional concentration phenomenon. :::

Part 6 - NumPy: Vector Operations for ML

import numpy as np

# ── Vector creation ────────────────────────────────────────────────────────
# Standard vector
v = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

# Zero vector (often needed as baseline)
zero = np.zeros(512)

# Random unit vector (useful for initialization)
random_vec = np.random.randn(512)
random_unit = random_vec / np.linalg.norm(random_vec)

# ── Basic arithmetic ───────────────────────────────────────────────────────
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0, 6.0])

addition = u + v           # [5, 7, 9]
subtraction = u - v        # [-3, -3, -3]
scalar_mult = 3 * u        # [3, 6, 9]

# ── Norms ──────────────────────────────────────────────────────────────────
l1 = np.linalg.norm(u, ord=1)        # 6.0
l2 = np.linalg.norm(u)               # √14 ≈ 3.742
linf = np.linalg.norm(u, ord=np.inf) # 3.0

# ── Dot product ────────────────────────────────────────────────────────────
dot = np.dot(u, v)                   # 1*4 + 2*5 + 3*6 = 32
dot_alt = u @ v                      # same, @ is the matmul operator
dot_manual = np.sum(u * v)           # same

# ── Angle between vectors ──────────────────────────────────────────────────
cos_theta = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
theta_radians = np.arccos(np.clip(cos_theta, -1, 1))
theta_degrees = np.degrees(theta_radians)
print(f"Angle between u and v: {theta_degrees:.2f}°")

# ── Cosine similarity for embeddings ───────────────────────────────────────
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    """Production-ready cosine similarity with zero-vector guard."""
    a_norm = np.linalg.norm(a)
    b_norm = np.linalg.norm(b)
    if a_norm < 1e-10 or b_norm < 1e-10:
        return 0.0
    return float(np.dot(a, b) / (a_norm * b_norm))

# ── Broadcasting: batch operations ────────────────────────────────────────
# Query embedding vs. 10,000 document embeddings
query = np.random.randn(512)
documents = np.random.randn(10000, 512)

# Normalize all at once
query_norm = query / np.linalg.norm(query)
doc_norms = np.linalg.norm(documents, axis=1, keepdims=True)
docs_normalized = documents / doc_norms

# Compute all 10,000 cosine similarities in one operation
similarities = docs_normalized @ query_norm  # shape: (10000,)

# Find top-5 most similar documents
top5_indices = np.argsort(similarities)[-5:][::-1]
print(f"Top-5 document indices: {top5_indices}")
print(f"Top-5 similarities: {similarities[top5_indices]}")

Broadcasting rules (essential for ML)

# Broadcasting: NumPy extends dimensions automatically
# Rule: dimensions are compatible if they are equal or one of them is 1

query = np.random.randn(512)          # shape: (512,)
docs = np.random.randn(100, 512)      # shape: (100, 512)

# Element-wise subtraction with broadcasting
# query (512,) → broadcasts to (100, 512)
differences = docs - query            # shape: (100, 512)

# L2 distance from query to each document
l2_distances = np.linalg.norm(differences, axis=1)  # shape: (100,)

# This is equivalent to, but much faster than:
l2_distances_slow = np.array([
    np.linalg.norm(doc - query) for doc in docs
])

# Verify equivalence
assert np.allclose(l2_distances, l2_distances_slow)
print("Broadcasting and loop give identical results ✓")

Part 7 - ML Connections: Where Vectors Appear

Embeddings as vectors

Every modern NLP and vision model maps its inputs to vectors:

# Conceptual: what an embedding model does
from typing import List
import numpy as np

# OpenAI text-embedding-3-small produces 1536-dim vectors
# Claude's embeddings (via API) produce 1024-dim vectors
# sentence-transformers all-MiniLM-L6-v2 produces 384-dim vectors

def retrieve_top_k(
    query_embedding: np.ndarray,
    document_embeddings: np.ndarray,
    k: int = 5
) -> np.ndarray:
    """
    RAG retrieval: find top-k documents by cosine similarity.

    This is the core of every RAG system. The vector algebra is:
    1. Normalize all embeddings to unit length
    2. Compute dot products (= cosine similarity for unit vectors)
    3. Return top-k by score
    """
    # Normalize
    q_norm = query_embedding / np.linalg.norm(query_embedding)
    d_norms = np.linalg.norm(document_embeddings, axis=1, keepdims=True)
    d_normalized = document_embeddings / d_norms

    # Cosine similarities: matrix-vector multiplication
    # Shape: (n_docs, dim) @ (dim,) = (n_docs,)
    scores = d_normalized @ q_norm

    # Top-k indices
    top_k = np.argsort(scores)[-k:][::-1]
    return top_k, scores[top_k]

KNN: distance metrics matter

import numpy as np

def knn_predict(
    train_X: np.ndarray,
    train_y: np.ndarray,
    test_X: np.ndarray,
    k: int = 5,
    metric: str = 'euclidean'
) -> np.ndarray:
    """
    K-Nearest Neighbors with different distance metrics.

    In low dimensions: Euclidean distance works well.
    In high dimensions (embeddings): cosine similarity often better.
    """
    predictions = []

    for test_point in test_X:
        if metric == 'euclidean':
            # L2 distance: ‖x - y‖₂
            distances = np.linalg.norm(train_X - test_point, axis=1)
            nearest = np.argsort(distances)[:k]

        elif metric == 'cosine':
            # Cosine similarity → convert to distance
            test_norm = test_point / np.linalg.norm(test_point)
            train_norms = train_X / np.linalg.norm(train_X, axis=1, keepdims=True)
            similarities = train_norms @ test_norm
            nearest = np.argsort(similarities)[-k:][::-1]

        elif metric == 'manhattan':
            # L1 distance: Σ|xᵢ - yᵢ|
            distances = np.linalg.norm(train_X - test_point, ord=1, axis=1)
            nearest = np.argsort(distances)[:k]

        # Majority vote among k nearest neighbors
        neighbor_labels = train_y[nearest]
        prediction = np.bincount(neighbor_labels).argmax()
        predictions.append(prediction)

    return np.array(predictions)

Part 8 - Common Failure Modes and Engineering Red Flags

:::danger Do not mix L2 distance and cosine similarity In a vector database or embedding search system, you must decide on one distance metric at index time and query time. If you index with cosine similarity but query with L2 distance (or vice versa), you will get wrong results - no error, just silently wrong rankings.

# WRONG: inconsistent metrics
index.add(embeddings)  # indexed with L2
results = index.search(query, k=5, metric='cosine')  # queried with cosine

# RIGHT: consistent metrics - always normalize if using cosine
normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
index.add(normalized_embeddings)
# Now L2 distance on normalized vectors == 2*(1 - cosine_similarity)
# You can use L2 search to find cosine-nearest neighbors

:::

:::danger Zero vectors are silent bugs A zero vector has no direction. Computing cosine similarity with a zero vector gives 0/0 (NaN). This can happen when:

An embedding model returns zeros for empty or corrupted input
You accidentally divide by a norm of 0 during normalization

Always guard against zero vectors in production:

def safe_normalize(v: np.ndarray, eps: float = 1e-10) -> np.ndarray:
    """Normalize with zero-vector guard."""
    norm = np.linalg.norm(v)
    if norm < eps:
        # Return zero vector rather than NaN
        return np.zeros_like(v)
    return v / norm

:::

:::tip Use pre-normalized embeddings in vector databases Many vector databases (Pinecone, Weaviate, Qdrant) support cosine similarity natively. But they often compute it more efficiently by storing normalized vectors and using dot product search. Pre-normalize your embeddings before inserting them to avoid re-normalization overhead at query time. :::

:::tip For text embeddings, prefer cosine similarity over L2 distance L2 distance is sensitive to vector magnitude. Cosine similarity is not. For text embeddings, where the same concept might have different magnitude depending on how the model scales its outputs, cosine similarity is almost always the right choice. :::

Interview Questions

Q1: What is the geometric meaning of the dot product?

The dot product u·v = ‖u‖·‖v‖·cos(θ) where θ is the angle between the vectors.

Geometrically, it equals the product of:

The length of the projection of u onto v (or vice versa)
The length of the other vector

When u·v > 0: vectors point in similar directions (θ < 90°) When u·v = 0: vectors are orthogonal/perpendicular (θ = 90°) When u·v < 0: vectors point in opposite directions (θ > 90°)

In ML, the dot product in the attention mechanism QKᵀ computes how much each query "aligns" with each key - high dot product = high attention weight.

Q2: Why does the L1 norm induce sparsity in regularized models?

Visualize the L1 constraint region in 2D: it's a diamond (rotated square) with corners at (±C, 0) and (0, ±C).

During optimization with L1 regularization, the loss function's level sets (ellipses) expand from the minimum until they touch the L1 constraint boundary. Because the L1 boundary is a diamond with corners on the coordinate axes, the first contact point is most likely at a corner - where one coordinate is zero (sparse solution).

The L2 constraint region is a smooth circle. Its boundary has no corners, so the level sets touch it at a smooth point where no coordinate is forced to exactly zero.

This is why Lasso (L1 regularization) produces sparse models (many weights exactly 0) while Ridge (L2 regularization) produces small but non-zero weights.

Q3: Why do high-dimensional spaces break nearest neighbor search?

In high dimensions (the "curse of dimensionality"):

Distance concentration: As dimensionality grows, the ratio (max_distance - min_distance) / min_distance → 0. All points appear nearly equidistant, making nearest neighbor meaningless.
Exponential data sparsity: To maintain the same density, you need exponentially more data as dimensions increase. With n training points and d dimensions, you have O(n/2^d) points per unit cell.
Consequence for ML: KNN with Euclidean distance fails in high-dimensional embedding spaces (512D, 1536D). Solutions include:
- Use cosine similarity (more robust to concentration)
- Use approximate nearest neighbor algorithms (HNSW, IVF)
- Reduce dimensionality with PCA first
- Use learned similarity metrics

Q4: What is a vector space, and why do embeddings live in one?

A vector space is a set with addition and scalar multiplication satisfying 8 axioms (closure, associativity, commutativity, identity, inverse, distributivity).

Embeddings live in ℝⁿ, which is a vector space, because this gives us:

Meaningful arithmetic: king - man + woman ≈ queen works because vector subtraction and addition are well-defined
Similarity measures: Dot products and cosine similarity have geometric meaning (angle between directions)
Linear algebra tools: We can apply PCA, SVD, and matrix operations to batches of embeddings

If embeddings were just arbitrary arrays (not a vector space), operations like subtraction and averaging would have no semantic meaning.

Practice Challenges

Level 1: Predict

Challenge: Without running code, predict whether the following cosine similarities are positive, negative, or near zero:

cos_sim([1, 0, 0], [0, 1, 0])
cos_sim([1, 2, 3], [2, 4, 6])
cos_sim([1, 1, 1], [-1, -1, -1])

Answer

cos_sim([1,0,0], [0,1,0]) = 0 (perpendicular vectors, angle = 90°)
cos_sim([1,2,3], [2,4,6]) = 1.0 (parallel vectors: [2,4,6] = 2·[1,2,3], same direction)
cos_sim([1,1,1], [-1,-1,-1]) = -1.0 (antiparallel: [-1,-1,-1] = -1·[1,1,1])

Level 2: Debug

Challenge: The following cosine similarity function returns NaN for some inputs. Find and fix the bug:

def broken_cosine_sim(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

Answer

The bug: if either u or v is the zero vector, np.linalg.norm() returns 0, causing division by zero → NaN.

def fixed_cosine_sim(u, v, eps=1e-10):
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u < eps or norm_v < eps:
        return 0.0  # Convention: zero vectors have no similarity
    return float(np.dot(u, v) / (norm_u * norm_v))

Also note: due to floating-point arithmetic, np.dot(u, v) / (norm_u * norm_v) can slightly exceed 1.0 or fall below -1.0. For downstream use in np.arccos(), add np.clip(result, -1, 1).

Level 3: Design

Challenge: You are building a semantic search system. You have 1 million documents, each represented as a 1536-dimensional embedding. Describe (with pseudocode or NumPy) an efficient approach to find the top-10 most similar documents to a query embedding. Address: (1) what distance metric to use and why, (2) how to handle the scale efficiently.

Answer

import numpy as np

# ── Indexing phase (done once) ─────────────────────────────────────────────
# 1M documents × 1536 dims = ~12GB float32 → borderline for RAM
# Use float16 for storage (6GB), compute in float32

doc_embeddings = load_embeddings()  # (1_000_000, 1536) float32

# Use cosine similarity (angular distance) - more robust in high dims
# Pre-normalize at index time to avoid repeated normalization at query time
norms = np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
norms = np.maximum(norms, 1e-10)  # zero-vector guard
normalized_docs = doc_embeddings / norms  # shape: (1_000_000, 1536)

# ── Query phase ────────────────────────────────────────────────────────────
def search(query_embedding: np.ndarray, k: int = 10) -> tuple:
    # Normalize query
    q_norm = np.linalg.norm(query_embedding)
    q_normalized = query_embedding / max(q_norm, 1e-10)

    # Cosine similarities: one matrix-vector multiply
    # On CPU: 1M × 1536 dot products = ~3ms
    # On GPU: ~0.1ms
    scores = normalized_docs @ q_normalized  # (1_000_000,)

    # Partial sort for top-k (faster than full sort for large arrays)
    top_k_indices = np.argpartition(scores, -k)[-k:]
    top_k_sorted = top_k_indices[np.argsort(scores[top_k_indices])[::-1]]

    return top_k_sorted, scores[top_k_sorted]

# ── Production note ────────────────────────────────────────────────────────
# For true production scale, use a vector database (Faiss, Qdrant, Pinecone)
# These use HNSW or IVF-PQ approximate nearest neighbor for sub-linear search time
# Pure NumPy brute-force is O(n·d) per query - fine for <100K docs

Why cosine similarity: Scale-invariant (important for text embeddings), more robust to the high-dimensional concentration phenomenon than Euclidean distance, and well-supported by vector databases.

Efficiency techniques: Pre-normalization, matrix-vector multiply instead of loops, np.argpartition for top-k.

Quick Reference Cheatsheet

Operation	Math notation	NumPy	Notes
Vector creation	v ∈ ℝⁿ	`np.array([...])`
L1 norm	‖v‖₁	`np.linalg.norm(v, ord=1)`	Sparsity-inducing
L2 norm	‖v‖₂	`np.linalg.norm(v)`	Most common default
L∞ norm	‖v‖∞	`np.linalg.norm(v, ord=np.inf)`	Max absolute component
Unit vector	v/‖v‖	`v / np.linalg.norm(v)`	Normalize
Dot product	u·v	`np.dot(u, v)` or `u @ v`	Sum of products
Cosine similarity	u·v/(‖u‖‖v‖)	See code above	Angle-based
Angle	θ = arccos(u·v/(‖u‖‖v‖))	`np.arccos(cos_sim)`	In radians
Vector addition	u + v	`u + v`	Component-wise
Scalar multiply	αv	`alpha * v`	Each component × α
Batch normalize	V/‖V‖ₜ	`V / np.linalg.norm(V, axis=1, keepdims=True)`	For matrix of row-vectors
Pairwise cosine sim	VVᵀ (normalized)	`normalized @ normalized.T`	(n, n) similarity matrix

Key Takeaways

A vector is simultaneously an ordered array of numbers (algebraic) and a direction + magnitude in space (geometric) - both views are needed for ML
The 8 vector space axioms are what make embedding arithmetic (king - man + woman ≈ queen) geometrically meaningful
L1 and L2 norms create different constraint geometries - L1's diamond shape induces sparsity, L2's sphere shape induces smoothness
The dot product u·v = ‖u‖·‖v‖·cos(θ) connects algebra to geometry and is the foundation of attention mechanisms
Cosine similarity is preferable to Euclidean distance for high-dimensional embeddings because it is scale-invariant and more robust to the curse of dimensionality
High-dimensional geometry is counterintuitive: distances concentrate, volume lives near the surface, and random vectors are approximately orthogonal
NumPy's broadcasting lets you compute batch cosine similarities with a single matrix-vector multiply - no Python loops needed

Next: Matrix Operations - The Engine of the Neural Network Forward Pass →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Vectors in 3D demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - What a Vector Really Is​

The algebraic view​

The geometric view​

Why both views matter in ML​

Part 2 - Vector Spaces: The 8 Axioms (and Why They Matter)​

The 8 axioms​

Why do these axioms matter for ML?​

Vector subspaces​

Part 3 - Norms: Measuring Vector Size​

The three most important norms​

Geometric visualization of unit balls​

Norms in NumPy​

Normalizing vectors​

Part 4 - Inner Products and the Angle Between Vectors​

What the dot product encodes​

Cosine similarity​

The Cauchy-Schwarz inequality​

Part 5 - High-Dimensional Geometry Surprises​

Surprise 1: All points are far apart​

Surprise 2: Volume concentrates at the surface​

Surprise 3: Random vectors are nearly orthogonal​

The curse of dimensionality​

Part 6 - NumPy: Vector Operations for ML​

Broadcasting rules (essential for ML)​

Part 7 - ML Connections: Where Vectors Appear​

Embeddings as vectors​

KNN: distance metrics matter​

Part 8 - Common Failure Modes and Engineering Red Flags​

Interview Questions​

Practice Challenges​

Level 1: Predict​

Level 2: Debug​

Level 3: Design​

Quick Reference Cheatsheet​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - What a Vector Really Is

The algebraic view

The geometric view

Why both views matter in ML

Part 2 - Vector Spaces: The 8 Axioms (and Why They Matter)

The 8 axioms

Why do these axioms matter for ML?

Vector subspaces

Part 3 - Norms: Measuring Vector Size

The three most important norms

Geometric visualization of unit balls

Norms in NumPy

Normalizing vectors

Part 4 - Inner Products and the Angle Between Vectors

What the dot product encodes

Cosine similarity

The Cauchy-Schwarz inequality

Part 5 - High-Dimensional Geometry Surprises

Surprise 1: All points are far apart

Surprise 2: Volume concentrates at the surface

Surprise 3: Random vectors are nearly orthogonal

The curse of dimensionality

Part 6 - NumPy: Vector Operations for ML

Broadcasting rules (essential for ML)

Part 7 - ML Connections: Where Vectors Appear

Embeddings as vectors

KNN: distance metrics matter

Part 8 - Common Failure Modes and Engineering Red Flags

Interview Questions

Practice Challenges

Level 1: Predict

Level 2: Debug

Level 3: Design

Quick Reference Cheatsheet

Key Takeaways