Master matrix factorization for recommendations - SVD, Funk SVD, SGD and ALS optimization, biases, regularization, and implicit feedback with BPR. The algorithm that won the Netflix Prize.

How does matrix factorization work in practice?

Matrix Factorization - Discovering Hidden Taste Dimensions covers SVD, matrix factorization, latent factors from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/recommender-systems/matrix-factorization

What is the difference between SVD and latent factors?

See the full breakdown at https://engineersofai.com/docs/ml/recommender-systems/matrix-factorization

Matrix Factorization - Discovering Hidden Taste Dimensions

Reading time: ~40 minutes | Level: Recommender Systems | Role: MLE, Data Scientist, AI Engineer

The Real Interview Moment

It is 2006. Netflix has just announced the Netflix Prize: one million dollars to whoever can beat their in-house recommendation algorithm, Cinematch, by 10% on RMSE. The competition attracts hundreds of teams - PhDs from top universities, research labs, seasoned data scientists.

A few months into the competition, an anonymous researcher posting as "Simon Funk" publishes a blog post describing how he beat Cinematch by over 8% using a technique he called SVD - though it was not quite the classical linear algebra operation by that name. It was something simpler and more elegant: learn a low-dimensional factorization of the rating matrix directly from observed ratings using gradient descent, a few thousand lines of C++, and a remarkably clear intuition.

The post went viral in the ML community overnight. Teams that had been agonizing over sophisticated feature engineering and ensemble methods suddenly pivoted. Matrix factorization had been theorized before - the connection between collaborative filtering and linear algebra was known - but this was the first time anyone had shown it working spectacularly well on a real, sparse, messy, 100-million-rating dataset.

The technique Funk described became the foundation of virtually every industrial recommendation system built over the next decade. Understanding it thoroughly - not just as a formula to memorize but as a set of decisions about parameterization, optimization, and regularization - is one of the most useful skills an ML engineer can carry into a system design interview.

Why Matrix Factorization Beats Memory-Based CF

In the previous lesson we saw that collaborative filtering - user-based or item-based - works by finding similar users or items and aggregating their ratings. This approach has fundamental scalability problems.

With $m$ users and $n$ items:

Computing all user-user similarities costs $O(m^2 \cdot n)$
Computing all item-item similarities costs $O(n^2 \cdot m)$
At Netflix scale (480K users, 17K movies), this becomes computationally impractical

More critically, memory-based CF makes no generalization - it memorizes pairwise similarities and cannot extract any underlying structure. If user A and user B both like Inception and The Dark Knight, memory-based CF treats this as a lucky overlap. It cannot represent the underlying cause: both users like Christopher Nolan films, which tend to be cerebral, visually complex action movies.

Matrix factorization discovers this structure automatically. It learns that there are latent dimensions - "Christopher Nolan-ness", "action intensity", "sci-fi orientation" - and represents each user and item as a low-dimensional vector in that latent space. Recommendations become inner products in this compact representation.

This is the fundamental insight: the observed rating matrix is a noisy, sparse projection of a low-dimensional latent structure. Our job is to recover that structure.

The Decomposition

The rating matrix $R$ is a matrix with $m$ rows (users) and $n$ columns (items). Entry $r_{ui}$ is user $u$ 's rating for item $i$ . In practice, $R$ is extremely sparse - on Netflix, the average user rates fewer than 200 of 17,000 movies. Observed entries are perhaps 1–2% of the full matrix.

The core idea: approximate $R$ by the product of two low-rank matrices:

$R \approx P Q^\top$

where:

$P \in \mathbb{R}^{m \times k}$ is the user factor matrix - row $\vec{p}_u$ is user $u$ 's latent taste vector
$Q \in \mathbb{R}^{n \times k}$ is the item factor matrix - row $\vec{q}_i$ is item $i$ 's latent attribute vector
$k$ is the number of latent dimensions, typically 20–200 (much smaller than $m$ or $n$ )

The predicted rating for user $u$ on item $i$ is:

$\hat{r}_{ui} = \vec{p}_u \cdot \vec{q}_i = \sum_{f=1}^{k} p_{uf} q_{if}$

This is a dot product: how much does user $u$ 's taste profile align with item $i$ 's attribute profile?

Intuitively: if $k=3$ and the latent dimensions correspond (loosely) to "action", "sci-fi", and "romance", then:

A user who loves action and sci-fi but hates romance has $\vec{p}_u \approx [0.9, 0.8, -0.7]$
The Dark Knight has $\vec{q}_i \approx [0.95, 0.3, 0.1]$
Their dot product is $0.9 \times 0.95 + 0.8 \times 0.3 + (-0.7) \times 0.1 = 0.855 + 0.24 - 0.07 = 1.025$ - high

In practice the latent dimensions do not correspond to human-interpretable concepts - they are learned purely to minimize prediction error. But the geometry is the same: similar users and items cluster in the same neighborhood of the latent space.

Adding Biases

Raw dot products miss an important source of variation: some users rate everything highly (lenient raters), some rate everything harshly (strict raters). Some items are universally loved, others universally disliked. These biases are independent of taste - they are structural properties of users and items.

A better model includes three bias terms:

$\hat{r}_{ui} = \mu + b_u + b_i + \vec{p}_u \cdot \vec{q}_i$

where:

$\mu$ is the global mean rating across all observed ratings
$b_u$ is the user bias - how much user $u$ rates above or below average
$b_i$ is the item bias - how much item $i$ is rated above or below average

The dot product $\vec{p}_u \cdot \vec{q}_i$ now only needs to model the residual variation beyond what biases explain. In practice, biases account for roughly 25–30% of the explainable variance in typical rating datasets. Including them is not optional - it materially improves RMSE.

Example: Roger always rates 0.5 stars above average ( $b_{Roger} = +0.5$ ). Titanic is universally loved, 0.3 stars above average ( $b_{Titanic} = +0.3$ ). The global mean is 3.5. The bias-only prediction for Roger on Titanic is $3.5 + 0.5 + 0.3 = 4.3$ before considering any latent taste alignment.

The Loss Function

We learn $P$ , $Q$ , $\{b_u\}$ , and $\{b_i\}$ by minimizing squared error on observed ratings, plus an L2 regularization term to prevent overfitting:

$\mathcal{L} = \sum_{(u,i) \in \mathcal{K}} \left(r_{ui} - \hat{r}_{ui}\right)^2 + \lambda\left(\|\vec{p}_u\|^2 + \|\vec{q}_i\|^2 + b_u^2 + b_i^2\right)$

where $\mathcal{K}$ is the set of observed (user, item) rating pairs and $\lambda$ is the regularization strength.

Key design decisions embedded in this loss:

Only sum over observed ratings. We do not treat missing ratings as zeros. This is crucial - most missing ratings are missing at random (user has not seen the item), not zero (user hated it). Training on missing-as-zero would catastrophically bias the model toward low predictions.
L2 regularization on all parameters. Without regularization, the model can memorize training ratings perfectly by making $P$ and $Q$ arbitrarily large, collapsing to zero loss on training data and garbage on test data. Regularization keeps parameters small, which forces the model to learn generalizable structure rather than noise.
No constraint on the sign or scale of latent factors. The latent dimensions are free to be anything. This is what makes MF powerful - it discovers whatever structure best explains the data, not structure we imposed.

Optimization: SGD

The standard optimization algorithm is Stochastic Gradient Descent (SGD). For each observed rating $(u, i, r_{ui})$ , we compute the prediction error:

$e_{ui} = r_{ui} - \hat{r}_{ui}$

Then update all parameters in the direction that reduces the loss:

$b_u \leftarrow b_u + \alpha\left(e_{ui} - \lambda b_u\right)$

$b_i \leftarrow b_i + \alpha\left(e_{ui} - \lambda b_i\right)$

$p_{uf} \leftarrow p_{uf} + \alpha\left(e_{ui} \cdot q_{if} - \lambda \cdot p_{uf}\right)$

$q_{if} \leftarrow q_{if} + \alpha\left(e_{ui} \cdot p_{uf} - \lambda \cdot q_{if}\right)$

for $f = 1, \ldots, k$ . Here $\alpha$ is the learning rate. Note the cross-dependency in the factor updates: the gradient for $\vec{p}_u$ depends on the current value of $\vec{q}_i$ , and vice versa. This means both must be updated using the values from before the step - the updates are not interchangeable.

SGD on matrix factorization has some nice properties:

Each update costs $O(k)$ - proportional only to the number of latent factors
Total cost per epoch is $O(|\mathcal{K}| \cdot k)$ - linear in observed ratings
Works naturally on streaming data - new ratings can be incorporated with more SGD steps

Optimization: ALS

Alternating Least Squares (ALS) takes a different approach. Instead of gradient descent, it alternates between two closed-form optimization steps:

Fix $Q$ , solve for $P$ : With $Q$ fixed, the loss is a sum of independent quadratic problems - one for each user. Each can be solved analytically.
Fix $P$ , solve for $Q$ : With $P$ fixed, same thing - one quadratic problem per item.

For user $u$ with rating history $I_u$ (items rated by user $u$ ), the closed-form solution is:

$\vec{p}_u = \left(Q_{I_u}^\top Q_{I_u} + \lambda |I_u| I_k\right)^{-1} Q_{I_u}^\top \vec{r}_u$

where $Q_{I_u}$ is the submatrix of $Q$ containing only rows corresponding to items in $I_u$ , $\vec{r}_u$ is the vector of user $u$ 's ratings, and $I_k$ is the $k \times k$ identity matrix.

ALS advantages:

Parallelizable: all user updates are independent given $Q$ ; all item updates are independent given $P$ . This maps perfectly to distributed computing.
Numerically stable: closed-form solution avoids learning rate tuning and convergence issues.
Handles implicit feedback naturally: implicit ALS weights every (user, item) pair (not just observed ones) but downweights unobserved pairs. This is the default in production implicit feedback systems.

ALS disadvantages:

Each step requires solving a $k \times k$ linear system per user/item - cost is $O(k^3)$ per user, $O(m k^3)$ per epoch. For large $k$ or $m$ , this can be expensive.
Not naturally online - hard to update a single user's vector as new ratings arrive.

note

Apache Spark's spark.ml.recommendation.ALS implements distributed ALS and is the standard for training on 100M+ rating datasets. It partitions users and items across workers and uses block-structured communication to minimize data shuffling.

System Architecture

NumPy From Scratch: Full SGD Matrix Factorization

import numpy as np
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class MatrixFactorization:
    """
    Matrix Factorization with biases, L2 regularization, and early stopping.
    Trained with SGD (Funk SVD approach).
    """
    n_factors: int = 50          # number of latent dimensions k
    n_epochs: int = 50           # max training epochs
    lr: float = 0.005            # learning rate alpha
    reg: float = 0.02            # L2 regularization lambda
    early_stop_patience: int = 5 # stop if val RMSE doesn't improve
    verbose: bool = True

    # Learned parameters (set during fit)
    P: Optional[np.ndarray] = field(default=None, repr=False)  # user factors (m, k)
    Q: Optional[np.ndarray] = field(default=None, repr=False)  # item factors (n, k)
    bu: Optional[np.ndarray] = field(default=None, repr=False) # user biases (m,)
    bi: Optional[np.ndarray] = field(default=None, repr=False) # item biases (n,)
    global_mean: float = 0.0

    # Index mappings
    user_idx: dict = field(default_factory=dict)
    item_idx: dict = field(default_factory=dict)

    def fit(
        self,
        train_ratings: list,  # [(user_id, item_id, rating), ...]
        val_ratings: Optional[list] = None
    ):
        """Train matrix factorization on a list of (user, item, rating) triples."""
        # Build index mappings
        users = sorted(set(u for u, i, r in train_ratings))
        items = sorted(set(i for u, i, r in train_ratings))
        self.user_idx = {u: idx for idx, u in enumerate(users)}
        self.item_idx = {i: idx for idx, i in enumerate(items)}

        m, n, k = len(users), len(items), self.n_factors

        # Initialize parameters
        rng = np.random.default_rng(42)
        self.P = rng.normal(0, 0.1, (m, k))    # user factors
        self.Q = rng.normal(0, 0.1, (n, k))    # item factors
        self.bu = np.zeros(m)                   # user biases
        self.bi = np.zeros(n)                   # item biases
        self.global_mean = np.mean([r for _, _, r in train_ratings])

        # Convert to index-based triples for speed
        train_data = [
            (self.user_idx[u], self.item_idx[i], r)
            for u, i, r in train_ratings
        ]

        best_val_rmse = float("inf")
        patience_counter = 0

        for epoch in range(self.n_epochs):
            # Shuffle training data each epoch
            rng.shuffle(train_data)

            train_loss = 0.0

            for u_idx, i_idx, r_ui in train_data:
                # Prediction
                r_hat = (
                    self.global_mean
                    + self.bu[u_idx]
                    + self.bi[i_idx]
                    + self.P[u_idx] @ self.Q[i_idx]
                )

                # Error
                e_ui = r_ui - r_hat
                train_loss += e_ui ** 2

                # Bias updates
                self.bu[u_idx] += self.lr * (e_ui - self.reg * self.bu[u_idx])
                self.bi[i_idx] += self.lr * (e_ui - self.reg * self.bi[i_idx])

                # Factor updates (store old values to avoid cross-contamination)
                p_u = self.P[u_idx].copy()
                q_i = self.Q[i_idx].copy()
                self.P[u_idx] += self.lr * (e_ui * q_i - self.reg * p_u)
                self.Q[i_idx] += self.lr * (e_ui * p_u - self.reg * q_i)

            train_rmse = np.sqrt(train_loss / len(train_data))

            # Validation
            if val_ratings is not None:
                val_rmse = self._compute_rmse(val_ratings)
                if self.verbose:
                    print(f"Epoch {epoch+1:3d} | Train RMSE: {train_rmse:.4f} | Val RMSE: {val_rmse:.4f}")

                # Early stopping
                if val_rmse < best_val_rmse - 1e-4:
                    best_val_rmse = val_rmse
                    patience_counter = 0
                else:
                    patience_counter += 1
                    if patience_counter >= self.early_stop_patience:
                        if self.verbose:
                            print(f"Early stopping at epoch {epoch+1}")
                        break
            else:
                if self.verbose:
                    print(f"Epoch {epoch+1:3d} | Train RMSE: {train_rmse:.4f}")

        return self

    def predict(self, user_id, item_id) -> float:
        """Predict rating for a (user, item) pair."""
        if user_id not in self.user_idx or item_id not in self.item_idx:
            return self.global_mean  # fallback for unknown users/items

        u_idx = self.user_idx[user_id]
        i_idx = self.item_idx[item_id]
        r_hat = (
            self.global_mean
            + self.bu[u_idx]
            + self.bi[i_idx]
            + self.P[u_idx] @ self.Q[i_idx]
        )
        return float(np.clip(r_hat, 1.0, 5.0))  # clip to valid rating range

    def recommend(self, user_id, all_item_ids: list, seen_item_ids: set, k: int = 10) -> list:
        """Return top-k unseen item recommendations for a user."""
        if user_id not in self.user_idx:
            return []

        u_idx = self.user_idx[user_id]
        scores = []
        for item_id in all_item_ids:
            if item_id in seen_item_ids or item_id not in self.item_idx:
                continue
            i_idx = self.item_idx[item_id]
            score = (
                self.global_mean
                + self.bu[u_idx]
                + self.bi[i_idx]
                + self.P[u_idx] @ self.Q[i_idx]
            )
            scores.append((item_id, float(score)))

        scores.sort(key=lambda x: x[1], reverse=True)
        return scores[:k]

    def _compute_rmse(self, ratings: list) -> float:
        """Compute RMSE on a rating list."""
        errors = []
        for u, i, r in ratings:
            if u in self.user_idx and i in self.item_idx:
                r_hat = self.predict(u, i)
                errors.append((r - r_hat) ** 2)
        return float(np.sqrt(np.mean(errors))) if errors else float("inf")


# ── Demo on synthetic data ────────────────────────────────────────────────────
rng = np.random.default_rng(0)

# 200 users, 100 items, 5000 ratings
n_users, n_items, n_ratings = 200, 100, 5000
user_ids = rng.integers(0, n_users, n_ratings)
item_ids = rng.integers(0, n_items, n_ratings)
true_ratings = np.clip(
    rng.normal(3.5, 1.0, n_ratings),  # noisy ratings centered at 3.5
    1.0, 5.0
).round(1)

all_ratings = list(zip(user_ids.tolist(), item_ids.tolist(), true_ratings.tolist()))

# Train/val split (90/10)
split = int(0.9 * len(all_ratings))
train, val = all_ratings[:split], all_ratings[split:]

mf = MatrixFactorization(n_factors=20, n_epochs=30, lr=0.005, reg=0.02, verbose=True)
mf.fit(train, val)

# Single prediction
print(f"\nPredicted rating for user 0, item 5: {mf.predict(0, 5):.2f}")

# Top-5 recommendations for user 0
seen = {i for u, i, r in train if u == 0}
all_items = list(range(n_items))
recs = mf.recommend(0, all_items, seen, k=5)
print("\nTop-5 recs for user 0:")
for item_id, score in recs:
    print(f"  item {item_id:3d}  predicted={score:.2f}")

Practical Code: surprise and implicit Libraries

Explicit Ratings with surprise

from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split, cross_validate
import pandas as pd

# ── Load data ─────────────────────────────────────────────────────────────────
# Using built-in MovieLens-100K
from surprise import Dataset
data = Dataset.load_builtin("ml-100k")

# ── Train/test split ──────────────────────────────────────────────────────────
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# ── Funk SVD with biases ──────────────────────────────────────────────────────
algo = SVD(
    n_factors=100,      # latent dimensions
    n_epochs=30,        # training epochs
    lr_all=0.005,       # learning rate (same for all params)
    reg_all=0.02,       # regularization (same for all params)
    biased=True,        # include user + item biases
    random_state=42,
    verbose=True
)
algo.fit(trainset)

# ── Evaluate ──────────────────────────────────────────────────────────────────
predictions = algo.test(testset)
rmse = accuracy.rmse(predictions)
mae  = accuracy.mae(predictions)
print(f"\nRMSE: {rmse:.4f} | MAE: {mae:.4f}")

# ── Recommend for a specific user ─────────────────────────────────────────────
from surprise import Trainset

def get_top_n(algo, trainset: Trainset, user_id: str, n: int = 10) -> list:
    """Return top-n unseen item predictions for a user."""
    all_items = set(trainset.all_items())
    seen = set(j for (j, _) in trainset.ur[trainset.to_inner_uid(user_id)])
    unseen = all_items - seen

    preds = [
        (trainset.to_raw_iid(i), algo.predict(user_id, trainset.to_raw_iid(i)).est)
        for i in unseen
    ]
    return sorted(preds, key=lambda x: x[1], reverse=True)[:n]

recs = get_top_n(algo, trainset, user_id="196", n=10)
print("\nTop-10 for user 196:")
for item_id, score in recs:
    print(f"  item {item_id:5s}  predicted={score:.2f}")

# ── Cross-validation ──────────────────────────────────────────────────────────
results = cross_validate(SVD(n_factors=100), data, measures=["RMSE", "MAE"], cv=5)
print(f"\nCV RMSE: {results['test_rmse'].mean():.4f} +/- {results['test_rmse'].std():.4f}")

Implicit Feedback with the implicit Library

For scenarios without explicit ratings - page views, clicks, plays, purchases - ALS on implicit data is the standard:

import implicit
import numpy as np
import scipy.sparse as sparse

# ── Build user-item interaction matrix ───────────────────────────────────────
# Each entry: number of times user played the track (implicit confidence)
n_users, n_items = 500, 2000
rng = np.random.default_rng(42)

# Simulate sparse interactions: ~5% density
interactions_row = rng.integers(0, n_users, 10000)
interactions_col = rng.integers(0, n_items, 10000)
interactions_data = rng.integers(1, 50, 10000).astype(np.float32)  # play counts

# Sparse user-item matrix (CSR format)
user_item = sparse.csr_matrix(
    (interactions_data, (interactions_row, interactions_col)),
    shape=(n_users, n_items)
)
print(f"User-item matrix shape: {user_item.shape}")
print(f"Density: {user_item.nnz / (n_users * n_items):.2%}")

# ── Train ALS model ───────────────────────────────────────────────────────────
model = implicit.als.AlternatingLeastSquares(
    factors=64,         # latent dimensions
    iterations=20,      # ALS iterations
    regularization=0.1, # L2 regularization
    alpha=40.0,         # confidence scaling: c_ui = 1 + alpha * r_ui
    random_state=42,
    use_gpu=False
)

# implicit expects item-user matrix (transposed)
item_user = user_item.T.tocsr()
model.fit(item_user)

# ── Recommend for user 0 ──────────────────────────────────────────────────────
user_id = 0
ids, scores = model.recommend(
    user_id,
    user_item[user_id],   # user's interaction history (for filtering seen items)
    N=10,                 # return top-10
    filter_already_liked_items=True
)
print(f"\nTop-10 recommendations for user {user_id}:")
for item_id, score in zip(ids, scores):
    print(f"  item {item_id:4d}  score={score:.4f}")

# ── Similar items ─────────────────────────────────────────────────────────────
similar_ids, similar_scores = model.similar_items(item_id=42, N=5)
print(f"\nItems similar to item 42:")
for sid, sscore in zip(similar_ids, similar_scores):
    print(f"  item {sid:4d}  similarity={sscore:.4f}")

The alpha parameter is critical for implicit feedback. The confidence that user $u$ has a preference for item $i$ is defined as $c_{ui} = 1 + \alpha \cdot r_{ui}$ where $r_{ui}$ is the raw interaction count. Without alpha, all interactions are treated with equal confidence regardless of how many times the user interacted. With alpha=40, a track played 10 times has confidence 401, while a track played once has confidence 41 - a 10x stronger signal.

Implicit Feedback: Bayesian Personalized Ranking (BPR)

ALS on implicit data optimizes for prediction of interaction confidence - which is a proxy for what we actually want: a good ranking. BPR (Bayesian Personalized Ranking, Rendle et al. 2009) directly optimizes for ranking by training on user preference pairs.

The assumption: if user $u$ interacted with item $i$ but not item $j$ , then user $u$ prefers $i$ over $j$ . Formally, we maximize the posterior probability of the observed preference ordering:

$\text{BPR-OPT} = \sum_{(u, i, j) \in D_S} \ln \sigma\left(\hat{r}_{uij}\right) - \lambda \|\Theta\|^2$

where $D_S = \{(u, i, j) \mid i \in I^+_u, j \notin I^+_u\}$ is the set of user-item-nonitem triples, $\hat{r}_{uij} = \hat{r}_{ui} - \hat{r}_{uj}$ is the predicted preference gap, and $\sigma$ is the sigmoid function.

The gradient for a single triple $(u, i, j)$ :

$\frac{\partial}{\partial \Theta} \ln \sigma(\hat{r}_{uij}) = (1 - \sigma(\hat{r}_{uij})) \frac{\partial \hat{r}_{uij}}{\partial \Theta}$

In practice, BPR consistently outperforms ALS on ranking metrics (NDCG, MAP, recall@K) even though ALS often has lower RMSE on interaction reconstruction. RMSE on implicit feedback is the wrong metric - ranking quality is what users experience.

import torch
import torch.nn as nn
import numpy as np

class BPRMatrixFactorization(nn.Module):
    """BPR-optimized matrix factorization."""

    def __init__(self, n_users: int, n_items: int, n_factors: int = 64):
        super().__init__()
        self.user_factors = nn.Embedding(n_users, n_factors)
        self.item_factors = nn.Embedding(n_items, n_factors)

        # Initialize with small random values
        nn.init.normal_(self.user_factors.weight, mean=0, std=0.01)
        nn.init.normal_(self.item_factors.weight, mean=0, std=0.01)

    def forward(self, user_ids, pos_item_ids, neg_item_ids):
        """Compute BPR loss for a batch of (user, positive, negative) triples."""
        p_u = self.user_factors(user_ids)         # (batch, k)
        q_i = self.item_factors(pos_item_ids)     # (batch, k)
        q_j = self.item_factors(neg_item_ids)     # (batch, k)

        # Predicted scores
        r_ui = (p_u * q_i).sum(dim=1)             # (batch,)
        r_uj = (p_u * q_j).sum(dim=1)             # (batch,)

        # BPR loss: -log(sigma(r_ui - r_uj))
        loss = -torch.log(torch.sigmoid(r_ui - r_uj)).mean()
        return loss

    def recommend(self, user_id: int, item_ids: list, k: int = 10) -> list:
        """Get top-k items for a user."""
        with torch.no_grad():
            u = torch.tensor([user_id])
            items = torch.tensor(item_ids)
            p_u = self.user_factors(u)          # (1, k)
            q_all = self.item_factors(items)    # (n, k)
            scores = (p_u * q_all).sum(dim=1)   # (n,)
            top_k = torch.topk(scores, k=min(k, len(item_ids)))
            return [(item_ids[i], float(s)) for i, s in
                    zip(top_k.indices.tolist(), top_k.values.tolist())]


# ── Training loop ──────────────────────────────────────────────────────────────
def train_bpr(model, interactions: dict, n_items: int, n_epochs: int = 20,
              lr: float = 0.01, reg: float = 0.01, batch_size: int = 512):
    """
    interactions: {user_id: set of positive item_ids}
    """
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=reg)
    all_items = set(range(n_items))

    for epoch in range(n_epochs):
        model.train()
        total_loss = 0.0
        n_batches = 0

        # Generate random triples
        users, pos_items, neg_items = [], [], []
        for user_id, pos_set in interactions.items():
            neg_pool = list(all_items - pos_set)
            for pos in pos_set:
                neg = np.random.choice(neg_pool)
                users.append(user_id)
                pos_items.append(pos)
                neg_items.append(neg)

        # Batch training
        indices = np.random.permutation(len(users))
        for start in range(0, len(indices), batch_size):
            batch = indices[start:start+batch_size]
            u_batch  = torch.tensor([users[i]      for i in batch])
            pi_batch = torch.tensor([pos_items[i]  for i in batch])
            ni_batch = torch.tensor([neg_items[i]  for i in batch])

            loss = model(u_batch, pi_batch, ni_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            n_batches += 1

        avg_loss = total_loss / max(n_batches, 1)
        print(f"Epoch {epoch+1:3d} | BPR Loss: {avg_loss:.4f}")

    return model

Production Engineering Notes

Spark ALS for Distributed Training

At scale (millions of users, millions of items, billions of interactions), ALS in Spark is the standard:

from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

spark = SparkSession.builder.appName("ALSRecommender").getOrCreate()

# ratings_df: Spark DataFrame with columns [userId, movieId, rating]
ratings_df = spark.read.parquet("s3://your-bucket/ratings.parquet")

(train_df, test_df) = ratings_df.randomSplit([0.9, 0.1], seed=42)

als = ALS(
    maxIter=15,
    rank=100,               # number of latent factors
    regParam=0.1,           # L2 regularization
    userCol="userId",
    itemCol="movieId",
    ratingCol="rating",
    coldStartStrategy="drop",  # handle unknown users/items in test
    implicitPrefs=False,       # set True for implicit feedback
)

model = als.fit(train_df)
predictions = model.transform(test_df)

evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions.dropna())
print(f"RMSE on test set: {rmse:.4f}")

# Recommend top-10 for all users
user_recs = model.recommendForAllUsers(10)
user_recs.show(5, truncate=False)

Spark ALS uses block-structured communication - it partitions users and items into blocks and distributes factor updates across workers, minimizing network communication. At 100M users and 10M items with k=128 factors, training completes in 15–30 minutes on a 20-node cluster.

Handling New Users and Items: Warm Start

Matrix factorization has an inherent cold-start problem: new users and items have no latent factors.

For new items with content features: initialize $\vec{q}_i$ using content-based embeddings mapped to the latent space via a learned projection matrix $W$ : $\vec{q}_i^{init} = W \cdot \vec{v}_i^{content}$ . This hybrid warm-start gives new items a reasonable initial position in latent space immediately on launch.

For new users with a few interactions: run 1–5 additional SGD steps holding $Q$ fixed and only updating $\vec{p}_u$ and $b_u$ . This is fast (seconds) and can run online as new ratings arrive.

For completely new users with zero history: fall back to popularity-based recommendations or content-based filtering. Track when each user crosses a history threshold (e.g., 5 interactions) and switch to MF-based recommendations automatically.

Time-Aware Matrix Factorization

User tastes and item popularity drift over time. The BellKor solution (which eventually won the Netflix Prize) added a temporal component to the bias terms:

$\hat{r}_{ui}(t) = \mu + b_u(t) + b_i(t) + \vec{p}_u(t) \cdot \vec{q}_i$

where $b_u(t) = b_u + \alpha_u \cdot \text{dev}_u(t)$ tracks how a user's baseline rating changes over time, and $b_i(t)$ captures trending items. This time-aware model reduced Netflix Prize RMSE by an additional ~1% - significant at that level of competition.

The Relationship Between MF and Word2Vec

There is a deep connection between matrix factorization and word2vec that is worth understanding. Word2vec's skip-gram model, when trained with negative sampling, implicitly factorizes the pointwise mutual information (PMI) matrix of word co-occurrences. Collaborative filtering MF factorizes a rating (or interaction) matrix.

This means:

Tricks that work for word2vec (subsampling frequent words, negative sampling) have direct analogues in recommendation MF (downweighting popular items, using BPR sampling).
Pre-trained word embeddings can be used to initialize item factors in content-aware MF.
The two-tower model (upcoming lesson) is essentially a neural generalization of MF that replaces dot products with learned nonlinear transformations - the latent factors become the output of deep encoders.

Common Mistakes

danger

Treating missing ratings as 0 in the loss function. This is the most catastrophic mistake in matrix factorization. If you include all (user, item) pairs in your loss with unobserved entries set to 0, you are telling the model: "this user rates every movie they have never seen a 0 out of 5." The model learns to predict near-zero for everything. Always train on observed ratings only - sum the loss only over $(u, i) \in \mathcal{K}$ where $r_{ui}$ is actually observed.

danger

Using classical SVD directly on the full sparse rating matrix. Classical SVD requires a dense matrix and computes an exact decomposition. If you impute missing values with zeros or means and run np.linalg.svd, you get a matrix that does not reflect actual user preferences - you have introduced millions of fake "zero ratings" that dominate the signal. Use SGD or ALS instead, which naturally operate only on observed ratings.

warning

Setting k (number of latent factors) too high. Beginners often think more factors = better recommendations. In practice, 50–200 latent factors is usually sufficient for most datasets. Beyond that, the model starts overfitting - it memorizes the training ratings rather than learning generalizable taste structure. The optimal k depends on dataset size: ~20–50 for small datasets (1M ratings), ~100–200 for large datasets (100M+ ratings). Always tune k via validation RMSE and watch for train/val divergence.

warning

Ignoring regularization. Without L2 regularization on the factor matrices and biases, SGD will overfit severely on training data. The hyperparameter $\lambda$ must be tuned - typically via grid search over [0.001, 0.01, 0.02, 0.05, 0.1]. Regularization has a bigger impact on final RMSE than the learning rate in most settings. Start with $\lambda = 0.02$ as a baseline.

tip

Use biases - they matter a lot. Adding user and item bias terms typically reduces RMSE by 5–10% compared to pure dot product MF. The bias absorbs the "easy" variation (user leniency, item popularity) and lets the latent factors focus on the harder task of modeling taste alignment. Never deploy MF without biases.

YouTube Resources

Video	Channel	What You Will Learn
Matrix Factorization for Recommendations	ritvikmath	Excellent visual walkthrough of the core algorithm
SVD and Recommender Systems	StatQuest	SVD intuition and connection to matrix factorization
Netflix Prize and BellKor	Yannic Kilcher	History of MF in recommendations and the winning solution
ALS for Collaborative Filtering	Stanford CS246	ALS derivation and distributed implementation

Interview Q&A

Q1: Why did matrix factorization beat memory-based collaborative filtering on the Netflix Prize?

A: Memory-based CF (user-user or item-item) has several fundamental weaknesses that MF addresses:

Scalability: memory-based CF requires computing pairwise similarities between all users or all items. At Netflix scale (480K users, 17K movies), computing all user-user similarities is computationally impractical and requires storing a 480K x 480K matrix.

Sparsity: when rating matrices are sparse (each user rates fewer than 0.5% of items), user-user similarity estimates are based on very few common ratings and are statistically noisy. MF learns a shared low-dimensional structure that enables generalization even from sparse observations.

Generalization: memory-based CF memorizes pairwise similarities and cannot generalize beyond observed co-ratings. MF discovers latent structure (implicit preference dimensions) that generalizes to (user, item) pairs where neither has been seen together before.

Biases: MF naturally models and corrects for user leniency and item popularity biases. Memory-based CF does not have a principled way to handle these.

Offline training: MF produces compact user and item vectors ( $m \times k$ and $n \times k$ matrices) that support fast prediction (one dot product per candidate item). Memory-based CF requires computing similarities at query time - slow for real-time serving.

Simon Funk's key insight was that you do not need the full SVD - you can minimize the reconstruction error on observed entries only using SGD, which is both tractable and highly effective.

Q2: When should you use ALS vs SGD for matrix factorization?

A: The choice depends on your data and infrastructure:

Use SGD when:

You have explicit ratings data (1–5 stars, thumbs up/down)
You need online updates - new ratings can be incorporated by running more SGD steps
You are running on a single machine or want a simple implementation
Training data fits in memory
You want flexibility in loss function (e.g., switch to BPR for ranking)

Use ALS when:

You have implicit feedback (clicks, plays, purchases) - ALS with weighted confidence handles this naturally
You have a distributed computing environment (Spark) - ALS parallelizes perfectly across a cluster because all user updates given fixed $Q$ are independent
Training speed is critical - ALS often converges in fewer iterations than SGD
You need a mature, well-tested production implementation - Spark ALS is battle-tested at petabyte scale

In practice: start with surprise's SGD-based SVD for explicit ratings experiments, move to Spark ALS for production at scale or when switching to implicit feedback.

Q3: How do you handle implicit feedback in matrix factorization?

A: Implicit feedback (plays, clicks, views) is fundamentally different from explicit ratings:

No negative feedback: you only observe interactions, not non-interactions. A missing entry could mean "user has not seen the item" (neutral) or "user saw and disliked" (negative). You cannot tell.
ALS with confidence weights (Hu et al. 2008): define a confidence value $c_{ui} = 1 + \alpha \cdot r_{ui}$ where $r_{ui}$ is the raw interaction count and $\alpha$ is a scaling parameter (typically 40). Define a binary preference $p_{ui} = 1$ if the user interacted, $p_{ui} = 0$ otherwise. Minimize:

$\mathcal{L} = \sum_{u,i} c_{ui}(p_{ui} - \vec{p}_u \cdot \vec{q}_i)^2 + \lambda(\|P\|^2 + \|Q\|^2)$

The sum is over ALL (user, item) pairs, not just observed ones. Unobserved pairs get $p_{ui}=0$ and $c_{ui}=1$ - low confidence that the user does not prefer the item. Observed pairs get $c_{ui} = 1 + \alpha \cdot r_{ui}$ - higher confidence for more interactions.
BPR: directly optimize ranking by sampling user-positive-negative triples and maximizing the probability that the user prefers the positive item over the negative. Better for ranking metrics than ALS.
Sampling strategy: for BPR, negative sampling is critical. Uniform random negatives work but sampling popularity-weighted negatives (harder negatives) improves representation quality.

Q4: How does regularization prevent overfitting in MF and how do you tune it?

A: Without regularization, the model can perfectly memorize all observed ratings by setting $\vec{p}_u \cdot \vec{q}_i$ to exactly $r_{ui} - \mu - b_u - b_i$ for each observed pair, with arbitrarily large values for $P$ and $Q$ . The L2 regularization term $\lambda(\|P\|^2 + \|Q\|^2 + \|b\|^2)$ penalizes large parameter values, preventing this memorization. It creates a bias-variance tradeoff: too little regularization overfits, too much underfits.

Tuning $\lambda$ :

Start with $\lambda \in \{0.001, 0.005, 0.01, 0.02, 0.05, 0.1\}$
Use a time-based validation split (train on older data, validate on newer) to simulate production conditions
Monitor training RMSE vs validation RMSE across epochs - overfitting looks like training RMSE decreasing while validation RMSE plateaus or increases
Separate $\lambda$ values for user factors, item factors, user biases, and item biases can be useful (four hyperparameters instead of one), but one shared $\lambda$ is a reasonable starting point

In practice, $\lambda = 0.02$ works well as a starting point on most datasets. Use early stopping on validation RMSE as a complementary regularization mechanism.

Q5: How do you handle the cold-start problem in matrix factorization?

A: Cold start manifests in two forms: new users and new items. MF struggles with both because there are no latent factors to look up.

New items (more tractable):

If the item has content features (description, category, audio features), learn a linear or nonlinear mapping from content features to the latent factor space. Initialize $\vec{q}_{new} = W \vec{v}_{content}$ where $W$ is trained alongside the MF model or as a separate regression.
Items with at least a few ratings: run a few ALS/SGD steps to estimate $\vec{q}_{new}$ while keeping all other parameters fixed.

New users (harder):

Zero interactions: fall back to content-based filtering (from the previous lesson) or popularity-based recommendations.
1–5 interactions: run online SGD steps to estimate $\vec{p}_{new}$ and $b_{new}$ from the small rating set, holding $Q$ fixed.
This "user factor estimation" step takes milliseconds and can run synchronously on the first few user actions.
Once the user has 15–20 interactions, the MF-estimated factors are statistically reliable.

The deeper fix is a hybrid model that uses content features for representation when interaction history is sparse. Two-tower models (the next lesson) solve this elegantly by encoding both user and item through deep networks that can incorporate arbitrary side features.

Q6: What is the relationship between matrix factorization and classical SVD?

A: Classical SVD decomposes a matrix $M$ as $M = U \Sigma V^\top$ where $U$ and $V$ are orthonormal and $\Sigma$ is diagonal with singular values. Taking the top- $k$ components gives the best rank- $k$ approximation in Frobenius norm.

The issue: SVD requires a fully observed, dense matrix. The rating matrix is 99%+ missing. If you impute missing entries (with 0 or the global mean) and run SVD, you get a decomposition of your imputation choices, not of actual user preferences.

Funk SVD (what the recommender community calls "SVD") solves only for the factors $P$ and $Q$ such that their product approximates the observed entries. It ignores missing entries entirely. The name "SVD" is misleading - it is not the classical decomposition. A better name is "low-rank matrix factorization trained on observed entries via SGD."

Key differences:

Classical SVD: exact solution, requires dense matrix, $O(mn^2)$ or $O(m^2 n)$ complexity
Funk SVD / MF: approximate solution, trains only on observed entries, $O(|\mathcal{K}| k)$ per epoch
Classical SVD: orthogonality constraints on $U$ and $V$
Funk SVD: no orthogonality constraints - factors are free to be anything

The connection: if the rating matrix were fully observed and you ran Funk SVD to convergence, the result would approximate the classical SVD truncation. But for sparse recommendation matrices, the two diverge significantly because SVD fills in the blanks with imputed values while Funk SVD ignores them.

Key Takeaways

Matrix factorization is the most important algorithm in the history of recommender systems. The core idea - approximate a sparse rating matrix as a product of two low-dimensional factor matrices - is elegant, scalable, and effective. Adding biases, regularization, and temporal dynamics produces models that can account for the major sources of variation in real rating data.

The optimization choices matter enormously. SGD and ALS are both correct but have different tradeoffs: SGD is simpler and more flexible, ALS is parallelizable and naturally handles implicit feedback. For production implicit feedback systems, ALS (via Spark) and BPR (via SGD on ranking triples) are the two standard approaches.

The fundamental limitations - cold start and filter bubble - are not solved by more sophisticated matrix factorization. They require hybrid architectures that combine MF with content-based signals and diversity constraints. This is why the field moved to neural collaborative filtering and two-tower models, which we will cover in the next two lessons.

But MF remains the baseline every production recommender system must beat. If your neural model cannot outperform a well-tuned ALS on your offline evaluation metrics, your neural model is not ready for production.

Hyperparameter Reference

A practical guide to tuning matrix factorization hyperparameters, organized by impact:

Hyperparameter	Typical Range	Impact	Notes
`n_factors` (k)	20–200	High	Start at 50. Increase if underfitting, decrease if overfitting.
`reg` ( $\lambda$ )	0.001–0.1	High	Most important hyperparameter. Grid search: [0.001, 0.005, 0.01, 0.02, 0.05, 0.1]
`lr` ( $\alpha$ )	0.001–0.05	Medium	For SGD. Start at 0.005. Use learning rate decay: $\alpha_t = \alpha_0 / (1 + \text{decay} \cdot t)$
`n_epochs`	20–100	Medium	Use early stopping on validation RMSE (patience=5) - actual epochs will be fewer
`biased`	True/False	High	Almost always True. Biases account for 25–30% of explainable variance.
`alpha` (ALS implicit)	1–100	High	Confidence scaling. Netflix paper used 40. Tune on downstream ranking metrics.
`nprobe` (Faiss IVF)	1–100	Medium	Speed-recall tradeoff for approximate search. Higher = slower but more accurate.

Quick-Reference: MF Algorithm Comparison

Algorithm        Data Type      Parallelizable   Online Updates   Typical RMSE (ML-100K)
─────────────────────────────────────────────────────────────────────────────────────────
Funk SVD (SGD)   Explicit       No               Yes (online SGD)  0.91–0.93
ALS              Both           Yes (Spark)      Partial           0.91–0.94
BPR              Implicit       Partial           Yes              N/A (ranking metric)
SVD++ (Koren)    Explicit       No               No               0.89–0.91
TimeSVD++        Explicit       No               No               0.88–0.90
─────────────────────────────────────────────────────────────────────────────────────────
Neural CF (NCF)  Both           Yes (GPU)        Yes              0.87–0.89
Two-Tower        Both           Yes (GPU)        Yes              Varies

SVD++ extends Funk SVD by incorporating implicit feedback signals (which items the user has rated, regardless of rating value) as additional user features. TimeSVD++ adds temporal dynamics to user biases and factors. Both improve RMSE at the cost of significantly more complex training and serving.

For most production systems, a well-tuned Funk SVD or ALS with biases achieves 90–95% of the improvement that SVD++ or TimeSVD++ would deliver, at a fraction of the engineering complexity. Reach for the complex variants only after you have exhausted simpler improvements: better regularization, richer negative sampling, and improved feature engineering.

Glossary

Term	Definition
Latent factor	A hidden dimension in the factorized space that the model learns automatically. Does not correspond to any named feature but captures shared structure across users and items.
User factor matrix P	Matrix of shape (m, k) where each row is the k-dimensional latent representation of a user.
Item factor matrix Q	Matrix of shape (n, k) where each row is the k-dimensional latent representation of an item.
User bias	A scalar $b_u$ capturing how much user $u$ rates above or below the global mean, independent of item quality or taste alignment.
Item bias	A scalar $b_i$ capturing the inherent popularity or quality of item $i$ independent of individual user taste.
SGD	Stochastic Gradient Descent. Processes one (or a small batch of) training examples at a time and updates all parameters incrementally.
ALS	Alternating Least Squares. Alternates between analytically solving for user factors (with item factors fixed) and item factors (with user factors fixed).
BPR	Bayesian Personalized Ranking. An optimization criterion that directly maximizes AUC of the ranking by training on user-positive-negative item triples.
Implicit feedback	Interaction signals without explicit preference labels: clicks, plays, purchases, page views. Requires different modeling assumptions than explicit ratings.
Confidence weight	In implicit ALS, $c_{ui} = 1 + \alpha \cdot r_{ui}$ - the strength of evidence that user $u$ prefers item $i$ , scaled by raw interaction count.
Cold start	The inability of a trained model to make predictions for new users or items with no interaction history, since their factor vectors are unknown.
Rank-k approximation	An approximation to a matrix using only k singular values/components. Matrix factorization learns the best rank-k approximation that minimizes reconstruction error on observed entries.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Matrix Factorization demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why Matrix Factorization Beats Memory-Based CF​

The Decomposition​

Adding Biases​

The Loss Function​

Optimization: SGD​

Optimization: ALS​

System Architecture​

NumPy From Scratch: Full SGD Matrix Factorization​

Practical Code: surprise and implicit Libraries​

Explicit Ratings with surprise​

Implicit Feedback with the implicit Library​

Implicit Feedback: Bayesian Personalized Ranking (BPR)​

Production Engineering Notes​

Spark ALS for Distributed Training​

Handling New Users and Items: Warm Start​

Time-Aware Matrix Factorization​

The Relationship Between MF and Word2Vec​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Q1: Why did matrix factorization beat memory-based collaborative filtering on the Netflix Prize?​

Q2: When should you use ALS vs SGD for matrix factorization?​

Q3: How do you handle implicit feedback in matrix factorization?​

Q4: How does regularization prevent overfitting in MF and how do you tune it?​

Q5: How do you handle the cold-start problem in matrix factorization?​

Q6: What is the relationship between matrix factorization and classical SVD?​

Key Takeaways​

Hyperparameter Reference​

Quick-Reference: MF Algorithm Comparison​

Glossary​