What is neural collaborative filtering?

How deep learning revolutionized recommendations by replacing the linear dot product with learnable nonlinear interactions between users and items.

How does NCF work in practice?

Neural Collaborative Filtering - Beyond the Dot Product covers neural collaborative filtering, NCF, deep learning recommendation from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/recommender-systems/neural-collaborative-filtering

What is the difference between neural collaborative filtering and deep learning recommendation?

See the full breakdown at https://engineersofai.com/docs/ml/recommender-systems/neural-collaborative-filtering

Neural Collaborative Filtering - Beyond the Dot Product

Reading time: ~35 minutes | Level: Recommender Systems | Role: MLE, AI Engineer, Data Scientist

The Moment That Changed Recommendations

Singapore, 2017. Xiangnan He sits in his office at the National University of Singapore, staring at a matrix factorization model that just posted state-of-the-art results on MovieLens. The numbers are good. The paper reviewers would likely accept. He could have stopped there.

But something nagged at him. The entire field had spent a decade refining the same fundamental operation: $\hat{r}_{ui} = \vec{p}_u \cdot \vec{q}_i$ . A dot product. User embedding dotted with item embedding. It worked. Everyone used it. Netflix Prize teams used it. Spotify used it. Amazon used it. But He noticed something that seemed obvious once you wrote it down: a dot product is just a weighted sum. It is, at its core, a linear operation. And linear models cannot capture all the patterns that exist between users and items.

He imagined two users - call them Alice and Bob. Alice loves cerebral sci-fi and hates rom-coms. Bob loves action-packed blockbusters and hates arthouse films. In an embedding space, their taste vectors point in completely different directions. Their dot product should be near zero. But both have watched hundreds of Marvel movies. Due to the magnitude of their Marvel-preference vectors, their dot product could actually be quite high - fooling the model into thinking they have similar taste when they fundamentally do not. A linear model cannot distinguish between "similar in all dimensions" and "different in most dimensions but aligned on one high-magnitude feature."

He wrote one sentence in his notebook that would become the core thesis of a paper cited over 7,000 times: "The inner product, which simply combines the multiplication of latent features linearly, may not be sufficient to capture the complex structure of user interaction data." The solution he proposed - Neural Collaborative Filtering (NCF) - would replace the rigid dot product with a learned nonlinear function, giving the model the expressiveness to capture interactions that linear algebra fundamentally cannot represent.

That paper, "Neural Collaborative Filtering" at WWW 2017, became the blueprint for the recommendation architectures powering YouTube, TikTok, and Amazon's deep stacks today. Understanding NCF is not just an academic exercise - it is understanding the mathematical DNA of the systems that decide what a billion people watch, buy, and read every day.

Why This Exists

To understand why NCF matters, you need to feel the constraint of matrix factorization (MF) rather than just know it abstractly.

In standard MF, each user $u$ gets a latent vector $\vec{p}_u \in \mathbb{R}^k$ and each item $i$ gets a latent vector $\vec{q}_i \in \mathbb{R}^k$ . The predicted rating is:

$\hat{r}_{ui} = \vec{p}_u \cdot \vec{q}_i = \sum_{f=1}^{k} p_{uf} \cdot q_{if}$

This is elegant and efficient. But consider what happens in the embedding space when you have three users: Alice ( $\vec{p}_A$ ), Bob ( $\vec{p}_B$ ), and Charlie ( $\vec{p}_C$ ).

Suppose Alice and Bob have overlapping taste - they both love the same obscure genre. So their embeddings are similar: $\text{sim}(A, B)$ is high. Now suppose Bob and Charlie also have overlapping taste in a different genre: $\text{sim}(B, C)$ is high. What does MF predict about Alice and Charlie?

Because the similarity in MF is defined entirely by the dot product (or cosine similarity) in a shared low-dimensional space, Alice and Charlie will be forced to have relatively high similarity too - even if their actual preferences are completely unrelated. This is the transitivity problem: the geometry of the embedding space imposes similarity relationships that may not exist in reality. If $\vec{p}_A \approx \vec{p}_B$ and $\vec{p}_B \approx \vec{p}_C$ , then $\vec{p}_A \approx \vec{p}_C$ must hold to some degree in Euclidean space.

Real user-item interactions do not obey this transitivity. You can love both jazz and metal without implying that jazz fans will love metal. But a dot-product model, trained to minimize reconstruction error, will inevitably distort embeddings to accommodate the transitivity that the geometry demands.

The fix is conceptually simple: instead of fixing the interaction function to be a dot product, let the model learn the interaction function. Replace the dot product with a neural network. Give the model the capacity to learn whatever function best explains the data - linear or not, transitive or not.

This is the core insight of NCF. Everything else is details.

Historical Context

The Paper

"Neural Collaborative Filtering" - Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, Tat-Seng Chua. WWW 2017.

He and his team at NUS proposed a general framework called NCF that subsumes matrix factorization as a special case and extends it through neural networks. The paper introduced two specific architectures:

Generalized Matrix Factorization (GMF) - restores MF expressiveness by adding a learned output layer over the element-wise product of embeddings.
Multi-Layer Perceptron (MLP) - replaces the dot product entirely with a deep network that learns from concatenated embeddings.
NeuMF - the crown jewel: a fusion model that runs both GMF and MLP in parallel and combines their outputs for the final prediction.

The Data Setting

NCF was designed for implicit feedback - the kind of data that actually exists at scale. You do not have star ratings for most items. You have clicks, views, purchases, time-spent. A click is a weak positive signal. The absence of a click is not necessarily a negative - the user might not have seen the item. This framing forced the team to think carefully about training signals, leading to the binary cross-entropy formulation with negative sampling that the paper uses throughout.

The Impact

The paper triggered a wave of neural recommendation research. Within two years, virtually every major tech company had moved from pure MF to some form of neural collaborative filtering:

YouTube (2016, but deepened post-NCF): Covington et al. used deep neural networks for candidate generation and ranking.
Alibaba (2018): Deep Interest Network (DIN) built on NCF ideas with attention.
Pinterest (2019): PinSage combined graph neural networks with NCF-style training.

NCF did not just propose one model. It established a language and a framework for thinking about recommendation as a function approximation problem - and that language is still used today.

Core Concepts

Concept 1: The Interaction Function Abstraction

NCF frames recommendation as learning a function:

$\hat{y}_{ui} = f(u, i \mid \Theta)$

where $\hat{y}_{ui} \in [0, 1]$ is the predicted probability that user $u$ interacts with item $i$ , and $\Theta$ is the set of all learnable parameters.

The key insight: the choice of $f$ is the choice of inductive bias you impose on user-item interactions. MF chooses $f = \text{dot product}$ , which assumes that the interaction can be fully captured by a sum of pairwise latent factor products. NCF chooses $f = \text{neural network}$ , which makes almost no assumption beyond the representational capacity of the architecture.

In NCF, the pipeline begins identically to MF: user $u$ and item $i$ are each represented as a one-hot vector and passed through embedding layers:

$\vec{p}_u = \mathbf{P}^\top \cdot \vec{v}_u^U, \quad \vec{q}_i = \mathbf{Q}^\top \cdot \vec{v}_i^I$

where $\mathbf{P} \in \mathbb{R}^{M \times k}$ and $\mathbf{Q} \in \mathbb{R}^{N \times k}$ are the user and item embedding matrices, $M$ is the number of users, $N$ is the number of items, and $k$ is the embedding dimension. The divergence from MF happens in what comes next.

Concept 2: Generalized Matrix Factorization (GMF)

Standard MF computes $\hat{r}_{ui} = \vec{p}_u \cdot \vec{q}_i$ . You can rewrite this as an element-wise product followed by a sum:

$\hat{r}_{ui} = \mathbf{1}^\top (\vec{p}_u \odot \vec{q}_i)$

where $\odot$ denotes element-wise (Hadamard) product. In this form, you can see that MF is just a fixed linear combination of the element-wise interactions between latent dimensions.

GMF generalizes this by replacing the fixed sum with a learnable output layer:

$\hat{y}_{ui}^{GMF} = \sigma\left(\vec{h}^\top (\vec{p}_u^{(G)} \odot \vec{q}_i^{(G)})\right)$

where $\vec{h} \in \mathbb{R}^k$ is a learnable weight vector and $\sigma$ is the sigmoid function. The superscript $(G)$ indicates these are the GMF-specific embedding matrices (separate from the MLP embeddings).

This single change has a profound effect: instead of treating each latent dimension's contribution equally (as MF does with the unit weight vector $\mathbf{1}$ ), GMF learns to weight the importance of each dimension's interaction. Some dimensions of the user-item product may matter much more than others. GMF learns which ones.

When $\vec{h} = \mathbf{1}$ and no activation is applied, GMF reduces exactly to standard MF - making MF a special case of this framework.

Concept 3: The MLP Branch

The MLP branch takes a fundamentally different approach. Instead of computing an element-wise product, it concatenates the user and item embeddings and feeds the concatenation through a series of fully connected layers:

$\phi_1 = \begin{bmatrix} \vec{p}_u^{(M)} \\ \vec{q}_i^{(M)} \end{bmatrix}, \quad \phi_\ell = \text{ReLU}\left(W_\ell \phi_{\ell-1} + \vec{b}_\ell\right) \text{ for } \ell = 2, \ldots, L$

$\hat{y}_{ui}^{MLP} = \sigma\left(W_L \phi_{L-1} + \vec{b}_L\right)$

The input dimension of $\phi_1$ is $2k$ (user embedding dimension plus item embedding dimension). Each subsequent layer reduces the dimension by a factor - a tower architecture - until you reach the output.

Why concatenation rather than element-wise product? The element-wise product constrains both embeddings to be in the same dimensionality and implicitly enforces dimension-to-dimension correspondence. Concatenation makes no such assumption - the MLP can learn any interaction between any combination of user and item embedding dimensions. It has strictly more expressive power.

The MLP can learn interactions that no dot product can represent. For example: "user embedding dimension 3 is high AND item embedding dimension 7 is low" - a conjunction of conditions. A dot product can never capture "AND" logic because it is additive, not multiplicative across dimensions in the joint sense.

The practical downside is computational: the MLP's parameters scale with the product of layer widths, whereas GMF is parameter-efficient. This is the fundamental tension NCF navigates.

Concept 4: NeuMF - The Fusion Model

The full NeuMF model runs GMF and MLP in parallel with separate embedding matrices, then concatenates their final hidden representations and passes them through a single output layer:

$\hat{y}_{ui} = \sigma\left(\vec{h}^\top \begin{bmatrix} \phi^{GMF} \\ \phi^{MLP} \end{bmatrix}\right)$

where $\phi^{GMF} = \vec{p}_u^{(G)} \odot \vec{q}_i^{(G)}$ and $\phi^{MLP}$ is the output of the last MLP hidden layer (before the sigmoid).

The intuition: GMF is good at capturing linear correlations between latent factors. MLP is good at capturing complex nonlinear interactions. Combining them lets each branch do what it does best, and the output layer learns how to weight the contribution of each.

Crucially, the GMF and MLP branches use separate embedding matrices. If they shared embeddings, they would constrain each other - the embedding that is optimal for linear interactions (GMF) may not be optimal for nonlinear interactions (MLP), and vice versa. Separate embeddings give each branch the freedom to learn the representation that works best for its own interaction function.

The total parameter count of NeuMF is: $|\Theta| = M \cdot k_G + N \cdot k_G + k_G + M \cdot k_M + N \cdot k_M + \sum_{\ell=1}^{L}(d_\ell \cdot d_{\ell-1} + d_\ell) + (k_G + d_L)$

where $k_G$ is the GMF embedding dimension, $k_M$ is the MLP embedding dimension, and $d_\ell$ is the width of MLP layer $\ell$ .

Concept 5: Training on Implicit Feedback

NCF is trained on implicit feedback - binary signals where $y_{ui} = 1$ if user $u$ interacted with item $i$ and $y_{ui} = 0$ for non-interacted pairs (negatives).

The loss function is binary cross-entropy:

$\mathcal{L} = -\sum_{(u,i) \in \mathcal{O}^+} \log \hat{y}_{ui} - \sum_{(u,j) \in \mathcal{O}^-} \log(1 - \hat{y}_{uj})$

where $\mathcal{O}^+$ is the set of observed positive interactions and $\mathcal{O}^-$ is the set of sampled negative examples.

note

You cannot train on all non-interactions as negatives - for a user with 100 interactions in a catalog of 1 million items, there are 999,900 non-interactions. Training on all of them would create a massive class imbalance and make training prohibitively slow. Instead, you sample a fixed number of negatives per positive.

Negative sampling strategies (ordered from weakest to strongest):

Uniform random: sample items the user has not interacted with uniformly at random. Fast and simple, but easy for the model - most random items are obviously irrelevant.
Popularity-weighted: sample items proportional to their interaction frequency $q(i) \propto \text{freq}(i)^\alpha$ where $0 < \alpha \leq 1$ . Popular items are harder negatives because the model cannot simply ignore them.
Hard negatives: sample items the model currently scores highly but the user has not interacted with. These are the most informative training signals but require an extra forward pass to generate, making training ~2x more expensive.

The original NCF paper uses uniform random sampling with 4 negatives per positive as the default. In practice, popularity-weighted with $\alpha = 0.75$ (the same trick word2vec uses) is a simple upgrade that consistently improves ranking metrics.

Architecture Diagram

Implementation: NeuMF from Scratch

Dataset: MovieLens-1M

Before writing the model, let's set up the data pipeline. MovieLens-1M has ~1 million ratings from 6,040 users on 3,706 movies. We convert explicit ratings to implicit feedback (any rating = interaction).

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# ─── Data Loading ────────────────────────────────────────────────────────────

def load_movielens_1m(path: str) -> pd.DataFrame:
    """Load MovieLens 1M ratings and convert to implicit feedback."""
    ratings = pd.read_csv(
        path,
        sep="::",
        engine="python",
        names=["user_id", "movie_id", "rating", "timestamp"],
    )
    # Convert to implicit: any rating = positive interaction
    # Re-index users and items to 0-based integers
    ratings["user_idx"] = ratings["user_id"].astype("category").cat.codes
    ratings["item_idx"] = ratings["movie_id"].astype("category").cat.codes
    return ratings


class NCFDataset(Dataset):
    """
    Implicit feedback dataset with negative sampling.

    For each positive (user, item) pair, we sample `num_negatives`
    items the user has NOT interacted with.
    """

    def __init__(
        self,
        interactions: pd.DataFrame,
        num_users: int,
        num_items: int,
        num_negatives: int = 4,
        negative_sampling: str = "uniform",  # "uniform" or "popularity"
    ):
        self.num_users = num_users
        self.num_items = num_items
        self.num_negatives = num_negatives
        self.negative_sampling = negative_sampling

        # Build per-user positive item set for efficient negative sampling
        self.user_positives: dict[int, set[int]] = (
            interactions.groupby("user_idx")["item_idx"]
            .apply(set)
            .to_dict()
        )

        # Compute item popularity for popularity-weighted sampling
        item_counts = interactions["item_idx"].value_counts()
        counts = np.zeros(num_items)
        for item_idx, count in item_counts.items():
            counts[item_idx] = count
        # Smooth with alpha=0.75 (word2vec trick)
        probs = counts ** 0.75
        self.item_probs = probs / probs.sum()

        # Build flat list of (user, item, label) samples
        self.samples = self._build_samples(interactions)

    def _sample_negatives(self, user_idx: int, n: int) -> list[int]:
        positives = self.user_positives.get(user_idx, set())
        negatives = []
        attempts = 0
        while len(negatives) < n and attempts < n * 20:
            if self.negative_sampling == "popularity":
                candidate = np.random.choice(self.num_items, p=self.item_probs)
            else:
                candidate = np.random.randint(0, self.num_items)
            if candidate not in positives:
                negatives.append(candidate)
            attempts += 1
        return negatives

    def _build_samples(self, interactions: pd.DataFrame) -> list[tuple]:
        samples = []
        for _, row in interactions.iterrows():
            u, i = int(row["user_idx"]), int(row["item_idx"])
            # Positive sample
            samples.append((u, i, 1.0))
            # Negative samples
            for neg_item in self._sample_negatives(u, self.num_negatives):
                samples.append((u, neg_item, 0.0))
        return samples

    def __len__(self) -> int:
        return len(self.samples)

    def __getitem__(self, idx: int):
        user, item, label = self.samples[idx]
        return (
            torch.tensor(user, dtype=torch.long),
            torch.tensor(item, dtype=torch.long),
            torch.tensor(label, dtype=torch.float32),
        )

NeuMF Model

class NeuMF(nn.Module):
    """
    Neural Matrix Factorization (NeuMF).

    Combines a Generalized Matrix Factorization (GMF) branch and
    a Multi-Layer Perceptron (MLP) branch, with separate embedding
    matrices for each branch.

    Reference: He et al. (2017) "Neural Collaborative Filtering"
    """

    def __init__(
        self,
        num_users: int,
        num_items: int,
        gmf_dim: int = 64,
        mlp_dim: int = 64,
        mlp_layers: list[int] = [128, 64, 32],
        dropout: float = 0.2,
    ):
        super().__init__()

        self.num_users = num_users
        self.num_items = num_items

        # ── GMF embeddings ─────────────────────────────────────────────────
        self.gmf_user_emb = nn.Embedding(num_users, gmf_dim)
        self.gmf_item_emb = nn.Embedding(num_items, gmf_dim)

        # ── MLP embeddings ─────────────────────────────────────────────────
        # Input to first MLP layer is 2 * mlp_dim (concatenation)
        self.mlp_user_emb = nn.Embedding(num_users, mlp_dim)
        self.mlp_item_emb = nn.Embedding(num_items, mlp_dim)

        # ── MLP layers ─────────────────────────────────────────────────────
        mlp_input_dim = mlp_dim * 2
        layers = []
        in_dim = mlp_input_dim
        for out_dim in mlp_layers:
            layers.extend([
                nn.Linear(in_dim, out_dim),
                nn.BatchNorm1d(out_dim),
                nn.ReLU(),
                nn.Dropout(p=dropout),
            ])
            in_dim = out_dim
        self.mlp = nn.Sequential(*layers)

        # ── Output layer ───────────────────────────────────────────────────
        # Input = GMF output (gmf_dim) + MLP output (mlp_layers[-1])
        self.output_layer = nn.Linear(gmf_dim + mlp_layers[-1], 1)

        self._init_weights()

    def _init_weights(self):
        """
        He et al. recommend initializing NeuMF from pre-trained
        GMF and MLP models. As a fallback, use normal init for
        embeddings and Xavier for linear layers.
        """
        for module in self.modules():
            if isinstance(module, nn.Embedding):
                nn.init.normal_(module.weight, std=0.01)
            elif isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)

    def forward(
        self,
        user_ids: torch.Tensor,
        item_ids: torch.Tensor,
    ) -> torch.Tensor:
        # ── GMF branch ─────────────────────────────────────────────────────
        p_gmf = self.gmf_user_emb(user_ids)   # (B, gmf_dim)
        q_gmf = self.gmf_item_emb(item_ids)   # (B, gmf_dim)
        gmf_out = p_gmf * q_gmf               # element-wise product (B, gmf_dim)

        # ── MLP branch ─────────────────────────────────────────────────────
        p_mlp = self.mlp_user_emb(user_ids)   # (B, mlp_dim)
        q_mlp = self.mlp_item_emb(item_ids)   # (B, mlp_dim)
        mlp_input = torch.cat([p_mlp, q_mlp], dim=-1)  # (B, 2*mlp_dim)
        mlp_out = self.mlp(mlp_input)          # (B, mlp_layers[-1])

        # ── Fusion and output ──────────────────────────────────────────────
        fused = torch.cat([gmf_out, mlp_out], dim=-1)  # (B, gmf_dim + mlp_layers[-1])
        logit = self.output_layer(fused).squeeze(-1)    # (B,)
        return torch.sigmoid(logit)

    def get_user_embeddings(self, user_ids: torch.Tensor) -> torch.Tensor:
        """Useful for analysis and visualization."""
        return torch.cat([
            self.gmf_user_emb(user_ids),
            self.mlp_user_emb(user_ids),
        ], dim=-1)

Training Loop

def train_neumf(
    model: NeuMF,
    train_loader: DataLoader,
    val_interactions: pd.DataFrame,
    num_users: int,
    num_items: int,
    num_epochs: int = 20,
    lr: float = 1e-3,
    device: str = "cuda" if torch.cuda.is_available() else "cpu",
) -> dict:
    """
    Train NeuMF with binary cross-entropy loss.
    Evaluates Hit Rate @ 10 and NDCG @ 10 on the validation set.
    """
    model = model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-6)
    criterion = nn.BCELoss()
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="max", patience=3, factor=0.5
    )

    history = {"train_loss": [], "hr@10": [], "ndcg@10": []}

    for epoch in range(num_epochs):
        # ── Training ───────────────────────────────────────────────────────
        model.train()
        total_loss = 0.0

        for batch_users, batch_items, batch_labels in train_loader:
            batch_users = batch_users.to(device)
            batch_items = batch_items.to(device)
            batch_labels = batch_labels.to(device)

            optimizer.zero_grad()
            preds = model(batch_users, batch_items)
            loss = criterion(preds, batch_labels)
            loss.backward()

            # Gradient clipping helps with embedding layers
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            total_loss += loss.item() * len(batch_labels)

        avg_loss = total_loss / len(train_loader.dataset)

        # ── Evaluation (Hit Rate and NDCG @ 10) ───────────────────────────
        hr, ndcg = evaluate_model(
            model, val_interactions, num_users, num_items, k=10, device=device
        )
        scheduler.step(hr)

        history["train_loss"].append(avg_loss)
        history["hr@10"].append(hr)
        history["ndcg@10"].append(ndcg)

        print(
            f"Epoch {epoch+1:>2}/{num_epochs} | "
            f"Loss: {avg_loss:.4f} | "
            f"HR@10: {hr:.4f} | "
            f"NDCG@10: {ndcg:.4f}"
        )

    return history


def evaluate_model(
    model: NeuMF,
    val_interactions: pd.DataFrame,
    num_users: int,
    num_items: int,
    k: int = 10,
    num_neg_eval: int = 99,
    device: str = "cpu",
) -> tuple[float, float]:
    """
    Leave-one-out evaluation protocol.

    For each user:
    1. Take their most recent interaction as the ground truth positive.
    2. Sample 99 random negatives.
    3. Score all 100 items with the model.
    4. Check if the ground truth item appears in the top-k.
    """
    model.eval()
    hit_count = 0
    ndcg_total = 0.0
    num_users_evaluated = 0

    with torch.no_grad():
        # Get the last interaction per user as ground truth
        val_df = (
            val_interactions
            .sort_values("timestamp")
            .groupby("user_idx")
            .last()
            .reset_index()
        )

        user_positives = (
            val_interactions.groupby("user_idx")["item_idx"]
            .apply(set)
            .to_dict()
        )

        for _, row in val_df.iterrows():
            u = int(row["user_idx"])
            gt_item = int(row["item_idx"])
            known_positives = user_positives.get(u, set())

            # Sample negative items not in user's history
            neg_items = []
            while len(neg_items) < num_neg_eval:
                candidate = np.random.randint(0, num_items)
                if candidate not in known_positives:
                    neg_items.append(candidate)

            all_items = [gt_item] + neg_items
            users_tensor = torch.full(
                (len(all_items),), u, dtype=torch.long, device=device
            )
            items_tensor = torch.tensor(all_items, dtype=torch.long, device=device)

            scores = model(users_tensor, items_tensor).cpu().numpy()
            # Rank items by score descending
            ranked_indices = np.argsort(-scores)
            ranked_items = [all_items[idx] for idx in ranked_indices]

            # Hit Rate @ k
            if gt_item in ranked_items[:k]:
                hit_count += 1
                # NDCG @ k: log2(2) = 1 for rank 1, log2(3) for rank 2, ...
                rank = ranked_items.index(gt_item) + 1
                ndcg_total += 1.0 / np.log2(rank + 1)

            num_users_evaluated += 1

    hr = hit_count / num_users_evaluated
    ndcg = ndcg_total / num_users_evaluated
    return hr, ndcg

Putting It All Together

if __name__ == "__main__":
    # Load data
    ratings = load_movielens_1m("ml-1m/ratings.dat")

    num_users = ratings["user_idx"].nunique()
    num_items = ratings["item_idx"].nunique()
    print(f"Users: {num_users}, Items: {num_items}")

    # Train/val split by time (more realistic than random)
    train_df, val_df = train_test_split(
        ratings, test_size=0.1, random_state=42, stratify=ratings["user_idx"]
    )

    # Build dataset with negative sampling
    train_dataset = NCFDataset(
        train_df,
        num_users=num_users,
        num_items=num_items,
        num_negatives=4,
        negative_sampling="popularity",  # Better than uniform
    )
    train_loader = DataLoader(
        train_dataset,
        batch_size=256,
        shuffle=True,
        num_workers=4,
        pin_memory=True,
    )

    # Initialize NeuMF
    model = NeuMF(
        num_users=num_users,
        num_items=num_items,
        gmf_dim=64,
        mlp_dim=64,
        mlp_layers=[128, 64, 32],
        dropout=0.2,
    )

    print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

    # Train
    history = train_neumf(
        model,
        train_loader,
        val_df,
        num_users=num_users,
        num_items=num_items,
        num_epochs=20,
        lr=1e-3,
    )

    # Save
    torch.save(model.state_dict(), "neumf_movielens.pth")

:::tip Pre-training Strategy (He et al.'s Recommendation) The paper recommends pre-training the GMF and MLP branches separately, then initializing NeuMF from their weights before fine-tuning jointly. This significantly helps convergence.

def init_from_pretrained(
    neumf: NeuMF,
    gmf_state_dict: dict,
    mlp_state_dict: dict,
):
    """Initialize NeuMF weights from separately pre-trained GMF and MLP models."""
    neumf_state = neumf.state_dict()

    # Copy GMF weights
    for key in ["gmf_user_emb.weight", "gmf_item_emb.weight"]:
        neumf_state[key] = gmf_state_dict[key]

    # Copy MLP weights
    for key in ["mlp_user_emb.weight", "mlp_item_emb.weight", "mlp"]:
        matching = {k: v for k, v in mlp_state_dict.items() if key in k}
        neumf_state.update(matching)

    neumf.load_state_dict(neumf_state, strict=False)
    return neumf

:::

Production Engineering Notes

Why Big Tech Moved from NeuMF to Two-Tower

NeuMF was a research breakthrough, but it has a critical production limitation: it requires a forward pass for every user-item pair at serving time. To score all 1 million items for a single user, you need 1 million forward passes through the model. Even with batching, this takes seconds - which is unacceptable for real-time recommendation.

The solution that big tech converged on is the two-tower model (covered in the next lesson), which decouples user and item computations so that item embeddings can be precomputed offline and looked up in milliseconds using approximate nearest neighbor search.

NeuMF is not useless in production - it is often used as the ranking model in a two-stage pipeline where a two-tower retrieval model first narrows 10 million items down to 1,000 candidates, and then NeuMF-style models re-rank those 1,000. At this scale, 1,000 forward passes is very fast.

Embedding Dimension Choice

Empirically:

GMF dimension: 32–128 works well. Larger is not always better - the element-wise product structure limits how much the GMF branch can benefit from extra dimensions.
MLP dimension: 64–256. The MLP benefits more from larger embeddings because the concatenation allows the network to explore cross-dimensional interactions.
Layer widths: a tower structure (halving dimensions each layer) is standard and works well: [256, 128, 64, 32].

Batch Size and Learning Rate

NCF is sensitive to the batch size/learning rate combination because of the negative sampling dynamic:

Too small a batch: the model sees too few negatives per positive, leading to slow convergence.
Too large a batch: gradients become dominated by easy negatives, degrading the quality of the learned representations.
A batch size of 256 with Adam at lr=1e-3 is a reliable starting point for MovieLens-scale data.
For larger datasets (tens of millions of interactions), batch sizes of 1024–4096 are common.

Handling Popularity Bias

NeuMF, like all collaborative filtering methods, inherits popularity bias from the data. Popular items receive more training signal and tend to be recommended more frequently than their actual relevance warrants. Mitigation strategies:

Popularity-weighted negative sampling: oversample popular items as negatives.
IPS (Inverse Propensity Scoring): upweight losses on interactions with less popular items.
Explicit debiasing layers: some production systems add a counterfactual debiasing component at the output layer.

Common Mistakes

:::danger Using MSE Loss on Implicit Data The single most common error when implementing NCF for the first time: using mean squared error loss as if you had explicit ratings.

Wrong:

loss = nn.MSELoss()(predictions, labels)

Right:

loss = nn.BCELoss()(predictions, labels)  # labels are 0 or 1

Implicit feedback is not a rating. There is no ground truth "correct score" for a non-interaction. The BCE formulation correctly frames the problem as: "given this user and item, what is the probability of interaction?" MSE would force the model to predict exactly 0 for every non-interaction, which conflicts with the reality that non-interactions are often unobserved preferences rather than true negatives. :::

:::danger Not Doing Negative Sampling at All Some implementations train only on positive samples, treating it as a one-class problem. This produces a model that predicts high scores for everything because it was never penalized for false positives.

Wrong:

# Training loop that only uses positive interactions
for user, item in positive_pairs:
    pred = model(user, item)
    loss = -torch.log(pred)  # only positive loss

Always sample and include negatives. The ratio of 4 negatives per positive from the original paper is a solid default. :::

:::warning Random Negatives Are Too Easy Uniform random negatives work but leave significant performance on the table. A random item from a million-item catalog is almost certainly irrelevant to the user. The model learns almost nothing from these easy negatives after the first few epochs.

Upgrade to popularity-weighted negatives with $\alpha = 0.75$ for a free performance boost. For an even larger gain (at the cost of training complexity), implement hard negative mining: periodically score all items and use the high-scoring non-interactions as negatives. :::

:::tip Separate Embedding Matrices Are Important Some implementations use shared embeddings for GMF and MLP to save memory. This degrades performance. The original paper explicitly uses separate embedding matrices because the optimal embedding for a linear interaction (GMF) differs from the optimal embedding for a nonlinear interaction (MLP).

If memory is a constraint, it is better to reduce the embedding dimensions than to share the matrices. :::

YouTube Resources

Video	Channel	Why Watch
Neural Collaborative Filtering	Deepak Sekar	NCF paper walkthrough, equation by equation
Deep Learning for Recommendations	Stanford CS246	DL recommendations overview from first principles
Embedding Layers Explained	StatQuest	Embedding intuition without the jargon
YouTube DNN Recommender	Yannic Kilcher	Real production system analysis - what NCF looks like at scale

Interview Q&A

Q1: Why does NCF outperform matrix factorization on implicit feedback benchmarks?

Answer: MF uses a dot product as its interaction function, which is linear and symmetric. It cannot capture nonlinear, asymmetric, or conjunction-based interactions between user and item latent factors. NCF replaces the dot product with a learned neural network, giving it the capacity to approximate any continuous function over the user-item feature space.

More specifically: MF's embedding space enforces geometric constraints (transitivity of similarity) that real user-item interactions do not obey. If user A is similar to user B, and user B is similar to user C, MF geometrically forces A and C to also be somewhat similar. NCF's MLP branch has no such constraint - it can represent arbitrary similarity relationships.

On implicit feedback in particular, the NCF training objective (BCE with negative sampling) is better matched to the task than the MSE objective commonly used with MF on explicit ratings. Both the architecture and the training objective contribute to NCF's advantage.

Q2: How does NeuMF handle implicit versus explicit feedback differently?

Answer: The key differences are in the loss function and the target variable.

Explicit feedback (ratings 1–5):

Target: the actual rating value
Loss: typically MSE or MAE
The model predicts a score on a continuous scale

Implicit feedback (clicks, views, purchases):

Target: binary - did the user interact (1) or not (0)?
Loss: binary cross-entropy
The model predicts a probability of interaction
Requires negative sampling because "not interacted" $\neq$ "dislikes"

NCF is designed for implicit feedback. The BCE loss with sigmoid output naturally produces calibrated probabilities. MSE on implicit data is problematic because it treats the absence of interaction as a strong negative signal (score = 0), conflating "unobserved" with "disliked."

Q3: Walk me through negative sampling strategies and when you would use each.

Answer: Negative sampling is critical to NCF performance - the model can only learn what "bad" recommendations look like if it is trained on negative examples.

Uniform random sampling: sample negatives with probability $1/N$ for each item $i$ not in the user's history. Fast, simple, works as a baseline. Weakness: most random items are obviously irrelevant (e.g., a Swahili-language documentary for a user who only watches English comedies). The model converges quickly on these easy negatives and stops improving.

Popularity-weighted sampling: $q(i) \propto \text{freq}(i)^{0.75}$ . Popular items appear more often as negatives. These are "harder" because the model might plausibly recommend them based on their high prior probability - but the user has not interacted. This forces the model to learn finer-grained user preferences. The $0.75$ smoothing exponent prevents completely dominating with only the most popular items.

Hard negative mining: score all items, take the highly-scored non-interactions as negatives. These are the hardest possible negatives - items the model wants to recommend but should not. Training on these maximally accelerates learning but requires generating negatives dynamically (extra inference pass per epoch). Used in production systems where accuracy is critical (e.g., Pinterest's PinSage).

Rule of thumb: start with popularity-weighted (free improvement over uniform), add hard negatives if you have the compute budget and accuracy matters over training speed.

Q4: Why does NeuMF use separate embedding matrices for the GMF and MLP branches rather than shared embeddings?

Answer: The two branches have fundamentally different interaction functions, and the optimal embedding for one function may not be optimal for the other.

The GMF branch computes element-wise products of embeddings and then applies a linear output layer. This encourages embeddings where aligned dimensions encode correlated user-item preferences - geometrically, it pushes similar users and items to have similar vectors.

The MLP branch concatenates embeddings and passes them through nonlinear layers. It can learn any interaction between any combination of user and item embedding dimensions. The optimal MLP embeddings may look very different from the optimal GMF embeddings - the MLP might benefit from embeddings that span a rich variety of orthogonal features rather than ones that align dimension-by-dimension.

Sharing embeddings forces a single representation to serve two different purposes simultaneously. This is a constraint that degrades both branches. The original paper ablates this and shows that separate embeddings consistently outperform shared ones.

The memory cost is 2x the embeddings, but embedding tables are cheap relative to the model's other parameters - this is an easy trade-off.

Q5: How would you scale NeuMF to a production system with 100 million users and 10 million items?

Answer: NeuMF as described cannot scale to this regime for real-time inference - the core bottleneck is the requirement for a forward pass per user-item pair.

Two-stage pipeline (the standard production solution):

Retrieval stage: use a two-tower model (described in the next lesson) to reduce 10 million items to ~1,000 candidates in milliseconds using approximate nearest neighbor search. The two-tower constraint (independent user and item towers) allows item embeddings to be precomputed offline.
Ranking stage: apply a NeuMF-style model to rank the 1,000 candidates. At this scale (1,000 forward passes per user), the model runs in tens of milliseconds on GPU.

Additional engineering concerns at this scale:

Embedding sharding: embedding tables of 100M users × 128 dimensions × 4 bytes = 51 GB. Must be sharded across multiple servers or stored in parameter servers (e.g., Meta's DLRM uses custom parameter server infrastructure).
Freshness: user behavior changes faster than you can retrain. Use online learning or frequent incremental updates for user embeddings.
Popularity bias correction: at 100M users, popularity biases become extreme. Explicit debiasing is necessary.
Training data: 100M users × 10M items generates a huge volume of interactions. Distributed training (e.g., PyTorch DDP or FSDP) with asynchronous gradient updates is required.

Q6: What is the transitivity problem in matrix factorization and how does NCF address it?

Answer: In MF, user similarity is measured by the dot product (or cosine similarity) in the shared embedding space. The geometry of that space imposes a transitivity constraint: if $\text{sim}(A, B)$ is high and $\text{sim}(B, C)$ is high, then $\text{sim}(A, C)$ must also be relatively high - because in a Euclidean space, if $\vec{p}_A \approx \vec{p}_B$ and $\vec{p}_B \approx \vec{p}_C$ , then $\vec{p}_A \approx \vec{p}_C$ .

Real user preferences do not obey this constraint. Users A and C can have similar taste to user B in completely different dimensions (A loves B's jazz picks, C loves B's action picks), with no overlap between A and C.

MF must distort embeddings to accommodate conflicting similarity requirements, degrading the quality of all representations.

NCF's MLP branch addresses this because the MLP does not define similarity via a distance function in a shared space. Instead, it learns an arbitrary mapping from the joint user-item feature space to a probability. It can represent "A is similar to B in jazz dimensions, B is similar to C in action dimensions, but A and C are dissimilar" without any geometric contradiction. The MLP has no transitivity constraint built into its architecture.

Key Takeaways

MF is a special case of NCF - the dot product is a linear interaction function, and NCF generalizes it to any learnable function.
NeuMF = GMF + MLP - the GMF branch captures linear patterns efficiently; the MLP branch captures nonlinear patterns expressively; the fusion captures both.
Implicit feedback requires BCE + negative sampling - not MSE, not just positive examples.
Separate embeddings matter - the optimal representation for linear interaction differs from the optimal representation for nonlinear interaction.
NeuMF is a ranking model, not a retrieval model - at production scale, it lives in the second stage of a retrieval → ranking pipeline.
The field moved on to two-tower for retrieval - but NeuMF's ideas (deep interaction functions, negative sampling, embedding separation) live on in every modern recommendation system.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Neural Network Forward Pass demo on the EngineersOfAI Playground - no code required.

:::

The Moment That Changed Recommendations​

Why This Exists​

Historical Context​

The Paper​

The Data Setting​

The Impact​

Core Concepts​

Concept 1: The Interaction Function Abstraction​

Concept 2: Generalized Matrix Factorization (GMF)​

Concept 3: The MLP Branch​

Concept 4: NeuMF - The Fusion Model​

Concept 5: Training on Implicit Feedback​

Architecture Diagram​

Implementation: NeuMF from Scratch​

Dataset: MovieLens-1M​

NeuMF Model​

Training Loop​

Putting It All Together​

Production Engineering Notes​

Why Big Tech Moved from NeuMF to Two-Tower​

Embedding Dimension Choice​

Batch Size and Learning Rate​

Handling Popularity Bias​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Q1: Why does NCF outperform matrix factorization on implicit feedback benchmarks?​

Q2: How does NeuMF handle implicit versus explicit feedback differently?​

Q3: Walk me through negative sampling strategies and when you would use each.​

Q4: Why does NeuMF use separate embedding matrices for the GMF and MLP branches rather than shared embeddings?​

Q5: How would you scale NeuMF to a production system with 100 million users and 10 million items?​

Q6: What is the transitivity problem in matrix factorization and how does NCF address it?​

Key Takeaways​